Area, Delay, and Power Characteristics of Standard-Cell Implementations of the AES S-Box Stefan Tillich, Martin Feldhofer, Thomas Popp Institute for Applied Information Processing and Communications Graz University of Technology, Inffeldgasse 16a, A–8010 Graz, Austria {stillich,mfeldhof,tpopp}@iaik.tugraz.at Johann Großschädl University of Bristol, Department of Computer Science Merchant Venturers Building, Woodland Road, Bristol, BS8 1UB, U.K. johann.groszschaedl@cs.bris.ac.uk Abstract Cryptographic substitution boxes (S-boxes) are an integral part of modern block ciphers like the Advanced Encryption Standard (AES). There exists a rich literature devoted to the efficient implementation of cryptographic S-boxes, wherein hardware designs for FPGAs and standard cells received particular attention. In this paper we present a comprehensive study of different standard-cell implementations of the AES S-box with respect to timing (i.e. critical path), silicon area, power consumption, and combinations of these cost metrics. We examine implementations which exploit the mathematical properties of the AES S-box, constructions based on hardware look-up tables, and dedicated low-power solutions. Our results show that the timing, area, and power properties of the different S-box realizations can vary by up to almost an order of magnitude. In terms of area and area-delay product, the best choice are implementations which calculate the S-box output. On the other hand, the hardware look-up solutions are characterized by the shortest critical path. The dedicated low-power implementations do not only reduce power consumption by a large degree, but they also show good timing properties and offer the best power-delay and power-area product, respectively. 1 Introduction The Internet of the 21st century will consist of billions of non-traditional computing systems like cell phones, PDAs, sensor nodes, and other mobile devices (“gadgets”) with wireless networking capability. Wireless networking, along with the fact that many of these devices (e.g. sensor nodes) are easily accessible, have raised a number of security concerns. Sophisticated security protocols, in combination with well-established cryptographic primitives, can ensure privacy and integrity of communication over insecure networks. Consequently, there is an increasing demand to implement cryptographic algorithms on resource-limited embedded devices like cell phones, PDAs, or sensor nodes. Even some extremely constrained systems like Radio Frequency Identification (RFID) tags are required to perform cryptographic operations. The Advanced Encryption Standard (AES), which was announced by the NIST in 2001, defines one of the most important symmetric ciphers for the long-term future [15]. The AES algorithm Journal of Signal Processing Systems, vol. 50, no. 2, pp. 251–261, Feb. 2008. © Springer-Verlag, 2008. A preliminary version of this paper was published in the proceedings of SAMOS 2006, LNCS 4017, pp. 457–466. is a variant of the Rijndael cipher [4] and can be implemented efficiently in both hardware and software. Common AES hardware implementations take the form of stand-alone ASICs and cryptographic coprocessors for system-on-chip (SoC) integration. In addition, hardware/software co-design techniques like extending the instruction set of a general-purpose processor have been investigated in the recent past [19]. Due to the high performance of modern microprocessors, AES software can reach throughput rates that are sufficient for most applications. Therefore, hardware implementations of the AES algorithm are mainly important for high-end server systems with extreme performance requirements and for embedded devices with a demand for low power consumption and small silicon area. Most of the published AES hardware designs focus on high speed and high throughput for implementation in FPGAs [3, 11, 16]. In addition, some ASIC implementations have been reported in the literature. For example, Hodjat et al. developed a 3.84 Gbits/s AES crypto coprocessor with modes-of-operation support based on a 0.18 µm CMOS technology [7]. Their design features a 128-bit datapath and encrypts a block of data in 11 clock cycles. A completely different design approach is necessary when optimizing AES hardware for low power consumption or small silicon area. Feldhofer et al. introduced an AES implementation suited for passively-powered devices like RFID tags [6]. It comprises an 8-bit datapath which occupies an area of 3,595 gates (including registers and control logic) when synthesized using a 0.35 µm standard cell library. These results show that the AES algorithm allows for a wide range of trade-offs between performance, power consumption, and hardware cost [5]. Symmetric ciphers like the AES require non-linear functions in order to resist linear cryptanalysis. Substitution is a common function for introducing non-linearity. A substitution function, generally referred to as S-box, can be realized in form of an arbitrary mapping from input bits to output bits (e.g. DES [14]) or via algebraic operations (e.g. AES). Different cipher algorithms use different numbers of S-boxes. For example, DES uses eight S-boxes which map six to four bits, while AES employs a single S-box which is a bijective mapping from eight to eight bits. The AES algorithm makes use of its S-box in the SubBytes round transformation and the key expansion [4]. From a mathematical point of view, the AES S-box is defined as an inversion in the finite field F28 with a specific irreducible polynomial [9], followed by an affine transformation. The inverse S-box, which is required for the InvSubBytes round transformation for decryption, is simply the inverse of the affine transformation, followed by an inversion in F28 . The S-box is a costly and performance-critical building block of the AES algorithm. Results from previous work [7, 21] show that the S-box lies on the critical path of many AES architectures and, hence, limits the maximum clock frequency. In addition, the S-box also impacts area and power consumption of AES hardware [17, 13]. Therefore, the AES S-box has been a subject of intensive research in recent years, which has led to a rich literature on efficient S-box design and implementation. The proposed designs can be roughly categorized into S-boxes that contain optimized circuits for arithmetic in F28 [2, 17, 20], constructions using hardware look-up tables [8, 11], and dedicated low-power solutions [1], all of which have their specific advantages and disadvantages with respect to area, delay, and power consumption. Although most papers introducing new S-box designs provide implementation results and discuss related work, it is generally difficult to compare the different design approaches since, for example, the implementations may have been produced using different design flows and tools, different standard cell libraries, or different optimizations (speed, area) for the synthesis process. In this paper we analyze and compare silicon area, critical path delay, and power consumption characteristics of the most common standard-cell designs of the AES S-box in a uniform and coherent way. We consider in our study designs which exploit the mathematical properties of the S-box, constructions based on hardware look-up tables, and dedicated low-power solutions. In contrast to our previous work [18] where we used a 0.35 µm standard-cell library to evaluate different S-box designs, we conducted the present study on basis of a more modern 0.25 µm process technology in order to provide practical insights and results that are closer to the state-of-the-art in VLSI manufacturing. We put similar effort into optimizing each of the evaluated S-box designs to ensure a fair comparison. Our results show that the area, delay, and power figures of the different S-box designs vary significantly (up to almost an order of magnitude), which underpins the importance of selecting the best-suited S-box with respect to the requirements of the application. The remainder of this paper is organized as follows. Section 2 briefly explains the AES algorithm and discusses hardware implementation aspects. In Section 3 we overview different implementation strategies for the AES S-box. The particular S-box implementations that we used for our evaluation of area, delay, and power consumption are described in Section 4. Section 5 provides background information on the design flow and evaluation methodology. In Section 6 we discuss our experimental results and we finally conclude in Section 7. 2 The Advanced Encryption Standard In November 2001, after several years of public evaluation, the National Institute of Standards and Technology (NIST) officially announced the algorithm for the new Federal Information Processing Standard FIPS-197 [15], also called Advanced Encryption Standard (AES). The block cipher Rijndael [4] was chosen from 15 submitted candidates and has thenceforward become the AES algorithm. The AES is a very flexible algorithm suitable for implementation on many platforms in software as well as in hardware. Its simplicity and symmetry properties facilitate optimization towards different objectives such as high performance or low cost. The AES algorithm has a fixed block size of 128 bits. Each block is organized as a 4 × 4 matrix of bytes, referred to as State. The FIPS-197 standard defines three different key lengths: 128, 192, and 256 bits. Similar to most symmetric ciphers, the AES algorithm encrypts an input block by applying a round transformation several times. Depending on the key length, the number of rounds is either 10, 12, or 14. The round transformation modifies the 128-bit State from its initial value (i.e. the plaintext) to obtain the ciphertext after the last round. Each round consists of non-linear, linear, and key-dependent transformations, which can all be described by means of algebraic operations over the finite field F28 . These operations, called SubBytes, ShiftRows, MixColumns, and AddRoundKey, scramble the bytes of the State either individually, row-wise, or column-wise. Before the first round an initial AddRoundKey is performed, while in the last round the MixColumns operation is omitted. The SubBytes transformation substitutes each byte of the State independently. This byte substitution is defined by the so-called S-box, which can be expressed through arithmetic operations in the finite fields F2 and F28 . More specifically, it is composed of an inversion in F28 followed by an affine transformation. The affine transformation consists of a multiplication with a constant polynomial over F2 and addition of another constant polynomial. The SubBytes transformation is the only non-linear function of the AES algorithm. Its implementation has a major impact on the area, performance, and power consumption of an AES hardware module. ShiftRows rotates each row of the State to the left using a specific offset. The offset equals the row index (starting at 0), which means that the first row is not rotated at all and the last row is rotated by three bytes to the left. MixColumns operates on columns of the State. Each column is interpreted as a polynomial of degree 3 with coefficients from the field F28 . This polynomial is multiplied by a polynomial with fixed coefficients, and the result is reduced modulo g(t) = {1}t 4 + {1} (where {1} ∈ F28 ). The MixColumns operation is often expressed as a multiplication by a constant 4 × 4 matrix of F28 elements with the input column (interpreted as four elements of F28 ), yielding the respective output column. The three aforementioned transformations form the substitution permutation network of the AES algorithm, wherein SubBytes represents the substitution part (to increase confusion) and ShiftRows and MixColumns constitute the permutation part (increasing diffusion). AddRoundKey simply combines the State with a round key by applying an XOR-operation over all 128 bits. The KeySchedule transformation produces the 128-bit round keys, whereby the first round key is equal to the cipher key. All other round keys are computed from the previous round key by using the S-box functionality and some constants referred to as Rcon. The decryption function recovers the plaintext from a given ciphertext by executing the inverse round transformations (InvSubBytes, InvShiftRows, InvMixColumns, and AddRoundKey) in reverse order. All round keys are also used in reverse order. 2.1 Hardware Implementation Aspects The AES is a flexible algorithm well suited for implementation in hardware. A multitude of hardware architectures are possible, which allows for optimization toward different requirements, ranging from high performance to low power consumption and small silicon area. A considerable literature exists that is devoted to efficient hardware implementation of the AES [3, 6, 7, 11, 16, 17, 21]. Depending on the target application, AES architectures can have a datapath width of between 8 and 128 bits. Additionally, it is possible to unroll several rounds and insert pipeline stages into the design. However, to support different modes of operation like the CBC mode [4], often only one round is realized in hardware and used repeatedly. The width of the datapath determines the main characteristics (i.e. performance, area, power consumption) of an AES implementation. Since the AES is byte-oriented, an 8-bit architecture with a single S-box is the natural choice for applications where small area and low power dissipation are crucial, e.g. smart cards or RFID tags. At the other end of the spectrum are 128-bit architectures containing 16 S-boxes to compute the SubBytes function of a 128-bit data block in one pass. Due to this massive parallelism, 128-bit architectures can reach high throughput rates at the expense of large silicon area. 32-bit architectures with four S-boxes constitute a good compromise between the two aforementioned extremes; they allow for much higher performance than 8-bit architectures but demand only a fraction of the area of 128-bit implementations. 3 Implementation Strategies for the AES S-Box All AES architectures sketched in the previous section have a common feature in that the SubBytes transformation occupies a significant portion of the overall silicon area. The size of SubBytes is, in turn, determined by the number of S-boxes and their concrete implementation. Various implementation options for the AES S-box have been investigated in the recent past, which has led to an abundant literature [1, 2, 8, 10, 12, 13, 17, 20]. The SubBytes transformation substitutes all 16 bytes of the State independently using the S-box. Furthermore, the S-box is also used in the AES key expansion. In software, the S-box is typically realized in the form of a look-up table since inversion in the finite field F28 can not be calculated efficiently on general-purpose processors. In hardware, on the other hand, the implementation of the S-box is directed by the desired trade-off between area, delay, and power consumption. The most obvious implementation approach for the S-box takes the form of hardware look-up tables [11]. However, since encryption and decryption require different tables, and each table contains 2048 bits, the overall hardware cost of this approach is relatively high. An implementation option related to standard cells is the use of ROM compilers to produce hardware macros. For the technology that we used, a sufficiently large ROM would require a considerable amount of silicon area. The critical path delay would be similar to a hardware look-up approach, but the power consumption of generated ROMs is about two to three orders of magnitude higher1 . Therefore, we do not consider the implementation of the S-box as ROM in this paper. More sophisticated approaches calculate the S-box function in hardware using its algebraic properties [4]. The focus of such implementations is the efficient realization of the inversion in F28 , which can be achieved by decomposing the finite field into the sub-fields F24 and F22 . An inversion in a finite field of characteristic 2 can be carried out in different ways, depending on the basis which is used to represent the field elements [9]. The two most common types of bases for F2m are the polynomial basis and the normal basis. A polynomial basis is a basis of the form {1, α, α2 , . . . , αm−1 } where α is a root of an irreducible polynomial p(t) of degree m with coefficients from F2 . On the other hand, a normal basis can be found by selecting a field element m β ∈ F2m such that the elements of the set {β, β2 , β4 , . . . , β2 −1 } are linearly independent. A third approach for implementing the AES S-box was proposed by Bertoni et al. in [1]. By using an intermediate one-hot encoding of the input, arbitrary logic functions (including cryptographic S-boxes) can be realized with minimal power consumption. The main drawback of this approach is that it results in relatively large silicon area. 4 Implementation Details All AES S-box implementations analyzed in this paper can perform forward and inverse byte substitution for encryption and decryption, respectively. We implemented the S-boxes either from scratch or obtained the HDL descriptions from the authors of the respective publications. The implementations examined consist solely of combinatorial logic, i.e. no pipeline stages have been inserted since pipelining does not make sense when a feedback mode of operation like OFB or CBC is used [7]. In the following we describe a total of eight different implementations of the AES S-box which can be grouped into three basic categories: look-up implementations, calculating implementations, and low-power implementations. Four of the eight S-box implementations are illustrated in Figure 1. The simplest design in our comparison is a straight-forward implementation of a hardware look-up table [11]. The synthesizer transforms the behavioral description of the look-up table into a mass of unstructured standard cells. This approach will be denoted as hw-lut. A modification of 1 Unfortunately, the exact performance figures for ROMs were not accessible for the technology we used. Sin Sin enc / dec Sin[3..0] 16x8-bit LUT Sin[3..0] 16x8-bit LUT Sin[3..0] ... 16x8-bit LUT ... Combinational logic Sin[7..4] 32-to-1 enc / dec Sout sub16-lut hw-lut Sout Sin Sin ... Decoder inverse affine transformation 1 0 Permutation GF(28) inversion enc / dec affine transformation Multiplexer 1 hybrid-lut 0 Sout ... Sout bertoni Figure 1: Comparison of four S-box implementations hw-lut is to use sub-tables in order to minimize switching activity in the look-up tables to reduce power consumption. We examined such solutions with sub-tables of size 16, 32, 64, 128, and 256 bytes, but in this paper we only specify results for size 16 (sub16-lut). Implementations which calculate the S-box transformation in hardware were first proposed by Wolkerstorfer et al. [20] and Satoh et al. [17]. The former approach decomposes the elements of F28 into polynomials over the sub-field F24 and performs inversion there. Our implementation of this solution is denoted as wolkerstorfer. Satoh’s solution decomposes the field elements further into polynomials over the sub-field F22 , where inversion is a trivial swap of the lower and higher bit of the representation. This implementation is referred to as satoh in the following. Both of these approaches represent the field elements by using a polynomial basis. Canright improved the calculation of the S-box by switching the representation to a normal basis [2]. Like in Satoh’s solution, the elements of F28 are mapped to a polynomial over the sub-field F22 . This approach will be denoted as canright. A compromise between hardware look-up and calculation has also been examined. In this implementation (denoted as hybrid-lut) only the inversion in F28 is realized as look-up table. Since the inversion is used for both encryption and decryption, the size of the look-up table is halved in relation to the hw-lut approach. The affine and inverse affine transformations are performed via logic circuits just as in the calculating implementations of wolkerstorfer, satoh, and canright. The low-power approach of Bertoni et al. [1] uses a decode stage to convert the eight bits of the input byte and the control bit which selects encryption or decryption into a one-hot representation consisting of 29 = 512 bits. The substitution itself is just a rearrangement of these bits and can be done efficiently in hardware by a rewiring of lines as illustrated in Figure 1. Since two of the lines always map to the same 8-bit result (one for encryption and one for decryption), these line pairs can be combined with a logical OR to yield a one-hot decoded representation of the result consisting of 256 bits. A subsequent encoder stage transforms this result back to an 8-bit binary value. Due to this decoder-permute-encoder structure, there is only very little signal activity within the circuit when the input changes, resulting in low power consumption. Note that the structure of Bertoni’s approach makes it easily possible to introduce pipeline stages. However, it may be necessary to add a large number of additional flip-flops when the pipeline stage is placed between the decoder and encoder, i.e. on the one-hot encoded signal lines. These flip-flops will increase power consumption considerably and can easily mitigate the low-power advantages of this solution. For design scenarios where both power consumption and silicon area are of minor importance, Bertoni’s approach can offer the best opportunity for reaching very high clock frequencies. We tested two implementations of Bertoni’s approach: One implementation uses a decoder with four stages as proposed in the original publication for minimal power consumption (bertoni). The second implementation, denoted as bertoni-2stg, uses a different decoder structure with only two stages in order to reduce the critical path of the circuit. In the remainder of this paper we will refer to wolkerstorfer, satoh, and canright as calculating implementations. We will denote hw-lut and hybrid-lut as look-up implementations, and sub16lut, bertoni, and bertoni-2stg as low-power implementations. 5 Design Flow and Evaluation Methodology In contrast to our previous work [18] where we used a 0.35 µm standard cell library from Austriamicrosystems, all results in this paper were obtained with the VST250 standard cells from Virtual Silicon. These standard cells are built upon the 0.25 µm process technology L250 of UMC, which provides one poly-silicon layer and five metal layers. The nominal supply voltage of the VST250 cell library is 2.5 V. We implemented the eight S-box designs described in Section 4 in VHDL according to the specifications in the respective papers. In order to ensure a fair comparison and a common interface for all implementations, we provided the input and output of each S-box with 8-bit registers. The integration of the registers made it possible to optimize for area and delay during synthesis. The logic synthesis was done using the Physically Knowledgeable Synthesis (PKS) tool from Cadence. We varied the constraints for the delay time (i.e. maximum clock frequency) from the minimum value to a value where the constraints could just be met. The delays given in Table 1 are the actual delays of the synthesized circuit. Empty cells in the table indicate that the respective target delay could not be achieved by the synthesizer. After synthesis, the placement and routing of the standard cells was performed with the Cadence tool First Encounter. We did not include I/O cells into the designs, i.e. we analyzed only the core of the S-boxes consisting of standard cells and the power supply rings. During placement we used an area utilization of 70%. All the figures in Table 1 are results from synthesis excluding the clock tree for the input and output registers. After the routing step we integrated the layouts of the standard cells into the design, which gave us the full layout in GDS2 format. Design canright satoh wolkerstorfer hw-lut sub16-lut hybridlut bertoni bertoni2stg Result Act. delay (ns) Area (GE) Power (µA) Act. delay (ns) Area (GE) Power (µA) Act. delay (ns) Area (GE) Power (µA) Act. delay (ns) Area (GE) Power (µA) Act. delay (ns) Area (GE) Power (µA) Act. delay (ns) Area (GE) Power (µA) Act. delay (ns) Area (GE) Power (µA) Act. delay (ns) Area (GE) Power (µA) 2.00 – – – – – – – – – 1.95 1545 1.18 – – – – – – 1.86 2016 0.42 1.98 1941 0.42 3.00 – – – – – – – – – 2.91 1415 0.97 2.94 2040 0.56 2.93 1222 1.34 2.90 1433 0.30 2.79 1446 0.32 4.00 – – – – – – – – – 3.90 1351 1.00 3.92 1979 0.53 3.92 840 1.02 3.31 1399 0.27 3.53 1436 0.31 Target delay (ns) 5.00 6.00 4.98 5.00 496 400 1.78 1.78 – 5.93 – 438 – 2.00 4.93 5.94 625 412 1.87 1.97 4.98 5.88 1352 1302 0.97 0.93 4.46 4.46 1957 1957 0.55 0.58 4.86 5.83 810 799 0.98 0.95 3.31 3.31 1399 1399 0.27 0.27 3.26 3.26 1421 1421 0.33 0.33 7.00 6.55 303 1.81 6.55 409 1.73 6.48 415 1.75 6.61 1301 1.00 4.46 1957 0.58 6.49 798 0.98 3.31 1399 0.27 3.26 1421 0.33 8.00 6.55 303 1.81 6.99 385 1.51 7.51 392 1.53 6.61 1301 1.00 4.46 1957 0.58 6.49 798 0.98 3.31 1399 0.27 3.26 1421 0.33 9.00 6.55 303 1.81 6.99 385 1.51 7.51 392 1.53 6.61 1301 1.00 4.46 1957 0.58 6.49 798 0.98 3.31 1399 0.27 3.26 1421 0.33 Table 1: Synthesis results of the eight S-box designs depending on the target delay We extracted a Spectre netlist from the layout using Assura RCX, where we only considered resistors larger than 1 Ω and capacitors larger than 1 pF. In contrast to our previous work [18], we obtained the power consumption of the different S-box designs through simulation with Synopsys NanoSim. All simulations were performed with BSIM3v3 transistor models characterized for the UMC L250 technology and the built-in NanoSim models for resistors and capacitors. The results of the NanoSim simulations shown in Table 1 represent the mean current consumption of the S-boxes at a supply voltage of 2.5 V. We used a clock frequency of 50 MHz (i.e. new input values are applied to the circuit with a period of 20 ns) and simulated all 256 possible input patterns. 6 Experimental Results We synthesized all eight S-box implementations mentioned in Section 4 using the design flow described previously. For each implementation several synthesis runs were carried out, whereby we specified different target values for the maximum critical path delay, ranging from 2 ns to 9 ns. Table 1 summarizes the actual delay, the area of the synthesized design, and the mean power consumption. We omitted the results of all synthesis runs where the timing constraints were not met, i.e. when the actual delay was higher than the target delay. 2500 sub16-lut hybrid-lut bertoni satoh bertoni-2stg wolkerstorfer hw-lut canright Area (GE) 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 9 Target value for critical path delay (ns) Figure 2: Area vs. critical path delay Figure 2 shows the area of the eight S-box designs when synthesized for a specific critical path delay. The area is given in gate equivalents (GE), calculated as total area divided by the size of a 2-input NAND with the lowest drive strength, which is the NAND20 cell of the library we used. Amongst the three calculating implementations (at the bottom of the figure), canright is clearly the best. It has the smallest size of all eight S-boxes, but suffers from a longer critical path than the hardware look-up implementations and the low-power solutions. The calculating implementations are smaller than the other two approaches because they make use of the algebraic structure of the S-box to implement the substitution. On the other hand, this structure has a relatively long critical path. The shortest critical path can be achieved with bertoni, but its size is about three times that of canright. Look-up implementations ignore the algebraic structure of the S-box and just aim at a straightforward realization of the boolean equations given by the input-output relation. Hence, the synthesizer has a much higher degree of freedom for optimizing the circuit, which allows for a shorter critical path at the expense of silicon area. The low-power implementations also ignore the algebraic properties of the substitution and simply implement the boolean equations of the input-output relation. However, they use a specific structure (decode-permute-encode) to reduce signal activity. Although the critical path is similarly short as for look-up implementations, the one-hot encoding requires more silicon area than the look-up implementations. The sub16-lut approach also has a significant area overhead introduced by the address decoding of the sub-tables, which makes it the most costly solution in terms of silicon area. Moreover, the address decode logic causes a longer critical path. As expected, the compromise between hardware look-up and calculation (hybrid-lut) lies somewhere between hw-lut and the calculating implementations with regard to both critical path delay and area. Figure 3 shows the total power consumption plotted against the critical path delay. All power values are normalized with respect to the power consumption of hw-lut for a delay of 5.0 ns. The low-power S-boxes based on the approach of Bertoni (bertoni, bertoni-2stg) are the clear winners of this comparison. The original implementation bertoni shows the best overall results among all eight examined designs, closely followed by the modified version bertoni-2stg. Bertoni’s approach is solely directed towards low power consumption with a minimal level of signal activity in the circuit. The sub16-lut approach, on the other hand, tries to improve a straightforward look-up table Total power (normalized) 1,5 satoh hw-lut 1,25 wolkerstorfer sub16-lut canright bertoni-2stg hybrid-lut bertoni 1 0,75 0,5 0,25 0 1 2 3 4 5 6 7 8 9 Target value for critical path delay (ns) Figure 3: Total power consumption vs. critical path delay implementation (hw-lut) with low-power measures. However, sub16-lut requires almost twice as much power as bertoni, while hw-lut consumes about three times more power. The hybrid-lut approach requires roughly the same amount of power as hw-lut. The power consumption of the calculating implementations is much higher than that of the low-power and look-up versions. The algebraic evaluation of the S-box function in calculating implementations causes a large number of internal nodes to transition even if only a few input bits toggle. This behavior entails high signal activity and, in turn, high power consumption. In look-up implementations a change of a few input bits affects the evaluation of all output bits separately. As normally some output bits will remain unchanged, the signal activity within this particular path is low, which limits the overall power consumption. The implementation of canright consumes almost twice as much power as hw-lut, and roughly an order of magnitude more power than bertoni. The other two calculating implementations, wolkerstorfer and satoh, have similar power characteristics as canright. 1250 (Power x Area) normalized satoh sub16-lut wolkerstorfer hw-lut canright bertoni hybrid-lut bertoni-2stg 1000 750 500 250 0 1 2 3 4 5 6 7 8 Target value for critical path delay (ns) Figure 4: Power-area product vs. critical path delay 9 Figure 4 shows the results of the eight S-box implementations in terms of the power-area product. This metric is particularly relevant for applications with a need for both small silicon area and low power consumption, e.g. cryptographically enhanced RFID tags or sensor nodes. Due to their large area requirements, hw-lut and sub16-lut have the worst power-area product among all eight examined implementations. Also the calculating S-boxes show a relatively bad power-area product, which is mainly caused by the high power consumption of the S-box evaluation. All three calculating implementations have similar characteristics for relaxed critical path conditions. Both satoh and wolkerstorfer also have similar properties for more stringent constraints on the critical path, whereas canright becomes more and more advantageous for faster designs. The hybrid-lut implementation is even slightly better than canright when synthesized for a delay of 5 ns. However, hybrid-lut becomes very unattractive if the critical path delay needs to be smaller. The low-power approach of bertoni achieves the best overall power-area product, closely followed by bertoni-2stg. The power-area products shown in Figure 4 differ from those in [18] because we used a different standard cell library and a different approach for evaluating the power consumption. According to our results, the calculating implementations are more attractive than the look-up implementations and sub16-lut is the best look-up implementation for short critical paths. The low-power designs achieve the best results for the power-area product in our study as well as in [18]. However, while our study found slight advantages for bertoni, the results in [18] show bertoni-2stg as winner. 1,2 Total power (normalized) 1 decreasing critical path delay 0,8 0,6 0,4 satoh canright hw-lut bertoni 0,2 wolkerstorfer hybrid-lut bertoni-2stg sub16-lut 0 0 300 600 900 1200 1500 1800 2100 Area (GE) Figure 5: Total power consumption vs. area Figure 5 illustrates the power consumption in relation to the required silicon area. In general, the points further away from the point of origin represent synthesis results for shorter critical path delays. The figure shows that calculating implementations tend to sacrifice power efficiency to achieve higher speed. On the other hand, the low-power implementations trade silicon area for a shorter critical path. The sub16-lut implementation shows similar behavior. The look-up implementations hw-lut and hybrid-lut sacrifice area as well as power efficiency to roughly the same degree. In order to minimize the critical path delay, the synthesizer applies a number of optimization techniques like using standard cells with higher drive strengths or the duplication of logic paths, which causes considerable power consumption in circuits with high switching activity. Calculating S-box implementations have an inherently high number of signal switches and, therefore, incur an over-proportional increase in power consumption when reducing the critical path delay. Low-power implementations, on the other hand, are characterized by little signal activity and, therefore, a moderate increase in power consumption for shorter critical paths. When compared to the results reported in [18] (which are based on a 0.35 µm standard-cell library), the silicon area and critical path delay figures correspond quite well to the current ones obtained with the UMC 0.25 µm technology. Regarding power consumption, we notice that the current figures indicate a less dramatic difference among the examined S-box implementations as those given in [18]. We attribute this discrepany to the different standard cell libraries and the different power evaluation methods. While the results in [18] were obtained via estimations from the synthesis tool, our current figures result from a much more accurate simulation of the placed and routed designs using NanoSim. This, of course, has also led to slight differences in all other metrics which include the power consumption results. 7 Conclusions In this paper we examined eight AES S-box implementations which follow three different design strategies. We analyzed and compared various cost metrics like critical path delay, silicon area, and power consumption of these implementations based on synthesis runs with a 0.25 µm CMOS standard cell library. Our simulation results clearly show that the characteristics of the eight S-box implementations differ significantly. For example, the power consumption of the different S-boxes varies by almost an order of magnitude, which underpins the importance of selecting the proper S-box with respect to the requirements of the target application. We found that Canright’s S-box design is the best choice for applications where small silicon area is the main criterion (e.g. RFID tags). Bertoni’s S-box is very well suited for applications with a demand for low power or energy consumption, e.g. wireless sensor nodes. In addition, the Bertoni S-box also has the shortest critical path, followed by the look-up implementations. While the results for the calculating implementations only apply to the AES S-box, the insights from the other two implementation strategies (look-up except hybrid-lut and low-power) are also useful for other cryptographic S-boxes. Acknowledgements The authors would like to thank Johannes Wolkerstorfer and David Canright for providing the HDL source code of several AES S-box implementations. The research described in this paper has been supported by the Austrian Science Fund (FWF) under grant P16952–N04, the FIT-IT initiative of the Austrian Federal Ministry of Transport, Innovation, and Technology (project SNAP), and the EPSRC under grant EP/E001556/1. The research described in this paper has also been supported, in part, by the European Commission through the IST Programme under contract IST-2002-507932 ECRYPT. The information in this document reflects only the authors’ views, is provided as is and no guarantee or warranty is given that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability. References [1] G. Bertoni, M. Macchetti, L. Negri, and P. Fragneto. Power-efficient ASIC synthesis of cryptographic Sboxes. In Proceedings of the 14th ACM Great Lakes Symposium on VLSI (GLSVLSI 2004), pp. 277– 281. ACM Press, 2004. [2] D. Canright. A very compact S-Box for AES. In Cryptographic Hardware and Embedded Systems — CHES 2005, vol. 3659 of Lecture Notes in Computer Science, pp. 441–455. Springer Verlag, 2005. [3] P. Chodowiec and K. Gaj. Very compact FPGA implementation of the AES algorithm. In Cryptographic Hardware and Embedded Systems — CHES 2003, vol. 2779 of Lecture Notes in Computer Science, pp. 319–333. Springer Verlag, 2003. [4] J. Daemen and V. Rijmen. The Design of Rijndael: AES – The Advanced Encryption Standard. Springer Verlag, 2002. [5] M. Feldhofer, K. Lemke, E. Oswald, F.-X. Standaert, T. Wollinger, and J. Wolkerstorfer. State of the Art in Hardware Architectures. ECRYPT deliverable D.VAM.2, available for download at http://www. ecrypt.eu.org/documents/D.VAM.2-1.0.pdf, Sept. 2005. [6] M. Feldhofer, J. Wolkerstorfer, and V. Rijmen. AES implementation on a grain of sand. IEE Proceedings Information Security, 152(1):13–20, Oct. 2005. [7] A. Hodjat, D. D. Hwang, B.-C. Lai, K. Tiri, and I. M. Verbauwhede. A 3.84 Gbits/s AES crypto coprocessor with modes of operation in a 0.18-µm CMOS technology. In Proceedings of the 15th ACM Great Lakes Symposium on VLSI (GLSVLSI 2005), pp. 351–356. ACM Press, 2005. [8] H. Li. A parallel S-box architecture for AES byte substitution. In Proceedings of the 2nd International Conference on Communications, Circuits and Systems (ICCCAS 2004), vol. 1, pp. 1–3. IEEE, 2004. [9] R. Lidl and H. Niederreiter. Finite Fields, vol. 20 of Encyclopedia of Mathematics and Its Applications. Cambridge University Press, 1996. [10] M. Macchetti and G. Bertoni. Hardware implementation of the Rijndael SBOX: A case study. ST Journal of System Research, 0(0):84–91, July 2003. [11] M. McLoone and J. V. McCanny. High performance single-chip FPGA Rijndael algorithm implementations. In Cryptographic Hardware and Embedded Systems — CHES 2001, vol. 2162 of Lecture Notes in Computer Science, pp. 65–76. Springer Verlag, 2001. [12] N. Mentens, L. Batina, B. Preneel, and I. M. Verbauwhede. Systematic evaluation of compact hardware implementations for the Rijndael S-box. In Topics in Cryptology — CT-RSA 2005, vol. 3376 of Lecture Notes in Computer Science, pp. 323–333. Springer Verlag, 2005. [13] S. Morioka and A. Satoh. An optimized S-Box circuit architecture for low power AES design. In Cryptographic Hardware and Embedded Systems — CHES 2002, vol. 2523 of Lecture Notes in Computer Science, pp. 172–186. Springer Verlag, 2002. [14] National Institute of Standards and Technology (NIST). Data Encryption Standard (DES). Federal Information Processing Standards (FIPS) Publication 46-3, Oct. 1999. [15] National Institute of Standards and Technology (NIST). Advanced Encryption Standard (AES). Federal Information Processing Standards (FIPS) Publication 197, Nov. 2001. [16] N. Pramstaller and J. Wolkerstorfer. A universal and efficient AES co-processor for field programmable logic arrays. In Field Programmable Logic and Application — FPL 2004, vol. 3203 of Lecture Notes in Computer Science, pp. 565–574. Springer Verlag, 2004. [17] A. Satoh, S. Morioka, K. Takano, and S. Munetoh. A compact Rijndael hardware architecture with S-Box optimization. In Advances in Cryptology — ASIACRYPT 2001, vol. 2248 of Lecture Notes in Computer Science, pp. 239–254. Springer Verlag, 2001. [18] S. Tillich, M. Feldhofer, and J. Großschädl. Area, delay, and power characteristics of standard-cell implementations of the AES S-box. In Embedded Computer Systems: Architectures, Modeling, and Simulation — SAMOS 2006, vol. 4017 of Lecture Notes in Computer Science, pp. 457–466. Springer Verlag, 2006. [19] S. Tillich and J. Großschädl. Instruction set extensions for efficient AES implementation on 32-bit processors. In Cryptographic Hardware and Embedded Systems — CHES 2006, vol. 4249 of Lecture Notes in Computer Science, pp. 270–284. Springer Verlag, 2006. [20] J. Wolkerstorfer, E. Oswald, and M. Lamberger. An ASIC implementation of the AES SBoxes. In Topics in Cryptology — CT-RSA 2002, vol. 2271 of Lecture Notes in Computer Science, pp. 67–78. Springer Verlag, 2002. [21] X. Zhang and K. K. Parhi. High-speed VLSI architectures for the AES algorithm. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 12(9):957–967, Sept. 2004.