A Masked AES ASIC Implementation ∗ Norbert Pramstaller1 , Elisabeth Oswald1 , Stefan Mangard1 , Frank K. Gürkaynak2 , and Simon Häne2 1 Institute for Applied Information Processing and Communications (IAIK), Graz University of Technology, Inffeldgasse 16a, 8010 Graz, Austria {Norbert.Pramstaller, Elisabeth.Oswald, Stefan.Mangard}@iaik.at 2 Integrated Systems Laboratory (IIS) ETH Zentrum, Swiss Federal Institute of Technology Zurich, Gloriastrasse 35, 8006 Zurich, Switzerland {kgf, haene}@iis.ee.ethz.ch Abstract Introduced in 1999, differential power-analysis (DPA) attacks pose a serious threat for cryptographic devices. Several countermeasures have been proposed during the last years. However, none of them leads to implementations that are provably resistant against DPA. A promising class of DPA countermeasures is masking. In this article we discuss implementations of three existing masking schemes for the Advanced Encryption Standard (AES). We present an ASIC that has been implemented and manufactured. This test chip is used to investigate the countermeasures in practice. With this test chip we have also determined the costs of masking in terms of area and execution time. Compared to an unmasked AES implementation the best masking scheme shows a performance loss about 40-50%. To the best of the authors knowledge it is the first ASIC that implements masking for AES. 1. Introduction The Advanced Encryption Standard (AES) [1] will be the most widely used symmetric block cipher in the coming years. AES seems to be resistant to state-of-the-art mathematical cryptanalysis. However, unprotected implementations of it are vulnerable to differential power-analysis (DPA) attacks. DPA attacks are successful if the power consumption of a cryptographic device depends on the processed data and the cipher key. DPA attacks can be mounted efficiently in practice [2]. Several DPA countermeasures, such as masking the intermediate values, have been proposed so far. The masking of intermediate values is achieved by adding a random mask to the input of the algorithm. In this way, all intermediate values are randomized automatically. At the end of the algorithm, the mask is removed to get the correct result. ∗ The work described in this paper has been supported in part by the FP6 sponsored research project SCARD (Side-Channel Analysis Resistant Design Flow), project number IST-2002-507270, and by the FWF research project “Investigations of Simple and Differential Power Analysis”, project number P16110-N04. The development and the implementation of this last approach are the main topics of this article. This article is structured as follows: Section 2 briefly sketches the Advanced Encryption Standard and DPA attacks. In Section 3 we briefly survey three existing masking schemes for AES. After the comparison of the masking schemes, we discuss the ASIC implementation in Section 4. Finally, we draw conclusions in Section 5. 2. AES and DPA The AES algorithm is a symmetric block cipher that operates on 128-bit data blocks. AES uses a cipher key to encrypt a 128-bit data block. The length of the cipher key can be 128 bits, 192 bits, or 256 bits—the standard defines these three key lengths to adapt the algorithm to short-, medium-, and long-term security requirements. Like most symmetric ciphers, AES encrypts an input data-block by applying the same round function iteratively. The round function alters the input data-block, which is called State, by applying non-linear and linear functions. The linear functions are AddRoundKey, ShiftRows, and MixColumns. The only non-linear transformation is SubBytes. SubBytes is composed of a byte inversion in the finite field GF(28 ) followed by an affine transformation. DPA can be applied to any unprotected implementation of any algorithm. It works because an attacker can predict an intermediate value, which is dependent on a small part of the secret key that occurs during the execution of the algorithm. As the power consumption of an implementation strongly depends on processed data, there is a correlation between the power consumption and the, for example, Hamming weight of the intermediate values. Hence, an attacker can verify the correctness of his prediction (which is based on a small part of the secret key) by calculating this correlation. Consequently, a way to counteract DPA is to randomize the intermediate values that occur during the execution of the algorithm by adding another secret value, which is called mask. The mask is changed for every execution of the algorithm. In this way, the attacker is unable to predict intermediate values. A +X 3. Masking AES In a masked AES, a random mask is added to the input data of the algorithm prior to encryption. At the end of the encryption, the mask is removed to get the correct result. In order to remove the mask, we need to keep track how the mask is modified by the algorithm. Figure 1 shows this basic principle. Y XY Y AY a-1 Plaintext X AY +XY X Random Mask Y A-1Y-1 Y-1 XY-1 add random mask a-1 A-1Y-1+ XY-1 Y Cipher Key Masked Algorithm Mask Modification Legend: a-1 GF(28) multiplication GF(28) addition GF(28) inversion A-1 + X remove random mask Ciphertext Figure 1. Basic masking principle Masking the linear AES functions is easy. Because these functions are linear, applying them on a masked value A + X, gives the same result as applying them first on the data A and then on the mask X: f (A + X) = f (A) + f (X). The SubBytes transformation is composed of a multiplicative inversion in GF(28 ) and an affine transformation. Masking the non-linear byte inversion is tricky because Inv(A + X) 6= Inv(A) + Inv(X). Without modification, the result of the byte inversion is (A + X)−1 and thus it is not possible to remove the mask at the end of the algorithm easily. Therefore, a modified byte inversion is required such that the result of the inversion equals A−1 + X. Three approaches for a modified byte inversion [6, 7, 12] will be discussed in the remainder of this section. Multiplicative masking, which was presented by Akkar et al. [6], is based on the idea that prior to the byte inversion the additive mask X is replaced by the multiplicative mask Y. After the byte inversion, the multiplicative mask is replaced by the additive mask again. We refer to this scheme, which is depicted in Figure 2, as SubBytesAkkar throughout the remainder of this article. As it can be seen in Figure 2, SubBytesAkkar requires four multipliers, two inversions and two additions in GF(28 ). Therefore, it is already clear that masking leads to a noticeable increase in terms of area. The second approach has been presented by Trichina et al. [7]. The so-called simplified adaptive multiplicative masking is based on the same scheme as [6] except that the additive mask X is reused as multiplicative mask Y. This scheme requires less operations than [6]. We refer to this masking scheme as SubBytesTrichina. Both approaches have a severe drawback: they are vulnerable to so-called zero-value attacks [8]. This attack is Figure 2. Modified byte inversion SubBytesAkkar based on the fact that multiplicative masks do not hide zero input-values: if A = 0 then AY = 0 ∀Y . Recently it has been shown that Trichina’s method is even vulnerable to ordinary DPA [14]. A third approach has been presented by Oswald et al. [12]. In contrast to [6, 7], the additive mask is never removed during the computation of the algorithm. This approach renders zero-value attacks impossible. In order to get the desired result of the inversion, which is A−1 + X, so-called correction terms are computed in parallel to the inversion. The new masking scheme, referred to as SubBytesIAIK, uses composite field arithmetic for the implementation of the AES SubBytes operation [9]. The detailed structure of SubBytesIAIK is schematically depicted in Figure 3. 3.1. Comparison of Masking Approaches For a comparison of the proposed masking schemes, the area-time (AT) products have been determined and are shown in Figure 4. To have a direct comparison with unmasked implementations, a LUT-based SubBytes implementation (referred to as SubBytesLUT) and a compositefield based SubBytes implementation [9] (referred to as SubBytes) have been implemented as well. For the implementation we have used a 0.25µ m CMOS technology. As it can be seen in Figure 4, the masking approach SubBytesAkkar requires most hardware resources, SubBytesTrichina the fewest. SubBytesIAIK lies in between but it is the only one which is secure. Compared to the unmasked SubBytes implementations it is clear that masking leads to a considerable performance loss. A point which we have omitted so far is that we need random numbers as masks: SubBytesAkkar requires an additive and a multiplicative mask, SubBytesTrichina uses an addi- A+X GF(256) to GF(16) 4 4 4 amh aml mh ml a2 dm3 c1 c2 GF (16) to GF(4) 2 a2 c31 fresh_mask c4 c5 i6 mh ml a2 dm1a dm2 dm3 c1 c2 i4 c31 c4 c5 c3 i6 i0 i5 i2 a2 c dm1 i7 i3 i4 i7 i1 i8 i5 i2 i8 mh i9 = dm i9 = dm GF(4) x GF(4) masked inversion ml c31 mh ml fresh_mask(1..0) a2 fresh_mask i3 i1 2 aml GF4 inversion dm_inv1 dm_inv_mh dm_inv1 dm_inv fresh_mask dm4 c7 i10 i13 inv_amh i12 dm5 c8 i16 i17 i21 i18 i22 fresh_mask dm4 c7 i23 inv_aml 4 4 GF(16) to GF(256) 8 A-1+X inv_amh ml c5 mh aml dm_inv_mh dm_inv_ml fresh_mask i11 i12 dm5 c8 i16 i17 i21 i18 i22 i20 i15 i10 i13 dm_inv_mh(0) i19 i14 inv_amh c1 fresh_mask mh amh dm_inv_ml fresh_mask c5 c2 mh ml aml dm_inv_mh fresh_mask i11 dm_inv_mh(1) dm_inv_ml c5 mh c1 fresh_mask mh amh dm_inv_ml dm_inv_mh dm_inv2 dm_inv dm_inv(0) fresh_mask c5 i0 2 amh c c3 dm1 2 a2 c GF(16) to GF(4) 2 c2 dm2 4 mh dm1a a2 c X 4 4 a2 A+X 8 GF(256) to GF(16) 4 fresh_mask X 8 mh(0) c31(0) fresh_mask i19 i20 i14 i23 i15 inv_amh inv_aml 2 2 GF(4) to GF(16) 4 A-1+X Figure 3. Gate-level structure of SubBytesIAIK. Inversion in GF(16) × GF(16) (left) and in GF(4) × GF(4) (right) Comparison of AT products for different SubBytes implementations (8bit) 55k 50k 45k 40k area [µm2] 35k 30k 25k 20k 15k 10k 5k 0 1 2 3 4 5 6 7 8 9 critical path [ns] 10 11 12 13 14 15 Figure 4. AT products of different masking schemes tive mask1 , and SubBytesIAIK uses additive masks. The number of random bits which are required for the masks clearly influences the performance as well. SubBytesTrichina requires 20 random bits per data bit, SubBytesAkkar 2 and SubBytesIAIK requires 1 random bit per data bit. In other words, for the encryption of one 128-bit block, SubBytesTrichina needs 10 128-bit random masks. Assuming that is is possible to generate a 128-bit random mask in one clock cycle, the throughput of SubBytesTrichina is reduced further by a factor of 10 only by the mask generation. 4. ASIC Implementation ARES We have designed and manufactured an ASIC in order to evaluate the different masking schemes in practice and to determine the costs for masking in terms of area and execution time. This test chip, which we called ARES, features several masked and unmasked AES implementations. In particular, it has a 32-bit AES (referred to as MAES32) with three different SubBytes implementations: one unmasked SubBytes (SubBytes) and two masked SubBytes (SubBytesAkkar and SubBytesIAIK). For comparisons with high throughput AES implementations, ARES also features a throughput optimized 128-bit masked AES version (referred to as MAES-128) with SubBytesIAIK only. Additionally, the test chip contains a noise generator that is based on a linear feedback shift register (LFSR). The noise engine is used to investigate the impact of noise on the DPA resistance in practice. Figure 5 shows the three main components of ARES. 4.1. MAES-32 For the implementation of MAES-32 we paid special attention on a sequential execution order. More precisely, all AES transformations are performed one after the other. 1 As described in [11], the mask values for SubBytesTrichina have some restrictions because the additive mask is reused as multiplicative mask. Figure 5. Floorplan of ARES With this design, we expect that it will be easier to perform DPA attacks with a low number of measurements. This is important for us since we intend to attack the unmasked SubBytes implementation. 4.2. MAES-128 MAES-128 is a throughput-optimized encryption engine that is used for direct comparisons with other highthroughput implementations such as Fastcore [10]. Additionally, it can be used as noise generator. 4.3. Implementation Issues of Masking When implementing SubBytesIAIK, designers must pay special attention at the gate structure of the masking scheme. As can be seen in Figure 3, a so-called fresh mask must be added to some intermediate values. This fresh-mask is fundamental for the security of the masking scheme. However, it is logically redundant, e.g. (A ⊕ B) ⊕ (C ⊕ A) = B ⊕ C. Therefore, during all steps of the design flow, designers must ensure that this fresh mask is never removed during an optimization step. The intricate computation structure of the masking schemes causes additional problems for the designer. Typically, designers try to balance the combinational paths. Unbalanced paths have the drawback that the clock period can not be exploited entirely. Long combinatorial paths also increase the power consumption due to more glitches. The combinatorial delay of the masked implementations are very long compared to the unmasked implementations (see Figure 4). Therefore, it seems natural to pipeline these structures. However, pipelining SubBytesIAIK is not straightforward. Due to the complicated structure a considerable amount of pipeline registers is required. 4.4. Noise Engine The noise engine is based on an LFSR with 65 bits. This LFSR generates pseudo random numbers which are used as input to a network of buffers. The network of buffers has been designed in such a way that the noise engine consumes about the same amount of power as the MAES-32, if the noise engine is clocked with half of its maximum frequency. The noise engine has a separate clock signal with which the amount of noise can be set. The higher the clock frequency is, the more switching activity occurs in the network of buffers and hence the more power is consumed. Due to the fact that the amount of noise can be controlled via the clock frequency, a detailed analysis of the noise impact on the DPA-resistance of the chip can be performed. 5. Conclusions and Further Work We have presented the first ASIC featuring masked AES implementations in this article. Our ASIC consists of several masked and unmasked implementations, whose security and performance characteristics have been discussed in this article as well. From our research we have concluded that DPA resistant masking schemes for AES can be defined. However, their practical implementation shows that there is still room for improvement with respect to their implementation cost. In the near future we will continue our ongoing research with the ARES chip. So far, the functional tests and scan tests of all 10 packaged samples have been successful. We intend to acquire power measurements, and perform DPA attacks on the different AES implementations. References [1] National Institute of Standards and Technology (NIST). FIPS-197: Advanced Encryption Standard, November 2001. Available online at http://www.itl.nist.gov/ fipspubs/. [2] Paul C. Kocher, Joshua Jaffe, and Benjamin Jun. Differential Power Analysis. In Michael Wiener, editor, Advances in Cryptology - CRYPTO ’99, 19th Annual International Cryptology Conference, Santa Barbara, California, USA, August 15-19, 1999, Proceedings, volume 1666 of Lecture Notes in Computer Science, pages 388–397. Springer, 1999. [3] Stefan Mangard. Hardware Countermeasures against DPA– A Statistical Analysis of Their Effectiveness. In Tatsuaki Okamoto, editor, Topics in Cryptology - CT-RSA 2004, The Cryptographers’ Track at the RSA Conference 2004, San Francisco, CA, USA, February 23-27, 2004, Proceedings, volume 2964 of Lecture Notes in Computer Science, pages 222–235. Springer, 2004. [4] Kris Tiri, Moonmoon Akmal, and Ingrid Verbauwhede. A Dynamic and Differential CMOS Logic with Signal Independent Power Consumption to Withstand Differential Power Analysis on Smart Cards. In 28th European SolidState Circuits Conference - ESSCIRC 2002, Firenze, Italy, September 24-26,2002, Proceedings, pages 403–406, 2002. [5] Kris Tiri and Ingrid Verbauwhede. Securing Encryption Algorithms against DPA at the Logic Level: Next Generation Smart Card Technology. In Colin D. Walter, Çetin Kaya Koç, and Christof Paar, editors, Cryptographic Hardware and Embedded Systems - CHES 2003, 5th International Workshop, Cologne, Germany, September 8-10, 2003, Proceedings, volume 2779 of Lecture Notes in Computer Science, pages 125–136. Springer, 2003. [6] Mehdi-Laurent Akkar and Christophe Giraud. An Implementation of DES and AES, Secure against Some Attacks. In Çetin Kaya Koç, David Naccache, and Christof Paar, editors, Cryptographic Hardware and Embedded Systems CHES 2001, Third International Workshop, Paris, France, May 14-16, 2001, Proceedings, volume 2162 of Lecture Notes in Computer Science, pages 309–318. Springer, 2001. [7] Elena Trichina, Domenico De Seta, and Lucia Germani. Simplified Adaptive Multiplicative Masking for AES. In Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors, Cryptographic Hardware and Embedded Systems CHES 2002, 4th International Workshop, Redwood Shores, CA, USA, August 13-15, 2002, Revised Papers, volume 2523 of Lecture Notes in Computer Science, pages 187–197. Springer, 2003. [8] Jovan D. Golic and Christophe Tymen. Multiplicative Masking and Power Analysis of AES. In Burton S. Kaliski Jr., Çetin Kaya Koç, and Christof Paar, editors, Cryptographic Hardware and Embedded Systems - CHES 2002, 4th International Workshop, Redwood Shores, CA, USA, August 1315, 2002, Revised Papers, volume 2523 of Lecture Notes in Computer Science, pages 198–212. Springer, 2003. [9] Johannes Wolkerstorfer, Elisabeth Oswald, and Mario Lamberger. An ASIC Implementation of the AES Sboxes. In Bart Preneel, editor, Topics in Cryptology - CT-RSA 2002, The Cryptographer’s Track at the RSA Conference, 2002, San Jose, CA, USA, February 18-22, 2002, Proceedings, volume 2271 of Lecture Notes in Computer Science, pages 67–78. Springer, 2002. [10] Frank K. Gürkaynak, Andreas Burg, Norbert Felber, Wolfgang Fichtner, D. Gasser, F. Hug, and Hubert Kaeslin. A 2GB/s balanced AES crypto-chip implementation. In Proceedings of the Great Lakes Symposium on VLSI, Boston USA, 2004. [11] Norbert Pramstaller. An AES ASIC-Implementation Resistant to Differential Power Analysis. Master’s thesis, Institute for Applied Information Processing and Communications (IAIK), Graz University of Technology, June 2004. [12] Elisabeth Oswald, Stefan Mangard, and Norbert Pramstaller. Secure and Efficient Masking of AES - A Mission Impossible? Cryptology ePrint Archive (http://eprint.iacr. org/), Report 2004/134, 2004. [13] Norbert Pramstaller, Frank K. Gürkaynak, Simon Haene, Hubert Kaeslin, Norbert Felber, and Wolfgang Fichtner. Towards an AES Crypto-chip Resistant to Differential Power Analysis. In Proccedings 30th European Solid-State Circuits Conference - ESSCIRC 2004, Leuven, Belgium, Proceedings - to appear, 2004. [14] M.-L. Akkar, R. Bevan, and L. Goubin. Two Power Analysis Attacks against One-Mask Methods. In FSE 2004 – Preproceedings, pages 308–325, 2004.