Hardware and Software Implementation of RIJNDAEL AES Algorithm EE-540 Project Report Vipul Joshi Ali Bassam Sanjit Kurup Nasrin Ahmed Shafaat Qureshi Table of Contents 1. Introduction……………………………………………………………………..3 1.1 Project perspective………………………………………………………...3 2. Design…………………………………………………………………………....4 2.1 Spiral Design Model………………………………………………………4 2.2 Design Flow………………………………………………………………..5 2.3 Project Modeling………………………………………………………….6 3. The Advanced Encryption Algorithm………………………………………...6 3.1 Introduction……………………………………………………………….6 3.2 Overview of Encryption………………………………………………….7 3.3 Overview of Decryption……….…………………………………………11 4. Module…………………………….……………………………………………13 4.1 Software module………………………………………………………….13 4.2 Intel IXP2850……………………………………………………………..14 4.3 Intel IXP2850 Cryptographic Unit………………………………………14 4.4 Intel SDK 3.0………………………………………...……………………15 4.5 Implementing AES on IXP 2850…………………………………………16 5. Hardware Implementation of AES……………………………………………22 5.1 Design process…………………………………………………………….22 5.2 Interface Design…………………………………………………………..22 5.3 Encryption block………………………………………………………….25 5.4 Decryption block………………………………………………………….33 6. Synthesis of the AES Chip…………………………………………………….41 7. Conclusion……………………………………………………………………..56 8. Reference………………………………………………………………………57 2 1. Introduction: The Rijndael block cipher algorithm was chosen by NIST as the new advanced encryption standard (AES). As DES is not regarded as a Standard anymore the industry would now rush into implementing AES for cryptographic implementations on their products. Being the strongest encryption algorithm which never has been broken till now, it comes with overheads like performance. Various hardware implementations for AES exist, but have their own pros & cons and there is lot of work being done in the area to achieve perfection. 1.1 Product Perspective The growth of the network processor industry in last two years has signified their promise their primary purpose being packet processing at up to OC – 192 speeds. Giving much better performance as compared to general purposes processors Network processor units are now a very integral part of the gigabyte Internet. Advanced Encryption Standard (AES) the latest encryption standard approved by NIST is by far becoming the default choice for encryption in networked applications. Hardware implementation of the algorithm gives better performance but offers less flexibility and is also difficult and time consuming to implement as compared to a software implementation. With the implementation of the security block in IXP 2850 Intel has implemented encryption as an ASIC chip on the same board as their processor. Putting minimum specifics in the block assures flexibility for various applications. This flexibility is achieved by including a programming model along which runs on the microengines. Our objective is to implement the Advanced Encryption Standard on hardware using a FPGA chip. Simultaneously develop a software implementation using the SDK 3.0 for Intel’s IXP 2400/2800. 3 The Intel SDK 3.0 consists of a decent API which gives relative control to the programmer. Comparison between these two approaches using the same flow charts, block lengths, key lengths and same data will determine the overheads of using hardware and if they are worth it. Also it would test the performance of the Intel’s Programming model. 2. Design 2.1 Spiral Design Model The spiral model was incorporated in our design. In this manner our design was constantly updated as more requirements were met. The system feasibility was checked and as specifications were specified the model was updated. After determining the system feasibility and determining the specifications a prototype was developed. This led to the making of an initial system. The initial system then developed into an enhanced system as more requirements and specification were met. Figure 1 – Spiral Model The spiral model (figure 1) successively refines the system as we move through initial basic systems to full-scale systems. This is advantageous as the overall system is updated 4 and one may never have to restart the design. This is in contrast with the waterfall design that may have to be redesigned as new features spring up. One disadvantage of the spiral model can be stated as that it takes more time. But this is a trade off as in case the system design fails one may have to redesign the system, which may end up consuming even more time. 2.2 Design Flow The entire project can be broken down into two major modules .i.e. Hardware module and software module. Requirements Specification Architecture Software design Hardware design Integration Testing Figure 2 Design flow The first to be considered were the requirements and specification flowed by the architecture design. There were two separate designs for hardware and software modules respectively. The step involved integrating the modules and finally testing. 5 2.3 Project Modeling Figure 3 Project Modeling 3. The Advanced Encryption Algorithm 3.1 Introduction In 1997 the National Institute of Standards and Technology (NIST) of the United States put out a call for proposals for a new symmetric algorithm, that will be called the Advanced Encryption Standard (AES). The algorithm DES was since 1976 the standard for symmetric algorithms, and a replacement was necessary. In 1998 15 can- didate algorithms were accepted, and after one year of research 5 of these candidates were announced as finalists: 6 • MARS (IBM corporation) • RC6 (RSA Laboratories) • Rijndael (J. Daemen & V. Rijmen) • Serpent (E. Biham et al.) • Twofish (B. Schneier et al.) On all of these algorithms extensive research has been done to find attacks or weaknesses. According to NIST, all 5 finalists appear to offer adequate security. Also much research has been done to test the performs of these 5 algorithms in both software and hardware. In 2000 NIST announced that Rijndael was chosen as the successor of DES, the AES. The combination of security, performance, efficiency, implementability and flexibility made Rijndael an appropriate selection for the AES. The candidates for the AES algorithm had to fulfill certain design criteria. First, of course the algorithm should be a symmetric algorithm and it must be resistant against all known attacks. Furthermore, the AES must be efficient in performance and memory for different platforms. The design must be simple, and it should be able to handle different key lengths (128, 192 and 256 bits). The block length of the cipher should be 128 bits. 3.2 Overview of Encryption Figure 4 Overview of AES 7 It starts with an initial round followed by a number of standard rounds and it ends with the final round. Only four different operations are necessary to compute these rounds and a key schedule. It is possible in Rijndael to use different keylengths according to the security level that is required for the application. Rijndael is defined as a block cipher with key lengths of 128, 192 or 256 bits. The possible input block lengths are 128, 192 or 256 for the Rijndael algorithm. The AES algorithm is exactly the same as the Rijndael algorithm, but it only defines one block length of 128 bits. The Rijndael algorithm is such that each bit is dependent on all bits from 2 rounds ago, e.g. full diffusion is provided. The number of rounds that must be run is dependent on the key length. Key length (words) Number of rounds (Nr ) AES-128 4 10 AES-192 6 12 AES-256 8 14 Figure 5 Key length 3.2.1 AddRoundKey The AddRoundKey operation is a simple EXOR operation between the State and the RoundKey. The RoundKey is derived from the Cipherkey by means of the key schedule. The State and RoundKey are of the same size and to obtain the next State an EXOR operation is done per element: s (i, j)= s(i, j) ⊕ w(i, j). (5) where s is the current State, s the next State and w the round key. 8 Figure 6 AddRoundKey 3.2.2 Sub Bytes The operation SubBytes is similar to the S-boxes used in the DES-algorithm. Rijndael has only one S-box. The design criteria for the S-box are such that it is resistant against the known differential and linear cryptanalysis and attack using algebraic manipulations. Figure 7 The AES S Box 9 3.2.3 Shift Rows Figure 8 Shift Rows In ShiftRows the rows of State are cyclically shifted with different offsets. Row 1 is shifted over c1 bytes, row 2 over c2 bytes, and row 3 over c3 bytes. The values of c1, c2, and c3 depend on the block length Nb: 3.2.4 MixColumns Figure 9 MixColums 10 The MixColumn transformation is an operation on the different columns. To calculate the MixColumn transformation the columns of the current state are considered as polynomials over GF(28). 3.2.5 Key schedule The RoundKeys are derived from the CipherKey by means of a key schedule. The number of RoundKeys necessary to encrypt one block of information depends on the block length and key length as this determines the number of rounds. For a block length of 128 bits, 11 RoundKeys (1 for initial round, 9 for standard rounds and 1 for the final round) are needed. Figure 10 KeySchedule 3.3 Process for Decryption The Inverse Cipher: The inversion of the cipher code presented in section 3.2 is straightforward and is just the reverse process of encryption. In this section, the whole reverse cipher of the state is described in detail. The state undergoes in a decryption sequence: a. Inverse shift row b. Inverse sub bytes transformation using S-Box c. Inverse mix column d. Inverse sub key 11 3.3.1 Inverse Shift row The inverse shift row step is just the inverse of the forward shift row operation used in the encryption process. The first row is kept as it is and the second row is right shifted one place, the third row is right shifted to two places and he fourth row is right shifted to three places. The inverse right shift operation is explained for Nc = Nb = 4 in figure 11. S0,0 S0,1 S0,2 S0,3 S0,0 S0,1 S0,2 S0,3 S1,0 S1,1 S1,2 S1,3 S1,3 S1,0 S1,1 S1,2 S2,0 S2,1 S2,2 S2,3 S2,2 S2,3 S2,0 S2,1 S3,0 S3,1 S3,2 S3,3 S3,1 S3.2 S3,3 S3,0 Figure 11: Inverse shift row 3.3.2 Inverse Sub bytes transformation using S-Box The inverse sub bytes transformation uses the inverse S-Box table provided in the figure 12. Figure 12: Inverse S-Box table 12 3.3.3 Inverse Mix column The inverse mix column transformation acts independently on every column of the state. Figure 13: Inverse mix column 3.3.4 Inverse Round Key The inverse round key transformation is its own inverse. Each round key is generated for each round. 4. Modules 4.1 Software Module The software module consists of implementing the Advanced Encryption Standard on Intel’s IXP 2850. The IXP 2850 consists of two cryptographic units having hardware cores of AES, 3DES and SHA-1. It also consists of a SDK for implementing various functionalities for Network Processing. 4.2 Intel’s IXP 2850 The newest member of the Intel® second generation network processor product family, the Intel® IXP2850 network processor, delivers high-performance packet and content processing with robust security features in a single platform. By integrating capabilities that have typically required multiple specialized processors, the Intel IXP2850 provides a secure and cost-effective platform that enables a broad range of emerging applications. 13 4.3 IXP 2850 Cryptographic Unit The cryptography unit performs authentication and bulk encryption. It is believed that these two datapath tasks are critical strategic functions for the network processor. The crypto engines are innovative designs that have a very small footprint, yet the two engines provide 10Gb/s throughput performance. 14 4.4 IXP SDK 3.0 The Developer Workbench is an integrated development environment for assembling, compiling, linking, and debugging microcode that runs on the IXP2400, IXP2800, and IXP2850 Network Processor Microengines. The Workbench is a Microsoft* Win32* application that runs on Windows 2000* platforms. Features Important Workbench features include: • Source level debugging. • Debug-only project creation mode. • Execution history. • Statistics. • Media Bus device and network traffic simulation for the Network Processors • Command line interface to the Network Processor simulators (Transactors). • Customizable graphical user interface (GUI) components. Debugging support The Workbench supports debugging in four different configurations: • Local simulation with no foreign model, in which the Workbench and the Network Processor simulator (Transactor) both run on the same Microsoft Windows* platform. • Local simulation with local foreign models, in which the Workbench, the Transactor, and one or more foreign model Dynamic-Link Libraries all run on the same Windows platform. 15 • Local simulation with a remote foreign model, in which the Workbench and the Transactor both run on the same Windows platform and communicate over the network with a foreign model running on a remote system. • Hardware, in which the Workbench runs on a Windows host and communicates over a network or a serial port with a subsystem containing actual Network Processors. (Not currently available for IXP2800 Network Processors). 4.5 Implementing AES on IXP 2850 4.5.1 Design Figure 14 AES design 16 4.5.2 Overview The software implementation encrypts encrypting and decrypting on a single microengine using one thread. The steps for doing the above are as follows: 1. Plain text to be encrypted is written form the DRAM transfer registers into the input RAM of the crypto unit. 2. The key is written into both crypto units. 3. The text is encrypted using crypto unit 0 and the resulting cipher text is written into the DRAM read registers. 4. The cipher text is copied from the DRAM read register into the DRAM write register and then is transferred to input RAM of the crypto unit 1. 5. The cipher text is decrypted and is then written into the DRAM read transfer registers. 6. The text is then compared to the original text to see if the program works. 4.5.3 Software Functions The following important functions of the SDK 3.0 API are used for the above steps: Step 1: Writing the plain text into crypto input RAM xbuf_alloc($$orig_plain_text, 16, write) crypto_write_ram( $$orig_plain_text[0], DATA_RAM_ADDR, 8, ENCRYPT_UNIT, ram_sig) ctx_arb[ram_sig] Step 2: Loading the Key crypto_load_key( $$key[0], 3, DECRYPT_UNIT, 17 CRYPTO_BANK, DECRYPT_STATE, key_sig) ctx_arb[iv_sig, key_sig] Step 3: Encrypt crypto_cipher( $$encrypt_data[0], DATA_RAM_ADDR, 8, CRYPTO_CIPHER_ENCRYPT, CRYPTO_CIPHER_NO_CBC, CRYPTO_CIPHER_AES_128, ENCRYPT_UNIT, CRYPTO_BANK, ENCRYPT_STATE, cipher_sig) ctx_arb[cipher_sig] Step: 4 Decrypt crypto_cipher( $$new_plain_text[0], DATA_RAM_ADDR, 8, CRYPTO_CIPHER_DECRYPT, CRYPTO_CIPHER_NO_CBC, CRYPTO_CIPHER_AES_128, DECRYPT_UNIT, CRYPTO_BANK, DECRYPT_STATE, cipher_sig) 18 4.5.4 Simulation Output Figure 15 Original plain text The above figure shows the simulation output’s data watch in which we can see the text loaded into the Input RAM of the crypto unit. 19 Figure 13 Encrypted Text The above figure shows the cipher text after encryption using AES 128 bit key. 20 Figure 14 Writing Encrypted text to DRAM Figure 15 Decrypted text same as original plain text 21 5. Hardware Implementation of AES 5.1 Design Process The hardware design process was divided into blocks and each block was handled in order. The design started with the interface design defining the data and instruction flow from AES co-processor to IXP-2400. After the interface is designed the encryption and the decryption blocks are designed using VHDL and simulated to make sure the modules are working. The VHDL modules are then fed to the synthesizer to generate netlist and simulated again to make sure the module is producing expected results. After the netlist is available the floor-planning and placement tools are used to layout the final chip. 5.2 Interface Design The interface of the AES block is designed to allow data flow between the INTEL IXP 2400 board and the security co-processor. A block diagram of the hardware implementation of the AES algorithm and the interface is shown below. P C I AES2pack 128 busy 128 Data_out P2AESburst P2AESRnW B u s P2AESreq mptdataout C O N T R O L Key_in 128 Encryption Data_in 128 Data_out 128 Key_in mptaddr Decryption 32 128 Data_in mptdatain 32 32 22 Burst Write Handshake: 1- The Processor checks if busy line is high. If high (busy), The processor drops its request. If busy is low (not busy), the processor then drives P2AESBURST high requesting a burst write transfer ( 4 long words per burst write). 2- The processor drives a valid address on the address/data bus,P2AESRnW low, then drives the data and pulses P2AESREQ high for one clock cycle signaling the AES chip that a valid address/data is available on the bus. 3- The AES chip latches the address/data and wait for the second data until all four long words are latched. 4- When the processor is done with its burst write transfer, it drives P2AESBURST low signaling the end of the burst write cycle. The Burst write handshake waveform is shown below. CLK P2AESBURST ADDR MPTADDR[32] P2AESREQ MPTDATA[32] DATA DATA DATA DATA Figure 16: Burst write Handshake Burst Read Handshake: 1- The Processor checks if busy line is high. If high (busy), The processor drops its request. If busy is low (not busy the AES has Encrypted/decrypted), the processor then drives P2AESREQ and P2AESBURST high requesting a burst read transfer (4 long words per burst read). 2- The processor drives a valid address on the address/data bus,P2AESRnW high, then waits for the AES chip to place data on the address/data bus. 3- When the AES is ready, it drives the least significant word on the address/data bus and pulses the AES2PACK high for one cycle signaling that a valid data is now available on the bus. 23 4- The processor then latches the first data word and wait for the following data. 5- The AES repeats step3 until all 128 bit data are transfer. 6- When all 128 bit data have been received, the processor drives the P2AESREQ Low signaling the end of the burst read cycle. The Burst read handshake waveform is shown below. CLK P2AESREQ P2AESBURST P2AESRNW MPTADDR[32] ADDR AES2PACK MPTDATA[32] DATA DATA DATA DATA Figure 17: Burst Read Handshake 24 5.3 Encryption Block The encryption process for 128-bit data size goes through 10 rounds. The initial round only adds the starting key and the input data and the result is the input of round 1. Round 1 through round 10 the starting data goes through sub-byte transformation, shift rows transformation, mix column transformation and then added with the specific round key generated for each round from previous round key. The flowchart of the top-level module controlling the encryption block is shown in figure 18. All the modules are implemented using VHDL and are provided in the appendix. Figure 18: Top level AES encryption block 25 5.3.1 Control The state machine defined in the control module and shown in figure 19 controls the AES encryption rounds. The round constants used to generate key schedule are different for each round and is provided in the state machine. 5.3.2 Figure 19: Control Finite State Machine 26 5.3.2 Rounds Rijndael encryption goes through 10 rounds of encryption process. Each round the input data is mapped into s-box, performed shift rows and mix column on it and then added to the round keys. This operation is shown in the figure 20. data_in S-Box Shift Row Mix Column last_mux_sel Mux Add Round Key data_out Figure 20: Block diagram of Round operation The s-box transformation, shift rows and mix column are explained in details below and also the simulation results are shown for each block. 27 i.> Sub-Bytes Transformation In the sub-bytes transformation the input data is mapped from the s-box table shown in figure 7. For example if the input data is “193de3bea0f4e22b9ac68d2ae9f84808”, subbytes transformation is done for each 8 bits of XY where the first 4 bits is ‘X’ and the last 4 bits is ‘Y’. In the given input data the first 8 bits is 19, where X = 1 and Y = 9 From the table provided in figure 7, corresponding mapped data for 19 is d4. Similarly for the next 8bits is mapped and so on. The complete transformation is shown in figure 21. 19 a0 9a e9 d4 E0 B8 1e 3d f4 c6 f8 27 Bf B4 41 e3 e2 8d 48 11 98 5d 52 be 2b 2a 08 ae F1 E5 30 Figure 21: Sub-byte transformation The mapped result is “d42711aee0bf98f1b8b45de51e415230”. The simulation result shown in figure 22 shows that the “bytesub” result is “d42711aee0bf98f1b8b45de51 e415230”and hence confirms that the module for sub-bytes transformation is working. 28 Figure 22: Simulation of rounds ii> Shift Rows After sub-byte transformation the shift rows operation is performed on the result. The process of shift rows is the first row is kept as it is , the second row is shifted left once, the third row is shifted left twice and the fourth row is shifted left thrice. For example if the sub-byte output is “d42711aee0bf98f1b8b45de51 e415230” then the shift rows is performed as shown in figure 23. d4 E0 B8 1e d4 E0 B8 1e 27 Bf B4 41 bf B4 41 27 11 98 5d 52 5d 52 11 98 ae F1 E5 30 30 ae F1 F5 Figure 23: Shift row transformation The shifted result is “d4bf5d30e0b452aeb84111f177de961a”. The simulation result shown in figure 22 shows that the “shiftrow” result as “d4bf5d30e0b452aeb84111 f177de961a”and hence confirms that the module for shift rows is working. 29 iii> Mix Column The result from shift row operation is multiplied with a given matrix in mix column module. The operation is explained in figure 24. 02 03 01 01 01 02 03 01 D3 26 01 01 02 7a 03 01 01 04 E0 48 28 66 cb F8 06 81 19 E5 9a 4c == d4 E0 B8 1e 27 Bf B4 41 03 11 98 5d 52 02 ae F1 E5 30 X Figure 24: Mix Column operation The result of mix column operation is “046681e5e0cb199a48f8d37a2806264c”. From figure 22 it is seen that the “mixcolumn” result is “046681e5e0cb199a48f8d37 a2806264c” and hence confirms that the mix column module is working. iv> Key addition In this step the round key generated by the key schedule, explained next, is XOR-ed with the mix column result. 5.3.3 Key Schedule In encryption the key schedule is generated from the 128 bit key input. The block diagram of the key scheduler is shown in figure 25. 30 user_key next_key key_reg_mux_sel MUX load_key_reg KEY REGISTER key_word(3) SHIFT key_word(0) key_word(1) key_word(2) key_word(3) T S BOX 4 XOR XOR round_constant XOR XOR XOR Figure 25: Key Scheduler 31 5.3.4 Simulation Result The encryption block is simulated with an input data and key length of 128 bits. The data input data is entered as “3243f6a8885a308d313198a2e0370734” and the input key is “2b7e151628aed2a6abf7158809cf4f3c”. The encrypted cipher output is “3925841d02dc09fbdc118597196a0b32”. The result matches with the spec. Figure 26. Encrypted result 32 5.4 Decryption Block The encryption process for 128-bit data size goes through 10 rounds. The initial round only adds the starting key and the input data and the result is the input of round 1. Round 1 through round 10 the starting data goes through sub-byte transformation, shift rows transformation, mix column transformation and then added with the specific round key generated for each round from previous round key. The flowchart of the top-level module controlling the encryption block is shown in figure 27. All the modules are implemented using VHDL and are provided in the appendix. Figure 27: Top level AES block controlling decryption 33 5.4.1 Control The state machine defined in the control module and shown in figure 28 controls the AES encryption rounds. The round constants used to generate key schedule are different for each round and is provided in the state machine. Decrypt=1 INIT Load Inputs Round 0 data_reg_mux_sel=11 key_reg_mux_sel=0 load_data_reg=1 load_key_reg=1 Round10 Round9 Round8 Round7 round_const = 00000100 data_reg_mux_sel=01 key_reg_mux_sel=1 load_data_reg=1 load_key_reg=1 last_mux_sel=1 data_reg_mux_sel=01 key_reg_mux_sel=1 load_data_reg=1 load_key_reg=1 round_const = 00000001 data_reg_mux_sel=01 key_reg_mux_sel=1 load_data_reg=1 load_key_reg=1 round_const = 00000010 data_reg_mux_sel=01 key_reg_mux_sel=1 load_data_reg=1 load_key_reg=1 Round6 Round1 Round2 Round3 Round4 Round5 round_const = 00110110 data_reg_mux_sel=00 key_reg_mux_sel=1 load_data_reg=1 load_key_reg=1 round_const = 00011011 data_reg_mux_sel=01 key_reg_mux_sel=1 load_data_reg=1 load_key_reg=1 round_const = 10000000 data_reg_mux_sel=01 key_reg_mux_sel=1 load_data_reg=1 load_key_reg=1 round_const = 01000000 data_reg_mux_sel=01 key_reg_mux_sel=1 load_data_reg=1 load_key_reg=1 round_const = 00100000 data_reg_mux_sel=01 key_reg_mux_sel=1 load_data_reg=1 load_key_reg=1 round_const = 00010000 data_reg_mux_sel=01 key_reg_mux_sel=1 load_data_reg=1 load_key_reg=1 round_const = 00001000 data_reg_mux_sel=01 key_reg_mux_sel=1 load_data_reg=1 load_key_reg=1 Figure 28: Finite state machine 34 5.4.2 Rounds Rijndael decryption goes through 10 rounds of decryption process. Each round inverse shift rows is performed and then it is mapped into inverse s-box. Round key is added with the result and then inverse mix column operation is performed. This operation is shown in the figure 29. data_in Inverse Shift Row Inverse S Box Add Round Key Inv. Mix Column Last_mux_sel Mux data_out Figure 29: Block diagram of Round operation The inverse s-box transformation, inverse shift rows and inverse mix column are explained in details below and also the simulation results are shown for each block. 35 i> Inverse Shift Rows The inverse shift rows operation is performed on the input data. The process of inverse shift rows is the first row is kept as it is , the second row is shifted right once, the third row is shifted right twice and the fourth row is shifted right thrice. For example if the input data is “e9317db5cb322c723d2e895faf090794” then the shift rows is performed as shown in figure 30. E9 Cb 3d af E9 Cb 3d af 31 32 2e 09 09 31 32 2e 7d 2c 89 07 89 07 7d 2c B5 72 5f 94 72 5f 94 B5 Figure 30: Inverse Shift row transformation The shifted result is “e9098972cb31075f3d327d94af2e2cb5”. The simulation result shown in figure 32 shows that the “shiftrow” result as “e9098972cb31075f3d327d94 af2e2cb5”and hence confirms that the module for inverse shift rows is working. ii.> Inverse Sub-Bytes Transformation In the inverse sub-bytes transformation the shifted data is mapped from the inverse s-box table shown in figure 12. For example if the input data is “e9098972cb31075f3d327d94af2e2cb5”, inverse sub-bytes transformation is done for each 8 bits of XY where the first 4 bits is ‘X’ and the last 4 bits is ‘Y’. In the given input data the first 8 bits is e9, where X = e and Y = 9 From the table provided in figure 12, corresponding mapped data for e9 is eb. Similarly for the next 8bits is mapped and so on. The complete transformation is shown in figure 31. 36 E9 Cb 3d af Eb 59 8b 1b 09 31 32 2e 40 2e A1 C3 89 07 7d 2c F2 38 13 42 72 5f 94 B5 1e 84 E7 D2 Figure 31: Inverse Sub-byte transformation The mapped result is “eb40f21e592e38848ba113e71bc342d2”. The simulation result shown in figure xxx shows that the “bytesub” result is “eb40f21e592e38848ba113e71 bc342d2”and hence confirms that the module for inverse sub-bytes transformation is working. Figure 32: Simulation of rounds iii> Key addition In this step the round key generated by the key schedule, explained next, is XOR-ed with the mix column result. 37 iv> Inverse Mix Column The result from shift row operation is multiplied with a given matrix in inverse mix column module. The operation is explained in figure 33. 87 F2 4d 97 6e 4c 90 Ec 46 E7 4a C3 A6 8c D8 95 = 0e 0b 0d 09 09 0e 0b 0d 0d 09 0e 0b 0b 0d 09 0e X 47 40 A3 4c 37 D4 70 9f 94 E4 42 3a Ed A5 A6 bc Figure 33: Inverse Mix Column operation The result of mix column operation is “876e46a6f24ce78c4d904ad897ecc395”. From figure xxx it is seen that the “mixcolumn” result is “876e46a6f24ce78c4d904ad897 ecc395” and hence confirms that the inverse mix column module is working. 5.4.3 Key Schedule In decryption the key schedule is generation is the inverse of the encryption process. The block diagram of the key scheduler is shown in figure 34. 38 user_key next_key key_reg_mux_sel MUX load_key_reg KEY REGISTER next_key_word(3) SHIFT key_word(0) key_word(1) key_word(2) key_word(3) XOR T S BOX 4 XOR XOR round_constant XOR XOR Figure 34: Key Scheduler 39 5.4.4 Simulation Result The decryption block is simulated with an input data and key length of 128 bits. The data input data is entered as “3925841d02dc09fbdc118597196a0b32” and the input key is the key generated at the round 10 of encryption process, which is “D014f9a8c9ee2589e13f0 cc8b6630ca6”. The decrypted cipher output is “3243f6a8885a308d313198a2e0370734”. The result matches with the spec. Figure 35. Decrypted result 40 6 Synthesis of the AES Chip The previous sections of the documentation dealt with the front end design of the chip. After the RTL is done in VHDL, the next immediate step is to synthesize the chip. This part of the document will walk you through the process of synthesis of chip. 6.1 Division of the Chip design work: The work of chip design was divided in the following manner: ARCHITECTURE Mr. Ali Bassam “Partition of the chip in functional blocks and design of the interface” FRONT END DESIGN Ms. Nasreen Ahmed Mr. Shafaat Qureshi “Enter the HDL code and verify the functionality” BACK END DESIGN Mr. Sanjit Kurup “Synthesize, Floor plan, place and route, clock tree generation, physical verification” FABRICATION MOSIS Fig 6.1 41 Architecture design: The step involves partitioning of the chip in the functional units. As seen in the hardware section, the chip was divided in different functional blocks such as the interface, Encryptor, Decryptor. Within each of these sections, there were several fundamental divisions such as S-box, Shift row, mix column, add round key. This task is one of the most challenging as it involves the clear understanding of the specification. Front End Design: The task of the front end design team was to mainly design the state machine of the individual blocks as per the specification. Based on the state machines or the flow charts, RTL codes are written, and the language used is VHDL. These codes have to be tested for their functionality using the simulator. The tool used in this case was NCSim from Cadence and the results are as explained in the previous section. Back End Design: The back end design involves the synthesis of the logic circuit, verifying the gate level simulation, place and route and many other intricacies of the chip layout. This section of the document explains each of these steps in detail, especially in the context of the AES chip. Fabrication: The AES chip is not yet fabricated for the reasons explained in the subsequent sections. Assuming that the GDS2 file or CIF file was generated for the chip, the obvious choice of fabrication would be the MOS Implementation Service (MOSIS). Before going into the depths of the chip design, here is the flow chart for the different steps and processes involved in the chip design and synthesis. As shown in the figure, the portion of HDL entry and its functional verification is over. Next comes the synthesis and the simulation of the netlist generated out of it. 42 Fig 6.2 6.2 Synthesis of the chip: The process of converting the RTL to the equivalent gate level circuit is called synthesis. This is an automated process usually undertaken by the software being used. The software used in this project is from Cadence Design System Inc. called Physically Knowledgeable Synthesis (PKS). Fig 6.3 gives the snap shot of the Physically knowledgeable synthesis tool. 43 Fig 6.3 Amongst the different steps involved in the process of synthesis, the primary one is the loading of the design libraries. The design library used is the gscLib2.0. This is a generic library, and the chips synthesized using this library cannot be fabricated by any commercial fabrication unit. The reason is obvious as the tools have been donated by Cadence Inc. for education purpose only, and not for commercial manufacturability or IP development. The Technology Library Cell library contains the definition of the basic cells such as Inverters and Buffer Cells of various strengths Cells with basic logic functions AND NAND AND_OR OR NOR OR_AND XOR XNOR MUX Various input configurations of each • Two, three or four input cells • Example: AND2, AND3 and AND4 cells 44 After the library is sourced, a generic netlist is generated. It involves the generation of the gate level circuitry from the VHDL logic. The gates that are used are not specific to the synthesis library. Instead, it’s generic, which is the reason it is called generic netlist. To elaborate on the process, here is an example. Consider the following VHDL code: process (ain,bin,cin) ain begin if cin = '1' then bout <= ain; bin cin else MUX bout <= bin; end if; end process; bout In the above example, the VHDL code generates a multiplexer. This is the process of generating the generic net list. However, it is not necessary that the Library used for the actual synthesis will have all the basic gates defined in the generic netlist. For example, consider the following example. If the generic netlist creates a 15 input and gates, it is not necessary that the Technology library will have a 13 input AND gate. Instead the library will have an AND gate with maximum input of 4. Therefore, there needs to be a process which converts the 13 inputs AND gate into its equivalent netlist containing 4 input AND gates. This is done in the second step of synthesis, called Technology mapping, in which the generic netlist is redefined using the technology library. The Fig 6.4 graphically explains the concept. 45 Generic Map Technology Map Fig 6.4 Defining the Constraint for the synthesis: Before actually performing technology mapping, the constraint of synthesis for the chips needs to be defined. This is defined in two stages. First is to define the .tcl file, where the tcl stands for Tool Control Language. This is a script file will define the constraints to the different input pins, clock pins and the output pins. The .tcl file content that was used to synthesize the AES chip is as given below set_global fanout_load_limit 20 set_clock clk -period 30.30 -waveform { 0 15.15 } set_clock_root -clock clk clk set_clock_uncertainty 0.150 set_external_delay -clock clk 2 [ get_names [ find -outputs ] ] set_input_delay -clock clk 2 [ get_names [ find -inputs ] ] set_clock_insertion_delay 0.1 {clk} 46 As you will observe in the content of the .tcl file, the clock frequency is defined at 33 Mhz with a 50% duty cycle. Also, the clock uncertainty is 0.15 ns. The set_external_delay defines all the other inputs and sets the input arrival delay at 2 ns. Similarly, the outputs are also defined to have a delay of maximum 2ns. Fig 6.5 The figure above shows the other constraints that have been defined during the process of synthesis. As you will observe, there are no optimization defined for the synthesis. The reason for not constraining the design is explained in the next paragraph. After setting the parameters and defining the constraint files, the technology mapping is done. The time required by the tool for synthesis is approximately 1 hour 15 minutes! For commercial chips, synthesis has known to take days to complete. This is the primary reason why the design was not defined to be optimized in order to save time during the initial stages of synthesis. After synthesis, the output looks like the screens given below. Fig 6.6 47 Fig 6.6 48 Fig 6.7 Fig 6.8 49 Fig. 6.9 Fig. 6.10 50 Fig 6.11 The fig. 6.7 and 6.8 gives different views of the synthesized chip. Synthesis Reports: There are several reports that are generated during the process of synthesis. It is imperative to look into these reports to verify the operability of the synthesized circuit. Fig 6.8 and Fig. 6.9 gives the summary of these reports. As observed, the worst slack is 14.506 for the 33 Mhz clock, which is significantly a good figure. The core utilization and the cell area is also given. Fig 6.9 gives the details on the process technology. One of the important reports is the timing report. Fig 6.10 gives the detailed timing analysis. Although, it’s just a portion of the complete report, it gives the gist and can be used to find the slack, setup time and hold time 51 Gate Level Simulation: One of the reports generated during the process of synthesis is the verilog netlist. The file has extension of .v and is used to verify the functionality of the synthesized logic. A portion of the .v file generated for the AES chip is as given Fig 6.11 module control(reset, clk, encrypt, data_sel, load_data, key_sel, round_const, last_mux_sel , load_key); input reset; input clk; input encrypt; output [1:0] data_sel; output load_data; output key_sel; output [7:0] round_const; output last_mux_sel; output load_key; wire [3:0] state; BUFX3 i_16753(.A(\nbus_135[1] ), .Y(data_sel[1])); BUFX3 i_16717(.A(n_4876), .Y(last_mux_sel)); BUFX3 i_16705(.A(n_4861), .Y(load_data)); BUFX3 i_16679(.A(n_4831), .Y(key_sel)); BUFX3 i_16667(.A(n_4816), .Y(load_key)); NAND2X1 i_49(.A(n_60), .B(n_69), .Y(n_50)); OAI21X1 i_2(.A0(n_57), .A1(n_58), .B0(n_68), .Y(n_49)); NAND2X1 i_3(.A(n_44), .B(n_67), .Y(n_47)); NAND2X1 i_41(.A(n_60), .B(n_66), .Y(n_44)); AOI21X1 i_4(.A0(state[3]), .A1(n_56), .B0(n_41), .Y(n_43)); MX2X1 i_19(.A(state[1]), .B(n_61), .S0(state[2]), .Y(n_42)); AOI21X1 i_37(.A0(n_59), .A1(n_57), .B0(n_42), .Y(n_41)); Fig 6.12 This verlog file is used to do the netlevel simulation. The test bench that was used for testing the .VHDL file for functional simulation, the same file is used to simulate the test vector for this verilog netlist. After running the simulation, and fixing a few bugs, the verilog netlist generated the waveform, which was exactly the same as the VHDL file shown in the previous section. 52 Place and Route of the Chip: The System On Chip (SoC) Encounter tool from Cadence Design System Inc. has been used to Place and Route the chip. The snap shot in Fig 6.12 and Fig 6.13 Fig 6.13 53 Fig 6.14 The IO Pad with buffers has not been designed, which is the reason the GDS2 or CIF file has not been generated. This will be done shortly and the document will be upgraded. Fig 6.14 gives the summary of the chip in short. 54 TH E AES CH IP Fig 6.15 55 Conclusions 1. The entire code running on one microengine in a single thread environment took 1185 microengine cycles. Further optimization is possible by applying parallelism and using more microengines. 2. The crypto units of the IXP 2850 operate at 700 MHz providing high- speed bulk encryption and decryption. All the symmetric key encryption is offloaded to the crypto units, which also are used to calculate message digests, message authentication checks and checksums. The crypto units can perform bulk of packet transformations at high speeds freeing the microengines to perform other packet processing tasks. 3. The Intel SSDK 3.0 provides an excellent development environment with lot of built in functions. It has a good programming environment and debugging interface. 4. The VHDL implementation of the AES algorithm was successful. Both the encryption and decryption blocks are designed and simulated and the results are as expected. 5. The AES chip was successfully synthesized using different set of tools from Cadence Design System Inc. The netlist generated out of synthesis was successfully tested for its functionality. Also, the place and route of the chip was completed. 56 7. References: Daemen J, Rijmen V; “AES Proposal: Rijndael” 1999 “Security: Adding protection to network via the network processor”; Intel technology journal; 2002 “Network processors” Intel technology journal; 2002 “Intel IXP2850 Network processor programmer reference model”; Feb 2003. “Intel IXP2400/IXP2800 Network processor development”; Feb 2003. “Intel IXP2400/IXP2800 Network processor programmer reference model”; Feb 2003 “Intel integrated performance primitives for Intel architecture Reference model Volume 4 Cryptography;2003 B. Gladman, “The AES Algorithm (Rijndael) in C and C++, performance of the optimized implementation,” from http://fp.gladman.plus.com/cryptography_technology/rijndael/index.htm System on Chip CECS 460 B materials from http://crete.cadence.com 57