Figure 20: Block diagram of Round operation

advertisement
Hardware and Software
Implementation of RIJNDAEL AES
Algorithm
EE-540
Project Report
Vipul Joshi
Ali Bassam
Sanjit Kurup
Nasrin Ahmed
Shafaat Qureshi
Table of Contents
1.
Introduction……………………………………………………………………..3
1.1 Project perspective………………………………………………………...3
2.
Design…………………………………………………………………………....4
2.1 Spiral Design Model………………………………………………………4
2.2 Design Flow………………………………………………………………..5
2.3 Project Modeling………………………………………………………….6
3.
The Advanced Encryption Algorithm………………………………………...6
3.1 Introduction……………………………………………………………….6
3.2 Overview of Encryption………………………………………………….7
3.3 Overview of Decryption……….…………………………………………11
4.
Module…………………………….……………………………………………13
4.1 Software module………………………………………………………….13
4.2 Intel IXP2850……………………………………………………………..14
4.3 Intel IXP2850 Cryptographic Unit………………………………………14
4.4 Intel SDK 3.0………………………………………...……………………15
4.5 Implementing AES on IXP 2850…………………………………………16
5.
Hardware Implementation of AES……………………………………………22
5.1 Design process…………………………………………………………….22
5.2 Interface Design…………………………………………………………..22
5.3 Encryption block………………………………………………………….25
5.4 Decryption block………………………………………………………….33
6.
Synthesis of the AES Chip…………………………………………………….41
7.
Conclusion……………………………………………………………………..56
8.
Reference………………………………………………………………………57
2
1.
Introduction:
The Rijndael block cipher algorithm was chosen by NIST as the new advanced
encryption standard (AES). As DES is not regarded as a Standard anymore the industry
would now rush into implementing AES for cryptographic implementations on their
products. Being the strongest encryption algorithm which never has been broken till now,
it comes with overheads like performance. Various hardware implementations for AES
exist, but have their own pros & cons and there is lot of work being done in the area to
achieve perfection.
1.1
Product Perspective
The growth of the network processor industry in last two years has signified their promise
their primary purpose being packet processing at up to OC – 192 speeds. Giving much
better performance as compared to general purposes processors Network processor units
are now a very integral part of the gigabyte Internet.
Advanced Encryption Standard (AES) the latest encryption standard approved by NIST is
by far becoming the default choice for encryption in networked applications. Hardware
implementation of the algorithm gives better performance but offers less flexibility and is
also difficult and time consuming to implement as compared to a software
implementation. With the implementation of the security block in IXP 2850 Intel has
implemented encryption as an ASIC chip on the same board as their processor. Putting
minimum specifics in the block assures flexibility for various applications. This
flexibility is achieved by including a programming model along which runs on the
microengines.
Our objective is to implement the Advanced Encryption Standard on hardware using a
FPGA chip. Simultaneously develop a software implementation using the SDK 3.0 for
Intel’s IXP 2400/2800.
3
The Intel SDK 3.0 consists of a decent API which gives relative control to the
programmer.
Comparison between these two approaches using the same flow charts, block lengths, key
lengths and same data will determine the overheads of using hardware and if they are
worth it. Also it would test the performance of the Intel’s Programming model.
2.
Design
2.1
Spiral Design Model
The spiral model was incorporated in our design. In this manner our design was
constantly updated as more requirements were met. The system feasibility was checked
and as specifications were specified the model was updated. After determining the system
feasibility and determining the specifications a prototype was developed. This led to the
making of an initial system. The initial system then developed into an enhanced system
as more requirements and specification were met.
Figure 1 – Spiral Model
The spiral model (figure 1) successively refines the system as we move through initial
basic systems to full-scale systems. This is advantageous as the overall system is updated
4
and one may never have to restart the design. This is in contrast with the waterfall design
that may have to be redesigned as new features spring up. One disadvantage of the spiral
model can be stated as that it takes more time. But this is a trade off as in case the system
design fails one may have to redesign the system, which may end up consuming even
more time.
2.2
Design Flow
The entire project can be broken down into two major modules .i.e. Hardware module
and software module.
Requirements
Specification
Architecture
Software design
Hardware design
Integration
Testing
Figure 2 Design flow
The first to be considered were the requirements and specification flowed by the
architecture design. There were two separate designs for hardware and software modules
respectively. The step involved integrating the modules and finally testing.
5
2.3
Project Modeling
Figure 3 Project Modeling
3.
The Advanced Encryption Algorithm
3.1
Introduction
In 1997 the National Institute of Standards and Technology (NIST) of the United
States put out a call for proposals for a new symmetric algorithm, that will be called the
Advanced Encryption Standard (AES). The algorithm DES was since 1976 the standard
for symmetric algorithms, and a replacement was necessary. In 1998 15 can- didate
algorithms were accepted, and after one year of research 5 of these candidates were
announced as finalists:
6
• MARS (IBM corporation)
• RC6 (RSA Laboratories)
• Rijndael (J. Daemen & V. Rijmen)
• Serpent (E. Biham et al.)
• Twofish (B. Schneier et al.)
On all of these algorithms extensive research has been done to find attacks or weaknesses. According to NIST, all 5 finalists appear to offer adequate security. Also much
research has been done to test the performs of these 5 algorithms in both software and
hardware. In 2000 NIST announced that Rijndael was chosen as the successor of DES,
the AES. The combination of security, performance, efficiency, implementability and
flexibility made Rijndael an appropriate selection for the AES.
The candidates for the AES algorithm had to fulfill certain design criteria. First, of
course the algorithm should be a symmetric algorithm and it must be resistant
against all known attacks. Furthermore, the AES must be efficient in performance and
memory for different platforms. The design must be simple, and it should be able to
handle different key lengths (128, 192 and 256 bits). The block length of the cipher
should be 128 bits.
3.2
Overview of Encryption
Figure 4 Overview of AES
7
It starts with an initial round followed by a number of standard rounds and it ends with
the final round. Only four different operations are necessary to compute these rounds and
a key schedule.
It is possible in Rijndael to use different keylengths according to the security level that is
required for the application. Rijndael is defined as a block cipher with key lengths of
128, 192 or 256 bits. The possible input block lengths are 128, 192 or 256 for the
Rijndael algorithm. The AES algorithm is exactly the same as the Rijndael algorithm,
but it only defines one block length of 128 bits.
The Rijndael algorithm is such that each bit is dependent on all bits from 2 rounds ago,
e.g. full diffusion is provided. The number of rounds that must be run is dependent on the
key length.
Key length (words)
Number of rounds (Nr )
AES-128
4
10
AES-192
6
12
AES-256
8
14
Figure 5 Key length
3.2.1 AddRoundKey
The AddRoundKey operation is a simple EXOR operation between the State and the
RoundKey. The RoundKey is derived from the Cipherkey by means of the key schedule.
The State and RoundKey are of the same size and to obtain the next State an EXOR
operation is done per element:
s (i, j)= s(i, j) ⊕ w(i, j). (5)
where s is the current State, s the next State and w the round key.
8
Figure 6 AddRoundKey
3.2.2 Sub Bytes
The operation SubBytes is similar to the S-boxes used in the DES-algorithm. Rijndael
has only one S-box. The design criteria for the S-box are such that it is resistant against
the known differential and linear cryptanalysis and attack using algebraic manipulations.
Figure 7 The AES S Box
9
3.2.3 Shift Rows
Figure 8 Shift Rows
In ShiftRows the rows of State are cyclically shifted with different offsets. Row 1 is
shifted over c1 bytes, row 2 over c2 bytes, and row 3 over c3 bytes. The values of c1, c2,
and c3 depend on the block length Nb:
3.2.4 MixColumns
Figure 9 MixColums
10
The MixColumn transformation is an operation on the different columns.
To calculate the MixColumn transformation the columns of the current state are
considered as polynomials over GF(28).
3.2.5 Key schedule
The RoundKeys are derived from the CipherKey by means of a key schedule.
The number of RoundKeys necessary to encrypt one block of information depends on the
block length and key length as this determines the number of rounds. For a block length
of 128 bits, 11 RoundKeys (1 for initial round, 9 for standard rounds and 1 for the final
round) are needed.
Figure 10 KeySchedule
3.3 Process for Decryption
The Inverse Cipher:
The inversion of the cipher code presented in section 3.2 is straightforward and is just the
reverse process of encryption.
In this section, the whole reverse cipher of the state is described in detail. The state
undergoes in a decryption sequence:
a. Inverse shift row
b. Inverse sub bytes transformation using S-Box
c. Inverse mix column
d. Inverse sub key
11
3.3.1 Inverse Shift row
The inverse shift row step is just the inverse of the forward shift row operation used
in the encryption process. The first row is kept as it is and the second row is right
shifted one place, the third row is right shifted to two places and he fourth row is right
shifted to three places. The inverse right shift operation is explained for Nc = Nb = 4
in figure 11.
S0,0
S0,1
S0,2
S0,3
S0,0
S0,1
S0,2
S0,3
S1,0
S1,1
S1,2
S1,3
S1,3
S1,0
S1,1
S1,2
S2,0
S2,1
S2,2
S2,3
S2,2
S2,3
S2,0
S2,1
S3,0
S3,1
S3,2
S3,3
S3,1
S3.2
S3,3
S3,0
Figure 11: Inverse shift row
3.3.2 Inverse Sub bytes transformation using S-Box
The inverse sub bytes transformation uses the inverse S-Box table provided in the
figure 12.
Figure 12: Inverse S-Box table
12
3.3.3
Inverse Mix column
The inverse mix column transformation acts independently on every column of the
state.
Figure 13: Inverse mix column
3.3.4
Inverse Round Key
The inverse round key transformation is its own inverse. Each round key is generated for
each round.
4.
Modules
4.1
Software Module
The software module consists of implementing the Advanced Encryption Standard on
Intel’s IXP 2850. The IXP 2850 consists of two cryptographic units having hardware
cores of AES, 3DES and SHA-1. It also consists of a SDK for implementing various
functionalities for Network Processing.
4.2
Intel’s IXP 2850
The newest member of the Intel® second generation network processor product family,
the Intel® IXP2850 network processor, delivers high-performance packet and content
processing with robust security features in a single platform. By integrating capabilities
that have typically required multiple specialized processors, the Intel IXP2850 provides a
secure and cost-effective platform that enables a broad range of emerging applications.
13
4.3
IXP 2850 Cryptographic Unit
The cryptography unit performs authentication and bulk encryption. It is believed that
these two datapath tasks are critical strategic functions for the network processor. The
crypto engines are innovative designs that have a very small footprint, yet the two
engines provide 10Gb/s throughput performance.
14
4.4
IXP SDK 3.0
The Developer Workbench is an integrated development environment for assembling,
compiling, linking, and debugging microcode that runs on the IXP2400, IXP2800, and
IXP2850 Network Processor Microengines. The Workbench is a Microsoft* Win32*
application that runs on Windows 2000* platforms.
Features
Important Workbench features include:
• Source level debugging.
• Debug-only project creation mode.
• Execution history.
• Statistics.
• Media Bus device and network traffic simulation for the Network Processors
• Command line interface to the Network Processor simulators (Transactors).
• Customizable graphical user interface (GUI) components.
Debugging support
The Workbench supports debugging in four different configurations:
• Local simulation with no foreign model, in which the Workbench and the Network
Processor simulator (Transactor) both run on the same Microsoft Windows* platform.
• Local simulation with local foreign models, in which the Workbench, the Transactor,
and one or more foreign model Dynamic-Link Libraries all run on the same Windows
platform.
15
• Local simulation with a remote foreign model, in which the Workbench and the
Transactor both run on the same Windows platform and communicate over the network
with a foreign model running on a remote system.
• Hardware, in which the Workbench runs on a Windows host and communicates over a
network or a serial port with a subsystem containing actual Network Processors. (Not
currently available for IXP2800 Network Processors).
4.5
Implementing AES on IXP 2850
4.5.1 Design
Figure 14 AES design
16
4.5.2 Overview
The software implementation encrypts encrypting and decrypting on a single microengine
using one thread. The steps for doing the above are as follows:
1. Plain text to be encrypted is written form the DRAM transfer registers into the
input RAM of the crypto unit.
2. The key is written into both crypto units.
3. The text is encrypted using crypto unit 0 and the resulting cipher text is written
into the DRAM read registers.
4. The cipher text is copied from the DRAM read register into the DRAM write
register and then is transferred to input RAM of the crypto unit 1.
5. The cipher text is decrypted and is then written into the DRAM read transfer
registers.
6. The text is then compared to the original text to see if the program works.
4.5.3 Software Functions
The following important functions of the SDK 3.0 API are used for the above
steps:
Step 1: Writing the plain text into crypto input RAM
xbuf_alloc($$orig_plain_text, 16, write)
crypto_write_ram(
$$orig_plain_text[0],
DATA_RAM_ADDR,
8,
ENCRYPT_UNIT,
ram_sig)
ctx_arb[ram_sig]
Step 2: Loading the Key
crypto_load_key(
$$key[0],
3,
DECRYPT_UNIT,
17
CRYPTO_BANK,
DECRYPT_STATE,
key_sig)
ctx_arb[iv_sig, key_sig]
Step 3: Encrypt
crypto_cipher(
$$encrypt_data[0],
DATA_RAM_ADDR,
8,
CRYPTO_CIPHER_ENCRYPT,
CRYPTO_CIPHER_NO_CBC,
CRYPTO_CIPHER_AES_128,
ENCRYPT_UNIT,
CRYPTO_BANK,
ENCRYPT_STATE,
cipher_sig)
ctx_arb[cipher_sig]
Step: 4 Decrypt
crypto_cipher(
$$new_plain_text[0],
DATA_RAM_ADDR,
8,
CRYPTO_CIPHER_DECRYPT,
CRYPTO_CIPHER_NO_CBC,
CRYPTO_CIPHER_AES_128,
DECRYPT_UNIT,
CRYPTO_BANK,
DECRYPT_STATE,
cipher_sig)
18
4.5.4 Simulation Output
Figure 15 Original plain text
The above figure shows the simulation output’s data watch in which we can see the text
loaded into the Input RAM of the crypto unit.
19
Figure 13 Encrypted Text
The above figure shows the cipher text after encryption using AES 128 bit key.
20
Figure 14 Writing Encrypted text to DRAM
Figure 15 Decrypted text same as original plain text
21
5.
Hardware Implementation of AES
5.1
Design Process
The hardware design process was divided into blocks and each block was handled in
order. The design started with the interface design defining the data and instruction flow
from AES co-processor to IXP-2400. After the interface is designed the encryption and
the decryption blocks are designed using VHDL and simulated to make sure the modules
are working. The VHDL modules are then fed to the synthesizer to generate netlist and
simulated again to make sure the module is producing expected results. After the netlist is
available the floor-planning and placement tools are used to layout the final chip.
5.2
Interface Design
The interface of the AES block is designed to allow data flow between the INTEL IXP
2400 board and the security co-processor. A block diagram of the hardware
implementation of the AES algorithm and the interface is shown below.
P
C
I
AES2pack
128
busy
128
Data_out
P2AESburst
P2AESRnW
B
u
s
P2AESreq
mptdataout
C
O
N
T
R
O
L
Key_in
128
Encryption
Data_in
128
Data_out
128
Key_in
mptaddr
Decryption
32
128 Data_in
mptdatain
32
32
22
Burst Write Handshake:
1- The Processor checks if busy line is high. If high (busy), The processor drops its
request. If busy is low (not busy), the processor then drives P2AESBURST high
requesting a burst write transfer ( 4 long words per burst write).
2- The processor drives a valid address on the address/data bus,P2AESRnW low,
then drives the data and pulses P2AESREQ high for one clock cycle signaling the
AES chip that a valid address/data is available on the bus.
3- The AES chip latches the address/data and wait for the second data until all four
long words are latched.
4- When the processor is done with its burst write transfer, it drives P2AESBURST
low signaling the end of the burst write cycle. The Burst write handshake
waveform is shown below.
CLK
P2AESBURST
ADDR
MPTADDR[32]
P2AESREQ
MPTDATA[32]
DATA
DATA
DATA
DATA
Figure 16: Burst write Handshake
Burst Read Handshake:
1- The Processor checks if busy line is high. If high (busy), The processor drops its
request. If busy is low (not busy the AES has Encrypted/decrypted), the processor
then drives P2AESREQ and P2AESBURST high requesting a burst read transfer
(4 long words per burst read).
2- The processor drives a valid address on the address/data bus,P2AESRnW high,
then waits for the AES chip to place data on the address/data bus.
3- When the AES is ready, it drives the least significant word on the address/data
bus and pulses the AES2PACK high for one cycle signaling that a valid data is
now available on the bus.
23
4- The processor then latches the first data word and wait for the following data.
5- The AES repeats step3 until all 128 bit data are transfer.
6- When all 128 bit data have been received, the processor drives the P2AESREQ
Low signaling the end of the burst read cycle. The Burst read handshake
waveform is shown below.
CLK
P2AESREQ
P2AESBURST
P2AESRNW
MPTADDR[32]
ADDR
AES2PACK
MPTDATA[32]
DATA
DATA
DATA
DATA
Figure 17: Burst Read Handshake
24
5.3
Encryption Block
The encryption process for 128-bit data size goes through 10 rounds. The initial round
only adds the starting key and the input data and the result is the input of round 1. Round
1 through round 10 the starting data goes through sub-byte transformation, shift rows
transformation, mix column transformation and then added with the specific round key
generated for each round from previous round key. The flowchart of the top-level module
controlling the encryption block is shown in figure 18. All the modules are implemented
using VHDL and are provided in the appendix.
Figure 18: Top level AES encryption block
25
5.3.1 Control
The state machine defined in the control module and shown in figure 19 controls the AES
encryption rounds. The round constants used to generate key schedule are different for
each round and is provided in the state machine.
5.3.2
Figure 19: Control Finite State Machine
26
5.3.2
Rounds
Rijndael encryption goes through 10 rounds of encryption process. Each round the input
data is mapped into s-box, performed shift rows and mix column on it and then added to
the round keys. This operation is shown in the figure 20.
data_in
S-Box
Shift Row
Mix Column
last_mux_sel
Mux
Add Round Key
data_out
Figure 20: Block diagram of Round operation
The s-box transformation, shift rows and mix column are explained in details below and
also the simulation results are shown for each block.
27
i.>
Sub-Bytes Transformation
In the sub-bytes transformation the input data is mapped from the s-box table shown in
figure 7. For example if the input data is “193de3bea0f4e22b9ac68d2ae9f84808”, subbytes transformation is done for each 8 bits of XY where the first 4 bits is ‘X’ and the last
4 bits is ‘Y’. In the given input data the first 8 bits is 19, where
X = 1 and Y = 9
From the table provided in figure 7, corresponding mapped data for 19 is d4. Similarly
for the next 8bits is mapped and so on. The complete transformation is shown in figure
21.
19
a0
9a
e9
d4
E0
B8
1e
3d
f4
c6
f8
27
Bf
B4
41
e3
e2
8d
48
11
98
5d
52
be
2b
2a
08
ae
F1
E5
30
Figure 21: Sub-byte transformation
The mapped result is “d42711aee0bf98f1b8b45de51e415230”. The simulation result
shown in figure 22 shows that the “bytesub” result is “d42711aee0bf98f1b8b45de51
e415230”and hence confirms that the module for sub-bytes transformation is working.
28
Figure 22: Simulation of rounds
ii>
Shift Rows
After sub-byte transformation the shift rows operation is performed on the result. The
process of shift rows is the first row is kept as it is , the second row is shifted left once,
the third row is shifted left twice and the fourth row is shifted left thrice. For example if
the sub-byte output is “d42711aee0bf98f1b8b45de51 e415230” then the shift rows is
performed as shown in figure 23.
d4
E0
B8
1e
d4
E0
B8
1e
27
Bf
B4
41
bf
B4
41
27
11
98
5d
52
5d
52
11
98
ae
F1
E5
30
30
ae
F1
F5
Figure 23: Shift row transformation
The shifted result is “d4bf5d30e0b452aeb84111f177de961a”. The simulation result
shown in figure 22 shows that the “shiftrow” result as “d4bf5d30e0b452aeb84111
f177de961a”and hence confirms that the module for shift rows is working.
29
iii> Mix Column
The result from shift row operation is multiplied with a given matrix in mix column
module. The operation is explained in figure 24.
02
03
01
01
01
02
03
01
D3 26
01
01
02
7a
03
01
01
04
E0
48
28
66
cb
F8
06
81
19
E5
9a
4c
==
d4
E0
B8
1e
27
Bf
B4
41
03
11
98
5d
52
02
ae
F1
E5
30
X
Figure 24: Mix Column operation
The result of mix column operation is “046681e5e0cb199a48f8d37a2806264c”. From
figure 22 it is seen that the “mixcolumn” result is “046681e5e0cb199a48f8d37
a2806264c” and hence confirms that the mix column module is working.
iv> Key addition
In this step the round key generated by the key schedule, explained next, is XOR-ed with
the mix column result.
5.3.3 Key Schedule
In encryption the key schedule is generated from the 128 bit key input. The block
diagram of the key scheduler is shown in figure 25.
30
user_key
next_key
key_reg_mux_sel
MUX
load_key_reg
KEY REGISTER
key_word(3)
SHIFT
key_word(0)
key_word(1) key_word(2)
key_word(3)
T
S BOX 4
XOR
XOR
round_constant
XOR
XOR
XOR
Figure 25: Key Scheduler
31
5.3.4 Simulation Result
The encryption block is simulated with an input data and key length of 128 bits. The data
input data is entered as “3243f6a8885a308d313198a2e0370734” and the input key is
“2b7e151628aed2a6abf7158809cf4f3c”. The encrypted cipher output is
“3925841d02dc09fbdc118597196a0b32”. The result matches with the spec.
Figure 26. Encrypted result
32
5.4
Decryption Block
The encryption process for 128-bit data size goes through 10 rounds. The initial round
only adds the starting key and the input data and the result is the input of round 1. Round
1 through round 10 the starting data goes through sub-byte transformation, shift rows
transformation, mix column transformation and then added with the specific round key
generated for each round from previous round key. The flowchart of the top-level module
controlling the encryption block is shown in figure 27. All the modules are implemented
using VHDL and are provided in the appendix.
Figure 27: Top level AES block controlling decryption
33
5.4.1 Control
The state machine defined in the control module and shown in figure 28 controls the AES
encryption rounds. The round constants used to generate key schedule are different for
each round and is provided in the state machine.
Decrypt=1
INIT
Load
Inputs
Round 0
data_reg_mux_sel=11
key_reg_mux_sel=0
load_data_reg=1
load_key_reg=1
Round10
Round9
Round8
Round7
round_const = 00000100
data_reg_mux_sel=01
key_reg_mux_sel=1
load_data_reg=1
load_key_reg=1
last_mux_sel=1
data_reg_mux_sel=01
key_reg_mux_sel=1
load_data_reg=1
load_key_reg=1
round_const = 00000001
data_reg_mux_sel=01
key_reg_mux_sel=1
load_data_reg=1
load_key_reg=1
round_const = 00000010
data_reg_mux_sel=01
key_reg_mux_sel=1
load_data_reg=1
load_key_reg=1
Round6
Round1
Round2
Round3
Round4
Round5
round_const = 00110110
data_reg_mux_sel=00
key_reg_mux_sel=1
load_data_reg=1
load_key_reg=1
round_const = 00011011
data_reg_mux_sel=01
key_reg_mux_sel=1
load_data_reg=1
load_key_reg=1
round_const = 10000000
data_reg_mux_sel=01
key_reg_mux_sel=1
load_data_reg=1
load_key_reg=1
round_const = 01000000
data_reg_mux_sel=01
key_reg_mux_sel=1
load_data_reg=1
load_key_reg=1
round_const = 00100000
data_reg_mux_sel=01
key_reg_mux_sel=1
load_data_reg=1
load_key_reg=1
round_const = 00010000
data_reg_mux_sel=01
key_reg_mux_sel=1
load_data_reg=1
load_key_reg=1
round_const = 00001000
data_reg_mux_sel=01
key_reg_mux_sel=1
load_data_reg=1
load_key_reg=1
Figure 28: Finite state machine
34
5.4.2
Rounds
Rijndael decryption goes through 10 rounds of decryption process. Each round inverse
shift rows is performed and then it is mapped into inverse s-box. Round key is added with
the result and then inverse mix column operation is performed. This operation is shown
in the figure 29.
data_in
Inverse Shift Row
Inverse S Box
Add Round Key
Inv. Mix Column
Last_mux_sel
Mux
data_out
Figure 29: Block diagram of Round operation
The inverse s-box transformation, inverse shift rows and inverse mix column are
explained in details below and also the simulation results are shown for each block.
35
i>
Inverse Shift Rows
The inverse shift rows operation is performed on the input data. The process of inverse
shift rows is the first row is kept as it is , the second row is shifted right once, the third
row is shifted right twice and the fourth row is shifted right thrice. For example if the
input data is “e9317db5cb322c723d2e895faf090794” then the shift rows is performed as
shown in figure 30.
E9
Cb
3d
af
E9
Cb
3d
af
31
32
2e
09
09
31
32
2e
7d
2c
89
07
89
07
7d
2c
B5
72
5f
94
72
5f
94
B5
Figure 30: Inverse Shift row transformation
The shifted result is “e9098972cb31075f3d327d94af2e2cb5”. The simulation result
shown in figure 32 shows that the “shiftrow” result as “e9098972cb31075f3d327d94
af2e2cb5”and hence confirms that the module for inverse shift rows is working.
ii.> Inverse Sub-Bytes Transformation
In the inverse sub-bytes transformation the shifted data is mapped from the inverse s-box
table shown in figure 12. For example if the input data is
“e9098972cb31075f3d327d94af2e2cb5”, inverse sub-bytes transformation is done for
each 8 bits of XY where the first 4 bits is ‘X’ and the last 4 bits is ‘Y’. In the given input
data the first 8 bits is e9, where
X = e and Y = 9
From the table provided in figure 12, corresponding mapped data for e9 is eb. Similarly
for the next 8bits is mapped and so on. The complete transformation is shown in figure
31.
36
E9
Cb
3d
af
Eb
59
8b
1b
09
31
32
2e
40
2e
A1
C3
89
07
7d
2c
F2
38
13
42
72
5f
94
B5
1e
84
E7
D2
Figure 31: Inverse Sub-byte transformation
The mapped result is “eb40f21e592e38848ba113e71bc342d2”. The simulation result
shown in figure xxx shows that the “bytesub” result is “eb40f21e592e38848ba113e71
bc342d2”and hence confirms that the module for inverse sub-bytes transformation is
working.
Figure 32: Simulation of rounds
iii>
Key addition
In this step the round key generated by the key schedule, explained next, is XOR-ed with
the mix column result.
37
iv>
Inverse Mix Column
The result from shift row operation is multiplied with a given matrix in inverse mix
column module. The operation is explained in figure 33.
87
F2
4d
97
6e
4c
90
Ec
46
E7
4a
C3
A6 8c
D8 95
=
0e
0b
0d
09
09
0e
0b
0d
0d
09
0e
0b
0b
0d
09
0e
X
47
40
A3 4c
37
D4 70
9f
94
E4
42
3a
Ed A5 A6 bc
Figure 33: Inverse Mix Column operation
The result of mix column operation is “876e46a6f24ce78c4d904ad897ecc395”. From
figure xxx it is seen that the “mixcolumn” result is “876e46a6f24ce78c4d904ad897
ecc395” and hence confirms that the inverse mix column module is working.
5.4.3 Key Schedule
In decryption the key schedule is generation is the inverse of the encryption process. The
block diagram of the key scheduler is shown in figure 34.
38
user_key
next_key
key_reg_mux_sel
MUX
load_key_reg
KEY REGISTER
next_key_word(3)
SHIFT
key_word(0)
key_word(1) key_word(2)
key_word(3)
XOR
T
S BOX 4
XOR
XOR
round_constant
XOR
XOR
Figure 34: Key Scheduler
39
5.4.4 Simulation Result
The decryption block is simulated with an input data and key length of 128 bits. The data
input data is entered as “3925841d02dc09fbdc118597196a0b32” and the input key is the
key generated at the round 10 of encryption process, which is “D014f9a8c9ee2589e13f0
cc8b6630ca6”. The decrypted cipher output is “3243f6a8885a308d313198a2e0370734”.
The result matches with the spec.
Figure 35. Decrypted result
40
6 Synthesis of the AES Chip
The previous sections of the documentation dealt with the front end design of the chip.
After the RTL is done in VHDL, the next immediate step is to synthesize the chip. This
part of the document will walk you through the process of synthesis of chip.
6.1 Division of the Chip design work:
The work of chip design was divided in the following manner:
ARCHITECTURE
Mr. Ali Bassam
“Partition of the chip in functional blocks and design of the interface”
FRONT END DESIGN
Ms. Nasreen Ahmed
Mr. Shafaat Qureshi
“Enter the HDL code and verify the functionality”
BACK END DESIGN
Mr. Sanjit Kurup
“Synthesize, Floor plan, place and route, clock tree generation, physical
verification”
FABRICATION
MOSIS
Fig 6.1
41
Architecture design:
The step involves partitioning of the chip in the functional units. As seen in the hardware
section, the chip was divided in different functional blocks such as the interface,
Encryptor, Decryptor. Within each of these sections, there were several fundamental
divisions such as S-box, Shift row, mix column, add round key. This task is one of the
most challenging as it involves the clear understanding of the specification.
Front End Design:
The task of the front end design team was to mainly design the state machine of the
individual blocks as per the specification. Based on the state machines or the flow charts,
RTL codes are written, and the language used is VHDL. These codes have to be tested
for their functionality using the simulator. The tool used in this case was NCSim from
Cadence and the results are as explained in the previous section.
Back End Design:
The back end design involves the synthesis of the logic circuit, verifying the gate level
simulation, place and route and many other intricacies of the chip layout. This section of
the document explains each of these steps in detail, especially in the context of the AES
chip.
Fabrication:
The AES chip is not yet fabricated for the reasons explained in the subsequent sections.
Assuming that the GDS2 file or CIF file was generated for the chip, the obvious choice of
fabrication would be the MOS Implementation Service (MOSIS).
Before going into the depths of the chip design, here is the flow chart for the different
steps and processes involved in the chip design and synthesis. As shown in the figure,
the portion of HDL entry and its functional verification is over. Next comes the synthesis
and the simulation of the netlist generated out of it.
42
Fig 6.2
6.2 Synthesis of the chip:
The process of converting the RTL to the equivalent gate level circuit is called synthesis.
This is an automated process usually undertaken by the software being used. The
software used in this project is from Cadence Design System Inc. called Physically
Knowledgeable Synthesis (PKS). Fig 6.3 gives the snap shot of the Physically
knowledgeable synthesis tool.
43
Fig 6.3
Amongst the different steps involved in the process of synthesis, the primary one is the
loading of the design libraries. The design library used is the gscLib2.0. This is a generic
library, and the chips synthesized using this library cannot be fabricated by any
commercial fabrication unit. The reason is obvious as the tools have been donated by
Cadence Inc. for education purpose only, and not for commercial manufacturability or IP
development.
The Technology Library Cell library contains the definition of the basic cells such as
Inverters and Buffer Cells of various strengths
Cells with basic logic functions
AND
NAND
AND_OR
OR
NOR
OR_AND
XOR
XNOR
MUX
Various input configurations of each
•
Two, three or four input cells
•
Example: AND2, AND3 and AND4 cells
44
After the library is sourced, a generic netlist is generated. It involves the generation of the
gate level circuitry from the VHDL logic. The gates that are used are not specific to the
synthesis library. Instead, it’s generic, which is the reason it is called generic netlist.
To elaborate on the process, here is an example. Consider the following VHDL code:
process (ain,bin,cin)
ain
begin
if cin = '1' then
bout <= ain;
bin
cin
else
MUX
bout <= bin;
end if;
end process;
bout
In the above example, the VHDL code generates a multiplexer. This is the process of
generating the generic net list. However, it is not necessary that the Library used for the
actual synthesis will have all the basic gates defined in the generic netlist.
For example, consider the following example. If the generic netlist creates a 15 input and
gates, it is not necessary that the Technology library will have a 13 input AND gate.
Instead the library will have an AND gate with maximum input of 4. Therefore, there
needs to be a process which converts the 13 inputs AND gate into its equivalent netlist
containing 4 input AND gates. This is done in the second step of synthesis, called
Technology mapping, in which the generic netlist is redefined using the technology
library. The Fig 6.4 graphically explains the concept.
45
Generic Map
Technology Map
Fig 6.4
Defining the Constraint for the synthesis:
Before actually performing technology mapping, the constraint of synthesis for the chips
needs to be defined. This is defined in two stages. First is to define the .tcl file, where the
tcl stands for Tool Control Language. This is a script file will define the constraints to the
different input pins, clock pins and the output pins. The .tcl file content that was used to
synthesize the AES chip is as given below
set_global fanout_load_limit 20
set_clock clk -period 30.30 -waveform { 0 15.15 }
set_clock_root -clock clk clk
set_clock_uncertainty 0.150
set_external_delay -clock clk 2 [ get_names [ find -outputs ] ]
set_input_delay -clock clk 2 [ get_names [ find -inputs ] ]
set_clock_insertion_delay 0.1 {clk}
46
As you will observe in the content of the .tcl file, the clock frequency is defined at 33
Mhz with a 50% duty cycle. Also, the clock uncertainty is 0.15 ns. The
set_external_delay defines all the other inputs and sets the input arrival delay at 2 ns.
Similarly, the outputs are also defined to have a delay of maximum 2ns.
Fig 6.5
The figure above shows the other constraints that have been defined during the process of
synthesis. As you will observe, there are no optimization defined for the synthesis. The
reason for not constraining the design is explained in the next paragraph.
After setting the parameters and defining the constraint files, the technology mapping is
done. The time required by the tool for synthesis is approximately 1 hour 15 minutes! For
commercial chips, synthesis has known to take days to complete. This is the primary
reason why the design was not defined to be optimized in order to save time during the
initial stages of synthesis.
After synthesis, the output looks like the screens given below. Fig 6.6
47
Fig 6.6
48
Fig 6.7
Fig 6.8
49
Fig. 6.9
Fig. 6.10
50
Fig 6.11
The fig. 6.7 and 6.8 gives different views of the synthesized chip.
Synthesis Reports:
There are several reports that are generated during the process of synthesis. It is
imperative to look into these reports to verify the operability of the synthesized circuit.
Fig 6.8 and Fig. 6.9 gives the summary of these reports. As observed, the worst slack is
14.506 for the 33 Mhz clock, which is significantly a good figure. The core utilization
and the cell area is also given. Fig 6.9 gives the details on the process technology.
One of the important reports is the timing report. Fig 6.10 gives the detailed timing
analysis. Although, it’s just a portion of the complete report, it gives the gist and can be
used to find the slack, setup time and hold time
51
Gate Level Simulation:
One of the reports generated during the process of synthesis is the verilog netlist. The file
has extension of .v and is used to verify the functionality of the synthesized logic. A
portion of the .v file generated for the AES chip is as given Fig 6.11
module control(reset, clk, encrypt, data_sel, load_data, key_sel, round_const, last_mux_sel
, load_key);
input reset;
input clk;
input encrypt;
output [1:0] data_sel;
output load_data;
output key_sel;
output [7:0] round_const;
output last_mux_sel;
output load_key;
wire [3:0] state;
BUFX3 i_16753(.A(\nbus_135[1] ), .Y(data_sel[1]));
BUFX3 i_16717(.A(n_4876), .Y(last_mux_sel));
BUFX3 i_16705(.A(n_4861), .Y(load_data));
BUFX3 i_16679(.A(n_4831), .Y(key_sel));
BUFX3 i_16667(.A(n_4816), .Y(load_key));
NAND2X1 i_49(.A(n_60), .B(n_69), .Y(n_50));
OAI21X1 i_2(.A0(n_57), .A1(n_58), .B0(n_68), .Y(n_49));
NAND2X1 i_3(.A(n_44), .B(n_67), .Y(n_47));
NAND2X1 i_41(.A(n_60), .B(n_66), .Y(n_44));
AOI21X1 i_4(.A0(state[3]), .A1(n_56), .B0(n_41), .Y(n_43));
MX2X1 i_19(.A(state[1]), .B(n_61), .S0(state[2]), .Y(n_42));
AOI21X1 i_37(.A0(n_59), .A1(n_57), .B0(n_42), .Y(n_41));
Fig 6.12
This verlog file is used to do the netlevel simulation. The test bench that was used for
testing the .VHDL file for functional simulation, the same file is used to simulate the test
vector for this verilog netlist. After running the simulation, and fixing a few bugs, the
verilog netlist generated the waveform, which was exactly the same as the VHDL file
shown in the previous section.
52
Place and Route of the Chip:
The System On Chip (SoC) Encounter tool from Cadence Design System Inc. has been
used to Place and Route the chip. The snap shot in Fig 6.12 and Fig 6.13
Fig 6.13
53
Fig 6.14
The IO Pad with buffers has not been designed, which is the reason the GDS2 or CIF file
has not been generated. This will be done shortly and the document will be upgraded.
Fig 6.14 gives the summary of the chip in short.
54
TH E AES CH IP
Fig 6.15
55
Conclusions
1.
The entire code running on one microengine in a single thread
environment took 1185 microengine cycles. Further optimization is possible by
applying parallelism and using more microengines.
2.
The crypto units of the IXP 2850 operate at 700 MHz providing high-
speed bulk encryption and decryption. All the symmetric key encryption is
offloaded to the crypto units, which also are used to calculate message digests,
message authentication checks and checksums. The crypto units can perform bulk
of packet transformations at high speeds freeing the microengines to perform
other packet processing tasks.
3. The Intel SSDK 3.0 provides an excellent development environment with lot
of built in functions. It has a good programming environment and debugging
interface.
4. The VHDL implementation of the AES algorithm was successful. Both the
encryption and decryption blocks are designed and simulated and the results
are as expected.
5. The AES chip was successfully synthesized using different set of tools from
Cadence Design System Inc. The netlist generated out of synthesis was
successfully tested for its functionality. Also, the place and route of the chip
was completed.
56
7.
References:
 Daemen J, Rijmen V; “AES Proposal: Rijndael” 1999
 “Security: Adding protection to network via the network processor”; Intel
technology journal; 2002
 “Network processors” Intel technology journal; 2002
 “Intel IXP2850 Network processor programmer reference model”; Feb
2003.
 “Intel IXP2400/IXP2800 Network processor development”; Feb 2003.
 “Intel IXP2400/IXP2800 Network processor programmer reference
model”; Feb 2003
 “Intel integrated performance primitives for Intel architecture Reference
model Volume 4 Cryptography;2003
 B. Gladman, “The AES Algorithm (Rijndael) in C and C++, performance
of the optimized implementation,” from
http://fp.gladman.plus.com/cryptography_technology/rijndael/index.htm
 System on Chip CECS 460 B materials from http://crete.cadence.com
57
Download