IEEE JOUIRNAL OF SOLID-STATE CIRCUITS, VOL. 23, NO. 3, JUNE 647 1988 Security and Performance Optimization of a New DES Data Encryption Chip INGRID VERBAUWHEDE, FRANK HOORNAERT, SENIOR MEMBER, IEEE, AND HUGO Abstract high formance result the — Cryptograpfdcaf security. This Data Encryption Standard between design transformations algorithms. wonld These strategy. CAD data encryption floor which are combined Novel tools high and It chip. is the and chip designers. not At and equivalence plan with problem do speed of a new high-per- optimization a serious optfmizations, nor the security, and layout (DES) cryptograpfdcal present both cryptographers lead to a very efficient otherwise algorithm level, demand the implementation of close cooperation system which applications paper presents minimal for data compromise with a highly routing, scrambling the structured are nsed at different DES design steps in the design process. The resnft is a single chip of 25 mmz in 3- pm double-metal CMOS, which Functionality tests show that a clock of 16.7 MHz means that a 32-Mbk/s modes. This execution is the fastest of all four DES DES can be applied, data rate can be achieved for all eight byte chip reported modes of operation yet, aflowing equally due to an originaf fast pipeline architecture. JOOS VANDEWALLE, J. DE MAN, FELLOW, IEEE sender and no one else. Both sender and receiver want be sure that the integrity that an opponent of the message is guaranteed, did not change, insert, or delete parts of the message. As electronic communications and networks grow, there is not only a need for compatibility on communication protocols but also for common methods for data protection. To set up a secure communication between different parties of different organizations all over the world s@zdards have to be defined [1], [2]. Although there are several DES chips available on the market, there is still a need for higher performance the existing chips and more secure DES chips. None combines high speed with standards. INTRODUCTION U NTIL the last decade, cryptography was the domain of the diplomatic and military world [3]. Due to the microelectronics (revolution a need for commercial c~ptography has emerged. performance digital circuits The ever cheaper and higher have caused a rapid expansion of international telephone communications, computer works, etc. Electronic mail, electronic funds transfer, recently tion theft. smart may cards are part of daily create However, security threats the same digital netand life. But this evoht- from eavesdropping circuits allow to for small and elegant cryptographical solutions to replace their rigid and heavy mechanical counterparts. A classical algorithm, like the Data Encryption Standard (DES) [1], protects data in two ways. First, priuacy is protected. After message, sent such encryption, over as electronic receiver. A second that of authentication. sure that the sender can be sure that the an insecure mail, is only and often After communication read by more important decryption, the message he received channel the intended demand is the receiver can be came from the original modes and 0018-9200/88/0600-0647 it implements output but cial the Data Encryption dependency at a certain without speed allows to include in his ciphertext. moment also on previous approach of Stan- most widely used standards for enciplner[1]. Second, it implements all operation in the standard, “DES Modes of Operreason for this is that the strength of a system depends heavily on its use. These give the user the opportunity time feedback In this way the not only depends on the input inputs or ciphertext outputs. A spe- the implementation of modes reduction and with a small all overhead in silicon. Third, one can easily use the chip for bankingrelated standards (for message authentication and financial key management), as in the generation of message authentication codes (MAC’s) and triple encryption. Key management and key security both demand ple key registers on chip. As a practical balance multibetween user convenience of key management and silicon area, four key registers are implemented, e.g., for two master keys and two session keys. This consumes the data-path algorithm, Manuscript received Se~tember 28, 1987; revised February 1, 1988. This work was performed m collaboration with the Cryptech company. I. Verbauwhede and H. J. De Man are with the IMEC Laboratory, B-3030 Heverlee, Belgium and with the ESAT Laboratory, Katholieke Universiteit Leuven, B-3030 Heverlee, Belgium. F. Hoomaert is with C J. Vandewalle is with %t&~~!-~~j~#~lsK~~~~~~e Uttiversiteit Leuven, B-3030 Heverlee, Belgium. IEEE Log Number 8820717. First, dard, one of the ing computation modes specified ation” [2]. The cryptographical of all require- ments for high security and user-friendliness. The chip presented here has a unique combination 1. to i.e., like area. The security the DES, depends about of 30 percent a publicly only of known on the security of the keys. If the keys are compromised, the whole system is. This requires that, once entered, keys may not leave the chip. Hereto special key registers are implemented which cannot be scanned out. More generally, from the system performance point of view, a cryptographic $01.00 01988 IEEE device should be easy to use :and IEEE JOURNAL 648 The algorithm - CV@ZW@ti mquirsmmts rsstridons – security 1,=-.] ------- ! /gor/thm ‘“9’”’=””” IL Fig. VOL. 23, NO. e.g., for IMPLEMENTING A MULTIMODE DES INTO A COMPACT FLOOR PLAN General design methodology. 1. bulk encryptions, satellite communications, or ISDN. The insertion of an encryption device may not slow down the overall performance of a system and must also be flexible for use in a large number of environments. If someone tries to tamper, a general reset and especially a disabling of all key registers must occur. The general applied in design methodology this particular cryptographical case, as shown requirements can also be in Fig. smaller subproblems. This results in a data path consisting of a DES part, a key management part, an input/output and a mode functionality in this A. The DES Algorithm whereby part, Encryption other devices in a system. Processor (DEP) part. For each module, the are discussed below. [14] is very flexible and its Hardware As shown in Fig. 2 a single DES calculation [1] is a sequence of a 64-bit initial permutation, a consecutive calculation of 16 rounds, and a 64-bit inverse initial permutation. A round can be compared with an iteration step, difficult with calculation and its implementation 1. All that are implemented speed of 14 Mbit/s, can calculate only three limited modes and has a synchronous 1/0 interface which makes it to communicate implemen- tation of the DES algorithm but to provide a more general cryptographical device of which the DES computation is only a part. In order to deal with all cryptographical requirements, the global problem has been partitioned into part, of an ASIC chip have been described briefly. They can be found in more detail in [4]. There exist several DES chips [12] –[14]. Most of these can compute only the DES algorithm [12]. The fastest DES chip reported yet [13], with a maximum The Data 1988 design task, which consists of The aim of this work is not a straightforward fast, 3, JUNE detail in [4]. The layout engineering task is the topic of this paper. Only the important aspects of the high level design, necessary to understand the context, are repeated here. ? Wi engineering CIRCUITS, the translation of the cryptographical and security requirements in a floor plan and architecture, is reported in full K ‘, OF SOLID-STATE the number shown round. in Fig. of iterations is fixed 3, contains hardware at 16. The DES It consists of 32- and 48-bit modulo for one DES 2 adders (XOR’S to use and is realized as a programmable microprocessor. But this is at the cost of a much slower speed, maximum Al and A2), eight nonlinear 4.7 Mbit/s, sion E, a permutation P, and two registers of 32 bits. In the floor plan and layout the S boxes are realized as the eight PLA’s (see the bottoms of Figs. 15 and 16). PLA’s are the best way of implementing these nonlinear functions, which are defined in table format, since these None and at the cost of an overhead of these chips implements with on silicon area. the same speed and the same ease MAC’s or triple encryption or can handle multiple keys. It is unique to have all of these cryptographical requirements combined in one device. This has been achieved by applying a modular design strategy, the result of which is explained in Section II. A modular tained by algorithm and efficient applying floor plan some algorithmic engineering could knowledge and layout restrictions 1). In Section the choice of algorithmic leading to a minimum routing be ob- equivalences. can only be efficiently layout III only are known and silicon The done if some (Fig. equivalences area are dis- inputs and four substitutions PLA’s that suited way, these PLA’s were possible, to be as random nonlinear arbitrary cannot be minimized that would with six as possible. in a compact functions. scattered. some structure and functions S boxes), an expan- to implement, and ONE’S are randomly the tables substitution (also called are supposed are well structured outputs because the ZERO’S Indeed, would and We found if a reduction have to be present constitute a weakness in of the cussed. Starting from the architecture and floor plan, the layout engineer wants to generate in a minimal design time layout security of the DES algorithm. On top of the PLA’s the 32-bit hardware routing for the permutation P is placed, which at the same time makes a connection between the PLA’s and the 48 bit slices of the rest of the DES part. In that is fast, compact, and highly efficient. vanced CAD tools are helpful. The whole and the related CAD tools are explained and the end could be realized without the use of routing by using the inherent structure. In this way a considerable Even for a cryptographical Therefore addesign process in Section IV. chip, the design conforms to a conventional design and layout strategy, as explained in Section V. However, testability is something special and at first sight seems to be in conflict with security requirements: one must be unable to scan out keys or intermediate ciphertext. The solution to this dilemma is discussed in Section VI. Finally, Section VII contains the implementation and test results of the first prototype. contrast, amount the large of silicon in Section 64-bit permutations at the beginning area has been saved, as will be explained III. B. The Keys The key module implements two tasks: 1) the safe storage of the keys, and 2) the key schedule calculation. For each DES round, a subkey of 48 bits has to be VERSAUWHEDE el al.: OPTIMIZATION OF DES DATA ENCRYPTION CHIP 649 St3nder Receivw ~-1 S ~1. w +3 16 DES 56 +.... munda Plain w &y I Fig. ‘T f R3rseInitial Fig. 2. 5. Definition transition l..._T__J_J ~ l..r ~–. key pitrt. for Above this and the routing, ( LR) is implemented. the implementation the Most of the ‘four key KRI – KR41 As shown’ in Fig. 4 an incoming key has 64 bits. The key parity -.J and decryption. the 56 bit slices of the keypart of the DES area is needed &iiL:& Phlin plt for left right shifting registers, ~,–-.+ PC2 is located, which consists of two 28-bit next to each other. This enables an easy slices hardware – 1 of the CBC mode, encryption between 32 bit The main parts of the DES algorithm. K’w cipher M and selection parts placed x “64 U* -+ MS DES =i enters is checked only once, when the the chip,’ so 8 bits are dropped. A hardware crisscross rcwting for the initial key permutation F’C1 could be saved by using the inherent structure; Therefore the initial key permutation PC1 is also computed on’ly once when the key enters the chip and the keys are stored in a permuted fashion. The algorithmic equivalent to save this routing is explained C. The Modes: L* Fig. 3. in Section III. Combining Feedback and Pipelining For a given fixed key and given fixed data input, the DES cipher output is always the same. An opponent could Architecture thus use frequency analysis to retrieve the original text [3]. To avoid this, several DES modes of operation are defined [2]. These provide feedback and block chaining. As an for one DES round. example, key inplr J-” in’ Fig. 5 the cipher block defined. In this mode a given fixed input key will produce different chaining (CBC) ii and a“given fixed cipher outputs, because they do not depend only on the input but also on previous inputi. From the implementation point of view,. a pipeline is introduced to increase the data rate through the chip. The inherently slow on–off chip communication “is done in parallel with the DES calculation. The previous cipher text is written out while the actual plain ciphered and the next one’ is read in. 16 Subkey generation seems to be in I example, with the feedhac~ ~modes. mode’ the new input for the following 2 sum of the -previcm? cipher The key schedule calculation and the new data input. This conflict has’ been resolved modulo ?ES For in the CIEIC the” bit-wise -t-” 4. conflict the DES device is configured .A . . . . . . -------- -~ Fig. when text is being enAt ,first siglit this calculation by placing is cy,@ut the mode generated, as shown in Fig. 4. The input key is also 64 bit, but 8 bits are used for parity checking [1]. After an initial key permutation, PCJ, the 16 subkeys, one for each round, calculation strategically between the DES hardware and the input/output hardware and by combining mode calculation with the internal data transport on chip. How the feedback loop can be combined with the pipeline is explained in Fig, 6 for the same CBC mode (encryption ahd are derived decryption). from the 56-bit One subkey is obtained after a 56- to 48-bit key selected for encryption. after some left or right shifting permutation and selection, and PC2 [1]. The key hardware is placed in the middle of the layout (see Figs. 15 and 16). At the bottom, the key permutaticm enciphered The newly entered the DES part or vice-versa done shown on data are xoRed data while they are transferred the fly between with [4], [5]. This mode-calculation every in Fig. 7. This technique two the from the 1/0 pipeline to is stages as of mode calculation does IEEE JOURNAL 650 OF SOLID-STATS CIRCUITS, 23, NO. 3, JUNE 1988 VOL. )7 10 m — DES DES (a) 10 E! ~: x :Iti . l.(+) ... . . DES (b) Fig. 6. Mode Fig. 8. Hierarchy: one head controller, and four data-path modules (c) calculation combined with internal (b) CBCenc, and (c) CBCdec. transport: -k–l pipnll”e mk communications plfall. S+a90k+l (LC’S), (a) ECB, most @peline four local controllers (cross-hatched). placed m ,, with applications in a microprocessor it is important a number modifications of different to the original ing the communication Moreover, or for DMA. that the chip systems with For can be minimal system and without decreas- speed. if h the future this DES hardware module is used as a small encryption corner on a large VLSI digital device, it is only this 1/0 frame that has to be stripped off or replaced by a smaller one, e.g., for on-chip interprocessor E. communication. The Controller In addition needed. Unit to the data Extra features path a controller like MAC’s, triple structure encryption, is etc. can be computed with the same data path but impose strict requirements on this controller. As shown in Fig. 8 the controller has also been designed module has its own local controller, only (b) Fig. 7. Fast internal with the main codes incoming mode calculation in between two pipeline stages: (a) ECB, and (b) CBCenc. controller. commands hierarchically. Each which communicates This central D. The Inp@/Output The input/output tion. On the floor limited. pendent which is hardware nous way the keys. from the Therefore, (IO) Part module plap realizes off-chip of the prototype, communica- this is still very in a second implementation an inde- 1/0 interface (1/0 frame) is being designed, divided into two main parts. The internal DES communicates with this 1/0 frame in a synchroand performs mainly a redirection of the data or The second part can read or write data to and outside world in a asynchronous or semisynchro- nous fashion. It can combine 8, 16, or 32 bits and it contains or separate data for buses of the necessary glue logic for de- part of the chip has to be activated. The local controllers can execute a limited long sequences, e.g., the local DES controller not decrease the speed of the device. The same can be shown for all other modes. Hence a unique feature of the design, not found inother implementations, is that allfour modes are equally fast. controller and decides which number of can generate sequences for DES or inverse DES (DES-1) calculations. The local controller for the modes decides the configuration of the modes instruction register for the correct internal transport and mode calculation. The main controller synchronizes the local controllers for parallel execution as, e.g., for the pipeline execution of Fig. 7. It monitors the local controllers until every module has finished. The execution times can be of varying length, e.g., depending cation on the speed of the external or depending on single or triple 1/0 communi- encryption. The communication between two modules is restricted to one common register between every pair of modules, which is strictly monitored, as shown in Fig. 8. Neighboring local controllers can give instructions to these registers but at different time instances. For example, the 1/0 register is shared by the 1/0 part and modes part. The 1/0 part internal fills or clears it, and the modes part transport. The local controllers uses it for use the common VERBAUWHEDE registers et a[.: but are allowed the 1/0 OPTIMIZATION only OF DES DATA the main controller to use it. For example, register, The main 1/0 controller starts decides if the 1/0 the modes part controller ENCRYPTION TABLE when they THE part is filling is not allowed alternatively 651 CHIP INITIAL I PERMUTATION 1P to work. the DES IP= and/or [58 50 42 34 26 18 10 2 60 .52 44 36 28 20 12 4 the chip, the key and 62 54 46 38 30 22 14 6 modes sequences in between different pipeline sections (cf. At present, the longest modes Fig. 7)I should be minimal. 64 56 48 40 32 24 16 8 and then the key and/or To maximize the data rate through modes controller. 57494133251791 sequence takes only nine clock cycles and the longest key sequence takes three clock cycles. the controller is very small, as the In tlhe first prototype objective was to test the functionality rate through the chip. The 51 43 35 27 19 11 3 53 45 37 29 21 13 5 63 .5.5 47 39 31 23 15 7 ] of the data path. In the actually completed second implementation, mance of the controller has been optimized data 59 61 challenge the perforto maximize for the main controller ‘is on the one hand to provide the external user with simple powerful high-level commands, but on the other hand to keep calculating without were not possible, data path would F. The Overall For shown not be available parts of the data path to the outside world. Floor Plan the first illustrated all parallel losing time in between subtasks. If this the internal calculation power of the prototype in Figs. the floor 15 and leads to minimum plan 16. The and layout relative wire and’ bus length. are placement Each ‘mod- ule is bit-slice designed and has the same width, though they have different word lengths. Thus modules are abutted and routing is avoided. Fig. III. ALGORITHMIC 9. EQUIVALENCES FOR AN Straightforward hardware mented compared routin for 1P which with t Ee data path. is not 48. 84’2- imple- EFFICIENT FLOOR PLAN The global layout permutations, 28-bit (Fig. 16) contains one of 56 bits (which routing) and only two hardwired is in fact two times a one of 32 bits. algorithm as defined in the standard large permutations, a 64-bit initial 64-bit final permutation 1P-1, and permutation with PC1. dedicated However, These permutations crisscross routing the DES but are not by using realized yield this hardware A. Byte-Oriented For saving will be briefly explained. IP and IP-1 Realization illustration, standard the IP table [1] as defined is given as Table 1. It reads as follows: in the bit 58 of the input is the first, most significant bit of the permuted input, bit 50 is the second and so on, and bit 7 is the last, least significant 64-bit channel bit. routing, The d@l_L. .4 9 ...1 ..56 124J574J 57.-. ‘ corresponding straightforward which is not implemented, yg%~ a shift technique and an optimal arrangement of the floor plan [4]. The three most important algorithmic equivalences which 33E~~~1 contains three more permutation 1P, a a 56-bit initial key is shown in Fig. 9 to compare it with the actual data path. If the width is taken the same as for all other modules, the height is 500 pm and this only for IP. 33 . . . Fig. [ ~ 10. 49 . . . 41...48 a Equivalent realization I 56 57. of 1P: floor-plan ..s4 organization However, IP is byte oriented and so can be realized with a shift technique [4]. IP is calculated when going on–off chip. Instead of two 64-bit complex channel routings, one for IP and one for 1P-1, the 64-bit 1/0 register is arranged as a concatenation of eight shift registers of 8 bits, One can prove [5] that this shift-register concatenacorretion together with a small floor-plan reorganization sponds to the initial Fig. 10, no scrambling permutation routing 1P. As can be seen in remains. back after the DES round calculations byte per byte performs 1P-1. Shifting the result and reading it out IEEE JOURNAL 652 -% OF SOLID-STATE CIRCUITS, u VOL. 23, NO. 3, JUNE 1988 Architecture Fig. 11. Equivalent realization of PC1: floor-plan + I I & organization. I 1 ! Conrmmmion ! FbOrplan OrgmlmUon 5 ‘-@J ‘* 16 DES rounds 12. 1 —.— u Fig. Equivalent realization: 1P and 1P-1 outside of the modes. the feedback c1 output : – – switch Iavel - transistor – pattern loops layout test B. Equivalent PCI Realization Fig. A similar applied shift technique to realize the initial with seven registers key permutation into two 28-bit registers, PC1. necessary The for key schedule calculation. Again, similar to the 1P realization, it is the switching from serial shift to parallel use that performs the permutation. C. 1P and IP-1 IP and while modes, IP-I Fig. 1P are brought however, the 8-byte in Outside the Feedback Loops of the Modes these are calculated and 12. The feedback loops on chip. For example, mode (CFB), simplification outside the feedback when going on or off chip. The are calculated cipher IP-1 outside the feedback for this is explained one obtains loops by bringing is based on the fact that IP and 1P-1 are each others inverse, so lP-l(lP(Data)) = Data. The combination of both techniques, the equivalent shift technique and bringing IP and 1P-1 outside the feedback loops, allows for very compact realization of the modes hardware. First the hardware routing for three permutations is saved. This would produce an increase of 30 percent of the data path without considering the placement problems. Second, by shifting from the 1/0 register to the internal register, only eight XOR’S, eight mukiplexers, and eight bit buses are needed, instead 64 XOR’S, 65 multiplexer, [14]. IV. llm The design of an area-consuming and 64 bit buses such as used in DESIGN PROCESSAND USE OF CAD evolution of the data path, tions to layout, is presented into five phases. from 13. Design level evolution of the data path. can be corresponding floor plan is shown in Fig. 11. There is no routing of the incoming bits. The seven shift registers are combined ~—– J_ ! Lagic Sbnulat-til TOOLS specifica- in Fig. 13. It can be broken up A. The Architecture The DES Synthesis algorithm,. its modes of operation, and the other cryptographical specifications have already been explained above. The feedback loop over architecture represents the choice of the algorithmic producing a better floor plan. B. The Floor-Plan The result Organization of the first equivalences thereby and the Cell Library iteration loop is the ji’oor-plan organization discussed above on which the optimal relative position of the different elements is established. Based on that floor-plan organization a minimum cell library defined. Besides the eight PLA’s, registers, multiplexer was and XOR’S are needed to implement the DES, key, and modes hardware. Except for the PLA’s, we have been able to build all of the other logic out of only four gates: a static XOR and three types of inverters (with small, medium, or large W/L). This is possible because DES consists essentially of data scrambling and bit-wise modulo 2 arithmetic. Pass transistors are used to implement the clocks and conditions of multiplexer and registers (a maximum of two in series is permitted). The inverters can be provided with, or without, static refresh to implement memory. As an example, a register with a two-input multiplexer at the input is given in Fig. 14. The basic cells have been characterized individually for the transistor dimensions, the driving capability, etc. “Special” cells are the eight PLA’s used to realize the substitution functions. These are also characterized separately. The cell layouts are, however, customized to the floor plan. For the width one cell different versions of the bit slice. Typical exist, depending for the DES algorithm on is et al.: OPTIMIZATION VERBAUWHEDE cl OF DES DATA Encryption CHIP 653 Phil pert system DIALOG Register implementation with two-input multiplexer, using pass transistors and refresh inverters. realized results in in such that is a way with except what 14. description base of 85 rules. The rules are formulated consistent Fig. [6]. The formal a knowledge is that everything is forbidden the rules, instead is forbidden otherwise except of accepting what everything by the rules. The reason for tl& some construction errors can pass unde- tected. The DIALOG system verifies first the correctness of the the use of different word lengths, 8 X 8, 32, 48, and 56. To generate a floor plan on which all modules have the same width, the cells had to be laid out on different pitches. This library topology. This means that every cell that does not conform to the fcmr basic cells is reported to the designer. Refresh inverters are identified by their weak W/L ratio. has been done using the pitch matching PLA’s symbolic layout editor, To implement CAMELEON the “glue” the controller, capabilities [11]. logic of the 1/0 some more logic in generating device, a standard C. Construction A novel way of design a compact including Second, clude: Be- a) DES in accordance clocking and implementation: threshold-voltage ~ drop ond to avoid leakage problems how long As shown in Fig. 14 drive at most two refreshing inverters through a pass-transistor network.” This is to avoid problems with ratioed Indeed, the preceding static refresh inverter constitutes which switches d) capacitance ional a ratioed load the refreshing clocks; of a that every is followed at least by a possible prolbof the capaci- of every maximum and The verification DES ances) input (gate) and diffusion the load capacitances, path 45-m@ The main advantage CPU level descripti~n transistors and of the capacitmachine. is that, it is comple- With simulation one detects faults for. With this verification mainly at the interface between particularly construction exceeded capacitive 6K time on a Symbolics of this verification mentary to simulation. that one is searching connection faults ules are detected, (12K routing area capacitances. of the transistor data took calculates cell and checks whether it exceeds the value specified in the rules. This load capacitances whole is mainly used for the implementation of condit- checks—DIALOG incorporates the Phil conditional clocks are followed by full static refresh; for the Phi2 latch a restoring PMOS transistor is sufficient. 3) For correct logic levels: “A ‘medium’ inuerter may logic. timing verification—this verification of the correct clock- and sec- as it is unknown a value must be stored in a register. c) “Conditional after a pass transistor, length identified; driving clocks always have to be followed by a refreshing inverter.” The first reason is for level restoration, to restore the the or it is verified tances at the input and output nodes of a pass-transistor network is calculated and problem nodes are ing is only allowed on clock Phil .“ The reason for this is that it would cause an unfeasible controller implementamemory verification—e.g., is checked, rules. These in- electrical pmblems-e.g., to detect lems with charge sharing, the ratio a priori rules, separately. the library pass transistor network level-restoring inverter; with some constructiming the system verifies’ pass-transistor b) has been the use of and treated pass network cell library. capability rules, etc. Some examples are as follows. 1) For correct clocking and control: “Conditional tion later on. 2) For correct are identified rules and the use of an expert system design practice, The basic logic gates and cells are connected rules and Rules and Verification defined construction to check consistent tion interface gates are necessary. cause they are not critical they are taken from of the different modviolations and loads. to in- D. The Layout Generation verter. 4) For capacitive a maximal loading: capacitive loads are estimated extraction programs. For a large digital “A ‘small’ load of 500 fF.” initially inverter and checked chip a rigorous may drive These capacitive afterwards construction by strategy must be followed. One cannot allow “fancy tricks” to save gates or transistors. Such local optimization carries a penalty back, when the overall system has to be tested, controlled, or verified. How the construction rules are used for a novel way of verification is discussed next. For verification the construction rules for building the data path are written (Language Expressing in the LISP-based TOpology language LEXTOC Constraints) of the ex- The floor plan, the adjusted cell layouts, and the construction rules form one global layout. Placement and routing of the different modules is done by hand. The hardwired permutations, P and PC2, are difficult to route by hand. Therefore the router of the modide generator environment (MGE) [10] has been used. Ternninal connections can be specified in a textual form using LISP. The resulting routing has a poor density compared with what could be obtained however, writing connections by hand routing. in a texttial !n this case, form is less sensitive to errors and mistakes are more easily detected because the permutations tables are copied from the stamdard [1], similar to Table I for 1P. IEEE JOURNAL 654 E. ERC and Logic DES-like The latter this will the input for a logic simulation. The first goal is to check the floor plan and in particular the global routing of the floor plan, which is part of the algorithm implementation. lated The whole because routing, be checked data path needs to be simu- e.g., for the permutations, in relation to neighboring can only modules. A second goal of the logic simulation is to generate test patterns for the processed prototypes. Indeed, a functional reference pattern for an adder or a multiplier is easy to check. But it is not obvious that for the hexadecimal 01234567 data word 89abcdef, the first S-box are generated input 567890ab cdeff31f the first right by an independent input is 7c6abae9 software checked and patterns implementation with Hardware encryption corner encryp- on a larger circuit. V. STRUCTURED DESIGN AND LAYOUT STRATEGIES It will be explained in this section that the design fulfills general design and layout strategies. The application of these techniques A. A Structured In general, eases the design effort. the complexity of IC design: larity, and locality. The implementation the consistent hierarchy, has to be changed, building application blocks. visible in layout, as shown in in Fig. 16. But regularity logic also exists in of the construction gates, registers, etc., The same construction rules, when into functional rules, sometimes termed a methodology, can be coded in a verification to allow automatic verification. B. Layout tool Strategies Typical for this design” is the use of d$ferent word lengths in different places. The DES part has a combination of 32 part slices. The key part is byte oriented has several ing is applied. for reducing modularity, regu- These techniques are closely related. of the DES algorithm itself is only a small part of the problem. The architecture for one DES round is easy to deduce and the control for this hardware has 56 bit slices and the and has eight blocks consequences on the a more or less square layout, Each module, independent of 8 bits. general layout cell stretch- of the number slices, is 4 mm wide. The DES part has a combination of of 32 and 48 bit slices. The key part has 56 bit slices and the 1/0 part is byte oriented and has eight blocks of 8 bits. Most connections between made by abutment. cells inside the modules are This means that a bit slice in the key part is only 70 pm wide compared to 120 pm for a bit slice in the DES part. The layout of a register in the DES part will thus be arranged differently from a register in the key part. Second, the relative placement for the permutations also allows word length to another. For of the routing the transitions example, PC2 channels from one is placed the 56 key bits and the 32 data registers, expansion [7] are available and if something is most clearly constructing between Design Methodoloq four techniques Regularity strategy. First, to obtain Module level in system design, as a stand-alone or as a small algorithms, the chip photograph 1/0 switch-level descriptions, and test patterns are produced. This module can be used as a black box at a higher device 1988 have only local influence. and 48 bit independently The result of the whole design process is a hardware module for the data path: full layout, transistorand tion 23, NO. 3, JUNE tion protocols, but they apply everywhere (regularity). Modularity allows also the reuse of the same modules for This F. The Result: A Ciyptographical VLSI VOL. and the key is b48132c41ef 7. Reference of the DES algorithm and available reference tables. abstraction CIRCUITS, two modules as explained in Section II. The use of these registers is submitted to very strict timing and communica- Simulation An electrical rule check is performed on the layout. The output of the extraction is a transistor-level description which can be converted easily to a switch-level description. forms OF SOLID-STATE E is placed between Also, PLA’s path, therefore are difficult they and the the 32 and 48 data slices. to fit nicely are placed on a bit-sliced at the bottom data and the terminal fitting is included in the permutation P between the 48 data slices and the eight PLA’s. Third, the variation in word lengths has influence on our power and ground routing strategy. A double-metal technology is used in the design. The two metal layers run is a well-defined sequence. But the realization and interaction of all aspects, the mode calculation, the key handling, perpendicular to each other on chip. Metall runs parallel to the data flow, in the vertical direction, because there are and the 1/0 interface make it a complex problem. It can be said that a more attractive solution is obtained when all more data contacts to be made (in metall). Meta12 runs horizontally, parallel with the control flow. Control is four techniques For instance, mostly are applied. the controller is designed in a hierarchical way. One controller is master and the others are slaves. Each controller is a separate module. Modularity means well-defined interfaces. The terminals and the timing are defined for higher levels and what is inside is transparent. This is the definition of locality. The same is valid for the data path. restricted The communication to one communication between register two modules between is every supplied in poly to make connections to the gates of (pass) transistors. Long poly wires have RC problems. In this case the maximum length amounts to 4 mm. Therefore meta12 is used to bypass these long poly wires. Power and ground lines run also horizontally, perpendicular to data flow. Because of the problem of different word lengths vertically for unders would two power and ground reasons. be unavoidable First, cannot be provided crossovers at the interface and cross- between VERRACWHSDE d af.: OPTIMIZATION OF DES DATA L scan ENCRYPTION pads 635 CHIP \ c<) n : 0 ..1. L. 1.. LL...LJ K I Fig. 15. different small Even ‘Es - 1 Fig. 16. be testable. These tests are based on signatures. A test signature uses the same propagation principle as a digital Floor plan of the data path of the first prototype. modules. A second problem signature is caused by the very pitches in some modides such as the key registers. by mirroring the cells, it would be impossible to arrange the wells and to separate transistors. horizontally Now the n-wells over the different lines are used in common the NMOS and PMOS two different typical of surges on the power double-metal classic finger power or fork single-metal-layer fork. cells. ring is the ground ring. Each horizontal is connected connections between they are multiple is typical bottleneck, our rnetall dips are unlikely of StUck-at-ZERO a MAC, is to detect of false data. A small chafige the signature. is used for ‘test sig- clear data and key pair is applied, and the DES will be tested and stuck-at-O~is, faults will be detected. The typical stuck- a overhead dynamic faults research. of static While CMOS design are a in other chip designs, an OFIlogic and chip area is necessary to realize fault propagation, The rings. Via and meta12 are necessary but and the current ionality and line through the For in DES it is inherently present. a double the data path. sides to or an insertion A known of further ring, and the meta12 ring is the at both of MAC’S. the restilt is compared with a reference cipher text after a number of encryptions and decryptions. Complete funct- open for the generation change topic that all cur- power or ground an illegal A the stem of the is more equally uted along the ring. Thus there is no current and power by distribution. this is solved by providing power cells is solved which and meta12 all around metall the lines ground has the problem one critical In this design, of metall structure technology rent passes through ring and as in is based on the CBC mode, the function natures. Usually, however, power is distributed along the signal lines to avoid surge currents on the power lines [8]. The problem which change of input data will completely The same propagation principle are connected by abutment bit slices. Power and ground between Chip photograph of the implemented prototype. distrib- The VII. IMPLEMENTATION AND TEST RESULTS, first prototype, scan with registers, has. be~~n processed using a 3-pm N-well CMOS process with double metal (see Fig. 16). It contains 12000 transistors and htis an area of 25 mm2. The aim was to test the functionality of the different parts, mainly the DES and the modes part. Tests have shown full operation up to 16.7-MHz clock concentration to occur. rate. With this clock frequency, a 20-Mbit/s data rate on all eight byte modes is achieved. VI. TESTABILITY OF A CRYPTOGRAPHICAL A second implementation is now being developed, which takes the data path as a hardware module arid which also DEVICE On the floor plan of the first prototype (Fig. 15) it can be seen that some scan registers are implemented. This scan technique design security during demands. scan out ciphertext. mercial is very useful development. localizing faults this is in conflict in a with Scan pads provide a very easy way to initialization vectors or intermediate keys, Therefore applications. for But they are not allowed They are only provided opment and are realized as independent combining them with existing registers. contains the full controller and microprocessor The scan registers have been removed. Thanks performance controller, a full DES calculation, the mode calculation, same clock frequency rate of 32 Mbit/s takes only 28 clock can be applied interface. to a highincluding cycles. If the to this device, a data can be expected. for safe comduring devel- blocks instead of This means over- head on area, but they are very easy to remove. Scan pads are removed for safe commercial applications. For validation or maintenance purposes, the chip must still VIII. CONCLUSIONS A single chip has been designed and implemented which executes DES with a number of unique cryptographical features. The original approach of mode calculation allows pipelining in combination with feedback modes. Due to 656 1Ei3EJOURNALOF SOLID-STATECIRCUITS,VOL. 23, NO. 3, the structured design strategy, optimization, tools the regular floor the global plan, at all design levels, the main building blocks commercial Checking architecture can be reused in a flexible applications at different verification rather or tial and complement for and many way for safe DES-like algorithms. design levels, including and electrical than local and the use of CAD simulation, each other. The outcome authors Rammelaere expert system service chip division wish for to their thank help I. in for chip verification, processing, for stimulating division of the IMEC Laboratory, Heverlee, Belgium, working towmds the Ph.D. degree in electrical engineering. She is a member of the research group on applications and architectural design strategies, under the direction of Dr. F. Catthoor, Prof. H. De Man, and Prof. J. Vandewalle. Her research interest is in the design and integration of cryptography applications ad currently of ~gebrtic data processing applications. In particular, the formalization of the architectural methodologies is envisioned. is a compact ACKNOWLEDGMENT The 1988 and design rule checks are essen- design which can be used as a small module in larger digital VLSI circuits, but also as a fast stand-alone device. In short, it has a number of unique features not found in other devices. De JUNE Bolsens using the and the INVOMEC and the members W. DIALOG MPC of the VSDM discussions. Frank Hoornaert was born on October 20, 1961 in Belgium. He received his education from the Klein Seminarie Roeselare and from the Katholieke Universiteit Leuven, Heverlee, Belgium. From October 1984 to September 1986 he was a Chip Designer at IMEC, Heverlee, Belgium. He worked on hardware design and development of secufity systems at the Katholieke Universiteit Leuven from October 1986 until Februarv 1987. Since March 1987 he has been a Secu~ty and Development Manager at Cryptech n.v., Brussels, Belgium. His hardware and software knowledge includes gate array and full-custom design experience, digital design with PP and TTL components, and design of cryptographic hardware. REFERENCES [1] Data [2] [3] [4] [5] [6] [7] [8] [9] [10] [Ii] [12] Encryption Standard, Federal Information Processing Standard (FIPS) 46, Nat. Bur. Stand., Jan. 1977. DES Modes of Operation, Federaf Information Processing Standard (FIPS) 81, Nat. Bur. Stand., Dec. 1980. W. Diffie and M. E. Hellman, ‘~Privacy and authentication: An introduction to cryptography,” Proc. IEEE, vol. 67, no. 3, pp. 397-427. Mar. 1979. I. Verbauwhede, F. Hoomaert, J. Vandewalle, H. De Man, and R. Govaerts, “Security considerations in the design and implementation of a new DES chip,” to be published in Proc, Eurocrypt 87 Conf., Springer-Verlag. M. Davio; Y. Desmedt, J. Goubert, F. Hoomaert, and J.-J. Quisquater, “Efficient hardware and software implementations of the DES,” in Aduances in C~ptologp, Proc. Ctypto 84, Aug. 1984. H. De I&i,, L Bolsens, E. Vanden Meersch, and J. Van “DIALOG: Ari expert debugging system for (leyenbreughel, MOSVLSI design,” IEEE Trans. Computer-Aided Des., vol. CAD-4, no. 3, pp. 303–311, July 1985. N. Weste and K. Eshraghian, Principles of CMOS VLSI besign, A Systems Perspective. Reading, MA: Addison-Wesley, 1984, pp. 238-24Q. L, A. Glasser and D. W. Doppe&ihl, The Design and Analysis of Reading, MA: Addison-Wesley, 1985, pp. 316-317. VLSI Circuits. M. Bartholomeus, L, Reynders, M. Pauwels, and H. De Man, “ PLASCO: A procedural silicon compiler for PLA based systems~’ in Proc. [EEE 1985 Custom Integrated Circuits Conf., pp. 226–229, H. De Man. J. Rabaev. P. Six. and L. Claesen. “Cathedral-II: A silicon compiler for di~tal sign’al processing,” IhEE Design&Test, pp. 13-25, Dec. 1986. K. Croes~ H. J, De Man, and P. Six, “ CAMELEON: A processtolerant symbolic layout system,” IEEE J. Solid-State Circuits, vol. 23. no. 3. DD. 705-713. June 1988. R.’ H. Cu;h’man, “Data-encryption chips provide security—or is it fake [13] [14] security?.” EDN. vol. 27. DD. 39-45. Feb. 17.1982. D. MacMil&, “Single chip en&ypts data at 14 Mb/s,” Electronics, vol. 54, pp. 161–165, June 16, 1981, R. C. Fairfield, A. Matusevich, and J. Plany, “An LSI digital encryption processor (DEP),” IEEE Cow-nun. Msg., vol. 23, no, 7, pp. 30-41, July 1985. hrgrid Verbauwhede was born in Leuven, Belgium, on October 24, 1961. She received the electrical engineering degree from the Katholieke Universiteit Letiven, Heverlee, Belgium, in 1984. Her NLS. thesis was on analog design: the study ahd design of a micropower instrumentation amplifier based on switched-capacitor techniques, Since September 1984 she has been at the VSDM (VLSI Systems Design Methodologies) Hugo J. De Man (M81–SM81–F’86) was born in Boom, Belgium, on September 19, 1940. He received the electncaf engineering degree and the Ph.D. degree in applied sciences from the Kathofieke Universiteit Leuven, Heverlee, Belgium, in 1964 and 1968, respectively. In 1968 he became a member of the staff of the Laboratory for Physics and Electronics of Semiconductors at the University of Leuven, working on device physics and integrated circuit technology. From 1969 to 1971 h= was at the Electronic Research Laboratory, University of California, Berkeley, as an ESRO-NASA Postdoctoral Research Fellow, working on computer-aided device and circuit design. In 1971 he returned to the University of Leuven as a Research Associate of the NFWO (Belgian National Science Foundation). In 1974 he became a Professor at the University of Leuven. During the winter quarter of 1974–1975 he was a Visiting Associate Professor at the University of California, Berkeley. Since 1984 he has been Vice-President of the VLSI systems design group at the IMEC Laboratory, Leuven, Belgium. Dr. De Man was ~an Associate Editor for the IEEE JOURNAL OF SOLID-STATE CIRCUITS from 1975 to 1980 and was European Associate Editor for the IEEE TRANSACTIONSON COMPUTER-AIDED DESIGN from 1982 to 1985. He received a Best Paper Award at the ISSCC of 1973 on Bipolar Device Simulation and at the 1981 ESSCIRC conference for work on an integrated CAD system. In 1986 he became Fellow of the IEEE. His actuaf field of research is the design of integrated circuits and computer-aided design.