Instruction Set Extensions for Public-Key Cryptography IAIK TUG Dipl.-Ing. Johann Großschädl Institute for Applied Information Processing and Communications Graz University of Technology Inffeldgasse 16a, A–8010 Graz, Austria http://www.iaik.tugraz.at Abstract: Public-key cryptography (PKC) is the basis for security and privacy in distributed systems like the Internet, for e-commerce, and for virtually all modern cryptographic protocols. Most public-key cryptosystems involve computation-intensive arithmetic operations (e.g. exponentiation in groups or finite fields), resulting in an unacceptably long delay on embedded devices like mobile phones, PDAs, or smart cards. The project described in poster is directed towards the design of instruction-level enhancements (ISA extensions) to improve both the performance and energy-efficiency of embedded RISC processors when executing cryptographic workloads. We focus on low-level arithmetic operations used in PKC, e.g. addition, multiplication, squaring, modular reduction, inversion, and division in multiplicative groups or finite fields of very high order (160-2048 bits). The first goal of this research project is the design, prototype implementation, and test of a SPARC V8-compatible processor with an extended instruction set optimized for PKC. The second project goal is to develop and analyze sophisticated micro-architectural enhancements for high-speed cryptography and improved security (i.e. resistance against side-channel attacks). Public-Key Cryptography • Widely used in security protocols like SSL, IPSec, … – Asymmetric encryption, key exchange, digital signatures • Based on a “hard” mathematical problem – Integer factorization problem, discrete logarithm problem • Traditional public-key cryptosystems – RSA, DSA, Diffie-Hellman are based on IF or DLP – Exponentiation in mult. group, 1024-2048 bit operands • Elliptic curve cryptography – DLP on an elliptic curve defined over a finite field GF(q) – Much faster, operand length between 160 and 250 bits Arithmetic in Finite Fields Multiple-Precision Multiplication • Performance of ECC depends on field arithmetic – Addition, multiplication, squaring, inversion • Prime fields GF(p) and binary extension field GF(2m) Multiply/Accumulate Operation – Recommended by standard bodies (IEEE, ANSI, NIST) • Arithmetic in GF(p) – Elements are the integers from 0 to p–1 – Addition, multiplication is performed modulo the prime p Multiply/Accumulate Operation • Arithmetic in GF(2m) – Elements are binary polynomials of degree up to m–1 – Arithmetic modulo irreducible polynomial p(t) of degree m Instruction Set Extensions Memory (Cache) rs load rt Registers ALU MAC store HI rd hi part lo part LO Only six custom instructions are sufficient to accelerate the arithmetic operations in prime fields GF(p) and binary extensions fields GF(2m) The long operands are represented by arrays of single-precision words (e.g. 32-bit unsigned integers) The algorithm spends the majority of its execution time in inner loops performing multiply/accumulate operations. Speeding up these inner loops has a dramatic impact on the overall performance. Prototype Implementation The proposed custom instructions are executed in a unified multiply/accumulate unit with a “long” accumulator and two result registers (similar to MIPS32). The multiply/accumulate (MAC) unit consists of a (32*16)-bit unified multiplier (for integers & bin. polynomials) and a 72-bit accu DSU 5-stage Integer Unit Icache ~ B FPU I/O COP Uarts Dcache Timers AHB AHB Ctrl. APB The MAC unit can operate independently and in parallel with the ALU. Memory Ctrl. PCI (I/T) SRAM / PROM 32-bit PCI Results and Conclusions cin Integer mode: fsel = 1 (standard FA) Polynomial mode: fsel = 0 (XORs) sin All timings are given in clock cycles fsel A dual-field adder (DFA) is a full adder capable of performing addition both with and without carry cout A DFA is only slightly larger than a conventional full adder (FA) LEON/ AMBA The MAC unit has been integrated into the LEON-2 SPARC V8 processor and prototyped on FPGA Unified Multiplier Datapath Addition of binary polynomials is simply a logical XOR operation APB Ctrl. sout pin Polynomial arithmetic can be easily integrated into the datapath of a conventional integer multiplier. A properly designed unified multiplier composed of DFAs consumes about 30% less power in POLY-mode than in INT-mode • • • • Speed-up compared to conv. software: GF(p) 2x, GF(2m) 6x The proposed extensions make GF(2m) faster than GF(p) Binary extension fields are more energy-efficient than GF(p) Extra hardware cost is marginal (approximately 5,500 gates) The research illustrated on this poster has been supported by the Austrian Science Fund (FWF) under grant no. P16952-N04 (“Instruction Set Extensions for Public-Key Cryptography”) and in part by the European Commission through the IST Program under contract IST-2002-507932 ECRYPT. The information on this web site is provided as is, and no guarantee or warranty is given or implied that the information is fit for any particular purpose. The user thereof uses the information at its sole risk and liability.