Using Graphics Processing Unit in Block Cipher Jingwei Lu (jinweil, ID 0537619) Abstract: The paper will first talk about GPU computation power compare to CPU. The paper will also talk about GPU computation power growth faster than Moore’s law. In a decade we may see 212 times computation power difference between GPU and CPU. It will explain why we are interested to implement block cipher on GPU. The paper will give introduction to general programming on GPU. Then it will map block cipher to general programming on GPU. I will analysis current limitations of GPU and show it is not fit for block cipher. Finally I will introduce new windows graphics foundation 2.0 (WGF2.0) improvements and supporting of WGF2.0 in GPU. With the new improvement block cipher can easily be implemented on GPU and expected to have big performance boost. 1. Introduction The computation power of graphics processing unit (GPU) has increased dramatically in last 10 years. The multi-billions gaming industry is the primary driving force of this improvement. Let’s look at the computation power difference between CPU and GPU. Today a 3.8 GHz Pentium 4 machine theoretically can achieve 7.6GFLOPS speed. If considering duo core model, it can reach 15.2GFLOPS. For a duo processor duo core machine it can reach 30.4GFLOPS. The cost of such machine will be over $3000. On the other hand, an NVIDIA 7800 GTX 512M graphics card is capable of around 200GFLOPS. ATI's latest X1900 graphics card (2/06) has a claimed performance of 554 GFLOPS. The graphics card can also be easily combined with SLI or crossfire technology. 4 crossfire-enabled ATI X1900 boards will cost about same as the $3000 machine, but it will have a 70x float computation power. New duo core graphics card is also in development. So the cost of GPU system will be even lower in close feature. The GPU computation power increases much faster than CPU. According to Mark Harris’ presentation [1] from NVIDIA, GPU computation power is growing significantly faster than Moore’s law. It has compound average growth rate (CAGR) of 2.2. With CAGR of 2.25, we can expect a 3325x growth in a decade. While for CPU with CAGR of 1.5, we only expect 58x growth in a decade. By the end of a decade we are looking at about 212 times computation power difference between GPU and CPU. Those numbers are estimated on GPU’s and CPU’s peak performance. However, GPU and CPU are suitable for different type of work. GPU is more like a SIMD machine which works well with parallel works. Whether we can harvest the GPU computation power depends on how algorithm can be converted to GPU computation model. Using GPU for general computation has been a hot topic recently. It has been successfully applied to many areas, such as audio and signal processing, computation geometry, dynamics simulation, numerical algorithms, database sort and search, etc.[2]. One of the goals of this paper is to study if GPU can be used in some of the block ciphers. There are two main reasons that I am interested to look into possibility of using GPU in cryptography. First, when we consider a cipher security we need to consider how long it will take for current hardware to do an exhaustive key search. GPU is a very powerful hardware with low cost. Consider 212 times performance difference between GPU and CPU system in a decade, it will definitely need to be studied whether any of the cipher can be efficiently implemented in GPU. Second, with more and more communication happens over the internet, a cheap and efficient encryption and decryption implementation will help secure of communication. By off loading work from CPU to GPU, it will allow CPU to do other work. I have investigated the possibility of using GPU in cryptography. My main focus is on using GPU in block cipher. Section 2 will give introduction of GPU and how to do general computing on GPU. Section 3 will study the limitation of current GPU programming model for block cipher. In Section 3 I will also introduce the new improvement in Shader 4.0 and DirectX 10. I will also cover how those two technologies can really help GPU be used in cryptography. Section 4 will be conclusion and future work. 2. GPGPU background GPGPU stands for general programming in GPU. First I will introduce GPU architecture a little bit. GPU has specific pipeline designed for graphics. There are two major steps of GPU pipeline: Transformation and lighting (T&L) and Rasterization. The transformation and lighting step operates on 3-D points(Vertex). Application sends primitive data (such as triangle, light, mesh, etc.) to GPU. The T&L stage will transform the vertexes in primitives into screen space. In modern GPU, programmable vertex shader is a unit to do the T&L. User can write their own code to be executed in vertex shader. The rasterization step will map the primitive object to screen pixels. The unit is called pixel shader. Modern GPU all have programmable pixel shader. Vertex shader operates on vertex stream while pixel shader operates on each pixel on the screen. Final color value of a pixel come out of pixel shader will go through composite engine and send to render target. The render target could be a memory buffer or display screen. If render target is memory buffer, the value could again be used in next around of pixel/vertex processing. The whole simplified GPU pipeline is illustrated in illustration 1. Transformation and lighting Vertex Processor Rasterization Pixel Processor Composition text text Textures Screen Illustration 1: GPU pipeline Most general computing on GPU uses programmable pixel shader. There are three main reasons to do this. First, generally a GPU has more processing power in pixel shader, for example, ATI X1900 has 48 pixel shader processors while only has 8 vertex shader processors. Second, pixel shader also can read texture map while vertex shader can not do that (in shader 3.0, vertex shader can also access texture but number of texture map is much less than pixel shader). All most all general computing methods on GPU use texture map as input. Third, pixel shader works in screen space, which is a two dimension array. It will be easy to map general computing onto a grid. To carry computing on GPU, we will first need to map the computing to GPU grid computing model. Grid computing assumes there is an MxN grid. For each cell in the grid, it will carry out a computation which is independent of other cell and produce a result. The computation is called kernel. For example, two for-loops can be easily model as an MxN grid computing. One dimension for-loop can also be modeled as MxN grid computing. Dimension more than two will have to do multiple pass to map to grid computing. In section 3 I will describe how to map block cipher to grid computing. To programming GPU we must use graphics APIs: OpenGL and Direct3D. OpenGL is mostly used in academic community due to its platform portability and its extension mechanism. Direct3D is mostly used in compute game industry and it is only available on windows platform. Either API works perfectly well for general computing on GPU. They are merely a tool for application to interact with GPU. Which one to choose is simply a matter of personal preference. To programming vertex/pixel shader, we can use shader assembly language, or cg(NVIDIA), or HLSL(Microsoft) language. Assembly language is low level language and it is not easy to use. Cg and HLSL are c like high level language. They will be compiled to Shader 1.0, 2.0, 3.0, or 4.0 standard based on compile options. The compiled shader program can only be run in hardware which supports the shader version. Recently, a few GPGPU-friendly streaming languages/APIs, such as Brook, Sh, and accelerator, have been developed to insulate developers from the graphics APIs. Those languages provide a stream computing model. The compiler will automatically compile the program to shader language. User can leverage GPU computing without knowledge of graphics API and GPU. While we enjoy the benefits of those languages, it also has limitations, such as, performance is only about 50%-80% of a hand craft shader program; it only exposes limited function of shader programming. 3. Using GPU for block cipher 3.1. General algorithm to implement block cipher in GPU The GPU grid computing model can be applied to block cipher in multiple ways. First, we can think the data as a stream. It can be mapped to MxN grid. The computation kernel will encrypt each grid point with same key. This computation model is actually encrypting MxN data in parallel. Since each grid point computation is independent, only ECB mode can be used. If faster encryption/decryption algorithm can be developed for this model, the algorithm can be used to improve communication encryption/decryption performance. Second, we can also map key exhaustive search to grid computing model. We generate MxN key and each key map to a grid point. The kernel will encrypt same data on each grid point with different key. So for each rendering pass we are trying MxN keys at same time. Data will be stored in texture map and passed to pixel shader during computation. There are multiple texture formats available. Only 32 bits RGBA format fits well for block cipher. For the single key encryption/decryption scenario, we store 4 bytes of data in each texture pixel. We construct an MxN pixel texture from data. We use four texture maps to store MxN 4x4 data blocks. The key can be stored in constant registry and passed to pixel shader. For the key search scenario, we need to store generated key into MxN pixel texture. It depends on the key size we may need several pixels to hold one key. Suppose it needs K pixel to hold one key, then we will construct an MKxN pixel texture to hold all generated keys. We can also use K texture maps to hold the keys. Each GPU Shader version has different limitation on number of constant registries. The data can be stored in constant registry or another texture map depends on its size. The output of shader program will always be 32bits RGBA value per pixel. This limitation may cause some block cipher to be implemented with multiple pass of calculation. Each calculation it works on 32bits data and generate 32bits output. The other alternative is that we use multiple pixel output to represent a block. For example, an AES block cipher we work on 4x4 bytes block. This can be calculated using 4 pixels. The pixel shader program will need to distinguish which part of the block it is work on. Multiple render target feature of latest ATI and NVIDIA GPU has solved this problem. Now it can support up to 4 render targets. This means shader program could output 16bytes data per pixel, which works well for most block cipher. Pseudo algorithm for block cipher on GPU: 1. Setup texture: depends on which scenario, we either initializes K MxN texture to hold grid data point. 2. Setup key constant or data constant depends on scenario. 3. Setup texture for different mapping tables, such as s-box mapping, t-box mapping 4. Setup GPU pipeline: load shader program, setup view point, setup output buffer 5. Draw a full screen quad. For each pixel on the screen, kernel will be evaluated once in pixel shader for the quad. 6. Read the output buffer. Goto step 1 if we need multiple pass. The pixel shader code will implement encryption logic based on different block cipher. The pixel shader code is largely depending on current GPU limitation and different algorithm requirement. We will discuss it in next subsection. 3.2. Limitation of current GPU The biggest limitation prevents current GPU to be used in block cipher is that it lacks of bitwise operation. All commonly used block ciphers require bitwise operation, which includes XOR, shift, and isolate bits from byte. Without the support of bitwise operation in shader program, it is impossible to implement the whole encryption/decryption algorithm in shader program. To work around this limitation, we will have to separate the encryption/decryption work between CPU and GPU. Let the CPU do bitwise operation and use GPU for other calculation, such as s-box lookup. Since majority work of a block cipher is in bitwise operation, this kind of work around really does not buy us anything in terms of performance. The extra cost of moving data between CPU and GPU actually will slow calculation down. Cook [3] from Columbia University has implement AES algorithm using openGL API on current GPU. The algorithm use GPU for s-box lookup and rely on CPU to do XOR operation. The performance of GPU based AES algorithm is 41 times slower than an optimized C AES implementation. The test is done on a machine with NVidia geforce3 graphics card and 1.8 GHz Pentium 4. In the paper, the performance analysis shows that during the test CPU usage of GPU AES implementation is 100%. The time is spending in CPU XOR operation and communication overhead from CPU to GPU. For simple block cipher like AES, GPU really does not have any edge over CPU given there is no bit wise operation support in GPU. Second limitation is that there is no cheap dynamic branching statement support in current GPU. From Shader 3.0 dynamic branching is supported in shader program. However, according to [1], dynamic branching is expensive in shader program compare to CPU. The situation will not change in near future. Fortunately most block ciphers do not have lots of condition statement in the algorithm. This will not be a big issue but it is something we need to pay attention when GPU cipher is designed. Current GPUs are all designed for graphics calculation which heavily use float point calculation. GPU calculation is all done using floating point. Integer is supported in high level shader programming language. It is mapped to floating point value inside shader processor. Currently GPU support 32bits floating point value with 23 significant bits. The precision is enough to support 2-byte integer value. For block cipher a byte is mapped to 32bits floating point value. So the precision is enough for block cipher. Texture lookup will return float point value for each RGBA component. We will need to convert the value to integer before another texture lookup. It will require us to be careful about float point calculation error and may cause invalid value after truncated to integer. A little trick of adding 0.001 to the floating point value before truncation can solve this kind of problem. Usually graphics card will have 128M, 256M or 512M video memory. Compare to CPU the video memory is limited. The available video memory limits how many cells can be used in grid (screen size decide the grid size) and how many/big the lookup table can be. More grid points mean better parallelism which as shown in paper [3] will have better performance. 3.3. Improvement in SHADER 4.0 and DirectX10 Based on my analysis of current GPU functionality and Cook’s [2] work, it is clear that even though GPU has been successfully used in other general computing areas and have more than 10x gain on performance[2], it is not useful to improve block cipher performance. However, GPU design is a faster changing industry. The recent development in GPU industry gives us promising news that in near feature, maybe 1 or 2 years, GPU can be successfully used to significantly improve block cipher performance. ATI and NVidia are working on new chips which will support Windows Graphics Foundation 2.0(DirectX 10 and Shader 4.0). Those chips are expected to be released in later 2006. From software side, Windows Graphics Foundation 2.0(WGF2.0) will be shipped with windows vista in later 2006. There are new features in WGF2.0 eliminate some major GPU limitations for block cipher. With hardware and software improvement by end of 2006, I will expected to see 10x+ faster GPU block cipher available. Most important improvement in Shader 4.0 is the hardware support for bitwise operation and integer support in GPU. This major improvement will dramatically simplify the task to map block cipher algorithm onto shader program. Multiple pass per round is no longer needed. Bit operation can be done inside the shader and don’t have to rely on CPU to do it. Unified shader processing is another improvement. In GPU there will only be one type of shader processor which can process vertex shader program, pixel shader program, or new geometry shader program. For block cipher computation we only need to use pixel shader. This means all the unified shader processor will be used for encryption/decryption. Compare to previous non-unified shader architecture, vertex shader processor will be idle during block cipher computation. This will allow us to fully utilized GPU computation power. Many block ciphers are possible to be implemented in GPU after new Windows Graphics Foundation 2.0 released. I will show a design of AES implementation in GPU. Let’s consider single key and ECB mode AES encryption scenario. First, we will need to prepare the data into texture maps. Each AES encryption block is 4x4 byte array. Each row will be stored in a single pixel RGBA value of a texture map. We will use four texture maps to hold the data. Texture map n holds row n value from each block(n = 0…3). The output is also stored in four texture maps. Each texture map will hold a row from the final encrypted block. Since we use single key, round key generation is same for all grid points. So it will be easy generate all round keys up front using CPU. For AES-128, we need 11 round 128-bits key. The key will be stored as 11 shader constants. The shader program is straightforward mapping from AES C implementation (now we have all the bitwise operation in shader.). I will skip the detail implementation here. After each block is encrypted in shader program, the encrypted block will be written to four texture maps and sent back to application. For key search scenarios the design is a little different. We will still use four texture maps to hold random keys. Each 128-bits key is divided into four 4-byte integer and stored in four RGBA values of four texture maps. The 4x4 block data does not change so we store it in shader constant. In key search scenario the shader program will also need to implement key scheduling algorithm in addition to encryption algorithm. It is also a straightforward mapping from C algorithm. The encrypted data block will be write to four texture maps same as before. Compare with single key data encryption scenario, we have extra key scheduling work in shader program. So we will expect slower performance in this scenario. Now entire AES encryption algorithm is implemented on GPU. We will not have the overhead to do XOR calculation on CPU and pass data around with multiple pass. I will expect the AES algorithm can fully leverage the GPU processing power and faster memory access. Once the new GPU chip that supports Windows Graphics Foundation 2.0 is on the market, I will be interested to conduct a test. I am expecting 10x+ faster than a C AES implementation on a 3.8GHz Pentium 4 machine. 4. Conclusion and future work The paper first introduced why we need to look into GPU implementation of block cipher. There are two reasons: to evaluate cipher security we need to consider a faster growing low cost hardware GPU; and to find a faster and low CPU usage encryption/decryption for communication security. If encryption and decryption can be fully done in GPU, it will be a good candidate for security channel between machine and remote display. The study of current GPU functionality shows that it can not be effectively used in block cipher. It is also true for stream cipher because current GPU lack of support for bitwise operation. New development in GPU design has indication that next version of GPU will have hardware support for bitwise operation. Windows Graphics Foundation 2.0 releasing with windows vista will provide software support for bitwise operation. It is very easy to map block cipher algorithm to shader program when we have bitwise operation. The paper provides a high level design on how AES cipher can be implemented in GPU. Future work will be conducting a test once new GPU is available. 5. Reference: [1] Mark Harris, GPGPU: Beyond Graphics, http://download.nvidia.com/developer/presentations/GDC_2004/GDC2004_OpenGL_GP GPU_04.pdf [2] General-Purpose Computation Using Graphics Hardware, www.gpgpu.org [3] Debra L. Cook, John Ioanidis, etc. CryptoGraphics: Secret Key Cryptography Using Graphics Cards, http://www1.cs.columbia.edu/~dcook/pubs/CTRSA-corrected.pdf