Using Graphics Processing Unit in Block Cipher

advertisement
Using Graphics Processing Unit in Block Cipher
Jingwei Lu
(jinweil, ID 0537619)
Abstract: The paper will first talk about GPU computation power compare to CPU. The paper
will also talk about GPU computation power growth faster than Moore’s law. In a decade we may
see 212 times computation power difference between GPU and CPU. It will explain why we are
interested to implement block cipher on GPU. The paper will give introduction to general
programming on GPU. Then it will map block cipher to general programming on GPU. I will
analysis current limitations of GPU and show it is not fit for block cipher. Finally I will introduce
new windows graphics foundation 2.0 (WGF2.0) improvements and supporting of WGF2.0 in
GPU. With the new improvement block cipher can easily be implemented on GPU and expected
to have big performance boost.
1. Introduction
The computation power of graphics processing unit (GPU) has increased dramatically in last 10
years. The multi-billions gaming industry is the primary driving force of this improvement. Let’s look at
the computation power difference between CPU and GPU. Today a 3.8 GHz Pentium 4 machine
theoretically can achieve 7.6GFLOPS speed. If considering duo core model, it can reach 15.2GFLOPS. For
a duo processor duo core machine it can reach 30.4GFLOPS. The cost of such machine will be over $3000.
On the other hand, an NVIDIA 7800 GTX 512M graphics card is capable of around 200GFLOPS. ATI's
latest X1900 graphics card (2/06) has a claimed performance of 554 GFLOPS. The graphics card can also
be easily combined with SLI or crossfire technology. 4 crossfire-enabled ATI X1900 boards will cost about
same as the $3000 machine, but it will have a 70x float computation power. New duo core graphics card is
also in development. So the cost of GPU system will be even lower in close feature. The GPU computation
power increases much faster than CPU. According to Mark Harris’ presentation [1] from NVIDIA, GPU
computation power is growing significantly faster than Moore’s law. It has compound average growth rate
(CAGR) of 2.2. With CAGR of 2.25, we can expect a 3325x growth in a decade. While for CPU with
CAGR of 1.5, we only expect 58x growth in a decade. By the end of a decade we are looking at about 212
times computation power difference between GPU and CPU. Those numbers are estimated on GPU’s and
CPU’s peak performance. However, GPU and CPU are suitable for different type of work. GPU is more
like a SIMD machine which works well with parallel works. Whether we can harvest the GPU computation
power depends on how algorithm can be converted to GPU computation model. Using GPU for general
computation has been a hot topic recently. It has been successfully applied to many areas, such as audio
and signal processing, computation geometry, dynamics simulation, numerical algorithms, database sort
and search, etc.[2]. One of the goals of this paper is to study if GPU can be used in some of the block
ciphers.
There are two main reasons that I am interested to look into possibility of using GPU in
cryptography. First, when we consider a cipher security we need to consider how long it will take for
current hardware to do an exhaustive key search. GPU is a very powerful hardware with low cost. Consider
212 times performance difference between GPU and CPU system in a decade, it will definitely need to be
studied whether any of the cipher can be efficiently implemented in GPU. Second, with more and more
communication happens over the internet, a cheap and efficient encryption and decryption implementation
will help secure of communication. By off loading work from CPU to GPU, it will allow CPU to do other
work.
I have investigated the possibility of using GPU in cryptography. My main focus is on using GPU
in block cipher. Section 2 will give introduction of GPU and how to do general computing on GPU.
Section 3 will study the limitation of current GPU programming model for block cipher. In Section 3 I will
also introduce the new improvement in Shader 4.0 and DirectX 10. I will also cover how those two
technologies can really help GPU be used in cryptography. Section 4 will be conclusion and future work.
2. GPGPU background
GPGPU stands for general programming in GPU. First I will introduce GPU architecture a little
bit. GPU has specific pipeline designed for graphics. There are two major steps of GPU pipeline:
Transformation and lighting (T&L) and Rasterization. The transformation and lighting step operates on 3-D
points(Vertex). Application sends primitive data (such as triangle, light, mesh, etc.) to GPU. The T&L
stage will transform the vertexes in primitives into screen space. In modern GPU, programmable vertex
shader is a unit to do the T&L. User can write their own code to be executed in vertex shader. The
rasterization step will map the primitive object to screen pixels. The unit is called pixel shader. Modern
GPU all have programmable pixel shader. Vertex shader operates on vertex stream while pixel shader
operates on each pixel on the screen. Final color value of a pixel come out of pixel shader will go through
composite engine and send to render target. The render target could be a memory buffer or display screen.
If render target is memory buffer, the value could again be used in next around of pixel/vertex processing.
The whole simplified GPU pipeline is illustrated in illustration 1.
Transformation
and lighting
Vertex Processor
Rasterization
Pixel Processor
Composition
text
text
Textures
Screen
Illustration 1: GPU pipeline
Most general computing on GPU uses programmable pixel shader. There are three main reasons to
do this. First, generally a GPU has more processing power in pixel shader, for example, ATI X1900 has 48
pixel shader processors while only has 8 vertex shader processors. Second, pixel shader also can read
texture map while vertex shader can not do that (in shader 3.0, vertex shader can also access texture but
number of texture map is much less than pixel shader). All most all general computing methods on GPU
use texture map as input. Third, pixel shader works in screen space, which is a two dimension array. It will
be easy to map general computing onto a grid. To carry computing on GPU, we will first need to map the
computing to GPU grid computing model. Grid computing assumes there is an MxN grid. For each cell in
the grid, it will carry out a computation which is independent of other cell and produce a result. The
computation is called kernel. For example, two for-loops can be easily model as an MxN grid computing.
One dimension for-loop can also be modeled as MxN grid computing. Dimension more than two will have
to do multiple pass to map to grid computing. In section 3 I will describe how to map block cipher to grid
computing.
To programming GPU we must use graphics APIs: OpenGL and Direct3D. OpenGL is mostly
used in academic community due to its platform portability and its extension mechanism. Direct3D is
mostly used in compute game industry and it is only available on windows platform. Either API works
perfectly well for general computing on GPU. They are merely a tool for application to interact with GPU.
Which one to choose is simply a matter of personal preference. To programming vertex/pixel shader, we
can use shader assembly language, or cg(NVIDIA), or HLSL(Microsoft) language. Assembly language is
low level language and it is not easy to use. Cg and HLSL are c like high level language. They will be
compiled to Shader 1.0, 2.0, 3.0, or 4.0 standard based on compile options. The compiled shader program
can only be run in hardware which supports the shader version.
Recently, a few GPGPU-friendly streaming languages/APIs, such as Brook, Sh, and accelerator,
have been developed to insulate developers from the graphics APIs. Those languages provide a stream
computing model. The compiler will automatically compile the program to shader language. User can
leverage GPU computing without knowledge of graphics API and GPU. While we enjoy the benefits of
those languages, it also has limitations, such as, performance is only about 50%-80% of a hand craft shader
program; it only exposes limited function of shader programming.
3. Using GPU for block cipher
3.1.
General algorithm to implement block cipher in GPU
The GPU grid computing model can be applied to block cipher in multiple ways. First, we can
think the data as a stream. It can be mapped to MxN grid. The computation kernel will encrypt each grid
point with same key. This computation model is actually encrypting MxN data in parallel. Since each grid
point computation is independent, only ECB mode can be used. If faster encryption/decryption algorithm
can be developed for this model, the algorithm can be used to improve communication
encryption/decryption performance. Second, we can also map key exhaustive search to grid computing
model. We generate MxN key and each key map to a grid point. The kernel will encrypt same data on each
grid point with different key. So for each rendering pass we are trying MxN keys at same time.
Data will be stored in texture map and passed to pixel shader during computation. There are
multiple texture formats available. Only 32 bits RGBA format fits well for block cipher. For the single key
encryption/decryption scenario, we store 4 bytes of data in each texture pixel. We construct an MxN pixel
texture from data. We use four texture maps to store MxN 4x4 data blocks. The key can be stored in
constant registry and passed to pixel shader. For the key search scenario, we need to store generated key
into MxN pixel texture. It depends on the key size we may need several pixels to hold one key. Suppose it
needs K pixel to hold one key, then we will construct an MKxN pixel texture to hold all generated keys.
We can also use K texture maps to hold the keys. Each GPU Shader version has different limitation on
number of constant registries. The data can be stored in constant registry or another texture map depends on
its size. The output of shader program will always be 32bits RGBA value per pixel. This limitation may
cause some block cipher to be implemented with multiple pass of calculation. Each calculation it works on
32bits data and generate 32bits output. The other alternative is that we use multiple pixel output to
represent a block. For example, an AES block cipher we work on 4x4 bytes block. This can be calculated
using 4 pixels. The pixel shader program will need to distinguish which part of the block it is work on.
Multiple render target feature of latest ATI and NVIDIA GPU has solved this problem. Now it can support
up to 4 render targets. This means shader program could output 16bytes data per pixel, which works well
for most block cipher.
Pseudo algorithm for block cipher on GPU:
1.
Setup texture: depends on which scenario, we either initializes K MxN texture to hold
grid data point.
2.
Setup key constant or data constant depends on scenario.
3.
Setup texture for different mapping tables, such as s-box mapping, t-box mapping
4.
Setup GPU pipeline: load shader program, setup view point, setup output buffer
5.
Draw a full screen quad. For each pixel on the screen, kernel will be evaluated once in
pixel shader for the quad.
6.
Read the output buffer. Goto step 1 if we need multiple pass.
The pixel shader code will implement encryption logic based on different block cipher. The pixel shader
code is largely depending on current GPU limitation and different algorithm requirement. We will discuss
it in next subsection.
3.2.
Limitation of current GPU
The biggest limitation prevents current GPU to be used in block cipher is that it lacks of bitwise
operation. All commonly used block ciphers require bitwise operation, which includes XOR, shift, and
isolate bits from byte. Without the support of bitwise operation in shader program, it is impossible to
implement the whole encryption/decryption algorithm in shader program. To work around this limitation,
we will have to separate the encryption/decryption work between CPU and GPU. Let the CPU do bitwise
operation and use GPU for other calculation, such as s-box lookup. Since majority work of a block cipher is
in bitwise operation, this kind of work around really does not buy us anything in terms of performance. The
extra cost of moving data between CPU and GPU actually will slow calculation down. Cook [3] from
Columbia University has implement AES algorithm using openGL API on current GPU. The algorithm use
GPU for s-box lookup and rely on CPU to do XOR operation. The performance of GPU based AES
algorithm is 41 times slower than an optimized C AES implementation. The test is done on a machine with
NVidia geforce3 graphics card and 1.8 GHz Pentium 4. In the paper, the performance analysis shows that
during the test CPU usage of GPU AES implementation is 100%. The time is spending in CPU XOR
operation and communication overhead from CPU to GPU. For simple block cipher like AES, GPU really
does not have any edge over CPU given there is no bit wise operation support in GPU.
Second limitation is that there is no cheap dynamic branching statement support in current GPU.
From Shader 3.0 dynamic branching is supported in shader program. However, according to [1], dynamic
branching is expensive in shader program compare to CPU. The situation will not change in near future.
Fortunately most block ciphers do not have lots of condition statement in the algorithm. This will not be a
big issue but it is something we need to pay attention when GPU cipher is designed.
Current GPUs are all designed for graphics calculation which heavily use float point calculation.
GPU calculation is all done using floating point. Integer is supported in high level shader programming
language. It is mapped to floating point value inside shader processor. Currently GPU support 32bits
floating point value with 23 significant bits. The precision is enough to support 2-byte integer value. For
block cipher a byte is mapped to 32bits floating point value. So the precision is enough for block cipher.
Texture lookup will return float point value for each RGBA component. We will need to convert the value
to integer before another texture lookup. It will require us to be careful about float point calculation error
and may cause invalid value after truncated to integer. A little trick of adding 0.001 to the floating point
value before truncation can solve this kind of problem.
Usually graphics card will have 128M, 256M or 512M video memory. Compare to CPU the video
memory is limited. The available video memory limits how many cells can be used in grid (screen size
decide the grid size) and how many/big the lookup table can be. More grid points mean better parallelism
which as shown in paper [3] will have better performance.
3.3.
Improvement in SHADER 4.0 and DirectX10
Based on my analysis of current GPU functionality and Cook’s [2] work, it is clear that even
though GPU has been successfully used in other general computing areas and have more than 10x gain on
performance[2], it is not useful to improve block cipher performance. However, GPU design is a faster
changing industry. The recent development in GPU industry gives us promising news that in near feature,
maybe 1 or 2 years, GPU can be successfully used to significantly improve block cipher performance. ATI
and NVidia are working on new chips which will support Windows Graphics Foundation 2.0(DirectX 10
and Shader 4.0). Those chips are expected to be released in later 2006. From software side, Windows
Graphics Foundation 2.0(WGF2.0) will be shipped with windows vista in later 2006. There are new
features in WGF2.0 eliminate some major GPU limitations for block cipher. With hardware and software
improvement by end of 2006, I will expected to see 10x+ faster GPU block cipher available.
Most important improvement in Shader 4.0 is the hardware support for bitwise operation and
integer support in GPU. This major improvement will dramatically simplify the task to map block cipher
algorithm onto shader program. Multiple pass per round is no longer needed. Bit operation can be done
inside the shader and don’t have to rely on CPU to do it.
Unified shader processing is another improvement. In GPU there will only be one type of shader
processor which can process vertex shader program, pixel shader program, or new geometry shader
program. For block cipher computation we only need to use pixel shader. This means all the unified shader
processor will be used for encryption/decryption. Compare to previous non-unified shader architecture,
vertex shader processor will be idle during block cipher computation. This will allow us to fully utilized
GPU computation power.
Many block ciphers are possible to be implemented in GPU after new Windows Graphics
Foundation 2.0 released. I will show a design of AES implementation in GPU. Let’s consider single key
and ECB mode AES encryption scenario. First, we will need to prepare the data into texture maps. Each
AES encryption block is 4x4 byte array. Each row will be stored in a single pixel RGBA value of a texture
map. We will use four texture maps to hold the data. Texture map n holds row n value from each block(n =
0…3). The output is also stored in four texture maps. Each texture map will hold a row from the final
encrypted block. Since we use single key, round key generation is same for all grid points. So it will be
easy generate all round keys up front using CPU. For AES-128, we need 11 round 128-bits key. The key
will be stored as 11 shader constants. The shader program is straightforward mapping from AES C
implementation (now we have all the bitwise operation in shader.). I will skip the detail implementation
here. After each block is encrypted in shader program, the encrypted block will be written to four texture
maps and sent back to application.
For key search scenarios the design is a little different. We will still use four texture maps to hold
random keys. Each 128-bits key is divided into four 4-byte integer and stored in four RGBA values of four
texture maps. The 4x4 block data does not change so we store it in shader constant. In key search scenario
the shader program will also need to implement key scheduling algorithm in addition to encryption
algorithm. It is also a straightforward mapping from C algorithm. The encrypted data block will be write to
four texture maps same as before. Compare with single key data encryption scenario, we have extra key
scheduling work in shader program. So we will expect slower performance in this scenario.
Now entire AES encryption algorithm is implemented on GPU. We will not have the overhead to
do XOR calculation on CPU and pass data around with multiple pass. I will expect the AES algorithm can
fully leverage the GPU processing power and faster memory access. Once the new GPU chip that supports
Windows Graphics Foundation 2.0 is on the market, I will be interested to conduct a test. I am expecting
10x+ faster than a C AES implementation on a 3.8GHz Pentium 4 machine.
4. Conclusion and future work
The paper first introduced why we need to look into GPU implementation of block cipher. There
are two reasons: to evaluate cipher security we need to consider a faster growing low cost hardware GPU;
and to find a faster and low CPU usage encryption/decryption for communication security. If encryption
and decryption can be fully done in GPU, it will be a good candidate for security channel between machine
and remote display.
The study of current GPU functionality shows that it can not be effectively used in block cipher. It
is also true for stream cipher because current GPU lack of support for bitwise operation. New development
in GPU design has indication that next version of GPU will have hardware support for bitwise operation.
Windows Graphics Foundation 2.0 releasing with windows vista will provide software support for bitwise
operation. It is very easy to map block cipher algorithm to shader program when we have bitwise operation.
The paper provides a high level design on how AES cipher can be implemented in GPU. Future work will
be conducting a test once new GPU is available.
5. Reference:
[1] Mark Harris, GPGPU: Beyond Graphics,
http://download.nvidia.com/developer/presentations/GDC_2004/GDC2004_OpenGL_GP
GPU_04.pdf
[2] General-Purpose Computation Using Graphics Hardware, www.gpgpu.org
[3] Debra L. Cook, John Ioanidis, etc. CryptoGraphics: Secret Key Cryptography Using
Graphics Cards, http://www1.cs.columbia.edu/~dcook/pubs/CTRSA-corrected.pdf
Download