Real-time image deconvolution on the GPU

advertisement
Real-time image deconvolution on the GPU
James T. Klosowski and Shankar Krishnan
AT&T Labs Research, 180 Park Ave., Florham Park, NJ USA 07932
ABSTRACT
Two-dimensional image deconvolution is an important and well-studied problem with applications to image deblurring
and restoration. Most of the best deconvolution algorithms use natural image statistics that act as priors to regularize the
problem. Recently, Krishnan and Fergus provide a fast deconvolution algorithm that yields results comparable to the
current state of the art. They use a hyper-Laplacian image prior to regularize the problem. The resulting optimization
problem is solved using alternating minimization in conjunction with a half-quadratic penalty function. In this paper, we
provide an efficient CUDA implementation of their algorithm on the GPU. Our implementation leverages many wellknown CUDA optimization techniques, as well as several others that have a significant impact on this particular
algorithm. We discuss each of these, as well as make a few observations regarding the CUFFT library. Our experiments
were run on an Nvidia GeForce GTX 260. For a single channel image of size 710 x 470, we obtain over 40 fps, while on
a larger image of size 1900 x 1266, we get almost 6 fps (without counting disk I/O). In addition to linear performance,
we believe ours is the first implementation to perform deconvolutions at video rates. Our running times also
demonstrate that our GPU implementation is over 27 times faster than the original CPU implementation.
Keywords: Deconvolution, deblurring, GPU, CUDA, hyper-Laplacian, FFT
1. INTRODUCTION
Two-dimensional image deconvolution is an important and well-studied problem with applications to image deblurring
and restoration. The original problem formulation is ill-posed, and requires some form of regularization to make it
tractable. Numerous deconvolution algorithms exist, varying greatly in their performance and solution. Simple filtering
algorithms are extremely fast, but yield poor results. Most of the best algorithms use natural image statistics that act as
priors to regularize the problem.
More recently, Krishnan and Fergus [1] provide a fast non-blind deconvolution algorithm that yields results comparable
to the current state of the art. Krishnan and Fergus use a hyper-Laplacian image prior to regularize the problem. The
resulting optimization problem is solved using alternating minimization in conjunction with a half-quadratic penalty
function. While one of the sub-problems uses a standard L2 minimization solved in closed form in the Fourier domain,
the key novelty in their approach is that the other sub-problem is a non-convex problem that is separable over the pixels.
This per-pixel step can be solved by polynomial root finding. Since the coefficients of the polynomial depend only on
system parameters and a range of image gradient values, the valid roots can be pre-computed and stored in a lookup
table (LUT). This makes their approach particularly well-suited for a GPU implementation.
There have been some attempts to implement image deconvolution algorithms on the GPU. Recently, Domanski et. al.
[2] have implemented the Richardson-Lucy algorithm [3,4] on an Nvidia GTX 260 GPU using Fourier space
convolutions. They report an 8.5x performance improvement over a CPU implementation.
Our Contributions: In this paper, we provide an implementation of the Krishnan-Fergus algorithm on the GPU using
the CUDA architecture. We have made extensive use of many well-known CUDA optimization techniques, as well as
several others that have a significant impact on our particular algorithm. Our optimizations fall into four categories:
memory transfers between host and device, device memory access, parallel execution, and instruction throughput. One
of the most valuable lessons learned was to compute as much data on the GPU as possible, rather than using the CPU
and then transferring the data. We discuss this and each of the other optimizations used, as well as a few additional
observations regarding the CUFFT library.
Our experiments were performed on a Linux workstation with an Nvidia GeForce GTX 260. On a single channel image
of size 710 x 470, we obtain over 40 frames per second (fps), while on a larger image of size 1900 x 1266, we get almost
6 fps (ignoring disk I/O). On a wider assortment of images, our implementation processes over 3 megapixels (1,048,576
pixels) per second. This is a factor of 27 times faster than the CPU implementation by Krishnan and Fergus [1].
2. KRISHNAN/FERGUS ALGORITHM
In this section, we briefly review the algorithm of Krishnan and Fergus [1] for non-blind deconvolution of images. Their
method relies on the fact that the gradients of natural images exhibit a heavy-tailed distribution, and have been effective
in the past for regularizing the (ill-posed) deconvolution problem. These heavy-tailed distributions can be modeled using
𝛼
a hyper-Laplacian function: 𝑝(π‘₯) ∝ 𝑒 −π‘˜|π‘₯| , where 0.5 ≤ α ≤ 0.8. We start by formulating the deconvolution problem,
assuming the blur kernel is known. Most of the notation used is borrowed from the original paper.
Let x be the original uncorrupted linear grayscale image of N pixels; y be the observed image corrupted with blur and/or
noise, which is assumed to be the result of a convolution with a known blur kernel k and the addition of zero-mean
Gaussian noise. The problem is to reconstruct x given y and k. Given the ill-posed nature of the problem, they regularize
it using a penalty term |. |𝛼 that acts on the output of a set of known filters 𝑓1 , 𝑓2 , β‹― , 𝑓𝑗 . In their paper, Krishnan and
𝑇
Fergus use the two simplest gradients in the x- and y-directions (𝑓1 = [1 -1] and 𝑓2 = [1 -1] ). Their formulation
seeks to find a MAP (Maximum-a-Posteriori) estimate of x in a probabilistic framework. This is equivalent to the
following optimization problem (i is the pixel index, λ controls the strength of the regularization, and ⊕ is the 2D
convolution operator):
𝑁
2
𝑖=1
𝑗=1
𝛼
πœ†
min ∑ ( (π‘₯ ⊕ π‘˜ − 𝑦)2𝑖 + ∑ |(π‘₯ ⊕ 𝑓𝑗 )𝑖 | )
π‘₯
2
(1)
𝑗
Let us denote 𝐹𝑖 π‘₯ ≡ (π‘₯ ⊕ 𝑓𝑗 )𝑖 . Using the half-quadratic penalty method of Geman and Yang [5], they introduce
𝑗
auxiliary variables 𝑀𝑖1 and 𝑀𝑖2 (denoted by w), thereby allowing them to move the 𝐹𝑖 π‘₯ terms outside the |. |𝛼 expression.
This results in a new cost function to be minimized.
𝑁
2
𝑖=1
𝑗=1
πœ†
𝛽
𝑗 𝛼
min ∑ ( (π‘₯ ⊕ π‘˜ − 𝑦)2𝑖 + (‖𝐹𝑖1 π‘₯ − 𝑀𝑖1 β€–22 + ‖𝐹𝑖2 π‘₯ − 𝑀𝑖2 β€–22 ) + ∑|𝑀𝑖 | )
π‘₯,𝑀
2
2
(2)
β is a weight that increases continuously throughout the optimization. As β→ ∞, the solution of the modified cost
function (eqn. (2)) converges to the original cost function (eqn. (1)). Their algorithm uses an alternating minimization
scheme where the non-convex part of the problem is solved in one phase, followed by a quadratic phase which can be
efficiently solved in the frequency domain using FFTs. The alternating minimization proceeds by solving for the optimal
x for a fixed w (called the x sub-problem), and vice versa (called the w sub-problem).
2.1 Solving the x sub-problem
In this case, because w is fixed, equation (2) is quadratic in x. Let 𝐾π‘₯ ≡ π‘₯ ⊕ π‘˜. Rewriting equation (2) in matrix form
and taking derivatives, x can be obtained by solving the following linear system:
𝑇
𝑇
(𝐹1 𝐹1 + 𝐹 2 𝐹 2 +
πœ† 𝑇
πœ†
𝑇
𝑇
𝐾 𝐾) π‘₯ = 𝐹1 𝑀 1 + 𝐹 2 𝑀 2 + 𝐾 𝑇 𝑦
𝛽
𝛽
(3)
Assuming circular boundary conditions, they apply 2D FFTs which diagonalize the convolution matrices 𝐹1 , 𝐹 2 and 𝐾
and x can be solved optimally as:
π‘₯ = β„± −1 (
β„±(𝐹1 )∗ ∘ β„±(𝑀 1 ) + β„±(𝐹 2 )∗ ∘ β„±(𝑀 2 ) + (πœ†/𝛽)β„±(𝐾)∗ ∘ β„±(𝑦)
)
β„±(𝐹1 )∗ ∘ β„±(𝐹1 ) + β„±(𝐹 2 )∗ ∘ β„±(𝐹 2 ) + (πœ†/𝛽)β„±(𝐾)∗ ∘ β„±(𝐾)
(4)
where * denotes complex conjugate and ∘ denotes component-wise multiplication. The division is also performed
component-wise.
2.2 Solving the w sub-problem
The novelty in the Krishnan-Fergus paper stems from this step. Given a fixed x, finding the optimal
2N independent 1D problems of the form:
β
𝑀 ∗ = arg min (|w|α + (w − v)2 ),
w
2
w requires solving
(5)
𝑗
where 𝑣 ≡ 𝐹𝑖 π‘₯. For the special case when 𝛼 = (𝑑 − 1)/𝑑, 𝑑 ≥ 2, the solution to the above minimization problem can
be converted into finding the real roots of a univariate polynomial of degree (𝑑 + 1). The polynomial is of the form:
𝛼 𝑑
𝑀(𝑀 − 𝑣)𝑑 + (−𝑠𝑔𝑛(𝑣))𝑑−1 ( ) = 0
𝛽
(6)
We are interested in the largest real root 𝑀 ∗ of the polynomial such that
2𝑣
],
𝑀∗ ∈ { 𝑑 + 1
2𝑣
[
, 𝑣] ,
𝑑+1
[𝑣,
𝑖𝑓 𝑣 < 0
𝑖𝑓 𝑣 > 0
Empirically, it has been found that the normal range for 𝛼 for natural images is between 0.5 and 0.8. This implies that
there are only 4 values of 𝑑 (2, 3, 4 and 5) that satisfy the above requirement. As described in their paper, Krishnan and
Fergus resort to a pre-computed LUT to solve this part of the sub-problem efficiently. The range of values for 𝑑 limits
the degree of the polynomial between 3 and 6. For several values of 𝛽 (from 1 to 256 in multiples of 2√2) and 𝑣
(200,000 values uniformly in the range [-10,10]), we determine the optimal 𝑀 using an efficient and robust root solver
described in [6]. For the sake of completeness, Algorithm 1 below reviews the main steps involved in the KrishnanFergus algorithm.
Algorithm 1: Fast image deconvolution using hyper-Laplacian priors
Input: Blurred image 𝑦; blur kernel π‘˜; regularization weight λ; and prior exponent 0.5 ≤ 𝛼 ≤ 0.8
𝛽 regime parameters: 𝛽0 , 𝛽𝑖𝑛𝑐 , π›½π‘šπ‘Žπ‘₯ ; number of inner iterations 𝑇
Output: Deblurred image π‘₯
𝛽 = 𝛽0 ; π‘₯ = 𝑦;
Precompute constant terms in eqn. (4).
Generate a LUT by solving eqn. (6)
while 𝛽 < π›½π‘šπ‘Žπ‘₯ do
for i = 1 to 𝑇 do
Given π‘₯, solve eqn. (5) for all pixels using LUT to obtain 𝑀
Given 𝑀, solve eqn. (4) to obtain π‘₯
end for
𝛽 = 𝛽𝑖𝑛𝑐 . 𝛽;
end while
return π‘₯
3. GPU IMPLEMENTATION
The nature of the Krishnan-Fergus deconvolution algorithm lends itself nicely to a GPU implementation. The first subproblem in the alternating scheme is to solve for w, which as previously discussed, can be approximated very quickly by
using a lookup table. The CUDA memory hierarchy provides efficient means for implementing such a table, including
the constant, texture, and global memory spaces. The second sub-problem, which solves for x, can be found directly in
the Fourier domain using only three FFTs per iteration of the algorithm. Nvidia provides an implementation of the Fast
Fourier Transform in their CUFFT library, which includes several parallel algorithms each with different performance
and accuracy characteristics. Based on these observations and previous experience with CUDA, we believed that
mapping the Krishnan-Fergus algorithm onto the GPU could lead to significantly faster running times.
Our initial CPU-based implementation of the Krishnan-Fergus algorithm was on a system with dual quad-cores and the
running time was almost two seconds for a relatively small image (710x470 pixels). As we ported the algorithm over to
the GPU using CUDA, we utilized many performance optimizations to achieve significant gains in running time. We
have organized these optimizations into four categories and present them below in sections 3.3 - 3.7. In the next two
sections, we briefly discuss an important tool for identifying bottlenecks, as well as the CUFFT library.
3.1 CUDA Visual Profiler
Before discussing our optimizations, we want to quickly emphasize the importance of profiling the CUDA code in order
to identify bottlenecks and inefficient code. Throughout our effort to port the algorithm to the GPU, we routinely made
use of the CUDA Visual Profiler (CVP), which provides a highly detailed report on the number of times operations
occurred, together with the amount of time they took to complete. Many other statistics are also included on a per-kernel
basis, such as the occupancy, registers used, and number of instructions issued. In particular, the GPU Time Height Plot,
shown in figure 1, illustrates with a bar chart how long each kernel and memory copy operation took to complete,
thereby quickly highlighting which operations should be optimized first. In this snapshot, the second, third, and last bars
show the time it takes to copy data to and from the GPU. The remaining bars, which are all roughly the same
magnitude, highlight the running times for all of the kernels in our implementation. After the I/O operations, the next
highest bars are all from kernels executed by the CUFFT library, and thus outside of our control. The next kernel for us
to optimize would be “compute_out” which is called once per iteration of the algorithm, and whose purpose is to
perform the final computations when solving for x. In the left column of the figure, the number of times (i.e. sessions)
that we executed the profiler and saved the output (up to this point) is shown to be 23. CVP is a great help in identifying
your bottlenecks and illustrating any performance improvements that result from code changes. It should be used often.
Figure 1: The CUDA Visual Profiler time height plot. The second, third, and last bars plot the time
(microseconds) to copy the data to and from the GPU. All other bars represent each of the kernels executed.
Figure 2: The CUDA Visual Profiler time summary plot. After padding the images to be a power of two, the
four most time consuming kernels are all within the CUFFT library, and occupy 47% of the total time.
3.2 CUFFT library
The CUFFT library is Nvidia’s parallel implementation of the Fast Fourier Transform on the GPU. The library contains
several algorithms for FFT each with different accuracy and performance capabilities, based upon the size of the input
data. Although the CUFFT library supports single and double precision, real and complex values, and forward and
backward transformations, it does have a few limitations. Two and three-dimensional transforms must contain fewer
than 16384 elements in each dimension, and the limit for one-dimensional transforms is about 8 million elements, which
equates to an image that is roughly 2800x2800 pixels. Bases on some initial experiments, 1D transforms were slightly
faster than 2D transforms, and thus were used in our implementation of the Krishnan-Fergus algorithm.
In our initial implementation, we simply called the FFT routines with data whose size was based upon the original input
image, e.g., 710 pixels wide by 470 pixels high. Because the CUFFT library is optimized for transform sizes that are
powers of small prime factors, e.g. powers of two, we then tried padding the original image (and necessarily all
associated blur kernels and filters), to be the next largest power of two in each dimension. In the example above, the
new padded image was 1024 pixels wide by 512 pixels high, an increase of over 57% in the total number of pixels.
However, the running time of the entire Krishnan-Fergus algorithm was then cut by almost 80%. The size of the input
data clearly has a significant impact on which internal algorithm is used in the CUFFT library. We explore this fact
further section 4.2.
Figure 2 is another snapshot of the CVP that shows the time summary plot for the kernels and memory transfers to and
from the GPU. In parentheses next to the name of each kernel is the number of times it was called. Even after padding
our data to be a power of two in each dimension, the CUFFT kernels, whose names all begin with spRadix, are still the
four most expensive kernels in our implementation and they sum to over 47% of the current running time. There are
several alternative implementations of FFT on the GPU [7,8,9] which could potentially lead to faster running times for
our algorithm, but we have not yet experimented with them.
3.3 Memory transfers
Transferring data between the host machine and the GPU is often one of the biggest bottlenecks for CUDA
implementations, and this algorithm was no exception. As can be seen in figure 1, the two longest operations are when
we copy the lookup table from the CPU to the GPU (second bar), and again when we copy the final image back from the
GPU (last bar). What this figure does not show are the other numerous spikes related to data movement that we were
able to reduce or eliminate by being more careful about memory transfers. These optimizations fall roughly into two
categories: minimizing the amount of data transferred, and making the transfers faster.
One of the most important decisions to make when implementing any algorithm for the GPU is whether to use single or
double precision floating point variables. It is easy to fall into the mindset of using double precision for greatest
accuracy, but one really must evaluate if this is absolutely necessary when programming for the GPU. One reason for
this is because Nvidia GPUs double precision performance is only 1/8 th that of single precision (until just recently when
the Fermi architecture was released, at which point the double precision performance increased to one half that of single
precision, as one might expect). Another reason is the increase in all GPU resources needed for double precision:
registers, on-chip memory, on-device memory, and bandwidth required to transfer the data to and from the GPU.
Our initial lookup table was over 100MB in size, and included four values of alpha and 17 values of beta. Each entry
was also stored as a floating point using double precision. Upon more careful inspection of the output images when
using single versus double precision variables, we realized that the differences were imperceptible (on the order of 1 or 2
units difference out of 255 for the red, green, or blue components of the image) and therefore single precision would be
sufficient for our needs. This issue needs to be evaluated for any problem domain, but for image data, a difference of
this small magnitude in the image will not be noticeable to the end-user, but the difference in running times will be. In
addition to using floats, we also limited the values of alpha and beta to be exactly what was required: we could limit
alpha to only one value, and beta to only six values. In the end, our lookup table was reduced to only 4MB.
As discussed in the CUFFT section, we decided to pad our images, blur kernels, and image filters to all be a power of
two in each dimension. Initially we did this on the CPU and then transferred the data to the GPU. However, by using
CVP, we immediately realized how much time this was taking and modified our approach. In our current version of the
code, we only transfer the bare minimum data required, and then pad the data on the GPU itself using very simple
kernels. This provided a nice reduction in data movement for the input image, but a more significant savings occurred
for the blur kernel and image filters, which are much smaller, e.g. the Gaussian blur kernel is only 13x13 pixels. The
reduction in times to copy and pad the image, blur kernels, and image filters ranged from 50% to as much as 95%.
In addition to limiting the amount of data to transfer to and from the GPU, there are also several standard optimizations
that help make the transfers faster. One of the optimizations we used is page-locked memory on the host, which
prevents the operating system from paging the memory out, and therefore the GPU’s access to that data over PCIExpress can be at full-speed. Another well-known optimization that we used is that of asynchronous memory copies.
Rather than waiting for the lookup table to finish being copied to the GPU, we can use Nvidia’s asynchronous API
which allows kernels to be executing in parallel with the memory copy. Thus, we can be running our simple kernels to
pad the image data to hide the latency of the memory copy.
3.4 Memory access
Although each GPU multiprocessor has shared memory and 32-bit registers on the chip itself, many data accesses that
the kernels make are to the global memory on the graphics card. Global memory is not cached (on all but 2.x compute
capability devices) and these accesses have a much longer latency than those to on-chip memory and therefore great care
must be taken to fully optimize them. Nvidia’s CUDA C Best Practices Guide goes into depth on how to make global
accesses coalesced, or aligned, and claims this may be the single most important performance consideration when
programming for the CUDA architecture [10]. We agree with this assertion and note that this optimization has become
commonplace when writing CUDA code. To verify that each of our kernels was making aligned accesses, we again
used CVP to see how many 32-, 64-, and 128-byte global loads and stores were made. By analyzing these numbers, we
could optimize our code and reduce the number of 32- and 64-byte accesses, thereby increasing our overall performance.
As global memory is not cached, it is important to know about the texture and constant memory spaces available in the
CUDA architecture as well. These memories are cached and are ideally suited for memory that is going to be loaded
once and then accessed in a read-only fashion. Our lookup table falls perfectly into this scenario. Our original
implementation used global memory to store the lookup table, but by accessing the data using a 1D texture, the
“solve_image” kernel’s running time was cut by 50%. As this kernel is called twice per iteration of the algorithm (12
times in all), the improvement in this one kernel reduced the total running time by 5%.
3.5 Parallel execution
Running algorithms in parallel is the core of GPU computing. Serial operations can probably best be executed on the
CPU, while parallel portions can usually be efficiently mapped to the GPU, provided some care is taken to maximize the
parallelism given its fixed resources. We have already discussed one standard optimization technique regarding
parallelism and that is overlapping memory transfers between the host machine and the GPU device with other work.
These asynchronous copies allow kernels to be executing on data already loaded onto the device, while new data is
copied over to it, effectively hiding the latency involved in the data copy.
For GPU devices with CUDA compute capability 1.x, only one kernel can be executed at a time, although the kernels are
run in multiple thread blocks. To keep the multiprocessors busy on the GPU, there should be at least as many thread
blocks as there are multiprocessors, but ideally there will be more in order to hide the latency that exists when a warp (a
grouping of 32 threads) is not ready to execute it’s next instruction, e.g. if it must wait for data to be loaded from global
memory or some other operations to complete. When one warp stalls, another warp can be quickly scheduled to execute
and effectively hide the latency of the memory access.
In our implementation, we have maximized the number of warps and thread blocks that are running on the GPU by
carefully managing the limited GPU resources, including registers and shared memory that are common to each thread
block. By carefully rewriting our GPU kernels, and in some cases subdividing them into smaller kernels, we were able
to reduce the number of registers and shared memory needed, thereby increasing the number of blocks that could be
executed in parallel. One immediate benefit of using floats instead of doubles is that the number of registers used was
cut in half. Another means of reducing the number of registers was by eliminating temporary variables that were
convenient when writing the code, but not strictly necessary for correct execution. For example, by reordering some of
the computations in one kernel, we could avoid several temporary variables that increased our register use.
To determine how many registers and how much shared memory were used by each kernel, we ran our code through the
CVP, which automatically reports these statistics. We could then plug this information into the CUDA Occupancy
Calculator to determine where the bottlenecks were and how we could best increase the amount of parallelism in the
kernels. The Occupancy Calculator is really just a simple spreadsheet that computes this information for you once you
have entered only four pieces of information: CUDA compute capability of the GPU, number of threads per block,
registers needed per thread, and shared memory needed per thread. Only the first item is completely determined by the
graphics hardware you have, while the three remaining items can be influenced by the user and the implementation.
Occupancy is loosely defined as the ratio of the number of warps running concurrently on a multiprocessor divided by
the maximum number of warps that could be running concurrently based on the hardware limits of the GPU. Having an
occupancy of 1 (or 100%) is the best case, but does not necessarily translate to improved performance over lower ratios.
However, a kernel with a low occupancy may prevent the multiprocessor from hiding latency adequately for memoryintensive kernels.
3.6 Instruction throughput
Other optimizations to CUDA code involve trying to minimize the number of cycles required to get the same work done.
In other words, by reducing the number of cycles needed, the GPU can process more data in the same amount of time,
thereby increasing the amount of data processed per instruction, i.e. the instruction throughput. We have utilized these
optimizations in several forms, such as minimizing slow arithmetic instructions, minimizing divergent warps, and
avoiding unnecessary type conversions.
Arithmetic instructions can vary significantly in the number of operations per clock cycle on the GPU. For CUDA
compute capability 1.x, eight 32-bit floating-point adds can occur in each clock cycle, but only one 64-bit floating-point
add [11]. 32-bit integer multiplications on the other hand, require several instructions by themselves, so are very
inefficient by comparison. However, 24-bit integer multiplications (e.g. __[u]mul24) can again be done eight times per
clock cycle, and therefore should be utilized when appropriate. For example, when computing the index of a specific
thread in a kernel, one could replace the standard int index = blockDim.x * blockIdx.x + threadIdx.x, with the
following: int index = _umul24(blockDim.x, blockIdx.x) + threadIdx.x. Another significant hit is using division or
modulo inside the kernel. For many of our kernels, since they involve two-dimensional images implemented as onedimensional arrays, we are often dividing the index by the number of columns to determine which row of the image we
are going to operate on. Rather than doing this repeatedly inside the kernels, we could instead multiply by the reciprocal
of the number of columns, which is fixed for a given image and can be computed once and stored in constant memory.
Divergence refers to the situation when threads in a single warp take different code paths due to a branch condition, e.g.
if and switch statements. As a result, the different execution paths must be serialized, thereby reducing the amount of
parallelism achieved. For all of our kernels, we made great effort to eliminate divergence within warps. For example,
by carefully choosing the number of threads per thread block, and by padding our images to be of sizes that are a power
of two, we could guarantee that every thread in every thread block was assigned the work associated with one of the
pixels in the image. Thus, we could immediately remove all of the if-statements that check boundary conditions (e.g. if
(index < num_pixels)), and cause divergence within warps.
We have already spoken at some length on using floats instead of doubles, however care must be taken to avoid all
implicit conversions to doubles. By using the --keep compiler option and examining the temporary files, we were able
to find several instances of literal constants that were being converted to doubles automatically. To avoid this, the
simple scheme of adding an “f” after the literal prevents this from happening, e.g. 1.0f. Another good lesson was to
remove the sm_13 compiler directive, which prevents the compiler from using doubles at all, and thus warning messages
were given at compile time that doubles were being demoted to floats. Such messages allow you to easily find and fix
those problems. Another optimization is to use the single-precision functions when appropriate. For example, using
rintf to round a floating point variable to an integer requires only one instruction, whereas roundf() requires eight
instructions. There are many other such examples of this as well, including truncf(), ceilf(), and floorf(). Charts
detailing these instruction counts can be found in the NVIIDA CUDA C Programmer’s Guide [11].
4. EXPERIMENTAL RESULTS
Our experiments were conducted on a dual quad-core (Intel Xeon E5506 @ 2.13GHz) 64-bit Linux workstation, with
12GB of RAM and an Nvidia GeForce GTX 260 graphics adapter. The GTX 260 has 896MB of global memory, 27
multiprocessors (216 cores total), and CUDA compute capability 1.3. The CUDA driver at the time was version 3.0.
4.1 Running Times
Figures 8 and 9 are examples of the results of our implementation of the Krishnan-Fergus algorithm for the GPU. The 3channel color images in figure 8 are of size 710 x 470 pixels, and were processed in 0.066 seconds (i.e. the equivalent of
over 15 frames per second (fps)). The 1-channel version of this image required 0.024 seconds (about 41 fps), roughly
one-third of the time for the RGB image. The time is not exactly one-third due to the fixed start-up costs associated with
the implementation, such as copying the lookup table, kernels, and filters to the GPU, which only happen once
regardless of the number of channels (or images) being deblurred. These times indicate that our algorithm was
processing 4.8MPel/sec for 3-channel images, and 13Mpel/sec for 1-channel images. Similarly, the one- and threechannel deblurring times for the images in figure 9 are respectively 0.041 seconds and 0.116 seconds (or the equivalent
of 24.4 fps and 8.6 fps). These images are slightly larger than those in figure 8 at 768 x 512 pixels, and achieved a
slightly lower throughput of 3.2Mpel/sec for the 3-channel image, and 9.1 Mpel/sec for the 1-channel image. This
reduction is due to the larger padded size of this image (1024x1024) compared to the one in figure 8 (1024x512).
Although the number of rows in the image in figure 9 is already a power of two (i.e. 512), we have to pad all images by
half the kernel size along each border, followed by the additional padding up to the next larger power of two. Thus we
pad this image by 6 pixels around the border (increasing the width and height of each image by 12 pixels), which then
forces us to pad up to 1024 in each dimension to achieve our desired size for the CUFFT library.
Figure 3: Running time of our GPU implementation of the Krishnan-Fergus algorithm.
We plot the number of megapixels in each image versus the running time.
Figure 3 shows the running time of our GPU implementation on over 40 (3-channel) images of various sizes. Upon
initial inspection, the running times appear to increase linearly with respect to the number of megapixels in the image,
and in fact, the equation of the linear regression line was calculated to be f(x) = 0.314x + 0.0019, with r 2 = 0.92. This
implies that our implementation can process over 3 Mpel per second for 3-channel images. On closer inspection
however, we noticed a staircase pattern in which distinct jumps in running times were clearly evident as the images sizes
increased. This effect is an artifact of our padding the size of the images up to the next highest power of two for
performance considerations related to the CUFFT library. We explore this behavior further in the following section.
4.2 CUFFT Performance
As noted in section 3.2, by padding our image sizes to be the next largest power of two, the overall running time of the
algorithm was significantly reduced. This improvement was a direct consequence of the reduction in time taken within
the CUFFT library. Documentation on the library [12] indicates that performance is best when the size of the
transformation is a power of small prime factors, e.g. 2, 3, 5, 7. While our initial optimization based on powers of two
produced significant improvements, they also seemed excessive for images that were already of sizes that were powers
of two because we would have to quadruple the number of pixels processed (essentially doubling the size in each
dimension) to get the performance gain. We asked ourselves if we could not find smaller sizes that were also easily
divided into small prime factors and that would also offer similar performance gains.
We conducted an experiment in which we started with an image that was 512 x 512 pixels in size and then padded it by
12 pixels in each dimension (as required by our implementation for processing the image) to produce an image that was
524 x 524. Next, we padded the image to be n x n pixels, where n ranged from 525 all the way up to 2750 pixels. For
each padded image, we ran our algorithm to determine the overall running time. The goal was to determine which sizes
worked best with the CUFFT library and thus produced the best overall running times. Figure 4 (left) shows the plot of
the padded image size versus running time for all values. Although the figure is quite cluttered, the overall trends are
illustrated. Figure 4 (right) shows only the first 100 entries of that plot to highlight more clearly the behavior as we
increase the dimensions of the image. The running times appear to jump all over the place, but there is a clear pattern:
the longest times occur exactly when the padded size equals a single prime number, such as 541, 547, 557, 563, etc. If
you ignore those times, you will see another set of values that are higher than the rest and also appear on a monotonically
increasing curve. These times refer to padded sizes that are equal to a small prime factor, e.g. 2, 3, 5, multiplied by a
large prime number. For example, the first four such entries are 526 (2 * 263), 537 (3 * 179), 538 (2 * 269), and 542 (2
* 271) . The most important information in these plots, however, are the fastest running times so that we can potentially
optimize our padded sizes for all images and thereby produce the fastest results.
Figure 4: Line plots showing padded image size versus running time. The plot on the right shows
a zoomed view of the first 100 entries in the left plot. Prime numbers are clearly inefficient.
To find these values, we sorted the results (pairs of sizes and running times) according to running times and then only
kept those sizes that were monotonically increasing. For example, the six fastest running times for the sizes reflected in
figure 4 were 540, 576, 600, 528, 546, and 625. If we only keep the monotonically increasing values, we get 540, 576,
600, and 625. The reason for this is that it makes no sense to ever pad an image up to size 528 x 528 because we found
that padding to 540 x 540 produces even faster running times. For this example image, the running times for these four
optimized sizes are 0.0772, 0.0843, 0.0849, and 0.0963 seconds. Again, this is attributed to the fact that these numbers
are nicely factored into small prime numbers, which allows the CUFFT library to use the most efficient kernels. For this
experiment, after sorting and keeping the monotonically increasing sizes, we obtained a lookup table that contains 24
entries: 540, 576, 600, 625, 648, 720, 729, 1024, 1080, 1125, 1296, 1323, 1536, 2048, 2058, 2160, 2304, 2400, 2500,
2592, 2625, 2646, 2700, and 2744. As you can see, there are several sizes that are faster than just padding to 1024, as
our original algorithm would do. Our new algorithm for deciding what sizes to pad images up to are based on this look
up table: if the original image size is less than or equal to 540, pad to 540, otherwise consider the remaining entries in
increasing order until the size is less than the value in the lookup table. For images that are even larger than 2750, or
smaller than 512, we simply used the next larger power of two. However, 2750 appears to be the largest image that we
can process with our algorithm on the GXT 260 without running out of graphics memory.
To test our new algorithm, we conducted another experiment in which we resized a single image (originally 1000 x 1000
pixels) to be of size n x n pixels, where n ranges from 1 to 2000. For each resized image, we ran our software and
recorded the overall running time for our original algorithm, shown in figure 5, as well as our new algorithm, shown in
figure 6. The staircase effect that we described earlier is clearly evident in figure 5 whenever we hit a power of two.
The inset image highlights the behavior for the smallest of values, which are not discernible in the larger plot. The
results from our new algorithm in figure 6 also exhibit the staircase behavior, but at much finer levels, which allows our
running times to be significantly reduced for many image sizes. The inset image again shows a zoomed view of the
smallest image sizes, although we have truncated the data to start at 500 pixels, since the algorithms are the same up
until this point. One thing to note is that these results are for square images. When the width and height of the images
are not equal, additional staircase effects will occur. Another interesting point is that our new algorithm supports image
sizes up to 7 Mpel, whereas the original one caused us to run out of graphics memory around 4 Mpel because the padded
sizes simply became too large.
Figure 5: Padded image size plotted against running time for our original algorithm of padding up
to the next largest power of two. The inset image shows a zoomed view of the smallest sizes.
Figure 6: Padded image size plotted against running time for both our original and optimized padded
algorithms. The inset image shows a zoomed view of the smallest sizes, starting at 500 x 500 pixel images.
As a final test, we applied our new padding algorithm to the same input images from figure 3. The new running times,
(solid circles) along with the old values (crosses), are shown in figure 7. In all but a single case, the optimized algorithm
produces times that are as good as, or better than, the original algorithm. In some cases, when the algorithms pad to the
same sizes, there are normal fluctuations in running times that have one or the other slightly faster. We discount these
discrepancies and consider the results equivalent. In other cases, the new algorithm results in a reduction in time of
between 10% and 40%. The one significant exception is for an image that is of size 1600 x 1200 pixels. Based on the
original algorithm, the padded size is 2048 x 2048, or 2 22, but the new algorithm pads it to 2048 x 1296, which can be
written 215 * 34. In this example, the new running time is almost 15% slower than the original. The newest
documentation on the CUFFT library does provides some new guidelines as to what sizes will be the best for their
implementation. In fact, they comment specifically on this type of example where the two sizes are close, but the larger
number is a multiple of a single factor, but the smaller number is the multiple of two factors. Further experimentation
could shed more light on these optimal sizes, but it would require an exhaustive test of all possible image dimensions.
5. CONCLUSIONS
We have presented our implementation of the Krishnan-Fergus deconvolution algorithm using Nvidia’s CUDA
architecture. The algorithm requires efficient solutions to two sub-problems, each of which map well onto the GPU.
One sub-problem can be approximately solved using a lookup-table, which maps perfectly to the cached texture
memory. The other sub-problem can be solved using FFTs, which can be efficiently computed in parallel thanks to the
CUFFT library provided by Nvidia. Programming for the GPU is a non-trivial task so we have carefully described the
numerous optimizations employed to achieve these interactive running times. We have also made several observations
regarding performance of the CUFFT library, and how through carefully selecting the transform sizes, one can see
significant improvements. For 3-channel color images, our GPU implementation can process over 3 megapixels per
second (often more), which is over 27 times faster than the original CPU implementation by Krishnan and Fergus.
However, our GPU implementation is not without limitation. Currently we can process images up to around 2700 x 2700
pixels, after which we run out of graphics memory on the GTX 260 (which has 896MB). For newer graphics cards, with
additional memory, we could push this limit higher, but we could also achieve this by being more judicious in our
memory usage on the graphics card.
Figure 7: Running time of our GPU implementation of the Krishnan-Fergus algorithm using our
new padding algorithm. In all but two cases, the new algorithm results in faster running times.
REFERENCES
[1] Krishnan, D. and Fergus, R., “Fast Image Deconvolution using Hyper-Laplacian Priors,” NIPS, 1033-1041 (2009).
[2] Domanski, L., Vallotton, P. and Wang, D., “Two and Three-Dimensional Image Deconvolution on Graphics
Hardware”, MODSIM09 (2009).
[3] Richardson, W. H., “Bayesian-Based Iterative Method of Image Restoration”, Journal of the Optical Society of
America (JOSA), 62 (1):55–59 (1972).
[4] Lucy, L. B., “An iterative technique for the rectification of observed distributions”, Astronomical Journal, 79 (6):
745–754 (1974).
[5] Geman, D. and Yang, C., “Nonlinear image recovery with half-quadratic regularization”, IEEE PAMI, 4:932–946
(1995).
[6] Krishnan, S., Foskey, M., Culver, T., Keyser, J. and Manocha, D., “PRECISE: Efficient Multiprecision Evaluation
of Algebraic Roots and Predicates for Reliable Geometric Computations”, Proceedings of Seventeenth Annual
ACM Symposium on Computational Geometry, 274-283 (2001).
[7] Govindaraju, N., Lloyd, B, Dotsenko Y., Smith, B., and Manferdelli, J., “High Performance Discrete Fourier
Transforms on Graphics Processors”, ACM/IEEE Supercomputing (2008).
[8] Nukada, A, and Matsuoka, S., “Auto-Tuning 3-D FFT Library for CUDA GPUs”, ACM/IEEE Supercomputing
(2009).
[9] Gu, L, Li, X., and Siegel, J., “An Emperically Tuned 2D and 3D FFT Library on CUDA GPU”, 24 th International
Conference on Supercomputing (2010).
[10] NVIDIA CUDATM: CUDA C Best Practices Guide, version 3.2, Aug 2010.
[11] NVIDIA CUDATM: CUDA C Programming Guide, version 3.2, Oct 2010.
[12] NVIDIA CUDATM: CUDA CUFFT Library, version PG-05327-032_V02, Aug 2010.
Figure 8: Original image (top) and deblurred image (bottom). The original image size is preserved
to prevent artifacts due to resizing.
Figure 9: Original image (top) and deblurred image (bottom). The original image size is preserved
to prevent artifacts due to resizing.
Download