Real-time image deconvolution on the GPU

Real-time image deconvolution on the GPU James T. Klosowski and Shankar Krishnan AT&T Labs Research, 180 Park Ave., Florham Park, NJ USA 07932 ABSTRACT Two-dimensional image deconvolution is an important and well-studied problem with applications to image deblurring and restoration. Most of the best deconvolution algorithms use natural image statistics that act as priors to regularize the problem. Recently, Krishnan and Fergus provide a fast deconvolution algorithm that yields results comparable to the current state of the art. They use a hyper-Laplacian image prior to regularize the problem. The resulting optimization problem is solved using alternating minimization in conjunction with a half-quadratic penalty function. In this paper, we provide an efficient CUDA implementation of their algorithm on the GPU. Our implementation leverages many wellknown CUDA optimization techniques, as well as several others that have a significant impact on this particular algorithm. We discuss each of these, as well as make a few observations regarding the CUFFT library. Our experiments were run on an Nvidia GeForce GTX 260. For a single channel image of size 710 x 470, we obtain over 40 fps, while on a larger image of size 1900 x 1266, we get almost 6 fps (without counting disk I/O). In addition to linear performance, we believe ours is the first implementation to perform deconvolutions at video rates. Our running times also demonstrate that our GPU implementation is over 27 times faster than the original CPU implementation. Keywords: Deconvolution, deblurring, GPU, CUDA, hyper-Laplacian, FFT 1. INTRODUCTION Two-dimensional image deconvolution is an important and well-studied problem with applications to image deblurring and restoration. The original problem formulation is ill-posed, and requires some form of regularization to make it tractable. Numerous deconvolution algorithms exist, varying greatly in their performance and solution. Simple filtering algorithms are extremely fast, but yield poor results. Most of the best algorithms use natural image statistics that act as priors to regularize the problem. More recently, Krishnan and Fergus [1] provide a fast non-blind deconvolution algorithm that yields results comparable to the current state of the art. Krishnan and Fergus use a hyper-Laplacian image prior to regularize the problem. The resulting optimization problem is solved using alternating minimization in conjunction with a half-quadratic penalty function. While one of the sub-problems uses a standard L2 minimization solved in closed form in the Fourier domain, the key novelty in their approach is that the other sub-problem is a non-convex problem that is separable over the pixels. This per-pixel step can be solved by polynomial root finding. Since the coefficients of the polynomial depend only on system parameters and a range of image gradient values, the valid roots can be pre-computed and stored in a lookup table (LUT). This makes their approach particularly well-suited for a GPU implementation. There have been some attempts to implement image deconvolution algorithms on the GPU. Recently, Domanski et. al. [2] have implemented the Richardson-Lucy algorithm [3,4] on an Nvidia GTX 260 GPU using Fourier space convolutions. They report an 8.5x performance improvement over a CPU implementation. Our Contributions: In this paper, we provide an implementation of the Krishnan-Fergus algorithm on the GPU using the CUDA architecture. We have made extensive use of many well-known CUDA optimization techniques, as well as several others that have a significant impact on our particular algorithm. Our optimizations fall into four categories: memory transfers between host and device, device memory access, parallel execution, and instruction throughput. One of the most valuable lessons learned was to compute as much data on the GPU as possible, rather than using the CPU and then transferring the data. We discuss this and each of the other optimizations used, as well as a few additional observations regarding the CUFFT library. Our experiments were performed on a Linux workstation with an Nvidia GeForce GTX 260. On a single channel image of size 710 x 470, we obtain over 40 frames per second (fps), while on a larger image of size 1900 x 1266, we get almost 6 fps (ignoring disk I/O). On a wider assortment of images, our implementation processes over 3 megapixels (1,048,576 pixels) per second. This is a factor of 27 times faster than the CPU implementation by Krishnan and Fergus [1]. 2. KRISHNAN/FERGUS ALGORITHM In this section, we briefly review the algorithm of Krishnan and Fergus [1] for non-blind deconvolution of images. Their method relies on the fact that the gradients of natural images exhibit a heavy-tailed distribution, and have been effective in the past for regularizing the (ill-posed) deconvolution problem. These heavy-tailed distributions can be modeled using 𝛼 a hyper-Laplacian function: 𝑝(𝑥) ∝ 𝑒 −𝑘|𝑥| , where 0.5 ≤ α ≤ 0.8. We start by formulating the deconvolution problem, assuming the blur kernel is known. Most of the notation used is borrowed from the original paper. Let x be the original uncorrupted linear grayscale image of N pixels; y be the observed image corrupted with blur and/or noise, which is assumed to be the result of a convolution with a known blur kernel k and the addition of zero-mean Gaussian noise. The problem is to reconstruct x given y and k. Given the ill-posed nature of the problem, they regularize it using a penalty term |. |𝛼 that acts on the output of a set of known filters 𝑓1 , 𝑓2 , ⋯ , 𝑓𝑗 . In their paper, Krishnan and 𝑇 Fergus use the two simplest gradients in the x- and y-directions (𝑓1 = [1 -1] and 𝑓2 = [1 -1] ). Their formulation seeks to find a MAP (Maximum-a-Posteriori) estimate of x in a probabilistic framework. This is equivalent to the following optimization problem (i is the pixel index, λ controls the strength of the regularization, and ⊕ is the 2D convolution operator): 𝑁 2 𝑖=1 𝑗=1 𝛼 𝜆 min ∑ ( (𝑥 ⊕ 𝑘 − 𝑦)2𝑖 + ∑ |(𝑥 ⊕ 𝑓𝑗 )𝑖 | ) 𝑥 2 (1) 𝑗 Let us denote 𝐹𝑖 𝑥 ≡ (𝑥 ⊕ 𝑓𝑗 )𝑖 . Using the half-quadratic penalty method of Geman and Yang [5], they introduce 𝑗 auxiliary variables 𝑤𝑖1 and 𝑤𝑖2 (denoted by w), thereby allowing them to move the 𝐹𝑖 𝑥 terms outside the |. |𝛼 expression. This results in a new cost function to be minimized. 𝑁 2 𝑖=1 𝑗=1 𝜆 𝛽 𝑗 𝛼 min ∑ ( (𝑥 ⊕ 𝑘 − 𝑦)2𝑖 + (‖𝐹𝑖1 𝑥 − 𝑤𝑖1 ‖22 + ‖𝐹𝑖2 𝑥 − 𝑤𝑖2 ‖22 ) + ∑|𝑤𝑖 | ) 𝑥,𝑤 2 2 (2) β is a weight that increases continuously throughout the optimization. As β→ ∞, the solution of the modified cost function (eqn. (2)) converges to the original cost function (eqn. (1)). Their algorithm uses an alternating minimization scheme where the non-convex part of the problem is solved in one phase, followed by a quadratic phase which can be efficiently solved in the frequency domain using FFTs. The alternating minimization proceeds by solving for the optimal x for a fixed w (called the x sub-problem), and vice versa (called the w sub-problem). 2.1 Solving the x sub-problem In this case, because w is fixed, equation (2) is quadratic in x. Let 𝐾𝑥 ≡ 𝑥 ⊕ 𝑘. Rewriting equation (2) in matrix form and taking derivatives, x can be obtained by solving the following linear system: 𝑇 𝑇 (𝐹1 𝐹1 + 𝐹 2 𝐹 2 + 𝜆 𝑇 𝜆 𝑇 𝑇 𝐾 𝐾) 𝑥 = 𝐹1 𝑤 1 + 𝐹 2 𝑤 2 + 𝐾 𝑇 𝑦 𝛽 𝛽 (3) Assuming circular boundary conditions, they apply 2D FFTs which diagonalize the convolution matrices 𝐹1 , 𝐹 2 and 𝐾 and x can be solved optimally as: 𝑥 = ℱ −1 ( ℱ(𝐹1 )∗ ∘ ℱ(𝑤 1 ) + ℱ(𝐹 2 )∗ ∘ ℱ(𝑤 2 ) + (𝜆/𝛽)ℱ(𝐾)∗ ∘ ℱ(𝑦) ) ℱ(𝐹1 )∗ ∘ ℱ(𝐹1 ) + ℱ(𝐹 2 )∗ ∘ ℱ(𝐹 2 ) + (𝜆/𝛽)ℱ(𝐾)∗ ∘ ℱ(𝐾) (4) where * denotes complex conjugate and ∘ denotes component-wise multiplication. The division is also performed component-wise. 2.2 Solving the w sub-problem The novelty in the Krishnan-Fergus paper stems from this step. Given a fixed x, finding the optimal 2N independent 1D problems of the form: β 𝑤 ∗ = arg min (|w|α + (w − v)2 ), w 2 w requires solving (5) 𝑗 where 𝑣 ≡ 𝐹𝑖 𝑥. For the special case when 𝛼 = (𝑑 − 1)/𝑑, 𝑑 ≥ 2, the solution to the above minimization problem can be converted into finding the real roots of a univariate polynomial of degree (𝑑 + 1). The polynomial is of the form: 𝛼 𝑑 𝑤(𝑤 − 𝑣)𝑑 + (−𝑠𝑔𝑛(𝑣))𝑑−1 ( ) = 0 𝛽 (6) We are interested in the largest real root 𝑤 ∗ of the polynomial such that 2𝑣 ], 𝑤∗ ∈ { 𝑑 + 1 2𝑣 [ , 𝑣] , 𝑑+1 [𝑣, 𝑖𝑓 𝑣 < 0 𝑖𝑓 𝑣 > 0 Empirically, it has been found that the normal range for 𝛼 for natural images is between 0.5 and 0.8. This implies that there are only 4 values of 𝑑 (2, 3, 4 and 5) that satisfy the above requirement. As described in their paper, Krishnan and Fergus resort to a pre-computed LUT to solve this part of the sub-problem efficiently. The range of values for 𝑑 limits the degree of the polynomial between 3 and 6. For several values of 𝛽 (from 1 to 256 in multiples of 2√2) and 𝑣 (200,000 values uniformly in the range [-10,10]), we determine the optimal 𝑤 using an efficient and robust root solver described in [6]. For the sake of completeness, Algorithm 1 below reviews the main steps involved in the KrishnanFergus algorithm. Algorithm 1: Fast image deconvolution using hyper-Laplacian priors Input: Blurred image 𝑦; blur kernel 𝑘; regularization weight λ; and prior exponent 0.5 ≤ 𝛼 ≤ 0.8 𝛽 regime parameters: 𝛽0 , 𝛽𝑖𝑛𝑐 , 𝛽𝑚𝑎𝑥 ; number of inner iterations 𝑇 Output: Deblurred image 𝑥 𝛽 = 𝛽0 ; 𝑥 = 𝑦; Precompute constant terms in eqn. (4). Generate a LUT by solving eqn. (6) while 𝛽 < 𝛽𝑚𝑎𝑥 do for i = 1 to 𝑇 do Given 𝑥, solve eqn. (5) for all pixels using LUT to obtain 𝑤 Given 𝑤, solve eqn. (4) to obtain 𝑥 end for 𝛽 = 𝛽𝑖𝑛𝑐 . 𝛽; end while return 𝑥 3. GPU IMPLEMENTATION The nature of the Krishnan-Fergus deconvolution algorithm lends itself nicely to a GPU implementation. The first subproblem in the alternating scheme is to solve for w, which as previously discussed, can be approximated very quickly by using a lookup table. The CUDA memory hierarchy provides efficient means for implementing such a table, including the constant, texture, and global memory spaces. The second sub-problem, which solves for x, can be found directly in the Fourier domain using only three FFTs per iteration of the algorithm. Nvidia provides an implementation of the Fast Fourier Transform in their CUFFT library, which includes several parallel algorithms each with different performance and accuracy characteristics. Based on these observations and previous experience with CUDA, we believed that mapping the Krishnan-Fergus algorithm onto the GPU could lead to significantly faster running times. Our initial CPU-based implementation of the Krishnan-Fergus algorithm was on a system with dual quad-cores and the running time was almost two seconds for a relatively small image (710x470 pixels). As we ported the algorithm over to the GPU using CUDA, we utilized many performance optimizations to achieve significant gains in running time. We have organized these optimizations into four categories and present them below in sections 3.3 - 3.7. In the next two sections, we briefly discuss an important tool for identifying bottlenecks, as well as the CUFFT library. 3.1 CUDA Visual Profiler Before discussing our optimizations, we want to quickly emphasize the importance of profiling the CUDA code in order to identify bottlenecks and inefficient code. Throughout our effort to port the algorithm to the GPU, we routinely made use of the CUDA Visual Profiler (CVP), which provides a highly detailed report on the number of times operations occurred, together with the amount of time they took to complete. Many other statistics are also included on a per-kernel basis, such as the occupancy, registers used, and number of instructions issued. In particular, the GPU Time Height Plot, shown in figure 1, illustrates with a bar chart how long each kernel and memory copy operation took to complete, thereby quickly highlighting which operations should be optimized first. In this snapshot, the second, third, and last bars show the time it takes to copy data to and from the GPU. The remaining bars, which are all roughly the same magnitude, highlight the running times for all of the kernels in our implementation. After the I/O operations, the next highest bars are all from kernels executed by the CUFFT library, and thus outside of our control. The next kernel for us to optimize would be “compute_out” which is called once per iteration of the algorithm, and whose purpose is to perform the final computations when solving for x. In the left column of the figure, the number of times (i.e. sessions) that we executed the profiler and saved the output (up to this point) is shown to be 23. CVP is a great help in identifying your bottlenecks and illustrating any performance improvements that result from code changes. It should be used often. Figure 1: The CUDA Visual Profiler time height plot. The second, third, and last bars plot the time (microseconds) to copy the data to and from the GPU. All other bars represent each of the kernels executed. Figure 2: The CUDA Visual Profiler time summary plot. After padding the images to be a power of two, the four most time consuming kernels are all within the CUFFT library, and occupy 47% of the total time. 3.2 CUFFT library The CUFFT library is Nvidia’s parallel implementation of the Fast Fourier Transform on the GPU. The library contains several algorithms for FFT each with different accuracy and performance capabilities, based upon the size of the input data. Although the CUFFT library supports single and double precision, real and complex values, and forward and backward transformations, it does have a few limitations. Two and three-dimensional transforms must contain fewer than 16384 elements in each dimension, and the limit for one-dimensional transforms is about 8 million elements, which equates to an image that is roughly 2800x2800 pixels. Bases on some initial experiments, 1D transforms were slightly faster than 2D transforms, and thus were used in our implementation of the Krishnan-Fergus algorithm. In our initial implementation, we simply called the FFT routines with data whose size was based upon the original input image, e.g., 710 pixels wide by 470 pixels high. Because the CUFFT library is optimized for transform sizes that are powers of small prime factors, e.g. powers of two, we then tried padding the original image (and necessarily all associated blur kernels and filters), to be the next largest power of two in each dimension. In the example above, the new padded image was 1024 pixels wide by 512 pixels high, an increase of over 57% in the total number of pixels. However, the running time of the entire Krishnan-Fergus algorithm was then cut by almost 80%. The size of the input data clearly has a significant impact on which internal algorithm is used in the CUFFT library. We explore this fact further section 4.2. Figure 2 is another snapshot of the CVP that shows the time summary plot for the kernels and memory transfers to and from the GPU. In parentheses next to the name of each kernel is the number of times it was called. Even after padding our data to be a power of two in each dimension, the CUFFT kernels, whose names all begin with spRadix, are still the four most expensive kernels in our implementation and they sum to over 47% of the current running time. There are several alternative implementations of FFT on the GPU [7,8,9] which could potentially lead to faster running times for our algorithm, but we have not yet experimented with them. 3.3 Memory transfers Transferring data between the host machine and the GPU is often one of the biggest bottlenecks for CUDA implementations, and this algorithm was no exception. As can be seen in figure 1, the two longest operations are when we copy the lookup table from the CPU to the GPU (second bar), and again when we copy the final image back from the GPU (last bar). What this figure does not show are the other numerous spikes related to data movement that we were able to reduce or eliminate by being more careful about memory transfers. These optimizations fall roughly into two categories: minimizing the amount of data transferred, and making the transfers faster. One of the most important decisions to make when implementing any algorithm for the GPU is whether to use single or double precision floating point variables. It is easy to fall into the mindset of using double precision for greatest accuracy, but one really must evaluate if this is absolutely necessary when programming for the GPU. One reason for this is because Nvidia GPUs double precision performance is only 1/8 th that of single precision (until just recently when the Fermi architecture was released, at which point the double precision performance increased to one half that of single precision, as one might expect). Another reason is the increase in all GPU resources needed for double precision: registers, on-chip memory, on-device memory, and bandwidth required to transfer the data to and from the GPU. Our initial lookup table was over 100MB in size, and included four values of alpha and 17 values of beta. Each entry was also stored as a floating point using double precision. Upon more careful inspection of the output images when using single versus double precision variables, we realized that the differences were imperceptible (on the order of 1 or 2 units difference out of 255 for the red, green, or blue components of the image) and therefore single precision would be sufficient for our needs. This issue needs to be evaluated for any problem domain, but for image data, a difference of this small magnitude in the image will not be noticeable to the end-user, but the difference in running times will be. In addition to using floats, we also limited the values of alpha and beta to be exactly what was required: we could limit alpha to only one value, and beta to only six values. In the end, our lookup table was reduced to only 4MB. As discussed in the CUFFT section, we decided to pad our images, blur kernels, and image filters to all be a power of two in each dimension. Initially we did this on the CPU and then transferred the data to the GPU. However, by using CVP, we immediately realized how much time this was taking and modified our approach. In our current version of the code, we only transfer the bare minimum data required, and then pad the data on the GPU itself using very simple kernels. This provided a nice reduction in data movement for the input image, but a more significant savings occurred for the blur kernel and image filters, which are much smaller, e.g. the Gaussian blur kernel is only 13x13 pixels. The reduction in times to copy and pad the image, blur kernels, and image filters ranged from 50% to as much as 95%. In addition to limiting the amount of data to transfer to and from the GPU, there are also several standard optimizations that help make the transfers faster. One of the optimizations we used is page-locked memory on the host, which prevents the operating system from paging the memory out, and therefore the GPU’s access to that data over PCIExpress can be at full-speed. Another well-known optimization that we used is that of asynchronous memory copies. Rather than waiting for the lookup table to finish being copied to the GPU, we can use Nvidia’s asynchronous API which allows kernels to be executing in parallel with the memory copy. Thus, we can be running our simple kernels to pad the image data to hide the latency of the memory copy. 3.4 Memory access Although each GPU multiprocessor has shared memory and 32-bit registers on the chip itself, many data accesses that the kernels make are to the global memory on the graphics card. Global memory is not cached (on all but 2.x compute capability devices) and these accesses have a much longer latency than those to on-chip memory and therefore great care must be taken to fully optimize them. Nvidia’s CUDA C Best Practices Guide goes into depth on how to make global accesses coalesced, or aligned, and claims this may be the single most important performance consideration when programming for the CUDA architecture [10]. We agree with this assertion and note that this optimization has become commonplace when writing CUDA code. To verify that each of our kernels was making aligned accesses, we again used CVP to see how many 32-, 64-, and 128-byte global loads and stores were made. By analyzing these numbers, we could optimize our code and reduce the number of 32- and 64-byte accesses, thereby increasing our overall performance. As global memory is not cached, it is important to know about the texture and constant memory spaces available in the CUDA architecture as well. These memories are cached and are ideally suited for memory that is going to be loaded once and then accessed in a read-only fashion. Our lookup table falls perfectly into this scenario. Our original implementation used global memory to store the lookup table, but by accessing the data using a 1D texture, the “solve_image” kernel’s running time was cut by 50%. As this kernel is called twice per iteration of the algorithm (12 times in all), the improvement in this one kernel reduced the total running time by 5%. 3.5 Parallel execution Running algorithms in parallel is the core of GPU computing. Serial operations can probably best be executed on the CPU, while parallel portions can usually be efficiently mapped to the GPU, provided some care is taken to maximize the parallelism given its fixed resources. We have already discussed one standard optimization technique regarding parallelism and that is overlapping memory transfers between the host machine and the GPU device with other work. These asynchronous copies allow kernels to be executing on data already loaded onto the device, while new data is copied over to it, effectively hiding the latency involved in the data copy. For GPU devices with CUDA compute capability 1.x, only one kernel can be executed at a time, although the kernels are run in multiple thread blocks. To keep the multiprocessors busy on the GPU, there should be at least as many thread blocks as there are multiprocessors, but ideally there will be more in order to hide the latency that exists when a warp (a grouping of 32 threads) is not ready to execute it’s next instruction, e.g. if it must wait for data to be loaded from global memory or some other operations to complete. When one warp stalls, another warp can be quickly scheduled to execute and effectively hide the latency of the memory access. In our implementation, we have maximized the number of warps and thread blocks that are running on the GPU by carefully managing the limited GPU resources, including registers and shared memory that are common to each thread block. By carefully rewriting our GPU kernels, and in some cases subdividing them into smaller kernels, we were able to reduce the number of registers and shared memory needed, thereby increasing the number of blocks that could be executed in parallel. One immediate benefit of using floats instead of doubles is that the number of registers used was cut in half. Another means of reducing the number of registers was by eliminating temporary variables that were convenient when writing the code, but not strictly necessary for correct execution. For example, by reordering some of the computations in one kernel, we could avoid several temporary variables that increased our register use. To determine how many registers and how much shared memory were used by each kernel, we ran our code through the CVP, which automatically reports these statistics. We could then plug this information into the CUDA Occupancy Calculator to determine where the bottlenecks were and how we could best increase the amount of parallelism in the kernels. The Occupancy Calculator is really just a simple spreadsheet that computes this information for you once you have entered only four pieces of information: CUDA compute capability of the GPU, number of threads per block, registers needed per thread, and shared memory needed per thread. Only the first item is completely determined by the graphics hardware you have, while the three remaining items can be influenced by the user and the implementation. Occupancy is loosely defined as the ratio of the number of warps running concurrently on a multiprocessor divided by the maximum number of warps that could be running concurrently based on the hardware limits of the GPU. Having an occupancy of 1 (or 100%) is the best case, but does not necessarily translate to improved performance over lower ratios. However, a kernel with a low occupancy may prevent the multiprocessor from hiding latency adequately for memoryintensive kernels. 3.6 Instruction throughput Other optimizations to CUDA code involve trying to minimize the number of cycles required to get the same work done. In other words, by reducing the number of cycles needed, the GPU can process more data in the same amount of time, thereby increasing the amount of data processed per instruction, i.e. the instruction throughput. We have utilized these optimizations in several forms, such as minimizing slow arithmetic instructions, minimizing divergent warps, and avoiding unnecessary type conversions. Arithmetic instructions can vary significantly in the number of operations per clock cycle on the GPU. For CUDA compute capability 1.x, eight 32-bit floating-point adds can occur in each clock cycle, but only one 64-bit floating-point add [11]. 32-bit integer multiplications on the other hand, require several instructions by themselves, so are very inefficient by comparison. However, 24-bit integer multiplications (e.g. __[u]mul24) can again be done eight times per clock cycle, and therefore should be utilized when appropriate. For example, when computing the index of a specific thread in a kernel, one could replace the standard int index = blockDim.x * blockIdx.x + threadIdx.x, with the following: int index = _umul24(blockDim.x, blockIdx.x) + threadIdx.x. Another significant hit is using division or modulo inside the kernel. For many of our kernels, since they involve two-dimensional images implemented as onedimensional arrays, we are often dividing the index by the number of columns to determine which row of the image we are going to operate on. Rather than doing this repeatedly inside the kernels, we could instead multiply by the reciprocal of the number of columns, which is fixed for a given image and can be computed once and stored in constant memory. Divergence refers to the situation when threads in a single warp take different code paths due to a branch condition, e.g. if and switch statements. As a result, the different execution paths must be serialized, thereby reducing the amount of parallelism achieved. For all of our kernels, we made great effort to eliminate divergence within warps. For example, by carefully choosing the number of threads per thread block, and by padding our images to be of sizes that are a power of two, we could guarantee that every thread in every thread block was assigned the work associated with one of the pixels in the image. Thus, we could immediately remove all of the if-statements that check boundary conditions (e.g. if (index < num_pixels)), and cause divergence within warps. We have already spoken at some length on using floats instead of doubles, however care must be taken to avoid all implicit conversions to doubles. By using the --keep compiler option and examining the temporary files, we were able to find several instances of literal constants that were being converted to doubles automatically. To avoid this, the simple scheme of adding an “f” after the literal prevents this from happening, e.g. 1.0f. Another good lesson was to remove the sm_13 compiler directive, which prevents the compiler from using doubles at all, and thus warning messages were given at compile time that doubles were being demoted to floats. Such messages allow you to easily find and fix those problems. Another optimization is to use the single-precision functions when appropriate. For example, using rintf to round a floating point variable to an integer requires only one instruction, whereas roundf() requires eight instructions. There are many other such examples of this as well, including truncf(), ceilf(), and floorf(). Charts detailing these instruction counts can be found in the NVIIDA CUDA C Programmer’s Guide [11]. 4. EXPERIMENTAL RESULTS Our experiments were conducted on a dual quad-core (Intel Xeon E5506 @ 2.13GHz) 64-bit Linux workstation, with 12GB of RAM and an Nvidia GeForce GTX 260 graphics adapter. The GTX 260 has 896MB of global memory, 27 multiprocessors (216 cores total), and CUDA compute capability 1.3. The CUDA driver at the time was version 3.0. 4.1 Running Times Figures 8 and 9 are examples of the results of our implementation of the Krishnan-Fergus algorithm for the GPU. The 3channel color images in figure 8 are of size 710 x 470 pixels, and were processed in 0.066 seconds (i.e. the equivalent of over 15 frames per second (fps)). The 1-channel version of this image required 0.024 seconds (about 41 fps), roughly one-third of the time for the RGB image. The time is not exactly one-third due to the fixed start-up costs associated with the implementation, such as copying the lookup table, kernels, and filters to the GPU, which only happen once regardless of the number of channels (or images) being deblurred. These times indicate that our algorithm was processing 4.8MPel/sec for 3-channel images, and 13Mpel/sec for 1-channel images. Similarly, the one- and threechannel deblurring times for the images in figure 9 are respectively 0.041 seconds and 0.116 seconds (or the equivalent of 24.4 fps and 8.6 fps). These images are slightly larger than those in figure 8 at 768 x 512 pixels, and achieved a slightly lower throughput of 3.2Mpel/sec for the 3-channel image, and 9.1 Mpel/sec for the 1-channel image. This reduction is due to the larger padded size of this image (1024x1024) compared to the one in figure 8 (1024x512). Although the number of rows in the image in figure 9 is already a power of two (i.e. 512), we have to pad all images by half the kernel size along each border, followed by the additional padding up to the next larger power of two. Thus we pad this image by 6 pixels around the border (increasing the width and height of each image by 12 pixels), which then forces us to pad up to 1024 in each dimension to achieve our desired size for the CUFFT library. Figure 3: Running time of our GPU implementation of the Krishnan-Fergus algorithm. We plot the number of megapixels in each image versus the running time. Figure 3 shows the running time of our GPU implementation on over 40 (3-channel) images of various sizes. Upon initial inspection, the running times appear to increase linearly with respect to the number of megapixels in the image, and in fact, the equation of the linear regression line was calculated to be f(x) = 0.314x + 0.0019, with r 2 = 0.92. This implies that our implementation can process over 3 Mpel per second for 3-channel images. On closer inspection however, we noticed a staircase pattern in which distinct jumps in running times were clearly evident as the images sizes increased. This effect is an artifact of our padding the size of the images up to the next highest power of two for performance considerations related to the CUFFT library. We explore this behavior further in the following section. 4.2 CUFFT Performance As noted in section 3.2, by padding our image sizes to be the next largest power of two, the overall running time of the algorithm was significantly reduced. This improvement was a direct consequence of the reduction in time taken within the CUFFT library. Documentation on the library [12] indicates that performance is best when the size of the transformation is a power of small prime factors, e.g. 2, 3, 5, 7. While our initial optimization based on powers of two produced significant improvements, they also seemed excessive for images that were already of sizes that were powers of two because we would have to quadruple the number of pixels processed (essentially doubling the size in each dimension) to get the performance gain. We asked ourselves if we could not find smaller sizes that were also easily divided into small prime factors and that would also offer similar performance gains. We conducted an experiment in which we started with an image that was 512 x 512 pixels in size and then padded it by 12 pixels in each dimension (as required by our implementation for processing the image) to produce an image that was 524 x 524. Next, we padded the image to be n x n pixels, where n ranged from 525 all the way up to 2750 pixels. For each padded image, we ran our algorithm to determine the overall running time. The goal was to determine which sizes worked best with the CUFFT library and thus produced the best overall running times. Figure 4 (left) shows the plot of the padded image size versus running time for all values. Although the figure is quite cluttered, the overall trends are illustrated. Figure 4 (right) shows only the first 100 entries of that plot to highlight more clearly the behavior as we increase the dimensions of the image. The running times appear to jump all over the place, but there is a clear pattern: the longest times occur exactly when the padded size equals a single prime number, such as 541, 547, 557, 563, etc. If you ignore those times, you will see another set of values that are higher than the rest and also appear on a monotonically increasing curve. These times refer to padded sizes that are equal to a small prime factor, e.g. 2, 3, 5, multiplied by a large prime number. For example, the first four such entries are 526 (2 * 263), 537 (3 * 179), 538 (2 * 269), and 542 (2 * 271) . The most important information in these plots, however, are the fastest running times so that we can potentially optimize our padded sizes for all images and thereby produce the fastest results. Figure 4: Line plots showing padded image size versus running time. The plot on the right shows a zoomed view of the first 100 entries in the left plot. Prime numbers are clearly inefficient. To find these values, we sorted the results (pairs of sizes and running times) according to running times and then only kept those sizes that were monotonically increasing. For example, the six fastest running times for the sizes reflected in figure 4 were 540, 576, 600, 528, 546, and 625. If we only keep the monotonically increasing values, we get 540, 576, 600, and 625. The reason for this is that it makes no sense to ever pad an image up to size 528 x 528 because we found that padding to 540 x 540 produces even faster running times. For this example image, the running times for these four optimized sizes are 0.0772, 0.0843, 0.0849, and 0.0963 seconds. Again, this is attributed to the fact that these numbers are nicely factored into small prime numbers, which allows the CUFFT library to use the most efficient kernels. For this experiment, after sorting and keeping the monotonically increasing sizes, we obtained a lookup table that contains 24 entries: 540, 576, 600, 625, 648, 720, 729, 1024, 1080, 1125, 1296, 1323, 1536, 2048, 2058, 2160, 2304, 2400, 2500, 2592, 2625, 2646, 2700, and 2744. As you can see, there are several sizes that are faster than just padding to 1024, as our original algorithm would do. Our new algorithm for deciding what sizes to pad images up to are based on this look up table: if the original image size is less than or equal to 540, pad to 540, otherwise consider the remaining entries in increasing order until the size is less than the value in the lookup table. For images that are even larger than 2750, or smaller than 512, we simply used the next larger power of two. However, 2750 appears to be the largest image that we can process with our algorithm on the GXT 260 without running out of graphics memory. To test our new algorithm, we conducted another experiment in which we resized a single image (originally 1000 x 1000 pixels) to be of size n x n pixels, where n ranges from 1 to 2000. For each resized image, we ran our software and recorded the overall running time for our original algorithm, shown in figure 5, as well as our new algorithm, shown in figure 6. The staircase effect that we described earlier is clearly evident in figure 5 whenever we hit a power of two. The inset image highlights the behavior for the smallest of values, which are not discernible in the larger plot. The results from our new algorithm in figure 6 also exhibit the staircase behavior, but at much finer levels, which allows our running times to be significantly reduced for many image sizes. The inset image again shows a zoomed view of the smallest image sizes, although we have truncated the data to start at 500 pixels, since the algorithms are the same up until this point. One thing to note is that these results are for square images. When the width and height of the images are not equal, additional staircase effects will occur. Another interesting point is that our new algorithm supports image sizes up to 7 Mpel, whereas the original one caused us to run out of graphics memory around 4 Mpel because the padded sizes simply became too large. Figure 5: Padded image size plotted against running time for our original algorithm of padding up to the next largest power of two. The inset image shows a zoomed view of the smallest sizes. Figure 6: Padded image size plotted against running time for both our original and optimized padded algorithms. The inset image shows a zoomed view of the smallest sizes, starting at 500 x 500 pixel images. As a final test, we applied our new padding algorithm to the same input images from figure 3. The new running times, (solid circles) along with the old values (crosses), are shown in figure 7. In all but a single case, the optimized algorithm produces times that are as good as, or better than, the original algorithm. In some cases, when the algorithms pad to the same sizes, there are normal fluctuations in running times that have one or the other slightly faster. We discount these discrepancies and consider the results equivalent. In other cases, the new algorithm results in a reduction in time of between 10% and 40%. The one significant exception is for an image that is of size 1600 x 1200 pixels. Based on the original algorithm, the padded size is 2048 x 2048, or 2 22, but the new algorithm pads it to 2048 x 1296, which can be written 215 * 34. In this example, the new running time is almost 15% slower than the original. The newest documentation on the CUFFT library does provides some new guidelines as to what sizes will be the best for their implementation. In fact, they comment specifically on this type of example where the two sizes are close, but the larger number is a multiple of a single factor, but the smaller number is the multiple of two factors. Further experimentation could shed more light on these optimal sizes, but it would require an exhaustive test of all possible image dimensions. 5. CONCLUSIONS We have presented our implementation of the Krishnan-Fergus deconvolution algorithm using Nvidia’s CUDA architecture. The algorithm requires efficient solutions to two sub-problems, each of which map well onto the GPU. One sub-problem can be approximately solved using a lookup-table, which maps perfectly to the cached texture memory. The other sub-problem can be solved using FFTs, which can be efficiently computed in parallel thanks to the CUFFT library provided by Nvidia. Programming for the GPU is a non-trivial task so we have carefully described the numerous optimizations employed to achieve these interactive running times. We have also made several observations regarding performance of the CUFFT library, and how through carefully selecting the transform sizes, one can see significant improvements. For 3-channel color images, our GPU implementation can process over 3 megapixels per second (often more), which is over 27 times faster than the original CPU implementation by Krishnan and Fergus. However, our GPU implementation is not without limitation. Currently we can process images up to around 2700 x 2700 pixels, after which we run out of graphics memory on the GTX 260 (which has 896MB). For newer graphics cards, with additional memory, we could push this limit higher, but we could also achieve this by being more judicious in our memory usage on the graphics card. Figure 7: Running time of our GPU implementation of the Krishnan-Fergus algorithm using our new padding algorithm. In all but two cases, the new algorithm results in faster running times. REFERENCES [1] Krishnan, D. and Fergus, R., “Fast Image Deconvolution using Hyper-Laplacian Priors,” NIPS, 1033-1041 (2009). [2] Domanski, L., Vallotton, P. and Wang, D., “Two and Three-Dimensional Image Deconvolution on Graphics Hardware”, MODSIM09 (2009). [3] Richardson, W. H., “Bayesian-Based Iterative Method of Image Restoration”, Journal of the Optical Society of America (JOSA), 62 (1):55–59 (1972). [4] Lucy, L. B., “An iterative technique for the rectification of observed distributions”, Astronomical Journal, 79 (6): 745–754 (1974). [5] Geman, D. and Yang, C., “Nonlinear image recovery with half-quadratic regularization”, IEEE PAMI, 4:932–946 (1995). [6] Krishnan, S., Foskey, M., Culver, T., Keyser, J. and Manocha, D., “PRECISE: Efficient Multiprecision Evaluation of Algebraic Roots and Predicates for Reliable Geometric Computations”, Proceedings of Seventeenth Annual ACM Symposium on Computational Geometry, 274-283 (2001). [7] Govindaraju, N., Lloyd, B, Dotsenko Y., Smith, B., and Manferdelli, J., “High Performance Discrete Fourier Transforms on Graphics Processors”, ACM/IEEE Supercomputing (2008). [8] Nukada, A, and Matsuoka, S., “Auto-Tuning 3-D FFT Library for CUDA GPUs”, ACM/IEEE Supercomputing (2009). [9] Gu, L, Li, X., and Siegel, J., “An Emperically Tuned 2D and 3D FFT Library on CUDA GPU”, 24 th International Conference on Supercomputing (2010). [10] NVIDIA CUDATM: CUDA C Best Practices Guide, version 3.2, Aug 2010. [11] NVIDIA CUDATM: CUDA C Programming Guide, version 3.2, Oct 2010. [12] NVIDIA CUDATM: CUDA CUFFT Library, version PG-05327-032_V02, Aug 2010. Figure 8: Original image (top) and deblurred image (bottom). The original image size is preserved to prevent artifacts due to resizing. Figure 9: Original image (top) and deblurred image (bottom). The original image size is preserved to prevent artifacts due to resizing.

Real-time image deconvolution on the GPU

Related documents

Products

Support

Real-time image deconvolution on the GPU

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib