GPU Accelerated High Quality Video/Image Super Resolution Zhangzong Zhao, Li Song, Rong Xie, Xiaokang Yang Shanghai Jiao Tong University Outline • Background • The Super Resolution convolutional neural network • GPU Optimization of the SRCNN algorithm • Experiments • Conclusion&Further work The motivation: Blu-Ray(HD) To UHD(4K) • Even 4K/UHD TV is ready now, native 4K contents are few. • There is a complete lack of available 4K/UHD content in the short term. • Thus, solutions converting HD contents to high quality 4K are needed. The Background of SR • State-of-the-art SR methods not only increase resolution of input image, but also enhance perceptual quality by adding more details or high frequency signals. • Example based SR methods learn LR-HR mapping relationship from thousands of patches pair. • By using latest machine learning approaches over massive dataset, this methodology has witness significant progress in past years details Input The SR frame The “original” frame SRCNN Example Bicubic SRCNN SRCNN Example Bicubic SRCNN Which algorithm do we choose? • The criteria: better quality and high speed • In general, quality and speed are contradictious Quality Speed The Top SR algorithms - quality • TOP3 SR algorithms include CSC、CNN、A+, have better restoration performance or better quality (PSNR). S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, L. Zhang, Convolutional Sparse Coding for Image Super-resolution, ICCV 2015. The Top SR algorithms - speed • TOP3 SR algorithms include CSC、CNN、A+, have very different running time or speed benchmark. • In all, CNN and A+ are the best ones in terms of both quality and speed. In this paper, we choose the CNN based method as our anchor for its friendly to parallel computing on GPU. • It should be noted that the original SRCNN implementation(matlab, over CPU) still need 300s to convert a 1920x1080 frame to 3840x2160. S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, L. Zhang, Convolutional Sparse Coding for Image Super-resolution, ICCV 2015. Prototype: SRCNN • SRCNN: Super-Resolution Convolutional Neural Network • Use the popular CNN framework to learn the end to end LR to HR mapping function F: output = F (input) The main steps in SRCNN • path extraction and representation(Conv1-W164*9*9*1): F1 (Y ) max(0,W1 *Y B1 ) • Non-linear mapping(Conv2, W2-32*1*1*64): F2 (Y ) max(0,W2 * F1 (Y ) B2 ) • Reconstruction(Conv3, W3-1*5*5*32): F3 (Y ) W3 * F2 (Y ) B3 Thus, there are two basic operations: convolution and Rectified Linear Unit, ReLU or max(0,x) in this paper Direct GPU Implementation of Convolution • SRCNN consist of 3 layer convolution and 2 layer max. Over 95% of execution time is spend on convolution, thus we focus on the GPU optimization of convolution. • We split the convolution task into millions of mini-tasks(along width and height of the convolution output). Each convolution output pixel is computed by a unique thread, as shown in the figure. • The GPU is friendly to deal with parallel image processing. The processing time can be accelerated from 300s/frame (CPU) to 1s/frame(GPU, direct convolution) GPU Memory Hierarchy • To further accelerate our SR method, we need exploit the GPU hierarchical memory and make full use of it. • Three types of GPU memory: • Global memory: size is big, but slowest • Shared memory: size is small, but faster • Register: size is very limited, but fastest • The key point: • load reused data to faster memory GPU Optimization Trick 1: Shared Kernel The reuse of filter parameters GPU Optimization Trick 2: Shared Patch The reuse of input image patch GPU Optimization Trick 3: Fusion of Convolution and ReLu(Max) Operation Put them together: Fully GPU Implemented SRCNN • SR method should be fully GPU implemented • to accelerate the entire procedure as much as possible • minimize CPU/GPU data transfer • In implementation, we use above tricks in several ways, like: • Shared Kernel & Patch for Conv3(1*5*5*32) • Shared Kernel & Registered Pixel for Conv2(32*1*1*64) Experiments • The open SJTU 4K video sequences are used to test the speed and quality • Speed*: Overall execution for 1920*1080 to 3840*2160 single channel accelerates from 300s/frame (CPU) to 0.15s/frame (GPU), a 2000x speed up. • Quality: GPU executed SR method has the exactly same quality as the CPU. * On a workstation which has 2 Intel E5-2697V2 @2.7GHz processers and Nivida GTX 980TI Observations • best solution to get the maximum rate of acceleration comes from an optimized combination by applying different schemes to different convolution layers. • This is reasonable because each convolution layer has a specific size of convolution, and would have a specific feature for a series of filter loading, data inputting, computing and data outputting operation. When dealing with that specific size of convolution, A suitable combination version would be better than others. Method CPU Conv1 13064ms Conv2 306622ms Conv3 5650ms Direct Shared Kernel Shared Kernel&Patch 453ms 375ms 139ms 475ms 396ms 1602ms 86ms 56ms 41ms Shared Kernel& Registered Pixel Unable 24ms Unable cuDNN 56ms 37ms 150ms Conclusion & Further work • We used GPU to accelerate other Super-Resolution methods, with better quality, more complex procedure. • We are working on use accelerate other video processing tasks with GPU, like frame upscaling, color converting and LDR2HDR,etc. HD 4K 25/30p 50/60p 8bit 10/12bit R.709 R.2020 HD Cloud Video Processing with more GPUs UHD Thanks! song_li@sjtu.edu.cn