GPU ACCELERATED HIGH-QUALITY VIDEO_IMAGE SUPER

advertisement
GPU Accelerated High Quality
Video/Image Super Resolution
Zhangzong Zhao, Li Song, Rong Xie, Xiaokang Yang
Shanghai Jiao Tong University
Outline
• Background
• The Super Resolution convolutional neural network
• GPU Optimization of the SRCNN algorithm
• Experiments
• Conclusion&Further work
The motivation: Blu-Ray(HD) To UHD(4K)
• Even 4K/UHD TV is ready now, native 4K
contents are few.
• There is a complete lack of available
4K/UHD content in the short term.
• Thus, solutions converting HD contents
to high quality 4K are needed.
The Background of SR
• State-of-the-art SR methods not only increase resolution
of input image, but also enhance perceptual quality by
adding more details or high frequency signals.
• Example based SR methods learn LR-HR mapping relationship from
thousands of patches pair.
• By using latest machine learning approaches over massive dataset, this
methodology has witness significant progress in past years
details
Input
The SR frame
The “original” frame
SRCNN Example
Bicubic
SRCNN
SRCNN Example
Bicubic
SRCNN
Which algorithm do we choose?
• The criteria: better quality and high speed
• In general, quality and speed are contradictious
Quality
Speed
The Top SR algorithms - quality
• TOP3 SR algorithms include CSC、CNN、A+, have better restoration
performance or better quality (PSNR).
S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, L. Zhang, Convolutional Sparse Coding for Image Super-resolution, ICCV 2015.
The Top SR algorithms - speed
• TOP3 SR algorithms include CSC、CNN、A+, have very different
running time or speed benchmark.
• In all, CNN and A+ are the best ones in terms of both quality and speed.
In this paper, we choose the CNN based method as our anchor for its
friendly to parallel computing on GPU.
• It should be noted that the original SRCNN implementation(matlab, over
CPU) still need 300s to convert a 1920x1080 frame to 3840x2160.
S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, L. Zhang, Convolutional Sparse Coding for Image Super-resolution, ICCV 2015.
Prototype:
SRCNN
• SRCNN: Super-Resolution
Convolutional Neural Network
• Use the popular CNN framework to learn
the end to end LR to HR mapping
function F: output = F (input)
The main steps in SRCNN
• path extraction and representation(Conv1-W164*9*9*1):
F1 (Y )  max(0,W1 *Y  B1 )
• Non-linear mapping(Conv2, W2-32*1*1*64):
F2 (Y )  max(0,W2 * F1 (Y )  B2 )
• Reconstruction(Conv3, W3-1*5*5*32):
F3 (Y )  W3 * F2 (Y )  B3
Thus, there are two basic operations: convolution and
Rectified Linear Unit, ReLU or max(0,x) in this paper
Direct GPU Implementation of Convolution
• SRCNN consist of 3 layer convolution
and 2 layer max. Over 95% of execution
time is spend on convolution, thus we
focus on the GPU optimization of
convolution.
• We split the convolution task into millions
of mini-tasks(along width and height of
the convolution output). Each convolution
output pixel is computed by a unique
thread, as shown in the figure.
• The GPU is friendly to deal with parallel
image processing. The processing time
can be accelerated from 300s/frame (CPU)
to 1s/frame(GPU, direct convolution)
GPU Memory Hierarchy
• To further accelerate our SR method, we need exploit the
GPU hierarchical memory and make full use of it.
• Three types of GPU memory:
• Global memory: size is big, but slowest
• Shared memory: size is small, but faster
• Register: size is very limited, but fastest
• The key point:
• load reused data to faster memory
GPU Optimization Trick 1: Shared Kernel
The reuse of filter parameters
GPU Optimization Trick 2: Shared Patch
The reuse of input image patch
GPU Optimization Trick 3: Fusion of
Convolution and ReLu(Max) Operation
Put them together: Fully GPU Implemented
SRCNN
• SR method should be fully GPU implemented
• to accelerate the entire procedure as much as possible
• minimize CPU/GPU data transfer
• In implementation, we use above tricks in several ways, like:
• Shared Kernel & Patch for Conv3(1*5*5*32)
• Shared Kernel & Registered Pixel for Conv2(32*1*1*64)
Experiments
• The open SJTU 4K video sequences are used to test the
speed and quality
• Speed*: Overall execution for 1920*1080 to 3840*2160 single
channel accelerates from 300s/frame (CPU) to 0.15s/frame (GPU),
a 2000x speed up.
• Quality: GPU executed SR method has the exactly same quality as
the CPU.
* On a workstation which has 2 Intel E5-2697V2 @2.7GHz processers and Nivida GTX 980TI
Observations
• best solution to get the maximum rate of acceleration comes
from an optimized combination by applying different schemes to
different convolution layers.
• This is reasonable because each convolution layer has a specific
size of convolution, and would have a specific feature for a series
of filter loading, data inputting, computing and data outputting
operation. When dealing with that specific size of convolution, A
suitable combination version would be better than others.
Method
CPU
Conv1
13064ms
Conv2
306622ms
Conv3
5650ms
Direct
Shared Kernel
Shared Kernel&Patch
453ms
375ms
139ms
475ms
396ms
1602ms
86ms
56ms
41ms
Shared Kernel&
Registered Pixel
Unable
24ms
Unable
cuDNN
56ms
37ms
150ms
Conclusion & Further work
• We used GPU to accelerate other Super-Resolution
methods, with better quality, more complex procedure.
• We are working on use accelerate other video processing
tasks with GPU, like frame upscaling, color converting
and LDR2HDR,etc.
HD
4K
25/30p
50/60p
8bit
10/12bit
R.709
R.2020
HD
Cloud Video Processing
with more GPUs
UHD
Thanks!
song_li@sjtu.edu.cn
Download