Complexity reduction of h.264 using parallel programming

advertisement
COMPLEXITY REDUCTION OF H.264
USING PARALLEL PROGRAMMING
By Sudeep Gangavati
Department of Electrical Engineering
University of Texas at Arlington
Supervisor : Dr.K.R.Rao
Outline













Introduction to video compression
Why H.264
Overview of H.264
Motivation
Possible approaches
Related work
Theoretical estimation
Proposed approach
Parallel computing
NVIDIA GPUs and CUDA Programming Model
Complexity reduction using CUDA
Results
Conclusions and future work
Introduction to video compression
Video codec: A software or a hardware device that can
compress and decompress
 Need for compression : Limited bandwidth and limited storage
space.
 Several codecs : H.264, VP8, AVS China, Dirac etc.

Figure 1 Forecast of mobile data usage
Why H.264 ?








H.264/MPEG-4 part 10 or AVC (Advanced Video
Coding) standardized by ITU-T VCEG and MPEG in
2004.
Approximately 50% bit-rate reductions over MPEG-2.
Most widely used standard.
Built on the concepts of earlier standards like MPEG-2.
Substantial compression efficiency.
Network friendly data representation.
Improved error resiliency tools
Supports various applications
Overview of H.264

There are two parts:
◦ Encoder : Carries out intra prediction, motion
estimation, transform, quantization and encoding
processes to produce a H.264 bit-stream.
◦ Decoder: Carries out the decoding, inverse transform,
inverse quantization to reconstruct the earlier
encoded video.
H.264 encoder [1]
H.264 decoder [2]
Intra prediction
Exploit spatial redundancies
 9 directional modes for prediction of 4 x 4 luma
blocks
 4 modes for 16 x 16 luma blocks
 4 modes for 8 x 8 chroma blocks

Intra prediction

9 modes for 4 x 4 luma block

4 modes for 16 x 16 luma blocks
Inter prediction
Exploits temporal redundancy
 Involves prediction from one or more previous
frames called reference frames

Motion estimation and compensation
Motion estimation and compensation is a
process of finding a matching block
 Motion search is performed.
 Motion vectors are obtained that provide the
displacement in the block.

Transform, Quantization and Encoding
Predicted values are then transformed.
 H.264 employs integer transform, basically
rough approximation of DCT
 After transform, the values are quantized for
compression
 Entropy encoding : CAVLC / CABAC

H.264 profiles [1]

H.264 provides several profiles for
different applications
Motivation

Performed a time profiling [45] on H.264 and obtained :
2% 3%
5%
Motion Estimation
Transform and quantization
Intra Prediction
90%
VLC Encoding and others
Motion estimation takes more time than any other
module in H.264
 Need to reduce this time by efficient implementation
without sacrificing video quality and bitrate.
 With reduced motion estimation time, the total time
for encoding is reduced.

Possible approaches for complexity
reduction

Encoder optimization Levels :
◦ Algorithmic Level : Develop new algorithms similar to
Three step algorithm, fast mode decision
algorithm etc.
◦ Compiler Level : Efficient programming
◦ Implementation Level: Using parallel programming
using CUDA, OpenMP , utilize
underlying hardware etc.
Related work
Author
1. Chan et.al [41]
Features
Advantages
Considers pyramid
algorithm for the motion
estimation
Consider motion vector
1.Video quality degradation
predicted to calculate SAD. 2.RD performance is not
considered.
2.Lee et.al [40]
Multi-pass motion
6 times speed up achieved
estimation. Generates local compared to standard
and global SADs in the first implementation.
and second passes. Fast ME
Search algorithm is used.
3.Rodriguez et.al [42]
Considers tree structured
motion estimation
algorithm
4. Cheng et.al [44]
Based on simplified
unsymmetrical multihexagon search. Divide
into tiles.
Searching algorithm is not
disclosed. No
documentation on internal
details.
5.NVIDIA Encoder
Three sequential steps 1.
SAD Calculation 2.Uses
binary reduction algorithm
3. Cost reduction
3x speed up. Thread
created for each tile.
Provides 4 times speed up.
Very good visual quality.
Disadvantages
1.Focus only on speed, not
on rate and distortion.
2. Threads are invoked for
pixels
3.Video resolution limit the
thread creation
1.Implementation results in
higher bitrate.
2.RD performance is not
shown.
Penalty in video quality
Fixed search range,
Issues with previous work
Focus only on achievable speed up.
 Does not consider the methods to decrease
the bitrate
 Does not consider techniques to maintain
video quality
 Thread creation overhead and limitations in
some approaches.

Theoretical estimation by Amdahl`s
Law [43]

We use this law to find out maximum achievable speed up

Widely used in parallel computing to predict theoretical maximum speed
up using multiple processors.

Amdahl`s law states that if P is the proportion of a program that can be
made parallel and (1-P) is the proportion that cannot be parallelized, then
maximum speedup that can be achieved by using N processors is
Using Amdahl`s Law

Approximation of speed up achieved upon
parallelizing a portion of the code
◦ P: parallelized portion
◦ N: Number of processor cores

In the encoder code, motion estimation
accounts to approximately 2/3rd of the code .

Applying the law the maximum speedup that can
be achieved in our case is 2.2 times or 55% time
reduction.
Proposed work

We propose the following to address the problem :
◦ Using CUDA for parallel implementation for faster calculation of
SAD (sum of absolute differences) and use one thread per block
instead of one thread per pixel to address the thread creation
overhead and limitation.
◦ Use a better search algorithm for motion estimation to maintain
the video quality
◦ Combine SAD cost values and save the bitrate

The above methods address all the issues mentioned
earlier

Along with the above, we utilize shared and texture memory
of the GPU that reduces the global memory references
and provides faster memory access.
Parallel Computing
Multi-core and many-core processors improve
the efficiency by parallel processing
 Parallel processing provides significant
improvement
 Techniques to program software on multiple
core processors:

◦ Data Parallelism
◦ Task parallelism
Parallel Computing

Data Parallelism
◦ Split the large data set into smaller parts and
execute them in parallel. After the execution,
the data are grouped.
Parallel Computing

Task Parallelism
◦ Distribute threads to different processors
◦ Data could be common
◦ May execute same or different code
NVIDIA GPU And CUDA Programming Model

NVIDIA pioneered the Graphics Processing
Units (GPU) Technology. First GPU: GeForce256
in 1999, had 128 MB of graphics memory.

GPUs, consisting of many core processors, are
used in applications requiring high amounts of
computation.

CPU-GPU Heterogeneous Model
Host-Device Connection
Compute Unified Device
Architecture (CUDA) [22]





NVIDIA introduced CUDA in 2006.
Programming model that make programs run
on GPU.
The serial portions of our program written in
C/C++ functions.
Parallel portions are written as GPU kernels.
C/C++ functions execute on CPU kernels sent
to GPU for processing.
Problem decomposition
Serial C functions run on CPU
 CUDA Kernels run on GPU

Hardware Architecture

Main element :
Stream multiprocessor (SM)
GT550M series has 2 SMs

Each SM has 48 cores

Each SM is capable of
executing 1536 threads

Total of 3072 threads running
in parallel

Threading

Threads are grouped into
blocks

Blocks are grouped into
grids

All threads within a block
execute on the same SM
Complexity reduction using CUDA

Motion estimation: Process of finding the best matching
block.
Complexity reduction using CUDA
To find best matching block, search is done in
the search window (or region).
 Search provides the best matching block by
computing the difference i.e. it obtains sum of
absolute difference (SAD).

Motion
vector
Complexity reduction using CUDA

SAD (dx, dy) =
x  N 1 y  N 1
 | I
m x



n y
k
(m, n)  I k 1 (m  dx, n  dy ) |
Search through search range of 8,16 or maximum 32
Select the block with least SAD.
Larger the block size, more the computations
A 352 x 288 frame
Standard algorithm





Divide the frame into macroblocks
of size 16 x 16
Further divide these macroblocks
into sub-blocks of 8 x 8 .
Search through the search area
Compute SAD
obtain MVs
Our approach
• Main idea is to:
– Minimize memory references and
Memory transfer
– Make use of shared memory and
texture memory
– Use single thread to compute SAD for
single block
– Make thread block creation
dependent on the frame size for
scalability
– large number of threads are invoked
that run in parallel and each block of
thread consists of 396 threads that
compute SADs of 396 - 8 x 8 blocks
SAD mapping to threads
Blocks 352 x 288 : (352/8) * (288 /8) = 1584 blocks that are to be computed
for SAD.
Total thread blocks = 4. Each block with 396 threads.
This makes the approach scalable. For a video with higher resolution,
like 704 x 480 ( 4SIF) or 704 x 576 (4CIF), we can create 16 blocks
each with 396. So the number of threads created is dependent on
video resolution.
Performance enhancements

We consider Rate-distortion (RD) criteria and
employ following techniques:
◦ To minimize bitrate:
 Calculate the cost for smaller sub blocks of 8 x 8 and combine 4 of
these and form a single cost for 16 x 16 block.
◦ To enhance video quality:
 Incorporate exhaustive full search algorithm that goes on to calculate
the matching block for the entire frame without skipping any blocks
as opposed to other algorithms. Previous studies [30] show that this
algorithm provides the best performance. Though it is highly
computational, this is used keeping video quality in mind.
Memory access

Memory access from texture memory to shared memory
Texture
Memory
Shared
memory

Memcpy API to move data into the Array we allocated:
cudaMemcpyToArray(
a_before_dilated,
0,
0,
h_before_dilated,
width*height*sizeof(uchar1),
cudaMemcpyHostToDevice);
//
//
//
//
//
//
array pointer
array offset width
array offset height
source
size in bytes
type of memcpy
Performance Metrics

% Time reduction =

∆Bitrate % =

∆PSNR % =

∆SSIM % =
𝑇𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛 𝑏𝑦 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒−𝑇𝑖𝑚𝑒 𝑡𝑎𝑘𝑒 𝑏𝑦 𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒𝑑 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒
𝑇𝑖𝑚𝑒 𝑡𝑎𝑘𝑒𝑛 𝑏𝑦 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒
𝐵𝑖𝑡𝑟𝑎𝑡𝑒
𝑘𝑏𝑖𝑡𝑠
𝑠
𝑏𝑦 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒−𝐵𝑖𝑡𝑟𝑎𝑡𝑒
𝐵𝑖𝑡𝑟𝑎𝑡𝑒
𝑘𝑏𝑖𝑡𝑠
𝑠
𝑘𝑏𝑖𝑡𝑠
𝑠
𝑏𝑦 𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒𝑑 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒
𝑏𝑦 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒
𝑃𝑆𝑁𝑅 𝑑𝐵 𝑏𝑦 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒−𝑃𝑆𝑁𝑅 𝑑𝐵 𝑏𝑦 𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒𝑑 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒
𝑃𝑆𝑁𝑅 𝑑𝐵 𝑏𝑦 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒
∗ 100
𝑆𝑆𝐼𝑀 𝑏𝑦 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒−𝑆𝑆𝐼𝑀 𝑏𝑦 𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒𝑑 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒
𝑆𝑆𝐼𝑀 𝑏𝑦 𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒
∗ 100
∗ 100
∗ 100
QCIF and CIF formats
Test Sequences
Results
Comparison of average encoding time for QCIF sequences
700
Time in seconds
600
500
Reference Software
400
Optimized Software
NVIDIA Encoder
300
200
100
0
Akiyo
Carphone
News
Container
QCIF Video Sequences
Foreman
The CPU-GPU implemented encoder performs better than the CPU-only encoder. But falls short when
compared to NVIDIA Encoder. This is due to the fact that NVIDIA Encoder is heavily optimized at all
levels of H.264 and not just motion estimation. NVIDIA has not released the type of searching algorithm
it is using as well. Use of appropriate algorithm for motion search significantly changes the performance
of quality, bitrate and speed.
The theoretical speed up was about 2.2-2.5 times. From results, we achieve approx. 2 times speed up. This
can be attributed to the other factors like the time it takes for load and store operations for functions ,
transfer of control to the GPU, memory transfer and references for operations that we have not
considered and also other H.264 calculations etc.
Results for QCIF video sequences
42
48
Reference software
Optimized software
46
Reference software
Optimized software
40
NVIDIA Encoder
NVIDIA Encoder
PSNR (dB)
42
40
38
38
36
36
34
34
32
200 kbps
400 kbps
600 kbps
800 kbps
1000 kbps
32
200 kbps
Bitrate
600 kbps
800 kbps
1000 kbps
PSNR vs. Bitrate for Carphone sequence
42
40
400 kbps
Bitrate
PSNR vs. Bitrate for Akiyo sequence
44
Reference software
Optimized software
42
NVIDIA Encoder
38
Reference software
Optimized software
NVIDIA Encoder
40
PSNR (dB)
PSNR (dB)
PSNR (dB)
44
36
38
36
34
34
32
32
30
200 kbps
400 kbps
600 kbps
800 kbps
1000 kbps
Bitrate
PSNR vs. Bitrate for Container sequence
200 kbps
400 kbps
600 kbps
800 kbps
1000 kbps
Bitrate
PSNR vs. Bitrate for Foreman sequence
Results
0.96
0.94
SSIM
0.92
0.9
Reference Software
0.88
Optimized software
NVIDIA Encoder
0.86
0.84
0.82
Akiyo

Carphone
News
Container
QCIF Sequences
Foreman
SSIM provides the structural similarity between the input
and output videos. Ranges from 0.0 to 1.0. 0 is the least
quality video. 1 is the highest quality video
Results
Comparison of average encoding time for CIF sequences
2500
Time in seconds
2000
1500
Reference Software
Optimized software
1000
NVIDIA Encoder
500
0
Akiyo

Carphone
News
Container
CIF Video sequences
Foreman
Similar behavior is observed in case of CIF video sequences.
Results for CIF video sequences
News
45
48
Reference software
Optimized software
NVIDIA Encoder
46
44
43
PSNR (dB)
44
PSNR(dB)
Reference Software
Optimized software
NVIDIA Encoder
42
42
41
40
40
39
38
38
36
37
200 kbps
400 kbps
600 kbps
800 kbps
1000 kbps
200 kbps
400 kbps
Bitrate
800 kbps
1000 kbps
Bitrate
Carphone
Container
50
48
600 kbps
44
Reference Software
Optimized software
NVIDIA Encoder
42
Reference software
Optimized software
NVIDIA Encoder
PSNR (dB)
PSNR (dB)
46
44
42
40
40
38
36
38
200 kbps
400 kbps
600 kbps
Bitrate
800 kbps
1000 kbps
34
200 kbps
400 kbps
600 kbps
Bitrate
800 kbps
1000 kbps
Results
0.97
0.96
SSIM
0.95
Reference software
0.94
Optimized software
NVIDIA Encoder
0.93
0.92
0.91
Akiyo

Carphone
News
CIF Sequences
Container
Foreman
SSIM values for our optimized software and NVIDIA encoder are very close.
Conclusions






Nearly 50% reduction in encoding time on various
sequences close to the theoretical estimation.
Less degradation in video quality is observed.
Less bitrate is obtained by uniquely combining the SAD
costs of sub blocks into SAD cost of larger macroblock
SSIM, Bitrate, PSNR are close to the values obtained
without optimizations
Achieved data parallelism
With little modification in the code, the approach is
actually scalable to better hardware and increased video
resolution
Limitations
As the threads work in parallel, in case when the
sum of SADs till kth row (k<8) exceeds the
current SAD, then there is no need to compute
further. But due to the concurrent processing, no
best SAD is available until the thread is done
calculating.
 Search range cannot be modified while encoding
is in progress.
 Since this is a hardware implementation, the
performance largely depends on the type of
hardware used.

Future work

Other operations in H.264 like filtering, entropy encoding can be
parallelized.

Block dependencies are not considered in this approach. This could
be challenging but results in higher compression efficiency.

Different profiles like High and Main profiles can be used for
implementation.

Different motion estimation algorithms can be implemented in
parallel and later on incorporated into H.264.

CUDA can be applied to HEVC [46], next generation video coding
standard, successor to H.264. HEVC is known to be more complex
than H.264.
Thank You
References
[1] I.E. Richardson, “The H.264 advanced video compression standard”, 2nd Edition,
Wiley, 2010.
[2] S. Kwon, A. Tamhankar, and K.R. Rao, “Overview of H.264/MPEG-4 part 10”, Journal
of Visual Communication and Image Representation, vol. 17, no.2, pp. 186-216, April
2006.
[3] Draft ITU-T Recommendation and final draft international standard of joint video
specification (ITU-T Rec. H.264/ISO/IEC 14 496-10 AVC), Mar. 2003.
[4] G. Sullivan, “Overview of international video coding standards (preceding
H.264/AVC)”, ITU-T VICA Workshop, July 2005.
[5] T. Wiegand, et al, “Overview of the H.264/AVC video coding standard”, IEEE
Transactions on. Circuits and Systems for Video Technology, vol.13, pp. 560–576, July
2003.
[6] M. Jafari and S. Kasaei, “Fast intra- and inter-prediction mode decision in H.264
advanced video coding”, International Journal of Computer Science and Network
Security, vol.8, no.5, pp. 1-6, May 2008.
[7] W. Chen and H. Hang, “H.264/AVC motion estimation implementation on Compute
Unified Device Architecture (CUDA)”, 2008 IEEE International Conference on
Multimedia and Expo, pp. 697-700, 26 April 2008.
References
[8] Y. He, I. Ahmad and M. Liou, “ A software-based MPEG-4 video encoder using parallel
processing”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 8, no.7, pp.
909-920, November 1998.
[9] D. Marpe, T. Wiegand and G. J. Sullivan, “The H.264/MPEG-4 AVC standard and its
applications”, IEEE Communications Magazine, vol. 44, pp. 134-143, Aug. 2006.
[10] Z.Wang, et al, “ Image quality assessment : From error visibility to structural similarity”,
IEEE Transactions on Image Processing, vol 13. pp. 600-612, April 2004.
[11] G. Sullivan, P. Topiwala, and A. Luthra, “The H.264/AVC advanced video coding standard:
overview and introduction to the fidelity range extensions” SPIE Conference on Applications
of Digital Image Processing XXVII, vol. 5558, pp. 53-74, 2004.
[12] A. Puri, X. Chen and A. Luthra, “Video coding using the H.264/MPEG-4 AVC compression
standard”, Signal Processing:Image Communication , vol.19, pp. 793–849, 2004.
[13] K.R. Rao and P.Yip, Discrete cosine transform, Academic Press, 1990.
[14] H.Yadav, “Optimization of the deblocking filter in H.264 codec for real time
implementation”, M.S. Thesis, E.E. Dept, UT Arlington, 2006.
[15] https://computing.llnl.gov/tutorials/parallel_comp/, Introduction to parallel computing.
References
[16] J. Kim, et al, “Complexity reduction algorithm for intra mode selection in H.264/AVC
video coding” J. Blanc-Talon et al. (Eds.): ACIVS 2006, LNCS 4179, pp. 454 – 465, SpringerVerlag, Berlin, Heidelberg, 2006.
[17] B.Jung, et al, “Adaptive slice-level parallelism for real-time H.264/AVC encoder with fast
inter mode selection”, Multimedia Systems and Applications X, edited by S. Rahardja, J.W.Kim
and J.Luo, Proc. of SPIE, vol. 6777, 67770J, 2007.
[18] S.Ge, X.Tian and Y. - K. Chen, “Efficient multithreading implementation of H.264 encoder
on Intel Hyper-threading architectures”, ICICS-PCM 2003.
[19] T. Rauber and G.Runger, “Parallel programming for multicore and cluster systems”, 2nd
Edition,Wiley, 2008
[20] D.Ailawadi, M.K.Mohapatra and A.Mittal, “Frame-based parallelization of MPEG-4 on
Compute Unified Device Arcitecture(CUDA)”, IEEE Conference on Advanced Computing , pp.
267-272 , 2010.
[21] M. A. F. Rodriguez, “CUDA: Speeding up parallel computing”, International Journal of
Computer Science and Security, November 2010.
[22] NVIDIA, NVIDIA CUDA Programming Guide,Version 3.2, NVIDIA, September 2010.
[23] “http://drdobbs.com/high-performance-computing/206900471” Jonathan Erickson, GPU
Computing Isn’t Just About Graphics Anymore, Online Article, February 2008.
[24] J. Nickolls and W. J. Dally,” The GPU Computing Era” , IEEE Computer Society Micro-IEEE,
vol. 30, Issue 2, pp. 56 - 69, April 2010.
[25] M.Abdellah, “High performance Fourier volume rendering on graphics processing units”,
M.S. Thesis, Systems and Bio-Medical Engineering Department, Cairo University, 2012.
References
[26] J. Sanders and E. Kandrot, “CUDA by example: an introduction to general-purpose GPU
programming” Addison-Wesley Professional, 2010.
[27] NVIDIA, NVIDIA’s Next Generation CUDA Compute Architecture:Fermi, White Paper,
Version 1.1, NVIDIA 2009.
[28] NVIDIA, Best Programming Practices, 2009.
[30] P. Kuhn, “Algorithms, complexity analysis and VLSI architectures for MPEG-4 motion
estimation”, Kluwer Academic, 1999.
[31] K. Shen and E.J. Delp, “ A spatial-temporal parallel approach for real time MPEG video
compression”, Proc. of 25th International conference on parallel processing, pp. 100-107, 1996.
[32] JM 16.0 software – http://iphome.hhi.de/suehring/tml/
[33] JM Reference Software Manual –http://iphome.hhi.de/suehring/tml/JM Reference
Software Manual (JVT-AE010).pdf[
34] D. Han, A. Kulkarni and K.R. Rao, “Fast inter-prediction mode decision algorithm for
H.264 video encoder”, ECTICON 2012, Cha Am, Thailand, May 2012.
[35] S. Sun, et al, “A highly efficient parallel algorithm for H.264 encoder based on
macro-block region partition”, Springer-Verlag, Berlin, Heidelberg, pp. 577–585, 2007.
[36] Test sequences - http://trace.eas.asu.edu/yuv/
References
[37] D. Kirk and W.-M. Hwu, “Programming massively parallel processors: A hands-on
approach (Applications of GPU Computing series)”, Morgan Kauffman, 2010
[38] Flynn`s Taxonomy, http://www.phy.ornl.gov/csep/ca/node11.html
[39] T. Saxena, “Reducing the encoding time of H.264 Baseline profile using parallel
programming”, M.S. Thesis, E.E. Dept, UT Arlington, 2012.
[40] C-Y. Lee,Y-C. Lin, C-L. Wu, C-H. Chang,Y-M. Tsao and S-Y. Chien, “Multi-pass and frame
parallel algorithms of motion estimation of H.264/AVC for generic GPU”, IEEE International
Conference on Multimedia and Expo, pp. 1603-1606, 2010
[41] L. Chan, J. W.Lee, A. Rothberg and P. Weaver, “Parallelizing H.264 motion estimation
algorithm using CUDA”, Proc. of Independent Activities Period (IAP) , MIT, 2009
[42] R. Rodriguez, J.L. Martinez, G. Fernandez-Escribano, J.M. Claver and J.L. Sanchez, “
Accelerating H.264 inter prediction in GPU by using CUDA”, International Conference on
Consumer Electronics, pp. 463-464, 2010
[43] “Amdahl`s Law”, http://www.futurechips.org/thoughts-for-researchers/parallelprogramming-gene-amdahl-said.html
[44] N-M.Cheung, X.Fan, O.C.Au and M-C. Kung, “ Video coding on multicore graphics
processors”, IEEE Signal Processing Magazine, vol. 27, pp. 79-89, 2010.
[45] Finding application bottlenecks with Visual Studio Profiler http://msdn.microsoft.com/enus/magazine/cc337887.aspx
[46] B.Bross, W.-J. Han, J.-R. Ohm, G.J. Sullivan, T. Wiegand, “ High efficiency video coding
(HEVC) text specification draft 8”, JCT-VC Document, JCTVC-J1003, Stockholm, Sweden, July
2012.
Appendix
CUDA Memory Model [22]
Specifications of the GPU Hardware
used in this thesis
SSIM [10]









The difference with respect to other techniques mentioned previously such as MSE or
PSNR, is that these approaches estimate perceived errors on the other hand SSIM
considers image degradation as perceived change in structural information. Structural
information is the idea that the pixels have strong inter-dependencies especially when they
are spatially close. These dependencies carry important information about the structure of
the objects in the visual scene.
The SSIM metric is calculated on various windows of an image. The measure between two
windows and of common size N×N is:
the average of μx ;
the average of μy;
the variance of σx;
the variance of σy ;
the covariance of and σxy;
C1 and C2, two variables to stabilize the division with weak denominator;
In order to evaluate the image quality this formula is applied only on luma. The resultant
SSIM index is a decimal value between -1 and 1, and value 1 is only reachable in the case
of two identical sets of data. Typically it is calculated on window sizes of 8×8. The window
can be displaced pixel-by-pixel on the image but the authors propose to use only a
subgroup of the possible windows to reduce the complexity of the calculation.
Download