Complexity Model Based Load-balancing Algorithm For Parallel

advertisement
Complexity Model Based Loadbalancing Algorithm For Parallel
Tools Of HEVC
Yong-Jo Ahn, Tae-Jin Hwang, Dong-Gyu Sim,
and Woo-Jin Han
2013 IEEE International Conference on Visual Communications and Image
Processing (VCIP)
1
Outline
•
•
•
•
•
Introduction
Related Work
Proposed Method
Experimental Results
Conclusion
2
Introduction
• Demand for new video coding standards has
been increasing due to recent expansion of
digital broadcasting services and the advent of
various multimedia devices.
• Newly supported coding tools cause not only
high coding efficiency but also high
computational complexity caused from
decision process for the diverse modes.
3
Cont.
• Some studies on parallel processing methods
as well as fast mode decision algorithms for
HEVC fast encoder are considered to be one of
key part in progress.
• In this paper, parallel processing methods
using slice and tile tools supported by HEVC is
introduced and load-balancing algorithm
which enhances slice and tile parallel
processing is proposed in this paper.
4
Related Work
• A few parallel tools are adopted in the HEVC
main profile and key tools for parallel
processing are tile [5] and wave-front parallel
processing (WPP) [6].
• Parallel method
– Tile
– Entropy slice
– WPP(Wavefront parallel processing)
•
•
[5] A. Fuldseth, M. Horowitz, S. Xu, A. Gegall, and M. Zhou, "Tiles," ITU-T/ISO/IEC JCT-VC doc.,
JCTVCE196, Mar. 2011.
[6] F. Henry and S. Pateux, "Wavefront parallel processing," ITU-T/ISO/IEC JCT-VC doc., JCTVCE196, Mar.
2011.
5
Cont.
(b) Entropy slice
(a) Tile
(c) WPP
6
Cont.
• To select suitable parallel options, several
factors such as encoding time saving, coding
efficiency decrease, and extensibility for the
number of processing cores should be
considered.
• Coding efficiency decrease is also one of the
most important factors in adopting parallel
processing.
7
Cont.
• Data-level parallelism can be applied to the
frame-, slice-, tile-, or coding unit-level
according to the parallelization methods.
• Number of non-referenced B frames in IBBP
coding structures significantly impacts on
coding efficiency and restricts extensibility of
processing cores.
8
Cont.
• Extensibility of the number of processing cores is
the highest and coding efficiency loss is also the
smallest when using WPP.
• However, it is hard to expect a large encoding
time saving with WPP due to restricted data
dependency.
• Generally, increase of the number of slices and
tiles impacts on bitrate much for low resolution
sequences, but increase of the number of slices
and tiles does not influence on bitrate much for
high resolution sequences.
9
Proposed Method
• To resolve high computational complexity of
HEVC encoder, various technical contributions
on early termination methods and fast mode
decision algorithms are adopted for the
reference software[7][8].
• However, it is not easy to achieve a real-time
encoder with only the fast algorithms.
• Computational load should be balanced
among core.
•
•
[7] R. H. Gweon, Y.-L. Lee, and J. Lim, "Early termination of CU encoding to reduce HEVC
complexity," ITUT/ ISO/IEC JCT-VC doc., JCTVC-F045, July 2011.
[8] K. Choi and E. S. Jang, "Coding tree pruning based CU early termination," ITU-T/ISO/IEC JCTVC doc., JCTVC-F092, July 2011.
10
Complexity Model For HEVC Encoder
• For slice and tile tools, the number of CTU
should be determined earlier than actual
encoding with complexity prediction.
𝐢𝐢𝑖 𝑙 =
𝐢𝐸𝑀(𝑠, π‘š) × πΆπ»πΎ(𝑠, π‘š|𝑙)
π‘ πœ–π‘† π‘šπœ–π‘€
𝑆 = 𝑠 64 × 64, 32 × 32, 16 × 16, 8 × 8}
𝑀 = {m | MERGE, INTER, INTRA}
𝐢𝐻𝐾 𝑠, π‘š 𝑙 =
1, 𝑖𝑓 𝑠𝑒𝑙𝑒𝑐𝑑𝑒𝑑(𝑠, π‘š|𝑙)
0, π‘œπ‘‘β„Žπ‘’π‘Ÿπ‘€π‘–π‘ π‘’
(1)
11
Cont.
𝑅 𝑠, π‘š = π‘Ÿ(𝑠, π‘š) × 2𝑀(πΆπ‘‡π‘ˆ)/𝑀(𝑠)
𝐢𝐸𝑀 𝑠, π‘š = 𝑅(𝑠, π‘š) × π‘πΉ
• R(s, m) : complexity per unit.
• r(s, m) : complexity ratio of each CU
size and mode.
• w(s) : width of CU size.
• NF : a normalization factor for fixedpoint operation.
12
Cont.
• The proposed complexity model for HEVC
encoder is evaluated with the Pearson product
moment correlation with HEVC common test
sequences under the HEVC common test
conditions.
13
Cont.
• Pearson product-moment correlation
coefficient is a measure of the linear
correlation between two variables X and Y,
giving a value between +1 and −1 inclusive,
where 1 is total positive correlation, 0 is no
correlation, and −1 is total negative
correlation.
πœŒπ‘‹,π‘Œ
π‘π‘œπ‘£(𝑋, π‘Œ) 𝐸[(𝑋 − πœ‡π‘‹ )(π‘Œ − πœ‡π‘Œ )]
=
=
πœŽπ‘‹ πœŽπ‘Œ
πœŽπ‘‹ πœŽπ‘Œ
14
Complexity Model Based Load-balancing
Algorithm For Parallel Tools Of HEVC
• Number of CTUs for each temporal level slice
𝐿𝑗 𝑖 π‘˜ =
π‘œπ‘“π‘“π‘ π‘’π‘‘ 𝑗
𝑖
πΆπ‘‡π‘ˆπ‘–π‘›πΉπ‘Ÿπ‘Žπ‘šπ‘’
𝑁
π‘˜ =
+ π‘œπ‘“π‘“π‘ π‘’π‘‘ 𝑗 𝑖 (π‘˜)
π‘œπ‘“π‘“π‘ π‘’π‘‘ 𝑗
𝑖−1
π‘˜ +
1
(𝑁
−
(2)
𝑆𝐢 𝑗 𝑖−1 (π‘˜)
)
𝑁−1 𝑆𝐢 𝑗
𝑖−1 (𝑛)
𝑛=0
× πΆπ‘‡π‘ˆπ‘–π‘›πΉπ‘Ÿπ‘Žπ‘šπ‘’
(3)
𝐿 π‘˜ −𝑙
𝑆𝐢𝑖 π‘˜ =
𝐢𝐢𝑖 (𝑙)
𝑙=0
•
•
•
•
•
•
L(k) : the number of CTUs assigned to k-th slice.
i : frame index.
j : temporal layer id.
k : slice number.
N is the number of slices in a frame.
CTUinFrame is the number of CTUs in the frame.
15
Cont.
• Number of CTUs are assigned to each tile for a
temporal layer with column and row offsets for loadbalancing for tile-level parallel processing.
πΆπ‘‡π‘ˆπ‘–π‘›π‘Šπ‘–π‘‘π‘‘β„Ž
𝑁
𝐿𝑗 𝑖 π‘˜ = (
+ π‘Šπ‘‚ 𝑗 𝑖 π‘˜ ) × (
π‘Šπ‘‚ 𝑗 𝑖 π‘˜ = π‘Šπ‘‚ 𝑗 𝑖−1 π‘˜ + (𝑁
1
π‘Šπ‘–π‘‘π‘‘β„Ž
𝐻𝑂
•
•
•
•
•
𝑗
𝑖
π‘˜ = 𝐻𝑂
𝑗
𝑖−1
π‘˜ + (𝑁
1
π»π‘’π‘–π‘”β„Žπ‘‘
−
−
πΆπ‘‡π‘ˆπ‘–π‘›π»π‘’π‘–π‘”β„Žπ‘‘
𝑁
+ 𝐻𝑂 𝑗 𝑖 π‘˜ )
π‘ŠπΏπΆ 𝑗 𝑖−1 (π‘˜)
) × πΆπ‘‡π‘ˆπ‘–π‘›π‘Šπ‘–π‘‘π‘‘β„Ž
(5)
) × πΆπ‘‡π‘ˆπ‘–π‘›π»π‘’π‘–π‘”β„Žπ‘‘
(6)
π‘π‘Šπ‘–π‘‘π‘‘β„Ž −1
π‘ŠπΏπΆ 𝑗 𝑖−1 (𝑛)
𝑛=0
𝐻𝐿𝐢 𝑗 𝑖−1 (π‘˜)
π‘π»π‘’π‘–π‘”β„Žπ‘‘ −1
𝐻𝐿𝐢 𝑗 𝑖−1 (𝑛)
𝑛=0
(4)
L(k) : the number of CTUs assigned to k-th tile.
i : frame index.
j : temporal layer id.
k : tile number.
NlnWidth and Nheight : number of tiles composing a frame in horizontal and vertical
directions.
• CTUlnWidth and CTUheight : number of CTUs of a tile in horizontal and vertical 16
directions.
Cont.
• Control of complexity balancing for a tile-level
parallelism is harder than that for a slice-level
parallelism because size of tile is determined
by only tile width and height not by CTU offset
used in load balancing for slice-level
parallelism.
17
Experimental Results
• HM 11.0 reference software is utilized.
• A PC equipped with the Intel® Core™ i7-3930K CPU and
16GB memory was used for this evaluation. Intel® C++ 64bit compiler XE 13.0 used in Windows 7 64-bit operating
system.
• A frame is partitioned into four slices or tiles for fair
evaluation.
• Two fast encoding algorithms, CFM [7] and ECU [8] adopted
for HM are employed to evaluate the proposed loadbalanced parallelization.
• [7] R. H. Gweon, Y.-L. Lee, and J. Lim, "Early termination of CU encoding to reduce
HEVC complexity," ITUT/ ISO/IEC JCT-VC doc., JCTVC-F045, July 2011.
• [8] K. Choi and E. S. Jang, "Coding tree pruning based CU early termination," ITUT/ISO/IEC JCT-VC doc., JCTVC-F092, July 2011.
18
Cont.
19
Cont.
20
Cont.
21
Conclusion
• To maximize encoding time gain of parallel
processing for HEVC encoder, load balance
algorithms based complexity prediction model
are proposed.
• Average ATS gain of slice-level parallel processing
is achieved by 12.05% by adaptively adjusting the
number of CTUs. Average ATS gain of tile-level
parallel processing is 3.81 %.
• ATS gain obtained by load-balancing algorithm is
higher in slice-level than in tile-level parallelism.
22
Download