Complexity Model Based Loadbalancing Algorithm For Parallel Tools Of HEVC Yong-Jo Ahn, Tae-Jin Hwang, Dong-Gyu Sim, and Woo-Jin Han 2013 IEEE International Conference on Visual Communications and Image Processing (VCIP) 1 Outline • • • • • Introduction Related Work Proposed Method Experimental Results Conclusion 2 Introduction • Demand for new video coding standards has been increasing due to recent expansion of digital broadcasting services and the advent of various multimedia devices. • Newly supported coding tools cause not only high coding efficiency but also high computational complexity caused from decision process for the diverse modes. 3 Cont. • Some studies on parallel processing methods as well as fast mode decision algorithms for HEVC fast encoder are considered to be one of key part in progress. • In this paper, parallel processing methods using slice and tile tools supported by HEVC is introduced and load-balancing algorithm which enhances slice and tile parallel processing is proposed in this paper. 4 Related Work • A few parallel tools are adopted in the HEVC main profile and key tools for parallel processing are tile [5] and wave-front parallel processing (WPP) [6]. • Parallel method – Tile – Entropy slice – WPP(Wavefront parallel processing) • • [5] A. Fuldseth, M. Horowitz, S. Xu, A. Gegall, and M. Zhou, "Tiles," ITU-T/ISO/IEC JCT-VC doc., JCTVCE196, Mar. 2011. [6] F. Henry and S. Pateux, "Wavefront parallel processing," ITU-T/ISO/IEC JCT-VC doc., JCTVCE196, Mar. 2011. 5 Cont. (b) Entropy slice (a) Tile (c) WPP 6 Cont. • To select suitable parallel options, several factors such as encoding time saving, coding efficiency decrease, and extensibility for the number of processing cores should be considered. • Coding efficiency decrease is also one of the most important factors in adopting parallel processing. 7 Cont. • Data-level parallelism can be applied to the frame-, slice-, tile-, or coding unit-level according to the parallelization methods. • Number of non-referenced B frames in IBBP coding structures significantly impacts on coding efficiency and restricts extensibility of processing cores. 8 Cont. • Extensibility of the number of processing cores is the highest and coding efficiency loss is also the smallest when using WPP. • However, it is hard to expect a large encoding time saving with WPP due to restricted data dependency. • Generally, increase of the number of slices and tiles impacts on bitrate much for low resolution sequences, but increase of the number of slices and tiles does not influence on bitrate much for high resolution sequences. 9 Proposed Method • To resolve high computational complexity of HEVC encoder, various technical contributions on early termination methods and fast mode decision algorithms are adopted for the reference software[7][8]. • However, it is not easy to achieve a real-time encoder with only the fast algorithms. • Computational load should be balanced among core. • • [7] R. H. Gweon, Y.-L. Lee, and J. Lim, "Early termination of CU encoding to reduce HEVC complexity," ITUT/ ISO/IEC JCT-VC doc., JCTVC-F045, July 2011. [8] K. Choi and E. S. Jang, "Coding tree pruning based CU early termination," ITU-T/ISO/IEC JCTVC doc., JCTVC-F092, July 2011. 10 Complexity Model For HEVC Encoder • For slice and tile tools, the number of CTU should be determined earlier than actual encoding with complexity prediction. πΆπΆπ π = πΆπΈπ(π , π) × πΆπ»πΎ(π , π|π) π ππ πππ π = π 64 × 64, 32 × 32, 16 × 16, 8 × 8} π = {m | MERGE, INTER, INTRA} πΆπ»πΎ π , π π = 1, ππ π πππππ‘ππ(π , π|π) 0, ππ‘βπππ€ππ π (1) 11 Cont. π π , π = π(π , π) × 2π€(πΆππ)/π€(π ) πΆπΈπ π , π = π (π , π) × ππΉ • R(s, m) : complexity per unit. • r(s, m) : complexity ratio of each CU size and mode. • w(s) : width of CU size. • NF : a normalization factor for fixedpoint operation. 12 Cont. • The proposed complexity model for HEVC encoder is evaluated with the Pearson product moment correlation with HEVC common test sequences under the HEVC common test conditions. 13 Cont. • Pearson product-moment correlation coefficient is a measure of the linear correlation between two variables X and Y, giving a value between +1 and −1 inclusive, where 1 is total positive correlation, 0 is no correlation, and −1 is total negative correlation. ππ,π πππ£(π, π) πΈ[(π − ππ )(π − ππ )] = = ππ ππ ππ ππ 14 Complexity Model Based Load-balancing Algorithm For Parallel Tools Of HEVC • Number of CTUs for each temporal level slice πΏπ π π = ππππ ππ‘ π π πΆπππππΉππππ π π = + ππππ ππ‘ π π (π) ππππ ππ‘ π π−1 π + 1 (π − (2) ππΆ π π−1 (π) ) π−1 ππΆ π π−1 (π) π=0 × πΆπππππΉππππ (3) πΏ π −π ππΆπ π = πΆπΆπ (π) π=0 • • • • • • L(k) : the number of CTUs assigned to k-th slice. i : frame index. j : temporal layer id. k : slice number. N is the number of slices in a frame. CTUinFrame is the number of CTUs in the frame. 15 Cont. • Number of CTUs are assigned to each tile for a temporal layer with column and row offsets for loadbalancing for tile-level parallel processing. πΆππππππππ‘β π πΏπ π π = ( + ππ π π π ) × ( ππ π π π = ππ π π−1 π + (π 1 ππππ‘β π»π • • • • • π π π = π»π π π−1 π + (π 1 π»πππβπ‘ − − πΆπππππ»πππβπ‘ π + π»π π π π ) ππΏπΆ π π−1 (π) ) × πΆππππππππ‘β (5) ) × πΆπππππ»πππβπ‘ (6) πππππ‘β −1 ππΏπΆ π π−1 (π) π=0 π»πΏπΆ π π−1 (π) ππ»πππβπ‘ −1 π»πΏπΆ π π−1 (π) π=0 (4) L(k) : the number of CTUs assigned to k-th tile. i : frame index. j : temporal layer id. k : tile number. NlnWidth and Nheight : number of tiles composing a frame in horizontal and vertical directions. • CTUlnWidth and CTUheight : number of CTUs of a tile in horizontal and vertical 16 directions. Cont. • Control of complexity balancing for a tile-level parallelism is harder than that for a slice-level parallelism because size of tile is determined by only tile width and height not by CTU offset used in load balancing for slice-level parallelism. 17 Experimental Results • HM 11.0 reference software is utilized. • A PC equipped with the Intel® Core™ i7-3930K CPU and 16GB memory was used for this evaluation. Intel® C++ 64bit compiler XE 13.0 used in Windows 7 64-bit operating system. • A frame is partitioned into four slices or tiles for fair evaluation. • Two fast encoding algorithms, CFM [7] and ECU [8] adopted for HM are employed to evaluate the proposed loadbalanced parallelization. • [7] R. H. Gweon, Y.-L. Lee, and J. Lim, "Early termination of CU encoding to reduce HEVC complexity," ITUT/ ISO/IEC JCT-VC doc., JCTVC-F045, July 2011. • [8] K. Choi and E. S. Jang, "Coding tree pruning based CU early termination," ITUT/ISO/IEC JCT-VC doc., JCTVC-F092, July 2011. 18 Cont. 19 Cont. 20 Cont. 21 Conclusion • To maximize encoding time gain of parallel processing for HEVC encoder, load balance algorithms based complexity prediction model are proposed. • Average ATS gain of slice-level parallel processing is achieved by 12.05% by adaptively adjusting the number of CTUs. Average ATS gain of tile-level parallel processing is 3.81 %. • ATS gain obtained by load-balancing algorithm is higher in slice-level than in tile-level parallelism. 22