STREAMING DAY 2010 UDINE Context-Aware Fast 3D DCT/IDCT Algorithm for Low-power Video Codec in Mobile Embedded Systems Sergio Saponara, Luca Fanucci University of Pisa, Italy Contact: sergio.saponara@iet.unipi.it STDAY2010, Udine, Sept. 2010 Outline Application of Multidimensional DCT in video coding Fast algorithm for 3D DCT • Fast techniques based on radix-factorization • Fast techniques based on context-aware processing • Algorithmic results VLSI Architectures for 3D DCT CMOS implementation results Conclusion STDAY2010, Udine, Sept. 2010 Application of Multidimensional DCT in video coding Fast algorithm for 3D DCT • Fast techniques based on radix-factorization • Fast techniques based on context-aware processing • Algorithmic results VLSI Architectures for 3D DCT CMOS implementation results Conclusion STDAY2010, Udine, Sept. 2010 2D DCT for video coding 2D DCT allows for the reduction of spatial data redundancy - Conventional algorithm adopted in H.26X (by ITU-T) and MPEGX (by ISO/IEC) video CoDec (encoder/decoder) - 2D DCT is applied to image blocks of NxN pixels (usually N=8) Core of motioncompensated H.26x Encoder STDAY2010, Udine, Sept. 2010 3D DCT for video coding (1/2) 3D DCT extends the spatial compression properties to time With respect to H.26x/MPEGx CoDecs 3D DCT offers: - Lower cost: 3D DCT (spatio-temporal compression) instead of 2D DCT (for spatial compression) plus motion estimation (for temporal compression) - Symmetric complexity of decoder and encoder much lower than motion-compensated H.26x/MPEGx encoders - Optimal solution for applications requiring real-time coding/decoding in the same terminal: interactive TV and web services, video telephony, video conferencing, face recognition - Same coding efficiency for slow motion videos or small/medium image formats; higher error-resilience STDAY2010, Udine, Sept. 2010 3D DCT for video coding (2/2) 3D DCT is applied to cube of NxNxN pixels (usually N=8) As in H.26X/MPEGx each frame of a video is divided in blocks 1 Cube NxNxN = image blocks of NxN pixels belonging to N consecutive frames Video sequence 3D DCT entropy coding quantize bit stream channel Decoded video 3D IDCT dequantize entropy decoding STDAY2010, Udine, Sept. 2010 Application of Multidimensional DCT in video coding Fast algorithm for 3D DCT • Fast techniques based on radix-factorization • Fast techniques based on context-aware processing • Algorithmic results VLSI Architectures for 3D DCT CMOS implementation results Conclusion STDAY2010, Udine, Sept. 2010 3D-DCT radix-factorization (1/2) Equation of a N3-point 3D DCT Z (l , m, n) 8 N3 N 1 N 1 N 1 x(i, j, k )Ci,l C j ,m C k ,n C p,0 k 0 j 0 i 0 (2 p 1)q , C p , q cos 2N 2 1 A direct implementation of the equation requires N3 multiplications and additions (MAC) The N3-point 3D DCT is implemented by 3 N-point 1D DCT plus proper transposition matrixes Complexity of 3N MAC Memory cost: T1 of N2 words plus T2 of N3 words STDAY2010, Udine, Sept. 2010 3D-DCT radix-factorization (2/2) Blocks 0,..,N-1 1D DCT T1 1D DCT T2 1D DCT Each N-point 1D-DCT is factorized in simpler radix-2 butterflies STDAY2010, Udine, Sept. 2010 3D-DCT/IDCT data correlation Switching bits between consecutive input samples % Occurence 40 Miss America Akiyo Coastguard Foreman 30 20 10 0 0 1 2 3 bits 4 5 6 7 With MissAmerica, Akiyo, Foreman, Coastguard up to 60-70% of the rows are null in IDCT mode Distribution of the amplitude of AC coefficients for Foreman vs. the coefficient number (1 to 512 in the 8x8x8 cube) STDAY2010, Udine, Sept. 2010 8 Context-aware 3D-DCT Insert before a 1D stage a pre-processor that for each row Xi of N samples: - analyzes the statistics of the DCT/IDCT input samples in each computing N 1 stage 1 N 1 A x N i 0 i SAD ( xi xi 1 ) i 1 -based on heuristic rules decides if the DCT/IDCT computation can be avoided - If A = 0 and SAD = 0 or If A ≠ 0 and SAD<TH1 the transform result is forced to zero In these cases the transform result is estimated to have a small residual energy and most likely would be cancelled by the quantizer STDAY2010, Udine, Sept. 2010 % Computation saving 60 MissAmerica 50 Akiyo Coastguard Foreman 40 30 20 10 0 Context-aware vs. classic 3D DCT/IDCT STDAY2010, Udine, Sept. 2010 Rate-distortion performance of context-aware fast 3D DCT PSNR (dB) 49 46 43 Baseline 3D DCT/IDCT Codec 40 Fast Context-Aware 3D DCT/IDCT Codec 37 0 100 200 300 bit-rate (kbits/s) 400 500 Rate-distortion curve for Akiyo PSNR variation at fixed bit-rate: context-aware vs. classic 3D DCT STDAY2010, Udine, Sept. 2010 Application of Multidimensional DCT in video coding Fast algorithm for 3D DCT • Fast techniques based on radix-factorization • Fast techniques based on context-aware processing • Algorithmic results VLSI Architectures for 3D DCT CMOS implementation results Conclusion STDAY2010, Udine, Sept. 2010 Why VLSI HW design? A SW optimized design of a 3D DCT/IDCT reaches real-time time VGA 24 Hz on Intel Core 2 6300@1.86 GHz [T. Fryza et al.] The Core 2 6300 processor, in 65 nm CMOS, integrates two cores and up to 4 MB of L2 cache. The die size is 143 mm2 for 290 M transistors; at 1.86 GHz the power consumption is up to 65 W For battery-powered terminals VLSI HW design is needed STDAY2010, Udine, Sept. 2010 1D-DCT circuit engine RAC RAC r r RAC n+r r+1 2-1 Distribuited Arithmetic RAC (ROM+Accumulator) instead of a (Multiplier + Accumulator) n 1 K n 1 K j j z ci xi ci xi 0 xij 2 ci xi 0 ci xij 2 i 1 i 1 j 1 i 1 j 1 i 1 K K STDAY2010, Udine, Sept. 2010 Rounding RAC ROM 16 words Accumulator RAC 4 Sum/Subtract RAC Parallel Input Serial Output RAC Sum/Subtract (IDCT) RAC Sum/Subtract (DCT) Bit Extraction Serial Input Parallel Output RAC m 3D-DCT architectures: schemes B lock 0 1D DCT/IDCT T1 1D DCT/IDCT R0 B lock 1 1D DCT/IDCT T1 1D DCT/IDCT R1 1D DCT/IDCT B lock N-1 1D DCT/IDCT 1D DCT/IDCT T1 RN-1 Stage 1 3D architectures with different degrees of parallelism and power vs. area trade-offs FULL PARALLEL (PA) Stage 2 Blocks 0,..,N-1 1 D DCT Blocks 0, .., N-1 T1 1 D DCT T2 1 D DCT CASCADE (CS) T2 1D ITERATIVE (IT) DCT/IDCT STDAY2010, Udine, Sept. 2010 3D-DCT architectures: performance and complexity STDAY2010, Udine, Sept. 2010 Application of Multidimensional DCT in video coding Fast algorithm for 3D DCT • Fast techniques based on radix-factorization • Fast techniques based on context-aware processing • Algorithmic results VLSI Architectures for 3D DCT CMOS implementation results Conclusion STDAY2010, Udine, Sept. 2010 CMOS implementation results 0.18 m, 1.6 V, 6 metal levels standard-cell 250 QCIF CIF 4 CIF 16 CIF 200 FULL PARALLEL CASCADE 150 ITERATIVE 100 50 0 0 15 30 45 60 75 Circuit complexity (Kgates) vs. Power consumption (mW) Dotted lines refer to the elaboration of the same video formats STDAY2010, Udine, Sept. 2010 Power consumption with contextaware saving Power consumption of the CS architecture Power consumption of the PA architecture STDAY2010, Udine, Sept. 2010 Conclusions 3D-DCT/IDCT is a promising solution for real-time, low-power, low-complexity, Implementation of video encoders and decoders in battery powered terminals STDAY2010, Udine, Sept. 2010 Thanks for your attention!!! STDAY2010, Udine, Sept. 2010