3D DCT for video coding

advertisement
STREAMING DAY 2010
UDINE
Context-Aware Fast 3D DCT/IDCT
Algorithm for Low-power Video
Codec in Mobile Embedded
Systems
Sergio Saponara, Luca Fanucci
University of Pisa, Italy
Contact: sergio.saponara@iet.unipi.it
STDAY2010, Udine, Sept. 2010
Outline
 Application of Multidimensional DCT in video coding
 Fast algorithm for 3D DCT
• Fast techniques based on radix-factorization
• Fast techniques based on context-aware processing
• Algorithmic results
 VLSI Architectures for 3D DCT
 CMOS implementation results
 Conclusion
STDAY2010, Udine, Sept. 2010
 Application of Multidimensional DCT in video coding
 Fast algorithm for 3D DCT
• Fast techniques based on radix-factorization
• Fast techniques based on context-aware processing
• Algorithmic results
 VLSI Architectures for 3D DCT
 CMOS implementation results
 Conclusion
STDAY2010, Udine, Sept. 2010
2D DCT for video coding
2D DCT allows for the reduction of spatial data redundancy
- Conventional algorithm adopted in H.26X (by ITU-T) and
MPEGX (by ISO/IEC) video CoDec (encoder/decoder)
- 2D DCT is applied to image blocks of NxN pixels (usually N=8)
Core of motioncompensated H.26x
Encoder
STDAY2010, Udine, Sept. 2010
3D DCT for video coding (1/2)
3D DCT extends the spatial compression properties to time
With respect to H.26x/MPEGx CoDecs 3D DCT offers:
-
Lower cost: 3D DCT (spatio-temporal compression) instead of
2D DCT (for spatial compression) plus motion estimation (for
temporal compression)
-
Symmetric complexity of decoder and encoder much lower than
motion-compensated H.26x/MPEGx encoders
-
Optimal solution for applications requiring real-time coding/decoding
in the same terminal: interactive TV and web services, video
telephony, video conferencing, face recognition
-
Same coding efficiency for slow motion videos or small/medium
image formats; higher error-resilience
STDAY2010, Udine, Sept. 2010
3D DCT for video coding (2/2)
3D DCT is applied to cube of NxNxN pixels (usually N=8)
As in H.26X/MPEGx each frame of a video is divided in blocks
1 Cube NxNxN = image blocks of NxN pixels belonging to N
consecutive frames
Video sequence
3D DCT
entropy
coding
quantize
bit stream
channel
Decoded video
3D IDCT
dequantize
entropy
decoding
STDAY2010, Udine, Sept. 2010
 Application of Multidimensional DCT in video coding
 Fast algorithm for 3D DCT
• Fast techniques based on radix-factorization
• Fast techniques based on context-aware processing
• Algorithmic results
 VLSI Architectures for 3D DCT
 CMOS implementation results
 Conclusion
STDAY2010, Udine, Sept. 2010
3D-DCT radix-factorization (1/2)
 Equation of a N3-point 3D DCT
Z (l , m, n) 
8
N3
N 1 N 1 N 1
 x(i, j, k )Ci,l C j ,m C k ,n
C p,0 
k 0 j 0 i 0
 (2 p  1)q 
, C p , q  cos

2N
2


1
 A direct implementation of the equation requires N3
multiplications and additions (MAC)
 The N3-point 3D DCT is implemented by 3 N-point 1D
DCT plus proper transposition matrixes
 Complexity of 3N MAC
 Memory cost: T1 of N2 words plus T2 of N3 words
STDAY2010, Udine, Sept. 2010
3D-DCT radix-factorization (2/2)
Blocks 0,..,N-1
1D DCT
T1
1D DCT
T2
1D DCT
 Each N-point 1D-DCT is factorized in simpler radix-2 butterflies
STDAY2010, Udine, Sept. 2010
3D-DCT/IDCT data correlation
Switching bits between
consecutive input samples
% Occurence
40
Miss America
Akiyo
Coastguard
Foreman
30
20
10
0
0
1
2
3 bits 4
5
6
7
With MissAmerica, Akiyo, Foreman, Coastguard
up to 60-70% of the rows are null in IDCT mode
Distribution of the amplitude of AC
coefficients for Foreman
vs. the coefficient number
(1 to 512 in the 8x8x8 cube)
STDAY2010, Udine, Sept. 2010
8
Context-aware 3D-DCT
Insert before a 1D stage a pre-processor that for each row Xi of N samples:
- analyzes the statistics of the DCT/IDCT input samples in each computing
N 1
stage
1 N 1
A
x

N
i 0
i
SAD   ( xi  xi 1 )
i 1
-based on heuristic rules decides if the DCT/IDCT computation can be
avoided
- If A = 0 and SAD = 0 or If A ≠ 0 and SAD<TH1 the transform result is
forced to zero
In these cases the transform result is estimated to have a small residual
energy and most likely would be cancelled by the quantizer
STDAY2010, Udine, Sept. 2010
% Computation saving
60
MissAmerica
50
Akiyo
Coastguard
Foreman
40
30
20
10
0
Context-aware vs. classic 3D DCT/IDCT
STDAY2010, Udine, Sept. 2010
Rate-distortion performance of
context-aware fast 3D DCT
PSNR (dB)
49
46
43
Baseline 3D DCT/IDCT
Codec
40
Fast Context-Aware 3D
DCT/IDCT Codec
37
0
100
200
300
bit-rate (kbits/s)
400
500
Rate-distortion curve for Akiyo
PSNR variation at fixed bit-rate: context-aware vs. classic 3D DCT
STDAY2010, Udine, Sept. 2010
 Application of Multidimensional DCT in video coding
 Fast algorithm for 3D DCT
• Fast techniques based on radix-factorization
• Fast techniques based on context-aware processing
• Algorithmic results
 VLSI Architectures for 3D DCT
 CMOS implementation results
 Conclusion
STDAY2010, Udine, Sept. 2010
Why VLSI HW design?
A SW optimized design of a 3D DCT/IDCT reaches real-time
time VGA 24 Hz on Intel Core 2 6300@1.86 GHz
[T. Fryza et al.]
The Core 2 6300 processor, in 65 nm CMOS, integrates two
cores and up to 4 MB of L2 cache. The die size is 143 mm2
for 290 M transistors; at 1.86 GHz the power consumption is
up to 65 W
For battery-powered terminals VLSI HW design is needed
STDAY2010, Udine, Sept. 2010
1D-DCT circuit engine
RAC
RAC
r
r
RAC
n+r
r+1
2-1
Distribuited Arithmetic
RAC (ROM+Accumulator) instead of a (Multiplier + Accumulator)
n 1
K
n 1


K
 j
j


z   ci xi   ci   xi 0   xij 2    ci xi 0    ci xij 2
i 1
i 1
j 1
i 1
j 1  i 1



K
K
STDAY2010, Udine, Sept. 2010
Rounding
RAC
ROM
16
words
Accumulator
RAC
4
Sum/Subtract
RAC
Parallel Input Serial Output
RAC
Sum/Subtract (IDCT)
RAC
Sum/Subtract (DCT)
Bit Extraction
Serial Input Parallel Output
RAC
m
3D-DCT architectures: schemes
B lock
0
1D
DCT/IDCT
T1
1D
DCT/IDCT
R0
B lock
1
1D
DCT/IDCT
T1
1D
DCT/IDCT
R1
1D
DCT/IDCT
B lock
N-1
1D
DCT/IDCT
1D
DCT/IDCT
T1
RN-1
Stage 1
3D architectures with
different degrees of
parallelism and power vs.
area trade-offs
FULL PARALLEL (PA)
Stage 2
Blocks 0,..,N-1
1 D DCT
Blocks 0, .., N-1
T1
1 D DCT
T2
1 D DCT
CASCADE (CS)
T2
1D
ITERATIVE (IT)
DCT/IDCT
STDAY2010, Udine, Sept. 2010
3D-DCT architectures:
performance and complexity
STDAY2010, Udine, Sept. 2010
 Application of Multidimensional DCT in video coding
 Fast algorithm for 3D DCT
• Fast techniques based on radix-factorization
• Fast techniques based on context-aware processing
• Algorithmic results
 VLSI Architectures for 3D DCT
 CMOS implementation results
 Conclusion
STDAY2010, Udine, Sept. 2010
CMOS implementation results
0.18 m, 1.6 V, 6 metal levels standard-cell
250
QCIF CIF
4 CIF
16 CIF
200
FULL PARALLEL
CASCADE
150
ITERATIVE
100
50
0
0
15
30
45
60
75
Circuit complexity (Kgates) vs. Power consumption (mW)
Dotted lines refer to the elaboration of the same video formats
STDAY2010, Udine, Sept. 2010
Power consumption with contextaware saving
Power consumption of the CS architecture
Power consumption of the PA architecture
STDAY2010, Udine, Sept. 2010
Conclusions
3D-DCT/IDCT is a promising solution for
real-time,
low-power,
low-complexity,
Implementation of video encoders and decoders
in battery powered terminals
STDAY2010, Udine, Sept. 2010
Thanks for your attention!!!
STDAY2010, Udine, Sept. 2010
Download