Efficient Integer DCT Architectures for HEVC

advertisement
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
1
Efficient Integer DCT Architectures for HEVC
Pramod Kumar Meher, Senior Member, IEEE, Sang Yoon Park, Member, IEEE,
Basant Kumar Mohanty, Senior Member, IEEE, Khoon Seong Lim, and Chuohao Yeo, Member, IEEE
Abstract—In this paper, we present area- and power-efficient
architectures for the implementation of integer discrete cosine
transform (DCT) of different lengths to be used in High Efficiency
Video Coding (HEVC). We show that an efficient constant matrixmultiplication scheme can be used to derive parallel architectures
for 1-D integer DCT of different lengths. We also show that
the proposed structure could be reusable for DCT of lengths
4, 8, 16, and 32 with a throughput of 32 DCT coefficients
per cycle irrespective of the transform size. Moreover, the
proposed architecture could be pruned to reduce the complexity
of implementation substantially with only a marginal affect on
the coding performance. We propose power-efficient structures
for folded and full-parallel implementations of 2-D DCT. From
the synthesis result, it is found that the proposed architecture
involves nearly 14% less area-delay product (ADP) and 19% less
energy per sample (EPS) compared to the direct implementation
of the reference algorithm, in average, for integer DCT of lengths
4, 8, 16, and 32. Also, an additional 19% saving in ADP and 20%
saving in EPS can be achieved by the proposed pruning algorithm
with nearly the same throughput rate. The proposed architecture
is found to support Ultra-High-Definition (UHD) 7680×4320 @
60 fps video which is one of the applications of HEVC.
Index Terms—Discrete cosine transform (DCT), High Efficiency Video Coding, HEVC, H.265, integer DCT, video coding.
I. I NTRODUCTION
The discrete cosine transform (DCT) plays a vital role
in video compression due to its near-optimal de-correlation
efficiency [1]. Several variations of integer DCT have been
suggested in the last two decades to reduce the computational
complexity [2]–[6]. The new H.265/HEVC (High Efficiency
Video Coding) standard [7] has been recently finalized and
poised to replace H.264/AVC [8]. Some hardware architectures
for the integer DCT for HEVC have also been proposed for its
real-time implementation. Ahmed et al [9] have decomposed
the DCT matrices into sparse sub-matrices where the multiplications are avoided by using lifting scheme. Shen et al [10]
have used the multiplierless multiple constant multiplication
(MCM) approach for 4-point and 8-point DCT, and have used
the normal multipliers with sharing techniques for 16 and 32point DCTs. Park et al [11] have used Chen’s factorization
of DCT where the butterfly operation has been implemented
by the processing element with only shifters, adders, and
Manuscript submitted on xx, xx, xxxx. (Corresponding author: S. Y. Park.)
P. K. Meher, S. Y. Park, K. S. Lim, and C. Yeo are with the Institute
for Infocomm Research, 138632 Singapore (e-mail: {pkmeher, sypark, kslim,
chyeo}@i2r.a-star.edu.sg). B. K. Mohanty is with the Department of Electronics and Communication Engineering, Jaypee University of Engineering
and Technology, Raghogarh, Guna, Madhya Pradesh, India-473226 (e-mail:
[email protected]).
Copyright (c) 2013 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must be
obtained from the IEEE by sending an email to [email protected]
multiplexors. Budagavi and Sze [12] have proposed a unified
structure to be used for forward as well as inverse transform
after the matrix decomposition.
One key feature of HEVC is that it supports DCT of
different sizes such as 4, 8, 16, and 32. Therefore, the hardware
architecture should be flexible enough for the computation
of DCT of any of these lengths. The existing designs for
conventional DCT based on constant matrix multiplication
(CMM) and MCM can provide optimal solutions for the
computation of any of these lengths but they are not reusable
for any length to support the same throughput processing of
DCT of different transform lengths. Considering this issue,
we have analyzed the possible implementations of integer
DCT for HEVC in the context of resource requirement and
reusability, and based on that, we have derived the proposed
algorithm for hardware implementation. We have designed
scalable and reusable architectures for 1-D and 2-D integer
DCTs for HEVC which could be reused for any of the
prescribed lengths with the same throughput of processing
irrespective of transform size.
In the next section, we present algorithms for hardware
implementation of the HEVC integer DCTs of different lengths
4, 8, 16, and 32. In Section III, we illustrate the design of the
proposed architecture for the implementation of 4-point and 8point integer DCT along with a generalized design of integer
DCT of length N , which could be used for the DCT of length
N = 16 and 32. Moreover, we demonstrate the reusability
of the proposed solutions in this section. In Section IV,
we propose power-efficient designs of transposition buffers
for full-parallel and folded implementations of 2-D Integer
DCT. In Section V, we propose a bit-pruning scheme for
the implementation of integer DCT and present the impact
of pruning on forward and inverse transforms. In Section VI,
we compare the synthesis result of the proposed architecture
with those of existing architectures for HEVC.
II. A LGORITHM FOR H ARDWARE I MPLEMENTATION OF
I NTEGER DCT FOR HEVC
In the Joint Collaborative Team-Video Coding (JCT-VC),
which manages the standardization of HEVC, Core Experiment 10 (CE10) studied the design of core transforms over
several meeting cycles [13]. The eventual HEVC transform
design [14] involves coefficients of 8-bit size, but does not
allow full factorization unlike other competing proposals [13].
It however allows for both matrix multiplication and partial
butterfly implementation. In this section, we have used the
partial-butterfly algorithm of [14] for the computation of
integer DCT along with its efficient algorithmic transformation
for hardware implementation.
2
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
A. Key Features of Integer DCT for HEVC
1
The N -point integer DCT for HEVC given by [14] can be
computed by partial butterfly approach using a (N/2)-point
DCT and a matrix-vector product of (N/2) × (N/2) matrix
with an (N/2)-point vector as




a(0)
y(0)


 y(2) 
a(1)








·
·





 = CN/2 

·
·
(1a)








·
·




a(N/2 − 2)
y(N − 4)
a(N/2 − 1)
y(N − 2)
and



b(0)
y(1)


 y(3) 
b(1)








·
·








·
·

 = MN/2 





·
·




b(N/2 − 2)
y(N − 3)
b(N/2 − 1)
y(N − 1)

(1b)
where
a(i) = x(i) + x(N − i − 1)
b(i) = x(i) − x(N − i − 1)
(4)
and

64
89

83

75
C8 = 
64

50

36
18

64 64 64 64 64 64 64
75 50 18 −18 −50 −75 −89

36 −36 −83 −83 −36 36 83

−18 −89 −50 50 89 18 −75
 . (5)
−64 −64 64 64 −64 −64 64

−89 18 75 −75 −18 89 −50

−83 83 −36 −36 83 −83 36
−50 75 −89 89 −75 50 −18
Based on (1) and (2), hardware oriented algorithms for DCT
computation can be derived in three stages as in Table I.
For 8, 16, and 32-point DCT, even indexed coefficients of
[y(0), y(2), y(4), · · ·y(N − 2)] are computed as 4, 8, and 16point DCTs of [a(0), a(1), a(2), · · ·a(N/2 − 1)], respectively,
according to (1a). In Table II, we have listed the arithmetic
complexities of the reference algorithm and the MCM-based
algorithm for 4, 8, 16, and 32-point DCT. Algorithms for
Inverse DCT (IDCT) can also be derived in a similar way.
(2)
for i = 0, 1, · · ·, N/2 − 1. X = [x(0), x(1), · · ·, x(N − 1)]
is the input vector and Y = [y(0), y(1), · · ·, y(N − 1)] is N point DCT of X. CN/2 is (N/2)-point integer DCT kernel
matrix of size (N/2) × (N/2). MN/2 is also a matrix of size
(N/2) × (N/2) and its (i, j)th entry is defined as
2i+1,j
mi,j
for 0 ≤ i, j ≤ N/2 − 1
N/2 = cN
are represented, respectively, as


64
64
64
64
83
36 −36 −83

C4 = 
64 −64 −64
64
36 −83
83 −36
(3)
c2i+1,j
N
is the (2i + 1, j)th entry of the matrix CN .
where
Note that (1a) could be similarly decomposed, recursively,
further using CN/4 and MN/4 . We have referred to the direct
implementation of DCT based on (1)-(3) as the reference
algorithm in the rest of the paper.
B. Hardware Oriented Algorithm
Direct implementation of (1) requires N 2 /4 + MULN/2
multiplications, N 2 /4+N/2+ADDN/2 additions and 2 shifts
where MULN/2 and ADDN/2 are the number of multiplications and additions/subtractions of (N/2)-point DCT, respectively. Computation of (1) could be treated as a CMM problem
[15]–[17]. Since the absolute values of the coefficients in all
the rows and columns of matrix M in (1b) are identical, CMM
problem can be implemented as a set of N/2 MCMs which
will result in a highly regular architecture and will have lowcomplexity implementation.
The kernel matrices for 4, 8, 16, and 32-point integer DCT
for HEVC are given in [14], and 4 and 8-point integer DCT
1 While the standard defines the inverse DCT, we show the forward DCT
here since it is more directly related to the later parts of this paper. Also, note
that the transpose of the inverse DCT can be used as the forward DCT.
III. P ROPOSED A RCHITECTURES FOR I NTEGER DCT
C OMPUTATION
In this section, we present the proposed architecture for the
computation of integer DCT of length 4. We also present a
generalized architecture for integer DCT of higher lengths.
A. Proposed Architecture for 4-Point Integer DCT
The proposed architecture for 4-point integer DCT is shown
in Fig. 1(a). It consists of an input adder unit (IAU), a shiftadd unit (SAU) and an output adder unit (OAU). The IAU
computes a(0), a(1), b(0), and b(1) according to STAGE-1 of
the algorithm as described in Table I. The computation of ti,36
and ti,83 are performed by two SAUs according to STAGE2 of the algorithm. The computation of t0,64 and t1,64 does
not consume any logic since the shift operations could be rewired in hardware. The structure of SAU is shown in Fig. 1(b).
Outputs of the SAU are finally added by the OAU according
to STAGE-3 of the algorithm.
B. Proposed Architecture for Integer DCT of Length 8 and
Higher Length DCTs
The generalized architecture for N -point integer DCT based
on the proposed algorithm is shown in Fig. 2. It consists
of four units, namely the IAU, (N/2)-point integer DCT
unit, SAU, and OAU. The IAU computes a(i) and b(i) for
i = 0, 1, ..., N/2 − 1 according to STAGE-1 of the algorithm
of Section II-B. The SAU provides the result of multiplication
of input sample with DCT coefficient by STAGE-2 of the
algorithm. Finally, the OAU generates the output of DCT from
a binary adder tree of log2 N − 1 stages. Figs. 3(a), 3(b),
MEHER et al.: EFFICIENT INTEGER DCT ARCHITECTURES FOR HEVC
3
TABLE I
3-STAGE S H ARDWARE O RIENTED A LGORITHMS FOR THE C OMPUTATION OF 4, 8, 16, AND 32-P OINT DCT.
STAGE-1
STAGE-2
STAGE-3
STAGE-1
STAGE-2
STAGE-3
STAGE-1
STAGE-2
STAGE-3
STAGE-1
STAGE-2
STAGE-3
4-Point DCT Computation
a(i) = x(i) + x(3 − i), b(i) = x(i) − x(3 − i) for i = 0 and 1.
mi,9 = 9b(i) = (b(i) << 3) + b(i), ti,64 = 64a(i) = a(i) << 6, ti,83 = 83b(i) = (b(i) << 6) + (mi,9 << 1) + b(i),
ti,36 = 36b(i) = mi,9 << 2 for i = 0 and 1.
y(0) = t0,64 + t1,64 , y(2) = t0,64 − t1,64 , y(1) = t0,83 + t1,36 , y(3) = t0,36 − t1,83 .
8-Point DCT Computation
a(i) = x(i) + x(7 − i), b(i) = x(i) − x(7 − i) for 0 ≤ i ≤ 3.
mi,9 = 9b(i) = (b(i) << 3) + b(i), mi,25 = 25b(i) = (b(i) << 4) + mi,9 , ti,18 = 18b(i) = mi,9 << 1,
ti,50 = 50b(i) = mi,25 << 1, ti,75 = 75b(i) = ti,50 + mi,25 , ti,89 = 89b(i) = (b(i) << 6) + mi,25 for 0 ≤ i ≤ 3.
y(1) = t0,89 + t1,75 + t2,50 + t3,18 , y(3) = t0,75 − t1,18 − t2,89 − t3,50 ,
y(5) = t0,50 − t1,89 + t2,18 + t3,75 , y(7) = t0,18 − t1,50 + t2,75 − t3,89 .
16-Point DCT Computation
a(i) = x(i) + x(15 − i), b(i) = x(i) − x(15 − i) for 0 ≤ i ≤ 7.
mi,8 = 8b(i) = b(i) << 3, ti,9 = 9b(i) = mi,8 + b(i), mi,18 = 18b(i) = ti,9 << 1, mi,72 = 72b(i) = ti,9 << 3,
ti,25 = 25b(i) = (b(i) << 4) + ti,9 , ti,43 = 43b(i) = mi,18 + ti,25 , ti,57 = 57b(i) = ti,25 + (b(i) << 5),
ti,70 = 70b(i) = mi,72 − (b(i) << 1), ti,80 = 80b(i) = mi,72 + mi,8 , ti,87 = 87b(i) = (ti,43 << 1) + b(i),
ti,90 = 90b(i) = mi,72 + mi,18 for 0 ≤ i ≤ 7.
y(1) = t0,90 + t1,87 + t2,80 + t3,70 + t4,57 + t5,43 + t6,25 + t7,9 , y(3) = t0,87 + t1,57 + t2,9 − t3,43 − t4,80 − t5,90 − t6,70 − t7,25 ,
y(5) = t0,80 + t1,9 − t2,70 − t3,87 − t4,25 + t5,57 + t6,90 + t7,43 , y(7) = t0,70 − t1,43 − t2,87 + t3,9 + t4,90 + t5,25 − t6,80 − t7,57 ,
y(9) = t0,57 − t1,80 − t2,25 + t3,90 − t4,9 − t5,87 + t6,43 + t7,70 , y(11) = t0,43 − t1,90 + t2,57 + t3,25 − t4,87 + t5,70 + t6,9 − t7,80 ,
y(13) = t0,25 − t1,70 + t2,90 − t3,80 + t4,43 + t5,9 − t6,57 + t7,87 , y(15) = t0,9 − t1,25 + t2,43 − t3,57 + t4,70 − t5,80 + t6,87 − t7,90 .
32-Point DCT Computation
a(i) = x(i) + x(31 − i), b(i) = x(i) − x(31 − i) for 0 ≤ i ≤ 15.
mi,2 = 2b(i) = b(i) << 1, ti,4 = 4b(i) = b(i) << 2, ti,9 = 9b(i) = (b(i) << 3) + b(i), mi,18 = 18b(i) = ti,9 << 1,
mi,72 = 72b(i) = ti,9 << 3, ti,13 = 13b(i) = ti,9 + ti,4 , mi,52 = 52b(i) = ti,13 << 2, ti,22 = 22b(i) = mi,18 + ti,4 ,
ti,31 = 31b(i) = (b(i) << 5) − b(i), ti,38 = 38b(i) = (ti,9 << 2) + mi,2 , ti,46 = 46b(i) = (ti,22 << 1) + mi,2 ,
ti,54 = 54b(i) = mi,52 + mi,2 , ti,61 = 61b(i) = (ti,31 << 1) − b(i), ti,67 = 67b(i) = ti,54 + ti,13 ,
ti,73 = 73b(i) = mi,72 + b(i), ti,78 = 78b(i) = mi,52 + (ti,13 << 1), ti,82 = 82b(i) = ti,78 + ti,4 ,
ti,85 = 85b(i) = mi,72 + ti,13 , ti,88 = 88b(i) = ti,22 << 2, ti,90 = 90b(i) = mi,72 + mi,18 for 0 ≤ i ≤ 15.
y(1) = t0,90 + t1,90 + t2,88 + t3,85 + t4,82 + t5,78 + t6,73 + t7,67 + t8,61 + t9,54 + t10,46 + t11,38 + t12,31 + t13,22 + t14,13 + t15,4 ,
y(3) = t0,90 + t1,82 + t2,67 + t3,46 + t4,22 − t5,4 − t6,31 − t7,54 − t8,73 − t9,85 − t10,90 − t11,88 − t12,78 − t13,61 − t14,38 − t15,13 ,
y(5) = t0,88 + t1,67 + t2,31 − t3,13 − t4,54 − t5,82 − t6,90 − t7,78 − t8,46 − t9,4 + t10,38 + t11,73 + t12,90 + t13,85 + t14,61 + t15,22 ,
y(7) = t0,85 + t1,46 − t2,13 − t3,67 − t4,90 − t5,73 − t6,22 + t7,38 + t8,82 + t9,88 + t10,54 − t11,4 − t12,61 − t13,90 − t14,78 − t15,31 ,
y(9) = t0,82 + t1,22 − t2,54 − t3,90 − t4,61 + t5,13 + t6,78 + t7,85 + t8,31 − t9,46 − t10,90 − t11,67 + t12,4 + t13,73 + t14,88 + t15,38 ,
y(11) = t0,78 − t1,4 − t2,82 − t3,73 + t4,13 + t5,85 + t6,67 − t7,22 − t8,88 − t9,61 + t10,31 + t11,90 + t12,54 − t13,38 − t14,90 − t15,46 ,
y(13) = t0,73 − t1,31 − t2,90 − t3,22 + t4,78 + t5,67 − t6,38 − t7,90 − t8,13 + t9,82 + t10,61 − t11,46 − t12,88 − t13,4 + t14,85 + t15,54 ,
y(15) = t0,67 − t1,54 − t2,78 + t3,38 + t4,85 − t5,22 − t6,90 + t7,4 + t8,90 + t9,13 − t10,88 − t11,31 + t12,82 + t13,46 − t14,73 − t15,61 ,
y(17) = t0,61 − t1,73 − t2,46 + t3,82 + t4,31 − t5,88 − t6,13 + t7,90 − t8,4 − t9,90 + t10,22 + t11,85 − t12,38 − t13,78 + t14,54 + t15,67 ,
y(19) = t0,54 − t1,85 − t2,4 + t3,88 − t4,46 − t5,61 + t6,82 + t7,13 − t8,90 + t9,38 + t10,67 − t11,78 − t12,22 + t13,90 − t14,31 − t15,73 ,
y(21) = t0,46 − t1,90 + t2,38 + t3,54 − t4,90 + t5,31 + t6,61 − t7,88 + t8,22 + t9,67 − t10,85 + t11,13 + t12,73 − t13,82 + t14,4 + t15,78 ,
y(23) = t0,38 − t1,88 + t2,73 − t3,4 − t4,67 + t5,90 − t6,46 − t7,31 + t8,85 − t9,78 + t10,13 + t11,61 − t12,90 + t13,54 + t14,22 − t15,82 ,
y(25) = t0,31 − t1,78 + t2,90 − t3,61 + t4,4 + t5,54 − t6,88 + t7,82 − t8,38 − t9,22 + t10,73 − t11,90 + t12,67 − t13,13 − t14,46 + t15,85 ,
y(27) = t0,22 − t1,61 + t2,85 − t3,90 + t4,73 − t5,38 − t6,4 + t7,46 − t8,78 + t9,90 − t10,82 + t11,54 − t12,13 − t13,31 + t14,67 − t15,88 ,
y(29) = t0,13 − t1,38 + t2,61 − t3,78 + t4,88 − t5,90 + t6,85 − t7,73 + t8,54 − t9,31 + t10,4 + t11,22 − t12,46 + t13,67 − t14,82 + t15,90 ,
y(31) = t0,4 − t1,13 + t2,22 − t3,31 + t4,38 − t5,46 + t6,54 − t7,61 + t8,67 − t9,73 + t10,78 − t11,82 + t12,85 − t13,88 + t14,90 − t15,90 .
and 3(c) respectively illustrate the structures of IAU, SAU,
and OAU in the case of 8-point integer DCT. Four SAUs are
required to compute ti,89 , ti,75 , ti,50 , and ti,18 for i = 0, 1,
2, and 3 according to STAGE-2 of the algorithm. The outputs
of SAUs are finally added by two-stage adder tree according
to STAGE-3 of the algorithm. Structures for 16 and 32-point
integer DCT can also be obtained similarly.
TABLE II
C OMPUTATIONAL C OMPLEXITIES OF MCM-BASED H ARDWARE
O RIENTED A LGORITHM AND R EFERENCE A LGORITHM .
N
4
8
16
32
Reference Algorithm
MULT
ADD
SHIFT
4
20
84
340
8
28
100
372
2
2
2
2
MCM-Based Algorithm
ADD
SHIFT
14
50
186
682
10
30
86
278
C. Reusable Architecture for Integer DCT
The proposed reusable architecture for the implementation of DCT of any of the prescribed lengths is shown in
Fig. 4(a). There are two (N/2)-point DCT units in the structure. The input to one (N/2)-point DCT unit is fed through
(N/2) 2:1 MUXes that selects either [a(0), ..., a(N/2 −
1)] or [x(0), ..., x(N/2 − 1)], depending on whether it is
used for N -point DCT computation or for the DCT of a
lower size. The other (N/2)-point DCT unit takes the input
[x(N/2), ..., x(N − 1)] when it is used for the computation
of DCT of N/2 point or a lower size, otherwise, the input is
reset by an array of (N/2) AND gates to disable this (N/2)point DCT unit. The output of this (N/2)-point DCT unit is
4
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
output adder unit
t0,64
A
input adder unit
A
x(0)
a(0)
<< 6
x(0)
x(7)
x(1)
x(6)
A
a(0)
x(2)
x(5)
A
A
A
x(3)
b(0)
shift-add unit
b(0)
a(1)
x(4)
A
A
A
A
A
b(1)
a(2)
b(2)
a(3)
b(3)
y(0)
_
t0,83
t0,36
x(3)
A
_
_
_
y(1)
(-)
x(1)
A
a(1)
t1,64
<< 6
(-)
A
(a)
y(2)
an.pdf
A
x(2)
shift-add unit
(-)
t1,36
t1,83
A
ti,75
A
y(3)
(-)
SAU-i.pdf
(a)
A
<< 3
A
<< 4
<< 6
b(i)
ti,89
A
<< 6
b(1)
<< 1
b(i)
A
<< 3
<< 1
ti,50
<< 1
ti,18
ti,83
A
(b)
ti,36
<< 1
t0,89 t3,18 t1,75 t2,50 t2,89 t1,18 t0,75 t3,50 t1,89 t2,18 t3,75 t0,50 t3,89 t0,18 t2,75 t1,50
W+6 W+8
W+8
W+7
A
A
(b)
W+8
_
_
<<A2
A
_
AA
A
_
A
ti,89/16A
stage-1
W+8
>> 1
Fig. 1.
Proposed architecture of 4-point integer DCT. (a) 4-point DCT
architecture. (b) Structure of SAU.
_
A
A
ti,75/16
stage-2
<< 1
A
A
A
prune-b.pdf
A
ti,50/16
W+9
a(1)
x(1)
N/2
y(1)
shift-add unit
N/2
shift-add unit
x(N-2)
b(N/2-1)
N/2
shift-add unit
ty(7)
i,18/16
<< 3
oau.pdf
y(N-2)
x(N-1)
y(5)
Fig. 3. Proposed architecture of 8-point integer DCT and IDCT. (a) Structure
of IAU. (b) Structure of SAU. (c) Structure of OAU.
output adder unit
input adder unit
b(1)
y(3)A
(c)
y(2)
(N/2)-point
integer
DCT unit
a(N/2-1)
b(0)
y(1)
b(i)
y(0)
a(0)
x(0)
y(3)
y(N-1)
combinational logics for control units are shown in Figs. 4(b)
and 4(c) for N = 16 and 32, respectively. For N = 8, m8 is a
1-bit signal which is used as sel 1 while sel 2 is not required
since 4-point DCT is the smallest DCT. The proposed structure
can compute one 32-point DCT, two 16-point DCTs, four 8point DCTs, and eight 4-point DCTs, while the throughput
remains the same as 32 DCT coefficients per cycle irrespective
of the desired transform size.
Fig. 2. Proposed generalized architecture for integer DCT of length N = 8,
16, and 32.
multiplexed with that of the OAU, which is preceded by the
SAUs and IAU of the structure. The N AND gates before
IAU are used to disable the IAU, SAU, and OAU when the
architecture is used to compute (N/2)-point DCT computation
or a lower size. The input of the control unit, mN is used to
decide the size of DCT computation. Specifically, for N = 32,
m32 is a 2-bits signal which is set to {00}, {01}, {10}, and
{11} to compute 4, 8, 16, and 32-point DCT, respectively.
The control unit generates sel 1 and sel 2, where sel 1 is
used as control signals of N MUXes and input of N AND
gates before IAU. sel 2 is used as the input m(N/2) to two
lower size reusable integer DCT units in recursive manner. The
IV. P ROPOSED S TRUCTURE FOR 2-D I NTEGER DCT
Using its separable property, a (N × N )-point 2-D integer DCT could be computed by row-column decomposition
technique in two distinct stages:
•
•
STAGE-1: N -point 1-D integer DCT is computed for
each column of the input matrix of size (N × N ) to
generate an intermediate output matrix of size (N × N ).
STAGE-2: N -point 1-D DCT is computed for each row
of the intermediate output matrix of size (N × N ) to
generate desired 2-D DCT of size (N × N ).
We present here a folded architecture and a full-parallel
architecture for the 2-D integer DCT, along with the necessary transposition buffer to match them without internal data
movement.
MEHER et al.: EFFICIENT INTEGER DCT ARCHITECTURES FOR HEVC
5
mN
control unit
sel_2
sel_1
0
control unit
y(2)
(N/2)-point
reusable integer
DCT unit
MUX
y(0)
a(0) a(1) a(N/2-1)
input adder unit
flexible-dct.pdf
x(N/2-1)
shift-add unit
a(0)a(1) a(N/2-1)
x(N/2)
x(N/2+1)
b(1)
x(0)
input adder unit
x(1)
b(N/2-1)
x(N-1)
x(N/2-1)
x(N-1)
b(0)
b(1)
N/2
shift add
shift-add
unit unit
N/2
N/2
shift add unit
(N/2)-point
reusable integer
b(N/2-1) DCT unit
N/2
0
1
y(1)
0
y(N-2)
1
y(3)
y(1)
0
y(3)
1
MUX
x(1)
y(2)
output adder unit
(N/2)-point N/2
reusable
shift-add
unit
integer
DCT unit N/2
b(0)
output adder unit
x(0)
y(N-2)
MUX
MUX unit
(N/2 2:1 MUX)
0
1
1
MUX
sel_2
sel_1
y(0)
MUX
1
0
MUX
mN
y(N-1)
y(N-1)
shift add unit
sel_2
(a)
flexible-dct.pdf
sel_1
control.pdf
m32(0)
m32(1)
sel_1
sel_2
m16(0)
m16(1)
sel_2(0)
sel_2(1)
(b)
(c)
Fig. 4. Proposed reusable architecture of integer DCT. (a) Proposed reusable architecture for N = 8, 16, and 32. (b) Control unit for N = 16. (c) Control
unit for N = 32.
A. Folded Structure for 2-D Integer DCT
The folded structure for the computation of (N × N )-point
2-D integer DCT is shown in Fig. 5(a). It consists of one
N -point 1-D DCT module and a transposition buffer. The
structure of the proposed 4 × 4 transposition buffer is shown
in Fig. 5(b). It consists of 16 registers arranged in 4 rows and
4 columns. (N × N ) transposition buffer can store N values
in any one column of registers by enabling them by one of the
enable signals ENi for i = 0, 1, · · ·, N − 1. One can select the
content of one of the rows of registers through the MUXes.
During the first N successive cycles, the DCT module
receives the successive columns of (N × N ) block of input
for the computation of STAGE-1, and stores the intermediate
results in the registers of successive columns in the transposition buffer. In the next N cycles, contents of successive rows
of the transposition buffer are selected by the MUXes and fed
as input to the 1-D DCT module. N MUXes are used at the
input of the 1-D DCT module to select either the columns
from the input buffer (during the first N cycles) or the rows
from the transposition buffer (during the next N cycles).
B. Full-Parallel Structure for 2-D Integer DCT
The full-parallel structure for (N × N )-point 2-D integer
DCT is shown in Fig. 6(a). It consists of two N -point 1D DCT modules and a transposition buffer. The structure
of the 4 × 4 transposition buffer for full-parallel structure
is shown in Fig. 6(b). It consists of 16 register cells (RC)
(shown in Fig. 6(c)) arranged in 4 rows and 4 columns. N ×N
transposition buffer can store N values in a cycle either rowwise or column-wise by selecting the inputs by the MUXes at
the input of RCs. The output from RCs can also be collected
either row-wise or column-wise. To read the output from the
buffer, N number of (2N − 1):1 MUXes (shown in Fig. 6(d))
are used, where outputs of the ith row and the ith column of
RCs are fed as input to the ith MUX. For the first N successive
cycles, the ith MUX provides output of N successive RCs on
6
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
MUX
MUX
rows
of 2-D
DCT
output
N-point
integer
DCT unit
rows
of 2-D
DCT
output
MUX
columns
of 2-D
DCT
input
NxN
transposition
buffer
N-point
integer
DCT unit
columns
of 2-D
DCT
input
N-point integer
DCT unit
(a)
NxN
transposition
buffer
rows of
intermediate
output
columns of
intermediate
output
x0
y00
RC00
(a)
EN1
EN0
R03
R02
RC10
R21
N-point R13
integer
DCT unit
R22
rows
of 2-D
DCT
output
RC20
R23
y10
RC30
RC01
RC11
RC21
y20
y30
RC30
RC21
RC31
y30
RC31
R30
R31
R32
R33
xi
xj
xi
xj
MUX
y0
MUX
y1
MUX
y2
MUX
y3
RCijyij
Rij
ENij
clock
yij
Rij
ENij
clock
y01
y11
y11
y21
y21
y31
y31
(b)y0i
RCij
MUX
x3
R20
R12
MUX
x2
R11 NxN
transposition
buffer
N-point
R10
integer
DCT unit
RC02
y02
x2
RC11
y20
RC20
x1
columns
of 2-D
DCT
input
y00
y10
RC10
clock
R01
x3
RC03
y03
x3
clock
EN3
RC00
R00
y01
x1
clock
x0
RC01
x0
EN2
x2
x1
s0
s1
s2 ss0
1
s2
RC02
RC12
RC12
RC22
RC22
RC32
RC32
y02
y12
RC13
RC23
y32
RC33
RC33
y23
y23
RC23
y32
y13
y13
RC13
y22
y22
y03
RC03
y12
y33
y33
y1i y2i y3i yi0 yi1 yi2 yi3
y0i y1i y2i y3i yi0 yi1 yi2 yi3
7-to-1 MUX
7-to-1 MUX
i=0,1,2,3
i=0,1,2,3
yi
yi
(c)
(d)
(b)
Fig. 5. Folded structure of (N × N )-point 2-D integer DCT. (a) The folded
2-D DCT architecture. (b) Structure of the transposition buffer for input size
4 × 4.
the ith row. In the next N successive cycles, the ith MUX
provides output of N successive RCs on the ith column. By
this arrangement, in the first N cycles, we can read the output
of N successive columns of RCs and in the next N cycles,
we can read the output of N successive rows of RCs. The
transposition buffer in this case allows both read and write
operations concurrently. If for the N cycles, results are read
and stored column-wise now, then in the next N successive
cycles, results are read and stored in the transposition buffer
row-wise. The first 1-D DCT module receives the inputs
column-wise from the input buffer. It computes a column of
intermediate output and stores in the transposition buffer. The
second 1-D DCT module receives the rows of the intermediate
result from the transposition buffer and computes the rows of
2-D DCT output row-wise.
Suppose that in the first N cycles, the intermediate results
are stored column-wise and all the columns are filled in with
Fig. 6. Full-parallel structure of (N × N )-point 2-D integer DCT. (a) The
full-parallel 2-D DCT architecture. (b) Structure of the transposition buffer for
input size 4×4. (c) Register cell RCij. (d) 7-to-1 MUX for 4×4 transposition
buffer.
intermediated results, then in the next N cycles, contents of
successive rows of the transposition buffer are selected by
the MUXes and fed as input to the 1-D DCT module of the
second stage. During this period, the output of the 1-D DCT
module of first stage is stored row-wise. In the next N cycles,
results are read and written column-wise. The alternating
column-wise and row-wise read and write operations with the
transposition buffer continues. The transposition buffer in this
case introduces a pipeline latency of N cycles required to fill
in the transposition buffer for the first time.
V. P RUNED DCT A RCHITECTURE
Fig. 7(a) illustrates a typical hybrid video encoder, e.g.,
HEVC, and Fig. 7(b) shows the breakdown of the blocks
in the shaded region in Fig. 7(a) for Main Profile, where
the encoding and decoding chain involves DCT, quantization,
de-quantization, and IDCT based on the transform design
Encoder block diagram
MEHER et al.: EFFICIENT INTEGER DCT ARCHITECTURES FOR HEVC
input
video
-
transform
entropy
coding
quantization
7
bitstream
21 20
23 22
inverse
quantization
18
17
16
8
9
7
6
9
DCT-1
ti,50
ti,89
A
<< 6
ti,75
b(i)<<6
log2N
+22
DCT-2
inverse
quantization
0
mi,25
A
transform
>>
log2N-1
1
2
ti,75
reference
frame buffer
16
3
4
reconstructed
video
(a)
log2N
+15
5
ti , 50
in-loop
processing
inter
prediction
19
t i ,18
inverse
transform
intra
prediction
truncated
bits
shift-add unit (SAU)
SAU-i.pdf
>>
log2N+6
t i , 89
16
ti ,18
ti ,50
ti ,75
t i , 89
quantization
A
<< 4
b(i)
A
<< 3
<< 1
ti,50
<< 1
ti,18
mi,25
output of SAU
(a)
16
IDCT-1
log2N
+22
>>7
log2N
+15
clipping
log2N-1
16
IDCT-2
log2N
+22
>>12
log2N
+10
ti,89/16
A
<< 2
>> 1
inverse transform
ti,75/16
A
prune-b.pdf
ti,50/16
A
<< 1
(b)
b(i)
ti,18/16
A
Fig. 7. Encoding and decoding chain involving DCT in the HEVC codec. (a)
Block diagram of a typical hybrid video encoder, e.g., HEVC. Note that the
decoder contains a subset of the blocks in the encoder, except for the having
entropy decoding instead of entropy coding. (b) Breakdown of the blocks in
the shaded region in (a) for Main Profile.
<< 3
(b)
shift-add unit (SAU)
24
proposed in [14]. As shown in Fig. 7(b), a data before and after
the DCT/IDCT transform is constrained to have maximum 16
bits regardless of internal bit depth and DCT length. Therefore,
scaling operation is required to retain the wordlength of intermediate data. In the Main Profile which supports only 8-bit
samples, if bit truncations are not performed, the wordlength
of the DCT output would be log2 N + 6 bits more than that
of the input to avoid overflow. The output wordlengths of the
first and second forward transforms are scaled down to 16 bits
by truncating least significant log2 N − 1 and log2 N + 6 bits,
respectively, as shown in the figure. The resulting coefficients
from the inverse transforms are also scaled down by the
fixed scaling factor of 7 and 12. It should be noted that
additional clipping of log2 N − 1 most significant bits is
required to maintain 16 bits after the first inverse transform
and subsequent scaling.
The scaling operation, however, could be integrated with
the computation of the transform without significant impact
on the coding result. The SAU includes several left-shift
operations as shown in Figs. 1(b) and 3(b) whereas the scaling
process is equivalent to perform the right shift. Therefore,
by manipulating the shift operations in SAU circuit, we can
optimize the complexity of the proposed DCT structure.
Fig. 8(a) shows the dot diagram of the SAU to illustrate
the pruning with an example of the second forward DCT of
23
22
21
20
19
18
17
16
9
8
7
6
5
4
t0,89
t1, 75
t 2 ,50
t 3 ,18
pad.pdf
truncated
bits
output adder unit (OAU)
final truncation
y (0)
(c)
Fig. 8. (a) Dot diagram for the pruning of the SAU for the second 8-point
forward DCT. (b) The modified structure of the SAU after pruning. (c) Dot
diagram for the truncation after the SAU and the OAU to generate y(0) for
8-point DCT.
length 8. Each row of dot diagram contains 17 dots, which
represents output of the IAU or its shifted form (for 16 bits of
the input wordlength). The final sum without truncation should
pad.pdf
be 25 bits. But, we use only 16 bits in the final sum, and the
rest 9 bits are finally discarded. To reduce the computational
complexity, some of the least significant bits (LSB) in the SAU
(in the gray area in Fig. 8(a)) can be pruned. It is noted that the
worst-case error by the pruning scheme occurs when all the
truncated bits are one. In this case, the sum of truncated values
8
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
TABLE III
T HE I NTERNAL R EGISTER W ORDLENGTH AND THE N UMBER OF P RUNED
B ITS FOR THE I MPLEMENTATION OF E ACH T RANSFORM (DCT-1, DCT-2,
IDCT-1 AND IDCT-2 IN F IG . 7( B )).
TABLE IV
E XPERIMENTAL R ESULTS FOR P ROPOSED M ETHODS C OMPARED TO
HM10.0 A NCHOR . BD- RATE R ESULTS FOR THE L UMA AND C HROMA (Cb
AND Cr) C OMPONENTS ARE SHOWN AS [Y(%) / Cb(%) / Cr(%)].
N
T
A
B
C
D
E
F
G
H
I
J
Configuration
PF(%)
PI(%)
PB(%)
9
16
16
16
8
8
7
7
0
0
1
1
1
4
4
4
0
4
3
3
0
0
0
5
0
0
1
0
1
8
7
12
16
16
16
16
16
16
16
12
Main AI
0.1/0.1/0.1
0.8/1.9/2.0
1.0/2.1/2.1
4
T1
T2
T3
T4
Main RA
0.1/0.1/0.2
0.8/1.5/1.6
0.9/1.7/1.8
Main LDB
0.1/0.3/0.0
0.9/1.9/1.4
1.0/1.8/1.6
0.1/0.1/0.0
1.0/1.7/1.4
1.1/1.6/1.5
8
T1
T2
T3
T4
9
16
16
16
8
8
7
7
1
1
2
2
2
4
4
4
0
4
3
3
0
1
0
5
0
0
2
0
2
9
7
12
15
16
16
16
16
16
16
13
Main LDP
16
T1
T2
T3
T4
9
16
16
16
8
8
7
7
2
2
3
3
3
4
4
4
0
4
3
3
0
2
0
5
0
0
3
0
3
10
7
12
14
16
16
16
16
16
16
14
32
T1
T2
T3
T4
9
16
16
16
8
8
7
7
3
3
4
4
4
4
4
4
0
4
3
3
0
3
0
5
0
0
4
0
4
11
7
12
13
16
16
16
16
16
16
15
A: input wordlength, B: bit increment in IAU/SAU without truncation, C:
bit increment in OAU without truncation, D: the number of truncated bits in
SAU, E: the number of truncated bits after SAU, F: the number of truncated
bits after OAU, G: the number of clipped bits, H: total scale factor
(H=D+E+F), I: the number of bits of SAU output (I=A+B-D-E), and J: the
number of bits of DCT output (J=A+B+C-D-E-F-G). T1, T2, T3, and T4
means DCT-1, DCT-2, IDCT-1, and IDCT-2 in Fig. 7(b), respectively.
amounts to 88, but it is only 17% of the weight of LSB of the
final output, 29 . Therefore, the impact of the proposed pruning
on the final output is not significant. However, as we prune
more bits, the truncation error increases rapidly. Fig. 8(b)
shows the modified structure of the SAU after pruning.
The output of the SAU is the product of DCT coefficient
with the input values. In the HEVC transform design [14], 16
bit multipliers are used for all internal multiplications. To have
the same precision, the output of SAU is also scaled down to
16 bits by truncating 4 more LSBs, which is shown in the
gray area before the OAU in Fig. 8(c) for the output addition
corresponding to the computation of y(0). The LSB of the
result of the OAU is truncated again to retain 16 bits transform
output. By careful pruning, the complexity of SAU and OAU
can be significantly reduced without significant impact on
the error performance. The overall reduction of the hardware
complexity by the pruning scheme is discussed in Section VI.
Table III lists the internal register wordlength and the number of pruned bits for the implementation of each transform
(DCT-1, DCT-2, IDCT-1, and IDCT-2 in Fig. 7(b)). In the
case of the second forward 8-point DCT (in the 7-th row
in Table III), the number of input wordlength is 16 bits. If
any truncation is not performed in the processing of IAU,
SAU, and OAU, the number of output wordlength becomes
25 bits by summation of values in column A, B, and C. By
the truncation procedure illustrated in Fig. 8 using the number
of truncated bits listed in the column D, E, and F of Table III,
the wordlength of DCT output becomes 16 bits. Note that the
scale factor listed in the column H satisfies the configuration
in Fig. 7(b) and the wordlength of the multiplier output and the
final DCT output as listed in the column I and J, respectively
is less than or equal to 16 bits satisfying constraints described
in [14] over all the transforms T1-T4 for N = 4, 8, 16, and
32.
In Fig. 2, as the odd-indexed output samples, y(1), y(3),...,
y(N − 1) are scaled down by pruning procedure, evenindexed output samples, y(0), y(2),..., y(N − 2) which are
from (N/2)-point integer DCT should also be scaled down
correspondingly. For example, the scaling factor of 8-point
second DCT (T2) is 1/29 as shown in column H in Table III,
the output of internal 4-point DCT should also be scaled down
by 1/29 . Since we have 4-point DCT with scaling factor 1/28
as shown in third row and column H, we can reuse it, and
perform additional scaling by 1/2.
To understand the impact of using the pruned transform
within HEVC, we have integrated them into the HEVC reference software, HM10.02 , for both luma and chroma residuals.
The proposed pruned forward DCT is a non-normative change
and can be implemented in encoders to produce HEVCcompliant bitstreams. Therefore, we perform one set of experiments, prune-forward (PF), in which the forward DCT
used is the proposed pruned version, while the inverse DCT
is the same as that defined in HEVC Final Draft International
Standard (FDIS) [7]. In addition, we perform two other sets
of experiments: (i) prune-inverse (PI), in which the forward
DCT is not pruned but the inverse DCT is pruned, and (ii)
prune-both (PB), where both forward and inverse DCT are
the pruned versions. Both PI and PB are no longer standards
compliant, since the pruned inverse DCT is different from what
is specified in the HEVC standard. We also note here that
both the encoder and decoder use the same pruned inverse
transforms, such that any coding loss is due to the loss in
accuracy of the inverse transform rather than drift between
encoder and decoder. We performed the tests using the JCT-VC
HM10.0 common test conditions [18] for the Main Profile, i.e.,
on the test sequences3 mandated for All-Intra (AI), RandomAccess (RA), Low-Delay B (LDB), and Low-Delay P (LDP)
configuration, over QP values of 22, 27, 32, and 37. Using the
HM10.0 as the anchor, we measure the BD-Rate [19] values
of each of the PF, PI and PB experiments to get a measure
2 Available
through SVN via http://hevc.hhi.fraunhofer.de/svn/svn
HEVCSoftware/tags/HM-10.0/.
3 The test sequences are classified into Classes A-F. Classes A through D
contain sequences with resolutions of 2560 × 1600, 1920 × 1080, 832 × 480,
416 × 240 respectively. Class E consists of video conferencing sequences of
resolution 1280 × 720, while Class F consists of screen content videos of
resolutions ranging from 832 × 480 to 1280 × 720.
MEHER et al.: EFFICIENT INTEGER DCT ARCHITECTURES FOR HEVC
9
TABLE V
A REA , T IME , AND P OWER C OMPLEXITIES OF P ROPOSED A RCHITECTURES AND D IRECT I MPLEMENTATION OF R EFERENCE A LGORITHM OF I NTEGER
DCT FOR VARIOUS L ENGTHS BASED ON S YNTHESIS R ESULT U SING TSMC 90-nm CMOS L IBRARY.
Design
Architecture
N
Area
Consumption
Computation
Time
Power
Consumption
Area-Delay
Product
Energy per
Sample
Excess AreaDelay Product
Excess Energy
per Sample
Using Reference Algorithm
4
8
16
32
4656
21457
88451
332889
2.84
3.25
3.67
3.96
0.28
1.68
7.45
29.75
13223
69735
324615
1318240
0.68
2.01
4.48
8.96
18.95%
51.23%
57.79%
51.06%
27.18%
59.03%
67.20%
68.66%
Proposed Algorithm
Before Pruning
4
8
16
32
4399
16772
66130
253432
2.95
3.34
3.92
4.48
0.26
1.31
5.73
23.17
12977
56018
259229
1135375
0.63
1.57
3.45
6.98
16.73%
21.48%
26.00%
30.10%
17.10%
24.02%
28.77%
31.46%
Proposed Algorithm
After Pruning
4
8
16
32
3794
13931
52480
196103
2.93
3.31
3.92
4.45
0.22
1.06
4.45
17.63
11116
46111
205721
872658
0.54
1.26
2.68
5.31
−−
−−
−−
−−
−−
−−
−−
−−
Area, computation time, and power consumption are measured in sq.um, ns, and mW, respectively. Area-delay product and energy per sample are in the
units of sq.um×ns and mW×ns, respectively.
of the coding performance impact. One can interpret the BDRate number as the percentage bitrate difference for the same
peak signal to noise ratio (PSNR); note that a positive BD-Rate
number indicates coding loss compared to the anchor. Table IV
summarizes the experimental results of using the different
combinations of the propose pruned forward and inverse DCT.
Using the pruned forward DCT alone (PF case) gives average
coding loss over each coding configuration of about 0.1% in
the luma component, suggesting that the use of the pruned
forward DCT in encoder has a insignificant coding impact.
On the other hand, PI gives average coding loss ranging from
0.8% to 1.0% in the luma component over the different coding
configurations. This is expected since the bit truncation after
each transform stage is larger. Finally, PB gives average coding
loss ranging from 0.9% - 1.1% in the luma component. It is
interesting to note that the coding loss in PB is approximately
the sum of that of PF and PI. We have also observed that the
coding performance difference for each of PF, PI and PB are
fairly consistent across the different classes of test sequences.
VI. I MPLEMENTATION R ESULTS AND D ISCUSSIONS
A. Synthesis Results of 1-D Integer DCT
We have coded the architecture derived from the reference
algorithm of Section II as well as the proposed architectures
for different transform lengths in VHDL, and synthesized
by Synopsys Design Compiler using TSMC 90-nm General
Purpose (GP) CMOS library [20]. The wordlength of input
samples are chosen to be 16 bits. The area, computation
time, and power consumption (at 100 MHz clock frequency)
obtained from the synthesis reports are shown in Table V. It is
found that the proposed architecture involves nearly 14% less
area-delay product (ADP) and 19% less energy per sample
(EPS) compared to the direct implementation of reference
algorithm, in average, for integer DCT of lengths 4, 8, 16, and
32. Additional 19% saving in ADP and 20% saving in EPS
are also achieved by the pruning scheme with nearly the same
throughput rate. The pruning scheme is more effective for
higher length DCT since the percentage of total area occupied
by the SAUs increases as DCT length increases, and hence
more adders are affected by the pruning scheme.
B. Comparison with the Existing Architectures
We have named the proposed reusable integer DCT architecture before applying pruning as reusable architecture-1 and
that after applying pruning as reusable architecture-2. The
processing rate of the proposed integer DCT unit is 16 pixels
per cycle considering 2-D folded structure since 2-D transform
of 32 × 32 block can be obtained in 64 cycles. In order to
support 8K Ultra-High-Definition (UHD) (7680×4320) @ 30
frames/sec (fps) and 4:2:0 YUV format which is one of the applications of HEVC [21], the proposed reusable architectures
should work at the operating frequency faster than 94 MHz
(7680×4320×30×1.5/16). The computation times of 5.56ns
and 5.27ns for reusable architectures-1 and 2, respectively
(obtained from the synthesis without any timing constraint)
are enough for this application. Also, the computation time
less than 5.358ns is needed in order to support 8K UHD @
60 fps, which can be achieved by slight increase in silicon
area when we synthesize the reusable architecture-1 with the
desired timing constraint. Table VI lists the synthesis results
of the proposed reusable architectures as well as the existing
architectures for HEVC for N = 32 in terms of gate count
which is normalized by area of 2-input NAND gate, maximum operating frequency, processing rate, throughput, and
supporting video format. The proposed reusable architecture2 requires larger area than the design of [11], but offers much
higher throughput. Also, the proposed architectures involve
less gate counts as well as higher throughput than the design of
[10]. Specially, the designs of [10] and [11] require very high
operational frequencies of 761 MHz and 1403 MHz in order to
support UHD @ 60 fps since 2-D transform of 32 × 32 block
can be computed in 261 cycles and 481 cycles, respectively.
However, 187 MHz operating frequency which is needed to
support 8K UHD @ 60 fps by the proposed architecture can be
10
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
TABLE VI
C OMPARISON OF D IFFERENT 1-D I NTEGER DCT A RCHITECTURES WITH THE P ROPOSED A RCHITECTURES
Design
Technology
Gate Counts
Shen et al [10]
Park et al [11]
Reusable Architecture-1
Reusable Architecture-2
0.13-µm
0.18-µm
90-nm
90-nm
134 K
52 K
131 K
103 K
Max Freq.
350
300
187
187
TABLE VII
2-D I NTEGER T RANSFORM WITH F OLDED S TRUCTURE AND
F ULL -PARALLEL S TRUCTURE .
Architecture
Technology
Total Gate Counts
Operating Frequency
Processing Rate (pels/cycle)
Throughput
Total Power Consumption
Energy per Sample (EPS)
Folded
Full-parallel
TSMC 90-nm GP CMOS
208 K
347 K
187 MHz
16
32
2.992 G
5.984 G
40.04 mW
67.57 mW
13.38 pJ
11.29 pJ
obtained using TSMC 90-nm or newer technologies as shown
in Table VI. Also, we could obtain the frequency of 94 MHz
for UHD @ 30 fps and 4:2:0 YUV format using TSMC 0.15µm or newer technologies.
C. Synthesis Results of 2-D Integer DCT
We also synthesized the the folded and full-parallel structures for 2-D integer DCT. We have listed total gate counts,
processing rate, throughput, power consumption, and EPS in
Table VII. We set the operational frequency to 187 MHz for
both cases to support UHD @ 60 fps. The 2-D full-parallel
structure yields 32 samples in each cycle after initial latency
of 32 cycles providing double the throughput of the folded
structure. However, the full-parallel architecture consumes
1.69 times more power than the folded architecture since it
has two 1-D DCT units and nearly the same complexity of
transposition buffer while the throughput of full-parallel design
is double the throughput of folded design. Thus, the fullparallel design involves 15.6% less EPS.
VII. S UMMARY AND C ONCLUSIONS
In this paper, we have proposed area- and power-efficient architectures for the implementation of integer DCT of different
lengths to be used in HEVC. The computation of N -point 1-D
DCT involves an (N/2)-point 1-D DCT and a vector-matrix
multiplication with a constant matrix of size (N/2) × (N/2).
We have shown that the MCM-based architecture is highly
regular and involves significantly less area-delay complexity
and less energy consumption than the direct implementation
of matrix multiplication for odd DCT coefficients. We have
used the proposed architecture to derive a reusable architecture
for DCT which can compute the DCT of lengths 4, 8, 16,
and 32 with throughput of 32 output coefficients per cycle.
We have also shown that proposed design could be pruned to
reduce the area and power complexity of forward DCT without
MHz
MHz
MHz
MHz
Pixels/Cycle
4
2.12
16
16
Throughput
1.40
0.63
2.99
2.99
Gsps
Gsps
Gsps
Gsps
Supporting Video Format
4096x2048
3840x2160
7680x4320
7680x4320
@
@
@
@
30
30
60
60
fps
fps
fps
fps
significant loss of quality. We have proposed power-efficient
architectures for folded and full-parallel implementations of
2-D DCT, where no data movement takes place within the
transposition buffer. From the synthesis result, it is found that
the proposed algorithm with pruning involves nearly 30% less
ADP and 35% less EPS compared to the reference algorithm
in average for integer DCT of lengths 4, 8, 16, and 32 with
nearly the same throughput rate.
ACKNOWLEDGMENT
The authors would like to thank Dr. Arnaud Tisserand for
providing us HDL code for the computation of the integer
DCT based on their algorithm.
R EFERENCES
[1] N. Ahmed, T. Natarajan, and K. Rao, “Discrete cosine transform,” IEEE
Trans. Comput., vol. 100, no. 1, pp. 90–93, Jan. 1974.
[2] W. Cham and Y. Chan, “An order-16 integer cosine transform,” IEEE
Trans. Signal Process., vol. 39, no. 5, pp. 1205–1208, May 1991.
[3] Y. Chen, S. Oraintara, and T. Nguyen, “Video compression using integer
DCT,” in Proc. IEEE International Conference on Image Processing,
Sep. 2000, pp. 844–845.
[4] J. Wu and Y. Li, “A new type of integer DCT transform radix and
its rapid algorithm,” in Proc. International Conference on Electric
Information and Control Engineering, Apr. 2011, pp. 1063–1066.
[5] A. M. Ahmed and D. D. Day, “Comparison between the cosine and
Hartley based naturalness preserving transforms for image watermarking
and data hiding,” in Proc. First Canadian Conference on Computer and
Robot Vision, May 2004, pp. 247–251.
[6] M. N. Haggag, M. El-Sharkawy, and G. Fahmy, “Efficient fast
multiplication-free integer transformation for the 2-D DCT H.265 standard,” in Proc. International Conference on Image Processing, Sep.
2010, pp. 3769–3772.
[7] B. Bross, W.-J. Han, J.-R. Ohm, G. J. Sullivan, Y.-K. Wang, and
T. Wiegand, “High Efficiency Video Coding (HEVC) text specification
draft 10 (for FDIS & Consent),” in JCT-VC L1003, Geneva, Switzerland,
Jan. 2013.
[8] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overview
of the H.264/AVC video coding standard,” IEEE Trans. Circuits Syst.
Video Technol., vol. 13, no. 7, pp. 560–576, Jul. 2003.
[9] A. Ahmed, M. U. Shahid, and A. Rehman, “N -point DCT VLSI
architecture for emerging HEVC standard,” VLSI Design, 2012.
[10] S. Shen, W. Shen, Y. Fan, and X. Zeng, “A unified 4/8/16/32-point
integer IDCT architecture for multiple video coding standards,” in Proc.
IEEE International Conference on Multimedia and Expo, Jul. 2012, pp.
788–793.
[11] J.-S. Park, W.-J. Nam, S.-M. Han, and S. Lee, “2-D large inverse
transform (16x16, 32x32) for HEVC (high efficiency video coding),”
Journal of Semiconductor Technology and Science, vol. 12, no. 2, pp.
203–211, Jun. 2012.
[12] M. Budagavi and V. Sze, “Unified forward+inverse transform architecture for HEVC,” in Proc. International Conference on Image Processing,
Sep. 2012, pp. 209–212.
[13] JCTVC-G040, CE10: Summary report on core transform
design. [Online]. Available: http://phenix.int-evry.fr/jct/doc end user/
documents/7 Geneva/wg11/JCTVC-G040-v3.zip
[14] JCTVC-G495, CE10: Core transform design for HEVC: proposal for
current HEVC transform. [Online]. Available: http://phenix.int-evry.fr/
jct/doc end user/documents/7 Geneva/wg11/JCTVC-G495-v2.zip
MEHER et al.: EFFICIENT INTEGER DCT ARCHITECTURES FOR HEVC
[15] M. Potkonjak, M. B. Srivastava, and A. P. Chandrakasan, “Multiple
constant multiplications: Efficient and versatile framework and algorithms for exploring common subexpression elimination,” IEEE Trans.
Comput.-Aided Des. Integr. Circuits Syst., vol. 15, no. 2, pp. 151–165,
Feb. 1996.
[16] M. D. Macleod and A. G. Dempster, “Common subexpression elimination algorithm for low-cost multiplierless implementation of matrix
multipliers,” Electronics Letters, vol. 40, no. 11, pp. 651–652, May 2004.
[17] N. Boullis and A. Tisserand, “Some optimizations of hardware multiplication by constant matrices,” IEEE Trans. Comput., vol. 54, no. 10,
pp. 1271–1282, Oct. 2005.
[18] F. Bossen, “Common HM test conditions and software reference configurations,” in JCT-VC L1100, Geneva, Switzerland, Jan. 2013.
[19] G. Bjontegaard, “Calculation of average PSNR differences between RD
curves,” in ITU-T SC16/Q6, VCEG-M33, Austin, USA, Apr. 2001.
[20] TSMC 90nm General-Purpose CMOS Standard Cell Libraries tcbn90ghp. [Online]. Available: http://www.tsmc.com/
[21] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of the
high efficiency video coding (HEVC) standard,” IEEE Trans. Circuits
Syst. Video Technol., vol. 22, no. 12, pp. 1649–1668, Dec. 2012.
Pramod Kumar Meher (SM’03) received the B.
Sc. (Honours) and M.Sc. degree in physics, and the
Ph.D. degree in science from Sambalpur University,
India, in 1976, 1978, and 1996, respectively.
Currently, he is a Senior Scientist with the Institute for Infocomm Research, Singapore. Previously,
he was a Professor of Computer Applications with
Utkal University, India, from 1997 to 2002, and a
Reader in electronics with Berhampur University,
India, from 1993 to 1997. His research interest includes design of dedicated and reconfigurable architectures for computation-intensive algorithms pertaining to signal, image and
video processing, communication, bio-informatics and intelligent computing.
He has contributed nearly 200 technical papers to various reputed journals
and conference proceedings.
Dr. Meher has served as a speaker for the Distinguished Lecturer Program
(DLP) of IEEE Circuits Systems Society during 2011 and 2012 and Associate
Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSII: EXPRESS BRIEFS during 2008 to 2011. Currently, he is serving as
Associate Editor for the IEEE TRANSACTIONS ON CIRCUITS AND
SYSTEMS-I: REGULAR PAPERS, the IEEE TRANSACTIONS ON VERY
LARGE SCALE INTEGRATION (VLSI) SYSTEMS, and Journal of Circuits,
Systems, and Signal Processing. Dr. Meher is a Fellow of the Institution of
Electronics and Telecommunication Engineers, India. He was the recipient of
the Samanta Chandrasekhar Award for excellence in research in engineering
and technology for 1999.
Sang Yoon Park (S’03-M’11) received the B.S.
degree in electrical engineering and the M.S. and
Ph.D. degrees in electrical engineering and computer
science from Seoul National University, Seoul, Korea, in 2000, 2002, and 2006, respectively. He joined
the School of Electrical and Electronic Engineering,
Nanyang Technological University, Singapore as a
Research Fellow in 2007. Since 2008, he has been
with Institute for Infocomm Research, Singapore,
where he is currently a Research Scientist. His
research interest includes design of dedicated and
reconfigurable architectures for low-power and high-performance digital signal
processing systems.
11
Basant Kumar Mohanty (M’06-SM’11) received
M.Sc degree in Physics from Sambalpur University,
India, in 1989 and received Ph.D degree in the
field of VLSI for Digital Signal Processing from
Berhampur University, Orissa in 2000. In 2001 he
joined as Lecturer in EEE Department, BITS Pilani,
Rajasthan. Then he joined as an Assistant Professor
in the Department of ECE, Mody Institute of Education Research (Deemed University), Rajasthan. In
2003 he joined Jaypee University of Engineering and
Technology, Guna, Madhya Pradesh, where he becomes Associate Professor in 2005 and full Professor in 2007. His research interest includes design and implementation of low-power and high-performance
digital systems for resource constrained digital signal processing applications.
He has published nearly 50 technical papers. Dr.Mohanty serving as Associate
Editor for the Journal of Circuits, Systems, and Signal Processing. He is a
life time member of the Institution of Electronics and Telecommunication
Engineering, New Delhi, India, and he was the recipient of the Rashtriya
Gaurav Award conferred by India International friendship Society, New Delhi
for the year 2012.
Khoon Seong Lim received B.Eng. (with honors)
and M.Eng. degrees in electrical and electronic engineering from Nanyang Technological University,
Singapore, in 1999 and 2001, respectively. He was
a R&D DSP engineer with BITwave Private Limited, Singapore, working in the area of adaptive
microphone array beamforming in 2002. He joined
the School of Electrical and Electronic Engineering,
Nanyang Technological University, as a Research
Associate in 2003. Since 2006, he has been with
the Institute for Infocomm Research, Singapore,
where he is currently a Senior Research Engineer. His research interests
include embedded systems, sensor array signal processing, and digital signal
processing in general.
Chuohao Yeo (S’05, M’09) received the S.B. degree
in electrical science and engineering and the M.Eng.
degree in electrical engineering and computer science from the Massachusetts Institute of Technology
(MIT), Cambridge, MA, in 2002, and the Ph.D.
degree in electrical engineering and computer sciences from the University of California, Berkeley in
2009.
From 2005 to 2009, he was a graduate student
researcher at the Berkeley Audio Visual Signal Processing and Communication Systems Laboratory at
UC Berkeley. Since 2009, he has been a Scientist with the Signal Processing
Department at the Institute for Infocomm Research, Singapore, where he
leads a team that is actively involved in HEVC standardization activities.
His current research interests include image and video processing, coding
and communications, distributed source coding, and computer vision.
Dr. Yeo was a recipient of the Singapore Government Public Service
Commission Overseas Merit Scholarship from 1998 to 2002, and a recipient of
Singapore’s Agency for Science, Technology and Research National Science
Scholarship from 2004 to 2009. He received a Best Student Paper Award at
SPIE VCIP 2007 and a Best Short Paper Award at ACM MM 2008.
Download