Toward Memory-efficient Design of Video Encoders for Multimedia Applications

advertisement
Toward Memory-efficient Design of Video Encoders for Multimedia Applications
Avishek Saha∗ , Santosh Ghosh∗ , Shamik Sural†
∗
Department of Computer Science and Engg.
IIT Kharagpur, WB, India
{avishek,santosh,jay}@cse.iitkgp.ernet.in
Abstract
In this paper, we suggest a scheme to reduce memory requirement of the Full Search Block Matching (FSBM) algorithm. Next,
we explore the memory design space of embedded multimedia applications and suggest a speed-energy optimized cache hierarchy.
1. Introduction
Multimedia applications, in general, need a large amount of
memory and bandwidth to store and access frames. However, the
increasing gap between memory and processor performance and
the high-power consumption of deep cache structures have shifted
the performance bottleneck in these applications to memory operations. The two most common approaches [4] to improve the performance of multimedia applications are, either a) to satisfy the
computational requirements of encoder/decoder, or b) to design
instruction extensions for multimedia applications.
In this paper, we suggest two approaches to overcome memory performance bottleneck in embedded multimedia applications.
Our first approach reduces the memory requirement of the FSBM
algorithm, to half the required size of its search area. This technique can be extended to frames, where the memory savings are
substantial. Our second approach explores memory hierarchy designs to support real-time video encoding on low power, portable
platforms.
2. Memory Reduction Approach
In FSBM-based motion estimation, each N×N macroblock of
the current frame is compared with all candidate macroblocks in a
pre-defined (N+2p)×(N+2p) search window in the previous frame,
where p is the search range. The motion vector is determined by
identifying the best matching MB, using some matching criterion.
The most commonly used matching criterion is the Sum of Absolute Differences (SAD). High computational requirement of the
FSBM and its regular processing scheme makes its hardware implementation a preferred choice [7],[8],[5],[6].
Let us consider a reference MB of size 16×16 (i.e., N = 16)
and a search region of size 32×32 (i.e., p = 8). For FSBM, we
need to calculate SAD at (2p + 1)2 = 289 different locations. We
denote a SAD calculation at location ’n’ as ’SADn ’, where the
numbering has been done in a row-major fashion. It is to be observed that after each SAD operation, a pixel at some location
becomes obsolete. After SAD0 , the pixel at (m,n)=(0,0) in the
search region is no longer used. Again, after SAD15 , the entire
row m=0 of the search region becomes obsolete. At this point of
Jayanta Mukherjee∗
†
School of Information Technology
IIT Kharagpur, WB, India
shamik@sit.iitkgp.ernet.in
time, we can replace the obsolete data in row m=0 by the data in
row m=16. Thus, we do not need any extra memory to store the
row m=16. Similarly, after SAD31 , row m=17 is stored in row
m=1 and after SAD47 row m=18 is stored in row m=2 and so on.
We implement an architecture based on our proposed memory reduction scheme. The proposed design contains 9 pipelined
stages. The first stage is an array of 256 Processing Elements (PE).
Each PE generates an 8-bit absolute difference. The second stage
contains 128 8-bit adders, the third stage contains 64 9-bit adders
and so on. Finally, a 16-bit SAD output is generated. The 9th stage
of the pipeline is the slowest, as it adds two 15-bit operands. Assuming that each addition operation takes a single cycle, the first
SAD is obtained after 9 clock cycles. After the first 9 cycles, for all
subsequent cycles, 1 SAD output is generated in every cycle. So,
we need a total of 9+(288*1)=297 cycles to compute 289 SADs in
a search region of 32×32.
Table 1. Comparison for N=16 and p=8
App Memory
[7] FSR, extra memory, cMB
[8] FSR,
cMB
[5] FSR,
cMB
[6] FSR,
cMB
[*] HalfFSR
cMB, one
32b buffer
HW Complexity
256×(4 registers, 1 2:1
MUX, 1 ADC, 1 16
-bit adder, 1 RAM)
256×(1 ALU, 2 registers, 1
3:1 MUX, 1 SAM module)
256×(1 ADC, 1 2:1 MUX
1 16-bit acc, 1 register)
17×(1 ADC,1 16-bit adder,
1 8-bit and 1 16-bit register)
256 ADCs, 255 adders and
registers(128 8-bit,64 9-bit,
32 10-bit,16 11-bit,8 12-bit,
4 13-bit, 2 14-bit, 1 15-bit)
Cycles Freq
256 40
N.A. N.A.
256
12.5
4370 103.84
297
232.78
ADC→Absolute Difference Calculator, SAM→Search Area Module
FSR→Full Search Region, cMB→current Macroblock
The proposed design has been implemented in Verilog HDL
and verified with RTL simulations using Mentor Graphics
ModelSim SE. The Verilog RTL has been synthesized on a
2V8000ff1152 Xilinx Virtex II FPGA with speed grade -5. The
synthesized design gives an operating frequency of 232.78MHz
(4.296 ns clock cycle). Our design has been compared with representative architectures in terms of (a) Memory requirement (Memory), (b) Hardware complexity (HW Complexity), (c) Cycles consumption (Cycles) per macroblock, and (d) Operating frequency
(Freq), for a macroblock of size 16×16 and search region 32×32,
i.e., N=16 and p=8. A comparative analysis of our architecture
with the existing designs is given in Table 1. The column App
refers to the design being considered and [*] refers to our pro-
posed hardware. As can be seen, our design outperforms the others
in terms of required memory, hardware complexity and operating
frequency. Our PE performs a single absolute difference operation and our entire design requires a total of 256 subtractors and
255 adders. Thus, our design has a lower hardware complexity
compared to [7], [8], [5] as it requires no multiplexer, accumulator, SAM (Search Area Module) or RAM (Random Access Memory). Only [6] has a simpler hardware, but at the cost of higher
cycles/block and lower frequency of operation. Although our architecture consumes more number of cycles/block as compared
to [7] and [5], the pipelined feature of our design accounts for its
higher operating frequency.
3
Memory Hierarchy Design
In this section, we present our cache design space exploration.
Our aim is to select an optimal cache configuration which satisfies
the conflicting design parameters of speed and power consumption. We examine the cache miss rate, processor cycles and energy
consumption, with changes in the cache configuration.
Our experimental framework is based on the Trimaran system
designed for research in instruction-level parallelism [2]. Power
estimates are based on data from Cacti toolset [1]. We assume a
232Mhz processor, which is the frequency achieved by our proposed design. Simulations have been performed on H.263 [3] encoder. Table 2 presents the results on the test sequence Carphone,
for L1 Instruction and Data Cache and L2 Unified Cache.
Table 2. Cache Exploration Results
Cache Config
Type
Size
Asso
L1 IC
4K
1
L1 IC
8K
1
L1 IC
16K
1
L1 IC
32K
1
L1 IC
32K
1
L1 IC
32K
2
L1 IC
32K
4
L1 IC
32K
8
L1 DC
4K
4
L1 DC
8K
4
L1 DC
16K
4
L1 DC
32K
4
L1 DC
32K
1
L1 DC
32K
2
L1 DC
32K
4
L1 DC
32K
8
L2 UC
32K
4
L2 UC
64K
4
L2 UC 128K
4
L2 UC 256K
4
L2 UC
64K
1
L2 UC
64K
2
L2 UC
64K
4
L2 UC
64K
8
Miss
Rate
0.03
0.02
0.01
0.01
0.01
0.01
0.01
0.01
1.01
0.74
0.39
0.29
0.63
0.35
0.29
0.29
0.06
0.04
0.03
0.01
0.07
0.06
0.04
0.04
Cycles
51908953541
51849179836
51772893596
51742836299
51742836299
57689046540
58870661609
60049336177
54001438874
53185342958
52047935100
51742836299
50135292184
51491204189
51742836299
52161989006
51832861259
51742836299
51749999904
51651261958
51826240276
51805362285
51742836299
51743544800
Energy
(in mJ)
84272
100609
124782
180558
180558
161138
153357
154311
121753
129887
145827
180558
217336
191978
180558
179120
180004
180558
180370
181939
182593
181204
180558
183322
For Instruction Cache the miss rate performance metric is of
low importance. Table 2 shows that increasing instruction cache
size reduces processor cycles due to lower cache misses but increases power consumption. Changing cache size from 16K to
32K marginally reduces processor cycles but greatly increases the
energy consumption. Also, both cache size 16K and 32K have
the same miss rate. However, a 16K instruction cache size offers
30.891% less power consumption at the cost of only 0.058% loss
in speed, over a 32K cache size. Again, it is interesting to note
that at instruction cache associativity 4, the energy consumption
is minimum. Further increase in associativity leads to increase in
both processor cycles and energy consumption. Hence, we suggest
an optimal instruction cache size of 16K and associativity 4. Cache
miss rate is a more important parameter in the exploration of Data
Cache design space. Increasing data cache size results in drastic
reduction in processor cycles due to substantial reduction in cache
miss rate, as shown in Table 2. These applications are heavily dataintensive and usually thrive well on large data caches. In Table 2, a
32K cache results in 19.24% increase in power consumption with
only 0.59% improvement in speed while having comparable miss
rate, when compared to a 16K cache. Increasing associativity from
4 to 8 degrades speed by 0.81% but improves power consumption
by only 0.796%. Therefore, 16K cache size and 4-way associativity seems to be the optimal data cache configuration. Table 2
shows that the execution time and power consumption for L2 Unified cache remains almost constant with changes in cache size and
associativity. The L2 Unified Cache size of 64K gives a moderately high-speed and also consumes low-power, as compared
to other cache sizes. Again, associativity 4 seems to be the best
choice as it scores over other available choices, in all respect. So,
the best L2 Unified Cache configuration is 64K size and 4-way associativity. Thus, the optimal cache hierarchy in our case is 16K
4-way L1 Instruction Cache, 16K 4-way L1 Data Cache and 64K
4-way L2 Unified Cache.
4
Conclusions
In this paper, we have suggested two approaches to overcome
the memory limitations of multimedia applications. First, we have
designed an improved hardware using our proposed memory reduction algorithm for FSBM-based motion estimation. Next, we
have performed a memory hierarchy design space exploration to
select a energy-optimized memory configuration. Our work shows
the importance of considering energy as a performance metric in
memory design space exploration.
References
[1] Interactive cacti. http://www.ece.ubc.ca/ stevew/cacti/.
[2] Trimaran infrastructure. http://www.trimaran.org.
[3] Video coding for low bit rate communication. ITU-T Recommendation H.263, Feb 1998.
[4] H. Chen, K. Li, and B. Wei. Memory performance optimizations for real-time software hdtv decoding. The Journal of
VLSI Signal Processing, 41(2):193–207, September 2005.
[5] Y. Lai, Y. Lai, Y. Liu, P. Wu, and L. Chen. Vlsi implementation of the motion estimator with two-dimensional data-reuse.
In IEEE Trans. Cons. Elec., volume 44, pages 623–629, 1998.
[6] H. Loukil, F. Ghozzi, A. Samet, M. Ben Ayed, and N. Masmoudi. Hardware implementation of block matching algorithm with fpga technology. In Proceedings of Intl. Conf. on
Microelectronics, pages 542–546, December 2004.
[7] V. Moshnyaga and K. Tamaru. A memory efficient array architecture for full-search block matching algorithm. In Proc.
of IEEE ICASSP, volume 5, pages 4109–4112, April 1997.
[8] J. Tuan and C. Jen. An architecture of full-search block matching for minimum memory bandwidth requirement. In Proc. of
IEEE 8th GLSVLSI, pages 152–156, February 1998.
Download