Toward Memory-efficient Design of Video Encoders for Multimedia Applications Avishek Saha∗ , Santosh Ghosh∗ , Shamik Sural† ∗ Department of Computer Science and Engg. IIT Kharagpur, WB, India {avishek,santosh,jay}@cse.iitkgp.ernet.in Abstract In this paper, we suggest a scheme to reduce memory requirement of the Full Search Block Matching (FSBM) algorithm. Next, we explore the memory design space of embedded multimedia applications and suggest a speed-energy optimized cache hierarchy. 1. Introduction Multimedia applications, in general, need a large amount of memory and bandwidth to store and access frames. However, the increasing gap between memory and processor performance and the high-power consumption of deep cache structures have shifted the performance bottleneck in these applications to memory operations. The two most common approaches [4] to improve the performance of multimedia applications are, either a) to satisfy the computational requirements of encoder/decoder, or b) to design instruction extensions for multimedia applications. In this paper, we suggest two approaches to overcome memory performance bottleneck in embedded multimedia applications. Our first approach reduces the memory requirement of the FSBM algorithm, to half the required size of its search area. This technique can be extended to frames, where the memory savings are substantial. Our second approach explores memory hierarchy designs to support real-time video encoding on low power, portable platforms. 2. Memory Reduction Approach In FSBM-based motion estimation, each N×N macroblock of the current frame is compared with all candidate macroblocks in a pre-defined (N+2p)×(N+2p) search window in the previous frame, where p is the search range. The motion vector is determined by identifying the best matching MB, using some matching criterion. The most commonly used matching criterion is the Sum of Absolute Differences (SAD). High computational requirement of the FSBM and its regular processing scheme makes its hardware implementation a preferred choice [7],[8],[5],[6]. Let us consider a reference MB of size 16×16 (i.e., N = 16) and a search region of size 32×32 (i.e., p = 8). For FSBM, we need to calculate SAD at (2p + 1)2 = 289 different locations. We denote a SAD calculation at location ’n’ as ’SADn ’, where the numbering has been done in a row-major fashion. It is to be observed that after each SAD operation, a pixel at some location becomes obsolete. After SAD0 , the pixel at (m,n)=(0,0) in the search region is no longer used. Again, after SAD15 , the entire row m=0 of the search region becomes obsolete. At this point of Jayanta Mukherjee∗ † School of Information Technology IIT Kharagpur, WB, India shamik@sit.iitkgp.ernet.in time, we can replace the obsolete data in row m=0 by the data in row m=16. Thus, we do not need any extra memory to store the row m=16. Similarly, after SAD31 , row m=17 is stored in row m=1 and after SAD47 row m=18 is stored in row m=2 and so on. We implement an architecture based on our proposed memory reduction scheme. The proposed design contains 9 pipelined stages. The first stage is an array of 256 Processing Elements (PE). Each PE generates an 8-bit absolute difference. The second stage contains 128 8-bit adders, the third stage contains 64 9-bit adders and so on. Finally, a 16-bit SAD output is generated. The 9th stage of the pipeline is the slowest, as it adds two 15-bit operands. Assuming that each addition operation takes a single cycle, the first SAD is obtained after 9 clock cycles. After the first 9 cycles, for all subsequent cycles, 1 SAD output is generated in every cycle. So, we need a total of 9+(288*1)=297 cycles to compute 289 SADs in a search region of 32×32. Table 1. Comparison for N=16 and p=8 App Memory [7] FSR, extra memory, cMB [8] FSR, cMB [5] FSR, cMB [6] FSR, cMB [*] HalfFSR cMB, one 32b buffer HW Complexity 256×(4 registers, 1 2:1 MUX, 1 ADC, 1 16 -bit adder, 1 RAM) 256×(1 ALU, 2 registers, 1 3:1 MUX, 1 SAM module) 256×(1 ADC, 1 2:1 MUX 1 16-bit acc, 1 register) 17×(1 ADC,1 16-bit adder, 1 8-bit and 1 16-bit register) 256 ADCs, 255 adders and registers(128 8-bit,64 9-bit, 32 10-bit,16 11-bit,8 12-bit, 4 13-bit, 2 14-bit, 1 15-bit) Cycles Freq 256 40 N.A. N.A. 256 12.5 4370 103.84 297 232.78 ADC→Absolute Difference Calculator, SAM→Search Area Module FSR→Full Search Region, cMB→current Macroblock The proposed design has been implemented in Verilog HDL and verified with RTL simulations using Mentor Graphics ModelSim SE. The Verilog RTL has been synthesized on a 2V8000ff1152 Xilinx Virtex II FPGA with speed grade -5. The synthesized design gives an operating frequency of 232.78MHz (4.296 ns clock cycle). Our design has been compared with representative architectures in terms of (a) Memory requirement (Memory), (b) Hardware complexity (HW Complexity), (c) Cycles consumption (Cycles) per macroblock, and (d) Operating frequency (Freq), for a macroblock of size 16×16 and search region 32×32, i.e., N=16 and p=8. A comparative analysis of our architecture with the existing designs is given in Table 1. The column App refers to the design being considered and [*] refers to our pro- posed hardware. As can be seen, our design outperforms the others in terms of required memory, hardware complexity and operating frequency. Our PE performs a single absolute difference operation and our entire design requires a total of 256 subtractors and 255 adders. Thus, our design has a lower hardware complexity compared to [7], [8], [5] as it requires no multiplexer, accumulator, SAM (Search Area Module) or RAM (Random Access Memory). Only [6] has a simpler hardware, but at the cost of higher cycles/block and lower frequency of operation. Although our architecture consumes more number of cycles/block as compared to [7] and [5], the pipelined feature of our design accounts for its higher operating frequency. 3 Memory Hierarchy Design In this section, we present our cache design space exploration. Our aim is to select an optimal cache configuration which satisfies the conflicting design parameters of speed and power consumption. We examine the cache miss rate, processor cycles and energy consumption, with changes in the cache configuration. Our experimental framework is based on the Trimaran system designed for research in instruction-level parallelism [2]. Power estimates are based on data from Cacti toolset [1]. We assume a 232Mhz processor, which is the frequency achieved by our proposed design. Simulations have been performed on H.263 [3] encoder. Table 2 presents the results on the test sequence Carphone, for L1 Instruction and Data Cache and L2 Unified Cache. Table 2. Cache Exploration Results Cache Config Type Size Asso L1 IC 4K 1 L1 IC 8K 1 L1 IC 16K 1 L1 IC 32K 1 L1 IC 32K 1 L1 IC 32K 2 L1 IC 32K 4 L1 IC 32K 8 L1 DC 4K 4 L1 DC 8K 4 L1 DC 16K 4 L1 DC 32K 4 L1 DC 32K 1 L1 DC 32K 2 L1 DC 32K 4 L1 DC 32K 8 L2 UC 32K 4 L2 UC 64K 4 L2 UC 128K 4 L2 UC 256K 4 L2 UC 64K 1 L2 UC 64K 2 L2 UC 64K 4 L2 UC 64K 8 Miss Rate 0.03 0.02 0.01 0.01 0.01 0.01 0.01 0.01 1.01 0.74 0.39 0.29 0.63 0.35 0.29 0.29 0.06 0.04 0.03 0.01 0.07 0.06 0.04 0.04 Cycles 51908953541 51849179836 51772893596 51742836299 51742836299 57689046540 58870661609 60049336177 54001438874 53185342958 52047935100 51742836299 50135292184 51491204189 51742836299 52161989006 51832861259 51742836299 51749999904 51651261958 51826240276 51805362285 51742836299 51743544800 Energy (in mJ) 84272 100609 124782 180558 180558 161138 153357 154311 121753 129887 145827 180558 217336 191978 180558 179120 180004 180558 180370 181939 182593 181204 180558 183322 For Instruction Cache the miss rate performance metric is of low importance. Table 2 shows that increasing instruction cache size reduces processor cycles due to lower cache misses but increases power consumption. Changing cache size from 16K to 32K marginally reduces processor cycles but greatly increases the energy consumption. Also, both cache size 16K and 32K have the same miss rate. However, a 16K instruction cache size offers 30.891% less power consumption at the cost of only 0.058% loss in speed, over a 32K cache size. Again, it is interesting to note that at instruction cache associativity 4, the energy consumption is minimum. Further increase in associativity leads to increase in both processor cycles and energy consumption. Hence, we suggest an optimal instruction cache size of 16K and associativity 4. Cache miss rate is a more important parameter in the exploration of Data Cache design space. Increasing data cache size results in drastic reduction in processor cycles due to substantial reduction in cache miss rate, as shown in Table 2. These applications are heavily dataintensive and usually thrive well on large data caches. In Table 2, a 32K cache results in 19.24% increase in power consumption with only 0.59% improvement in speed while having comparable miss rate, when compared to a 16K cache. Increasing associativity from 4 to 8 degrades speed by 0.81% but improves power consumption by only 0.796%. Therefore, 16K cache size and 4-way associativity seems to be the optimal data cache configuration. Table 2 shows that the execution time and power consumption for L2 Unified cache remains almost constant with changes in cache size and associativity. The L2 Unified Cache size of 64K gives a moderately high-speed and also consumes low-power, as compared to other cache sizes. Again, associativity 4 seems to be the best choice as it scores over other available choices, in all respect. So, the best L2 Unified Cache configuration is 64K size and 4-way associativity. Thus, the optimal cache hierarchy in our case is 16K 4-way L1 Instruction Cache, 16K 4-way L1 Data Cache and 64K 4-way L2 Unified Cache. 4 Conclusions In this paper, we have suggested two approaches to overcome the memory limitations of multimedia applications. First, we have designed an improved hardware using our proposed memory reduction algorithm for FSBM-based motion estimation. Next, we have performed a memory hierarchy design space exploration to select a energy-optimized memory configuration. Our work shows the importance of considering energy as a performance metric in memory design space exploration. References [1] Interactive cacti. http://www.ece.ubc.ca/ stevew/cacti/. [2] Trimaran infrastructure. http://www.trimaran.org. [3] Video coding for low bit rate communication. ITU-T Recommendation H.263, Feb 1998. [4] H. Chen, K. Li, and B. Wei. Memory performance optimizations for real-time software hdtv decoding. The Journal of VLSI Signal Processing, 41(2):193–207, September 2005. [5] Y. Lai, Y. Lai, Y. Liu, P. Wu, and L. Chen. Vlsi implementation of the motion estimator with two-dimensional data-reuse. In IEEE Trans. Cons. Elec., volume 44, pages 623–629, 1998. [6] H. Loukil, F. Ghozzi, A. Samet, M. Ben Ayed, and N. Masmoudi. Hardware implementation of block matching algorithm with fpga technology. In Proceedings of Intl. Conf. on Microelectronics, pages 542–546, December 2004. [7] V. Moshnyaga and K. Tamaru. A memory efficient array architecture for full-search block matching algorithm. In Proc. of IEEE ICASSP, volume 5, pages 4109–4112, April 1997. [8] J. Tuan and C. Jen. An architecture of full-search block matching for minimum memory bandwidth requirement. In Proc. of IEEE 8th GLSVLSI, pages 152–156, February 1998.