H.264 Intra Frame Coder System Design Özgür Taşdizen Microelectronics Program at Sabanci University 4/8/2005 OUTLINE • Introduction • Hardware Architectures For Intra Frame Coder Modules • Top Level Intra Frame Coder Hardware • H.264 Intra Frame Coder System • Conclusions and Future Work H.264 VIDEO CODING STANDARD • The latest video coding standard • Developed with the collaboration of ITU-T and MPEG • Includes 3 Profiles and 14 Levels Standards H.263 H.261 ITU-T MPEG-1 MPEG Joint ITU-T / MPEG H.263+ MPEG-4 H.262 / MPEG-2 1984 1985 1986 1988 1990 1992 H.263++ H.264 / MPEG-4 Part 10 1994 1996 1998 2000 2002 2004 Years H.264 VIDEO CODING STANDARD It Provides Significant Performance Gains Coder MPEG-4 ASP H.263 HLP MPEG-2 H.264 38.62% 48.80% 64.46% Average Bit Rate Savings 3.0 2025 : MPEG-2 : MPEG-4 (ASP) 1.8 90-minute DVD-quality movie (Download time at 700 Kbps) 1234 : H.264 1.1 727 386 235 139 Bandwidth Required (Mbps) Storage Utilization (MB) Download Time (Minutes) H.264 Encoder Block Diagram Current Frame Reference Frame Motion Estimation Residue + Transform Quant - Motion Compensation Entropy Coder Mode Decision Choose Intra Mode Intra Prediction + Reconstructed Frame Reorder Deblocking Filter + Inverse Transform Reconstruction Intra Frame Coder Inverse Quant OUTLINE • Introduction • Hardware Architectures For Intra Frame Coder Modules • Top Level Intra Frame Coder Hardware • H.264 Intra Frame Coder System • Conclusions and Future Work Transform and Quantization Algorithms Residue Forward Transform Quantizer Hadamard Transform Inverse Hadamard Transform Reconstruction Inverse Transform Inverse Quantizer VLC H.264 Transform Algorithm • • A multiply-free 4x4 integer transform is used. It only requires additions and shifts. For 16x16 intra coded luminance blocks and for 8x8 chrominance blocks a second transform, Hadamard Transform, is applied on DC coefficients. 4x4 Forward Integer Transform 4x4 Hadamard Transform 2x2 Hadamard Transform 4x4 Inverse Integer Transform H.264 Transform Algorithm • 4x4 Forward Integer Transform is applied to all the blocks except –1, 16, 17 • 4x4 Hadamard Transform is applied to –1 if intra 16x16 mode is selected • 2x2 Hadamard Transform is applied to 16, 17 -1 16 17 LUMA 0 1 4 5 18 19 22 23 2 3 6 7 20 21 24 25 8 9 12 13 10 11 14 15 CHROMA CB CHROMA CR Transform Hardware Register 0 stores: (x0+x4+x8+x12) Register 1 stores: (x1+x5+x9+x13) Register 2 stores: (x2+x6+x10+x14) Register 3 stores: (x3+x7+x11+x15) Pipelining Registers are used to increase the maximum clock frequency Register 4 stores the result of transform operations (x0+x4+x8+x12) + (x1+x5+x9+x13) + (x2+x6+x10+x14) + (x3+x7+x11+x15) 2*(x0+x4+x8+x12) + (x1+x5+x9+x13) - (x2+x6+x10+x14) - 2*(x3+x7+x11+x15) (x0+x4+x8+x12) - (x1+x5+x9+x13) - (x2+x6+x10+x14) + (x3+x7+x11+x15) (x0+x4+x8+x12) - 2* (x1+x5+x9+x13) + 2*(x2+x6+x10+x14) - (x3+x7+x11+x15) Quantization Hardware QP ranges from 0 to 51. qbits = 15+floor(QP/6) AC Coefficients : |Zij| = (|Wij|.MF + f) >> qbits, sign(Zij) = sign(Wij) DC Coefficients : |Zij| = (|Yij|.MF + 2f) >> (qbits + 1), sign(Zij) = sign(Yij) Inverse Quantization AC Coefficients : W’ij = Zij.V.2floor(QP/6) DC Coefficients : If QP > 12 W’ij = Wqij.V.2floor(QP/6) - 2 Else W’ij = [ Wqij.V + 21 - floor(QP/6) ] >> (2-floor (QP/6)) Transform and Quantization Hardware Hardware Implementation Results In the worst case, it takes 2500 cycles to complete the TQIQIT operations of a 4x4 block FPGA implementation Excluding I/O Register Files Including I/O Register Files FPGA implementation works at 81MHz and it can code 27 Function Generators 2497 4054 VGA frames per second CLB Slices 1249 2027 Dffs or Latches 581 583 Block Multipliers 1 1 0.18µ ASIC implementation 0.18µ ASIC implementation works at 210MHz and it can code 70 VGA frames per second Critical Path Delay [ns] Gate Count Transform part of the Datapath 2.77 1978 Datapath 4.78 12773 Datapath + Control Unit 4.8 23162 4.8 130505 Datapath + Control + Input Register File + Output Register File TQ Context Adaptive Variable Length Encoder Hardware 1) After prediction, transformation and quantization, blocks typically contain zeros and ones 2) The highest non-zero coefficients after the zig-zag scan are often sequences of +/-1. 3) The number of non-zero coefficients in neighbouring blocks are correlated 4) The magnitude of non-zero coefficients tends to be higher at the start Intra Prediction Hardware • 9 prediction modes for 4x4 luma blocks • 4 prediction modes for 16x16 luma and 8x8 chroma blocks Inputs from Top-Level Reconstructed Pixels Address Generation Hardwares Neigbouring Buffers Top Level Mode Controller Internal Buffers Controller for 4x4 Luma Prediction Modes Datapath for 4x4 Luma Prediction Modes Controller for 16x16 Luma Prediction Modes Datapath for 16x16 Luma Prediction Modes Controller for 8x8 Chroma Prediction Modes Datapath for 8x8 Chroma Prediction Modes Reconstructed Pixels Output MUX Prediction Buffer (384x8) OUTLINE • Introduction • Hardware Architectures For Intra Frame Coder Modules • Top Level Intra Frame Coder Hardware • H.264 Intra Frame Coder System • Conclusions and Future Work Top Level Intra Frame Coder Hardware Input SEARCH Pipelining CODER Output Register File HARDWARE Register File HARDWARE Register File Functional Units 1st MB Search Hardware 2nd MB Coder Hardware 4th MB 3rd MB 4000 8000 12000 16000 Time (cycles) CIF @ 30 fps requires processing 11800 Macroblocks per second Level @30Mhz @40Mhz @50Mhz @60Mhz @70Mhz @80Mhz 2.0 (CIF @30 fps) 2525 3367 4208 5050 5892 6734 Search Hardware 384 x 8 Reg. for 16 DC coefs. Current MB Luma 16x16 Intra Pred. Residue Mux Neighbors Hadamard Transform Chroma 8x8 384 x 8 Predicted MB QP 256 x 8 Current MB Neighbors Luma 4x4 Intra Pred. 256 x 8 Predicted MB Residue Hadamard Transform Mode Decision Mode Mode Decision SATD based mode decision algorithm Cost4x Cost16x16 << 3 4 18 18 9 Mux 1) Compute the cost of each 4x4 mode Intra 4x4 vs Intra 16x16 Cost Comparator 18 Select the 4x4 mode with lowest cost 2) Compute the cost of each 16x16 mode Add_sub Add/Sub 19 Select the 16x16 mode with lowest cost Register 3) Compute the cost of each 8x8 mode 19 Select the 8x8 mode with lowest cost 4) Compare selected 4x4 and 16x16 costs and select the best mode 5) Start the coder hardware with selected mode information Result 1. Cycle: Register = 8 x 2. Cycle: Register = 16 x 3. Cycle: Register = 24 x 4. Cycle: Register = 4x4cost + 24 x 5. Cycle: Register = 16x16cost – (4x4cost + 24 x ) High Speed Hadamard Transform Hardware • Performs SATD computation • 13-bit adders/subtractors • Reguires only 18 cycles for a 4x4 Block • Two-stage pipeline z2 z3 z4 add/sub add/sub z5 z6 add/sub add/sub z7 z9 z8 add/sub add/sub add/sub z11 z12 add/sub add/sub add/sub Register z13 z14 add/sub add/sub P. Register add/sub z10 P. Register z1 z0 add/sub add/sub add/sub z15 Coder Hardware 384 x 8 Current MB 384 x 9 Residue Quant Transform 384 x 16 Reg. file Reg. file HT 384 x 8 IHT CAVLC Predicted MB Inverse Transform Inverse Quant 192 x 32 Intra Pred. 16 x 16 Reconstruct Reg. File Reg. File 384 x 8 Reconstructed MB Bitstream Scheduling of Intra 4x4 modes Modules Intra Prediction Residue TQ IQIT TQ IQIT 1st Block TQIQIT 2nd Block CAVLC Reconstruction 0 24 42 86 142 160 202 246 302 320 Time (cycles) Worst Case cycle counts required to complete a 4x4 block : TQIQIT = 100, CAVLC = 120, Residue&Reconstruction = 18, Intra Prediction = 24 Scheduling of Intra 16x16 modes Modules Intra Prediction Residue 1st Block TQ TQ TQ HT IQIT IQIT 2nd Block TQIQIT 16th Block CAVLC Reconstruction 42 75 24 48 86 0 130 384 402 746 800 860 880 920 1040 Time (cycles) Implementation Results for H.264 Intra Frame Coder Hardware • Synthesized at 61.4 MHz and Placed & Routed at 53.8 MHz. • The total equivalent gate count is 1,051,458 Device Utilizations for XC2V8000 FPGA Resources Used Available Utilization IOs Global Buffers Function Generators CLB Slices Dffs or Latches Block RAMs Block Multipliers 418 2 1108 16 37.73% 12.50% 21404 93184 22.97% 10702 3881 1 46592 96508 168 22.97% 4.02% 0.60% 1 168 0.60% OUTLINE • Introduction • Hardware Architectures For Intra Frame Coder Modules • Top Level Intra Frame Coder Hardware • H.264 Intra Frame Coder System • Conclusions and Future Work System Overview • PC is used to develop Verilog modules and debug the system • Multi Ice Debugger communicates with the development board • Development Board is used for testing the designed hardware • Color LCD Panel is used for visual verification ARM-based Development Platform Logic Tile Xilinx Virtex II 8000 FPGA Arm 926EJ-S Processor based Development Chip Versatile Platform Baseboard Xilinx Virtex II 2000 FPGA Development Chip ARM AMBA 2.0 Software Implementation • Matlab and C codes are developed • ARM AXD Tool is used to debug the system • C codes run on ARM926EJ-S processor • SRAM available on Logic Tile is used to store image data Capturing the image in RGB format SRAM Converting the image from RGB format to YCbCr format Partitioning the image into macroblocks H.264 Intra Frame Coder Hardware Displaying the reconstructed image SRAM Converting the image from YCbCr format to RGB format 4:2:0 Sampling SRAM Reconstructing the image in raster-scan order Hardware Implementation ARM Development Board implements Tri-state AHB buses An AHB master is designed for reading and writing the image data to the SRAMs available on the logic tile. 2 SRAM controllers are instantiated in the design as slaves on AHM M1 and AHM M2 buses. System Arbiter controls the multiplexing Design Flow Verilog modules High Effort for Speed Leonardo Spectrum Modify HDL files Compiler Synthesis Constraints Logic Optimizer Modify Constraints Met? Netlist for XC2V8000 No Yes High Effort for Speed Bitstream Options Mapper Xilinx Project Translator Navigator Placer Router Place and Route Constraints Modify Constraints Met? No Yes Bitsream for XC2V8000 Resulting bitsream OUTLINE • Introduction • Hardware Architectures For Intra Frame Coder Modules • Top Level Intra Frame Coder Hardware • H.264 Intra Frame Coder System • Conclusions and Future Work Conclusions • Transform – Quant architecture is designed and verified to work at 81 MHz • Mode Decision, Intra Prediction and CAVLC are integrated. • Top – Level design is synthesized at 61.4 MHz and placed & routed at 53.8MHz. • Device utilization for XC2V8000 FPGA is approximately 23% with a total equivalent gate count of 1,051,458. • The H.264 Intra Frame Coder System is verified to work on an ARM Versatile Platform development board. Future Work • Implementing header generation functionality • Further verification by decoding the generated bitstream using an H.264 compliant decoder • Implementing low-power techniques such as clock gating • Adding a camera to the system for real-time video capturing and coding • Developing an ASIC implementation and fabricating a prototype • Creating a complete H.264 video coding system by integrating motion estimation, motion compensation, deblocking filter, intra vs. inter mode decision and rate control units Thanks ? Questions...