HW/SW Co-Design of an MPEG-2 Decoder Pradeep Dhananjay Kiran Divakar Leela Kishore Kothamasu Anthony Weerasinghe Outline • • • • • • • • • Objectives Background HW-SW Partitioning SW/HW Design Testing and Debug VGA Display Driver Results Lessons Learned Future Work Objectives • Accelerate MPEG-2 Decoder – Identify bottlenecks – Isolate bottleneck functions and partition design – Convert SW functions to HW blocks – Design HW/SW interfaces for communication – Measure accelerated performance • Design VGA display driver on FPGA – Attempt to display decoded stream in real-time Background • Development Platform – TLL-5000 prototyping board • ARMv9, Spartan3 FPGA, VGA DAC (ADV7125) • Source code for MPEG2 Decoder – Obtained from sourceforge.net Background – MPEG2 • Consists of Group of pictures (GOP) sequence • Types of pictures – I-picture (Intra coded) – P-picture (Forward predicted) – B-picture (Bidirectional predicted) Background – MPEG2 HW-SW Partitioning • Linux profiling done to determine critical functions – Results based on a particular input (mpeg file) – Assumed to be representative of a typical use case – Profiling done on x86 Linux and as well as on the board • gmon.out generated on board Profiling on x86-Linux mpeg2_slice MC_put_o_16_c mpeg2_idct_copy_c MC_avg_xy_16_c MC_put_xy_16_c get_mpeg1_non_intra_block rgb_c_32_420 mpeg2_idct_add_c 0 5 10 15 20 % Time Spent in Function 25 30 Profiling on ARM-Linux mpeg2_slice MC_put_o_8_c get_mpeg1_non_intra_block MC_put_o_16_c mpeg2_idct_copy_c mpeg2_idct_add_c 0 10 20 30 % Time Spent in Function 40 50 HW-SW Partitioning SW Design • IDCT function uses pointers to access an input array – Not suitable for synthesis by Catapult-C – Converted all pointer accesses to array accesses • IDCT performs non sequential accesses with varying stride – Modified caller of the IDCT function to re-organize access pattern into sequential form – Created temporary array, which is passed to function – Return array from function is re-distributed to correct locations • Changes to software verified using golden code SW Flow Chart . . . . . Create temporary buffer MPEG2 SW code .…… • Pass input values in temporary buffer to FPGA memory start • Issue Start command to FPGA ........ • IDCT does computation and stores data back in FPGA memory ……. IDCT function call .…… ........ Wait for Interrupt interrupt ……. ……. • Reads values from FPGA memory to temporary buffer • Stores values from temp buffer back to original array in order • Generates interrupt signal after computation is done HW Design • Mentor Catapult-C Synthesis Tool – High level synthesis from C/C++ to Verilog RTL HW Design • High Level Synthesis – Tool schedules operations on a cycle-by cycle basis – Constrained to available resources • Uses target device and library information – Built RTL as a interface + controller + datapath Example: Y = A*C + B*D Example: Y = A*C + B*D Example: Y = A*C + B*D Example: Y = A*C + B*D HW Design • Code conversion for synthesis – Isolate IDCT function from MPEG2 code – Merge initialization functions • One initialization construct was needed – Remove all global variables • Few dependencies for the IDCT function – Convert pointer arithmetic to array offsets • Most work needed for this conversion • No standard guidelines available HW Design • Pointer conversions HW Design • Hardware Interface HW Design • Verifying Isolated IDCT function in C and RTL – C testbench written to test isolated IDCT function – Catapult-C allows testing of C function vs. RTL • Ensure RTL generation matches expected behavior • Un-converted pointer code generated wrong RTL HW Design • Integration with communication interface – Communication FSM given – Integrate IDCT block Problems Faced • IDCT RTL would not synthesize to 66 MHz – 27 MHz clock used instead • IDCT code takes ~30 minutes to synthesize – Inefficiency of using Catapult-C to generate code • Catapult code difficult to debug • Some reads not returning correct values – Read/Write alignment – Synthesis could be a problem Debug Techniques • Removed IDCT block for fast synthesis – Used to check interface memory writes – Showed 16 bit writes were not successful • Routed state bits to board LEDS – Helpful when program hangs due to lack of DTACK – OR’d DTACK with DIP switch to prevent hang • printf and printk statements to check addresses and data being sent Delay Values • Hardware Delay – Approximately 10 us to compute IDCT • Based on cycle count provided by Catapult-C and 27 MHz clock frequency of FPGA • Pure software implementation – Approximately 30 us • Overhead for communication – ~15000 us VGA Display Block Diagram Generated ppm files FPGA ARM VGA Application VGA Controller On Board ADV 7125 RAM 1 RAM 2 VGA Driver Main FSM Monitor VGA Hardware: ADV7125 Video DAC • ADV7125 has triple 8-bit video DAC’s • VGA DAC requires R, G, B 8-bit values • Needs H-Sync and V-Synch signals VGA Controller • Used double buffer to store frame data – FIFO implementation didn’t work • ARM cannot keep up with the display data rate requirement – Frame resolution: 64X48 – Each frame transfer requires 3072 words – Used 12KB RAM memory to implement double buffer • One full frame transferred with single driver call – Reduces system call overhead – Each call overhead ~26 μs • Interrupt used to communicate to User application – Fills the next buffer VGA Display Demonstration Lessons Learned • Debugging on an FPGA is difficult! • Hand-conversion of C code could have been more efficient • Create test bench to simulate ARM-FGPA communication – Allows quick debug of FPGA hardware – Visibility into internal signals • Hardware partition should have high computation to communication ratio – IDCT called many times with small computation time – ~10 us of computation; ~15000 us of communication Future Work • Fix erroneous reads from IDCT • Integrate VGA display driver and MPEG2 Decoder Thank you!