HW/SW Co-Design of an MPEG

advertisement
HW/SW Co-Design of an
MPEG-2 Decoder
Pradeep Dhananjay
Kiran Divakar
Leela Kishore Kothamasu
Anthony Weerasinghe
Outline
•
•
•
•
•
•
•
•
•
Objectives
Background
HW-SW Partitioning
SW/HW Design
Testing and Debug
VGA Display Driver
Results
Lessons Learned
Future Work
Objectives
• Accelerate MPEG-2 Decoder
– Identify bottlenecks
– Isolate bottleneck functions and partition design
– Convert SW functions to HW blocks
– Design HW/SW interfaces for communication
– Measure accelerated performance
• Design VGA display driver on FPGA
– Attempt to display decoded stream in real-time
Background
• Development Platform
– TLL-5000 prototyping board
• ARMv9, Spartan3 FPGA, VGA DAC (ADV7125)
• Source code for MPEG2 Decoder
– Obtained from sourceforge.net
Background – MPEG2
• Consists of Group of pictures (GOP) sequence
• Types of pictures
– I-picture (Intra coded)
– P-picture (Forward predicted)
– B-picture (Bidirectional predicted)
Background – MPEG2
HW-SW Partitioning
• Linux profiling done to determine critical
functions
– Results based on a particular input (mpeg file)
– Assumed to be representative of a typical use case
– Profiling done on x86 Linux and as well as on the
board
• gmon.out generated on board
Profiling on x86-Linux
mpeg2_slice
MC_put_o_16_c
mpeg2_idct_copy_c
MC_avg_xy_16_c
MC_put_xy_16_c
get_mpeg1_non_intra_block
rgb_c_32_420
mpeg2_idct_add_c
0
5
10
15
20
% Time Spent in Function
25
30
Profiling on ARM-Linux
mpeg2_slice
MC_put_o_8_c
get_mpeg1_non_intra_block
MC_put_o_16_c
mpeg2_idct_copy_c
mpeg2_idct_add_c
0
10
20
30
% Time Spent in Function
40
50
HW-SW Partitioning
SW Design
• IDCT function uses pointers to access an input array
– Not suitable for synthesis by Catapult-C
– Converted all pointer accesses to array accesses
• IDCT performs non sequential accesses with varying
stride
– Modified caller of the IDCT function to re-organize access
pattern into sequential form
– Created temporary array, which is passed to function
– Return array from function is re-distributed to correct
locations
• Changes to software verified using golden code
SW Flow Chart
.
.
.
.
.
Create temporary buffer
MPEG2
SW code
.……
• Pass input values in temporary
buffer to FPGA memory
start
• Issue Start command to FPGA
........
• IDCT does computation and
stores data back in FPGA
memory
…….
IDCT
function call
.……
........
Wait for Interrupt
interrupt
…….
…….
• Reads values from FPGA
memory to temporary buffer
• Stores values from temp buffer
back to original array in order
• Generates interrupt signal
after computation is done
HW Design
• Mentor Catapult-C Synthesis Tool
– High level synthesis from C/C++ to Verilog RTL
HW Design
• High Level Synthesis
– Tool schedules operations on a cycle-by cycle basis
– Constrained to available resources
• Uses target device and library information
– Built RTL as a interface + controller + datapath
Example: Y = A*C + B*D
Example: Y = A*C + B*D
Example: Y = A*C + B*D
Example: Y = A*C + B*D
HW Design
• Code conversion for synthesis
– Isolate IDCT function from MPEG2 code
– Merge initialization functions
• One initialization construct was needed
– Remove all global variables
• Few dependencies for the IDCT function
– Convert pointer arithmetic to array offsets
• Most work needed for this conversion
• No standard guidelines available
HW Design
• Pointer conversions
HW Design
• Hardware Interface
HW Design
• Verifying Isolated IDCT function in C and RTL
– C testbench written to test isolated IDCT function
– Catapult-C allows testing of C function vs. RTL
• Ensure RTL generation matches expected behavior
• Un-converted pointer code generated wrong RTL
HW Design
• Integration with communication interface
– Communication FSM given
– Integrate IDCT block
Problems Faced
• IDCT RTL would not synthesize to 66 MHz
– 27 MHz clock used instead
• IDCT code takes ~30 minutes to synthesize
– Inefficiency of using Catapult-C to generate code
• Catapult code difficult to debug
• Some reads not returning correct values
– Read/Write alignment
– Synthesis could be a problem
Debug Techniques
• Removed IDCT block for fast synthesis
– Used to check interface memory writes
– Showed 16 bit writes were not successful
• Routed state bits to board LEDS
– Helpful when program hangs due to lack of DTACK
– OR’d DTACK with DIP switch to prevent hang
• printf and printk statements to check
addresses and data being sent
Delay Values
• Hardware Delay
– Approximately 10 us to compute IDCT
• Based on cycle count provided by Catapult-C and 27
MHz clock frequency of FPGA
• Pure software implementation
– Approximately 30 us
• Overhead for communication
– ~15000 us
VGA Display Block Diagram
Generated
ppm files
FPGA
ARM
VGA
Application
VGA
Controller
On Board
ADV 7125
RAM
1
RAM
2
VGA
Driver
Main FSM
Monitor
VGA Hardware: ADV7125 Video DAC
• ADV7125 has triple 8-bit video DAC’s
• VGA DAC requires R, G, B 8-bit values
• Needs H-Sync and V-Synch signals
VGA Controller
• Used double buffer to store frame data
– FIFO implementation didn’t work
• ARM cannot keep up with the display data rate requirement
– Frame resolution: 64X48
– Each frame transfer requires 3072 words
– Used 12KB RAM memory to implement double buffer
• One full frame transferred with single driver call
– Reduces system call overhead
– Each call overhead ~26 μs
• Interrupt used to communicate to User application
– Fills the next buffer
VGA Display Demonstration
Lessons Learned
• Debugging on an FPGA is difficult!
• Hand-conversion of C code could have been more
efficient
• Create test bench to simulate ARM-FGPA
communication
– Allows quick debug of FPGA hardware
– Visibility into internal signals
• Hardware partition should have high computation to
communication ratio
– IDCT called many times with small computation time
– ~10 us of computation; ~15000 us of communication
Future Work
• Fix erroneous reads from IDCT
• Integrate VGA display driver and MPEG2
Decoder
Thank you!
Download