Implementing Image Processing Pipelines in a Hardware/Software Environment Heather Quinn1, Dr. Miriam Leeser Dr. Laurie Smith King Northeastern University 1hquinn@ece.neu.edu College of the Holy Cross An Example Motivation: Accelerate image processing tasks through efficient use of FPGAs. Combine already designed components at runtime to implement series of transformations (pipelines) Our Environment HW Init 1a) swsw implementation S W Median Filter Edge Detect S W 1b) swhw implementation S W Median Filter Pad Image H W 1c) hwsw implementation S W Pad Image H W Median Filter S W Remove Padding Edge Detect 1d) hwhw implementation S W Pad Image H W Median Filter R P R G Fix Padding Edge Detect Designed in Java Done on a single host Exhaustive: up to 11 stages ILP: all pipelines up to 13 stages, some pipelines up to 18 stages, and none larger Sub-optimal solutions in 500 ms Greedy and local search: all problem sizes Strategy Exhaustive and ILP up to 13 stages Greedy or local search for more than 13 stages Assumptions Reprogramming and Communication costs incurred at sw/hw boundaries Might need to fix image edge in between components Problems sizes of 20 or fewer stages 500 ms to make a decision Runtime for Exhaustive and ILP Algorithms 1000000 Average Exhaustive Runtime Maximum Exhaustive Runtime Average ILP Runtime Maximum ILP Runtime 100000 10000 median sw 1000 Solving Pipeline Assignment 800 Exhaustive Search 600 400 200 Find: Optimal solutions How: Search entire problem space Algorithm Runtime: O(2N), where N is the number of pipeline stages ILP 0 0 1000 2000 3000 4000 Total Pixels 5000 6000 7000 8000 Median FilterEdge Detection Profiles Median to Edge Running Time (with Initialization Time) 2000 sw/sw total Software algorithm’s runtime for small images less than the Exponential number of implementations Reprogramming costs Need a strategy to find a fast pipeline implementation at runtime median hw Efficient Use of FPGAs Remove Padding hw/sw boundaries Red boxes are fixing image edges Green Boxes are reprogramming Exhaustive: optimal solutions for 11 stages Optimal solutions in 500 ms and software Each implementation has known runtimes for a set of images Interpolation used for rest of image sizes Each hardware implementation has a known area size All components are image in/image out edge hw systems Hardware initialization, Communicating image, and Reprogramming Series of image processing algorithms applied to an image Each algorithm has a software and hardware implementation Finding the crossover point for a pipeline is complicated S W 1200 Using hardware incurs execution costs not present in software Image Processing Pipelines Remove Padding edge sw Hardware Systems hw/hw total 1800 hw/sw total sw/hw total 1600 1400 1200 Millisecond hardware costs Profiling the hardware and software runtimes for different image sizes determines the crossover point Deciding at runtime to execute in software or hardware is simple for one algorithm processing one image S W Results Each algorithm has two implementations: hardware Blue boxes are Edge Detect Used as a baseline for solution quality Timed to find 500 ms boundary ILP solver constrained to 500 ms Ability to solve dependent on components Local Search returns best solution found within time limit The Library of Components Median and Edge Runtimes (with Init Times) 1400 Milliseconds Designed with JHDL (from BYU) Input and output image in FPGA’s on-board memory Input image communicated from host to FPGA at beginning of processing Output image communicated from FPGA to host at end of processing Host and FPGA connected through PCI Bus Display Median Filter and Edge Detection Profiles Hardware Processing Get Data Possible Implementations of workstations (NOW) FPGAs are expensive, available on some hosts but not others NOW provide coarsegrained parallelism, FPGAs provide fine-grained parallelism Software processing Edge Det 1000 800 600 400 200 0 0 1000 2000 3000 4000 Total Pixels 5000 6000 7000 8000 Find: Optimal solutions How: AMPL model running on CPLEX Need: ILP formulation of the problem statement Algorithm Runtime: Unknown Greedy Find: Sub-optimal solutions How: Make optimal decisions for each pipeline stage based on hardware area usage and speedup values Algorithm Runtime: O(N), where N is the number of pipeline stages Local Search Find: Sub-optimal solutions How: Improve upon initial solutions (found through Greedy or randomly) Algorithm Runtime: runs for user supplied amount of time milliseconds FPGA Repgm implementations (components), a pipeline, and an image Output: an assignment of each component to a hardware or software implementation Need pipeline implementations that minimize reprogramming and communication costs A heterogeneous network FPGA2 Median Send Data Synthetic components arranged into pipelines of length 1 to 20 Exhaustive algorithm run to completion Inputs: a profiled library of image processing Median Filter & Edge Detection Start App Experiments Problem Statement 1000 100 10 1 0 2 4 6 8 10 12 14 16 18 Problem Size Future Work ADAPT: Algorithm that calls exhaustive, ILP and local search algorithms to solve pipeline assignment problem based on problem size Decision Time: Study how the amount of time allotted affects ADAPT results Virtex II Pro: Add scheduling support for using embedded Power PC cores Publications L. Smith King, H. Quinn, M. Leeser, D. Galatopoullos and E. S. Manolakos, “Run-time Execution of Reconfigurable Hardware in a Java Environment”, International Conference on Computer Design, September 2001. H Quinn, M. Leeser, and L. Smith King, “Accelerating Image Processing in a Software/Hardware Environment”, MAPLD International Conference, September 2002. 20