Presentation - MIT Lincoln Laboratory

Sherman Braganza Prof. Miriam Leeser ReConfigurable Laboratory Northeastern University Boston, MA Outline  Introduction  Motivation   Optical Quadrature Microscopy Phase unwrapping  Algorithms  Minimum LP norm phase unwrapping  Platforms  Reconfigurable Hardware and Graphics Processors  Implementation  FPGA and GPU specifics  Verification details  Results  Performance  Power  Cost  Conclusions and Future Work Motivation – Why Bother With Phase Unwrapping?  Used in phase based imaging applications  IFSAR, OQM microscopy  High quality results are computationally expensive  Only difficult in 2D or higher Wrapped embryo image  Integrating gradients with noisy data  Residues and path dependency 0.1 0.3 0.1 0.3 -0.1 -0.3 -0.1 -0.2 No residues Residues Algorithms – Which One Do We Choose?  Many phase unwrapping algorithms  Goldstein’s, Flynns, Quality maps, Mask Cuts, multi-grid, PCG, Minimum LP norm (Ghiglia and Pritt, “Two Dimensional Phase Unwrapping”, Wiley, NY, 1998.  We need: High quality (performance is secondary)  Abilitity to handle noisy data  Choose Minimum LP Norm algorithm: Has the highest computational cost a) Software embryo unwrap Using matlab ‘unwrap’ b) Software embryo unwrap Using Minimum LP Norm Breaking Down Minimum LP Norm  Minimizes existence of differences between measured data and calculated data  Iterates Preconditioned Conjugate Gradient (PCG)  94% of total computation time  Also iterative  Two steps to PCG   Preconditioner (2D DCT, Poisson calculation and 2D IDCT) Conjugate Gradient Platforms – Which Accelerator Is Best For Phase Unwrapping?  FPGAs  Fine grained control  Highly parallel  Limited program memory  Floating point?  High implementation cost Xilinx Virtex II Pro architecture http://www.xilinx.com/ Platforms - GPUs  GPUs  Data parallel architecture   Less flexibility Floating point  Inter processor communication?  Lower implementation cost  Limited # of execution units  Large program memory G80 Architecture [nvidia.com/cuda] Platform Comparison FPGAs GPUs •Absolute control: Can specific custom •Need to fit application to bit-widths/architectures to optimally architecture suit application •Can have fast processor-processor communication •Multiprocessor-multiprocessor communication is slow •Low clock frequency •Higher frequency •High degree of implementation freedom => higher implementation effort. VHDL. •Relatively straightforward to develop for. Uses standard C syntax •Small program space. High reprogramming time •Relatively large program space. Low reprogramming time. Platform Description • FPGA and GPU on different platforms 4 years apart • Effects of Moore’s Law Platform specifications • Machine 3 in the Results: Cost section has a Virtex 5 and two Core2Quads Software unwrap execution time Implementation: Preconditioning On An FPGA  Need to account for bitwidth  Minimum of 28 bit needed – Use 24 bit + block exponent  Implement a 2D 1024x512 DCT/IDCT using 1D row/column decomposition  Implement a streaming floating point kernel to solve discretized Poisson equation 27 bit software unwrap 28 bit software unwrap Minimum P L Norm On A GPU  NVIDIA provides 2D FFT kernel  Use to compute 2D DCT  Can use CUDA to implement floating point solver  Few accuracy issues  No area constraints on GPU  Why not implement whole algorithm?  Multiple kernels, each computing one CG or LP norm step  One host to accelerator transfer per unwrap Verifying Our Implementations  Look at residue counts as algorithm progresses  Less than 0.1% difference  Visual inspection: Glass bead gives worst case results Software unwrap GPU unwrap FPGA unwrap Verifying Our Implementations  Differences between software and accelerated version GPU vs. Software FPGA vs. Software Results: FPGA  Implemented preconditioner in hardware and measured algorithm speedup  Maximum speedup assuming zero preconditioning calculation time : 3.9x  We get 2.35x on a V2P70, 3.69x on a V5 (projected) Results: GPU  Implemented entire LP norm kernel on GPU and measured algorithm speedup  Speedups for all sections except disk IO  5.24x algorithm speedup. 6.86x without disk IO Results: FPGAs vs. GPUs  Preconditioning only  Similar platform generation. Projected FPGA results.  Includes FPGA data transfer, not GPU  Buses? Currently use PCI-X for FPGA, PCI-E for GPU Data transfer Preconditioning Time Time (s) Computation 5 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 GPU FPGA 3 Core V5 (Projected) Results: Power  GPU power consumption increases significantly  FPGA power decreases Power consumption (W) Cost  Machine 3 includes an AlphaData board with a Xilinx Virtex 5 FPGA platform and two Core2Quads  Performance is given by 1/Texec  Proportional to FLOPs Machine 2 $2200 Machine 3 $10000 Performance/Cost ratio Performance/Cost ratio 4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 Machine 2 (GPU) Machine 3 (V5 FPGA) Performance To Watt-Dollars • Metric to include all parameters Performance/(Cost, Power) 0.06 0.05 0.04 0.03 0.02 0.01 0.00 Machine 2 (GPU) Machine 3 (V5 FPGA) Conclusions And Future Work  For phase unwrapping GPUs provide higher performance  Higher power consumption  FPGAs have low power consumption  High reprogramming time  OQM: GPUs are the best fit. Cost effective and faster:  Images already on processor  FPGAs have a much stronger appeal in the embedded domain  Future Work  Experiment with new GPUs (GTX 280) and platforms (Cell, Larrabee, 4x2 multicore)  Multi-FPGA implementation Thank You! Any Questions? Sherman Braganza (braganza.s@neu.edu) Miriam Leeser (mel@coe.neu.edu) Northeastern University ReConfigurable Laboratory http://www.ece.neu.edu/groups/rcl

Presentation - MIT Lincoln Laboratory

Related documents

Products

Support

Presentation - MIT Lincoln Laboratory

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib