An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA Applications of Multi-resolution Processing Background: Linear Pipeline and Segment Pipeline Our Approach: Time-Sharing Pipeline Architecture A combined approach: The same PE works for all the pyramid levels in parallel in a time-sharing pipeline manner Linear pipeline: Duplicate processing elements (PEs) for each pyramid level Each work-cycle, compute -> 1 pixels for G2 (coarsest level), -> 4 pixels for G1 -> 16 pixels for G0 (finest level) -> next cycle, back to G2 and so forth and all PEs work together for all pyramid levels in parallel • + Less demand of off-chip memory bandwidth • - Poor efficient use of the PE resources • - Area and power overhead Segment pipeline: A recirculating design, uses One single PE runs at full speed as segment pipeline one processing element to generate all pyramid levels, one level after another • + Save computational resources • - Require very high memory bandwidth Time-Sharing Pipeline in Gaussian Pyramid and Laplacian Pyramid Time-sharing Pipeline in Optical Flow Estimation (L-K) As low memory traffic as linear pipeline Line Buffer, Sliding Window Registers and Blocklinear Sliding window operations Pixels are streaming into the on-chip line buffer for temporal storage G0 Three time sharing pipeline work simultaneously: Gaussian Pyramid Line buffer size is proportional to the image width, making the line buffer cost for high resolution images huge Blocklinear Image Processing Single PE Linebuffer pyramid Timing MUX The convolution engine can be replaced with other processing elements for a more complicated multi-resolution pyramid system Laplacian Pyramid Linebuffer width is equal to the block width Data refetch at boundary • Two for Gaussian pyramids construction (fine to the coarse scale) • One for motion estimation (coarse to the fine scale) Inspired by the GPU block-linear texture memory layout Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Significantly reduce the linebuffer size Hardware Synthesis in 32nm CMOS Genesis-based chip generator encapsulates all the parameters (e.g., window size, pyramid levels) and allows the automated generation of synthesizable HDL hardware for design space exploration Area Evaluation Memory Bandwidth Evaluation Design points are running at 500 MHz on 32nm CMOS Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP) TP consumes much less PE area due to the time-sharing of the same PE among different pyramid levels DRAM traffic is an order of magnitude less than SP Energy saving The cost of extra shift registers and controlling logic for time-sharing configurations are negligible compared with the reduction of the PE cost Time-Sharing Pipeline (TP) vs. Block diagram of a convolution-based time-sharing Hardware chip generator GUI pyramid engine (e.g., 3-level Gaussian pyramid engine with a 3x3 convolution window) Overall Performance & Energy Evaluation TP Only accesses the source images from the DRAM, and to return the resulting motion vectors back to the DRAM The overhead of TP over SP is fairly small for designs with small windows All other intermediate memory traffic is completely eliminated BlockLinear Design Evaluation TP is almost 2x faster than SP TP is only slightly slower than LP while eliminating all the logic costs Energy consumption is dominated by DRAM accesses vs. SP: 10x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing Segment Pipeline (SP) TP consumes increasingly more area compared to SP as the pyramid levels grow Simulation Result http://www.c2s2.org (a) Source Image Frames P(N) = Parallel Degree. B(N) = Number of Blocks. Increase number of blocks reduces linebuffer area, while remains the same throughput This chart demonstrates various design trade-offs (b) Segment Pipeline Optical Flow (c) Time-sharing Pipeline Optical Flow Optical flow (velocity) on a benchmark image with a left-to-right movement The proposed TP-based implementation produces the same motion vectors as the SP-based implementation, validating the approach An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA Existing Solutions: Linear Pipeline and Segment Pipeline Applications in Multi-resolution Processing Linear pipeline: Duplicate processing elements (PEs) for each pyramid level and all PEs work together for all pyramid levels in parallel • Pro: Less demand of off-chip memory bandwidth Panorama Stitching • Con: HDR o Poor efficient use of the PE resources o Area and power overhead Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level Detail Enhancement after another Optical Flow • Pro: Save computational resources • Con: Require very high memory bandwidth Proposed Solution: Time-sharing Pipeline The same PE works for all the pyramid levels in parallel in a time-sharing pipeline manner Each work-cycle, compute -> 1 pixels for G2 (coarsest level), -> 4 pixels for G1 -> 16 pixels for G0 (finest level) -> next cycle, back to G2 and so forth One single PE runs at full speed as segment pipeline As low memory traffic as linear pipeline Application Demonstration Laplacain Pyramid Hierarchical Lucas-Kanade G0 Gaussian Pyramid (a) Source Image Frames Single PE Linebuffer pyramid Timing MUX The convolution engine can be replaced with other processing elements for a more complicated multi-resolution pyramid system Evaluation Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP) TP consumes much less PE area due to the time-sharing of the same PE among different pyramid levels The cost of extra shift registers and controlling logic for time-sharing configurations are negligible compared with the reduction of the PE cost Three time sharing pipeline work simultaneously: • Two for Gaussian pyramids construction (fine to the coarse scale) • One for motion estimation (coarse to the fine scale) (a) SourcePipeline Image Frames (b) Segment Optical Flow Simulation Result of Hierarchical Lucas-Kanade Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory Bandwidth Area http://www.c2s2.org Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP) TP consumes increasingly more area compared to SP as the pyramid levels grow The overhead of TP over SP is fairly small for designs with small windows (b) Segment Pipeline Optical FlowFlow(c) Time-shari (c) Time-sharing Pipeline Optical DRAM traffic is an order of magnitude less than SP Energy saving TP Only accesses the source images from the DRAM, and to return the resulting motion vectors back to the DRAM All other intermediate memory traffic is completely eliminated TP is almost 2x faster than SP TP is only slightly slower than LP while eliminating all the logic costs Power Energy consumption is dominated by DRAM accesses vs. SP: 10x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing An Energy Efficient Time-sharing Pyramid Pipeline for Multi-resolution Computer Vision Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli NVIDIA Applications in Multi-resolution Processing Existing Solutions: Linear Pipeline and Segment Pipeline Linear pipeline: Replicate processing elements (PEs) for each pyramid level; all PEs work in parallel for all pyramid levels • Pro: Less demand of off-chip memory bandwidth Panorama Stitching • Con: HDR o Inefficient use of the PE resources o Area and power overhead Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level after another Detail Enhancement • Pro: Saves computational resources • Con: Requires very high memory bandwidth Optical Flow Proposed Solution: Time-sharing Pipeline The same PE works for all the pyramid levels in parallel as a time-sharing pipeline Each work-cycle, compute -> 1 pixel for G2 (coarsest level) -> 4 pixels for G1 -> 16 pixels for G0 (finest level) -> next cycle, back to G2 and so forth One PE runs at full speed as a segment pipeline As low memory traffic as a linear pipeline Application Demonstration Laplacian Pyramid Hierarchical Lucas-Kanade G0 Gaussian Pyramid Single PE Linebuffer pyramid Timing MUX The convolution engine can be replaced by other processing elements for a more complicated multiresolution pyramid system Evaluation Area Three time sharing pipelines work simultaneously: (a) Source Image Frames o Two for Gaussian pyramids construction (from fine to coarse scale) o One for motion estimation (from coarse to fine scale) Only needs to read the two source images from the main memory, and write the resulting motion vectors back to the memory (a) SourcePipeline Image Frames (b) Segment Optical Flow (b) Segment Pipeline Optical FlowFlow(c) Time-shari (c) Time-sharing Pipeline Optical Simulation Result of Hierarchical Lucas-Kanade Optical Flow Bandwidth Power DRAM traffic is an order of magnitude less than SP http://www.c2s2.org Energy saving Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP) TP consumes much less PE area The cost of extra shift registers and controlling logic is negligible compared to the reduction of the PE cost TP is almost 2x faster than SP TP is only slightly slower than LP while eliminating all the logic costs TP only accesses the source images from the DRAM, and returns the motion vectors back to the DRAM Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP) TP consumes more area as the pyramid levels grow The area cost is still competitive in small window All other intermediate memory traffic is completely eliminated Energy consumption is dominated by DRAM accesses vs. SP: 10x saving on DRAM access (log scale), similar on chip memory accessing and logic processing cost vs. LP: Similar DRAM access cost, but less energy cost on the on-chip logic processing