poster

advertisement
An Energy Efficient Time-sharing Pyramid Pipeline for
Multi-resolution Computer Vision
Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli
NVIDIA
Applications of Multi-resolution
Processing
Background: Linear Pipeline and Segment
Pipeline
Our Approach: Time-Sharing Pipeline
Architecture
A combined approach:
 The same PE works for all the
pyramid levels in parallel in a
time-sharing pipeline manner
 Linear pipeline: Duplicate processing elements (PEs) for each pyramid level
 Each work-cycle, compute
-> 1 pixels for G2 (coarsest level),
-> 4 pixels for G1
-> 16 pixels for G0 (finest level)
-> next cycle, back to G2
and so forth
and all PEs work together for all pyramid levels in parallel
• + Less demand of off-chip memory bandwidth
• - Poor efficient use of the PE resources
• - Area and power overhead
 Segment pipeline: A recirculating design, uses
 One single PE runs at full
speed as segment pipeline
one processing element to generate all pyramid
levels, one level after another
• + Save computational resources
• - Require very high memory bandwidth
Time-Sharing Pipeline in Gaussian Pyramid
and Laplacian Pyramid
Time-sharing Pipeline in Optical Flow Estimation (L-K)
 As low memory traffic as linear
pipeline
Line Buffer, Sliding Window Registers and
Blocklinear
 Sliding window
operations
 Pixels are streaming
into the on-chip line
buffer for temporal
storage
G0
 Three time sharing pipeline work
simultaneously:
Gaussian Pyramid




 Line buffer size is
proportional to the
image width, making
the line buffer cost for
high resolution images
huge
Blocklinear Image Processing
Single PE
Linebuffer pyramid
Timing MUX
The convolution engine can be replaced with
other processing elements for a more
complicated multi-resolution pyramid system
Laplacian Pyramid
Linebuffer width is equal
to the block width
Data refetch
at boundary
• Two for Gaussian pyramids construction
(fine to the coarse scale)
• One for motion estimation (coarse to the fine
scale)
 Inspired by the GPU
block-linear texture
memory layout
 Only needs to read the two source images from the main memory, and write
the resulting motion vectors back to the memory
 Significantly reduce
the linebuffer size
Hardware Synthesis in 32nm CMOS
Genesis-based chip generator encapsulates all the parameters (e.g., window size,
pyramid levels) and allows the automated generation of synthesizable HDL hardware for
design space exploration
Area Evaluation
Memory Bandwidth Evaluation
Design points are running at 500 MHz on 32nm CMOS
Time-Sharing Pipeline (TP) vs.
Linear Pipeline (LP)
 TP consumes much less PE area
due to the time-sharing of the same
PE among different pyramid levels
 DRAM traffic is an order of
magnitude less than SP  Energy
saving
 The cost of extra shift registers and
controlling logic for time-sharing
configurations are negligible
compared with the reduction of the
PE cost
Time-Sharing
Pipeline (TP) vs.
Block diagram of a convolution-based time-sharing
Hardware chip generator GUI
pyramid engine
(e.g., 3-level Gaussian pyramid engine with a 3x3 convolution
window)
Overall Performance & Energy Evaluation
 TP Only accesses the source
images from the DRAM, and to
return the resulting motion vectors
back to the DRAM
 The overhead of TP over SP is
fairly small for designs with small
windows
 All other intermediate memory traffic
is completely eliminated
BlockLinear Design Evaluation
 TP is almost 2x faster than SP
 TP is only slightly slower than LP
while eliminating all the logic costs
 Energy consumption is dominated by
DRAM accesses
 vs. SP: 10x saving on DRAM access (log
scale), similar on chip memory accessing
and logic processing cost
 vs. LP: Similar DRAM access cost, but
less energy cost on the on-chip logic
processing
Segment Pipeline (SP)
 TP consumes increasingly more
area compared to SP as the
pyramid levels grow
Simulation Result
http://www.c2s2.org
(a) Source Image Frames
P(N) = Parallel Degree.
B(N) = Number of Blocks.
 Increase number of blocks reduces linebuffer area, while remains the same throughput
 This chart demonstrates various design trade-offs
(b) Segment Pipeline Optical Flow
(c) Time-sharing Pipeline Optical Flow
 Optical flow (velocity) on a benchmark image with a left-to-right movement
 The proposed TP-based implementation produces the same motion vectors as the
SP-based implementation, validating the approach
An Energy Efficient Time-sharing Pyramid Pipeline for
Multi-resolution Computer Vision
Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli
NVIDIA
Existing Solutions: Linear Pipeline and Segment Pipeline
Applications in Multi-resolution
Processing
 Linear pipeline: Duplicate processing elements (PEs) for each pyramid level and all PEs work together for all
pyramid levels in parallel
• Pro: Less demand of off-chip memory bandwidth
Panorama Stitching
• Con:
HDR
o Poor efficient use of the PE resources
o Area and power overhead
 Segment pipeline: A recirculating design, uses one processing element to generate all pyramid levels, one level
Detail Enhancement
after another
Optical Flow
• Pro: Save computational resources
• Con: Require very high memory bandwidth
Proposed Solution: Time-sharing Pipeline
 The same PE works for all the pyramid levels in
parallel in a time-sharing pipeline manner
 Each work-cycle, compute
-> 1 pixels for G2 (coarsest level),
-> 4 pixels for G1
-> 16 pixels for G0 (finest level)
-> next cycle, back to G2
and so forth
 One single PE runs at full speed as segment
pipeline
 As low memory traffic as linear pipeline
Application Demonstration
Laplacain Pyramid
Hierarchical Lucas-Kanade
G0
Gaussian Pyramid




(a) Source Image Frames
Single PE
Linebuffer pyramid
Timing MUX
The convolution engine can be replaced with other processing
elements for a more complicated multi-resolution pyramid
system
Evaluation
Time-Sharing Pipeline (TP)
vs. Linear Pipeline (LP)
 TP consumes much less PE
area due to the time-sharing of
the same PE among different
pyramid levels
 The cost of extra shift
registers and controlling logic
for time-sharing configurations
are negligible compared with
the reduction of the PE cost
 Three time sharing pipeline work simultaneously:
• Two for Gaussian pyramids construction
(fine to the coarse scale)
• One for motion estimation (coarse to the fine scale)
(a) SourcePipeline
Image Frames
(b) Segment
Optical Flow
Simulation Result of Hierarchical Lucas-Kanade
 Only needs to read the two source images from the main memory,
and write the resulting motion vectors back to the memory
Bandwidth
Area
http://www.c2s2.org
Time-Sharing Pipeline (TP) vs.
Segment Pipeline (SP)
 TP consumes increasingly more
area compared to SP as the
pyramid levels grow
 The overhead of TP over SP is
fairly small for designs with
small windows
(b)
Segment Pipeline
Optical
FlowFlow(c) Time-shari
(c) Time-sharing
Pipeline
Optical
 DRAM traffic is an order of
magnitude less than SP 
Energy saving
 TP Only accesses the source
images from the DRAM, and to
return the resulting motion vectors
back to the DRAM
 All other intermediate memory
traffic is completely eliminated
 TP is almost 2x faster than SP
 TP is only slightly slower than LP
while eliminating all the logic costs
Power
 Energy consumption is dominated by
DRAM accesses
 vs. SP: 10x saving on DRAM access (log
scale), similar on chip memory accessing
and logic processing cost
 vs. LP: Similar DRAM access cost, but
less energy cost on the on-chip logic
processing
An Energy Efficient Time-sharing Pyramid Pipeline for
Multi-resolution Computer Vision
Qiuling Zhu, Navjot Garg, Yun-Ta Tsai, Kari Pulli
NVIDIA
Applications in Multi-resolution Processing
Existing Solutions: Linear Pipeline and Segment Pipeline
 Linear pipeline: Replicate processing elements (PEs) for each pyramid level; all PEs
work in parallel for all pyramid levels
• Pro: Less demand of off-chip memory bandwidth
Panorama Stitching
• Con:
HDR
o Inefficient use of the PE resources
o Area and power overhead
 Segment pipeline: A recirculating design, uses one processing element to generate all
pyramid levels, one level after another
Detail Enhancement
• Pro: Saves computational resources
• Con: Requires very high memory bandwidth
Optical Flow
Proposed Solution: Time-sharing Pipeline
 The same PE works for all the pyramid
levels in parallel as a time-sharing pipeline
 Each work-cycle, compute
-> 1 pixel for G2 (coarsest level)
-> 4 pixels for G1
-> 16 pixels
for G0 (finest level)
-> next cycle, back to G2
and so forth
 One PE runs at full speed as a segment
pipeline
 As low memory traffic as a linear pipeline
Application Demonstration
Laplacian Pyramid
Hierarchical Lucas-Kanade
G0
Gaussian Pyramid




Single PE
Linebuffer pyramid
Timing MUX
The convolution engine can be replaced by other
processing elements for a more complicated multiresolution pyramid system
Evaluation
Area
 Three time sharing pipelines work simultaneously:
(a) Source Image Frames
o Two for Gaussian pyramids construction
(from fine to coarse scale)
o One for motion estimation (from coarse to fine scale)
 Only needs to read the two source images from the main
memory, and write the resulting motion vectors back to
the memory
(a) SourcePipeline
Image Frames
(b) Segment
Optical Flow
(b)
Segment Pipeline
Optical
FlowFlow(c) Time-shari
(c) Time-sharing
Pipeline
Optical
Simulation Result of Hierarchical Lucas-Kanade
Optical Flow
Bandwidth
Power
 DRAM traffic is an
order of magnitude
less than SP
http://www.c2s2.org
 Energy saving
Time-Sharing Pipeline (TP) vs. Linear Pipeline (LP)
 TP consumes much less PE area
 The cost of extra shift registers and controlling logic is
negligible compared to the reduction of the PE cost
 TP is almost 2x faster than SP
 TP is only slightly slower than LP while eliminating all the
logic costs
 TP only accesses
the source images
from the DRAM, and
returns the motion
vectors back to the
DRAM
Time-Sharing Pipeline (TP) vs. Segment Pipeline (SP)
 TP consumes more area as the pyramid levels grow
 The area cost is still competitive in small window
 All other
intermediate
memory traffic is
completely
eliminated
 Energy consumption is dominated by DRAM accesses
 vs. SP: 10x saving on DRAM access (log scale), similar
on chip memory accessing and logic processing cost
 vs. LP: Similar DRAM access cost, but less energy cost
on the on-chip logic processing
Download