PPTX - University of Michigan

advertisement
Sonic Millip3De:
Massively Parallel 3D Stacked Accelerator
for 3D Ultrasound
Richard Sampson* Ming Yang† Siyuan Wei†
Chaitali Chakrabarti† Thomas F. Wenisch*
*University
of Michigan
†Arizona
State University
Portable Medical Imaging Devices
• Medical imaging moving towards portability
– MEDICS (X-Ray CT) [Dasika ‘10]
– Handheld 2D Ultrasound [Fuller ‘09]
• Not just a matter of convenience
– Improved patient health [Gunnarsson ‘00, Weinreb ‘08]
– Access in developing countries
• Why ultrasound?
– Low transmit power [Nelson ‘10]
– No dangers or side-effects
2
Handheld 3D Ultrasound
• 3D has numerous benefits over 2D
– Easier to interpret images
– Greater volumetric accuracy
• … as well as many challenges
– 12k transducers, 10M image points
• 10-20x beyond state of the art
– High raw data bandwidth (6Tb/s)
• Major bottleneck in state of the art
– Tight handheld power budget (5W)
3
Why a Custom Accelerator?
• Software algorithms load/store intensive
– von Neumann designs inefficient
• Large system would require over 700 DSPs
– General purpose CPUs even less efficient
Architecture
Energy/Scanline
(1 fps)
Intel Core i7-2670
25.08J
Single Core
Time/Scanline
4.46s
ARM Cortex-A8
33.04J
132.18s
TI C6678 DSP
2.84J
2.27s
4
Contributions
• Iterative delay calculation algorithm
– Reduces storage by over 400x
– Enables streaming data flow
• Sonic Millip3De design
– Leverages 3D die stacking technology
– Transform-select-reduce accelerator framework
• Power and image analysis of Sonic Millip3De
– Negligible change in image quality
– Able to meet 5W power budget by 11nm node
5
Outline
•
•
•
•
Introduction
Ultrasound background
Algorithm design
System design
– Sonic Millip3De
– Select Sub-Unit
• Results and analysis
• Conclusions
6
Ultrasound: Transmit and Receive
Image
Space
𝜏
Receive
Transducer
Focal
Points
Receive Raw
Channel Data
Transmit
Transducer
7
Ultrasound: Transmit and Receive
𝜏
8
Ultrasound: Transmit and Receive
𝜏
9
Ultrasound: Transmit and Receive
𝜏
10
Ultrasound: Transmit and Receive
𝜏
11
Ultrasound: Transmit and Receive
𝜏
12
Ultrasound: Transmit and Receive
𝜏
13
Ultrasound: Transmit and Receive
𝜏
14
Ultrasound: Transmit and Receive
𝜏
15
Ultrasound: Transmit and Receive
𝜏
16
Ultrasound: Transmit and Receive
𝜏
17
Ultrasound: Transmit and Receive
𝜏
18
Ultrasound: Transmit and Receive
𝜏
19
Ultrasound: Transmit and Receive
𝜏
Each transducer stores array of raw receive data
20
Ultrasound: Image Reconstruction
Image reconstructed from data based on round trip delay
21
Ultrasound: Image Reconstruction
Images from each transducer combined to produce full frame
22
Delay Index Calculation
• Iterate through all image points 𝑃 for each
transducer and calculate delay index πœπ‘ƒ
tP =
(
fs
Rp + Rp2 + Xi2 - 2Rp Xi sinq
c
)
• Often done with lookup tables (LUTs) instead
• 50 GB LUT required for target 3D system
23
Challenges of Handheld 3D Ultrasound
• Delay index LUT requires too much storage
– New iterative algorithm reduces necessary
constant storage by 400x
• Peak raw data bandwidth (6Tb/s) infeasible
– Sub-aperture multiplexing reduces peak data rate,
but requires more transmits
• Handheld power budget very tight (5W)
– 3D stacked, highly parallel data streaming design
reconstructs images efficiently
24
Iterative Delay Index Calculation
• Deltas between adjacent
focal points on a scanline
form smooth curve
• Fit piecewise quadratic
approx. to delta function
• Two sections sufficient for
negligible error
Section 1
Section 2
25
Sub-aperture Multiplexing
• Peak raw data bandwidth (6Tb/s) infeasible
• Solution: sub-aperture multiplexing
– Transmit multiple times from same location
– Receive with subset of transducers (sub-aperture)
– Sum images together
• Prior work: reduce data rate
• Our design: also reduces HW
and power requirements
26
System Design
27
System Design
Sonic Millp3De comprises 1,024 parallel pipelines
28
System Design: Transducers
Interchangeable CMOS transducer layer; can use older process
29
System Design: ADC/Storage
Separate storage layer to reduce wire lengths
30
System Design: Transform-Select-Reduce
Accelerator units in fast, low power process
31
Select Sub-Unit Design
Selects sample closest to each focal point using our algorithm
32
Select Sub-Unit Design
Section 1
Section 2
All delays for a scanline estimated using 9 constants
33
Select Sub-Unit Design
Section 1
Section 2
A(n+1)2 + B(n+1) + C = (An2 + Bn + C) + 2An + (A+B)
Adders calculate next iteration of quadratic approximation
34
Select Sub-Unit Design
Section 1
Section 2
Decrementor selects sample for next image focal point
35
Select Sub-Unit Design
Section 1
Section 2
Section decrementor indicates when to change constants
36
Outline
•
•
•
•
Introduction
Ultrasound background
Algorithm design
System design
– Sonic Millip3De
– Select Sub-Unit
• Results and analysis
• Conclusions
37
System Parameters
Parameters
Value
Sub-apertures
12
Transmit Sources
16
Transmits per Frame
192
Transducers per Sub-aperture
1,024
Total Transducers
12,288
Storage per Transducer
4,096 x 12 bits
Focal Points per Scanline
4,096
Image Depth
6 cm
Image Angular Width
π/4
Sampling Frequency
40 MHz
Interpolation Factor
4x
Interpolated Sampling Frequency (fs)
160 MHz
Speed of Sound (tissue)
1,540 m/s
Target Frame Rate
1 fps
38
Image Quality Comparison
Simulations using Field II [Jensen ‘92, ‘95]
Our Design (12 bit)
Ideal
Bits
CNR
Ideal
2.972
14
2.942
13
2.960
12
2.942
11 bit
11
2.536
10
2.233
Our design has negligible difference from ideal system
39
Power Analysis and Scaling
20
DRAM
Memory Interface
Network Wires
Accelerator
SRAM
ADC
Transducers
Power (W)
15
10
5
0
45
32
22
Technology Node
16
11
Can meet 5W by 11nm node
40
Conclusions
• 3D die stacked Sonic Millip3De design is able
to meet 5W power budget by 11nm
• Algorithm/HW co-design enables
order-of-magnitude gains
– Power and output quality goals often in conflict
– Need guidance from domain experts to balance
• Architects have much to offer for
application-specific system designs
41
Questions?
Special thanks to:
Brian Fowlkes
Oliver Kripfgans
Ron Dreslinski
42
Download