Sonic Millip3De: Massively Parallel 3D Stacked Accelerator for 3D Ultrasound Richard Sampson* Ming Yang† Siyuan Wei† Chaitali Chakrabarti† Thomas F. Wenisch* *University of Michigan †Arizona State University Portable Medical Imaging Devices • Medical imaging moving towards portability – MEDICS (X-Ray CT) [Dasika ‘10] – Handheld 2D Ultrasound [Fuller ‘09] • Not just a matter of convenience – Improved patient health [Gunnarsson ‘00, Weinreb ‘08] – Access in developing countries • Why ultrasound? – Low transmit power [Nelson ‘10] – No dangers or side-effects 2 Handheld 3D Ultrasound • 3D has numerous benefits over 2D – Easier to interpret images – Greater volumetric accuracy • … as well as many challenges – 12k transducers, 10M image points • 10-20x beyond state of the art – High raw data bandwidth (6Tb/s) • Major bottleneck in state of the art – Tight handheld power budget (5W) 3 Why a Custom Accelerator? • Software algorithms load/store intensive – von Neumann designs inefficient • Large system would require over 700 DSPs – General purpose CPUs even less efficient Architecture Energy/Scanline (1 fps) Intel Core i7-2670 25.08J Single Core Time/Scanline 4.46s ARM Cortex-A8 33.04J 132.18s TI C6678 DSP 2.84J 2.27s 4 Contributions • Iterative delay calculation algorithm – Reduces storage by over 400x – Enables streaming data flow • Sonic Millip3De design – Leverages 3D die stacking technology – Transform-select-reduce accelerator framework • Power and image analysis of Sonic Millip3De – Negligible change in image quality – Able to meet 5W power budget by 11nm node 5 Outline • • • • Introduction Ultrasound background Algorithm design System design – Sonic Millip3De – Select Sub-Unit • Results and analysis • Conclusions 6 Ultrasound: Transmit and Receive Image Space π Receive Transducer Focal Points Receive Raw Channel Data Transmit Transducer 7 Ultrasound: Transmit and Receive π 8 Ultrasound: Transmit and Receive π 9 Ultrasound: Transmit and Receive π 10 Ultrasound: Transmit and Receive π 11 Ultrasound: Transmit and Receive π 12 Ultrasound: Transmit and Receive π 13 Ultrasound: Transmit and Receive π 14 Ultrasound: Transmit and Receive π 15 Ultrasound: Transmit and Receive π 16 Ultrasound: Transmit and Receive π 17 Ultrasound: Transmit and Receive π 18 Ultrasound: Transmit and Receive π 19 Ultrasound: Transmit and Receive π Each transducer stores array of raw receive data 20 Ultrasound: Image Reconstruction Image reconstructed from data based on round trip delay 21 Ultrasound: Image Reconstruction Images from each transducer combined to produce full frame 22 Delay Index Calculation • Iterate through all image points π for each transducer and calculate delay index ππ tP = ( fs Rp + Rp2 + Xi2 - 2Rp Xi sinq c ) • Often done with lookup tables (LUTs) instead • 50 GB LUT required for target 3D system 23 Challenges of Handheld 3D Ultrasound • Delay index LUT requires too much storage – New iterative algorithm reduces necessary constant storage by 400x • Peak raw data bandwidth (6Tb/s) infeasible – Sub-aperture multiplexing reduces peak data rate, but requires more transmits • Handheld power budget very tight (5W) – 3D stacked, highly parallel data streaming design reconstructs images efficiently 24 Iterative Delay Index Calculation • Deltas between adjacent focal points on a scanline form smooth curve • Fit piecewise quadratic approx. to delta function • Two sections sufficient for negligible error Section 1 Section 2 25 Sub-aperture Multiplexing • Peak raw data bandwidth (6Tb/s) infeasible • Solution: sub-aperture multiplexing – Transmit multiple times from same location – Receive with subset of transducers (sub-aperture) – Sum images together • Prior work: reduce data rate • Our design: also reduces HW and power requirements 26 System Design 27 System Design Sonic Millp3De comprises 1,024 parallel pipelines 28 System Design: Transducers Interchangeable CMOS transducer layer; can use older process 29 System Design: ADC/Storage Separate storage layer to reduce wire lengths 30 System Design: Transform-Select-Reduce Accelerator units in fast, low power process 31 Select Sub-Unit Design Selects sample closest to each focal point using our algorithm 32 Select Sub-Unit Design Section 1 Section 2 All delays for a scanline estimated using 9 constants 33 Select Sub-Unit Design Section 1 Section 2 A(n+1)2 + B(n+1) + C = (An2 + Bn + C) + 2An + (A+B) Adders calculate next iteration of quadratic approximation 34 Select Sub-Unit Design Section 1 Section 2 Decrementor selects sample for next image focal point 35 Select Sub-Unit Design Section 1 Section 2 Section decrementor indicates when to change constants 36 Outline • • • • Introduction Ultrasound background Algorithm design System design – Sonic Millip3De – Select Sub-Unit • Results and analysis • Conclusions 37 System Parameters Parameters Value Sub-apertures 12 Transmit Sources 16 Transmits per Frame 192 Transducers per Sub-aperture 1,024 Total Transducers 12,288 Storage per Transducer 4,096 x 12 bits Focal Points per Scanline 4,096 Image Depth 6 cm Image Angular Width π/4 Sampling Frequency 40 MHz Interpolation Factor 4x Interpolated Sampling Frequency (fs) 160 MHz Speed of Sound (tissue) 1,540 m/s Target Frame Rate 1 fps 38 Image Quality Comparison Simulations using Field II [Jensen ‘92, ‘95] Our Design (12 bit) Ideal Bits CNR Ideal 2.972 14 2.942 13 2.960 12 2.942 11 bit 11 2.536 10 2.233 Our design has negligible difference from ideal system 39 Power Analysis and Scaling 20 DRAM Memory Interface Network Wires Accelerator SRAM ADC Transducers Power (W) 15 10 5 0 45 32 22 Technology Node 16 11 Can meet 5W by 11nm node 40 Conclusions • 3D die stacked Sonic Millip3De design is able to meet 5W power budget by 11nm • Algorithm/HW co-design enables order-of-magnitude gains – Power and output quality goals often in conflict – Need guidance from domain experts to balance • Architects have much to offer for application-specific system designs 41 Questions? Special thanks to: Brian Fowlkes Oliver Kripfgans Ron Dreslinski 42