Energy-Efficient System Design for Mobile Processing Platforms by Rahul Rithe B.Tech., Indian Institute of Technology Kharagpur (2008) S.M., Massachusetts Institute of Technology (2010) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy SACHUSESwINs-ThU OF TECHNOLOGY at the r JUN 10 2014 MASSACHUSETTS INSTITUTE OF TECHNOLOGY LiBRARIES June 2014 @ Massachusetts Institute of Technology 2014. All rights reserved. Signature redacted ..... A uth or ...................................................... . .- Department of Electrical Engineering and Computer Science May 20, 2014 Signature redacted :......../................... Anantha P. Chandrakasan Joseph F. and Nancy P. Keithley Professor of Electrical Engineering Thesis Supervisor C ertified by ...............................................-. bySignature Acceped Accepted by ................................... t r redacted e a t d . Lediej. Kolodziejski Chair, Department Committee on Graduate Students 2 Energy-Efficient System Design for Mobile Processing Platforms by Rahul Rithe Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2014, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Portable electronics has fueled the rich emergence of multimedia applications that have led to the exponential growth in content creation and consumption. New energy-efficient integrated circuits and systems are necessary to enable the increasingly complex augmented-reality applications, such as high-performance multimedia, "big-data" processing and smart healthcare, in real-time on mobile platforms of the future. This thesis presents an energy-efficient system design approach with algorithm, architecture and circuit co-design for multiple application areas. A shared transform engine, capable of supporting multiple video coding standards in real-time with ultra-low power consumption, is developed. The transform engine, implemented using 45 nm CMOS technology, supports Quad Full-HD (4k x 2k) video coding with reconfigurable processing for H.264 and VC-1 standards at 0.5 V and operates down to 0.3 V to maximize energy-efficiency. Algorithmic and architectural optimizations, including matrix factorization, transpose memory elimination and data dependent processing, achieve significant savings in area and power consumption. A reconfigurable processor for computational photography is presented. An efficient implementation of the 3D bilateral grid structure supports a wide range of non-linear filtering applications, including high dynamic range imaging, low-light enhancement and glare reduction. The processor, implemented using 40 nm CMOS technology, enables real-time processing of HD images, while operating down to 0.5 V and achieving 280x higher energy-efficiency compared to software implementations on state-of-the-art mobile processors. A scalable architecture enables 8x energy scalability for the same throughput performance, while trading-off output resolution for energy. Widespread use of medical imaging techniques has been limited by factors such as size, weight, cost and complex user interface. A portable medical imaging platform for accurate objective quantification of skin condition progression, using robust computer vision techniques, is presented. Clinical validation shows 95% accuracy in progression assessment. Algorithmic optimizations, reducing the memory bandwidth and computational complexity by over 80%, pave the way for energy-efficient hardware implementation to enable real-time portable medical imaging. Thesis Supervisor: Anantha P. Chandrakasan Title: Joseph F. and Nancy P. Keithley Professor of Electrical Engineering 3 4 Acknowledgments Since the first time I came to MIT in August 2008 and navigated my way to 38-107, trying to make sense of MIT's (still) incomprehensible building numbering system, it has been a wonderful journey of exploration - filled with numerous challenges and exciting rewards of scientific discovery. I have been fortunate to have had exceptional advisors and mentors to guide me through this journey. I am extremely grateful to my advisor, Prof. Anantha Chandrakasan, for being a great mentor, role model and a constant source of inspiration. I learned from Anantha that conducting great research is a process that involves working in collaboration with researchers, industry partners and funding agencies, while constantly pushing the boundaries of the stateof-the-art. The collaborative research environment that Anantha has fostered in the lab not only motivated me to produce great results but also afforded the opportunities to work with graduate and undergraduate students and learn how to mentor and motivate others in realizing their full potential as researchers. I learned invaluable lessons in organization and management, from being inspired by Anantha's visionary leadership of EECS, while managing a large research group. Thank you Anantha for giving me the freedom to explore my interests and helping me grow both professionally and personally throughout my graduate studies at MIT! I am thankful to the members of my Ph.D. thesis committee, Prof. William Freeman, Prof. Li-Shuan Peh and Prof. Vivienne Sze, for their advise, feedback and support. Prof. Freeman's advise on the computer vision related work for medical imaging was extremely valuable. I would like to thank Vivienne for her help and support throughout my graduate work at MIT - first as a senior graduate student and then as a faculty member at MIT - from helping me learn digital design to long discussions about research and reviewing paper drafts. I am extremely grateful to Prof. Fredo Durand for several valuable discussions on topics ranging from research to photography to career options. I had the privilege of working with Dr. Dennis Buss, chief scientist at Texas Instruments and visiting scientist at MIT, during my master's research. I am immensely thankful to Dennis for all the insightful discussions over the last six years on topics ranging from research and industry collaboration to the past, present and future of the semiconductor industry. 5 The work was made possible by the generous support of our industry partners. I would like to acknowledge the Foxconn Technology Group, Texas Instruments and the MIT Presidential Fellowship for providing funding support and the TSMC University Shuttle Program for chip fabrication. I consider teaching to be an integral part of the graduate experience and I am grateful to Prof. Harry Lee for giving me the rare opportunity to serve as a recitation instructor for the undergraduate 'Circuits and Electronics' class. I would like to thank Prof. Harry Lee, Prof. Karl Berggren, Prof. John Kassakian and Prof. Khurram Afridi for helping me further my passion for teaching and enhance my abilities as a teacher. One of the best things about MIT is the people you get to interact and work with day-to-day. I would like to thank Chih-Chi Cheng and Mahmut Sinangil for working long hours with me on the video coding project. I am extremely thankful to Priyanka Raina, Nathan Ickes and Srikanth Tenneti for their tremendous help in bringing the computational photography project from an idea to a live demonstration platform. It has been a great experience for me to work with two 'SuperUROP' students - Michelle Chen and Qui Nguyen - on the smartphone-based medical imaging platform and I am thankful to them for being such enthusiastic collaborators. I would also like to thank Dr. Vaneeta Sheth from the Brigham and Women's Hospital for bringing her dermatology expertise to our medical imaging work and conducting a pilot study to demonstrate its effectiveness during treatment. When I first arrived at MIT, I could not have imagined a work environment better than what Ananthagroup has offered me over the last six years. It has been an absolute pleasure to work with all the members of Ananthagroup- past and present. The diverse set of expertise, thoughtful discussions and "procrastination circles" have helped create the best workplace for research. All work and no play is no fun. I would like to thank Masood Qazi for teaching me everything I know about playing Squash and those amazing trips to Burdick's for the best hot chocolate ever! I would also like to thank the members of the "Ananthagrop Tennis Club" - Arun, Phil and Nachiket - for quite a few evenings well spent, braving wind, rain and cold on the tennis courts. Margaret Flaherty, our administrative assistant, is the reason everything in 38-107 runs so smoothly. I would like to thank Margaret for her relentless work and attention to detail. 6 Saurav Bandyopadhyay, Rishabh Singh and I went to IIT Kharagpur together and continued our journey at MIT together, including that first crammed flight from Delhi to Boston. I am extremely thankful to Saurav and Rishabh for being such great friends over the years. The foundation of my work rests on the unconditional love and support from my family. The pride and joy of my late grandparents, Nirmalabai and Namdevrao Wankhade, in every one of my achievements over the years has been and will continue to be a constant source of inspiration for me. The love of my grandfather, Panjabrao Rithe, for education and the hardships he endured for it has been the driving force for me on this academic journey. The steadfast belief of my parents, Rajani and Jagdish Rithe, and my sister Bhagyashree, their support through all my endeavors and encouragement to follow my dreams, has made this journey from a small village in India to the present moment possible. And for that I am eternally grateful! Rahul Rithe Cambridge, MA 01 MAY 2014 7 8 Contents 1 1.1 Mobile Computing Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Energy-Efficient System Design 1.3 2 23 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 26 1.2.1 Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 1.2.2 Application Specific Processing 1.2.3 Reconfigurable Hardware . . . . . . . . . . . . . . . . . . . . . . . . 30 1.2.4 Low-Voltage Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 . . . . . . . . . . . . . . . . . . . . 28 37 Transform Engine for Video Coding 2.1 25 Transform Engine Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 . . . . . . . . . . . . . . . 40 2.1.1 Integer Transform: H.264/AVC & VC-1 2.1.2 Matrix Factorization for Hardware Sharing . . . . . . . . . . . . . . 42 2.1.3 Eliminating Transpose Memory . . . . . . . . . . . . . . . . . . . . 47 2.1.4 Data Dependent Processing . . . . . . . . . . . . . . . . . . . . . . 52 2.2 Future Video Coding Standards . . . . . . . . . . . . . . . . . . . . . . . . 56 2.3 Statistical Methodology for Low-Voltage Design . . . . . . . . . . . . . . . 61 2.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 2.5 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 2.6 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 CONTENTS 10 CONTENTS 10 3 Reconfigurable Processor for Computational Photography 3.1 3.2 3.3 3.4 77 Bilateral Filtering ........... . . . . . . . . . . . . . . . . . . . . 79 3.1.1 . . . . . . . . . . . . . . . . . . . . Bilateral Grid . . . . . . . . .. 81 Bilateral Filter Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.2.1 Grid Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.2.2 Grid Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 3.2.3 Grid Interpolation . . . . . . . . . . . . . . . . . . . . 86 3.2.4 Memory Management..... . . . . . . . . . . . . . . . . . . . . 88 3.2.5 Scalable Grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 3.3.1 High Dynamic Range Imaging . . . . . . . . . . . . . . . . . . . . 91 3.3.2 Glare Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3.3.3 Low-Light Enhanced Imaging . . . . . . . . . . . . . . . . . . . . 1 00 . . . . . . Low-Voltage Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 3.4.1 Statistical Design Methodology . . . . . . . . . . . . . . . . . . . . 108 3.4.2 Multiple Voltage Domains . . . . . . . . . . . . . . . . . . . . 109 3.5 Memory Bandwidth Optimization . . . . . . . . . . . . . . . . . . . . 1 10 3.6 Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3.6.1 Energy Scalable Processing . . . . . . . . . . . . . . . . . . . . 117 3.6.2 Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 3.7 System Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 23 3.8 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 1 24 4 Portable Medical Imaging Platform 4.1 4.2 Skin Conditions - Diagnosis & Treatment 127 . . . . . 128 4.1.1 Clinical Assessment: Current Approaches . . 128 4.1.2 Quantitative Dermatology . . . . . . . . . . 130 Skin Condition Progression: Quantitative Analysis . 133 4.2.1 134 Color Correction . . . . . . . . . . . . . . . 11 CONTENTS 11 CONTENTS 4.3 5 4.2.2 Contour Detection . . . . . . . . . . 137 4.2.3 Progression Analysis . . . . . . . . . 143 4.2.4 Auto-tagging . . . . . . . . . . . . . 146 4.2.5 Skin condition Progression: Summary 149 Experimental Results . . . . . . . . . . . . . 150 4.3.1 Clinical Validation . . . . . . . . . . 150 4.3.2 Progression Quantification . . . . . . 150 4.3.3 Auto-tagging Performance . . . . . . 154 4.3.4 Energy-Efficient Processing . . . . . 155 4.3.5 Limitations . . . . . . . . . . . . . . 157 4.4 Mobile Application . . . . . . . . . . . . . . 158 4.5 Multispectral Imaging: Future Work . . . . 159 4.6 Summary and Conclusions . . . . . . . . . . 162 165 Conclusions and Future Directions Summary of Contributions . . . . . . . . . . 166 5.1.1 Video Coding . . . . . . . . . . . . . 166 5.1.2 Computational Photography . . . . . 167 5.1.3 Medical Imaging . . . . . . . . . . . 168 5.2 Conclusions . . . . . . . . . . . . . . . . . . 168 5.3 Future Directions . . 5.1 . . . . . . . . . . . . . . . . . . . . . . . 170 . . . . . . . . 170 . . . . . . . . . . . . . . . . . . . . . . 171 5.3.1 Computational Photography and Computer Vision 5.3.2 Portable Medical Imaging 175 A Integer Transform A.1 H.264/AVC Integer Transform . . . . . . . . . . . . . . . . . . . . . . . . 175 . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 A.2 VC-1 Integer Transform B Clinical Pilot Study for Vitiligo Progression Analysis B.1 Subjects for Pilot Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 179 12 CONTENTS B.2 Progression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 Acronyms 185 Bibliography 189 List of Figures 1-1 Evolution of computing and multimedia processing. (Analytical Engine: London Science Museum) 1-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 Processor feature scaling and Performance/Watt trends. (Data courtesy Stanford CPU DB: cpudb.stanford.edu) . . . . . . . . . . . . . . . . . . . . 26 1-3 Processor energy/operation scaling with performance. (Data courtesy Stanford CPU DB: cpudb.stanford.edu) . . . . . . . . . . . . . . . . . . . . . . 27 1-4 Energy efficiency of processors: from CPUs to ASICs. . . . . . . . . . . . . 29 1-5 Delay scaling with VDD. Corner delay scales by 15x, whereas total delay (corner + 3a- stochastic delay) scales by 36 x. 2-1 Hardware architecture of the even component. The figure shows data paths exercised in (a) H.264 and (b) VC-1. 2-2 . . . . . . . . . . . . . . . . . . . . . 48 Hardware architecture of the odd component. The figure shows data paths exercised in (a) H.264 and (b) VC-1. 2-3 . . . . . . . . . . . . . . . . 32 . . . . . . . . . . . . . . . . . . . . . 49 Column-wise 1D transform: 8x8 data is processed over four clock cycles, CO to C3: Column 0 and 7 in CO, 1 and 6 in C1, 2 and 5 in C2, 3 and 4 in C3. Two transformed columns are generated in each clock cycle. . . . . . . 50 LIST OF FIGURES 14 2-4 Row-wise ID transform: Partial products for all 64 coefficients are computed in each clock cycle, using the 2 x 8 data obtained by transposing the two columns generated by 1D column-wise transform. The partial products are stored in the output buffer. At the end end of four clock cycles, the output buffer contains complete 2D transformed output. 2-5 . . . . . . . . . . 51 Hardware architecture of the (a) even and (b) odd component. Std = {0: H .264, 1: V C-1}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2-6 Histogram of the prediction residue for a number of test sequences . . . . . 54 2-7 Correlation between input switching activity and system switching activity. The plot also shows linear regression for the data. Measured correlation is 0 .83 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2-8 Switching activity and Power consumption in the transform as a function of DC bias applied to the input data . . . . . . . . . . . . . . . . . . . . . 2-9 55 Hardware architecture of the even component for shared 8 x 8 transform for H.264, VC-1 and HEVC. The highlighted blocks are the same as those used in the shared H.264/VC-1 architecture, shown in Figure 2-1. . . . . . . . . 59 2-10 Hardware architecture of the odd component for shared 8x8 transform for H.264, VC-1 and HEVC. The highlighted blocks are the same as those used in the shared H.264/VC-1 architecture, shown in Figure 2-2. . . . . . . . . 60 2-11 Switching activity in HEVC transform as a function of DC bias applied to the input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2-12 Delay PDF of a representative timing path at 0.5 V. STA estimate of the global corner delay is 14.1 ns, the 3o- delay estimate using Gaussian SSTA is 23.2 ns and the 3a- delay estimate using Monte-Carlo analysis is 31.8 ns. 62 2-13 Graphic illustration in xi-space of the convolution integral, and the operating point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2-14 Delay PDF of a representative timing path at 0.5 V, estimated using Gaussian SSTA, Monte-Carlo and OPA. . . . . . . . . . . . . . . . . . . . . . . 65 15 LIST OF FIGURES 2-15 Typical timing path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2-16 OPA based statistical design methodology for low voltage operation. . . . . 67 2-17 Block diagram of the 2D transform engine design . . . . . . . . . . . . . . 69 2-18 Die photo and design statistics of the fabricated IC . . . . . . . . . . . . . 69 2-19 Measured power consumption and frequency scaling with VDD for different transform implementations. (a) Frequency scaling with VDD, (b) Power consumption while operating at the frequency shown in (a). . . . . . . . . 70 2-20 Power consumption for transform modules with and without transpose memory, with and without shared architecture for H.264 and VC-1 . . . . . 72 2-21 Switching activity and Power consumption in the transform as a function of DC bias applied to the input data . . . . . . . . . . . . . . . . . . . . . 72 3-1 System block diagram for the reconfigurable computational photography processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 3-2 Comparison of Gaussian filtering and bilateral filtering. Bilateral filtering effectively reduces noise while preserving scene details. . . . . . . . . . . . 81 3-3 Construction of a 3D bilateral grid from a 2D image . . . . . . . . . . . . . 82 3-4 Architecture of the bilateral filtering engine. Grid scalability is achieved by gating processing engines and SRAM banks . . . . . . . . . . . . . . . . 84 3-5 Architecture of the grid assignment engine. . . . . . . . . . . . . . . . . . . 84 3-6 Architecture of the convolution engine for grid filtering. . . . . . . . . . . . 85 3-7 Architecture of the interpolation engine. Trilinear interpolation is implemented as three pipelined stages of linear interpolations. . . . . . . . . . . 87 3-8 Memory management by task scheduling. . . . . . . . . . . . . . . . ... . . . 89 3-9 Camera curves that map the pixel intensity values on to the incident exposure. 92 3-10 HDR creation module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 3-11 HDR image scaled to 8 bit/pixel/color for displaying on LDR media. (HDR radiance map courtesy Paul Debevec [121].) . . . . . . . . . . . . . . . . . 94 LIST OF FIGURES 16 3-12 Processing flow for HDR creation and tone-mapping for displaying HDR images on LDR media. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 3-13 Tone-mapped HDR image. (HDR radiance map courtesy Paul Debevec [12 1].) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 3-14 Processor configuration for HDR imaging. . . . . . . . . . . . . . . . . . . 96 3-15 Input low-dynamic range images: (a) under exposed image, (b) normally exposed image, (c) over exposed image. Output image: (d) tonemapped H D R im age. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 3-16 Contrast adjustment module. Contrast is increased or decreased depending on the adjustment factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 3-17 Processing flow for glare reduction. . . . . . . . . . . . . . . . . . . . . . . 98 3-18 Processor configuration for glare reduction. . . . . . . . . . . . . . . . . . . 99 3-19 (a) Input image with glare. (b) Output image with reduced glare. . . . . . 99 3-20 Processing flow for low-light enhancement. . . . . . . . . . . . . . . . . . . 102 3-21 Processor configuration for low-light enhancement . . . . . . . . . . . . . . 103 3-22 Generating a mask representing regions with high scene details. 3-23 Merging flash and no-flash images with shadow correction. . . . . . . 104 . . . . . . . . . 104 3-24 (a) Image with flash, (b) image without flash, (c) no-flash base layer, (d) flash detail layer, (d) edge mask, (f) low-light enhanced output. . . . . . . 106 3-25 Input images: (a) image with flash, (b) image without flash. Output image: (c) low-light enhanced image. . . . . . . . . . . . . . . . . . . . . . . . . . 107 3-26 Comparison of the image quality performance from the proposed approach with that of [138] and [139]. (a) Output from our approach, (b) output from [138], (c) output from [139], (d) difference image between (a) and (b) - amplified 5x, (e) difference image between (a) and (c) - amplified 5x. . . 107 3-27 Delay PDF of a representative timing path from the computational photography processor at 0.5 V. STA estimate of the global corner delay is 21.9 ns, the 3- delay estimate using OPA is 36.1 ns. . . . . . . . . . . . . . 109 17 LIST OF FIGURES 3-28 Separate voltage domains for logic and memory. Level shifters are used to transition between domains. . . . . . . . . . . . . . . . . . . . . . . . . . .111 3-29 Memory bandwidth and estimated power consumption for 2D bilateral filtering, 3D bilateral grid and bilateral grid with memory management using task scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 3-30 Die photo of the testchip. Highlighted boxes indicate SRAMs. HDR, CR and SC refer to HDR create, contrast reduction and shadow correction m odules respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 3-31 Processor performance: trade-off of energy vs. performance for varying VDD 116 3-32 Processor area (number of gates) and power breakdown. . . . . . . . . . . 116 3-33 Energy scalable processing. Grid resolution vs. energy trade-off at 0.9 V. . 117 3-34 Energy/resolution scalable processing. HDR imaging outputs for (a) grid block size: 16 x 16, intensity levels: 16, (b) grid block size: 128 x 128, intensity levels: 16, (c) grid block size: 16 x 16, intensity levels: 4, (d) grid block size: 128 x 128, intensity levels: 4. . . . . . . . . . . . . . . . . . . . 118 3-35 Energy/resolution scalable processing. Low-light enhancement outputs for (a) grid block size: 16 x 16, intensity levels: 16, (b) grid block size: 128 x 128, intensity levels: 16, (c) grid block size: 16 x 16, intensity levels: 4, (d) grid block size: 128 x 128, intensity levels: 4. . . . . . . . . . . . . . . . . . . . 119 3-36 Energy efficiency of processors ranging from CPUs and mobile processors to FPGAs and ASICs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 3-37 Processor integration with external memory, camera and display. . . . . . . 123 3-38 Printed circuit board and system integration with camera and display. . 12 124 LIST OF FIGURES 18 4-1 Standardized assessments for estimating the degree of pigmentation to derive the Vitiligo Area Scoring Index. At 100% depigmentation, no pigment is present; at 90%, specks of pigment are present; at 75%, the depigmented area exceeds the pigmented area; at 50%, the depigmented and pigmented areas are equal; at 25%, the pigmented area exceeds the depigmented area; and at 10%, only specks of depigmentation are present. (Figure reproduced with permission from [167]) . . . . . . . . . . . . . . . . . . . . . . . . . . 129 . . . . . . . . . . . . . 134 4-2 Processing flow for skin lesion progression analysis. 4-3 Color correction by histogram matching. Images captured with normal room lighting (a) and with color chart white-balance calibration (b). Images after color correction and contrast enhancement (c) of images in (a). . 136 4-4 Level set segmentation. (a) Original image with intensity inhomogeneity and initialization of the level set function. (b) Homogeneous image obtained at the end of iterations and the corresponding level set function. . . . . . . 139 4-5 Narrowband implementation of level set segmentation. LSM variables are tracked only for pixels that fall within a narrow band defined around the zero level set in the current iteration. . . . . . . . . . . . . . . . . . . . . . 140 4-6 Number of pixels processed using the narrowband implementation over 50 LSM iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4-7 Lesion segmentation using K-means. . . . . . . . . . . . . . . . . . . . . . 143 4-8 Contour evolution for lesion segmentation using narrowband LSM. . . . . . 143 4-9 SIFT feature matching performed on the highlighted narrow band of pixels in the vicinity of the contour. . . . . . . . . . . . . . . . . . . . . . . . . . 144 4-10 Color correction for a sequence of images by R, G, B histogram modification. (a) Original image sequence, (b) Color corrected image sequence. The lesion color changes due to phototherepy. . . . . . . . . . . . . . . . . 151 4-11 Image segmentation using LSM for lesion contour detection despite intensity/color inhomogeneities in the image. . . . . . . . . . . . . . . . . . . . 151 19 LIST OF FIGURES 4-12 Image registration based on matching features with respect to the reference image at the beginning of the treatment. . . . . . . . . . . . . . . . . . . . 152 4-13 Sequence of images during treatment. (a) Images captured with normal room lighting. (b) Processed image sequence. . . . . . . . . . . . . . . . . 152 4-14 Image registration through feature matching. (a) Images of a lesion from different camera angles, (b) Images after contour detection and alignment. Area matches to 98% accuracy and pixel overlap to 97% accuracy. . . . . . 153 4-15 Progression analysis. (a) Artificial image sequence with known area change, created from a lesion image. (b) Image sequence after applying scaling, rotation and perspective mismatch. (c) Output image sequence after lesion alignment and fill factor computation . . . . . . . . . . . . . . . . . . . . . 154 4-16 Memory bandwidth and estimated power consumption for full image LSM and SIFT compared to the optimized narrowband implementations of LSM and SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 4-17 Image segmentation fails to accurately identify lesion contours where the lesions don't have well defined boundaries. . . . . . . . . . . . . . . . . . . 157 4-18 Architecture of the mobile application with cloud integration. . . . . . . . 158 4-19 User interface of the mobile application. (Contributed by Michelle Chen and Qui Nguyen). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 4-20 A conceptual diagram of the portable imaging module for multispectral polarized light imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 5-1 Secure cloud-based medical imaging platform. . . . . . . . . . . . . . . . . 172 B-1 Progression of skin lesions over time. Lesion contours are identified from the color corrected images and the lesions are aligned using SIFT feature matching to determine the fill factor. . . . . . . . . . . . . . . . . . . . . . 181 20 LIST OF FIGURES List of Tables 2.1 Separable 2D transform definitions for H.264/AVC and VC-1 . . . . . . . . 41 2.2 Row-wise transform computations for even-odd components over four clock cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2.3 Full-chip Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 2.4 Trtansform engines implemented in this design . . . . . . . . . . . . . . . . 68 2.5 Measurement results for implemented transform modules . . . . . . . . . . 71 2.6 Overheads and Advantages of proposed ideas . . . . . . . . . . . . . . . . . 73 2.7 Performance comparison of proposed approach with previous publications . 74 3.1 Setup/Hold Timing Analysis at 0.5 V . . . . . . . . . . . . . . . . . . . . . 110 3.2 Performance comparison with mobile processor implementations at 0.9 V. . 120 4.1 Summary of clinical assessment and quantitative dermatology approaches . 132 4.2 Bit Width Representations of LSM Variables. 4.3 Performance enhancement through algorothmic optimizations. . . . . . . . 156 B.1 Demographics of the subjects for clinical study. . . . . . . . . . . . . . . . . 139 . . . . . . . . . . . . . . . 180 B.2 Progression of Skin Lesions During Treatment . . . . . . . . . . . . . . . . 182 22 LIST OF TABLES Chapter 1 Introduction In 1837, Charles Babbage proposed the concept of the Analytical Engine [1], the first Turing complete computer with an arithmetic logic unit, control flow and integrated memory. If it had been completely built, the Analytical Engine would have been vast and would have needed to be operated by a steam engine [2]. The idea of computing devices that are astronomically more powerful and yet can fit in the palm of a person's hand while operating on tiny batteries built into the devices themselves, would have been unthinkable. Integrated circuits, driven by the semiconductor process scaling following "Moore's Law" [3] and "Dennard Scaling" [4] over the last half century, have transformed computing through exponential enhancements in performance, power efficiency and cost. Today we are moving ever closer to the era of all computing being mobile. The vision of ubiquitous computing [5] and portable wireless terminals for real-time multimedia access and processing, heralded by the Xerox ParcTab [6] and the InfoPad [7,8], has become ubiquitous by the emergence of portable multimedia devices like smartphones and tablet computing devices. We are surrounded by computing devices that form the "internet of things" - gateways to the hyper-connected world. The exponential growth in computing has fueled advances in increasingly complex multimedia processing applications - from the first color photograph, created by Thomas Sutton Introduction 24 and James Clerk Maxwell in 1861 based on Maxwell's three-color method1 [9], to modern day multimedia processing capabilities that have enabled real-time High Definition (HD) video, computational photography, computer vision and graphics, and biological and biomedical imaging. Figure 1-1 shows the evolution of computing and multimedia processing. Analytical Engine (1837) 0Sar First Color Photograph (1861) Figure 1-1: Evolution of computing and multimedia processing. (Analytical Engine: London Science Museum) Next generation mobile platforms will need to extend these capabilities multifold to enable efficient multimedia processing, natural user interfaces through gesture and speech recognition, real-time interpretation and "big data" inference from sensors interfacing with the world, and provide portable smart healthcare solutions for continuous health monitoring. 'The three-color method forms the foundation of virtually all color imaging techniques to this day. 1.1 Mobile Computing Challenges 25 Regardless of the specific functionality, these applications have a common set of challenges. They are computationally complex, typically require large non-linear filter kernels or large block sizes for processing (64 x 64 or more) and have high memory size and bandwidth requirements. To support real-time performance (1080p at 30 frames per second (fps)), the throughput requirements for such applications can exceed 1 TOPS. The processing is often non-localized with data dependancies across multiple rows in an image or even across multiple frames in a video sequence. Many algorithms are iterative, such as in image deblurring or segmentation, which limits parallelism and place further constraints on realtime performance. This presents the significant challenge of high-computing performance requirement while ensuring ultra-low power operation, to be efficient on battery-operated mobile devices. 1.1 Mobile Computing Challenges The energy budget of a mobile platform is constrained by its battery capacity. While processing power has increased exponentially, battery energy density has followed a roughly linear trajectory [10]. Over the last 15 years, processor performance has increased by 100 x, transistor count by 1000 x, whereas battery capacity has increased only by a factor of 2.6 [11]. At the same time, even as the number of transistors have followed "Moore's Law" exponential growth and continue to do so with process scaling and 3D integration, we are no longer able to achieve exponential gains in performance per Watt of power consumption from process scaling alone, due to the lack of operating voltage scaling [12]. Figure 1-2 shows these trends over the last 40 years [13]. The lack of significant energy density enhancements in batteries combined with flattening performance enhancements per unit of power consumption have led to a major challenge in mobile computing. Energy has become the key limiting factor in scaling computing performance on mobile platforms. The significant performance enhancements needed to enable high complexity applications on future mobile platforms will only be achievable Introduction 26 Introduction 26 106 105 .ne Transistors V 104 103 E . Performance/Watt 102 0 z . -o 101 1 1 197 0 1980 1990 2000 2010 Figure 1-2: Processor feature scaling and Performance/Watt trends. (Data courtesy Stanford CPU DB: cpudb.stanford.edu) through significant enhancements in energy-efficiency of such systems. 1.2 Energy-Efficient System Design Fine-grained parallelism and low voltage operation are powerful tools for low-power design that take advantage of the exponential scaling in transistor costs to trade-off silicon area for lower power consumption [14-17]. Technology scaling, circuit topologies, and architecture trends are aligning to take advantage of these trade-offs for severely energyconstrained applications on mobile platforms. 1.2.1 Parallel Processing Parallel processing has become a cornerstone of low-power digital design [14] because of its remarkable ability, when coupled with voltage scaling, to enhance energy efficiency at no overall performance cost. It allows each individual processing engine or core to operate at less than its peak performance, which enables the operating voltage to be 27 1.2 Energy-Efficient System Design scaled down and achieves a super-linear reduction in energy per operation. Figure 1-3 shows the normalized energy/op scaling vs. performance for processors over 20 years. For applications that support data parallelism, a processor can have two processing engines, each running at half the required performance that together achieve the same throughput But due to the super- as a single processing engine running at the full performance. linear scaling in energy per operation as we lower performance, the two engines combined consume lower power than one engine running at full performance. 100 V . M 50 85-90 90-95 95-00 ".. 00.-05 . U. .... - 20 o. ... .. %E10 1 10 +. 2. .. . .. 00 2. 1 2 .. 5 10 20 . 50 100 Performance (Normalized) Figure 1-3: Processor energy/operation scaling with performance. (Data courtesy Stanford CPU DB: cpudb.stanford.edu) Over the last decade, the transition from single core processing to multi-core processing, taking advantage of parallelism, allowed us to continue to scale overall system performance without increasing the energy budget. However, it is also evident from Figure 1-3 that continuing to reduce peak performance for increasing energy efficiency has diminishing returns. Moving between low energy points causes large shifts in performance for small energy changes. This puts a limit on the performance enhancements achievable from multi-core processing alone. Introduction 28 1.2.2 Application Specific Processing The maximum performance enhancement achievable through parallelism is further limited by "Amdahl's Law" [18], which states that the speedup of a program using parallel processing is limited by the time needed for the sequential fraction of the program. If 50% of the processing involved in an algorithm is sequential, then the maximum performance enhancement achievable through parallelism can not exceed 2x the performance of a single core processor. Achieving significantly higher performance enhancements requires a reformulation of the problem with algorithmic design and optimization that reduces computational complexity and enable highly parallel processing by minimizing sequential dependencies. The energy-efficiency achievable through parallelism is often limited by the energy spent in memory accesses. A 16 bit data access consumes about 5 pJ of energy from on-chip SRAM and about 160 pJ of energy from external DRAM. This compares to about 160 fJ of energy consumed by a 16 bit add operation [19]. Algorithmic optimizations can also significantly enhance processing locality that enables a large number of computations per memory access and amortizes the energy cost. This approach is inherently application specific. A general purpose processor spends significant amount of resources on the control and memory overhead associated with each computation. The high cost of programability is reflected in the relatively small fraction of energy (2-5%) spent in actual computation as opposed to control (45-50%) and memory access (40-45%) [13]. This makes software implementations of high-complexity applications extremely inefficient. Maximizing energy efficiency necessitates a significant reduction in this overhead by minimizing the control complexity and amortizing the cost of memory accesses over several computations. Application specific hardware implementations provide the best solutions to trade-off programmability for high energy-efficiency and take full advantage of algorithmic optimizations. Figure 1-4 shows the energy-efficiency of processors with different architectures - from CPUs to ASICs, where an operation is defined as a 16 bit addition. 29 2 1.2 Ener~m-Efficient Systemn Design 104 E 10 3 -----------x -----------m-------------- ASIC (Video Decoder) 0 0 0Mobile ]p---- ----------- - -m- Processor --- M- 102 a, LU x 101 CMu Ui a, 100 10-1 1 2 3 4 6 5 Processors 7 8 9 Figure 1-4: Energy efficiency of processors: from CPUs to ASICs. Processor Description 1 Intel Sandy Bridge [20] 2 Intel Ivy Bridge [21] 3 24 Core Programable Processor [22] 4 Multimedia DSP [23] 5 Mobile Processors [24,25] 6 GPGPU Application Processor [26] 7 Object Recognition ASIC [27] 8 SVD ASIC [28] 9 Video Decoder ASIC [29] Hardware implementations minimize the control requirement, maximize processing data locality that allows large number of computations per memory access, taking advantage of spatial and temporal parallelism to reduce memory size and bandwidth, and enable deep pipelines with flexible bit-widths. Application specific hardware implementations are the key to achieving exponential enhancements in performance without increasing the energy budget. Introduction 30 1.2.3 Reconfigurable Hardware Flexibility in implementing various applications after the hardware has been implemented is a desirable feature. However, depending on the architecture used to provide flexibility, there can be a 2 to 3 orders of magnitude difference in energy-efficiency between these implementations, as seen from Figure 1-4. Fully customized hardware implementations are well suited for applications that have well defined standards, such as video coding. Most desktop and mobile processors today have embedded hardware accelerators for video coding. However it is impractical to develop hardware implementations for every iteration of an algorithm in areas such as computer vision and biomedical signal processing, where the algorithms are constantly evolving. Even for standardized applications, existence of multiple competing standards makes it difficult to develop individual hardware implementations for all the standards. For example, it is impractical for most application processors to implement individual video coding accelerators for more than ten video coding standards with more than 20 different coding profiles. Dedicated video coding engines, such as IVA-HD [30], support multiple video coding standards though a reconfigurable architecture that implements optimized core functional units, such as motion estimation, transform and entropy coding engines, and uses a configurable pipeline with distributed control. A closer examination of these areas reveals that it may not be necessary to develop hardware accelerators for each individual algorithm. A vast number of computational photography and computer vision applications, for example, use a well defined set of functions, such as non-linear filtering [31], Gaussian or Laplacian pyramids [32,33], Scale Invariant Feature Transform (SIFT) [34], Histogram of Gaussians (HoG) [35] or Haar features [36], etc. These functions are well established and form the foundation of the OpenCV library [37] used for software implementations of almost all computer vision applications. A hardware implementation with highly optimized processing units supporting such functions, and the ability to activate these processing units and configure the datapaths based on 31 1.2 Energy-Efficient System Design the application requirements, provides a very attractive alternative that maintains high energy-efficiency while supporting a large class of applications. An important aspect of reconfigurable implementations is architecture scalability. The use of individual processing units as well as the amount of parallelism within each unit, is application specific. Video coding with 4k x 2k resolution at 60 fps has a 20 x higher performance requirement than 720p at 30 fps. Different processing block sizes or filter kernels (4 x 4 to 128 x 128 or more) result in different optimal configurations in a parallel processor. Scalable architectures also enable us to explore energy vs. output quality tradeoffs, where the user can determine the amount of energy spent in processing, depending on the desired output for the specific application. The ability to effectively turn-off processing units and memory banks, through clock and power gating when not used, is key to minimizing energy that is simply being wasted by the system. This thesis demonstrates examples of efficient reconfigurable and scalable hardware implementations for video coding and computational photography applications. 1.2.4 Low-Voltage Circuits For parallelism to yield enhancements in energy-efficiency, it must be coupled with voltage scaling. The power consumption of CMOS digital circuits operating at voltage VDD, frequency f and driving a load modeled as a capacitance C, is given by: Ptotal = Pswitching + Pleakage =ax CxVDD X f + Leakage X VDD - where, a is the switching activity of a logic gate and Ileakage is the leakage current. For varying performance requirements, scaling frequency only provides a linear scaling in power consumption in the switching-power dominated region of operation. However, scaling VDD along with the frequency, to match the peak performance of the proces- Introduction 32 sor, provides a cubic scaling in power consumption. To take full advantage of Dynamic Voltage-Frequency Scaling (DVFS) [38], circuit implementations must be capable of operating across a wide voltage range, from nominal VDD down to the minimum energy point, which typically occurs near or below the threshold voltage (VT) and minimizes the energy per operation [39]. When VDD is reduced to the range of 0.5 V, statistical variations in the transistor threshold voltage becomes an important factor in determining logic performance. Random Dopant Fluctuations (RDF) are a dominant source of variations at low voltage, causing random, local threshold voltage shifts [40-42]. Local variations have long been known in analog design and in SRAM design [43,44]. With technology scaling, they have become a major concern for digital design as well. At nominal voltage, local variations in VT may result in 5%-10% variation in the logic timing. However, at low voltage, these variations can result in timing path delays with standard deviation comparable to the global corner delay, and must be accounted for during timing closure in order to ensure a robust, manufacturable design. Figure 1-5 shows the delay of a 28 nm CMOS logic gate as the voltage is lowered from 1 V to 0.5 V. The nominal delay scales by a factor of 15. But taking into account stochastic variations, the total 3- delay scales by a factor of 36. Typically reliability at 40 , . 1 I-36 x -.- 30 ----0 - 20- S 100 Total Delay Corner Delay 15x 0.5 0.6 0.7 0.8 0.9 1.0 VDD M Figure 1-5: Delay scaling with VDD. Corner delay scales by 15x, whereas total delay (corner + 3o- stochastic delay) scales by 36 x. 1.3 Thesis Contributions 33 low-voltage is achieved by over-designing the system with large design margins to account for variations. Such design margins have a significant energy cost [12]. This thesis demonstrates low-voltage design using statistical static timing analysis techniques that minimize the overhead of large design margins to account for variations, while ensuring reliable low-voltage operation with 3a- confidence. 1.3 Thesis Contributions The broad focus of this thesis is to address the challenges of implementing high-complexity applications with high-performance requirements on mobile platforms through a comprehensive view of system design, where algorithms are designed and optimized to enhance processing locality and enable highly parallel architectures that can be implemented using low-power low-voltage circuits to achieve maximally energy-efficient systems. This is accomplished by starting with application areas and exploring key features that form the basis of a wide array of functionalities in that area. The algorithms underlying these features are optimized for hardware implementation, considering trade-offs that reduce computational complexity and memory requirements. Parallel architectures with reconfigurability and scalability are developed to support real-time performance at low frequencies. Finally, circuits are implemented to provide a wide voltage-frequency operating range and ensure minimum energy operation. The main contributions of this thesis are in the following areas: o Shared Transform Engine for Video Coding: A shared transform engine for H.264 and VC-1 video coding standards that supports Quad full-HD (4kx2k) resolutions at 30 fps is presented in Chapter 2. Transform engine is a critical part of video encoding/decoding process. High coding efficiency often comes at a cost of increased complexity in the transform module. This work explores algorithmic optimizations where a larger transform matrix (8 x 8 or larger) is factorized into multiple Introduction 34 small (2 x 2) matrices that can be computed much more efficiently. The factorization can also be formulated in such a way that Discrete Cosine Transform (DCT) based transform matrices corresponding to multiple video coding standards result in the same factors. This is key to achieving an efficient shared implementation. The size of transpose memory for 2D transform becomes a key concern for large transforms. Architectural schemes to eliminate an explicit transpose memory and reuse an output buffer to save area and power are explored. Data dependent processing is used to further reduce the power consumption of the transform engine by lowering switching activity. Both the forward and inverse integer transforms are implemented to support encoding as well as decoding operations. The proposed techniques are demonstrated through a testchip, implemented using 45 nm CMOS technology. Statistical circuit design techniques ensure a wide operating range and reliable operation down to 0.3 V. The testchip is used to benchmark different implementations of transform engines such as reconfigurable implementation vs. individual implementations for the two standards, implementations with and without transpose memory, and evaluate the different architectures for power and area efficiency. * Reconfigurable Processor for Computational Photography: A wide array of computational photography applications such as High Dynamic Range (HDR) imaging, low-light enhancement, tone management and video enhancement rely on non-linear filtering techniques such as bilateral filtering. Chapter 3 presents the development of a reconfigurable architecture for multiple computational photography applications. Algorithmic optimizations, leveraging the bilateral grid structure, are explored to transform an inefficient non-linear filtering operation into an efficient linear filtering operation with significant reductions in computational and memory requirements. Algorithm-architecture co-design enables a highly parallel and scalable architecture that can be configured to implement various functionalities, including HDR imaging, low-light enhancement and glare reduction. Memory management techniques are explored to minimize the external DRAM bandwidth 35 1.3 Thesis Contributions and power consumption. The scalable architecture enables users to explore en- ergy/resolution trade-offs for energy-scalable processing. The proposed techniques are demonstrated through a testchip, implemented using 40 nm CMOS technology. Careful design for low-voltage operation ensures reliable operation down to 0.5 V, while achieving real-time performance. The comprehensive system design approach from algorithms to circuits enables a 280x enhancement in energy-efficiency compared to implementations on commercial mobile processors. e Portable Platform for Medical Imaging: Medical imaging techniques are important tools in diagnosis and treatment of various skin conditions. Widespread use of such imaging techniques has been limited by factors such as size, weight, cost and complex user interface. Treatments for skin conditions require reliable outcome measures to compare studies and to assess the changes over time. Chapter 4 presents the development of a portable medical imaging platform for accurate objective quantification of skin lesion progression. Computer vision techniques are extended and enhanced to identify lesion contours in images captured using smartphones and quantify the progression through feature matching. The approach is validated through a pilot study in collaboration with the Brigham and Women's Hospital. Algorithmic optimizations are explored to improve software run-time performance, memory bandwidth and power consumption. These optimizations pave the way for energy-efficient hardware implementations that could enable real-time processing on mobile platforms. 36 Introduction Chapter 2 Transform Engine for Video Coding Multimedia applications, such as video playback, have become prevalent in portable multimedia devices. Video accounted for 53% of the mobile data traffic in 2013 and is expected to increase 14x between 2013 and 2018, accounting for 69% of total mobile data traffic by 2018 [45]. Such applications present the unique challenge of high-performance requirement while ensuring ultra-low power operation, to be efficient on battery-operated mobile devices. Low-power hardware implementations targeted to a specific standard, such as application processors for H.264 video encoding [46] and decoding [47,48], have been proposed. A universal media player requires supporting multiple video coding standards. High power and area cost of dedicated video encoding/decoding for each standard necessitates the development of a shared architecture for multi-standard video coding. Dedicated video coding engines supporting multiple standards have recently been proposed using reconfigurable architectures. The IVA-HD video coding engine [30] supports encoding and decoding for multiple standards, such as H.264, H.263, MPEG 4, MPEG 1/2, WMV9, VC-1, MJPEG and AVS. It implements optimized core functional units, such as motion estimation, transform and entropy coding engines, and uses a configurable pipeline with distributed control to achieve programability for the different standards. A multi-format 38 Transform Engine for Video Coding video codec application processor, supporting H.264, H.263, MPEG 4, MPEG 2, VC-1 and VP8, is proposed in [49]. Hardwired logic is combined with a dedicated ARMv5 architecture CPU to provide programability for supporting multiple standards. Energy efficiency of circuits is a critical concern for portable multimedia applications. It is important not only to optimize functionality but also achieve low energy per operation. Dynamic Voltage-Frequency Scaling (DVFS) is an important technique for reducing power consumption while achieving high peak computational performance [50]. The energy efficiency of digital circuits is maximized at very low supply voltages, near or below the transistor threshold voltage, such as 0.5 V [51]. This makes the ability to operate at low voltage (VDD < 0.5 V) a key component of achieving low power operation. This work explores power reduction techniques at various stages, such as algorithms, architectures and circuits. Combining aggressive voltage scaling, by operating at VDD 0.5 V, and increased parallelism and pipelining, by processing 16 pixels in each clock cycle, provides an effective way of reducing power while achieving high performance, such as 4k x 2k QuadFull HD (3840 x 2160) video coding at 30 frames per second (fps), at low frequency. Transform engine is a critical part of video encoding/decoding process. High coding efficiency often comes at a cost of increased complexity in the transform module, such as variable size transforms (4x4, 8x8, 8x4, 4x8, etc.) as well as hierarchical transform, where Discrete Cosine Transform (DCT) coefficients are further encoded using Hadamard transform. DCT is the most commonly used transform in video and image coding applications. DCT has excellent energy compaction property, which leads to good compression efficiency of the transform. However, the irrational numbers in the transform matrix make its exact implementation with finite precision hardware impossible, leading to a drift (difference between reconstructed video frames in encoder and decoder) between forward and inverse transform coefficients. Recent video coding standards, such as H.264/AVC [52,53] and VC-1 [54-56] use a variation of the DCT, known as integer transform, where the transform matrix is an integer approximation of the DCT. This allows exact computation of inverse transform using integer arithmetic and also allows implementation using addi- 2.1 Transform Engine Design 39 tions and shifts, without any multiplications [57]. H.264/AVC and VC-1 also use variable size transforms, such as 8x8 and 4x4 in H.264/AVC (High profile) and 8x8, 8x4, 4x8 and 4x4 in VC-1 (Advance profile), to more effectively exploit the spatial correlation and improve coding efficiency. Construction of computationally efficient integer transform matrices is proposed in [58], which allows implementation using 16 bit arithmetic with rate distortion performance similar to 32 bit or floating point DCT implementations. Recent research has focused on efficient implementation of the integer transforms. Matrix decomposition is used to implement 4x4 and 8x8 integer transforms for VC-1 in [59]. A hardware sharing scheme for inverse integer transforms of H.264, MPEG-4 and VC-1 using delta coefficient matrix is proposed in [60]. Matrix decomposition with sparse matrices and matrix offset computations is proposed in [61] for a shared ID inverse integer transform of H.264 and VC-1. Matrix decomposition and transform symmetry is used to develop a computationally efficient approach for ID 8x8 inverse transform for VC-1 in [62]. Similar ideas are used to achieve a shared architecture for 1D 8 x 8 forward and inverse transforms of H.264 in [63]. A circuit architecture that can be applied to standards such as MPEG 1/2/4, H.264 and VC-1 is proposed in [64] based on similarity of 4x4 and 8x8 DCT matrices. In this work, a shared transform for H.264/AVC and VC-1 video coding standards is proposed [65]. Forward integer transform and inverse integer transform are both implemented to support encoding as well as decoding operations. We also propose a scheme to eliminate an explicit transpose memory, which is required in 2D transform implementation, to save area and power. This work also explores data dependent processing to further reduce the power consumption of the transform engine. 2.1 Transform Engine Design This section explores the ideas of matrix factorization for hardware sharing, eliminating an explicit transpose memory in 2D transform and data dependent processing to reduce Transform Engine for Video Coding 40 switching activity, to achieve a shared transform engine for H.264/AVC and VC-1 video coding standards. The objective is to design a transform engine that can support video coding with Quad Full-HD (QFHD) resolution at 30 fps, with very low power consumption. 2.1.1 Integer Transform: H.264/AVC & VC-1 H.264/AVC uses 4x4 transform in baseline and main profile and both 4x4 and 8x8 transforms in the high profile. VC-1 uses 4x4, 4x8, 8x4 and 8x8 transforms in the advance profile. The transform matrices for H.264/AVC and VC-1 standards are defined in Appendix A. The 4x4 transform matrices for H.264 and VC-1, as well as the 8x8 transform matrices, are structurally identical. This allows us to generate a unified 4x4 transform matrix and a unified 8x8 transform matrix for H.264 and VC-1, as defined by eq. (2.1) and eq. (2.2) respectively. aa a ~y-a-3 (2.1) T4 a -'y-a a-/ H.264: a = 1,1 a -y = 1, y = 1/2 and VC-1: a = 17, / = 22, y = 10. 41 2.1 Transformn Engine Design 41 2.1 Transform Engine Design a b f c a d g e a c g -e -a -b -f -d a a -g -b -a e f C a e- a c -g -b f-d T8 = a -e f d a -c -g b a -d -a b -a -e f -c a -c 9 e -a b -f d a -b f -c a -d g -e H.264: a = 8, b = 12, c = 10, d = 6, e = 3, f VC-1: a = 12, b = 16, c = 15, d = 9, e = 4, = f= (2.2) 8, g = 4 16, g = 6. The separable 2D transforms are defined as given in Table 2.1, where m = {8, 4} and n = {8, 4}, X is the prediction residue and Y is the transformed data. Table 2.1: Separable 2D transform definitions for H.264/AVC and VC-1 Forward Transform H.264 VC-1 Xmxm - Tm TM -Ymxm - TM Xmxn - Tn) - Nmxn (Tm -Ymxn - TnT)/1024 T. (TmT . Inverse Transform The scaling factors in transform definitions can be absorbed in the quantization process. This work focuses on implementing the transform matrix computations. Transform Engine for Video Coding 42 2.1.2 Matrix Factorization for Hardware Sharing Transform matrices for H.264/AVC and VC-1 have identical structure, as shown in eq. (2.1) and eq. (2.2). In this section, we will exploit this fact to design a shared transform engine for H.264/AVC and VC-1. The 8x8 transform matrix can be decomposed into two 4x4 matrices using even-odd decomposition [66], given by eq. (2.3). (2.3) T8= B 8 - M8 - P8 where, a f a 9 0 0 0 a g -a -f 0 0 0 a -g -a f 0 0 0 a -f a -g 0 0 0 0 0 0 0 -d c -b 0 0 0 0 -b e C 0 0 0 0 -e -b -d 0 0 0 0 d e (2.4) c P8 is a permutation matrix that has zero computational complexity and B 8 can be implemented using 8 adders. 43 2.1 Transform Engine Design 43 2.1 Transform Engine Design 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 (2.5) Bs = and 1 0 0 0 0 1 -1 0 0 0 0 0 0 0 -1 0 0 0 0 0 0 0 0 0 0 -1 0 0 0 0 1 1 0 0 0 0 -1 We propose further factorization of the even and odd components of M8 to achieve hardware sharing between H.264 and VC-1 matrices. The factorization scheme is derived in such a way that both H.264 and VC-1 matrices result maximum number of common factors. The even component of H.264 is factorized as shown in eq. (2.6). r 8 8 8 4 8 4 -8 -8 8 -4 -8 8 8 -8 8 -4 -I He = 1 0 1 0 8 0 8 0 0 1 0 1 8 0 -8 0 0 1 0 -1 0 8 0 4 1 0 -1 0 0 4 0 -8 Transform Engine for Video Coding 44 Transform Engine for Video Coding 44 1 0 1 0 2 0 2 0 0 1 0 1 2 0 -2 0 0 1 0 -1 0 2 0 1 1 0 -1 0 0 1 0 -2 4. = (2.6) Fie 4F 2e The even component of VC-1 is factorized as shown in eq. (2.7). 12 16 12 6 12 6 -12 -16 12 -6 -12 16 12 -16 12 -6 1 1 0 12 0 12 0 0 0 1 12 0 -12 0 0 0 -1 0 16 0 6 1 -1 0 0 6 0 -16 1 1 0 2 0 2 0 0 0 1 2 0 -2 0 0 0 0 0 0 0 0 0 0 1 0 0 +4 0 0 -1 0 2 0 1 1 -1 0 0 1 0 -2 Fie - (6F 2e + 4F 3 e) 0 0 0 -1 / (2.7) 45 2.1 Transform Engine Design Similarly, we propose factorizing the odd component for H.264 as shown in eq. (2.8). 3 -6 10 -12 6 -12 3 10 10 -3 -12 -6 12 10 6 3 3 2 -2 0 1 0 0 -4 -2 0 -3 2 0 1 4 0 2 -3 0 2 0 4 -1 0 0 2 2 3 4 0 1 HO = 0 0 1 0 0 0 1 0 0 -4 0 1 0 0 -1 0 0 1 4 0 0 0 1 0 -1 0 0 0 4 -1 0 1 1 0 0 0 1 4 0 1 0 1 -1 0 1 0 -1 2 = (2F 2o+ 3F 30 ) - Flo 0 'I 0 (2.8) Tr-ansform Engine for Video Coding 46 Transform Engine for Video Coding 46 And the odd component for VC-1 is factorized as shown in eq. (2.9). 4 -9 15 -16 9 -16 4 15 15 -4 -16 -9 16 15 9 4 4 3 -3 0 1 0 0 -4 -3 0 -4 3 0 1 4 0 3 -4 0 3 0 4 -1 0 0 3 3 4 4 0 1 0 1- -1 0 Vo= 1' 0 1 0 1 0 0 0 1 0 0 -4 1 0 0 -1 0 0 1 4 0 0 0 0 4 -1 0 0 1 4 0 0 1 C 31 0 C 1 0 -1 0 1 0 0 1 0 / 4F30 ) - Flo = (3F 2o+ 0 (2.9) Notice that the major factors, Fie and F2 ,, are common between the even components of H.264 and VC-1. The factor F3e for VC-1 is a very sparse matrix and has very little computational complexity. Similarly, all the factors, Flo, F20 and F30 , are common between the odd components of H.264 and VC-1. This factorization allows us to maximize hardware sharing between the even as well as odd components of H.264 and VC-1. The hardware architecture for shared implementation of the even component for H.264 and VC-1, using the factorization defined by eq. (2.6), eq. (2.7), is shown in Figure 2-1. The architecture for the odd component, using the factorization defined by eq. (2.8) and 47 2.1 Transform Engine Design eq. (2.9), is shown in Figure 2-2. A column of input data is represented as: [Xo, Li, X2, X 3 , X4 , X5 , X6 , X7]T (2.10) Reconfigurability is achieved by using multiplexers to program the datapath, enabled by a flag indicating the standard (H.264 or VC-1) being used. The shared 4x 4 transform for H.264 and VC-1 is achieved in a similar manner, as defined by eq. (2.11), where T4 is defined by eq. (2.1). TH = (Fie - F 2e) >> 1 and TV =Fie - (8F 2e+ 4F 3e + F4 ) (2.11) where, Fie, F 2 e and F3e are defined in eq. (2.7) and F 4 in eq. (2.12). 1 0 1 0 -1 0 0 2 0 2 0 2 0 -2 1 0 (2.12) F4 = 2.1.3 Eliminating Transpose Memory Conventional row-column decomposition uses same 1D transform architecture for both row and column operations. This requires the use of a transpose memory between the row-wise 1D transform and the column-wise ID transform. The transpose memory can be a significant part of the total area (as high as 48% of the gate count in one of the benchmarked designs) and power consumed by the 2D transform. We propose an approach to avoid the transpose memory by using separate designs for the row-wise and column-wise 1D transforms and using the output buffer to store intermediate data. By enabling the output buffer to have wide number of ports to read/write 2D data, referred to as a 2D output buffer, an explicit transposition is avoided. In this Transform Engine for Video Coding 48 48 1<< <<<< + ----+-7------------------------- F2 e < 1 1 <<1 << << << <<1 <<1 --------------S + + --------------- ---------- ----- << 1<<1 <<1 X6 X2 X4 XO ------------- -------- --------- ------ H.264 H.264 + - Fle-4Fe Y2 Y1 Y3 yo (a) H.264 XO 1<< <<<< F2 e 6 +1--- ---------------------- ---1 <<1 << <<1 -------------- ---------- ------ <1 <<1 <<1 << 1 << ~ ~ --- F2e X6 X2 X4 ~ ~ ~~ <1 6F2 e+4F 3e 0 1 Fie-(6F 2e+4F 3e) H.264 0H.264 + + Yo Y1 Y3 Y2 (b) VC-1 Figure 2-1: Hardware architecture of the even component. The figure shows data paths exercised in (a) H.264 and (b) VC-1. 49 2.1 Transform Engine Design << 2 << 2 < 2 2 - + + Flo/+ X7 X1 X5 X3 +X- --+ F2/F1 ----------------------------------- ---------w--------- (F20+F3o)-Flo + + (2F 20 +3F 30 )-Flo --------- -- ----------- Y1 Y3 Yo ---- ---------------- Y2 (a) H.264 X5 X3 << 2 F10 ---------------F20-Flo X1 X7 << 2 I<< 2I + -------- -------- ---------/ -------- (F2o+F3o)-FIO ----- ------------------------- -----+ ------- w---------+ --+ -------------(3F20+4F3 0)F 10-----Yo -------- Y3 y1 Y2 (b) VC-1 Figure 2-2: Hardware architecture of the odd component. The figure shows data paths exercised in (a) H.264 and (b) VC-1. Transform Engine for Video Coding 50 implementation, we spread the processing of an 8 x 8 block over 4 clock cycles. In each cycle, we process 8x2 data, i.e. two columns (0 and 7, 1 and 6, 2 and 5, 3 and 4) from the 8 x 8 input, to obtain two transformed columns, as shown in Figure 2-3. x U -------------- - -P m-------- C ------ -|,-+---- - - - Co -* -, Co a g g a f a -f-e a*d - da -a - a a d a g - -b a X-c a a13a- -a -b -a ba4 a -e CTTC2CT i C 3C2CC 4 x~ C 0 i 14 c0 c1 cI co c , C -------3 a -ci2 a---g----d a-f----a~~~~~ g-e-a ' --b-f------ transform iD transform Column-wise 1D 4 3 1 xa3a c a I x4 x Transformed columns x3 x4 2 ~ ~ 3 4 x~x 8x8 input data Wx CO C1 C2 C3 C3 C2 U12 U13 U4 U15 U22 U23 U24 U25 U32 U33 U34 U35 U42 U43 U4 U45 U2 U53 U54 U5 U62 U63 U64 U65 C1 Co Transformed columns Figure 2-3: Column-wise 1D transform: 8x8 data is processed over four clock cycles, Co to 03: Column 0 and 7 in C0, 1 and 6 in C1, 2 and 5 in C2, 3 and 4 in C 3 . Two transformed columns are generated in each clock cycle. For the row-wise computation, we don't have an entire row (transposed column) available in each clock cycle, without using a transpose memory. To overcome this problem, we only compute partial products of all 8x8 coefficients in each clock cycle and store them to the 8 x 8 output buffer, as shown in Figure 2-4. The processing in Figure 2-3 and Figure 2-4 is shown as direct inner product for simplicity. The implementation performs the same processing using the matrix decomposition approach, described in Section 2.1.2. 51 2.1 Transform Engine Design x m-------------S ----- r ------------ * I I I I I a a . Xi I Xi X 1 I C2 X I ' C2 C Co CI C2 C3 C3 C2 c cf a d' g -e -a -b -g -b -a e C1 2D Transformed Output Transposed columns Row-wise D transform CO FU02 U12 U2 2 U3 2 U42 U52 Co Yoo Y01 Y02 Y03 Y04 Y05 Y106 Y07 C, Y10 Y11 Y12 Y13 Y14 Y15 Y16 Y17 U62 U721 C 2 Y20 Y21 Y22 Y23 Y24 Y25 Y26 Y27 Y30 Y31 Y32 Y33 Y34 Y35 Y36 Y37 -f d a c IU03 U13 U2 3 U3 3 U43 U5 3 U63 U7 3 C3 S-f d a -c U04 U14 U24 U34 U44 U5 4 U64 U74 C3 Y40 Y41 Y42 143 Y44 Y45 146 Y47 -g b -a -e 05 uUI U25U 35 U45 U55 U65 U751 C 2 150 Y51 Y52 Y53 154 Y55 Y56 Y57 Y60 Ye1 Y62 Y63 Y64 165 Y70 Y71 Y72 173 g e -a b C, f -c a -d Co Row-wise 1D transform Transposed columns - Y66 Y67 Y74 Y75 Y76 Y77 2D Transformed Output Figure 2-4: Row-wise 1D transform: Partial products for all 64 coefficients are computed in each clock cycle, using the 2x8 data obtained by transposing the two columns generated by 1D column-wise transform. The partial products are stored in the output buffer. At the end end of four clock cycles, the output buffer contains complete 2D transformed output. Over four clock cycles, we add and accumulate the results for all 8 x 8 coefficients in the output buffer with 64 reads/writes each cycle, so that at the end of fourth clock cycle we get the complete result for the entire 8 x 8 block. The partial products computed in each clock cycle, for the column vector [uOO, I0 1 , U0 2 , u0 3 ,0u 4 , U05 , U0 6 , uo 7]T, are shown in Table 2.2. These partial products are generated by the hardware architectures shown in Figure 2-5. The appropriate coefficients are selected by the multiplexers in each clock cycle. Transform Engine for Video Coding 52 Table 2.2: Row-wise transform computations for even-odd components over four clock cycles Clk H.264 VC-1 H.264 VC-1 H.264 VC-1 H.264 VC-1 Co 8uoo 12u00 8u0 0 12u 0 0 8u 0 0 12uOO 8u0 0 12UOO C1 4uO6 6uO6 -8U 0 6 -16uO 6 8u06 16uO6 4uO6 6uO6 C2 8u02 16UO2 4U02 6uO2 -4U02 -6U02 -8u02 -16uO2 C3 8U0 4 12UO4 -8u04 -12UO4 -8u04 -12UO4 8U0 4 12O4 -16uO7 10U 0 7 15uO7 -6U 0 7 -9uO7 3U0 7 4uO7 Co 2.1.4 -12u 07 C1 3uoi 4uoi 6U01 9U01 1Ou0i 15u0i 12uoi 16uoi C2 1Ou 05 15uO5 3uO5 4uO5 -12uO 5 -16uO 5 6uO5 9uO5 C3 -6u -9uO -12UO3 -16u 0 -3u 0 -4uO3 10u0 3 15u0 3 03 3 3 3 Data Dependent Processing In addition to processing optimization, it is also important to take into account the nature of input data to further achieve power savings. By exploiting the characteristics of the data being processed, architectures can be designed to minimize switching activity, optimize pipeline bit widths and perform variable number of operations per block [67]. Applicationspecific SRAM designs for video coding applications that exploit the correlation of storage data and signal statistics to reduce the bit-line switching activity and consequently the energy consumption are proposed in [68,69]. Transform engine operates on the 8 bit prediction residue. Figure 2-6 shows the histogram of the prediction residue for a number of test sequences. This analysis shows that more than 80% of the prediction residue lies in the range -32 to +32. Due to 2's complement processing, a large number of bits are flipped every time a number changes from a small negative value to a small positive value. At the input, this results in high switching activity 2.1 Transform Engine Design 53 53 2.1 Transform Engine Design U04 U02 U06 U00 - << 1 CLK cycle 1 2 3 4 << 0 1 0 1-C2 C, C2 0 0 1 0 0 1 0 0 << << 1 -C1 |> C20 C, /2 F 1 0 Sid 1 \- 1 0 1AC 0 V f 0 -Sid 07-1 yeo / ye3 Y1I/ Y'92 Std 0 (a) CLK cycle C 2 11 3 0 01 2 3 -C « «1 «1 U03 U05 U01 U07 1 Std 0 1 2 3 Y0 0 0 -C 1 12 Y0 2 01 3 -C 1 Std 0 1 2 3 -C Y0 3 (b) Figure 2-5: Hardware architecture of the (a) even and (b) odd component. Std = {0: H.264, 1: VC-1}. around zero. Switching activity at the input propagates through the system, though the effect is different at different nodes. For example, a node implementing functionality similar to XOR shows high switching activity, whereas other nodes show significantly low switching activity. Because of this, different input patterns affect the system switching activity differently. Overall, we observe that high switching activity at the input results in a high switching activity for the entire system. Figure 2-7 shows the correlation between switching activity at the input and the system Transform Engine for Video Coding 54 x104 , 15 . Horsecab Rally -- 10 Splash - Waterskiing -- S5- 0 -50 -100 100 50 0 Input Magnitude Figure 2-6: Histogram of the prediction residue for a number of test sequences switching activity for 150 different input sequences. Zero input switching activity refers to no bits changing at the input and 1 refers to all the input bits switching simultaneously from 0 to 1 or 1 to 0. For the system switching activity, 0 refers to no activity, which corresponds to leakage power, and 1 refers to maximum power consumption. The plot shows a strong correlation of 0.83 between input switching activity and system switching activity. This indicates that there is a significant benefit to the system switching activity by reducing the input switching activity. 0.8 - 0.6 - . ' . 0 0 * *.s a. 0 . 0 0 00 C 0 0 .1 Input Switching Activity Figure 2-7: Correlation between input switching activity and system switching activity. The plot also shows linear regression for the data. Measured correlation is 0.83. 55 2.1 Transform Engine Design 55 2.1 Transform Engine Design In order to reduce the switching activity, we pre-process the input data by adding a fixed DC bias to the prediction residue. To accommodate for the added bias, the dynamic range is increased from 8 bit to 9 bit. The DC bias shifts the input histogram to the right. For example, for a DC bias of 32, more than 80% of the input data falls within 0 to 64. Thus less than 6 LSBs are flipped during most operations, reducing the overall switching activity. Note that the DC bias only affects the DC coefficient in the transform output. This can be easily corrected by subtracting a corresponding bias from the DC coefficient at the output. Figure 2-8 shows the reduction in switching activity and power as a function of DC bias values, despite the one bit increase in bit width, for different video sequences. 0.8 -----------.0.8 0.7 '0.7 0.6-Power ---+- 0 Horsecab - - Splash -+- Waterskiing Rally 32 -- 64 Bias 0 Switching Activity Horsecab -'- Splash --- Waterskiing Rally 96 12 .5 Figure 2-8: Switching activity and Power consumption in the transform as a function of DC bias applied to the input data On average, the switching activity and power consumption reach a minimum for DC bias of about 64 and then start to increase again. This is because as a higher DC bias is applied, more MSBs start switching, partially offsetting the effect of reduction in switching activity in the LSBs. Data dependent processing scheme has less than 5% hardware cost and reduces the average switching activity by 30% and average power by 15% for the DC bias of 64. Transform Engine for Video Coding 56 2.2 Future Video Coding Standards The ideas proposed in this work have general applicability beyond H.264/AVC and VC-1 video coding standards. In this section, we will look at applying these ideas to the 8 x8 transform of the next generation video coding standard High-Efficiency Video Coding (HEVC) [70]. The HEVC standard recommendation [70] defines the 8x8 1D transform as given by eq. (2.13). 64 89 83 75 64 50 36 18 64 75 36 -18 -64 -89 -83 -50 64 50 -36 -89 -64 18 83 75 64 18 -83 -50 64 75 -36 -89 (2.13) T8 = 64 -18 -83 50 64 -75 -36 89 64 -50 -36 89 -64 -18 83 -75 64 -75 36 18 -64 89 -83 50 64 -89 83 -75 64 -50 36 -18 Notice that the structure of this transform matrix is same as the generalized matrix for H.264/AVC and VC-1, defined in eq. (2.2), where: a = 64, b = 89, c = 75, d = 50, e = 18, f = 83, g = 36. The idea of matrix decomposition for hardware sharing, as described in Section 2.1.2, can be applied to eq. (2.13) as well. Extension of even-odd decomposition for HEVC transform to reduce hardware complexity is described in [71]. Even-Odd decomposition, performed as defined in eq. (2.3), gives the even and odd components for the 8x8 HEVC matrix, defined by eq. (2.14) and eq. (2.15) respectively. 57 '2.2 Future Video Coding Standards 57 2.2 Future Video Coding Standards 64 83 64 36 64 36 -64 -83 (2.14) HEVCe = 64 -36 -64 83 -83 64 -36 18 -50 75 -89 50 -89 18 75 75 -18 -89 -50 50 18 64 (2.15) HEVCO = 89 75 The even and odd components can be further factorized as given by eq. (2.16) and eq. (2.17) respectively. HEVCe = Fie - (32. F2e + 4. F4e + 15 - Fe) (2.16) HEVCo = (15. F 20 + 22. F30 + F 4 0 ) - F1 0 + 5 - Fo (2.17) Notice that the factors Fie, F2e, F3e, Fio, F2 0 and F3 0 are same as those defined in eq. (2.6), eq. (2.7), eq. (2.8), eq. (2.9), for H.264 and VC-1 factorization. F 4e, F 4 0 and F50 , defined by eq. (2.18), eq. (2.19) and eq. (2.20) respectively, are extremely sparse matrices. 0 0 0 0 0 0 0 0 0 1 0 1 (2.18) F4e = 0 1 0 -1 Tr-ansform Engine for Video Coding 58 Transform Engine for Video Coding 58 F 4o 0 0 0 -1 0 -1 0 0 (2.19) = 0 0 1 0 L1 0 0 0 0 -1 0 0 1 0 0 0 0 0 0 -1 0 0 1 (2.20) F 5o = 0 Since most of the factors for HEVC transform matrix are the same as those for H.264 and VC-1, it is possible to achieve an efficient hardware implementation with shared architecture between H.264, VC-1 and HEVC, as shown in Figure 2-9 and Figure 2-10, for even and odd components respectively. This demonstrates that matrix factorization can be extended to standards beyond H.264 and VC-1 to achieve shared hardware implementations for multiple standards. The identical structure of the transform matrix, as given by eq. (2.2), for H.264, VC-1 and HEVC arises because of the symmetric nature of coefficients in the DCT, which forms the basis of transforms in all of these standards. As long as a video coding standard uses transform based on DCT, it will always result in a matrix such as eq. (2.2). Transform matrices for different standards are multiples of each other with slight variations and can be factorized into very similar factors to maximize sharing. The idea of eliminating an explicit transpose memory in 2D transform, as described in Section 2.1.3, is equally applicable to HEVC. The processing, over four clock cycles, can be done in the same way as used for H.264 and VC-1, with the results accumulated in the output buffer. 59 2.2 Future Video Coding Standards XO X6 X2 X4 +1~ + - «1 <1 «1 <1 1<3 + YO + << 2 - + << 2 Y1 Y3 <<37 Y2 Std ={O: VC-1, 1: H.264, 2: HEVC} Figure 2-9: Hardware architecture of the even component for shared 8x8 transform for H.264, VC-1 and HEVC. The highlighted blocks are the same as those used in the shared H.264/VC-1 architecture, shown in Figure 2-1. Data dependent processing, as described in Section 2.1.4, is independent of the video coding standard being used. Since the nature of the input data (the prediction residue), as shown in Figure 2-6, is the same for HEVC as for H.264 and VC-1, we can use data dependent processing to reduce switching activity and power consumption in HEVC transform engine as well. Figure 2-11 shows results of switching activity simulations for the HEVC transform architecture proposed above. We consistently observe data dependent processing resulting in an average 25% reduction in switching activity, demonstrating the applicability of this idea beyond H.264 and VC-1. It should also be noted that the ideas of even-odd decomposition and matrix factorization Transform Engine for Video Coding Transform Engine for Video Coding 60 60 X3 X1 X5 << << << << << << << 4 2 td 0 1 0 X7 << 4 1 2 2 Std 10 td 0 1 2 -Std + Y2 Y1 Y3 YO Std = (0: VC-1, 1: H.264, 2: HEVC} Figure 2-10: Hardware architecture of the odd component for shared 8x8 transform for H.264, VC-1 and HEVC. The highlighted blocks are the same as those used in the shared H.264/VC-1 architecture, shown in Figure 2-2. 1 0.9 .0 0.8 0 0.7 0.6 0.5 0 32 64 Bias - Horsecab Rally Splash -+- Waterskiing 96 128 Figure 2-11: Switching activity in HEVC transform as a function of DC bias applied to the input data as well as eliminating an explicit transpose memory can be applied to transform matrices of larger sizes such as 16x 16 and 32x32. The ideas proposed in this work can potentially be extended to future video coding standards that use DCT based transforms. 2.3 Statistical Methodology for Low-Voltage Design 61 The benefits of these optimizations become even more significant for a larger size transform. For example, for the 32x32 transform in HEVC [71], the transform weights are 8 bit wide as opposed to 5 bit in H.264 [57]. In addition, each 1D coefficient computation requires 32 add-multiply operations as opposed to 8 add-multiply operations. This leads to a 6.4x more complexity per pixel in HEVC transform compared to H.264. The 32x32 HEVC transform also requires 16x larger transpose memory compared to 8x8 transform in H.264. A hardware implementation of the HEVC decoder, proposed in [72], shows that the transform module constitutes about 17% of the decoder area and power consumption. This indicates that the area and power savings achieved by the ideas proposed in this work can be significant towards achieving a low power video encoder/decoder implementation for future video coding standards, such as HEVC. 2.3 Statistical Methodology for Low-Voltage Design The performance of logic circuits is highly sensitive to variation in threshold voltage (VT) at low voltages and can also result in functional failures at the extremes of VT variation. For minimum geometry transistors, threshold voltage variation of 25 mV to 50 mV is typical. At nominal VDD such as 1 V or 1.2 V, local variations in threshold voltage may result in 5% to 10% variation in the logic timing. However, for low voltage operation (VDD 0.5 V), these variations can result in timing path delays with standard deviation comparable to the global corner delay, and must be accounted for during timing closure in order to ensure a robust, manufacturable design. This challenge has been recognized [42,73,74] and circuit design techniques for low-voltage operation have begun to take into account Statistical Static Timing Analysis (SSTA) approaches for estimating circuit performance [75]. A logic gate design methodology ac- counting for global process corners that identifies logic gates with severely asymmetric pullup/pulldown networks is proposed in [76]. Nominal delay and delay variability models valid in both above and subthreshold regions are proposed in [77]. A transistor sizing Transform Engine for Video Coding 62 methodology to manage the trade-off between reducing variability and minimizing energy overhead is proposed in [78]. Most of these statistical approaches make the assumption that the impact of variations on circuit performance can be modeled as a Gaussian distribution. This assumption is usually accurate at nominal voltage [79,80], but fails to capture the non-linear impact of variations on circuit performance at low-voltage that results in highly non-Gaussian delay distributions. This phenomenon is depicted in Figure 2-12, which shows the delay Probability Density Function (PDF) of a representative path at 0.5 V, estimated using Gaussian SSTA and Monte-Carlo analysis. Static Timing Analysis (STA) estimates the global corner delay for the path to be 14.1 ns. Modeling the impact of variations using Gaussian SSTA results in the 3a delay estimate of 23.1 ns. However, Monte-Carlo analysis suggests that Gaussian SSTA is not adequate to fully capture the impact of variations and results in the 3a delay estimate of 31.8 ns. x107 I I : Corner delay-- 15 I * -- I I Gaussian SSTA Monte-Carlo $10 - 3a, Gaussian -4 5Id, 0 5 25 20 15 10 (ns) Delay Timing Path 3a ,MC 30 35 the global Figure 2-12: Delay PDF of a representative timing path at 0.5 V. STA estimate of 3a delay corner delay is 14.1 ns, the 3a delay estimate using Gaussian SSTA is 23.2 ns and the estimate using Monte-Carlo analysis is 31.8 ns. Performing large Monte-Carlo simulations for processor designs with millions of transistors is impractical. We use a computationally efficient approach, called the Operating Point Analysis (OPA) [81], that can perform accurate path-based timing analysis in the regime where delay is a highly non-linear function of the random variables and/or the PDFs of 63 2.3 Statistical Methodology for Low-Voltage Design the random variables are non-Gaussian. OPA provides an approximation to the fa value of a random variable D, when D is a linear or non-linear function D(x, x 2 , ... XN) Of random variables xi, which can be Gaussian or non-Gaussian. The fca operating point is the point in xi-space where the joint probability density function of the xi is maximum, subject to the constraint that D(x1, x 2 , - - - XN) = Df,. In other words, the operating point represents the most likely combination of random variables xi that results in the fa delay for the logic gate or the timing path. Figure 2-13 illustrates the convolution integrand and the operating point, where delay is a non-linear function of two variables. A transcendental relationship is established between the unknown operating point and unknown f a delay, and this equation is solved iteratively. x2 * xO ----------- 'J Convolution Integrand D(xxa2)=D, Figure 2-13: Graphic illustration in xi-space of the convolution integral, and the operating point. The methodology, developed in [82], is summarized below. Standard Cell Library Characterization For the 45 nm process used in this work, Random Dopant Fluctuations (RDF) induced local variations were modeled by two compact model parameters for each transistor. These transistor random variables (also called mismatch parameters) are statistically indepen- 64 Transform Engine for Video Coding dent with approximately Gaussian PDF. The OPA approach is applicable for any local variations given by a compact model of transistor mismatch parameters. The goal of cell characterization is to predict the delay PDF for each arc of each cell. An arc is defined as input rise or fall, input slew rate and output capacitance. At nominal voltage, cell delay is approximately linear in the transistor random variables, with the result that the cell delay is approximately Gaussian. However, at 0.5 V, cell delay is highly non-linear in transistor random variables, with the result that the cell delay has a non-Gaussian PDF. OPA is used to perform stochastic characterization of the standard cell library at VDD = 0.5 V. This characterization ensures functionality and quantifies the performance of standard cells at VDD = 0.5 V. Standard cells that fail the functionality or do not satisfy the performance requirement are not used in the design. The functionality and setup/hold performance of flip-flops are also verified using the cell characterization approach. Timing Path Analysis The goal of timing path analysis is to compute the 3a (or in general fa) stochastic delay of a timing path. OPA is used, along with the pre-characterized standard cell library, to determine the 3a setup/hold performance for individual paths from the design at 0.5 V. Figure 2-14 shows the PDF computed using OPA superimposed on the PDF computed using Monte-Carlo, for the path analyzed in Figure 2-12 at 0.5 V. Monte-Carlo analysis results in the 3a delay estimate of 31.8 ns. OPA shows excellent agreement with MonteCarlo with the 3a delay estimate of 30.7 ns. Full-Chip Timing Closure Given the size of the design, it is not practical to analyze each path individually to determine the 3a setup/hold performance. At nominal voltage, paths that fail the setup/hold requirement are determined using the corner-based analysis and timing closure is achieved by performing setup/hold fix on these paths. However, at low voltage, it is not possible 65 2.3 Statistical Methodology for Low-Voltage Design 65 2.3 Statistical Methodology for Low-Voltage Design x 10 15* 7 Corner delay-.- - - Gau ssian SSTA ite-Carlo - OPA 10* 3a, Gaussian 5 0 ( - 5 3o, OPA 3&, MC 25 20 10 15 Timing Path Delay (ns) Figure 2-14: Delay PDF of a representative timing path at 0.5 V, 30 35 estimated using Gaussian SSTA, Monte-Carlo and OPA. to consider only the paths that fail the setup/hold requirement in the corner analysis and determine their 3- setup/hold performance, since a path with larger corner delay need not have a larger stochastic variation. A three phase approach, outlined below, is used to reduce the number of paths that need to be analyzed for setup/hold constraints using OPA analysis. 1. All paths are analyzed with traditional STA using the corner delay plus the 3stochastic delay for each cell. This is a pessimistic analysis, so those paths that pass this analysis can be removed from further consideration. 2. The paths that did not pass during the first phase are re-analyzed, this time using OPA for the launch and capture clock paths, as defined in Figure 2-15, and STA with the corner delay plus the 3a- stochastic delay for cells in the data paths. Again, this is a pessimistic analysis and any paths that pass during this phase need no further consideration. 3. Lastly, the few remaining paths are analyzed using OPA for the entire path. The paths that fail the 3- setup or hold performance test are optimized to fix the T1ransform Engine for Video Coding 66 66ltrs D V ~--* REG ~j-7 REG Q cells -- o4 D2 4D14----------- CLK Path-1 CLK CLK 'Launch 3-sigma cik skew CLK Path-2 Capture /Clock/Cok Common CLK Path Figure 2-15: Typical timing path. setup/hold violations. This process is repeated untill all the timing paths in the de- sign meet the 3a setup and hold performance computed using OPA. Setup/hold fixing using OPA ensures that cells that are very sensitive to VT variations are not used in the critical paths. Table 3.1 shows statistics on the number of paths analyzed during each phase of timing closure, for both setup and hold analysis of the entire chip. Table 2.3: Full-chip Timing Analysis Phase Data Path Clock Path Paths Analyzed Worst Slack % Fail Setup Analysis 0 25MHz 1 STA (+3a) STA (-3a) 20k -14.2 ns 5% 2 STA (+3a) OPA 1k -3.2 ns 9% 3 OPA OPA 87 -0.2 ns 12% Paths requiring fixing (before timing closure) 10 Hold Analysis 1 STA (-3u) STA (+3a) 20k -11.2 ns 7% 2 STA (-3) OPA 1.4k -2.5 ns 8% 3 OPA OPA 112 -0.1 ns 14% Paths requiring fixing (before timing closure) 16 2.4 Implementation 67 The overall statistical design methodology can be summarized as shown in Figure 2-16. C Data path Clk path extraction [3-Phase path pruning OPA analysis of potentially critical paths Yes No I Timing closure achieved! I II Setup/Hold fix for failing paths ] SPICE Netlist extraction Timing Closure Figure 2-16: OPA based statistical design methodology for low voltage operation. 2.4 Implementation In this work, we implemented ten different versions of the transform engine, listed in Table 2.4, and compared their relative performance. All transforms have been implemented to complete an 8 x 8 transform over 4 clock cycles. Transformn Engine for Video Coding Coding 68 Transform Engine for Video 68 Table 2.4: Transform engines implemented in this design Description Tr. Type HVF8 Shared 8 x 8 forward transform without transpose memory HV1 8 Shared 8x8 inverse transform without transpose memory HV/T 8M Shared 8 x 8 forward transform with transpose memory HVTM Shared 8x8 inverse transform with transpose memory HF8 8x8 forward transform for H.264 without transpose memory H 18 8x8 inverse transform for H.264 without transpose memory VF8 8x8 forward transform for VC-1 without transpose memory V1 8 8x8 inverse transform for VC-1 without transpose memory HVF4 Shared 4x4 forward transform HV1 4 Shared 4x4 inverse transform In this design, the output buffer has been implemented as a register bank of size 8 x 8 with each element being 8 bit wide. The architecture of the 2D transform engine, along with the output buffer, is shown in Figure 2-17. Figure 2-18 shows the die photo of the IC fabricated using commercial 45 nm CMOS technology. The gate counts in Figure 2-18 include output buffer as well. The proposed shared transform engine design uses separate 1D transforms for column and row-wise computations and does not use a transpose memory. The 1D column and row-wise transforms are designed using the shared architectures described in Section 2.1.2 and 2.1.3 respectively. The 2D output buffer is used to store intermediate data. The shared transform modules with transpose memory are implemented using the shared 1D transform architecture described in Section 2.1.2 for both column and row wise transforms. Each ID transform processes 8 x 2 data in each clock cycle and a 16 x 8 transpose 69 2.4 Implementation I, Prediction Residue - 4 V 'A Transform Coefficients a Figure 2-17: Block diagram of the 2D transform engine design 2mm E E 04 Design Statistics 1.5mm 2 Active Area Technology 45nm 1/0 Pads 96 Tr. Type HVF8 HVTM YF8 HF8 VF8 HVF4 Gate Count 44.7k 66.5k 30.9k Ti. Type HV 8 HVTM 8 H1 8 35.6k 18.8k V8 HV1 4 _____ Gate Count 45.1k 66.8k 31.6k 35.8k 18.9k Figure 2-18: Die photo and design statistics of the fabricated IC memory, which constitutes 48% of the gate count, is used to allow operation in ping-pong mode to achieve a throughput of 8 x 8 2D transform over 4 cycles. An alternative approach to achieve the same throughput is to process 8 x 4 data in each clock cycle and use an 8 x 8 transpose memory. This has not been implemented on chip, however synthesis results show 15% higher overall gate count for this approach. Transform Engine for Video Coding 70 2.5 Measurement Results The shared architecture for 8 x8 transform (HVF8/HVS) is able to achieve 25% reduction in area compared to the combined area of individual 8 x 8 transforms for H.264 (HF8/HIS) and VC-1 (VF8/VI8). Eliminating explicit transpose memory helps save 23% area compared to the implementation that uses a transpose memory (HVSM/HV8M). The decoder only uses inverse transforms. The encoder requires both forward and inverse transforms, thus doubling the area savings due to hardware sharing. Figure 2-19 shows the measured power consumption and frequency for different transform modules as a function of VDD- 60 . 1H111 HVF8 --- HVF8M 40 - -- -- 8.3 HF8 ............................. . . ... .. VF8 ..... ...... .........----... ..... ..-.. ......-.- .... HVF4 0.35 0.4 0.45 0.5 0.55 0.6 VDD(V) (a) 500 ---400 ^ --+- - .3 - --------HVF8M ................................. -H8 F S300 - 0200 HVF8 VF8 HVF4 .......................... 0.35 0.4 0.45 .......... 0.5 0.55 0.6 VDD(V) (b) Figure 2-19: Measured power consumption and frequency scaling with VDD for different transform implementations. (a) Frequency scaling with VDD, (b) Power consumption while operating at the frequency shown in (a). 71 2.5 Measurement Results All the transform modules implemented on this chip have been verified to be operational to support video encoding/decoding with Quad-Full HD (3840 x 2160) resolution at 30 fps. The shared 8x8 transform is able to achieve video encoding/decoding in both H.264 and VC-1 with 3840 x 2160 (QFHD) resolution at 30 fps, while operating at 25 MHz frequency at 0.52 V. The module is also able to achieve 1080p (Full-HD) at 30 fps, while operating at 6.3 MHz at 0.41 V and 720p (HD) at 30 fps, while operating at 2.8 MHz at 0.35 V. Measurement results for all the modules are summarized in Table 2.5. Table 2.5: Measurement results for implemented transform modules Transform Type QFHD@30fps 25MHz 1080p@30fps 6.3MHz 720p@30fps 2.8MHz VDD Power VDD Power VDD Power (V) (w) (V) (AW) (V) (AW) HVF8 0.52 214 0.41 79 0.35 43 HV 0.53 218 0.42 81 0.36 44 HVFT8M 0.50 270 0.40 95 0.33 51 HV,8M 0.49 268 0.40 94 0.33 50 HF8 0.51 175 0.41 67 0.34 35 H1 8 0.50 172 0.40 66 0.33 34 VF8 0.51 189 0.41 70 0.35 38 V8 0.51 188 0.41 70 0.34 37 HVF4 0.49 127 0.39 55 0.33 31 HV 4 0.48 124 0.40 54 0.33 30 8 Figure 2-20 compares the power consumption of shared transform without transpose memory, shared transform with transpose memory and individual transform implementations for H.264 and VC-1. While supporting Quad Full-HD resolution, eliminating explicit transpose memory helps reduce power consumption of the 8x8 transform by 26%. Transform Engine for Video Coding 72 'Ifransform Engine for Video Coding 72 300 W200 HF8 -m VF8 HVF8 HVTM 10010 QFHD@30fps 1080p@30fps 720p@30fps Figure 2-20: Power consumption for transform modules with and without transpose memory, with and without shared architecture for H.264 and VC-1 Data dependent processing affects different architectures differently because of varying degrees of correlation between input switching activity and system switching activity. Figure 2-21 shows the switching activity and power consumption for different transform modules as a function of the input DC bias. We observe a reduction in switching activity by 25%-30% across the modules, resulting in a 15%-20% power saving. Table 2.6 summarizes the overheads and advantages of the three key ideas proposed in this 1 1 C.) C) 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.5 - HVF8 - -- HVFT8M _- 0.4 Switching Activity Power 0 32 HF8- HVF8 VF8 HVF8M I 64 96 HF8 - 0.6 0.5 VF8 -0.4 128 Bias Figure 2-21: Switching activity and Power consumption in the transform as a function of DC bias applied to the input data 73 2.5 Measurement Results work. Applying DC bias requires 16 adders (with one fixed input, i.e. the DC bias) that cause 5% increase in area and 4% increase in power. But it helps reduce the switching activity by 30%, which results in a 15% overall power saving for the design. Hardware sharing requires 26 additional 2:1 multiplexers that consume 9% area and 6% power. But sharing helps us implement the H.264 and VC-1 transforms using 78 adders and 62 multiplexers (including the overhead), as opposed to 126 adders and 60 multiplexers for individual H.264 and VC-1 implementations, which reduces the overall area by 25%. The scheme for eliminating transpose memory requires us to access 8x8 data in each clock cycle for row-wise transform computations. This increases the data accesses by 4x for the row-wise computations, as opposed to the implementation that uses a transpose memory. The increased data accesses lead to 7% increase in power consumption. However, the ability to eliminate transpose memory saves 23% area and 26% power. Overall, the proposed design optimizations help reduce the power consumption by about 40%, despite the overhead. Table 2.6: Overheads and Advantages of proposed ideas Feature Overhead Advantage Data Dependent Processing 5% area, 4% power 30% reduction in switching activity, 15% reduction in power Hardware Sharing between H.264 and VC-1 9% area, 6% power 25% reduction in area Transpose Memory Elimination 4x data access, 7% power 23% reduction in area, 26% reduction in power Table 2.7 shows a performance comparison of the proposed approach for 2D transform implementation with some previously published approaches. The comparison shows that the proposed approach achieves a significant reduction in power compared to the previous approaches. Assuming a roughly 4x scaling in power due to technology scaling from 180 nm to 45 nm, the architectural techniques proposed in this work achieve a reduction T'ransform Engine for Video Coding Transform Engine for Video Coding 74 '74 in power consumption by over 45 x compared to [83] and 68 x compared to [84], while achieving the same throughput at VDD = 0.52 V. Table 2.7: Performance comparison of proposed approach with previous publications Publication Technology Huang'08 [83] Fan'11 [84] Wang'11 [85] Chen'11 [86] This Work Low Nominal VDD VDD 180 nm 180 nm 130 nm 180 nm 45 nm 45 nm 39.8k 95.1k 23.1k 17.7k 44.7k 44.7k 8x 8x 8x 4x 16x 16x Throughput 400M pixels/s 400M pixels/s 800M pixels/s 1000M pixels/s 400M pixels/s 4640M pixels/s Frequency 50 MHz 50 MHz 100 MHz 250 MHz 25 MHz 290 MHz Voltage 1.8V 1.8V 1.2V 1.8V 0.52V 1.0V Power 38.7 mW 58.01 mW - 54 mW 214 pW 4.1 mW Supported Standards MPEG, H.264 MPEG, H.264, AVS, VC-1 MPEG, H.264, AVS, VC-1 H.264 H.264, VC-1 H.264, VC-1 Transform Type Forward Inverse Forward, Inverse Inverse Forward, Inverse Forward, Inverse Gates Parallelism 2.6 Summary and Conclusions The ability to perform very high-resolution video encoding/decoding for multiple standards at ultra-low voltage to achieve low power operation is critical in multimedia devices. In this work, we have developed a shared architecture for H.264/AVC and VC-1 transform engine. Similarity between the structure of transform matrices is exploited to perform matrix decomposition to maximize hardware sharing. The shared architecture helps to 2.6 Sumnmary and Conclusions 75 7 save more than 30% hardware compared to total hardware requirement of individual H.264/AVC and VC-1 transform implementations. An approach to eliminate an explicit transpose memory is demonstrated, by using a 2D output buffer and separately designing the row-wise and column-wise ID transforms. This helps us reduce the area by 23% and save power by 26% compared to the implementation that uses transpose memory. We have demonstrated that data dependent processing can help reduce the switching activity by more than 30% and further reduce power consumption. The implementation is able to support Quad-Full HD (3840 x 2160) video encoding/decoding at 30 fps while operating at 0.52 V. The ideas of matrix factorization for hardware sharing, eliminating transpose memory and data dependent processing could potentially be extended to other coding standards as well. As bigger block sizes such as 32x32 and 64x64 are explored in future video coding standards like HEVC, these ideas could lead to even higher savings in area and power requirement of the transform engine, allowing their efficient implementation in multi-standard multimedia devices. Exploration of the ideas proposed in this work leads to the following conclusions. 1. Reconfigurable hardware architectures that implement optimized core functional units for a class of applications, such as video coding, and enable configurable datapaths with distributed control, are key to supporting efficient processing for multiple applications. Algorithmic optimizations that reframe the algorithms are important for enabling hardware reconfigurability. 2. Data dependent processing can be a powerful tool in reducing system power consumption. By exploiting the characteristics of the data being processed, architectures can be designed to minimize switching activity, optimize pipeline bit widths and perform variable number of operations per block. The reduction in computations and switching activity has a direct impact on the system power consumption. 3. Memory size and power consumption can have a significant impact on system effi- Transform Engine for Video Coding 76 ciency. Architectural approaches that trade-off small increases in logic complexity for significant reductions in memory size and power consumption can provide the most optimal system design solutions. 4. Low-voltage operation of circuits is important to provide wide voltage/frequency operating range and attain minimum energy operation. Global and local variations have a significant impact on circuit performance at low-voltages. This impact can not be fully captured with corner-based STA or Gaussian SSTA techniques. Statistical design approaches that take into account the non-linear impact of variations on circuit performance at low-voltage must be used to ensure reliable low-voltage operation. Chapter 3 Reconfigurable Processor for Computational Photography Computational photography is transforming digital photography by significantly enhancing and extending the capabilities of a digital camera. The field encompasses a wide range of techniques such as High Dynamic Range (HDR) imaging [87], low-light enhancement [138,139], panorama stitching [88], image deblurring [89] and light field photography [90], that allow users to not just capture a scene flawlessly, but also reveal details that could otherwise not be seen. Non-linear filtering techniques, such as bilateral filtering [31,91,92], anisotropic diffusion [93,94] and optimization [95,96], form a significant part of computational photography. The behaviors of such techniques have been well studied and characterized [97-102]. These techniques have a wide range of applications, including denoising [103,104], HDR imaging [87], low-light enhancement [138,139], tone management [105,106], video enhancement [107,108] and optical flow estimation [109,110]. The high computational complexity of such multimedia processing applications necessitates fast hardware implementations [111, 112] to enable real-time processing in an energy-efficient manner. Recent research has focused on specialized image sensors to capture information that is 78 Reconfigurable Processor for Computational Photography not captured by a regular CMOS image sensor. An image sensor with multi-bucket pixels is proposed in [113] to enable time multiplexed exposure that improves the image dynamic range and detects structured light illumination. A back-illuminated stacked CMOS sensor is proposed in [114] that uses spatially varying pixel exposures to support HDR imaging. An approach to reduce the temporal readout noise in an image sensor is proposed in [115] to improve low-light-level imaging. However, computational photography applications using regular CMOS image sensors that are currently used in the commercial cameras have so far remained software based. Such CPU/GPU based implementations lead to high energy consumption and typically do not support real-time processing. This work implements a reconfigurable multi-application processor for computational photography by exploring power reduction techniques at various design stages - algorithms, architectures and circuits. The algorithms are optimized to reduce the computational complexity and memory requirement. A parallel and pipelined architecture enables high throughput while operating at low frequencies, which allows real-time processing on HD images. Circuit design for low voltage operation ensures reliable performance down to 0.5 V. The reconfigurable hardware implementation performs HDR imaging, low-light enhanced imaging and glare reduction, as shown in Figure 3-1. The filtering engine can also be accessed from off-chip and used with other applications. The input images are pre-processed for the specific functions. The core of the processing unit are two bilateral filter engines that operate in parallel and decompose an image into a low frequency base layer and a high frequency detail layer. Each bilateral filter uses further parallelism within it. The choice of two parallel engines is based on the throughput requirements for real-time processing and the amount of memory bandwidth available to keep all the engines active. The processor is able to access 8 pixels per clock cycle and each filtering engine is capable of processing 4 pixels per clock cycle. Bilateral filtering is performed using a bilateral grid structure [116] that converts an input image into a three dimensional data structure and filters it by convolving with a three dimensional Gaussian kernel. Parallel processing al- 79 3.1 Bilateral Filtering Preprocessing IF INF Weighted j - Average IG Grid Grid 'El Assignment Assignment '3 HDR Creation - _j Convolution IM Engine IHDR *' ITM Contrast IRG Adjustment IBF *Grid IBFInterpolation ILLE * Convolution Engine O Grid Interpolation hdo Correction Postprocessing Figure 3-1: System block diagram for the reconfigurable computational photography processor lows enhanced throughput while operating at low frequency and low voltage. The bilateral filtered images are post processed to generate the outputs for the specific functions. This chapter describes bilateral filtering and its efficient implementation using the bilateral grid. A scalable hardware architecture for the bilateral filter engine is described in Section 3.2. Implementation of HDR imaging, low-light enhancement and glare reduction using bilateral filtering is discussed in Section 3.3. The challenges of low voltage operation and approaches to address process variation are described in Section 3.4. The significance of architectural optimizations for reducing external memory bandwidth and power consumption - crucial to enhance the system energy-efficiency, is described in Section 3.5. Section 3.6 provides measurement results for the testchip. 3.1 Bilateral Filtering Bilateral filtering is a non-linear filtering technique that traces its roots to the non-linear Gaussian filters proposed in [31] for edge-preserving diffusion. It takes into account the difference in the pixel intensities as well as the pixel locations while assigning weights, as Reconfigurable Processor for Computational Photography 80 opposed to linear Gaussian filtering that assigns filter weights based solely on the pixel locations [91,92]. For an image I at pixel position p, the bilateral filtered output, 1B, is defined by eq. (3.1). N Gs(n) - G1 (I(p) - I(p - n)) - I(p - n) IB(P) = (3.1) n=-N where, N W (p)= 1 Gs (n) -G, (I(p) -I (p -n)) n=-N The output value at each pixel in the image is a weighted average of the values in a neighborhood, where the weight is the product of a Gaussian on the spatial distance (Gs) with standard deviation a, and a Gaussian on the pixel intensity/range difference (GI) with standard deviation a,. In linear Gaussian filtering, on the other hand, the weights are determined solely by the spatial term. In bilateral filtering, the range term GI(I(p) - I(p - n)) ensures that only those pixels in the vicinity that have similar intensities contribute significantly towards filtering. This avoids blurring across edges and results in an output that effectively reduces the noise while preserving the scene details. Figure 3-2 compares Gaussian filtering and bilateral filtering in reducing image noise and preserving details. However, non-linear filtering is inefficient and slow to implement because the filter kernel is spatially variant and needs to be recomputed for filtering every pixel. In addition, most computational photography applications require large filter kernels, 64 x 64 or more. A direct implementation of bilateral filtering can take several minutes to process HD images on a CPU. Faster approaches for bilateral filtering have been proposed. A separable approximation of the bilateral filter is proposed in [117] that speeds up processing and improves efficiency for applications that use small filter kernels, such as denoising. Optimization techniques have been proposed that reduce the processing time by filtering subsampled versions of the image with discrete intensity kernels and reconstructing the filtered results 81 3.1 Bilateral Filtering 81 31 Bilateral Filtering Linear Gaussian Filtering Non-Linear Bilateral Filtering Figure 3-2: Comparison of Gaussian filtering and bilateral filtering. Bilateral filtering effectively reduces noise while preserving scene details. using linear interpolation [87,118]. A fast approach to bilateral filtering based on a box spatial kernel, which can be iterated to yield smooth spatial falloff, is proposed in [119]. However real-time processing of HD images requires further speed-up. 3.1.1 Bilateral Grid The bilateral grid structure for fast bilateral filtering is proposed in [116], where the processing complexity is reduced by down-sampling the image for filtering. But to preserve the details while down-sampling, a third intensity dimension is added so that pixels with very different intensities, within a block being down-sampled, are assigned to different intensity levels, thus preserving the intensity differences. This results in a three dimensional structure. Creating a 3D bilateral grid and processing it requires large amount of storage (65 MB for a 10 megapixel image). In this work, we implement bilateral filtering using a reconfigurable grid. To translate the grid structure efficiently into hardware, we convert Reconfigurable Processor for Computational Photography 82 it into a data structure. The storage requirement is reduced to 21.5 kB by scheduling the filtering engine tasks so that only two grid rows need to be stored at a time. The implementation is flexible to allow varying grid sizes for energy/resolution scalable image processing. The bilateral grid structure used by this chip is constructed as follows. The input image is partitioned into blocks of size a, x a, pixels and a histogram of pixel intensity values is generated for each block. Each histogram has 256/ar bins, where each bin corresponds to an intensity level in the grid. This results in a 3D representation of the 2D image, as shown in Figure 3-3. Each grid cell (i, j, r) stores the number of pixels in a block corresponding to that intensity bin (Wj) and their summed intensity (I, ). To provide flexibility in grid creation and processing, the processor supports block sizes ranging from 16 x 16 to 128 x 128 pixels with 4 to 16 intensity bins in the histogram. 3D Grid 0 1 2 0 (2,1.8) -. 1 Histogram Summed Intensity 2D Image (2,1.9) (2,1.8) 1 2 1 Figure 3-3: Construction of a 3D bilateral grid from a 2D image The bilateral grid has two key advantages: Aggressive down-sampling: The size of the blocks (a, x a,) used while creating the grid and the number of intensity bins (256/ur) determine the amount by which the image is down-sampled. a, controls smoothing and a, controls the extent of edgepreservation. Most computational photography applications only require a coarse grid resolution. The hardware implementation merges blocks of 16 x 16 to 128 x 128 pixels into 4 to 16 grid cells. This significantly reduces the number of computations required for processing as well as the amount of on-chip storage required. 3.2 Bilateral Filter Engine 83 * Built-in edge awareness: Two pixels that are spatially adjacent but have very different intensities end up far apart in the grid along the intensity dimension. Filtering the grid level-by-level using a 3D linear Gaussian kernel, only the intensity levels that are near each other influence the filtering and the levels that are far apart do not contribute in each other's filtering. Without any downsampling (a, = a, = 1), this operation is identical to performing bilateral filtering on the 2D image. Filtering a down-sampled grid using a 3D Gaussian kernel provides a good approximation to bilateral filtering the image for most computational photography applications. 3.2 Bilateral Filter Engine Intensity levels in the bilateral grid can be processed in parallel. This enables a highly parallel architecture, where 256/ar intensity levels are created, filtered and interpolated in a parallel and pipelined manner. The bilateral filter engine using the bilateral grid is implemented as shown in Figure 3-4. It consists of three components - the grid assignment engine, the grid filtering engine and the grid interpolation engine. The spatial and intensity down-sampling factors, o, and ar, are programmed by the user at the start of the processing. The image is scanned pixel by pixel in a block-wise manner. The size of the block is scalable from 16 x16 pixels (a, = 16) to 128x128 pixels (o, = 128). Depending on the intensity of the input pixel, it is assigned to one of the intensity bins. The number of intensity bins is also scalable from 4 (ar = 64) to 16 (a, = 16). As the data structure is stored on-chip, the different intensity levels in the grid can be processed in parallel. This enables a highly parallel architecture for processing. 3.2.1 Grid Assignment The pixels are assigned to the appropriate grid cells by the grid assignment engines. The hardware has 16 Grid Assignment (GA) engines that can operate in parallel to process 16 intensity levels in the grid. But 4, 8 or 12 grid assignment engines could be activated if the Reconfigurable Processor for Computational Photography 84 Memory Interface 128 bit data bus -- -0---16 -31 GA Engine Engine CGA 00 03 - GA Engine 04 1 GA En Ine 07 E GA Engine Conv Engine 00 .Temporary Temporary Buffer Conv Engine Bak0Conv Conv En ine 1Bank2 Bank 3 08 - I Bank 5 Bank Bank7 r* Sl tGGA Engine C4Bn Conv Engine 07 08 22 . 2- 03 Engine Bank 1 - 0 .C c Buffer Conv Engine 15 Grid Scaling Control Figure 3-4: Architecture of the bilateral filtering engine. Grid scalability is achieved by gating processing engines and SRAM banks grid uses fewer intensity levels. Figure 3-5 shows the architecture of the grid assignment engine. For each pixel from each block, its intensity is compared with the boundaries of the intensity bins using digital comparators. If the pixel intensity is within the bin boundaries, it is assigned to that intensity bin. Intensities of all the pixels assigned to a bin are summed by an accumulator. A weight counter maintains the count of number of pixels assigned to the bin. Both the summed intensity and weight are stored for each bin in on-chip memory. 0 --.) 0 S IJ~ 1 as *0 bit + Ir Sum m e d Intensity )Weight 4,. xra b a<b X b a<b o, x(r+1) Figure 3-5: Architecture of the grid assignment engine. i i 3.2 Bilateral Filter Engine 3.2.2 85 Grid Filtering The Convolution (Conv) engine, shown in Figure 3-6, convolves the grid intensities and weights with a 3 x 3 x 3 Gaussian kernel, which is equivalent to bilateral filtering in the image domain, and returns the normalized intensity. The convolution is performed by multiplying the 27 coefficients of the filter kernel with the 27 grid cells and adding them using a 3-stage adder tree. The intensity and weight are convolved in parallel and the convolved intensity is normalized with the convolved weight by using a fixed point divider to make sure that there is no intensity scaling during filtering. The filter coefficients are programmable to enable filtering operations of different types, including non-separable filters, to be performed using the same reconfigurable hardware. The coefficients are programmed by the user in the beginning of the processing, otherwise the default 3 x 3 x 3 Gaussian kernel is used. The hardware has 16 convolution engines that can operate in parallel to filter a grid with 16 intensity levels. But 4, 8 or 12 of them can be activated if fewer intensity levels are used in the grid. r Assigned Gridr GAssin KGrnd - Filtered Grid x Figure 3-6: Architecture of the convolution engine for grid filtering. 86 3.2.3 Reconfigurable Processor for Computational Photography Grid Interpolation The interpolation engine, shown in Figure 3-7, reconstructs the filtered 2D image from the filtered grid. The filtered intensity value at pixel (x, y) is obtained by trilinear interpolation of 2 x 2 x 2 filtered grid values surrounding the location (x/-,, y/-s, Ixy/-r). Trilinear interpolation is equivalent to performing linear interpolations independently across each of the three dimensions of the grid. To meet throughput requirements, the interpolation engine is implemented as three pipelined stages of linear interpolations. The output value IBF(X, y) is calculated from filtered grid values Fg using four parallel linear interpolations along the i dimension, given by eq. (3.2): Fj = Fj', F 3 i + I1jX Wi x w + F+ 1, +1 x = ,31 = F,+l x w' Fjff = F+x + F 2 , xr+l wi + F+1 xw (3.2) followed by two parallel linear interpolations along the j dimension, given by eq. (3.3): Fx w1 + F+1x -+1 F+1 (3.3) followed by an interpolation along the r dimension, given by eq. (3.4): IBF(x, y) = F x Wi + Fr+1 xi (3.4) The interpolation weights, given by eq. (3.5), are computed based on the output pixel location (x, y), the intensity of the original pixel in the input image Ixy at location (X, y), and the grid cell index (i, j, r). 87 3.2 Bilateral Filter Engine 87 3.2 Bilateral Filter Engine WT x 2s s W Wi =j +1- - j; = x + W =--i; r ri ' W[ = -- Y - r; -E + 1 -- (3.5) The pixel location (x, y) and the grid cell index (i,j, r) are maintained in internal counters. The original pixel intensity 1,, is read from the DRAM in chunks of 32 pixels per read request to fully utilize the memory bandwidth. x r+~Y~ o+ f r j41 Filtered 2D Image Filtered Grid Ixa 9:r_j +1.+- rFJ+1' bi F r--1 Fj+1,J i- +1-- F.1r,,y+ Fir --9 LI F, Linear interpolation Linear F2 G w2 w inerInterpolation + Interpolation Linear inaInterpolation Interpolation i dimension Output Lna Le R j dimension f Linear interpolation r dimension Figure 3-7: Architecture of the interpolation engine. Trilinear interpolation is implemented as three pipelined stages of linear interpolations. The assigned and filtered grid cells are stored in the on-chip memory. Last three assigned blocks are stored in a temporary buffer and two previous rows of grid blocks are stored in the SRAM. Last two filtered blocks are stored in the temporary buffer and one filtered grid row is stored in the SRAM. The on-chip SRAM can store up to 256 blocks per row with 16 intensity levels. Reconfigurable Processor for Computational Photography 88 3.2.4 Memory Management The grid processing tasks are scheduled to minimize local storage requirements and memory traffic. Figure 3-8 shows the memory management scheme by task scheduling. Grid processing is performed cell-by-cell in a row-wise manner. The last three blocks are stored in the temporary buffer and the last two rows are stored in the SRAM. Once a 3x3x3 block is available, the convolution engine begins filtering the grid. When block A, shown in Figure 3-8, is being assigned, the convolution engine is filtering block F. As filtering proceeds to the next block in the row, the first assigned block, stored in the SRAM, becomes redundant and is replaced by the first assigned block in the temporary buffer. Last two filtered blocks are stored in the temporary buffer and the previous row of filtered blocks are stored in the SRAM. As 2x2x2 filtered blocks become available, the interpolation engine begins reconstructing the output 2D image. When block F, shown in Figure 3-8, is being filtered, the interpolation engine is reconstructing the output 2D image from block I. As interpolation proceeds to the next block in the row, the first filtered block, stored in the SRAM, becomes redundant and is replaced by the first filtered block in the temporary buffer. Boundary rows and columns are replicated for processing boundary cells. This scheduling scheme allows processing without storing the entire grid. Only two assigned grid rows and one filtered grid row need to be stored locally at a time. Memory management reduces the memory requirement to 21.5 kB for processing a 10 megapixel image and allows processing grids of arbitrary height using the same amount of on-chip memory. 3.2.5 Scalable Grid Energy-efficiency is the key concern in processing on mobile platforms. The ability to trade-off computational quality for energy is highly desirable, making algorithm structures and systems that enable this trade-off extremely useful to explore [120]. An user performing computational photography on a mobile device might choose to trade-off out- 89 89 3.2 Bilateral Filter Engine 3.2 Bilateral Filter Engine - i ,b 0 Assigned Grid 2 3 5 4 6 W-3 W-2 W-1 ******S 0 I Stored in SRAM *****] A 2 A Block being assigned F Block being filtered Blocks used for filtering Temporary Buffer Filtered Grid 0 2 1 3 4 5 W-2 W-I W-3 6 I 1 I Stored in *SRAM Block being filtered j Temporary Buffer Block being interpolated Filtered Blocks used for interpolation Figure 3-8: Memory management by task scheduling. put resolution for energy, depending on the current state of the battery and the energy requirement for the task. This trade-off could also be made based on the intended usage for the image. For example, if the output image is intended for use on social media or web-based applications, a lower resolution, such as 2 megapixel, might be most appropriate. Whereas, for generating high-quality prints, the user would like to achieve the highest resolution possible. This makes an architecture that enables energy-scalable processing extremely valuable. We develop an architecture that enables the energy vs. quality trade-off by scaling the size of the bilateral grid to support the desired output resolution. The size of the grid is determined by the image size and the downsampling factors. For an image of size Iw x IH pixels with the spatial and intensity/range downsampling factors oa and -, respectively, the grid width (Gw) and height (GH) are given by eq. (3.6) and the number of grid cells (NG) is given by eq. (3.7). Gw = ; GH- 0-r (3.6) Reconfigurable Processor for Computational Photography 90 The number of computations as well as storage depends directly on the size of the grid. Selecting the downsampling factors the same as the standard deviations of the spatial and intensity/range Gaussians in the bilateral filter (eq. (3.1)) provides a good tradeoff between the output quality and processing complexity. The choice of downsampling factors is guided by the image content and the application. Most applications work well with a coarse grid resolution on the order of 32 pixels with 8 to 12 intensity bins. If the image has high spatial details, a smaller o would result in better preservation of those details in the output. Similarly, a smaller Ur would help preserve fine intensity details. The grid size is configurable by adjusting as from 16 to 128, which scales the block size from 16 x 16 to 128 x 128 pixels, and Ur from 16 to 64, which scales the number of intensity levels from 16 to 4. For a 10 megapixel (4096 x 2592) image, the number of grid cells scales from 663552 (a = 16, Ur = 16) to 2592 (a = 128, Ur = 64). The architecture achieves energy scalability by activating only the required number of hardware units for a given grid resolution. The 21.5 kB of on-chip SRAM used to store two rows of created grid cells and one row of filtered grid cells. The SRAM is implemented as 8 banks supporting a maximum of 256 cells in each row of the grid with 16 intensity levels, corresponding to the worst case of a = 16, Ur = 16. Each bank is power gated to save energy when a lower resolution grid is used. Only one bank is used when a = 128 and all 8 banks are used when as = 16. The bilateral filter engine achieves scalability by activating only the required number of processing engines and SRAM banks, and power gating the remaining engines and memory banks, for the desired grid resolution. 3.3 Applications The testchip has two bilateral filter engines, each processing 4 pixels/cycle. The processor performs HDR imaging, low-light enhanced imaging and glare reduction using the bilateral filter engines. 91 3.3 Applications 3.3.1 High Dynamic Range Imaging The range of intensities captured in an image is limited by the resolution of the image sensor. Typically, image sensors use 8 bits/pixel resolution, which limits the dynamic range of intensities captured in an image to 256 : 1. On the other hand, the range of intensities we encounter in the real-world is 5 to 6 orders of magnitude. HDR imaging is a technique for capturing a greater dynamic range between the brightest and darkest regions of an image than a traditional digital camera. It is done by capturing multiple images of the same scene with varying exposure levels, such that the low exposure images capture the bright regions of the scene well without loss of detail and the high exposure images capture the dark regions of the scene. These differently exposed images are then combined together into a high dynamic range image, which more faithfully represents the brightness values in the scene. HDR Creation The first step in HDR imaging is to create a composite HDR image, from multiple differently exposed images, which represents the true scene radiance value at each pixel of the image [121]. The true scene radiance value at each pixel is recovered from the recorded intensity I and the exposure time At as follows. The exposure E is defined as the product of sensor irradiance R (which is the amount of light hitting the camera sensor and is proportional to the scene radiance) and the exposure time At. The intensity I is a nonlinear function of the exposure E, given by eq. 3.8. I = f(E) I = f(R x At) (3.8) We can then obtain the sensor irradiance as given by eq. 3.9, where, g = log f-. log(R) = g(I) - log(At) (3.9) Reconfigurable Processor for Computational Photography 92 The mapping g is knows as the camera curve [121]. Figure 3-9 shows the camera curves for the RGB color channels of a typical camera sensor. 3 0 -3 3U -6 -9 -12 -15 ( 32 64 96 128 192 160 224 256 Image Intensity Figure 3-9: Camera curves that map the pixel intensity values on to the incident exposure. The HDR creation module, shown in Figure 3-10 takes values of a pixel from three different exposures (IEl, IE2, IE3) and generates an output pixel which represents the true scene radiance value (IHDR) at that location. Since we are working with a finite range of discrete pixel values (8 bits/color), the camera curves are stored as combinational look-up tables B r. R Camera Curve El S Weighted Average Exposure Correction C 12 Exponent bt LUT IHDR E2 cc EXP LUT LUT CC0 Wj Ei4 LUTX W2 | Figure 3-10: HDR creation module. Ij 128 E 3.3 Applications 93 (LUTs) to enable fast access. The true (log) exposure values are obtained from the pixel intensities using the camera curves, followed by exposure time correction to obtain (log) scene radiance. The three resulting (log) radiance values obtained from the three images represent the radiance values of the same location in the scene. A weighted average of these three values is taken to obtain the final (log) radiance value. The weighting function gives a higher weight to the exposures in which pixel value is closer to the middle of the response function, thus avoiding the high contributions from images where the pixel value is saturated. In the end an exponentiation is performed to get the final radiance value (16 bits/pixel/color). Processing is performed in the log domain for two reasons. The human visual system responds to the ratio of intensities rather than the absolute difference in intensities. This can be represented effectively in the log domain. Also, it simplifies the computations to additions and subtractions instead of multiplications and divisions. Tone Mapping High dynamic range images (16 bit/pixel/color) can not be properly displayed on low dynamic range media (8 bits/pixel/color), which constitute almost all the displays that are commonly used. Figure 3-11 shows how an HDR image would appear on a Low Dynamic Range (LDR) display if it is simply scaled from 16 bit/pixel/color to 8 bit/pixel/color. Properly preserving the dynamic range, captured in the HDR image, while displaying on the LDR media requires tone mapping that compresses image dynamic range through contrast adjustment [87,94,122-124]. In this work, we leverage the local contrast adjustment based tone-mapping approach proposed in [87] and implement two-stage decomposition [125,126] using bilateral filtering in hardware. Figure 3-12 shows the processing flow for HDR imaging, including HDR creation and tonemapping. The 16 bit/pixel/color HDR image is split into intensity and color channels. A low-frequency base layer is created by bilateral filtering the HDR intensity in log domain and a high-frequency detail layer is created by dividing the log intensity with the base 94 Reconfigurable Processor for Computational Photography Figure 3-11: HDR image scaled to 8 bit/pixel/color for displaying on LDR media. radiance map courtesy Paul Debevec [121].) (HDR Input Images HDR Image 4, Intensit ty Data Color Data Scaled Color Data Detail Layer k Base Layer I v Tone-Mapped HDR Image Figure 3-12: Processing flow for HDR creation and tone-mapping for displaying HDR images on LDR media. 3.3 Applications 95 layer. The dynamic range of the base layer is compressed by a scaling factor in the log domain. The scaling factor is user programmable to control the base contrast and achieve a desired look for the image. By default, a scaling factor of 5 is used. The detail layer is untouched to preserve the details and the colors are scaled linearly to 8 bit/pixel/color. Merging the compressed base layer, the detail layer and the color channels results in a tone-mapped HDR image (ITM). Figure 3-13 shows the tone-mapped version of the image shown in Figure 3-11. Figure 3-13: Tone-mapped HDR image. (HDR radiance map courtesy Paul Debevec [121].) Figure 3-14 shows the hardware configuration for HDR imaging. The hardware performs HDR imaging by activating the HDR Create module for pre-processing that merges three LDR exposures into one 16 bit/pixel/color HDR image and the Contrast Adjustment module for post-processing that performs contrast scaling and merging of the intensity and color data. Both bilateral grids are configured to perform filtering in an interleaved manner, where each grid processes alternate blocks in parallel. The processor also preserves the 16 bit/pixel/color HDR image in external memory, which could be tone-mapped using a different software or hardware implementation. Figure 3-15 shows a set of input low-dynamic range exposures that capture different ranges of intensities in the scene and the tonemapped HDR output image. Reconfigurable Processor for Computational Photography 96 Reconfigurable Processor for Computational Photography 96 Preprocessing I' Weighted Average Bilateral Filter Bilateral Filter wj dJ Grid Assignment I\, L Grid Assignment V. - IEl -j I13 4. 'TM Convolution Engine Convolution Engine GtrdL Grid CM 4. mEI. 1BF 4- Shad w 11LE 4- Interpolation rInterpolation Correction Figure 3-14: Processor configuration for HDR imaging. S E (a) (b) T-emppd (c) D (d) Figure 3-15: Input low-dynamic range images: (a) under exposed image, (b) normally exposed image, (c) over exposed image. Output image: (d) tonemapped HDR image. 97 3.3 Applications 3.3.2 Glare Reduction Images captured with a bright light source in or near the field of view are affected significantly by glare, which reduces contrast, color vibrance and often leads to loss of scene details due to pixel saturation. The effect of veiling glare on HDR imaging in image capture and display is measured in [127]. An approach to quantify the presence of veiling glare and related optical artifacts, and reducing glare through deconvolution by a measured glare spread function, is proposed in [128]. Glare removal in HDR imaging, by estimating a global glare spread function for a scene based on fitting a radially-symmetric polynomial to the fall-off of light around bright pixels, is proposed in [129]. Glare is modeled as a 4D ray-space phenomenon in [130] and an approach to remove glare by outlier rejection in ray-space is proposed. In this work, we address the effects of glare by improving the contrast and enhancing colors in the captured image. This process is similar to performing a single image HDR tone-mapping operation, with the exception that the contrast is increased instead of compressed. Programmability of the contrast adjustment module, shown in Figure 3-16, allows us to achieve this by simply using a different contrast adjustment factor than the one used for HDR imaging. Combine Color Channels + Color Data Exponentiation EXP LUT Intensity Range Adjustment EXP Output LUT Image log I Figure 3-16: Contrast adjustment module. Contrast is increased or decreased depending on the adjustment factor. 98 Reconfigurable Processor for Computational Photography Figure 3-17 shows the processing flow for glare reduction. Input Image Intensity Data Detail Layer Base Layer Color Data Scaled Color Data Output Image Figure 3-17: Processing flow for glare reduction. The input image is split into intensity and color channels. A low-frequency base layer and a high-frequency detail layer are created by bilateral filtering the intensity. The contrast of the base layer is enhanced using the contrast adjustment module, which is also used in HDR tone-mapping. The adjustment factor is user programmable to achieve the desired look for the output image. Adjustment factor of 0.25 is used as a default for glare reduction. The scaled color data is merged with the contrast enhanced base layer and the detail layer to obtain a glare reduced output image. Figure 3-18 shows the processor configuration for glare reduction. Figure 3-19 shows an input image with glare and the glare reduced output image. Glare reduction recovers details that are white-washed in the original image and enhances the image colors and contrast. 99 3.3 Applications 1F FWeighted WeigtedBilateral Fifter Average 1, Bilateral Fftrs - IG 11, 1- -> Grid Assignment Grid Assignment Convolution Convolution Engine HDR Creation - HDREngine InaGrid SBF JwInterpolation LE HI Grid InterpolatIon Correction Postprocessing Figure 3-18: Processor configuration for glare reduction. (a) (b) Figure 3-19: (a) Input image with glare. (b) Output image with reduced glare. 100 3.3.3 Reconfigurable Processor for Computational Photography Low-Light Enhanced Imaging Photography in low-light situations is a challenging task due to a number of conflicting requirements. Capturing dynamic scenes without blurring requires short exposure times. However, inadequate amount of light entering the image sensor in this short duration results in images that are dark, noisy and lacking details. A possible solution to this problem is to use a flash to add artificial light to the scene. This addresses the problems of brightness, noise and lack of details, while enabling small exposure times to avoid blurring. However, use of the flash defeats the original purpose of creating a realistic representation of the scene in the photograph. The artificial light of the flash destroys the natural scene ambience. It makes objects near the flash appear extremely bright, while objects that are beyond the range of the flash appear very dark. In addition, it introduces unpleasant artifacts due to flash shadows. Combining the information captured in images of the same scene with flash (high details and low noise) and without flash (natural scene ambience) in quick succession provides a possible way to avoid the limitations of an image with flash or an image without flash alone. Using flash and no-flash images to estimate ambient illumination and using that information for color balancing is proposed in [131]. Creating enhanced images by processing a stack of registered images of the same scene is proposed in [132]. This approach allows users to combine multiple images, captured under varying lighting conditions, and blend them in a spatially varying manner. Acquiring multiple images with different levels of flash intensity, including no flash, and subsequently adjusting the flash level by linearly interpolating among these images is proposed in [133]. Images of the same scene, captured with different aperture and focal-length settings, but not with different flash settings, are merged by interpolating between the settings in [134]. Approaches for synthetically relighting scenes using sequences of images captured under varying lighting conditions have also been proposed [135-137]. In this work, we implement an approach for low-light enhancement, similar to the ap- 101 3.3 Applications proaches proposed in [138] and [139], that merges two images captured in quick succession, one taken without flash (INF) and one with flash (IF). The main difference between our approach and [138,139] lies in flash shadow treatment and how that affects the overall filtering operation. The large scale features in an image can be considered as representative of the scene illumination and the details of the scene albedo [140]. Both the images with and without flash are decomposed into a large scale base layer and a high frequency detail layer through bilateral filtering. To preserve the natural scene ambience in the output, the large scale base layer from the no-flash image is selected. This layer is merged with the detail layer from the flash image to achieve high details and low-noise in the output. However, flash shadows need to be considered during the merging process and treated to avoid artifacts in the final output. The approach used by [138] assumes that the regions in flash shadow should appear exactly the same in both the flash and the no-flash image. So any regions where the differences in intensities between the flash and no-flash image are small are considered as shadow regions. A shadow mask representing such regions is created and the the details from the flash detail layer are only added to the no-flash base layer in the regions not covered by the mask. This approach avoids flash shadows but regions that are farther away from the flash and do not receive sufficient illumination are also detected as shadows. Since no details are added in these regions, large areas of the image often tend to appear smooth and lacking details. The approach in [139] makes a similar assumption to detect flash shadows. The regions where the differences in intensities between the flash and no-flash image are the lowest are considered to be the umbra of the flash shadows. The gradients at the flash shadow boundaries are then analyzed to determine the penumbra regions. The shadow mask, consisting of the umbra and the penumbra regions is then used to exclude shadow pixels from bilateral filtering. In this scheme, while filtering a pixel, only the pixels in its neighborhood that are outside the shadow region are used and the pixels in the shadow region receive no weight. This approach also assigns colors from the flash image to the Reconfigurable Processor for Computational Photography 102 Non-Flash e Flash IImage Image Base Layer Base Layer Detail Layer Detail Layer Low-Light Enhanced image Figure 3-20: Processing flow for low-light enhancement. final output. For the shadow regions, local color correction is performed that copies colors from illuminated regions in the flash image. Since this approach requires a specialized type of bilateral filtering that takes into account the shadow mask, it can not be easily implemented using the bilateral grid. To address these challenges, we took an algorithm/architecture co-design approach and developed a technique that decouples the core filtering operation from the shadow correction operation. This enables us to perform bilateral filtering efficiently using the bilateral grid and correct for flash shadows as a post-processing step. Figure 3-20 shows the processing flow for low-light enhancement. The RGB color channels are processed independently and merged in the end to generate the final output. Figure 3-21 shows the processor configuration for low-light enhancement. The bilateral grid is used to decompose both images into base and detail layers. The scene ambience is captured in the base layer of the no-flash image and details are captured in the detail layer of the flash image. In this mode, one bilateral filter engine is configured to perform bilateral filtering on the flash image and the other to perform cross-bilateral filtering, given by eq. (3.10), on the no-flash image using the flash image. The location of the grid cell is determined by the flash image and the intensity value is determined by the no-flash image. 103 103 3.3 Applications 3.3 Applications Preprocessing IF INF - ~( 'E2 IE1 E3 Weighted Average 6Bilateral Filter 10 Grid Assignment HDR Creation - Convolution Engine 'F IR G Contrast Adjustment M G rid IF - Interpolation _ Interpoation LLE Postprocessing Figure 3-21: Processor configuration for low-light enhancement. N ICB (P) = Gs (n) - Gi (IF(p) ~~ IF (p - (P)1 n ' INF (P - n) (3-10) n=-N where, N W(p) = E Gs(n) . GI(IF(p) - IF(p ~~ n) (3-11) n=-N Shadow Correction A shadow correction module is implemented which merges the details from the flash image with base layer of the cross-bilateral filtered no-flash image and corrects for the flash shadows to avoid artifacts in the output image. The shadow correction algorithm was developed in collaboration with Srikanth Tenneti. Instead of detecting the flash shadows and attempting to avoid those while adding details, we create a mask representing regions with high details in the scene. This is done by detecting edges that appear in the bilateral filtered no-flash image, which preserves the scene details but avoids spurious edges due to noise. Figure 3-22 shows the mask generation process. Gradients are computed at each pixel for blocks of 4x4 pixels. If the gradient at a pixel is higher than the average Reconfigurable Processor for Computational Photography 104 4x4 block No-Flash Base Layer 1i rdSmooth Binary Mask Mask iing 01-+r g _____ ____ S0 1 Soj a b <2 Smoothing Filter + >>3 mi bv- ____ ___b__IV mean(s) Mean Gradient g4 b 16. b b mean(s) Figure 3-22: Generating a mask representing regions with high scene details. gradient for that block, the pixel is assigned as an edge pixel. This results in a binary mask that highlights all the strong edges in the scene but no false edges due to the flash shadows. The details from the flash image are added to the filtered no-flash image, as shown in Figure 3-23, only in the regions represented by the mask. A linear filter is used to smooth 1 Mask filt IF IF x Flash v v Details with shadow artifacts X Non-flash base layer Shadow corrected details ILLE Figure 3-23: Merging flash and no-flash images with shadow correction. 3.3 Applications 105 the mask to ensure that that the resulting image does not have discontinuities. This implementation of the shadow correction module handles shadows effectively to produce low-light enhanced images without artifacts. Figure 3-24 shows a set of flash and no-flash images, the no-flash base layer from bilateral filtering, the flash detail layer, the edge mask, created using the process described in Figure 3-22, and the low-light enhanced output image. Figure 3-25 shows a set of flash and no-flash images and the low-light enhanced output image. The enhanced output effectively reduces noise while preserving the natural look and scene details, and avoiding artifacts due to flash shadows. Figure 3-26 compares the output from our approach with that from [138] and [139]. Our approach and the approach in [138] use colors from the no-flash image for the final output. The approach in [139] uses the colors from the flash image for the output. Our approach achieves output quality comparable to the previous approaches, as indicated by the difference images. Decoupling the shadow correction process from the core bilateral filtering process enables efficient implementation using the bilateral grid. Reconfigurable Processor for Computational Photography 106 Reconfigurable Processor for Computational Photography 106 (a) (b) Edge Mask Flash Details (d) (c) (e) (f) Figure 3-24: (a) Image with flash, (b) image without flash, (c) no-flash base layer, (d) flash detail layer, (d) edge mask, (f) low-light enhanced output. 107 3.3 Applications lOT 3.3 Applications (b) (a) (c) Figure 3-25: Input images: (a) image with flash, (b) image without flash. Output image: (c) low-light enhanced image. (b) (a) (d) (c) (e) with Figure 3-26: Comparison of the image quality performance from the proposed approach that of [138 and [139]. (a) Output from our approach, (b) output from [138], (c) output from [139], (d) difference image between (a) and (b) - amplified 5x, (e) difference image between (a) and (c) - amplified 5x. Reconfigurable Processor for Computational Photography 108 3.4 Low-Voltage Operation In addition to algorithm optimizations and highly-parallel processor architectures, the third key component of energy-efficient system design is implementing such architectures using ultra-low power circuits. The energy consumed by a digital circuit can be minimized by operating at the optimal VDD, which requires the ability to operate at low voltage [17,38,141,142]. 3.4.1 Statistical Design Methodology In this work, we use a statistical timing analysis approach, similar to the Operating Point Analysis (OPA) based statistical design methodology outlined in Section 2.3, to ensure reliable low-voltage operation. One important difference in the approach, however, is that the transistor random variables corresponding to local variations, also known as mismatch parameters, were not available from the foundry for the 40 nm CMOS library used in this design. In absence of the mismatch parameters, we used the global corner delays, that model the impact of global variations, to estimate the impact of local variations. The typical corner delay provides the nominal delay for the standard cell. The best and worst corner delays are used to model the -3- and the +3- global corner delays respectively. At low-voltage, the impact of local variations is comparable to global variations [81]. As a result, we use the standard deviation (o) obtained from the global corner delays to model the impact of local variations as a Gaussian Probability Density Function (PDF) with the mean delay given by the global corner delay. A subset of the standard cells from the 40 nm CMOS logic library are analyzed in this manner to model the impact of variations at 0.5 V. These models of standard cell variations are then used to perform setup/hold analysis for the timing paths in the processor. The setup/hold timing closure for the processor, with 3- performance requirement at 0.5 V, is performed using the OPA based approach. The PDF of delay at 0.5 V for a representative path from the design, computed using the models of standard cell variations 109 3.4 Low-Voltage Operation as described above, is shown in Figure 3-27. The global corner delay for this path is 21.9 ns. However, after accounting for the local variations, OPA estimates the 3- delay to be 36.1 ns. Note that even if the standard cell delay PDFs are modeled as Gaussians, the timing path delay PDF can be non-Gaussian. x 108 delay 2 -COorn'er 2 - 3o- delay 0.5 0 .1 10 15 30 25 20 Timing Path Delay (ns) 35 40 Figure 3-27: Delay PDF of a representative timing path from the computational photography processor at 0.5 V. STA estimate of the global corner delay is 21.9 ns, the 30- delay estimate using OPA is 36.1 ns. Table 3.1 shows statistics on the number of paths analyzed for both setup and hold analysis of the chip. Setup/hold fixing using OPA ensures that cells that are very sensitive to VT variations are not used in the critical paths. This helps improve the 3- performance at 0.5 V by 32%, from 17 MHz to 25 MHz. The OPA analysis for timing paths ensures reliable functionality at 0.5 V. 3.4.2 Multiple Voltage Domains SRAM based on the six transistor (6T) cell is the most common form of embedded memory in processor design. However, low-voltage operation of 6T SRAM faces significant challenges from process variations, bit cell stability and sensing. Threshold voltage variations among transistors that constitute the 6T cell significantly degrade the read/write stability of the bit cell, especially at low voltages [143]. To ensure that the memory will Reconfigurable Processor for Computational Photography 110 Table 3.1: Setup/Hold Timing Analysis at 0.5 V Phase Data Path Clock Path Paths Analyzed Worst Slack % Fail Setup Analysis @ 25MHz 1 STA (+3u) STA (-3a) 95k -10.7 ns 3.6% 2 STA (+3u) OPA 3.4k -2.9 ns 1.5% 3 OPA OPA 52 -0.05 ns 13.4% Paths requiring fixing (before timing closure) 7 Hold Analysis 1 STA (-3o) STA (+3u) 95k -8.2 ns 2.8% 2 STA (-3a) OPA 2.7k -1.8 ns 2.4% 3 OPA OPA 65 -0.13 ns 13.8% Paths requiring fixing (before timing closure) 9 operate reliably as the logic voltage is scaled down, we use separate voltage domains for logic and memory. This allows us to operate the memory at the nominal voltage of 0.9 V, while scaling the logic voltage down to 0.5 V. Voltage level shifters, capable of transitioning between 0.5 V and 0.9 V, are used to transition the signals between the logic and memory voltage domains. Figure 3-28 shows the logic and memory voltage domains in the processor and the level shifters used to transition between the domains. The logic domain is operated at voltage VDDL and the memory domain is operated at voltage VDDM. 3.5 Memory Bandwidth Optimization The target external memory consists of two 64 Mx 16 bit DDR2 DRAM modules with a burst length of 8. The processor generates 23 bit addresses for accessing the DRAM that are divided as: 13 bit row address, 3 bit bank address and 7 bit column address. A 32 bit wide 266 MHz DDR2 memory controller is implemented using Xilinx XC5VLX50 FPGA. 3.5 Memory Bandwidth Optimization 3.5 Memory Bandwidth Optimization ill 111 Computational Photography Processor Figure 3-28: Separate voltage domains for logic and memory. Level shifters are used to transition between domains. We use a modified version of the Xilinx MIG DRAM controller which supports a lazy pre-charge policy. Hence, a row is only pre-charged when an access is made to a different row in the same bank. The 256 bit DDR2 interface is connected to the 64 bit processor interface through asynchronous FIFOs. This enables the processor to work with any 256 bit DRAM system as well as allows the processor and memory to operate at different frequencies. The goal of memory optimization is to reduce the memory size and bandwidth required to support real-time operation. To process 1080p images (1920 x 1080 at 30 fps) in real-time, a naive bilateral filtering implementation in 2D image domain with 64 x 64 filter kernel 112 Reconfigurable Processor for Computational Photography and a 4 kB cache to store 64 x 64 pixels (8 bit each), the DRAM bandwidth is: BW 2 D Bilateral = (1080 x 30 x 64 x 64 + 1919 x 1080 x 30 x 64) x 3 colors = 11.5 GB/s (3.12) 64 x 64 pixels are accessed for the first element in each row and cached in the buffer. For subsequent pixels in the same row, only the next 64 pixels need to be accessed. The processing for RGB color channels is performed independently. Algorithmic optimizations that leverage the bilateral grid structure to perform bilateral filtering in 3D grid domain, with 16 x 16 pixel blocks, 16 intensity levels and a 3 x 3 x 3 filter kernel, reduces the bandwidth requirement to: BW 3 D Grid = BWGrid Creation + BWGrid Filtering + BWGrid Interpolation (3.13) where, BWGrid Creation = BWImage Read + BWGrid Write = (1920 x 1080 x 30 + 1920 x 1080 x 30 x 16 levels x 4 B/level) x 3 colors 16 x 16 blocks = 222.5 MB/s BWGrid Filtering = BWGrid Read (3.14) + BWFiltered Grid Write 1920 x 1080 x 30 16 x 16 blocks + B/level x (3 x 3 x 3 kernel) x 3 colors 1920 x 1080 x 30 x 16 levels x 1 B/level x 3 colors 16 x 16 blocks = 1212.5 MB/s (3.15) 113 3.5 Memory Bandwidth Optimization BWGrid Interpolation = BWFiltered Grid Read + _1920 x 1080 x 30 blo 1620 x 1680 16 x 16 blocks BWoutput Image Write x 16 levels x 1 B/level x 3 colors +1920 x 1080 x 30 x 3 colors = 189.1 MB/s (3.16) Combining the bandwidth requirements for grid creation, filtering and interpolation, from eq. (3.14), eq. (3.15) and eq. (3.16), the total bandwidth requirement for processing the 3D bilateral grid, from eq. (3.13), is: BW 3DGrid = 222.5 MB/s + 1212.5 MB/s + 189.1 MB/s = 1624.1 MB/s (3.17) The significant downsampling and reduction in computational complexity enabled by the bilateral grid, compared to bilateral filtering in the 2D image domain, provides a bandwidth reduction of 86% - from 11.5 GB/s to 1.6 GB/s. Architectural optimizations and the memory management approach, described in Section 3.2.4, that uses task scheduling and the 21.5 kB on-chip SRAM as a cache for intermediate data, further reduce the memory bandwidth. This approach only requires reading the original image and writing back the filtered output, resulting in the bandwidth requirement of: BWprocessor = BWImage Read + BWoutput Image Write = 1920 x 1080 x 30 x 3 colors + 1920 x 1080 x 30 x 3 colors = 356 MB/s (3.18) The memory management approach enables processing in the 3D grid domain while stor- Reconfigurable Processor for Computational Photography 114 ing only two rows of created grid blocks and one row of filtered grid blocks, without having to create an entire grid before processing. This data can be stored efficiently onchip using SRAM and avoid a significant number of off-chip DRAM accesses, reducing the memory bandwidth by 97% compared to bilateral filtering in the 2D image domain from 11.5 GB/s to 356 MB/s. Based on the number of memory accesses, we can estimate the memory power using a memory power consumption model [144]. The memory size is optimized for the specific implementation. For example, for 2D bilateral filtering implementation and our Grid and task scheduling implementation, the DRAM only stores an input image and an output image, which requires 12 MB of memory. Whereas, the 3D Grid implementation without task scheduling requires storing the created and filtered grid as well as the input and output images, which requires 13.7 MB of memory. Figure 3-29 shows the memory bandwidth and estimated power consumption for 2D bilateral filtering, after algorithmic optimizations with the 3D bilateral grid and the after architectural optimizations involving memory management with task scheduling. The bilateral grid reduces the memory power consumption by 75% - from 697 mW to 175 mW. Architectural optimizations with memory management further reduce the memory power to 108 mW - an overall savings of 85% x 102 x103 1 12 Bandwidth Power 11700 697 9 8 0 -6 3 175 108 1624 jML 2 356M 0 0 2D Bilateral 3D Grid 3D Grid & Filtering Processing Scheduling Figure 3-29: Memory bandwidth and estimated power consumption for 2D bilateral filtering, 3D bilateral grid and bilateral grid with memory management using task scheduling. 115 3.6 Measurement Results compared to bilateral filtering in the 2D image domain. The memory power does not scale linearly with the bandwidth because of the standby power consumption of the memory. This comparison demonstrates the significance of algorithm/architecture co-design and considering trade-offs for optimizing power consumption not only for the processor core but for the system as a whole, including external memory and communication costs. 3.6 Measurement Results The testchip, shown in Figure 3-30, is implemented in 40 nm CMOS technology with the active area of 1.1 mmx1.1 mm, 1.94 million transistors and 21.5 kB SRAM. The processor is verified to be operational from 25 MHz at 0.5 V to 98 MHz at 0.9 V with SRAMs operating at 0.9 V. This chip is designed to function as an accelerator core as part of a larger microprocessor system, utilizing the system's existing DRAM resources. For standalone testing of this 2 mm Chip Features TraVsto 0.9 Figure 3-30: Die photo of the testchip. Highlighted boxes indicate SRAMs. HDR, CR and SC refer to HDR create, contrast reduction and shadow correction modules respectively. Reconfigurable Processor for Computational Photography 116 chip, a 32 bit wide 266 MHz DDR2 memory controller was implemented using a Xilinx XC5VLX50 FPGA. The performance vs. energy trade-off of the testchip for a range of VDD is shown in Figure 3-31. For best image quality settings, grid block size 16 x 16 with 16 intensity levels, the processor is able to operate from 25 MHz at 0.5 V with 2.3 mW power consumption to 98 MHz at 0.9 V with 17.8 mW power consumption. . I . I I , I . ......--.........- . .--------..----- ..-.---.--------- . - I Energy - - V oltage ----.--...--.----.--------------.-------------- .---------- .-----.-- - 1 .2 I 20 30 40 ..--.-.-.------ ... ..-..I .... --.-..... S0.6 ...-........... 70 60 50 Frequency (MHz) 80 1 . I - 1.5 90 *- - 0 .9 0.7 100. Figure 3-31: Processor performance: trade-off of energy vs. performance for varying VDD The processing run-time scales linearly with the image size with 60 megapixels/second processing at 0.9 V. Figure 3-32 shows the area and power breakdowns of the processor for the bilateral filter engines and the pre-processing and post-processing modules. The power breakdown is obtained from post-layout simulations. The shadow correction module is power gated during HDR imaging and the HDR creation and contrast adjustment modules are power gated during low-light enhancement. Area 3% Power 15k SBilateral Filter Engines * HDR Creation Contrast Adjustment iShadow HDR Imaging Correction Low-Light Enhancement Figure 3-32: Processor area (number of gates) and power breakdown. 117 3.6 Measurement Results 3.6.1 Energy Scalable Processing The grid scalability, described in Section 3.2.5, provides a trade-off between grid resolution and the amount of energy required for processing. Figure 3-33 demonstrates this trade-off at 0.9 V for grid block size varying from 16 x 16 pixels to 128 x 128 pixels and the number of intensity levels varying from 4 to 16. 1 .0 ....... ..... 0.416 125 16 Intel te k t 4 18 6 2...12.C......\.. at 0.9 V. Figure 3-33: Energy scalable processing. Grid resolution vs. energy trade-off The energy consumption has a roughly linear dependance on the number of grid intensity levels. This is because the number of active processing engines and memory banks is proportional to the number of intensity levels, which results in an approximately linear scaling in power consumption while the processing run-time remains unchanged. The energy consumption varies roughly quadratically with the grid block size, because the number of blocks to process decreases quadratically with the downsampling factor (same as the block size). This results in an approximately quadratic scaling in run-time while the processing power consumption remains unaffected. A combination of these grid scaling parameters enables processing energy scalability from 0.19 mJ to 1.37 mJ per megapixel at 0.9 V. The energy vs. image quality trade-off is depicted by a comparison of output images for 118 Reconfigurable Processor for Computational Photography different grid configurations, for HDR imaging and low-light enhancement in Figure 3-34 and Figure 3-35 respectively. The impact of intensity downsampling on the image quality is much more significant than spatial downsampling because the edge-preserving nature of the bilateral grid depends on the number of intensity levels. (a) Block size: 16x16, Intensity levels: 16 Energy: 13.7 mJ (c) Block size: 16x 16, Intensity levels: 4 Energy: 6.4 mJ (b) Block size: 128x 128, Intensity levels: 16 Energy: 4.2 mJ (d) Block size: 128 x 128, Intensity levels: 4 Energy: 1.9 mJ Figure 3-34: Energy/resolution scalable processing. HDR imaging outputs for (a) grid block size: 16 x 16, intensity levels: 16, (b) grid block size: 128 x 128, intensity levels: 16, (c) grid block size: 16 x 16, intensity levels: 4, (d) grid block size: 128 x 128, intensity levels: 4. 119 119 3.6 Measurement Results 3.6 Measurement Results (a) Block size: 16 x16 Intensity Levels: 16 Energy: 13.7 ml (b) Block size: 128x128 Intensity Levels: 16 Energy: 4.2 m) (c) Block size: 16x16 Intensity Levels: 4 Energy: 6.4 mJ (d) Block size: 128x128 Intensity Levels: 4 Energy: 1.9 mj Figure 3-35: Energy/resolution scalable processing. Low-light enhancement outputs for (a) grid block size: 16 x 16, intensity levels: 16, (b) grid block size: 128 x 128, intensity levels: 16, (c) grid block size: 16 x 16, intensity levels: 4, (d) grid block size: 128 x 128, intensity levels: 4. 3.6.2 Energy Efficiency Image processing pipelines typically involve a complex set of interconnected operations, where each processing stage has large data dependencies. These operations don't automatically lend themselves to spatial and temporal parallelism. Several memory read/write operations are required for every stage of processing, making the cost of memory accesses often higher than the cost of computations [145]. This makes it difficult to achieve efficient software implementations without significant efforts to manually optimize the code, including decisions regarding memory access patterns and order of processing. Significant efforts are also required to enhance processing locality and parallelism using intrinsics and other low-level programming techniques [146,147]. Table 3.2 shows a comparison of the processor performance with implementations on other mobile processors at 0.9 V. Software that replicates the functionality of the testchip and maintains identical image quality is implemented on the mobile processors. The implementations are optimized for multi-threading and multi-core processing as well as 120 Reconfigurable Processor for Computational Photography taking advantage of available GPU resources on the processors. Processing runtime and power consumption during software execution are measured. The processor achieves more than 5.2 x faster performance than the fastest software implementation and consumes less than 40 x power compared to the most power efficient one, resulting in an energy reduction of more than 280x compared to software implementations on some of the recent mobile processors while maintaining the same output image quality. Table 3.2: Performance comparison with mobile processor implementations at 0.9 V. Processor Technology (nm) Frequency (MHz) Power (mW) Runtime* (s) Energy* (mJ) Intel Atom [148] 32 1800 870 4.96 4315 Qualcomm Snapdragon [24] 28 1500 760 5.19 3944 Samsung Exynos [25] 32 1700 1180 4.05 4779 TI OMAP [149] 45 1000 770 6.47 4981 This Work 40 98 17.8 0.771 13.7 *Image size: 10 megapixel To make software implementations more efficient and easier to implement without significant manual tuning, the Halide image processing language [150] proposes decoupling the algorithm definition from its execution strategy and automating the search for optimized mappings of the pipelines to parallel processors and memory hierarchies. An optimizing compiler generates higher performance implementations from an algorithm and a schedule described using Halide. We compared the processing performance using Halide with a C implementation and the hardware implementation of our processor. A moderately optimized implementation generated using Halide for an ARM core, running on a Qualcomm Snapdragon processor [24], was able to process a 10 megapixel image in 2.1 seconds. With better optimization, the runtime could be reduced even further. This compared with 4.05 seconds for the manually optimized C implementation running on the same processor. The hardware implementation completed the processing in 771 ms. Halide provided significant performance gains while making the software easier to implement. 121 3.6 Measurement Results It is also useful to quantify the energy-efficiency of processors in terms of operations performed per second per unit of power consumption (MOPS/mW), which highlights the trade-offs associated with different architectures. Figure 3-36 shows such a comparison for processors ranging from fully-programable CPUs and mobile processors to FPGAs and ASICs. An operation is defined as a 16 bit add computation. This work (0.5 V) 10 5 Ex o 41 104 x 00 NW 03 C 102 Mobile Processors x 00 0- S101 CPU 2~100 W 10-1 1 2 3 4 6 5 Processors 7 8 9 10 Figure 3-36: Energy efficiency of processors ranging from CPUs and mobile processors to FPGAs and ASICs. Processor Description 1 Intel Sandy Bridge [20] 2 Intel Ivy Bridge [21] 3 Multimedia DSP [23] 4 Mobile Processors [24,25] 5 GPGPU Application Processor [26] 6 FPGA with hierarchical interconnects [151] 7 SVD ASIC [28] 8 Video Decoder ASIC [29] 9 Multi-Granularity FPGA [152] 10 This work (0.5 V) Reconfigurable Processor for Computational Photography 122 The significant enhancement in processing speed as well as the reduction in power consumption achieved by the hardware implementation in this work, resulting in 2 to 3 orders of magnitude higher energy-efficiency, can be attributed to several factors. 1. The algorithmic and architectural optimizations maximize data locality and enable spatial and temporal parallelism that helps maximize the number of computations performed per memory access. This amortizes the cost of memory accesses over computations as well as reduces the memory bandwidth. Even an optimized software implementation has a very limited amount of control over the processing architecture and memory management strategies of the general purpose processor to be able to achieve the most optimal implementation. 2. The high amount of parallelism enabled by algorithm/architecture co-design facilitates real-time performance while operating at less than 100 MHz, compared to other processors operating at higher than 1 GHz frequency. 3. The hardware implementation allows careful pipelining with flexible bit widths that enables preserving full-resolution of fixed-point computations at each stage of the pipeline, whereas software implementations are restricted to a fixed bit width of 32 bit or 64 bit operations. Attempting to adapt bit widths to match the required resolution at pipeline stages often leads to a degradation in performance on these cores instead of enhancement, because this introduces additional typecasting operations in software processing. 4. Hardware implementations tailored to the specific applications avoid the significant overhead of a control unit that is essential in a general purpose processor to configure the processing units and complex memory hierarchies. The performance and power overhead of just the instruction fetch and decode unit can be significant. Even with an optimized software implementation, it is hard to avoid uneven pipelines and variable memory latencies, resulting in stalls that prevent optimal resource utilization. 123 3.7 System Integration 5. The ability to scale voltage and frequency is key to ensuring minimum energy consumption for the desired performance. The active power consumption of circuits scales quadratically with voltage. Circuits that are able to operate reliably down to near threshold voltage enable minimum energy point operation for maximizing efficiency. General purpose processors rarely provide such flexibility to optimize energy and performance requirements. 3.7 System Integration The processor is integrated, as shown in Figure 3-37, with external DDR2 memory, a camera and a display. A 32 bit wide 266 MHz DDR2 memory controller and an USB interface for communicating with a host PC are implemented using a Xilinx XC5VLX50 FPGA. A software application, running on the host PC, is developed for processor configuration, image capture, activating processing and result display. Host USB USB Interface 64b USB DDR2 Memory Controller k~ 64b 0 E DDR2 Memory 256MB, 32b Camera Figure 3-37: Processor integration with external memory, camera and display. The Printed Circuit Board (PCB) that integrates the processor, memory and interfaces is shown in Figure 3-38, along with a setup that connects to a camera and display. The system provides a portable platform for live computational photography. 124 Reconfigurable Processor for Computational Photography USB I/F - FPGA - XCSVLX5O DRAM ASIC Figure 3-38: Printed circuit board and system integration with camera and display. 3.8 Summary and Conclusions In this work, we developed a reconfigurable processor for computational photography that enables real-time processing in an energy efficient manner. The processor performs HDR imaging, low-light enhancement and glare reduction using a scalable bilateral grid. Algorithmic optimizations that leverage the 3D bilateral grid structure, map the computationally complex non-linear filtering operation on to an efficient linear filtering operation in the 3D grid domain, significantly reduce the computational complexity and memory requirement, enhance processing locality and enable a highly parallel architecture. Architectural optimizations exploit parallelism to enable high throughput real-time performance while operating at low frequency and achieve hardware scalability to enable energy vs. output quality trade-offs for energy/resolution scalable processing. Through algorithm/architecture co-design, an approach for low-light enhancement and flash shadow correction that enables efficient implementation using the bilateral grid architecture is developed. Circuit design for low voltage operation ensures reliable performance down to 0.5 V, enabling a wide voltage operating range for voltage/frequency scaling and achieving minimum energy operation for the desired performance. The processor is implemented using 40 nm CMOS technology and verified to be operational from 98 MHz at 0.9 V with 17.8 mW power consumption to 25 MHz at 0.5 V with 3.8 Summary and Conclusions 125 2.3 mW power consumption. At 0.9 V, it can process up to 60 megapixel/s. The scalability of the architecture enables processing from 0.19 mJ/megapixel to 1.37 mJ/megapixel for different grid configurations at 0.9 V, while trading-off output quality for energy. The processor achieves 280 x energy reduction compared to identical software implementations on recent mobile processors. The energy scalable implementation proposed in this work enables efficient integration into portable multimedia devices for real time computational photography. Based on the system design approach, from algorithms to circuit implementation, adopted in this work, the following conclusions can be drawn. 1. Hardware oriented algorithmic reframing is key to efficient implementation. The efficiency gains achievable for a system through architectural and circuit optimizations are limited if the algorithm requires sequential processing with large data dependencies. The significant reduction in computational complexity, memory size and bandwidth, achieved through algorithmic transformation from inefficient non-linear filtering in the image domain to efficient linear filtering in the 3D grid domain, demonstrates the significance of algorithmic trade-offs in system design. 2. Scalable architectures, with efficient clock and power gating, enable energy vs. performance/quality trade-offs that are extremely desirable for mobile processing. This energy-scalable processing allows the user to determine the energy usage for a task, based on the battery state or intended usage for the output. 3. Memory management - both on-chip memory size and off-chip memory bandwidth is critical to maximizing the system energy-efficiency. Reduction in external memory bandwidth from 11.5 GB/s to 356 MB/s and the corresponding power consumption from 697 mW to 108 mW, through algorithm/architecture co-design, careful task scheduling and use of on-chip SRAM cache, demonstrates this effect. 4. Low-voltage circuit operation is important to enable voltage/frequency scaling and attain minimum energy operation for the desired performance. 126 Reconfigurable Processor for Computational Photography Chapter 4 Portable Medical Imaging Platform Medical imaging techniques play a crucial role in the diagnosis and treatment of numerous medical conditions. Traditionally, medical diagnostic systems have been restricted to sophisticated clinical environments due to cost, size and expertise required to operate such equipment. Recent advances in computational photography and computer vision, coupled with efficient high-performance processing on portable multimedia devices, provide a unique opportunity for high quality and highly capable medical imaging systems to become much more portable and cost efficient. Image processing techniques such as High Dynamic Range (HDR) imaging, contrast enhancement, image segmentation and registration, could be used to ease the requirements of high-precision optical front-ends for medical imaging systems that make such equipment bulky and expensive, and enable digital cameras and smartphones to be used for medical imaging. Proliferation of connected portable devices presents an opportunity for making sophisticated medical imaging systems available to small clinics and individuals in rural areas and emerging countries to enable early diagnosis and better treatment outcomes. 128 4.1 Portable Medical Imaging Platform Skin Conditions - Diagnosis & Treatment Skin conditions are among the top five leading causes of nonfatal disease burden globally [153] and can have a significant negative impact on the quality of life. Chronic skin conditions are often easily visible and can be characterized by multiple features including pigmentation, erythema, scale or other secondary features. Vitiligo is one such common condition found in up to 2% of the worldwide population [154]. The disease is characterized by loss of pigment in the skin, hair and mucous membranes caused in part by autoimmune destruction of epidermal melanocytes [155,156]. Due to its appearance on visible areas of the skin, Vitiligo can have a significant negative impact on the quality of life in affected children and adults. 4.1.1 Clinical Assessment: Current Approaches Treatments of skin conditions aim to arrest disease progression and induce repigmentation of affected skin. Several surgical and non-surgical treatments, such as topical immunomodulators, phototherapy, and surgical grafting and transplantation, are available [157,158]. However, diagnosis is primarily based on visual clinical evaluation. Dermoscopy [159,160] is a noninvasive technique that aids visual observations by allowing clinician to perform direct microscopic examination of diagnostic features in pigmented skin lesions and visualization of pigmented cutaneous lesions in vivo [161,162]. Commercially available dermoscopy tools, such as DermLite [163], aim to improve the ease and accuracy of visual evaluations by providing magnification, LED lighting and polarizing filters to enhance the field of view and reduce glare and shadows. However, reliable objective outcome measures, to allow for comparison of studies and to accurately assess changes over time, are currently lacking [164-166]. Several tissue lesions can be identified based on measurable features extracted from a lesion, making the accurate quantification of tissue lesion features of essential importance in clinical practice. Current outcome measures include the Physician's Global Assessment (PGA) that grades 41 Skin Conditions - Diaonosis & Treatment12 129 patient improvement based on broad categories of percentage repigmentation over time (025%, 25-50%, 50-75% and 75-100%) and the Vitiligo Area and Severity Index (VASI) [167] that measures percentage repigmentation graded over area of involvement summed over body sites involved. Figure 4-1, reproduced with permission from [167], shows an example of VASI assessment. 100% 90% 75% 50% 25% 10% Figure 4-1: Standardized assessments for estimating the degree of pigmentation to derive the Vitiligo Area Scoring Index. At 100% depigmentation, no pigment is present; at 90%, specks of pigment are present; at 75%, the depigmented area exceeds the pigmented area; at 50%, the depigmented and pigmented areas are equal; at 25%, the pigmented area exceeds the depigmented area; and at 10%, only specks of depigmentation are present. (Figure reproduced with permission from [167]) Portable Medical Imaging Platform 130 These outcome measures rely on subjective clinical assessment through visual observation, which cannot exclude inter-observer bias and have limited accuracy, reproducibility and quantifiability. Two recent studies [165,166] conclude that the current outcome measures have poor methodological quality and unclear clinical relevance as well as lack consensus among the clinicians, researchers and patients. Recent studies have begun using image analysis to evaluate treatment efficacy, but these trials rely on investigator-defined boundaries of skin lesions which can be biased, and these programs require user involvement to analyze each image separately, which can be time-consuming [168,169]. An objective measurement tool that accurately quantifies repigmentation could overcome these limitations and serve as a diagnostic tool for dermatologists. Image processing techniques can be applied to identify the skin lesions and extract their features, which would allow much more accurate determination of disease progression. The ability to more objectively quantify change over time will significantly improve the physician's ability to perform clinical trials and determine the efficacy of therapies. 4.1.2 Quantitative Dermatology Algorithms for quantitative dermatology are being developed. A framework to detect and label moles on skin images is proposed in [170]. The method searches the image for skin regions using a non-parametric skin detection scheme and uses difference of Gaussian filters to find possible mole candidates. A trained Support Vector Machine (SVM) is used to classify the candidates as moles. An approach for registering micro-level features in high-resolution face images is proposed in [171]. The approach registers features in images captured with different light polarizations by approximating the face surface as a collection of quasi-planar skin patches and estimates spatially varying homographies using feature matching and quasiconvex optimization. A supervised learning technique to automatically detect acne-like lesions and enable computer assisted counting of acne lesions in skin images is proposed in [172], which models skin regions by a six dimensional vector using temporal and spatial features, and detects the separating boundary between 4.1 Skin Conditions - Diagnosis & Treatment 131 the patch images. Quantitative assessment of wound healing through dimensional measurements and tissue classification is proposed in [173]. The approach computes a 3D model from multiple views of the wound. Tissue classification is performed from color and texture region descriptors computed after unsupervised segmentation. Principal component analysis followed by image segmentation is used in [174] to analyze and determine areas of skin that have undergone repigmentation during the treatment of Vitiligo. This approach converts an RGB image into an image that represent skin areas due to melanin and haemoglobin and determines the change in area of such regions over time. All the images taken over time are assumed to be accurately registered with respect to each other and have uniform color profiles. A technique for melanocytic lesion segmentation based on image thresholding is proposed in [175]. Thresholding schemes work well when the lesion and background skin have distinct intensity and color profiles. However, their accuracy is limited when the image has intensity and/or color inhomogeneities. Table 4.1 summarizes the current approaches for clinical assessment and recent work in quantitative dermatology. A review of the automated analysis techniques for pigmented skin lesions [176], applied to dermoscopic and clinical images, finds that even though several approaches for analyzing individual lesions have been proposed, there is a scarcity of approaches on the automation of lesion change detection. The study concludes that computer-aided diagnosis systems based on individual pigmented skin lesion image analysis cannot yet be used to provide the best diagnostic results. In this work, we develop a system for skin lesion detection and progression analysis and apply it to clinical images for Vitiligo, obtained from ten different subjects during treatment. Institutional Review Board approval was obtained for data analysis (MIT Protocol Number: 1301005500) as well as the clinical pilot study in collaboration with the Brigham and Women's Hospital (BWH Protocol Number: 2012-P-002185/1). Image segmentation is used to accurately determine the lesion contours in an image and a registration scheme Portable Medical Imaging Platform 132 Table 4.1: Summary of clinical assessment and quantitative dermatology approaches Reference Description [160,162] Dermoscopy - Microscopic examination of diagnostic features in pigmented skin lesions. [163] DermLite - Commercial dermoscope to provide magnification, LED lighting, polarizing filters. [165] PGA - Patient improvement based on broad categories of percentage repigmentation over time (0-25%, 25-50%, 50-75% and 75-100%) [167] VASI - Measuring percentage repigmentation, based on visual observation, graded over area of involvement summed over body sites involved. [170] A framework, based on difference of Gaussian filters and trained SVM, to detect and label moles on skin images. [171] Registering micro-level features, using feature matching and quasiconvex optimization, in high-resolution face images, captured with different light polarizations. [172] A supervised learning technique, using temporal and spatial features, to automatically detect and count acne-like lesions in images. [173] Quantitative assessment of wound healing by computing a 3D model from multiple views of the wound and tissue classification based on color and texture region descriptors. [174] Principal component analysis and image segmentation of images captured with standardized lighting and alignment to determine repigmented skin areas during treatment. [175] Melanocytic lesion segmentation based on image thresholding. using feature matching is implemented to align a sequence of images for a lesion. A progress metric called fill factor, which accurately quantifies repigmentation of skin lesions, is proposed. 4.2 Skin Condition Progression: 4.2 Quantitative Analysis 133 Skin Condition Progression: Quantitative Analysis The focus of this work is on developing a system for lesion detection and progression analysis of skin conditions, based not only on standardized clinical imaging but also through images captured by patients at home, using smartphone or digital cameras, without any standardization. The main contributions of this work are to leverage algorithmic techniques from different areas of computer vision, such as color correction, image segmentation and feature matching, optimize and modify them to enhance accuracy for the skin imaging application and reduce computational and memory complexity for efficient and fast implementation, and develop an easy-to-use automated mobile system that could be used by patients as well as doctors for frequent monitoring of skin conditions. The overall processing flow, from the non-standardized image sequence to quantification of progression, is summarized in Figure 4-2. The progress of a skin lesion is recorded by capturing images of the lesion at regular intervals of time. This is done for all lesions located on different body areas. Color correction is performed by adjusting R, G, B histograms to neutralize the effects of varying lighting and enhance the contrast. A Level Set Method (LSM) based image segmentation approach is implemented to identify the lesion contours. In the vicinity of the lesion contours, Scale Invariant Feature Transform (SIFT) based feature detection is performed to identify key features of the lesion. For the first set of images of all skin lesions, they are manually tagged based on their location on the body. For all future images, the tagging is performed automatically by comparing the features from the new image with the previous set of images for all skin lesions. Once the new image is tagged to a specific lesion, it is registered with the first image in the sequence for that lesion, using pre-computed SIFT features. The warped lesion contours are computed after alignment and their area is compared to the area of the first lesion in the sequence to determine the fill factor that indicates the change in area and quantifies the progress over time. Portable Medical Imaging Platform 134 134PotbeMdclmaigPafr time Lesion image Sequence f Color Correction Histogram Adjustment ) Contour Detection Segmentation - Level Set Method ] Feature Detection SIFT in the vicinity of the contour No First image in the sequence? Yes Auto- Tagging Manual Tagging Feature co nparison with previous ima ges of all lesions Based on lesion location T Store Con :our and SIFT Fea tures L Image Alignment Store Contour and SIFT Features 4 I Factor = 0 F FillReference Homography warping feature matching Fill Factor Warped lesion area comparison Figure 4-2: Processing flow for skin lesion progression analysis. 4.2.1 Color Correction Accurate color information of skin lesions is significant for dermatology diagnosis and treatment [177,178]. However, different lighting conditions and non-uniform illumination during image capture often lead to images with varying color profiles. Having a consistent 4.2 Skin Condition Progression: 135 Quantitative Analysis color profile in the images captured over time is important for both visual comparison as well as to accurately determine the progression over time. Some approaches for color normalization in dermatological applications have proposed normalizing color profiles of the instruments to match the images captured with different devices, through users characterizing and calibrating the color response [179]. An approach to build color normalization filters by analyzing features in a large data set of images for a skin condition is proposed in [180], which extracts image features from the inside, outside, and peripheral regions of the tumor and builds multiple regression models with statistical feature selection. We developed a color correction scheme that automatically corrects for color variations and enhances image contrast using color histograms. Histogram equalization is typically used to enhance contrast in intensity images. However, performing histogram equalization on R, G and B color channels independently, brings the color peaks in alignment and results in an image that closely resembles one in neutral lighting environment. For an image I, the color histogram for channel c (R, G or B) is modified by adjusting the pixel color values I,(x, y) to span the entire dynamic range D, as given by eq. (4.1). IM (XY) = Ic(XY) - (ICU - II) (4.1) Ic x D where, Icu and I represent the upper and lower limits of the histogram. The approach can be summarized as follows: 1. Compute histograms for R, G and B color channels. 2. Determine the upper and lower limits of the R, G and B histograms as the +2U limit (Ic" > intensity of 97.8% pixels) and the -2- limit (I < intensity of 97.8% pixels). This avoids histogram skewing due to long tails and results in better peak alignment. 3. Expand the R, G, B histograms to occupy the entire dynamic range (D) of 0 to 255 using eq. 4.1. Portable Medical Imaging Platform 136 Figure 4-3 shows the performance of this approach for images of two different skin lesions. The approach achieves performance comparable to white-balance calibration with a color chart, while also enhancing the contrast to make the lesion more prominent. 6 S401 0 6 50 100 150 200 ;2500 50 100 150 200 Intensity 2500 50 100 150 200 250 50 100 150 200 2500 50 100 150 200 Intensity 2501D 50 100 150 200 250 - 4 2 0 0 (a) Captured with Color Tinted Lighting (b) Captured with Neutral Lighting (c) After Color Correction and Contrast Enhancement Figure 4-3: Color correction by histogram matching. Images captured with normal room lighting (a) and with color chart white-balance calibration (b). Images after color correction and contrast enhancement (c) of images in (a). 4.2 Skin Condition Progression: 4.2.2 Quantitative Analysis 137 Contour Detection Accurately determining the contours of skin lesions is critical to diagnosis and treatment as the contour shape is often an important feature in determining the skin condition. It is also important for determining the response to treatment and the progress over time. Due to non-uniform illumination, skin curvature and camera perspective, the images tend to have intensity and color variations within lesions. This makes it difficult for segmentation algorithms that rely on intensity/color uniformity to accurately identify the lesion contours. Segmentation approaches for images with intensity bias have been proposed [181-184]. A level set approach is proposed in [184] that models the distribution of intensity belonging to each tissue as a Gaussian distribution with spatially varying mean and variance and creates a level set formulation by defining a maximum likelihood objective function. An LSM based approach called the distance regularized level set evolution was proposed in [185] and extended in [186] to a region-based image segmentation scheme that can take into account intensity inhomogeneities. Based on a model of images with intensity inhomogeneities, the approach in [186] derives a local intensity clustering property of the image intensities, and defines a local clustering criterion function for the image intensities in a neighborhood of each point. We leverage the level set method for image segmentation [186], which provides good accuracy in lesion segmentation with intensity/color inhomogeneities. However, this approach has very high computational complexity and memory requirement, as described below. We develop an efficient and accurate narrowband implementation that significantly reduces the computational complexity and memory requirement. A distance regularized level set function, similar to that proposed in [185], is used to update the level set values during iterations. Our implementation only performs updates to the level set function, and the related variables (energy function, bias field, etc. - defined below), for a small subset of pixels that fall within a narrow band around the current segmentation contour in an iteration. This limits the computations and memory accesses to this small subset 138 Portable Medical Imaging Platform of pixels, instead of the entire image. The following section describes the approach in further detail. Level Set Method for Segmentation The original image I with non-uniform intensity profile is modeled as a combination of the homogeneous image J and a bias field b that captures all the intensity inhomogeneities in I, given by eq (4.2). I = bJ + n (4.2) n is the additive zero-mean Gaussian noise. A Level Set Function (LSF) #(x) is defined for every pixel x in the image. The image is segmented into two regions Q, and Q2 based on the values of the level set function in these regions, such that: Q1 = {x: O(x) > 0}, Q2 = {x: #(x) < 0} The segmentation contours are represented by the 'zero level set': {x : (4.3) #(x) = 0}. The level set function is initialized over the entire image and iteratively evolved to achieve the final segmentation. The unknown homogeneous image J is modeled by two constants ci and c2 in regions Q, and Q2 respectively. An energy function, .F(0, {ci, c2 }, b), is defined over Q1 , Q 2 , c1 , c2 and b. The optimal regions Q1 and Q2 are obtained by minimizing the energy, F, in a variational framework. The energy minimization is performed in an iterative manner with respect to one variable at a time while the other variables are set to their values in the previous iteration. The iterative process is implemented numerically using a finite difference scheme [185]. This process iteratively converges to the homogeneous image J and the corresponding level set function #(x), as shown by the sketch in Figure 4-4. 4.2 Skin Condition Progression: 139 Quantitative Analysis 139 4.2 Skin Condition Progression: Quantitative Analysis (D(X)=O (b) (a) Figure 4-4: Level set segmentation. (a) Original image with intensity inhomogeneity and initialization of the level set function. (b) Homogeneous image obtained at the end of iterations and the corresponding level set function. The iterative process achieves accurate segmentation despite intensity inhomogeneities. However, it requires storage and update of the level set function O(x), the bias field b, the homogeneous image model {c 1 , c2 } and the corresponding energy function F(#, {c 1i, c2 }, b) for every pixel in each iteration. Bit widths for representing this data are given by Table 4.2. This results in a 42 bits/pixel requirement for the level set approach. Processing Table 4.2: Bit Width Representations of LSM Variables. Variable Bit Width I(x) 8 bits/pixel J(x) 8 bits/pixel b(x) 8 bits/pixel O(x) 2 bits/pixel F(O,{ci, c 2}, b) 16 bits/pixel a 2 megapixel (1920 x 1080) image requires storing 11 MB of data and updating it in each iteration. On-chip SRAM in a processor is typically not suited to such large memory requirement, necessitating an external DRAM for storing these variables. To process a 2 megapixel image with 50 LSM iterations in one second requires the memory bandwidth of: 140 140PotbeMdclmangPafr BWLsM = BWIRead/WriteBW BWLSM BWRead + BWRead/Write + BW+ Portable Medical Imaging Platform + BO+ ead/Write B+BWRead/Write = 1920 x 1080 x (8 + 2 x 8 + 2 x 2 + 2 x 8 + 2 x 16) x (50 iterations) (4.4) = 985 MB/s To enable energy efficient implementations and real-time processing, we need to optimize the algorithm and reduce the computational and memory requirements. Narrowband LSM We develop a narrowband implementation of the approach, where instead of storing and updating the LSM variables for all the pixels in the image in each iteration, we only need to track a small subset of pixels that fall within a narrow band defined around the zero level set, as depicted in Figure 4-5. <D(x)=0 Segmentation contour in <D(x)=0 Narrowband current iteration are tracked Figure 4-5: Narrowband implementation of level set segmentation. LSM variables current the set in only for pixels that fall within a narrow band defined around the zero level iteration. The narrowband implementation is achieved by limiting the computations to a narrow band around the zero level set [185]. The LSF at a pixel x = (i, j) in the image is denoted either by <pj and a set of zero-crossing pixels is determined as the pixels (i, j) such that 4.2 Skin Condition Progression: 141 Quantitative Analysis qi+i,j and 4 i-1,j or #i,j+1 and #ij-1 have opposite signs. If the set of zero crossing pixels is denoted by Z, the narrowband B is constructed as given by eq. (4.5). B= U (4.5) Ni (ij)eZ where, Nij is a 5 x 5 pixel window centered around pixel (i, j). The 5 x 5 window is experimentally determined to provide a good trade-off between computational complexity and quality of the results. The LSF based segmentation using narrowband can be summarized as follows. 1. Initialize the LSF to 09. where # indicates the LSF value during iteration k. Construct the narrowband B 0 using eq. (4.5). =k+1 2. Update the LSF on the narrowband using a finite difference scheme [185] as # j + At - L(#k ), where At is the time step of the iteration and L(# 3. Determine the set of zero-crossing pixels of #k+ ) ~ . and update the narrowband Bk+1 using eq. (4.5). 4. For pixels (i, j) part of the updated narrowband Bk+1 that were not part of the narrowband Bk, set # l = 3 if # ±l > 0 and #ktl = -3 otherwise. 5. Continue iterations till the narrowband stops changing (Bk+1 = Bk = Bk-1) or the limit on maximum iterations is reached. The set of zero-crossing points at the end of iteration represents the segmentation contour. The narrowband approach significantly reduces the computational costs as well as memory requirements for LSM segmentation. Figure 4-6 shows the number of pixels processed for five 2 megapixel images of skin lesions over 50 LSM iterations using the narrowband implementation. On average, 400,000 pixels are processed per iteration. Compared to the 2 million pixels processed per iteration using original LSM, this represents a 80% reduction in the processing cost and reduces the average memory bandwidth to 197 MB/s. 142 Portable Medical Imaging Platform x105 10 --- Image Image Image Image Image 8 6 -- 1 2 3 4 5 4 -02 0 10 20 30 Number of Iterations 40 50 Figure 4-6: Number of pixels processed using the narrowband implementation over 50 LSM iterations. Two-Step Segmentation This narrowband implementation, however, has one important limitation. We perform updates on the LSM variables only for the pixels in the small neighborhood of the segmentation contour in the current iteration. If the LSF isn't properly initialized, it is possible for the energy function to get trapped in a local minima resulting in inaccurate segmen- tation. This can be easily avoided by starting with a good initialization. We achieve this by using a 2-step approach: * Step 1: A simple segmentation technique such as thresholding or K-means is used. This step is very efficient computationally and generates segmentation contours that are not completely accurate but serve as a good starting point for our narrowband LSM implementation. " Step 2: Contours generated in Step 1 are used to initialize the LSF. Narrowband LSM then iteratively refines these contours to achieve final segmentation. Figure 4-7 shows the segmentation achieved by K-means in Step 1 for a skin lesion. Using these contours to initialize the LSM iterations, Figure 4-8 shows the evolution of contours during LSM iterations in Step 2. 143 4.2 Skin Condition Progression: Quantitative Analysis 143 4.2 Skin Condition Progression: Quantitative Analysis * * Initial contours K-means segmentation Original Image Figure 4-7: Lesion segmentation using K-means. Initial contours 10 Iterations 20 Iterations 30 Iterations 40 Iterations 50 Iterations: Final Contours Figure 4-8: Contour evolution for lesion segmentation using narrowband LSM. 4.2.3 Progression Analysis The ability to accurately determine the progression of a skin condition over time is an important aspect of diagnosis and treatment. In this work, we capture images of the same skin lesions using a handheld digital camera over an extended period of time during treatment and analyze them to determine the progress. However, the lesion contours determined in individual images can not be directly compared as the images typically have scaling, orientation and perspective mismatch. We propose an image registration scheme based on SIFT feature matching [34] for pro- 144 Portable Medical Imaging Platform gression analysis. Skin surface typically does not have significant features that could be detected and matched across images by SIFT. However, the lesion boundary creates distinct features due to transition in color and intensity from the regular skin to the lesion. To further highlight these features, we superimpose the identified contour on to the original image before feature detection. The lesion contours change over time as the treatment progresses, however this change is typically slow and non-uniform. Repigmentation often occurs within the lesion and some parts of the contour shrink while others remain the same. Performing SIFT results in several matching features corresponding to the areas of the lesion that haven't significantly changed. Matching SIFT features over large images can be computationally expensive. Also, on relatively featureless skin surfaces, most useful SIFT features are concentrated around the lesion contour, where there is change in intensity and color. To take advantage of this, we restrict feature matching using SIFT to a narrow band of pixels in the neighborhood of the contour, defined in the same manner as the narrow band in Section 4.2.2 by eq. (4.5). Figure 4-9 shows a pair of images of the same lesion with some of the matching SIFT features identified on them. Figure 4-9: SIFT feature matching performed on the highlighted narrow band of pixels in the vicinity of the contour. This significantly speeds up the processing by reducing the number of computations and memory requirement, while providing significant features near the contour that can be matched across images. For a 2 megapixel image, instead of performing SIFT feature 4.2 Skin Condition Progression: Quantitative Analysis 145 detection over 2 million pixels, this approach requires processing only 250,000 pixels on average - a reduction of 88%. This also reduces the memory requirement from 2 MB to about 250 kB which could be efficiently implemented as on-chip SRAM instead of external DRAM. SIFT is performed only once on any given image, the first time it is analyzed. The SIFT features for the image are stored in the database and used for subsequent analyses. Once the SIFT features are determined in all the images in a sequence, we identify matching features across images using Random Sample Consensus (RANSAC) [187] and compute homography transforms that map every image in the sequence to the first image. The homographies are used to warp images in the sequence to align with the first image. Lesion contours in the warped images can be used to compare .the lesions and determine the progression over time. The lesion area, confined by the warped contours, is determined for each image in the sequence. We define a quantitative metric called fill factor (F) at time t as the change in area of the lesion with respect to the reference (first image, captured before the beginning of the treatment), given by eq. (4.6). F = 1-- A0 (4.6) where, At is the lesion area at time t and Ao is the lesion area in the reference image. A limitation of the narrowband approach for feature matching is that it can be difficult to determine a significant number of matching features if the lesion contours in two subsequent images have changed dramatically. If the images are only collected during clinical visits that are usually more than a month apart, it is possible to have significant changes in the lesion contours, depending on the skin condition. The images collected for this work, as part of the pilot study for Vitiligo, were usually a month apart. The approach worked well for these images. A goal is this work is to facilitate frequent image collection by enabling patients to capture images at home, achieving accurate feature matching between subsequent images as well as frequent feedback for doctors and patients. Portable Medical Imaging Platform 146 4.2.4 Auto-tagging Many skin conditions typically result in lesions in multiple body areas. For a patient or a doctor to be able to keep track of the various lesions, it is important to be able to classify the lesions based on the body areas maintain individual databases for a sequence of images from each lesion. In this work, we implement a scheme where the subject needs to manually identify the lesions only once, during initial setup, and all future instances of the same lesion are automatically classified and entered into the database for analysis. The auto-tagging approach is developed in collaboration with undergraduate researchers Michelle Chen and Qui Nguyen. The well-studied problem of image classification is similar and several image classification techniques exist that may adapt well to this application. In an image classification problem, there are several predefined classes of images into which unknown images must be classified. In this case, the different affected areas on the body can be thought of as the classes, and we would like to classify new photographs taken by the patient. A large body of research exists on image classification. Most approaches generally involve determining distinctive features of each class and then comparing the features of unknown images to these known patterns to determine their likely classifications. SIFT features are a very popular option for general image classification, because they are resistant to changes in scale and transformations [34]. The SIFT descriptors have high discriminative power, while at the same time being robust to local variations [188]. SIFT has been shown to significantly outperform descriptors such as pixel intensities [189,190], edge points [191] and steerable pyramids [192]. Features such as the Harris-Affine detector [193] and geometric blur descriptors [194] have also emerged as alternatives to SIFT for image matching and classification. Furthermore, given the nature of skin images, where a lightly colored lesion is surrounded by darker skin with very few other features present, the main distinguishing feature of each image is simply the shape of the lesion. This enables us to use descriptors designed for 4.2 Skin Condition Progression: Quantitative Analysis 147 shape recognition, such as shape contexts [195]. The accuracy of shape context matching is strongly correlated with the accuracy of segmentation to determine the lesion contour. The accuracy of segmentation increases for darker skin types, especially in presence of intensity inhomogeneities. In this work, we implemented and analyzed both SIFT based and shape context based classification. One important difference between classic image classification algorithms and the approach that we used in this work is how the definitive features of each class are determined. In classic image classification, there are a large number of training examples that can be used to determine the distinctive pattern of features for each class, and machine learning techniques are often used to do this. In our case, however, there are only a few examples per class, and because the lesions change over time, older examples are less relevant. As a result, we do not use machine learning to combine the examples. Instead, we use the features of the most recent photograph in each class to represent that class. At the beginning of the treatment, all skin lesions are photographed and manually tagged based on the body areas. An image of lesion i captured at time t is denoted by L . The images (L9) are processed to perform color correction and contour detection, as described in Section 4.2.1 and 4.2.2. SIFT-based classification SIFT features are computed for each image and stored along with the image as Sio. When a new image (Li) is captured at time t = 1, same processing is performed to determine the contour and SIFT features S . SIFT features for the new image (SI) are compared with those determined earlier (S?) to find matches using two nearest neighbor approach. The largest set of inliers (Ij) with Nij elements and the total symmetric transfer error (ei,j) (normalized over the range [0, 1]) for every combination { Si, S } are determined using RANSAC. The image (LI) is then classified to belong to lesion i if the given i maximizes 148 Portable Medical Imaging Platform the matching criterion Mi,, defined by eq. (4.7). Mij = Nij (1 + A(1 - ei,j)) (4.7) where, A is a constant and set to 0.2 in this work. The homography HO'1 , corresponding to the best match, is stored for later use in progression analysis. Shape context based classification Shape context descriptors [195] that for a given point on the contour encode the distribution of the relative locations of all the other points, are computed for each image and stored along with the image as SC2. When a new image (Li) is captured at time t = 1, same processing is performed to determine the contour and shape context descriptors SCf. Shape context descriptors for the new image (SCJ) are compared with those determined earlier (SC?) to find the minimum cost matching between these points, using the difference between the shape context descriptors as the cost of matching two points. Finally, a thin plate spline transformation [196,197] between the two contours is computed using the minimum cost matching. The overall difference between two lesion images is then represented as a combination of the cost of the matching (SCgst) and the size of the transformation (Tj). The image (Li) is then classified to belong to lesion i if the given i maximizes the matching criterion Mi,, defined by eq. (4.8). Mij = -1 x (SCfc8 t +kxTi,5) (4.8) where, k is a constant that represents how much the size of the transformation is considered relative to the cost of the matching. In this work, k is set to 10. The same process is applied for tagging any future image L j by comparing it against the previously captured set of images L'-. 4.2 Skin Condition Progression: 4.2.5 Quantitative Analysis 149 Skin condition Progression: Summary The overall processing involving image tagging, lesion contour detection and progression analysis, can be summarized as follows. " Initial Setup 1. Manually tag images (L,) based on the location i of the lesion. 2. Perform color correction and segmentation to determine lesion contours (C). 3. Compute SIFT features (S0) in the vicinity of the lesion contour (C). Store C and Si for future analysis. 4. Shape context based tagging: Compute shape context features (SC) on the lesion contour (C). Store SC for future analysis. " Subsequent Analysis 1. For an image L captured at time t, perform color correction and contour detection (C). 2. Compute SIFT features (S ) in the vicinity of the lesion contour (C). 1 3. Perform feature matching for every combination {S- , S3} and tag L3 to lesion i using eq. 4.7. Store the best match homography H'l"' for further analysis. 4. Shape context based tagging: Perform shape context matching for every combination {SC-1, SCj} and tag L) to lesion i using eq. 4.8. 5. Using the pre-computed contours (Cl) and homographies (Hi-'), register a sequence of n images of the same lesion captured over time to the first image (L9). 6. Compare the areas of the warped lesion contours to determine the progression over time and compute the fill factor (Ft) using eq. 4.6. 150 4.3 4.3.1 Portable Medical Imaging Platform Experimental Results Clinical Validation Institutional Review Board approval was obtained for data analysis (MIT Protocol Number: 1301005500) as well as the clinical pilot study in collaboration with the Brigham and Women's Hospital (BWH Protocol Number: 2012-P-002185/1). Ten subjects ages 18 years and older with a dermatologist diagnosis of Vitiligo were recruited by Dr. Vaneeta Sheth. Subjects had a variety of skin phototypes and disease characteristics. As this was a pilot study, no standardized intervention was performed. Rather, subjects were treated with standard therapies used for Vitiligo based on clinical characteristics and patient preference. Further subject specific details, along with various treatment modalities are outlined in Appendix B. Photographs of skin lesions were taken at the beginning of treatment and during at least two subsequent clinical follow-up visits, using normal room lighting and a handheld digital camera. 4.3.2 Progression Quantification The approach to analyze the individual images and determine the progress over time is implemented using MATLAB. For a sequence of images of a skin lesion captured over time, we process each image to perform color correction and contrast enhancement. Figure 4-10 shows a sequence of images with their R, G, B histograms and the outputs after color correction. The color corrected images are then processed to perform lesion contour detection. Figure 4-11 shows a sequence of images with the detected contours over-laid. LSM based image segmentation accurately detects the lesion boundaries despite intensity/color inhomogeneities in the image. Feature matching is performed across images to correct for scaling, orientation and per- 4.3 Experimental Results 151 151 4.3 Experimental Results 3 00 50 100 150 200 250s0 / 50 100 150 200 2500 SO 100I1so Intensity 20 0 so 100 ISO 200 250 0 50 100 ISO 200 250 (a) Captured Image Sequence S ~2 0 / // 0 50 100 15 200 250 0 50 100 150 200 2Mo 0 50 100 ISO 200 2S Intensity (b) Color Corrected Image Sequence Figure 4-10: Color correction for a sequence of images by R, G, B histogram modification. (a) Original image sequence, (b) Color corrected image sequence. The lesion color changes due to phototherepy. Figure 4-11: Image segmentation using LSM for lesion contour detection despite intensity/color inhomogeneities in the image. spective mismatch. Homography transform, computed based on the matching features, is used to warp all the images in a sequence with respect to the first image, which is used as a reference. Figure 4-12 shows a sequence of warped images. The warped lesions Portable Medical Imaging Platform 152 are compared with respect to the reference lesion at the beginning of the treatment to determine the progress over time in terms of the fill factor. Nov'12 Fill Factor = 0 Mar'13 Fill Factor = 27% Jul'13 Fill Factor = 51% Sep'13 Fill Factor = 57% Figure 4-12: Image registration based on matching features with respect to the reference image at the beginning of the treatment. A sequence of captured and processed images of a different skin lesion from another subject are shown in Figure 4-13 and the fill factor is computed by comparing the warped lesions. (a) Images captured with normal room lighting Nov'12 Fill Factor = 0 Dec'12 Fill Factor = 6% Jan'13 Fill Factor = 16% Feb'13 Fill Factor = 22% (b) Processed outputs after contour detection and alignment Figure 4-13: Sequence of images during treatment. (a) Images captured with normal room lighting. (b) Processed image sequence. 4.3 Experimental Results 153 The approach for image registration is independently validated by analyzing images of the same skin lesion captured from different camera angles. Contour detection is performed on the individual images that are then aligned by feature matching. Figure 4-14 shows one such comparison. The aligned lesions are compared in terms of their area as well (a) Images of the same lesion from different camera angles (b) Images after lesion contour detection and alignment Figure 4-14: Image registration through feature matching. (a) Images of a lesion from different camera angles, (b) Images after contour detection and alignment. Area matches to 98% accuracy and pixel overlap to 97% accuracy. as the number of pixels that overlap. Analysis of 100 images from 25 lesions, with four real and artificial camera angles each, shows a 96% accuracy in area and 95% accuracy in pixel overlap. To validate the progression analysis, we take one image each from 50 different lesions and artificially generate a sequence of 4 images for each lesion with known change in area. We then apply rotation, scaling and perspective mismatch to the new images. This artificial sequence is used as input to our system, which determines the lesion contours, aligns the sequence and computes the fill factor. We compare the fill factor with the known change in area from the artificial sequence and also compute pixel overlap between the lesions identified on the original sequence (before adding mismatch) and those on the processed sequence. Figure 4-15 shows one such comparison. Analysis of 200 images from 50 such sequences shows a 95% accuracy in fill factor computation and pixel overlap. Portable Medical Imaging Platform 154 Portable Medical Imaging Platform 154 Fill Factor = 0 Fill Factor = 6% Fill Factor = 13% Fill Factor = 18% Fill Factor = 23% Fill Factor = 30% (a) Artificial imaae sequence with known area chance (b) Artificial image sequence with added mismatch in scaling, rotation and perspective Fill Factor = 0 Pixel Overlap = 100% Fill Factor= 8% Pixel Overlap = 97% Fill Factor = 14% Pixel Overlap = 98% Fill Factor = 21% Pixel Overlap = 96% Fill Factor = 25% Pixel Overlap = 96% Fill Factor = 31% Pixel Overlap = 97% (c) Aligned image sequence with computed fill factor Figure 4-15: Progression analysis. (a) Artificial image sequence with known area change, created from a lesion image. (b) Image sequence after applying scaling, rotation and perspective mismatch. (c) Output image sequence after lesion alignment and fill factor computation. The proposed approach is used to analyze 174 images corresponding to 50 skin lesions from ten subjects to determine the progression over time. The progression of multiple lesions during treatment, as well as a detailed analysis of progression for all ten patients in the clinical study is presented in Appendix B. 4.3.3 Auto-tagging Performance Performance of the auto-tagging technique is evaluated by analyzing images of twenty lesions from ten subjects with five images in each sequence. The first twenty images, captured at the beginning of the treatment, are manually tagged. The auto-tagging techniques using SIFT and Shape Contexts are used to classify the remaining 80 images. For each technique, the performance is evaluated as follows. For each image, we use the technique to calculate the similarity between that image and all images from the previous timestep, defined by the matching criteria in eq. (4.7) or eq. (4.8). If the image from the previous timestep with the highest similarity is from the same set, then the technique 4.3 Experimental Results 155 classifies the image correctly. Otherwise, the classification is incorrect. The SIFT based classification approach is able to accurately tag 70 of the 80 images, achieving an accuracy of 87%. The shape context based approach is able to accurately classify 72 of the 80 images, achieving an accuracy of 90%. The images in this test data set were captured one to three months apart, which resulted in significant changes in the lesion contours for some of the test images. If the contours are significantly different, it becomes difficult for both SIFT based feature matching and shape context matching to identify enough matching features for robust classification. Enabling more frequent data collection, where adjacent images have far fewer changes in the lesion shape, will further help enhance the accuracy of tagging. The processing steps for tagging using SIFT are part of the steps necessary for contour detection and progression analysis. So this approach adds very small overhead while achieving good accuracy. Shape context based approach requires computing shape contexts and transforms, but this is a small overhead (less than 5%) in the overall processing. 4.3.4 Energy-Efficient Processing The algorithmic optimizations outlined in Section 4.2.2 and Section 4.2.3 for segmentation and SIFT based progression analysis respectively provide significant reduction in computational complexity and memory size and bandwidth requirements. We can estimate the reduction in processing complexity through a comparison of run-times for the different implementations. Three different implementations with full LSM and full SIFT, narrowband LSM and full SIFT, and narrowband LSM and narrowband SIFT, are created in MATLAB. All three implementations are run on a computer with 2.4 GHz Intel Core i5 processor and 8 GB 1600 MHz DDR3 memory. Run-times are determined as average of fifty runs of the same implementation for processing two 2 megapixel images. Table 4.3 shows the comparison of run-times for the different implementations. The narrowband LSM implementation enhances the performance by 62% compared to full Portable Medical Imaging Platform 156 Table 4.3: Performance enhancement through algorothmic optimizations. Approach Run Time Power Energy Segmentation Feature Matching Full LSM Full SIFT 11.4 sec 20.6 W 235 J Narrowband LSM Full SIFT 4.3 sec 21.2 W 91 J Narrowband LSM Narrowband SIFT 3.1 sec 20.2 W 63 J LSM. The narrowband SIFT implementation improves the performance by 28% compared to full SIFT. A combination of both results in a 73% performance enhancement compared to full LSM and SIFT. The power consumption of the CPU during processing is measured using Intel Power Gadget [198]. The algorithmic optimizations result in a 73% reduction in the overall energy consumption. For processing a 2 megapixel image in one second, based on the number of memory accesses, we can estimate the memory power using a memory power consumption model [144]. Figure 4-16 shows the memory bandwidth and estimated power consumption for processing with full LSM segmentation and SIFT feature matching compared with the optimized 1000 9g7 Bandwidth Power - 200 150 750 0 CD 0 104 100 500 250 97 - CD 50 0 0 Full Image LSM & SIFT Narrowband LSM & SIFT Figure 4-16: Memory bandwidth and estimated power consumption for full image LSM and SIFT compared to the optimized narrowband implementations of LSM and SIFT. 157 4.3 Experimental Results narrowband LSM segmentation and narrowband SIFT feature matching. The algorithmic optimizations leading to the narrowband implementations of both LSM segmentation and SIFT feature matching result in a 80% reduction in memory bandwidth and 45% reduction in memory power. These algorithmic enhancements pave the way for energy-efficient hardware implementations that could enable real-time processing on mobile platforms. 4.3.5 Limitations The performance of the system depends on several factors, including image exposure, skin type, location of the lesion, etc. For example, it is harder to accurately segment and align lesions that may not have well defined boundaries such as a lesion that wraps around a finger or feet. Figure 4-17 shows an example of where segmentation fails to identify the right lesion contours. Capturing multiple images of the lesion, each zoomed-in on a narrow patch, could help improve the performance in such cases. November'12 Figure 4-17: Image segmentation fails to accurately don't have well defined boundaries. January'13 identify lesion contours where the lesions All the data analyzed in this work is based on image collection that happens only during the patients visit to the doctor. Such visits may be far apart (a month or more) and the lesions may have changed significantly to accurately determine matching features Portable Medical Imaging Platform 158 between the new image and the previously collected image. One of the goals of the mobile application is to enable patients to frequently capture and analyze images, even outside clinical visits. Frequent data collection and analysis would not only enhance the performance of the system further, but also provide doctors a near real-time feedback of the response to treatment that could be used to tailor the treatment for best outcomes. The approach is validated for Vitiligo skin condition, but it has general applicability and could be extended to other skin conditions as well. 4.4 Mobile Application A key objective of this work is to enable patients to perform imaging and progression analysis of skin lesions at home much more frequently than having the usage limited to dermatologists in a clinical environment. The ability to perform imaging and analysis using mobile devices, such as smartphones, is important towards achieving this goal. Along with undergraduate researchers Michelle Chen and Qui Nguyen, we are developing a mobile application for the Android platform that enables image capture of the skin lesions and provides a simple user interface to analyze the images. The analysis is performed using a cloud-based system that integrates with the mobile application. Figure 4-18 shows the architecture of the mobile application and cloud integration. Figure 4-18: Architecture of the mobile application with cloud integration. 4.5 Multispectral Imaging: Future Work 159 The application allows the user to capture images of the skin lesion using the built-in camera on the mobile device. The images are uploaded to the cloud server. The first time that a patient uses the application, they are asked to label each image manually. For all subsequent usage, the images are tagged automatically based on the labels originally provided by the user. The tag is suggested to the user for confirmation to prevent mislabeling in cases where auto-tagging might result in a wrong classification. A database of all the images, organized according to the tags and the date of capture, is maintained in the cloud server. The user can select a region to analyze, which activates processing on the cloud server. After the processing is complete, the results are retrieved and displayed on the mobile device as an animation sequence that takes the user through all the images of that lesion, warped to align with the first image in the sequence, and shows the progression in terms of the corresponding fill factors. Figure 4-19 shows some of the screens that form the user interface of the application that is currently under development. 4.5 Multispectral Imaging: Future Work Medical imaging techniques are important tools in diagnosis and treatment of various skin conditions, including skin cancers such as melanoma. Defining the true border of skin lesions and detecting their features are critical for dermatology. Imaging techniques such as multi-spectral imaging with polarized light provide non-invasive tools for probing the structure of living epithelial cells in situ without need for tissue removal. Light polarization also makes it possible to distinguish between single backscattering from epithelial-cell nuclei and multiply scattered light. Polarized light imaging gives relevant information on the borders of skin lesions that are not visible to the naked eye. Many skin conditions typically originate in the superficial regions of the skin (epidermal basement membrane) where polarized light imaging is most effective [199]. A number of polarized light imaging systems have been used in clinical imaging [199-201]. However, widespread use of these systems has been limited by their complexity and cost. Portable Medical Imaging Platform 160 Portable Medical Imaging Platform 160 Left Hand Initial Screen Subsequent Usage: Auto-tagging and user confirmation Image Capture First Usage: Manual Tagging Select Region to Analyze Figure 4-19: User interface of the mobile application. (Contributed Nguyen). Display Progression by Michelle Chen and Qui Some of the commercially available Dermlite [163] systems are useful for eliminating glare and shadows from the field of view but do not provide information on the backscattered degree of polarization and superficial light scattering. More complex systems based on confocal microscopy [202] trade-off portability and cost for high resolution and depth information. 161 4.5 Multispectral Imaging: Future Work We envision a portable imaging module with multispectral polarized light for medical imaging that could serve as an optical front-end for the skin imaging and analysis system developed in this work. A conceptual diagram of the imaging module is shown in Figure 4-20. The imaging module could function as an attachment to a smartphone and Cross Polarization Multispectral Illumination Figure 4-20: A conceptual diagram of the portable imaging module for multispectral polarized light imaging augment the built-in camera by enabling image capture under different spectral wavelengths, ranging from infrared to ultraviolet, and light polarization. The multispectral illumination could be created using LEDs of varying wavelengths that are trigged sequentially and synchronized with the camera to capture a stack of images of the same lesion under different wavelength illumination. The synchronization could be achieved through a wired or wireless interface, such as USB or Bluetooth, with the smartphone. The images captured under multispectral illumination provide a way to optically dissect a skin lesion by analyzing the features visible under different wavelengths. For example, surface pigmentation using blue light, superficial vascularity under yellow light, and deeper pigmentation and vascularity with the deeper-penetrating red light [203]. Such a device could enable early detection of skin conditions, even before the lesions fully manifest on the skin surface, as well as more accurate diagnosis and treatment by providing dermatologists with far more details of the lesion morphology than are visible under white light illumination. 162 4.6 Portable Medical Imaging Platform Summary and Conclusions In this work, we developed and implemented a system for identifying skin lesions and determining the progression of the skin condition over time. The approach is applied to clinical images of skin lesions captured using a handheld digital camera during the course of treatment. This work leverages computer vision techniques, such as SIFT feature matching and LSM image segmentation, and makes application specific modifications, such as color/contrast enhancement, contour based feature detection and contour detection in presence of intensity/color inhomogeneities. A system that integrates all of these aspects into a seamless flow and enables lesion detection and progression analysis of skin conditions, based not only on standardized clinical imaging but also through images captured by patients at home, using smartphone or digital cameras, without any standardization, is developed. The algorithmic enhancements and optimizations with the narrowband implementations of level set segmentation and SIFT feature matching help improve the software run-time performance by over 70% and CPU energy consumption by 73%. These optimizations also reduce the estimated memory bandwidth requirement by 80% and memory power consumption by 45%. These optimizations pave the way for energy-efficient hardware implementations that could enable real-time processing on mobile platforms. Based on the images of skin lesions obtained from the pilot study, in collaboration with the Brigham and Women's Hospital, the results indicate that the lesion segmentation and progression analysis approach is able to effectively handle images captured under varying lighting conditions without the need for specialized imaging equipment. R, G, B histogram matching and expansion neutralizes the effect of lighting variations while also enhancing the contrast to make the skin lesions more prominent. LSM based segmentation accurately identifies the lesion contours despite intensity/color inhomogeneities in the image. Feature matching using SIFT effectively corrects for scaling, orientation and perspective mismatch in camera angles for a sequence of images captured over time and aligns the lesions 4.6 Summary and Conclusions 163 that can then be compared to determine progress over time. The fill factor provides objective quantification of the progression with 95% accuracy, representing a significant improvement over the current subjective outcome metrics such as the Physician's Global Assessment and VASI that have assessment variability of more than 25%. Based on the analysis of existing assessment techniques and the contributions of this work, the following conclusions can be drawn: 1. The current assessment techniques for skin conditions are primarily based on subjective clinical assessment by physicians. Lack of quantification tools also has a significant impact on patient compliance. There is a significant need for quantitative dermatology approaches to aid doctors in determining important lesion features and accurately tracking progression over time, as well as giving patients confidence that a treatment is having the desired impact. 2. A diverse set of computer vision functionalities need to be integrated to enable skin imaging and analysis without any standardization in image capture. Application requirements, such as image segmentation in presence of intensity/color inhomogeneities and feature matching on relatively featureless skin surfaces, pose important challenges. This work leverages recent approaches in level set methods and feature matching, and enhances them for robustness with application specific modifications. 3. The algorithms have high computational complexity and memory requirements. Efficient software and hardware implementations require algorithmic optimizations that significantly reduce the processing complexity without sacrificing accuracy. The narrowband image segmentation and feature matching approaches proposed in this work achieve this objective. These algorithmic optimizations could enable efficient hardware implementations for real-time analysis on mobile devices. 4. It is important to have a simple tool with reproducible results. The proposed system is demonstrated to achieve these goals through a pilot study for Vitiligo.This approach provides a significant tool for accurate and objective assessment of the 164 Portable Medical Imaging Platform progress with impact on patient compliance. The precise quantification of progression would enable physicians perform an objective follow-up study and test the efficacy of therapeutic procedures for best outcomes. 5. Combining efficient mobile processing with portable optical front-ends that enable enhanced image acquisition, such as multispectral imaging, polarized lighting and macro/microscopic imaging, will be key to developing portable medical imaging systems. Such devices could be deployed widely at low cost for early detection and monitoring of diseases in rural areas and emerging countries. Chapter 5 Conclusions and Future Directions The energy cost of processor programability is very high due to the overhead associated with supporting a fine-grained instruction set compared to the actual cost of computation. As we go from CPUs and DSPs to FPGAs and ASICs, we progressively reduce this overhead and trade-off programability to gain energy-efficiency [204]. It is important to note, however, that the energy cost is ultimately determined by the desired operation and underlying algorithms. An algorithm that requires high precision floating point operations to maintain functionality and accuracy will not be able to achieve energy-efficiency comparable to one that can be implemented using small bit-width fixed point operations. Same is true with the performance enhancement that a parallel architecture could achieve, as described by Amdahl's Law. The energy requirement of an algorithm with large data dependencies will be dominated by the cost of memory accesses. Even a highly optimized hardware implementation will not significant improve the energy-efficiency of such a system. The development of a system that maximizes energy-efficiency must begin with algorithms - often reframing the problem and optimizing processing without changing functionality or impacting accuracy - and co-designing algorithms and architectures. 166 5.1 Conclusions and Future Directions Summary of Contributions This thesis demonstrates the significance of the co-design approach for mobile platforms through energy-efficient system design for multiple application areas. 5.1.1 Video Coding Reconfigurability is key to enabling a class of closely related functionalities efficiently in hardware. Algorithmic rearrangements and optimizations for transform matrix computations were key to developing a reconfigurable transform engine for multiple video coding standards. The optimizations maximized hardware sharing and minimized the amount of computations required to implement large transform matrices. The shared transform resulted in 30% hardware saving compared to total hardware requirement of individual H.264/AVC and VC-1 transform implementations. Algorithmic modifications for data dependent processing to optimize pipeline bit widths and reduce switching activity of the system reduced the power consumption by 15%. Moving away from conventional 2D transform architectures, an approach to eliminate an explicit transpose memory was demonstrated, by reusing the output buffer to store intermediate data and separately designing the row-wise and column-wise 1D transforms. It helped reduce the area by 23% and power by 26% compared to the implementation using transpose memory. Low-voltage circuit design using statistical performance analysis ensured reliable operation down to 0.35 V. The transform engine was demonstrated to support video encoding/decoding in both H.264 and VC-1 standards with Quad Full-HD (3840 x 2160) resolution at 30 fps, while operating at 25 MHz, 0.52 V and consuming 214 pW of power. This provided a 250x higher power efficiency while supporting the same throughput as the previous state-of-the-art ASIC implementations. The design provided efficient performance scalability with 1080p (1920 x 1080) at 30 fps, while operating at 6.3 MHz, 0.41 V with 79 ,uW of power consumption, and 720p (1280 x 720) at 30 fps, while operating at 2.8 MHz, 0.35 V with 43 ptW of power consumption. 5.1 Summary of Contributions 167 The ideas of matrix factorization for hardware sharing, eliminating transpose memory and data dependent processing have general applicability. As bigger block sizes such as 32x32 and 64x64 are explored in new video coding standards like HEVC, these ideas could lead to even higher savings in area and power requirement of the transform engine, allowing their efficient implementation in multi-standard multimedia devices. 5.1.2 Computational Photography The importance of reframing algorithms for efficient hardware implementations is clearly demonstrated by the optimizations, leveraging the 3D bilateral grid, that led to significant reductions in computational complexity, memory size and bandwidth, while preserving the output quality. The bilateral grid implementation enhanced processing locality by reducing the data dependencies from multiple image rows to a few grid blocks in the neighborhood, and enabled highly parallel processing. Architectural optimizations exploiting parallelism, with two bilateral filter engines operating in parallel and each supporting 16 x parallel processing, enabled high throughput real-time performance while operating at less than 100 MHz frequency. Combining algorithmic optimizations, parallelism and processing data locality with careful memory management, helped reduce the external memory bandwidth by 97% - from 5.6 GB/s to 165.9 MB/s and the DDR2 memory power consumption by 74% - from 380 mW to 99 mW. Through algorithm/architecture co-design, an approach for low-light enhancement and flash shadow correction was developed that enables efficient implementation using the bilateral grid architecture. Circuit design for low-voltage operation and multiple voltage domains enabled the processor to achieve a wide operating range - from 25 MHz at 0.5 V with 2.3 mW power consumption to 98 MHz at 0.9 V with 17.8 mW power consumption. Co-designing algorithms, architectures and circuits, enabled the processor to achieve 280 x higher energy-efficiency compared to software implementations with identical functionality on state-of-the-art mo- Conclusions and Future Directions 168 bile processors. A scalable architecture, with clock and power gating, enabled users to perform energy/resolution scalable processing and was demonstrated to achieve energy scalability from 0.19 mJ/megapixel to 1.37 mJ/megapixel for different grid configurations at 0.9 V, while trading-off output quality for energy. 5.1.3 Medical Imaging The current assessment techniques for skin conditions are primarily based on subjective clinical assessment by physicians. The algorithmic enhancements that extended computer vision techniques - from image segmentation in presence of inhomogeneities to feature matching on relatively featureless surfaces - were key to developing a system for objective quantification of skin condition progression. The system achieved robust performance in clinical validation with 95% accuracy, representing a significant improvement over the current subjective outcome metrics such as the Physician's Global Assessment and VASI that have assessment variability of more than 25%. Algorithmic optimizations with the narrowband implementations of level set segmentation and SIFT feature matching helped improve the software run-time performance and CPU energy consumption by over 70%. These optimizations also reduced the estimated memory bandwidth requirement by 80% and memory power consumption by 45%. These optimizations pave the way for energy-efficient hardware implementations that could enable real-time processing on mobile platforms. 5.2 Conclusions This thesis focuses on addressing the challenges of implementing high-complexity applications with high-performance requirements on mobile platforms through a comprehensive view of system design, where algorithms are designed and optimized to enhance processing locality and enable highly parallel architectures that can be implemented using low-power 169 5.2 Conclusions19 low-voltage circuits to achieve maximally energy-efficient systems. The investigation in this thesis for multiple application areas leads to the following conclusions. 1. Application Specific Processing: With the performance per watt gains due to technology scaling saturating and the tight energy constraints of mobile platforms, energy-efficiency is the key bottleneck in scaling performance. Application specific hardware units that trade-off programmability for high energy-efficiency are becoming an increasingly important part of processor architectures. Hardware-optimized algorithm design is crucial to maximizing performance and efficiency gains. 2. Reconfigurable Architectures: A hardware implementation with highly optimized processing units supporting core functionalities in a class of applications (example: computational photography or video coding) and the ability to activate these processing units and configure the datapaths based on the application requirements, provides a very attractive alternative to individual hardware implementations for each algorithm or application, that maintains high energy-efficiency while supporting a class of applications. 3. Scalable Architectures: Scalable architectures, with efficient clock and power gating, enable energy vs. performance/quality trade-offs that are extremely desirable for mobile processing. This energy-scalable processing allows the user to determine the energy usage for a task, based on the battery state or intended usage for the output. 4. Data Dependent Processing: Data dependent processing can be a powerful tool in reducing system power consumption. Applications such as multimedia processing have high data dependency, where intensities of pixels in an image, pixel blocks in consecutive frames in a video sequence or utterances in a speech sequence are highly correlated. By exploiting the characteristics of the data being processed, architectures can be designed to minimize switching activity, optimize pipeline bit widths and perform variable number of operations per block [67]. The reduction in Conclusions and Future Directions 170 number of computations and switching activity has a direct impact on the system power consumption. 5. Low-Voltage Circuit Design: Low-voltage circuit operation is important to enable voltage/frequency scaling and attain minimum energy operation for the desired performance. Variations play a key role in determining circuit performance for lowvoltage operation. The non-linear impact of local variations on performance must be taken into account to ensure a robust design at low-voltage. 6. Memory Bandwidth and Power: External memory bandwidth and power consumption is a key bottleneck in achieving maximally efficient systems for data intensive applications. If the power consumption of the external memory and the interface between the memory and the processor is the dominant source of system power consumption, optimizing the processor alone adds very little to the system efficiency. New technology solutions such as embedded DRAM [205,206], that enables DRAM integration onto the processor die, can play a crucial role in maximizing the system energy-efficiency by minimizing the cost of memory accesses while enabling significantly higher bandwidths. 5.3 5.3.1 Future Directions Computational Photography and Computer Vision With recent advances in photography, incorporating computer vision and computational photography techniques, we have just begun to scratch the surface of what cameras of the future could achieve. For example, embedded computer vision aspires to enable an ever expanding range of applications such as image and video search, scene reconstruction, 3D scanning and modeling. Enabling such applications requires a proces- sor capable of sustained computational loads and memory bandwidths, while operating within the tight constraints of low power mobile platforms. Chapter 3 presents the al- 5.3 Future Directions 171 gorithm/architecture/circuit co-design approach, as it relates to a set of computational photography applications. Such a comprehensive system design approach will be essential to enable computational photography for embedded vision and video processing on mobile devices. This opens up a new dimension in video processing with possibilities such as lighfield video, where the video could be manipulated in real time during playback - refocusing frames and changing viewpoints. New research in image sensors [207,208], along with multi-sensor arrays, could be coupled with energy efficient processing to realize exciting new possibilities for future generation cameras and smartphones, in applications such as 3D image and video capture, depth sensing, multi-view video and gesture control. Combining the ability to interpret very complex real 3D environments using computational photography, with object and feature recognition techniques from computer vision, and natural human interfaces such as gesture and speech recognition, are key to making a truly immersive environment, like the Holodeck, a reality [209]. The performance and energy constraints of such a system would necessitate novel architectural and circuit design innovations. Many of the underlying algorithms in computational photography and computer vision are still in a nascent stage, which requires reconfigurability and programability in the hardware implementations. For example, an efficient processor for OpenCV [37], the library of programming functions for computer vision, could dramatically transform the way computer vision applications are implemented. The challenges of such processors would lie in implementing computationally complex and memory intensive hardware primitives while ensuring flexibility for new software innovations to be realized. 5.3.2 Portable Medical Imaging Proliferation of connected portable devices and cloud computing provides us an unique opportunity to revolutionize the delivery of affordable primary health care. A secure and portable medical imaging platform is a key milestone in making this goal a reality. Com- Conclusions and Future Directions 172 putational imaging is becoming an integral part of portable devices such as smartphones. Extending this functionality for medical imaging applications will enable portable noninvasive medical monitoring. A cloud based service can then allow the patient and the doctor to share this medical database and perform image analysis to help with the diagnosis and monitor the progress. Strong security guarantees are essential to ensure that patient-doctor confidentiality is respected by such services. Strong cryptographic primitives like homomorphic encryption [210] provide potential ways to enable secure processing in the encrypted domain, which would ensure user privacy and protect patient data. Figure 5-1 shows the conceptual representation of such a cloud-based processing platform. One of the major challenges in using this approach is the extremely high computational Patient Doctor Captue Encrypt Secure Database Encrypt Display Results Decrypt Processing Decrypt Clinical Image Capture View Results& Lesion Features Diagnosis & Treatment Figure 5-1: Secure cloud-based medical imaging platform. complexity and memory requirement of processing in the encrypted domain. This makes software-based processing extremely inefficient and real-time operation impractical. Optimized encryption algorithm with efficient hardware implementation would be essential to make secure real-time processing a reality. The work presented in Chapter 4 provides a foundation for developing efficient hardware implementations to integrate medical imaging in mobile devices. This would enable real-time processing of hundreds of images, captured over time, to provide doctors and patients immediate feedback that could be used to determine the future course of treatment. The enormous performance and energy advantages that efficient hardware implementations provide could be used to transform medical imaging application, such as Optical Coherence Tomography (OCT), Magnetic Resonance Imaging (MRI) and Computed To- 5.3 Future Directions 173 mography (CT) scan reconstruction, and shift the analysis from bulky GPU clusters to portable devices. Such systems could significantly enhance medical imaging and finally bring the Tricorder from the realms of science fiction to reality! The intersection of cutting-edge algorithms, massively-parallel architectures with specialized reconfigurable accelerators and ultra-low power circuits is ripe for exploration. The future of technology innovation will be defined by societal imperatives such as affordable healthcare, energy-efficiency and security, and the biggest challenge of this era will be to revolutionize these fields just as the era of CMOS scaling revolutionized computing, communication and consumer entertainment. In just a decade, the relationship among our daily activities, our data, and the mediums of content creation and consumption will be radically different. This thesis attempts to define the challenges and propose system design solutions to help build the technologies that will define this relationship. 174 Conclusions and Future Directions Appendix A Integer Transform The most commonly use transform in video and image coding applications is the Discrete Cosine Transform (DCT). DCT has excellent energy compaction property, which leads to good compression efficiency of the transform. However, the irrational numbers in the transform matrix make its exact implementation impossible, leading to a drift between forward and inverse transform coefficients. H.264/AVC as well as VC-1 video coding standards use a variation of the DCT, known as Integer transform. In these transforms, the transform matrices are defined to have only integers. This makes exact inverse possible using integer arithmetic. The following sections describe the definitions of integer transforms for H.264/AVC and VC-1 video coding standards. A.1 H.264/AVC Integer Transform The separable 2-D 8x8 forward transform for H.264/AVC can be written as: F8 =H8 Xs-Hs ((A.1H A.1) Tr-ansformn Integer Transform 176 Integer 176 and the separable 2-D 8x8 inverse transform can be written as: (A.2) 18= H8 -Ys8 - H Where, the 1-D 8x8 integer transform for H.264/AVC is defined as: 8 12 8 10 8 6 4 3 8 10 4 -3 -8 -12 -8 -6 8 6 -4 -12 -8 3 8 10 8 3 -8 -6 8 10 -4 -12 8 -3 -8 6 8 -10 -4 12 8 -6 -4 12 -8 -3 8 -10 8 -10 4 3 -8 12 -8 6 8 -12 8 -10 8 -6 4 -3 (A.3) H8 = Similarly, the separable 2-D 4x 4 forward transform for H.264/AVC can be written as: F4 =HT - X4x4 . H 4 (A.4) and the separable 2-D 4x4 inverse transform can be written as: 14 =H 4 - Y 4 x4 -H4T (A.5) Where, the 1-D 4x4 transform for H.264/AVC is defined as: 1 2 1 1 1 1 -1 -2 1 -1 -1 2 1 -2 1 -1 (A.6) 177 A.2 VC-1 Integer Transform A.2 VC-1 Integer Transform VC-1 uses 8 x 8, 8 x 4, 4 x 8 and 4 x 4 transforms. The 2-D separable m x n forward integer transform for VC-1, where m = 8, 4 and n = 8, 4, is given as: (A.7) Fmxn= (Vm - Xmxn - Vn) -Nmxn And the m x n inverse integer transform for VC-1 is given as: - Vm -YmXn I~ VT (A.8) 1024 The denominator is chosen to be the power of 2 closest to the squared norm of the basis functions (288, 289 and 292) of the ID transformation. In order to preserve one extra bit of precision, the 1-D transform operation is performed as: Ymxn - VT 16 ImXn and = Vm.D "6 (A.9) 64 The 1-D transform matrix is defined as: V8 12 16 16 15 12 9 6 4 12 15 6 -4 -12 -16 -16 -9 12 9 -6 -16 -12 4 16 15 12 4 -16 -9 12 15 -6 -16 12 -4 -16 9 12 -15 -6 16 12 -9 -6 16 -12 -4 16 -15 12 -15 6 4 -12 16 -16 9 12 -16 16 -15 12 -9 6 -4 (A.10) = 178 Integer Transform and the 1-D 4x4 inverse transform matrix is defined as: 17 22 17 10 17 10 -17 -22 17 -10 -17 22 17 -22 17 -10 (A.11) V4 = Appendix B Clinical Pilot Study for Vitiligo Progression Analysis B.1 Subjects for Pilot Study Institutional Review Board approval was obtained for data analysis (MIT Protocol Number: 1301005500) as well as the clinical pilot study in collaboration with the Brigham and Women's Hospital (BWH Protocol Number: 2012-P-002185/1). Ten subjects ages 18 years and older with a dermatologist diagnosis of vitiligo were recruited by Dr. Vaneeta Sheth. Subjects had a variety of skin phototypes and disease characteristics, as outlined in Table B.1. As this was a pilot study, no standardized intervention was performed. Rather, subjects were treated with standard therapies used for vitiligo based on clinical characteristics and patient preference. Clinical Pilot Study for Vitiligo Progression Analysis 180 Table B.1: Demographics of the subjects for clinical study. Subject Age (Years) Gender Ethnicity Vitiligo Phenotype Treatment Modalities 21 F Hispanic Acrofacial None 59 M AfricanAmerican Non-segmental vitiligo NBUVB*, oral corticosteroids 57 M Caucasian, Native American Non-segmental NBUVB vitiligo 29 M Caucasian Mucosal/genital Topical calcineurin inhibitor, NBUVB 43 M Caucasian Nonsegmental common vitiligo NBUVB 27 F South Asian Segmental NBUVB 46 M Greek Acrofacial Topical corticosteroids 43 M Caucasian Nonsegmental/common vitiligo NBUVB, topical corticosteroids 35 F South Asian Acrofacial NBUVB 43 F AfricanAmerican Segmental NBUVB, topical immunomodulators, topical bimatoprost *NBUVB: Narrow-band Ultraviolet B B.2 Progression Analysis The proposed approach is used to analyze 174 images corresponding to 50 skin lesions from ten subjects to determine the progression over time. Figure B-i shows the progression of five lesions through 20 images captured during treatment. A detailed analysis of progression for all ten patients in the clinical study is presented in Table B.2. 181 B.2 Progyression Analvsis18 Fill Factor = 0 Fill Factor = 11% Fill Factor = 25% Fill Factor = 36% Fill Factor = 0 Fill Factor = 3% Fill Factor = 19% Fill Factor = 28% Fill Factor =0 Fill Factor = -9% Fill Factor = -2% Fill Factor = 17% Fill Factor =0 Fill Factor =7% Fill Factor =4% Fill Factor =13% Fill Factor = 0 Fill Factor = 17% Fill Factor = 26% Fill Factor = 59% Figure B-1: Progression of skin lesions over time. Lesion contours are identified from the color corrected images and the lesions are aligned using SIFT feature matching to determine the fill factor. Clinical Pilot Study for Vitiligo Progression Analysis Analysis 182 Clinical Pilot Study for Vitiligo Progression 182 Table B.2: Progression of Skin Lesions During Treatment Subject 1 2 3 4 Site Fill Factor (%) Dec'12 Jan'13 Jun'13 Left Hand 0 -2 -19 Right Hand 0 1 -9 Nov'12 Dec'12 Jan'13 Feb'13 Mar'13 - Chest 0 8 24 63 78 - Left Elbow 0 2 9 17 31 - Right Elbow 0 4 17 26 59 - Nov'12 Dec'12 Jan'13 Feb'13 - Left Popliteal Fossa 0 5 9 10 - - Left Wrist 0 4 10 16 - - Right Popliteal Fossa 0 3 5 11 - Right Antecubital Fossa 0 6 16 22 - Right Forearm 0 11 25 36 - - Right Wrist 0 9 25 28 - - Dec'12 Mar'13 May'13 Jun'13 Jul'13 Oct'13 Left Foot 0 -3 2 5 13 17 Left Hand 0 1 5 14 21 22 Left Knee 0 7 1 14 17 26 Right Hand 0 2 3 24 33 39 Right Foot 0 2 4 -5 6 18 Right Knee 0 0 3 7 19 52 - - - - - 183 B.2 Progression Analysis Subject Site Feb'13 Mar'13 0 2 5 Apr'13 May'13 Jun'13 Jul'13 - Left Eye 0 3 19 28 - Left Neck 0 3 4 6 - Left Preauricular 0 6 53 - - May'13 Jul'13 Sep'13 - - Left Forehead 0 2 -2 - - Left Hand 0 1 3 - - Right Forehead 0 3 7 - Right Hand 0 -2 1 - Jun'13 Jul'13 Aug'13 Forehead 0 2 Left Temple 0 Right Temple Genital 7 8 (%) Jan'13 5 6 Fill Factor - -- - - - - - Oct'13 Nov'13 - 3 16 - - -83 -91 - 46 - 0 -16 -11 84 95 - Jun'13 Jul'13 Sep'13 - - - Left Cutaneous Lower Lip 0 4 7 - - Right Oral Commissure 0 -4 -8 - - - R. Cutaneous Upper Lip 0 2 21 - - - Right Preauricular 0 8 86 - Nov'12 Mar'13 Jul'13 Sep'13 - - 0 27 52 57 - - - 10 Right Cheek 184 Clinical Pilot Study for Vitiligo Progression Analysis Acronyms ASIC Application Specific Integrated Circuit BW Bandwidth CC Camera Curves CMOS Complementary Metal Oxide Semiconductor Cony Convolution CPU Central Processing Unit CT Computed Tomography DCT Discrete Cosine Transform DRAM Dynamic Random Access Memory DSP Digital Signal Processor DVFS Dynamic Voltage-Frequency Scaling FIFO First In First Out 186 Acronyms Acronyms 186 FPGA Field Programmable Gate Array fps frames per second GA Grid Assignment GPGPU General Purpose Graphics Processing Unit GPU Graphics Processing Unit HD High Definition HDR High Dynamic Range HEVC High-Efficiency Video Coding HoG Histogram of Gaussians IC Integrated Circuit LDR Low Dynamic Range LED Light Emitting Diode LSB Least Significant Bit LSF Level Set Function LSM Level Set Method LUT Look-Up Table MBPS Megabytes per second 187 Acronyms 187 Acronyms MRI Magnetic Resonance Imaging MSB Most Significant Bit NBUVB Narrow-band Ultraviolet B OCT Optical Coherence Tomography OPA Operating Point Analysis OPS Operations Per Second PC Personal Computer PCB Printed Circuit Board PDF Probability Density Function PGA Physician's Global Assessment QFHD Quad Full-HD RANSAC Random Sample Consensus RDF Random Dopant Fluctuations SIFT Scale Invariant Feature Transform SRAM Static Random Access Memory SSTA Statistical Static Timing Analysis STA Static Timing Analysis 188 Acronyms SVD Singular Value Decomposition SVM Support Vector Machine VASI Vitiligo Area and Severity Index Bibliography [1] C. Babbage, "On the mathematical powers of the calculating engine," Manuscript: Museum of History of Science, Oxford, December 1837. Original [2] B. Collier, "The little engines that could've: The calculating machines of Charles Babbage," Doctoral dissertation, Harvard University, August 1970. [3] G. E. Moore, "Cramming more components onto integrated circuits," Electronics, pp. 114- 117, April 1965. [4] R. H. Dennard, F. Gaensslen, H. Yu, L. Rideout, E. Bassous, and A. LeBlanc, "Design of ion-implanted MOSFET's with very small physical dimensions," IEEE Journal of SolidState Circuits, vol. SC-9, pp. 256-268, October 1974. [5] M. Weiser, "The computer for the 21st century," Scientific American, vol. 265, pp. 94-104, September 1991. [6] R. Want, W. Schilit, N. Adams, R. Gold, K. Petersen, D. Goldberg, J. Ellis, and M. Weiser, "An overview of the ParcTab ubiquitous computing experiment," IEEE Personal Communications, vol. 2, pp. 28-43, December 1995. [7] R. Broderson, "Infopad - an experiment in system level design and integration," Design Automation Conference, pp. 313-314, 1997. [8] A. Chandrakasan, A. Burstein, and R. W. Brodersen, "A low power chipset for portable multimedia applications," InternationalSolid-State Circuits Conference, pp. 82-83, 1994. [9] J. C. Maxwell, "Experiments on color, as perceived by the eye, with remarks on colorblindness," Transactions of the Royal Society of Edinburgh, vol. 21, no. 2, pp. 275-298, 1855. [10] J. A. Paradiso and T. Starner, "Energy scavenging for mobile and wireless electronics," IEEE Pervasive Computing, vol. 4, pp. 18-27, January 2005. [11] Y. Miyabe, "Smart life solutions: from home to city," InternationalSolid-State Circuits Conference, pp. 12-17, 2013. [12] R. H. Dennard, J. Cai, and A. Kumar, "A perspective on today's scaling challenges and possible future directions," Solid-State Electronics, vol. 51, pp. 518-525, April 2007. [13] M. Horowitz, "Computing's energy problem (and what we can do about it)," International Solid-State Conference, pp. 10-14, 2014. BIBLIOGRAPHY 190 [14] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, "Low-power CMOS digital design," IEEE Journal of Solid-State Circuits,vol. 27, pp. 473-484, April 1992. [15] B. Davari, R. H. Dennard, and G. G. Shahidi, "CMOS scaling for high performance and low-power-the next ten years," Proceedings of the IEEE,vol. 83, pp. 595-606, April 1995. [16] M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, and K. Bernstein, "Scaling, power, and the future of CMOS," IEEE InternationalElectron Devices Meeting, pp. 7-15, 2005. [17] K. Itoh, "Adaptive circuits for the 0.5-V nanoscale CMOS era," IEEE InternationalSolid State Circuits Conference, pp. 14-20, 2009. [18] G. M. Amdahl, "Validity of the single processor approach to achieving large-scale computing capabilities," AFIPS Spring Joint Computer Conference, pp. 483-485, 1967. Computational Sciences [19] W. Dally, "The path to high-efficiency computing," ornl.gov/workshops/SMC13/ //computing. http: and Engineering Conference [online] presentations/3-SMC_0913_Dally.pdf,2013. [20] M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts, "A fully integrated multi-CPU, GPU and memory controller 32nm processor," InternationalSolid-State Circuits Confer- ence, pp. 264-265, 2011. [21] S. Damaraju, V. George, S. Jahagirdar, T. Khondker, R. Milstrey, S. Sarkar, S. Siers, I. Stolero, and A. Subbiah, "A 22nm IA multi-CPU and GPU system-on-chip," International Solid-State Circuits Conference, pp. 56-57, 2012. [22] P. Ou, J. Zhang, H. Quan, Y. Li, M. He, Z. Yu, X. Yu, S. Cui, J. Feng, S. Zhu, J. Lin, M. Jing, X. Zeng, and Z. Yu, "A 65nm 39GOPS/W 24-core processor with 11Tb/s/W packet-controlled circuit-switched double-layer network-on-chip and heterogeneous execution array," InternationalSolid-State Circuits Conference, pp. 56-57, 2013. [23] G. Gammie, N. Ickes, M. E. Sinangil, R. Rithe, J. Gu, A. Wang, H. Mair, S. Datla, B. Rong, S. Honnavara-Prasad, L. Ho, G. Baldwin, D. Buss, A. P. Chandrakasan, and U. Ko, "A 28nm 0.6V low-power DSP for mobile applications," InternationalSolid-State Circuits Conference, pp. 132-133, 2011. [24] "Dragonboard Snapdragon S4 plus APQ8060A mobile development board," [online] https: //developer .qualcomm.com/mobile-development/development-devices/ dragonboard. [25] "Samsung Exynos 5 dual Arndale board," [online] wiki/index .php/Main.Page. http: //www. arndaleboard. org/ [26] Y. Park, C. Yu, K. Lee, H. Kim, Y. Park, C. Kim, Y. Choi, J. Oh, C. Oh, G. Moon, S. Kim, H. Jang, J. A. Lee, C. Kim, and S. Park, "72.5GFlops 240Mpixel/s 1080p 60fps multi-format video codec application processor enabled with GPGPU for fused multimedia application," InternationalSolid-State Circuits Conference, pp. 160-161, 2013. [27] J. Park, I. Hong, G. Kim, Y. Kim, K. Lee, S. Park, K. Bong, and H. J. Yoo, "A 646GOPS/W multi-classifier many-core processor with cortex-like architecture for superresolution recognition," InternationalSolid-State Circuits Conference, pp. 168-169, 2013. BIBLIOGRAPHY 191 [28] D. Markovic, R. W. Brodersen, and B. Nikolic, "A 70GOPS, 34mW multi-carrier MIMO chip in 3.5mm 2 ," IEEE Symposium on VLSI Circuits, pp. 158-159, 2006. [29] C. T. Huang, M. Tikekar, C. Juvekar, V. Sze, and A. Chandrakasan, "A 249Mpixel/s HEVC video-decoder chip for quad full HD applications," International Solid-State Circuits Conference, pp. 162-163, 2013. [30] M. Mehendale, S. Das, M. Sharma, M. Mody, R. Reddy, J. Meehan, H. Tamama, B. Carlson, and M. Polley, "A true multistandard, programmable, low-power, full HD video-codec engine for smartphone SoC," International Solid-State Circuits Conference, pp. 226-227, 2012. [31] V. Aurich and J. Weule, "Non-linear gaussian filters performing edge preserving diffusion," Springer Berlin Heidelberg, pp. 538-545, 1995. [32] P. J. Burt, "Fast algorithms for estimating local image properties," Computer Vision, Graphics, and Image Processing,vol. 21, pp. 368-382, March 1983. [33] P. J. Burt and E. H. Adelson, "The laplacian pyramid as a compact image code," IEEE Transactionson Communication, vol. 31, pp. 532-540, April 1983. [34] D. Lowe, "Distinctive image features from scale-invariant keypoints," InternationalJournal of Computer Vision, vol. 60, pp. 91-110, February 2004. [35] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," Computer Vision and Pattern Recognition Conference, pp. 886-893, 2005. [36] P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features," Computer Vision and Pattern Recognition Conference, pp. 511-518, 2001. [37] "OpenCV: Open source computer vision," [online] http: //opencv. org/. [38] A. P. Chandrakasan, D. C. Daly, D. F. Finchelstein, J. Kwong, Y. K. Ramadass, M. E. Sinangil, V. Sze, and N. Verma, "Technologies for ultradynamic voltage scaling," Proceedings of the IEEE, vol. 98, pp. 191-214, February 2010. [39] B. Calhoun, A. Wang, and A. Chandrakasan, "Modeling and sizing for minimum energy operation in subthreshold circuits," IEEE Journal of Solid-State Circuits, vol. 40, pp. 1778-1786, September 2005. [40] A. Asenov, "Random dopant induced threshold voltage lowering and fluctuations in sub0.1 pm MOSFET's: A 3-D "atomistic" simulation study," IEEE Transactions on Electron Devices, vol. 45, pp. 2505-2513, December 1998. [41] P. Andrei and I. Mayergoyz, "Random doping-induced fluctuations of subthreshold characteristics in MOSFET devices," Solid-State Electronics, vol. 47, pp. 2055-2061, November 2003. [42] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, "Analysis and mitigation of variability in subthreshold design," InternationalSymposium on Low Power Electronics and Design, pp. 20-25, 2005. BIBLIOGRAPHY 192 properties of [43] M. J. M. Pelgrom, A. C. J. Duinmaijer, and A. P. G. Welbers, "Matching October MOS transistors," IEEE Journal of Solid-State Circuits,vol. 24, pp. 1433-1440, 1989. for sub-threshold [44] B. H. Calhoun and A. P. Chandrakasan, "Static noise margin variation SRAM in 65-nm CMOS," IEEE Journal of Solid-State Circuits, vol. 41, pp. 1673-1679, July 2006. [45] "Cisco visual networking index: Global mobile data traffic forecast update, 2013-2018," [online] http: //www. cisco. com/c/en/us/solut ions/collateral/service-provider/ 5 2 8 6 2 .html. visual-networking-index-vni/white _papercll- 0 T. S. [46] Y. K. Lin, D. W. Li, C. C. Lin, T. Y. Kuo, S. J. Wu, W. C. Tai, W. C. Chang, and 2 Chang, "A 242mW 10mm 1080p H.264/AVC high-profile encoder chip," International Solid-State Circuits Conference, pp. 314-315, 2008. "A low[47] D. F. Finchelstein, V. Sze, M. E. Sinangil, Y. Koken, and A. P. Chandrakasan, power 0.7-V H.264 720p video decoder," IEEE Asian Solid-State Circuits Conference, pp. 173-176, 2008. T. Fu[48] K. Yu, M. Takahashi, T. Maeda, H. Hara, H. Arakida, H. Yamamoto, Y. Hagiwara, jita, M. Watanabe, T. Shimazawa, Y. Ohara, T. Miyamori, M. Hamada, and Y. Oowaki, in "A 222mW H.264 full-HD decoding application processor with x512b stacked dram 40nm," InternationalSolid-State Circuits Conference, pp. 326-327, 2010. [49] Y. Park, C. Yu, K. Lee, H. Kim, Y. Park, C. Kim, Y. Choi, J. Oh, C. Oh, G. Moon, S. Kim, H. Jang, J. A. Lee, C. Kim, and S. Park, "72.5GFlops 240Mpixel/s 1080p 60fps multi-format video codec application processor enabled with GPGPU for fused multimedia application," InternationalSolid-State Circuits Conference, pp. 160-161, 2013. IEEE Interna[50] T. Burd and R. Broderson, "Design issues for dynamic voltage scaling," tional Symposium on Low Power Electronics and Design, pp. 9-14, 2000. energy [51] B. H. Calhoun and A. P. Chandrakasan, "Characterizing and modeling minimum Elecoperation for subthreshold circuits," IEEE InternationalSymposium on Low Power tronics and Design, pp. 90-95, 2004. [52] I.-T. S. H, "H.264: Advanced video coding for generic audiovisual services," [53] T. Wiegand and G. J. Sullivan, "Overview of the H.264/AVC video coding standard," IEEE Transactions on Circuits and Systems for Video Processing, vol. 13, pp. 560-576, July 2003. [54] S. 421M, "VC-1 compressed video bitstream format and decoding process," [55] H. Kalva and J. Lee, "The VC-1 video coding standard," IEEE Multimedia, vol. 14, pp. 88-91, October 2007. [56] S. Srinivasan, P. Hsu, T. Holcomb, K. Mukerjee, S. L. Regunathan, B. Lin, J. Liang, M.-C. Lee, and J. Ribas-Corbera, "Windows Media Video 9: Overview and applications," Signal Processing: Image Communication,vol. 19, pp. 851-875, October 2004. BIBLIOGRAPHY 193 [57] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, "Low-complexity transform and quantization in H.264/AVC," IEEE Transactions on Circuits and Systems for Video Processing, vol. 13, pp. 598-603, July 2003. [58] S. Srinivasan, S. Regunathan, and B. Lin, "Computationally efficient transforms for video coding," IEEE InternationalConference on Image Processing, pp. 11-14, 2005. [59] S. Srinivasan and J. Liang, "Fast video codec transform implementations," U.S. Patent 20050256916, November 2005. [60] S. Lee and K. Cho, "Design of transform and quantization circuit for multi-standard integrated video decoder," IEEE Workshop on Signal Processing Systems, pp. 181-186, 2007. [61] C.-P. Fan and G.-A. Su, "Efficient low-cost sharing design of fast 1-D inverse integer transform algorithms for H.264/AVC and VC-1," IEEE Signal ProcessingLetters, vol. 15, pp. 926-929, 2008. [62] C.-P. Fan and G.-A. Su, "Efficient fast 1-D 8x8 inverse integer transform for VC-1 application," IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, pp. 584-590, April 2009. [63] G.-A. Su and C.-P. Fan, "Cost effective hardware sharing architecture for fast 1D 8x8 forward and inverse integer transforms of H.264/AVC high profile," IEEE Asia Pacific Conference on Circuits and Systems, pp. 1332-1335, 2008. [64] S. Lee and K. Cho, "Design of high-performance transform and quantization circuit for unified video codec," IEEE Asia Pacific Conference on Circuits and Systems, pp. 1450- 1453, 2008. [65] R. Rithe, C. C. Cheng, and A. Chandrakasan, "Quad full-HD transform engine for dualstandard low-power video coding," IEEE Asian Solid-State Circuits Conference, pp. 401- 404, 2011. [66] W.-H. Chen, C. Smith, and S. Fralick, "A fast computational algorithm for the discrete cosine transform," IEEE Transactions on Communications,vol. 25, pp. 1004-1009, Septem- ber 1977. [67] T. Xanthopoulos and A. P. Chandrakasan, "A low-power IDCT macrocell for MPEG-2 MP©ML exploiting data distribution properties for minimal activity," IEEE Journal of Solid-State Circuits, vol. 34, pp. 693-703, May 1999. [68] H. Fujiwara, K. Nii, H. Noguchi, J. Miyakoshi, Y. Murachi, Y. Morita, H. Kawaguchi, and M. Yoshimoto, "Novel video memory reduces 45% of bitline power using majority logic and data-bit reordering," IEEE Transactions on Very Large Scale Integration Systems, vol. 16, pp. 620-627, June 2008. [69] M. E. Sinangil and A. P. Chandrakasan, "Application-specific SRAM design using output prediction to reduce bit-line switching activity and statistically gated sense amplifiers for up to 1.9x lower energy/access," IEEE Journal of Solid-State Circuits, vol. 49, pp. 107- 117, January 2014. BIBLIOGRAPHY 194 [70] "ITU-T recommendation H.265 and ISO/IEC 23008-2: High Efficiency Video Coding," aspx?rec=11885, 2013. [online] http: //www. itu. int/ITU-T/recommendations/rec . [71] M. Budagavi, A. Fuldseth, G. Bjontegaard, V. Sze, and M. Sadafale, "Core transform design in the high efficiency video coding (HEVC) standard," IEEE Journal of Selected Topics in Signal Processing,vol. 7, pp. 1029-1041, December 2013. [72] M. Tikekar, C.-T. Huang, C. Juvekar, V. Sze, and A. P. Chandrakasan, "A 249-Mpixel/s HEVC video-decoder chip for 4k ultra-HD applications," IEEE Journal of Solid-State Circuits, vol. 49, pp. 61-72, January 2014. [73] K. J. Kuhn, "Reducing variation in advanced logic technologies: Approaches to process and design for manufacturability of nanoscale CMOS," IEEE InternationalElectron Devices Meeting, pp. 471-474, 2007. [74] L. Cheng, P. Gupta, C. Spanos, K. Qian, and L. He, "Physically justifiable die-level modeling of spatial variation in view of systematic across wafer variability," Design Automation Conference, pp. 104-109, 2009. [75] D. Blaauw, K. Chopra, A. Srivastava, and L. Scheffer, "Statistical timing analysis: From basic principles to state of the art," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 27, pp. 589-607, April 2008. [76] A. Wang and A. Chandrakasan, "A 180-mV subthreshold FFT processor using a minimum energy design methodology," IEEE Journal of Solid-State Circuits, vol. 40, pp. 310-319, January 2005. [77] Y. Cao and L. T. Clark, "Mapping statistical process variations toward circuit performance variability: An analytical modeling approach," ACM IEEE Design Automation Conference, pp. 658-663, 2005. [78] J. Kwong, Y. K. Ramadass, N. Verma, and A. P. Chandrakasan, "A 65 nm sub-V microcontroller with integrated sram and switched capacitor DC-DC converter"," IEEE Journal of Solid-State Circuits, vol. 44, pp. 115-126, January 2009. [79] H. Mahmoodi, S. Mukhapadhyay, and K. Roy, "Estimation of delay variations due to random-dopant fluctuations in nanoscale CMOS circuits," IEEE Journal of Solid-State Circuits,vol. 40, pp. 1787-1796, September 2005. stan[80] S. Sundareswaran, J. A. Abraham, A. Ardelea, and R. Panda, "Characterization of ElecQuality on Symposium International dard cells for intra-cell mismatch variations," tronic Design, pp. 213-219, 2008. [81] R. Rithe, S. Chao, J. Gu, A. Wang, S. Datla, G. Gammie, D. Buss, and A. Chandrakasan, "The effect of random dopant fluctuations on logic timing at low voltage," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, pp. 911-924, May 2012. [82] R. Rithe, "SSTA design methodology for low voltage operation," Master's thesis, Massachusetts Institute of Technology, 2010. [83] C. Y. Huang, L. F. Chen, and Y. K. Lai, "A high-speed 2D transform architecture with unique kernel for multi-standard video applications," IEEE InternationalSymposium on Circuits and Systems, pp. 21-24, 2008. BIBLIOGRAPHY 195 [84] C. P. Fan, C. H. Fang, C. W. Chang, and S. J. Hsu, "Fast multiple inverse transforms with low-cost hardware sharing design for multistandard video decoding," IEEE Transactions on Circuits and Systems-IL: Express Briefs, vol. 58, pp. 517-521, August 2011. [85] K. Wang, J. Chen, W. Cao, Y. Wang, L. Wang, and J. Tong, "A reconfigurable multitransform VLSI architecture supporting video codec design," IEEE Transactions on Circuits and Systems-II: Express Briefs, vol. 58, pp. 432-436, July 2011. [86] Y.-H. Chen, T.-Y. Chang, and C.-W. Lu, "A low-cost and high-throughput architecture for H.264/AVC integer transform by using four computation streams," IEEE International Symposium on Integrated Circuits, pp. 380-383, 2011. [87] F. Durand and J. Dorsey, "Fast bilateral filtering for the display of high-dynamic-range images," ACM Transactions on Graphics, vol. 21, pp. 257-266, July 2002. [88] M. Brown and D. G. Lowe, "Automatic panoramic image stitching using invariant features," International Journal of Computer Vision, vol. 74, pp. 59-73, August 2007. [89] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, "Efficient marginal likelihood optimization in blind deconvolution," IEEE Conference on Computer Vision and Pattern Recognition, pp. 2657-2664, 2011. [90] R. Ng, M. Levoy, M. Bredif, G. Duval, M. Horowitz, and P. Hanrahan, "Light-field photography with a handheld plenoptic camera," Stanford University Computer Science Tech Report, April 2005. [91] C. Tomasi and R. Manduchi, "Bilateral filtering for gray and color images," IEEE International Conference on Computer Vision, pp. 839-846, 1998. [92] S. M. Smith and J. M. Brady, "SUSAN - a new approach to low level image processing," International Journal of Computer Vision, vol. 23, pp. 45-78, May 1997. [93] P. Perona and J. Malik, "Scale-space and edge detection using anisotropic diffusion," IEEE TransactionsPattern Analysis Machine Intelligence, vol. 12, pp. 629-639, July 1990. [94] J. Tumblin and G. Turk, "LCIS: A boundary hierarchy for detail-preserving contrast reduction," A CM SIGGRAPH Conference, pp. 83-90, 1999. [95] A. Levin, A. Rav-Acha, and D. Lischinski, "Spectral matting," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, pp. 1699-1712, October 2008. [96] D. Lischinski, Z. Farbman, M. Uyttendaele, and R. Szeliski, "Interactive local adjustment of tonal values," ACM Transactions on Graphics, vol. 25, pp. 646-653, March 2006. [97] N. Sochen, R. Kimmel, and A. M. Bruckstein, "Diffusions and confusions in signal and image processing," Journal of Mathematical Imaging and Vision, vol. 14, pp. 237-244, May 2001. [98] M. Elad, "On the bilateral filter and ways to improve it," IEEE Transactions on Image Processing,vol. 11, pp. 1141-1151, October 2002. [99] J. van de Weijer and R. van den Boomgaard, "On the equivalence of local-mode finding, robust estimation and mean-shift analysis as used in early vision tasks," International Conference on Pattern Recognition, pp. 927-930, 2002. BIBLIOGRAPHY 196 [100] D. Barash and D. Comaniciu, "A common framework for nonlinear diffusion, adaptive smoothing, bilateral filtering and mean shift," Image and Video Computing,vol. 22, pp. 73- 81, January 2004. [101] A. Buades, B. Coll, and J.-M. Morel, "Neighborhood filters and PDE's," Numerische Mathematik, vol. 105, pp. 1-34, November 2006. [102] P. Mrazek, J. Weickert, and A. Bruhn, "On robust estimation and smoothing with spatial and tonal kernels," Springer Geometric Propertiesfor Incomplete data, vol. 31, pp. 335- 352, 2006. [103] M. Aleksic, M. Smirnov, and S. Goma, "Novel bilateral filter approach: Image noise reduction with sharpening," Proceedings of the SPIE, vol. 6069, pp. 141-147, May 2006. [104] C. Liu, W. T. Freeman, R. Szeliski, and S. Kang, "Noise estimation from a single image," IEEE Computer Vision and Pattern Recognition Conference, pp. 901-908, 2006. [105] S. Bae, S. Paris, and F. Durand, "Two-scale tone management for photographic look," ACM Transactions on Graphics, vol. 25, pp. 637-645, July 2006. [106] M. Elad, "Retinex by two bilateral filters," Scale-Space Conference, pp. 217-229, July. [107] E. Bennet and L. McMillan, "Video enhancement using per-pixel virtual exposures," A CM Transactions on Graphics, vol. 24, pp. 845-852, July 2005. [108] H. Winnemoller, S. C. Olsen, and B. Gooch, "Real-time video abstraction," ACM Transactions on Graphics, vol. 25, pp. 1221-1226, August 2006. [109] J. Xiao, H. Cheng, H. Awhney, C. Rao, and M. Isnardi, "Bilateral filtering based optical flow estimation with occlusion detection," European Conference on Computer Vision, pp. 211-224, 2006. [110] P. Sand and S. Teller, "Particle video: Long-range motion estimation using point trajectories," InternationalJournal of Computer Vision, vol. 80, pp. 72-91, January 2008. [111] E.-H. Woo, J.-H. Sohn, H. Kim, and H.-J. Yoo, "A 195 mW, 9.1 Mvertices/s fully programmable 3-D graphics processor for low-power mobile devices," IEEE Journal of SolidState Circuits, vol. 43, pp. 2370-2380, July 2008. [112] F. Sheikh, S. K. Mathew, M. A. Anders, H. Kaul, S. K. Hsu, A. Agarwal, R. K. Krishnamurthy, and S. Borkar, "A 2.05 Gvertices/s 151 mW lighting accelerator for 3D graphics vertex and pixel shading in 32 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 48, pp. 128-139, January 2013. [113] G. Wan, X. Li, G. Agranov, M. Levoy, and M. Horowitz, "CMOS image sensors with multi-bucket pixels for computational photography," IEEE Journalof Solid-State Circuits, vol. 47, pp. 1031-1042, April 2012. [114] S. Sukegawa, T. Umebayashi, T. Nakajima, H. Kawanobe, K. Koseki, I. Hirota, T. Haruta, M. Kasai, K. Fukumoto, T. Wakano, K. Inoue, H. Takahashi, T. Nagano, Y. Nitta, T. Hirayama, and N. Fukushima, "A 1/4-inch 8Mpixel back-illuminated stacked CMOS image sensor," IEEE InternationalSolid-State Circuits Conference, pp. 484-485, 2013. BIBLIOGRAPHY 197 [115] Y. Chen, Y. Xu, Y. Chae, A. Mierop, X. Wang, and A. Theuwissen, "A 0.7e-rms-temporalreadout-noise CMOS image sensor for low-light-level imaging," IEEE InternationalSolidState Circuits Conference, pp. 384-385, 2012. [116] J. Chen, S. Paris, and F. Durand, "Real time edge-aware image processing with the bilateral grid," ACM Transactions on Graphics, vol. 26, July 2007. [117] T. Q. Pham and L. J. V. Vliet, "Separable bilateral filtering for fast video preprocessing," IEEE InternationalConference on Multimedia and Expo, pp. 4-8, 2005. [118] S. Paris and F. Durand, "A fast approximation of the bilateral filter using a signal processing approach," InternationalJournal of Computer Vision, vol. 81, pp. 24-52, January 2009. [119] B. Weiss, "Fast median and bilateral filtering," ACM Transactions on Graphics, vol. 25, pp. 519-526, July 2006. [120] A. Sinha, A. Wang, and A. Chandrakasan, "Energy scalable system design," IEEE Transactions on Very Large Scale Integration Systems, vol. 10, pp. 135-145, April 2002. [121] P. E. Debevec and J. Malik, "Recovering high dynamic range radiance maps from photographs," ACM Conference on Computer Graphics and Interactive Techniques, pp. 369- 378, 1997. [122] G. W. Larson, H. Rushmeier, and C. Piatko, "A visibility matching tone reproduction operator for high dynamic range scenes," IEEE Transactions on Visualization and Computer Graphics, vol. 3, pp. 291-306, October 1997. [123] J. DiCarlo and B. Wandell, "Rendering high dynamic range images," Proceedings of the SPIE: Image Sensors, pp. 392-401, 2000. [124] J. Cohen, C. Tchou, T. Hawkins, and P. Debevec, "Real-time high-dynamic range texture mapping," Eurographics Workshop on Rendering, pp. 313-320, October 2001. [125] D. J. Jobson, Z. U. Rahman, and G. A. Woodell, "A multi-scale retinex for bridging the gap between color images and the human observation of scenes," IEEE Transactions on Image Processing,vol. 6, pp. 965-976, July 1997. [126] S. N. Pattanaik, J. A. Ferwerda, M. D. Fairchild, and D. P. Greenberg, "A multiscale model of adaptation and spatial vision for realistic image display," ACM SIGGRAPH Conference, pp. 287-298, 1997. [127] J. J. McCann and A. Rizzi, "Veiling glare: The dynamic range limit of HDR images," Human Vision and Electronic Imaging XII, SPIE, vol. 6492, 2007. [128] E. V. Talvala, A. Adams, M. Horowitz, and M. Levoy, "Veiling glare in high dynamic range imaging," ACM Transactions on Graphics, vol. 26, July 2007. [129] E. Reinhard, G. Ward, S. Pattanaik, and P. Debevec, "High dynamic range imaging acquisition, display and image-based lighting," Morgan Kaufman Publishers, 2006. [130] R. Raskar, A. Agrawal, C. A. Wilson, and A. Veeraraghavan, "Glare aware photography: 4D ray sampling for reducing glare effects of camera lenses," ACM Transactions on Graphics, vol. 27, pp. 56:1-56:10, August 2008. BIBLIOGRAPHY 198 [131] J. M. DiCarlo, F. Xiao, and B. A. Wandell, "Illuminating illumination," Color Imaging Conference, pp. 27-34, 2001. [132] M. F. Cohen, A. Colburn, and S. Drucker, "Image stacks," MSR Technical Report, vol. 40, July 2003. [133] H. Hoppe and K. Toyama, "Continuous flash," MSR Technical Report, vol. 63, October 2003. [134] K. Toyama and B. Schoelkopf, "Interactive images," MSR Technical Report, vol. 64, December 2003. [135] P. Debevec, T. Hawkins, C. Tchou, H. Duiker, W. Sarokin, and M. Sagar, "Acquiring the reflectance field of the human face," ACM SIGGRAPH Conference, pp. 145-156, 2000. [136] V. Masselus, P. Dutre, and F. Anrys, "The free-form light stage," EurographicsRendering Symposium, pp. 247-256, 2002. [137] D. Akers, F. Losasso, J. Klingner, M. Agrawala, J. Rick, and P. Hanrahan, "Conveying shape and features with image-based relighting," IEEE Visualization, pp. 349-354, 2003. [138] G. Petschnigg, M. Agrawala, H. Hoppe, R. Szeliski, M. Cohen, and K. Toyama, "Digital photography with flash and no-flash image pairs," A CM Transactionson Graphics,vol. 23, pp. 664-672, August 2004. [139] E. Eisemann and F. Durand, "Flash photography enhancement via intrinsic relighting," ACM Transactions on Graphics, vol. 23, pp. 673-678, August 2004. [140] B. M. Oh, M. Chen, J. Dorsey, and F. Durand, "Image-based modeling and photo editing," ACM SIGGRAPH Conference, 2001. [141] A. Wang and A. P. Chandrakasan, "A 180mV FFT processor using subthreshold circuit technologies," IEEE InternationalSolid State Circuits Conference, pp. 292-293, 2004. [142] S. Sridhara, M. DiRenzo, S. Lingam, S.-J. Lee, R. Blazquez, J. Maxey, S. Ghanem, Y.-H. Lee, R. Abdallah, P. Singh, and M. Goel, "Microwatt embedded processor platform for medical system-on-chip applications," IEEE Symposium on VLSI Circuit,pp. 15-16, 2010. [143] M. Qazi, M. E. Sinangil, and A. P. Chandrakasan, "Challenges and directions for lowvoltage SRAM," IEEE Design & Test of Computers, vol. 28, pp. 32-43, January 2011. [144] "Intel Atom processor Z2760," [online] ddr2-sdram. http://www.micron.com/products/dram/ [145] A. Adams, E. Talvala, S. H. Park, D. E. Jacobs, B. Ajdin, N. Gelfand, J. Dolson, D. Vaquero, J. Baek, M. Tico, H. P. A. Lensch, W. Matusik, K. Pulli, M. Horowitz, and M. Levoy, "The Frankencamera: An experimental platform for computational photography," ACM Transactions on Graphics, vol. 29, pp. 29:1-29:12, July 2010. [146] "Intel integrated performance primitives," en-us/intel-ipp. [online] https ://software. intel. com/ [147] "The OpenMP api specification for parallel programming," [online] http: //openmp. org/. 199 [148] "DDR2 SDRAM system-power calculator," [online] http: //www. intel. com/content/ www/us/en/processors/atom/atom-z2760-datasheet.html. [149] "Pandaboard: Open OMAP 4 mobile software development platform," [online] //pandaboard. org/content/platform. http: "Halide: [150] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe, A language and compiler for optimizing parallelism, locality, and recomputation in image and processing pipelines," ACM SIGPLAN Conference on Programming Language Design Implementation, pp. 519-530, 2013. hierar[151] C. C. Wang, F. L. Yuan, H. Chen, and D. Markovic, "A 1.1 GOPS/mW FPGA with chical interconnect fabric," IEEE InternationalSymposium on VLSI Circuits,pp. 136-137, 2011. with [152] C. C. Wang, F. L. Yuan, T. H. Yu, and D. Markovic, "A multi-granularity FPGA Conhierarchical interconnects for efficient and flexible mobile computing," International ference on Solid-State Circuits, pp. 460-461, 2014. [153] R. J. Hay, N. E. Johns, H. C. Williams, I. W. Bolliger, R. P. Dellavalle, D. J. Margolis, R. Marks, L. Naldi, M. A. Weinstock, S. K. Wulf, C. Michaud, C. J. L. Murray, and M. Naghavi, "The global burden of skin disease in 2010: An analysis of the prevalence and impact of skin conditions," Journal of Investigative Dermatology, November 2013. of the American [154] P. E. Grimes, "New insight and new therapies in vitiligo," The Journal Medical Association, vol. 293, pp. 730-735, February 2005. a comprehensive [155] A. Alikhan, L. M. Felsten, M. Daly, and V. Petronic-Rosic, "Vitiligo: diagnosis, differential overview part I. Introduction, epidemiology, quality of life, diagnosis, of associations, histopathology, etiology, and work-up," Journal of American Academy Dermatology,vol. 65, pp. 473-491, September 2011. [156] K. Ezzedine, H. W. Lim, T. Suzuki, I. Katayama, I. Hamzavi, C. C. Lan, B. K. Goh, T. Anbar, C. S. de Castro, A. Y. Lee, D. Prasad, N. V. Geel, I. C. L. Poole, N. Oiso, clasL. Benzekri, R. Spritz, Y. Gauthier, S. K. Hann, M. Picardo, and A. Taieb, "Revised consensus sification/nomenclature of vitiligo and related issues: The vitiligo global issues conference," Pigment Cell and Melanoma Research, vol. 25, pp. E1-13, May 2012. [157] D. J. Gawkrodger, A. D. Ormerod, L. Shaw, I. Mauri-Sole, M. E. Whitton, M. J. Watts, of A. V. Anstey, J. Ingham, and K. Young, "Guideline for the diagnosis and management vitiligo," British Journal of Dermatology, vol. 159, pp. 1051-1076, November 2008. medicine and [158] R. M. Halder and J. L. Chappell, "Vitiligo update," Seminars in cutaneous surgery, vol. 28, pp. 86-92, June 2009. Journal [159] G. C. do Carmo and M. R. e Silva, "Dermoscopy: basic concepts," International of Dermatology, vol. 47, pp. 712-719, July 2008. compared [160] M. E. Vestergaard, P. Macaskill, P. E. Holt, and S. W. Menzies, "Dermoscopy of meta-analysis a melanoma: with naked eye examination for the diagnosis of primary 159, vol. studies performed in a clinical setting," The British Journal of Dermatology, pp. 669-676, September 2008. BIBLIOGRAPHY 200 "Skin surface mi[161] W. Stolz, P. Bilek, M. Landthaler, T. Merkle, and 0. Braun-Falco, croscopy," The Lancet, vol. 334, pp. 864-865, October 1989. "Dermoscopy [162] R. P. Braun, H. S. Rabinovitz, M. Oliviero, A. W. Kopf, and J. H. Saurat, vol. 52, Dermatology, of of pigmented skin lesions," Journal of the American Academy pp. 109-121, January 2005. [163] "Dermlite," [online] http://dermlite.com/. [164] U. Gonzalez, M. Whitton, V. Eleftheriadou, M. Pinart, J. Batchelor, and J. Leonardi-Bee, "Guidelines for designing and reporting clinical trials in vitiligo," Archives of Dermatology, vol. 147, pp. 1428-1436, December 2011. [165] V. Eleftheriadou, K. S. Thomas, M. E. Whitton, J. M. Batchelor, and J. C. Ravenscroft, and a "Which outcomes should we measure in vitiligo? Results of a systematic review Journal survey among patients and clinicians on outcomes in vitiligo trials," The British of Dermatology, vol. 167, pp. 804-814, October 2012. [166] C. Vrijman, M. L. Homan, J. Limpens, W. van der Veen, A. Wolkerstorfer, C. B. Terwee, systematic and P. I. Spuls, "Measurement properties of outcome measures for vitiligo: A review," Archives of Dermatology, vol. 17, pp. 1-8, September 2012. modeling [167] I. Hamzavi, H. Jain, D. McLean, J. Shapiro, H. Zeng, and H. Lui, "Parametric vitiligo The of narrowband UV-B phototherapy for vitiligo using a novel quantitative tool: area scoring index," Archives of Dermatology, vol. 140, pp. 677-683, June 2004. for measuring [168] T. S. Oh, 0. Lee, J. E. Kim, S. W. Son, and C. H. Oh, "Quantitative method therapeutic efficacy of the 308 nm excimer laser for vitiligo," Skin Research and Technology, vol. 18, pp. 347-355, August 2012. "Digital image [169] M. W. L. Homan, A. Wolkerstorfer, M. A. Sprangers, and J. L. V. der Veen, analysis vs. clinical assessment to evaluate repigmentation after punch grafting in vitiligo," Journal of the European Academy of Dermatology and Venereology, vol. 27, pp. 235-238, February 2013. in [170] T. S. Cho, W. T. Freeman, and H. Tsao, "A reliable skin mole localization scheme," International Conference on Computer Vision, pp. 1-8, IEEE, 2007. skin [171] S. K. Madan, K. J. Dana, and 0. G. Cula, "Quasiconvex alignment of multimodal Workimages for quantitative dermatology," in Computer Vision and PatternRecognition shops, pp. 117-124, IEEE, 2009. regions using [172] S. K. Madan, K. J. Dana, and 0. Cula, "Learning-based detection of acne-like time-lapse features," in Signal Processing in Medicine and Biology Symposium, pp. 1-6, IEEE, 2011. pro[173] H. Wannous, Y. Lucas, and S. Treuillet, "Enhanced assessment of the wound-healing cess by accurate multiview tissue classification," IEEE Transactions on Medical Imaging, vol. 30, pp. 315-326, February 2011. [174] H. Nugroho, M. H. A. Fadzil, V. V. Yap, S. Norashikin, and H. H. Suraiya, "Determination in of skin repigmentation progression," IEEE InternationalConference of the Engineering Medicine and Biology Society, pp. 3442-3445, 2007. BIBLIOGRAPHY 201 [175] F. Peruch, F. Bogo, M. Bonazza, V. M. Cappelleri, and E. Peserico, "Simpler, faster, more accurate melanocytic lesion segmentation through MEDS," IEEE Transactions on Biomedical Engineering,vol. 61, pp. 557-565, February 2014. [176] K. Korotkov and R. Garcia, "Computerized analysis of pigmented skin lesions: A review," Artificial Intelligence in Medicine, vol. 56, pp. 69-90, October 2012. [177] R. J. Friedman, D. S. Rigel, and A. W. Kopf, "Early detection of malignant melanoma: The role of physician examination and self-examination of the skin," CA: A CancerJournal for Clinicians, vol. 35, pp. 130-151, May 1985. [178] J. Chen, J. Stanley, R. H. Moss, and W. V. Stoecker, "Color analysis of skin lesion regions for melanoma discrimination in clinical images," Skin Research and Technology, vol. 9, pp. 94-104, May 2003. [179] C. Grana, G. Pellacani, and S. Seidenari, "Practical color calibration for dermoscopy, applied to a digital epiluminescence microscope," Skin Research and Technology, vol. 11, pp. 242-247, November 2005. [180] H. Iyatomi, M. E. Celebi, G. Schaefer, and M. Tanaka, "Automated color normalization for dermascopy images," in International Conference on Image Processing,pp. 4357-4360, IEEE, 2010. [181] M. Styner, C. Brechbuhler, G. Szckely, and G. Gerig, "Parametric estimate of intensity inhomogeneities applied to MRI," IEEE Transactionson Medical Imaging, vol. 19, pp. 153- 165, March 2000. [182] J. Milles, Y. Zhu, G. Gimenez, C. Guttmann, and I. Magnin, "MRI intensity nonuniformity correction using simultaneously spatial and gray-level histogram information," Journal of Computerized Medical Imaging and Graphics, vol. 31, pp. 81-90, March 2007. [183] U. Vovk, F. Pernus, and B. Likar, "Review of methods for correction of intensity inhomogeneity in MRI," IEEE Transactions on Medical Imaging, vol. 26, pp. 405-421, March 2007. [184] K. Zhang, L. Zhang, and S. Zhang, "A variational multiphase level set approach to simultaneous segmentation and bias correction," in InternationalConference on Image Processing, pp. 4105-4108, IEEE, 2010. [185] C. Li, C. Xu, C. Gui, and M. D. Fox, "Distance regularized level set evolution and its application to image segmentation," IEEE Transactions on Image Processing, vol. 19, pp. 3243-3254, December 2010. [186] C. Li, R. Huang, Z. Ding, C. Gatenby, D. N. Metaxas, and J. C. Gore, "A level set method for image segmentation in the presence of intensity inhomogeneities with application to MRI," IEEE Transactions on Image Processing,vol. 20, pp. 2007-2016, July 2011. [187] M. A. Fischler and R. C. Bolles, "Random sample consensus: A paradigm for model fitting with applicatlons to image analysis and automated cartography," ACM Communications on Graphics and Image Processing, vol. 24, pp. 381-395, June 1981. BIBLIOGRAPHY 202 [188] K. Mikolajczyk and C. Schmid, "A performance evaluation of local descriptors," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, pp. 1615-1630, October 2005. [189] L. Fei-Fei and P. Perona, "A bayesian hierarchical model for learning natural scene categories," IEEE Conference on Computer Vision and Pattern Recognition, pp. 524-531, 2005. [190] A. Bosch, A. Zisserman, and X. Munoz, "Scene classification using a hybrid generative/discriminative approach," IEEE Transactions on Pattern Analysis and Machine In- telligence, vol. 30, pp. 712-727, April 2008. [191] S. Lazebnik, C. Schmid, and J. Ponce, "Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories," IEEE Conference on Computer Vision and Pattern Recognition, pp. 2169-2178, 2006. [192] J. J. Kivinen, E. B. Sudderth, and M. Jordan, "Learning multiscale representation of natural scenes using Dirichlet processes," IEEE Conference on Computer Vision, pp. 1-8, 2007. [193] K. Grauman and T. Darrell, "Efficient image matching with distributions of local invariant features," IEEE Conference on Computer Vision and Pattern Recognition, pp. 627-634, 2005. [194] K. Frome, Y. Singer, F. Sha, and J. Malik, "Learning globally-consistent local distance functions for shape-based image retrieval and classification," IEEE Conference on Computer Vision, pp. 1-8, 2007. [195] S. Belongie, J. Malik, and J. Puzicha, "Shape matching and object recognition using shape contexts," IEEE Transactionson PatternAnalysis and Machine Intelligence, vol. 24, pp. 509-522, April 2002. [196] J. Duchon, "Splines minimizing rotation-invariant semi-norms in Sobolev spaces," Constructive Theory of Functions of Several Variables, vol. 571, pp. 85-100, 1977. [197] J. Meinguet, "Multivariate interpolation at arbitrary points made simple," Journal of Applied Mathematics and Physics, vol. 5, pp. 439-468, 1979. [online] [198] "Intel power gadget," intel-power-gadget-20. https ://software. intel. com/en-us/articles/ [199] S. L. Jacques, J. C. Ramella-Roman, K. Lee, "Imaging skin pathology with polarized light," Journal of Biomedical Optics, Vol. 7, No. 3, 329-340, July 2002. [200] M. H. Smith, P. Burke, A. Lompado, E. Tanner, L. W. Hillman, "Mueller matrix imaging polarimetry in dermatology," Proceedings of SPIE, 2000. [201] J. A. Muccini, N. Kollias, S. B. Phillips, R. R. Anderson, A. J. Sober, M. J. Stiller, L. A. Drake, "Polarized light photography in the evaluation of photoaging," Journal of the American Academy of Dermatology, Vol. 33, No. 5, 765-769, Nov. 1995. BIBLIOGRAPHY 203 [202] R. Langley, M. Rajadhyaksha, P. Dwyer, A. Sober, T. Flotte, R. R. Anderson, "Confocal scanning laser microscopy of benign and malignant melanocytic skin lesions in vivo," Journal of the American Academy of Dermatology, Vol. 45, No. 3, 365-276, Sept. 2001. [203] D. Kapsokalyvas, N. Bruscino, D. Alfieri, V. de Giorgi, G. Cannarozzo, T. Lotti, and F. S. Pavone, "Imaging of human skin lesions with the multispectral dermoscope," Proceedings of the SPIE, 2010. [204] R. W. Brodersen, "Low power design, past and future," 2014. [205] E. J. Fluhr, J. Friedrich, D. Dreps, V. Zyuban, G. Still, C. Gonzalez, A. Hall, D. Hogenmiller, F. Malgioglio, R. Nett, J. Paredes, J. Pille, D. Plass, R. Puri, P. Restle, D. Shan, K. Stawiasz, Z. T. Deniz, D. Wendel, and M. Ziegler, "Power8: A 12-core server-class processor in 22nm SOI with 7.6Tb/s off-chip bandwidth," IEEE InternationalSolid State Circuits Conference, pp. 96-97, 2014. [206] N. Kurd, M. Chowdhury, E. Burton, T. P. Thomas, C. Mozak, B. Boswell, M. Lal, A. Deval, J. Douglas, M. Elassal, A. Nalamalpu, T. M. Wilson, M. Merten, S. Chennupaty, W. Gomes, and R. Kumar, "Haswell: A family of IA 22nm processors," IEEE International Solid State Circuits Conference, pp. 112-113, 2014. [207] A. Wang, P. R. Gill, and A. Molnar, "An angle-sensitive CMOS imager for single-sensor 3D photography," InternationalSolid-State Circuits Conference, pp. 412-413, 2011. [208] W. Kim, Y. Wang, I. Ovsiannikov, S. H. Lee, Y. Park, C. Chung, and E. Fossum, "A 1.5 Mpixel RGBZ CMOS image sensor for simultaneous color and range image capture," InternationalSolid-State Circuits Conference, pp. 392-393, 2012. [209] L. T. Su, "Architecting the future through heterogeneous computing," InternationalSolidState Circuits Conference, pp. 8-11, 2011. [210] C. Gentry, "Computing arbitrary functions of encrypted data," Communications of the ACM, vol. 53, pp. 97-105, March 2010.