Document 11268626

advertisement
Energy-Efficient System Design for Mobile Processing Platforms
by
Rahul Rithe
B.Tech., Indian Institute of Technology Kharagpur (2008)
S.M., Massachusetts Institute of Technology (2010)
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
SACHUSESwINs-ThU
OF TECHNOLOGY
at the
r
JUN 10 2014
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
LiBRARIES
June 2014
@
Massachusetts Institute of Technology 2014. All rights reserved.
Signature redacted
.....
A uth or ...................................................... . .-
Department of Electrical Engineering and Computer Science
May 20, 2014
Signature redacted
:......../...................
Anantha P. Chandrakasan
Joseph F. and Nancy P. Keithley Professor of Electrical Engineering
Thesis Supervisor
C ertified by ...............................................-.
bySignature
Acceped
Accepted by ...................................
t r
redacted
e a t d .
Lediej. Kolodziejski
Chair, Department Committee on Graduate Students
2
Energy-Efficient System Design for Mobile Processing Platforms
by
Rahul Rithe
Submitted to the Department of Electrical Engineering and Computer Science
on May 20, 2014, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Abstract
Portable electronics has fueled the rich emergence of multimedia applications that have led to the
exponential growth in content creation and consumption. New energy-efficient integrated circuits
and systems are necessary to enable the increasingly complex augmented-reality applications,
such as high-performance multimedia, "big-data" processing and smart healthcare, in real-time
on mobile platforms of the future. This thesis presents an energy-efficient system design approach
with algorithm, architecture and circuit co-design for multiple application areas.
A shared transform engine, capable of supporting multiple video coding standards in real-time
with ultra-low power consumption, is developed. The transform engine, implemented using
45 nm CMOS technology, supports Quad Full-HD (4k x 2k) video coding with reconfigurable
processing for H.264 and VC-1 standards at 0.5 V and operates down to 0.3 V to maximize
energy-efficiency. Algorithmic and architectural optimizations, including matrix factorization,
transpose memory elimination and data dependent processing, achieve significant savings in area
and power consumption.
A reconfigurable processor for computational photography is presented. An efficient implementation of the 3D bilateral grid structure supports a wide range of non-linear filtering applications,
including high dynamic range imaging, low-light enhancement and glare reduction. The processor, implemented using 40 nm CMOS technology, enables real-time processing of HD images,
while operating down to 0.5 V and achieving 280x higher energy-efficiency compared to software implementations on state-of-the-art mobile processors. A scalable architecture enables 8x
energy scalability for the same throughput performance, while trading-off output resolution for
energy.
Widespread use of medical imaging techniques has been limited by factors such as size, weight,
cost and complex user interface. A portable medical imaging platform for accurate objective quantification of skin condition progression, using robust computer vision techniques, is
presented. Clinical validation shows 95% accuracy in progression assessment. Algorithmic optimizations, reducing the memory bandwidth and computational complexity by over 80%, pave the
way for energy-efficient hardware implementation to enable real-time portable medical imaging.
Thesis Supervisor: Anantha P. Chandrakasan
Title: Joseph F. and Nancy P. Keithley Professor of Electrical Engineering
3
4
Acknowledgments
Since the first time I came to MIT in August 2008 and navigated my way to 38-107, trying to
make sense of MIT's (still) incomprehensible building numbering system, it has been a wonderful journey of exploration - filled with numerous challenges and exciting rewards of scientific
discovery.
I have been fortunate to have had exceptional advisors and mentors to guide me through this
journey. I am extremely grateful to my advisor, Prof. Anantha Chandrakasan, for being a
great mentor, role model and a constant source of inspiration. I learned from Anantha that
conducting great research is a process that involves working in collaboration with researchers,
industry partners and funding agencies, while constantly pushing the boundaries of the stateof-the-art.
The collaborative research environment that Anantha has fostered in the lab not
only motivated me to produce great results but also afforded the opportunities to work with
graduate and undergraduate students and learn how to mentor and motivate others in realizing
their full potential as researchers. I learned invaluable lessons in organization and management,
from being inspired by Anantha's visionary leadership of EECS, while managing a large research
group. Thank you Anantha for giving me the freedom to explore my interests and helping me
grow both professionally and personally throughout my graduate studies at MIT!
I am thankful to the members of my Ph.D. thesis committee, Prof. William Freeman, Prof.
Li-Shuan Peh and Prof. Vivienne Sze, for their advise, feedback and support. Prof. Freeman's
advise on the computer vision related work for medical imaging was extremely valuable. I would
like to thank Vivienne for her help and support throughout my graduate work at MIT - first
as a senior graduate student and then as a faculty member at MIT - from helping me learn
digital design to long discussions about research and reviewing paper drafts. I am extremely
grateful to Prof. Fredo Durand for several valuable discussions on topics ranging from research
to photography to career options.
I had the privilege of working with Dr. Dennis Buss, chief scientist at Texas Instruments and
visiting scientist at MIT, during my master's research. I am immensely thankful to Dennis for
all the insightful discussions over the last six years on topics ranging from research and industry
collaboration to the past, present and future of the semiconductor industry.
5
The work was made possible by the generous support of our industry partners. I would like
to acknowledge the Foxconn Technology Group, Texas Instruments and the MIT Presidential
Fellowship for providing funding support and the TSMC University Shuttle Program for chip
fabrication.
I consider teaching to be an integral part of the graduate experience and I am grateful to
Prof. Harry Lee for giving me the rare opportunity to serve as a recitation instructor for the
undergraduate 'Circuits and Electronics' class. I would like to thank Prof. Harry Lee, Prof.
Karl Berggren, Prof. John Kassakian and Prof. Khurram Afridi for helping me further my
passion for teaching and enhance my abilities as a teacher.
One of the best things about MIT is the people you get to interact and work with day-to-day.
I would like to thank Chih-Chi Cheng and Mahmut Sinangil for working long hours with me
on the video coding project. I am extremely thankful to Priyanka Raina, Nathan Ickes and
Srikanth Tenneti for their tremendous help in bringing the computational photography project
from an idea to a live demonstration platform. It has been a great experience for me to work
with two 'SuperUROP' students - Michelle Chen and Qui Nguyen - on the smartphone-based
medical imaging platform and I am thankful to them for being such enthusiastic collaborators.
I would also like to thank Dr. Vaneeta Sheth from the Brigham and Women's Hospital for
bringing her dermatology expertise to our medical imaging work and conducting a pilot study
to demonstrate its effectiveness during treatment.
When I first arrived at MIT, I could not have imagined a work environment better than what
Ananthagroup has offered me over the last six years. It has been an absolute pleasure to work
with all the members of Ananthagroup- past and present. The diverse set of expertise, thoughtful
discussions and "procrastination circles" have helped create the best workplace for research. All
work and no play is no fun. I would like to thank Masood Qazi for teaching me everything
I know about playing Squash and those amazing trips to Burdick's for the best hot chocolate
ever! I would also like to thank the members of the "Ananthagrop Tennis Club" - Arun, Phil
and Nachiket - for quite a few evenings well spent, braving wind, rain and cold on the tennis
courts.
Margaret Flaherty, our administrative assistant, is the reason everything in 38-107 runs so
smoothly. I would like to thank Margaret for her relentless work and attention to detail.
6
Saurav Bandyopadhyay, Rishabh Singh and I went to IIT Kharagpur together and continued
our journey at MIT together, including that first crammed flight from Delhi to Boston. I am
extremely thankful to Saurav and Rishabh for being such great friends over the years.
The foundation of my work rests on the unconditional love and support from my family. The
pride and joy of my late grandparents, Nirmalabai and Namdevrao Wankhade, in every one of my
achievements over the years has been and will continue to be a constant source of inspiration for
me. The love of my grandfather, Panjabrao Rithe, for education and the hardships he endured
for it has been the driving force for me on this academic journey. The steadfast belief of my
parents, Rajani and Jagdish Rithe, and my sister Bhagyashree, their support through all my
endeavors and encouragement to follow my dreams, has made this journey from a small village
in India to the present moment possible. And for that I am eternally grateful!
Rahul Rithe
Cambridge, MA
01 MAY 2014
7
8
Contents
1
1.1
Mobile Computing Challenges . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Energy-Efficient System Design
1.3
2
23
Introduction
. . . . . . . . . . . . . . . . . . . . . . . . 26
1.2.1
Parallel Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.2.2
Application Specific Processing
1.2.3
Reconfigurable Hardware . . . . . . . . . . . . . . . . . . . . . . . . 30
1.2.4
Low-Voltage Circuits . . . . . . . . . . . . . . . . . . . . . . . . . .
31
Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
. . . . . . . . . . . . . . . . . . . .
28
37
Transform Engine for Video Coding
2.1
25
Transform Engine Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
. . . . . . . . . . . . . . . 40
2.1.1
Integer Transform: H.264/AVC & VC-1
2.1.2
Matrix Factorization for Hardware Sharing . . . . . . . . . . . . . .
42
2.1.3
Eliminating Transpose Memory . . . . . . . . . . . . . . . . . . . .
47
2.1.4
Data Dependent Processing
. . . . . . . . . . . . . . . . . . . . . .
52
2.2
Future Video Coding Standards . . . . . . . . . . . . . . . . . . . . . . . . 56
2.3
Statistical Methodology for Low-Voltage Design . . . . . . . . . . . . . . . 61
2.4
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
67
2.5
Measurement Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
2.6
Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
CONTENTS
10
CONTENTS
10
3
Reconfigurable Processor for Computational Photography
3.1
3.2
3.3
3.4
77
Bilateral Filtering ...........
. . . . . . . . . . . . . . . . . . . . 79
3.1.1
. . . . . . . . . . . . . . . . . . . .
Bilateral Grid . . . . . . . . ..
81
Bilateral Filter Engine . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 83
3.2.1
Grid Assignment
. . . . . . .
. . . . . . . . . . . . . . . . . . . . 83
3.2.2
Grid Filtering . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 85
3.2.3
Grid Interpolation
. . . . . . . . . . . . . . . . . . . . 86
3.2.4
Memory Management.....
. . . . . . . . . . . . . . . . . . . . 88
3.2.5
Scalable Grid . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 88
Applications . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 90
3.3.1
High Dynamic Range Imaging
. . . . . . . . . . . . . . . . . . . . 91
3.3.2
Glare Reduction . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 97
3.3.3
Low-Light Enhanced Imaging
. . . . . . . . . . . . . . . . . . . . 1 00
. . . . . .
Low-Voltage Operation . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 108
3.4.1
Statistical Design Methodology
. . . . . . . . . . . . . . . . . . . . 108
3.4.2
Multiple Voltage Domains
. . . . . . . . . . . . . . . . . . . . 109
3.5
Memory Bandwidth Optimization
. . . . . . . . . . . . . . . . . . . . 1 10
3.6
Measurement Results . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 115
3.6.1
Energy Scalable Processing
. . . . . . . . . . . . . . . . . . . . 117
3.6.2
Energy Efficiency . . . . . . .
. . . . . . . . . . . . . . . . . . . . 119
3.7
System Integration . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . 1 23
3.8
Summary and Conclusions . . . . . .
. . . . . . . . . . . . . . . . . . . . 1 24
4 Portable Medical Imaging Platform
4.1
4.2
Skin Conditions - Diagnosis & Treatment
127
. . . . .
128
4.1.1
Clinical Assessment: Current Approaches . .
128
4.1.2
Quantitative Dermatology . . . . . . . . . .
130
Skin Condition Progression: Quantitative Analysis .
133
4.2.1
134
Color Correction
. . . . . . . . . . . . . . .
11
CONTENTS
11
CONTENTS
4.3
5
4.2.2
Contour Detection
. . . . . . . . . .
137
4.2.3
Progression Analysis . . . . . . . . .
143
4.2.4
Auto-tagging
. . . . . . . . . . . . .
146
4.2.5
Skin condition Progression: Summary
149
Experimental Results . . . . . . . . . . . . .
150
4.3.1
Clinical Validation
. . . . . . . . . .
150
4.3.2
Progression Quantification . . . . . .
150
4.3.3
Auto-tagging Performance . . . . . .
154
4.3.4
Energy-Efficient Processing
. . . . .
155
4.3.5
Limitations
. . . . . . . . . . . . . .
157
4.4
Mobile Application . . . . . . . . . . . . . .
158
4.5
Multispectral Imaging: Future Work
. . . .
159
4.6
Summary and Conclusions . . . . . . . . . .
162
165
Conclusions and Future Directions
Summary of Contributions . . . . . . . . . .
166
5.1.1
Video Coding . . . . . . . . . . . . .
166
5.1.2
Computational Photography . . . . .
167
5.1.3
Medical Imaging
. . . . . . . . . . .
168
5.2
Conclusions . . . . . . . . . . . . . . . . . .
168
5.3
Future Directions . .
5.1
. . . . . . . . . . . . . . . . . . . . . . . 170
. . . . . . . .
170
. . . . . . . . . . . . . . . . . . . . . .
171
5.3.1
Computational Photography and Computer Vision
5.3.2
Portable Medical Imaging
175
A Integer Transform
A.1 H.264/AVC Integer Transform . . . . . . . . . . . . . . . . . . . . . . . .
175
. . . . . . . . . . . . . . . . . . . . . . . . . . .
177
A.2 VC-1 Integer Transform
B Clinical Pilot Study for Vitiligo Progression Analysis
B.1 Subjects for Pilot Study . . . . . . . . . . . . . . . . . . . . . . . . . . .
179
179
12
CONTENTS
B.2 Progression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Acronyms
185
Bibliography
189
List of Figures
1-1
Evolution of computing and multimedia processing. (Analytical Engine:
London Science Museum)
1-2
. . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Processor feature scaling and Performance/Watt trends. (Data courtesy
Stanford CPU DB: cpudb.stanford.edu) . . . . . . . . . . . . . . . . . . . . 26
1-3
Processor energy/operation scaling with performance. (Data courtesy Stanford CPU DB: cpudb.stanford.edu)
. . . . . . . . . . . . . . . . . . . . . . 27
1-4
Energy efficiency of processors: from CPUs to ASICs. . . . . . . . . . . . . 29
1-5
Delay scaling with VDD. Corner delay scales by 15x, whereas total delay
(corner + 3a- stochastic delay) scales by 36 x.
2-1
Hardware architecture of the even component. The figure shows data paths
exercised in (a) H.264 and (b) VC-1.
2-2
. . . . . . . . . . . . . . . . . . . . . 48
Hardware architecture of the odd component. The figure shows data paths
exercised in (a) H.264 and (b) VC-1.
2-3
. . . . . . . . . . . . . . . . 32
. . . . . . . . . . . . . . . . . . . . . 49
Column-wise 1D transform: 8x8 data is processed over four clock cycles,
CO to C3: Column 0 and 7 in CO, 1 and 6 in C1, 2 and 5 in C2, 3 and 4 in
C3.
Two transformed columns are generated in each clock cycle. . . . . . . 50
LIST OF FIGURES
14
2-4
Row-wise ID transform: Partial products for all 64 coefficients are computed in each clock cycle, using the 2 x 8 data obtained by transposing the
two columns generated by 1D column-wise transform. The partial products
are stored in the output buffer. At the end end of four clock cycles, the
output buffer contains complete 2D transformed output.
2-5
. . . . . . . . . . 51
Hardware architecture of the (a) even and (b) odd component. Std = {0:
H .264, 1: V C-1}.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2-6
Histogram of the prediction residue for a number of test sequences . . . . . 54
2-7
Correlation between input switching activity and system switching activity.
The plot also shows linear regression for the data. Measured correlation is
0 .83 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2-8
Switching activity and Power consumption in the transform as a function
of DC bias applied to the input data . . . . . . . . . . . . . . . . . . . . .
2-9
55
Hardware architecture of the even component for shared 8 x 8 transform for
H.264, VC-1 and HEVC. The highlighted blocks are the same as those used
in the shared H.264/VC-1 architecture, shown in Figure 2-1. . . . . . . . . 59
2-10 Hardware architecture of the odd component for shared 8x8 transform for
H.264, VC-1 and HEVC. The highlighted blocks are the same as those used
in the shared H.264/VC-1 architecture, shown in Figure 2-2. . . . . . . . . 60
2-11 Switching activity in HEVC transform as a function of DC bias applied to
the input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
2-12 Delay PDF of a representative timing path at 0.5 V. STA estimate of the
global corner delay is 14.1 ns, the 3o- delay estimate using Gaussian SSTA
is 23.2 ns and the 3a- delay estimate using Monte-Carlo analysis is 31.8 ns.
62
2-13 Graphic illustration in xi-space of the convolution integral, and the operating point.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2-14 Delay PDF of a representative timing path at 0.5 V, estimated using Gaussian SSTA, Monte-Carlo and OPA. . . . . . . . . . . . . . . . . . . . . . . 65
15
LIST OF FIGURES
2-15 Typical timing path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2-16 OPA based statistical design methodology for low voltage operation. . . . . 67
2-17 Block diagram of the 2D transform engine design . . . . . . . . . . . . . . 69
2-18 Die photo and design statistics of the fabricated IC . . . . . . . . . . . . . 69
2-19 Measured power consumption and frequency scaling with VDD for different
transform implementations. (a) Frequency scaling with VDD, (b) Power
consumption while operating at the frequency shown in (a).
. . . . . . . . 70
2-20 Power consumption for transform modules with and without transpose
memory, with and without shared architecture for H.264 and VC-1 . . . . . 72
2-21 Switching activity and Power consumption in the transform as a function
of DC bias applied to the input data . . . . . . . . . . . . . . . . . . . . . 72
3-1
System block diagram for the reconfigurable computational photography
processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3-2
Comparison of Gaussian filtering and bilateral filtering. Bilateral filtering
effectively reduces noise while preserving scene details.
. . . . . . . . . . . 81
3-3
Construction of a 3D bilateral grid from a 2D image . . . . . . . . . . . . . 82
3-4
Architecture of the bilateral filtering engine. Grid scalability is achieved
by gating processing engines and SRAM banks . . . . . . . . . . . . . . . . 84
3-5
Architecture of the grid assignment engine. . . . . . . . . . . . . . . . . . . 84
3-6
Architecture of the convolution engine for grid filtering. . . . . . . . . . . . 85
3-7
Architecture of the interpolation engine. Trilinear interpolation is implemented as three pipelined stages of linear interpolations. . . . . . . . . . . 87
3-8
Memory management by task scheduling. . . . . . . . . . . . . . . . ... . . . 89
3-9
Camera curves that map the pixel intensity values on to the incident exposure. 92
3-10 HDR creation module. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3-11 HDR image scaled to 8 bit/pixel/color for displaying on LDR media. (HDR
radiance map courtesy Paul Debevec [121].)
. . . . . . . . . . . . . . . . . 94
LIST OF FIGURES
16
3-12 Processing flow for HDR creation and tone-mapping for displaying HDR
images on LDR media. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3-13 Tone-mapped HDR image. (HDR radiance map courtesy Paul Debevec
[12 1].)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3-14 Processor configuration for HDR imaging.
. . . . . . . . . . . . . . . . . . 96
3-15 Input low-dynamic range images: (a) under exposed image, (b) normally
exposed image, (c) over exposed image. Output image: (d) tonemapped
H D R im age. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3-16 Contrast adjustment module. Contrast is increased or decreased depending
on the adjustment factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3-17 Processing flow for glare reduction. . . . . . . . . . . . . . . . . . . . . . . 98
3-18 Processor configuration for glare reduction. . . . . . . . . . . . . . . . . . . 99
3-19 (a) Input image with glare. (b) Output image with reduced glare. . . . . . 99
3-20 Processing flow for low-light enhancement. . . . . . . . . . . . . . . . . . . 102
3-21 Processor configuration for low-light enhancement . . . . . . . . . . . . . . 103
3-22 Generating a mask representing regions with high scene details.
3-23 Merging flash and no-flash images with shadow correction.
. . . . . . 104
. . . . . . . . . 104
3-24 (a) Image with flash, (b) image without flash, (c) no-flash base layer, (d)
flash detail layer, (d) edge mask, (f) low-light enhanced output.
. . . . . . 106
3-25 Input images: (a) image with flash, (b) image without flash. Output image:
(c) low-light enhanced image.
. . . . . . . . . . . . . . . . . . . . . . . . . 107
3-26 Comparison of the image quality performance from the proposed approach
with that of [138] and [139].
(a) Output from our approach, (b) output
from [138], (c) output from [139], (d) difference image between (a) and (b)
- amplified 5x, (e) difference image between (a) and (c) - amplified 5x. . . 107
3-27 Delay PDF of a representative timing path from the computational photography processor at 0.5 V. STA estimate of the global corner delay is
21.9 ns, the 3- delay estimate using OPA is 36.1 ns. . . . . . . . . . . . . . 109
17
LIST OF FIGURES
3-28 Separate voltage domains for logic and memory. Level shifters are used to
transition between domains. . . . . . . . . . . . . . . . . . . . . . . . . . .111
3-29 Memory bandwidth and estimated power consumption for 2D bilateral filtering, 3D bilateral grid and bilateral grid with memory management using
task scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
3-30 Die photo of the testchip. Highlighted boxes indicate SRAMs. HDR, CR
and SC refer to HDR create, contrast reduction and shadow correction
m odules respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3-31 Processor performance: trade-off of energy vs. performance for varying VDD 116
3-32 Processor area (number of gates) and power breakdown.
. . . . . . . . . . 116
3-33 Energy scalable processing. Grid resolution vs. energy trade-off at 0.9 V. . 117
3-34 Energy/resolution scalable processing. HDR imaging outputs for (a) grid
block size: 16 x 16, intensity levels: 16, (b) grid block size: 128 x 128,
intensity levels: 16, (c) grid block size: 16 x 16, intensity levels: 4, (d) grid
block size: 128 x 128, intensity levels: 4. . . . . . . . . . . . . . . . . . . . 118
3-35 Energy/resolution scalable processing. Low-light enhancement outputs for
(a) grid block size: 16 x 16, intensity levels: 16, (b) grid block size: 128 x 128,
intensity levels: 16, (c) grid block size: 16 x 16, intensity levels: 4, (d) grid
block size: 128 x 128, intensity levels: 4. . . . . . . . . . . . . . . . . . . . 119
3-36 Energy efficiency of processors ranging from CPUs and mobile processors
to FPGAs and ASICs.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3-37 Processor integration with external memory, camera and display. . . . . . . 123
3-38 Printed circuit board and system integration with camera and display. .
12
124
LIST OF FIGURES
18
4-1
Standardized assessments for estimating the degree of pigmentation to derive the Vitiligo Area Scoring Index. At 100% depigmentation, no pigment
is present; at 90%, specks of pigment are present; at 75%, the depigmented
area exceeds the pigmented area; at 50%, the depigmented and pigmented
areas are equal; at 25%, the pigmented area exceeds the depigmented area;
and at 10%, only specks of depigmentation are present. (Figure reproduced
with permission from [167])
. . . . . . . . . . . . . . . . . . . . . . . . . . 129
. . . . . . . . . . . . . 134
4-2
Processing flow for skin lesion progression analysis.
4-3
Color correction by histogram matching. Images captured with normal
room lighting (a) and with color chart white-balance calibration (b). Images after color correction and contrast enhancement (c) of images in (a). . 136
4-4 Level set segmentation. (a) Original image with intensity inhomogeneity
and initialization of the level set function. (b) Homogeneous image obtained
at the end of iterations and the corresponding level set function. . . . . . . 139
4-5
Narrowband implementation of level set segmentation. LSM variables are
tracked only for pixels that fall within a narrow band defined around the
zero level set in the current iteration. . . . . . . . . . . . . . . . . . . . . . 140
4-6
Number of pixels processed using the narrowband implementation over 50
LSM iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4-7 Lesion segmentation using K-means.
. . . . . . . . . . . . . . . . . . . . . 143
4-8 Contour evolution for lesion segmentation using narrowband LSM. . . . . . 143
4-9 SIFT feature matching performed on the highlighted narrow band of pixels
in the vicinity of the contour.
. . . . . . . . . . . . . . . . . . . . . . . . . 144
4-10 Color correction for a sequence of images by R, G, B histogram modification. (a) Original image sequence, (b) Color corrected image sequence.
The lesion color changes due to phototherepy. . . . . . . . . . . . . . . . . 151
4-11 Image segmentation using LSM for lesion contour detection despite intensity/color inhomogeneities in the image.
. . . . . . . . . . . . . . . . . . . 151
19
LIST OF FIGURES
4-12 Image registration based on matching features with respect to the reference
image at the beginning of the treatment. . . . . . . . . . . . . . . . . . . . 152
4-13 Sequence of images during treatment. (a) Images captured with normal
room lighting. (b) Processed image sequence.
. . . . . . . . . . . . . . . . 152
4-14 Image registration through feature matching. (a) Images of a lesion from
different camera angles, (b) Images after contour detection and alignment.
Area matches to 98% accuracy and pixel overlap to 97% accuracy. . . . . . 153
4-15 Progression analysis. (a) Artificial image sequence with known area change,
created from a lesion image. (b) Image sequence after applying scaling,
rotation and perspective mismatch. (c) Output image sequence after lesion
alignment and fill factor computation . . . . . . . . . . . . . . . . . . . . . 154
4-16 Memory bandwidth and estimated power consumption for full image LSM
and SIFT compared to the optimized narrowband implementations of LSM
and SIFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
4-17 Image segmentation fails to accurately identify lesion contours where the
lesions don't have well defined boundaries. . . . . . . . . . . . . . . . . . . 157
4-18 Architecture of the mobile application with cloud integration.
. . . . . . . 158
4-19 User interface of the mobile application. (Contributed by Michelle Chen
and Qui Nguyen). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
4-20 A conceptual diagram of the portable imaging module for multispectral
polarized light imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
5-1
Secure cloud-based medical imaging platform.
. . . . . . . . . . . . . . . . 172
B-1 Progression of skin lesions over time. Lesion contours are identified from
the color corrected images and the lesions are aligned using SIFT feature
matching to determine the fill factor. . . . . . . . . . . . . . . . . . . . . . 181
20
LIST OF FIGURES
List of Tables
2.1
Separable 2D transform definitions for H.264/AVC and VC-1 . . . . . . . . 41
2.2
Row-wise transform computations for even-odd components over four clock
cycles
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.3
Full-chip Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
2.4
Trtansform engines implemented in this design . . . . . . . . . . . . . . . . 68
2.5
Measurement results for implemented transform modules . . . . . . . . . . 71
2.6
Overheads and Advantages of proposed ideas . . . . . . . . . . . . . . . . . 73
2.7
Performance comparison of proposed approach with previous publications . 74
3.1
Setup/Hold Timing Analysis at 0.5 V . . . . . . . . . . . . . . . . . . . . . 110
3.2
Performance comparison with mobile processor implementations at 0.9 V.
.
120
4.1
Summary of clinical assessment and quantitative dermatology approaches
.
132
4.2
Bit Width Representations of LSM Variables.
4.3
Performance enhancement through algorothmic optimizations. . . . . . . . 156
B.1 Demographics of the subjects for clinical study.
. . . . . . . . . . . . . . . . 139
. . . . . . . . . . . . . . . 180
B.2 Progression of Skin Lesions During Treatment . . . . . . . . . . . . . . . . 182
22
LIST OF TABLES
Chapter 1
Introduction
In 1837, Charles Babbage proposed the concept of the Analytical Engine [1], the first
Turing complete computer with an arithmetic logic unit, control flow and integrated
memory. If it had been completely built, the Analytical Engine would have been vast
and would have needed to be operated by a steam engine [2].
The idea of computing
devices that are astronomically more powerful and yet can fit in the palm of a person's
hand while operating on tiny batteries built into the devices themselves, would have been
unthinkable. Integrated circuits, driven by the semiconductor process scaling following
"Moore's Law" [3] and "Dennard Scaling" [4] over the last half century, have transformed
computing through exponential enhancements in performance, power efficiency and cost.
Today we are moving ever closer to the era of all computing being mobile. The vision of
ubiquitous computing [5] and portable wireless terminals for real-time multimedia access
and processing, heralded by the Xerox ParcTab [6] and the InfoPad [7,8], has become
ubiquitous by the emergence of portable multimedia devices like smartphones and tablet
computing devices. We are surrounded by computing devices that form the "internet of
things" - gateways to the hyper-connected world.
The exponential growth in computing has fueled advances in increasingly complex multimedia processing applications - from the first color photograph, created by Thomas Sutton
Introduction
24
and James Clerk Maxwell in 1861 based on Maxwell's three-color method1 [9], to modern day multimedia processing capabilities that have enabled real-time High Definition
(HD) video, computational photography, computer vision and graphics, and biological
and biomedical imaging. Figure 1-1 shows the evolution of computing and multimedia
processing.
Analytical Engine
(1837)
0Sar
First Color Photograph
(1861)
Figure 1-1: Evolution of computing and multimedia processing. (Analytical Engine: London
Science Museum)
Next generation mobile platforms will need to extend these capabilities multifold to enable efficient multimedia processing, natural user interfaces through gesture and speech
recognition, real-time interpretation and "big data" inference from sensors interfacing
with the world, and provide portable smart healthcare solutions for continuous health
monitoring.
'The three-color method forms the foundation of virtually all color imaging techniques to this day.
1.1 Mobile Computing Challenges
25
Regardless of the specific functionality, these applications have a common set of challenges.
They are computationally complex, typically require large non-linear filter kernels or large
block sizes for processing (64 x 64 or more) and have high memory size and bandwidth
requirements. To support real-time performance (1080p at 30 frames per second (fps)),
the throughput requirements for such applications can exceed 1 TOPS. The processing
is often non-localized with data dependancies across multiple rows in an image or even
across multiple frames in a video sequence. Many algorithms are iterative, such as in image
deblurring or segmentation, which limits parallelism and place further constraints on realtime performance. This presents the significant challenge of high-computing performance
requirement while ensuring ultra-low power operation, to be efficient on battery-operated
mobile devices.
1.1
Mobile Computing Challenges
The energy budget of a mobile platform is constrained by its battery capacity. While processing power has increased exponentially, battery energy density has followed a roughly
linear trajectory [10].
Over the last 15 years, processor performance has increased by
100 x, transistor count by 1000 x, whereas battery capacity has increased only by a factor
of 2.6 [11]. At the same time, even as the number of transistors have followed "Moore's
Law" exponential growth and continue to do so with process scaling and 3D integration,
we are no longer able to achieve exponential gains in performance per Watt of power
consumption from process scaling alone, due to the lack of operating voltage scaling [12].
Figure 1-2 shows these trends over the last 40 years [13].
The lack of significant energy density enhancements in batteries combined with flattening
performance enhancements per unit of power consumption have led to a major challenge
in mobile computing. Energy has become the key limiting factor in scaling computing
performance on mobile platforms. The significant performance enhancements needed to
enable high complexity applications on future mobile platforms will only be achievable
Introduction
26
Introduction
26
106
105
.ne
Transistors
V
104
103
E
.
Performance/Watt
102
0
z
. -o
101
1 1
197 0
1980
1990
2000
2010
Figure 1-2: Processor feature scaling and Performance/Watt trends. (Data courtesy Stanford
CPU DB: cpudb.stanford.edu)
through significant enhancements in energy-efficiency of such systems.
1.2
Energy-Efficient System Design
Fine-grained parallelism and low voltage operation are powerful tools for low-power design that take advantage of the exponential scaling in transistor costs to trade-off silicon
area for lower power consumption [14-17]. Technology scaling, circuit topologies, and
architecture trends are aligning to take advantage of these trade-offs for severely energyconstrained applications on mobile platforms.
1.2.1
Parallel Processing
Parallel processing has become a cornerstone of low-power digital design [14] because of
its remarkable ability, when coupled with voltage scaling, to enhance energy efficiency
at no overall performance cost. It allows each individual processing engine or core to
operate at less than its peak performance, which enables the operating voltage to be
27
1.2 Energy-Efficient System Design
scaled down and achieves a super-linear reduction in energy per operation. Figure 1-3
shows the normalized energy/op scaling vs. performance for processors over 20 years. For
applications that support data parallelism, a processor can have two processing engines,
each running at half the required performance that together achieve the same throughput
But due to the super-
as a single processing engine running at the full performance.
linear scaling in energy per operation as we lower performance, the two engines combined
consume lower power than one engine running at full performance.
100
V
.
M 50
85-90
90-95
95-00
"..
00.-05
.
U.
....
-
20
o.
...
..
%E10 1
10
+.
2.
..
.
..
00
2.
1
2
..
5
10
20
.
50
100
Performance (Normalized)
Figure 1-3: Processor energy/operation scaling with performance. (Data courtesy Stanford CPU
DB: cpudb.stanford.edu)
Over the last decade, the transition from single core processing to multi-core processing,
taking advantage of parallelism, allowed us to continue to scale overall system performance
without increasing the energy budget. However, it is also evident from Figure 1-3 that
continuing to reduce peak performance for increasing energy efficiency has diminishing
returns. Moving between low energy points causes large shifts in performance for small
energy changes.
This puts a limit on the performance enhancements achievable from
multi-core processing alone.
Introduction
28
1.2.2
Application Specific Processing
The maximum performance enhancement achievable through parallelism is further limited by "Amdahl's Law" [18], which states that the speedup of a program using parallel
processing is limited by the time needed for the sequential fraction of the program. If 50%
of the processing involved in an algorithm is sequential, then the maximum performance
enhancement achievable through parallelism can not exceed 2x the performance of a
single core processor. Achieving significantly higher performance enhancements requires
a reformulation of the problem with algorithmic design and optimization that reduces
computational complexity and enable highly parallel processing by minimizing sequential
dependencies. The energy-efficiency achievable through parallelism is often limited by the
energy spent in memory accesses. A 16 bit data access consumes about 5 pJ of energy
from on-chip SRAM and about 160 pJ of energy from external DRAM. This compares
to about 160 fJ of energy consumed by a 16 bit add operation [19]. Algorithmic optimizations can also significantly enhance processing locality that enables a large number
of computations per memory access and amortizes the energy cost. This approach is
inherently application specific.
A general purpose processor spends significant amount of resources on the control and
memory overhead associated with each computation. The high cost of programability is
reflected in the relatively small fraction of energy (2-5%) spent in actual computation
as opposed to control (45-50%) and memory access (40-45%) [13]. This makes software
implementations of high-complexity applications extremely inefficient. Maximizing energy
efficiency necessitates a significant reduction in this overhead by minimizing the control
complexity and amortizing the cost of memory accesses over several computations.
Application specific hardware implementations provide the best solutions to trade-off
programmability for high energy-efficiency and take full advantage of algorithmic optimizations. Figure 1-4 shows the energy-efficiency of processors with different architectures
- from CPUs to ASICs, where an operation is defined as a 16 bit addition.
29
2
1.2 Ener~m-Efficient Systemn Design
104
E
10
3
-----------x -----------m-------------- ASIC (Video Decoder)
0
0
0Mobile
]p----
-----------
-
-m-
Processor
---
M-
102
a,
LU
x
101
CMu
Ui
a,
100
10-1
1
2
3
4
6
5
Processors
7
8
9
Figure 1-4: Energy efficiency of processors: from CPUs to ASICs.
Processor
Description
1
Intel Sandy Bridge [20]
2
Intel Ivy Bridge [21]
3
24 Core Programable Processor [22]
4
Multimedia DSP [23]
5
Mobile Processors [24,25]
6
GPGPU Application Processor [26]
7
Object Recognition ASIC [27]
8
SVD ASIC [28]
9
Video Decoder ASIC [29]
Hardware implementations minimize the control requirement, maximize processing data
locality that allows large number of computations per memory access, taking advantage of
spatial and temporal parallelism to reduce memory size and bandwidth, and enable deep
pipelines with flexible bit-widths. Application specific hardware implementations are the
key to achieving exponential enhancements in performance without increasing the energy
budget.
Introduction
30
1.2.3
Reconfigurable Hardware
Flexibility in implementing various applications after the hardware has been implemented
is a desirable feature. However, depending on the architecture used to provide flexibility,
there can be a 2 to 3 orders of magnitude difference in energy-efficiency between these
implementations, as seen from Figure 1-4.
Fully customized hardware implementations are well suited for applications that have
well defined standards, such as video coding. Most desktop and mobile processors today
have embedded hardware accelerators for video coding.
However it is impractical to
develop hardware implementations for every iteration of an algorithm in areas such as
computer vision and biomedical signal processing, where the algorithms are constantly
evolving. Even for standardized applications, existence of multiple competing standards
makes it difficult to develop individual hardware implementations for all the standards.
For example, it is impractical for most application processors to implement individual
video coding accelerators for more than ten video coding standards with more than 20
different coding profiles. Dedicated video coding engines, such as IVA-HD [30], support
multiple video coding standards though a reconfigurable architecture that implements
optimized core functional units, such as motion estimation, transform and entropy coding
engines, and uses a configurable pipeline with distributed control.
A closer examination of these areas reveals that it may not be necessary to develop hardware accelerators for each individual algorithm. A vast number of computational photography and computer vision applications, for example, use a well defined set of functions,
such as non-linear filtering [31], Gaussian or Laplacian pyramids [32,33], Scale Invariant
Feature Transform (SIFT) [34], Histogram of Gaussians (HoG) [35] or Haar features [36],
etc. These functions are well established and form the foundation of the OpenCV library
[37] used for software implementations of almost all computer vision applications. A hardware implementation with highly optimized processing units supporting such functions,
and the ability to activate these processing units and configure the datapaths based on
31
1.2 Energy-Efficient System Design
the application requirements, provides a very attractive alternative that maintains high
energy-efficiency while supporting a large class of applications.
An important aspect of reconfigurable implementations is architecture scalability. The
use of individual processing units as well as the amount of parallelism within each unit,
is application specific. Video coding with 4k x 2k resolution at 60 fps has a 20 x higher
performance requirement than 720p at 30 fps. Different processing block sizes or filter
kernels (4 x 4 to 128 x 128 or more) result in different optimal configurations in a parallel
processor. Scalable architectures also enable us to explore energy vs. output quality tradeoffs, where the user can determine the amount of energy spent in processing, depending
on the desired output for the specific application.
The ability to effectively turn-off
processing units and memory banks, through clock and power gating when not used,
is key to minimizing energy that is simply being wasted by the system.
This thesis
demonstrates examples of efficient reconfigurable and scalable hardware implementations
for video coding and computational photography applications.
1.2.4
Low-Voltage Circuits
For parallelism to yield enhancements in energy-efficiency, it must be coupled with voltage
scaling. The power consumption of CMOS digital circuits operating at voltage VDD,
frequency f and driving a load modeled as a capacitance C, is given by:
Ptotal = Pswitching + Pleakage
=ax CxVDD X f + Leakage X VDD
-
where, a is the switching activity of a logic gate and Ileakage is the leakage current.
For varying performance requirements, scaling frequency only provides a linear scaling
in power consumption in the switching-power dominated region of operation. However,
scaling VDD along with the frequency, to match the peak performance of the proces-
Introduction
32
sor, provides a cubic scaling in power consumption. To take full advantage of Dynamic
Voltage-Frequency Scaling (DVFS) [38], circuit implementations must be capable of operating across a wide voltage range, from nominal VDD down to the minimum energy point,
which typically occurs near or below the threshold voltage (VT) and minimizes the energy
per operation [39].
When VDD is reduced to the range of 0.5 V, statistical variations in the transistor threshold
voltage becomes an important factor in determining logic performance. Random Dopant
Fluctuations (RDF) are a dominant source of variations at low voltage, causing random,
local threshold voltage shifts [40-42]. Local variations have long been known in analog
design and in SRAM design [43,44]. With technology scaling, they have become a major
concern for digital design as well. At nominal voltage, local variations in VT may result in
5%-10% variation in the logic timing. However, at low voltage, these variations can result
in timing path delays with standard deviation comparable to the global corner delay, and
must be accounted for during timing closure in order to ensure a robust, manufacturable
design. Figure 1-5 shows the delay of a 28 nm CMOS logic gate as the voltage is lowered
from 1 V to 0.5 V. The nominal delay scales by a factor of 15. But taking into account
stochastic variations, the total 3- delay scales by a factor of 36. Typically reliability at
40
,
.
1
I-36 x
-.-
30 ----0
-
20-
S
100
Total Delay
Corner Delay
15x
0.5
0.6
0.7
0.8
0.9
1.0
VDD M
Figure 1-5: Delay scaling with VDD. Corner delay scales by 15x, whereas total delay (corner +
3o- stochastic delay) scales by 36 x.
1.3 Thesis Contributions
33
low-voltage is achieved by over-designing the system with large design margins to account
for variations. Such design margins have a significant energy cost [12].
This thesis demonstrates low-voltage design using statistical static timing analysis techniques that minimize the overhead of large design margins to account for variations, while
ensuring reliable low-voltage operation with 3a- confidence.
1.3
Thesis Contributions
The broad focus of this thesis is to address the challenges of implementing high-complexity
applications with high-performance requirements on mobile platforms through a comprehensive view of system design, where algorithms are designed and optimized to enhance
processing locality and enable highly parallel architectures that can be implemented using
low-power low-voltage circuits to achieve maximally energy-efficient systems.
This is accomplished by starting with application areas and exploring key features that
form the basis of a wide array of functionalities in that area. The algorithms underlying these features are optimized for hardware implementation, considering trade-offs
that reduce computational complexity and memory requirements. Parallel architectures
with reconfigurability and scalability are developed to support real-time performance at
low frequencies. Finally, circuits are implemented to provide a wide voltage-frequency
operating range and ensure minimum energy operation.
The main contributions of this thesis are in the following areas:
o Shared Transform Engine for Video Coding: A shared transform engine for
H.264 and VC-1 video coding standards that supports Quad full-HD (4kx2k) resolutions at 30 fps is presented in Chapter 2. Transform engine is a critical part of
video encoding/decoding process. High coding efficiency often comes at a cost of
increased complexity in the transform module. This work explores algorithmic optimizations where a larger transform matrix (8 x 8 or larger) is factorized into multiple
Introduction
34
small (2 x 2) matrices that can be computed much more efficiently. The factorization can also be formulated in such a way that Discrete Cosine Transform (DCT)
based transform matrices corresponding to multiple video coding standards result in
the same factors. This is key to achieving an efficient shared implementation. The
size of transpose memory for 2D transform becomes a key concern for large transforms. Architectural schemes to eliminate an explicit transpose memory and reuse
an output buffer to save area and power are explored. Data dependent processing is
used to further reduce the power consumption of the transform engine by lowering
switching activity. Both the forward and inverse integer transforms are implemented
to support encoding as well as decoding operations. The proposed techniques are
demonstrated through a testchip, implemented using 45 nm CMOS technology. Statistical circuit design techniques ensure a wide operating range and reliable operation
down to 0.3 V. The testchip is used to benchmark different implementations of transform engines such as reconfigurable implementation vs. individual implementations
for the two standards, implementations with and without transpose memory, and
evaluate the different architectures for power and area efficiency.
* Reconfigurable Processor for Computational Photography: A wide array
of computational photography applications such as High Dynamic Range (HDR)
imaging, low-light enhancement, tone management and video enhancement rely on
non-linear filtering techniques such as bilateral filtering. Chapter 3 presents the
development of a reconfigurable architecture for multiple computational photography applications. Algorithmic optimizations, leveraging the bilateral grid structure,
are explored to transform an inefficient non-linear filtering operation into an efficient linear filtering operation with significant reductions in computational and
memory requirements. Algorithm-architecture co-design enables a highly parallel
and scalable architecture that can be configured to implement various functionalities, including HDR imaging, low-light enhancement and glare reduction. Memory
management techniques are explored to minimize the external DRAM bandwidth
35
1.3 Thesis Contributions
and power consumption.
The scalable architecture enables users to explore en-
ergy/resolution trade-offs for energy-scalable processing. The proposed techniques
are demonstrated through a testchip, implemented using 40 nm CMOS technology.
Careful design for low-voltage operation ensures reliable operation down to 0.5 V,
while achieving real-time performance. The comprehensive system design approach
from algorithms to circuits enables a 280x enhancement in energy-efficiency compared to implementations on commercial mobile processors.
e Portable Platform for Medical Imaging: Medical imaging techniques are important tools in diagnosis and treatment of various skin conditions. Widespread
use of such imaging techniques has been limited by factors such as size, weight,
cost and complex user interface.
Treatments for skin conditions require reliable
outcome measures to compare studies and to assess the changes over time. Chapter 4 presents the development of a portable medical imaging platform for accurate
objective quantification of skin lesion progression. Computer vision techniques are
extended and enhanced to identify lesion contours in images captured using smartphones and quantify the progression through feature matching. The approach is
validated through a pilot study in collaboration with the Brigham and Women's
Hospital. Algorithmic optimizations are explored to improve software run-time performance, memory bandwidth and power consumption. These optimizations pave
the way for energy-efficient hardware implementations that could enable real-time
processing on mobile platforms.
36
Introduction
Chapter 2
Transform Engine for Video
Coding
Multimedia applications, such as video playback, have become prevalent in portable multimedia devices. Video accounted for 53% of the mobile data traffic in 2013 and is expected
to increase 14x between 2013 and 2018, accounting for 69% of total mobile data traffic
by 2018 [45]. Such applications present the unique challenge of high-performance requirement while ensuring ultra-low power operation, to be efficient on battery-operated mobile
devices. Low-power hardware implementations targeted to a specific standard, such as
application processors for H.264 video encoding [46] and decoding [47,48], have been
proposed. A universal media player requires supporting multiple video coding standards.
High power and area cost of dedicated video encoding/decoding for each standard necessitates the development of a shared architecture for multi-standard video coding. Dedicated
video coding engines supporting multiple standards have recently been proposed using reconfigurable architectures. The IVA-HD video coding engine [30] supports encoding and
decoding for multiple standards, such as H.264, H.263, MPEG 4, MPEG 1/2, WMV9,
VC-1, MJPEG and AVS. It implements optimized core functional units, such as motion
estimation, transform and entropy coding engines, and uses a configurable pipeline with
distributed control to achieve programability for the different standards. A multi-format
38
Transform Engine for Video Coding
video codec application processor, supporting H.264, H.263, MPEG 4, MPEG 2, VC-1
and VP8, is proposed in [49].
Hardwired logic is combined with a dedicated ARMv5
architecture CPU to provide programability for supporting multiple standards.
Energy efficiency of circuits is a critical concern for portable multimedia applications. It
is important not only to optimize functionality but also achieve low energy per operation.
Dynamic Voltage-Frequency Scaling (DVFS) is an important technique for reducing power
consumption while achieving high peak computational performance [50].
The energy
efficiency of digital circuits is maximized at very low supply voltages, near or below the
transistor threshold voltage, such as 0.5 V [51]. This makes the ability to operate at low
voltage (VDD < 0.5 V) a key component of achieving low power operation. This work
explores power reduction techniques at various stages, such as algorithms, architectures
and circuits. Combining aggressive voltage scaling, by operating at VDD
0.5 V, and
increased parallelism and pipelining, by processing 16 pixels in each clock cycle, provides
an effective way of reducing power while achieving high performance, such as 4k x 2k QuadFull HD (3840 x 2160) video coding at 30 frames per second (fps), at low frequency.
Transform engine is a critical part of video encoding/decoding process. High coding efficiency often comes at a cost of increased complexity in the transform module, such as
variable size transforms (4x4, 8x8, 8x4, 4x8, etc.) as well as hierarchical transform,
where Discrete Cosine Transform (DCT) coefficients are further encoded using Hadamard
transform. DCT is the most commonly used transform in video and image coding applications. DCT has excellent energy compaction property, which leads to good compression
efficiency of the transform. However, the irrational numbers in the transform matrix make
its exact implementation with finite precision hardware impossible, leading to a drift (difference between reconstructed video frames in encoder and decoder) between forward and
inverse transform coefficients. Recent video coding standards, such as H.264/AVC [52,53]
and VC-1 [54-56] use a variation of the DCT, known as integer transform, where the
transform matrix is an integer approximation of the DCT. This allows exact computation
of inverse transform using integer arithmetic and also allows implementation using addi-
2.1 Transform Engine Design
39
tions and shifts, without any multiplications [57]. H.264/AVC and VC-1 also use variable
size transforms, such as 8x8 and 4x4 in H.264/AVC (High profile) and 8x8, 8x4, 4x8
and 4x4 in VC-1 (Advance profile), to more effectively exploit the spatial correlation and
improve coding efficiency. Construction of computationally efficient integer transform matrices is proposed in [58], which allows implementation using 16 bit arithmetic with rate
distortion performance similar to 32 bit or floating point DCT implementations.
Recent research has focused on efficient implementation of the integer transforms. Matrix
decomposition is used to implement 4x4 and 8x8 integer transforms for VC-1 in [59]. A
hardware sharing scheme for inverse integer transforms of H.264, MPEG-4 and VC-1 using
delta coefficient matrix is proposed in [60]. Matrix decomposition with sparse matrices and
matrix offset computations is proposed in [61] for a shared ID inverse integer transform
of H.264 and VC-1. Matrix decomposition and transform symmetry is used to develop a
computationally efficient approach for ID 8x8 inverse transform for VC-1 in [62]. Similar
ideas are used to achieve a shared architecture for 1D 8 x 8 forward and inverse transforms
of H.264 in [63]. A circuit architecture that can be applied to standards such as MPEG
1/2/4, H.264 and VC-1 is proposed in [64] based on similarity of 4x4 and 8x8 DCT
matrices.
In this work, a shared transform for H.264/AVC and VC-1 video coding standards is
proposed [65]. Forward integer transform and inverse integer transform are both implemented to support encoding as well as decoding operations. We also propose a scheme
to eliminate an explicit transpose memory, which is required in 2D transform implementation, to save area and power. This work also explores data dependent processing to
further reduce the power consumption of the transform engine.
2.1
Transform Engine Design
This section explores the ideas of matrix factorization for hardware sharing, eliminating
an explicit transpose memory in 2D transform and data dependent processing to reduce
Transform Engine for Video Coding
40
switching activity, to achieve a shared transform engine for H.264/AVC and VC-1 video
coding standards. The objective is to design a transform engine that can support video
coding with Quad Full-HD (QFHD) resolution at 30 fps, with very low power consumption.
2.1.1
Integer Transform: H.264/AVC & VC-1
H.264/AVC uses 4x4 transform in baseline and main profile and both 4x4 and 8x8
transforms in the high profile. VC-1 uses 4x4, 4x8, 8x4 and 8x8 transforms in the
advance profile. The transform matrices for H.264/AVC and VC-1 standards are defined
in Appendix A.
The 4x4 transform matrices for H.264 and VC-1, as well as the 8x8 transform matrices,
are structurally identical. This allows us to generate a unified 4x4 transform matrix and
a unified 8x8 transform matrix for H.264 and VC-1, as defined by eq. (2.1) and eq. (2.2)
respectively.
aa
a
~y-a-3
(2.1)
T4
a
-'y-a
a-/
H.264: a = 1,1
a
-y
= 1, y = 1/2 and VC-1: a = 17, / = 22, y = 10.
41
2.1 Transformn Engine Design
41
2.1 Transform Engine Design
a
b
f
c
a
d
g
e
a
c
g
-e
-a
-b
-f
-d
a
a
-g
-b
-a
e
f
C
a
e-
a
c
-g
-b
f-d
T8 =
a
-e
f
d
a
-c
-g
b
a
-d
-a
b
-a
-e
f
-c
a
-c
9
e
-a
b
-f
d
a
-b
f
-c
a
-d
g
-e
H.264: a = 8, b = 12, c = 10, d = 6, e = 3, f
VC-1: a = 12, b = 16, c = 15, d = 9, e = 4,
=
f=
(2.2)
8, g = 4
16, g = 6.
The separable 2D transforms are defined as given in Table 2.1, where m = {8, 4} and n
=
{8, 4}, X is the prediction residue and Y is the transformed data.
Table 2.1: Separable 2D transform definitions for H.264/AVC and VC-1
Forward Transform
H.264
VC-1
Xmxm - Tm
TM -Ymxm - TM
Xmxn - Tn) - Nmxn
(Tm -Ymxn - TnT)/1024
T.
(TmT
.
Inverse Transform
The scaling factors in transform definitions can be absorbed in the quantization process.
This work focuses on implementing the transform matrix computations.
Transform Engine for Video Coding
42
2.1.2
Matrix Factorization for Hardware Sharing
Transform matrices for H.264/AVC and VC-1 have identical structure, as shown in eq. (2.1)
and eq. (2.2). In this section, we will exploit this fact to design a shared transform engine
for H.264/AVC and VC-1.
The 8x8 transform matrix can be decomposed into two 4x4 matrices using even-odd
decomposition [66], given by eq. (2.3).
(2.3)
T8= B 8 - M8 - P8
where,
a
f
a
9
0
0
0
a
g
-a
-f
0
0
0
a -g
-a
f
0
0
0
a
-f
a
-g
0
0
0
0
0
0
0
-d
c
-b
0
0
0
0
-b
e
C
0
0
0
0
-e
-b
-d
0
0
0
0
d
e
(2.4)
c
P8 is a permutation matrix that has zero computational complexity and B 8 can be implemented using 8 adders.
43
2.1 Transform Engine Design
43
2.1 Transform Engine Design
0 0
0 0
1
0
0
0
0
1
0 1
0 0
0
0
0
0
1
0
0 0
0 0
0
0
0
1
0
0
0 0
1 0
0
1
1
0
0
0
0
0
0
(2.5)
Bs =
and
1 0
0 0
0
1 -1
0 0
0 0
0
0
0
-1
0
0
0 0
0 0
0
0
0
0
-1
0
0 0
0 1
1
0
0
0
0
-1
We propose further factorization of the even and odd components of M8 to achieve hardware sharing between H.264 and VC-1 matrices. The factorization scheme is derived
in such a way that both H.264 and VC-1 matrices result maximum number of common
factors.
The even component of H.264 is factorized as shown in eq. (2.6).
r
8
8
8
4
8
4
-8
-8
8 -4
-8
8
8 -8
8
-4
-I
He =
1 0
1
0
8 0
8
0
0 1
0
1
8 0 -8
0
0 1
0
-1
0 8
0
4
1 0 -1
0
0 4
0
-8
Transform Engine for Video Coding
44
Transform Engine for Video Coding
44
1 0
1
0
2 0
2
0
0 1
0
1
2 0 -2
0
0 1
0
-1
0 2
0
1
1 0 -1
0
0 1
0
-2
4.
=
(2.6)
Fie 4F 2e
The even component of VC-1 is factorized as shown in eq. (2.7).
12
16
12
6
12
6
-12
-16
12
-6
-12
16
12
-16
12
-6
1
1
0
12
0
12
0
0
0
1
12
0
-12
0
0
0
-1
0
16
0
6
1
-1
0
0
6
0
-16
1
1
0
2 0
2
0
0
0
1
2 0 -2
0
0 0 0
0
0 0 0
0
0 1 0
0
+4
0
0
-1
0 2
0
1
1
-1
0
0 1
0
-2
Fie - (6F 2e + 4F 3 e)
0 0 0 -1
/
(2.7)
45
2.1 Transform Engine Design
Similarly, we propose factorizing the odd component for H.264 as shown in eq. (2.8).
3
-6
10
-12
6
-12
3
10
10
-3
-12
-6
12
10
6
3
3
2
-2
0
1
0
0
-4
-2
0
-3
2
0 1
4
0
2
-3
0
2
0 4 -1
0
0
2
2
3
4 0
1
HO =
0
0
1
0
0
0
1 0
0
-4
0
1
0
0
-1
0
0 1
4
0
0
0
1
0 -1
0
0
0 4 -1
0
1
1
0
0
0
1
4 0
1
0
1
-1
0
1
0
-1
2
= (2F 2o+ 3F 30 ) - Flo
0
'I
0
(2.8)
Tr-ansform Engine for Video Coding
46
Transform Engine for Video Coding
46
And the odd component for VC-1 is factorized as shown in eq. (2.9).
4
-9
15
-16
9
-16
4
15
15
-4
-16
-9
16
15
9
4
4
3
-3
0
1 0
0
-4
-3
0
-4
3
0 1
4
0
3
-4
0
3
0 4 -1
0
0
3
3
4
4 0
1
0
1-
-1
0
Vo=
1'
0
1 0
1
0
0
0
1 0
0
-4
1
0
0
-1
0
0 1
4
0
0
0
0 4
-1
0
0
1
4 0
0
1
C
31
0 C
1
0 -1
0
1
0
0
1
0
/
4F30 ) - Flo
= (3F 2o+
0
(2.9)
Notice that the major factors, Fie and F2 ,, are common between the even components
of H.264 and VC-1. The factor F3e for VC-1 is a very sparse matrix and has very little
computational complexity. Similarly, all the factors, Flo, F20 and F30 , are common between the odd components of H.264 and VC-1. This factorization allows us to maximize
hardware sharing between the even as well as odd components of H.264 and VC-1.
The hardware architecture for shared implementation of the even component for H.264
and VC-1, using the factorization defined by eq. (2.6), eq. (2.7), is shown in Figure 2-1.
The architecture for the odd component, using the factorization defined by eq. (2.8) and
47
2.1 Transform Engine Design
eq. (2.9), is shown in Figure 2-2. A column of input data is represented as:
[Xo,
Li,
X2, X 3 , X4 , X5 , X6 , X7]T
(2.10)
Reconfigurability is achieved by using multiplexers to program the datapath, enabled by
a flag indicating the standard (H.264 or VC-1) being used.
The shared 4x 4 transform for H.264 and VC-1 is achieved in a similar manner, as defined
by eq. (2.11), where T4 is defined by eq. (2.1).
TH = (Fie - F 2e) >> 1
and
TV =Fie - (8F 2e+ 4F 3e + F4 )
(2.11)
where, Fie, F 2 e and F3e are defined in eq. (2.7) and F 4 in eq. (2.12).
1
0
1 0 -1
0
0 2
0
2
0 2
0
-2
1 0
(2.12)
F4 =
2.1.3
Eliminating Transpose Memory
Conventional row-column decomposition uses same 1D transform architecture for both
row and column operations. This requires the use of a transpose memory between the
row-wise 1D transform and the column-wise ID transform. The transpose memory can
be a significant part of the total area (as high as 48% of the gate count in one of the
benchmarked designs) and power consumed by the 2D transform.
We propose an approach to avoid the transpose memory by using separate designs for the
row-wise and column-wise 1D transforms and using the output buffer to store intermediate
data. By enabling the output buffer to have wide number of ports to read/write 2D
data, referred to as a 2D output buffer, an explicit transposition is avoided.
In this
Transform Engine for Video Coding
48
48
1<<
<<<<
+
----+-7-------------------------
F2 e
< 1
1
<<1
<<
<<
<<
<<1
<<1
--------------S
+
+
--------------- ---------- -----
<< 1<<1
<<1
X6
X2
X4
XO
-------------
--------
---------
------
H.264
H.264
+ -
Fle-4Fe
Y2
Y1
Y3
yo
(a) H.264
XO
1<<
<<<<
F2 e
6
+1--- ----------------------
---1
<<1
<<
<<1
-------------- ---------- ------
<1
<<1
<<1
<< 1
<<
~ ~
---
F2e
X6
X2
X4
~
~
~~
<1
6F2 e+4F 3e
0
1
Fie-(6F 2e+4F 3e)
H.264
0H.264
+
+
Yo
Y1
Y3
Y2
(b) VC-1
Figure 2-1: Hardware architecture of the even component. The figure shows data paths exercised
in (a) H.264 and (b) VC-1.
49
2.1 Transform Engine Design
<< 2
<< 2
< 2
2
-
+
+
Flo/+
X7
X1
X5
X3
+X-
--+
F2/F1
-----------------------------------
---------w---------
(F20+F3o)-Flo
+
+
(2F 20 +3F 30 )-Flo
--------- -- -----------
Y1
Y3
Yo
----
----------------
Y2
(a) H.264
X5
X3
<< 2
F10
---------------F20-Flo
X1
X7
<< 2
I<< 2I
+
--------
--------
---------/ --------
(F2o+F3o)-FIO
-----
-------------------------
-----+ ------- w---------+
--+ -------------(3F20+4F3 0)F
10-----Yo
--------
Y3
y1
Y2
(b) VC-1
Figure 2-2: Hardware architecture of the odd component. The figure shows data paths exercised
in (a) H.264 and (b) VC-1.
Transform Engine for Video Coding
50
implementation, we spread the processing of an 8 x 8 block over 4 clock cycles. In each
cycle, we process 8x2 data, i.e. two columns (0 and 7, 1 and 6, 2 and 5, 3 and 4) from
the 8 x 8 input, to obtain two transformed columns, as shown in Figure 2-3.
x
U
-------------- -
-P
m--------
C
------
-|,-+----
-
-
-
Co
-*
-,
Co
a
g
g
a
f
a
-f-e
a*d
-
da
-a
- a
a
d
a
g
-
-b
a
X-c
a
a13a-
-a
-b
-a
ba4
a -e
CTTC2CT
i
C 3C2CC
4 x~
C
0
i
14
c0 c1 cI co c ,
C
-------3
a
-ci2
a---g----d
a-f----a~~~~~
g-e-a
'
--b-f------
transform
iD transform
Column-wise 1D
4
3
1
xa3a
c
a
I
x4 x
Transformed columns
x3 x4 2
~
~
3
4
x~x
8x8 input data
Wx
CO
C1
C2
C3
C3
C2
U12
U13
U4
U15
U22
U23
U24
U25
U32
U33
U34
U35
U42
U43
U4
U45
U2
U53
U54
U5
U62
U63
U64
U65
C1
Co
Transformed columns
Figure 2-3: Column-wise 1D transform: 8x8 data is processed over four clock cycles, Co to 03:
Column 0 and 7 in C0, 1 and 6 in C1, 2 and 5 in C2, 3 and 4 in C 3 . Two transformed columns
are generated in each clock cycle.
For the row-wise computation, we don't have an entire row (transposed column) available
in each clock cycle, without using a transpose memory. To overcome this problem, we
only compute partial products of all 8x8 coefficients in each clock cycle and store them
to the 8 x 8 output buffer, as shown in Figure 2-4.
The processing in Figure 2-3 and Figure 2-4 is shown as direct inner product for simplicity. The implementation performs the same processing using the matrix decomposition
approach, described in Section 2.1.2.
51
2.1 Transform Engine Design
x
m-------------S
-----
r
------------
*
I
I
I
I
I
a
a
.
Xi
I
Xi X
1
I
C2
X
I
'
C2
C
Co
CI
C2
C3
C3
C2
c
cf
a
d'
g
-e
-a
-b
-g
-b
-a
e
C1
2D Transformed Output
Transposed columns
Row-wise D transform
CO
FU02
U12
U2 2
U3 2
U42
U52
Co
Yoo
Y01
Y02
Y03
Y04 Y05 Y106
Y07
C,
Y10
Y11
Y12
Y13
Y14
Y15
Y16
Y17
U62
U721 C 2
Y20 Y21
Y22
Y23
Y24
Y25
Y26
Y27
Y30
Y31
Y32
Y33 Y34
Y35
Y36
Y37
-f
d
a
c
IU03
U13
U2 3
U3 3
U43
U5 3
U63
U7 3 C3
S-f
d
a
-c
U04
U14
U24
U34
U44
U5 4
U64
U74 C3
Y40
Y41
Y42
143 Y44
Y45 146 Y47
-g
b
-a
-e
05
uUI
U25U
35
U45
U55
U65
U751 C 2
150
Y51
Y52
Y53 154
Y55 Y56 Y57
Y60
Ye1
Y62
Y63 Y64
165
Y70
Y71
Y72 173
g
e
-a
b
C,
f
-c
a
-d
Co
Row-wise 1D transform
Transposed columns
-
Y66 Y67
Y74 Y75 Y76
Y77
2D Transformed Output
Figure 2-4: Row-wise 1D transform: Partial products for all 64 coefficients are computed in
each clock cycle, using the 2x8 data obtained by transposing the two columns generated by 1D
column-wise transform. The partial products are stored in the output buffer. At the end end of
four clock cycles, the output buffer contains complete 2D transformed output.
Over four clock cycles, we add and accumulate the results for all 8 x 8 coefficients in the
output buffer with 64 reads/writes each cycle, so that at the end of fourth clock cycle
we get the complete result for the entire 8 x 8 block. The partial products computed in
each clock cycle, for the column vector [uOO, I0 1 , U0 2 , u0 3 ,0u 4 , U05 , U0 6 , uo 7]T, are shown in
Table 2.2.
These partial products are generated by the hardware architectures shown in Figure 2-5.
The appropriate coefficients are selected by the multiplexers in each clock cycle.
Transform Engine for Video Coding
52
Table 2.2: Row-wise transform computations for even-odd components over four clock cycles
Clk
H.264
VC-1
H.264
VC-1
H.264
VC-1
H.264
VC-1
Co
8uoo
12u00
8u0 0
12u 0 0
8u 0 0
12uOO
8u0 0
12UOO
C1
4uO6
6uO6
-8U 0 6
-16uO 6
8u06
16uO6
4uO6
6uO6
C2
8u02
16UO2
4U02
6uO2
-4U02
-6U02
-8u02
-16uO2
C3
8U0 4
12UO4
-8u04
-12UO4
-8u04
-12UO4
8U0 4
12O4
-16uO7
10U 0 7
15uO7
-6U 0 7
-9uO7
3U0 7
4uO7
Co
2.1.4
-12u
07
C1
3uoi
4uoi
6U01
9U01
1Ou0i
15u0i
12uoi
16uoi
C2
1Ou 05
15uO5
3uO5
4uO5
-12uO 5
-16uO 5
6uO5
9uO5
C3
-6u
-9uO
-12UO3
-16u 0
-3u 0
-4uO3
10u0 3
15u0 3
03
3
3
3
Data Dependent Processing
In addition to processing optimization, it is also important to take into account the nature
of input data to further achieve power savings. By exploiting the characteristics of the data
being processed, architectures can be designed to minimize switching activity, optimize
pipeline bit widths and perform variable number of operations per block [67]. Applicationspecific SRAM designs for video coding applications that exploit the correlation of storage
data and signal statistics to reduce the bit-line switching activity and consequently the
energy consumption are proposed in [68,69].
Transform engine operates on the 8 bit prediction residue. Figure 2-6 shows the histogram
of the prediction residue for a number of test sequences. This analysis shows that more
than 80% of the prediction residue lies in the range -32 to +32. Due to 2's complement
processing, a large number of bits are flipped every time a number changes from a small
negative value to a small positive value. At the input, this results in high switching activity
2.1 Transform Engine Design
53
53
2.1 Transform Engine Design
U04 U02 U06 U00 - << 1
CLK cycle
1
2
3
4
<<
0
1
0
1-C2
C,
C2
0
0
1
0
0
1
0
0
<<
<< 1
-C1
|>
C20
C,
/2
F
1
0
Sid
1
\-
1
0
1AC
0
V
f
0
-Sid
07-1
yeo / ye3
Y1I/
Y'92
Std
0
(a)
CLK cycle
C
2
11
3 0
01 2 3
-C
«
«1
«1
U03 U05 U01 U07
1
Std
0 1 2 3
Y0 0
0
-C
1
12
Y0 2
01
3
-C
1
Std
0 1 2 3
-C
Y0 3
(b)
Figure 2-5: Hardware architecture of the (a) even and (b) odd component. Std = {0: H.264, 1:
VC-1}.
around zero. Switching activity at the input propagates through the system, though the
effect is different at different nodes.
For example, a node implementing functionality
similar to XOR shows high switching activity, whereas other nodes show significantly low
switching activity. Because of this, different input patterns affect the system switching
activity differently. Overall, we observe that high switching activity at the input results
in a high switching activity for the entire system.
Figure 2-7 shows the correlation between switching activity at the input and the system
Transform Engine for Video Coding
54
x104
,
15
.
Horsecab
Rally
--
10
Splash
-
Waterskiing
--
S5-
0
-50
-100
100
50
0
Input Magnitude
Figure 2-6: Histogram of the prediction residue for a number of test sequences
switching activity for 150 different input sequences. Zero input switching activity refers
to no bits changing at the input and 1 refers to all the input bits switching simultaneously
from 0 to 1 or 1 to 0. For the system switching activity, 0 refers to no activity, which
corresponds to leakage power, and 1 refers to maximum power consumption. The plot
shows a strong correlation of 0.83 between input switching activity and system switching
activity. This indicates that there is a significant benefit to the system switching activity
by reducing the input switching activity.
0.8
-
0.6
-
.
'
.
0
0
*
*.s
a. 0 .
0
0
00
C
0
0
.1
Input Switching Activity
Figure 2-7: Correlation between input switching activity and system switching activity. The
plot also shows linear regression for the data. Measured correlation is 0.83.
55
2.1 Transform Engine Design
55
2.1 Transform Engine Design
In order to reduce the switching activity, we pre-process the input data by adding a fixed
DC bias to the prediction residue. To accommodate for the added bias, the dynamic
range is increased from 8 bit to 9 bit. The DC bias shifts the input histogram to the
right. For example, for a DC bias of 32, more than 80% of the input data falls within
0 to 64. Thus less than 6 LSBs are flipped during most operations, reducing the overall
switching activity. Note that the DC bias only affects the DC coefficient in the transform
output. This can be easily corrected by subtracting a corresponding bias from the DC
coefficient at the output. Figure 2-8 shows the reduction in switching activity and power
as a function of DC bias values, despite the one bit increase in bit width, for different
video sequences.
0.8
-----------.0.8
0.7
'0.7
0.6-Power
---+-
0
Horsecab - - Splash
-+- Waterskiing
Rally
32
--
64
Bias
0
Switching Activity
Horsecab -'- Splash
--- Waterskiing
Rally
96
12
.5
Figure 2-8: Switching activity and Power consumption in the transform as a function of DC
bias applied to the input data
On average, the switching activity and power consumption reach a minimum for DC
bias of about 64 and then start to increase again. This is because as a higher DC bias is
applied, more MSBs start switching, partially offsetting the effect of reduction in switching
activity in the LSBs. Data dependent processing scheme has less than 5% hardware cost
and reduces the average switching activity by 30% and average power by 15% for the DC
bias of 64.
Transform Engine for Video Coding
56
2.2
Future Video Coding Standards
The ideas proposed in this work have general applicability beyond H.264/AVC and VC-1
video coding standards. In this section, we will look at applying these ideas to the 8 x8
transform of the next generation video coding standard High-Efficiency Video Coding
(HEVC) [70].
The HEVC standard recommendation [70] defines the 8x8 1D transform as given by
eq. (2.13).
64
89
83
75
64
50
36
18
64
75
36
-18
-64
-89
-83
-50
64
50
-36
-89
-64
18
83
75
64
18
-83
-50
64
75
-36
-89
(2.13)
T8 =
64 -18
-83
50
64
-75
-36
89
64 -50
-36
89
-64
-18
83
-75
64 -75
36
18
-64
89
-83
50
64 -89
83
-75
64
-50
36
-18
Notice that the structure of this transform matrix is same as the generalized matrix for
H.264/AVC and VC-1, defined in eq. (2.2), where: a = 64, b = 89, c = 75, d = 50, e = 18,
f = 83, g = 36.
The idea of matrix decomposition for hardware sharing, as described in Section 2.1.2,
can be applied to eq. (2.13) as well. Extension of even-odd decomposition for HEVC
transform to reduce hardware complexity is described in [71]. Even-Odd decomposition,
performed as defined in eq. (2.3), gives the even and odd components for the 8x8 HEVC
matrix, defined by eq. (2.14) and eq. (2.15) respectively.
57
'2.2 Future Video Coding Standards
57
2.2 Future Video Coding Standards
64
83
64
36
64
36
-64
-83
(2.14)
HEVCe =
64 -36
-64
83
-83
64
-36
18 -50
75
-89
50 -89
18
75
75 -18
-89
-50
50
18
64
(2.15)
HEVCO =
89
75
The even and odd components can be further factorized as given by eq. (2.16) and
eq. (2.17) respectively.
HEVCe = Fie - (32. F2e + 4. F4e + 15 - Fe)
(2.16)
HEVCo = (15. F 20 + 22. F30 + F 4 0 ) - F1 0 + 5 - Fo
(2.17)
Notice that the factors Fie, F2e, F3e, Fio, F2 0 and F3 0 are same as those defined in eq. (2.6),
eq. (2.7), eq. (2.8), eq. (2.9), for H.264 and VC-1 factorization. F 4e, F 4 0 and F50 , defined
by eq. (2.18), eq. (2.19) and eq. (2.20) respectively, are extremely sparse matrices.
0 0 0
0
0 0 0
0
0 1 0
1
(2.18)
F4e =
0 1 0 -1
Tr-ansform Engine for Video Coding
58
Transform Engine for Video Coding
58
F 4o
0
0
0
-1
0 -1
0
0
(2.19)
=
0
0
1
0
L1
0
0
0
0 -1
0
0
1
0
0
0
0
0
0 -1
0
0
1
(2.20)
F 5o =
0
Since most of the factors for HEVC transform matrix are the same as those for H.264
and VC-1, it is possible to achieve an efficient hardware implementation with shared
architecture between H.264, VC-1 and HEVC, as shown in Figure 2-9 and Figure 2-10,
for even and odd components respectively.
This demonstrates that matrix factorization can be extended to standards beyond H.264
and VC-1 to achieve shared hardware implementations for multiple standards.
The identical structure of the transform matrix, as given by eq. (2.2), for H.264, VC-1
and HEVC arises because of the symmetric nature of coefficients in the DCT, which forms
the basis of transforms in all of these standards. As long as a video coding standard uses
transform based on DCT, it will always result in a matrix such as eq. (2.2). Transform
matrices for different standards are multiples of each other with slight variations and can
be factorized into very similar factors to maximize sharing.
The idea of eliminating an explicit transpose memory in 2D transform, as described in
Section 2.1.3, is equally applicable to HEVC. The processing, over four clock cycles, can
be done in the same way as used for H.264 and VC-1, with the results accumulated in the
output buffer.
59
2.2 Future Video Coding Standards
XO
X6
X2
X4
+1~
+
-
«1
<1
«1
<1
1<3
+
YO
+
<< 2
-
+
<< 2
Y1
Y3
<<37
Y2
Std ={O: VC-1, 1: H.264, 2: HEVC}
Figure 2-9: Hardware architecture of the even component for shared 8x8 transform for H.264,
VC-1 and HEVC. The highlighted blocks are the same as those used in the shared H.264/VC-1
architecture, shown in Figure 2-1.
Data dependent processing, as described in Section 2.1.4, is independent of the video
coding standard being used. Since the nature of the input data (the prediction residue),
as shown in Figure 2-6, is the same for HEVC as for H.264 and VC-1, we can use data
dependent processing to reduce switching activity and power consumption in HEVC transform engine as well. Figure 2-11 shows results of switching activity simulations for the
HEVC transform architecture proposed above. We consistently observe data dependent
processing resulting in an average 25% reduction in switching activity, demonstrating the
applicability of this idea beyond H.264 and VC-1.
It should also be noted that the ideas of even-odd decomposition and matrix factorization
Transform Engine for Video Coding
Transform Engine for Video Coding
60
60
X3
X1
X5
<<
<<
<<
<<
<<
<<
<< 4
2
td
0
1
0
X7
<< 4
1
2
2
Std
10
td
0 1
2
-Std
+
Y2
Y1
Y3
YO
Std = (0: VC-1, 1: H.264, 2: HEVC}
Figure 2-10: Hardware architecture of the odd component for shared 8x8 transform for H.264,
VC-1 and HEVC. The highlighted blocks are the same as those used in the shared H.264/VC-1
architecture, shown in Figure 2-2.
1
0.9
.0
0.8
0
0.7
0.6
0.5
0
32
64
Bias
-
Horsecab
Rally
Splash
-+-
Waterskiing
96
128
Figure 2-11: Switching activity in HEVC transform as a function of DC bias applied to the
input data
as well as eliminating an explicit transpose memory can be applied to transform matrices
of larger sizes such as 16x 16 and 32x32. The ideas proposed in this work can potentially
be extended to future video coding standards that use DCT based transforms.
2.3 Statistical Methodology for Low-Voltage Design
61
The benefits of these optimizations become even more significant for a larger size transform. For example, for the 32x32 transform in HEVC [71], the transform weights are 8
bit wide as opposed to 5 bit in H.264 [57]. In addition, each 1D coefficient computation
requires 32 add-multiply operations as opposed to 8 add-multiply operations. This leads
to a 6.4x more complexity per pixel in HEVC transform compared to H.264. The 32x32
HEVC transform also requires 16x larger transpose memory compared to 8x8 transform
in H.264. A hardware implementation of the HEVC decoder, proposed in [72], shows that
the transform module constitutes about 17% of the decoder area and power consumption.
This indicates that the area and power savings achieved by the ideas proposed in this work
can be significant towards achieving a low power video encoder/decoder implementation
for future video coding standards, such as HEVC.
2.3
Statistical Methodology for Low-Voltage Design
The performance of logic circuits is highly sensitive to variation in threshold voltage (VT)
at low voltages and can also result in functional failures at the extremes of VT variation.
For minimum geometry transistors, threshold voltage variation of 25 mV to 50 mV is
typical. At nominal VDD such as 1 V or 1.2 V, local variations in threshold voltage may
result in 5% to 10% variation in the logic timing. However, for low voltage operation
(VDD
0.5 V), these variations can result in timing path delays with standard deviation
comparable to the global corner delay, and must be accounted for during timing closure
in order to ensure a robust, manufacturable design.
This challenge has been recognized [42,73,74] and circuit design techniques for low-voltage
operation have begun to take into account Statistical Static Timing Analysis (SSTA) approaches for estimating circuit performance [75].
A logic gate design methodology ac-
counting for global process corners that identifies logic gates with severely asymmetric
pullup/pulldown networks is proposed in [76]. Nominal delay and delay variability models valid in both above and subthreshold regions are proposed in [77]. A transistor sizing
Transform Engine for Video Coding
62
methodology to manage the trade-off between reducing variability and minimizing energy
overhead is proposed in [78]. Most of these statistical approaches make the assumption
that the impact of variations on circuit performance can be modeled as a Gaussian distribution. This assumption is usually accurate at nominal voltage [79,80], but fails to capture
the non-linear impact of variations on circuit performance at low-voltage that results in
highly non-Gaussian delay distributions. This phenomenon is depicted in Figure 2-12,
which shows the delay Probability Density Function (PDF) of a representative path at
0.5 V, estimated using Gaussian SSTA and Monte-Carlo analysis. Static Timing Analysis
(STA) estimates the global corner delay for the path to be 14.1 ns. Modeling the impact
of variations using Gaussian SSTA results in the 3a delay estimate of 23.1 ns. However,
Monte-Carlo analysis suggests that Gaussian SSTA is not adequate to fully capture the
impact of variations and results in the 3a delay estimate of 31.8 ns.
x107
I
I
:
Corner delay--
15
I
*
--
I
I
Gaussian SSTA
Monte-Carlo
$10 -
3a, Gaussian
-4
5Id,
0
5
25
20
15
10
(ns)
Delay
Timing Path
3a ,MC
30
35
the global
Figure 2-12: Delay PDF of a representative timing path at 0.5 V. STA estimate of
3a delay
corner delay is 14.1 ns, the 3a delay estimate using Gaussian SSTA is 23.2 ns and the
estimate using Monte-Carlo analysis is 31.8 ns.
Performing large Monte-Carlo simulations for processor designs with millions of transistors
is impractical. We use a computationally efficient approach, called the Operating Point
Analysis (OPA) [81], that can perform accurate path-based timing analysis in the regime
where delay is a highly non-linear function of the random variables and/or the PDFs of
63
2.3 Statistical Methodology for Low-Voltage Design
the random variables are non-Gaussian. OPA provides an approximation to the fa value
of a random variable D, when D is a linear or non-linear function D(x, x 2 , ...
XN)
Of
random variables xi, which can be Gaussian or non-Gaussian. The fca operating point is
the point in xi-space where the joint probability density function of the xi is maximum,
subject to the constraint that D(x1, x 2 , - - - XN) = Df,. In other words, the operating
point represents the most likely combination of random variables xi that results in the
fa delay for the logic gate or the timing path. Figure 2-13 illustrates the convolution
integrand and the operating point, where delay is a non-linear function of two variables.
A transcendental relationship is established between the unknown operating point and
unknown f a delay, and this equation is solved iteratively.
x2 *
xO
----------- 'J
Convolution
Integrand
D(xxa2)=D,
Figure 2-13: Graphic illustration in xi-space of the convolution integral, and the operating point.
The methodology, developed in [82], is summarized below.
Standard Cell Library Characterization
For the 45 nm process used in this work, Random Dopant Fluctuations (RDF) induced local variations were modeled by two compact model parameters for each transistor. These
transistor random variables (also called mismatch parameters) are statistically indepen-
64
Transform Engine for Video Coding
dent with approximately Gaussian PDF. The OPA approach is applicable for any local
variations given by a compact model of transistor mismatch parameters. The goal of cell
characterization is to predict the delay PDF for each arc of each cell. An arc is defined as
input rise or fall, input slew rate and output capacitance. At nominal voltage, cell delay is
approximately linear in the transistor random variables, with the result that the cell delay
is approximately Gaussian. However, at 0.5 V, cell delay is highly non-linear in transistor
random variables, with the result that the cell delay has a non-Gaussian PDF.
OPA is used to perform stochastic characterization of the standard cell library at VDD
=
0.5 V. This characterization ensures functionality and quantifies the performance of
standard cells at VDD = 0.5 V. Standard cells that fail the functionality or do not satisfy
the performance requirement are not used in the design. The functionality and setup/hold
performance of flip-flops are also verified using the cell characterization approach.
Timing Path Analysis
The goal of timing path analysis is to compute the 3a (or in general fa) stochastic delay
of a timing path. OPA is used, along with the pre-characterized standard cell library, to
determine the 3a setup/hold performance for individual paths from the design at 0.5 V.
Figure 2-14 shows the PDF computed using OPA superimposed on the PDF computed
using Monte-Carlo, for the path analyzed in Figure 2-12 at 0.5 V. Monte-Carlo analysis
results in the 3a delay estimate of 31.8 ns. OPA shows excellent agreement with MonteCarlo with the 3a delay estimate of 30.7 ns.
Full-Chip Timing Closure
Given the size of the design, it is not practical to analyze each path individually to determine the 3a setup/hold performance. At nominal voltage, paths that fail the setup/hold
requirement are determined using the corner-based analysis and timing closure is achieved
by performing setup/hold fix on these paths. However, at low voltage, it is not possible
65
2.3 Statistical Methodology for Low-Voltage Design
65
2.3 Statistical Methodology for Low-Voltage Design
x 10
15*
7
Corner delay-.-
- -
Gau ssian SSTA
ite-Carlo
-
OPA
10*
3a, Gaussian
5
0
(
-
5
3o, OPA
3&, MC
25
20
10
15
Timing Path Delay (ns)
Figure 2-14: Delay PDF of a representative timing path at 0.5 V,
30
35
estimated using Gaussian
SSTA, Monte-Carlo and OPA.
to consider only the paths that fail the setup/hold requirement in the corner analysis and
determine their 3- setup/hold performance, since a path with larger corner delay need
not have a larger stochastic variation.
A three phase approach, outlined below, is used to reduce the number of paths that need
to be analyzed for setup/hold constraints using OPA analysis.
1. All paths are analyzed with traditional STA using the corner delay plus the 3stochastic delay for each cell. This is a pessimistic analysis, so those paths that pass
this analysis can be removed from further consideration.
2. The paths that did not pass during the first phase are re-analyzed, this time using
OPA for the launch and capture clock paths, as defined in Figure 2-15, and STA
with the corner delay plus the 3a- stochastic delay for cells in the data paths. Again,
this is a pessimistic analysis and any paths that pass during this phase need no
further consideration.
3. Lastly, the few remaining paths are analyzed using OPA for the entire path.
The paths that fail the 3- setup or hold performance test are optimized to fix the
T1ransform Engine for Video Coding
66
66ltrs
D
V
~--* REG
~j-7
REG
Q
cells
-- o4 D2
4D14-----------
CLK Path-1
CLK
CLK
'Launch
3-sigma
cik skew
CLK Path-2
Capture
/Clock/Cok
Common
CLK Path
Figure 2-15: Typical timing path.
setup/hold violations.
This process is repeated untill all the timing paths in the de-
sign meet the 3a setup and hold performance computed using OPA. Setup/hold fixing
using OPA ensures that cells that are very sensitive to VT variations are not used in the
critical paths. Table 3.1 shows statistics on the number of paths analyzed during each
phase of timing closure, for both setup and hold analysis of the entire chip.
Table 2.3: Full-chip Timing Analysis
Phase
Data Path
Clock Path
Paths Analyzed
Worst Slack
% Fail
Setup Analysis 0 25MHz
1
STA (+3a)
STA (-3a)
20k
-14.2 ns
5%
2
STA (+3a)
OPA
1k
-3.2 ns
9%
3
OPA
OPA
87
-0.2 ns
12%
Paths requiring fixing (before timing closure)
10
Hold Analysis
1
STA (-3u)
STA (+3a)
20k
-11.2 ns
7%
2
STA (-3)
OPA
1.4k
-2.5 ns
8%
3
OPA
OPA
112
-0.1 ns
14%
Paths requiring fixing (before timing closure)
16
2.4 Implementation
67
The overall statistical design methodology can be summarized as shown in Figure 2-16.
C
Data path
Clk path extraction
[3-Phase path pruning
OPA analysis of potentially critical
paths
Yes
No
I
Timing closure achieved!
I
II
Setup/Hold fix for failing paths
]
SPICE Netlist extraction
Timing Closure
Figure 2-16: OPA based statistical design methodology for low voltage operation.
2.4
Implementation
In this work, we implemented ten different versions of the transform engine, listed in
Table 2.4, and compared their relative performance.
All transforms have been implemented to complete an 8 x 8 transform over 4 clock cycles.
Transformn Engine for Video Coding
Coding
68
Transform Engine for Video
68
Table 2.4: Transform engines implemented in this design
Description
Tr. Type
HVF8
Shared 8 x 8 forward transform without transpose memory
HV1 8
Shared 8x8 inverse transform without transpose memory
HV/T
8M
Shared 8 x 8 forward transform with transpose memory
HVTM
Shared 8x8 inverse transform with transpose memory
HF8
8x8 forward transform for H.264 without transpose memory
H 18
8x8 inverse transform for H.264 without transpose memory
VF8
8x8 forward transform for VC-1 without transpose memory
V1 8
8x8 inverse transform for VC-1 without transpose memory
HVF4
Shared 4x4 forward transform
HV1 4
Shared 4x4 inverse transform
In this design, the output buffer has been implemented as a register bank of size 8 x 8 with
each element being 8 bit wide. The architecture of the 2D transform engine, along with
the output buffer, is shown in Figure 2-17.
Figure 2-18 shows the die photo of the IC fabricated using commercial 45 nm CMOS
technology. The gate counts in Figure 2-18 include output buffer as well.
The proposed shared transform engine design uses separate 1D transforms for column
and row-wise computations and does not use a transpose memory. The 1D column and
row-wise transforms are designed using the shared architectures described in Section 2.1.2
and 2.1.3 respectively. The 2D output buffer is used to store intermediate data.
The shared transform modules with transpose memory are implemented using the shared
1D transform architecture described in Section 2.1.2 for both column and row wise transforms. Each ID transform processes 8 x 2 data in each clock cycle and a 16 x 8 transpose
69
2.4 Implementation
I,
Prediction
Residue
-
4
V
'A
Transform
Coefficients
a
Figure 2-17: Block diagram of the 2D transform engine design
2mm
E
E
04
Design Statistics
1.5mm 2
Active Area
Technology
45nm
1/0 Pads
96
Tr.
Type
HVF8
HVTM
YF8
HF8
VF8
HVF4
Gate
Count
44.7k
66.5k
30.9k
Ti.
Type
HV 8
HVTM
8
H1 8
35.6k
18.8k
V8
HV1 4
_____
Gate
Count
45.1k
66.8k
31.6k
35.8k
18.9k
Figure 2-18: Die photo and design statistics of the fabricated IC
memory, which constitutes 48% of the gate count, is used to allow operation in ping-pong
mode to achieve a throughput of 8 x 8 2D transform over 4 cycles. An alternative approach
to achieve the same throughput is to process 8 x 4 data in each clock cycle and use an 8 x 8
transpose memory. This has not been implemented on chip, however synthesis results
show 15% higher overall gate count for this approach.
Transform Engine for Video Coding
70
2.5
Measurement Results
The shared architecture for 8 x8 transform (HVF8/HVS) is able to achieve 25% reduction
in area compared to the combined area of individual 8 x 8 transforms for H.264 (HF8/HIS)
and VC-1 (VF8/VI8). Eliminating explicit transpose memory helps save 23% area compared to the implementation that uses a transpose memory (HVSM/HV8M). The decoder
only uses inverse transforms. The encoder requires both forward and inverse transforms,
thus doubling the area savings due to hardware sharing.
Figure 2-19 shows the measured power consumption and frequency for different transform
modules as a function of VDD-
60
.
1H111
HVF8
---
HVF8M
40
-
--
--
8.3
HF8
............................. . . ... ..
VF8
..... ...... .........----... ..... ..-..
......-.- ....
HVF4
0.35
0.4
0.45
0.5
0.55
0.6
VDD(V)
(a)
500
---400 ^ --+-
-
.3
- --------HVF8M .................................
-H8
F
S300 -
0200
HVF8
VF8
HVF4 ..........................
0.35
0.4
0.45
..........
0.5
0.55
0.6
VDD(V)
(b)
Figure 2-19: Measured power consumption and frequency scaling with VDD for different transform implementations. (a) Frequency scaling with VDD, (b) Power consumption while operating
at the frequency shown in (a).
71
2.5 Measurement Results
All the transform modules implemented on this chip have been verified to be operational
to support video encoding/decoding with Quad-Full HD (3840 x 2160) resolution at 30 fps.
The shared 8x8 transform is able to achieve video encoding/decoding in both H.264 and
VC-1 with 3840 x 2160 (QFHD) resolution at 30 fps, while operating at 25 MHz frequency
at 0.52 V. The module is also able to achieve 1080p (Full-HD) at 30 fps, while operating at
6.3 MHz at 0.41 V and 720p (HD) at 30 fps, while operating at 2.8 MHz at 0.35 V.
Measurement results for all the modules are summarized in Table 2.5.
Table 2.5: Measurement results for implemented transform modules
Transform
Type
QFHD@30fps
25MHz
1080p@30fps
6.3MHz
720p@30fps
2.8MHz
VDD
Power
VDD
Power
VDD
Power
(V)
(w)
(V)
(AW)
(V)
(AW)
HVF8
0.52
214
0.41
79
0.35
43
HV
0.53
218
0.42
81
0.36
44
HVFT8M
0.50
270
0.40
95
0.33
51
HV,8M
0.49
268
0.40
94
0.33
50
HF8
0.51
175
0.41
67
0.34
35
H1 8
0.50
172
0.40
66
0.33
34
VF8
0.51
189
0.41
70
0.35
38
V8
0.51
188
0.41
70
0.34
37
HVF4
0.49
127
0.39
55
0.33
31
HV 4
0.48
124
0.40
54
0.33
30
8
Figure 2-20 compares the power consumption of shared transform without transpose memory, shared transform with transpose memory and individual transform implementations
for H.264 and VC-1. While supporting Quad Full-HD resolution, eliminating explicit
transpose memory helps reduce power consumption of the 8x8 transform by 26%.
Transform Engine for Video Coding
72
'Ifransform Engine for Video Coding
72
300
W200
HF8
-m
VF8
HVF8
HVTM
10010
QFHD@30fps
1080p@30fps
720p@30fps
Figure 2-20: Power consumption for transform modules with and without transpose memory,
with and without shared architecture for H.264 and VC-1
Data dependent processing affects different architectures differently because of varying
degrees of correlation between input switching activity and system switching activity.
Figure 2-21 shows the switching activity and power consumption for different transform
modules as a function of the input DC bias. We observe a reduction in switching activity
by 25%-30% across the modules, resulting in a 15%-20% power saving.
Table 2.6 summarizes the overheads and advantages of the three key ideas proposed in this
1
1
C.)
C)
0.9
0.9
0.8
0.8
0.7
0.7
0.6 0.5 -
HVF8
-
-- HVFT8M _-
0.4
Switching Activity
Power
0
32
HF8-
HVF8
VF8
HVF8M
I
64
96
HF8
-
0.6
0.5
VF8
-0.4
128
Bias
Figure 2-21: Switching activity and Power consumption in the transform as a function of DC
bias applied to the input data
73
2.5 Measurement Results
work. Applying DC bias requires 16 adders (with one fixed input, i.e. the DC bias) that
cause 5% increase in area and 4% increase in power. But it helps reduce the switching
activity by 30%, which results in a 15% overall power saving for the design. Hardware
sharing requires 26 additional 2:1 multiplexers that consume 9% area and 6% power.
But sharing helps us implement the H.264 and VC-1 transforms using 78 adders and
62 multiplexers (including the overhead), as opposed to 126 adders and 60 multiplexers
for individual H.264 and VC-1 implementations, which reduces the overall area by 25%.
The scheme for eliminating transpose memory requires us to access 8x8 data in each
clock cycle for row-wise transform computations. This increases the data accesses by 4x
for the row-wise computations, as opposed to the implementation that uses a transpose
memory. The increased data accesses lead to 7% increase in power consumption. However,
the ability to eliminate transpose memory saves 23% area and 26% power. Overall, the
proposed design optimizations help reduce the power consumption by about 40%, despite
the overhead.
Table 2.6: Overheads and Advantages of proposed ideas
Feature
Overhead
Advantage
Data Dependent Processing
5% area, 4% power
30% reduction in switching activity,
15% reduction in power
Hardware Sharing between H.264
and VC-1
9% area, 6% power
25% reduction in area
Transpose Memory Elimination
4x data access, 7%
power
23% reduction in area, 26% reduction in power
Table 2.7 shows a performance comparison of the proposed approach for 2D transform
implementation with some previously published approaches. The comparison shows that
the proposed approach achieves a significant reduction in power compared to the previous
approaches. Assuming a roughly 4x scaling in power due to technology scaling from
180 nm to 45 nm, the architectural techniques proposed in this work achieve a reduction
T'ransform Engine for Video Coding
Transform Engine for Video Coding
74
'74
in power consumption by over 45 x compared to [83] and 68 x compared to [84], while
achieving the same throughput at VDD = 0.52 V.
Table 2.7: Performance comparison of proposed approach with previous publications
Publication
Technology
Huang'08
[83]
Fan'11
[84]
Wang'11
[85]
Chen'11
[86]
This Work
Low
Nominal
VDD
VDD
180 nm
180 nm
130 nm
180 nm
45 nm
45 nm
39.8k
95.1k
23.1k
17.7k
44.7k
44.7k
8x
8x
8x
4x
16x
16x
Throughput
400M
pixels/s
400M
pixels/s
800M
pixels/s
1000M
pixels/s
400M
pixels/s
4640M
pixels/s
Frequency
50 MHz
50 MHz
100 MHz
250 MHz
25 MHz
290 MHz
Voltage
1.8V
1.8V
1.2V
1.8V
0.52V
1.0V
Power
38.7 mW
58.01 mW
-
54 mW
214 pW
4.1 mW
Supported
Standards
MPEG,
H.264
MPEG,
H.264,
AVS, VC-1
MPEG,
H.264,
AVS, VC-1
H.264
H.264,
VC-1
H.264,
VC-1
Transform
Type
Forward
Inverse
Forward,
Inverse
Inverse
Forward,
Inverse
Forward,
Inverse
Gates
Parallelism
2.6
Summary and Conclusions
The ability to perform very high-resolution video encoding/decoding for multiple standards at ultra-low voltage to achieve low power operation is critical in multimedia devices.
In this work, we have developed a shared architecture for H.264/AVC and VC-1 transform
engine. Similarity between the structure of transform matrices is exploited to perform
matrix decomposition to maximize hardware sharing. The shared architecture helps to
2.6 Sumnmary and Conclusions
75
7
save more than 30% hardware compared to total hardware requirement of individual
H.264/AVC and VC-1 transform implementations. An approach to eliminate an explicit
transpose memory is demonstrated, by using a 2D output buffer and separately designing
the row-wise and column-wise ID transforms. This helps us reduce the area by 23% and
save power by 26% compared to the implementation that uses transpose memory. We
have demonstrated that data dependent processing can help reduce the switching activity
by more than 30% and further reduce power consumption. The implementation is able to
support Quad-Full HD (3840 x 2160) video encoding/decoding at 30 fps while operating
at 0.52 V.
The ideas of matrix factorization for hardware sharing, eliminating transpose memory
and data dependent processing could potentially be extended to other coding standards
as well. As bigger block sizes such as 32x32 and 64x64 are explored in future video
coding standards like HEVC, these ideas could lead to even higher savings in area and
power requirement of the transform engine, allowing their efficient implementation in
multi-standard multimedia devices.
Exploration of the ideas proposed in this work leads to the following conclusions.
1. Reconfigurable hardware architectures that implement optimized core functional
units for a class of applications, such as video coding, and enable configurable datapaths with distributed control, are key to supporting efficient processing for multiple
applications. Algorithmic optimizations that reframe the algorithms are important
for enabling hardware reconfigurability.
2. Data dependent processing can be a powerful tool in reducing system power consumption. By exploiting the characteristics of the data being processed, architectures can be designed to minimize switching activity, optimize pipeline bit widths
and perform variable number of operations per block. The reduction in computations
and switching activity has a direct impact on the system power consumption.
3. Memory size and power consumption can have a significant impact on system effi-
Transform Engine for Video Coding
76
ciency. Architectural approaches that trade-off small increases in logic complexity
for significant reductions in memory size and power consumption can provide the
most optimal system design solutions.
4. Low-voltage operation of circuits is important to provide wide voltage/frequency
operating range and attain minimum energy operation. Global and local variations
have a significant impact on circuit performance at low-voltages. This impact can
not be fully captured with corner-based STA or Gaussian SSTA techniques. Statistical design approaches that take into account the non-linear impact of variations
on circuit performance at low-voltage must be used to ensure reliable low-voltage
operation.
Chapter 3
Reconfigurable Processor for
Computational Photography
Computational photography is transforming digital photography by significantly enhancing and extending the capabilities of a digital camera. The field encompasses a wide range
of techniques such as High Dynamic Range (HDR) imaging [87], low-light enhancement
[138,139], panorama stitching [88], image deblurring [89] and light field photography [90],
that allow users to not just capture a scene flawlessly, but also reveal details that could
otherwise not be seen.
Non-linear filtering techniques, such as bilateral filtering [31,91,92], anisotropic diffusion
[93,94] and optimization [95,96], form a significant part of computational photography.
The behaviors of such techniques have been well studied and characterized [97-102]. These
techniques have a wide range of applications, including denoising [103,104], HDR imaging
[87], low-light enhancement [138,139], tone management [105,106], video enhancement
[107,108] and optical flow estimation [109,110]. The high computational complexity of
such multimedia processing applications necessitates fast hardware implementations [111,
112] to enable real-time processing in an energy-efficient manner.
Recent research has focused on specialized image sensors to capture information that is
78
Reconfigurable Processor for Computational Photography
not captured by a regular CMOS image sensor. An image sensor with multi-bucket pixels
is proposed in [113] to enable time multiplexed exposure that improves the image dynamic
range and detects structured light illumination. A back-illuminated stacked CMOS sensor
is proposed in [114] that uses spatially varying pixel exposures to support HDR imaging.
An approach to reduce the temporal readout noise in an image sensor is proposed in [115]
to improve low-light-level imaging. However, computational photography applications
using regular CMOS image sensors that are currently used in the commercial cameras
have so far remained software based. Such CPU/GPU based implementations lead to
high energy consumption and typically do not support real-time processing.
This work implements a reconfigurable multi-application processor for computational photography by exploring power reduction techniques at various design stages - algorithms,
architectures and circuits. The algorithms are optimized to reduce the computational
complexity and memory requirement. A parallel and pipelined architecture enables high
throughput while operating at low frequencies, which allows real-time processing on HD
images. Circuit design for low voltage operation ensures reliable performance down to
0.5 V.
The reconfigurable hardware implementation performs HDR imaging, low-light enhanced
imaging and glare reduction, as shown in Figure 3-1. The filtering engine can also be accessed from off-chip and used with other applications. The input images are pre-processed
for the specific functions. The core of the processing unit are two bilateral filter engines
that operate in parallel and decompose an image into a low frequency base layer and a
high frequency detail layer. Each bilateral filter uses further parallelism within it. The
choice of two parallel engines is based on the throughput requirements for real-time processing and the amount of memory bandwidth available to keep all the engines active.
The processor is able to access 8 pixels per clock cycle and each filtering engine is capable
of processing 4 pixels per clock cycle. Bilateral filtering is performed using a bilateral grid
structure [116] that converts an input image into a three dimensional data structure and
filters it by convolving with a three dimensional Gaussian kernel. Parallel processing al-
79
3.1 Bilateral Filtering
Preprocessing
IF
INF
Weighted
j
-
Average
IG
Grid
Grid
'El
Assignment
Assignment
'3
HDR
Creation
-
_j
Convolution
IM
Engine
IHDR *'
ITM
Contrast
IRG
Adjustment
IBF *Grid
IBFInterpolation
ILLE *
Convolution
Engine
O
Grid
Interpolation
hdo
Correction
Postprocessing
Figure 3-1: System block diagram for the reconfigurable computational photography processor
lows enhanced throughput while operating at low frequency and low voltage. The bilateral
filtered images are post processed to generate the outputs for the specific functions.
This chapter describes bilateral filtering and its efficient implementation using the bilateral grid. A scalable hardware architecture for the bilateral filter engine is described in
Section 3.2. Implementation of HDR imaging, low-light enhancement and glare reduction
using bilateral filtering is discussed in Section 3.3. The challenges of low voltage operation
and approaches to address process variation are described in Section 3.4. The significance
of architectural optimizations for reducing external memory bandwidth and power consumption - crucial to enhance the system energy-efficiency, is described in Section 3.5.
Section 3.6 provides measurement results for the testchip.
3.1
Bilateral Filtering
Bilateral filtering is a non-linear filtering technique that traces its roots to the non-linear
Gaussian filters proposed in [31] for edge-preserving diffusion. It takes into account the
difference in the pixel intensities as well as the pixel locations while assigning weights, as
Reconfigurable Processor for Computational Photography
80
opposed to linear Gaussian filtering that assigns filter weights based solely on the pixel
locations [91,92]. For an image I at pixel position p, the bilateral filtered output, 1B, is
defined by eq. (3.1).
N
Gs(n) - G1 (I(p) - I(p - n)) - I(p - n)
IB(P) =
(3.1)
n=-N
where,
N
W (p)= 1
Gs (n) -G, (I(p) -I (p -n))
n=-N
The output value at each pixel in the image is a weighted average of the values in a
neighborhood, where the weight is the product of a Gaussian on the spatial distance
(Gs) with standard deviation a, and a Gaussian on the pixel intensity/range difference
(GI) with standard deviation a,. In linear Gaussian filtering, on the other hand, the
weights are determined solely by the spatial term. In bilateral filtering, the range term
GI(I(p) - I(p - n)) ensures that only those pixels in the vicinity that have similar intensities contribute significantly towards filtering. This avoids blurring across edges and
results in an output that effectively reduces the noise while preserving the scene details.
Figure 3-2 compares Gaussian filtering and bilateral filtering in reducing image noise and
preserving details.
However, non-linear filtering is inefficient and slow to implement because the filter kernel
is spatially variant and needs to be recomputed for filtering every pixel. In addition, most
computational photography applications require large filter kernels, 64 x 64 or more. A direct implementation of bilateral filtering can take several minutes to process HD images on
a CPU. Faster approaches for bilateral filtering have been proposed. A separable approximation of the bilateral filter is proposed in [117] that speeds up processing and improves
efficiency for applications that use small filter kernels, such as denoising. Optimization
techniques have been proposed that reduce the processing time by filtering subsampled
versions of the image with discrete intensity kernels and reconstructing the filtered results
81
3.1 Bilateral Filtering
81
31 Bilateral Filtering
Linear Gaussian Filtering
Non-Linear Bilateral Filtering
Figure 3-2: Comparison of Gaussian filtering and bilateral filtering. Bilateral filtering effectively
reduces noise while preserving scene details.
using linear interpolation [87,118]. A fast approach to bilateral filtering based on a box
spatial kernel, which can be iterated to yield smooth spatial falloff, is proposed in [119].
However real-time processing of HD images requires further speed-up.
3.1.1
Bilateral Grid
The bilateral grid structure for fast bilateral filtering is proposed in [116], where the processing complexity is reduced by down-sampling the image for filtering. But to preserve
the details while down-sampling, a third intensity dimension is added so that pixels with
very different intensities, within a block being down-sampled, are assigned to different intensity levels, thus preserving the intensity differences. This results in a three dimensional
structure. Creating a 3D bilateral grid and processing it requires large amount of storage
(65 MB for a 10 megapixel image). In this work, we implement bilateral filtering using a
reconfigurable grid. To translate the grid structure efficiently into hardware, we convert
Reconfigurable Processor for Computational Photography
82
it into a data structure. The storage requirement is reduced to 21.5 kB by scheduling
the filtering engine tasks so that only two grid rows need to be stored at a time. The
implementation is flexible to allow varying grid sizes for energy/resolution scalable image
processing.
The bilateral grid structure used by this chip is constructed as follows. The input image
is partitioned into blocks of size a, x a, pixels and a histogram of pixel intensity values
is generated for each block. Each histogram has 256/ar bins, where each bin corresponds
to an intensity level in the grid. This results in a 3D representation of the 2D image,
as shown in Figure 3-3. Each grid cell (i, j, r) stores the number of pixels in a block
corresponding to that intensity bin (Wj) and their summed intensity (I, ). To provide
flexibility in grid creation and processing, the processor supports block sizes ranging from
16 x 16 to 128 x 128 pixels with 4 to 16 intensity bins in the histogram.
3D Grid
0
1
2
0
(2,1.8)
-.
1
Histogram
Summed Intensity
2D Image
(2,1.9)
(2,1.8)
1
2
1
Figure 3-3: Construction of a 3D bilateral grid from a 2D image
The bilateral grid has two key advantages:
Aggressive down-sampling: The size of the blocks (a, x a,) used while creating
the grid and the number of intensity bins (256/ur) determine the amount by which
the image is down-sampled. a, controls smoothing and a, controls the extent of edgepreservation. Most computational photography applications only require a coarse
grid resolution. The hardware implementation merges blocks of 16 x 16 to 128 x 128
pixels into 4 to 16 grid cells. This significantly reduces the number of computations
required for processing as well as the amount of on-chip storage required.
3.2 Bilateral Filter Engine
83
* Built-in edge awareness: Two pixels that are spatially adjacent but have very
different intensities end up far apart in the grid along the intensity dimension. Filtering the grid level-by-level using a 3D linear Gaussian kernel, only the intensity levels
that are near each other influence the filtering and the levels that are far apart do
not contribute in each other's filtering. Without any downsampling (a, = a, = 1),
this operation is identical to performing bilateral filtering on the 2D image. Filtering
a down-sampled grid using a 3D Gaussian kernel provides a good approximation to
bilateral filtering the image for most computational photography applications.
3.2
Bilateral Filter Engine
Intensity levels in the bilateral grid can be processed in parallel. This enables a highly
parallel architecture, where 256/ar intensity levels are created, filtered and interpolated
in a parallel and pipelined manner. The bilateral filter engine using the bilateral grid is
implemented as shown in Figure 3-4. It consists of three components - the grid assignment engine, the grid filtering engine and the grid interpolation engine. The spatial and
intensity down-sampling factors, o, and ar, are programmed by the user at the start of the
processing. The image is scanned pixel by pixel in a block-wise manner. The size of the
block is scalable from 16 x16 pixels (a, = 16) to 128x128 pixels (o, = 128). Depending
on the intensity of the input pixel, it is assigned to one of the intensity bins. The number
of intensity bins is also scalable from 4 (ar = 64) to 16 (a, = 16). As the data structure is
stored on-chip, the different intensity levels in the grid can be processed in parallel. This
enables a highly parallel architecture for processing.
3.2.1
Grid Assignment
The pixels are assigned to the appropriate grid cells by the grid assignment engines. The
hardware has 16 Grid Assignment (GA) engines that can operate in parallel to process 16
intensity levels in the grid. But 4, 8 or 12 grid assignment engines could be activated if the
Reconfigurable Processor for Computational Photography
84
Memory Interface
128 bit data bus
-- -0---16 -31
GA Engine
Engine
CGA
00
03
-
GA Engine 04
1
GA En Ine 07
E
GA Engine
Conv Engine 00
.Temporary
Temporary
Buffer
Conv Engine
Bak0Conv
Conv En ine
1Bank2
Bank 3
08
-
I
Bank 5
Bank
Bank7
r*
Sl
tGGA
Engine
C4Bn
Conv Engine 07
08
22 . 2-
03
Engine
Bank 1
-
0
.C
c
Buffer
Conv Engine 15
Grid Scaling Control
Figure 3-4: Architecture of the bilateral filtering engine. Grid scalability is achieved by gating
processing engines and SRAM banks
grid uses fewer intensity levels. Figure 3-5 shows the architecture of the grid assignment
engine. For each pixel from each block, its intensity is compared with the boundaries
of the intensity bins using digital comparators. If the pixel intensity is within the bin
boundaries, it is assigned to that intensity bin. Intensities of all the pixels assigned to a
bin are summed by an accumulator. A weight counter maintains the count of number of
pixels assigned to the bin. Both the summed intensity and weight are stored for each bin
in on-chip memory.
0 --.) 0
S
IJ~
1
as
*0
bit
+
Ir
Sum m e d
Intensity
)Weight
4,. xra
b a<b
X b a<b
o,
x(r+1)
Figure 3-5: Architecture of the grid assignment engine.
i
i
3.2 Bilateral Filter Engine
3.2.2
85
Grid Filtering
The Convolution (Conv) engine, shown in Figure 3-6, convolves the grid intensities and
weights with a 3 x 3 x 3 Gaussian kernel, which is equivalent to bilateral filtering in the
image domain, and returns the normalized intensity. The convolution is performed by
multiplying the 27 coefficients of the filter kernel with the 27 grid cells and adding them
using a 3-stage adder tree. The intensity and weight are convolved in parallel and the
convolved intensity is normalized with the convolved weight by using a fixed point divider
to make sure that there is no intensity scaling during filtering. The filter coefficients are
programmable to enable filtering operations of different types, including non-separable
filters, to be performed using the same reconfigurable hardware.
The coefficients are
programmed by the user in the beginning of the processing, otherwise the default 3 x 3 x 3
Gaussian kernel is used. The hardware has 16 convolution engines that can operate in
parallel to filter a grid with 16 intensity levels. But 4, 8 or 12 of them can be activated if
fewer intensity levels are used in the grid.
r
Assigned Gridr
GAssin KGrnd
-
Filtered Grid
x
Figure 3-6: Architecture of the convolution engine for grid filtering.
86
3.2.3
Reconfigurable Processor for Computational Photography
Grid Interpolation
The interpolation engine, shown in Figure 3-7, reconstructs the filtered 2D image from the
filtered grid. The filtered intensity value at pixel (x, y) is obtained by trilinear interpolation of 2 x 2 x 2 filtered grid values surrounding the location (x/-,, y/-s, Ixy/-r). Trilinear
interpolation is equivalent to performing linear interpolations independently across each
of the three dimensions of the grid. To meet throughput requirements, the interpolation
engine is implemented as three pipelined stages of linear interpolations. The output value
IBF(X,
y) is calculated from filtered grid values Fg using four parallel linear interpolations
along the i dimension, given by eq. (3.2):
Fj = Fj',
F
3
i
+
I1jX Wi
x w + F+ 1, +1 x
=
,31
= F,+l x w'
Fjff = F+x
+ F
2
,
xr+l
wi + F+1 xw
(3.2)
followed by two parallel linear interpolations along the j dimension, given by eq. (3.3):
Fx
w1 + F+1x
-+1
F+1
(3.3)
followed by an interpolation along the r dimension, given by eq. (3.4):
IBF(x, y) = F x Wi + Fr+1 xi
(3.4)
The interpolation weights, given by eq. (3.5), are computed based on the output pixel
location (x, y), the intensity of the original pixel in the input image Ixy at location (X, y),
and the grid cell index (i, j, r).
87
3.2 Bilateral Filter Engine
87
3.2 Bilateral Filter Engine
WT
x
2s
s
W
Wi =j +1-
- j;
=
x
+
W
=--i;
r
ri
'
W[ = -- Y - r;
-E
+ 1 --
(3.5)
The pixel location (x, y) and the grid cell index (i,j, r) are maintained in internal counters.
The original pixel intensity 1,, is read from the DRAM in chunks of 32 pixels per read
request to fully utilize the memory bandwidth.
x
r+~Y~
o+
f
r
j41
Filtered 2D Image
Filtered Grid
Ixa 9:r_j
+1.+-
rFJ+1'
bi
F r--1
Fj+1,J
i- +1--
F.1r,,y+
Fir --9
LI
F,
Linear
interpolation
Linear
F2
G
w2
w
inerInterpolation
+
Interpolation
Linear
inaInterpolation
Interpolation
i dimension
Output
Lna
Le
R
j dimension
f
Linear
interpolation
r dimension
Figure 3-7: Architecture of the interpolation engine. Trilinear interpolation is implemented as
three pipelined stages of linear interpolations.
The assigned and filtered grid cells are stored in the on-chip memory. Last three assigned
blocks are stored in a temporary buffer and two previous rows of grid blocks are stored
in the SRAM. Last two filtered blocks are stored in the temporary buffer and one filtered
grid row is stored in the SRAM. The on-chip SRAM can store up to 256 blocks per row
with 16 intensity levels.
Reconfigurable Processor for Computational Photography
88
3.2.4
Memory Management
The grid processing tasks are scheduled to minimize local storage requirements and memory traffic. Figure 3-8 shows the memory management scheme by task scheduling. Grid
processing is performed cell-by-cell in a row-wise manner. The last three blocks are stored
in the temporary buffer and the last two rows are stored in the SRAM. Once a 3x3x3
block is available, the convolution engine begins filtering the grid. When block A, shown
in Figure 3-8, is being assigned, the convolution engine is filtering block F. As filtering
proceeds to the next block in the row, the first assigned block, stored in the SRAM,
becomes redundant and is replaced by the first assigned block in the temporary buffer.
Last two filtered blocks are stored in the temporary buffer and the previous row of filtered blocks are stored in the SRAM. As 2x2x2 filtered blocks become available, the
interpolation engine begins reconstructing the output 2D image. When block F, shown
in Figure 3-8, is being filtered, the interpolation engine is reconstructing the output 2D
image from block I. As interpolation proceeds to the next block in the row, the first filtered block, stored in the SRAM, becomes redundant and is replaced by the first filtered
block in the temporary buffer. Boundary rows and columns are replicated for processing
boundary cells. This scheduling scheme allows processing without storing the entire grid.
Only two assigned grid rows and one filtered grid row need to be stored locally at a time.
Memory management reduces the memory requirement to 21.5 kB for processing a 10
megapixel image and allows processing grids of arbitrary height using the same amount
of on-chip memory.
3.2.5
Scalable Grid
Energy-efficiency is the key concern in processing on mobile platforms. The ability to
trade-off computational quality for energy is highly desirable, making algorithm structures and systems that enable this trade-off extremely useful to explore [120]. An user
performing computational photography on a mobile device might choose to trade-off out-
89
89
3.2 Bilateral Filter Engine
3.2 Bilateral Filter Engine
-
i ,b
0
Assigned Grid
2
3
5
4
6
W-3
W-2 W-1
******S
0
I
Stored in
SRAM
*****]
A
2
A
Block being assigned
F
Block being filtered
Blocks used for filtering
Temporary Buffer
Filtered Grid
0
2
1
3
4
5
W-2 W-I
W-3
6
I
1
I
Stored in
*SRAM
Block being filtered
j
Temporary Buffer
Block being interpolated
Filtered Blocks used for interpolation
Figure 3-8: Memory management by task scheduling.
put resolution for energy, depending on the current state of the battery and the energy
requirement for the task. This trade-off could also be made based on the intended usage
for the image. For example, if the output image is intended for use on social media or
web-based applications, a lower resolution, such as 2 megapixel, might be most appropriate. Whereas, for generating high-quality prints, the user would like to achieve the highest
resolution possible. This makes an architecture that enables energy-scalable processing
extremely valuable.
We develop an architecture that enables the energy vs. quality trade-off by scaling the
size of the bilateral grid to support the desired output resolution. The size of the grid is
determined by the image size and the downsampling factors. For an image of size Iw x IH
pixels with the spatial and intensity/range downsampling factors oa and -, respectively,
the grid width (Gw) and height (GH) are given by eq. (3.6) and the number of grid cells
(NG) is given by eq. (3.7).
Gw =
;
GH-
0-r
(3.6)
Reconfigurable Processor for Computational Photography
90
The number of computations as well as storage depends directly on the size of the grid.
Selecting the downsampling factors the same as the standard deviations of the spatial
and intensity/range Gaussians in the bilateral filter (eq. (3.1)) provides a good tradeoff between the output quality and processing complexity. The choice of downsampling
factors is guided by the image content and the application. Most applications work well
with a coarse grid resolution on the order of 32 pixels with 8 to 12 intensity bins. If the
image has high spatial details, a smaller o would result in better preservation of those
details in the output. Similarly, a smaller
Ur
would help preserve fine intensity details.
The grid size is configurable by adjusting as from 16 to 128, which scales the block size
from 16 x 16 to 128 x 128 pixels, and
Ur
from 16 to 64, which scales the number of intensity
levels from 16 to 4. For a 10 megapixel (4096 x 2592) image, the number of grid cells
scales from 663552 (a = 16, Ur = 16) to 2592 (a = 128,
Ur
= 64).
The architecture
achieves energy scalability by activating only the required number of hardware units for
a given grid resolution.
The 21.5 kB of on-chip SRAM used to store two rows of created grid cells and one row
of filtered grid cells. The SRAM is implemented as 8 banks supporting a maximum of
256 cells in each row of the grid with 16 intensity levels, corresponding to the worst case
of a = 16, Ur = 16. Each bank is power gated to save energy when a lower resolution
grid is used. Only one bank is used when a = 128 and all 8 banks are used when
as = 16. The bilateral filter engine achieves scalability by activating only the required
number of processing engines and SRAM banks, and power gating the remaining engines
and memory banks, for the desired grid resolution.
3.3
Applications
The testchip has two bilateral filter engines, each processing 4 pixels/cycle. The processor
performs HDR imaging, low-light enhanced imaging and glare reduction using the bilateral
filter engines.
91
3.3 Applications
3.3.1
High Dynamic Range Imaging
The range of intensities captured in an image is limited by the resolution of the image
sensor. Typically, image sensors use 8 bits/pixel resolution, which limits the dynamic
range of intensities captured in an image to 256 : 1. On the other hand, the range of
intensities we encounter in the real-world is 5 to 6 orders of magnitude. HDR imaging
is a technique for capturing a greater dynamic range between the brightest and darkest
regions of an image than a traditional digital camera. It is done by capturing multiple
images of the same scene with varying exposure levels, such that the low exposure images
capture the bright regions of the scene well without loss of detail and the high exposure
images capture the dark regions of the scene. These differently exposed images are then
combined together into a high dynamic range image, which more faithfully represents the
brightness values in the scene.
HDR Creation
The first step in HDR imaging is to create a composite HDR image, from multiple differently exposed images, which represents the true scene radiance value at each pixel of the
image [121]. The true scene radiance value at each pixel is recovered from the recorded
intensity I and the exposure time At as follows. The exposure E is defined as the product
of sensor irradiance R (which is the amount of light hitting the camera sensor and is proportional to the scene radiance) and the exposure time At. The intensity I is a nonlinear
function of the exposure E, given by eq. 3.8.
I = f(E)
I = f(R x At)
(3.8)
We can then obtain the sensor irradiance as given by eq. 3.9, where, g = log f-.
log(R) = g(I) - log(At)
(3.9)
Reconfigurable Processor for Computational Photography
92
The mapping g is knows as the camera curve [121]. Figure 3-9 shows the camera curves
for the RGB color channels of a typical camera sensor.
3
0
-3
3U
-6
-9
-12
-15
(
32
64
96
128
192
160
224
256
Image Intensity
Figure 3-9: Camera curves that map the pixel intensity values on to the incident exposure.
The HDR creation module, shown in Figure 3-10 takes values of a pixel from three different
exposures (IEl, IE2, IE3) and generates an output pixel which represents the true scene
radiance value (IHDR) at that location. Since we are working with a finite range of discrete
pixel values (8 bits/color), the camera curves are stored as combinational look-up tables
B
r.
R
Camera
Curve
El
S
Weighted
Average
Exposure
Correction
C
12
Exponent
bt
LUT
IHDR
E2
cc
EXP
LUT
LUT
CC0
Wj
Ei4
LUTX
W2
|
Figure 3-10: HDR creation module.
Ij 128
E
3.3 Applications
93
(LUTs) to enable fast access. The true (log) exposure values are obtained from the pixel
intensities using the camera curves, followed by exposure time correction to obtain (log)
scene radiance. The three resulting (log) radiance values obtained from the three images
represent the radiance values of the same location in the scene. A weighted average of
these three values is taken to obtain the final (log) radiance value. The weighting function
gives a higher weight to the exposures in which pixel value is closer to the middle of the
response function, thus avoiding the high contributions from images where the pixel value
is saturated. In the end an exponentiation is performed to get the final radiance value (16
bits/pixel/color). Processing is performed in the log domain for two reasons. The human
visual system responds to the ratio of intensities rather than the absolute difference in
intensities. This can be represented effectively in the log domain. Also, it simplifies the
computations to additions and subtractions instead of multiplications and divisions.
Tone Mapping
High dynamic range images (16 bit/pixel/color) can not be properly displayed on low dynamic range media (8 bits/pixel/color), which constitute almost all the displays that are
commonly used. Figure 3-11 shows how an HDR image would appear on a Low Dynamic
Range (LDR) display if it is simply scaled from 16 bit/pixel/color to 8 bit/pixel/color.
Properly preserving the dynamic range, captured in the HDR image, while displaying on
the LDR media requires tone mapping that compresses image dynamic range through contrast adjustment [87,94,122-124]. In this work, we leverage the local contrast adjustment
based tone-mapping approach proposed in [87] and implement two-stage decomposition
[125,126] using bilateral filtering in hardware.
Figure 3-12 shows the processing flow for HDR imaging, including HDR creation and tonemapping. The 16 bit/pixel/color HDR image is split into intensity and color channels. A
low-frequency base layer is created by bilateral filtering the HDR intensity in log domain
and a high-frequency detail layer is created by dividing the log intensity with the base
94
Reconfigurable Processor for Computational Photography
Figure 3-11: HDR image scaled to 8 bit/pixel/color for displaying on LDR media.
radiance map courtesy Paul Debevec [121].)
(HDR
Input Images
HDR Image
4,
Intensit ty
Data
Color
Data
Scaled
Color Data
Detail
Layer k
Base
Layer
I
v
Tone-Mapped
HDR Image
Figure 3-12: Processing flow for HDR creation and tone-mapping for displaying HDR images
on LDR media.
3.3 Applications
95
layer. The dynamic range of the base layer is compressed by a scaling factor in the log
domain. The scaling factor is user programmable to control the base contrast and achieve
a desired look for the image. By default, a scaling factor of 5 is used. The detail layer is
untouched to preserve the details and the colors are scaled linearly to 8 bit/pixel/color.
Merging the compressed base layer, the detail layer and the color channels results in a
tone-mapped HDR image (ITM). Figure 3-13 shows the tone-mapped version of the image
shown in Figure 3-11.
Figure 3-13: Tone-mapped HDR image. (HDR radiance map courtesy Paul Debevec [121].)
Figure 3-14 shows the hardware configuration for HDR imaging. The hardware performs
HDR imaging by activating the HDR Create module for pre-processing that merges three
LDR exposures into one 16 bit/pixel/color HDR image and the Contrast Adjustment
module for post-processing that performs contrast scaling and merging of the intensity
and color data. Both bilateral grids are configured to perform filtering in an interleaved
manner, where each grid processes alternate blocks in parallel. The processor also preserves the 16 bit/pixel/color HDR image in external memory, which could be tone-mapped
using a different software or hardware implementation.
Figure 3-15 shows a set of input low-dynamic range exposures that capture different ranges
of intensities in the scene and the tonemapped HDR output image.
Reconfigurable Processor for Computational Photography
96
Reconfigurable Processor for Computational Photography
96
Preprocessing
I'
Weighted
Average
Bilateral Filter
Bilateral Filter
wj
dJ
Grid
Assignment
I\,
L
Grid
Assignment
V.
-
IEl
-j
I13
4.
'TM
Convolution
Engine
Convolution
Engine
GtrdL
Grid
CM
4.
mEI.
1BF 4-
Shad w
11LE 4-
Interpolation
rInterpolation
Correction
Figure 3-14: Processor configuration for HDR imaging.
S E
(a)
(b)
T-emppd
(c)
D
(d)
Figure 3-15: Input low-dynamic range images: (a) under exposed image, (b) normally exposed
image, (c) over exposed image. Output image: (d) tonemapped HDR image.
97
3.3 Applications
3.3.2
Glare Reduction
Images captured with a bright light source in or near the field of view are affected significantly by glare, which reduces contrast, color vibrance and often leads to loss of scene
details due to pixel saturation. The effect of veiling glare on HDR imaging in image
capture and display is measured in [127]. An approach to quantify the presence of veiling
glare and related optical artifacts, and reducing glare through deconvolution by a measured glare spread function, is proposed in [128]. Glare removal in HDR imaging, by
estimating a global glare spread function for a scene based on fitting a radially-symmetric
polynomial to the fall-off of light around bright pixels, is proposed in [129]. Glare is modeled as a 4D ray-space phenomenon in [130] and an approach to remove glare by outlier
rejection in ray-space is proposed.
In this work, we address the effects of glare by improving the contrast and enhancing
colors in the captured image. This process is similar to performing a single image HDR
tone-mapping operation, with the exception that the contrast is increased instead of
compressed. Programmability of the contrast adjustment module, shown in Figure 3-16,
allows us to achieve this by simply using a different contrast adjustment factor than the
one used for HDR imaging.
Combine Color
Channels
+
Color Data
Exponentiation
EXP
LUT
Intensity Range
Adjustment
EXP
Output
LUT
Image
log I
Figure 3-16: Contrast adjustment module. Contrast is increased or decreased depending on the
adjustment factor.
98
Reconfigurable Processor for Computational Photography
Figure 3-17 shows the processing flow for glare reduction.
Input Image
Intensity
Data
Detail
Layer
Base
Layer
Color
Data
Scaled
Color Data
Output Image
Figure 3-17: Processing flow for glare reduction.
The input image is split into intensity and color channels. A low-frequency base layer
and a high-frequency detail layer are created by bilateral filtering the intensity. The
contrast of the base layer is enhanced using the contrast adjustment module, which is
also used in HDR tone-mapping. The adjustment factor is user programmable to achieve
the desired look for the output image. Adjustment factor of 0.25 is used as a default
for glare reduction. The scaled color data is merged with the contrast enhanced base
layer and the detail layer to obtain a glare reduced output image. Figure 3-18 shows the
processor configuration for glare reduction.
Figure 3-19 shows an input image with glare and the glare reduced output image. Glare
reduction recovers details that are white-washed in the original image and enhances the
image colors and contrast.
99
3.3 Applications
1F FWeighted
WeigtedBilateral Fifter
Average
1,
Bilateral Fftrs
-
IG
11, 1- ->
Grid
Assignment
Grid
Assignment
Convolution
Convolution
Engine
HDR
Creation
-
HDREngine
InaGrid
SBF JwInterpolation
LE
HI
Grid
InterpolatIon
Correction
Postprocessing
Figure 3-18: Processor configuration for glare reduction.
(a)
(b)
Figure 3-19: (a) Input image with glare. (b) Output image with reduced glare.
100
3.3.3
Reconfigurable Processor for Computational Photography
Low-Light Enhanced Imaging
Photography in low-light situations is a challenging task due to a number of conflicting
requirements. Capturing dynamic scenes without blurring requires short exposure times.
However, inadequate amount of light entering the image sensor in this short duration
results in images that are dark, noisy and lacking details. A possible solution to this
problem is to use a flash to add artificial light to the scene. This addresses the problems
of brightness, noise and lack of details, while enabling small exposure times to avoid
blurring. However, use of the flash defeats the original purpose of creating a realistic
representation of the scene in the photograph. The artificial light of the flash destroys the
natural scene ambience. It makes objects near the flash appear extremely bright, while
objects that are beyond the range of the flash appear very dark. In addition, it introduces
unpleasant artifacts due to flash shadows.
Combining the information captured in images of the same scene with flash (high details
and low noise) and without flash (natural scene ambience) in quick succession provides
a possible way to avoid the limitations of an image with flash or an image without flash
alone. Using flash and no-flash images to estimate ambient illumination and using that
information for color balancing is proposed in [131]. Creating enhanced images by processing a stack of registered images of the same scene is proposed in [132]. This approach
allows users to combine multiple images, captured under varying lighting conditions, and
blend them in a spatially varying manner. Acquiring multiple images with different levels
of flash intensity, including no flash, and subsequently adjusting the flash level by linearly
interpolating among these images is proposed in [133]. Images of the same scene, captured
with different aperture and focal-length settings, but not with different flash settings, are
merged by interpolating between the settings in [134]. Approaches for synthetically relighting scenes using sequences of images captured under varying lighting conditions have
also been proposed [135-137].
In this work, we implement an approach for low-light enhancement, similar to the ap-
101
3.3 Applications
proaches proposed in [138] and [139], that merges two images captured in quick succession,
one taken without flash (INF) and one with flash (IF). The main difference between our
approach and [138,139] lies in flash shadow treatment and how that affects the overall filtering operation. The large scale features in an image can be considered as representative
of the scene illumination and the details of the scene albedo [140]. Both the images with
and without flash are decomposed into a large scale base layer and a high frequency detail
layer through bilateral filtering. To preserve the natural scene ambience in the output,
the large scale base layer from the no-flash image is selected. This layer is merged with
the detail layer from the flash image to achieve high details and low-noise in the output.
However, flash shadows need to be considered during the merging process and treated to
avoid artifacts in the final output.
The approach used by [138] assumes that the regions in flash shadow should appear exactly
the same in both the flash and the no-flash image. So any regions where the differences
in intensities between the flash and no-flash image are small are considered as shadow
regions. A shadow mask representing such regions is created and the the details from the
flash detail layer are only added to the no-flash base layer in the regions not covered by
the mask. This approach avoids flash shadows but regions that are farther away from
the flash and do not receive sufficient illumination are also detected as shadows. Since no
details are added in these regions, large areas of the image often tend to appear smooth
and lacking details.
The approach in [139] makes a similar assumption to detect flash shadows. The regions
where the differences in intensities between the flash and no-flash image are the lowest
are considered to be the umbra of the flash shadows. The gradients at the flash shadow
boundaries are then analyzed to determine the penumbra regions. The shadow mask,
consisting of the umbra and the penumbra regions is then used to exclude shadow pixels
from bilateral filtering.
In this scheme, while filtering a pixel, only the pixels in its
neighborhood that are outside the shadow region are used and the pixels in the shadow
region receive no weight. This approach also assigns colors from the flash image to the
Reconfigurable Processor for Computational Photography
102
Non-Flash
e
Flash
IImage
Image
Base
Layer
Base
Layer
Detail
Layer
Detail
Layer
Low-Light Enhanced
image
Figure 3-20: Processing flow for low-light enhancement.
final output. For the shadow regions, local color correction is performed that copies colors
from illuminated regions in the flash image. Since this approach requires a specialized
type of bilateral filtering that takes into account the shadow mask, it can not be easily
implemented using the bilateral grid.
To address these challenges, we took an algorithm/architecture co-design approach and
developed a technique that decouples the core filtering operation from the shadow correction operation. This enables us to perform bilateral filtering efficiently using the bilateral
grid and correct for flash shadows as a post-processing step. Figure 3-20 shows the processing flow for low-light enhancement. The RGB color channels are processed independently
and merged in the end to generate the final output.
Figure 3-21 shows the processor
configuration for low-light enhancement.
The bilateral grid is used to decompose both images into base and detail layers. The scene
ambience is captured in the base layer of the no-flash image and details are captured in
the detail layer of the flash image. In this mode, one bilateral filter engine is configured
to perform bilateral filtering on the flash image and the other to perform cross-bilateral
filtering, given by eq. (3.10), on the no-flash image using the flash image. The location
of the grid cell is determined by the flash image and the intensity value is determined by
the no-flash image.
103
103
3.3 Applications
3.3 Applications
Preprocessing
IF
INF -
~( 'E2
IE1
E3
Weighted
Average
6Bilateral Filter
10
Grid
Assignment
HDR
Creation
-
Convolution
Engine
'F
IR G
Contrast
Adjustment
M
G rid
IF -
Interpolation
_
Interpoation
LLE
Postprocessing
Figure 3-21: Processor configuration for low-light enhancement.
N
ICB (P) =
Gs (n) - Gi (IF(p) ~~ IF (p -
(P)1
n ' INF (P - n)
(3-10)
n=-N
where,
N
W(p) = E
Gs(n) . GI(IF(p) - IF(p ~~ n)
(3-11)
n=-N
Shadow Correction
A shadow correction module is implemented which merges the details from the flash
image with base layer of the cross-bilateral filtered no-flash image and corrects for the
flash shadows to avoid artifacts in the output image. The shadow correction algorithm was
developed in collaboration with Srikanth Tenneti. Instead of detecting the flash shadows
and attempting to avoid those while adding details, we create a mask representing regions
with high details in the scene. This is done by detecting edges that appear in the bilateral
filtered no-flash image, which preserves the scene details but avoids spurious edges due
to noise. Figure 3-22 shows the mask generation process. Gradients are computed at
each pixel for blocks of 4x4 pixels. If the gradient at a pixel is higher than the average
Reconfigurable Processor for Computational Photography
104
4x4 block
No-Flash
Base Layer
1i rdSmooth
Binary
Mask
Mask
iing
01-+r
g
_____ ____
S0 1
Soj
a
b
<2
Smoothing Filter
+
>>3
mi
bv-
____ ___b__IV
mean(s)
Mean Gradient
g4
b
16.
b
b
mean(s)
Figure 3-22: Generating a mask representing regions with high scene details.
gradient for that block, the pixel is assigned as an edge pixel. This results in a binary
mask that highlights all the strong edges in the scene but no false edges due to the flash
shadows.
The details from the flash image are added to the filtered no-flash image, as shown in
Figure 3-23, only in the regions represented by the mask. A linear filter is used to smooth
1
Mask
filt
IF
IF
x
Flash
v
v
Details
with
shadow
artifacts
X
Non-flash
base layer
Shadow corrected
details
ILLE
Figure 3-23: Merging flash and no-flash images with shadow correction.
3.3 Applications
105
the mask to ensure that that the resulting image does not have discontinuities. This
implementation of the shadow correction module handles shadows effectively to produce
low-light enhanced images without artifacts.
Figure 3-24 shows a set of flash and no-flash images, the no-flash base layer from bilateral
filtering, the flash detail layer, the edge mask, created using the process described in
Figure 3-22, and the low-light enhanced output image.
Figure 3-25 shows a set of flash and no-flash images and the low-light enhanced output
image. The enhanced output effectively reduces noise while preserving the natural look
and scene details, and avoiding artifacts due to flash shadows.
Figure 3-26 compares the output from our approach with that from [138] and [139].
Our approach and the approach in [138] use colors from the no-flash image for the final
output. The approach in [139] uses the colors from the flash image for the output. Our
approach achieves output quality comparable to the previous approaches, as indicated by
the difference images. Decoupling the shadow correction process from the core bilateral
filtering process enables efficient implementation using the bilateral grid.
Reconfigurable Processor for Computational Photography
106
Reconfigurable Processor for Computational Photography
106
(a)
(b)
Edge Mask
Flash Details
(d)
(c)
(e)
(f)
Figure 3-24: (a) Image with flash, (b) image without flash, (c) no-flash base layer, (d) flash
detail layer, (d) edge mask, (f) low-light enhanced output.
107
3.3 Applications
lOT
3.3 Applications
(b)
(a)
(c)
Figure 3-25: Input images: (a) image with flash, (b) image without flash. Output image: (c)
low-light enhanced image.
(b)
(a)
(d)
(c)
(e)
with
Figure 3-26: Comparison of the image quality performance from the proposed approach
that of [138 and [139]. (a) Output from our approach, (b) output from [138], (c) output from
[139], (d) difference image between (a) and (b) - amplified 5x, (e) difference image between (a)
and (c) - amplified 5x.
Reconfigurable Processor for Computational Photography
108
3.4
Low-Voltage Operation
In addition to algorithm optimizations and highly-parallel processor architectures, the
third key component of energy-efficient system design is implementing such architectures
using ultra-low power circuits. The energy consumed by a digital circuit can be minimized
by operating at the optimal VDD, which requires the ability to operate at low voltage
[17,38,141,142].
3.4.1
Statistical Design Methodology
In this work, we use a statistical timing analysis approach, similar to the Operating Point
Analysis (OPA) based statistical design methodology outlined in Section 2.3, to ensure
reliable low-voltage operation. One important difference in the approach, however, is that
the transistor random variables corresponding to local variations, also known as mismatch
parameters, were not available from the foundry for the 40 nm CMOS library used in this
design. In absence of the mismatch parameters, we used the global corner delays, that
model the impact of global variations, to estimate the impact of local variations. The
typical corner delay provides the nominal delay for the standard cell. The best and worst
corner delays are used to model the -3- and the +3- global corner delays respectively. At
low-voltage, the impact of local variations is comparable to global variations [81]. As a
result, we use the standard deviation (o) obtained from the global corner delays to model
the impact of local variations as a Gaussian Probability Density Function (PDF) with
the mean delay given by the global corner delay. A subset of the standard cells from the
40 nm CMOS logic library are analyzed in this manner to model the impact of variations
at 0.5 V. These models of standard cell variations are then used to perform setup/hold
analysis for the timing paths in the processor.
The setup/hold timing closure for the processor, with 3- performance requirement at
0.5 V, is performed using the OPA based approach. The PDF of delay at 0.5 V for a
representative path from the design, computed using the models of standard cell variations
109
3.4 Low-Voltage Operation
as described above, is shown in Figure 3-27. The global corner delay for this path is 21.9
ns. However, after accounting for the local variations, OPA estimates the 3- delay to be
36.1 ns. Note that even if the standard cell delay PDFs are modeled as Gaussians, the
timing path delay PDF can be non-Gaussian.
x 108
delay
2 -COorn'er
2 -
3o- delay
0.5
0
.1
10
15
30
25
20
Timing Path Delay (ns)
35
40
Figure 3-27: Delay PDF of a representative timing path from the computational photography
processor at 0.5 V. STA estimate of the global corner delay is 21.9 ns, the 30- delay estimate
using OPA is 36.1 ns.
Table 3.1 shows statistics on the number of paths analyzed for both setup and hold analysis
of the chip. Setup/hold fixing using OPA ensures that cells that are very sensitive to VT
variations are not used in the critical paths. This helps improve the 3- performance at
0.5 V by 32%, from 17 MHz to 25 MHz.
The OPA analysis for timing paths ensures
reliable functionality at 0.5 V.
3.4.2
Multiple Voltage Domains
SRAM based on the six transistor (6T) cell is the most common form of embedded memory in processor design. However, low-voltage operation of 6T SRAM faces significant
challenges from process variations, bit cell stability and sensing. Threshold voltage variations among transistors that constitute the 6T cell significantly degrade the read/write
stability of the bit cell, especially at low voltages [143]. To ensure that the memory will
Reconfigurable Processor for Computational Photography
110
Table 3.1: Setup/Hold Timing Analysis at 0.5 V
Phase
Data Path
Clock Path
Paths Analyzed
Worst Slack
% Fail
Setup Analysis @ 25MHz
1
STA (+3u)
STA (-3a)
95k
-10.7 ns
3.6%
2
STA (+3u)
OPA
3.4k
-2.9 ns
1.5%
3
OPA
OPA
52
-0.05 ns
13.4%
Paths requiring fixing (before timing closure)
7
Hold Analysis
1
STA (-3o)
STA (+3u)
95k
-8.2 ns
2.8%
2
STA (-3a)
OPA
2.7k
-1.8 ns
2.4%
3
OPA
OPA
65
-0.13 ns
13.8%
Paths requiring fixing (before timing closure)
9
operate reliably as the logic voltage is scaled down, we use separate voltage domains for
logic and memory. This allows us to operate the memory at the nominal voltage of 0.9 V,
while scaling the logic voltage down to 0.5 V. Voltage level shifters, capable of transitioning between 0.5 V and 0.9 V, are used to transition the signals between the logic and
memory voltage domains. Figure 3-28 shows the logic and memory voltage domains in
the processor and the level shifters used to transition between the domains. The logic
domain is operated at voltage VDDL and the memory domain is operated at voltage
VDDM.
3.5
Memory Bandwidth Optimization
The target external memory consists of two 64 Mx 16 bit DDR2 DRAM modules with a
burst length of 8. The processor generates 23 bit addresses for accessing the DRAM that
are divided as: 13 bit row address, 3 bit bank address and 7 bit column address. A 32 bit
wide 266 MHz DDR2 memory controller is implemented using Xilinx XC5VLX50 FPGA.
3.5 Memory Bandwidth Optimization
3.5 Memory Bandwidth Optimization
ill
111
Computational Photography Processor
Figure 3-28: Separate voltage domains for logic and memory. Level shifters are used to transition
between domains.
We use a modified version of the Xilinx MIG DRAM controller which supports a lazy
pre-charge policy. Hence, a row is only pre-charged when an access is made to a different
row in the same bank. The 256 bit DDR2 interface is connected to the 64 bit processor
interface through asynchronous FIFOs. This enables the processor to work with any 256
bit DRAM system as well as allows the processor and memory to operate at different
frequencies.
The goal of memory optimization is to reduce the memory size and bandwidth required to
support real-time operation. To process 1080p images (1920 x 1080 at 30 fps) in real-time,
a naive bilateral filtering implementation in 2D image domain with 64 x 64 filter kernel
112
Reconfigurable Processor for Computational Photography
and a 4 kB cache to store 64 x 64 pixels (8 bit each), the DRAM bandwidth is:
BW 2 D Bilateral
=
(1080 x 30 x 64 x 64 + 1919 x 1080 x 30 x 64) x 3 colors
= 11.5 GB/s
(3.12)
64 x 64 pixels are accessed for the first element in each row and cached in the buffer.
For subsequent pixels in the same row, only the next 64 pixels need to be accessed. The
processing for RGB color channels is performed independently.
Algorithmic optimizations that leverage the bilateral grid structure to perform bilateral
filtering in 3D grid domain, with 16 x 16 pixel blocks, 16 intensity levels and a 3 x 3 x 3
filter kernel, reduces the bandwidth requirement to:
BW 3 D Grid
= BWGrid Creation + BWGrid Filtering + BWGrid Interpolation
(3.13)
where,
BWGrid Creation = BWImage Read
+
BWGrid Write
= (1920 x 1080 x 30 +
1920 x 1080 x 30
x 16 levels x 4 B/level) x 3 colors
16 x 16 blocks
= 222.5 MB/s
BWGrid Filtering
= BWGrid Read
(3.14)
+
BWFiltered Grid Write
1920 x 1080 x 30
16 x 16 blocks
+
B/level x (3 x 3 x 3 kernel) x 3 colors
1920 x 1080 x 30
x 16 levels x 1 B/level x 3 colors
16 x 16 blocks
= 1212.5 MB/s
(3.15)
113
3.5 Memory Bandwidth Optimization
BWGrid Interpolation = BWFiltered Grid Read +
_1920
x 1080 x 30
blo
1620 x 1680
16 x 16 blocks
BWoutput
Image Write
x 16 levels x 1 B/level x 3 colors
+1920 x 1080 x 30 x 3 colors
=
189.1 MB/s
(3.16)
Combining the bandwidth requirements for grid creation, filtering and interpolation, from
eq. (3.14), eq. (3.15) and eq. (3.16), the total bandwidth requirement for processing the
3D bilateral grid, from eq. (3.13), is:
BW 3DGrid = 222.5 MB/s + 1212.5 MB/s + 189.1 MB/s
= 1624.1 MB/s
(3.17)
The significant downsampling and reduction in computational complexity enabled by
the bilateral grid, compared to bilateral filtering in the 2D image domain, provides a
bandwidth reduction of 86% - from 11.5 GB/s to 1.6 GB/s.
Architectural optimizations and the memory management approach, described in Section 3.2.4, that uses task scheduling and the 21.5 kB on-chip SRAM as a cache for intermediate data, further reduce the memory bandwidth. This approach only requires
reading the original image and writing back the filtered output, resulting in the bandwidth requirement of:
BWprocessor
= BWImage Read + BWoutput Image Write
= 1920 x 1080 x 30 x 3 colors + 1920 x 1080 x 30 x 3 colors
= 356 MB/s
(3.18)
The memory management approach enables processing in the 3D grid domain while stor-
Reconfigurable Processor for Computational Photography
114
ing only two rows of created grid blocks and one row of filtered grid blocks, without
having to create an entire grid before processing. This data can be stored efficiently onchip using SRAM and avoid a significant number of off-chip DRAM accesses, reducing
the memory bandwidth by 97% compared to bilateral filtering in the 2D image domain from 11.5 GB/s to 356 MB/s.
Based on the number of memory accesses, we can estimate the memory power using a
memory power consumption model [144]. The memory size is optimized for the specific
implementation. For example, for 2D bilateral filtering implementation and our Grid and
task scheduling implementation, the DRAM only stores an input image and an output
image, which requires 12 MB of memory. Whereas, the 3D Grid implementation without task scheduling requires storing the created and filtered grid as well as the input
and output images, which requires 13.7 MB of memory. Figure 3-29 shows the memory
bandwidth and estimated power consumption for 2D bilateral filtering, after algorithmic
optimizations with the 3D bilateral grid and the after architectural optimizations involving
memory management with task scheduling. The bilateral grid reduces the memory power
consumption by 75% - from 697 mW to 175 mW. Architectural optimizations with memory management further reduce the memory power to 108 mW - an overall savings of 85%
x 102
x103
1
12
Bandwidth
Power
11700
697
9
8
0
-6
3
175
108
1624
jML
2
356M
0
0
2D Bilateral
3D Grid
3D Grid &
Filtering
Processing
Scheduling
Figure 3-29: Memory bandwidth and estimated power consumption for 2D bilateral filtering,
3D bilateral grid and bilateral grid with memory management using task scheduling.
115
3.6 Measurement Results
compared to bilateral filtering in the 2D image domain. The memory power does not scale
linearly with the bandwidth because of the standby power consumption of the memory.
This comparison demonstrates the significance of algorithm/architecture co-design and
considering trade-offs for optimizing power consumption not only for the processor core
but for the system as a whole, including external memory and communication costs.
3.6
Measurement Results
The testchip, shown in Figure 3-30, is implemented in 40 nm CMOS technology with
the active area of 1.1 mmx1.1 mm, 1.94 million transistors and 21.5 kB SRAM. The
processor is verified to be operational from 25 MHz at 0.5 V to 98 MHz at 0.9 V with
SRAMs operating at 0.9 V.
This chip is designed to function as an accelerator core as part of a larger microprocessor
system, utilizing the system's existing DRAM resources. For standalone testing of this
2 mm
Chip Features
TraVsto
0.9
Figure 3-30: Die photo of the testchip. Highlighted boxes indicate SRAMs. HDR, CR and SC
refer to HDR create, contrast reduction and shadow correction modules respectively.
Reconfigurable Processor for Computational Photography
116
chip, a 32 bit wide 266 MHz DDR2 memory controller was implemented using a Xilinx
XC5VLX50 FPGA. The performance vs. energy trade-off of the testchip for a range of
VDD is shown in Figure 3-31. For best image quality settings, grid block size 16 x 16 with
16 intensity levels, the processor is able to operate from 25 MHz at 0.5 V with 2.3 mW
power consumption to 98 MHz at 0.9 V with 17.8 mW power consumption.
. I
. I
I ,
I
.
......--.........- . .--------..----- ..-.---.---------
.
-
I
Energy
-
-
V oltage ----.--...--.----.--------------.-------------- .---------- .-----.--
-
1 .2
I
20
30
40
..--.-.-.------
...
..-..I
....
--.-.....
S0.6 ...-...........
70
60
50
Frequency (MHz)
80
1
.
I
-
1.5
90
*-
- 0 .9
0.7
100.
Figure 3-31: Processor performance: trade-off of energy vs. performance for varying VDD
The processing run-time scales linearly with the image size with 60 megapixels/second
processing at 0.9 V. Figure 3-32 shows the area and power breakdowns of the processor for
the bilateral filter engines and the pre-processing and post-processing modules. The power
breakdown is obtained from post-layout simulations. The shadow correction module is
power gated during HDR imaging and the HDR creation and contrast adjustment modules
are power gated during low-light enhancement.
Area
3%
Power
15k
SBilateral Filter Engines
* HDR Creation
Contrast Adjustment
iShadow
HDR Imaging
Correction
Low-Light Enhancement
Figure 3-32: Processor area (number of gates) and power breakdown.
117
3.6 Measurement Results
3.6.1
Energy Scalable Processing
The grid scalability, described in Section 3.2.5, provides a trade-off between grid resolution
and the amount of energy required for processing. Figure 3-33 demonstrates this trade-off
at 0.9 V for grid block size varying from 16 x 16 pixels to 128 x 128 pixels and the number
of intensity levels varying from 4 to 16.
1 .0
.......
.....
0.416
125
16
Intel
te k
t
4
18
6
2...12.C......\..
at 0.9 V.
Figure 3-33: Energy scalable processing. Grid resolution vs. energy trade-off
The energy consumption has a roughly linear dependance on the number of grid intensity
levels. This is because the number of active processing engines and memory banks is
proportional to the number of intensity levels, which results in an approximately linear
scaling in power consumption while the processing run-time remains unchanged.
The
energy consumption varies roughly quadratically with the grid block size, because the
number of blocks to process decreases quadratically with the downsampling factor (same
as the block size). This results in an approximately quadratic scaling in run-time while the
processing power consumption remains unaffected. A combination of these grid scaling
parameters enables processing energy scalability from 0.19 mJ to 1.37 mJ per megapixel
at 0.9 V.
The energy vs. image quality trade-off is depicted by a comparison of output images for
118
Reconfigurable Processor for Computational Photography
different grid configurations, for HDR imaging and low-light enhancement in Figure 3-34
and Figure 3-35 respectively. The impact of intensity downsampling on the image quality
is much more significant than spatial downsampling because the edge-preserving nature
of the bilateral grid depends on the number of intensity levels.
(a) Block size: 16x16, Intensity levels: 16
Energy: 13.7 mJ
(c) Block size: 16x 16, Intensity levels: 4
Energy: 6.4 mJ
(b) Block size: 128x 128, Intensity levels: 16
Energy: 4.2 mJ
(d) Block size: 128 x 128, Intensity levels: 4
Energy: 1.9 mJ
Figure 3-34: Energy/resolution scalable processing. HDR imaging outputs for (a) grid block
size: 16 x 16, intensity levels: 16, (b) grid block size: 128 x 128, intensity levels: 16, (c) grid
block size: 16 x 16, intensity levels: 4, (d) grid block size: 128 x 128, intensity levels: 4.
119
119
3.6 Measurement Results
3.6 Measurement Results
(a) Block size: 16 x16
Intensity Levels: 16
Energy: 13.7 ml
(b) Block size: 128x128
Intensity Levels: 16
Energy: 4.2 m)
(c) Block size: 16x16
Intensity Levels: 4
Energy: 6.4 mJ
(d) Block size: 128x128
Intensity Levels: 4
Energy: 1.9 mj
Figure 3-35: Energy/resolution scalable processing. Low-light enhancement outputs for (a) grid
block size: 16 x 16, intensity levels: 16, (b) grid block size: 128 x 128, intensity levels: 16, (c)
grid block size: 16 x 16, intensity levels: 4, (d) grid block size: 128 x 128, intensity levels: 4.
3.6.2
Energy Efficiency
Image processing pipelines typically involve a complex set of interconnected operations,
where each processing stage has large data dependencies. These operations don't automatically lend themselves to spatial and temporal parallelism. Several memory read/write
operations are required for every stage of processing, making the cost of memory accesses
often higher than the cost of computations [145]. This makes it difficult to achieve efficient software implementations without significant efforts to manually optimize the code,
including decisions regarding memory access patterns and order of processing. Significant
efforts are also required to enhance processing locality and parallelism using intrinsics and
other low-level programming techniques [146,147].
Table 3.2 shows a comparison of the processor performance with implementations on
other mobile processors at 0.9 V. Software that replicates the functionality of the testchip
and maintains identical image quality is implemented on the mobile processors.
The
implementations are optimized for multi-threading and multi-core processing as well as
120
Reconfigurable Processor for Computational Photography
taking advantage of available GPU resources on the processors. Processing runtime and
power consumption during software execution are measured. The processor achieves more
than 5.2 x faster performance than the fastest software implementation and consumes less
than 40 x power compared to the most power efficient one, resulting in an energy reduction
of more than 280x compared to software implementations on some of the recent mobile
processors while maintaining the same output image quality.
Table 3.2: Performance comparison with mobile processor implementations at 0.9 V.
Processor
Technology
(nm)
Frequency
(MHz)
Power
(mW)
Runtime*
(s)
Energy*
(mJ)
Intel Atom [148]
32
1800
870
4.96
4315
Qualcomm Snapdragon [24]
28
1500
760
5.19
3944
Samsung Exynos [25]
32
1700
1180
4.05
4779
TI OMAP [149]
45
1000
770
6.47
4981
This Work
40
98
17.8
0.771
13.7
*Image size: 10 megapixel
To make software implementations more efficient and easier to implement without significant manual tuning, the Halide image processing language [150] proposes decoupling the
algorithm definition from its execution strategy and automating the search for optimized
mappings of the pipelines to parallel processors and memory hierarchies. An optimizing
compiler generates higher performance implementations from an algorithm and a schedule
described using Halide. We compared the processing performance using Halide with a C
implementation and the hardware implementation of our processor. A moderately optimized implementation generated using Halide for an ARM core, running on a Qualcomm
Snapdragon processor [24], was able to process a 10 megapixel image in 2.1 seconds. With
better optimization, the runtime could be reduced even further. This compared with 4.05
seconds for the manually optimized C implementation running on the same processor. The
hardware implementation completed the processing in 771 ms. Halide provided significant
performance gains while making the software easier to implement.
121
3.6 Measurement Results
It is also useful to quantify the energy-efficiency of processors in terms of operations
performed per second per unit of power consumption (MOPS/mW), which highlights the
trade-offs associated with different architectures. Figure 3-36 shows such a comparison for
processors ranging from fully-programable CPUs and mobile processors to FPGAs and
ASICs. An operation is defined as a 16 bit add computation.
This work (0.5 V)
10 5
Ex
o
41 104
x
00
NW
03
C
102
Mobile
Processors
x
00
0-
S101
CPU
2~100
W 10-1
1
2
3
4
6
5
Processors
7
8
9
10
Figure 3-36: Energy efficiency of processors ranging from CPUs and mobile processors to FPGAs
and ASICs.
Processor
Description
1
Intel Sandy Bridge [20]
2
Intel Ivy Bridge [21]
3
Multimedia DSP [23]
4
Mobile Processors [24,25]
5
GPGPU Application Processor [26]
6
FPGA with hierarchical interconnects [151]
7
SVD ASIC [28]
8
Video Decoder ASIC [29]
9
Multi-Granularity FPGA [152]
10
This work (0.5 V)
Reconfigurable Processor for Computational Photography
122
The significant enhancement in processing speed as well as the reduction in power consumption achieved by the hardware implementation in this work, resulting in 2 to 3 orders
of magnitude higher energy-efficiency, can be attributed to several factors.
1. The algorithmic and architectural optimizations maximize data locality and enable
spatial and temporal parallelism that helps maximize the number of computations
performed per memory access. This amortizes the cost of memory accesses over
computations as well as reduces the memory bandwidth. Even an optimized software
implementation has a very limited amount of control over the processing architecture
and memory management strategies of the general purpose processor to be able to
achieve the most optimal implementation.
2. The high amount of parallelism enabled by algorithm/architecture co-design facilitates real-time performance while operating at less than 100 MHz, compared to
other processors operating at higher than 1 GHz frequency.
3. The hardware implementation allows careful pipelining with flexible bit widths that
enables preserving full-resolution of fixed-point computations at each stage of the
pipeline, whereas software implementations are restricted to a fixed bit width of 32
bit or 64 bit operations. Attempting to adapt bit widths to match the required resolution at pipeline stages often leads to a degradation in performance on these cores
instead of enhancement, because this introduces additional typecasting operations
in software processing.
4. Hardware implementations tailored to the specific applications avoid the significant
overhead of a control unit that is essential in a general purpose processor to configure
the processing units and complex memory hierarchies. The performance and power
overhead of just the instruction fetch and decode unit can be significant.
Even
with an optimized software implementation, it is hard to avoid uneven pipelines
and variable memory latencies, resulting in stalls that prevent optimal resource
utilization.
123
3.7 System Integration
5. The ability to scale voltage and frequency is key to ensuring minimum energy consumption for the desired performance. The active power consumption of circuits
scales quadratically with voltage. Circuits that are able to operate reliably down to
near threshold voltage enable minimum energy point operation for maximizing efficiency. General purpose processors rarely provide such flexibility to optimize energy
and performance requirements.
3.7
System Integration
The processor is integrated, as shown in Figure 3-37, with external DDR2 memory, a camera and a display. A 32 bit wide 266 MHz DDR2 memory controller and an USB interface
for communicating with a host PC are implemented using a Xilinx XC5VLX50 FPGA.
A software application, running on the host PC, is developed for processor configuration,
image capture, activating processing and result display.
Host
USB
USB Interface
64b
USB
DDR2 Memory
Controller
k~
64b
0
E
DDR2 Memory
256MB, 32b
Camera
Figure 3-37: Processor integration with external memory, camera and display.
The Printed Circuit Board (PCB) that integrates the processor, memory and interfaces
is shown in Figure 3-38, along with a setup that connects to a camera and display. The
system provides a portable platform for live computational photography.
124
Reconfigurable Processor for Computational Photography
USB I/F -
FPGA
-
XCSVLX5O
DRAM
ASIC
Figure 3-38: Printed circuit board and system integration with camera and display.
3.8
Summary and Conclusions
In this work, we developed a reconfigurable processor for computational photography
that enables real-time processing in an energy efficient manner. The processor performs
HDR imaging, low-light enhancement and glare reduction using a scalable bilateral grid.
Algorithmic optimizations that leverage the 3D bilateral grid structure, map the computationally complex non-linear filtering operation on to an efficient linear filtering operation in the 3D grid domain, significantly reduce the computational complexity and
memory requirement, enhance processing locality and enable a highly parallel architecture. Architectural optimizations exploit parallelism to enable high throughput real-time
performance while operating at low frequency and achieve hardware scalability to enable
energy vs. output quality trade-offs for energy/resolution scalable processing. Through algorithm/architecture co-design, an approach for low-light enhancement and flash shadow
correction that enables efficient implementation using the bilateral grid architecture is
developed. Circuit design for low voltage operation ensures reliable performance down to
0.5 V, enabling a wide voltage operating range for voltage/frequency scaling and achieving
minimum energy operation for the desired performance.
The processor is implemented using 40 nm CMOS technology and verified to be operational from 98 MHz at 0.9 V with 17.8 mW power consumption to 25 MHz at 0.5 V with
3.8 Summary and Conclusions
125
2.3 mW power consumption. At 0.9 V, it can process up to 60 megapixel/s. The scalability of the architecture enables processing from 0.19 mJ/megapixel to 1.37 mJ/megapixel
for different grid configurations at 0.9 V, while trading-off output quality for energy. The
processor achieves 280 x energy reduction compared to identical software implementations
on recent mobile processors. The energy scalable implementation proposed in this work
enables efficient integration into portable multimedia devices for real time computational
photography.
Based on the system design approach, from algorithms to circuit implementation, adopted
in this work, the following conclusions can be drawn.
1. Hardware oriented algorithmic reframing is key to efficient implementation. The efficiency gains achievable for a system through architectural and circuit optimizations
are limited if the algorithm requires sequential processing with large data dependencies. The significant reduction in computational complexity, memory size and
bandwidth, achieved through algorithmic transformation from inefficient non-linear
filtering in the image domain to efficient linear filtering in the 3D grid domain,
demonstrates the significance of algorithmic trade-offs in system design.
2. Scalable architectures, with efficient clock and power gating, enable energy vs. performance/quality trade-offs that are extremely desirable for mobile processing. This
energy-scalable processing allows the user to determine the energy usage for a task,
based on the battery state or intended usage for the output.
3. Memory management - both on-chip memory size and off-chip memory bandwidth is critical to maximizing the system energy-efficiency. Reduction in external memory
bandwidth from 11.5 GB/s to 356 MB/s and the corresponding power consumption
from 697 mW to 108 mW, through algorithm/architecture co-design, careful task
scheduling and use of on-chip SRAM cache, demonstrates this effect.
4. Low-voltage circuit operation is important to enable voltage/frequency scaling and
attain minimum energy operation for the desired performance.
126
Reconfigurable Processor for Computational Photography
Chapter 4
Portable Medical Imaging
Platform
Medical imaging techniques play a crucial role in the diagnosis and treatment of numerous medical conditions. Traditionally, medical diagnostic systems have been restricted
to sophisticated clinical environments due to cost, size and expertise required to operate
such equipment. Recent advances in computational photography and computer vision,
coupled with efficient high-performance processing on portable multimedia devices, provide a unique opportunity for high quality and highly capable medical imaging systems
to become much more portable and cost efficient. Image processing techniques such as
High Dynamic Range (HDR) imaging, contrast enhancement, image segmentation and
registration, could be used to ease the requirements of high-precision optical front-ends
for medical imaging systems that make such equipment bulky and expensive, and enable
digital cameras and smartphones to be used for medical imaging. Proliferation of connected portable devices presents an opportunity for making sophisticated medical imaging
systems available to small clinics and individuals in rural areas and emerging countries to
enable early diagnosis and better treatment outcomes.
128
4.1
Portable Medical Imaging Platform
Skin Conditions - Diagnosis & Treatment
Skin conditions are among the top five leading causes of nonfatal disease burden globally [153] and can have a significant negative impact on the quality of life. Chronic skin
conditions are often easily visible and can be characterized by multiple features including
pigmentation, erythema, scale or other secondary features. Vitiligo is one such common
condition found in up to 2% of the worldwide population [154]. The disease is characterized by loss of pigment in the skin, hair and mucous membranes caused in part by
autoimmune destruction of epidermal melanocytes [155,156]. Due to its appearance on
visible areas of the skin, Vitiligo can have a significant negative impact on the quality of
life in affected children and adults.
4.1.1
Clinical Assessment: Current Approaches
Treatments of skin conditions aim to arrest disease progression and induce repigmentation of affected skin. Several surgical and non-surgical treatments, such as topical immunomodulators, phototherapy, and surgical grafting and transplantation, are available
[157,158]. However, diagnosis is primarily based on visual clinical evaluation. Dermoscopy
[159,160] is a noninvasive technique that aids visual observations by allowing clinician to
perform direct microscopic examination of diagnostic features in pigmented skin lesions
and visualization of pigmented cutaneous lesions in vivo [161,162]. Commercially available dermoscopy tools, such as DermLite [163], aim to improve the ease and accuracy
of visual evaluations by providing magnification, LED lighting and polarizing filters to
enhance the field of view and reduce glare and shadows. However, reliable objective outcome measures, to allow for comparison of studies and to accurately assess changes over
time, are currently lacking [164-166].
Several tissue lesions can be identified based on
measurable features extracted from a lesion, making the accurate quantification of tissue
lesion features of essential importance in clinical practice.
Current outcome measures include the Physician's Global Assessment (PGA) that grades
41 Skin Conditions
-
Diaonosis & Treatment12
129
patient improvement based on broad categories of percentage repigmentation over time (025%, 25-50%, 50-75% and 75-100%) and the Vitiligo Area and Severity Index (VASI) [167]
that measures percentage repigmentation graded over area of involvement summed over
body sites involved. Figure 4-1, reproduced with permission from [167], shows an example
of VASI assessment.
100%
90%
75%
50%
25%
10%
Figure 4-1: Standardized assessments for estimating the degree of pigmentation to derive the
Vitiligo Area Scoring Index. At 100% depigmentation, no pigment is present; at 90%, specks
of pigment are present; at 75%, the depigmented area exceeds the pigmented area; at 50%, the
depigmented and pigmented areas are equal; at 25%, the pigmented area exceeds the depigmented area; and at 10%, only specks of depigmentation are present. (Figure reproduced with
permission from [167])
Portable Medical Imaging Platform
130
These outcome measures rely on subjective clinical assessment through visual observation,
which cannot exclude inter-observer bias and have limited accuracy, reproducibility and
quantifiability. Two recent studies [165,166] conclude that the current outcome measures
have poor methodological quality and unclear clinical relevance as well as lack consensus
among the clinicians, researchers and patients. Recent studies have begun using image
analysis to evaluate treatment efficacy, but these trials rely on investigator-defined boundaries of skin lesions which can be biased, and these programs require user involvement
to analyze each image separately, which can be time-consuming [168,169]. An objective
measurement tool that accurately quantifies repigmentation could overcome these limitations and serve as a diagnostic tool for dermatologists. Image processing techniques can
be applied to identify the skin lesions and extract their features, which would allow much
more accurate determination of disease progression. The ability to more objectively quantify change over time will significantly improve the physician's ability to perform clinical
trials and determine the efficacy of therapies.
4.1.2
Quantitative Dermatology
Algorithms for quantitative dermatology are being developed. A framework to detect
and label moles on skin images is proposed in [170]. The method searches the image for
skin regions using a non-parametric skin detection scheme and uses difference of Gaussian
filters to find possible mole candidates. A trained Support Vector Machine (SVM) is
used to classify the candidates as moles. An approach for registering micro-level features
in high-resolution face images is proposed in [171]. The approach registers features in
images captured with different light polarizations by approximating the face surface as
a collection of quasi-planar skin patches and estimates spatially varying homographies
using feature matching and quasiconvex optimization. A supervised learning technique
to automatically detect acne-like lesions and enable computer assisted counting of acne
lesions in skin images is proposed in [172], which models skin regions by a six dimensional
vector using temporal and spatial features, and detects the separating boundary between
4.1 Skin Conditions - Diagnosis & Treatment
131
the patch images. Quantitative assessment of wound healing through dimensional measurements and tissue classification is proposed in [173]. The approach computes a 3D
model from multiple views of the wound. Tissue classification is performed from color
and texture region descriptors computed after unsupervised segmentation. Principal component analysis followed by image segmentation is used in [174] to analyze and determine
areas of skin that have undergone repigmentation during the treatment of Vitiligo. This
approach converts an RGB image into an image that represent skin areas due to melanin
and haemoglobin and determines the change in area of such regions over time. All the
images taken over time are assumed to be accurately registered with respect to each other
and have uniform color profiles. A technique for melanocytic lesion segmentation based on
image thresholding is proposed in [175]. Thresholding schemes work well when the lesion
and background skin have distinct intensity and color profiles. However, their accuracy
is limited when the image has intensity and/or color inhomogeneities.
Table 4.1 summarizes the current approaches for clinical assessment and recent work in
quantitative dermatology.
A review of the automated analysis techniques for pigmented skin lesions [176], applied to
dermoscopic and clinical images, finds that even though several approaches for analyzing
individual lesions have been proposed, there is a scarcity of approaches on the automation
of lesion change detection. The study concludes that computer-aided diagnosis systems
based on individual pigmented skin lesion image analysis cannot yet be used to provide
the best diagnostic results.
In this work, we develop a system for skin lesion detection and progression analysis and
apply it to clinical images for Vitiligo, obtained from ten different subjects during treatment. Institutional Review Board approval was obtained for data analysis (MIT Protocol
Number: 1301005500) as well as the clinical pilot study in collaboration with the Brigham
and Women's Hospital (BWH Protocol Number: 2012-P-002185/1). Image segmentation
is used to accurately determine the lesion contours in an image and a registration scheme
Portable Medical Imaging Platform
132
Table 4.1: Summary of clinical assessment and quantitative dermatology approaches
Reference
Description
[160,162]
Dermoscopy - Microscopic examination of diagnostic features in pigmented skin lesions.
[163]
DermLite - Commercial dermoscope to provide magnification, LED
lighting, polarizing filters.
[165]
PGA - Patient improvement based on broad categories of percentage
repigmentation over time (0-25%, 25-50%, 50-75% and 75-100%)
[167]
VASI - Measuring percentage repigmentation, based on visual observation, graded over area of involvement summed over body sites involved.
[170]
A framework, based on difference of Gaussian filters and trained SVM,
to detect and label moles on skin images.
[171]
Registering micro-level features, using feature matching and quasiconvex optimization, in high-resolution face images, captured with different
light polarizations.
[172]
A supervised learning technique, using temporal and spatial features, to
automatically detect and count acne-like lesions in images.
[173]
Quantitative assessment of wound healing by computing a 3D model
from multiple views of the wound and tissue classification based on color
and texture region descriptors.
[174]
Principal component analysis and image segmentation of images captured with standardized lighting and alignment to determine repigmented skin areas during treatment.
[175]
Melanocytic lesion segmentation based on image thresholding.
using feature matching is implemented to align a sequence of images for a lesion.
A
progress metric called fill factor, which accurately quantifies repigmentation of skin lesions, is proposed.
4.2 Skin Condition Progression:
4.2
Quantitative Analysis
133
Skin Condition Progression: Quantitative Analysis
The focus of this work is on developing a system for lesion detection and progression
analysis of skin conditions, based not only on standardized clinical imaging but also
through images captured by patients at home, using smartphone or digital cameras, without any standardization. The main contributions of this work are to leverage algorithmic
techniques from different areas of computer vision, such as color correction, image segmentation and feature matching, optimize and modify them to enhance accuracy for the
skin imaging application and reduce computational and memory complexity for efficient
and fast implementation, and develop an easy-to-use automated mobile system that could
be used by patients as well as doctors for frequent monitoring of skin conditions.
The overall processing flow, from the non-standardized image sequence to quantification
of progression, is summarized in Figure 4-2.
The progress of a skin lesion is recorded by capturing images of the lesion at regular
intervals of time.
This is done for all lesions located on different body areas.
Color
correction is performed by adjusting R, G, B histograms to neutralize the effects of varying
lighting and enhance the contrast. A Level Set Method (LSM) based image segmentation
approach is implemented to identify the lesion contours. In the vicinity of the lesion
contours, Scale Invariant Feature Transform (SIFT) based feature detection is performed
to identify key features of the lesion. For the first set of images of all skin lesions, they are
manually tagged based on their location on the body. For all future images, the tagging is
performed automatically by comparing the features from the new image with the previous
set of images for all skin lesions. Once the new image is tagged to a specific lesion, it is
registered with the first image in the sequence for that lesion, using pre-computed SIFT
features. The warped lesion contours are computed after alignment and their area is
compared to the area of the first lesion in the sequence to determine the fill factor that
indicates the change in area and quantifies the progress over time.
Portable Medical Imaging Platform
134
134PotbeMdclmaigPafr
time
Lesion image
Sequence
f
Color Correction
Histogram Adjustment
)
Contour Detection
Segmentation - Level Set Method
]
Feature Detection
SIFT in the vicinity of the contour
No
First image in
the sequence?
Yes
Auto- Tagging
Manual Tagging
Feature co nparison with
previous ima ges of all lesions
Based on lesion location
T
Store Con :our and SIFT
Fea tures
L
Image Alignment
Store Contour and SIFT
Features
4
I
Factor = 0
F FillReference
Homography warping feature matching
Fill Factor
Warped lesion area
comparison
Figure 4-2: Processing flow for skin lesion progression analysis.
4.2.1
Color Correction
Accurate color information of skin lesions is significant for dermatology diagnosis and
treatment [177,178]. However, different lighting conditions and non-uniform illumination
during image capture often lead to images with varying color profiles. Having a consistent
4.2 Skin Condition Progression:
135
Quantitative Analysis
color profile in the images captured over time is important for both visual comparison as
well as to accurately determine the progression over time. Some approaches for color normalization in dermatological applications have proposed normalizing color profiles of the
instruments to match the images captured with different devices, through users characterizing and calibrating the color response [179]. An approach to build color normalization
filters by analyzing features in a large data set of images for a skin condition is proposed
in [180], which extracts image features from the inside, outside, and peripheral regions of
the tumor and builds multiple regression models with statistical feature selection.
We developed a color correction scheme that automatically corrects for color variations
and enhances image contrast using color histograms. Histogram equalization is typically
used to enhance contrast in intensity images. However, performing histogram equalization
on R, G and B color channels independently, brings the color peaks in alignment and
results in an image that closely resembles one in neutral lighting environment. For an
image I, the color histogram for channel c (R, G or B) is modified by adjusting the pixel
color values I,(x, y) to span the entire dynamic range D, as given by eq. (4.1).
IM
(XY)
= Ic(XY)
-
(ICU - II)
(4.1)
Ic x D
where, Icu and I represent the upper and lower limits of the histogram.
The approach can be summarized as follows:
1. Compute histograms for R, G and B color channels.
2. Determine the upper and lower limits of the R, G and B histograms as the +2U
limit (Ic" > intensity of 97.8% pixels) and the -2-
limit (I < intensity of 97.8%
pixels). This avoids histogram skewing due to long tails and results in better peak
alignment.
3. Expand the R, G, B histograms to occupy the entire dynamic range (D) of 0 to 255
using eq. 4.1.
Portable Medical Imaging Platform
136
Figure 4-3 shows the performance of this approach for images of two different skin lesions.
The approach achieves performance comparable to white-balance calibration with a color
chart, while also enhancing the contrast to make the lesion more prominent.
6
S401
0
6
50
100
150
200
;2500
50
100 150 200
Intensity
2500
50
100
150
200
250
50
100
150
200
2500
50
100 150 200
Intensity
2501D
50
100
150
200
250
-
4
2
0
0
(a) Captured with Color
Tinted Lighting
(b) Captured with Neutral
Lighting
(c) After Color Correction
and Contrast Enhancement
Figure 4-3: Color correction by histogram matching. Images captured with normal room lighting
(a) and with color chart white-balance calibration (b). Images after color correction and contrast
enhancement (c) of images in (a).
4.2 Skin Condition Progression:
4.2.2
Quantitative Analysis
137
Contour Detection
Accurately determining the contours of skin lesions is critical to diagnosis and treatment
as the contour shape is often an important feature in determining the skin condition.
It is also important for determining the response to treatment and the progress over
time. Due to non-uniform illumination, skin curvature and camera perspective, the images tend to have intensity and color variations within lesions. This makes it difficult
for segmentation algorithms that rely on intensity/color uniformity to accurately identify
the lesion contours. Segmentation approaches for images with intensity bias have been
proposed [181-184]. A level set approach is proposed in [184] that models the distribution of intensity belonging to each tissue as a Gaussian distribution with spatially varying
mean and variance and creates a level set formulation by defining a maximum likelihood
objective function. An LSM based approach called the distance regularized level set evolution was proposed in [185] and extended in [186] to a region-based image segmentation
scheme that can take into account intensity inhomogeneities. Based on a model of images
with intensity inhomogeneities, the approach in [186] derives a local intensity clustering
property of the image intensities, and defines a local clustering criterion function for the
image intensities in a neighborhood of each point.
We leverage the level set method for image segmentation [186], which provides good accuracy in lesion segmentation with intensity/color inhomogeneities. However, this approach
has very high computational complexity and memory requirement, as described below.
We develop an efficient and accurate narrowband implementation that significantly reduces the computational complexity and memory requirement. A distance regularized
level set function, similar to that proposed in [185], is used to update the level set values
during iterations. Our implementation only performs updates to the level set function,
and the related variables (energy function, bias field, etc. - defined below), for a small
subset of pixels that fall within a narrow band around the current segmentation contour
in an iteration. This limits the computations and memory accesses to this small subset
138
Portable Medical Imaging Platform
of pixels, instead of the entire image. The following section describes the approach in
further detail.
Level Set Method for Segmentation
The original image I with non-uniform intensity profile is modeled as a combination of the
homogeneous image J and a bias field b that captures all the intensity inhomogeneities in
I, given by eq (4.2).
I = bJ + n
(4.2)
n is the additive zero-mean Gaussian noise.
A Level Set Function (LSF)
#(x)
is defined for every pixel x in the image. The image
is segmented into two regions Q, and Q2 based on the values of the level set function in
these regions, such that:
Q1 = {x: O(x) > 0},
Q2 = {x: #(x) < 0}
The segmentation contours are represented by the 'zero level set': {x :
(4.3)
#(x)
= 0}. The
level set function is initialized over the entire image and iteratively evolved to achieve the
final segmentation.
The unknown homogeneous image J is modeled by two constants ci and c2 in regions
Q, and Q2 respectively. An energy function, .F(0, {ci, c2 }, b), is defined over Q1 , Q 2 , c1 ,
c2 and b. The optimal regions Q1 and Q2 are obtained by minimizing the energy, F, in
a variational framework. The energy minimization is performed in an iterative manner
with respect to one variable at a time while the other variables are set to their values
in the previous iteration. The iterative process is implemented numerically using a finite
difference scheme [185].
This process iteratively converges to the homogeneous image J and the corresponding
level set function
#(x),
as shown by the sketch in Figure 4-4.
4.2 Skin Condition Progression:
139
Quantitative Analysis
139
4.2 Skin Condition Progression: Quantitative Analysis
(D(X)=O
(b)
(a)
Figure 4-4: Level set segmentation. (a) Original image with intensity inhomogeneity and initialization of the level set function. (b) Homogeneous image obtained at the end of iterations
and the corresponding level set function.
The iterative process achieves accurate segmentation despite intensity inhomogeneities.
However, it requires storage and update of the level set function O(x), the bias field b, the
homogeneous image model {c 1 , c2 } and the corresponding energy function F(#, {c 1i, c2 }, b)
for every pixel in each iteration.
Bit widths for representing this data are given by
Table 4.2. This results in a 42 bits/pixel requirement for the level set approach. Processing
Table 4.2: Bit Width Representations of LSM Variables.
Variable
Bit Width
I(x)
8 bits/pixel
J(x)
8 bits/pixel
b(x)
8 bits/pixel
O(x)
2 bits/pixel
F(O,{ci, c 2}, b)
16 bits/pixel
a 2 megapixel (1920 x 1080) image requires storing 11 MB of data and updating it in each
iteration. On-chip SRAM in a processor is typically not suited to such large memory
requirement, necessitating an external DRAM for storing these variables. To process a 2
megapixel image with 50 LSM iterations in one second requires the memory bandwidth
of:
140
140PotbeMdclmangPafr
BWLsM
=
BWIRead/WriteBW
BWLSM
BWRead + BWRead/Write + BW+
Portable Medical Imaging Platform
+ BO+
ead/Write
B+BWRead/Write
= 1920 x 1080 x (8 + 2 x 8 + 2 x 2 + 2 x 8 + 2 x 16) x (50 iterations)
(4.4)
= 985 MB/s
To enable energy efficient implementations and real-time processing, we need to optimize
the algorithm and reduce the computational and memory requirements.
Narrowband LSM
We develop a narrowband implementation of the approach, where instead of storing and
updating the LSM variables for all the pixels in the image in each iteration, we only need
to track a small subset of pixels that fall within a narrow band defined around the zero
level set, as depicted in Figure 4-5.
<D(x)=0
Segmentation contour in
<D(x)=0
Narrowband
current iteration
are tracked
Figure 4-5: Narrowband implementation of level set segmentation. LSM variables
current
the
set in
only for pixels that fall within a narrow band defined around the zero level
iteration.
The narrowband implementation is achieved by limiting the computations to a narrow
band around the zero level set [185]. The LSF at a pixel x = (i, j) in the image is denoted
either
by <pj and a set of zero-crossing pixels is determined as the pixels (i, j) such that
4.2 Skin Condition Progression:
141
Quantitative Analysis
qi+i,j and 4 i-1,j or #i,j+1 and #ij-1 have opposite signs. If the set of zero crossing pixels
is denoted by Z, the narrowband B is constructed as given by eq. (4.5).
B= U
(4.5)
Ni
(ij)eZ
where, Nij is a 5 x 5 pixel window centered around pixel (i, j). The 5 x 5 window is
experimentally determined to provide a good trade-off between computational complexity
and quality of the results.
The LSF based segmentation using narrowband can be summarized as follows.
1. Initialize the LSF to 09.
where
#
indicates the LSF value during iteration k.
Construct the narrowband B 0 using eq. (4.5).
=k+1
2. Update the LSF on the narrowband using a finite difference scheme [185] as
#
j
+ At - L(#k ), where At is the time step of the iteration and L(#
3. Determine the set of zero-crossing pixels of
#k+
) ~
.
and update the narrowband Bk+1
using eq. (4.5).
4. For pixels (i, j) part of the updated narrowband Bk+1 that were not part of the
narrowband Bk, set # l = 3 if # ±l > 0 and #ktl = -3 otherwise.
5. Continue iterations till the narrowband stops changing (Bk+1 = Bk = Bk-1) or the
limit on maximum iterations is reached. The set of zero-crossing points at the end
of iteration represents the segmentation contour.
The narrowband approach significantly reduces the computational costs as well as memory
requirements for LSM segmentation. Figure 4-6 shows the number of pixels processed for
five 2 megapixel images of skin lesions over 50 LSM iterations using the narrowband
implementation. On average, 400,000 pixels are processed per iteration. Compared to the
2 million pixels processed per iteration using original LSM, this represents a 80% reduction
in the processing cost and reduces the average memory bandwidth to 197 MB/s.
142
Portable Medical Imaging Platform
x105
10
---
Image
Image
Image
Image
Image
8
6
--
1
2
3
4
5
4 -02
0
10
20
30
Number of Iterations
40
50
Figure 4-6: Number of pixels processed using the narrowband implementation over 50 LSM
iterations.
Two-Step Segmentation
This narrowband implementation, however, has one important limitation. We perform
updates on the LSM variables only for the pixels in the small neighborhood of the segmentation contour in the current iteration. If the LSF isn't properly initialized, it is possible
for the energy function to get trapped in a local minima resulting in inaccurate segmen-
tation. This can be easily avoided by starting with a good initialization. We achieve this
by using a 2-step approach:
* Step 1: A simple segmentation technique such as thresholding or K-means is used.
This step is very efficient computationally and generates segmentation contours that
are not completely accurate but serve as a good starting point for our narrowband
LSM implementation.
" Step 2: Contours generated in Step 1 are used to initialize the LSF. Narrowband
LSM then iteratively refines these contours to achieve final segmentation.
Figure 4-7 shows the segmentation achieved by K-means in Step 1 for a skin lesion. Using
these contours to initialize the LSM iterations, Figure 4-8 shows the evolution of contours
during LSM iterations in Step 2.
143
4.2 Skin Condition Progression: Quantitative Analysis
143
4.2 Skin Condition Progression: Quantitative Analysis
*
*
Initial contours
K-means segmentation
Original Image
Figure 4-7: Lesion segmentation using K-means.
Initial contours
10 Iterations
20 Iterations
30 Iterations
40 Iterations
50 Iterations: Final Contours
Figure 4-8: Contour evolution for lesion segmentation using narrowband LSM.
4.2.3
Progression Analysis
The ability to accurately determine the progression of a skin condition over time is an
important aspect of diagnosis and treatment. In this work, we capture images of the
same skin lesions using a handheld digital camera over an extended period of time during
treatment and analyze them to determine the progress. However, the lesion contours
determined in individual images can not be directly compared as the images typically
have scaling, orientation and perspective mismatch.
We propose an image registration scheme based on SIFT feature matching [34] for pro-
144
Portable Medical Imaging Platform
gression analysis. Skin surface typically does not have significant features that could be
detected and matched across images by SIFT. However, the lesion boundary creates distinct features due to transition in color and intensity from the regular skin to the lesion.
To further highlight these features, we superimpose the identified contour on to the original image before feature detection. The lesion contours change over time as the treatment
progresses, however this change is typically slow and non-uniform. Repigmentation often
occurs within the lesion and some parts of the contour shrink while others remain the
same. Performing SIFT results in several matching features corresponding to the areas
of the lesion that haven't significantly changed. Matching SIFT features over large images can be computationally expensive. Also, on relatively featureless skin surfaces, most
useful SIFT features are concentrated around the lesion contour, where there is change in
intensity and color. To take advantage of this, we restrict feature matching using SIFT to
a narrow band of pixels in the neighborhood of the contour, defined in the same manner
as the narrow band in Section 4.2.2 by eq. (4.5). Figure 4-9 shows a pair of images of the
same lesion with some of the matching SIFT features identified on them.
Figure 4-9: SIFT feature matching performed on the highlighted narrow band of pixels in the
vicinity of the contour.
This significantly speeds up the processing by reducing the number of computations and
memory requirement, while providing significant features near the contour that can be
matched across images. For a 2 megapixel image, instead of performing SIFT feature
4.2 Skin Condition Progression:
Quantitative Analysis
145
detection over 2 million pixels, this approach requires processing only 250,000 pixels on
average - a reduction of 88%. This also reduces the memory requirement from 2 MB to
about 250 kB which could be efficiently implemented as on-chip SRAM instead of external
DRAM.
SIFT is performed only once on any given image, the first time it is analyzed. The SIFT
features for the image are stored in the database and used for subsequent analyses. Once
the SIFT features are determined in all the images in a sequence, we identify matching
features across images using Random Sample Consensus (RANSAC) [187] and compute
homography transforms that map every image in the sequence to the first image. The
homographies are used to warp images in the sequence to align with the first image.
Lesion contours in the warped images can be used to compare .the lesions and determine
the progression over time. The lesion area, confined by the warped contours, is determined
for each image in the sequence. We define a quantitative metric called fill factor (F) at
time t as the change in area of the lesion with respect to the reference (first image,
captured before the beginning of the treatment), given by eq. (4.6).
F = 1--
A0
(4.6)
where, At is the lesion area at time t and Ao is the lesion area in the reference image.
A limitation of the narrowband approach for feature matching is that it can be difficult to
determine a significant number of matching features if the lesion contours in two subsequent images have changed dramatically. If the images are only collected during clinical
visits that are usually more than a month apart, it is possible to have significant changes
in the lesion contours, depending on the skin condition. The images collected for this
work, as part of the pilot study for Vitiligo, were usually a month apart. The approach
worked well for these images. A goal is this work is to facilitate frequent image collection by enabling patients to capture images at home, achieving accurate feature matching
between subsequent images as well as frequent feedback for doctors and patients.
Portable Medical Imaging Platform
146
4.2.4
Auto-tagging
Many skin conditions typically result in lesions in multiple body areas. For a patient
or a doctor to be able to keep track of the various lesions, it is important to be able to
classify the lesions based on the body areas maintain individual databases for a sequence
of images from each lesion. In this work, we implement a scheme where the subject needs
to manually identify the lesions only once, during initial setup, and all future instances
of the same lesion are automatically classified and entered into the database for analysis.
The auto-tagging approach is developed in collaboration with undergraduate researchers
Michelle Chen and Qui Nguyen.
The well-studied problem of image classification is similar and several image classification
techniques exist that may adapt well to this application. In an image classification problem, there are several predefined classes of images into which unknown images must be
classified. In this case, the different affected areas on the body can be thought of as the
classes, and we would like to classify new photographs taken by the patient. A large body
of research exists on image classification. Most approaches generally involve determining
distinctive features of each class and then comparing the features of unknown images to
these known patterns to determine their likely classifications. SIFT features are a very
popular option for general image classification, because they are resistant to changes in
scale and transformations [34].
The SIFT descriptors have high discriminative power,
while at the same time being robust to local variations [188]. SIFT has been shown to
significantly outperform descriptors such as pixel intensities [189,190], edge points [191]
and steerable pyramids [192]. Features such as the Harris-Affine detector [193] and geometric blur descriptors [194] have also emerged as alternatives to SIFT for image matching
and classification.
Furthermore, given the nature of skin images, where a lightly colored lesion is surrounded
by darker skin with very few other features present, the main distinguishing feature of each
image is simply the shape of the lesion. This enables us to use descriptors designed for
4.2 Skin Condition Progression:
Quantitative Analysis
147
shape recognition, such as shape contexts [195]. The accuracy of shape context matching
is strongly correlated with the accuracy of segmentation to determine the lesion contour.
The accuracy of segmentation increases for darker skin types, especially in presence of
intensity inhomogeneities.
In this work, we implemented and analyzed both SIFT based and shape context based
classification. One important difference between classic image classification algorithms
and the approach that we used in this work is how the definitive features of each class are
determined. In classic image classification, there are a large number of training examples
that can be used to determine the distinctive pattern of features for each class, and
machine learning techniques are often used to do this. In our case, however, there are
only a few examples per class, and because the lesions change over time, older examples
are less relevant. As a result, we do not use machine learning to combine the examples.
Instead, we use the features of the most recent photograph in each class to represent that
class.
At the beginning of the treatment, all skin lesions are photographed and manually tagged
based on the body areas. An image of lesion i captured at time t is denoted by L . The
images (L9) are processed to perform color correction and contour detection, as described
in Section 4.2.1 and 4.2.2.
SIFT-based classification
SIFT features are computed for each image and stored along with the image as Sio. When
a new image (Li) is captured at time t = 1, same processing is performed to determine the
contour and SIFT features S . SIFT features for the new image (SI) are compared with
those determined earlier (S?) to find matches using two nearest neighbor approach. The
largest set of inliers (Ij)
with Nij elements and the total symmetric transfer error (ei,j)
(normalized over the range [0, 1]) for every combination
{ Si,
S
}
are determined using
RANSAC. The image (LI) is then classified to belong to lesion i if the given i maximizes
148
Portable Medical Imaging Platform
the matching criterion Mi,, defined by eq. (4.7).
Mij = Nij (1 + A(1 - ei,j))
(4.7)
where, A is a constant and set to 0.2 in this work. The homography HO'1 , corresponding
to the best match, is stored for later use in progression analysis.
Shape context based classification
Shape context descriptors [195] that for a given point on the contour encode the distribution of the relative locations of all the other points, are computed for each image and
stored along with the image as SC2. When a new image (Li) is captured at time t = 1,
same processing is performed to determine the contour and shape context descriptors
SCf. Shape context descriptors for the new image (SCJ) are compared with those determined earlier (SC?) to find the minimum cost matching between these points, using
the difference between the shape context descriptors as the cost of matching two points.
Finally, a thin plate spline transformation [196,197] between the two contours is computed
using the minimum cost matching. The overall difference between two lesion images is
then represented as a combination of the cost of the matching (SCgst) and the size of the
transformation (Tj). The image (Li) is then classified to belong to lesion i if the given i
maximizes the matching criterion Mi,, defined by eq. (4.8).
Mij = -1 x (SCfc8 t +kxTi,5)
(4.8)
where, k is a constant that represents how much the size of the transformation is considered relative to the cost of the matching. In this work, k is set to 10.
The same process is applied for tagging any future image L j by comparing it against the
previously captured set of images L'-.
4.2 Skin Condition Progression:
4.2.5
Quantitative Analysis
149
Skin condition Progression: Summary
The overall processing involving image tagging, lesion contour detection and progression
analysis, can be summarized as follows.
" Initial Setup
1. Manually tag images (L,) based on the location i of the lesion.
2. Perform color correction and segmentation to determine lesion contours (C).
3. Compute SIFT features (S0) in the vicinity of the lesion contour (C). Store
C and Si for future analysis.
4. Shape context based tagging: Compute shape context features (SC) on
the lesion contour (C). Store SC for future analysis.
" Subsequent Analysis
1. For an image L captured at time t, perform color correction and contour detection (C).
2. Compute SIFT features (S ) in the vicinity of the lesion contour (C).
1
3. Perform feature matching for every combination {S- , S3} and tag L3 to lesion
i using eq. 4.7. Store the best match homography H'l"' for further analysis.
4. Shape context based tagging: Perform shape context matching for every
combination {SC-1, SCj} and tag L) to lesion i using eq. 4.8.
5. Using the pre-computed contours (Cl) and homographies (Hi-'), register a
sequence of n images of the same lesion captured over time to the first image
(L9).
6. Compare the areas of the warped lesion contours to determine the progression
over time and compute the fill factor (Ft) using eq. 4.6.
150
4.3
4.3.1
Portable Medical Imaging Platform
Experimental Results
Clinical Validation
Institutional Review Board approval was obtained for data analysis (MIT Protocol Number: 1301005500) as well as the clinical pilot study in collaboration with the Brigham
and Women's Hospital (BWH Protocol Number: 2012-P-002185/1). Ten subjects ages 18
years and older with a dermatologist diagnosis of Vitiligo were recruited by Dr. Vaneeta
Sheth. Subjects had a variety of skin phototypes and disease characteristics.
As this
was a pilot study, no standardized intervention was performed. Rather, subjects were
treated with standard therapies used for Vitiligo based on clinical characteristics and patient preference. Further subject specific details, along with various treatment modalities
are outlined in Appendix B. Photographs of skin lesions were taken at the beginning of
treatment and during at least two subsequent clinical follow-up visits, using normal room
lighting and a handheld digital camera.
4.3.2
Progression Quantification
The approach to analyze the individual images and determine the progress over time is
implemented using MATLAB.
For a sequence of images of a skin lesion captured over time, we process each image to
perform color correction and contrast enhancement. Figure 4-10 shows a sequence of
images with their R, G, B histograms and the outputs after color correction.
The color corrected images are then processed to perform lesion contour detection. Figure 4-11 shows a sequence of images with the detected contours over-laid. LSM based
image segmentation accurately detects the lesion boundaries despite intensity/color inhomogeneities in the image.
Feature matching is performed across images to correct for scaling, orientation and per-
4.3 Experimental Results
151
151
4.3 Experimental Results
3
00
50
100 150 200 250s0
/
50 100 150 200 2500 SO
100I1so
Intensity
20 0
so 100 ISO
200
250
0
50 100 ISO
200
250
(a) Captured Image Sequence
S
~2
0
/
//
0
50
100
15
200
250
0
50
100
150
200
2Mo 0
50
100
ISO
200
2S
Intensity
(b) Color Corrected Image Sequence
Figure 4-10: Color correction for a sequence of images by R, G, B histogram modification. (a)
Original image sequence, (b) Color corrected image sequence. The lesion color changes due to
phototherepy.
Figure 4-11: Image segmentation using LSM for lesion contour detection despite intensity/color
inhomogeneities in the image.
spective mismatch. Homography transform, computed based on the matching features,
is used to warp all the images in a sequence with respect to the first image, which is
used as a reference. Figure 4-12 shows a sequence of warped images. The warped lesions
Portable Medical Imaging Platform
152
are compared with respect to the reference lesion at the beginning of the treatment to
determine the progress over time in terms of the fill factor.
Nov'12
Fill Factor = 0
Mar'13
Fill Factor = 27%
Jul'13
Fill Factor = 51%
Sep'13
Fill Factor = 57%
Figure 4-12: Image registration based on matching features with respect to the reference image
at the beginning of the treatment.
A sequence of captured and processed images of a different skin lesion from another subject
are shown in Figure 4-13 and the fill factor is computed by comparing the warped lesions.
(a) Images captured with normal room lighting
Nov'12
Fill Factor = 0
Dec'12
Fill Factor = 6%
Jan'13
Fill Factor = 16%
Feb'13
Fill Factor = 22%
(b) Processed outputs after contour detection and alignment
Figure 4-13: Sequence of images during treatment. (a) Images captured with normal room
lighting. (b) Processed image sequence.
4.3 Experimental Results
153
The approach for image registration is independently validated by analyzing images of the
same skin lesion captured from different camera angles. Contour detection is performed
on the individual images that are then aligned by feature matching. Figure 4-14 shows
one such comparison. The aligned lesions are compared in terms of their area as well
(a) Images of the same lesion from different camera angles
(b) Images after lesion contour detection and alignment
Figure 4-14: Image registration through feature matching. (a) Images of a lesion from different
camera angles, (b) Images after contour detection and alignment. Area matches to 98% accuracy
and pixel overlap to 97% accuracy.
as the number of pixels that overlap. Analysis of 100 images from 25 lesions, with four
real and artificial camera angles each, shows a 96% accuracy in area and 95% accuracy
in pixel overlap.
To validate the progression analysis, we take one image each from 50 different lesions and
artificially generate a sequence of 4 images for each lesion with known change in area. We
then apply rotation, scaling and perspective mismatch to the new images. This artificial
sequence is used as input to our system, which determines the lesion contours, aligns the
sequence and computes the fill factor. We compare the fill factor with the known change
in area from the artificial sequence and also compute pixel overlap between the lesions
identified on the original sequence (before adding mismatch) and those on the processed
sequence. Figure 4-15 shows one such comparison. Analysis of 200 images from 50 such
sequences shows a 95% accuracy in fill factor computation and pixel overlap.
Portable Medical Imaging Platform
154
Portable Medical Imaging Platform
154
Fill Factor = 0
Fill Factor = 6%
Fill Factor = 13%
Fill Factor = 18%
Fill Factor = 23%
Fill Factor
=
30%
(a) Artificial imaae sequence with known area chance
(b) Artificial image sequence with added mismatch in scaling, rotation and perspective
Fill Factor = 0
Pixel Overlap = 100%
Fill Factor= 8%
Pixel Overlap = 97%
Fill Factor = 14%
Pixel Overlap = 98%
Fill Factor = 21%
Pixel Overlap = 96%
Fill Factor = 25%
Pixel Overlap = 96%
Fill Factor = 31%
Pixel Overlap = 97%
(c) Aligned image sequence with computed fill factor
Figure 4-15: Progression analysis. (a) Artificial image sequence with known area change, created from a lesion image. (b) Image sequence after applying scaling, rotation and perspective
mismatch. (c) Output image sequence after lesion alignment and fill factor computation.
The proposed approach is used to analyze 174 images corresponding to 50 skin lesions
from ten subjects to determine the progression over time. The progression of multiple
lesions during treatment, as well as a detailed analysis of progression for all ten patients
in the clinical study is presented in Appendix B.
4.3.3
Auto-tagging Performance
Performance of the auto-tagging technique is evaluated by analyzing images of twenty
lesions from ten subjects with five images in each sequence.
The first twenty images,
captured at the beginning of the treatment, are manually tagged. The auto-tagging techniques using SIFT and Shape Contexts are used to classify the remaining 80 images.
For each technique, the performance is evaluated as follows. For each image, we use the
technique to calculate the similarity between that image and all images from the previous
timestep, defined by the matching criteria in eq. (4.7) or eq. (4.8). If the image from the
previous timestep with the highest similarity is from the same set, then the technique
4.3 Experimental Results
155
classifies the image correctly. Otherwise, the classification is incorrect.
The SIFT based classification approach is able to accurately tag 70 of the 80 images,
achieving an accuracy of 87%. The shape context based approach is able to accurately
classify 72 of the 80 images, achieving an accuracy of 90%. The images in this test data
set were captured one to three months apart, which resulted in significant changes in the
lesion contours for some of the test images. If the contours are significantly different, it
becomes difficult for both SIFT based feature matching and shape context matching to
identify enough matching features for robust classification. Enabling more frequent data
collection, where adjacent images have far fewer changes in the lesion shape, will further
help enhance the accuracy of tagging.
The processing steps for tagging using SIFT are part of the steps necessary for contour detection and progression analysis. So this approach adds very small overhead while achieving good accuracy. Shape context based approach requires computing shape contexts and
transforms, but this is a small overhead (less than 5%) in the overall processing.
4.3.4
Energy-Efficient Processing
The algorithmic optimizations outlined in Section 4.2.2 and Section 4.2.3 for segmentation and SIFT based progression analysis respectively provide significant reduction in
computational complexity and memory size and bandwidth requirements.
We can estimate the reduction in processing complexity through a comparison of run-times
for the different implementations. Three different implementations with full LSM and full
SIFT, narrowband LSM and full SIFT, and narrowband LSM and narrowband SIFT, are
created in MATLAB. All three implementations are run on a computer with 2.4 GHz
Intel Core i5 processor and 8 GB 1600 MHz DDR3 memory. Run-times are determined as
average of fifty runs of the same implementation for processing two 2 megapixel images.
Table 4.3 shows the comparison of run-times for the different implementations.
The
narrowband LSM implementation enhances the performance by 62% compared to full
Portable Medical Imaging Platform
156
Table 4.3: Performance enhancement through algorothmic optimizations.
Approach
Run Time
Power
Energy
Segmentation
Feature Matching
Full LSM
Full SIFT
11.4 sec
20.6 W
235 J
Narrowband LSM
Full SIFT
4.3 sec
21.2 W
91 J
Narrowband LSM
Narrowband SIFT
3.1 sec
20.2 W
63 J
LSM. The narrowband SIFT implementation improves the performance by 28% compared
to full SIFT. A combination of both results in a 73% performance enhancement compared
to full LSM and SIFT. The power consumption of the CPU during processing is measured
using Intel Power Gadget [198]. The algorithmic optimizations result in a 73% reduction
in the overall energy consumption.
For processing a 2 megapixel image in one second, based on the number of memory accesses, we can estimate the memory power using a memory power consumption model [144].
Figure 4-16 shows the memory bandwidth and estimated power consumption for processing with full LSM segmentation and SIFT feature matching compared with the optimized
1000
9g7
Bandwidth
Power
-
200
150
750
0
CD
0
104
100
500
250
97
-
CD
50
0
0
Full Image
LSM & SIFT
Narrowband
LSM & SIFT
Figure 4-16: Memory bandwidth and estimated power consumption for full image LSM and
SIFT compared to the optimized narrowband implementations of LSM and SIFT.
157
4.3 Experimental Results
narrowband LSM segmentation and narrowband SIFT feature matching. The algorithmic
optimizations leading to the narrowband implementations of both LSM segmentation and
SIFT feature matching result in a 80% reduction in memory bandwidth and 45% reduction
in memory power. These algorithmic enhancements pave the way for energy-efficient hardware implementations that could enable real-time processing on mobile platforms.
4.3.5
Limitations
The performance of the system depends on several factors, including image exposure, skin
type, location of the lesion, etc. For example, it is harder to accurately segment and align
lesions that may not have well defined boundaries such as a lesion that wraps around
a finger or feet. Figure 4-17 shows an example of where segmentation fails to identify
the right lesion contours. Capturing multiple images of the lesion, each zoomed-in on a
narrow patch, could help improve the performance in such cases.
November'12
Figure 4-17: Image segmentation fails to accurately
don't have well defined boundaries.
January'13
identify lesion contours where the lesions
All the data analyzed in this work is based on image collection that happens only during
the patients visit to the doctor. Such visits may be far apart (a month or more) and
the lesions may have changed significantly to accurately determine matching features
Portable Medical Imaging Platform
158
between the new image and the previously collected image. One of the goals of the
mobile application is to enable patients to frequently capture and analyze images, even
outside clinical visits. Frequent data collection and analysis would not only enhance the
performance of the system further, but also provide doctors a near real-time feedback of
the response to treatment that could be used to tailor the treatment for best outcomes.
The approach is validated for Vitiligo skin condition, but it has general applicability and
could be extended to other skin conditions as well.
4.4
Mobile Application
A key objective of this work is to enable patients to perform imaging and progression
analysis of skin lesions at home much more frequently than having the usage limited to
dermatologists in a clinical environment. The ability to perform imaging and analysis
using mobile devices, such as smartphones, is important towards achieving this goal.
Along with undergraduate researchers Michelle Chen and Qui Nguyen, we are developing a
mobile application for the Android platform that enables image capture of the skin lesions
and provides a simple user interface to analyze the images. The analysis is performed using
a cloud-based system that integrates with the mobile application. Figure 4-18 shows the
architecture of the mobile application and cloud integration.
Figure 4-18: Architecture of the mobile application with cloud integration.
4.5 Multispectral Imaging: Future Work
159
The application allows the user to capture images of the skin lesion using the built-in
camera on the mobile device. The images are uploaded to the cloud server. The first
time that a patient uses the application, they are asked to label each image manually. For
all subsequent usage, the images are tagged automatically based on the labels originally
provided by the user. The tag is suggested to the user for confirmation to prevent mislabeling in cases where auto-tagging might result in a wrong classification. A database of
all the images, organized according to the tags and the date of capture, is maintained in
the cloud server. The user can select a region to analyze, which activates processing on
the cloud server. After the processing is complete, the results are retrieved and displayed
on the mobile device as an animation sequence that takes the user through all the images
of that lesion, warped to align with the first image in the sequence, and shows the progression in terms of the corresponding fill factors. Figure 4-19 shows some of the screens
that form the user interface of the application that is currently under development.
4.5
Multispectral Imaging: Future Work
Medical imaging techniques are important tools in diagnosis and treatment of various skin
conditions, including skin cancers such as melanoma. Defining the true border of skin lesions and detecting their features are critical for dermatology. Imaging techniques such
as multi-spectral imaging with polarized light provide non-invasive tools for probing the
structure of living epithelial cells in situ without need for tissue removal. Light polarization also makes it possible to distinguish between single backscattering from epithelial-cell
nuclei and multiply scattered light. Polarized light imaging gives relevant information on
the borders of skin lesions that are not visible to the naked eye. Many skin conditions
typically originate in the superficial regions of the skin (epidermal basement membrane)
where polarized light imaging is most effective [199].
A number of polarized light imaging systems have been used in clinical imaging [199-201].
However, widespread use of these systems has been limited by their complexity and cost.
Portable Medical Imaging Platform
160
Portable Medical Imaging Platform
160
Left Hand
Initial Screen
Subsequent Usage:
Auto-tagging and user
confirmation
Image Capture
First Usage: Manual
Tagging
Select Region to
Analyze
Figure 4-19: User interface of the mobile application. (Contributed
Nguyen).
Display Progression
by Michelle Chen and Qui
Some of the commercially available Dermlite [163] systems are useful for eliminating glare
and shadows from the field of view but do not provide information on the backscattered
degree of polarization and superficial light scattering. More complex systems based on
confocal microscopy [202] trade-off portability and cost for high resolution and depth
information.
161
4.5 Multispectral Imaging: Future Work
We envision a portable imaging module with multispectral polarized light for medical
imaging that could serve as an optical front-end for the skin imaging and analysis system developed in this work. A conceptual diagram of the imaging module is shown in
Figure 4-20. The imaging module could function as an attachment to a smartphone and
Cross
Polarization
Multispectral
Illumination
Figure 4-20: A conceptual diagram of the portable imaging module for multispectral polarized
light imaging
augment the built-in camera by enabling image capture under different spectral wavelengths, ranging from infrared to ultraviolet, and light polarization. The multispectral
illumination could be created using LEDs of varying wavelengths that are trigged sequentially and synchronized with the camera to capture a stack of images of the same lesion
under different wavelength illumination. The synchronization could be achieved through
a wired or wireless interface, such as USB or Bluetooth, with the smartphone.
The images captured under multispectral illumination provide a way to optically dissect
a skin lesion by analyzing the features visible under different wavelengths. For example, surface pigmentation using blue light, superficial vascularity under yellow light, and
deeper pigmentation and vascularity with the deeper-penetrating red light [203]. Such a
device could enable early detection of skin conditions, even before the lesions fully manifest on the skin surface, as well as more accurate diagnosis and treatment by providing
dermatologists with far more details of the lesion morphology than are visible under white
light illumination.
162
4.6
Portable Medical Imaging Platform
Summary and Conclusions
In this work, we developed and implemented a system for identifying skin lesions and
determining the progression of the skin condition over time. The approach is applied to
clinical images of skin lesions captured using a handheld digital camera during the course
of treatment.
This work leverages computer vision techniques, such as SIFT feature matching and LSM
image segmentation, and makes application specific modifications, such as color/contrast
enhancement, contour based feature detection and contour detection in presence of intensity/color inhomogeneities. A system that integrates all of these aspects into a seamless
flow and enables lesion detection and progression analysis of skin conditions, based not
only on standardized clinical imaging but also through images captured by patients at
home, using smartphone or digital cameras, without any standardization, is developed.
The algorithmic enhancements and optimizations with the narrowband implementations
of level set segmentation and SIFT feature matching help improve the software run-time
performance by over 70% and CPU energy consumption by 73%. These optimizations
also reduce the estimated memory bandwidth requirement by 80% and memory power
consumption by 45%. These optimizations pave the way for energy-efficient hardware
implementations that could enable real-time processing on mobile platforms.
Based on the images of skin lesions obtained from the pilot study, in collaboration with
the Brigham and Women's Hospital, the results indicate that the lesion segmentation and
progression analysis approach is able to effectively handle images captured under varying
lighting conditions without the need for specialized imaging equipment. R, G, B histogram
matching and expansion neutralizes the effect of lighting variations while also enhancing
the contrast to make the skin lesions more prominent. LSM based segmentation accurately
identifies the lesion contours despite intensity/color inhomogeneities in the image. Feature
matching using SIFT effectively corrects for scaling, orientation and perspective mismatch
in camera angles for a sequence of images captured over time and aligns the lesions
4.6 Summary and Conclusions
163
that can then be compared to determine progress over time. The fill factor provides
objective quantification of the progression with 95% accuracy, representing a significant
improvement over the current subjective outcome metrics such as the Physician's Global
Assessment and VASI that have assessment variability of more than 25%.
Based on the analysis of existing assessment techniques and the contributions of this work,
the following conclusions can be drawn:
1. The current assessment techniques for skin conditions are primarily based on subjective clinical assessment by physicians.
Lack of quantification tools also has a
significant impact on patient compliance. There is a significant need for quantitative dermatology approaches to aid doctors in determining important lesion features
and accurately tracking progression over time, as well as giving patients confidence
that a treatment is having the desired impact.
2. A diverse set of computer vision functionalities need to be integrated to enable skin
imaging and analysis without any standardization in image capture. Application
requirements, such as image segmentation in presence of intensity/color inhomogeneities and feature matching on relatively featureless skin surfaces, pose important
challenges. This work leverages recent approaches in level set methods and feature
matching, and enhances them for robustness with application specific modifications.
3. The algorithms have high computational complexity and memory requirements. Efficient software and hardware implementations require algorithmic optimizations that
significantly reduce the processing complexity without sacrificing accuracy. The narrowband image segmentation and feature matching approaches proposed in this work
achieve this objective. These algorithmic optimizations could enable efficient hardware implementations for real-time analysis on mobile devices.
4. It is important to have a simple tool with reproducible results. The proposed system is demonstrated to achieve these goals through a pilot study for Vitiligo.This
approach provides a significant tool for accurate and objective assessment of the
164
Portable Medical Imaging Platform
progress with impact on patient compliance. The precise quantification of progression would enable physicians perform an objective follow-up study and test the
efficacy of therapeutic procedures for best outcomes.
5. Combining efficient mobile processing with portable optical front-ends that enable
enhanced image acquisition, such as multispectral imaging, polarized lighting and
macro/microscopic imaging, will be key to developing portable medical imaging
systems. Such devices could be deployed widely at low cost for early detection and
monitoring of diseases in rural areas and emerging countries.
Chapter 5
Conclusions and Future Directions
The energy cost of processor programability is very high due to the overhead associated
with supporting a fine-grained instruction set compared to the actual cost of computation. As we go from CPUs and DSPs to FPGAs and ASICs, we progressively reduce this
overhead and trade-off programability to gain energy-efficiency [204]. It is important to
note, however, that the energy cost is ultimately determined by the desired operation and
underlying algorithms. An algorithm that requires high precision floating point operations to maintain functionality and accuracy will not be able to achieve energy-efficiency
comparable to one that can be implemented using small bit-width fixed point operations. Same is true with the performance enhancement that a parallel architecture could
achieve, as described by Amdahl's Law. The energy requirement of an algorithm with
large data dependencies will be dominated by the cost of memory accesses. Even a highly
optimized hardware implementation will not significant improve the energy-efficiency of
such a system. The development of a system that maximizes energy-efficiency must begin with algorithms - often reframing the problem and optimizing processing without
changing functionality or impacting accuracy - and co-designing algorithms and architectures.
166
5.1
Conclusions and Future Directions
Summary of Contributions
This thesis demonstrates the significance of the co-design approach for mobile platforms
through energy-efficient system design for multiple application areas.
5.1.1
Video Coding
Reconfigurability is key to enabling a class of closely related functionalities efficiently in
hardware. Algorithmic rearrangements and optimizations for transform matrix computations were key to developing a reconfigurable transform engine for multiple video coding
standards. The optimizations maximized hardware sharing and minimized the amount
of computations required to implement large transform matrices. The shared transform
resulted in 30% hardware saving compared to total hardware requirement of individual
H.264/AVC and VC-1 transform implementations. Algorithmic modifications for data
dependent processing to optimize pipeline bit widths and reduce switching activity of
the system reduced the power consumption by 15%. Moving away from conventional
2D transform architectures, an approach to eliminate an explicit transpose memory was
demonstrated, by reusing the output buffer to store intermediate data and separately
designing the row-wise and column-wise 1D transforms. It helped reduce the area by 23%
and power by 26% compared to the implementation using transpose memory.
Low-voltage circuit design using statistical performance analysis ensured reliable operation down to 0.35 V. The transform engine was demonstrated to support video encoding/decoding in both H.264 and VC-1 standards with Quad Full-HD (3840 x 2160)
resolution at 30 fps, while operating at 25 MHz, 0.52 V and consuming 214 pW of power.
This provided a 250x higher power efficiency while supporting the same throughput as
the previous state-of-the-art ASIC implementations. The design provided efficient performance scalability with 1080p (1920 x 1080) at 30 fps, while operating at 6.3 MHz, 0.41 V
with 79 ,uW of power consumption, and 720p (1280 x 720) at 30 fps, while operating at
2.8 MHz, 0.35 V with 43 ptW of power consumption.
5.1 Summary of Contributions
167
The ideas of matrix factorization for hardware sharing, eliminating transpose memory
and data dependent processing have general applicability. As bigger block sizes such as
32x32 and 64x64 are explored in new video coding standards like HEVC, these ideas
could lead to even higher savings in area and power requirement of the transform engine,
allowing their efficient implementation in multi-standard multimedia devices.
5.1.2
Computational Photography
The importance of reframing algorithms for efficient hardware implementations is clearly
demonstrated by the optimizations, leveraging the 3D bilateral grid, that led to significant
reductions in computational complexity, memory size and bandwidth, while preserving
the output quality. The bilateral grid implementation enhanced processing locality by
reducing the data dependencies from multiple image rows to a few grid blocks in the
neighborhood, and enabled highly parallel processing.
Architectural optimizations exploiting parallelism, with two bilateral filter engines operating in parallel and each supporting 16 x parallel processing, enabled high throughput
real-time performance while operating at less than 100 MHz frequency. Combining algorithmic optimizations, parallelism and processing data locality with careful memory
management, helped reduce the external memory bandwidth by 97% - from 5.6 GB/s
to 165.9 MB/s and the DDR2 memory power consumption by 74% - from 380 mW to
99 mW. Through algorithm/architecture co-design, an approach for low-light enhancement and flash shadow correction was developed that enables efficient implementation
using the bilateral grid architecture.
Circuit design for low-voltage operation and multiple voltage domains enabled the processor to achieve a wide operating range - from 25 MHz at 0.5 V with 2.3 mW power consumption to 98 MHz at 0.9 V with 17.8 mW power consumption. Co-designing algorithms,
architectures and circuits, enabled the processor to achieve 280 x higher energy-efficiency
compared to software implementations with identical functionality on state-of-the-art mo-
Conclusions and Future Directions
168
bile processors. A scalable architecture, with clock and power gating, enabled users to
perform energy/resolution scalable processing and was demonstrated to achieve energy
scalability from 0.19 mJ/megapixel to 1.37 mJ/megapixel for different grid configurations
at 0.9 V, while trading-off output quality for energy.
5.1.3
Medical Imaging
The current assessment techniques for skin conditions are primarily based on subjective
clinical assessment by physicians. The algorithmic enhancements that extended computer vision techniques - from image segmentation in presence of inhomogeneities to
feature matching on relatively featureless surfaces - were key to developing a system for
objective quantification of skin condition progression. The system achieved robust performance in clinical validation with 95% accuracy, representing a significant improvement
over the current subjective outcome metrics such as the Physician's Global Assessment
and VASI that have assessment variability of more than 25%. Algorithmic optimizations
with the narrowband implementations of level set segmentation and SIFT feature matching helped improve the software run-time performance and CPU energy consumption by
over 70%. These optimizations also reduced the estimated memory bandwidth requirement by 80% and memory power consumption by 45%. These optimizations pave the way
for energy-efficient hardware implementations that could enable real-time processing on
mobile platforms.
5.2
Conclusions
This thesis focuses on addressing the challenges of implementing high-complexity applications with high-performance requirements on mobile platforms through a comprehensive
view of system design, where algorithms are designed and optimized to enhance processing
locality and enable highly parallel architectures that can be implemented using low-power
169
5.2 Conclusions19
low-voltage circuits to achieve maximally energy-efficient systems. The investigation in
this thesis for multiple application areas leads to the following conclusions.
1. Application Specific Processing: With the performance per watt gains due to
technology scaling saturating and the tight energy constraints of mobile platforms,
energy-efficiency is the key bottleneck in scaling performance. Application specific
hardware units that trade-off programmability for high energy-efficiency are becoming an increasingly important part of processor architectures. Hardware-optimized
algorithm design is crucial to maximizing performance and efficiency gains.
2. Reconfigurable Architectures: A hardware implementation with highly optimized processing units supporting core functionalities in a class of applications (example: computational photography or video coding) and the ability to activate these
processing units and configure the datapaths based on the application requirements,
provides a very attractive alternative to individual hardware implementations for
each algorithm or application, that maintains high energy-efficiency while supporting a class of applications.
3. Scalable Architectures: Scalable architectures, with efficient clock and power
gating, enable energy vs. performance/quality trade-offs that are extremely desirable
for mobile processing. This energy-scalable processing allows the user to determine
the energy usage for a task, based on the battery state or intended usage for the
output.
4. Data Dependent Processing: Data dependent processing can be a powerful tool
in reducing system power consumption. Applications such as multimedia processing
have high data dependency, where intensities of pixels in an image, pixel blocks
in consecutive frames in a video sequence or utterances in a speech sequence are
highly correlated.
By exploiting the characteristics of the data being processed,
architectures can be designed to minimize switching activity, optimize pipeline bit
widths and perform variable number of operations per block [67]. The reduction in
Conclusions and Future Directions
170
number of computations and switching activity has a direct impact on the system
power consumption.
5. Low-Voltage Circuit Design: Low-voltage circuit operation is important to enable voltage/frequency scaling and attain minimum energy operation for the desired
performance. Variations play a key role in determining circuit performance for lowvoltage operation. The non-linear impact of local variations on performance must
be taken into account to ensure a robust design at low-voltage.
6. Memory Bandwidth and Power: External memory bandwidth and power consumption is a key bottleneck in achieving maximally efficient systems for data intensive applications. If the power consumption of the external memory and the
interface between the memory and the processor is the dominant source of system
power consumption, optimizing the processor alone adds very little to the system efficiency. New technology solutions such as embedded DRAM [205,206], that enables
DRAM integration onto the processor die, can play a crucial role in maximizing the
system energy-efficiency by minimizing the cost of memory accesses while enabling
significantly higher bandwidths.
5.3
5.3.1
Future Directions
Computational Photography and Computer Vision
With recent advances in photography, incorporating computer vision and computational
photography techniques, we have just begun to scratch the surface of what cameras of
the future could achieve.
For example, embedded computer vision aspires to enable
an ever expanding range of applications such as image and video search, scene reconstruction, 3D scanning and modeling.
Enabling such applications requires a proces-
sor capable of sustained computational loads and memory bandwidths, while operating
within the tight constraints of low power mobile platforms. Chapter 3 presents the al-
5.3 Future Directions
171
gorithm/architecture/circuit co-design approach, as it relates to a set of computational
photography applications. Such a comprehensive system design approach will be essential
to enable computational photography for embedded vision and video processing on mobile devices. This opens up a new dimension in video processing with possibilities such as
lighfield video, where the video could be manipulated in real time during playback - refocusing frames and changing viewpoints. New research in image sensors [207,208], along
with multi-sensor arrays, could be coupled with energy efficient processing to realize exciting new possibilities for future generation cameras and smartphones, in applications such
as 3D image and video capture, depth sensing, multi-view video and gesture control.
Combining the ability to interpret very complex real 3D environments using computational photography, with object and feature recognition techniques from computer vision,
and natural human interfaces such as gesture and speech recognition, are key to making a truly immersive environment, like the Holodeck, a reality [209]. The performance
and energy constraints of such a system would necessitate novel architectural and circuit
design innovations. Many of the underlying algorithms in computational photography
and computer vision are still in a nascent stage, which requires reconfigurability and
programability in the hardware implementations. For example, an efficient processor for
OpenCV [37], the library of programming functions for computer vision, could dramatically transform the way computer vision applications are implemented. The challenges
of such processors would lie in implementing computationally complex and memory intensive hardware primitives while ensuring flexibility for new software innovations to be
realized.
5.3.2
Portable Medical Imaging
Proliferation of connected portable devices and cloud computing provides us an unique
opportunity to revolutionize the delivery of affordable primary health care. A secure and
portable medical imaging platform is a key milestone in making this goal a reality. Com-
Conclusions and Future Directions
172
putational imaging is becoming an integral part of portable devices such as smartphones.
Extending this functionality for medical imaging applications will enable portable noninvasive medical monitoring. A cloud based service can then allow the patient and the
doctor to share this medical database and perform image analysis to help with the diagnosis and monitor the progress. Strong security guarantees are essential to ensure that
patient-doctor confidentiality is respected by such services. Strong cryptographic primitives like homomorphic encryption [210] provide potential ways to enable secure processing
in the encrypted domain, which would ensure user privacy and protect patient data. Figure 5-1 shows the conceptual representation of such a cloud-based processing platform.
One of the major challenges in using this approach is the extremely high computational
Patient
Doctor
Captue
Encrypt
Secure Database
Encrypt
Display
Results
Decrypt
Processing
Decrypt
Clinical Image Capture
View Results&
Lesion Features
Diagnosis &
Treatment
Figure 5-1: Secure cloud-based medical imaging platform.
complexity and memory requirement of processing in the encrypted domain. This makes
software-based processing extremely inefficient and real-time operation impractical. Optimized encryption algorithm with efficient hardware implementation would be essential
to make secure real-time processing a reality.
The work presented in Chapter 4 provides a foundation for developing efficient hardware implementations to integrate medical imaging in mobile devices. This would enable
real-time processing of hundreds of images, captured over time, to provide doctors and patients immediate feedback that could be used to determine the future course of treatment.
The enormous performance and energy advantages that efficient hardware implementations provide could be used to transform medical imaging application, such as Optical
Coherence Tomography (OCT), Magnetic Resonance Imaging (MRI) and Computed To-
5.3 Future Directions
173
mography (CT) scan reconstruction, and shift the analysis from bulky GPU clusters to
portable devices. Such systems could significantly enhance medical imaging and finally
bring the Tricorder from the realms of science fiction to reality!
The intersection of cutting-edge algorithms, massively-parallel architectures with specialized reconfigurable accelerators and ultra-low power circuits is ripe for exploration. The
future of technology innovation will be defined by societal imperatives such as affordable
healthcare, energy-efficiency and security, and the biggest challenge of this era will be
to revolutionize these fields just as the era of CMOS scaling revolutionized computing,
communication and consumer entertainment. In just a decade, the relationship among
our daily activities, our data, and the mediums of content creation and consumption will
be radically different. This thesis attempts to define the challenges and propose system
design solutions to help build the technologies that will define this relationship.
174
Conclusions and Future Directions
Appendix A
Integer Transform
The most commonly use transform in video and image coding applications is the Discrete
Cosine Transform (DCT). DCT has excellent energy compaction property, which leads
to good compression efficiency of the transform. However, the irrational numbers in the
transform matrix make its exact implementation impossible, leading to a drift between
forward and inverse transform coefficients.
H.264/AVC as well as VC-1 video coding standards use a variation of the DCT, known as
Integer transform. In these transforms, the transform matrices are defined to have only
integers. This makes exact inverse possible using integer arithmetic.
The following sections describe the definitions of integer transforms for H.264/AVC and
VC-1 video coding standards.
A.1
H.264/AVC Integer Transform
The separable 2-D 8x8 forward transform for H.264/AVC can be written as:
F8 =H8 Xs-Hs
((A.1H
A.1)
Tr-ansformn
Integer Transform
176
Integer
176
and the separable 2-D 8x8 inverse transform can be written as:
(A.2)
18= H8 -Ys8 - H
Where, the 1-D 8x8 integer transform for H.264/AVC is defined as:
8
12
8
10
8
6
4
3
8
10
4
-3
-8
-12
-8
-6
8
6
-4
-12
-8
3
8
10
8
3
-8
-6
8
10
-4
-12
8
-3
-8
6
8
-10
-4
12
8
-6
-4
12
-8
-3
8
-10
8 -10
4
3
-8
12
-8
6
8 -12
8
-10
8
-6
4
-3
(A.3)
H8 =
Similarly, the separable 2-D 4x 4 forward transform for H.264/AVC can be written as:
F4 =HT - X4x4 . H 4
(A.4)
and the separable 2-D 4x4 inverse transform can be written as:
14 =H 4 - Y 4 x4 -H4T
(A.5)
Where, the 1-D 4x4 transform for H.264/AVC is defined as:
1
2
1
1
1
1
-1
-2
1
-1
-1
2
1 -2
1
-1
(A.6)
177
A.2 VC-1 Integer Transform
A.2
VC-1 Integer Transform
VC-1 uses 8 x 8, 8 x 4, 4 x 8 and 4 x 4 transforms.
The 2-D separable m x n forward integer transform for VC-1, where m = 8, 4 and n = 8, 4,
is given as:
(A.7)
Fmxn= (Vm - Xmxn - Vn) -Nmxn
And the m x n inverse integer transform for VC-1 is given as:
- Vm -YmXn
I~
VT
(A.8)
1024
The denominator is chosen to be the power of 2 closest to the squared norm of the basis
functions (288, 289 and 292) of the ID transformation.
In order to preserve one extra bit of precision, the 1-D transform operation is performed
as:
Ymxn - VT
16
ImXn
and
=
Vm.D
"6
(A.9)
64
The 1-D transform matrix is defined as:
V8
12
16
16
15
12
9
6
4
12
15
6
-4
-12
-16
-16
-9
12
9
-6
-16
-12
4
16
15
12
4
-16
-9
12
15
-6
-16
12
-4
-16
9
12
-15
-6
16
12
-9
-6
16
-12
-4
16
-15
12 -15
6
4
-12
16
-16
9
12 -16
16
-15
12
-9
6
-4
(A.10)
=
178
Integer Transform
and the 1-D 4x4 inverse transform matrix is defined as:
17
22
17
10
17
10
-17
-22
17 -10
-17
22
17 -22
17
-10
(A.11)
V4 =
Appendix B
Clinical Pilot Study for Vitiligo
Progression Analysis
B.1
Subjects for Pilot Study
Institutional Review Board approval was obtained for data analysis (MIT Protocol Number: 1301005500) as well as the clinical pilot study in collaboration with the Brigham
and Women's Hospital (BWH Protocol Number: 2012-P-002185/1). Ten subjects ages 18
years and older with a dermatologist diagnosis of vitiligo were recruited by Dr. Vaneeta
Sheth. Subjects had a variety of skin phototypes and disease characteristics, as outlined
in Table B.1. As this was a pilot study, no standardized intervention was performed.
Rather, subjects were treated with standard therapies used for vitiligo based on clinical
characteristics and patient preference.
Clinical Pilot Study for Vitiligo Progression Analysis
180
Table B.1: Demographics of the subjects for clinical study.
Subject
Age
(Years)
Gender
Ethnicity
Vitiligo
Phenotype
Treatment
Modalities
21
F
Hispanic
Acrofacial
None
59
M
AfricanAmerican
Non-segmental
vitiligo
NBUVB*, oral
corticosteroids
57
M
Caucasian,
Native
American
Non-segmental
NBUVB
vitiligo
29
M
Caucasian
Mucosal/genital
Topical calcineurin
inhibitor, NBUVB
43
M
Caucasian
Nonsegmental
common vitiligo
NBUVB
27
F
South Asian
Segmental
NBUVB
46
M
Greek
Acrofacial
Topical corticosteroids
43
M
Caucasian
Nonsegmental/common
vitiligo
NBUVB, topical
corticosteroids
35
F
South Asian
Acrofacial
NBUVB
43
F
AfricanAmerican
Segmental
NBUVB, topical
immunomodulators,
topical bimatoprost
*NBUVB: Narrow-band Ultraviolet B
B.2
Progression Analysis
The proposed approach is used to analyze 174 images corresponding to 50 skin lesions
from ten subjects to determine the progression over time. Figure B-i shows the progression of five lesions through 20 images captured during treatment. A detailed analysis of
progression for all ten patients in the clinical study is presented in Table B.2.
181
B.2 Progyression Analvsis18
Fill Factor = 0
Fill Factor = 11%
Fill Factor = 25%
Fill Factor = 36%
Fill Factor = 0
Fill Factor = 3%
Fill Factor = 19%
Fill Factor = 28%
Fill Factor =0
Fill Factor = -9%
Fill Factor = -2%
Fill Factor = 17%
Fill Factor =0
Fill Factor =7%
Fill Factor =4%
Fill Factor =13%
Fill Factor = 0
Fill Factor = 17%
Fill Factor = 26%
Fill Factor = 59%
Figure B-1: Progression of skin lesions over time. Lesion contours are identified from the color
corrected images and the lesions are aligned using SIFT feature matching to determine the fill
factor.
Clinical Pilot Study for Vitiligo Progression Analysis
Analysis
182
Clinical Pilot Study for Vitiligo Progression
182
Table B.2: Progression of Skin Lesions During Treatment
Subject
1
2
3
4
Site
Fill Factor (%)
Dec'12
Jan'13
Jun'13
Left Hand
0
-2
-19
Right Hand
0
1
-9
Nov'12
Dec'12
Jan'13
Feb'13
Mar'13
-
Chest
0
8
24
63
78
-
Left Elbow
0
2
9
17
31
-
Right Elbow
0
4
17
26
59
-
Nov'12
Dec'12
Jan'13
Feb'13
-
Left Popliteal
Fossa
0
5
9
10
-
-
Left Wrist
0
4
10
16
-
-
Right Popliteal
Fossa
0
3
5
11
-
Right
Antecubital
Fossa
0
6
16
22
-
Right Forearm
0
11
25
36
-
-
Right Wrist
0
9
25
28
-
-
Dec'12
Mar'13
May'13
Jun'13
Jul'13
Oct'13
Left Foot
0
-3
2
5
13
17
Left Hand
0
1
5
14
21
22
Left Knee
0
7
1
14
17
26
Right Hand
0
2
3
24
33
39
Right Foot
0
2
4
-5
6
18
Right Knee
0
0
3
7
19
52
-
-
-
-
-
183
B.2 Progression Analysis
Subject
Site
Feb'13
Mar'13
0
2
5
Apr'13
May'13
Jun'13
Jul'13
-
Left Eye
0
3
19
28
-
Left Neck
0
3
4
6
-
Left
Preauricular
0
6
53
-
-
May'13
Jul'13
Sep'13
-
-
Left Forehead
0
2
-2
-
-
Left Hand
0
1
3
-
-
Right Forehead
0
3
7
-
Right Hand
0
-2
1
-
Jun'13
Jul'13
Aug'13
Forehead
0
2
Left Temple
0
Right Temple
Genital
7
8
(%)
Jan'13
5
6
Fill Factor
-
--
-
-
-
-
-
Oct'13
Nov'13
-
3
16
-
-
-83
-91
-
46
-
0
-16
-11
84
95
-
Jun'13
Jul'13
Sep'13
-
-
-
Left Cutaneous
Lower Lip
0
4
7
-
-
Right Oral
Commissure
0
-4
-8
-
-
-
R. Cutaneous
Upper Lip
0
2
21
-
-
-
Right
Preauricular
0
8
86
-
Nov'12
Mar'13
Jul'13
Sep'13
-
-
0
27
52
57
-
-
-
10
Right Cheek
184
Clinical Pilot Study for Vitiligo Progression Analysis
Acronyms
ASIC
Application Specific Integrated Circuit
BW
Bandwidth
CC
Camera Curves
CMOS
Complementary Metal Oxide Semiconductor
Cony
Convolution
CPU
Central Processing Unit
CT
Computed Tomography
DCT
Discrete Cosine Transform
DRAM
Dynamic Random Access Memory
DSP
Digital Signal Processor
DVFS
Dynamic Voltage-Frequency Scaling
FIFO
First In First Out
186
Acronyms
Acronyms
186
FPGA
Field Programmable Gate Array
fps
frames per second
GA
Grid Assignment
GPGPU
General Purpose Graphics Processing Unit
GPU
Graphics Processing Unit
HD
High Definition
HDR
High Dynamic Range
HEVC
High-Efficiency Video Coding
HoG
Histogram of Gaussians
IC
Integrated Circuit
LDR
Low Dynamic Range
LED
Light Emitting Diode
LSB
Least Significant Bit
LSF
Level Set Function
LSM
Level Set Method
LUT
Look-Up Table
MBPS
Megabytes per second
187
Acronyms
187
Acronyms
MRI
Magnetic Resonance Imaging
MSB
Most Significant Bit
NBUVB
Narrow-band Ultraviolet B
OCT
Optical Coherence Tomography
OPA
Operating Point Analysis
OPS
Operations Per Second
PC
Personal Computer
PCB
Printed Circuit Board
PDF
Probability Density Function
PGA
Physician's Global Assessment
QFHD
Quad Full-HD
RANSAC
Random Sample Consensus
RDF
Random Dopant Fluctuations
SIFT
Scale Invariant Feature Transform
SRAM
Static Random Access Memory
SSTA
Statistical Static Timing Analysis
STA
Static Timing Analysis
188
Acronyms
SVD
Singular Value Decomposition
SVM
Support Vector Machine
VASI
Vitiligo Area and Severity Index
Bibliography
[1] C. Babbage, "On the mathematical powers of the calculating engine,"
Manuscript: Museum of History of Science, Oxford, December 1837.
Original
[2] B. Collier, "The little engines that could've: The calculating machines of Charles Babbage," Doctoral dissertation, Harvard University, August 1970.
[3] G. E. Moore, "Cramming more components onto integrated circuits," Electronics, pp. 114-
117, April 1965.
[4] R. H. Dennard, F. Gaensslen, H. Yu, L. Rideout, E. Bassous, and A. LeBlanc, "Design of
ion-implanted MOSFET's with very small physical dimensions," IEEE Journal of SolidState Circuits, vol. SC-9, pp. 256-268, October 1974.
[5] M. Weiser, "The computer for the 21st century," Scientific American, vol. 265, pp. 94-104,
September 1991.
[6] R. Want, W. Schilit, N. Adams, R. Gold, K. Petersen, D. Goldberg, J. Ellis, and M. Weiser,
"An overview of the ParcTab ubiquitous computing experiment," IEEE Personal Communications, vol. 2, pp. 28-43, December 1995.
[7] R. Broderson, "Infopad - an experiment in system level design and integration," Design
Automation Conference, pp. 313-314, 1997.
[8] A. Chandrakasan, A. Burstein, and R. W. Brodersen, "A low power chipset for portable
multimedia applications," InternationalSolid-State Circuits Conference, pp. 82-83, 1994.
[9] J. C. Maxwell, "Experiments on color, as perceived by the eye, with remarks on colorblindness," Transactions of the Royal Society of Edinburgh, vol. 21, no. 2, pp. 275-298,
1855.
[10] J. A. Paradiso and T. Starner, "Energy scavenging for mobile and wireless electronics,"
IEEE Pervasive Computing, vol. 4, pp. 18-27, January 2005.
[11] Y. Miyabe, "Smart life solutions: from home to city," InternationalSolid-State Circuits
Conference, pp. 12-17, 2013.
[12] R. H. Dennard, J. Cai, and A. Kumar, "A perspective on today's scaling challenges and
possible future directions," Solid-State Electronics, vol. 51, pp. 518-525, April 2007.
[13] M. Horowitz, "Computing's energy problem (and what we can do about it)," International
Solid-State Conference, pp. 10-14, 2014.
BIBLIOGRAPHY
190
[14] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen, "Low-power CMOS digital design,"
IEEE Journal of Solid-State Circuits,vol. 27, pp. 473-484, April 1992.
[15] B. Davari, R. H. Dennard, and G. G. Shahidi, "CMOS scaling for high performance and
low-power-the next ten years," Proceedings of the IEEE,vol. 83, pp. 595-606, April 1995.
[16] M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, and K. Bernstein, "Scaling, power,
and the future of CMOS," IEEE InternationalElectron Devices Meeting, pp. 7-15, 2005.
[17] K. Itoh, "Adaptive circuits for the 0.5-V nanoscale CMOS era," IEEE InternationalSolid
State Circuits Conference, pp. 14-20, 2009.
[18] G. M. Amdahl, "Validity of the single processor approach to achieving large-scale computing capabilities," AFIPS Spring Joint Computer Conference, pp. 483-485, 1967.
Computational Sciences
[19] W. Dally, "The path to high-efficiency computing,"
ornl.gov/workshops/SMC13/
//computing.
http:
and Engineering Conference [online]
presentations/3-SMC_0913_Dally.pdf,2013.
[20] M. Yuffe, E. Knoll, M. Mehalel, J. Shor, and T. Kurts, "A fully integrated multi-CPU,
GPU and memory controller 32nm processor," InternationalSolid-State Circuits Confer-
ence, pp. 264-265, 2011.
[21] S. Damaraju, V. George, S. Jahagirdar, T. Khondker, R. Milstrey, S. Sarkar, S. Siers,
I. Stolero, and A. Subbiah, "A 22nm IA multi-CPU and GPU system-on-chip," International Solid-State Circuits Conference, pp. 56-57, 2012.
[22] P. Ou, J. Zhang, H. Quan, Y. Li, M. He, Z. Yu, X. Yu, S. Cui, J. Feng, S. Zhu, J. Lin,
M. Jing, X. Zeng, and Z. Yu, "A 65nm 39GOPS/W 24-core processor with 11Tb/s/W
packet-controlled circuit-switched double-layer network-on-chip and heterogeneous execution array," InternationalSolid-State Circuits Conference, pp. 56-57, 2013.
[23] G. Gammie, N. Ickes, M. E. Sinangil, R. Rithe, J. Gu, A. Wang, H. Mair, S. Datla,
B. Rong, S. Honnavara-Prasad, L. Ho, G. Baldwin, D. Buss, A. P. Chandrakasan, and
U. Ko, "A 28nm 0.6V low-power DSP for mobile applications," InternationalSolid-State
Circuits Conference, pp. 132-133, 2011.
[24] "Dragonboard Snapdragon S4 plus APQ8060A mobile development board," [online]
https: //developer .qualcomm.com/mobile-development/development-devices/
dragonboard.
[25] "Samsung Exynos 5 dual Arndale board," [online]
wiki/index .php/Main.Page.
http: //www. arndaleboard. org/
[26] Y. Park, C. Yu, K. Lee, H. Kim, Y. Park, C. Kim, Y. Choi, J. Oh, C. Oh, G. Moon,
S. Kim, H. Jang, J. A. Lee, C. Kim, and S. Park, "72.5GFlops 240Mpixel/s 1080p 60fps
multi-format video codec application processor enabled with GPGPU for fused multimedia
application," InternationalSolid-State Circuits Conference, pp. 160-161, 2013.
[27] J. Park, I. Hong, G. Kim, Y. Kim, K. Lee, S. Park, K. Bong, and H. J. Yoo, "A
646GOPS/W multi-classifier many-core processor with cortex-like architecture for superresolution recognition," InternationalSolid-State Circuits Conference, pp. 168-169, 2013.
BIBLIOGRAPHY
191
[28] D. Markovic, R. W. Brodersen, and B. Nikolic, "A 70GOPS, 34mW multi-carrier MIMO
chip in 3.5mm 2 ," IEEE Symposium on VLSI Circuits, pp. 158-159, 2006.
[29] C. T. Huang, M. Tikekar, C. Juvekar, V. Sze, and A. Chandrakasan, "A 249Mpixel/s
HEVC video-decoder chip for quad full HD applications," International Solid-State Circuits Conference, pp. 162-163, 2013.
[30] M. Mehendale, S. Das, M. Sharma, M. Mody, R. Reddy, J. Meehan, H. Tamama, B. Carlson, and M. Polley, "A true multistandard, programmable, low-power, full HD video-codec
engine for smartphone SoC," International Solid-State Circuits Conference, pp. 226-227,
2012.
[31] V. Aurich and J. Weule, "Non-linear gaussian filters performing edge preserving diffusion,"
Springer Berlin Heidelberg, pp. 538-545, 1995.
[32] P. J. Burt, "Fast algorithms for estimating local image properties," Computer Vision,
Graphics, and Image Processing,vol. 21, pp. 368-382, March 1983.
[33] P. J. Burt and E. H. Adelson, "The laplacian pyramid as a compact image code," IEEE
Transactionson Communication, vol. 31, pp. 532-540, April 1983.
[34] D. Lowe, "Distinctive image features from scale-invariant keypoints," InternationalJournal of Computer Vision, vol. 60, pp. 91-110, February 2004.
[35] N. Dalal and B. Triggs, "Histograms of oriented gradients for human detection," Computer
Vision and Pattern Recognition Conference, pp. 886-893, 2005.
[36] P. Viola and M. Jones, "Rapid object detection using a boosted cascade of simple features,"
Computer Vision and Pattern Recognition Conference, pp. 511-518, 2001.
[37] "OpenCV: Open source computer vision," [online] http: //opencv. org/.
[38] A. P. Chandrakasan, D. C. Daly, D. F. Finchelstein, J. Kwong, Y. K. Ramadass, M. E.
Sinangil, V. Sze, and N. Verma, "Technologies for ultradynamic voltage scaling," Proceedings of the IEEE, vol. 98, pp. 191-214, February 2010.
[39] B. Calhoun, A. Wang, and A. Chandrakasan, "Modeling and sizing for minimum energy operation in subthreshold circuits," IEEE Journal of Solid-State Circuits, vol. 40,
pp. 1778-1786, September 2005.
[40] A. Asenov, "Random dopant induced threshold voltage lowering and fluctuations in sub0.1 pm MOSFET's: A 3-D "atomistic" simulation study," IEEE Transactions on Electron
Devices, vol. 45, pp. 2505-2513, December 1998.
[41] P. Andrei and I. Mayergoyz, "Random doping-induced fluctuations of subthreshold characteristics in MOSFET devices," Solid-State Electronics, vol. 47, pp. 2055-2061, November
2003.
[42] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, "Analysis and mitigation of variability
in subthreshold design," InternationalSymposium on Low Power Electronics and Design,
pp. 20-25, 2005.
BIBLIOGRAPHY
192
properties of
[43] M. J. M. Pelgrom, A. C. J. Duinmaijer, and A. P. G. Welbers, "Matching
October
MOS transistors," IEEE Journal of Solid-State Circuits,vol. 24, pp. 1433-1440,
1989.
for sub-threshold
[44] B. H. Calhoun and A. P. Chandrakasan, "Static noise margin variation
SRAM in 65-nm CMOS," IEEE Journal of Solid-State Circuits, vol. 41, pp. 1673-1679,
July 2006.
[45] "Cisco visual networking index: Global mobile data traffic forecast update, 2013-2018,"
[online] http: //www. cisco. com/c/en/us/solut ions/collateral/service-provider/
5 2 8 6 2 .html.
visual-networking-index-vni/white _papercll- 0
T. S.
[46] Y. K. Lin, D. W. Li, C. C. Lin, T. Y. Kuo, S. J. Wu, W. C. Tai, W. C. Chang, and
2
Chang, "A 242mW 10mm 1080p H.264/AVC high-profile encoder chip," International
Solid-State Circuits Conference, pp. 314-315, 2008.
"A low[47] D. F. Finchelstein, V. Sze, M. E. Sinangil, Y. Koken, and A. P. Chandrakasan,
power 0.7-V H.264 720p video decoder," IEEE Asian Solid-State Circuits Conference,
pp. 173-176, 2008.
T. Fu[48] K. Yu, M. Takahashi, T. Maeda, H. Hara, H. Arakida, H. Yamamoto, Y. Hagiwara,
jita, M. Watanabe, T. Shimazawa, Y. Ohara, T. Miyamori, M. Hamada, and Y. Oowaki,
in
"A 222mW H.264 full-HD decoding application processor with x512b stacked dram
40nm," InternationalSolid-State Circuits Conference, pp. 326-327, 2010.
[49] Y. Park, C. Yu, K. Lee, H. Kim, Y. Park, C. Kim, Y. Choi, J. Oh, C. Oh, G. Moon,
S. Kim, H. Jang, J. A. Lee, C. Kim, and S. Park, "72.5GFlops 240Mpixel/s 1080p 60fps
multi-format video codec application processor enabled with GPGPU for fused multimedia
application," InternationalSolid-State Circuits Conference, pp. 160-161, 2013.
IEEE Interna[50] T. Burd and R. Broderson, "Design issues for dynamic voltage scaling,"
tional Symposium on Low Power Electronics and Design, pp. 9-14, 2000.
energy
[51] B. H. Calhoun and A. P. Chandrakasan, "Characterizing and modeling minimum
Elecoperation for subthreshold circuits," IEEE InternationalSymposium on Low Power
tronics and Design, pp. 90-95, 2004.
[52] I.-T. S. H, "H.264: Advanced video coding for generic audiovisual services,"
[53] T. Wiegand and G. J. Sullivan, "Overview of the H.264/AVC video coding standard,"
IEEE Transactions on Circuits and Systems for Video Processing, vol. 13, pp. 560-576,
July 2003.
[54] S. 421M, "VC-1 compressed video bitstream format and decoding process,"
[55] H. Kalva and J. Lee, "The VC-1 video coding standard," IEEE Multimedia, vol. 14,
pp. 88-91, October 2007.
[56] S. Srinivasan, P. Hsu, T. Holcomb, K. Mukerjee, S. L. Regunathan, B. Lin, J. Liang, M.-C.
Lee, and J. Ribas-Corbera, "Windows Media Video 9: Overview and applications," Signal
Processing: Image Communication,vol. 19, pp. 851-875, October 2004.
BIBLIOGRAPHY
193
[57] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, "Low-complexity transform
and quantization in H.264/AVC," IEEE Transactions on Circuits and Systems for Video
Processing, vol. 13, pp. 598-603, July 2003.
[58] S. Srinivasan, S. Regunathan, and B. Lin, "Computationally efficient transforms for video
coding," IEEE InternationalConference on Image Processing, pp. 11-14, 2005.
[59] S. Srinivasan and J. Liang, "Fast video codec transform implementations," U.S. Patent
20050256916, November 2005.
[60] S. Lee and K. Cho, "Design of transform and quantization circuit for multi-standard
integrated video decoder," IEEE Workshop on Signal Processing Systems, pp. 181-186,
2007.
[61] C.-P. Fan and G.-A. Su, "Efficient low-cost sharing design of fast 1-D inverse integer
transform algorithms for H.264/AVC and VC-1," IEEE Signal ProcessingLetters, vol. 15,
pp. 926-929, 2008.
[62] C.-P. Fan and G.-A. Su, "Efficient fast 1-D 8x8 inverse integer transform for VC-1 application," IEEE Transactions on Circuits and Systems for Video Technology, vol. 19,
pp. 584-590, April 2009.
[63] G.-A. Su and C.-P. Fan, "Cost effective hardware sharing architecture for fast 1D 8x8
forward and inverse integer transforms of H.264/AVC high profile," IEEE Asia Pacific
Conference on Circuits and Systems, pp. 1332-1335, 2008.
[64] S. Lee and K. Cho, "Design of high-performance transform and quantization circuit for
unified video codec," IEEE Asia Pacific Conference on Circuits and Systems, pp. 1450-
1453, 2008.
[65] R. Rithe, C. C. Cheng, and A. Chandrakasan, "Quad full-HD transform engine for dualstandard low-power video coding," IEEE Asian Solid-State Circuits Conference, pp. 401-
404, 2011.
[66] W.-H. Chen, C. Smith, and S. Fralick, "A fast computational algorithm for the discrete cosine transform," IEEE Transactions on Communications,vol. 25, pp. 1004-1009, Septem-
ber 1977.
[67] T. Xanthopoulos and A. P. Chandrakasan, "A low-power IDCT macrocell for MPEG-2
MP©ML exploiting data distribution properties for minimal activity," IEEE Journal of
Solid-State Circuits, vol. 34, pp. 693-703, May 1999.
[68] H. Fujiwara, K. Nii, H. Noguchi, J. Miyakoshi, Y. Murachi, Y. Morita, H. Kawaguchi, and
M. Yoshimoto, "Novel video memory reduces 45% of bitline power using majority logic
and data-bit reordering," IEEE Transactions on Very Large Scale Integration Systems,
vol. 16, pp. 620-627, June 2008.
[69] M. E. Sinangil and A. P. Chandrakasan, "Application-specific SRAM design using output
prediction to reduce bit-line switching activity and statistically gated sense amplifiers for
up to 1.9x lower energy/access," IEEE Journal of Solid-State Circuits, vol. 49, pp. 107-
117, January 2014.
BIBLIOGRAPHY
194
[70] "ITU-T recommendation H.265 and ISO/IEC 23008-2: High Efficiency Video Coding,"
aspx?rec=11885, 2013.
[online] http: //www. itu. int/ITU-T/recommendations/rec .
[71] M. Budagavi, A. Fuldseth, G. Bjontegaard, V. Sze, and M. Sadafale, "Core transform
design in the high efficiency video coding (HEVC) standard," IEEE Journal of Selected
Topics in Signal Processing,vol. 7, pp. 1029-1041, December 2013.
[72] M. Tikekar, C.-T. Huang, C. Juvekar, V. Sze, and A. P. Chandrakasan, "A 249-Mpixel/s
HEVC video-decoder chip for 4k ultra-HD applications," IEEE Journal of Solid-State
Circuits, vol. 49, pp. 61-72, January 2014.
[73] K. J. Kuhn, "Reducing variation in advanced logic technologies: Approaches to process and
design for manufacturability of nanoscale CMOS," IEEE InternationalElectron Devices
Meeting, pp. 471-474, 2007.
[74] L. Cheng, P. Gupta, C. Spanos, K. Qian, and L. He, "Physically justifiable die-level modeling of spatial variation in view of systematic across wafer variability," Design Automation
Conference, pp. 104-109, 2009.
[75] D. Blaauw, K. Chopra, A. Srivastava, and L. Scheffer, "Statistical timing analysis: From
basic principles to state of the art," IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 27, pp. 589-607, April 2008.
[76] A. Wang and A. Chandrakasan, "A 180-mV subthreshold FFT processor using a minimum
energy design methodology," IEEE Journal of Solid-State Circuits, vol. 40, pp. 310-319,
January 2005.
[77] Y. Cao and L. T. Clark, "Mapping statistical process variations toward circuit performance variability: An analytical modeling approach," ACM IEEE Design Automation
Conference, pp. 658-663, 2005.
[78] J. Kwong, Y. K. Ramadass, N. Verma, and A. P. Chandrakasan, "A 65 nm sub-V microcontroller with integrated sram and switched capacitor DC-DC converter"," IEEE Journal
of Solid-State Circuits, vol. 44, pp. 115-126, January 2009.
[79] H. Mahmoodi, S. Mukhapadhyay, and K. Roy, "Estimation of delay variations due to
random-dopant fluctuations in nanoscale CMOS circuits," IEEE Journal of Solid-State
Circuits,vol. 40, pp. 1787-1796, September 2005.
stan[80] S. Sundareswaran, J. A. Abraham, A. Ardelea, and R. Panda, "Characterization of
ElecQuality
on
Symposium
International
dard cells for intra-cell mismatch variations,"
tronic Design, pp. 213-219, 2008.
[81] R. Rithe, S. Chao, J. Gu, A. Wang, S. Datla, G. Gammie, D. Buss, and A. Chandrakasan,
"The effect of random dopant fluctuations on logic timing at low voltage," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 20, pp. 911-924, May 2012.
[82] R. Rithe, "SSTA design methodology for low voltage operation," Master's thesis, Massachusetts Institute of Technology, 2010.
[83] C. Y. Huang, L. F. Chen, and Y. K. Lai, "A high-speed 2D transform architecture with
unique kernel for multi-standard video applications," IEEE InternationalSymposium on
Circuits and Systems, pp. 21-24, 2008.
BIBLIOGRAPHY
195
[84] C. P. Fan, C. H. Fang, C. W. Chang, and S. J. Hsu, "Fast multiple inverse transforms with
low-cost hardware sharing design for multistandard video decoding," IEEE Transactions
on Circuits and Systems-IL: Express Briefs, vol. 58, pp. 517-521, August 2011.
[85] K. Wang, J. Chen, W. Cao, Y. Wang, L. Wang, and J. Tong, "A reconfigurable multitransform VLSI architecture supporting video codec design," IEEE Transactions on Circuits and Systems-II: Express Briefs, vol. 58, pp. 432-436, July 2011.
[86] Y.-H. Chen, T.-Y. Chang, and C.-W. Lu, "A low-cost and high-throughput architecture
for H.264/AVC integer transform by using four computation streams," IEEE International
Symposium on Integrated Circuits, pp. 380-383, 2011.
[87] F. Durand and J. Dorsey, "Fast bilateral filtering for the display of high-dynamic-range
images," ACM Transactions on Graphics, vol. 21, pp. 257-266, July 2002.
[88] M. Brown and D. G. Lowe, "Automatic panoramic image stitching using invariant features," International Journal of Computer Vision, vol. 74, pp. 59-73, August 2007.
[89] A. Levin, Y. Weiss, F. Durand, and W. T. Freeman, "Efficient marginal likelihood optimization in blind deconvolution," IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2657-2664, 2011.
[90] R. Ng, M. Levoy, M. Bredif, G. Duval, M. Horowitz, and P. Hanrahan, "Light-field photography with a handheld plenoptic camera," Stanford University Computer Science Tech
Report, April 2005.
[91] C. Tomasi and R. Manduchi, "Bilateral filtering for gray and color images," IEEE International Conference on Computer Vision, pp. 839-846, 1998.
[92] S. M. Smith and J. M. Brady, "SUSAN - a new approach to low level image processing,"
International Journal of Computer Vision, vol. 23, pp. 45-78, May 1997.
[93] P. Perona and J. Malik, "Scale-space and edge detection using anisotropic diffusion," IEEE
TransactionsPattern Analysis Machine Intelligence, vol. 12, pp. 629-639, July 1990.
[94] J. Tumblin and G. Turk, "LCIS: A boundary hierarchy for detail-preserving contrast
reduction," A CM SIGGRAPH Conference, pp. 83-90, 1999.
[95] A. Levin, A. Rav-Acha, and D. Lischinski, "Spectral matting," IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 30, pp. 1699-1712, October 2008.
[96] D. Lischinski, Z. Farbman, M. Uyttendaele, and R. Szeliski, "Interactive local adjustment
of tonal values," ACM Transactions on Graphics, vol. 25, pp. 646-653, March 2006.
[97] N. Sochen, R. Kimmel, and A. M. Bruckstein, "Diffusions and confusions in signal and
image processing," Journal of Mathematical Imaging and Vision, vol. 14, pp. 237-244,
May 2001.
[98] M. Elad, "On the bilateral filter and ways to improve it," IEEE Transactions on Image
Processing,vol. 11, pp. 1141-1151, October 2002.
[99] J. van de Weijer and R. van den Boomgaard, "On the equivalence of local-mode finding,
robust estimation and mean-shift analysis as used in early vision tasks," International
Conference on Pattern Recognition, pp. 927-930, 2002.
BIBLIOGRAPHY
196
[100] D. Barash and D. Comaniciu, "A common framework for nonlinear diffusion, adaptive
smoothing, bilateral filtering and mean shift," Image and Video Computing,vol. 22, pp. 73-
81, January 2004.
[101] A. Buades, B. Coll, and J.-M. Morel, "Neighborhood filters and PDE's," Numerische
Mathematik, vol. 105, pp. 1-34, November 2006.
[102] P. Mrazek, J. Weickert, and A. Bruhn, "On robust estimation and smoothing with spatial
and tonal kernels," Springer Geometric Propertiesfor Incomplete data, vol. 31, pp. 335-
352, 2006.
[103] M. Aleksic, M. Smirnov, and S. Goma, "Novel bilateral filter approach: Image noise
reduction with sharpening," Proceedings of the SPIE, vol. 6069, pp. 141-147, May 2006.
[104] C. Liu, W. T. Freeman, R. Szeliski, and S. Kang, "Noise estimation from a single image,"
IEEE Computer Vision and Pattern Recognition Conference, pp. 901-908, 2006.
[105] S. Bae, S. Paris, and F. Durand, "Two-scale tone management for photographic look,"
ACM Transactions on Graphics, vol. 25, pp. 637-645, July 2006.
[106] M. Elad, "Retinex by two bilateral filters," Scale-Space Conference, pp. 217-229, July.
[107] E. Bennet and L. McMillan, "Video enhancement using per-pixel virtual exposures," A CM
Transactions on Graphics, vol. 24, pp. 845-852, July 2005.
[108] H. Winnemoller, S. C. Olsen, and B. Gooch, "Real-time video abstraction," ACM Transactions on Graphics, vol. 25, pp. 1221-1226, August 2006.
[109] J. Xiao, H. Cheng, H. Awhney, C. Rao, and M. Isnardi, "Bilateral filtering based optical flow estimation with occlusion detection," European Conference on Computer Vision,
pp. 211-224, 2006.
[110] P. Sand and S. Teller, "Particle video: Long-range motion estimation using point trajectories," InternationalJournal of Computer Vision, vol. 80, pp. 72-91, January 2008.
[111] E.-H. Woo, J.-H. Sohn, H. Kim, and H.-J. Yoo, "A 195 mW, 9.1 Mvertices/s fully programmable 3-D graphics processor for low-power mobile devices," IEEE Journal of SolidState Circuits, vol. 43, pp. 2370-2380, July 2008.
[112] F. Sheikh, S. K. Mathew, M. A. Anders, H. Kaul, S. K. Hsu, A. Agarwal, R. K. Krishnamurthy, and S. Borkar, "A 2.05 Gvertices/s 151 mW lighting accelerator for 3D graphics
vertex and pixel shading in 32 nm CMOS," IEEE Journal of Solid-State Circuits, vol. 48,
pp. 128-139, January 2013.
[113] G. Wan, X. Li, G. Agranov, M. Levoy, and M. Horowitz, "CMOS image sensors with
multi-bucket pixels for computational photography," IEEE Journalof Solid-State Circuits,
vol. 47, pp. 1031-1042, April 2012.
[114] S. Sukegawa, T. Umebayashi, T. Nakajima, H. Kawanobe, K. Koseki, I. Hirota, T. Haruta,
M. Kasai, K. Fukumoto, T. Wakano, K. Inoue, H. Takahashi, T. Nagano, Y. Nitta, T. Hirayama, and N. Fukushima, "A 1/4-inch 8Mpixel back-illuminated stacked CMOS image
sensor," IEEE InternationalSolid-State Circuits Conference, pp. 484-485, 2013.
BIBLIOGRAPHY
197
[115] Y. Chen, Y. Xu, Y. Chae, A. Mierop, X. Wang, and A. Theuwissen, "A 0.7e-rms-temporalreadout-noise CMOS image sensor for low-light-level imaging," IEEE InternationalSolidState Circuits Conference, pp. 384-385, 2012.
[116] J. Chen, S. Paris, and F. Durand, "Real time edge-aware image processing with the bilateral grid," ACM Transactions on Graphics, vol. 26, July 2007.
[117] T. Q. Pham and L. J. V. Vliet, "Separable bilateral filtering for fast video preprocessing,"
IEEE InternationalConference on Multimedia and Expo, pp. 4-8, 2005.
[118] S. Paris and F. Durand, "A fast approximation of the bilateral filter using a signal processing approach," InternationalJournal of Computer Vision, vol. 81, pp. 24-52, January
2009.
[119] B. Weiss, "Fast median and bilateral filtering," ACM Transactions on Graphics, vol. 25,
pp. 519-526, July 2006.
[120] A. Sinha, A. Wang, and A. Chandrakasan, "Energy scalable system design," IEEE Transactions on Very Large Scale Integration Systems, vol. 10, pp. 135-145, April 2002.
[121] P. E. Debevec and J. Malik, "Recovering high dynamic range radiance maps from photographs," ACM Conference on Computer Graphics and Interactive Techniques, pp. 369-
378, 1997.
[122] G. W. Larson, H. Rushmeier, and C. Piatko, "A visibility matching tone reproduction operator for high dynamic range scenes," IEEE Transactions on Visualization and Computer
Graphics, vol. 3, pp. 291-306, October 1997.
[123] J. DiCarlo and B. Wandell, "Rendering high dynamic range images," Proceedings of the
SPIE: Image Sensors, pp. 392-401, 2000.
[124] J. Cohen, C. Tchou, T. Hawkins, and P. Debevec, "Real-time high-dynamic range texture
mapping," Eurographics Workshop on Rendering, pp. 313-320, October 2001.
[125] D. J. Jobson, Z. U. Rahman, and G. A. Woodell, "A multi-scale retinex for bridging the
gap between color images and the human observation of scenes," IEEE Transactions on
Image Processing,vol. 6, pp. 965-976, July 1997.
[126] S. N. Pattanaik, J. A. Ferwerda, M. D. Fairchild, and D. P. Greenberg, "A multiscale
model of adaptation and spatial vision for realistic image display," ACM SIGGRAPH
Conference, pp. 287-298, 1997.
[127] J. J. McCann and A. Rizzi, "Veiling glare: The dynamic range limit of HDR images,"
Human Vision and Electronic Imaging XII, SPIE, vol. 6492, 2007.
[128] E. V. Talvala, A. Adams, M. Horowitz, and M. Levoy, "Veiling glare in high dynamic
range imaging," ACM Transactions on Graphics, vol. 26, July 2007.
[129] E. Reinhard, G. Ward, S. Pattanaik, and P. Debevec, "High dynamic range imaging acquisition, display and image-based lighting," Morgan Kaufman Publishers, 2006.
[130] R. Raskar, A. Agrawal, C. A. Wilson, and A. Veeraraghavan, "Glare aware photography: 4D ray sampling for reducing glare effects of camera lenses," ACM Transactions on
Graphics, vol. 27, pp. 56:1-56:10, August 2008.
BIBLIOGRAPHY
198
[131] J. M. DiCarlo, F. Xiao, and B. A. Wandell, "Illuminating illumination," Color Imaging
Conference, pp. 27-34, 2001.
[132] M. F. Cohen, A. Colburn, and S. Drucker, "Image stacks," MSR Technical Report, vol. 40,
July 2003.
[133] H. Hoppe and K. Toyama, "Continuous flash," MSR Technical Report, vol. 63, October
2003.
[134] K. Toyama and B. Schoelkopf, "Interactive images," MSR Technical Report, vol. 64, December 2003.
[135] P. Debevec, T. Hawkins, C. Tchou, H. Duiker, W. Sarokin, and M. Sagar, "Acquiring the
reflectance field of the human face," ACM SIGGRAPH Conference, pp. 145-156, 2000.
[136] V. Masselus, P. Dutre, and F. Anrys, "The free-form light stage," EurographicsRendering
Symposium, pp. 247-256, 2002.
[137] D. Akers, F. Losasso, J. Klingner, M. Agrawala, J. Rick, and P. Hanrahan, "Conveying
shape and features with image-based relighting," IEEE Visualization, pp. 349-354, 2003.
[138] G. Petschnigg, M. Agrawala, H. Hoppe, R. Szeliski, M. Cohen, and K. Toyama, "Digital
photography with flash and no-flash image pairs," A CM Transactionson Graphics,vol. 23,
pp. 664-672, August 2004.
[139] E. Eisemann and F. Durand, "Flash photography enhancement via intrinsic relighting,"
ACM Transactions on Graphics, vol. 23, pp. 673-678, August 2004.
[140] B. M. Oh, M. Chen, J. Dorsey, and F. Durand, "Image-based modeling and photo editing,"
ACM SIGGRAPH Conference, 2001.
[141] A. Wang and A. P. Chandrakasan, "A 180mV FFT processor using subthreshold circuit
technologies," IEEE InternationalSolid State Circuits Conference, pp. 292-293, 2004.
[142] S. Sridhara, M. DiRenzo, S. Lingam, S.-J. Lee, R. Blazquez, J. Maxey, S. Ghanem, Y.-H.
Lee, R. Abdallah, P. Singh, and M. Goel, "Microwatt embedded processor platform for
medical system-on-chip applications," IEEE Symposium on VLSI Circuit,pp. 15-16, 2010.
[143] M. Qazi, M. E. Sinangil, and A. P. Chandrakasan, "Challenges and directions for lowvoltage SRAM," IEEE Design & Test of Computers, vol. 28, pp. 32-43, January 2011.
[144] "Intel Atom processor Z2760," [online]
ddr2-sdram.
http://www.micron.com/products/dram/
[145] A. Adams, E. Talvala, S. H. Park, D. E. Jacobs, B. Ajdin, N. Gelfand, J. Dolson, D. Vaquero, J. Baek, M. Tico, H. P. A. Lensch, W. Matusik, K. Pulli, M. Horowitz, and
M. Levoy, "The Frankencamera: An experimental platform for computational photography," ACM Transactions on Graphics, vol. 29, pp. 29:1-29:12, July 2010.
[146] "Intel integrated performance primitives,"
en-us/intel-ipp.
[online]
https ://software. intel. com/
[147] "The OpenMP api specification for parallel programming," [online] http: //openmp. org/.
199
[148] "DDR2 SDRAM system-power calculator," [online] http: //www. intel. com/content/
www/us/en/processors/atom/atom-z2760-datasheet.html.
[149] "Pandaboard: Open OMAP 4 mobile software development platform," [online]
//pandaboard. org/content/platform.
http:
"Halide:
[150] J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. Amarasinghe,
A language and compiler for optimizing parallelism, locality, and recomputation in image
and
processing pipelines," ACM SIGPLAN Conference on Programming Language Design
Implementation, pp. 519-530, 2013.
hierar[151] C. C. Wang, F. L. Yuan, H. Chen, and D. Markovic, "A 1.1 GOPS/mW FPGA with
chical interconnect fabric," IEEE InternationalSymposium on VLSI Circuits,pp. 136-137,
2011.
with
[152] C. C. Wang, F. L. Yuan, T. H. Yu, and D. Markovic, "A multi-granularity FPGA
Conhierarchical interconnects for efficient and flexible mobile computing," International
ference on Solid-State Circuits, pp. 460-461, 2014.
[153] R. J. Hay, N. E. Johns, H. C. Williams, I. W. Bolliger, R. P. Dellavalle, D. J. Margolis,
R. Marks, L. Naldi, M. A. Weinstock, S. K. Wulf, C. Michaud, C. J. L. Murray, and
M. Naghavi, "The global burden of skin disease in 2010: An analysis of the prevalence
and impact of skin conditions," Journal of Investigative Dermatology, November 2013.
of the American
[154] P. E. Grimes, "New insight and new therapies in vitiligo," The Journal
Medical Association, vol. 293, pp. 730-735, February 2005.
a comprehensive
[155] A. Alikhan, L. M. Felsten, M. Daly, and V. Petronic-Rosic, "Vitiligo:
diagnosis,
differential
overview part I. Introduction, epidemiology, quality of life, diagnosis,
of
associations, histopathology, etiology, and work-up," Journal of American Academy
Dermatology,vol. 65, pp. 473-491, September 2011.
[156] K. Ezzedine, H. W. Lim, T. Suzuki, I. Katayama, I. Hamzavi, C. C. Lan, B. K. Goh,
T. Anbar, C. S. de Castro, A. Y. Lee, D. Prasad, N. V. Geel, I. C. L. Poole, N. Oiso,
clasL. Benzekri, R. Spritz, Y. Gauthier, S. K. Hann, M. Picardo, and A. Taieb, "Revised
consensus
sification/nomenclature of vitiligo and related issues: The vitiligo global issues
conference," Pigment Cell and Melanoma Research, vol. 25, pp. E1-13, May 2012.
[157] D. J. Gawkrodger, A. D. Ormerod, L. Shaw, I. Mauri-Sole, M. E. Whitton, M. J. Watts,
of
A. V. Anstey, J. Ingham, and K. Young, "Guideline for the diagnosis and management
vitiligo," British Journal of Dermatology, vol. 159, pp. 1051-1076, November 2008.
medicine and
[158] R. M. Halder and J. L. Chappell, "Vitiligo update," Seminars in cutaneous
surgery, vol. 28, pp. 86-92, June 2009.
Journal
[159] G. C. do Carmo and M. R. e Silva, "Dermoscopy: basic concepts," International
of Dermatology, vol. 47, pp. 712-719, July 2008.
compared
[160] M. E. Vestergaard, P. Macaskill, P. E. Holt, and S. W. Menzies, "Dermoscopy
of
meta-analysis
a
melanoma:
with naked eye examination for the diagnosis of primary
159,
vol.
studies performed in a clinical setting," The British Journal of Dermatology,
pp. 669-676, September 2008.
BIBLIOGRAPHY
200
"Skin surface mi[161] W. Stolz, P. Bilek, M. Landthaler, T. Merkle, and 0. Braun-Falco,
croscopy," The Lancet, vol. 334, pp. 864-865, October 1989.
"Dermoscopy
[162] R. P. Braun, H. S. Rabinovitz, M. Oliviero, A. W. Kopf, and J. H. Saurat,
vol. 52,
Dermatology,
of
of pigmented skin lesions," Journal of the American Academy
pp. 109-121, January 2005.
[163] "Dermlite," [online] http://dermlite.com/.
[164] U. Gonzalez, M. Whitton, V. Eleftheriadou, M. Pinart, J. Batchelor, and J. Leonardi-Bee,
"Guidelines for designing and reporting clinical trials in vitiligo," Archives of Dermatology,
vol. 147, pp. 1428-1436, December 2011.
[165] V. Eleftheriadou, K. S. Thomas, M. E. Whitton, J. M. Batchelor, and J. C. Ravenscroft,
and a
"Which outcomes should we measure in vitiligo? Results of a systematic review
Journal
survey among patients and clinicians on outcomes in vitiligo trials," The British
of Dermatology, vol. 167, pp. 804-814, October 2012.
[166] C. Vrijman, M. L. Homan, J. Limpens, W. van der Veen, A. Wolkerstorfer, C. B. Terwee,
systematic
and P. I. Spuls, "Measurement properties of outcome measures for vitiligo: A
review," Archives of Dermatology, vol. 17, pp. 1-8, September 2012.
modeling
[167] I. Hamzavi, H. Jain, D. McLean, J. Shapiro, H. Zeng, and H. Lui, "Parametric
vitiligo
The
of narrowband UV-B phototherapy for vitiligo using a novel quantitative tool:
area scoring index," Archives of Dermatology, vol. 140, pp. 677-683, June 2004.
for measuring
[168] T. S. Oh, 0. Lee, J. E. Kim, S. W. Son, and C. H. Oh, "Quantitative method
therapeutic efficacy of the 308 nm excimer laser for vitiligo," Skin Research and Technology,
vol. 18, pp. 347-355, August 2012.
"Digital image
[169] M. W. L. Homan, A. Wolkerstorfer, M. A. Sprangers, and J. L. V. der Veen,
analysis vs. clinical assessment to evaluate repigmentation after punch grafting in vitiligo,"
Journal of the European Academy of Dermatology and Venereology, vol. 27, pp. 235-238,
February 2013.
in
[170] T. S. Cho, W. T. Freeman, and H. Tsao, "A reliable skin mole localization scheme,"
International Conference on Computer Vision, pp. 1-8, IEEE, 2007.
skin
[171] S. K. Madan, K. J. Dana, and 0. G. Cula, "Quasiconvex alignment of multimodal
Workimages for quantitative dermatology," in Computer Vision and PatternRecognition
shops, pp. 117-124, IEEE, 2009.
regions using
[172] S. K. Madan, K. J. Dana, and 0. Cula, "Learning-based detection of acne-like
time-lapse features," in Signal Processing in Medicine and Biology Symposium, pp. 1-6,
IEEE, 2011.
pro[173] H. Wannous, Y. Lucas, and S. Treuillet, "Enhanced assessment of the wound-healing
cess by accurate multiview tissue classification," IEEE Transactions on Medical Imaging,
vol. 30, pp. 315-326, February 2011.
[174] H. Nugroho, M. H. A. Fadzil, V. V. Yap, S. Norashikin, and H. H. Suraiya, "Determination
in
of skin repigmentation progression," IEEE InternationalConference of the Engineering
Medicine and Biology Society, pp. 3442-3445, 2007.
BIBLIOGRAPHY
201
[175] F. Peruch, F. Bogo, M. Bonazza, V. M. Cappelleri, and E. Peserico, "Simpler, faster,
more accurate melanocytic lesion segmentation through MEDS," IEEE Transactions on
Biomedical Engineering,vol. 61, pp. 557-565, February 2014.
[176] K. Korotkov and R. Garcia, "Computerized analysis of pigmented skin lesions: A review,"
Artificial Intelligence in Medicine, vol. 56, pp. 69-90, October 2012.
[177] R. J. Friedman, D. S. Rigel, and A. W. Kopf, "Early detection of malignant melanoma:
The role of physician examination and self-examination of the skin," CA: A CancerJournal
for Clinicians, vol. 35, pp. 130-151, May 1985.
[178] J. Chen, J. Stanley, R. H. Moss, and W. V. Stoecker, "Color analysis of skin lesion regions
for melanoma discrimination in clinical images," Skin Research and Technology, vol. 9,
pp. 94-104, May 2003.
[179] C. Grana, G. Pellacani, and S. Seidenari, "Practical color calibration for dermoscopy,
applied to a digital epiluminescence microscope," Skin Research and Technology, vol. 11,
pp. 242-247, November 2005.
[180] H. Iyatomi, M. E. Celebi, G. Schaefer, and M. Tanaka, "Automated color normalization
for dermascopy images," in International Conference on Image Processing,pp. 4357-4360,
IEEE, 2010.
[181] M. Styner, C. Brechbuhler, G. Szckely, and G. Gerig, "Parametric estimate of intensity
inhomogeneities applied to MRI," IEEE Transactionson Medical Imaging, vol. 19, pp. 153-
165, March 2000.
[182] J. Milles, Y. Zhu, G. Gimenez, C. Guttmann, and I. Magnin, "MRI intensity nonuniformity
correction using simultaneously spatial and gray-level histogram information," Journal of
Computerized Medical Imaging and Graphics, vol. 31, pp. 81-90, March 2007.
[183] U. Vovk, F. Pernus, and B. Likar, "Review of methods for correction of intensity inhomogeneity in MRI," IEEE Transactions on Medical Imaging, vol. 26, pp. 405-421, March
2007.
[184] K. Zhang, L. Zhang, and S. Zhang, "A variational multiphase level set approach to simultaneous segmentation and bias correction," in InternationalConference on Image Processing,
pp. 4105-4108, IEEE, 2010.
[185] C. Li, C. Xu, C. Gui, and M. D. Fox, "Distance regularized level set evolution and its
application to image segmentation," IEEE Transactions on Image Processing, vol. 19,
pp. 3243-3254, December 2010.
[186] C. Li, R. Huang, Z. Ding, C. Gatenby, D. N. Metaxas, and J. C. Gore, "A level set method
for image segmentation in the presence of intensity inhomogeneities with application to
MRI," IEEE Transactions on Image Processing,vol. 20, pp. 2007-2016, July 2011.
[187] M. A. Fischler and R. C. Bolles, "Random sample consensus: A paradigm for model fitting
with applicatlons to image analysis and automated cartography," ACM Communications
on Graphics and Image Processing, vol. 24, pp. 381-395, June 1981.
BIBLIOGRAPHY
202
[188] K. Mikolajczyk and C. Schmid, "A performance evaluation of local descriptors," IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 27, pp. 1615-1630, October 2005.
[189] L. Fei-Fei and P. Perona, "A bayesian hierarchical model for learning natural scene categories," IEEE Conference on Computer Vision and Pattern Recognition, pp. 524-531,
2005.
[190] A. Bosch, A. Zisserman, and X. Munoz, "Scene classification using a hybrid generative/discriminative approach," IEEE Transactions on Pattern Analysis and Machine In-
telligence, vol. 30, pp. 712-727, April 2008.
[191] S. Lazebnik, C. Schmid, and J. Ponce, "Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories," IEEE Conference on Computer Vision and
Pattern Recognition, pp. 2169-2178, 2006.
[192] J. J. Kivinen, E. B. Sudderth, and M. Jordan, "Learning multiscale representation of
natural scenes using Dirichlet processes," IEEE Conference on Computer Vision, pp. 1-8,
2007.
[193] K. Grauman and T. Darrell, "Efficient image matching with distributions of local invariant
features," IEEE Conference on Computer Vision and Pattern Recognition, pp. 627-634,
2005.
[194] K. Frome, Y. Singer, F. Sha, and J. Malik, "Learning globally-consistent local distance
functions for shape-based image retrieval and classification," IEEE Conference on Computer Vision, pp. 1-8, 2007.
[195] S. Belongie, J. Malik, and J. Puzicha, "Shape matching and object recognition using
shape contexts," IEEE Transactionson PatternAnalysis and Machine Intelligence, vol. 24,
pp. 509-522, April 2002.
[196] J. Duchon, "Splines minimizing rotation-invariant semi-norms in Sobolev spaces," Constructive Theory of Functions of Several Variables, vol. 571, pp. 85-100, 1977.
[197] J. Meinguet, "Multivariate interpolation at arbitrary points made simple," Journal of
Applied Mathematics and Physics, vol. 5, pp. 439-468, 1979.
[online]
[198] "Intel power gadget,"
intel-power-gadget-20.
https ://software. intel. com/en-us/articles/
[199] S. L. Jacques, J. C. Ramella-Roman, K. Lee,
"Imaging skin pathology with polarized
light," Journal of Biomedical Optics, Vol. 7, No. 3, 329-340, July 2002.
[200] M. H. Smith, P. Burke, A. Lompado, E. Tanner, L. W. Hillman, "Mueller matrix imaging
polarimetry in dermatology," Proceedings of SPIE, 2000.
[201] J. A. Muccini, N. Kollias, S. B. Phillips, R. R. Anderson, A. J. Sober, M. J. Stiller, L.
A. Drake, "Polarized light photography in the evaluation of photoaging," Journal of the
American Academy of Dermatology, Vol. 33, No. 5, 765-769, Nov. 1995.
BIBLIOGRAPHY
203
[202] R. Langley, M. Rajadhyaksha, P. Dwyer, A. Sober, T. Flotte, R. R. Anderson, "Confocal
scanning laser microscopy of benign and malignant melanocytic skin lesions in vivo,"
Journal of the American Academy of Dermatology, Vol. 45, No. 3, 365-276, Sept. 2001.
[203] D. Kapsokalyvas, N. Bruscino, D. Alfieri, V. de Giorgi, G. Cannarozzo, T. Lotti, and F. S.
Pavone, "Imaging of human skin lesions with the multispectral dermoscope," Proceedings
of the SPIE, 2010.
[204] R. W. Brodersen, "Low power design, past and future," 2014.
[205] E. J. Fluhr, J. Friedrich, D. Dreps, V. Zyuban, G. Still, C. Gonzalez, A. Hall, D. Hogenmiller, F. Malgioglio, R. Nett, J. Paredes, J. Pille, D. Plass, R. Puri, P. Restle, D. Shan,
K. Stawiasz, Z. T. Deniz, D. Wendel, and M. Ziegler, "Power8: A 12-core server-class
processor in 22nm SOI with 7.6Tb/s off-chip bandwidth," IEEE InternationalSolid State
Circuits Conference, pp. 96-97, 2014.
[206] N. Kurd, M. Chowdhury, E. Burton, T. P. Thomas, C. Mozak, B. Boswell, M. Lal, A. Deval, J. Douglas, M. Elassal, A. Nalamalpu, T. M. Wilson, M. Merten, S. Chennupaty,
W. Gomes, and R. Kumar, "Haswell: A family of IA 22nm processors," IEEE International Solid State Circuits Conference, pp. 112-113, 2014.
[207] A. Wang, P. R. Gill, and A. Molnar, "An angle-sensitive CMOS imager for single-sensor
3D photography," InternationalSolid-State Circuits Conference, pp. 412-413, 2011.
[208] W. Kim, Y. Wang, I. Ovsiannikov, S. H. Lee, Y. Park, C. Chung, and E. Fossum, "A
1.5 Mpixel RGBZ CMOS image sensor for simultaneous color and range image capture,"
InternationalSolid-State Circuits Conference, pp. 392-393, 2012.
[209] L. T. Su, "Architecting the future through heterogeneous computing," InternationalSolidState Circuits Conference, pp. 8-11, 2011.
[210] C. Gentry, "Computing arbitrary functions of encrypted data," Communications of the
ACM, vol. 53, pp. 97-105, March 2010.
Download