Exploring Space of 4D Kshitij Marwah

Tensor Photography: Exploring Space of 4D
Modulations Inside Traditional Camera Designs
by
Kshitij Marwah
Bachelor of Technology, Department of Computer Science and
Engineering, Indian Institute of Technology, Delhi, India (2010)
Master of Technology, Department of Computer Science and
Engineering, Indian Institute of Technology, Delhi, India (2010)
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning
in partial fulfillment of the requirements for the degree of
Master of Science In Media Arts and Sciences
MASSACHUSETTS I"TS
OF TECHNOLOGY
at the
JUL 14 2014
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
LIBRARIES
September 2013
@ Massachusetts Institute of Technology 2013. All rights reserved.
Signature redacted
A u th o r ......................
......................................
Program in Media Arts and Sciences
August 9, 2013
C ertified by ................................
Signature redacted
'Ramesh Raskar
Associate Professor
Program in Media Arts and Sciences
sis Supervisor
Accepted by .....................
Signature redacted
VPr6fessdZIaDicia Maes
Associate Academic Head
Program in Media Arts and Science
I rE
Tensor Photography: Exploring Space of 4D Modulations
Inside Traditional Camera Designs
by
Kshitij Marwah
Submitted to the Program in Media Arts and Sciences,
School of Architecture and Planning
on August 9, 2013, in partial fulfillment of the
requirements for the degree of
Master of Science In Media Arts and Sciences
Abstract
Light field photography has gained a significant research in the last two decades:
today, commercial light field cameras are widely available demonstrating capabilities
such as post-capture refocus, 3D photography and view point changes. But, most
traditional acquisition approaches either multiplex a low resolution light field into a
single sensor image or require multiple photographs to be taken for acquiring high
resolution light field. In this thesis, we design, implement and analyze a new light field
camera architecture that allows capture and reconstruction of higher resolution light
fields in a single shot. The proposed architecture comprises three key components:
light field atoms as sparse representation of natural light fields, an optical design to
allow capture of optimized 2D light field projections and robust sparse reconstruction
methods to recover a 4D light field from a single coded 2D projection. In addition
we also explore other applications including compressive focal stack reconstructions,
light field compression and denoising.
Thesis Supervisor: Ramesh Raskar
Title: Associate Professor
Program in Media Arts and Sciences
3
Tensor Photography: Exploring Space of 4D Modulations
Inside Traditional Camera Designs
by
Kshitij Marwah
Signature redacted
T hesis R eader .......................................................
Edward Boyden
Associate Professor of Media Arts and Sciences
Program in Media Arts and Sciences
Thesis Reader .....
Signature redacted
Joseph Paradiso
ssociate Professor of Media Arts and Sciences
Program in Media Arts and Sciences
5
6
Acknowledgments
This thesis and work would not have been possible without tremendous support of
my collaborators, friends and family who have stuck through patiently with me. I
would first like to thank Professor Ramesh Raskar, who saw something in me and got
me into the MIT Media Lab. I would always be indebted to the great opportunity
he gave me to explore this area of research and life at the MIT Media Lab. My
collaborators Gordon Wetzstein and Yosuke Bando, who have been instrumental in
getting this work noticed and published. Gordon is one of the most rigorous, hard
working, systemic and professional person that I have come across. His grasp for
detail is unparalleled and I have learned tremendously from him. I have never met
an engineer as amazing as Yosuke. Every code or data he gave me in the course of
the thesis was flawless and discussions with him helped form of what we now have is
a great camera design.
Technical Contributions I would like to acknowledge Ramesh who gave me
the problem of high resolution light field photography to be solved and let it be
a part of my thesis. Gordon who helped me in understanding light fields, camera
designs and the tiniest of details in light field display and synthesis. I would like to
especially acknowledge his contribution in Chapter 3 and 4 for succinct writing and
explanation. Yosuke helped develop both hardware designs in Section 4.3 of Chapter
4. This camera would not have been a working prototype without him.
External Collaborators I would like to thank Ashok Veeraraghavan, Kaushik
Mitra, Guilermo Sapiro, Matthew O'Toole, Micha Feigin, Amit Aggarwal, Rohit
Pandharkar for various discussions and comments.
My family, mom, dad and my sister for patiently waiting for my phone calls back
home while I was working to make this work happen. They are the best family one
could have asked for and I love them. My partner Vartika Srivastava who waited
for two years in India for me to finish this work. I don't know how much she will
understand this thesis but if not for the strength she gave, this work wouldn't have
happened. My friends at the Media Lab Prashant Patil, Anirudh Sharma, Austin
7
Lee, Misha Sra, Andrea Colaco, Anette Von Kapri, Nan Zhao, Aydin Arpa, Ayush
Bhandari, Belen Masia, and many many more who made my stay enriching and
wonderful here. My friends back in India Jyoti Johar, Ridhima Sodhi, Aanchal Jain,
Ekta Jaiswal, Rahul Gupta and so many more with whom I have spent innumerable
beautiful moments.
8
Contents
13
1.1
Light Field Photography . . . . . . . . . . . . . . . . . . . . . . . .
14
1.2
Spatio-Angular Resolution Trade-Off in Light Field Camera Designs
15
1.3
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.4
Organizational Overview . . . . . . . . . . . . . . . . . . . . . . . .
17
.
.
.
Introduction
.
1
2 Prior Art
19
2.1.1
Hundred Years Ago . . . . . . . . . . . . . .
. . . . . . . . .
19
2.1.2
Hand-held Light Field Photography . . . . .
. . . . . . . . .
20
2.1.3
Dappled Photography
. . . . . . . . . . . .
. . . . . . . . .
20
2.1.4
Focused Plenoptic Camera . . . . . . . . . .
. . . . . . . . .
21
2.1.5
Camera Arrays and Gantry
.
. . . . . . . . .
22
.
. . . . . . . . .
23
. . . . .
. . . . . . . . .
25
. . . . . . . . . . . . .
. . . . . . . . .
25
. . . . . . . . .
25
.
.
.
.
.
. . . . . . . . .
Compressive Sensing and Sparse Representations
2.3
Compressive Computational Photography
.
2.2
.
. . . . . . . . .
Single Pixel Camera
2.3.2
Compressive Video Acquisition
2.3.3
Compressive Light Field Imaging (Simulation) . . . . . . . ... 27
.
2.3.1
. . . . . . .
29
3.1
Primer on Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . .
29
3.1.1
Approximate Solutions . . . . . . . . . . . . . . . . . . . . .
31
3.1.2
The Analysis Problem
35
.
.
Sparse Coding and Representations
. . . . . . . . . . . . . . . . . . . . .
.
3
Light Field Acquisition . . . . . . . . . . . . . . . .
.
2.1
19
9
3.2
3.3
4
. . . . . . . . . . . . . . .
35
3.2.1
Light Field Atoms
. . . . . . . . . . . . . . . . . . . . . . . .
35
3.2.2
Generating "Good" Training Sets with Coresets . . . . . . . .
38
. . . . . . . . . . . . . . . . .
38
3.3.1
Dictionary Design Parameters . . . . . . . . . . . . . . . . . .
41
3.3.2
Sparse Reconstruction Parameters . . . . . . . . . . . . . . . .
43
3.3.3
Evaluating Optimal Projection Matrices and Mask Patterns
46
Evaluating System Design Parameters
49
Compressive Light Field Photography
4.1
4.2
4.3
4.4
5
Overview of Dictionary Learning Methods
Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
4.1.1
Acquiring Coded Light Field Projections . . . . . . . . . . . .
49
4.1.2
Reconstructing Light Fields from Projections . . . . . . . . . .
51
4.1.3
Learning Light Field Atoms . . . . . . . . . . . . . . . . . . .
53
A nalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
. . . . . . . . . . . . . . . . .
54
. . . . . . . . . . . . .
56
4.2.1
Interpreting Light Field Atoms
4.2.2
What are Good Modulation Patterns?
4.2.3
Are More Shots Better?
. . . . . . . . . . . . . . . . . . . . .
59
4.2.4
Evaluating Depth of Field . . . . . . . . . . . . . . . . . . . .
59
4.2.5
Comparing Computational Light Field Cameras . . . . . . . .
61
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
. . . . . . . . . . . . . . . . . . . .
62
4.3.1
Primary Hardware Design
4.3.2
Secondary Hardware Design
. . . . . . . . . . . . . . . . . . .
66
4.3.3
Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
73
Additional Applications
5.1
"Undappling" Images with Coupled Dictionaries . . . . . . . . . . . .
73
5.2
Light Field Compression . . . . . . . . . . . . . . . . . . . . . . . . .
75
5.3
Light Field Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
5.4
Compressive Focal Stacks
. . . . . . . . . . . . . . . . . . . . . . . .
77
10
6
Discussion and Conclusion
81
6.1
Benefits and Limitations . . . . . . . . . . . . . . . . . . . . . . . . .
81
6.2
Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
11
12
List of Figures
1-1
A conventional camera sums over all the individual rays or alternatively
all the individual views as recorded by different points on the aperture
to generate the recorded image. . . . . . . . . . . . . . . . . . . . . .
1-2
14
A light field camera works by placing a refractive or non-refractive
element between the lens and the sensor (a micro lens array in this case)
to record this spatio-angular sampling of rays on the image sensor. This
can also be thought of as a two-dimensional array of two-dimensional
images, each from a different view point on the aperture.
1-3
14
Effects such as post-capture refocusing can be shown via capturing the
4D light field.
2-1
. . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
The first light field camera design to sample angular information on a
2D film using a micro lens array. This was made by Lippmann in 1908
[3 7].
2-2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
A main lens is added to Lippmann's design that is directly focused on
the micro-lens array. The micro-lens refracts the incoming light based
on incident angle to allow for the 4D sampling of light rays on the 2D
sensor[44].
2-3
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
A cosine mask is placed in front of the sensor to modulate the light
field in the frequency domain. The mask heterodynes the incoming
light field that can be tiled back together and inverted to recover the
4D radiance function [54].
. . . . . . . . . . . . . . . . . . . . . . . .
13
21
2-4
In this design the main lens is not focused on the micro-lens but at a
plane just before the micro-lens. This allows a flexible trade off between
angular and spatial resolution, though the product still being equal to
sensor resolution[38].
2-5
. . . . . . . . . . . . . . . . . . . . . . . . . . .
22
One of the first light field acquisition devices formed by carefully aligning and placing multiple cameras.
Effects such as looking through
bushes and synthetic aperture photography can be implemented effectively with this bulky and expensive design[58].
2-6
. . . . . . . . . . . .
22
This gantry contains a rotating camera for capturing light field by
recording images at various view points. Image based rendering approaches can then be used to interpolate between various view points[35].
23
2-7
A dictionary learned from a number of natural image patches. This
dictionary captures essential edges and features that contribute to image formation. These "atoms" are different that traditionally known
basis such as Fourier, DCT or Wavelets.
2-8
. . . . . . . . . . . . . . . .
24
RICE's single pixel camera is the first physical prototype based on compressive sensing. It allows reconstruction of a 2D image by taking a series of measurements on a single photodetector using Li-minimization
techniques[18].
2-9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
In this design a LCoS based system is used to control per pixel exposure, implementing a mask pattern that modulates and sums over
frames during capture. Dictionaries are learned from training videos
as shown on the left to reconstruct video from exposure in a single
shot[27].
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
2-10 Conceptually two compressive light field architectures have been proposed. One of the left multiplexes by putting a random code on the
aperture, while the one on the right placed a coded lenslet between the
sensor and the lens. Both these architectures require multiple shots to
be taken for high resolution light field reconstruction[5].
14
. . . . . . .
28
3-1
Evaluating various approaches to dictionary learning and choosing
training sets. The two most important evaluation criteria are speed
(and memory footprint) of the learning stage as well as quality of the
learned light field atoms. The former is indicated by timings, whereas
the latter is indicated by the PSNR and visual quality of reconstructions from the same measurements (2D light field projected onto a ID
sensor image with a random modulation mask). We evaluate the KSVD algorithm and its nonnegative variant and online sparse coding
as implemented in the SPAMS software package. K-SVD is a very
slow algorithm that limits practical application to small training sets
containing 27,000 training patches (top left); coresets are tested to reduce a much larger training set with 1,800,000 patches to the same,
smaller size (27,000 patches, top center). In all cases, K-SVD learns
high-quality light field atoms that capture different spatial features in
the training set sheared with different amounts in the spatio-angular
domain. Unfortunately, K-SVD becomes too slow for practical application with large high-dimensional training sets; nevertheless, we use
it as a benchmark in these flatland experiments. Online sparse coding
(bottom row) is much faster and capable of handling large training sets.
Unfortunately, the learned atoms are very noisy and lead to low-quality
reconstructions (bottom left and center). Using coresets, much of the
redundancy in large training sets can be removed prior to learning the
atoms (lower right), thereby removing noise in the atoms and achieving
the highest-quality reconstructions in the least amount of time.
15
. . .
39
3-2
Evaluating dictionary overcompleteness.
The color-coded visualiza-
tions of dictionaries (top row) and the histograms (center row) illustrate the intrinsic dimension of these dictionaries. For more than 2x
overcomplete dictionaries, most of the atoms are rarely used to adequately represent the training set. Reconstructions of a 2D light field
that was not in the training set using the respective dictionaries (bottom row) show that the quality (PSNR) is best for 1-2 x overcomplete
dictionaries and drops below and above. Dictionaries with 0.1 - 0.5x
overcompleteness do not perform well, because they simply do not
contain enough atoms to sparsely represent the test light field. On the
other hand, excessive overcompleteness (larger d) does not improve
sparsity (smaller k) further, and k log(d/k) turns to increase, leaving
the number of measurements insufficient.
3-3
. . . . . . . . . . . . . . .
40
Evaluating light field atom sizes. A synthetic light field (lower left) is
projected onto a sensor image with a random modulation mask and
reconstructed with dictionaries comprised of varying atom sizes. The
angular resolution of light field and atoms is 3 x 3, but the spatial
resolution of the atoms ranges from 9 x 9 to 21 x 21. Dictionaries for
all experiments are 5x overcomplete. We observe best reconstruction
quality for an atom size of 11 x 11. If chosen too small, the spatioangular shears of objects at a distance to the focal plane will not be
adequately captured by the light field atoms. Furthermore, the ratio between the number of measurements and number of unknowns
is too low for a robust reconstruction with sparse coding approaches.
For increasing atom resolutions, that ratio becomes more favorable for
reconstruction-however, with an increasing spatial extend of an atom
it also becomes increasingly difficult to represent the underlying light
fields sparsely, leading to lower-quality reconstructions. . . . . . . . .
16
42
3-4
Evaluating sparse reconstruction methods. We reconstruct a 2D light
field (top row) using the same dictionary and simulated sensor image
with four different methods of sparse coding (rows 2-6) and different window merging functions. Although fast, orthogonal matching
pursuit (OMP, second row) only achieves a low reconstruction quality.
Reweighted fl, as implemented in the NESTA package, yields
high-quality reconstructions but takes about 15 times as long as OMP
(third row). Basis pursuit denoise, as implemented in the SPGL1 solver
package, is also high-quality but almost 400 times slower than OMP
(fourth row). Due to its immense time requirements, this implementation is not suitable for high-resolution 4D light field reconstructions.
The homotopy approach to solving the basis pursuit denoise problem
(fifth row) is even faster than OMP and results in a reconstruction
quality that is comparable with the best, but much slower SPGL1 implem entation.
3-5
. . . . . . . . . ..
. . . . . . . . . . . . . . . . . . .
45
Evaluating convergence for all algorithms using a single patch. Please
note that OMP is a greedy algorithm; while all other algorithms minimize the f 1-norm of coefficients, OMP increments the number of used
coefficients by one in each iteration.
17
. . . . . . . . . . . . . . . . . .
45
3-6
Evaluating dictionary learning at various resolutions. We learn two
dictionaries from a training set consisting of black and white text at
resolutions of 128 x 128 and 64 x 64 at different depths from a simulated camera aperture of 0.25 cm. Reconstructions of a resolution
chart placed at the same depths but with different resolutions are then
performed. As shown in the plot, the PSNRs do not vary significantly
for dictionaries learned at 64 x 64 and reconstruction at 128 x 128, and
vice versa. This can be attributed to our patch-by-patch reconstruction that captures features at a resolution of 9 x 9 in this case. At such
a patch size, features are generally scale-invariant. Moreover, for the
case of resolution charts, these features are generally edges (spatial or
angular) that do not depend on resolutions.
3-7
. . . . . . . . . . . . . .
46
Evaluating optimal mask patterns. We evaluate three kinds of mask
patterns for both sliding and distinct patch reconstructions for a flatland 2D light field. A light field is divided into patches of resolution
20 (space) x 5 (angle) and is projected down with the corresponding
mask to a 20 pixel coded sensor image. As seen in the figure, given
a dictionary, an optimized mask pattern performs much better than a
traditional random mask. Jointly optimizing for both mask and dictionary performs slightly better than the optimized mask albeit at much
higher computational times. The coherence value p as shown in the
figure decreases as we optimize for both the mask and the dictionary.
18
47
4-1
Illustration of ray optics, light field modulation through coded attenuation masks, and corresponding projection matrix. The proposed optical setup comprises a conventional camera with a coded attenuation
mask mounted at a slight offset in front of the sensor (left). This mask
optically modulates the light field (center) before it is projected onto
the sensor. The coded projection operator is expressed as a sparse matrix <), here illustrated for a 2D light field with three views projected
onto a ID sensor (right).
4-2
. . . . . . . . . . . . . . . . . . . . . . . .
50
Visualization of light field atoms captured in an overcomplete dictionary. Light field atoms are the essential building blocks of natural light
fields-most light fields can be represented by the weighted sum of very
few atoms. We show that light field atoms are crucial for robust light
field reconstruction from coded projections and useful for many other
applications, such as 4D light field compression and denoising. . . . .
4-3
53
Compressibility of a 4D light field in various high-dimensional bases.
As compared to popular basis representations, the proposed light field
atoms provide better compression quality for natural light fields (plots,
second row). Edges and junctions are faithfully captured (third row);
for the purpose of 4D light field reconstruction from a single coded
2D projection, the proposed dictionaries combined with sparse coding
techniques perform best in this experiment (bottom row).
4-4
. . . . . .
55
Evaluating optical modulation codes and multiple shot acquisition. We
simulate light field reconstructions from coded projections for one, two,
and five captured camera images. One tile of the corresponding mask
patterns is shown in the insets. For all optical codes, an increasing
number of shots increases the number of measurements, hence reconstruction quality. Nevertheless, optimized mask patterns facilitate
single-shot reconstructions with a quality that other patterns can only
achieve with multiple shots. . . . . . . . . . . . . . . . . . . . . . . .
19
58
4-5
Evaluating depth of field. As opposed to lenslet arrays, the proposed
approach preserves most of the image resolution at the focal plane.
Reconstruction quality, however, decreases with distance to the focal
plane. Central views are shown (on focal plane) for full-resolution light
field, lenslet acquisition, and compressive reconstruction; compressive
reconstructions are also shown for two other distances. The three plots
evaluate reconstruction quality for varying aperture diameters with a
dictionary learned from data corresponding to the blue plot (aperture
diam eter 0.25 cm ).
4-6
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
60
Illustration of different optical light field camera setups with a quantitative value p for the expected reconstruction quality (lower value is
better). While lenslet arrays have the best light transmission T (higher
value is better), reconstructions are expected to be of lower quality.
Masks coded with random or optimized patterns perform best of all
systems with 50% or more transmission. Two masks are expected to
perform slightly better with our reconstruction, but at the cost of reduced light efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4-7 Prototype light field camera. We implement an optical relay system
that emulates a spatial light modulator (SLM) being mounted at a
slight offset in front of the sensor (right inset). We employ a reflective
LCoS as the SLM (lower left insets).
4-8
. . . . . . . . . . . . . . . . . .
63
Pinhole array mask and captured images. For conciseness, only the
region covering 12 x 10 pinholes are shown here. (a) Pinhole array
displayed on the LCoS. (b) Image sensor recording of the LCoS pinhole
array mask for a white cardboard scene. (c) Image sensor recording of
the LCoS pinhole array mask for a newspaper scene. (d) Normalized
image, i.e., (c) divided by (b). (e) Image sensor recording of the LCoS
pinhole array mask for a white cardboard scene where each pinhole has
a corresponding attenuation value according to the mask pattern.
20
. .
64
4-9
Calibrated projection matrix. Left: visualization of the matrix as a
multi-view image. Right: magnified crops from the two views.
. . . .
65
4-10 Printed transparency-based prototype camera (a) and calibration setups (b,c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
4-11 Light field reconstruction from an image modulated by a static film
mask.
(a) Captured image.
(b) One view from the calibrated pro-
jection matrix D. (c,d) Two views from the reconstructed light field.
Yellow lines are superimposed to show the parallax. (e) All of the 5 x 5
views from the reconstructed light field. . . . . . . . . . . . . . . . . .
67
4-12 An illustration of placement of the optimal or random mask for two
commonly used camera designs. For traditional DSLRs with a lens
of focal length about 50 mm, the mask needs to be placed about 1.6
mm away from the sensor to capture 5 x 5 angular views. For a mobile phone camera this distance reduces to about 160 microns, due to
reduced focal length. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-13 Training light fields and central views for physical experiments.
4-14 Training light fields and central views for simulated experiments..
67
. .
68
.
69
4-15 Light field reconstruction from a single coded 2D projection. The scene
is composed of diffuse objects at different depths; processing the 4D
light field allows for post-capture refocus. . . . . . . . . . . . . . . . .
71
4-16 Reconstruction of a partly-occluded scene. Two views of a light field
reconstructed from a single camera image. Areas occluded by highfrequency structures can be recovered by the proposed methods, as
seen in the close-ups. . . . . . . . . . . . . . . . . . . . . . . . . . . .
72
4-17 Light field reconstructions of an animated scene. We capture a coded
sensor image for multiple frames of a rotating carousel (left) and reconstruct 4D light fields for each of them. The techniques explored in this
paper allow for higher-resolution light field acquisition than previous
single-shot approaches. . . . . . . . . . . . . . . . . . . . . . . . . . .
21
72
5-1
"Undappling" a mask-modulated sensor image (left). The known projection of the mask pattern can be divided out; remaining noise patterns in out-of-focus regions are further reduced using a coupled dictionary method (right).
5-2
. . . . . . . . . . . . . . . . . . . . . . . . .
75
Light field compression. A light field is divided into small 4D patches
and represented by only few coefficients. Light field atoms achieve a
higher image quality than DCT coefficients.
5-3
. . . . . . . . . . . . . .
Light field denoising. Sparse coding and the proposed 4D dictionaries
can remove noise from 4D light fields. . . . . . . . . . . . . . . . . . .
5-4
76
77
Focal stack setup. The focus mechanism of an SLR camera lens is
intercepted and controlled by an Arduino board. Synchronized with a
remote shutter, this allows us to capture high-quality focal stacks. . .
5-5
Focal stack training set. Central focal slices for each scene are shown
along with downsampled versions of all the slices. . . . . . . . . . . .
5-6
78
79
Compressive focal stack results. A single coded projection is simulated
(lower left) and used to recover all six slices (three of them are shown
in the bottom row) of the original focal stack (top row). While the
focal stack is successfully recovered, we observe a slight loss of image
sharpness for in-focus image regions. This could be overcome with optimized projections (as demonstrated for 4D light fields in the primary
text) or by acquiring multiple shots . . . . . . . . . . . . . . . . . . .
6-1
79
We compare our mask-based light field camera design with existing
commercial technologies i.e. micro lens based approaches and camera
arrays. This comparison can help commercial camera makers chose the
right technology for their specific purposes.
22
. . . . . . . . . . . . . .
84
Chapter 1
Introduction
Since the invention of the first cameras, photographers have been striving to capture
moments on film. Traditional cameras be it film or digital do not capture individual
rays of light that contribute to a moment. Instead, each pixel or film grain captures
the sum of all rays over different angles contributing to a point in the scene. This
leads to loss of angular information that is critical for inferring view point changes and
depth. To circumvent this problem light field cameras were introduced to capture both
spatial and angular information by multiplexing the incoming 4D light field onto a 2D
image sensor. These cameras have defined a new era in camera technology allowing
consumers to easily capture, edit and share moments. Commercial products such as
those by LYTRO and Raytrix exist in markets facilitating novel user experiences such
as digital refocusing and 3D imaging.
Unfortunately, the technological foundations for these cameras are a century old
that have not fundamentally changed since that time. Most currently available devices trade spatial resolution for the ability to capture different views of a light field,
often times reducing the final resolution by orders of magnitude.
This trend di-
rectly counteracts the increasing resolution demands of the industry, with the race
for megapixels being the most significant driving factor behind camera technology in
the last decade.
In this dissertation we introduce a new mathematical and computational framework combined with a novel optical coding strategy to get over these fundamental
23
Figure 1-1: A conventional camera sums over all the individual rays or alternatively
all the individual views as recorded by different points on the aperture to generate
the recorded image.
Figure 1-2: A light field camera works by placing a refractive or non-refractive element
between the lens and the sensor (a micro lens array in this case) to record this spatioangular sampling of rays on the image sensor. This can also be thought of as a
two-dimensional array of two-dimensional images, each from a different view point on
the aperture.
limitations in light field acquisition systems. The price we pay for recording high
resolution light fields on a traditional 2D image sensor is increased computational
processing. The goal of these thesis is to show how to transfer the fundamental optical limits into computational processing, a price fair enough to create high resolution
light field acquisition systems.
1.1
Light Field Photography
As shown in Figure 1-1 a conventional camera takes in a 3D scene and sums over all
the incoming angles to generate the final image. A light field camera on the other end
places either a refractive or non-refractive element between the lens and the sensor to
disambiguate this angular information in spatial pixels. Figure 1-2 shows a light field
camera design with the main lens focused at a micro lens array. As rays hit the micro
24
Figure 1-3: Effects such as post-capture refocusing can be shown via capturing the
4D light field.
lens system they get refracted based on angle of incidence thereby hitting the sensor
pixels at different locations. This implicitly records the incoming angular information
onto spatial pixels. The recorded light field is a 4D function and can be thought of
as a 2D array of 2D images, each with a different view point over the aperture. It is
a simplification of the full plenoptic function in free space, that records the space of
all light rays emanating from a scene including position, angle, wavelength and time.
Once this this 4D radiance function is recorded one can either average all the views
to get conventional image or shift and add the views to demonstrate refocusing at
different focal planes [44], as shown in Figure 1-3. Further more, the captured light
field can also be used for depth map generation and generating 3D content for new
glasses-free light field displays [56].
1.2
Spatio-Angular Resolution Trade-Off in Light
Field Camera Designs
Though there are immense applications of light field camera designs the price one
has to pay is a permanent trade-off with spatial and angular resolutions. As seen
25
in Figure 1-2 the spatial resolution of each view is reduced by the factor of number
angular samples.
Mathematically, for any view or refocused image the following
equation is always followed:
S = pX
x
pY
x
sX
x
sY
(1.1)
where S is the total sensor resolution, px and py are the number of angular views
in the x and y dimension respectively, sx and s, is the resolution for each view. This
drop in resolution can go unto about a 100 times depending on number of angular
views that need to be sampled. In commercial cameras like LYTRO, one converts
a high 20 MP resolution sensor into a measly 0.1 MP light field for refocusing. It
turns out as long as one operates in the well known Shanon-Nyquist regime in signal
processing this equation will always hold true.
Recent advances in the theory of sampling and reconstruction of sparse signals [10]
have shown that one can go beyond the Shanon-Nyquist sampling rate if the signal is
known to be sparse or compressible in a certain domain. These domains or basis can
be traditionally known orthonormal basis function such as Fourier, DCT and PCAs or
redundant dictionaries. These coherent and overcomplete dictionaries learn essential
features from a series of training data adapting to the signal under consideration.
This generally allows for a data-aware representation against the traditional known
basis that are signal agnostic.
We build on these recent theories and frameworks to propose, develop and evaluate
a new light field camera design that overcomes these traditional limits of spatioangular resolution trade-offs. We show with practical results a working prototype
of light field camera with enhanced resolutions using priors as captured in these
overcomplete dictionaries.
1.3
Contributions
We explore compressive light field photography with a novel algorithmic framework
with optical-co-design strategies to overcome traditional resolution limits. In partic26
ular in this thesis, we make the following contributions:
" We propose compressive light field photography as a system combining opticallycoded light field projections and nonlinear computational reconstructions that
utilize overcomplete dictionaries as a sparse representation of natural light fields.
" We introduce light field atoms as the essential building blocks of natural light
fields; these atoms are not only useful for high-resolution light field reconstruction from coded projections but also for compressing and denoising 4D light
fields.
" We analyze existing compressive light field cameras and evaluate sparse representations for such high-dimensional signals. We demonstrate that the proposed
atoms combined with optimized optical codes allow for light field reconstruction
from a single photograph.
" We build a prototype compressive light field camera and demonstrate successful
recovery of partially-occluded environments, refractions, reflections, and animated scenes.
1.4
Organizational Overview
" Chapter 2 includes a survey of related works in the fields of computational
photography, compressive sensing and light field capture.
" Chapter 3 has a primer on sparse coding and representation strategies, with
a rigorous evaluation of various design parameters for dictionary learning and
sparsity-exploiting reconstructions.
" Chapter 4 demonstrates a practical light field camera design derived from these
evaluations, including results and reconstructions for natural light field scenarios
from a single coded image.
" Chapter 5 showcases other applications of overcomplete dictionaries including
compressive focal stack acquisition, compression and denoising.
27
* Chapter 6 discussed future directions towards building the ultimate camera
system that can capture the full plenoptic function including the parameters of
time and wavelength.
28
Chapter 2
Prior Art
2.1
Light Field Acquisition
2.1.1
Hundred Years Ago
The art of capturing light field started in 1902 when Frederick Ives [28] patented the
idea of parallax stereogram camera. He placed a pinhole mask near the sensor of the
camera to sample the angular content of the light field. Though, the most significant
contribution was made by Gabriel Laippmann [37] when he placed a micro lens array
attached in front of the film sensor as shown in Figure 2-1.
Figure 2-1: The first light field camera design to sample angular information on a 2D
film using a micro lens array. This was made by Lippmann in 1908 [37].
29
Figure 2-2: A main lens is added to Lippmann's design that is directly focused on the
micro-lens array. The micro-lens refracts the incoming light based on incident angle
to allow for the 4D sampling of light rays on the 2D sensor[44].
2.1.2
Hand-held Light Field Photography
In 2005, researchers at Stanford adding a main lens to Lippmann's design, with its
focus on the micro lens array[44]. As shown in Figure 2-2 the incident light field onto
the micro lens is refracted based on the incidence angle to record angular samples on
the 2D image sensor. Recently, lenslet-based systems have been integrated into digital
cameras [1, 44]; consumer products are now widely available. Since spatial pixels are
used to record angular samples, a trade-off in resolution occurs as established in the
previous chapter. This permanently reduces the final resolution of the capture image
by a factor of number of angular samples.
2.1.3
Dappled Photography
In 2007, non-refractive light modulating codes or masks were introduced to capture
light field with better light efficiency than pinhole arrays [54, 31, 55]. These codes
optically heterodyne the incoming light field creating copies in the fourier domain.
The distance of the mask to the sensor is computed so as to avoid aliasing in the
frequency domain.
By taking an inverse fourier transform of the recorded image
30
Figure 2-3: A cosine mask is placed in front of the sensor to modulate the light field
in the frequency domain. The mask heterodynes the incoming light field that can be
tiled back together and inverted to recover the 4D radiance function [54].
and aligning the angular samples properly, light field can be reconstructed back. It is
assumed that the light field is band-limited for these designs. Figure 2-3 shows couple
of designs for mask based capture of light field, and building light efficient frameworks
than traditional pinhole arrays. Nevertheless, all of these approaches sacrifice image
resolution-the number of sensor pixels is the upper limit of the number of light
rays captured. Other techniques to correct lens aberrations have been proposed by
creating spatially optimal filters [45]. This technique combines optical heterodyning
approaches as discussed with locally optimal mask patterns that turn deblurring
problem into a well-posed one allowing effective solutions.
2.1.4
Focused Plenoptic Camera
Within these limits of resolution trade-offs [24, 33], alternative designs have been
proposed that favor spatial resolution over angular resolution [38] as shown in Figure
2-4.
By slightly changing the optical design where the main lens not focused on
the micro lens array but at a plane before it, one can allow flexible trade off with
spatial and angular resolution. The product between the two still remains equal to
the number of sensor pixels. A number of techniques have been developed to compute
high resolution images using super resolution on the 4D light field. These techniques
rely on pixel shifts between different views and often reconstruct a high resolution
31
Focused plenopic
PhOMWeno
t ycamr
.anln
MC~S
Figure 2-4: In this design the main lens is not focused on the micro-lens but at a
plane just before the micro-lens. This allows a flexible trade off between angular and
spatial resolution, though the product still being equal to sensor resolution[38].
Figure 2-5: One of the first light field acquisition devices formed by carefully aligning
and placing multiple cameras. Effects such as looking through bushes and synthetic
aperture photography can be implemented effectively with this bulky and expensive
design[58].
focal stack from the given light field. In these approaches, one can think of a low
resolution refocused image as high resolution image blurred out by a depth dependent
kernel. Since, depth is known based on micro lens array alignment and distance we
can invert the problem (same as in super resolution) to get high resolution refocused
image.
2.1.5
Camera Arrays and Gantry
In order to fully preserve image resolution, current options include either camera arrays [58] or taking multiple photographs with a single camera [35, 25, 36]. Camera
32
Figure 2-6: This gantry contains a rotating camera for capturing light field by recording images at various view points. Image based rendering approaches can then be
used to interpolate between various view points[35].
arrays as shown in Figure 2-5 are usually bulky, expensive and need to carefully
calibrated, aligned and fired to record high resolution light fields. The alternative is to use time sequential approaches in which one can either modulate with a
mask over the aperture and take multiple shots or capture photos with a rotating
gantry as shown in Figure 2-6. These approaches do not work for dynamic scenes.
2.2
Compressive Sensing and Sparse Representations
It is traditionally known that to reconstruct a signal back faithfully from its measurements, the number of samples has to be greater than twice the bandwidth of
the signal (Shannon-Nyquist Theorem). In recent years, this assumption has been
challenged for the case of sparse or compressible signals[10]. The theory of compressive sensing shows that for a k-sparse signal one requires significantly less number of
measurements for exact recovery provided it's sampled in a way that preserves signal
energy. More specifically, given a measurement vector i E R" that contains a coded
33
Figure 2-7: A dictionary learned from a number of natural image patches. This
dictionary captures essential edges and features that contribute to image formation.
These "atoms" are different that traditionally known basis such as Fourier, DCT or
Wavelets.
down projection of a k-sparse signal 1 E R", we wish to recover a set of sparse coefficients a E Rd that form an accurate representation signal in some basis or dictionary
E) E Rnxd
(2.1)
4 va.
C =#=
It is shown that if number of measurements, m > const k log
(1)
and the sensing
matrix 4D satisfies the RIP property [11] exact recovery is guaranteed. For general
compressive sensing frameworks 4 is taken to be a random gaussian or bernoulli
matrix. RIP guarantees for these matrices are known and proven [11].
The problem of finding the right basis that represents a class of signals sparsely
is still open and active area of research. Generally available transformations such
as Fourier, DCT or Wavelets are signal agnostic and are known to represent a wide
variety of signals with few coefficients. Recently, the field of dictionary learning has
come up, that tries to learn essential features from a wide variety of training data
by solving a convex optimization problem [3]. This general problem of learning an
34
overcomplete dictionary D that sparsifies a class of signals can be expressed as
minimize
||L - V)AIIF
subject to Vi, ||A|O
k
Here, V) E R'Xd is the overcomplete dictionary, L E R"Xq is the training set that
contains q signals or patches that represent the desired class of signals well. An algorithm tackling this problem will not only compute D but also the sparse coefficients
A E R dq
that approximate the training set. Figure 2-7 shows an overcomplete dic-
tionary learned from millions of image patches. As seen these newly learned basis can
capture essential features of natural image formation such s edges at different orientations. Any natural image can be thought of as superposition of these dictionary
elements with varied weights.
2.3
2.3.1
Compressive Computational Photography
Single Pixel Camera
One of the first practical demonstrations of compressive sensing applications has
been the single pixel camera developed at RICE University [18]. Their system images
the scene through a lens onto a DMD device. This DMD implements a bernoulli
matrix either allowing the light ray to pass through or deflected away. A number of
measurements are recorded on a single photodiode. The signal under estimation is
assumed to be sparse in the DCT domain, and is recovered back using compressive
sensing approaches.
2.3.2
Compressive Video Acquisition
Following this compressive sensing has recently been applied to video acquisition by
either modulating the incoming light source [48], or per-pixel exposure control [27],
.
see Figure 2-9
Modulating the light source with random strobing allows for reconstruction of
35
Low-cost, ftst, sensitive
optical detection
PDI
Xmtr
Compressed, encoded
Image encoded by DMD
and random basis
DMD
image data sent via RF
for reconstruction
ITIDSP
Rcvr
RNG
Figure 2-8: RICE's single pixel camera is the first physical prototype based on compressive sensing. It allows reconstruction of a 2D image by taking a series of measurements on a single photodetector using Li-minimization techniques[18].
Figure 2-9: In this design a LCoS based system is used to control per pixel exposure,
implementing a mask pattern that modulates and sums over frames during capture.
Dictionaries are learned from training videos as shown on the left to reconstruct video
from exposure in a single shot[27].
36
periodic event that are sparse in the fourier domain (a periodic event will only have 1
frequency component in the fourier domain). On the other hand, by having per-pixel
control of exposure one can multiplex multiple frames of captured video into a single
coded image that can be inverted to get the video during exposure using compressive
sensing.
2.3.3
Compressive Light Field Imaging (Simulation)
The idea of compressive light field acquisition itself is not new, either. Kamal et
al. [29] and Park and Wakin [46], for instance, simulate a compressive camera array.
Recently, researchers have started to explore compressive light field acquisition with
a single camera.
Optical coding strategies include randomly coding apertures [5,
6], coded lenslets [5], a combination of coded mask and aperture [59], and random
mirror reflections [23]. Unfortunately, all techniques are based on simulations and
still require multiple images to be recorded to reconstruct the light field and are not
suited for dynamic scenes. They though succeed in reducing the number of shots
as compared to their non-compressive counterparts [36]. Fergus et al. [23] require
significant changes to the optical setup, so that conventional 2D images are difficult to
be captured. The work by Xu and Lam [59] is most closely related to ours. However,
they only show simulated results and employ simple light field priors based on total
variation (TV). Furthermore, they propose an optical setup using dual-layer masks,
but their choice of mask patterns (random and sum-of-sinusoids) reduces the light
efficiency of the optical system to less than 5%. It could also be argued that light field
superresolution [8] is a form of compressive light field acquisition; higher-resolution
information is recovered from microlens-based measurements under Lambertian scene
assumptions. The fundamental resolution limits of microlens cameras, however, are
depth-dependent [47]. We show that mask-based camera designs are better suited for
compressive light field sensing and derive optimized single-device acquisition setups.
In this thesis, we demonstrate that light field atoms captured in overcomplete
dictionaries represent natural light fields more sparsely than previously employed
bases. We evaluate a variety of light field camera architectures and show that mask37
u
Fourier Lens Plane
(ngla Modulation)
I
("(Spa
Imaging Lenls Amplitude-mask
n
Modulation)
Object Wavefront
Object Wavefront
(m, n)
&9(,
).
=I
K]
K
1'f -
S.
so+
Image
I
Sensnr
Lenslet Array (Detector Array)
(Detector Array)
Focal Plane
S.
I
so
i
8i
Figure 2-10: Conceptually two compressive light field architectures have been proposed. One of the left multiplexes by putting a random code on the aperture, while
the one on the right placed a coded lenslet between the sensor and the lens. Both
these architectures require multiple shots to be taken for high resolution light field
reconstruction [5].
based approaches provide a good tradeoff between expected reconstruction quality
and optical light efficiency; we derive optimized mask patterns with approximate
50% light transmission that allow for high-quality light field reconstructions from a
single coded projection.
38
Chapter 3
Sparse Coding and Representations
Our work builds on recent advances in the signal processing community. Here we
outline and evaluate relevant mathematical tools to help in a robust and efficient
inversion of underdetermined system of equations as formulated in Equation 2.1. As
mentioned, one of the key challenges in resolution preserving light field photography
is the choice of a dictionary in which natural light fields are sparse.
We discuss
approaches to learn light field atoms-essential building blocks of natural light fields
that sparsely represent such high-dimensional signals. In this chapter, our focus is to
give readers an intuitive understanding of various design parameters for sparse coding
and reconstruction we consider the signal under consideration as the light field and
evaluate them in 2D flat land. This helps motivate choice of values as we move ahead
in building a high resolution compressive light field camera in the chapter ahead.
3.1
Primer on Sparse Coding
This section reviews the mathematical background of sparse coding. Much progress
on this topic has been made in the signal and information theory community throughout the last decade. This field has exploded in the last five years and this section
serves merely as a concise overview of mathematical tools relevant for compressive
light field photography. For our purposes of building practical camera designs sparse
reconstruction approaches are also evaluated based on the following criteria: speed,
39
quality, and ease of implementation.
Let us start with the problem statement. Given a vectorized sensor image i C R"
that contains a coded projection of the incident vectorized light field 1 C R', we wish
to recover a set of sparse coefficients a C Rd that form an accurate representation of
the light field in some basis or dictionary V C Rnxd
i = (PI= (Pvc.
(3.1)
The challenges here are twofold. First, the number of measurements (sensor pixels)
m is significantly lower than the number of unknowns (light field rays) n. Second,
assuming that tools from the signal processing community can be employed to solve
this problem, which basis or dictionary V provides a sufficiently sparse representation
for natural light fields? While the next section is mainly concerned with answering the
latter question, we outline tools that are at our disposal to tackle the first challenge
in this section.
Solving an underdetermined linear system, such as Equation 3.1, is challenging
Our work builds on recent advances in
because it has infinitely many solutions.
compressed sensing (see, e.g. [16, 12]) to solve such equation systems. The general
idea is that, under certain conditions, underdetermined systems can be solved if the
unknowns are either sparse or can be represented in a sparse basis or overcomplete
dictionary. A signal is said to be k-sparse, if it has at most k nonzero coefficients.
Mathematically, sparsity is expressed by the to pseudo-norm
.0
that simply counts
the number of nonzero elements in a vector.
The answer to the following problem is at the core of finding a robust solution to
Equation 3.1:
minimize
subject to
||aH(o
@}
11i - 'tVaI12 < c
(3.2)
This formulation seeks the sparsest representation of a signal that achieves a
prescribed approximation error. Equation 3.2 can similarly be stated with equality
40
constrains; we focus on bounded errors as the measurements taken by computational
cameras usually contain sensor noise. A different, but closely related problem is that
of sparsity-constrained approximation:
minimize
1|i - <b'a||2
subject to
|ka||O
f Cel
(3.3)
<
K
Here, the objective is to find a signal that has at most
;
nonzero coefficients and
minimizes the residual. Problems 3.2 and 3.3 can be expressed using a Lagrangian
function that balances the twin objectives of minimizing both error and sparsity as
minimize 1i - <bDa12 + A ||a||o .
{a}
(3.4)
The above problems (Eqs. 3.2- 3.4) are combinatorial in nature. Finding optimal
solutions is NP-hard and, therefore, intractable for high resolutions [42]. The two
most common approaches to tackle these problems are greedy methods and convex
relaxation methods (see, e.g. [50, 51]). The remainder of this section outlines methods
for both.
3.1.1
Approximate Solutions
In this section, we briefly review the two most common approximate solutions to the
problems outlined in the previous section: greedy algorithms and convex relaxation
methods.
Greedy Methods
Greedy methods for sparse approximation are simple and fast. In these iterative
approaches, the dictionary atom that is most strongly correlated to the residual part
of the signal is chosen and the corresponding coefficient added to the sparse coefficient
vector (MP). In addition, a least-squares minimization can be added to each iteration
so as to significantly improve convergence (OMP).
41
Matching Pursuit (MP)
is a greedy method for sparse approximation that con-
structs a sparse approximation one step at a time by selecting the atom most strongly
correlated with the residual part of the signal and uses it to update the current approximation. One way to accomplish this is to choose the atom most strongly correlated
with the residual by computing inner products between residual and all atoms. Then,
the coefficient vector is updated for the coefficient corresponding to the chosen atom
with the inner product of that atom and the residual [51]. This is done iteratively.
Orthogonal Matching Pursuit (OMP)
has been used for decades and is one
of the earliest approaches to sparse approximation. While independently discovered
by several researchers, a detailed treatise can be found in [52]. Just as MP, OMP is
a greedy algorithm and therefore extremely fast. The improvement over MP comes
from the fact that OMP adds a additional least-squares minimization in each iteration.
Similarly to MP, in each iteration OMP picks the atom that contributes most to the
overall residual and adds the corresponding coefficient to the sparse representation.
In addition to picking that coefficient, OMP runs a least-squares minimization over
all coefficients picked until the current iteration to obtain best approximation over
the atoms that have already been chosen.
One of the disadvantages of OMP is that exact recovery of the signal is only
guaranteed for a very low coherence value and that the sparsity of the signal has to
be known. The latter is rarely the case in practice, however.
Convex Relaxation Methods
Convex relaxation methods follow a different strategy than greedy methods. Rather
than trying to solve a difficult problem, these types of methods solve a slightly different problem that can be solved much easier in hope that the solutions will be close
enough to those of the original problems. In particular, the to-norm that makes Equations 3.2- 3.4 NP-hard is replaced by the fl-norm, which is convex yet non-smooth.
This approach replaces a combinatorial sparse approximation problem with a related
convex problem, which can be solved in polynomial time. Although there are no
42
theoretical guarantees that these numerical methods actually solve sparse approximation problems, it has been shown that, under certain conditions, the solutions to
the convex relaxation problems converge to the combinatorial problems with a very
high probability [10, 17].
One of the most important achievements of recent literature on compressive sensing is the derivation of bounds on recoverability and required number of measurements
for the convex relaxation methods described in this section. A lower bound can, for
instance, be placed on the number of measurements m for a k-sparse d-dimensional
signal [12]:
m > const k log
().
(3.5)
Solutions to the convexified problems can be found with a variety of approaches,
including linear programming or nonlinear programming, such as interior point methods [9]. In the following, we discuss several different formulations for convex relaxations of Equations 3.2- 3.4.
Basis Pursuit (BP)
replaces the to-norm of Equation 3.2 with an fi-norm, but
also uses equality constraints for the measurements [15]:
minimize
||a|I(
subject to
<bDc = i
fal
(3.6)
Efficient implementations of this problem can, for instance, be found in the SPGL1
solver package [53]. Basis pursuit works well so long as the number of measurements
m > const k log (d/k) and the measurement matrix is sufficiently incoherent. While
this problem is important for many applications, in the context of light field cameras
one usually has to deal with sensor noise which makes it difficult to use the equality
constraints in the above formulation.
Basis Pursuit Denoise (BPDN)
is very similar to basis pursuit; the main dif-
ference is that the equality constraints are replaced by inequality constraints [15]:
43
minimize
IQ}
subject to
|Hall(
(3.7)
Ili - <Da11 2
< C
The parameter E is basically the noise level of a recorded sensor image in our
application. Both, BP and BPND can be summarized as minimizing the sparsity
of a coefficient vector while complying with the observations, at least up to some
threshold E. For practical implementation, the constraints of Equation 3.7 can be
directly included in the objective function using a Lagrangian formulation
minimize 11i - <Da
2
+ A |HaliK
(3.8)
Equation 3.8 is an unconstrained, convex, quadratic problem and can, therefore,
be easily solved with existing solvers. Efficient implementations are readily available
in the SPGL1 package [53] and also in the NESTA solver [7]. Yang et al. [60] provide
an excellent overview of recent and especially fast fi minimization algorithms; upon
request the authors also provide source code-we found their implementation of a
homotopy method to solve the BPDN problem most efficient (see Fig. 3-4).
Lasso
is a problem closely related to BPDN and a convex relaxation of Equation 3.4:
minimize
{a}
Hi
subject to
Hlall1 <
-
bVah2
(3.9)
K
Solving this problem provides the best approximation of a signal using a linear
combination of (ideally) K atoms or fewer from the dictionary [49]. Lasso can also be
solved with the unconstrained, Lagrangian form of the BPDN problem (Eq. 3.8); the
parameter A is simply chosen to give a different tradeoff between sparsity and error.
An implementation of Lasso can be found in the SPGL1 package [53].
44
3.1.2
The Analysis Problem
The last section gave a brief overview of approaches to sparse coding and approximate
solutions. All problems are formulated as synthesis problems, however, which are
theoretically only valid if the dictionary V is an orthonormal basis. In case it is
coherent, overcomplete, or redundant, Candes et al. [13] have shown that solving the
following fi-analysis problems theoretically results in superior reconstructions:
minimize
||D* 1|1(
(3.10)
subject to
Ii
-
<1l
2 <
6
[13] use the reweighted ti method [14] to solve the Basis Pursuit Denoise analysis
problem. Reweighted f, is an iterative method; an efficient implementation is readily
available in the NESTA solver package [7]. We compare NESTA's reweighted f, solver
with the BPND implementation of SPGL1 in Figure 3-4 and conclude that both result
in a comparable reconstruction quality for our problem.
3.2
Overview of Dictionary Learning Methods
One of the most critical parts of any sparse coding approach is the choice of basis or dictionary that sparsifies the signal appropriately. As shown in the primary
text, natural light fields are poorly sparsified by standard bases such as the discrete
cosine transform, the Fourier basis, or wavelets. Instead, we propose to learn the fundamental building blocks of natural light fields-light field atoms-in overcomplete,
redundant, and possibly coherent dictionaries. Within the last few years, sparse coding with coherent and redundant dictionaries has gained a lot of interest in the signal
processing community [13].
3.2.1
Light Field Atoms
The general problem of learning an overcomplete dictionary D that sparsifies a class
of signals can be expressed as
45
minimize
L-
'VA}
subject to Vi,
Here, V E
Rnxd
3 .1F
(3.11)
HAi1o <
k
is the overcomplete dictionary, L c R'Xq is the training set
that contains q signals or patches that represent the desired class of signals well.
An algorithm tackling Equation 4.7 will not only compute D but also the sparse
coefficients A E Rdxq that approximate the training set. The Frobenius matrix norm
in the above problem is defined as
||XIIF
i4 Xj.
A variety of solutions to Equation 4.7 exist, in the following we discuss the most
widely used methods.
K-SVD
The K-SVD algorithm [3] is a simple, yet powerful method of learning overcomplete
dictionaries from a training dataset. K-SVD applies a two-stage process to solve
Equation 4.7: given an initial estimate of ) and A, in the first stage (sparse coding
stage) the coefficient vectors Ai, i
=
1... q are updated independently using any
pursuit method with the fixed dictionary.
In the second stage (codebook update
stage), D is updated by picking the singular vector of the residual matrix E = L - DA
that contributes most to the error; the corresponding coefficients in A and D are
updated so as to minimize the residual with that singular vector. The vector can easily
be found by applying a singular value decomposition (SVD) to the residual matrix
E and picking the singular vector corresponding to the largest singular value. The
K-SVD algorithm alternates between these two stages, sparse coding and codebook
update, in an iterative fashion and is therefore similar in spirit to alternating leastsquares methods.
Nonnegative K-SVD
The nonnegative variant of K-SVD [4] (K-SVD NN) allows only positive values in the
dictionary atoms as well as in the coefficients, i.e. Di4 > 0 and % > 0. The K-SVD
46
NN algorithm itself is a slight modification of K-SVD: in the sparse coding stage, any
pursuit algorithm can be applied that only allows for positive coefficients; the SVD
in the codebook update stage is replaced by finding the closest nonnegative rank-1
matrix (in Frobenius norm) that approximates E. Algorithms for nonnegative matrix
factorization (NMF) are well-explored (e.g., [32]).
The advantage of nonnegative matrix factorizations and dictionary learning algorithms is that these often result in decompositions that carry physical meaning. In
particular, a visual analysis of the extracted or decomposed features often allows for
intuitive interpretations of its parts. For the application of learning light field atoms,
nonnegative dictionaries allow for intuitive interpretations of what the basic building
blocks of natural light fields are. Figure 3-1 compares the atoms learned with K-SVD
NN to alternative learning methods that allow negative atoms.
Online Sparse Coding
With online sparse coding, we refer to the online optimization algorithm proposed by
Mairal et al. [40]. The goal of this algorithm is to overcome the limited size of training datasets that can be handled by alternative dictionary learning methods, such as
K-SVD. For this purpose, Marail et al. propose a method based on stochastic approximation that is specifically designed to handle large training sets of visual data consisting of small patches in an efficient manner. The proposed approach is implemented
in the open source software package SPAMS (http://spams-devel.gforge.inria.fr).
Assume that a large training set of small patches is available. Conventional algorithms, such as K-SVD, randomly pick the maximum number of training patches that
can be processed given a limited memory or time budget. Online sparse coding uses
a batch approach, where the full training set is processed by picking a random patch
at a time and updating the dictionary and coefficients accordingly. This is similar in
spirit to matching pursuit algorithms.
While being able to handle very large training sets and being very fast, online
sparse coding has the disadvantage of very slow converge rates. Fixing the number
of iterations and comparing it to K-SVD, as illustrated in Figure 3-1, shows that the
47
resulting light field atoms are much noisier and lower quality. This is mainly due to
the fact that it is unclear how to chose the next training patch at any given iteration
of the algorithm. Ideally, one that would maximize the amount of new information
should be chosen; it may be impossible to know which one that is, so a random sample
is drawn that may not lead to any new information at all.
Generating "Good" Training Sets with Coresets
3.2.2
Coresets are means to preprocess large training sets so as to make subsequent dictionary learning methods more efficient. A comprehensive survey of coresets can be
found in [2]. Given a training set L E RInq, a coreset C C R
<C
is extracted and
directly used as a surrogate training set for the dictionary learning process (Eq. 4.7).
Coresets have two advantages: first, the size of the training set is significantly reduced
(i.e., c
<
q) and, second, redundancies in the training set are removed, significantly
improving convergence rates of batch-sequential algorithms such as online sparse coding. In a way, coresets can be interpreted as clustering algorithms specialized for the
application of picking "good" training sets for dictionary learning [22].
A simple yet efficient approach to computing corsets is to select c patches of the
training set that have a sufficiently high variance [21]. Feigin et al. showed that the
best dictionary learned from a coreset computed with their algorithm is very close to
the best dictionary that can be learned from the original training dataset that the
coreset was learned from. We implement the method described in [21] and evaluate
it with K-SVD and online sparse coding in the following section.
3.3
Evaluating System Design Parameters
Before we can effectively apply these recent theories of sparse coding and representation on high dimensional signals, we take a step back and evaluate various system
design parameters to help guide us in resolution preserving light field recovery. In
this section, we evaluate several learning methods in flatland with straightforward
extensions to the full, four-dimensional case. The flatland analysis simply allows for
48
K-SMD
eDmu"
KAM w~h rnn
a
MMM
Nonneg*M~e
I
K-VD wi Comets
SPAM eth~Cm
I
Figure 3-1: Evaluating various approaches to dictionary learning and choosing training sets. The two most important evaluation criteria are speed (and memory footprint) of the learning stage as well as quality of the learned light field atoms. The
former is indicated by timings, whereas the latter is indicated by the PSNR and visual quality of reconstructions from the same measurements (2D light field projected
onto a 1D sensor image with a random modulation mask). We evaluate the K-SVD
algorithm and its nonnegative variant and online sparse coding as implemented in the
SPAMS software package. K-SVD is a very slow algorithm that limits practical application to small training sets containing 27,000 training patches (top left); coresets are
tested to reduce a much larger training set with 1,800,000 patches to the same, smaller
size (27,000 patches, top center). In all cases, K-SVD learns high-quality light field
atoms that capture different spatial features in the training set sheared with different
amounts in the spatio-angular domain. Unfortunately, K-SVD becomes too slow for
practical application with large high-dimensional training sets; nevertheless, we use
it as a benchmark in these flatland experiments. Online sparse coding (bottom
row)
is much faster and capable of handling large training sets. Unfortunately, the learned
atoms are very noisy and lead to low-quality reconstructions (bottom left and center).
Using coresets, much of the redundancy in large training sets can be removed prior to
learning the atoms (lower right), thereby removing noise in the atoms and achieving
the highest-quality reconstructions in the least amount of time.
49
0.lx - 10 Atoms
Overcomplete 0.5x - 50 Atoms
Overcomplete Ix - 100 Atoms
Overcomplete 2x - 200 Atoms
Overcomplete 5x - 500 Atoms
Overcomplete lOx - 1000 Atoms
Figure 3-2: Evaluating dictionary overcompleteness. The color-coded visualizations
of dictionaries (top row) and the histograms (center row) illustrate the intrinsic dimension of these dictionaries. For more than 2x overcomplete dictionaries, most of
the atoms are rarely used to adequately represent the training set. Reconstructions
of a 2D light field that was not in the training set using the respective dictionaries
(bottom row) show that the quality (PSNR) is best for 1 - 2x overcomplete dictionaries and drops below and above. Dictionaries with 0.1 - 0.5x overcompleteness
do not perform well, because they simply do not contain enough atoms to sparsely
represent the test light field. On the other hand, excessive overcompleteness (larger
d) does not improve sparsity (smaller k) further, and k log(d/k) turns to increase,
leaving the number of measurements insufficient.
50
more intuitive interpretations.
3.3.1
Dictionary Design Parameters
Here, we evaluate various approaches to and design parameters of the learning stage.
We conclude that coresets applied to large training sets in combination with online
sparse coding implemented in the SPAMS package gives the best results in the shortest
amount of time.
Speed and Quality of Learning Stage
We evaluate several different dictionary
learning methods in Figure 3-1. While the K-SVD algorithm results in high-quality
light field atoms, online sparse coding usually extracts noisy atoms for a comparable
number of iterations in both algorithms. Unfortunately, K-SVD becomes increasingly
slow and unfeasible for large training sets of high-resolution and four-dimensional
light field patches.
Applying coresets to large training sets prior to the learning
stage, however, removes redundancies in the training data and allows SPAMS to
learn atoms that result in higher-quality reconstructions than K-SVD in very little
time (Fig. 3-1, lower right).
We also evaluate the nonnegative variant of K-SVD for the purpose of improved
interpretability of the learned atoms (Fig. 3-1, upper right).
As with all other ap-
proaches, nonnegative K-SVD atoms exhibit spatial features sheared with different
amounts in the spatio-angular space.
Edges in these atoms, however, are much
smoother than in the approaches that allow for negative entries.
This allows for
improved blending of several nonnegative atoms to form light field "molecules", such
as T-junctions observed in occlusions. Atoms that allow for negative values can easily create such "molecules" when added. The reconstruction quality of nonnegative
K-SVD is low, however; we conclude that this algorithm is great to analyze learned
atoms but, considering low-quality and immense compute times, not fit for practical
processing in this application.
51
Target Lght Field View
Recontruction PSNR 28.8dB Reconstructi PSNR 310dB
Atom Size: g9x3x3
Atom Size 1lollx3x3
Reconstruction, PS
277dB R co7ction, PSNR 287d8 RecnVucon, PSNR 285d8
Atom Size: 13x13x3x3
Atom Sime 17x17x3x3
Atom Size: 21x21x3x3
MONS
0.26
10
Figure 3-3: Evaluating light field atom sizes. A synthetic light field (lower left) is
projected onto a sensor image with a random modulation mask and reconstructed
with dictionaries comprised of varying atom sizes. The angular resolution of light
field and atoms is 3 x 3, but the spatial resolution of the atoms ranges from 9 x 9
to 21 x 21. Dictionaries for all experiments are 5x overcomplete. We observe best
reconstruction quality for an atom size of 11 x 11. If chosen too small, the spatioangular shears of objects at a distance to the focal plane will not be adequately
captured by the light field atoms. Furthermore, the ratio between the number of
measurements and number of unknowns is too low for a robust reconstruction with
sparse coding approaches. For increasing atom resolutions, that ratio becomes more
favorable for reconstruction-however, with an increasing spatial extend of an atom
it also becomes increasingly difficult to represent the underlying light fields sparsely,
leading to lower-quality reconstructions.
Atom Size
The size (or resolution) of light field atoms is an important design
parameter. Consider an atom size of n = p2
x
p2-the number of measurements is
always m = px. Assuming a constant sparseness k of the light field in the coefficient
space, the minimum number of measurements should ideally follow Equation 3.5,
i.e. m > const k log (d/k). As the spatial atom size is increased for a given angular
size, the recovery problem becomes more well-posed because m grows linearly with
the atom size, whereas the right hand side only grows logarithmically. On the other
hand, an increasing spatial light field atom size may decrease the compressibility of
the light field expressed in terms of these atoms. Figure 3-3 evaluates the sensitivity of
the light field recovery process with respect to the spatial atom size. We conclude that
there is an optimal tradeoff between the above arguments (number of measurements
vs. sparsity). We heuristically determine the optimal atom size to be px = 11 for our
application.
52
Overcompleteness
We also evaluate how overcomplete dictionaries should ideally
be, that is how many atoms should be learned from a given training set. Conventional
orthonormal bases in this unit are "1 x" overcomplete
D is square. The overcom-
pleteness of dictionaries, however can be arbitrarily chosen in the learning process.
We evaluate dictionaries that are 0.1 x, 0.5 x, 1 x, 2 x, 5 x, 10 x overcomplete in Figure
3-2. The color-coded visualizations of the atoms in the respective dictionaries indicate how many times each of the atoms is actually being used for the training set
(on a normalized scale). The histograms count how many times an atom was used to
represent the training set is shown in the center row. We observe that for a growing
dictionary size, the redundancy grows as well. While all coefficients in the 0.1 x and
0.5x dictionaries are being used almost equally often, for 5x and particularly lOx,
most of the coefficients are rarely being used, hence overly redundant. We conclude
that 1 - 2x overcomplete dictionaries adequately represent this particular training
set consisting of 27, 000 light field patches (reduced from 1,800,000 randomly chosen
ones using coresets); all atoms have a resolution of 5 x 20 in angle and space.
At this point, we would like to remind the reader of the number of required
measurements in convex relaxation methods, such as basis pursuit denoise used in
this experiment, as outlined in supplementary Equation 3.5. For a fixed sparsity, a
linearly increasing dictionary size (or overcompleteness) requires an logarithmically
growing number of measurements. As seen in the top and center rows of Figure 3-2,
the intrinsic dimension of these dictionaries, that is the number of required coefficients
to adequately represent the training set, is about 1 - 2x. As the overcompleteness
grows (5 - lOx), there are simply not enough measurements in the sensor image,
hence the PSNR of the reconstructions drops (Fig. 3.5).
3.3.2
Sparse Reconstruction Parameters
A comparison of several of the above discussed sparse coding algorithms is shown in
Figure 3-4. We evaluate these approaches based on reconstruction quality and speed.
The choice of algorithm is based on publicly available code and ease of implementation; while there is a huge variety of different algorithms in the literature, we pick the
53
most popular ones that can easily be implemented or are readily available. Figure 3-4
shows that reweighted fi, as implemented in NESTA [7], basis pursuit denoise implemented in SPGL1 [53], and basis pursuit denoise using the homotopy method [60]
perform equally well in terms of reconstruction quality. However, the processing time
for both reweighted f1 and BPDN (SPGL1) prohibits practical use for high-resolution
4D light field reconstructions. Hence, we chose the homotopy method described by
Yang et al. [60] with source code provided by the authors.
Sliding Window Reconstructions
At this stage we would like to point out that
reconstructions are performed independently for each pixel in the sensor image. That
is, a window of patch size px
PX, P2 = m is centered around a sensor pixel and
represents the measurement vector i C R' (see Eq. 3.1). A sparse set of 4D light field
atoms, each of size px x Px x p, x p, pp
= n, is then reconstructed for each sensor
pixel.
Following standard practice [20], we use this sliding window approach and reconstruct the 4D window for each sensor pixel separately in all results in the paper. As
a post-processing step, overlapping patches are merged using a median function. We
compare several different choices for the merging function, including average, median,
and simply picking the spatial center of the 4D window in Figure 3-4.
Evaluating Effect of Resolution
We evaluate the effect of different resolutions
between the learning and reconstruction phases. We learn two dictionaries from a
plane of text placed at different depths at resolutions of 128 x 128 and 64 x 64,
respectively. Given the 128 x 128 dictionary for benchmarking, we reconstruct a resolution chart at the same depths at the same resolution. To compare, we reconstruct
an anti-aliased resolution chart downsampled to 64 x 64 with the dictionary learned
at the resolution of 128 x 128. Similarly, we also reconstruct a resolution chart of
128 x 128 with the dictionary learned at the resolution of 64 x 64.
For this experiment, no significant difference was found in PSNR as shown in
the plot in Figure 3-6. Though the dictionary learning is not proven to be scale or
54
Figure 3-4: Evaluating sparse reconstruction methods. We reconstruct a 2D light
field (top row) using the same dictionary and simulated sensor image with four different methods of sparse coding (rows 2-6) and different window merging functions.
Although fast, orthogonal matching pursuit (OMP, second row) only achieves a low
reconstruction quality. Reweighted 6i, as implemented in the NESTA package, yields
high-quality reconstructions but takes about 15 times as long as OMP (third row).
Basis pursuit denoise, as implemented in the SPGL1 solver package, is also highquality but almost 400 times slower than OMP (fourth row). Due to its immense
time requirements, this implementation is not suitable for high-resolution 4D light
field reconstructions. The homotopy approach to solving the basis pursuit denoise
problem (fifth row) is even faster than OMP and results in a reconstruction quality
that is comparable with the best, but much slower SPGL1 implementation.
Convergence
OMP
Convergence Reweighted LI
Convergence
BPDN
(SPGL1)
Convergence BPDN (homiotopy)
Figure 3-5: Evaluating convergence for all algorithms using a single patch. Please
note that OMP is a greedy algorithm; while all other algorithms minimize the Linorm of coefficients, OMP increments the number of used coefficients by one in each
iteration.
55
F ijure 3-6: Evaluating dktionary emig at varkius reshtbons. W e em two dictianarm fram a trainig set cnnsisting of black and whitB tact at reolutbns of
d0Ats fram a sin ulatd cm era aperture of 025
128 x 128 and 64 x 64 at di e12at
an R exanstmuctbns of a roo1itbn cbart placed at the Sam e d~pths but w itha di ereat rem1it ins axe the perfrmn e1. A s shcwn i the pbot, the P SN R s do not vary
mmed at 64 x 64 and rAxnstrctin at 128 x 128, and
signi cntly fir dictnar
tus
rexnstmuctn
that
to ourpatch-by-patch
vic versa. T his cn be atnrbuted
bn
of 9 x 9 in this e e. A t sch a patrh size, fWaures ae gmFture at a re
whution c earts, th estuxs are
nr the a of r
era ly ah-nvarint. M orovet, f
tet.ns.
cn xe
geraTy eag
(atia aor angular) that do not dead
rotatinvariant, it ain s to capture natural buildng bbodcs of 1ight ekds. lh the
it
a of a tactua1 planar scme at di eret deaths, thee bbcks sat
enatialand angular edgE that are scb-invariant. As hcwn
3.3.3
tatins are ]
pnioUnd
e ets
.
rEewpu
dti nary
at bow reitkrns 11 x 11 or 9 x 9 iour a,
ieming is perf nn ed an patrh
of mn uti-a
in [41], sno
cdntai
Evaluating Optimal Projection Matrices and Mask Pat-
terns
Taditknal
in cnn praive
s sg
projict the high din iasnal scar
beause randcr
Property (R ~lP
whea projei
mu atr
srena1 to a bw
aretiver sgni b prrivation of energy fr
), that
]lwer
that, ifr ovei m plate ditnar
usto
to
din esicnal libpaem. This is
datisy the well-knlcn R strictEd
s are known to
dcwn to a
mn atrics have bro
systn s, randc
Ta
etry
a sare vtor
din easianal subsame. R eOEntly, it has been shown
, there edsts a
tin ize3 r a give basis Opi]. Furtherm ore, d
be optin ized togethe to be m ost orthnorn al who
56
a1
of
Mtnari
9n atr9I
and
that can be op-
Msng m atris an
operatig on aarne vctors.
C
Figure 3-7: Evaluating optimal mask patterns. We evaluate three kinds of mask
patterns for both sliding and distinct patch reconstructions for a flatland 2D light
field. A light field is divided into patches of resolution 20 (space) x 5 (angle) and is
projected down with the corresponding mask to a 20 pixel coded sensor image. As
seen in the figure, given a dictionary, an optimized mask pattern performs much better
than a traditional random mask. Jointly optimizing for both mask and dictionary
performs slightly better than the optimized mask albeit at much higher computational
times. The coherence value p as shown in the figure decreases as we optimize for both
the mask and the dictionary.
57
Mathematically, this optimality criterion can be expressed as
minimize
I
-
GTG
F
(3.12)
where G is the product of the sensing matrix P and the overcomplete dictionary
V. This unconstrained problem is known to be convex, but the sensing matrices
generated are generally dense and cannot be physically realized. To generate physical
optimized code, we add physical constraints on our ID mask patterns as
minimize
I
{f}
-
GT GF
(3.13)
subject to 0 < fi < 1, Vi
where f E R" corresponds to the ID mask pattern in the matrix and is zero
otherwise. A further extension shown in [19] is to perform coupled learning where,
given a training set, one optimizes both for the dictionary and the sensing matrix.
minimize A |1L - DA
{V,+,A}
12
+ 111L - PDA2
(3-14)
Here L corresponds to the light field training set, V is the learned overcomplete
dictionary, A is the sparse coefficient vector. and 4P is the optimized sensing matrix. The constraints for the physically realizable mask pattern are then added as
before. Figure 3-7 compares reconstruction quality for all three mask patterns. As
shown, given a fixed dictionary, the optimal mask pattern performs much better than
the random projection. Joint optimization of dictionary and mask patterns performs
marginally better than the stand-alone optimized mask albeit at an increased computational overhead that will not scale up to 4D light fields given current computational
power.
Based on the insights drawn in this chapter on various design and system parameters, we describe system, algorithms and results to create a resolution preserving
light field camera design in the following chapter.
58
Chapter 4
Compressive Light Field
Photography
This chapter describes the system design, analysis, implementation and results for
a practical compressive light field camera design.
We build on the mathematical
foundations described in the previous chapter to get over century-old fundamental
limitations for high resolution light field capture in a single shot. We start with
a problem formulation that fits the theory, move ahead with analysis of light field
dictionaries to implementing a real prototype. We end with results from our new light
field camera design showcasing complicated illumination effects such as refraction and
dynamic scenes.
4.1
Problem Formulation
4.1.1
Acquiring Coded Light Field Projections
An image i(x) captured by a camera sensor is the projection of an incident spatioangular light field l(x, v) along its angular dimension v over the aperture area V:
i (X) = l (x, v) dv.
59
(4.1)
Measurement Matrix
Light Field
Ray Optics
44
4)
Figure 4-1: Illustration of ray optics, light field modulation through coded attenuation
masks, and corresponding projection matrix. The proposed optical setup comprises
a conventional camera with a coded attenuation mask mounted at a slight offset in
front of the sensor (left). This mask optically modulates the light field (center) before
it is projected onto the sensor. The coded projection operator is expressed as a sparse
matrix b, here illustrated for a 2D light field with three views projected onto a ID
sensor (right).
We adopt a two-plane parameterization [35, 25] for the light field where x is the 2D
spatial dimension on the sensor plane and v denotes the 2D position on the aperture
plane at distance da (see Fig. 4-1, left). For brevity of notation, the light field in
Equation 4.1 absorbs vignetting and other angle-dependent factors [43]. We propose
to insert a coded attenuation mask
f(
) at a distance d, from the sensor, which
optically modulates the light field prior to projection as
i(x) = j f(x + s(v - x)) (x, v) d,
(4.2)
where s = di/da is the shear of the mask pattern with respect to the light field (see
Fig. 4-1, center). In discretized form, coded light field projection can be expressed as
a matrix-vector multiplication:
i = C1,
4P =
[(P1 4
2 ..-
Pp]
(4.3)
where i E Rm and 1 E R" are the vectorized sensor image and light field, respectively.
All p, x p, angular light field views 13 (j = 1 ... p 2) are stacked in 1. Note that each
submatrix (Pj E Rmxm is a sparse matrix containing the sheared mask code on its
diagonal (see Fig. 4-1, right). For multiple recorded sensor images, the individual
60
photographs and corresponding measurement matrices are stacked in i and P.
The observed image i =
j 4ylj sums the light field views, each multiplied with
the same mask code but sheared by different amounts. If the mask is mounted directly
on the sensor, the shear vanishes (s = 0) and the views are averaged. If the mask is
located in the aperture (s = 1), the diagonals of each submatrix 4D become constants
which results in a weighted average of all light field views. In this case, however, the
angular weights do not change over the sensor area. Intuitively, the most random, or
similarly incoherent, sampling of different angular samples happens when the mask
is located between sensor and aperture; we evaluate this effect in Section 4.2.5.
Equations 4.1-4.3 model a captured sensor image as the angular projection of
the incident light field. These equations can be interpreted to either describe the
entire sensor image or small neighborhoods of sensor pixels-2D patches-as the
projection of the corresponding 4D light field patch. The sparsity priors discussed
in the following sections exclusively operate on such small two-dimensional and fourdimensional patches.
4.1.2
Reconstructing Light Fields from Projections
The inverse problem of reconstructing a light field from a coded projection requires
a linear system of equations (Eq. 4.3) to be inverted.
For a single sensor image,
the number of measurements is significantly smaller than the number of unknowns,
i.e. m
<
n. We leverage sparse coding techniques discussed in the previous chapter
to solve the ill-posed underdetermined problem. For this purpose, we assume that
natural light fields are sufficiently compressible in some basis or dictionary V E R',d
such that
i = (PI = (PDa,
61
(4.4)
where most of the coefficients in a C Rd have values close to zero. As shown earlier
(e.g., [16, 12]), we seek a robust solution to Equation 4.4 as
minimize
fa}
subject to
|1al(.
11i -
(4.5)
"Dah2 < e
which is the basis pursuit denoise (BPDN) problem [15]. In practice, we solve the
Lagrangian formulation of Equation 4.5 as
minimize 11i {a}
Dah2 +-A ja 1
.
(4.6)
Assuming that the light field is k-sparse, that is it can be well represented by a linear
combination of at most k columns in D, a lower bound on the required number
of measurements m is O(k log(d/k)) [13]. While Equation 4.6 is not constrained to
penalize negative values in the reconstructed light field 1 = Da, we have not observed
any resulting artifacts in practice.
The two main challenges for any compressive computational photography method
are twofold: a "good" sparsity basis has to be known and reconstruction times have
to scale up to high resolutions. In the following, we show how to learn dictionaries
of small light field atoms that sparsely represent natural light fields. A side effect of
using light field atoms is that scalability is intrinsically addressed as follows: instead
of attempting to solve a single, large optimization problem, many small and independent problems are solved simultaneously. As discussed in the following, light field
atoms model local spatio-angular coherence in the 4D light field sparsely. Therefore,
a small 4D light field patch is reconstructed from a 2D image patch centered around
each sensor pixel. The recovered light field patches are merged into a single reconstruction. Performance is optimized through parallelization and quick convergence
of each subproblem; the reconstruction time grows linearly with increasing sensor
resolution.
62
Figure 4-2: Visualization of light field atoms captured in an overcomplete dictionary.
Light field atoms are the essential building blocks of natural light fields-most light
fields can be represented by the weighted sum of very few atoms. We show that light
field atoms are crucial for robust light field reconstruction from coded projections and
useful for many other applications, such as 4D light field compression and denoising.
4.1.3
Learning Light Field Atoms
Following recent trends in the information theory community (e.g., [13]), we propose
to learn the fundamental building blocks of natural light fields-light field atoms-in
overcomplete dictionaries. We consider 4D spatio-angular light field patches of size
n = Px p, x p, x p1 . Given a large set of such patches, randomly chosen from a
collection of training light fields, we learn a dictionary D E R"
minimize
as
j|L - DAIF
JVAJ
(4.7)
subject to Vj, |ajJ|o < k
where L E R'Xq is a training set comprised of q light field patches and A=[ai,
... , aq]
Rdxq is a set of k-sparse coefficient vectors. The Frobenius matrix norm is ||X|I1
Zi, x?,
k (k
<
=
the to pseudo-norm counts the number of nonzero elements in a vector, and
d) is the sparsity level we wish to enforce.
In practice, training sets for the dictionary learning process are extremely large
and often contain a lot of redundancy. Solving Equations 4.7, however, is computa63
E
tionally expensive. Coresets as described earlier have been introduced as a means to
cheaply reduce large dictionary training sets to manageable sizes. Feigin et al. [21],
for instance, simply pick a subset of training samples in L that have a sufficiently
high variance; we follow their approach.
4.2
Analysis
In this section, we analyze the structure of light field atoms and dictionaries, evaluate
the design parameters of dictionaries, derive optimal modulation patterns for coded
projections, evaluate the proposed camera architecture, and compare it with a range
of alternative light field camera designs.
4.2.1
Interpreting Light Field Atoms
As discussed in Section 4.1.3, overcomplete dictionaries are learned from training sets
of natural light fields. The columns of these dictionaries are designed to sparsely
represent the respective training set, hence capture their essential building blocks
or atoms. Obviously, the structure of these building blocks mainly depends on the
specific training set; intuitively, large and diverse collections of natural light fields
should exhibit some common structures, just like natural images. Based on recent
insights, such as the dimensionality gap [43, 34], one would expect that the increased
dimensionality from 2D images to 4D light fields introduces a lot of redundancy. The
dimensionality gap is a 3D manifold in the 4D light field space, which successfully
models diffuse objects within a certain depth range. Unfortunately, occlusions, specularities, and high-dimensional edges are not accounted for in this prior. In contrast,
light field atoms do not model a specific lower-dimensional manifold, rather they
sparsely represent the elemental structures of natural light fields.
Figure 4-2 visualizes an artificially-colored dictionary showing the central views of
all its atoms; two of them are magnified and shown as 4D mosaics. We observe that
light field atoms capture high-dimensional edges as well as high-frequency structures
exhibiting different amounts of rotation and shear. Please note that these atoms also
64
Target Light Field
tV
Compressibility of 4D Light Field - Quantitative Evaluation
W
0
3-
--
.. . . . .---
z
- - -
20
W'
0-
10L -
-
-
-
-..
.
IM
.. . . . .-
4D DCT
--4D Haar Wav
-.
4D PCA
- -- 4D FFT
.. . . . . .
..
4D LF Atoms
. ..
0
0.05
0.1
Compression Ratio in % Coefficients
0.15
0.2
Compressibility & Reconstruction from Coded 2D Projections - Qualitative Evaluation
4D DC1
4D Li ht Field Atoms
00CO
ca0
'a
U.
CL
E IL
@8
Co)
0) Z
C0.
Figure 4-3: Compressibility of a 4D light field in various high-dimensional bases. As
compared to popular basis representations, the proposed light field atoms provide
better compression quality for natural light fields (plots, second row). Edges and
junctions are faithfully captured (third row); for the purpose of 4D light field reconstruction from a single coded 2D projection, the proposed dictionaries combined with
sparse coding techniques perform best indlis experiment (bottom row).
contain negative values; combining a few atoms allows complex lighting effects to be
formed, such as reflections and refractions as well as junctions observed in occlusions
(see Fig. 4-3).
4.2.2
What are Good Modulation Patterns?
The proposed optical setup consists of a conventional camera with a coded attenuation mask mounted in front of the sensor. A natural question emerges: what should
the mask patterns be? In the compressive sensing literature, most often dense sensing matrices of random Gaussian noise are employed. The proposed optical setup,
however, restricts the measurement matrix <D to be very sparse (see Fig. 4-1, right).
In the following, we discuss several choices of mask codes with respect to both computational and optical properties. A good mask design should facilitate high quality
reconstructions while also providing a high light transmission.
Tiled Broadband Codes
Broadband codes, such as arrays of pinholes, sum-of-
sinusoids (SoS), or MURA patterns, are common choices for light field acquisition
with attenuation masks. These patterns are designed to multiplex angular light information into the spatial sensor layout; under bandlimited assumptions, the 4D
light field is reconstructed using linear demosaicking
[55].
In previous applications,
the number of reconstructed light field elements is limited to the number of sensor
pixels. The proposed nonlinear framework allows for a larger number of light rays to
be recovered than available sensor pixels; however, these could also be reconstructed
from measurements taken with broadband codes and the sparse reconstruction algorithms proposed in this paper. We evaluate such codes in Figure 4-4 and show that
the achieved quality is lower than for random or optimized masks.
Random Mask Patterns For high resolutions, random measurement matrices
provide incoherent signal projections with respect to most sparsity bases, including
overcomplete dictionaries, with a high probability. This is one of the main reasons why
random codes are by far the most popular choice in compressive sensing applications.
66
In our application, the structure of the measurement matrix is dictated by the optical
setup-it is extremely sparse. Each sensor pixel integrates over only a few incident
light rays, hence the corresponding matrix row only has that many non-zero entries.
While random modulation codes are a popular choice in compressive computational
photography applications, these are not necessarily the best choice for overcomplete
dictionaries, as shown in the following.
Optimizing Mask Patterns As discussed in the previous chapter, recently research has focused on deriving optimal measurement matrices for a given dictionary [19]. The intuition here is that projections of higher-dimensional signals should
be as orthogonal as possible in the lower-dimensional projection space. Poor choices
of codes would allow high-dimensional signals to project onto the same measurement,
whereas optimal codes remove such ambiguities as best as possible. Mathematically,
this optimality criterion can be expressed as
minimize
I - GTG||
{f}F
subject to 0 <
fi
< 1, Vi
(4.8)
EZ fi/m > T
where G is D with normalized columns and f E R' is the mask pattern along the
diagonals of the submatrices in 4 (see Fig. 4-1, right). Hence, each column of G is
the normalized projection of one light field atom into the measurement basis. The
individual elements of GTG are inner products of each of these projections, hence
measuring the distance between them. Whereas diagonal elements of GTG are always
one, the off-diagonal elements correspond to mutual distances between projected light
field atoms. To maximize these distances, the objective function attempts to make
GTG as close to identity as possible. To further optimize for light efficiency of the
system, we add an additional constraint
T
on the mean light transmission of the mask
code f.
67
Mask-,
Scene Setup
~0
CL
.0
Figure 4-4: Evaluating optical modulation codes and multiple shot acquisition. We
simulate light field reconstructions from coded projections for one, two, and five captured camera images. One tile of the corresponding mask patterns is shown in the
insets. For all optical codes, an increasing number of shots increases the number
of measurements, hence reconstruction quality. Nevertheless, optimized mask patterns facilitate single-shot reconstructions with a quality that other patterns can only
achieve with multiple shots.
68
4.2.3
Are More Shots Better?
We strongly believe that the most viable light field camera design would be able to
reconstruct a high-quality and high-resolution light field from a single photograph.
Nevertheless, it may be argued that more measurements may give even better results. This argument is supported by the experiments shown in Figure 4-4, where we
evaluate multi-shot reconstructions for different mask patterns. In all cases, quality
measured in peak signal-to-noise ratio (PSNR) is improved for an increasing number of shots, each captured with a different mask pattern. However, after a certain
number of shots reconstruction quality is not significantly increased further-in the
shown experiment, the gain from two to five shots is rather low. We also observe
that a "good" choice of modulation codes equally improves reconstruction quality.
In particular, optimized mask patterns allow for a single-shot reconstruction quality
that can only be achieved with multiple shots otherwise. Capturing multiple shots
with optimized mask patterns does not significantly improve image quality.
4.2.4
Evaluating Depth of Field
We evaluate the depth of field achieved with the proposed method in Figure 4-5.
For this experiment, we render light fields containing a single planar resolution chart
at different distances to the camera's focal plane (located at 50 cm).
Each light
field has a resolution of 128 x 128 pixels and 5 x 5 views. The physical distances
correspond to those in our camera prototype setup described in Section 4.3.1. While
the reconstruction quality is high when the chart is close to the focal plane, it decreases
with an increasing distance. Compared to capturing this scene with a lenslet array,
however, the proposed approach results in a significantly increased image resolution.
The training data for this experiment contains white planes with random text at
different distances to the focal plane, rendered with an aperture diameter of 0.25 cm.
Whereas parallax within the range of the training data can be faithfully recovered
(magenta and blue plots), a drop in reconstruction quality is observed when parallax
exceeds that of the training data (green plot).
69
5-
....
30
I........b...
W20
Jm
do*
..aperture .1 cm...
M
4bmm
.............
-K Focal Plane
CL
U-
aperture .5 cm
at 50 cm
)
10
4
z=50
45
50
z=50
60
65
55
Distance to Can era Aperture in cm
Z= 50
496w
24
Original
Lenslets
z=60
5JWIL4A
2490
70
80
75
z=70
3
5MLA3
2490
3
Proposed Reconstructions
Figure 4-5: Evaluating depth of field. As opposed to lenslet arrays, the proposed
approach preserves most of the image resolution at the focal plane. Reconstruction
quality, however, decreases with distance to the focal plane. Central views are shown
(on focal plane) for full-resolution light field, lenslet acquisition, and compressive
reconstruction; compressive reconstructions are also shown for two other distances.
The three plots evaluate reconstruction quality for varying aperture diameters with a
dictionary learned from data corresponding to the blue plot (aperture diameter 0.25
cm).
70
Lenslet Array Coded Lens Array Coded Aperture Coded Mask Coded Ap. & Mask
mO
60
p#=
0 .4 0 7 1
eb0
p=
0 .3 8 1 4
0
Abroad= 0. 3 7 9 3
prand =
i-=0.5
=1
=
5
A =0.3739
0.3790
N 04M
Figure 4-6: Illustration of different optical light field camera setups with a quantitative
value y for the expected reconstruction quality (lower value is better). While lenslet
arrays have the best light transmission T (higher value is better), reconstructions are
expected to be of lower quality. Masks coded with random or optimized patterns
perform best of all systems with 50% or more transmission. Two masks are expected
to perform slightly better with our reconstruction, but at the cost of reduced light
efficiency.
4.2.5
Comparing Computational Light Field Cameras
Two criteria are important when comparing different light field camera designs: optical light efficiency and expected quality of computational reconstruction.
Light
efficiency is measured as the mean light transmission of the optical system T, whereas
the value , = II
-
GTGI|F quantifies the expected reconstruction quality based on
Equation 4.8 (lower value is better).
We compare lenslet arrays [37, 44], randomly coded lenslet arrays and coded
apertures [5], coded broadband masks [28, 54, 31] (we only show URA masks as
the best general choice for this resolution), random masks and optimized masks,
as proposed in this paper, as well as randomly coded apertures combined with a
coded mask [59]. All optical camera designs are illustrated in Figure 4-6. Optically,
lenslet arrays perform best with little loss of light; most mask-based designs have a
light transmission of approx. 50%, except for pinholes. Combining randomly coded
apertures with a modulation mask results in an overall transmission of about 25%,
although Xu and Lam's choice of sum-of-sinusoids masks results in transmissions of
less than 5%.
Figure 4-6 also shows the quantitative value p for expected quality. Under this
aspect, lenslet arrays perform worst, followed by coded lenslet arrays, coded apertures,
71
and previously proposed tiled broadband codes
(
rand)
and the optimized patterns
(
opt)
(
broad).
Random modulation masks
proposed in Section 4.2.2 have the best
expected quality of all setups with a mean transmission of 50% or higher. Although
the dual-layer design proposed by Xu and Lam [59] has a lower
value, their design is
significantly less light efficient than ours. While the quantitative differences between
-values of these camera designs are subtle, qualitative differences of reconstructions
are much more pronounced, as shown in Figures 4-4 and 4-5 and in the supplemental
video. The discussed comparison is performed by assuming that all optical setups
use the reconstruction method and overcomplete dictionaries proposed in this paper,
as opposed to previously proposed PCA sparsity bases [5] or simple total variation
priors [59].
4.3
4.3.1
Implementation
Primary Hardware Design
For experiments with real scenes, it is necessary to easily change mask patterns for
calibration and capturing training light fields. To this end, we implement a capture
system using a liquid crystal on silicon (LCoS) display (SiliconMicroDisplay ST1080).
An LCoS acts as a mirror where each pixel can independently change the polarization
state of incoming light. In conjunction with a polarizing beam splitter and relay
optics, as shown in Figure 4-7, the optical system emulates an attenuation mask
mounted at an offset in front of the sensor. As a single pixel on the LCoS cannot
be well resolved with the setup, we treat blocks of 4 x 4 LCoS pixels as macropixels,
resulting in a mask resolution of 480 x 270. The SLR camera lens (Nikon 105 mm
f/2.8D) is not focused on the LCoS but in front of it, thereby optically placing the
(virtual) image sensor behind the LCoS plane. A Canon EF 50 mm f/1.8 II lens is
used as the imaging lens and focused at a distance of 50 cm; scenes are placed within
a depth range of 30-100 cm. The f-number of the system is the maximum of both
lenses (f/2.8).
72
Figure 4-7: Prototype light field camera. We implement an optical relay system that
emulates a spatial light modulator (SLM) being mounted at a slight offset in front of
the sensor (right inset). We employ a reflective LCoS as the SLM (lower left insets).
Adjusting the Mask-Sensor Distance
The distance d, between the mask (LCoS
plane) and the virtual image sensor is adjusted by changing the focus of the SLR
camera lens. For capturing light fields with p, x p, angular resolution (p, = 5 in our
experiments), the distance is chosen as that of a conventional mask-based method
that would result in the desired angular resolution albeit at lower spatial resolution
[54]. Specifically, we display a pinhole array on the LCoS where adjacent pinholes are
p, macropixels apart while imaging a white calibration object. We then adjust the
focus of the SLR camera lens so that disc-shaped blurred images under the pinholes
almost abut each other. In this way, angular light field samples impinging on each
sensor pixel pass through distinct macropixels on the LCoS with different attenuation
values before getting integrated on the sensor.
Capturing Coded Light Field Projections
We capture mask-modulated light
field projections by displaying a pattern on the LCoS macropixels and resizing the
73
(a)
(b)
(c)
(d)
(e)
Figure 4-8: Pinhole array mask and captured images. For conciseness, only the region
covering 12 x 10 pinholes are shown here. (a) Pinhole array displayed on the LCoS.
(b) Image sensor recording of the LCoS pinhole array mask for a white cardboard
scene. (c) Image sensor recording of the LCoS pinhole array mask for a newspaper
scene. (d) Normalized image, i.e., (c) divided by (b). (e) Image sensor recording of
the LCoS pinhole array mask for a white cardboard scene where each pinhole has a
corresponding attenuation value according to the mask pattern.
sensor images accordingly.
Capturing Training Light Fields
For the dictionary learning stage, we capture
a variety of scenes using a traditional pinhole array. For this purpose, p, x p, (= 25)
images are recorded with shifting pinholes on the LCoS to obtain full-resolution light
fields. However, as shown in Figure 4-8(b), the discs under the pinholes (i.e., pointspread functions, or PSFs) have color-dependent nonuniform intensity distributions
due to birefringence introduced by the LCoS pixels. To remove this effect from
training light fields, we record images of a white cardboard scene also using shifting
pinhole arrays to obtain a set of PSFs under each LCoS macropixels, and normalize
training light fields by these images. Figure 4-8(c) shows an example of a raw pinhole
array image of a newspaper scene. By dividing Figure 4-8(c) by the LCoS PSF image
shown in Figure 4-8(b), we obtain a normalized pinhole image shown in Figure 4-8(d).
Calibration
We measure the projection matrix 4 by capturing the light field of
a white cardboard scene modulated by the mask pattern. Again we use a shifting
pinhole array, but rather than making each pinhole fully open (1.0) as in the case of
training light field capture, we assign each pinhole a corresponding value (E [0, 1])
in the mask. An example of a mask-modulated pinhole array image is shown in
Figure 4-8(e).
74
Figure 4-9: Calibrated projection matrix. Left: visualization of the matrix as a
multi-view image. Right: magnified crops from the two views.
We observed that the actual attenuation introduced by the LCoS was greater than
the specified mask value. That is, the ratio of a captured pixel value in Figure 4-8(e)
to the corresponding pixel value in Figure 4-8(b) was less than the LCoS macropixel
value. To compensate for this non-linearity, we assumed gamma curve relationship
between LCoS pixel values and actual attenuations, and performed linear search for
the optimal gamma value for each color channel to obtain expected attenuation ratios.
Figure 4-9 shows the calibrated projection matrix <b, obtained from pinhole array
images captured in the above-mentioned manner. Here, the projection "matrix" is
shown as a multi-view image by rearranging the pixels of the pinhole images. Each
view corresponds to the diagonal of each of the submatrices <b. That is, the diagonal
elements of <Pj are the lexicographically ordered pixel values of the j-th view. These
views contain three effects: 1) shifted mask pattern (our optimized mask tiled over
the LCoS area), 2) color-dependent nonuniform LCoS PSF effect, leading to color
and intensity variations between the views, and 3) view-dependent vignetting from
the imaging lens.
75
CamMask
Mask holder
Lens
(a)
(b)
(c)
Figure 4-10: Printed transparency-based prototype camera (a) and calibration setups
(b,c).
4.3.2
Secondary Hardware Design
Once the optical parameters are determined and dictionaries are learned, the proposed
light field camera could be implemented in a more compact form factor with a static
attenuation mask as shown in Figure 4-10(a). We fabricate a mask holder that fits
into the sensor housing of a Lumenera Lw11059 monochrome camera, and attach a
film with a random mask pattern. As the printer guarantees 25 Pm resolution, we
conservatively pick a mask resolution of 50 pm, which roughly corresponds to 6 x 6
pixels on the sensor. We therefore downsample the sensor image by 6, and crop out
the center 200 x 160 region for light field reconstruction in order to avoid mask holder
reflection and vignetting. The distance between the mask and the sensor is 1.6 mm.
A Canon EF 50mm f/1.8 II lens is used and focused at a distance of 50 cm. As the
mask is static, we use an aperture-based light field capture method to calibrate the
2
projection matrix (b. We place a 10 x 10 mm external aperture immediately in front
of the lens as shown in Figure 4-10(b), and capture a white cardboard scene with
2
5 x 5 sub-apertures each having an area of 2 x 2 mm , as shown in Figure 4-10(c).
Figure 4-11 shows reconstruction results. The reconstruction quality is humble
because we did not learn dictionaries or optimize mask patterns for this setup (we used
the dictionary learned for the LCoS setup and used a random mask). Nevertheless,
parallax was recovered.
76
I
Figure 4-11: Light field reconstruction from an image modulated by a static film
mask. (a) Captured image. (b) One view from the calibrated projection matrix <D.
(c,d) Two views from the reconstructed light field. Yellow lines are superimposed to
show the parallax. (e) All of the 5 x 5 views from the reconstructed light field.
scene
scene
lens
sensor
lens
F =50mm
mask
sensor
t=16mm
Mask For DSLR
F = 50mme
scene
lens
mask sensor
Traditlonal Cameras
F =5mm
mm
Mask For Cell-Phone Cameras
Figure 4-12: An illustration of placement of the optimal or random mask for two
commonly used camera designs. For traditional DSLRs with a lens of focal length
about 50 mm, the mask needs to be placed about 1.6 mm away from the sensor to
capture 5 x 5 angular views. For a mobile phone camera this distance reduces to
about 160 microns, due to reduced focal length.
77
Figure 4-13: Training light fields and central views for physical experiments.
4.3.3
Software
The algorithmic framework is a two step process involving an offline dictionary learning stage and a nonlinear reconstruction.
Dictionary Learning We capture five training light fields, each captured with an
aperture setting of approx. 0.5 cm (f/2.8), with our prototype setup and randomly
extract one million 4D light field patches, each with a spatial resolution of 11 x
11 pixels and 5 x 5 angular samples. After applying coreset reduction [21], 50,000
remaining patches are used to learn a 1.7x overcomplete dictionary consisting of
5,000 light field atoms. The memory footprint of this learned dictionary is about 111
MB. We employ the Sparse Modeling Software [39] to learn this dictionary on an
workstation equipped with a 24-core Intel Xeon processor and 200 GB RAM in about
10 hours. This is a one-time preprocessing step.
For physical experiments we capture five training sets as shown in Figure 4-13.
Each light field in the training set has a resolution of 480 x 270 pixels in space and
5 x 5 views. We randomly extract about 360,000 overlapping patches from each of
the training light fields, each patch has a spatial resolution of 11 x 11 pixels and an
angular resolution of 5 x 5. To increase variability amongst these extracted patches we
employ the coreset technique discussed in Section B.2 to reduce this set to tractable
size of about 50,000 patches. This process is repeated for all the training light fields
to generate a training set of about 250,000 patches. Coresets are again applied to
reduce the final training set to about 50,000 patches.
78
Figure 4-14: Training light fields and central views for simulated experiments.
Sparse Reconstruction
For the real world experiments, each light field is recon-
structed with 5 x 5 views from a single sensor image with a resolution of 480 x 270 pixels. For this purpose, the coded sensor image is divided into overlapping 2D patches,
each with a resolution of 11 x 11 pixels, by centering a sliding window around each
sensor pixel. Subsequently, a small 4D light field patch is recovered for each of these
windows. The reconstruction is performed in parallel on an 8-core Intel i7 workstation with 16 GB RAM. We employ the fast ei-relaxed homotopy method described
by Yang et al. [60] with sparsity penalizing parameter A set to 10, tolerance to 0.001
and iterations to be 10,000; reconstructions for three color channels take about 18
hours for each light field. The reconstructed overlapping 4D patches are merged with
a median filter. It should be noted that at each step the product of the calibrated projection matrix and the dictionary needs to be normalized for correct minimization of
the Li norm. Each light field patch takes about 0.1 seconds to be recovered, resulting
in runtime of about 18 hours for all there color channels on an 8-core Intel i7 machine
with 16 GB of memory. Although the proposed method requires extensive computation times, we note that each patch is independent of all other patches. Hence, the
reconstruction can be easily parallelized and significantly accelerated with modern
high-end GPUs that have up to thousands of cores or cloud-based infrastructures.
79
4.4
Results
All results discussed in this section are captured with our prototype compressive
light field camera and reconstructed from a single sensor image. This image is a
coded projection of the light field and the employed mask pattern is optimized for
the computed dictionary with the technique described in Section 4.2.2. The same
optical code and dictionary is used in all examples, the latter being learned from
captured light fields that do not include any of the shown objects. All training sets
and captured data are publicly available on the project website or upon request.
Layered Diffuse Objects
Figure 4-15 shows results for a set of cards at different
distances to the camera's focal plane. A 4D light field with 5 x 5 views (upper right)
is reconstructed from a single coded projection (upper left). Parallax for out-of-focus
objects is observed (center row). By shearing the 4D light field and averaging all
views, a synthetically refocused camera image can be computed in post-processing
(bottom row).
Partly-occluded Environments
Reconstructed views of a scene exhibiting more
complex structures are shown in Figure 4-16. The toy is partly occluded by a highfrequency shrub; occluded areas of the toy are faithfully reconstructed.
Reflections and Refractions
Complex lighting effects, such as reflections and
refractions exhibited by the dragon and the tiger in Figure ??, are successfully reconstructed with the proposed technique. In this scene, parts of the dragon are refracted
through the head and shoulders of the glass tiger, whereas its back reflects and refracts the green background. We show images synthetically focused on foreground
and background objects as well.
Animated Scenes
The proposed algorithms allow 4D light fields to be recov-
ered from a single 2D sensor image.
Dynamic events can be recovered this way;
to demonstrate this capability, we show several frames of an animated miniature
80
pr4
4
Figure 4-15: Light field reconstruction from a single coded 2D projection. The scene
is composed of diffuse objects at different depths; processing the 4D light field allows
for post-capture refocus.
81
Figure 4-16: Reconstruction of a partly-occluded scene. Two views of a light field reconstructed from a single camera image. Areas occluded by high-frequency structures
can be recovered by the proposed methods, as seen in the close-ups.
Coded 2D Projections
Figure 4-17: Light field reconstructions of an animated scene. We capture a coded
sensor image for multiple frames of a rotating carousel (left) and reconstruct 4D
light fields for each of them. The techniques explored in this paper allow for higherresolution light field acquisition than previous single-shot approaches.
carousel in Figure 4-17. Coded sensor images and reconstructions are shown. A visual description of the recovered parallax and refocusing effects can be found in the
attached video at http://www.kshitijmarwah.com/index.php?/research/compressivelight-field-photography/.
The light field camera as described in this chapter is the first attempt to get over
traditional limits of resolution imposed for the last hundred years. Given a m x n
sensor image, this practical camera design and algorithmic framework recovers a light
field that is m x n in space and p x p in angle. Though the reconstruction quality
is depth dependent we believe this is the first camera to show these resolutions in a
single capture.
82
Chapter 5
Additional Applications
In this chapter, we outline a variety of additional applications for light field dictionaries and sparse coding techniques. In particular, we show applications in 4D light field
compression and denoising. We show how to remove the coded patterns introduced
by modulation masks so as to retrieve a conventional 2D photograph. We also show
how to require high resolution focal stacks from a single capture by slightly changing
the optical setup and learning 3D dictionaries. This is critical for applications such as
refocusing where one does not need the full light field but a high resolution refocused
image.
5.1
"Undappling" Images with Coupled Dictionaries
Although the optical acquisition setup proposed in this paper allows for light fields
to be recovered from a single sensor image, a photographer may want to capture
a conventional 2D image as well. In a commercial implementation, this could be
achieved if the proposed optical system was implemented with programmable spatial
light modulators or modulation masks that can be mechanically moved out of the
optical path. The most straightforward method to remove the mask pattern is to
divide out the 2D projection of the mask pattern itself. That allows in-focus region
83
to be free of mask dapples but renders the out-of-focus areas with high frequency
artifacts. Another approach is to first recover a 4D light field and average out the
views to get the image back, but that seems costly.
To mitigate this problem, we use joint sparse coding to implicitly learn mappings
from a modulated image to a demodulated or "undappled" image that still remains in
the two-dimensional manifold of images in lieu of light field recovery. Let us consider
two feature spaces X (consisting of mask modulated images of scenes as captured by
our setup after dividing out the mask pattern) and Y (consisting of images of the
same scenes as if there was no mask). Given these training sets, unlike traditional
sparse coding that has been discussed so far, we employ joint sparse coding techniques
to simultaneously learn two dictionaries D2, and Dy for the two feature sets such that
the sparse representation of a mask modulated image xi in terms of D
should be
same as the representation of the corresponding demodulated image yi in terms of
Dy. Hence, if we know our measurement xi we can recover its underlying signal yi.
Mathematically, the joint sparse coding scheme [61] can be realized as:
minimize
{Vx,Vy,A}
||xi - Daill2 +
flyi -
Dyaill2+ A || illa
(5.1)
The formulation requires a common sparse representation ac that should reconstruct both xi and yi. As shown by Yang et al. [61], we can convert the problem
into standard sparse coding scheme with concatenated feature space of X and Y.
Now, given a new test modulated image xt, we find its representation in the learned
dictionary D,. The resulting coefficient vector is then multiplied to the dictionary
DY to generate the demodulated image.
We captured modulated images from our setup and divide out the mean mask
pattern. Each image is divided into overlapping patches of resolution 7 x 7, forming
a training set for the first feature space. Correspondingly, training images without
any modulation are also captured of the same scene and are divided into overlapping
patches of resolution 7 x 7 to form the other training set. Joint sparse coding is then
performed using software package described in [62]. Given a new modulated image,
84
Figure 5-1: "Undappling" a mask-modulated sensor image (left). The known projection of the mask pattern can be divided out; remaining noise patterns in out-of-focus
regions are further reduced using a coupled dictionary method (right).
we first divide out the mean mask pattern and divide it into overlapping patches of
resolution 7 x 7. Using our jointly trained dictionary, we reconstruct demodulated
image patches which are then merged.
Following [61], we learn a coupled dictionary
'D21
-
[Vap
Vundap]
from a train-
ing set containing projected light fields both with and without the mask patterns. One
part of that dictionary
Ddap
is used for reconstructing the 2D coefficients a, whereas
the other is used to synthesize the "undappled" image as
Vundapa.
As opposed to
the framework discussed in the previous chapter, the dictionaries and coefficients for
this application are purely two-dimensional.
5.2
Light Field Compression
We illustrate compressibility of a 4D light field in Figure 4-3 both quantitatively and,
for a single 4D patch, also qualitatively. Compression is achieved by finding the best
representation of a light field with a fixed number of coefficients. This representation
can be found by solving the LASSO [42] problem
Ill -
minimize
fal
subject to
a5.2
(5.2)
||alo < K
In this formulation, 1 is a 4D light field patch that is represented by at most
K
atoms. As opposed to sparse reconstructions from coded 2D projections, light field
85
' V I I
PSNR 24.5 dB
W
PSNR 27.2 dB
k
~
Figure 5-2: Light field compression. A light field is divided into small 4D patches and
represented by only few coefficients. Light field atoms achieve a higher image quality
than DCT coefficients.
compression strives to reduce the required data size of a given 4D light field. As
this technique is independent of the light field acquisition process, we envision future
applications in high-dimensional data storage and transfer.
Figure 5-2 compares an example light field compressed into a fixed number of
DCT coefficients and light field atoms. For this experiment, a light field with 5 x 5
views is divided into distinct 9 x 9 x 5 x 5 spatio-angular patches that are individually compressed. Light field atoms allow for high image quality with a low number
of coefficients and smooth transitions between neighboring patches. While this experiment demonstrates improved compressibility of a single example light field using
atoms, further investigation is required to analyze the suitability of overcomplete
dictionaries for compressing a wide range of different light fields.
86
Figure 5-3: Light field denoising. Sparse coding and the proposed 4D dictionaries can
remove noise from 4D light fields.
5.3
Light Field Denoising
Another popular application of dictionary-based sparse coding techniques is image
denoising [20]. Following this trend, we apply sparse coding techniques to denoise
4D light fields. Similar to light field compression, the goal of denoising is not to
reconstruct higher-resolution data from a smaller number of measurements, but to
represent a given 4D light field by a linear combination of a small number of noisefree atoms. In practice, this can be achieved in the same way as compression, i.e.
by applying Equation 5.2. For the case of a noisy target light field 1, this effectively
applies a nonlinear four-dimensional denoising filter to the light field.
Figure 5-3 shows the central view and close-ups of one row of a noisy 4D light
field and its denoised representation.
We hope that 4D light field denoising will
find applications in emerging commercial light field cameras, as this technique is
independent of the proposed compressive reconstruction framework and could be
applied to light fields captured with arbitrary optical setups.
5.4
Compressive Focal Stacks
One application of 4D light fields is post-capture refocus. It could be argued that
recovering the full 4D light field from a 2D sensor image is unnecessary for this
87
Figure 5-4: Focal stack setup. The focus mechanism of an SLR camera lens is intercepted and controlled by an Arduino board. Synchronized with a remote shutter,
this allows us to capture high-quality focal stacks.
particular application, a 3D focal stack may suffice. We evaluate compressive focal
stacks in this section.
We capture focal stacks using a a Canon T3 SLR camera with a modified Canon
50mm lens.
The connection between lens and camera body is blocked and inter-
cepted with wires that are connected to an Arduino Deumilanove board. The latter
is controlled from a PC and synchronized with a remote camera trigger that is also
controlled, in software, from the same PC. The lens is set to autofocus, its focus can
be programmed and adjusted to the desired focal setting. Figure 5-4 shows the setup
with a scene in the background.
Using the described setup, we capture multiple scenes as training sets for the
dictionary learning stage.
The central focal slices of all training focal stacks are
shown in Figure 5-5. Each stack contains six images focused at different distances.
A 2x overcomplete dictionary with 3D focal stack atoms, each with a resolution of
10 x 10 x 6 pixels is learned from this training set.
We simulate a coded focal sweep for another scene.
The sweep refocuses the
camera lens throughout the exposure time of a single image. An additional light
modulator is assumed to be located directly on the sensor. Alternatively, the readout
of each pixel can be controlled, making such a light modulator needless.
88
For this
Figure 5-5: Focal stack training set. Central focal slices for each scene are shown
along with downsampled versions of all the slices.
Figure 5-6: Compressive focal stack results. A single coded projection is simulated
(lower left) and used to recover all six slices (three of them are shown in the bottom
row) of the original focal stack (top row). While the focal stack is successfully recovered, we observe a slight loss of image sharpness for in-focus image regions. This
could be overcome with optimized projections (as demonstrated for 4D light fields in
the primary text) or by acquiring multiple shots.
experiment, we employ random codes-each focal slice is multiplied with a random
pattern and integrated to get a coded 2D projection as shown in Figure 5-6 (left).
Given this image, we employ the compressive reconstruction techniques discussed in
the primary text together with the learned dictionary to recover the six focal stack
slices as seen in Figure 5-6.
While the optical setup for this experiment is different from the mask-based
setup used for 4D light field acquisition, we gain a number of interesting insights in
the sparsity-exploiting reconstruction techniques proposed in our manuscript. First,
coded projections of high-dimensional visual signal can be captured with a variety of
different optical setups. The choice for the latter depends on available hardware and
optimal code designs; ideally, the optical design captures a single or multiple optimized coded projections. Second, the essential building blocks of high-dimensional
visual signals can be learned from appropriate training sets; we demonstrate this for
89
3D focal stacks and 4D light fields. This approach, however, is equally applicable for
other signal types, such as 8D reflectance fields or the full plenoptic function. Using
the learned sparse representations or atoms of the visual signals, high-dimensional
reconstructions can be computed from the lower-dimensional coded projections. The
quality of the results depends on a variety of factors, such as sparsity of the signal
types in their atoms, dimensionality gap between projections and signal, quality of
the optical codes, as well as the number of captured projections.
90
Chapter 6
Discussion and Conclusion
In summary, this thesis explores compressive light field acquisition by analyzing and
evaluating sparse representations of natural light fields, optimized optical coding
strategies, robust high-dimensional light field reconstruction from lower-dimensional
coded projections, and additional applications such as 4D light field compression and
denoising.
Compressive light field acquisition is closely related to emerging compressive light
field displays [56, 30, 57]. These displays are compressive in the sense that the display
hardware has insufficient degrees of freedom to exactly represent the target light
field and relies on an optimization process to determine a perceptually acceptable
approximation. Compressive cameras are constrained in their degrees of freedom to
capture each ray of a light field and instead record coded projections with subsequent
sparsity-exploiting reconstructions. We envision future compressive image acquisition
and display systems to be a single, integrated framework that exploits the duality
between computational light acquisition and display. Most recently, researchers have
started to explore such ideas for display-adaptive rendering [26].
6.1
Benefits and Limitations
The primary benefits of the proposed computational camera architecture compared
to previous techniques are increased light field resolution and a reduced number of
91
required photographs. We show that reconstructions from coded light field projections captured in a single image can achieve a high quality; this is facilitated by the
proposed co-design of optical codes, nonlinear reconstruction techniques, and sparse
representations of natural light fields.
However, the achieved resolution of photographed objects decreases at larger distances to the camera's focal plane. Attenuation masks lower the light efficiency of
the optical system as compared to refractive optical elements, such as lenslet arrays.
Yet, they are less costly than lenslet arrays. The mask patterns are fundamentally
limited by diffraction. Dictionaries have to be stored along with sparse reconstructions, thereby increasing memory requirements.
Processing times of the discussed
compressive camera design are higher than those of most other light field cameras.
While these seem prohibitive at the moment, each small 4D patch is reconstructed
independently; the computational routines discussed in this paper are well suited for
parallel implementation, for instance on GPUs.
The camera prototype exhibits a number of artifacts, including angle-dependent
color and intensity nonlinearities as well as limited contrast. Observed color shifts
are intrinsic to the LCoS, due to birefringence of the liquid crystals; this spatial light
modulator (SLM) is designed to work with collimated light, but we operate it outside
its designed angular range so as to capture ground truth light fields for evaluation and
dictionary learning. Current resolution limits of the captured results are imposed by
the limited contrast of the LCoS-multiple pixels have to be binned. Simple coded
transparencies or alternative SLMs could overcome these optical limitations in future
hardware implementations.
Atoms captured in overcomplete dictionaries are shown to represent light fields
more sparsely than other basis representations. However, these atoms are adapted
to the training data, including its depth range, aperture diameter, and general scene
structures, such as occlusions and high-frequency textures. We demonstrate that even
a few training light fields that include reflections, refractions, texture, and occlusions
suffice to reconstruct a range of scene types. Nevertheless, we expect reconstruction
quality to degrade for scenes that contain structures not captured in the training data,
92
as for instance shown for parallax exceeding that of the training data in Section 4.2.4.
A detailed analysis of how target-specific light field atoms are w.r.t.
all possible
parameters, however, is left for future work.
6.2
Future Work
Our current prototype camera is designed as a multipurpose device capturing coded
projections as well as reference light fields for dictionary learning and evaluating
reconstructions. Future devices will decouple this process. Whereas coded projections
can be recorded with conventional cameras enhanced by coded masks, the dictionary
learning process will rely increasingly on large online datasets of natural light fields.
These are likely to appear as a direct result of the commercial success of light field
cameras on the consumer market. Such developments have two advantages. First, a
larger range of different training data will make light field dictionaries more robust
and better adapted to specific applications.
Second, widely available dictionaries
will fuel research on novel optical camera designs or commercial implementations of
compressive light field cameras.
While we evaluate a range of existing light field camera designs and devise optimal coding strategies for them, we would like to explore new optical setups in the
future.
Evaluation with alternative error metrics to PSNR, such as perceptually-
driven strategies, is an interesting avenue of future work. Finally, we plan to explore
compressive acquisitions of the full plenoptic function, adding temporal and spectral
light variation to the proposed framework. While this increases the dimensionality
of the dictionary and reconstruction problem, we believe that exactly this increase
in dimensionality will further improve compressibility and sparsity of the underlying
visual signals.
The proposed compressive camera architecture is facilitated by the synergy of optical design and computational processing. We believe that the exploration of sparse
representations of high-dimensional visual signals has only just begun; fully understanding the latent structures of the plenoptic function, including spatial, angular,
93
Camera Design
Spatial
Angular
!Depth
Resolution 'Resolution
Computation
Resolution
Fabrication
Cost
LOW
High
Medium
LoW
Medium
High
Camera Arrays
(Pelican Imaging)
High
High
High
Low
High
High
Compressive MaskBased LUght Field
(Ours)
High
MediumnHigh
Medium
High
LOW
Medium
Microlens Based
1Light
Transmission
(LYTRO)
Figure 6-1: We compare our mask-based light field camera design with existing commercial technologies i.e. micro lens based approaches and camera arrays. This comparison can help commercial camera makers chose the right technology for their specific purposes.
spectral, and temporal light variation, seems one step closer but still not within reach.
Novel optical designs and improved computational routines both for data analysis and
reconstruction will have to be devised, placing future camera systems at the intersection of scientific computing, information theory, and optics engineering. We end by
showcasing a comparison with existing commercial architectures in Figure 6-1. We
believe that this thesis provides many insights indispensable for future computational
camera designs.
94
Bibliography
[1] Edward Adelson and John Wang. Single Lens Stereo with a Plenoptic Camera.
IEEE Trans. PA MI, 14(2):99-106, 1992.
[2] Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Approximating Extent Measures of Points. J. ACM, 51(4):606-635, 2004.
[3] M. Aharon, M. Elad, and A. Bruckstein. The K-SVD: An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation. IEEE Trans. Signal
Processing, 54(11):4311-4322, 2006.
[4] Michal Aharon, Michael Elad, and Alfred M. Bruckstein. K-SVD and its Nonnegative Variant for Dictionary Design. In Proc. SPIE Conference Wavelets,
pages 327-339, 2005.
[5] Amit Ashok and Mark A. Neifeld. Compressive Light Field Imaging. In Proc.
SPIE 7690, page 76900Q, 2010.
[6] S.D. Babacan, R. Ansorge, M. Luessi, P.R. Mataran, R. Molina, and A.K. Katsaggelos. Compressive Light Field Sensing. IEEE Trans. Im. Proc., 21(12):4746
-4757, 2012.
[7] Stephen Becker, Jerome Bobin, and Emmanuel Candes. Nesta: A fast anad
accurate first-order method for sparse recovery. In Applied and Computational
Mathematics, 2009. http://www-stat.stanford.edu/ candes/nesta.
[8] Tom Bishop, Sara Zanetti, and Paolo Favaro. Light-Field Superresolution. In
Proc. ICCP, pages 1-9, 2009.
[9] S. Boyd and L. Vandenberghe.
Press, 2004.
Convex Optimization. Cambridge University
[10] E. Candes, J. Romberg, and T. Tao. Stable Signal Recovery from Incomplete
and Inaccurate Measurements. Comm. Pure Appl. Math., 59:1207-1223, 2006.
[11] Emmanuel Candes, Justin Romberg, and Terrence Tao. Robust uncertainty
principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Information Theory, 52(2):489-509, 2006.
95
[12] Emmanuel Candes and Michael B. Wakin. An Introduction to Compressive
Sampling. IEEE Signal Processing, 25(2):21-30, 2008.
[13] Emmanuel J. Candes, Yonina C. Eldar, Deanna Needell, and Paige Randall.
Compressed Sensing with Coherent and Redundant Dictionaries. Appl. and
Comp. Harmonic Analysis, 31(1):59-73, 2011.
[14] Emmanuel J. Candes, Michael B. Wakin, and Stephen P. Boyd. Enhancing
Sparsity by Reweighted 11 Minimization. Journal of FourierAnalysis and Applications, 15(5):877-905, 2008.
[15] Scott Shaobing Chen, David L. Donoho, Michael, and A. Saunders. Atomic
Decomposition by Basis Pursuit. SIAM J. on Scientific Computing, 20:33-61,
1998.
[16] D. Donoho. Compressed Sensing. IEEE Trans. Inform. Theory, 52(4):1289-1306,
2006.
[17] D. Donoho. For Most Large Underdetermined Systems of Linear Equations,
the Minimal fi-Norm Solution is also the Sparsest Solution. Comm. Pure Appl.
Math., 59(6):797-829, 2006.
[18] Marco F. Duarte, Mark A. Davenport, Dharmpal Takhar, Jason Laska, Ting
Sun, Kevin Selly, and Richard Baraniuk. Single Pixel Imaging via Compressive
Sampling. IEEE Signal Processing Processing Magazine, 2008.
[19] J.M. Duarte-Carvajalino and G. Sapiro. Learning to sense sparse signals: Simultaneous Sensing Matrix and Sparsifying Dictionary Optimization. IEEE Trans.
Im. Proc., 18(7):1395-1408, 2009.
[20] M. Elad and M. Aharon. Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries. IEEE Trans. Im. Proc., 15(12):3736-3745,
2006.
[21] Micha Feigin, Dan Feldman, and Nir A. Sochen. From High Definition Image
to Low Space Optimization. In Scale Space and Var. Methods in Comp. Vision,
volume 6667, pages 459-470, 2012.
[22] Dan Feldman. Coresets and Their Applications. PhD thesis, Tel-Aviv University,
2010.
[23] Rob Fergus, A Torralba, and W T Freeman. Random Lens Imaging. Technical
Report TR-2006-058, MIT, 2006.
[24] T Georgiev and A Lumsdaine. Spatio-angular Resolution Tradeoffs in Integral
Photography. Proc. EGSR, pages 263-272, 2006.
[25] S. Gortler, R. Grzeszczuk, R. Szelinski, and M. Cohen. The Lumigraph. In Proc.
ACM SIGGRAPH, pages 43-54, 1996.
96
[26] Felix Heide, Gordon Wetzstein, Ramesh Raskar, and Wolfgang Heidrich. Adaptive Image Synthesis for Compressive Displays. ACM Trans. Graph. (SIGGRAPH), 32(4):1-11, 2013.
[27] Yasunobu Hitomi, Jinwei Gu, Mohit Gupta, Tomoo Mitsunaga, and Shree K.
Nayar. Video from a Single Coded Exposure Photograph using a Learned OverComplete Dictionary. In Proc. IEEE ICCV, 2011.
[28] H Ives. Parallax Stereogram and Process of Making Same. US patent 725,567,
1903.
[29] M.H. Kamal, M. Golbabaee, and P. Vandergheynst. Light Field Compressive
Sensing in Camera Arrays. In Proc. ICASSP, pages 5413 -5416, 2012.
[30] D. Lanman, G. Wetzstein, M. Hirsch, W. Heidrich, and R. Raskar. Polarization
Fields: Dynamic Light Field Display using Multi-Layer LCDs. ACM Trans.
Graph. (SIGGRAPH Asia), 30:1-9, 2011.
[31] Douglas Lanman, Ramesh Raskar, Amit Agrawal, and Gabriel Taubin. Shield
Fields: Modeling and Capturing 3D Occluders. ACM Trans. Graph. (SIGGRAPH Asia), 27(5):131, 2008.
[32] Daniel D. Lee and Sebastian Seung. Learning the Parts of Objects by Nonnegative Matrix Factorization. Nature, 401:788-791, 1999.
[33] Anat Levin, William T. Freeman, and Fre'do Durand. Understanding Camera
Trade-Offs through a Bayesian Analysis of Light Field Projections. In Proc.
ECCV, pages 88-101, 2008.
[34] Anat Levin, Samuel W. Hasinoff, Paul Green, Fredo Durand, and William T.
Freeman. 4D Frequency Analysis of Computational Cameras for Depth of Field
Extension. ACM Trans. Graph. (SIGGRAPH), 28(3):97, 2009.
[35] M. Levoy and P. Hanrahan. Light Field Rendering. In Proc. A CM SIGGRAPH,
pages 31-42, 1996.
[36] Chia-Kai Liang, Tai-Hsu Lin, Bing-Yi Wong, Chi Liu, and Homer H. Chen. Programmable Aperture Photography: Multiplexed Light Field Acquisition. ACM
Trans. Graph. (SIGGRAPH), 27(3):1-10, 2008.
[37] Gabriel Lippmann. La Photographie Integrale. Academic des Sciences, 146:446451, 1908.
[38] A. Lumsdaine and T. Georgiev. The Focused Plenoptic Camera. In Proc. ICCP,
pages 1-8, 2009.
[39] J. Mairal, F. Bach, G. Ponce, and G. Sapiro. Online Dictionary Learning For
Sparse Coding. In International Conference on Machine Learning, 2009.
97
[40] Julien Mairal, Jean Ponce, and Guillermo Sapiro. Online Learning for Matrix
Factorization and Sparse Coding. Journal of Machine Learning Research, 11:1960, 2010.
[41] Julien Mairal, Guillermo Sapiro, and Michael Elad. Multiscale Sparse Representations with Learned Dictionaries. Proc. ICIP, 2007.
[42] B. K. Natarajan. Sparse Approximate Solutions to Linear Systems. SIAM J.
Computing, 24:227-234, 1995.
[43] Ren Ng. Fourier Slice Photography.
24(3):735-744, 2005.
ACM Trans. Graph. (SIGGRAPH),
[44] Ren Ng, Marc Levoy, Mathieu Bredif, Gene Duval, Mark Horowitz, and Pat Hanrahan. Light Field Photography with a Hand-Held Plenoptic Camera. Technical
report, Stanford University, 2005.
[45] Rohit Pandharkar, Ahmed Kirmani, and Ramesh Raskar. Lens Aberration Correction Using Locally Optimal Mask Based Low Cost Light Field Cameras. Optical Society of America, 2010.
[46] Jae Young Park and Michael B Wakin. A geometric approach to multi-view
compressive imaging. EURASIP Journal on Advances in Signal Processing, 37,
2012.
[47] Christian Perwass and Lennart Wietzke. Single Lens 3D-Camera with Extended
Depth-of-Field. In Proc. SPIE 8291, pages 29-36, 2012.
[48] D. Reddy, A. Veeraraghavan, and R. Chellappa. P2C2: Programmable Pixel
Compressive Camera for High Speed Imaging. In Proc. IEEE CVPR, pages
329-336, 2011.
[49] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal Statist.
Soc B, 58(1):267-288, 1996.
[50] J. A. Tropp and S. J. Wright. Computational Methods for Sparse Solution of
Linear Inverse Problems. Proc. IEEE, 98(6):948-958, 2010.
[51] Joel A. Tropp. Topics in Sparse Approximation. PhD thesis, University of Texas
at Austin, 2004.
[52] Joel A. Tropp and Anna C. Gilbert. Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit. IEEE Trans. Information Theory,
53(12):4655-4666, 2007.
[53] E. van den Berg and M. P. Friedlander. Probing the Pareto frontier for basis
pursuit solutions. SIAM Journal on Scientific Computing, 31(2):890-912, 2008.
http://www.cs.ubc.ca/labs/scl/spgl1.
98
[54] Ashok Veeraraghavan, Ramesh Raskar, Amit Agrawal, Ankit Mohan, and Jack
Tumblin. Dappled Photography: Mask Enhanced Cameras for Heterodyned
Light Fields and Coded Aperture Refocussing. ACM Trans. Graph. (SIGGRAPH), 26(3):69, 2007.
[55] G. Wetzstein, I. Ihrke, and W. Heidrich. On Plenoptic Multiplexing and Reconstruction. IJCV, pages 1-16, 2012.
[56] G. Wetzstein, D. Lanman, W. Heidrich, and R. Raskar. Layered 3D: Tomographic Image Synthesis for Attenuation-based Light Field and High Dynamic
Range Displays. ACM Trans. Graph. (SIGGRAPH), 2011.
[57] G. Wetzstein, D. Lanman, M. Hirsch, and R. Raskar. Tensor Displays: Compressive Light Field Synthesis using Multilayer Displays with Directional Backlighting. ACM Trans. Graph. (SIGGRAPH), 31:1-11, 2012.
[58] Bennett Wilburn, Neel Joshi, Vaibhav Vaish, Eino-Ville Talvala, Emilio Antunez,
Adam Barth, Andrew Adams, Mark Horowitz, and Marc Levoy. High Performance Imaging using Large Camera Arrays. ACM Trans. Graph. (SIGGRAPH),
24(3):765-776, 2005.
[59] Zhimin Xu and Edmund Y. Lam. A High-resolution Lightfield Camera with
Dual-mask Design. In Proc. SPIE 8500, page 85000U, 2012.
[60] Allen Yang, Arvind Ganesh, Shankar Sastry, and Yi Ma. Fast Li-Minimization
Algorithms and An Application in Robust Face Recognition: A Review. Technical report, UC Berkeley, 2010.
[61] Jianchao Yang, Zhaowen Wang, Zhe Lin, S. Cohen, and T. Huang. Coupled Dictionary Training for Image Super-Resolution. IEEE Trans. Im. Proc.,
21(8):3467-3478, 2012.
[62] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using
sparse-representations. In Proc. Int. Conference on Curves and Surfaces, pages
711-730, 2012.
99