Tensor Photography: Exploring Space of 4D Modulations Inside Traditional Camera Designs by Kshitij Marwah Bachelor of Technology, Department of Computer Science and Engineering, Indian Institute of Technology, Delhi, India (2010) Master of Technology, Department of Computer Science and Engineering, Indian Institute of Technology, Delhi, India (2010) Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning in partial fulfillment of the requirements for the degree of Master of Science In Media Arts and Sciences MASSACHUSETTS I"TS OF TECHNOLOGY at the JUL 14 2014 MASSACHUSETTS INSTITUTE OF TECHNOLOGY LIBRARIES September 2013 @ Massachusetts Institute of Technology 2013. All rights reserved. Signature redacted A u th o r ...................... ...................................... Program in Media Arts and Sciences August 9, 2013 C ertified by ................................ Signature redacted 'Ramesh Raskar Associate Professor Program in Media Arts and Sciences sis Supervisor Accepted by ..................... Signature redacted VPr6fessdZIaDicia Maes Associate Academic Head Program in Media Arts and Science I rE Tensor Photography: Exploring Space of 4D Modulations Inside Traditional Camera Designs by Kshitij Marwah Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning on August 9, 2013, in partial fulfillment of the requirements for the degree of Master of Science In Media Arts and Sciences Abstract Light field photography has gained a significant research in the last two decades: today, commercial light field cameras are widely available demonstrating capabilities such as post-capture refocus, 3D photography and view point changes. But, most traditional acquisition approaches either multiplex a low resolution light field into a single sensor image or require multiple photographs to be taken for acquiring high resolution light field. In this thesis, we design, implement and analyze a new light field camera architecture that allows capture and reconstruction of higher resolution light fields in a single shot. The proposed architecture comprises three key components: light field atoms as sparse representation of natural light fields, an optical design to allow capture of optimized 2D light field projections and robust sparse reconstruction methods to recover a 4D light field from a single coded 2D projection. In addition we also explore other applications including compressive focal stack reconstructions, light field compression and denoising. Thesis Supervisor: Ramesh Raskar Title: Associate Professor Program in Media Arts and Sciences 3 Tensor Photography: Exploring Space of 4D Modulations Inside Traditional Camera Designs by Kshitij Marwah Signature redacted T hesis R eader ....................................................... Edward Boyden Associate Professor of Media Arts and Sciences Program in Media Arts and Sciences Thesis Reader ..... Signature redacted Joseph Paradiso ssociate Professor of Media Arts and Sciences Program in Media Arts and Sciences 5 6 Acknowledgments This thesis and work would not have been possible without tremendous support of my collaborators, friends and family who have stuck through patiently with me. I would first like to thank Professor Ramesh Raskar, who saw something in me and got me into the MIT Media Lab. I would always be indebted to the great opportunity he gave me to explore this area of research and life at the MIT Media Lab. My collaborators Gordon Wetzstein and Yosuke Bando, who have been instrumental in getting this work noticed and published. Gordon is one of the most rigorous, hard working, systemic and professional person that I have come across. His grasp for detail is unparalleled and I have learned tremendously from him. I have never met an engineer as amazing as Yosuke. Every code or data he gave me in the course of the thesis was flawless and discussions with him helped form of what we now have is a great camera design. Technical Contributions I would like to acknowledge Ramesh who gave me the problem of high resolution light field photography to be solved and let it be a part of my thesis. Gordon who helped me in understanding light fields, camera designs and the tiniest of details in light field display and synthesis. I would like to especially acknowledge his contribution in Chapter 3 and 4 for succinct writing and explanation. Yosuke helped develop both hardware designs in Section 4.3 of Chapter 4. This camera would not have been a working prototype without him. External Collaborators I would like to thank Ashok Veeraraghavan, Kaushik Mitra, Guilermo Sapiro, Matthew O'Toole, Micha Feigin, Amit Aggarwal, Rohit Pandharkar for various discussions and comments. My family, mom, dad and my sister for patiently waiting for my phone calls back home while I was working to make this work happen. They are the best family one could have asked for and I love them. My partner Vartika Srivastava who waited for two years in India for me to finish this work. I don't know how much she will understand this thesis but if not for the strength she gave, this work wouldn't have happened. My friends at the Media Lab Prashant Patil, Anirudh Sharma, Austin 7 Lee, Misha Sra, Andrea Colaco, Anette Von Kapri, Nan Zhao, Aydin Arpa, Ayush Bhandari, Belen Masia, and many many more who made my stay enriching and wonderful here. My friends back in India Jyoti Johar, Ridhima Sodhi, Aanchal Jain, Ekta Jaiswal, Rahul Gupta and so many more with whom I have spent innumerable beautiful moments. 8 Contents 13 1.1 Light Field Photography . . . . . . . . . . . . . . . . . . . . . . . . 14 1.2 Spatio-Angular Resolution Trade-Off in Light Field Camera Designs 15 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4 Organizational Overview . . . . . . . . . . . . . . . . . . . . . . . . 17 . . . Introduction . 1 2 Prior Art 19 2.1.1 Hundred Years Ago . . . . . . . . . . . . . . . . . . . . . . . 19 2.1.2 Hand-held Light Field Photography . . . . . . . . . . . . . . 20 2.1.3 Dappled Photography . . . . . . . . . . . . . . . . . . . . . 20 2.1.4 Focused Plenoptic Camera . . . . . . . . . . . . . . . . . . . 21 2.1.5 Camera Arrays and Gantry . . . . . . . . . . 22 . . . . . . . . . . 23 . . . . . . . . . . . . . . 25 . . . . . . . . . . . . . . . . . . . . . . 25 . . . . . . . . . 25 . . . . . . . . . . . . . . Compressive Sensing and Sparse Representations 2.3 Compressive Computational Photography . 2.2 . . . . . . . . . . Single Pixel Camera 2.3.2 Compressive Video Acquisition 2.3.3 Compressive Light Field Imaging (Simulation) . . . . . . . ... 27 . 2.3.1 . . . . . . . 29 3.1 Primer on Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 Approximate Solutions . . . . . . . . . . . . . . . . . . . . . 31 3.1.2 The Analysis Problem 35 . . Sparse Coding and Representations . . . . . . . . . . . . . . . . . . . . . . 3 Light Field Acquisition . . . . . . . . . . . . . . . . . 2.1 19 9 3.2 3.3 4 . . . . . . . . . . . . . . . 35 3.2.1 Light Field Atoms . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2.2 Generating "Good" Training Sets with Coresets . . . . . . . . 38 . . . . . . . . . . . . . . . . . 38 3.3.1 Dictionary Design Parameters . . . . . . . . . . . . . . . . . . 41 3.3.2 Sparse Reconstruction Parameters . . . . . . . . . . . . . . . . 43 3.3.3 Evaluating Optimal Projection Matrices and Mask Patterns 46 Evaluating System Design Parameters 49 Compressive Light Field Photography 4.1 4.2 4.3 4.4 5 Overview of Dictionary Learning Methods Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.1.1 Acquiring Coded Light Field Projections . . . . . . . . . . . . 49 4.1.2 Reconstructing Light Fields from Projections . . . . . . . . . . 51 4.1.3 Learning Light Field Atoms . . . . . . . . . . . . . . . . . . . 53 A nalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 . . . . . . . . . . . . . . . . . 54 . . . . . . . . . . . . . 56 4.2.1 Interpreting Light Field Atoms 4.2.2 What are Good Modulation Patterns? 4.2.3 Are More Shots Better? . . . . . . . . . . . . . . . . . . . . . 59 4.2.4 Evaluating Depth of Field . . . . . . . . . . . . . . . . . . . . 59 4.2.5 Comparing Computational Light Field Cameras . . . . . . . . 61 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 . . . . . . . . . . . . . . . . . . . . 62 4.3.1 Primary Hardware Design 4.3.2 Secondary Hardware Design . . . . . . . . . . . . . . . . . . . 66 4.3.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 73 Additional Applications 5.1 "Undappling" Images with Coupled Dictionaries . . . . . . . . . . . . 73 5.2 Light Field Compression . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.3 Light Field Denoising . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.4 Compressive Focal Stacks . . . . . . . . . . . . . . . . . . . . . . . . 77 10 6 Discussion and Conclusion 81 6.1 Benefits and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 11 12 List of Figures 1-1 A conventional camera sums over all the individual rays or alternatively all the individual views as recorded by different points on the aperture to generate the recorded image. . . . . . . . . . . . . . . . . . . . . . 1-2 14 A light field camera works by placing a refractive or non-refractive element between the lens and the sensor (a micro lens array in this case) to record this spatio-angular sampling of rays on the image sensor. This can also be thought of as a two-dimensional array of two-dimensional images, each from a different view point on the aperture. 1-3 14 Effects such as post-capture refocusing can be shown via capturing the 4D light field. 2-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 The first light field camera design to sample angular information on a 2D film using a micro lens array. This was made by Lippmann in 1908 [3 7]. 2-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A main lens is added to Lippmann's design that is directly focused on the micro-lens array. The micro-lens refracts the incoming light based on incident angle to allow for the 4D sampling of light rays on the 2D sensor[44]. 2-3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A cosine mask is placed in front of the sensor to modulate the light field in the frequency domain. The mask heterodynes the incoming light field that can be tiled back together and inverted to recover the 4D radiance function [54]. . . . . . . . . . . . . . . . . . . . . . . . . 13 21 2-4 In this design the main lens is not focused on the micro-lens but at a plane just before the micro-lens. This allows a flexible trade off between angular and spatial resolution, though the product still being equal to sensor resolution[38]. 2-5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 One of the first light field acquisition devices formed by carefully aligning and placing multiple cameras. Effects such as looking through bushes and synthetic aperture photography can be implemented effectively with this bulky and expensive design[58]. 2-6 . . . . . . . . . . . . 22 This gantry contains a rotating camera for capturing light field by recording images at various view points. Image based rendering approaches can then be used to interpolate between various view points[35]. 23 2-7 A dictionary learned from a number of natural image patches. This dictionary captures essential edges and features that contribute to image formation. These "atoms" are different that traditionally known basis such as Fourier, DCT or Wavelets. 2-8 . . . . . . . . . . . . . . . . 24 RICE's single pixel camera is the first physical prototype based on compressive sensing. It allows reconstruction of a 2D image by taking a series of measurements on a single photodetector using Li-minimization techniques[18]. 2-9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 In this design a LCoS based system is used to control per pixel exposure, implementing a mask pattern that modulates and sums over frames during capture. Dictionaries are learned from training videos as shown on the left to reconstruct video from exposure in a single shot[27]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2-10 Conceptually two compressive light field architectures have been proposed. One of the left multiplexes by putting a random code on the aperture, while the one on the right placed a coded lenslet between the sensor and the lens. Both these architectures require multiple shots to be taken for high resolution light field reconstruction[5]. 14 . . . . . . . 28 3-1 Evaluating various approaches to dictionary learning and choosing training sets. The two most important evaluation criteria are speed (and memory footprint) of the learning stage as well as quality of the learned light field atoms. The former is indicated by timings, whereas the latter is indicated by the PSNR and visual quality of reconstructions from the same measurements (2D light field projected onto a ID sensor image with a random modulation mask). We evaluate the KSVD algorithm and its nonnegative variant and online sparse coding as implemented in the SPAMS software package. K-SVD is a very slow algorithm that limits practical application to small training sets containing 27,000 training patches (top left); coresets are tested to reduce a much larger training set with 1,800,000 patches to the same, smaller size (27,000 patches, top center). In all cases, K-SVD learns high-quality light field atoms that capture different spatial features in the training set sheared with different amounts in the spatio-angular domain. Unfortunately, K-SVD becomes too slow for practical application with large high-dimensional training sets; nevertheless, we use it as a benchmark in these flatland experiments. Online sparse coding (bottom row) is much faster and capable of handling large training sets. Unfortunately, the learned atoms are very noisy and lead to low-quality reconstructions (bottom left and center). Using coresets, much of the redundancy in large training sets can be removed prior to learning the atoms (lower right), thereby removing noise in the atoms and achieving the highest-quality reconstructions in the least amount of time. 15 . . . 39 3-2 Evaluating dictionary overcompleteness. The color-coded visualiza- tions of dictionaries (top row) and the histograms (center row) illustrate the intrinsic dimension of these dictionaries. For more than 2x overcomplete dictionaries, most of the atoms are rarely used to adequately represent the training set. Reconstructions of a 2D light field that was not in the training set using the respective dictionaries (bottom row) show that the quality (PSNR) is best for 1-2 x overcomplete dictionaries and drops below and above. Dictionaries with 0.1 - 0.5x overcompleteness do not perform well, because they simply do not contain enough atoms to sparsely represent the test light field. On the other hand, excessive overcompleteness (larger d) does not improve sparsity (smaller k) further, and k log(d/k) turns to increase, leaving the number of measurements insufficient. 3-3 . . . . . . . . . . . . . . . 40 Evaluating light field atom sizes. A synthetic light field (lower left) is projected onto a sensor image with a random modulation mask and reconstructed with dictionaries comprised of varying atom sizes. The angular resolution of light field and atoms is 3 x 3, but the spatial resolution of the atoms ranges from 9 x 9 to 21 x 21. Dictionaries for all experiments are 5x overcomplete. We observe best reconstruction quality for an atom size of 11 x 11. If chosen too small, the spatioangular shears of objects at a distance to the focal plane will not be adequately captured by the light field atoms. Furthermore, the ratio between the number of measurements and number of unknowns is too low for a robust reconstruction with sparse coding approaches. For increasing atom resolutions, that ratio becomes more favorable for reconstruction-however, with an increasing spatial extend of an atom it also becomes increasingly difficult to represent the underlying light fields sparsely, leading to lower-quality reconstructions. . . . . . . . . 16 42 3-4 Evaluating sparse reconstruction methods. We reconstruct a 2D light field (top row) using the same dictionary and simulated sensor image with four different methods of sparse coding (rows 2-6) and different window merging functions. Although fast, orthogonal matching pursuit (OMP, second row) only achieves a low reconstruction quality. Reweighted fl, as implemented in the NESTA package, yields high-quality reconstructions but takes about 15 times as long as OMP (third row). Basis pursuit denoise, as implemented in the SPGL1 solver package, is also high-quality but almost 400 times slower than OMP (fourth row). Due to its immense time requirements, this implementation is not suitable for high-resolution 4D light field reconstructions. The homotopy approach to solving the basis pursuit denoise problem (fifth row) is even faster than OMP and results in a reconstruction quality that is comparable with the best, but much slower SPGL1 implem entation. 3-5 . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 45 Evaluating convergence for all algorithms using a single patch. Please note that OMP is a greedy algorithm; while all other algorithms minimize the f 1-norm of coefficients, OMP increments the number of used coefficients by one in each iteration. 17 . . . . . . . . . . . . . . . . . . 45 3-6 Evaluating dictionary learning at various resolutions. We learn two dictionaries from a training set consisting of black and white text at resolutions of 128 x 128 and 64 x 64 at different depths from a simulated camera aperture of 0.25 cm. Reconstructions of a resolution chart placed at the same depths but with different resolutions are then performed. As shown in the plot, the PSNRs do not vary significantly for dictionaries learned at 64 x 64 and reconstruction at 128 x 128, and vice versa. This can be attributed to our patch-by-patch reconstruction that captures features at a resolution of 9 x 9 in this case. At such a patch size, features are generally scale-invariant. Moreover, for the case of resolution charts, these features are generally edges (spatial or angular) that do not depend on resolutions. 3-7 . . . . . . . . . . . . . . 46 Evaluating optimal mask patterns. We evaluate three kinds of mask patterns for both sliding and distinct patch reconstructions for a flatland 2D light field. A light field is divided into patches of resolution 20 (space) x 5 (angle) and is projected down with the corresponding mask to a 20 pixel coded sensor image. As seen in the figure, given a dictionary, an optimized mask pattern performs much better than a traditional random mask. Jointly optimizing for both mask and dictionary performs slightly better than the optimized mask albeit at much higher computational times. The coherence value p as shown in the figure decreases as we optimize for both the mask and the dictionary. 18 47 4-1 Illustration of ray optics, light field modulation through coded attenuation masks, and corresponding projection matrix. The proposed optical setup comprises a conventional camera with a coded attenuation mask mounted at a slight offset in front of the sensor (left). This mask optically modulates the light field (center) before it is projected onto the sensor. The coded projection operator is expressed as a sparse matrix <), here illustrated for a 2D light field with three views projected onto a ID sensor (right). 4-2 . . . . . . . . . . . . . . . . . . . . . . . . 50 Visualization of light field atoms captured in an overcomplete dictionary. Light field atoms are the essential building blocks of natural light fields-most light fields can be represented by the weighted sum of very few atoms. We show that light field atoms are crucial for robust light field reconstruction from coded projections and useful for many other applications, such as 4D light field compression and denoising. . . . . 4-3 53 Compressibility of a 4D light field in various high-dimensional bases. As compared to popular basis representations, the proposed light field atoms provide better compression quality for natural light fields (plots, second row). Edges and junctions are faithfully captured (third row); for the purpose of 4D light field reconstruction from a single coded 2D projection, the proposed dictionaries combined with sparse coding techniques perform best in this experiment (bottom row). 4-4 . . . . . . 55 Evaluating optical modulation codes and multiple shot acquisition. We simulate light field reconstructions from coded projections for one, two, and five captured camera images. One tile of the corresponding mask patterns is shown in the insets. For all optical codes, an increasing number of shots increases the number of measurements, hence reconstruction quality. Nevertheless, optimized mask patterns facilitate single-shot reconstructions with a quality that other patterns can only achieve with multiple shots. . . . . . . . . . . . . . . . . . . . . . . . 19 58 4-5 Evaluating depth of field. As opposed to lenslet arrays, the proposed approach preserves most of the image resolution at the focal plane. Reconstruction quality, however, decreases with distance to the focal plane. Central views are shown (on focal plane) for full-resolution light field, lenslet acquisition, and compressive reconstruction; compressive reconstructions are also shown for two other distances. The three plots evaluate reconstruction quality for varying aperture diameters with a dictionary learned from data corresponding to the blue plot (aperture diam eter 0.25 cm ). 4-6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Illustration of different optical light field camera setups with a quantitative value p for the expected reconstruction quality (lower value is better). While lenslet arrays have the best light transmission T (higher value is better), reconstructions are expected to be of lower quality. Masks coded with random or optimized patterns perform best of all systems with 50% or more transmission. Two masks are expected to perform slightly better with our reconstruction, but at the cost of reduced light efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4-7 Prototype light field camera. We implement an optical relay system that emulates a spatial light modulator (SLM) being mounted at a slight offset in front of the sensor (right inset). We employ a reflective LCoS as the SLM (lower left insets). 4-8 . . . . . . . . . . . . . . . . . . 63 Pinhole array mask and captured images. For conciseness, only the region covering 12 x 10 pinholes are shown here. (a) Pinhole array displayed on the LCoS. (b) Image sensor recording of the LCoS pinhole array mask for a white cardboard scene. (c) Image sensor recording of the LCoS pinhole array mask for a newspaper scene. (d) Normalized image, i.e., (c) divided by (b). (e) Image sensor recording of the LCoS pinhole array mask for a white cardboard scene where each pinhole has a corresponding attenuation value according to the mask pattern. 20 . . 64 4-9 Calibrated projection matrix. Left: visualization of the matrix as a multi-view image. Right: magnified crops from the two views. . . . . 65 4-10 Printed transparency-based prototype camera (a) and calibration setups (b,c). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4-11 Light field reconstruction from an image modulated by a static film mask. (a) Captured image. (b) One view from the calibrated pro- jection matrix D. (c,d) Two views from the reconstructed light field. Yellow lines are superimposed to show the parallax. (e) All of the 5 x 5 views from the reconstructed light field. . . . . . . . . . . . . . . . . . 67 4-12 An illustration of placement of the optimal or random mask for two commonly used camera designs. For traditional DSLRs with a lens of focal length about 50 mm, the mask needs to be placed about 1.6 mm away from the sensor to capture 5 x 5 angular views. For a mobile phone camera this distance reduces to about 160 microns, due to reduced focal length. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-13 Training light fields and central views for physical experiments. 4-14 Training light fields and central views for simulated experiments.. 67 . . 68 . 69 4-15 Light field reconstruction from a single coded 2D projection. The scene is composed of diffuse objects at different depths; processing the 4D light field allows for post-capture refocus. . . . . . . . . . . . . . . . . 71 4-16 Reconstruction of a partly-occluded scene. Two views of a light field reconstructed from a single camera image. Areas occluded by highfrequency structures can be recovered by the proposed methods, as seen in the close-ups. . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4-17 Light field reconstructions of an animated scene. We capture a coded sensor image for multiple frames of a rotating carousel (left) and reconstruct 4D light fields for each of them. The techniques explored in this paper allow for higher-resolution light field acquisition than previous single-shot approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . 21 72 5-1 "Undappling" a mask-modulated sensor image (left). The known projection of the mask pattern can be divided out; remaining noise patterns in out-of-focus regions are further reduced using a coupled dictionary method (right). 5-2 . . . . . . . . . . . . . . . . . . . . . . . . . 75 Light field compression. A light field is divided into small 4D patches and represented by only few coefficients. Light field atoms achieve a higher image quality than DCT coefficients. 5-3 . . . . . . . . . . . . . . Light field denoising. Sparse coding and the proposed 4D dictionaries can remove noise from 4D light fields. . . . . . . . . . . . . . . . . . . 5-4 76 77 Focal stack setup. The focus mechanism of an SLR camera lens is intercepted and controlled by an Arduino board. Synchronized with a remote shutter, this allows us to capture high-quality focal stacks. . . 5-5 Focal stack training set. Central focal slices for each scene are shown along with downsampled versions of all the slices. . . . . . . . . . . . 5-6 78 79 Compressive focal stack results. A single coded projection is simulated (lower left) and used to recover all six slices (three of them are shown in the bottom row) of the original focal stack (top row). While the focal stack is successfully recovered, we observe a slight loss of image sharpness for in-focus image regions. This could be overcome with optimized projections (as demonstrated for 4D light fields in the primary text) or by acquiring multiple shots . . . . . . . . . . . . . . . . . . . 6-1 79 We compare our mask-based light field camera design with existing commercial technologies i.e. micro lens based approaches and camera arrays. This comparison can help commercial camera makers chose the right technology for their specific purposes. 22 . . . . . . . . . . . . . . 84 Chapter 1 Introduction Since the invention of the first cameras, photographers have been striving to capture moments on film. Traditional cameras be it film or digital do not capture individual rays of light that contribute to a moment. Instead, each pixel or film grain captures the sum of all rays over different angles contributing to a point in the scene. This leads to loss of angular information that is critical for inferring view point changes and depth. To circumvent this problem light field cameras were introduced to capture both spatial and angular information by multiplexing the incoming 4D light field onto a 2D image sensor. These cameras have defined a new era in camera technology allowing consumers to easily capture, edit and share moments. Commercial products such as those by LYTRO and Raytrix exist in markets facilitating novel user experiences such as digital refocusing and 3D imaging. Unfortunately, the technological foundations for these cameras are a century old that have not fundamentally changed since that time. Most currently available devices trade spatial resolution for the ability to capture different views of a light field, often times reducing the final resolution by orders of magnitude. This trend di- rectly counteracts the increasing resolution demands of the industry, with the race for megapixels being the most significant driving factor behind camera technology in the last decade. In this dissertation we introduce a new mathematical and computational framework combined with a novel optical coding strategy to get over these fundamental 23 Figure 1-1: A conventional camera sums over all the individual rays or alternatively all the individual views as recorded by different points on the aperture to generate the recorded image. Figure 1-2: A light field camera works by placing a refractive or non-refractive element between the lens and the sensor (a micro lens array in this case) to record this spatioangular sampling of rays on the image sensor. This can also be thought of as a two-dimensional array of two-dimensional images, each from a different view point on the aperture. limitations in light field acquisition systems. The price we pay for recording high resolution light fields on a traditional 2D image sensor is increased computational processing. The goal of these thesis is to show how to transfer the fundamental optical limits into computational processing, a price fair enough to create high resolution light field acquisition systems. 1.1 Light Field Photography As shown in Figure 1-1 a conventional camera takes in a 3D scene and sums over all the incoming angles to generate the final image. A light field camera on the other end places either a refractive or non-refractive element between the lens and the sensor to disambiguate this angular information in spatial pixels. Figure 1-2 shows a light field camera design with the main lens focused at a micro lens array. As rays hit the micro 24 Figure 1-3: Effects such as post-capture refocusing can be shown via capturing the 4D light field. lens system they get refracted based on angle of incidence thereby hitting the sensor pixels at different locations. This implicitly records the incoming angular information onto spatial pixels. The recorded light field is a 4D function and can be thought of as a 2D array of 2D images, each with a different view point over the aperture. It is a simplification of the full plenoptic function in free space, that records the space of all light rays emanating from a scene including position, angle, wavelength and time. Once this this 4D radiance function is recorded one can either average all the views to get conventional image or shift and add the views to demonstrate refocusing at different focal planes [44], as shown in Figure 1-3. Further more, the captured light field can also be used for depth map generation and generating 3D content for new glasses-free light field displays [56]. 1.2 Spatio-Angular Resolution Trade-Off in Light Field Camera Designs Though there are immense applications of light field camera designs the price one has to pay is a permanent trade-off with spatial and angular resolutions. As seen 25 in Figure 1-2 the spatial resolution of each view is reduced by the factor of number angular samples. Mathematically, for any view or refocused image the following equation is always followed: S = pX x pY x sX x sY (1.1) where S is the total sensor resolution, px and py are the number of angular views in the x and y dimension respectively, sx and s, is the resolution for each view. This drop in resolution can go unto about a 100 times depending on number of angular views that need to be sampled. In commercial cameras like LYTRO, one converts a high 20 MP resolution sensor into a measly 0.1 MP light field for refocusing. It turns out as long as one operates in the well known Shanon-Nyquist regime in signal processing this equation will always hold true. Recent advances in the theory of sampling and reconstruction of sparse signals [10] have shown that one can go beyond the Shanon-Nyquist sampling rate if the signal is known to be sparse or compressible in a certain domain. These domains or basis can be traditionally known orthonormal basis function such as Fourier, DCT and PCAs or redundant dictionaries. These coherent and overcomplete dictionaries learn essential features from a series of training data adapting to the signal under consideration. This generally allows for a data-aware representation against the traditional known basis that are signal agnostic. We build on these recent theories and frameworks to propose, develop and evaluate a new light field camera design that overcomes these traditional limits of spatioangular resolution trade-offs. We show with practical results a working prototype of light field camera with enhanced resolutions using priors as captured in these overcomplete dictionaries. 1.3 Contributions We explore compressive light field photography with a novel algorithmic framework with optical-co-design strategies to overcome traditional resolution limits. In partic26 ular in this thesis, we make the following contributions: " We propose compressive light field photography as a system combining opticallycoded light field projections and nonlinear computational reconstructions that utilize overcomplete dictionaries as a sparse representation of natural light fields. " We introduce light field atoms as the essential building blocks of natural light fields; these atoms are not only useful for high-resolution light field reconstruction from coded projections but also for compressing and denoising 4D light fields. " We analyze existing compressive light field cameras and evaluate sparse representations for such high-dimensional signals. We demonstrate that the proposed atoms combined with optimized optical codes allow for light field reconstruction from a single photograph. " We build a prototype compressive light field camera and demonstrate successful recovery of partially-occluded environments, refractions, reflections, and animated scenes. 1.4 Organizational Overview " Chapter 2 includes a survey of related works in the fields of computational photography, compressive sensing and light field capture. " Chapter 3 has a primer on sparse coding and representation strategies, with a rigorous evaluation of various design parameters for dictionary learning and sparsity-exploiting reconstructions. " Chapter 4 demonstrates a practical light field camera design derived from these evaluations, including results and reconstructions for natural light field scenarios from a single coded image. " Chapter 5 showcases other applications of overcomplete dictionaries including compressive focal stack acquisition, compression and denoising. 27 * Chapter 6 discussed future directions towards building the ultimate camera system that can capture the full plenoptic function including the parameters of time and wavelength. 28 Chapter 2 Prior Art 2.1 Light Field Acquisition 2.1.1 Hundred Years Ago The art of capturing light field started in 1902 when Frederick Ives [28] patented the idea of parallax stereogram camera. He placed a pinhole mask near the sensor of the camera to sample the angular content of the light field. Though, the most significant contribution was made by Gabriel Laippmann [37] when he placed a micro lens array attached in front of the film sensor as shown in Figure 2-1. Figure 2-1: The first light field camera design to sample angular information on a 2D film using a micro lens array. This was made by Lippmann in 1908 [37]. 29 Figure 2-2: A main lens is added to Lippmann's design that is directly focused on the micro-lens array. The micro-lens refracts the incoming light based on incident angle to allow for the 4D sampling of light rays on the 2D sensor[44]. 2.1.2 Hand-held Light Field Photography In 2005, researchers at Stanford adding a main lens to Lippmann's design, with its focus on the micro lens array[44]. As shown in Figure 2-2 the incident light field onto the micro lens is refracted based on the incidence angle to record angular samples on the 2D image sensor. Recently, lenslet-based systems have been integrated into digital cameras [1, 44]; consumer products are now widely available. Since spatial pixels are used to record angular samples, a trade-off in resolution occurs as established in the previous chapter. This permanently reduces the final resolution of the capture image by a factor of number of angular samples. 2.1.3 Dappled Photography In 2007, non-refractive light modulating codes or masks were introduced to capture light field with better light efficiency than pinhole arrays [54, 31, 55]. These codes optically heterodyne the incoming light field creating copies in the fourier domain. The distance of the mask to the sensor is computed so as to avoid aliasing in the frequency domain. By taking an inverse fourier transform of the recorded image 30 Figure 2-3: A cosine mask is placed in front of the sensor to modulate the light field in the frequency domain. The mask heterodynes the incoming light field that can be tiled back together and inverted to recover the 4D radiance function [54]. and aligning the angular samples properly, light field can be reconstructed back. It is assumed that the light field is band-limited for these designs. Figure 2-3 shows couple of designs for mask based capture of light field, and building light efficient frameworks than traditional pinhole arrays. Nevertheless, all of these approaches sacrifice image resolution-the number of sensor pixels is the upper limit of the number of light rays captured. Other techniques to correct lens aberrations have been proposed by creating spatially optimal filters [45]. This technique combines optical heterodyning approaches as discussed with locally optimal mask patterns that turn deblurring problem into a well-posed one allowing effective solutions. 2.1.4 Focused Plenoptic Camera Within these limits of resolution trade-offs [24, 33], alternative designs have been proposed that favor spatial resolution over angular resolution [38] as shown in Figure 2-4. By slightly changing the optical design where the main lens not focused on the micro lens array but at a plane before it, one can allow flexible trade off with spatial and angular resolution. The product between the two still remains equal to the number of sensor pixels. A number of techniques have been developed to compute high resolution images using super resolution on the 4D light field. These techniques rely on pixel shifts between different views and often reconstruct a high resolution 31 Focused plenopic PhOMWeno t ycamr .anln MC~S Figure 2-4: In this design the main lens is not focused on the micro-lens but at a plane just before the micro-lens. This allows a flexible trade off between angular and spatial resolution, though the product still being equal to sensor resolution[38]. Figure 2-5: One of the first light field acquisition devices formed by carefully aligning and placing multiple cameras. Effects such as looking through bushes and synthetic aperture photography can be implemented effectively with this bulky and expensive design[58]. focal stack from the given light field. In these approaches, one can think of a low resolution refocused image as high resolution image blurred out by a depth dependent kernel. Since, depth is known based on micro lens array alignment and distance we can invert the problem (same as in super resolution) to get high resolution refocused image. 2.1.5 Camera Arrays and Gantry In order to fully preserve image resolution, current options include either camera arrays [58] or taking multiple photographs with a single camera [35, 25, 36]. Camera 32 Figure 2-6: This gantry contains a rotating camera for capturing light field by recording images at various view points. Image based rendering approaches can then be used to interpolate between various view points[35]. arrays as shown in Figure 2-5 are usually bulky, expensive and need to carefully calibrated, aligned and fired to record high resolution light fields. The alternative is to use time sequential approaches in which one can either modulate with a mask over the aperture and take multiple shots or capture photos with a rotating gantry as shown in Figure 2-6. These approaches do not work for dynamic scenes. 2.2 Compressive Sensing and Sparse Representations It is traditionally known that to reconstruct a signal back faithfully from its measurements, the number of samples has to be greater than twice the bandwidth of the signal (Shannon-Nyquist Theorem). In recent years, this assumption has been challenged for the case of sparse or compressible signals[10]. The theory of compressive sensing shows that for a k-sparse signal one requires significantly less number of measurements for exact recovery provided it's sampled in a way that preserves signal energy. More specifically, given a measurement vector i E R" that contains a coded 33 Figure 2-7: A dictionary learned from a number of natural image patches. This dictionary captures essential edges and features that contribute to image formation. These "atoms" are different that traditionally known basis such as Fourier, DCT or Wavelets. down projection of a k-sparse signal 1 E R", we wish to recover a set of sparse coefficients a E Rd that form an accurate representation signal in some basis or dictionary E) E Rnxd (2.1) 4 va. C =#= It is shown that if number of measurements, m > const k log (1) and the sensing matrix 4D satisfies the RIP property [11] exact recovery is guaranteed. For general compressive sensing frameworks 4 is taken to be a random gaussian or bernoulli matrix. RIP guarantees for these matrices are known and proven [11]. The problem of finding the right basis that represents a class of signals sparsely is still open and active area of research. Generally available transformations such as Fourier, DCT or Wavelets are signal agnostic and are known to represent a wide variety of signals with few coefficients. Recently, the field of dictionary learning has come up, that tries to learn essential features from a wide variety of training data by solving a convex optimization problem [3]. This general problem of learning an 34 overcomplete dictionary D that sparsifies a class of signals can be expressed as minimize ||L - V)AIIF subject to Vi, ||A|O k Here, V) E R'Xd is the overcomplete dictionary, L E R"Xq is the training set that contains q signals or patches that represent the desired class of signals well. An algorithm tackling this problem will not only compute D but also the sparse coefficients A E R dq that approximate the training set. Figure 2-7 shows an overcomplete dic- tionary learned from millions of image patches. As seen these newly learned basis can capture essential features of natural image formation such s edges at different orientations. Any natural image can be thought of as superposition of these dictionary elements with varied weights. 2.3 2.3.1 Compressive Computational Photography Single Pixel Camera One of the first practical demonstrations of compressive sensing applications has been the single pixel camera developed at RICE University [18]. Their system images the scene through a lens onto a DMD device. This DMD implements a bernoulli matrix either allowing the light ray to pass through or deflected away. A number of measurements are recorded on a single photodiode. The signal under estimation is assumed to be sparse in the DCT domain, and is recovered back using compressive sensing approaches. 2.3.2 Compressive Video Acquisition Following this compressive sensing has recently been applied to video acquisition by either modulating the incoming light source [48], or per-pixel exposure control [27], . see Figure 2-9 Modulating the light source with random strobing allows for reconstruction of 35 Low-cost, ftst, sensitive optical detection PDI Xmtr Compressed, encoded Image encoded by DMD and random basis DMD image data sent via RF for reconstruction ITIDSP Rcvr RNG Figure 2-8: RICE's single pixel camera is the first physical prototype based on compressive sensing. It allows reconstruction of a 2D image by taking a series of measurements on a single photodetector using Li-minimization techniques[18]. Figure 2-9: In this design a LCoS based system is used to control per pixel exposure, implementing a mask pattern that modulates and sums over frames during capture. Dictionaries are learned from training videos as shown on the left to reconstruct video from exposure in a single shot[27]. 36 periodic event that are sparse in the fourier domain (a periodic event will only have 1 frequency component in the fourier domain). On the other hand, by having per-pixel control of exposure one can multiplex multiple frames of captured video into a single coded image that can be inverted to get the video during exposure using compressive sensing. 2.3.3 Compressive Light Field Imaging (Simulation) The idea of compressive light field acquisition itself is not new, either. Kamal et al. [29] and Park and Wakin [46], for instance, simulate a compressive camera array. Recently, researchers have started to explore compressive light field acquisition with a single camera. Optical coding strategies include randomly coding apertures [5, 6], coded lenslets [5], a combination of coded mask and aperture [59], and random mirror reflections [23]. Unfortunately, all techniques are based on simulations and still require multiple images to be recorded to reconstruct the light field and are not suited for dynamic scenes. They though succeed in reducing the number of shots as compared to their non-compressive counterparts [36]. Fergus et al. [23] require significant changes to the optical setup, so that conventional 2D images are difficult to be captured. The work by Xu and Lam [59] is most closely related to ours. However, they only show simulated results and employ simple light field priors based on total variation (TV). Furthermore, they propose an optical setup using dual-layer masks, but their choice of mask patterns (random and sum-of-sinusoids) reduces the light efficiency of the optical system to less than 5%. It could also be argued that light field superresolution [8] is a form of compressive light field acquisition; higher-resolution information is recovered from microlens-based measurements under Lambertian scene assumptions. The fundamental resolution limits of microlens cameras, however, are depth-dependent [47]. We show that mask-based camera designs are better suited for compressive light field sensing and derive optimized single-device acquisition setups. In this thesis, we demonstrate that light field atoms captured in overcomplete dictionaries represent natural light fields more sparsely than previously employed bases. We evaluate a variety of light field camera architectures and show that mask37 u Fourier Lens Plane (ngla Modulation) I ("(Spa Imaging Lenls Amplitude-mask n Modulation) Object Wavefront Object Wavefront (m, n) &9(, ). =I K] K 1'f - S. so+ Image I Sensnr Lenslet Array (Detector Array) (Detector Array) Focal Plane S. I so i 8i Figure 2-10: Conceptually two compressive light field architectures have been proposed. One of the left multiplexes by putting a random code on the aperture, while the one on the right placed a coded lenslet between the sensor and the lens. Both these architectures require multiple shots to be taken for high resolution light field reconstruction [5]. based approaches provide a good tradeoff between expected reconstruction quality and optical light efficiency; we derive optimized mask patterns with approximate 50% light transmission that allow for high-quality light field reconstructions from a single coded projection. 38 Chapter 3 Sparse Coding and Representations Our work builds on recent advances in the signal processing community. Here we outline and evaluate relevant mathematical tools to help in a robust and efficient inversion of underdetermined system of equations as formulated in Equation 2.1. As mentioned, one of the key challenges in resolution preserving light field photography is the choice of a dictionary in which natural light fields are sparse. We discuss approaches to learn light field atoms-essential building blocks of natural light fields that sparsely represent such high-dimensional signals. In this chapter, our focus is to give readers an intuitive understanding of various design parameters for sparse coding and reconstruction we consider the signal under consideration as the light field and evaluate them in 2D flat land. This helps motivate choice of values as we move ahead in building a high resolution compressive light field camera in the chapter ahead. 3.1 Primer on Sparse Coding This section reviews the mathematical background of sparse coding. Much progress on this topic has been made in the signal and information theory community throughout the last decade. This field has exploded in the last five years and this section serves merely as a concise overview of mathematical tools relevant for compressive light field photography. For our purposes of building practical camera designs sparse reconstruction approaches are also evaluated based on the following criteria: speed, 39 quality, and ease of implementation. Let us start with the problem statement. Given a vectorized sensor image i C R" that contains a coded projection of the incident vectorized light field 1 C R', we wish to recover a set of sparse coefficients a C Rd that form an accurate representation of the light field in some basis or dictionary V C Rnxd i = (PI= (Pvc. (3.1) The challenges here are twofold. First, the number of measurements (sensor pixels) m is significantly lower than the number of unknowns (light field rays) n. Second, assuming that tools from the signal processing community can be employed to solve this problem, which basis or dictionary V provides a sufficiently sparse representation for natural light fields? While the next section is mainly concerned with answering the latter question, we outline tools that are at our disposal to tackle the first challenge in this section. Solving an underdetermined linear system, such as Equation 3.1, is challenging Our work builds on recent advances in because it has infinitely many solutions. compressed sensing (see, e.g. [16, 12]) to solve such equation systems. The general idea is that, under certain conditions, underdetermined systems can be solved if the unknowns are either sparse or can be represented in a sparse basis or overcomplete dictionary. A signal is said to be k-sparse, if it has at most k nonzero coefficients. Mathematically, sparsity is expressed by the to pseudo-norm .0 that simply counts the number of nonzero elements in a vector. The answer to the following problem is at the core of finding a robust solution to Equation 3.1: minimize subject to ||aH(o @} 11i - 'tVaI12 < c (3.2) This formulation seeks the sparsest representation of a signal that achieves a prescribed approximation error. Equation 3.2 can similarly be stated with equality 40 constrains; we focus on bounded errors as the measurements taken by computational cameras usually contain sensor noise. A different, but closely related problem is that of sparsity-constrained approximation: minimize 1|i - <b'a||2 subject to |ka||O f Cel (3.3) < K Here, the objective is to find a signal that has at most ; nonzero coefficients and minimizes the residual. Problems 3.2 and 3.3 can be expressed using a Lagrangian function that balances the twin objectives of minimizing both error and sparsity as minimize 1i - <bDa12 + A ||a||o . {a} (3.4) The above problems (Eqs. 3.2- 3.4) are combinatorial in nature. Finding optimal solutions is NP-hard and, therefore, intractable for high resolutions [42]. The two most common approaches to tackle these problems are greedy methods and convex relaxation methods (see, e.g. [50, 51]). The remainder of this section outlines methods for both. 3.1.1 Approximate Solutions In this section, we briefly review the two most common approximate solutions to the problems outlined in the previous section: greedy algorithms and convex relaxation methods. Greedy Methods Greedy methods for sparse approximation are simple and fast. In these iterative approaches, the dictionary atom that is most strongly correlated to the residual part of the signal is chosen and the corresponding coefficient added to the sparse coefficient vector (MP). In addition, a least-squares minimization can be added to each iteration so as to significantly improve convergence (OMP). 41 Matching Pursuit (MP) is a greedy method for sparse approximation that con- structs a sparse approximation one step at a time by selecting the atom most strongly correlated with the residual part of the signal and uses it to update the current approximation. One way to accomplish this is to choose the atom most strongly correlated with the residual by computing inner products between residual and all atoms. Then, the coefficient vector is updated for the coefficient corresponding to the chosen atom with the inner product of that atom and the residual [51]. This is done iteratively. Orthogonal Matching Pursuit (OMP) has been used for decades and is one of the earliest approaches to sparse approximation. While independently discovered by several researchers, a detailed treatise can be found in [52]. Just as MP, OMP is a greedy algorithm and therefore extremely fast. The improvement over MP comes from the fact that OMP adds a additional least-squares minimization in each iteration. Similarly to MP, in each iteration OMP picks the atom that contributes most to the overall residual and adds the corresponding coefficient to the sparse representation. In addition to picking that coefficient, OMP runs a least-squares minimization over all coefficients picked until the current iteration to obtain best approximation over the atoms that have already been chosen. One of the disadvantages of OMP is that exact recovery of the signal is only guaranteed for a very low coherence value and that the sparsity of the signal has to be known. The latter is rarely the case in practice, however. Convex Relaxation Methods Convex relaxation methods follow a different strategy than greedy methods. Rather than trying to solve a difficult problem, these types of methods solve a slightly different problem that can be solved much easier in hope that the solutions will be close enough to those of the original problems. In particular, the to-norm that makes Equations 3.2- 3.4 NP-hard is replaced by the fl-norm, which is convex yet non-smooth. This approach replaces a combinatorial sparse approximation problem with a related convex problem, which can be solved in polynomial time. Although there are no 42 theoretical guarantees that these numerical methods actually solve sparse approximation problems, it has been shown that, under certain conditions, the solutions to the convex relaxation problems converge to the combinatorial problems with a very high probability [10, 17]. One of the most important achievements of recent literature on compressive sensing is the derivation of bounds on recoverability and required number of measurements for the convex relaxation methods described in this section. A lower bound can, for instance, be placed on the number of measurements m for a k-sparse d-dimensional signal [12]: m > const k log (). (3.5) Solutions to the convexified problems can be found with a variety of approaches, including linear programming or nonlinear programming, such as interior point methods [9]. In the following, we discuss several different formulations for convex relaxations of Equations 3.2- 3.4. Basis Pursuit (BP) replaces the to-norm of Equation 3.2 with an fi-norm, but also uses equality constraints for the measurements [15]: minimize ||a|I( subject to <bDc = i fal (3.6) Efficient implementations of this problem can, for instance, be found in the SPGL1 solver package [53]. Basis pursuit works well so long as the number of measurements m > const k log (d/k) and the measurement matrix is sufficiently incoherent. While this problem is important for many applications, in the context of light field cameras one usually has to deal with sensor noise which makes it difficult to use the equality constraints in the above formulation. Basis Pursuit Denoise (BPDN) is very similar to basis pursuit; the main dif- ference is that the equality constraints are replaced by inequality constraints [15]: 43 minimize IQ} subject to |Hall( (3.7) Ili - <Da11 2 < C The parameter E is basically the noise level of a recorded sensor image in our application. Both, BP and BPND can be summarized as minimizing the sparsity of a coefficient vector while complying with the observations, at least up to some threshold E. For practical implementation, the constraints of Equation 3.7 can be directly included in the objective function using a Lagrangian formulation minimize 11i - <Da 2 + A |HaliK (3.8) Equation 3.8 is an unconstrained, convex, quadratic problem and can, therefore, be easily solved with existing solvers. Efficient implementations are readily available in the SPGL1 package [53] and also in the NESTA solver [7]. Yang et al. [60] provide an excellent overview of recent and especially fast fi minimization algorithms; upon request the authors also provide source code-we found their implementation of a homotopy method to solve the BPDN problem most efficient (see Fig. 3-4). Lasso is a problem closely related to BPDN and a convex relaxation of Equation 3.4: minimize {a} Hi subject to Hlall1 < - bVah2 (3.9) K Solving this problem provides the best approximation of a signal using a linear combination of (ideally) K atoms or fewer from the dictionary [49]. Lasso can also be solved with the unconstrained, Lagrangian form of the BPDN problem (Eq. 3.8); the parameter A is simply chosen to give a different tradeoff between sparsity and error. An implementation of Lasso can be found in the SPGL1 package [53]. 44 3.1.2 The Analysis Problem The last section gave a brief overview of approaches to sparse coding and approximate solutions. All problems are formulated as synthesis problems, however, which are theoretically only valid if the dictionary V is an orthonormal basis. In case it is coherent, overcomplete, or redundant, Candes et al. [13] have shown that solving the following fi-analysis problems theoretically results in superior reconstructions: minimize ||D* 1|1( (3.10) subject to Ii - <1l 2 < 6 [13] use the reweighted ti method [14] to solve the Basis Pursuit Denoise analysis problem. Reweighted f, is an iterative method; an efficient implementation is readily available in the NESTA solver package [7]. We compare NESTA's reweighted f, solver with the BPND implementation of SPGL1 in Figure 3-4 and conclude that both result in a comparable reconstruction quality for our problem. 3.2 Overview of Dictionary Learning Methods One of the most critical parts of any sparse coding approach is the choice of basis or dictionary that sparsifies the signal appropriately. As shown in the primary text, natural light fields are poorly sparsified by standard bases such as the discrete cosine transform, the Fourier basis, or wavelets. Instead, we propose to learn the fundamental building blocks of natural light fields-light field atoms-in overcomplete, redundant, and possibly coherent dictionaries. Within the last few years, sparse coding with coherent and redundant dictionaries has gained a lot of interest in the signal processing community [13]. 3.2.1 Light Field Atoms The general problem of learning an overcomplete dictionary D that sparsifies a class of signals can be expressed as 45 minimize L- 'VA} subject to Vi, Here, V E Rnxd 3 .1F (3.11) HAi1o < k is the overcomplete dictionary, L c R'Xq is the training set that contains q signals or patches that represent the desired class of signals well. An algorithm tackling Equation 4.7 will not only compute D but also the sparse coefficients A E Rdxq that approximate the training set. The Frobenius matrix norm in the above problem is defined as ||XIIF i4 Xj. A variety of solutions to Equation 4.7 exist, in the following we discuss the most widely used methods. K-SVD The K-SVD algorithm [3] is a simple, yet powerful method of learning overcomplete dictionaries from a training dataset. K-SVD applies a two-stage process to solve Equation 4.7: given an initial estimate of ) and A, in the first stage (sparse coding stage) the coefficient vectors Ai, i = 1... q are updated independently using any pursuit method with the fixed dictionary. In the second stage (codebook update stage), D is updated by picking the singular vector of the residual matrix E = L - DA that contributes most to the error; the corresponding coefficients in A and D are updated so as to minimize the residual with that singular vector. The vector can easily be found by applying a singular value decomposition (SVD) to the residual matrix E and picking the singular vector corresponding to the largest singular value. The K-SVD algorithm alternates between these two stages, sparse coding and codebook update, in an iterative fashion and is therefore similar in spirit to alternating leastsquares methods. Nonnegative K-SVD The nonnegative variant of K-SVD [4] (K-SVD NN) allows only positive values in the dictionary atoms as well as in the coefficients, i.e. Di4 > 0 and % > 0. The K-SVD 46 NN algorithm itself is a slight modification of K-SVD: in the sparse coding stage, any pursuit algorithm can be applied that only allows for positive coefficients; the SVD in the codebook update stage is replaced by finding the closest nonnegative rank-1 matrix (in Frobenius norm) that approximates E. Algorithms for nonnegative matrix factorization (NMF) are well-explored (e.g., [32]). The advantage of nonnegative matrix factorizations and dictionary learning algorithms is that these often result in decompositions that carry physical meaning. In particular, a visual analysis of the extracted or decomposed features often allows for intuitive interpretations of its parts. For the application of learning light field atoms, nonnegative dictionaries allow for intuitive interpretations of what the basic building blocks of natural light fields are. Figure 3-1 compares the atoms learned with K-SVD NN to alternative learning methods that allow negative atoms. Online Sparse Coding With online sparse coding, we refer to the online optimization algorithm proposed by Mairal et al. [40]. The goal of this algorithm is to overcome the limited size of training datasets that can be handled by alternative dictionary learning methods, such as K-SVD. For this purpose, Marail et al. propose a method based on stochastic approximation that is specifically designed to handle large training sets of visual data consisting of small patches in an efficient manner. The proposed approach is implemented in the open source software package SPAMS (http://spams-devel.gforge.inria.fr). Assume that a large training set of small patches is available. Conventional algorithms, such as K-SVD, randomly pick the maximum number of training patches that can be processed given a limited memory or time budget. Online sparse coding uses a batch approach, where the full training set is processed by picking a random patch at a time and updating the dictionary and coefficients accordingly. This is similar in spirit to matching pursuit algorithms. While being able to handle very large training sets and being very fast, online sparse coding has the disadvantage of very slow converge rates. Fixing the number of iterations and comparing it to K-SVD, as illustrated in Figure 3-1, shows that the 47 resulting light field atoms are much noisier and lower quality. This is mainly due to the fact that it is unclear how to chose the next training patch at any given iteration of the algorithm. Ideally, one that would maximize the amount of new information should be chosen; it may be impossible to know which one that is, so a random sample is drawn that may not lead to any new information at all. Generating "Good" Training Sets with Coresets 3.2.2 Coresets are means to preprocess large training sets so as to make subsequent dictionary learning methods more efficient. A comprehensive survey of coresets can be found in [2]. Given a training set L E RInq, a coreset C C R <C is extracted and directly used as a surrogate training set for the dictionary learning process (Eq. 4.7). Coresets have two advantages: first, the size of the training set is significantly reduced (i.e., c < q) and, second, redundancies in the training set are removed, significantly improving convergence rates of batch-sequential algorithms such as online sparse coding. In a way, coresets can be interpreted as clustering algorithms specialized for the application of picking "good" training sets for dictionary learning [22]. A simple yet efficient approach to computing corsets is to select c patches of the training set that have a sufficiently high variance [21]. Feigin et al. showed that the best dictionary learned from a coreset computed with their algorithm is very close to the best dictionary that can be learned from the original training dataset that the coreset was learned from. We implement the method described in [21] and evaluate it with K-SVD and online sparse coding in the following section. 3.3 Evaluating System Design Parameters Before we can effectively apply these recent theories of sparse coding and representation on high dimensional signals, we take a step back and evaluate various system design parameters to help guide us in resolution preserving light field recovery. In this section, we evaluate several learning methods in flatland with straightforward extensions to the full, four-dimensional case. The flatland analysis simply allows for 48 K-SMD eDmu" KAM w~h rnn a MMM Nonneg*M~e I K-VD wi Comets SPAM eth~Cm I Figure 3-1: Evaluating various approaches to dictionary learning and choosing training sets. The two most important evaluation criteria are speed (and memory footprint) of the learning stage as well as quality of the learned light field atoms. The former is indicated by timings, whereas the latter is indicated by the PSNR and visual quality of reconstructions from the same measurements (2D light field projected onto a 1D sensor image with a random modulation mask). We evaluate the K-SVD algorithm and its nonnegative variant and online sparse coding as implemented in the SPAMS software package. K-SVD is a very slow algorithm that limits practical application to small training sets containing 27,000 training patches (top left); coresets are tested to reduce a much larger training set with 1,800,000 patches to the same, smaller size (27,000 patches, top center). In all cases, K-SVD learns high-quality light field atoms that capture different spatial features in the training set sheared with different amounts in the spatio-angular domain. Unfortunately, K-SVD becomes too slow for practical application with large high-dimensional training sets; nevertheless, we use it as a benchmark in these flatland experiments. Online sparse coding (bottom row) is much faster and capable of handling large training sets. Unfortunately, the learned atoms are very noisy and lead to low-quality reconstructions (bottom left and center). Using coresets, much of the redundancy in large training sets can be removed prior to learning the atoms (lower right), thereby removing noise in the atoms and achieving the highest-quality reconstructions in the least amount of time. 49 0.lx - 10 Atoms Overcomplete 0.5x - 50 Atoms Overcomplete Ix - 100 Atoms Overcomplete 2x - 200 Atoms Overcomplete 5x - 500 Atoms Overcomplete lOx - 1000 Atoms Figure 3-2: Evaluating dictionary overcompleteness. The color-coded visualizations of dictionaries (top row) and the histograms (center row) illustrate the intrinsic dimension of these dictionaries. For more than 2x overcomplete dictionaries, most of the atoms are rarely used to adequately represent the training set. Reconstructions of a 2D light field that was not in the training set using the respective dictionaries (bottom row) show that the quality (PSNR) is best for 1 - 2x overcomplete dictionaries and drops below and above. Dictionaries with 0.1 - 0.5x overcompleteness do not perform well, because they simply do not contain enough atoms to sparsely represent the test light field. On the other hand, excessive overcompleteness (larger d) does not improve sparsity (smaller k) further, and k log(d/k) turns to increase, leaving the number of measurements insufficient. 50 more intuitive interpretations. 3.3.1 Dictionary Design Parameters Here, we evaluate various approaches to and design parameters of the learning stage. We conclude that coresets applied to large training sets in combination with online sparse coding implemented in the SPAMS package gives the best results in the shortest amount of time. Speed and Quality of Learning Stage We evaluate several different dictionary learning methods in Figure 3-1. While the K-SVD algorithm results in high-quality light field atoms, online sparse coding usually extracts noisy atoms for a comparable number of iterations in both algorithms. Unfortunately, K-SVD becomes increasingly slow and unfeasible for large training sets of high-resolution and four-dimensional light field patches. Applying coresets to large training sets prior to the learning stage, however, removes redundancies in the training data and allows SPAMS to learn atoms that result in higher-quality reconstructions than K-SVD in very little time (Fig. 3-1, lower right). We also evaluate the nonnegative variant of K-SVD for the purpose of improved interpretability of the learned atoms (Fig. 3-1, upper right). As with all other ap- proaches, nonnegative K-SVD atoms exhibit spatial features sheared with different amounts in the spatio-angular space. Edges in these atoms, however, are much smoother than in the approaches that allow for negative entries. This allows for improved blending of several nonnegative atoms to form light field "molecules", such as T-junctions observed in occlusions. Atoms that allow for negative values can easily create such "molecules" when added. The reconstruction quality of nonnegative K-SVD is low, however; we conclude that this algorithm is great to analyze learned atoms but, considering low-quality and immense compute times, not fit for practical processing in this application. 51 Target Lght Field View Recontruction PSNR 28.8dB Reconstructi PSNR 310dB Atom Size: g9x3x3 Atom Size 1lollx3x3 Reconstruction, PS 277dB R co7ction, PSNR 287d8 RecnVucon, PSNR 285d8 Atom Size: 13x13x3x3 Atom Sime 17x17x3x3 Atom Size: 21x21x3x3 MONS 0.26 10 Figure 3-3: Evaluating light field atom sizes. A synthetic light field (lower left) is projected onto a sensor image with a random modulation mask and reconstructed with dictionaries comprised of varying atom sizes. The angular resolution of light field and atoms is 3 x 3, but the spatial resolution of the atoms ranges from 9 x 9 to 21 x 21. Dictionaries for all experiments are 5x overcomplete. We observe best reconstruction quality for an atom size of 11 x 11. If chosen too small, the spatioangular shears of objects at a distance to the focal plane will not be adequately captured by the light field atoms. Furthermore, the ratio between the number of measurements and number of unknowns is too low for a robust reconstruction with sparse coding approaches. For increasing atom resolutions, that ratio becomes more favorable for reconstruction-however, with an increasing spatial extend of an atom it also becomes increasingly difficult to represent the underlying light fields sparsely, leading to lower-quality reconstructions. Atom Size The size (or resolution) of light field atoms is an important design parameter. Consider an atom size of n = p2 x p2-the number of measurements is always m = px. Assuming a constant sparseness k of the light field in the coefficient space, the minimum number of measurements should ideally follow Equation 3.5, i.e. m > const k log (d/k). As the spatial atom size is increased for a given angular size, the recovery problem becomes more well-posed because m grows linearly with the atom size, whereas the right hand side only grows logarithmically. On the other hand, an increasing spatial light field atom size may decrease the compressibility of the light field expressed in terms of these atoms. Figure 3-3 evaluates the sensitivity of the light field recovery process with respect to the spatial atom size. We conclude that there is an optimal tradeoff between the above arguments (number of measurements vs. sparsity). We heuristically determine the optimal atom size to be px = 11 for our application. 52 Overcompleteness We also evaluate how overcomplete dictionaries should ideally be, that is how many atoms should be learned from a given training set. Conventional orthonormal bases in this unit are "1 x" overcomplete D is square. The overcom- pleteness of dictionaries, however can be arbitrarily chosen in the learning process. We evaluate dictionaries that are 0.1 x, 0.5 x, 1 x, 2 x, 5 x, 10 x overcomplete in Figure 3-2. The color-coded visualizations of the atoms in the respective dictionaries indicate how many times each of the atoms is actually being used for the training set (on a normalized scale). The histograms count how many times an atom was used to represent the training set is shown in the center row. We observe that for a growing dictionary size, the redundancy grows as well. While all coefficients in the 0.1 x and 0.5x dictionaries are being used almost equally often, for 5x and particularly lOx, most of the coefficients are rarely being used, hence overly redundant. We conclude that 1 - 2x overcomplete dictionaries adequately represent this particular training set consisting of 27, 000 light field patches (reduced from 1,800,000 randomly chosen ones using coresets); all atoms have a resolution of 5 x 20 in angle and space. At this point, we would like to remind the reader of the number of required measurements in convex relaxation methods, such as basis pursuit denoise used in this experiment, as outlined in supplementary Equation 3.5. For a fixed sparsity, a linearly increasing dictionary size (or overcompleteness) requires an logarithmically growing number of measurements. As seen in the top and center rows of Figure 3-2, the intrinsic dimension of these dictionaries, that is the number of required coefficients to adequately represent the training set, is about 1 - 2x. As the overcompleteness grows (5 - lOx), there are simply not enough measurements in the sensor image, hence the PSNR of the reconstructions drops (Fig. 3.5). 3.3.2 Sparse Reconstruction Parameters A comparison of several of the above discussed sparse coding algorithms is shown in Figure 3-4. We evaluate these approaches based on reconstruction quality and speed. The choice of algorithm is based on publicly available code and ease of implementation; while there is a huge variety of different algorithms in the literature, we pick the 53 most popular ones that can easily be implemented or are readily available. Figure 3-4 shows that reweighted fi, as implemented in NESTA [7], basis pursuit denoise implemented in SPGL1 [53], and basis pursuit denoise using the homotopy method [60] perform equally well in terms of reconstruction quality. However, the processing time for both reweighted f1 and BPDN (SPGL1) prohibits practical use for high-resolution 4D light field reconstructions. Hence, we chose the homotopy method described by Yang et al. [60] with source code provided by the authors. Sliding Window Reconstructions At this stage we would like to point out that reconstructions are performed independently for each pixel in the sensor image. That is, a window of patch size px PX, P2 = m is centered around a sensor pixel and represents the measurement vector i C R' (see Eq. 3.1). A sparse set of 4D light field atoms, each of size px x Px x p, x p, pp = n, is then reconstructed for each sensor pixel. Following standard practice [20], we use this sliding window approach and reconstruct the 4D window for each sensor pixel separately in all results in the paper. As a post-processing step, overlapping patches are merged using a median function. We compare several different choices for the merging function, including average, median, and simply picking the spatial center of the 4D window in Figure 3-4. Evaluating Effect of Resolution We evaluate the effect of different resolutions between the learning and reconstruction phases. We learn two dictionaries from a plane of text placed at different depths at resolutions of 128 x 128 and 64 x 64, respectively. Given the 128 x 128 dictionary for benchmarking, we reconstruct a resolution chart at the same depths at the same resolution. To compare, we reconstruct an anti-aliased resolution chart downsampled to 64 x 64 with the dictionary learned at the resolution of 128 x 128. Similarly, we also reconstruct a resolution chart of 128 x 128 with the dictionary learned at the resolution of 64 x 64. For this experiment, no significant difference was found in PSNR as shown in the plot in Figure 3-6. Though the dictionary learning is not proven to be scale or 54 Figure 3-4: Evaluating sparse reconstruction methods. We reconstruct a 2D light field (top row) using the same dictionary and simulated sensor image with four different methods of sparse coding (rows 2-6) and different window merging functions. Although fast, orthogonal matching pursuit (OMP, second row) only achieves a low reconstruction quality. Reweighted 6i, as implemented in the NESTA package, yields high-quality reconstructions but takes about 15 times as long as OMP (third row). Basis pursuit denoise, as implemented in the SPGL1 solver package, is also highquality but almost 400 times slower than OMP (fourth row). Due to its immense time requirements, this implementation is not suitable for high-resolution 4D light field reconstructions. The homotopy approach to solving the basis pursuit denoise problem (fifth row) is even faster than OMP and results in a reconstruction quality that is comparable with the best, but much slower SPGL1 implementation. Convergence OMP Convergence Reweighted LI Convergence BPDN (SPGL1) Convergence BPDN (homiotopy) Figure 3-5: Evaluating convergence for all algorithms using a single patch. Please note that OMP is a greedy algorithm; while all other algorithms minimize the Linorm of coefficients, OMP increments the number of used coefficients by one in each iteration. 55 F ijure 3-6: Evaluating dktionary emig at varkius reshtbons. W e em two dictianarm fram a trainig set cnnsisting of black and whitB tact at reolutbns of d0Ats fram a sin ulatd cm era aperture of 025 128 x 128 and 64 x 64 at di e12at an R exanstmuctbns of a roo1itbn cbart placed at the Sam e d~pths but w itha di ereat rem1it ins axe the perfrmn e1. A s shcwn i the pbot, the P SN R s do not vary mmed at 64 x 64 and rAxnstrctin at 128 x 128, and signi cntly fir dictnar tus rexnstmuctn that to ourpatch-by-patch vic versa. T his cn be atnrbuted bn of 9 x 9 in this e e. A t sch a patrh size, fWaures ae gmFture at a re whution c earts, th estuxs are nr the a of r era ly ah-nvarint. M orovet, f tet.ns. cn xe geraTy eag (atia aor angular) that do not dead rotatinvariant, it ain s to capture natural buildng bbodcs of 1ight ekds. lh the it a of a tactua1 planar scme at di eret deaths, thee bbcks sat enatialand angular edgE that are scb-invariant. As hcwn 3.3.3 tatins are ] pnioUnd e ets . rEewpu dti nary at bow reitkrns 11 x 11 or 9 x 9 iour a, ieming is perf nn ed an patrh of mn uti-a in [41], sno cdntai Evaluating Optimal Projection Matrices and Mask Pat- terns Taditknal in cnn praive s sg projict the high din iasnal scar beause randcr Property (R ~lP whea projei mu atr srena1 to a bw aretiver sgni b prrivation of energy fr ), that ]lwer that, ifr ovei m plate ditnar usto to din esicnal libpaem. This is datisy the well-knlcn R strictEd s are known to dcwn to a mn atrics have bro systn s, randc Ta etry a sare vtor din easianal subsame. R eOEntly, it has been shown , there edsts a tin ize3 r a give basis Opi]. Furtherm ore, d be optin ized togethe to be m ost orthnorn al who 56 a1 of Mtnari 9n atr9I and that can be op- Msng m atris an operatig on aarne vctors. C Figure 3-7: Evaluating optimal mask patterns. We evaluate three kinds of mask patterns for both sliding and distinct patch reconstructions for a flatland 2D light field. A light field is divided into patches of resolution 20 (space) x 5 (angle) and is projected down with the corresponding mask to a 20 pixel coded sensor image. As seen in the figure, given a dictionary, an optimized mask pattern performs much better than a traditional random mask. Jointly optimizing for both mask and dictionary performs slightly better than the optimized mask albeit at much higher computational times. The coherence value p as shown in the figure decreases as we optimize for both the mask and the dictionary. 57 Mathematically, this optimality criterion can be expressed as minimize I - GTG F (3.12) where G is the product of the sensing matrix P and the overcomplete dictionary V. This unconstrained problem is known to be convex, but the sensing matrices generated are generally dense and cannot be physically realized. To generate physical optimized code, we add physical constraints on our ID mask patterns as minimize I {f} - GT GF (3.13) subject to 0 < fi < 1, Vi where f E R" corresponds to the ID mask pattern in the matrix and is zero otherwise. A further extension shown in [19] is to perform coupled learning where, given a training set, one optimizes both for the dictionary and the sensing matrix. minimize A |1L - DA {V,+,A} 12 + 111L - PDA2 (3-14) Here L corresponds to the light field training set, V is the learned overcomplete dictionary, A is the sparse coefficient vector. and 4P is the optimized sensing matrix. The constraints for the physically realizable mask pattern are then added as before. Figure 3-7 compares reconstruction quality for all three mask patterns. As shown, given a fixed dictionary, the optimal mask pattern performs much better than the random projection. Joint optimization of dictionary and mask patterns performs marginally better than the stand-alone optimized mask albeit at an increased computational overhead that will not scale up to 4D light fields given current computational power. Based on the insights drawn in this chapter on various design and system parameters, we describe system, algorithms and results to create a resolution preserving light field camera design in the following chapter. 58 Chapter 4 Compressive Light Field Photography This chapter describes the system design, analysis, implementation and results for a practical compressive light field camera design. We build on the mathematical foundations described in the previous chapter to get over century-old fundamental limitations for high resolution light field capture in a single shot. We start with a problem formulation that fits the theory, move ahead with analysis of light field dictionaries to implementing a real prototype. We end with results from our new light field camera design showcasing complicated illumination effects such as refraction and dynamic scenes. 4.1 Problem Formulation 4.1.1 Acquiring Coded Light Field Projections An image i(x) captured by a camera sensor is the projection of an incident spatioangular light field l(x, v) along its angular dimension v over the aperture area V: i (X) = l (x, v) dv. 59 (4.1) Measurement Matrix Light Field Ray Optics 44 4) Figure 4-1: Illustration of ray optics, light field modulation through coded attenuation masks, and corresponding projection matrix. The proposed optical setup comprises a conventional camera with a coded attenuation mask mounted at a slight offset in front of the sensor (left). This mask optically modulates the light field (center) before it is projected onto the sensor. The coded projection operator is expressed as a sparse matrix b, here illustrated for a 2D light field with three views projected onto a ID sensor (right). We adopt a two-plane parameterization [35, 25] for the light field where x is the 2D spatial dimension on the sensor plane and v denotes the 2D position on the aperture plane at distance da (see Fig. 4-1, left). For brevity of notation, the light field in Equation 4.1 absorbs vignetting and other angle-dependent factors [43]. We propose to insert a coded attenuation mask f( ) at a distance d, from the sensor, which optically modulates the light field prior to projection as i(x) = j f(x + s(v - x)) (x, v) d, (4.2) where s = di/da is the shear of the mask pattern with respect to the light field (see Fig. 4-1, center). In discretized form, coded light field projection can be expressed as a matrix-vector multiplication: i = C1, 4P = [(P1 4 2 ..- Pp] (4.3) where i E Rm and 1 E R" are the vectorized sensor image and light field, respectively. All p, x p, angular light field views 13 (j = 1 ... p 2) are stacked in 1. Note that each submatrix (Pj E Rmxm is a sparse matrix containing the sheared mask code on its diagonal (see Fig. 4-1, right). For multiple recorded sensor images, the individual 60 photographs and corresponding measurement matrices are stacked in i and P. The observed image i = j 4ylj sums the light field views, each multiplied with the same mask code but sheared by different amounts. If the mask is mounted directly on the sensor, the shear vanishes (s = 0) and the views are averaged. If the mask is located in the aperture (s = 1), the diagonals of each submatrix 4D become constants which results in a weighted average of all light field views. In this case, however, the angular weights do not change over the sensor area. Intuitively, the most random, or similarly incoherent, sampling of different angular samples happens when the mask is located between sensor and aperture; we evaluate this effect in Section 4.2.5. Equations 4.1-4.3 model a captured sensor image as the angular projection of the incident light field. These equations can be interpreted to either describe the entire sensor image or small neighborhoods of sensor pixels-2D patches-as the projection of the corresponding 4D light field patch. The sparsity priors discussed in the following sections exclusively operate on such small two-dimensional and fourdimensional patches. 4.1.2 Reconstructing Light Fields from Projections The inverse problem of reconstructing a light field from a coded projection requires a linear system of equations (Eq. 4.3) to be inverted. For a single sensor image, the number of measurements is significantly smaller than the number of unknowns, i.e. m < n. We leverage sparse coding techniques discussed in the previous chapter to solve the ill-posed underdetermined problem. For this purpose, we assume that natural light fields are sufficiently compressible in some basis or dictionary V E R',d such that i = (PI = (PDa, 61 (4.4) where most of the coefficients in a C Rd have values close to zero. As shown earlier (e.g., [16, 12]), we seek a robust solution to Equation 4.4 as minimize fa} subject to |1al(. 11i - (4.5) "Dah2 < e which is the basis pursuit denoise (BPDN) problem [15]. In practice, we solve the Lagrangian formulation of Equation 4.5 as minimize 11i {a} Dah2 +-A ja 1 . (4.6) Assuming that the light field is k-sparse, that is it can be well represented by a linear combination of at most k columns in D, a lower bound on the required number of measurements m is O(k log(d/k)) [13]. While Equation 4.6 is not constrained to penalize negative values in the reconstructed light field 1 = Da, we have not observed any resulting artifacts in practice. The two main challenges for any compressive computational photography method are twofold: a "good" sparsity basis has to be known and reconstruction times have to scale up to high resolutions. In the following, we show how to learn dictionaries of small light field atoms that sparsely represent natural light fields. A side effect of using light field atoms is that scalability is intrinsically addressed as follows: instead of attempting to solve a single, large optimization problem, many small and independent problems are solved simultaneously. As discussed in the following, light field atoms model local spatio-angular coherence in the 4D light field sparsely. Therefore, a small 4D light field patch is reconstructed from a 2D image patch centered around each sensor pixel. The recovered light field patches are merged into a single reconstruction. Performance is optimized through parallelization and quick convergence of each subproblem; the reconstruction time grows linearly with increasing sensor resolution. 62 Figure 4-2: Visualization of light field atoms captured in an overcomplete dictionary. Light field atoms are the essential building blocks of natural light fields-most light fields can be represented by the weighted sum of very few atoms. We show that light field atoms are crucial for robust light field reconstruction from coded projections and useful for many other applications, such as 4D light field compression and denoising. 4.1.3 Learning Light Field Atoms Following recent trends in the information theory community (e.g., [13]), we propose to learn the fundamental building blocks of natural light fields-light field atoms-in overcomplete dictionaries. We consider 4D spatio-angular light field patches of size n = Px p, x p, x p1 . Given a large set of such patches, randomly chosen from a collection of training light fields, we learn a dictionary D E R" minimize as j|L - DAIF JVAJ (4.7) subject to Vj, |ajJ|o < k where L E R'Xq is a training set comprised of q light field patches and A=[ai, ... , aq] Rdxq is a set of k-sparse coefficient vectors. The Frobenius matrix norm is ||X|I1 Zi, x?, k (k < = the to pseudo-norm counts the number of nonzero elements in a vector, and d) is the sparsity level we wish to enforce. In practice, training sets for the dictionary learning process are extremely large and often contain a lot of redundancy. Solving Equations 4.7, however, is computa63 E tionally expensive. Coresets as described earlier have been introduced as a means to cheaply reduce large dictionary training sets to manageable sizes. Feigin et al. [21], for instance, simply pick a subset of training samples in L that have a sufficiently high variance; we follow their approach. 4.2 Analysis In this section, we analyze the structure of light field atoms and dictionaries, evaluate the design parameters of dictionaries, derive optimal modulation patterns for coded projections, evaluate the proposed camera architecture, and compare it with a range of alternative light field camera designs. 4.2.1 Interpreting Light Field Atoms As discussed in Section 4.1.3, overcomplete dictionaries are learned from training sets of natural light fields. The columns of these dictionaries are designed to sparsely represent the respective training set, hence capture their essential building blocks or atoms. Obviously, the structure of these building blocks mainly depends on the specific training set; intuitively, large and diverse collections of natural light fields should exhibit some common structures, just like natural images. Based on recent insights, such as the dimensionality gap [43, 34], one would expect that the increased dimensionality from 2D images to 4D light fields introduces a lot of redundancy. The dimensionality gap is a 3D manifold in the 4D light field space, which successfully models diffuse objects within a certain depth range. Unfortunately, occlusions, specularities, and high-dimensional edges are not accounted for in this prior. In contrast, light field atoms do not model a specific lower-dimensional manifold, rather they sparsely represent the elemental structures of natural light fields. Figure 4-2 visualizes an artificially-colored dictionary showing the central views of all its atoms; two of them are magnified and shown as 4D mosaics. We observe that light field atoms capture high-dimensional edges as well as high-frequency structures exhibiting different amounts of rotation and shear. Please note that these atoms also 64 Target Light Field tV Compressibility of 4D Light Field - Quantitative Evaluation W 0 3- -- .. . . . .--- z - - - 20 W' 0- 10L - - - - -.. . IM .. . . . .- 4D DCT --4D Haar Wav -. 4D PCA - -- 4D FFT .. . . . . . .. 4D LF Atoms . .. 0 0.05 0.1 Compression Ratio in % Coefficients 0.15 0.2 Compressibility & Reconstruction from Coded 2D Projections - Qualitative Evaluation 4D DC1 4D Li ht Field Atoms 00CO ca0 'a U. CL E IL @8 Co) 0) Z C0. Figure 4-3: Compressibility of a 4D light field in various high-dimensional bases. As compared to popular basis representations, the proposed light field atoms provide better compression quality for natural light fields (plots, second row). Edges and junctions are faithfully captured (third row); for the purpose of 4D light field reconstruction from a single coded 2D projection, the proposed dictionaries combined with sparse coding techniques perform best indlis experiment (bottom row). contain negative values; combining a few atoms allows complex lighting effects to be formed, such as reflections and refractions as well as junctions observed in occlusions (see Fig. 4-3). 4.2.2 What are Good Modulation Patterns? The proposed optical setup consists of a conventional camera with a coded attenuation mask mounted in front of the sensor. A natural question emerges: what should the mask patterns be? In the compressive sensing literature, most often dense sensing matrices of random Gaussian noise are employed. The proposed optical setup, however, restricts the measurement matrix <D to be very sparse (see Fig. 4-1, right). In the following, we discuss several choices of mask codes with respect to both computational and optical properties. A good mask design should facilitate high quality reconstructions while also providing a high light transmission. Tiled Broadband Codes Broadband codes, such as arrays of pinholes, sum-of- sinusoids (SoS), or MURA patterns, are common choices for light field acquisition with attenuation masks. These patterns are designed to multiplex angular light information into the spatial sensor layout; under bandlimited assumptions, the 4D light field is reconstructed using linear demosaicking [55]. In previous applications, the number of reconstructed light field elements is limited to the number of sensor pixels. The proposed nonlinear framework allows for a larger number of light rays to be recovered than available sensor pixels; however, these could also be reconstructed from measurements taken with broadband codes and the sparse reconstruction algorithms proposed in this paper. We evaluate such codes in Figure 4-4 and show that the achieved quality is lower than for random or optimized masks. Random Mask Patterns For high resolutions, random measurement matrices provide incoherent signal projections with respect to most sparsity bases, including overcomplete dictionaries, with a high probability. This is one of the main reasons why random codes are by far the most popular choice in compressive sensing applications. 66 In our application, the structure of the measurement matrix is dictated by the optical setup-it is extremely sparse. Each sensor pixel integrates over only a few incident light rays, hence the corresponding matrix row only has that many non-zero entries. While random modulation codes are a popular choice in compressive computational photography applications, these are not necessarily the best choice for overcomplete dictionaries, as shown in the following. Optimizing Mask Patterns As discussed in the previous chapter, recently research has focused on deriving optimal measurement matrices for a given dictionary [19]. The intuition here is that projections of higher-dimensional signals should be as orthogonal as possible in the lower-dimensional projection space. Poor choices of codes would allow high-dimensional signals to project onto the same measurement, whereas optimal codes remove such ambiguities as best as possible. Mathematically, this optimality criterion can be expressed as minimize I - GTG|| {f}F subject to 0 < fi < 1, Vi (4.8) EZ fi/m > T where G is D with normalized columns and f E R' is the mask pattern along the diagonals of the submatrices in 4 (see Fig. 4-1, right). Hence, each column of G is the normalized projection of one light field atom into the measurement basis. The individual elements of GTG are inner products of each of these projections, hence measuring the distance between them. Whereas diagonal elements of GTG are always one, the off-diagonal elements correspond to mutual distances between projected light field atoms. To maximize these distances, the objective function attempts to make GTG as close to identity as possible. To further optimize for light efficiency of the system, we add an additional constraint T on the mean light transmission of the mask code f. 67 Mask-, Scene Setup ~0 CL .0 Figure 4-4: Evaluating optical modulation codes and multiple shot acquisition. We simulate light field reconstructions from coded projections for one, two, and five captured camera images. One tile of the corresponding mask patterns is shown in the insets. For all optical codes, an increasing number of shots increases the number of measurements, hence reconstruction quality. Nevertheless, optimized mask patterns facilitate single-shot reconstructions with a quality that other patterns can only achieve with multiple shots. 68 4.2.3 Are More Shots Better? We strongly believe that the most viable light field camera design would be able to reconstruct a high-quality and high-resolution light field from a single photograph. Nevertheless, it may be argued that more measurements may give even better results. This argument is supported by the experiments shown in Figure 4-4, where we evaluate multi-shot reconstructions for different mask patterns. In all cases, quality measured in peak signal-to-noise ratio (PSNR) is improved for an increasing number of shots, each captured with a different mask pattern. However, after a certain number of shots reconstruction quality is not significantly increased further-in the shown experiment, the gain from two to five shots is rather low. We also observe that a "good" choice of modulation codes equally improves reconstruction quality. In particular, optimized mask patterns allow for a single-shot reconstruction quality that can only be achieved with multiple shots otherwise. Capturing multiple shots with optimized mask patterns does not significantly improve image quality. 4.2.4 Evaluating Depth of Field We evaluate the depth of field achieved with the proposed method in Figure 4-5. For this experiment, we render light fields containing a single planar resolution chart at different distances to the camera's focal plane (located at 50 cm). Each light field has a resolution of 128 x 128 pixels and 5 x 5 views. The physical distances correspond to those in our camera prototype setup described in Section 4.3.1. While the reconstruction quality is high when the chart is close to the focal plane, it decreases with an increasing distance. Compared to capturing this scene with a lenslet array, however, the proposed approach results in a significantly increased image resolution. The training data for this experiment contains white planes with random text at different distances to the focal plane, rendered with an aperture diameter of 0.25 cm. Whereas parallax within the range of the training data can be faithfully recovered (magenta and blue plots), a drop in reconstruction quality is observed when parallax exceeds that of the training data (green plot). 69 5- .... 30 I........b... W20 Jm do* ..aperture .1 cm... M 4bmm ............. -K Focal Plane CL U- aperture .5 cm at 50 cm ) 10 4 z=50 45 50 z=50 60 65 55 Distance to Can era Aperture in cm Z= 50 496w 24 Original Lenslets z=60 5JWIL4A 2490 70 80 75 z=70 3 5MLA3 2490 3 Proposed Reconstructions Figure 4-5: Evaluating depth of field. As opposed to lenslet arrays, the proposed approach preserves most of the image resolution at the focal plane. Reconstruction quality, however, decreases with distance to the focal plane. Central views are shown (on focal plane) for full-resolution light field, lenslet acquisition, and compressive reconstruction; compressive reconstructions are also shown for two other distances. The three plots evaluate reconstruction quality for varying aperture diameters with a dictionary learned from data corresponding to the blue plot (aperture diameter 0.25 cm). 70 Lenslet Array Coded Lens Array Coded Aperture Coded Mask Coded Ap. & Mask mO 60 p#= 0 .4 0 7 1 eb0 p= 0 .3 8 1 4 0 Abroad= 0. 3 7 9 3 prand = i-=0.5 =1 = 5 A =0.3739 0.3790 N 04M Figure 4-6: Illustration of different optical light field camera setups with a quantitative value y for the expected reconstruction quality (lower value is better). While lenslet arrays have the best light transmission T (higher value is better), reconstructions are expected to be of lower quality. Masks coded with random or optimized patterns perform best of all systems with 50% or more transmission. Two masks are expected to perform slightly better with our reconstruction, but at the cost of reduced light efficiency. 4.2.5 Comparing Computational Light Field Cameras Two criteria are important when comparing different light field camera designs: optical light efficiency and expected quality of computational reconstruction. Light efficiency is measured as the mean light transmission of the optical system T, whereas the value , = II - GTGI|F quantifies the expected reconstruction quality based on Equation 4.8 (lower value is better). We compare lenslet arrays [37, 44], randomly coded lenslet arrays and coded apertures [5], coded broadband masks [28, 54, 31] (we only show URA masks as the best general choice for this resolution), random masks and optimized masks, as proposed in this paper, as well as randomly coded apertures combined with a coded mask [59]. All optical camera designs are illustrated in Figure 4-6. Optically, lenslet arrays perform best with little loss of light; most mask-based designs have a light transmission of approx. 50%, except for pinholes. Combining randomly coded apertures with a modulation mask results in an overall transmission of about 25%, although Xu and Lam's choice of sum-of-sinusoids masks results in transmissions of less than 5%. Figure 4-6 also shows the quantitative value p for expected quality. Under this aspect, lenslet arrays perform worst, followed by coded lenslet arrays, coded apertures, 71 and previously proposed tiled broadband codes ( rand) and the optimized patterns ( opt) ( broad). Random modulation masks proposed in Section 4.2.2 have the best expected quality of all setups with a mean transmission of 50% or higher. Although the dual-layer design proposed by Xu and Lam [59] has a lower value, their design is significantly less light efficient than ours. While the quantitative differences between -values of these camera designs are subtle, qualitative differences of reconstructions are much more pronounced, as shown in Figures 4-4 and 4-5 and in the supplemental video. The discussed comparison is performed by assuming that all optical setups use the reconstruction method and overcomplete dictionaries proposed in this paper, as opposed to previously proposed PCA sparsity bases [5] or simple total variation priors [59]. 4.3 4.3.1 Implementation Primary Hardware Design For experiments with real scenes, it is necessary to easily change mask patterns for calibration and capturing training light fields. To this end, we implement a capture system using a liquid crystal on silicon (LCoS) display (SiliconMicroDisplay ST1080). An LCoS acts as a mirror where each pixel can independently change the polarization state of incoming light. In conjunction with a polarizing beam splitter and relay optics, as shown in Figure 4-7, the optical system emulates an attenuation mask mounted at an offset in front of the sensor. As a single pixel on the LCoS cannot be well resolved with the setup, we treat blocks of 4 x 4 LCoS pixels as macropixels, resulting in a mask resolution of 480 x 270. The SLR camera lens (Nikon 105 mm f/2.8D) is not focused on the LCoS but in front of it, thereby optically placing the (virtual) image sensor behind the LCoS plane. A Canon EF 50 mm f/1.8 II lens is used as the imaging lens and focused at a distance of 50 cm; scenes are placed within a depth range of 30-100 cm. The f-number of the system is the maximum of both lenses (f/2.8). 72 Figure 4-7: Prototype light field camera. We implement an optical relay system that emulates a spatial light modulator (SLM) being mounted at a slight offset in front of the sensor (right inset). We employ a reflective LCoS as the SLM (lower left insets). Adjusting the Mask-Sensor Distance The distance d, between the mask (LCoS plane) and the virtual image sensor is adjusted by changing the focus of the SLR camera lens. For capturing light fields with p, x p, angular resolution (p, = 5 in our experiments), the distance is chosen as that of a conventional mask-based method that would result in the desired angular resolution albeit at lower spatial resolution [54]. Specifically, we display a pinhole array on the LCoS where adjacent pinholes are p, macropixels apart while imaging a white calibration object. We then adjust the focus of the SLR camera lens so that disc-shaped blurred images under the pinholes almost abut each other. In this way, angular light field samples impinging on each sensor pixel pass through distinct macropixels on the LCoS with different attenuation values before getting integrated on the sensor. Capturing Coded Light Field Projections We capture mask-modulated light field projections by displaying a pattern on the LCoS macropixels and resizing the 73 (a) (b) (c) (d) (e) Figure 4-8: Pinhole array mask and captured images. For conciseness, only the region covering 12 x 10 pinholes are shown here. (a) Pinhole array displayed on the LCoS. (b) Image sensor recording of the LCoS pinhole array mask for a white cardboard scene. (c) Image sensor recording of the LCoS pinhole array mask for a newspaper scene. (d) Normalized image, i.e., (c) divided by (b). (e) Image sensor recording of the LCoS pinhole array mask for a white cardboard scene where each pinhole has a corresponding attenuation value according to the mask pattern. sensor images accordingly. Capturing Training Light Fields For the dictionary learning stage, we capture a variety of scenes using a traditional pinhole array. For this purpose, p, x p, (= 25) images are recorded with shifting pinholes on the LCoS to obtain full-resolution light fields. However, as shown in Figure 4-8(b), the discs under the pinholes (i.e., pointspread functions, or PSFs) have color-dependent nonuniform intensity distributions due to birefringence introduced by the LCoS pixels. To remove this effect from training light fields, we record images of a white cardboard scene also using shifting pinhole arrays to obtain a set of PSFs under each LCoS macropixels, and normalize training light fields by these images. Figure 4-8(c) shows an example of a raw pinhole array image of a newspaper scene. By dividing Figure 4-8(c) by the LCoS PSF image shown in Figure 4-8(b), we obtain a normalized pinhole image shown in Figure 4-8(d). Calibration We measure the projection matrix 4 by capturing the light field of a white cardboard scene modulated by the mask pattern. Again we use a shifting pinhole array, but rather than making each pinhole fully open (1.0) as in the case of training light field capture, we assign each pinhole a corresponding value (E [0, 1]) in the mask. An example of a mask-modulated pinhole array image is shown in Figure 4-8(e). 74 Figure 4-9: Calibrated projection matrix. Left: visualization of the matrix as a multi-view image. Right: magnified crops from the two views. We observed that the actual attenuation introduced by the LCoS was greater than the specified mask value. That is, the ratio of a captured pixel value in Figure 4-8(e) to the corresponding pixel value in Figure 4-8(b) was less than the LCoS macropixel value. To compensate for this non-linearity, we assumed gamma curve relationship between LCoS pixel values and actual attenuations, and performed linear search for the optimal gamma value for each color channel to obtain expected attenuation ratios. Figure 4-9 shows the calibrated projection matrix <b, obtained from pinhole array images captured in the above-mentioned manner. Here, the projection "matrix" is shown as a multi-view image by rearranging the pixels of the pinhole images. Each view corresponds to the diagonal of each of the submatrices <b. That is, the diagonal elements of <Pj are the lexicographically ordered pixel values of the j-th view. These views contain three effects: 1) shifted mask pattern (our optimized mask tiled over the LCoS area), 2) color-dependent nonuniform LCoS PSF effect, leading to color and intensity variations between the views, and 3) view-dependent vignetting from the imaging lens. 75 CamMask Mask holder Lens (a) (b) (c) Figure 4-10: Printed transparency-based prototype camera (a) and calibration setups (b,c). 4.3.2 Secondary Hardware Design Once the optical parameters are determined and dictionaries are learned, the proposed light field camera could be implemented in a more compact form factor with a static attenuation mask as shown in Figure 4-10(a). We fabricate a mask holder that fits into the sensor housing of a Lumenera Lw11059 monochrome camera, and attach a film with a random mask pattern. As the printer guarantees 25 Pm resolution, we conservatively pick a mask resolution of 50 pm, which roughly corresponds to 6 x 6 pixels on the sensor. We therefore downsample the sensor image by 6, and crop out the center 200 x 160 region for light field reconstruction in order to avoid mask holder reflection and vignetting. The distance between the mask and the sensor is 1.6 mm. A Canon EF 50mm f/1.8 II lens is used and focused at a distance of 50 cm. As the mask is static, we use an aperture-based light field capture method to calibrate the 2 projection matrix (b. We place a 10 x 10 mm external aperture immediately in front of the lens as shown in Figure 4-10(b), and capture a white cardboard scene with 2 5 x 5 sub-apertures each having an area of 2 x 2 mm , as shown in Figure 4-10(c). Figure 4-11 shows reconstruction results. The reconstruction quality is humble because we did not learn dictionaries or optimize mask patterns for this setup (we used the dictionary learned for the LCoS setup and used a random mask). Nevertheless, parallax was recovered. 76 I Figure 4-11: Light field reconstruction from an image modulated by a static film mask. (a) Captured image. (b) One view from the calibrated projection matrix <D. (c,d) Two views from the reconstructed light field. Yellow lines are superimposed to show the parallax. (e) All of the 5 x 5 views from the reconstructed light field. scene scene lens sensor lens F =50mm mask sensor t=16mm Mask For DSLR F = 50mme scene lens mask sensor Traditlonal Cameras F =5mm mm Mask For Cell-Phone Cameras Figure 4-12: An illustration of placement of the optimal or random mask for two commonly used camera designs. For traditional DSLRs with a lens of focal length about 50 mm, the mask needs to be placed about 1.6 mm away from the sensor to capture 5 x 5 angular views. For a mobile phone camera this distance reduces to about 160 microns, due to reduced focal length. 77 Figure 4-13: Training light fields and central views for physical experiments. 4.3.3 Software The algorithmic framework is a two step process involving an offline dictionary learning stage and a nonlinear reconstruction. Dictionary Learning We capture five training light fields, each captured with an aperture setting of approx. 0.5 cm (f/2.8), with our prototype setup and randomly extract one million 4D light field patches, each with a spatial resolution of 11 x 11 pixels and 5 x 5 angular samples. After applying coreset reduction [21], 50,000 remaining patches are used to learn a 1.7x overcomplete dictionary consisting of 5,000 light field atoms. The memory footprint of this learned dictionary is about 111 MB. We employ the Sparse Modeling Software [39] to learn this dictionary on an workstation equipped with a 24-core Intel Xeon processor and 200 GB RAM in about 10 hours. This is a one-time preprocessing step. For physical experiments we capture five training sets as shown in Figure 4-13. Each light field in the training set has a resolution of 480 x 270 pixels in space and 5 x 5 views. We randomly extract about 360,000 overlapping patches from each of the training light fields, each patch has a spatial resolution of 11 x 11 pixels and an angular resolution of 5 x 5. To increase variability amongst these extracted patches we employ the coreset technique discussed in Section B.2 to reduce this set to tractable size of about 50,000 patches. This process is repeated for all the training light fields to generate a training set of about 250,000 patches. Coresets are again applied to reduce the final training set to about 50,000 patches. 78 Figure 4-14: Training light fields and central views for simulated experiments. Sparse Reconstruction For the real world experiments, each light field is recon- structed with 5 x 5 views from a single sensor image with a resolution of 480 x 270 pixels. For this purpose, the coded sensor image is divided into overlapping 2D patches, each with a resolution of 11 x 11 pixels, by centering a sliding window around each sensor pixel. Subsequently, a small 4D light field patch is recovered for each of these windows. The reconstruction is performed in parallel on an 8-core Intel i7 workstation with 16 GB RAM. We employ the fast ei-relaxed homotopy method described by Yang et al. [60] with sparsity penalizing parameter A set to 10, tolerance to 0.001 and iterations to be 10,000; reconstructions for three color channels take about 18 hours for each light field. The reconstructed overlapping 4D patches are merged with a median filter. It should be noted that at each step the product of the calibrated projection matrix and the dictionary needs to be normalized for correct minimization of the Li norm. Each light field patch takes about 0.1 seconds to be recovered, resulting in runtime of about 18 hours for all there color channels on an 8-core Intel i7 machine with 16 GB of memory. Although the proposed method requires extensive computation times, we note that each patch is independent of all other patches. Hence, the reconstruction can be easily parallelized and significantly accelerated with modern high-end GPUs that have up to thousands of cores or cloud-based infrastructures. 79 4.4 Results All results discussed in this section are captured with our prototype compressive light field camera and reconstructed from a single sensor image. This image is a coded projection of the light field and the employed mask pattern is optimized for the computed dictionary with the technique described in Section 4.2.2. The same optical code and dictionary is used in all examples, the latter being learned from captured light fields that do not include any of the shown objects. All training sets and captured data are publicly available on the project website or upon request. Layered Diffuse Objects Figure 4-15 shows results for a set of cards at different distances to the camera's focal plane. A 4D light field with 5 x 5 views (upper right) is reconstructed from a single coded projection (upper left). Parallax for out-of-focus objects is observed (center row). By shearing the 4D light field and averaging all views, a synthetically refocused camera image can be computed in post-processing (bottom row). Partly-occluded Environments Reconstructed views of a scene exhibiting more complex structures are shown in Figure 4-16. The toy is partly occluded by a highfrequency shrub; occluded areas of the toy are faithfully reconstructed. Reflections and Refractions Complex lighting effects, such as reflections and refractions exhibited by the dragon and the tiger in Figure ??, are successfully reconstructed with the proposed technique. In this scene, parts of the dragon are refracted through the head and shoulders of the glass tiger, whereas its back reflects and refracts the green background. We show images synthetically focused on foreground and background objects as well. Animated Scenes The proposed algorithms allow 4D light fields to be recov- ered from a single 2D sensor image. Dynamic events can be recovered this way; to demonstrate this capability, we show several frames of an animated miniature 80 pr4 4 Figure 4-15: Light field reconstruction from a single coded 2D projection. The scene is composed of diffuse objects at different depths; processing the 4D light field allows for post-capture refocus. 81 Figure 4-16: Reconstruction of a partly-occluded scene. Two views of a light field reconstructed from a single camera image. Areas occluded by high-frequency structures can be recovered by the proposed methods, as seen in the close-ups. Coded 2D Projections Figure 4-17: Light field reconstructions of an animated scene. We capture a coded sensor image for multiple frames of a rotating carousel (left) and reconstruct 4D light fields for each of them. The techniques explored in this paper allow for higherresolution light field acquisition than previous single-shot approaches. carousel in Figure 4-17. Coded sensor images and reconstructions are shown. A visual description of the recovered parallax and refocusing effects can be found in the attached video at http://www.kshitijmarwah.com/index.php?/research/compressivelight-field-photography/. The light field camera as described in this chapter is the first attempt to get over traditional limits of resolution imposed for the last hundred years. Given a m x n sensor image, this practical camera design and algorithmic framework recovers a light field that is m x n in space and p x p in angle. Though the reconstruction quality is depth dependent we believe this is the first camera to show these resolutions in a single capture. 82 Chapter 5 Additional Applications In this chapter, we outline a variety of additional applications for light field dictionaries and sparse coding techniques. In particular, we show applications in 4D light field compression and denoising. We show how to remove the coded patterns introduced by modulation masks so as to retrieve a conventional 2D photograph. We also show how to require high resolution focal stacks from a single capture by slightly changing the optical setup and learning 3D dictionaries. This is critical for applications such as refocusing where one does not need the full light field but a high resolution refocused image. 5.1 "Undappling" Images with Coupled Dictionaries Although the optical acquisition setup proposed in this paper allows for light fields to be recovered from a single sensor image, a photographer may want to capture a conventional 2D image as well. In a commercial implementation, this could be achieved if the proposed optical system was implemented with programmable spatial light modulators or modulation masks that can be mechanically moved out of the optical path. The most straightforward method to remove the mask pattern is to divide out the 2D projection of the mask pattern itself. That allows in-focus region 83 to be free of mask dapples but renders the out-of-focus areas with high frequency artifacts. Another approach is to first recover a 4D light field and average out the views to get the image back, but that seems costly. To mitigate this problem, we use joint sparse coding to implicitly learn mappings from a modulated image to a demodulated or "undappled" image that still remains in the two-dimensional manifold of images in lieu of light field recovery. Let us consider two feature spaces X (consisting of mask modulated images of scenes as captured by our setup after dividing out the mask pattern) and Y (consisting of images of the same scenes as if there was no mask). Given these training sets, unlike traditional sparse coding that has been discussed so far, we employ joint sparse coding techniques to simultaneously learn two dictionaries D2, and Dy for the two feature sets such that the sparse representation of a mask modulated image xi in terms of D should be same as the representation of the corresponding demodulated image yi in terms of Dy. Hence, if we know our measurement xi we can recover its underlying signal yi. Mathematically, the joint sparse coding scheme [61] can be realized as: minimize {Vx,Vy,A} ||xi - Daill2 + flyi - Dyaill2+ A || illa (5.1) The formulation requires a common sparse representation ac that should reconstruct both xi and yi. As shown by Yang et al. [61], we can convert the problem into standard sparse coding scheme with concatenated feature space of X and Y. Now, given a new test modulated image xt, we find its representation in the learned dictionary D,. The resulting coefficient vector is then multiplied to the dictionary DY to generate the demodulated image. We captured modulated images from our setup and divide out the mean mask pattern. Each image is divided into overlapping patches of resolution 7 x 7, forming a training set for the first feature space. Correspondingly, training images without any modulation are also captured of the same scene and are divided into overlapping patches of resolution 7 x 7 to form the other training set. Joint sparse coding is then performed using software package described in [62]. Given a new modulated image, 84 Figure 5-1: "Undappling" a mask-modulated sensor image (left). The known projection of the mask pattern can be divided out; remaining noise patterns in out-of-focus regions are further reduced using a coupled dictionary method (right). we first divide out the mean mask pattern and divide it into overlapping patches of resolution 7 x 7. Using our jointly trained dictionary, we reconstruct demodulated image patches which are then merged. Following [61], we learn a coupled dictionary 'D21 - [Vap Vundap] from a train- ing set containing projected light fields both with and without the mask patterns. One part of that dictionary Ddap is used for reconstructing the 2D coefficients a, whereas the other is used to synthesize the "undappled" image as Vundapa. As opposed to the framework discussed in the previous chapter, the dictionaries and coefficients for this application are purely two-dimensional. 5.2 Light Field Compression We illustrate compressibility of a 4D light field in Figure 4-3 both quantitatively and, for a single 4D patch, also qualitatively. Compression is achieved by finding the best representation of a light field with a fixed number of coefficients. This representation can be found by solving the LASSO [42] problem Ill - minimize fal subject to a5.2 (5.2) ||alo < K In this formulation, 1 is a 4D light field patch that is represented by at most K atoms. As opposed to sparse reconstructions from coded 2D projections, light field 85 ' V I I PSNR 24.5 dB W PSNR 27.2 dB k ~ Figure 5-2: Light field compression. A light field is divided into small 4D patches and represented by only few coefficients. Light field atoms achieve a higher image quality than DCT coefficients. compression strives to reduce the required data size of a given 4D light field. As this technique is independent of the light field acquisition process, we envision future applications in high-dimensional data storage and transfer. Figure 5-2 compares an example light field compressed into a fixed number of DCT coefficients and light field atoms. For this experiment, a light field with 5 x 5 views is divided into distinct 9 x 9 x 5 x 5 spatio-angular patches that are individually compressed. Light field atoms allow for high image quality with a low number of coefficients and smooth transitions between neighboring patches. While this experiment demonstrates improved compressibility of a single example light field using atoms, further investigation is required to analyze the suitability of overcomplete dictionaries for compressing a wide range of different light fields. 86 Figure 5-3: Light field denoising. Sparse coding and the proposed 4D dictionaries can remove noise from 4D light fields. 5.3 Light Field Denoising Another popular application of dictionary-based sparse coding techniques is image denoising [20]. Following this trend, we apply sparse coding techniques to denoise 4D light fields. Similar to light field compression, the goal of denoising is not to reconstruct higher-resolution data from a smaller number of measurements, but to represent a given 4D light field by a linear combination of a small number of noisefree atoms. In practice, this can be achieved in the same way as compression, i.e. by applying Equation 5.2. For the case of a noisy target light field 1, this effectively applies a nonlinear four-dimensional denoising filter to the light field. Figure 5-3 shows the central view and close-ups of one row of a noisy 4D light field and its denoised representation. We hope that 4D light field denoising will find applications in emerging commercial light field cameras, as this technique is independent of the proposed compressive reconstruction framework and could be applied to light fields captured with arbitrary optical setups. 5.4 Compressive Focal Stacks One application of 4D light fields is post-capture refocus. It could be argued that recovering the full 4D light field from a 2D sensor image is unnecessary for this 87 Figure 5-4: Focal stack setup. The focus mechanism of an SLR camera lens is intercepted and controlled by an Arduino board. Synchronized with a remote shutter, this allows us to capture high-quality focal stacks. particular application, a 3D focal stack may suffice. We evaluate compressive focal stacks in this section. We capture focal stacks using a a Canon T3 SLR camera with a modified Canon 50mm lens. The connection between lens and camera body is blocked and inter- cepted with wires that are connected to an Arduino Deumilanove board. The latter is controlled from a PC and synchronized with a remote camera trigger that is also controlled, in software, from the same PC. The lens is set to autofocus, its focus can be programmed and adjusted to the desired focal setting. Figure 5-4 shows the setup with a scene in the background. Using the described setup, we capture multiple scenes as training sets for the dictionary learning stage. The central focal slices of all training focal stacks are shown in Figure 5-5. Each stack contains six images focused at different distances. A 2x overcomplete dictionary with 3D focal stack atoms, each with a resolution of 10 x 10 x 6 pixels is learned from this training set. We simulate a coded focal sweep for another scene. The sweep refocuses the camera lens throughout the exposure time of a single image. An additional light modulator is assumed to be located directly on the sensor. Alternatively, the readout of each pixel can be controlled, making such a light modulator needless. 88 For this Figure 5-5: Focal stack training set. Central focal slices for each scene are shown along with downsampled versions of all the slices. Figure 5-6: Compressive focal stack results. A single coded projection is simulated (lower left) and used to recover all six slices (three of them are shown in the bottom row) of the original focal stack (top row). While the focal stack is successfully recovered, we observe a slight loss of image sharpness for in-focus image regions. This could be overcome with optimized projections (as demonstrated for 4D light fields in the primary text) or by acquiring multiple shots. experiment, we employ random codes-each focal slice is multiplied with a random pattern and integrated to get a coded 2D projection as shown in Figure 5-6 (left). Given this image, we employ the compressive reconstruction techniques discussed in the primary text together with the learned dictionary to recover the six focal stack slices as seen in Figure 5-6. While the optical setup for this experiment is different from the mask-based setup used for 4D light field acquisition, we gain a number of interesting insights in the sparsity-exploiting reconstruction techniques proposed in our manuscript. First, coded projections of high-dimensional visual signal can be captured with a variety of different optical setups. The choice for the latter depends on available hardware and optimal code designs; ideally, the optical design captures a single or multiple optimized coded projections. Second, the essential building blocks of high-dimensional visual signals can be learned from appropriate training sets; we demonstrate this for 89 3D focal stacks and 4D light fields. This approach, however, is equally applicable for other signal types, such as 8D reflectance fields or the full plenoptic function. Using the learned sparse representations or atoms of the visual signals, high-dimensional reconstructions can be computed from the lower-dimensional coded projections. The quality of the results depends on a variety of factors, such as sparsity of the signal types in their atoms, dimensionality gap between projections and signal, quality of the optical codes, as well as the number of captured projections. 90 Chapter 6 Discussion and Conclusion In summary, this thesis explores compressive light field acquisition by analyzing and evaluating sparse representations of natural light fields, optimized optical coding strategies, robust high-dimensional light field reconstruction from lower-dimensional coded projections, and additional applications such as 4D light field compression and denoising. Compressive light field acquisition is closely related to emerging compressive light field displays [56, 30, 57]. These displays are compressive in the sense that the display hardware has insufficient degrees of freedom to exactly represent the target light field and relies on an optimization process to determine a perceptually acceptable approximation. Compressive cameras are constrained in their degrees of freedom to capture each ray of a light field and instead record coded projections with subsequent sparsity-exploiting reconstructions. We envision future compressive image acquisition and display systems to be a single, integrated framework that exploits the duality between computational light acquisition and display. Most recently, researchers have started to explore such ideas for display-adaptive rendering [26]. 6.1 Benefits and Limitations The primary benefits of the proposed computational camera architecture compared to previous techniques are increased light field resolution and a reduced number of 91 required photographs. We show that reconstructions from coded light field projections captured in a single image can achieve a high quality; this is facilitated by the proposed co-design of optical codes, nonlinear reconstruction techniques, and sparse representations of natural light fields. However, the achieved resolution of photographed objects decreases at larger distances to the camera's focal plane. Attenuation masks lower the light efficiency of the optical system as compared to refractive optical elements, such as lenslet arrays. Yet, they are less costly than lenslet arrays. The mask patterns are fundamentally limited by diffraction. Dictionaries have to be stored along with sparse reconstructions, thereby increasing memory requirements. Processing times of the discussed compressive camera design are higher than those of most other light field cameras. While these seem prohibitive at the moment, each small 4D patch is reconstructed independently; the computational routines discussed in this paper are well suited for parallel implementation, for instance on GPUs. The camera prototype exhibits a number of artifacts, including angle-dependent color and intensity nonlinearities as well as limited contrast. Observed color shifts are intrinsic to the LCoS, due to birefringence of the liquid crystals; this spatial light modulator (SLM) is designed to work with collimated light, but we operate it outside its designed angular range so as to capture ground truth light fields for evaluation and dictionary learning. Current resolution limits of the captured results are imposed by the limited contrast of the LCoS-multiple pixels have to be binned. Simple coded transparencies or alternative SLMs could overcome these optical limitations in future hardware implementations. Atoms captured in overcomplete dictionaries are shown to represent light fields more sparsely than other basis representations. However, these atoms are adapted to the training data, including its depth range, aperture diameter, and general scene structures, such as occlusions and high-frequency textures. We demonstrate that even a few training light fields that include reflections, refractions, texture, and occlusions suffice to reconstruct a range of scene types. Nevertheless, we expect reconstruction quality to degrade for scenes that contain structures not captured in the training data, 92 as for instance shown for parallax exceeding that of the training data in Section 4.2.4. A detailed analysis of how target-specific light field atoms are w.r.t. all possible parameters, however, is left for future work. 6.2 Future Work Our current prototype camera is designed as a multipurpose device capturing coded projections as well as reference light fields for dictionary learning and evaluating reconstructions. Future devices will decouple this process. Whereas coded projections can be recorded with conventional cameras enhanced by coded masks, the dictionary learning process will rely increasingly on large online datasets of natural light fields. These are likely to appear as a direct result of the commercial success of light field cameras on the consumer market. Such developments have two advantages. First, a larger range of different training data will make light field dictionaries more robust and better adapted to specific applications. Second, widely available dictionaries will fuel research on novel optical camera designs or commercial implementations of compressive light field cameras. While we evaluate a range of existing light field camera designs and devise optimal coding strategies for them, we would like to explore new optical setups in the future. Evaluation with alternative error metrics to PSNR, such as perceptually- driven strategies, is an interesting avenue of future work. Finally, we plan to explore compressive acquisitions of the full plenoptic function, adding temporal and spectral light variation to the proposed framework. While this increases the dimensionality of the dictionary and reconstruction problem, we believe that exactly this increase in dimensionality will further improve compressibility and sparsity of the underlying visual signals. The proposed compressive camera architecture is facilitated by the synergy of optical design and computational processing. We believe that the exploration of sparse representations of high-dimensional visual signals has only just begun; fully understanding the latent structures of the plenoptic function, including spatial, angular, 93 Camera Design Spatial Angular !Depth Resolution 'Resolution Computation Resolution Fabrication Cost LOW High Medium LoW Medium High Camera Arrays (Pelican Imaging) High High High Low High High Compressive MaskBased LUght Field (Ours) High MediumnHigh Medium High LOW Medium Microlens Based 1Light Transmission (LYTRO) Figure 6-1: We compare our mask-based light field camera design with existing commercial technologies i.e. micro lens based approaches and camera arrays. This comparison can help commercial camera makers chose the right technology for their specific purposes. spectral, and temporal light variation, seems one step closer but still not within reach. Novel optical designs and improved computational routines both for data analysis and reconstruction will have to be devised, placing future camera systems at the intersection of scientific computing, information theory, and optics engineering. We end by showcasing a comparison with existing commercial architectures in Figure 6-1. We believe that this thesis provides many insights indispensable for future computational camera designs. 94 Bibliography [1] Edward Adelson and John Wang. Single Lens Stereo with a Plenoptic Camera. IEEE Trans. PA MI, 14(2):99-106, 1992. [2] Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Approximating Extent Measures of Points. J. ACM, 51(4):606-635, 2004. [3] M. Aharon, M. Elad, and A. Bruckstein. The K-SVD: An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation. IEEE Trans. Signal Processing, 54(11):4311-4322, 2006. [4] Michal Aharon, Michael Elad, and Alfred M. Bruckstein. K-SVD and its Nonnegative Variant for Dictionary Design. In Proc. SPIE Conference Wavelets, pages 327-339, 2005. [5] Amit Ashok and Mark A. Neifeld. Compressive Light Field Imaging. In Proc. SPIE 7690, page 76900Q, 2010. [6] S.D. Babacan, R. Ansorge, M. Luessi, P.R. Mataran, R. Molina, and A.K. Katsaggelos. Compressive Light Field Sensing. IEEE Trans. Im. Proc., 21(12):4746 -4757, 2012. [7] Stephen Becker, Jerome Bobin, and Emmanuel Candes. Nesta: A fast anad accurate first-order method for sparse recovery. In Applied and Computational Mathematics, 2009. http://www-stat.stanford.edu/ candes/nesta. [8] Tom Bishop, Sara Zanetti, and Paolo Favaro. Light-Field Superresolution. In Proc. ICCP, pages 1-9, 2009. [9] S. Boyd and L. Vandenberghe. Press, 2004. Convex Optimization. Cambridge University [10] E. Candes, J. Romberg, and T. Tao. Stable Signal Recovery from Incomplete and Inaccurate Measurements. Comm. Pure Appl. Math., 59:1207-1223, 2006. [11] Emmanuel Candes, Justin Romberg, and Terrence Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Information Theory, 52(2):489-509, 2006. 95 [12] Emmanuel Candes and Michael B. Wakin. An Introduction to Compressive Sampling. IEEE Signal Processing, 25(2):21-30, 2008. [13] Emmanuel J. Candes, Yonina C. Eldar, Deanna Needell, and Paige Randall. Compressed Sensing with Coherent and Redundant Dictionaries. Appl. and Comp. Harmonic Analysis, 31(1):59-73, 2011. [14] Emmanuel J. Candes, Michael B. Wakin, and Stephen P. Boyd. Enhancing Sparsity by Reweighted 11 Minimization. Journal of FourierAnalysis and Applications, 15(5):877-905, 2008. [15] Scott Shaobing Chen, David L. Donoho, Michael, and A. Saunders. Atomic Decomposition by Basis Pursuit. SIAM J. on Scientific Computing, 20:33-61, 1998. [16] D. Donoho. Compressed Sensing. IEEE Trans. Inform. Theory, 52(4):1289-1306, 2006. [17] D. Donoho. For Most Large Underdetermined Systems of Linear Equations, the Minimal fi-Norm Solution is also the Sparsest Solution. Comm. Pure Appl. Math., 59(6):797-829, 2006. [18] Marco F. Duarte, Mark A. Davenport, Dharmpal Takhar, Jason Laska, Ting Sun, Kevin Selly, and Richard Baraniuk. Single Pixel Imaging via Compressive Sampling. IEEE Signal Processing Processing Magazine, 2008. [19] J.M. Duarte-Carvajalino and G. Sapiro. Learning to sense sparse signals: Simultaneous Sensing Matrix and Sparsifying Dictionary Optimization. IEEE Trans. Im. Proc., 18(7):1395-1408, 2009. [20] M. Elad and M. Aharon. Image Denoising Via Sparse and Redundant Representations Over Learned Dictionaries. IEEE Trans. Im. Proc., 15(12):3736-3745, 2006. [21] Micha Feigin, Dan Feldman, and Nir A. Sochen. From High Definition Image to Low Space Optimization. In Scale Space and Var. Methods in Comp. Vision, volume 6667, pages 459-470, 2012. [22] Dan Feldman. Coresets and Their Applications. PhD thesis, Tel-Aviv University, 2010. [23] Rob Fergus, A Torralba, and W T Freeman. Random Lens Imaging. Technical Report TR-2006-058, MIT, 2006. [24] T Georgiev and A Lumsdaine. Spatio-angular Resolution Tradeoffs in Integral Photography. Proc. EGSR, pages 263-272, 2006. [25] S. Gortler, R. Grzeszczuk, R. Szelinski, and M. Cohen. The Lumigraph. In Proc. ACM SIGGRAPH, pages 43-54, 1996. 96 [26] Felix Heide, Gordon Wetzstein, Ramesh Raskar, and Wolfgang Heidrich. Adaptive Image Synthesis for Compressive Displays. ACM Trans. Graph. (SIGGRAPH), 32(4):1-11, 2013. [27] Yasunobu Hitomi, Jinwei Gu, Mohit Gupta, Tomoo Mitsunaga, and Shree K. Nayar. Video from a Single Coded Exposure Photograph using a Learned OverComplete Dictionary. In Proc. IEEE ICCV, 2011. [28] H Ives. Parallax Stereogram and Process of Making Same. US patent 725,567, 1903. [29] M.H. Kamal, M. Golbabaee, and P. Vandergheynst. Light Field Compressive Sensing in Camera Arrays. In Proc. ICASSP, pages 5413 -5416, 2012. [30] D. Lanman, G. Wetzstein, M. Hirsch, W. Heidrich, and R. Raskar. Polarization Fields: Dynamic Light Field Display using Multi-Layer LCDs. ACM Trans. Graph. (SIGGRAPH Asia), 30:1-9, 2011. [31] Douglas Lanman, Ramesh Raskar, Amit Agrawal, and Gabriel Taubin. Shield Fields: Modeling and Capturing 3D Occluders. ACM Trans. Graph. (SIGGRAPH Asia), 27(5):131, 2008. [32] Daniel D. Lee and Sebastian Seung. Learning the Parts of Objects by Nonnegative Matrix Factorization. Nature, 401:788-791, 1999. [33] Anat Levin, William T. Freeman, and Fre'do Durand. Understanding Camera Trade-Offs through a Bayesian Analysis of Light Field Projections. In Proc. ECCV, pages 88-101, 2008. [34] Anat Levin, Samuel W. Hasinoff, Paul Green, Fredo Durand, and William T. Freeman. 4D Frequency Analysis of Computational Cameras for Depth of Field Extension. ACM Trans. Graph. (SIGGRAPH), 28(3):97, 2009. [35] M. Levoy and P. Hanrahan. Light Field Rendering. In Proc. A CM SIGGRAPH, pages 31-42, 1996. [36] Chia-Kai Liang, Tai-Hsu Lin, Bing-Yi Wong, Chi Liu, and Homer H. Chen. Programmable Aperture Photography: Multiplexed Light Field Acquisition. ACM Trans. Graph. (SIGGRAPH), 27(3):1-10, 2008. [37] Gabriel Lippmann. La Photographie Integrale. Academic des Sciences, 146:446451, 1908. [38] A. Lumsdaine and T. Georgiev. The Focused Plenoptic Camera. In Proc. ICCP, pages 1-8, 2009. [39] J. Mairal, F. Bach, G. Ponce, and G. Sapiro. Online Dictionary Learning For Sparse Coding. In International Conference on Machine Learning, 2009. 97 [40] Julien Mairal, Jean Ponce, and Guillermo Sapiro. Online Learning for Matrix Factorization and Sparse Coding. Journal of Machine Learning Research, 11:1960, 2010. [41] Julien Mairal, Guillermo Sapiro, and Michael Elad. Multiscale Sparse Representations with Learned Dictionaries. Proc. ICIP, 2007. [42] B. K. Natarajan. Sparse Approximate Solutions to Linear Systems. SIAM J. Computing, 24:227-234, 1995. [43] Ren Ng. Fourier Slice Photography. 24(3):735-744, 2005. ACM Trans. Graph. (SIGGRAPH), [44] Ren Ng, Marc Levoy, Mathieu Bredif, Gene Duval, Mark Horowitz, and Pat Hanrahan. Light Field Photography with a Hand-Held Plenoptic Camera. Technical report, Stanford University, 2005. [45] Rohit Pandharkar, Ahmed Kirmani, and Ramesh Raskar. Lens Aberration Correction Using Locally Optimal Mask Based Low Cost Light Field Cameras. Optical Society of America, 2010. [46] Jae Young Park and Michael B Wakin. A geometric approach to multi-view compressive imaging. EURASIP Journal on Advances in Signal Processing, 37, 2012. [47] Christian Perwass and Lennart Wietzke. Single Lens 3D-Camera with Extended Depth-of-Field. In Proc. SPIE 8291, pages 29-36, 2012. [48] D. Reddy, A. Veeraraghavan, and R. Chellappa. P2C2: Programmable Pixel Compressive Camera for High Speed Imaging. In Proc. IEEE CVPR, pages 329-336, 2011. [49] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal Statist. Soc B, 58(1):267-288, 1996. [50] J. A. Tropp and S. J. Wright. Computational Methods for Sparse Solution of Linear Inverse Problems. Proc. IEEE, 98(6):948-958, 2010. [51] Joel A. Tropp. Topics in Sparse Approximation. PhD thesis, University of Texas at Austin, 2004. [52] Joel A. Tropp and Anna C. Gilbert. Signal Recovery From Random Measurements Via Orthogonal Matching Pursuit. IEEE Trans. Information Theory, 53(12):4655-4666, 2007. [53] E. van den Berg and M. P. Friedlander. Probing the Pareto frontier for basis pursuit solutions. SIAM Journal on Scientific Computing, 31(2):890-912, 2008. http://www.cs.ubc.ca/labs/scl/spgl1. 98 [54] Ashok Veeraraghavan, Ramesh Raskar, Amit Agrawal, Ankit Mohan, and Jack Tumblin. Dappled Photography: Mask Enhanced Cameras for Heterodyned Light Fields and Coded Aperture Refocussing. ACM Trans. Graph. (SIGGRAPH), 26(3):69, 2007. [55] G. Wetzstein, I. Ihrke, and W. Heidrich. On Plenoptic Multiplexing and Reconstruction. IJCV, pages 1-16, 2012. [56] G. Wetzstein, D. Lanman, W. Heidrich, and R. Raskar. Layered 3D: Tomographic Image Synthesis for Attenuation-based Light Field and High Dynamic Range Displays. ACM Trans. Graph. (SIGGRAPH), 2011. [57] G. Wetzstein, D. Lanman, M. Hirsch, and R. Raskar. Tensor Displays: Compressive Light Field Synthesis using Multilayer Displays with Directional Backlighting. ACM Trans. Graph. (SIGGRAPH), 31:1-11, 2012. [58] Bennett Wilburn, Neel Joshi, Vaibhav Vaish, Eino-Ville Talvala, Emilio Antunez, Adam Barth, Andrew Adams, Mark Horowitz, and Marc Levoy. High Performance Imaging using Large Camera Arrays. ACM Trans. Graph. (SIGGRAPH), 24(3):765-776, 2005. [59] Zhimin Xu and Edmund Y. Lam. A High-resolution Lightfield Camera with Dual-mask Design. In Proc. SPIE 8500, page 85000U, 2012. [60] Allen Yang, Arvind Ganesh, Shankar Sastry, and Yi Ma. Fast Li-Minimization Algorithms and An Application in Robust Face Recognition: A Review. Technical report, UC Berkeley, 2010. [61] Jianchao Yang, Zhaowen Wang, Zhe Lin, S. Cohen, and T. Huang. Coupled Dictionary Training for Image Super-Resolution. IEEE Trans. Im. Proc., 21(8):3467-3478, 2012. [62] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. In Proc. Int. Conference on Curves and Surfaces, pages 711-730, 2012. 99