Efficient Cloud Photo Enhancement using Compact Transform Recipes ARCHNES MASSACHUSETTS INSTITUTE OF TECH NOLOLGY JUL 0 7 2015 by LIBRARIES Michael Gharbi M.S. in Applied Mathematics, Ecole Polytechnique (2013) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2015 @ Massachusetts Institute of Technology 2015. All rights reserved. Author ............ Signature redacted Department f Certified by.........Signature ring and Computer Science May 19, 2015 redacted............... Fredo Durand Professor of Electrical Engineering and Computer Science Thesis Supervisor Accepted by.............. Signature redacted /6U"eslie A. Kolodziejski Chairman, Department Committee on Graduate Students 2 Efficient Cloud Photo Enhancement using Compact Transform Recipes by Michael Gharbi Submitted to the Department of Electrical Engineering and Computer Science on May 19, 2015, in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering and Computer Science Abstract Cloud image processing is often proposed as a solution to the limited computing power and battery life of mobile devices: it allows complex algorithms to run on powerful servers with virtually unlimited energy supply. Unfortunately, this idyllic picture overlooks the cost of uploading the input and downloading the output images, which can be significantly higher, in both time and energy, than that of local computation. For most practical scenarios, the compression ratios required to make cloud image processing energy and time efficient would degrade an image so severely as to make the result unacceptable. Our work focuses on image enhancements that preserve the overall content of an image. Our key insight is that, in this case, the server can compute and transmit a description of the transformationfrom input to output, which we call a transform recipe. At equivalent quality, recipes are much more compact than jpeg images: this reduces the data to download. Furthermore, we can fit the parameters of a recipe from a highly-degraded input-output pair, which decreases the data to upload. The client reconstructs an approximation of the true output by applying the recipe to its local high quality input. We show on a dataset of 168 images spanning 10 applications, that recipes can handle a diverse set of image enhancements and, for the same amount of transmitted data, provide much higher fidelity compared to jpeg-compressing both the input and output. We further demonstrate their usefulness in a practical proof of concept system with which we measure energy consumption and latency for both local and cloud computation. Thesis Supervisor: Fredo Durand Title: Professor of Electrical Engineering and Computer Science 3 4 Acknowledgments First and foremost, I would like to thank my advisor Fredo Durand whose guidance and insights proved invaluable in navigating research in Computer Science. Fr6do always gave me the freedom to explore my interests and sharpen my abilities. His enthusiasm kept me going whenever uncertainty, frustration or discouragement arose. I thank Sylvain Paris who has been a precious mentor and an inspiration since I met him. His optimism and knowledge pushed me to seek new challenges. Thanks to my collaborators Gaurav, Jonathan and Yichang, who made this work possible. Thanks to the people of the MIT Graphics Group: you are the very reason this environment is so uniquely welcoming and stimulating. In particular, I thank my two office-mates Tiam, and Abe for reminding me that life is not so serious after all. Thanks to Bryt for keeping the soul of this place alive through generations of grad students. Finally I thank my family and friends for their unconditional support and the interest they always put in both my work and my personal development. Thank you Robi for elevating my mind and keeping me on track with the pursuit of happiness. 5 6 Contents 1 2 Introduction 19 1.1 21 Reducing Data Transfers 2.1 3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Transform Recipes ....... 25 ............................ 25 2.1.1 R esidual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.1.2 High Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.1.3 Chrominance channels . . . . . . . . . . . . . . . . . . . . . . 28 2.1.4 Luminance channel . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.3 Data com pression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3.1 Upstream Compression . . . . . . . . . . . . . . . . . . . . . . 31 2.3.2 Input Pre-processing . . . . . . . . . . . . . . . . . . . . . . . 32 Evaluation 35 3.1 Test enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Expressiveness and robustness . . . . . . . . . . . . . . . . . . . . . . 38 3.3 Prototype system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4 Conclusion 47 A Recipe Parameters 49 7 8 List of Figures 1-1 Cloud computing is often thought as an ideal solution to enable complex algorithms on mobile devices; by exploiting remote resources, one expects to reduce running time and energy consumption. How- ever, this is ignoring the cost of transferring data over the network. When accounted for, this overhead often takes away the benefits of the cloud, especially for data-heavy photo applications. We introduce a new photo processing pipeline that reduces the amount of transmitted data. The core of our approach is the transform recipe, a represehtation of the image transformation applied to a photo that is compact and can be accurately estimated from an aggressively compressed input photo. These properties make cloud computing more efficient in situations where transferring standard jpeg-compressed images would be too costly in terms of energy and/or time. In the example above where we used our most aggressive setting, our approach reduces the amount of transferred data by about 80x compared to using images compressed with standard jpeg settings while producing an output visually similar to the ground-truth result computed directly from the original uncompressed photo. . . . . . . . . . . . . . . . . . . . . . . 9 19 1-2 A major advantage of our transform recipes is that we can evaluate them using severely degraded input photos (a,d). Processing such a strongly compressed image generates a result with numerous jpeg artifacts (b,e). We can nevertheless use such pairs of degraded inputoutput photos to build a recipe that, when applied on the original artifact-free photo (c) produces a high quality output (f) that closely approximates the reference output (g) directly computed from the original photo. The close-ups (c-g) show the region with highest error in our method (f). 2-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 To reconstruct the output image we decompose the input as a shallow Laplacian pyramid. The lowpass residual is linearly re-scaled using the ratio coefficients Rc(p). For the chrominance, we don't use the complete multiscale decomposition: each block in the high-pass of the chrominance undergoes and affine transformation parameterized by Ac and b,. The luminance channel is reconstructed using a combination of an affine transformation (Ay, by), a piecewise linear transformation ({qi}), and a linear scaling of the Laplacian pyramid level ({me}). 2-2 . 26 Recipe coefficients computed for the photo in Figure 1-2. We remapped the values to the [0; 1] range for display purposes. (a) lowpass residual R,. (b,c) affine coefficients of the chrominance A, and b,. (d) affine coefficients of the luminance Ay and by. {me}. (f) non linear coefficients {qi}. 10 (e) multiscale coefficients . . . . . . . . . . . . . . . . . 32 2-3 Adding a small amount of noise to the upsampled degraded image makes the fitting process well-posed and enables our reconstruction (b) to closely approximate the ground-truth output (a). Because of the downsampling and JPEG compression, the degraded input (not shown) exhibits large flat areas (in particular in the higher Laplacian pyramid levels). Without the added noise, the fitting procedure might use the corresponding recipe coefficients as affine offsets. This generates artifacts (c) when reconstructing from the high quality input (which does have high frequency content). . . . . . . . . . . . . . . . 33 2-4 In this close-up from a Detail Manipulation example, processing directly the downsampled input before fitting the recipe fails to capture the high frequencies of the transformation. Errors are particularly visible in the eyes and hair. 2-5 . . . . . . . . . . . . . . . . . . . . . . . . 34 We overlap the blocks of our recipes (b) and linearly interpolate the reconstructed pixel values between overlapping blocks so as to avoid visible blocking artifacts (c). 3-1 . . . . . . . . . . . . . . . . . . . . . . 34 The additional features to model the transformation of the luminance are critical to capture subtle local effects of the ground truth output (a). (b) Using only the affine terms, most of the effect on high frequencies is lost on the wall, and the wooden shutter. (c) Adding the Laplacian pyramid coefficients, the high frequency detail is more faithfully captured on the wall. (d) With both the non-linear and multiscale terms, our model better captures local contrast variations. This is particularly visible on the texture of the wooden shutter. 3-2 . . . . . . . . 39 We analyze the quality of our reconstruction for different settings of our recipe and various degrees of input compression. Not suprisingly the reconstruction quality decreases as input and output compression increase. We report the average PSNR on the whole dataset. 11 . . . . 42 3-3 Our method handles a large variety of photographic enhancements. We filter a highly degraded copy (not shown) of the reference input (a). From the resulting degraded output, we compute the recipe parameters. We then reconstruct an output image (c) that closely approximates the reference output (b) computed from the original high-quality input. (d) and (e) are a close-up on the region of highest error in our reconstruction. (f) is a rescaled difference map that emphasizes the location of our errors. As shown on the Lo smoothing example, our method cannot handle well filters that significantly alter the structure of the input. ....... 3-4 43 ............................... Additional examples. We filter a highly degraded copy (not shown) of the reference input (a). From the resulting degraded output, we compute the recipe parameters. We then reconstruct an output image (c) that closely approximates the reference output (b) computed from the original high-quality input. (d) and (e) are a close-up on the region of highest error in our reconstruction. (f) is a rescaled difference map that emphasizes the location of our errors. . . . . . . . . . . . . . . . 3-5 44 Plot of the power consumption over time of three computation schemes: purely local, JPEG transfer, and our approach. For this measurement, we used a Samsung Galaxy S3 connected to a 3G network; the test application is Local Laplacian Filters and the input is a 4-megapixels image. Our approach uses less energy and is faster than the other two. 45 A-1 The block size w controls the quality-compacity trade-off of transform recipes. Small blocks yield a fine-grained description of the transformation but require more storage space whereas large blocks form a coarser and more concise description. With larger blocks, the recipe lose some spatial accuracy. In practice, we recommend values between 32 and 128 for w. These results are for an uncompressed input. 12 . . . 53 A-2 The number of linear segments p in the piecewise-linear tone curve has a different impact depending on the application. For instance, Recoloring does not transform the luminance channel of the input and therefore does not need more precision here. We setteled with p = 6 in all our experim ents. . . . . . . . . . . . . . . . . . . . . . . . . . . 13 54 14 List of Tables 2.1 Upsampling the degraded input and adding a small amount of noise is critical to the quality of our reconstruction. We report the average PSNR on the whole dataset for a 4 x 4 downsampling of the input and JPEG compression with quality Q= 50. The recipes use a block size w = 64 and both the multiscale and non-linear features are enabled. 3.1 33 De-activating either the luminance curve or the multiscale features, decreases the reconstruction quality by around 2 dB. We report the average PSNR on the whole dataset for a non-degraded input. The recipes use a block size w = 64. 3.2 . . . . . . . . . . . . . . . . . . . . . 39 Reconstruction results per enhancement. %,, refers to the compression of the input as a fraction of the uncompressed bitmap data, %down to that of the recipe (or output in case of jpeg and jdiff). For an uncompressed input, our method matches the quality of a jpeg compressed image. The benefits of become more apparent as the input image is further compressed. In the "standard" setting, the input is downsampled by a factor 2 in each dimension and JPEG compressed with a quality Q= 30, the block size for the recipe is w jdiff methods are given the same input, and use = 32. The jpeg and Q= 80 for the output compression. The LO enhancement is a failure case for our method. 15 . 40 3.3 Reconstruction results per enhancement. %,p refers to the compression of the input as a fraction of the uncompressed bitmap data, %down to that of the recipe (or output in case of jpeg and jdiff). The "medium" setting is a 4x downsampling with Q = 50 and w = 64. The "low" quality setting is a 8 x downsampling of the input with Q = 80 and w = 128. The jpeg and jdiff methods are given the same input, and use 3.4 Q= 80 for the output compression. . . . . . . . . . . . . . . . . 41 We compare the end-to-end latency and energy cost of using our recipes for cloud processing compared to local processing of the enhancement and the jpeg baseline. Our method always saves energy and time compared to the baseline jpeg method. The measurements are from a Samsung Galaxy S5, processing a 8-megapixels image, and are averaged over 20 runs (we also report one standard deviation). . . . . . . 45 A. 1 Per-enhancement PSNR and SSIM for recipes with w = 32 and no input compression. block overlap allows us to linearly interpolate between overlapping block and avoids blocking artifacts. It roughly multiplies the data ratio by 4. multiscale refers to the use of several pyramid levels to represent the high-pass of the luminance channel. This allows us to capture subtle multiscale effects. non-linear refers to the use of a piecewise-linear tone curve in the representation of the high-pass luminance. This allows us to better capture tonal variations. 16 . . . . 50 A.2 Per-enhancement PSNR and SSIM for recipes with w = 32 and an input compressed to %., = 1.0. block overlap allows us to linearly interpolate between overlapping block and avoids blocking artifacts. It roughly multiplies the data ratio by 4. multiscale refers to the use of several pyramid levels to represent the high-pass of the luminance channel. This allows us to capture subtle multiscale effects. non-linear refers to the use of a piecewise-linear tone curve in the representation of the high-pass luminance. variations. This allows us to better capture tonal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A.3 Per-enhancement PSNR and SSIM for recipes with w = 64 and an input compressed to %up = 0.4. block overlap allows us to linearly interpolate between overlapping block and avoids blocking artifacts. It roughly multiplies the data ratio by 4. multiscale refers to the use of several pyramid levels to represent the high-pass of the luminance channel. This allows us to capture subtle multiscale effects. non-linear refers to the use of a piecewise-linear tone curve in the representation of the high-pass luminance. This allows us to better capture tonal variations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 52 18 Chapter 1 Introduction ground-truth output 2.5 MB CLOUD MOBI LE DEVICE original input 2.3 MB download - 41 kB dowurmple compress \0 reconstruct compressed input - 17 kB / 8- reconstruction PSNR = 30 dB Figure 1-1: Cloud computing is often thought as an ideal solution to enable complex algorithms on mobile devices; by exploiting remote resources , one expects to reduce running time and energy consumption. However, this is ignoring the cost of transferring data over the network. When accounted for , this overhead often takes away the benefits of the cloud, especially for data-heavy photo applications. We introduce a new photo processing pipeline that reduces the amount of transmitted data. The core of our approach is the transform recipe, a representation of the image transformation applied to a photo that is compact and can be accurately estimated from an aggressively compressed input photo. These properties make cloud computing more efficient in situations where transferring standard jpeg-compressed images would be too costly in terms of energy and/ or time. In the example above where we used our most aggressive setting, our approach reduces the amount of transferred data by about 80 x compared to using images compressed with standard jpeg settings while producing an output visually similar to the ground-truth result computed directly from the original uncompressed photo. 19 Mobile devices such as cell phones and cameras are exciting platforms for computational photography, but their limited computing power, memory and battery life can make it challenging to implement advanced algorithms. Cloud computing is often hailed as a solution allowing connected devices to benefit from the speed and infinite power supply of remote servers. Cloud processing is also imperative for algorithms that require large databases, e.g. [15, 23] and can streamline applications development, especially in the face of the mobile platform fragmentation. Unfortunately, the cost of transmitting the input and output images can dwarf that of local computation, making cloud solutions surprisingly expensive both in time and energy [2, 101. For instance, a 6 seconds computation on a 16-megapixels image captured by a Samsung Galaxy S5 would consume 20J. Processing the same image on the cloud would take 14 seconds and draw 54J1 . In short, efficient cloud photo enhancement requires a reduction of the diata transferred that is beyond what lossy image compression can achieve while keeping image quality high. In this work, we focus on photographic enhancements that preserve the overall content of an image (in the sense of style vs. content) and do not spatially warp it. This includes content-aware photo enhancement [12], dehazing [13], edge-preserving enhancements [17, 7], colorization [161, style transfer [22, 1], time-of-day hallucination [23] but excludes dramatic changes such as inpainting. For this class of enhancements, the input and output images are usually highly correlated. We exploit this observation to construct a representation of the transformationfrom input to output that we call a transform recipe. By design, recipes are much more compact than images. And because they capture the transformation applied to an image, they are forgiving to a lower input quality. These are the key properties that allow us to cut down the data transfers. With our method, the mobile client uploads a downsampled and highly degraded image instead of the original input, thus reducing the upload cost. The server upsamples this image back to its original resolution and computes the enhancement. From this low quality input-output pair, the server fits the recipe 'Transferring the jpeg images with their default compression settings (4MB each way), over a 4G network with bandwidth 4Mbits/s, according to our estimate of the average power draw and modeling like in [141 20 and sends it back to the client instead of the output, incurring a reduced download cost. In turns, the client combines the recipe with the original high quality input to reconstruct the output (Fig. 1-1). While our recipes are lossy, they provide a highfidelity approximations of the full enhancement yet are one to two orders of magnitude more compact. Transform recipes use a combination of local affine transformations of the YUV channels of a shallow Laplacian pyramid as well as local tone curves for the highest pyramid level. This makes them flexible and expressive, so as to adapt to a variety of enhancements, with a moderate computational footprint on the mobile device. We evaluate the quality of our reconstruction on several standard photo editing tasks including detail enhancement, general tonal and color adjustment, and recolorization. We demonstrate that in practice, our approach translates into shorter processing times and lower energy consumption with a proof-of-concept implementation running on a smartphone instrumented to measure its power usage. 1.1 Related Work Image compression Image compression has been a long standing problem in the image processing community [18]. The data used to represent digital images can be reduced by exploiting their structure and redundancy. Lossless image compression methods and formats generally rely on general purpose compression techniques such as run-length encoding (BMP format), adaptive dictionary algorithm [31, 28], entropy coding [11, 29] or a combination thereof [6]. These methods typically lower the data size of natural images to around 50% to 70% that of an uncompressed bitmap, which is virtually never enough for cloud-based image processing tasks. Alternatively, lossy image compression methods offer more aggressive compression ratios at the expense of some visible degradation. They often try to minimize the loss of information relevant to the human visual system based on heuristics (e.g. using chroma sub-sampling). The most widespread lossy compression methods build on transform coding techniques: the image is transformed to a different basis, and the transform coefficients are quan21 tized. Popular transforms include the Discrete Cosine Transform (JPEG [27]) and the Wavelet Transform (JPEG2000 [24]). Images compressed using these methods can reach 10% to 20% of their original size with very little visible degradation but often, this is still not sufficient for cloud photo applications. More recently, methods that exploit the intra-frame predictive components of video codecs have emerged (WebP, BPG). While they do improve the compression/quality trade-off, the improvements at very low bit-rates remain marginal, and these methods struggle to pervade as they compete with more mainstream formats. Unlike traditional image compression techniques, we seek to compress the description of a transformation from the input of an image enhancement to the output. We build upon, and use traditional image compression techniques as building blocks in our strategy: any lossy compression technique can be used to compress the input image before sending it to the cloud. JPEG is a natural candidate since it is ubiquitous and often implemented in hardware on mobile devices. We also use lossless image compression to further compress our recipes. Recipes as a regression on image data Using the input image to predict the output is inspired from shape recipes [8, 26]. In the context of stereo shape estimation, shape recipes are regression coefficients that predict band-passed shape information from an observed image of this shape. They form a low-dimensional representation of the high-dimensional shape data. Shape recipes describe the shape in relation to the image, and constitute a set of rules to reconstruct the shape from the image. These recipes are simple as they let the image bear much of the complexity of the representation. Similarly, our transform recipes capture the transformation between the input and output images, factoring out their structural complexity. Whereas shape recipes are described in terms of convolution kernel with possible non-linear extensions, the complexity of our model is constrained by the amount of data we can afford to transfer back to the client. Our recipes also differ from shape recipes in that we have a notion of spatially varying transformation: we fit different models for each block, in a grid subdivision of the image. In contrast, shape recipes apply 22 globally to a whole image sub-band. Finally, our recipes are computed from low quality images and applied to the high quality input whereas shape recipes do not have this constraint. 23 (b) output of the filter applied to the degraded input PSNR = 18.8 dB (a) degraded input sent to the server 4 x 4 bicubic downsampling JPEG compression Q = 20 ( c) original full quality input (d) degraded input close-up on (a) ( f) our reconstruction %down = Q.4% PSNR = 32.6 dB ( e) output of the filter close-up on (b) (g) reference output computed on the full-quality input Figure 1-2: A major advantage of our transform recipes is that we can evaluate them using severely degraded input photos (a,d). Processing such a strongly compressed image generates a result with numerous jpeg artifacts (b,e). We can nevertheless use such pairs of degraded input- output photos to build a recipe that , when applied on the original artifactfree photo (c) produces a high quality output (f) that closely approximates the reference output (g) directly computed from the original photo. The close-ups ( c- g) show the region with highest error in our method (f). 24 Chapter 2 Reducing Data Transfers We reduce the upload cost by downsampling and aggressively compressing the input image I. This yield a degraded image I. Upon reception by the server, I is upsampled back to the original resolution. Let us denote this new low quality image by i. The server processes I to generate a low fidelity output 0. The server then assembles a recipe fia such that flC(I) ~ 0, where 0 is the ground-truth output that would result from the direct processing of I. Recipes are designed to be compact and therefore incur a low download cost (Fig. 1-1). To reconstruct the final ouput, the client downloads the recipe and combines it with the original high-quality input I (Fig. 2-1). Our objective with recipes is to represent a variety of standard photo enhancements. In particular, we seek to capture effects that may be spatially varying, multi-scale, and nonlinear. 2.1 Transform Recipes We define recipes over w x w blocks where w controls the trade-off between compactness and accuracy. Small blocks yield a fine-grained description of the transformation but require more storage space whereas large blocks form a coarser and more concise description. We use overlapping blocks to allow for smooth transitions between blocks and prevent blocking artifacts (Fig 2-5), that is, we lay out the w x w blocks on a x 1 grid so that each pair of adjacent blocks share half their area. We work in the W 25 same YCbCr color space as JPEG to separate the luminance Y from the chrominance (Cb , Cr) while still manipulating values that span [O; 255] [9]. ORIGINAL INPUT RECIPE RECEIVED FROM THE SERVER RECONSTRUCTED OUTPUT low-frequency residual multiscale decomposition linear scaling . . 3 coefs / block linear upsampling high-frequency chrominance 111111111111111111111111 111111111111 111111111111 high-frequency intensity pyramid 111111111111111 •• 11111111111111 1111111111 II Ill - - - affine transform 8 coefs / block .. 111111111111 111111111111 11111111 1111111111 linear blending pyramid collapse affine transform . . . 4 coefs / block piecewise-linear remapping .. 5 coefs / block per-level linear scaling logl wJ coefs / .. block Figure 2-1: To reconstruct the output image we decompose the input as a shallow Laplacian pyramid. The lowpass residual is linearly re-scaled using the ratio coefficients Rc(P) . For the chrominance, we don 't use the complete multiscale decomposition: each block in the high-pass of the chrominance undergoes and affine transformation parameterized by Ac and be. The luminance channel is reconstructed using a combination of an affine transformation ( Ay , by) , a piecewise linear transformation ({qi}) , and a linear scaling of the Laplacian pyramid level ({mt}). To capture multi-scale effects, we decompose the images using Laplacian pyramids [3]. That is, we split I and 0 into n + 1 levels {Lt [!]} and {Le[O]} where the resolution of each level is half that of the preceding one. The first n levels represent the details at increasingly coarser scales and the last level is a low-frequency residual. We set n = log 2 ( w) so that each block is represented by a single pixel in the residual. We compute a recipe in three steps: we first represent the transformation of the residual, then that of the other pyramid levels , and finally we quantize and encode the result of the first two steps in an off-the-shelf compressed format. 26 2.1.1 Residual We found that even small errors in the residual propagate to large areas in the final reconstruction and can produce conspicuous artifacts in smooth regions like skies. Since the residual is only a small fraction of the data, we can afford to represent it with greater precision. We represent the low-frequency part of the transformation by the ratio of the residuals, i.e., for each pixel p and each channel c E {Y, Cb, Cr}: L,[c]p)+ 1 Re(p) = Ln[0c](P) +1 Ln [Ic] (P) + I (2.1) where 1 is added to prevent divisions by zero and to ensure that Rc is constant when Ln[Oc] = L,[Ic], that is, for edits like sharpening that only modify the high frequencies. This compresses better in the final stage. 2.1.2 High Frequencies Representing the high-frequency part of the transformation requires much more data. Using per-pixel ratios would not be competitive with standard image compression. To model the high-frequency effect of the transformation, we combine all the levels of the pyramids except the residual into a single layer: n-1 H[I] = E up(Lj[I]) f=0 (2.2) where up(.) is an operator that upsamples the levels back to the size of the original image I. In practice, we use the property of Laplacian pyramids that all the levels including the residual add up to the original image to compute H[I] = I - up(Ln[I]), which is more efficient. We use the same scheme to compute H[0]. Next, we process H[I] and H[O] block by block independently. Our strategy is to fit a parametric function to the pixels within each block to represent their transformation. In our early experiments, we found that while the chrominance transformation were simple enough to be well approximated by an affine function, no single simple 27 parametric function was able to capture alone the diversity of luminance transformations generated by common photographic edits. Instead, our strategy is to rely on a combination of several functions that depend on many parameters and use a sparse regression to keep only a few of them in the final representation. 2.1.3 Chrominance channels Let Occ denote the two chrominance channels of 0. Occ(p) is the the 2D vector containing the chrominance values of 0 at pixel p. Within each w x w block B, we model the high frequencies of the chrominance H[Occ](p) as an affine function of H[I](p). We use a standard least-squares regression and minimize: ZIIH[Occ](p) - Ac(B)H[I](p) - bc(B) 1 2 (2.3) pEB where A, and bc are the 2 x 3 matrix and the 2D vector defining the affine model. 2.1.4 Luminance channel We represent the high frequencies of the luminance H[Oy] with a more sophisticated function comprising several components. The first component is an affine function akin to that used for the chrominance channel: Ay(B)H[I](p) + by(B) with Ay and by the 1 x 3 matrix and the scalar constant defining the affine function. Intuitively, this affine component captures the local changes of brightness and contrast. Next, we capture scale-dependent effects with a linear function of the pyramid levels: n-1 me(B) up(L, [Iy]) where {me(B)} are the coefficients of the linear combination within the block B and up(-) is used to ensure that all levels are upsampled back to the original resolution. Last, we add a term to account for the nonlinear effects. We use a piecewise linear function made of p segments over the luminance range of the block. To define this function, we introduce p - 1 regularly spaced luminance values yj = minu H[Iy] + I(maxL H[Iy] - minBH[Iy]), i E {1, . . ,p - 1} and the segment functions si(-) = max(. - yi, 0) where we omit the dependency on B of yi and 28 si for clarity's sake. Intuitively, si generates a unit change of slope at the ith node yi. Equipped with these functions, we define our nonlinear term as Z_-j qj(B)sj(H[Iy]) where the {qj(B)} coefficients control the change of slope between two consecutive linear segments. To fit the complete model, we use LASSO regression 1251, i.e., we minimize a standard least-squares term: S H[Oy](p) - Ay(B)H[I](p) - by(B) pEB n-1 - 5 p-1 mj(B) up(L,[Iy])(p) - 2 qj(B)s(H[Iy](p)) (2.4) =i=1 to which we add the L, norm of the free parameters Ay(B), by(B), {me(B)}, and {qj (B) }. Solving the optimization problems (Eq. 2.3 and Eq. 2.4) for each overlapping block, and computing the lowpass ratio (Eq. 2.1), we now have all the coefficients of the recipe (Fig. 2-2). In the next section, we describe how we reconstruct the output image from the full quality input image and these recipe coefficients. 2.2 Reconstruction We reconstruct the output 0 using the recipe described in the previous section and the input I. We proceed in two steps; we first deal with the multiscale part of the recipe and then with its high-frequency component. This process is illustrated in Figure 2-1. From Equation 2.1, for each pixel p and each channel c E {Y, Cb, Cr}, we get the expression for the lowpass residual: Ln[Oc](p) = Rc(p)(Ln[Ic](p) + 1) - 1 29 (2.5) Then, we reconstruct the intensity levels f E [0; n - 1] of the multiscale component of Equation 2.4: Lf [0y] = mjL[Iy] (2.6) Together with the luminance channel of Equation 2.5, this gives a complete Laplacian pyramid {Lj [0y] }. We collapse this pyramid to get an intermediate output luminance channel 6y. At this point, the reconstruction is missing some high-frequency luminance components from Equation 2.4 and the chrominance channels Cr and Cb. To reconstruct the remaining high-frequency information H[O], we first compute H3[0] within each block then linearly blend the results of overlapping blocks. For the chrominance, we apply the affine remapping that we estimated in Equation 2.3: H8 [Occ](p) = Ac (B)H[I](p) + bc(3) (2.7) And for the intensity channel, we apply the affine remapping and the nonlinear functions that we computed in Equation 2.4: P-1 H8 [Oy](p) = Ay3(B)H[I](p) + by(B) + qi(B)s(H [Iy](p)) (2.8) We linearly interpolate H8 [Occ] and H8 [Oy] between overlapping blocks to get H[Occ] and H[Oy] Finally, we put all the pieces together. We linearly upsample the chrominance residual (Eq. 2.5) and add the multiscale component (Eq. 2.6), and the linearly interpolated high-frequency chrominance (Eq. 2.7) and luminance (Eq. 2.8): 0 = up(Ln[Occ]) + + H[Occ] + H[Oy] (2.9) where we implicitly lift the quantities in the right hand side into YCbCr vectors depending on the channels they contain. 30 2.3 Data compression Once computed by the server, we assemble the recipe coefficients described in Section 3.1 as a multi channel image (Fig.2-2). Each channel of the recipe is independently uniformly quantized to 8 bits. In practice the quantization error is small and does not affect the quality of the reconstruction. (less than 0.1 dB) We compress the coefficient maps using an off the shelf standard lossless image compression method. In our implementation, we save the lowpass residual as a 16-bits float TIFF file, and the highpass coefficients as tiles in a 8-bits grayscale PNG file. This compression takes benefit both from the spatial redundancy of the recipe and the sparsity induced by the LASSO regression and further reduces the data by about a factor of 2. 2.3.1 Upstream Compression So far, we have only described how to cut down the data downloaded by the client using transform recipes. Because recipes capture the transformation applied to an image they are resilient to degradations of the input (Fig. 1-2). We leverage this property and reduce the data uploaded to the server by sub-sampling and compressing the input image using a lossy compression algorithm with aggressive settings (we use JPEG). Upon receiving the data, the server upsamples the degraded image back to its original resolution and adds a small amount of Gaussian noise to it (a- = 0.5% of the range, see Section 2.3.2). The server then processes the upsampled image to produce a low quality output. The fitting procedure of Section 2.1 is applied to this low quality input/output pair. The client in turn reconstructs the output image by combining his high quality input with the recipe. 31 - ... J1~ ~ • ~ -· -.. -'1 (c) .J \.. '· ~. . j .~./~):?\ • ) .. - . I ' ,, • • • ' - (d} (e) (f} Figure 2-2: Recipe coefficients computed for the photo in Figure 1-2. We remapped the values to the [O; 1) range for display purposes. (a) lowpass residual R e. (b,c) affine coefficients of the chrominance Ac and he. (d) affine coefficients of the luminance Ay and by. (e) multiscale coefficients {mt}. (f) non linear coefficients {qi}. 2.3.2 Input Pre-processing Adding noise serves two purposes. Since the downsampling and JPEG compression both drastically reduce the high frequency content of the input image, adding noise allows us to re-introduce some high frequencies from which we can estimate the transformation. Furthermore, the low quality of the input can make the fitting procedures ill-posed which creates severe artifacts at reconstruction time (Fig. 2-3). The added noise acts as a regularizer. Alternatively we could apply the desired image enhancement directly to the low resolution input , which would also lower the computational cost. This would require adapting the parameters of the enhancement accordingly but in certain cases , it is 32 unclear how the low-resolution processing relates to the high-resolution processing. Besides, with this approach, we cannot estimate the high-frequency part of the transformation. As shown in Figure 2-4, this approach fails for enhancements that have a non trivial high-frequency component. We summarize the importance of upsampling the input and adding noise in Table 2.1. with upsampling without upsampling with noise without noise 34.9 dB 28.7 dB 33.5 dB 27.6 dB Table 2.1: Upsampling the degraded input and adding a small amount of noise is critical to the quality of our reconstruction. We report the average PSNR on the whole dataset for a 4 x 4 downsampling of the input and JPEG compression with quality Q = 50. The recipes use a block size w = 64 and both the multiscale and non-linear features are enabled. (a) reference (b) our method PSNR = 34.7 dB (c) no noise added to the upsampled input before processing, PSNR = 28.1 dB Figure 2-3: Adding a small amount of noise to the upsampled degraded image makes the fitting process well-posed and enables our reconstruction (b) to closely approximate the ground-truth output (a). Because of the downsampling and JPEG compression, the degraded input (not shown) exhibits large flat areas (in particular in the higher Laplacian pyramid levels). Without the added noise, the fitting procedure might use the corresponding recipe coefficients as affine offsets. This generates artifacts (c) when reconstructing from the high quality input (which does have high frequency content). 33 (a) reference (b) our method PSNR = 34.7 dB (c) no input upsampling before processing, PSNR. = 24.2 dB Figure 2-4: In this close-up from a Detail Manipulation example, processing directly the downsampled input before fitting the recipe fails to capture the high frequencies of the transformation. Errors are particularly visible in the eyes and hair. (a) reference (b) our method PSNR = 42.1 dB (c) no block overlap, PSNR = 34.7 dB Figure 2-5: We overlap the blocks of our recipes (b) and linearly interpolate the reconstructed pixel values between overlapping blocks so as to avoid visible blocking artifacts (c). 34 Chapter 3 Evaluation We evaluate the quality of our reconstruction on several image processing applications ( 3.1). We conduct stress tests by degrading the quality of the input to several degrees and analyze the effect of our design choices ( 3.2). Latency and physical energy consumption are measured using a proof-of-concept implementation on an Android smartphone ( 3.3). The full reconstruction results assessing the different quality settings and the impact of each features in the pipeline are provided in Appendix A. We express the compression performance as a fraction of the original, uncompressed 24-bits bitmap. Though an uncompressed bitmap would never be used in the context of cloud image processing, this provides us with a fixed reference for all our comparisons. A typical out-of-the-phone JPEG has a data ratio in the 10% to 15% realm. A PNG image has a typical data ratio of 50% to 75% for natural color images. Besides visual comparison, we report both PSNR and SSIM to quantify the quality of our reconstruction. 3.1 Test enhancements We selected a range of applications representative of typical photo editing scenarios. Some of these applications might not necessarily require cloud processing, but they demonstrate the expressiveness of our model. We gathered 168 high quality test images from Flickr and the MIT-Adobe fiveK dataset [4] to use as input to the 35 image processing algorithms. The image resolution range between 2-megapixels and 8-megapixels. MIT-Adobe fiveK dataset provides raw images that give us baseline for reconstruction quality. From these high quality inputs, we generated degraded inputs for several quality levels using a combination of bicubic downsampling (2 to 8 times smaller images in each dimension) and aggressive JPEG compression (with quality parameters ranging from 30 to 80, using Adobe Photoshop's quantization tables). Our JPEG compression uses the quantization tables from Adobe Photoshop. This new set of images illustrate what a typical client would send for processing in the cloud using our technique. We ran each algorithm on this large set of inputs for different quality settings of the degraded input and measured end-to-end approximation fidelity. Photo editing and Photoshop actions We manually applied various image edit- ing tools to the input images and recorded them as Photoshop Actions. The edits include tonal adjustments, color correction, non-trivial masking, spatially varying blurs, unsharp masking, high pass boost, vignetting, white balance adjustments, and detail enhancement. The set of operations include both global and local edits, and were created as a stress test for our model. These actions are an example of filters that would simply not be available on a mobile device. Dehazing This algorithm 113] estimates an atmospheric light vector from the input image, computes a transmission map and removes the corresponding haze. The transmission map, and therefore the transformation applied to the input image exhibits sharp transitions but they follow the input's content. Edge preserving detail enhancement We test both the Local Laplacian Fil- ter [17] and another edge-preserving decomposition [7]. These methods are challenging for our approach, which does not have any notion of edge preservation. 36 Style Transfer [1] alters the distribution of gradients in an image to match a target distribution using the Local Laplacian filter. The algorithm outputs an image whose style matches that of the examples image. Portrait Style Transfer [22] spatially matches the input image to a target ex- ample whose style is to be imitated using dense correspondence. Because of the correspondence step and the database query, this algorithm is a good candidate for cloud processing. The multi-scale, spatially varying transfer is of interest to test our recipe. The algorithm also adds post-processing not well tied to our model (adding highlights in the eyes). We disabled this post-processing during our evaluation. Time of Day Hallucination [231 is an example of algorithm that requires a large database and costly computer vision matching. Colorization [16] requires solving a large sparse linear system which becomes costly on mobile platforms for large images. We input scribbles on images from the MIT5K dataset. Matting Is beyond the scope of our work since the output image does not resembles a photography. It is nonetheless useful in the photo editing context and is a good stress test for expressiveness. We use the implementation of KNN-matting available on the authors' webpage 15], and use as inputs the publicly available data from [21]. LO smoothing [30] is a paradigm of methods where the stylized output is signifi- cantly different from the input. We include it as a failure case. 37 3.2 Expressiveness and robustness We processed our entire dataset at various levels of compression and found that in all cases except the most extreme settings our approach produces outputs visually similar to the ground truth, i.e., the direct processing of the uncompressed input photo. Figure 3-3 Figure 3-4 and shows a few example reconstruction for the "standard" quality setting. We compare our technique to two baseline alternatives set to match our data rate: " jpeg: the client sends a JPEG, the server processes it and transmit the output back as a JPEG image. " jdiff: the client sends a JPEG input image. The server sends back the difference between this input and the output it computes from it. The difference is also JPEG compressed. The client adds the difference back to its local input. As shown in Table 3.3, for all applications, our approach performs equally or better than these alternatives, often by a large margin. The complete and detailed performance report with comparison to this baseline can be found in Appendix A. Figure 3-2 illustrates the robustness of our recipes for various compression ratios of the input and output. We recommend two level of compression: "standard" that corresponds to about 1.5% of the data and generates images often indistinguishable from the ground truth, and "medium quality" that corresponds to about 0.5% of the data and only introduce minor deviations. In practice, one would choose the level of accuracy/compression depending on the application. The extra features for the luminance channel are critical to the expressiveness of our model. Both the non-linear luminance curve and the multiscale features help capture subtle and local effects (Fig. 3-1). We quantify the improvement these features provide in Table 3.1. A detailed per enhancement report can be found in Appendix A. 38 with luminance curve without 40.2 dB 38.5 dB 38.3 dB 36.6 dB with multiscale without Table 3.1: De-activating either the luminance curve or the multiscale features, decreases the reconstruction quality by around 2 dB. We report the average PSNR on the whole dataset for a non-degraded input. The recipes use a block size w = 64. (a) reference output (b) affine terms only, PSNR = 32.1 dB (c) affine and multiscale terms for the luminance, PSNR = 33.6 dB (d) affine, multiscale and non-linear terms for the luminance, PSNR = 36.4 dB Figure 3-1: The additional features to model the transformation of the luminance are critical to capture subtle local effects of the ground truth output (a). (b) Using only the affine terms, most of the effect on high frequencies is lost on the wall, and the wooden shutter. (c) Adding the Laplacian pyramid coefficients, the high frequency detail is more faithfully captured on the wall. (d) With both the non-linear and multiscale terms, our model better captures local contrast variations. This is particularly visible on the texture of the wooden shutter. 39 PSNR (dB) %down SSIM quality enhancement # %, ours jpeg jdiff ours jpeg jdiff ours jpeg jdiff non-degraded input w = 32 LocalLaplacian Dehazing Detail Manipulation Lo Matting Photoshop Recoloring Portrait Transfer Time of Day Style Transfer 20 20 20 12 12 10 10 20 24 20 71.7 51.4 71.7 62.5 41.7 68.3 33.4 31.5 47.5 76.8 1.7 1.4 1.6 2.0 1.0 1.4 0.8 1.1 1.8 1.1 9.1 9.8 5.6 3.7 3.1 11.9 3.2 3.9 14.0 15.3 7.3 7.6 4.2 7.8 7.2 9.4 1.1 2.5 12.4 12.4 38.1 42.6 39.0 36.7 37.1 44.3 49.3 42.9 42.7 37.3 35.6 39.0 37.1 42.6 48.3 42.4 42.5 41.6 39.9 43.4 36.6 41.8 41.7 35.4 42.7 44.8 47.7 41.3 40.8 42.0 0.97 0.98 0.98 0.95 0.97 0.99 1.00 0.99 0.98 0.97 0.94 0.95 0.94 0.98 1.00 0.98 0.96 0.98 0.98 0.99 0.94 0.97 0.98 0.84 0.96 0.98 0.98 0.97 0.97 0.98 all 168 55.6 1.4 7.9 7.2 41.0 41.2 41.5 0.98 0.97 0.96 LocalLaplacian Dehazing Detail Manipulation Lo Matting Photoshop Recoloring Portrait Transfer Time of Day Style Transfer 20 20 20 12 12 10 10 20 24 20 0.8 0.9 0.8 1.8 0.8 0.8 0.7 0.8 1.5 1.0 1.8 1.4 1.7 2.1 0.9 1.5 1.4 1.1 1.8 1.1 2.4 2.1 1.9 1.8 1.0 2.1 1.0 1.3 2.9 2.8 2.1 1.7 1.7 2.9 2.5 1.9 0.4 1.1 2.9 2.3 37.0 36.1 36.8 35.8 35.2 42.6 47.3 41.1 38.6 34.9 26.0 28.0 27.9 34.7 32.0 29.6 38.2 32.9 26.7 25.6 26.7 28.9 28.8 33.0 32.1 30.4 39.4 33.8 27.3 26.0 0.97 0.97 0.98 0.94 0.96 0.99 0.99 0.99 0.97 0.96 0.67 0.75 0.81 0.95 0.97 0.80 0.92 0.89 0.75 0.67 0.70 0.79 0.85 0.89 0.92 0.82 0.94 0.90 0.78 0.70 all 168 1.0 1.5 1.9 2.0 38.5 30.2 30.6 0.97 0.82 0.83 standard 2 x downsampling Q = 30 w = 32 Table 3.2: Reconstruction results per enhancement. %,, refers to the compression of the input as a fraction of the uncompressed bitmap data, %down to that of the recipe (or output in case of jpeg and jdiff). For an uncompressed input, our method matches the quality of a jpeg compressed image. The benefits of become more apparent as the input image is further compressed. In the "standard" setting, the input is downsampled by a factor 2 in each dimension and JPEG compressed with a quality Q = 30, the block size for the recipe is w = 32. The jpeg and jdiff methods are given the same input, and use Q = 80 for the output compression. The LO enhancement is a failure case for our method. 0 PSNR (dB) ours jpeg jdiff ours SSIM jpeg jdiff 0.6 0.5 0.5 0.9 0.9 0.6 0.1 0.4 0.9 0.7 34.6 34.3 33.9 32.5 32.9 40.0 44.6 36.7 35.8 32.8 23.3 26.1 24.3 29.9 28.9 26.9 35.6 30.1 24.8 23.0 23.6 26.7 24.7 29.6 29.0 27.3 36.3 30.6 25.1 23.2 0.96 0.95 0.97 0.90 0.94 0.98 0.99 0.97 0.95 0.94 0.55 0.68 0.70 0.90 0.95 0.74 0.89 0.83 0.66 0.56 0.56 0.70 0.72 0.88 0.91 0.75 0.91 0.83 0.68 0.57 0.6 0.6 35.8 27.3 27.6 0.96 0.75 0.75 0.2 0.1 0.1 0.2 0.2 0.1 0.1 0.1 0.2 0.1 0.2 0.2 0.2 0.2 0.1 0.2 0.1 0.2 0.3 0.2 0.2 0.2 0.2 0.3 0.3 0.2 0.0 0.1 0.3 0.2 32.0 31.5 31.0 30.1 30.2 36.2 40.9 31.8 33.5 30.3 21.3 24.3 21.8 26.5 25.9 25.0 32.6 27.6 23.3 20.8 21.4 24.7 22.1 26.7 26.0 25.3 32.9 27.8 23.4 20.9 0.95 0.93 0.95 0.87 0.92 0.97 0.99 0.94 0.93 0.92 0.45 0.63 0.63 0.85 0.94 0.71 0.87 0.78 0.61 0.46 0.46 0.64 0.64 0.85 0.90 0.72 0.87 0.78 0.61 0.47 0.1 0.2 0.2 32.7 24.9 25.1 0.94 0.69 0.69 %down quality enhancement # %,, ours medium 4x downsampling Q = 50 w = 64 Local Laplacian Dehazing Detail Manipulation LO Matting Photoshop Recoloring Portrait Transfer Time of Day Style Transfer 20 20 20 12 12 10 10 20 24 20 0.3 0.3 0.3 0.7 0.3 0.3 0.3 0.4 0.6 0.4 0.5 0.4 0.5 0.6 0.4 0.4 0.4 0.4 0.6 0.3 0.7 0.6 0.6 0.6 0.3 0.6 0.4 0.4 0.9 0.8 all 168 0.4 0.5 Local Laplacian Dehazing Detail Manipulation Lo Matting Photoshop Recoloring Portrait Transfer Time of Day Style Transfer 20 20 20 12 12 10 10 20 24 20 0.1 0.1 0.1 0.3 0.1 0.1 0.1 0.2 0.3 0.2 all 168 0.2 low 8 x downsampling Q = 80 w = 128 jpeg jdiff Table 3.3: Reconstruction results per enhancement. %,, refers to the compression of the input as a fraction of the uncompressed bitmap data, %down to that of the recipe (or output in case of jpeg and jdiff). The "medium" setting is a 4x downsampling with Q = 50 and w = 64. The "low" quality setting is a 8x downsampling of the input with Q = 80 and w = 128. The jpeg and jdiff methods are given the same input, and use Q = 80 for the output compression. 40 35 - - our method %., = 1.0 v- our method %up = 0.5 *-o our method %., = 0.1 Z 0-0 jpeg :- jdiff 30- 251 0.2 0.4 1.0 0.8 0.6 Total data ratio I (%.p + %Vd.-) 1.2 1.4 our Figure 3-2: We analyze the quality of our reconstruction for different settings of quality reconstruction recipe and various degrees of input compression. Not suprisingly the decreases as input and output compression increase. We report the average PSNR on the whole dataset. 3.3 Prototype system enWe implemented a client-server poof-of-concept system to validate runtime and image ergy consumption. We use a Samsung Galaxy S5 as the client device. Three processing scenarios are tested: on-device processing, naive cloud processing - where we transfer the input and output compressed as JPEG with the default settings from the cellphone, and our recipe-based processing. through WIFI and cellular LTE network. We consider network connections Our server runs an Intel Xeon E5645 CPU, using 12 cores.For on-device processing we use a Halide-optimized implemenWe tation [20, 191 of the test filters, through the Java Native Interface on Android. evaluate local Laplacian filter [17] with fifty pyramids for detail enhancement, on an 8-megapixels input. a To ensure accurate power measurement, we replace the cellphone battery with generator. The power generator, and record the current and voltage drained from the three power usage plot in Fig. 3-5 shows the energy consumption at each step for the scenarios on the Local Laplacian filter. By reducing the amount of data transferred and both upstream and downstream, our method greatly reduces transmission costs cuts down both end-to-end computation time and power usage (Fig. 3.4). 42 De hazing -- PSNR = 34.3 dB PSNR = 42.5 dB 0.5 %down = 1.8 % P= O.l %down = 1.5 %down= 2.4 %up = ) Detail Manipulation 11 ~·- PSNR = 25.6 dB Local Laplacian PSNR = 32.4 dB %up = 0.4 %down= 1.9 Matting PSNR = 37. 7 dB %up = 0.3 %dow n = 0.7 (a) input (b) reference (c) our output reconstruction ( d) reference ( e) highest error p atch ( f) rescaled difference Figure 3-3: Our method handles a large variety of photographic enhancements. We filter a highly degraded copy (not shown) of the reference input (a). From the resulting degraded output , we compute the recipe parameters. We then reconstruct an output image (c) that closely approximates the reference output (b) computed from the original high-quality input. (d) and (e) are a close-up on the region of highest error in our reconstruction. (f) is a rescaled difference map that emphasizes the location of our errors. As shown on the Lo smoothing example, our method cannot handle well filters that significantly alter the structure of the input. 43 Portrait Transfer PSNR = 38.5 dB 3 Recoloring PSNR = 48. 7 dB Style Transfer PSNR = 34.6 dB Time of Day Photoshop PSNR (a) input = 0.4 %down= 1.1 3 up = 0.2 %down = 1.5 %up= 0.3 %down = 1.1 PSNR = 37.2 dB 3 up = 0.4 %down = 1.6 = 46.l dB 3 up = 0.7 %down = 1.5 (b) reference (c) our output reconstruction Iii) ( d) reference ( e) highest error patch ( f) rescaled difference Figure 3-4: Additional examples. We filter a highly degraded copy (not shown) of the reference input (a). From the resulting degraded output , we compute the recipe parameters. We then reconstruct an output image (c) that closely approximates the reference output (b) computed from the original high-quality input. (d) and (e) are a close-up on the region of highest error in our reconstruction. (f) is a rescaled difference map that emphasizes the location of our errors. 44 WIFI LTE enhancement Local Laplacian Style Transfer recipe jpeg recipe jpeg local 1.5 J 0.5 s 30.6 14.7 2.4 J 1.9 s 10.6 4.9 1.2 J 0.2 s 14.6 9.2 0.6 J 0.3 s 7.9 3.9 1.2 J 0.1 s 135.6 2.4 J 30.6 + 0.9 s 44.1 23.6 0.7 J 2.2 s 20.7 12.4 1.5 J 0.7 s 30.5 17.3 0.7 J 1.5 s 23.2 11.4 1.3 J 0.3 s 23.2 7.0 Table 3.4: We compare the end-to-end latency and energy cost of using our recipes for cloud processing compared to local processing of the enhancement and the jpeg baseline. Our method always saves energy and time compared to the baseline jpeg method. The measurements are from a Samsung Galaxy S5, processing a 8-megapixels image, and are averaged over 20 runs (we also report one standard deviation). power consumption (W) lower is better 6 recons truction download die.:I I dle upload: our approach - jpeg local 14.3J 33.1J 22.1J 5.8s 17.1s 6.6s download dle upload 0L 0 2 4 6 10 8 12 14 16 time (s) Figure 3-5: Plot of the power consumption over time of three computation schemes: purely local, JPEG transfer, and our approach. For this measurement, we used a Samsung Galaxy S3 connected to a 3G network; the test application is Local Laplacian Filters and the input is a 4-megapixels image. Our approach uses less energy and is faster than the other two. 45 46 Chapter 4 Conclusion Cloud image enhancement can only be made more efficient if the power and time cost of data transfers is minimized. In the case of algorithms that preserve the content of an image and do not spatially warp it, we show that we can exploit the similarity between the input and the output. We introduce transform recipes, a flexible representation which capture how images are locally modified by an enhancement. Recipes dramatically reduce the bandwidth required for cloud computation because they are very compact and allow us to work on the server with a highly-degraded input-output pair and apply the recipe on the client to the high-quality input. Finally we demonstrate that transform recipes can support a wide range of algorithms and, in practical scenarios, lead to reduced power and time costs. Our method assumes a good correlation between the input and the output, and no spatial transformation. This excludes operations such as barrel distortion correction and perspective rectification. A parametric warp could be added but we chose to focus on stylistic transformations. Our method also cannot handle significant addition of content, such as inpainting techniques or clone brushing. Overall, the development of good recipes is a trade-off between expressiveness, compactness and reconstruction costs. In some scenarios, enhancement-specific recipes might provide significant gains. 47 48 Appendix A Recipe Parameters We provide detailed quality comparisons to evaluate our design choices. In particular, we motivate the use of block overlapping, non-linear luminance features and multiscale decomposition. Table A.1 reports the per-enhancement PSNR and SSIM for recipes with w with w = = 32 and no input compression. Table A.2 summarizes the metrics for recipes 32 and an input compressed to %., = 1.0 (the standard quality level). Table A.2 is for recipes with w = 64 and an input compressed to %,, = 0.4 (the medium quality level). We illustrate the impact of the number of linear segments of the luminance tone curve in Figure A-2. The impact of w is shown in Figure A-1. 49 %down PSNR (dB) block overlap multiscale non-linear / -(-/ v/-/ / Local Laplacian Dehazing Detail Manipulation Lo Matting Photoshop Recoloring Portrait Transfer Time of Day Style Transfer 0.3 0.3 0.3 0.4 1.0 0.2 all Local Laplacian Dehazing Detail Manipulation 0.3 0.8 0.9 0.4 0.4 0.4 0.4 0.4 0.4 1.1 0.5 0.4 0.8 0.5 0.5 0.3 0.4 0.2 0.5 0.3 0.4 0.2 / 0.5 1.3 1.1 1.3 1.5 1.4 1.1 1.3 1.6 0.7 0.7 0.5 0.6 0.4 1.1 1.1 0.4 0.7 0.8 0.6 0.9 0.8 1.4 1.4 1.4 0.4 1.7 1.4 1.6 2.0 1.0 1.4 0.3 0.5 0.3 0.3 0.5 0.2 0.2 0.7 0.7 0.3 0.4 0.6 0.3 0.3 0.7 0.3 0.4 1.1 1.1 0.4 33.5 34.3 34.7 35.9 35.5 37.0 37.0 38.1 40.4 41.1 40.6 41.7 41.3 42.5 41.7 42.6 32.9 31.8 33.7 33.8 32.7 34.4 36.1 34.3 34.4 34.6 33.3 35.8 36.8 35.1 35.0 35.6 34.4 36.6 38.0 35.7 36.3 39.0 36.7 37.1 39.5 40.3 41.1 42.0 41.8 42.8 43.5 44.3 47.9 36.8 49.2 38.1 47.9 39.0 48.0 39.1 49.2 40.7 49.3 40.4 48.0 41.1 49.3 42.9 41.2 41.6 41.2 42.2 41.7 42.8 42.1 42.7 31.1 31.7 31.8 35.8 32.5 36.6 36.6 37.3 all 36.9 37.7 38.1 38.8 39.0 39.8 40.0 41.0 Local Laplacian Dehazing Detail Manipulation Lo Matting Photoshop Recoloring Portrait Transfer Time of Day Style Transfer 0.94 0.95 0.95 0.96 0.96 0.97 0.98 0.98 0.98 0.98 0.95 0.88 0.97 0.98 1.00 0.97 0.98 0.93 0.96 0.89 0.97 0.98 0.97 0.96 0.90 0.96 0.98 0.98 0.94 0.98 0.96 0.92 0.97 0.98 0.97 0.98 0.97 0.96 0.96 1.00 1.00 0.99 1.00 0.99 1.00 0.98 0.98 0.93 0.98 0.98 0.93 0.99 0.98 0.94 0.98 0.98 0.97 0.99 1.00 0.98 all 0.96 0.96 0.97 0.97 0.97 Lo Matting Photoshop Recoloring Portrait Transfer Time of Day Style Transfer SSIMv - / features 0.2 0.2 0.4 0.1 1.1 0.92 0.97 0.98 1.00 0.98 0.98 0.96 0.97 0.98 0.94 1.1 1.8 1.1 0.98 0.98 0.95 0.97 0.99 1.00 0.98 0.97 0.99 0.98 0.97 0.97 0.98 Table A.1: Per-enhancement PSNR and SSIM for recipes with w = 32 and no input compression. block overlap allows us to linearly interpolate between overlapping block and avoids blocking artifacts. It roughly multiplies the data ratio by 4. multiscale refers to the use of several pyramid levels to represent the high-pass of the luminance channel. This allows us to capture subtle multiscale effects. non-linear refers to the use of a piecewiselinear tone curve in the representation of the high-pass luminance. This allows us to better capture tonal variations. 50 %down PSNR (dB) SSIM block overlap -- multiscale non-linear - - - Local Laplacian Dehazing Detail Manipulation LO Matting Photoshop Recoloring Portrait Transfer 0.3 0.3 0.3 0.4 0.1 0.3 0.3 0.2 1.1 1.0 1.0 1.2 0.1 0.9 0.8 0.5 0.4 0.4 0.4 0.5 0.1 0.4 0.4 0.3 0.5 0.4 0.4 0.5 0.1 0.4 0.3 0.3 Time of Day 0.4 1.1 0.5 0.5 Style Transfer 0.1 0.3 0.2 0.2 all 0.3 0.8 0.4 0.4 Local Laplacian 32.5 33.2 32.2 Dehazing 35.2 35.5 35.0 Detail Manipulation 32.2 33.0 LO 31.3 Matting Photoshop Recoloring 32.7 38.6 46.4 Portrait Transfer Time of Day Style Transfer - - - features 1.5 1.2 1.3 1.6 0.2 1.1 1.1 0.9 0.6 0.5 0.5 0.6 0.1 0.5 0.4 0.4 1.8 1.4 1.7 2.1 0.9 1.5 1.4 1.1 1.4 1.5 0.6 1.8 0.7 0.7 0.3 1.1 1.1 1.1 0.4 1.5 34.3 32.8 35.2 33.7 37.0 35.7 35.2 36.0 35.4 36.1 34.7 33.5 35.4 34.4 36.0 36.8 32.1 33.7 32.6 34.5 33.6 34.9 35.8 33.4 39.3 47.4 33.1 39.8 46.3 34.2 40.8 46.5 33.7 40.5 47.3 35.0 41.5 47.5 34.3 42.0 46.4 35.2 42.6 47.3 35.9 38.4 30.4 37.0 39.0 31.1 38.0 38.0 30.8 37.9 38.3 34.0 39.4 38.6 31.5 38.9 38.9 34.8 39.9 38.0 34.2 41.1 38.6 34.9 all 35.4 36.1 36.2 36.8 36.9 37.6 37.5 38.5 Local Laplacian Dehazing Detail Manipulation LO Matting Photoshop Recoloring Portrait Transfer Time of Day Style Transfer 0.92 0.96 0.94 0.86 0.96 0.97 0.99 0.97 0.97 0.92 0.93 0.96 0.95 0.88 0.97 0.98 1.00 0.97 0.97 0.93 0.89 0.95 0.97 0.92 0.96 0.98 0.99 0.98 0.96 0.92 0.94 0.96 0.95 0.89 0.96 0.98 0.99 0.98 0.97 0.95 0.90 0.95 0.97 0.93 0.97 0.98 0.99 0.98 0.97 0.93 0.95 0.96 0.96 0.90 0.97 0.98 1.00 0.98 0.97 0.96 0.90 0.95 0.97 0.93 0.95 0.98 0.99 0.98 0.96 0.95 0.97 0.97 0.98 0.94 0.96 0.99 0.99 0.99 0.97 0.96 all 0.95 0.95 0.95 0.96 0.96 0.96 0.96 0.97 1.5 1.2 1.4 1.6 0.2 1.2 1.1 0.8 Table A.2: Per-enhancement PSNR and SSIM for recipes with w = 32 and an input compressed to %,p = 1.0. block overlap allows us to linearly interpolate between overlapping block and avoids blocking artifacts. It roughly multiplies the data ratio by 4. multiscale refers to the use of several pyramid levels to represent the high-pass of the luminance channel. This allows us to capture subtle multiscale effects. non-linear refers to the use of a piecewiselinear tone curve in the representation of the high-pass luminance. This allows us to better capture tonal variations. 51 %down PSNR (dB) SSIM block overlap multiscale non-linear - - - features - -- ' 0.1 0.1 0.1 0.1 0.1 0.1 0.4 0.3 0.4 / -/. -/ 0.4 0.3 0.4 0.2 0.1 0.1 0.5 0.4 0.5 0.5 0.5 0.2 0.6 0.1 0.3 0.3 0.3 0.1 0.3 0.3 0.3 0.0 0.1 0.1 0.1 0.4 0.4 0.4 0.4 0.2 0.5 0.5 0.2 0.6 0.1 0.2 0.2 0.1 0.3 0.1 0.3 0.3 0.1 0.5 29.4 32.4 32.4 31.2 30.8 36.6 43.3 34.1 33.6 28.8 32.4 34.0 30.6 29.9 31.9 38.4 43.4 34.4 36.8 32.0 29.9 32.9 32.9 31.7 31.5 37.7 44.6 35.4 34.2 29.5 33.2 34.6 31.4 30.5 32.8 39.0 44.8 35.5 37.5 32.9 30.4 32.6 33.2 31.9 32.1 38.8 43.3 35.3 35.1 32.1 34.6 34.3 33.9 32.5 32.9 40.0 44.6 36.7 35.8 32.8 33.9 33.2 34.4 34.0 35.2 34.5 35.8 0.91 0.94 0.92 0.92 0.95 0.93 0.83 0.91 0.95 0.93 0.95 0.93 0.84 0.92 0.96 0.94 0.95 0.94 0.85 0.92 0.96 0.96 0.95 0.97 Lo Matting Photoshop Recoloring Portrait Transfer Time of Day Style Transfer 0.81 0.94 0.96 0.99 0.95 0.96 0.89 0.81 0.95 0.96 0.99 0.96 0.97 0.90 0.88 0.94 0.96 0.99 0.96 0.92 0.89 0.83 0.94 0.97 0.99 0.96 0.96 0.94 0.89 0.94 0.97 0.99 0.97 0.93 0.90 0.84 0.94 0.97 0.99 0.97 0.96 0.95 0.89 0.94 0.97 0.99 0.96 0.94 0.93 0.90 0.94 0.98 0.99 0.97 0.95 0.94 all 0.93 0.93 0.92 0.94 0.93 0.95 0.94 0.96 Local Laplacian Dehazing Detail Manipulation 0.1 0.1 0.1 0.3 0.3 0.3 LO 0.1 0.3 0.1 0.1 Matting Photoshop Recoloring Portrait Transfer 0.0 0.1 0.1 0.1 0.0 0.2 0.2 0.2 0.0 0.1 0.1 0.1 0.0 0.1 0.1 0.1 Time of Day 0.1 0.3 0.2 Style Transfer 0.0 0.1 0.1 all 0.1 0.2 0.1 Local Laplacian Dehazing Detail Manipulation LO Matting Photoshop Recoloring Portrait Transfer Time of Day Style Transfer 30.7 33.6 29.5 29.1 30.5 36.3 43.4 32.7 37.3 28.6 31.4 34.1 30.2 29.6 31.3 37.0 44.7 33.8 37.9 29.3 all 33.2 Local Laplacian Dehazing Detail Manipulation Table A.3: Per-enhancement PSNR and SSIM for recipes with w = 64 and an input compressed to %,p = 0.4. block overlap allows us to linearly interpolate between overlapping block and avoids blocking artifacts. It roughly multiplies the data ratio by 4. multiscale refers to the use of several pyramid levels to represent the high-pass of the luminance channel. This allows us to capture subtle multiscale effects. non-linear refers to the use of a piecewiselinear tone curve in the representation of the high-pass luminance. This allows us to better capture tonal variations. 52 ..... ..... ..... ..... 55 50 Local Laplacian Dehazing Detail Manipulation Lo Matting ..... ..... .... ..... Photoshop Recoloring Portrait Transfer Time of Day Style Transfer 45 i:cl ~ ~ z 00. 40 ~ 35 30 25--~...._~~~_._~~~~~~~~---~~~~~~~~~~~~~~~~~ 16 32 64 128 block size w 256 Figure A-1: The block size w controls the quality-compacity trade-off of transform recipes. Small blocks yield a fine-grained description of the transformation but require more storage space whereas large blocks form a coarser and more concise description. With larger blocks, the recipe lose some spatial accuracy. In practice, we recommend values between 32 and 128 for w. These results are for an uncompressed input. 53 .... .... .... 50 45 Local Laplacian Dehazing Detail Manipulation Lo .... .... Photoshop Recoloring Portrait Transfer Time of Day .......~~~~~--~~~~~~--~~~~~~ 30---~~~~~--~~~~~~ 0 2 6 4 8 10 number of luminance bins Figure A-2: The number of linear segments p in the piecewise-linear tone curve has a different impact depending on the application. For instance, Recoloring does not transform the luminance channel of the input and therefore does not need more precision here. We setteled with p = 6 in all our experiments. 54 Bibliography [1] Mathieu Aubry, Sylvain Paris, Samuel W. Hasinoff, Jan Kautz, and Fr6do Durand. Fast local laplacian filters: Theory and applications. ACM Trans. Graph., 33(5):167:1-167:14, September 2014. 12] Kenneth C. Barr and Krste Asanovi6. Energy-aware lossless data compression. ACM Trans. Comput. Syst., 24(3):250-291, August 2006. [3] Peter J Burt and Edward H Adelson. The laplacian pyramid as a compact image code. IEEE Transactions on Communications, 31(4):532-540, 1983. [4] Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Fredo Durand. Learning photographic global tonal adjustment with a database of input / output image pairs. In IEEE Conference on Computer Vision and Pattern Recognition, 2011. 15] Qifeng Chen, Dingzeyu Li, and Chi-Keung Tang. Knn matting. In IEEE Conference on Computer Vision and PatternRecognition, pages 869-876, June 2012. [61 P. Deutsch. Deflate compressed data format specification version 1.3, 1996. [7] Zeev Farbman, Raanan Fattal, Dani Lischinski, and Richard Szeliski. Edgepreserving decompositions for multi-scale tone and detail manipulation. In A CM Transaction on Graphics (SIGGRAPH), SIGGRAPH '08, pages 67:1-67:10, New York, NY, USA, 2008. ACM. [81 William T. Freeman and Antonio Torralba. Shape recipes: Scene representations that refer to the image. In Vision Sciences Society Annual Meeting, pages 25-47. MIT Press, 2002. 19] Eric Hamilton. Jpeg file interchange format. C-Cube Microsystems, 1992. [10] Junxian Huang, Feng Qian, Alexandre Gerber, Z. Morley Mao, Subhabrata Sen, and Oliver Spatscheck. A close examination of performance and power characteristics of 4g lte networks, 2012. [11] D.A. Huffman. A method for the construction of minimum-redundancy codes. Proceedings of the IRE, 40(9):1098-1101, Sept 1952. [12] Liad Kaufman, Dani Lischinski, and Michael Werman. Content-aware automatic photo enhancement. Computer Graphics Forum, 31(8):2528-2540, 2012. 55 [131 Jin-Hwan Kim, Won-Dong Jang, Jae-Young Sim, and Chang-Su Kim. Optimized contrast enhancement for real-time image and video dehazing. J. Vis. Comun. Image Represent., 24(3):410-425, April 2013. [14] Karthik Kumar, Jibang Liu, Yung-Hsiang Lu, and Bharat Bhargava. A survey of computation offloading for mobile systems. Mob. Netw. Appl., 18(1):129-140, February 2013. [15] Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, and James Hays. Transient attributes for high-level understanding and editing of outdoor scenes. A CM Transaction on Graphics (SIGGRAPH), 33(4):149:1-149:11, July 2014. [161 Anat Levin, Dani Lischinski, and Yair Weiss. Colorization using optimization. ACM Transaction on Graphics (SIGGRAPH), 23(3):689-694, August 2004. [17] Sylvain Paris, Samuel W. Hasinoff, and Jan Kautz. Local laplacian filters: Edge-aware image processing with a laplacian pyramid. In ACM Transaction on Graphics (SIGGRAPH), SIGGRAPH '11, pages 68:1-68:12, New York, NY, USA, 2011. ACM. [18] Majid Rabbani and Paul W. Jones. Digital Image Compression Techniques. Society of Photo-Optical Instrumentation Engineers (SPIE), Bellingham, WA, USA, 1st edition, 1991. [19] Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman Amarasinghe, and Fredo Durand. Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Transactions on Graphics, 31(4):32:1-32:12, July 2012. [20] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frodo Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI, pages 519-530, New York, NY, USA, 2013. ACM. [21] C. Rhemann, C. Rother, Jue Wang, M. Gelautz, P. Kohli, and P. Rott. A perceptually motivated online benchmark for image matting. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1826-1833, June 2009. [22] YiChang Shih, Sylvain Paris, Connelly Barnes, William T. Freeman, and Fredo Durand. Style transfer for headshot portraits. ACM Transaction on Graphics (SIGGRAPH), 33(4):148:1-148:14, July 2014. [23] Yichang Shih, Sylvain Paris, Fredo Durand, and William T. Freeman. Datadriven hallucination of different times of day from a single outdoor photo. ACM Transaction on Graphics (SIGGRAPH), 32(6):200:1-200:11, November 2013. 56 The jpeg 2000 still image compression standard. IEEE Signal Processing Magazine, 18(5):36-58, 2001. 124] Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi. [251 Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267-288, 1994. 126] A. Torralba and W.T. Freeman. Properties and applications of shape recipes. In IEEE Conference on Computer Vision and PatternRecognition, volume 2, pages 11-383-90 vol.2, June 2003. 1271 G.K. Wallace. The jpeg still picture compression standard. IEEE Transactions on Consumer Electronics, 38(1):xviii-xxxiv, Feb 1992. 1281 T.A. Welch. A technique for high-performance data compression. Computer, 17(6):8-19, June 1984. 129] Ian H Witten, Radford M Neal, and John G Cleary. Arithmetic coding for data compression. Communications of the ACM, 30(6):520-540, 1987. [30] Li Xu, Cewu Lu, Yi Xu, and Jiaya Jia. Image smoothing via 10 gradient minimization. ACM Transaction on Graphics (SIGGRAPH), 30(6):174:1-174:12, December 2011. [31] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337-343, May 1977. 57