Efficient Cloud Photo Enhancement using 7

advertisement
Efficient Cloud Photo Enhancement
using
Compact Transform Recipes
ARCHNES
MASSACHUSETTS INSTITUTE
OF TECH NOLOLGY
JUL 0 7 2015
by
LIBRARIES
Michael Gharbi
M.S. in Applied Mathematics, Ecole Polytechnique (2013)
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2015
@ Massachusetts Institute of Technology 2015. All rights reserved.
Author ............
Signature redacted
Department
f
Certified by.........Signature
ring and Computer Science
May 19, 2015
redacted...............
Fredo Durand
Professor of Electrical Engineering and Computer Science
Thesis Supervisor
Accepted by..............
Signature redacted
/6U"eslie A. Kolodziejski
Chairman, Department Committee on Graduate Students
2
Efficient Cloud Photo Enhancement
using
Compact Transform Recipes
by
Michael Gharbi
Submitted to the Department of Electrical Engineering and Computer Science
on May 19, 2015, in partial fulfillment of the
requirements for the degree of
Master of Science in Electrical Engineering and Computer Science
Abstract
Cloud image processing is often proposed as a solution to the limited computing
power and battery life of mobile devices: it allows complex algorithms to run on
powerful servers with virtually unlimited energy supply. Unfortunately, this idyllic
picture overlooks the cost of uploading the input and downloading the output images, which can be significantly higher, in both time and energy, than that of local
computation. For most practical scenarios, the compression ratios required to make
cloud image processing energy and time efficient would degrade an image so severely
as to make the result unacceptable. Our work focuses on image enhancements that
preserve the overall content of an image. Our key insight is that, in this case, the
server can compute and transmit a description of the transformationfrom input to
output, which we call a transform recipe. At equivalent quality, recipes are much
more compact than jpeg images: this reduces the data to download. Furthermore, we
can fit the parameters of a recipe from a highly-degraded input-output pair, which
decreases the data to upload. The client reconstructs an approximation of the true
output by applying the recipe to its local high quality input. We show on a dataset
of 168 images spanning 10 applications, that recipes can handle a diverse set of image
enhancements and, for the same amount of transmitted data, provide much higher
fidelity compared to jpeg-compressing both the input and output. We further demonstrate their usefulness in a practical proof of concept system with which we measure
energy consumption and latency for both local and cloud computation.
Thesis Supervisor: Fredo Durand
Title: Professor of Electrical Engineering and Computer Science
3
4
Acknowledgments
First and foremost, I would like to thank my advisor Fredo Durand whose guidance
and insights proved invaluable in navigating research in Computer Science. Fr6do
always gave me the freedom to explore my interests and sharpen my abilities. His
enthusiasm kept me going whenever uncertainty, frustration or discouragement arose.
I thank Sylvain Paris who has been a precious mentor and an inspiration since I
met him. His optimism and knowledge pushed me to seek new challenges. Thanks to
my collaborators Gaurav, Jonathan and Yichang, who made this work possible.
Thanks to the people of the MIT Graphics Group: you are the very reason this
environment is so uniquely welcoming and stimulating. In particular, I thank my
two office-mates Tiam, and Abe for reminding me that life is not so serious after all.
Thanks to Bryt for keeping the soul of this place alive through generations of grad
students.
Finally I thank my family and friends for their unconditional support and the
interest they always put in both my work and my personal development.
Thank you Robi for elevating my mind and keeping me on track with the pursuit
of happiness.
5
6
Contents
1
2
Introduction
19
1.1
21
Reducing Data Transfers
2.1
3
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Transform Recipes .......
25
............................
25
2.1.1
R esidual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.1.2
High Frequencies . . . . . . . . . . . . . . . . . . . . . . . . .
27
2.1.3
Chrominance channels . . . . . . . . . . . . . . . . . . . . . .
28
2.1.4
Luminance channel . . . . . . . . . . . . . . . . . . . . . . . .
28
2.2
Reconstruction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
2.3
Data com pression . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
2.3.1
Upstream Compression . . . . . . . . . . . . . . . . . . . . . .
31
2.3.2
Input Pre-processing . . . . . . . . . . . . . . . . . . . . . . .
32
Evaluation
35
3.1
Test enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
3.2
Expressiveness and robustness . . . . . . . . . . . . . . . . . . . . . .
38
3.3
Prototype system . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
4 Conclusion
47
A Recipe Parameters
49
7
8
List of Figures
1-1
Cloud computing is often thought as an ideal solution to enable complex algorithms on mobile devices; by exploiting remote resources,
one expects to reduce running time and energy consumption.
How-
ever, this is ignoring the cost of transferring data over the network.
When accounted for, this overhead often takes away the benefits of the
cloud, especially for data-heavy photo applications. We introduce a
new photo processing pipeline that reduces the amount of transmitted
data. The core of our approach is the transform recipe, a represehtation of the image transformation applied to a photo that is compact
and can be accurately estimated from an aggressively compressed input photo. These properties make cloud computing more efficient in
situations where transferring standard jpeg-compressed images would
be too costly in terms of energy and/or time. In the example above
where we used our most aggressive setting, our approach reduces the
amount of transferred data by about 80x compared to using images
compressed with standard jpeg settings while producing an output visually similar to the ground-truth result computed directly from the
original uncompressed photo.
. . . . . . . . . . . . . . . . . . . . . .
9
19
1-2
A major advantage of our transform recipes is that we can evaluate
them using severely degraded input photos (a,d). Processing such a
strongly compressed image generates a result with numerous jpeg artifacts (b,e). We can nevertheless use such pairs of degraded inputoutput photos to build a recipe that, when applied on the original
artifact-free photo (c) produces a high quality output (f) that closely
approximates the reference output (g) directly computed from the original photo. The close-ups (c-g) show the region with highest error in
our method (f).
2-1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
To reconstruct the output image we decompose the input as a shallow
Laplacian pyramid.
The lowpass residual is linearly re-scaled using
the ratio coefficients Rc(p). For the chrominance, we don't use the
complete multiscale decomposition: each block in the high-pass of the
chrominance undergoes and affine transformation parameterized by Ac
and b,. The luminance channel is reconstructed using a combination
of an affine transformation (Ay, by), a piecewise linear transformation
({qi}), and a linear scaling of the Laplacian pyramid level ({me}).
2-2
.
26
Recipe coefficients computed for the photo in Figure 1-2. We remapped
the values to the [0; 1] range for display purposes. (a) lowpass residual
R,. (b,c) affine coefficients of the chrominance A, and b,. (d) affine
coefficients of the luminance Ay and by.
{me}. (f) non linear coefficients {qi}.
10
(e) multiscale coefficients
. . . . . . . . . . . . . . . . .
32
2-3
Adding a small amount of noise to the upsampled degraded image
makes the fitting process well-posed and enables our reconstruction
(b) to closely approximate the ground-truth output (a). Because of
the downsampling and JPEG compression, the degraded input (not
shown) exhibits large flat areas (in particular in the higher Laplacian
pyramid levels). Without the added noise, the fitting procedure might
use the corresponding recipe coefficients as affine offsets. This generates artifacts (c) when reconstructing from the high quality input
(which does have high frequency content).
. . . . . . . . . . . . . . .
33
2-4 In this close-up from a Detail Manipulation example, processing directly the downsampled input before fitting the recipe fails to capture
the high frequencies of the transformation. Errors are particularly visible in the eyes and hair.
2-5
. . . . . . . . . . . . . . . . . . . . . . . .
34
We overlap the blocks of our recipes (b) and linearly interpolate the
reconstructed pixel values between overlapping blocks so as to avoid
visible blocking artifacts (c).
3-1
. . . . . . . . . . . . . . . . . . . . . .
34
The additional features to model the transformation of the luminance
are critical to capture subtle local effects of the ground truth output
(a). (b) Using only the affine terms, most of the effect on high frequencies is lost on the wall, and the wooden shutter. (c) Adding the
Laplacian pyramid coefficients, the high frequency detail is more faithfully captured on the wall. (d) With both the non-linear and multiscale
terms, our model better captures local contrast variations. This is particularly visible on the texture of the wooden shutter.
3-2
. . . . . . . .
39
We analyze the quality of our reconstruction for different settings of
our recipe and various degrees of input compression. Not suprisingly
the reconstruction quality decreases as input and output compression
increase. We report the average PSNR on the whole dataset.
11
. . . .
42
3-3
Our method handles a large variety of photographic enhancements. We
filter a highly degraded copy (not shown) of the reference input (a).
From the resulting degraded output, we compute the recipe parameters. We then reconstruct an output image (c) that closely approximates the reference output (b) computed from the original high-quality
input. (d) and (e) are a close-up on the region of highest error in our
reconstruction.
(f) is a rescaled difference map that emphasizes the
location of our errors. As shown on the Lo smoothing example, our
method cannot handle well filters that significantly alter the structure
of the input. .......
3-4
43
...............................
Additional examples. We filter a highly degraded copy (not shown)
of the reference input (a). From the resulting degraded output, we
compute the recipe parameters. We then reconstruct an output image
(c) that closely approximates the reference output (b) computed from
the original high-quality input. (d) and (e) are a close-up on the region
of highest error in our reconstruction. (f) is a rescaled difference map
that emphasizes the location of our errors. . . . . . . . . . . . . . . .
3-5
44
Plot of the power consumption over time of three computation schemes:
purely local, JPEG transfer, and our approach. For this measurement,
we used a Samsung Galaxy S3 connected to a 3G network; the test
application is Local Laplacian Filters and the input is a 4-megapixels
image. Our approach uses less energy and is faster than the other two.
45
A-1 The block size w controls the quality-compacity trade-off of transform
recipes. Small blocks yield a fine-grained description of the transformation but require more storage space whereas large blocks form a
coarser and more concise description. With larger blocks, the recipe
lose some spatial accuracy. In practice, we recommend values between
32 and 128 for w. These results are for an uncompressed input.
12
.
. .
53
A-2 The number of linear segments p in the piecewise-linear tone curve
has a different impact depending on the application.
For instance,
Recoloring does not transform the luminance channel of the input and
therefore does not need more precision here. We setteled with p = 6
in all our experim ents. . . . . . . . . . . . . . . . . . . . . . . . . . .
13
54
14
List of Tables
2.1
Upsampling the degraded input and adding a small amount of noise
is critical to the quality of our reconstruction. We report the average
PSNR on the whole dataset for a 4 x 4 downsampling of the input and
JPEG compression with quality
Q=
50. The recipes use a block size
w = 64 and both the multiscale and non-linear features are enabled.
3.1
33
De-activating either the luminance curve or the multiscale features,
decreases the reconstruction quality by around 2 dB. We report the
average PSNR on the whole dataset for a non-degraded input. The
recipes use a block size w = 64.
3.2
. . . . . . . . . . . . . . . . . . . . .
39
Reconstruction results per enhancement. %,, refers to the compression
of the input as a fraction of the uncompressed bitmap data,
%down
to
that of the recipe (or output in case of jpeg and jdiff). For an uncompressed input, our method matches the quality of a jpeg compressed
image. The benefits of become more apparent as the input image is
further compressed. In the "standard" setting, the input is downsampled by a factor 2 in each dimension and JPEG compressed with a
quality
Q=
30, the block size for the recipe is w
jdiff methods are given the same input, and use
=
32. The jpeg and
Q=
80 for the output
compression. The LO enhancement is a failure case for our method.
15
.
40
3.3
Reconstruction results per enhancement. %,p refers to the compression
of the input as a fraction of the uncompressed bitmap data,
%down
to
that of the recipe (or output in case of jpeg and jdiff). The "medium"
setting is a 4x downsampling with
Q
= 50 and w = 64. The "low"
quality setting is a 8 x downsampling of the input with
Q
= 80 and
w = 128. The jpeg and jdiff methods are given the same input, and
use
3.4
Q=
80 for the output compression.
. . . . . . . . . . . . . . . .
41
We compare the end-to-end latency and energy cost of using our recipes
for cloud processing compared to local processing of the enhancement
and the jpeg baseline. Our method always saves energy and time compared to the baseline jpeg method.
The measurements are from a
Samsung Galaxy S5, processing a 8-megapixels image, and are averaged over 20 runs (we also report one standard deviation).
. . . . . .
45
A. 1 Per-enhancement PSNR and SSIM for recipes with w = 32 and no input compression. block overlap allows us to linearly interpolate between
overlapping block and avoids blocking artifacts. It roughly multiplies
the data ratio by 4. multiscale refers to the use of several pyramid
levels to represent the high-pass of the luminance channel. This allows
us to capture subtle multiscale effects.
non-linear refers to the use
of a piecewise-linear tone curve in the representation of the high-pass
luminance. This allows us to better capture tonal variations.
16
. . . .
50
A.2 Per-enhancement PSNR and SSIM for recipes with w = 32 and an
input compressed to %., = 1.0.
block overlap allows us to linearly
interpolate between overlapping block and avoids blocking artifacts.
It roughly multiplies the data ratio by 4. multiscale refers to the use
of several pyramid levels to represent the high-pass of the luminance
channel. This allows us to capture subtle multiscale effects. non-linear
refers to the use of a piecewise-linear tone curve in the representation
of the high-pass luminance.
variations.
This allows us to better capture tonal
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
A.3 Per-enhancement PSNR and SSIM for recipes with w = 64 and an
input compressed to %up = 0.4.
block overlap allows us to linearly
interpolate between overlapping block and avoids blocking artifacts.
It roughly multiplies the data ratio by 4. multiscale refers to the use
of several pyramid levels to represent the high-pass of the luminance
channel. This allows us to capture subtle multiscale effects. non-linear
refers to the use of a piecewise-linear tone curve in the representation
of the high-pass luminance. This allows us to better capture tonal
variations.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
52
18
Chapter 1
Introduction
ground-truth output
2.5 MB
CLOUD
MOBI LE DEVICE
original input
2.3 MB
download - 41 kB
dowurmple
compress
\0
reconstruct
compressed input - 17 kB
/
8-
reconstruction
PSNR = 30 dB
Figure 1-1: Cloud computing is often thought as an ideal solution to enable complex algorithms on mobile devices; by exploiting remote resources , one expects to reduce running time
and energy consumption. However, this is ignoring the cost of transferring data over the
network. When accounted for , this overhead often takes away the benefits of the cloud, especially for data-heavy photo applications. We introduce a new photo processing pipeline that
reduces the amount of transmitted data. The core of our approach is the transform recipe,
a representation of the image transformation applied to a photo that is compact and can be
accurately estimated from an aggressively compressed input photo. These properties make
cloud computing more efficient in situations where transferring standard jpeg-compressed
images would be too costly in terms of energy and/ or time. In the example above where we
used our most aggressive setting, our approach reduces the amount of transferred data by
about 80 x compared to using images compressed with standard jpeg settings while producing an output visually similar to the ground-truth result computed directly from the original
uncompressed photo.
19
Mobile devices such as cell phones and cameras are exciting platforms for computational photography, but their limited computing power, memory and battery life
can make it challenging to implement advanced algorithms. Cloud computing is often
hailed as a solution allowing connected devices to benefit from the speed and infinite
power supply of remote servers. Cloud processing is also imperative for algorithms
that require large databases, e.g. [15, 23] and can streamline applications development, especially in the face of the mobile platform fragmentation. Unfortunately, the
cost of transmitting the input and output images can dwarf that of local computation,
making cloud solutions surprisingly expensive both in time and energy [2, 101. For
instance, a 6 seconds computation on a 16-megapixels image captured by a Samsung
Galaxy S5 would consume 20J. Processing the same image on the cloud would take
14 seconds and draw 54J1 . In short, efficient cloud photo enhancement requires a
reduction of the diata transferred that is beyond what lossy image compression can
achieve while keeping image quality high.
In this work, we focus on photographic enhancements that preserve the overall
content of an image (in the sense of style vs. content) and do not spatially warp it.
This includes content-aware photo enhancement [12], dehazing [13], edge-preserving
enhancements [17, 7], colorization [161, style transfer [22, 1], time-of-day hallucination [23] but excludes dramatic changes such as inpainting. For this class of enhancements, the input and output images are usually highly correlated. We exploit this
observation to construct a representation of the transformationfrom input to output
that we call a transform recipe. By design, recipes are much more compact than
images. And because they capture the transformation applied to an image, they are
forgiving to a lower input quality. These are the key properties that allow us to cut
down the data transfers. With our method, the mobile client uploads a downsampled
and highly degraded image instead of the original input, thus reducing the upload
cost. The server upsamples this image back to its original resolution and computes
the enhancement. From this low quality input-output pair, the server fits the recipe
'Transferring the jpeg images with their default compression settings (4MB each way), over a
4G network with bandwidth 4Mbits/s, according to our estimate of the average power draw and
modeling like in [141
20
and sends it back to the client instead of the output, incurring a reduced download
cost. In turns, the client combines the recipe with the original high quality input to
reconstruct the output (Fig. 1-1). While our recipes are lossy, they provide a highfidelity approximations of the full enhancement yet are one to two orders of magnitude
more compact.
Transform recipes use a combination of local affine transformations of the YUV
channels of a shallow Laplacian pyramid as well as local tone curves for the highest
pyramid level. This makes them flexible and expressive, so as to adapt to a variety
of enhancements, with a moderate computational footprint on the mobile device.
We evaluate the quality of our reconstruction on several standard photo editing tasks
including detail enhancement, general tonal and color adjustment, and recolorization.
We demonstrate that in practice, our approach translates into shorter processing times
and lower energy consumption with a proof-of-concept implementation running on a
smartphone instrumented to measure its power usage.
1.1
Related Work
Image compression
Image compression has been a long standing problem in the
image processing community [18]. The data used to represent digital images can be
reduced by exploiting their structure and redundancy. Lossless image compression
methods and formats generally rely on general purpose compression techniques such
as run-length encoding (BMP format), adaptive dictionary algorithm [31, 28], entropy
coding [11, 29] or a combination thereof [6]. These methods typically lower the data
size of natural images to around 50% to 70% that of an uncompressed bitmap, which
is virtually never enough for cloud-based image processing tasks. Alternatively, lossy
image compression methods offer more aggressive compression ratios at the expense of
some visible degradation. They often try to minimize the loss of information relevant
to the human visual system based on heuristics (e.g. using chroma sub-sampling). The
most widespread lossy compression methods build on transform coding techniques:
the image is transformed to a different basis, and the transform coefficients are quan21
tized. Popular transforms include the Discrete Cosine Transform (JPEG [27]) and the
Wavelet Transform (JPEG2000 [24]). Images compressed using these methods can
reach 10% to 20% of their original size with very little visible degradation but often,
this is still not sufficient for cloud photo applications. More recently, methods that
exploit the intra-frame predictive components of video codecs have emerged (WebP,
BPG). While they do improve the compression/quality trade-off, the improvements
at very low bit-rates remain marginal, and these methods struggle to pervade as
they compete with more mainstream formats. Unlike traditional image compression
techniques, we seek to compress the description of a transformation from the input
of an image enhancement to the output. We build upon, and use traditional image
compression techniques as building blocks in our strategy: any lossy compression
technique can be used to compress the input image before sending it to the cloud.
JPEG is a natural candidate since it is ubiquitous and often implemented in hardware
on mobile devices. We also use lossless image compression to further compress our
recipes.
Recipes as a regression on image data Using the input image to predict the
output is inspired from shape recipes [8, 26]. In the context of stereo shape estimation,
shape recipes are regression coefficients that predict band-passed shape information
from an observed image of this shape. They form a low-dimensional representation
of the high-dimensional shape data. Shape recipes describe the shape in relation to
the image, and constitute a set of rules to reconstruct the shape from the image.
These recipes are simple as they let the image bear much of the complexity of the
representation. Similarly, our transform recipes capture the transformation between
the input and output images, factoring out their structural complexity.
Whereas
shape recipes are described in terms of convolution kernel with possible non-linear
extensions, the complexity of our model is constrained by the amount of data we
can afford to transfer back to the client. Our recipes also differ from shape recipes
in that we have a notion of spatially varying transformation: we fit different models
for each block, in a grid subdivision of the image. In contrast, shape recipes apply
22
globally to a whole image sub-band.
Finally, our recipes are computed from low
quality images and applied to the high quality input whereas shape recipes do not
have this constraint.
23
(b) output of the filter
applied to the degraded input
PSNR = 18.8 dB
(a) degraded input sent to the server
4 x 4 bicubic downsampling
JPEG compression Q = 20
( c) original full
quality input
(d) degraded input
close-up on (a)
( f) our reconstruction
%down = Q.4%
PSNR = 32.6 dB
( e) output of the filter
close-up on (b)
(g) reference output
computed on the
full-quality input
Figure 1-2: A major advantage of our transform recipes is that we can evaluate them
using severely degraded input photos (a,d). Processing such a strongly compressed image
generates a result with numerous jpeg artifacts (b,e). We can nevertheless use such pairs of
degraded input- output photos to build a recipe that , when applied on the original artifactfree photo (c) produces a high quality output (f) that closely approximates the reference
output (g) directly computed from the original photo. The close-ups ( c- g) show the region
with highest error in our method (f).
24
Chapter 2
Reducing Data Transfers
We reduce the upload cost by downsampling and aggressively compressing the input
image I. This yield a degraded image I. Upon reception by the server, I is upsampled
back to the original resolution. Let us denote this new low quality image by i. The
server processes I to generate a low fidelity output
0.
The server then assembles a
recipe fia such that flC(I) ~ 0, where 0 is the ground-truth output that would
result from the direct processing of I.
Recipes are designed to be compact and
therefore incur a low download cost (Fig. 1-1).
To reconstruct the final ouput, the client downloads the recipe and combines
it with the original high-quality input I (Fig. 2-1). Our objective with recipes is to
represent a variety of standard photo enhancements. In particular, we seek to capture
effects that may be spatially varying, multi-scale, and nonlinear.
2.1
Transform Recipes
We define recipes over w x w blocks where w controls the trade-off between compactness and accuracy. Small blocks yield a fine-grained description of the transformation
but require more storage space whereas large blocks form a coarser and more concise
description. We use overlapping blocks to allow for smooth transitions between blocks
and prevent blocking artifacts (Fig 2-5), that is, we lay out the w x w blocks on a
x 1 grid so that each pair of adjacent blocks share half their area. We work in the
W
25
same YCbCr color space as JPEG to separate the luminance Y from the chrominance
(Cb , Cr) while still manipulating values that span [O; 255] [9].
ORIGINAL
INPUT
RECIPE RECEIVED FROM THE SERVER
RECONSTRUCTED
OUTPUT
low-frequency residual
multiscale
decomposition
linear scaling . .
3 coefs / block
linear upsampling
high-frequency
chrominance
111111111111111111111111
111111111111
111111111111
high-frequency intensity pyramid
111111111111111 ••
11111111111111
1111111111
II Ill
-
-
-
affine transform
8 coefs / block
.. 111111111111
111111111111
11111111
1111111111
linear
blending
pyramid
collapse
affine transform . . .
4 coefs / block
piecewise-linear
remapping
..
5 coefs / block
per-level linear
scaling
logl wJ coefs /
..
block
Figure 2-1: To reconstruct the output image we decompose the input as a shallow Laplacian
pyramid. The lowpass residual is linearly re-scaled using the ratio coefficients Rc(P) . For
the chrominance, we don 't use the complete multiscale decomposition: each block in the
high-pass of the chrominance undergoes and affine transformation parameterized by Ac and
be. The luminance channel is reconstructed using a combination of an affine transformation
( Ay , by) , a piecewise linear transformation ({qi}) , and a linear scaling of the Laplacian
pyramid level ({mt}).
To capture multi-scale effects, we decompose the images using Laplacian pyramids [3]. That is, we split I and 0 into n
+ 1 levels {Lt [!]} and {Le[O]} where the
resolution of each level is half that of the preceding one. The first n levels represent
the details at increasingly coarser scales and the last level is a low-frequency residual.
We set n = log 2 ( w) so that each block is represented by a single pixel in the residual. We compute a recipe in three steps: we first represent the transformation of the
residual, then that of the other pyramid levels , and finally we quantize and encode
the result of the first two steps in an off-the-shelf compressed format.
26
2.1.1
Residual
We found that even small errors in the residual propagate to large areas in the final
reconstruction and can produce conspicuous artifacts in smooth regions like skies.
Since the residual is only a small fraction of the data, we can afford to represent it
with greater precision.
We represent the low-frequency part of the transformation by the ratio of the
residuals, i.e., for each pixel p and each channel c E {Y, Cb, Cr}:
L,[c]p)+ 1
Re(p) = Ln[0c](P) +1
Ln [Ic] (P) + I
(2.1)
where 1 is added to prevent divisions by zero and to ensure that Rc is constant
when Ln[Oc] = L,[Ic], that is, for edits like sharpening that only modify the high
frequencies. This compresses better in the final stage.
2.1.2
High Frequencies
Representing the high-frequency part of the transformation requires much more data.
Using per-pixel ratios would not be competitive with standard image compression.
To model the high-frequency effect of the transformation, we combine all the levels
of the pyramids except the residual into a single layer:
n-1
H[I] = E up(Lj[I])
f=0
(2.2)
where up(.) is an operator that upsamples the levels back to the size of the original
image I. In practice, we use the property of Laplacian pyramids that all the levels
including the residual add up to the original image to compute H[I] = I - up(Ln[I]),
which is more efficient. We use the same scheme to compute H[0].
Next, we process H[I] and H[O] block by block independently. Our strategy is to
fit a parametric function to the pixels within each block to represent their transformation. In our early experiments, we found that while the chrominance transformation
were simple enough to be well approximated by an affine function, no single simple
27
parametric function was able to capture alone the diversity of luminance transformations generated by common photographic edits. Instead, our strategy is to rely on a
combination of several functions that depend on many parameters and use a sparse
regression to keep only a few of them in the final representation.
2.1.3
Chrominance channels
Let Occ denote the two chrominance channels of 0. Occ(p) is the the 2D vector
containing the chrominance values of 0 at pixel p. Within each w x w block B, we
model the high frequencies of the chrominance H[Occ](p) as an affine function of
H[I](p). We use a standard least-squares regression and minimize:
ZIIH[Occ](p) - Ac(B)H[I](p) - bc(B) 1 2
(2.3)
pEB
where A, and bc are the 2 x 3 matrix and the 2D vector defining the affine model.
2.1.4
Luminance channel
We represent the high frequencies of the luminance H[Oy] with a more sophisticated
function comprising several components. The first component is an affine function
akin to that used for the chrominance channel: Ay(B)H[I](p) + by(B) with Ay
and by the 1 x 3 matrix and the scalar constant defining the affine function. Intuitively, this affine component captures the local changes of brightness and contrast.
Next, we capture scale-dependent effects with a linear function of the pyramid levels:
n-1
me(B) up(L, [Iy]) where {me(B)} are the coefficients of the linear combination
within the block B and up(-) is used to ensure that all levels are upsampled back
to the original resolution. Last, we add a term to account for the nonlinear effects.
We use a piecewise linear function made of p segments over the luminance range of
the block. To define this function, we introduce p - 1 regularly spaced luminance
values yj = minu H[Iy] + I(maxL H[Iy] - minBH[Iy]), i E {1, . . ,p - 1} and the
segment functions si(-)
=
max(. - yi, 0) where we omit the dependency on B of yi and
28
si for clarity's sake. Intuitively, si generates a unit change of slope at the ith node yi.
Equipped with these functions, we define our nonlinear term as
Z_-j
qj(B)sj(H[Iy])
where the {qj(B)} coefficients control the change of slope between two consecutive
linear segments.
To fit the complete model, we use LASSO regression 1251, i.e., we minimize a
standard least-squares term:
S
H[Oy](p) - Ay(B)H[I](p) - by(B)
pEB
n-1
- 5
p-1
mj(B) up(L,[Iy])(p) -
2
qj(B)s(H[Iy](p))
(2.4)
=i=1
to which we add the L, norm of the free parameters Ay(B), by(B), {me(B)}, and
{qj (B) }.
Solving the optimization problems (Eq. 2.3 and Eq. 2.4) for each overlapping
block, and computing the lowpass ratio (Eq. 2.1), we now have all the coefficients of
the recipe (Fig. 2-2). In the next section, we describe how we reconstruct the output
image from the full quality input image and these recipe coefficients.
2.2
Reconstruction
We reconstruct the output 0 using the recipe described in the previous section and
the input I. We proceed in two steps; we first deal with the multiscale part of the
recipe and then with its high-frequency component.
This process is illustrated in
Figure 2-1.
From Equation 2.1, for each pixel p and each channel c E {Y, Cb, Cr}, we get the
expression for the lowpass residual:
Ln[Oc](p) = Rc(p)(Ln[Ic](p) + 1) - 1
29
(2.5)
Then, we reconstruct the intensity levels f E [0; n - 1] of the multiscale component
of Equation 2.4:
Lf [0y] = mjL[Iy]
(2.6)
Together with the luminance channel of Equation 2.5, this gives a complete Laplacian pyramid {Lj [0y] }. We collapse this pyramid to get an intermediate output luminance channel 6y. At this point, the reconstruction is missing some high-frequency
luminance components from Equation 2.4 and the chrominance channels Cr and Cb.
To reconstruct the remaining high-frequency information H[O], we first compute
H3[0] within each block then linearly blend the results of overlapping blocks. For
the chrominance, we apply the affine remapping that we estimated in Equation 2.3:
H8 [Occ](p) = Ac (B)H[I](p) + bc(3)
(2.7)
And for the intensity channel, we apply the affine remapping and the nonlinear
functions that we computed in Equation 2.4:
P-1
H8 [Oy](p) = Ay3(B)H[I](p) + by(B) +
qi(B)s(H [Iy](p))
(2.8)
We linearly interpolate H8 [Occ] and H8 [Oy] between overlapping blocks to get
H[Occ] and H[Oy]
Finally, we put all the pieces together. We linearly upsample the chrominance
residual (Eq. 2.5) and add the multiscale component (Eq. 2.6), and the linearly interpolated high-frequency chrominance (Eq. 2.7) and luminance (Eq. 2.8):
0
= up(Ln[Occ]) +
+ H[Occ] + H[Oy]
(2.9)
where we implicitly lift the quantities in the right hand side into YCbCr vectors
depending on the channels they contain.
30
2.3
Data compression
Once computed by the server, we assemble the recipe coefficients described in Section 3.1 as a multi channel image (Fig.2-2). Each channel of the recipe is independently uniformly quantized to 8 bits. In practice the quantization error is small and
does not affect the quality of the reconstruction. (less than 0.1 dB)
We compress the coefficient maps using an off the shelf standard lossless image
compression method. In our implementation, we save the lowpass residual as a 16-bits
float TIFF file, and the highpass coefficients as tiles in a 8-bits grayscale PNG file.
This compression takes benefit both from the spatial redundancy of the recipe and
the sparsity induced by the LASSO regression and further reduces the data by about
a factor of 2.
2.3.1
Upstream Compression
So far, we have only described how to cut down the data downloaded by the client
using transform recipes. Because recipes capture the transformation applied to an
image they are resilient to degradations of the input (Fig. 1-2). We leverage this
property and reduce the data uploaded to the server by sub-sampling and compressing
the input image using a lossy compression algorithm with aggressive settings (we use
JPEG). Upon receiving the data, the server upsamples the degraded image back to
its original resolution and adds a small amount of Gaussian noise to it (a- = 0.5%
of the range, see Section 2.3.2). The server then processes the upsampled image to
produce a low quality output. The fitting procedure of Section 2.1 is applied to this
low quality input/output pair. The client in turn reconstructs the output image by
combining his high quality input with the recipe.
31
-
...
J1~
~ •
~
-·
-..
-'1
(c)
.J
\..
'·
~.
.
j
.~./~):?\ •
)
..
-
.
I
'
,,
•
•
•
'
-
(d}
(e)
(f}
Figure 2-2: Recipe coefficients computed for the photo in Figure 1-2. We remapped
the values to the [O; 1) range for display purposes. (a) lowpass residual R e. (b,c) affine
coefficients of the chrominance Ac and he. (d) affine coefficients of the luminance Ay and
by. (e) multiscale coefficients {mt}. (f) non linear coefficients {qi}.
2.3.2
Input Pre-processing
Adding noise serves two purposes. Since the downsampling and JPEG compression
both drastically reduce the high frequency content of the input image, adding noise
allows us to re-introduce some high frequencies from which we can estimate the transformation. Furthermore, the low quality of the input can make the fitting procedures
ill-posed which creates severe artifacts at reconstruction time (Fig. 2-3). The added
noise acts as a regularizer.
Alternatively we could apply the desired image enhancement directly to the low
resolution input , which would also lower the computational cost. This would require
adapting the parameters of the enhancement accordingly but in certain cases , it is
32
unclear how the low-resolution processing relates to the high-resolution processing.
Besides, with this approach, we cannot estimate the high-frequency part of the transformation. As shown in Figure 2-4, this approach fails for enhancements that have a
non trivial high-frequency component. We summarize the importance of upsampling
the input and adding noise in Table 2.1.
with upsampling
without upsampling
with noise
without noise
34.9 dB
28.7 dB
33.5 dB
27.6 dB
Table 2.1: Upsampling the degraded input and adding a small amount of noise is critical
to the quality of our reconstruction. We report the average PSNR on the whole dataset for
a 4 x 4 downsampling of the input and JPEG compression with quality Q = 50. The recipes
use a block size w = 64 and both the multiscale and non-linear features are enabled.
(a) reference
(b) our method
PSNR = 34.7 dB
(c) no noise added to the
upsampled input before
processing, PSNR = 28.1 dB
Figure 2-3: Adding a small amount of noise to the upsampled degraded image makes
the fitting process well-posed and enables our reconstruction (b) to closely approximate
the ground-truth output (a). Because of the downsampling and JPEG compression, the
degraded input (not shown) exhibits large flat areas (in particular in the higher Laplacian
pyramid levels). Without the added noise, the fitting procedure might use the corresponding
recipe coefficients as affine offsets. This generates artifacts (c) when reconstructing from the
high quality input (which does have high frequency content).
33
(a) reference
(b) our method
PSNR = 34.7 dB
(c) no input upsampling before
processing, PSNR. = 24.2 dB
Figure 2-4: In this close-up from a Detail Manipulation example, processing directly the
downsampled input before fitting the recipe fails to capture the high frequencies of the
transformation. Errors are particularly visible in the eyes and hair.
(a) reference
(b) our method
PSNR = 42.1 dB
(c) no block overlap,
PSNR = 34.7 dB
Figure 2-5: We overlap the blocks of our recipes (b) and linearly interpolate the reconstructed pixel values between overlapping blocks so as to avoid visible blocking artifacts (c).
34
Chapter 3
Evaluation
We evaluate the quality of our reconstruction on several image processing applications
(
3.1).
We conduct stress tests by degrading the quality of the input to several
degrees and analyze the effect of our design choices
(
3.2). Latency and physical
energy consumption are measured using a proof-of-concept implementation on an
Android smartphone
(
3.3).
The full reconstruction results assessing the different quality settings and the
impact of each features in the pipeline are provided in Appendix A. We express the
compression performance as a fraction of the original, uncompressed 24-bits bitmap.
Though an uncompressed bitmap would never be used in the context of cloud image
processing, this provides us with a fixed reference for all our comparisons. A typical
out-of-the-phone JPEG has a data ratio in the 10% to 15% realm. A PNG image has a
typical data ratio of 50% to 75% for natural color images. Besides visual comparison,
we report both PSNR and SSIM to quantify the quality of our reconstruction.
3.1
Test enhancements
We selected a range of applications representative of typical photo editing scenarios.
Some of these applications might not necessarily require cloud processing, but they
demonstrate the expressiveness of our model.
We gathered 168 high quality test
images from Flickr and the MIT-Adobe fiveK dataset [4] to use as input to the
35
image processing algorithms. The image resolution range between 2-megapixels and
8-megapixels. MIT-Adobe fiveK dataset provides raw images that give us baseline
for reconstruction quality.
From these high quality inputs, we generated degraded inputs for several quality
levels using a combination of bicubic downsampling (2 to 8 times smaller images in
each dimension) and aggressive JPEG compression (with quality parameters ranging
from 30 to 80, using Adobe Photoshop's quantization tables). Our JPEG compression
uses the quantization tables from Adobe Photoshop. This new set of images illustrate
what a typical client would send for processing in the cloud using our technique.
We ran each algorithm on this large set of inputs for different quality settings of
the degraded input and measured end-to-end approximation fidelity.
Photo editing and Photoshop actions
We manually applied various image edit-
ing tools to the input images and recorded them as Photoshop Actions. The edits
include tonal adjustments, color correction, non-trivial masking, spatially varying
blurs, unsharp masking, high pass boost, vignetting, white balance adjustments, and
detail enhancement. The set of operations include both global and local edits, and
were created as a stress test for our model. These actions are an example of filters
that would simply not be available on a mobile device.
Dehazing
This algorithm 113] estimates an atmospheric light vector from the input
image, computes a transmission map and removes the corresponding haze. The transmission map, and therefore the transformation applied to the input image exhibits
sharp transitions but they follow the input's content.
Edge preserving detail enhancement
We test both the Local Laplacian Fil-
ter [17] and another edge-preserving decomposition [7]. These methods are challenging for our approach, which does not have any notion of edge preservation.
36
Style Transfer
[1] alters the distribution of gradients in an image to match a
target distribution using the Local Laplacian filter. The algorithm outputs an image
whose style matches that of the examples image.
Portrait Style Transfer
[22] spatially matches the input image to a target ex-
ample whose style is to be imitated using dense correspondence.
Because of the
correspondence step and the database query, this algorithm is a good candidate for
cloud processing. The multi-scale, spatially varying transfer is of interest to test our
recipe. The algorithm also adds post-processing not well tied to our model (adding
highlights in the eyes). We disabled this post-processing during our evaluation.
Time of Day Hallucination
[231 is an example of algorithm that requires a large
database and costly computer vision matching.
Colorization
[16] requires solving a large sparse linear system which becomes
costly on mobile platforms for large images. We input scribbles on images from the
MIT5K dataset.
Matting
Is beyond the scope of our work since the output image does not resembles
a photography. It is nonetheless useful in the photo editing context and is a good
stress test for expressiveness. We use the implementation of KNN-matting available
on the authors' webpage 15], and use as inputs the publicly available data from [21].
LO smoothing
[30] is a paradigm of methods where the stylized output is signifi-
cantly different from the input. We include it as a failure case.
37
3.2
Expressiveness and robustness
We processed our entire dataset at various levels of compression and found that
in all cases except the most extreme settings our approach produces outputs visually
similar to the ground truth, i.e., the direct processing of the uncompressed input
photo. Figure 3-3 Figure 3-4 and shows a few example reconstruction for the "standard" quality setting. We compare our technique to two baseline alternatives set to
match our data rate:
" jpeg: the client sends a JPEG, the server processes it and transmit the output
back as a JPEG image.
" jdiff: the client sends a JPEG input image. The server sends back the difference
between this input and the output it computes from it. The difference is also
JPEG compressed. The client adds the difference back to its local input.
As shown in Table 3.3, for all applications, our approach performs equally or better than these alternatives, often by a large margin.
The complete and detailed
performance report with comparison to this baseline can be found in Appendix A.
Figure 3-2 illustrates the robustness of our recipes for various compression ratios of
the input and output. We recommend two level of compression: "standard" that
corresponds to about 1.5% of the data and generates images often indistinguishable
from the ground truth, and "medium quality" that corresponds to about 0.5% of the
data and only introduce minor deviations. In practice, one would choose the level of
accuracy/compression depending on the application.
The extra features for the luminance channel are critical to the expressiveness
of our model. Both the non-linear luminance curve and the multiscale features help
capture subtle and local effects (Fig. 3-1). We quantify the improvement these features
provide in Table 3.1. A detailed per enhancement report can be found in Appendix A.
38
with luminance curve
without
40.2 dB
38.5 dB
38.3 dB
36.6 dB
with multiscale
without
Table 3.1: De-activating either the luminance curve or the multiscale features, decreases
the reconstruction quality by around 2 dB. We report the average PSNR on the whole
dataset for a non-degraded input. The recipes use a block size w = 64.
(a) reference output
(b) affine terms only, PSNR = 32.1 dB
(c) affine and multiscale terms for the luminance, PSNR = 33.6 dB
(d) affine, multiscale and non-linear
terms for the luminance, PSNR = 36.4 dB
Figure 3-1: The additional features to model the transformation of the luminance are
critical to capture subtle local effects of the ground truth output (a). (b) Using only the
affine terms, most of the effect on high frequencies is lost on the wall, and the wooden
shutter. (c) Adding the Laplacian pyramid coefficients, the high frequency detail is more
faithfully captured on the wall. (d) With both the non-linear and multiscale terms, our
model better captures local contrast variations. This is particularly visible on the texture
of the wooden shutter.
39
PSNR (dB)
%down
SSIM
quality
enhancement
#
%,
ours
jpeg
jdiff
ours
jpeg
jdiff
ours
jpeg
jdiff
non-degraded input
w = 32
LocalLaplacian
Dehazing
Detail Manipulation
Lo
Matting
Photoshop
Recoloring
Portrait Transfer
Time of Day
Style Transfer
20
20
20
12
12
10
10
20
24
20
71.7
51.4
71.7
62.5
41.7
68.3
33.4
31.5
47.5
76.8
1.7
1.4
1.6
2.0
1.0
1.4
0.8
1.1
1.8
1.1
9.1
9.8
5.6
3.7
3.1
11.9
3.2
3.9
14.0
15.3
7.3
7.6
4.2
7.8
7.2
9.4
1.1
2.5
12.4
12.4
38.1
42.6
39.0
36.7
37.1
44.3
49.3
42.9
42.7
37.3
35.6
39.0
37.1
42.6
48.3
42.4
42.5
41.6
39.9
43.4
36.6
41.8
41.7
35.4
42.7
44.8
47.7
41.3
40.8
42.0
0.97
0.98
0.98
0.95
0.97
0.99
1.00
0.99
0.98
0.97
0.94
0.95
0.94
0.98
1.00
0.98
0.96
0.98
0.98
0.99
0.94
0.97
0.98
0.84
0.96
0.98
0.98
0.97
0.97
0.98
all
168
55.6
1.4
7.9
7.2
41.0
41.2
41.5
0.98
0.97
0.96
LocalLaplacian
Dehazing
Detail Manipulation
Lo
Matting
Photoshop
Recoloring
Portrait Transfer
Time of Day
Style Transfer
20
20
20
12
12
10
10
20
24
20
0.8
0.9
0.8
1.8
0.8
0.8
0.7
0.8
1.5
1.0
1.8
1.4
1.7
2.1
0.9
1.5
1.4
1.1
1.8
1.1
2.4
2.1
1.9
1.8
1.0
2.1
1.0
1.3
2.9
2.8
2.1
1.7
1.7
2.9
2.5
1.9
0.4
1.1
2.9
2.3
37.0
36.1
36.8
35.8
35.2
42.6
47.3
41.1
38.6
34.9
26.0
28.0
27.9
34.7
32.0
29.6
38.2
32.9
26.7
25.6
26.7
28.9
28.8
33.0
32.1
30.4
39.4
33.8
27.3
26.0
0.97
0.97
0.98
0.94
0.96
0.99
0.99
0.99
0.97
0.96
0.67
0.75
0.81
0.95
0.97
0.80
0.92
0.89
0.75
0.67
0.70
0.79
0.85
0.89
0.92
0.82
0.94
0.90
0.78
0.70
all
168
1.0
1.5
1.9
2.0
38.5
30.2
30.6
0.97
0.82
0.83
standard
2 x downsampling
Q = 30
w = 32
Table 3.2: Reconstruction results per enhancement. %,, refers to the compression of the input as a fraction of the uncompressed bitmap
data, %down to that of the recipe (or output in case of jpeg and jdiff). For an uncompressed input, our method matches the quality of
a jpeg compressed image. The benefits of become more apparent as the input image is further compressed. In the "standard" setting,
the input is downsampled by a factor 2 in each dimension and JPEG compressed with a quality Q = 30, the block size for the recipe is
w = 32. The jpeg and jdiff methods are given the same input, and use Q = 80 for the output compression. The LO enhancement is a
failure case for our method.
0
PSNR (dB)
ours jpeg jdiff
ours
SSIM
jpeg
jdiff
0.6
0.5
0.5
0.9
0.9
0.6
0.1
0.4
0.9
0.7
34.6
34.3
33.9
32.5
32.9
40.0
44.6
36.7
35.8
32.8
23.3
26.1
24.3
29.9
28.9
26.9
35.6
30.1
24.8
23.0
23.6
26.7
24.7
29.6
29.0
27.3
36.3
30.6
25.1
23.2
0.96
0.95
0.97
0.90
0.94
0.98
0.99
0.97
0.95
0.94
0.55
0.68
0.70
0.90
0.95
0.74
0.89
0.83
0.66
0.56
0.56
0.70
0.72
0.88
0.91
0.75
0.91
0.83
0.68
0.57
0.6
0.6
35.8
27.3
27.6
0.96
0.75
0.75
0.2
0.1
0.1
0.2
0.2
0.1
0.1
0.1
0.2
0.1
0.2
0.2
0.2
0.2
0.1
0.2
0.1
0.2
0.3
0.2
0.2
0.2
0.2
0.3
0.3
0.2
0.0
0.1
0.3
0.2
32.0
31.5
31.0
30.1
30.2
36.2
40.9
31.8
33.5
30.3
21.3
24.3
21.8
26.5
25.9
25.0
32.6
27.6
23.3
20.8
21.4
24.7
22.1
26.7
26.0
25.3
32.9
27.8
23.4
20.9
0.95
0.93
0.95
0.87
0.92
0.97
0.99
0.94
0.93
0.92
0.45
0.63
0.63
0.85
0.94
0.71
0.87
0.78
0.61
0.46
0.46
0.64
0.64
0.85
0.90
0.72
0.87
0.78
0.61
0.47
0.1
0.2
0.2
32.7
24.9
25.1
0.94
0.69
0.69
%down
quality
enhancement
#
%,,
ours
medium
4x downsampling
Q = 50
w = 64
Local Laplacian
Dehazing
Detail Manipulation
LO
Matting
Photoshop
Recoloring
Portrait Transfer
Time of Day
Style Transfer
20
20
20
12
12
10
10
20
24
20
0.3
0.3
0.3
0.7
0.3
0.3
0.3
0.4
0.6
0.4
0.5
0.4
0.5
0.6
0.4
0.4
0.4
0.4
0.6
0.3
0.7
0.6
0.6
0.6
0.3
0.6
0.4
0.4
0.9
0.8
all
168
0.4
0.5
Local Laplacian
Dehazing
Detail Manipulation
Lo
Matting
Photoshop
Recoloring
Portrait Transfer
Time of Day
Style Transfer
20
20
20
12
12
10
10
20
24
20
0.1
0.1
0.1
0.3
0.1
0.1
0.1
0.2
0.3
0.2
all
168
0.2
low
8 x downsampling
Q = 80
w = 128
jpeg jdiff
Table 3.3: Reconstruction results per enhancement. %,, refers to the compression of the input as a fraction of the uncompressed bitmap
data, %down to that of the recipe (or output in case of jpeg and jdiff). The "medium" setting is a 4x downsampling with Q = 50 and
w = 64. The "low" quality setting is a 8x downsampling of the input with Q = 80 and w = 128. The jpeg and jdiff methods are given
the same input, and use Q = 80 for the output compression.
40
35 -
-
our method %.,
=
1.0
v- our method %up = 0.5
*-o our method %., = 0.1
Z
0-0
jpeg
:-
jdiff
30-
251
0.2
0.4
1.0
0.8
0.6
Total data ratio I (%.p + %Vd.-)
1.2
1.4
our
Figure 3-2: We analyze the quality of our reconstruction for different settings of
quality
reconstruction
recipe and various degrees of input compression. Not suprisingly the
decreases as input and output compression increase. We report the average PSNR on the
whole dataset.
3.3
Prototype system
enWe implemented a client-server poof-of-concept system to validate runtime and
image
ergy consumption. We use a Samsung Galaxy S5 as the client device. Three
processing scenarios are tested: on-device processing, naive cloud processing - where
we transfer the input and output compressed as JPEG with the default settings from
the cellphone, and our recipe-based processing.
through WIFI and cellular LTE network.
We consider network connections
Our server runs an Intel Xeon E5645
CPU, using 12 cores.For on-device processing we use a Halide-optimized implemenWe
tation [20, 191 of the test filters, through the Java Native Interface on Android.
evaluate local Laplacian filter
[17]
with fifty pyramids for detail enhancement, on an
8-megapixels input.
a
To ensure accurate power measurement, we replace the cellphone battery with
generator. The
power generator, and record the current and voltage drained from the
three
power usage plot in Fig. 3-5 shows the energy consumption at each step for the
scenarios on the Local Laplacian filter. By reducing the amount of data transferred
and
both upstream and downstream, our method greatly reduces transmission costs
cuts down both end-to-end computation time and power usage (Fig. 3.4).
42
De hazing
--
PSNR
=
34.3 dB
PSNR
= 42.5 dB
0.5
%down =
1.8
% P= O.l
%down =
1.5
%down=
2.4
%up =
)
Detail Manipulation
11
~·-
PSNR = 25.6 dB
Local Laplacian
PSNR = 32.4 dB
%up = 0.4
%down=
1.9
Matting
PSNR = 37. 7 dB
%up = 0.3
%dow n =
0.7
(a) input
(b) reference
(c) our
output
reconstruction
( d) reference
( e) highest
error p atch
( f) rescaled
difference
Figure 3-3: Our method handles a large variety of photographic enhancements. We filter
a highly degraded copy (not shown) of the reference input (a). From the resulting degraded
output , we compute the recipe parameters. We then reconstruct an output image (c) that
closely approximates the reference output (b) computed from the original high-quality input.
(d) and (e) are a close-up on the region of highest error in our reconstruction. (f) is a rescaled
difference map that emphasizes the location of our errors. As shown on the Lo smoothing
example, our method cannot handle well filters that significantly alter the structure of the
input.
43
Portrait Transfer
PSNR = 38.5 dB
3
Recoloring
PSNR = 48. 7 dB
Style Transfer
PSNR = 34.6 dB
Time of Day
Photoshop
PSNR
(a) input
=
0.4
%down=
1.1
3 up = 0.2
%down =
1.5
%up=
0.3
%down =
1.1
PSNR = 37.2 dB
3 up =
0.4
%down =
1.6
= 46.l dB
3 up =
0.7
%down =
1.5
(b) reference
(c) our
output
reconstruction
Iii)
( d) reference
( e) highest
error patch
( f) rescaled
difference
Figure 3-4: Additional examples. We filter a highly degraded copy (not shown) of the
reference input (a). From the resulting degraded output , we compute the recipe parameters.
We then reconstruct an output image (c) that closely approximates the reference output
(b) computed from the original high-quality input. (d) and (e) are a close-up on the region
of highest error in our reconstruction. (f) is a rescaled difference map that emphasizes the
location of our errors.
44
WIFI
LTE
enhancement
Local Laplacian
Style Transfer
recipe
jpeg
recipe
jpeg
local
1.5 J
0.5 s
30.6
14.7
2.4 J
1.9 s
10.6
4.9
1.2 J
0.2 s
14.6
9.2
0.6 J
0.3 s
7.9
3.9
1.2 J
0.1 s
135.6 2.4 J
30.6 + 0.9 s
44.1
23.6
0.7 J
2.2 s
20.7
12.4
1.5 J
0.7 s
30.5
17.3
0.7 J
1.5 s
23.2
11.4
1.3 J
0.3 s
23.2
7.0
Table 3.4: We compare the end-to-end latency and energy cost of using our recipes for
cloud processing compared to local processing of the enhancement and the jpeg baseline.
Our method always saves energy and time compared to the baseline jpeg method. The
measurements are from a Samsung Galaxy S5, processing a 8-megapixels image, and are
averaged over 20 runs (we also report one standard deviation).
power consumption (W) lower is better
6
recons truction
download
die.:I
I
dle
upload:
our approach
-
jpeg
local
14.3J
33.1J
22.1J
5.8s
17.1s
6.6s
download
dle
upload
0L
0
2
4
6
10
8
12
14
16
time (s)
Figure 3-5: Plot of the power consumption over time of three computation schemes: purely
local, JPEG transfer, and our approach. For this measurement, we used a Samsung Galaxy
S3 connected to a 3G network; the test application is Local Laplacian Filters and the input
is a 4-megapixels image. Our approach uses less energy and is faster than the other two.
45
46
Chapter 4
Conclusion
Cloud image enhancement can only be made more efficient if the power and time cost
of data transfers is minimized. In the case of algorithms that preserve the content
of an image and do not spatially warp it, we show that we can exploit the similarity
between the input and the output. We introduce transform recipes, a flexible representation which capture how images are locally modified by an enhancement. Recipes
dramatically reduce the bandwidth required for cloud computation because they are
very compact and allow us to work on the server with a highly-degraded input-output
pair and apply the recipe on the client to the high-quality input. Finally we demonstrate that transform recipes can support a wide range of algorithms and, in practical
scenarios, lead to reduced power and time costs.
Our method assumes a good correlation between the input and the output, and no
spatial transformation. This excludes operations such as barrel distortion correction
and perspective rectification. A parametric warp could be added but we chose to focus
on stylistic transformations. Our method also cannot handle significant addition of
content, such as inpainting techniques or clone brushing. Overall, the development
of good recipes is a trade-off between expressiveness, compactness and reconstruction
costs. In some scenarios, enhancement-specific recipes might provide significant gains.
47
48
Appendix A
Recipe Parameters
We provide detailed quality comparisons to evaluate our design choices. In particular,
we motivate the use of block overlapping, non-linear luminance features and multiscale
decomposition. Table A.1 reports the per-enhancement PSNR and SSIM for recipes
with w
with w
=
=
32 and no input compression. Table A.2 summarizes the metrics for recipes
32 and an input compressed to %., = 1.0 (the standard quality level).
Table A.2 is for recipes with w = 64 and an input compressed to %,, = 0.4 (the
medium quality level).
We illustrate the impact of the number of linear segments of the luminance tone
curve in Figure A-2. The impact of w is shown in Figure A-1.
49
%down
PSNR (dB)
block overlap
multiscale
non-linear
/
-(-/
v/-/
/
Local Laplacian
Dehazing
Detail Manipulation
Lo
Matting
Photoshop
Recoloring
Portrait Transfer
Time of Day
Style Transfer
0.3
0.3
0.3
0.4
1.0
0.2
all
Local Laplacian
Dehazing
Detail Manipulation
0.3
0.8
0.9
0.4
0.4
0.4
0.4
0.4
0.4
1.1
0.5
0.4
0.8
0.5
0.5
0.3
0.4
0.2
0.5
0.3
0.4
0.2
/
0.5
1.3
1.1
1.3
1.5
1.4
1.1
1.3
1.6
0.7
0.7
0.5
0.6
0.4
1.1
1.1
0.4
0.7
0.8
0.6
0.9
0.8
1.4
1.4
1.4
0.4
1.7
1.4
1.6
2.0
1.0
1.4
0.3
0.5
0.3
0.3
0.5
0.2
0.2
0.7
0.7
0.3
0.4
0.6
0.3
0.3
0.7
0.3
0.4
1.1
1.1
0.4
33.5
34.3
34.7
35.9
35.5
37.0
37.0
38.1
40.4
41.1
40.6
41.7
41.3
42.5
41.7
42.6
32.9
31.8
33.7
33.8
32.7
34.4
36.1
34.3
34.4
34.6
33.3
35.8
36.8
35.1
35.0
35.6
34.4
36.6
38.0
35.7
36.3
39.0
36.7
37.1
39.5
40.3
41.1
42.0
41.8
42.8
43.5
44.3
47.9
36.8
49.2
38.1
47.9
39.0
48.0
39.1
49.2
40.7
49.3
40.4
48.0
41.1
49.3
42.9
41.2
41.6
41.2
42.2
41.7
42.8
42.1
42.7
31.1
31.7
31.8
35.8
32.5
36.6
36.6
37.3
all
36.9
37.7
38.1
38.8
39.0
39.8
40.0
41.0
Local Laplacian
Dehazing
Detail Manipulation
Lo
Matting
Photoshop
Recoloring
Portrait Transfer
Time of Day
Style Transfer
0.94
0.95
0.95
0.96
0.96
0.97
0.98
0.98
0.98
0.98
0.95
0.88
0.97
0.98
1.00
0.97
0.98
0.93
0.96
0.89
0.97
0.98
0.97
0.96
0.90
0.96
0.98
0.98
0.94
0.98
0.96
0.92
0.97
0.98
0.97
0.98
0.97
0.96
0.96
1.00
1.00
0.99
1.00
0.99
1.00
0.98
0.98
0.93
0.98
0.98
0.93
0.99
0.98
0.94
0.98
0.98
0.97
0.99
1.00
0.98
all
0.96
0.96
0.97
0.97
0.97
Lo
Matting
Photoshop
Recoloring
Portrait Transfer
Time of Day
Style Transfer
SSIMv
-
/
features
0.2
0.2
0.4
0.1
1.1
0.92
0.97
0.98
1.00
0.98
0.98
0.96
0.97
0.98
0.94
1.1
1.8
1.1
0.98
0.98
0.95
0.97
0.99
1.00
0.98
0.97
0.99
0.98
0.97
0.97
0.98
Table A.1: Per-enhancement PSNR and SSIM for recipes with w = 32 and no input
compression. block overlap allows us to linearly interpolate between overlapping block and
avoids blocking artifacts. It roughly multiplies the data ratio by 4. multiscale refers to
the use of several pyramid levels to represent the high-pass of the luminance channel. This
allows us to capture subtle multiscale effects. non-linear refers to the use of a piecewiselinear tone curve in the representation of the high-pass luminance. This allows us to better
capture tonal variations.
50
%down
PSNR (dB)
SSIM
block overlap
--
multiscale
non-linear
-
-
-
Local Laplacian
Dehazing
Detail Manipulation
LO
Matting
Photoshop
Recoloring
Portrait Transfer
0.3
0.3
0.3
0.4
0.1
0.3
0.3
0.2
1.1
1.0
1.0
1.2
0.1
0.9
0.8
0.5
0.4
0.4
0.4
0.5
0.1
0.4
0.4
0.3
0.5
0.4
0.4
0.5
0.1
0.4
0.3
0.3
Time of Day
0.4
1.1
0.5
0.5
Style Transfer
0.1
0.3
0.2
0.2
all
0.3
0.8
0.4
0.4
Local Laplacian
32.5
33.2
32.2
Dehazing
35.2
35.5
35.0
Detail Manipulation
32.2
33.0
LO
31.3
Matting
Photoshop
Recoloring
32.7
38.6
46.4
Portrait Transfer
Time of Day
Style Transfer
-
-
-
features
1.5
1.2
1.3
1.6
0.2
1.1
1.1
0.9
0.6
0.5
0.5
0.6
0.1
0.5
0.4
0.4
1.8
1.4
1.7
2.1
0.9
1.5
1.4
1.1
1.4
1.5
0.6
1.8
0.7
0.7
0.3
1.1
1.1
1.1
0.4
1.5
34.3
32.8
35.2
33.7
37.0
35.7
35.2
36.0
35.4
36.1
34.7
33.5
35.4
34.4
36.0
36.8
32.1
33.7
32.6
34.5
33.6
34.9
35.8
33.4
39.3
47.4
33.1
39.8
46.3
34.2
40.8
46.5
33.7
40.5
47.3
35.0
41.5
47.5
34.3
42.0
46.4
35.2
42.6
47.3
35.9
38.4
30.4
37.0
39.0
31.1
38.0
38.0
30.8
37.9
38.3
34.0
39.4
38.6
31.5
38.9
38.9
34.8
39.9
38.0
34.2
41.1
38.6
34.9
all
35.4
36.1
36.2
36.8
36.9
37.6
37.5
38.5
Local Laplacian
Dehazing
Detail Manipulation
LO
Matting
Photoshop
Recoloring
Portrait Transfer
Time of Day
Style Transfer
0.92
0.96
0.94
0.86
0.96
0.97
0.99
0.97
0.97
0.92
0.93
0.96
0.95
0.88
0.97
0.98
1.00
0.97
0.97
0.93
0.89
0.95
0.97
0.92
0.96
0.98
0.99
0.98
0.96
0.92
0.94
0.96
0.95
0.89
0.96
0.98
0.99
0.98
0.97
0.95
0.90
0.95
0.97
0.93
0.97
0.98
0.99
0.98
0.97
0.93
0.95
0.96
0.96
0.90
0.97
0.98
1.00
0.98
0.97
0.96
0.90
0.95
0.97
0.93
0.95
0.98
0.99
0.98
0.96
0.95
0.97
0.97
0.98
0.94
0.96
0.99
0.99
0.99
0.97
0.96
all
0.95
0.95
0.95
0.96
0.96
0.96
0.96
0.97
1.5
1.2
1.4
1.6
0.2
1.2
1.1
0.8
Table A.2: Per-enhancement PSNR and SSIM for recipes with w = 32 and an input
compressed to %,p = 1.0. block overlap allows us to linearly interpolate between overlapping
block and avoids blocking artifacts. It roughly multiplies the data ratio by 4. multiscale
refers to the use of several pyramid levels to represent the high-pass of the luminance channel.
This allows us to capture subtle multiscale effects. non-linear refers to the use of a piecewiselinear tone curve in the representation of the high-pass luminance. This allows us to better
capture tonal variations.
51
%down
PSNR (dB)
SSIM
block overlap
multiscale
non-linear
-
-
-
features
-
--
'
0.1
0.1
0.1
0.1
0.1
0.1
0.4
0.3
0.4
/
-/.
-/
0.4
0.3
0.4
0.2
0.1
0.1
0.5
0.4
0.5
0.5
0.5
0.2
0.6
0.1
0.3
0.3
0.3
0.1
0.3
0.3
0.3
0.0
0.1
0.1
0.1
0.4
0.4
0.4
0.4
0.2
0.5
0.5
0.2
0.6
0.1
0.2
0.2
0.1
0.3
0.1
0.3
0.3
0.1
0.5
29.4
32.4
32.4
31.2
30.8
36.6
43.3
34.1
33.6
28.8
32.4
34.0
30.6
29.9
31.9
38.4
43.4
34.4
36.8
32.0
29.9
32.9
32.9
31.7
31.5
37.7
44.6
35.4
34.2
29.5
33.2
34.6
31.4
30.5
32.8
39.0
44.8
35.5
37.5
32.9
30.4
32.6
33.2
31.9
32.1
38.8
43.3
35.3
35.1
32.1
34.6
34.3
33.9
32.5
32.9
40.0
44.6
36.7
35.8
32.8
33.9
33.2
34.4
34.0
35.2
34.5
35.8
0.91
0.94
0.92
0.92
0.95
0.93
0.83
0.91
0.95
0.93
0.95
0.93
0.84
0.92
0.96
0.94
0.95
0.94
0.85
0.92
0.96
0.96
0.95
0.97
Lo
Matting
Photoshop
Recoloring
Portrait Transfer
Time of Day
Style Transfer
0.81
0.94
0.96
0.99
0.95
0.96
0.89
0.81
0.95
0.96
0.99
0.96
0.97
0.90
0.88
0.94
0.96
0.99
0.96
0.92
0.89
0.83
0.94
0.97
0.99
0.96
0.96
0.94
0.89
0.94
0.97
0.99
0.97
0.93
0.90
0.84
0.94
0.97
0.99
0.97
0.96
0.95
0.89
0.94
0.97
0.99
0.96
0.94
0.93
0.90
0.94
0.98
0.99
0.97
0.95
0.94
all
0.93
0.93
0.92
0.94
0.93
0.95
0.94
0.96
Local Laplacian
Dehazing
Detail Manipulation
0.1
0.1
0.1
0.3
0.3
0.3
LO
0.1
0.3
0.1
0.1
Matting
Photoshop
Recoloring
Portrait Transfer
0.0
0.1
0.1
0.1
0.0
0.2
0.2
0.2
0.0
0.1
0.1
0.1
0.0
0.1
0.1
0.1
Time of Day
0.1
0.3
0.2
Style Transfer
0.0
0.1
0.1
all
0.1
0.2
0.1
Local Laplacian
Dehazing
Detail Manipulation
LO
Matting
Photoshop
Recoloring
Portrait Transfer
Time of Day
Style Transfer
30.7
33.6
29.5
29.1
30.5
36.3
43.4
32.7
37.3
28.6
31.4
34.1
30.2
29.6
31.3
37.0
44.7
33.8
37.9
29.3
all
33.2
Local Laplacian
Dehazing
Detail Manipulation
Table A.3: Per-enhancement PSNR and SSIM for recipes with w = 64 and an input
compressed to %,p = 0.4. block overlap allows us to linearly interpolate between overlapping
block and avoids blocking artifacts. It roughly multiplies the data ratio by 4. multiscale
refers to the use of several pyramid levels to represent the high-pass of the luminance channel.
This allows us to capture subtle multiscale effects. non-linear refers to the use of a piecewiselinear tone curve in the representation of the high-pass luminance. This allows us to better
capture tonal variations.
52
.....
.....
.....
.....
55
50
Local Laplacian
Dehazing
Detail Manipulation
Lo
Matting
.....
.....
....
.....
Photoshop
Recoloring
Portrait Transfer
Time of Day
Style Transfer
45
i:cl
~
~
z
00.
40
~
35
30
25--~...._~~~_._~~~~~~~~---~~~~~~~~~~~~~~~~~
16
32
64
128
block size w
256
Figure A-1: The block size w controls the quality-compacity trade-off of transform recipes.
Small blocks yield a fine-grained description of the transformation but require more storage
space whereas large blocks form a coarser and more concise description. With larger blocks,
the recipe lose some spatial accuracy. In practice, we recommend values between 32 and
128 for w. These results are for an uncompressed input.
53
....
....
....
50
45
Local Laplacian
Dehazing
Detail Manipulation
Lo
....
....
Photoshop
Recoloring
Portrait Transfer
Time of Day
.......~~~~~--~~~~~~--~~~~~~
30---~~~~~--~~~~~~
0
2
6
4
8
10
number of luminance bins
Figure A-2: The number of linear segments p in the piecewise-linear tone curve has a
different impact depending on the application. For instance, Recoloring does not transform
the luminance channel of the input and therefore does not need more precision here. We
setteled with p = 6 in all our experiments.
54
Bibliography
[1] Mathieu Aubry, Sylvain Paris, Samuel W. Hasinoff, Jan Kautz, and Fr6do Durand. Fast local laplacian filters: Theory and applications. ACM Trans. Graph.,
33(5):167:1-167:14, September 2014.
12] Kenneth C. Barr and Krste Asanovi6. Energy-aware lossless data compression.
ACM Trans. Comput. Syst., 24(3):250-291, August 2006.
[3] Peter J Burt and Edward H Adelson. The laplacian pyramid as a compact image
code. IEEE Transactions on Communications, 31(4):532-540, 1983.
[4] Vladimir Bychkovsky, Sylvain Paris, Eric Chan, and Fredo Durand. Learning
photographic global tonal adjustment with a database of input / output image
pairs. In IEEE Conference on Computer Vision and Pattern Recognition, 2011.
15] Qifeng Chen, Dingzeyu Li, and Chi-Keung Tang. Knn matting. In IEEE Conference on Computer Vision and PatternRecognition, pages 869-876, June 2012.
[61 P. Deutsch. Deflate compressed data format specification version 1.3, 1996.
[7] Zeev Farbman, Raanan Fattal, Dani Lischinski, and Richard Szeliski. Edgepreserving decompositions for multi-scale tone and detail manipulation. In A CM
Transaction on Graphics (SIGGRAPH), SIGGRAPH '08, pages 67:1-67:10, New
York, NY, USA, 2008. ACM.
[81 William T. Freeman and Antonio Torralba. Shape recipes: Scene representations
that refer to the image. In Vision Sciences Society Annual Meeting, pages 25-47.
MIT Press, 2002.
19] Eric Hamilton. Jpeg file interchange format. C-Cube Microsystems, 1992.
[10] Junxian Huang, Feng Qian, Alexandre Gerber, Z. Morley Mao, Subhabrata Sen,
and Oliver Spatscheck. A close examination of performance and power characteristics of 4g lte networks, 2012.
[11] D.A. Huffman. A method for the construction of minimum-redundancy codes.
Proceedings of the IRE, 40(9):1098-1101, Sept 1952.
[12] Liad Kaufman, Dani Lischinski, and Michael Werman. Content-aware automatic
photo enhancement. Computer Graphics Forum, 31(8):2528-2540, 2012.
55
[131 Jin-Hwan Kim, Won-Dong Jang, Jae-Young Sim, and Chang-Su Kim. Optimized
contrast enhancement for real-time image and video dehazing. J. Vis. Comun.
Image Represent., 24(3):410-425, April 2013.
[14] Karthik Kumar, Jibang Liu, Yung-Hsiang Lu, and Bharat Bhargava. A survey
of computation offloading for mobile systems. Mob. Netw. Appl., 18(1):129-140,
February 2013.
[15] Pierre-Yves Laffont, Zhile Ren, Xiaofeng Tao, Chao Qian, and James Hays. Transient attributes for high-level understanding and editing of outdoor scenes. A CM
Transaction on Graphics (SIGGRAPH), 33(4):149:1-149:11, July 2014.
[161 Anat Levin, Dani Lischinski, and Yair Weiss. Colorization using optimization.
ACM Transaction on Graphics (SIGGRAPH), 23(3):689-694, August 2004.
[17] Sylvain Paris, Samuel W. Hasinoff, and Jan Kautz. Local laplacian filters:
Edge-aware image processing with a laplacian pyramid. In ACM Transaction
on Graphics (SIGGRAPH), SIGGRAPH '11, pages 68:1-68:12, New York, NY,
USA, 2011. ACM.
[18] Majid Rabbani and Paul W. Jones. Digital Image Compression Techniques.
Society of Photo-Optical Instrumentation Engineers (SPIE), Bellingham, WA,
USA, 1st edition, 1991.
[19] Jonathan Ragan-Kelley, Andrew Adams, Sylvain Paris, Marc Levoy, Saman
Amarasinghe, and Fredo Durand. Decoupling algorithms from schedules for easy
optimization of image processing pipelines. ACM Transactions on Graphics,
31(4):32:1-32:12, July 2012.
[20] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frodo
Durand, and Saman Amarasinghe. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In
Proceedings of the 34th ACM SIGPLAN Conference on Programming Language
Design and Implementation, PLDI, pages 519-530, New York, NY, USA, 2013.
ACM.
[21] C. Rhemann, C. Rother, Jue Wang, M. Gelautz, P. Kohli, and P. Rott. A perceptually motivated online benchmark for image matting. In IEEE Conference
on Computer Vision and Pattern Recognition, pages 1826-1833, June 2009.
[22] YiChang Shih, Sylvain Paris, Connelly Barnes, William T. Freeman, and Fredo
Durand. Style transfer for headshot portraits. ACM Transaction on Graphics
(SIGGRAPH), 33(4):148:1-148:14, July 2014.
[23] Yichang Shih, Sylvain Paris, Fredo Durand, and William T. Freeman. Datadriven hallucination of different times of day from a single outdoor photo. ACM
Transaction on Graphics (SIGGRAPH), 32(6):200:1-200:11, November 2013.
56
The
jpeg 2000 still image compression standard. IEEE Signal Processing Magazine,
18(5):36-58, 2001.
124] Athanassios Skodras, Charilaos Christopoulos, and Touradj Ebrahimi.
[251 Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of
the Royal Statistical Society, Series B, 58:267-288, 1994.
126] A. Torralba and W.T. Freeman. Properties and applications of shape recipes. In
IEEE Conference on Computer Vision and PatternRecognition, volume 2, pages
11-383-90 vol.2, June 2003.
1271 G.K. Wallace. The jpeg still picture compression standard. IEEE Transactions
on Consumer Electronics, 38(1):xviii-xxxiv, Feb 1992.
1281 T.A. Welch. A technique for high-performance data compression. Computer,
17(6):8-19, June 1984.
129] Ian H Witten, Radford M Neal, and John G Cleary. Arithmetic coding for data
compression. Communications of the ACM, 30(6):520-540, 1987.
[30] Li Xu, Cewu Lu, Yi Xu, and Jiaya Jia. Image smoothing via 10 gradient minimization. ACM Transaction on Graphics (SIGGRAPH), 30(6):174:1-174:12,
December 2011.
[31] J. Ziv and A. Lempel. A universal algorithm for sequential data compression.
IEEE Transactions on Information Theory, 23(3):337-343, May 1977.
57
Download