Uploaded by gzeltv

2003.03808

advertisement
PULSE: Self-Supervised Photo Upsampling via
Latent Space Exploration of Generative Models
arXiv:2003.03808v3 [cs.CV] 20 Jul 2020
Sachit Menon*, Alexandru Damian*, Shijia Hu, Nikhil Ravi, Cynthia Rudin
Duke University
Durham, NC
{sachit.menon,alexandru.damian,shijia.hu,nikhil.ravi,cynthia.rudin}@duke.edu
Abstract
sible.
* denotes equal contribution
The primary aim of single-image super-resolution is to
construct a high-resolution (HR) image from a corresponding low-resolution (LR) input. In previous approaches,
which have generally been supervised, the training objective typically measures a pixel-wise average distance between the super-resolved (SR) and HR images. Optimizing such metrics often leads to blurring, especially in high
variance (detailed) regions. We propose an alternative formulation of the super-resolution problem based on creating
realistic SR images that downscale correctly. We present
a novel super-resolution algorithm addressing this problem, PULSE (Photo Upsampling via Latent Space Exploration), which generates high-resolution, realistic images
at resolutions previously unseen in the literature. It accomplishes this in an entirely self-supervised fashion and is
not confined to a specific degradation operator used during
training, unlike previous methods (which require training
on databases of LR-HR image pairs for supervised learning). Instead of starting with the LR image and slowly
adding detail, PULSE traverses the high-resolution natural
image manifold, searching for images that downscale to the
original LR image. This is formalized through the “downscaling loss,” which guides exploration through the latent
space of a generative model. By leveraging properties of
high-dimensional Gaussians, we restrict the search space
to guarantee that our outputs are realistic. PULSE thereby
generates super-resolved images that both are realistic and
downscale correctly. We show extensive experimental results demonstrating the efficacy of our approach in the domain of face super-resolution (also known as face hallucination). We also present a discussion of the limitations and
biases of the method as currently implemented with an accompanying model card with relevant metrics. Our method
outperforms state-of-the-art methods in perceptual quality
at higher resolutions and scale factors than previously pos-
1. Introduction
Figure 1. (x32) The input (top) gets upsampled to the SR image
(middle) which downscales (bottom) to the original image.
In this work, we aim to transform blurry, low-resolution images into sharp, realistic, high-resolution images. Here, we
focus on images of faces, but our technique is generally applicable. In many areas (such as medicine, astronomy, microscopy, and satellite imagery), sharp, high-resolution images are difficult to obtain due to issues of cost, hardware
restriction, or memory limitations [24]. This leads to the
capture of blurry, low-resolution images instead. In other
4321
cases, images could be old and therefore blurry, or even in
a modern context, an image could be out of focus or a person could be in the background. In addition to being visually unappealing, this impairs the use of downstream analysis methods (such as image segmentation, action recognition, or disease diagnosis) which depend on having highresolution images [20] [23]. In addition, as consumer laptop, phone, and television screen resolution has increased
over recent years, popular demand for sharp images and
video has surged. This has motivated recent interest in the
computer vision task of image super-resolution, the creation
of realistic high-resolution (henceforth HR) images that a
given low-resolution (LR) input image could correspond to.
While the benefits of methods for image super-resolution
are clear, the difference in information content between HR
and LR images (especially at high scale factors) hampers
efforts to develop such techniques. In particular, LR images
inherently possess less high-variance information; details
can be blurred to the point of being visually indistinguishable. The problem of recovering the true HR image depicted
by an LR input, as opposed to generating a set of potential
such HR images, is inherently ill-posed, as the size of the
total set of these images grows exponentially with the scale
factor [3]. That is to say, many high-resolution images can
correspond to the exact same low-resolution image.
Traditional supervised super-resolution algorithms train
a model (usually, a convolutional neural network, or CNN)
to minimize the pixel-wise mean-squared error (MSE) between the generated super-resolved (SR) images and the
corresponding ground-truth HR images [15] [8]. However,
this approach has been noted to neglect perceptually relevant details critical to photorealism in HR images, such as
texture [16]. Optimizing on an average difference in pixelspace between HR and SR images has a blurring effect, encouraging detailed areas of the SR image to be smoothed
out to be, on average, more (pixelwise) correct. In fact, in
the case of mean squared error (MSE), the ideal solution is
the (weighted) pixel-wise average of the set of realistic images that downscale properly to the LR input (as detailed
later). The inevitable result is smoothing in areas of high
variance, such as areas of the image with intricate patterns
or textures. As a result, MSE should not be used alone as a
measure of image quality for super-resolution.
Some researchers have attempted to extend these MSEbased methods to additionally optimize on metrics intended
to encourage realism, serving as a force opposing the
smoothing pull of the MSE term [16, 8]. This essentially
drags the MSE-based solution in the direction of the natural image manifold (the subset of RM ×N that represents the
set of high-resolution images). This compromise, while improving perceptual quality over pure MSE-based solutions,
makes no guarantee that the generated images are realistic.
Images generated with these techniques still show signs of
Figure 2. FSRNet tends towards an average of the images that
downscale properly. The discriminator loss in FSRGAN pulls it
in the direction of the natural image manifold, whereas PULSE
always moves along this manifold.
blurring in high variance areas of the images, just as in the
pure MSE-based solutions.
To avoid these issues, we propose a new paradigm for
super-resolution. The goal should be to generate realistic
images within the set of feasible solutions; that is, to find
points which actually lie on the natural image manifold and
also downscale correctly. The (weighted) pixel-wise average of possible solutions yielded by the MSE does not generally meet this goal for the reasons previously described.
We provide an illustration of this in Figure 2.
Our method generates images using a (pretrained) generative model approximating the distribution of natural images under consideration. For a given input LR image, we
traverse the manifold, parameterized by the latent space of
the generative model, to find regions that downscale correctly. In doing so, we find examples of realistic images
that downscale properly, as shown in 1.
Such an approach also eschews the need for supervised
training, being entirely self-supervised with no ‘training’
needed at the time of super-resolution inference (except
for the unsupervised generative model). This framework
presents multiple substantial benefits. First, it allows the
same network to be used on images with differing degradation operators even in the absence of a database of corresponding LR-HR pairs (as no training on such databases
takes place). Furthermore, unlike previous methods, it does
not require super-resolution task-specific network architectures, which take substantial time on the part of the researcher to develop without providing real insight into the
problem; instead, it proceeds alongside the state-of-the-art
in generative modeling, with zero retraining needed.
Our approach works with any type of generative model
with a differentiable generator, including flow-based models, variational autoencoders (VAEs), and generative adver-
tional work has at its core aimed to approximate the
LR → HR map using supervised learning (especially
with neural networks), our approach centers on the use
of unsupervised generative models of HR data. Using
generative adversarial networks, we explore the latent
space to find regions that map to realistic images and
downscale correctly. No retraining is required. Our
particular implementation, using StyleGAN [13], allows for the creation of any number of realistic SR
samples that correctly map to the LR input.
Figure 3. We show here how visually distinct images, created with
PULSE, can all downscale (represented by the arrows) to the same
LR image.
sarial networks (GANs); the particular choice is dictated by
the tradeoffs each make in approximating the data manifold. For this work, we elected to use GANs due to recent
advances yielding high-resolution, sharp images [13, 12].
One particular subdomain of image super-resolution
deals with the case of face images. This subdomain – known
as face hallucination – finds application in consumer photography, photo/video restoration, and more [28]. As such,
it has attracted interest as a computer vision task in its own
right. Our work focuses on face hallucination, but our methods extend to a more general context.
Because our method always yields a solution that both
lies on the natural image manifold and downsamples correctly to the original low-resolution image, we can provide
a range of interesting high-resolution possibilities e.g. by
making use of the stochasticity inherent in many generative
models: our technique can create a set of images, each of
which is visually convincing, yet look different from each
other, where (without ground truth) any of the images could
plausibly have been the source of the low-resolution input.
Our main contributions are as follows.
1. A new paradigm for image super-resolution. Previous efforts take the traditional, ill-posed perspective of
attempting to ‘reconstruct’ an HR image from an LR
input, yielding outputs that, in effect, average many
possible solutions. This averaging introduces undesirable blurring. We introduce new approach to superresolution: a super-resolution algorithm should create
realistic high-resolution outputs that downscale to the
correct LR input.
2. A novel method for solving the super-resolution
task. In line with our new perspective, we propose
a new algorithm for super-resolution. Whereas tradi-
3. An original method for latent space search under
high-dimensional Gaussian priors. In our task and
many others, it is often desirable to find points in a generative model’s latent space that map to realistic outputs. Intuitively, these should resemble samples seen
during training. At first, it may seem that traditional
log-likelihood regularization by the latent prior would
accomplish this, but we observe that the ‘soap bubble’
effect (that much of the density of a high dimensional
Gaussian lies close to the surface of a hypersphere)
contradicts this. Traditional log-likelihood regularization actually tends to draw latent vectors away from
this hypersphere and, instead, towards the origin. We
therefore constrain the search space to the surface of
that hypersphere, which ensures realistic outputs in
higher-dimensional latent spaces; such spaces are otherwise difficult to search.
2. Related Work
While there is much work on image super-resolution
prior to the advent of convolutional neural networks
(CNNs), CNN-based approaches have rapidly become
state-of-the-art in the area and are closely relevant to our
work; we therefore focus on neural network-based approaches here. Generally, these methods use a pipeline
where a low-resolution (LR) image, created by downsampling a high-resolution (HR) image, is fed through a
CNN with both convolutional and upsampling layers, generating a super-resolved (SR) output. This output is then
used to calculate the loss using the chosen loss function and
the original HR image.
2.1. Current Trends
Recently, supervised neural networks have come to dominate current work in super-resolution. Dong et al. [9]
proposed the first CNN architecture to learn this non-linear
LR to HR mapping using pairs of HR-LR images. Several groups have attempted to improve the upsampling step
by utilizing sub-pixel convolutions and transposed convolutions [22]. Furthermore, the application of ResNet architectures to super-resolution (started by SRResNet [16]), has
yielded substantial improvement over more traditional con-
volutional neural network architectures. In particular, the
use of residual structures allowed for the training of larger
networks. Currently, there exist two general trends: one,
towards networks that primarily better optimize pixel-wise
average distance between SR and HR, and two, networks
that focus on perceptual quality.
2.2. Loss Functions
Towards these different goals, researchers have designed
different loss functions for optimization that yield images
closer to the desired objective. Traditionally, the loss function for the image super-resolution task has operated on a
per-pixel basis, usually using the L2 norm of the difference between the ground truth and the reconstructed image,
as this directly optimizes PSNR (the traditional metric for
the super-resolution task). More recently, some researchers
have started to use the L1 norm since models trained using
L1 loss seem to perform better in PSNR evaluation. The
L2 norm (as well as pixel-wise average distances in general) between SR and HR images has been heavily criticized for not correlating well with human-observed image
quality [16]. In face super-resolution, the state-of-the-art
for such metrics is FSRNet [8], which used a facial prior to
achieve previously unseen PSNR.
Perceptual quality, however, does not necessarily increase with higher PSNR. As such, different methods, and
in particular, objective functions, have been developed to increase perceptual quality. In particular, methods that yield
high PSNR result in blurring of details. The information required for details is often not present in the LR image and
must be ‘imagined’ in. One approach to avoiding the direct
use of the standard loss functions was demonstrated in [25],
which draws a prior from the structure of a convolutional
network. This method produces similar images to the methods that focus on PSNR, which lack detail, especially in
high frequency areas. Because this method cannot leverage
learned information about what realistic images look like, it
is unable to fill in missing details. Methods that try to learn
a map from LR to HR images can try to leverage learned information; however, as mentioned, networks optimized on
PSNR are still explicitly penalized for attempting to hallucinate details they are unsure about, thus optimizing on PSNR
stills resulting in blurring and lack of detail.
To resolve this issue, some have tried to use generative
model-based loss terms to provide these details. Neural
networks have lent themselves to application in generative
models of various types (especially generative adversarial
networks–GANs–from [10]), to image reconstruction tasks
in general, and more recently, to super-resolution. Ledig et
al. [16] created the SRGAN architecture for single-image
upsampling by leveraging these advances in deep generative models, specifically GANs. Their general methodology
was to use the generator to upscale the low-resolution input
image, which the discriminator then attempts to distinguish
from real HR images, then propagate the loss back to both
networks. Essentially, this optimizes a supervised network
much like MSE-based methods with an additional loss term
corresponding to how fake the discriminator believes the
generated images to be. However, this approach is fundamentally limited as it essentially results in an averaging of
the MSE-based solution and a GAN-based solution, as we
discuss later. In the context of faces, this technique has been
incorporated into FSRGAN, resulting in the current perceptual state-of-the-art in face super resolution at ×8 upscaling factors up to resolutions of 128 × 128. Although these
methods use a ‘generator’ and a ‘discriminator’ as found in
GANs, they are trained in a completely supervised fashion;
they do not use unsupervised generative models.
2.3. Generative Networks
Our algorithm does not simply use GAN-style training;
rather, it uses a truly unsupervised GAN (or, generative
model more broadly). It searches the latent space of this
generative model for latents that map to images that downscale correctly. The quality of cutting-edge generative models is therefore of interest to us.
As GANs have produced the highest-quality highresolution images of deep generative models to date, we
chose to focus on these for our implementation. Here we
provide a brief review of relevant GAN methods with highresolution outputs. Karras et al. [12] presented some of
the first high-resolution outputs of deep generative models
in their ProGAN algorithm, which grows both the generator
and the discriminator in a progressive fashion. Karras et al.
[13] further built upon this idea with StyleGAN, aiming to
allow for more control in the image synthesis process relative to the black-box methods that came before it. The input
latent code is embedded into an intermediate latent space,
which then controls the behavior of the synthesis network
with adaptive instance normalization applied at each convolutional layer. This network has 18 layers (2 each for each
resolution from 4 × 4 to 1024 × 1024). After every other
layer, the resolution is progressively increased by a factor of
2. At each layer, new details are introduced stochastically
via Gaussian input to the adaptive instance normalization
layers. Without perturbing the discriminator or loss functions, this architecture leads to the option for scale-specific
mixing and control over the expression of various high-level
attributes and variations in the image (e.g. pose, hair, freckles, etc.). Thus, StyleGAN provides a very rich latent space
for expressing different features, especially in relation to
faces.
3. Method
We begin by defining some universal terminology necessary to any formal description of the super-resolution prob-
lem. We denote the low-resolution input image by ILR . We
aim to learn a conditional generating function G that, when
applied to ILR , yields a higher-resolution super-resolved
image ISR . Formally, let ILR ∈ Rm×n . Then our desired
function SR is a map Rm×n → RM ×N where M > m,
N > n. We define the super-resolved image ISR ∈ RM ×N
ISR := SR(ILR ).
(1)
In a traditional approach to super-resolution, one considers that the low-resolution image could represent the same
information as a theoretical high-resolution image IHR ∈
RM ×N . The goal is then to best recover this particular IHR
given ILR . Such approaches therefore reduce the problem
to an optimization task: fit a function SR that minimizes
L := kIHR − ISR kpp
(2)
where k · kp denotes some lp norm.
In practice, even when trained correctly, these algorithms fail to enhance detail in high variance areas. To
see why this is, fix a low resolution image ILR . Let M
be the natural image manifold in RM ×N , i.e., the subset of RM ×N that resembles natural realistic images, and
let P be a probability distribution over M describing the
likelihood of an image appearing in our dataset. Finally,
let R be the set of images that downscale correctly, i.e.,
R = {I ∈ RN ×M : DS(I) = ILR }. Then in the limit
as the size of our dataset tends to infinity, our expected loss
when the algorithm outputs a fixed image ISR is
Z
kIHR − ISR kpp dP (IHR ).
(3)
M∩R
This is minimized when ISR is an lp average of IHR over
M ∩ R. In fact, when p = 2, this is minimized when
Z
ISR =
IHR dP (IHR ),
(4)
M∩R
so the optimal ISR is a weighted pixelwise average of the
set of high resolution images that downscale properly. As a
result, the lack of detail in algorithms that rely only on an lp
norm cannot be fixed simply by changing the architecture
of the network. The problem itself has to be rephrased.
We therefore propose a new framework for single image
super resolution. Let M, DS be defined as above. Then for
a given LR image ILR ∈ Rm×n and > 0, our goal is to
find an image ISR ∈ M with
kDS(ISR ) − ILR kp ≤ .
(5)
Figure 4. While traveling from zinit to zf inal in the latent space
L, we travel from Iinit ∈ M to If inal ∈ M ∩ R.
Then we are seeking an image ISR ∈ M∩R . The set M∩
R is the set of feasible solutions, because a solution is not
feasible if it did not downscale properly and look realistic.
It is also interesting to note that the intersections M∩R
and in particular M ∩ R0 are guaranteed to be nonempty,
because they must contain the original HR image (i.e., what
traditional methods aim to reconstruct).
3.1. Downscaling Loss
Central to the problem of super-resolution, unlike general image generation, is the notion of correctness. Traditionally, this has been interpreted to mean how well a particular ground truth image IHR is ‘recovered’ by the application of the super-resolution algorithm SR to the lowresolution input ILR , as discussed in the related work section above. This is generally measured by some lp norm
between ISR and the ground truth, IHR ; such algorithms
only look somewhat like real images because minimizing
this metric drives the solution somewhat nearer to the manifold. However, they have no way to ensure that ISR lies
close to M. In contrast, in our framework, we never deviate from M, so such a metric is not necessary. For us, the
critical notion of correctness is how well the generated SR
image ISR corresponds to ILR .
We formalize this through the downscaling loss, to explicitly penalize a proposed SR image for deviating from
its LR input (similar loss terms have been proposed in
[2],[25]). This is inspired by the following: for a proposed
SR image to represent the same information as a given LR
image, it must downscale to this LR image. That is,
ILR ≈ DS(ISR ) = DS(SR(ILR ))
(7)
In particular, we can let R ⊂ RN ×M be the set of images
that downscale properly, i.e.,
where DS(·) represents the downscaling function.
Our downscaling loss therefore penalizes SR the more
its outputs violate this,
R = {I ∈ RN ×M : kDS(I) − ILR kpp ≤ }.
LDS (ISR , ILR ) := kDS(ISR ) − ILR kpp .
(6)
(8)
It is important to note that the downscaling loss can be
used in both supervised and unsupervised models for superresolution; it does not depend on an HR reference image.
3.2. Latent Space Exploration
How might we find regions of the natural image manifold
M that map to the correct LR image under the downscaling
operator? If we had a differentiable parameterization of the
manifold, we could progress along the manifold to these
regions by using the downscaling loss to guide our search.
In that case, images found would be guaranteed to be high
resolution as they came from the HR image manifold, while
also being correct as they would downscale to the LR input.
In reality, we do not have such convenient, perfect parameterizations of manifolds. However, we can approximate such a parameterization by using techniques from unsupervised learning. In particular, much of the field of deep
generative modeling (e.g. VAEs, flow-based models, and
GANs) is concerned with creating models that map from
some latent space to a given manifold of interest. By leveraging advances in generative modeling, we can even use
pretrained models without the need to train our own network. Some prior work has aimed to find vectors in the
latent space of a generative model to accomplish a task;
see [2] for creating embeddings and [6] in the context of
compressed sensing. (However, as we describe later, this
work does not actually search in a way that yields realistic
outputs as intended.) In this work, we focus on GANs, as
recent work in this area has resulted in the highest quality
image-generation among unsupervised models.
Regardless of its architecture, let the generator be called
G, and let the latent space be L. Ideally, we could approximate M by the image of G, which would allow us to
rephrase the problem above as the following: find a latent
vector z ∈ L with
kDS(G(z)) − ILR kpp ≤ .
(9)
Unfortunately, in most generative models, simply requiring
that z ∈ L does not guarantee that G(z) ∈ M; rather,
such methods use an imposed prior on L. In order to ensure G(z) ∈ M, we must be in a region of L with high
probability under the chosen prior. One idea to encourage
the latent to be in the region of high probability is to add a
loss term for the negative log-likelihood of the prior. In the
case of a Gaussian prior, this takes the form of l2 regularization. Indeed, this is how the previously mentioned work
[6] attempts to address this issue. However, this idea does
not actually accomplish the goal. Such a penalty forces vectors towards 0, but most of the mass of a high-dimensional
√
Gaussian is located near the surface of a sphere of radius d
(see [26]). To get around this, we observed that we could
replace the Gaussian prior on Rd with a uniform prior on
√
dSd−1 . This approximation can be used for any method
with high dimensional spherical Gaussian priors.
√
We can let L0 = dS d−1 (where S d−1 ⊂ Rd is the unit
sphere in d dimensional Euclidean space) and reduce the
problem above to finding a z ∈ L0 that satisfies Equation
(9). This reduces the problem from gradient descent in the
entire latent space to projected gradient descent on a sphere.
4. Experiments
We designed various experiments to assess our method.
We focus on the popular problem of face hallucination, enhanced by recent advances in GANs applied to face generation. In particular, we use Karras et al.’s pretrained
Face StyleGAN (trained on the Flickr Face HQ Dataset, or
FFHQ) [13]. We adapted the implementation found at [27]
in order to transfer the original StyleGAN-FFHQ weights
and model from TensorFlow [1] to PyTorch [19]. For each
experiment, we used 100 steps of spherical gradient descent
with a learning rate of 0.4 starting with a random initialization. Each image was therefore generated in ∼5 seconds on
a single NVIDIA V100 GPU.
4.1. Data
We evaluated our procedure on the well-known highresolution face dataset CelebA HQ. (Note: this is not to
be confused with CelebA, which is of substantially lower
resolution.) We performed these experiments using scale
factors of 64×, 32×, and 8×. For our qualitative comparisons, we upscale at scale factors of both 8× and 64×,
i.e., from 16 × 16 to 128 × 128 resolution images and
1024 × 1024 resolution images. The state-of-the-art for
face super-resolution in the literature prior to this point was
limited to a maximum of 8× upscaling to a resolution of
128×128, thus making it impossible to directly make quantitative comparisons at high resolutions and scale factors.
We followed the traditional approach of training the supervised methods on CelebA HQ. We tried comparing with supervised methods trained on FFHQ, but they failed to generalize and yielded very blurry and distorted results when
evaluated on CelebA HQ; therefore, in order to compare our
method with the best existing methods, we elected to train
the supervised models on CelebA HQ instead of FFHQ.
4.2. Qualitative Image Results
Figure 5 shows qualitative results to demonstrate the visual quality of the images from our method. We observe
levels of detail that far surpass competing methods, as exemplified by certain high frequency regions (features like
eyes or lips). More examples and full-resolution images are
in the appendix.
Figure 5. Comparison of PULSE with bicubic upscaling, FSRNet, and FSRGAN. In the first image, PULSE adds a messy patch in the hair
to match the two dark diagonal pixels visible in the middle of the zoomed in LR image.
4.3. Quantitative Comparison
HR
Nearest
Bicubic
FSRNet
FSRGAN
PULSE
Here we present a quantitative comparison with state-ofthe-art face super-resolution methods. Due to constraints
on the peak resolution that previous methods can handle,
evaluation methods were limited, as detailed below.
We conducted a mean-opinion-score (MOS) test as is
common in the perceptual super-resolution literature [16,
14]. For this, we had 40 raters examine images upscaled by
6 different methods (nearest-neighbors, bicubic, FSRNet,
FSRGAN, and our PULSE). For this comparison, we used
a scale factor of 8 and a maximum resolution of 128 × 128,
despite our method’s ability to go substantially higher, due
to this being the maximum limit for the competing methods. After being exposed to 20 examples of a 1 (worst) rating exemplified by nearest-neighbors upsampling, and a 5
(best) rating exemplified by high-quality HR images, raters
provided a score from 1-5 for each of the 240 images. All
images fell within the appropriate = 1e − 3 for the downscaling loss. The results are displayed in Table 1.
PULSE outperformed the other methods and its score approached that of the HR dataset. Note that the HR’s 3.74 av-
3.74
1.01
1.34
2.77
2.92
3.60
Table 1. MOS Score for various algorithms at 128 × 128. Higher
is better.
erage image quality reflects the fact that some of the HR images in the dataset had noticeable artifacts. All pairwise differences were highly statistically significant (p < 10−5 for
all 15 comparisons) by the Mann-Whitney-U test. The results demonstrate that PULSE outperforms current methods
in generating perceptually convincing images that downscale correctly.
To provide another measure of perceptual quality, we
evaluated the Naturalness Image Quality Evaluator (NIQE)
score [18], previously used in perceptual super-resolution
[11, 5, 29]. This no-reference metric extracts features from
images and uses them to compute a perceptual index (lower
is better). As such, however, it only yields meaningful results at higher resolutions. This precluded direct comparison with FSRNet and FSRGAN, which produce images of
at most 128 × 128 pixels.
HR
Nearest
Bicubic
PULSE
5. Robustness
3.90
12.48
7.06
2.47
Table 2. NIQE Score for various algorithms at 1024×1024. Lower
is better.
The main aim of our algorithm is to perform perceptually realistic super-resolution with a known downscaling
operator. However, we find that even for a variety of unknown downscaling operators, we can apply our method using bicubic downscaling as a stand-in for more substantial
degradations applied–see Figure 6. In this case, we provide
only the degraded low-resolution image as input. We find
that the output downscales approximately to the true, nonnoisy LR image (that is, the bicubically downscaled HR)
rather than to the degraded LR given as input. This is desired behavior, as we would not want to create an image that
matches the additional degradations. PULSE thus implicitly denoises images. This is due to the fact that we restrict
the outputs to only realistic faces, which in turn can only
downscale to reasonable LR faces. Traditional supervised
networks, on the other hand, are sensitive to added noise
and changes in the domain and must therefore be explicitly
trained with the noisy inputs (e.g., [4]).
Figure 6. (x32) We show the robustness of PULSE under various degradation operators. In particular, these are downscaling
followed by Gaussian noise (std=25, 50), motion blur in random
directions with length 100 followed by downscaling, and downscaling followed by salt-and-pepper noise with a density of 0.05.
6. Bias
We evaluated NIQE scores for each method at a resolution of 1024 × 1024 from an input resolution of 16 × 16, for
a scale factor of 64. All images for each method fell within
the appropriate = 1e − 3 for the downscaling loss. The
results are in Table 2. PULSE surpasses even the CelebA
HQ images in terms of NIQE here, further showing the perceptual quality of PULSE’s generated images. This is possible as NIQE is a no-reference metric which solely considers perceptual quality; unlike reference metrics like PSNR,
performance is not bounded above by that of the HR images
typically used as reference.
4.4. Image Sampling
As referenced earlier, we initialize the point we start at
in the latent space by picking a random point on the sphere.
We found that we did not encounter any issues with convergence from random initializations. In fact, this provided us
one method of creating many different outputs with highlevel feature differences: starting with different initializations. An example of the variation in outputs yielded by
this process can be observed in Figure 3.
Furthermore, by utilizing a generative model with inherent stochasticity, we found we could sample faces with
fine-level variation that downscale correctly; this procedure
can be repeated indefinitely. In our implementation, we accomplish this by resampling the noise inputs that StyleGAN
uses to fill in details within the image.
While we initially chose to demonstrate PULSE using
StyleGAN (trained on FFHQ) as the generative model for
its impressive image quality, we noticed some bias when
evaluated on natural images of faces outside of our test
set. In particular, we believe that PULSE may illuminate
some biases inherent in StyleGAN. We document this
in a more structured way with a model card in Figure 8,
where we also examine success/failure rates of PULSE
across different subpopulations. We propose a few possible
sources for this bias:
Bias inherited from latent space constraint: If StyleGAN
pathologically placed people of color in areas of lower
density in the latent space, bias would be introduced by
PULSE’s constraint on the latent space which is necessary
to consistently generate high resolution images. To evaluate
this, we ran PULSE with different radii for the hypersphere
PULSE searches on, corresponding to different samples.
This did not seem to have an effect.
Failure to converge: In the initial code we released on
GitHub, PULSE failed to return “no image found” when
at the end of optimization it still did not find an image
that downscaled correctly (within ). The concern could
therefore be that it is harder to find images in the outputs
of StyleGAN that downscale to people of color than to
white people. To test this, we found a new dataset with
better representation to evaluate success/failure rates on,
“FairFace: Face Attribute Dataset for Balanced Race,
Gender, and Age” [30]. This dataset was labeled by
third-party annotators on these fields. We sample 100
Black
79.2%
East Asian
87.0%
Indian
87.4%
Race
Latino/Hispanic Middle Eastern
90.2%
87.0%
Southeast Asian
87.4%
White
83.4%
Gender
Female Male
91.4% 88.6%
Table 3. Success rates (frequency with which PULSE finds an image in the outputs of the generator that downscales correctly) of PULSE
with StyleGAN-FFHQ across various groups, evaluated on FairFace. See “Failure to converge” in Section 6 for full explanation of this
analysis and its limitations.
examples per subgroup and calculate the success/failure
rate across groups with ×64 downscaling after running
PULSE 5 times per image. The results of this experiment
can be found in Table 6. There is some variation in these
percentages, but it does not seem to be the primary cause of
what we observed. Note that this metric is lacking in that
it only reports whether an image was found - which does
not reflect the diversity of images found over many runs on
many images, an important measure that was difficult to
quantify.
Bias inherited from optimization: This would imply that the
constrained latent space contains a wide range of images of
people of color but that PULSE’s optimization procedure
does not find them. However, if this is the case then we
should be able to find such images with enough random
initializations in the constrained latent space. We ran this
experiment and this also did not seem to have an effect.
Bias inherited from StyleGAN: Some have noted that it
seems more diverse images can be found in an augmented
latent space of StyleGAN per [2]. However, this is not close
to the set of images StyleGAN itself generates when trained
on faces: for example, in the same paper, the authors display
images of unrelated domains (such as cats) being embedded
successfully as well. In our work, PULSE is constrained to
images StyleGAN considers realistic face images (actually,
a slight expansion of this set; see below and Appendix for
an explanation of this).
More technically: in StyleGAN, they sample a latent
vector z, which is fed through the mapping network to become a vector w, which is duplicated 18 times to be fed
through the synthesis network. In [2], they instead find 18
different vectors to feed through the synthesis network that
correspond to any image of interest (whether faces or otherwise). In addition,
√ while each of these latent vectors would
have norm ≈ 512 when sampled, this augmented latent
space allows them to vary freely, potentially finding points
in the latent space very far from what would have been seen
in training. Using this augmented latent space therefore removes any guarantee that the latent recovered corresponds
to a realistic image of a face.
In our work, instead of duplicating the vector 18 times
as StyleGAN does, we relax this constraint and encourage
these 18 vectors to be approximately equal to each other so
as to still generate realistic outputs (see Appendix). This
relaxation means the set of images PULSE can generate
should be broader than the set of images StyleGAN could
produce naturally. We found that loosening any of StyleGAN’s constraints on the latent space further generally led
to unrealistic faces or images that were not faces.
Overall, it seems that sampling from StyleGAN yields
white faces much more frequently than faces of people of
color, indicating more of the prior density may be dedicated to white faces. Recent work by Salminen et al. [21]
describes the implicit biases of StyleGAN in more detail,
which seem to confirm these observations. In particular, we
note their analysis of the demographic bias of the outputs of
the model:
Results indicate a racial bias among the generated
pictures, with close to three-[fourths] (72.6%) of
the pictures representing White people. Asian
(13.8%) and Black (10.1%) are considerably less
frequent, while Indians represent only a minor
fraction of the pictures (3.4%).
This bias extends to any downstream application of StyleGAN, including the implementation of PULSE using StyleGAN.
7. Discussion and Future Work
Through these experiments, we find that PULSE produces perceptually superior images that also downscale correctly. PULSE accomplishes this at resolutions previously
unseen in the literature. All of this is done with unsupervised methods, removing the need for training on paired
datasets of LR-HR images. The visual quality of our images as well as MOS and NIQE scores demonstrate that our
proposed formulation of the super-resolution problem corresponds with human intuition. Starting with a pre-trained
GAN, our method operates only at test time, generating
each image in about 5 seconds on a single GPU. However,
we also note significant limitations when evaluated on natural images past the standard benchmark.
One reasonable concern when searching the output space
of GANs for images that downscale properly is that while
GANs generate sharp images, they need not cover the whole
distribution as, e.g., flow-based models must. In our experiments using CelebA and StyleGAN, we did not observe any
manifestations of this, which may be attributable to bias see
Section 6 (The “mode collapse” behavior of GANs may exacerbate dataset bias and contribute to the results described
in Section 6 and the model card, Figure 8.) Advances in
generative modeling will allow for generative models with
better coverage of larger distributions, which can be directly
used with PULSE without modification.
Another potential concern that may arise when considering this unsupervised approach is the case of an unknown
downscaling function. In this work, we focused on the most
prominent SR use case: on bicubically downscaled images.
In fact, in many use cases, the downscaling function is either
known analytically (e.g., bicubic) or is a (known) function
of hardware. However, methods have shown that the degradations can be estimated in entirely unsupervised fashions
for arbitrary LR images (that is, not necessarily those which
have been downscaled bicubically) [7, 31]. Through such
methods, we can retain the algorithm’s lack of supervision;
integrating these is an interesting topic for future work.
8. Conclusions
We have established a novel methodology for image
super-resolution as well as a new problem formulation.
This opens up a new avenue for super-resolution methods
along different tracks than traditional, supervised work with
CNNs. The approach is not limited to a particular degradation operator seen during training, and it always maintains
high perceptual quality.
Acknowledgments: Funding was provided by the Lord
Foundation of North Carolina and the Duke Department of
Computer Science. Thank you to the Google Cloud Platform research credits program.
References
[1] Martı́n Abadi, Paul Barham, Jianmin Chen, Zhifeng
Chen, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Geoffrey Irving, Michael Isard,
et al. Tensorflow: A system for large-scale machine
learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16),
pages 265–283, 2016.
niques for preserving global structure and denoising with low information loss in single-image superresolution. In 2018 IEEE Conference on Computer
Vision and Pattern Recognition Workshops, CVPR
Workshops 2018, Salt Lake City, UT, USA, June 18-22,
2018, pages 874–881. IEEE Computer Society, 2018.
[5] Yochai Blau, Roey Mechrez, Radu Timofte, Tomer
Michaeli, and Lihi Zelnik-Manor. The 2018 PIRM
challenge on perceptual image super-resolution. In
The European Conference on Computer Vision
(ECCV) Workshops, September 2018.
[6] Ashish Bora, Ajil Jalal, Eric Price, and Alexandros G.
Dimakis. Compressed sensing using generative models. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.
[7] Adrian Bulat, Jing Yang, and Georgios Tzimiropoulos. To learn image super-resolution, use a gan to
learn how to do image degradation first. In Proceedings of the European Conference on Computer Vision
(ECCV), pages 185–200, 2018.
[8] Yu Chen, Ying Tai, Xiaoming Liu, Chunhua Shen, and
Jian Yang. Fsrnet: End-to-end learning face superresolution with facial priors. In Proceedings of the
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 2492–2501, 2018.
[9] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for
image super-resolution. In Proceedings of the European Conference on Computer Vision (ECCV), pages
184–199, 2014.
[10] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza,
Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
Courville, and Yoshua Bengio. Generative adversarial
nets. In Advances in Neural Information Processing
Systems, pages 2672–2680, 2014.
[11] Seokhwa Jeong, Inhye Yoon, and Joonki Paik. Multiframe example-based super-resolution using locally
directional self-similarity. IEEE Transactions on Consumer Electronics, 61(3):353–358, 2015.
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. Image2StyleGAN: How to embed images into the StyleGAN latent space? In Proceedings of the International Conference on Computer Vision (ICCV), 2019.
[12] Tero Karras, Timo Aila, Samuli Laine, and Jaakko
Lehtinen. Progressive growing of GANs for improved quality, stability, and variation.
CoRR,
abs/1710.10196, 2018. Appeared at the 6th Annual International Conference on Learning Representations.
[3] Simon Baker and Takeo Kanade. Limits on superresolution and how to break them. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, pages 372–379.
IEEE, 2000.
[13] Tero Karras, Samuli Laine, and Timo Aila. A stylebased generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 4401–4410, 2019.
[4] Yijie Bei, Alexandru Damian, Shijia Hu, Sachit
Menon, Nikhil Ravi, and Cynthia Rudin. New tech-
[14] Deokyun Kim, Minseon Kim, Gihyun Kwon, and
Dae-Shik Kim. Progressive face super-resolution
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
via attention to facial landmark. In Proceedings of
the 30th British Machine Vision Conference (BMVC),
2019.
Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition (CVPR),
2016.
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose
Caballero, Andrew Cunningham, Alejandro Acosta,
Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image superresolution using a generative adversarial network. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4681–
4690, 2017.
Margaret Mitchell, Simone Wu, Andrew Zaldivar,
Parker Barnes, Lucy Vasserman, Ben Hutchinson,
Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and
transparency, pages 220–229, 2019.
A. Mittal, R. Soundararajan, and A. C. Bovik. Making
a completely blind image quality analyzer. IEEE Signal Processing Letters, 20(3):209–212, March 2013.
Adam Paszke, Sam Gross, Francisco Massa, Adam
Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Antiga, et al. Pytorch: An imperative style, highperformance deep learning library. In Advances in
neural information processing systems, pages 8026–
8037, 2019.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox.
U-net: Convolutional networks for biomedical image segmentation.
In Proceedings of the International Conference on Medical Image Computing
and Computer-Assisted Intervention, pages 234–241.
Springer, 2015.
Joni Salminen, Soon-gyo Jung, Shammur Chowdhury,
and Bernard J. Jansen. Analyzing demographic bias in
artificially generated facial pictures. In Extended Abstracts of the 2020 CHI Conference on Human Factors
in Computing Systems, page 18. ACM, Apr 2020.
Wenzhe Shi, Jose Caballero, Ferenc Huszar, Johannes
Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert,
and Zehan Wang. Real-time single image and video
super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
Karen Simonyan and Andrew Zisserman.
Twostream convolutional networks for action recognition
in videos. In Advances in Neural Information Processing Systems, pages 568–576, 2014.
[24] Amanjot Singh and Jagroop Singh Sidhu. Super resolution applications in modern digital image processing. International Journal of Computer Applications,
150(2):0975–8887, 2016.
[25] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
[26] Roman Vershynin. Random Vectors in High Dimensions, page 3869. Cambridge Series in Statistical
and Probabilistic Mathematics. Cambridge University
Press, 2018.
[27] Thomas Viehmann and Lernapparat. Pytorch implementation of the stylegan generator. https://
github.com/lernapparat/lernapparat/,
2019.
[28] Nannan Wang, Dacheng Tao, Xinbo Gao, Xuelong Li,
and Jie Li. A comprehensive survey to face hallucination. International Journal of Computer Vision,
106(1):9–30, 2014.
[29] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao
Liu, Chao Dong, Yu Qiao, and Chen Change Loy.
ESRGAN: Enhanced super-resolution generative adversarial networks. In Proceedings of the European
Conference on Computer Vision (ECCV) Workshops,
September 2018.
[30] D. Xu, S. Yuan, L. Zhang, and X. Wu. Fairgan+: Achieving fair data generation and classification through generative adversarial nets. In 2019 IEEE
International Conference on Big Data (Big Data),
pages 1401–1406, 2019.
[31] Tianyu Zhao, Changqing Zhang, Wenqi Ren, Dongwei Ren, and Qinghua Hu. Unsupervised degradation learning for single image super-resolution. arXiv
preprint arXiv:1812.04240, 2018.
Model Card - PULSE with StyleGAN FFHQ Generative Model Backbone
Model Details
• PULSE developed by researchers at Duke University, 2020, v1.
– Latent Space Exploration Technique.
– PULSE does no training, but is evaluated by downscaling loss (equation 9) for fidelity to input low-resolution image.
– Requires pre-trained generator to parameterize natural image manifold.
• StyleGAN developed by researchers at NVIDIA, 2018, v1.
– Generative Adversarial Network.
– StyleGAN trained with adversarial loss (WGAN-GP).
Intended Use
• PULSE was intended as a proof of concept for one-to-many super-resolution (generating multiple high resolution outputs
from a single image) using latent space exploration.
• Intended use of implementation using StyleGAN-FFHQ (faces) is purely as an art project - seeing fun pictures of imaginary
people that downscale approximately to a low-resolution image.
• Not suitable for facial recognition/identification. PULSE makes imaginary faces of people who do not exist, which should
not be confused for real people. It will not help identify or reconstruct the original image.
• Demonstrates that face recognition is not possible from low resolution or blurry images because PULSE can produce visually
distinct high resolution images that all downscale correctly.
Factors
• Similarly to [17]: “based on known problems with computer vision face technology, potential relevant factors include groups
for gender, age, race, and Fitzpatrick skin type.” Additional factors include lighting, background content, hairstyle, pose,
camera focal length, and accessories.
Metrics
• Evaluation metrics include the success/failure rate of PULSE using StyleGAN (when it does not find an image that downscales appropriately). Note that this metric is lacking in that it only reports whether an image was found - which does not
reflect the diversity of images found over many runs on many images, an important measure that was difficult to quantify. In
original evaluation experiments, the failure rate was zero.
• To better evaluate the way that this metric varies across groups, we found a new dataset with better representation to evaluate
success/failure rates on, “FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age” [30]. This dataset was
labeled by third-party annotators on these fields. We sample 100 examples per subgroup and calculate the success/failure
rate across groups with ×64 downscaling after running PULSE 5 times per image. We note that this is a small sample size
per subgroup. The results of this analysis can be found in Table 6.
Training Data
Evaluation Data
• CelebA HQ, test data split, chosen as a basic proof of
• PULSE is never trained itself, it leverages a pretrained
concept.
generator.
• MOS Score evaluated via ratings from third party anno• StyleGAN is trained on FFHQ [13].
tators (Amazon MTurk)
Ethical Considerations
• Evaluation data - CelebA HQ: Faces based on public figures (celebrities). Only image data is used (no additional annotations). However, we point out that this dataset has been noted to have a severe imbalance of white faces compared to faces
of people of color (almost 90% white) [30]. This leads to evaluation bias. As this has been the accepted benchmark in face
super-resolution, issues of bias in this field may go unnoticed.
Caveats and Recommendations
• FairFace appears to be a better dataset to use than CelebA HQ for evaluation purposes.
• Due to lack of available compute, we could not at this time analyze intersectional identities and the associated biases.
• For an in depth discussion of the biases of StyleGAN, see [21].
• Finally, again similarly to [17]:
1. Does not capture race or skin type, which has been reported as a source of disproportionate errors.
2. Given gender classes are binary (male/not male), which we include as male/female. Further work needed to evaluate
across a spectrum of genders.
3. An ideal evaluation dataset would additionally include annotations for Fitzpatrick skin type, camera details, and environment (lighting/humidity) details.
PULSE: Supplementary Material
Sachit Menon*, Alexandru Damian*, Nikhil Ravi, Shijia Hu, Cynthia Rudin
Duke University
Durham, NC
arXiv:2003.03808v3 [cs.CV] 20 Jul 2020
{sachit.menon,alexandru.damian,nikhil.ravi,shijia.hu,cynthia.rudin}@duke.edu
1. Appendix A: Additional Figures
by computing S(T (M (z)), η).
Here we provide further samples of the output of our
super-resolution method for illustration in Figure 1. These
results were obtained with ×8 scale factor from an input of
resolution 16 × 16. This highlights our method’s capacity
to illustrate detailed features that we did not have space to
show in the main paper, such as noses and a wider variety of eyes. We also present additional examples depicting
PULSE’s robustness to additional degradation operators in
Figure 2 and some randomly selected generated samples in
Figures 3, 4, 5, and 6.
2.2. Latent Space Embedding
Experimentally, we observed that optimizing directly on
z ∈ S 511 ⊂ R512 yields poor results; this latent space is
not expressive enough to map to images that downscale correctly. A logical next step would be to use the expanded
18 × 512-dimensional latent space that the synthesis network takes as input, as noted by Abdal, Qin, and Wonka
[1]. By ignoring the mapping network, S can be applied
to any vector in R18×512 , rather than only those consisting
of a single 512-dimensional vector repeated 18 times. This
expands the expressive potential of the network; however,
by allowing the 18 512-dimensional input vectors to S to
vary independently, the synthesis network is no longer constrained to the original domain of StyleGAN.
2. Appendix B: Implementation Details
2.1. StyleGAN
In order to generate experimental results using our
method, we had to pick a pretrained generative model to
work with. For this we chose StyleGAN due to its state-ofthe-art performance on high resolution image generation.
StyleGAN consists of two components: first, a mapping
network M : R512 → R512 , a tiling function T : R512 →
R18×512 , and a synthesis network S : R18×512 × N →
R1024×1024 where N is collection of Euclidean spaces of
varying dimensions representing the domains of each of the
noise vectors fed into the synthesis network. To generate
images, a vector z is sampled uniformly at random from
the surface of the unit sphere in R512 . This is transformed
into another 512-dimensional vector by the mapping network, which is replicated 18 times by the tiling function T .
The new 18 × 512 dimensional vector is input to the synthesis network which uses it to generate a high-resolution,
1024 × 1024 pixel image. More precisely, the synthesis network consists of 18 sequential layers, and the resolution of
the generated image is doubled after every other layer; each
of these 18 layers is re-fed into the 512-dimensional output of the mapping network, hence the tiling function. The
synthesis network also takes as input noise sampled from
the unit Gaussian, which it uses to stochastically add details
to the generated image. Formally, η is sampled from the
Gaussian prior on N , at which point the output is obtained
2.3. Cross Loss
For the purposes of super-resolution, such approaches
are problematic because they void the guarantee that the algorithm is traversing a good approximation of M, the natural image manifold. The synthesis network was trained with
a limited subset of R18×512 as input; the further the input
it receives is from that subset, the less we know about the
output it will produce. The downscaling loss, defined in the
main paper, is alone not enough to guide PULSE to a realistic image (only an image that downscales correctly). Thus,
we want to make some compromise between the vastly increased expressive power of allowing the input vectors to
vary independently and the realism produced by tiling the
input to S 18 times. Instead of optimizing on downscaling loss alone, we need some term in the loss discouraging
straying too far in the latent space from the original domain.
To accomplish this, we introduce another metric, the
“cross loss.” For a set of vectors v1 , ..., vk , we define the
cross loss of v1 , ..., vk to be
X
CROSS(v1 , ..., vk ) =
|vi − vj |22
i<j
The cross loss imposes a penalty based on the Euclidean
1
Figure 1. Further comparison of PULSE with bicubic upscaling, FSRNet, and FSRGAN.
distance between every pair of vectors input to S. This can
be considered a simple form of relaxation on the original
constraint that the 18 input vectors be exactly equal.
When v1 , ..., vk are sampled from a sphere, it makes
more sense to compare geodesic distances along the sphere.
This is the approach we used in generating our results. Let
θ(v, w) denote the angle between the vectors v and w. We
then define the geodesic cross loss to be
X
GEOCROSS(v1 , ..., vk ) =
θ(vi , vj )2
.
i<j
Empirically, by allowing the 18 input vectors to S to vary
while applying the soft constraint of the (geo)cross loss, we
can increase the expressive potential of the network without
large deviations from the natural image manifold.
2.4. Approximating the input distribution of S
StyleGAN begins with a uniform distribution on S 511 ⊂
R512 , which is pushed forward by the mapping network to a
transformed probability distribution over R512 . Therefore,
another requirement to ensure that S([v1 , ..., v18 ], η) is a
realistic image is that each vi is sampled from this pushforward distribution. While analyzing this distribution, we
found that we could transform this back to a distribution on
the unit sphere without the mapping network by simply applying a single linear layer with a leaky-ReLU activation–
an entirely invertible transformation. We therefore inverted
this function to obtain a sampling procedure for this distribution. First, we generate a latent w from S 511 , and then
apply the inverse of our transformation.
2.5. Noise Inputs
The second parameter of S controls the stochastic variation that StyleGAN adds to an image. When the noise
is set to 0, StyleGAN generates smoothed-out, detail-free
images. The synthesis network takes 18 noise vectors at
varying scales, one at each layer. The earlier noise vectors influence more global features, for example the shape
of the face, while the later noise vectors add finer details,
such as hair definition. Our first approach was to sample
the noise vectors before we began traversing the natural image manifold, keeping them fixed throughout the process.
In an attempt to increase the expressive power of the synthesis network, we also tried to perform gradient descent
on both the latent input and the noise input to S simultaneously, but this tended to take the noise vectors out of the
spherical shell from which they were sampled and produced
unrealistic images. Using a standard Gaussian prior forced
the noise vectors towards 0 as mentioned in the main paper. We therefore experimented with two approaches for
the noise input:
1. Fixed noise: Especially when upsampling from 16 ×
16 to 1024 × 1024, StyleGAN was already expressive
enough to upsample our images correctly and so we
did not need to resort to more complicated techniques.
2. Partially trainable noise: In order to slightly increase
the expressive power of the network, we optimized on
the latent and the first 5-7 noise vectors, allowing us
to slightly modify the facial structure of the images we
generated to better match the LR images while main-
Figure 2. (x32) Additional robustness results for PULSE under additional degradation operators (these are downscaling followed by Gaussian noise (std=25, 50), motion blur in random directions with length 100 followed by downscaling, and downscaling followed by saltand-pepper noise with a density of 0.05.)
Nearest
Bicubic
FSRNet
FSRGAN
PULSE
PSNR
21.78
23.40
25.93
24.55
22.01
SSIM
0.51
0.63
0.74
0.66
0.55
taining image quality. This was the approach we used
to generate the images presented in this paper.
3. Appendix C: Alternative Metrics
For completeness, we provide the metrics of PSNR and
SSIM here. These results were obtained with ×8 scale factor from an input of resolution 16 × 16. Note that we explicitly do not aim to optimize on this pixel-wise average
distance from the high-resolution image, so these metrics
do not have meaningful implications for our work.
4. Appendix D: Robustness
Traditional supervised approaches using CNNs are notoriously sensitive to tiny changes in the input domain, as any
perturbations are propagated and amplified through the network. This caused some problems when trying to train FSRNET and FSRGAN on FFHQ and then test them on CelebAHQ. However, PULSE never feeds an LR image through
a convolutional network and never applies filters to the LR
input images. Instead, the LR image is only used as a term
in the downscaling loss. Because the generator is not capable of producing “noisy” images, it will seek an SR image
that downscales to the closest point on the LR natural image
manifold to the noisy LR input. This means that PULSE
outputs an SR image that downscales to the projection of
the noisy LR input onto the LR natural image manifold, and
if the noise is not too strong, this should be close to the
“true” unperturbed LR . This may explain why PULSE had
no problems when applied to different domains, and why
we could demonstrate robustness when the low resolution
image was degraded with various types of noise.
References
[1] Rameen Abdal, Yipeng Qin, and Peter Wonka.
Image2StyleGAN: How to embed images into the StyleGAN latent space? In Proceedings of the International Conference
on Computer Vision (ICCV), 2019. 1
Figure 3. (64x) Sample 1
Figure 4. (64x) Sample 2
Figure 5. (64x) Sample 3
Figure 6. (64x) Sample 4
Download