VAE-U-nets for Image Resolution Enhancement

VAE-U-nets: Addressing the Weaknesses of the GANs Used For Image Resolution Enhancement
High resolution pathology images, used for cancer detection, require costly equipment to obtain. In order to
ensure that the poor in developing nations can have access to high resolution pathology images and hence
better cancer diagnoses, machine learning methods to digitally enhance originally low resolution images have
been experimented with. Researchers have mostly focussed on the use of Generative Adversarial Models
(GANs), and although they have been quite promising in the sense that the images produced are often highly
detailed, they suffer from a number of problematic drawbacks, such as mode collapse and the fact that they
do not optimise reconstruction accuracy. With that in mind, Variational Autoencoders have been experimented
with in this project in hopes of enhancing pathology images in a probabilistically rigorous manner. U-net
architecture has also been incorporated into VAEs to improve their reconstruction accuracy, with much
success. The models were trained to enhance artificially downsampled images into the original ones.
1. Introduction
When it comes to detecting cancer, pathologists meticulously analyse images of a patient’s tissue for
abnormalities. However, pathologists’ ability to do so is limited by the resolution of the images received, and
it goes without saying that ultra-high resolution pathology images require expensive advanced microscopes to
obtain [1][2]. The issue of cost is especially pronounced in developing nations, where being unable to capture
high quality pathology images could cause pathologists to miss tumorous cells and thus mean the difference
between survival and death for the patient. Beyond this, ultra-high resolution pathology images are also
unrealistic to obtain via digital pathology because of the very long scanning durations required for such high
resolutions [3][4].
To combat both these issues, researchers have been looking at using Generative Adversarial Networks (GANs)
to turn previously low resolution pathology images into high resolution ones. While GANs have indeed been
able to produce very high quality images, there are 2 problems which make medical professionals hesitant if
not downright opposed to using them in real diagnoses [5]. Firstly, GANs tend to suffer from mode collapse
[6][7][8], meaning that the model learns to only produce a certain type of images as opposed to all variations.
Secondly, GANs are trained to optimise how real an image looks instead of how accurate it is [9].
Consequently, GANs sometimes produce images which look very detailed but are unrealistic and inaccurate,
which is especially problematic in the field of medicine where accuracy is paramount.
Besides GANs, U-nets have also been used to enhance the resolution of images. Although they perform
decently well reconstruction-wise, as with GANs, they are not probabilistic in nature, so even though they do
not suffer from mode collapse like GANs do, it is difficult if not impossible to assess the probability of the
generated enhanced image being real.
With all that in mind, the use of Variational Autoencoders (VAEs) rather than GANs to enhance the resolution
of pathology images is proposed. VAEs produce more accurate and thus reliable data because they in part
optimise the accuracy of the images produced [10]. On top of that, they can theoretically be used to assign a
probability [11][12][13][14] to generated images since they essentially estimate the probability distribution of
the images. That way, doctors have a more objective way of judging if a machine learning-enhanced image
should be trusted.
In this work, our unique contributions are using VAEs to perform image resolution enhancement and
demonstrating that augmenting U-nets with VAEs’ probabilistic nature can make reconstructions probabilistic
without any loss to image enhancement quality.
2. Models and Methodology
Data used
Pathology images were obtained from the Patch Camelyon dataset [15] , which consists of 327.680 colour
images (96 x 96px) extracted from histopathologic scans of lymph node section. Of the 262144 training and
32768 validation images originally available in the Patch Camelyon dataset, the same 26240 training and 984
validation images were used across all experiments.
To obtain low resolution images, the original images were downsampled to produce images downsampled by
2 and 4 times. This was done by applying an operation on the original images that was similar to a max pool
layer with 2x2 and 4x4 windows. Below are samples of the original and down sampled images.
Fig 2.1: Samples of the original, down sampled by 2 and 4 images
In this project, models were trained to enhance the downsampled images obtained in 2.1 into the original ones.
Loss function:
The formula for the loss the models sought to minimise is
πΏπ‘œπ‘ π‘  = 96 ∗ 96 ∗ 3 ∗ ∑ 𝑦𝑖 ∗ log(𝑦̂)
𝑖 + πœ† ∗ ∑(πœŽπ‘– + πœ‡π‘– − log(πœŽπ‘– ) − 1)
The first term represents the reconstruction loss, which measures how close the generated image is to the
ground truth, in other words, the model’s accuracy. The second term represents the Kullback–Leibler
divergence loss, which measures how close the encoded latent vectors are to the standard normal distribution.
The lambda value used in this project was 0.1 [16]
Model parameters:
The parameters used for all models regardless of the experiment are as follows.
Batch Size
Learning rate
(variance layer)
He normal
Model architectures:
Below is an illustration of the VAE model.
The encoder starts off with 4 repeated units. Each repeated unit consists of two 3×3 convolutions (with
rectified linear unit (ReLU) as the activation layer), followed by a 3×3 convolution with stride 2 for
downsampling, with the exception of the 4th repeated unit which has a flattening layer instead. Starting off at
32, the number of feature channels is doubled at each downsampling step. The 4 repeated units are followed
by a 128 unit dense layer, which is connected to 2 dense layers, one each to represent the mean and variance
of the probability distribution describing the encoded latent vector. A sampling layer is then used to randomly
sample a latent vector from the distribution. This latent vector will be fed into the decoder as input. Similarly,
the decoder consists of a 128 unit dense layer followed by 4 repeated units. Every repeated unit comprises of
a transpose convolutional layer with stride 2 that doubles the dimensions of the feature map and halves the
number of feature channels, followed by a concatenation with the corresponding feature map from the encoder,
and two 3×3 convolutions.
Besides the standard VAE model, a model incorporating a U-net infrastructure into that of the standard VAE
one (referred to as VAE-U-net) was used. U-nets contain shortcuts, which concatenate feature maps in the
decoder with their corresponding encoder feature maps. That way, features “learnt” by the encoder at different
levels of abstraction can be used by the encoder.
Fig 2.2: Diagram of the Standard and VAE-U-net Models architecture
Besides vanilla VAE models with 8, 16 and 32 latent units, VAE-U-net models with only shortcut 4 and with
all shortcuts (both using 16 latent units), as well as the traditional U-net model were also tested.
3. Results and Discussion
Enhancement of downsampled by 2x images
Fig 3.1: Training and Validation loss against epochs graph for the 8, 16 and 32 latent units models of downsampled by
2x images
Although the final training loss decreases as the number of latent units used increases, the validation loss
shows no further decrease using beyond 16 latent units, suggesting that 16 is the optimal number of latent
units to use. That said, the images generated by all models (shown in Fig 3.4) were far from realistic and
detailed. This, coupled with the fact that the SSIM and PSNR values (Table 3.1) actually deteriorated after
processing by the models, confirms that a regular VAE doesn’t cut it when it comes to enhancing detailed
pathology images.
All three models’ validation loss were much higher than their respective training loss, a sign that the models
have suffered from overfitting.
Fig 3.2 & 3.3 : Training and Validation against epochs graph for the Standard VAE (16 latent units), VAE-U-net (only
Shortcut 4), VAE-U-net (all shortcuts), and traditional U-net models respectively of downsampled by 2x images
SSIM (2)
PSNR (2)
SSIM (4)
PSNR (4)
8 latent
16 latent
32 latent
(Only shortcut
(All shortcuts)
U-net (All
Table 3.1: Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) values of downsampled
images and the enhanced images of different models compared to the original images. The numbers 2 and 4 within the
brackets indicate that the SSIM/PSNR values in that same row correspond to the model’s performance when trained and
tested on images that were downsampled by 2 and 4 times respectively
Fig 3.4: Sample enhanced images generated by 8, 16 and 32 latent Standard VAE models, VAE-U-net models, and the
traditional U-net model compared to the input (downsampled by 2x images) and the target (original images)
Not only were the training and validation losses of the VAE-U-net models much lower than that of the standard
models, the enhanced images produced by them were more identical to the originals by a long shot. Plus, the
VAE-U-net models’ training and validation loss graphs coincide, meaning that the models have managed to
effectively model the distribution of the real data.
The SSIM and PSNR values tell a similar story, improving after the use of the models to process the
downsampled images. The fact that the VAE-U-net (all shortcuts) model gave rise to a much greater
improvement in SSIM values than the marginal one the VAE-U-net (only shortcut 4) produced goes to show
that the use of shortcuts are indeed what has driven the enhancement of the downsampled images.
Moreover, not only are VAE-U-net (all shortcuts) model’s losses are just as low as that of the traditional Unet model, it’s SSIM and PSNR values are just as high as that of the traditional U-net model as well. This
means that the VAE-U-net model is able to enjoy the benefit of probabilistic modelling without any expense
to reconstruction accuracy.
Refer to Figure 3.4 for samples of the enhanced images
Enhancement of down sampled by 4x images
Fig 3.5 & 3.6 : Training and Validation against epochs graph for the Standard VAE (16 latent units), VAE-U-net (only
Shortcut 4 and all shortcuts), and traditional U-net models respectively of downsampled by 4x images
Fig 3.7: Sample enhanced images generated by VAE-U-net (only Shortcut 4 and all shortcuts) and traditional U-net
models compared to the input (downsampled by 4x images) and the target (original images)
The results for the models trained to enhance downsampled by 4x images were similar to that of the
downsampled by 2x images. Firstly, increasing the number of connections led to a lower final training loss
while making a marginal, if any, improvement in terms of validation loss. Secondly, the training and validation
losses of the VAE-U-net (all shortcuts) and the traditional U-net models were completely identical.
However, these models did not perform as well as their downsampled by 2x images counterparts, both in terms
of training and validation losses as well as the accuracy of the images produced. On top of that, the validation
loss of all models were higher than the corresponding training loss, indicating that unlike the downsampled
by 2x images models, they had all suffered from overfitting. Since the VAE-U-net (only Shortcut 4) model’s
validation loss graph fluctuated much more than that of the other 2 models, it has likely overfitted much more
as well
As with the downsampled by 2 times models, the use of all three models helped to improve the SSIM values,
though by a much smaller extent (Refer to table 3.1). This could indicate that models are increasingly less able
to enhance images as the downsampled images to be enhanced become of lower and lower resolution. As
such, even though no experiments were done to test this, it is suspected that such models will result in little to
no meaningful improvement in resolution for images downsampled beyond 4 times. That being said, the
improvement in PSNR values brought about by the models were similar to that of the downsampled by 2 times
Refer to Figure 3.7 for samples of the enhanced images
4. Limitations and Future Work
As shown in section 3.1, VAEs alone failed to enhance the resolution of the downsampled images. This could
have to do with the fact that Gaussian distributions were used to estimate the distribution of the latent vectors
of the images, which is problematic because natural images, especially pathology ones, have distributions
which are way more complex than that [17]. This hypothesis could be tested by using the same vanilla VAEs
to model and enhance images of greater simplicity, such as those in the MNIST dataset. Even so, it is highly
unusual that the images produced by VAEs alone were much worse than the initial downsampled images.
The research community has suggested many various ways of tackling this issue. This includes using Infusion
training, where a Markov chain is trained to gradually converge to the data distribution [18], as well as using
Deep Feature Consistent VAEs, which is trained with loss function that first feeds the original and
reconstructed image into a pre-trained convolutional neural network (CNN) to extract higher level features
and then compares the these features to compute a so-called perceptual loss [19].
Even though no experiments on images downsampled to an extent beyond 4 times were done, it is safe to say
based on the fact that the downsampled by 4 models only managed to achieve minor resolution enhancement
that even models described in the earlier paragraph will not be able to enhance images downsampled to an
extent beyond 4 times. At that point, the images simply lack sufficient information for enhancement to the
original to be possible.
That being said, the models’ poor performance on downsampled by 4 images might in part be attributed to the
method in which the downsampled by 4 images were generated from the originals. The downsampled images
are noticeably brighter than the originals, which could be a sign that the downsampling operation used in this
project is not sophisticated enough to produce fully realistic downsampled images.
5. Conclusion
In this project, VAE-based probabilistic models were successfully used to enhance the resolution of pathology
images. As expected, VAEs alone were inadequate. What was surprising, however, was that the images
produced by VAEs alone were much worse than the initial downsampled images. That said, the incorporation
of U-net architecture managed to substantially make up for this weakness. Future works include using other
methods beyond U-nets to augment vanilla VAEs and result in better image resolution enhancement.
