RO050 | VAE-U-nets: Addressing the Weaknesses of the GANs Used For Image Resolution Enhancement VAE-U-nets: Addressing the Weaknesses of the GANs Used For Image Resolution Enhancement Abstract High resolution pathology images, used for cancer detection, require costly equipment to obtain. In order to ensure that the poor in developing nations can have access to high resolution pathology images and hence better cancer diagnoses, machine learning methods to digitally enhance originally low resolution images have been experimented with. Researchers have mostly focussed on the use of Generative Adversarial Models (GANs), and although they have been quite promising in the sense that the images produced are often highly detailed, they suffer from a number of problematic drawbacks, such as mode collapse and the fact that they do not optimise reconstruction accuracy. With that in mind, Variational Autoencoders have been experimented with in this project in hopes of enhancing pathology images in a probabilistically rigorous manner. U-net architecture has also been incorporated into VAEs to improve their reconstruction accuracy, with much success. The models were trained to enhance artificially downsampled images into the original ones. 1. Introduction When it comes to detecting cancer, pathologists meticulously analyse images of a patient’s tissue for abnormalities. However, pathologists’ ability to do so is limited by the resolution of the images received, and it goes without saying that ultra-high resolution pathology images require expensive advanced microscopes to obtain [1][2]. The issue of cost is especially pronounced in developing nations, where being unable to capture high quality pathology images could cause pathologists to miss tumorous cells and thus mean the difference between survival and death for the patient. Beyond this, ultra-high resolution pathology images are also unrealistic to obtain via digital pathology because of the very long scanning durations required for such high resolutions [3][4]. To combat both these issues, researchers have been looking at using Generative Adversarial Networks (GANs) to turn previously low resolution pathology images into high resolution ones. While GANs have indeed been able to produce very high quality images, there are 2 problems which make medical professionals hesitant if not downright opposed to using them in real diagnoses [5]. Firstly, GANs tend to suffer from mode collapse [6][7][8], meaning that the model learns to only produce a certain type of images as opposed to all variations. Secondly, GANs are trained to optimise how real an image looks instead of how accurate it is [9]. Consequently, GANs sometimes produce images which look very detailed but are unrealistic and inaccurate, which is especially problematic in the field of medicine where accuracy is paramount. Besides GANs, U-nets have also been used to enhance the resolution of images. Although they perform decently well reconstruction-wise, as with GANs, they are not probabilistic in nature, so even though they do not suffer from mode collapse like GANs do, it is difficult if not impossible to assess the probability of the generated enhanced image being real. 1 RO050 | VAE-U-nets: Addressing the Weaknesses of the GANs Used For Image Resolution Enhancement With all that in mind, the use of Variational Autoencoders (VAEs) rather than GANs to enhance the resolution of pathology images is proposed. VAEs produce more accurate and thus reliable data because they in part optimise the accuracy of the images produced [10]. On top of that, they can theoretically be used to assign a probability [11][12][13][14] to generated images since they essentially estimate the probability distribution of the images. That way, doctors have a more objective way of judging if a machine learning-enhanced image should be trusted. In this work, our unique contributions are using VAEs to perform image resolution enhancement and demonstrating that augmenting U-nets with VAEs’ probabilistic nature can make reconstructions probabilistic without any loss to image enhancement quality. 2. Models and Methodology 2.1 Data used Pathology images were obtained from the Patch Camelyon dataset [15] , which consists of 327.680 colour images (96 x 96px) extracted from histopathologic scans of lymph node section. Of the 262144 training and 32768 validation images originally available in the Patch Camelyon dataset, the same 26240 training and 984 validation images were used across all experiments. To obtain low resolution images, the original images were downsampled to produce images downsampled by 2 and 4 times. This was done by applying an operation on the original images that was similar to a max pool layer with 2x2 and 4x4 windows. Below are samples of the original and down sampled images. Fig 2.1: Samples of the original, down sampled by 2 and 4 images 2.2 Models In this project, models were trained to enhance the downsampled images obtained in 2.1 into the original ones. Loss function: The formula for the loss the models sought to minimise is π π 2 2 πΏππ π = 96 ∗ 96 ∗ 3 ∗ ∑ π¦π ∗ log(π¦Μ) π + π ∗ ∑(ππ + ππ − log(ππ ) − 1) π=1 π=1 The first term represents the reconstruction loss, which measures how close the generated image is to the ground truth, in other words, the model’s accuracy. The second term represents the Kullback–Leibler 2 RO050 | VAE-U-nets: Addressing the Weaknesses of the GANs Used For Image Resolution Enhancement divergence loss, which measures how close the encoded latent vectors are to the standard normal distribution. The lambda value used in this project was 0.1 [16] Model parameters: The parameters used for all models regardless of the experiment are as follows. Batch Size 32 KL Learning rate Parameter Parameter coefficient Initializer Initializer (Lambda) (Default) (variance layer) He normal Zeros 0.1 0.001 Optimiser Adam Epochs 60 Model architectures: Below is an illustration of the VAE model. The encoder starts off with 4 repeated units. Each repeated unit consists of two 3×3 convolutions (with rectified linear unit (ReLU) as the activation layer), followed by a 3×3 convolution with stride 2 for downsampling, with the exception of the 4th repeated unit which has a flattening layer instead. Starting off at 32, the number of feature channels is doubled at each downsampling step. The 4 repeated units are followed by a 128 unit dense layer, which is connected to 2 dense layers, one each to represent the mean and variance of the probability distribution describing the encoded latent vector. A sampling layer is then used to randomly sample a latent vector from the distribution. This latent vector will be fed into the decoder as input. Similarly, the decoder consists of a 128 unit dense layer followed by 4 repeated units. Every repeated unit comprises of a transpose convolutional layer with stride 2 that doubles the dimensions of the feature map and halves the number of feature channels, followed by a concatenation with the corresponding feature map from the encoder, and two 3×3 convolutions. Besides the standard VAE model, a model incorporating a U-net infrastructure into that of the standard VAE one (referred to as VAE-U-net) was used. U-nets contain shortcuts, which concatenate feature maps in the decoder with their corresponding encoder feature maps. That way, features “learnt” by the encoder at different levels of abstraction can be used by the encoder. Fig 2.2: Diagram of the Standard and VAE-U-net Models architecture 3 RO050 | VAE-U-nets: Addressing the Weaknesses of the GANs Used For Image Resolution Enhancement Besides vanilla VAE models with 8, 16 and 32 latent units, VAE-U-net models with only shortcut 4 and with all shortcuts (both using 16 latent units), as well as the traditional U-net model were also tested. 3. Results and Discussion 3.1 Enhancement of downsampled by 2x images Fig 3.1: Training and Validation loss against epochs graph for the 8, 16 and 32 latent units models of downsampled by 2x images Although the final training loss decreases as the number of latent units used increases, the validation loss shows no further decrease using beyond 16 latent units, suggesting that 16 is the optimal number of latent units to use. That said, the images generated by all models (shown in Fig 3.4) were far from realistic and detailed. This, coupled with the fact that the SSIM and PSNR values (Table 3.1) actually deteriorated after processing by the models, confirms that a regular VAE doesn’t cut it when it comes to enhancing detailed pathology images. All three models’ validation loss were much higher than their respective training loss, a sign that the models have suffered from overfitting. Fig 3.2 & 3.3 : Training and Validation against epochs graph for the Standard VAE (16 latent units), VAE-U-net (only Shortcut 4), VAE-U-net (all shortcuts), and traditional U-net models respectively of downsampled by 2x images 4 RO050 | VAE-U-nets: Addressing the Weaknesses of the GANs Used For Image Resolution Enhancement Images SSIM (2) PSNR (2) SSIM (4) PSNR (4) Downsampled image 0.622 16.5 0.258 11.8 8 latent units VAE 0.122 15.0 - 16 latent units VAE 0.123 15.1 - 32 latent units VAE 0.125 15.1 - VAE-U-net (Only shortcut 4) 0.635 19.6 0.317 16.6 VAE-U-net (All shortcuts) U-net (All shortcuts) 0.729 20.7 0.337 16.8 0.727 20.7 0.344 16.9 Table 3.1: Structural Similarity Index Measure (SSIM) and Peak Signal to Noise Ratio (PSNR) values of downsampled images and the enhanced images of different models compared to the original images. The numbers 2 and 4 within the brackets indicate that the SSIM/PSNR values in that same row correspond to the model’s performance when trained and tested on images that were downsampled by 2 and 4 times respectively Fig 3.4: Sample enhanced images generated by 8, 16 and 32 latent Standard VAE models, VAE-U-net models, and the traditional U-net model compared to the input (downsampled by 2x images) and the target (original images) Not only were the training and validation losses of the VAE-U-net models much lower than that of the standard models, the enhanced images produced by them were more identical to the originals by a long shot. Plus, the VAE-U-net models’ training and validation loss graphs coincide, meaning that the models have managed to effectively model the distribution of the real data. The SSIM and PSNR values tell a similar story, improving after the use of the models to process the downsampled images. The fact that the VAE-U-net (all shortcuts) model gave rise to a much greater improvement in SSIM values than the marginal one the VAE-U-net (only shortcut 4) produced goes to show that the use of shortcuts are indeed what has driven the enhancement of the downsampled images. Moreover, not only are VAE-U-net (all shortcuts) model’s losses are just as low as that of the traditional Unet model, it’s SSIM and PSNR values are just as high as that of the traditional U-net model as well. This means that the VAE-U-net model is able to enjoy the benefit of probabilistic modelling without any expense to reconstruction accuracy. Refer to Figure 3.4 for samples of the enhanced images 5 RO050 | VAE-U-nets: Addressing the Weaknesses of the GANs Used For Image Resolution Enhancement 3.2 Enhancement of down sampled by 4x images Fig 3.5 & 3.6 : Training and Validation against epochs graph for the Standard VAE (16 latent units), VAE-U-net (only Shortcut 4 and all shortcuts), and traditional U-net models respectively of downsampled by 4x images Fig 3.7: Sample enhanced images generated by VAE-U-net (only Shortcut 4 and all shortcuts) and traditional U-net models compared to the input (downsampled by 4x images) and the target (original images) The results for the models trained to enhance downsampled by 4x images were similar to that of the downsampled by 2x images. Firstly, increasing the number of connections led to a lower final training loss while making a marginal, if any, improvement in terms of validation loss. Secondly, the training and validation losses of the VAE-U-net (all shortcuts) and the traditional U-net models were completely identical. However, these models did not perform as well as their downsampled by 2x images counterparts, both in terms of training and validation losses as well as the accuracy of the images produced. On top of that, the validation loss of all models were higher than the corresponding training loss, indicating that unlike the downsampled by 2x images models, they had all suffered from overfitting. Since the VAE-U-net (only Shortcut 4) model’s validation loss graph fluctuated much more than that of the other 2 models, it has likely overfitted much more as well 6 RO050 | VAE-U-nets: Addressing the Weaknesses of the GANs Used For Image Resolution Enhancement As with the downsampled by 2 times models, the use of all three models helped to improve the SSIM values, though by a much smaller extent (Refer to table 3.1). This could indicate that models are increasingly less able to enhance images as the downsampled images to be enhanced become of lower and lower resolution. As such, even though no experiments were done to test this, it is suspected that such models will result in little to no meaningful improvement in resolution for images downsampled beyond 4 times. That being said, the improvement in PSNR values brought about by the models were similar to that of the downsampled by 2 times models. Refer to Figure 3.7 for samples of the enhanced images 4. Limitations and Future Work As shown in section 3.1, VAEs alone failed to enhance the resolution of the downsampled images. This could have to do with the fact that Gaussian distributions were used to estimate the distribution of the latent vectors of the images, which is problematic because natural images, especially pathology ones, have distributions which are way more complex than that [17]. This hypothesis could be tested by using the same vanilla VAEs to model and enhance images of greater simplicity, such as those in the MNIST dataset. Even so, it is highly unusual that the images produced by VAEs alone were much worse than the initial downsampled images. The research community has suggested many various ways of tackling this issue. This includes using Infusion training, where a Markov chain is trained to gradually converge to the data distribution [18], as well as using Deep Feature Consistent VAEs, which is trained with loss function that first feeds the original and reconstructed image into a pre-trained convolutional neural network (CNN) to extract higher level features and then compares the these features to compute a so-called perceptual loss [19]. Even though no experiments on images downsampled to an extent beyond 4 times were done, it is safe to say based on the fact that the downsampled by 4 models only managed to achieve minor resolution enhancement that even models described in the earlier paragraph will not be able to enhance images downsampled to an extent beyond 4 times. At that point, the images simply lack sufficient information for enhancement to the original to be possible. That being said, the models’ poor performance on downsampled by 4 images might in part be attributed to the method in which the downsampled by 4 images were generated from the originals. The downsampled images are noticeably brighter than the originals, which could be a sign that the downsampling operation used in this project is not sophisticated enough to produce fully realistic downsampled images. 5. Conclusion In this project, VAE-based probabilistic models were successfully used to enhance the resolution of pathology images. As expected, VAEs alone were inadequate. What was surprising, however, was that the images produced by VAEs alone were much worse than the initial downsampled images. That said, the incorporation 7 RO050 | VAE-U-nets: Addressing the Weaknesses of the GANs Used For Image Resolution Enhancement of U-net architecture managed to substantially make up for this weakness. Future works include using other methods beyond U-nets to augment vanilla VAEs and result in better image resolution enhancement. References [1] Paul Fonteloa, John Faustorillab, Alex Gavinoa, and Alvin Marcelo, (2012) “Digital pathology – implementation challenges in low-resource countries” , Analytical Cellular Pathology, 35(1), 31-36, https://doi.org/ 10.3233/ACP-2011-0024 [2] Hallgrímur Benediktsson, John Whitelaw, Indrojit Roy, (2007), “Pathology Services in Developing Countries: A Challenge”, Arch Pathol Lab Med, 131 (11), 1636–1639. doi: https://doi.org/10.1043/15432165(2007)131[1636:PSIDCA]2.0.CO;2 [3] Hartman, D., Pantanowitz, L., McHugh, J. et al. (2017). “Enterprise Implementation of Digital Pathology: Feasibility, Challenges, and Opportunities”. J Digit Imaging, 30, 555–560, https://doi.org/10.1007/s10278-017-9946-9 [4] C Higgins (2015) Applications and challenges of digital pathology and whole slide imaging, Biotechnic & Histochemistry, 90(5), 341-347, DOI: 10.3109/10520295.2015.1044566 [5] Kazeminia, S., Baur, C., Kuijper, A., van Ginneken, B., Navab, N., Albarqouni, S., & Mukhopadhyay, A. (2020). “GANs for Medical Image Analysis”. Artificial Intelligence in Medicine, 109 , https://doi.org/10.1016/j.artmed.2020.101938 [6] D. Bau et al., 2019 , "Seeing What a GAN Cannot Generate," IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), pp. 4501-4510, doi: 10.1109/ICCV.2019.00460. [7] Zhaoyu Zhang, Mengyan Li, and Jun Yu. 2018. “On the convergence and mode collapse of GAN”. In SIGGRAPH Asia 2018 Technical Briefs (SA '18). Association for Computing Machinery, New York, NY, USA, Article 21, 1–4. DOI:https://doi.org/10.1145/3283254.3283282 [8] H. Thanh-Tung and T. Tran, "Catastrophic forgetting and mode collapse in GANs," 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, United Kingdom, 2020, pp. 1-10, doi: 10.1109/IJCNN48605.2020.9207181. 8 RO050 | VAE-U-nets: Addressing the Weaknesses of the GANs Used For Image Resolution Enhancement [9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020. “Generative adversarial networks”. Communications of the ACM, 63(11), 139–144. DOI:https://doi.org/10.1145/3422622 [10] Kingma, Diederik P and Welling, (2013). “Max. Auto-Encoding Variational Bayes”. In The 2nd International Conference on Learning Representations (ICLR) [11] V. Edupuganti, M. Mardani, S. Vasanawala and J. Pauly, (2020), "Uncertainty Quantification in Deep MRI Reconstruction". In IEEE Transactions on Medical Imaging, https://doi.org/ 10.1109/TMI.2020.3025065. [12] Jinwon An, Sungzoon Cho, 2015, “Variational Autoencoder based Anomaly Detection using Reconstruction Probability”, SNU Data Mining Center [13] Y. Kawachi, Y. Koizumi and N. Harada, (2018), "Complementary Set Variational Autoencoder for Supervised Anomaly Detection," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, pp. 2366-2370, doi: 10.1109/ICASSP.2018.8462181 [14] Zijing Luo, Yihui Xiong, Renguang Zuo, (2020), “Recognition of geochemical anomalies using a deep variational autoencoder network”, Applied Geochemistry, Volume 122 [15] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, M. Welling. (2018), "Rotation Equivariant CNNs for Digital Pathology". arXiv:1806.03962 [16] Andermatt S., Horváth A., Pezold S., Cattin P. (2019) “Pathology Segmentation Using Distributional Differences to Images of Healthy Origin”. In: Crimi A., Bakas S., Kuijf H., Keyvan F., Reyes M., van Walsum T. (eds) Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. BrainLes 2018. Lecture Notes in Computer Science, vol 11383. Springer, Cham. https://doi.org/10.1007/978-3-030-11723-8_23 [17] Shengjia Zhao, Jiaming Song, Stefano Ermon. (2017), “Towards a Deeper Understanding of Variational Autoencoding Models” [18] Shengjia Zhao, Jiaming Song, Stefano Ermon. (2017), “Towards a Deeper Understanding of Variational Autoencoding Models” 9 RO050 | VAE-U-nets: Addressing the Weaknesses of the GANs Used For Image Resolution Enhancement [19] Xianxu Hou, Linlin Shen, Ke Sun, Guoping Qiu, (2016), “Deep Feature Consistent Variational Autoencoder” 10