GENERATIVE ADVERSARIAL NETWORKS FOR IMAGE-TO-IMAGE TRANSLATION GENERATIVE ADVERSARIAL NETWORKS FOR IMAGE-TO-IMAGE TRANSLATION Edited by ARUN SOLANKI Assistant Professor, Department of Computer Science and Engineering, Gautam Buddha University, Greater Noida, India ANAND NAYYAR Lecturer, Researcher and Scientist, Duy Tan University, Da Nang, Viet Nam MOHD NAVED Assistant Professor, Analytics Department, Jagannath University, Delhi NCR, India Academic Press is an imprint of Elsevier 125 London Wall, London EC2Y 5AS, United Kingdom 525 B Street, Suite 1650, San Diego, CA 92101, United States 50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom Copyright © 2021 Elsevier Inc. All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions. This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein). Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility. To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein. Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-12-823519-5 For information on all Academic Press publications visit our website at https://www.elsevier.com/books-and-journals Publisher: Mara Conner Acquisitions Editor: Chris Katsaropoulos Editorial Project Manager: Gabriela D. Capille Production Project Manager: Niranjan Bhaskaran Cover Designer: Christian J. Bilbow Typeset by SPi Global, India Contributors Er. Aarti Department of Computer Science & Engineering, Lovely Professional University, Phagwara, Punjab, India Supavadee Aramvith Multimedia Data Analytics and Processing Research Unit, Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand Tanvi Arora Department of CSE, CGC College of Engineering, Landran, Mohali, Punjab, India Betul Ay Firat University Computer Engineering Department, Elazig, Turkey Galip Aydin Firat University Computer Engineering Department, Elazig, Turkey Junchi Bin University of British Columbia, Kelowna, BC, Canada Erik Blasch MOVEJ Analytics, Dayton, OH, United States Udaya Mouni Boppana Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia Najihah Chaini Faculty of Applied Sciences and Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia Amir H. Gandomi University of Technology Sydney, Ultimo, NSW, Australia Aashutosh Ganesh Radboud University, Nijmegen, The Netherlands Thittaporn Ganokratanaa Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand xi xii Contributors Koshy George SRM University—AP, Guntur District, Andhra Pradesh, India Meenu Gupta Department of Computer Science & Engineering, Chandigarh University, Ajitgarh, Punjab, India Álvaro S. Hervella CITIC Research Center; VARPA Research Group, Biomedical Research Institute of A Coruña (INIBIC), University of A Coruña, A Coruña, Spain Kavikumar Jacob Faculty of Applied Sciences and Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia Rachna Jain Department of Computer Science & Engineering, Bharati Vidyapeeth’s College of Engineering, Delhi, India S. Jayalakshmy IFET College of Engineering, Villupuram, India Leta Tesfaye Jule Department of Physics, College of Natural and Computational Science; Centre for Excellence in Indigenous Knowledge, Innovative Technology Transfer and Entrepreneurship, Dambi Dollo University, Dambi Dollo, Ethiopia A. Sampath Kumar Department of Computer Science and Engineering, Dambi Dollo University, Dambi Dollo, Ethiopia Meet Kumari Department of Electronics & Communication Engineering, Chandigarh University, Ajitgarh, Punjab, India Lakshay Department of Computer Science & Engineering, Bharati Vidyapeeth’s College of Engineering, Delhi, India Zheng Liu University of British Columbia, Kelowna, BC, Canada H.R Mamatha Department of CSE, PES University, Bengaluru, India Omkar Metri Department of CSE, PES University, Bengaluru, India Contributors Aida Mustapha Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia D. Nagarajan Department of Mathematics, Hindustan Institute of Technology and Science, Chennai, India Jorge Novo CITIC Research Center; VARPA Research Group, Biomedical Research Institute of A Coruña (INIBIC), University of A Coruña, A Coruña, Spain Marcos Ortega CITIC Research Center; VARPA Research Group, Biomedical Research Institute of A Coruña (INIBIC), University of A Coruña, A Coruña, Spain Lakshmi Priya Manakula Vinayaga Institute of Technology, Pondicherry, India Krishnaraj Ramaswamy Centre for Excellence in Indigenous Knowledge, Innovative Technology Transfer and Entrepreneurship; Department of Mechanical Engineering, Dambi Dollo University, Dambi Dollo, Ethiopia S. Mohamed Mansoor Roomi Department of Electronics and Communication Engineering, Thiagarajar College of Engineering, Madurai, India Jose Rouco CITIC Research Center; VARPA Research Group, Biomedical Research Institute of A Coruña (INIBIC), University of A Coruña, A Coruña, Spain Angel D. Sappa ESPOL Polytechnic University, CIDIS-FIEC, Guayaquil, Ecuador; Computer Vision Center, Edifici O, Campus UAB, Bellaterra, Barcelona, Spain K. Saruladha Pondicherry Engineering College, Department of Computer Science and Engineering, Puducherry, India A. Sasithradevi School of Electronics Engineering, VIT University, Chennai, India R. Sivaranjani Department of Electronics and Communication Engineering, Sethu Institute of Technology, Madurai, India Rituraj Soni Department of CSE, Engineering College Bikaner, Bikaner, Rajasthan, India xiii xiv Contributors S. Sountharrajan School of Computing Science and Engineering, VIT Bhopal University, Bhopal, India Patricia L. Suárez ESPOL Polytechnic University, CIDIS-FIEC, Guayaquil, Ecuador Gnanou Florence Sudha Pondicherry Engineering College, Pondicherry, India E. Thirumagal Pondicherry Engineering College, Department of Computer Science and Engineering, Puducherry; REVA University, Bengaluru, India Boris X. Vintimilla ESPOL Polytechnic University, CIDIS-FIEC, Guayaquil, Ecuador N. Yuuvaraj Research and Development, ICT Academy, Chennai, India Ran Zhang University of British Columbia, Kelowna, BC, Canada CHAPTER 1 Super-resolution-based GAN for image processing: Recent advances and future trends Meenu Guptaa, Meet Kumarib, Rachna Jainc, and Lakshayc a Department of Computer Science & Engineering, Chandigarh University, Ajitgarh, Punjab, India Department of Electronics & Communication Engineering, Chandigarh University, Ajitgarh, Punjab, India c Department of Computer Science & Engineering, Bharati Vidyapeeth’s College of Engineering, Delhi, India b 1.1 Introduction One can think of the term machine as older than the computer itself. In 1950, the computer scientist, logician, and mathematician Alan Turing penned a paper for the generations to come, “Computing Machinery and Intelligence” [1]. Today, computers can not only match humans but have outperformed them completely. Sometimes people think about not achieving the superhumanity face recognition or cleaning the medical image of the patient accurately, even for the small algorithm, as a machine learning algorithm is the best at pattern reorganization in existing image data using features for tasks such as classification and regression. When we try to generate new data, however, the computer has struggled [2]. An algorithm can easily defeat a chess grandmaster, classify whether a transaction is fraudulent or not, and classify in a medical report whether the given medical report has any disease or not, but fail on humanity’s most basic and essential capacities—including crafting an original creation or a pleasant conversation. Mahdizadehaghdam et al. [3] proposed some tests named the Imitation game, also known as the Turing Test. Behind a closed door, an unknown observer talks with two counterparts means a computer and a human. In 2014, all of the above problems were solved when Ian Goodfellow invented generative adversarial networks (GANs). This technique has enabled computers to generate realistic data by using two separate neural networks. Before GANs, different ways have been proposed by the programmer to analyze the generated data. But the result received from the generated data was not up to the mark. When GANs were introduced the first time, it showed a remarkable result as there was no difference between the generated fake images or photograph-image and gave the same result as the real-world-like quality. GANs turn scribbled images to a photograph-like image [4]. Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00030-0 Copyright © 2021 Elsevier Inc. All rights reserved. 1 2 Generative adversarial networks for image-to-Image translation Fig. 1.1 Improves realism of the image as general adversarial networks varies [4]. In recent years, how far GANs have changed the meaning of generating or improving the real image is shown in Fig. 1.1. Fig. 1.1 was first produced by GAN in 2014 and shows how human faces continuously improve in generating fake images. The machine could produce as a blurred image, and even that achievement is celebrated as a success. In just the next 3 years, we could not classify which is fake or which qualifies as high-resolution portrait photographs [5]. GANs are a category of machine learning techniques that uses two simultaneously trained models: the first is the generator to generate fake data and the discriminator is used to discrete the raw data from the real dataset images. The word generative indicates creating new data from the given data. GAN generates the data which learn from the choice of the given training set. The term adversarial points to maintaining the dynamic between the two models that are the generator and the discriminator. Here two networks are continually trying to trick the other as the generator generates better fake images to get convincing data. The better discriminator is trying to distinguish the real data examples from the fake generated ones. The word networks indicates the class of machine models. The generator and the discriminator commonly use the neural network. As a complex, the neural network is more complex than the implementation of GAN [6]. GAN has two models. First, it works where we put the input and then we get the output. The goal is to form two models that combine and run simultaneously so that the first discriminator receives input from the real data that come from the training dataset, and the second time onward there are two input sources that are the actual data and the fake examples coming from the given generator. A random number vectors is passed through the generator. The output acquired from the generator is Fake examples that try to convince as far as possible the real data. The discriminator predicted the probability of the input real. The main purpose of creating two models separately is to overcome the problem of fake data that is generated from the training dataset. The discriminator’s goal is to differentiate between the fake data generated from the generator and the real input example from the dataset. This section further discusses the training parts of the discriminator and the generator in Sections 1.1.1 and 1.1.2 [7]. Super-resolution-based GAN for image processing (a) (c) and (d) x x* (b) Discriminator Classification error Generator z Fig. 1.2 Train the discriminator. 1.1.1 Train the discriminator Fig. 1.2 discusses the trained model of the discriminator and the steps are as follows [8]: (a) First, get a random real example x from the given training dataset. (b) Now get a new random vector z and, utilizing the generator network, synthesize a fake example as x*. (c) Utilize the discriminator network to distinguish between x* and x. (d) Find the classification error and back-propagate. Then try to minimize the classification error to update the discriminator biases and weight. 1.1.2 Train the generator Fig. 1.3 shows the trained model of the generator as you can see the labeling of these steps as follows: (a) First, choose a random new image from the dataset as vector z, using a generator to create an x*, i.e., a fake example. (b) It used a discriminator to categorize real and fake examples. (c) Find the classification error and back-propagate. Then try to minimize the classification error due to which the total error to renovate the discriminator biases and weight [9]. 1.1.3 Organization of the chapter This chapter is further classified into different sections. Section 1.2 discusses the background study of this work and also different research views. Section 1.3 discusses the SR GAN model for image processing. Section 1.4 discusses the application-based GAN case studies to enhance object detection. Section 1.5 discusses the open issues and challenges faced in the working of GAN. Section 1.6 concludes the chapter with its future scope. 3 4 Generative adversarial networks for image-to-Image translation (b) and (c) r x x` (a) Discriminator Classification error Generator z Fig. 1.3 Train the generator. 1.2 Background study The goal of Perera et al. [10] is to determine whether the given query is from the same class or different class. Their solution is based on learning the latent representation of an in-class example using a de-noising and auto-encoder network. This method gives good results for COIL, MINST f-MINST dataset. They give a new thinking to GANs as in face recognition we create fake images, which help to identity theft and privacy breaches. In this chapter, they proposed a technique to recognize the forensic face. They use the deep face recognition system as a core for their model and create fake images repeatedly to help data augmentation [11]. Tripathy et al. [12] present a generic face animator that controls the pose and expression using a given face image. They implemented a two-stage neural network model, which is learned in a self-supervised manner. Rathgeb et al. [13] proposed a supervised deep learning algorithm using CNNs to detect synthetic images. The proposed algorithm gives an accuracy of 99.83 for distinguishing real images from dataset and fake images generated using GANs. Yu et al. [14] have shown advances to complete a new level as they proposed a method to visualize forensic and model attribution. The model supports image attribution, enables fine-grained model authentication, persists across different image frequencies, fingerprint frequencies and paths, and is not biased. Lian et al. [15] provide a guidance module that is introduced in FG-SRGAN, which is utilized to reduce the space of possible mapping functions and helps to learn the correct Super-resolution-based GAN for image processing mapping function from a low-resolution domain to a high-resolution domain. The guidance module is used to greatly reduce adversarial loss. Takano and Alaghband [16] proposed the SRGAN model. In this chapter, they solved the problem of sharpening the images. It can give a slight hint on how the real image looks like from the blurry image as they convert a low-resolution image to a high-resolution image. Dou et al. [17] proposed PCA-GAN, which greatly improves the performance of GAN-based models on super-resolving face images. The model focuses on cumulative discrimination in the orthogonal projection space spanned by PCA projection to details into the discriminator. Jiang et al. [18] proposed to improve the perception of the CT image using SRGAN, which leads to greatly enhance the spatial resolution of the image, as perception increases the disease analysis on a tiny portion of areas and pathological features. They introduced a diluted convolution module. The mean structural similarity (MSSIM) loss is also introduced to improve the perceptual loss function. Li et al. [19] provide an improvement method of SRGAN and a solution for the problem of image distortion in textile flow detection, a super-resolution image reconstruction. Here the result of an experiment shows that the PNSR of SRGAN is 0.83 higher than that of the Bilinear, and the SSIM is higher than 0.0819. SRGAN can get a clearer image and reconstruct a richer texture, with more high-frequency details, and that is easier to identify defects, which is important in the flaw detection of fabrics. Wang et al. [20] decided to use dense convolutional network blocks (dense blocks), which connect each layer to every other layer in a feed-forward manner as our very deep generator networks. GAN solves the problem of spectral normalization as the method offers better training stability and visual improvements. Nan et al. [21] solved the complex computation, unstable network, and slow learning speed problems of a generative adversarial network for image super-resolution (SRGAN). We proposed a single image super-resolution reconstruction model called Res_WGAN based on ResNeXt. Li et al. [22] discussed an edge-enhanced super-resolution network (EESR), which proposed better generation of high-frequency structures in blind super-resolution. EESR is able to recover textures with 4 times upsampling and gained a PTS of 0.6210 on the DIV2K test set, which is much better than the state-of-the-art methods. Sood et al. [23] worked on magnetic resonance (MR) images to obtain highresolution images for which the patients have to wait for a long time in a still state. Obtaining low-resolution images and then converting them to high-resolution images uses four models: SRGAN, SRCNN, SRResNet, and sparse representation; among them, SRGAN gives the best result. 5 6 Generative adversarial networks for image-to-Image translation Lee et al. [24] present a super-resolution model specialized for license plate images, CSRGAN, trained with a novel character-based perceptual loss. Specifically, they focus on character-level recognizability of super-resolved images rather than pixel-level reconstruction. Chen et al. [25] divided the technique into two different parts: the first one is to improve PSNR and the second one is to improve visual quality. They propose a new dense block, which uses complex connections between each layer to build a more powerful generator. Next, to improve perceptual quality, they found a new set of feature maps to compute the perceptual loss, which would make the output image look more real and natural. Jeon et al. [26] proposed a method to increase the similarity between pixels by performing the operation of the ResNet module, which has an effect similar to that of the ensemble operation. That gives a better high-resolution image. As the resolutions of remote sensing images are low, to improve the performance we required high-level resolution. In this chapter, they first optimize the generator and residual-in-residual dense without BN (batch normalization) is used. Firtstly GAN (relativistic generative adversarial network) is introduced and then the sensation loss is improved [27]. 1.3 SR-GAN model for image processing Image super-resolution is defined as an increase in the size image, but trying to not decrease the quality of the image keeps the reduction in quality to a minimum or creates a high-resolution image from a low-resolution image by using the details from the original image. This problem has some difficulties as for an input low-resolution image, and there are some multiple solutions available. SR-GAN has numerous applications like medical image processing, satellite image, aerial image analysis, etc. [28]. 1.3.1 Architecture of SR-GAN Many programs that are good, fast, and accurate get a single image super-resolution. But still, something that is missing is the texture of the original features of the image. That is the way where we recover the low-resolution image so that the image produce is not distorted. Later we recover these errors, but it is not complete all errors that are produced. The main error shows that result has a peak signal-to-noise ratio (PSNR) high thus provides good image quality results, but lacking high-frequency details. The previous result also sees the similarity in pixel space, which leads to a blurry or unsatisfying image. Due to this, we introduce SR-GAN, a model that can capture the perceptual difference in the ground truth image and the model output. Fig. 1.4 discusses the architecture of SRGAN [29]. Super-resolution-based GAN for image processing HR Images Discriminator LR Image Generator Content Loss GAN Loss SR Image Fig. 1.4 Architecture of SRGAN [29]. The training algorithm of SRGAN is shown in the following steps: (a) We run the HR (high-resolution) images to get sample LR (low-resolution) images. To train our dataset, we required both LR and HR images. (b) Then allow LR images to pass through the generator, which increases the samples and provides SR (super-resolution) images. (c) LR and HR images are classified by passing through the discriminator and backpropagated [30]. Fig. 1.5 presents the network between the generator and the discriminator. It contains convolution layers, parameterized ReLu(PrelU), and batch normalization. The generator also implements skip connections similar to ResNet [31]. 1.3.2 Network architecture Residual blocks are defined as seep learning networks that are difficult to train. The residual learning framework makes the training easier for the networks and enables them to go deep substantially, to improve the performance. In the generator, there are a total of 16 residual blocks used [32]. As in Generator 2, a subpixel is used for getting the feature map up-sampling. Every time pixel shuffle is applied it rearranges the elements of the L*B*H*r*r tensor and transforms into the rL*rB*H tensor. With increase in computation, the bicubic filter from the pipeline has been removed. We use parameterized rely on instead of Relu pr LeakyRelu. Prelude adds a learnable parameter, which leads to learning the negative part coefficient adaptively. The convolution layer “k3n64s1” represents kernel filters of 3*3 outputting channels 64 along with stride 1. Similarly, “k3n256s1” and “K9n3s1” are other convolution layers added [33]. 1.3.3 Perceptual loss As in the below equation, LSR shows the perceptual loss, and it is a commonly used model based on the mean square error. As the equation is a loss function, it calculates the loss function and gives a solution concerning characteristics. Here LSRx is notated 7 Generative adversarial networks for image-to-Image translation g residual blocks Generator Generated Image k9n1s1 Conv Conv k3n128s1 ReLU Batch Norm Elementwise Sum k3n128s1 Conv k3n128s1 Elementwise Sum ReLU Conv Batch Norm Conv ReLU Noisy Image k9n128s1 k3n128s1 Early fusion Conditional masks Middle fusion Late fusion Notations k- kernel size n- number of channels Middle/Late fusion s- stride Early fusion Original or Generated Image? Dense Leaky ReLU Dense (1) Leaky ReLU Conv Batch Norm Conv Leaky ReLU k3n128s1 Original and Generated Images 8 Discriminator d residual blocks Fig. 1.5 SRGAN-based model architecture for seismic images [30]. as a content loss, and it is used as the last term is an adversarial loss as shown in Eq. (1.1). The weighted sum of both gives the perceptual loss (VGG-based content loss): LSR ¼ LSRx + 103 LSRAdv (1.1) 1.3.3.1 Content loss The pixel-wise mean square error loss is evaluated as rl X rb 2 1 X IHRx, y GθG ILRx, y LSRMSE ¼ 2 r LB x¼1 y¼1 (1.2) Eq. (1.2) is the most generally utilized advancement focus for image SR on which many best-in-class approaches depend [34, 35]. However, while accomplishing Super-resolution-based GAN for image processing significantly high PSNR, arrangements of MSE advancement problems frequently require high frequency, which brings about perceptually unsuitable management with highly smooth surfaces [36]. Rather than depending on losses (pixel-wise), we expand on the thoughts of Shi et al. [35], Denton et al. [37], and Ledig [4] and use a loss function work that is closer to perceptual likeness. We characterize the VGG loss dependent on the ReLU enhancement layers of the preprepared 19-layer VGG model portrayed in Simonyan and Zisserman [38]. We demonstrate the element map obtained by the nth convolution before the Ith max-pooling layer inside the VGG19 loss as the Euclidean separation between the component portrayal of reproduced picture GθG(ILR) as shown in the Eq. (1.3) [39] Li, j X Bi, j 2 1 X LSRVGG=ij: ¼ ∅i, j IHRx, y ∅i, j GθG ILRx, y Li, j Bi, j x¼1 y¼1 (1.3) Here Li, j and Bi, j describe the length and the width of the given feature map used in the VGG system. 1.3.3.2 Adversarial loss The last section of this chapter discusses the content loss and also included the generative part of GAN pertaining particularly to perceptual loss. It urges our system to support arrangements that dwell on the complex of regular pictures, by attempting to fool the discriminator. The generative loss LSRadv is defined on the basis of the probabilities of the discriminator GθG(ILR) overall training as shown in Eq. (1.4) [40]: LSRadv ¼ N X log DθD ðGθG ðILRÞÞ (1.4) n¼1 where DθD(GθG(ILR)) is the probability that creates fake images GθD(ILR) as a real highresolution image. For a good gradient, we limit the value logDθD from log1 x where x is the probability of creating fake images [41]. 1.4 Case study This includes the different case studies as applications of EE-GAN to enhance object detection, edge-enhanced GAN for remote sensing image, application of SRGAN on video surveillance, and forensic application and super-resolution of video using SRGAN. 1.4.1 Case study 1: Application of EE-GAN to enhance object detection Detection performance of small objects in remote sensing images has not been more desirable than in huge size objects, especially in noisy and low-resolution images. Thus, enhanced super-resolution GAN (ESRGAN) provides significant image enhancement 9 10 Generative adversarial networks for image-to-Image translation output. However, reconstructed images generally lose high-frequency edge data. Thus, object detection performance gives small objects decrement on low-resolution and noisy remote sensing images. Thus, residual-in-residual dense blocks (RRDB) for both the EEN and ESRGAN and EEN, for the detector system used a high-speed region-based convolutional network (FRCNN) as well as a single-shot detector (SSD) [42]. 1.4.2 Case study 2: Edge-enhanced GAN for remote sensing image The recent super-resolution (SR) techniques that are dependent on deep learning have provided significant comparative merits. Still, they remain not desirable in highfrequency edge details for the recovery of pictures in noise-contaminated image conditions, such as remote sensing satellite images. Thus, a GAN-based edge-enhancement network (EEGAN) is used for reliable satellite image SR reconstruction with the adversarial learning method, which is noise insensitive. Especially EEGAN comprises two primary subnetworks: an edge-enhancement subnetwork (EESN) and an ultra-dense subnetwork (UDSN). First, in UDSN, 2-D dense blocks are collected for feature extraction to gain an intermediate image in high-resolution result, which looks sharp but offers artifacts and noise. After that, EESN is generated to enhance and extract the image contours by purifying the noise-contaminated components with mask processing. The recovered enhanced edges and intermediate image can be joined to produce high credibility and clear content results. Extensive experiments on Jilin-1 video satellite images, Kaggle Open Source Data set as well as Digital globe provide a more optimum reconstruction performance than previous SR methods [43]. 1.4.3 Case study 3: Application of SRGAN on video surveillance and forensic application Person reidentification (REID) is a significant work in forensics and video applications. Several past methods are based on a primary assumption that several person images have sufficiently high and uniform resolutions. Several scale mismatching and low resolution always present in the open-world REID. This is known as scale-adaptive low-resolution person re-identification (SALR-REID). The intuitive method to address this issue is to improve several low resolutions to a high resolution uniformly. Thus, SRGAN is one of the popular image super-resolution deep networks constructed with a fixed upscaling parameter. But it is yet not suitable for SALR-REID work that requires a network not only to synthesize image features for judging a person’s identity but also to enhance the capability of scale-adaptive upscaling. We group multiple SRGANs in series to supplement the ability of image feature representation as well by plugging in an identification network. Thus, a cascaded super-resolution GAN (CSRGAN) framework with a unified formulation can be used [44]. Super-resolution-based GAN for image processing 1.4.4 Case study 4: Super-resolution of video using SRGAN SRGAN techniques are used to improve the image quality. There are several methods of image transformation where the computing system gets input and sends it in the output image. GAN is the deep neural network that consists of two networks, discriminator and generator. GANs are about designing, such as portrait drawing or symphony composition. SRGAN gives various merits over methods. It proposes a perceptual loss factor that comprises the merits of content and adversarial losses. Here the discriminator block discriminates between real HR images from produced super-resolved images [45], while the generator function is used for propagating model training. Adversarial loss function utilizes a discriminator network that is trained to discriminate already between the two pictures. However, content loss function utilizes perceptual similarity despite the pixel space similarity. The superior thing about SRGAN is that it produces the same data as real data. SRGANs learn the representations that are internal to produce upscale images [46]. The neural network is faithful in photo-realistic textures that are recovered from downgraded images. The SRGAN methods demit with a high peak to signal noise ratio but also give high visual perception and efficiency. Joining the adversarial and perceptual loss will produce a high-quality, super-resolution image. Moreover, the training phase perceptual losses evaluate image similarities robustly compared to per-pixel losses. Further, perceptual loss functions identify the high-level semantic and perceptual differences between the generated images [42]. 1.5 Open issues and challenges When we train our GAN models, we suffer many major problems. Some problems are nonconvergence, model collapse, and diminished gradient unbalanced between the two models. GAN is sensitive toward the hyper-parameter factors. In GAN, sometimes the partial model is collapsed [45]. The gradient corresponds to ILR approaches zero, and then our model is collapsed. When we restart our model, the training in the discriminator detects the single-mode impact. The discriminator will take charge and change a single point to the next most likely point [46]. Overfitting is one of the main challenges as the balance between the generator and the discriminator. Some programmers give the solution. Someone proposes to use cost function with a nonvanishing gradient instead. Nonconvergence occurs due to both low and high mesh quality [47]. As we cannot apply GAN on static data due to a more complex convolution layer being required as the real and fake static data, we have not classified the data. There are some results theoretically but cannot be implemented [7]. Again, alongside various merits of GANs, there are yet open challenges that require to be solved for their medical imaging employment. In cross-modality image and image 11 12 Generative adversarial networks for image-to-Image translation reconstruction synthesis, most tasks still adopt traditional shallow reference advantages like PSNR, SSIM, or MAE for quantitative analysis. However, these measures do not relate to the image’s visual quality, e.g., pixel-wise loss direct optimization generated a blurry result but gave higher numbers compared to using adversarial loss [48]. It provides great difficulty in interpreting these horizontal comparison numbers of GAN-based tasks, particularly when other losses are presented. One method to reduce this issue is to utilize downstream works like classification or segmentation to validate the quality of the produced sample. Some other method is to recruit domain experts but this method is time-consuming, expensive, and hard to scale [49]. Today, we have applied GAN for more than 20 basic applications. All the applications have a broad area where GAN is applied, as most important are satellite images where GAN is best for training and testing of images. In medical images like MRI and X-ray images as they are of low resolution, the images and edges are not sharp enough, due to which extraction of more features is not possible with the help of SR-GAN and EE-GAN GAN [50]. 1.6 Conclusion and future scope In the past years before the discovery of GANs, image processing of satellite images or medical X-ray images was quite hard for feature extraction purposes. Classification is also somewhat hard due to the presence of a high error rate at the time. In a single image, every 1px represents at least 10 m, due to which feature extraction is significantly reduced. As the images are of low quality, the objects are blurry to get the high-resolution image, due to which SR-GAN is used. As both the models run at the same time, it greatly reduces the training time. GAN is used to generate fake data in today’s world. Hence, many algorithms are proposed, which make fake things appear real. GAN has several other applications, including making recipes, songs, fake images of a person, generating Cartoon characters, generating new human poses, face aging, and photo blending. These are the areas generally used in the present scenarios where GAN is freely applied. In future, by using GAN, we can create videos of robot motion and train a robot for progressive enhancement. Some researchers are working on the Novo generation of new molecules for extracting the desired properties in silica molecules. Many of the researchers are also working on the application of autonomous driving of a self-driving car using GAN. References [1] K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, J. Jiang, Edge-enhanced GAN for remote sensing image superresolution, IEEE Trans. Geosci. Remote Sens. 57 (8) (2019) 5799–5812. [2] V. Ramakrishnan, A.K. Prabhavathy, J. Devishree, A survey on vehicle detection techniques in aerial surveillance, Int. J. Comput. Appl. 55 (18) (2012). Super-resolution-based GAN for image processing [3] S. Mahdizadehaghdam, A. Panahi, H. Krim, Sparse generative adversarial network, in: Proceedings of the IEEE International Conference on Computer Vision Workshops (ICCVW), 2019, pp. 3063–3071. [4] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, W. Shi, Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690. [5] S. Borman, R.L. Stevenson, Super-resolution from image sequences—a review, in: 1998 Midwest Symposium on Circuits and Systems (Cat. No. 98CB36268), IEEE, 1998, pp. 374–378. [6] D. Dai, R. Timofte, L. Van Gool, Jointly optimized regressors for image super-resolution, in: Computer Graphics Forum, vol. 34, 2015, pp. 95–104. No. 2. [7] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, Y. Fu, Image super-resolution using very deep residual channel attention networks, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 286–301. [8] N.M. Nawawi, M.S. Anuar, M.N. Junita, Cardinality improvement of Zero Cross Correlation (ZCC) code for OCDMA visible light communication system utilizing catenated-OFDM modulation scheme, Optik 170 (2018) 220–225. [9] P. Mamoshina, L. Ojomoko, Y. Yanovich, A. Ostrovski, A. Botezatu, P. Prikhodko, I.O. Ogu, Converging blockchain and next-generation artificial intelligence technologies to decentralize and accelerate biomedical research and healthcare, Oncotarget 9 (5) (2018) 5665–5690. [10] P. Perera, R. Nallapati, B. Xiang, Ocgan: one-class novelty detection using gans with constrained latent representations, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2898–2906. [11] N.T. Do, I.S. Na, S.H. Kim, Forensics face detection from gans using convolutional neural network, in: Proceeding of 2018 International Symposium on Information Technology Convergence (ISITC 2018), 2018. [12] S. Tripathy, J. Kannala, E. Rahtu, Icface: interpretable and controllable face reenactment using gans, in: The IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 3385–3394. [13] C. Rathgeb, A. Dantcheva, C. Busch, Impact and detection of facial beautification in face recognition: an overview, IEEE Access 7 (2019) 152667–152678. [14] N. Yu, L.S. Davis, M. Fritz, Attributing fake images to gans: learning and analyzing gan fingerprints, in: Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7556–7566. [15] S. Lian, H. Zhou, Y. Sun, Fg-srgan: a feature-guided super-resolution generative adversarial network for unpaired image super-resolution, in: International Symposium on Neural Networks, Springer, Cham, 2019, pp. 151–161. [16] N. Takano, G. Alaghband, Srgan: Training Dataset Matters, arXiv, 2019. preprint arXiv:1903.09922. [17] H. Dou, C. Chen, X. Hu, Z. Hu, S. Peng, PCA-SRGAN: Incremental Orthogonal Projection Discrimination for Face Super-Resolution, arXiv, 2020. preprint arXiv:2005.00306. [18] X. Jiang, Y. Xu, P. Wei, Z. Zhou, CT image super resolution based on improved SRGAN, in: 2020 5th International Conference on Computer and Communication Systems (ICCCS), IEEE, 2020, pp. 363–367. [19] H. Li, C. Zhang, H. Li, N. Song, White-light interference microscopy image super-resolution using generative adversarial networks, IEEE Access 8 (2020) 27724–27733. [20] M. Wang, Z. Chen, Q.J. Wu, M. Jian, Improved face super-resolution generative adversarial networks, Mach. Vis. Appl. 31 (2020) 1–12. [21] F. Nan, Q. Zeng, Y. Xing, Y. Qian, Single image super-resolution reconstruction based on the ResNeXt network, in: Multimedia Tools and Applications, 2020, pp. 1–12. [22] Y.Y. Li, Y.D. Zhang, X.W. Zhou, W. Xu, EESR: edge enhanced super-resolution, in: 2018 14th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), IEEE, 2018, pp. 1–3. [23] R. Sood, M. Rusu, Anisotropic super resolution in prostate Mri using super resolution generative adversarial networks, in: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), IEEE, 2019, pp. 1688–1691. [24] S. Lee, J.H. Kim, J.P. Heo, Super-resolution of license plate images via character-based perceptual loss, in: 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), IEEE, 2020, pp. 560–563. 13 14 Generative adversarial networks for image-to-Image translation [25] B.X. Chen, T.J. Liu, K.H. Liu, H.H. Liu, S.C. Pei, Image super-resolution using complex dense block on generative adversarial networks, in: 2019 IEEE International Conference on Image Processing (ICIP), IEEE, 2019, pp. 2866–2870. [26] W.S. Jeon, S.Y. Rhee, Single image super resolution using residual learning, in: 2019 International Conference on Fuzzy Theory and its Applications (iFUZZY), IEEE, 2019, pp. 1–4. [27] J. Wenjie, L. Xiaoshu, Research on super-resolution reconstruction algorithm of remote sensing image based on generative adversarial networks, in: 2019 IEEE 2nd International Conference on Automation, Electronics and Electrical Engineering (AUTEEE), IEEE, 2019, pp. 438–441. [28] V.K. Ha, J. Ren, X. Xu, S. Zhao, G. Xie, V.M. Vargas, Deep learning based single image superresolution: a survey, in: International Conference on Brain Inspired Cognitive Systems, Springer, Cham, 2018, pp. 106–119. [29] X. Wang, K. Yu, C. Dong, C. Change Loy, Recovering realistic texture in image super-resolution by deep spatial feature transform, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 606–615. [30] W.S. Lai, J.B. Huang, N. Ahuja, M.H. Yang, Fast and accurate image super-resolution with deep laplacian pyramid networks, IEEE Trans. Pattern Anal. Mach. Intell. 41 (11) (2018) 2599–2613. [31] W.S. Lai, J.B. Huang, N. Ahuja, M.H. Yang, Deep laplacian pyramid networks for fast and accurate super-resolution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 624–632. [32] D. Kim, H.U. Jang, S.M. Mun, S. Choi, H.K. Lee, Median filtered image restoration and anti-forensics using adversarial networks, IEEE Signal Process Lett. 25 (2) (2017) 278–282. [33] T. Tong, G. Li, X. Liu, Q. Gao, Image super-resolution using dense skip connections, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4799–4807. [34] C. Dong, C.C. Loy, X. Tang, Accelerating the super-resolution convolutional neural network, in: European Conference on Computer Vision, Springer, Cham, 2016, pp. 391–407. [35] W. Shi, J. Caballero, F. Huszár, J. Totz, A.P. Aitken, R. Bishop, Z. Wang, Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883. [36] L. Yue, H. Shen, J. Li, Q. Yuan, H. Zhang, L. Zhang, Image super-resolution: the techniques, applications, and future, Signal Process. 128 (2016) 389–408. [37] E.L. Denton, S. Chintala, R. Fergus, Deep generative image models using a laplacian pyramid of adversarial networks, in: Advances in Neural Information Processing Systems, 2015, pp. 1486–1494. [38] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv, 2014. preprint arXiv:1409.1556. [39] X. Li, Y. Wu, W. Zhang, R. Wang, F. Hou, Deep learning methods in real-time image superresolution: a survey, J. Real-Time Image Proc. (2019) 1–25. [40] K. Hayat, Multimedia super-resolution via deep learning: a survey, Digital Signal Process. 81 (2018) 198–217. [41] M.S. Sajjadi, B. Scholkopf, M. Hirsch, Enhancenet: single image super-resolution through automated texture synthesis, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4491–4500. [42] X. Zhao, Y. Zhang, T. Zhang, X. Zou, Channel splitting network for single MR image superresolution, IEEE Trans. Image Process. 28 (11) (2019) 5649–5662. [43] J. Kim, J. Kwon Lee, K. Mu Lee, Accurate image super-resolution using very deep convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1646–1654. [44] R. Timofte, E. Agustsson, L. Van Gool, M.H. Yang, L. Zhang, Ntire 2017 challenge on single image super-resolution: methods and results, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 114–125. [45] X. Song, Y. Dai, X. Qin, Deep depth super-resolution: learning depth super-resolution using deep convolutional neural network, in: Asian Conference on Computer Vision, Springer, Cham, 2016, pp. 360–376. Super-resolution-based GAN for image processing [46] L. Zhang, P. Wang, C. Shen, L. Liu, W. Wei, Y. Zhang, A. Van Den Hengel, Adaptive importance learning for improving lightweight image super-resolution network, Int. J. Comput. Vis. 128 (2) (2020) 479–499. [47] Y. Li, J. Hu, X. Zhao, W. Xie, J. Li, Hyperspectral image super-resolution using deep convolutional neural network, Neurocomputing 266 (2017) 29–41. [48] Y. Liang, J. Wang, S. Zhou, Y. Gong, N. Zheng, Incorporating image priors with deep convolutional neural networks for image super-resolution, Neurocomputing 194 (2016) 340–347. [49] Q. Chang, K.W. Hung, J. Jiang, Deep learning based image Super-resolution for nonlinear lens distortions, Neurocomputing 275 (2018) 969–982. [50] B. Lim, S. Son, H. Kim, S. Nah, K. Mu Lee, Enhanced deep residual networks for single image superresolution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 136–144. 15 CHAPTER 2 GAN models in natural language processing and image translation E. Thirumagala,b and K. Saruladhaa a Pondicherry Engineering College, Department of Computer Science and Engineering, Puducherry, India REVA University, Bengaluru, India b 2.1 Introduction In recent years, GANs have shown significant progress in modeling image and speech complex data distributions. The introduction of GAN and VAE made training big datasets in an unsupervised manner possible. 2.1.1 Variational auto encoders The variational auto encoders (VAEs) [1] were used for generating images before GANs. The VAE has a probabilistic encoder and probabilistic decoder. The real samples “r” are fed into the encoder. The encoder outputs an encoded image, with which the noise “n’ is added whose distribution is given by Xe(n jr). The distribution Xe(n jr) is given as input to decoder whose distribution is given by Yd(r j n) which will generate the fake image “r.” The loss function L(e,d) between encoder and decoder is computed for every iteration. The VAE uses a mean square loss function, which is given by L ðe, dÞ ¼ EnXeðnjrÞ ½Yd ðr j nÞ + KLDðXe ðnj r ÞkYd ðnÞÞ (2.1) where KLDðXe ðnj r ÞkYd ðnÞÞ ¼ X Xe ðnj r Þ Xe ðnj r Þlog Yd ðnÞ The Kullback-Leibler divergence (KLD) is the distance metric that computes the similarity between the real sample given to the encoder Xe and the generated fake image from decoder Yd. If the loss function yields more value, it means the decoder does not generate fake images similar to the real samples. The backpropagation will take place for every iteration until the decoder generates the image similar to the real image. By using stochastic gradient descent, the weights and bias of the encoder and decoder will be adjusted and again image generation will happen. The optimal value of the loss function is 0.5. Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00001-4 Copyright © 2021 Elsevier Inc. All rights reserved. 17 18 Generative adversarial networks for image-to-Image translation When the loss function of the decoder becomes 0.5, it means the decoder generates the image similar to the real image. 2.1.1.1 Drawback of VAE VAE uses Kullback-Leibler divergence (KLD). When the generated image distribution Yd(n) does not match the real image distribution Xe(n j r), then Yd(n) value will become 0. The KLD will lead to ∞ (infinity), which means learning will not take place for the encoder and decoder. This leads to the invention of GANs. 2.1.2 Brief introduction to GAN Generative adversarial networks (GANs) [2–4] are generative neural network models introduced by Ian Goodfellow in 2014. Recently, GANs have been used in numerous applications such as discovery and prevention of security attacks, clothing translation, text-to-image conversion, photo blending, video games, etc. GANs have generator (G) and discriminator (D) which can be convolutional neural network (CNN), feed forward neural networks, or recurrent neural networks (RNNs). The generator (G) will generate fake images, by taking random noise distribution as input. The real samples and the generated fake images are given as input to the discriminator (D), which will output whether the image is from the real sample or from the generator (1 or 0). The loss functions are computed to check whether (i) the generator is generating images close to real samples and (ii) the discriminator is correctly discriminating between real and fake images. If the loss function yields a big value, then backpropagate to Generator and Discriminator neural networks, adjust its weights and bias which is called as optimization. There are various optimization algorithms such as stochastic gradient descent, RMSProp, Adam, AdaGrad, etc. Therefore, both G and D learn simultaneously. This chapter is organized as follows: The various GAN architectures are discussed in Section 2.2. Section 2.3 describes the applications of GANs in natural language processing. The applications of GANs in image generation and translation are discussed in Section 2.4. The Section 2.5 discusses the evaluation metrics that can be used for checking the performance of the GAN. The tools and the languages used for GAN research are discussed in Section 2.6. The open challenges for further research are discussed in Section 2.7. 2.2 Basic GAN model classification based on learning From the literature survey made, the classification of various GANs is made based on the learning methods as shown in Fig. 2.1. The learning can be supervised or unsupervised. Supervised learning is making the machine learning with the labeled data. The classification and regression algorithms come under supervised learning. The unsupervised learning [5] will take place when the data is unlabeled. The machine will act on the data GAN models in natural language processing and image translation Fig. 2.1 GAN architecture classification. based on the similarities, differences, and patterns. The clustering and association algorithms come under unsupervised learning. 2.2.1 Unsupervised learning The generative models before GANs used the Markov chain [6] method for training which has various drawbacks such as high computational complexity, low efficiency, etc. As shown in Fig. 2.1, vanilla GAN, WGAN, WGAN-GP, Info GAN, BEGAN, Unsupervised Sequential GAN, Parallel GAN, and Cycle GAN are categorized under unsupervised learning [7]. The above said GANs will take the data or the real samples without labels as input. The architectures of all the above GANs are shown in the following sections. This section details about each GAN architecture with its loss functions and optimization techniques. 2.2.1.1 Vanilla GAN The Vanilla GAN [8, 9] is the basic GAN architecture. The real samples are given by “r.” The random noise is given by “n.” The random noise distribution pn(n) is given as input to the G which will generate the fake image. The real sample distribution pd(r) and fake images are given as input to D. D will discriminate whether the image is real (which means 1) or fake (which means 0) which is shown in Fig. 2.2. Then by using the binary cross entropy loss function, the loss of G and D will be calculated by Eqs. (2.3) and (2.4). Binary cross-entropy loss function (Goodfellow [2]) is given by Eq. (2.2). L ðx 0 , xÞ ¼ xlog x 0 + ð1 xÞ log ð1 x 0 Þ (2.2) where x0 is the generated fake image and x is the real image. When the image is coming from real sample “r” to D, then D has to output 1. So, substitute x0 ¼ D(r) and x ¼ 1 in Eq. (2.2) will lead to the following equation: LGAN ðDÞ ¼ Εrpdðr Þ ½ log ðDðr ÞÞ (2.3) 19 20 Generative adversarial networks for image-to-Image translation Fig. 2.2 Vanilla GAN, WGAN, WGAN-GP architecture. When the image is coming from G, G(n) to D, then D has to output 0.So substitute x ¼ D(G(r)) and x ¼ 0 in Eq. (2.2) will lead to the following equation: 0 LGAN ðGÞ ¼ ΕnpnðnÞ ½ log ð1 DðGðnÞÞ (2.4) Using min-max game theory, the D has to output 1 if the image is the real sample. Hence the D has to be maximized. The D has to output 0 if the image has come from the generator. Hence the G has to be minimized. The loss function is given by n min max LGAN ðG, DÞ ¼ min max Erpdðr Þ ½ log ðDðr ÞÞ + ΕnpnðnÞ ½ log ð1 DðGðnÞÞg G D G D (2.5) The optimal value of D is given by D∗ ¼ pd ðr Þ pd ðr Þ + pg ðr Þ (2.6) If the optimal value of D (0.5) is obtained, then D cannot differentiate between real and fake images. The optimal value of G is given by G∗ ¼ log 4 + 2 JSD pd ðr Þkpg ðr Þ (2.7) where 1 JSD pd ðr Þkpg ðr Þ ¼ KLD pd ðr Þk pd + pg + KLD pg ðr Þk pd + pg 2 If the loss function yields more value than by using backpropagation and stochastic gradient descent [6], weights and bias will be adjusted for every epoch until D discriminates properly. GAN models in natural language processing and image translation 2.2.1.2 WGAN The WGAN [10] stands for Wasserstein GAN. The architecture is the same as that of Vanilla GAN as shown in Fig. 2.2. In order to avoid the drawback of using Jenson Shannon Divergence distance, the Wasserstein distance metric has been used. JSD matches the real image and fake image distributions in the vertical axis. Whereas the Wasserstein distance matches the real image and fake image distributions in the horizontal axis. The Wasserstein distance is otherwise called as earth mover distance. The Wasserstein distance between the real image distribution “Pr” and the generated image distribution “Pg” is given by W Pr , Pg ¼ inf δEπðpr , pg Þ Eða, bÞδ jx yj (2.8) Π is the transport plan which tells how the distribution changes from real and generated image. Eq. (2.8) is intractable. To make it tractable, using Kantorovich-Rubinstein duality the W-distance is given by W Pr , Pg ¼ SupjjDjjL1 Ε Dðr Þ D Gg ðnÞ (2.9) When taking the slope between real and fake image distributions, if the slope is less than or equal to k, it is called as k-Lipchitz constant. When k is 1 it is 1-Lipchitz constant. Wasserstein GAN uses 1-Lipchitz constant and to achieve it, the weights will be clipped in the range (1, 1). The discriminator loss function is given by LWGAN ðDÞ ¼ Εrpdðr Þ ½Dðr Þ (2.10) 21 22 Generative adversarial networks for image-to-Image translation The generator loss function is given by LWGAN ðGÞ ¼ Ε npnðnÞ ½DðGðnÞÞ (2.11) The overall parametric loss function is given by LWGAN ðG, DÞ ¼ min max Ε rpdðr Þ ½Dðr Þ ΕnpnðnÞ ½DðGðnÞÞ G wEW (2.12) In WGAN, the discriminator will not return 0 or 1 rather it will return the Wasserstein distance. WGAN uses RMSProp optimizer which alters the weights and bias of G and D for every iteration until D cannot discriminate between real and fake images. 2.2.1.3 WGAN-GP The WGAN-GP [11, 12] (Wasserstein GAN-Gradient Penalty) architecture is identical to Vanilla GAN and WGAN as shown in Fig. 2.2. To avoid the drawback of weight clipping in order to use the Lipchitz constant, the gradient penalty term is incorporated with WGAN loss function. The gradient penalty term is computed when the gradient norm value moves away from 1. The loss function of WGAN-GP is given by LWGAN ðG, DÞ ¼ min max Ε rpdðr Þ ½Dðr Þ Ε npnðnÞ ½DðGðnÞÞ wEW G h 2 + λΕxP ðxÞ j— x DðxÞj 1 (2.13) The gradient penalty term is included with the loss function of WGAN. In Eq. (2.13), x-sampled from noise “n” and real image “r” is given by, x ¼ t x + (1 t)x, where t is sampled between 0 and 1. λ-hyperparameter. The adam optimizer, if used with WGANGP, it generates good clear images. 2.2.1.4 Info GAN Info GAN [8, 13] is the information maximizing GAN. The semantic information is added with noise and given to the G. G outputs the fake image. The fake image that is generated and the real sample are given to D which will output 0 (fake) or 1 (real) shown in Fig. 2.3. Then the loss function is computed. The stochastic gradient descent is used for optimizing the neural network. The noise “n” with semantic information “si” is fed to G which is given by G(n, si). The mutual information MI (si; G(z, n)) has to be maximized between semantic information “si” and generator G(n, si). The mutual information MI (c; G(z, n)) is the amount of information obtained from knowledge of G(z, n) about semantic information si. Maximizing mutual information is not very easy, so variational lower bound of mutual information MI (si; G(z, n)) by GAN models in natural language processing and image translation Fig. 2.3 Info GAN architecture. defining one more semantic distribution Q(si jr). The variational lower bound LB (G, Q) of mutual information is given by LBðG, QÞ ¼ EsiP ðsiÞ, xGðn, siÞ ½ log Qðsij r Þ + H ðsiÞ I ðsi; Gðz, nÞÞ (2.14) where H(si) is the entropy of latent codes. The loss function is given by min max LInfoGAN ðG, DÞ ¼ Erpdðr Þ ½ log ðDðr ÞÞ G, Q D + ΕnpnðnÞ ½ log ð1 DðGðnÞÞ λ LBðG, QÞ (2.15) Wake Sleep algorithm has been used with InfoGAN. The lower bound of the generator log PG(x) has been optimized and updated in the wake phase. The auxiliary distribution Q is updated in the sleep phase by up sampling from generator distribution instead of real data distribution. The cost is only a little more than the vanilla GAN. 2.2.1.5 BEGAN BEGAN [14, 15] stands for Boundary Equilibrium GAN. This BEGAN is mainly developed to achieve Nash equilibrium. The architecture is the same as that of vanilla GAN with one difference: for maintaining equilibrium, the proportional control theory has been used as shown in Fig. 2.4. In BEGAN generator acts as a decoder and the discriminator acts as autoencoder and also discriminates between real and fake images. In BEGAN, instead of matching the data distributions of real image and generated image, the autoencoder loss has been calculated for real image and generated image. The Wasserstein distance has been computed between the autoencoder loss of real and generated images. The autoencoder loss is given by L ðsÞ ¼ js AF ðsÞjη (2.16) where L(s) is a loss for training autoencoder. “s” is the sample of dimension “d,” AF is the autoencoder function which converts the sample of dimension “d” to sample of dimension “d,” η is the target norm takes value {1,2}. 23 24 Generative adversarial networks for image-to-Image translation Fig. 2.4 BEGAN architecture. The loss of the D is given by LD ¼ L ðr Þ ki L ðGðnD ÞÞ (2.17) The loss of the G is given by LG ¼ L ðGðnG ÞÞ (2.18) BEGAN makes use of the proportional control model to preserve equilibrium E[L(G(n))] ¼ γ E[L(r)] where γ is the hyperparameter which takes the value (0, 1). For maintaining equilibrium, it uses the variable ki which takes a value (0, 1) to control the generator loss during gradient descent. Where ki is given by ki + 1 ¼ ki + λk ðγ L ðr Þ L ðGðnG ÞÞÞ (2.19) Initially take k0 ¼ 0. λk is the learning rate of k. 2.2.1.6 Unsupervised sequential GAN The Sequential GAN [16, 17] involves a sequence of G and D. The noise vector “z” is given as input to Generator G1. The G1 produces fake image1 “f” as the output. The fake image1 and real sample “r” is given as input to discriminator D1 which will discriminate between real image1 and fake image1. The fake image1 is given as input to generator G2. The G2 produces fake image2 as the output. The fake image2 and the real image2 is given as input to discriminator D2 which will discriminate between the real and fake image as shown in Fig. 2.5. The loss function by considering G1 and D1 is given by Ladv ðG1, D1, n, r Þ ¼ Erpdðr Þ ½ log ðD1ðr ÞÞ + Ε npnðnÞ ½ log ð1 D1ðG1ðnÞÞg (2.20) The loss function by considering G2 and D2 is given by Limg2img ðG2, D2, f , r Þ ¼ Erpdðr Þ ½ log ðD2ðr ÞÞ + Ε f pf ð f Þ ½ log ð1 D2ðG2ðf ÞÞg (2.21) GAN models in natural language processing and image translation Fig. 2.5 Unsupervised sequential GAN architecture. The loss function of unsupervised sequential GAN is given by LunseqGAN ðG1, D1, G2, D2Þ ¼ Ladv ðG1, D1, n, r Þ + Limg2img ðG2, D2, f , r Þ (2.22) 2.2.1.7 Parallel GAN The architecture of the Parallel GAN [18] is shown in Fig. 2.6. Whenever there are bimodal images to be processed or when there is a need to generate multiple images at the same time then the parallel GAN can be used. Noise vector will be given to generator G1 and G2 parallelly. The G1 and G2 generate fake image1 and fake image2 parallelly. The real image1 and fake image1 are given as input to input to discriminator D1 which will discriminate between real image1 and fake image1. The real image2 and fake image2 are given as input to input to discriminator D2 which will discriminate between Fig. 2.6 Parallel GAN architecture. 25 26 Generative adversarial networks for image-to-Image translation real image2 and fake image2 parallelly with D1. The binary cross-entropy loss is computed for (G1, D1) and (G2, D2) parallelly. If D1 and D2 are not discriminating between real and fake images properly, then by using back propagation and stochastic gradient (G1, D1) and (G2, D2) weights and bias will be adjusted for every iteration until the D1 and D2 discriminate correctly. 2.2.1.8 Cycle GAN The Cycle GAN is otherwise called cycle-consistent GAN [19, 20]. The noise vector “z” is given as input to generator G1. The G1 produces feature map “f” as the output. The feature map and real sample “r” is given as input to discriminator D1 which will discriminate between the real sample and feature map. The feature map is given as input to generator G2. The G2 produces a fake image as the output. The fake image and the real image are given as input to discriminator D2 which will discriminate between the real and fake image as shown in Fig. 2.7. The loss function by considering G1 and D1 is given by L1ðG1, D1, n, r Þ ¼ Erpdðr Þ ½ log ðD1ðr ÞÞ + Ε npnðnÞ ½ log ð1 D1ðG1ðnÞÞg (2.23) The loss function by considering G2 and D2 is given by L2ðG2, D2, f , r Þ ¼ Erpdðr Þ ½ log ðD2ðr ÞÞ + Εf pf ð f Þ ½ log ð1 D2ðG2ðf ÞÞg (2.24) The cycle consistency loss is given by h h i i Lcycle ðG1, G2Þ ¼ Ε npnðnÞ jG2ðG1ðnÞÞ nj1 + Εf pf ð f Þ jG1ðG2ð f ÞÞ f j1 (2.25) The Cycle GAN loss is given by LcycleGAN ðG1, G2, D1, D2Þ ¼ L1ðG1, D1, n, r Þ + L2ðG2, D2, f , r Þ + Lcycle ðG1, G2Þ (2.26) Fig. 2.7 Cycle GAN architecture. GAN models in natural language processing and image translation 2.2.2 Semisupervised learning Semisupervised learning is that the discriminator D will be trained by class labels, i.e., D will do supervised learning. The generator will not be trained with the class labels hence the learning will be unsupervised. The Semi GAN comes under this semisupervised learning category which is discussed in the following section. 2.2.2.1 Semi GAN The architecture of semi GAN [21, 22] is shown in Fig. 2.8. The class labels are added with the real samples and given as input to discriminator D, so the learning becomes supervised. The noise vector is given as input to generator G which generates the fake sample. The real samples with the class labels and the fake image generated by generator G are given as input to D. The D will discriminate between real and fake image and also classifies the image to which class it belongs to. The loss functions are computed for G and D. If the D is not discriminating properly, then by using backpropagation and stochastic gradient the parameters of the G and G will be adjusted for every iteration until D discriminates correctly. The discriminator loss function is given by LsemiGAN ðDÞ ¼ Ε rpdðr Þ ½ log ðDðrj c ÞÞ (2.27) The generator loss function is given by LsemiGAN ðGÞ ¼ Ε npnðnÞ ½ log ð1 DðGðnÞÞ (2.28) Using min-max game theory, the D has to output 1 if the image is the real sample. Hence the D has to be maximized. The D has to output 0 if the image is from the generator. Hence the G has to be minimized. The loss function is given by n o min max LSemiGAN ðG, DÞ ¼ min max Erpdðr Þ ½ log ðDðrj c ÞÞ + Ε npnðnÞ ½ log ð1 DðGðnÞÞÞ G D G D (2.29) Fig. 2.8 Semi GAN. 27 28 Generative adversarial networks for image-to-Image translation The generator is not trained with the class labels but the discriminator has been trained with the class labels. The following sections describe about supervised learning. 2.2.3 Supervised learning Supervised learning is making the machine learning with the labeled data. CGAN, BiGAN, AC GAN, and supervised sequential GAN are the GAN architectures which will learn in a supervised manner. The architectures of the abovementioned GAN are shown in the following sections. This section details about each GAN architecture and the loss functions and optimization techniques have been used in each architecture. 2.2.3.1 CGAN CGAN stands for conditional GAN [23, 24]. The architecture of CGAN is shown in Fig. 2.9. The class labels are attached with the real samples. Generator G and discriminator D are trained with the class labels. The noise vector along with the class labels are given as input to G. The G outputs the fake image. The class labels, real, and fake image generated by G are given as input to D. The D discriminate between the real and fake image also find out to which class the image belongs to. The loss function is the same as that of vanilla Gan with one difference that the class labels “c” are added with the real sample, discriminator, and the generator terms. The binary cross-entropy loss [25] is used and the stochastic gradient descent is used for optimizing G and D when D is not discriminating properly. The discriminator loss function is given by LCGAN ðDÞ ¼ Ε rpdðr Þ ½ log ðDðrj c ÞÞ (2.30) The generator loss function is given by LCGAN ðGÞ ¼ Ε npnðnÞ ½ log ð1 DðGðnj c ÞÞ Fig. 2.9 CGAN architecture. (2.31) GAN models in natural language processing and image translation Using min-max game theory, the D has to output 1 if the image is the real sample. Hence the D has to be maximized. The D has to output 0 if the image is from generator. Hence the G has to be minimized. The loss function is given by n o min max LCGAN ðG, DÞ ¼ min max Erpdðr Þ ½ log ðDðrj c ÞÞ + Ε npnðnÞ ½ log ð1 DðGðnj c ÞÞÞ G D G D (2.32) CGANs [26] with multilabel predictions can be used for automated image tagging where the generator can generate the tag vector distribution conditioned on image features. 2.2.3.2 BiGAN BiGAN stands for bidirectional GAN [8, 27, 28]. The architecture of BiGAN is shown in Fig. 2.10. The noise vector is given as input to generator G which generates the fake image. The real sample is given as an input to the encoder with output the encoded image with which the noise is added. The encoded image, noise, real image, and generated fake image are given as input to discriminator D. The discriminator discriminates between real and fake images. The loss functions are computed for G and D. If D is not discriminating properly, then by using backpropagation and stochastic gradient the parameters of the G and G will be adjusted for every iteration until D discriminates correctly. The discriminator loss function is given by LBiGAN ðDÞ ¼ Εrpdðr Þ ΕnpEðnj r Þ ½log ðDðr, nÞ (2.33) The discriminator has been trained with the real data, noise, and encoded image distribution. The generator loss function is given by Fig. 2.10 BiGAN architecture. 29 30 Generative adversarial networks for image-to-Image translation LBiGAN ðGÞ ¼ ΕnpnðnÞ Ε rpGðr j nÞ ½ log ð1 Dðr, nÞ (2.34) Using min-max game theory, the D has to output 1 if the image is the real sample. Hence the D has to be maximized. The D has to output 0 if the image is from the generator. Hence the G has to be minimized. The loss function is given by min max LBiGAN ðD, E, GÞ ¼ min max Ε rpdðr Þ Ε npEðnj r Þ ½ log ðDðr, nÞ G, E D G, E D + Ε npnðnÞ Ε rpGðr j nÞ ½ log ð1 Dðr, nÞ (2.35) 2.2.3.3 ACGAN The architecture of ACGAN [12, 29] is shown in Fig. 2.11. The architecture is the same as that of CGAN with one difference the class labels “c” are conditioned with the real samples and noise vector which is given as input to generator G. The class labels are not conditioned with the discriminator D. The training is based on the log probability of correct source whether the image is real or fake image generated by G is real or fake and log probability of correct class to which the sample belongs to. The stochastic gradient descent is used to adjust the weights and bias of G and D for every iteration if D is not discriminating correctly. The log probability of the correct source whether the image is from the real sample or the image is generated by generator G is given by Lsource ¼ Ε½ log P ðsource ¼ realjRreal Þ + Ε½ logP ðsource ¼ fakejRfake Þ (2.36) The log probability of the correct class to which the image belongs to or classified correctly is given by Fig. 2.11 ACGAN architecture. GAN models in natural language processing and image translation Lclass ¼ Ε ½ log P ðclass ¼ c jRreal Þ + Ε ½ log P ðclass ¼ c jRfake Þ (2.37) The image samples are given by “R.” The conditional probability has been used. The training has to be carried out in the way that D has to maximize Lsource + Lclass and G has to maximize Lsource Lclass. 2.2.3.4 Supervised seq-GAN The supervised sequential GAN [25, 30, 31] architecture is shown in Fig. 2.12. The real image is given as input to the encoder that outputs the encoded image. The encoded image is given as input to the G1 which in turn generates the fake image1. The fakeimage1 is given as input to G2 which will generate fakeimage2. The noise vector, encoded image, and fake image1 are given as input to D1. The noise vector, encoded image, and fake image2 are given as input to D2. D1 and D2 will discriminate between real and fake images. The loss function by considering G2 and D2 is given by Ladv ðG1, D1, n, r Þ ¼ Erpdðr Þ ½ log ðD1ðr ÞÞ + ΕnpnðnÞ ½ log ð1 D1ðG1ðnÞÞg (2.38) The loss function by considering G2 and D2 is given by the following equations. Limg2img ðG2, D2, f , r Þ ¼ Erpdðr Þ ½ log ðD2ðr ÞÞ + Ε f pf ð f Þ ½ log ð1 D2ðG2ð f ÞÞg (2.39) Lencoder ðr, nÞ ¼ Ε rpdðr Þ Ε npEðnj r Þ ½ log ðDðr, nÞ + Ε npnðnÞ Ε rpGðr j nÞ ½ log ð1 Dðr, nÞ (2.40) The loss function of unsupervised sequential GAN is given by the following equation LSupseqGAN ðG1, D1, G2, D2Þ ¼ Ladv ðG1, D1, n, r Þ + Limg2img ðG2, D2, f , r Þ + Lencoder ðr, nÞ Fig. 2.12 Supervised sequential GAN. (2.41) 31 32 Generative adversarial networks for image-to-Image translation 2.2.4 Comparison of GAN models This section discusses the comparison of GAN models. Table 2.1 summarizes the activation function, loss function, distance metrics, and optimization techniques used by the GAN models. Table 2.1 Loss functions and distance metrics of GAN’s. GAN Activation function Loss function Distance metric Optimization technique Back propagation with stochastic gradient decent RMS prop Vanilla GAN Rectified linear unit (ReLU) Binary cross entropy loss Jenson Shannon divergence WGAN ReLU, leaky ReLU, tanh ReLU, leaky ReLU, tanh KantorovichRubinstein duality loss KantorovichRubinstein duality + penalty term added when the gradient moves away from 1 Binary cross entropy + variational information regularization Auto encoder loss + proportional control theory Binary cross entropy + image to image conversion loss Wasserstein distance Wasserstein distance WGAN—GP Info GAN BEGAN Unsupervised seq-GAN Parallel GAN Cycle GAN Semi GAN Rectified linear unit (ReLU) Exponential linear unit (ELU) Rectified linear unit (ReLU) Rectified linear unit (ReLU) ReLU, sigmoid ReLU Binary cross entropy loss Binary cross entropy loss + cycle consistency loss Binary cross entropy loss with labels included with real samples Adam Jenson Shannon divergence Wasserstein distance Stochastic gradient decent Adam Jenson Shannon divergence and KullbackLeibler divergence Jenson Shannon divergence Jenson Shannon divergence Jenson Shannon divergence RMS prop Stochastic gradient decent Batch normalization Stochastic gradient decent GAN models in natural language processing and image translation Table 2.1 Loss functions and distance metrics of GAN’s—cont’d GAN Activation function CGAN ReLU BiGAN ReLU AC GAN ReLU Supervised seq-GAN ReLU Loss function Binary cross entropy loss with labels included Binary cross entropy loss + guarantee G and E are inverse Log likelihood of real source + log likelihood of correct label Binary cross entropy + image to image conversion loss + autoencoder loss Distance metric Optimization technique Jenson Shannon divergence Jenson Shannon divergence Jenson Shannon divergence Jenson Shannon divergence and KullbackLeibler divergence Stochastic gradient decent Stochastic gradient decent Stochastic gradient decent RMS prop 2.2.5 Pros and cons of the GAN models This section discusses the pros and cons of GAN models. Table 2.2 summarizes the pros and cons of the various GAN models. 2.3 GANs in natural language processing Currently, many GAN architectures are emerging and yielding good results for the natural language processing applications. There have been various GAN architectures proposed in the recent years, including SeqGAN with policy gradient that is used for generating speech, poems, and music which outperforms other architectures. The RankGAN is used for generating sentences where the discriminator will act as the ranker. The following subsection elaborates on the various GAN architectures proposed for the applications of NLP. 2.3.1 Application of GANs in natural language processing This section discusses the various GAN architectures such as SeqGAN, RankGAN, UGAN, Quasi-GAN, BFGAN, TH-GAN, etc., proposed for the applications of natural language processing. 33 Table 2.2 Pros and cons of GAN models. GAN Pros Cons Vanilla GAN GAN can generate samples that are more similar to the real samples. It can learn deep representations of the data WGAN The experiments conducted using WGAN reveals that it does not lead to the problem of mode collapse WGAN—GP The training is very balanced. Hence the machine could be trained effortlessly which will make the model converge properly It is used for learning data representations which are disentangled by using information theory extensions When the real image distribution and generated fake image distribution is not overlapping then Jenson Shannon divergence between real and fake image distribution value becomes log 2. The derivative of log 2 is 0, which means learning will not take place at the initial start of back propagation The Lipchitz constant is been applied. Weight clipping is a simple technique but it will lead to poor quality image generation Attaining Nash equilibrium state is very hard. Batch normalization cannot be used as gradient penalty will be applied to all data samples The mutual info has been included to generator that will eliminate the significant attributes from data and assign them to semantic information while learning is in progress. Fine tuning λ hyperparameter if not done accurately, it will not generate good quality image The hyperparameter γ must be fine tuned properly. The appropriate learning rate has to be set properly. If not done properly, it is not generate good clarity image Hard to achieve Nash equilibrium Info GAN BEGAN The training is fast and stable Unsupervised seq-GAN Parallel GAN Cycle GAN It extract more deep features Semi-GAN Multiple images can be generated at same time The requirement for dataset is low. Randomly two image styles can be converted It is an effective model which can be used for the regression tasks Hard to achieve Nash equilibrium When doing image-to-image translation, considering various parameters such as color, texture, geometry, etc. is very difficult The generator cannot generate more realistic image to fool the discriminator as it is strong enough to discriminate since it is been trained with the class labels CGAN BiGAN AC GAN Supervised seq-GAN The class labels are included which increases the performance of the GAN and it can be used for many applications such as shadow maps generation, image synthesis, etc. As class labels are included, it can generate good realistic images As class labels are included, it can generate good realistic images It extract more deep features The training is not stable. The stability in training can still be improved The drawback is that the real image sample which is been given to the encoder must be of good clarity and it cannot perform well when the data distributions are complex The GAN training is not stable. The ACGAN training can be still improved The real sample given to the encoder should be of good clarity otherwise it will not generate realistic image 36 Generative adversarial networks for image-to-Image translation 2.3.1.1 Generation of semantically similar human-understandable summaries using SeqGAN with policy gradient In recent years, generating text summaries have become attractive in the area of natural language processing. The SeqGAN with policy gradient architecture has been proposed for generating text summary. The proposed SeqGAN with policy gradient architecture [32] has three neural networks namely one generator (G) and two discriminators, viz., D1 and D2 as shown in Fig. 2.13. The G is the sequential model which takes the raw text as input and generates the summary of the text as output. The D1 trains G to output summaries which are human readable. Hence G and D1 form the GAN. The D1 is trained to distinguish between input text and the summary generated by G. G is trained to fool D1. As D1 trains the generator to generate the human-readable summary, it is called as human-readable summary discriminator. The summary generated by the generator might be irrelevant with only G and D1. Hence another discriminator D2 is added to the architecture for checking the semantic similarity between the input raw text and the generated human-readable summary. SeqGAN incorporates reinforcement learning. Policy gradient is the optimization technique used for updating the parameters (weights and bias) of G by obtaining rewards from D1 and D2. Hence D1 will train G to generate a semantically similar summary and D2 will train G to generate human-readable summary. Semantic similarity discriminator The semantic similarity discriminator is trained as the classifier using the text summarization dataset shown in Fig. 2.14. This discriminator will teach the generator to generate a semantically similar and more concise summary. The raw text and the human-readable summary are given as inputs to the encoders individually to generate the encoded representations namely Ri and Rs. The Ri and Rs are concatenated, product and difference are performed and given to the four-class classifiers which classify the human-readable summary into four classes namely similar, dissimilar, redundant, and incomplete class. The softmax outputs the probability distribution. Fig. 2.13 SeqGAN for generation of human-readable summary. GAN models in natural language processing and image translation Fig. 2.14 Semantic similarity discriminator. 2.3.1.2 Generation of quality language descriptions and ranking using RankGAN Language generation plays a major role in many NLP applications such as image caption generation, machine translation, dialogue generation systems, etc. Hence the RankGAN has been proposed to generate high-quality language descriptions. The RankGAN [33] consists of two neural networks such as generator G and ranker R. The generative model used is long short-term memory (LSTM) to generate the sentences which are called machine-written sentences. Instead of the discriminator being trained to be a binary classifier, RankGAN uses a ranker which has been trained to rank the human-written sentences more than the machine-written sentences. The ranker will train the generator to generate machine-written sentences which are similar to human-written sentences. In this way, generator fools the ranker to rank the machine-written sentences more than the human-written one. The policy gradient method is used for optimizing the training. The architecture of RankGAN is shown in Fig. 2.15. The G generates the sentences from the synthetic dataset. The human-written sentences with the generated machinewritten sentences are given as input to the ranker. The reference human-written sentence Fig. 2.15 Architecture of RankGAN. 37 38 Generative adversarial networks for image-to-Image translation is also given as input to the ranker. The ranker has to rank human-written sentences more than the machine-written sentence. The generator G will be trained to fool the ranker, hence the ranker will rank the machine-written sentence more than the human-written sentence. The ranker will compute the rank score by using Rðij S,C Þ ¼ ΕsS ½P ðij s, C Þ where P ðij s, C Þ ¼ P exp ðβαðij sÞÞ s 0 EC 0 exp ðβαði 0 j sÞÞ (2.42) and α(ij s) ¼ cosine(xi, xs) xi is the feature vector of input sentences. xs is the feature vector of reference sentences. The parameter β value is set during the experiment empirically. The reference set S is constructed by sampling reference sentences from human-written sentences. C is the comparison set sampled from both human-written and machine-generated sentence set. s is the reference sentence sampled from set S. 2.3.1.3 Dialogue generation using reinforce GAN Dialogue generation is the most important module in applications such as Siri, Google assistant, etc. The reinforce GAN [34] was proposed for dialogue generation using reinforcement learning. The architecture of reinforced GAN is shown in Fig. 2.16. The reinforce GAN has two neural network architectures namely generator G and discriminator D. The input dialogue history is given to the generator which outputs the machinegenerated dialogue. The {input dialogue history, machine-generated dialogue} pair is given to the hierarchical encoder which outputs the vector representation of dialogue. The vector representation is given as input to the discriminator D which in turn outputs the probability that the dialogue is human generated or machine generated. The policy gradient optimization technique is used. The weights and bias of G and D are adjusted by the rewards generated by them during training. The discriminator outputs will be used as rewards to train the generator so that the generator can generate a dialogue which is more similar to the human-generated dialogue. Fig. 2.16 Architecture of reinforce GAN. GAN models in natural language processing and image translation 2.3.1.4 Text style transfer using UGAN Text style transfer is an important research application of natural language processing which aims at rephrasing the input text into the style that is desired by the user. Text style transfer has its application in many scenarios such as transferring the positive review into a negative one, conversion of informal text into a formal one, etc. Many techniques that are used for text style transfer are unidirectional, i.e., it transfers the sentence from positive to negative form. UGAN (Unified Generative Adversarial Networks) [35] is the only architecture which does multidirectional text style transfer shown in Fig. 2.17. Input to the architecture will be the sentence and the target attribute, for example input: sentence: “chicken is delicious” and target attribute: “negative.” Output of the architecture will be the transferred sentence. Output: “chicken is horrible” and vice versa. UGAN has two networks namely generator and discriminator. The LSTM is the generator network which takes the sentence and the target attribute as input and generates the output sentence as per the target attribute. The output transferred sentence generated by LSTM is given as the input to the discriminator. The discriminator uses the RankGAN rank score computation equations to rank the original sentence and the generated sentence. The classification of the sentence whether “positive” or “negative” is done by the discriminator. 2.3.1.5 Tibetan question-answer corpus generation using Qu-GAN In recent years, many question answering systems have been designed for many languages using deep learning models. It is hard to design a question-answering systems for languages with less resources such as Tibetan. To solve this problem, QuGAN [36] has been proposed for a question answering system. The architecture of the QuGAN is shown in Fig. 2.18. Initially, by using maximum likelihood, some amount of data is sampled from Fig. 2.17 Architecture of UGAN. Fig. 2.18 Architecture of QuGAN. 39 40 Generative adversarial networks for image-to-Image translation the data in the database. This is done to reduce the distance between the probability distribution of the real and the generated data. The randomly sampled data is given to the generator (quasi recurrent neural network—QRNN) which generates the question and answers which in turn is given to the BERT model to correct the grammatical errors and syntax. The generated and the real data are given as an input to the discriminator (long short-term memory—LSTM) which classifies between the real and the generated data. The policy gradient and the Monto-Carlo search optimization techniques are used to optimize the training of the neural networks by adjusting their weights and bias. 2.3.1.6 Generation of the sentence with lexical constraints using BFGAN Nowadays for generating meaning sentences, lexical constraints are incorporated to the model which has applications in machine translation, dialogue system, etc. For generating lexically constrained meaningful sentences, BFGAN (backward forward) [37] has been proposed as shown in Fig. 2.19. BFGAN has two generators namely forward and backward generators and one discriminator. The LSTM dynamic attention-based model called as attRNN is used as the generators. The discriminator can be CNN-based binary classifier to classify between real sentences and machine-generated meaningful sentences. The input sentence is split into words and given as input to the backward generator which generates the first half of the sentence in the backward direction. The backward sentence is reversed and fed as input to the forward generator which in turn outputs the complete sentence with lexical constraints. The discriminator is used for making the backward and forward generators powerful by training them using the Moto Carlo optimization technique. The real sentence and the generated sentences are given as input to the discriminator which will classify between real and the machine-generated complete sentence with the lexical constraints incorporated in it. 2.3.1.7 Short-spoken language intent classification with cSeq-GAN Intent classification in dialog system has grabbed attention in industries. For intent classification, cSeq-GAN [38] has been proposed shown in Fig. 2.20. cSeq-GAN has two Fig. 2.19 Architecture of BFGAN. GAN models in natural language processing and image translation Fig. 2.20 Architecture of cSeq-GAN. neural networks namely the generator (LSTM) and the discriminator (CNN). The real questions with no tags and with tags are given as input to the generator. The generator in turn generates questions with classes. The generated and the real questions with tags are given as input to the discriminator. The CNN is used as the discriminator which has been implemented with both the sigmoid and the softmax layer. The sigmoid layer classifies the real and the generated questions. The softmax layer is used for classifying the questions to the respective intent class. The policy gradient optimization technique is used for adjusting the weights and bias of the generator and discriminator during training. 2.3.1.8 Recognition of Chinese characters using TH-GAN Historical Chinese characters are of low-quality images. In order to enhance the quality of the historical Chinese character images, TH-GAN (transfer learning-based historical Chinese character recognition) [39] has been proposed shown in Fig. 2.21. The generator used is the U-Net architecture. The WGAN model has been used. The source Chinese character is given as input to the generator which outputs generated Chinese character. The target image, real character image, and the generated character images are given as input to the discriminator. The discriminator classifies between the real and the fake character image. The policy gradient is the technique used for adjusting weights and bias of the generator and the discriminator during training. The following session discusses the NLP datasets. Fig. 2.21 Architecture of TH-GAN. 41 42 Generative adversarial networks for image-to-Image translation 2.3.2 NLP datasets The open-source free NLP datasets available for research is shown in Table 2.3. 2.4 GANs in image generation and translation In recent years, for image generation and translation, many GAN architectures have been proposed such as cycleGAN, DualGAN, DiscoGAN, etc. The following section discusses the various applications of GANs in image generation and translation. 2.4.1 Applications of GANs in image generation and translation The following subsections discuss the various applications of GANs in image generation and translation. 2.4.1.1 Ensemble learning GANs in face forensics Fake images generated by newer image generation methods such as face2face and deepfake are really hard to distinguish using previous face-forensics methods. To overcome the same, a novel generative adversarial ensemble learning method [40] has been proposed as shown in Fig. 2.22. In this GAN, two generators with the same architecture are used but both of them are trained in different ways. The feedback face generator gets the feedback from the discriminators and generates the more fine-tuned image. As a discriminator ResNet and DenseNet are used. The ability to discriminate between real and fake images is achieved by combining the feature maps of both ResNet and DenseNet. Image is fed to both the network and 1024-dimensional output feature is extracted by using global average pooling and then a 2048-dimensional feature vector is generated by taking output features by both the networks and concatenating them, later SoftMax function is used to normalize the 2D scores. During the training process of the GAN, the spectral normalization method is used for the stabilization of the process. 2.4.1.2 Spherical image generation from the 2D sketch using SGANs Most of the VR applications rely mostly on panoramic images or videos, and most of the image generation models just focus on 2D images and ignore the spherical structure of the panoramic images. To solve this, a panoramic image generation method based on spherical convolution and GAN called SGAN [41] is proposed shown in Fig. 2.23. For input, a sketch map of the image is taken, which provides a really good geometric structure representation. A custom designed generator is used to generate the spherical image and it reduces the distortion in the image using spherical convolution, loss of least squares is used to describe the constraint for whether the discriminator is able to distinguish the image generated from the real image. The spherical convolution is used for observing the data from multiple angles. Discriminator is used to distinguish between generated GAN models in natural language processing and image translation Table 2.3 NLP datasets. NLP datasets Description Link CNN/Daily Mail dataset It is the text summarization dataset which as two features namely the documents need to be summarized (article) and the target text summary (highlights) It has the author details, date of the news, headlines and the detailed news link It has small poems. Each poem is with 4–5 lines and each line is with 4–5 words https://github.com/abisee/cnndailymail News summarization dataset Chinese poem dataset COCO (common objects in context) captions Shakespear’s plays Open subtitles dataset YELP Amazon It is the object detection and caption dataset. The dataset has five sections such as info, licenses, images, annotations, and category It consists of 715 characters of Shakespeare plays. It has continuous set of lines spoken by each character in a play. It can be used for text generation It has the group of translated movie subtitles. It has subtitles of 62 languages It is a business reviews and user dataset. It has 5,200,000 user business reviews Information about 174,000 businesses The data about 11 metropolitan areas It is an amazon review dataset Caption It consists of approximately 3.3 million image caption pairs Noisy speech Noisy and clean speech dataset. It can be used for speech enhancement applications It consists of 3,000,000 reviews on cars, hotels collected from tripadvisor It consists of text summaries of about 4000 cases. It can be used training text summarization tasks OpinRank Legal case reports https://www.kaggle.com/ sunnysai12345/news-summary https://github.com/Disiok/ poetry-seq2seq https://github.com/ XingxingZhang/rnnpg http://cocodataset.org/ #download https://www.tensorflow.org/ federated/api_docs/python/tff/ simulation/datasets/shakespeare/ load_data https://github.com/PolyAILDN/conversational-datasets/ tree/master/opensubtitles https://www.kaggle.com/yelpdataset/yelp-dataset https://registry.opendata.aws/? search¼managedBy:amazon https://ai.googleblog.com/ 2018/09/conceptual-captionsnew-dataset-and.html https://datashare.is.ed.ac.uk/ handle/10283/2791 http://kavita-ganesan.com/ entity-ranking-data/#. XuxKF2gzY2z https://archive.ics.uci.edu/ml/ datasets/Legal+Case+Reports 43 44 Generative adversarial networks for image-to-Image translation Fig. 2.22 Architecture of ensemble learning GAN. Fig. 2.23 Architecture of SGAN. images and real images and in the case of image generation, a multiscale discriminator is used, which is quite common and adds the advantage of decreasing the burden on the network. 2.4.1.3 Generation of radar images using TsGAN Radar data becomes really hard to understand due to imbalanced data and also becomes the bottleneck for some operations. To defend the radar operations, a two-stage general adversarial network (TsGAN) [42] has been introduced, as shown in Fig. 2.24. In the first stage, it generates samples which are similar to real data and distinguishes its eligibility. To generate radar image sequences, each frame is decomposed as content information and motion information. Also, for capturing data such as the flow of clouds, RNN is used. For discriminators, two of them are being used, one for distinguishing between radar image and generated image and the second one for motion information, like image generation sequence. The second stage is used to define the relationship between intervals and adjacent frames. The rank discriminator is used for computing GAN models in natural language processing and image translation Fig. 2.24 Architecture of TsGAN. the rank loss between generated motion sequences, real motion sequences and the enhanced generated motion sequences. 2.4.1.4 Generation of CT from MRI using MCRCGAN The MRI (magnetic resonance images) are really useful in radiation treatment planning with the functional information that provides as compared with CT (computed tomography). But there are some applications where MRI cannot be used because of the absence of electron density information. To apply MRI for these types of applications, MCRCGAN (multichannel residual conditional GAN) [43] has been introduced, which generates pseudo-CT as shown in Fig. 2.25. MCRCGAN has two parts, generator which generated the pseudo-CT image according to the input MR images, and discriminator is used to distinguish between p-CT images with the real ones and measure the degrees/number of mismatches since it helps the network feed accordingly for the next iteration for better efficiency. MCRCGAN actually adopts the multichannel ResNet as the generator and CNN as the discriminator. 2.4.1.5 Generation of scenes from text using text-to-image GAN Generating an image from text is a vividly interesting research topic with very unique use cases but it is quite difficult since the language description and images vary a different part Fig. 2.25 Architecture of MCRCGAN. 45 46 Generative adversarial networks for image-to-Image translation of the world and the current models which generate images tend to mix the generation of background and foreground which leads to object in images which are really submerged into the background. To make sure that the generation of the image is done by keeping in mind about the background and foreground. To achieve this VAE (variational autoencoder) and GAN proved to be robust. Here the generator contains three modules, namely, downsampling module, upsampling module, and residual module. The architecture of text-to-image GAN [44] is shown in Fig. 2.26. 2.4.1.6 Gastritis image generation using PG-GAN For detection of gastric cancer, X-ray images of gastric are used. Multiple X-ray images are relatively large in size so LC-PGGAN (loss function-based conditional progressive growing GAN) [45] has been introduced as shown in Fig. 2.27. This GAN generates images which are effective for gastritis classification and have all the necessary details to look for any sort of symptoms. For the generation of synthetic images, divided patched images are used. The whole process is divided into two different sections. (1) lowresolution step: Here fake and real images are given to the discriminator which sends the loss values to (2) high-resolution step: here fake images along with patches with random sampling and real images with patches are given to the discriminator to finalize the output. Fig. 2.26 The architecture of text-to-image GAN. Fig. 2.27 Architecture of LC-PGCAN. GAN models in natural language processing and image translation 2.4.1.7 Image-to-image translation using quality-aware GAN Image-to-image translation is one of the widely practiced with GAN and to do the same many works has been proposed but all of them depend on pretrained network structure or they rely on image pairs, so they cannot be applied on unpaired images. To solve these issues, a unified quality-aware GAN-based framework [46] was proposed as shown in Fig. 2.28. Here two different implementations of quality loss are done, one is based on the image quality score between the real and reconstructed image and another one is based on the adaptive deep network-based loss to calculate the score between the real and reconstructed image from the generator. Here the generators generate such as each constructed image has a similar or close score to the real image. The loss function includes adversarial loss, reconstruction loss, quality-aware loss, IQA loss, and content-based loss. 2.4.1.8 Generation of images from ancient text using encoder-based GAN Ancient texts are of great use since it helps us to get to know about our past and maybe some keys to our future, to retrieve or understand these texts, an encoder-based GAN [9] has been introduced to generate the remote sensing images retrieved from the text retrieved from different sources as shown in Fig. 2.29. To train this particular network, we have used satellite images and ancient images. Here generator is conditioned with the training set text encodings and corresponding texts are synthesized. The discriminator is used to predicting the sources of input images, for whether they are real or synthesized. Text encoder and Noise generator is used prior to the input. Fig. 2.28 Architecture of quality aware GAN. Fig. 2.29 Architecture of encoder-based GAN. 47 48 Generative adversarial networks for image-to-Image translation 2.4.1.9 Generation of footprint images from satellite images using IGAN For many architectural purposes and planning, building footprints plays an important role. To convert satellite images into footprint images, a IGAN (improved GAN) [26] was proposed as shown in Fig. 2.30. This GAN uses CGAN with the cost function from Wasserstein distance and integrated with gradient penalty. The generator is provided with noise and satellite image, using Leaky ReLU as activator function it generates a footprint image which then sent to discriminator helps to get the score, and if the score does not get as close as the real image, it goes to generator again and the iterations provide better results every time. The dataset was based on Munich and Berlin which gave 256 256 images to work on. Also, segmentation is used on images to get the visible gradients. 2.4.1.10 Underwater image enhancement using a multiscale dense generative adversarial network Underwater image improvement has become more popular in underwater vision research. The underwater images suffer from various problems such as underexposure, color distortion, and fuzz. To address these problems, multiscale dense block generative adversarial network (MSDB-GAN) [47] for enhancing underwater images has been proposed as shown in Fig. 2.31. The random noise and the image to be enhanced are given as input to the generator. The multiscale dense block is embedded within the generator. The MSDB is used for concatenating all the local features of the image using the Fig. 2.30 Architecture of IGAN. Fig. 2.31 Architecture of MSDB-GAN. GAN models in natural language processing and image translation Leaky ReLU activation function. The discriminator discriminates between the real and the generated image. 2.4.2 Image datasets The open-source free image datasets available for research are shown in Table 2.4. The following section discusses the various evaluation metrics. Table 2.4 Image datasets. Image datasets Description Link CelebA-HQ It consists of 30,000 face images of high resolution It consists of 685,000 footprints of the buildings https://www.tensorflow.org/ datasets/catalog/celeb_a_hq https://spacenetchallenge.github. io/datasets/spacenetBuildingsV2summary.html https://www.kaggle.com/ navoneel/brain-mri-images-forbrain-tumor-detection http://www.vision.caltech.edu/ visipedia/CUB-200.html https://www.robots.ox.ac. uk/vgg/data/flowers/102/ http://mmlab.ie.cuhk.edu.hk/ projects/CelebA.html https://www.openstreetmap.org/ #map¼5/21.843/82.795 AOI MRI brain tumor It consist of 96 images of MRI brain tumor CUB It consists of 200 various bird species images It consists of 102 various flow category images It consists of 200,000 celebrity images The map data can be downloaded by selecting the smaller areas from the map It consists of 108,077 Images with captions of people, signs, buildings, etc. It consists of approximately 9,000,000 images been annotated with labels and bounding boxes for 600 object categories CIFAR 10 consists of 60,000 images of 10 classes. CIFAR 100 is extended by 100 classes. Each class consists of 600 images It consists of 30,000 images categorized in 256 classes It consists of 190,000 images, 60,000 annotated images, 658,000 labeled objects It consists of 100 different toys images. Each toy being photographed in 72 poses. Hence 7200 images for 100 toys are present Oxford 102 CelebA OpenStreetMap Visual Genome Open Images CIFAR 10/100 Caltech 256 LabelMe COIL-20 http://visualgenome.org/api/v0/ api_home.html https://storage.googleapis.com/ openimages/web/download.html https://www.cs.toronto. edu/kriz/cifar.html https://www.kaggle.com/ jessicali9530/caltech256 http://labelme.csail.mit.edu/ Release3.0/browserTools/php/ dataset.php https://www.cs.columbia.edu/ CAVE/software/softlib/coil-20. php 49 50 Generative adversarial networks for image-to-Image translation 2.5 Evaluation metrics This section discusses the various evaluation metrics that are needed to assess the performance of the GAN models. 2.5.1 Precision Precision (P) refers to the percentage of the relevant results obtained during prediction. It is given by the ratio of true positives and the actual results. TP TP + FP where TP is true positive and FP is false positive. P¼ 2.5.2 Recall Recall (R) refers to the total percentage of the relevant results that are correctly classified by the classifier. It is given by the ratio of true positives and the predicted results. TP TP + FN where TP is true positive and FN is false negative. R¼ 2.5.3 F1 score F1 score is defined as the harmonic mean of both precision and recall. It is given twice the ratio of multiplication of precision and recall to the addition of precision and recall. P R F1 score ¼ 2 P +R where P is precision and R is recall. 2.5.4 Accuracy Accuracy refers to how accurately the model predicts the results. It is given by the ratio of true positive and true negative results to the total results obtained. TP + TN Total where TP is true positive and TN is true negative Accuracy ¼ chet inception distance 2.5.5 Fre Frechet inception distance (FID) is the metric used to evaluate the quality of the images generated by the GANs. If FID is less, then it means the generator has generated a good GAN models in natural language processing and image translation quality image. If FID is more, then it means the generator has generated a lower quality image. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi FID ¼ jjμ1 μ2 jj2 + Tr C1 + C2 2 C1 C2 Þ where μ1 and μ2 are feature-wise mean of real and generated images. C1 and C2 are covariance matrices of real and generated image feature vectors. Tr indicates trace linear algebra function. 2.5.6 Inception score The inception score (IS) is the metric used for measuring both the quality of the generated image and the difference between the generated and the real image. For measuring the quality of the image, the inception network can be used to classify the generated and the real images. The difference between the real and the generated image is computed using KL-divergence. IS ¼ exp EgG DKL ðpðrj gÞj j pðr ÞÞÞ where g is the generated image, r is the real samples with labels. DKL is the KullbackLeibler divergence measures the distance between the real and generated image probability distributions. 2.5.7 IoU score Intersection over union (IoU) otherwise called as Jaccard index is the metric which computes the overlap between the predicted results and the ground truth samples. The score ranges from 0 to 1. 0 indicates no overlap. TP IoU score ¼ TP + FP + FN where TP is the true positive results, FP is the false positive results, and FN is the false negative results. 2.5.8 Sensitivity Sensitivity measures the percentage of the true positives that are correctly computed. TP TP + FN where TP is the true positive results and FN is the false negative results. Sensitivity ¼ 2.5.9 Specificity Specificity measures the percentage of the true negatives that are correctly computed. TN Specificity ¼ TN + FP where TN is the true negative results and FP is the false positive results. 51 52 Generative adversarial networks for image-to-Image translation 2.5.10 BELU score Bilingual evaluation understudy (BELU) score is the metric used for measuring the similarity between the system-generated text and the input reference text. N T where N is the total number of words matching between the system-generated text and the input reference text. T is the total number of system-generated words. BELU ¼ 2.5.11 ROUGE score Recall-oriented understudy for gisting evaluation (ROUGE) score is used for evaluating automatic text summarization. It evaluates by computing ROUGE recall and precision. N N , ROUGE Recall ¼ T R where N is the total number of words matching between the system-generated text and the input reference text. T is the total number of words in system-generated text. R is the total number of words in the input reference text. The next section discusses the various languages and tools used for research. ROUGE Precision ¼ 2.6 Tools and languages used for GAN research This section discusses the various languages that can be used for training the neural networks such as generator and the discriminator. 2.6.1 Python For training the generator, (1) Pandas are used for data manipulation (2) Using OS, data path is set (3) Train and test data is divided using pd.DataFrame() function (4) In the infinite loop, • pd.read_csv is used to read the data from csv file • Labels are retrieved from the list using data.iloc() function • Append it into array using append() function (5) Inside generator (), batch_size,shuffle_data is used, • For setting array list empty [] list are initialized • cv2.imread is used to read images (if theres is any) • reading array is done using np.array For training the discriminator, (1) Keras can be used (2) The discriminator is defined using def define_discriminator(n_inputs¼2) GAN models in natural language processing and image translation (3) Define the model type and the activation functions to be used by, • model ¼ Sequential () • model.add (Dense(25, activation ¼‘relu’, kernel_initializer ¼‘he_uniform’, input_dim ¼n_inputs)) • model.add (Dense(1, activation¼‘sigmoid’)) (4) Compile the model by specifying the loss function and the optimizer to be used using model. compile (loss¼‘binary_crossentropy’, optimizer¼‘adam’, metrics¼[‘accuracy’]) (5) return model 2.6.2 R programming (1) Install the neural network packages using install.packages(“neuralnet”) (2) Load the neural net packages using library (“neuralnet”) (3) Read the CSV file using read.csv() (4) Preview the dataset using View() (5) To view the structure and verify the ID variable str() function is used. (6) To set the input variables to the same scale, scale (Any var[1:12]) is used. (7) Generate a random seed using set.seed (200) (8) Split the dataset into 70-30 train and test set using, • ind <- sample(2, nrow (Any var), replace ¼ TRUE, prob ¼ c(0.7, 0.3)) • train.data <- Any var[ind ¼¼ 1, ] • test.data <- Any var[ind ¼¼ 2, ] (9) Neural network with one hidden layer and two nodes and linear output set to false is given by, nn <- neuralnet(formula ¼ FSTAT AGE + SEX + CPK + SHO + CHF + MIORD + MITYPE + YEAR + YRGRP + LENSTAY + DSTAT + LENFOL, data ¼ train.data, hidden ¼ 2, err.fct ¼ “ce”, linear.output ¼ FALSE) (10) Summary can be generated using summary (nn) (11) Visualize the neural network using plot (nn) 2.6.3 MatLab (1) Set the dataset path using fullfile(path) function (2) Load data as ImageDatastore using imageDatastore(path) (3) To view the data from the dataset, imshow() function is used (4) Divide the dataset into train and test set using splitEachLabel() function (5) Define the neural network model. E.g., Convolutional neural network by specifying [imageInputLayer(dimension), convolution2dLayer(dimension), reluLayer, maxPooling2dLayer (stride dimension), fully ConnectedLayer(10), softmaxLayer classificationLayer] (6) Set the training option settings using the function trainingOptions(optimization technique, maximum no. of epochs, initial learning rate) 53 54 Generative adversarial networks for image-to-Image translation (7) Train the model using function trainNetwork (traindata, layers, optionsset) (8) Prediction can be performed using the function classify() (9) Compute accuracy 2.6.4 Julia (1) Load the train and test data using dataset_name.traindata() and dataset_name.testdata() (2) Add channel layer by unsqueeze(traindata,layerno) and unsqueeze (testdata, layerno) (3) Encode the labels by using the functions onehotbatch(traindata, 0:9) and onehotbatch(testdata, 0:9) (4) Create complete dataset by using DataLoader(traindata, batchsize¼size) (5) To implement CNN use the function chain(Conv(dimension), pad¼2, stride¼2, activation_function) (6) Maxpooling can be implemented by using the function maxpooling() and average pooling can be implemented by using GlobalMeanPool() (7) Binary cross entropy loss is given by crossentropy(model(x), y) (8) Gradient descent optimizer is given by Descent(learning rate) and adam optimizer is given by adam(learning rate) (9) Train the model using @epochs number of epochs Flux.train!(loss, b,w, train_data, opt) where b is bias, w is weight, opt is the optimizer used (10) Compute accuracy The next section discusses the open challenges for future research. 2.7 Open challenges for future research This section discusses the open challenges of GAN for future research. • Vanishing gradients is the problem that many GAN architectures suffer from. Firstly, the discriminator will be trained to classify between real and fake images. Then the generator will be trained but initially, the G will be generating the fake image which will be easily classified by D. The value of G will be 0 initially and so the slope will also be close to 0. Hence, the gradient cannot be calculated. • Mode collapse is that the generator sometimes collapses and will always generate the same or similar fake images of one type, i.e., the generator generates limited varieties of fake samples. GAN architectures have to be designed in such a way that it will not suffer from this problem. • In many GANs, it is very hard to achieve Nash equilibrium. Many numbers of epochs have to run to achieve Nash equilibrium. The research challenge is to develop a technique which will help G and D to achieve Nash equilibrium easily. GAN models in natural language processing and image translation • The challenge is to train G and D simultaneously that they will fail to converge many times. Sometimes the G instead of attaining Nash equilibrium, it might oscillate between specific sample generated. • How to increase the stability of training? • What learning rate to set for G and D and also to check what is the effect of changing the learning rate is a challenge. • Tuning hyperparameter, i.e., deciding on the value to set for hyperparameter is a challenge as it increases the training stability. • New activation techniques can be proposed for activating the neurons in the network so that the learning can be stable. • Training G and D is very hard. Optimizing the loss functions is very difficult and it needs many trail and errors. New optimization techniques can be proposed for still better fine-tuning of G and D, if the discriminator is not discriminating properly. • If one network (either G or D) will not be trained properly, then the entire system performance will degrade. 2.8 Conclusion This chapter provides an overview of the generative adversarial networks, classification of the GAN models based on learning and their pros and cons. The various applications of GAN in natural language processing, image generation, and translation are discussed. The various natural language processing and image datasets are listed. The evaluation metrics needed for assessing the GAN performance have also been discussed. The tools available for GAN research are also mentioned. Finally, the chapter summarizes the open challenges for the future research. References [1] Y. Puy, Z. Gany, R. Henaoy, X. Yuanz, C. Liy, A. Stevensy, L. Cariny, Variational autoencoder for deep learning of images, labels and captions, in: 30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 2016, pp. 1–9. [2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems 27, Montreal, Quebec, Canada, 2014, pp. 2672–2680. [3] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, New York, 2016. [4] I. Goodfellow, NIPS 2016 tutorial: generative adversarial networks, arXiv: 1701.00160 (2016). [5] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv: 1511.06434 (2015). [6] M. Roveri, Learning discrete-time Markov chains under concept drift, IEEE Trans. Neural Netw. Learn. Syst. 30 (9) (2019) 2570–2582. [7] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I.J. Goodfellow, A. Bergeron, N. Bouchard, Y. Bengio, Theano: new features and speed improvements, in: Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012. 55 56 Generative adversarial networks for image-to-Image translation [8] S. Hitawala, D.R. Cheriton, Comparative study on generative adversarial networks, arXiv:1801.04271v1 (2018). [9] M.B. Bejiga, F. Melgani, A. Vascotto, Retro-remote sensing: generating images from ancient texts, IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 12 (3) (2019) 950–960, https://doi.org/10.1109/ JSTARS.2019.2895693. [10] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein GAN, arXiv:1701.07875 (2017). [11] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. Courville, Improved training of Wasserstein GANs, arXiv: 1704.00028 (2017). [12] Q. Jin, R. Lin, F. Yang, E-WACGAN: enhanced generative model of signaling data based on WGAN-GP and ACGAN, IEEE Syst. J. 14 (3) (2020) 3289–3300, https://doi.org/10.1109/ JSYST.2019.2935457. [13] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel, InfoGAN: interpretable representation learning by information maximizing generative adversarial nets, in: Proc. 30th Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 2172–2180. [14] D. Berthelot, T. Schumm, L. Metz, BEGAN: boundary equilibrium generative adversarial networks, arXiv: 1703.10717 (2017). [15] R.D. Hjelm, A.P. Jacob, T. Che, K. Cho, Y. Bengio, Boundary seeking generative adversarial networks, arXiv preprint arXiv:1702.08431 (2017). [16] L.T. Yu, W.N. Zhang, J. Wang, Y. Yu, SeqGAN: sequence generative adversarial nets with policy gradient, arXiv: 1609.05473 (2016). [17] X. Yang, Y. Lin, Z. Wang, X. Li, K. Cheng, Bi-modality medical image synthesis using semisupervised sequential generative adversarial networks, IEEE J. Biomed. Health Inform. (2019) 1–11. [18] D.J. Im, H. Ma, C.D. Kim, G.W. Taylor, Generative adversarial parallelization, arXiv:1612.04021v1 (2016). [19] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, arXiv:1703.10593 (2018). [20] W. Yang, C. Hui, Z. Chen, J. Xue, Q. Liao, FV-GAN: finger vein representation using generative adversarial networks, IEEE Trans. Inf. Foren. Sec. 14 (9) (2019) 2512–2524. [21] Z. Dai, Z. Yang, F. Yang, W.W. Cohen, R.R. Salakhutdinov, Good semi-supervised learning that requires a bad GAN, in: Advances in Neural Information Processing Systems, 2017, pp. 6510–6520. [22] Z. Yang, J. Hu, R. Salakhutdinov, W. Cohen, Semisupervised QA with generative domain-adaptive nets, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, 2017, pp. 1040–1050. [23] M. Mirza, S. Osindero, Conditional generative adversarial nets, arXiv: 1411.1784 (2014). [24] C. Esteban, S.L. Hyland, G. Ratsch, Real-valued (medical)¨ time series generation with recurrent conditional GANs, arXiv preprint arXiv:1706.02633 (2017). [25] P. Costa, A. Galdran, M.I. Meyer, M. Niemeijer, M. Abrámoff, A.M. Mendonça, A. Campilho, Endto-end adversarial retinal image synthesis, IEEE Trans. Med. Imag. 37 (3) (2018) 781–791. [26] Y. Shi, Q. Li, X.X. Zhu, Building footprint generation using improved generative adversarial networks, IEEE Geosci. Remote Sens. Lett. 16 (4) (2019) 603–607, https://doi.org/10.1109/ LGRS.2018.2878486. [27] J. Donahue, P. Krahenbuhl, T. Darrell, Adversarial feature learning, arXiv: 1605.09782 (2017). [28] Z. Zhang, S. Liu, M. Li, M. Zhou, E. Chen, Bidirectional generative adversarial networks for neural machine translation, in: Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pp. 190–199, Brussels, Belgium, October 31-November 1, 2018. [29] A. Odena, C. Olah, J. Shlens, Conditional image synthesis with auxiliary classifier GANs, arXiv: 1610.09585 (2017). [30] J.E. Iglesias, E. Konukoglu, D. Zikic, B. Glocker, K. Van Leemput, B. Fischl, Is synthesizing mri contrast useful for inter-modality analysis? in: International Conference on Medical Image Computing and Computer-Assisted Intervention, 2013, pp. 631–638. [31] S. Semeniuta, A. Severyn, S. Gelly, On accurate evaluation of GANs for language generation, arXiv:1806.04936v3 (2019). GAN models in natural language processing and image translation [32] H. Zhuang, W. Zhang, Generating semantically similar and human-readable summaries with generative adversarial networks, IEEE Access 7 (2019) 169426–169433, https://doi.org/10.1109/ ACCESS.2019.2955087. [33] K. Lin, D. Li, X. He, Z. Zhang, M.-T. Sun, Adversarial ranking for language generation; 2017. arXiv:1705.11001. [34] J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, D. Jurafsky, Adversarial learning for neural dialogue generation; 2017. arXiv:1701.06547v5. [35] W. Yu, T. Chang, X. Guo, X. Wang, B. Liu, Y. He, UGAN: unified generative adversarial networks for multidirectional text style transfer, IEEE Access 8 (2020) 55170–55180, https://doi.org/10.1109/ ACCESS.2020.2980898. [36] Y. Sun, C. Chen, T. Xia, X. Zhao, QuGAN: quasi generative adversarial network for tibetan question answering corpus generation, IEEE Access 7 (2019) 116247–116255, https://doi.org/10.1109/ ACCESS.2019.2934581. [37] D. Liu, J. Fu, Q. Qu, J. Lv, BFGAN: backward and forward generative adversarial networks for lexically constrained sentence generation, IEEE/ACM Trans. Audio Speech Language Process. 27 (12) (2019) 2350–2361, https://doi.org/10.1109/TASLP.2019.2943018. [38] X. Zhou, Y. Peng, Short-spoken language intent classification with conditional sequence generative adversarial network, in: 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 2019, pp. 1753–1756, https://doi.org/10.1109/ICTAI.2019.00261. [39] J. Cai, L. Peng, Y. Tang, C. Liu, P. Li, TH-GAN: generative adversarial network based transfer learning for historical Chinese character recognition, in: 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 2019, pp. 178–183, https://doi.org/10.1109/ ICDAR.2019.00037. [40] J. Baek, Y. Yoo, S. Bae, Generative adversarial ensemble learning for face forensics, IEEE Access 8 (2020) 45421–45431, https://doi.org/10.1109/ACCESS.2020.2968612. [41] Y. Duan, C. Han, X. Tao, B. Geng, Y. Du, J. Lu, Panoramic image generation: from 2-D sketch to spherical image, IEEE J. Sel. Top. Signal Process. 14 (1) (2020) 194–208, https://doi.org/10.1109/ JSTSP.2020.2968772. [42] C. Zhang, X. Yang, Y. Tang, W. Zhang, Learning to generate radar image sequences using two-stage generative adversarial networks, IEEE Geosci. Remote Sens. Lett. 17 (3) (2020) 401–405, https://doi. org/10.1109/LGRS.2019.2922326. [43] K. Xu, et al., Multichannel residual conditional GAN-leveraged abdominal pseudo-CT generation via Dixon MR images, IEEE Access 7 (2019) 163823–163830, https://doi.org/10.1109/ ACCESS.2019.2951924. [44] C. Zhang, Y. Peng, Stacking VAE and GAN for context-aware text-to-image generation, in: 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM), Xi’an, 2018, pp. 1–5, https://doi.org/10.1109/BigMM.2018.8499439. [45] R. Togo, T. Ogawa, M. Haseyama, Synthetic gastritis image generation via loss function-based conditional PGGAN, IEEE Access 7 (2019) 87448–87457, https://doi.org/10.1109/ ACCESS.2019.2925863. [46] L. Chen, L. Wu, Z. Hu, M. Wang, Quality-aware unpaired image-to-image translation, IEEE Trans. Multimedia 21 (10) (2019) 2664–2674, https://doi.org/10.1109/TMM.2019.2907052. [47] Y. Guo, H. Li, P. Zhuang, Underwater image enhancement using a multiscale dense generative adversarial network, IEEE J. Ocean. Eng. 45 (3) (2020) 862–870, https://doi.org/10.1109/ JOE.2019.2911447. 57 CHAPTER 3 Generative adversarial networks and their variants Er. Aarti Department of Computer Science & Engineering, Lovely Professional University, Phagwara, Punjab, India 3.1 Introduction of generative adversarial network (GAN) Generative adversarial networks [1] were projected to figure out the weaknesses of other generative structures also proven successful in the field of unsupervised learning. GAN has acquired wide consideration in the AI area for their capability to learn highdimensional and complex real information circulation. In particular, they do not depend on any suppositions about the distribution and can create pictures that look genuine such as latent space in a basic way. This ground-breaking property drives GAN to be applied to different applications, for example, image translation, image synthesis, domain adaptation, image attribute editing, and other scholastic fields [2]. The most persuasive explanation that GANs are broadly considered, created, and utilized is a result of their prosperity. GANs have had the option to create photographs so sensible that people can’t tell whether they are scenes, items, and individuals that do not exist in reality [3]. Generating a picture from a given text depiction has two objectives: visual authenticity and semantic consistency. Although huge advancement has been made in creating highcaliber and outwardly sensible pictures utilizing generative adversarial networks, ensuring semantic consistency between the text depiction and visual substance stays very challenging. Various interesting applications of GANs are image-to-image translation, superresolution, semantic-image-to-photo translation, generation of new human poses, photos to emojis, photograph editing, face aging, photo blending, and many more. In a game-theoretic scheme, the generator system is required to contend against an adversary by completing the objective, as generative adversarial networks depend on this scheme. Adversarial games are the domain of AI where two or more agents play opposite to each other. GANs are an exciting recent innovation in deep learning. GANs are one of the new state-of-the-art neural networks that can be used to do many things. Recovering corrupted data, text-to-image generation, and many more endless applications generative adversarial network has. Generative models can be thought of as containing more information than their discriminative counterparts since they also are used for discriminative tasks such as classification or regression. The adversarial modeling structure is generally straight to apply when both Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00003-8 Copyright © 2021 Elsevier Inc. All rights reserved. 59 60 Generative adversarial networks for image-to-image translation the frameworks are multilayer perceptrons. No doubt, adversarial networks act as a general-purpose solution to image-to-image translation issues. These systems not only acquire the information regarding mapping from an intake picture to yield picture but also get a loss activity to prepare for this mapping. The probability distribution can be duplicated by GAN so that they could, therefore, utilize loss activity, which depicts the distance among the dissemination of the information produced through the GAN and the dispersion of the original information. GANs are a way to deal with generative modeling by utilizing DL strategies, for example, CNN. An improved method called deep convolutional GAN, or DCGAN prompted increasingly stable models. These days, most of the GANs are at least loosely dependent on the DCGAN design. It is also one of the variants of GAN. Generative modeling is an unsupervised task in artificial intelligence which contains automatically searching and learning the patterns or regularities as intake information. The framework is utilized to produce new structures that possibly can be taken from the initial dataset [3]. GANs design for naturally preparing a generative framework by considering the independent issue as supervised and utilizing both generative and discriminative structures. Fig. 3.1 shows the design of the generative adversarial network. (See Table 3.1.) It is a deep-learning system and one of the most encouraging techniques for independent learning in complex dissemination. Deep-learning techniques can be utilized as generative structures. Two mainstream models incorporate the restricted Boltzmann machine (RBM) and the deep belief network (DBN). Two present-day instances of deep-learning generative framework algorithms incorporate the variational autoencoder (VAE) and the GAN [3]. GANs are a special case of generative models which are able to predict features in a much better way due to the adversarial training. They are a smart way of preparing a generative framework by presenting the issue as a supervised learning issue with two submodels. The architecture of GAN goes through two components in the system: generative and discriminative models. Both of these models are prepared altogether by an adversarial procedure. Each model can be any neural network, such as a convolutional neural network (CNN), a recurrent neural network (RNN), or a long short-term memory (LSTM). Real Data Latent Variable Xreal Z Xfake Distinguishing real or fake Discriminator Generator Generated fake data Fig. 3.1 GAN [4]. Generative adversarial networks and their variants Table 3.1 Comparison between various GAN-based approaches. Methods Input Output Characteristics Loss function Resolution Code SRGAN [18] Image Image High upscaling factor Adversarial + feature Arbitrary FCGAN [52] Face Face Adversarial + distance Adversarial 128 128 VGAN [47] Noise vector TGAN [43] Noise vector VariGAN [41] Human + view StackGAN Text [71] cycleGAN Image [19] Video Video 64 64 T 64 64 Ch Temporal Adversarial generator Human Coarse to fine Adversarial + distance Image High-quality Adversarial 256 256 Image 256 256 T + PT + TF 256 256 T + PT 128 128 T + TF pix2pix [54] Image Age-cGAN [64] Context Encoder [40] TP-GAN [68] Face + age Face Image + holes Face TF Image Image Face Unpaired data Adversarial + cycle consistency General Adversarial + framework distance Identity Identity preserved preserving Adversarial + distance Two pathway Adversarial + distance + identity preserving + tv + symmetry 128 128 128 128 In code column T, TF, Ch, PT denotes Torch, Tensorflow, Pytorch, Chainer, respectively. The training process of the G and D network is called adversarial preparation. G and D structures are prepared together in an adversarial fashion to improve each other as well as adjust the parameters for G to minimize log(1 D(G(z)) and for D to reduce logD(X) [5] while competing for the two-player min-max game with value function V (G;D). min max V ðG, DÞ ¼ min max Expdata ½ log DðxÞ + Ezpz ½ log ð1 DðGðzÞÞÞ G D G D (3.1) where V(G,D)is a binary cross-entropy function mainly utilized in a binary classification problem, D(x) is a multilayer perceptron, pz(z) is the distribution of input noise variables, and pdata(x) and pz(z) in Eq. (3.1) denotes the real data probability distribution defined in the data space X and the probability distribution of z defined on the latent space, respectively. Gmaps z from Z into the element of X, whereas D takes an intake x and distinguishes whether x is a real sample or a fake sample generated by G [6]. 61 62 Generative adversarial networks for image-to-image translation 3.1.1 Generative model (GM) A generator model mainly figures out how to make pictures that look genuine. At the time of training, the generator continuously turns out to be better while the creation of pictures seems genuine. It takes a static length arbitrary vector as intake and produces an example in the area, as shown in Fig. 3.2. The vector is drawn arbitrarily from gaussian dissemination and utilized to seed the generative procedure. After preparation, points in multidimensional vector space will compare with the issue space, framing a restricted portrayal of the information circulation. The vector space is known as a latent space that contained inactive factors. A latent variable is an arbitrary variable that is significant for an area, however not straightforwardly recognizable. They are often implied as compression or projection of information conveyance. On account of GANs, it applies context to points in a selected latent area to such an extent that new points drawn from that area can be given to the generator model as intake and used to produce new and distinctive yield models [3]. The main purpose of the generator is to deceive the discriminator and generates new conceivable models from the problem area through machine mostly in picture whereas the discriminator figures out the false data made by the generator and determines whether the picture is authentic or machine generated [7]. In the first GAN hypothesis, G and D are not required to be neural systems and only required to have the option to fit the comparable generation and discriminant capacities. However, deep neural systems are commonly utilized as G and D. Both can be a nonlinear mapping function, such as a multilayer perceptron. Random Input Vector Generator Model Generated Example Fig. 3.2 GAN generator model. Generative adversarial networks and their variants 3.1.2 Discriminator model (DM) A discriminator model attempts to differentiate pictures as either authentic or false while training it turns into better at revealing that separation. It takes a model as input (genuine or created) from the space and presents a parallel class mark of genuine or produced, as shown in Fig. 3.3. The genuine model originates from the preparation dataset. The discriminator is a typical classification model. The discriminator model is disposed of as interest on the generator after the preparation procedure. The procedure arrives at balance when the discriminator can no longer recognize genuine pictures from fakes. GANs possibility for both great and terrible is tremendous, in light of the fact that they can make sense of how to imitate any scattering of data. In this way, it can make universes amazingly such as own in any region: music, pictures, composition, and speech. They are robot specialists, and their output is imperative even solid. However, they can likewise be utilized to make false media content and is the innovation supporting Deepfakes. They have been utilized for some applications, particularly for picture blend on account of their capacity to produce high-quality pictures. In recent years, various variations of GAN have been projected, and they generated excellent outcomes for image generation. GANs relate to the arrangement of generative structures, which imply that they can create new content [8]. GAN does not work with a definite thickness function [9]. In game-theoretic methodology, it figures out how to produce from preparing dispersion through two-player game. Samples are best but can be precarious and unsteady to prepare with no inference queries. GANs depend on a hypothetical game situation wherein the generator system must challenge against an adversary, and legitimately creates tests. The discriminator network and adversary are the ones who Input Example Discriminator Model Binary Classification Real/Fake Fig. 3.3 GAN discriminator model. 63 64 Generative adversarial networks for image-to-image translation try to recognize tests drawn from the preparation information and the generator [3]. The common game preparation produces a sensibly decent outcome. An amazing GAN application requires a reasonable preparation strategy; alternatively, the yield might be unsuitable because of the flexibility of the neural system model [7]. Advantages 1. It is a better modeling of data distribution. 2. In theory, GANs can train any type of generator network. Other frameworks require generator networks to have some specific form of functionality, such as the output layer being Gaussian. 3. There is no need to use the Markov chain to repeatedly sample, without inferring in the learning process, without complicated variational lower bounds, avoiding the difficulty of approximating the difficult probability of calculation. Disadvantages 1. It is hard to train, unstable. Proper synchronization is required between the generator and the discriminator, but in actual training, it is easy for D to converge and G to diverge. D/G training requires careful design. 2. It has mode collapse issue. The learning process of GANs may have a missing pattern, the generator begins to degenerate, and the same sample points are always generated, and the learning cannot be continued. 3. It cannot solve inference queries such as p(x). 3.2 Related work Goodfellow et al. [1] portrayed the GAN architecture in 2014 and discussed the nonsaturating loss function. It also provides the derivation for the optimal discriminator and demonstrates the effectiveness empirically on the MNIST, TFD, and CIFAR-10 image datasets. Radford et al. [10] introduced a class of deep convolutional GANs (DCGANs) that imposes empirical constraints on the network architecture to solve the problem of potential instability during training. Salimans et al. [11] provided a set of tools to avoid instability and mode collapsing, which includes historical averaging, minibatch discrimination, one-sided label smoothing, feature matching, and virtual batch normalization. Che et al. [12] used regularization methods for the objective to avoid the problem of missing modes. Arjovsky et al. [13] suggested minimization of the Wasserstein-1 or Earth-Mover distance among generator and data distribution with theoretical reasoning. In a follow-up paper, Gulrajani et al. [14] projected an enhanced approach for training the discriminator—termed critic by Arjovsky et al. [13]—which behaves stably, even with deep ResNet architectures. GANs have mostly been investigated on pictures, showing significant success with tasks such as image generation [15–17], image superresolution [18], style transfer [19, 20], and many others. Generative adversarial networks and their variants 3.3 Deep-learning methods There has been a gigantic advancement in framework demonstration and perception after presenting the advanced models for deep learning (DL). DL techniques quickly developed and extended applications in different logical and engineering areas. Deep learning is a growing area of AI (ML) research. It includes various concealed layers of artificial neural systems. The methodology applies nonlinear alterations and structures the deliberations of a high level in the huge collection of data. The current improvements in deeplearning structures inside various fields have just given huge commitments in AI. Current analysis has applied deep learning as the principal tool for digital image processing. A convolutional neural networks (CNN) is used for Iris recognition considered as more powerful in comparison with customary Iris sensor [21]. Deep learning is a subset of the field of ML, which is a subfield of AI [22]. Health informatics, bioinformatics, safety, energy, economic, security, urban informatics, hydrological systems modeling, and computational mechanisms are the advanced application field of DL [23]. Deep-learning techniques are quickly advancing for better performance. Recently, DL algorithms have come out of AI and soft computing strategies. From that point, a few DL algorithms are currently acquainted with mainstream researchers and used in different application areas. Nowadays, their use has evolved into fundamental because of its knowledge, effective learning, precision, and strength in the model structure. Deep-learning strategies are quickly developing. Some of them have progressed to be had practical experience in a specific application area. Literature incorporates sufficient survey papers on the advancing designs in particular usage areas, such as superresolution imaging, multimedia analytics, cardiovascular image analysis, transportation systems, radiology, medical ultrasound analysis, 3D sensed data classification, activity recognition in radar, sentiment classification, renewable energy forecasting, image cytometry, 3D sensed data classification, text detection, apache-spark, and hyperspectral [24–29]. The convolutional neural network, recurrent neural network, de-noising autoencoder, deep belief network, and long short-term memory techniques have been recognized as the most famous deep-learning strategies [23]. 3.3.1 Convolutional neural network It is one of the well-known structures of deep-learning procedures. It includes three sorts of a layer with various pooling, convolutional, and completely associated layers shown in Fig. 3.4. There are two phases for the preparation procedure in each CNN, the feedforward, and the back-propagation phase. GoogLeNet [30], AlexNet [31], ZFNet [32], ResNet [33], and VGGNet [34] are the most widely recognized CNN designs. In spite of the fact that it is basically known and commonly utilized for image processing applications. 65 66 Generative adversarial networks for image-to-image translation Output Input Layer Conventional Layer Sub_sampling Layer Conventional Layer Sub_sampling Layer Fully Connected Layer Fig. 3.4 CNN architecture [23]. 3.3.2 Recurrent neural network (RNN) It is moderately current deep-learning strategy. RNN is intended to perceive groupings and patterns, for example, handwriting, text, speech, and many more applications [23]. It has advantages in the structure of cyclic associations which utilize repetitive calculations to successively measure the intake information [35]. It is essentially an ideal neural system which has been extended beyond time through edges that feed into whenever step into rather than stepping into the following layer in a similar time. Every past source of input information is carried a state vector in concealed units, and further such vectors are used to process the yields. The expert systems, hydrological prediction, economics, energy, and navigation are its present applications. Fig. 3.5 depicts the architecture of RNN. Output Layer Hidden Layer Input Layer Fig. 3.5 RNN architecture [23]. Generative adversarial networks and their variants 3.3.3 Deep belief network (DBN) It is recognized as a composite multilayered neural system which includes undirected and coordinated associations. It is utilized for high structural manifolds learning of information. The strategy consists of various layers which include associations among the layers with the exception of associations between units inside every fold. It also contains restricted Boltzmann machines (RBM) that are prepared in an insatiable way [36] in which each layer connects with both the past and resulting layer [37, 38]. The structure is comprised of a feed-forward system and a few folds of RBM as characteristic extractors [39]. The two layers of an RBM [40] are hidden and visible layers. Fig. 3.6 depicts the design of the DBN strategy. Deep belief network is one of the most dependable DL strategies having computational proficiency and high precision [23]. Human emotion discovery, time arrangement expectation, sustainable power source forecast, cancer diagnosis, and financial estimating are the public application area. 3.3.4 Long short-term memory It is an RNN technique that advantages input associations to be utilized as a generalpurpose computer. The technique is used for two arrangements such as patterns recognition and image processing applications. Mainly, it consists of three central parts which Hidden Layer Hidden Layer Hidden Layer Visible Layer Output Layer Weights RBM1 Fig. 3.6 DBN architecture [23]. RBM2 RBM3 RBM4 67 68 Generative adversarial networks for image-to-image translation Input gate Output gate Cell Input modulation gate Forget gate Fig. 3.7 LSTM architecture. include information, yield, and forget doors that can be controlled on choosing when to allow the data to come inside the neuron and also to recollect what was figure out during the last time step. As it chooses, whole this that relies on the present intake is one of the fundamental qualities of the LSTM technique [23]. Fig. 3.7 presents the design of the LSTM technique. LSTM has demonstrated incredible possibilities in various environmental areas such as hydrological prediction, hazard modeling, air quality, and geological modeling. LSTM design may be appropriate for some application areas because of its speculation capacities such as solar power modeling, energy demand and consumption, and wind energy industry [23]. 3.4 Variants of GAN With the advancement of technology, various improvements are made to the variants of GAN. 3.4.1 Vari GAN Vari GAN represents variational GAN [41] which was proposed to create multiview individual pictures from a solitary perspective. This GAN replaces a coarse-to-fine manner. Vari GAN has been made out of three systems: a coarse image generator, a fine picture generator, and a restrictive discriminator. The coarse image generator GC utilizes a restrictive VAE design [7] where VAE represents variational autoencoder. With an input picture i and an objective view v, a low-quality picture was created independently with Generative adversarial networks and their variants the objective view i-v (low quality). The fine picture generator GF is made out of double-way U-Net [42] design. The U-Net is named after its symmetric shape. This maps i-v (low quality) to a high-quality picture conditioned on the input picture. Discriminator D looks at the high-quality picture adapted on the input picture. GF and discriminator are jointly prepared with a target function comprising a content loss and an affective loss estimating L1 distinction between (i-v) high quality picture and ground truth [10]. 3.4.2 TGAN TGAN represents a temporal generative adversarial net that was suggested by Saito et al. [43] for video generation. It comprises a discriminator, a temporal, and a picture generator. The temporal generator delivers a grouping of inactive frame vectors [V11, V12, V13, …, V1S] from an arbitrary variable V0, where S is the count of video frames. The picture generator takes V0 and a frame vector Z1t (0< t < S + 1) as intake and generates the t-th video frame. Here additionally, the discriminator accepts the entire video as intake and attempts to recognize it from genuine ones. TGAN follows WGAN [44] for stable preparation, however, applying further particular clipping value rather than weight clipping to the discriminator [10]. Recently, Temporal GAN (TGAN) [45] manages the instability in video generation by sending a frame-wise generation framework. A generative model is utilized to sample frames for image generation; a temporal generator preserves temporal consistency and controls this model. This model separates essential pieces of a video as a frontal area from background or dynamic from static patterns to manage the instability of preparing GANs. It accepts a latent space of pictures and considers that a video clip is produced by navigating the points in the dormant space. Video clips of various lengths relate to dormant space directions of multiple lengths. 3.4.3 Laplacian pyramid of generative adversarial network (LAPGAN) Denton et al. [17] projected the creation of pictures in a coarse-to-fine manner utilizing a cascade of convolutional GANs having the structure of a Laplacian pyramid with N levels. This method utilizes multiple numbers of the generator and discriminator system and different levels of the Laplacian pyramid. A GAN is prepared by downsampling the picture at first at each phase of the level, N, and then it is again upscaled at each layer in a backward pass where a noise vector is mapped to a picture from the Conditional GAN with the coarsest quality until it reaches its original size. At each degree of the pyramid with the exception of the coarse stone, a different CGAN is prepared that considers the yield picture in the coarser level as a restrictive variable to produce the picture at this stage. This approach is mainly used because it can create pictures with higher quality in a coarse-to-fine manner [10]. This methodology permitted to exploitation the 69 70 Generative adversarial networks for image-to-image translation multiscale model of regular pictures, assembling a progression of generative models, each catching picture structure at a specific degree of the Laplacian pyramid which is made from a Gaussian pyramid utilizing upsampling u(.) and downsampling d(.) capacities. Assume G(I) ¼ [I0; I1; …; IK] be the Gaussian pyramid where I0 ¼ I and IK are k rehashed utilization of d(.) to I. At that point, the coefficient hk at level k of the Laplacian pyramid is given by the difference among the neighboring levels in the Gaussian pyramid, expanding the little one with u(.). hk ¼ Lk ðI Þ ¼ Gk ðI Þ uðGk + 1 ðI ÞÞ ¼ Ik uðIk + 1 Þ (3.2) Laplacian pyramid coefficients [h1; …; hk] reconstruction can be performed by backward recurrence given as follows: Ik ¼ uðIk + 1 + hk Þ (3.3) So, a set of convolutional generative models G0; G1; … Gk, is used while preparing a LAPGAN where each of which captures the dispersion of coefficients hk for various phases of the Laplacian pyramid. The generative structures are utilized to generate hk0 s during reconstruction. Modification of Eq. (3.2) is given as follows: I’k ¼ uððI’k + 1 Þ + h’k Þ ¼ uðI’k + 1 Þ + G’k ðzk, uðI’k + 1 ÞÞ (3.4) Training image I is used for the construction of a Laplacian pyramid. A stochastic choice is made at each level regarding the coefficient hk construction with the usage of standard procedure or produces by Gk [46]. D and G compete for the two-player minimax game with value function V (G;D): min max V ðD, GÞ ¼ Ey, xpdataðy, xÞ ½ log ðdðy, xÞÞ + Expx, z pz ðzÞ ½ log ð1 DðGðz, xÞ, xÞÞ G D (3.5) LAPGAN is a tandem system in which a set of pictures are adjusted orderly as per their quality from less to more. Based on a low-quality sample, it first produced a low-quality picture and then considered intake with a higher quality picture to the successive level. At each level, the generator corresponds to a discriminator that determines whether an intake picture is authentic or fake. The quality of the output image will be greatly improved and more authentic after many times of feature extraction. It is more advisable for high-quality pictures because it is trained under supervised learning. The advantages of LAPGAN are easy to approach, learn residuals such as different distributions can be learned at each stage by the generator and passed as supplementary information to the next layer, step-by-step independent training, and increase the ability of GAN. In addition, it also joins CGAN to change unsupervised methodologies into supervised learning with significant performance advancement. The disadvantage is that it must be trained under supervision. Generative adversarial networks and their variants 3.4.4 Video generative adversarial network (VGAN) Video GAN (VGAN) framework proposes the utilization of independent streams for creating frontal area and background. Vondrick et al. [47] hypothesized that a video clip is a point in a latent space and suggested GANs generating video [9] with a spatiotemporal convolutional design in 2016. It adjusts the DCGAN model to predict future frames, create videos, and classify human actions. VGAN is a GAN for video in which it is considered that the entire video is joined by a stationary background scene and a dynamic foreground clip. The background is produced as a picture and afterward duplicated over time. A mutually prepared cover chooses among foreground and background to produce videos. So as to urge the system to utilize the background stream, sparsity is added earlier to the mask during learning. Henceforth, it considers a two-stream generator where the intake is a noise vector to both of them. The stationary background picture with 2D convolutional layers is produced with the effort of the background stream while the moving foreground generator attempts to create the 3D foreground video cube and the relating 3D forefront cover, with spatial-temporal 3D CNN layers predicts conceivable future frames. The discriminator considers the entire produced video as intake and attempts to recognize from the original video. Since VGAN considers video as a 3D cube that requires huge storage space and tests suggested this framework can likewise produce small videos up to a second at full frame rate better than basic baselines [21]. Also, investigations and perceptions describe the inside model that learns valuable highlights for perceiving activities with negligible oversight, recommending scene elements are a promising sign for portrayal learning. A few attempts to approach the video generation issue were made through GANs [1]. However, past work has concentrated generally on small patches and assessed them for video grouping. This system is also to learn mapping from the dormant space to video clips. Yet, expecting a video clip is a point in the inert space that superfluously expands the intricacy of the issue, since videos of a similar activity with various execution speeds are presented by various focuses in the inactive space. In addition, this presumption forces each created video clip to have a similar length, while the length of real-world video clips varies. No doubt, GANs to video production is considered troublesome since the video has an additional temporal measurement involving a lot bigger calculation and storage cost. It is additionally not minor to keep temporal cognizance. 3.4.5 Superresolution GAN (SRGAN) It takes a low-quality picture as intake and produces an upsampled picture with 4* upscaling quality. The main objective of SR is to enhance the quality of the low-tested picture that is upsampling the given picture. Basically, this issue is not well presented in light of the fact that the recovered high-quality picture misses high recurrence data 71 72 Generative adversarial networks for image-to-image translation during the upscaling of the image, particularly for huge upscaling factors. Numerous other deep-learning-based strategies [4, 45, 48] were projected to handle this issue, but those could not perform well with very low-tested pictures. This superresolution GAN utilizes deep-learning ideas to give higher quality pictures. During the process of training, a high-quality picture is always changed over into a low-quality picture by downsampling. The generator of the GAN is answerable for changing over the low-quality picture to high-quality picture, and the discriminator is liable for arranging the produced pictures [21]. Ledig et al. [18] projected SRGAN that considers a low-quality picture as input and produces an upsampled picture with 4* quality. The system design of SRGAN implemented by Ledig et al. supersedes the rules of DCGAN [36] architecture. The design of the generator utilizes both convolutional and residual networks [21]. The target function incorporates an adversarial and furthermore a feature loss rather than pixel-wise meansquared error loss [9] to upgrade the authenticity of the renewed image and understand the 4* upscaling recreation. It also uses affective loss which is a component extricated by the convolutional neural system. By contrasting the highlights of the created picture and the attributes of the objective picture after convolutional neural system, the produced picture and the objective picture are increasingly the same in linguistics and pattern [49]. The feature loss is evaluated as separation among feature maps of the produced expanded picture and the factual picture, where the feature maps are removed from a preprepared VGG19 system by feeding the picture into it. Examinations depict that SRGAN has better execution at the best available methods on the collection of data for the public [21]. Loss is calculated as a weighted combination of regularization, adversarial, and content loss where function measures the difference in the two high-resolution images. SRGAN generator G takes low-quality image ILR and outputs its high-quality image ISR. θG are the parameters of G. N 1X ^ θG ¼ arg min lSR GθG InLR , InHR θG N n¼1 (3.6) SRGAN discriminator D classifies whether a high-quality image is IHR or ISR. θD is the parameter of D. min max EI HR ptrainðI HR Þ logDθD I HR + EI LR pGðI LR Þ log 1 DθD GθG I LR θG θD (3.7) Wang et al. [50] proposed an enhanced SRGAN which advanced the adversarial loss, the system design, and the affective loss. Generative adversarial networks and their variants 3.4.6 Face conditional generative adversarial network (FCGAN) FCGAN is face conditional GAN which focuses on facial picture SR. Berthelot et al. [51] projected BEGAN, which aims to try to maintain a balance that can be adapted for the trade-off among variety and trait. Huang projected FCGAN [52] that concentrates on facial picture SR. Inside the system design, both generator and the discriminator utilize a decoder, an encoder alongside skip associations. It creates excellent outcomes with 4* scaling factor. In preparation, the target function incorporates a loss, i.e., content loss, which is evaluated by the L1 pixel-wise dissimilarity between the produced upsampled picture and the ground truth. 3.5 Applications of GAN The significant function of GAN is the systems that create cases with a similar dispersion as genuine information, for example, producing photo-realistic pictures. GANs can likewise be utilized to handle the issue of inadequate preparation of cases for supervised or semi-supervised learning. As of now, a favorable use of GAN is computer vision which includes pictures and video, for example, image-to-image translation, video generation, generation of cartoon characters, text-to-image translation, and many more. In this segment, the application scope of GANs is discussed [49]. GANs have some genuinely helpful practical applications, which incorporate the following. A. The application in the image • Image generation Generative systems can be utilized to create reasonable pictures after being prepared for sample pictures. For instance, to produce new pictures of dogs, a GAN can be prepared on thousands of samples of pictures of dogs. When the preparation has been completed, the generator system will have the option to create new pictures that are not quite the same as the pictures in the preparation set. Image generation is utilized in social media, marketing, entertainment, logo generation, and so on. Hanock et al. [53] projected composite GAN, which creates fractional pictures by various generators and lastly combined the whole picture. • Image-to-image translation It is utilized to change over pictures taken in the day to pictures taken around evening time, to change over portrayals to artistic creations, to style pictures to look such as Picasso or Van Gogh works of art, to change over airborne pictures to satellite pictures consequently, and to change over pictures of ponies to pictures of zebras. These utilization cases are ground-breaking since they can spare time. Phillip et al. [54] exhibited GAN’s, precisely pix2pix method for the image-to-image translation undertakings. Jun et al. [19] presented the renowned cycle GAN as well as the setup of noteworthy image-to-image translation models. Cycle GAN is a significant 73 74 Generative adversarial networks for image-to-image translation application framework of GAN in the field of a picture. It depends on two sorts of pictures that need no matching. A crying face can be transformed into a laugh through composition or zebra to the horse. Star GAN is a further advancement of Cycle GAN, where solidarity is taken to prepare a single classification for the next class. Star GAN is used to change the smiling look into a crying look, alongside a collection of appearances, for example, shock, disappointment, and so on. • High-resolution picture generation GANs can assist in creating high-quality pictures taken from low-quality camera pictures without losing any necessary details. Superresolution is a field in which GAN depicts a very remarkable outcome with commercial chances [55]. This can be valuable on websites. The utilization of GAN for SR tackles the inadequacies of the ordinary strategies, which includes the DL techniques, with absences of high recurrence data. Customary deep CNN can enhance the imperfection by choosing the target function. GAN can likewise take care of this issue and acquire fulfilling observation [49]. Christian et al. [18] show the utilization of GANs, explicitly SRGAN framework, to produce yield pictures having enriched pixel quality and sometimes even more. Huang et al. [56] utilize GAN to make variants of photos of personal appearances. Subeesh et al. [57] provide a case of GAN to make high-quality photos, concentrating on the road scene. • Photo inpainting The fundamental idea of this application is to fill the gaps of a picture. Numerous deep-learning procedures have come to tackle this issue, and the significant task is to fill the enormous gaps of a picture to make an ideal one. There are convolutional systems for picture inpainting however these are bad at filling the gaps with appropriate highlights, and henceforth generative models are utilized for searching the relevant highlights which are to be filled with, and these highlights are known through the preparation process [21]. Pathak et al. [50] have projected another technique for picture inpainting called context encoders which depend on convolutional systems prepared mostly to produce pictures at a discretionary. So these systems need to comprehend both full images and pictures with holes to recognize the highlights with which need to supplant with. The method proposed by Pathak et al. depends on encoder-decoder design. That framework is fit for taking pictures with input size 128 128 with gaps. The yield of that proposed framework is either the gap of the picture or the whole picture. The gap of the picture size will be 64 64, and the full picture is 128 128. GANs can assist in recovering those areas in the picture that has some missing parts. Deepak et al. [40] portrayed the utilization of GAN, explicitly context encoder to execute photo inpainting that is covering a region of a photo which was expelled for unknown reasons. Raymond et al. [58] used GAN to fill in and fix purposefully Generative adversarial networks and their variants • • • • • • corrupted photos of the human face. Yijun et al. [59] likewise used GAN for inpainting and remaking harmed photos of personal appearances [60]. Generation of realistic photograph Andrew et al. [61] demonstrated the creation of synthetic photos with BigGAN strategy, which are in every practical sense undefined from authentic photographs. 3D object generation 3D objects can be created with GAN [55]. Jiajun et al. [62] showed a GAN for producing new three-dimensional new items such as car, sofa, chair, and table. Matheus et al. [63] used GAN to produce 3D models that provide two-dimensional pictures of items from various points of view [60]. Face aging The fundamental point of this is to create a human picture at some age. On the off chance, if the present age of an individual is 20 years, the GAN is utilized to create a picture of that individual at 40 years. Face aging techniques change a facial picture to another age, while as yet keeping character [21]. A large portion of the GAN utilized for face aging includes conditional GAN. The primary point is to produce a picture with an objective mark age from a given initial face picture. This can be extremely valuable for both the surveillance and entertainment businesses. It is especially helpful for face verification since it implies that an organization does not have to change its security frameworks as individuals get older. An Age-cGAN [64] system can create pictures at various ages, which then could be utilized to prepare a reliable model for face confirmation. Grigory et al. [64] utilized GAN to create photos of faces having various evident ages, such as from young to old one. Zhifei et al. [65] utilized a GANdependent strategy for de-aging the photos of different faces. Generate photos of the human face Tero et al. [66] exhibited the creation of conceivable, reasonable photos of individual faces. It is reasonable to call the striking outcome because of genuine looks. In that capacity, the consequences got a lot of media consideration. Face generation is usually prepared on examples such as big name, implying that components of current superstars are in the produced faces, causing to appear to be recognizable, however not precisely. Their techniques were likewise used to show the generation of items and scenes. Few instances were utilized from this paper in a 2018 report to exhibit the quick advancement of GANs from 2014 to 2017 [60]. Generation of new human poses Liqian et al. [67] gave a case of creating current photos of individual structures with recent postures. Face frontal view generation Rui et al. [68] showed the utilization of GAN for creating front-view photos of individual faces provided photos taken at some particular point. The created front- 75 76 Generative adversarial networks for image-to-image translation on photographs can be utilized as intake is the concept behind it for face verification or face identification framework. • Generation of cartoon character Yanghua et al. [69] showed the preparation and usage of a GAN for creating anime characters’ faces which are Japanese comic book characters. Motivated by the anime models, many individuals have attempted to develop Pokemon characters, for example, the poke GANventure and produce the Pokemon with DCGAN task having constrained achievement [22]. B. The Application with the Video • Video synthesis GANs can likewise be utilized to produce videos. They can create content in less time than if somehow managed to make content physically. They can also improve the efficiency of filmmakers and furthermore engage specialists who need to build innovative videos in their available time. Carl et al. [47] portray the utilization of GAN for video forecast, explicitly foreseeing as long as a moment of video frames with progress, principally for stationary components of the picture. • Video frame prediction It represents determining the future frame regarding the current frames [21]. Mathieu et al. [70] firstly used GAN preparation for video prediction in which the generator can produce the last frame of the video dependent on the prior arrangement of the frames, and the discriminator is utilized to finish up the frame. All the frames aside from the last frame are genuine pictures. The discriminator can adequately utilize the data of the time measurement and furthermore helps to make the produced frame predictable with all the past frames is its advantage. Test outcomes depict that the frames are clearer than the other algorithms created by confrontation preparation. C. Application of human-computer interaction • Text-to-image synthesis It is the earlier application of domain-transfer GAN. No doubt, generating multiple pictures from text details is an intriguing use case of GANs. This can be useful in the film business, as a GAN is equipped for creating new information relied on some content that can be made up. In the comic industry, it is conceivable to naturally create arrangements of a story. Han et al. [71] exhibited the utilization of GAN, explicitly the Stack GAN to create practical appearing photos from textual portrayals of necessary items such as flying creatures and blossoms. • Auxiliary automatic driving Santana et al. [72] actualized the assisted automatic driving with GAN. Initially, a picture is created, which is reliable with the appropriation of the official movement scene picture, and afterward, a progress framework is prepared dependent on the cyclic neural system to anticipate the following movement pictures. Generative adversarial networks and their variants 3.6 Conclusion Nowadays, GANs are one of the most fascinating thoughts for many researchers to work on it and suggesting various models based on GAN with regard to computer engineering. Generative adversarial networks and their variants are the most promising generative approaches in the discipline of computer vision. In this chapter, a comprehensive review of GAN and their variants are provided. It can be seen that the latest variants of GAN are unsupervised and more stable than the previous models that can produce realistic content and texture details, which will be an advantage to various applications such as superresolution, image inpainting, etc. They are also applicable in different areas such as image classification, image-to-image translation, recovery of corrupted data, text-to-image generation, and many more endless applications. Comparison is also done between various GAN-based methods. References [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, Curran Associates, 2014, pp. 2672–2680. [2] A. Mosavi, S. Ardabili, A.R. Várkonyi-Kóczy, List of Deep Learning Models. (2019), https://doi.org/ 10.20944/preprints201908.0152.v1 (Preprint). [3] J. Brownlee, A Gentle Introduction to Generative Adversarial Networks (GANs), Retrieved from https:// machinelearningmastery.com/what-are-generative-adversarial-networks-gans/, 2019, July 19. [4] C. Dong, C.C. Loy, K. He, X. Tang, Learning a deep convolutional network for image super-resolution. in: Computer Vision—ECCV 2014, 2014, pp. 184–199, https://doi.org/ 10.1007/978-3-319-10593-2_13. [5] M. Mirza, S. Osindero, Conditional Generative Adversarial Nets, (2014)ArXiv: abs/1411.1784. [6] Y. Hong, U. Hwang, J. Yoo, S. Yoon, How generative adversarial networks and their variants work. ACM Comput. Surv. 52 (1) (2019) 1–43, https://doi.org/10.1145/3301282. [7] K. Sohn, X.C. Yan, H. Lee, Learning structured output representation using deep conditional generative models, in: Proceedings of the 28th International Conference on Neural Information Processing Systems—vol. 2 (NIPS’15), MIT Press, Cambridge, MA, USA, 2015, pp. 3483–3491. [8] J. Le, The 10 Deep Learning Methods AI Practitioners Need to Apply, Retrieved fromhttps:// medium.com/cracking-the-data-science-interview/the-10-deep-learning-methods-aipractitioners-need-to-apply-885259f402c1, 2020, May 10. [9] Y. Hong, U. Hwang, J. Yoo, S. Yoon, How generative adversarial networks and their variants work: an overview, ACM Comput. Surv. 52 (2019) 1–43. [10] W. Sun, B. Zheng, W. Qian, Automatic feature learning using multichannel ROI based on deep structured algorithms for computerized lung cancer diagnosis. Comput. Biol. Med. 89 (2017) 530–539, https://doi.org/10.1016/j.compbiomed.2017.04.006. [11] T. Salimans, I.J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved Techniques for Training GANs, (2016)ArXiv: abs/1606.03498. [12] T. Che, Y. Li, A.P. Jacob, Y. Bengio, W. Li, Mode Regularized Generative Adversarial Networks, (2016)ArXiv: abs/1612.02136. [13] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein Gan, (2017)arXiv preprint arXiv: 1701.07875. [14] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A.C. Courville, Improved Training of Wasserstein GANs, (2017)ArXiv: abs/1704.00028. [15] T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive Growing of GANs for Improved Quality, Stability, and Variation, (2018)ArXiv: abs/1710.10196. 77 78 Generative adversarial networks for image-to-image translation [16] D.J. Im, C.D. Kim, H. Jiang, R. Memisevic, Generating Images with Recurrent Adversarial Networks, (2016)ArXiv: abs/1602.05110. [17] E.L. Denton, S. Chintala, A. Szlam, R. Fergus, Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks, (2015)ArXiv: abs/1506.05751. [18] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, … W. Shi, Photo-realistic single image super-resolution using a generative adversarial network. in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 105–114, https://doi.org/10.1109/ cvpr.2017.19. [19] J. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2242–2251. [20] F. Jurie, A new log-polar mapping for space variant imaging: application to face detection and tracking, Pattern Recogn. 32 (1999) 865–875. [21] N. Yashwanth, P. Navya, M. Rukhiya, K.S. Prasad, K.S. Deepthi, Survey on generative adversarial networks, Int. J. Adv. Res. Innov. Ideas Technol. 5 (2019) 239–244. [22] J. Brownlee, 18 Impressive Applications of Generative Adversarial Networks (GANs), Retrieved fromhttps://machinelearningmastery.com/impressive-applications-of-generativeadversarial-networks/, 2019, July 12. [23] R. Vargas, A. Mosavi, R. Ruiz, Deep Learning: A Review. (2018)https://doi.org/10.20944/preprints201810.0218.v1 (Preprint). [24] M. Biswas, V. Kuppili, L. Saba, D.R. Edla, H.S. Suri, E. Cuadrado-Godı́a, J.R. Laird, R.T. Marinhoe, J.M. Sanches, A. Nicolaides, J.S. Suri, State-of-the-art review on deep learning in medical imaging, Front. Biosci. 24 (2019) 392–426. [25] L. Bote-Curiel, S. Muñoz-Romero, A. Gerrero-Curieses, J.L. Rojo-álvarez, Deep learning and big data in healthcare: a double review for critical beginners, Appl. Sci. 9 (2019) 2331. [26] Y. Feng, H.S. Teh, Y. Cai, Deep learning for chest radiology: a review, Curr. Radiol. Rep. 7 (2019). [27] D. Griffiths, J. Boehm, A review on deep learning techniques for 3D sensed data classification, Remote Sens. 11 (2019) 1499. [28] A. Gupta, P.J. Harrison, H. Wieslander, N. Pielawski, K. Kartasalo, G. Partel, L. Solorzano, A. Suveer, A.H. Klemm, O. Spjuth, I. Sintorn, C. W€ahlby, Deep learning in image cytometry: a review, Cytometry 95 (2019) 366–380. [29] V.K. Ha, J. Ren, X. Xu, S. Zhao, G. Xie, V.M. Vargas, Deep learning based single image super-resolution: a survey, in: BICS, 2018. [30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9. [31] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in: NIPS, 2012. [32] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: ECCV, 2014. [33] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778. [34] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, (2015)CoRR: abs/1409.1556. [35] S. Min, B. Lee, S. Yoon, Deep learning in bioinformatics, Brief. Bioinform. 18 (2017) 851–869. [36] A. Radford, L. Metz, S. Chintala, Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, (2015)CoRR: abs/1511.06434. [37] Q. Zhang, Y. Xiao, W. Dai, J. Suo, C. Wang, J. Shi, H. Zheng, Deep learning based classification of breast tumors with shear-wave elastography. Ultrasonics 72 (2016) 150–157, https://doi.org/10.1016/ j.ultras.2016.08.004. [38] D. Wulsin, J.R. Gupta, R. Mani, J.A. Blanco, B. Litt, Modeling electroencephalography waveforms with semi-supervised deep belief nets: fast classification and anomaly measurement, J. Neural Eng. 8 (3) (2011) 036015. [39] Y.-J. Cao, L.-L. Jia, Y.-X. Chen, N. Lin, C. Yang, B. Zhang, et al., Recent advances of generative adversarial networks in computer vision. IEEE Access 7 (2019) 14985–15006, https://doi.org/ 10.1109/access.2018.2886814. Generative adversarial networks and their variants [40] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A.A. Efros, Context encoders: feature learning by inpainting. in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1–12, https://doi.org/10.1109/cvpr.2016.278. [41] B. Zhao, X. Wu, Z. Cheng, H. Liu, J. Feng, Multi-view image generation from a single-view, in: Proceedings of the 26th ACM international conference on Multimedia, 2018. [42] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation. Lect. Notes Comput. Sci (2015) 234–241, https://doi.org/10.1007/978-3-319-24574-4_28. [43] M. Saito, E. Matsumoto, S. Saito, Temporal generative adversarial nets with singular value clipping, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2849–2858. [44] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in: Proceedings of the 34th International Conference on Machine Learning, in PMLR, vol. 70, 2017, pp. 214–223. [45] J. Kim, J.K. Lee, K.M. Lee, Deeply-recursive convolutional network for image super-resolution. in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1637–1645, https://doi.org/10.1109/cvpr.2016.181. [46] K. Cheng, R. Tahir, L.K. Eric, M. Li, An analysis of generative adversarial networks and variants for image synthesis on MNIST dataset. Multimed. Tools Appl. 79 (19–20) (2020) 13725–13752, https:// doi.org/10.1007/s11042-019-08600-2. [47] C. Vondrick, H. Pirsiavash, A. Torralba, Generating Videos with Scene Dynamics, (2016)ArXiv: abs/ 1609.02612. [48] W. Shi, J. Caballero, F. Huszar, J. Totz, A.P. Aitken, R. Bishop, et al., Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1874–1883, https://doi. org/10.1109/cvpr.2016.207. [49] L. Gonog, Y. Zhou, A review: generative adversarial networks, in: 2019 14th IEEE Conference on Industrial Electronics and Applications (ICIEA), 2019, pp. 505–510. [50] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C.C. Loy, Y. Qiao, X. Tang, ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks, (2018)ArXiv: abs/1809.00219. [51] D. Berthelot, T. Schumm, L. Metz, BEGAN: Boundary Equilibrium Generative Adversarial Networks, (2017)arXiv preprint arXiv: 1703.10717, 2017. [52] H. Bin, W. Chen, X. Wu, L. Chun-Liang, High-Quality Face Image SR Using Conditional Generative Adversarial Networks, (2017)ArXiv: abs/1707.00737. [53] H. Kwak, B.-T. Zhang, Generating Images Part by Part with Composite Generative Adversarial Networks, (2016) ArXiv:1607.05387. [54] P. Isola, J. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5967–5976. [55] F. Shaikh, Top 5 Interesting Applications of GANs for Every Machine Learning Enthusiast!, Retrieved fromhttps://www.analyticsvidhya.com/blog/2019/04/top-5-interesting-applications-gans-deeplearning/, 2020, May 11. [56] H. Bin, C. Wei-Hai, W. Xing-Ming, High-Quality Face Image Super-Resolution Using Conditional Generative Adversarial Networks, (2018) ArXiv:1707.00737. [57] S. Vasu, N.T. Madam, N.A. Rajagopalan, Analyzing Perception-Distortion Tradeoff using Enhanced Perceptual Super-resolution Network, (2018) ArXiv:1811.00344. [58] R.A. Yeh, C. Chen, T.Y. Lim, A.G. Schwing, M. Hasegawa-Johnson, M.N. Do, Semantic image inpainting with deep generative models. in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1–19, https://doi.org/10.1109/cvpr.2017.728. [59] Y. Li, S. Liu, J. Yang, M.-H. Yang, Generative face completion. in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1–9, https://doi.org/10.1109/ cvpr.2017.624. [60] J. Hui, GAN—Some cool applications of GAN—Jonathan Hui, Retrieved fromhttps://medium. com/@jonathan_hui/gan-some-cool-applications-of-gans-4c9ecca35900, 2020, March 10. [61] A. Brock, J. Donahue, K. Simonyan, Large Scale GAN Training for High Fidelity Natural Image Synthesis, (2018)ArXiv: abs/1809.11096. 79 80 Generative adversarial networks for image-to-image translation [62] J. Wu, C. Zhang, T. Xue, W.T. Freeman, J.B. Tenenbaum, Learning a probabilistic latent space of object shapes via 3D generative-adversarial modeling, in: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16), Curran Associates Inc, Red Hook, NY, USA, 2016, pp. 82–90. [63] M. Gadelha, S. Maji, R. Wang, 3D shape induction from 2D views of multiple objects. in: 2017 International Conference on 3D Vision (3DV), 2017, pp. 402–411, https://doi.org/ 10.1109/3dv.2017.00053. [64] G. Antipov, M. Baccouche, J. Dugelay, Face aging with conditional generative adversarial networks, in: 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 2089–2093. [65] Z. Zhang, Y. Song, H. Qi, Age progression/regression by conditional adversarial autoencoder, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4352–4360. [66] T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of GANs for improved quality, stability, and variation, in: ICLR 2018, 2018 Retrieved fromhttps://paperswithcode.com/paper/ progressive-growing-of-gans-for-improved. [67] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, L.V. Gool, Pose guided person image generation, in: NIPS, 2017. [68] R. Huang, S. Zhang, T. Li, R. He, Beyond face rotation: global and local perception GAN for photorealistic and identity preserving frontal view synthesis, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2458–2467. [69] Y. Jin, J. Zhang, M. Li, Y. Tian, H. Zhu, Z. Fang, Towards the Automatic Anime Characters Creation with Generative Adversarial Networks, (2017)ArXiv: abs/1708.05509. [70] M. Mathieu, C. Couprie, Y. LeCun, Deep Multi-Scale Video Prediction Beyond Mean Square Error, (2015)CoRR: abs/1511.05440. [71] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D. Metaxas, StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017p. 1, https://doi.org/10.1109/iccv.2017.629. [72] E. Santana, G. Hotz, Learning a Driving Simulator, (2016)ArXiv: abs/1608.01230. CHAPTER 4 Comparative analysis of filtering methods in fuzzy C-means: Environment for DICOM image segmentation D. Nagarajana, Kavikumar Jacobb, Aida Mustaphac, Udaya Mouni Boppanac,, and Najihah Chainib a Department of Mathematics, Hindustan Institute of Technology and Science, Chennai, India Faculty of Applied Sciences and Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia c Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia b 4.1 Introduction Medical image analysis was done using a sequential application of low-level pixel processing and mathematical modeling to develop rule-based systems. During the same period, artificial intelligence was developed in analogy systems. In the 1980s magnetic resonance or computed tomography imaging system has been introduced that encode and decode the output of the images. Digital imaging and communications in medicine (DICOM) has improved the communication mechanism in the medical environment. In products such as CT, MR, X-ray, NM, RT, US, etc., DICOM is used for image storing, printing the information about the patient’s condition, and transmitting the correct information about the radiological images. It involves a file format and protocol in communication networks. It is useful for receiving images and patient data in DICOM format. DICOM format has been widely adopted to all medical environments and derivations from the DICOM standard are used into other application areas. DICOM is the basis of digital imaging and communication in nondestructive testing and in security. DICOM data consist of many attributes including information such as name, ID, and image pixel data. A single DICOM object can have only one attribute containing pixel data. Pixel data can be compressed using a variety of standards, including JPEG, JPEG Lossless, JPEG 2000, and Run-length encoding. Image processing is a rapidly growing field in the academic world that is used with numerous techniques especially in image segmentation and edge detection, which are important for diagnosing the problem or disease. Digital images have been used for getting productive results and data recovery. Spatial changes in MRI are due to the radio frequency coil that will affect the tissue statistics [1]. Medical image segmentation is an essential task in clinical diagnosis. Generally, most of the medical images are the overlapping of the gray scale intensities of various tissues. Medical image data will be uncertain Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00002-6 Copyright © 2021 Elsevier Inc. All rights reserved. 81 82 Generative adversarial networks for image-to-Image translation due to noise, blurs in recovery, and effects of partial volume from the sensor, which has a low quality of determining. These issues can be resolved by using a fuzzy set which gives the membership function. Hence, fuzzy clustering is a suitable method for the segmentation of medical images. Cluster analysis is a methodology of grouping a data set into groups of indistinguishable individuals. Image segmentation is the process of partitioning image pixels into similar regions. Therefore, clustering algorithms are naturally suitable for image segmentation [2]. The ordinary clustering methods confine all the points of the data set into a single cluster. But fuzzy clustering gives the idea of overlapping the membership in two or more sets. Hence, fuzzy clustering has been widely used in different fields including image segmentation. The fuzzy C-means algorithm has been widely applied in the image processing area including medical image segmentation to classify the major tissues from MRI of the human brain. Furthermore, this algorithm readily meets the scale and shift invariant, then incorporates the multidimensional data [3]. Clustering from the flow of the data is an important task due to the increasing scope of a large amount of data composed over time. Dunn has been developed a fuzzy C-means as a clustering methodology for image segmentation and later it was improved by Bezdek [4, 5]. Due to noise and inhomogeneity, accurate image segmentation is a difficult task in medical images. In the conventional method, the color image is transformed into a gray scale image [6]. For each target class, users prefer training data, and clustering is done for the image using some filters to reduce noise. However, some of the clusters may contain more than one target class. It needs to be partitioned again until getting no such clusters. Since the medical images including ultrasound images such as tomography using X-ray mammography and MRI are represented and saved digitally, the application of image processing methodology has increased tremendously in recent years. Therefore, MRI has been used in many types of research. MRI is an influential tool for detecting unusual changes in various parts of the brain in the initial stage. This tool is a suitable one to acquire brain images with a high contrast level. The recovery parameters of MRI can be modified in order to acquire different gray scale levels for various tissues and types of neuropathology. Though the segmentation of the brain image is a difficult task, it is very important to detect tumors, necrotic tissues, and edema in the diagnostic system. Many methods have been applied for this task namely thresholding, statistical models, region growing, clustering, and active control models. Since the distribution of strain in medical images is very complex in general, the thresholding methodology fails here. Therefore, the extension of thresholding is nothing but region growing needs source from all the regions and facing the corresponding problem to maintain homogeneity as thresholding [6]. The most popular clustering algorithms are expectation-maximization (EM) and fuzzy C-means, which are used for segmentation. The EM method are designed for the distribution of intensity like a normal distribution, which is not suitable for noisy Comparative analysis of filtering methods images. But, FCM considers the only intensity of the image and can be used for clustering. The method of unsupervised learning is called clustering, where similar clusters are developed. This method is an objective function-based method and an interesting one. The objective of this method is to divide the observation into as possible as a similar cluster. Likewise, the FCM algorithm is an unsupervised fuzzy clustering algorithm, where the soft partition is possible by getting clusters that partially belong to multiple clusters. These partitions need not be a fuzzy partition as the input may be larger than the data set. But, most of the algorithms generate soft partition, i.e., fuzzy partition. Soft clustering assures the membership degree of all the points in each cluster adding up to one. In earlier days, computer-aided detection of unusual growth of tissues was motivated by the requirement of obtaining possible accuracy. This process cannot be compared with the recent technologies used, which are digitalized and enable us to observe the volume and location of the unwanted tissues [7]. Since all the objects can have membership in more than one cluster, fuzzy partitions are more adjustable than crisp. FCM clustering uses a simple color feature with adequate information that will efficiently cluster the video frames. Cluster algorithm has been widely used in pattern recognition, data mining, computational biology, and computer vision. Cluster methodology is an unsupervised learning method where the objective is groping elements into clusters with a high level of similarity and the elements in different clusters will have a high level of a degree of dissimilarity. Dissimilarity can be measured using distance, symmetry, curvature, and intensity using the information from the data set [8–10]. FCM clustering is an instrument to categorize the image blocks and provides stepwise detailed searching. Therefore, FCM is a fuzzy classification model where each data is a shaped cluster and identified by a membership degree. Various modifications of FCM clustering have been applied to crisp numbers and only very few of them are extended to noncrisp numbers since it needs complicated equations and tiring calculations. From the data set developing algorithms that can deal with uncertainties is an important task. Automatic generation of type-1 membership functions based on human experts and their perception. This automatic generation can be done by FCM, self-organizing feature map, and robust agglomerative mixture decomposition methods. Image segmentation using type-1 fuzzy sets may give unsatisfactory results and applying type-2 fuzzy sets with more desirable results can solve this issue. Since the secondary grades of type-2 fuzzy sets are equal to one, this set can control the level of uncertainty of data more efficiently than the conventional methods. Here, using interval type-2 fuzzy sets can reduce computational complexity. The clustering procedure for the data under a fuzzy environment is called fuzzy C numbers. These numbers may be considered as normal type fuzzy numbers, triangular, and trapezoidal fuzzy numbers [11–14]. Modeling of membership functions based on similarity decomposition and centroid of clusters is the most important task in fuzzy clustering. In fuzzy cluster analysis, the membership matrix will represent the relationship between the data and it gives a more comprehensive view of the 83 84 Generative adversarial networks for image-to-Image translation relationships. This membership matrix raises the expressiveness of the cluster analysis. In conventional methods, when the data are equally distanced between representatives, they are assigned to one cluster [15]. 4.1.1 Organization of chapter The remaining part of the work is organized as follows. In Section 4.2, a review of the literature is given for the aim and scope of this work. In Section 4.3, some of the basic concepts are presented for a better understanding of the work. In Section 4.4, image segmentation on the DICOM image is proposed using the FCM clustering algorithm. In Section 4.5, the result and discussion of this work are given. In Section 4.6, we concluded our work in the future direction. 4.2 Related works Ahmed et al. [1] introduced a new algorithm for fuzzy segmentation on MRI data and calculated inhomogeneities of intensity. They have neutralized the inhomogeneities by their modified fuzzy C-means algorithm which allows the labeling of a pixel by immediate neighborhood. They have also illustrated that the efficiency of their modified algorithm by using synthetic images and MRI data. Yang et al. [2] proposed a new technique called an alternative FCM algorithm for MRI image segmentation to distinguish abnormal and normal tissues in ophthalmology. They have concluded that their proposed algorithm is better than the existing fuzzy C-means algorithm when it detects abnormal tissues depending on a window selection. The extended version of the FCM clustering algorithm has been introduced in Ref. [16] to overcome the issues of noise sensitiveness. Roy et al. [3] studied intensity shading, size of the variable cluster, and smoothness of membership functions of the FCM cluster algorithm in detail and introduced a new parameter called compactness to obtain additional information of the clusters. With that parameter, they have proposed a fuzzy C-means algorithm with variable compactness which is used to analyze major tissues in brain MRIs. Hore et al. [4] exhibited the online fuzzy clustering algorithm to partition large data which may be considered as streaming data. They have concluded that their algorithm offers partitions in large volumes of MRI when clustered all the data at one time. An automatic method is proposed to identify exudates from low contrast digital images of retinopathy patients with nonstretched pupils based on the FCM clustering technique [5]. Balafar [6] introduced a new FCM clustering method which is used to convert the color image into the gray level image by a user-selected training data and decrease the noise by using an anisotropic filter. Suri and Sardana [17] made a prediction of the gold price using the FCM clustering with the help of a known fuzzy membership function based on fuzzy clustering and weighted least square and Takagi-Sugeno model. In 2011, Christ and Parvathi [7] proposed a new technique for the segmentation of medical images using the Silhouette method, Spatial FCM, Comparative analysis of filtering methods and hidden Markov random field-based FCM algorithms. Havens et al. [8] analyzed large databases by using three new incremental kernelized FCM algorithms such as rseKernelized FCM, sp-Kernelized FCM, and o-Kernelized FCM. They have evaluated the performance of all three algorithms by comparison and recommended rse-Kernelized FCM is suitable for computational problems. Asadi and Charkari [9] have done a video description using FCM clustering with a new keyframe extraction system which is chosen based on maximum membership grade that will produce static video summaries along with high accuracy and low error rate. Pimentel and Souza [10] have been introduced a novel approach to deal with the membership based on the essential information in the entire feature of the image. Biswas et al. [11] confined a fast-geometrical image by using FCM clustering which considers pixel patterns in the column direction of an image block as the classification features and for stepwise precise classification, two-level classification method has been applied. Recently, Mulyana [12] identified a medical plant using FCM clustering based on a fractal method such as fractal dimension and fractal code which are used to extract the image feature of the 20 variety of medicinal plants for every 30 samples. The experimental result shows that 85.04% and 79.94% fuzzy clustering are based on fractal dimension and fractal code, respectively. Moreno and Lopez [13] explained the progress of a trajectory planning system using fuzzy algorithms and machine vision methods. The system has been controlling the movement of a tele-commanded mobile robot for machine vision techniques and fuzzy algorithms. Hadi et al. [14] have been proposed a vector form of FCM which simplifies the method of FCM clustering applied to fuzzy numbers. Warunsin and Chitsobhuk [18] established the performance of the cyclone identification system using the histogram and used the classification of support vector machines and FCM clustering. Fredo et al. [19] segmented the sub outer layer regions of the brain regions such as corpus callosum (CC) and brain stem (BS) using FCM clustering. They are also recommended that this skeleton can be used to diagnose neural disorders autism automatically. Doganay et al. [20] developed a fully automatic algorithm for lung tissue segmentation using the FCM clustering algorithm. The fast FCM clustering algorithm is used to segment the lung region in two-dimensional high-resolution computer tomography images. Liu et al. [21] presented a fluctuation of the fuzzy local information C-means clustering algorithms which include region-level spatial, spectral, and structural information along with region-level Markov random field model to achieve accuracy to color texture images. Recently, Vani and Anusuya [22] implemented a unique Kannada word recognizer using FCM and vector quantification. To improve the efficiency and speed of FCM, Stetco et al. [15] proposed a fuzzy C-means++ algorithm that obtained a maximum level of occurrences on both artificially generated and real-world data sets. Velmurugan and Naveen [23] have examined the usage of clustering methods and preprocessing methods to forecast the disease in MRI brain images in the medical field. Mohammed et al. [24] introduced the FCM algorithm that takes less time in finding 85 86 Generative adversarial networks for image-to-Image translation clusters and applied in image segmentation. Kaur and Tulsi [25] have proposed an FCM method to obtain impressive results for complex background images in order to overcome the issue such as failed to compute threshold value when there is no significant change in the gray level of pixels. Heriana [26] had done edge detection on an image using FCM and objective function based on the data distribution of mean and standard deviation values of each of the four magnitude direction values of a pixel that have been calculated based on the objective function. Rai [27] has introduced the idea of detection of soft metaphor by allowing membership values to fuzzy sets which represent varying degrees of metaphoricity. Jebari et al. [28] proposed an automatic genetic FCM algorithm with the uses of newly defined genetic algorithms including a new mutation operator, crossover operator, and tournament selection to develop the number of clusters and to contribute initial centroids. Sivasaravanababu et al. [29] converted the captured RGB image into a gray-scale image and illuminated it by using the technique of image enhancement. Zhang et al. [30] disparate FCM clustering into traditional FCM objectives using a new diversity regularization. The FCM objective has been addressed by an optimization algorithm in order to converge the local optimal solutions with adequate time complexity. Edge detection on DICOM image [31] and image extraction on MRI DICOM image [32] were studied by the use of the MATLAB program under the type-2 fuzzy setting to convert DICOM image into a 2D gray scale image. Jinlin [33] has been introduced a new FCM clustering algorithm based on multiobjective optimization along with fuzzy distance measurement which is used to adjust the weights of the pixel local information to improve the performance and computational time while segmenting images by a different type of noises. Santiago [34] had done mass abnormality segmentation and categorization modified FCM using histogram and binary decision tree. The importance of preprocessing and fuzzy methods has been highlighted for the segmentation and classification of mammographic image processing. Torra [35] has studied and analyzed the effect of the parameter m which corresponds to the degree of fuzziness of the solution acquired from the unsupervised FCM algorithm. Umoren et al. [36] refined an isolated diagnostic system using the FCM algorithm and shown that ophthalmic pathological results obtained from FCM are faster and reliably clustering. Srivastava et al. [37] analyzed the image using the FCM algorithm by carrying out the apportionment procedure in which the image is considered as an object and subdivided into the class of images to overcome noise sensitivity. Tolentino et al. [38] have proposed a new technique for the measurement of the distance to rectify the issues of FCM incorporating trigonometric functions and Manhattan distance calculation on speed and accuracy. Vernanda et al. [39] focused on the controversy involved in students’ data that continue to colleges introduced graduate-school clustering using FCM. Borthakur et al. [40] identified suitable metrices from heart rate variability analysis for sonification. They have also investigated the use of the auditory display in aiding the analysis of heart rate variability leveraged by unsupervised machine learning techniques. Katircioglu et al. [41] determined Denim fabric’s measurement of Comparative analysis of filtering methods the influence of air using the FCM algorithm. The fabric samples are analyzed by a microscope to count the bright areas of the pixels and images are improved by image processing. Gan has [42] proposed safe semisupervised FCM clustering and introduced MinMax FCM to swamp the issues such as wrongly labeled samples which are carefully examined by constraining the corresponding predictions to be those yielded by unsupervised clustering. However, the realm of image segmentation on the DICOM image of the patient’s MRI has not been studied yet in the literature so far. Hence, it is still open to many possibilities for innovative research work especially in the context of FCM clustering. Hence, in this chapter, we have studied and analyzed the performance of a fuzzy C-means clustering (FCMC) algorithm along with different image filtering methods based on digital imaging and communications in medicine (DICOM) data set. The significance of this study is to the lower false positive rate and the intrusion detection is a high rate. For this purpose, the DICOM color images are first converted to gray scale and applied various filters to reduce the noise error. 4.3 Methodology 4.3.1 Proposed algorithm In this section, edge detection is done on the DICOM image of a magnetic resonance imaging (MRI) patient using the fuzzy C-means clustering (FCMC) algorithm. Algorithm 4.1: Fuzzy C-means clustering algorithm 1. Convert CT scan files to DICOM through mri ¼ flipdim(mri,1); 2. Import the background image and show it on the axes through bg ¼ imread(’background. png’); 3. Prevent plotting over the background and turn the axis off j making sure the background is behind all the other uicontrols; 4. Covert RGB to Green Channel Complement through GIm ¼ imcomplement(green); 5. Contrast Limited Adaptive Histogram Equalization 6. Structuring Element through se ¼ strel(’ball’,8,8); j Morphological Open through gopen ¼ imopen(HIm,se); 7. Remove Optic Disk using j godisk ¼ HIm - gopen 2D j Median Filter medfilt ¼ imguidedfilter(godisk,’DegreeOfSmoothing’,1); 8. Segmentation Using Fuzzy C-means through j ffcm1 ¼([’The 1st Cluster ¼ ’ num2str(ccc1)]); j ffcm2 ¼([’The 2nd Cluster ¼ ’ num2str(ccc2)]); 9. Using edge detection detect the edge through j SegmentedImage ¼ get(LTproject.segmented_image,’Userdata’). 87 88 Generative adversarial networks for image-to-Image translation 4.3.2 Evaluation metrics The fuzzy C-means clustering [3] is the solution of the energy function and is defined mathematically as X X 2 p JFCM ¼ u y vj (4.1) i℧ j¼1 ij i where yi is the intensity of the observed image at the ith pixel, C is the number of classes, vj is the centroid of the class, ℧ is the domain of the image, and uij is the membership C P (nonnegative) function of the ith pixel for the jth class and uij ¼ 1, 8i℧. The paramj¼1 eter p is the weighting exponent where p > 1. If p ¼ 1, then FCM becomes hard K-means algorithm with binary values as the member functions. The membership function and the center of the cluster is defined by 1 3 2 2 d yi , φj m1 Xm 4 5 k¼1 d ðy , φ Þ i k uij ¼ (4.2) Xm f uij yi vj ¼ Xk¼1 n f uij k¼1 where f is the degrees of freedom. Accuracy ¼ NTrue Positive + NTrue Negative NTrue Positive + NTrue Negative + NFalse Positive + NFalse Negative (4.3) Precision is expressed as Precision ¼ NTrue Positive NTrue Positive + NFalse Positive (4.4) Eq. (4.5) shows that the harmonic mean between precision and sensitivity is given by Harmonic mean ¼ 2 NTrue Positive 2 NTrue Positive + NFalse Positive + NFalse Negative (4.5) 4.3.3 Morphological operations A medium filter can do removing noise from an image effectively. It is a classical preprocessing step to make the results better of later processing like edge detection. Under some conditions, this filter extracts edges during noise reduction. Hence, this filter has been used in digital image processing widely [24]. Comparative analysis of filtering methods 4.3.3.1 2D median filter 2D filter is a technique of nonlinear digital filtering which is used for the removal of noise from an image. Removing noise from an image is a preprocessing step of ensuing processing like edge detection. This filter extracts edges during noise reduction and hence it has been widely used in image processing and signal processing as well. 4.3.3.2 Imguided filter The function of the Imguided filter enforces edge preserving on an image smoothly using a guidance image that is the content of a second image. This guided image can be a different version of the image or an entirely different image. Guided image filtering is a region of operation where the statistics is a region in the parallel dimensional neighborhood is the guided image. This takes place while measuring the value of the output pixel. The structure of the guidance image and the image to be filtered are the same when both are the same. If they are different, structures in the guidance image will impact the filtered image. 4.3.3.3 Imfilter The function of Imfilter calculates the value of each output pixel using double-precision, floating-point arithmetic. Using the Imfilter toolbox, images can be filtered using convolution or correlation. It handles types of data using the rules of arithmetic saturation and the output image has a similar data type as the input image. If the result exceeds the type of data, this filter truncates the result to the allowed range. If the data type is an integer, then this filter rounds fractional values. Using this truncation behavior, the image can be converted to various types of data before calling Imfilter. If the input image is of class double, then the output will be negative values. 4.3.3.4 Wiener 2 filtering It is two-dimensional noise-removal filtering and it is a linear filter. This filter adapts itself according to the variance of the local image. If the variance is large, then this filter performs in a smooth way. While the variance is small this filter carries out very smoothly. This adaptive filter has been used widely for preserving edges and of the image and other parts as well. This filter also manages the preliminary calculations and enforces the input image. 4.3.3.5 Gaussian filter It is a linear filter. It is used to reduce the noise and it is alone will blur edges and low contrast. It is faster than other filters. Imadjust is not necessary for this filter. 89 90 Generative adversarial networks for image-to-Image translation 4.3.4 Research design Fig. 4.1 depicts the methodology used for image segmentation on DICOM using FCM. Based on Fig. 4.1, DICOM image segmentation begins with the DICOM image read. 4.4 Experimental analysis The proposed methods are implemented in MATLAB 2015a environment. The DICOM files are in the montage shown in Fig. 4.2. We choose the slide with the best view as in Fig. 4.2 for full image purpose in Fig. 4.3. The data set used in this chapter is sourced from the digital imaging and communications in medicine (DICOM) database for brain images. The color type of the image is gray scale and the modality is computed tomography. The study description is facial bone from 50-year-old female. The thickness of the slice is 4. Fig. 4.3 shows an excerpt of the DICOM data set. Medical imaging is useful in convolution models for image segmentation. Medical image segmentation data sets are limited, and annotated data is available for training. While surgery is one of the treatment for brain tumors, radiation and chemotherapy may be used to slow the growth of tumors that cannot be physically removed. Magnetic resonance imaging (MRI) furnishes elaborated images of the brain and is also common test used to diagnose brain tumors. All the more, brain tumor segmentation Input Create Axes Green Channel Information about DICOM Image Region Props and Centroid Contrast DICOM Read Image Thresholding Segmentation using FCM Flipdim Montage Edge Detection Output Fig. 4.1 DICOM image segmentation using FCM. Comparative analysis of filtering methods Fig. 4.2 Montage of the DICOM File. Fig. 4.3 Image with the best view. 91 92 Generative adversarial networks for image-to-Image translation from MR images can have a great impact on improved diagnostics, growth rate prediction, and treatment planning. Some tumors can be easily segmented, others were difficult to identify locate/diagnose. These tumors are often circulated, poorly contrasted, and extend tentacle-like structures that make them difficult to segment. Another primal difficulty with segmenting brain tumors is that they can appear in any shape, size, and anywhere in the brain. Brain is typically made of three layers of tissues: white matter, gray matter, and cerebrospinal fluid. Brain tumor segmentation aims to detect the location and extension of the tumor regions. This is done by identifying abnormal areas from normal tissue. Borders of abnormal tissues are often fuzzy and hard to distinguish from healthy tissues. 4.5 Performance analysis Here, the performance of the entire process of image segmentation on the DICOM image using FCM has been described clearly in Fig. 4.4. Information on the physical object is known as 3D presentation states (3DPR) that is nominated for storing all parameters and relevant information of 3D visualization. The main purpose of 3DPR is allowing the storage and distribution of the presentation of an image in 2DPR, it can be applied to volume data via 3DPR. Thus, the experiment is to develop a systematic and DICOMconformant parameterization of 3D visualization. This corresponds to parameterizing all procedures of 3D medical visualization and storing all necessary parameters and data in a 3DPR object. Then, the 3DPR object can be used to rerun all the procedures automatically to regenerate the 3D visualization. The procedures to be parameterized are preprocessing, segmentation, and postprocessing. Instead of storing the segmentation parameters, segmented voxel data can be stored using lossless compression. Using diverse test cases, various compression methods are used. Clear visibility of the image has been obtained using a green channel image with high contrast as shown in Fig. 4.4. In the denoising process, the main disadvantage of the existing methods is the behavior of over amplifying in the relatively homogenous region of an image. To overcome this disadvantage, we used contrast limited adaptive histogram equalization as shown in Fig. 4.4. Removing noise from an image can be done effectively using the median filter. It is a classical preprocessing step to make the results better of later processing like edge detection. Under some conditions, this filter extracts edges during noise reduction. Hence this filter has been used in digital image processing as shown in Fig. 4.4. In this part, using the morphological open and remove the disk, the image has been structured using a 2D median filter and removed background and image adjustment as shown in Fig. 4.4. The background and image adjustment have been done. Using edge detection segmented images have detected the edge. Comparative analysis of filtering methods Fig. 4.4 Different filter segmentation. 4.6 Results and discussion In the proposed system, 2D median filter is found to be the best filter to extract the image from DICOM information. The classification output of the experiment reveals that the accuracy of the image extraction is 97%, 5% sensitivity, 99% specification, 12% PPV, and 7% harmonic mean of precision and sensitivity. The classification outputs are shown in Table 4.1 and Fig. 4.5. The classification output of the experiment reveals that the accuracy of the image extraction through 2D median filter is 97%, Imguided is 94%, Imfilter is 96%, wiener2 is 96%, and Medfilters is 96%. In the proposed system, 2D median filter is one of the best filters to extract the image from DICOM information for accuracy. The classification output of the experiment reveals that, in Fig. 4.6, the sensitivity of the image extraction through a 2D median filter is 4%, Imguided is 23%, Imfilter is 4%, 93 94 Generative adversarial networks for image-to-Image translation Table 4.1 Classification outputs. Filters Accuracy Sensitivity Specificity FPR PPV Harmonic mean 2D median Imguided Imfilter Wiener 2 medfilter 0.9771 0.9431 0.9662 0.9684 0.9618 0.1418 0.2314 0.0468 0.0996 0.1354 0.9858 0.9568 0.9848 0.9783 0.9794 0.0192 0.0432 0.0022 0.0117 0.0106 0.1363 0.0933 0.1174 0.1168 0.1125 0.1734 0.1330 0.0649 0.1236 0.1654 Accuracy Fig. 4.5 Accuracy comparison of all filters. Sensitivity Fig. 4.6 Sensitivity comparison for all filters. wiener2 is 9%, and Medfilters is 13%. In the proposed system, 2D median filter and Imfilter are the same percentages of sensitivity. Fig. 4.7 the classification output of the experiment reveals that the specificity of the image extraction through a 2D median filter is 98%, Imguided is 23%, Imfilter is 4%, wiener2 is 9%, and Medfilters is 13%. In the proposed system, 2D median filter and Imfilter are the same percentages of specificity. Fig. 4.8 reveals that the specificity of the image extraction through 2D median filter is 0%, Imguided is 4%, Imfilter is 0%, wiener2 is 0%, and Medfilters is 0%. Comparative analysis of filtering methods Specificity Fig. 4.7 Specificity comparison of all filters. fpr Fig. 4.8 FPR comparison for all filters. ppv Fig. 4.9 PPV comparison for all filters. Fig. 4.9 is the classification output of the experiment which reveals that the PPV of the image extraction through 2D median filter is 12%, Imguided is 9%, Imfilter is 27%, wiener2 is 16%, and Medfilters is 21%. Fig. 4.10, the classification output of the experiment reveals that the harmonic of the image extraction through 2D median filter is 7%, Imguided is 13%, Imfilter is 6%, wiener2 is 12%, and Medfilters is 16%. The classification output of the experiment reveals that the accuracy of the image extraction is 97%, 5% sensitivity, 99% specification, 12% PPV, and 7% harmonic mean 95 96 Generative adversarial networks for image-to-Image translation Harmonic mean Fig. 4.10 Harmonic comparison for all filters. of precision and sensitivity. In the proposed system, 2D median filter is one of the best filters to extract the image from DICOM information. 4.7 Conclusion Most of the trials fail to segment the images due to noise, inequality of content, less contrast, and inhomogeneity of the image that is to be segmented. Because of these reasons, it is required to follow these methods for reducing error. The procedure of separation of a digital image into numerous segments is called image segmentation. This process aims to facilitate the portrayal of the image into more meaningful and make it easier to determine or analyze. Using this method, one can locate the objects, curves, and lines in images. In this way, each pixel would be labeled in an image where the pixels with the same label contribute secure characteristics. Hence, in this way image segmentation is very useful in digital image processing. In this chapter, image segmentation has been done on the DICOM image of a patient’s MRI. It has been observed that it takes very little memory space to save the file. Further, the process may be extended to neutrosophic and plithogenic environments. Acknowledgment This research is supported by Universiti Tun Hussein Onn Malaysia, Malaysia under GPPS Vote No: H346. References [1] M.N. Ahmed, S.M. Yamany, N. Mohamed, A.A. Farag, T. Moriarty, A modified fuzzy C-means algorithm for bias field estimation and segmentation of MRI data, IEEE Trans. Med. Imaging 21 (3) (2002) 193–199. [2] M.S. Yang, Y.J. Hu, K.C.R. Lin, C.C.L. Lin, Segmentation techniques for tissue differentiation in MRI of ophthalmology using fuzzy clustering algorithms, Magn. Reson. Imaging 20 (2002) 173–179. [3] S. Roy, H. Agarwal, A. Carass, Y. Bai, D.L. Pham, J.L. Prince, Fuzzy c-means with variable compactness, in: IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2008, pp. 452–2455, https://doi.org/10.1109/isbi.2008.4541030. Comparative analysis of filtering methods [4] P. Hore, P.O. Hall, D.B. Goldgof, W. Cheng, Online fuzzy c means, in: NAFIPS 2008-2008 Annual Meeting of the North American Fuzzy Information Processing Society, 2008, https://doi.org/ 10.1109/nafips.2008.4531233. [5] A. Sopharak, B. Uyyanonvara, S. Barman, Automatic exudate detection from non-dilated diabetic retinopathy retinal images using fuzzy C-means clustering, Sensors 9 (2009) 2148–2161. [6] M.A. Balafar, A.B.D.R. Ramli, M.I. Saripan, S. Mashohor, Medical image segmentation using fuzzy C-mean (FCM) and user specified data, J. Circuits Syst. Comput. 19 (1) (2010) 1–14. [7] M.C.J. Christ, R.M.S. Parvathi, Fuzzy c-means algorithm for medical image segmentation, in: 2011 3rd International Conference on Electronics Computer Technology, 2011, pp. 33–36, https://doi. org/10.1109/icectech.2011.5941851. [8] T.C. Havens, J.C. Bezdek, M. Palaniswami, Incremental kernal fuzzy c-means, in: Computational Intelligence, Springer, 2012, pp. 3–18. [9] E. Asadi, N.M. Charkari, Video summarization using fuzzy C-means clustering, in: 20th Iranian Conference on Electrical Engineering (ICEE2012), 2012, pp. 690–694, https://doi.org/10.1109/ IranianCEE.2012.6292442. [10] B.A. Pimentel, R.M.C.R.D. Souza, A multivariate fuzzy c-means method, Appl. Soft Comput. 13 (2013) 1592–1607. [11] A.K. Biswas, S. Karmakar, S. Sharma, M.K. Kowar, Fast fractal image compression by pixels pattern using fuzzy c-means, J. Eng. Res. 1 (3) (2013) 109–121. [12] I. Mulyana, Y. Herdiyeni, S.H. Wijaya, Identification of medical plant based on fractal by using clustering fuzzy C-means, in: The Second International Conference on Information Technology and Business Application (ICIBA2013), 2013, ISBN: 978-979-3877-16-7. [13] R.J. Moreno, J.D. Lopez, Trajectory planning for a robotic mobile using fuzzy c-means and machine vision, in: Symposium of Signals, Images and Artificial Vision (STSIVA2013), 2013, https://doi.org/ 10.1109/stsiva.2013.6644912. [14] M. Hadi, K. Morteza, S.Y. Hadi, Vector fuzzy C-means, J. Intell. Fuzzy Syst. 24 (2013) 363–381. [15] A. Stetco, X.J. Zeng, J. Keane, Fuzzy C-means ++: fuzzy C-means with effective seeding initialization, Expert Syst. Appl. 42 (21) (2015) 7541–7548. [16] Y. Yang, Image segmentation by fuzzy C-means clustering algorithm with a novel penalty term, Comput. Inform. 26 (2007) 17–31. [17] P.R. Suri, N. Sardana, Forecasting gold prices using fuzzy C means, J. Comput. 3 (3) (2011) 99–106. [18] K. Warunsin, O. Chitsobhuk, Cyclone identification using fuzzy C mean clustering, in: 13th International Symposium on Communications and Information Technologies (ISCIT), 2013, pp. 369–373, https://doi.org/10.1109/ISCIT.2013.6645884. [19] A.R.J. Fredo, G. Kavitha, S. Ramakrishnan, Analysis of sub-cortical regions in cognitive processing using fuzzy c-means clustering and geometrical measure in autistic MR images, in: 2014 40th Annual Northeast Bioengineering Conference (NEBEC), 2014, https://doi.org/10.1109/ NEBEC.2014.6972791. [20] E. Doganay, S. Kara, H.K. Ozcelik, Automatic segmentation of the lungs from HRCT scans by using fuzzy C-means, in: International Symposium on Sustainable Development (ISSD 2014), 2014, p. 77. [21] G. Liu, P. Li, Y. Zhang, Color texture image segmentation method based on fuzzy c-means clustering and region-level Markov random field model, Math. Probl. Eng. 2014 (2015) 1–9. [22] H.Y. Vani, M.A. Anusuya, Isolated speech recognition using fuzzy C means technique, in: 2015 International Conference on Emerging Research in Electronics, Computer Science and Technology, 2015, pp. 353–357, https://doi.org/10.1109/ERECT.2015.7499040. [23] T. Velmurugan, A. Naveen, Analysing MRI brain images using fuzzy C-means algorithm, Int. J. Control Theory Appl. 9 (10) (2016) 4661–4675. [24] H.R. Mohammed, H.H. Alnoamani, A.A. Jalil, Improved fuzzy C-mean algorithm for image segmentation, Int. J. Adv. Res. Artif. Intell. 5 (6) (2016) 7–10. [25] B. Kaur, K.P. Tulsi, Improving the color image segmentation using fuzzy-C-means, in: 2016 International Conference on Advanced Communication Control and Computing Technologies (ICACCCT), 2016, pp. 789–794, https://doi.org/10.1109/ICACCCT.2016.7831747. [26] O. Heriana, A.N. Rahman, M.T. Miftahushudur, Image edge detection using objective function and fuzzy C means, in: 2017 International Conference on Radar, Antenna, Microwave, Electronics, and 97 98 Generative adversarial networks for image-to-Image translation Telecommunications (ICRAMET), 2017, pp. 149–153, https://doi.org/10.1109/ icramet.2017.8253165. [27] S. Rai, S. Chakraverty, D.K. Tayal, Y. Kukreti, Soft metaphor detection using fuzzy c-means, Lect. Notes Comput. Sci (2017) 402–411, https://doi.org/10.1007/978-3-3-319-71928-3_38. [28] K. Jebari, A. Elmoujahid, A. Ettouhami, Automatic genetic fuzzy c-means, J. Intell. Syst. (2018) 1–11, https://doi.org/10.1515/jisys-2018-0063. [29] S. Sivasaravanababu, M.R. Barasu, G.S. Siva Priya, P. Punitha, K. Shanmuga Priya, Bronchogenic carcinoma indentification with X-ray image using fuzzy C means, Int. J. Pure Appl. Math. 119 (15) (2018) 727–730. [30] L. Zhang, M. Luo, J. Liu, Z. Li, Q. Zheng, Diverse fuzzy c-means for image clustering, Pattern Recogn. Lett. (2018), https://doi.org/10.1016/j.patrec.2018.07.004. [31] D. Nagarajan, M. Lathamaheswari, R. Sujatha, J. Kavikumar, Edge detection on DICOM image using triangular norms in type-2 fuzzy, Int. J. Adv. Comput. Sci. Appl. 9 (11) (2018) 462–475. [32] D. Nagarajan, M. Lathamaheswari, J. Kavikumar, Hamzha, A type-2 fuzzy in image extraction for DICOM image, Int. J. Adv. Comput. Sci. Appl. 9 (12) (2018) 351–362. [33] C. Jinlin, Y. Chunzhi, X. Guangkui, L. Zing, Image segmentation method using fuzzy C mean clustering based on multi-objective optimization, J. Phys. Conf. Ser. 1004 (2018) 012035, https://doi.org/ 10.1088/1742-6596/1004/1/012035. [34] V.D. Santiago, Q.R. Martinez, L.A. Mecias, B.M.L. Romanach, Mammographic mass segmentation using fuzzy C-means and decision trees, Lect. Notes Comput. Sci (2018) 1–10, https://doi.org/ 10.1007/978-3-319-94544-6_1. [35] V. Torra, On the selection of m for fuzzy c-means, in: 9th Conference of the European Society for Fuzzy Logic and Technology, 2015, pp. 1571–1577, https://doi.org/10.2991/ifsa-eusflat15.2015.224. [36] I. Umoren, G. Usua, F. Osang, Analytic medical process for ophthalmic pathologies using fuzzy C-mean algorithm, Innov. Syst. Softw. Eng. 7 (2019) 67–84. [37] A. Srivastava, B. Hazela, P. Khanna, D. Arora, Application of fuzzy C-means (FCM) algorithm in image appointment, IOSR J. Eng. (2019) 4–8. [38] J.A. Tolentino, B.D. Gerardo, P.M. Ruji, Enhanced Manhattan-based clustering using fuzzy C-means algorithm, in: The 14th International Conference on Computing and Information Technology (IC2IT 2018), 2019, pp. 126–134, https://doi.org/10.1007/978-3-319-93692-5_13. [39] D. Vernanda, N.N. Purnawan, T.H. Apandi, School clustering using fuzzy C means method, SinkrOn J. Penelit. Tek. Inform. 4 (1) (2019), https://doi.org/10.33395/sinkron.v4i1.10168. [40] D. Borthakur, V. Grace, P. Batchelor, H. Dubey, Fuzzy C-means clustering and sonification of HRV features, in: 2019 the IEEE/ACM 4th International Conference on Connected Health: Applications, Systems and Engineering Technologies, 2019. arXiv:1908.07107[cs.HC]. [41] G. Katircioglu, E.K. Aydogan, M. Ozmen, E. Akgul, Determination of Denim fabric’s air permeability with image processing using fuzzy C means, in: International Conference on Intelligent and Fuzzy Systems (INFUS 2019), 2019, pp. 1208–1214, https://doi.org/10.1007/978-3-030-23756-1_142. [42] H. Gan, Safe semi-supervised fuzzy c-means clustering, IEEE Access (2019) 1–6, https://doi.org/ 10.1109/access.2019.2929307. CHAPTER 5 A review of the techniques of images using GAN Rituraj Sonia and Tanvi Arorab a Department of CSE, Engineering College Bikaner, Bikaner, Rajasthan, India Department of CSE, CGC College of Engineering, Landran, Mohali, Punjab, India b 5.1 Introduction to GANs The generative adversarial networks (GANs) are the models that have been constructed for the image-to-image translations. They are considered a powerful class of neural networks implemented for the purpose of unsupervised learning. The concept of GAN was introduced by Ian J. Goodfellow [1] in 2014. It can be divided into three parts. • Generative: It describes how the data is generated. • Adversarial: The process of the training of the model is carried out in a competitive manner. • Networks: It is use of the deep learning neural network for training process. GAN basically consists of the two networks: generator network and a discriminator network as shown in Fig. 5.1. Both these networks try to compete with each other and in this process they also train each other through multiple cycles of generation and discrimination. The generator network aims at generating new images, text, audio, etc. These new items (text, audio, and images) are fake in nature. The discriminator checks these images with the help of a training model, whether these images are fake or real. It does this analysis with the help of the feedback and loss functions. Figs. 5.2–5.4 display the output obtained from different types of GANs. Fig. 5.2 displays the transformation of one object into another, depending on the given inputs. Similarly, in Fig. 5.3, the GAN generates high-resolution images. Lastly, in Fig. 5.4, the GAN performs the image-to-image translation and thus contributes to an increase in the dataset, which is close to realistic images. 5.1.1 Need for GANs The GANs have gained popularity over just 2–3 years. They have the capability to generate very realistic images and videos that can assist in implementing the image editor or processor in our tablets or smartphones. The GANs have the capability of modeling and data distribution, and can produce clearer and sharper images. The GANs can train any Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00006-3 Copyright © 2021 Elsevier Inc. All rights reserved. 99 100 Generative adversarial networks for image-to-Image translation Fig. 5.1 Basic structure of GAN [2]. Fig. 5.2 Example of GANs transforming zebra to horse [3]. A review of the techniques of images using GAN Fig. 5.3 Example of GANs generating high-resolution images [3]. Fig. 5.4 Example of image-to-image translation [3]. 101 102 Generative adversarial networks for image-to-Image translation type of generator network, with no limitation, whereas other techniques have limitations on the generator networks and can only be used in specific cases. Moreover, the GAN models do not depend on the Markov chain, which is used to generate the samples. The earlier advantages associated with the GAN models make them promising solutions for the generation of the image dataset, which are required for training the deep learning models that require a large number of items to be trained. The cost of physically collecting and labeling the items is quite high, whereas the GAN can help to generate the items of the dataset with minimum effort and quite low cost. The GANs can also help to generate the face photos, cartoon characters, photos of emojis, automatically generate the models for advertisements, and all these activities can be just done by feeding in the base photo. The different variants can be automatically generated with the help of GAN. The GAN models are also needed for the purpose of photo editing; they can aid in making the photos clearer and improve the resolution of the images as well. That can be used to derive meaningful information from otherwise unclear images. They can help the researchers generate a large number of images that appear to be real with input given in the form of the sketch or semantic images. Apart from that, the GAN can also be used to generate the images from the text descriptions. The image-to-image conversion can also be carried out with the help of the GANs. The GAN models can be used for photo editing to such an extent that one can produce different kinds of images related to the variation in facial expressions, gestures, lip movements, gender, hair colors, etc. Therefore, it can be comprehended that the GAN models are needed for generating the synthetic datasets, for an image-to-image conversion, for text-to-image conversion, for editing blurred or low-resolution images, to forecast the looks of an individual after a certain age and to generate the 3D models. The ultimate need of the GAN is for generating the data that can be used to train the neural network-based models, as the accuracy of the neural network models depends upon the effectiveness of the training data. On the contrary, the success of the GAN application depends on the extent of the training of the GAN architecture; if not carried out perfectly, the results may not be good enough to carry out research on real-time applications. In Section 5.2, the various architectures related to the GANs are discussed with their underlying models and working. 5.2 GAN architectures This section provides an essential insight into the working and modeling of the different architecture of the GANs. Each architecture has its working style thus contributing to the generation of the images to create datasets in various research problems. A review of the techniques of images using GAN 5.2.1 Fully connected GANs The basic concept in the research scenario related to the GANs field is the utilization of the deep convolutional neural networks (CNNs) for the process of the image synthesis tasks. Therefore, in this traditional approach, the pooling layers and the fully connected layers 4,5 are removed or minimized from the GANs. Barua et al. [6] proposed the use of the fully connected convolution net architecture for the GANs (FCC-GANs), by stating that the implementation of these multiple fully connected layers along with the convolution layers gives better performance than the conventional architecture. In case of the conventional GANs, the single process of deep convolution generates the images. However, the work proposed by Barua et al. [6] states the two-step process method for image generation using FCC-GANs. The first step states ways to obtain the high-dimensional image features with the help of the low-dimensional input noise. The second step involves the generation of the image features using these high-dimensional features. These fully connected layers help to understand the relation between the input noise features and, thus, the generation of the final image features, which are closer to the natural images. The convolution layers cannot achieve this global mapping operation due to emphasis on the local connectivity. The methodology given by Barua et al. [6] accomplishes the following aims: • The use of the fully connected and the convolution layers is proposed that generates higher-level images on different benchmark datasets as compared to the existing GAN methods. • The learning rate of the FCC-GANs is higher than the conventional GANs. The former also produces very high-quality realistic images within a few rounds (epoch) of training. • The FCC-GANs give better results on the parameters such as Fretchet inception distance and inception score compared to existing CNN architecture on the benchmark datasets. • The architecture proposed as fully GANs is robust and stable as compared to existing CNN architecture. A simple example of the FCC-GANs proposed by Barua et al. [6] is shown in Fig. 5.5 and that of the conventional GANs is shown in Fig. 5.6. These models create 32 32 3 RGB images from the random noise vector z. In Fig. 5.5, the number of the nodes is denoted by the number in the boxes. Whereas, in the case of conventional architecture, as shown in Fig. 5.6, the number in the boxes indicates the shape of the output layers. The FCC-GANs shown in Fig. 5.5 can be utilized for the images of different resolutions by changing the depth and shape of the convolution stack. The experiments in Barua et al. [6] are carried out on the four datasets MNIST [7], CIFAR-10 [8], SVHN [9], and Celeb A [10]. It has been proved by experiments on these datasets that FCC-GANs produces higher quality images and converges faster than the 103 Generative adversarial networks for image-to-Image translation Output (image) 32 32 3 Discriminator output 1 16 Features (Low dimensional) 16 64 512 FC layers Flattened to 4096 features (High dimensional) 4 4 256 8 8 128 16 16 64 CONV layers 8 8 128 Reshape to 4 4 256 Intermediate features (High dimensional) 4096 64 512 Z (A) 16 16 64 CONV layers FC layers Image (fake/real) - 32 32 3 104 (B) Fig. 5.5 (Top) FCC-GAN generator; (Bottom) FCC-GAN discriminator [6]. Fig. 5.6 (Left) Conventional generator; (Right) conventional discriminator [6]. traditional GANs approach. The stability of the FCC-GANs has been proved using different parameters to indicate the importance of the FCC-GANs in Pix2Pix image generation. The most important advantage of the FCC-GANs is that it can be associated with any GAN method, and it can also be used in complex networks such as ResNet [11]. A review of the techniques of images using GAN 5.2.2 Conditional GANs The concept of the conditional generative adversarial network (CGANs) by Mirza and Osindero [12] was first introduced to the world by Mehdi Mirza and Simon Osindero. This idea is an augmentation on the GAN. It is implemented in the machine learning domain for the training of the image-to-image generative models. In the traditional GANs model, there are no conditions applied to the generator and discriminator and there is no control on the types pf data generated by such GANs. Thus, if the given framework does not require such data, it is just a waste of effort. Whereas, in the case of the CGANs, a condition can be applied to both the generator and discriminator. These conditions can be based on the same class labels of the image or some other property [13]. Therefore, the available GANs model can be converted into the CGANs by applying other additional conditions to the generator and discriminator y. This extra conditional information can be applied to both generator and discriminator. It can be seen in Fig. 5.7 that along with input z, a condition is also applied to GANs to convert it into the CGANs. Another example of the CGANs is shown in Fig. 5.8; here the condition y is added to the generator as well as the discriminator for the desired output. Discriminator D(x|y) y x Generator G(z|y) z Fig. 5.7 An example of the conditional adversarial net [12]. y 105 106 Generative adversarial networks for image-to-Image translation Real data Random noise Generator Y Y Discriminator Real / fake Fig. 5.8 An example of the conditional adversarial net [14]. The factors for the construction of the CGANs can be as follows: • The first and foremost is to add features or conditions to control the output and direct the generator to produce images as per the given conditions. • These features or sets of features should be available from the images that classify them into specific classes such as images of human beings if the aim is to create the face of the imaginary actors, etc. It can contain features like the complexion of the hairs and the type of the eyes, etc. • The information, as well as the data that will learn, can also be incorporated in the images and into the inputs. • The evaluation of the discriminator is performed on the similarity of the fake and the real data. It also takes into account the mapping of the input features with the fake data image. • The condition can be imposed on both the input of the generator and discriminator. It can in the form of digits forming a vector (condition) and is linked as a real or fake image to the given generator or discriminator. Fig. 5.9 depicts the output of the generation of the digits [12] using the MNIST datasets with the help of the CGANs. The CGANs suffer from one disadvantage: they always need labels to perform the work, as they are completely unsupervised. Fig. 5.9 MNIST digit generation using CGANs [12]. A review of the techniques of images using GAN 5.2.3 Adversarial autoencoders Adversarial autoencoder [15] is a probabilistic autoencoder that uses GAN to perform variational inference by matching the aggregated posterior of the hidden code vector of the autoencoder with an arbitrary prior distribution. Autoencoders [16] works on the similar approach of the feed-forward neural networks and uses the concepts of unsupervised learning. The autoencoder’s main task is to encode information related to the input in between the architecture and deconstruct that information by best means to the output. As shown in Fig. 5.10, the first layer is used for encoding the information (up to the middle layer), and therefore it is known as encoder as it is used for encoding information. The middle layer in the given architecture is termed an encoded vector. The end layer from the output of the middle layer is termed as the decoder. The end layer assists in the reconstruction of the information available through code. So the input layer of the autoencoder after receiving the data is sent to the autoencoder’s middle layer and that middle layer is essential as this layer has data which has reduced dimension. Makhzani et al. [15] propose another variation of the GANs called adversarial autoencoders (AAE) that converts an autoencoder into the generative model. The job of the autoencoders is to generate new random data with the help of given input data. The only Code Encoder Fig. 5.10 A simple encoder [16]. Decoder 107 108 Generative adversarial networks for image-to-Image translation Encoder Code Decoder Input Real Discriminator Fake Fig. 5.11 A simple adversarial autoencoder [16]. variation of the GANs with AAEs is that the latter controls the encoder output with the assistance of the prior distribution. This encoded vector is comprised of the mean value and standard deviation, and now along with this, it also has a prior distribution function. On the other hand, the decoder can map the prior (imposed) distribution to the data distribution with the help of the deep generative model. The type of distribution for the prior distribution can be any distribution, for example, normal distribution, gamma distribution, Gaussian distribution, etc. The prominent concept is the settlement of the distribution of the encoded values in the direction of the prior distribution. Therefore, the detector performs mapping of the prior distribution with the data distribution. Fig. 5.11 demonstrates the simple AAE, in which the standard autoencoder is placed at the top row, and it is generating the image x as per information given by the latent code z. The second network is set at the bottom row to discriminate the fact, that whether the sample is coming from the sampled distribution specified by the user or hidden code of the autoencoder. Makhzani et al. [15] have proposed that the AAE, attains the competitive test probabilities on Toronto Face Dataset [17] and real-valued MNIST datasets. The proposed method can be applied to the semisupervised scenarios. It obtains excellent semisupervised classification performance on SVHN and MNIST datasets. The AAEs find applications in dimensionality reduction and data visualization and disentangle the content and style of images, and unsupervised clustering. 5.2.4 Deep convolution GANs The GANs, as discussed in the earlier section, consist of the primary two networks, generator and discriminator, to carry out different works. To make the GANs more powerful, to accomplish the more complex applications, both the generator and discriminator will be augmented with the convolutional neural network layers. This structure is known as deep convolution GANs. The concept of the deep convolution GANs (DCGANs) is floated by Radford et al. [18] in 2015, and they succeed in utilizing the ConvNEt idea A review of the techniques of images using GAN into the GANs. This idea of incorporating ConvNEts into GANs make this DCGANs as the most eligible candidate for implementing unsupervised learning. Many attempts were made to integrate the CNN with GANs to improve the performance. The approach used by Radford et al. [18] uses a family of architecture to train the model for a large number of the datasets and allow the training for higher resolution and deeper networks. The DCGANs [18] were implemented by the following three approaches: • The concept given by Springenberg et al. [19] that replaces the idea of the maxpooling by the strided convolutions so that the network learnt from its downsampling is used as the first step in implementing the DCGANs. • The second step is to eliminate all connected layers on top of convolutional features. It is applied by Mordvintsev et al. [20], where the concept of global average pooling is applied for the image classification models. This idea of the global average pooling gives ample stability to the system model. • The third and last step is to apply the concept of the batch normalization 21. It helps to stabilize the learning process by normalizing the assigning each unit zero mean and unit variance. The process of stabilization solves the training issues in those problems due to poor initialization and helps gradient flow in deeper models. In this way, this process helps the generators begin learning and prevents them from collapsing at a single point. Fig. 5.12 demonstrates the implementation of the DCGANs in detail. A 100-dimensional uniform distribution Z is projected to a small spatial extent convolution representation with many feature maps. Then this high-level representation is converted to a 64 64-pixel image with the help of the series of four fractionally strided convolutions. There is no use of the fully connected layers in this figure. In another work, Durall et al. [22] discussed a method to handle the problem of the stabilization that occurs in the training 3 128 256 512 64 1024 100 z 8 4 4 Project and reshape 5 32 5 5 5 5 Stride 2 5 8 Stride 2 5 Stride 2 16 5 16 CONV 1 CONV 2 32 CONV 3 Stride 2 64 CONV 4 G(z) Fig. 5.12 DCGAN generator used for LSUN scene modeling [23]. 109 110 Generative adversarial networks for image-to-Image translation phase of the GANs. A new framework called OC-GAN (Octave-GAN) that uses octave convolution is proposed in this work. It reduces the problem of modal collapse in the existing GANs and generates images of higher quality. The method is tested on the Celeb-A dataset. 5.2.5 StackGANs A StackGAN consists of the two stacks which are considered as stage-1 and stage-2. The function of the stage-1 GAN is to produce the low-resolution images based on the description given by the user. Such images have very rough sketches and basic colors to give a preview of the low-resolution images. After generating these images from stage-1, these images are passed into the stage-2, in which high-resolution images are generated by these images which appears more realistic. The process of the image generation is achieved by describing the form of the text or text embedding in the instructions. The stage-2 network adds all kinds of the relevant details as per the text instructions and thus produces images that are very close to the realistic images with proper resolutions. The working of the StackGAN can be compared with that of a painter. In the case of the complex painting, a painter always first draws some edges, rough sketches, and lines, etc. to prepare the overview of the image. In the next stage, the painter fills all relevant colors, adds more specific details, and shapes this artwork. Thus, it is in the second stage the painter gives a realistic view to his pictures. Similarly, stage-1 produces the low-resolution images with the help of the given text description, and stage-2 that works on stage-1 tries to capture the details which are erased by stage-1. Stage-2 adds more information to the images generated by stage-1. The support of model distribution generated from a roughly aligned low-resolution image has a better probability of intersecting with the support of image distribution [24]. Fig. 5.13 depicts the architecture of the StackGAN. As discussed earlier, it is composed of the two stages, and for each step, there are two generators and two discriminators. The StackGAN at each level consists of the text encoder, conditioning augmentation network, generator network, discriminator network, and embedding compressor network. As is very clear from Fig. 5.13 that stage-1 GAN is generating the images of low resolution with the size of 64 64 and then stage-2 GAN takes these images as inputs, applies some conditional augmentation on them to generate the high-resolution images of the size 256 256. Fig. 5.14 displays the example of the images generated by StackGANs [24] with the help of the text description given to the system. The StackGAN is applied to the dataset of Oxford [25] for the generation of flower images. Fig. 5.15 displays the example of the images generated by StackGANs [24] with the help of the text description about the rooms on the COCO dataset [26]. Fig. 5.13 StackGAN architecture: Stage-1 take inputs from the given text and applies rough sketching to produce low-resolution images by sketching a rough shape. Then, stage-2 generates more prominent high-resolution images by correcting the defects [24]. 112 Generative adversarial networks for image-to-Image translation Text description This flower has a lot of small purple petals in a dome-like configuration This flower is pink, white, and yellow in color, and has petals that are striped This flower has petals that are dark pink with white edges and pink stamen 64×64 GAN-INT-CLS 256×256 StackGAN Fig. 5.14 Text-to-image generation using StackGAN [24]. Fig. 5.15 Results on COCO dataset [26] using StackGAN [24]. This flower is white and yellow in color, with petals that are wavy and smooth A review of the techniques of images using GAN Thus, the method [24] performs much better concerning the other methods in this domain and produces high-resolution images that are incredibly close to the realistic images. 5.2.6 CycleGANs CycleGAN [27] is one of the models used for training the image-to-image translation, in which the GAN architecture is used. It is an enhancement of the GAN model that simultaneously trains two generator and discriminator models. In this model, two domains of images are formulated. The CycleGAN [28] in a simplified manner is shown in Figs. 5.16 and 5.17. The first generator is fed the images from the first domain, and the output of the first generator serves as input for the second domain of images. In contrast, the second generator takes input from the second domain of the images and outputs the images that feed as input to the first domain of images. The discriminator model checks how believable the images are from both the generators and then it fine tunes the generator models accordingly. The model described earlier can check the correctness of the images generated for each domain, but it is not sufficient to translate the images. Therefore, for the purpose of the image-to-image translation, the CycleGAN has an add-on extension with the name of cycle consistency. In this, the output of the first generator is attached to the second generator’s input. The output thus generated by the second generator is matched with the initial image fed to the first generator. Likewise, the reverse operation also holds true wherein the second generator’s output can serve as input to the first generator, and the result produced is the same as the input fed to the second generator. The cycle Fig. 5.16 Flow A-B-A starts from input in domain A [28]. 113 114 Generative adversarial networks for image-to-Image translation Fig. 5.17 Flow B-A-B starts from input in domain B [28]. consistency is used as a regularization measure for the generator models that help the image generation process for the image-to-image translation. The CycleGAN model can be explained with the help of an example, where the aim is to translate the images of winter scene landscapes to the images of summer scene landscapes. It is well known that both the seasons will have different images for both the landscapes. So, in this case, the images of the two domains will be the images of the winter scene landscape, and others will be the images of summer scene landscapes, which is depicted through Fig. 5.18. CycleGAN has an architecture of two GANs, and each GAN has a discriminator and a generator model, meaning there are four models in total in the architecture. So the system has two GAN generators; one will be taking images of winter scene landscape and generating images of summer scene landscape while the other will be taking the images of the summer scene landscape and generating the images of the winter scene landscape. Then the discriminator model will be checking if both the models are generating images as intended or not; based upon the discriminator’s judgment, the generators will be further trained to get the exact translation. The CycleGANs can be used in varied domains such as for style transfer, object transfiguration, season transfer, generating photographs from paintings, or photography enhancement. 5.2.7 Wasserstein GANs The idea of the WGANs or the Wasserstein GAN was given by Arjovsky et al. [30]. It can be described as an augmentation to the existing GAN architecture. The main aim of the A review of the techniques of images using GAN Fig. 5.18 Example of CycleGAN for summer to winter translation [29]. WGANs is to provide support for the model to improve the stability for the training of the given model and also provides a loss function to analyze the standards of the images generated by the model. The WGAN uses an approach to perform a better approximation of the data provided in the dataset for training purposes. The WGAN proposes to use a critic in place of the discriminator, which decides the fakeness (or realness) of the given image with the help of the score given by that critic. The whole theory for the WGAN is based on the mathematical calculation about the distances. It states that the generator must search for minimization of the distance between the distribution of the data observed in the training dataset and the distribution observed in generated examples. In the paper by Arjovsky et al. 30, the discussion consists of the various distribution distance measures, such as Jensen-Shannon (JS) divergence [31], Kullback-Leibler (KL) divergence [32], and the Wasserstein distance (Earth-Mover [EM] distance). The ability of each distance is based on the convergence of sequences of probability distributions. So, it was proved that WGAN could effectively train the generator using the properties of the Wasserstein distance more optimally compared to the other distribution distances. 115 116 Generative adversarial networks for image-to-Image translation Fig. 5.19 A simple WGAN architecture [34]. Fig. 5.19 depicts the simple WGANs architecture and the concept of the WGANs revolves around the fact that Wasserstein distance is differential and continuous, which means that the training can be performed till it achieves optimal value. It is based on the argument that the longer the process of the training is carried out for the critic, then it will provide the reliable gradient of the Wasserstein. It happens due to the differentiable nature of the Wasserstein distance. Whereas in the case of the JS divergence (distance), the critic becomes reliable, but the true gradient becomes zero as the JS is locally saturated, and vanishing gradients are obtained. The critic in the WGANs does not get saturated. The critic gives a clean gradient as it reduces to the linear function compared to the discriminator that may learn the difference between real and fake quickly. Still, it does with almost no reliable linear-gradient information. The most crucial advantage of using the WGAN is that it makes the training process stable and ensures that the training process is not sensitive to the choice of the hyperparameter configurations. The WGAN aims to decrease the critic’s loss, and achieving this can have a good quality of the generated images. The WGANs always try to make the lower the generating loss, whereas the other GANs try to achieve an equilibrium between two generative and discriminate models. The applications of the WGANs include the simulation of the isolated electromagnetic showers in a realistic setup of a multilayer sampling calorimeter [33]. Similarly, one critical step in the analysis of the medical images is the structure-preserved denoising of 3D magnetic resonance imaging (MRI) images. Ran et al. [35] presented the use of the Wasserstein generative adversarial network (RED-WGAN) for the MRI denoising method. The next section discusses some of the open issues and research gaps in the domain of the applications related to the GANs. A review of the techniques of images using GAN 5.3 Discussion on research gaps There are many open problems and research gaps where the GANs can be applied to achieve better results as compared to the traditional machine learning approach. The work by Barua et al. [6] emphasizes using the fully connected GANs for the unsupervised training. So, the effect of using fully connected convoultional (FCC) GANs on semisupervised training can be studied. Similarly, the inclusion of the CGANs can improvise the results. The complex networks, such ResNet [11], can be examined with the help of the FCC-GANs. Zhao et al. [36] propose a method using adversarially regularized autoencoders for training deep latent variable models on the simple discrete structures. These simple structures consist of short sentences and the binary digits. Therefore, the scope is available to apply the training to model complex structures such as documents and old manuscripts. Balabka [37] proposes a model using AAEs to recognize human activity with the help of semisupervised learning. The semisupervised learning is applied to use the unlabeled data with the help of the AAE training. The challenge open in the domain of the AAEs is to explore many hyperparameter that can be tuned to improve the performance of the model aggressively. Ruiz-Garcia et al. [38] suggested a new method that uses a generative adversarial stacked autoencoder, which helps in mapping the facial expressions to an illumination invariant facial representation. The open research problem in this domain includes developing a method that can handle the scenario in which there is no labeling of the multipose datasets. Lu et al. [39] discuss a method based on deep learning-based (DA-DCGAN) for practical domain-shifting DC series arc fault detection in photovoltaic systems. The GANs serve the purpose of the generation of the dummy arc shifting data. But this problem of implementation of the GANs in the area of the application-specific integrated circuits with low cost and improvement in the reliability remains open. Padala et al. [40] proposed the idea to study the effect of variation of the input noise applied to the GANs. The inference of the method states that the noise has a remarkable contribution to the images’ generation. But the gap in this study is to make a theoretical analysis between the high-dimensional data and low-dimensional distribution. Kim [41] proposes a new variation of the GANs called Bool GAN and applied on the dataset containing the images of cars, that put the baseline model proposed by Radford et al. [18]. The inclusion of the dropout and convolution layers improves the efficiency of the model. The open issue in this study is to perform more experiments regarding the addition of a number of layers to find out the optimum hyperparameter and the scheduling of the learning rate. This study may give new dimensions about the performance of the model. Durall et al. [22] proposed the use of Octave GANs, which states that Bayesian optimization can be explored as future work. 117 118 Generative adversarial networks for image-to-Image translation Cheng et al. [42] presented a novel method called SeqAttnGAN for creating images with the help of interactive image editing software. This method is implemented on the two benchmark datasets, DeepFashion-Seq and Zap-Seq. These datasets have images that are attached to the proper description in the text. The method gives excellent results as compared to the baseline methods. This method addresses future work to create human faces with an interactive image editor and to explore the generation of consistent image sequences by given attributes and other factors. Vougioukas et al. [43] proposed a novel and innovative method to generate the video signals generated by speech. The method is achieved by applying temporal GANs. The performance of the method is evaluated on the GRID [44] and the TCD TIMIT [45] dataset. The method can capture the videos with proper facial expressions, including blinking, etc. This method is open to the problem of capturing the possible mood and gesture of the speaker and showing it using the facial expression. Zhao et al. [46] suggested a novel idea dependent on the CGANs to retrieve the lost and missing information from the images of the solar observation. This missing information occurs due to the overexposure of the images in the solar observation process due to the violet solar burst. This novel idea uses CGANs and includes integration of the edge mass loss, masked L1 loss, and adversarial loss. The model uses training for the new dataset for the overexposed images. The work is still open to the problem where the images are of high texture and have large overexposure areas. Zhu et al. [47] proposed a novel method by implementing the CGANs to solve the issue of producing the multiple outputs of the image-to-image translation using a single input. The mapping ambiguity is resolved by randomly sampling the low-dimensional latent vector. The generator used in this method achieves to map the input, with latent code to the possible output. Thus, the method encourages the bijective consistency between output modes and latent encoding. The future work in this method caters to produce the image-to-image translation with the help of controlling different user parameters along with meaningful attributes. Nataraj et al. [48] discussed a model for detecting the fake images generated by the GANs. It is achieved by using the combination of deep learning and cooccurrence matrices of the pixels. These types of matrices are obtained by performing the computation on the different color channels. Further, a deep convolution neural network is used for training purposes to discriminate the real images from the fake images generated by the GAN. The future work in this domain is to make use of the pixels’ location manipulated in the fake images generated by GANs and rectify them. Liu et al. [49] used the idea of the Coupled GANs to perform the image-to-image translation in the unsupervised environment. The aim is to use the information about the images in two different domains and perform translation. The open issue that need to be addressed in this is to prevent the stability of the training system due to the saddle A review of the techniques of images using GAN point searching problem. Along with that, the issue is to remove the system’s limitation as unimodal due to the assumption of the Gaussian latent space. The other open problems in this domain are like the need for the automatic metrics for judging the performance of the different types of generative networks and also need to consider the nondeterministic training losses for future prediction. The next section discusses some of the applications that can solved with the assistance of the GANs. 5.4 GAN applications There are many areas in which GANs can be applied to have remarkable results, and these are as follows: • Generation of images: The task of the generation is one of the prominent areas where GANs are applied and giving the researchers an ample amount of the datasets to carry out different experiments. These images are realistic in nature. The process of image generation includes some sample images, based on which the GAN can generate a large number of the images with the help of the generator and discriminator. These new images which are generated will be different as compared to the existing sample images. The process of the generation of the images is used extensively in the animation, social media, marketing, entertainment world, and generation of the logos of the digital world. • Synthesis of images using text: The most exciting feature of the GANs is the synthesis of the images with the help of the text description. Such applications are used in the entertainment industry. With the help of the text (story), an animation character with their gestures can be created. • Aging of face: The GANs with their variants called the CGANs can be used to predict the face of the images with the targeted ages. GANs architecture can create and predict people’s faces at different times (age). Thus, such a system can be useful in companies for face verification of their employees. It works on the principle of semisupervised learning for aging and progression. There are datasets with faces as images and age as labels are available in the public domain for experiment purpose. • Image-to-image translation: This feature of GANs states that the images can be translated into other images with the help of the generator and discriminator. The images taken at night can be translated into the day; similarly, the different drawings and sketches can be translated into beautiful paintings. The different aerial images can be translated into the satellite images, and the images of the zebra can be converted to horses, etc. CGANs can be applied for synthesizing photos from label maps as defined by Nayak [14], uses edge maps to create, and fill colors in the images as discussed by Wang et al. [50] and Isola et al. [51]. 119 120 Generative adversarial networks for image-to-Image translation • Synthesis of video: The synthesis of the video can also be performed with the assistance of the GANs. They take less time to create the videos compared to the conditions if they are created manually or in real time. In this manner, this property of GANs can encourage the animation creators to make optimum use of the new technology to develop and promote their videos in less time and close to the real world. GANs can also be used to predict the frames to appear in the future in any video sequence as mentioned by Villegas et al. [52]. • Generating high-quality images: The GANs allow converting the low-resolution images taken by ordinary cameras to high-resolution and quality images. Thus, it helps to observe the minute details of the images that cannot be viewed in low-resolution images. • Missing part generation of images: The GANs network can be used to generate the missing parts of the partially degraded images and thus recovers the original images. • Generating shadow maps: Nguyen et al. [53] apply one conditional (sensitivity) parameter (CGANs) to the system (generator) to parameterize the loss of trained detector and is more efficient than other GANs. • Speech enhancement: Phan et al. [54] propose two architectures ISEGAN and DSEGAN for the process of speech enhancement. The main motive behind speech enhancement is to remove the unnecessary background and irrelevant noises that create problems in the process of speech recognition. The speech enhancement will further help in the cochlear implants, hearing aids, and communication systems. Therefore, GANs play an important role in the domain of speech recognition by enhancing the sample of speeches. • Fault diagnosis: The GANs system can be implemented in the detection and diagnosis of the DC arc faults that occur in the photovoltaic system as described in Lu et al. [39]. The source and the target domain data are available during the operation in the field, but the fault data are not available. Therefore, GANs can be used to generate the dummy data. 5.5 Conclusion This chapter mainly covers the introduction to the GANs, its need, and the detailed architecture of the various models, that is, fully connected GAN, CGANs, and AAEs, deep convolution GANs, StackGANs, CycleGANs, and Wasserstein GANs. The advantages and disadvantages of the models are listed. The chapter also focuses on the various research gaps identified in the different architecture of the GANs. The research gaps identified will provoke the students and scholars in this domain to contribute to the development of algorithms in the GANs. Various applications fall into the area of GANs. These are listed in the last section of the chapter. These applications, if solved using the approaches of the GANs, will provide better results as compared to the traditional A review of the techniques of images using GAN machine learning approaches. The chapter also covers the various examples of the image-to-image translations described by the researchers. In the recent years, GANs have been emerged as one of the novel methodologies to generate the data from the rough information given. They are considered to be the robust and powerful class of the neural networks for unsupervised learning. With the GANs idea, a large number of the image dataset can be created, which are very close to the real image. Thus, it satisfies the need of the dataset among the researchers for implementing their models. Along with the great advantage of generating huge images, GANs have limitations that they can generate better results if the input data is mapped into the learned subspace. Still, in case of the unseen data not mapped correctly, it may give poor results. Another problem associated with the GANs is the problem of mode collapse, which states that the generator always produces output from a small set of spaces. Similarly, the GANs in certain stages are challenging to converge in the training process. The machine and the resources required to implement the GANs training models have exceptionally high configuration and expensive. The GANs training and implementation process require extensive use of the GPUs along with the CPUs. The need for memory for accounting the large data is also an issue in the GANs. The researchers in this domain can work on the complex problems of artificial intelligence by implementing the advanced version of the GANs. It will enhance the capabilities of the machines and provides the human race with a new solution to existing problems in the different areas of science and engineering. References [1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, 2014, pp. 2672–2680. [2] A. Mittal, Generative Adversarial Networks (GAN), 2020. https://codeburst.io/generativeadversarial-networks-gan-3c8978ba99a6. [3] J. Hui, GAN some cool applications of GAN, Medium (2020). https://medium.com/@jonathan_hui/ gan-some-cool-applications-of-gans-4c9ecca35900. [4] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A.C. Courville, Improved training of Wasserstein GANs, in: Advances in Neural Information Processing Systems, 2017, pp. 5767–5777. [5] T. Miyato, T. Kataoka, M. Koyama, Y. Yoshida, Spectral normalization for generative adversarial networks, in: International Conference on Learning Representations, 2018. [6] S. Barua, S. Monazam Erfani, J. Bailey, FCC-GAN: a fully connected and convolutional net architecture for GANs, arXiv E-Prints, arXiv-1905 (2019). [7] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324. [8] A. Krizhevsky, Learning Multiple Layers of Features From Tiny Images (Master’s thesis), Department of Computer Science, University of Toronto, 2009. [9] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A.Y. Ng, Reading digits in natural images with unsupervised feature learning, NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011. 121 122 Generative adversarial networks for image-to-Image translation [10] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3730–3738. [11] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [12] M. Mirza, S. Osindero, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784 (2014). [13] I. Goodfellow, M. Mirza, A. Courville, Y. Bengio, Multi-prediction deep Boltzmann machines, in: Advances in Neural Information Processing Systems, 2013, pp. 548–556. [14] M. Nayak, An introduction to conditional GANs (CGANs), Medium (2019). https://medium.com/ datadriveninvestor/an-introduction-to-conditional-gans-cgans-727d1f5bb011. [15] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, Adversarial autoencoders, arXiv Preprint arXiv:1511 (2015). [16] C. Rubiks, Introduction to Adversarial Autoencoders, 2019. https://rubikscode.net/2019/01/14/ introduction-to-adversarial-autoencoders/. [17] J. Susskind, A. Anderson, G.E. Hinton, The Toronto face dataset, Technical Report UTML TR 2010001, U. Toronto, 2010. tech. rep. [18] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint arXiv:1511.06434 (2015). [19] J.T. Springenberg, A. Dosovitskiy, T. Brox, M. Riedmiller, Striving for simplicity: the all convolutional net, arXiv preprint arXiv:1412.6806 (2014). [20] A. Mordvintsev, C. Olah, M. Tyka, Inceptionism: going deeper into neural networks, 2015. https:// research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html. [21] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in: International Conference on Machine Learning, PMLR, 2015, pp. 448–456. [22] R. Durall, F.-J. Pfreundt, J. Keuper, Stabilizing GANs with octave convolutions, arXiv preprint arXiv:1905.12534 (2019). [23] C. Shorten, DCGANs (Deep Convolutional Generative Adversarial Networks), Medium (2019). https://towardsdatascience.com/dcgans-deep-convolutional-generative-adversarial-networksc7f392c2c8f8. [24] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, StackGAN: text to photorealistic image synthesis with stacked generative adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5907–5915. [25] M.-E. Nilsback, A. Zisserman, Automated flower classification over a large number of classes, in: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, IEEE, 2008, pp. 722–729. [26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft coco: common objects in context, in: European Conference on Computer Vision, Springer, 2014, pp. 740–755. [27] Y. Gan, J. Gong, M. Ye, Y. Qian, K. Liu, S. Zhang, GANs with multiple constraints for image translation, Complexity 2018 (2018) 1–27, https://doi.org/10.1155/2018/4613935. [28] H. Bansal, A. Rathore, Understanding and implementing CycleGAN in tensorflow, 2017. [29] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223–2232. [30] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in: International Conference on Machine Learning, PMLR, 2017, pp. 214–223. [31] A.P. Majtey, P.W. Lamberti, D.P. Prato, Jensen-Shannon divergence as a measure of distinguishability between mixed quantum states, Phys. Rev. A 72 (5) (2005) 052310. [32] D.A. Klein, S. Frintrop, Center-surround divergence of feature statistics for salient object detection, in: 2011 International Conference on Computer Vision, IEEE, 2011, pp. 2214–2219. [33] M. Erdmann, J. Glombitza, T. Quast, Precise simulation of electromagnetic calorimeter showers using a Wasserstein Generative Adversarial Network, Comput. Software Big Sci. 3 (1) (2019) 4. [34] V. Chandak, P. Saxena, M. Pattanaik, G. Kaushal, Semantic image completion and enhancement using deep learning, in: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), IEEE, 2019, pp. 1–6. A review of the techniques of images using GAN [35] M. Ran, J. Hu, Y. Chen, H. Chen, H. Sun, J. Zhou, Y. Zhang, Denoising of 3D magnetic resonance images using a residual encoder-decoder Wasserstein generative adversarial network, Med. Image Anal. 55 (2019) 165–180. [36] J. Zhao, Y. Kim, K. Zhang, A. Rush, Y. LeCun, Adversarially regularized autoencoders, in: International Conference on Machine Learning, 2018, pp. 5902–5911. [37] D. Balabka, Semi-supervised learning for human activity recognition using adversarial autoencoders, in: Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers, 2019, pp. 685–688. [38] A. Ruiz-Garcia, V. Palade, M. Elshaw, M. Awad, Generative adversarial stacked autoencoders for facial pose normalization and emotion recognition, in: 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, 2020, pp. 1–8. [39] S. Lu, T. Sirojan, B.T. Phung, D. Zhang, E. Ambikairajah, DA-DCGAN: an effective methodology for DC series arc fault diagnosis in photovoltaic systems, IEEE Access 7 (2019) 45831–45840. [40] M. Padala, D. Das, S. Gujar, Effect of input noise dimension in GANs, arXiv preprint arXiv:2004.06882 (2020). [41] D.H. Kim, Deep convolutional GANs for car image generation, arXiv preprint arXiv:2006.14380 (2020). [42] Y. Cheng, Z. Gan, Y. Li, J. Liu, J. Gao, Sequential attention GAN for interactive image editing, in: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4383–4391. [43] K. Vougioukas, S. Petridis, M. Pantic, End-to-end speech-driven realistic facial animation with temporal GANs, in: CVPR Workshops, 2019, pp. 37–40. [44] M. Cooke, J. Barker, S. Cunningham, X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, J. Acoust. Soc. Am. 120 (5) (2006) 2421–2424. [45] N. Harte, E. Gillen, TCD-TIMIT: an audio-visual corpus of continuous speech, IEEE Trans. Multimedia 17 (5) (2015) 603–615. [46] D. Zhao, L. Xu, L. Chen, Y. Yan, L.-Y. Duan, Mask-Pix2Pix network for overexposure region recovery of solar image, Adv. Astron. 2019 (2019). [47] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A.A. Efros, O. Wang, E. Shechtman, Toward multimodal image-to-image translation, in: Advances in Neural Information Processing Systems, 2017, pp. 465–476. [48] L. Nataraj, T.M. Mohammed, B.S. Manjunath, S. Chandrasekaran, A. Flenner, J.H. Bappy, A.K. RoyChowdhury, Detecting GAN generated fake images using co-occurrence matrices, Electron. Imaging 2019 (5) (2019). 532-1–532-7. [49] M.-Y. Liu, T. Breuel, J. Kautz, Unsupervised image-to-image translation networks, in: Advances in Neural Information Processing Systems, 2017, pp. 700–708. [50] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, B. Catanzaro, High-resolution image synthesis and semantic manipulation with conditional GANs, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8798–8807. [51] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134. [52] R. Villegas, J. Yang, S. Hong, X. Lin, H. Lee, Decomposing motion and content for natural video sequence prediction, in: 5th International Conference on Learning Representations, ICLR 2017, International Conference on Learning Representations, ICLR, 2017. [53] V. Nguyen, T.F. Yago Vicente, M. Zhao, M. Hoai, D. Samaras, Shadow detection with conditional generative adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4510–4518. [54] H. Phan, I.V. McLoughlin, L. Pham, O.Y. Chen, P. Koch, M. De Vos, A. Mertins, Improving GANs for speech enhancement, IEEE Signal Process. Lett. 27 (2020) 1700–1704. 123 CHAPTER 6 A review of techniques to detect the GAN-generated fake images Tanvi Aroraa and Rituraj Sonib a Department of CSE, CGC College of Engineering, Landran, Mohali, Punjab, India Department of CSE, Engineering College Bikaner, Bikaner, Rajasthan, India b 6.1 Introduction The generative adversarial network (GAN) is an artificial intelligence-based technique that is based on the deep learning modalities of the machine learning paradigm. It is an unsupervised learning technique. The GANs have been initially created in 2014 to generate new data points from the existing data points. In these, two competing neural networks are made to work against each other to improve their quality. The working principle of the GANs can be best described by taking the example of a generator that is generating some output, and a tester that is testing the generated output, for its authenticity. The tester knows what is correct, thus based on the feedback of the tester, the generator keeps on improving its output. The generator is just like a blind man, which improves its results based on the rejection and selection of its output. The GANs are used for generative modeling, that is, a model is used to create new instances from the preexisting instances, such as the creation of new images that are quite identical but still different from the already existing images. The GAN-based models work like a gameplay, where both the players try to trick each other and ultimately solve the puzzle. If the GAN-based methods are properly trained then they can be effectively used to create new data items as per the specifications of the items in the training set. The GAN is composed of two contending neural networks that work against each other in competing mode to investigate, capture, and duplicate the disparities in the dataset. The GANs are composed of three distinct components: • Generative: It is aimed at learning the generative model that can demonstrate the generation of the data concerning the probabilistic model. • Adversarial: The adversarial setting-based training of the model is carried out by this unit. • Networks: The training of this model is carried out using the deep learning-based artificial intelligence methods. The working of the GANs is shown in Fig. 6.1, which has two components, the generator and the discriminator. The task of the generator component is to create the Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00004-X Copyright © 2021 Elsevier Inc. All rights reserved. 125 126 Generative adversarial networks for image-to-image translation Fig. 6.1 Simple architecture of GANs (https://www.geeksforgeeks.org/generative-adversarial-networkgan/). self-made illustrations of the data, which may be images, audios, or videos, and the discriminator component tries to discriminate the input given to it as the real data or the selfmade illustrations. In the GANs, the generator and the discriminator both are the neural networks, and both of them compete against each other in the training phase. The training of the generator and discriminator is carried over several iterations, and with each subsequent iteration, the capabilities of both the components are enhanced. The generator learns to generate better illustrative samples and the discriminator gets more proficient at judging the illustrations as fake samples. In short, the GANs are based on the minimax game, where the discriminator tries to minimize its gains and the generator tries to minimize the gains of the discriminator or to maximize its losses. Although GANs have emerged just a few years ago, over a short span of time a large number of variants have emerged. The most commonly used variants of the GANs are 1. Vanilla GAN: In this, the basic multilayer perceptron-based generator and discriminator neural networks are used, and it is one of the simplest implementation of the GAN that tries to augment the mathematical functions based on the stochastic gradient descent approach. 2. Conditional GAN (CGAN): This GAN implementation is based on the setting up of the condition-based parameter in the deep learning approach. The extra condition parameter is augmented in the generator component of the GAN to generate the output data. The discriminator is feed with the input data that has labels associated with it to assist it to distinguish between the actual data and the morphed fake data. A review of techniques to detect the GAN-generated fake images 3. Deep convolutional GAN (DCGAN): This implementation of the GAN uses the convolutional neural networks in place of the multilayer perceptrons, but these CNNs does not contain the max-pooling layer that has been substituted with the convolutional stride and the layers of the CNN are not completely connected. Over the years, this implantation of the GANs has become the most widely used as well as the most promising implantation of the GANs. 4. Laplacian pyramid GAN (LAPGAN): This implementation of the GANs are mainly used for producing superior quality images. In this initially, the image is downsampled, and then again it is upsampled until the image comes to its original size. This approach leads to the introduction of noise in the images. This implementation has a large number of generator and discriminator modules of the network along with the distinct levels of the Laplacian pyramid. 5. Super resolution GAN (SRGAN): This implantation aims at producing high-resolution images, and they work at enhancing the low-resolution images, but simultaneously it takes care that the upscaling does not result in introducing noise or error in the images. This implementation is made up of the deep neural and the adversarial networks. The GAN-based techniques can be used in a large number of applications such as: • to augment the dataset of images by creating more synthetic images • to create different facial expressions • to create real-looking images • to create the images of the cartoons • to create images based on text description • to create emojis from the images • to create edited photographs • to improve the resolution of the images • to transform the clothing of images • 3D object creation • completing the incomplete images • to synthesize the videos These are just a few of the applications of GANs, but recently they have created a lot of excitement as they have emerged as one of the most fascinating applications of the current AI advancements, and still it is hoped that many more exciting applications will be available in near future. The GANs have many advantages associated with them, the first and the foremost advantage is that they are unsupervised learning models that have self-learning capability, do not require the labeled data, and can self-learn from the data itself. Moreover, the GAN-based methods can also be used to produce the data that is as good as the real data. The GANs have the capability to not only generate the numeric or alphanumeric data, but also generate multimedia data, that is, images and videos, which are indistinguishable and are at par with the real data. Thus, GAN-generated images that have diverse 127 128 Generative adversarial networks for image-to-image translation applications in the field of marketing, gaming, mass media advertisements, and other domains. The GANs not just learn from the data itself, but they also can understand the complex data and they have a wide range of applications in machine learning. The GANs have gained a lot of hype in the recent times, but they do have their limitations as well, and they are just the product of virtual imagination. They depend on the training data, if the training data is not correct or of good quality, then the GAN-based methods can even fail. They cannot be just used to create novel things, but they can only reformulate the things based on the previous examples. The real strength of the GANs depends on the coordination of the generator and the discriminator component. They both need to be fine-tuned, and the strength of the generator will be of no use if the discriminator is weak or vice versa. They both need to work in synchronization to produce the correct results. The content thus generated by the GANs is termed as DeepFakes. 6.2 DeepFake A new term DeepFake has emerged in the digital world, which has been derived from the two terms namely deep learning and fake; it is a new product created by artificial intelligence. In layman’s terms, DeepFakes can be defined as a false media, images, videos, or sounds, created by using deep learning techniques. Deep learning is a branch of artificial intelligence, which has a large number of layers that can make informed decision making by using a set of algorithms, and work the way the neurons of the human brain work. The intelligence of the deep learning-based algorithms has created the fear of creating something that does not exist, but which mimics the real-world existence of the things that they pose to depict. Fig. 6.2 shows an example of the DeepFake image created by morphing the original image. Thus, we can say that DeepFakes are the morphed video clips, images, sounds, or other representations in the digital form that are created using the erudite algorithms based on artificial intelligence that creates fabricated media and gives the impression to be realistic. The DeepFakes have emerged just a couple of years ago, but this technology has made many refinements and is posing a great threat to public figures such as celebrities, political leaders, and public figures such as the technology leaders. The very first incidence of the DeepFake caught the attention of the media in 2017, wherein a video footage emerged showing the famous Bollywood figures were made viral, and the reality of that footage was not real, which was the amalgamation of the face of the celebrity and the other details of some other actor and was created using the DeepFake technology. The DeepFakes can be created a large number of images that are available on the internet. Therefore, the popular public figures are the one who can be targeted at large to generate DeepFakes, for whom a large number of images and videos are readily available on the internet, due to media coverage. A review of techniques to detect the GAN-generated fake images Fig. 6.2 Fake image generated using GAN (https://spectrum.ieee.org). To reveal how the technology of deep learning can be abused, the scientists at the University of Washington created and posted the DeepFake-generated video of the President Barack Obama over the social media, the scientists proved that they can make the fake video of President Obama speak out whatever they wanted. We all can well imagine what harm this technology can pose to the image of the public figures and it can pose a great threat to the security of the world at large. Thus, fake news and DeepFake can club together and can tar the authentic information and can thus create misunderstandings and miscommunications that are supported by fake facts. The DeepFakes emerged just a few years ago, but much development has taken place and is improving at a rapid rate. The scientists have developed methods that allow them to even edit the transcripts of the video and do alterations to the words that are being spoken by the person whose video is being altered. In yet another work, the researchers at Stanford University have developed methods that can not only manipulate the facial expressions, but their methods can also do three-dimensional head movement of the characters of the video or make them blink their eyes or let them gaze at particular stuff, all these things can be carried out with the help of the GANs. These features can help the movie industry to easily dub the movies to other languages, as all these things look to be unbelievably photorealistic. But the concern of the research fraternity lies with the counterpoints, what if these techniques are abused and are used for the illegitimate activities. Initially, it was believed that DeepFakes can only be created for the images if a large number of similar images of the celebrity or the public figures for whom large amounts of images are available in the public domain. But the recent developments by the Samsung’s 129 130 Generative adversarial networks for image-to-image translation Fig. 6.3 Sample source and generated fake images (https://miro.medium.com). AI lab has created a living portrait of Salvador Dali, Marilyn Monroe, and many more, moreover, they have created an image of Mona Lisa, who is smiling, and all these things have been achieved by just using a limited number of photographs, as illustrated with examples in Fig. 6.3. The requirement of a limited number of photographs has thus raised concern for the ordinary people, as they initially believed that they are invulnerable to DeepFakes as there are not enough images available to train the computer procedures to create DeepFakes. After seeing the realistic view of the images created by artificial intelligence, now the big question is how to control and protect the misuse of this technology, which has a large number of advantages associated with it, but still it has few threats associated with its usage. There are many questions that need to be pondered upon, should legal laws be passed, to bind the social networking websites, to detect the DeepFakes, and subsequently remove them. Moreover, should the intention of creating the DeepFake be given any consideration, while removing the DeepFake images, and apart from that can the DeepFakes can be differentiated based on the intention of their creation either for the entertaining or for perniciousness. A review of techniques to detect the GAN-generated fake images 6.3 DeepFake challenges The rapid development in the domain of artificial intelligence has posed a serious threat to the authenticity of the multimedia content that is being generated. The recent advancements in the field of deep learning that led to the generation of DeepFakes have further intensified the generation of the misinformation and it is being believed that over the years this problem of misinformation supported by fake multimedia content will further intensify. With the development of technology, the approaches that are being used to generate the automated fake multimedia content will further improve and it will become more challenging to discriminate the original and the fake content. The DeepFake-based multimedia content poses a great deal of challenges in the following domains: 1. create distress during difficult situations 2. can be a threat to the reputation of famous personalities 3. it can spread hatred and disrespect for the innocent 4. can cause loss of faith in the digital content 5. it may turn out to be deceiver’s dividend, who can always deny his words by saying, the content is fake, even though it has been said or done by him 6. fake pornography can be created and can cause mental distress to the affected 7. the fake images or even biometrics can be used for financial fraud 8. the DeepFake paradigm can be used to create fake news and hoaxes and may cause social distress 9. harmful to individuals or organizations 10. can cause exploitation 11. lead to sabotage 12. misrepresentation of democratic discourse 13. manipulation of elections 14. corroding belief in establishments 15. aggravating social separations 16. decline public protection 17. discouragement of international relations 18. endangering national safekeeping 19. discouragement of journalism 20. lead to false allegations These are to name a few domains, but in brief, we can say that DeepFakes are a forthcoming challenge for national security, individual privacy, and egalitarianism. Therefore, there is an urgent need to restraint the spread of fake digital content and to devise methods that can detect the fake content and have the capability to destroy it, without further spread. (DeepFakes: A Looming Challenge for Privacy, Democracy, and National Security.) 131 132 Generative adversarial networks for image-to-image translation 6.4 GAN-based techniques for generating DeepFake There are mainly two broad ways in which the GANs can generate the DeepFake, one uses an image-to-image translation and the other method is text-to-image synthesis, in the following sections, the GAN-based techniques for generating DeepFake are being discussed. 6.4.1 Image-to-image translation Image-to-image translation aims to convert one image into another image, in this the goal is to learn how the input image can be mapped to an output image. This technique can be used in a variety of ways such as for transfer of style, image super resolution, image inpainting, transfiguration of the objects, transferring the season of the image and for image enhancement. The image-to-image translation method is also termed as PIX2PIX translation for image generation, a sample is illustrated in Fig. 6.4. In this, the conditional GANs (CGANs) models are used. The image generation was earlier also there but in that case, for each type of translation, a separate model was required for each of the translation types. But the usage of the CycleGAN paved the way for a cycle-based consistency that had loss enabling inverse conversion ability that too without loss of any information. The introduction of the cyclic technique did not require similar image pairs for training but instead, it has the capability for training the GAN networks on two distinct domains, to learn the features of each domain to translate one into another seamlessly. Apart from the CycleGAN, other models have also been developed for image-to-image translation such as BicycleGAN, StarGAN, etc. and they are being discussed in the following section. 6.4.1.1 StarGAN: Unified generative adversarial networks for multidomain image-to-image translation StarGAN is a scalable technique developed for image-to-image translation and can be used for several domains by just using a single model. It has a unified architecture based on which it can concurrently train many datasets that too of distinct domains by using a single StarGAN network [1]. It can produce images of superior quality and it is very flexible in translating input images into distinct target domains. In this work, the authors have tried to exhibit the results by transforming the facial expressions of the input images. 6.4.1.2 Toward multimodal image-to-image translation This method proposed a BicycleGAN model for image-to-image translation by joining two distinct GAN models namely Conditional Variational Autoencoder GAN and Conditional Latent Regressor GAN [2]. They have harnessed the good features of both the approaches and the proposed model has the capabilities to implement the interconnect among the hidden encoding and output individually each director jointly and by that it is Fig. 6.4 Sample images based on image-to-image translation (https://miro.medium.com). 134 Generative adversarial networks for image-to-image translation capable of achieving superior results. This method has been compared against several state-of-the-art encoder methods and is capable of giving superior results in comparison to all of them. 6.4.1.3 U-GAT-IT: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation It is an unsupervised image-to-image translation method which has used a novel attention and learnable normalization module, to operate in an end-to-end way. The role of the attention function is to act as a supervisor to the method to pay more attention to the important regions that are distinct in the original and the target image domains that are dependent on the attention map that has been created by the auxiliary classifier [3]. The proposed method is capable of withstanding the geometric or shape changes in the target images. They have also integrated an adaptive layer-instance normalization procedure that supports the attention guided function to easily manage the extent of modification for the shape and texture based on the parameters that it has acquired during the learning phase from the dataset. 6.4.1.4 Image-to-image translation with conditional adversarial networks The conditional adversarial networks are the baseline methods for carrying out an image-to-image translation, as they can learn from the mapping of the source to target image, and they also learn the loss function to train the further images from this mapping [4]. Due to this method, it is capable of applying the same loss function to a wide variety of images. This method has an associated software with the name PIX2PIX that has been widely used by many artists to experiment with the proposed approach because of the ease of use and varied applicability. 6.4.1.5 Multichannel attention selection GAN with cascaded semantic guidance for cross-view image translation The multichannel attention selection GAN with cascaded guidance for cross-view image translation aims at the translation of the images with completely distinct views and the images may be suffering from a high degree of deformation, which is a quite challenging task. The proposed work carried out this task with very good precision, in which the system can create natural scene images with random viewpoints that are guided by an input image of the desired scene along with a semantic map that is novel [5]. This method takes input from the semantic maps and is a two-step process, in the first step, the input image and the desired semantic map are fed into the cycled semantic-guided generator to create the initial raw results. In the second step, the initial raw results are further refined by the multichannel attention selection methodology. A review of techniques to detect the GAN-generated fake images 6.4.1.6 Cross-view image synthesis using geometry-guided CGANs In this work, the authors have proposed a cross-view image synthesis method that is based on the geometry-guided CGANs. In this approach, the pixel information is preserved between the two viewpoints, to give a realistic appearance of the input image to the output generated images. To achieve this objective, they have used homography to guide the mapping of the images between the distinct views that are dependent on the overlapping views, so that the details of the image that is input are preserved [6]. To give a realistic image, they have painted the regions that were missing in the image that has been created by transformation by using the GANs. They have used the geometric constraints, due to which the complete minute details can be added to the image thus generated, moreover the proposed approach has given very good results for cross image-based image generation as compared to simple pixel-based image generation methods. 6.4.1.7 Cross-view image synthesis using CGANs In this work, the authors have proposed a CGANs to generate cross-view images from the natural scene images of aerial to street view and street view to aerial [7], which is a challenging task in the domain of computer vision. It becomes even more challenging when the generation of the new images for a completely different view as the process of conversion of understanding and transforming the image appearance and semantics across different semantic viewpoints is a nontrivial task. In this work, they have used novel architectures namely Crossview Fork and Crossview Sequential that have the capability of generating images that have a resolution of 64 64 and 256 256. The architecture of Crossview Fork uses one discriminator and one generator. The generator module of the Crossview Fork tries to fantasize about the image along with its semantics for segmentation for the output image. The Crossview Sequential uses two CGANs, out of which the first unit is used for creating the output image that is fed in the second unit to generate the map of the semantic segmentation. To improve the results the feedback is supplied to the first unit from the second unit to improve the quality of the images. The proposed method works well for the generation of the natural scene images by using a cross-view image-to-image translation. 6.4.1.8 WarpGAN: Automatic caricature generation With the improvement in the GAN-based architectures, automatic caricature generation methods have been developed, which can generate the caricatures from the input image of the face. The WarpGAN architecture cannot only produce the caricatures but can also transform the texture styles [8]. This architecture works by automatically learning to predict a collection of control points that can be further used to transform the image into a caricature and has the capability of preserving the identity of the original photograph as well. The caricatures generated by using the WarpGAN are quite identical to the caricatures that are drawn by using hand, but they have prominent features of the face more 135 136 Generative adversarial networks for image-to-image translation exaggerated. This is possible as the WarpGAN uses the identity preserving adversarial loss that helps the discriminator module to differentiate between the distinct images under study and it also gives the option to customize the caricatures thus generated by controlling the styles and the extent of the exaggeration to be produced in the output image. 6.4.1.9 CariGANs: Unpaired photo-to-caricature translation CariGAN is the first architecture used for the creation of caricature from the input image. It is based upon the two-step process, in the first step, the geometric exaggeration is carried out, whereas in the second step the look and feel style is defined, to carry out these two steps two distinct GAN models are used namely CariGeoGAN and CariStyGAN [9]. The CariGeoGAN carries out the geometrical transformation from the input image to the target caricature and the CariStyGAN translates the look and feel of the caricature to the input photos, but it does not cause any change to the geometrical aspects. This method can easily carry out cross-domain translation by breaking the process into a two-step process, and the output images thus generated closely resemble the handgenerated images, and the caricatures thus generated can be controlled by tuning the parameters to adjust the color and texture of the output images thus generated. 6.4.1.10 Unpaired photo-to-caricature translation on faces in the wild The unpaired photo-to-caricature translation on faces in the wild is capable of transforming the input photo to the caricature in distinct styles and the same model can be used for other high-end images to image translation applications. Their design uses a two-path approach to detect the overall structure and the local features which are required for carrying out the translation process [10]. They have used two discriminators; one discriminator is coarse and the other is fine. The generator of this model also provides an extra perceptual loss in addition to the loss that is provided by the adversarial and the cycle consistency to attain the learning in two distinct fields. This model can also learn the different styles from the supplementary noise that can be given as input to the model. 6.4.2 Text-to-image synthesis The GAN-based deep learning architectures have the unique ability to generate the images based on the text descriptors. The system works by giving the text phrase as input, and the GAN model can generate an image based on the description. The sample architecture is shown in Fig. 6.5 that demonstrates the image creation architecture based on the Reed et al. model. The illustrative GAN-based model can successfully convert the text phrases into images. The diagram illustrates the visualization of how the text strings fit so well in the sequential image generation model. The generator network filters the text input using fully connected neural network layers and the random noise is also concatenated in the form of a vector Z, whereas in the discriminator network, the text A review of techniques to detect the GAN-generated fake images Fig. 6.5 Text-to-image synthesis process (https://www.oreilly.com). Fig. 6.6 Sample text-to-image synthesis (https://cdn-images-1.medium.com). input is also compressed using a fully connected model as is used in the generator network and then it is recreated and concatenated in the form of an image. Fig. 6.6 shows a sample of how the text phrases can be converted into the actual images of the flowers. The GAN-based models have been refined and fine-tuned to generate photorealistic images by just taking the text descriptors as input, in the following section, various state-of-the-art methods have been discussed that are being used to generate high-resolution images, based on the textual information. 6.4.2.1 Generative adversarial text-to-image synthesis The recent advancements in the domain of artificial neural networks have given the power to the computer systems to transform text to pixels. Thereby, facilitating the generation of the images from the text descriptions [11]. All this has been possible due to the recent advancements in the development of a deep convolutional method-based GAN networks. In the proposed work, the authors have tried to generate the images of birds and flowers by giving a detailed description of their structure and features. They have 137 138 Generative adversarial networks for image-to-image translation spent considerable effort in creating an efficient GAN model and the training dataset that can create images of birds and flowers from the text descriptors written by humans. They have used five distinct text descriptors along with the dataset of Caltech-UCSD for birds and Oxford-102 for the flowers. 6.4.2.2 StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks This approach aims at generating the images using the text descriptors. They have used a stacked generative adversarial networks (StackGAN) [12] to produce 256 256 images that mimic the realistic images. They have tried to generate photorealistic images by refining the sketches, by decomposing the process of sketch refinement into subtasks. In the initial phase, the GAN-based approach is used to draw the initial shapes of the objects with the colors, based on the text description given, thus yielding a basic lowresolution image. In the second phase, the results of the first stage are combined with the input descriptors in the form of text, to generate more realistic images, and the drawbacks of the first stage are also overcome, thus yielding high-resolution photorealistic images. To give more realistic effects, the proposed model also adopts the conditioning augmentation method that helps to smoothen and condition the image to a great extent. Therefore, this approach is capable of producing images of high resolution, which have very good quality. 6.4.2.3 MC-GAN: Multiconditional generative adversarial network for image synthesis The proposed method aims at generating an image from the text descriptors when the background base image is already given and the new object can be created at a specified location. This approach is the enhancement of the text-to-image generation phase, as now new objects can be added to the preexisting images and that too at the specified locations, as per the text descriptors. This has been made possible by using multiconditional generative adversarial networks (MC-GAN) [13], which can regulate the background and the desired object simultaneously. This model employs a synthesis block that helps to disassociate the object and the background during the training phase, thereby enabling the MC-GAN to generate as good as real images with a resolution of 128 128 by monitoring the extent of the background specifications from the specified base image with the forefront details using the text descriptors. The proposed method is capable of smoothly mixing the possible orientation and the layout of the object with the background image. The method can give excellent results due to the MC-GAN model that can act like a pixel-wise gating function that has the capability of regulating the volume of evidence from the background image with the aid of the text descriptors of the new object that is to be placed in the foreground. A review of techniques to detect the GAN-generated fake images 6.4.2.4 MirrorGAN: Learning text-to-image generation by redescription The authors have developed a three-process method to generate the images from the text descriptors by redescription and have named their method as MirrorGAN [14], the three distinct steps used in this approach are STEM that is an embedding module for the semantic text and it generates word and sentence level embeddings, the second module is GLAM that is based on the cascaded architecture for generating target images from coarse to fine scales, leveraging both local word consideration and global sentence consideration to gradually improve the range and semantic uniformity of the produced images, and the third module is STREAM that aims to regenerate and align the images based on the semantic text. 6.4.2.5 StackGAN++: Realistic image synthesis with stacked generative adversarial networks This is the improvement of the StackGANs, this approach takes the low-resolution images generated by StackGANs and the text descriptions and then creates highresolution images that look as real as the realistic images. The proposed method is based on the multistep GAN architecture and is suitable for both conditional and nonconditional methods for GAN-based image generation [15]. In this architecture, there are multiple generators and discriminators organized just like trees, thereby generating multiplescale images for the similar scenes from the distinct branches of the trees. This approach has a more constant training pattern as compared to StackGAN as it is collaborating with multiple distributions for approximation. They have also integrated the conditioning augmentation to improve the smoothness of the images and also improves the diversity of the images. 6.4.2.6 Conditional image generation and manipulation for user-specified content In this approach, the authors have created a dataset named CelebTD-HQ that has facial images and the associated text descriptors. The dataset has been generated by using a twostep pipeline, in the first step of the pipeline a textStyleGAN model has been created, which is trained upon the text and in the second part of the pipeline, they have used pretrained weights of the previously trained textStyleGAN model to carry out the semantic manipulation of the facial images. This approach aims to train the semantic directions based on the latent space [16]. This method is capable of producing conditional images based on the semantic manipulation using the text descriptors. 6.4.2.7 Controllable text-to-image generation This method can generate high-quality images, from the text descriptors based on natural language using the controllable GANs, in which they have used a generator that is based on the word level spatial and attention, that can generate and manipulate subregions of the image to corresponding relevant textual words [17]. This method also employs a 139 140 Generative adversarial networks for image-to-image translation supervisory feedback mechanism based on the text descriptors, which operates by establishing a correlation between the text and the regions of the image, thereby creating an efficient training mechanism that can change particular visual features without disturbing the other content of the image. Thus this method is capable of generating and manipulating the artificially generated images by giving the text-based descriptors. The main focus of this work is to change the category, color, and texture of the images by giving text descriptors. 6.4.2.8 DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis This approach tries to overcome the drawbacks of the initial text-to-image synthesis methods, which greatly rely on the quality of the initial base image, and the contribution of the descriptor works on the different content of the image. In this approach, the authors have used dynamic memory-based GAN approach [18], to synthesize the good quality images from the text descriptors. In this method, the fuzzy contents of the image are improved by using the dynamic memory function. Further, it has two gates named memory writing gate and response gate. The memory writing gate selects the relevant textual information corresponding to the base image content, which further improves the quality of the images generated from the text descriptors and the response gate combines the information gained from the dynamic memory and the attributes of the image. The method has been tested with Caltech-UCSD 200 and Microsoft Common Objects in Context dataset. 6.4.2.9 Object-driven text-to-image synthesis via adversarial training The ObJGAN method aims to generate realistic images by efficiently capturing the object-level textual information, which is required for the creation of realistic images. This model consists of three components namely the attentive image generator which is driven by the objects, a discriminator based on the objects, and an attention method that is also driven by the objects [19]. In this approach, the text descriptors and a semantic layout that is created well in advance are given as input to the image generator, which is used to create high-resolution synthetic images by a method that refines the coarse images to high-quality images, by an iterative process. In each stage of the iteration, the generator improves the regions of the image by giving due consideration to the words that are associated with the bounding box of that region. The role of the attention layer is to form the labels for the class for each of the words that are used for querying the region and the discriminator checks all the bounding boxes to validate that the objects that are created are at par with the sematic layout of the image that was pregenerated. A review of techniques to detect the GAN-generated fake images 6.4.2.10 AttnGAN: Fine-grained text-to-image generation with attentional generative adversarial networks The proposed method aims to generate the fine-grained synthetic images, by using the attentional GANs [20], which use attention-driven multistep improvement mechanism for generating photorealistic images from the text descriptors. This method subdivides the image into subregions depending on the text descriptors associated with those subregions of the image. It also deploys a deep attentional multimodal that finds out the similarity and finds out the matching of the image and thus trains the generator using the dissimilarity. This method generates a more refined image after each stage and has been tested with the CUB and the COCO dataset. 6.4.2.11 Cycle text-to-image GAN with BERT In this work, the authors have tried to create the images from the captions of the images, using attention GAN models. Wherein these models learn the attention based on the words to image feature mapping. For the fine-tuning of the model, they have used the cyclic design that can do the mapping of the images back to the caption of the image [21]. The authors have also integrated the pretrained model of the BERT which is based on natural language processing for integrating the initial features of the image in the form of text. The proposed model outperforms the normal attention GAN. 6.4.2.12 Dualattn-GAN: Text-to-image synthesis with dual attentional generative adversarial network A text-to-image synthesis approach has been described using Dual Attentional Generative Adversarial Network architecture. In this approach, the authors have used double attention methods to improve the local details and the overall structure by considering text descriptors features and the corresponding different regions of the image [22]. There are two different attentions in this work, one is textual attention and the other is the visual attention. The textual attention aims to improve the interface between the text constructs and the visuals, and the visual attention aims to model the internal description of the vision through the spatial axes and the channel, which can help to capture the overall structures of the image in a better way. They have also used the attention embedding method to amalgamate the features from multiple paths. They have stabilized the training of the GAN model by using the spectral normalization and have improved the capability of the CNNs by using the structure based on the inverted residual method. Throughout just a few years, the GAN-based models have been created that can generate fake images by either transforming the existing images or getting the text descriptors. These models are quite helpful to the researchers to generate the dataset for the training purpose for deep learning-based models, where a large amount of data is required to train the networks, but on the contrary, the same technology can be used for the illegitimate process also, and thus pose a great deal of threat to the mankind. Therefore, 141 142 Generative adversarial networks for image-to-image translation methods need to develop that can distinguish between the real and synthetic images. In the following section, the artificial intelligence-based methods are being discussed to detect the DeepFakes. 6.5 Artificial intelligence-based methods to detect DeepFakes It has been observed that the GAN-based architectures have the capability to produce photorealistic images that can be a concern of security, or it may cheat others by posing false news over social media and falsify the information, thus causing mental agony and revulsion. With the recent advancements in the GANs, the quality of these false images may also improve substantially and may lead to more serious issues. Therefore, it becomes a major issue to devise methods that can distinguish between the real and the GANgenerated false images. Although the GAN-generated images can very well fool the individuals, they cannot escape the computer-based artificial intelligence-powered detectors that are robust and are not vulnerable to the prejudice the humans are. In the following section, we will be discussing the various state-of-the-art methods developed so far to detect the DeepFake images. 6.5.1 Can forensic detectors identify GAN-generated images? The current work investigates to distinguish between the real images and the GANgenerated fake images. The proposed method verifies the authenticity and originality of the images based on the forensic detectors [23]. In this work, the authors have used two approaches to distinguish between the fake images generated by GAN. The first approach is intrusive, in this case, the detector is created using the GAN architecture; therefore, some of the functions of the GAN are used in the detector to recognize the GAN-generated images. The other approach is nonintrusive, meaning there is no module of the GAN available, and the detector is generated on its own, without any input from the GAN that is used for creating the images. In this work, the authors have verified three nonintrusive methods namely inception scores, face quality assessment method, and the trained VGG 16 network model that is based on the latest features. The intrusive approach can detect the fake images quite efficiently whereas for the nonintrusive approaches, the VGA-based approach is good at detecting the fake images if it has sufficient training data, but the results are not good if there is a mismatch between the training and the test data sets. 6.5.2 Detection of deep network-generated images using disparities in color components The proposed method aims to detect the fake images using the disparities in the color components of the images [24]. The DeepFake images that are generated by deep networks for the RGB color space and the ones with no specified constraints for the A review of techniques to detect the GAN-generated fake images correlation among the color components are very easy to distinguish from the real images. The proposed method detects the statistics of the fake images based on the color components and distinguishes them from the real images. The distinction between the real and fake images has been made based on the feature set that is compact and effective and has been validated on different binary classifiers, and this method works when the generative models are known or unknown. 6.5.3 Detecting and simulating artifacts in GAN fake images The task of classifying the images, into fake and real is challenging, as the dataset for training is usually unavailable, and the model that is used by the attacker for the generation of the fake images is also not readily available. Therefore, in this approach, the authors have tried to simulate the fake image generation process using an AutoGAN model, and it stimulates the generation of the artifacts of the most common GAN approaches and they have also tried to locate the artifacts that are generated while applying the upsampling operation during the generation of the fake images [25]. Doing this they discovered that the artifacts thus generated are the duplicate copies of the frequency domain spectra, therefore they proposed a spectrum-based classifier rather than a pixel-by-pixel classifier to distinguish between the fake and the real images. This approach has given very good results to detect the CycleGAN-generated fake images. 6.5.4 Detecting GAN-generated fake images using cooccurrence matrices In this work, the authors have proposed a deep learning and cooccurrence matrices-based combined approach to detect the fake images generated by the GANs. The cooccurrence matrix has been calculated using the color channels of the pixels and then the deep learning-based CNN model has been trained for classification [26]. The pixel-based cooccurrence matrix is directly passed to the deep learning-based model to classify the real and fake images and hence detect the DeepFake images that have been generated using GAN-based models. The proposed method also gave good results, when it was trained and tested on distinct datasets. 6.5.5 Detecting GAN-generated imagery using color cues The proposed method has been deployed to distinguish between the fake and the real images using the color and saturation-based forensic parameters. For the color-based forensic, they have considered that the GAN-generated images will have a high correlation between the pixels in the chromaticity space as compared to the real-world image. For the saturation-based forensic, the frequency of underexposed and saturated pixels will be reduced as the generator component of the GAN carries out the normalization step [27]. To distinguish between the real images and the GAN-generated images, they have 143 144 Generative adversarial networks for image-to-image translation classified the real and fake images using the SVM classifier. This approach was able to achieve the AUC parameter of 70% using the dataset named NIST MFC2018. 6.5.6 Attributing fake images to GANs: Analyzing fingerprints in generated images This method aims at classifying the images as real or fake based on the GAN attributes that are created with the GAN-generated images. They have also tried to identify the GAN network that has generated the fake image, as each GAN network creates a different fingerprint and a small difference in the GAN training can also change the fingerprint of the GAN-generated fake image [28]. In this, they have used a learning-based method using an attribution network model to map the input image and the fingerprint image comparable to that. For each GAN model, a fingerprint is generated and then the model fingerprint and the actual fingerprint are used for classification to discriminate between the realistic image and the GAN-generated image. The proposed method has been able to achieve 99.5% accuracy using the CelebA dataset and fake images generated using distinct GAN models such as ProGAN, SNGAN, CramerGAN, and MMDGAN. 6.5.7 FakeSpotter: A simple baseline for spotting AI-synthesized fake faces In this work, the authors have proposed a FakeSpotter that can detect the fake images by analyzing the behavior of the neurons, as it has been observed that for the fake images, the activation function of each layer varies, and that can act as a very strong feature for detecting the fake images, generated by the artificial means. In this approach, the behavior of the neurons has been captured for both the real and the fake images and then an SVM classifier has been trained to classify the real and fake images [29]. The proposed method has been able to achieve an accuracy of 84.7% based on the FaceNet model, they have tested their model on CelebBA-HQ and FFHQ datasets. 6.5.8 Incremental learning for the detection and classification of GAN-generated images In this work, the authors have proposed a method to detect the unseen images that are fake. They have used a detection model based on multitask incremental learning that can locate and classify the GAN-produced fake images. In this work, they have placed the classifier at different positions based on the iCarl algorithm, to monitor the incremental learning, the two models that are used are named as multitask multiclassifier and multitask single classifier [30]. The proposed model has been tested on five distinct GAN models namely CycleGAN, StyleGAN, StarGAN, ProGAN, and Glow and they used the Xceptionnet model for the process of detection of the GAN-generated fake images. A review of techniques to detect the GAN-generated fake images 6.5.9 Unmasking DeepFakes with simple features Although the DeepFake generation methods have progressed, they have developed to the extent that they can generate an image even from the text descriptions. But the GAN-based methods do leave some artifacts in the fake images that might be missed by the human eye, the artificial intelligence-based methods can catch those artifacts and can easily discriminate between the real and the fake images. In this work, the authors have tried to capture the frequency domain features of the images and a classifier to classify the real and the fake images [31]. The main strength of the method is its capability to give very good results with a limited dataset that has an annotated training set moreover it works very well with unsupervised classifiers. The proposed method is capable of achieving 100% accuracy after being trained with 20 images that are annotated. 6.5.10 DeepFake detection by analyzing convolutional traces This approach aims at analyzing the human faces present in the DeepFake images, it is based on the fact that the artificially generated images, do leave behind some fingerprints that can be detected by the forensics tools. In this approach, the authors have proposed an expectation-maximization (EM) approach that can extract local features of the images that are specific to the false image generation model [32]. This approach has tested the proposed method with the CELEBA dataset and has been able to detect the fake images along with the different network architectures that have been used to create those fake images. 6.5.11 Face X-ray for more general face forgery detection This chapter proposes a method to detect the fake images, by converting the input image into its gray-scale image, using which it detects whether the image is the real image or a forged image. If the gray-scale image can be disintegrated into two separate blending images, then it is revealed that it is a fake image as it is having the blending boundary, else it is categorized as the real image [33]. It has been seen that most of the image manipulation methods aim at blending the reformed part into the background of the original image. The method is not able to give good results when it has to detect the lowresolution images, as in that case, the evidence of the blending is less evident, and hence hard to detect. 6.5.12 DeepFake image detection based on pairwise learning Detecting the GAN-generated fake images is still a challenge, therefore in this approach, the authors have proposed a deep learning approach to detect the fake images based on contrastive loss. In this approach initially, most recent GAN architectures are used to create the fake and real image pairs [34]. In the subsequent step, they have deployed a DenseNet model that is having a two-streamed network model, based on which the 145 146 Generative adversarial networks for image-to-image translation pairwise information is fed as input. In this way, a joint network is trained based on the fake features, based on the pairwise learning that can discern the real and artificially created images based on the features. As the last step, a classifier is deployed at the end of the fake feature network to distinguish between the artificially generated and the realistic images. 6.6 Comparative study of artificial intelligence-based techniques to detect the face manipulation in GAN-generated fake images The GANs have brought in a new era, wherein artificial images can be created by specifying the textual features of the images or by manipulating the already existing images by transforming and manipulation of the pixels of the digital images. The GANs have emerged just a couple of years ago, the fake image generation and fake image detection methods have developed at a rapid rate. In the previous section, we have discussed about the various methods that can detect the GAN-generated fake images. From the literature survey, it has been observed that most of the harm that is done by fake images is attributed to the manipulation of the faces. This section aims at carrying out a comparative study of the various methods that have been proposed for the detection of the fake images. To carry out the comparative analysis, we have considered four types of methods that can detect the following type of fake images: (1) construction of a new face, (2) swapping of the facial identity, (3) manipulation of facial features, and (4) manipulating the facial expressions. For each type of the fake images, the comparison has been done on the parameters of features and the classifier used. The performance parameters have not been compared as different researchers have used different performance measurement parameters and distinct dataset; therefore, the fair comparison is not feasible. 6.6.1 Techniques for detecting the construction of a new face The researchers in Ref. [35] have investigated the working of the GAN architectures to trace the different artifacts that can differentiate the original images and the synthetic images. The system has been evaluated using the color-based features and the classification has been carried out using the SVM classifier that is linear. The method can achieve the AUC parameter of nearly 70% using the NIST MFC2018 dataset [36]. Yu et al. [37] discovered that each GAN-based architecture generates a unique fingerprint in the synthetic images, they formulated a learning-based approach using the attribution network model that has the capability of mapping the input image with its equivalent fingerprint image. Therefore, this approach was able to derive a correlation index between the fingerprint of the image and its corresponding model fingerprint that has been used to classify the images. The method has been tested using the dataset named A review of techniques to detect the GAN-generated fake images CelebA [38], which contains the real images and the fake images that have been synthesized using the different GANs as proposed [39–42]. The proposed method has claimed to achieve 99.5% accuracy. Although the system is capable of very good results, but it fails if the images are blur, compressed, noisy, or cropped. Authors of Ref. [43] inferred that the observation of the neuron behavior can help us to detect the synthetic faces, as the activation of the neuron across different layers generates distinct patterns and can capture the distinct features that can help to detect the manipulated facial attributes, they implemented different face recognition systems based on deep learning [44–46] to learn about the real and fake faces. Based on the features learned, an SVM-based classifier has been trained in order to discriminate between the real and fake images. The proposed work was able to achieve an accuracy of 84.7% using the FaceNet model on CelebA-HQ [39] and FFHQ [47] dataset of real images and InterFaceGAN [48] and StyleGAN [47]. An analysis of the distinct face manipulation approaches has been proposed in Stehouwer et al. [49], where they have proposed that the novel attention mechanisms are capable of giving good results [50] as they help the process as well as enhance the feature maps of CNN architectures. The proposed method can achieve 100% AUC and 0.1% EER for the real face of Refs. [38, 47, 51] datasets and has been tested with the synthetic images created using [39, 47] GAN-based models. A fake face synthesis approach [52] has been proposed based on steganalysis and the statistics of the real-world natural images. They used a combination of the pixel cooccurrence matrixes and CNN-based deep learning models. They initially validated their approach using the images created using the CycleGAN [53], this method has also been validated using the fake images created using different GAN architectures. They implemented the proposed approach in their work [54] where the validation has been carried using 100K-Face database and can achieve EER value of 7.2%. Different fake face synthesis systems have been assessed in Neves et al. [54] based on the experimental results using different datasets, they concluded if the experiments are performed in controlled conditions then the results with EER as close to 0.8% are achieved, but if the detection experiments are performed in real-world scenarios then the performance of the proposed systems degrades to a great extent. To test the methods for the real-world scenarios by Marra et al. [55], experiments have been performed to detect the previously unseen fake images. They used a multitask incremental model based on the learning and have tried to find out the fake images that have been generated using distinct GAN networks. The comparative analysis of the different techniques for detecting the construction of a new face has been illustrated in Table 6.1 and it can be inferred that most of the work in this domain has been carried out using the CNN classifier and most of the researchers have used image-related features to distinguish between the real and fake images. 147 148 Generative adversarial networks for image-to-image translation Table 6.1 Comparison of different techniques for detecting the construction of a new face. Work Features Classifier McCloskey and Albright [35] Yu et al. [37] Kim et al. [43] Stehouwer et al. [49] Nataraj et al. [56] Neves et al. [54] Marra et al. [55] Color related GAN related CNN neuron behavior Image related Steganalysis Image related Image related SVM CNN SVM CNN + attention mechanism CNN CNN CNN + incremental Learning 6.6.2 Techniques for detecting the swapping of the facial identity The first study to detect the face swapping has been proposed by Zhou et al. [57], in which the authors have used a two-stream network that can detect the face manipulation. In this, the authors used the fusion of the face classification using CNN based on GoogLeNet [58] and an SVM-based classification approach that used the triplet path which has been trained based on steganalysis features that measure the triplet loss in the patches of the images under consideration for detecting the swapping of the facial identity. The SwapMe app was evaluated by Li et al. [59] to check the capacity of generalization for the previously trained model that can detect the swapping of the face or the identity. This method turned out be one of the most robust method to detect the swapping of the faces based on the Celeb-DF dataset. Mesoscopic features of the images were focused using two different neural network models that had different number of layers [60]. In one model, CNN architecture comprising of four convolutional layers and a fully connected (Meso-4) layers were used while in the second model, the Meso-4 layer has been modified that had a different inception module as proposed by Szegedy et al. [58] and it has been named as Mesoinception-4. Initially, the method was tested using the self-created database for detecting the fake images, and it attained the accuracy of 98.4%. Later, it has been tested with the unseen dataset [59] and the proposed method turned out to be robust with other datasets as well as the FaceForensics++ dataset. The vulnerabilities of the recent face detection approaches namely VGG [44] and FaceNet [46] to DeepFake based on the DeepFakeTIMIT dataset have been described in Korshunov and Marcel [61]. In addition to that they have evaluated the challenges associated with the detection of fake digital content while using the baseline methods. They have used the principal component analysis-based approach for feature reduction and RNN for long short term memory so as to discriminate between the real and fake digital content as proposed in Korshunov and Marcel [62]. They also used image quality measures [63] and the raw faces as the features for the purpose of detection of the fake A review of techniques to detect the GAN-generated fake images images. They have used total 129 features based on signal-to-noise ratio, specularity and blur, etc. The PCA with LDA or SVM classifiers have been used for the purpose of classification and they were able to get EER of 3.3% for LQ and EER of 8.9% for HQ using the DeepFakeTIMIT dataset. The DeepFakes are generally created by merging the synthetic face regions with the real image and doing so leaves certain artifacts that can be traced when the 3D head view of the image is analyzed by Yang et al. [64]. In order to prove their claims, they carried out investigations to find out the differences between the head poses considering the complete set of facial features (68 features were extracted) and the features of the center of the face. The features obtained were normalized and then an SVM classifier has been used for the classification task. This method has been tested with UADFV dataset and an AUC value of 89% has been achieved. They further in Li and Lyu [65] extended this work of the detection of the fake faces using the warping artifacts. In this, they used the CNN to detect the artifacts. The system has been trained using four different variants of CNN as proposed by Refs. [66, 67] and the method has been tested using the UADFV and DeepFakeTIMIT datasets, with very good results. The authors of Ref. [51] have analyzed the face swapping approaches and evaluated them on distinct detection methods for face swapping and validated the results using the FaceForensics++ dataset. For their evaluation, they considered CNN-based system using steganalysis features [68], CNN-based system with specially tuned layers that can diminish the content in the image [69], a CNN-based system with global pooling layer [70], the CNN Mesolnception-4 [60], and CNN based on XceptionNet [71] using the ImageNet dataset [72]. They concluded that the CNN-based XceptionNet [71] gave the best overall results. A fake image detection method based on the elementary features of eye color, missing details of eye, teeth, or the reflections, which are generally associated with natural images has been proposed by Matern et al. [73]. They considered the logistic regression and multilayer perceptron [74] for the purpose of classification, and achieved an AUC value of 85.1%. The fake face detection method has been proposed using the CNN and attention technique by Stehouwer et al. [49], which aims at improving the feature map of the classifiers that are being used. The attention map has the capability to be inserted into any basic neural network, by addition of a convolutional layer, the method was able to achieve the AUC value of 99.43% and EER of 3.1%. Seeing the popularity and the relevance of the topic, Facebook that contains a huge database of images, has launched a competition named as DeepFake detection challenge, in collaboration with other organizations. They have provided the baseline results using the CNN model, with six convolutional layers, along with a fully connected layer, XceptionNet model trained with face images and with the full images, these base line models have the capability to give precession of 93% with a recall of 8.4%. 149 150 Generative adversarial networks for image-to-image translation Table 6.2 Comparison of different techniques for detecting the swapping of the facial identity. Work Features Classifier Zhou et al. [57] Afchar et al. [60] Korshunov and Marcel [61] G€ uera and Delp [75] Yang et al. [64] Li and Lyu [65] R€ ossler et al. [51] Matern et al. [73] Nguyen et al. [76] Stehouwer et al. [49] Dolhansky et al. [77] Agarwal et al. [78] Sabir et al. [79] Image-related steganalysis Mesoscopic level Lip image-audio speech, image related Image + temporal Head pose estimation Face warping artifacts Image-related steganalysis Visual artifacts Image related Image related Image related Facial expressions and pose Image + temporal information CNN and SVM CNN PCA + RNN PCA + LDA, SVM Information CNN + RNN SVM CNN CNN Logistic regression, MLP Autoencoder CNN + attention mechanism CNN SVM CNN + RNN The comparison of the different techniques for detecting the swapping of the facial identity is presented in Table 6.2. It can be inferred that most of the researchers have used image-related features and a combination of CNN classifier and in some cases, they have used a combination of CNN and some other state-of-the-art classifiers. 6.6.3 Techniques for detecting the manipulated of facial features In the initial days, the manipulation of the facial attributes was studied to check the robustness of the facial recognition techniques, and the manipulations were tested against the cosmetic surgery, makeup, and the occlusion of the face due to external factors. With the advent of the DeepFakes, the interest for detecting the images with facial attributes manipulated, has again become popular. In Bharati et al. [80], restricted Boltzmann machine-based approach has been used to detect the images that contain manipulated facial features. In this approach, the system for the detection of the manipulated features was given the patches of the face, so as to learn the distinct features of the face and to classify the image as the authentic or the one with manipulated features. The system has been validated using synthetic datasets that have been generated using the ND-IIITD dataset [81] and a set of images of the famous celebrities. The images of the dataset were manipulated using the features such as smile, color of the eyes, shape of the lips, texture of the skin, etc. The system has been able to achieve an accuracy of 96.2% and 87.1% over the celebrity dataset and ND-IIITD dataset, respectively. The different variants of the CNN architectures have been evaluated by Tariq et al. [82] to detect the manipulation of the facial attributes using the CelebA dataset [38] of real images and they adopted two distinct approaches to generate the fake images, one A review of techniques to detect the GAN-generated fake images approach used the ProGAN [39] architecture to generate the fake images and the other set of fake images have been generated using the Adobe photoshop software. The manipulation of images has been done using the cosmetic makeup, adding glasses to the face, changing the hair style, or putting on hats. They considered images of two distinct sizes, namely 32 32 and 256 256. The GAN-generated images have been detected with 99.99% AUC whereas the Adobe photoshop-generated images have been detected with 74.9% AUC. The CNN model has the capability to detect the machine-generated fake image with very good accuracy, whereas it is capable of giving average results for the images created by the Adobe photoshop software. An application named as Fake Spotter has been proposed by Kim et al. [43], which is based on the principle that the behavior of the neurons changes across different layers. The activation functions of the neurons across different layers can capture the distinct features, to support the manipulated images. They used face recognition systems as proposed by Parkhi et al. [44], Amos et al. [45], and Schroff et al. [46] for extracting the features and then used the SVM classifier to classify the manipulated and the original images. The proposed method has been tested using the datasets as described in Karras et al. [39] and Karras et al. [47] for the original images and the synthetic datasets generates using InterFaceGAN and StyleGAN approaches, the system was able to achieve an accuracy of 84.7% on FaceNet model. The facial features manipulation system has been proposed by Jain et al. [83] using the CNN architecture that has six layers for convolution and two layers that are fully connected and it has also used the residual connections as proposed by He et al. [67]. The system has been fed with the nonoverlapping patches of the face, in order to learn the distinct facial features. The classification has been carried out using the SVM classifier, the proposed model has been able to detect the manipulated images with an accuracy of almost 100% using the datasets as proposed by Bharati et al. [80] and the StatGAN [84] generated dataset that has been trained using the CelebA dataset [38]. The attention mechanisms that have the capability to enhance the feature maps of the different CNN architectures have been proposed by Stehouwer et al. [49]. They used the FaceApp to create the fake images with facial features manipulated, using 28 distinct filters that changes the hair style, color of the skin, put or remove the beard, etc., and the fake images that have been synthesized using the StarGAN model using a set of 40 distinct features. They tested the proposed approach using the DFFD dataset and are able to achieve an AUC of 99.9%. The authors of Ref. [85] have collected the real images and also created the synthetic images using the Adobe photoshop tool named Face Aware Liquify and some manipulated images were created by the professional artists by manipulating the facial features. Then they used the humans to classify the images as real and fake, and the humans were able to classify with almost 50% accuracy, then they subjected the said dataset to deep recurrent networks and thus the automatic system has been able to detect the fake images 151 152 Generative adversarial networks for image-to-image translation with an accuracy of 99.8% for the images generated by machines and 99.7% accuracy for the human created fake images. A steganalysis-based method [56] has been used to detect the fake images with 99.4% accuracy using the StarGAN [84] generated fake images with facial features manipulated and the real images of Liu et al. [38] dataset. The detection of the fake images has been done by Zhang et al. [86] using the spectrum domain. In this, the RGB channels of the input image are subjected to 2D DFT and a frequency image is generated for each of the RGB channel. The classification has been carried out using the AutoGAN that has the capability to create artifacts similar to GAN without the aid of any trained GAN model. They used the StarGAN [84] and GauGAN [87] for the purpose of evaluation, out of which StarGAN is capable of detecting the images in the frequency domain with 100% accuracy whereas the GauGAN has been able to do so with 50% accuracy. Table 6.3 illustrates the comparison of different techniques that have been proposed so far for detecting the manipulated facial features. Most of the researchers have used the CNN-based classifiers using the image-related features to distinguish between the real and fake images. 6.6.4 Techniques for detecting the manipulated facial expressions The advancements in the technology have enabled the computer-based softwares to change the speech of the speaker, along with the facial expressions [88]. In Stehouwer et al. [49], the researchers have proposed a technique to detect the manipulation of the facial features using the DFFD dataset, which achieved the AUC value of 99.4%. It has been observed by Refs. [37, 51, 54] that the results are good in the controlled environments, but most of the methods fail in the real scenarios, therefore new methods need to be explored that have the capability to work in the real scenario wherein there is variation of blur, noise, and compression. The second issue that needs to be addressed is the robustness of the proposed methods against the unseen face manipulation, it has been Table 6.3 Comparison of different techniques for detecting the manipulated of facial features. Work Features Classifier Bharati et al. [80] Tariq et al. [82] Kim et al. [43] Jain et al. [83] Stehouwer et al. [49] Wang et al. [85] Nataraj et al. [56] Marra et al. [55] Zhang et al. [86] Face patches Image related CNN neuron behavior Face patches Image related Image related Steganalysis Image related Frequency domain RBM CNN SVM CNN + SVM CNN + attention mechanism DRN CNN CNN + incremental learning GAN discriminator A review of techniques to detect the GAN-generated fake images Table 6.4 Comparison of different techniques for detecting the manipulated facial expressions. Work Features Classifier Zhou et al. [57] R€ ossler et al. [51] Matern et al. [73] Nguyen et al. [76] Stehouwer et al. [49] Sabir et al. [79] Mesoscopic level Image-related steganalysis Visual artifacts Image related Image related Image + temporal CNN CNN Logistic regression, MLP Autoencoder CNN + attention mechanism information CNN + RNN observed by Refs. [55, 59] that the systems have very poor generalization capability, therefore they fail to give good results in real-world scenarios. The GAN-based methods StyleGAN [47] can detect the face manipulated images with quite good accuracy. The accuracy of the GAN methods is attributed to the fingerprints that are left as artifacts, in the GAN-generated fake images. Further, a research group proposed to eliminate the fingerprints that are generated by the GAN models, so as to make the GAN-generated Fake images, hard to detect [54], they proposed to use autoencoders and degradation of the image quality, but it resulted in the loss of the detection rates. Table 6.4 illustrates the comparison of different techniques for detecting the manipulated facial expressions. The researchers have mostly used image-related features and CNN classifiers with the benchmark datasets to check the performance of the methods for detecting the fake images generated using different GAN architectures. Most of the methods have been able to achieve high accuracy. 6.7 Legal and ethical considerations With the rapid development in the domain of artificial intelligence, a situation has emerged that we can now create images by just giving the features of the image in the form of text or we can manipulate any image. The images thus generated are as good as real images and are termed as DeepFakes, as there are generally created using the deep learning-based modalities. The DeepFake images can be quite innovative and can add significant value to the creative and to the education domain, but on the contrary, this innovative methodology have a plethora of threats as well, which can have social, political, and financial harmful implications. The main concern is the DeepFakes are very hard to discern with the human eye as they blur the line between the original and the fake images. Moreover, with the proliferation of the digital media and digital platforms, the DeepFake images can spread like a wild fire on different social platforms. Therefore, we need to address the legal and the ethical implications attached with the DeepFakes. In order to address this point, the DeepFakes can be categorized into four different categories namely face swapping in order to take revenge and defamation of 153 154 Generative adversarial networks for image-to-image translation public figures these two categories are defined under the hard cases and can have hefty legal and ethical complications, on the contrary, the DeepFakes are created for the illustration of creativity or for reducing recapturing, which has social benefit associated with it, they are of the lighter category and have quite few legal and ethical complications associated with them. The emergence of DeepFakes has posed a serious problem, for which we need to look at the cause of the problem and correct it, instead of just cursing the symptoms associated with it. The penalties associated with the fake content that is generated are manifold ranging from spreading of misinformation, humiliation of the victims, and propagation of the fake news. The most challenging tasks is how to prevent the propagation of the false information and save the society at large from straightforward implications of DeepFakes. Some countries have passed laws to control the implications of DeepFakes. In which the propagator will be held responsible for posting the DeepFake over the social media that can act like a limiting factor for others from opting the same path. But the harm and humiliation done by once is hard to reverse. Hence in the era of internet virality and proliferation of social media, the spread of the misinformation is beyond control and the social media platforms have developed as mediums for political and social discourse. In order to curb this menace, either the laws should be framed to tackle the issues, associated with the DeepFakes, or the platforms through which the DeepFakes spread like wild fires, should be equipped with such technologies, which can ascertain the truthfulness of the content that is being posted and censor the fake content, so that it never gets the platform to be launched and hence control the implications associated with it. Table 6.5 illustrates the legalities associated with DeepFake images. 6.8 Conclusion and future scope Every coin has two sides, the GAN-based systems have a large number of applications, but few of the applications can serve the malicious purpose as well. As it has already been witnessed that how the deep learning-based approaches have been harnessed by the fraudsters, to generate artificial intelligence-based syntactic images and even videos that can be used by the criminals for carrying out scams, fraudulent activities, or to create fake images and even fake news. On the same line’s computational intelligence of the GANs can be harnessed by the fraudsters to use the GAN-generated images and videos for the malicious activities, they can improve their artificial intelligence-based methods by generating the synthetic images of the innocent individuals, whom they have chosen to victimize. A review of techniques to detect the GAN-generated fake images Table 6.5 Legalities associated with DeepFake. Purpose of DeepFake Face swapping Defamation of public figures Reducing recapturing Creativity Cases Benefits Alarms Affects Legalities Swapping the face of the victim with that of others, in order to defame the victim Creating images of the events that never happened The person who is swapping is taking revenge, and gets satisfaction Mental torture and humiliation of the victim It can have mental torture, abuse, and financial implications to the victim Criminal proceedings can be initiated Freedom of expression Defame a public figure, distort the reputation and even alter the election results May impact the IPR Destroy the international relations, create polarization, and erode the trust in organizations Public and private law suits can be filled Redundant data creation Private law suits can be filled May impact the IPR People may feel offended Public and private law suits can be filled Dubbing the same video in multiple languages Creation of MEMES Reducing the effort of repetitive tasks Freedom of expression, creativity The efforts have been put by the research faternity to find out the techniques to detect the fake images. Most of the works have been carried out using CNN-based classifiers to discern between fake and real images, using the image-related features. Although much innovations have taken place in the field of artificial intelligence, nobody has given much importance to the security risks. The artificial intelligence-based innovations have posed or may pose in years to come. It is a well-understood fact that in the endeavor to develop intelligent machines, which would mimic the human-like traits and will make the work of the humans easier, but not much importance has been given to the security, privacy, and other risks associated with these advancements. 155 156 Generative adversarial networks for image-to-image translation References [1] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, J. Choo, StarGAN: unified generative adversarial networks for multi-domain image-to-image translation, CoRR abs/1711.09020 (2017) 1–15. https://arxiv.org/ pdf/1711.09020.pdf. [2] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A.A. Efros, O. Wang, E. Shechtman, Toward multimodal imageto-image translation, CoRR abs/1711.11586 (2017) 1–12. https://arxiv.org/pdf/1711.11586.pdf. [3] J. Kim, M. Kim, H. Kang, K. Lee, U-GAT-IT: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation, CoRR abs/1907.10830 (2019) 1–19. http://arxiv.org/abs/1907.10830. [4] P. Isola, J. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, CoRR abs/1611.07004 (2016) 1–17. http://arxiv.org/abs/1611.07004. [5] H. Tang, D. Xu, N. Sebe, Y. Wang, J.J. Corso, Y. Yan, Multi-channel attention selection GAN with cascaded semantic guidance for cross-view image translation, CoRR abs/1904.06807 (2019) 1–20. http://arxiv.org/abs/1904.06807. [6] K. Regmi, A. Borji, Cross-view image synthesis using geometry-guided conditional GANs, CoRR abs/1808.05469 (2018) 1–11. http://arxiv.org/abs/1808.05469. [7] K. Regmi, A. Borji, Cross-view image synthesis using conditional GANs, CoRR abs/ 1803.03396 (2018) 1–10. http://arxiv.org/abs/1803.03396. [8] Y. Shi, D. Deb, A.K. Jain, WarpGAN: automatic caricature generation, CoRR abs/1811.10100 (2018) 1–15. http://arxiv.org/abs/1811.10100. [9] K. Cao, J. Liao, L. Yuan, CariGANs: unpaired photo-to-caricature translation, CoRR abs/ 1811.00222 (2018) 1–14. http://arxiv.org/abs/1811.00222. [10] Z. Zheng, H. Zheng, Z. Yu, Z. Gu, B. Zheng, Photo-to-caricature translation on faces in the wild, CoRR abs/1711.10735 (2017) 1–28. http://arxiv.org/abs/1711.10735. [11] S.E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative adversarial text to image synthesis, CoRR abs/1605.05396 (2016) 1–10. http://arxiv.org/abs/1605.05396. [12] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, D.N. Metaxas, StackGAN: text to photorealistic image synthesis with stacked generative adversarial networks, CoRR abs/1612.03242 (2016) 1–14. http://arxiv.org/abs/1612.03242. [13] H. Park, Y.J. Yoo, N. Kwak, MC-GAN: multi-conditional generative adversarial network for image synthesis, CoRR abs/1805.01123 (2018) 1–13. http://arxiv.org/abs/1805.01123. [14] T. Qiao, J. Zhang, D. Xu, D. Tao, MirrorGAN: learning text-to-image generation by redescription, CoRR abs/1903.05854 (2019) 1–10. http://arxiv.org/abs/1903.05854. [15] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, StackGAN++: realistic image synthesis with stacked generative adversarial networks, CoRR abs/1710.10916 (2017) 1–16. http:// arxiv.org/abs/1710.10916. [16] D. Stap, M. Bleeker, S. Ibrahimi, M. ter Hoeve, Conditional image generation and manipulation for user-specified content, ArXiv abs/2005.04909 (2020) 1–10. [17] B. Li, X. Qi, T. Lukasiewicz, P.H.S. Torr, Controllable text-to-image generation, in: NeurIPS2019, pp. 1–11. [18] M. Zhu, P. Pan, W. Chen, Y. Yang, DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis, CoRR abs/1904.01310 (2019) 1–9. http://arxiv.org/abs/1904.01310. [19] W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, J. Gao, Object-driven text-to-image synthesis via adversarial training, CoRR abs/1902.10740 (2019) 1–23. http://arxiv.org/abs/1902.10740. [20] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, X. He, AttnGAN: fine-grained text to image generation with attentional generative adversarial networks, CoRR abs/1711.10485 (2017) 1–9. http://arxiv.org/abs/1711.10485. [21] T. Tsue, S.K. Sen, J. Li, Cycle text-to-image GAN with BERT, ArXiv abs/2003.12137 (2020) 1–8. [22] Y. Cai, X. Wang, Z. Yu, F. Li, P. Xu, Y. Li, L. Li, Dualattn-GAN: text to image synthesis with dual attentional generative adversarial network, IEEE Access 7 (2019) 183706–183716. [23] H. Li, H. Chen, B. Li, S. Tan, Can forensic detectors identify GAN generated images? in: 2018 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC) 2018, pp. 722–727. A review of techniques to detect the GAN-generated fake images [24] H. Li, B. Li, S. Tan, J. Huang, Detection of deep network generated images using disparities in color components, CoRR abs/1808.07276 (2018) 1–26. http://arxiv.org/abs/1808.07276. [25] X. Zhang, S. Karaman, S. Chang, Detecting and simulating artifacts in GAN fake images, in: 2019 IEEE International Workshop on Information Forensics and Security (WIFS)2019, pp. 1–6. [26] L. Nataraj, T.M. Mohammed, B.S. Manjunath, S. Chandrasekaran, A. Flenner, J.H. Bappy, A.K. RoyChowdhury, Detecting GAN generated fake images using co-occurrence matrices, CoRR abs/ 1903.06836 (2019) 1–6. http://arxiv.org/abs/1903.06836. [27] S. McCloskey, M. Albright, Detecting GAN-generated imagery using color cues, CoRR abs/ 1812.08247 (2018) 1–7. http://arxiv.org/abs/1812.08247. [28] N. Yu, L. Davis, M. Fritz, Attributing fake images to GANs: analyzing fingerprints in generated images, CoRR abs/1811.08180 (2018) 1–41. http://arxiv.org/abs/1811.08180. [29] R. Wang, L. Ma, F. Juefei-Xu, X. Xie, J. Wang, Y. Liu, FakeSpotter: a simple baseline for spotting AI-synthesized fake faces, CoRR abs/1909.06122 (2019) 1–8. http://arxiv.org/abs/1909.06122. [30] F. Marra, C. Saltori, G. Boato, L. Verdoliva, Incremental learning for the detection and classification of GAN-generated images, CoRR abs/1910.01568 (2019) 1–6. http://arxiv.org/abs/1910.01568. [31] R. Durall, M. Keuper, F. Pfreundt, J. Keuper, Unmasking DeepFakes with simple features, CoRR abs/ 1911.00686 (2019) 1–8. http://arxiv.org/abs/1911.00686. [32] L. Guarnera, O. Giudice, S. Battiato, DeepFake detection by analyzing convolutional traces, CoRR abs/2004.10448 (2020) 1–10. https://arxiv.org/abs/2004.10448. [33] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, B. Guo, Face X-ray for more general face forgery detection, CoRR abs/1912.13458 (2019) 1–10.http://arxiv.org/abs/1912.13458. [34] C.-C. Hsu, Y.-X. Zhuang, C.-Y. Lee, Deep fake image detection based on pairwise learning. Appl. Sci. 10 (2020) 370, https://doi.org/10.3390/app10010370. [35] S. McCloskey, M. Albright, Detecting GAN-generated imagery using color cues, CoRR abs/ 1812.08247 (2018) 1–7. http://arxiv.org/abs/1812.08247. [36] H. Guan, M. Kozak, E. Robertson, Y. Lee, A.N. Yates, A. Delgado, D. Zhou, T. Kheyrkhah, J. Smith, J. Fiscus, MFC datasets: large-scale benchmark datasets for media forensic challenge evaluation, in: 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW)2019, pp. 63–72. [37] N. Yu, L. Davis, M. Fritz, Attributing fake images to GANs: analyzing fingerprints in generated images, CoRR abs/1811.08180 (2018) 1–41. http://arxiv.org/abs/1811.08180. [38] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, CoRR abs/ 1411.7766 (2014) 1–11. http://arxiv.org/abs/1411.7766. [39] T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of GANs for improved quality, stability, and variation, CoRR abs/1710.10196 (2017) 1–26. http://arxiv.org/abs/1710.10196. [40] T. Miyato, T. Kataoka, M. Koyama, Y. Yoshida, Spectral normalization for generative adversarial networks, CoRR abs/1802.05957 (2018) 1–26. http://arxiv.org/abs/1802.05957. [41] M.G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lakshminarayanan, S. Hoyer, R. Munos, The Cramer distance as a solution to biased Wasserstein gradients, CoRR abs/1705.10743 (2017) 1–20. http://arxiv.org/abs/1705.10743. [42] M. Binkowski, D.J. Sutherland, M. Arbel, A. Gretton, Demystifying MMD GANs, ArXiv abs/ 1801.01401 (2018) 1–36. [43] R. Wang, L. Ma, F. Juefei-Xu, X. Xie, J. Xiang Wang, Y. Liu, FakeSpotter: a simple baseline for spotting AI-synthesized fake faces, ArXiv abs/1909.06122 (2019) 1–8. [44] O.M. Parkhi, A. Vedaldi, A. Zisserman, Deep face recognition, Deep Face Recogn. abs/ 1909.06122 (2015) 1–31. [45] B. Amos, B. Ludwiczuk, M. Satyanarayanan, OpenFace: a general-purpose face recognition library with mobile applications, CMU-CS-16-118, CMU School of Computer Science, 2016 Tech. Rep. [46] F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: a unified embedding for face recognition and clustering, CoRR abs/1503.03832 (2015) 1–10. http://arxiv.org/abs/1503.03832. [47] T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks, CoRR abs/1812.04948 (2018) 1–12. http://arxiv.org/abs/1812.04948. [48] Y. Shen, J. Gu, X. Tang, B. Zhou, Interpreting the latent space of GANs for semantic face editing, CoRR abs/1907.10786 (2019) 1–10. http://arxiv.org/abs/1907.10786. 157 158 Generative adversarial networks for image-to-image translation [49] J. Stehouwer, H. Dang, F. Liu, X. Liu, A. Jain, On the detection of digital face manipulation, ArXiv abs/1910.01717 (2019) 1–12. [50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, CoRR abs/1706.03762 (2017) 1–15. http://arxiv.org/abs/1706.03762. [51] A. R€ ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, M. Nießner, FaceForensics++: learning to detect manipulated facial images, CoRR abs/1901.08971 (2019) 1–14. http://arxiv.org/abs/1901. 08971. [52] B. Biggio, P. Korshunov, T. Mensink, G. Patrini, D. Rao, A. Sadhu, Synthetic realities: deep learning for detecting audiovisual fakes, in: International Conference on Machine Learning2019 https://sites. google.com/view/audiovisualfakes-icml2019/ abs/1901.08971. [53] J. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, CoRR abs/1703.10593 (2017) 1–18. http://arxiv.org/abs/1703.10593. [54] J. Neves, R. Tolosana, R. Vera-Rodriguez, V. Lopes, H. Proença, Real or fake? Spoofing state-of-theart face synthesis detection systems, CoRR abs/1911.05351 (2019) 1–8. http://arxiv.org/abs/1911. 05351. [55] F. Marra, C. Saltori, G. Boato, L. Verdoliva, Incremental learning for the detection and classification of GAN-generated images, in: 2019 IEEE International Workshop on Information Forensics and Security (WIFS)2019, pp. 1–6. [56] L. Nataraj, T.M. Mohammed, B.S. Manjunath, S. Chandrasekaran, A. Flenner, J.H. Bappy, A.K. RoyChowdhury, Detecting GAN generated fake images using co-occurrence matrices, CoRR abs/ 1903.06836 (2019) 1–6. http://arxiv.org/abs/1903.06836. [57] P. Zhou, X. Han, V.I. Morariu, L.S. Davis, Two-stream neural networks for tampered face detection, CoRR abs/1803.11276 (2018) 1–9. http://arxiv.org/abs/1803.11276. [58] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S.E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, CoRR abs/1409.4842 (2014) 1–12. http://arxiv. org/abs/1409.4842. [59] Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-DF: a large-scale challenging dataset for DeepFake forensics, arXiv CR (2020) 1–10. [60] D. Afchar, V. Nozick, J. Yamagishi, I. Echizen, MesoNet: a compact facial video forgery detection network, CoRR abs/1809.00888 (2018) 1–7. http://arxiv.org/abs/1809.00888. [61] P. Korshunov, S. Marcel, DeepFakes: a new threat to face recognition? Assessment and detection, CoRR abs/1812.08685 (2018) 1–5. http://arxiv.org/abs/1812.08685. [62] P. Korshunov, S. Marcel, Speaker inconsistency detection in tampered video, in: 2018 26th European Signal Processing Conference (EUSIPCO)2018, pp. 2375–2379. [63] J. Galbally, S. Marcel, J. Fierrez, Image quality assessment for fake biometric detection: application to iris, fingerprint, and face recognition, IEEE Trans. Image Process. 23 (2) (2014) 710–724. [64] X. Yang, Y. Li, S. Lyu, Exposing deep fakes using inconsistent head poses, in: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)2019, pp. 8261–8265. [65] Y. Li, S. Lyu, Exposing DeepFake videos by detecting face warping artifacts, CoRR abs/ 1811.00656 (2018) 1–7. http://arxiv.org/abs/1811.00656. [66] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv cs.CV, (2015) pp. 1–14. [67] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, CoRR abs/ 1512.03385 (2015) 1–12. http://arxiv.org/abs/1512.03385. [68] D. Cozzolino, G. Poggi, L. Verdoliva, Recasting residual-based local descriptors as convolutional neural networks: an application to image forgery detection, CoRR abs/1703.04615 (2017) 1–7. http:// arxiv.org/abs/1703.04615. [69] B. Bayar, M.C. Stamm, A deep learning approach to universal image manipulation detection using a new convolutional layer. in: Proceedings of the 4th ACM Workshop on Information Hiding and Multimedia SecurityAssociation for Computing Machinery, New York, NY, 2016, pp. 5–10, https://doi.org/10.1145/2909827.2930786. A review of techniques to detect the GAN-generated fake images [70] N. Rahmouni, V. Nozick, J. Yamagishi, I. Echizen, Distinguishing computer graphics from natural images using convolution neural networks, in: 2017 IEEE Workshop on Information Forensics and Security (WIFS)2017, pp. 1–6. [71] F. Chollet, Xception: deep learning with depthwise separable convolutions, CoRR abs/ 1610.02357 (2016) 1–8. http://arxiv.org/abs/1610.02357. [72] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition2009, pp. 248–255. [73] F. Matern, C. Riess, M. Stamminger, Exploiting visual artifacts to expose Deepfakes and face manipulations, in: 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW)2019, pp. 83–92. [74] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. http://www. deeplearningbook.org. [75] D. G€ uera, E.J. Delp, Deepfake video detection using recurrent neural networks, in: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)2018, pp. 1–6. [76] H.H. Nguyen, F. Fang, J. Yamagishi, I. Echizen, Multi-task learning for detecting and segmenting manipulated facial images and videos, CoRR abs/1906.06876 (2019) 1–8. http://arxiv.org/abs/ 1906.06876. [77] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, C. Canton-Ferrer, The Deepfake detection challenge (DFDC) preview dataset, arXiv cs.CV abs/1910.08854 (2019) 1–14. [78] S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, H. Li, Protecting world leaders against deep fakes, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June2019, pp. 38–45. [79] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, P. Natarajan, Recurrent convolutional strategies for face manipulation detection in videos, CoRR abs/1905.00582 (2019) 1–8. http://arxiv.org/ abs/1905.00582. [80] A. Bharati, R. Singh, M. Vatsa, K.W. Bowyer, Detecting facial retouching using supervised deep learning, IEEE Trans. Inf. Forensics Secur. 11 (9) (2016) 1903–1913. [81] P. Flynn, K. Bowyer, P.J. Phillips, Assessment of time dependency in face recognition: an initial study. Audio- Video-Based Biom. Pers. Authentication (2003) 44–51, https://doi.org/10.1007/3-54044887-X_6. [82] S. Tariq, S. Lee, H. Kim, Y. Shin, S.S. Woo, Detecting both machine and human created fake face images in the wild. in: Proceedings of the 2nd International Workshop on Multimedia Privacy and SecurityAssociation for Computing Machinery, New York, NY, 2018, pp. 81–87, https://doi.org/ 10.1145/3267357.3267367. [83] A. Jain, R. Singh, M. Vatsa, On detecting GANs and retouching based synthetic alterations, CoRR abs/1901.09237 (2019) 1–7. http://arxiv.org/abs/1901.09237. [84] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, CoRR abs/ 1411.7766 (2014) 1–11. http://arxiv.org/abs/1411.7766. [85] S. Wang, O. Wang, A. Owens, R. Zhang, A.A. Efros, Detecting photoshopped faces by scripting photoshop, CoRR abs/1906.05856 (2019) 1–16. http://arxiv.org/abs/1906.05856. [86] X. Zhang, S. Karaman, S. Chang, Detecting and simulating artifacts in GAN fake images, CoRR abs/ 1907.06515 (2019) 1–10. http://arxiv.org/abs/1907.06515. [87] P. Isola, J. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, CoRR abs/1611.07004 (2016) 1–17. http://arxiv.org/abs/1611.07004. [88] S. Suwajanakorn, S.M. Seitz, I. Kemelmacher-Shlizerman, Synthesizing Obama: learning lip sync from audio. ACM Trans. Graph. 36 (4) (2017), https://doi.org/10.1145/3072959.3073640. 159 CHAPTER 7 Synthesis of respiratory signals using conditional generative adversarial networks from scalogram representation S. Jayalakshmya, Lakshmi Priyab, and Gnanou Florence Sudhac a IFET College of Engineering, Villupuram, India Manakula Vinayaga Institute of Technology, Pondicherry, India c Pondicherry Engineering College, Pondicherry, India b 7.1 Introduction Chronic respiratory disorders are a genre of long-term illness influencing the air passage and the anatomy of the respiratory system. Lung disorders are graded as the biggest donor to the global effect of diseases in the world. Some of the respiratory disorders include chronic obstructive pulmonary diseases (COPD), asthma, occupational lung illness, chronic bronchitis, pneumonia, etc. The diagnostic studies of the World Health Organization (WHO) and Healthy People 2020 revealed that currently over 25 million people in the United States (US) have been diagnosed with asthma and roughly 14.8 million adults have been identified with COPD [1]. In addition, the Forum of International Respiratory Societies (FIRS) has anticipated that more than millions of people live along with increased pressure in the pulmonary system and more than 50 million individuals suffer from occupation-related lung infections, thereby figuring out that more than one billion people are being affected from chronic respiratory conditions [2]. The causes of COPD include deficiency of alpha-1-antitrypsin protein, long-term exposure to air pollution, chemicals, fumes, and dust inhalation in the workplace. This deficiency causes the lungs to deteriorate and also can affect the liver. The common monitoring and diagnostic techniques for pulmonary diseases are spirometry, CT scans, arterial blood gas tests, and auscultation of the lungs. Respiratory auscultation is the most preferred diagnostic tool for the examination of pulmonary disorders and it provides the physiological and pathological information for the medical experts to proceed with the therapeutic procedure [3]. The respiratory sound (RS) is one of the most significant bio-signals used to diagnose certain respiratory abnormalities. RS detected from the chest wall and mouth may be classified as normal (vesicular) and adventitious sounds. Some of the abnormal sounds include wheezing, rhonchus (low-pitched wheezes), stridor, and Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00024-5 Copyright © 2021 Elsevier Inc. All rights reserved. 161 162 Generative adversarial networks for image-to-image translation crackles which vary in frequency. The use of a conventional stethoscope to listen lung sound makes lung auscultation technique simple, easy to use, and the most popular noninvasive method for diagnosis. To assist medical practitioners further in the process of their diagnosis, deep learningbased computer-aided diagnosis (CAD) systems have been extensively used over the past few years. However, for these deep learning algorithms to precisely differentiate even the feeble abnormal breathing patterns, huge volume of training data is required. On the other hand, a very limited number of both normal and abnormal respiratory sound datasets are available in the publicly available datasets. Working with these insufficient resources, deep learning models tend to struggle with few restrictions such as overfitting which performs well only for the trained model but not for other unobserved data. This can be considered as the greatest challenge in conventional deep learning-based CAD systems. Although several patterns of training and architectural designs have been deployed in research findings, training a model with a scarce amount of data is a demanding task. Therefore to acquire more samples, data augmentation is the sensible solution. For scaling up of datasets, the conventional augmentation approach accomplishes ordinary variations in the given images in order to gain a different facet of original images. Few modifications involve rotational changes, translation, reflection, and diminishing the size of images [4]. Considering from the respiratory signal perspective, transforming the audio signal to the 2D representation using conventional augmentation approaches is not appropriate. In explicit terms, random flipping around the time axis indicates that the signal is reversed in time and random Y flipping will completely change the understanding of frequency. In the same way, random X translation implies that there is only time shift and translation in Y signifies that the frequency spectrum is being modified, which may not really be a true representation of the original signal itself. Scaling randomly in the X direction in order to simulate slower breathing or faster breathing, while not changing the frequency characteristics would be physically meaningful but this in turn adds random noise to the signal representation. To resolve these issues, this study aims at improving the data set artificially with the help of generative adversarial network (GAN). In this study, it is proposed to synthesize a respiratory signal using GAN architecture by the virtue of time-frequency representation which gives a picture of the signal. Scalogram, the visual representation of the energy density of the signal obtained through continuous wavelet transform is found to differentiate well the different classes of respiratory signals [5]. The fact that the continuous wavelet transform is invertible, enables to use of GAN architecture to indirectly synthesize signals. GAN is a class of deep learning framework which succinctly generates images from a known representation of data called latent space. GAN comprises two deep neural networks called generator and discriminator, competing one against the other in order to learn the probability distribution function of the known training set images and hence the term adversarial. The outcome of this study will undoubtedly address the challenge Scalogram-based respiratory signal synthesis faced with restricted training data in deep learning-based respiratory signal classification. Furthermore, the proposed study can be used as a data augmentation technique in the abovementioned signal classification task. The remaining part of this chapter is structured in this way: related study on the application of GAN in diverse fields and augmentation approaches is presented in Section 7.2; basic GAN, cGAN, and proposed model are explained in Section 7.3, the dataset details, generator, and discriminator network architecture and the classifier results are depicted in Section 7.4; and finally research findings are explained in Section 7.5. 7.2 Related work Over the past few years, the focus on deep learning models and algorithms has gradually gained significant importance in addressing several issues in the field of medical imaging. Several studies have been reported in the literature using supervised learning techniques, wherein a huge amount of training data is required to prepare a strong model. Owing to the wide range of images in the medical field, the collection of data samples continues to be a great challenge. Introducing minute changes in the original images poses limitations in the classification performance as the augmentation methods induce additional details in the training samples. Furthermore, few proportion of expanded dataset seems to sound in a distinct way compared with the real-world objects resulting in unsuitability to other databases. In order to overcome these issues, Ian Good fellow et al. [6] proposed an alternative approach of data augmentation wherein synthetic images are generated by employing generative adversarial networks (GANs). The implementation of the structured adversarial modeling GAN in turn signified very sharp distributions compared to Markov chain models. GANs are a kind of unsupervised learning used for mapping the small-scaled hidden vectors to high-dimensional data. In the literature, GANs have been lately put into practice in diverse fields and many initiatives were carried out on medical images which employ image-to-image translation. In the year 2017, Costa et al. [7] explored U-net for generating new fundus images in the retina by vessel segmentation with the help of GANs. The results indicated that both the original and synthetic images were observed differently visibly even though both were a part of the same vessel tree. In addition, the produced synthetic images were of the major part of real image set quality. Lei Bi et al. [8] proposed multichannel generative adversarial networks (M-GAN) for boosting the training data of positron emission tomography (PET) images and provided more realistic images in comparison with conventional GAN. In 2018, Frid-Adar et al. [9] proposed classical data segmentation technique as the first stage to expand the dataset of CT images and synthetic data augmentation as the second stage using GAN for the classification of liver lesions and yielded 78.6% sensitivity and 88.4% specificity for the case of classic data augmentation approach and with the help of synthetic image creation, the classification 163 164 Generative adversarial networks for image-to-image translation accuracy was found to be improved to 85.7% sensitivity and 92.4% specificity. Furthermore, in 2018, Salehinejad et al. [10] have also witnessed the expansion in dataset samples by implementing GAN to produce artificial images for the classification of the lesion in chest X-ray images. The authors utilized a deep convolutional neural network (DCNN) to identify disorders with five different classes of chest X-rays and the performance results are found to be improved. In 2019, Bhattacharya D et al. [11] suggested deep convolutional generative adversarial network (DCGAN) and experimented on NIH chest X-ray image open database to enhance the efficiency of CNN model using GAN and yielded 65.3% accuracy. With the aid of structure correcting GAN, Dai et al. [12] carried out segmentation between lungs and heart regions in chest X-ray images. In that work, the authors introduced a critic network in order to figure out the higher-level structures to acquire practical segmentation findings to achieve realistic segmentation outcomes. Several comprehensive attempts with this method resulted in real segmentation with high precision. Onishi et al. [13] explored deep CNN (DCNN) and GAN to create the sufficient number of images in order to differentiate malicious and benign lung nodules. In that work, the images were generated using the pixel value distribution present in the mid-portion of the pulmonary nodule. This approach of pretraining and fine-tuning process using DCNN enables to discriminate almost 66.7% of benign nodules and 93.9% of malicious pulmonary nodules and proved with the classification accuracy of 20% more than original images. Apart from CT and PET images, Chaudhari et al. [14] have trialed the augmentation approach on gene expression dataset using modified generator GAN (MG-GAN) and compared the performance with basic GAN and KNN classifier. The results proved that MG-GAN improved the accuracy by 18.8% and 11.9% and further the loss value of the error function was found to be reduced very drastically from the value 0.6978 to 0.0082 making it suitable for applications with sensitive data. Luo et al. [15] explored the progressive growth of GAN-based augmentation on electroluminescence images for the classification of faulty photovoltaic component cells and improved the performance to the maximum of 14% using an enlarged dataset. Apart from this, Li et al. [16] focused the research on gear safety in the transmission industry for reliability classification using GAN wherein, the authors introduced bounded-GAN to create gear data with different settings and trained the model using ADAM optimizer. The research findings show that the proposed bounded-GAN excels other approaches on operational measures. GAN also finds its application in diverse fields such as magnetic resonance imaging scans, video surveillance, clinical informatics, computational biology, automotive fields, etc. Furthermore, GAN-based methods strengthen and show the possible gains in the audio synthesis field as well with reference to analysis, processing, and classification of signals. In spite of the latest developments in the field of artificial intelligence and generative models, the compilation of information from inherent sounds through neural nets Scalogram-based respiratory signal synthesis remains unresolved. In 2017, Shrivastava et al. [17] substantiated that audio signals can be illustrated at a faster pace with the help of GANs. Here, the authors proposed a combination of simulated and unsupervised learning and attempted numerous changes with the basic GAN such as self-regularization, local loss, and discriminator updation with the past record of perfect images and resulted in good performance improvement. Donahue et al. in 2019 [18] proved that GANs allow the signal generation to take place at any time. The authors introduced WaveGAN and experimented GANs to unsupervised synthesis of raw-waveform audio as an initial attempt and achieved promising results. By the virtue of complexity in neural structure, audio signal generation is a key issue as it depends on several time scales. Therefore, it is advantageous to train a net with a greater degree of illustration rather than utilizing samples in the temporal dimension. Time-frequency analysis are used to distinguish and handle the nonstationary signals in a better way. One such example is Shen et al. [19] demonstrated a novel architecture named Tacotron 2, for combining audio signals through Mel-frequency spectrograms. These Mel spectrograms are fed as input to a network named WaveNet and resulted in a mean opinion score of 4.53. Even though several achievements have been enabled through TF representations, visual representation of the spectrum of frequencies varying with time (spectrograms) is noninvertible. Marafioti et al. [20] discussed the key points of neural structure explored for producing TF representations especially speech synthesis using STFT. In that, the authors introduced TiFGAN, which unconditionally creates audio. But it poses limitations on producing audio with substantial quality. In addition, the spectrogram type of representation produces only constant resolution. These drawbacks can be resolved by employing the continuous wavelet transform (invertible) in which TF representation named scalograms are produced with variable resolution. The use of an unconditional generative model does not influence the generated modes of data. Several efforts have been made for generating the audio signal in an unconditional manner [21–23]. Despite that, all the techniques utilize the autoregressive method which considers noise samples as input and creates samples of audio signals serially. By exploring the model with some conditioning using extra information, it would be possible to supervise the process of data generation. This conditioning process may rely on class labels or tags on a portion of data or data on a whole. Mirza et al. [24] introduced the conditional adversarial nets and generated images trained on MNIST class labels. The authors proved the potency of conditional adversarial nets and their useful applications with the tags used individually. Conditional GANs (cGANs) are a kind of GAN, wherein the information concerning the conditions are imposed to basic GAN. The results show superior performance compared to nonconditional GANs. The majority of the application areas experience the inability to gain access to big data for analysis, in particular, the medical field. Even though several data augmentation 165 166 Generative adversarial networks for image-to-image translation approaches are possible with GAN both for medical images and audio [13–23], some of the technical gaps observed in the existing study are quality of generated data samples, distortion, and its distribution in the data set, which were not good enough resulting in poor classification accuracy as the accuracy is significantly dependent on both the qualitative and quantitative terms. In addition, the speed of the image generation process is very slow specifically in the field of speech synthesis. This poses limitations on highdimensional data due to the nature of autoregressive modeling. Furthermore, in certain instances, network models trained with artificial generated sample data fail to perform well when fed with real images. All these gaps can be resolved in the proposed study with the help of conditional GAN. Inspired by the performance improvement by conditional GANs in Ref. [24], the proposed study employs the scalogram method of TF representation combined with cGAN for improved targeting and for synthesizing respiratory sounds in order to discriminate the normal and abnormal lung sounds. 7.3 GAN for signal synthesis In this section, the architectures of simple GAN and conditional GAN and the proposed system model using the conditional GAN to synthesize respiratory sound signals from the wavelet-based time-frequency representation are explained in detail. The conditional GAN is utilized in this proposed study to artificially generate more number of scalogram images. With this proposed model as a data augmentation technique, better prediction accuracy through computer-aided diagnosis is expected. 7.3.1 Simple GAN A simple GAN comprises two networks named generator and discriminator that are trained concurrently. The generator learns to generate a new image mimicking the data in the latent space by estimating its underlying probability distribution and the discriminator plays the role of binary classifier by mapping the input image either to the realimage dataset or to the generated set of images. The generator model has to be trained such that the generated image very closely resembles the images used for training thereby making the discriminator difficult to distinguish between the original and generated set of images. The discriminator in turn learns to make sure that its performance is better than that of the generator. This adversarial learning behavior of the GAN results in the generation of images which are very close to the real training set images. Fig. 7.1 shows the architecture of simple GAN [25]. 7.3.2 Conditional generative adversarial networks In contrast to basic GAN, cGAN utilizes a supervised method where both the generator and discriminator neural networks are adapted to meet the condition during the training Scalogram-based respiratory signal synthesis Fig. 7.1 Architecture of simple GAN Fig. 7.2 Architecture of conditional GAN. phase with the help of certain extra information. Fig. 7.2 shows the architecture of conditional GAN [26]. The latent space information in the form of noise and the condition labels are fed as inputs to the generator block. The images generated by the generator along with condition labels and the real images are given as input to the discriminator block. This block detects the similarity between the given labels and images. The data augmentation is achieved by incorporating the conditional variable y into the model. 7.3.3 Conditional GAN for respiratory sound synthesis The fundamental idea in the proposed study is to train the conditional GAN to generate realistic scalogram images of various respiratory sounds. 7.3.3.1 System model Fig. 7.3 shows the proposed framework for synthesizing respiratory sounds using cGAN. Different types of lung sound such as vesicular, wheezes, crackles, and low-pitched 167 168 Generative adversarial networks for image-to-image translation Fig. 7.3 Proposed system model for synthesizing respiratory sounds wheeze signals are given as input to the time-frequency transform. The continuous wavelet transform transforms the time-domain respiratory signal to time-scale analysis and is represented in the forms of scalogram representation. These real scalograms are fed as input to conditional GAN in order to generate more artificial scalogram images with the help of a generator network in cGAN. Finally, inverse CWT is applied to synthesize the respiratory sound signals in time domain. 7.3.3.2 Time-scale representation using CWT The continuous wavelet transform (CWT) modifies the temporal length of the basis function for the purpose of achieving a changeable time-frequency localization. To interpret very small changes in frequencies, CWT utilizes longer basis functions at the cost of confined localization in time and uses shorter basis functions ascertaining high localization in time [27]. This time-frequency transform elicits a spectrum with time scale vs amplitude named scalogram. Compared to spectrograms, scalograms are useful for analyzing realistic signals at diverse scales. As the frequencies in CW transform are logarithmic in nature, the obtained scalogram plot also uses a log scale frequency axis. 7.3.3.3 Generator and discriminator network architecture of cGAN The network architecture for generator and discriminator are shown in Figs. 7.4 and 7.5 and the corresponding analysis result is tabulated in Tables 7.1 and 7.2. In cGAN, the Scalogram-based respiratory signal synthesis Noise Image input Layer Project and Reshape Layer Labels Image input Layer Embed and Reshape Layer Batch Normalization Layer 1 Relu Layer 1 Concatenation Layer Transposed Convolution 2D Layer 1 Transposed Convolution 2D Layer 1 Batch Normalization Layer 1 Batch Normalization Layer 1 Relu Layer 1 Transposed Convolution 2D Layer 2 ReLU Layer 1 Transposed Convolution 2D Layer 1 Tanh Layer Fig. 7.4 Generator network architecture. generator network comprises sequential transposed convolutional layers with batch normalization for scaling up the arrays of different dimensions. The small-scale noise vector given to the fully connected section transforms the input to 1024 high-scaled image features which is further reshaped to 4 4 1025 for feeding to the convolution module. Several successive stages of transposed convolution layers transform the interim features to produce an output image equal to the dimension (64 64 3). This generator network generates synthetic scalograms for all classes of respiratory sounds individually to attain the wider class population. On the other hand, the discriminator network is modeled using multiple convolutional layers with leaky ReLu to produce prediction values. From the given input image to the discriminator block, which has a dimension of (64 64 3), the high-scaled features with the dimension (4 4 512) are extracted by the series of convolution layers. Further, these features are evened and are given as input to the fully connected network. This network gradually assigns the features to a low scale for classification. 169 170 Generative adversarial networks for image-to-image translation Images image input Layer Labels Image input Layer Drop out (0.25) Leaky ReLU Layer 2 Convolution 2D Layer 3 Concatenation Layer Batch Normalization Layer 3 Convolution 2D Layer 1 Leaky ReLU Layer 3 Leaky ReLU Layer 1 Convolution 2D Layer 4 Convolution 2D Layer 2 Batch Normalization Layer 4 Batch Normalization Layer 2 Leaky ReLU Layer 4 Convolution 2D Layer 5 Fig. 7.5 Discriminator network architecture. 7.3.3.4 Algorithm Algorithm 7.1. Generation of Scalograms using cGAN for the training process with Epochs 5 500, 1000, 1500, and 2000, learn rate 5 0.0002 and no. of classes 5 4 Input: Real scalograms S0,S1,S2…Sn and noise input vector: Y, Conditional metrics: C0,C1,C2…Cn Formulate bounds for G and D of cGAN with input dimension [64 64 3] for num of training phases do Ø Update the discriminator using S0,S1,S2…Sn with C0,C1,C2…Cn Ø Generate scalograms Z0,Z1,Z2…Zn using noise input vector with C0,C1,C2…Cn Ø Revise the D network using Z0,Z1,Z2…Zn with C0,C1,C2…Cn Ø Revise the G network using Z0,Z1,Z2…Zn with C0,C1,C2…Cn end for Output: 200 no. of observations for each class. Scalogram-based respiratory signal synthesis Table 7.1 Analysis result of generator network. Sl. no. 1 2 3 4 5 6 7 8 9 10 11 12 13 Name Type Activations Learnables Noise 1 1 100 images Proj Project and Reshape Layer with output size 4 4 1024 Labels 1 1 1 images emb Reshape Layer with output size 4 4 Image input 1 1 100 – Project and reshape 4 4 1024 Weights 16,384 100 Bias 16,384 1 Image input 111 – Embed and reshape layer 441 Concatenation 4 4 1025 Embedding weights 50 4 Fully connecting weights 16 50 Fully connecting Bias 16 1 – Transposed convolution layer 8 8 256 Weights 5 5 256 1025 Bias 1 1 256 Batch normalization 8 8 256 Offset 1 1 256 Scale 1 1 256 Relu 8 8 256 – Transposed convolution layer 16 16 128 Weights 5 5 128 256 Bias 1 1 128 Batch normalization 16 16 128 Offset 1 1 128 Scale 1 1 128 Relu 16 16 128 – Transposed convolution layer 32 32 64 Weights 5 5 64 128 Bias 1 1 64 Batch normalization 32 32 64 Offset 1 1 64 Scale 1 1 64 Cat Concatenation of 2 inputs along dimension 3 tconv1 256 5 5 1025 transposed convolutions with stride[1 1] and cropping [0 0 0 0] bn1 Batch normalization with 256 channels Relu1 Relu tconv2 128 5 5 256 transposed convolutions with stride [2 2] and cropping same bn2 Batch normalization with 128 channels Relu2 Relu tconv3 64 5 5 128 transposed convolutions with stride [2 2] and cropping same bn3 Batch normalization with 64 channels Continued 171 172 Generative adversarial networks for image-to-image translation Table 7.1 Analysis result of generator network—cont’d Sl. no. 14 15 16 Name Type Activations Learnables Relu3 Relu tconv4 3 5 5 64 transposed convolutions with stride [2 2] and cropping same tanh Hyperbolic tangent Relu 32 32 64 – Transposed convolution layer 64 64 3 Weights 5 5 3 64 Bias 1 1 3 Tanh 64 64 3 – Table 7.2 Analysis result of discriminator network. Sl. no. 1 2 3 4 5 6 7 8 9 Name Type Activations Learnables Images 64 64 3 images Dropped 25% dropout Labels 1 1 1 images emb Reshape Layer with output size 64 64 3 Image input 64 64 3 – Drop out Image input 64 64 3 111 – – Embed and reshape layer 64 64 1 Concatenation 64 64 1 Embedding weights 50 4 Fully connecting weights 4096 50 Fully connecting Bias 4096 1 – Convolution 32 32 64 Weights 5 5 4 64 Bias 1 1 64 Leaky Relu 32 32 64 – Convolution 16 16 128 Weights 5 5 64 128 Bias 1 1 128 Batch normalization 16 16 128 Weights 1 1 128 Bias 1 1 128 Cat Concatenation of 2 inputs along dimension 3 conv1 64 5 5 4 convolutions with stride [2 2] and padding same lrelu1 Leaky relu with scale 0.2 Conv2 128 5 5 64 convolutions with stride [2 2] and padding same bn2 Batch normalization with 128 channels Scalogram-based respiratory signal synthesis Table 7.2 Analysis result of discriminator network—cont’d Sl. no. 10 11 12 13 14 15 16 17 Name Type Activations Learnables lRelu2 Leaky ReLU with scale 0.2 Conv3 256 5 5 128 convolutions with stride [2 2] and padding same bn3 Batch normalization with 256 channels lRelu3 Leaky ReLU with scale 0.2 Conv4 512 5 5 256 convolutions with stride [2 2] and padding same bn4 Batch normalization with 512 channels lRelu4 Leaky ReLU with scale 0.2 Conv5 1 4 4 512 convolutions with stride [1 1] and padding [0 0 0 0] Leaky Relu 16 16 128 – Convolution 8 8 256 Weights 5 5 128 256 Bias 1 1 256 Batch normalization 8 8 256 Leaky Relu 8 8 256 Weights 1 1 256 Bias 1 1 256 – Convolution 4 4 512 Weights 5 5 256 512 Bias 1 1 512 Batch normalization 4 4 512 Leaky Relu 4 4 512 Weights 1 1 512 Bias 1 1 512 – Convolution 111 Weights 4 4 512 Bias 1 1 7.3.3.5 Steps The stages for the generation of original, realistic scalograms, and synthesis of respiratory sounds are outlined as follows: Step 1: Examine and analyze the given respiratory signals having integral number of time-varying frequencies using continuous wavelet transform. Step 2: Produce Morse scalogram representations for various lung sounds with the aid of MATLAB wavelet tool box. Step 3: Input the obtained scalograms of the original lung sound signals to Conditional GAN. Step 4: Generate realistic scalogram images with the help of modeled generator network. Step 5: Synthesize the original respiratory signal by giving the generated scalograms to inverse CWT. 173 174 Generative adversarial networks for image-to-image translation Step 6: Provide the original scalogram images extracted through continuous wavelet transform and generated scalogram images through cGAN to the pretrained models Alexnet CNN [5], GoogLenet [28], and ResNet 50 to quantify the performance. 7.4 Results and discussion To demonstrate the performance of the proposed data augmentation technique using conditional GAN, the original scalogram images extracted through continuous wavelet transform and generated scalogram images through cGAN are input to different pretrained models. The classification performance is compared for all the classes of respiratory sounds with and without augmentation. 7.4.1 Dataset The dataset used in this proposed study was acquired from various sources namely RALE (Respiration acoustics Laboratory Environment) repository [29], Think labs Lung sound library [30], and ICBHI [31] benchmark publicly available databank. These archives comprise gender-based normal and abnormal lung sounds of several kinds. For the training and testing phase, the entire lung sound database has been arbitrarily split into 70% and 30%. By and large, the database has 73 normal files, 281 crackle sound files, 33 rhonchi files, and 122 wheeze files. 7.4.2 Data augmentation using conditional GAN The training phase of the data augmentation process using cGAN is explained in this section. For the process of experimentation, (i) The number of latent inputs for the generator network are considered to be 100. Typically the generator produces RGB noise as scalograms at random. (ii) From the modeled convolutional filters, the discriminator network attempts to figure out the difference between random noise scalograms and real-scalogram images of respiratory sounds. (iii) To mystify the discriminator network, the generator learns from the transposed convolution filters. (iv) The process is continued endlessly, as long as the discriminator is confused to the greatest extent. In the proposed approach, the network is trained for four different epochs namely 500, 1000, 1500, and 2000 and the cost function is observed for all cases. The visual representation of the progress of training with scores of both the networks are shown in Fig. 7.6. To check for the convergence of the network during the process of training, the scores are plotted on a scale from 0 to 1. The score of the generator network is defined Scalogram-based respiratory signal synthesis Fig. 7.6 Training plots of generator and discriminator for different epochs. as the average of the likelihood images analogous to the discriminator output for the generated images. In case 1 of Fig. 7.6, i.e., 500 epochs, the concept of mode collapse happens which indicates that the generator is incapable of learning the scalogram representation corresponding to diverse inputs. Therefore, in order to increase the ability of the generator to produce more outputs, the number of epochs is increased. The training plots of cases (2, 3, 4) indicate that the generator score reaches the value 0 and the score of discriminator extends to almost one which signifies that the discriminator network is dominating the generator network and therefore classifies most of the images correctly. Since the plots are almost stable in cases 3 and 4, the training phase is stopped with 1500 epochs. Further increasing the number of iterations, increases the computational time of the network. 175 176 Generative adversarial networks for image-to-image translation 7.4.3 Samples of generated scalogram images for different classes Fig. 7.7 shows the samples of scalogram images generated by the generator network for 1500 epochs. 7.4.4 Synthesis of respiratory sounds using inverse CWT The inverse CWT is applied to the generated scalograms and the acquired respiratory sound signal for the case of normal and abnormal lung sounds are plotted using the signal analyzer app and shown in Fig. 7.8. Fig. 7.7 Samples of generated scalogram images for different classes using 1500 epochs. Scalogram-based respiratory signal synthesis Fig. 7.8 Sample of normal synthesis from scalogram using ICWT. 7.4.5 Performance results To evaluate the performance of data augmentation using conditional GAN, pretrained deep learning models such as AlexNet, GoogLeNet, and Resnet 50 are used for classification. The classification is performed for all the classes of respiratory sounds without augmentation and with augmentation and the results are compared for all the deep learning models. For classifying the data set without augmentation, the number of images considered for training are 357 and 153 images are used for testing. Similarly, for classification with augmentation, 200 images are generated for each class of respiratory sounds using cGAN. Out of which, 500 images are considered for training and 300 for testing. The experimental settings for modeling the network are listed in Table 7.3. Table 7.3 Parameters settings for trained network model. Sl. No Hyperparameters Values 1 2 3 4 5 7 8 Momentum Initial learning ate Learning rate drop factor Learning rate drop period Number of epochs Batch size Optimizer 0.9 0.0001 0.2 5 20 10 Sgdm 177 178 Generative adversarial networks for image-to-image translation Table 7.4 Classification accuracy for the various pre-trained models with and without cGAN. With augmentation using cGAN Classifier Without augmentation (%) 500 epochs (%) 1000 epochs (%) 1500 epochs (%) AlexNet GoogLeNet Resnet 50 68.63 73.86 81.37 93.13 93.45 95.23 95.13 96.88 97.82 96.38 96.88 98.75 Resnet 50 provides high accuracy (indicated in bold) compared to other two methods The performance metrics, accuracy of the pretrained network is usually estimated by calculating the testing accuracy with the help of a confusion matrix. Accuracy measures the number of correctly classified normal and abnormal sound files corresponding to the total number of test samples. Table 7.4 shows the classification accuracy obtained for various deep network models without and with augmentation for different epochs. The results in Table 7.4 indicate that the pretrained CNN model Resnet50 performs well for both the cases with and without augmentation. For the case of classification with real images, i.e., without augmentation, the Resnet 50 model produces an accuracy of 81.37% which is high compared to AlexNet and GoogLeNet classifiers. Furthermore, generating new images with cGAN and training the deep network model with ResNet 50 produces the highest classification accuracy of 98.67% compared with other deep network models at 1500 epochs. The training progress and confusion matrices of ResNet 50 network model with real images and generated images are shown in Figs. 7.9 and 7.10 and Tables 7.5 and 7.6. The columns of the confusion matrix plotted in Tables 7.5 and 7.6, indicate the true cases for the classes and the rows indicate the cases that are belonging to the class. To be more specific, in Table 7.5, the number of actual cases in the class crackle is 77, 31 in normal, 16 in rhonchi, and 29 in wheeze. The number of cases correctly classified as belonging to the particular class are 71 for crackle, 18 for normal, 10 for rhonchi, and 25 for wheeze. With data augmentation, in Table 7.6, the number of actual cases in the class crackle is 72, 79 in normal, 75 in rhonchi, and 74 in wheeze, and the number of cases correctly classified as belonging to the particular class are 72 for crackle, 75 for normal, 75 for rhonchi, and 74 for wheeze. From the confusion matrices tables, other metrics, such as precision, recall, and F1 score are also calculated for all types of lung sounds and are tabulated in Tables 7.7 and 7.8. Precision calculates the total number of positive class forecasts which is actually positive. The parameter recall computes the same number of positive class forecasts with all positive samples in the dataset while the F1 score is the weighted average of precision and recall. Scalogram-based respiratory signal synthesis 100 90 Final Accurary 80 70 60 Validation Accuracy = 81.37% 50 40 30 20 10 10 0 0 100 200 300 20 400 500 600 700 Iterations 2 Loss 1.5 1 Final 0.5 20 10 0 0 100 200 300 400 Iterations 500 600 700 Fig. 7.9 Training progress of ResNet 50 network for the case of real images (without augmentation). From the tables, it is observed that the class-wise accuracy is found to be high for all classes almost more than 98% for the case of classification with augmentation, whereas it is comparatively low for the case of without augmentation. In addition, the F1 score is high for all classes in Table 7.8 which reveals that the validation accuracy is better for augmented data in all cases. 7.4.6 Analysis The proposed method in this chapter demonstrates the training of CNNs with an alternative method for data augmentation by way of generating synthetic scalogram images by using conditional generative adversarial networks. The proposed method is experimented with three pretrained models namely Alexnet, GoogLeNet, and ResNet50 classifiers. The pretrained Alexnet model with five convolutional and three fully connected layers produces an accuracy of 68.63% for the case of original scalogram images whereas the network trained with 1500 epochs using cGAN data augmentation approach yields an accuracy of 96.38%. The same number of images when trained with GoogLeNet classifier with 22 layers deep yields an accuracy of 73.86% without augmentation and 96.88% with data augmentation. The third model ResNet 50 comprises 49 convolutional layers and a fully connected layer. This network when trained, yields an accuracy of 81.37% 179 Generative adversarial networks for image-to-image translation 100 90 80 Validation Accuracy: 98.75% Accuracy 70 60 50 40 30 20 10 10 0 200 400 20 600 800 1000 800 1000 1200 Iteration 1.6 1.4 Loss 180 1.2 1 0.8 0.6 0.4 0.2 0 200 400 600 Iteration Fig. 7.10 Training progress of ResNet 50 network with augmentation. Table 7.5 Confusion matrix of ResNet 50 network without augmentation. Crackle Normal Rhonchi Wheeze Crackle Normal Rhonchi Wheeze 71 3 0 3 9 18 0 4 1 0 10 5 3 1 0 25 Table 7.6 Confusion matrix of ResNet 50 network with augmentation. Crackle Normal Rhonchi Wheeze Crackle Normal Rhonchi Wheeze 72 0 0 0 3 75 0 1 0 0 75 0 0 0 0 74 1200 Scalogram-based respiratory signal synthesis Table 7.7 Precision and recall for ResNet 50 model without augmentation. Class Accuracy (%) Precision Recall F1 score Crackle Normal Rhonchi Wheeze 87.58 88.89 96.08 89.54 0.85 0.82 1 0.68 0.92 0.58 0.63 0.86 0.88 0.68 0.77 0.76 Table 7.8 Precision and recall for ResNet 50 model with augmentation. Class Accuracy (%) Precision Recall F1 score Crackle Normal Rhonchi Wheeze 99 98.67 100 99.67 0.96 1 1 0.99 1 0.95 1 1 0.98 0.97 1 0.99 without augmentation and 98.75% for the case of with augmentation. This indicates that deeper networks prove efficient both in terms of computation and the number of parameters. In addition, the model with good accuracy both in case of with and without augmentation, i.e., ResNet 50 is assessed with various metrics namely accuracy, precision, recall, and F1 score. Based on results from Table 7.8, the high values of precision, recall, and F1 score shows that the validation accuracy is better for augmented data in all classes of respiratory sounds. 7.5 Conclusion and future scope Owing to the challenges in incorporating the conventional data augmentation techniques for time-frequency representation of the signal, a novel data augmentation approach has experimented for the signal under study. In this chapter, GAN an unsupervised learning structure is utilized to generate the synthetic images for the different classes of respiratory sounds. For improved targeting on the image generation, the conditional information is imposed to basic GAN. The contradictory learning behavior of conditional GAN gives rise to the generation of scalogram images really close to original scalogram images of respiratory sounds. It is also found that the performance of modeled discriminator network predominates the generator network and therefore categorizes the majority of the images accurately. In addition, the performance of the data augmentation approach is evaluated with different pretrained deep learning classifiers and compared with original images without augmentation. The results show that there is a significant improvement in the classification accuracy of all models in the data augmentation approach in comparison with without cGAN. Furthermore, the testing accuracy of ResNet 50 model produces an increased accuracy of 98.75% with high values of precision, recall, and F1 score for all 181 182 Generative adversarial networks for image-to-image translation classes of respiratory sounds resulting in better prediction. This study can be further extended with other types of GAN such as cycle GANs and Wasserstein GANs for the synthetic generation of images. The same setup can be compared with the generation of images using variational convolutional autoencoder to produce a better prediction model. References [1] https://www.healthypeople.gov/2020/topics-objectives/topic/respiratory-diseases (Respiratory Diseases—Accessed 10 May 2020). [2] https://www.who.int/gard/publications/The_Global_Impact_of_Respiratory_Disease.pdf (Global impact of Respiratory diseases—Accessed 05 May 2020). [3] M. Sarkar, I. Madabhavi, N. Niranjan, M. Dogra, Auscultation of the respiratory system, Ann. Thoracic Med. 10 (3) (2015) 158. [4] A. Sharif Razavian, H. Azizpour, J. Sullivan, S. Carlsson, CNN features off-the-shelf: an astounding baseline for recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2014, pp. 806–813. [5] S. Jayalakshmy, G.F. Sudha, Scalogram based prediction model for respiratory disorders using optimized convolutional neural networks, Artif. Intell. Med. 103 (2020) 101809. [6] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al., Generative adversarial networks, 2014. arXiv preprint arXiv:1406.2661. [7] P. Costa, A. Galdran, M.I. Meyer, M.D. Abràmoff, M. Niemeijer, A.M. Mendonça, A. Campilho, Towards Adversarial Retinal Image Synthesis, 2017. arXiv preprint arXiv: 1701.08974. [8] L. Bi, J. Kim, A. Kumar, D. Feng, M. Fulham, Synthesis of positron emission tomography (PET) images via multi-channel generative adversarial networks (GANs), in: Molecular Imaging, Reconstruction and Analysis of Moving Body Organs, and Stroke Imaging and Treatment, Springer, Cham, 2017, pp. 43–51. [9] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, Synthetic data augmentation using GAN for improved liver lesion classification, in: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE, 2018, April, pp. 289–293. [10] H. Salehinejad, S. Valaee, T. Dowdell, E. Colak, J. Barfett, Generalization of deep neural networks for chest pathology classification in x-rays using generative adversarial networks, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, April, pp. 990–994. [11] D. Bhattacharya, S. Banerjee, S. Bhattacharya, B.U. Shankar, S. Mitra, GAN-based novel approach for data augmentation with improved disease classification, in: Advancement of Machine Intelligence in Interactive Medical Image Analysis, Springer, Singapore, 2020, pp. 229–239. [12] W. Dai, J. Doyle, X. Liang, H. Zhang, N. Dong, Y. Li, E.P. Xing, Scan: structure correcting adversarial network for chest x-rays organ segmentation, arXiv (2017). arXiv preprint arXiv: 1703.08770. [13] Y. Onishi, A. Teramoto, M. Tsujimoto, T. Tsukamoto, K. Saito, H. Toyama, H. Fujita, Automated pulmonary nodule classification in computed tomography images using a deep convolutional neural network trained by generative adversarial networks, BioMed Res. Int. 2019 (2019) 6051939, https://doi.org/10.1155/2019/6051939. [14] P. Chaudhari, H. Agrawal, K. Kotecha, Data augmentation using MG-GAN for improved cancer classification on gene expression data, Soft Comput. 24 (2019) 11381–11391. [15] Z. Luo, S.Y. Cheng, Q.Y. Zheng, GAN-based augmentation for improving CNN performance of classification of defective photovoltaic module cells in electroluminescence images, in: IOP Conference Series: Earth and Environmental Science, vol. 354 (1), IOP Publishing, 2019, October, p. 012106. [16] J. Li, H. He, L. Li, G. Chen, A novel generative model with bounded-GAN for reliability classification of gear safety, IEEE Trans. Ind. Electr. 66 (11) (2019) 8772–8781. Scalogram-based respiratory signal synthesis [17] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, R. Webb, Learning from simulated and unsupervised images through adversarial training, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2107–2116. [18] C. Donahue, J. McAuley, M. Puckette, Adversarial Audio Synthesis, 2018. arXiv preprint arXiv: 1802.04208. [19] J. Shen, R. Pang, R.J. Weiss, M. Schuster, N. Jaitly, Z. Yang, R.A. Saurous, Natural TTS synthesis by conditioning wavenet on Mel spectrogram predictions, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, April, pp. 4779–4783. [20] A. Marafioti, N. Perraudin, N. Holighaus, P. Majdak, Adversarial generation of time-frequency features with application in audio synthesis, in: International Conference on Machine Learning, 2019, May, pp. 4352–4362. [21] S. Dieleman, A.V.D. Oord, K. Simonyan, The challenge of realistic music generation: modelling raw audio at scale, 2018. arXiv preprint arXiv:1806.10474. [22] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W.Z. Teoh, J. Sotelo, et al., Melgan: generative adversarial networks for conditional waveform synthesis, in: Advances in Neural Information Processing Systems, 2019, pp. 14910–14921. [23] S. Vasquez, M. Lewis, Melnet: A Generative Model for Audio in the Frequency Domain, 2019. arXiv preprint arXiv: 1906.01083. [24] M. Mirza, S. Osindero, Conditional Generative Adversarial Nets, 2014. arXiv preprint arXiv: 1411.1784. [25] Y. Qian, H. Hu, T. Tan, Data augmentation using generative adversarial networks for robust speech recognition, Speech Commun. 114 (2019) 1–9. [26] https://www.mathworks.com/help/deeplearning/ug/train-conditional-generative-adversarialnetwork.html (Conditional Generative Adversarial Networks—Accessed 20 April 2020). [27] A.H. Najmi, J. Sadowsky, The continuous wavelet transform and variable resolution time-frequency analysis, Johns Hopkins APL Techn. Digest 18 (1) (1997) 134–140. [28] L. Balagourouchetty, J.K. Pragatheeswaran, B. Pottakkat, G. Ramkumar, GoogLeNet based ensemble FCNet classifier for focal liver lesion diagnosis, IEEE J. Biomed. Health Inform. 24 (6) (2020) 1686–1694. [29] The R.A.L.E. Repository. Rale.ca.N.P., 2017. Web. 28 February 2017. [30] https://www.thinklabs.com/lung-sounds (Lung sounds Library—Accessed 24 January 2019). [31] B.M. Rocha, D. Filos, L. Mendes, G. Serbes, S. Ulukaya, Y.P. Kahya, E. Kaimakamis, An open access database for the evaluation of respiratory sound classification algorithms, Physiol. Measur. 40 (3) (2019), 035001. 183 CHAPTER 8 Visual similarity-based fashion recommendation system Betul Ay and Galip Aydin Firat University Computer Engineering Department, Elazig, Turkey 8.1 Introduction Visual similarity search systems have become one of the most popular application areas of image retrieval systems in recent years. Content-based image retrieval (CBIR) [1] aims to search images based on their contents such as shapes, objects, colors, local geometry, or texture rather than the metadata associated with the image file such as file names, descriptions, or keywords. Image retrieval systems have been used in a large array of application areas such as search engines, personalized recommendation systems, art galleries management, retail systems, fashion design, and more commonly in e-commerce applications [2]. For any CBIR system, there are two major steps: finding the most important features of each image so that images can be described as feature vectors and calculating distances between images for similarity. Therefore, the success of a CBIR system relies heavily on the quality of the vector feature representation. The task of extracting high-quality, accurate, and efficient vector feature representation for each image is challenging due to the fact that images might have a wide variety of different properties such as content, size, resolution, etc. Also labeling large amounts of data is another major issue. Also supervised trainings with sufficient annotated data might limit the generalizability of the feature representations to novel classes. Recently, semisupervised and unsupervised learning approaches have gained popularity to overcome these difficulties. One of the most interesting applications of deep learning in recent years is creating feature representations of images as vectors. CNNs have widely been used for this purpose and show great promise [3–6]. Another interesting approach for creating vector representations is generative adversarial networks (GANs) [7]. GANs are being utilized in semisupervised learning due to the fact that they can learn deep image representations from unlabeled data. GANs are created using two models: a generative model G and a discriminator model D. The generative network creates samples while the discriminative network tries to distinguish the generated samples from true data. GANs are conceptually considered as a form of unsupervised learning because no labeled data is needed. In recent years, GANs have been one of the most popular fields Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00023-3 Copyright © 2021 Elsevier Inc. All rights reserved. 185 186 Generative adversarial networks for image-to-Image translation of research due to their ability to learn high-dimensional and complex data distributions by taking advantage of the use of unlabeled data for model training. Furthermore, GANs can be leveraged to build powerful models in any domains including images, speech, and text. Although GANs have been created for unsupervised learning, they have proven to be successful in semisupervised and reinforcement learning as well. GANs have successfully been used in various applications such as generating visually realistic images and style transfer however their effectiveness is not limited to these scenarios. Image-to-image translation (CycleGAN [8]), creating high-resolution images from low-resolution samples (SRGAN [9]), image generation from text (StackGAN [10]), learning to discover relationships between different domains such as fashion items (DiscoGAN [11]), transferring facial makeup from a reference image (Beautygan [12]), and other applications reviewed in Ref. [13] are some examples. Extracting deep features from images is another area of use for GANs. The idea is that since GANs can generate realistic images, vector representations of these images can also be used to describe a given image. And since the discriminator pushes the generator to generate more realistic images with enough training, the system produces higher quality representations in comparison to CNNs such as VGG. For instance, Hou et al. [14] utilize GANs for extracting features from images. They utilize a pretrained CNN namely 19-layer VGGNet [15] network which was trained on ImageNet for feature extraction. The proposed system contains generator, VGGNet, and discriminator networks. The generator network does not feed the real and fake images directly to the discriminator, unlike traditional GANs. Instead, the real and generated fake images are first preprocessed using VGGNet and corresponding features are extracted. The extracted features are subsequently fed to the discriminative network. The authors evaluated the results via a web interface created for human evaluators and they report that the resulting face images cannot be distinguished from real face images easily and the proposed model generates more realistic images compared to DCGAN [16] and DFC-VAE [17]. Since GANs are able to provide us with accurate image representations as vectors, they can be employed in similar product search for e-commerce applications. Worldwide growth of e-commerce economy has resulted in many innovative solutions including recommender systems to be deployed. Traditionally recommendation systems make use of past customer interactions, clicks, and purchase history to recommend new and related products. Content-based, collaborative, or hybrid recommender systems create recommendations based on recorded customer behaviors and similar decisions but ignore the product image content. A new type of recommendation system being emerged in e-commerce sites is visual similarity recommendation which creates a list of visually similar items to the query image. In this chapter, we present a visual recommendation system for e-commerce sites which utilizes GANs for image feature vector generation and a vector similarity search Visual similarity-based fashion recommendation system library for fast and accurate querying similarity. The proposed GAN is trained on a largescale shoe image dataset of 156,896 images. We also compare the precision and time performance of the proposed GAN with existing pretrained deep learning models on a standard fashion benchmark dataset, UT-Zap50K [18]. Since the proposed system does not require annotated image dataset, it is easy to extend for other types of fashion items other than shoes. We also prove this feature by extending the model with handbag images and conduct performance tests as well. The system as a whole presents a deep learning-based similar image recommendation solution. We also provide comparisons for several different GAN architectures for shoe similarity recommendation. The rest of the chapter is organized as follows: In Section 8.2 we briefly discuss related literature and present background on GANs and CNNs. Section 8.3 presents the proposed fashion recommendation system architecture. Experimental results are discussed in Section 8.4 and conclusions are presented in Section 8.5. 8.2 Related works The major research problem in this study is finding accurate vector representations for image features. Fast and accurate feature extraction from images allows us to build a robust and effective visual similarity recommendation system. Traditionally machine learning is used to recommend items for e-commerce customers [19–23]. However, in fashion domain image retrieval is subtle and subjective due to the fact that humans tend to have very different opinions on fashion items. Therefore, CBIR is an active area of research for e-commerce [24] and traditional recommender systems can be extended with visual similarity recommendations. In Ref. [25] the authors presented an architecture for retrieving most similar images to the query image. AlexNet [26] and VGG-16 pretrained networks are used to extract local and deep features from the activation of the intermediate layers. Representations from the fc6 and fc8 layers are used as feature vectors. Hamming distance is used as the similarity metric. To evaluate the efficiency of the system 2399 women’s fashion images obtained from Pinterest are collected and manually labeled into nine categories. Kiapour et al. [27] proposed an architecture for street to shop image retrieval which aims to find similar clothing items to a given real world in an online shop. In Ref. [28] the authors have proposed a solution for cross-domain fashion product retrieval by retrieving similar clothing items from online shopping images. Shankar et al. [29] proposed a visual search and recommendation system for e-commerce. Similarly, they use CNN for generating image feature vectors for each fashion product. Images from the Fashionista dataset [30] and Flipkart catalog images are labeled to create a large annotated dataset. This section introduces a theoretical overview of GANs [13]. We define the difference between Vanilla GAN and InfoGAN architectures depicted in Fig. 8.1 and highlight the strengths of the adversarial training process chosen for our recommendation task. 187 188 Generative adversarial networks for image-to-Image translation Fig. 8.1 Overview of GAN and InfoGAN architectures Lastly, we provide an intuitive overview of state-of-the-art architectures based on CNNs. 8.2.1 Vanilla GAN GAN architecture is comprised of two distinct networks: generator and discriminator. These networks are trained simultaneously and perform adversarial training by playing a two-player minimax game. The generator generates new data samples, while the discriminator decides whether each sample it receives belongs to the training dataset. For an image generation task, the generator network receives a random vector z which is sampled from a known distribution and generates a new fake image. Fake data generated from generator network and real data taken from the real dataset are fed into the discriminator. The discriminator takes account of all the data fed into it and returns a probability of whether an image is real or fake. More formally, this learning process is given in the following steps: • The goal of generator G is to build mapping from a prior noise distribution where random noise z ℝZ, to a data space referred to as fake data G(z). • The goal of discriminator D is to estimate the probability of a sample coming from real data x ¼ {x1, …, xN } and fake data G(z), being real D(x) or fake D(G(z)). • The value function represents a two-player minimax game that tries to maximize its value with respect to D and minimize its value with respect to G, which is expressed by the following equation: min G max D VI ðD, GÞ ¼ xPdata ½ logDðxÞ + zPnoise ½ log ð1 DðGðzÞÞÞ: Here, Pdata and Pnoise indicate real data distribution and noise distribution, respectively. • While G captures the data distribution in the training set and tries to fool the D with minimization of zPnoise ½ log ð1 DðGðzÞÞÞ term, D network performs a binary classifier with the maximization of the both xPdata ½ log DðxÞ and zPnoise ½ log ð1 DðGðzÞÞÞ. 8.2.2 InfoGAN Information maximizing generative adversarial networks (InfoGAN) described in Ref. [31] controls the different attributes of the generated images, unlike the other vanilla GAN Visual similarity-based fashion recommendation system Fig. 8.2 Performance comparison of the state-of-the-art deeper CNNs. architectures that control the generated images a little or not. The InfoGAN, an information-theoretic extension of GANs, uses information theory concepts so that the noise term is transformed into latent code, which provides systemic and predictable control on the output. It learns how to decompose the input noise vector into two parts; a source of incompressible noise z and a latent code c. It is trained to maximize the mutual information between c and the output of generator (generated image) G(z, c). Fig. 8.1B depicts the architecture of InfoGAN. Information-regulated min-max objective of InfoGAN is formulated as follows by adding a constant regularization term with a hyperparameter λ: min max VI ðD, GÞ ¼ V ðD, GÞ λI ðc; Gðz, c ÞÞ G D Here, I(c; G(z, c)) is the mutual information between c and G(z, c). While Vanilla GAN formulation (1) uses a single unstructured noise vector z, the generator of InfoGAN takes the concatenated vector (z, c) where c represents structured semantic features of the data distribution [31]. For a given set of structured latent variables c1, c2, …, cL, the QLlatent code c denotes the concatenation of all latent variables ci, which calculated as i¼1P(ci). It also uses a neural network Q(c j x) that shares the same network structure with the discriminator D (except the last layer), by adding to the Vanilla GAN a negligible computation cost. The InfoGAN uses variational information maximization technique named as lower bounding mutual information, which reduces computationally complex of the mutual information calculation, to maximize the mutual information. The final objective function of InfoGAN with a variational lower bound LI(G, Q) of the mutual information under the condition LI(G, Q) I(c; G(z, c)) is defined as follows: 189 190 Generative adversarial networks for image-to-Image translation min max VInfoGAN ðD, G, QÞ ¼ V ðD, GÞ λLI ðG, QÞ G, Q D 8.2.3 CNN-based architectures This section gives a review of CNN and dives deeper to explore the top CNN architectures which have proven themselves at visual tasks including image classification, object detection, and semantic segmentation. CNN emerged in the 1990s when Yann LeCun et al. [32] put forward new neural network architecture for classification of handwritten digits. Although the first CNN, known as LeNet, recognized digits of zip codes effectively, it could not cope with more difficult and complex data. Nevertheless, the work of these and many other researchers led to the development of larger and deeper CNNs. With the ImageNet visual recognition competition, the latest known CNN architectures have emerged. In 2012, the best invention was AlexNet architecture, developed by Alex Krizhevsky and his colleagues [26] at the University of Toronto. The first popular use of deep learning in computer vision began with the AlexNet architecture, which has 60 million parameters (over 1000 times higher than that of LeNet [33]), five convolutional layers and three fully connected layers. Deeper CNNs, which appeared first on the issue of ImageNet classification, have been more efficient in coming up with solutions to classification problems. Accuracy comparison on ImageNet of most popular CNN architectures, also used in this study for quantitative comparison, are depicted in Fig. 8.2. These architectures are summarized below: • VGG: This architecture was developed by Simonyan et al. in 2014 [15], based on the notion that deeper networks are stronger networks. The overall number of trainable parameters (around 140 million) is over 2.3 times higher than that of AlexNet. On the other hand, smaller filters have been used when compared with AlexNet. This architecture uses fixed 3 3 kernel filters in all convolution layers. Two versions of this architecture are available, VGG16 and VGG19, each with the layers 16 and 19. Both of the networks have been built with five blocks followed by a max-pooling layer and the blocks contain sequential convolutional layers. • Inception: The architecture, known as GoogLeNet, was developed by Google researchers [34]. Although VGG architecture has better accuracy performance than AlexNet, it needs to use too much memory due to the number of parameters. On the other hand, Inception has been performed higher accuracy than AlexNet and VGG16 with fewer trainable parameters (about 5 million for InceptionV1 version). It has been built with inception blocks that contain convolutional layers with variable-sized filters of 1 1, 3 3, and 5 5. The architecture, continuously improving upon, has also different versions (InceptionV1, InceptionV2, InceptionV3, so on). • ResNet: Researchers have noticed in the previous architectures that adding layers to deep architectures, while performance up to one point increased, a drop declined rapidly after one point. This problem, known as the vanishing gradient, arose during Visual similarity-based fashion recommendation system network training with back propagation. More briefly, as the number of layer increases, gradient values decrease and approach zero. ResNet [35] solved the vanishing gradient problem by introducing a shortcut connection. The architecture was built with residual blocks consisting of convolutional layers and the shortcut connection that also named as skip connection that connects the first layer input to the last layer output. The shortcut connection also named as skip connection because the network with this connection can skip some of the convolutional layers. Different residual networks with different layers of 18, 34, 50, 101, and 152 have been proposed. The common feature of all ResNet architectures is that they have 7 7 convolutional layer followed by 3 3 max-pooling layer before residual blocks, and average pooling after the blocks (before fully connected output layer). • DenseNet: Like ResNet, DenseNet [36] also focused on solving the vanishing gradient problem with fewer parameters. DenseNet architecture inspired from ResNet was constructed of dense blocks consisting of convolutional layers. Unlike ResNet, there are direct connections from the one convolutional layer to all sublayers. An input of any layer is the concatenated of the feature maps generated by all preceding layers. The architecture has been built with multiple dense blocks and it uses the same layers of ResNet before and after dense blocks. There are various versions of the network with different number of dense blocks such as DenseNet121, DenseNet169, DenseNet201, and DenseNet264. • MobileNets: The main goal of this architecture [37] is to create light-weight and lowlatency models, which can be used on limited-memory or resource-limited devices. The architecture (MobileNetV2) was built with inverted residual blocks and linear bottlenecks, where the input and output of the residual blocks are the bottlenecks layers [38]. When compared with previous architectures trained on ImageNet dataset, MobileNet have higher inference time and smaller model size, but it has worse classification performance (see in Fig. 8.2). This performance is acceptable given its ability to near real-time work on mobile devices. There are different versions such as MobileNet V1, V2, and V3. The abovementioned architectures, which also have pretrained models, are still very popular today and are often used as benchmarks to compare new proposed architectures or make sure a new dataset is reliable. 8.3 Fashion recommendation system The system architecture depicted in Fig. 8.3 consists of three major modules: 1. A deep neural network model for generating effective image representations or vectors 2. Feature vector database for querying image similarity 3. Web interface for interacting with the system 191 192 Generative adversarial networks for image-to-Image translation Fig. 8.3 Overview of fashion recommendation system. Arguably, the most important part of any image retrieval system is a feature extraction module since the accuracy of the results depends on the quality of the extracted features. Feature extraction is basically the process of dimensionality reduction in which a given input is converted into more manageable values so that redundant information in the input are ignored and computational power required to process a large number of inputs is decreased. The quality end effectiveness of the extracted features leads to more successful learning and generalization steps. Extracted features from a given input are generally represented as feature vectors in which only the important or selected portions of the input are preserved. Almost all machine learning tasks require some form of feature extraction to deal with large datasets and hence several algorithms and approaches have been developed over the years. Sometimes experts use domain knowledge to extract important features from the data which is called feature engineering while several algorithms such as PCA (principal component analysis), Autoencoders, or LSA (latent semantic analysis) are used to extract important features. Traditionally, in image processing edge, corner or blob detection algorithms are used to extract features. However, deep neural networks have lately paved the way for automatic feature extraction approaches. DNNs such as CNNs have proved to capture the significant properties of the images and create vector representations for later use. With little or no preprocessing, the images are fed into a pretrained DNN and vector representation or embedding of the image is obtained. Several studies show that the embedding obtained from DNNs can capture semantic and inherent features in the images and thus similar images are represented with closer vectors. Although deep learning models such as CNNs give successful results for the general image similarity problem, it may not be possible to get sufficient quality results within the fashion domain. For instance, there are only a few general shoe types (bots, sneakers, oxfords, etc.) and products in one of these categories resemble each other. Therefore, Visual similarity-based fashion recommendation system image similarity approaches in such domains require finer grain resolutions for successful results. In this study, we use GANs for generating vector representations of images. Image similarity task is the process of retrieving similar images to the queried product image. In our case where we try to identify similar shoe images, similarity can be looked at three major properties: color, shape, and texture. The success of the similarity query must account for all three dimensions; hence we experimented with several GAN architectures and evaluated the results according to these three dimensions. The second module is the vector database which stores the vector representations of images and returns the similar image ids for similarity queries. With the widespread use of artificial intelligence models, the need to effectively query the vector representations of both text and image data has emerged. In the recommender systems, for example, each product or user is represented as a vector and for each vector, there needs to be a list of best recommendations generated. However, it can be quite expensive to generate such lists where the number of data is very large because traditional databases and algorithms are not suitable for storing and querying for similar vectors among hundreds of thousands or even millions of vectors. In recent years several similarity search libraries emerged such as FAISS, Annoy, and NMSLib. Most of these libraries generate a list of approximate nearest neighbors for a given query and employ several different types of indexes to perform this task effectively. FAISS (Facebook AI Similarity Search) [39] is a library developed by Facebook AI Research group which is used for efficient searching of dense vectors or document embeddings. It contains many useful algorithms for searching arbitrary sizes of vectors. We employ FAISS in our architecture as a similarity search engine. FAISS implements some of the search algorithms on GPU and is extremely scalable in comparison with the traditional SQL database engines. FAISS is best utilized with documents are represented as vectors and are identified by an integer id. Image embeddings or representation vectors are extracted using the trained model and are inserted into the FAISS index. FAISS can compare the vectors using L2 (Euclidean) distances, dot products or cosine similarity. When a query image is presented the system creates the vector embedding of the image and queries the FAISS index for similar images. FAISS returns the lowest L2 distances (or the highest dot product) with the query vector [39]. In our system, we first build a FAISS index with XXX shoe image vectors. When a query image is presented, the system first generates a vector of the image using the GAN model and queries the FAISS index for closest vectors. The FAISS library returns a list of image ids of which the L2 distances are smallest to the query vector. The third and last module is the web interface which is used to interact with the system. We demonstrate selected shoe images and similar images. Successful deep learning models require large amount of data for training. It is imperative to provide a sufficient amount of data so that after enough training the DNN can capture the essence of the domain and visually understand what is in it and what is not. 193 194 Generative adversarial networks for image-to-Image translation For a request image, the image is passed into the trained model and feature vector of the image is extracted. Similarity score is computed across all of the images stored in feature vector DB. The best similar images that have the lowest similarity score are ranked for the recommendation. Our goal is to achieve the optimal model as the inference time has to be fast. 8.3.1 Deep network architectures We make an effort to build a recommender system based on visual similarity in this chapter. Firstly, we explore the best model, which learns the best visual features of fashion items. The best model in this study refers to the optimum neural network, which gives high accuracy with fast inference time. Since the various visual features such as colors, edges, corners, and other different patterns in the model are learned during the training process, deep learning is also referred to as feature learning. We present our feature learning experiments of following neural networks under this section. 8.3.1.1 Proposed network The proposed architecture of discriminator and generator networks inspired by InfoGAN [31] is defined in Table 8.1. The discriminator model D takes an image with tree color channels and 207 207 pixel in size and outputs a binary prediction as fake or real. Instead of binary prediction, we use the discriminator to extract a feature vector with a size of 1 12,544 after saving the trained discriminator model. Generator model G input is a concatenated 108dimensional vector consisting of noise variable (100) and latent code (8) representing class information. We conduct unsupervised learning which we use no labeled data. We assume that our data consists essentially of eight different classes (heels, sandals, sports, boots, high boots, loafers, slippers, and flats), so the latent code value is set to eight. We change the latent code value to nine for adding handbag to further extend. We use batch normalization [40] in the models to stabilize the training. We also apply a regularization layer (dropout [41]) after all convolutional layers to solve the overfitting problem and memorization limitation. For D (Discriminator), all convolution layers use Leaky ReLU (LReLU) and the output layer using sigmoid activation is used to get the prediction score of the images over two classes as real (class ¼ 1) and fake (class ¼ 0). For G (generator), all transposed convolutional layers use ReLU and tanh activation function is used in the last layer, which is described in Ref. [16]. Similar to the InfoGAN, D and Q share the same network structure using convolutional layers except for the fully connected output layers. At the output layer, Q uses tanh activation function. The discriminator and generator loss curves are depicted in Fig. 8.4. From the patterns in the loss curves, it can be seen that both discriminator loss and generator loss decrease up to about 2500th iteration. After about this iteration point, generator loss is increasing rapidly and discriminator losses are dropping, which mean that discriminator is getting too strong to distinguish the real and fake samples and generator is not able to generate better Visual similarity-based fashion recommendation system Table 8.1 The networks used for training the shoe and handbag dataset. Discriminator Model D/Recognition Network Q Input 207x207 Color image 3x3 conv2d. 16 LReLU. stride 2. Dropout (.5) 3x3 conv2d. 32 LReLU. stride 1. batchnorm Dropout (.5) 3x3 conv2d. 64 LReLU. stride 2. batchnorm Dropout (.5) 3x3 conv2d. 128 LReLU. stride 1. Batchnorm Dropout(.5) 3x3 conv2d. 256 LReLU. stride 2. Batchnorm Dropout(.5) 3x3 conv2d. 16 LReLU. stride 1. batchnorm FC. 1 sigmoid for D (output layer for D) FC. 8 Tanh for Q (output layer for Q) Generator Model G Input ℝ108 FC. 4x6x256 3x3 conv2d_transpose. 128 ReLU. stride 1. batchnorm Dropout (.6) 3x3 conv2d_transpose. 64 ReLU. stride 1. batchnorm Dropout (.6) 3x3 conv2d_transpose. 32 ReLU. stride 1. batchnorm Dropout(.6) 3x3 conv2d_transpose. 16 ReLU. stride 1. batchnorm Dropout(.6) 3x3 conv2d_transpose. 16 ReLU. stride 1. batchnorm Dropout(.6) 3x3 conv2d_transpose. 3 Tanh. stride 1. Dropout(.6) Fig. 8.4 Training results of the proposed network. 195 196 Generative adversarial networks for image-to-Image translation samples. We accept the 2000–2500 intervals are the ideal checkpoints for our network and the generator samples at this iterations confirm that the model has learned the shoe features well. We can observe from the Fig. 8.4 that the shoe samples generated by the proposed network have some basic shoe features such as color and class patterns including heels, sandals, sports, boots, high boots, loafers, slippers, and flats. The model also overcomes the image quality and diversity barriers, which the GAN models often suffer the lack of the diversity of generated images. 8.3.1.2 State-of-the-art CNNs In this study, we retrain 12 pretrained models trained for the ImageNet classification (showed in Fig. 8.2) to use as feature extractor by removing the last (dense) layer. We remove the dense layer that has 1000 labels (classes) and add dropout and a new dense layer consisting of one label that represents shoe class. We leverage learning deep features from shoe images and extract the feature vectors of these new models trained on general shoe domain instead of the shoe classification task. In a nutshell, we transfer the pretrained features into shoe domain by using the power of transfer learning. Feature vectors containing the new visual features belong to shoe domain are used for computing a distance metric between similar shoe items. The process of extraction feature vectors for a given input is called inference. While the light-weight models have low inference time, the heavy models with large number of parameters are expensive for inference. The inference time has to be fast because the high inference time leads to negative user experience. Inference time or the response speed of the trained network is as important as the accuracy. 8.4 Experiments and results To measure the performance of the aforementioned models and the overall architecture we have conducted several experiments. AI models created and employed in this study are developed using Tensorflow framework. We use a public standard benchmark dataset for measuring the performance of the models. 8.4.1 Experimental setup The training experiments and performance tests have been conducted on a server that has 24-core Intel Xeon E5-2628L CPU, 256 GB RAM which runs Ubuntu Server 16.04 OS. 8 NVidia GTX 1080-Ti GPUs on the server have been used for training the models. The baseline framework is TensorFlow for all model experiments. The shoe dataset used for this study is collected from Turkish e-commerce sites: https://www.flo.com.tr, https://www.trendyol.com, and https://www.n11.com. We scraped the handbag data from various web sites: https://www.flo.com.tr, https://www.amazon.com, https:// www.hepsiburada.com.tr, https://www.boyner.com.tr, https://www.trendyol.com, Visual similarity-based fashion recommendation system https://www.ayakkabidunyasi.com.tr, https://www.morhipo.com, and https://www. n11.com. The overall training dataset consists of 156,896 shoe images and 130,540 handbag images. We conducted the performance tests of all models on 10,000 randomly selected shoe images from UT-Zap50K benchmark dataset. 8.4.2 Comparative results Fig. 8.5 depicts the test results of all models on UT-Zap50K benchmark dataset. For each network architecture, Fig. 8.5A shows the precision results and Fig. 8.5B shows the inference times. It is known that the performance results of unsupervised learning hard to evaluate, hence no universally agreed performance metrics are available for visual recommendation applications. However, the return of similarity results from irrelevant classes for a query product of a particular class indicates that the model is working poorly. For example, when the customer clicks on a product belonging to the heel class, it is expected that similar products from the heel class will be recommended. Therefore, we firstly compare the models with the standard precision metric which is formulated as follows: Precision ¼ #of relevant items retrieved #of retrieved items Precision values for each model are calculated using eight classes in the UT-Zap50K dataset. For a given query image Zap50K class information is used as the ground truth. If the retrieved image is of the same class then the result is counted as relevant and marked irrelevant if otherwise. Performance results shown in Fig. 8.5B shows that inference time increases linearly with the depth of the neural network. Large number of parameters (weights) makes the networks memory-inefficient and computationally expensive. Inference time for our proposed model is around 0.04 s which is significantly shorter than other pretrained models. Our model also provides higher precision rates. The proposed model performs the best in terms of precision and inference time among all models tested in this study. Fig. 8.6 shows the results of all models tested in this study for a sample shoe image. It can be observed that all versions of DenseNet model have returned similarity results in irrelevant classes for a query image given in the type of sneakers. Sample visual recommendation results with similarity scores have been displayed in Fig. 8.7 (woman shoes) and Fig. 8.8 (woman handbags). 8.4.3 Web interface for visual inspection Success of image retrieval systems is hard to measure due to the fact that the concept of image similarity is highly subjective. We have created a web interface for visual inspection of similarity results as another performance evaluation. We have asked human 197 Fig. 8.5 Performance comparison for the models used in this study: (left) precision results and (right) inference time per model. Visual similarity-based fashion recommendation system Fig. 8.6 Visual similarity search results for the proposed model and other pretrained models. Fig. 8.7 Sample visual similarity results of the proposed model on unseen shoe images retrieved from beymen.com (Query image and top-5 similar images—the results taken from 7723 indexed images). 199 200 Generative adversarial networks for image-to-Image translation Fig. 8.8 Sample visual similarity results of the proposed model on unseen handbag images retrieved from beymen.com (Query image and top-5 similar images—the results taken from 3507 indexed images). annotators to select the best results among alternative results. Fig. 8.9 shows the annotation interface. The annotator clicks a shoe image on the left and results of different images are shown on the right. Then the annotator selects the rows which he/she thinks contain the most similar images. We use this information to determine which model performs the best in terms of actual human feedback. 8.5 Conclusion and future works In this chapter, we outlined our work visual similarity-based fashion recommendation system. This chapter is extended from our DeepML-2019 submission [42]. The system consists of a GAN-based image retrieval module and a high-performance image feature search library. We have collected a large set of shoe images from e-commerce to train the GAN and used the model we obtained from this training to create a CBIR system. We also created another set of shoe image from https://www.flo.com.tr which is used to create a web interface to demonstrate the results of our GAN model along with other models we tested in this study. The experimental results show that the proposed model achieved superior performance in terms of precision and query time. The results also show that the system can be used in real-world e-commerce platforms as well. Based on the findings in this study, we plan to build an end-to-end online fashion recommendation system for e-commerce sites. The system will contain recommendation support for various fashion items such as shoes, clothes, and accessories. We also plan to explore other GAN architectures for generating successful image representation for different fashion categories. Fig. 8.9 Web interface checked human annotators to selection of the best model: (A) VGG19, (B) MobileNetV2, (C) ResNet152, (D) proposed model (modified InfoGAN). 202 Generative adversarial networks for image-to-Image translation References [1] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based image retrieval at the end of the early years, IEEE Trans. Pattern Anal. Mach. Intell. 22 (12) (2000) 1349–1380. [2] V.N. Gudivada, V.V. Raghavan, Content-based image retrieval systems, Computer 28 (9) (1995) 18–22. [3] E. Simo-Serra, H. Ishikawa, Fashion style in 128 floats: joint ranking and classification using weak data for feature extraction, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016. [4] G. Scarpa, M. Gargiulo, A. Mazza, R. Gaetano, A CNN-based fusion method for feature extraction from sentinel data, Remote Sens. 10 (2) (2018) 236. [5] D. Weimer, B. Scholz-Reiter, M. Shpitalni, Design of deep convolutional neural network architectures for automated feature extraction in industrial inspection, CIRP Ann. Manuf. Technol. 65 (2016) 417–420. [6] W. Zhao, S. Du, Spectral-spatial feature extraction for hyperspectral image classification: a dimension reduction and deep learning approach, IEEE Trans. Geosci. Remote Sens. 54 (8) (2016) 4544–4554. [7] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’14), vol. 2, MIT Press, Cambridge, MA, 2014, pp. 2672–2680. [8] J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017. [9] C. Ledig, et al., Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings—30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017. [10] H. Zhang, et al., StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017. [11] T. Kim, M. Cha, H. Kim, J.K. Lee, J. Kim, Learning to discover cross-domain relations with generative adversarial networks, in: 34th International Conference on Machine Learning, ICML 2017, 2017. [12] T. Li, et al., Beautygan: instance-level facial makeup transfer with deep generative adversarial network, in: MM 2018—Proceedings of the 2018 ACM Multimedia Conference, 2018. [13] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, A.A. Bharath, Generative adversarial networks: an overview, IEEE Signal Process. Mag. 35 (1) (2018) 53–65. [14] X. Hou, K. Sun, G. Qiu, Deep feature similarity for generative adversarial networks, in: Proceedings— 4th Asian Conference on Pattern Recognition, ACPR 2017, 2018. [15] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv Prepr. arXiv1409.1556(2014). [16] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in: 4th International Conference on Learning Representations, ICLR 2016—Conference Track Proceedings, 2016. [17] X. Hou, L. Shen, K. Sun, G. Qiu, Deep feature consistent variational autoencoder, in: Proceedings— 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017, 2017. [18] A. Yu, K. Grauman, Fine-grained visual comparisons with local learning, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2014. [19] J. Fu, J. Wang, Z. Li, M. Xu, H. Lu, Efficient clothing retrieval with semantic-preserving visual phrases, in: Asian Conference on Computer Vision, Springer, Berlin, Heidelberg, pp. 420–431. [20] Q. Liu, S. Wu, L. Wang, Deepstyle: learning user preferences for visual recommendation, in: SIGIR 2017—Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2017. [21] X. Wang, T. Zhang, Clothes search in consumer photos via color matching and attribute learning, in: MM’11—Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops, 2011. [22] Z. Zhou, Y. Xu, J. Zhou, L. Zhang, Interactive image search for clothing recommendation, in: MM 2016—Proceedings of the 2016 ACM Multimedia Conference, 2016. Visual similarity-based fashion recommendation system [23] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, S. Yan, Street-to-shop: cross-scenario clothing retrieval via parts alignment and auxiliary set, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012. [24] Z. Feng, Z. Yu, Y. Yang, Y. Jing, J. Jiang, M. Song, Interpretable partitioned embedding for customized multi-item fashion outfit composition, in: ICMR 2018—Proceedings of the 2018 ACM International Conference on Multimedia Retrieval, 2018. [25] Y. Jing, et al., Visual search at pinterest, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015. [26] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, Adv. Neural Inform. Process. Syst. 25 (2012) 1097–1105. [27] M.H. Kiapour, X. Han, S. Lazebnik, A.C. Berg, T.L. Berg, Where to buy it: matching street clothing photos in online shops, in: Proceedings of the IEEE International Conference on Computer Vision, 2015. [28] J. Huang, R. Feris, Q. Chen, S. Yan, Cross-domain image retrieval with a dual attribute-aware ranking network, in: Proceedings of the IEEE International Conference on Computer Vision, 2015. [29] D. Shankar, S. Narumanchi, H.A. Ananya, P. Kompalli, K. Chaudhury, Deep learning based large scale visual recommendation and search for e-commerce, arXiv Prepr. arXiv1703.02344(2017). [30] K. Yamaguchi, M.H. Kiapour, L.E. Ortiz, T.L. Berg, Parsing clothing in fashion photographs, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012. [31] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel, InfoGAN: interpretable representation learning by information maximizing generative adversarial nets, in: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16), Curran Associates Inc., Red Hook, NY, 2016, pp. 2180–2188 [32] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Handwritten digit recognition with a back-propagation network, in: Proceedings of the 2nd International Conference on Neural Information Processing Systems (NIPS’89), MIT Press, Cambridge, MA, 1989, pp. 396–404. [33] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86 (11) (1998) 2278–2324, https://doi.org/10.1109/5.726791. [34] C. Szegedy, et al., Going deeper with convolutions, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015. [35] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016. [36] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks, in: Proceedings—30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017. [37] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, MobileNets: efficient convolutional neural networks for mobile vision applications Andrew, Rep. Pract. Oncol. Radiother. (2017) arXiv preprint arXiv:1704.04861. [38] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.C. Chen, MobileNetV2: inverted residuals and linear bottlenecks, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018. [39] J. Johnson, M. Douze, H. Jegou, Billion-scale similarity search with GPUs. IEEE Trans. Big Data (2019), https://doi.org/10.1109/TBDATA.2019.2921572. [40] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in: 32nd International Conference on Machine Learning, ICML 2015, 2015. [41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (1) (2014) 1929–1958. [42] B. Ay, G. Aydın, Z. Koyun, M. Demir, A visual similarity recommendation system using generative adversarial networks, in: 2019 International Conference on Deep Learning and Machine Learning in Emerging Applications (Deep-ML), 2019, pp. 44–48. 203 CHAPTER 9 Deep learning-based vegetation index estimation Patricia L. Suáreza, Angel D. Sappaa,b, and Boris X. Vintimillaa a ESPOL Polytechnic University, CIDIS-FIEC, Guayaquil, Ecuador Computer Vision Center, Edifici O, Campus UAB, Bellaterra, Barcelona, Spain b 9.1 Introduction Computer vision applications can be found in almost every domain, including topics such as medical imaging, gaming, video surveillance, multimedia, industrial applications, and remote sensing, just to mention a few. In most of the cases, these applications are based on images obtained from cameras working at the visible spectrum. There are some cases, in particular in medical imaging and remote sensing, where cross-spectral and multispectral images are considered. The appealing factor of using images from different spectral bands lies on the one hand on the possibility to obtain information that cannot be seen at the visible spectrum; on the other hand, on the combined use of information that can be considered to generate some kind of high-level reasoning; for instance, in remote sensing the combined use of images from different spectral bands is considered to generate vegetation indexes (VIs). These VIs are used to determine the health and strength of vegetation and their definitions involve several factors, such as soil reflectance, vegetation density, etc. All this information would help to increase the yield of crops [1, 2]. The obtained information is used for monitoring and evaluating the Earth’s vegetative cover using several factors, such as soil reflectance, atmosphere, vegetation density, etc., with the aim to obtain those formulas that get more reliable information about vegetation based on remotely sensed values. The usual form of a VI is a ratio of reflectance measured in two bands, or their algebraic combination. Spectral ranges (bands) to be used in VI calculation are selected depending on the spectral properties of plants. Lately, techniques based on sensors sensitive to multiple spectra have been implemented to perform remote sensing to evaluate the biophysical variables of vegetation in both forestry and agriculture [3, 4]. Furthermore, Panda et al. [5] proposed a method for processing high-end images in order to determine the importance of spectral VIs in the field of agricultural crop yield prediction using a neural network. In Ref. [6], the authors proposed to analyze the climatological phenomena that affect the local climate. According to their theory, these phenomena have a direct effect on crop yield. Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00013-0 Copyright © 2021 Elsevier Inc. All rights reserved. 205 206 Generative adversarial networks for image-to-Image translation The index could be computed using several spectral bands that are sensitive to plant biomass and health. For instance, it is known that healthy vegetation reflects light strongly in the near-infrared band and less strongly in the visible portion of the spectrum. Thus, the information between the light reflected in the near-infrared and in the visible spectrum is generally used to detect areas that potentially have healthy vegetation. Among the different indexes proposed in the literature, the Normalized Difference Vegetation Index (NDVI) is the most widely used [7]; NDVI is often used to monitor drought, forecast agricultural production, assist in forecasting fire zones and desert offensive maps [8]. NDVI is preferable for global vegetation monitoring since it helps to compensate for changes in lighting conditions, surface slope exposure, and other external factors. In general, it is used to determine the condition, developmental stages, and biomass of cultivated plants and to forecast their yields. This index is calculated as the ratio between the difference and sum of the reflectance in NIR and red regions: NVDI ¼ RNIR RRED , RNIR + RRED (9.1) where RNIR is the reflectance of NIR radiation and RRED is the reflectance of visible red radiation. This index defines values from 1.0 to 1.0, basically representing greens, where negative values are mainly formed from clouds, water, and snow, and values close to 0 are primarily formed from rocks and bare soil. Very small values (0.1 or less) of the NDVI function correspond to empty areas of rocks, sand, or snow. Moderate values (from 0.2 to 0.3) represent shrubs and meadows, while large values (from 0.6 to 0.8) indicate temperate and tropical forests [9, 10]. Proposals that use images of several spectra, whether crossed or multispectral, depend on the use of multiple sensors. In the case of VIs such as NDVI, it is required to have images of the visible spectrum and near-infrared spectrum of the same scene, which are acquired by different cameras at the same time . These images are required to calculate the values of Eq. (9.1). It should be noted that before calculating Eq. (9.1) images must be accurately recorded, that is, the information must be referred to the same reference system. Since the images of different spectra can be displayed differently, the challenge is to find the same reference points in the images of both spectra [11]. Recently, techniques with convolutional networks have been proposed focusing on solving this problem and finding correspondences in crossed spectral domains [12, 13]. With the correlated information, the images can be recorded in a single reference system. In Ref. [14], the authors proposed to use the NDVI to measure the changes in the ecosystem in a given interval of time. The changes in the index values allow us to infer how the climate impacts the health of the crops. With this method, the impact of climate Deep learning-based vegetation index estimation change can be determined and controlled planning can be managed, focusing efforts on the most affected areas. This is valuable in determining effective and smart reforestation plans. In this chapter, a novel approach to perform an image-to-image translation is proposed, in which the NDVI is estimated using a synthetic NIR image. The proposed model is able to use unpaired data to estimate a synthetic NIR just from an grayscale image using a CycleGAN. Actually, a similar technique has been recently presented in Ref. [15] where an NDVI is generated from a near-infrared (NIR) image, and also in Ref. [16] where the VI is estimated just from the single image of the visible spectrum. Although interesting results have been obtained, the weak point of these approaches lies on the need of having NIR images, which are not that common such as visible spectrum images. In other words, the disadvantage of these approaches depends on paired samples for the training process. The solution proposed in the current chapter consists of a model where the index is estimated from an unpaired learning-based approach, where a cycled generative adversarial network (CycleGAN) [17] is trained with a large data set. In the proposed approach, an unsupervised learning model with a set of unpaired images is used as an input, one from the visible spectrum and the other image corresponds to an NIR image; each one is fed into a CycleGAN to perform the image domain translation. Additionally, a multiple loss function is used to obtain a better optimization of the model; a residual network (ResNet) architecture is used to go deeper without degradation in accuracy and error rate. The chapter is organized as follows. Section 9.2 presents works related to the NDVI problem, as well as the basic concepts and notation of GAN and CycleGAN networks. The proposed approach is detailed in Section 9.3. The experimental results with a set of real images are presented in Section 9.4. Finally, the conclusions are given in Section 9.5. 9.2 Related work Solutions based on computer vision to tackle problems related to precision agriculture have been widely used. This technology enables better identification, analysis, and management of this temporal and spatial in-field variability. Nowadays, with NIR sensors, all the captured crop information could be filed to obtain statistical information of every year trying to predict the health of future plantations for better crop productivity. Many computer vision techniques have evolved to offer solutions for these kinds of agricultural prediction problems. These methods came from mathematical and statistical to deep learning neural networks. This section review works related to VI estimation, using classical approaches as well as convolutional neural network (CNN)-based approaches. 207 208 Generative adversarial networks for image-to-Image translation 9.2.1 Vegetation index: Formulations and applications In this section, agricultural approaches focused on the use of the NDVI to perform adequate control of crop production using the index information to monitor plant health at each stage of their growth are reviewed. In Ref. [18], the authors proposed to use SAR images to estimate missing spectral features through data fusion and deep learning, exploiting both temporal and cross-sensor dependencies on Sentinel-1 and Sentinel-2 time series, in order to obtain the NDVI. Another approach is presented in Ref. [19]; the authors propose a technique to predict the vegetation dynamics behavior using Moderate Resolution Imaging Spectroradiometer (MODIS) NDVI time series data sets and long short term memory model network, an advanced technique adapted from the artificial neural network. Another approach, presented by Ulsig et al. [20], introduces an automated technique to detect and count individual palm trees from UAV using a combination of spectral and spatial analyses. The proposed approach comprises a step that discriminates the vegetation from the surrounding objects by applying the normalized difference VI and another step used to detect individual palm trees using a combination of circular Hough transform (CHT) and the morphological operators. Damian et al. [21] propose to use the information obtained from the normalized difference VI using satellite images to increase the productivity improving the task of delimiting management zones for annual crops. For this research three crop productivity maps, from 2009 to 2015, were used for each area of analysis, developing a descriptive and geostatistical case study. According to Ulsig et al. [22], long-term observations of vegetation phenology can be used to monitor the response of terrestrial ecosystems to climate change. They propose a method for observing phenological events by analyzing time series of VIs such as the normalized difference VI to investigate the potential of a Photochemical Reflection Index (PRI) to improve the accuracy of MODIS-based phenological estimates in an evergreen coniferous forest. The results suggest that PRI can serve as an effective indicator of spring seasonal transitions, and confirm the usefulness of MODIS PRI for detecting phenology. In addition, Li et al. present a study to evaluate the economic benefits of greening programs (e.g., planting urban trees, adding or enhancing parks, providing incentives for green roofs) using low-cost NDVI data from satellite imagery, using the spatial lag-Tobit models [23], which predict tree canopy cover from NDVI. In another research [24], the authors focus on temporal NDVI and surface temperature, the methodology used altogether for the assessment of resolution dynamic Urban Heat Island (UHI) change on environmental condition with different environmental conditions, geographical locations, and demography. The research demonstrates the correlation between temporal NDVI and surface temperature exemplified with a case study conducted over two different regions, geographically as well as economically. In Ref. [25], the authors present a Deep learning-based vegetation index estimation method to reconstruct NDVI time series datasets for monitoring long-term changes in terrestrial vegetation. This temporal-spatial iteration (TSI) method was developed to estimate the NDVIs of contaminated pixels, based on reliable data. The TSI method will be most applicable when large numbers of contaminated pixels exist. Also, in Ref. [26], the authors present a method to analyze the use of the NDVI to evaluate crop yields, using a multispectral sensor mounted on a UAV with the objective of predicting biomass variations and grain production. In another work, presented by Taghizadeh et al. [27], the authors propose an approach to extract the phenological parameters based on time series of the NDVI, and these variables are used with crop rotation to predict the organic carbon content of the surface layer. Also in Ref. [28], the authors present a high-throughput phenotyping platform to dynamically monitor NDVI during the growing season for the contrasting wheat crops. The high-throughput phenotyping platform captured the variation of NDVI among crops and treatments (i.e., irrigation, nitrogen, and sowing). The high-throughput phenotyping platform can be used in agronomy, physiology, and breeding to explore the complex interaction of genotype, environment, and management of the soil in a farmland area. Additionally, in Ref. [29], the authors illustrate how the normalized difference VI, leaf area index (LAI), and fractional vegetation cover are related to each other, using a simple radiative transfer model with vegetation, soil, and atmospheric components. Another approach [30] presents a local modeling technique to estimate regression models with spatially varying relationships, using geographically weighted regression (GWR), to investigate the spatially nonstationary relationships between NDVI and climatic factors at multiple scales in northern China. The results indicate that all GWR models with appropriate bandwidth represented significant improvements in model performance over the ordinary least-squares (OLS) models. The results revealed that the ecogeographical transition zone and the GWR model can improve the model ability to address spatial, nonstationary, and scale-dependent problems in landscape ecology. 9.2.2 Deep learning-based approaches Deep learning models have obtained state-of-the-art results on some computer vision complex problems. Nevertheless, there are many challenging problems in agricultural pending to be solved and deep learning approaches are the most likely to be used, obviating the need for a pipeline of specialized and hand-crafted methods used before. Some researchers have proposed deep learning-based approaches for remote sensing and agricultural applications. In Ref. [31], the authors propose to use SAR images to estimate missing spectral features through data fusion and deep learning, exploiting both temporal and cross-sensor dependencies on Sentinel-1 and Sentinel-2 time series, in order to obtain the normalized difference VI. 209 210 Generative adversarial networks for image-to-Image translation Huang et al. [32] proposed a novel method for effective and efficient topographic shadow detection for the images obtained from Sentinel-2A multispectral imager (MSI) by combining both the spectral and spatial information. This method uses a CNN, operating directly on indexes input due to its remarkable classification performance, exploiting the spatial contextual information and spectral features for effective topographic extraction. In addition, in Ref. [33], a decision-level fusion approach is proposed with a simpler architecture for the task of dense semantic labeling. This method first obtains two initial probabilistic labelings resulting from a fully CNN and a simple classifier, for example, logistic regression exploiting spectral channels and LiDAR data, respectively. The conditional random field (CRF) inference will estimate the final dense semantic labeling results. In Ref. [34], the authors present a methodology to predict the NDVI by training a crop growth model with historical data. Although they use a very simple soybean growth model, the methodology could be extended to other crops and more complex models. All the approaches presented earlier are just a selection of recent publications where the usefulness of VIs, in particular the NDVI, can be appreciated. Unfortunately, to compute the NDVI registered images from different spectra (i.e., visible and NIR) are needed, which sometimes is a challenging task since they may look different. So, the problem is how to find the same set of features in both spectra (e.g., points [11]) to be used as a reference for the registration process. Recently, some deep learning-based approaches have been proposed to overcome this problem and to obtain correspondences in cross-spectral domains (e.g., [13, 35]). Once correspondences are obtained, the image registration can proceed by mapping both images to a single reference system; then VIs can be easily computed. As mentioned in the previous section, recently some approaches for estimating NDVI have been proposed (e.g., [15, 16]) implementing GAN’s networks using NIR or RGB images; both approaches depend on the existence of accurately registered images. Having in mind the registration drawback needed to estimate the NDVI and to overcome this problem, in the current work an unsupervised learning model is proposed (a CycleGAN architecture). The model is trained with a set of unpaired images (grayscale and NDVI image) under an unsupervised scheme. To understand generative adversarial networks (GANs), a summary is given here. GANs are powerful and flexible tools quite useful in several computer vision problems; one of their most common applications is image generation. Fig. 9.1 depicts this architecture. In the GAN framework [36], generative models are estimated via an adversarial process, in which simultaneously two models are trained: (i) a generative model G that captures the data distribution, and (ii) a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. In this architecture, it Deep learning-based vegetation index estimation z (Gaussian noise) Generator G(z) Generated data x (Real data) Discriminator D(x) S Synthetic or real? Converged model R Update model Fig. 9.1 Illustration of a generative adversarial network. is possible to apply certain conditions to improve the learning process. According to Ref. [37], to learn the generator’s distribution pg over data x, the generator builds a mapping function from a prior noise distribution pz(z) to a data space G(z;θg) and the discriminator, D(x;θd), outputs a single scalar representing the probability that x came from training data rather than pg. G and D are both trained simultaneously, the parameters for G are adjusted to minimize logð1 DðGðzÞÞÞ and for D to minimize log DðxÞ with a value function V (D, G): min max V ðD, GÞ ¼ xp dataðxÞ ½ logDðxÞ + zp dataðzÞ ½log ð1 DðGðzÞÞÞ: G D (9.2) GANs can be extended to a conditional model if both the generator and discriminator are conditioned on some extra information y (see Fig. 9.2). This information could be any kind of auxiliary information, such as class labels or data from other modalities. We can perform the conditioning by feeding y into both discriminator and generator as additional input layer. The objective function of a two-player minimax game would be 211 212 Generative adversarial networks for image-to-Image translation z (Gaussian noise) y (label) Generator G(z|y) Generated data x (Real data) Discriminator D(x|y) S Synthetic or real? Converged Model R Update model Fig. 9.2 Illustration of a conditional generative adversarial network. min max V ðD,GÞ ¼ xp dataðxÞ ½log DðxjyÞ + zp zðzÞ ½ logð1 DðGðzjyÞÞÞ: G D (9.3) The discriminator performs a binary classification including the extra information fed to the network, as a result, the discriminator and generator will gain more accurate gradients. Conditional GANs enhance the stability of the model, but it affects the learning of the semantic characteristics of the image samples. 9.3 Proposed approach This work proposes to estimate the NDVI VI using a synthetic NIR generated from just from a single image of the visible spectrum using a CycleGAN. The architecture used in this approach is based on the one presented in Ref. [17], a previous work that presents an unpaired image-to-image translation, through a CycleGAN. This type of network permits domain style transfer, which is a convenient method for image-to-image translation problems because it is not necessary to have a set of input images that capture the scene at Deep learning-based vegetation index estimation the same time and place from different spectra. Obtaining this type of set of images could be time consuming and quite difficult based on what type of domain style the image data set we are trying to translate between. In Ref. [38], the authors present a general-purpose image-to-image translation model in a supervised manner by using conditional adversarial; these networks not only learn the mapping from an input image to output image but also learn a loss function to train the corresponding mapping. Before presenting the proposed approach, a brief description of CycleGAN is presented. 9.3.1 Cycle generative adversarial networks Image-to-image translation is the process of transforming an image from one domain to another, where the goal is to learn the mapping between an input image and an output image. This task has been generally performed by using a training set of aligned image pairs. However, for many tasks, paired training data will not be available, and to prepare them often takes a lot of work from specialized personnel to obtain thousands of paired image datasets, especially with complex image translations. CycleGAN is an architecture to address this problem because it learns to perform image translations without explicit pairs of images. No one-to-one image pairs are required (see Fig. 9.3) to observe the corresponding scheme. CycleGAN will learn to perform style transfer from the two sets despite every image having vastly different compositions. According to Zhu et al. [17], the CycleGAN presents an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples (see Fig. 9.4) to observe a description of the domain translation with paired samples (left of the graph); and unpaired samples (right of the graph); in our case, we use a translation of unpaired images. Thus, the goal is to learn a mapping G: X!Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping DR DS V R S W DR DS V V r ^ S ^r W ^ R s s^ W Cycled-consistency loss Cycled-consistency loss Fig. 9.3 Cycle generative adversarial network, original scheme proposed in Ref. [17]. 213 214 Generative adversarial networks for image-to-Image translation Paired xi Unpaired yi xm yn Fig. 9.4 (Left) Supervised training (paired data). (Right) Unsupervised training (unpaired data). is highly under-constrained, it is necessary an inverse mapping F: Y ! X and introduce a cycle consistency loss to enforce F(G(X)) X (and vice versa). The model includes two mappings functions G: X!Y and F: Y !X. In addition, it introduces two adversarial discriminators Dx and Dy, where Dx aims to distinguish between images x and translated images F(y); in the same way, Dy aims to discriminate between y and G(x). Besides, the proposed approach includes two types of loss terms: adversarial losses [36] for matching the distribution of generated images to the data distribution in the target domain real images; and a cycle consistency loss to prevent the learned mappings G and F from contradicting each other. 9.3.2 Residual learning model (ResNet) Deep neural networks have evolved from simple to very complex architectures depending on the type of problem to be solved, whether these are classification, segmentation, recognition, identification, etc. One of the first implementation of deep convolutional networks is presented in Ref. [39], where the authors present an approach to classify 1000 different classes from the ImageNet dataset. The model have been designed to support very deep CNN training to classify the 1.2 million high-resolution images into the 1000 different classes. The model has 60 million parameters and 650,000 neurons. The architecture consists of five convolutional layers followed by max-pooling layers in some cases, and three fully connected layers with a last softmax layer of 1000 elements. The authors also have implemented a very efficient convolutional operation with multiple GPU to reduce training time and overfitting. Additionally in the fully connected layers they have employed a dropout operation to perform regularization, which proved to be very effective. Another technique that continues the work in very deep learning networks, is the one presented in Ref. [40]; according to the authors, deep networks naturally integrate low/mid/high-level features and classifiers in an end-to-end multilayer Deep learning-based vegetation index estimation way, and the “levels” of features can be enriched by the number of stacked layers (depth). When deeper networks are able to start converging, a degradation problem could appear, with the network depth increasing, accuracy gets saturated and then degrades rapidly; this behavior of degradation indicates that every neural model is unique and not easy to optimize. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. In Ref. [40], it is presented a deep residual learning framework, where instead of waiting for the stacked layers to fit directly to a desired underlying mapping, these layers are allowed to fit the residual mapping. Previously, the desired underlying mapping was denoted by H(x). It is allowed that the stacked nonlinear layers fit another mapping of F(x) :¼ H(x) + x. The original mapping is recast into F(x) + x. The authors hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers. The formulation of F(x) + x can be realized by feed-forward neural networks with “shortcut connections” to perform identity mapping, and their outputs are added to the outputs of the stacked layers (see Fig. 9.5), also an identity shortcut connection add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by stochastic gradient descent (SGD) with backpropagation. It avoids the vanishing gradient problem, as the gradient is backpropagated to earlier layers; repeated multiplication may make the gradient infinitely small. As a result, as the network goes deeper, its performance can get saturated or even starts degrading rapidly. To avoid all these problems, we implement our generator and discriminator to propagate larger gradients to initial layers and these layers also could learn as fast as the final layers, giving us the ability to train deeper networks. ResNet is a model designed to be applied in a deep neural network layer architecture, which consists of convolution layers known as building blocks, where a residue of input is added to the output. 9.3.3 Proposed architecture This section presents the approach proposed for NDVI vegetation estimation just with a single image from the visible spectrum. As mentioned earlier, it uses a similar architecture like the one proposed in Ref. [17], a recent work for unpaired image-to-image translation, where the use of a CycleGAN has been proposed. CycleGANs is a convenient method for image-to-image translation problems, such as style transfer, because it just relies on an unconstrained input set and output set rather than specific corresponding input/output pairs. This could be time consuming, unfeasible, or even impossible based on what two image types one is trying to translate between. Another approach presented 215 216 Generative adversarial networks for image-to-Image translation n l u X Identity F(x) n l u n F(x) + x l u Fig. 9.5 Residual block used on the generator network. in Ref. [38] has shown results synthesizing photos from label maps, reconstructing objects from edge maps, but still dependent on some kind of correlated labeling. Our architecture is based on the approach presented in Ref. [17] in relation to cycle consistent learning and loss functions; in our work, it is used to estimate the synthetic NIR images. The proposed model can learn to translate the images between the visible spectrum to the corresponding NIR spectrum, without the need to have accurately registered RGB/ NIR pairs. This allows us to use these NIR synthetic images in the calculation of the NDVI VI and to be able to use them in solutions oriented to solve problems related to the state of the crops and their corresponding level of productivity in the crops. Another advantage of being able to count on the synthetic images of the NIR spectrum is that, undoubtedly, the costs of Deep learning-based vegetation index estimation the solutions are decreased since there is no need to buy acquisition devices sensitive to that electromagnetic spectrum. Additionally, our architecture uses the ResNet [40] to perform the image transformation from one spectrum to another. The core idea of ResNet is to introduce a so-called “identity shortcut connection” that skips one or more layers. These skip connections ensure properties of NIR images of previous layers are available for later layers as well, so that their outputs do not deviate much from original grayscale image input; otherwise, the characteristics of original images will not be retained in the output and results will be very unreal. The formulation of F(x) + x can be realized by feed-forward neural networks with “shortcut connections” (see Section 9.3.2). Shortcut connections are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers. Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-toend by SGD with backpropagation. These skip connections ensure properties of NIR images of previous layers are available for later layers as well so that their outputs do not deviate much from original RGB input (grayscale); otherwise, the characteristics of original images will not be retained in the output and results will be very unreal. Fig. 9.6 depicts the CycleGAN model proposed in the current work. As shown in Estimated NIR images G Grayscale images (x) Activation function Tanh Convolutional block 2D convolution Conv. block A1 Deconvolutional block y Deconv. block C2 Rectifier linear unit Batch normalization Rectifier linear unit Conv. block A3 Deconv. block C1 Residual block B1 Residual block B9 2D deconvolution x RESIDUAL BLOCK Batch Normalization Rectifier linear unit 2D convolution (3×3×3) Batch normalizatinon Rectifier linear unit 2D convolution (3×3×3) Fig. 9.6 Cycle generative adversarial generator network detailed architecture. 217 218 Generative adversarial networks for image-to-Image translation Fig. 9.6, CycleGAN architecture to generate NIR synthetic images is composed of two generators G, F and two discriminators Dx, Dy. In order to generate a synthetic image, the architecture takes the advantage from the joint of cycle consistency and least-square losses [41] in addition to the usual discriminator and generator losses. The results of the experiments have shown that these loss functions demand that the model maintain textural information of the visible and NIR images and generate uniform synthetic outputs. According to Zhu et al. [17], the objective of a CycleGAN is to learn mapping functions between two domains X and Y given training samples N xi N i¼1 X and xi i¼1 Y . The generator network architecture designed to estimate NIR synthetic VI is described in Fig. 9.6. Also, Figs. 9.9 and 9.10 depict the CycleGAN scheme proposed in the current work. The model includes two mapping functions G: X!Y and F: Y !X. In addition, it introduces two adversarial discriminators Dx and Dy, where Dx aims to distinguish between images x and translated images F(y); in the same way, Dy aims to discriminate between y and G(x). Besides, the proposed approach includes two types of loss terms: adversarial losses [36] for matching the distribution of generated synthetic NIR images to the data distribution in the target domain real NIR images; and a cycle consistency loss to prevent the learned mappings G and F from contradicting each other. 9.3.4 Loss functions The adversarial losses, according to Goodfellow et al. [36], are applied to both mapping functions. For the mapping function G: X!Y and its discriminator Dy, the objective is defined as LGAN ðG,Dy ,X, Y Þ ¼ yp dataðyÞ ½ logDY ðyÞ + xp dataðxÞ ½ log ð1 DY ðGðxÞÞÞ, (9.4) where G tries to generate images G(x) that look similar to images from domain Y, while Dy aims to distinguish between translated samples G(x) and real samples y. For the mapping function F: Y !X and its discriminator Dx, the objective is defined as LGAN ðF,Dx , Y , XÞ ¼ xp dataðxÞ ½ log DX ðxÞ + yp dataðyÞ ½ logð1 DX ðFðyÞÞÞ, (9.5) where F tries to generate images F(y) that look similar to images from domain X, while Dx aims to distinguish between translated samples F(y) and real samples x. Also, according to Zhu et al. [17], to reduce the space of possible mapping functions, the learned mapping functions should be cycle consistent; for each image x from domain Deep learning-based vegetation index estimation X, the image translation cycle should be able to bring x back to the original image, that is, x!G(x)!F(G(x)) x, calling this forward cycle consistency. Therefore, for each image y from domain Y, G, and F should also satisfy backward cycle consistency: y ! F(y) ! G(F(y)) y. This cycle consistency loss is defined as Lcycle ðG, FÞ ¼ x p dataðxÞ ½kFðGðxÞÞ xk1 + y p dataðyÞ ½kGðFðyÞÞ yk1 : (9.6) 9.3.5 Least-square GAN’s loss In the current work, a least-square loss has been implemented [41] to accelerate the training process. This loss is able to move the fake samples toward the decision boundary, in other words, generate samples that are closer to real data, in our case the synthetic NIR image. The experiments performed with this loss instead of negative log likelihood shown better results. Eqs. (9.4), (9.5) are replaced with the least-square losses, which are defined as LLSGAN ðG,Dy , X,Y Þ ¼ yp dataðyÞ ½ðDY ðyÞ 1Þ2 + xp dataðxÞ ½DY ðGðxÞÞ2 (9.7) LLSGAN ðF,Dx , Y , XÞ ¼ xp dataðxÞ ½ðDX ðxÞ 1Þ2 + yp dataðyÞ ½ðDX ðFðyÞÞ2 Þ: (9.8) and For this unsupervised approach, the standard CycleGAN LCYCLE : (cycle-consistent loss), and LLSGAN : (least-square loss), have been implemented, both with their corresponding weights distributions for the multiple loss function. For the first unsupervised approach, the weighted sum of the individual loss function terms designed to obtain the best results, is defined as LFINALSYNNIRCYCLEGAN ¼ 0:38LGAN + 0:62LCYCLE : (9.9) And the second loss evaluated in this unsupervised approach is the LSGAN loss where the weighted sum of the individual loss function terms is defined as LFINALSYNNIRCYCLELSGAN ¼ 0:65LLSGAN + 0:35LCYCLE : (9.10) The combination of the weights associated with each loss function is focused on improving the quality of the images for human perception and at the same time, they are used as regularization terms that determine which loss function is the most significant in the optimization of the model for the generation of the synthetic VI. An inappropriate weight balance increases the risk that the model generates synthetic indexes with too many artifacts and that it cannot generalize properly. Once the synthetic NIR image is estimated, the NDVI is computed by using Eq. (9.1) together with the information from the red channel of the given image. 219 220 Generative adversarial networks for image-to-Image translation 9.4 Results and discussions 9.4.1 Datasets for training and testing The proposed approach has been evaluated using a grayscale and with an unpaired NVDI VI; the architecture of the U-Net generator implemented is presented in Fig. 9.6; the model receives as an input the a single image of the visible spectrum representation from Brown and S€ usstrunk [42]. From the aforementioned data set, the country, mountain, and field categories have been considered for evaluating the performance of the proposed approach; examples of this dataset are presented in Fig. 9.7. This dataset consists of 477 registered images categorized in 9 groups captured in RGB (visible spectrum) and NIR (near-infrared spectrum). The country category contains 52 pairs of images of (1024 680 pixels), while the field contains 51 pairs of images of (1024 680 pixels). In order to train the network to generate the VI from each of these categories, a data lengthening process has been applied to avoid overfitting or underfitting the model, so that it can converge and generalize; this process is carried out automatically by a specialized algorithm. It should be noted that during the training process paired images do not belong to the same scene, because there is no need to have correspondences as input for the CycleGAN proposed model. 9.4.2 Data augmentation The proposed architecture uses as an input an unpaired dataset from Brown and S€ usstrunk [42], the RGB converted to grayscale and the NIR images. In order to enlarge the size of the training dataset, we have implemented an automatic data augmentation process to create a modified version of images in the dataset of grayscale and NIR by taking random crops with a parameterized size, randomly selecting the coordinates in the image to crop the region before the training phase. After, the creation of multiple variations of the images, that can improve the performance and the ability of the fit models to generalize what they have learned to new images. The data augmentation process executed for this approach (see Fig. 9.8) has provided us with a total of 70 different variations with a size of 256 256 for each image per category existent in the data set; 3500 pairs of images of (256 256 pixels) have been generated, both in a grayscale of the RGB images as well as in the corresponding NIR images (the NIR images are used to compute the groundtruth NVDI indexes, which are represented as images). Additionally, 1000 pairs of images, per category, of (256 256 pixels) have been also generated for testing and 100 pairs of images per category for validation, which can be used to feed the learning network to synthesize VIs to increase the performance and accelerate the generalization of the model. Deep learning-based vegetation index estimation Fig. 9.7 Some examples of cross-spectral images, where the (first row) RGB images; (second row) unpaired NIR images; (third row) ground-truth NDVI images. (Images from M. Brown, S. S€ usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 177–184.) 9.4.3 Evaluation metrics Digital images resulting from an artificial intelligence process, such as deep neural networks, are subject to a wide variety of distortions, which may result in a degradation of visual quality. Quality is a very important parameter for all objects and their functionalities. The importance of research in the objective evaluation of image quality is to develop measures that can automatically predict the perceived image quality. In an 221 222 Generative adversarial networks for image-to-Image translation Fig. 9.8 Algorithm proposed for data augmentation. image-based technique, image quality is a prime criterion. Commonly, for a good image quality evaluation, an evaluation with complete reference metrics is applied, like mean square error (MSE), one of the most used image quality metrics. The MSE metric measures the average of the squares of the errors or deviations. This is to say that large differences between actual and predicted are punished more with MSE. This error (MSE) does not match with human visual perception. In contrast to MSE recently, a perceptual metric that measures image quality level, Structural Similarity (SSIM) index , has been developed with a view to comparing the structural and feature similarity measures between restored and original objects on the basis of perception. For our approach, we have used Root Means Squared Error and SSIM index as metrics, with which we were able to compute the results of the experiments and obtain consistent results. However, RMSE does not measure representation of the textures of the images, instead of SSIM index, which is an absolute value of the representation perspective presented in the images. Additionally, from a semantic perspective, SSIM index gives better results to measure over RMSE error. Also, the SSIM index performs well to obtain perception and saliency-based errors. According to Wang et al. [43], SSIM index evaluates images accounting for the fact that the human visual perception system is sensitive to changes in the local structure; the purpose of using this index defines the structural information in an image as those attributes that represent the structure of objects in the scene. The structural loss for a pixel p is defined as LSSIM ¼ P 1 X 1 SSIMðpÞ, NM p¼1 (9.11) where SSIM( p) is the structural similarity index (see Ref. [43] for more details) centered in pixel p of the patch P. Deep learning-based vegetation index estimation 9.4.4 Experimental results The proposed approach (see Figs. 9.9 and 9.10) has been evaluated using NIR and RGB images together with the corresponding NDVI obtained from Eq. (9.1), in which the Cycle consistency loss FYX Fake grayscale image in x domain GXY Reconstructed NIR image Real NIR image in Y domain Real or fake? Dx Discriminator in Y domain Real grayscale image in x domain Fig. 9.9 Cycle generative adversarial model F: Y (NIR) ! X(grayscale) and its discriminator Dx. Cycle consistency loss GXY Fake NIR image in Y domain Real grayscale image in X domain FYX Reconstructed grayscale image Real or fake? DY Discriminator in Y domain Real NIR image in Y domain Fig. 9.10 Cycle generative adversarial model G: X(grayscale) ! Y (NIR) and its discriminator Dy. 223 224 Generative adversarial networks for image-to-Image translation RGB red channel was used; the cross-spectral data set used in our implementation came from Brown and S€ usstrunk [42]. This dataset consists of 477 registered images categorized into 9 groups captured in RGB (visible) and NIR (near-infrared) spectral bands. The country, mountain, and field categories have been considered for evaluating the performance of the proposed approach. The country category contains 52 pairs of images of (1024 680 pixels), mountain category contains 55 pairs of images of (1024 680 pixels), while the field contains 51 pairs of images of (1024 680 pixels). In order to increase the training dataset, a data augmentation process was performed, to improve the accuracy of our network to generate synthetic NIR images. The data augmentation consists of applying flipping, rotating, and transposing over the original images. After the data augmentation process, for each category 600 pairs of images from visible and NIR spectrum have been generated. Additionally, for each category 40 pairs of images for testing and 20 pairs of images for validation from visible and NIR spectrum have been used. It is important to emphasize that despite the dataset images are registered, for the CycleGAN training process, we use unpaired images. On average, every training process took about 80 hours using a 3.2 GHz 8 core processor with 32 GB of memory with a NVIDIA TITAN XP GPU. Some illustrations with the corresponding NIR results obtained with the proposed CycleGAN approach are depicted in Fig. 9.11 for qualitative evaluation. These synthetic NIR images obtained with the CycleGAN are then used for estimating the NDVIs. Figs. 9.12–9.14 present some illustrations of NDVIs estimated per category country, field, and mountain using the generated synthetic NIR images. Also, Figs. 9.15–9.17 present illustrations of the NVDI generated with the proposed approach compared with results from Ref. [17], showing better qualitative results. Quantitative evaluations are presented in Table 9.1. In this table, average root mean square error (RMSE) and SSIM index computed over the validation set are depicted, when different combinations of the proposed loss functions were considered. Our experiments used the standard loss function for GANs, which are based on negative log likelihood and also used the least-square loss to obtain better quantitative results and avoid the vanishing gradient problem, where a deep feed-forward network is unable to propagate valid gradient information from the output back to the first layer of the model. We implement least-square loss to accelerate and maintain stable the training process. Additionally, in this table, results from Refs. [16, 17] are presented. It can be appreciated that in all the cases the results obtained with the least-square loss in the proposed CycleGAN are better than those obtained with the approach presented in Refs. [16, 17]. It should be mentioned that the least-square losses permits to accelerate the network convergence, allowing a better optimization of the network. To increase the cycle loss effect over the network we used L1 (λ). The CycleGAN network proposed has been trained using Stochastic AdamOptimazer since it is well Deep learning-based vegetation index estimation Fig. 9.11 Illustration of NIR images obtained by the proposed CycleGAN, which are later on used to estimate the corresponding NDVIs. (First row) RGB images. (Second row) Grayscale image used as input into the CycleGAN. (Third row) Estimated NIR images. (Fourth row) Ground-truth NIR images. (Images from M. Brown, S. S€ usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 177–184, country, field, and mountain categories.) suited for problems with deep network, large datasets, and avoid overfitting. The image dataset was normalized in a ( 1, 1) range and rescaled to 256 256 to avoid memory problems during the training process. The following hyperparameters were used during the training process: learning rate 0.0003, epsilon ¼ 1e 08, exponential decay rate for the first moment momentum 0.6, L1 (λ) 10.5, weight decay 1e 2, leak ReLU 0.20. 225 226 Generative adversarial networks for image-to-Image translation Fig. 9.12 Images of NDVI VIs from Country category obtained with the synthetic NIR generated by the proposed CycleGAN. (Left) Ground-truth NDVI VI images. (Right) Estimated NDVI VIs. (Images from M. Brown, S. S€ usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 177–184.) Deep learning-based vegetation index estimation Fig. 9.13 Images of NDVI VIs from Field category obtained with the synthetic NIR generated by the proposed CycleGAN. (Left) Ground-truth NDVI VI images. (Right) Estimated NDVI VIs. (Images from M. Brown, S. S€ usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 177–184.) 227 228 Generative adversarial networks for image-to-Image translation Fig. 9.14 Images of NDVI VIs from mountain category obtained with the synthetic NIR generated by the proposed CycleGAN. (Left) Ground-truth NDVI VI images. (Right) Estimated NDVI VIs. (Images from M. Brown, S. S€ usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 177–184.) Deep learning-based vegetation index estimation Fig. 9.15 Images of NDVI VIs obtained from country with the proposed CycleGAN implemented in this paper: (first col) NDVI estimated with Ref. [17]; (second col) NDVI estimated by the proposed CyclicGAN; (third col) Ground-truth NDVI VI. (Images from M. Brown, S. S€ usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 177–184.) 229 230 Generative adversarial networks for image-to-Image translation Fig. 9.16 Images of NDVI VIs from field obtained with the proposed CycleGAN implemented in this chapter: (first col) NDVI estimated with Ref. [17]; (second col) NDVI estimated by the proposed CyclicGAN; (third col) ground-truth NDVI VI. (Images from M. Brown, S. S€ usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 177–184.) Deep learning-based vegetation index estimation Fig. 9.17 Images of NDVI VIs from mountain obtained with the proposed CycleGAN implemented in this chapter: (first col) NDVI estimated with Ref. [17]; (second col) NDVI estimated by the proposed CyclicGAN; (third col) ground-truth NDVI VI. (Images from M. Brown, S. S€ usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 177–184.) 231 232 Generative adversarial networks for image-to-Image translation Table 9.1 Average root means squared errors (RMSE) and structural similarities (SSIM) obtained from the estimated NDVI and the real one computed from Eq. (9.1) (SSIM the bigger the better). RMSE Training Supervised approach: results from Ref. [16] Unsupervised approach: results from Ref. [17] Proposed NDVI estimation with LFINALSYNNIRCYCLELSGAN SSIM Country Field Mountain Country Field Mountain 3.53 3.70 – 0.94 0.91 – 3.46 3.53 3.82 0.93 0.90 0.88 3.39 3.56 3.81 0.94 0.92 0.89 Notes: NDVI values are scaled up to a range of [0–255] since they are depicted as images as shown in Figs. 9.15–9.17. 9.5 Conclusions This chapter tackles the challenging problem of generating NDVI VI using an NIR synthetic image and its corresponding RGB representation. NIR images are estimated by using a CycleGAN network. Results have shown that in most of the cases the network is able to obtain reliable synthetic NIR representations that can be used to obtain VIs. As mentioned in Section 9.4, this approach has not the limitation of needing paired NIR-RGB images for training. As a future work, actually, as work in progress we are considering the use of a CycleGAN architecture with continual learning with deep generative display, but feed it with RGB and their corresponding NIR image in the generator to speed up the generalization. Future work will also consider other loss functions to improve the training process. Acknowledgments This work has been partially supported by the ESPOL project PRAIM (FIEC-09-2015); the Spanish Government under Project TIN2017-89723-P; and the “CERCA Programme/Generalitat de Catalunya.” The authors also thank NVIDIA for GPU donations and the CYTED Network: "Ibero-American Thematic Network on ICT Applications for Smart Cities" (REF-518RT0559). References [1] S.F. Di Gennaro, F. Rizza, F.W. Badeck, A. Berton, S. Delbono, B. Gioli, P. Toscano, A. Zaldei, A. Matese, UAV-based high-throughput phenotyping to discriminate barley vigour with visible and near-infrared vegetation indices, Int. J. Remote Sens. 39 (15–16) (2018) 5330–5344. [2] M.F. Dreccer, G. Molero, C. Rivera-Amado, C. John-Bejai, Z. Wilson, Yielding to the image: how phenotyping reproductive growth can assist crop improvement and production, Plant Sci. 282 (2019) 73–82. [3] M. Wójtowicz, A. Wójtowicz, J. Piekarczyk, Application of remote sensing methods in agriculture, Commun. Biometry Crop Sci. 11 (1) (2016) 31–50. Deep learning-based vegetation index estimation [4] T. Adão, J. Hruška, L. Pádua, J. Bessa, E. Peres, R. Morais, J. Sousa, Hyperspectral imaging: a review on UAV-based sensors, data processing and applications for agriculture and forestry, Remote Sens. 9 (11) (2017) 1110. [5] S.S. Panda, D.P. Ames, S. Panigrahi, Application of vegetation indices for agricultural crop yield prediction using neural network techniques, Remote Sens. 2 (3) (2010) 673–696. [6] S.S. Dahikar, S.V. Rode, Agricultural crop yield prediction using artificial neural network approach, Int. J. Innov. Res. Electr. Electron. Instrum. Control Eng. 2 (1) (2014) 683–686. [7] J. Rouse Jr, R.H. Haas, J.A. Schell, D.W. Deering, Monitoring vegetation systems in the great plains with ERTS, NTRS, NASA Technical Reports Server, 1974. Tech. Rep. [8] S. Skakun, C.O. Justice, E. Vermote, J.-C. Roger, Transitioning from MODIS to VIIRS: an analysis of inter-consistency of NDVI data sets for agricultural monitoring, Int. J. Remote Sens. 39 (4) (2018) 971–992. [9] T.N. Carlson, D.A. Ripley, On the relation between NDVI, fractional vegetation cover, and leaf area index, Remote Sens. Environ. 62 (3) (1997) 241–252. [10] A.H. Junges, D.C. Fontana, C.S. Lampugnani, Relationship between the normalized difference vegetation index and leaf area in vineyards, Bragantia 78 (2) (2019) 297–305. [11] P. Ricaurte, C. Chilán, C.A. Aguilera-Carrasco, B.X. Vintimilla, A.D. Sappa, Feature point descriptors: infrared and visible spectra, Sensors 14 (2) (2014) 3690–3701. [12] C.A. Aguilera, F.J. Aguilera, A.D. Sappa, C. Aguilera, R. Toledo, Learning cross-spectral similarity measures with deep convolutional neural networks, in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, JunIEEE, Las Vegas, USA, 2016, p. 9. [13] P.L. Suárez, A.D. Sappa, B.X. Vintimilla, Cross-spectral image patch similarity using convolutional neural network, in: IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM)IEEE, 2017, pp. 1–5. [14] N. Pettorelli, A.L.M. Chauvenet, J.P. Duffy, W.A. Cornforth, A. Meillere, J.E.M. Baillie, Tracking the effect of climate change on ecosystem functioning using protected areas: Africa as a case study, Ecol. Indic. 20 (2012) 269–276. [15] P.L. Suárez, A.D. Sappa, B.X. Vintimilla, Learning image vegetation index through a conditional generative adversarial network, in: 2nd Ecuador Technical Chapters Meeting2017, pp. 27–35. [16] P.L. Suárez, A.D. Sappa, B.X. Vintimilla, Vegetation index estimation from monospectral images, in: International Conference Image Analysis and RecognitionSpringer, 2018, pp. 353–362. [17] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision2017, pp. 2223–2232. [18] A. Marra, M. Gargiulo, G. Scarpa, R. Gaetano, Estimating the NDVI from SAR by Convolutional Neural Networks, in: IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing SymposiumIEEE, 2018, pp. 1954–1957. [19] D.S. Reddy, P.R.C. Prasad, Prediction of vegetation dynamics using NDVI time series data and LSTM, Model. Earth Syst. Environ. 4 (1) (2018) 409–419. [20] S. Al Mansoori, A. Kunhu, H. Al Ahmad, Automatic palm trees detection from multispectral UAV data using normalized difference vegetation index and circular Hough transform, in: High-Performance Computing in Geoscience and Remote Sensing VIII, 10792 International Society for Optics and Photonics, 2018, pp. 11–19. [21] J.M. Damian, M.R. Cherubin, A.Z. da Fonseca, E.Z. Fornari, A.L. Santi, O.H. de Castro Pias, Applying the NDVI from satellite images in delimiting management zones for annual crops., Sci. Agric. 77 (1) (2020) 1–11. [22] L. Ulsig, C. Nichol, K. Huemmrich, D. Landis, E. Middleton, A. Lyapustin, I. Mammarella, J. Levula, A. Porcar-Castell, Detecting inter-annual variations in the phenology of evergreen conifers using longterm MODIS vegetation index time series, Remote Sens. 9 (1) (2017) 49. [23] W. Li, J.-D.M. Saphores, T.W. Gillespie, A comparison of the economic benefits of urban green spaces estimated with NDVI and with high-resolution land cover data, Landsc. Urban Plan. 133 (2015) 105–117. [24] M. Rani, P. Kumar, P.C. Pandey, P.K. Srivastava, B.S. Chaudhary, V. Tomar, V.P. Mandal, Multitemporal NDVI and surface temperature analysis for Urban Heat Island inbuilt surrounding of sub- 233 234 Generative adversarial networks for image-to-Image translation humid region: a case study of two geographical regions, Remote Sens. Appl. Soc. Environ. 10 (2018) 163–172. [25] L. Xu, B. Li, Y. Yuan, X. Gao, T. Zhang, A temporal-spatial iteration method to reconstruct NDVI time series datasets, Remote Sens. 7 (7) (2015) 8906–8924. [26] M.A. Hassan, M. Yang, A. Rasheed, G. Yang, M. Reynolds, X. Xia, Y. Xiao, Z. He, A rapid monitoring of NDVI across the wheat growth cycle for grain yield prediction using a multi-spectral UAV platform, Plant Sci. 282 (2019) 95–103. [27] R. Taghizadeh-Mehrjardi, K. Schmidt, A. Amirian-Chakan, T. Rentschler, M. Zeraatpisheh, F. Sarmadian, R. Valavi, N. Davatgar, T. Behrens, T. Scholten, Predicting machine learning models and rescanning covariate space, Remote Sens. 12 (7) (2020) 1095. [28] T. Duan, S.C. Chapman, Y. Guo, B. Zheng, Dynamic monitoring of NDVI in wheat agronomy and breeding trials using an unmanned aerial vehicle, Field Crops Res. 210 (2017) 71–80. [29] Z. Jiang, A.R. Huete, J. Chen, Y. Chen, J. Li, G. Yan, X. Zhang, Analysis of NDVI and scaled difference vegetation index retrievals of vegetation fraction, Remote Sens. Environ. 101 (3) (2006) 366–378. [30] Z. Zhao, J. Gao, Y. Wang, J. Liu, S. Li, Exploring spatially variable relationships between NDVI and climatic factors in a transition zone using geographically weighted regression, Theor. Appl. Climatol. 120 (3–4) (2015) 507–519. [31] A. Mazza, M. Gargiulo, G. Scarpa, R. Gaetano, Estimating the NDVI from SAR by convolutional neural networks, in: IGARSS IEEE International Geoscience and Remote Sensing SymposiumIEEE, 2018, pp. 1954–1957. [32] H. Huang, G. Sun, J. Ren, J. Rang, A. Zhang, Y. Hao, Spectral-spatial topographic shadow detection from Sentinel-2A MSI imagery via convolutional neural networks, in: IGARSS IEEE International Geoscience and Remote Sensing SymposiumIEEE, 2018, pp. 661–664. [33] Y. Liu, S. Piramanayagam, S.T. Monteiro, E. Saber, Dense semantic labeling of very-high-resolution aerial imagery and lidar with fully-convolutional neural networks and higher-order CRFs, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops2017, pp. 76–85. [34] A. Berger, G. Ettlin, C. Quincke, P. Rodrı́guez-Bocca, Predicting the normalized difference vegetation index (NDVI) by training a crop growth model with historical data, Comput. Electron. Agr. 161 (2019) 305–311. [35] C.A. Aguilera, A.D. Sappa, C. Aguilera, R. Toledo, Cross-spectral local descriptors via quadruplet network, Sensors 17 (4) (2017) 873. [36] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems2014, pp. 2672–2680. [37] M. Mirza, S. Osindero, Conditional Generative Adversarial Nets, ArXiv abs-1411-1784 (2014). [38] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition2017, pp. 1125–1134. [39] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems2012, pp. 1097–1105. [40] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition2016, pp. 770–778. [41] X. Mao, Q. Li, H. Xie, R.Y.K. Lau, Z. Wang, S.P. Smolley, Least squares generative adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision2017, pp. 2794–2802. [42] M. Brown, S. S€ usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern RecognitionIEEE, 2011, pp. 177–184. [43] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612. CHAPTER 10 Image generation using generative adversarial networks Omkar Metri and H.R Mamatha Department of CSE, PES University, Bengaluru, India 10.1 Introduction to deep learning Neural networks and deep learning are one of the greatest innovations in the field of artificial intelligence. The reason for the majority of the real tasks at hand such as image and face recognition, speech recognition, object detection, and natural language processing having the best solutions is neural networks and deep learning. Likewise, the majority of machine learning algorithms are great in learning patterns, classifications tasks such as category assignment and regression for interpreting the numerical values based on the available data. But the computers have struggled when asked for data generation. The only way out for gathering data to train the models is collection from different sources or manual creation of the data. This gave rise to generative modeling. In a nutshell, generative modeling is a robust way of understanding the data distributions in an unsupervised manner. These models aim to generate new data by learning true distributions of the data. The data distributions cannot be featured with perfection. As a consequence, with the help of the neural networks and deep learning, it can be approximated as a function of the true distribution. Two elegant architectures used for generation purposes are variational autoencoder (VAE) and generative adversarial network (GAN). 10.1.1 Generative deep learning A few questions may strike the mind such as the differences between generative and discriminative models, the need of generative models, and others. This section answers these questions. A generative model basically categorizes the data sample based on how the data was generated. In other words, categorizing the data sample based on generation assumptions. On the other hand, discriminative models categorize the data sample based on the differences ignoring the data generation details. If we imagine the speech to language classification task, the generative approach would be learning the languages and classifying based on the gained knowledge, and the discriminative approach would be determining the linguistic differences without learning any language and predicting the language of the speech. For the x inputs and y labels, the generative algorithms would learn the Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00007-5 Copyright © 2021 Elsevier Inc. All rights reserved. 235 236 Generative adversarial networks for image-to-Image translation joint probability, i.e., p(x, y). Similarly, discriminative algorithms would learn the conditional probability, i.e., p(y j x) [1]. The generative models are being extensively used for producing realistic images of artwork, simulation purposes, time-series data, reinforcement learning as well as for generalizing features. A few prominent models are nuclear autoregressive density estimator (NADE) [2], masked autoencoder density estimator (MADE) [2], pixel recurrent neural networks (PixelRNN), pixel convolution neural networks (PixelCNN), variational autoencoder (VAE), Markov chains, and generative adversarial network (GAN). PixelRNN made an impact and became one of the promising solutions for image compression, reconstruction, generation, and others [3]. The model basically loads the image and at a given point of time scans one row and one pixel within that row. The idea is to predict the distribution of the next pixel with the possible values. Joint distribution of the pixel is basically the product of the conditional probabilities, thus making a sequence problem. Two types of architecture are experimented with, i.e., row LSTM (long-short-term memory) and diagonal BiLSTMs. The former uses a unidirectional layer of LSTM to scan the image row by row along with one-dimensional convolution. The diagonal BiLSTMs scan the image in the diagonal fashion. To increase the convergence speed and propagation of the signals explicitly, residual connections were added to the architecture. Usually, the realistic pictures have three channels, i.e., RGB. Hence, while predicting a pixel for the R channel, the context is the previously generated pixels to the left and above. Similarly, the context for the G channel remains the previously generated pixels along with the dependency on the R channel. Likewise, the B channel will have dependency on the generated pixels, the R channel and the G channel. To ensure the dependencies, masks are applied, i.e., masked convolution. Also, PixelCNN has been modeled using CNN. The model has been tested on MNIST, CIFAR-10, and ImageNet data. Fig. 10.1 shows the generation samples of CIFAR-10 and 32 32 ImageNet. The sequential generation is slow, which is a major drawback of pixelRNN/CNN. Also, the training of pixelCNN is faster compared to pixelRNN. Van den Oord et al. [4] present an improvement over the pixelCNN termed a Gated PixelCNN. It is computationally efficient and surpasses the pixelRNN in Ref. [3]. The vanilla PixelCNN ignored the content to the right of the current pixel, termed the blind spot (Fig. 1 in Ref. [3]). The gated architecture took off the blind spots with the help of two convolution stacks: one for a horizontal stack (current row till the current pixel) and the other for a vertical stack (all the rows above the current row). Using conditional modeling of images, the conditional PixelCNN model generated realistic images of different classes. Also, the model was tested on human images. It performed well on generating images of the same person with different postures (Fig. 10.2) and modeling on eight classes (Fig. 10.3). It also demonstrated the use of PixelCNN as an image decoder. The generated samples were of average quality depicting the model’s capability of capturing the variations on objects. Another improvement on Image generation using generative adversarial networks Fig. 10.1 Generation samples of CIFAR-10 (left) and 32 32 ImageNet (right) [3]. Fig. 10.2 Source image (left) and image generation samples (right) [4]. PixelCNN with a logistic likelihood is proposed in Ref. [5]. The next section deals with a brief introduction of autoencoders and a thorough explanation of the VAE. 10.1.2 Variational autoencoder Autoencoders are feed-forward neural networks where the input is equivalent to the output. The autoencoder has three parts, i.e., encoder, code, and decoder. The input is fed to the encoder which compresses and produces a code. The code is termed latent space representation. In turn, the decoder regenerates the input using the latent representation. The output is a degraded representation of the input [6]. Autoencoders are used for 237 238 Generative adversarial networks for image-to-Image translation Fig. 10.3 Conditional image generation on eight classes [4]. anomaly detection purposes, information retrieval, image denoising, medical aging, popularity prediction, and others [7]. Many variations of the autoencoders have been used in a variety of applications. The prominent ones are sparse, stacked, denoising, and variational autoencoders. The standard autoencoder encodes the images as latent vectors, trying to memorize the images. Therefore, generating new images is not possible as the task of producing latent vectors depends on the input images. VAE solves the problem efficiently by representing the data into latent space which roughly follows the normal Gaussian distribution. Hence, feeding a randomly sampled data from the distribution to the decoder will generate a new image. To measure the efficiency, two separate losses are calculated. One is the mean squared error (generative loss) and the other is the Kullback-Leibler (KL) divergence measuring the proximity between latent space and normal Gaussian distribution. The simplest method to optimize KL divergence is by allowing the encoder to produce a code of two vectors (means and standard deviations) and thereby picking a sample for generating an image [8, 9]. Image generation using generative adversarial networks The availability of huge amounts of data has driven visual forecasting. The drawbacks of the traditional approaches like nearest neighbor algorithms and transferring raw trajectories are computationally expensive, output space with high dimensions, encoding difficulty due to the pixel color variation in the frames, and blurry predictions. The challenges are addressed by predicting the dense pixel trajectories using conditional VAE [10]. The results indicate the model’s capability of learning representations with less data and commendable visual forecasting from static images (Figs. 3 and 4 in Ref. [10]). VAEs have been used in the music sector for style transfer between music genres [11]. Also, VAEs have been employed in medicine [12] and anomaly detection [13]. 10.2 Introduction to GAN GANs [14] are an amazing AI innovation fit for making pictures, sound, and recordings that are unclear from the real thing. The section explains GAN in relation to the concept of game theory, i.e., Nash equilibrium, architecture, and training problems. 10.2.1 Nash equilibrium Nash equilibrium is an important concept of game theory named after the inventor, i.e., John Nash. The best end result is dependent on the behavior and interaction of the participants in the game. In other words, the optimal solution in a noncooperative game is where the player cannot change the initial strategy. Basically, the player does not gain anything by deviating from the initial strategy under the assumption that other players keep the strategies unchanged. The game may include multiple equilibria or none of them [15, 16]. For example, assume two companies A and B. The companies are determining if they should start an advertising campaign to launch their products. If both the companies choose to advertise, each company acquires 100 customers. If only one of them chooses to campaign, then the company will acquire 200 customers. If neither of them campaigns, then no customers are acquired (Table 10.1). Company A should advertise as it provides a good profit and reward, rather than not advertise. A similar scenario comes up for company B as well. Hence, both the companies opting for advertisement is a Nash equilibrium. Another common example associated with Nash equilibrium is the prisoner’s dilemma [16]. Table 10.1 Reward table. Company A, B Advertise Do not advertise Advertise Do not advertise 100, 100 0, 200 200, 0 0, 0 239 240 Generative adversarial networks for image-to-Image translation 10.2.2 GAN and Nash equilibrium A GAN setup includes two neural network systems in opposition to one another—one to create fakes (generator) and one to spot them (discriminator). The term generative indicates the idea of creating new data depending upon the training data. The term adversarial indicates a gamelike framework with two networks, i.e., generator and discriminator. The generator produces the realistic data which is similar to the training data, whereas the discriminator’s task is to identify fake data produced by the generator from the real data coming from the training sample (Fig. 10.4) [18–23]. GAN is built on the concept of a zero-sum noncooperative game, i.e., minimax. In other words, one player maximizes its action and the other player minimizes those actions. Before diving into deep mathematics, Eq. (10.1) represents the variable notations with Eqs. (10.2)–(10.4) representing log-loss, KL divergence, and Jensen-Shanon divergence (JSD) function notations used in the rest of the chapter. z ! Noise vector x ! Original training data ! xr GðzÞ ! Generator output ! xf (10.1) DðxÞ ! Discriminator output for xr DðGðzÞÞ ! Discriminator output for xf 1X ðyi log ðpi ÞÞ + ðð1 yi Þ log ð1 pi ÞÞ E ðpj yÞ ¼ N ð pðxÞ DKL ðP k QÞ ¼ pðxÞ log dx qðxÞ 1 1 JSDðP k QÞ ¼ DKL ðP k M Þ + DKL ðQ k M Þ 2 2 (10.2) (10.3) (10.4) The discriminator is a binary classifier with the intention of categorizing the image as real or fake. As a consequence, the model should showcase high and low probabilities for Fig. 10.4 GAN architecture [17]. Image generation using generative adversarial networks real and fake data. Hence, the outcome of D(x) and D(G(z)) lies in the range 0 and 1. The discriminator’s task is to maximize and minimize the probabilities of D(x) and D(G(z)), respectively, whereas the generator maximizes the probability of D(G(z)). The discriminator and generator aim at achieving Eqs. (10.5), (10.6), respectively. Therefore, the value function of the minimax game is defined as Eq. (10.7). E() in Eq. (10.7) is the binary cross entropy, i.e., log-loss function. The noise z follows normal or uniform distribution. According to game theory, the convergence of GAN is achieved when the discriminator and generator reach Nash equilibrium, i.e., loge4 (details in Section 10.2.2.1). Discriminator : max ExpðxÞ ½ log DðxÞ + EzpðzÞ ½ log ð1 DðGðzÞÞÞ Generator : max EzpðzÞ ½ log DðGðzÞÞ (10.5) (10.6) GAN : minG maxD V ðD, GÞ ¼ ExpðxÞ ½ log DðxÞ + EzpðzÞ ½ log ð1 DðGðzÞÞÞ (10.7) 10.2.2.1 Nash equilibrium proof Assume, A ¼ Pðxr Þ B ¼ P xf y ¼ DðxÞ (10.8) From the Radon-Nikodym theorem, G(z) can be approximated to x. In addition, from Eqs. (10.2), (10.8), Eq. (10.7) can be written as ð minG maxD V ðD, GÞ ¼ A log y + B log ð1 yÞ dy (10.9) y The optimal discriminator is obtained by maximizing the integrand in Eq. (10.9). Hence, the integrand is rewritten as f ðyÞ ¼ A log y + B log ð1 yÞ (10.10) To find the maximum of Eq. (10.10), f 0 (y) ¼ 0 and f 00 (y) < 0 should satisfy. f 0 ðyÞ ¼ 0 A B ¼0 y 1y y¼ (10.11) A A+B Eq. (10.11) boils down to, DðxÞ ¼ P ðxr Þ 1 ¼ jif jP ðxr Þ ¼ P xf 2 P ðxr Þ + P xf (10.12) 241 242 Generative adversarial networks for image-to-Image translation Eq. (10.12) indicates the discriminator’s puzzled situation on feeding original and generated images. Hence, Eq. (10.9) becomes: ð A B + B log dy V ðD, GÞ ¼ A log A+B A+B y ð A B ¼ A log + B log + ð log e 2 log e 2ÞðA + BÞdy A+B A+B y ð ð A B ¼ A log + B log + log e 2ðA + BÞdy log e 2ðA + BÞdy A+B A+B y y ð ð A B + A log e 2 + B log + B log e 2dy log e 2 ðA + BÞdy ¼ A log A+B A+B y x ð A B ¼ A log + B log dy 2 log e 2 A+B A+B y 2 2 A+B A+B + DKL B k 2 log e 2 ¼DKL A k 2 2 ¼2 JSDðA k BÞ log e 4 (10.13) Eq. (10.13) is rewritten as V ðD, GÞ ¼ loge 4 + 2 JSD P ðxr Þ k P xf (10.14) From the JSD divergence, JSD(P(xr) k P(xf)) is zero when P(xr) ¼ P(xf). Hence, Eq. (10.14) reduces to loge4 proving that Nash equilibrium (global minimum) for the minimax game is loge4. 10.2.2.2 Training problems Since the advent of GAN, a lot of research has been conducted. Consequently, numerous problems are associated with the training. A few are stated below [17, 24]: 1. Modal collapse: is the collapsing of the generator to produce the same kind of image for the possibly different latent vectors. The aim of the generator is Eq. (10.6). Consider the generator is trained substantially without updating the generator. In this case, the generated images will converge on finding the optimal image fooling the discriminator, thereby becoming independent of z (Eq. 10.15). Hence, the gradient turns zero w.r.t. z and the mode collapses to a single point. One prominent affecting factor Image generation using generative adversarial networks is the learning rate. It is recommended to use a low learning rate and add noise to real and generated images during training. Mode collapse is a challenging problem to date. x∗ ¼ argmax DðxÞ (10.15) 2. Nonconvergence and stability: Due to the diminished gradient, the discriminator becomes more powerful and the generator’s gradient vanishes. Hence, GAN does not learn anything. Also, the stability of the model is a major concern. In other words, model parameters destabilize and therefore are responsible for nonconvergence. Lipschitz regularization has shown great success in stabilizing GAN training [25] and a few important considerations during training are cited in Ref. [26]. 3. Early stopping and hyperparameters: Early stopping is terminating the training due to an abrupt increase or decrease in loss. Hyperparameter selection like loss function and imbalance between the two components can result in overfitting. 10.2.3 VAE-GAN Larsen et al. [27] presented VAEGAN architecture by combining VAE and GAN outplaying the trivial VAE. The flow is indicated in Fig. 10.5A. The disadvantage with the VAE is the generation of blurry images. Hence, the loss function of the VAE’s decoder is Fig. 10.5 VAE-GAN architecture [27]: (A) vanilla VAE-GAN and (B) VAE-GAN with auxiliary generator. 243 244 Generative adversarial networks for image-to-Image translation replaced with the loss metric learned through the GAN. Thus, no assumptions are made regarding the loss function as it uses the GAN’s discriminator to categorize the image as real or fake. In other words, the generator (VAE decoder) uses this information to produce less blurry images. The original VAE generates samples which differ in a Gaussian way resulting in blurry images. VAEGAN overcomes the problem by sending the reconstructions and ground truth to the discriminator assuming that one hidden layer of the discriminator will differ in a Gaussian way. Hence, one of the hidden layers of the GAN discriminator is used for VAE loss. The reason for not choosing the final output layer is due to the lack of variation learning between the real and fake. LGAN ¼ log ðDisðxÞÞ + log ð1 DisðDec ðzÞÞÞ + log ð1 DisðDec ðEnc ðxÞÞÞÞ (10.16) The architecture is refined by adding an auxiliary generator over the generator cum decoder (Fig. 10.5B). The discriminator is set to receive and classify images from three sources, i.e., original (x), samples from normal distributions (xp), and VAE (x ). Hence, the outputs generated by the auxiliary and decoder generator are treated as fake. The objective of the GAN is Eq. (10.16) and the results obtained are better compared to vanilla VAE-GAN (Fig. 10.6). 10.3 Applications Since the innovation of GANs, they have been extensively used by the experts and researchers. It is one of the remarkable innovations in the field of deep learning. GANs can be used for image editing by reconstructing the image. GANs can be employed for Fig. 10.6 Reconstruction using VAE variations [27]. Image generation using generative adversarial networks Table 10.2 Implementation details. Sno GAN flavor/link Paper 1 2 3 4 5 6 7 8 9 10 11 Pix2Pix, CycleGAN Pix2Pix CoGAN BiGAN StarGAN StarGAN V2 SRGAN Art2Real Monkey-Net First Order Model StackGAN [19, 20] [19] [21] [23] [28] [29] [30] [31] [32] [33] [34] strengthening security by creating fake threats and training the model to identify these threats. The availability of data in the healthcare industry is limited. Hence, the GANs can be used for synthetic generation of data. On a similar basis, they can be employed to generate 3D objects and for prediction of future video frames and thereby generating videos. In addition to it, they are making a huge impact in the movie making, gaming, music, and fashion industries. The section introduces a few image translation applications using flavors of GAN. Tables 10.2 and 10.3 provide the implementation and dataset details. 10.3.1 Image-to-image translation using {c, cycle}-GAN The goal of the image-to-image translation is to align the input image to output image with the help of aligned image pairs. Mirza and Osindero [18] extended the vanilla GAN to a conditional model. The architecture supplied additional information, i.e., class labels to the generator and discriminator by adding an extra input layer. The experiment was conducted on the MNIST dataset. Parzen’s window-based log likelihood estimate is calculated and the comparative results are cited in Table 1 of Ref. [18]. On a similar basis, tag vectors were conditioned on the images for the automatic tagging of images. The objective remained the same in Ref. [19] with the discriminator’s task unchanged and the generator’s task being to fool the discriminator along with nearing the output to that of the ground truth in the L1 sense as L1 supports little haziness. The architecture is named as pix2pix and the experiments have been performed on various datasets like cityscapes, CMP (Centre for Machine Perception) facades, Google maps, edges, sketches, and others. Pix2pix architecture consists of two components, i.e., U-Net generator and PatchGAN discriminator. The U-Net generator is an autoencoder with skip connections. As a consequence of the matching spatial connections between the layers, the skip connections do not require resizing or projections. The aim of the patchGAN 245 246 Generative adversarial networks for image-to-Image translation Table 10.3 Dataset description. Sno Dataset name/link Description 1 Open Image V6 2 3 Cityscapes, Edge2handbags, Edge2shoes, Facades, Maps CelebA 4 5 6 CelebA-HQ RaFD AFHQ 7 8 9 10 Monet2Photo Landscape2Photos Portrait2Photo UvA-Nemo 11 12 13 14 BAIR Robo Pushing CUB Bird VoxCeleb Oxford-102 15 COCO 9M images with image-level labels, bounding box, and segmentation masks of objects, visual relationships, and localized narratives Urban street scenes, sketches, buildings, and Google maps 202,599 number of face images with 40 annotations per image of 10,177 unique identities High-resolution images of CelebA 67 models with 8 different emotional expressions 15,000 high-resolution images of cat, dog, and wildlife Monet images Landscape images Portrait images 1240 smile videos from 400 subjects (597 spontaneous and 643 posed) 59,000 examples of robot pushing motions 6033 images of 200 bird species Short clips of video extracted from YouTube videos 102 flower categories with the number of images per class ranging between 40 and 258 COCO is a large-scale object detection, segmentation, and captioning dataset discriminator is to classify the N N patch in an image as real/fake. In addition, it has fewer parameters and is faster when compared to classifying the entire image. The advantage of pix2pix architecture is being generic and learning the objective during training without making any assumptions between two types of images. Hence, it is flexible for various situations. Human scoring and semantic segmentation are the two evaluation strategies [19, 35, 36]. Zhu et al. [20] is a notable extension of GAN architecture with two generator and two discriminator models trained simultaneously. For the simplicity of understanding, d1 and d2 are two domains. One generator takes the input images from d1 and generates images from d2. Similarly, the other generator takes the input images from d2 and outputs images from d1. The discriminators play the same role and generators update accordingly. Cycle consistency is the add-on to this architecture from the machine translation domain. It states that the phrase translated from Kannada to English should translate from English to Kannada with the same efficiency. In case of CycleGAN, the idea is to feed the output of one generator as input to the other generator and the output should be the same as the original image. The reverse is also true. The loss is calculated in two parts, i.e., forward Image generation using generative adversarial networks Table 10.4 Evaluation of different models on Cityscapes dataset. Rank Model Year Per Pixel Acc (%) Per Class Acc (%) Class IOU Paper 1 2 3 4 5 Pix2Pix CycleGAN CoGAN SimGAN BiGAN 2016 2017 2016 2016 2016 71 52 40 20 19 25 17 10 10 6 0.18 0.16 0.06 0.04 0.02 [19] [20] [22] [21] [23] and backward cycle consistency loss [20, 37]. Impressive applications like collection style transfer, object transfiguration, season transfer, and image generation from paintings are demonstrated as well. A few more noteworthy extensions of GAN are simGAN, coGAN, and BiGAN [21–23]. Table 10.4 summarizes the scores obtained using different flavors of GAN on the cityscapes dataset. Pictorial results are showcased in Fig. 10.7. Pix2Pix (Fig. 10.8A) and CycleGAN (Fig. 10.8B) can be employed for many real applications at hand. 10.3.2 Face generation using StarGAN Face generation is the task of generating different variations of the face from the existing dataset. The task of generating the face from the given image is related to a particular aspect, i.e., changing the color of the hair, smiling to angry, etc. The important part of face generation is the requirement of high-quality images. Choi et al. [28] use two datasets, i.e., CelebA consists of images with 40 attributes such as hair color, skin tone, etc. and RaFD consists of 8 emotional expressions per face image. The trivial versions are inefficient and less productive for translation among multidomain datasets as it would require k(k 1) generators with k value set to the number of domains. StarGAN is an efficient solution for learning the differences of various domains with the help of a single generator and discriminator. The trivial versions learn fixed translation such as black-togray hair while the StarGAN generator is fed with image and domain information, thereby learning to translate the images. The domain is either fed as a binary or onehot vector and the target domain is randomly generated during the train, giving the control and flexibility of generating the images in any domain during the test phase. Fig. 10.7 Different flavors of GANs on the Cityscapes dataset [20]. 247 248 Generative adversarial networks for image-to-Image translation Fig. 10.8 Use of GANs in the fashion industry, paintings, and realistic image generation: (A) Edges to handbags using Pix2Pix [19] and (B) CycleGAN image generation [20]. The approach allows training on multiple datasets by ignoring the unknown labels and concentrating on the necessary labels. Basically, the model learns the features of CelebA along with the RaFD dataset like happy, fearful and incorporates these emotional features on the CelebA dataset (Fig. 10.10A). Two variations have been experimented with and the results depict the success of the model, not only on multiple domains of a single dataset but also on multiple domains across multiple datasets. Image generation using generative adversarial networks The limitation of starGAN is the indication of the domain with a predetermined label as the generator receives a fixed label, thereby producing the same output image for every domain. The same set of researchers [28] came up with the scalable approach to generate images having multiple domains. In other words, the generated image can have more than one domain incorporated. This is achieved by replacing the domain information with the domain-specific style code [29]. To achieve this, two modules are in the architecture. One is the mapping network which transforms the latent code to various style codes and one of them is randomly selected during training. Another module is the style encoder with the functionality of extraction of the style code from the image. StarGAN and StarGAN V2 architecture are presented in Fig. 10.9A and B, respectively. Apart from the scalable approach and good results (Fig. 10.10B), Choi et al. [29] present an AFHQ Fig. 10.9 StarGAN architecture: (A) StarGAN [28] and (B) StarGAN V2 [29]. 249 250 Generative adversarial networks for image-to-Image translation Fig. 10.10 Image-to-image translation: (A) StarGAN [28] and (B) StarGAN V2 [29]. dataset consisting of high-quality animal face images having inter and intra domain differences. The dataset is publicly available. 10.3.3 Photo-realistic images using SRGAN and Art2Real Single image superresolution (SR) is the task of improving and reckoning a highresolution (HR) image using a low-resolution (LR) image. Dong et al. [38] is a major Image generation using generative adversarial networks Fig. 10.11 Left to right: Bicubic, SRResNEt, SRGAN, and original image [30]. breakthrough with the proposal of SRCNN. The model structure is simple and achieved SOTAa compared to the traditional sparse coding-based SR methods. The major unsolvable problem in SR is recovering the finer texture details when resolved at the upper scale [30]. Ledig et al. [30] proposed SRGAN with the combination of deep learning and adversarial networks. The ultimate goal of the generator is the estimation of the HR image using the LR image, achieved with the help of feed-forward CNN. On the other hand, the discriminator is similar to the VGG network with the exception of dropping off the max-pooling layer throughout the component. The work highlighted the perceptual loss function defined as the weighted sum of content loss and adversarial loss. It accounts for the loss more sensitive to human perception. Content loss is defined as the difference between the features of the generated image and original image such as PSNR (peak signal to noise ratio), MSE (mean square error), and SSIM (structural similarity index). Adversarial loss accounts for the probability that the generated image is real or fake. It is worth mentioning that no objective measures can match the level of human perception (Fig. 10.11). Hence, mean opinion score (MOS) testing was performed where 26 evaluators rated the generated images on a range of 1–5. The MOS test proved that SRGAN results are superior with SOTAa performance and measures such as SSIM, PSNR fail to estimate the image quality. D-SRGANb has also been used in the field of the Digital Elevation Model (DEM). DEM is a 3D representation of the terrain’s surface such as the Earth, moon, or asteroid (Fig. 10.12) [39]. The accuracy of the DEM models depends on various factors apart from source, horizontal and vertical precision of the elevation data. DEM data is used for prediction of soil attributes [40], developing the product, decision-making, mapping purpose, 3D simulations, river channel estimation, contour maps creation, and so on [41]. a b SOTA, State of the art. D-SRGAN, Dem-Super resolution generative adversarial network. 251 252 Generative adversarial networks for image-to-Image translation Fig. 10.12 3D rendering of a DEM of Tithonium Chasma on Mars [39]. Lakshmi and Yarrakula [41] present the research in DEM generation and various techniques developed over a decade. A variety of satellite images are available at different resolutions such as spectral, radioactive, and temporal. This reduces the time, cost, and effort required to collect DEM data. SRGAN and D-SRGAN [42] differ in network architecture with D-SRGAN outperforming other methods used in the domain of DEM with the model performing well on a flatter terrain compared to a steeper terrain. On similar lines, Tomei et al. [31] proposed semantic aware architecture to generate realistic images from the paintings. The architecture is based on memory banks from which realistic details are recovered at patch level (Fig. 10.13). Hence, it includes preparation of memory banks to support generation. One memory bank (Bc) consists of patches of one semantic class such as Bsky, Brock. These patches are extracted using a sliding window policy if 20% of pixels belong to the semantic class c. Once the preparation of the memory banks is complete, the painting is transformed to a realistic image by pairing generated and real patches. The generated patches are segmentation maps from the original/generated painting and the real patches are memory banks. The dataset consists of paintings from specific artists, artworks from Wikiart, landscape photos from Flickr, and people photos from CelebA. The evaluation metric is Frechet Inception Distance (FID) which measures the difference between two Gaussians and the results outperform in comparison with cycleGAN, Unsupervised Image-to-Image Translation (UNIT), Disentangled Representation for Image-to-Image Translation (DRIT), and style-transferred real methods. A lower FID conveys the realism of the generated images. Fig. 10.14 depicts the qualitative results of the Art2Real framework. Fig. 10.13 Art2Real architecture [31]. 254 Generative adversarial networks for image-to-Image translation Fig. 10.14 Qualitative results of Art2Real [31]. 10.3.4 Image animation and scene generation using monkey net, first-order motion, and StackGAN Animation in the field of computer graphics and vision is defined as the ability of generating moving images [43]. It is also referred to as computer-generated imagery [44] which usually comprises 3D computer graphics. For 3D animation, objects are rendered on the computer, whereas for 2D figure animation, separate objects are used and the animator moves the specific parts like eyes, mouth based on keyframes. Early works show the use of deep learning and GAN for deep motion transfer [45, 46]. Siarohin et al. [32] generate image animation on the target object given a driving sequence and source image. The requirement of pretrained models for object detection, ground truth data availability, animating any kind of object, and video translations from one domain to another are the few limitations of animations. To address these challenges, it introduced a three-module framework. The first module extracts the object keypoints in an unsupervised fashion. The second module generates motion heatmaps from the keypoints to encode the motion information and the third module incorporates the heatmaps and appearance from the source image to produce a short video. The novelty of the approach is image animation on arbitrary objects and a game plan of transferring the motion information by learning the pixel movement in an unsupervised manner. The architecture follows a self-learning scheme named as Monkey-Net (Fig. 10.15). The UvA-Nemo, Tai-Chai, and BAIR datasets are used for experimental purposes. The evaluation metrics are Average Keypoints Distance (AKD), Missing Keypoints Distance (MKD), AED (Average Euclidean Distance (AED), and FID. The results outperform when compared to the X2face method (Fig. 10.16). The image animation was followed by Image generation using generative adversarial networks Fig. 10.15 (A) Image animation (B) monkey architecture [32]. image-to-video translation. The qualitative evaluation was performed using amazon mechanical turk (MTurk). MTurk makes tasks like survey, data validation easier by assigning the tasks to distributed forces who perform them virtually. Three videos were shown to users: driving video and two videos generated using X2face and Monkey methods. The Monkey generated videos are preferred more than 80% of times over the X2face method. The weakness of Monkey-Net is the poor generation quality due to pose changes in case of large objects [33]. To overcome the weakness, the keypoints detector models complex motions with the help of local affine transformations. To improve their estimation, equivariance loss is used during keypoints training. If the driving video contains large motions, then the generator should infer the object parts from the context which are not clear in the input image. Hence, an occlusion-aware generator is set up. Most of the frameworks fail to handle HR data, whereas the experimental results on the HR dataset depict the success of the framework in image animations when compared to other SOTAa methods (Fig. 10.16). In addition, the authors have released a new Thau-Chi-HD dataset. Deep learning in combination with GANs has been used in scene generation as well. Generating quality images from the text is an interesting field of computer vision termed scene generation. Early approaches fail to capture the object and generate low-quality images. Zhang et al. [34] presented a novel approach to generate 256 256 quality images conditioned on text data. The idea is to decompose the problem into subproblems by extracting the primitive shape and colors of the object, thereby producing LR images with the help of given words. In the next stage, HR realistic images are produced using the generated LR images and text descriptions. This is achieved by stacking up two GANs (Fig. 10.17). The first step is converting the text to embedding. The less text description results in the discontinuous latent data, thereby behaving as an obstacle for the generator. Hence, the embedding is condition augmented which is combined with noise to generate an image. The discriminator generates the decision score. The rough LR images along with the text embedding are fed to the second GAN with the generator following the encoderdecoder network with residual blocks. CUB, Oxford-102, and MS COCO datasets have been used for experimental purposes with inception score as evaluation metrics. The results generate realistic images when compared to the existing methods (Figs. 10.18 and 10.19). 255 Fig. 10.16 Image animation: reference video (first row), X2face (second row), Monkey-Net (third row), and first-order model (fourth row) [33]. Fig. 10.17 StackGAN architecture [34]. Fig. 10.18 Scene generation using different methods on the CUB dataset [34]. Fig. 10.19 Scene generation on the Oxford-102 and COCO datasets [34]. 260 Generative adversarial networks for image-to-Image translation 10.4 Future of GANs Research in GANs is soaring in image and video generation to a great extent. GANs are a perfect setup to govern the generated samples belonging to the distribution of interest with the help of the adversarial discriminator. The results proved to synthesize facial expressions, swapping the horse with a zebra, fashion industry, paintings to realistic images, animation and scene generation compared to the other SOTA1 methods. A statement by Facebook AI Research—“GANS are the most interesting idea of the decade”—is somewhat true and experienced to the present day. A good amount of research is currently ongoing in natural language processing and GANs. In addition, the possible future applications are text, audio, and music generation, drug discovery, and medical imaging. With the amount of research ongoing in various fields, GANs remain a promising and prominent solution. References [1] A.Y. Ng, M.I. Jordan, On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes, in: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, 2001, pp. 841–848. [2] M.M. Khapra, CS7015 (Deep Learning): Lecture 22 Autoregressive Models (NADE, MADE), 2020, [Online] Available at: https://www.cse.iitm.ac.in/miteshk/CS7015/Slides/Handout/Lecture22. pdf. (Accessed 6 July 2020). [3] A.V.D. Oord, N. Kalchbrenner, K. Kavukcuoglu, Pixel Recurrent Neural Networks, 2016. arXiv 2016. arXiv preprint arXiv:1601.06759. [4] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, Conditional image generation with pixelcnn decoders, in: Advances in Neural Information Processing Systems, 2016, pp. 4790–4798. [5] T. Salimans, A. Karpathy, X. Chen, D.P. Kingma, Pixelcnn++: Improving the Pixelcnn with Discretized Logistic Mixture Likelihood and Other Modifications, 2017. arXiv preprint arXiv:1701.05517. [6] A. Dertat, Applied Deep Learning—Part 3: Autoencoders, 2020, [Online] Available at: https:// towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798. (Accessed 6 July 2020). [7] Wikipedia Contributors, Autoencoder—Wikipedia, the Free Encyclopedia, 2020, [Online] Available at: https://en.wikipedia.org/w/index.php?title¼Autoencoder&oldid¼957588290. (Accessed 6 July 2020). [8] D.P. Kingma, M. Welling, An Introduction to Variational Autoencoders, 2019. arXiv preprint arXiv:1906.02691. [9] K. Frans, Variational Autoencoders Explained, 2020, [Online] Available at: http://kvfrans.com/ variational-autoencoders-explained. (Accessed 6 July 2020). [10] J. Walker, C. Doersch, A. Gupta, M. Hebert, An uncertain future: forecasting from static images using variational autoencoders, in: European Conference on Computer Vision, Springer, Cham, 2016, October, pp. 835–851. [11] G. Brunner, A. Konrad, Y. Wang, R. Wattenhofer, MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer, 2018. arXiv preprint arXiv:1809.07600. [12] Q. Zhao, E. Adeli, N. Honnorat, T. Leng, K.M. Pohl, Variational autoencoder for regression: Application to brain aging analysis, in: International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, Cham, 2019, October, pp. 823–831. Image generation using generative adversarial networks [13] J. An, S. Cho, Variational Autoencoder Based Anomaly Detection Using Reconstruction Probability, 2015. [14] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.C. Courville, Y. Bengio, Generative Adversarial Networks, 2014. arXiv preprint arXiv:1406.2661. [15] CFI Education Inc, Nash Equilibrium, 2020, [Online] Available at: https://corporatefinanceinstitute. com/resources/knowledge/economics/nash-equilibrium-game-theory/. (Accessed 6 July 2020). [16] J. Chen, Nash Equilibrium, 2020, [Online] Available at: https://www.investopedia.com/terms/n/ nash-equilibrium.asp#::text¼The%20Nash%20equilibrium%20in%20this,one%20prisoner’s% 20outcome%20is%20worse. (Accessed 6 July 2020). [17] J. Hui, Gan—Why It Is So Hard to Train Generative Adversarial Networks! 2020, [Online] Available at: https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisorynetworks-819a86b3750b. (Accessed 6 July 2020). [18] M. Mirza, S. Osindero, Conditional Generative Adversarial Nets, 2014. arXiv 2014. arXiv preprint arXiv:1411.1784. [19] P. Isola, J.Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134. [20] J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223–2232. [21] M. Liu, O. Tuzel, Coupled Generative Adversarial Networks, 2016. arXiv 2016. arXiv preprint arXiv:1606.07536. [22] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, R. Webb, Learning from simulated and unsupervised images through adversarial training, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2107–2116. [23] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, A. Courville, Adversarially Learned Inference, 2016. arXiv preprint arXiv:1606.00704. [24] M. Pasini, 10 Lessons I Learned Training GANs for One Year, 2020, [Online] Available at: https:// towardsdatascience.com/10-lessons-i-learned-training-generative-adversarial-networks-gans-for-ayear-c9071159628. (Accessed 6 July 2020). [25] Y. Qin, N. Mitra, P. Wonka, How Does Lipschitz Regularization Influence GAN Training?, 2018. arXiv preprint arXiv:1811.09567. [26] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved Techniques for Training Gans, 2016. arXiv 2016. arXiv preprint arXiv:1606.03498. [27] A.B. Larsen, S.K. Sønderby, H. Larochelle, O. Winther, Autoencoding Beyond Pixels Using a Learned Similarity Metric, 2016. arXiv preprint arXiv:1512.09300. [28] Y. Choi, M. Choi, M. Kim, J.W. Ha, S. Kim, J. Choo, Stargan: unified generative adversarial networks for multi-domain image-to-image translation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8789–8797. [29] Y. Choi, Y. Uh, J. Yoo, J.W. Ha, Stargan v2: diverse image synthesis for multiple domains, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8188–8197. [30] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, W. Shi, Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690. [31] M. Tomei, M. Cornia, L. Baraldi, R. Cucchiara, Art2real: unfolding the reality of artworks via semantically-aware image-to-image translation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5849–5859. [32] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, N. Sebe, Animating arbitrary objects via deep motion transfer, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2377–2386. 261 262 Generative adversarial networks for image-to-Image translation [33] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, N. Sebe, First order motion model for image animation, in: Advances in Neural Information Processing Systems, 2019, pp. 7137–7147. [34] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan: text to photorealistic image synthesis with stacked generative adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5907–5915. [35] C. Shorten, Pix2Pix, 2020, [Online] Available at: https://towardsdatascience.com/pix2pix869c17900998. (Accessed 6 July 2020). [36] Neurohive, Pix2pix—Image-to-Image Translation Neuralnetwork, 2020, [Online] Available at: https://neurohive.io/en/popular-networks/pix2pix-image-to-image-translation/. (Accessed 6 July 2020). [37] J. Brownlee, A Gentle Introduction to Cyclegan for Image Translation, 2020, [Online] Available at: https://machinelearningmastery.com/what-is-cyclegan/. (Accessed 6 July 2020). [38] C. Dong, C.C. Loy, K. He, X. Tang, Image super-resolution using deep convolutional networks, IEEE Trans. Pattern Anal. Mach. Intell. 38 (2016) 295–307. [39] Wikipedia Contributors, Digital Elevation Model—Wikipedia, the Free Encyclopedia, 2020, [Online] Available at: https://en.wikipedia.org/w/index.php?title¼Digital_elevation_model& oldid¼962013302. (Accessed 6 July 2020). [40] J. Thompson, J. Bell, C. Butler, Digital elevation model resolution: effects on terrain attribute calculation and quantitative soil-landscape modeling, Geoderma 100 (2001) 67–89, https://doi.org/ 10.1016/S0016-7061(00)00081-1. [41] S. Lakshmi, K. Yarrakula, Review and critical analysis on digital elevation models, Geofizika 35 (2019) 129–157, https://doi.org/10.15233/gfz.2018.35.7. [42] B.Z. Demiray, M. Sit, I. Demir, D-SRGAN: DEM Super-Resolution With Generative Adversarial Network, 2020. arXiv preprint arXiv:2004.04788. [43] Wikipedia Contributors, Computer Animation—Wikipedia, the Free Encyclopedia, 2020, [Online] Available at: https://en.wikipedia.org/w/index.php?title¼Computer_animation&oldid¼966153819. (Accessed 6 July 2020). [44] Wikipedia Contributors, Computer-Generated Imagery—Wikipedia, the Free Encyclopedia, 2020, [Online] Available at: https://en.wikipedia.org/w/index.php?title¼Computer-generated_imagery& oldid¼962950245. (Accessed 6 July 2020). [45] O. Wiles, A. Sophia Koepke, A. Zisserman, X2face: a network for controlling face generation using images, audio, and pose codes, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 670–686. [46] A. Bansal, S. Ma, D. Ramanan, Y. Sheikh, Recycle-gan: unsupervised video retargeting, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 119–135. CHAPTER 11 Generative adversarial networks for histopathology staining Aashutosh Ganesha,b and Koshy Georgea,c,d a PES Center for Intelligent Systems, PES University, Bangalore, India Radboud University, Nijmegen, The Netherlands c Department of Electronics and Communication Engineering, PES University, Bangalore, India d SRM University—AP, Guntur District, Andhra Pradesh, India b 11.1 Introduction Generative adversarial networks (GANs), a type of deep learning proposed in Ref. [1], consist of two networks, the generator and the discriminator. The former belongs to the class of generative or forward models, which depends on unsupervised learning to determine the distribution of the training data, and the latter belongs to the class of discriminative or backward models that ascertains the decision boundaries via supervised learning [2]. (Generative modeling and some applications are treated in Ref. [3]. Some recent books on GANs are Refs. [4, 5].) While generative methods model class-conditional distributions and prior probabilities, discriminative methods estimate posterior probabilities without explicitly modeling the probability distributions. Note that the discriminator is a classifier. In the context of GANs, it attempts to distinguish between real data and the data created by the generator. Albeit it is possible to arrive at several generativediscriminative pairs, what makes GANs unique is that the generator and the discriminator are pitted against each other in a two-player game seeking to find the Nash equilibrium [6]. Several discriminators have been proposed that successfully map a high-dimensional input to a class label [7–9]. This has been made possible due to the back-propagation algorithm and the use of piecewise linear activation functions with well-behaved gradients [10–12]. Dropout algorithms [13–15] have also contributed to this success. A GAN essentially is a procedure for creating generators that mitigates the difficulties faced in creating meaningful generator-discriminator models. These obstacles include approximating intractable probabilistic computations and leverage piecewise linear activation functions. GANs are similar to variational autoencoders (VAE) [16, 17] in that both approaches are used to determine the distribution of data using unsupervised learning. Accordingly, both have two networks. While the decoder is generative, the encoder is a recognition model. Such an approach leads to an intractable distribution which then has to be Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00010-5 Copyright © 2021 Elsevier Inc. All rights reserved. 263 264 Generative adversarial networks for image-to-Image translation approximated by another tractable distribution and then use the method of variational inference. The different approach adopted makes GANs better than VAEs. Suppose that G and D are the two feed-forward neural networks, respectively, representing the generative and discriminative networks. In a GAN, G and D simultaneously participate in the following two-player game: min max fx ½log DðxÞ + z ½log f1 DðGðzÞÞgg G D (11.1) An input prior p(z) is first defined, and then mapped to the space GðzÞ. While the discriminator maximizes the probability of assigning the correct label to training examples (real data) and minimizes the probability to samples from G, the generator G is trained to maximize the probability assigned by the discriminator to those samples it generates. Thus, G attempts to capture the data distribution and D estimates the probability that a sample came from the training data rather than from G. The applications of GANs have been quite varied, and includes image segmentation, text-to-image synthesis, and high-resolution image generation [18–20]. (A recent survey of applications is available in Ref. [21].) In particular, GANs have been found useful in medical imaging; see Refs. [22–25]. Deep learning and GANs have helped in the automation of diagnostics of diseases such as breast cancer and gastrointestinal disease, segmentation of nuclei, image reconstruction, and image translation of X-ray image to CT scans [26–31]. This has been made possible largely due to strides in computing power, storage capacity, and image capture techniques [32]. Histology and histopathology are the careful study of microscopic tissues. Histopathology is important for diagnosis and is considered a gold standard; for example, it is required for cancer diagnosis, where the microscopic tissue is analyzed by a pathologist. A fundamental step in histopathology is staining caused by chemical reactions induced in the tissue under analysis, and results in accentuated features that help in diagnosis. The stains range from the commonly used hematoxylin and eosin (H&E) stain—devised independently by Wissowzky in 1876 and Busch in 1877 [33]—to the relatively rare Grocott-G€ om€ ori methenamine silver (GMS) stain proposed by G€ om€ ori in 1946 [34]. Different stains affect the tissue on a slide distinctively thereby highlighting particular features for the pathologist [35]. Whenever required, developing diverse stained histopathological slides of the same tissue sample is a parallel, laborious, and a time-consuming process. Moreover, it is subject to human error. Evidently, histology staining and histological analysis are cumbersome processes, where automation can be beneficial to diagnosticians. With recent developments in deep learning, accelerated computing, and storage, histological image analysis has had some transformative changes. In Ref. [36], breast cancer classification has an accuracy of over 98.4% with a recurrent patch-based convolutional neural network (CNN). GANs have showcased its usefulness in histopathology: stain Generative adversarial networks for histopathology staining normalization is introduced in Ref. [37], InfoGAN [38] and WGAN [39] are used for feature extraction in Ref. [40], and synthetic histopathology image generation is discussed in Ref. [41]. The process of histology staining has also shown some scope for automation, as illustrated in Refs. [42, 43], where histopathology staining is achieved through style transfer [44] or through residual GANs [45]. From a machine learning perspective, each stain results in a different feature space, and a transformative network has the ability to transform one space to another. The latter is a classic image-to-image translation problem. Thus, we can frame the problem of transforming one stained tissue to another as an image-to-image translation problem [46]. In this chapter, we consider the problem of transformation of a feature space corresponding to one stain to another posed as an image-to-image translation problem, and present a solution based on GANs. Specifically, the use case and limits of the image-to-image translation utilizing GANs are demonstrated here using the images from the Automatic Nonrigid Histological Image Registration (the ANHIR) challenge dataset [47–52]. (ANHIR challenge was part of the IEEE International Symposium on Biomedical Imaging [ISBI] in 2019, where the call was to register tissues across different samples for large images.) Histology staining in this chapter is framed as a domain adaptation problem for each stain. The dataset consists of various tissues with different types of stains per tissue. This challenge requires the tissues to be registered for a given pair of input and target images. Registration is an important task in medical imaging as it allows diagnosticians to extract more information from one image than they typically do from single samples. Histology registration postalignment allows the viewer to see the information from multiple stains on the same sample. However, since these samples of differently stained tissues are not readily available, there is a potential application of converting one tissue stain to another stain type. As mentioned earlier, GANs have proved effective in various image generation tasks such as segmentation and synthetic data generation. The ANHIR dataset is utilized here to demonstrate this domain adaptation problem wherein a stained histology image leads to a histology image with a different stain. In this chapter, we discuss the details of the implementation. Specifically, the preparation of the dataset and the methodology to solve an image-to-image translation problem are discussed. Moreover, the efficacy of GANs when the number of available images is relatively small is showcased and we illustrate some techniques that yield better performance. The results of this chapter are based on an implementation of the code primarily done in python, specifically TensorFlow [53]. Due to constraints on available datasets, it must be emphasized that the suggested methodology may not be completely clinically viable as yet. This chapter is organized as follows. GANs are presented in Section 11.2. In this section, we present the vanilla GAN and other variations relevant in our context, the objective functions considered for optimization, and the image-quality metrics. 265 266 Generative adversarial networks for image-to-Image translation The image-to-image transformation problem is discussed in Section 11.3. Histology is outlined in Section 11.4. The networks and the dataset used in this chapter are described in Section 11.5, and the results presented and discussed in Section 11.6, followed by conclusions in Section 11.7. 11.2 Generative adversarial networks The vanilla GAN introduced in Ref. [1] ensured that the generator-discriminator pair played a two-player min-max game seeking the Nash equilibrium (a saddle point), as described in Eq. (11.1). Both DðGðzÞÞ and DðxÞ represent probabilities. A trained discriminator D is such that it maximizes the probability DðxÞ for an image x that belongs to the input distribution. From a prior distribution p(z), a sample z is input to G. The resulting output GðzÞ of an untrained generator evidently does not belong to the input distribution and hence rightly classified as a fake image by the trained discriminator; that is, the value of DðGðzÞÞ is nowhere near unity. The generator G is trained so that the probability DðGðzÞÞ is maximized. Essentially, D tries to reject these images as fake while G attempts to fool D into thinking they are real. Eventually, G learns sufficiently to generate samples that correspond to the distribution of the real data. 11.2.1 Improvements to vanilla GAN Albeit the vanilla GAN is comparatively better than its contemporaries, a number of issues can affect its performance. Some of the issues are as follows: First, it is likely that the generator network does not improve as fast as the discriminator network causing the former to output less than ideal images. Second, the generator produces samples from a limited class; this issue is called mode collapse. Third, the networks are trained using the back-propagation algorithm and hence require the computation of gradients. The problem of unstable gradients, vanishing gradients associated with Kullback-Leibler (KL) [54] and Jensen-Shannon (JS) [55] divergences have been well reported. Hyperparameter optimization is essential to strike a balance between the generator and discriminator wherein one does not improve at a rate that the other cannot keep up. Several improvements to the architecture, batch training, and techniques to avoid mode collapse have been proposed [56]. In this chapter, we suggest the required improvements for the image-to-image translation problem. 11.2.2 Deep convolutional GANs Introduced by Radford et al. [57], the deep convolutional GAN (DCGAN) showcased a marked improvement in generating natural images from multiple modalities of realworld data. The DCGAN adopted the following steps to improve the efficacy: strided convolutions as opposed to downsampling through max pooling [58]; batch Generative adversarial networks for histopathology staining normalization [59] to ensure zero mean and unit variance; the use of Leaky ReLU activation function for the discriminator; and the use of inception score to evaluate the efficacy. In particular, DCGAN improved the generation of images from the ImageNet, faces and CIFAR10 datasets [56, 57, 60]. However, for the specific application in medical imaging, there is some more room for improvement. 11.2.3 Variations in optimization functions As the implementation of GANs are susceptible to issues in the gradients and the quality of generated images, better results may be obtained by varying the performance objective. These functions are used during the training of GANs. Some possibilities are listed here: 1. The L2 loss function was introduced in Ref. [61] to overcome the problem of vanishing gradients especially for those fake samples sufficiently far from real data but classified correctly. Since the least squares loss function penalizes such samples, the least-squares GAN (LSGAN) attempts to generate samples closer to the real data. The suggested objective functions for the 0–1 coding scheme are as follows: (11.2) J D ¼ x ðDðxÞ 1Þ2 + z ðDðGðzÞÞÞ2 (11.3) J G ¼ z DðGðzÞ 1Þ2 2. In order to minimize the pixel-wise distance between the generated and target images, the mean square error (MSE) [42] is often used in the objective function. Let x be the input image, x^ the output image of the generator, and y the target image. Then J D ¼ ½ log ð1 Dð^ x ÞÞ + ½ log ðDðyÞÞ (11.4) J G ¼ ½ logDð^ x Þ + λ E MSE ð^ x , yÞ (11.5) where λ is a regularization parameter and E MSE ðA,BÞ ¼ N X M 1 X ðaij bij Þ2 NM i¼1 j¼1 (11.6) Here, A and B are two images of dimensions N M pixels. Let aij and bij, respectively, be the values of the pixels in the (i, j)th position in images A and B. 3. Instead of MSE, some researchers prefer to use the mean absolute error (MAE) instead [62]. This also has an effect on the pixel-wise distance between the target and output images. J D ¼ ½ log ð1 Dð^ x ÞÞ + ½ log ðDðyÞÞ (11.7) 267 268 Generative adversarial networks for image-to-Image translation J G ¼ ½ logDð^ x Þ + λ E MAE ð^ x ,yÞ (11.8) where λ is a regularization parameter and E MAE ðA,BÞ ¼ N X M 1 X jaij bij j NM i¼1 j¼1 (11.9) The other quantities are as defined earlier. 11.2.4 Image-quality metrics We digress briefly to explore image-quality metrics to measure the efficacy of GANs. This is due to an inherent problem in generative modeling, and to a degree unsupervised learning. In the case of an image-to-image translation problem, a reasonable method of measuring network performance is image-quality metrics. The following image-quality metrics are used here to evaluate the performance of GANs in our context. 1. Suppose that A and B are two images of dimensions N M pixels. Let aij and bij, respectively, be the values of the pixels in the (i, j)th position in images A and B. Then, the pixel-wise MSE metric [63] between the two images A and B is as defined earlier, and repeated here for convenience. N X M 1 X E MSE ðA, BÞ ¼ ðaij bij Þ2 NM i¼1 j¼1 (11.10) Evidently, the smaller the value of MSE, the closer are the two images A and B with reference to this measure. However, it may be noted that this metric may not correlate well with subjective analysis of quality. In our context, images A and B, respectively, correspond to the target image and the output of GAN. 2. The peak signal-to-noise ratio (PSNR) is derived from the metric MSE, E MSE ðA, BÞ. Similar to MSE, PSNR also does not correlate well with human quality assessment. This metric [63] is computed as follows: E PSNR ðA, BÞ ¼ 10 log 2552 E MSE ðA, BÞ (11.11) A and B are closer with respect to PSNR if the corresponding measure is large. Clearly, smaller the value of MSE, the larger the value of PSNR. 3. The structural similarity index (SSI) [63] primarily deals with three aspects of similarity— luminance, contrast, and structure. While luminance is defined as the brightness of the image, the contrast is the difference between the luminance divided by the average luminance of the image. If A and B are two images, this metric is computed as follows: Generative adversarial networks for histopathology staining 2μ μ + c1 LðA, BÞ ¼ 2 A B2 μA + μB + c1 (11.12) 2σ A σ B + c2 CðA, BÞ ¼ 2 σ A + σ 2B + c2 (11.13) ρAB + c3 σ A σ B + c3 (11.14) SðA,BÞ ¼ E SSI ðA, BÞ ¼ L p ðA,BÞ C q ðA, BÞ Sr ðA,BÞ (11.15) In these equations, μA and μB are, respectively, the average values of the pixels in A and B, σ 2A and σ 2B the corresponding variances, and ρAB is the correlation coefficient between A and B. The constants c1, c2, and c3 are introduced to avoid division by zero, or near-division by zero, and the constants p, q, and r represent relative importance of three components. The measure SSI is considered to be related to perception in a human visual system. Its value lie in the interval [0, 1]. Clearly, a value of unity indicates that the two images are the same. In the context of image compression, a comparison of these metrics is available in Ref. [64] where an analytical link between PSNR and SSI has been shown for common degradations in an image including additive Gaussian noise and Gaussian blur. We note that there are full-reference, partial-reference, and nonreference methods of image comparison. In the context of medical image comparison, full-reference methods are a stronger indicator of network performance. The aforementioned metrics belong to this class of methods. 11.3 The image-to-image translational problem The task of image-to-image translation is one of the prominent developments in deep learning applied to image processing. This task can be described as one that transforms the feature space of an image to another. Applications include transforming aerial photographs to maps, removal of background from images, and colorization of black and white images. Generative networks have been found to be rather useful for this. The goal of the network is to learn the map between the input and target images to suitably transform the former to the latter. That is, if U is the set of input images and T is the set of target images, the trained generator network should be such that G : U ! T and the distribution of images GðUÞ is indistinguishable from the distribution of images in T . CNNs have showcased remarkable results in several fields including medical image classification. An evident drawback of this class of networks is the requirement for large quantities of data to realize the full potential of CNNs. Moreover, some problems in medical imaging require pixel-level classification; for example, medical image 269 270 Generative adversarial networks for image-to-Image translation segmentation. The U-net architecture builds upon the fully connected CNN [26]. The expansive and contracting paths are somewhat symmetric leading to an U-shaped architecture. Some of the principal differences relative to typical CNNs are that the pooling layers are replaced with upsampling layers and successive convolutional layers have a large number of filters. In addition, extensive data augmentation is adopted to compensate for smaller datasets. We note that the U-net is essentially an autoencoder (AE). In general, AEs are networks utilized to learn encodings and they are predominantly used in unsupervised learning [65]. They consist of an encoder and decoder, the former to encode the data distribution to a latent space or bottleneck and the latter to decode this latent space. AEs are typically used for principal component analysis. Conditional GANs (CGANs) [66] are explored in Ref. [62] to deal with the image-to-image translational problem. Here, the pair of networks learn a conditional generative model [1] making them suitable to tackle such problems. In Ref. [62], the generator is a U-net architecture described earlier which allows it to encode the images into a bottleneck. The discriminator is a convolutional PatchGAN classifier, which allows the network to perform patch-level classification. This encourages the generator to learn patch-level features. In contrast to regular GANs wherein the generative models learn a mapping from the random noise vector z to an output image y ¼ GðzÞ, CGANs learn the mapping from the input image and the noise vector to the output image: y ¼ Gðx,zÞ. The objective function for CGAN can therefore be described as J G, D ¼ x, y ½ log Dðx, yÞ + x, z ½ log ð1 Dðx,Gðx,zÞÞÞ (11.16) The optimal CGAN is then the following: arg min max J G, D G D This can be regularized as follows: arg min max J G, D + λx, y, z ½k y Gðx,zÞk1 G D (11.17) where the L1 metric is preferred over the L2 metric to mitigate the effect of blurring [67]. The discriminator that we use is the convolutional PatchGAN classifier, originally proposed in Ref. [68]. The generator with L1 distance function accurately capture the low frequencies. Accordingly, the GAN discriminator needs to only enforce correctness at the higher frequencies. To accomplish this, it is sufficient to restrict attention to the structure in local image patches, leading to the terminology PatchGAN which only penalizes structure at the scale of patches. Thus, the discriminator classifies based on whether or not each P P patch in an image is real or fake. This is run convolutionally across the image and averages the responses to yield the output image. PatchGAN is computationally efficient in that it has fewer parameters, runs faster, and can be applied to Generative adversarial networks for histopathology staining images of arbitrary sizes. The mathematical background is that it models the image as a Markov random field assuming that pixels separated by more than the patch dimensions are statistically independent. This idea has been successfully used for both texture and style. Thus, the end-to-end pipeline for the image-to-image translation problem uses an autoencoder (U-net) as our generator and the discriminator of PatchGAN. The latter is trained to distinguish between the real images (actual data) from the fake or synthesized images. The goal of the autoencoder is to transform from the domain space to another by utilizing a learned transformation encoded in the latent space. In contrast, the objective of the discriminator is to distinguish between real and fake images. 11.4 Histology and medical imaging As referenced in previous sections, histology is the examination and study of microscopic structures present on tissues and histopathology is where this examination is used for diagnosis [69]. The overall goal of histopathological analysis is to understand and establish the relationship between the structures present on the tissue with the affliction that the subject of analysis has. Histopathological analysis is primarily qualitative, where certified pathologists comb through the tissue slide images directly from the microscope or the scanned whole slide images to determine affliction. A fundamental step in histology is staining, where due to cell structures/chemical compounds present on the slide the stains react with them to accentuate the features present on the slide for diagnosis. The a priori information present on the tissue is not immediately visible, requiring a pathologist to use reagents to give a contrast. This allows them to better evaluate the tissue. Different stains are utilized depending on the region of interest, where different tissue sections react differently. The entire process behind histopathology, from analysis to diagnosis, is necessary but time consuming. Analysis of histology primarily revolves around structures present on the extracted sample. Important precursory steps before analysis are the biopsy of the tissue and fixing of the tissue dyeing/staining; the latter step is relevant in our discussions. The dyes reveal cellular structures and counterstains are used for contrast. The most common stain used is hematoxylin and eosin (H&E), since it is relatively quick to stain and stains a large number of cells well. However, in some cases, it does not provide the contrast required. For that different types of stains are used depending on the afflictions. Histopathology additionally examines the extent of the affliction, called disease grading. It is very useful in distinguishing between different subtypes of diseases, especially in cancers. The scope for automation is evident; companies such as Leica have introduced an automatic stainer to circumvent the process through a physical batch stainer [70, 71]. 271 272 Generative adversarial networks for image-to-Image translation The scope for automation is quite high for histopathology image analysis as well. With techniques such as whole slide scanning, it has become easier to devise algorithms, which are able to comb through the image for diagnostic purposes. While most of the past medical image analysis has centered on cytology, histopathology serves as the gold standard for diagnostics, especially for cancers. While providing us with a rich feature space, there are challenges in automation of analysis. For this chapter, we will primarily be examining histopathology stains as different feature spaces. 11.4.1 Histology as different feature spaces Since stains are reagents that interact with cells present on the tissue, they produce different features in that they accentuate structures based on the underlying chemical reactions. H&E clearly dye structures such as cytoplasm, nuclei, organelles, and extracellular components. It aids in the diagnosis of diseases based on the organization of the tissue. H&E chemicals are basic and acidic, respectively, and they operate on the cell nucleus and the cytoplasm cell walls, respectively. Specialized stains have been developed that deal with sections not normally dyed by H&E. For example, Masson’s trichrome stains connective tissue where the basic structures are stained blue. Alcian Blue stains heavy proteins such as mucin blue. With the established fact that each stain provides different information to the diagnosticians, each of them can be considered to be a different feature space. Special stains are sometimes used in conjunction with routine stains to extract more information from the tissue. However, there is a scope for error in fixing and staining that sample. Additionally, these special stains require expensive chemicals. Therefore, there is a use-case for generating the newly stained tissue, through image processing and machine learningbased methods. In summary, the stains generate unique features, and these unique features are required for diagnosis. Moreover, conventional staining is prone to human error especially during redyeing the tissue, and is time consuming. Further, the process is expensive. Thus, an automated system to generate a reasonable approximation of the tissue has the potential to save resources. Accordingly, only when required, pathologists need to resort to conventional methods. 11.5 Network architecture and dataset The methodology adopted in this chapter to use GANs for histology staining is described here. The principal issue is the unavailability of a large dataset of unstained tissues. Therefore, our goal is virtual staining of a histology slide, which transforms a slide from one feature space into another. As mentioned earlier, virtual staining has been explored through style transfer in Ref. [43] and GANs in Ref. [42]. The fundamental difference between Ref. [42] and this study is the choice of viewing tissue samples. While Ref. [42] Generative adversarial networks for histopathology staining utilizes autofluorescent images to generate bright-field images, we attempt here to learn a map between two bright-field images. Additionally, we examine the transformation between two types of stains rather than stain a sample from their unstained equivalent image. We again emphasize that unstained images are usually not available. We use the ANHIR dataset to showcase the efficacy of GANs for histology staining. Specifically, we demonstrate that an image of a tissue stained with one chemical is transformed to an image of the same tissue stained with a different chemical. 11.5.1 ANHIR dataset The automatic nonrigid histological image registration (ANHIR) dataset is designed for image registration for large-scale images. Registration is required to align multiple tissue sections into one representation in 3D space, where each individual image is stacked upon each other. This allows a pathologist to extract more information from the tissue samples from multiple features and biomarkers. The characteristics of the dataset are depicted in Table 11.1. This dataset has a variety of tissues from different organs and the tissues are stained with a diverse set of histological stains. We note that the ANHIR dataset provides more stains than indicated in Table 11.1. (For example, images of an H&E-stained and IHC-stained lung lesion tissue are both available. In addition, the dataset provides images with CD10, CD31, and Ki67 stains, which are not considered here.) Moreover, the dataset also provides different levels of magnification for each tissue ranging from 10 to 40. We opt for higher magnification in order to obtain a larger image dataset to achieve our objective. A brief explanation of the stains and their effects are as follows: • CD31 is a type of immunohistochemical (IHC) stain, which mediates cell-to-cell adhesion. These are typically used in tonsils, skin, liver, and kidneys. • Estrogen receptor (ER) antibody stains are proteins activated by the estrogen hormone. It is extensively used in breast carcinoma. Table 11.1 The ANHIR dataset. Tissue Stains Magnitude Average size Lung lesion Lung lobe Mammary gland Mice-kidney Colon adenocarcinoma Gastric adenocarcinoma Breast tissue Human kidney H&E, IHC H&E, CD31 H&E, ER, PR PAS, SMA, CD31 H&E, IHC, CD H&E, IHC H&E, IHC, PR H&E, PAS, MAS 40 10 40 20 10 40 40 40 18k15k 11k6k 12k4k 37k30k 60k50k 60k75k 65k60k 18k55k 273 274 Generative adversarial networks for image-to-Image translation Table 11.2 Subset of ANHIR dataset. Tissue Stains Scaling Number of images Lung lesion Kidney Lung lobe H&E, CD31 H&E, MAS H&E, CD31 50 25 25 1250 368 151 • Progestrone receptor (PR) antibody stains are used for breast cancer detection and are typically used in conjunction with ERs. • Periodic acid-Schiff (PAS) stains are used for detecting the presence of carbohydrates in tissues such as connective tissues, mucus, etc. These are used in diagnostics in diseases such as glycogen storage disease. • Masson’s trichome (MAS) stain is a three-color staining procedure. It produces red for keratin and muscle fibers, blue and green for collagen, light red or pink for cytoplasms, and dark brown to black for cell nuclei. This stain is typically used in cardiac, kidney, muscular, and hepatic pathologies. • Smooth muscle actin (SMA) is a special type of stain, which is typically used in specialized cancer diagnostics. Further details about these stains can be found in Refs. [72, 73]. As indicated in the table, the available magnifications and the average size of the images vary. For the purpose of this chapter, we choose a subset of the ANHIR dataset as indicated in Table 11.2. The number of available images are also indicated. 11.5.2 Dataset preparation The tissues and stains considered here are listed in Table 11.2. The whole slide scan of the histological image is first divided into smaller sections of size 256 256. The pixel values in each image are scaled to lie within the range [1, 1]. This increases the numerical stability during training. (Moreover, it has been the experience that using a floating point arithmetic is better.) 11.5.3 Network architectures The architectures for the generator and discriminator have been arrived at after several experiments. The networks are based on U-net for the generator network and the PatchGAN architecture for the discriminator [62]. This architecture maps closely with CGANs. The generator architecture of the network includes convolutional layers with skip connections. The number of filters is 64, 128, 256, 512, and 1024. The activation functions for all but the final layer is the ReLU [74], and for the final layer the sigmoid activation function tanh v. This network for the discriminator features batch normalization and leaky ReLU activation function. The other details of the architectures are provided in “Appendix” section. Generative adversarial networks for histopathology staining We consider here the performance objectives mean absolute error (MAE) and the root mean square error (RMSE). The networks were trained using the Adam optimizer [75], at a learning rate of 105 and 104 for the discriminator and the generator networks, respectively. Batch sizes of 5, 10, and 15 were used at different trials. Shuffling was utilized, and finally the images were trained for 201 epochs for the number of input images as indicated in Table 11.2. The approach to evaluate the efficacy of the proposed GAN should include both qualitative (visual inspection) and quantitative image metrics. Both are required as the latter alone may not suffice in that good performance with respect to quantitative metrics may not indicate clinical viability. We emphasize here that the output images must eventually be validated by a diagnostician. 11.6 Results and discussions As indicated earlier, the subset of ANHIR dataset considered here is listed in Table 11.2. Sample input and target images for the three tissues are shown in Fig. 11.1A–F. Here, an image of a lung lesion tissue stained with H&E is shown in Fig. 11.1A. This is the input image to the proposed GAN. The target output image is shown in Fig. 11.1B, which is the image of the same tissue but stained with CD31. Likewise, images of a kidney tissue stained with H&E and MAS stains are, respectively, shown in Fig. 11.1C and D, and images of a lung lobe tissue stained with H&E and CD31 stains are, respectively, shown in Fig. 11.1E and F. As mentioned earlier, the images belonging to a particular tissue are input to the proposed GAN architecture described in the previous section and “Appendix” section. The performance of GAN depends on the choice of objective functions. In this chapter, we consider two objective functions. In what follows, a GAN trained with the MSE objective function described in Eqs. (11.4), (11.5) is referred to as GANMSE, and a GAN trained with the MAE objective function described in Eqs. (11.7), (11.8) is denoted GANMAE. The output of the generative network should resemble that of the corresponding target images. The closeness of the images is quantified using the metrics MSE, PSNR, and the SSI. (We note that due to a lack of resources, validation by a pathologist was not possible.) Sample sets of input, target, and output images in the case of lung lesion tissue are shown in Fig. 11.2 showing the results with the proposed GANMSE and GANMAE. We note an Adam optimizer has been used during the training process. (The chosen parameters are β1 ¼ 0.9, β2 ¼ 0.999, and E ¼ 107.) From Fig. 11.2, it is clear that there is a close approximation between the target and output images, respectively, shown in Fig. 11.2B and C with mildly blurry results. In contrast, it is evident from Fig. 11.2E and F that GANMAE generates deep blue images, which is less than ideal for our 275 276 Generative adversarial networks for image-to-Image translation Fig. 11.1 Lung lesion tissue: (A) Input image with H&E stain. (B) Target image with CD31 stain. Kidney tissue: (C) Input image with H&E stain. (D) Target image with MAS stain. Lung lobe tissue: (E) Input image with H&E stain. (F) Target image with CD31 stain. application. Thus, a closer comparison between target and output images clearly indicates that the MSE loss function provides better results. These observations are further validated using the image-quality metrics. Indeed, with GANMSE the respective averaged values of SSI, PSNR, and MSE are 0.8455, 27.365, and 0.0650, and the corresponding values for GANMAE are 0.5938, 9.0791, and 0.1678. (These values are also depicted in Table 11.3.) All these measures indicate Generative adversarial networks for histopathology staining Fig. 11.2 Lung lesion tissue. With GANMSE: (A) Input image; (B) target image; (C) output image. With GANMAE: (D) Input image; (E) target image; (F) output image. Table 11.3 Averaged image metric values. Tissue Type SSI PSNR MSE Lung lesion GANMSE GANMAE GANMSE GANMAE GANMSE GANMAE 0.8455 0.5938 0.3757 0.4038 0.7213 0.6931 27.365 9.0791 12.6759 12.0887 21.0530 21.1157 0.0650 0.1678 0.0052 0.0739 0.0046 0.0140 Kidney Lung lobe that the output of the generator and the target image are reasonably close. However, GANMSE is performing better on the test images. Evidently, if an MSE loss function is used during training, the MSE image-quality metric yields a smaller value. In addition to these observations, it is evident from Fig. 11.2B and C that some of the finer details have not been well captured by the network. Equivalently, the generative 277 278 Generative adversarial networks for image-to-Image translation network has not completely learned the transformation from one feature space to another. The primary reason for this is the smaller size of the dataset. It may be recalled that only 1250 images were available. From the observations with reference to the lung lesion tissue, it is evident that the higher-level features of the images such as the overall structure are captured completely. However, the generated images can be blurry and there is a marked variations between the images with GANMSE and GANMAE. Therefore, our approach to histology staining posed an image-to-image translation problem is promising. Moreover, the quantitative measures SSIM and PSNR may not be the ideal metrics in our context. Despite the scores, the images are slightly blurry as observed in Fig. 11.2C. Further, there are artefacts introduced if the number of epochs is lesser than 100; these eventually disappear with additional training. As will be evident, these remarks hold for both kidney and lung lobe tissues. Some issues with these tissue images are exacerbated due to the smaller size of the datasets. The results with kidney tissues are shown in Fig. 11.3 for both GANMSE and GANMAE. Similarly, the results with lung lobe tissues are shown in Fig. 11.4 for both Fig. 11.3 Kidney tissue. With GANMSE: (A) Input image; (B) target image; (C) output image. With GANMAE: (D) Input image; (E) target image; (F) output image. Generative adversarial networks for histopathology staining Fig. 11.4 Lung lobe tissue. With GANMSE: (A) Input image; (B) target image; (C) output image. With GANMAE: (D) Input image; (E) target image; (F) output image. variants of GANs considered here. From Table 11.3, the differences between the two GANs are less pronounced for both kidney and lung lobe tissues, with the performances of GANMSE better than the other. (There is a negligible discrepancy in these observations with respect to PSNR for the lung lobe tissue.) Nonetheless, minor visual differences can be observed in the kidney tissue even with GANMSE. This is due to the relatively higher complexity of the kidney tissue when compared with the lung lesion or lung lobe tissues. An additional issue is the relatively smaller number of images in the dataset. At present we have only 368 images corresponding to the kidney tissue, which is rather small considering the complexity. In contrast, even with a much lower dataset size of 151, the correlation between the output and target images for a lung lobe tissue is reasonably good for both GANMSE and GANMAE with the former outperforming the latter in both quantitative measures as depicted in Table 11.3, as well as visually. This is clearly due to the fact that these tissues are much less complex. An interesting point to highlight is that despite the dataset being smaller in size for the lung lobe tissue, the transformation is significantly better. This is 279 280 Generative adversarial networks for image-to-Image translation due to the fact that these images are far less complex when compared to the images corresponding to the kidney tissue. Evidently, the network performance is highly varying with respect to the complexity of the tissue. The network showcases reasonably good performances in tissues such as lung lesions and lung lobes. However, the performance decreases with complex tissue samples such as kidneys. Accordingly, there is a requirement for a larger distribution of tissue for the generated samples to be modeled with a higher degree of accuracy. An important point to note is the imperfections present due to the image capture of the histology and due to human error while fixing the histology. The imperfections present on the target image in Fig. 11.5A where the tissue is ripped and stringy is reflected in the output image Fig. 11.5B generated with GANMAE. This emphasizes the fact that the images used for training the GAN are to be chosen carefully, and requires domain knowledge from a pathologist. To summarize, these results showcase the image-to-image transformation from an input image corresponding to one stain to another image corresponding to a different stain. When the number of samples in the dataset is large, the quality of the images is quite satisfactory. In some cases, however, the results are slightly blurry, which may be addressed if the dataset is larger. Variations in the objective function clearly have an effect on the output image. The images that resulted with an MSE objective function are better than those obtained with an MAE objective function. Due to the complexity of the image, the output images corresponding to the kidney tissue are rather diffuse as the intricate features have not been captured completely correctly. The finer details corresponding to the lung lobes have been well captured. As far as the metrics are concerned, there is a strong indication that both SSI and PSNR do not indicate visual quality of the images. Moreover, the imperfections present on the target tissue appear to be transferred as well to the output. This can prove problematic as the imperfections may be due to Fig. 11.5 Errors lung lobe tissue: (A) Target with imperfections; (B) output image with distortions. Generative adversarial networks for histopathology staining human errors. Further, artefacts are sometimes introduced at lower epochs. However, with sufficient training, these artefacts are eventually removed. 11.7 Conclusions GANs in the past have been used for image translation tasks. In our medical image translation task, we utilize the conditional GAN algorithm to generate differently stained images from paired input images. The results have been demonstrated to be satisfactory. If the size of the dataset is suitably large, the results are better as it captures both high-level and low-level features. Unfortunately, the imperfections in the images are captured as well by the network. Therefore, the training dataset ought to be carefully chosen. Additionally, histology staining is dependent on the presence of a particular compound/tissue on the slide. Accordingly, the network needs to have a large distribution of histology stains in different circumstances/stages. This is important as it needs to learn more settings for it to be able to learn the transformation sufficiently well to be viable clinically. While the PSNR and the SSIM are popular image-quality metrics, a potential quality metric that could be utilized for future evaluation is the relative target registration error. Here, the landmarks can give the networks a better indication of whether the networks are performing better or worse. However, such a study requires hand labeling by experts. We have showcased the use of CGANs for the image-to-image translation between two stained tissues. Therefore, it has the potential for transforming images of unstained tissues to stained ones given enough data with different settings. However, the issues highlighted earlier are to be addressed before a more complete end-to-end system is available that can transform an unstained tissue to a stained one. Finally, the networks should also learn a general mapping of how to transform stained tissues. Since the images are paired from one image to another, the network might learn a map where it transforms the input image to the target image, rather than altering the stain. This issue is a known problem when using a one-to-one mapping between inputs and targets. Thus future work in this space should explore unpaired translation of such images to make an effective end-to-end staining algorithm. Appendix: Network architectures The generator and discriminator networks are described here. The generator network consists of 38 layers with several sets of convolutional and max pool layers, and listed in Tables 11.4 and 11.5. Each layer is characterized by a triplet (n1, n2, n3) which indicates an image of dimensions n1 n2 with n3 the number of filters. There are four skip connections. Specifically, the outputs of Layers 3, 6, 9, and 13 are as well inputs to Layers 35, 30, 25, and 20, respectively. The 12 layers of the discriminator network are listed in Table 11.6. 281 282 Generative adversarial networks for image-to-Image translation Table 11.4 Generator architecture: Part A. No. Layer Input Output 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Input layer Conv2D Conv2D MaxPooling2D Conv2D Conv2D MaxPooling2D Conv2D Conv2D MaxPooling2D Conv2D Conv2D Dropout MaxPooling2D Conv2D Conv2D Dropout Upsampling2D Conv2D (256, 256, 3) (256, 256, 3) (256, 256, 64) (256, 256, 64) (128, 128, 64) (128, 128, 128) (128, 128, 128) (64, 64, 128) (64, 64, 256) (64, 64, 256) (32, 32, 256) (32, 32, 512) (32, 32, 512) (32, 32, 512) (16, 16, 512) (16, 16, 1024) (16, 16, 1024) (16, 16, 1024) (32, 32, 1024) (256, 256, 3) (256, 256, 64) (256, 256, 64) (128, 128, 64) (128, 128, 128) (128, 128, 128) (64, 64, 128) (64, 64, 256) (64, 64, 256) (32, 32, 256) (32, 32, 512) (32, 32, 512) (32, 32, 512) (16, 16, 512) (16, 16, 1024) (16, 16, 1024) (16, 16, 1024) (32, 32, 1024) (32, 32, 512) Remark To Layer 35 To Layer 30 To Layer 25 To Layer 20 Table 11.5 Generator architecture: Part B. No. Layer Input Output 20 Concatenate (32, 32, 1024) 21 22 23 24 25 Conv2D Conv2D Upsampling2D Conv2D Concatenate 26 27 28 29 30 Conv2D Conv2D Upsampling2D Conv2D Concatenate 31 32 33 34 35 Conv2D Conv2D Upsampling2D Conv2D Concatenate 36 37 38 Conv2D Conv2D Conv2D (32, 32, 512) (32, 32, 512) (32, 32, 1024) (32, 32, 512) (32, 32, 512) (64, 64, 512) (64, 64, 256) (64, 64, 256) (64, 64, 512) (64, 64, 256) (64, 64, 256) (128, 128, 256) (128, 128, 128) (128, 128, 128) (128, 128, 256) (128, 128, 128) (128, 128, 128) (256, 256, 128) (256, 256, 64) (256, 256, 64) (256, 256, 128) (256, 256, 64) (256, 256, 64) Remark From Layer 13 (32, 32, 512) (32, 32, 512) (64, 64, 512) (64, 64, 256) (64, 64, 512) From Layer 9 (64, 64, 256) (64, 64, 256) (128, 128, 256) (128, 128, 128) (128, 128, 256) From Layer 6 (128, 128, 128) (128, 128, 128) (256, 256, 128) (256, 256, 64) (256, 256, 128) From Layer 3 (256, 256, 64) (256, 256, 64) (256, 256, 3) Generative adversarial networks for histopathology staining Table 11.6 Discriminator architecture. No. Layer Input Output 1 2 3 4 5 6 7 8 9 10 11 12 Input layer Conv2D Leaky ReLU Dropout Conv2D Leaky ReLU Dropout Conv2D Leaky ReLU Dropout Flatten Dense (256, 256, 3) (256, 256, 3) (128, 128, 64) (128, 128, 64) (128, 128, 64) (64, 64, 128) (64, 64, 128) (64, 64, 128) (32, 32, 256) (32, 32, 256) (32, 32, 256) (262144) (256, 256, 3) (128, 128, 64) (128, 128, 64) (128, 128, 64) (64, 64, 128) (64, 64, 128) (64, 64, 128) (32, 32, 256) (32, 32, 256) (32, 32, 256) (262144) (1) References [1] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks, in: Proceedings of the 28th International Conference on Neural Information Processing Systems (NIPS’14), Montreal, Quebec, Canada, 2014, pp. 2672–2680. [2] A.Y. Ng, M.I. Jordan, On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes, in: Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’01), Vancouver, British Columbia, Canada, 2001, pp. 841–848. [3] D. Foster, Generative Deep Learning: Teaching Machines to Paint, Write, Compose, and Play, O’Reilly Media, Sebastopol, CA, 2019. [4] K. Ganguly, Learning Generative Adversarial Networks: Next-Generation Deep Learning Simplified, Packt Publishing, Birmingham, UK, 2017. [5] J. Langr, V. Bok, GANs in Action: Deep Learning With Generative Adversarial Networks, Manning Publications, Shelter Island, NY, 2019. [6] J.F. Nash, Jr., Equilibrium points in n-person game, Proc. Natl. Acad. Sci. 36 (1) (1950) 48–49. [7] G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups, IEEE Signal Process. Mag. 29 (6) (2012) 82–97. [8] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM 60 (6) (2017) 84–90. [9] R. Yamashita, M. Nishio, R.K.G. Do, K. Togashi, Convolutional neural networks: an overview and application to radiology, Insights Imaging 9 (2018) 611–629. [10] K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun, What is the best multi-stage architecture for object recognition? in: Proceedings of the 12th International Conference on Computer Vision (ICCV’09), Kyoto, Japan, 2009, pp. 2146–2153. [11] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS’11), Ft. Lauderdale, FL, USA, 2011, pp. 315–323. [12] F. Agostinelli, M. Hoffman, P. Sadowski, P. Baldi, Learning activation functions to improve deep neural networks, in: Proceedings of the 3rd International Conference on Learning Representations Workshop (ICLR), San Diego, CA, USA, 2015. [13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (1) (2014) 1929–1958. 283 284 Generative adversarial networks for image-to-Image translation [14] Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: representing model uncertainty in deep learning, in: Proceedings of the 33rd International Conference on Machine Learning (ICML 16), New York, NY, USA, 2016, pp. 1050–1059. [15] Y. Gal, Z. Ghahramani, A theoretically grounded application of dropout in recurrent neural networks, in: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16), Barcelona, Spain, 2016, pp. 1027–1035. [16] D.P. Kingma, M. Welling, Auto-encoding variational Bayes, in: Proceedings of the 2nd International Conference on Learning Representations (ICLR 2014), Banff, AB, Canada, 2014. [17] D.J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, in: Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China, 2014, pp. 1278–1286. [18] E. Denton, S. Chintala, A. Szlam, R. Fergus, Deep generative image models using a Laplacian pyramid of adversarial networks, in: Proceedings of the 29th International Conference on Neural Information Processing Systems (NIPS’15), Montreal, Quebec, Canada, 2015, pp. 1486–1494. [19] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative adversarial text to image synthesis, in: Proceedings of the 33rd International Conference on Machine Learning (ICML), New York, NY, USA, 2016, pp. 1060–1069. [20] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, W. Shi, Photo-realistic single image super-resolution using a generative adversarial network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 105–114. [21] H. Alqahtani, M. Kavakli-Thorne, G. Kumar, Applications of generative adversarial networks (GANs): an updated review, Arch. Comput. Methods Eng. 28 (2021) 525–552, https://doi.org/10.1007/ s11831-019-09388-y. [22] S. Kazeminia, C. Baur, A. Kuijper, B. van Ginneken, N. Navab, S. Albarqouni, A. Mukhopadhyay, GANs for medical image analysis, Artif. Intell. Med. 109 (2020), 101938. [23] X. Yi, E. Walia, P. Babyn, Generative adversarial network in medical imaging: a review, Med. Image Anal. 58 (2019) 1–24. [24] S. Kaji, S. Kida, Overview of image-to-image translation by use of deep neural networks: denoising, super-resolution, modality conversion, and reconstruction in medical imaging, Radiol. Phys. Technol. 12 (3) (2019) 235–248. [25] K. Armanious, C. Jiang, M. Fischer, T. K€ ustner, T. Hepp, K. Nikolaou, S. Gatidis, B. Yang, MedGAN: medical image translation using GANs, Comput. Med. Imaging Graph. 79 (2020) 101684. [26] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation, in: N. Navab, J. Hornegger, W. Wells, A. Frangi (Eds.), Medical Image Computing and Computer-Assisted Intervention (MICCAI 2015), Lecture Notes in Computer Science, vol. 9351, Springer, Switzerland, 2015, pp. 234–241. [27] J.K. Min, M.S. Kwak, J.M. Cha, Overview of deep learning in gastrointestinal endoscopy, Gut Liver 13 (4) (2019) 388–393. [28] D. Hachuel, A. Jha, C.D. Velez, A. Martinez, Mo2049—augmenting gastrointestinal health: a deep learning approach to human stool recognition and characterization in macroscopic images, Gastroenterology 156 (6 suppl 1) (2019) S-937. [29] F. Mahmood, D. Borders, R. Chen, G.N. McKay, K.J. Salimian, A. Baras, N.J. Durr, Deep adversarial training for multi-organ nuclei segmentation in histopathology images, IEEE Trans. Med. Imaging (2019), https://doi.org/10.1109/TMI.2019.2927182. [30] H. Tsuda, K. Hotta, Cell image segmentation by integrating pix2pixs for each class, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 2019, pp. 1065–1073. [31] S. Pandey, P.R. Singh, J. Tian, An image augmentation approach using two-stage generative adversarial network for nuclei image segmentation, Biomed. Signal Process. Control 57 (2020) 101782. [32] H. Wang, B. Raj, On the origin of deep learning, arXiv:1702.07800 (2017). [33] D. Wittekind, Traditional staining for routine diagnostic pathology including the role of tannic acid. 1. Value and limitations of hematoxylin-eosin stain, Biotech. Histochem. 78 (5) (2003) 261–270. Generative adversarial networks for histopathology staining [34] R.G. Grocott, A stain for fungi in tissue sections and smears using Gomori’s methenamine-silver nitrate technic, Am. J. Clin. Pathol. 25 (8) (1955). 975–959. [35] H.A. Alturkistani, F.M. Tashkandi, Z.M. Mohammedsaleh, Histological stains: a literature review and case study, Glob. J. Health Sci. 8 (3) (2016) 72–79. [36] D. Bardou, K. Zhang, S.M. Ahmad, Classification of breast cancer based on histology images using convolutional neural networks, IEEE Access 6 (2018) 24680–24693. [37] M.T. Shaban, C. Baur, N. Navab, S. Albarqouni, StainGAN: stain style transfer for digital histological images, in: Proceedings of the IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), Venice, Italy, 2019, pp. 953–956. [38] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel, InfoGAN: interpretable representation learning by information maximizing generative adversarial nets, in: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16), Barcelona, Spain, 2016, pp. 2180–2188. [39] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in: Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 2017, pp. 214–223. [40] B. Hu, Y. Tang, E.I.-C. Chang, Y. Fan, M. Lai, Y. Xu, Unsupervised learning for cell-level visual representation in histopathology images with generative adversarial networks, IEEE J. Biomed. Health Informatics 23 (3) (2019) 1316–1328. [41] L. Hou, A. Agarwal, D. Samaras, T.M. Kurc, R.R. Gupta, J.H. Saltz, Robust histopathology image analysis: to label or synthesize, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 8525–8534. [42] Y. Rivenson, H. Wang, Z. Wei, K. de Haan, Y. Zhang, Y. Wu, H. G€ unaydin, J.E. Zuckerman, T. Chong, A.E. Sisk, L.M. Westbrook, W.D. Wallace, A. Ozcan, Virtual histological staining of unlabelled tissue-autofluorescence images via deep learning, Nat. Biomed. Eng. 3 (6) (2019) 466–477. [43] A. Ganesh, N.R. Vasanth, A.S. Ramaswamy, K. George, Staining of unstained histology using style transfer with color-based segmentation, in: Proceedings of the 2019 IEEE TENCON, Kochi, India, 2019. [44] L.A. Gatys, A.S. Ecker, M. Bethge, A neural algorithm of artistic style, arXiv:1508.06576 (2015). [45] L. Zhang, C. Long, X. Zhang, C. Xiao, RIS-GAN: explore residual and illumination with generative adversarial networks for shadow removal, in: Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), New York, NY, USA, 2020. [46] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A.A. Efros, O. Wang, E. Shechtman, Toward multimodal image-to-image translation, in: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 2017, pp. 465–476. [47] R. Fernandez-Gonzalez, A. Jones, E. Garcia-Rodriguez, P.Y. Chen, A. Idica, S.J. Lockett, M.H. Barcellos-Hoff, C. Ortiz-De-Solorzano, System for combined three-dimensional morphological and molecular analysis of thick tissue specimens, Microsc. Res. Tech. 59 (6) (2002) 522–530. [48] G. Bueno, O. Deniz, AIDPATH: Academia and Industry Collaboration in Digital Pathology, http:// aidpath.eu/?page_id¼279. [49] J. Borovec, A. Munoz-Barrutia, J. Kybic, Benchmarking of image registration methods for differently stained histological slides, in: Proceedings of the 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 2018, pp. 3368–3372. [50] L. Gupta, B.M. Klinkhammer, P. Boor, D. Merhof, M. Gadermayr, Stain independent segmentation of whole slide images: a case study in renal histology, in: Proceedings of the 15th IEEE International Symposium on Biomedical Imaging (ISBI), Washington, DC, USA, 2018. [51] J. Borovec, J. Kybic, I. Arganda-Carreras, D.V. Sorokin, G. Bueno, A.V. Khvostikov, S. Bakas, E.I. Chang, S. Heldmann, K. Kartasalo, L. Latonen, J. Lotz, M. Noga, S. Pati, K. Punithakumar, P. Ruusuvuori, A. Skalski, N. Tahmasebi, M. Valkonen, L. Venet, Y. Wang, N. Weiss, M. Wodzinski, Y. Xiang, Y. Xu, Y. Yan, P. Yushkevic, S. Zhao, A. Muñoz-Barrutia, ANHIR: automatic non-rigid histological image registration challenge, IEEE Trans. Med. Imaging, https://doi.org/10. 1109/TMI.2020.2986331. 285 286 Generative adversarial networks for image-to-Image translation [52] J. Borovec, BIRL: benchmark on image registration methods with landmark validation, arXiv:1912.13452 (2020). [53] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine learning on heterogeneous distributed systems, arXiv:1603.04467 (2016). [54] S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math. Stat. 22 (1) (1951) 79–86. [55] J. Burbea, C.R. Rao, On the convexity of some divergence measures based on entropy functions, IEEE Trans. Inf. Theory 28 (3) (1982) 489–495. [56] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training GANs, in: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16), Barcelona, Spain, 2016, pp. 2234–2242. [57] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in: Proceedings of the International Conference on Learning Representations (ICLR 2016), San Juan, Puerto Rico, 2016. [58] D. Scherer, A. M€ uller, S. Behnke, Evaluation of pooling operations in convolutional architectures for object recognition, in: K. Diamantaras, W. Duch, L.S. Iliadis (Eds.), Proceedings of the 20th International Conference on Artificial Neural Networks (ICANN), Thessaloniki, Greece, September 15–18, vol. 6354, Springer, Switzerland, 2010, pp. 92–101. [59] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, in: Proceedings of the 32nd International Conference on Machine Learning (ICML’15), Lille, France, 2015, pp. 448–456. [60] S. Ravuri, S. Mohamed, M. Rosca, O. Vinyals, Learning implicit generative models with method of learning moments, in: Proceedings of the 35th International Conference on Machine Learning (ICML), Stockholm, Sweden, 2018, pp. 4314–4323. [61] X. Mao, Q. Li, H. Xie, R.Y.K. Lau, Z. Wang, S.P. Smolley, Least squares generative adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017, pp. 2813–2821. [62] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), Honolulu, HI, USA, 2017, pp. 5967–5976. [63] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612. [64] A. Hore, D. Ziou, Image quality metrics: PSNR vs. SSIM, in: Proceedings of the 20th IEEE International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, 2010, pp. 2366–2369. [65] P. Baldi, Autoencoders, unsupervised learning, and deep architectures, in: Proceedings of the ICML Workshop on Unsupervised and Transfer Learning, Bellevue, WA, USA, 2011, pp. 37–49. [66] M. Mirza, S. Osindero, Conditional generative adversarial nets, arXiv:1411.1784 (2014). [67] A.B.L. Larsen, S.K. Sønderby, H. Larochelle, O. Winther, Autoencoding beyond pixels using a learned similarity metric, in: Proceedings of the 33th International Conference on Machine Learning (ICML), New York, NY, USA, 2016, pp. 1558–1566. [68] C. Li, M. Wand, Precomputed real-time texture synthesis with Markovian generative adversarial networks, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, October 11–14, Lecture Notes in Computer Science, vol. 9907, Springer, Switzerland, 2016, pp. 702–716. [69] M.N. Gurcan, L.E. Boucheron, A. Can, A. Madabhushi, N.M. Rajpoot, B. Yener, Histopathological image analysis: a review, IEEE Rev. Biomed. Eng. 2 (2009) 147–171. [70] S. Thiem, C. Dalkidis, Automatic stainer having a heating station, Leica Microsystems Nussloch, GmbH, Nussloch, Germany, US Patent 6827900 B2 (December 2004). [71] A.R. Morales, M. Nassiri, Automation of the histology laboratory, Lab. Med. 38 (7) (2007) 405–410. Generative adversarial networks for histopathology staining [72] P. Gattuso, V.B. Reddy, O. David, D.J. Spitz, M.H. Haber, Differential Diagnosis in Surgical Pathology, Saunders Elsevier, Philadelphia, PA, 2010. [73] D.J. Dabbs, Diagnostic Immunohistochemistry: Theranostic and Genomic Applications, Elsevier, Philadelphia, PA, 2018. [74] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines, in: Proceedings of the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 2010, pp. 807–814. [75] D.P. Kingma, J.L. Ba, ADAM: a method for stochastic optimization, in: Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, 2015. 287 CHAPTER 12 Analysis of false data detection rate in generative adversarial networks using recurrent neural network A. Sampath Kumara, Leta Tesfaye Juleb,c, Krishnaraj Ramaswamyc,d, S. Sountharrajane, N. Yuuvarajf, and Amir H. Gandomig a Department of Computer Science and Engineering, Dambi Dollo University, Dambi Dollo, Ethiopia Department of Physics, College of Natural and Computational Science, Dambi Dollo University, Dambi Dollo, Ethiopia c Centre for Excellence in Indigenous Knowledge, Innovative Technology Transfer and Entrepreneurship, Dambi Dollo University, Dambi Dollo, Ethiopia d Department of Mechanical Engineering, Dambi Dollo University, Dambi Dollo, Ethiopia e School of Computing Science and Engineering, VIT Bhopal University, Bhopal, India f Research and Development, ICT Academy, Chennai, India g University of Technology Sydney, Ultimo, NSW, Australia b 12.1 Introduction Recently, the generative adversarial networks (GANs) have emerged as a potential class of generative models, where it operates as a joint optimization model with two neural networks of contrasting goals. Since decades, the generative adversarial network (GAN) [1–10] is emerging as a viable solution for most application with its adversarial training ability in optimizing the generative ability. GANs are an emergent model for unsupervised and semisupervised learning. The learning is achieved with implicit modeling of high-dimensional data distribution [6]. GANs are considered interesting since it moves away from a viewpoint of likelihood maximization; however, it uses an adversarial game approach for the process of training the generative models [11]. Conventional GAN has no prior training data on the distribution of data, which provide the goal of generating the samples from the distributions [12]. Effective training on GANs is rather challenging. The generator and discriminator model capacities are balanced for the generator to learn effectively. The lack of unambiguous convergence criterion tends to complement this problem [13]. Various attempts in existing researches to scale up the GAN are unsuccessful due to its reliability of identifying fake or false data and unstable training. GAN encounters various difficulties, while scaling up its robustness and scalability. Hence, a stable training model across a limited range of datasets with deeper learning algorithm can be made significant to scale up the operation of GAN in finding the false detection rate. Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00012-9 Copyright © 2021 Elsevier Inc. All rights reserved. 289 290 Generative adversarial networks for image-to-image translation The problem of imbalanced learning can be defined as a problem of learning from a binary or multiclass dataset, where for one of the classes called the majority class the number of instances is significantly greater than in the remaining classes called the minority classes [14]. In unbalanced datasets, standard learning methods work poorly because they are a prejudice to the majority classes. In particular, minority classes contribute lesser to minimize the objective function during training in a standard classification method [15]. Designing the GANs is still difficult in practice, even if GANs have achieved great success in image generation. In case a GAN would be unstable, network architectures should be well designed. Various GAN methods are developed [16–18] for improving the stabilization ability of GANs learning. The instability associated with GAN learning is caused by the saturation occurring while sampling the data in the discriminator [19]. Hence to resolve such problem, this chapter uses optimal weight selection to avoid the saturation, thereby increasing the stability of operation. In this chapter, we develop a GAN and the operation in the GAN is scaled up using a recurrent neural network (RNN). It uses its neighborhood relationship between the samples to generate the target output and error generated. The errors are then propagated in the backward direction over the GAN to update the network weights to estimate the output. The RNN generator, on the other hand, reduces the probability of the RNN discriminator in identifying the false generated samples and increasing the probability of discriminator in identifying correctly the real samples. The objective function in RNN is designed in such a way that its gradient operator for the false samples is quite far from the decision boundary of RNN discriminator, thereby producing the increasing true classification rate. 12.1.1 Contributions The main contributions of the study are presented below: • The author(s) uses RNN to scale up the operation of GAN for stable training. • The discriminator uses the same RNN to classify the generated and real data samples by updating the weights. • The aim of RNN in the GAN structure is to delimit the error rate using its time series prediction based on its past inputs. It further uses its neighborhood relationship between the samples to generate the target output and error generated. • The errors are then propagated in the backward direction over the GAN to update the network weights to estimate the output. The RNN generator, on the other hand, reduces the probability of the RNN discriminator in identifying the false generated samples and increasing the probability of discriminator in identifying correctly the real samples. • The objective function in RNN is designed in such a way that its gradient operator for the false samples is quite far from the decision boundary of RNN discriminator, thereby producing the increasing true classification rate. Analysis of false data detection rate in adversarial networks • The experiments are carried out on the real-world time series dataset show the results of accurate classification with increased false detection rate than benchmark GAN method. The outline of the chapter is as follows: Section 12.2 discusses the related works. Section 12.3 provides the proposed GAN classification with modifications made in the work to improve the performance of classifier. Section 12.4 evaluates the entire work and Section 12.5 concludes the work with possible directions of the future scope. 12.2 Related works Lyu et al. [20] designed denoising-based GAN for removing the mixed noise by the combination of three different GAN elements. It includes feature extractor networks, discriminator, and a generator. The feature extractor network additionally trains the generator-discriminator network using a mutual game. The direct mapping is implemented to eliminate the noise to improve the quality of input data. Chen et al. [21] developed a denoising method with GAN (D-GAN) to remove the presence of speckle noise. The generator is trained with ground truth value for mapping the noisy regions in an image. The discriminator finds the similarity comparison using the loss function with reconstructed input data. The generator-discriminator finally eliminates the noise with its stable training to achieve denoising effectiveness. Nawi et al. [22] formed a GAN embedded with a discriminative metric for the generation of real samples using its deep metric learning. The feature extractor acts as a discriminator that identifies both the discriminative and preserving loss. It further reduces the distance of estimation between the real and generation samples that preserve the input data from losses and improves the model stability with weight adaptation strategy. Li et al. [23] used multitask learning-based GAN (MTL-GAN) to segment the grains or noises present in the images of alloy microstructure. It uses the detection of grains at edges and its segmentation using rich convolutional features. The fine-tuning of GAM is employed to find the hidden grains and extracts the quantitative indicators. The above methods achieve the accurate noise reduction but the accuracy tends to reduce on increasing data samples. Zhong et al. [24] developed a deep-GAN model embedded with the output noises input from the decoder-encoder model. The GAN is improved with adversarial training and Bayesian inference. The entire model is pretrained for mapping the noise vector to optimal feature that acts as an input feed for the generator. The generator’s learnability is trained with dataset carrying intrinsic distribution information to reduce the set of errors. Cui et al. [25] use the encoder-decoder network in GAN to remove the noise from the input signal. The network tends to classify the original input data from the adversarial loss and pixel losses and thereby the original information is preserved. Sun et al. [26] modified the GAN model and formulated it with t-distributions noise mixture using a latent generative space. The learning components 291 292 Generative adversarial networks for image-to-image translation of the t-distributions are mixture the diversity of class deification even in the presence of the noise vector. The classification loss embedded with generator-discriminator losses stabilizes the entire network. The use of adversarial loss tends to reduce with the stability of noise vector with increased data samples. Ak et al. [27] developed an enhanced attention GAN with improved stability using an integrated attention module for linear modulation. It is designed to avoid matching losses between the features that significantly degrade the classification errors. The training stability is hence improved to resolve the collapse of the system with the reduction of instance noise. Xu et al. [28] developed a multideep GAN to resolve the unstable training using its reconstructive sampling. The multigranularity GAN is then decomposed to remove the noises present in the input data to further enhance the training. Deep neural network-based melanoma detection has been introduced by Banuselvasaraswathy et al. [29]. The deep learning-based training analysis suggested could be used in proposing an improved deep-based GAN development. Zhang et al. [30] developed a deep-GAN using its explicit training data that map the noise distribution with real-time data. Further, the generation of fake samples helps in balancing the datasets for stable training. These methods tend to reduce the probability of sampling under increased data samples using the first part of the generator. Zhang and Sheng [31] used Wasserstein GAN to improve the model’s generalization ability. It detects the input seismic signal and can distinguish the noise and the original signal especially in low signal-to-noise ratio (SNR) conditions. Hence, the stability is improved on adversarial sample datasets. Yang et al. [32] developed a GAN architecture with loss function to distinguish the clean signal from a parallel noisy signal. It uses supervised loss representation, i.e., high-level loss in the hidden layer of GAN to find the losses under low SNR and in a resource-constrained environment. These two methods suffer poorly from the unstable training and the inclusion of RNN can effectively improve the stabilized training pattern. In Refs. [26, 33–36], various machine learning and deep learning algorithms for classification of diseases are proposed. The optimization techniques suggested by them for diagnosing breast cancer are an effective one. The other big data analytics techniques suggested by them are noteworthy in handling big data and this in turn reveals the possibility of reducing the traffic in the network. Sampathkumar et al. [37–40] proposed various optimization approaches for identifying the accuracy of the accurate gene for a particular disease. Various classifiers and statistical approaches are used to predict the feature selection gene with reduced dimensionality, which prove that algorithms are effective one. These methods fail to discriminate between original and fake samples resulting in poor determination accuracy. Congestion control in WSN could be improved by multiple routing paths and this could be achieved by means of priority-based scheduling algorithms as discussed in Ref. [41]. Various researches have been undergone in the machine learning algorithm over the accuracy prediction. In this chapter, GAN with RNN [42–45] has concentrating over the reducing Analysis of false data detection rate in adversarial networks false samples, which have not been carried out in the previous works. The utilization of RNN is to effectively process the temporal data, which are unavailable in the existing methods. The utilization of recurrent layers overcomes the challenges associated with a loss function. The RNN further can enable unsupervised learning against the conventional supervised learning models [46–48]. 12.3 Methods The proposed method is designed with the integration of GAN with RNN [22,25,32]. The GAN is responsible for optimal of neural network with adversarial training. The generator network is regarded as the first network that helps in mapping the input from the source and acts as a low-dimensional in contrast with high-dimensional dataset with respect to the target domain. The adversarial or discriminator network is the second network that generates the output, which the adversarial network fails in classifying the realtime dataset. The generator-discriminator gets optimized w.r.t. the discriminator output. For easier classification by the discriminator, the outputs from the generator w.r.t real dataset samples. To optimally attain the results, the study uses RNN algorithm to generate the optimal weights to achieve accurate predictive results. The training of regression model [49] benefits the adversarial network signal. The adversarial network process the predicted and real classification with receptive field with weight function that quantifies well the task with global real-time prediction. It reduces the errors associated with classification of task. Hence, it is very necessary to design a loss function and that should contribute well with adversarial training. Fig. 12.1 shows various components that include a generator G operating on a noise vector z, which is sampled using an input distribution pz. It uses recurrent layers to transform a sample x from the input noise vector z. Finally, the discriminator D classifies the input samples from the real data distribution samples pdata. The algorithm or the workflow of entire system is given in following steps: Step 1: Generate the synthetic sampling from real data distributions using the generator network. Step 2: Transform noise vector from distribution into newer samples. z Generator G x Pdata Pdata Label y Fig. 12.1 Generative adversarial network. Discriminator D Unlabeled 293 294 Generative adversarial networks for image-to-image translation Step 3: Run RNN and the weights of RNN is updated using the fitness function. Step 4: Select the fitness value with improved probability estimator to increase classifier with probable weights updated based on the newer samples. Step 5: Access the new samples and the samples from the generator using discriminator network. Step 6: Find the similarity between these two data. 12.4 GAN-RNN architecture GAN is designed with the two neural networks. First, the generator network generates the convincing synthetic sampling x drawn from real data distributions pdata. The generator network transforms the noise vector z from pz distribution into newer samples x. Second, the discriminator network accesses the samples from the real data distributions pdata and it then accesses the samples from G and classifies between the two data. The training of GAN is carried out to solve the optimization problem, where the discriminator gets maximized and generator gets minimized. min max f ðD, GÞ ¼ E j log DðxÞj + E j log ð1 DðGðzÞÞÞj G D x pdata x pz b c (12.1) where G is represented as the generator, D is represented as the discriminator, and f(D, G) is represented as the objective function. The final layer in the second network uses an activation function, i.e., sigmoid activation function: D(x), D(G(z)) [0, 1]. The activation functions are maximized, where the error is minimized during predict w.r.t. target values on real/fake samples. On contrary, the generator reduces the discriminator to predict the fake samples. Hence, the generator loss directly depends on the discriminator performance. 12.4.1 Optimization of GAN using RNN Consider a sample (a, b), where a is the input and b is the label with class C. The probability estimation using the RNN model is defined as Y ¼ softmax(f(a; θ)). The study uses cross-entropy objective function that helps in updating the weights of RNN through backward learning, i.e., backpropagation during the process of training the classifier, which is seen in Fig. 12.2. The fitness function is selected to be robust with improved probability estimator that increases the performance of GAN-RNN classifier with probable weights update. The weights are updated optimally as in Ref. [36]. Consider L(b, Y) as the loss function and RNN is associated with finding the best weights {w1, w2, …, wK} that combine the results of loss function at each iteration. The study uses a cross-entropy loss function to validate the performance of the GANRNN classifier (Fig. 12.3). The error produced by validating the model lies between the range [22] (Fig. 12.4). The cross-entropy increases the prediction probability of the Analysis of false data detection rate in adversarial networks Input Dataset z Generator G Pdata x Pdata Labelled y Unlabelled Discriminator D Weights update RNN Fig. 12.2 GAN-RNN learning module. 2 10 Train Validation Test Best Mean Squared Error (mse) 1 10 0 10 -1 10 -2 10 -3 10 0 5 10 15 20 25 30 Iterations Fig. 12.3 Average performance of GAN-RNN model. 35 40 45 50 295 Generative adversarial networks for image-to-image translation 12 11 Output and Target 10 9 Training Targets Training Outputs Validation Targets Validation Outputs Test Targets Test Outputs Errors Response 8 7 6 5 4 3 2 Targets - Outputs Error 296 0 -2 200 400 600 800 1000 1200 1400 1600 1800 Time Fig. 12.4 Average error and plots of entire datasets. input labels or the actual labels. The reduction of average values of the loss function is the maximization of log-likelihood of the input data. The cross-entropy loss function is defined using the following expression: X CE ¼ ðb log ðwi Þ + ð1 bÞlog ð1 wi ÞÞ (12.2) i f ¼ M X CE ðc, Y Þ log CE ðc, Y Þ (12.3) c¼1 where M represents the total number of class, log represents the natural logarithm, y represents the binary indicator with c as the class label with accurate classified result over an observed data O, and p represents the predicted observation O from the class c. With the training samples (N), the loss function is defined as min f ¼ w M X f ðbi , Yi Þ i¼1 where b represents the true label and Y represents the predicted label. (12.4) Analysis of false data detection rate in adversarial networks 12.5 Performance evaluation This section provides the results of proposed GAN-RNN model with various other GAN models that include: GAN [26], Wasserstein GAN [30], deep-GAN [29], D-GAN [31], and MTL-GAN [23]. 12.5.1 Dataset collection The multivariate datasets are mixed of integer and real-valued attribute characteristics and the categorical attributes are removed and this is used to evaluate the GAN-RNN model including: Dataset-1: breast cancer Wisconsin (original) with 699 instances, Dataset-2: breast cancer Wisconsin (diagnostic) with 569 instances, Dataset-3: heart disease with 303 instances, Dataset-4: heart failure clinical records with 299 instances, Dataset-5: diabetes 130-US hospitals with 100,000 instances, and Dataset-6: lung cancer with 32 instances [50]. The features or classes used in this chapter are given in brief in Tables 12.1 and 12.2. 12.5.2 Performance metrics The performance of GAN-RNN model is validated against various metrics that include accuracy, sensitivity, specificity, F-measure, geometric-mean (G-mean), percentage error, and training performance. The definitions of all the performance metrics are given below. Table 12.1 Attributes of first three datasets for prediction. Attributes Dataset-1 Dataset-2 Dataset-3 Data set characteristics Attribute characteristics (*categorical attributes are removed) Number of instances Number of attributes Missing values or errors present Multivariate Integer Multivariate Real 699 10 Yes 569 32 No Multivariate Integer and real 303 75 Yes Table 12.2 Attributes of last three datasets for prediction. Attributes Dataset-4 Dataset-5 Dataset-6 Data set characteristics Attribute characteristics (*categorical attributes are removed) Number of instances Number of attributes Missing values or errors present Multivariate Integer and real 299 13 No Multivariate Integer Multivariate Integer 100,000 55 Yes 32 56 Yes 297 298 Generative adversarial networks for image-to-image translation Accuracy to validate the GAN-RNN is the total accurate samples predictions required to ensure that the GAN-RNN models predict the output correctly without noise. Hence, it can be defined as the ratio of total correct predictions vs total predictions. TP + TN (12.5) TP + TN + FP + FN where TP is the true positive cases, TN is the true negative cases, FP is the false positive cases, and FN is the false negative cases. F-measure to validate the GAN-RNN is defined as the weighted harmonic mean of the sensitivity and specificity values. It tends to range between zero and one. Accuracy ¼ F measure ¼ 2TP 2TP + FP + FN (12.6) It is inferred that the higher the F-measure value is, the higher the performance of GAN-RNN classifier would be. Sensitivity is defined as the ability of the GAN-RNN model to correctly identify the true positive rate. Sensitivity ¼ TP TP + FN (12.7) Specificity is defined as the ability of the GAN-RNN model to correctly identify the true negative rate. Specificity ¼ TN TN + FP (12.8) Geometric-mean (G-mean) to validate the GAN-RNN is defined as the aggregation of both specificity and sensitivity measures that maintain the trade-off between the measures if the dataset is imbalanced. It is measured using the following equation: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi TP TN G mean ¼ (12.9) TP + FN TN + FP Higher the G-mean is, higher is the performance of GAN-RNN and vice versa. Mean absolute percentage error (MAPE) to validate the GAN-RNN is defined as the performance measure of prediction accuracy to estimate the total eliminated losses during the prediction of accurate data instances from the preprocessed datasets. Thus, MAPE is defined as the ratio of the difference between the actual classes (At) and predicted classes (Ft), and the actual class. The MAPE value is then multiplied with 100 to find the percentage errors and finally, it is divided with the fit points or actual data points (n). The percentage error is hence defined as n 100 X At Ft (12.10) MAPE ¼ n t¼1 At Analysis of false data detection rate in adversarial networks 12.5.3 Discussions Table 12.3 shows the results of accuracy between the proposed GAN-RNN (Method-6) with benchmark GAN (Method-1), Wasserstein GAN (Method-2), Deep-GAN (Method-3), D-GAN (Method-5), and MTL-GAN (Method-5). The results are Table 12.3 Accuracy of various GAN with (a) Dataset-1, (b) Dataset-2, (c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6. (a) Methods Dataset-1 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 96.6507 97.3779 97.4082 97.489 97.4991 97.6304 (b) Methods Dataset-2 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 97.9334 97.9536 97.9738 97.9738 97.9839 97.9839 (c) Methods Dataset-3 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 96.0952 96.1154 96.1255 96.2164 96.2366 96.2972 (d) Methods Dataset-4 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 97.6395 97.6395 97.7203 97.7203 97.7405 97.791 Continued 299 300 Generative adversarial networks for image-to-image translation Table 12.3 Accuracy of various GAN with (a) Dataset-1, (b) Dataset-2, (c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6—cont’d (e) Methods Dataset-5 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 94.1348 94.2863 94.3671 94.4176 94.5994 94.6095 (f ) Methods Dataset-6 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 97.5698 97.5698 97.6506 97.6506 97.6708 97.7213 validated against various datasets as discussed in Section 12.5. The results validated against the existing model over each dataset show that the GAN-RNN obtains higher classification accuracy than other existing GAN methods. The results of simulation over the large Dataset-5 show that the GAN-RNN and other existing methods suffer from reduced accuracy and this is due to the complexity in computing the larger data features, however, the GAN-RNN has proven to be operating with reduced complexity than other methods. Table 12.4 shows the results of sensitivity accuracy between the proposed GANRNN with existing GAN models over various datasets. The results validated against the existing model show that the GAN-RNN obtains higher sensitivity accuracy than other existing GAN methods. The sensitivity over the large Dataset-5 is lesser on all GAN methods. However, the GAN-RNN has proven to be operating with higher sensitivity than other methods. Table 12.5 shows the results of specificity accuracy between the proposed GANRNN with existing GAN models over various datasets. The results validated against the existing model show that the GAN-RNN obtains higher specificity accuracy than other existing GAN methods. The specificity over the large Dataset-5 is lesser on all GAN methods. However, the GAN-RNN has proven to be operating with higher specificity than other methods. Table 12.6 shows the results of F-measure accuracy between the proposed GANRNN with existing GAN models over various datasets. The results validated against Analysis of false data detection rate in adversarial networks Table 12.4 Sensitivity of various GAN with (a) Dataset-1, (b) Dataset-2, (c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6. (a) Methods Dataset-1 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 65.6131 67.0181 69.8168 72.5236 82.0721 85.7091 (b) Methods Dataset-2 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 86.0626 92.7609 93.8729 94.4587 94.7415 94.7516 (c) Methods Dataset-3 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 62.5821 63.3194 64.2193 64.5829 66.4211 67.2605 (d) Methods Dataset-4 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 87.6594 87.6594 88.4775 88.6088 89.3562 89.4269 (e) Methods Dataset-5 GAN W_GAN Deep_GAN 61.3297 61.6933 62.2185 Continued 301 Table 12.4 Sensitivity of various GAN with (a) Dataset-1, (b) Dataset-2, (c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6—cont’d (e) Methods Dataset-5 D_GAN MTL_GAN GAN_RNN 63.8456 64.3506 64.6738 (f ) Methods Dataset-6 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 87.7241 87.7241 88.5432 88.6745 89.4219 89.4936 Table 12.5 Specificity of various GAN with (a) Dataset-1, (b) Dataset-2, (c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6. (a) Methods Dataset-1 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 93.9537 93.9739 93.9739 93.9739 93.9739 94.6203 (b) Methods Dataset-2 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 93.7214 94.5193 94.7213 94.8021 94.8223 94.8829 up (c) Methods Dataset-3 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 91.8923 91.9529 91.9933 93.2366 93.6507 94.0042 Analysis of false data detection rate in adversarial networks (d) Methods Dataset-4 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 94.7304 94.7405 94.8314 94.8314 94.9223 94.9223 (e) Methods Dataset-5 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 92.4276 92.5892 92.5993 92.6498 92.6902 92.7407 (f ) Methods Dataset-6 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 94.6607 94.6708 94.7617 94.7617 94.8526 94.8526 Table 12.6 F-measure of various GAN with (a) Dataset-1, (b) Dataset-2, (c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6. (a) Methods Dataset-1 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 91.3151 92.7604 92.9119 93.427 93.6391 94.2855 Continued 303 304 Generative adversarial networks for image-to-image translation Table 12.6 F-measure of various GAN with (a) Dataset-1, (b) Dataset-2, (c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6—cont’d (b) Methods Dataset-2 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 79.383 79.5143 80.0395 81.1313 81.8181 82.111 (c) Methods Dataset-3 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 60.0657 71.6534 71.9777 74.8875 78.1508 81.3838 (d) Methods Dataset-4 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 87.971 88.0922 90.0526 90.0829 91.4565 91.4666 (e) Methods Dataset-5 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 54.1239 61.8736 62.217 62.6816 64.2279 64.3693 (f ) Methods Dataset-6 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 88.0326 88.1538 90.1152 90.1455 91.5201 91.5313 Analysis of false data detection rate in adversarial networks the existing model show that the GAN-RNN obtains higher F-measure accuracy than other existing GAN methods. The F-measure over the large Dataset-5 is lesser on all GAN methods. However, the GAN-RNN has proven to be operating with higher F-measure than other methods. Table 12.7 shows the results of G-mean accuracy between the proposed GAN-RNN with existing GAN models over various datasets. The results validated against the existing model show that the GAN-RNN obtains higher G-mean accuracy than other existing GAN methods. The G-mean over the large Dataset-5 is lesser on all GAN methods. However, the GAN-RNN has proven to be operating with higher G-mean than other methods. Table 12.8 shows the results of MAPE accuracy between the proposed GAN-RNN with existing GAN models over various datasets. The results validated against the existing Table 12.7 G-mean of various GAN with (a) Dataset-1, (b) Dataset-2, (c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6. (a) Methods Dataset-1 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 95.4278 98.8628 99.4183 99.7213 99.8627 99.8627 (b) Methods Dataset-2 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 96.1954 96.1954 96.5691 96.6398 96.9731 97.0135 (c) Methods Dataset-3 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 81.1616 81.6767 82.212 82.4443 83.5664 84.031 Continued 305 306 Generative adversarial networks for image-to-image translation Table 12.7 G-mean of various GAN with (a) Dataset-1, (b) Dataset-2, (c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6—cont’d (d) Methods Dataset-4 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 96.2621 96.2621 96.6368 96.7075 97.0408 97.0812 (e) Methods Dataset-5 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 81.4545 81.6969 81.9696 82.9796 83.3028 83.4957 (f ) Methods Dataset-6 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 83.9301 84.7885 86.4257 88.0013 93.0937 94.6289 Table 12.8 MAPE of various GAN with (a) Dataset-1, (b) Dataset-2, (c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6. (a) Methods Dataset-1 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 28.4113 27.0064 24.2077 21.5108 11.9523 10.2453 Analysis of false data detection rate in adversarial networks Table 12.8 MAPE of various GAN with (a) Dataset-1, (b) Dataset-2, (c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6—cont’d (b) Methods Dataset-2 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 80.7386 20.5224 13.6918 2.60904 48.4075 14.4503 (c) Methods Dataset-3 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 31.4423 30.704 29.8051 29.4516 27.6023 26.7741 (d) Methods Dataset-4 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 32.6947 32.3311 31.8059 30.1788 29.6738 29.3506 (e) Methods Dataset-5 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 72.7857 72.6039 64.6269 63.2826 55.851 55.1632 (f ) Methods Dataset-6 GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN 27.7352 27.5534 19.5814 18.2381 10.8116 10.1248 307 308 Generative adversarial networks for image-to-image translation model show that the GAN-RNN obtains reduced MAPE accuracy than other existing GAN methods. The MAPE over the Dataset-5 is highest of all datasets due to its increased dataset size. However, the GAN-RNN has proven to be operating with reduced MAPE than other methods. Further, the proposed method is tested with a high-dimensional dataset to test the efficacy and robustness of the proposed model. The proposed method is tested additionally tested with dataset-5 by varying the number of training data and the results are given below. From Fig. 12.5, we found that the performance of the proposed system with 90% training data performs the best with various class labels. Hence, we conclude that with increasing training labels, the supervised method performs the best on reduced noise labels, which are evident from the figure. We hence choose the 90% training data with 10% noise labels for further evaluation. The result shows that the proposed method on all three categories has higher performance with data stability than the other models as reported in Tables 12.9–12.11. The result shows that with 90% training data, we found that the stability of the system is higher than that of other models. With 90% of the training data, we found the GAN with RNN generates optimal synthetic samples from the datasets. The classification of data using discriminator network on the real data distribution shows accurate transformation of noise vector into new samples. This resolves the problems of optimization and the use of a cross-entropy objective function updates the RNN weights using through backward learning that increases the robust of the classifier with minimal losses. However, with reduction of training data, the performance tends to degrade as the noise vector transformation is less accurate, which affects the stability of the system. 12.6 Conclusions In this chapter, the scaling of GAN is improved with RNN that effectively learns the network recursively for classification. The recursive learning from the past records and neighborhood relationship reduces the error rate, while analyzing the time series data. The backward propagation of the GAN with updated network weights has explicitly enabled the output estimation. The target output is projected by the GAN-RNN model and most accurate predictions are thus made. Therefore, the probability of identifying the false samples is reduced using RNN discriminator operated with a gradient operator-based objective function. The results of the simulation show that the GANRNN (97.01%) model has a higher classification rate than other existing methods such as benchmark GAN (96.67%), Wasserstein GAN (96.82%), Deep-GAN (96.87%), D-GAN (96.91%), and MTL-GAN (96.96%). The training of the model using synthetic datasets has enabled higher predictive results with reduced false classification rate. In future, the system may choose an optimal loss function on contributing with adversarial training. Analysis of false data detection rate in adversarial networks 90 Classification Accuracy (%) 80 70 60 10% 20% 30% 50 40 30 20 10 0 200 (A) 400 600 Residuals 800 1000 1200 60% trainingdata 80 Classification Accuracy (%) 70 60 10% 20% 30% 50 40 30 20 0 200 (B) 400 600 Residuals 800 1000 1200 75% trainingdata 44 42 Classification Accuracy (%) 40 38 36 10% 20% 30% 34 32 30 28 26 24 0 (C) 200 400 600 Residuals 800 1000 1200 90% trainingdata Fig. 12.5 Comparison of GAN-RNN accuracy with varying training data over Dataset-5 with varying training labels. 309 310 Generative adversarial networks for image-to-image translation Table 12.9 Comparison of other parameters GAN-RNN with 60% training data with 10% label noise. Metrics GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN F-measure G-mean MAPE Sensitivity Specificity 75.1937 75.5129 73.5287 83.5006 75.9477 75.4599 75.7568 69.6674 76.7431 77.9414 75.6402 77.5809 62.5377 77.5066 81.1982 75.8311 79.7666 43.3257 79.3848 86.7468 80.668 82.3765 40.2916 79.5969 88.1371 86.5453 85.2939 38.3286 86.8634 88.5719 Table 12.10 Comparison of GAN-RNN with 75% training data with 10% label noise. Metrics GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN F-measure G-mean MAPE Sensitivity Specificity 42.0106 78.2172 31.3378 66.7712 79.9575 44.2377 78.4622 28.2188 70.4946 80.159 56.1514 80.0529 26.7341 78.8757 83.8824 56.321 80.0953 23.997 92.009 83.9036 58.8355 80.5301 23.3819 92.7089 85.3575 90.0036 92.0408 18.3954 98.3610 86.2483 Table 12.11 Comparison of GAN-RNN with 90% training data with 10% label noise. Metrics GAN W_GAN Deep_GAN D_GAN MTL_GAN GAN_RNN F-measure G-mean MAPE Sensitivity Specificity 72.0748 47.4309 21.8431 82.1325 79.1197 72.1278 61.1378 19.0105 84.9757 82.175 73.0727 64.3522 18.9151 85.0712 83.1718 74.2605 48.7045 13.7695 90.2051 86.5241 79.5969 81.9841 12.3591 91.6272 88.6461 85.4954 92.4756 11.1395 92.8468 90.8636 References [1] M.Y. Liu, O. Tuzel, Coupled generative adversarial networks, in: Advances in Neural Information Processing Systems, 2016, pp. 469–477. [2] H. Zhang, I. Goodfellow, D. Metaxas, A. Odena, Self-attention generative adversarial networks, in: International Conference on Machine Learning, 2019, May, pp. 7354–7363. [3] C. Li, M. Wand, Precomputed real-time texture synthesis with Markovian generative adversarial networks, in: European Conference on Computer Vision, Springer, Cham, 2016, October, pp. 702–716. [4] T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4401–4410. [5] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan: text to photorealistic image synthesis with stacked generative adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5907–5915. [6] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, A.A. Bharath, Generative adversarial networks: an overview, IEEE Signal Process. Mag. 35 (1) (2018) 53–65. Analysis of false data detection rate in adversarial networks [7] T. Schlegl, P. Seeb€ ock, S.M. Waldstein, U. Schmidt-Erfurth, G. Langs, Unsupervised anomaly detection with generative adversarial networks to guide marker discovery, in: International Conference on Information Processing in Medical Imaging, Springer, Cham, 2017, June, pp. 146–157. [8] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, S. Belongie, Stacked generative adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5077–5086. [9] E.L. Denton, S. Chintala, R. Fergus, Deep generative image models using a Laplacian pyramid of adversarial networks, in: Advances in Neural Information Processing Systems, 2015, pp. 1486–1494. [10] J.M. Wolterink, T. Leiner, M.A. Viergever, I. Išgum, Generative adversarial networks for noise reduction in low-dose CT, IEEE Trans. Med. Imaging 36 (12) (2017) 2536–2545. [11] K. Roth, A. Lucchi, S. Nowozin, T. Hofmann, Stabilizing training of generative adversarial networks through regularization, in: Advances in Neural Information Processing Systems, 2017, pp. 2018–2028. [12] G.J. Qi, Loss-sensitive generative adversarial networks on lipschitz densities, Int. J. Comput. Vis. 128 (2020) 1118–1140. [13] D. Warde-Farley, Y. Bengio, Improving generative adversarial networks with denoising feature matching, in: International Conference on Learning Representations, Toulon, France, April 24–26, 2017, 2016. [14] G. Douzas, F. Bacao, Effective data generation for imbalanced learning using conditional generative adversarial networks, Expert Syst. Appl. 91 (2018) 464–471. [15] K.M. Ting, An instance-weighting method to induce cost-sensitive trees, IEEE Trans. Knowl. Data Eng. 14 (3) (2002) 659–665. [16] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv (2015). preprint arXiv:1511.06434. [17] J. Zhao, M. Mathieu, Y. LeCun, Energy-based generative adversarial network, arXiv (2016). preprint arXiv:1609.03126. [18] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training gans, in: Advances in Neural Information Processing Systems, 2016, pp. 2234–2242. [19] X. Mao, Q. Li, H. Xie, R.Y. Lau, Z. Wang, Multi-class generative adversarial networks with the L2 loss function, arXiv 5 (2016) 1057–7149. preprint arXiv:1611.04076. [20] Q. Lyu, M. Guo, Z. Pei, DeGAN: mixed noise removal via generative adversarial networks, Appl. Soft Comput. 95 (2020) 106478. [21] Z. Chen, Z. Zeng, H. Shen, X. Zheng, P. Dai, P. Ouyang, DN-GAN: denoising generative adversarial networks for speckle noise reduction in optical coherence tomography images, Biomed. Signal Process. Control 55 (2020) 101632. [22] N.M. Nawi, A. Khan, M.Z. Rehman, H. Chiroma, T. Herawan, Weight optimization in recurrent neural networks with hybrid metaheuristic cuckoo search techniques for data classification, Math. Probl. Eng. 2015 (2015). [23] M. Li, D. Chen, S. Liu, F. Liu, Grain boundary detection and second phase segmentation based on multi-task learning and generative adversarial network, Measurement 162 (2020) 107857. [24] G. Zhong, W. Gao, Y. Liu, Y. Yang, D.H. Wang, K. Huang, Generative adversarial networks with decoder-encoder output noises, Neural Netw. 127 (2020) 19–28. [25] Z. Cui, K. Henrickson, R. Ke, Y. Wang, Traffic graph convolutional recurrent neural network: a deep learning framework for network-scale traffic learning and forecasting, IEEE Trans. Intell. Transp. Syst. (2019), https://doi.org/10.1109/TITS.2019.2950416. [26] J. Sun, G. Zhong, Y. Chen, Y. Liu, T. Li, K. Huang, Generative adversarial networks with mixture of t-distributions noise for diverse image generation, Neural Netw. 122 (2020) 374–381. [27] K.E. Ak, J.H. Lim, J.Y. Tham, A.A. Kassim, Semantically consistent text to fashion image synthesis with an enhanced attentional generative adversarial network, Pattern Recogn. Lett. 135 (2020) 22–29. [28] L. Xu, X. Zeng, W. Li, Z. Huang, Multi-granularity generative adversarial nets with reconstructive sampling for image Inpainting, Neurocomputing 402 (2020) 220–234. [29] B. Banuselvasaraswathy, A. Sampathkumar, P. Jayarajan, N. Sheriff, M. Ashwin, V. Sivasankaran, A review on thermal and QoS aware routing protocols for health care applications in WBASN, in: IEEE International Conference on Communication and Signal Processing, July 28-30, India, 2020. 311 312 Generative adversarial networks for image-to-image translation [30] W. Zhang, X. Li, X.D. Jia, H. Ma, Z. Luo, X. Li, Machinery fault diagnosis with imbalanced data using deep generative adversarial networks, Measurement 152 (2020) 107377. [31] J. Zhang, G. Sheng, First arrival picking of microseismic signals based on nested U-net and Wasserstein generative adversarial network, J. Pet. Sci. Eng. (2020), https://doi.org/10.1016/j.petrol.2020. 107527. [32] F. Yang, Z. Wang, J. Li, R. Xia, Y. Yan, Improving generative adversarial networks for speech enhancement through regularization of latent representations, Speech Comm. 118 (2020) 1–9. [33] A. Sampathkumar, S. Murugan, R. Rastogi, M.K. Mishra, S. Malathy, R. Manikandan, Energy efficient ACPI and JEHDO mechanism for IoT device energy management in healthcare, in: G. Kanagachidambaresan, R. Maheswar, V. Manikandan, K. Ramakrishnan (Eds.), Internet of Things in Smart Technologies for Sustainable Urban Development, EAI/Springer Innovations in Communication and Computing. Springer, Cham, 2020. [34] S. Pascual, J. Serra, A. Bonafonte, Time-domain speech enhancement using generative adversarial networks, Speech Comm. 114 (2019) 10–21. [35] Z. Chen, C. Wang, H. Wu, K. Shang, J. Wang, DMGAN: discriminative metric-based generative adversarial networks, Knowl.-Based Syst. 192 (2020) 105370. [36] X. Li, Y. Makihara, C. Xu, Y. Yagi, M. Ren, Gait recognition invariant to carried objects using alpha blending generative adversarial networks, Pattern Recogn. 105 (2020) 107376. [37] A. Sampathkumar, J. Mulerikkal, M. Sivaram, Glowworm swarm optimization for effectual load balancing and routing strategies in wireless sensor networks, Wirel. Netw. 26 (6) (2020) 4227–4238, https://doi.org/10.1007/s11276-020-02336-w. [38] A. Sampathkumar, S. Murugan, A.A. Elngar, L. Garg, R. Kanmani, A.C.J. Malar, A novel scheme for an IoT-based weather monitoring system using a wireless sensor network, in: Integration of WSN and IoT for Smart Cities, 2020, pp. 181–191. [39] S.R. Thennarasu, M. Selvam, K. Srihari, A new whale optimizer for workflow scheduling in cloud computing environment, J. Ambient Intell. Human. Comput. (2020) 1–8. [40] K. Kamaraj, C. Arvind, K. Srihari, A weight optimized artificial neural network for automated software test oracle, Soft Comput. (2020) 1–11. [41] M. Hibat-Allah, M. Ganahl, L.E. Hayward, R.G. Melko, J. Carrasquilla, Recurrent neural network wave functions, Phys. Rev. Res. 2 (2) (2020), 023358. [42] M.Z. Uddin, M.M. Hassan, A. Alsanad, C. Savaglio, A body sensor data fusion and deep recurrent neural network-based behavior recognition approach for robust healthcare, Inf. Fusion 55 (2020) 105–115. [43] K. Guo, Y. Hu, Z. Qian, H. Liu, K. Zhang, Y. Sun, B. Yin, Optimized graph convolution recurrent neural network for traffic prediction, IEEE Trans. Intell. Transp. Syst. 22 (2) (2021) 1138–1149. [44] M. Almiani, A. AbuGhazleh, A. Al-Rahayfeh, S. Atiewi, A. Razaque, Deep recurrent neural network for IoT intrusion detection system, Simul. Model. Pract. Theory 101 (2020) 102031. [45] X. Li, Z. Xu, S. Li, H. Wu, X. Zhou, Cooperative kinematic control for multiple redundant manipulators under partially known information using recurrent neural network, IEEE Access 8 (2020) 40029–40038. [46] A. Sampathkumar, P. Vivekanandan, Gene selection using multiple queen colonies in large scale machine learning, J. Electr. Eng. 9 (6) (2020) 97–111. [47] A. Sampathkumar, P. Vivekanandan, Gene selection using PLOA method in microarray data for cancer classification, J. Med. Imag. Health Inf. 9 (2019) 1294–1300. [48] A. Sampathkumar, R. Rastogi, Arukonda, S.A. Shankar, S. Kautish, M. Sivaram, An efficient hybrid methodology for detection of cancer-causing gene using CSC for micro array data, in: J. Ambient Intell. Human. Comput., Springer, 2020, doi:10.1007/s12652-020-01731-7. [49] A. Sherstinsky, Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network, Physica D 404 (2020) 132306. [50] T.H. Wen, S. Young, Recurrent neural network language generation for spoken dialogue systems, Comput. Speech Lang. 63 (2020) 101017. CHAPTER 13 WGGAN: A wavelet-guided generative adversarial network for thermal image translation Ran Zhanga,∗, Junchi Bina,∗, Zheng Liua, and Erik Blaschb a University of British Columbia, Kelowna, BC, Canada MOVEJ Analytics, Dayton, OH, United States b 13.1 Introduction Thermal or infrared (IR) images are widely used in different applications inducing nightvision navigation and surveillance [1], face recognition [2], and remote sensing [3]. IR images are produced by infrared cameras to record the thermal information of objects. These images are monochrome and are usually shown in gray scale [4]. They are different from RGB-converted gray scale images that maintain the texture information. Compared with RGB images, IR images are less affected by environmental factors such as illumination differences, fog, and smoke. However, they do not contain color and texture information, which is critical to understanding the objects in images. RGB and IR images have different characteristics and advantages for capturing the information of objects. IR images tend to catch more thermal structure, while the RGB images are more sensitive to colors. The VI and IR images can be fused [1, 5, 6] to generate more useful and comprehensive images that take advantage of both sources. However, the acquisition of RGB images in dark conditions is difficult due to the hardware limitation, which requires lights. RGB images are more easily understood by humans and play an important role in machine vision applications. Therefore, IR-to-RGB translation is needed [7]. Traditional methods of image translation require specifying color manually [8], reference image [4], or paired image datasets [9]. Manually specifying colors are able to produce vivid, colorful images according to the guide of human [8]. However, it requires a lot of labor and is a relatively time-consuming method compared with automatic methods. Reference-based methods colorize images automatically by establishing the feature connections from source to target images. They can be combined with manual intervention to improve performance. But they need the reference images which are selected manually. With the advances of convolutional neural networks (CNNs), image ∗ These authors contributed equally to this work. Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00015-4 Copyright © 2021 Elsevier Inc. All rights reserved. 313 314 Generative adversarial networks for image-to-image translation translation can be fully automated. CNNs extract both low-level and semantic features. The object can be localized and colorized when CNNs learn semantic information [10]. CNN-based translation methods are trained on a large number of images that can cover various objects in different situations. During the testing process, these methods are fully automatic and do not need reference images. Although the CNN-based translation has high accuracy, all these methods need to have fully paired images to learn the direct mapping between IR and RGB images. The acquisition of paired images is challenging due to the difficulty of hardware calibration in industrial applications. A generative adversarial network (GAN) is proposed to address the unpaired image translation by including discriminators during training generative models. Recent state-of-the-art image translation methods are proposed based on GAN [11–13]. However, when transferring IR images to RGB images, contemporary GANs are unable to keep the structure of the object or produce clear texture information. In this study, we proposed a wavelet-guided generative adversarial network (WGGAN) to address the challenge. Similar to contemporary methods, the WGGAN is also comprised of an autoencoder for image translation and a discriminator for training. To deal with the spatial distortion problem, we combine discrete wavelet transformation and variational autoencoder to keep structural information in the early stages of the network. It brings clear synthetic RGB images, as shown in Fig. 13.1. Contrarily, both qualitative and quantitative analyses are implemented to evaluate the proposed method’s performance. Compared with novel methods, the proposed method has more promising results in both results. To conclude, our contributions are as follows: • Design combining the discrete wavelet transformation and variational autoencoder for IR-to-RGB image translation, which improves both qualitative and quantitative analyses. To the best of our knowledge, this is the first method to adopt discrete wavelet transformation in GAN-based methods of IR-to-RGB translation. • Robust performance as the WGGAN does not require paired IR and RGB image datasets facilitating the thermal image translation when paired images are not available. Fig. 13.1 An example of IR-to-RGB translation via the proposed wavelet-guided generative adversarial network (WGGAN). WGGAN: A wavelet-guided generative adversarial network The rest of the chapter is organized as follows. Section 13.2 introduces the progress in infrared image translation and relevant GANs’ applications in image translation. Section 13.3 introduces the proposed WGGAN from the overall architecture to details of implements. Section 13.4 presents the experimental setup and results compared with contemporary methods. Finally, Section 13.5 concludes the experiments and WGGAN. 13.2 Related work 13.2.1 Infrared image translation Infrared image translation aims to transfer single-channel gray scale images to multichannel RGB images that contain color and texture information. It can be divided into scribble-based [10, 14–16], reference-based [17–19], and fully automatic [20, 21] methods. Scribble-based methods assume that adjacent areas have a similar color. The scribbles can be added by human intervention or edge detection algorithm. Reference-based methods rely on reference images that have a structure similar to the source image. The reference images can be selected automatically by feature matching. Then the IR images can be transferred to color images by image analogy [18]. Fully automatic methods usually utilize the CNNs [22] or GANs [22, 23] to extract features and automatically establish the pixel-wise mapping from source images to target images. The IR images can be transferred directly to RGB images without manual intervention or reference images. However, they are usually supervised methods and require that the training dataset be paired, that is, the IR images should have corresponding calibrated RGB images. In most scenarios, it is hard to obtain paired datasets. Our proposed WGGAN only needs the unpaired IR and RGB images for training, and gaining significant practical values comparing with other fully automatic methods. 13.2.2 GANs in image translation Transferring IR images to RGB images can be considered as a specific application of image translation. Much research has been conducted in this field. Image translation focuses on transferring the style of the image from one domain to another domain. Depending on whether the dataset is paired or unpaired, the image translation can be divided into paired or unpaired ways. Paired data and unpaired data can also be utilized to train a model [24], thus, obtaining the advantages of both paired and unpaired methods. The training image translation model with paired data can lead to better performance despite that the paired data are not easy to collect and calibrate. Conditional GAN [25] is used in the pixel-to-pixel level paired image-to-image translation. It showed great performance on the paired datasets. Unpaired image translation methods are more widely exploited as they have fewer limitations on the datasets. These methods usually contain more than one autoencoder, which generates target images first and reconstructs 315 316 Generative adversarial networks for image-to-image translation images from the target domain to the source domain. The reconstructed images should be similar or consistent with the source image in this process. CycleGAN [11] introduces cycle consistency loss to keep the reconstructed images similar to the source image. Resembling cycle consistency loss, reconstructed loss is designed in DiscoGAN [26] to keep the similarities between original and reconstructed images. DualGAN [27] adopts dual learning to GANs and learns to transfer images between two domains. UNIT [28] makes a shared-latent space assumption for transferring images. UGATIT [12] produces attention masks and fuses the generation output with the attention mechanism to generate higher quality target images. Apart from the above unimodal image translation problem, which is limited between two domains, multimodal image translation using a single model in unpaired datasets is more challenging. StarGAN [29] performs image-to-image translation for multiple domains. StarGAN v2 [30] enhances the performance by introducing the mapping network and style encoder. MUNIT [13] is another multimodal translation model which assumes that the image composition can be decomposed into a domain-invariant content code and a domain-specific style code. However, an empirical study reveals that these methods may have strong spatial distortion during paired thermal image translation [7]. Moreover, the quantitative results also indicate the unsatisfactory of translated images of both CycleGAN and UNIT. To address the problem of deformed translation, our proposed WGGAN aims to preserve structural information by adopting discrete wavelet transformation to variational autoencoder for unpaired thermal image translation. 13.3 Wavelet-guided generative adversarial network 13.3.1 Overall architecture The proposed wavelet-guided generative adversarial network (WGGAN) is designed for converting images from the IR domain to the RGB domain. Fig. 13.2 shows the overall Fig. 13.2 The architecture of the proposed WGGAN. WGGAN: A wavelet-guided generative adversarial network architecture of the proposed method. The overall process consists of training and testing stages. The training stage includes a proposed wavelet-guided variational autoencoder (WGVA), a discriminator, and a cycle consistency loss. The proposed WGVA aims to generate RGB images from real IR images, while a discriminator aims to recognize the transferred RGB images from real RGB images. Moreover, the cycle consistency loss [11] is implemented to train the WGVA due to the unaligned IR-RGB image pairs. After the training stage, the WGVA can be launched to translate IR images as a standalone model without the other components during the testing stage. 13.3.2 Wavelet-guided variational autoencoder The proposed wavelet-guided variational autoencoder (WGVA) is developed based on deep variational autoencoder and discrete wavelet transformation (DWT) [31]. As shown in Fig. 13.3, the WGVA consists of two subnetworks: an encoder E for converting IR image to latent space z, and a generator G to reconstruct the RGB image from z. Like MUNIT, StarGAN, and CycleGAN [13, 29], the design of WGVA follows standard residual autoencoder architecture with two convolutional layers, two pooling layers, and four residual blocks at the encoder. Moreover, the architecture of the generator is symmetrically reversed to the encoder. The details of residual autoencoder can be found in Ref. [28]. Unlike standard residual autoencoder, the latent space z is reparameterized to variational distribution, which enables smooth and accurate generation [13]. On the other hand, inspired by DWT, wavelet pooling, and wavelet unpooling are designed to substitute conventional pooling layers in standard residual autoencoder. Then, the highfrequency components skip bridge the corresponding pooling and unpooling layers for improving generative resolution, as illustrated in Fig. 13.3. The details of reparameterization and wavelet pooling are introduced in the following sections. Fig. 13.3 The illustration of wavelet-guided variational autoencoder (WGVA). 317 318 Generative adversarial networks for image-to-image translation 13.3.2.1 Reparameterization in latent space The encoder-generator pair {E, G} constitutes the WGVA for IR-to-RGB translation as shown in Fig. 13.3. The latent space z is the representation of an input IR image x. According to the theoretical study, the z should represent the variational distribution of data to have smooth generation [13]. For achieving this purpose, the reparameterization is implemented as follows: z qðzj xÞ ¼ N z; μ, σ 2 I (13.1) z ¼ μ + σ⊙E, where E N ð0, σ 2 I Þ (13.2) where ⊙ refers to element-wise product; N (.) represents the normal distribution; μ is the mean of z; σ is the standard deviation of z; I is the identity matrix; and E is the normal noise. Both σ and μ are learnable variables in WGVA. In other words, the σ and μ can be regarded as approximated mean and standard deviation of the entire dataset after training. Through the above equations, the latent space z is standardized to the intended distribution with the σ and μ. On the other hand, the normal noise E is added to smooth the z for stochastic optimization [13]. Therefore, the latent space z can be regarded to the variational distribution of input image after reparameterization with learned σ and μ. 13.3.2.2 Discrete wavelet transformation for pooling According to the empirical studies [13, 31], the standard residual blocks may generate distorted and blurred images. The major reason is that the generator is lack of structural information from the encoder. To address the issue, we adopt the discrete wavelet transformation (DWT) to extract structural information at pooling layers of the proposed method [31]. Discrete wavelet transformation (DWT) has four kernels, {LL>, LH>, HL>, HH>}, where the low (L) and high (H) pass filters are 1 1 L> ¼ pffiffiffi ½1, 1, H> ¼ pffiffiffi ½1, 1 2 2 (13.3) Thus, the DWT can generate four types of output denoted as LL, LH, HL, HH, respectively. Fig. 13.4 shows the examples after DWT. The output of LL has the smooth texture of the images, while the rest of the outputs capture the vertical, horizontal, and diagonal edges [31]. For simplicity, we denote the output of LL as low-frequency components and outputs of LH, HL, and HH as high-frequency components. The DWT enables the proposed model to control the IR-to-RGB conversion by different components separately. Specifically, the low-frequency component can affect the overall generative texture, while the high-frequency components affect the generative structure. Without processing the generative network’s high-frequency components, the structural information can be well maintained in these components. From this point of view, the WGGAN: A wavelet-guided generative adversarial network Fig. 13.4 The illustration of discrete wavelet transformation (DWT). Fig. 13.5 The illustration of wavelet pooling and wavelet unpooling. wavelet pooling and wavelet unpooling are proposed to use these components in autoencoder for better IR-to-RGB translation, as shown in Fig. 13.5 [31]. The wavelet pooling applies DWT to the encoder layer to have low frequency and high-frequency components. The kernels of the convolutional layer are changed to DWT kernels to apply the DWT in the deep neural layer. On the other hand, the wavelet pooling layer is locked during the optimization. Moreover, the stride of the layer is 2 to have downsampling features as same as conventional pooling layers [31]. The lowfrequency component will be further processed by the network. Meanwhile, the high-frequency components skipped to the symmetrical wavelet unpooling layer in the generator. In the wavelet unpooling layer of the generator, both high-frequency and low-frequency components are concatenated. Then, the concatenated components are processed to have upsampling features by transpose convolution [31]. 13.3.3 Objective functions in adversarial training The full objective of the WGGAN comprises four loss functions: cycle-consistency loss, ELBO loss, perceptual loss, and GAN loss [11, 13, 32]. Cycle-consistency loss. To train the proposed method with unpaired RGB and IR images, we adopt the cycle consistency loss which is similar to MUNIT and CycleGAN [11]. The basic idea of cycle consistency loss aims to include two generative networks for constraining the generative images. Two generative adversarial networks: GAN1 ¼ {E1, G1, D1} 319 320 Generative adversarial networks for image-to-image translation for IR-to-RGB translation and GAN2 ¼ {E2, G2, D2} for RGB-to-IR translation are used in training, where E, G, and D denote encoder, generator, and discriminator, respectively. For simplicity, E(x) ¼ z indicates the latent space z is generated by encoder E. The theory of the loss is that the image translation cycle should be capable of bringing converted images back to original images, i.e., x ! E(x) ! G1(z) ! G2G1(z) x. The cycle-consistency loss is shown as below: LCC ðE1 , G1 , E2 , G2 Þ ¼ x1 pðx1 Þ ½k G2 ðG1 ðz1 ÞÞ x1 k + x2 pðx2 Þ ½k G1 ðG2 ðz2 ÞÞ x2 k (13.4) where k k represents the ‘1 distance. ELBO loss. ELBO aims to minimize the variational upper bound of latent space z. The objective function is LE ðE, GÞ ¼ λ1 KL qðzj xÞkpη ðzÞ λ2 zqðzj xÞ log pG ðxj zÞ (13.5) X P ðxÞ (13.6) KLðP|jQÞ ¼ P ðxÞlog QðxÞ where the hyperparameters λ1 and λ2 control the weights of the objective terms and the KL divergence terms that penalize deviation of the distribution of latent space from the prior distribution pη(z). Here q(.) represents the reparameterization mentioned in the previous section, while pη(z) represents zero-mean Gaussian distribution. pG(.) is the Laplacian distribution based on generator according to empirical studies [13]. Perceptual loss. Perceptual loss is a conventional loss function of neural style transfer with the assistant of pretrained VGG-16 [33] as shown in the following equation. The perceptual loss consists of two parts: the first term loss is content loss and the second term is style loss. LP ðE, G, xc , xs Þ ¼ 1 k ϕj ðGðzÞÞ ϕj ðxc Þ k22 + Cj Hj Wj i 1 h k Gr ϕj ðGðzÞÞ Gr ϕj ðxs Þ k22 Cj Hj Wj (13.7) where ϕj(x) represents the feature map of jth convolutional layers of shape Cj Hj Wj in pretrained VGG-16; xc denotes content images while xs denotes style images; Gr is the Gram matrix which is used for representing image style. From this point of view, the perceptual loss aims to transfer the style of images while maintaining image structure. The details of the perceptual loss can be found in Ref. [32]. GAN loss. GAN loss aims to ensure the translated images resembling the images in the target domains, respectively [14]. For example, if the discriminator regards the synthetic IR images as real IR images, the synthetic IR images are successful. WGGAN: A wavelet-guided generative adversarial network LGAN ðE, G, DÞ ¼ xpðxÞ logDðxÞ + zpðzj xÞ ½ log 1 DðGðzÞÞ (13.8) Full loss. Finally, the complete loss function can be written as: Ltotal ¼ λE ðLE ðE, GÞÞ + λP ðLP ðE1 , G1 , x1 , x2 ÞÞ + λGAN ðLGAN ðE, G, DÞÞ + λCC ðLCC ðE1 , G1 , E2 , G2 ÞÞ (13.9) where λE ¼ 0.1, λP ¼ 0.1, λGAN ¼1, and λCC ¼ 10 represent the weights of ELBO loss, perceptual loss, GAN loss, and cycle consistency loss, respectively. 13.4 Experiments This section presents the details of the experiments. First, the section describes the implemented dataset and evaluation methods in this experiment. Then, the baselines and the relevant experimental setup are also introduced. Finally, both qualitative and quantitative analyses of translation results are presented. 13.4.1 Data description The implemented dataset is FLIR ADAS [34], which is an open dataset for autonomous driving. The dataset contains RGB and IR images from the same driving car. However, the recorded RGB and IR images are unpaired due to the cameras’ different properties [34]. For all experiments, the training and testing splits follow the dataset benchmark. The training dataset contains 8862 IR images and RGB images, while there are 1363 IR images and 1257 RGB images in the testing dataset. The statistics of the FLIR ADAS are presented in Table 13.1. 13.4.2 Evaluation methods Qualitative analysis. In researches on generative models, the human perceptual study is a direct way to compare the quality of translation among models. In this study, several graduate students and computer vision engineers are invited to subjectively evaluate the translated results from the proposed method and baseline methods with source IR image. They are then required to select which output has the best quality with short comments. Table 13.1 Statistics in FLIR ADAS. Dataset Image type # of frames Image size Training IR RGB IR RGB 8862 8363 1363 1257 640 512 1800 1600 640 512 1800 1600 Testing 321 322 Generative adversarial networks for image-to-image translation Quantitative analysis. Numerical evaluating the proposed WGGAN and other comparative methods is challenging since there are no paired RGB images. To measure the translated quality, we include four Inception-based metrics: 1-Nearest Neighbor classifier (1-NN), kernel maximum mean discrepancy (KMMD), Frechet inception distance (FID), and Wasserstein distance (WD) [35]. These methods compute the distance between features in the target and generated images from the Inception network [35]. If an IR image is well translated, these metrics will have small values, which indicates the generated RGB image is similar to the distribution of target RGB images. Besides, two no-reference image quality assessment (NR-IQA) methods, blind/referenceless image spatial quality evaluator (BRISQUE) [36], and natural image quality evaluator (NIQE) [37], are also used to independently evaluate the generated images without any pair or unpaired images. The small values of them indicate the high quality of the translated images. Moreover, a multicriteria decision analysis, TOPSIS [38], is included to summarize all the quantitative evaluation metrics. 13.4.3 Baselines CycleGAN. CycleGAN consists of two standard residual autoencoders for training with GAN loss and cycle-consistency loss [11]. MUNIT. MUNIT is similar to CycleGAN, which also consists of two autoencoders. For having diverse generative images, the encoder of MUNIT has one content branch and a style branch. Inspired by neural style transfer [13], where the two branches are combined by adaptive instance normalization in the generator for image reconstruction. StarGAN. StarGAN is a state-of-the-art generative method in facial attribute transfer and facial expression synthesis. It includes mask vector and domain classification to generate diverse output [29]. UGATIT. The UGATIT adopts an attention mechanism to residual autoencoder with auxiliary classifier inspired by weakly supervised learning. Moreover, it also introduces an adaptive instance normalization to the residual generator [12]. It achieves novel performance in tasks of anime translation and style transfer. 13.4.4 Experimental setup The adaptive moment optimization (ADAM) [12] is used as an optimizer for training the proposed method where the learning rate is set to 0.00001, and momentums are set to 0.5 and 0.99. For improving the model’s robustness, the batch size was set as 1 with instance normalization after each neural layer. The discriminator is adopted from PatchGAN [11]. Moreover, all the activation functions of neurons are set to the rectified linear unit (ReLU), while the activation function of the output layer is Tanh to generate synthetic images. WGGAN: A wavelet-guided generative adversarial network To make a fair comparison, both WGGAN and baseline models are trained in 27 epochs with batch size 1. On the other hand, all images are resized to 512 512 before feeding to the network. A desktop with an NVIDIA TITAN RTX, and Intel Core i7 and 64 GB memory is used throughout the experiments. 13.4.5 Translation results This subsection aims to present both qualitative and quantitative analyses of translation results compared with baselines. In qualitative analysis, the examples of translation are illustrated with subjective comments. On the other hand, the quantitative analysis presents numerical results to compare the proposed WGGAN with baselines. 13.4.5.1 Qualitative analysis Fig. 13.6 illustrates the translation results in the test set of FLIR ADAS. StarGAN has the worst translation performance, which has unsuitable colors and noisy black spots on the images. The rest of the translated images can generate clear edges of solid objects like vehicles. However, the UGATIT is not capable of clearly translating objects such as trees and houses. Compared with WGGAN, CycleGAN, and MUNIT, participants also point out that the road texture is not well translated by UGATIT, as shown in Fig. 13.6. The texture of the road is too smooth to present the details, such as the curb on the road. On the other hand, CycleGAN can accurately translate the objects from IR images with sharp edges and textures. Several participants also mention that there are some incorrect mapping objects on the generated RGB images from CycleGAN. For example, trees should not appear in the sky in Fig. 13.6B. On the contrary, both the proposed WGGAN and MUNIT are able to translate the IR images with clear texture information of objects. However, some parts of the images are not correctly translated, such as the sky and people, as shown in Fig. 13.6C. In qualitative evaluation, participants indicate that the proposed WGGAN can generate the best quality of images with clear texture and correct mapping objects. Compared with other state-of-the-art methods, the generated images are less scattered noises. To conclude, participants believe that the proposed WGGAN has the best performance in IR-to-RGB translation. 13.4.5.2 Quantitative analysis Table 13.2 illustrates the quantitative results of the IR-to-RGB translation. The best result of each evaluation method is highlighted. It is difficult to identify the best method within contemporary methods. CycleGAN achieves excellent performance in NR-IQA evaluation, while MUNIT has better performance in 1-NN and KMMD. Unlike contemporary methods, the proposed WGGAN outperforms all the contemporary models with the smallest values in 1-NN, KMMD, FID, and NIQE. For inception-based metrics, WGGAN has 26.1% and 53.1% improvement in KMMD and FID, which means that the generative RGB images are similar to the target RGB domain. On the other 323 Fig. 13.6 Examples of (A) source IR images, (B) proposed WGGAN, (C) CycleGAN, (D) MUNIT, (E) StarGAN, and (F) UGATIT. WGGAN: A wavelet-guided generative adversarial network Table 13.2 Quantitative results of contemporary methods. Models 1-NN KMMD FID WD BRISQUE NIQE CycleGAN MUNIT StarGAN UGATIT WGGAN 0.961 0.927 0.992 0.959 0.924 0.318 0.237 0.397 0.283 0.175 0.222 0.157 0.121 0.098 0.046 61.60 67.50 75.73 65.88 65.97 15.17 27.99 36.55 36.81 28.89 2.730 2.750 6.392 3.663 2.477 Table 13.3 Ranking results by TOPSIS based on quantitative result. TOPSIS CycleGAN MUNIT StarGAN UGATIT WGGAN 0.479 0.569 0.313 0.559 0.796 hand, the WGGAN also achieves the best performance in NIQE, which indicates the image is similar to the natural images in terms of statistical regularities. To better conclude the best model with these evaluating methods, we use a common multicriteria decision model, TOPSIS, to rank the image translation models based on evaluating methods. Table 13.3 shows that WGGAN has the highest values of TOPSIS. The ranking results also demonstrate that the proposed WGGAN can generate more high-quality RGB images in comparison with other novel image translation methods. 13.5 Conclusion In this chapter, an IR-to-RGB image translation method, wavelet-guided generative adversarial network (WGGAN), is proposed for context enhancement. A waveletguided variational autoencoder (WGVA) is proposed for generating smooth and clear RGB images from the IR domain, which combines variational inference and discrete wavelet transformation. In addition, more objective functions are introduced to improve generative quality, such as ELBO loss and perceptual loss. Both qualitative and quantitative results demonstrate the effectiveness of the proposed WGGAN to enable better context enhancement for IR-to-RGB translation. Many industrial applications can benefit from the proposed method, such as object detection at night for applications of semiautonomous driving, unmanned aerial vehicle (UAV) surveillance, and urban security. Although the proposed WGGAN has promising results in thermal image translation, there is still room for improvement. For example, the objects’ colors should be more discriminative from the background. Therefore, we first aim to add numerous IR and RGB images for fully training the proposed WGGAN. Then, more advanced modules, such as adaptive instance normalization, will be included to enhance the translation. 325 326 Generative adversarial networks for image-to-image translation Acknowledgment The first three authors are supported in part by grants from TerraSense Analytics Ltd. and Advanced Research Computing, University of British Columbia. References [1] G. Bhatnagar, Z. Liu, A novel image fusion framework for night-vision navigation and surveillance, Signal Image Video Process. 9 (1) (2015) 165–175. [2] G. Hermosilla, F. Gallardo, G. Farias, C.S. Martin, Fusion of visible and thermal descriptors using genetic algorithms for face recognition systems, Sensors 15 (8) (2015) 17944–17962. [3] X. Chang, L. Jiao, F. Liu, F. Xin, Multicontourlet-based adaptive fusion of infrared and visible remote sensing images, IEEE Geosci. Remote Sens. Lett. 7 (3) (2010) 549–553. [4] T. Hamam, Y. Dordek, D. Cohen, Single-band infrared texture-based image colorization, in: 2012 IEEE 27th Convention of Electrical and Electronics Engineers in Israel, 2012, pp. 1–5. [5] J. Ma, et al., Infrared and visible image fusion via detail preserving adversarial learning, Inf. Fusion 54 (2020) 85–98. [6] X. Jin, et al., A survey of infrared and visual image fusion methods, Infrared Phys. Technol. 85 (2017) 478–501. [7] S. Liu, V. John, E. Blasch, Z. Liu, Y. Huang, IR2VI: enhanced night environmental perception by unsupervised thermal image translation, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, June, vol. 2018, 2018, pp. 1234–1241. [8] H. Chang, O. Fried, Y. Liu, S. DiVerdi, A. Finkelstein, Palette-based photo recoloring, ACM Trans. Graph. 34 (4) (2015). [9] G. Larsson, M. Maire, G. Shakhnarovich, Learning representations for automatic colorization, in: Proceedings of European Conference on Computer Vision (ECCV), LNCS, vol. 9908, 2016, pp. 577–593. [10] A.Y.-S. Chia, et al., Semantic colorization with internet images, ACM Trans. Graph. 30 (6) (2011) 1–8. [11] J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, October, vol. 2017, 2017, pp. 2242–2251. [12] J. Kim, M. Kim, H. Kang, L. Kwanhee, U-GAT-IT: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation, in: ICLR 2020, 2020, pp. 1–19. [13] X. Huang, M.Y. Liu, S. Belongie, J. Kautz, Multimodal unsupervised image-to-image translation, in: Proceedings of European Conference on Computer Vision (ECCV), LNCS, vol. 11207, 2018, pp. 179–196. [14] I.J. Goodfellow, et al., Generative adversarial nets, Adv. Neural Inf. Process. Syst. 3 (2014) 2672–2680. [15] A. Levin, D. Lischinski, Y. Weiss, Colorization using optimization, in: ACM SIGGRAPH 2004 Papers, 2004, pp. 689–694. [16] Q. Luan, F. Wen, D. Cohen-Or, L. Liang, Y.-Q. Xu, H.-Y. Shum, Natural image colorization, in: Proceedings of the 18th Eurographics Conference on Rendering Techniques, 2007, pp. 309–320. [17] R.K. Gupta, A.Y.-S. Chia, D. Rajan, H.Z. Ee Sin Ng, Image colorization using similar images, in: Proceedings of the 20th ACM international conference on Multimedia (MM’12), 2012. [18] A. Hertzmann, C.E. Jacobs, N. Oliver, B. Curless, D.H. Salesin, Image analogies, in: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques, 2001, pp. 327–340. [19] Y. Zheng, E. Blasch, Z. Liu, Multispectral Image Fusion and Night Vision Colorization, Society of Photo-Optical Instrumentation Engineers, 2018. [20] Z. Cheng, Q. Yang, B. Sheng, Deep colorization, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), vol. 2015, 2015, pp. 415–423. WGGAN: A wavelet-guided generative adversarial network [21] M. Limmer, H.P.A. Lensch, Infrared colorization using deep convolutional neural networks, in: 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016, pp. 61–68. [22] P.L. Suárez, A.D. Sappa, B.X. Vintimilla, Infrared image colorization based on a triplet DCGAN architecture, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 212–217. [23] P.L. Suárez, A.D. Sappa, B.X. Vintimilla, Learning to colorize infrared images, in: PAAMS, 2017. [24] S. Tripathy, J. Kannala, E. Rahtu, Learning image-to-image translation using paired and unpaired training samples, in: Proceedings of Asian Conference on Computer Vision (ACCV), LNCS, vol. 11362, 2019, pp. 51–66. [25] P. Isola, J.Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings—30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, January, vol. 2017, 2017, pp. 5967–5976. [26] T. Kim, M. Cha, H. Kim, J.K. Lee, J. Kim, Learning to discover cross-domain relations with generative adversarial networks, in: The 34th International Conference on Machine Learning, March, vol. 4, 2017, pp. 2941–2949. [27] Z. Yi, H. Zhang, P. Tan, M. Gong, DualGAN: unsupervised dual learning for image-to-image translation, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), October, vol. 2017, 2017, pp. 2868–2876. [28] M.Y. Liu, T. Breuel, J. Kautz, Unsupervised image-to-image translation networks, Adv. Neural Inf. Process. Syst. 2017 (2017) 701–709. [29] Y. Choi, M. Choi, M. Kim, J.W. Ha, S. Kim, J. Choo, StarGAN: unified generative adversarial networks for multi-domain image-to-image translation, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, pp. 8789–8797. [30] Y. Choi, Y. Uh, J. Yoo, J.-W. Ha, StarGAN v2: diverse image synthesis for multiple domains, in: CoRR, vol. abs/1912.0, 2019. [31] J. Yoo, Y. Uh, S. Chun, B. Kang, J.W. Ha, Photorealistic style transfer via wavelet transforms, in: Proceedings of the IEEE International Conference on Computer Vision, October, vol. 2019, 2019, pp. 9035–9044. [32] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution, in: Proceedings of the 14th European Conference on Computer Vision, Lecture Notes in Computer Science (LNCS), vol. 9906, Springer, 2016, pp. 694–711. [33] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: 3rd International Conference on Learning Representations. ICLR 2015—Conference Track Proceedings, 2015, pp. 1–14. [34] F.A. Group, FLIR thermal dataset for algorithm training, 2018, [Online]. Available from: https:// www.flir.in/oem/adas/adas-dataset-form. [35] Q. Xu, et al., An empirical study on evaluation metrics of generative adversarial networks, in: CoRR, vol. arXiv:1806, 2018, pp. 1–14. [36] A. Mittal, A.K. Moorthy, A.C. Bovik, No-reference image quality assessment in the spatial domain, IEEE Trans. Image Process. 21 (12) (2012) 4695–4708. [37] A. Mittal, R. Soundararajan, A.C. Bovik, Making a ‘completely blind’ image quality analyzer, IEEE Signal Process. Lett. 20 (3) (2013) 209–212. [38] V. Yadav, S. Karmakar, P.P. Kalbar, A.K. Dikshit, PyTOPS: a python based tool for TOPSIS, SoftwareX 9 (2019) 217–222. 327 CHAPTER 14 Generative adversarial network for video analytics A. Sasithradevia, S. Mohamed Mansoor Roomib, and R. Sivaranjanic a School of Electronics Engineering, VIT University, Chennai, India Department of Electronics and Communication Engineering, Thiagarajar College of Engineering, Madurai, India c Department of Electronics and Communication Engineering, Sethu Institute of Technology, Madurai, India b 14.1 Introduction The objective of video analytics is to recognize the events in videos automatically. Video analytics can detect events such as a sudden burst of flames, suspicious movement of vehicles and pedestrians, abnormal movement of a vehicle not obeying traffic signs. A commonly known application in the research field of video analytics is video surveillance which has started evolving 50 years ago. The principle behind the video surveillance is to involve human operators to monitor the events occurring in a public area, room, or desired space. In general, an operator is given full responsibility for several cameras and studies have shown that increasing the number of cameras to be monitored per operator degrades the performance of the operator. Hence, video analysis software aims to provide a better trade-off between accurate event detection and huge video information [1–3]. Machine learning, in particular, its descendant namely deep learning has prompted the research in the video analytics domain. The fundamental purpose of deep learning is to identify the sophisticated model that signifies the probability distributions over the different samples of videos which need analytics. Generative adversarial network (GAN) provides an efficient way to learn deep representations with minimal training data. GAN is an evolving technique for generating and representing the samples using both unsupervised and semisupervised learning methods. It is accomplished through the implicit modeling of high-dimensional data distribution. The underlying working principle of GAN is to train the pair of networks in competition with each other. Among these networks, one acts like an imitator and the other as a skillful specialist. From the formal description of GAN, the generator creates fake data mimicking the realistic one and the discriminator is an expert trained to distinguish the real samples from the forger ones. Both the networks are trained simultaneously in competition with each other. This generic framework for GAN is shown in Fig. 14.1. Both the generator and the discriminator are the neural networks where the former generates new instances and the latter assesses whether the instances belong to the dataset. Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00008-7 Copyright © 2021 Elsevier Inc. All rights reserved. 329 330 Generative adversarial networks for image-to-Image translation Fig. 14.1 Generic framework for generative adversarial networks. For the purpose of classification, the discriminator plays the role of a classifier to distinguish the real from the fake. To build a GAN, one needs to have a training dataset and a clear idea about the desired output. Initially, GAN learns from simple distribution of 2D data, later GAN could be able to mimic high-dimensional data distribution along with eventual training. During the training phase, both the competing networks get the attributes regarding the distribution of data. The data samples generated by the generator along with the real data samples are used to train the discriminator. After sufficient training, the generator is trained against the discriminator. Thus the generator learns to map any random data samples. Consider the scenario as Fig. 14.1, where a D-dimensional noise vector obtained from the latent space is fed into the generator which converts them into new data samples. The discriminator then processes both the real and fake samples for classifying it. The main advantage of GAN relies on its randomness which aids it to create new data samples rather than the exact replica of the real data. Another crucial advantage of GAN over Autoencoders [4] and Boltzmann machine [5] is that GAN does not rely on Markov chain for the purpose of generating training models. GANs were designed to eliminate the high complexity associated with Markov chains. Also, the generator function undergoes a minimum restriction compared to Boltzman machines. Owing to these advantages, GANs have been attracted toward a variety of applications and the craving to utilize it in numerous areas is increasing. They have been effectively used in a wide variety of tasks like image to image translation, obtaining high-resolution images from lowresolution images, deciding the drugs for treating desired diseases, retrieving images, object recognition, text-image translation, intelligent video analysis [6], and so on. In this article, we present an overview of the working principle of GANs and its variants available for video analytics. We also emphasize the pros, cons, and the challenges for the fruitful implementation of GANs in different video analytic problems. Generative adversarial network for video analytics The remainder of this chapter is organized as follows: Section 14.2 provides the building blocks of GANs, its driving factor called objective functions and the challenging issues of GANs. Section 14.3 highlights the variants of GANs emerged for the problem of video analytics in past years. Section 14.4 discusses the possible future works in the area of video analytics based on GAN. Section 14.5 concludes this chapter. 14.2 Building blocks of GAN This section describes the basic building blocks of GAN and the different objective functions used for training the GAN architectures. 14.2.1 Training process The training process involving the objective or cost function is the basic building block for GANs. Training of GAN is a dual process which includes choosing the parameters for a generator that confuses the discriminator with fake data and discriminator that maximize the accuracy for any given application. The algorithm involved in the training process is described as follows: Algorithm 14.1 Step 1: Update parameters of discriminator “θD”: Input: “m” samples from real frames and “m” samples from noise data. Do: Compute the expected Gradient rθD ¼ f{JθD(θD; θG)} Update: θD (θD, —θD) Step 2: Update parameters of Generator “θG”: Input: “m” samples from noise data and θD. Do: Compute the expected Gradient rθG ¼ f{JθG(θG; θD)} Update: θG (θG, rθG). The objective or the cost function V(G, D) for the training depends on the two competing networks. The training process includes both maximization and minimization as max min V ðG, DÞ D G (14.1) where V(G, D) ¼ fpdata(x) log D(x) + fpg(x) log(1 D(x)). As illustrated in the Algorithm 14.1, one of the model parameters are updated, while p ðxÞ the other is fixed. An exclusive discriminator D0 ðxÞ ¼ pdataðdata is available for any fixed xÞ + pgðxÞ generator G [7]. The generator is also optimal when Pg(x) ¼ Pdata(x) and it shows that the generator reaches an optimal point only when the discriminator is totally confused in 331 332 Generative adversarial networks for image-to-Image translation discriminating the real data from fakes. The discriminator is not trained completely until the generator reaches the optimum value. But the generator is updated simultaneously with the discriminator. An alternate cost function typically used for updating the generator is maxG log D(G(Z)) instead of minG log (1 2D(G(Z))). 14.2.2 Objective functions The main objective of generative models is to make Pg(x) equivalent to the real data distribution Pdata(x). Hence, the underlying fact for training the generator is to reduce the dissimilarity between the two distributions [8]. In recent years, researchers have attempted to utilize various dissimilarity measures to upgrade the performance of GAN. This section describes the difference in computation using various measures and objective functions. f-Divergence: It is a dissimilarity measure between two distribution functions that are convex in nature. The f-Divergence between the two convex functions [8] namely Pg(x) and Pdata(x) is written as 0 x2 ð Pdata ðxÞ @ Df Pdata | Pg ¼ Pg ðxÞf dx (14.2) Pg ðxÞ x1 Integral probability metric: It provides the maximal dissimilarity measure between two arbitrary functions [8]. Consider the data space X R with probability distribution function defined as P(X). The IPM distance metric between the distributions Pdata, Pg P(X) is defined as (14.3) dF Pdata , Pg ¼ Supf F ExPdata f ðxÞ ExPg ðf ðxÞÞ Auxiliary object functions: The auxiliary functions that are related to the adversarial objective functions are reconstruction and classification objective function. • Reconstruction objective function: The goal of reconstruction objective function is to minimize the difference between the output image of the neural network and the real image provided as input to the neural network [9, 10]. This type of objective function aids the generator to preserve the content of the real image data and use the autoencoder architecture for GAN’s discriminator [11, 12]. The discrepancy value evaluated using reconstruction objective function mostly involves L1 norm measure. • Classification objective function: Discriminator network can also be used as classifier [13, 14] when cross entropy loss is employed as an objective function in the discriminator. Cross entropy loss is widely used in many GAN applications for semisupervised learning and domain adaptation. This objective function can also be used to train the generator and discriminator jointly for the classification purpose. Generative adversarial network for video analytics 14.3 GAN variations for video analytics In recent years, intelligent video analytics has become an emerging technology and research field in academics and industry. The scenes in videos are recorded by cameras that aids for invigilation of happenings that occur in the area where human ability fails. Recently, a huge number of cameras are utilized for useful purposes [6] like fire detection, person detection and tracking, vehicle detection, smoke detection, unknown object and crime detection in country borders, shopping malls, airports, sports stadiums, underground stations, residential areas, and academic campuses and so on. The manually monitoring the videos is really cumbersome due to the obstacles like drowsiness of the operator, diversion due to increased responsibilities, etc. This prompts the need for semisupervised approaches for analyzing the events in videos [15, 16]. Hence, intelligent video analytics is one challenging problem in the field of computer vision where deep networks have not succeeded classical handcrafted attributes. To date, video analytics has traveled a long journey from holistic features such as motion history image [17] (MHI), motion energy image [18] (MEI), action banks [19] up to local feature-based approaches like HOG3D [20], spatiotemporal histogram of radon projection (STHRP) [21], histogram of optical flow [22], and tracking approaches. One efficient approach is to employ deep networks for learning and analyzing the videos without the knowledge of class labels but with the sequential organization of frames termed as “weak supervision.” This technique also requires a little supervision in strategies for providing input to deep neural networks such as sampling, encoding, and organizing methods. Unlike deep networks, generative models called GANs [23, 24] have been successfully implemented in the field of video analytics without human intervention in labeling the videos for applications such as future video frame prediction [25]. Over a period of time, the architectures of GAN is modified for various applications like video generation, video prediction, action recognition, video summarization, video understanding, and so on as listed in Table 14.1. 14.3.1 GAN variations for video generation and prediction Recent progress in generative models [26] has attracted the researchers to examine image synthesis. In particular, GANs have been employed to synthesize images from random data distribution, through nonlinear transformation from prime image to synthesized one or generate the synthesized images from the source domain. This enhanced advances in image synthesis have gained the confidence to utilize GANs for generating video sequence. One of the challenging issues in using GANs for generating and predicting videos is that the output of the GAN architectures must provide meaningful video responses. This challenge has added huge responsibility to GAN which includes understanding both the spatial and the temporal content of the video. One such extension of GAN is MoCoGAN [27], which is used for generating videos with no prior knowledge 333 334 Generative adversarial networks for image-to-Image translation Table 14.1 GAN variations. S. no GAN variations Application 1 2 3 4 5 6 7 8 9 10 11 MoCoGAN VGGAN LGGAN TGANs Dynamic transfer GAN FTGAN DMGAN AMCGAN Discrimnet HiGAN DCycle GAN Video generation 12 13 14 PoseGAN Recycle GAN DTRGAN Video prediction Action recognition Video recognition Face translation between images and videos Human pose estimation Video retargeting Video Summarization about priming image. This variant of GAN architecture partitions the input data distribution into two subspaces namely content and motion subspace. The content subspace sampling follows Gaussian distribution sampling whereas motion subspace sampling was accomplished by RNN. These two subspaces form the two discriminators called content and motion discriminators. Even though MoCoGAN could generate videos of variable lengths, the motion discriminator was designed only to handle the frames in limited number. As shown in Fig. 14.2, the spatial content generation was performed for different instances of appearance but the motion was fixed at the same expression. Another useful variant of GAN is dynamic transfer GAN [28]. It attempts to generate the video sequence by transferring the dynamics of temporal motion available in the Fig. 14.2 Example frames generated by MoCoGAN [27]. Generative adversarial network for video analytics Fig. 14.3 Frames generated by dynamic GAN [28] for anger expression. source video sequence onto a prime target image. This target image contains the spatial content of the video data and the dynamic information is obtained from the arbitrary motion. RNN is used for spatiotemporal encoding. This dynamic GAN can generate video sequences of variable length using the competition between a generator and two discriminator networks. Among the two discriminator networks, one act as spatial discriminator to monitor the fidelity of the generated video sequence and other acts as a dynamic discriminator to maintain the integrity of the entire video sequence. They have provided visualization to demonstrate the ability of the dynamic GAN in encoding the enriched dynamics from source videos by suppressing the appearance features. Fig. 14.3 shows an example of frames generated using dynamic GAN for anger expression. Ohnishi et al. developed flow and texture generative adversarial network (FTGAN) model [29] used to generate hierarchical video from orthogonal information. FTGAN comprises two networks namely FlowGAN and TextureGAN. This variation in the GAN architecture is proposed to explore the representation and generate videos without enormous annotation cost. Flow GAN is used to generate optical flow which provides the edge and motion information for the video to be generated. The RGB videos are generated from optical flow using Texture GAN. The generic framework for FTGAN is shown in Fig. 14.4. Fig. 14.4 Generic framework for FTGAN [29]. 335 336 Generative adversarial networks for image-to-Image translation TextureGAN preserves the consistency in the foreground and the scenes while accumulating texture information with the generated optical flow. This model provides a progressive advance in generating more realistic videos without label data. The prime advantage of FTGAN is that both GANs share complementary information about the video content. The authors used both real and computer graphics (CG) videos for training the texture and FlowGANs. The real-world dataset namely Penn Action contains 2326 videos of 15 different classes whereas the CG human video dataset namely SURREAL consists of 67,582 videos. The TextureGAN and FlowGAN are trained on these dataset for 60 k iterations. The accuracy obtained on SURREAL dataset is 44% and 54% while using textureGAN and flowGAN respectively. On Penn Action dataset, the accuracy obtained through textureGAN is 72% and flowGAN is 58%. A multistage dynamic generative adversarial network (MSDGAN) was proposed for generating time-lapse videos of high resolution. The process involved in MSDGAN [30] is twofolds: at the initial stage, realistic information is generated for each frame in the video. The next stage prunes the videos generated by the first stage through the use of motion dynamics which could make the videos closer to the real one. The authors had used a large-scale time-lapse dataset to test the videos. This model generates realistic videos of up to 128 128 resolution for 32 frames. They had collected over 5000 timelapse videos from YouTube and short clips are created manually from it. After that, the short video clips are partitioned into frames and MSDGAN is used to generate clips. A short video clip can be generated from continuous 32 frames. Fig. 14.5 shows the frames generated by the MSDGAN and the red circle indicates motion between adjacent frames. Fig. 14.5 Frames generated by MSDGAN [30], given the start frame 1. Generative adversarial network for video analytics A robust one-stream video generation architecture which is an extension of Wasserstein GAN architecture known as improved video generative adversarial network (iVGAN) is another variation of GAN. This model generates the whole video clip without separating foreground from background. Similar to classical GAN, iVGAN [31] model has two networks called a generator and a critic/discriminator network. The aim of using a generator network is to create videos from a low-dimensional latent code. Critic network discriminates the real and fake data and updates in competence with the generator. This iVGAN architecture tackles the challenging issues in video analytics such as future frame prediction, video colonization, and in painting. The authors used different dataset such as stabilized videos collected from YouTube and Airplanes dataset. This model works by constantly filling the damaged holes to reconstruct the spatial and temporal information of videos. Fig. 14.6 depicts the example video frames generated by iVGAN. One of the useful efforts for generating the videos for the given description/caption has been taken by Pan et al. [32]. These kinds of video generation from the text description are attracted toward real-time applications. It is attained through the efficient extension of GAN architecture termed as temporal GAN (TGAN). TGAN consists of a generator and three discriminator networks. The input to the generator network is the combination of noise vector and the encoded sentences derived from LSTM network. The generator produces the frames of video sequences using 3D convolution operator. Three discriminators are utilized in TGANs for purposes such as video, frames, and motion discrimination. Among the discriminators, the function of two networks is to distinguish the real and fake videos or frames formed by the generator. In addition to this, these discriminator networks discriminate the semantically matched and mismatched video or frame text description pairs. The need for last discriminator network is to improve the temporal coherence between the real and generated frames. The whole TGAN architecture has undergone end to end learning. This GAN variant is evaluated over datasets like SMBG, TBMG, and MSVD to generate videos from captions. The coherence metric implies readability and temporal coherence of videos. The coherence metric of 1.86 is reported for TGANs. Table 14.2 enumerates the collection of dataset used for evaluating the GAN variants proposed for video generation. 14.3.2 GAN variations for video recognition Video recognition is usually done via large number of labeled videos during the training session. For a new test task, many videos are unlabeled and annotation is needed there. It also requires human help to annotate every video. It’s a tedious process to annotate a large set of data. In order to overcome this Yu et al. proposed a novel approach called hierarchical generative adversarial networks (HiGAN) [3] where the fully labeled images are utilized to recognize those unlabeled videos. The idea behind HiGAN model is 337 Fig. 14.6 Video frames generated using iVGAN [31]. Table 14.2 List of dataset available to validate video generation techniques. S. no Model Dataset 1 MoCoGAN 2 3 TGANs Dynamic transfer GAN FTGAN iVGAN MSDGAN MUG facial expression dataset, YouTube videos, Weizmann action dataset and UCF101 SMBG,TBMG, MSVD CASIA 4 5 6 Penn Action, SURREAL Tiny videos, Airplane dataset YouTube videos, Beach dataset, Golf dataset Generative adversarial network for video analytics combining low-level conditional GAN and high-level conditional GAN and utilizing the adversarial learning from them. Also, this method provides domain invariant feature representation between labeled images and unlabeled video. The performance is evaluated by conducting experiments on two complex video datasets UCF101 [33] and HMDB51 [34]. In this work, each target video is split into 16-frame clips without any overlap and it constructs a video clip domain by combining all the target video frame. In each video clip, the deep feature that is 512D feature vector from pool 5 layer of 3D ConvNets is extracted and are used to train large-scale video dataset. The HiGAN comparatively outperforms in terms of recognition rate in both datasets compared to the approach C3D [35]. HiGAN recognition rate is observed as 4% improvement in UCF101 and 10% improvement in HMDB51 dataset compared to C3D technique. Human behavior understanding in video is still a challenging task. It requires an accurate model to handle both the pixel-wise and global level prediction. Spampinato et al. [36] demonstrated an adversarial GAN-based framework to learn video representation through unsupervised learning to perform both local and global prediction of human behavior in videos. In this approach, first the video is synthesized by factorizing the process in to the generation of static visual content and motion and secondly enforcing spatiotemporal coherency of object trajectories and finally incorporates motion estimation and pixel-wise dense prediction. So, the self-supervised way of learning provides an effective feature set which further used for video object segmentation and video action tasks [37]. Also, the new segmentation network proposed is able to integrate into any another segmentation model for supervision. This provides a strong model for object motion. The wide range of experimental evaluation showed that VOS-GAN performance on modeling object motion better than the existing video generation methods such as VGAN, TGAN, and MoCoGAN. In the previous researches the following approaches were implemented for video retargeting [38]: The first one is specifically performed domain wise which is not applicable for other domains. And the second one is implemented across the domain which needs manual supervision for labeling and alignment of information and the last approach is unsupervised and unpaired image translation where learning is mutually done in different domains which is also shown insufficient information for processing. Bansal et al. [38] propose a new unsupervised data-driven approach for effective video retargeting which incorporates spatiotemporal information with conditional generative adversarial networks (GANs). It combines both spatial and temporal information along with adversarial loses for translating content and also preserving style. The publicly available Viper dataset is used for experimentation for image-to-labels and labels-to-image to evaluate the results of video retargeting. The performance measures such as mean pixel accuracy (M), average class accuracy (AC), and intersection over union (IoU) provides comparatively better results for the combination of cycle GAN and recycle-GAN 339 340 Generative adversarial networks for image-to-Image translation Jang and Kim [39] developed appearance and motion conditions generative adversarial network (AMC-GAN) which consists of a generator, two discriminators, and perceptual ranking module. The two discriminators monitor the appearance and motion features. They used a new conditioning scheme that helps the training by varying appearance and motion conditions. The perceptual ranking module enables AMCGAN for understanding the events in the video. AMCGAN model is evaluated on MUG facial expressions and NATOPS human action dataset. The MUG dataset consists of 931 video clips which contain six basic emotions like anger, disgust, fear, happy, sad, and surprise. It is preprocessed to get 32 frames of resolution 64 64 pixels. The NATOPS human action dataset has 9600 videos containing 24 different actions. In unsupervised video representation future frame prediction is a challenging task. Existing methods operate directly on pixels which result blurry prediction of the future frame. Liang et al. [26] proposed a dual motion generative adversarial net (GAN) architecture to predict future frame in video sequence through dual learning mechanism. The future frame prediction and dual future flow prediction form a close loop. It achieves better video prediction by generating informative feedback signals to each other. This dual motion GAN has fully differentiable network architecture for video prediction. Extensive experiments on video frame prediction, flow prediction, and unsupervised video representation learning demonstrate the contributions of Dual Motion GAN to motion encoding and predictive learning. Caltech and YouTube Clips are taken for future frame analysis to show the performance of video recognition using dual motion GAN compared to other existing approaches in the KITTI dataset. The performance evaluation metrics such as mean square error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index metrics (SSIM) are used to evaluate the image quality of future frame prediction. Higher PSNR and SSIM are achieved via dual motion GAN. The implementations are based on the public Torch7 platform on a single NVIDIA GeForce GTX 1080. Dual motion GAN takes around 300 ms to predict one future frame. 14.3.3 GAN variations for video summarization Due to the availability of the huge amount of multimedia data produced by the progressive growth of video capturing devices, video summarization [12,40–42] plays a crucial role in video analytics problem. Video summarization [43] extracts the representative and useful content from the video for data analysis and it is highly useful in large scale video analysis. One of the efficient approaches in video summarization is that deriving the suitable key frames from the entire video and those set of key frames are enough to portray the story of the video. To enhance the quality of summarization, there exists some challenges need to be tackled by the summarization techniques. The first challenge is to choose a fine key frame selection strategy which takes into account the temporal relation Generative adversarial network for video analytics of the frames within the video and the importance of the key frames. The next challenge is to devise a mechanism to assess the preciseness and completeness of the selected key frames. To address these issues, several models have been introduced so far like feature-based approaches [12], Long-short-term memory (LSTM)-based models [1,44] and determinantal point process (DPP)-based techniques [45]. Owing to the memory problems that arise in LSTM as well as DPP and redundant key frames issue in feature-based approaches, GAN has attracted the researchers in this community because of its regularization ability. One of the GAN variants proposed for video summarization namely dilated temporal relational generative adversarial network (DTRGAN) [46] is shown in Fig. 14.7. The generator contains two units namely dilated temporal relational (DTR) and bidirectional LSTM (Bi-LSTM). The generator gets the video and the real summary of the respective video as input. The DTR unit aims to tackle the first challenge. The inputs to the discriminator are real, generated, and random summary pairs and the purpose of the discriminator is to optimize the player losses at the time of training. A supervised generator loss term is introduced to attain the completeness and preciseness nature of the key frames. Fig. 14.7 Architecture of DTRGAN [46]. 341 342 Generative adversarial networks for image-to-Image translation 14.3.4 PoseGAN Walker et al. [25] have developed a video forecasting technique by generating future pose using generative adversarial network (GAN) and variational autoencoders (VAEs). In this approach, video forecasting is attained by generating videos directly in pixel space. This approach models the whole structure of the videos including the scene dynamics conjointly under unconstrained environment. The authors divided the video forecasting problem into two stages: The first stage handles the high-level features of video like human scenes and uses VAE to predict the future actions of a human. The authors used UCF101 dataset for evaluating the poseGAN architecture in predicting the future poses of human. 14.4 Discussion 14.4.1 Advantages of GAN One of the major advantages of GAN is that it does not require knowledge about the shape of the generator’s probability distribution model. Hence, GANs avoids the need for the determined density shapes for representing high complex and high-density data distribution. Reduced time complexity: The sampling of generated data can be parallelized in GANs and it makes them pretty faster than PixelRNN [47], Wavenet [48], and PixelCNN [49]. In the future frame prediction problem [39], the autoregressive models rely on the value of the previous frame’s pixel value for the prediction of the probability distribution of the future frame’s pixel. Hence, the generation of the future frame is too slow and the time consumption is even worse for high-dimensional data. But GANs uses a simple feed-forward neural network strategy for mapping in the generator. The generator creates all the future frame pixels at the same time itself rather than pixel by pixel approach followed by autoregressive models. This pace of GAN processing attracted many researchers in various fields. Accurate results: Form the study of different GAN variants it is evident that GAN can produce astounding results for video analytics problems. Also, the performance is far better than variational autoencoder (VAE), one of the generator models which assume the probability distribution of pixels as a normal distribution. As GAN can master in capturing the high-frequency parts of the data, the generator develops to guide the highfrequency parts to betray the discriminator. Lack of assumptions: Even though VAE attempts to maximize likelihood through variational lower bound, it needs assumptions on the prior and posterior probability distributions of data. On the other hand, GANs do not need any strong assumptions about the probability distribution. Generative adversarial network for video analytics 14.4.2 Disadvantages of GAN Trade-off between discriminator and generator: The imbalance occurs between generator and discriminator because of nonconvergence and mode collapse. Mode collapse is a commonly occurring and difficult issue in GAN models. It happens in the case when the generator is offered with images that look similar. Also, when the generator is trained extensively without updating any information to the discriminator, the mode collapses. Owing to this mode collapse, the generator will converge to an optimal data which fools discriminator the most and it is the best realistic image from the perspective of the discriminator. A partial mode collapse occurs in GANs frequently than a complete mode collapse. Thus, the training process involved in GAN is heuristic in nature. Hyperparameters and training: The need for suitable hyperparameters to attain the cost function is a major concern in GANs. The tuning of these parameters is also a time-consuming process. 14.5 Conclusion GAN is growing as an efficient generative model through the generation of real-like data using random latent spaces. The underlying fact in the GAN process is that it does not need the understanding of real data samples and high-level mathematical foundations. This merit allowed the GANs to be extensively used in various academic and engineering fields. In this chapter, we introduced the basics and working principle of GAN, several variations of GAN available for various applications like video generation, video prediction, action recognition, and video summarization in the area of video analytics. The enormous growth of GAN in the video analytics domain is not only due to its ability to learn the deep representation and nonlinear mapping but due to its potential to use the enormous amount of unlabeled video data. There are huge openings in the development of algorithms and architectures of GAN for using it in different application domains apart from video analytics, such as prediction, superresolution, generating new human poses, and face frontal view generation. The future scope in video recognition includes exploiting large-scale web images for video recognition which will further improve the recognition accuracy. Video retargeting can be accomplished more precisely using spatiotemporal generative models and further, it can be extended to multiple source domain adaptation. Also, the spatiotemporal neural network architecture can be applied for video retargeting in future. The realworld videos with complex motion interactions can be attempted for video recognition through the modeling of multiagent dependencies. Also, the alternative can be made for loss function, evaluation metrics, RNN, and synthetically generated videos to improve the performance of video recognition system. Generative adversarial neural networks can be the next step in deep learning evolution and while they provide better results across several application domains. 343 344 Generative adversarial networks for image-to-Image translation References [1] H. Chen, G. Ding, Z. Lin, S. Zhao, J. Han, Show, observe and tell: attribute-driven attention model for image captioning, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 606–612. [2] L.C. Chen, G. Papandreou, S.F. Adam, Rethinking Atrous Convolution for Semantic Image Segmentation, (2017). arXiv preprint arXiv:170605587. [3] F. Yu, X. Wu, et al., Exploiting Images for Video Recognition with Hierarchical Generative Adversarial Networks, (2018). arXiv:1805.04384v1 [cs.CV]. [4] S. Skansi, Autoencoders, in: Introduction to Deep Learning, Springer, Berlin, 2018, pp. 153–163. [5] D.H. Ackley, G.E. Hinton, T.J. Sejnowski, A learning algorithm for Boltzmann machines. Cogn. Sci. 9 (1) (1985) 147–169, https://doi.org/10.1207/s15516709cog0901_7. [6] Wahyono, A. Filonenko, K.-H. Jo, Designing interface and integration framework for multi-channel intelligent surveillance system, in: IEEE Conference on Human System Interactions, 2016. [7] J. Goodfellow, M. Pouget-Abadie, B.X. Mirza, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, 2014, pp. 2672–2680. [8] Y. Hong, U. Hwang, J. Yoo, S. Yoon, Show Generative Adversarial Networks and Their Variants Work: An Overview, (2017). arXiv preprint arXiv:1711.05914. [9] T. Che, Y. Li, A.P. Jacob, Y. Bengio, W. Li, Mode regularized generative adversarial networks, in: Proc. ICLR, 2017, 2017. [10] J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2223–2232. [11] D. Berthelot, T. Schumm, L. Metz, Began: Boundary Equilibrium Generative Adversarial Networks, https://arxiv.org/abs/1703.10717, 2017. [12] B. Zhao, E.P. Xing, Quasi real-time summarization for consumer videos, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2513–2520. [13] A. Odena, C. Olah, J. Shlens, Conditional image synthesis with auxiliary classifier GANs, in: Proceedings of the 34th International Conference on Machine Learning, vol. 70, 2017, pp. 2642–2651. [14] J.T. Spring Enberg, Unsupervised and Semi-Supervised Learning with Categorical Generative Adversarial Networks, (2015). arXiv preprint arXiv:1511.06390. [15] B. Fernando, H. Bilen, E. Gavves, S. Gould, Self-Supervised Video Representation Learning With Odd-One-out Networks, arXiv preprint arXiv:1611.06646(2016). [16] I. Misra, C.L. Zitnick, M. Hebert, Shuffle and learn: unsupervised learning using temporal order verification, in: European Conference on Computer Vision, 2016, pp. 527–544. [17] M.A.R. Ahad, J.K. Tan, H. Kim, S. Ishikawa, Motion history image: its variants and applications, Mach. Vis. Appl. 23 (2) (2012) 255–281. [18] A.F. Bobick, J.W. Davis, The recognition of human movement using temporal templates, IEEE Trans. Pattern Anal. Mach. Intell. 23 (2001) 257–267. [19] S. Sadanand, J.J. Corso, Action bank: a high-level representation of activity in video. in: IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, 2012, pp. 1234–1241, https://doi.org/10.1109/CVPR.2012.6247806. [20] N. Li, X. Cheng, S. Zhang, et al., Realistic human action recognition by fast HOG3D and selforganization feature map. Mach. Vis. Appl. 25 (2014) 1793–1812, https://doi.org/10.1007/s00138014-0639-9. [21] A. Sasithradevi, S.M.M. Roomi, Video classification and retrieval through spatio-temporal radon features. Pattern Recogn. 99 (2020) 107099, https://doi.org/10.1016/j.patcog.2019.107099. [22] J. Pers, V. Suli´c, M. Kristan, M. Perˇse, K. Polanec, S. Kovaˇciˇc, Histograms of optical flow for efficient representation of body motion, Pattern Recogn. Lett. 31 (11) (2010) 1369–1376. [23] J. Zhao, M. Mathieu, Y. LeCun, Energy-Based Generative Adversarial Network, (2016). arXiv preprint arXiv:1609.03126. [24] Z. Huang, B. Kratzwald, et al., Face Translation Between Images and Videos Using Identity-Aware CycleGAN, (2017). arXiv:1712.00971v1 [cs.CV]. Generative adversarial network for video analytics [25] J. Walker, K. Marino, et al., The Pose Knows: Video Forecasting by Generating Pose Futures, (2017). (arXiv:1705.00053v1 [cs.CV). [26] X. Liang, Lisa, et al., Dual Motion GAN for Future-Flow Embedded Video Prediction, (2017) arXiv:1708.00284v2 [cs.CV]. [27] S. Tulyakov, M.-Y. Liu, X. Yang, J. Kautz, Mocogan, Decomposing Motion and Content for Video Generation, (2017). arXiv preprint arXiv:1707.04993. [28] W.J. Baddar, G. Gu, et al., Dynamics Transfer GAN: Generating Video by Transferring Arbitrary Temporal Dynamics from a Source Video to a Single Target Image, (2017). arXiv:1712.03534v1 [cs.CV]. [29] K. Ohnishi, S. Yamamoto, et al., Hierarchical Video Generation from Orthogonal Information: Optical Flow and Texture, (2017). arXiv:1711.09618v2 [cs.CV]. [30] W. Xiong, W. Luo, et al., Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic Generative Adversarial Networks, (2018)arXiv:1709.07592v3 [cs.CV]. [31] B. Kratzwald, Z. Huang, et al., Improving Video Generation for Multi-Functional Applications, arXiv:1711.11453v2 [cs.CV](2018). [32] Y. Pan, Z. Qiu, et al., To Create What you Tell: Generating Videos from Captions, (2018). arXiv:1804.08264v1 [cs.CV]. [33] K. Soomro, A.R. Zamir, M. Shah, UCF101: a dataset of 101 human actions classes from videos in the wild, in: Computer Vision and Pattern Recognition (cs.CV), 2012 arXiv:1212.0402. [34] H. Kuehne, H. Jhuang, E.´ı. Garrote, T. Poggio, T. Serre, Hmdb: a large video database for human motion recognition, in: International Conference on Computer Vision (ICCV), IEEE, 2011, pp. 2556–2563. [35] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3D convolutional networks, in: ICCV, 2015, pp. 4489–4497. [36] C. Spampinato, S. Palazzo, et al., Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos, (2019)arXiv:1803.09092v2 [cs.CV]. [37] U. Ahsan, H. Sun, et al., DiscrimNet: Semi-Supervised Action Recognition from Videos Using Generative Adversarial Networks, (2018)arXiv:1801.07230v1 [cs.CV]. [38] A. Bansal, S. Ma, D. Ramanan, Y. Sheikh, Recycle-GAN: unsupervised video retargeting. in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Computer Vision—ECCV 2018 Lecture Notes in Computer Science, vol. 11209, Springer, Cham, 2018https://doi.org/10.1007/978-3-030-01228-1_8. [39] Y. Jang, G. Kim, et al., Video Prediction with Appearance and Motion Conditions, (2018). arXiv:1807.02635v1 [cs.CV]. [40] M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool, Creating summaries from user videos, in: Proceedings of European Conference on Computer Vision, 2014, pp. 505–520. [41] G. Kim, L. Sigal, E.P. Xing, Joint summarization of large-scale collections of web images and videos for storyline reconstruction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4225–4232. [42] B. Mahasseni, M. Lam, S. Todorovic, Unsupervised video summarization with adversarial LSTM networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. [43] A. Sharghi, B. Gong, M. Shah, Query-focused extractive video summarization, in: Proceedings of European Conference on Computer Vision, 2016. [44] K. Zhang, W.L. Chao, F. Sha, K. Grauman, Video summarization with long short-term memory, in: Proceedings of European Conference on Computer Vision, 2016, pp. 766–782. [45] B. Gong, W.L. Chao, K. Grauman, F. Sha, Diverse sequential subset selection for supervised video summarization, in: Advances in Neural Information Processing Systems, 2014, pp. 2069–2077. [46] Y. Zhang, M. Kampffmeyer, et al., Dilated Temporal Relational Adversarial Network for Generic Video Summarization, (2019). arXiv:1804.11228v2 [cs.CV]. [47] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, K. Kavukcuoglu, Wavenet: A Generative Model for Raw Audio, (2016). arXiv preprint arXiv:1609.03499. [48] A. Oord, N. Kalchbrenner, K. Kavukcuoglu, Pixel Recurrent Neural Networks, (2016). arXiv preprint arXiv:1601.06759 90. [49] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training GANs, in: Advances in Neural Information Processing Systems, 2016, pp. 2234–2242. 345 CHAPTER 15 Multimodal reconstruction of retinal images over unpaired datasets using cyclical generative adversarial networks Roucoa,b, Jorge Novoa,b, and Marcos Ortegaa,b Álvaro S. Hervellaa,b, Jose a CITIC Research Center, University of A Coruña, A Coruña, Spain VARPA Research Group, Biomedical Research Institute of A Coruña (INIBIC), University of A Coruña, A Coruña, Spain b 15.1 Introduction The recent rise of deep learning has revolutionized medical imaging, making a significant impact on modern medicine [1]. Nowadays, in clinical practice, medical imaging technologies are key tools for the prevention, diagnosis, and follow-up of numerous diseases [2]. There exist a large variety of imaging modalities that allow to visualize the different organs and tissues in the human body [3]. Thus, clinicians can select the most adequate imaging modality to study the different anatomical or pathological structures in detail. Nevertheless, the detailed analysis of the images can be a tedious and difficult task for a clinical specialist. For instance, many diseases in their early stages are only evidenced by very small lesions or subtle anomalies. In these scenarios, factors such as the clinicians’ expertise and workload can affect the reliability of the final analysis. Thus, the use of deep learning algorithms allows to accelerate the process and helps to produce a more reliable analysis of the images. Ultimately, this will result in a better diagnosis and treatment for the patients. Deep neural networks (DNNs) have been demonstrated to provide a superior performance for numerous image analysis problems in comparison to more classical methods [4]. For instance, nowadays, deep learning represents the state-of-the-art approach for typical tasks, such as image segmentation [5] or image classification [6]. Besides the remarkable improvements in these canonical image analysis problems, deep learning also makes possible the emergence of novel applications. For instance, these algorithms can be used for the transformation of images among different modalities [7], or the training of future clinical professionals using realistic generated images [8]. These novel applications, among others, certainly benefit from the particular advantages of generative adversarial networks (GANs) [9]. This creative setting, consisting of different networks with opposite objectives, have been demonstrated to be able to further exploit the capacity of the DNNs. Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00014-2 Copyright © 2021 Elsevier Inc. All rights reserved. 347 348 Generative adversarial networks for image-to-Image translation Multimodal reconstruction is a novel application driven by DNNs that consists in the translation of medical images among complementary modalities [7]. Nowadays, complementary imaging modalities, representing the same organs or tissues, are commonly available in most medical specialties [3]. The differences among modalities can be due to the use of different capture devices, and also due to the use of contrasts that enhance certain tissues. The clinicians choose the most adequate imaging modality according to different factors, such as the target organs or tissues, the evidence of disease, or the risk factors of the patient. In this sense, it is particularly important to consider the properties of the different anatomical and pathological structures, given that some structures can be enhanced in one modality and be completely missing in other. This significant change in the appearance, dependent on the properties of the tissues and organs, can make the translation among modalities very challenging. However, this challenge that complicates the training of the multimodal reconstruction is beneficial if we are interested in using the task for representation learning purposes. This is due to the fact that a harder task will enforce the network to learn more complex representations during the training. In this regard, the multimodal reconstruction has already demonstrated a successful performance as pretraining task for transfer learning in medical imaging [10]. In this chapter, we study the use of GANs for the multimodal reconstruction between complementary imaging modalities. In particular, the multimodal reconstruction is addressed by using a cyclical GAN methodology, which allows training the adversarial setting with independent sets of two different image modalities [11]. Nowadays, GANs represent the quintessential approach for image-to-image translation tasks [12]. However, these kinds of applications are typically focused on producing realistic and aesthetically pleasing images. In contrast, in the multimodal reconstruction of medical images, the realism and aesthetics of the generated images are not as important as producing medically accurate reconstructions. In particular, this means that the generated color patterns and textures must be coherent with the expected visualization of the real organs or tissues in the target modality. Additionally, this may involve the omission of certain structures, or even the enhancement of those that are only vaguely appreciated in the original modality. We evaluate all these aspects in order to assess the validity of the studied cyclical GAN method for the multimodal reconstruction. The study presented in this chapter is focused on ophthalmic imaging. In particular, we use the retinography and the fluorescein angiography as the original and target imaging modalities in the multimodal reconstruction. These imaging modalities, which represent the eye fundus, are useful for the study of important ocular and systemic diseases, such as glaucoma or diabetes [2]. A representative example of retinography and fluorescein angiography for the same eye is depicted in Fig. 15.1. The main difference between them is that the fluorescein angiography uses a contrast dye, which is injected to the patient, to produce the fluorescence of the blood. Thus, the fluorescein angiography depicts an enhanced representation of the retinal vasculature and related lesions. Multimodal reconstruction of retinal images Fig. 15.1 Example of retinography and fluorescein angiography for the same eye: (A) retinography and (B) angiography. In this context, the successful training of a deep neural network in the multimodal reconstruction of the angiography from the retinography will provide a model able to produce a contrast-free estimation of the enhanced retinal vasculature. Additionally, due to the challenges of the transformation, which is mainly mediated by the presence of blood flow in the different tissues, the neural networks will need to learn rich high level representations of the data. This represents a remarkable potential for transfer learning purposes [13, 14]. The presented study includes an extensive evaluation of the cyclical GAN methodology for the multimodal reconstruction between complementary imaging modalities. For this purpose, two different multimodal datasets containing both retinography and fluorescein angiography images are used. Additionally, in order to further analyze the advantages and limitations of the methodology, we present an extensive comparison with a state-of-the-art approach for the multimodal reconstruction of these ophthalmic images [15]. In contrast with the cyclical GAN methodology, this other approach requires the use of multimodal paired data for training, i.e., retinography and angiography of the same eye. Therefore, the cyclical GAN presents an important advantage, avoiding not only the necessity of paired data but also the unnecessary preprocessing for the alignment of the different image pairs. 15.2 Related research Generative adversarial networks (GANs) represent a relatively new deep learning framework for the estimation of generative models [16]. The original GAN setting consists of two different networks with opposite objectives. In particular, a discriminator that learns to distinguish between real and fake samples and a generator that learns to produce fake samples that the discriminator misclassifies as real. Based on this original idea, several variations were developed in posterior works, aiming at applying the novel paradigm in different scenarios [17]. 349 350 Generative adversarial networks for image-to-Image translation In recent years, GANs have been extensively used for addressing different vision problems and graphics tasks. The use of GANs has been especially groundbreaking for computer graphics applications due to the visually appealing results that are obtained. Similarly, a kind of vision problem that has been revolutionized by the use of GANs is image-to-image translation, which consists of performing a mapping between different image domains or imaging modalities [12]. An early work addressing this problem with GANs, known as Pix2Pix [18], relied on the availability of paired data for learning the generative model. In particular, Isola et al. [18] show that their best results are achieved by combining a traditional pixel-wise loss and a conditional GAN framework. Given the difficulty of gathering the paired data in many application domains, posterior works have proposed alternatives to learn the task by using unpaired training data. Among the different proposals, the work of Zhu et al. [19], known as CycleGAN, has been especially influential. CycleGAN compensates for the lack of paired data by learning not only the desired mapping function but also the inverse mapping. This allows introducing a cycle-consistency loss whereby the subsequent application of both mapping functions must return the original input image. Concurrently, this same idea with different naming was also proposed in DualGAN [20] and DiscoGAN [21]. Additionally, besides the cycle-consistency alternative, other different proposals have been presented in different works [12] although the use of these other alternatives is not as extended in posterior applications. In medical imaging, GANs have also been used for different applications, including the mapping between complementary imaging modalities. In particular, GANs have been successfully applied in tasks such as image denoising [22], multimodal reconstruction [11], segmentation [23], image synthesis [24], or anomaly detection [25]. Among these different tasks, several of them can be directly addressed as an image-to-image translation [8]. In these cases, the adaption of those state-of-the-art approaches that already demonstrated a good performance in natural images has been common. In particular, numerous works in medical imaging are based on the use of Pix2Pix or CycleGAN methodologies [8]. Similarly to other application domains, the choice between one or other approach is conditioned by the availability of paired data for training. However, in medical imaging, the paired data is typically easy to obtain, which is evidenced by the prevalence of paired approaches in the literature [8]. With regard to the multimodal reconstruction, the difficulty in these cases is to perform an accurate registration of the available image pairs. An important concern regarding the use of GANs in medical imaging is the hallucination of nonexistent structures by the networks [8]. This is a concomitant risk with the use of GANs due to the high capacity of these frameworks to model the given training data. Cohen et al. [26] demonstrated that this risk is especially elevated when the training data is heavily unbalanced. For instance, a GAN framework that is trained for multimodal reconstruction with a large majority of pathological images will tend to hallucinate pathological structures when processing healthy images. This behavior can be in part mitigated by the addition of pixel-wise losses if paired data is available. Nevertheless, regarding the multimodal reconstruction, even when the paired data is available, most Multimodal reconstruction of retinal images of the works still use the GAN framework together with the pixel-wise loss [8]. In this regard, the work of Hervella et al. [15] is an example of multimodal reconstruction without GANs and using instead the Structural Similarity (SSIM) for the loss function. The motivation for this is, for many applications in medical imaging, it is not necessary to generate realistic or aesthetically pleasing images. In this context, the results obtained in Ref. [15] show that, without the use of GANs, the generated images lack realism and can be easily identified as synthetic samples. 15.3 Multimodal reconstruction of retinal images Multimodal reconstruction is an image translation task between complementary medical imaging modalities [7]. The objective of this task is, given a certain medical image, to reconstruct the underlying tissues and organs according to the characteristics of a different complementary imaging modality. Particularly, this chapter is focused on the multimodal reconstruction of the fluorescein angiography from the retinography. These two complementary retinal imaging modalities represent the eye fundus, including the main anatomical structures and possible lesions in the eye. The main difference between retinography and angiography is that the latter requires the injection of a contrast dye before capturing the images. The injection of this contrast dye results in an enhancement of the retinal vasculature as well as those pathological structures with blood flow. Simultaneously, those other retinal structures and tissues where there is a lack of blood flow may be attenuated in the resulting images. Thus, there is an intricate relation between retinography and angiography, given that the visual transformation between the modalities depends on physical properties such as the presence of blood flow in the different tissues. As a reference, the transformation between retinography and angiography for the main anatomical and pathological structures in the retina can be visualized in Fig. 15.2. Fig. 15.2 Example of retinography and fluorescein angiography for the same eye. The included images depict the main anatomical structures as well as the two main types of lesions in the retina. 351 352 Generative adversarial networks for image-to-Image translation Recently, the difficulty of performing the multimodal reconstruction between retinography and angiography has been overcome by using DNNs [7]. In this regard, the required multimodal transformation can be modeled as a mapping function that GR2A : R ! A given a certain retinography r R returns the corresponding angiography a ¼ GR2A(r) A for the same eye. In this scenario, the mapping function GR2A can be parameterized by a DNN. Thus, the function parameters can be learned by applying an adequate training strategy. In this regard, we present two different deep learning-based approaches for learning the mapping function GR2A, the cyclical GAN methodology [11] and the paired SSIM methodology [15]. 15.3.1 Cyclical GAN methodology The cyclical GAN methodology is based on the use of generative adversarial networks (GANs) for learning the mapping function from retinography to angiography [11]. In this regard, GANs have demonstrated to be useful tools for learning the data distribution of a certain training set, allowing the generation of new images that resemble those contained in the training data [16]. This means that, by using GANs and a sufficiently large training set of unlabeled angiographies, it is possible to generate new fake angiographies that are theoretically indistinguishable from the real ones. However, in the presented multimodal reconstruction, the generated images do not only need to resemble real angiographies but, also, need to represent the physical attributes given by a particular retinography. Thus, in contrast with the original GAN approach [16], the presented methodology does not generate new images from a random noise vector, but rather from another image with the same spatial dimensions as the one that is being generated. In practice, this image-to-image transformation is achieved by using an encoder-decoder network as the generator, whereas the discriminator is still a decoder network as in the original GAN approach. Applying this setting, the multimodal reconstructions could theoretically be trained by using two independent unlabeled sets of images, one of the retinographies, and the other of angiographies. An inherent difficulty of training an image-to-image GAN is that, typically, the generator network has enough capacity to generate a variety of plausible images while ignoring the characteristics of the network input. In the case of the multimodal reconstruction, this would mean that the physical attributes of the retinographies are not successfully transferred to the generated angiographies. In this regard, early image-to-image GAN approaches addressed the issue by explicitly conditioning the generated images on the network input [18]. In particular, this is achieved by using a paired dataset instead of two independent datasets for training. For instance, the use of retinography-angiography pairs, instead of independent retinography and angiography samples, allows training a discriminator to distinguish between fake and real angiographies conditioned on a given real retinography. The use of such a discriminator will force the generator to analyze and take Multimodal reconstruction of retinal images into account the attributes of the input retinography. Additionally, in Ref. [18], the use of paired datasets is even further exploited by complementing the adversarial feedback to the generator with a pixel-wise similarity metric between the generator output and the available ground truth. However, in this case, it is not only necessary to have paired data, but also the available image pairs must be aligned. In contrast with previous alternatives, the presented cyclical GAN methodology addresses the issue of the generator potentially ignoring the characteristics of its input in a different manner that does not require the use of paired datasets. In particular, the cyclical GAN solution is based on the use of a double transformation [19]. The idea is to simultaneously learn GR2A and its inverse mapping function GA2R : A ! R that given a certain angiography a A produces a retinography r ¼ GA2R(a) R of the same eye. Then, the subsequent application of both transformations should be equivalent to the identity function. For instance, if a retinography is transformed into angiography and, then, it is transformed back into retinography, the resulting image should be identical to the original retinography that is used as input. However, if any of the two transformations ignores the characteristics of their input, the resulting retinography will differ from the original. Therefore, it is possible to ensure that the input image characteristics are not being ignored by enforcing the identity between the original retinography and the one that is transformed back from angiography. This is referred to as cycle-consistency, and it can be applied by using any similarity metric between both original and reconstructed input image. An important advantage of this solution is that it does not require the use of paired datasets, only being necessary two independent sets of unlabeled retinographies and angiographies. In order to obtain the best performance for the multimodal reconstruction, the presented cyclical GAN methodology involves the use of two complementary training cycles: (1) from retinography to angiography to retinography (R2A2R) and (2) from angiography to retinography to angiography (A2R2A). A flowchart showing the complete training procedure is depicted in Fig. 15.3. It is observed that two different generators, GR2A and GA2R, and two different discriminators, DA and DR, are used during the training. The discriminators DA and DR are trained to distinguish between generated and real images. Simultaneously, the generators GR2A and GA2R are trained to generate images that the discriminators misclassify as real. This adversarial training is performed using a least square loss, which has demonstrated to produce a more stable learning process in comparison to the original loss in regular GANs [27]. Regarding the discriminator training, the target values are 1 for the real images and 0 for the generated images. Thus, the adversarial training losses for the discriminators are defined as LDadvA ¼ ErR DA ðGR2A ðr ÞÞ2 + EaA ðDA ðaÞ 1Þ2 (15.1) (15.2) LDadvR ¼ EaA DR ðGA2R ðaÞÞ2 + ErR ðDR ðr Þ 1Þ2 353 354 Generative adversarial networks for image-to-Image translation Fig. 15.3 Flowchart for the complete training procedure in the cyclical GAN methodology. This approach involves the use of two complementary training cycles that only differ in which imaging modality is being used as input and which one is the target. For each training cycle, the appearance of the target modality in the generated images is enforced by the feedback of the discriminator. Simultaneously, the cycle-consistency is used to ensure that the input image characteristics, such as the anatomical and pathological structures, are not being ignored by the networks. In the case of the generator training, the objective is that the discriminator assigns a value 1 to the generated images. Thus, the adversarial training losses for the generators are defined as 2 adv LG ¼ E ð D ð G ð r Þ Þ 1 Þ (15.3) rR A R2A R2A adv ¼ EaA ðDR ðGA2R ðaÞÞ 1Þ2 LG (15.4) A2R Regarding the cycle consistency in the presented approach, the L1-norm between the original image and its reconstructed version is used as a loss function. In particular, the complete cycle-consistency loss, including both training cycles, is defined as L cyc ¼ ErR GA2R ðGR2A ðr ÞÞ rk1 + EaA GR2A ðGA2R ðaÞÞ ak1 (15.5) As it can be observed in previous equations as well as in Fig. 15.3, there is a strong parallelism between both training cycles, R2A2R and A2R2A. In particular, the only difference is the imaging modality that each training cycle starts with, what sets which imaging modality is being used as input and which one is the target. Finally, the complete loss function that is used for simultaneously training all the networks is defined as adv adv L ¼ LG + LDadvA + LG + LDadvR + λL cyc R2A A2R (15.6) Multimodal reconstruction of retinal images where λ is a parameter that controls the relative importance of the cycle-consistency loss and the adversarial losses. For the experiments presented in this chapter, this parameter is set to a value of λ ¼ 10, which was also previously adopted in Ref. [19]. The optimization of the loss function during the training is performed with the Adam algorithm [28]. Regarding the hyperparameters of Adam, the weight decays are β1 ¼ 0.5 and β2 ¼ 0.999. In comparison to the original values recommended by Kingma et al. [28], this set of values has demonstrated to provide a more stable learning process when training GANs [29]. The optimization is performed with a batch size of 1 image. The learning rate is set to an initial value of α ¼ 2e 4 and it is kept constant for 200,000 iterations. Then, following the approach previously adopted in Ref. [19], the learning rate is linearly reduced to zero for the same number of iterations. The number of iterations before starting to reduce the learning rate is established empirically through the analysis of both the learning curves and the generated images in a training subset that is reserved for validation. Finally, a data augmentation strategy is applied to avoid possible overfitting to the training set. In particular, random spatial and color augmentations are applied to the images. The spatial augmentations consist of affine transformations and the color augmentations are linear transformations of the image channels in HSV (Hue-SaturationValue) color space. In the case of the angiographies, which have one single channel, a linear transformation is directly applied over the raw intensity values. This augmentation strategy has been previously applied for the analysis of retinal images, demonstrating a good performance avoiding overfitting with limited training data [10, 30]. The particular range for the transformations was validated before training in order to ensure that the augmented images still resemble valid retinas. 15.3.2 Paired SSIM methodology An alternative methodology for the multimodal reconstruction between retinography and angiography was proposed in Ref. [7]. In this case, the authors avoid the use of GANs by taking advantage of existing multimodal paired data. In particular, a set of retinography-angiography pairs where both images correspond to the same eye. The motivation for this lies in the fact that, in contrast to other application domains, in medical imaging the paired data is easy to obtain. Nowadays, in modern clinical practice, the use of different imaging modalities is broadly extended across most medical services. In this sense, although for many patients the use of a single imaging modality can be enough for diagnostic purposes, there is still a large number of cases where the use of several imaging modalities is required. In this latter scenario, it is also common to use more complex or invasive techniques, such as those requiring the injection of contrasts. This is the case of retinography and angiography in retinal imaging. While retinography is a broadly extended modality, typically used in screening programs, angiography is only used when 355 356 Generative adversarial networks for image-to-Image translation it is clearly required. However, each time the angiography is taken for a patient, retinography is typically also available. This facilitates the gathering of these paired multimodal datasets. Technically, the advantage of using paired training data is that it allows directly comparing the network output with a ground truth image. In particular, during the training, for each retinography that is fed to the network, there is also available an angiography of the same eye. Thus, the training feedback can be obtained by computing any similarity metric between generated and real angiography. In order to facilitate this measurement of similarity, the retinography and angiography within each multimodal pair are registered. The registration produces an alignment of the different retinal structures between the retinography and the angiography. Consequently, there will also be an alignment between the network output and the real angiography that is used as ground truth. This allows the use of common pixel-wise metrics for the measurement of the similarity between the network output and the target image. In the presented methodology [15], the registration is performed following a domainspecific method that relies on the vascular structures of the retina [31]. In particular, this registration method presents two different steps. The first step is a landmark-based registration where the landmarks are the crossings and the bifurcations of the retinal vasculature. This first registration produces a coarse alignment of the images that is later refined by performing a subsequent intensity-based registration. This second registration is based on the optimization of a similarity metric of the vessels between both images. The complete registration procedure allows generating a paired and registered multimodal dataset, which is used for directly training the generator network GR2A. The complete methodology for training the multimodal reconstruction is depicted in Fig. 15.4. As it is observed, an advantage of this methodology is that only a single neural network is required. Regarding the training of the generator, the similarity between the network output and target angiography is evaluated by using the structural similarity (SSIM) [32]. This metric, which was initially proposed for image quality assessment, measures the similarity between images by independently considering the intensity, contrast, and structural Fig. 15.4 Flowchart for the complete training procedure of the paired SSIM methodology. The first step is the multimodal registration of the paired retinal images, which can be performed off-line before the actual network training. Then, the training feedback is provided by the structural similarity (SSIM), which is a pixel-wise similarity metric. Multimodal reconstruction of retinal images information. The measurement is performed at a local level considering a small neighborhood for each pixel. In particular, an SSIM map between two images (x, y) is computed with a set of local statistics as X 2μx μy + C1 + 2σ xy + C2 SSIMðx, yÞ ¼ (15.7) μ2x + μ2y + C1 σ 2x + σ 2y + C2 where μx and μy are the local averages for x and y, respectively, σ x and σ y are the local standard deviations for x and y, respectively, and σ xy is the local covariance between x and y. These local statistics are computed for each pixel by weighting its neighborhood with an isotropic two-dimensional Gaussian with σ ¼ 1.5 pixels [32]. Then, given that SSIM is a similarity metric, the loss function for training GR2A is defined by using the negative SSIM: L SSIM ¼ Er , aðR, AÞ ½SSIMðGR2A ðr Þ, aÞ (15.8) The optimization of the loss function during the training is performed with the Adam algorithm [28]. Regarding the hyperparameters of Adam, the weight decays are set as β1 ¼ 0.9 and β2 ¼ 0.999, which are the default values recommended by Kingma et al. [28]. The optimization is performed with a batch size of 1 image. The learning rate is set to an initial value of α ¼ 2e 4 and then it is reduced by a factor of 10 when the validation loss ceases to improve for 1250 iterations. Finally, the training is early stopped after 5000 iterations without improvement in the validation loss. These hyperparameters are established empirically according to the evolution of the learning curves during the training. Finally, a data augmentation strategy is also applied to avoid possible overfitting to the training set. In particular, random spatial and color augmentations are applied to the images. The spatial augmentations consist of affine transformations and the color augmentations are linear transformations of the image channels in HSV (Hue-Saturation-Value) color space. In this case, the color augmentations are only applied to the retinography, which is the only imaging modality being used as input to a neural network. In contrast, the same affine transformation is applied to the retinography and the angiography in each multimodal image pair. This is necessary to keep the alignment between the images and make possible the measurement of the pixel-wise similarity, namely SSIM, between the network output and the target angiography. As in the cyclical GAN methodology, the particular range for the transformations is validated before training in order to ensure that the augmented images still resemble valid retinas. 15.3.3 Network architectures Regarding the neural networks, the same network architectures are used for the two presented methodologies, cyclical GAN and paired SSIM. This eases the comparison 357 358 Generative adversarial networks for image-to-Image translation between the methodologies, excluding the network architecture as a factor in the possible performance differences. In particular, the experiments that are presented in this chapter are performed with the same network architectures that were previously used in Ref. [19]. The generator, which is used in both cyclical GAN and paired SSIM, is a fully convolutional neural network consisting of an encoder, a decoder, and several residual blocks in the middle of them. A diagram of the network and the details of the different blocks are depicted in Fig. 15.5 and Table 15.1, respectively. In contrast with other common encoder-decoder architectures, this network presents a small encoder and decoder, which is compensated by the large number of layers that are present in the middle residual blocks. As a consequence, there is also a small spatial reduction of the input data through the network. In particular, the height and width of the internal representations within the network are reduced up to a factor of 4. This relatively low spatial reduction allows keeping an adequate level of spatial accuracy without the Fig. 15.5 Diagram of the network architecture for the generator. Each colored block represents the output of a layer in the neural network. The width of the blocks represents the number of channels whereas the height represents the spatial dimensions. The details of the different layers are in Table 15.1. Table 15.1 Building blocks of the generator architecture. Block Layers Kernel Stride Out features Encoder Conv/IN/ReLU Conv/IN/ReLU Conv/IN/ReLU Conv/IN/ReLU Conv/IN Residual addition ConvT/IN/ReLU ConvT/IN/ReLU Conv/IN/ReLU 77 33 33 33 33 – 33 33 77 1 2 2 1 1 – 2 2 1 64 128 256 256 256 256 128 64 Image channels Residual Decoder Conv, convolution; IN, instance normalization [33]; ConvT, convolution transpose. Multimodal reconstruction of retinal images necessity of additional features such as skip connections [34]. Another particularity of the network is the use of instance normalization [33] layers after each convolution, in contrast to the more extended use of batch normalization. In this regard, instance normalization was initially proposed for improving the performance of style-transfer applications and has demonstrated to be also effective for cyclical GANs. Additionally, these normalization layers could be seen as an effective way of dealing with the problems of using batch normalization with small batch sizes. In this sense, it should be noticed that both the experiments presented in this chapter as well as the experiments in Ref. [19] are performed with a batch size of 1 image. In contrast with the generator, the discriminator network is only used in the cyclical GAN methodology. The selected architecture is the one that was also used in Ref. [19]. In particular, the discriminator is a fully convolutional neural network, which allows working on arbitrarily sized images. This kind of discriminator architecture is typically known as PatchGAN [18], given that the decision of the discriminator is produced at the level of overlapping image patches. A diagram of the network and the details of the different layers are depicted in Fig. 15.6 and Table 15.2, respectively. The characteristics of the different layers are similar to those in the generator network. The main difference is the use of Leaky ReLU instead of ReLU as an activation function, which has Fig. 15.6 Diagram of the network architecture for the discriminator. Each colored block represents the output of a layer in the network. The width of the blocks represents the number of channels whereas the height represents the spatial dimensions. The details of the different layers are in Table 15.2. Table 15.2 Layers of the discriminator architecture. Layers Kernel Stride Out features Conv/Leaky ReLU Conv/IN/Leaky ReLU Conv/IN/Leaky ReLU Conv/IN/Leaky ReLU Conv 44 44 44 44 44 2 2 2 1 1 64 128 256 512 1 Conv, convolution; IN, instance normalization. 359 360 Generative adversarial networks for image-to-Image translation demonstrated to be a useful modification for the adequate training of GANs [29]. With regard to the discriminator output, this architecture provides a decision for overlapping image patches of size 70 70. 15.4 Experiments and results 15.4.1 Datasets The experiments presented in this chapter are performed on a multimodal dataset consisting of 118 retinography-angiography pairs. This multimodal dataset is created from two different collections of images. In particular, half of the images are taken from a public multimodal dataset provided by Isfahan MISP [35] whereas the other half have been gathered from a local hospital [15]. The Isfahan MISP collection consists of 59 retinography-angiography pairs including both pathological and healthy cases. In particular, 30 image pairs correspond to patients that were diagnosed with diabetic retinopathy whereas the other 29 images pairs correspond to healthy retinas. All the images in the collection present a size of 720 576 pixels. The private collection consists of 59 additional retinography-angiography pairs. Most of the images correspond to pathological cases, including representative samples of several common ophthalmic diseases. Additionally, the original images presented different sizes and, therefore, they were resized to a fixed size of 720 576. This collection of images has been gathered from the ophthalmic services of Complexo Hospitalario Universitario de Santiago de Compostela (CHUS) in Spain. To perform the different experiments, the complete multimodal dataset is randomly split into two subsets of equal size, i.e., 59 image pairs each. One of these subsets is held out as a test set and the other is used for training the multimodal reconstruction. Additionally, the training image pairs are randomly split into a validation subset of nine image pairs and a training subset of 50 image pairs. The purpose of this split is to control the training progress through the validation subset, as described in Section 15.3. Finally, it should be noticed that, although the same subset of image pairs is used for the training of both methodologies, the images are considered as unpaired for the cyclical GAN approach. 15.4.2 Qualitative evaluation of the reconstruction Firstly, the quality and coherence of the generated angiographies are evaluated through visual analysis. To that end, Figs. 15.7 and 15.8 depict some representative examples of generated images together with the original retinographies and angiographies. The examples are taken from the holdout test set. In general, both methodologies were able to learn an adequate transformation for the main anatomical structures in the retina, namely, the vasculature, fovea, and optic disc. In particular, it is observed that the retinal Multimodal reconstruction of retinal images Fig. 15.7 Examples of generated angiographies together with the corresponding original retinographies and angiographies. Some representative examples of microaneurysms (green), microhemorrhages (blue), and bright lesions (yellow) are marked with circles. vasculature is successfully enhanced in all the cases, which is one of the main characteristics of real angiographies. This vascular enhancement evidences a high-level understanding of the different structures in the retina, given that other dark-colored structures in retinography, such as the fovea, are mainly kept with a dark tone in the reconstructed angiographies. This means that the applied transformation is structurespecific and guided by the semantic information in the images instead of low-level 361 362 Generative adversarial networks for image-to-Image translation Fig. 15.8 Examples of generated angiographies together with the corresponding original retinographies and angiographies. Some representative examples of microaneurysms (green) and microhemorrhages (blue) are marked with circles. information such as the color. In contrast with the vasculature, the reconstructed optic discs are not as similar as those in the real angiographies. However, this can be explained by the fact that the appearance of the optic disc is not as consistent among angiographies. In this sense, both methodologies learn to reconstruct the optic disc with a slight higher intensity, which may indicate that this is the predominant appearance of this anatomical structure in the training set. Multimodal reconstruction of retinal images With regard to the pathological structures, there are greater differences between the presented methodologies. For instance, microaneurysms are only generated or enhanced by the cyclical GAN methodology. Microaneurysms are tiny vascular lesions that, in contrast to other pathological structures, remain connected to the bloodstream. Therefore, they are directly affected by the injected contrast dye in the angiography. As it is observed in Fig. 15.7, the cyclical GAN methodology is able to enhance these small lesions. However, neither all the microaneurysms in the ground truth angiography are reconstructed nor all the reconstructed microaneurysms are present in the ground truth. This may indicate that part of these microaneurysms are artificially created by the network or that small microhemorrhages are being misidentified as microaneurysms. Nevertheless, it must be considered that the detection of microaneurysms is a very challenging task in the field. Thus, despite the possible errors, the fact that these small structures were identified by the cyclical GAN methodology is a significative outcome. In contrast to the previous analysis about microaneurysms, the examples of Fig. 15.7 evidence that the paired SSIM methodology provides a better reconstruction for other pathological structures. In particular, bright lesions that are present in the retinography should not be visible in the angiography. However, the cyclical GAN approach fails to completely remove these lesions, especially if they are large such as those in the top-left quarter or the retina shown in Fig. 15.7B. The paired SSIM approach provides a more accurate reconstruction regarding these kinds of lesions although in the previous case there still remains a show in the area of the lesion. Finally, regarding the microhemorrhages, these kinds of lesions are also more accurately reconstructed by the paired SSIM approach. In particular, these lesions present a dark appearance in both retinography and angiography. In the depicted examples, it is observed that paired SSIM reconstructs the microhemorrhages, as expected. However, the cyclical GAN approach tends to remove these lesions. Additionally, in some cases, the small microhemorrhages are reconstructed with a bright tone like the microaneurysms. Besides the anatomical and pathological structures in the retina, the main difference that is observed between both methodologies is the general appearance of the generated angiographies. In this regard, the images generated by the cyclical GAN present a more realistic look and they could be easier misidentified as real angiographies. The main reason for this is the texture in the images. In particular, cyclical GAN produces a textured retinal background that mimics the appearance of a real angiography. In contrast, the retinal background in the angiographies generated by paired SSIM is very homogeneous, which gives away the synthetic nature of the images. The explanation for this difference between both approaches is the use of GANs in the cyclical GAN methodology. In this sense, the discriminator network has the capacity to learn and distinguish the main characteristics of the angiography, including the textured background. Thus, a synthetic angiography with a smooth background would be easily identified as fake by the discriminator. Consequently, during the training, the generator will learn to generate 363 364 Generative adversarial networks for image-to-Image translation the textured background in order to trick the discriminator. In the case of the paired SSIM, the presented results show that SSIM does not provide the feedback that is required to learn this characteristic. Additionally, according to the results presented in Ref. [15], the use of L1-norm or L2-norm in the loss function does not provide that feedback either. In this regard, it should be noticed that these are full-reference pixel-wise metrics that directly compare the network output against a specific ground truth image. Thus, even if an angiography-like texture is generated, this will not necessarily minimize the loss function if the generated texture does not exactly match the one in the provided ground truth. It could be the case that the specific texture of each angiography was impossible to infer from the corresponding retinography. In that scenario, the generator could never completely reduce the loss portion corresponding to the textured background. The resulting outcome could be the generation of a homogeneous background that minimizes the loss throughout the training set. This explanation fits with what is observed in Figs. 15.7 and 15.8. 15.4.3 Quantitative evaluation of the reconstruction The multimodal reconstruction is quantitatively evaluated by measuring the reconstruction error between the generated and the ground truth angiographies. In particular, the reconstruction is evaluated by means of SSIM, mean average error (MAE), and mean squared error (MSE), which are common evaluation metrics for image reconstruction and image quality assessment. The presented evaluation is performed on the paired data of the holdout test set. When comparing the two presented methodologies, it must be considered that the paired SSIM relies on the availability of paired data for training. The paired data represent a richer source of information in comparison to the unpaired counterpart and, therefore, it is expected that the paired SSIM provided better performance than cyclical GAN for the same number of training samples. Additionally, it should be also considered that the paired data, despite being commonly available in medical imaging, is inherently harder to collect than the unpaired counterpart. For these reasons, the presented evaluation not only compares the performance of both methodologies when using the complete training set but, also, it compares the performance when there are more unpaired than paired images available for training. This is an expected scenario in practical applications. The results of the quantitative evaluation are depicted in Fig. 15.9. In the case of paired SSIM, the presented results correspond to several experiments with a varying number of training samples, ranging from 10 to 50 image pairs. In the case of cyclical GAN, the presented results are obtained after training with the complete training subset, i.e., 50 image pairs. Firstly, it is observed that the paired SSIM always provides better results than the cyclical GAN considering SSIM although that is not the case for MAE and MSE. Considering these two metrics, the paired SSIM obtains similar Multimodal reconstruction of retinal images Fig. 15.9 Comparison of cyclical GAN and paired SSIM with a varying number of training samples for paired SSIM. The evaluation is performed by means of (A) SSIM, (B) MAE, and (C) MSE. 365 366 Generative adversarial networks for image-to-Image translation or worse results depending on the number of training samples. In general, it is clear that, up to 30 image pairs, the paired SSIM experiments a positive evolution with the addition of more training data. Then, between 30 and 50 image pairs, the evolution stagnates and there is no improvement with the addition of more images. In the case of MAE and MSE, the final results to which the paired SSIM converges are approximately the same as those obtained by the cyclical GAN. This may indicate an existent upper bound in the performance of the multimodal reconstruction with this experimental setting. Regarding the comparison by means of SSIM, there is an important difference between both methodologies independently of the number of training images for paired SSIM. On the one hand, this may be explained by the fact that the generator of the paired SSIM has been explicitly trained to maximize SSIM. Thus, this network excels when it is evaluated by means of this metric. On the other hand, however, it must be considered that SSIM is a more complex metric in comparison to MAE or MSE. In particular, SSIM does not directly measure the difference between pixels but, instead, it measures local similarities that include high-level information such as structural coherence. Thus, it could be possible that subtle structural errors, which are not evidenced by MAE or MSE, contribute to the worse performance of cyclical GAN considering SSIM. 15.4.4 Ablation analysis of the generated images In order to better understand the obtained results, we present a more detailed quantitative analysis in this section. In particular, the presented analysis considers the possible differences in error distribution among different retinal regions. As it was shown in Section 15.4.2, both methodologies seem to provide a similar enhancement of the retinal vasculature. However, there are important differences in the reconstructed retinal background and certain pathological structures. Therefore, it is interesting to study how the reconstruction error is distributed between the vasculature and the background, and whether this distribution is different between both methodologies. To that end, the reconstruction errors are recalculated using a binary vascular mask to separate between vasculature and background regions. Given that only a broad approximation of the vasculature is necessary, the vascular mask is computed by applying some common image processing techniques. First, the multiscale Laplacian operator proposed in Ref. [31] is applied to the original angiography. This operation further enhances the retinal vasculature, resulting in an image with much greater contrast between vasculature and background [36]. Then, the vascular region is dilated to ensure that the resulting mask not only includes the vessels but also their surrounding pixels. This way, the reconstruction error in the vasculature will also include the error due to inappropriate vessel edges. Finally, the vascular mask is binarized by applying Otsu’s thresholding method [37]. An example of the produced binary vascular mask together with the original angiography is depicted in Fig. 15.10. Multimodal reconstruction of retinal images Fig. 15.10 Example of vascular mask used for evaluation: (A) angiography and (B) resulting vessel mask for (A). The results of the quantitative evaluation using the computed vascular masks are depicted in Fig. 15.11. Firstly, it is observed that, in all the cases, the reconstruction error is greater in the vessels than in the background. This may indicate that the reconstruction of the retinal background is an easier task in comparison to the retinal vasculature. In this regard, it must be noticed that the retinal vasculature is an intricate network with numerous intersection and bifurcations, which increases the difficulty of the reconstruction. The background also includes some pathological structures, which can be a source of errors as seen in Section 15.4.2. However, these pathological structures neither are present in all the images nor occupy a significantly large area of the background. Moreover, the bright lesions in the angiography, i.e., the microaneurysms, are included within the vascular mask, as can be seen in Fig. 15.10. This balances the contribution of the pathological structures between both regions. Regarding the comparison between cyclical GAN and paired SSIM, the analysis is the same as in the previous evaluation. This happens independently of the retinal region that is analyzed, vasculature or background. In particular, the performance of paired SSIM experiments the same evolution with the increase in the number of training images. Considering MAE and MSE, paired SSIM converges again to the same results that are achieved by cyclical GAN, resulting in a similar performance. In contrast, there is still an important difference between the methodologies when considering SSIM. Finally, it is interesting to observe that the error distribution between regions is the same for paired SSIM and cyclical GAN, even when there is a clear visual difference in the reconstructed background between both methodologies (see Fig. 15.7). This shows that the more realistic look provided by the textured background does not necessarily lead to a better reconstruction in terms of full-reference pixel-wise metrics. In particular, the same reconstruction error can be achieved by producing a homogeneous background with an adequate tone, as paired SSIM does. This explains why the use of these metrics as a loss function does not encourage the generator to produce a textured background. Moreover, in the case of SSIM, which is the metric used by paired SSIM during training, the reconstruction error for the textured background is even greater than that of the homogeneous version. 367 368 Generative adversarial networks for image-to-Image translation Fig. 15.11 Comparison of cyclical GAN and paired SSIM with a varying number of training samples for paired SSIM. The evaluation is conducted independently for vessels and the background of the images. The evaluation is performed by means of (A) SSIM, (B) MAE, and (C) MSE. Multimodal reconstruction of retinal images 15.4.5 Structural coherence of the generated images An observation that remains to be explained after the previous analyses is the different results obtained whether the evaluation is performed by means of SSIM or MAE/ MSE. In particular, both methodologies achieve similar results in MAE and MSE, although paired SSIM always performs better in terms of SSIM. Given that SSIM is characterized by including higher level information such as the structural coherence between images, the generated images are visually inspected to find possible structural differences. Fig. 15.12 depicts some composite images using a checkerboard pattern that is used to perform the visual analysis. In particular, the depicted images show the generated angiography together with the original retinography (Fig. 15.12A and C) as well as the generated angiography together with the ground truth angiography (Fig. 15.12B and D). At a glance, it seems that both angiographies, from paired SSIM and cyclical GAN, are perfectly reconstructed. However, on closer examination, it is observed that in the angiographies generated by cyclical GAN there are small displacements with respect to the originals. Examples of these displacements are shown in detail in Fig. 15.12. As it is observed, the displacement occurs, at least, in the retinal vasculature. Moreover, it can be observed that the displacement is consistent among the zoomed patches even when they are distant in the images. This indicates that the observed displacement could be the result of an affine transformation. With regard to the cause of the displacement, an initial hypothesis is based on the fact that cyclical GAN does not put any hard constraint on the structure of the generated angiography. The only requirements are that the image must look like a real angiography and that it must be possible to reconstruct the original retinography from it. Thus, although the more straightforward way to reconstruct the original retinography seems to be to keep the original structure as it is, nothing enforces the networks to do so. Nevertheless, it must be considered that if GR2A applies any spatial transformation to the generated angiographies, and then GA2R must learn to apply the inverse transformation when reconstructing the original retinography. This synergy between the networks is necessary to still minimize the cycle-consistency loss in the cyclical GAN methodology. Although not straightforward, this situation seems plausible given that the observed displacement is very subtle. The presented situation may initiate if the first network, GR2A, starts to reconstruct the vessels of the angiography over the vessel edges of the input retinography. This is likely to happen given the facility of a neural network to detect edges in an image. Moreover, the vessel edges are easier to detect than the vessel centerlines. To verify this hypothesis, the angiographies generated during the first stages of the training have been revised. A representative example of these images is depicted in Fig. 15.13. As it can be observed, there are some bright lines that seems to be drawn over the edges of the subtle dark vessels. This evidences the origin of the issue, although the ultimate cause is the underconstrained training setting of cyclical GAN. 369 370 Generative adversarial networks for image-to-Image translation Fig. 15.12 Comparison of generated angiographies against (A, C) the corresponding original retinographies and (B, D) the corresponding ground truth angiographies. (A, B) Angiography generated using paired SSIM. (C, D) Angiography generated using cyclical GAN. Additionally, cropped regions are depicted in detail for each case. Fig. 15.13 Representative example of generated angiography on the first stages of training for cyclical GAN: (A) original retinography and (B) generated angiography. Multimodal reconstruction of retinal images 15.5 Discussion and conclusions In this chapter, we have presented a cyclical GAN methodology for the multimodal reconstruction of retinal images [11]. This multimodal reconstruction is a novel task that consists of the translation of medical images between complementary modalities [7]. This allows the estimation of either more invasive or less affordable imaging modalities from a readily available alternative. For instance, this chapter addresses the estimation of fluorescein angiography from retinography, where the former requires the injection of a contrast dye to the patients. Despite the recent technical advances in the field, the direct use of generated images in clinical practice is still only a future potential application. However, there are several other possible applications where this multimodal reconstruction can be taken advantage of. For instance, multimodal reconstruction has already demonstrated to be a successful pretraining task for transfer learning in medical image analysis [13, 14]. This is an important application that reduces the necessity of large collections of expert-annotated data in medical imaging [10]. In order to provide a comprehensive analysis of the cyclical GAN methodology, we have also presented an exhaustive comparison against a state-of-the-art approach where no GANs were used [15]. This way, it is possible to study the particular advantages and disadvantages of using GANs for multimodal reconstruction. The provided comparison is performed under the fairest conditions, by using the same dataset, network architectures, and training strategies. In this regard, the only differences are those intrinsically due to the methodologies themselves. Regarding the presented results, it is seen that both approaches are able to produce an adequate estimation of the angiography from retinography. However, there are important differences in several aspects of the generated angiographies. Moreover, the requirements for training each one of both approaches must also be considered in the comparison. Regarding the requirements for the training of both approaches, the main difference is the use of unpaired data in cyclical GAN and paired data in paired SSIM. In broad domain applications, i.e., performed in natural images, this would represent an insurmountable obstacle for the paired SSIM methodology. However, in medical imaging, the paired data can be relatively easy to obtain due to the common use of complementary imaging modalities in clinical practice. In this case, however, the disadvantage of paired SSIM is the necessity of registered image pairs where the different anatomical and pathological structures must be aligned. The multimodal registration method that is applied in paired SSIM has demonstrated to be reliable for the alignment of retinographyangiography pairs [31]. Moreover, it has been successfully applied for the registration of the multimodal dataset that is used in the experiments herein described. However, the results presented in Ref. [31] also show that, quantitatively, the registration performance is lower for the most complex cases, which can be due to, e.g., low-quality images or severe pathologies. This could potentially limit the variety of images in an extended 371 372 Generative adversarial networks for image-to-Image translation version of the dataset including more challenging scenarios. Additionally, the registration method in paired SSIM is domain-specific and, therefore, cannot be directly applied to other types of multimodal image pairs. This means that the use of paired SSIM in other medical specialties would require the availability of adequate registration methods. Although image registration is a common task in medical imaging, the availability of such multimodal registration algorithms cannot be taken for granted. In contrast, cyclical GAN can be directly applied to any kind of multimodal setting without the need for registered or paired data. Another important difference between the presented approaches is the complexity of the training procedure. In this sense, cyclical GAN represents a more complex approach including four different neural networks and two training cycles, as described in Section 15.3.1. In comparison, once the multimodal image registration is performed, paired SSIM only requires the training of a single neural network. The use of four different networks in cyclical GAN means that, computationally, more memory is required for training. In a situation of limited resources, which is the common practical scenario, this will negatively affect the size and number of images that is possible for each batch during the training. Moreover, in practice, cyclical GAN also requires longer training times than paired SSIM, which further increases the computational costs. This is in part due to the use of a single network in paired SSIM but also to the use of a full-reference pixel-wise metric for the loss functions. The feedback provided by this more classical alternative results in a faster convergence in comparison to the adversarial training. Regarding the performance of the multimodal reconstruction, the examples depicted in Figs. 15.7 and 15.8 show that both methodologies are able to successfully recognize the main anatomical structures in the retina. In that sense, despite the evident aesthetic differences, the transformations applied to the anatomical structures are adequate in both cases. Thus, both approaches show a similar potential for transfer learning regarding the analysis of the retinal anatomy. However, when considering the pathological structures, there are important differences between both methodologies. In this case, none of the methodologies perfectly reconstruct all the lesions. In particular, the examples depicted in Fig. 15.7 indicate that each methodology gives preference to different types of lesions in the generated images. Thus, it is not clear which alternative would be a better option toward the pathological analysis of the retinal images. In this regard, given the mixed results that are obtained, future works could explore the development of hybrid methods for the multimodal reconstruction of retinal images. The objective, in this case, would be to combine the good properties of cyclical GAN and paired SSIM. One of the main differences between cyclical GAN and paired SSIM is the appearance of the generated angiographies. Due to the use of a GAN framework in cyclical GAN, the generated angiographies look realistic and aesthetically pleasing. In contrast, the angiographies generated by paired SSIM present a more synthetic appearance. The importance Multimodal reconstruction of retinal images of this difference in the appearance of the generated angiographies depends on the specific application. On the one hand, for representation learning purposes, the priority is the proper recognition of the different retinal structures. Additionally, even for the potential clinical interpretation of the images, realism is not as important as the accurate reconstruction of the different structures. On the other hand, there exist potential applications such as data augmentation or clinical simulations where the realism of the images is of great importance. Finally, a relevant observation presented in this chapter is the fact that cyclical GAN does not necessarily keep the exact same structure of the input image. This is a known possible issue, given the underconstrained training setting in cyclical GANs. Nevertheless, in this chapter, we have presented empirical evidence of this issue in the form of small displacements for the reconstructed blood vessels. According to the evidence presented in Section 15.4.5, it is not possible to predict whether these displacements will happen or how they will exactly be. In this sense, the particular structural displacements produced by the networks is affected by the stochasticity of the training procedure. Moreover, although we have only noticed these structural incoherence in the blood vessels, it would be possible to note the existence of similar subtle structural transformations for other elements in the images. In line with prior observations in the presented comparison, the importance of these structural errors depends on the specific application for which the multimodal reconstruction is applied. For instance, this kind of small structural variations should not significantly affect the quality of the internal representations learned by the network. However, they would impede the use of cyclical GAN as a tool for accurate multimodal image registration. The development of hybrid methodologies, as previously discussed, could also be a solution to this structural issue while keeping the good properties of GANs. For instance, according to the results presented in Section 15.4.3, the addition of a small number of paired training samples could be sufficient for improving the structural coherence of the cyclical GAN approach. Additionally, a hybrid approach of this kind could still incorporate those more challenging paired images that may not be successfully registered. To conclude, the presented cyclical GAN approach has been demonstrated to be a valid alternative for the multimodal reconstruction of retinal images. In particular, the provided comparison shows that cyclical GAN has both advantages and disadvantages with respect to the state-of-the-art approach paired SSIM. In this regard, these two approaches are complementary to each other when considering their strengths and weaknesses. This motivates the future development of hybrid methods aiming at taking advantage of the strengths of both alternatives. Acknowledgments This work was supported by Instituto de Salud Carlos III, Government of Spain, and the European Regional Development Fund (ERDF) of the European Union (EU) through the DTS18/00136 research project, and 373 374 Generative adversarial networks for image-to-Image translation by Ministerio de Ciencia, Innovación y Universidades, Government of Spain, through the RTI2018095894-B-I00 research project. The authors of this work also receive financial support from the ERDF and European Social Fund (ESF) of the EU and Xunta de Galicia through Centro de Investigación de Galicia, ref. ED431G 2019/01, and the predoctoral grant contract ref. ED481A-2017/328. Conflict of interest The authors declare no conflicts of interest. References [1] G. Litjens, T. Kooi, B.E. Bejnordi, A.A.A. Setio, F. Ciompi, M. Ghafoorian, J.A. van der Laak, B. van Ginneken, C.I. Sánchez, A survey on deep learning in medical image analysis, Med. Image Anal. 42 (2017) 60–88, https://doi.org/10.1016/j.media.2017.07.005. [2] E.D. Cole, E.A. Novais, R.N. Louzada, N.K. Waheed, Contemporary retinal imaging techniques in diabetic retinopathy: a review, Clin. Exp. Ophthalmol. 44 (4) (2016) 289–299, https://doi.org/ 10.1111/ceo.12711. [3] T. Farncombe, K. Iniewski, Medical Imaging: Technology and Applications, CRC Press, 2017. [4] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Deep learning for visual understanding: a review, Neurocomputing 187 (2016) 27–48, https://doi.org/10.1016/j.neucom.2015.09.116. [5] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, P. Martinez-Gonzalez, J. GarciaRodriguez, A survey on deep learning techniques for image and video semantic segmentation, Appl. Soft Comput. 70 (2018) 41–65. ISSN 15684946 https://doi.org/10.1016/j.asoc.2018.05.018. [6] W. Rawat, Z. Wang, Deep convolutional neural networks for image classification: a comprehensive review, Neural Comput. 29 (9) (2017) 2352–2449, https://doi.org/10.1162/neco_a_00990. [7] A.S. Hervella, J. Rouco, J. Novo, M. Ortega, Retinal image understanding emerges from selfsupervised multimodal reconstruction, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2018, https://doi.org/10.1007/978-3-030-00928-1_37. [8] S. Engelhardt, L. Sharan, M. Karck, R.D. Simone, I. Wolf, Cross-domain conditional generative adversarial networks for stereoscopic hyperrealism in surgical training, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2019, https://doi.org/10.1007/978-3-030-322540_18. [9] X. Yi, E. Walia, P. Babyn, Generative adversarial network in medical imaging: a review, Med. Image Anal. 58 (2019) 101552. ISSN 1361-8415 https://doi.org/10.1016/j.media.2019.101552. [10] Á.S. Hervella, J. Rouco, J. Novo, M. Ortega, Learning the retinal anatomy from scarce annotated data using self-supervised multimodal reconstruction, Appl. Soft Comput. 91 (2020) 106210, https://doi. org/10.1016/j.asoc.2020.106210. [11] A.S. Hervella, J. Rouco, J. Novo, M. Ortega, Deep multimodal reconstruction of retinal images using paired or unpaired data, in: International Joint Conference on Neural Networks (IJCNN), 2019, https://doi.org/10.1109/IJCNN.2019.8852082. [12] L. Wang, W. Chen, W. Yang, F. Bi, F.R. Yu, A state-of-the-art review on image synthesis with generative adversarial networks, IEEE Access 8 (2020) 63514–63537, https://doi.org/10.1109/ ACCESS.2020.2982224. [13] A.S. Hervella, L. Ramos, J. Rouco, J. Novo, M. Ortega, Multi-modal self-supervised pre-training for joint optic disc and cup segmentation in eye fundus images, in: 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, https://doi.org/10.1109/ ICASSP40776.2020.9053551. [14] J. Morano, A.S. Hervella, N. Barreira, J. Novo, J. Rouco, Multimodal transfer learning-based approaches for retinal vascular segmentation, in: 24th European Conference on Artificial Ingelligence (ECAI), 2020. Multimodal reconstruction of retinal images [15] Á.S. Hervella, J. Rouco, J. Novo, M. Ortega, Self-supervised multimodal reconstruction of retinal images over paired datasets, Expert Syst. Appl. (2020) 113674, https://doi.org/10.1016/j. eswa.2020.113674. [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems (NIPS), 27, 2014, pp. 2672–2680. [17] Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, Y. Zheng, Recent progress on generative adversarial networks (GANs): a survey, IEEE Access 7 (2019) 36322–36333, https://doi.org/10.1109/ ACCESS.2019.2905015. [18] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, https://doi.org/10.1109/CVPR.2017.632. [19] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, https://doi.org/10.1109/ICCV.2017.244. [20] Z. Yi, H. Zhang, P. Tan, M. Gong, DualGAN: unsupervised dual learning for image-to-image translation, in: The IEEE International Conference on Computer Vision (ICCV), 2017, https://doi.org/ 10.1109/ICCV.2017.310. [21] T. Kim, M. Cha, H. Kim, J.K. Lee, J. Kim, Learning to discover cross-domain relations with generative adversarial networks, in: Proceedings of the 34th International Conference on Machine Learning, vol. 70, 2017, pp. 1857–1865. [22] J.M. Wolterink, T. Leiner, M.A. Viergever, I. Išgum, Generative adversarial networks for noise reduction in low-dose CT, IEEE Trans. Med. Imaging 36 (12) (2017) 2536–2545, https://doi.org/10.1109/ TMI.2017.2708987. [23] Y. Xue, T. Xu, H. Zhang, L. Long, X. Huang, SegAN: adversarial network with multi-scale L1 loss for medical image segmentation, Neuroinformatics 16 (3–4) (2018) 383–392. ISSN 1539-2791 https:// doi.org/10.1007/s12021-018-9377-x. [24] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification, Neurocomputing 321 (2018) 321–331. ISSN 0925-2312 https://doi.org/10.1016/j.neucom.2018.09.013. [25] T. Schlegl, P. Seeb€ ock, S.M. Waldstein, G. Langs, U. Schmidt-Erfurth, F-AnoGAN: fast unsupervised anomaly detection with generative adversarial networks, Med. Image Anal. 54 (2019) 30–44. ISSN 1361-8415 https://doi.org/10.1016/j.media.2019.01.010. [26] J. Cohen, M. Luck, S. Honari, Distribution matching losses can hallucinate features in medical image translation, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2018, https://doi.org/10.1007/978-3-030-00928-1_60. [27] X. Mao, Q. Li, H. Xie, R.Y. Lau, Z. Wang, S. Paul Smolley, Least squares generative adversarial networks, in: The IEEE International Conference on Computer Vision (ICCV), 2017, https://doi.org/ 10.1109/ICCV.2017.304. [28] D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, in: International Conference on Learning Representations (ICLR), 2015. [29] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in: International Conference on Learning Representations (ICLR), 2016. [30] Á.S. Hervella, J. Rouco, J. Novo, M.G. Penedo, M. Ortega, Deep multi-instance heatmap regression for the detection of retinal vessel crossings and bifurcations in eye fundus images, Comput. Methods Prog. Biomed. 186 (2020) 105201. ISSN 0169-2607 https://doi.org/10.1016/j.cmpb. 2019.105201. [31] A.S. Hervella, J. Rouco, J. Novo, M. Ortega, Multimodal registration of retinal images using domainspecific landmarks and vessel enhancement, in: International Conference on Knowledge-Based and Intelligent Information and Engineering Systems (KES), 2018, https://doi.org/10.1016/j. procs.2018.07.213. [32] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612, https://doi.org/10.1109/ TIP.2003.819861. 375 376 Generative adversarial networks for image-to-Image translation [33] D. Ulyanov, A. Vedaldi, V.S. Lempitsky, Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2017, pp. 4105–4113, https://doi.org/10.1109/CVPR.2017.437. [34] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, https:// doi.org/10.1007/978-3-319-24574-4_28. [35] S.H.M. Alipour, H. Rabbani, M.R. Akhlaghi, Diabetic Retinopathy Grading by Digital Curvelet Transform, Computational and Mathematical Methods in Medicine, 2012, https://doi.org/ 10.1155/2012/761901. [36] A.S. Hervella, J. Rouco, J. Novo, M. Ortega, Self-supervised deep learning for retinal vessel segmentation using automatically generated labels from multimodal data, in: International Joint Conference on Neural Networks (IJCNN), 2019, https://doi.org/10.1109/IJCNN.2019.8851844. [37] N. Otsu, A threshold selection method from gray-level histograms, IEEE Trans. Syst. Man Cybern. 9 (1) (1979) 62–66, https://doi.org/10.1109/TSMC.1979.4310076. CHAPTER 16 Generative adversarial network for video anomaly detection Thittaporn Ganokratanaaa and Supavadee Aramvithb a Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand Multimedia Data Analytics and Processing Research Unit, Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand b 16.1 Introduction Video anomaly detection (VAD) has gained increasing recognition in a surveillance system for ensuring security. VAD is a challenging task due to the high appearance structure of the images with motion between frames. This anomaly research has drawn interests from researchers in computer vision areas. The traditional approaches including the social force model (SF) [1], mixture of probabilistic principal component analyzers (MPPCA) [2], Gaussian mixture of dynamic texture (MDT) and the combination of SF + MPPCA [3, 4], sparse reconstruction [5–7], one class learning machine [8], K-nearest neighbor [9], and tracklet analysis [10, 11] are proposed to challenge the anomaly detection problem due to their performance in detecting multiple objects. However, the traditional approaches do not perform well with anomaly detection problem since this problem is a complex task that mostly occurs in the crowded scenes, making it more problematic for the traditional approaches to generalize. Thus, deep learning approaches are employed to achieve a higher anomaly detection rate, such as deep Gaussian mixture model (GMM) [12, 13], autoencoder [14, 15], and deep pretrain convolution neural network (CNN) [16–19]. Even using these deep learning approaches, the problem is still open when dealing with all issues of anomaly detection. Specifically, the major challenging issues in the anomaly detection task are categorized into three types: complex scene, small anomaly samples, and object localization in pixel level. With the complex scene, one may consist of multiple moving objects with clutter and occlusions that cause the difficulty in detecting and localizing objects. This issue also refers to the crowded scene which is more challenging than the uncrowded scene. The second challenge is the small samples from available anomaly datasets with abnormal ground truth, leading to the struggles of model training in a data-hungry deep learning approach. In practice, it is impossible to train all anomalous events as they occur randomly. Therefore, the anomaly detection task is categorized as an unsupervised learning manner since there are no requirements of data labeling on the positive rare class. Another important issue is about Generative Adversarial Networks for Image-to-Image Translation https://doi.org/10.1016/B978-0-12-823519-5.00011-7 Copyright © 2021 Elsevier Inc. All rights reserved. 377 378 Generative adversarial networks for image-to-Image translation pixel localization of the objects in the scene. Previous works [1, 6, 15, 20] struggled with this challenging task in which they achieved high accuracy only at frame-level anomaly detection. On the other hand, the accuracy of a pixel-level anomaly localization is significantly poorer. In recent works [21–23] researchers try to improve the performance to cover all evaluation criteria but can achieve good performance either at the frame or pixel level in some complex scenes. This happens as a consequence of insufficient input features for training the model such as appearance and motion patterns of the objects. The features of foreground objects should be extracted sufficiently and efficiently during training to make the model understand all characteristics. To deal with these challenges, the unsupervised deep learning-based approach is the most suitable technique for the anomaly detection problem since it does not require any labeled data on abnormalities. Unsupervised learning is a key domain of deep generative models, such as adversarially trained autoencoders (AAE) [24], variational autoencoder (VAE) [25], and generative adversarial networks (GAN) [26, 27]. Generative models for anomaly detection aim to model only normal events in training as it is the majority of the patterns. The abnormal events can be distinguished by evaluating the distance from the learned normal events. Early generative works are mostly based on handcrafted features [1, 3, 5, 11, 28] or CNN [17, 18] to extract and learn the important features. However, the performances of anomaly detection and localization are still needed to be improved due to the difficulties in approximating many probabilistic computations and utilizing the piecewise linear units as in the generative models [29–31]. Hence, recent trends for video anomaly detection focus more on GAN [20–22] which is an effective approach that achieves high performance in image generation and synthesis, affords data augmentation, and overcomes classification problems in the complex scenarios. 16.1.1 Anomaly detection for surveillance videos Video surveillance has gained increasing popularity since it has been widely used to ensure security. Closed-circuit television (CCTV) cameras are used to monitor the scene, record certain situations, and provide evidence. They generally perform as the postvideo forensic process that allows the investigation for abnormalities of previous events in manual control by human operators [32]. This manual causes difficulty for the operators since abnormalities can occur in any situation, such as crowded or uncrowded in indoor and outdoor scenes. Additionally, it may cause serious problems, including a terrorist attack, a robbery, and an area invasion, leading to personal injury or death and property damage [33]. Thus, to enhance the performance of video surveillance, it is crucial to building an intelligent system for anomaly detection and localization. The anomaly is defined as “a person or thing that is different from what is usual, or not in agreement with something else and therefore not satisfactory” [34]. Multiple terms Generative adversarial network for video anomaly detection Fig. 16.1 Examples of abnormal events in crowds from UCSD pedestrian [4], UMN [1], and CHUK Avenue datasets [6]. stand for the anomaly, including anomalous events, abnormal events, unusual events, abnormality, irregularity, and suspicious activity. In VAD, the abnormal event can be seen as the distinctive patterns or motions that are different from neighboring areas or the majority of the activities in the scene. Specifically, the normal events are the frequently occurring objects and the common moving patterns representing the majority of the patterns, while the abnormal events are varied and rarely occurred describing as an infrequent event that may include an unseen object and have a significantly lower probability than the probability of the normal event. Examples of different abnormal events are shown in Fig. 16.1. The anomaly detection for surveillance videos is challenging because of complex patterns of the real scene (e.g., moving foreground objects with large amounts of occlusion and clutter in crowds) captured by the static CCTV cameras. The VAD relies on fixed CCTV cameras, which take only moving foreground objects into account while disregarding the static background. The goal of VAD is to accurately identify all possible anomalous events from the regular normal patterns in crowded and complex scenes from the video sequences. To design the effective anomaly detection for surveillance videos, it is considered to learn all information of the objects from both of their appearance (spatial) and motion (temporal) features under the unsupervised learning or semisupervised learning manner. In the model training, with the unsupervised learning task, only the frames of normal events are trained, meaning that there is no data labeling on abnormalities. This benefits the use of VAD in real-world environments where any type of abnormal events can unpredictably occur. Then, all videos are fed into the model during testing. Any pattern deviated from the trained normal samples is identified as abnormal events that can be detected by evaluating the anomaly score known as the error of the predictive model in a vector space or the posterior probability of the test samples. 16.1.2 A broader view of generative adversarial network for anomaly detection in videos A GAN has been studied for years. The success of GAN comes from the effectiveness of its structure in improving image generation and classification tasks with a pair of networks. GAN presents an end-to-end deep learning framework in modeling the 379 380 Generative adversarial networks for image-to-Image translation likelihood of normal events in videos and provides flexibility in model training since it does not require annotated abnormal samples. Its learning is achieved through deriving backpropagation to compute the error of each parameter in both generator and discriminator networks. The goals of GAN are to produce the synthetic output that is not able to be identified as different from the real data and to automatically learn a loss function to achieve the indistinguishable output goal. The loss of GAN attempts to classify whether the synthetic output is fake or real, while it is trained to be minimized in the generative model at the same time. This loss makes it more beneficial to apply GAN in various applications since it can adapt to the data without requiring different loss functions, unlike the loss functions of the traditional CNNs approach. Specifically, in the video anomaly detection, GAN performs as a two-player minimax game between a generator G and a discriminator D, providing high accuracy output. G attempts to fool D by generating synthetic images that are similar to the real data, whereas D efforts to discriminate whether the synthetic image belongs to the real or synthetic data. This mini-max game benefits data augmentation and implicit data management, thanks to D that assists G to reduce the distance between its samples and the training data distribution and to train on the small benchmark without the need to define an explicit parametric function or additional classifiers. Therefore, GAN is one of the most distinctive approaches to deal with complex anomaly detection tasks since it achieves good results in reconstructing, translating, and classifying images. Following the unsupervised GAN in an image-to-image translation [35], it can extract significant features of the objects of interest (e.g., moving foreground objects) and efficiently translate them from spatial to temporal representations without any prior knowledge of anomaly or direct information on anomaly types. In this way, GAN can provide comprehensive information concerning appearance and motion features. Hence, we focus on reviewing GAN for the anomaly detection task in videos and also introduce our proposed method named deep spatiotemporal translation network or DSTN, as a novel unsupervised GAN approach to detect and localize anomalies in crowded scenes [26]. This chapter contains six sections. In Section 16.1, anomaly detection for surveillance videos and a broader view of GAN are reviewed. Section 16.2 presents a literature review, including the basic structure of GAN along with the literature of anomaly detection in videos based on GAN. We elaborate on the GAN training in Section 16.3, which includes the image-to-image translation and our proposed DSTN. The performance of DSTN is discussed in Section 16.4 along with its related details, including the publicly available anomaly benchmarks, the evaluation criteria, the comparison of GAN with an autoencoder, and the advantages and limitations of GAN for anomaly detection in videos. Finally, Section 16.5 provides a conclusion for this chapter. Generative adversarial network for video anomaly detection 16.2 Literature review We introduce the basic structure of the GAN and review the related works on anomaly detection for surveillance videos based on GAN. The details of GAN architecture and its state-of-the-art methods in video anomaly detection are described as following. 16.2.1 The basic structure of generative adversarial network Since the concept of generative models has been studied in machine learning areas for many years, it has gained wide recognition from Goodfellow et al. [27] who introduced a novel adversarial process named GAN. The basic structure of GAN consists of two networks working simultaneously against each other, the generator G and the discriminator D as shown in Fig. 16.2. In general, G produces a synthetic image n from the input noise z, whereas D attempts to differentiate between n and a real image r. The goals of G are to generate synthesized examples of objects that look like the real ones and then attempt to fool D to make a wrong decision that the synthesized data generated by G are real. On the other hand, D has been learned on a dataset with the label of the images. D tries its best to discriminate whether its input data are fake or real by comparing them with the real training data. In other words, G is a counterfeiter producing fake checks, while D is an officer trying to catch G. Specifically, G is good at creating the synthesized images as it only updates the gradient through D to optimize its parameters, making D more challenging to differentiate its input data. The training of this mini-max game makes both networks better until at one point that the probability distributions of G and the real data are Fig. 16.2 Generative adversarial network architecture. 381 382 Generative adversarial networks for image-to-Image translation equivalent (when there is enough capacity and training time) so that G and D are not able to improve any longer. Thus, D is unable to differentiate between these two distributions. In the perspective of the generative adversarial network training, G takes the input noise z from a probability distributionpz(z) and then it generates the fake data and feeds into D as D(G(z)). D(x) denotes the probability that x is from the distribution of real datapdata instead of the distribution of generator pg. The discriminator D takes two inputs from G(z) and pdata. D is trained to enlarge the probability of defining the precise label to the real and the synthesized examples. Specifically, the goal of D is to accurately classify its input samples by giving a label of 1 to real samples and a label of 0 to synthetic ones. D tries to solve a binary classification problem based on neuron networks with a sigmoid function, giving output in the range [0,1]. Then G is simultaneously trained to reduce [log(1 D(G(z)))]. These two adversarial networks, G and D, are represented with value function V(G, D) as follows: min max V ðD, GÞ (16.1) V ðD, GÞ ¼ xpdata ðxÞ ½ logDðxÞ + zpz ðzÞ ½ log ð1 DðGðzÞÞÞ, (16.2) G D where [logD(x)] is the objective function of the discriminator, representing the entropy of the real data distribution pdata passing through D, which tries to maximize [logD(x)]. Note that the objective function of the discriminator will be maximized when both real and synthetic samples are accurately classified to 1 and 0, respectively. The objective function of the generator is [log(1 D(G(z)))], representing the entropy of the random noise samples z passing through G to generate the synthetic samples or fake data and then pass through D to minimize [log(1 D(G(z)))]. The goal of the generator’s objective function is to fool D to make a wrong classification as it inspires D to identify the synthetic samples as real ones or to label the synthetic samples as 1. In other words, it attempts to minimize the likelihood that D classifies these samples correctly as fake data. Thus, [log(1 D(G(z)))] is reduced when the synthesized samples are wrongly labeled as 1. However, in real practice, G is poor in generating synthetic samples in the early training stage, making D too easy to classify the synthesized samples from G and the real samples from the dataset due to their great difference. To solve this problem, the generator’s objective function can be changed from reducing [log(1 D(G(z)))] to increasing [logD(x)] to provide enough gradient for G. This alternative objective function of the generator provides a stronger gradient in the early stage of the generator’s training. As both objective functions are distinct, the two networks are trained together by alternating the gradient updates following the standard gradient rule with the momentum parameter. There are two main procedures for training G and D networks to update their gradients alternately. The process is first to freeze G and train only D. This alternative gradient update is triggered by the fact that the discriminator needs to learn the outputs Generative adversarial network for video anomaly detection of the generator to define the real data from the fake ones. Thus the generator is required to be frozen. The discriminator network can be updated as shown in the following equation: m h i X (16.3) rθd 1=m logD xðiÞ + log 1 D G zðiÞ i¼1 Specifically, the gradient updates are different for the learning of two networks: D uses stochastic gradient ascent, while G uses stochastic gradient descent. D uses a hyperparameter k to update steps for each step of G. To update D, the stochastic gradient ascent performs the updates for k times to increase the likelihood that D accurately labels both samples (fake and real data). These updates are achieved using backpropagation on an equal number of examples with batch size of 2. Let noise samples m consist of {z(1), z(2), …, z(m)} from the distribution of generator pg(z) and real examples m consist of {x(1), x(2), …, x(m)} from the distribution of real data pdata. Once D is updated, then only G is trained to update its gradient as shown in the following equation: m X (16.4) log 1 D G zðiÞ rθg 1=m i¼1 Concerning the update for the generator network, m noise samples are input into G only once to generate m synthesized examples. G uses the stochastic gradient descent to minimize the likelihood that D labels the synthesized samples correctly. The generator’s objective function aims to minimize [log(1 D(G(z)))] to boost the likelihood that synthetic examples are classified as real examples. This process computes the gradients during backpropagation for both networks. Still, it only updates the parameters of G. D is kept constant during the training of G to prevent the possibility that G might never converge. 16.2.2 The literature of video anomaly detection based on generative adversarial network Here we review recent literature works that use GANs for anomaly detection in crowds. There are three outstanding video anomaly detection works ranked by publication year as follows. 16.2.2.1 Cross-channel generative adversarial networks Starting with the work proposed by Ravanbakhsh et al. [20], this work applies conditional GANs (cGANs) where the generator G and the discriminator D are both conditioned to the real data and also relies on the idea of the translating image to image [35]. Following the characteristics of cGANs, the input image x is fed to G to produce a generated image p that looks realistic. G attempts to deceive D that p is real, while D efforts to identify x from p. This paper states that U-Net structure [36] in the generative network 383 384 Generative adversarial networks for image-to-Image translation and a patch discriminator (Markovian discriminator) advantage the transformation of images in different representations (e.g., spatial to temporal representations). Thus, the authors adopt this concept to translate the appearance of a frame to the motion of optical flow and target to learn only the normal patterns. To detect an abnormality, they compare the generated image with the real image by using a simple pixel-by-pixel difference along with pretrain on ImageNet [37]. The framework of the anomaly detection in videos using cGANs during testing is shown in Fig. 16.3. F!O More specifically, the authors train two networks N that uses frames F to genO!F erate optical flow O and N that uses optical flow O to generate frames F. Assume that Ft is the frame of the training video sequence with RGB channels at time t and Ot is the optical flow containing three channels (horizontal, vertical, and magnitude). Ot is obtained from two consecutive frames, Ft and Ft+1, following the computation in Ref. [38]. As both generative and discriminative models are the conditional networks, G generates output from its inputs consisting of an image x and noise vector z, providing a synthetic output image p ¼ G(x,z). In the case of N F!O , x is assigned as a current frame x ¼ Ft, hence the generation of its corresponding optical flow (or the synthetic output image p) is represented as y ¼ Ot. For D, it takes two inputs, whether it is (x,y) or (x, p) to yield a probability of classes belonging to its pair. The loss functions are defined, including a reconstruction loss LL1 and a conditional GAN loss LcGAN as shown in Eqs. (16.5) and (16.6), respectively. F!O For N , LL1 is determined with the training set X ¼ fðFt , Ot Þg as LL1 ðx, yÞ ¼ ky Gðx, yÞk1 (16.5) whereas LcGAN is assigned as LcGAN ðD, GÞ ¼ ðx, yÞX ½ log Dðx, yÞ + XfFt g, zZ ½ log ð1 Dðx, Gðx, zÞÞÞ (16.6) Fig. 16.3 A framework of video anomaly detection using conditional generative adversarial nets (cGANs) during testing in Ref. [20]. There are two generator networks: (i) producing a corresponding optical flow image from its input frames and (ii) reconstructing an appearance from a real optical flow image. Generative adversarial network for video anomaly detection O!F In contrast, the training set of N is X ¼ fðOt , Ft ÞgN t¼1 . Once the training is finished, the only model being used during testing is G, consisting of GF!O and GO!F networks. Both networks are not able to reconstruct the abnormality since they have been trained with only normal events. Then, the abnormality can be found by subtracting pixels to obtain the difference between O and po, ΔO ¼ O po,where po is the optical flow reconstruction when using F, GF!O(F). Another network is GO!F(O) that produces the appearance reconstruction pF. However, ΔO provides more information than the difference between F and pF. In this case, the authors added an additional network to find the difference in semantic perspective ΔS by using AlexNet [39] with its fifth convolutional layer h defined as ΔS ¼ h(F) h(pF). These two difference ΔO and ΔS are combined and normalized in between [0,1] for an abnormality map. Finally, the final score of abnormality heatmap is achieved by summing the normalized semantic difference map NS and the normalized optical flow difference map NO, A ¼ NS + λNO where λ ¼ 2. 16.2.2.2 Future frame prediction based on generative adversarial network Apart from the above work, there is an approach for a video future frame prediction of abnormalities based on GAN proposed by Liu et al. [21]. This work is inspired by the problems of anomaly detection that are mostly about minimizing the reconstruction errors from the training data. Instead, the authors proposed an unsupervised feature learning for video prediction and leveraged the distinction of their predicted frame with the real data for anomaly detection. The framework of video future prediction for detecting anomalies is shown in Fig. 16.4. In the training stage, only normal events are learned since they are considered as predictable patterns, using both appearance and motion constraints. Then, during testing, all frames are input and compared with the predicted frame. If the input frame corrects with the predicted frame, it is a normal event. If it is not, then it Fig. 16.4 Video future prediction framework for anomaly detection [21]. U-Net structure and pretrained Flownet are used to predict a target frame and to obtain optical flow, respectively. Adversarial training is used to disguise whether a predicted frame is real or fake. 385 386 Generative adversarial networks for image-to-Image translation becomes an anomalous event. Using a good predictor to train is the key in this work; thus U-Net network [36] is chosen due to its performance in translating images with the GAN model. In mathematical terms, given a video sequence containing t frames I1, I2, …, It. In this work, a future frame is defined as It + 1, while a predicted future frame as I^t + 1. The goal is to make I^t + 1 close to It + 1 to determine whether I^t + 1 is an abnormal or normal event by minimizing their distance in terms of intensity and gradient. In addition, optical flow is used to represent the temporal features between frames; It + 1 and It, and I^t + 1 and It.. We first take a look at the generator objective function LG, consisting of appearance (ingredient Lint and gradient Lgd), motion Lop, and adversarial training LG adv, in Eq. (16.7): G ^ It +1 LG ¼ λint Lint I^t + 1 , It + 1 + λgd Lgd I^t + 1 , It + 1 + λop Lop I^t + 1 , It + 1 , It + λadv Ladv (16.7) The discriminator objective function LD is defined in Eq. (16.8): D ^ I t + 1 , It + 1 : LD ¼ Ladv (16.8) The authors followed the work in Ref. [40] by using intensity and gradient difference. Specifically, the intensity and gradient penalties assure the similarity of all pixels and the sharpened generating images, respectively. Suppose I^t + 1 is I^ and It+1 is I. The distance of ‘2 between I^ and I is minimized in intensity to guarantee the similarity in the RGB space as shown in the following equation: 2 (16.9) Lint I^, I ¼ I^ I 2 : Then the gradient loss is defined as follows [40] in Eq. (16.10): X I^i, j I^i1, j Ii, j Ii1, j + I^i, j I^i, j1 Ii, j Ii, j1 Lgd I^, I ¼ 1 1 i, j (16.10) where i and j are the frame index. Then the optical flow estimation is applied by using a pretrained network, Flownet [41], denoted as f. The temporal loss is defined in the following equation: Lop I^t + 1 , It + 1 , It ¼ f I^t + 1 , I1 f ðIt + 1 , I1 Þ1 : (16.11) In the manner of the adversarial network, the training is an alternative update. The U-Net is used as the generator, while a patch discriminator is used for the discriminator following Ref. [35]. To train the discriminator D, they assign a label of 0 to a fake image and a label of 1 to a real image. The goal of D is to categorize the real future frame It+1 into class 1 and the predicted future frame I^t + 1 into class 0. During training the discriminator Generative adversarial network for video anomaly detection D, the weight of G is fixed by using the loss function of mean square error (MSE), denoted as LMSE. Hence, the LMSE of D can be defined in the following equation: X X D ^ Ladv I, I ¼ LMSE D Ii, j , 1 =2 + LMSE D I^i, j , 0 =2, (16.12) i, j i, j where i and j are the patch index. The MSE loss function LMSE is defined in the following equation: 2 (16.13) LMSE Y^ , Y ¼ Y^ Y , where the value of Y is in [0,1], while Y^ ½0, 1. In contrast, the objective of the generator G is to reconstruct images to fool D to label them as 1. The weight of D is fixed during training G. Thus, LMSE of G is defined as shown in the following equation: X G ^ Ladv I ¼ LMSE D I^i, j , 1 =2: (16.14) i, j To conclude, the appearance, motion, and adversarial training can assure that normal events are generated. The events with a great difference between the prediction and the real data are classified as abnormalities. 16.2.2.3 Cross-channel adversarial discriminators As in Ref. [20], the authors have been continuously proposed another GAN-based approach [22] for abnormality detection in crowd behavior. The training procedure is the same as Ref. [20] that only the frames of normal events are trained with the crosschannel networks based on conditional GANs by engaging G to translate the raw pixel image to the optical flow, inspired by Isola et al. [35]. This paper takes advantage of the U-Net framework [36] for translating from one to another image and representing a multichannel data, i.e., spatial and temporal representations, into an account, similarly to Refs. [20] and [21]. The novel part is in the testing where the authors proposed the end-to-end framework without additional classifiers by using the learned discriminator as the classifier of abnormalities. The framework of the cross-channel adversarial discriminators is shown in Fig. 16.5. For a brief explanation, G and D are simultaneously trained only on the frames of normalities. G generates the synthetic image from the learned normal events, while D learns how to differentiate whether its input is normal events or not due to the data distribution, defining the abnormal events as outliers in this sense. During testing, only D is being used directly to classify anomalies in the scene. In such a way, there is no need for reconstructing images at the testing time unlike the common GAN-based models [20, 21] that use G in testing. 387 388 Generative adversarial networks for image-to-Image translation Fig. 16.5 Cross-channel adversarial discriminators flow diagram with additional detail on parameters following [22]. Two generator networks are used during training: (i) generating a corresponding optical flow image and (ii) reconstructing an appearance. At testing time, only discriminative networks are used and represented as a learned decision boundary to detect anomalies. F!O O!F Specifically, there are two networks used for training: N and N . Suppose that Ft is a frame (at time t) of a video sequence and Ot is an optical flow acquired from two consecutive frames, Ft and Ft+1, following the computation of optical flow-based theory for warping [38]. In this work, G and D are conditioned to each other. G takes F!O an image x and noise vector z to output a synthetic optical flow r ¼ G(x,z). For N , let x be a current frame x ¼ Ft. Then, the generation of its corresponding optical flow r can be represented as y ¼ Ot. Conversely, D takes two inputs, whether it is (x,y) or (x,r) to obtain a probability of classes belonging to its pair. The reconstruction loss LL1 and a conditional GAN loss LcGAN can be obtained as follows. F!O In the case of N ,LL1 is determined with the training set X ¼ fðFt , Ot Þg as shown in the following equation: LL1 ðx, yÞ ¼ ky Gðx, yÞk1 (16.15) whereas LcGAN is represented in the following equation: LcGAN ðD, GÞ ¼ ðx, yÞX ½ log Dðx, yÞ + XfFt g, zZ ½ log ð1 Dðx, Gðx, zÞÞÞ: (16.16) O!F is X ¼ fðOt , Ft ÞgN In contrast, the training set of N t¼1 . Note that all training procedures are the same as Ref. [20]. G performs as implicit supervision for D. Both GF!O and GO!F networks lack the ability to reconstruct the abnormal events because they observe only the normal events during training, while DF!O and DO!F have learned the patterns to distinguish real data from artifacts. Generative adversarial network for video anomaly detection The discriminator is considered as the learned decision boundary that splits the densest area (i.e., the normal events x3) from the rest (i.e., abnormal events x1 and generated images x2). Since the goal is to detect abnormal events x1, the latter outside the decision boundaries are judged as outliers by D. During testing, the authors focus on only the discriminative networks for two-channel transformation tasks. The patch-based discrimina^ F!O and D ^ O!F are applied to the test frame F and its corresponding optical flow O tors D with the same 30 30 grid, resulting in two 30 30 score maps represented as SO for ^ F!O and SF for D ^ O!F . In detail, a patch pF on F and a patch pO on O are input to D F!O ^ D . Any abnormal events occur in these patches (pF and/or pO) are considered outliers ^ F!O , resulting in a low probability score of according to the distribution learned by D F!O ^ D ðpF, pOÞ. To finalize the anomaly maps, the normalized channel score maps with equal weights are fused S ¼ SO + SF in the range [0,1] and then applied with a range of thresholds to compute the ROC curves. We notice that there are some interesting points in this work: (i) the authors state that DF!Oprovides higher performance than DO!F since the input of DF!O is the real frame which contains more information than the optical flow frame, (ii) their proposed end-toend framework is simpler and also faster in testing than Ref. [20] since they do not require the generative models during testing and any additional classifiers to add on top of the model such as a pretrained AlexNet [39]. The observations from the early works inspire us to propose our method [26], which we discuss in Section 16.3.2. 16.3 Training a generative adversarial network 16.3.1 Using generative adversarial network based on the image-to-image translation The anomalous object observation using an unsupervised learning approach is considered as the structural problem of the reconstruction model known as the per-pixel classification or regression problem. The common framework used to solve this type of problem and explored by the state-of-the-art works in anomaly detection task [20–22] is the generative image-to-image translation network constructed by Isola et al. [35]. In general, they use this network to learn an optimal mapping from input to output image based on the objective function of GAN. Their experimental results prove that the network is good at generating synthesized images such as color, object reconstruction from edges, and label maps. From an overall perspective, based on the original GANs [27], the input of G consists of the image x and noise vector z in which the mapping of the original GANs is represented as G: z ! y (learning from z to output image y). Differently, conditional GANs learn from two inputs, x and z, to y, represented as G: {x,z} ! y. However, z is not necessary for the network as G still can learn the mapping without z, especially in the early training where G is learned to ignore z. Thus, the authors decided to use z in 389 390 Generative adversarial networks for image-to-Image translation the form of dropout in both the training and testing process. Considering the objective function, since the architecture of image-to-image translation network is based on the conditional GANs, its objective functions are indicated the same as we explained in Refs. [20] (see Eqs. 16.5 and 16.6) and [22] (see Eqs. 16.15 and 16.16) where two inputs are required for the discriminator D(x,y) and L1 loss is used to help output sharper image. On the other hand, the future frame prediction [21] applied unconditional GANs that uses only one input for the discriminator D(y) (see Eq. 16.12). It relied on the traditional L2 regression (MSE loss function) to condition the output and the input. This forcing condition results in lower performance (i.e., blurry images) on the frame-level anomaly detection compared to Refs. [20] and [22]. 16.3.2 Unsupervised learning of generative adversarial network for video anomaly detection In this section, we introduce our proposed method, named DSTN [26]. We take advantage of image-to-image transformation architecture with the U-Net network [36] to translate a spatial domain to a temporal domain. In this way, we can obtain comprehensive information on the objects from both appearance and motion information (optical flow). The proposed DSTN differs from the previous works [20–22] since we focus on only one deep spatiotemporal translation network to enhance the anomaly detection performance at a frame level and the challenging anomaly localization at a pixel level with regard to accuracy and computational time. Specifically, we include preprocessing and postprocessing stages to assist the learning of GAN without using any pretrained network to help in the classification, making the DSTN faster and more flexible. Besides, we differ from Ref. [35] as our target output is the motion information of the object corresponding to its appearance, not the realistic images. There are two main procedures for each training and testing time. For training, a feature collection and a spatiotemporal translation play essential roles to sufficiently collect information and effectively learn to model, respectively. Then a differentiation and an edge wrapping are utilized at the testing time. We shall explain the main components of our proposed method in detail including the system overview for both training and testing as follows. 16.3.2.1 System overview We first start with the system overview of ourDSTN. The DSTN is based on GAN and plugged with preprocessing and postprocessing procedures to improve the performance in learning normalities and localizing anomalies. Overall, the main components of the DSTN are fourfold: a feature collection, a spatiotemporal translation, a differentiation, and an edge wrapping. The feature collection is a key initial process for extracting the appearances of objects. These features are fed into the model to learn the normal patterns. In our case, the generator G is used in both training and testing time, while the discriminator D only at the training time. During training, G learns the normal patterns from the Generative adversarial network for video anomaly detection training videos. Hence it understands and has only the knowledge of what normal patterns look like. The reason why we feed only the frames of normal patterns is that we need the model to be flexible and able to handle all possible anomalous events in realworld environments without labels of anomalies. In testing, all videos, including normal and abnormal events, are input into the model where G tries to reconstruct the appearance and the motion representation from the learned normal events. Since G has not learned any abnormal samples, it is unable to reconstruct the abnormal area properly. We then take this inability to correctly reconstruct anomalous events to detect the anomalies in the scene. The anomalies can be exposed by subtracting pixels in the local area between the synthesized image and the real image and then applying the edge wrapping at the final stage to achieve precise edges of the abnormal objects. Specifically, during training, only normal events of original frames f are input with background removal frames fBR into the generative network G, which contains encoder En and decoder De, to generate dense optical flow (DIS) frames OFgen representing the motion of the normal objects. To attain good optical flow, the real DIS optical flow frame OFdis and fBR are fused to eliminate noise that frequently occurs in OFdis, giving Fused Dense Optical Flow frames OFfus. The patches of f and fBR are concatenated and fed into G to produce the patches of OFgen, while D has two alternate inputs, the patches of OFfus (real optical flow image) and OFgen (synthetic optical flow image), and tries to discriminate whether OFgen is fake or real. The training framework of DSTN is shown in Fig. 16.6. After training, the DSTN model understands a mapping of the appearance representation of normal events to its corresponding dense optical flow (motion representation). All parameters used in training are also used in testing. During testing, the unknown events from the testing videos will be reconstructed by G. However, the reconstruction of G provides results of unstructured blobs based on its knowledge of the learned normal patterns. Thus, these unstructured blobs are considered as anomalies. To capture the anomalies, the differentiation is computed by subtracting the patches of OFgen with the patches of OFfus. Note that not only the anomaly detection is significantly essential for real-world use, but also the anomaly localization. Therefore, edge wrapping (EW) is proposed to obtain the final output by retaining only the actual edges corresponding to the real abnormal objects and suppressing the rest. The DSTN framework at the testing time is shown in Fig. 16.7. 16.3.2.2 Feature collection We explain our proposed DSTN based on its training and testing time. During training, even GAN is good at data augmentation and image generation on small datasets, it still desires for sufficient features (e.g., appearance and motion features of objects) from data examples to feed its data-hungry characteristics of the deep learning-based model. The importance of feature extraction is recognized and represented as the preprocessing 391 392 Generative adversarial networks for image-to-Image translation Fig. 16.6 A training framework of proposed DSTN. Fig. 16.7 A testing framework of proposed DSTN [26]. Generative adversarial network for video anomaly detection procedure before learning the model. There are several procedures in the feature collection, including (i) background removal, (ii) fusion, (iii) patch extraction, and (iv) concatenated spatiotemporal features, as described below. In (i) background removal, we only take the moving foreground objects into account because we focus on the real situation from the CCTV cameras. Thus the static background is ignored in this sense. This method benefits in extracting the object features and removing irrelevant pixels in the background for obtaining only the important appearance information. Let ft be the current frame at time t of the video and ft 1 be the previous frame. The background removal fBR is computed by using the frame absolute difference as shown in the following equation: fBR ¼ jft ft1 j (16.17) After computing the frame absolute difference, the background removal output is binarized and then concatenated with the original frame f to acquire more information on the appearance, assisting the learning of the generator. It can simply conclude that more input features mean the better performance of the generator. The significance of the concatenated fBR and f frames is that it delivers extra features on the appearance of foreground objects in fBR as f contains all information where fBR may lose some of the information during the subtraction process. For (ii) fusion, according to the literature on GAN for video anomaly detection [20–22], they all apply theory for warping [38] for representing the temporal features. However, it has problems in obtaining all information on objects and also with high time complexity. Since the development of video anomaly detection requires reliable performance in terms of accuracy and running time, thus the theory for warping is not suitable for this task. To achieve the best performance on motion representation, we use dense inverse search (DIS) [42] to represent the motion features of foreground objects in surveillance videos due to its high accuracy and low time complexity performance on detecting and tracking objects. The DIS optical flow OFdis can be obtained from two consecutive frames ft and ft 1 as shown in Fig. 16.8 where the resolution of ft and ft 1 frames is 238 158 channels following the UCSD Ped1 dataset [4]. The number of the channels cp for ft and ft 1 frames is 1 (cp ¼ 1), while cp of OFdis is 3 (cp ¼ 3). However, OFdis contains noise dispersed in the background same as the objects as shown in Fig. 16.9. Thus, we propose a novel fusion between OFdis and fBR to use the good foreground object from fBR with OFdis for acquiring both appearance and motion information and assisting in noise reduction in OFdis. The fusion provides a clean background and explicit foreground objects. From Fig. 16.9, it is clearly shown that fusion effectively helps to remove noise from OFdis. Specifically, the noise reduction is implemented efficiently by observing fBR values equal to 0 or 255 and then using 393 394 Generative adversarial networks for image-to-Image translation Fig. 16.8 Dense inverse search optical flow framework. Fig. 16.9 A fusion between background subtraction and real optical flow. masking of fBR on OFdis to change its values. Let ζ be a constant value. The new output represented OFdis is a fusion OFfus as defined in Eq. (16.18): OF fus ¼ OF dis bfBR =ðfBR + ζÞc: (16.18) Apart from (i) background removal and (ii) fusion, (iii) patch extraction acts a part in the feature collection process and supports in acquiring more spatial and temporal features of the moving foreground object in local pixels. By doing that, it can achieve better information than directly extracting features from the full image. Patch extraction is implemented using the full-size of the moving foreground object appearance at the current frame f along with its direction, motion, and magnitude from the frame-by-frame dense optical flow image. We normalize all patch elements in the range [1, 1]. The patch size is defined as (w/a) h cp, where w is the frame width, h is the frame height, a is a scale value, and cp is the number of channels. A sliding-window method with a stride d is applied on the input frames of the generator G (i.e., f and fBR) and the discriminator D (i.e., OFfus). Fig. 16.10 shows examples of patch extraction. We extract the patches Generative adversarial network for video anomaly detection Fig. 16.10 Examples of patch extraction on a spatial frame. with the scale value a ¼ 4 and d ¼ w/a to obtain more local information from spatial and temporal representations. Then, the extracted patch image is scaled up to 256 256 full image to gain more information on the appearance from the semantic information and input into the model for the further process. The final process of the feature collection is (iv) the concatenation of spatiotemporal features for data preparation. We input the appearance information to the generative model to output the motion information in both training and testing time, as shown in Fig. 16.11. Since providing sufficient feature inputs to G is significantly important to produce good corresponding optical flow images; thus, the patches of f and fBR are concatenated to cover all possible low-level appearance information of the normal patterns for G to understand and learn them extensively. More specifically, fBR provides precise foreground object contours, whereas f provides inclusive knowledge for the whole scene. The sizes of the input and target images are fixed to the 256 256 full image as the default value in our proposed framework. The concatenated frames have two channels Fig. 16.11 Overview of our data preparation showing spatiotemporal input features (concatenated patches) and output feature (generated dense optical flow patch). 395 396 Generative adversarial networks for image-to-Image translation (cp ¼ 2) and the temporal target output has three channels (cp ¼ 3). As a final point, this concatenation process implies the potential of the learning of a spatiotemporal translation model to reach its desired temporal target output. 16.3.2.3 Spatiotemporal translation model In this section, we declare our training structure in the manner of GAN-based U-Net architecture [36] for translating the spatial inputs (f and fBR) to the temporal output (OFgen) and present on how the interplay between the generative and discriminative networks works during training. The details of the proposed spatiotemporal translation model are clearly explained as the following. The generative network G performs an image transformation from the concatenated f and fBR appearances to the OFgen motion representations. Generally, there are two inputs for G, an image x and noise z, to generate image e as the output with the same size of x but different channels, e ¼ G(x,z) [27, 43, 44]. On the other hand, the additional Gaussian noise z is not prominent to G in our case since G can learn to ignore z in the early stage of training. In addition, z is not that effective for transforming the spatial representation data from its input to the temporal representation data. Therefore, dropout [35] is applied in the decoder within batch normalization [45] instead of z, resulting in e ¼ G(x). On closer inspection, the full network of generator consisting of the encoder En and decoder De architectures is constructed by skip connections or residual connections [35], as shown in Fig. 16.12. The idea of skip connections is that it links the layers from the encoder straight to the decoder, allowing the networks to be easier for the optimization Fig. 16.12 Generator architecture consisting of an encoder and a decoder with skip connections [26]. Generative adversarial network for video anomaly detection and providing greater quality and less complexity for image translation than the traditional CNN architectures, e.g., AlexNet [39] and VGG nets [46, 47]. More specifically, let t be the total number of layers in the generative network. The skip connections are introduced at each layer i of En and layer t i of De. The data can be transferred from the first to the final layer by integrating channels of i with t i. The architectures of En and De are illustrated in Fig. 16.13. En compresses the spatial representation of data to higherlevel representation, while De performs the reverse process to generate OFgen. En uses Leaky-ReLU (L-ReLU) activation function. Conversely, De uses the ReLU activation function that benefits in accelerating the learning rate of the model to saturate color distributions [36]. To achieve accurate OFgen, the objective function is assigned and optimized by the Adam optimization algorithm [48] during training. For the encoder module, it acts as a data compression from a high-dimensional space into a low-dimensional latent space representation to pass to the decoder module. The first layer in the encoder is convolution using CNNs as the learnable feature extraction instead of the handcrafted features approach that is more delicate to derive the obscure data structures than the deep learning approach. The convolution is a linear operation implemented by covering an n ninput image I with a k ksliding window w. The output of convolution function on cell c of the image I is defined as shown in the following equation: yc ¼ kk X ðwi Ii, c Þ + bc (16.19) i¼1 where yc is the output after the convolution and bc is the bias. Let p be the padding and s be the stride. The output size of convolution O is calculated as shown in the following equation: O ¼ ðn k + 2pÞ=s + 1 Fig. 16.13 Encoder and decoder architectures [26]. (16.20) 397 398 Generative adversarial networks for image-to-Image translation Fig. 16.14 An example of a convolution operation on an image cell. The convolution operation on an image cell with b ¼ 0, n ¼ 8, k ¼ 3, p ¼ 0, and s ¼ 1 is delivered in Fig. 16.14 for more understanding. Once the convolution operation is completed, batch normalization is applied by normalizing the convolution output following the normal distribution in Ref. [45] to reduce the training time and avoid the vanishing gradient problem. Suppose y is convolution output values over a mini-batch: B ¼ {y1, y2, …, ym}, γ and β are learnable parameters andεis a constant to avoid zero variance. The normalized output S can be obtained by scaling and shifting as defined in the following equation: Si ¼ γ y^i + β ≡ BN γ, β ðyi Þ pffiffiffiffiffiffiffiffiffiffiffiffi • Normalize: y^i ¼ ðyi μB Þ= σ 2B + ε, m P • Mini-batch mean: μB ¼ 1=m yi , m i¼1 P • Mini-batch variance: σ 2B ¼ 1=m ðyi μB Þ2 . (16.21) i¼1 The final layer of En is adopted by the activation function to introduce nonlinear mapping function from the input to the output (response variable). The nonlinear mapping performs the transformation from one to another scale and decides whether the neurons can pass through it. The nonlinearity property makes the network more complex, resulting in a stronger model for learning the complex input data. The L-ReLu is assigned in the proposed DSTN to avoid the vanishing gradient problem. The property of L-ReLu is to allow the negative value to pass through the neuron by mapping it to the small negative value of the response variable, leading to the improvement of the flow of gradients through the model. The L-ReLu function can be determined, as shown in Eq. (16.22): f ðsÞ ¼ s, if s 0 as, otherwise (16.22) where s is the input, a is a coefficient (a ¼ 0.2) allowing the negative value to pass through the neuron, and f(s) is the response variable. Fig. 16.15 shows the graph representing the input data s mapping to the response variable. Generative adversarial network for video anomaly detection Fig. 16.15 Leaky-ReLu activation function. Regarding the decoder module De, it is an inverse process of encoding part in which the residual connections are passed through the encoder to the corresponding layers at the decoder. The dropout is used in De to represent noise vector z as it eliminates the neuron connections using a default probability, helping to prevent the overfitting during training and improve the performance of GAN. Let h {1, 2, …, H} be the hidden layers of the network, z(h+1)be the output layers h + 1, and r(h) be the random variable following j Bernoulli distribution with probability p [49]. The feed-forward operation can be described in the following equation: r ðhÞ : BernoulliðpÞ, feðsÞðhÞ ¼ r ðhÞ ∗f ðsÞðhÞ , ðh + 1Þ ðh + 1Þ e ðhÞ ðh + 1Þ ¼ wi : f ðsÞ + bi zi (16.23) Apart from the generator, we shall discuss the discriminative network D for the testing procedure. D distinguishes the real patch OFfus (y ¼ OFfus) and the synthetic patch OFgen (OFgen ¼ e). As a result, D delivers a scalar output determining the possibility that its inputs are from the real data. In the discriminative architecture, PatchGAN is constructed and applied to each partial image to help accelerate the training time for GAN, resulting in a better performance than using the full image discriminator net with the resolution of 256 256 pixels. The discriminator D is implemented by subsampling from 256 256 OFfus image to 64 64 pixels, providing 16 patches of OFfus passing through the PatchGAN model to classify whether OFgen is real or fake as shown in Fig. 16.16. The reason why we use the 64 64 PatchGAN is that it provides good pixel accuracy and good intensity on the appearance, making the synthetic image to be more recognizable. The experimental results on the impact of using the 64 64 PatchGAN can be found in Ref. [26]. 399 400 Generative adversarial networks for image-to-Image translation Fig. 16.16 Discriminator architecture with PatchGAN model [26]. To define objective function and optimization, we first discuss two objective functions determined during training; a GAN Loss, LGAN, and a L1 Loss or a generator loss, LL1. Note that our proposed DSTN method comprises only one translational network from the spatial (appearance) to the temporal (motion) image representations. The motion representation is computed based on the dense optical flow using arrays of horizontal and vertical components with the magnitude. Let y be the output image OFfus, x be the input image for G (the concatenated f and fBR image), and z is the additional Gaussian noise vector. Specifically, the dropout is adopted as z, then G can be represented as G(x). The objective functions can be denoted in Eqs. (16.24) and (16.25): • GAN Loss: LGAN ðG, DÞ ¼ Ey ½ log DðyÞ + Ex ½ log ð1 DðGðxÞÞÞ (16.24) LL1 ðGÞ ¼ Ex, y ky GðxÞk1 (16.25) • L1 Loss: Then, the optimization of G can be defined as in the following equation: G∗ ¼ arg min max LGAN ðG, DÞ + λLL1 ðGÞ G D (16.26) The advantage of using one spatiotemporal translation network is that it has less complexity while providing sufficient important features of objects for the learning of GAN. Generative adversarial network for video anomaly detection 16.3.3 Anomaly detection After the training, the spatiotemporal translation network perceives the transformation from the concatenated f and fBR appearance to the OFfus motion representations. All parameters in training are applied in the testing. To detect anomalies, we input two consecutive frames (f and ft1) from the test videos to the model. During testing, G is used to reconstruct OFgen following its trained knowledge. However, since G has trained with only the normal patterns, G is unable to regenerate the unknown events the same as the normal ones. We take this generator’s inability to reconstruct correct abnormal events in order to detect all possible anomalies that occur in the scene. The anomalies can be exposed by subtracting the patches of OFfus and OFgen to locate the difference in local pixels. To be more accurate on the object localization, edge wrapping is proposed to highlight the actual local pixels of anomalies. For anomaly detection, differentiation is a simple and effective method to obtain abnormalities. The pixels between a patch of OFfus (real image) and a patch of OFgen (fake image) are subtracted to determine even if there are anomalous events in the scene. This differentiation is directly defined in in the following equation: ΔOF ¼ OF fus OF gen > 0 (16.27) where ΔOF is the differentiation output in which its value is greater than 0 (ΔOF > 0). The reason why ΔOF can successfully indicate the abnormal events in the scene is that the differentiation between OFfus and OFgen provides a large difference in the anomalous areas where G is unable to reconstruct the abnormal events in OFgen the same as the abnormal events in OFfus (the real abnormal object from testing video sequence). In other words, G tries to reconstruct OFgen the same as OFfus, but it can only reconstruct unstructured blobs based on its knowledge of the learned normal events, making the abnormal events of OFgen different from OFfus. ΔOF provides the score indicating the probability of pixel, whether it belongs to normalities or abnormalities. The range of pixel values for each ΔOF from the test videos is between 0 and 1 where the highest pixel value is considered as an anomaly. To normalize the probability score from ΔOF, the maximum value MOF of all components is computed following the range of pixel values for each test video. From this process, we can gradually alter the threshold of the probability scores of anomalies to define the best decision boundary for obtaining ROC curves. Suppose the position of the pixel in the image is (i, j). The normalization of ΔOF denoted as NOF is shown in the following equation: NOF ði, jÞ ¼ 1=MOF ΔOF ði, jÞ (16.28) However, even we obtain good normalized differentiation NOF showing anomalies in the scene, there are concerning problems occurred in the experimental results, such as a false positive detection on the normal events (i.e., normal event is detected as abnormal 401 402 Generative adversarial networks for image-to-Image translation events) and overdetection on the pixels around the actual abnormal object (i.e., the area of the detected abnormal object is too large). This is because the performance of object localization is not effective enough. Therefore, we propose the edge wrapping (EW) method to overcome these concerning problems and specifically enhance the pixel-level anomaly localization performance. Our EW is performed by using [50] to preserve only the edges of the actual abnormal object and suppressing the rest (e.g., noise and insignificant edges that do not belong to the abnormal object), providing precise abnormal event detection and localization. EW performs as a multistage process categorized into three phases: a noise reduction, an intensity gradient, and a nonmaximum suppression. To eliminating background noise and irrelevant pixels of abnormal objects, a Gaussian filter is applied to blur the normalization of differentiation NOF with the size of we he ce where we and he are the width and the height of the filter and ce is the number of channels (e.g., the gray scale image has ce ¼ 1 and the color image has ce ¼ 3). Our differentiation output is the gray scale image ce ¼ 1. Considering the intensity gradient, an edge gradient Ge is achieved using a gradient operator to filter the image in a horizontal direction Gx and a vertical direction (Gy) for obtaining gradient magnitude perpendicular to edge direction at each pixel. The derivative filter has the same size as the Gaussian filter. The first derivative is computed, as shown in Eqs. (16.29) and (16.30): qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Ge ¼ Gx2 + Gy2 (16.29) (16.30) θ ¼ tan 1 Gy =Gx Then, a threshold is defined to conserve only the significant edges. This process is known as nonmaximum suppression. In this phase, the gradient magnitude at each pixel is investigated whether it is greater than a threshold T where we use a value of 50 as it performs the best results, as discussed in Ref. [26]. If it is greater than T, it shows an edge point corresponding to a local maxima to all possible neighborhoods. Hence, we preserve the local maxima and suppress the rest to 0 to acquire the edges corresponding to the certain anomalies. In addition, the Gaussian filter is then again applied with a kernel size of we he ce to prevent noise in the image, representing an output EW for the final output of the anomaly localization OL where ζ is a constant value. The anomaly localization OL can be computed, as shown in the following equation: OL ¼ ΔOF bEW =EW + ζc (16.31) 16.4 Experimental results The performance of the DSTN is evaluated on the publicly standard benchmarks used in the video anomaly detection task, UCSD pedestrian [4], UMN [1], and CUHK Avenue [6]. These datasets are recorded in crowds containing indoor and outdoor scenes. Our Generative adversarial network for video anomaly detection experiment results are compared with various competing methods respecting the accuracy on both frame and pixel level and the computational time. Additionally, we indicate the impact of GAN based U-Net network with residual connections compared with another popular architecture, i.e., autoencoder, and address the advantages and the disadvantages of GAN for anomaly detection. The detail of each subtopic is explained as follows. 16.4.1 Dataset 16.4.1.1 UCSD dataset The UCSD pedestrian dataset [4] consists of walking crowded pedestrian in two outdoor scenes with various anomalies, e.g., cycling, skateboarding, driving vehicles, and rolling wheelchairs. It is a well-known video benchmark dataset for the anomaly detection task due to its complex scene in the real environments with the low-resolution images. There are two subfolders: Ped1 and Ped2, where Ped stands for the pedestrian. The UCSD Ped1 contains 5500 normal frames with the 34 training video sequences and 3400 anomalous frames with 16 testing video sequences. The image resolution of the UCSD Ped1 is 238 158 pixels, and the UCSD Ped2 is 360 240 pixels for all frames. In the UCSD Ped2, there are 346 frames for normal events with 16 training video sequences and 1652 frames for anomalous events with 12 testing video sequences. The characteristics of Ped2 consists of the crowded pedestrian walking horizontally to the camera plane. The examples of the UCSD are shown in Fig. 16.17, where (a) is Ped1 and (b) is Ped2. 16.4.1.2 UMN dataset The UMN dataset [1] is one of the publicly available benchmarks in the video anomaly detection task designed for identifying the anomalies in crowds. It contains 11 videos with 7700 frames recorded in various indoor and outdoor scenarios. All frames have a resolution of 320 240 pixels. Both indoor and outdoor scenes have the characteristics Fig. 16.17 UCSD pedestrian dataset: (A) Ped1 and (B) Ped2. 403 404 Generative adversarial networks for image-to-Image translation Fig. 16.18 UMN dataset. of walking pedestrian as the normal event and the running pedestrian as the abnormal event as shown in Fig. 16.18. All video sequences start with the walking patterns then end with the running patterns. 16.4.1.3 CHUK Avenue dataset The CUHK Avenue dataset [6] consists of the crowded scenes at the campus. The total number of frames is 30,652 frames consisting of 15,328 frames for 16 training videos and 15,324 frames for 21 test videos. Each video sequence has a length of 1–2 min, with 25 frames per second (fps). This dataset is challenging due to its various moving objects in crowds and types of anomaly patterns related to human actions, including objectrelated human actions (person throwing, grabbing, and leaving objects), running, jumping, and loitering. In contrast, the normal pattern is the crowds who walk parallel to the image plane. The examples of the CHUK Avenue dataset are shown in Fig. 16.19. 16.4.2 Implementation details We implement the proposed DSTN framework using Keras [51] machine learning platform with TensorFlow [52] backend, and Matlab. A GPU NVIDIA GeForce GTX, 1080 Ti with 3584 CUDA Cores and 484 GB/sec memory bandwidth, is used during the training procedure. Additionally, the testing measurement is employed on a CPU Intel Core i9-7960 with a 2.80 GHz processor base frequency. The model performs transformation learning from spatial to temporal representations with the help of Adam optimizer. A learning rate is set to 0.0002, while the exponential decay rate β1 and β2 are set to 0.9 and 0.999, respectively, with epsilon 108. Fig. 16.19 CHUK Avenue dataset. Generative adversarial network for video anomaly detection 16.4.3 Evaluation criteria 16.4.3.1 Receiver operating characteristic (ROC) The Receiver operation characteristic curve (ROC) is a standard method used for evaluating the performance of an anomaly detection system. It is a plot that indicates a comparison between true positive rate (TPR) and false positive rate (FPR) at various threshold criteria [53] and benefits the analysis of the decision-making process. In the anomaly detection observation, the abnormal events that are correctly determined as the positive detections (abnormal event) from the entire positive ground truth data are represented as TPR known as the probability of detection. The more the curve of TPR goes up, the better the detection accuracy of abnormal events is. The normal event (negative data) that are incorrectly determined as the positive detections from the entire negative ground truth data are represented as FPR. The higher FPR means the higher rate of the misclassification of normal events. There are four types of binary predictions for TPR and FPR computation, as described below. True positive (TP) is the correct positive detection of an abnormal event when the prediction outcome and the ground truth data are positive (abnormal event). False positive (FP) is the false positive detection when the outcome is predicted as positive (abnormal event), but the ground truth data is negative (normal event), meaning that the normal event is incorrectly detected as an abnormal event. This problem often occurs in the video anomaly detection task (e.g., a walking person is detected as an anomaly). True negative (TN) is the correct detection of a normal event when the outcome is predicted as negative (normal event) and the ground truth data is also negative. False negative (FN) is the incorrect detection when the outcome is predicted as negative (normal event) and the ground truth data is positive (abnormal event). Hence, TPR and FPR can be computed, as shown in Eqs. (16.32) and (16.33), respectively: TPR ¼ TP=ðTP + FN Þ (16.32) FPR ¼ FP=ðFP + TN Þ (16.33) 16.4.3.2 Area under curve (AUC) Area Under Curve, also known as AUC, is used in classification analysis problems to define the best prediction model. It is computed from all the areas under the ROC curve, where TPR is plotted against FPR. The higher value of AUC indicates the superior performance of the model. Ideally, the model is a perfect classifier when all positive data are ranked above all negative data (AUC ¼ 1). In practice, most of the AUC results are required in the range between 0.5 and 1.0 (AUC ¼ [0.5,1]), meaning that the random positive data are ranked higher than the random negative data (greater than 50%). Besides, the worst case is when all negative data are ranked above all positive data, leading 405 406 Generative adversarial networks for image-to-Image translation the AUC to 0 (AUC ¼ 0). Hence, AUC classifiers can be defined as AUC [0,1] where AUC values for real-world use are greater than 0.5. The AUC values that are less than 0.5 are not acceptable for the model [53]. To conclude, we prefer higher AUC values than the lower ones. 16.4.3.3 Equal error rate (EER) Apart from the AUC, the performance of the model can be quantified by observing a receiving operating characteristic equal error rate (ROC-EER). The EER is a fixpoint that specifies equality of the misclassification of positive and negative data. Specifically, EER can be obtained from the intersection of the ROC curve on the diagonal EER line by varying a threshold until FPR equals to the miss rate 1-TPR. The lower EER values indicate that the model has better performance. 16.4.3.4 Frame-level and pixel-level evaluations for anomaly detection In general, the quantitative performance evaluation of the anomaly detection consists of two criteria, including frame-level and pixel-level evaluations. The frame-level evaluation focuses on the detection rate of the anomalous event in the scene. If one or more anomalous pixels are detected, the frame will be labeled as the abnormal frame no matter what size and location of the abnormal objects are. In this case, the detected frame is defined as TP if the actual frame is also abnormal. Contrarily, if the actual frame is normal, then the detected frame is classified as FP. The evaluation of the pixel level determines the correct location of anomalous object detection in the scene. This evaluation is a challenging criterion in anomaly detection and localization research since it focuses on the local pixel. It is remarkably more demanding and stricter than the frame-level evaluation due to its complexity of localizing anomalies, which improves the accuracy of the frame-level anomaly detection. To indicate whether the frame is the true positive (TP), the detected abnormal area is needed to be overlapped more than 40% with the ground truth [3]. In addition, the frame will be distinguished as the false positive (FP) if one pixel is detected as abnormal events. 16.4.3.5 Pixel accuracy In a standard semantic segmentation evaluation, the pixel accuracy metric [54] is computed to define the correctness of the pixel belonging to each semantic class. In the proposed DSTN, two semantic classes are defined, including a foreground P P semantic class and a background semantic class. The pixel accuracy is defined as inii/ inti, where nij is the number of misclassified pixels of class i, and nti is the total of consisting pixels of class i. 16.4.3.6 Structural similarity index (SSIM) SSIM index is a perceptual metric to measure the image quality of the predicted image to its original image [55]. Using the SSIM index, the model is more effective when the Generative adversarial network for video anomaly detection predicted image is more similar to the target image. In our case, we use SSIM to analyze the similarity of the dense optical flow generated from the generator to its real dense optical flow obtained from two consecutive video frames. 16.4.4 Performance of DSTN We evaluate the proposed DSTN regarding accuracy and time complexity aspects. The ROC curve is used to illustrate the performance of anomaly detection at the frame level and the pixel level and analyze the experimental results with other state-of-the-art works. Additionally, the AUC and the EER are evaluated as the criteria for determining the results. The performance of DSTN is first evaluated on the UCSD dataset, consisting of 10 and 12 videos for the UCSD Ped1 and the UCSD Ped2 with the pixel-level ground truth, by using both frame-level and pixel-level protocols. In the first stage of DSTN, patch extraction is implemented to provide the appearance features of the foreground object and its motion regarding the vector changes in each patch. The patches are extracted independently from each original image with a size of 238 158 pixels (UCSD Ped1) and 360 240 pixels (UCSD Ped2) to apply it with (w/4) h cp. As a result, we obtain 22 k patches from the UCSD Ped1 and 13.6 k patches from the UCSD Ped2. Then, to feed into the spatiotemporal translation model, we resize all patches to the 256 256 default size in both training and testing time. During training, the input of G (the concatenation of f and fBR patches) and target data (the generated dense optical flow OFgen) are set to the same size as the default resolution of 256 256 pixels. The encoding and decoding modules in G are implemented differently. As in the encoder network, the image resolution is encoded from 256 ! 128 ! 64 ! 32 ! 16 ! 8 ! 4 ! 2 ! 1 to obtain the latent space representing the spatial image in one-dimensional data space. CNN effectively employs this downscale with kernels of 3 3 and stride s ¼ 2. Additionally, the number of neurons corresponding to the image resolution in En is introduced in each layer from 6 ! 64 ! 128 ! 256 ! 512 ! 512 ! 512 ! 512 ! 512. In contrast, De decodes the latent space to reach the target data (the temporal representation of OFgen) with a size of 256 256 pixels using the same structure as En. The dropout is employed in De as noise z to remove the neuron connections using probability p ¼ 0.5, resulting in the prevention of overfitting on the training samples. Since D needs to fluctuate G to correct the classification between real and fake images at training time, PatchGAN is then applied by inputting a patch size of 64 64 pixels to output the probability of class label for the object. The PatchGAN architecture is constructed from 64 ! 32 ! 16 ! 8 ! 4 ! 2 ! 1, which is then flattened to 512 neurons and plugged in with Fully Connection (FC) and Softmax layers. The use of PatchGAN benefits the model in terms of time complexity. This is probably because there are fewer parameters to learn on the partial image, making 407 408 Generative adversarial networks for image-to-Image translation the model less complex and can achieve good running time for the training process. In the aspect of testing, G is specifically employed to reconstruct OFgen in order to analyze the real motion information OFfus. The image resolution for testing and training are set to the same resolution for all datasets. The quantitative performance of DSTN is presented in Table 16.1 where we consider the DSTN with various state-of-the-art works, e.g., AMDN [15], GMM-FCN [12], Convolutional AE [14], and future frame prediction [21]. From Table 16.1 it can be observed that the DSTN overcomes most of the methods in both frame-level and pixel-level criteria since we achieve higher AUC and lower EER on the UCSD Dataset. Moreover, we show the qualitative performance of DSTN using the standard evaluation for anomaly detection research known as the ROC curves, where we vary a threshold from 0 to 1 to plot the curve of TPR against FPR. The qualitative performance of DSTN is compared with other approaches in both frame-level evaluation (see Fig. 16.20A) and pixel-level evaluation (see Fig. 16.20B) on the UCSD Ped1 and at the frame-level evaluation on the UCSD Ped2 as presented in Figs. 16.20 and 16.21, respectively. Following Figs. 16.20 and 16.21, the DSTN (circle) shows the strongest growth curve on TPR and overcomes all the competing methods in the frame and pixel level. This means that the DSTN is a reliable and effective method to be able to detect and localize the anomalies with high precision. Examples of the experimental results of DSTN on the UCSD Ped 1 and Ped 2 dataset are illustrated in Fig. 16.22 to extensively present its performance in detecting and localizing anomalies in the scene. According to Fig. 16.22, the proposed DSTN is able to detect and locate various types of abnormalities effectively with each object, e.g., (a) a wheelchair, (b) a vehicle, (c) a skateboard, and (d) a bicycle, or even more than one anomaly in the same scene, e.g., (e) bicycles, (f ) a vehicle and a bicycle, and (g) a bicycle and a skateboard. However, we face the false positive problems in Fig. 16.22H a bicycle and a skateboard, where the walking person (normal event) is detected as an anomaly. Even the bicycle and the skateboard are correctly detected as anomalies in Fig. 16.22H, the false detection on the walking person makes this frame wrong anyway. The false positive anomaly detection is probably caused by a similar speed of walking to cycling in the scene. For the UMN dataset, the performance of DSTN is evaluated using the same settings as training parameters and network configuration on the UCSD pedestrian dataset. Table 16.2 indicates the AUC performance comparison of the DSTN with various competing works such as GANs [20], adversarial discriminator [22], AnomalyNet [23], and so on. Table 16.2 shows that the proposed DSTN achieves the best AUC results the same as Ref. [23], which outperforms all other methods. Noticeably, most of the competing methods can achieve high AUC on the UMN dataset. This is because the UMN dataset has less complexity regarding its abnormal patterns than the UCSD pedestrian and the Avenue datasets. Fig. 16.23 shows the performance of DSTN in detecting and localizing Table 16.1 EER and AUC comparison of DSTN with other methods on UCSD Dataset [26]. Ped1 (frame level) Ped1 (pixel level) Ped2 (frame level) Ped2 (pixel level) Method EER AUC EER AUC EER AUC EER AUC MPPCA Social Force (SF) SF + MPPCA Sparse Reconstruction MDT Detection at 150fps SR + VAE AMDN (double fusion) GMM Plug-and-Play CNN GANs GMM-FCN Convolutional AE Liu et al. Adversarial discriminator AnomalyNet DSTN (proposed method) 40% 31% 32% 19% 25% 15% 16% 16% 15.1% 8% 8% 11.3% 27.9% 23.5% 7% 25.2% 5.2% 59.0% 67.5% 68.8% – 81.8% 91.8% 90.2% 92.1% 92.5% 95.7% 97.4% 94.9% 81% 83.1% 96.8% 83.5% 98.5% 81% 79% 71% 54% 58% 43% 41.6% 40.1% 35.1% 40.8% 35% 36.3% – – 34% – 27.3% 20.5% 19.7% 21.3% 45.3% 44.1% 63.8% 64.1% 67.2% 69.9% 64.5% 70.3% 71.4% – 33.4% 70.8% 45.2% 77.4% 30% 42% 36% – 25% – 18% 17% – 18% 14% 12.6% 21.7% 12% 11% 10.3% 9.4% 69.3% 55.6% 61.3% – 82.9% – 89.1% 90.8% – 88.4% 93.5% 92.2% 90% 95.4% 95.5% 94.9% 95.5% – 80% 72% – 54% – – – – – – 19.2% – – – – 21.8% – – – – – – – – – – – 78.2% – 40.6% – 52.8% 83.1% 410 Generative adversarial networks for image-to-Image translation Fig. 16.20 ROC Comparison of DSTN with other methods on UCSD Ped1 dataset: (A) frame-level evaluation and (B) pixel-level evaluation [26]. anomalies in different scenarios on the UMN dataset, including (d) an indoor and outdoors in (a), (b), and (c), where we can detect most of the individual objects in the crowded scene. Apart from evaluating DSTN on the UCSD and the UMN datasets, we also assess our performance on the challenging CUHK Avenue dataset with the same parameter and Generative adversarial network for video anomaly detection Fig. 16.21 ROC Comparison of DSTN with other methods on UCSD Ped2 dataset at frame-level evaluation [26]. Fig. 16.22 Examples of DSTN performance in detecting on localizing anomalies on UCSD Ped1 and Ped2 dataset: (A) a wheelchair, (B) a vehicle, (C) a skateboard, (D) a bicycle, (E) bicycles, (F) a vehicle and a bicycle, (G) a bicycle and a skateboard, and (H) a bicycle and a skateboard [26]. configuration settings as the UCSD and the UMN datasets. Table 16.3 presents the performance comparison in terms of EER and AUC of the DSTN with other competing works [6, 12, 14, 21, 23] in which the proposed DSTN surpasses all state-of-the-art works for both protocols. We show examples of the DSTN performance in detecting and localizing various types of anomalies, e.g., (a) jumping, (b) throwing papers, (c) falling papers, and (d) grabbing a bag, on the CUHK Avenue dataset in Fig. 16.24 The DSTN can effectively detect and localize anomalies in this dataset, even in 411 412 Generative adversarial networks for image-to-Image translation Table 16.2 AUC comparison of DSTN with other methods on UMN dataset [26]. Method AUC Optical-flow SFM Sparse reconstruction Commotion Plug-and-play CNN GANs Adversarial discriminator Anomalynet DSTN (proposed method) 0.84 0.96 0.976 0.988 0.988 0.99 0.99 0.996 0.996 Fig. 16.23 Examples of DSTN performance in detecting on localizing anomalies on UMN dataset, where (A), (B), and (D) contain running activity outdoors while (C) is in an indoor [26]. Table 16.3 EER and AUC comparison of DSTN with other methods on CUHK Avenue dataset [26]. Method EER AUC Convolutional AE Detection at 150 FPS GMM-FCN Liu et al. Anomalynet DSTN (proposed method) 25.1% – 22.7% – 22% 20.2% 70.2% 80.9% 83.4% 85.1% 86.1% 87.9% Fig. 16.24D, which contains only small movements for abnormal events (only the human head and the fallen bag are slightly moving). To indicate the significance of our performance for real-time use, we then compare the running time of DSTN during testing in seconds per frame as shown in Table 16.4 with other competing methods [3–6, 15] following the environment and the computational time from Ref. [15]. Regarding Table 16.4, we achieve a lower running time than most of the competing methods except for Ref. [6]. This is because the architecture of DSTN relies on the Generative adversarial network for video anomaly detection Fig. 16.24 Examples of DSTN performance in detecting on localizing anomalies on CUHK Avenue dataset: (A) jumping, (B) throwing papers, (C) falling papers, and (D) grabbing a bag [26]. Table 16.4 Running time comparison on testing measurement (seconds per frame). CPU (GHz) Sparse Reconstruction Detection at 150 fps MDT Li et al. AMDN (double fusion) DSTN (proposed method) 2.8 Method Running time GPU Memory (GB) Ped1 Ped2 UMN Avenue 2.6 – 2.0 3.8 – 0.8 – 3.4 – 8.0 0.007 – – 0.007 3.9 2.8 2.1 – – Nvidia Quadro K4000 – 2.0 2.0 32 17 0.65 5.2 23 0.80 – – – – – – – 24 0.315 0.319 0.318 0.334 framework of the deep learning model using multiple layers of the convolutional neural network, which is more complex than Ref. [6] that uses the learning of a sparse dictionary and provides fewer connections. However, according to the experimental results in Tables 16.1 and 16.3, our proposed DSTN significantly provides higher AUC and lower EER with respect to the frame and the pixel level on the CUHK Avenue and the UCSD pedestrian datasets than Ref. [6]. Regarding the running time, the proposed method runs 3.17 fps for the UCSD Ped1 dataset, 3.15 fps for the UCSD Ped2 dataset, 3.15 fps for the UMN dataset, and 3 fps for the CUHK Avenue dataset. In every respect, we provide the performance comparison of the proposed DSTN with other competing works [3–6, 15] to show our performance in regard to the frame-level AUC and the running time in seconds per frame for the UCSD Ped1 and Ped2 dataset as presented in Figs. 16.25 and 16.26, respectively. Considering Figs. 16.25 and 16.26, our proposed method achieves the best results regarding the AUC and running time aspects. In this way, we can conclude that our DSTN surpasses other state-of-the-art approaches since we reach the highest AUC values at the frame level anomaly detection and the pixel level localization and given the good computational time for real-world applications. 413 414 Generative adversarial networks for image-to-Image translation Fig. 16.25 Frame-level AUC comparison and running time on UCSD Ped1 dataset. Fig. 16.26 Frame-level AUC comparison and running time on UCSD Ped2 dataset. 16.4.5 The comparison of generative adversarial network with an autoencoder GAN-based U-Net architecture is a practical approach to shortcut low-level information across the network. The skip connections in the generator play a significant role in our proposed framework. We highlight its significance with the experiments on the UCSD Ped2 and compare it with autoencoder, which can be constructed by erasing the skip connections in the U-Net architecture. All training videos are learned on both skip Generative adversarial network for video anomaly detection connections and autoencoder for 40 epochs to observe the performance in minimizing the L1 loss, as shown in Fig. 16.27. From Fig. 16.27, it demonstrates that the loss curve of the skip connections reaches lower error over the training time than the loss curve of the autoencoder, showing superior performance of the skip connections over the autoencoder. Besides, we observe the ability to generate temporal information (generated dense optical flow) of the skip connections and the autoencoder using the test videos from the UCSD Ped2 and compare it to the dense optical flow ground truth as displayed in Fig. 16.28. The autoencoder is impotent to achieve the motion information shown in Fig. 16.28C. In contrast, the skip connections in Fig. 16.28B can produce the motion Fig. 16.27 Performance comparison on UCSD Ped2 dataset between GAN based U-Net architecture (the residual connection) and autoencoder [26]. Fig. 16.28 The qualitative results in generating (A) dense optical flow on UCSD Ped2 dataset between (B) residual connection and (C) autoencoder [26]. 415 416 Generative adversarial networks for image-to-Image translation Table 16.5 FCN-score and SSIM comparison on UCSD Ped2 dataset between residual connection and autoencoder. Network architecture Pixel accuracy SSIM Autoencoder Residual connection 0.83 0.9 0.82 0.96 information of dense optical flow that correctly corresponds to its ground truth in Fig. 16.28A, giving a good synthesized image quality. To indicate the quantitative performance of the skip connections and the autoencoder, Structural Similarity Index (SSIM) [55] and FCN-score [54] are evaluated for each architecture on the UCSD Ped2 as presented in Table 16.5 A higher value means better performance for both evaluation criteria. Table 16.5 shows that the GAN-based U-Net architecture with skip connection is more suitable for the low-level information since it achieves superior results than the autoencoder for both evaluation metrics, especially in the SSIM. 16.4.6 Advantages and limitations of generative adversarial network for video anomaly detection The generative adversarial networks for anomaly detection have certain advantages over the traditional CNNs. The advantages are that the GAN framework does not require any labeled data and inference during the learning procedure. In addition, GAN can generate the example data without using different entries in a sequential sample and does not need Markov Chain Monte Carlo (MCMC) method to train the model as the Adversarially trained AAE [24] and VAE [25]. Instead, it computes only the backpropagation to obtain the gradients. As regards statistical advantage, the GAN model may gain the density distribution of example data from the generator network, which is trained and updated with the gradients flowing through the discriminator rather than directly updating with the example data. In this way, for GAN in video anomaly detection task, the objective function of the generator is strengthened to be able to generate the synthetic output that looks real from the input image since the parameters of the generator do not directly obtain the components of the target image. Apart from the advantages mentioned above, the generator network provides a very sharp synthetic image, while the visual performance of the VAE network based on the MCMC method presents a blurry image for mixing the modes in chains. As in the limitations of GAN, the training of GAN is unstable compared to VAE, resulting in the difficulty of predicting the value of each pixel for the whole image and causing artifact noise in the synthetic image. The major limitation of anomaly detection using GAN in the current research trends is that only the static camera scenario is implemented to obtain the appearance and motion features from the moving foreground objects. Besides, GAN also has a problem in learning and generating small objects Generative adversarial network for video anomaly detection (full appearance of the objects) in the crowded scene, making it challenging to enhance the accuracy of the model, especially at the pixel level. 16.5 Summary In this chapter, we extensively explain the architecture of GANs and explore its applications on video anomaly detection research. DSTN, a novel unsupervised anomaly detection and localization method, is introduced to enhance the knowledge of GAN and improve the performance of the system with respect to the accuracy of anomaly detection at the frame level and localization at the pixel level and the computational time. The DSTN is intended to comprehensively master the features from the spatial to the temporal representations by employing the novel fusion between the background removal and the real dense optical flow. The concatenation of patches is presented to assist the learning of the generative network. The proposed method is an unsupervised manner since only the normalities are trained to obtain the corresponding generated dense optical flow without labeling abnormal data. Since all videos are input into the model during testing, the unrecognized patterns are classified as abnormalities because the model has no prior knowledge of any abnormal events. The abnormalities can be simply detected by subtracting the difference in local pixels between the real and the generated dense optical flow images. To the best of our knowledge, the proposed DSTN is the first attempt to boost pixel-level anomaly localization with the edge wrapping method as the postprocessing process of the GAN framework. We implemented three publicly available benchmarks; UCSD pedestrian, UMN, and CUHK Avenue datasets. The performance of DSTN is distinguished with various methods and analyzed with the autoencoder to specify the significance of using the skip connections of GAN. From the experimental results, the proposed DSTN outperforms other state-of-the-art works for anomaly detection and localization and time consumption. The advantages and limitations of GAN are addressed in the final section to deliver a comprehensive view of the use of GAN for the video anomaly detection task. References [1] R. Mehran, A. Oyama, M. Shah, Abnormal crowd behavior detection using social force model, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 935–942. [2] J. Kim, K. Grauman, Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 2921–2928. [3] W. Li, V. Mahadevan, N. Vasconcelos, Anomaly detection and localization in crowded scenes, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2013) 18–32. [4] V. Mahadevan, W. Li, V. Bhalodia, N. Vasconcelos, Anomaly detection in crowded scenes, in: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, 2010, pp. 1975–1981. 417 418 Generative adversarial networks for image-to-Image translation [5] Y. Cong, J. Yuan, J. Liu, Sparse reconstruction cost for abnormal event detection, in: CVPR 2011, IEEE, 2011, pp. 3449–3456. [6] C. Lu, J. Shi, J. Jia, Abnormal event detection at 150 fps in matlab, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2720–2727. [7] Y. Yuan, Y. Feng, X. Lu, Structured dictionary learning for abnormal event detection in crowded scenes, Pattern Recogn. 73 (2018) 99–110. [8] S. Wang, E. Zhu, J. Yin, F. Porikli, Video anomaly detection and localization by local motion based joint video representation and OCELM, Neurocomputing 277 (2018) 161–175. [9] X. Zhang, S. Yang, X. Zhang, W. Zhang, J. Zhang, Anomaly Detection and Localization in Crowded Scenes by Motion-Field Shape Description and Similarity-Based Statistical Learning, 2018 (arXiv preprint arXiv:1805.10620). [10] H. Mousavi, S. Mohammadi, A. Perina, R. Chellali, V. Murino, Analyzing tracklets for the detection of abnormal crowd behavior, in: 2015 IEEE Winter Conference on Applications of Computer Vision, IEEE, 2015, pp. 148–155. [11] H. Mousavi, M. Nabi, H. Kiani, A. Perina, V. Murino, Crowd motion monitoring using tracklet-based commotion measure, in: 2015 IEEE International Conference on Image Processing (ICIP), IEEE, 2015, pp. 2354–2358. [12] Y. Fan, G. Wen, D. Li, S. Qiu, M.D. Levine, F. Xiao, Video anomaly detection and localization via Gaussian mixture fully convolutional variational autoencoder, Comput. Vis. Image Underst. 195 (2020) 102920. [13] Y. Feng, Y. Yuan, X. Lu, Learning deep event models for crowd anomaly detection, Neurocomputing 219 (2017) 548–556. [14] M. Hasan, J. Choi, J. Neumann, A.K. Roy-Chowdhury, L.S. Davis, Learning temporal regularity in video sequences, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 733–742. [15] D. Xu, Y. Yan, E. Ricci, N. Sebe, Detecting anomalous events in videos by learning deep representations of appearance and motion, Comput. Vis. Image Underst. 156 (2017) 117–127. [16] S. Bouindour, M.M. Hittawe, S. Mahfouz, H. Snoussi, Abnormal Event Detection Using Convolutional Neural Networks and 1-Class SVM Classifier, IET Digital Library, 2017. [17] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, N. Sebe, Plug-and-play cnn for crowd motion analysis: An application in abnormal event detection, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2018, pp. 1689–1698. [18] M. Sabokrou, M. Fayyaz, M. Fathy, Z. Moayed, R. Klette, Deep-anomaly: fully convolutional neural network for fast anomaly detection in crowded scenes, Comput. Vis. Image Underst. 172 (2018) 88–97. [19] H. Wei, Y. Xiao, R. Li, X. Liu, Crowd abnormal detection using two-stream fully convolutional neural networks, in: 2018 10th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), IEEE, 2018, pp. 332–336. [20] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, N. Sebe, Abnormal event detection in videos using generative adversarial nets, in: 2017 IEEE International Conference on Image Processing (ICIP), IEEE, 2017, pp. 1577–1581. [21] W. Liu, W. Luo, D. Lian, S. Gao, Future frame prediction for anomaly detection—a new baseline, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6536–6545. [22] M. Ravanbakhsh, E. Sangineto, M. Nabi, N. Sebe, Training adversarial discriminators for crosschannel abnormal event detection in crowds, in: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2019, pp. 1896–1904. [23] J.T. Zhou, J. Du, H. Zhu, X. Peng, Y. Liu, R.S.M. Goh, Anomalynet: an anomaly detection network for video surveillance, IEEE Trans. Inf. Forensics Secur. 14 (2019) 2537–2550. [24] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, Adversarial Autoencoders, 2015 (arXiv preprint arXiv:1511.05644). [25] J. An, S. Cho, Variational autoencoder based anomaly detection using reconstruction probability, in: Special Lecture on IE, vol. 2, 2015, pp. 1–18. Generative adversarial network for video anomaly detection [26] T. Ganokratanaa, S. Aramvith, N. Sebe, Unsupervised anomaly detection and localization based on deep spatiotemporal translation network, IEEE Access 8 (2020) 50312–50329. [27] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, Adv. Neural Inf. Proces. Syst. 2 (2014) 2672–2680. [28] J. Sun, X. Wang, N. Xiong, J. Shao, Learning sparse representation with variational auto-encoder for anomaly detection, IEEE Access 6 (2018) 33353–33361. [29] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323. [30] I.J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio, Maxout Networks, 2013 (arXiv preprint arXiv:1302.4389). [31] K. Jarrett, K. Kavukcuoglu, M.A. Ranzato, Y. Lecun, What is the best multi-stage architecture for object recognition? in: 2009 IEEE 12th International Conference on Computer Vision, IEEE, 2009, pp. 2146–2153. [32] Y. Mashalla, Impact of computer technology on health: computer vision syndrome (CVS), Med. Pract. Rev. 5 (2014) 20–30. [33] K. Gates, Professionalizing police media work: surveillance video and the forensic sensibility, in: Images, Ethics, Technology, Routledge, 2015. [34] C. Dictionary, Cambridge Advanced Learner’s Dictionary, PONS-Worterbucher, Klett Ernst Verlag, 2008. [35] P. Isola, J.-Y. Zhu, T. Zhou, A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1125–1134. [36] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 234–241. [37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, Imagenet large scale visual recognition challenge, Int. J. Comput. Vis. 115 (2015) 211–252. [38] T. Brox, A. Bruhn, N. Papenberg, J. Weickert, High accuracy optical flow estimation based on a theory for warping, in: European Conference on Computer Vision, Springer, 2004, pp. 25–36. [39] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. [40] M. Mathieu, C. Couprie, Y. Lecun, Deep Multi-Scale Video Prediction beyond Mean Square Error, 2015 (arXiv preprint arXiv:1511.05440). [41] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, T. Brox, Flownet: learning optical flow with convolutional networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2758–2766. [42] T. Kroeger, R. Timofte, D. Dai, L. Van Gool, Fast optical flow using dense inverse search, in: European Conference on Computer Vision, Springer, 2016, pp. 471–488. [43] A. Radford, L. Metz, S. Chintala, Unsupervised Representation Learning With Deep Convolutional Generative Adversarial Networks, 2015 (arXiv preprint arXiv:1511.06434). [44] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training gans, in: Advances in Neural Information Processing Systems, 2016, pp. 2234–2242. [45] S. Ioffe, C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015 (arXiv preprint arXiv:1502.03167). [46] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778. [47] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, 2014 (arXiv preprint arXiv:1409.1556). [48] D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, 2014 (arXiv preprint arXiv:1412.6980). [49] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (2014) 1929–1958. 419 420 Generative adversarial networks for image-to-Image translation [50] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. 8 (1986) 679–698. [51] F. Chollet, Keras document, Keras, GitHub, 2015. [52] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, Tensorflow: a system for large-scale machine learning, in: 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), 2016, pp. 265–283. [53] T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett. 27 (2006) 861–874. [54] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440. [55] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Trans. Image Process. 13 (2004) 600–612. Index Note: Page numbers followed by f indicate figures, t indicate tables, and b indicate boxes. A Ablation analysis, 366–368 Accuracy, 50, 88 of filters, 94f AC GAN, 30–31, 30f loss functions and distance metrics, 32–33t pros and cons of, 34–35t Adam algorithm, 355, 357 Adaptive moment optimization (ADAM), 322 Adversarial autoencoders (AAEs), 107–108, 117 Adversarial loss, 9, 213–214, 218 Adversarial network, 293, 293f Adversarial preparation, 61 Adversarial training, 140 Age-cGAN, 61t, 75 Aging of face, 119 AlexNet, 65, 178–181, 190 Alternative FCM algorithm, 84–85 Amazon, 43t Animation, 254–259 AOI, 49t Appearance and motion conditions GAN (AMC-GAN), 334t, 340 Area under curve (AUC), 405–406 Artificial intelligence-based methods, 142–146 Art2Real, 250–253, 253–254f Attentional generative adversarial networks (AttnGAN), 141 attRNN, 40 Autoencoder, 414–416 Automatic caricature generation, 135–136 Automatic nonrigid histological image registration (ANHIR) dataset, 265, 273–275 Auxiliary automatic driving, 76 Auxiliary object functions, 332–333 B Background removal, 391–395, 417 Backward forward GAN (BFGAN), 40, 40f BicycleGAN model, 132–134 Bidirectional GAN (BiGAN), 29–30, 29f loss functions and distance metrics, 32–33t pros and cons of, 34–35t Bidirectional LSTM (Bi-LSTM), 341, 341f Bilingual evaluation understudy (BELU) score, 52 Bool GANs, 117 Boundary equilibrium GAN (BEGAN), 23–24, 24f loss functions and distance metrics, 32–33t pros and cons of, 34–35t C Caltech 256, 49t Caption, 43t CariGANs, 136 Cartoon character generation, 76 Cascaded super-resolution GAN (CSRGAN), 6, 10 CD31 stain, 273 CelebA dataset, 49t, 247–248 CelebA-HQ, 49t Chinese poem dataset, 43t CHUK Avenue dataset, 404, 404f CIFAR 10/100, 49t Classification objective function, 332 Closed-circuit television (CCTV) cameras, 378 Cluster analysis, 81–82 CNN-based architectures, 185, 190–191, 196 CNN/Daily Mail dataset, 43t COCO dataset, 43t COIL-20, 49t Compactness, 84–85 Computer-aided diagnosis (CAD) systems, 162 Conditional adversarial networks, 134, 383–384 Conditional generative adversarial networks (CGANs), 28–29, 28f, 105–106, 126, 135, 139, 270 architecture, 166–167, 167f loss functions and distance metrics, 32–33t pros and cons of, 34–35t respiratory sound synthesis algorithm, 170–172 analysis, 179–181 data augmentation, 174–175 dataset, 174–181 discriminator network architecture, 169, 170f generator network architecture, 168–169, 169f inverse CWT, 176, 177f performance results, 177–179 421 422 Index Conditional generative adversarial networks (CGANs) (Continued) scalograms, 170b, 176, 176f steps, 173–174 system model, 167–168, 168f time-scale representation, 168 trained network model, 177, 177t Content-based image retrieval (CBIR), 185, 187, 200 Context Encoder, 61t Continuous wavelet transform (CWT), 168 Controllable GANs, 139–140 Conventional generative adversarial networks (cGANs), 289 Convolutional neural network (CNN), 18, 65, 269–270, 397, 407–408, 416–417 architecture, 66f Convolutional traces, 145 Convolution operation, 398, 398f Cooccurrence matrices, 143 Critic network discriminates, 337 Cross-channel adversarial discriminators, 387–389 Cross-channel generative adversarial networks, 383–385 Cross entropy loss, 332 Crossview Fork, 135 Cross-view image synthesis, 135 Crossview Sequential, 135 cSeq-GAN, 40–41, 41f CUB, 49t Cycle-consistency loss, 319–320 Cycle generative adversarial networks (CGAN), 25–26, 26f, 61t, 113–114, 115f, 132, 212–213, 322–325, 324f, 348, 350, 352–355 image-to-image translation, 245–247, 247–248f loss functions and distance metrics, 32–33t model, 223f normalized difference vegetation index (NDVI), 213–216 architecture, 217–218, 217f pros and cons of, 34–35t qualitative evaluation, 225f Cycle text-to-image GAN, 141 D Data augmentation, 220, 222f, 355, 357 Datasets image, 49, 49t for video generation techniques, 338t Decision-level fusion approach, 210 Deep belief network (DBN), 67 architecture, 67f Deep convolutional GAN (DCGAN), 64, 108–110, 127, 164, 266–267 DeepFake artificial intelligence-based methods, 142–146 challenges, 131 definition, 128 face swapping, 148–150 facial expressions, manipulation of, 152–153 facial features, manipulation of, 150–152 GAN-based techniques image-to-image translation, 132–136 text-to-image synthesis, 136–142 legal and ethical considerations, 153–154 new face construction, 146–147 sample source and generated fake images, 128–130, 129–130f Deep generative adversarial networks (GANs) model, 291–292 Deep learning (DL), 65–68, 209–212, 347, 349, 352, 377–378, 391–393, 397 end-to-end, 379–380 generative, 235–237, 238f overview, 235–239 unsupervised approach, 378 variational autoencoder (VAE), 237–239 Deep learning-based (DA-DCGAN), 117 Deep network architectures, 194–196 Deep neural networks (DNNs), 347–348, 352 Deep spatiotemporal translation network (DSTN), 390 dataset CHUK Avenue, 404, 404f UCSD, 403, 403f UMN, 403–404 feature collection, 391–396 implementation, 404 overview, 390–391, 392f performance, 407–413, 414f spatiotemporal translation model, 396–400 testing framework, 391, 392f training framework, 391, 392f Denoising-based generative adversarial networks (D-GAN), 291 Dense inverse search (DIS), 393 DenseNet, 22–23, 191 Digital elevation model (DEM), 251–252, 252f Index Digital imaging and communications in medicine (DICOM), 81 experimental analysis, 90–92 image segmentation, 90, 90f montage of, 90, 91f performance analysis, 92, 93f Dilated temporal relational GAN (DTRGAN), 340–341, 341f Discrete wavelet transformation, 318–319 Discriminator, 240–241 Discriminator model (DM), 63–64, 63f, 210–211, 211f Discriminator network, 293, 293f DNNs. See Deep neural networks (DNNs) DSTN. See Deep spatiotemporal translation network (DSTN) Dual attentional generative adversarial network (Dualattn-GAN), 141–142 Dual motion GAN (DMGAN), 334t, 340 Dynamic memory generative adversarial networks (DM-GAN), 140 Dynamic transfer GAN, 334–335, 334t, 335f E Earth mover distance, 21 e-commerce, 185–187, 196–197, 200 Edge-enhanced GAN (EE-GAN), 9 for remote sensing image, 10 Edge-enhanced super-resolution network (EESR), 5 Edge wrapping (EW), 401–402 ELBO loss, 320 Encoder-based GAN, 47, 47f Encoder-decoder network, 291–292 Enhanced super-resolution GAN (ESRGAN), 9–10 Ensemble learning GANs, 42, 44f Equal error rate (EER), 406 Errors lung lobe tissue, 280, 280f Estrogen receptor (ER) antibody stains, 273 Expectation-maximization (EM), 82–83 F Face aging, 75 Facebook AI Similarity Search (FAISS), 193 Face conditional GAN (FCGAN), 61t, 73 Face frontal view generation, 75 Face generation, 75, 247–250 Face swapping, 148–150 Facial expressions, manipulation of, 152–153 Facial features, manipulation of, 150–152 FakeSpotter, 144 False data detection rate. See Recurrent neural network (RNN), generative adversarial networks (GANs) False positive rate (FPR), 405 Fashion recommendation system, 191–196, 192f Fault diagnosis, 120 f-Divergence, 332 Feature collection, 391–396 FG-SRGAN, 4–5 Filters accuracy comparison, 94f classification outputs, 93, 94t FPR comparison, 94, 95f harmonic mean, 95, 96f PPV comparison, 95, 95f sensitivity comparison, 93–94, 94f specificity comparison, 94, 95f Fingerprints, 144 Flow and texture generative adversarial network (FTGAN), 334t, 335, 335f FlowGAN, 335 Fluorescein angiography, 348–349, 349f, 351, 351f, 371 Forum of International Respiratory Societies (FIRS), 161 Frame-level anomaly detection, 377–378, 406 Frechet inception distance (FID), 50–51 F1 score, 50 Fully connected convoultional GANs (FCC-GANs), 103–104, 117 Fully connected GANs, 103–104 Fusion, 391–395, 394f Fuzzy C-means, 82–83 Fuzzy C-means clustering (FCMC), 83–84, 87–88 G GANs. See Generative adversarial networks (GANs) Gaussian filter, 89 Generative adversarial networks (GANs), 1, 2f, 18, 59–64, 185–186, 379–380 advantages, 127–128, 329–330, 342, 416–417 applications, 73–76, 119–120, 127 architectures, 19f, 102–116, 125–126, 126f, 381f vs. autoencoder, 379–380 based on image-toimage translation, 389–390 423 424 Index Generative adversarial networks (GANs) (Continued) basic structure, 99, 100f building blocks of, 331–332 components, 125 cross-channel, 383–385 cross-channel adversarial discriminators, 387–389 cyclical, 348, 352–355 design of, 60f disadvantages, 343 fake images (see DeepFake) future frame prediction, 385–387 generic framework, 329, 330f image-to-image, 352–353 issues and challenges, 11–12 limitations, 128, 416–417 loss functions and distance metrics, 32–33t model, 125 need for, 99–102 objective functions, 332 parts, 99 pros and cons of, 34–35t research gaps, 117–119 structure of, 381–383 training process, 331–332 variants, 126 for video generation and prediction, 333–337 for video recognition, 337–340 for video summarization, 340–341 working principle, 125 Generative adversarial text-to-image synthesis, 137–138 Generator model (GM), 62, 62f, 210–211, 211f Generator network, 293 Geographically weighted regression (GWR), 209 Geometry-guided CGANs, 135 GoogLeNet, 65, 178–181, 190 Gradient penalty, 22 Grocott-Gomori methenamine silver (GMS) stain, 264 Guided image filtering, 89 H Harmonic mean, 88 of filters, 95, 96f Hematoxylin and eosin (H&E) stain, 264, 271–273, 273t, 275, 276f Hierarchical generative adversarial networks (HiGAN), 334t, 337–339 High-quality images, 101f, 120 High-resolution picture generation, 74 Histopathology staining, GANs for, 266 applications, 264 automatic nonrigid histological image registration (ANHIR) dataset, 273–275 conditional GANs (CGANs), 270 dataset, 272–275 deep convolutional GAN (DCGAN), 266–267 discriminator, 263, 283t errors lung lobe tissue, 280, 280f generator, 263, 282t histology, 264, 271–272 histopathological analysis, 271 histopathology, 264–265 image-quality metrics, 268–269 image-to-image translation, 265, 269–271 kidney tissue, 278–279, 278f lung lesion tissue, 275–276, 276–277f lung lobe tissue, 278–279, 279f machine learning, 265 medical imaging, 271–272 network architectures, 272–275, 281–282 optimization functions, 267–268 vanilla, 266 I Identity shortcut connection. See ResNet Image datasets, 49, 49t Image generation, 73, 119 applications, 244–259 face generation, 247–250 image animation, 254–259, 256f image-to-image translation, 245–247 photo-realistic images, 250–253 scene generation, 254–259, 258–259f generative adversarial network (GAN) architecture, 240f Art2Real, 250–253, 253–254f cycleGAN, 245–247, 247–248f dataset, 246t first-order motion, 254–259 implementation, 245t monkey net, 254–255, 255–256f Nash equilibrium, 239–243 stackGAN, 254–259, 257f starGAN, 247–250 superresolution (SR), 250–253 variational autoencoder (VAE), 243–244, 244f Index ImageNet, 191, 196 Image segmentation, 81–82 Image super-resolution, 6 Image synthesis, 119 Image-to-image translation, 73, 101f, 119, 132–136, 213–214, 389–390 histopathology staining, 265, 269–271 using cycle-GAN, 245–247, 247–248f Imfilter, 89 Imguided filter, 89 Imitation game, 1 Improved GAN (IGAN), 48, 48f Improved video generative adversarial network (iVGAN), 337, 338t, 338f Inception score (IS), 51 Incremental learning, 144 Info GAN, 22–23, 23f, 188–190, 188f loss functions and distance metrics, 32–33t pros and cons of, 34–35t Infrared image translation, 315 generative adversarial network in, 315–316 Integral probability metric, 332 Intersection over union (IoU), 51 Inverse CWT, 176, 177f IR-to-RGB translation, 313–314, 314f, 318–319, 319f, 323–325 J Jaccard index, 51 Jensen-Shannon (JS) divergence, 266 Jensen-Shanon divergence (JSD), 240, 242 Julia, 54 K Kernelized FCM, 84–85 Kernel maximum mean discrepancy (KMMD), 323–325 Kidney tissue, 278–279, 278f k-Lipchitz constant, 21 Kullback-Leibler divergence (KLD), 17–18, 238, 266 L LabelMe, 49t Laplacian pyramid GAN (LAPGAN), 69–70, 127 Least-square loss, 219 Legal case reports, 43t LeNet, 190 Long short-term memory (LSTM), 37, 67–68 architecture, 68f Loss function-based conditional progressive growing GAN (LC-PGGAN), 46, 46f Lung lesion tissue, 275–276, 276–277f Lung lobe tissue, 278–279, 279f M Machine learning, 381–382, 404 Magnetic resonance images (MRI), 45 Markov Chain Monte Carlo (MCMC) method, 416–417 Masson’s trichome (MAS) stain, 274 MatLab, 53–54 Mean absolute error (MAE), 275 Mean absolute percentage error (MAPE), 298 Mean average error (MAE), 364–366 Mean squared error (MSE), 221–222, 364–366 Medical imaging, 347–348, 350–351, 355–356, 364, 371–372 Microaneurysms, 361–362f, 363, 367 MinMax FCM, 86–87 MirrorGAN, 139 Missing part generation, 120 MNIST digit generation, 106, 106f MobileNet, 191 MoCoGAN, 333–334, 334t, 334f Mode collapse, 54 Modified generator GAN (MG-GAN), 164 Monkey net, 254–259 Motion energy image (MEI), 333 Motion history image (MHI), 333 MRI brain tumor, 49t Multichannel attention selection GAN, 134 Multichannel residual conditional GAN (MCRCGAN), 45, 45f Multiconditional generative adversarial network (MC-GAN), 138 Multidomain image-to-image translation, 132 Multimodal image-to-image translation, 132–134 Multimodal reconstruction, 348 of retinal image, 351–360 ablation analysis, 366–368 cyclical generative adversarial networks (GANs), 348–349, 352–355 datasets, 360 network architectures, 357–360 qualitative evaluation, 360–364 425 426 Index Multimodal reconstruction (Continued) quantitative evaluation, 364–366 SSIM methodology, 355–357 structural coherence, 369–370, 370f Multiscale dense block generative adversarial network (MSDB-GAN), 48–49, 48f Multistage dynamic generative adversarial network (MSDGAN), 336, 336f, 338t Multitask learning (MTL), 291 MUNIT, 315–316, 322–325, 324f Mutual information (MI), 22–23 N Nash equilibrium, 54, 239–243 Natural language processing datasets, 42, 43t GAN application in, 33–41 NDVI. See Normalized difference vegetation index (NDVI) Near-infrared (NIR) images, 207, 220 Near-infrared (NIR) spectrum, 216–217 New face construction, 146–147 News summarization dataset, 43t Noisy speech, 43t Normalized difference vegetation index (NDVI) applications, 208–209 cycle generative adversarial networks, 212–216 architecture, 217–218, 217f model, 223f qualitative evaluation, 225f data augmentation, 220, 222f datasets, 217f deep learning-based approaches, 209–212 estimation country category, 226f, 229f field category, 227f, 230f mountain category, 228f, 231f evaluation metrics, 221–222 formulations, 208–209 least-square loss, 219 loss functions, 218–219 overview, 205–207 residual learning model (ResNet), 214–215 O Object-driven generative adversarial networks (ObJGAN), 140 Octave GANs, 109–110, 117 o-Kernelized FCM, 84–85 Open Images, 49t OpenStreetMap, 49t Open subtitles dataset, 43t OpinRank, 43t Oxford 102, 49t P Pairwise learning, 145–146 Parallel GAN, 25–26, 25f loss functions and distance metrics, 32–33t pros and cons of, 34–35t Parameterized ReLu (PrelU), 7 Patch extraction, 391–395, 395f, 407 PatchGAN, 270–271, 322, 359–360, 399, 400f, 407–408 PCA-GAN, 5 Peak signal-to-noise ratio (PSNR), 6, 268 Perceptual loss, 7–9, 320 Periodic acid-Schiff (PAS) stains, 274 Person reidentification (REID), 10 PG-GAN, 46, 46f Photo inpainting, 74 Photo-realistic images, 250–253 Pixel accuracy, 406 Pixel convolution neural networks (PixelCNN), 235–237 Pixel-level anomaly localization, 377–378, 406 Pixel recurrent neural networks (PixelRNN), 235–236 Pix2pix, 61t PoseGAN, 342 Precision, 50, 88 Progestrone receptor (PR) antibody stains, 274 Python, 52–53 Q Quality-aware GAN, 47, 47f Quasi recurrent neural network (QRNN), 39–40 QuGAN, 39–40, 39f R Radon-Nikodym theorem, 241 RaFD dataset, 247–248 RankGAN, 37–38, 37f Realistic photograph generation, 75 Recall, 50 Index Recall-oriented understudy for gisting evaluation (ROUGE) score, 52 Receiver operating characteristic (ROC), 405 Reconstruction objective function, 332 Rectified linear unit (ReLU), 322 Recurrent neural network (RNN), 18, 66, 290 architecture, 66f generative adversarial networks (GANs) accuracy, 298–300, 299–300t adversarial/discriminator network, 293, 293f architecture, 294–296 deep-GAN model, 291–292 denoising method, 291 encoder-decoder network, 291–292 enhanced attention, 292 F-measure, 298, 300–305, 303–304t generator network, 293 geometric-mean (G-mean), 298, 305, 305–306t learning module, 295f mean absolute percentage error (MAPE), 298, 305–308, 306–307t multideep, 292 multitask learning (MTL), 291 optimization, 294–296 performance, 295f, 297–308 sensitivity, 298, 300, 301–302t specificity, 298, 300, 302–303t Wasserstein, 292 Reinforce GAN, 38, 38f Residual blocks, 7 ResNet, 6, 65, 190–191, 214–215, 216f, 217–218 Resnet 50 network model, 178–182, 179–180f, 180–181t ResNeXt, 5 Respiratory sound synthesis, conditional GAN algorithm, 170–172 analysis, 179–181 data augmentation, 174–175 dataset, 174–181 discriminator network architecture, 169, 170f generator network architecture, 168–169, 169f inverse CWT, 176, 177f performance, 177–179 scalograms, 170b, 176, 176f steps, 173–174 system model, 167–168, 168f time-scale representation, 168 trained network model, 177, 177t Restricted Boltzmann machines (RBM), 67 Res_WGAN, 5 Retinal image, multimodal reconstruction of, 348, 351–360 ablation analysis, 366–368 cyclical generative adversarial networks (GANs), 348–349, 352–355 datasets, 360 network architectures, 357–360 qualitative evaluation, 360–364 quantitative evaluation, 364–366 SSIM methodology, 355–357 structural coherence, 369–370 RNN. See Recurrent neural network (RNN) Root mean squared error (RMSE), 224, 232t, 275 R programming, 53 S Scale-adaptive low-resolution person reidentification (SALR-REID), 10 Scalograms, 170b, 176, 176f Scene generation, 254–259, 258–259f Seismic images, SRGAN-based model, 7, 8f Semantic similarity discriminator, 36, 37f Semi GAN, 27–28, 27f loss functions and distance metrics, 32–33t pros and cons of, 34–35t Semisupervised learning, 19f, 27 Semi GAN, 27–28, 27f Sensitivity, 51 SeqAttnGAN, 118 SeqGAN, 36, 36f Sequential GAN supervised, 31, 31f unsupervised, 24–25, 25f SGAN, 42–44, 44f Shadow maps, 120 Silhouette method, 84–85 Simple generative adversarial networks, 166 Smooth muscle actin (SMA) stain, 274 Spatial FCM, 84–85 Spatiotemporal translation model, 396–400 Specificity, 51 Speech enhancement, 120 sp-Kernelized FCM, 84–85 SRResNet, 5 427 428 Index Stacked generative adversarial networks (StackGAN), 61t, 110–113, 138, 254–259 Stacked generative adversarial networks (StackGAN++), 139 StarGAN, 73, 132, 247–250, 322–323, 324f Stochastic Adam optimazer, 224–225 Structural similarity index, 221–222, 224, 232t, 268 video anomaly detection (VAD), 406–407 Superresolution (SR), 74, 250–253 Super-resolution GAN (SR-GAN), 5, 61t, 71–72, 127 architecture of, 6–7, 7–8f image quality and, 11 network architecture, 7 perceptual loss, 7–9 adversarial loss, 9 content loss, 8–9 video surveillance and forensic application, 10 Supervised learning, 18–19, 28 ACGAN, 30–31, 30f bidirectional GAN (BiGAN), 29–30, 29f conditional GAN (CGAN), 28–29, 28f Supervised sequential GAN, 31, 31f loss functions and distance metrics, 32–33t pros and cons of, 34–35t T Temporal GAN (TGAN), 61t, 69, 334t, 337, 338t Text-to-image GAN, 45–46, 46f Text-to-image synthesis, 76, 136–142 TextureGAN, 335 Thermal image translation. See Wavelet-guided generative adversarial network (WGGAN) TH-GAN, 41, 41f 2D median filter, 89 3D object generation, 75 3D presentation states (3DPR), 92 Threshold, 402 Tithonium Chasma, 252f Trained model of discriminator, 3, 3f of generator, 3, 4f True positive rate (TPR), 405 Turing Test, 1 Two-stage general adversarial network (TsGAN), 44–45, 45f U UCSD dataset, 403, 403f, 407 UGATIT, 322–323, 324f UMN dataset, 403–404 U-Net, 269–271, 385–387, 414–415 U-Net generator, 217f, 220 Unified GAN (UGAN), 39, 39f, 132 Unpaired photo-to-caricature translation, 136 Unsupervised generative attentional networks (U-GAT-IT), 134 Unsupervised learning, 18–19, 19f, 210, 377–379, 389 BEGAN, 23–24, 24f cycle GAN, 25–26, 26f of generative adversarial network, 390–400 InfoGAN, 22–23, 23f parallel GAN, 25–26, 25f sequential GAN, 24–25, 25f vanilla GAN, 19–20, 20f Wasserstein GAN, 20f, 21–22 WGAN-GP, 20f, 22 UT-Zap50K benchmark dataset, 186–187, 196–197 V VAD. See Video anomaly detection (VAD) VAE. See Variational autoencoder (VAE) Vanilla GAN, 19–20, 19–20f, 126, 188, 266 loss functions and distance metrics, 32–33t pros and cons of, 34–35t Vanishing gradients, 54 Variational autoencoder (VAE), 17–18, 237–239, 243–244, 244f, 314–315 wavelet-guided, 317–319 Vari GAN, 61t, 68–69 Vegetation indexes (VIs), 205 normalized difference vegetation index (NDVI) applications, 208–209 formulations, 208–209 VGGNet, 65, 186 Video analytics, 329 generative adversarial network (GAN), 334t, 338t for video generation and prediction, 333–337 for video recognition, 337–340 for video summarization, 340–341 Video anomaly detection (VAD), 377–378 deep spatiotemporal translation network (see Deep spatiotemporal translation network (DSTN)) Index evaluation area under curve (AUC), 405–406 equal error rate (EER), 406 frame-level evaluations, 406 pixel accuracy, 406 pixel-level evaluations, 406 receiver operating characteristic (ROC), 405 structural similarity index (SSIM), 406–407 generative adversarial network (GAN), 379–380 advantages, 416–417 based on image-toimage translation, 389–390 cross-channel, 383–385 cross-channel adversarial discriminators, 387–389 limitations, 416–417 prediction based on, 385–387 structure of, 381–383 for surveillance videos, 378–379 Video frame prediction, 76 Video GAN (VGAN), 61t, 71 Video retargeting, 339 Video surveillance, 378–379 Video synthesis, 76, 120 Video understanding, 333 Visual Genome, 49t Visual similarity search systems fashion recommendation system (see Fashion recommendation system) test results, 197, 198–200f web interface, 197–200, 201f W WarpGAN, 135–136 Wasserstein distance, 21 Wasserstein GAN (WGAN), 20f, 21–22, 114–116, 292 loss functions and distance metrics, 32–33t pros and cons of, 34–35t Wavelet-guided generative adversarial network (WGGAN) adaptive moment optimization (ADAM), 322 architecture, 316–317 cycleGAN, 322–325, 324f FLIR ADAS dataset, 321, 321t, 323 MUNIT, 315–316, 322–325, 324f qualitative analysis, 321 translation results, 323, 324f quantitative analysis, 322 translation results, 323–325, 325t StarGAN., 322–323, 324f UGATIT, 322–323, 324f wavelet-guided variational autoencoder (WGVA), 317–319 cycle-consistency loss, 319–320 discrete wavelet transformation, 318–319 ELBO loss, 320 full loss, 321 GAN loss, 320–321 perceptual loss, 320 reparameterization, 318 Wavelet-guided variational autoencoder (WGVA), 317–319, 325 cycle-consistency loss, 319–320 discrete wavelet transformation, 318–319 ELBO loss, 320 full loss, 321 GAN loss, 320–321 perceptual loss, 320 reparameterization, 318 Weak supervision, 333 WGAN-GP, 20f, 22 loss functions and distance metrics, 32–33t pros and cons of, 34–35t WGGAN. See Wavelet-guided generative adversarial network (WGGAN) WGVA. See Wavelet-guided variational autoencoder (WGVA) Wiener 2 filtering, 89 Y YELP dataset, 43t Z ZFNet, 65 429