Uploaded by darshan2462

Generative Adversarial Networks for Image-to-Image Translation

advertisement
GENERATIVE
ADVERSARIAL
NETWORKS FOR
IMAGE-TO-IMAGE
TRANSLATION
GENERATIVE
ADVERSARIAL
NETWORKS FOR
IMAGE-TO-IMAGE
TRANSLATION
Edited by
ARUN SOLANKI
Assistant Professor, Department of Computer Science and
Engineering, Gautam Buddha University, Greater
Noida, India
ANAND NAYYAR
Lecturer, Researcher and Scientist, Duy Tan University,
Da Nang, Viet Nam
MOHD NAVED
Assistant Professor, Analytics Department, Jagannath
University, Delhi NCR, India
Academic Press is an imprint of Elsevier
125 London Wall, London EC2Y 5AS, United Kingdom
525 B Street, Suite 1650, San Diego, CA 92101, United States
50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States
The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom
Copyright © 2021 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical,
including photocopying, recording, or any information storage and retrieval system, without permission in writing
from the publisher. Details on how to seek permission, further information about the Publisher’s permissions policies
and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency,
can be found at our website: www.elsevier.com/permissions.
This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as
may be noted herein).
Notices
Knowledge and best practice in this field are constantly changing. As new research and experience broaden our
understanding, changes in research methods, professional practices, or medical treatment may become necessary.
Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any
information, methods, compounds, or experiments described herein. In using such information or methods they
should be mindful of their own safety and the safety of others, including parties for whom they have a professional
responsibility.
To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for
any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any
use or operation of any methods, products, instructions, or ideas contained in the material herein.
Library of Congress Cataloging-in-Publication Data
A catalog record for this book is available from the Library of Congress
British Library Cataloguing-in-Publication Data
A catalogue record for this book is available from the British Library
ISBN: 978-0-12-823519-5
For information on all Academic Press publications
visit our website at https://www.elsevier.com/books-and-journals
Publisher: Mara Conner
Acquisitions Editor: Chris Katsaropoulos
Editorial Project Manager: Gabriela D. Capille
Production Project Manager: Niranjan Bhaskaran
Cover Designer: Christian J. Bilbow
Typeset by SPi Global, India
Contributors
Er. Aarti
Department of Computer Science & Engineering, Lovely Professional University, Phagwara,
Punjab, India
Supavadee Aramvith
Multimedia Data Analytics and Processing Research Unit, Department of Electrical Engineering,
Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand
Tanvi Arora
Department of CSE, CGC College of Engineering, Landran, Mohali, Punjab, India
Betul Ay
Firat University Computer Engineering Department, Elazig, Turkey
Galip Aydin
Firat University Computer Engineering Department, Elazig, Turkey
Junchi Bin
University of British Columbia, Kelowna, BC, Canada
Erik Blasch
MOVEJ Analytics, Dayton, OH, United States
Udaya Mouni Boppana
Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn
Malaysia, Parit Raja, Johor, Malaysia
Najihah Chaini
Faculty of Applied Sciences and Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja,
Johor, Malaysia
Amir H. Gandomi
University of Technology Sydney, Ultimo, NSW, Australia
Aashutosh Ganesh
Radboud University, Nijmegen, The Netherlands
Thittaporn Ganokratanaa
Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University,
Bangkok, Thailand
xi
xii
Contributors
Koshy George
SRM University—AP, Guntur District, Andhra Pradesh, India
Meenu Gupta
Department of Computer Science & Engineering, Chandigarh University, Ajitgarh, Punjab,
India
Álvaro S. Hervella
CITIC Research Center; VARPA Research Group, Biomedical Research Institute of A Coruña
(INIBIC), University of A Coruña, A Coruña, Spain
Kavikumar Jacob
Faculty of Applied Sciences and Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja,
Johor, Malaysia
Rachna Jain
Department of Computer Science & Engineering, Bharati Vidyapeeth’s College of Engineering,
Delhi, India
S. Jayalakshmy
IFET College of Engineering, Villupuram, India
Leta Tesfaye Jule
Department of Physics, College of Natural and Computational Science; Centre for Excellence in
Indigenous Knowledge, Innovative Technology Transfer and Entrepreneurship, Dambi Dollo
University, Dambi Dollo, Ethiopia
A. Sampath Kumar
Department of Computer Science and Engineering, Dambi Dollo University, Dambi Dollo,
Ethiopia
Meet Kumari
Department of Electronics & Communication Engineering, Chandigarh University, Ajitgarh,
Punjab, India
Lakshay
Department of Computer Science & Engineering, Bharati Vidyapeeth’s College of Engineering,
Delhi, India
Zheng Liu
University of British Columbia, Kelowna, BC, Canada
H.R Mamatha
Department of CSE, PES University, Bengaluru, India
Omkar Metri
Department of CSE, PES University, Bengaluru, India
Contributors
Aida Mustapha
Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn
Malaysia, Parit Raja, Johor, Malaysia
D. Nagarajan
Department of Mathematics, Hindustan Institute of Technology and Science, Chennai, India
Jorge Novo
CITIC Research Center; VARPA Research Group, Biomedical Research Institute of A Coruña
(INIBIC), University of A Coruña, A Coruña, Spain
Marcos Ortega
CITIC Research Center; VARPA Research Group, Biomedical Research Institute of A Coruña
(INIBIC), University of A Coruña, A Coruña, Spain
Lakshmi Priya
Manakula Vinayaga Institute of Technology, Pondicherry, India
Krishnaraj Ramaswamy
Centre for Excellence in Indigenous Knowledge, Innovative Technology Transfer and
Entrepreneurship; Department of Mechanical Engineering, Dambi Dollo University, Dambi
Dollo, Ethiopia
S. Mohamed Mansoor Roomi
Department of Electronics and Communication Engineering, Thiagarajar College of
Engineering, Madurai, India
Jose Rouco
CITIC Research Center; VARPA Research Group, Biomedical Research Institute of A Coruña
(INIBIC), University of A Coruña, A Coruña, Spain
Angel D. Sappa
ESPOL Polytechnic University, CIDIS-FIEC, Guayaquil, Ecuador; Computer Vision Center,
Edifici O, Campus UAB, Bellaterra, Barcelona, Spain
K. Saruladha
Pondicherry Engineering College, Department of Computer Science and Engineering,
Puducherry, India
A. Sasithradevi
School of Electronics Engineering, VIT University, Chennai, India
R. Sivaranjani
Department of Electronics and Communication Engineering, Sethu Institute of Technology,
Madurai, India
Rituraj Soni
Department of CSE, Engineering College Bikaner, Bikaner, Rajasthan, India
xiii
xiv
Contributors
S. Sountharrajan
School of Computing Science and Engineering, VIT Bhopal University, Bhopal, India
Patricia L. Suárez
ESPOL Polytechnic University, CIDIS-FIEC, Guayaquil, Ecuador
Gnanou Florence Sudha
Pondicherry Engineering College, Pondicherry, India
E. Thirumagal
Pondicherry Engineering College, Department of Computer Science and Engineering,
Puducherry; REVA University, Bengaluru, India
Boris X. Vintimilla
ESPOL Polytechnic University, CIDIS-FIEC, Guayaquil, Ecuador
N. Yuuvaraj
Research and Development, ICT Academy, Chennai, India
Ran Zhang
University of British Columbia, Kelowna, BC, Canada
CHAPTER 1
Super-resolution-based GAN for image
processing: Recent advances and future
trends
Meenu
Guptaa, Meet Kumarib, Rachna Jainc, and Lakshayc
a
Department of Computer Science & Engineering, Chandigarh University, Ajitgarh, Punjab, India
Department of Electronics & Communication Engineering, Chandigarh University, Ajitgarh, Punjab, India
c
Department of Computer Science & Engineering, Bharati Vidyapeeth’s College of Engineering, Delhi, India
b
1.1 Introduction
One can think of the term machine as older than the computer itself. In 1950, the computer scientist, logician, and mathematician Alan Turing penned a paper for the generations to come, “Computing Machinery and Intelligence” [1]. Today, computers can not
only match humans but have outperformed them completely. Sometimes people think
about not achieving the superhumanity face recognition or cleaning the medical image of
the patient accurately, even for the small algorithm, as a machine learning algorithm is the
best at pattern reorganization in existing image data using features for tasks such as classification and regression. When we try to generate new data, however, the computer has
struggled [2]. An algorithm can easily defeat a chess grandmaster, classify whether a transaction is fraudulent or not, and classify in a medical report whether the given medical
report has any disease or not, but fail on humanity’s most basic and essential
capacities—including crafting an original creation or a pleasant conversation. Mahdizadehaghdam et al. [3] proposed some tests named the Imitation game, also known as the
Turing Test. Behind a closed door, an unknown observer talks with two counterparts
means a computer and a human.
In 2014, all of the above problems were solved when Ian Goodfellow invented generative adversarial networks (GANs). This technique has enabled computers to generate
realistic data by using two separate neural networks. Before GANs, different ways have
been proposed by the programmer to analyze the generated data. But the result received
from the generated data was not up to the mark. When GANs were introduced the first
time, it showed a remarkable result as there was no difference between the generated fake
images or photograph-image and gave the same result as the real-world-like quality.
GANs turn scribbled images to a photograph-like image [4].
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00030-0
Copyright © 2021 Elsevier Inc.
All rights reserved.
1
2
Generative adversarial networks for image-to-Image translation
Fig. 1.1 Improves realism of the image as general adversarial networks varies [4].
In recent years, how far GANs have changed the meaning of generating or improving
the real image is shown in Fig. 1.1. Fig. 1.1 was first produced by GAN in 2014 and shows
how human faces continuously improve in generating fake images. The machine could
produce as a blurred image, and even that achievement is celebrated as a success. In just
the next 3 years, we could not classify which is fake or which qualifies as high-resolution
portrait photographs [5].
GANs are a category of machine learning techniques that uses two simultaneously
trained models: the first is the generator to generate fake data and the discriminator is
used to discrete the raw data from the real dataset images. The word generative indicates
creating new data from the given data. GAN generates the data which learn from the
choice of the given training set. The term adversarial points to maintaining the dynamic
between the two models that are the generator and the discriminator. Here two networks
are continually trying to trick the other as the generator generates better fake images to
get convincing data. The better discriminator is trying to distinguish the real data examples from the fake generated ones. The word networks indicates the class of machine
models. The generator and the discriminator commonly use the neural network. As a
complex, the neural network is more complex than the implementation of GAN [6].
GAN has two models. First, it works where we put the input and then we get the
output. The goal is to form two models that combine and run simultaneously so that
the first discriminator receives input from the real data that come from the training dataset, and the second time onward there are two input sources that are the actual data and
the fake examples coming from the given generator. A random number vectors is passed
through the generator. The output acquired from the generator is Fake examples that try
to convince as far as possible the real data. The discriminator predicted the probability of
the input real. The main purpose of creating two models separately is to overcome the
problem of fake data that is generated from the training dataset. The discriminator’s goal is
to differentiate between the fake data generated from the generator and the real input
example from the dataset. This section further discusses the training parts of the discriminator and the generator in Sections 1.1.1 and 1.1.2 [7].
Super-resolution-based GAN for image processing
(a)
(c) and (d)
x
x*
(b)
Discriminator
Classification
error
Generator
z
Fig. 1.2 Train the discriminator.
1.1.1 Train the discriminator
Fig. 1.2 discusses the trained model of the discriminator and the steps are as follows [8]:
(a) First, get a random real example x from the given training dataset.
(b) Now get a new random vector z and, utilizing the generator network, synthesize a
fake example as x*.
(c) Utilize the discriminator network to distinguish between x* and x.
(d) Find the classification error and back-propagate. Then try to minimize the classification error to update the discriminator biases and weight.
1.1.2 Train the generator
Fig. 1.3 shows the trained model of the generator as you can see the labeling of these steps
as follows:
(a) First, choose a random new image from the dataset as vector z, using a generator to
create an x*, i.e., a fake example.
(b) It used a discriminator to categorize real and fake examples.
(c) Find the classification error and back-propagate. Then try to minimize the classification error due to which the total error to renovate the discriminator biases and
weight [9].
1.1.3 Organization of the chapter
This chapter is further classified into different sections. Section 1.2 discusses the background study of this work and also different research views. Section 1.3 discusses the
SR GAN model for image processing. Section 1.4 discusses the application-based
GAN case studies to enhance object detection. Section 1.5 discusses the open issues
and challenges faced in the working of GAN. Section 1.6 concludes the chapter with
its future scope.
3
4
Generative adversarial networks for image-to-Image translation
(b) and (c)
r
x
x`
(a)
Discriminator
Classification
error
Generator
z
Fig. 1.3 Train the generator.
1.2 Background study
The goal of Perera et al. [10] is to determine whether the given query is from the same
class or different class. Their solution is based on learning the latent representation of an
in-class example using a de-noising and auto-encoder network. This method gives good
results for COIL, MINST f-MINST dataset.
They give a new thinking to GANs as in face recognition we create fake images,
which help to identity theft and privacy breaches. In this chapter, they proposed a technique to recognize the forensic face. They use the deep face recognition system as a core
for their model and create fake images repeatedly to help data augmentation [11].
Tripathy et al. [12] present a generic face animator that controls the pose and expression using a given face image. They implemented a two-stage neural network model,
which is learned in a self-supervised manner.
Rathgeb et al. [13] proposed a supervised deep learning algorithm using CNNs to
detect synthetic images. The proposed algorithm gives an accuracy of 99.83 for distinguishing real images from dataset and fake images generated using GANs.
Yu et al. [14] have shown advances to complete a new level as they proposed a
method to visualize forensic and model attribution. The model supports image attribution, enables fine-grained model authentication, persists across different image frequencies, fingerprint frequencies and paths, and is not biased.
Lian et al. [15] provide a guidance module that is introduced in FG-SRGAN, which
is utilized to reduce the space of possible mapping functions and helps to learn the correct
Super-resolution-based GAN for image processing
mapping function from a low-resolution domain to a high-resolution domain. The guidance module is used to greatly reduce adversarial loss.
Takano and Alaghband [16] proposed the SRGAN model. In this chapter, they
solved the problem of sharpening the images. It can give a slight hint on how the real
image looks like from the blurry image as they convert a low-resolution image to a
high-resolution image.
Dou et al. [17] proposed PCA-GAN, which greatly improves the performance of
GAN-based models on super-resolving face images. The model focuses on cumulative
discrimination in the orthogonal projection space spanned by PCA projection to details
into the discriminator.
Jiang et al. [18] proposed to improve the perception of the CT image using SRGAN,
which leads to greatly enhance the spatial resolution of the image, as perception increases
the disease analysis on a tiny portion of areas and pathological features. They introduced a
diluted convolution module. The mean structural similarity (MSSIM) loss is also introduced to improve the perceptual loss function.
Li et al. [19] provide an improvement method of SRGAN and a solution for the problem of image distortion in textile flow detection, a super-resolution image reconstruction. Here the result of an experiment shows that the PNSR of SRGAN is 0.83
higher than that of the Bilinear, and the SSIM is higher than 0.0819. SRGAN can
get a clearer image and reconstruct a richer texture, with more high-frequency details,
and that is easier to identify defects, which is important in the flaw detection of fabrics.
Wang et al. [20] decided to use dense convolutional network blocks (dense blocks),
which connect each layer to every other layer in a feed-forward manner as our very deep
generator networks. GAN solves the problem of spectral normalization as the method
offers better training stability and visual improvements.
Nan et al. [21] solved the complex computation, unstable network, and slow learning
speed problems of a generative adversarial network for image super-resolution
(SRGAN). We proposed a single image super-resolution reconstruction model called
Res_WGAN based on ResNeXt.
Li et al. [22] discussed an edge-enhanced super-resolution network (EESR), which
proposed better generation of high-frequency structures in blind super-resolution. EESR
is able to recover textures with 4 times upsampling and gained a PTS of 0.6210 on the
DIV2K test set, which is much better than the state-of-the-art methods.
Sood et al. [23] worked on magnetic resonance (MR) images to obtain highresolution images for which the patients have to wait for a long time in a still state.
Obtaining low-resolution images and then converting them to high-resolution images
uses four models: SRGAN, SRCNN, SRResNet, and sparse representation; among
them, SRGAN gives the best result.
5
6
Generative adversarial networks for image-to-Image translation
Lee et al. [24] present a super-resolution model specialized for license plate images,
CSRGAN, trained with a novel character-based perceptual loss. Specifically, they focus
on character-level recognizability of super-resolved images rather than pixel-level
reconstruction.
Chen et al. [25] divided the technique into two different parts: the first one is to
improve PSNR and the second one is to improve visual quality. They propose a new
dense block, which uses complex connections between each layer to build a more powerful generator. Next, to improve perceptual quality, they found a new set of feature
maps to compute the perceptual loss, which would make the output image look more
real and natural.
Jeon et al. [26] proposed a method to increase the similarity between pixels by performing the operation of the ResNet module, which has an effect similar to that of the
ensemble operation. That gives a better high-resolution image.
As the resolutions of remote sensing images are low, to improve the performance we
required high-level resolution. In this chapter, they first optimize the generator and
residual-in-residual dense without BN (batch normalization) is used. Firtstly GAN (relativistic generative adversarial network) is introduced and then the sensation loss is
improved [27].
1.3 SR-GAN model for image processing
Image super-resolution is defined as an increase in the size image, but trying to not
decrease the quality of the image keeps the reduction in quality to a minimum or creates
a high-resolution image from a low-resolution image by using the details from the original image. This problem has some difficulties as for an input low-resolution image, and
there are some multiple solutions available. SR-GAN has numerous applications like
medical image processing, satellite image, aerial image analysis, etc. [28].
1.3.1 Architecture of SR-GAN
Many programs that are good, fast, and accurate get a single image super-resolution. But
still, something that is missing is the texture of the original features of the image. That is
the way where we recover the low-resolution image so that the image produce is not
distorted. Later we recover these errors, but it is not complete all errors that are produced.
The main error shows that result has a peak signal-to-noise ratio (PSNR) high thus provides good image quality results, but lacking high-frequency details. The previous result
also sees the similarity in pixel space, which leads to a blurry or unsatisfying image. Due to
this, we introduce SR-GAN, a model that can capture the perceptual difference in the
ground truth image and the model output. Fig. 1.4 discusses the architecture of
SRGAN [29].
Super-resolution-based GAN for image processing
HR Images
Discriminator
LR
Image
Generator
Content Loss
GAN
Loss
SR
Image
Fig. 1.4 Architecture of SRGAN [29].
The training algorithm of SRGAN is shown in the following steps:
(a) We run the HR (high-resolution) images to get sample LR (low-resolution) images.
To train our dataset, we required both LR and HR images.
(b) Then allow LR images to pass through the generator, which increases the samples
and provides SR (super-resolution) images.
(c) LR and HR images are classified by passing through the discriminator and backpropagated [30].
Fig. 1.5 presents the network between the generator and the discriminator. It contains
convolution layers, parameterized ReLu(PrelU), and batch normalization. The generator also implements skip connections similar to ResNet [31].
1.3.2 Network architecture
Residual blocks are defined as seep learning networks that are difficult to train. The residual learning framework makes the training easier for the networks and enables them to go
deep substantially, to improve the performance. In the generator, there are a total of
16 residual blocks used [32].
As in Generator 2, a subpixel is used for getting the feature map up-sampling. Every
time pixel shuffle is applied it rearranges the elements of the L*B*H*r*r tensor and transforms into the rL*rB*H tensor. With increase in computation, the bicubic filter from the
pipeline has been removed. We use parameterized rely on instead of Relu pr LeakyRelu.
Prelude adds a learnable parameter, which leads to learning the negative part coefficient
adaptively. The convolution layer “k3n64s1” represents kernel filters of 3*3 outputting
channels 64 along with stride 1. Similarly, “k3n256s1” and “K9n3s1” are other convolution layers added [33].
1.3.3 Perceptual loss
As in the below equation, LSR shows the perceptual loss, and it is a commonly used
model based on the mean square error. As the equation is a loss function, it calculates
the loss function and gives a solution concerning characteristics. Here LSRx is notated
7
Generative adversarial networks for image-to-Image translation
g residual blocks
Generator
Generated Image
k9n1s1
Conv
Conv
k3n128s1
ReLU
Batch Norm
Elementwise Sum
k3n128s1
Conv
k3n128s1
Elementwise Sum
ReLU
Conv
Batch Norm
Conv
ReLU
Noisy Image
k9n128s1 k3n128s1
Early fusion
Conditional
masks
Middle fusion
Late fusion
Notations
k- kernel size
n- number of channels
Middle/Late
fusion
s- stride
Early fusion
Original or
Generated
Image?
Dense
Leaky ReLU
Dense (1)
Leaky ReLU
Conv
Batch Norm
Conv
Leaky ReLU
k3n128s1
Original and
Generated
Images
8
Discriminator
d residual blocks
Fig. 1.5 SRGAN-based model architecture for seismic images [30].
as a content loss, and it is used as the last term is an adversarial loss as shown in Eq. (1.1).
The weighted sum of both gives the perceptual loss (VGG-based content loss):
LSR ¼ LSRx + 103 LSRAdv
(1.1)
1.3.3.1 Content loss
The pixel-wise mean square error loss is evaluated as
rl X
rb 2
1 X
IHRx, y GθG ILRx, y
LSRMSE ¼ 2
r LB x¼1 y¼1
(1.2)
Eq. (1.2) is the most generally utilized advancement focus for image SR on which
many best-in-class approaches depend [34, 35]. However, while accomplishing
Super-resolution-based GAN for image processing
significantly high PSNR, arrangements of MSE advancement problems frequently
require high frequency, which brings about perceptually unsuitable management with
highly smooth surfaces [36].
Rather than depending on losses (pixel-wise), we expand on the thoughts of Shi et al.
[35], Denton et al. [37], and Ledig [4] and use a loss function work that is closer to perceptual likeness. We characterize the VGG loss dependent on the ReLU enhancement
layers of the preprepared 19-layer VGG model portrayed in Simonyan and Zisserman
[38]. We demonstrate the element map obtained by the nth convolution before the
Ith max-pooling layer inside the VGG19 loss as the Euclidean separation between the
component portrayal of reproduced picture GθG(ILR) as shown in the Eq. (1.3) [39]
Li, j X
Bi, j
2
1 X
LSRVGG=ij: ¼
∅i, j IHRx, y ∅i, j GθG ILRx, y
Li, j Bi, j x¼1 y¼1
(1.3)
Here Li, j and Bi, j describe the length and the width of the given feature map used in the
VGG system.
1.3.3.2 Adversarial loss
The last section of this chapter discusses the content loss and also included the generative
part of GAN pertaining particularly to perceptual loss. It urges our system to support
arrangements that dwell on the complex of regular pictures, by attempting to fool the
discriminator. The generative loss LSRadv is defined on the basis of the probabilities of
the discriminator GθG(ILR) overall training as shown in Eq. (1.4) [40]:
LSRadv ¼
N
X
log DθD ðGθG ðILRÞÞ
(1.4)
n¼1
where DθD(GθG(ILR)) is the probability that creates fake images GθD(ILR) as a real highresolution image. For a good gradient, we limit the value logDθD from log1 x where
x is the probability of creating fake images [41].
1.4 Case study
This includes the different case studies as applications of EE-GAN to enhance object
detection, edge-enhanced GAN for remote sensing image, application of SRGAN on
video surveillance, and forensic application and super-resolution of video using SRGAN.
1.4.1 Case study 1: Application of EE-GAN to enhance object detection
Detection performance of small objects in remote sensing images has not been more
desirable than in huge size objects, especially in noisy and low-resolution images. Thus,
enhanced super-resolution GAN (ESRGAN) provides significant image enhancement
9
10
Generative adversarial networks for image-to-Image translation
output. However, reconstructed images generally lose high-frequency edge data. Thus,
object detection performance gives small objects decrement on low-resolution and noisy
remote sensing images. Thus, residual-in-residual dense blocks (RRDB) for both the EEN
and ESRGAN and EEN, for the detector system used a high-speed region-based convolutional network (FRCNN) as well as a single-shot detector (SSD) [42].
1.4.2 Case study 2: Edge-enhanced GAN for remote sensing image
The recent super-resolution (SR) techniques that are dependent on deep learning have
provided significant comparative merits. Still, they remain not desirable in highfrequency edge details for the recovery of pictures in noise-contaminated image conditions, such as remote sensing satellite images. Thus, a GAN-based edge-enhancement
network (EEGAN) is used for reliable satellite image SR reconstruction with the adversarial learning method, which is noise insensitive. Especially EEGAN comprises two
primary subnetworks: an edge-enhancement subnetwork (EESN) and an ultra-dense
subnetwork (UDSN). First, in UDSN, 2-D dense blocks are collected for feature extraction to gain an intermediate image in high-resolution result, which looks sharp but offers
artifacts and noise. After that, EESN is generated to enhance and extract the image contours by purifying the noise-contaminated components with mask processing. The
recovered enhanced edges and intermediate image can be joined to produce high credibility and clear content results. Extensive experiments on Jilin-1 video satellite images,
Kaggle Open Source Data set as well as Digital globe provide a more optimum reconstruction
performance than previous SR methods [43].
1.4.3 Case study 3: Application of SRGAN on video surveillance and
forensic application
Person reidentification (REID) is a significant work in forensics and video applications.
Several past methods are based on a primary assumption that several person images have
sufficiently high and uniform resolutions. Several scale mismatching and low resolution
always present in the open-world REID. This is known as scale-adaptive low-resolution
person re-identification (SALR-REID). The intuitive method to address this issue is to
improve several low resolutions to a high resolution uniformly. Thus, SRGAN is one of
the popular image super-resolution deep networks constructed with a fixed upscaling
parameter. But it is yet not suitable for SALR-REID work that requires a network
not only to synthesize image features for judging a person’s identity but also to enhance
the capability of scale-adaptive upscaling. We group multiple SRGANs in series to supplement the ability of image feature representation as well by plugging in an identification
network. Thus, a cascaded super-resolution GAN (CSRGAN) framework with a unified
formulation can be used [44].
Super-resolution-based GAN for image processing
1.4.4 Case study 4: Super-resolution of video using SRGAN
SRGAN techniques are used to improve the image quality. There are several methods of
image transformation where the computing system gets input and sends it in the output
image. GAN is the deep neural network that consists of two networks, discriminator and
generator. GANs are about designing, such as portrait drawing or symphony composition. SRGAN gives various merits over methods. It proposes a perceptual loss factor that
comprises the merits of content and adversarial losses. Here the discriminator block discriminates between real HR images from produced super-resolved images [45], while the
generator function is used for propagating model training. Adversarial loss function utilizes a discriminator network that is trained to discriminate already between the two pictures. However, content loss function utilizes perceptual similarity despite the pixel space
similarity. The superior thing about SRGAN is that it produces the same data as real data.
SRGANs learn the representations that are internal to produce upscale images [46]. The
neural network is faithful in photo-realistic textures that are recovered from downgraded
images. The SRGAN methods demit with a high peak to signal noise ratio but also give
high visual perception and efficiency. Joining the adversarial and perceptual loss will produce a high-quality, super-resolution image. Moreover, the training phase perceptual
losses evaluate image similarities robustly compared to per-pixel losses. Further, perceptual loss functions identify the high-level semantic and perceptual differences between
the generated images [42].
1.5 Open issues and challenges
When we train our GAN models, we suffer many major problems. Some problems are
nonconvergence, model collapse, and diminished gradient unbalanced between the two
models. GAN is sensitive toward the hyper-parameter factors. In GAN, sometimes the
partial model is collapsed [45]. The gradient corresponds to ILR approaches zero, and
then our model is collapsed. When we restart our model, the training in the discriminator
detects the single-mode impact. The discriminator will take charge and change a single
point to the next most likely point [46].
Overfitting is one of the main challenges as the balance between the generator and the
discriminator. Some programmers give the solution. Someone proposes to use cost function with a nonvanishing gradient instead. Nonconvergence occurs due to both low and
high mesh quality [47].
As we cannot apply GAN on static data due to a more complex convolution layer
being required as the real and fake static data, we have not classified the data. There
are some results theoretically but cannot be implemented [7].
Again, alongside various merits of GANs, there are yet open challenges that require to
be solved for their medical imaging employment. In cross-modality image and image
11
12
Generative adversarial networks for image-to-Image translation
reconstruction synthesis, most tasks still adopt traditional shallow reference advantages
like PSNR, SSIM, or MAE for quantitative analysis. However, these measures do not
relate to the image’s visual quality, e.g., pixel-wise loss direct optimization generated
a blurry result but gave higher numbers compared to using adversarial loss [48]. It provides great difficulty in interpreting these horizontal comparison numbers of GAN-based
tasks, particularly when other losses are presented. One method to reduce this issue is to
utilize downstream works like classification or segmentation to validate the quality of the
produced sample. Some other method is to recruit domain experts but this method is
time-consuming, expensive, and hard to scale [49].
Today, we have applied GAN for more than 20 basic applications. All the applications
have a broad area where GAN is applied, as most important are satellite images where
GAN is best for training and testing of images. In medical images like MRI and
X-ray images as they are of low resolution, the images and edges are not sharp enough,
due to which extraction of more features is not possible with the help of SR-GAN and
EE-GAN GAN [50].
1.6 Conclusion and future scope
In the past years before the discovery of GANs, image processing of satellite images or
medical X-ray images was quite hard for feature extraction purposes. Classification is also
somewhat hard due to the presence of a high error rate at the time. In a single image,
every 1px represents at least 10 m, due to which feature extraction is significantly
reduced. As the images are of low quality, the objects are blurry to get the high-resolution
image, due to which SR-GAN is used. As both the models run at the same time, it greatly
reduces the training time. GAN is used to generate fake data in today’s world. Hence,
many algorithms are proposed, which make fake things appear real. GAN has several
other applications, including making recipes, songs, fake images of a person, generating
Cartoon characters, generating new human poses, face aging, and photo blending. These
are the areas generally used in the present scenarios where GAN is freely applied. In
future, by using GAN, we can create videos of robot motion and train a robot for progressive enhancement. Some researchers are working on the Novo generation of new
molecules for extracting the desired properties in silica molecules. Many of the
researchers are also working on the application of autonomous driving of a self-driving
car using GAN.
References
[1] K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, J. Jiang, Edge-enhanced GAN for remote sensing image
superresolution, IEEE Trans. Geosci. Remote Sens. 57 (8) (2019) 5799–5812.
[2] V. Ramakrishnan, A.K. Prabhavathy, J. Devishree, A survey on vehicle detection techniques in aerial
surveillance, Int. J. Comput. Appl. 55 (18) (2012).
Super-resolution-based GAN for image processing
[3] S. Mahdizadehaghdam, A. Panahi, H. Krim, Sparse generative adversarial network, in: Proceedings of
the IEEE International Conference on Computer Vision Workshops (ICCVW), 2019, pp. 3063–3071.
[4] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, W. Shi, Photo-realistic single
image super-resolution using a generative adversarial network, in: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2017, pp. 4681–4690.
[5] S. Borman, R.L. Stevenson, Super-resolution from image sequences—a review, in: 1998 Midwest
Symposium on Circuits and Systems (Cat. No. 98CB36268), IEEE, 1998, pp. 374–378.
[6] D. Dai, R. Timofte, L. Van Gool, Jointly optimized regressors for image super-resolution, in: Computer Graphics Forum, vol. 34, 2015, pp. 95–104. No. 2.
[7] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, Y. Fu, Image super-resolution using very deep residual
channel attention networks, in: Proceedings of the European Conference on Computer Vision
(ECCV), 2018, pp. 286–301.
[8] N.M. Nawawi, M.S. Anuar, M.N. Junita, Cardinality improvement of Zero Cross Correlation (ZCC)
code for OCDMA visible light communication system utilizing catenated-OFDM modulation
scheme, Optik 170 (2018) 220–225.
[9] P. Mamoshina, L. Ojomoko, Y. Yanovich, A. Ostrovski, A. Botezatu, P. Prikhodko, I.O. Ogu, Converging blockchain and next-generation artificial intelligence technologies to decentralize and accelerate biomedical research and healthcare, Oncotarget 9 (5) (2018) 5665–5690.
[10] P. Perera, R. Nallapati, B. Xiang, Ocgan: one-class novelty detection using gans with constrained latent
representations, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2898–2906.
[11] N.T. Do, I.S. Na, S.H. Kim, Forensics face detection from gans using convolutional neural network,
in: Proceeding of 2018 International Symposium on Information Technology Convergence (ISITC
2018), 2018.
[12] S. Tripathy, J. Kannala, E. Rahtu, Icface: interpretable and controllable face reenactment using gans, in:
The IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 3385–3394.
[13] C. Rathgeb, A. Dantcheva, C. Busch, Impact and detection of facial beautification in face recognition:
an overview, IEEE Access 7 (2019) 152667–152678.
[14] N. Yu, L.S. Davis, M. Fritz, Attributing fake images to gans: learning and analyzing gan fingerprints, in:
Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7556–7566.
[15] S. Lian, H. Zhou, Y. Sun, Fg-srgan: a feature-guided super-resolution generative adversarial network
for unpaired image super-resolution, in: International Symposium on Neural Networks, Springer,
Cham, 2019, pp. 151–161.
[16] N. Takano, G. Alaghband, Srgan: Training Dataset Matters, arXiv, 2019. preprint arXiv:1903.09922.
[17] H. Dou, C. Chen, X. Hu, Z. Hu, S. Peng, PCA-SRGAN: Incremental Orthogonal Projection Discrimination for Face Super-Resolution, arXiv, 2020. preprint arXiv:2005.00306.
[18] X. Jiang, Y. Xu, P. Wei, Z. Zhou, CT image super resolution based on improved SRGAN, in: 2020
5th International Conference on Computer and Communication Systems (ICCCS), IEEE, 2020, pp.
363–367.
[19] H. Li, C. Zhang, H. Li, N. Song, White-light interference microscopy image super-resolution using
generative adversarial networks, IEEE Access 8 (2020) 27724–27733.
[20] M. Wang, Z. Chen, Q.J. Wu, M. Jian, Improved face super-resolution generative adversarial networks,
Mach. Vis. Appl. 31 (2020) 1–12.
[21] F. Nan, Q. Zeng, Y. Xing, Y. Qian, Single image super-resolution reconstruction based on the
ResNeXt network, in: Multimedia Tools and Applications, 2020, pp. 1–12.
[22] Y.Y. Li, Y.D. Zhang, X.W. Zhou, W. Xu, EESR: edge enhanced super-resolution, in: 2018 14th
IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), IEEE,
2018, pp. 1–3.
[23] R. Sood, M. Rusu, Anisotropic super resolution in prostate Mri using super resolution generative
adversarial networks, in: 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI
2019), IEEE, 2019, pp. 1688–1691.
[24] S. Lee, J.H. Kim, J.P. Heo, Super-resolution of license plate images via character-based perceptual loss,
in: 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), IEEE, 2020,
pp. 560–563.
13
14
Generative adversarial networks for image-to-Image translation
[25] B.X. Chen, T.J. Liu, K.H. Liu, H.H. Liu, S.C. Pei, Image super-resolution using complex dense block
on generative adversarial networks, in: 2019 IEEE International Conference on Image Processing
(ICIP), IEEE, 2019, pp. 2866–2870.
[26] W.S. Jeon, S.Y. Rhee, Single image super resolution using residual learning, in: 2019 International
Conference on Fuzzy Theory and its Applications (iFUZZY), IEEE, 2019, pp. 1–4.
[27] J. Wenjie, L. Xiaoshu, Research on super-resolution reconstruction algorithm of remote sensing image
based on generative adversarial networks, in: 2019 IEEE 2nd International Conference on Automation,
Electronics and Electrical Engineering (AUTEEE), IEEE, 2019, pp. 438–441.
[28] V.K. Ha, J. Ren, X. Xu, S. Zhao, G. Xie, V.M. Vargas, Deep learning based single image superresolution: a survey, in: International Conference on Brain Inspired Cognitive Systems, Springer,
Cham, 2018, pp. 106–119.
[29] X. Wang, K. Yu, C. Dong, C. Change Loy, Recovering realistic texture in image super-resolution by
deep spatial feature transform, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 606–615.
[30] W.S. Lai, J.B. Huang, N. Ahuja, M.H. Yang, Fast and accurate image super-resolution with deep laplacian pyramid networks, IEEE Trans. Pattern Anal. Mach. Intell. 41 (11) (2018) 2599–2613.
[31] W.S. Lai, J.B. Huang, N. Ahuja, M.H. Yang, Deep laplacian pyramid networks for fast and accurate
super-resolution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 624–632.
[32] D. Kim, H.U. Jang, S.M. Mun, S. Choi, H.K. Lee, Median filtered image restoration and anti-forensics
using adversarial networks, IEEE Signal Process Lett. 25 (2) (2017) 278–282.
[33] T. Tong, G. Li, X. Liu, Q. Gao, Image super-resolution using dense skip connections, in: Proceedings
of the IEEE International Conference on Computer Vision, 2017, pp. 4799–4807.
[34] C. Dong, C.C. Loy, X. Tang, Accelerating the super-resolution convolutional neural network, in:
European Conference on Computer Vision, Springer, Cham, 2016, pp. 391–407.
[35] W. Shi, J. Caballero, F. Huszár, J. Totz, A.P. Aitken, R. Bishop, Z. Wang, Real-time single image and
video super-resolution using an efficient sub-pixel convolutional neural network, in: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1874–1883.
[36] L. Yue, H. Shen, J. Li, Q. Yuan, H. Zhang, L. Zhang, Image super-resolution: the techniques, applications, and future, Signal Process. 128 (2016) 389–408.
[37] E.L. Denton, S. Chintala, R. Fergus, Deep generative image models using a laplacian pyramid of
adversarial networks, in: Advances in Neural Information Processing Systems, 2015, pp. 1486–1494.
[38] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition,
arXiv, 2014. preprint arXiv:1409.1556.
[39] X. Li, Y. Wu, W. Zhang, R. Wang, F. Hou, Deep learning methods in real-time image superresolution: a survey, J. Real-Time Image Proc. (2019) 1–25.
[40] K. Hayat, Multimedia super-resolution via deep learning: a survey, Digital Signal Process. 81 (2018)
198–217.
[41] M.S. Sajjadi, B. Scholkopf, M. Hirsch, Enhancenet: single image super-resolution through automated
texture synthesis, in: Proceedings of the IEEE International Conference on Computer Vision, 2017,
pp. 4491–4500.
[42] X. Zhao, Y. Zhang, T. Zhang, X. Zou, Channel splitting network for single MR image superresolution, IEEE Trans. Image Process. 28 (11) (2019) 5649–5662.
[43] J. Kim, J. Kwon Lee, K. Mu Lee, Accurate image super-resolution using very deep convolutional networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016,
pp. 1646–1654.
[44] R. Timofte, E. Agustsson, L. Van Gool, M.H. Yang, L. Zhang, Ntire 2017 challenge on single image
super-resolution: methods and results, in: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition Workshops, 2017, pp. 114–125.
[45] X. Song, Y. Dai, X. Qin, Deep depth super-resolution: learning depth super-resolution using deep
convolutional neural network, in: Asian Conference on Computer Vision, Springer, Cham, 2016,
pp. 360–376.
Super-resolution-based GAN for image processing
[46] L. Zhang, P. Wang, C. Shen, L. Liu, W. Wei, Y. Zhang, A. Van Den Hengel, Adaptive importance
learning for improving lightweight image super-resolution network, Int. J. Comput. Vis. 128 (2) (2020)
479–499.
[47] Y. Li, J. Hu, X. Zhao, W. Xie, J. Li, Hyperspectral image super-resolution using deep convolutional
neural network, Neurocomputing 266 (2017) 29–41.
[48] Y. Liang, J. Wang, S. Zhou, Y. Gong, N. Zheng, Incorporating image priors with deep convolutional
neural networks for image super-resolution, Neurocomputing 194 (2016) 340–347.
[49] Q. Chang, K.W. Hung, J. Jiang, Deep learning based image Super-resolution for nonlinear lens distortions, Neurocomputing 275 (2018) 969–982.
[50] B. Lim, S. Son, H. Kim, S. Nah, K. Mu Lee, Enhanced deep residual networks for single image superresolution, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Workshops, 2017, pp. 136–144.
15
CHAPTER 2
GAN models in natural language
processing and image translation
E.
Thirumagala,b and K. Saruladhaa
a
Pondicherry Engineering College, Department of Computer Science and Engineering, Puducherry, India
REVA University, Bengaluru, India
b
2.1 Introduction
In recent years, GANs have shown significant progress in modeling image and speech
complex data distributions. The introduction of GAN and VAE made training big
datasets in an unsupervised manner possible.
2.1.1 Variational auto encoders
The variational auto encoders (VAEs) [1] were used for generating images before GANs.
The VAE has a probabilistic encoder and probabilistic decoder. The real samples “r” are
fed into the encoder. The encoder outputs an encoded image, with which the noise “n’ is
added whose distribution is given by Xe(n jr). The distribution Xe(n jr) is given as input to
decoder whose distribution is given by Yd(r j n) which will generate the fake image “r.”
The loss function L(e,d) between encoder and decoder is computed for every iteration.
The VAE uses a mean square loss function, which is given by
L ðe, dÞ ¼ EnXeðnjrÞ ½Yd ðr j nÞ + KLDðXe ðnj r ÞkYd ðnÞÞ
(2.1)
where
KLDðXe ðnj r ÞkYd ðnÞÞ ¼
X
Xe ðnj r Þ
Xe ðnj r Þlog
Yd ðnÞ
The Kullback-Leibler divergence (KLD) is the distance metric that computes the similarity between the real sample given to the encoder Xe and the generated fake image from
decoder Yd. If the loss function yields more value, it means the decoder does not generate
fake images similar to the real samples. The backpropagation will take place for every
iteration until the decoder generates the image similar to the real image. By using stochastic gradient descent, the weights and bias of the encoder and decoder will be adjusted
and again image generation will happen. The optimal value of the loss function is 0.5.
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00001-4
Copyright © 2021 Elsevier Inc.
All rights reserved.
17
18
Generative adversarial networks for image-to-Image translation
When the loss function of the decoder becomes 0.5, it means the decoder generates the
image similar to the real image.
2.1.1.1 Drawback of VAE
VAE uses Kullback-Leibler divergence (KLD). When the generated image distribution
Yd(n) does not match the real image distribution Xe(n j r), then Yd(n) value will become 0.
The KLD will lead to ∞ (infinity), which means learning will not take place for the
encoder and decoder. This leads to the invention of GANs.
2.1.2 Brief introduction to GAN
Generative adversarial networks (GANs) [2–4] are generative neural network models
introduced by Ian Goodfellow in 2014. Recently, GANs have been used in numerous
applications such as discovery and prevention of security attacks, clothing translation,
text-to-image conversion, photo blending, video games, etc. GANs have generator
(G) and discriminator (D) which can be convolutional neural network (CNN), feed forward neural networks, or recurrent neural networks (RNNs). The generator (G) will
generate fake images, by taking random noise distribution as input. The real samples
and the generated fake images are given as input to the discriminator (D), which will
output whether the image is from the real sample or from the generator (1 or 0). The
loss functions are computed to check whether (i) the generator is generating images close
to real samples and (ii) the discriminator is correctly discriminating between real and fake
images. If the loss function yields a big value, then backpropagate to Generator and Discriminator neural networks, adjust its weights and bias which is called as optimization. There
are various optimization algorithms such as stochastic gradient descent, RMSProp,
Adam, AdaGrad, etc. Therefore, both G and D learn simultaneously.
This chapter is organized as follows: The various GAN architectures are discussed in
Section 2.2. Section 2.3 describes the applications of GANs in natural language processing. The applications of GANs in image generation and translation are discussed in
Section 2.4. The Section 2.5 discusses the evaluation metrics that can be used for checking the performance of the GAN. The tools and the languages used for GAN research are
discussed in Section 2.6. The open challenges for further research are discussed in
Section 2.7.
2.2 Basic GAN model classification based on learning
From the literature survey made, the classification of various GANs is made based on the
learning methods as shown in Fig. 2.1. The learning can be supervised or unsupervised.
Supervised learning is making the machine learning with the labeled data. The classification and regression algorithms come under supervised learning. The unsupervised
learning [5] will take place when the data is unlabeled. The machine will act on the data
GAN models in natural language processing and image translation
Fig. 2.1 GAN architecture classification.
based on the similarities, differences, and patterns. The clustering and association algorithms come under unsupervised learning.
2.2.1 Unsupervised learning
The generative models before GANs used the Markov chain [6] method for training
which has various drawbacks such as high computational complexity, low efficiency,
etc. As shown in Fig. 2.1, vanilla GAN, WGAN, WGAN-GP, Info GAN, BEGAN,
Unsupervised Sequential GAN, Parallel GAN, and Cycle GAN are categorized under
unsupervised learning [7]. The above said GANs will take the data or the real samples
without labels as input. The architectures of all the above GANs are shown in the following sections. This section details about each GAN architecture with its loss functions
and optimization techniques.
2.2.1.1 Vanilla GAN
The Vanilla GAN [8, 9] is the basic GAN architecture. The real samples are given by “r.”
The random noise is given by “n.” The random noise distribution pn(n) is given as input
to the G which will generate the fake image. The real sample distribution pd(r) and fake
images are given as input to D. D will discriminate whether the image is real (which
means 1) or fake (which means 0) which is shown in Fig. 2.2. Then by using the binary
cross entropy loss function, the loss of G and D will be calculated by Eqs. (2.3) and (2.4).
Binary cross-entropy loss function (Goodfellow [2]) is given by Eq. (2.2).
L ðx 0 , xÞ ¼ xlog x 0 + ð1 xÞ log ð1 x 0 Þ
(2.2)
where x0 is the generated fake image and x is the real image.
When the image is coming from real sample “r” to D, then D has to output 1. So,
substitute x0 ¼ D(r) and x ¼ 1 in Eq. (2.2) will lead to the following equation:
LGAN ðDÞ ¼ Εrpdðr Þ ½ log ðDðr ÞÞ
(2.3)
19
20
Generative adversarial networks for image-to-Image translation
Fig. 2.2 Vanilla GAN, WGAN, WGAN-GP architecture.
When the image is coming from G, G(n) to D, then D has to output 0.So substitute
x ¼ D(G(r)) and x ¼ 0 in Eq. (2.2) will lead to the following equation:
0
LGAN ðGÞ ¼ ΕnpnðnÞ ½ log ð1 DðGðnÞÞ
(2.4)
Using min-max game theory, the D has to output 1 if the image is the real sample.
Hence the D has to be maximized. The D has to output 0 if the image has come from the
generator. Hence the G has to be minimized. The loss function is given by
n
min max LGAN ðG, DÞ ¼ min max Erpdðr Þ ½ log ðDðr ÞÞ + ΕnpnðnÞ ½ log ð1 DðGðnÞÞg
G
D
G
D
(2.5)
The optimal value of D is given by
D∗ ¼
pd ðr Þ
pd ðr Þ + pg ðr Þ
(2.6)
If the optimal value of D (0.5) is obtained, then D cannot differentiate between real
and fake images. The optimal value of G is given by
G∗ ¼ log 4 + 2 JSD pd ðr Þkpg ðr Þ
(2.7)
where
1
JSD pd ðr Þkpg ðr Þ ¼ KLD pd ðr Þk pd + pg + KLD pg ðr Þk pd + pg
2
If the loss function yields more value than by using backpropagation and stochastic
gradient descent [6], weights and bias will be adjusted for every epoch until
D discriminates properly.
GAN models in natural language processing and image translation
2.2.1.2 WGAN
The WGAN [10] stands for Wasserstein GAN. The architecture is the same as that
of Vanilla GAN as shown in Fig. 2.2. In order to avoid the drawback of using Jenson
Shannon Divergence distance, the Wasserstein distance metric has been used. JSD
matches the real image and fake image distributions in the vertical axis. Whereas
the Wasserstein distance matches the real image and fake image distributions in
the horizontal axis. The Wasserstein distance is otherwise called as earth mover
distance.
The Wasserstein distance between the real image distribution “Pr” and the generated
image distribution “Pg” is given by
W Pr , Pg ¼ inf δEπðpr , pg Þ Eða, bÞδ jx yj
(2.8)
Π is the transport plan which tells how the distribution changes from real and generated image.
Eq. (2.8) is intractable. To make it tractable, using Kantorovich-Rubinstein duality the
W-distance is given by
W Pr , Pg ¼ SupjjDjjL1 Ε Dðr Þ D Gg ðnÞ
(2.9)
When taking the slope between real and fake image distributions, if the slope is less
than or equal to k, it is called as k-Lipchitz constant. When k is 1 it is 1-Lipchitz constant.
Wasserstein GAN uses 1-Lipchitz constant and to achieve it, the weights will be clipped
in the range (1, 1). The discriminator loss function is given by
LWGAN ðDÞ ¼ Εrpdðr Þ ½Dðr Þ
(2.10)
21
22
Generative adversarial networks for image-to-Image translation
The generator loss function is given by
LWGAN ðGÞ ¼ Ε npnðnÞ ½DðGðnÞÞ
(2.11)
The overall parametric loss function is given by
LWGAN ðG, DÞ ¼ min max Ε rpdðr Þ ½Dðr Þ ΕnpnðnÞ ½DðGðnÞÞ
G
wEW
(2.12)
In WGAN, the discriminator will not return 0 or 1 rather it will return the Wasserstein distance. WGAN uses RMSProp optimizer which alters the weights and bias of G
and D for every iteration until D cannot discriminate between real and fake images.
2.2.1.3 WGAN-GP
The WGAN-GP [11, 12] (Wasserstein GAN-Gradient Penalty) architecture is identical
to Vanilla GAN and WGAN as shown in Fig. 2.2. To avoid the drawback of weight
clipping in order to use the Lipchitz constant, the gradient penalty term is incorporated
with WGAN loss function. The gradient penalty term is computed when the gradient
norm value moves away from 1. The loss function of WGAN-GP is given by
LWGAN ðG, DÞ ¼ min max Ε rpdðr Þ ½Dðr Þ Ε npnðnÞ ½DðGðnÞÞ
wEW
G
h
2
+ λΕxP ðxÞ j— x DðxÞj 1
(2.13)
The gradient penalty term is included with the loss function of WGAN. In Eq. (2.13),
x-sampled from noise “n” and real image “r” is given by, x ¼ t x + (1 t)x, where t is
sampled between 0 and 1. λ-hyperparameter. The adam optimizer, if used with WGANGP, it generates good clear images.
2.2.1.4 Info GAN
Info GAN [8, 13] is the information maximizing GAN. The semantic information is
added with noise and given to the G. G outputs the fake image. The fake image that
is generated and the real sample are given to D which will output 0 (fake) or 1 (real)
shown in Fig. 2.3. Then the loss function is computed. The stochastic gradient descent
is used for optimizing the neural network. The noise “n” with semantic information “si”
is fed to G which is given by G(n, si). The mutual information MI (si; G(z, n)) has to be
maximized between semantic information “si” and generator G(n, si).
The mutual information MI (c; G(z, n)) is the amount of information obtained from
knowledge of G(z, n) about semantic information si. Maximizing mutual information is
not very easy, so variational lower bound of mutual information MI (si; G(z, n)) by
GAN models in natural language processing and image translation
Fig. 2.3 Info GAN architecture.
defining one more semantic distribution Q(si jr). The variational lower bound LB (G, Q)
of mutual information is given by
LBðG, QÞ ¼ EsiP ðsiÞ, xGðn, siÞ ½ log Qðsij r Þ + H ðsiÞ I ðsi; Gðz, nÞÞ
(2.14)
where H(si) is the entropy of latent codes. The loss function is given by
min max LInfoGAN ðG, DÞ ¼ Erpdðr Þ ½ log ðDðr ÞÞ
G, Q D
+ ΕnpnðnÞ ½ log ð1 DðGðnÞÞ λ LBðG, QÞ
(2.15)
Wake Sleep algorithm has been used with InfoGAN. The lower bound of the generator log PG(x) has been optimized and updated in the wake phase. The auxiliary distribution Q is updated in the sleep phase by up sampling from generator distribution
instead of real data distribution. The cost is only a little more than the vanilla GAN.
2.2.1.5 BEGAN
BEGAN [14, 15] stands for Boundary Equilibrium GAN. This BEGAN is mainly developed to achieve Nash equilibrium. The architecture is the same as that of vanilla GAN
with one difference: for maintaining equilibrium, the proportional control theory has
been used as shown in Fig. 2.4. In BEGAN generator acts as a decoder and the discriminator acts as autoencoder and also discriminates between real and fake images. In
BEGAN, instead of matching the data distributions of real image and generated image,
the autoencoder loss has been calculated for real image and generated image. The Wasserstein distance has been computed between the autoencoder loss of real and generated
images. The autoencoder loss is given by
L ðsÞ ¼ js AF ðsÞjη
(2.16)
where L(s) is a loss for training autoencoder. “s” is the sample of dimension “d,” AF is the
autoencoder function which converts the sample of dimension “d” to sample of dimension “d,” η is the target norm takes value {1,2}.
23
24
Generative adversarial networks for image-to-Image translation
Fig. 2.4 BEGAN architecture.
The loss of the D is given by
LD ¼ L ðr Þ ki L ðGðnD ÞÞ
(2.17)
The loss of the G is given by
LG ¼ L ðGðnG ÞÞ
(2.18)
BEGAN makes use of the proportional control model to preserve equilibrium E[L(G(n))] ¼ γ E[L(r)] where γ is the hyperparameter which takes the value (0, 1). For
maintaining equilibrium, it uses the variable ki which takes a value (0, 1) to control
the generator loss during gradient descent. Where ki is given by
ki + 1 ¼ ki + λk ðγ L ðr Þ L ðGðnG ÞÞÞ
(2.19)
Initially take k0 ¼ 0. λk is the learning rate of k.
2.2.1.6 Unsupervised sequential GAN
The Sequential GAN [16, 17] involves a sequence of G and D. The noise vector “z” is
given as input to Generator G1. The G1 produces fake image1 “f” as the output. The
fake image1 and real sample “r” is given as input to discriminator D1 which will discriminate between real image1 and fake image1. The fake image1 is given as input to
generator G2. The G2 produces fake image2 as the output. The fake image2 and the
real image2 is given as input to discriminator D2 which will discriminate between the
real and fake image as shown in Fig. 2.5. The loss function by considering G1 and D1 is
given by
Ladv ðG1, D1, n, r Þ ¼ Erpdðr Þ ½ log ðD1ðr ÞÞ + Ε npnðnÞ ½ log ð1 D1ðG1ðnÞÞg
(2.20)
The loss function by considering G2 and D2 is given by
Limg2img ðG2, D2, f , r Þ ¼ Erpdðr Þ ½ log ðD2ðr ÞÞ + Ε f pf ð f Þ ½ log ð1 D2ðG2ðf ÞÞg
(2.21)
GAN models in natural language processing and image translation
Fig. 2.5 Unsupervised sequential GAN architecture.
The loss function of unsupervised sequential GAN is given by
LunseqGAN ðG1, D1, G2, D2Þ ¼ Ladv ðG1, D1, n, r Þ + Limg2img ðG2, D2, f , r Þ
(2.22)
2.2.1.7 Parallel GAN
The architecture of the Parallel GAN [18] is shown in Fig. 2.6. Whenever there are
bimodal images to be processed or when there is a need to generate multiple images
at the same time then the parallel GAN can be used. Noise vector will be given to generator G1 and G2 parallelly. The G1 and G2 generate fake image1 and fake image2 parallelly. The real image1 and fake image1 are given as input to input to discriminator D1
which will discriminate between real image1 and fake image1. The real image2 and fake
image2 are given as input to input to discriminator D2 which will discriminate between
Fig. 2.6 Parallel GAN architecture.
25
26
Generative adversarial networks for image-to-Image translation
real image2 and fake image2 parallelly with D1. The binary cross-entropy loss is computed for (G1, D1) and (G2, D2) parallelly. If D1 and D2 are not discriminating between
real and fake images properly, then by using back propagation and stochastic gradient
(G1, D1) and (G2, D2) weights and bias will be adjusted for every iteration until the
D1 and D2 discriminate correctly.
2.2.1.8 Cycle GAN
The Cycle GAN is otherwise called cycle-consistent GAN [19, 20]. The noise vector “z”
is given as input to generator G1. The G1 produces feature map “f” as the output. The
feature map and real sample “r” is given as input to discriminator D1 which will discriminate between the real sample and feature map. The feature map is given as input to
generator G2. The G2 produces a fake image as the output. The fake image and the real
image are given as input to discriminator D2 which will discriminate between the
real and fake image as shown in Fig. 2.7. The loss function by considering G1 and
D1 is given by
L1ðG1, D1, n, r Þ ¼ Erpdðr Þ ½ log ðD1ðr ÞÞ + Ε npnðnÞ ½ log ð1 D1ðG1ðnÞÞg
(2.23)
The loss function by considering G2 and D2 is given by
L2ðG2, D2, f , r Þ ¼ Erpdðr Þ ½ log ðD2ðr ÞÞ + Εf pf ð f Þ ½ log ð1 D2ðG2ðf ÞÞg
(2.24)
The cycle consistency loss is given by
h
h
i
i
Lcycle ðG1, G2Þ ¼ Ε npnðnÞ jG2ðG1ðnÞÞ nj1 + Εf pf ð f Þ jG1ðG2ð f ÞÞ f j1 (2.25)
The Cycle GAN loss is given by
LcycleGAN ðG1, G2, D1, D2Þ ¼ L1ðG1, D1, n, r Þ + L2ðG2, D2, f , r Þ + Lcycle ðG1, G2Þ
(2.26)
Fig. 2.7 Cycle GAN architecture.
GAN models in natural language processing and image translation
2.2.2 Semisupervised learning
Semisupervised learning is that the discriminator D will be trained by class labels, i.e., D
will do supervised learning. The generator will not be trained with the class labels hence
the learning will be unsupervised. The Semi GAN comes under this semisupervised
learning category which is discussed in the following section.
2.2.2.1 Semi GAN
The architecture of semi GAN [21, 22] is shown in Fig. 2.8. The class labels are added
with the real samples and given as input to discriminator D, so the learning becomes
supervised. The noise vector is given as input to generator G which generates the fake
sample. The real samples with the class labels and the fake image generated by generator
G are given as input to D. The D will discriminate between real and fake image and also
classifies the image to which class it belongs to. The loss functions are computed for G and
D. If the D is not discriminating properly, then by using backpropagation and stochastic
gradient the parameters of the G and G will be adjusted for every iteration until D discriminates correctly. The discriminator loss function is given by
LsemiGAN ðDÞ ¼ Ε rpdðr Þ ½ log ðDðrj c ÞÞ
(2.27)
The generator loss function is given by
LsemiGAN ðGÞ ¼ Ε npnðnÞ ½ log ð1 DðGðnÞÞ
(2.28)
Using min-max game theory, the D has to output 1 if the image is the real sample.
Hence the D has to be maximized. The D has to output 0 if the image is from the
generator. Hence the G has to be minimized. The loss function is given by
n
o
min max LSemiGAN ðG, DÞ ¼ min max Erpdðr Þ ½ log ðDðrj c ÞÞ + Ε npnðnÞ ½ log ð1 DðGðnÞÞÞ
G
D
G
D
(2.29)
Fig. 2.8 Semi GAN.
27
28
Generative adversarial networks for image-to-Image translation
The generator is not trained with the class labels but the discriminator has been trained
with the class labels. The following sections describe about supervised learning.
2.2.3 Supervised learning
Supervised learning is making the machine learning with the labeled data. CGAN,
BiGAN, AC GAN, and supervised sequential GAN are the GAN architectures which
will learn in a supervised manner. The architectures of the abovementioned GAN are
shown in the following sections. This section details about each GAN architecture
and the loss functions and optimization techniques have been used in each architecture.
2.2.3.1 CGAN
CGAN stands for conditional GAN [23, 24]. The architecture of CGAN is shown in Fig.
2.9. The class labels are attached with the real samples. Generator G and discriminator D
are trained with the class labels. The noise vector along with the class labels are given as
input to G. The G outputs the fake image. The class labels, real, and fake image generated
by G are given as input to D. The D discriminate between the real and fake image also
find out to which class the image belongs to. The loss function is the same as that of vanilla
Gan with one difference that the class labels “c” are added with the real sample, discriminator, and the generator terms. The binary cross-entropy loss [25] is used and the stochastic gradient descent is used for optimizing G and D when D is not discriminating
properly.
The discriminator loss function is given by
LCGAN ðDÞ ¼ Ε rpdðr Þ ½ log ðDðrj c ÞÞ
(2.30)
The generator loss function is given by
LCGAN ðGÞ ¼ Ε npnðnÞ ½ log ð1 DðGðnj c ÞÞ
Fig. 2.9 CGAN architecture.
(2.31)
GAN models in natural language processing and image translation
Using min-max game theory, the D has to output 1 if the image is the real sample.
Hence the D has to be maximized. The D has to output 0 if the image is from generator.
Hence the G has to be minimized. The loss function is given by
n
o
min max LCGAN ðG, DÞ ¼ min max Erpdðr Þ ½ log ðDðrj c ÞÞ + Ε npnðnÞ ½ log ð1 DðGðnj c ÞÞÞ
G
D
G
D
(2.32)
CGANs [26] with multilabel predictions can be used for automated image tagging
where the generator can generate the tag vector distribution conditioned on image
features.
2.2.3.2 BiGAN
BiGAN stands for bidirectional GAN [8, 27, 28]. The architecture of BiGAN is shown in
Fig. 2.10. The noise vector is given as input to generator G which generates the fake
image. The real sample is given as an input to the encoder with output the encoded image
with which the noise is added. The encoded image, noise, real image, and generated fake
image are given as input to discriminator D. The discriminator discriminates between real
and fake images. The loss functions are computed for G and D. If D is not discriminating
properly, then by using backpropagation and stochastic gradient the parameters of the G
and G will be adjusted for every iteration until D discriminates correctly. The discriminator loss function is given by
LBiGAN ðDÞ ¼ Εrpdðr Þ ΕnpEðnj r Þ ½log ðDðr, nÞ
(2.33)
The discriminator has been trained with the real data, noise, and encoded image distribution. The generator loss function is given by
Fig. 2.10 BiGAN architecture.
29
30
Generative adversarial networks for image-to-Image translation
LBiGAN ðGÞ ¼ ΕnpnðnÞ Ε rpGðr j nÞ ½ log ð1 Dðr, nÞ
(2.34)
Using min-max game theory, the D has to output 1 if the image is the real sample.
Hence the D has to be maximized. The D has to output 0 if the image is from the generator. Hence the G has to be minimized. The loss function is given by
min max LBiGAN ðD, E, GÞ ¼ min max Ε rpdðr Þ Ε npEðnj r Þ ½ log ðDðr, nÞ
G, E D
G, E D
+ Ε npnðnÞ Ε rpGðr j nÞ ½ log ð1 Dðr, nÞ
(2.35)
2.2.3.3 ACGAN
The architecture of ACGAN [12, 29] is shown in Fig. 2.11. The architecture is the same
as that of CGAN with one difference the class labels “c” are conditioned with the real
samples and noise vector which is given as input to generator G. The class labels are
not conditioned with the discriminator D. The training is based on the log probability
of correct source whether the image is real or fake image generated by G is real or fake
and log probability of correct class to which the sample belongs to. The stochastic gradient descent is used to adjust the weights and bias of G and D for every iteration if D is
not discriminating correctly.
The log probability of the correct source whether the image is from the real sample or
the image is generated by generator G is given by
Lsource ¼ Ε½ log P ðsource ¼ realjRreal Þ + Ε½ logP ðsource ¼ fakejRfake Þ
(2.36)
The log probability of the correct class to which the image belongs to or classified
correctly is given by
Fig. 2.11 ACGAN architecture.
GAN models in natural language processing and image translation
Lclass ¼ Ε ½ log P ðclass ¼ c jRreal Þ + Ε ½ log P ðclass ¼ c jRfake Þ
(2.37)
The image samples are given by “R.” The conditional probability has been used. The
training has to be carried out in the way that D has to maximize Lsource + Lclass and G has
to maximize Lsource Lclass.
2.2.3.4 Supervised seq-GAN
The supervised sequential GAN [25, 30, 31] architecture is shown in Fig. 2.12. The real
image is given as input to the encoder that outputs the encoded image. The encoded
image is given as input to the G1 which in turn generates the fake image1. The fakeimage1 is given as input to G2 which will generate fakeimage2. The noise vector,
encoded image, and fake image1 are given as input to D1. The noise vector, encoded
image, and fake image2 are given as input to D2. D1 and D2 will discriminate between
real and fake images. The loss function by considering G2 and D2 is given by
Ladv ðG1, D1, n, r Þ ¼ Erpdðr Þ ½ log ðD1ðr ÞÞ + ΕnpnðnÞ ½ log ð1 D1ðG1ðnÞÞg
(2.38)
The loss function by considering G2 and D2 is given by the following equations.
Limg2img ðG2, D2, f , r Þ ¼ Erpdðr Þ ½ log ðD2ðr ÞÞ + Ε f pf ð f Þ ½ log ð1 D2ðG2ð f ÞÞg (2.39)
Lencoder ðr, nÞ ¼ Ε rpdðr Þ Ε npEðnj r Þ ½ log ðDðr, nÞ + Ε npnðnÞ Ε rpGðr j nÞ ½ log ð1 Dðr, nÞ
(2.40)
The loss function of unsupervised sequential GAN is given by the following equation
LSupseqGAN ðG1, D1, G2, D2Þ ¼ Ladv ðG1, D1, n, r Þ + Limg2img ðG2, D2, f , r Þ
+ Lencoder ðr, nÞ
Fig. 2.12 Supervised sequential GAN.
(2.41)
31
32
Generative adversarial networks for image-to-Image translation
2.2.4 Comparison of GAN models
This section discusses the comparison of GAN models. Table 2.1 summarizes the activation function, loss function, distance metrics, and optimization techniques used by the
GAN models.
Table 2.1 Loss functions and distance metrics of GAN’s.
GAN
Activation
function
Loss function
Distance
metric
Optimization
technique
Back
propagation
with
stochastic
gradient
decent
RMS prop
Vanilla GAN
Rectified
linear unit
(ReLU)
Binary cross entropy
loss
Jenson
Shannon
divergence
WGAN
ReLU, leaky
ReLU, tanh
ReLU, leaky
ReLU, tanh
KantorovichRubinstein duality loss
KantorovichRubinstein duality +
penalty term added
when the gradient
moves away from 1
Binary cross entropy +
variational information
regularization
Auto encoder loss +
proportional control
theory
Binary cross entropy +
image to image
conversion loss
Wasserstein
distance
Wasserstein
distance
WGAN—GP
Info GAN
BEGAN
Unsupervised
seq-GAN
Parallel GAN
Cycle GAN
Semi GAN
Rectified
linear unit
(ReLU)
Exponential
linear unit
(ELU)
Rectified
linear unit
(ReLU)
Rectified
linear unit
(ReLU)
ReLU,
sigmoid
ReLU
Binary cross entropy
loss
Binary cross entropy
loss + cycle
consistency loss
Binary cross entropy
loss with labels
included with real
samples
Adam
Jenson
Shannon
divergence
Wasserstein
distance
Stochastic
gradient
decent
Adam
Jenson
Shannon
divergence
and KullbackLeibler
divergence
Jenson
Shannon
divergence
Jenson
Shannon
divergence
Jenson
Shannon
divergence
RMS prop
Stochastic
gradient
decent
Batch
normalization
Stochastic
gradient
decent
GAN models in natural language processing and image translation
Table 2.1 Loss functions and distance metrics of GAN’s—cont’d
GAN
Activation
function
CGAN
ReLU
BiGAN
ReLU
AC GAN
ReLU
Supervised
seq-GAN
ReLU
Loss function
Binary cross entropy
loss with labels
included
Binary cross entropy
loss + guarantee G and
E are inverse
Log likelihood of real
source + log likelihood
of correct label
Binary cross entropy +
image to image
conversion loss + autoencoder loss
Distance
metric
Optimization
technique
Jenson
Shannon
divergence
Jenson
Shannon
divergence
Jenson
Shannon
divergence
Jenson
Shannon
divergence
and KullbackLeibler
divergence
Stochastic
gradient
decent
Stochastic
gradient
decent
Stochastic
gradient
decent
RMS prop
2.2.5 Pros and cons of the GAN models
This section discusses the pros and cons of GAN models. Table 2.2 summarizes the pros
and cons of the various GAN models.
2.3 GANs in natural language processing
Currently, many GAN architectures are emerging and yielding good results for the natural language processing applications. There have been various GAN architectures proposed in the recent years, including SeqGAN with policy gradient that is used for
generating speech, poems, and music which outperforms other architectures. The RankGAN is used for generating sentences where the discriminator will act as the ranker. The
following subsection elaborates on the various GAN architectures proposed for the applications of NLP.
2.3.1 Application of GANs in natural language processing
This section discusses the various GAN architectures such as SeqGAN, RankGAN,
UGAN, Quasi-GAN, BFGAN, TH-GAN, etc., proposed for the applications of natural
language processing.
33
Table 2.2 Pros and cons of GAN models.
GAN
Pros
Cons
Vanilla GAN
GAN can generate samples that are more similar to the
real samples. It can learn deep representations of the data
WGAN
The experiments conducted using WGAN reveals that it
does not lead to the problem of mode collapse
WGAN—GP
The training is very balanced. Hence the machine could
be trained effortlessly which will make the model
converge properly
It is used for learning data representations which are
disentangled by using information theory extensions
When the real image distribution and generated fake
image distribution is not overlapping then Jenson
Shannon divergence between real and fake image
distribution value becomes log 2. The derivative of log 2
is 0, which means learning will not take place at the initial
start of back propagation
The Lipchitz constant is been applied. Weight clipping is
a simple technique but it will lead to poor quality image
generation
Attaining Nash equilibrium state is very hard. Batch
normalization cannot be used as gradient penalty will be
applied to all data samples
The mutual info has been included to generator that will
eliminate the significant attributes from data and assign
them to semantic information while learning is in
progress. Fine tuning λ hyperparameter if not done
accurately, it will not generate good quality image
The hyperparameter γ must be fine tuned properly. The
appropriate learning rate has to be set properly. If not
done properly, it is not generate good clarity image
Hard to achieve Nash equilibrium
Info GAN
BEGAN
The training is fast and stable
Unsupervised
seq-GAN
Parallel GAN
Cycle GAN
It extract more deep features
Semi-GAN
Multiple images can be generated at same time
The requirement for dataset is low. Randomly two
image styles can be converted
It is an effective model which can be used for the
regression tasks
Hard to achieve Nash equilibrium
When doing image-to-image translation, considering
various parameters such as color, texture, geometry, etc.
is very difficult
The generator cannot generate more realistic image to
fool the discriminator as it is strong enough to
discriminate since it is been trained with the class labels
CGAN
BiGAN
AC GAN
Supervised
seq-GAN
The class labels are included which increases the
performance of the GAN and it can be used for many
applications such as shadow maps generation, image
synthesis, etc.
As class labels are included, it can generate good realistic
images
As class labels are included, it can generate good realistic
images
It extract more deep features
The training is not stable. The stability in training can still
be improved
The drawback is that the real image sample which is been
given to the encoder must be of good clarity and it cannot
perform well when the data distributions are complex
The GAN training is not stable. The ACGAN training
can be still improved
The real sample given to the encoder should be of good
clarity otherwise it will not generate realistic image
36
Generative adversarial networks for image-to-Image translation
2.3.1.1 Generation of semantically similar human-understandable summaries
using SeqGAN with policy gradient
In recent years, generating text summaries have become attractive in the area of natural
language processing. The SeqGAN with policy gradient architecture has been proposed
for generating text summary. The proposed SeqGAN with policy gradient architecture
[32] has three neural networks namely one generator (G) and two discriminators, viz., D1
and D2 as shown in Fig. 2.13. The G is the sequential model which takes the raw text as
input and generates the summary of the text as output. The D1 trains G to output summaries which are human readable. Hence G and D1 form the GAN. The D1 is trained to
distinguish between input text and the summary generated by G. G is trained to fool D1.
As D1 trains the generator to generate the human-readable summary, it is called as
human-readable summary discriminator. The summary generated by the generator
might be irrelevant with only G and D1. Hence another discriminator D2 is added to
the architecture for checking the semantic similarity between the input raw text and
the generated human-readable summary. SeqGAN incorporates reinforcement learning.
Policy gradient is the optimization technique used for updating the parameters (weights
and bias) of G by obtaining rewards from D1 and D2. Hence D1 will train G to generate a
semantically similar summary and D2 will train G to generate human-readable summary.
Semantic similarity discriminator
The semantic similarity discriminator is trained as the classifier using the text summarization dataset shown in Fig. 2.14. This discriminator will teach the generator to generate
a semantically similar and more concise summary. The raw text and the human-readable
summary are given as inputs to the encoders individually to generate the encoded representations namely Ri and Rs. The Ri and Rs are concatenated, product and difference
are performed and given to the four-class classifiers which classify the human-readable
summary into four classes namely similar, dissimilar, redundant, and incomplete class.
The softmax outputs the probability distribution.
Fig. 2.13 SeqGAN for generation of human-readable summary.
GAN models in natural language processing and image translation
Fig. 2.14 Semantic similarity discriminator.
2.3.1.2 Generation of quality language descriptions and ranking
using RankGAN
Language generation plays a major role in many NLP applications such as image caption
generation, machine translation, dialogue generation systems, etc. Hence the RankGAN
has been proposed to generate high-quality language descriptions. The RankGAN [33]
consists of two neural networks such as generator G and ranker R. The generative model
used is long short-term memory (LSTM) to generate the sentences which are called
machine-written sentences. Instead of the discriminator being trained to be a binary classifier, RankGAN uses a ranker which has been trained to rank the human-written sentences more than the machine-written sentences. The ranker will train the generator to
generate machine-written sentences which are similar to human-written sentences. In
this way, generator fools the ranker to rank the machine-written sentences more than
the human-written one. The policy gradient method is used for optimizing the training.
The architecture of RankGAN is shown in Fig. 2.15. The G generates the sentences
from the synthetic dataset. The human-written sentences with the generated machinewritten sentences are given as input to the ranker. The reference human-written sentence
Fig. 2.15 Architecture of RankGAN.
37
38
Generative adversarial networks for image-to-Image translation
is also given as input to the ranker. The ranker has to rank human-written sentences more
than the machine-written sentence. The generator G will be trained to fool the ranker,
hence the ranker will rank the machine-written sentence more than the human-written
sentence. The ranker will compute the rank score by using
Rðij S,C Þ ¼ ΕsS ½P ðij s, C Þ
where P ðij s, C Þ ¼ P
exp ðβαðij sÞÞ
s 0 EC 0
exp ðβαði 0 j sÞÞ
(2.42)
and α(ij s) ¼ cosine(xi, xs)
xi is the feature vector of input sentences. xs is the feature vector of reference sentences. The parameter β value is set during the experiment empirically. The reference
set S is constructed by sampling reference sentences from human-written sentences. C
is the comparison set sampled from both human-written and machine-generated sentence set. s is the reference sentence sampled from set S.
2.3.1.3 Dialogue generation using reinforce GAN
Dialogue generation is the most important module in applications such as Siri, Google
assistant, etc. The reinforce GAN [34] was proposed for dialogue generation using reinforcement learning. The architecture of reinforced GAN is shown in Fig. 2.16. The reinforce GAN has two neural network architectures namely generator G and discriminator
D. The input dialogue history is given to the generator which outputs the machinegenerated dialogue. The {input dialogue history, machine-generated dialogue} pair is
given to the hierarchical encoder which outputs the vector representation of dialogue.
The vector representation is given as input to the discriminator D which in turn outputs
the probability that the dialogue is human generated or machine generated. The policy
gradient optimization technique is used. The weights and bias of G and D are adjusted by
the rewards generated by them during training. The discriminator outputs will be used as
rewards to train the generator so that the generator can generate a dialogue which is more
similar to the human-generated dialogue.
Fig. 2.16 Architecture of reinforce GAN.
GAN models in natural language processing and image translation
2.3.1.4 Text style transfer using UGAN
Text style transfer is an important research application of natural language processing which
aims at rephrasing the input text into the style that is desired by the user. Text style transfer
has its application in many scenarios such as transferring the positive review into a negative
one, conversion of informal text into a formal one, etc. Many techniques that are used for
text style transfer are unidirectional, i.e., it transfers the sentence from positive to negative
form. UGAN (Unified Generative Adversarial Networks) [35] is the only architecture
which does multidirectional text style transfer shown in Fig. 2.17. Input to the architecture
will be the sentence and the target attribute, for example input: sentence: “chicken is
delicious” and target attribute: “negative.” Output of the architecture will be the transferred sentence. Output: “chicken is horrible” and vice versa. UGAN has two networks
namely generator and discriminator. The LSTM is the generator network which takes the
sentence and the target attribute as input and generates the output sentence as per the target
attribute. The output transferred sentence generated by LSTM is given as the input to the
discriminator. The discriminator uses the RankGAN rank score computation equations to
rank the original sentence and the generated sentence. The classification of the sentence
whether “positive” or “negative” is done by the discriminator.
2.3.1.5 Tibetan question-answer corpus generation using Qu-GAN
In recent years, many question answering systems have been designed for many languages
using deep learning models. It is hard to design a question-answering systems for languages with less resources such as Tibetan. To solve this problem, QuGAN [36] has been
proposed for a question answering system. The architecture of the QuGAN is shown in
Fig. 2.18. Initially, by using maximum likelihood, some amount of data is sampled from
Fig. 2.17 Architecture of UGAN.
Fig. 2.18 Architecture of QuGAN.
39
40
Generative adversarial networks for image-to-Image translation
the data in the database. This is done to reduce the distance between the probability distribution of the real and the generated data. The randomly sampled data is given to the
generator (quasi recurrent neural network—QRNN) which generates the question and
answers which in turn is given to the BERT model to correct the grammatical errors and
syntax. The generated and the real data are given as an input to the discriminator (long
short-term memory—LSTM) which classifies between the real and the generated data.
The policy gradient and the Monto-Carlo search optimization techniques are used to
optimize the training of the neural networks by adjusting their weights and bias.
2.3.1.6 Generation of the sentence with lexical constraints using BFGAN
Nowadays for generating meaning sentences, lexical constraints are incorporated to the
model which has applications in machine translation, dialogue system, etc. For generating
lexically constrained meaningful sentences, BFGAN (backward forward) [37] has been
proposed as shown in Fig. 2.19. BFGAN has two generators namely forward and backward generators and one discriminator. The LSTM dynamic attention-based model
called as attRNN is used as the generators. The discriminator can be CNN-based binary
classifier to classify between real sentences and machine-generated meaningful sentences.
The input sentence is split into words and given as input to the backward generator which
generates the first half of the sentence in the backward direction. The backward sentence
is reversed and fed as input to the forward generator which in turn outputs the complete
sentence with lexical constraints. The discriminator is used for making the backward and
forward generators powerful by training them using the Moto Carlo optimization technique. The real sentence and the generated sentences are given as input to the discriminator which will classify between real and the machine-generated complete sentence
with the lexical constraints incorporated in it.
2.3.1.7 Short-spoken language intent classification with cSeq-GAN
Intent classification in dialog system has grabbed attention in industries. For intent classification, cSeq-GAN [38] has been proposed shown in Fig. 2.20. cSeq-GAN has two
Fig. 2.19 Architecture of BFGAN.
GAN models in natural language processing and image translation
Fig. 2.20 Architecture of cSeq-GAN.
neural networks namely the generator (LSTM) and the discriminator (CNN). The real
questions with no tags and with tags are given as input to the generator. The generator in
turn generates questions with classes. The generated and the real questions with tags are
given as input to the discriminator. The CNN is used as the discriminator which has been
implemented with both the sigmoid and the softmax layer. The sigmoid layer classifies
the real and the generated questions. The softmax layer is used for classifying the questions to the respective intent class. The policy gradient optimization technique is used for
adjusting the weights and bias of the generator and discriminator during training.
2.3.1.8 Recognition of Chinese characters using TH-GAN
Historical Chinese characters are of low-quality images. In order to enhance the quality
of the historical Chinese character images, TH-GAN (transfer learning-based historical
Chinese character recognition) [39] has been proposed shown in Fig. 2.21. The generator
used is the U-Net architecture. The WGAN model has been used. The source Chinese
character is given as input to the generator which outputs generated Chinese character.
The target image, real character image, and the generated character images are given as
input to the discriminator. The discriminator classifies between the real and the fake character image. The policy gradient is the technique used for adjusting weights and bias of
the generator and the discriminator during training. The following session discusses the
NLP datasets.
Fig. 2.21 Architecture of TH-GAN.
41
42
Generative adversarial networks for image-to-Image translation
2.3.2 NLP datasets
The open-source free NLP datasets available for research is shown in Table 2.3.
2.4 GANs in image generation and translation
In recent years, for image generation and translation, many GAN architectures have been
proposed such as cycleGAN, DualGAN, DiscoGAN, etc. The following section discusses the various applications of GANs in image generation and translation.
2.4.1 Applications of GANs in image generation and translation
The following subsections discuss the various applications of GANs in image generation
and translation.
2.4.1.1 Ensemble learning GANs in face forensics
Fake images generated by newer image generation methods such as face2face and deepfake are really hard to distinguish using previous face-forensics methods. To overcome
the same, a novel generative adversarial ensemble learning method [40] has been proposed as shown in Fig. 2.22. In this GAN, two generators with the same architecture
are used but both of them are trained in different ways. The feedback face generator gets
the feedback from the discriminators and generates the more fine-tuned image. As a discriminator ResNet and DenseNet are used. The ability to discriminate between real
and fake images is achieved by combining the feature maps of both ResNet and DenseNet. Image is fed to both the network and 1024-dimensional output feature is extracted
by using global average pooling and then a 2048-dimensional feature vector is generated
by taking output features by both the networks and concatenating them, later SoftMax
function is used to normalize the 2D scores. During the training process of the GAN, the
spectral normalization method is used for the stabilization of the process.
2.4.1.2 Spherical image generation from the 2D sketch using SGANs
Most of the VR applications rely mostly on panoramic images or videos, and most of the
image generation models just focus on 2D images and ignore the spherical structure of the
panoramic images. To solve this, a panoramic image generation method based on spherical convolution and GAN called SGAN [41] is proposed shown in Fig. 2.23. For input, a
sketch map of the image is taken, which provides a really good geometric structure representation. A custom designed generator is used to generate the spherical image and it
reduces the distortion in the image using spherical convolution, loss of least squares is
used to describe the constraint for whether the discriminator is able to distinguish the
image generated from the real image. The spherical convolution is used for observing
the data from multiple angles. Discriminator is used to distinguish between generated
GAN models in natural language processing and image translation
Table 2.3 NLP datasets.
NLP datasets
Description
Link
CNN/Daily
Mail dataset
It is the text summarization dataset
which as two features namely the
documents need to be summarized
(article) and the target text summary
(highlights)
It has the author details, date of the
news, headlines and the detailed
news link
It has small poems. Each poem is
with 4–5 lines and each line is with
4–5 words
https://github.com/abisee/cnndailymail
News
summarization
dataset
Chinese poem
dataset
COCO
(common objects
in context)
captions
Shakespear’s
plays
Open subtitles
dataset
YELP
Amazon
It is the object detection and caption
dataset. The dataset has five sections
such as info, licenses, images,
annotations, and category
It consists of 715 characters of
Shakespeare plays. It has continuous
set of lines spoken by each character
in a play. It can be used for text
generation
It has the group of translated movie
subtitles. It has subtitles of
62 languages
It is a business reviews and user
dataset. It has 5,200,000 user
business reviews
Information about 174,000
businesses
The data about 11 metropolitan
areas
It is an amazon review dataset
Caption
It consists of approximately 3.3
million image caption pairs
Noisy speech
Noisy and clean speech dataset. It
can be used for speech enhancement
applications
It consists of 3,000,000 reviews on
cars, hotels collected from
tripadvisor
It consists of text summaries of
about 4000 cases. It can be used
training text summarization tasks
OpinRank
Legal case reports
https://www.kaggle.com/
sunnysai12345/news-summary
https://github.com/Disiok/
poetry-seq2seq
https://github.com/
XingxingZhang/rnnpg
http://cocodataset.org/
#download
https://www.tensorflow.org/
federated/api_docs/python/tff/
simulation/datasets/shakespeare/
load_data
https://github.com/PolyAILDN/conversational-datasets/
tree/master/opensubtitles
https://www.kaggle.com/yelpdataset/yelp-dataset
https://registry.opendata.aws/?
search¼managedBy:amazon
https://ai.googleblog.com/
2018/09/conceptual-captionsnew-dataset-and.html
https://datashare.is.ed.ac.uk/
handle/10283/2791
http://kavita-ganesan.com/
entity-ranking-data/#.
XuxKF2gzY2z
https://archive.ics.uci.edu/ml/
datasets/Legal+Case+Reports
43
44
Generative adversarial networks for image-to-Image translation
Fig. 2.22 Architecture of ensemble learning GAN.
Fig. 2.23 Architecture of SGAN.
images and real images and in the case of image generation, a multiscale discriminator is
used, which is quite common and adds the advantage of decreasing the burden on the
network.
2.4.1.3 Generation of radar images using TsGAN
Radar data becomes really hard to understand due to imbalanced data and also becomes
the bottleneck for some operations. To defend the radar operations, a two-stage general
adversarial network (TsGAN) [42] has been introduced, as shown in Fig. 2.24. In the
first stage, it generates samples which are similar to real data and distinguishes its eligibility. To generate radar image sequences, each frame is decomposed as content information and motion information. Also, for capturing data such as the flow of clouds,
RNN is used. For discriminators, two of them are being used, one for distinguishing
between radar image and generated image and the second one for motion information,
like image generation sequence. The second stage is used to define the relationship
between intervals and adjacent frames. The rank discriminator is used for computing
GAN models in natural language processing and image translation
Fig. 2.24 Architecture of TsGAN.
the rank loss between generated motion sequences, real motion sequences and the
enhanced generated motion sequences.
2.4.1.4 Generation of CT from MRI using MCRCGAN
The MRI (magnetic resonance images) are really useful in radiation treatment planning
with the functional information that provides as compared with CT (computed tomography). But there are some applications where MRI cannot be used because of the
absence of electron density information. To apply MRI for these types of applications,
MCRCGAN (multichannel residual conditional GAN) [43] has been introduced, which
generates pseudo-CT as shown in Fig. 2.25. MCRCGAN has two parts, generator
which generated the pseudo-CT image according to the input MR images, and discriminator is used to distinguish between p-CT images with the real ones and measure
the degrees/number of mismatches since it helps the network feed accordingly for the
next iteration for better efficiency. MCRCGAN actually adopts the multichannel
ResNet as the generator and CNN as the discriminator.
2.4.1.5 Generation of scenes from text using text-to-image GAN
Generating an image from text is a vividly interesting research topic with very unique use
cases but it is quite difficult since the language description and images vary a different part
Fig. 2.25 Architecture of MCRCGAN.
45
46
Generative adversarial networks for image-to-Image translation
of the world and the current models which generate images tend to mix the generation of
background and foreground which leads to object in images which are really submerged
into the background. To make sure that the generation of the image is done by keeping in
mind about the background and foreground. To achieve this VAE (variational autoencoder) and GAN proved to be robust. Here the generator contains three modules,
namely, downsampling module, upsampling module, and residual module. The architecture of text-to-image GAN [44] is shown in Fig. 2.26.
2.4.1.6 Gastritis image generation using PG-GAN
For detection of gastric cancer, X-ray images of gastric are used. Multiple X-ray images
are relatively large in size so LC-PGGAN (loss function-based conditional progressive
growing GAN) [45] has been introduced as shown in Fig. 2.27. This GAN generates
images which are effective for gastritis classification and have all the necessary details
to look for any sort of symptoms. For the generation of synthetic images, divided patched
images are used. The whole process is divided into two different sections. (1) lowresolution step: Here fake and real images are given to the discriminator which sends
the loss values to (2) high-resolution step: here fake images along with patches with random sampling and real images with patches are given to the discriminator to finalize the
output.
Fig. 2.26 The architecture of text-to-image GAN.
Fig. 2.27 Architecture of LC-PGCAN.
GAN models in natural language processing and image translation
2.4.1.7 Image-to-image translation using quality-aware GAN
Image-to-image translation is one of the widely practiced with GAN and to do the same
many works has been proposed but all of them depend on pretrained network structure
or they rely on image pairs, so they cannot be applied on unpaired images. To solve these
issues, a unified quality-aware GAN-based framework [46] was proposed as shown in
Fig. 2.28. Here two different implementations of quality loss are done, one is based
on the image quality score between the real and reconstructed image and another one
is based on the adaptive deep network-based loss to calculate the score between the real
and reconstructed image from the generator. Here the generators generate such as each
constructed image has a similar or close score to the real image. The loss function includes
adversarial loss, reconstruction loss, quality-aware loss, IQA loss, and content-based loss.
2.4.1.8 Generation of images from ancient text using encoder-based GAN
Ancient texts are of great use since it helps us to get to know about our past and maybe
some keys to our future, to retrieve or understand these texts, an encoder-based GAN [9]
has been introduced to generate the remote sensing images retrieved from the text
retrieved from different sources as shown in Fig. 2.29. To train this particular network,
we have used satellite images and ancient images. Here generator is conditioned with
the training set text encodings and corresponding texts are synthesized. The discriminator is used to predicting the sources of input images, for whether they are real or synthesized. Text encoder and Noise generator is used prior to the input.
Fig. 2.28 Architecture of quality aware GAN.
Fig. 2.29 Architecture of encoder-based GAN.
47
48
Generative adversarial networks for image-to-Image translation
2.4.1.9 Generation of footprint images from satellite images using IGAN
For many architectural purposes and planning, building footprints plays an important
role. To convert satellite images into footprint images, a IGAN (improved GAN) [26]
was proposed as shown in Fig. 2.30. This GAN uses CGAN with the cost function from
Wasserstein distance and integrated with gradient penalty. The generator is provided
with noise and satellite image, using Leaky ReLU as activator function it generates a footprint image which then sent to discriminator helps to get the score, and if the score does
not get as close as the real image, it goes to generator again and the iterations provide
better results every time. The dataset was based on Munich and Berlin which gave
256 256 images to work on. Also, segmentation is used on images to get the visible
gradients.
2.4.1.10 Underwater image enhancement using a multiscale dense generative
adversarial network
Underwater image improvement has become more popular in underwater vision
research. The underwater images suffer from various problems such as underexposure,
color distortion, and fuzz. To address these problems, multiscale dense block generative
adversarial network (MSDB-GAN) [47] for enhancing underwater images has been
proposed as shown in Fig. 2.31. The random noise and the image to be enhanced are
given as input to the generator. The multiscale dense block is embedded within the generator. The MSDB is used for concatenating all the local features of the image using the
Fig. 2.30 Architecture of IGAN.
Fig. 2.31 Architecture of MSDB-GAN.
GAN models in natural language processing and image translation
Leaky ReLU activation function. The discriminator discriminates between the real and
the generated image.
2.4.2 Image datasets
The open-source free image datasets available for research are shown in Table 2.4.
The following section discusses the various evaluation metrics.
Table 2.4 Image datasets.
Image datasets
Description
Link
CelebA-HQ
It consists of 30,000 face images of
high resolution
It consists of 685,000 footprints of the
buildings
https://www.tensorflow.org/
datasets/catalog/celeb_a_hq
https://spacenetchallenge.github.
io/datasets/spacenetBuildingsV2summary.html
https://www.kaggle.com/
navoneel/brain-mri-images-forbrain-tumor-detection
http://www.vision.caltech.edu/
visipedia/CUB-200.html
https://www.robots.ox.ac.
uk/vgg/data/flowers/102/
http://mmlab.ie.cuhk.edu.hk/
projects/CelebA.html
https://www.openstreetmap.org/
#map¼5/21.843/82.795
AOI
MRI brain
tumor
It consist of 96 images of MRI brain
tumor
CUB
It consists of 200 various bird species
images
It consists of 102 various flow
category images
It consists of 200,000 celebrity
images
The map data can be downloaded by
selecting the smaller areas from the
map
It consists of 108,077 Images with
captions of people, signs, buildings,
etc.
It consists of approximately
9,000,000 images been annotated
with labels and bounding boxes for
600 object categories
CIFAR 10 consists of 60,000 images
of 10 classes. CIFAR 100 is extended
by 100 classes. Each class consists of 600
images
It consists of 30,000 images
categorized in 256 classes
It consists of 190,000 images, 60,000
annotated images, 658,000 labeled
objects
It consists of 100 different toys
images. Each toy being
photographed in 72 poses. Hence
7200 images for 100 toys are present
Oxford 102
CelebA
OpenStreetMap
Visual Genome
Open Images
CIFAR 10/100
Caltech 256
LabelMe
COIL-20
http://visualgenome.org/api/v0/
api_home.html
https://storage.googleapis.com/
openimages/web/download.html
https://www.cs.toronto.
edu/kriz/cifar.html
https://www.kaggle.com/
jessicali9530/caltech256
http://labelme.csail.mit.edu/
Release3.0/browserTools/php/
dataset.php
https://www.cs.columbia.edu/
CAVE/software/softlib/coil-20.
php
49
50
Generative adversarial networks for image-to-Image translation
2.5 Evaluation metrics
This section discusses the various evaluation metrics that are needed to assess the performance of the GAN models.
2.5.1 Precision
Precision (P) refers to the percentage of the relevant results obtained during prediction. It
is given by the ratio of true positives and the actual results.
TP
TP + FP
where TP is true positive and FP is false positive.
P¼
2.5.2 Recall
Recall (R) refers to the total percentage of the relevant results that are correctly classified
by the classifier. It is given by the ratio of true positives and the predicted results.
TP
TP + FN
where TP is true positive and FN is false negative.
R¼
2.5.3 F1 score
F1 score is defined as the harmonic mean of both precision and recall. It is given twice the
ratio of multiplication of precision and recall to the addition of precision and recall.
P R
F1 score ¼ 2
P +R
where P is precision and R is recall.
2.5.4 Accuracy
Accuracy refers to how accurately the model predicts the results. It is given by the ratio of
true positive and true negative results to the total results obtained.
TP + TN
Total
where TP is true positive and TN is true negative
Accuracy ¼
chet inception distance
2.5.5 Fre
Frechet inception distance (FID) is the metric used to evaluate the quality of the images
generated by the GANs. If FID is less, then it means the generator has generated a good
GAN models in natural language processing and image translation
quality image. If FID is more, then it means the generator has generated a lower quality
image.
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
FID ¼ jjμ1 μ2 jj2 + Tr C1 + C2 2 C1 C2 Þ
where μ1 and μ2 are feature-wise mean of real and generated images. C1 and C2 are
covariance matrices of real and generated image feature vectors. Tr indicates trace linear
algebra function.
2.5.6 Inception score
The inception score (IS) is the metric used for measuring both the quality of the generated image and the difference between the generated and the real image. For measuring
the quality of the image, the inception network can be used to classify the generated and
the real images. The difference between the real and the generated image is computed
using KL-divergence.
IS ¼ exp EgG DKL ðpðrj gÞj j pðr ÞÞÞ
where g is the generated image, r is the real samples with labels. DKL is the KullbackLeibler divergence measures the distance between the real and generated image probability distributions.
2.5.7 IoU score
Intersection over union (IoU) otherwise called as Jaccard index is the metric which computes the overlap between the predicted results and the ground truth samples. The score
ranges from 0 to 1. 0 indicates no overlap.
TP
IoU score ¼
TP + FP + FN
where TP is the true positive results, FP is the false positive results, and FN is the false
negative results.
2.5.8 Sensitivity
Sensitivity measures the percentage of the true positives that are correctly computed.
TP
TP + FN
where TP is the true positive results and FN is the false negative results.
Sensitivity ¼
2.5.9 Specificity
Specificity measures the percentage of the true negatives that are correctly computed.
TN
Specificity ¼
TN + FP
where TN is the true negative results and FP is the false positive results.
51
52
Generative adversarial networks for image-to-Image translation
2.5.10 BELU score
Bilingual evaluation understudy (BELU) score is the metric used for measuring the similarity between the system-generated text and the input reference text.
N
T
where N is the total number of words matching between the system-generated text and
the input reference text. T is the total number of system-generated words.
BELU ¼
2.5.11 ROUGE score
Recall-oriented understudy for gisting evaluation (ROUGE) score is used for evaluating
automatic text summarization. It evaluates by computing ROUGE recall and precision.
N
N
, ROUGE Recall ¼
T
R
where N is the total number of words matching between the system-generated text and
the input reference text. T is the total number of words in system-generated text. R is the
total number of words in the input reference text.
The next section discusses the various languages and tools used for research.
ROUGE Precision ¼
2.6 Tools and languages used for GAN research
This section discusses the various languages that can be used for training the neural networks such as generator and the discriminator.
2.6.1 Python
For training the generator,
(1) Pandas are used for data manipulation
(2) Using OS, data path is set
(3) Train and test data is divided using pd.DataFrame() function
(4) In the infinite loop,
• pd.read_csv is used to read the data from csv file
• Labels are retrieved from the list using data.iloc() function
• Append it into array using append() function
(5) Inside generator (), batch_size,shuffle_data is used,
• For setting array list empty [] list are initialized
• cv2.imread is used to read images (if theres is any)
• reading array is done using np.array
For training the discriminator,
(1) Keras can be used
(2) The discriminator is defined using def define_discriminator(n_inputs¼2)
GAN models in natural language processing and image translation
(3) Define the model type and the activation functions to be used by,
• model ¼ Sequential ()
• model.add (Dense(25, activation ¼‘relu’, kernel_initializer ¼‘he_uniform’,
input_dim ¼n_inputs))
• model.add (Dense(1, activation¼‘sigmoid’))
(4) Compile the model by specifying the loss function and the optimizer to be used using
model. compile (loss¼‘binary_crossentropy’, optimizer¼‘adam’, metrics¼[‘accuracy’])
(5) return model
2.6.2 R programming
(1) Install the neural network packages using install.packages(“neuralnet”)
(2) Load the neural net packages using library (“neuralnet”)
(3) Read the CSV file using read.csv()
(4) Preview the dataset using View()
(5) To view the structure and verify the ID variable str() function is used.
(6) To set the input variables to the same scale, scale (Any var[1:12]) is used.
(7) Generate a random seed using set.seed (200)
(8) Split the dataset into 70-30 train and test set using,
• ind <- sample(2, nrow (Any var), replace ¼ TRUE, prob ¼ c(0.7, 0.3))
• train.data <- Any var[ind ¼¼ 1, ]
• test.data <- Any var[ind ¼¼ 2, ]
(9) Neural network with one hidden layer and two nodes and linear output set to false
is given by,
nn <- neuralnet(formula ¼ FSTAT AGE + SEX + CPK + SHO + CHF +
MIORD + MITYPE + YEAR + YRGRP + LENSTAY + DSTAT + LENFOL,
data ¼ train.data, hidden ¼ 2, err.fct ¼ “ce”, linear.output ¼ FALSE)
(10) Summary can be generated using summary (nn)
(11) Visualize the neural network using plot (nn)
2.6.3 MatLab
(1) Set the dataset path using fullfile(path) function
(2) Load data as ImageDatastore using imageDatastore(path)
(3) To view the data from the dataset, imshow() function is used
(4) Divide the dataset into train and test set using splitEachLabel() function
(5) Define the neural network model. E.g., Convolutional neural network by specifying
[imageInputLayer(dimension), convolution2dLayer(dimension), reluLayer, maxPooling2dLayer (stride dimension), fully ConnectedLayer(10), softmaxLayer
classificationLayer]
(6) Set the training option settings using the function trainingOptions(optimization
technique, maximum no. of epochs, initial learning rate)
53
54
Generative adversarial networks for image-to-Image translation
(7) Train the model using function trainNetwork (traindata, layers, optionsset)
(8) Prediction can be performed using the function classify()
(9) Compute accuracy
2.6.4 Julia
(1) Load the train and test data using dataset_name.traindata() and dataset_name.testdata()
(2) Add channel layer by unsqueeze(traindata,layerno) and unsqueeze (testdata,
layerno)
(3) Encode the labels by using the functions onehotbatch(traindata, 0:9) and
onehotbatch(testdata, 0:9)
(4) Create complete dataset by using DataLoader(traindata, batchsize¼size)
(5) To implement CNN use the function chain(Conv(dimension), pad¼2, stride¼2,
activation_function)
(6) Maxpooling can be implemented by using the function maxpooling() and average
pooling can be implemented by using GlobalMeanPool()
(7) Binary cross entropy loss is given by crossentropy(model(x), y)
(8) Gradient descent optimizer is given by Descent(learning rate) and adam optimizer
is given by adam(learning rate)
(9) Train the model using @epochs number of epochs Flux.train!(loss, b,w, train_data,
opt) where b is bias, w is weight, opt is the optimizer used
(10) Compute accuracy
The next section discusses the open challenges for future research.
2.7 Open challenges for future research
This section discusses the open challenges of GAN for future research.
• Vanishing gradients is the problem that many GAN architectures suffer from. Firstly,
the discriminator will be trained to classify between real and fake images. Then the
generator will be trained but initially, the G will be generating the fake image which
will be easily classified by D. The value of G will be 0 initially and so the slope will also
be close to 0. Hence, the gradient cannot be calculated.
• Mode collapse is that the generator sometimes collapses and will always generate the
same or similar fake images of one type, i.e., the generator generates limited varieties of
fake samples. GAN architectures have to be designed in such a way that it will not
suffer from this problem.
• In many GANs, it is very hard to achieve Nash equilibrium. Many numbers of epochs
have to run to achieve Nash equilibrium. The research challenge is to develop a technique which will help G and D to achieve Nash equilibrium easily.
GAN models in natural language processing and image translation
• The challenge is to train G and D simultaneously that they will fail to converge many
times. Sometimes the G instead of attaining Nash equilibrium, it might oscillate
between specific sample generated.
• How to increase the stability of training?
• What learning rate to set for G and D and also to check what is the effect of changing
the learning rate is a challenge.
• Tuning hyperparameter, i.e., deciding on the value to set for hyperparameter is a challenge as it increases the training stability.
• New activation techniques can be proposed for activating the neurons in the network
so that the learning can be stable.
• Training G and D is very hard. Optimizing the loss functions is very difficult and it
needs many trail and errors. New optimization techniques can be proposed for still
better fine-tuning of G and D, if the discriminator is not discriminating properly.
• If one network (either G or D) will not be trained properly, then the entire system
performance will degrade.
2.8 Conclusion
This chapter provides an overview of the generative adversarial networks, classification of
the GAN models based on learning and their pros and cons. The various applications of
GAN in natural language processing, image generation, and translation are discussed. The
various natural language processing and image datasets are listed. The evaluation metrics
needed for assessing the GAN performance have also been discussed. The tools available
for GAN research are also mentioned. Finally, the chapter summarizes the open challenges for the future research.
References
[1] Y. Puy, Z. Gany, R. Henaoy, X. Yuanz, C. Liy, A. Stevensy, L. Cariny, Variational autoencoder for
deep learning of images, labels and captions, in: 30th Conference on Neural Information Processing
Systems (NIPS 2016), Barcelona, Spain, 2016, pp. 1–9.
[2] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems 27, Montreal,
Quebec, Canada, 2014, pp. 2672–2680.
[3] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, New York, 2016.
[4] I. Goodfellow, NIPS 2016 tutorial: generative adversarial networks, arXiv: 1701.00160 (2016).
[5] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv: 1511.06434 (2015).
[6] M. Roveri, Learning discrete-time Markov chains under concept drift, IEEE Trans. Neural Netw.
Learn. Syst. 30 (9) (2019) 2570–2582.
[7] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I.J. Goodfellow, A. Bergeron, N. Bouchard, Y. Bengio,
Theano: new features and speed improvements, in: Deep Learning and Unsupervised Feature Learning
NIPS 2012 Workshop, 2012.
55
56
Generative adversarial networks for image-to-Image translation
[8] S. Hitawala, D.R. Cheriton, Comparative study on generative adversarial networks,
arXiv:1801.04271v1 (2018).
[9] M.B. Bejiga, F. Melgani, A. Vascotto, Retro-remote sensing: generating images from ancient texts,
IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 12 (3) (2019) 950–960, https://doi.org/10.1109/
JSTARS.2019.2895693.
[10] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein GAN, arXiv:1701.07875 (2017).
[11] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. Courville, Improved training of Wasserstein
GANs, arXiv: 1704.00028 (2017).
[12] Q. Jin, R. Lin, F. Yang, E-WACGAN: enhanced generative model of signaling data based on
WGAN-GP and ACGAN, IEEE Syst. J. 14 (3) (2020) 3289–3300, https://doi.org/10.1109/
JSYST.2019.2935457.
[13] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel, InfoGAN: interpretable representation learning by information maximizing generative adversarial nets, in: Proc. 30th Conf. Neural
Information Processing Systems, Barcelona, Spain, 2016, pp. 2172–2180.
[14] D. Berthelot, T. Schumm, L. Metz, BEGAN: boundary equilibrium generative adversarial networks,
arXiv: 1703.10717 (2017).
[15] R.D. Hjelm, A.P. Jacob, T. Che, K. Cho, Y. Bengio, Boundary seeking generative adversarial networks, arXiv preprint arXiv:1702.08431 (2017).
[16] L.T. Yu, W.N. Zhang, J. Wang, Y. Yu, SeqGAN: sequence generative adversarial nets with policy
gradient, arXiv: 1609.05473 (2016).
[17] X. Yang, Y. Lin, Z. Wang, X. Li, K. Cheng, Bi-modality medical image synthesis using semisupervised sequential generative adversarial networks, IEEE J. Biomed. Health Inform. (2019) 1–11.
[18] D.J. Im, H. Ma, C.D. Kim, G.W. Taylor, Generative adversarial parallelization, arXiv:1612.04021v1
(2016).
[19] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent
adversarial networks, arXiv:1703.10593 (2018).
[20] W. Yang, C. Hui, Z. Chen, J. Xue, Q. Liao, FV-GAN: finger vein representation using generative
adversarial networks, IEEE Trans. Inf. Foren. Sec. 14 (9) (2019) 2512–2524.
[21] Z. Dai, Z. Yang, F. Yang, W.W. Cohen, R.R. Salakhutdinov, Good semi-supervised learning that
requires a bad GAN, in: Advances in Neural Information Processing Systems, 2017, pp. 6510–6520.
[22] Z. Yang, J. Hu, R. Salakhutdinov, W. Cohen, Semisupervised QA with generative domain-adaptive
nets, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers), Vancouver, Canada, 2017, pp. 1040–1050.
[23] M. Mirza, S. Osindero, Conditional generative adversarial nets, arXiv: 1411.1784 (2014).
[24] C. Esteban, S.L. Hyland, G. Ratsch, Real-valued (medical)¨ time series generation with recurrent conditional GANs, arXiv preprint arXiv:1706.02633 (2017).
[25] P. Costa, A. Galdran, M.I. Meyer, M. Niemeijer, M. Abrámoff, A.M. Mendonça, A. Campilho, Endto-end adversarial retinal image synthesis, IEEE Trans. Med. Imag. 37 (3) (2018) 781–791.
[26] Y. Shi, Q. Li, X.X. Zhu, Building footprint generation using improved generative adversarial networks, IEEE Geosci. Remote Sens. Lett. 16 (4) (2019) 603–607, https://doi.org/10.1109/
LGRS.2018.2878486.
[27] J. Donahue, P. Krahenbuhl, T. Darrell, Adversarial feature learning, arXiv: 1605.09782 (2017).
[28] Z. Zhang, S. Liu, M. Li, M. Zhou, E. Chen, Bidirectional generative adversarial networks for neural
machine translation, in: Proceedings of the 22nd Conference on Computational Natural Language
Learning (CoNLL 2018), pp. 190–199, Brussels, Belgium, October 31-November 1, 2018.
[29] A. Odena, C. Olah, J. Shlens, Conditional image synthesis with auxiliary classifier GANs, arXiv:
1610.09585 (2017).
[30] J.E. Iglesias, E. Konukoglu, D. Zikic, B. Glocker, K. Van Leemput, B. Fischl, Is synthesizing mri contrast useful for inter-modality analysis? in: International Conference on Medical Image Computing and
Computer-Assisted Intervention, 2013, pp. 631–638.
[31] S. Semeniuta, A. Severyn, S. Gelly, On accurate evaluation of GANs for language generation,
arXiv:1806.04936v3 (2019).
GAN models in natural language processing and image translation
[32] H. Zhuang, W. Zhang, Generating semantically similar and human-readable summaries with generative adversarial networks, IEEE Access 7 (2019) 169426–169433, https://doi.org/10.1109/
ACCESS.2019.2955087.
[33] K. Lin, D. Li, X. He, Z. Zhang, M.-T. Sun, Adversarial ranking for language generation; 2017.
arXiv:1705.11001.
[34] J. Li, W. Monroe, T. Shi, S. Jean, A. Ritter, D. Jurafsky, Adversarial learning for neural dialogue
generation; 2017. arXiv:1701.06547v5.
[35] W. Yu, T. Chang, X. Guo, X. Wang, B. Liu, Y. He, UGAN: unified generative adversarial networks
for multidirectional text style transfer, IEEE Access 8 (2020) 55170–55180, https://doi.org/10.1109/
ACCESS.2020.2980898.
[36] Y. Sun, C. Chen, T. Xia, X. Zhao, QuGAN: quasi generative adversarial network for tibetan question
answering corpus generation, IEEE Access 7 (2019) 116247–116255, https://doi.org/10.1109/
ACCESS.2019.2934581.
[37] D. Liu, J. Fu, Q. Qu, J. Lv, BFGAN: backward and forward generative adversarial networks for lexically constrained sentence generation, IEEE/ACM Trans. Audio Speech Language Process. 27 (12)
(2019) 2350–2361, https://doi.org/10.1109/TASLP.2019.2943018.
[38] X. Zhou, Y. Peng, Short-spoken language intent classification with conditional sequence generative
adversarial network, in: 2019 IEEE 31st International Conference on Tools with Artificial Intelligence
(ICTAI), Portland, OR, USA, 2019, pp. 1753–1756, https://doi.org/10.1109/ICTAI.2019.00261.
[39] J. Cai, L. Peng, Y. Tang, C. Liu, P. Li, TH-GAN: generative adversarial network based transfer learning for historical Chinese character recognition, in: 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 2019, pp. 178–183, https://doi.org/10.1109/
ICDAR.2019.00037.
[40] J. Baek, Y. Yoo, S. Bae, Generative adversarial ensemble learning for face forensics, IEEE Access 8
(2020) 45421–45431, https://doi.org/10.1109/ACCESS.2020.2968612.
[41] Y. Duan, C. Han, X. Tao, B. Geng, Y. Du, J. Lu, Panoramic image generation: from 2-D sketch to
spherical image, IEEE J. Sel. Top. Signal Process. 14 (1) (2020) 194–208, https://doi.org/10.1109/
JSTSP.2020.2968772.
[42] C. Zhang, X. Yang, Y. Tang, W. Zhang, Learning to generate radar image sequences using two-stage
generative adversarial networks, IEEE Geosci. Remote Sens. Lett. 17 (3) (2020) 401–405, https://doi.
org/10.1109/LGRS.2019.2922326.
[43] K. Xu, et al., Multichannel residual conditional GAN-leveraged abdominal pseudo-CT generation via
Dixon MR images, IEEE Access 7 (2019) 163823–163830, https://doi.org/10.1109/
ACCESS.2019.2951924.
[44] C. Zhang, Y. Peng, Stacking VAE and GAN for context-aware text-to-image generation, in: 2018
IEEE Fourth International Conference on Multimedia Big Data (BigMM), Xi’an, 2018, pp. 1–5,
https://doi.org/10.1109/BigMM.2018.8499439.
[45] R. Togo, T. Ogawa, M. Haseyama, Synthetic gastritis image generation via loss function-based conditional PGGAN, IEEE Access 7 (2019) 87448–87457, https://doi.org/10.1109/
ACCESS.2019.2925863.
[46] L. Chen, L. Wu, Z. Hu, M. Wang, Quality-aware unpaired image-to-image translation, IEEE Trans.
Multimedia 21 (10) (2019) 2664–2674, https://doi.org/10.1109/TMM.2019.2907052.
[47] Y. Guo, H. Li, P. Zhuang, Underwater image enhancement using a multiscale dense generative adversarial network, IEEE J. Ocean. Eng. 45 (3) (2020) 862–870, https://doi.org/10.1109/
JOE.2019.2911447.
57
CHAPTER 3
Generative adversarial networks and
their variants
Er. Aarti
Department of Computer Science & Engineering, Lovely Professional University, Phagwara, Punjab, India
3.1 Introduction of generative adversarial network (GAN)
Generative adversarial networks [1] were projected to figure out the weaknesses of other
generative structures also proven successful in the field of unsupervised learning. GAN
has acquired wide consideration in the AI area for their capability to learn highdimensional and complex real information circulation. In particular, they do not depend
on any suppositions about the distribution and can create pictures that look genuine such
as latent space in a basic way. This ground-breaking property drives GAN to be applied to
different applications, for example, image translation, image synthesis, domain adaptation, image attribute editing, and other scholastic fields [2]. The most persuasive explanation that GANs are broadly considered, created, and utilized is a result of their
prosperity. GANs have had the option to create photographs so sensible that people can’t
tell whether they are scenes, items, and individuals that do not exist in reality [3]. Generating a picture from a given text depiction has two objectives: visual authenticity and
semantic consistency. Although huge advancement has been made in creating highcaliber and outwardly sensible pictures utilizing generative adversarial networks, ensuring
semantic consistency between the text depiction and visual substance stays very challenging. Various interesting applications of GANs are image-to-image translation, superresolution, semantic-image-to-photo translation, generation of new human poses,
photos to emojis, photograph editing, face aging, photo blending, and many more.
In a game-theoretic scheme, the generator system is required to contend against an
adversary by completing the objective, as generative adversarial networks depend on this
scheme. Adversarial games are the domain of AI where two or more agents play opposite
to each other. GANs are an exciting recent innovation in deep learning. GANs are one of
the new state-of-the-art neural networks that can be used to do many things. Recovering
corrupted data, text-to-image generation, and many more endless applications generative
adversarial network has.
Generative models can be thought of as containing more information than their discriminative counterparts since they also are used for discriminative tasks such as classification
or regression. The adversarial modeling structure is generally straight to apply when both
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00003-8
Copyright © 2021 Elsevier Inc.
All rights reserved.
59
60
Generative adversarial networks for image-to-image translation
the frameworks are multilayer perceptrons. No doubt, adversarial networks act as a
general-purpose solution to image-to-image translation issues. These systems not only
acquire the information regarding mapping from an intake picture to yield picture but
also get a loss activity to prepare for this mapping. The probability distribution can be
duplicated by GAN so that they could, therefore, utilize loss activity, which depicts
the distance among the dissemination of the information produced through the GAN
and the dispersion of the original information.
GANs are a way to deal with generative modeling by utilizing DL strategies, for
example, CNN. An improved method called deep convolutional GAN, or DCGAN
prompted increasingly stable models. These days, most of the GANs are at least loosely
dependent on the DCGAN design. It is also one of the variants of GAN. Generative
modeling is an unsupervised task in artificial intelligence which contains automatically
searching and learning the patterns or regularities as intake information. The framework
is utilized to produce new structures that possibly can be taken from the initial dataset [3].
GANs design for naturally preparing a generative framework by considering the independent issue as supervised and utilizing both generative and discriminative structures.
Fig. 3.1 shows the design of the generative adversarial network. (See Table 3.1.)
It is a deep-learning system and one of the most encouraging techniques for independent learning in complex dissemination. Deep-learning techniques can be utilized as
generative structures. Two mainstream models incorporate the restricted Boltzmann
machine (RBM) and the deep belief network (DBN). Two present-day instances of
deep-learning generative framework algorithms incorporate the variational autoencoder
(VAE) and the GAN [3]. GANs are a special case of generative models which are able to
predict features in a much better way due to the adversarial training.
They are a smart way of preparing a generative framework by presenting the issue as a
supervised learning issue with two submodels. The architecture of GAN goes through
two components in the system: generative and discriminative models. Both of these
models are prepared altogether by an adversarial procedure. Each model can be any neural network, such as a convolutional neural network (CNN), a recurrent neural network
(RNN), or a long short-term memory (LSTM).
Real Data
Latent
Variable
Xreal
Z
Xfake
Distinguishing
real or fake
Discriminator
Generator
Generated
fake data
Fig. 3.1 GAN [4].
Generative adversarial networks and their variants
Table 3.1 Comparison between various GAN-based approaches.
Methods
Input
Output
Characteristics
Loss function
Resolution Code
SRGAN [18]
Image
Image
High
upscaling
factor
Adversarial +
feature
Arbitrary
FCGAN [52]
Face
Face
Adversarial +
distance
Adversarial
128 128
VGAN [47]
Noise
vector
TGAN [43]
Noise
vector
VariGAN [41] Human +
view
StackGAN
Text
[71]
cycleGAN
Image
[19]
Video
Video
64 64
T
64 64
Ch
Temporal
Adversarial
generator
Human Coarse to fine Adversarial +
distance
Image High-quality Adversarial
256 256
Image
256 256
T + PT +
TF
256 256
T + PT
128 128
T + TF
pix2pix [54]
Image
Age-cGAN
[64]
Context
Encoder [40]
TP-GAN [68]
Face + age Face
Image +
holes
Face
TF
Image
Image
Face
Unpaired data Adversarial +
cycle
consistency
General
Adversarial +
framework
distance
Identity
Identity
preserved
preserving
Adversarial +
distance
Two pathway Adversarial +
distance +
identity
preserving +
tv + symmetry
128 128
128 128
In code column T, TF, Ch, PT denotes Torch, Tensorflow, Pytorch, Chainer, respectively.
The training process of the G and D network is called adversarial preparation. G and
D structures are prepared together in an adversarial fashion to improve each other as well
as adjust the parameters for G to minimize log(1 D(G(z)) and for D to reduce logD(X)
[5] while competing for the two-player min-max game with value function V (G;D).
min max V ðG, DÞ ¼ min max Expdata ½ log DðxÞ + Ezpz ½ log ð1 DðGðzÞÞÞ
G
D
G
D
(3.1)
where V(G,D)is a binary cross-entropy function mainly utilized in a binary classification
problem, D(x) is a multilayer perceptron, pz(z) is the distribution of input noise variables,
and pdata(x) and pz(z) in Eq. (3.1) denotes the real data probability distribution defined in
the data space X and the probability distribution of z defined on the latent space, respectively. Gmaps z from Z into the element of X, whereas D takes an intake x and distinguishes whether x is a real sample or a fake sample generated by G [6].
61
62
Generative adversarial networks for image-to-image translation
3.1.1 Generative model (GM)
A generator model mainly figures out how to make pictures that look genuine. At the
time of training, the generator continuously turns out to be better while the creation of
pictures seems genuine. It takes a static length arbitrary vector as intake and produces an
example in the area, as shown in Fig. 3.2. The vector is drawn arbitrarily from gaussian
dissemination and utilized to seed the generative procedure. After preparation, points in
multidimensional vector space will compare with the issue space, framing a restricted
portrayal of the information circulation. The vector space is known as a latent space
that contained inactive factors. A latent variable is an arbitrary variable that is significant
for an area, however not straightforwardly recognizable. They are often implied as
compression or projection of information conveyance. On account of GANs, it applies
context to points in a selected latent area to such an extent that new points drawn from
that area can be given to the generator model as intake and used to produce new and
distinctive yield models [3].
The main purpose of the generator is to deceive the discriminator and generates new
conceivable models from the problem area through machine mostly in picture whereas
the discriminator figures out the false data made by the generator and determines whether
the picture is authentic or machine generated [7]. In the first GAN hypothesis, G and D
are not required to be neural systems and only required to have the option to fit the
comparable generation and discriminant capacities. However, deep neural systems are
commonly utilized as G and D. Both can be a nonlinear mapping function, such as a
multilayer perceptron.
Random Input
Vector
Generator
Model
Generated
Example
Fig. 3.2 GAN generator model.
Generative adversarial networks and their variants
3.1.2 Discriminator model (DM)
A discriminator model attempts to differentiate pictures as either authentic or false while
training it turns into better at revealing that separation. It takes a model as input (genuine
or created) from the space and presents a parallel class mark of genuine or produced, as
shown in Fig. 3.3. The genuine model originates from the preparation dataset. The discriminator is a typical classification model. The discriminator model is disposed of as
interest on the generator after the preparation procedure.
The procedure arrives at balance when the discriminator can no longer recognize
genuine pictures from fakes. GANs possibility for both great and terrible is tremendous,
in light of the fact that they can make sense of how to imitate any scattering of data. In this
way, it can make universes amazingly such as own in any region: music, pictures, composition, and speech. They are robot specialists, and their output is imperative even solid.
However, they can likewise be utilized to make false media content and is the innovation
supporting Deepfakes. They have been utilized for some applications, particularly for
picture blend on account of their capacity to produce high-quality pictures. In recent
years, various variations of GAN have been projected, and they generated excellent outcomes for image generation. GANs relate to the arrangement of generative structures,
which imply that they can create new content [8]. GAN does not work with a definite
thickness function [9]. In game-theoretic methodology, it figures out how to produce
from preparing dispersion through two-player game. Samples are best but can be precarious and unsteady to prepare with no inference queries. GANs depend on a hypothetical
game situation wherein the generator system must challenge against an adversary, and
legitimately creates tests. The discriminator network and adversary are the ones who
Input Example
Discriminator
Model
Binary
Classification
Real/Fake
Fig. 3.3 GAN discriminator model.
63
64
Generative adversarial networks for image-to-image translation
try to recognize tests drawn from the preparation information and the generator [3]. The
common game preparation produces a sensibly decent outcome. An amazing GAN
application requires a reasonable preparation strategy; alternatively, the yield might be
unsuitable because of the flexibility of the neural system model [7].
Advantages
1. It is a better modeling of data distribution.
2. In theory, GANs can train any type of generator network. Other frameworks require
generator networks to have some specific form of functionality, such as the output
layer being Gaussian.
3. There is no need to use the Markov chain to repeatedly sample, without inferring in
the learning process, without complicated variational lower bounds, avoiding the difficulty of approximating the difficult probability of calculation.
Disadvantages
1. It is hard to train, unstable. Proper synchronization is required between the generator
and the discriminator, but in actual training, it is easy for D to converge and G to
diverge. D/G training requires careful design.
2. It has mode collapse issue. The learning process of GANs may have a missing pattern,
the generator begins to degenerate, and the same sample points are always generated,
and the learning cannot be continued.
3. It cannot solve inference queries such as p(x).
3.2 Related work
Goodfellow et al. [1] portrayed the GAN architecture in 2014 and discussed the nonsaturating loss function. It also provides the derivation for the optimal discriminator and
demonstrates the effectiveness empirically on the MNIST, TFD, and CIFAR-10 image
datasets. Radford et al. [10] introduced a class of deep convolutional GANs (DCGANs)
that imposes empirical constraints on the network architecture to solve the problem of
potential instability during training. Salimans et al. [11] provided a set of tools to avoid
instability and mode collapsing, which includes historical averaging, minibatch discrimination, one-sided label smoothing, feature matching, and virtual batch normalization.
Che et al. [12] used regularization methods for the objective to avoid the problem of
missing modes. Arjovsky et al. [13] suggested minimization of the Wasserstein-1 or
Earth-Mover distance among generator and data distribution with theoretical reasoning.
In a follow-up paper, Gulrajani et al. [14] projected an enhanced approach for training the
discriminator—termed critic by Arjovsky et al. [13]—which behaves stably, even with
deep ResNet architectures. GANs have mostly been investigated on pictures, showing
significant success with tasks such as image generation [15–17], image superresolution
[18], style transfer [19, 20], and many others.
Generative adversarial networks and their variants
3.3 Deep-learning methods
There has been a gigantic advancement in framework demonstration and perception after
presenting the advanced models for deep learning (DL). DL techniques quickly developed and extended applications in different logical and engineering areas. Deep learning
is a growing area of AI (ML) research. It includes various concealed layers of artificial
neural systems. The methodology applies nonlinear alterations and structures the deliberations of a high level in the huge collection of data. The current improvements in deeplearning structures inside various fields have just given huge commitments in AI. Current
analysis has applied deep learning as the principal tool for digital image processing.
A convolutional neural networks (CNN) is used for Iris recognition considered as more
powerful in comparison with customary Iris sensor [21]. Deep learning is a subset of the
field of ML, which is a subfield of AI [22]. Health informatics, bioinformatics, safety,
energy, economic, security, urban informatics, hydrological systems modeling, and computational mechanisms are the advanced application field of DL [23]. Deep-learning
techniques are quickly advancing for better performance.
Recently, DL algorithms have come out of AI and soft computing strategies. From
that point, a few DL algorithms are currently acquainted with mainstream researchers and
used in different application areas. Nowadays, their use has evolved into fundamental
because of its knowledge, effective learning, precision, and strength in the model structure. Deep-learning strategies are quickly developing. Some of them have progressed to
be had practical experience in a specific application area. Literature incorporates sufficient
survey papers on the advancing designs in particular usage areas, such as superresolution
imaging, multimedia analytics, cardiovascular image analysis, transportation systems,
radiology, medical ultrasound analysis, 3D sensed data classification, activity recognition
in radar, sentiment classification, renewable energy forecasting, image cytometry, 3D
sensed data classification, text detection, apache-spark, and hyperspectral [24–29]. The
convolutional neural network, recurrent neural network, de-noising autoencoder, deep
belief network, and long short-term memory techniques have been recognized as the
most famous deep-learning strategies [23].
3.3.1 Convolutional neural network
It is one of the well-known structures of deep-learning procedures. It includes three sorts
of a layer with various pooling, convolutional, and completely associated layers shown in
Fig. 3.4. There are two phases for the preparation procedure in each CNN, the feedforward, and the back-propagation phase. GoogLeNet [30], AlexNet [31], ZFNet
[32], ResNet [33], and VGGNet [34] are the most widely recognized CNN designs.
In spite of the fact that it is basically known and commonly utilized for image processing
applications.
65
66
Generative adversarial networks for image-to-image translation
Output
Input Layer
Conventional
Layer
Sub_sampling Layer
Conventional
Layer
Sub_sampling
Layer
Fully Connected Layer
Fig. 3.4 CNN architecture [23].
3.3.2 Recurrent neural network (RNN)
It is moderately current deep-learning strategy. RNN is intended to perceive groupings
and patterns, for example, handwriting, text, speech, and many more applications [23]. It
has advantages in the structure of cyclic associations which utilize repetitive calculations
to successively measure the intake information [35]. It is essentially an ideal neural system
which has been extended beyond time through edges that feed into whenever step into
rather than stepping into the following layer in a similar time. Every past source of input
information is carried a state vector in concealed units, and further such vectors are used
to process the yields. The expert systems, hydrological prediction, economics, energy,
and navigation are its present applications. Fig. 3.5 depicts the architecture of RNN.
Output Layer
Hidden Layer
Input Layer
Fig. 3.5 RNN architecture [23].
Generative adversarial networks and their variants
3.3.3 Deep belief network (DBN)
It is recognized as a composite multilayered neural system which includes undirected and
coordinated associations. It is utilized for high structural manifolds learning of information. The strategy consists of various layers which include associations among the layers
with the exception of associations between units inside every fold. It also contains
restricted Boltzmann machines (RBM) that are prepared in an insatiable way [36] in
which each layer connects with both the past and resulting layer [37, 38]. The structure
is comprised of a feed-forward system and a few folds of RBM as characteristic extractors
[39]. The two layers of an RBM [40] are hidden and visible layers. Fig. 3.6 depicts the
design of the DBN strategy.
Deep belief network is one of the most dependable DL strategies having computational proficiency and high precision [23]. Human emotion discovery, time arrangement
expectation, sustainable power source forecast, cancer diagnosis, and financial estimating
are the public application area.
3.3.4 Long short-term memory
It is an RNN technique that advantages input associations to be utilized as a generalpurpose computer. The technique is used for two arrangements such as patterns recognition and image processing applications. Mainly, it consists of three central parts which
Hidden Layer
Hidden Layer
Hidden Layer
Visible Layer
Output Layer
Weights
RBM1
Fig. 3.6 DBN architecture [23].
RBM2
RBM3
RBM4
67
68
Generative adversarial networks for image-to-image translation
Input
gate
Output
gate
Cell
Input modulation
gate
Forget gate
Fig. 3.7 LSTM architecture.
include information, yield, and forget doors that can be controlled on choosing when to
allow the data to come inside the neuron and also to recollect what was figure out during
the last time step. As it chooses, whole this that relies on the present intake is one of the
fundamental qualities of the LSTM technique [23]. Fig. 3.7 presents the design of the
LSTM technique.
LSTM has demonstrated incredible possibilities in various environmental areas such as
hydrological prediction, hazard modeling, air quality, and geological modeling. LSTM
design may be appropriate for some application areas because of its speculation capacities
such as solar power modeling, energy demand and consumption, and wind energy
industry [23].
3.4 Variants of GAN
With the advancement of technology, various improvements are made to the variants
of GAN.
3.4.1 Vari GAN
Vari GAN represents variational GAN [41] which was proposed to create multiview
individual pictures from a solitary perspective. This GAN replaces a coarse-to-fine manner. Vari GAN has been made out of three systems: a coarse image generator, a fine picture generator, and a restrictive discriminator. The coarse image generator GC utilizes a
restrictive VAE design [7] where VAE represents variational autoencoder. With an input
picture i and an objective view v, a low-quality picture was created independently with
Generative adversarial networks and their variants
the objective view i-v (low quality). The fine picture generator GF is made out of
double-way U-Net [42] design. The U-Net is named after its symmetric shape. This
maps i-v (low quality) to a high-quality picture conditioned on the input picture. Discriminator D looks at the high-quality picture adapted on the input picture. GF and discriminator are jointly prepared with a target function comprising a content loss and an
affective loss estimating L1 distinction between (i-v) high quality picture and ground
truth [10].
3.4.2 TGAN
TGAN represents a temporal generative adversarial net that was suggested by Saito
et al. [43] for video generation. It comprises a discriminator, a temporal, and a picture
generator. The temporal generator delivers a grouping of inactive frame vectors
[V11, V12, V13, …, V1S] from an arbitrary variable V0, where S is the count of video frames.
The picture generator takes V0 and a frame vector Z1t (0< t < S + 1) as intake and generates the t-th video frame. Here additionally, the discriminator accepts the entire video as
intake and attempts to recognize it from genuine ones. TGAN follows WGAN [44] for
stable preparation, however, applying further particular clipping value rather than weight
clipping to the discriminator [10].
Recently, Temporal GAN (TGAN) [45] manages the instability in video generation
by sending a frame-wise generation framework. A generative model is utilized to sample
frames for image generation; a temporal generator preserves temporal consistency and
controls this model. This model separates essential pieces of a video as a frontal area from
background or dynamic from static patterns to manage the instability of preparing GANs.
It accepts a latent space of pictures and considers that a video clip is produced by navigating the points in the dormant space. Video clips of various lengths relate to dormant
space directions of multiple lengths.
3.4.3 Laplacian pyramid of generative adversarial network (LAPGAN)
Denton et al. [17] projected the creation of pictures in a coarse-to-fine manner utilizing a
cascade of convolutional GANs having the structure of a Laplacian pyramid with
N levels. This method utilizes multiple numbers of the generator and discriminator system and different levels of the Laplacian pyramid. A GAN is prepared by downsampling
the picture at first at each phase of the level, N, and then it is again upscaled at each layer in
a backward pass where a noise vector is mapped to a picture from the Conditional GAN
with the coarsest quality until it reaches its original size. At each degree of the pyramid
with the exception of the coarse stone, a different CGAN is prepared that considers the
yield picture in the coarser level as a restrictive variable to produce the picture at this
stage. This approach is mainly used because it can create pictures with higher quality
in a coarse-to-fine manner [10]. This methodology permitted to exploitation the
69
70
Generative adversarial networks for image-to-image translation
multiscale model of regular pictures, assembling a progression of generative models, each
catching picture structure at a specific degree of the Laplacian pyramid which is made
from a Gaussian pyramid utilizing upsampling u(.) and downsampling d(.) capacities.
Assume G(I) ¼ [I0; I1; …; IK] be the Gaussian pyramid where I0 ¼ I and IK are k rehashed
utilization of d(.) to I. At that point, the coefficient hk at level k of the Laplacian pyramid is
given by the difference among the neighboring levels in the Gaussian pyramid, expanding the little one with u(.).
hk ¼ Lk ðI Þ ¼ Gk ðI Þ uðGk + 1 ðI ÞÞ ¼ Ik uðIk + 1 Þ
(3.2)
Laplacian pyramid coefficients [h1; …; hk] reconstruction can be performed by backward recurrence given as follows:
Ik ¼ uðIk + 1 + hk Þ
(3.3)
So, a set of convolutional generative models G0; G1; … Gk, is used while preparing a
LAPGAN where each of which captures the dispersion of coefficients hk for various
phases of the Laplacian pyramid. The generative structures are utilized to generate hk0 s
during reconstruction. Modification of Eq. (3.2) is given as follows:
I’k ¼ uððI’k + 1 Þ + h’k Þ ¼ uðI’k + 1 Þ + G’k ðzk, uðI’k + 1 ÞÞ
(3.4)
Training image I is used for the construction of a Laplacian pyramid. A stochastic
choice is made at each level regarding the coefficient hk construction with the usage
of standard procedure or produces by Gk [46].
D and G compete for the two-player minimax game with value function V (G;D):
min max V ðD, GÞ ¼ Ey, xpdataðy, xÞ ½ log ðdðy, xÞÞ + Expx, z pz ðzÞ ½ log ð1 DðGðz, xÞ, xÞÞ
G
D
(3.5)
LAPGAN is a tandem system in which a set of pictures are adjusted orderly as per their
quality from less to more. Based on a low-quality sample, it first produced a low-quality
picture and then considered intake with a higher quality picture to the successive level. At
each level, the generator corresponds to a discriminator that determines whether an
intake picture is authentic or fake. The quality of the output image will be greatly
improved and more authentic after many times of feature extraction. It is more advisable
for high-quality pictures because it is trained under supervised learning.
The advantages of LAPGAN are easy to approach, learn residuals such as different
distributions can be learned at each stage by the generator and passed as supplementary
information to the next layer, step-by-step independent training, and increase the ability
of GAN. In addition, it also joins CGAN to change unsupervised methodologies into
supervised learning with significant performance advancement. The disadvantage is that
it must be trained under supervision.
Generative adversarial networks and their variants
3.4.4 Video generative adversarial network (VGAN)
Video GAN (VGAN) framework proposes the utilization of independent streams for creating frontal area and background. Vondrick et al. [47] hypothesized that a video clip is a
point in a latent space and suggested GANs generating video [9] with a spatiotemporal
convolutional design in 2016. It adjusts the DCGAN model to predict future frames,
create videos, and classify human actions. VGAN is a GAN for video in which it is considered that the entire video is joined by a stationary background scene and a dynamic
foreground clip. The background is produced as a picture and afterward duplicated over
time. A mutually prepared cover chooses among foreground and background to produce
videos. So as to urge the system to utilize the background stream, sparsity is added earlier
to the mask during learning. Henceforth, it considers a two-stream generator where the
intake is a noise vector to both of them. The stationary background picture with 2D convolutional layers is produced with the effort of the background stream while the moving
foreground generator attempts to create the 3D foreground video cube and the relating
3D forefront cover, with spatial-temporal 3D CNN layers predicts conceivable future
frames. The discriminator considers the entire produced video as intake and attempts
to recognize from the original video. Since VGAN considers video as a 3D cube that
requires huge storage space and tests suggested this framework can likewise produce small
videos up to a second at full frame rate better than basic baselines [21]. Also, investigations
and perceptions describe the inside model that learns valuable highlights for perceiving
activities with negligible oversight, recommending scene elements are a promising sign
for portrayal learning.
A few attempts to approach the video generation issue were made through GANs
[1]. However, past work has concentrated generally on small patches and assessed
them for video grouping. This system is also to learn mapping from the dormant space
to video clips. Yet, expecting a video clip is a point in the inert space that superfluously expands the intricacy of the issue, since videos of a similar activity with various
execution speeds are presented by various focuses in the inactive space. In addition,
this presumption forces each created video clip to have a similar length, while the
length of real-world video clips varies. No doubt, GANs to video production is considered troublesome since the video has an additional temporal measurement involving a lot bigger calculation and storage cost. It is additionally not minor to keep
temporal cognizance.
3.4.5 Superresolution GAN (SRGAN)
It takes a low-quality picture as intake and produces an upsampled picture with 4* upscaling quality. The main objective of SR is to enhance the quality of the low-tested
picture that is upsampling the given picture. Basically, this issue is not well presented
in light of the fact that the recovered high-quality picture misses high recurrence data
71
72
Generative adversarial networks for image-to-image translation
during the upscaling of the image, particularly for huge upscaling factors. Numerous
other deep-learning-based strategies [4, 45, 48] were projected to handle this issue,
but those could not perform well with very low-tested pictures. This superresolution
GAN utilizes deep-learning ideas to give higher quality pictures. During the process
of training, a high-quality picture is always changed over into a low-quality picture
by downsampling. The generator of the GAN is answerable for changing over the
low-quality picture to high-quality picture, and the discriminator is liable for arranging
the produced pictures [21].
Ledig et al. [18] projected SRGAN that considers a low-quality picture as input and
produces an upsampled picture with 4* quality. The system design of SRGAN implemented by Ledig et al. supersedes the rules of DCGAN [36] architecture. The design of
the generator utilizes both convolutional and residual networks [21]. The target function
incorporates an adversarial and furthermore a feature loss rather than pixel-wise meansquared error loss [9] to upgrade the authenticity of the renewed image and understand
the 4* upscaling recreation. It also uses affective loss which is a component extricated by
the convolutional neural system. By contrasting the highlights of the created picture and
the attributes of the objective picture after convolutional neural system, the produced
picture and the objective picture are increasingly the same in linguistics and pattern
[49]. The feature loss is evaluated as separation among feature maps of the produced
expanded picture and the factual picture, where the feature maps are removed from a
preprepared VGG19 system by feeding the picture into it. Examinations depict that
SRGAN has better execution at the best available methods on the collection of data
for the public [21].
Loss is calculated as a weighted combination of regularization, adversarial, and content loss where function measures the difference in the two high-resolution images.
SRGAN generator G takes low-quality image ILR and outputs its high-quality image
ISR. θG are the parameters of G.
N
1X
^
θG ¼ arg min
lSR GθG InLR , InHR
θG N
n¼1
(3.6)
SRGAN discriminator D classifies whether a high-quality image is IHR or ISR. θD is
the parameter of D.
min max EI HR ptrainðI HR Þ logDθD I HR + EI LR pGðI LR Þ log 1 DθD GθG I LR
θG
θD
(3.7)
Wang et al. [50] proposed an enhanced SRGAN which advanced the adversarial loss,
the system design, and the affective loss.
Generative adversarial networks and their variants
3.4.6 Face conditional generative adversarial network (FCGAN)
FCGAN is face conditional GAN which focuses on facial picture SR. Berthelot et al. [51]
projected BEGAN, which aims to try to maintain a balance that can be adapted for the
trade-off among variety and trait. Huang projected FCGAN [52] that concentrates on
facial picture SR. Inside the system design, both generator and the discriminator utilize
a decoder, an encoder alongside skip associations. It creates excellent outcomes with 4*
scaling factor. In preparation, the target function incorporates a loss, i.e., content loss,
which is evaluated by the L1 pixel-wise dissimilarity between the produced upsampled
picture and the ground truth.
3.5 Applications of GAN
The significant function of GAN is the systems that create cases with a similar dispersion
as genuine information, for example, producing photo-realistic pictures. GANs can likewise be utilized to handle the issue of inadequate preparation of cases for supervised or
semi-supervised learning. As of now, a favorable use of GAN is computer vision which
includes pictures and video, for example, image-to-image translation, video generation,
generation of cartoon characters, text-to-image translation, and many more. In this segment, the application scope of GANs is discussed [49]. GANs have some genuinely
helpful practical applications, which incorporate the following.
A. The application in the image
• Image generation
Generative systems can be utilized to create reasonable pictures after being prepared for sample pictures. For instance, to produce new pictures of dogs, a GAN
can be prepared on thousands of samples of pictures of dogs. When the preparation
has been completed, the generator system will have the option to create new pictures
that are not quite the same as the pictures in the preparation set. Image generation is
utilized in social media, marketing, entertainment, logo generation, and so on.
Hanock et al. [53] projected composite GAN, which creates fractional pictures by various generators and lastly combined the whole picture.
• Image-to-image translation
It is utilized to change over pictures taken in the day to pictures taken around
evening time, to change over portrayals to artistic creations, to style pictures to look
such as Picasso or Van Gogh works of art, to change over airborne pictures to satellite
pictures consequently, and to change over pictures of ponies to pictures of zebras.
These utilization cases are ground-breaking since they can spare time. Phillip et al.
[54] exhibited GAN’s, precisely pix2pix method for the image-to-image translation
undertakings. Jun et al. [19] presented the renowned cycle GAN as well as the setup
of noteworthy image-to-image translation models. Cycle GAN is a significant
73
74
Generative adversarial networks for image-to-image translation
application framework of GAN in the field of a picture. It depends on two sorts of
pictures that need no matching. A crying face can be transformed into a laugh through
composition or zebra to the horse. Star GAN is a further advancement of Cycle GAN,
where solidarity is taken to prepare a single classification for the next class. Star GAN is
used to change the smiling look into a crying look, alongside a collection of appearances, for example, shock, disappointment, and so on.
• High-resolution picture generation
GANs can assist in creating high-quality pictures taken from low-quality camera
pictures without losing any necessary details. Superresolution is a field in which GAN
depicts a very remarkable outcome with commercial chances [55]. This can be valuable on websites. The utilization of GAN for SR tackles the inadequacies of the ordinary strategies, which includes the DL techniques, with absences of high recurrence
data. Customary deep CNN can enhance the imperfection by choosing the target
function. GAN can likewise take care of this issue and acquire fulfilling observation
[49]. Christian et al. [18] show the utilization of GANs, explicitly SRGAN framework, to produce yield pictures having enriched pixel quality and sometimes even
more. Huang et al. [56] utilize GAN to make variants of photos of personal appearances. Subeesh et al. [57] provide a case of GAN to make high-quality photos,
concentrating on the road scene.
• Photo inpainting
The fundamental idea of this application is to fill the gaps of a picture. Numerous
deep-learning procedures have come to tackle this issue, and the significant task is to
fill the enormous gaps of a picture to make an ideal one. There are convolutional systems for picture inpainting however these are bad at filling the gaps with appropriate
highlights, and henceforth generative models are utilized for searching the relevant
highlights which are to be filled with, and these highlights are known through the
preparation process [21]. Pathak et al. [50] have projected another technique for picture inpainting called context encoders which depend on convolutional systems prepared mostly to produce pictures at a discretionary. So these systems need to
comprehend both full images and pictures with holes to recognize the highlights with
which need to supplant with. The method proposed by Pathak et al. depends on
encoder-decoder design. That framework is fit for taking pictures with input size
128 128 with gaps. The yield of that proposed framework is either the gap of the
picture or the whole picture. The gap of the picture size will be 64 64, and the full
picture is 128 128.
GANs can assist in recovering those areas in the picture that has some missing parts.
Deepak et al. [40] portrayed the utilization of GAN, explicitly context encoder to execute photo inpainting that is covering a region of a photo which was expelled for
unknown reasons. Raymond et al. [58] used GAN to fill in and fix purposefully
Generative adversarial networks and their variants
•
•
•
•
•
•
corrupted photos of the human face. Yijun et al. [59] likewise used GAN for inpainting and remaking harmed photos of personal appearances [60].
Generation of realistic photograph
Andrew et al. [61] demonstrated the creation of synthetic photos with BigGAN
strategy, which are in every practical sense undefined from authentic photographs.
3D object generation
3D objects can be created with GAN [55]. Jiajun et al. [62] showed a GAN for
producing new three-dimensional new items such as car, sofa, chair, and table. Matheus et al. [63] used GAN to produce 3D models that provide two-dimensional pictures
of items from various points of view [60].
Face aging
The fundamental point of this is to create a human picture at some age. On the off
chance, if the present age of an individual is 20 years, the GAN is utilized to create a
picture of that individual at 40 years. Face aging techniques change a facial picture to
another age, while as yet keeping character [21]. A large portion of the GAN utilized
for face aging includes conditional GAN. The primary point is to produce a picture
with an objective mark age from a given initial face picture. This can be extremely
valuable for both the surveillance and entertainment businesses. It is especially helpful
for face verification since it implies that an organization does not have to change its
security frameworks as individuals get older. An Age-cGAN [64] system can create
pictures at various ages, which then could be utilized to prepare a reliable model
for face confirmation. Grigory et al. [64] utilized GAN to create photos of faces having
various evident ages, such as from young to old one. Zhifei et al. [65] utilized a GANdependent strategy for de-aging the photos of different faces.
Generate photos of the human face
Tero et al. [66] exhibited the creation of conceivable, reasonable photos of individual faces. It is reasonable to call the striking outcome because of genuine looks. In
that capacity, the consequences got a lot of media consideration. Face generation is
usually prepared on examples such as big name, implying that components of current
superstars are in the produced faces, causing to appear to be recognizable, however not
precisely. Their techniques were likewise used to show the generation of items and
scenes. Few instances were utilized from this paper in a 2018 report to exhibit the
quick advancement of GANs from 2014 to 2017 [60].
Generation of new human poses
Liqian et al. [67] gave a case of creating current photos of individual structures with
recent postures.
Face frontal view generation
Rui et al. [68] showed the utilization of GAN for creating front-view photos of
individual faces provided photos taken at some particular point. The created front-
75
76
Generative adversarial networks for image-to-image translation
on photographs can be utilized as intake is the concept behind it for face verification or
face identification framework.
• Generation of cartoon character
Yanghua et al. [69] showed the preparation and usage of a GAN for creating anime
characters’ faces which are Japanese comic book characters. Motivated by the anime
models, many individuals have attempted to develop Pokemon characters, for example, the poke GANventure and produce the Pokemon with DCGAN task having
constrained achievement [22].
B. The Application with the Video
• Video synthesis
GANs can likewise be utilized to produce videos. They can create content in less
time than if somehow managed to make content physically. They can also improve the
efficiency of filmmakers and furthermore engage specialists who need to build innovative videos in their available time. Carl et al. [47] portray the utilization of GAN for
video forecast, explicitly foreseeing as long as a moment of video frames with progress,
principally for stationary components of the picture.
• Video frame prediction
It represents determining the future frame regarding the current frames [21].
Mathieu et al. [70] firstly used GAN preparation for video prediction in which
the generator can produce the last frame of the video dependent on the prior
arrangement of the frames, and the discriminator is utilized to finish up the frame.
All the frames aside from the last frame are genuine pictures. The discriminator can
adequately utilize the data of the time measurement and furthermore helps to make
the produced frame predictable with all the past frames is its advantage. Test outcomes depict that the frames are clearer than the other algorithms created by confrontation preparation.
C. Application of human-computer interaction
• Text-to-image synthesis
It is the earlier application of domain-transfer GAN. No doubt, generating multiple pictures from text details is an intriguing use case of GANs. This can be useful in
the film business, as a GAN is equipped for creating new information relied on some
content that can be made up. In the comic industry, it is conceivable to naturally create
arrangements of a story. Han et al. [71] exhibited the utilization of GAN, explicitly the
Stack GAN to create practical appearing photos from textual portrayals of necessary
items such as flying creatures and blossoms.
• Auxiliary automatic driving
Santana et al. [72] actualized the assisted automatic driving with GAN. Initially, a
picture is created, which is reliable with the appropriation of the official movement
scene picture, and afterward, a progress framework is prepared dependent on the cyclic
neural system to anticipate the following movement pictures.
Generative adversarial networks and their variants
3.6 Conclusion
Nowadays, GANs are one of the most fascinating thoughts for many researchers to work
on it and suggesting various models based on GAN with regard to computer engineering.
Generative adversarial networks and their variants are the most promising generative
approaches in the discipline of computer vision. In this chapter, a comprehensive review
of GAN and their variants are provided. It can be seen that the latest variants of GAN are
unsupervised and more stable than the previous models that can produce realistic content
and texture details, which will be an advantage to various applications such as superresolution, image inpainting, etc. They are also applicable in different areas such as image
classification, image-to-image translation, recovery of corrupted data, text-to-image
generation, and many more endless applications. Comparison is also done between
various GAN-based methods.
References
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, Curran
Associates, 2014, pp. 2672–2680.
[2] A. Mosavi, S. Ardabili, A.R. Várkonyi-Kóczy, List of Deep Learning Models. (2019), https://doi.org/
10.20944/preprints201908.0152.v1 (Preprint).
[3] J. Brownlee, A Gentle Introduction to Generative Adversarial Networks (GANs), Retrieved from https://
machinelearningmastery.com/what-are-generative-adversarial-networks-gans/, 2019, July 19.
[4] C. Dong, C.C. Loy, K. He, X. Tang, Learning a deep convolutional network for image
super-resolution. in: Computer Vision—ECCV 2014, 2014, pp. 184–199, https://doi.org/
10.1007/978-3-319-10593-2_13.
[5] M. Mirza, S. Osindero, Conditional Generative Adversarial Nets, (2014)ArXiv: abs/1411.1784.
[6] Y. Hong, U. Hwang, J. Yoo, S. Yoon, How generative adversarial networks and their variants work.
ACM Comput. Surv. 52 (1) (2019) 1–43, https://doi.org/10.1145/3301282.
[7] K. Sohn, X.C. Yan, H. Lee, Learning structured output representation using deep conditional generative models, in: Proceedings of the 28th International Conference on Neural Information Processing
Systems—vol. 2 (NIPS’15), MIT Press, Cambridge, MA, USA, 2015, pp. 3483–3491.
[8] J. Le, The 10 Deep Learning Methods AI Practitioners Need to Apply, Retrieved fromhttps://
medium.com/cracking-the-data-science-interview/the-10-deep-learning-methods-aipractitioners-need-to-apply-885259f402c1, 2020, May 10.
[9] Y. Hong, U. Hwang, J. Yoo, S. Yoon, How generative adversarial networks and their variants work: an
overview, ACM Comput. Surv. 52 (2019) 1–43.
[10] W. Sun, B. Zheng, W. Qian, Automatic feature learning using multichannel ROI based on deep structured algorithms for computerized lung cancer diagnosis. Comput. Biol. Med. 89 (2017) 530–539,
https://doi.org/10.1016/j.compbiomed.2017.04.006.
[11] T. Salimans, I.J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved Techniques
for Training GANs, (2016)ArXiv: abs/1606.03498.
[12] T. Che, Y. Li, A.P. Jacob, Y. Bengio, W. Li, Mode Regularized Generative Adversarial Networks,
(2016)ArXiv: abs/1612.02136.
[13] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein Gan, (2017)arXiv preprint arXiv: 1701.07875.
[14] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A.C. Courville, Improved Training of Wasserstein
GANs, (2017)ArXiv: abs/1704.00028.
[15] T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive Growing of GANs for Improved Quality, Stability,
and Variation, (2018)ArXiv: abs/1710.10196.
77
78
Generative adversarial networks for image-to-image translation
[16] D.J. Im, C.D. Kim, H. Jiang, R. Memisevic, Generating Images with Recurrent Adversarial Networks,
(2016)ArXiv: abs/1602.05110.
[17] E.L. Denton, S. Chintala, A. Szlam, R. Fergus, Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks, (2015)ArXiv: abs/1506.05751.
[18] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, … W. Shi, Photo-realistic
single image super-resolution using a generative adversarial network. in: 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), 2017, pp. 105–114, https://doi.org/10.1109/
cvpr.2017.19.
[19] J. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017,
pp. 2242–2251.
[20] F. Jurie, A new log-polar mapping for space variant imaging: application to face detection and tracking,
Pattern Recogn. 32 (1999) 865–875.
[21] N. Yashwanth, P. Navya, M. Rukhiya, K.S. Prasad, K.S. Deepthi, Survey on generative adversarial
networks, Int. J. Adv. Res. Innov. Ideas Technol. 5 (2019) 239–244.
[22] J. Brownlee, 18 Impressive Applications of Generative Adversarial Networks (GANs), Retrieved
fromhttps://machinelearningmastery.com/impressive-applications-of-generativeadversarial-networks/, 2019, July 12.
[23] R. Vargas, A. Mosavi, R. Ruiz, Deep Learning: A Review. (2018)https://doi.org/10.20944/preprints201810.0218.v1 (Preprint).
[24] M. Biswas, V. Kuppili, L. Saba, D.R. Edla, H.S. Suri, E. Cuadrado-Godı́a, J.R. Laird, R.T. Marinhoe,
J.M. Sanches, A. Nicolaides, J.S. Suri, State-of-the-art review on deep learning in medical imaging,
Front. Biosci. 24 (2019) 392–426.
[25] L. Bote-Curiel, S. Muñoz-Romero, A. Gerrero-Curieses, J.L. Rojo-álvarez, Deep learning and big
data in healthcare: a double review for critical beginners, Appl. Sci. 9 (2019) 2331.
[26] Y. Feng, H.S. Teh, Y. Cai, Deep learning for chest radiology: a review, Curr. Radiol. Rep. 7 (2019).
[27] D. Griffiths, J. Boehm, A review on deep learning techniques for 3D sensed data classification, Remote
Sens. 11 (2019) 1499.
[28] A. Gupta, P.J. Harrison, H. Wieslander, N. Pielawski, K. Kartasalo, G. Partel, L. Solorzano, A. Suveer,
A.H. Klemm, O. Spjuth, I. Sintorn, C. W€ahlby, Deep learning in image cytometry: a review,
Cytometry 95 (2019) 366–380.
[29] V.K. Ha, J. Ren, X. Xu, S. Zhao, G. Xie, V.M. Vargas, Deep learning based single image
super-resolution: a survey, in: BICS, 2018.
[30] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
A. Rabinovich, Going deeper with convolutions, in: 2015 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2015, pp. 1–9.
[31] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in: NIPS, 2012.
[32] M.D. Zeiler, R. Fergus, Visualizing and understanding convolutional networks, in: ECCV, 2014.
[33] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[34] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition,
(2015)CoRR: abs/1409.1556.
[35] S. Min, B. Lee, S. Yoon, Deep learning in bioinformatics, Brief. Bioinform. 18 (2017) 851–869.
[36] A. Radford, L. Metz, S. Chintala, Unsupervised Representation Learning with Deep Convolutional
Generative Adversarial Networks, (2015)CoRR: abs/1511.06434.
[37] Q. Zhang, Y. Xiao, W. Dai, J. Suo, C. Wang, J. Shi, H. Zheng, Deep learning based classification of
breast tumors with shear-wave elastography. Ultrasonics 72 (2016) 150–157, https://doi.org/10.1016/
j.ultras.2016.08.004.
[38] D. Wulsin, J.R. Gupta, R. Mani, J.A. Blanco, B. Litt, Modeling electroencephalography waveforms
with semi-supervised deep belief nets: fast classification and anomaly measurement, J. Neural Eng. 8 (3)
(2011) 036015.
[39] Y.-J. Cao, L.-L. Jia, Y.-X. Chen, N. Lin, C. Yang, B. Zhang, et al., Recent advances of generative
adversarial networks in computer vision. IEEE Access 7 (2019) 14985–15006, https://doi.org/
10.1109/access.2018.2886814.
Generative adversarial networks and their variants
[40] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, A.A. Efros, Context encoders: feature learning by
inpainting. in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016,
pp. 1–12, https://doi.org/10.1109/cvpr.2016.278.
[41] B. Zhao, X. Wu, Z. Cheng, H. Liu, J. Feng, Multi-view image generation from a single-view,
in: Proceedings of the 26th ACM international conference on Multimedia, 2018.
[42] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation. Lect. Notes Comput. Sci (2015) 234–241, https://doi.org/10.1007/978-3-319-24574-4_28.
[43] M. Saito, E. Matsumoto, S. Saito, Temporal generative adversarial nets with singular value clipping,
in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2849–2858.
[44] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in: Proceedings of
the 34th International Conference on Machine Learning, in PMLR, vol. 70, 2017, pp. 214–223.
[45] J. Kim, J.K. Lee, K.M. Lee, Deeply-recursive convolutional network for image super-resolution.
in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016,
pp. 1637–1645, https://doi.org/10.1109/cvpr.2016.181.
[46] K. Cheng, R. Tahir, L.K. Eric, M. Li, An analysis of generative adversarial networks and variants for
image synthesis on MNIST dataset. Multimed. Tools Appl. 79 (19–20) (2020) 13725–13752, https://
doi.org/10.1007/s11042-019-08600-2.
[47] C. Vondrick, H. Pirsiavash, A. Torralba, Generating Videos with Scene Dynamics, (2016)ArXiv: abs/
1609.02612.
[48] W. Shi, J. Caballero, F. Huszar, J. Totz, A.P. Aitken, R. Bishop, et al., Real-time single image and
video super-resolution using an efficient sub-pixel convolutional neural network. in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1874–1883, https://doi.
org/10.1109/cvpr.2016.207.
[49] L. Gonog, Y. Zhou, A review: generative adversarial networks, in: 2019 14th IEEE Conference on
Industrial Electronics and Applications (ICIEA), 2019, pp. 505–510.
[50] X. Wang, K. Yu, S. Wu, J. Gu, Y. Liu, C. Dong, C.C. Loy, Y. Qiao, X. Tang, ESRGAN: Enhanced
Super-Resolution Generative Adversarial Networks, (2018)ArXiv: abs/1809.00219.
[51] D. Berthelot, T. Schumm, L. Metz, BEGAN: Boundary Equilibrium Generative Adversarial Networks, (2017)arXiv preprint arXiv: 1703.10717, 2017.
[52] H. Bin, W. Chen, X. Wu, L. Chun-Liang, High-Quality Face Image SR Using Conditional Generative Adversarial Networks, (2017)ArXiv: abs/1707.00737.
[53] H. Kwak, B.-T. Zhang, Generating Images Part by Part with Composite Generative Adversarial Networks, (2016) ArXiv:1607.05387.
[54] P. Isola, J. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016,
pp. 5967–5976.
[55] F. Shaikh, Top 5 Interesting Applications of GANs for Every Machine Learning Enthusiast!, Retrieved
fromhttps://www.analyticsvidhya.com/blog/2019/04/top-5-interesting-applications-gans-deeplearning/, 2020, May 11.
[56] H. Bin, C. Wei-Hai, W. Xing-Ming, High-Quality Face Image Super-Resolution Using Conditional
Generative Adversarial Networks, (2018) ArXiv:1707.00737.
[57] S. Vasu, N.T. Madam, N.A. Rajagopalan, Analyzing Perception-Distortion Tradeoff using Enhanced
Perceptual Super-resolution Network, (2018) ArXiv:1811.00344.
[58] R.A. Yeh, C. Chen, T.Y. Lim, A.G. Schwing, M. Hasegawa-Johnson, M.N. Do, Semantic image
inpainting with deep generative models. in: 2017 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2017, pp. 1–19, https://doi.org/10.1109/cvpr.2017.728.
[59] Y. Li, S. Liu, J. Yang, M.-H. Yang, Generative face completion. in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1–9, https://doi.org/10.1109/
cvpr.2017.624.
[60] J. Hui, GAN—Some cool applications of GAN—Jonathan Hui, Retrieved fromhttps://medium.
com/@jonathan_hui/gan-some-cool-applications-of-gans-4c9ecca35900, 2020, March 10.
[61] A. Brock, J. Donahue, K. Simonyan, Large Scale GAN Training for High Fidelity Natural Image Synthesis, (2018)ArXiv: abs/1809.11096.
79
80
Generative adversarial networks for image-to-image translation
[62] J. Wu, C. Zhang, T. Xue, W.T. Freeman, J.B. Tenenbaum, Learning a probabilistic latent space of
object shapes via 3D generative-adversarial modeling, in: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16), Curran Associates Inc, Red Hook, NY,
USA, 2016, pp. 82–90.
[63] M. Gadelha, S. Maji, R. Wang, 3D shape induction from 2D views of multiple objects. in: 2017 International Conference on 3D Vision (3DV), 2017, pp. 402–411, https://doi.org/
10.1109/3dv.2017.00053.
[64] G. Antipov, M. Baccouche, J. Dugelay, Face aging with conditional generative adversarial networks,
in: 2017 IEEE International Conference on Image Processing (ICIP), 2017, pp. 2089–2093.
[65] Z. Zhang, Y. Song, H. Qi, Age progression/regression by conditional adversarial autoencoder, in: 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4352–4360.
[66] T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of GANs for improved quality, stability,
and variation, in: ICLR 2018, 2018 Retrieved fromhttps://paperswithcode.com/paper/
progressive-growing-of-gans-for-improved.
[67] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, L.V. Gool, Pose guided person image generation,
in: NIPS, 2017.
[68] R. Huang, S. Zhang, T. Li, R. He, Beyond face rotation: global and local perception GAN for photorealistic and identity preserving frontal view synthesis, in: 2017 IEEE International Conference on
Computer Vision (ICCV), 2017, pp. 2458–2467.
[69] Y. Jin, J. Zhang, M. Li, Y. Tian, H. Zhu, Z. Fang, Towards the Automatic Anime Characters Creation
with Generative Adversarial Networks, (2017)ArXiv: abs/1708.05509.
[70] M. Mathieu, C. Couprie, Y. LeCun, Deep Multi-Scale Video Prediction Beyond Mean Square Error,
(2015)CoRR: abs/1511.05440.
[71] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D. Metaxas, StackGAN: text to photo-realistic
image synthesis with stacked generative adversarial networks. in: 2017 IEEE International Conference
on Computer Vision (ICCV), 2017p. 1, https://doi.org/10.1109/iccv.2017.629.
[72] E. Santana, G. Hotz, Learning a Driving Simulator, (2016)ArXiv: abs/1608.01230.
CHAPTER 4
Comparative analysis of filtering
methods in fuzzy C-means: Environment
for DICOM image segmentation
D. Nagarajana, Kavikumar Jacobb, Aida Mustaphac, Udaya Mouni Boppanac,,
and Najihah Chainib
a
Department of Mathematics, Hindustan Institute of Technology and Science, Chennai, India
Faculty of Applied Sciences and Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia
c
Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia
b
4.1 Introduction
Medical image analysis was done using a sequential application of low-level pixel processing and mathematical modeling to develop rule-based systems. During the same period,
artificial intelligence was developed in analogy systems. In the 1980s magnetic resonance
or computed tomography imaging system has been introduced that encode and decode
the output of the images. Digital imaging and communications in medicine (DICOM)
has improved the communication mechanism in the medical environment. In products
such as CT, MR, X-ray, NM, RT, US, etc., DICOM is used for image storing, printing
the information about the patient’s condition, and transmitting the correct information
about the radiological images. It involves a file format and protocol in communication
networks. It is useful for receiving images and patient data in DICOM format. DICOM
format has been widely adopted to all medical environments and derivations from the
DICOM standard are used into other application areas. DICOM is the basis of digital
imaging and communication in nondestructive testing and in security. DICOM data
consist of many attributes including information such as name, ID, and image pixel data.
A single DICOM object can have only one attribute containing pixel data. Pixel data can
be compressed using a variety of standards, including JPEG, JPEG Lossless, JPEG 2000,
and Run-length encoding.
Image processing is a rapidly growing field in the academic world that is used with
numerous techniques especially in image segmentation and edge detection, which are
important for diagnosing the problem or disease. Digital images have been used for getting productive results and data recovery. Spatial changes in MRI are due to the radio
frequency coil that will affect the tissue statistics [1]. Medical image segmentation is
an essential task in clinical diagnosis. Generally, most of the medical images are the overlapping of the gray scale intensities of various tissues. Medical image data will be uncertain
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00002-6
Copyright © 2021 Elsevier Inc.
All rights reserved.
81
82
Generative adversarial networks for image-to-Image translation
due to noise, blurs in recovery, and effects of partial volume from the sensor, which has a
low quality of determining. These issues can be resolved by using a fuzzy set which gives
the membership function. Hence, fuzzy clustering is a suitable method for the segmentation of medical images. Cluster analysis is a methodology of grouping a data set into
groups of indistinguishable individuals. Image segmentation is the process of partitioning
image pixels into similar regions. Therefore, clustering algorithms are naturally suitable
for image segmentation [2].
The ordinary clustering methods confine all the points of the data set into a single
cluster. But fuzzy clustering gives the idea of overlapping the membership in two or more
sets. Hence, fuzzy clustering has been widely used in different fields including image segmentation. The fuzzy C-means algorithm has been widely applied in the image processing area including medical image segmentation to classify the major tissues from MRI of
the human brain. Furthermore, this algorithm readily meets the scale and shift invariant,
then incorporates the multidimensional data [3]. Clustering from the flow of the data is an
important task due to the increasing scope of a large amount of data composed over time.
Dunn has been developed a fuzzy C-means as a clustering methodology for image
segmentation and later it was improved by Bezdek [4, 5].
Due to noise and inhomogeneity, accurate image segmentation is a difficult task in
medical images. In the conventional method, the color image is transformed into a gray
scale image [6]. For each target class, users prefer training data, and clustering is done for
the image using some filters to reduce noise. However, some of the clusters may contain
more than one target class. It needs to be partitioned again until getting no such clusters.
Since the medical images including ultrasound images such as tomography using X-ray
mammography and MRI are represented and saved digitally, the application of image
processing methodology has increased tremendously in recent years. Therefore, MRI
has been used in many types of research. MRI is an influential tool for detecting unusual
changes in various parts of the brain in the initial stage. This tool is a suitable one to
acquire brain images with a high contrast level. The recovery parameters of MRI can
be modified in order to acquire different gray scale levels for various tissues and types
of neuropathology. Though the segmentation of the brain image is a difficult task, it
is very important to detect tumors, necrotic tissues, and edema in the diagnostic system.
Many methods have been applied for this task namely thresholding, statistical models,
region growing, clustering, and active control models. Since the distribution of strain
in medical images is very complex in general, the thresholding methodology fails here.
Therefore, the extension of thresholding is nothing but region growing needs source
from all the regions and facing the corresponding problem to maintain homogeneity
as thresholding [6].
The most popular clustering algorithms are expectation-maximization (EM) and
fuzzy C-means, which are used for segmentation. The EM method are designed for
the distribution of intensity like a normal distribution, which is not suitable for noisy
Comparative analysis of filtering methods
images. But, FCM considers the only intensity of the image and can be used for clustering. The method of unsupervised learning is called clustering, where similar clusters are
developed. This method is an objective function-based method and an interesting one.
The objective of this method is to divide the observation into as possible as a similar cluster. Likewise, the FCM algorithm is an unsupervised fuzzy clustering algorithm, where
the soft partition is possible by getting clusters that partially belong to multiple clusters.
These partitions need not be a fuzzy partition as the input may be larger than the data set.
But, most of the algorithms generate soft partition, i.e., fuzzy partition. Soft clustering
assures the membership degree of all the points in each cluster adding up to one.
In earlier days, computer-aided detection of unusual growth of tissues was motivated
by the requirement of obtaining possible accuracy. This process cannot be compared
with the recent technologies used, which are digitalized and enable us to observe the volume and location of the unwanted tissues [7]. Since all the objects can have membership
in more than one cluster, fuzzy partitions are more adjustable than crisp. FCM clustering
uses a simple color feature with adequate information that will efficiently cluster the video
frames. Cluster algorithm has been widely used in pattern recognition, data mining, computational biology, and computer vision. Cluster methodology is an unsupervised learning method where the objective is groping elements into clusters with a high level of
similarity and the elements in different clusters will have a high level of a degree of dissimilarity. Dissimilarity can be measured using distance, symmetry, curvature, and intensity using the information from the data set [8–10].
FCM clustering is an instrument to categorize the image blocks and provides stepwise
detailed searching. Therefore, FCM is a fuzzy classification model where each data is a
shaped cluster and identified by a membership degree. Various modifications of FCM
clustering have been applied to crisp numbers and only very few of them are extended
to noncrisp numbers since it needs complicated equations and tiring calculations. From
the data set developing algorithms that can deal with uncertainties is an important task.
Automatic generation of type-1 membership functions based on human experts and their
perception. This automatic generation can be done by FCM, self-organizing feature
map, and robust agglomerative mixture decomposition methods. Image segmentation
using type-1 fuzzy sets may give unsatisfactory results and applying type-2 fuzzy sets with
more desirable results can solve this issue. Since the secondary grades of type-2 fuzzy sets
are equal to one, this set can control the level of uncertainty of data more efficiently than
the conventional methods. Here, using interval type-2 fuzzy sets can reduce computational complexity. The clustering procedure for the data under a fuzzy environment is
called fuzzy C numbers. These numbers may be considered as normal type fuzzy numbers, triangular, and trapezoidal fuzzy numbers [11–14]. Modeling of membership functions based on similarity decomposition and centroid of clusters is the most important task
in fuzzy clustering. In fuzzy cluster analysis, the membership matrix will represent the
relationship between the data and it gives a more comprehensive view of the
83
84
Generative adversarial networks for image-to-Image translation
relationships. This membership matrix raises the expressiveness of the cluster analysis. In
conventional methods, when the data are equally distanced between representatives, they
are assigned to one cluster [15].
4.1.1 Organization of chapter
The remaining part of the work is organized as follows. In Section 4.2, a review of the
literature is given for the aim and scope of this work. In Section 4.3, some of the basic
concepts are presented for a better understanding of the work. In Section 4.4, image
segmentation on the DICOM image is proposed using the FCM clustering algorithm.
In Section 4.5, the result and discussion of this work are given. In Section 4.6, we concluded our work in the future direction.
4.2 Related works
Ahmed et al. [1] introduced a new algorithm for fuzzy segmentation on MRI data and
calculated inhomogeneities of intensity. They have neutralized the inhomogeneities by
their modified fuzzy C-means algorithm which allows the labeling of a pixel by immediate neighborhood. They have also illustrated that the efficiency of their modified algorithm by using synthetic images and MRI data. Yang et al. [2] proposed a new technique
called an alternative FCM algorithm for MRI image segmentation to distinguish abnormal and normal tissues in ophthalmology. They have concluded that their proposed algorithm is better than the existing fuzzy C-means algorithm when it detects abnormal
tissues depending on a window selection. The extended version of the FCM clustering
algorithm has been introduced in Ref. [16] to overcome the issues of noise sensitiveness.
Roy et al. [3] studied intensity shading, size of the variable cluster, and smoothness of
membership functions of the FCM cluster algorithm in detail and introduced a new
parameter called compactness to obtain additional information of the clusters. With that
parameter, they have proposed a fuzzy C-means algorithm with variable compactness
which is used to analyze major tissues in brain MRIs. Hore et al. [4] exhibited the online
fuzzy clustering algorithm to partition large data which may be considered as streaming
data. They have concluded that their algorithm offers partitions in large volumes of MRI
when clustered all the data at one time. An automatic method is proposed to identify
exudates from low contrast digital images of retinopathy patients with nonstretched
pupils based on the FCM clustering technique [5]. Balafar [6] introduced a new FCM
clustering method which is used to convert the color image into the gray level image
by a user-selected training data and decrease the noise by using an anisotropic filter. Suri
and Sardana [17] made a prediction of the gold price using the FCM clustering with the
help of a known fuzzy membership function based on fuzzy clustering and weighted least
square and Takagi-Sugeno model. In 2011, Christ and Parvathi [7] proposed a new technique for the segmentation of medical images using the Silhouette method, Spatial FCM,
Comparative analysis of filtering methods
and hidden Markov random field-based FCM algorithms. Havens et al. [8] analyzed large
databases by using three new incremental kernelized FCM algorithms such as rseKernelized FCM, sp-Kernelized FCM, and o-Kernelized FCM. They have evaluated
the performance of all three algorithms by comparison and recommended rse-Kernelized
FCM is suitable for computational problems. Asadi and Charkari [9] have done a video
description using FCM clustering with a new keyframe extraction system which is chosen based on maximum membership grade that will produce static video summaries along
with high accuracy and low error rate. Pimentel and Souza [10] have been introduced a
novel approach to deal with the membership based on the essential information in the
entire feature of the image. Biswas et al. [11] confined a fast-geometrical image by using
FCM clustering which considers pixel patterns in the column direction of an image block
as the classification features and for stepwise precise classification, two-level classification
method has been applied.
Recently, Mulyana [12] identified a medical plant using FCM clustering based on a
fractal method such as fractal dimension and fractal code which are used to extract the
image feature of the 20 variety of medicinal plants for every 30 samples. The experimental
result shows that 85.04% and 79.94% fuzzy clustering are based on fractal dimension and
fractal code, respectively. Moreno and Lopez [13] explained the progress of a trajectory
planning system using fuzzy algorithms and machine vision methods. The system has
been controlling the movement of a tele-commanded mobile robot for machine vision
techniques and fuzzy algorithms. Hadi et al. [14] have been proposed a vector form of
FCM which simplifies the method of FCM clustering applied to fuzzy numbers.
Warunsin and Chitsobhuk [18] established the performance of the cyclone identification
system using the histogram and used the classification of support vector machines and
FCM clustering. Fredo et al. [19] segmented the sub outer layer regions of the brain
regions such as corpus callosum (CC) and brain stem (BS) using FCM clustering. They
are also recommended that this skeleton can be used to diagnose neural disorders autism
automatically. Doganay et al. [20] developed a fully automatic algorithm for lung tissue
segmentation using the FCM clustering algorithm. The fast FCM clustering algorithm is
used to segment the lung region in two-dimensional high-resolution computer tomography images. Liu et al. [21] presented a fluctuation of the fuzzy local information
C-means clustering algorithms which include region-level spatial, spectral, and structural
information along with region-level Markov random field model to achieve accuracy to
color texture images. Recently, Vani and Anusuya [22] implemented a unique Kannada
word recognizer using FCM and vector quantification. To improve the efficiency and
speed of FCM, Stetco et al. [15] proposed a fuzzy C-means++ algorithm that obtained
a maximum level of occurrences on both artificially generated and real-world data sets.
Velmurugan and Naveen [23] have examined the usage of clustering methods and preprocessing methods to forecast the disease in MRI brain images in the medical field.
Mohammed et al. [24] introduced the FCM algorithm that takes less time in finding
85
86
Generative adversarial networks for image-to-Image translation
clusters and applied in image segmentation. Kaur and Tulsi [25] have proposed an FCM
method to obtain impressive results for complex background images in order to overcome the issue such as failed to compute threshold value when there is no significant
change in the gray level of pixels. Heriana [26] had done edge detection on an image
using FCM and objective function based on the data distribution of mean and standard
deviation values of each of the four magnitude direction values of a pixel that have been
calculated based on the objective function. Rai [27] has introduced the idea of detection
of soft metaphor by allowing membership values to fuzzy sets which represent varying
degrees of metaphoricity. Jebari et al. [28] proposed an automatic genetic FCM algorithm
with the uses of newly defined genetic algorithms including a new mutation operator,
crossover operator, and tournament selection to develop the number of clusters and
to contribute initial centroids. Sivasaravanababu et al. [29] converted the captured
RGB image into a gray-scale image and illuminated it by using the technique of image
enhancement. Zhang et al. [30] disparate FCM clustering into traditional FCM
objectives using a new diversity regularization. The FCM objective has been addressed
by an optimization algorithm in order to converge the local optimal solutions with adequate time complexity. Edge detection on DICOM image [31] and image extraction on
MRI DICOM image [32] were studied by the use of the MATLAB program under the
type-2 fuzzy setting to convert DICOM image into a 2D gray scale image. Jinlin [33] has
been introduced a new FCM clustering algorithm based on multiobjective optimization
along with fuzzy distance measurement which is used to adjust the weights of the pixel
local information to improve the performance and computational time while segmenting
images by a different type of noises. Santiago [34] had done mass abnormality segmentation and categorization modified FCM using histogram and binary decision tree. The
importance of preprocessing and fuzzy methods has been highlighted for the segmentation and classification of mammographic image processing.
Torra [35] has studied and analyzed the effect of the parameter m which corresponds
to the degree of fuzziness of the solution acquired from the unsupervised FCM algorithm. Umoren et al. [36] refined an isolated diagnostic system using the FCM algorithm and shown that ophthalmic pathological results obtained from FCM are faster
and reliably clustering. Srivastava et al. [37] analyzed the image using the FCM algorithm by carrying out the apportionment procedure in which the image is considered as
an object and subdivided into the class of images to overcome noise sensitivity. Tolentino et al. [38] have proposed a new technique for the measurement of the distance to
rectify the issues of FCM incorporating trigonometric functions and Manhattan distance calculation on speed and accuracy. Vernanda et al. [39] focused on the controversy involved in students’ data that continue to colleges introduced graduate-school
clustering using FCM. Borthakur et al. [40] identified suitable metrices from heart rate
variability analysis for sonification. They have also investigated the use of the auditory
display in aiding the analysis of heart rate variability leveraged by unsupervised machine
learning techniques. Katircioglu et al. [41] determined Denim fabric’s measurement of
Comparative analysis of filtering methods
the influence of air using the FCM algorithm. The fabric samples are analyzed by a
microscope to count the bright areas of the pixels and images are improved by image
processing. Gan has [42] proposed safe semisupervised FCM clustering and introduced
MinMax FCM to swamp the issues such as wrongly labeled samples which are carefully
examined by constraining the corresponding predictions to be those yielded by unsupervised clustering.
However, the realm of image segmentation on the DICOM image of the patient’s
MRI has not been studied yet in the literature so far. Hence, it is still open to many possibilities for innovative research work especially in the context of FCM clustering.
Hence, in this chapter, we have studied and analyzed the performance of a fuzzy
C-means clustering (FCMC) algorithm along with different image filtering methods
based on digital imaging and communications in medicine (DICOM) data set. The significance of this study is to the lower false positive rate and the intrusion detection is a
high rate. For this purpose, the DICOM color images are first converted to gray scale and
applied various filters to reduce the noise error.
4.3 Methodology
4.3.1 Proposed algorithm
In this section, edge detection is done on the DICOM image of a magnetic resonance
imaging (MRI) patient using the fuzzy C-means clustering (FCMC) algorithm.
Algorithm 4.1: Fuzzy C-means clustering algorithm
1. Convert CT scan files to DICOM through mri ¼ flipdim(mri,1);
2. Import the background image and show it on the axes through bg ¼ imread(’background.
png’);
3. Prevent plotting over the background and turn the axis off
j making sure the background is behind all the other uicontrols;
4. Covert RGB to Green Channel Complement through GIm ¼ imcomplement(green);
5. Contrast Limited Adaptive Histogram Equalization
6. Structuring Element through se ¼ strel(’ball’,8,8);
j Morphological Open through gopen ¼ imopen(HIm,se);
7. Remove Optic Disk using
j godisk ¼ HIm - gopen 2D
j Median Filter medfilt ¼ imguidedfilter(godisk,’DegreeOfSmoothing’,1);
8. Segmentation Using Fuzzy C-means through
j ffcm1 ¼([’The 1st Cluster ¼ ’ num2str(ccc1)]);
j ffcm2 ¼([’The 2nd Cluster ¼ ’ num2str(ccc2)]);
9. Using edge detection detect the edge through
j SegmentedImage ¼ get(LTproject.segmented_image,’Userdata’).
87
88
Generative adversarial networks for image-to-Image translation
4.3.2 Evaluation metrics
The fuzzy C-means clustering [3] is the solution of the energy function and is defined
mathematically as
X X
2
p
JFCM ¼
u y vj
(4.1)
i℧
j¼1 ij i
where yi is the intensity of the observed image at the ith pixel, C is the number of classes,
vj is the centroid of the class, ℧ is the domain of the image, and uij is the membership
C
P
(nonnegative) function of the ith pixel for the jth class and uij ¼ 1, 8i℧. The paramj¼1
eter p is the weighting exponent where p > 1. If p ¼ 1, then FCM becomes hard K-means
algorithm with binary values as the member functions. The membership function and the
center of the cluster is defined by
1
3 2
2 d yi , φj m1
Xm
4
5
k¼1 d ðy , φ Þ
i
k
uij ¼
(4.2)
Xm f
uij yi
vj ¼ Xk¼1
n f
uij
k¼1
where f is the degrees of freedom.
Accuracy ¼
NTrue Positive + NTrue Negative
NTrue Positive + NTrue Negative + NFalse Positive + NFalse Negative
(4.3)
Precision is expressed as
Precision ¼
NTrue Positive
NTrue Positive + NFalse Positive
(4.4)
Eq. (4.5) shows that the harmonic mean between precision and sensitivity is given by
Harmonic mean ¼
2 NTrue Positive
2 NTrue Positive + NFalse Positive + NFalse Negative
(4.5)
4.3.3 Morphological operations
A medium filter can do removing noise from an image effectively. It is a classical preprocessing step to make the results better of later processing like edge detection. Under some
conditions, this filter extracts edges during noise reduction. Hence, this filter has been
used in digital image processing widely [24].
Comparative analysis of filtering methods
4.3.3.1 2D median filter
2D filter is a technique of nonlinear digital filtering which is used for the removal of noise
from an image. Removing noise from an image is a preprocessing step of ensuing processing like edge detection. This filter extracts edges during noise reduction and hence it has
been widely used in image processing and signal processing as well.
4.3.3.2 Imguided filter
The function of the Imguided filter enforces edge preserving on an image smoothly using
a guidance image that is the content of a second image. This guided image can be a different version of the image or an entirely different image. Guided image filtering is a
region of operation where the statistics is a region in the parallel dimensional neighborhood is the guided image. This takes place while measuring the value of the output pixel.
The structure of the guidance image and the image to be filtered are the same when both
are the same. If they are different, structures in the guidance image will impact the filtered
image.
4.3.3.3 Imfilter
The function of Imfilter calculates the value of each output pixel using double-precision,
floating-point arithmetic. Using the Imfilter toolbox, images can be filtered using convolution or correlation. It handles types of data using the rules of arithmetic saturation and
the output image has a similar data type as the input image. If the result exceeds the type of
data, this filter truncates the result to the allowed range. If the data type is an integer, then
this filter rounds fractional values. Using this truncation behavior, the image can be converted to various types of data before calling Imfilter. If the input image is of class double,
then the output will be negative values.
4.3.3.4 Wiener 2 filtering
It is two-dimensional noise-removal filtering and it is a linear filter. This filter adapts itself
according to the variance of the local image. If the variance is large, then this filter performs in a smooth way. While the variance is small this filter carries out very smoothly.
This adaptive filter has been used widely for preserving edges and of the image and other
parts as well. This filter also manages the preliminary calculations and enforces the input
image.
4.3.3.5 Gaussian filter
It is a linear filter. It is used to reduce the noise and it is alone will blur edges and low
contrast. It is faster than other filters. Imadjust is not necessary for this filter.
89
90
Generative adversarial networks for image-to-Image translation
4.3.4 Research design
Fig. 4.1 depicts the methodology used for image segmentation on DICOM using FCM.
Based on Fig. 4.1, DICOM image segmentation begins with the DICOM
image read.
4.4 Experimental analysis
The proposed methods are implemented in MATLAB 2015a environment. The
DICOM files are in the montage shown in Fig. 4.2.
We choose the slide with the best view as in Fig. 4.2 for full image purpose in Fig. 4.3.
The data set used in this chapter is sourced from the digital imaging and communications in medicine (DICOM) database for brain images. The color type of the image is
gray scale and the modality is computed tomography. The study description is facial bone
from 50-year-old female. The thickness of the slice is 4. Fig. 4.3 shows an excerpt of the
DICOM data set. Medical imaging is useful in convolution models for image segmentation. Medical image segmentation data sets are limited, and annotated data is available
for training. While surgery is one of the treatment for brain tumors, radiation and chemotherapy may be used to slow the growth of tumors that cannot be physically removed.
Magnetic resonance imaging (MRI) furnishes elaborated images of the brain and is also
common test used to diagnose brain tumors. All the more, brain tumor segmentation
Input
Create Axes
Green Channel
Information about
DICOM Image
Region Props and
Centroid
Contrast
DICOM Read
Image
Thresholding
Segmentation
using FCM
Flipdim
Montage
Edge Detection
Output
Fig. 4.1 DICOM image segmentation using FCM.
Comparative analysis of filtering methods
Fig. 4.2 Montage of the DICOM File.
Fig. 4.3 Image with the best view.
91
92
Generative adversarial networks for image-to-Image translation
from MR images can have a great impact on improved diagnostics, growth rate prediction, and treatment planning. Some tumors can be easily segmented, others were difficult
to identify locate/diagnose. These tumors are often circulated, poorly contrasted, and
extend tentacle-like structures that make them difficult to segment. Another primal difficulty with segmenting brain tumors is that they can appear in any shape, size, and anywhere in the brain. Brain is typically made of three layers of tissues: white matter, gray
matter, and cerebrospinal fluid. Brain tumor segmentation aims to detect the location and
extension of the tumor regions. This is done by identifying abnormal areas from normal
tissue. Borders of abnormal tissues are often fuzzy and hard to distinguish from healthy
tissues.
4.5 Performance analysis
Here, the performance of the entire process of image segmentation on the DICOM
image using FCM has been described clearly in Fig. 4.4. Information on the physical
object is known as 3D presentation states (3DPR) that is nominated for storing all parameters and relevant information of 3D visualization. The main purpose of 3DPR is allowing the storage and distribution of the presentation of an image in 2DPR, it can be applied
to volume data via 3DPR. Thus, the experiment is to develop a systematic and DICOMconformant parameterization of 3D visualization. This corresponds to parameterizing all
procedures of 3D medical visualization and storing all necessary parameters and data in a
3DPR object. Then, the 3DPR object can be used to rerun all the procedures automatically to regenerate the 3D visualization. The procedures to be parameterized are preprocessing, segmentation, and postprocessing. Instead of storing the segmentation
parameters, segmented voxel data can be stored using lossless compression. Using diverse
test cases, various compression methods are used.
Clear visibility of the image has been obtained using a green channel image with high
contrast as shown in Fig. 4.4. In the denoising process, the main disadvantage of the existing methods is the behavior of over amplifying in the relatively homogenous region of an
image. To overcome this disadvantage, we used contrast limited adaptive histogram
equalization as shown in Fig. 4.4. Removing noise from an image can be done effectively
using the median filter. It is a classical preprocessing step to make the results better of later
processing like edge detection. Under some conditions, this filter extracts edges during
noise reduction. Hence this filter has been used in digital image processing as shown in
Fig. 4.4. In this part, using the morphological open and remove the disk, the image has
been structured using a 2D median filter and removed background and image adjustment
as shown in Fig. 4.4. The background and image adjustment have been done. Using edge
detection segmented images have detected the edge.
Comparative analysis of filtering methods
Fig. 4.4 Different filter segmentation.
4.6 Results and discussion
In the proposed system, 2D median filter is found to be the best filter to extract the image
from DICOM information. The classification output of the experiment reveals that the
accuracy of the image extraction is 97%, 5% sensitivity, 99% specification, 12% PPV, and
7% harmonic mean of precision and sensitivity. The classification outputs are shown in
Table 4.1 and Fig. 4.5.
The classification output of the experiment reveals that the accuracy of the image
extraction through 2D median filter is 97%, Imguided is 94%, Imfilter is 96%, wiener2
is 96%, and Medfilters is 96%. In the proposed system, 2D median filter is one of the best
filters to extract the image from DICOM information for accuracy.
The classification output of the experiment reveals that, in Fig. 4.6, the sensitivity of
the image extraction through a 2D median filter is 4%, Imguided is 23%, Imfilter is 4%,
93
94
Generative adversarial networks for image-to-Image translation
Table 4.1 Classification outputs.
Filters
Accuracy
Sensitivity
Specificity
FPR
PPV
Harmonic mean
2D median
Imguided
Imfilter
Wiener 2
medfilter
0.9771
0.9431
0.9662
0.9684
0.9618
0.1418
0.2314
0.0468
0.0996
0.1354
0.9858
0.9568
0.9848
0.9783
0.9794
0.0192
0.0432
0.0022
0.0117
0.0106
0.1363
0.0933
0.1174
0.1168
0.1125
0.1734
0.1330
0.0649
0.1236
0.1654
Accuracy
Fig. 4.5 Accuracy comparison of all filters.
Sensitivity
Fig. 4.6 Sensitivity comparison for all filters.
wiener2 is 9%, and Medfilters is 13%. In the proposed system, 2D median filter and
Imfilter are the same percentages of sensitivity.
Fig. 4.7 the classification output of the experiment reveals that the specificity of the
image extraction through a 2D median filter is 98%, Imguided is 23%, Imfilter is 4%,
wiener2 is 9%, and Medfilters is 13%. In the proposed system, 2D median filter and Imfilter are the same percentages of specificity.
Fig. 4.8 reveals that the specificity of the image extraction through 2D median filter is
0%, Imguided is 4%, Imfilter is 0%, wiener2 is 0%, and Medfilters is 0%.
Comparative analysis of filtering methods
Specificity
Fig. 4.7 Specificity comparison of all filters.
fpr
Fig. 4.8 FPR comparison for all filters.
ppv
Fig. 4.9 PPV comparison for all filters.
Fig. 4.9 is the classification output of the experiment which reveals that the PPV of the
image extraction through 2D median filter is 12%, Imguided is 9%, Imfilter is 27%, wiener2 is 16%, and Medfilters is 21%.
Fig. 4.10, the classification output of the experiment reveals that the harmonic of the
image extraction through 2D median filter is 7%, Imguided is 13%, Imfilter is 6%, wiener2 is 12%, and Medfilters is 16%.
The classification output of the experiment reveals that the accuracy of the image
extraction is 97%, 5% sensitivity, 99% specification, 12% PPV, and 7% harmonic mean
95
96
Generative adversarial networks for image-to-Image translation
Harmonic mean
Fig. 4.10 Harmonic comparison for all filters.
of precision and sensitivity. In the proposed system, 2D median filter is one of the best
filters to extract the image from DICOM information.
4.7 Conclusion
Most of the trials fail to segment the images due to noise, inequality of content, less contrast,
and inhomogeneity of the image that is to be segmented. Because of these reasons, it is
required to follow these methods for reducing error. The procedure of separation of a digital
image into numerous segments is called image segmentation. This process aims to facilitate
the portrayal of the image into more meaningful and make it easier to determine or analyze.
Using this method, one can locate the objects, curves, and lines in images. In this way, each
pixel would be labeled in an image where the pixels with the same label contribute secure
characteristics. Hence, in this way image segmentation is very useful in digital image processing. In this chapter, image segmentation has been done on the DICOM image of a patient’s
MRI. It has been observed that it takes very little memory space to save the file. Further, the
process may be extended to neutrosophic and plithogenic environments.
Acknowledgment
This research is supported by Universiti Tun Hussein Onn Malaysia, Malaysia under GPPS Vote No: H346.
References
[1] M.N. Ahmed, S.M. Yamany, N. Mohamed, A.A. Farag, T. Moriarty, A modified fuzzy C-means algorithm for bias field estimation and segmentation of MRI data, IEEE Trans. Med. Imaging 21 (3) (2002)
193–199.
[2] M.S. Yang, Y.J. Hu, K.C.R. Lin, C.C.L. Lin, Segmentation techniques for tissue differentiation in
MRI of ophthalmology using fuzzy clustering algorithms, Magn. Reson. Imaging 20 (2002) 173–179.
[3] S. Roy, H. Agarwal, A. Carass, Y. Bai, D.L. Pham, J.L. Prince, Fuzzy c-means with variable compactness, in: IEEE International Symposium on Biomedical Imaging: From Nano to Macro, 2008, pp.
452–2455, https://doi.org/10.1109/isbi.2008.4541030.
Comparative analysis of filtering methods
[4] P. Hore, P.O. Hall, D.B. Goldgof, W. Cheng, Online fuzzy c means, in: NAFIPS 2008-2008 Annual
Meeting of the North American Fuzzy Information Processing Society, 2008, https://doi.org/
10.1109/nafips.2008.4531233.
[5] A. Sopharak, B. Uyyanonvara, S. Barman, Automatic exudate detection from non-dilated diabetic retinopathy retinal images using fuzzy C-means clustering, Sensors 9 (2009) 2148–2161.
[6] M.A. Balafar, A.B.D.R. Ramli, M.I. Saripan, S. Mashohor, Medical image segmentation using fuzzy
C-mean (FCM) and user specified data, J. Circuits Syst. Comput. 19 (1) (2010) 1–14.
[7] M.C.J. Christ, R.M.S. Parvathi, Fuzzy c-means algorithm for medical image segmentation, in: 2011
3rd International Conference on Electronics Computer Technology, 2011, pp. 33–36, https://doi.
org/10.1109/icectech.2011.5941851.
[8] T.C. Havens, J.C. Bezdek, M. Palaniswami, Incremental kernal fuzzy c-means, in: Computational
Intelligence, Springer, 2012, pp. 3–18.
[9] E. Asadi, N.M. Charkari, Video summarization using fuzzy C-means clustering, in: 20th Iranian Conference on Electrical Engineering (ICEE2012), 2012, pp. 690–694, https://doi.org/10.1109/
IranianCEE.2012.6292442.
[10] B.A. Pimentel, R.M.C.R.D. Souza, A multivariate fuzzy c-means method, Appl. Soft Comput.
13 (2013) 1592–1607.
[11] A.K. Biswas, S. Karmakar, S. Sharma, M.K. Kowar, Fast fractal image compression by pixels pattern
using fuzzy c-means, J. Eng. Res. 1 (3) (2013) 109–121.
[12] I. Mulyana, Y. Herdiyeni, S.H. Wijaya, Identification of medical plant based on fractal by using clustering fuzzy C-means, in: The Second International Conference on Information Technology and Business Application (ICIBA2013), 2013, ISBN: 978-979-3877-16-7.
[13] R.J. Moreno, J.D. Lopez, Trajectory planning for a robotic mobile using fuzzy c-means and machine
vision, in: Symposium of Signals, Images and Artificial Vision (STSIVA2013), 2013, https://doi.org/
10.1109/stsiva.2013.6644912.
[14] M. Hadi, K. Morteza, S.Y. Hadi, Vector fuzzy C-means, J. Intell. Fuzzy Syst. 24 (2013) 363–381.
[15] A. Stetco, X.J. Zeng, J. Keane, Fuzzy C-means ++: fuzzy C-means with effective seeding initialization,
Expert Syst. Appl. 42 (21) (2015) 7541–7548.
[16] Y. Yang, Image segmentation by fuzzy C-means clustering algorithm with a novel penalty term, Comput. Inform. 26 (2007) 17–31.
[17] P.R. Suri, N. Sardana, Forecasting gold prices using fuzzy C means, J. Comput. 3 (3) (2011) 99–106.
[18] K. Warunsin, O. Chitsobhuk, Cyclone identification using fuzzy C mean clustering, in: 13th International Symposium on Communications and Information Technologies (ISCIT), 2013, pp. 369–373,
https://doi.org/10.1109/ISCIT.2013.6645884.
[19] A.R.J. Fredo, G. Kavitha, S. Ramakrishnan, Analysis of sub-cortical regions in cognitive
processing using fuzzy c-means clustering and geometrical measure in autistic MR images, in: 2014
40th Annual Northeast Bioengineering Conference (NEBEC), 2014, https://doi.org/10.1109/
NEBEC.2014.6972791.
[20] E. Doganay, S. Kara, H.K. Ozcelik, Automatic segmentation of the lungs from HRCT scans by using
fuzzy C-means, in: International Symposium on Sustainable Development (ISSD 2014), 2014, p. 77.
[21] G. Liu, P. Li, Y. Zhang, Color texture image segmentation method based on fuzzy c-means clustering
and region-level Markov random field model, Math. Probl. Eng. 2014 (2015) 1–9.
[22] H.Y. Vani, M.A. Anusuya, Isolated speech recognition using fuzzy C means technique, in: 2015 International Conference on Emerging Research in Electronics, Computer Science and Technology, 2015,
pp. 353–357, https://doi.org/10.1109/ERECT.2015.7499040.
[23] T. Velmurugan, A. Naveen, Analysing MRI brain images using fuzzy C-means algorithm, Int. J. Control Theory Appl. 9 (10) (2016) 4661–4675.
[24] H.R. Mohammed, H.H. Alnoamani, A.A. Jalil, Improved fuzzy C-mean algorithm for image segmentation, Int. J. Adv. Res. Artif. Intell. 5 (6) (2016) 7–10.
[25] B. Kaur, K.P. Tulsi, Improving the color image segmentation using fuzzy-C-means, in: 2016 International Conference on Advanced Communication Control and Computing Technologies
(ICACCCT), 2016, pp. 789–794, https://doi.org/10.1109/ICACCCT.2016.7831747.
[26] O. Heriana, A.N. Rahman, M.T. Miftahushudur, Image edge detection using objective function and
fuzzy C means, in: 2017 International Conference on Radar, Antenna, Microwave, Electronics, and
97
98
Generative adversarial networks for image-to-Image translation
Telecommunications
(ICRAMET),
2017,
pp.
149–153,
https://doi.org/10.1109/
icramet.2017.8253165.
[27] S. Rai, S. Chakraverty, D.K. Tayal, Y. Kukreti, Soft metaphor detection using fuzzy c-means, Lect.
Notes Comput. Sci (2017) 402–411, https://doi.org/10.1007/978-3-3-319-71928-3_38.
[28] K. Jebari, A. Elmoujahid, A. Ettouhami, Automatic genetic fuzzy c-means, J. Intell. Syst. (2018) 1–11,
https://doi.org/10.1515/jisys-2018-0063.
[29] S. Sivasaravanababu, M.R. Barasu, G.S. Siva Priya, P. Punitha, K. Shanmuga Priya, Bronchogenic carcinoma indentification with X-ray image using fuzzy C means, Int. J. Pure Appl. Math. 119 (15) (2018)
727–730.
[30] L. Zhang, M. Luo, J. Liu, Z. Li, Q. Zheng, Diverse fuzzy c-means for image clustering, Pattern
Recogn. Lett. (2018), https://doi.org/10.1016/j.patrec.2018.07.004.
[31] D. Nagarajan, M. Lathamaheswari, R. Sujatha, J. Kavikumar, Edge detection on DICOM image using
triangular norms in type-2 fuzzy, Int. J. Adv. Comput. Sci. Appl. 9 (11) (2018) 462–475.
[32] D. Nagarajan, M. Lathamaheswari, J. Kavikumar, Hamzha, A type-2 fuzzy in image extraction for
DICOM image, Int. J. Adv. Comput. Sci. Appl. 9 (12) (2018) 351–362.
[33] C. Jinlin, Y. Chunzhi, X. Guangkui, L. Zing, Image segmentation method using fuzzy C mean clustering based on multi-objective optimization, J. Phys. Conf. Ser. 1004 (2018) 012035, https://doi.org/
10.1088/1742-6596/1004/1/012035.
[34] V.D. Santiago, Q.R. Martinez, L.A. Mecias, B.M.L. Romanach, Mammographic mass segmentation
using fuzzy C-means and decision trees, Lect. Notes Comput. Sci (2018) 1–10, https://doi.org/
10.1007/978-3-319-94544-6_1.
[35] V. Torra, On the selection of m for fuzzy c-means, in: 9th Conference of the European Society for
Fuzzy Logic and Technology, 2015, pp. 1571–1577, https://doi.org/10.2991/ifsa-eusflat15.2015.224.
[36] I. Umoren, G. Usua, F. Osang, Analytic medical process for ophthalmic pathologies using fuzzy
C-mean algorithm, Innov. Syst. Softw. Eng. 7 (2019) 67–84.
[37] A. Srivastava, B. Hazela, P. Khanna, D. Arora, Application of fuzzy C-means (FCM) algorithm in
image appointment, IOSR J. Eng. (2019) 4–8.
[38] J.A. Tolentino, B.D. Gerardo, P.M. Ruji, Enhanced Manhattan-based clustering using fuzzy C-means
algorithm, in: The 14th International Conference on Computing and Information Technology (IC2IT
2018), 2019, pp. 126–134, https://doi.org/10.1007/978-3-319-93692-5_13.
[39] D. Vernanda, N.N. Purnawan, T.H. Apandi, School clustering using fuzzy C means method, SinkrOn
J. Penelit. Tek. Inform. 4 (1) (2019), https://doi.org/10.33395/sinkron.v4i1.10168.
[40] D. Borthakur, V. Grace, P. Batchelor, H. Dubey, Fuzzy C-means clustering and sonification of HRV
features, in: 2019 the IEEE/ACM 4th International Conference on Connected Health: Applications,
Systems and Engineering Technologies, 2019. arXiv:1908.07107[cs.HC].
[41] G. Katircioglu, E.K. Aydogan, M. Ozmen, E. Akgul, Determination of Denim fabric’s air permeability
with image processing using fuzzy C means, in: International Conference on Intelligent and Fuzzy Systems (INFUS 2019), 2019, pp. 1208–1214, https://doi.org/10.1007/978-3-030-23756-1_142.
[42] H. Gan, Safe semi-supervised fuzzy c-means clustering, IEEE Access (2019) 1–6, https://doi.org/
10.1109/access.2019.2929307.
CHAPTER 5
A review of the techniques of images
using GAN
Rituraj
Sonia and Tanvi Arorab
a
Department of CSE, Engineering College Bikaner, Bikaner, Rajasthan, India
Department of CSE, CGC College of Engineering, Landran, Mohali, Punjab, India
b
5.1 Introduction to GANs
The generative adversarial networks (GANs) are the models that have been constructed
for the image-to-image translations. They are considered a powerful class of neural networks implemented for the purpose of unsupervised learning. The concept of GAN was
introduced by Ian J. Goodfellow [1] in 2014. It can be divided into three parts.
• Generative: It describes how the data is generated.
• Adversarial: The process of the training of the model is carried out in a competitive
manner.
• Networks: It is use of the deep learning neural network for training process.
GAN basically consists of the two networks: generator network and a discriminator network as shown in Fig. 5.1. Both these networks try to compete with each other and in
this process they also train each other through multiple cycles of generation and discrimination. The generator network aims at generating new images, text, audio, etc. These
new items (text, audio, and images) are fake in nature. The discriminator checks these
images with the help of a training model, whether these images are fake or real. It does
this analysis with the help of the feedback and loss functions.
Figs. 5.2–5.4 display the output obtained from different types of GANs. Fig. 5.2
displays the transformation of one object into another, depending on the given inputs.
Similarly, in Fig. 5.3, the GAN generates high-resolution images. Lastly, in Fig. 5.4,
the GAN performs the image-to-image translation and thus contributes to an increase
in the dataset, which is close to realistic images.
5.1.1 Need for GANs
The GANs have gained popularity over just 2–3 years. They have the capability to generate very realistic images and videos that can assist in implementing the image editor or
processor in our tablets or smartphones. The GANs have the capability of modeling and
data distribution, and can produce clearer and sharper images. The GANs can train any
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00006-3
Copyright © 2021 Elsevier Inc.
All rights reserved.
99
100
Generative adversarial networks for image-to-Image translation
Fig. 5.1 Basic structure of GAN [2].
Fig. 5.2 Example of GANs transforming zebra to horse [3].
A review of the techniques of images using GAN
Fig. 5.3 Example of GANs generating high-resolution images [3].
Fig. 5.4 Example of image-to-image translation [3].
101
102
Generative adversarial networks for image-to-Image translation
type of generator network, with no limitation, whereas other techniques have limitations
on the generator networks and can only be used in specific cases. Moreover, the GAN
models do not depend on the Markov chain, which is used to generate the samples.
The earlier advantages associated with the GAN models make them promising
solutions for the generation of the image dataset, which are required for training the deep
learning models that require a large number of items to be trained. The cost of physically
collecting and labeling the items is quite high, whereas the GAN can help to generate the
items of the dataset with minimum effort and quite low cost. The GANs can also help to
generate the face photos, cartoon characters, photos of emojis, automatically generate the
models for advertisements, and all these activities can be just done by feeding in the base
photo. The different variants can be automatically generated with the help of GAN.
The GAN models are also needed for the purpose of photo editing; they can aid in
making the photos clearer and improve the resolution of the images as well. That can be
used to derive meaningful information from otherwise unclear images. They can help the
researchers generate a large number of images that appear to be real with input given in
the form of the sketch or semantic images. Apart from that, the GAN can also be used to
generate the images from the text descriptions. The image-to-image conversion can also
be carried out with the help of the GANs. The GAN models can be used for photo editing to such an extent that one can produce different kinds of images related to the variation in facial expressions, gestures, lip movements, gender, hair colors, etc.
Therefore, it can be comprehended that the GAN models are needed for generating
the synthetic datasets, for an image-to-image conversion, for text-to-image conversion,
for editing blurred or low-resolution images, to forecast the looks of an individual after a
certain age and to generate the 3D models.
The ultimate need of the GAN is for generating the data that can be used to train the
neural network-based models, as the accuracy of the neural network models depends
upon the effectiveness of the training data. On the contrary, the success of the GAN
application depends on the extent of the training of the GAN architecture; if not carried
out perfectly, the results may not be good enough to carry out research on real-time
applications.
In Section 5.2, the various architectures related to the GANs are discussed with their
underlying models and working.
5.2 GAN architectures
This section provides an essential insight into the working and modeling of the different
architecture of the GANs. Each architecture has its working style thus contributing to the
generation of the images to create datasets in various research problems.
A review of the techniques of images using GAN
5.2.1 Fully connected GANs
The basic concept in the research scenario related to the GANs field is the utilization of
the deep convolutional neural networks (CNNs) for the process of the image synthesis
tasks. Therefore, in this traditional approach, the pooling layers and the fully connected
layers 4,5 are removed or minimized from the GANs. Barua et al. [6] proposed the use of
the fully connected convolution net architecture for the GANs (FCC-GANs), by stating
that the implementation of these multiple fully connected layers along with the convolution layers gives better performance than the conventional architecture.
In case of the conventional GANs, the single process of deep convolution generates
the images. However, the work proposed by Barua et al. [6] states the two-step process
method for image generation using FCC-GANs. The first step states ways to obtain the
high-dimensional image features with the help of the low-dimensional input noise. The
second step involves the generation of the image features using these high-dimensional
features. These fully connected layers help to understand the relation between the input
noise features and, thus, the generation of the final image features, which are closer to the
natural images. The convolution layers cannot achieve this global mapping operation due
to emphasis on the local connectivity. The methodology given by Barua et al. [6] accomplishes the following aims:
• The use of the fully connected and the convolution layers is proposed that generates
higher-level images on different benchmark datasets as compared to the existing GAN
methods.
• The learning rate of the FCC-GANs is higher than the conventional GANs. The former also produces very high-quality realistic images within a few rounds (epoch) of
training.
• The FCC-GANs give better results on the parameters such as Fretchet inception distance and inception score compared to existing CNN architecture on the benchmark
datasets.
• The architecture proposed as fully GANs is robust and stable as compared to existing
CNN architecture.
A simple example of the FCC-GANs proposed by Barua et al. [6] is shown in Fig. 5.5 and
that of the conventional GANs is shown in Fig. 5.6. These models create 32 32 3
RGB images from the random noise vector z. In Fig. 5.5, the number of the nodes is
denoted by the number in the boxes. Whereas, in the case of conventional architecture,
as shown in Fig. 5.6, the number in the boxes indicates the shape of the output layers. The
FCC-GANs shown in Fig. 5.5 can be utilized for the images of different resolutions by
changing the depth and shape of the convolution stack.
The experiments in Barua et al. [6] are carried out on the four datasets MNIST [7],
CIFAR-10 [8], SVHN [9], and Celeb A [10]. It has been proved by experiments on these
datasets that FCC-GANs produces higher quality images and converges faster than the
103
Generative adversarial networks for image-to-Image translation
Output (image)
32  32  3
Discriminator output
1
16 Features
(Low dimensional)
16
64
512
FC layers
Flattened to 4096 features
(High dimensional)
4  4  256
8  8  128
16  16  64
CONV layers
8  8  128
Reshape to 4  4  256
Intermediate features
(High dimensional)
4096
64
512
Z
(A)
16  16  64
CONV layers
FC layers
Image (fake/real) - 32  32  3
104
(B)
Fig. 5.5 (Top) FCC-GAN generator; (Bottom) FCC-GAN discriminator [6].
Fig. 5.6 (Left) Conventional generator; (Right) conventional discriminator [6].
traditional GANs approach. The stability of the FCC-GANs has been proved using different parameters to indicate the importance of the FCC-GANs in Pix2Pix image generation. The most important advantage of the FCC-GANs is that it can be associated with
any GAN method, and it can also be used in complex networks such as ResNet [11].
A review of the techniques of images using GAN
5.2.2 Conditional GANs
The concept of the conditional generative adversarial network (CGANs) by Mirza and
Osindero [12] was first introduced to the world by Mehdi Mirza and Simon Osindero.
This idea is an augmentation on the GAN. It is implemented in the machine learning
domain for the training of the image-to-image generative models.
In the traditional GANs model, there are no conditions applied to the generator and
discriminator and there is no control on the types pf data generated by such GANs. Thus,
if the given framework does not require such data, it is just a waste of effort. Whereas, in
the case of the CGANs, a condition can be applied to both the generator and discriminator. These conditions can be based on the same class labels of the image or some other
property [13]. Therefore, the available GANs model can be converted into the CGANs
by applying other additional conditions to the generator and discriminator y. This extra
conditional information can be applied to both generator and discriminator. It can be
seen in Fig. 5.7 that along with input z, a condition is also applied to GANs to convert
it into the CGANs.
Another example of the CGANs is shown in Fig. 5.8; here the condition y is added to
the generator as well as the discriminator for the desired output.
Discriminator
D(x|y)
y
x
Generator
G(z|y)
z
Fig. 5.7 An example of the conditional adversarial net [12].
y
105
106
Generative adversarial networks for image-to-Image translation
Real data
Random noise
Generator
Y
Y
Discriminator
Real / fake
Fig. 5.8 An example of the conditional adversarial net [14].
The factors for the construction of the CGANs can be as follows:
• The first and foremost is to add features or conditions to control the output and direct
the generator to produce images as per the given conditions.
• These features or sets of features should be available from the images that classify them
into specific classes such as images of human beings if the aim is to create the face of the
imaginary actors, etc. It can contain features like the complexion of the hairs and the
type of the eyes, etc.
• The information, as well as the data that will learn, can also be incorporated in the
images and into the inputs.
• The evaluation of the discriminator is performed on the similarity of the fake and the
real data. It also takes into account the mapping of the input features with the fake data
image.
• The condition can be imposed on both the input of the generator and discriminator. It
can in the form of digits forming a vector (condition) and is linked as a real or fake
image to the given generator or discriminator.
Fig. 5.9 depicts the output of the generation of the digits [12] using the MNIST datasets
with the help of the CGANs. The CGANs suffer from one disadvantage: they always
need labels to perform the work, as they are completely unsupervised.
Fig. 5.9 MNIST digit generation using CGANs [12].
A review of the techniques of images using GAN
5.2.3 Adversarial autoencoders
Adversarial autoencoder [15] is a probabilistic autoencoder that uses GAN to perform variational inference by matching the aggregated posterior of the hidden code vector of the
autoencoder with an arbitrary prior distribution. Autoencoders [16] works on the similar
approach of the feed-forward neural networks and uses the concepts of unsupervised
learning. The autoencoder’s main task is to encode information related to the input in
between the architecture and deconstruct that information by best means to the output.
As shown in Fig. 5.10, the first layer is used for encoding the information (up to the
middle layer), and therefore it is known as encoder as it is used for encoding information.
The middle layer in the given architecture is termed an encoded vector. The end layer
from the output of the middle layer is termed as the decoder. The end layer assists in the
reconstruction of the information available through code. So the input layer of the autoencoder after receiving the data is sent to the autoencoder’s middle layer and that middle
layer is essential as this layer has data which has reduced dimension.
Makhzani et al. [15] propose another variation of the GANs called adversarial autoencoders (AAE) that converts an autoencoder into the generative model. The job of the
autoencoders is to generate new random data with the help of given input data. The only
Code
Encoder
Fig. 5.10 A simple encoder [16].
Decoder
107
108
Generative adversarial networks for image-to-Image translation
Encoder
Code
Decoder
Input
Real
Discriminator
Fake
Fig. 5.11 A simple adversarial autoencoder [16].
variation of the GANs with AAEs is that the latter controls the encoder output with the
assistance of the prior distribution. This encoded vector is comprised of the mean value
and standard deviation, and now along with this, it also has a prior distribution function.
On the other hand, the decoder can map the prior (imposed) distribution to the data
distribution with the help of the deep generative model. The type of distribution for
the prior distribution can be any distribution, for example, normal distribution, gamma
distribution, Gaussian distribution, etc. The prominent concept is the settlement of the
distribution of the encoded values in the direction of the prior distribution. Therefore,
the detector performs mapping of the prior distribution with the data distribution.
Fig. 5.11 demonstrates the simple AAE, in which the standard autoencoder is placed
at the top row, and it is generating the image x as per information given by the latent code
z. The second network is set at the bottom row to discriminate the fact, that whether the
sample is coming from the sampled distribution specified by the user or hidden code of
the autoencoder.
Makhzani et al. [15] have proposed that the AAE, attains the competitive test probabilities on Toronto Face Dataset [17] and real-valued MNIST datasets. The proposed
method can be applied to the semisupervised scenarios. It obtains excellent semisupervised classification performance on SVHN and MNIST datasets. The AAEs find
applications in dimensionality reduction and data visualization and disentangle the content and style of images, and unsupervised clustering.
5.2.4 Deep convolution GANs
The GANs, as discussed in the earlier section, consist of the primary two networks, generator and discriminator, to carry out different works. To make the GANs more powerful, to accomplish the more complex applications, both the generator and discriminator
will be augmented with the convolutional neural network layers. This structure is known
as deep convolution GANs. The concept of the deep convolution GANs (DCGANs) is
floated by Radford et al. [18] in 2015, and they succeed in utilizing the ConvNEt idea
A review of the techniques of images using GAN
into the GANs. This idea of incorporating ConvNEts into GANs make this DCGANs as
the most eligible candidate for implementing unsupervised learning.
Many attempts were made to integrate the CNN with GANs to improve the performance. The approach used by Radford et al. [18] uses a family of architecture to train the
model for a large number of the datasets and allow the training for higher resolution and
deeper networks. The DCGANs [18] were implemented by the following three
approaches:
• The concept given by Springenberg et al. [19] that replaces the idea of the maxpooling by the strided convolutions so that the network learnt from its downsampling
is used as the first step in implementing the DCGANs.
• The second step is to eliminate all connected layers on top of convolutional features. It
is applied by Mordvintsev et al. [20], where the concept of global average pooling is
applied for the image classification models. This idea of the global average pooling
gives ample stability to the system model.
• The third and last step is to apply the concept of the batch normalization 21. It helps to
stabilize the learning process by normalizing the assigning each unit zero mean and
unit variance. The process of stabilization solves the training issues in those problems
due to poor initialization and helps gradient flow in deeper models. In this way, this
process helps the generators begin learning and prevents them from collapsing at a single point.
Fig. 5.12 demonstrates the implementation of the DCGANs in detail. A 100-dimensional
uniform distribution Z is projected to a small spatial extent convolution representation
with many feature maps. Then this high-level representation is converted to a 64 64-pixel image with the help of the series of four fractionally strided convolutions. There
is no use of the fully connected layers in this figure. In another work, Durall et al. [22]
discussed a method to handle the problem of the stabilization that occurs in the training
3
128
256
512
64
1024
100 z
8
4
4
Project and reshape
5
32
5
5
5
5
Stride 2
5
8
Stride 2
5
Stride 2
16
5
16
CONV 1
CONV 2
32
CONV 3
Stride 2
64
CONV 4
G(z)
Fig. 5.12 DCGAN generator used for LSUN scene modeling [23].
109
110
Generative adversarial networks for image-to-Image translation
phase of the GANs. A new framework called OC-GAN (Octave-GAN) that uses octave
convolution is proposed in this work. It reduces the problem of modal collapse in the
existing GANs and generates images of higher quality. The method is tested on the
Celeb-A dataset.
5.2.5 StackGANs
A StackGAN consists of the two stacks which are considered as stage-1 and stage-2. The
function of the stage-1 GAN is to produce the low-resolution images based on the
description given by the user. Such images have very rough sketches and basic colors
to give a preview of the low-resolution images. After generating these images from
stage-1, these images are passed into the stage-2, in which high-resolution images are
generated by these images which appears more realistic.
The process of the image generation is achieved by describing the form of the text or
text embedding in the instructions. The stage-2 network adds all kinds of the relevant
details as per the text instructions and thus produces images that are very close to the realistic images with proper resolutions. The working of the StackGAN can be compared
with that of a painter. In the case of the complex painting, a painter always first draws
some edges, rough sketches, and lines, etc. to prepare the overview of the image. In
the next stage, the painter fills all relevant colors, adds more specific details, and shapes
this artwork. Thus, it is in the second stage the painter gives a realistic view to his pictures.
Similarly, stage-1 produces the low-resolution images with the help of the given text
description, and stage-2 that works on stage-1 tries to capture the details which are erased
by stage-1. Stage-2 adds more information to the images generated by stage-1. The support of model distribution generated from a roughly aligned low-resolution image has a
better probability of intersecting with the support of image distribution [24].
Fig. 5.13 depicts the architecture of the StackGAN. As discussed earlier, it is composed of the two stages, and for each step, there are two generators and two
discriminators.
The StackGAN at each level consists of the text encoder, conditioning augmentation
network, generator network, discriminator network, and embedding compressor
network. As is very clear from Fig. 5.13 that stage-1 GAN is generating the images of
low resolution with the size of 64 64 and then stage-2 GAN takes these images as
inputs, applies some conditional augmentation on them to generate the high-resolution
images of the size 256 256.
Fig. 5.14 displays the example of the images generated by StackGANs [24] with the
help of the text description given to the system. The StackGAN is applied to the dataset of
Oxford [25] for the generation of flower images.
Fig. 5.15 displays the example of the images generated by StackGANs [24] with the
help of the text description about the rooms on the COCO dataset [26].
Fig. 5.13 StackGAN architecture: Stage-1 take inputs from the given text and applies rough sketching to produce low-resolution images by
sketching a rough shape. Then, stage-2 generates more prominent high-resolution images by correcting the defects [24].
112
Generative adversarial networks for image-to-Image translation
Text
description
This flower has
a lot of small
purple petals in
a dome-like
configuration
This flower is
pink, white,
and yellow in
color, and has
petals that are
striped
This flower has
petals that are
dark pink with
white edges
and pink
stamen
64×64
GAN-INT-CLS
256×256
StackGAN
Fig. 5.14 Text-to-image generation using StackGAN [24].
Fig. 5.15 Results on COCO dataset [26] using StackGAN [24].
This flower is
white and
yellow in color,
with petals that
are wavy and
smooth
A review of the techniques of images using GAN
Thus, the method [24] performs much better concerning the other methods in this
domain and produces high-resolution images that are incredibly close to the realistic
images.
5.2.6 CycleGANs
CycleGAN [27] is one of the models used for training the image-to-image translation, in
which the GAN architecture is used. It is an enhancement of the GAN model that simultaneously trains two generator and discriminator models.
In this model, two domains of images are formulated. The CycleGAN [28] in a simplified manner is shown in Figs. 5.16 and 5.17. The first generator is fed the images from
the first domain, and the output of the first generator serves as input for the second
domain of images. In contrast, the second generator takes input from the second domain
of the images and outputs the images that feed as input to the first domain of images. The
discriminator model checks how believable the images are from both the generators and
then it fine tunes the generator models accordingly.
The model described earlier can check the correctness of the images generated for
each domain, but it is not sufficient to translate the images. Therefore, for the purpose
of the image-to-image translation, the CycleGAN has an add-on extension with the
name of cycle consistency. In this, the output of the first generator is attached to the second generator’s input. The output thus generated by the second generator is matched
with the initial image fed to the first generator. Likewise, the reverse operation also holds
true wherein the second generator’s output can serve as input to the first generator, and
the result produced is the same as the input fed to the second generator. The cycle
Fig. 5.16 Flow A-B-A starts from input in domain A [28].
113
114
Generative adversarial networks for image-to-Image translation
Fig. 5.17 Flow B-A-B starts from input in domain B [28].
consistency is used as a regularization measure for the generator models that help the
image generation process for the image-to-image translation.
The CycleGAN model can be explained with the help of an example, where the aim
is to translate the images of winter scene landscapes to the images of summer scene landscapes. It is well known that both the seasons will have different images for both the landscapes. So, in this case, the images of the two domains will be the images of the winter
scene landscape, and others will be the images of summer scene landscapes, which is
depicted through Fig. 5.18.
CycleGAN has an architecture of two GANs, and each GAN has a discriminator and
a generator model, meaning there are four models in total in the architecture. So the system has two GAN generators; one will be taking images of winter scene landscape and
generating images of summer scene landscape while the other will be taking the images of
the summer scene landscape and generating the images of the winter scene landscape.
Then the discriminator model will be checking if both the models are generating images
as intended or not; based upon the discriminator’s judgment, the generators will be further trained to get the exact translation.
The CycleGANs can be used in varied domains such as for style transfer, object transfiguration, season transfer, generating photographs from paintings, or photography
enhancement.
5.2.7 Wasserstein GANs
The idea of the WGANs or the Wasserstein GAN was given by Arjovsky et al. [30]. It can
be described as an augmentation to the existing GAN architecture. The main aim of the
A review of the techniques of images using GAN
Fig. 5.18 Example of CycleGAN for summer to winter translation [29].
WGANs is to provide support for the model to improve the stability for the training of
the given model and also provides a loss function to analyze the standards of the images
generated by the model.
The WGAN uses an approach to perform a better approximation of the data provided
in the dataset for training purposes. The WGAN proposes to use a critic in place of the
discriminator, which decides the fakeness (or realness) of the given image with the help of
the score given by that critic. The whole theory for the WGAN is based on the mathematical calculation about the distances. It states that the generator must search for minimization of the distance between the distribution of the data observed in the training
dataset and the distribution observed in generated examples.
In the paper by Arjovsky et al. 30, the discussion consists of the various distribution
distance measures, such as Jensen-Shannon (JS) divergence [31], Kullback-Leibler (KL)
divergence [32], and the Wasserstein distance (Earth-Mover [EM] distance). The ability
of each distance is based on the convergence of sequences of probability distributions. So,
it was proved that WGAN could effectively train the generator using the properties of the
Wasserstein distance more optimally compared to the other distribution distances.
115
116
Generative adversarial networks for image-to-Image translation
Fig. 5.19 A simple WGAN architecture [34].
Fig. 5.19 depicts the simple WGANs architecture and the concept of the WGANs
revolves around the fact that Wasserstein distance is differential and continuous, which
means that the training can be performed till it achieves optimal value. It is based on the
argument that the longer the process of the training is carried out for the critic, then it will
provide the reliable gradient of the Wasserstein. It happens due to the differentiable
nature of the Wasserstein distance. Whereas in the case of the JS divergence (distance),
the critic becomes reliable, but the true gradient becomes zero as the JS is locally saturated, and vanishing gradients are obtained. The critic in the WGANs does not get saturated. The critic gives a clean gradient as it reduces to the linear function compared to
the discriminator that may learn the difference between real and fake quickly. Still, it does
with almost no reliable linear-gradient information. The most crucial advantage of using
the WGAN is that it makes the training process stable and ensures that the training process is not sensitive to the choice of the hyperparameter configurations. The WGAN aims
to decrease the critic’s loss, and achieving this can have a good quality of the generated
images. The WGANs always try to make the lower the generating loss, whereas the other
GANs try to achieve an equilibrium between two generative and discriminate models.
The applications of the WGANs include the simulation of the isolated electromagnetic
showers in a realistic setup of a multilayer sampling calorimeter [33]. Similarly, one critical step in the analysis of the medical images is the structure-preserved denoising of 3D
magnetic resonance imaging (MRI) images.
Ran et al. [35] presented the use of the Wasserstein generative adversarial network
(RED-WGAN) for the MRI denoising method.
The next section discusses some of the open issues and research gaps in the domain of
the applications related to the GANs.
A review of the techniques of images using GAN
5.3 Discussion on research gaps
There are many open problems and research gaps where the GANs can be applied to
achieve better results as compared to the traditional machine learning approach. The
work by Barua et al. [6] emphasizes using the fully connected GANs for the unsupervised
training. So, the effect of using fully connected convoultional (FCC) GANs on semisupervised training can be studied. Similarly, the inclusion of the CGANs can improvise the
results. The complex networks, such ResNet [11], can be examined with the help of the
FCC-GANs.
Zhao et al. [36] propose a method using adversarially regularized autoencoders for
training deep latent variable models on the simple discrete structures. These simple structures consist of short sentences and the binary digits. Therefore, the scope is available to
apply the training to model complex structures such as documents and old manuscripts.
Balabka [37] proposes a model using AAEs to recognize human activity with the help
of semisupervised learning. The semisupervised learning is applied to use the unlabeled
data with the help of the AAE training. The challenge open in the domain of the AAEs is
to explore many hyperparameter that can be tuned to improve the performance of the
model aggressively.
Ruiz-Garcia et al. [38] suggested a new method that uses a generative adversarial
stacked autoencoder, which helps in mapping the facial expressions to an illumination
invariant facial representation. The open research problem in this domain includes developing a method that can handle the scenario in which there is no labeling of the multipose
datasets.
Lu et al. [39] discuss a method based on deep learning-based (DA-DCGAN) for practical domain-shifting DC series arc fault detection in photovoltaic systems. The GANs
serve the purpose of the generation of the dummy arc shifting data. But this problem
of implementation of the GANs in the area of the application-specific integrated circuits
with low cost and improvement in the reliability remains open.
Padala et al. [40] proposed the idea to study the effect of variation of the input noise
applied to the GANs. The inference of the method states that the noise has a remarkable
contribution to the images’ generation. But the gap in this study is to make a theoretical
analysis between the high-dimensional data and low-dimensional distribution.
Kim [41] proposes a new variation of the GANs called Bool GAN and applied on the
dataset containing the images of cars, that put the baseline model proposed by Radford
et al. [18]. The inclusion of the dropout and convolution layers improves the efficiency of
the model. The open issue in this study is to perform more experiments regarding the
addition of a number of layers to find out the optimum hyperparameter and the scheduling of the learning rate. This study may give new dimensions about the performance of
the model.
Durall et al. [22] proposed the use of Octave GANs, which states that Bayesian optimization can be explored as future work.
117
118
Generative adversarial networks for image-to-Image translation
Cheng et al. [42] presented a novel method called SeqAttnGAN for creating images
with the help of interactive image editing software. This method is implemented on the
two benchmark datasets, DeepFashion-Seq and Zap-Seq. These datasets have images that
are attached to the proper description in the text. The method gives excellent results as
compared to the baseline methods. This method addresses future work to create human
faces with an interactive image editor and to explore the generation of consistent image
sequences by given attributes and other factors.
Vougioukas et al. [43] proposed a novel and innovative method to generate the video
signals generated by speech. The method is achieved by applying temporal GANs. The
performance of the method is evaluated on the GRID [44] and the TCD TIMIT [45]
dataset. The method can capture the videos with proper facial expressions, including
blinking, etc. This method is open to the problem of capturing the possible mood
and gesture of the speaker and showing it using the facial expression.
Zhao et al. [46] suggested a novel idea dependent on the CGANs to retrieve the lost
and missing information from the images of the solar observation. This missing information occurs due to the overexposure of the images in the solar observation process due to
the violet solar burst. This novel idea uses CGANs and includes integration of the edge
mass loss, masked L1 loss, and adversarial loss. The model uses training for the new dataset
for the overexposed images. The work is still open to the problem where the images are
of high texture and have large overexposure areas.
Zhu et al. [47] proposed a novel method by implementing the CGANs to solve the
issue of producing the multiple outputs of the image-to-image translation using a single
input. The mapping ambiguity is resolved by randomly sampling the low-dimensional
latent vector. The generator used in this method achieves to map the input, with latent
code to the possible output. Thus, the method encourages the bijective consistency
between output modes and latent encoding. The future work in this method caters to
produce the image-to-image translation with the help of controlling different user
parameters along with meaningful attributes.
Nataraj et al. [48] discussed a model for detecting the fake images generated by the
GANs. It is achieved by using the combination of deep learning and cooccurrence matrices of the pixels. These types of matrices are obtained by performing the computation on
the different color channels. Further, a deep convolution neural network is used for training purposes to discriminate the real images from the fake images generated by the GAN.
The future work in this domain is to make use of the pixels’ location manipulated in the
fake images generated by GANs and rectify them.
Liu et al. [49] used the idea of the Coupled GANs to perform the image-to-image
translation in the unsupervised environment. The aim is to use the information about
the images in two different domains and perform translation. The open issue that need
to be addressed in this is to prevent the stability of the training system due to the saddle
A review of the techniques of images using GAN
point searching problem. Along with that, the issue is to remove the system’s limitation as
unimodal due to the assumption of the Gaussian latent space.
The other open problems in this domain are like the need for the automatic metrics
for judging the performance of the different types of generative networks and also need to
consider the nondeterministic training losses for future prediction.
The next section discusses some of the applications that can solved with the assistance
of the GANs.
5.4 GAN applications
There are many areas in which GANs can be applied to have remarkable results, and these
are as follows:
• Generation of images: The task of the generation is one of the prominent areas where
GANs are applied and giving the researchers an ample amount of the datasets to carry
out different experiments. These images are realistic in nature. The process of image
generation includes some sample images, based on which the GAN can generate a
large number of the images with the help of the generator and discriminator. These
new images which are generated will be different as compared to the existing sample
images. The process of the generation of the images is used extensively in the animation, social media, marketing, entertainment world, and generation of the logos of the
digital world.
• Synthesis of images using text: The most exciting feature of the GANs is the synthesis of
the images with the help of the text description. Such applications are used in the
entertainment industry. With the help of the text (story), an animation character with
their gestures can be created.
• Aging of face: The GANs with their variants called the CGANs can be used to predict
the face of the images with the targeted ages. GANs architecture can create and predict
people’s faces at different times (age). Thus, such a system can be useful in companies
for face verification of their employees. It works on the principle of semisupervised
learning for aging and progression. There are datasets with faces as images and age
as labels are available in the public domain for experiment purpose.
• Image-to-image translation: This feature of GANs states that the images can be translated
into other images with the help of the generator and discriminator. The images taken
at night can be translated into the day; similarly, the different drawings and sketches
can be translated into beautiful paintings. The different aerial images can be translated
into the satellite images, and the images of the zebra can be converted to horses, etc.
CGANs can be applied for synthesizing photos from label maps as defined by Nayak
[14], uses edge maps to create, and fill colors in the images as discussed by Wang et al.
[50] and Isola et al. [51].
119
120
Generative adversarial networks for image-to-Image translation
• Synthesis of video: The synthesis of the video can also be performed with the assistance
of the GANs. They take less time to create the videos compared to the conditions if
they are created manually or in real time. In this manner, this property of GANs can
encourage the animation creators to make optimum use of the new technology to
develop and promote their videos in less time and close to the real world. GANs
can also be used to predict the frames to appear in the future in any video sequence
as mentioned by Villegas et al. [52].
• Generating high-quality images: The GANs allow converting the low-resolution images
taken by ordinary cameras to high-resolution and quality images. Thus, it helps to
observe the minute details of the images that cannot be viewed in low-resolution
images.
• Missing part generation of images: The GANs network can be used to generate the missing
parts of the partially degraded images and thus recovers the original images.
• Generating shadow maps: Nguyen et al. [53] apply one conditional (sensitivity) parameter (CGANs) to the system (generator) to parameterize the loss of trained detector
and is more efficient than other GANs.
• Speech enhancement: Phan et al. [54] propose two architectures ISEGAN and DSEGAN
for the process of speech enhancement. The main motive behind speech enhancement
is to remove the unnecessary background and irrelevant noises that create problems in
the process of speech recognition. The speech enhancement will further help in the
cochlear implants, hearing aids, and communication systems. Therefore, GANs play
an important role in the domain of speech recognition by enhancing the sample of
speeches.
• Fault diagnosis: The GANs system can be implemented in the detection and diagnosis
of the DC arc faults that occur in the photovoltaic system as described in Lu et al. [39].
The source and the target domain data are available during the operation in the field,
but the fault data are not available. Therefore, GANs can be used to generate the
dummy data.
5.5 Conclusion
This chapter mainly covers the introduction to the GANs, its need, and the detailed
architecture of the various models, that is, fully connected GAN, CGANs, and AAEs,
deep convolution GANs, StackGANs, CycleGANs, and Wasserstein GANs. The advantages and disadvantages of the models are listed. The chapter also focuses on the various
research gaps identified in the different architecture of the GANs. The research gaps identified will provoke the students and scholars in this domain to contribute to the development of algorithms in the GANs. Various applications fall into the area of GANs.
These are listed in the last section of the chapter. These applications, if solved using
the approaches of the GANs, will provide better results as compared to the traditional
A review of the techniques of images using GAN
machine learning approaches. The chapter also covers the various examples of the
image-to-image translations described by the researchers.
In the recent years, GANs have been emerged as one of the novel methodologies to
generate the data from the rough information given. They are considered to be the robust
and powerful class of the neural networks for unsupervised learning. With the GANs
idea, a large number of the image dataset can be created, which are very close to the real
image. Thus, it satisfies the need of the dataset among the researchers for implementing
their models.
Along with the great advantage of generating huge images, GANs have limitations
that they can generate better results if the input data is mapped into the learned subspace.
Still, in case of the unseen data not mapped correctly, it may give poor results. Another
problem associated with the GANs is the problem of mode collapse, which states that the
generator always produces output from a small set of spaces. Similarly, the GANs in certain stages are challenging to converge in the training process. The machine and the
resources required to implement the GANs training models have exceptionally high configuration and expensive. The GANs training and implementation process require extensive use of the GPUs along with the CPUs. The need for memory for accounting the
large data is also an issue in the GANs. The researchers in this domain can work on
the complex problems of artificial intelligence by implementing the advanced version
of the GANs. It will enhance the capabilities of the machines and provides the human
race with a new solution to existing problems in the different areas of science and
engineering.
References
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems, 2014, pp.
2672–2680.
[2] A. Mittal, Generative Adversarial Networks (GAN), 2020. https://codeburst.io/generativeadversarial-networks-gan-3c8978ba99a6.
[3] J. Hui, GAN some cool applications of GAN, Medium (2020). https://medium.com/@jonathan_hui/
gan-some-cool-applications-of-gans-4c9ecca35900.
[4] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A.C. Courville, Improved training of Wasserstein
GANs, in: Advances in Neural Information Processing Systems, 2017, pp. 5767–5777.
[5] T. Miyato, T. Kataoka, M. Koyama, Y. Yoshida, Spectral normalization for generative adversarial networks, in: International Conference on Learning Representations, 2018.
[6] S. Barua, S. Monazam Erfani, J. Bailey, FCC-GAN: a fully connected and convolutional net architecture for GANs, arXiv E-Prints, arXiv-1905 (2019).
[7] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (11) (1998) 2278–2324.
[8] A. Krizhevsky, Learning Multiple Layers of Features From Tiny Images (Master’s thesis), Department
of Computer Science, University of Toronto, 2009.
[9] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, A.Y. Ng, Reading digits in natural images with
unsupervised feature learning, NIPS Workshop on Deep Learning and Unsupervised Feature Learning,
2011.
121
122
Generative adversarial networks for image-to-Image translation
[10] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in: Proceedings of the
IEEE International Conference on Computer Vision, 2015, pp. 3730–3738.
[11] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[12] M. Mirza, S. Osindero, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784
(2014).
[13] I. Goodfellow, M. Mirza, A. Courville, Y. Bengio, Multi-prediction deep Boltzmann machines, in:
Advances in Neural Information Processing Systems, 2013, pp. 548–556.
[14] M. Nayak, An introduction to conditional GANs (CGANs), Medium (2019). https://medium.com/
datadriveninvestor/an-introduction-to-conditional-gans-cgans-727d1f5bb011.
[15] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, Adversarial autoencoders, arXiv Preprint
arXiv:1511 (2015).
[16] C. Rubiks, Introduction to Adversarial Autoencoders, 2019. https://rubikscode.net/2019/01/14/
introduction-to-adversarial-autoencoders/.
[17] J. Susskind, A. Anderson, G.E. Hinton, The Toronto face dataset, Technical Report UTML TR 2010001, U. Toronto, 2010. tech. rep.
[18] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint arXiv:1511.06434 (2015).
[19] J.T. Springenberg, A. Dosovitskiy, T. Brox, M. Riedmiller, Striving for simplicity: the all convolutional net, arXiv preprint arXiv:1412.6806 (2014).
[20] A. Mordvintsev, C. Olah, M. Tyka, Inceptionism: going deeper into neural networks, 2015. https://
research.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html.
[21] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal
covariate shift, in: International Conference on Machine Learning, PMLR, 2015, pp. 448–456.
[22] R. Durall, F.-J. Pfreundt, J. Keuper, Stabilizing GANs with octave convolutions, arXiv preprint
arXiv:1905.12534 (2019).
[23] C. Shorten, DCGANs (Deep Convolutional Generative Adversarial Networks), Medium (2019).
https://towardsdatascience.com/dcgans-deep-convolutional-generative-adversarial-networksc7f392c2c8f8.
[24] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, StackGAN: text to photorealistic image synthesis with stacked generative adversarial networks, in: Proceedings of the IEEE
International Conference on Computer Vision, 2017, pp. 5907–5915.
[25] M.-E. Nilsback, A. Zisserman, Automated flower classification over a large number of classes, in: 2008
Sixth Indian Conference on Computer Vision, Graphics & Image Processing, IEEE, 2008, pp.
722–729.
[26] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft
coco: common objects in context, in: European Conference on Computer Vision, Springer, 2014, pp.
740–755.
[27] Y. Gan, J. Gong, M. Ye, Y. Qian, K. Liu, S. Zhang, GANs with multiple constraints for image translation, Complexity 2018 (2018) 1–27, https://doi.org/10.1155/2018/4613935.
[28] H. Bansal, A. Rathore, Understanding and implementing CycleGAN in tensorflow, 2017.
[29] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent
adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision,
2017, pp. 2223–2232.
[30] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in: International
Conference on Machine Learning, PMLR, 2017, pp. 214–223.
[31] A.P. Majtey, P.W. Lamberti, D.P. Prato, Jensen-Shannon divergence as a measure of distinguishability
between mixed quantum states, Phys. Rev. A 72 (5) (2005) 052310.
[32] D.A. Klein, S. Frintrop, Center-surround divergence of feature statistics for salient object detection, in:
2011 International Conference on Computer Vision, IEEE, 2011, pp. 2214–2219.
[33] M. Erdmann, J. Glombitza, T. Quast, Precise simulation of electromagnetic calorimeter showers using
a Wasserstein Generative Adversarial Network, Comput. Software Big Sci. 3 (1) (2019) 4.
[34] V. Chandak, P. Saxena, M. Pattanaik, G. Kaushal, Semantic image completion and enhancement using
deep learning, in: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), IEEE, 2019, pp. 1–6.
A review of the techniques of images using GAN
[35] M. Ran, J. Hu, Y. Chen, H. Chen, H. Sun, J. Zhou, Y. Zhang, Denoising of 3D magnetic resonance
images using a residual encoder-decoder Wasserstein generative adversarial network, Med. Image Anal.
55 (2019) 165–180.
[36] J. Zhao, Y. Kim, K. Zhang, A. Rush, Y. LeCun, Adversarially regularized autoencoders, in: International Conference on Machine Learning, 2018, pp. 5902–5911.
[37] D. Balabka, Semi-supervised learning for human activity recognition using adversarial autoencoders, in:
Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous
Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers,
2019, pp. 685–688.
[38] A. Ruiz-Garcia, V. Palade, M. Elshaw, M. Awad, Generative adversarial stacked autoencoders for facial
pose normalization and emotion recognition, in: 2020 International Joint Conference on Neural Networks (IJCNN), IEEE, 2020, pp. 1–8.
[39] S. Lu, T. Sirojan, B.T. Phung, D. Zhang, E. Ambikairajah, DA-DCGAN: an effective methodology
for DC series arc fault diagnosis in photovoltaic systems, IEEE Access 7 (2019) 45831–45840.
[40] M. Padala, D. Das, S. Gujar, Effect of input noise dimension in GANs, arXiv preprint
arXiv:2004.06882 (2020).
[41] D.H. Kim, Deep convolutional GANs for car image generation, arXiv preprint arXiv:2006.14380
(2020).
[42] Y. Cheng, Z. Gan, Y. Li, J. Liu, J. Gao, Sequential attention GAN for interactive image editing, in:
Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 4383–4391.
[43] K. Vougioukas, S. Petridis, M. Pantic, End-to-end speech-driven realistic facial animation with temporal GANs, in: CVPR Workshops, 2019, pp. 37–40.
[44] M. Cooke, J. Barker, S. Cunningham, X. Shao, An audio-visual corpus for speech perception and automatic speech recognition, J. Acoust. Soc. Am. 120 (5) (2006) 2421–2424.
[45] N. Harte, E. Gillen, TCD-TIMIT: an audio-visual corpus of continuous speech, IEEE Trans. Multimedia 17 (5) (2015) 603–615.
[46] D. Zhao, L. Xu, L. Chen, Y. Yan, L.-Y. Duan, Mask-Pix2Pix network for overexposure region recovery of solar image, Adv. Astron. 2019 (2019).
[47] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A.A. Efros, O. Wang, E. Shechtman, Toward multimodal
image-to-image translation, in: Advances in Neural Information Processing Systems, 2017, pp.
465–476.
[48] L. Nataraj, T.M. Mohammed, B.S. Manjunath, S. Chandrasekaran, A. Flenner, J.H. Bappy, A.K. RoyChowdhury, Detecting GAN generated fake images using co-occurrence matrices, Electron. Imaging
2019 (5) (2019). 532-1–532-7.
[49] M.-Y. Liu, T. Breuel, J. Kautz, Unsupervised image-to-image translation networks, in: Advances in
Neural Information Processing Systems, 2017, pp. 700–708.
[50] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, B. Catanzaro, High-resolution image synthesis
and semantic manipulation with conditional GANs, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8798–8807.
[51] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017,
pp. 1125–1134.
[52] R. Villegas, J. Yang, S. Hong, X. Lin, H. Lee, Decomposing motion and content for natural video
sequence prediction, in: 5th International Conference on Learning Representations, ICLR 2017,
International Conference on Learning Representations, ICLR, 2017.
[53] V. Nguyen, T.F. Yago Vicente, M. Zhao, M. Hoai, D. Samaras, Shadow detection with conditional
generative adversarial networks, in: Proceedings of the IEEE International Conference on Computer
Vision, 2017, pp. 4510–4518.
[54] H. Phan, I.V. McLoughlin, L. Pham, O.Y. Chen, P. Koch, M. De Vos, A. Mertins, Improving GANs
for speech enhancement, IEEE Signal Process. Lett. 27 (2020) 1700–1704.
123
CHAPTER 6
A review of techniques to detect
the GAN-generated fake images
Tanvi
Aroraa and Rituraj Sonib
a
Department of CSE, CGC College of Engineering, Landran, Mohali, Punjab, India
Department of CSE, Engineering College Bikaner, Bikaner, Rajasthan, India
b
6.1 Introduction
The generative adversarial network (GAN) is an artificial intelligence-based technique
that is based on the deep learning modalities of the machine learning paradigm. It is
an unsupervised learning technique. The GANs have been initially created in 2014 to
generate new data points from the existing data points. In these, two competing neural
networks are made to work against each other to improve their quality. The working
principle of the GANs can be best described by taking the example of a generator that
is generating some output, and a tester that is testing the generated output, for its authenticity. The tester knows what is correct, thus based on the feedback of the tester, the generator keeps on improving its output. The generator is just like a blind man, which
improves its results based on the rejection and selection of its output.
The GANs are used for generative modeling, that is, a model is used to create new
instances from the preexisting instances, such as the creation of new images that are quite
identical but still different from the already existing images. The GAN-based models
work like a gameplay, where both the players try to trick each other and ultimately solve
the puzzle. If the GAN-based methods are properly trained then they can be effectively
used to create new data items as per the specifications of the items in the training set.
The GAN is composed of two contending neural networks that work against each
other in competing mode to investigate, capture, and duplicate the disparities in the dataset. The GANs are composed of three distinct components:
• Generative: It is aimed at learning the generative model that can demonstrate the generation of the data concerning the probabilistic model.
• Adversarial: The adversarial setting-based training of the model is carried out by
this unit.
• Networks: The training of this model is carried out using the deep learning-based artificial intelligence methods.
The working of the GANs is shown in Fig. 6.1, which has two components, the
generator and the discriminator. The task of the generator component is to create the
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00004-X
Copyright © 2021 Elsevier Inc.
All rights reserved.
125
126
Generative adversarial networks for image-to-image translation
Fig. 6.1 Simple architecture of GANs (https://www.geeksforgeeks.org/generative-adversarial-networkgan/).
self-made illustrations of the data, which may be images, audios, or videos, and the discriminator component tries to discriminate the input given to it as the real data or the selfmade illustrations. In the GANs, the generator and the discriminator both are the neural
networks, and both of them compete against each other in the training phase. The training of the generator and discriminator is carried over several iterations, and with each
subsequent iteration, the capabilities of both the components are enhanced. The generator learns to generate better illustrative samples and the discriminator gets more proficient at judging the illustrations as fake samples. In short, the GANs are based on the
minimax game, where the discriminator tries to minimize its gains and the generator tries
to minimize the gains of the discriminator or to maximize its losses.
Although GANs have emerged just a few years ago, over a short span of time a large
number of variants have emerged. The most commonly used variants of the GANs are
1. Vanilla GAN: In this, the basic multilayer perceptron-based generator and discriminator neural networks are used, and it is one of the simplest implementation of the
GAN that tries to augment the mathematical functions based on the stochastic gradient descent approach.
2. Conditional GAN (CGAN): This GAN implementation is based on the setting up of
the condition-based parameter in the deep learning approach. The extra condition
parameter is augmented in the generator component of the GAN to generate the output data. The discriminator is feed with the input data that has labels associated with it
to assist it to distinguish between the actual data and the morphed fake data.
A review of techniques to detect the GAN-generated fake images
3. Deep convolutional GAN (DCGAN): This implementation of the GAN uses the convolutional neural networks in place of the multilayer perceptrons, but these CNNs
does not contain the max-pooling layer that has been substituted with the convolutional stride and the layers of the CNN are not completely connected. Over the years,
this implantation of the GANs has become the most widely used as well as the most
promising implantation of the GANs.
4. Laplacian pyramid GAN (LAPGAN): This implementation of the GANs are mainly
used for producing superior quality images. In this initially, the image is downsampled, and then again it is upsampled until the image comes to its original size. This
approach leads to the introduction of noise in the images. This implementation has a
large number of generator and discriminator modules of the network along with the
distinct levels of the Laplacian pyramid.
5. Super resolution GAN (SRGAN): This implantation aims at producing high-resolution
images, and they work at enhancing the low-resolution images, but simultaneously it
takes care that the upscaling does not result in introducing noise or error in the images.
This implementation is made up of the deep neural and the adversarial networks.
The GAN-based techniques can be used in a large number of applications such as:
• to augment the dataset of images by creating more synthetic images
• to create different facial expressions
• to create real-looking images
• to create the images of the cartoons
• to create images based on text description
• to create emojis from the images
• to create edited photographs
• to improve the resolution of the images
• to transform the clothing of images
• 3D object creation
• completing the incomplete images
• to synthesize the videos
These are just a few of the applications of GANs, but recently they have created a lot of
excitement as they have emerged as one of the most fascinating applications of the current
AI advancements, and still it is hoped that many more exciting applications will be available in near future.
The GANs have many advantages associated with them, the first and the foremost
advantage is that they are unsupervised learning models that have self-learning capability,
do not require the labeled data, and can self-learn from the data itself. Moreover, the
GAN-based methods can also be used to produce the data that is as good as the real data.
The GANs have the capability to not only generate the numeric or alphanumeric data,
but also generate multimedia data, that is, images and videos, which are indistinguishable
and are at par with the real data. Thus, GAN-generated images that have diverse
127
128
Generative adversarial networks for image-to-image translation
applications in the field of marketing, gaming, mass media advertisements, and other
domains. The GANs not just learn from the data itself, but they also can understand
the complex data and they have a wide range of applications in machine learning.
The GANs have gained a lot of hype in the recent times, but they do have their limitations as well, and they are just the product of virtual imagination. They depend on the
training data, if the training data is not correct or of good quality, then the GAN-based
methods can even fail. They cannot be just used to create novel things, but they can only
reformulate the things based on the previous examples. The real strength of the GANs
depends on the coordination of the generator and the discriminator component. They
both need to be fine-tuned, and the strength of the generator will be of no use if the
discriminator is weak or vice versa. They both need to work in synchronization to produce the correct results. The content thus generated by the GANs is termed as
DeepFakes.
6.2 DeepFake
A new term DeepFake has emerged in the digital world, which has been derived from the
two terms namely deep learning and fake; it is a new product created by artificial intelligence. In layman’s terms, DeepFakes can be defined as a false media, images, videos, or
sounds, created by using deep learning techniques. Deep learning is a branch of artificial
intelligence, which has a large number of layers that can make informed decision making
by using a set of algorithms, and work the way the neurons of the human brain work. The
intelligence of the deep learning-based algorithms has created the fear of creating something that does not exist, but which mimics the real-world existence of the things that
they pose to depict.
Fig. 6.2 shows an example of the DeepFake image created by morphing the original
image. Thus, we can say that DeepFakes are the morphed video clips, images, sounds, or
other representations in the digital form that are created using the erudite algorithms
based on artificial intelligence that creates fabricated media and gives the impression
to be realistic. The DeepFakes have emerged just a couple of years ago, but this technology has made many refinements and is posing a great threat to public figures such as celebrities, political leaders, and public figures such as the technology leaders. The very first
incidence of the DeepFake caught the attention of the media in 2017, wherein a video
footage emerged showing the famous Bollywood figures were made viral, and the reality
of that footage was not real, which was the amalgamation of the face of the celebrity and
the other details of some other actor and was created using the DeepFake technology.
The DeepFakes can be created a large number of images that are available on the internet.
Therefore, the popular public figures are the one who can be targeted at large to generate
DeepFakes, for whom a large number of images and videos are readily available on the
internet, due to media coverage.
A review of techniques to detect the GAN-generated fake images
Fig. 6.2 Fake image generated using GAN (https://spectrum.ieee.org).
To reveal how the technology of deep learning can be abused, the scientists at the
University of Washington created and posted the DeepFake-generated video of the
President Barack Obama over the social media, the scientists proved that they can make
the fake video of President Obama speak out whatever they wanted. We all can well
imagine what harm this technology can pose to the image of the public figures and it
can pose a great threat to the security of the world at large. Thus, fake news and DeepFake
can club together and can tar the authentic information and can thus create misunderstandings and miscommunications that are supported by fake facts.
The DeepFakes emerged just a few years ago, but much development has taken
place and is improving at a rapid rate. The scientists have developed methods that allow
them to even edit the transcripts of the video and do alterations to the words that are
being spoken by the person whose video is being altered. In yet another work, the
researchers at Stanford University have developed methods that can not only manipulate the facial expressions, but their methods can also do three-dimensional head
movement of the characters of the video or make them blink their eyes or let them
gaze at particular stuff, all these things can be carried out with the help of the GANs.
These features can help the movie industry to easily dub the movies to other languages,
as all these things look to be unbelievably photorealistic. But the concern of the
research fraternity lies with the counterpoints, what if these techniques are abused
and are used for the illegitimate activities.
Initially, it was believed that DeepFakes can only be created for the images if a large
number of similar images of the celebrity or the public figures for whom large amounts of
images are available in the public domain. But the recent developments by the Samsung’s
129
130
Generative adversarial networks for image-to-image translation
Fig. 6.3 Sample source and generated fake images (https://miro.medium.com).
AI lab has created a living portrait of Salvador Dali, Marilyn Monroe, and many more,
moreover, they have created an image of Mona Lisa, who is smiling, and all these things
have been achieved by just using a limited number of photographs, as illustrated with
examples in Fig. 6.3. The requirement of a limited number of photographs has thus raised
concern for the ordinary people, as they initially believed that they are invulnerable to
DeepFakes as there are not enough images available to train the computer procedures to
create DeepFakes.
After seeing the realistic view of the images created by artificial intelligence, now the
big question is how to control and protect the misuse of this technology, which has a large
number of advantages associated with it, but still it has few threats associated with its
usage.
There are many questions that need to be pondered upon, should legal laws be passed,
to bind the social networking websites, to detect the DeepFakes, and subsequently
remove them. Moreover, should the intention of creating the DeepFake be given any
consideration, while removing the DeepFake images, and apart from that can the DeepFakes can be differentiated based on the intention of their creation either for the entertaining or for perniciousness.
A review of techniques to detect the GAN-generated fake images
6.3 DeepFake challenges
The rapid development in the domain of artificial intelligence has posed a serious threat
to the authenticity of the multimedia content that is being generated. The recent
advancements in the field of deep learning that led to the generation of DeepFakes have
further intensified the generation of the misinformation and it is being believed that over
the years this problem of misinformation supported by fake multimedia content will further intensify. With the development of technology, the approaches that are being used
to generate the automated fake multimedia content will further improve and it will
become more challenging to discriminate the original and the fake content.
The DeepFake-based multimedia content poses a great deal of challenges in the following domains:
1. create distress during difficult situations
2. can be a threat to the reputation of famous personalities
3. it can spread hatred and disrespect for the innocent
4. can cause loss of faith in the digital content
5. it may turn out to be deceiver’s dividend, who can always deny his words by saying,
the content is fake, even though it has been said or done by him
6. fake pornography can be created and can cause mental distress to the affected
7. the fake images or even biometrics can be used for financial fraud
8. the DeepFake paradigm can be used to create fake news and hoaxes and may cause
social distress
9. harmful to individuals or organizations
10. can cause exploitation
11. lead to sabotage
12. misrepresentation of democratic discourse
13. manipulation of elections
14. corroding belief in establishments
15. aggravating social separations
16. decline public protection
17. discouragement of international relations
18. endangering national safekeeping
19. discouragement of journalism
20. lead to false allegations
These are to name a few domains, but in brief, we can say that DeepFakes are a forthcoming challenge for national security, individual privacy, and egalitarianism. Therefore,
there is an urgent need to restraint the spread of fake digital content and to devise methods
that can detect the fake content and have the capability to destroy it, without further
spread. (DeepFakes: A Looming Challenge for Privacy, Democracy, and National
Security.)
131
132
Generative adversarial networks for image-to-image translation
6.4 GAN-based techniques for generating DeepFake
There are mainly two broad ways in which the GANs can generate the DeepFake, one
uses an image-to-image translation and the other method is text-to-image synthesis, in
the following sections, the GAN-based techniques for generating DeepFake are being
discussed.
6.4.1 Image-to-image translation
Image-to-image translation aims to convert one image into another image, in this the
goal is to learn how the input image can be mapped to an output image. This technique
can be used in a variety of ways such as for transfer of style, image super resolution, image
inpainting, transfiguration of the objects, transferring the season of the image and for
image enhancement.
The image-to-image translation method is also termed as PIX2PIX translation for
image generation, a sample is illustrated in Fig. 6.4. In this, the conditional GANs
(CGANs) models are used. The image generation was earlier also there but in that case,
for each type of translation, a separate model was required for each of the translation
types. But the usage of the CycleGAN paved the way for a cycle-based consistency that
had loss enabling inverse conversion ability that too without loss of any information. The
introduction of the cyclic technique did not require similar image pairs for training but
instead, it has the capability for training the GAN networks on two distinct domains, to
learn the features of each domain to translate one into another seamlessly. Apart from the
CycleGAN, other models have also been developed for image-to-image translation such
as BicycleGAN, StarGAN, etc. and they are being discussed in the following section.
6.4.1.1 StarGAN: Unified generative adversarial networks for multidomain
image-to-image translation
StarGAN is a scalable technique developed for image-to-image translation and can be
used for several domains by just using a single model. It has a unified architecture based
on which it can concurrently train many datasets that too of distinct domains by using a
single StarGAN network [1]. It can produce images of superior quality and it is very flexible in translating input images into distinct target domains. In this work, the authors have
tried to exhibit the results by transforming the facial expressions of the input images.
6.4.1.2 Toward multimodal image-to-image translation
This method proposed a BicycleGAN model for image-to-image translation by joining
two distinct GAN models namely Conditional Variational Autoencoder GAN and Conditional Latent Regressor GAN [2]. They have harnessed the good features of both the
approaches and the proposed model has the capabilities to implement the interconnect
among the hidden encoding and output individually each director jointly and by that it is
Fig. 6.4 Sample images based on image-to-image translation (https://miro.medium.com).
134
Generative adversarial networks for image-to-image translation
capable of achieving superior results. This method has been compared against several
state-of-the-art encoder methods and is capable of giving superior results in comparison
to all of them.
6.4.1.3 U-GAT-IT: Unsupervised generative attentional networks with adaptive
layer-instance normalization for image-to-image translation
It is an unsupervised image-to-image translation method which has used a novel attention
and learnable normalization module, to operate in an end-to-end way. The role of the
attention function is to act as a supervisor to the method to pay more attention to the
important regions that are distinct in the original and the target image domains that
are dependent on the attention map that has been created by the auxiliary classifier
[3]. The proposed method is capable of withstanding the geometric or shape changes
in the target images. They have also integrated an adaptive layer-instance normalization
procedure that supports the attention guided function to easily manage the extent of
modification for the shape and texture based on the parameters that it has acquired during
the learning phase from the dataset.
6.4.1.4 Image-to-image translation with conditional adversarial networks
The conditional adversarial networks are the baseline methods for carrying out an
image-to-image translation, as they can learn from the mapping of the source to target
image, and they also learn the loss function to train the further images from this mapping
[4]. Due to this method, it is capable of applying the same loss function to a wide variety
of images. This method has an associated software with the name PIX2PIX that has been
widely used by many artists to experiment with the proposed approach because of the
ease of use and varied applicability.
6.4.1.5 Multichannel attention selection GAN with cascaded semantic guidance
for cross-view image translation
The multichannel attention selection GAN with cascaded guidance for cross-view image
translation aims at the translation of the images with completely distinct views and the
images may be suffering from a high degree of deformation, which is a quite challenging
task. The proposed work carried out this task with very good precision, in which the
system can create natural scene images with random viewpoints that are guided by an
input image of the desired scene along with a semantic map that is novel [5]. This method
takes input from the semantic maps and is a two-step process, in the first step, the input
image and the desired semantic map are fed into the cycled semantic-guided generator to
create the initial raw results. In the second step, the initial raw results are further refined
by the multichannel attention selection methodology.
A review of techniques to detect the GAN-generated fake images
6.4.1.6 Cross-view image synthesis using geometry-guided CGANs
In this work, the authors have proposed a cross-view image synthesis method that is based
on the geometry-guided CGANs. In this approach, the pixel information is preserved
between the two viewpoints, to give a realistic appearance of the input image to the output generated images. To achieve this objective, they have used homography to guide the
mapping of the images between the distinct views that are dependent on the overlapping
views, so that the details of the image that is input are preserved [6]. To give a realistic
image, they have painted the regions that were missing in the image that has been created
by transformation by using the GANs. They have used the geometric constraints, due to
which the complete minute details can be added to the image thus generated, moreover
the proposed approach has given very good results for cross image-based image generation as compared to simple pixel-based image generation methods.
6.4.1.7 Cross-view image synthesis using CGANs
In this work, the authors have proposed a CGANs to generate cross-view images from
the natural scene images of aerial to street view and street view to aerial [7], which is a
challenging task in the domain of computer vision. It becomes even more challenging
when the generation of the new images for a completely different view as the process
of conversion of understanding and transforming the image appearance and semantics
across different semantic viewpoints is a nontrivial task. In this work, they have used
novel architectures namely Crossview Fork and Crossview Sequential that have the capability of generating images that have a resolution of 64 64 and 256 256. The architecture of Crossview Fork uses one discriminator and one generator. The generator
module of the Crossview Fork tries to fantasize about the image along with its semantics
for segmentation for the output image. The Crossview Sequential uses two CGANs, out
of which the first unit is used for creating the output image that is fed in the second unit to
generate the map of the semantic segmentation. To improve the results the feedback is
supplied to the first unit from the second unit to improve the quality of the images. The
proposed method works well for the generation of the natural scene images by using a
cross-view image-to-image translation.
6.4.1.8 WarpGAN: Automatic caricature generation
With the improvement in the GAN-based architectures, automatic caricature generation
methods have been developed, which can generate the caricatures from the input image
of the face. The WarpGAN architecture cannot only produce the caricatures but can also
transform the texture styles [8]. This architecture works by automatically learning to predict a collection of control points that can be further used to transform the image into a
caricature and has the capability of preserving the identity of the original photograph as
well. The caricatures generated by using the WarpGAN are quite identical to the caricatures that are drawn by using hand, but they have prominent features of the face more
135
136
Generative adversarial networks for image-to-image translation
exaggerated. This is possible as the WarpGAN uses the identity preserving adversarial loss
that helps the discriminator module to differentiate between the distinct images under
study and it also gives the option to customize the caricatures thus generated by controlling the styles and the extent of the exaggeration to be produced in the output image.
6.4.1.9 CariGANs: Unpaired photo-to-caricature translation
CariGAN is the first architecture used for the creation of caricature from the input image.
It is based upon the two-step process, in the first step, the geometric exaggeration is carried out, whereas in the second step the look and feel style is defined, to carry out these
two steps two distinct GAN models are used namely CariGeoGAN and CariStyGAN [9].
The CariGeoGAN carries out the geometrical transformation from the input image to
the target caricature and the CariStyGAN translates the look and feel of the caricature to
the input photos, but it does not cause any change to the geometrical aspects. This
method can easily carry out cross-domain translation by breaking the process into a
two-step process, and the output images thus generated closely resemble the handgenerated images, and the caricatures thus generated can be controlled by tuning the
parameters to adjust the color and texture of the output images thus generated.
6.4.1.10 Unpaired photo-to-caricature translation on faces in the wild
The unpaired photo-to-caricature translation on faces in the wild is capable of transforming the input photo to the caricature in distinct styles and the same model can be used for
other high-end images to image translation applications. Their design uses a two-path
approach to detect the overall structure and the local features which are required for carrying out the translation process [10]. They have used two discriminators; one discriminator is coarse and the other is fine. The generator of this model also provides an extra
perceptual loss in addition to the loss that is provided by the adversarial and the cycle
consistency to attain the learning in two distinct fields. This model can also learn the different styles from the supplementary noise that can be given as input to the model.
6.4.2 Text-to-image synthesis
The GAN-based deep learning architectures have the unique ability to generate the
images based on the text descriptors. The system works by giving the text phrase as input,
and the GAN model can generate an image based on the description. The sample architecture is shown in Fig. 6.5 that demonstrates the image creation architecture based on
the Reed et al. model. The illustrative GAN-based model can successfully convert the
text phrases into images. The diagram illustrates the visualization of how the text strings
fit so well in the sequential image generation model. The generator network filters the
text input using fully connected neural network layers and the random noise is also
concatenated in the form of a vector Z, whereas in the discriminator network, the text
A review of techniques to detect the GAN-generated fake images
Fig. 6.5 Text-to-image synthesis process (https://www.oreilly.com).
Fig. 6.6 Sample text-to-image synthesis (https://cdn-images-1.medium.com).
input is also compressed using a fully connected model as is used in the generator network
and then it is recreated and concatenated in the form of an image.
Fig. 6.6 shows a sample of how the text phrases can be converted into the actual
images of the flowers. The GAN-based models have been refined and fine-tuned to generate photorealistic images by just taking the text descriptors as input, in the following
section, various state-of-the-art methods have been discussed that are being used to generate high-resolution images, based on the textual information.
6.4.2.1 Generative adversarial text-to-image synthesis
The recent advancements in the domain of artificial neural networks have given the
power to the computer systems to transform text to pixels. Thereby, facilitating the generation of the images from the text descriptions [11]. All this has been possible due to the
recent advancements in the development of a deep convolutional method-based GAN
networks. In the proposed work, the authors have tried to generate the images of birds
and flowers by giving a detailed description of their structure and features. They have
137
138
Generative adversarial networks for image-to-image translation
spent considerable effort in creating an efficient GAN model and the training dataset that
can create images of birds and flowers from the text descriptors written by humans. They
have used five distinct text descriptors along with the dataset of Caltech-UCSD for birds
and Oxford-102 for the flowers.
6.4.2.2 StackGAN: Text to photo-realistic image synthesis with stacked generative
adversarial networks
This approach aims at generating the images using the text descriptors. They have used a
stacked generative adversarial networks (StackGAN) [12] to produce 256 256 images
that mimic the realistic images. They have tried to generate photorealistic images by
refining the sketches, by decomposing the process of sketch refinement into subtasks.
In the initial phase, the GAN-based approach is used to draw the initial shapes of the
objects with the colors, based on the text description given, thus yielding a basic lowresolution image. In the second phase, the results of the first stage are combined with
the input descriptors in the form of text, to generate more realistic images, and the drawbacks of the first stage are also overcome, thus yielding high-resolution photorealistic
images. To give more realistic effects, the proposed model also adopts the conditioning
augmentation method that helps to smoothen and condition the image to a great extent.
Therefore, this approach is capable of producing images of high resolution, which have
very good quality.
6.4.2.3 MC-GAN: Multiconditional generative adversarial network for image
synthesis
The proposed method aims at generating an image from the text descriptors when the
background base image is already given and the new object can be created at a specified
location. This approach is the enhancement of the text-to-image generation phase, as
now new objects can be added to the preexisting images and that too at the specified
locations, as per the text descriptors. This has been made possible by using multiconditional generative adversarial networks (MC-GAN) [13], which can regulate the background and the desired object simultaneously. This model employs a synthesis block
that helps to disassociate the object and the background during the training phase, thereby
enabling the MC-GAN to generate as good as real images with a resolution of 128 128
by monitoring the extent of the background specifications from the specified base image
with the forefront details using the text descriptors. The proposed method is capable of
smoothly mixing the possible orientation and the layout of the object with the background image. The method can give excellent results due to the MC-GAN model that
can act like a pixel-wise gating function that has the capability of regulating the volume of
evidence from the background image with the aid of the text descriptors of the new
object that is to be placed in the foreground.
A review of techniques to detect the GAN-generated fake images
6.4.2.4 MirrorGAN: Learning text-to-image generation by redescription
The authors have developed a three-process method to generate the images from the
text descriptors by redescription and have named their method as MirrorGAN [14], the
three distinct steps used in this approach are STEM that is an embedding module for the
semantic text and it generates word and sentence level embeddings, the second module
is GLAM that is based on the cascaded architecture for generating target images from
coarse to fine scales, leveraging both local word consideration and global sentence consideration to gradually improve the range and semantic uniformity of the produced
images, and the third module is STREAM that aims to regenerate and align the images
based on the semantic text.
6.4.2.5 StackGAN++: Realistic image synthesis with stacked generative adversarial
networks
This is the improvement of the StackGANs, this approach takes the low-resolution
images generated by StackGANs and the text descriptions and then creates highresolution images that look as real as the realistic images. The proposed method is based
on the multistep GAN architecture and is suitable for both conditional and nonconditional methods for GAN-based image generation [15]. In this architecture, there are multiple generators and discriminators organized just like trees, thereby generating multiplescale images for the similar scenes from the distinct branches of the trees. This approach
has a more constant training pattern as compared to StackGAN as it is collaborating with
multiple distributions for approximation. They have also integrated the conditioning
augmentation to improve the smoothness of the images and also improves the diversity
of the images.
6.4.2.6 Conditional image generation and manipulation for user-specified content
In this approach, the authors have created a dataset named CelebTD-HQ that has facial
images and the associated text descriptors. The dataset has been generated by using a twostep pipeline, in the first step of the pipeline a textStyleGAN model has been created,
which is trained upon the text and in the second part of the pipeline, they have used pretrained weights of the previously trained textStyleGAN model to carry out the semantic
manipulation of the facial images. This approach aims to train the semantic directions
based on the latent space [16]. This method is capable of producing conditional images
based on the semantic manipulation using the text descriptors.
6.4.2.7 Controllable text-to-image generation
This method can generate high-quality images, from the text descriptors based on natural
language using the controllable GANs, in which they have used a generator that is based
on the word level spatial and attention, that can generate and manipulate subregions of
the image to corresponding relevant textual words [17]. This method also employs a
139
140
Generative adversarial networks for image-to-image translation
supervisory feedback mechanism based on the text descriptors, which operates by establishing a correlation between the text and the regions of the image, thereby creating an
efficient training mechanism that can change particular visual features without disturbing
the other content of the image. Thus this method is capable of generating and manipulating the artificially generated images by giving the text-based descriptors. The main
focus of this work is to change the category, color, and texture of the images by giving
text descriptors.
6.4.2.8 DM-GAN: Dynamic memory generative adversarial networks
for text-to-image synthesis
This approach tries to overcome the drawbacks of the initial text-to-image synthesis
methods, which greatly rely on the quality of the initial base image, and the contribution
of the descriptor works on the different content of the image. In this approach, the
authors have used dynamic memory-based GAN approach [18], to synthesize the good
quality images from the text descriptors. In this method, the fuzzy contents of the image
are improved by using the dynamic memory function. Further, it has two gates named
memory writing gate and response gate. The memory writing gate selects the relevant
textual information corresponding to the base image content, which further improves
the quality of the images generated from the text descriptors and the response gate combines the information gained from the dynamic memory and the attributes of the image.
The method has been tested with Caltech-UCSD 200 and Microsoft Common Objects
in Context dataset.
6.4.2.9 Object-driven text-to-image synthesis via adversarial training
The ObJGAN method aims to generate realistic images by efficiently capturing the
object-level textual information, which is required for the creation of realistic images.
This model consists of three components namely the attentive image generator which is
driven by the objects, a discriminator based on the objects, and an attention method that
is also driven by the objects [19]. In this approach, the text descriptors and a semantic
layout that is created well in advance are given as input to the image generator, which is
used to create high-resolution synthetic images by a method that refines the coarse
images to high-quality images, by an iterative process. In each stage of the iteration,
the generator improves the regions of the image by giving due consideration to the
words that are associated with the bounding box of that region. The role of the attention layer is to form the labels for the class for each of the words that are used for querying the region and the discriminator checks all the bounding boxes to validate that the
objects that are created are at par with the sematic layout of the image that was
pregenerated.
A review of techniques to detect the GAN-generated fake images
6.4.2.10 AttnGAN: Fine-grained text-to-image generation with attentional
generative adversarial networks
The proposed method aims to generate the fine-grained synthetic images, by using the
attentional GANs [20], which use attention-driven multistep improvement mechanism
for generating photorealistic images from the text descriptors. This method subdivides
the image into subregions depending on the text descriptors associated with those subregions of the image. It also deploys a deep attentional multimodal that finds out the similarity and finds out the matching of the image and thus trains the generator using the
dissimilarity. This method generates a more refined image after each stage and has been
tested with the CUB and the COCO dataset.
6.4.2.11 Cycle text-to-image GAN with BERT
In this work, the authors have tried to create the images from the captions of the images,
using attention GAN models. Wherein these models learn the attention based on the
words to image feature mapping. For the fine-tuning of the model, they have used
the cyclic design that can do the mapping of the images back to the caption of the image
[21]. The authors have also integrated the pretrained model of the BERT which is based
on natural language processing for integrating the initial features of the image in the form
of text. The proposed model outperforms the normal attention GAN.
6.4.2.12 Dualattn-GAN: Text-to-image synthesis with dual attentional
generative adversarial network
A text-to-image synthesis approach has been described using Dual Attentional Generative Adversarial Network architecture. In this approach, the authors have used double
attention methods to improve the local details and the overall structure by considering
text descriptors features and the corresponding different regions of the image [22]. There
are two different attentions in this work, one is textual attention and the other is the visual
attention. The textual attention aims to improve the interface between the text constructs
and the visuals, and the visual attention aims to model the internal description of the
vision through the spatial axes and the channel, which can help to capture the overall
structures of the image in a better way. They have also used the attention embedding
method to amalgamate the features from multiple paths. They have stabilized the training
of the GAN model by using the spectral normalization and have improved the capability
of the CNNs by using the structure based on the inverted residual method.
Throughout just a few years, the GAN-based models have been created that can generate fake images by either transforming the existing images or getting the text descriptors. These models are quite helpful to the researchers to generate the dataset for the
training purpose for deep learning-based models, where a large amount of data is required
to train the networks, but on the contrary, the same technology can be used for the illegitimate process also, and thus pose a great deal of threat to the mankind. Therefore,
141
142
Generative adversarial networks for image-to-image translation
methods need to develop that can distinguish between the real and synthetic images. In
the following section, the artificial intelligence-based methods are being discussed to
detect the DeepFakes.
6.5 Artificial intelligence-based methods to detect DeepFakes
It has been observed that the GAN-based architectures have the capability to produce
photorealistic images that can be a concern of security, or it may cheat others by posing
false news over social media and falsify the information, thus causing mental agony and
revulsion. With the recent advancements in the GANs, the quality of these false images
may also improve substantially and may lead to more serious issues. Therefore, it becomes
a major issue to devise methods that can distinguish between the real and the GANgenerated false images. Although the GAN-generated images can very well fool the individuals, they cannot escape the computer-based artificial intelligence-powered detectors
that are robust and are not vulnerable to the prejudice the humans are. In the following
section, we will be discussing the various state-of-the-art methods developed so far to
detect the DeepFake images.
6.5.1 Can forensic detectors identify GAN-generated images?
The current work investigates to distinguish between the real images and the GANgenerated fake images. The proposed method verifies the authenticity and originality
of the images based on the forensic detectors [23]. In this work, the authors have used
two approaches to distinguish between the fake images generated by GAN. The first
approach is intrusive, in this case, the detector is created using the GAN architecture;
therefore, some of the functions of the GAN are used in the detector to recognize
the GAN-generated images. The other approach is nonintrusive, meaning there is no
module of the GAN available, and the detector is generated on its own, without any
input from the GAN that is used for creating the images. In this work, the authors have
verified three nonintrusive methods namely inception scores, face quality assessment
method, and the trained VGG 16 network model that is based on the latest features.
The intrusive approach can detect the fake images quite efficiently whereas for the nonintrusive approaches, the VGA-based approach is good at detecting the fake images if it
has sufficient training data, but the results are not good if there is a mismatch between the
training and the test data sets.
6.5.2 Detection of deep network-generated images using disparities
in color components
The proposed method aims to detect the fake images using the disparities in the color
components of the images [24]. The DeepFake images that are generated by deep networks for the RGB color space and the ones with no specified constraints for the
A review of techniques to detect the GAN-generated fake images
correlation among the color components are very easy to distinguish from the real
images. The proposed method detects the statistics of the fake images based on the color
components and distinguishes them from the real images. The distinction between the
real and fake images has been made based on the feature set that is compact and effective
and has been validated on different binary classifiers, and this method works when the
generative models are known or unknown.
6.5.3 Detecting and simulating artifacts in GAN fake images
The task of classifying the images, into fake and real is challenging, as the dataset for training is usually unavailable, and the model that is used by the attacker for the generation of
the fake images is also not readily available. Therefore, in this approach, the authors have
tried to simulate the fake image generation process using an AutoGAN model, and it
stimulates the generation of the artifacts of the most common GAN approaches and they
have also tried to locate the artifacts that are generated while applying the upsampling
operation during the generation of the fake images [25]. Doing this they discovered that
the artifacts thus generated are the duplicate copies of the frequency domain spectra,
therefore they proposed a spectrum-based classifier rather than a pixel-by-pixel classifier
to distinguish between the fake and the real images. This approach has given very good
results to detect the CycleGAN-generated fake images.
6.5.4 Detecting GAN-generated fake images using cooccurrence matrices
In this work, the authors have proposed a deep learning and cooccurrence matrices-based
combined approach to detect the fake images generated by the GANs. The cooccurrence
matrix has been calculated using the color channels of the pixels and then the deep
learning-based CNN model has been trained for classification [26]. The pixel-based
cooccurrence matrix is directly passed to the deep learning-based model to classify the
real and fake images and hence detect the DeepFake images that have been generated
using GAN-based models. The proposed method also gave good results, when it was
trained and tested on distinct datasets.
6.5.5 Detecting GAN-generated imagery using color cues
The proposed method has been deployed to distinguish between the fake and the real
images using the color and saturation-based forensic parameters. For the color-based
forensic, they have considered that the GAN-generated images will have a high correlation between the pixels in the chromaticity space as compared to the real-world image.
For the saturation-based forensic, the frequency of underexposed and saturated pixels will
be reduced as the generator component of the GAN carries out the normalization step
[27]. To distinguish between the real images and the GAN-generated images, they have
143
144
Generative adversarial networks for image-to-image translation
classified the real and fake images using the SVM classifier. This approach was able to
achieve the AUC parameter of 70% using the dataset named NIST MFC2018.
6.5.6 Attributing fake images to GANs: Analyzing fingerprints
in generated images
This method aims at classifying the images as real or fake based on the GAN attributes that
are created with the GAN-generated images. They have also tried to identify the GAN
network that has generated the fake image, as each GAN network creates a different fingerprint and a small difference in the GAN training can also change the fingerprint of the
GAN-generated fake image [28]. In this, they have used a learning-based method using
an attribution network model to map the input image and the fingerprint image comparable to that. For each GAN model, a fingerprint is generated and then the model fingerprint and the actual fingerprint are used for classification to discriminate between the
realistic image and the GAN-generated image. The proposed method has been able to
achieve 99.5% accuracy using the CelebA dataset and fake images generated using distinct
GAN models such as ProGAN, SNGAN, CramerGAN, and MMDGAN.
6.5.7 FakeSpotter: A simple baseline for spotting AI-synthesized
fake faces
In this work, the authors have proposed a FakeSpotter that can detect the fake images by
analyzing the behavior of the neurons, as it has been observed that for the fake images, the
activation function of each layer varies, and that can act as a very strong feature for detecting the fake images, generated by the artificial means. In this approach, the behavior of
the neurons has been captured for both the real and the fake images and then an SVM
classifier has been trained to classify the real and fake images [29]. The proposed method
has been able to achieve an accuracy of 84.7% based on the FaceNet model, they have
tested their model on CelebBA-HQ and FFHQ datasets.
6.5.8 Incremental learning for the detection and classification of
GAN-generated images
In this work, the authors have proposed a method to detect the unseen images that are
fake. They have used a detection model based on multitask incremental learning that can
locate and classify the GAN-produced fake images. In this work, they have placed the
classifier at different positions based on the iCarl algorithm, to monitor the incremental
learning, the two models that are used are named as multitask multiclassifier and multitask
single classifier [30]. The proposed model has been tested on five distinct GAN models
namely CycleGAN, StyleGAN, StarGAN, ProGAN, and Glow and they used the Xceptionnet model for the process of detection of the GAN-generated fake images.
A review of techniques to detect the GAN-generated fake images
6.5.9 Unmasking DeepFakes with simple features
Although the DeepFake generation methods have progressed, they have developed to
the extent that they can generate an image even from the text descriptions. But the
GAN-based methods do leave some artifacts in the fake images that might be missed
by the human eye, the artificial intelligence-based methods can catch those artifacts
and can easily discriminate between the real and the fake images. In this work, the authors
have tried to capture the frequency domain features of the images and a classifier to classify the real and the fake images [31]. The main strength of the method is its capability to
give very good results with a limited dataset that has an annotated training set moreover it
works very well with unsupervised classifiers. The proposed method is capable of achieving 100% accuracy after being trained with 20 images that are annotated.
6.5.10 DeepFake detection by analyzing convolutional traces
This approach aims at analyzing the human faces present in the DeepFake images, it is
based on the fact that the artificially generated images, do leave behind some fingerprints
that can be detected by the forensics tools. In this approach, the authors have proposed an
expectation-maximization (EM) approach that can extract local features of the images
that are specific to the false image generation model [32]. This approach has tested
the proposed method with the CELEBA dataset and has been able to detect the fake
images along with the different network architectures that have been used to create those
fake images.
6.5.11 Face X-ray for more general face forgery detection
This chapter proposes a method to detect the fake images, by converting the input image
into its gray-scale image, using which it detects whether the image is the real image or a
forged image. If the gray-scale image can be disintegrated into two separate blending
images, then it is revealed that it is a fake image as it is having the blending boundary,
else it is categorized as the real image [33]. It has been seen that most of the image manipulation methods aim at blending the reformed part into the background of the original
image. The method is not able to give good results when it has to detect the lowresolution images, as in that case, the evidence of the blending is less evident, and hence
hard to detect.
6.5.12 DeepFake image detection based on pairwise learning
Detecting the GAN-generated fake images is still a challenge, therefore in this approach,
the authors have proposed a deep learning approach to detect the fake images based on
contrastive loss. In this approach initially, most recent GAN architectures are used to
create the fake and real image pairs [34]. In the subsequent step, they have deployed a
DenseNet model that is having a two-streamed network model, based on which the
145
146
Generative adversarial networks for image-to-image translation
pairwise information is fed as input. In this way, a joint network is trained based on the
fake features, based on the pairwise learning that can discern the real and artificially created images based on the features. As the last step, a classifier is deployed at the end of the
fake feature network to distinguish between the artificially generated and the realistic
images.
6.6 Comparative study of artificial intelligence-based techniques
to detect the face manipulation in GAN-generated fake images
The GANs have brought in a new era, wherein artificial images can be created by specifying the textual features of the images or by manipulating the already existing images by
transforming and manipulation of the pixels of the digital images. The GANs have
emerged just a couple of years ago, the fake image generation and fake image detection
methods have developed at a rapid rate. In the previous section, we have discussed about
the various methods that can detect the GAN-generated fake images. From the literature
survey, it has been observed that most of the harm that is done by fake images is attributed
to the manipulation of the faces.
This section aims at carrying out a comparative study of the various methods that have
been proposed for the detection of the fake images. To carry out the comparative analysis,
we have considered four types of methods that can detect the following type of fake
images: (1) construction of a new face, (2) swapping of the facial identity, (3) manipulation of facial features, and (4) manipulating the facial expressions. For each type of the
fake images, the comparison has been done on the parameters of features and the classifier
used. The performance parameters have not been compared as different researchers have
used different performance measurement parameters and distinct dataset; therefore, the
fair comparison is not feasible.
6.6.1 Techniques for detecting the construction of a new face
The researchers in Ref. [35] have investigated the working of the GAN architectures to
trace the different artifacts that can differentiate the original images and the synthetic
images. The system has been evaluated using the color-based features and the classification has been carried out using the SVM classifier that is linear. The method can achieve
the AUC parameter of nearly 70% using the NIST MFC2018 dataset [36].
Yu et al. [37] discovered that each GAN-based architecture generates a unique fingerprint in the synthetic images, they formulated a learning-based approach using the
attribution network model that has the capability of mapping the input image with its
equivalent fingerprint image. Therefore, this approach was able to derive a correlation
index between the fingerprint of the image and its corresponding model fingerprint that
has been used to classify the images. The method has been tested using the dataset named
A review of techniques to detect the GAN-generated fake images
CelebA [38], which contains the real images and the fake images that have been synthesized using the different GANs as proposed [39–42]. The proposed method has claimed
to achieve 99.5% accuracy. Although the system is capable of very good results, but it fails
if the images are blur, compressed, noisy, or cropped.
Authors of Ref. [43] inferred that the observation of the neuron behavior can help us
to detect the synthetic faces, as the activation of the neuron across different layers generates distinct patterns and can capture the distinct features that can help to detect the
manipulated facial attributes, they implemented different face recognition systems based
on deep learning [44–46] to learn about the real and fake faces. Based on the features
learned, an SVM-based classifier has been trained in order to discriminate between
the real and fake images. The proposed work was able to achieve an accuracy of
84.7% using the FaceNet model on CelebA-HQ [39] and FFHQ [47] dataset of real
images and InterFaceGAN [48] and StyleGAN [47].
An analysis of the distinct face manipulation approaches has been proposed in
Stehouwer et al. [49], where they have proposed that the novel attention mechanisms
are capable of giving good results [50] as they help the process as well as enhance the
feature maps of CNN architectures. The proposed method can achieve 100% AUC
and 0.1% EER for the real face of Refs. [38, 47, 51] datasets and has been tested with
the synthetic images created using [39, 47] GAN-based models.
A fake face synthesis approach [52] has been proposed based on steganalysis and the
statistics of the real-world natural images. They used a combination of the pixel cooccurrence matrixes and CNN-based deep learning models. They initially validated their
approach using the images created using the CycleGAN [53], this method has also been
validated using the fake images created using different GAN architectures. They implemented the proposed approach in their work [54] where the validation has been carried
using 100K-Face database and can achieve EER value of 7.2%.
Different fake face synthesis systems have been assessed in Neves et al. [54] based on
the experimental results using different datasets, they concluded if the experiments are
performed in controlled conditions then the results with EER as close to 0.8% are
achieved, but if the detection experiments are performed in real-world scenarios then
the performance of the proposed systems degrades to a great extent.
To test the methods for the real-world scenarios by Marra et al. [55], experiments
have been performed to detect the previously unseen fake images. They used a multitask
incremental model based on the learning and have tried to find out the fake images that
have been generated using distinct GAN networks.
The comparative analysis of the different techniques for detecting the construction of
a new face has been illustrated in Table 6.1 and it can be inferred that most of the work in
this domain has been carried out using the CNN classifier and most of the researchers
have used image-related features to distinguish between the real and fake images.
147
148
Generative adversarial networks for image-to-image translation
Table 6.1 Comparison of different techniques for detecting the construction of a new face.
Work
Features
Classifier
McCloskey and Albright [35]
Yu et al. [37]
Kim et al. [43]
Stehouwer et al. [49]
Nataraj et al. [56]
Neves et al. [54]
Marra et al. [55]
Color related
GAN related
CNN neuron behavior
Image related
Steganalysis
Image related
Image related
SVM
CNN
SVM
CNN + attention mechanism
CNN
CNN
CNN + incremental Learning
6.6.2 Techniques for detecting the swapping of the facial identity
The first study to detect the face swapping has been proposed by Zhou et al. [57], in
which the authors have used a two-stream network that can detect the face manipulation.
In this, the authors used the fusion of the face classification using CNN based on
GoogLeNet [58] and an SVM-based classification approach that used the triplet path
which has been trained based on steganalysis features that measure the triplet loss in
the patches of the images under consideration for detecting the swapping of the facial
identity.
The SwapMe app was evaluated by Li et al. [59] to check the capacity of generalization for the previously trained model that can detect the swapping of the face or the identity. This method turned out be one of the most robust method to detect the swapping of
the faces based on the Celeb-DF dataset.
Mesoscopic features of the images were focused using two different neural network
models that had different number of layers [60]. In one model, CNN architecture comprising of four convolutional layers and a fully connected (Meso-4) layers were used
while in the second model, the Meso-4 layer has been modified that had a different
inception module as proposed by Szegedy et al. [58] and it has been named as
Mesoinception-4. Initially, the method was tested using the self-created database for
detecting the fake images, and it attained the accuracy of 98.4%. Later, it has been tested
with the unseen dataset [59] and the proposed method turned out to be robust with other
datasets as well as the FaceForensics++ dataset.
The vulnerabilities of the recent face detection approaches namely VGG [44] and
FaceNet [46] to DeepFake based on the DeepFakeTIMIT dataset have been described
in Korshunov and Marcel [61]. In addition to that they have evaluated the challenges
associated with the detection of fake digital content while using the baseline methods.
They have used the principal component analysis-based approach for feature reduction
and RNN for long short term memory so as to discriminate between the real and fake
digital content as proposed in Korshunov and Marcel [62]. They also used image quality
measures [63] and the raw faces as the features for the purpose of detection of the fake
A review of techniques to detect the GAN-generated fake images
images. They have used total 129 features based on signal-to-noise ratio, specularity and
blur, etc. The PCA with LDA or SVM classifiers have been used for the purpose of classification and they were able to get EER of 3.3% for LQ and EER of 8.9% for HQ using
the DeepFakeTIMIT dataset.
The DeepFakes are generally created by merging the synthetic face regions with the
real image and doing so leaves certain artifacts that can be traced when the 3D head view
of the image is analyzed by Yang et al. [64]. In order to prove their claims, they carried
out investigations to find out the differences between the head poses considering the
complete set of facial features (68 features were extracted) and the features of the center
of the face. The features obtained were normalized and then an SVM classifier has been
used for the classification task. This method has been tested with UADFV dataset and an
AUC value of 89% has been achieved. They further in Li and Lyu [65] extended this
work of the detection of the fake faces using the warping artifacts. In this, they used
the CNN to detect the artifacts. The system has been trained using four different variants
of CNN as proposed by Refs. [66, 67] and the method has been tested using the UADFV
and DeepFakeTIMIT datasets, with very good results.
The authors of Ref. [51] have analyzed the face swapping approaches and evaluated
them on distinct detection methods for face swapping and validated the results using the
FaceForensics++ dataset. For their evaluation, they considered CNN-based system using
steganalysis features [68], CNN-based system with specially tuned layers that can diminish the content in the image [69], a CNN-based system with global pooling layer [70], the
CNN Mesolnception-4 [60], and CNN based on XceptionNet [71] using the ImageNet
dataset [72]. They concluded that the CNN-based XceptionNet [71] gave the best overall results.
A fake image detection method based on the elementary features of eye color, missing
details of eye, teeth, or the reflections, which are generally associated with natural images
has been proposed by Matern et al. [73]. They considered the logistic regression and multilayer perceptron [74] for the purpose of classification, and achieved an AUC value of
85.1%.
The fake face detection method has been proposed using the CNN and attention
technique by Stehouwer et al. [49], which aims at improving the feature map of the classifiers that are being used. The attention map has the capability to be inserted into any
basic neural network, by addition of a convolutional layer, the method was able to
achieve the AUC value of 99.43% and EER of 3.1%.
Seeing the popularity and the relevance of the topic, Facebook that contains a huge
database of images, has launched a competition named as DeepFake detection challenge,
in collaboration with other organizations. They have provided the baseline results using
the CNN model, with six convolutional layers, along with a fully connected layer, XceptionNet model trained with face images and with the full images, these base line models
have the capability to give precession of 93% with a recall of 8.4%.
149
150
Generative adversarial networks for image-to-image translation
Table 6.2 Comparison of different techniques for detecting the swapping of the facial identity.
Work
Features
Classifier
Zhou et al. [57]
Afchar et al. [60]
Korshunov and Marcel
[61]
G€
uera and Delp [75]
Yang et al. [64]
Li and Lyu [65]
R€
ossler et al. [51]
Matern et al. [73]
Nguyen et al. [76]
Stehouwer et al. [49]
Dolhansky et al. [77]
Agarwal et al. [78]
Sabir et al. [79]
Image-related steganalysis
Mesoscopic level
Lip image-audio speech, image
related
Image + temporal
Head pose estimation
Face warping artifacts
Image-related steganalysis
Visual artifacts
Image related
Image related
Image related
Facial expressions and pose
Image + temporal information
CNN and SVM
CNN
PCA + RNN PCA + LDA,
SVM
Information CNN + RNN
SVM
CNN
CNN
Logistic regression, MLP
Autoencoder
CNN + attention mechanism
CNN
SVM
CNN + RNN
The comparison of the different techniques for detecting the swapping of the facial
identity is presented in Table 6.2. It can be inferred that most of the researchers have used
image-related features and a combination of CNN classifier and in some cases, they have
used a combination of CNN and some other state-of-the-art classifiers.
6.6.3 Techniques for detecting the manipulated of facial features
In the initial days, the manipulation of the facial attributes was studied to check the
robustness of the facial recognition techniques, and the manipulations were tested against
the cosmetic surgery, makeup, and the occlusion of the face due to external factors. With
the advent of the DeepFakes, the interest for detecting the images with facial attributes
manipulated, has again become popular. In Bharati et al. [80], restricted Boltzmann
machine-based approach has been used to detect the images that contain manipulated
facial features. In this approach, the system for the detection of the manipulated features
was given the patches of the face, so as to learn the distinct features of the face and to
classify the image as the authentic or the one with manipulated features. The system
has been validated using synthetic datasets that have been generated using the
ND-IIITD dataset [81] and a set of images of the famous celebrities. The images of
the dataset were manipulated using the features such as smile, color of the eyes, shape
of the lips, texture of the skin, etc. The system has been able to achieve an accuracy
of 96.2% and 87.1% over the celebrity dataset and ND-IIITD dataset, respectively.
The different variants of the CNN architectures have been evaluated by Tariq et al.
[82] to detect the manipulation of the facial attributes using the CelebA dataset [38] of real
images and they adopted two distinct approaches to generate the fake images, one
A review of techniques to detect the GAN-generated fake images
approach used the ProGAN [39] architecture to generate the fake images and the other
set of fake images have been generated using the Adobe photoshop software. The manipulation of images has been done using the cosmetic makeup, adding glasses to the face,
changing the hair style, or putting on hats. They considered images of two distinct sizes,
namely 32 32 and 256 256. The GAN-generated images have been detected with
99.99% AUC whereas the Adobe photoshop-generated images have been detected with
74.9% AUC. The CNN model has the capability to detect the machine-generated fake
image with very good accuracy, whereas it is capable of giving average results for the
images created by the Adobe photoshop software.
An application named as Fake Spotter has been proposed by Kim et al. [43], which is
based on the principle that the behavior of the neurons changes across different layers.
The activation functions of the neurons across different layers can capture the distinct
features, to support the manipulated images. They used face recognition systems as proposed by Parkhi et al. [44], Amos et al. [45], and Schroff et al. [46] for extracting the
features and then used the SVM classifier to classify the manipulated and the original
images. The proposed method has been tested using the datasets as described in Karras
et al. [39] and Karras et al. [47] for the original images and the synthetic datasets generates
using InterFaceGAN and StyleGAN approaches, the system was able to achieve an accuracy of 84.7% on FaceNet model.
The facial features manipulation system has been proposed by Jain et al. [83] using the
CNN architecture that has six layers for convolution and two layers that are fully connected and it has also used the residual connections as proposed by He et al. [67]. The
system has been fed with the nonoverlapping patches of the face, in order to learn the
distinct facial features. The classification has been carried out using the SVM classifier,
the proposed model has been able to detect the manipulated images with an accuracy
of almost 100% using the datasets as proposed by Bharati et al. [80] and the StatGAN
[84] generated dataset that has been trained using the CelebA dataset [38].
The attention mechanisms that have the capability to enhance the feature maps of the
different CNN architectures have been proposed by Stehouwer et al. [49]. They used the
FaceApp to create the fake images with facial features manipulated, using 28 distinct filters
that changes the hair style, color of the skin, put or remove the beard, etc., and the fake
images that have been synthesized using the StarGAN model using a set of 40 distinct
features. They tested the proposed approach using the DFFD dataset and are able to
achieve an AUC of 99.9%.
The authors of Ref. [85] have collected the real images and also created the synthetic
images using the Adobe photoshop tool named Face Aware Liquify and some manipulated images were created by the professional artists by manipulating the facial features.
Then they used the humans to classify the images as real and fake, and the humans were
able to classify with almost 50% accuracy, then they subjected the said dataset to deep
recurrent networks and thus the automatic system has been able to detect the fake images
151
152
Generative adversarial networks for image-to-image translation
with an accuracy of 99.8% for the images generated by machines and 99.7% accuracy for
the human created fake images.
A steganalysis-based method [56] has been used to detect the fake images with 99.4%
accuracy using the StarGAN [84] generated fake images with facial features manipulated
and the real images of Liu et al. [38] dataset.
The detection of the fake images has been done by Zhang et al. [86] using the spectrum domain. In this, the RGB channels of the input image are subjected to 2D DFT and
a frequency image is generated for each of the RGB channel. The classification has been
carried out using the AutoGAN that has the capability to create artifacts similar to GAN
without the aid of any trained GAN model. They used the StarGAN [84] and GauGAN
[87] for the purpose of evaluation, out of which StarGAN is capable of detecting the
images in the frequency domain with 100% accuracy whereas the GauGAN has been
able to do so with 50% accuracy.
Table 6.3 illustrates the comparison of different techniques that have been proposed
so far for detecting the manipulated facial features. Most of the researchers have used the
CNN-based classifiers using the image-related features to distinguish between the real
and fake images.
6.6.4 Techniques for detecting the manipulated facial expressions
The advancements in the technology have enabled the computer-based softwares to
change the speech of the speaker, along with the facial expressions [88]. In Stehouwer
et al. [49], the researchers have proposed a technique to detect the manipulation of
the facial features using the DFFD dataset, which achieved the AUC value of 99.4%.
It has been observed by Refs. [37, 51, 54] that the results are good in the controlled environments, but most of the methods fail in the real scenarios, therefore new methods need
to be explored that have the capability to work in the real scenario wherein there is variation of blur, noise, and compression. The second issue that needs to be addressed is the
robustness of the proposed methods against the unseen face manipulation, it has been
Table 6.3 Comparison of different techniques for detecting the manipulated of facial features.
Work
Features
Classifier
Bharati et al. [80]
Tariq et al. [82]
Kim et al. [43]
Jain et al. [83]
Stehouwer et al. [49]
Wang et al. [85]
Nataraj et al. [56]
Marra et al. [55]
Zhang et al. [86]
Face patches
Image related
CNN neuron behavior
Face patches
Image related
Image related
Steganalysis
Image related
Frequency domain
RBM
CNN
SVM
CNN + SVM
CNN + attention mechanism
DRN
CNN
CNN + incremental learning
GAN discriminator
A review of techniques to detect the GAN-generated fake images
Table 6.4 Comparison of different techniques for detecting the manipulated facial expressions.
Work
Features
Classifier
Zhou et al. [57]
R€
ossler et al. [51]
Matern et al. [73]
Nguyen et al. [76]
Stehouwer et al. [49]
Sabir et al. [79]
Mesoscopic level
Image-related steganalysis
Visual artifacts
Image related
Image related
Image + temporal
CNN
CNN
Logistic regression, MLP
Autoencoder
CNN + attention mechanism
information CNN + RNN
observed by Refs. [55, 59] that the systems have very poor generalization capability,
therefore they fail to give good results in real-world scenarios. The GAN-based methods
StyleGAN [47] can detect the face manipulated images with quite good accuracy. The
accuracy of the GAN methods is attributed to the fingerprints that are left as artifacts, in
the GAN-generated fake images. Further, a research group proposed to eliminate the
fingerprints that are generated by the GAN models, so as to make the GAN-generated
Fake images, hard to detect [54], they proposed to use autoencoders and degradation of
the image quality, but it resulted in the loss of the detection rates.
Table 6.4 illustrates the comparison of different techniques for detecting the manipulated facial expressions. The researchers have mostly used image-related features and
CNN classifiers with the benchmark datasets to check the performance of the methods
for detecting the fake images generated using different GAN architectures. Most of the
methods have been able to achieve high accuracy.
6.7 Legal and ethical considerations
With the rapid development in the domain of artificial intelligence, a situation has
emerged that we can now create images by just giving the features of the image in
the form of text or we can manipulate any image. The images thus generated are as good
as real images and are termed as DeepFakes, as there are generally created using the deep
learning-based modalities. The DeepFake images can be quite innovative and can add
significant value to the creative and to the education domain, but on the contrary, this
innovative methodology have a plethora of threats as well, which can have social, political, and financial harmful implications. The main concern is the DeepFakes are very hard
to discern with the human eye as they blur the line between the original and the fake
images. Moreover, with the proliferation of the digital media and digital platforms,
the DeepFake images can spread like a wild fire on different social platforms.
Therefore, we need to address the legal and the ethical implications attached with the
DeepFakes. In order to address this point, the DeepFakes can be categorized into four
different categories namely face swapping in order to take revenge and defamation of
153
154
Generative adversarial networks for image-to-image translation
public figures these two categories are defined under the hard cases and can have hefty
legal and ethical complications, on the contrary, the DeepFakes are created for the illustration of creativity or for reducing recapturing, which has social benefit associated with
it, they are of the lighter category and have quite few legal and ethical complications associated with them.
The emergence of DeepFakes has posed a serious problem, for which we need to look
at the cause of the problem and correct it, instead of just cursing the symptoms associated
with it.
The penalties associated with the fake content that is generated are manifold ranging
from spreading of misinformation, humiliation of the victims, and propagation of the fake
news. The most challenging tasks is how to prevent the propagation of the false information and save the society at large from straightforward implications of DeepFakes.
Some countries have passed laws to control the implications of DeepFakes. In which
the propagator will be held responsible for posting the DeepFake over the social media
that can act like a limiting factor for others from opting the same path. But the harm and
humiliation done by once is hard to reverse.
Hence in the era of internet virality and proliferation of social media, the spread of the
misinformation is beyond control and the social media platforms have developed as
mediums for political and social discourse. In order to curb this menace, either the laws
should be framed to tackle the issues, associated with the DeepFakes, or the platforms
through which the DeepFakes spread like wild fires, should be equipped with such technologies, which can ascertain the truthfulness of the content that is being posted and censor the fake content, so that it never gets the platform to be launched and hence control
the implications associated with it. Table 6.5 illustrates the legalities associated with
DeepFake images.
6.8 Conclusion and future scope
Every coin has two sides, the GAN-based systems have a large number of applications,
but few of the applications can serve the malicious purpose as well. As it has already been
witnessed that how the deep learning-based approaches have been harnessed by the
fraudsters, to generate artificial intelligence-based syntactic images and even videos that
can be used by the criminals for carrying out scams, fraudulent activities, or to create fake
images and even fake news.
On the same line’s computational intelligence of the GANs can be harnessed by the
fraudsters to use the GAN-generated images and videos for the malicious activities, they
can improve their artificial intelligence-based methods by generating the synthetic
images of the innocent individuals, whom they have chosen to victimize.
A review of techniques to detect the GAN-generated fake images
Table 6.5 Legalities associated with DeepFake.
Purpose of
DeepFake
Face
swapping
Defamation
of public
figures
Reducing
recapturing
Creativity
Cases
Benefits
Alarms
Affects
Legalities
Swapping
the face of
the victim
with that of
others, in
order to
defame the
victim
Creating
images of
the events
that never
happened
The person
who is
swapping is
taking
revenge,
and gets
satisfaction
Mental
torture and
humiliation
of the
victim
It can have
mental
torture,
abuse, and
financial
implications
to the victim
Criminal
proceedings
can be
initiated
Freedom of
expression
Defame a
public
figure,
distort the
reputation
and even
alter the
election
results
May impact
the IPR
Destroy the
international
relations,
create
polarization,
and erode the
trust in
organizations
Public and
private law
suits can be
filled
Redundant
data creation
Private law
suits can be
filled
May impact
the IPR
People may
feel offended
Public and
private law
suits can be
filled
Dubbing
the same
video in
multiple
languages
Creation of
MEMES
Reducing
the effort of
repetitive
tasks
Freedom of
expression,
creativity
The efforts have been put by the research faternity to find out the techniques to detect
the fake images. Most of the works have been carried out using CNN-based classifiers to
discern between fake and real images, using the image-related features.
Although much innovations have taken place in the field of artificial intelligence,
nobody has given much importance to the security risks. The artificial intelligence-based
innovations have posed or may pose in years to come. It is a well-understood fact that in
the endeavor to develop intelligent machines, which would mimic the human-like traits
and will make the work of the humans easier, but not much importance has been given to
the security, privacy, and other risks associated with these advancements.
155
156
Generative adversarial networks for image-to-image translation
References
[1] Y. Choi, M. Choi, M. Kim, J. Ha, S. Kim, J. Choo, StarGAN: unified generative adversarial networks
for multi-domain image-to-image translation, CoRR abs/1711.09020 (2017) 1–15. https://arxiv.org/
pdf/1711.09020.pdf.
[2] J. Zhu, R. Zhang, D. Pathak, T. Darrell, A.A. Efros, O. Wang, E. Shechtman, Toward multimodal imageto-image translation, CoRR abs/1711.11586 (2017) 1–12. https://arxiv.org/pdf/1711.11586.pdf.
[3] J. Kim, M. Kim, H. Kang, K. Lee, U-GAT-IT: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation, CoRR abs/1907.10830 (2019) 1–19.
http://arxiv.org/abs/1907.10830.
[4] P. Isola, J. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, CoRR abs/1611.07004 (2016) 1–17. http://arxiv.org/abs/1611.07004.
[5] H. Tang, D. Xu, N. Sebe, Y. Wang, J.J. Corso, Y. Yan, Multi-channel attention selection GAN with
cascaded semantic guidance for cross-view image translation, CoRR abs/1904.06807 (2019) 1–20.
http://arxiv.org/abs/1904.06807.
[6] K. Regmi, A. Borji, Cross-view image synthesis using geometry-guided conditional GANs, CoRR
abs/1808.05469 (2018) 1–11. http://arxiv.org/abs/1808.05469.
[7] K. Regmi, A. Borji, Cross-view image synthesis using conditional GANs, CoRR abs/
1803.03396 (2018) 1–10. http://arxiv.org/abs/1803.03396.
[8] Y. Shi, D. Deb, A.K. Jain, WarpGAN: automatic caricature generation, CoRR abs/1811.10100 (2018)
1–15. http://arxiv.org/abs/1811.10100.
[9] K. Cao, J. Liao, L. Yuan, CariGANs: unpaired photo-to-caricature translation, CoRR abs/
1811.00222 (2018) 1–14. http://arxiv.org/abs/1811.00222.
[10] Z. Zheng, H. Zheng, Z. Yu, Z. Gu, B. Zheng, Photo-to-caricature translation on faces in the wild,
CoRR abs/1711.10735 (2017) 1–28. http://arxiv.org/abs/1711.10735.
[11] S.E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative adversarial text to image
synthesis, CoRR abs/1605.05396 (2016) 1–10. http://arxiv.org/abs/1605.05396.
[12] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, D.N. Metaxas, StackGAN: text to photorealistic image synthesis with stacked generative adversarial networks, CoRR abs/1612.03242 (2016)
1–14. http://arxiv.org/abs/1612.03242.
[13] H. Park, Y.J. Yoo, N. Kwak, MC-GAN: multi-conditional generative adversarial network for image
synthesis, CoRR abs/1805.01123 (2018) 1–13. http://arxiv.org/abs/1805.01123.
[14] T. Qiao, J. Zhang, D. Xu, D. Tao, MirrorGAN: learning text-to-image generation by redescription,
CoRR abs/1903.05854 (2019) 1–10. http://arxiv.org/abs/1903.05854.
[15] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, StackGAN++: realistic image
synthesis with stacked generative adversarial networks, CoRR abs/1710.10916 (2017) 1–16. http://
arxiv.org/abs/1710.10916.
[16] D. Stap, M. Bleeker, S. Ibrahimi, M. ter Hoeve, Conditional image generation and manipulation for
user-specified content, ArXiv abs/2005.04909 (2020) 1–10.
[17] B. Li, X. Qi, T. Lukasiewicz, P.H.S. Torr, Controllable text-to-image generation, in: NeurIPS2019,
pp. 1–11.
[18] M. Zhu, P. Pan, W. Chen, Y. Yang, DM-GAN: dynamic memory generative adversarial networks for
text-to-image synthesis, CoRR abs/1904.01310 (2019) 1–9. http://arxiv.org/abs/1904.01310.
[19] W. Li, P. Zhang, L. Zhang, Q. Huang, X. He, S. Lyu, J. Gao, Object-driven text-to-image synthesis
via adversarial training, CoRR abs/1902.10740 (2019) 1–23. http://arxiv.org/abs/1902.10740.
[20] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, X. He, AttnGAN: fine-grained text to
image generation with attentional generative adversarial networks, CoRR abs/1711.10485 (2017)
1–9. http://arxiv.org/abs/1711.10485.
[21] T. Tsue, S.K. Sen, J. Li, Cycle text-to-image GAN with BERT, ArXiv abs/2003.12137 (2020) 1–8.
[22] Y. Cai, X. Wang, Z. Yu, F. Li, P. Xu, Y. Li, L. Li, Dualattn-GAN: text to image synthesis with dual
attentional generative adversarial network, IEEE Access 7 (2019) 183706–183716.
[23] H. Li, H. Chen, B. Li, S. Tan, Can forensic detectors identify GAN generated images? in: 2018 AsiaPacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
2018, pp. 722–727.
A review of techniques to detect the GAN-generated fake images
[24] H. Li, B. Li, S. Tan, J. Huang, Detection of deep network generated images using disparities in color
components, CoRR abs/1808.07276 (2018) 1–26. http://arxiv.org/abs/1808.07276.
[25] X. Zhang, S. Karaman, S. Chang, Detecting and simulating artifacts in GAN fake images, in: 2019 IEEE
International Workshop on Information Forensics and Security (WIFS)2019, pp. 1–6.
[26] L. Nataraj, T.M. Mohammed, B.S. Manjunath, S. Chandrasekaran, A. Flenner, J.H. Bappy, A.K. RoyChowdhury, Detecting GAN generated fake images using co-occurrence matrices, CoRR abs/
1903.06836 (2019) 1–6. http://arxiv.org/abs/1903.06836.
[27] S. McCloskey, M. Albright, Detecting GAN-generated imagery using color cues, CoRR abs/
1812.08247 (2018) 1–7. http://arxiv.org/abs/1812.08247.
[28] N. Yu, L. Davis, M. Fritz, Attributing fake images to GANs: analyzing fingerprints in generated images,
CoRR abs/1811.08180 (2018) 1–41. http://arxiv.org/abs/1811.08180.
[29] R. Wang, L. Ma, F. Juefei-Xu, X. Xie, J. Wang, Y. Liu, FakeSpotter: a simple baseline for spotting
AI-synthesized fake faces, CoRR abs/1909.06122 (2019) 1–8. http://arxiv.org/abs/1909.06122.
[30] F. Marra, C. Saltori, G. Boato, L. Verdoliva, Incremental learning for the detection and classification of
GAN-generated images, CoRR abs/1910.01568 (2019) 1–6. http://arxiv.org/abs/1910.01568.
[31] R. Durall, M. Keuper, F. Pfreundt, J. Keuper, Unmasking DeepFakes with simple features, CoRR abs/
1911.00686 (2019) 1–8. http://arxiv.org/abs/1911.00686.
[32] L. Guarnera, O. Giudice, S. Battiato, DeepFake detection by analyzing convolutional traces, CoRR
abs/2004.10448 (2020) 1–10. https://arxiv.org/abs/2004.10448.
[33] L. Li, J. Bao, T. Zhang, H. Yang, D. Chen, F. Wen, B. Guo, Face X-ray for more general face forgery
detection, CoRR abs/1912.13458 (2019) 1–10.http://arxiv.org/abs/1912.13458.
[34] C.-C. Hsu, Y.-X. Zhuang, C.-Y. Lee, Deep fake image detection based on pairwise learning. Appl.
Sci. 10 (2020) 370, https://doi.org/10.3390/app10010370.
[35] S. McCloskey, M. Albright, Detecting GAN-generated imagery using color cues, CoRR abs/
1812.08247 (2018) 1–7. http://arxiv.org/abs/1812.08247.
[36] H. Guan, M. Kozak, E. Robertson, Y. Lee, A.N. Yates, A. Delgado, D. Zhou, T. Kheyrkhah, J. Smith,
J. Fiscus, MFC datasets: large-scale benchmark datasets for media forensic challenge evaluation, in: 2019
IEEE Winter Applications of Computer Vision Workshops (WACVW)2019, pp. 63–72.
[37] N. Yu, L. Davis, M. Fritz, Attributing fake images to GANs: analyzing fingerprints in generated images,
CoRR abs/1811.08180 (2018) 1–41. http://arxiv.org/abs/1811.08180.
[38] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, CoRR abs/
1411.7766 (2014) 1–11. http://arxiv.org/abs/1411.7766.
[39] T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of GANs for improved quality, stability,
and variation, CoRR abs/1710.10196 (2017) 1–26. http://arxiv.org/abs/1710.10196.
[40] T. Miyato, T. Kataoka, M. Koyama, Y. Yoshida, Spectral normalization for generative adversarial networks, CoRR abs/1802.05957 (2018) 1–26. http://arxiv.org/abs/1802.05957.
[41] M.G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lakshminarayanan, S. Hoyer, R. Munos,
The Cramer distance as a solution to biased Wasserstein gradients, CoRR abs/1705.10743 (2017)
1–20. http://arxiv.org/abs/1705.10743.
[42] M. Binkowski, D.J. Sutherland, M. Arbel, A. Gretton, Demystifying MMD GANs, ArXiv abs/
1801.01401 (2018) 1–36.
[43] R. Wang, L. Ma, F. Juefei-Xu, X. Xie, J. Xiang Wang, Y. Liu, FakeSpotter: a simple baseline for spotting AI-synthesized fake faces, ArXiv abs/1909.06122 (2019) 1–8.
[44] O.M. Parkhi, A. Vedaldi, A. Zisserman, Deep face recognition, Deep Face Recogn. abs/
1909.06122 (2015) 1–31.
[45] B. Amos, B. Ludwiczuk, M. Satyanarayanan, OpenFace: a general-purpose face recognition library
with mobile applications, CMU-CS-16-118, CMU School of Computer Science, 2016 Tech. Rep.
[46] F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: a unified embedding for face recognition and clustering, CoRR abs/1503.03832 (2015) 1–10. http://arxiv.org/abs/1503.03832.
[47] T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks,
CoRR abs/1812.04948 (2018) 1–12. http://arxiv.org/abs/1812.04948.
[48] Y. Shen, J. Gu, X. Tang, B. Zhou, Interpreting the latent space of GANs for semantic face editing,
CoRR abs/1907.10786 (2019) 1–10. http://arxiv.org/abs/1907.10786.
157
158
Generative adversarial networks for image-to-image translation
[49] J. Stehouwer, H. Dang, F. Liu, X. Liu, A. Jain, On the detection of digital face manipulation, ArXiv
abs/1910.01717 (2019) 1–12.
[50] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin,
Attention is all you need, CoRR abs/1706.03762 (2017) 1–15. http://arxiv.org/abs/1706.03762.
[51] A. R€
ossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, M. Nießner, FaceForensics++: learning to
detect manipulated facial images, CoRR abs/1901.08971 (2019) 1–14. http://arxiv.org/abs/1901.
08971.
[52] B. Biggio, P. Korshunov, T. Mensink, G. Patrini, D. Rao, A. Sadhu, Synthetic realities: deep learning
for detecting audiovisual fakes, in: International Conference on Machine Learning2019 https://sites.
google.com/view/audiovisualfakes-icml2019/ abs/1901.08971.
[53] J. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent adversarial networks, CoRR abs/1703.10593 (2017) 1–18. http://arxiv.org/abs/1703.10593.
[54] J. Neves, R. Tolosana, R. Vera-Rodriguez, V. Lopes, H. Proença, Real or fake? Spoofing state-of-theart face synthesis detection systems, CoRR abs/1911.05351 (2019) 1–8. http://arxiv.org/abs/1911.
05351.
[55] F. Marra, C. Saltori, G. Boato, L. Verdoliva, Incremental learning for the detection and classification of
GAN-generated images, in: 2019 IEEE International Workshop on Information Forensics and Security
(WIFS)2019, pp. 1–6.
[56] L. Nataraj, T.M. Mohammed, B.S. Manjunath, S. Chandrasekaran, A. Flenner, J.H. Bappy, A.K. RoyChowdhury, Detecting GAN generated fake images using co-occurrence matrices, CoRR abs/
1903.06836 (2019) 1–6. http://arxiv.org/abs/1903.06836.
[57] P. Zhou, X. Han, V.I. Morariu, L.S. Davis, Two-stream neural networks for tampered face detection,
CoRR abs/1803.11276 (2018) 1–9. http://arxiv.org/abs/1803.11276.
[58] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S.E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
A. Rabinovich, Going deeper with convolutions, CoRR abs/1409.4842 (2014) 1–12. http://arxiv.
org/abs/1409.4842.
[59] Y. Li, X. Yang, P. Sun, H. Qi, S. Lyu, Celeb-DF: a large-scale challenging dataset for DeepFake forensics, arXiv CR (2020) 1–10.
[60] D. Afchar, V. Nozick, J. Yamagishi, I. Echizen, MesoNet: a compact facial video forgery detection
network, CoRR abs/1809.00888 (2018) 1–7. http://arxiv.org/abs/1809.00888.
[61] P. Korshunov, S. Marcel, DeepFakes: a new threat to face recognition? Assessment and detection,
CoRR abs/1812.08685 (2018) 1–5. http://arxiv.org/abs/1812.08685.
[62] P. Korshunov, S. Marcel, Speaker inconsistency detection in tampered video, in: 2018 26th European
Signal Processing Conference (EUSIPCO)2018, pp. 2375–2379.
[63] J. Galbally, S. Marcel, J. Fierrez, Image quality assessment for fake biometric detection: application to
iris, fingerprint, and face recognition, IEEE Trans. Image Process. 23 (2) (2014) 710–724.
[64] X. Yang, Y. Li, S. Lyu, Exposing deep fakes using inconsistent head poses, in: ICASSP 2019-2019
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)2019,
pp. 8261–8265.
[65] Y. Li, S. Lyu, Exposing DeepFake videos by detecting face warping artifacts, CoRR abs/
1811.00656 (2018) 1–7. http://arxiv.org/abs/1811.00656.
[66] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition,
arXiv cs.CV, (2015) pp. 1–14.
[67] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, CoRR abs/
1512.03385 (2015) 1–12. http://arxiv.org/abs/1512.03385.
[68] D. Cozzolino, G. Poggi, L. Verdoliva, Recasting residual-based local descriptors as convolutional neural networks: an application to image forgery detection, CoRR abs/1703.04615 (2017) 1–7. http://
arxiv.org/abs/1703.04615.
[69] B. Bayar, M.C. Stamm, A deep learning approach to universal image manipulation detection using
a new convolutional layer. in: Proceedings of the 4th ACM Workshop on Information Hiding
and Multimedia SecurityAssociation for Computing Machinery, New York, NY, 2016, pp. 5–10,
https://doi.org/10.1145/2909827.2930786.
A review of techniques to detect the GAN-generated fake images
[70] N. Rahmouni, V. Nozick, J. Yamagishi, I. Echizen, Distinguishing computer graphics from natural
images using convolution neural networks, in: 2017 IEEE Workshop on Information Forensics and
Security (WIFS)2017, pp. 1–6.
[71] F. Chollet, Xception: deep learning with depthwise separable convolutions, CoRR abs/
1610.02357 (2016) 1–8. http://arxiv.org/abs/1610.02357.
[72] J. Deng, W. Dong, R. Socher, L. Li, K. Li, L. Fei-Fei, ImageNet: a large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition2009, pp. 248–255.
[73] F. Matern, C. Riess, M. Stamminger, Exploiting visual artifacts to expose Deepfakes and face manipulations, in: 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW)2019,
pp. 83–92.
[74] I. Goodfellow, Y. Bengio, A. Courville, Deep Learning, MIT Press, 2016. http://www.
deeplearningbook.org.
[75] D. G€
uera, E.J. Delp, Deepfake video detection using recurrent neural networks, in: 2018 15th IEEE
International Conference on Advanced Video and Signal Based Surveillance (AVSS)2018, pp. 1–6.
[76] H.H. Nguyen, F. Fang, J. Yamagishi, I. Echizen, Multi-task learning for detecting and segmenting
manipulated facial images and videos, CoRR abs/1906.06876 (2019) 1–8. http://arxiv.org/abs/
1906.06876.
[77] B. Dolhansky, R. Howes, B. Pflaum, N. Baram, C. Canton-Ferrer, The Deepfake detection challenge
(DFDC) preview dataset, arXiv cs.CV abs/1910.08854 (2019) 1–14.
[78] S. Agarwal, H. Farid, Y. Gu, M. He, K. Nagano, H. Li, Protecting world leaders against deep fakes,
in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops,
June2019, pp. 38–45.
[79] E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, P. Natarajan, Recurrent convolutional strategies for face manipulation detection in videos, CoRR abs/1905.00582 (2019) 1–8. http://arxiv.org/
abs/1905.00582.
[80] A. Bharati, R. Singh, M. Vatsa, K.W. Bowyer, Detecting facial retouching using supervised deep learning, IEEE Trans. Inf. Forensics Secur. 11 (9) (2016) 1903–1913.
[81] P. Flynn, K. Bowyer, P.J. Phillips, Assessment of time dependency in face recognition: an initial study.
Audio- Video-Based Biom. Pers. Authentication (2003) 44–51, https://doi.org/10.1007/3-54044887-X_6.
[82] S. Tariq, S. Lee, H. Kim, Y. Shin, S.S. Woo, Detecting both machine and human created fake face
images in the wild. in: Proceedings of the 2nd International Workshop on Multimedia Privacy and
SecurityAssociation for Computing Machinery, New York, NY, 2018, pp. 81–87, https://doi.org/
10.1145/3267357.3267367.
[83] A. Jain, R. Singh, M. Vatsa, On detecting GANs and retouching based synthetic alterations, CoRR
abs/1901.09237 (2019) 1–7. http://arxiv.org/abs/1901.09237.
[84] Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, CoRR abs/
1411.7766 (2014) 1–11. http://arxiv.org/abs/1411.7766.
[85] S. Wang, O. Wang, A. Owens, R. Zhang, A.A. Efros, Detecting photoshopped faces by scripting
photoshop, CoRR abs/1906.05856 (2019) 1–16. http://arxiv.org/abs/1906.05856.
[86] X. Zhang, S. Karaman, S. Chang, Detecting and simulating artifacts in GAN fake images, CoRR abs/
1907.06515 (2019) 1–10. http://arxiv.org/abs/1907.06515.
[87] P. Isola, J. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, CoRR abs/1611.07004 (2016) 1–17. http://arxiv.org/abs/1611.07004.
[88] S. Suwajanakorn, S.M. Seitz, I. Kemelmacher-Shlizerman, Synthesizing Obama: learning lip sync from
audio. ACM Trans. Graph. 36 (4) (2017), https://doi.org/10.1145/3072959.3073640.
159
CHAPTER 7
Synthesis of respiratory signals using
conditional generative adversarial
networks from scalogram
representation
S.
Jayalakshmya, Lakshmi Priyab, and Gnanou Florence Sudhac
a
IFET College of Engineering, Villupuram, India
Manakula Vinayaga Institute of Technology, Pondicherry, India
c
Pondicherry Engineering College, Pondicherry, India
b
7.1 Introduction
Chronic respiratory disorders are a genre of long-term illness influencing the air passage
and the anatomy of the respiratory system. Lung disorders are graded as the biggest donor
to the global effect of diseases in the world. Some of the respiratory disorders include
chronic obstructive pulmonary diseases (COPD), asthma, occupational lung illness,
chronic bronchitis, pneumonia, etc. The diagnostic studies of the World Health Organization (WHO) and Healthy People 2020 revealed that currently over 25 million people
in the United States (US) have been diagnosed with asthma and roughly 14.8 million
adults have been identified with COPD [1]. In addition, the Forum of International
Respiratory Societies (FIRS) has anticipated that more than millions of people live along
with increased pressure in the pulmonary system and more than 50 million individuals
suffer from occupation-related lung infections, thereby figuring out that more than
one billion people are being affected from chronic respiratory conditions [2].
The causes of COPD include deficiency of alpha-1-antitrypsin protein, long-term
exposure to air pollution, chemicals, fumes, and dust inhalation in the workplace. This
deficiency causes the lungs to deteriorate and also can affect the liver. The common monitoring and diagnostic techniques for pulmonary diseases are spirometry, CT scans, arterial blood gas tests, and auscultation of the lungs. Respiratory auscultation is the most
preferred diagnostic tool for the examination of pulmonary disorders and it provides
the physiological and pathological information for the medical experts to proceed with
the therapeutic procedure [3]. The respiratory sound (RS) is one of the most significant
bio-signals used to diagnose certain respiratory abnormalities. RS detected from the chest
wall and mouth may be classified as normal (vesicular) and adventitious sounds. Some of
the abnormal sounds include wheezing, rhonchus (low-pitched wheezes), stridor, and
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00024-5
Copyright © 2021 Elsevier Inc.
All rights reserved.
161
162
Generative adversarial networks for image-to-image translation
crackles which vary in frequency. The use of a conventional stethoscope to listen lung
sound makes lung auscultation technique simple, easy to use, and the most popular noninvasive method for diagnosis.
To assist medical practitioners further in the process of their diagnosis, deep learningbased computer-aided diagnosis (CAD) systems have been extensively used over the past
few years. However, for these deep learning algorithms to precisely differentiate even the
feeble abnormal breathing patterns, huge volume of training data is required. On the
other hand, a very limited number of both normal and abnormal respiratory sound datasets are available in the publicly available datasets. Working with these insufficient
resources, deep learning models tend to struggle with few restrictions such as overfitting
which performs well only for the trained model but not for other unobserved data. This
can be considered as the greatest challenge in conventional deep learning-based CAD
systems. Although several patterns of training and architectural designs have been
deployed in research findings, training a model with a scarce amount of data is a demanding task. Therefore to acquire more samples, data augmentation is the sensible solution.
For scaling up of datasets, the conventional augmentation approach accomplishes
ordinary variations in the given images in order to gain a different facet of original images.
Few modifications involve rotational changes, translation, reflection, and diminishing the
size of images [4]. Considering from the respiratory signal perspective, transforming the
audio signal to the 2D representation using conventional augmentation approaches is not
appropriate. In explicit terms, random flipping around the time axis indicates that the
signal is reversed in time and random Y flipping will completely change the understanding of frequency. In the same way, random X translation implies that there is only time
shift and translation in Y signifies that the frequency spectrum is being modified, which
may not really be a true representation of the original signal itself. Scaling randomly in the
X direction in order to simulate slower breathing or faster breathing, while not changing
the frequency characteristics would be physically meaningful but this in turn adds random
noise to the signal representation. To resolve these issues, this study aims at improving the
data set artificially with the help of generative adversarial network (GAN). In this study, it
is proposed to synthesize a respiratory signal using GAN architecture by the virtue of
time-frequency representation which gives a picture of the signal. Scalogram, the visual
representation of the energy density of the signal obtained through continuous wavelet
transform is found to differentiate well the different classes of respiratory signals [5]. The
fact that the continuous wavelet transform is invertible, enables to use of GAN
architecture to indirectly synthesize signals.
GAN is a class of deep learning framework which succinctly generates images from a
known representation of data called latent space. GAN comprises two deep neural
networks called generator and discriminator, competing one against the other in order
to learn the probability distribution function of the known training set images and hence
the term adversarial. The outcome of this study will undoubtedly address the challenge
Scalogram-based respiratory signal synthesis
faced with restricted training data in deep learning-based respiratory signal classification.
Furthermore, the proposed study can be used as a data augmentation technique in the
abovementioned signal classification task.
The remaining part of this chapter is structured in this way: related study on the application of GAN in diverse fields and augmentation approaches is presented in Section 7.2;
basic GAN, cGAN, and proposed model are explained in Section 7.3, the dataset details,
generator, and discriminator network architecture and the classifier results are depicted in
Section 7.4; and finally research findings are explained in Section 7.5.
7.2 Related work
Over the past few years, the focus on deep learning models and algorithms has gradually
gained significant importance in addressing several issues in the field of medical imaging.
Several studies have been reported in the literature using supervised learning techniques,
wherein a huge amount of training data is required to prepare a strong model. Owing to
the wide range of images in the medical field, the collection of data samples continues to
be a great challenge. Introducing minute changes in the original images poses limitations
in the classification performance as the augmentation methods induce additional details in
the training samples. Furthermore, few proportion of expanded dataset seems to sound in
a distinct way compared with the real-world objects resulting in unsuitability to other
databases. In order to overcome these issues, Ian Good fellow et al. [6] proposed an alternative approach of data augmentation wherein synthetic images are generated by
employing generative adversarial networks (GANs). The implementation of the structured adversarial modeling GAN in turn signified very sharp distributions compared
to Markov chain models.
GANs are a kind of unsupervised learning used for mapping the small-scaled hidden
vectors to high-dimensional data. In the literature, GANs have been lately put into practice in diverse fields and many initiatives were carried out on medical images which
employ image-to-image translation. In the year 2017, Costa et al. [7] explored U-net
for generating new fundus images in the retina by vessel segmentation with the help
of GANs. The results indicated that both the original and synthetic images were observed
differently visibly even though both were a part of the same vessel tree. In addition, the
produced synthetic images were of the major part of real image set quality. Lei Bi et al. [8]
proposed multichannel generative adversarial networks (M-GAN) for boosting the training data of positron emission tomography (PET) images and provided more realistic
images in comparison with conventional GAN. In 2018, Frid-Adar et al. [9] proposed
classical data segmentation technique as the first stage to expand the dataset of CT images
and synthetic data augmentation as the second stage using GAN for the classification of
liver lesions and yielded 78.6% sensitivity and 88.4% specificity for the case of classic data
augmentation approach and with the help of synthetic image creation, the classification
163
164
Generative adversarial networks for image-to-image translation
accuracy was found to be improved to 85.7% sensitivity and 92.4% specificity. Furthermore, in 2018, Salehinejad et al. [10] have also witnessed the expansion in dataset samples
by implementing GAN to produce artificial images for the classification of the lesion in
chest X-ray images. The authors utilized a deep convolutional neural network (DCNN)
to identify disorders with five different classes of chest X-rays and the performance results
are found to be improved.
In 2019, Bhattacharya D et al. [11] suggested deep convolutional generative adversarial network (DCGAN) and experimented on NIH chest X-ray image open database to
enhance the efficiency of CNN model using GAN and yielded 65.3% accuracy. With the
aid of structure correcting GAN, Dai et al. [12] carried out segmentation between lungs
and heart regions in chest X-ray images. In that work, the authors introduced a critic
network in order to figure out the higher-level structures to acquire practical segmentation findings to achieve realistic segmentation outcomes. Several comprehensive
attempts with this method resulted in real segmentation with high precision. Onishi
et al. [13] explored deep CNN (DCNN) and GAN to create the sufficient number of
images in order to differentiate malicious and benign lung nodules. In that work, the
images were generated using the pixel value distribution present in the mid-portion
of the pulmonary nodule. This approach of pretraining and fine-tuning process using
DCNN enables to discriminate almost 66.7% of benign nodules and 93.9% of malicious
pulmonary nodules and proved with the classification accuracy of 20% more than original
images.
Apart from CT and PET images, Chaudhari et al. [14] have trialed the augmentation
approach on gene expression dataset using modified generator GAN (MG-GAN) and
compared the performance with basic GAN and KNN classifier. The results proved that
MG-GAN improved the accuracy by 18.8% and 11.9% and further the loss value of the
error function was found to be reduced very drastically from the value 0.6978 to 0.0082
making it suitable for applications with sensitive data. Luo et al. [15] explored the progressive growth of GAN-based augmentation on electroluminescence images for the
classification of faulty photovoltaic component cells and improved the performance to
the maximum of 14% using an enlarged dataset. Apart from this, Li et al. [16] focused
the research on gear safety in the transmission industry for reliability classification using
GAN wherein, the authors introduced bounded-GAN to create gear data with different
settings and trained the model using ADAM optimizer. The research findings show that
the proposed bounded-GAN excels other approaches on operational measures. GAN
also finds its application in diverse fields such as magnetic resonance imaging scans, video
surveillance, clinical informatics, computational biology, automotive fields, etc.
Furthermore, GAN-based methods strengthen and show the possible gains in the
audio synthesis field as well with reference to analysis, processing, and classification of
signals. In spite of the latest developments in the field of artificial intelligence and generative models, the compilation of information from inherent sounds through neural nets
Scalogram-based respiratory signal synthesis
remains unresolved. In 2017, Shrivastava et al. [17] substantiated that audio signals can be
illustrated at a faster pace with the help of GANs. Here, the authors proposed a combination of simulated and unsupervised learning and attempted numerous changes with the
basic GAN such as self-regularization, local loss, and discriminator updation with the past
record of perfect images and resulted in good performance improvement. Donahue et al.
in 2019 [18] proved that GANs allow the signal generation to take place at any time. The
authors introduced WaveGAN and experimented GANs to unsupervised synthesis of
raw-waveform audio as an initial attempt and achieved promising results. By the virtue
of complexity in neural structure, audio signal generation is a key issue as it depends on
several time scales. Therefore, it is advantageous to train a net with a greater degree of
illustration rather than utilizing samples in the temporal dimension. Time-frequency
analysis are used to distinguish and handle the nonstationary signals in a better way.
One such example is Shen et al. [19] demonstrated a novel architecture named Tacotron
2, for combining audio signals through Mel-frequency spectrograms. These Mel spectrograms are fed as input to a network named WaveNet and resulted in a mean opinion
score of 4.53.
Even though several achievements have been enabled through TF representations,
visual representation of the spectrum of frequencies varying with time (spectrograms)
is noninvertible. Marafioti et al. [20] discussed the key points of neural structure explored
for producing TF representations especially speech synthesis using STFT. In that, the
authors introduced TiFGAN, which unconditionally creates audio. But it poses limitations on producing audio with substantial quality. In addition, the spectrogram type of
representation produces only constant resolution. These drawbacks can be resolved by
employing the continuous wavelet transform (invertible) in which TF representation
named scalograms are produced with variable resolution. The use of an unconditional
generative model does not influence the generated modes of data. Several efforts have
been made for generating the audio signal in an unconditional manner [21–23]. Despite
that, all the techniques utilize the autoregressive method which considers noise samples as
input and creates samples of audio signals serially. By exploring the model with some
conditioning using extra information, it would be possible to supervise the process of data
generation. This conditioning process may rely on class labels or tags on a portion of data
or data on a whole.
Mirza et al. [24] introduced the conditional adversarial nets and generated images
trained on MNIST class labels. The authors proved the potency of conditional adversarial
nets and their useful applications with the tags used individually. Conditional GANs
(cGANs) are a kind of GAN, wherein the information concerning the conditions are
imposed to basic GAN. The results show superior performance compared to
nonconditional GANs.
The majority of the application areas experience the inability to gain access to big data
for analysis, in particular, the medical field. Even though several data augmentation
165
166
Generative adversarial networks for image-to-image translation
approaches are possible with GAN both for medical images and audio [13–23], some of
the technical gaps observed in the existing study are quality of generated data samples,
distortion, and its distribution in the data set, which were not good enough resulting
in poor classification accuracy as the accuracy is significantly dependent on both the qualitative and quantitative terms. In addition, the speed of the image generation process is
very slow specifically in the field of speech synthesis. This poses limitations on highdimensional data due to the nature of autoregressive modeling. Furthermore, in certain
instances, network models trained with artificial generated sample data fail to perform
well when fed with real images. All these gaps can be resolved in the proposed study with
the help of conditional GAN. Inspired by the performance improvement by conditional
GANs in Ref. [24], the proposed study employs the scalogram method of TF representation combined with cGAN for improved targeting and for synthesizing respiratory
sounds in order to discriminate the normal and abnormal lung sounds.
7.3 GAN for signal synthesis
In this section, the architectures of simple GAN and conditional GAN and the proposed
system model using the conditional GAN to synthesize respiratory sound signals from the
wavelet-based time-frequency representation are explained in detail. The conditional
GAN is utilized in this proposed study to artificially generate more number of scalogram
images. With this proposed model as a data augmentation technique, better prediction
accuracy through computer-aided diagnosis is expected.
7.3.1 Simple GAN
A simple GAN comprises two networks named generator and discriminator that are
trained concurrently. The generator learns to generate a new image mimicking the data
in the latent space by estimating its underlying probability distribution and the discriminator plays the role of binary classifier by mapping the input image either to the realimage dataset or to the generated set of images. The generator model has to be trained
such that the generated image very closely resembles the images used for training thereby
making the discriminator difficult to distinguish between the original and generated set of
images. The discriminator in turn learns to make sure that its performance is better than
that of the generator. This adversarial learning behavior of the GAN results in the generation of images which are very close to the real training set images. Fig. 7.1 shows the
architecture of simple GAN [25].
7.3.2 Conditional generative adversarial networks
In contrast to basic GAN, cGAN utilizes a supervised method where both the generator
and discriminator neural networks are adapted to meet the condition during the training
Scalogram-based respiratory signal synthesis
Fig. 7.1 Architecture of simple GAN
Fig. 7.2 Architecture of conditional GAN.
phase with the help of certain extra information. Fig. 7.2 shows the architecture of conditional GAN [26]. The latent space information in the form of noise and the condition
labels are fed as inputs to the generator block. The images generated by the generator
along with condition labels and the real images are given as input to the discriminator
block. This block detects the similarity between the given labels and images. The data
augmentation is achieved by incorporating the conditional variable y into the model.
7.3.3 Conditional GAN for respiratory sound synthesis
The fundamental idea in the proposed study is to train the conditional GAN to generate
realistic scalogram images of various respiratory sounds.
7.3.3.1 System model
Fig. 7.3 shows the proposed framework for synthesizing respiratory sounds using cGAN.
Different types of lung sound such as vesicular, wheezes, crackles, and low-pitched
167
168
Generative adversarial networks for image-to-image translation
Fig. 7.3 Proposed system model for synthesizing respiratory sounds
wheeze signals are given as input to the time-frequency transform. The continuous
wavelet transform transforms the time-domain respiratory signal to time-scale analysis
and is represented in the forms of scalogram representation. These real scalograms are
fed as input to conditional GAN in order to generate more artificial scalogram images
with the help of a generator network in cGAN. Finally, inverse CWT is applied to synthesize the respiratory sound signals in time domain.
7.3.3.2 Time-scale representation using CWT
The continuous wavelet transform (CWT) modifies the temporal length of the basis
function for the purpose of achieving a changeable time-frequency localization. To
interpret very small changes in frequencies, CWT utilizes longer basis functions at
the cost of confined localization in time and uses shorter basis functions ascertaining
high localization in time [27]. This time-frequency transform elicits a spectrum with
time scale vs amplitude named scalogram. Compared to spectrograms, scalograms
are useful for analyzing realistic signals at diverse scales. As the frequencies in CW
transform are logarithmic in nature, the obtained scalogram plot also uses a log scale
frequency axis.
7.3.3.3 Generator and discriminator network architecture of cGAN
The network architecture for generator and discriminator are shown in Figs. 7.4 and 7.5
and the corresponding analysis result is tabulated in Tables 7.1 and 7.2. In cGAN, the
Scalogram-based respiratory signal synthesis
Noise Image input
Layer
Project and Reshape
Layer
Labels Image input
Layer
Embed and Reshape
Layer
Batch Normalization
Layer 1
Relu Layer 1
Concatenation Layer
Transposed Convolution
2D Layer 1
Transposed Convolution
2D Layer 1
Batch Normalization
Layer 1
Batch Normalization
Layer 1
Relu Layer 1
Transposed Convolution
2D Layer 2
ReLU Layer 1
Transposed Convolution
2D Layer 1
Tanh Layer
Fig. 7.4 Generator network architecture.
generator network comprises sequential transposed convolutional layers with batch normalization for scaling up the arrays of different dimensions. The small-scale noise vector
given to the fully connected section transforms the input to 1024 high-scaled image features which is further reshaped to 4 4 1025 for feeding to the convolution module.
Several successive stages of transposed convolution layers transform the interim features
to produce an output image equal to the dimension (64 64 3). This generator network generates synthetic scalograms for all classes of respiratory sounds individually to
attain the wider class population.
On the other hand, the discriminator network is modeled using multiple convolutional layers with leaky ReLu to produce prediction values. From the given input image
to the discriminator block, which has a dimension of (64 64 3), the high-scaled
features with the dimension (4 4 512) are extracted by the series of convolution
layers. Further, these features are evened and are given as input to the fully connected
network. This network gradually assigns the features to a low scale for classification.
169
170
Generative adversarial networks for image-to-image translation
Images image
input Layer
Labels Image
input Layer
Drop out (0.25)
Leaky ReLU Layer 2
Convolution 2D Layer 3
Concatenation Layer
Batch Normalization
Layer 3
Convolution 2D Layer 1
Leaky ReLU Layer 3
Leaky ReLU Layer 1
Convolution 2D Layer 4
Convolution 2D Layer 2
Batch Normalization
Layer 4
Batch Normalization
Layer 2
Leaky ReLU Layer 4
Convolution 2D Layer 5
Fig. 7.5 Discriminator network architecture.
7.3.3.4 Algorithm
Algorithm 7.1. Generation of Scalograms using cGAN for the training
process with Epochs 5 500, 1000, 1500, and 2000, learn rate 5 0.0002 and
no. of classes 5 4
Input: Real scalograms S0,S1,S2…Sn and noise input vector: Y, Conditional metrics:
C0,C1,C2…Cn
Formulate bounds for G and D of cGAN with input dimension [64 64 3]
for num of training phases do
Ø Update the discriminator using S0,S1,S2…Sn with C0,C1,C2…Cn
Ø Generate scalograms Z0,Z1,Z2…Zn using noise input vector with C0,C1,C2…Cn
Ø Revise the D network using Z0,Z1,Z2…Zn with C0,C1,C2…Cn
Ø Revise the G network using Z0,Z1,Z2…Zn with C0,C1,C2…Cn
end for
Output: 200 no. of observations for each class.
Scalogram-based respiratory signal synthesis
Table 7.1 Analysis result of generator network.
Sl.
no.
1
2
3
4
5
6
7
8
9
10
11
12
13
Name
Type
Activations
Learnables
Noise
1 1 100 images
Proj
Project and Reshape Layer
with output size
4 4 1024
Labels
1 1 1 images
emb
Reshape Layer with output
size 4 4
Image input
1 1 100
–
Project and
reshape
4 4 1024
Weights
16,384 100
Bias 16,384 1
Image input
111
–
Embed and
reshape layer
441
Concatenation
4 4 1025
Embedding
weights 50 4
Fully connecting
weights 16 50
Fully connecting
Bias 16 1
–
Transposed
convolution
layer
8 8 256
Weights
5 5 256 1025
Bias 1 1 256
Batch
normalization
8 8 256
Offset 1 1 256
Scale 1 1 256
Relu
8 8 256
–
Transposed
convolution
layer
16 16 128
Weights
5 5 128 256
Bias 1 1 128
Batch
normalization
16 16 128
Offset 1 1 128
Scale 1 1 128
Relu
16 16 128
–
Transposed
convolution
layer
32 32 64
Weights
5 5 64 128
Bias 1 1 64
Batch
normalization
32 32 64
Offset 1 1 64
Scale 1 1 64
Cat
Concatenation of 2 inputs
along dimension 3
tconv1
256 5 5 1025 transposed
convolutions with stride[1 1]
and cropping [0 0 0 0]
bn1
Batch normalization with
256 channels
Relu1
Relu
tconv2
128 5 5 256 transposed
convolutions with stride
[2 2] and cropping same
bn2
Batch normalization with
128 channels
Relu2
Relu
tconv3
64 5 5 128 transposed
convolutions with stride
[2 2] and cropping same
bn3
Batch normalization with
64 channels
Continued
171
172
Generative adversarial networks for image-to-image translation
Table 7.1 Analysis result of generator network—cont’d
Sl.
no.
14
15
16
Name
Type
Activations
Learnables
Relu3
Relu
tconv4
3 5 5 64 transposed
convolutions with stride
[2 2] and cropping same
tanh
Hyperbolic tangent
Relu
32 32 64
–
Transposed
convolution
layer
64 64 3
Weights
5 5 3 64
Bias 1 1 3
Tanh
64 64 3
–
Table 7.2 Analysis result of discriminator network.
Sl.
no.
1
2
3
4
5
6
7
8
9
Name
Type
Activations
Learnables
Images
64 64 3 images
Dropped 25% dropout
Labels
1 1 1 images
emb
Reshape Layer with output
size 64 64 3
Image input
64 64 3
–
Drop out
Image input
64 64 3
111
–
–
Embed and
reshape layer
64 64 1
Concatenation
64 64 1
Embedding
weights 50 4
Fully connecting
weights
4096 50
Fully connecting
Bias 4096 1
–
Convolution
32 32 64
Weights
5 5 4 64
Bias 1 1 64
Leaky Relu
32 32 64
–
Convolution
16 16 128
Weights
5 5 64 128
Bias 1 1 128
Batch
normalization
16 16 128
Weights
1 1 128
Bias 1 1 128
Cat
Concatenation of 2 inputs
along dimension 3
conv1
64 5 5 4 convolutions
with stride [2 2] and padding
same
lrelu1
Leaky relu with scale 0.2
Conv2
128 5 5 64 convolutions
with stride [2 2] and padding
same
bn2
Batch normalization with 128
channels
Scalogram-based respiratory signal synthesis
Table 7.2 Analysis result of discriminator network—cont’d
Sl.
no.
10
11
12
13
14
15
16
17
Name
Type
Activations
Learnables
lRelu2
Leaky ReLU with scale 0.2
Conv3
256 5 5 128 convolutions
with stride [2 2] and padding
same
bn3
Batch normalization with 256
channels
lRelu3
Leaky ReLU with scale 0.2
Conv4
512 5 5 256 convolutions
with stride [2 2] and padding
same
bn4
Batch normalization with 512
channels
lRelu4
Leaky ReLU with scale 0.2
Conv5
1 4 4 512 convolutions
with stride [1 1] and padding
[0 0 0 0]
Leaky Relu
16 16 128
–
Convolution
8 8 256
Weights
5 5 128 256
Bias 1 1 256
Batch
normalization
8 8 256
Leaky Relu
8 8 256
Weights
1 1 256
Bias 1 1 256
–
Convolution
4 4 512
Weights
5 5 256 512
Bias 1 1 512
Batch
normalization
4 4 512
Leaky Relu
4 4 512
Weights
1 1 512
Bias 1 1 512
–
Convolution
111
Weights
4 4 512
Bias 1 1
7.3.3.5 Steps
The stages for the generation of original, realistic scalograms, and synthesis of respiratory
sounds are outlined as follows:
Step 1: Examine and analyze the given respiratory signals having integral number of
time-varying frequencies using continuous wavelet transform.
Step 2: Produce Morse scalogram representations for various lung sounds with the aid
of MATLAB wavelet tool box.
Step 3: Input the obtained scalograms of the original lung sound signals to
Conditional GAN.
Step 4: Generate realistic scalogram images with the help of modeled generator
network.
Step 5: Synthesize the original respiratory signal by giving the generated scalograms to
inverse CWT.
173
174
Generative adversarial networks for image-to-image translation
Step 6: Provide the original scalogram images extracted through continuous wavelet
transform and generated scalogram images through cGAN to the pretrained models
Alexnet CNN [5], GoogLenet [28], and ResNet 50 to quantify the performance.
7.4 Results and discussion
To demonstrate the performance of the proposed data augmentation technique using
conditional GAN, the original scalogram images extracted through continuous wavelet
transform and generated scalogram images through cGAN are input to different pretrained models. The classification performance is compared for all the classes of respiratory sounds with and without augmentation.
7.4.1 Dataset
The dataset used in this proposed study was acquired from various sources namely RALE
(Respiration acoustics Laboratory Environment) repository [29], Think labs Lung sound
library [30], and ICBHI [31] benchmark publicly available databank. These archives
comprise gender-based normal and abnormal lung sounds of several kinds. For the training and testing phase, the entire lung sound database has been arbitrarily split into 70%
and 30%. By and large, the database has 73 normal files, 281 crackle sound files, 33 rhonchi files, and 122 wheeze files.
7.4.2 Data augmentation using conditional GAN
The training phase of the data augmentation process using cGAN is explained in this section. For the process of experimentation,
(i) The number of latent inputs for the generator network are considered to be 100.
Typically the generator produces RGB noise as scalograms at random.
(ii) From the modeled convolutional filters, the discriminator network attempts to
figure out the difference between random noise scalograms and real-scalogram
images of respiratory sounds.
(iii) To mystify the discriminator network, the generator learns from the transposed
convolution filters.
(iv) The process is continued endlessly, as long as the discriminator is confused to the
greatest extent.
In the proposed approach, the network is trained for four different epochs namely 500,
1000, 1500, and 2000 and the cost function is observed for all cases. The visual representation of the progress of training with scores of both the networks are shown in
Fig. 7.6.
To check for the convergence of the network during the process of training, the
scores are plotted on a scale from 0 to 1. The score of the generator network is defined
Scalogram-based respiratory signal synthesis
Fig. 7.6 Training plots of generator and discriminator for different epochs.
as the average of the likelihood images analogous to the discriminator output for the
generated images. In case 1 of Fig. 7.6, i.e., 500 epochs, the concept of mode collapse
happens which indicates that the generator is incapable of learning the scalogram representation corresponding to diverse inputs. Therefore, in order to increase the ability of
the generator to produce more outputs, the number of epochs is increased. The training
plots of cases (2, 3, 4) indicate that the generator score reaches the value 0 and the score of
discriminator extends to almost one which signifies that the discriminator network is
dominating the generator network and therefore classifies most of the images correctly.
Since the plots are almost stable in cases 3 and 4, the training phase is stopped with 1500
epochs. Further increasing the number of iterations, increases the computational time of
the network.
175
176
Generative adversarial networks for image-to-image translation
7.4.3 Samples of generated scalogram images for different classes
Fig. 7.7 shows the samples of scalogram images generated by the generator network for
1500 epochs.
7.4.4 Synthesis of respiratory sounds using inverse CWT
The inverse CWT is applied to the generated scalograms and the acquired respiratory
sound signal for the case of normal and abnormal lung sounds are plotted using the signal
analyzer app and shown in Fig. 7.8.
Fig. 7.7 Samples of generated scalogram images for different classes using 1500 epochs.
Scalogram-based respiratory signal synthesis
Fig. 7.8 Sample of normal synthesis from scalogram using ICWT.
7.4.5 Performance results
To evaluate the performance of data augmentation using conditional GAN, pretrained
deep learning models such as AlexNet, GoogLeNet, and Resnet 50 are used for classification. The classification is performed for all the classes of respiratory sounds without
augmentation and with augmentation and the results are compared for all the deep learning models. For classifying the data set without augmentation, the number of images considered for training are 357 and 153 images are used for testing. Similarly, for classification
with augmentation, 200 images are generated for each class of respiratory sounds using
cGAN. Out of which, 500 images are considered for training and 300 for testing. The
experimental settings for modeling the network are listed in Table 7.3.
Table 7.3 Parameters settings for trained network model.
Sl. No
Hyperparameters
Values
1
2
3
4
5
7
8
Momentum
Initial learning ate
Learning rate drop factor
Learning rate drop period
Number of epochs
Batch size
Optimizer
0.9
0.0001
0.2
5
20
10
Sgdm
177
178
Generative adversarial networks for image-to-image translation
Table 7.4 Classification accuracy for the various pre-trained models with and without cGAN.
With augmentation using cGAN
Classifier
Without augmentation
(%)
500 epochs
(%)
1000 epochs
(%)
1500 epochs
(%)
AlexNet
GoogLeNet
Resnet 50
68.63
73.86
81.37
93.13
93.45
95.23
95.13
96.88
97.82
96.38
96.88
98.75
Resnet 50 provides high accuracy (indicated in bold) compared to other two methods
The performance metrics, accuracy of the pretrained network is usually estimated by
calculating the testing accuracy with the help of a confusion matrix. Accuracy measures
the number of correctly classified normal and abnormal sound files corresponding to the
total number of test samples. Table 7.4 shows the classification accuracy obtained for various deep network models without and with augmentation for different epochs. The
results in Table 7.4 indicate that the pretrained CNN model Resnet50 performs well
for both the cases with and without augmentation. For the case of classification with real
images, i.e., without augmentation, the Resnet 50 model produces an accuracy of
81.37% which is high compared to AlexNet and GoogLeNet classifiers. Furthermore,
generating new images with cGAN and training the deep network model with ResNet
50 produces the highest classification accuracy of 98.67% compared with other deep network models at 1500 epochs.
The training progress and confusion matrices of ResNet 50 network model with real
images and generated images are shown in Figs. 7.9 and 7.10 and Tables 7.5 and 7.6.
The columns of the confusion matrix plotted in Tables 7.5 and 7.6, indicate the true
cases for the classes and the rows indicate the cases that are belonging to the class. To be
more specific, in Table 7.5, the number of actual cases in the class crackle is 77, 31 in
normal, 16 in rhonchi, and 29 in wheeze. The number of cases correctly classified as
belonging to the particular class are 71 for crackle, 18 for normal, 10 for rhonchi, and
25 for wheeze. With data augmentation, in Table 7.6, the number of actual cases in
the class crackle is 72, 79 in normal, 75 in rhonchi, and 74 in wheeze, and the number
of cases correctly classified as belonging to the particular class are 72 for crackle, 75 for
normal, 75 for rhonchi, and 74 for wheeze.
From the confusion matrices tables, other metrics, such as precision, recall, and F1
score are also calculated for all types of lung sounds and are tabulated in Tables 7.7
and 7.8. Precision calculates the total number of positive class forecasts which is actually
positive. The parameter recall computes the same number of positive class forecasts with
all positive samples in the dataset while the F1 score is the weighted average of precision
and recall.
Scalogram-based respiratory signal synthesis
100
90
Final
Accurary
80
70
60
Validation Accuracy = 81.37%
50
40
30
20
10
10
0
0
100
200
300
20
400
500
600
700
Iterations
2
Loss
1.5
1
Final
0.5
20
10
0
0
100
200
300
400
Iterations
500
600
700
Fig. 7.9 Training progress of ResNet 50 network for the case of real images (without augmentation).
From the tables, it is observed that the class-wise accuracy is found to be high for all
classes almost more than 98% for the case of classification with augmentation, whereas it is
comparatively low for the case of without augmentation. In addition, the F1 score is high
for all classes in Table 7.8 which reveals that the validation accuracy is better for augmented data in all cases.
7.4.6 Analysis
The proposed method in this chapter demonstrates the training of CNNs with an alternative method for data augmentation by way of generating synthetic scalogram images by
using conditional generative adversarial networks. The proposed method is experimented with three pretrained models namely Alexnet, GoogLeNet, and ResNet50 classifiers.
The pretrained Alexnet model with five convolutional and three fully connected layers
produces an accuracy of 68.63% for the case of original scalogram images whereas the
network trained with 1500 epochs using cGAN data augmentation approach yields an
accuracy of 96.38%. The same number of images when trained with GoogLeNet classifier with 22 layers deep yields an accuracy of 73.86% without augmentation and 96.88%
with data augmentation. The third model ResNet 50 comprises 49 convolutional layers
and a fully connected layer. This network when trained, yields an accuracy of 81.37%
179
Generative adversarial networks for image-to-image translation
100
90
80
Validation Accuracy: 98.75%
Accuracy
70
60
50
40
30
20
10
10
0
200
400
20
600
800
1000
800
1000
1200
Iteration
1.6
1.4
Loss
180
1.2
1
0.8
0.6
0.4
0.2
0
200
400
600
Iteration
Fig. 7.10 Training progress of ResNet 50 network with augmentation.
Table 7.5 Confusion matrix of ResNet 50 network without augmentation.
Crackle
Normal
Rhonchi
Wheeze
Crackle
Normal
Rhonchi
Wheeze
71
3
0
3
9
18
0
4
1
0
10
5
3
1
0
25
Table 7.6 Confusion matrix of ResNet 50 network with augmentation.
Crackle
Normal
Rhonchi
Wheeze
Crackle
Normal
Rhonchi
Wheeze
72
0
0
0
3
75
0
1
0
0
75
0
0
0
0
74
1200
Scalogram-based respiratory signal synthesis
Table 7.7 Precision and recall for ResNet 50 model without augmentation.
Class
Accuracy (%)
Precision
Recall
F1 score
Crackle
Normal
Rhonchi
Wheeze
87.58
88.89
96.08
89.54
0.85
0.82
1
0.68
0.92
0.58
0.63
0.86
0.88
0.68
0.77
0.76
Table 7.8 Precision and recall for ResNet 50 model with augmentation.
Class
Accuracy (%)
Precision
Recall
F1 score
Crackle
Normal
Rhonchi
Wheeze
99
98.67
100
99.67
0.96
1
1
0.99
1
0.95
1
1
0.98
0.97
1
0.99
without augmentation and 98.75% for the case of with augmentation. This indicates that
deeper networks prove efficient both in terms of computation and the number of parameters. In addition, the model with good accuracy both in case of with and without
augmentation, i.e., ResNet 50 is assessed with various metrics namely accuracy, precision, recall, and F1 score. Based on results from Table 7.8, the high values of precision,
recall, and F1 score shows that the validation accuracy is better for augmented data in all
classes of respiratory sounds.
7.5 Conclusion and future scope
Owing to the challenges in incorporating the conventional data augmentation techniques
for time-frequency representation of the signal, a novel data augmentation approach has
experimented for the signal under study. In this chapter, GAN an unsupervised learning
structure is utilized to generate the synthetic images for the different classes of respiratory
sounds. For improved targeting on the image generation, the conditional information is
imposed to basic GAN. The contradictory learning behavior of conditional GAN gives
rise to the generation of scalogram images really close to original scalogram images of
respiratory sounds. It is also found that the performance of modeled discriminator network predominates the generator network and therefore categorizes the majority of the
images accurately. In addition, the performance of the data augmentation approach is
evaluated with different pretrained deep learning classifiers and compared with original
images without augmentation. The results show that there is a significant improvement in
the classification accuracy of all models in the data augmentation approach in comparison
with without cGAN. Furthermore, the testing accuracy of ResNet 50 model produces an
increased accuracy of 98.75% with high values of precision, recall, and F1 score for all
181
182
Generative adversarial networks for image-to-image translation
classes of respiratory sounds resulting in better prediction. This study can be further
extended with other types of GAN such as cycle GANs and Wasserstein GANs for
the synthetic generation of images. The same setup can be compared with the generation
of images using variational convolutional autoencoder to produce a better prediction
model.
References
[1] https://www.healthypeople.gov/2020/topics-objectives/topic/respiratory-diseases
(Respiratory
Diseases—Accessed 10 May 2020).
[2] https://www.who.int/gard/publications/The_Global_Impact_of_Respiratory_Disease.pdf (Global
impact of Respiratory diseases—Accessed 05 May 2020).
[3] M. Sarkar, I. Madabhavi, N. Niranjan, M. Dogra, Auscultation of the respiratory system, Ann. Thoracic Med. 10 (3) (2015) 158.
[4] A. Sharif Razavian, H. Azizpour, J. Sullivan, S. Carlsson, CNN features off-the-shelf: an astounding
baseline for recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition Workshops, 2014, pp. 806–813.
[5] S. Jayalakshmy, G.F. Sudha, Scalogram based prediction model for respiratory disorders using optimized convolutional neural networks, Artif. Intell. Med. 103 (2020) 101809.
[6] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, et al., Generative
adversarial networks, 2014. arXiv preprint arXiv:1406.2661.
[7] P. Costa, A. Galdran, M.I. Meyer, M.D. Abràmoff, M. Niemeijer, A.M. Mendonça, A. Campilho,
Towards Adversarial Retinal Image Synthesis, 2017. arXiv preprint arXiv: 1701.08974.
[8] L. Bi, J. Kim, A. Kumar, D. Feng, M. Fulham, Synthesis of positron emission tomography (PET)
images via multi-channel generative adversarial networks (GANs), in: Molecular Imaging, Reconstruction and Analysis of Moving Body Organs, and Stroke Imaging and Treatment, Springer, Cham, 2017,
pp. 43–51.
[9] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, Synthetic data augmentation using
GAN for improved liver lesion classification, in: 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), IEEE, 2018, April, pp. 289–293.
[10] H. Salehinejad, S. Valaee, T. Dowdell, E. Colak, J. Barfett, Generalization of deep neural networks for
chest pathology classification in x-rays using generative adversarial networks, in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, April, pp.
990–994.
[11] D. Bhattacharya, S. Banerjee, S. Bhattacharya, B.U. Shankar, S. Mitra, GAN-based novel approach for
data augmentation with improved disease classification, in: Advancement of Machine Intelligence in
Interactive Medical Image Analysis, Springer, Singapore, 2020, pp. 229–239.
[12] W. Dai, J. Doyle, X. Liang, H. Zhang, N. Dong, Y. Li, E.P. Xing, Scan: structure correcting adversarial
network for chest x-rays organ segmentation, arXiv (2017). arXiv preprint arXiv: 1703.08770.
[13] Y. Onishi, A. Teramoto, M. Tsujimoto, T. Tsukamoto, K. Saito, H. Toyama, H. Fujita, Automated
pulmonary nodule classification in computed tomography images using a deep convolutional neural
network trained by generative adversarial networks, BioMed Res. Int. 2019 (2019) 6051939,
https://doi.org/10.1155/2019/6051939.
[14] P. Chaudhari, H. Agrawal, K. Kotecha, Data augmentation using MG-GAN for improved cancer classification on gene expression data, Soft Comput. 24 (2019) 11381–11391.
[15] Z. Luo, S.Y. Cheng, Q.Y. Zheng, GAN-based augmentation for improving CNN performance of
classification of defective photovoltaic module cells in electroluminescence images, in: IOP Conference
Series: Earth and Environmental Science, vol. 354 (1), IOP Publishing, 2019, October, p. 012106.
[16] J. Li, H. He, L. Li, G. Chen, A novel generative model with bounded-GAN for reliability classification
of gear safety, IEEE Trans. Ind. Electr. 66 (11) (2019) 8772–8781.
Scalogram-based respiratory signal synthesis
[17] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, R. Webb, Learning from simulated and
unsupervised images through adversarial training, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2107–2116.
[18] C. Donahue, J. McAuley, M. Puckette, Adversarial Audio Synthesis, 2018. arXiv preprint arXiv:
1802.04208.
[19] J. Shen, R. Pang, R.J. Weiss, M. Schuster, N. Jaitly, Z. Yang, R.A. Saurous, Natural TTS synthesis by
conditioning wavenet on Mel spectrogram predictions, in: 2018 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, April, pp. 4779–4783.
[20] A. Marafioti, N. Perraudin, N. Holighaus, P. Majdak, Adversarial generation of time-frequency features with application in audio synthesis, in: International Conference on Machine Learning, 2019,
May, pp. 4352–4362.
[21] S. Dieleman, A.V.D. Oord, K. Simonyan, The challenge of realistic music generation: modelling raw
audio at scale, 2018. arXiv preprint arXiv:1806.10474.
[22] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W.Z. Teoh, J. Sotelo, et al., Melgan: generative
adversarial networks for conditional waveform synthesis, in: Advances in Neural Information Processing Systems, 2019, pp. 14910–14921.
[23] S. Vasquez, M. Lewis, Melnet: A Generative Model for Audio in the Frequency Domain, 2019. arXiv
preprint arXiv: 1906.01083.
[24] M. Mirza, S. Osindero, Conditional Generative Adversarial Nets, 2014. arXiv preprint arXiv:
1411.1784.
[25] Y. Qian, H. Hu, T. Tan, Data augmentation using generative adversarial networks for robust speech
recognition, Speech Commun. 114 (2019) 1–9.
[26] https://www.mathworks.com/help/deeplearning/ug/train-conditional-generative-adversarialnetwork.html (Conditional Generative Adversarial Networks—Accessed 20 April 2020).
[27] A.H. Najmi, J. Sadowsky, The continuous wavelet transform and variable resolution time-frequency
analysis, Johns Hopkins APL Techn. Digest 18 (1) (1997) 134–140.
[28] L. Balagourouchetty, J.K. Pragatheeswaran, B. Pottakkat, G. Ramkumar, GoogLeNet based ensemble
FCNet classifier for focal liver lesion diagnosis, IEEE J. Biomed. Health Inform. 24 (6) (2020)
1686–1694.
[29] The R.A.L.E. Repository. Rale.ca.N.P., 2017. Web. 28 February 2017.
[30] https://www.thinklabs.com/lung-sounds (Lung sounds Library—Accessed 24 January 2019).
[31] B.M. Rocha, D. Filos, L. Mendes, G. Serbes, S. Ulukaya, Y.P. Kahya, E. Kaimakamis, An open access
database for the evaluation of respiratory sound classification algorithms, Physiol. Measur. 40 (3) (2019),
035001.
183
CHAPTER 8
Visual similarity-based fashion
recommendation system
Betul Ay and Galip Aydin
Firat University Computer Engineering Department, Elazig, Turkey
8.1 Introduction
Visual similarity search systems have become one of the most popular application areas of
image retrieval systems in recent years. Content-based image retrieval (CBIR) [1] aims to
search images based on their contents such as shapes, objects, colors, local geometry, or
texture rather than the metadata associated with the image file such as file names,
descriptions, or keywords. Image retrieval systems have been used in a large array of
application areas such as search engines, personalized recommendation systems, art galleries management, retail systems, fashion design, and more commonly in e-commerce
applications [2].
For any CBIR system, there are two major steps: finding the most important features
of each image so that images can be described as feature vectors and calculating distances
between images for similarity. Therefore, the success of a CBIR system relies heavily on
the quality of the vector feature representation. The task of extracting high-quality, accurate, and efficient vector feature representation for each image is challenging due to the
fact that images might have a wide variety of different properties such as content, size,
resolution, etc. Also labeling large amounts of data is another major issue. Also supervised
trainings with sufficient annotated data might limit the generalizability of the feature representations to novel classes. Recently, semisupervised and unsupervised learning
approaches have gained popularity to overcome these difficulties.
One of the most interesting applications of deep learning in recent years is creating
feature representations of images as vectors. CNNs have widely been used for this purpose and show great promise [3–6]. Another interesting approach for creating vector representations is generative adversarial networks (GANs) [7]. GANs are being utilized in
semisupervised learning due to the fact that they can learn deep image representations
from unlabeled data. GANs are created using two models: a generative model G and
a discriminator model D. The generative network creates samples while the discriminative network tries to distinguish the generated samples from true data.
GANs are conceptually considered as a form of unsupervised learning because no
labeled data is needed. In recent years, GANs have been one of the most popular fields
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00023-3
Copyright © 2021 Elsevier Inc.
All rights reserved.
185
186
Generative adversarial networks for image-to-Image translation
of research due to their ability to learn high-dimensional and complex data distributions
by taking advantage of the use of unlabeled data for model training. Furthermore, GANs
can be leveraged to build powerful models in any domains including images, speech,
and text.
Although GANs have been created for unsupervised learning, they have proven to be
successful in semisupervised and reinforcement learning as well. GANs have successfully
been used in various applications such as generating visually realistic images and style
transfer however their effectiveness is not limited to these scenarios. Image-to-image
translation (CycleGAN [8]), creating high-resolution images from low-resolution
samples (SRGAN [9]), image generation from text (StackGAN [10]), learning to discover relationships between different domains such as fashion items (DiscoGAN [11]),
transferring facial makeup from a reference image (Beautygan [12]), and other applications reviewed in Ref. [13] are some examples.
Extracting deep features from images is another area of use for GANs. The idea is that
since GANs can generate realistic images, vector representations of these images can also
be used to describe a given image. And since the discriminator pushes the generator to
generate more realistic images with enough training, the system produces higher quality
representations in comparison to CNNs such as VGG. For instance, Hou et al. [14] utilize
GANs for extracting features from images. They utilize a pretrained CNN namely
19-layer VGGNet [15] network which was trained on ImageNet for feature extraction.
The proposed system contains generator, VGGNet, and discriminator networks. The
generator network does not feed the real and fake images directly to the discriminator,
unlike traditional GANs. Instead, the real and generated fake images are first preprocessed
using VGGNet and corresponding features are extracted. The extracted features are
subsequently fed to the discriminative network. The authors evaluated the results via
a web interface created for human evaluators and they report that the resulting face
images cannot be distinguished from real face images easily and the proposed model
generates more realistic images compared to DCGAN [16] and DFC-VAE [17].
Since GANs are able to provide us with accurate image representations as vectors,
they can be employed in similar product search for e-commerce applications. Worldwide
growth of e-commerce economy has resulted in many innovative solutions including
recommender systems to be deployed. Traditionally recommendation systems make
use of past customer interactions, clicks, and purchase history to recommend new and
related products. Content-based, collaborative, or hybrid recommender systems create
recommendations based on recorded customer behaviors and similar decisions but ignore
the product image content. A new type of recommendation system being emerged in
e-commerce sites is visual similarity recommendation which creates a list of visually
similar items to the query image.
In this chapter, we present a visual recommendation system for e-commerce sites
which utilizes GANs for image feature vector generation and a vector similarity search
Visual similarity-based fashion recommendation system
library for fast and accurate querying similarity. The proposed GAN is trained on a largescale shoe image dataset of 156,896 images. We also compare the precision and time performance of the proposed GAN with existing pretrained deep learning models on a standard fashion benchmark dataset, UT-Zap50K [18]. Since the proposed system does not
require annotated image dataset, it is easy to extend for other types of fashion items other
than shoes. We also prove this feature by extending the model with handbag images and
conduct performance tests as well. The system as a whole presents a deep learning-based
similar image recommendation solution. We also provide comparisons for several different GAN architectures for shoe similarity recommendation.
The rest of the chapter is organized as follows: In Section 8.2 we briefly discuss related
literature and present background on GANs and CNNs. Section 8.3 presents the proposed fashion recommendation system architecture. Experimental results are discussed
in Section 8.4 and conclusions are presented in Section 8.5.
8.2 Related works
The major research problem in this study is finding accurate vector representations for
image features. Fast and accurate feature extraction from images allows us to build a
robust and effective visual similarity recommendation system. Traditionally machine
learning is used to recommend items for e-commerce customers [19–23]. However,
in fashion domain image retrieval is subtle and subjective due to the fact that humans tend
to have very different opinions on fashion items. Therefore, CBIR is an active area of
research for e-commerce [24] and traditional recommender systems can be extended
with visual similarity recommendations.
In Ref. [25] the authors presented an architecture for retrieving most similar images to
the query image. AlexNet [26] and VGG-16 pretrained networks are used to extract local
and deep features from the activation of the intermediate layers. Representations from
the fc6 and fc8 layers are used as feature vectors. Hamming distance is used as the similarity metric. To evaluate the efficiency of the system 2399 women’s fashion images
obtained from Pinterest are collected and manually labeled into nine categories.
Kiapour et al. [27] proposed an architecture for street to shop image retrieval which
aims to find similar clothing items to a given real world in an online shop. In Ref. [28] the
authors have proposed a solution for cross-domain fashion product retrieval by retrieving
similar clothing items from online shopping images. Shankar et al. [29] proposed a visual
search and recommendation system for e-commerce. Similarly, they use CNN for
generating image feature vectors for each fashion product. Images from the Fashionista
dataset [30] and Flipkart catalog images are labeled to create a large annotated dataset.
This section introduces a theoretical overview of GANs [13]. We define the difference between Vanilla GAN and InfoGAN architectures depicted in Fig. 8.1 and highlight the strengths of the adversarial training process chosen for our recommendation task.
187
188
Generative adversarial networks for image-to-Image translation
Fig. 8.1 Overview of GAN and InfoGAN architectures
Lastly, we provide an intuitive overview of state-of-the-art architectures based
on CNNs.
8.2.1 Vanilla GAN
GAN architecture is comprised of two distinct networks: generator and discriminator.
These networks are trained simultaneously and perform adversarial training by playing
a two-player minimax game. The generator generates new data samples, while the discriminator decides whether each sample it receives belongs to the training dataset. For an
image generation task, the generator network receives a random vector z which is sampled from a known distribution and generates a new fake image. Fake data generated
from generator network and real data taken from the real dataset are fed into the discriminator. The discriminator takes account of all the data fed into it and returns a probability
of whether an image is real or fake. More formally, this learning process is given in the
following steps:
• The goal of generator G is to build mapping from a prior noise distribution where
random noise z ℝZ, to a data space referred to as fake data G(z).
• The goal of discriminator D is to estimate the probability of a sample coming from real
data x ¼ {x1, …, xN } and fake data G(z), being real D(x) or fake D(G(z)).
• The value function represents a two-player minimax game that tries to maximize its
value with respect to D and minimize its value with respect to G, which is expressed by
the following equation:
min G max D VI ðD, GÞ ¼ xPdata ½ logDðxÞ + zPnoise ½ log ð1 DðGðzÞÞÞ:
Here, Pdata and Pnoise indicate real data distribution and noise distribution,
respectively.
• While G captures the data distribution in the training set and tries to fool the D with
minimization of zPnoise ½ log ð1 DðGðzÞÞÞ term, D network performs a binary
classifier with the maximization of the both xPdata ½ log DðxÞ and
zPnoise ½ log ð1 DðGðzÞÞÞ.
8.2.2 InfoGAN
Information maximizing generative adversarial networks (InfoGAN) described in Ref. [31]
controls the different attributes of the generated images, unlike the other vanilla GAN
Visual similarity-based fashion recommendation system
Fig. 8.2 Performance comparison of the state-of-the-art deeper CNNs.
architectures that control the generated images a little or not. The InfoGAN, an
information-theoretic extension of GANs, uses information theory concepts so that the
noise term is transformed into latent code, which provides systemic and predictable control
on the output. It learns how to decompose the input noise vector into two parts; a source of
incompressible noise z and a latent code c. It is trained to maximize the mutual information
between c and the output of generator (generated image) G(z, c). Fig. 8.1B depicts the architecture of InfoGAN. Information-regulated min-max objective of InfoGAN is formulated
as follows by adding a constant regularization term with a hyperparameter λ:
min max VI ðD, GÞ ¼ V ðD, GÞ λI ðc; Gðz, c ÞÞ
G
D
Here, I(c; G(z, c)) is the mutual information between c and G(z, c). While Vanilla
GAN formulation (1) uses a single unstructured noise vector z, the generator of
InfoGAN takes the concatenated vector (z, c) where c represents structured semantic
features of the data distribution [31]. For a given set of structured latent variables
c1, c2, …, cL, the
QLlatent code c denotes the concatenation of all latent variables ci, which
calculated as i¼1P(ci). It also uses a neural network Q(c j x) that shares the same network structure with the discriminator D (except the last layer), by adding to the Vanilla
GAN a negligible computation cost. The InfoGAN uses variational information maximization technique named as lower bounding mutual information, which reduces
computationally complex of the mutual information calculation, to maximize the
mutual information. The final objective function of InfoGAN with a variational lower
bound LI(G, Q) of the mutual information under the condition LI(G, Q) I(c; G(z, c))
is defined as follows:
189
190
Generative adversarial networks for image-to-Image translation
min max VInfoGAN ðD, G, QÞ ¼ V ðD, GÞ λLI ðG, QÞ
G, Q D
8.2.3 CNN-based architectures
This section gives a review of CNN and dives deeper to explore the top CNN architectures which have proven themselves at visual tasks including image classification, object
detection, and semantic segmentation. CNN emerged in the 1990s when Yann LeCun
et al. [32] put forward new neural network architecture for classification of handwritten
digits. Although the first CNN, known as LeNet, recognized digits of zip codes effectively, it could not cope with more difficult and complex data. Nevertheless, the work
of these and many other researchers led to the development of larger and deeper CNNs.
With the ImageNet visual recognition competition, the latest known CNN architectures
have emerged. In 2012, the best invention was AlexNet architecture, developed by Alex
Krizhevsky and his colleagues [26] at the University of Toronto. The first popular use of
deep learning in computer vision began with the AlexNet architecture, which has 60 million parameters (over 1000 times higher than that of LeNet [33]), five convolutional
layers and three fully connected layers.
Deeper CNNs, which appeared first on the issue of ImageNet classification, have been
more efficient in coming up with solutions to classification problems. Accuracy comparison
on ImageNet of most popular CNN architectures, also used in this study for quantitative
comparison, are depicted in Fig. 8.2. These architectures are summarized below:
• VGG: This architecture was developed by Simonyan et al. in 2014 [15], based on the
notion that deeper networks are stronger networks. The overall number of trainable
parameters (around 140 million) is over 2.3 times higher than that of AlexNet. On the
other hand, smaller filters have been used when compared with AlexNet. This architecture uses fixed 3 3 kernel filters in all convolution layers. Two versions of this
architecture are available, VGG16 and VGG19, each with the layers 16 and 19. Both
of the networks have been built with five blocks followed by a max-pooling layer and
the blocks contain sequential convolutional layers.
• Inception: The architecture, known as GoogLeNet, was developed by Google
researchers [34]. Although VGG architecture has better accuracy performance than
AlexNet, it needs to use too much memory due to the number of parameters. On
the other hand, Inception has been performed higher accuracy than AlexNet and
VGG16 with fewer trainable parameters (about 5 million for InceptionV1 version).
It has been built with inception blocks that contain convolutional layers with
variable-sized filters of 1 1, 3 3, and 5 5. The architecture, continuously improving upon, has also different versions (InceptionV1, InceptionV2, InceptionV3, so on).
• ResNet: Researchers have noticed in the previous architectures that adding layers to
deep architectures, while performance up to one point increased, a drop declined rapidly after one point. This problem, known as the vanishing gradient, arose during
Visual similarity-based fashion recommendation system
network training with back propagation. More briefly, as the number of layer
increases, gradient values decrease and approach zero. ResNet [35] solved the vanishing gradient problem by introducing a shortcut connection. The architecture was built
with residual blocks consisting of convolutional layers and the shortcut connection
that also named as skip connection that connects the first layer input to the last layer
output. The shortcut connection also named as skip connection because the network
with this connection can skip some of the convolutional layers. Different residual networks with different layers of 18, 34, 50, 101, and 152 have been proposed. The common feature of all ResNet architectures is that they have 7 7 convolutional layer
followed by 3 3 max-pooling layer before residual blocks, and average pooling after
the blocks (before fully connected output layer).
• DenseNet: Like ResNet, DenseNet [36] also focused on solving the vanishing gradient
problem with fewer parameters. DenseNet architecture inspired from ResNet was
constructed of dense blocks consisting of convolutional layers. Unlike ResNet, there
are direct connections from the one convolutional layer to all sublayers. An input of
any layer is the concatenated of the feature maps generated by all preceding layers. The
architecture has been built with multiple dense blocks and it uses the same layers of
ResNet before and after dense blocks. There are various versions of the network with
different number of dense blocks such as DenseNet121, DenseNet169, DenseNet201,
and DenseNet264.
• MobileNets: The main goal of this architecture [37] is to create light-weight and lowlatency models, which can be used on limited-memory or resource-limited devices.
The architecture (MobileNetV2) was built with inverted residual blocks and linear
bottlenecks, where the input and output of the residual blocks are the bottlenecks
layers [38]. When compared with previous architectures trained on ImageNet dataset,
MobileNet have higher inference time and smaller model size, but it has worse classification performance (see in Fig. 8.2). This performance is acceptable given its ability
to near real-time work on mobile devices. There are different versions such as MobileNet V1, V2, and V3.
The abovementioned architectures, which also have pretrained models, are still very
popular today and are often used as benchmarks to compare new proposed architectures
or make sure a new dataset is reliable.
8.3 Fashion recommendation system
The system architecture depicted in Fig. 8.3 consists of three major modules:
1. A deep neural network model for generating effective image representations or
vectors
2. Feature vector database for querying image similarity
3. Web interface for interacting with the system
191
192
Generative adversarial networks for image-to-Image translation
Fig. 8.3 Overview of fashion recommendation system.
Arguably, the most important part of any image retrieval system is a feature extraction
module since the accuracy of the results depends on the quality of the extracted features.
Feature extraction is basically the process of dimensionality reduction in which a given
input is converted into more manageable values so that redundant information in the
input are ignored and computational power required to process a large number of inputs
is decreased. The quality end effectiveness of the extracted features leads to more successful learning and generalization steps. Extracted features from a given input are generally
represented as feature vectors in which only the important or selected portions of the
input are preserved. Almost all machine learning tasks require some form of feature
extraction to deal with large datasets and hence several algorithms and approaches have
been developed over the years. Sometimes experts use domain knowledge to extract
important features from the data which is called feature engineering while several
algorithms such as PCA (principal component analysis), Autoencoders, or LSA (latent
semantic analysis) are used to extract important features. Traditionally, in image processing edge, corner or blob detection algorithms are used to extract features.
However, deep neural networks have lately paved the way for automatic feature
extraction approaches. DNNs such as CNNs have proved to capture the significant properties of the images and create vector representations for later use. With little or no preprocessing, the images are fed into a pretrained DNN and vector representation or
embedding of the image is obtained. Several studies show that the embedding obtained
from DNNs can capture semantic and inherent features in the images and thus similar
images are represented with closer vectors.
Although deep learning models such as CNNs give successful results for the general
image similarity problem, it may not be possible to get sufficient quality results within the
fashion domain. For instance, there are only a few general shoe types (bots, sneakers,
oxfords, etc.) and products in one of these categories resemble each other. Therefore,
Visual similarity-based fashion recommendation system
image similarity approaches in such domains require finer grain resolutions for successful
results. In this study, we use GANs for generating vector representations of images.
Image similarity task is the process of retrieving similar images to the queried product
image. In our case where we try to identify similar shoe images, similarity can be looked
at three major properties: color, shape, and texture. The success of the similarity query
must account for all three dimensions; hence we experimented with several GAN architectures and evaluated the results according to these three dimensions.
The second module is the vector database which stores the vector representations of
images and returns the similar image ids for similarity queries.
With the widespread use of artificial intelligence models, the need to effectively query
the vector representations of both text and image data has emerged. In the recommender
systems, for example, each product or user is represented as a vector and for each vector,
there needs to be a list of best recommendations generated. However, it can be quite
expensive to generate such lists where the number of data is very large because traditional
databases and algorithms are not suitable for storing and querying for similar vectors
among hundreds of thousands or even millions of vectors. In recent years several similarity search libraries emerged such as FAISS, Annoy, and NMSLib. Most of these libraries generate a list of approximate nearest neighbors for a given query and employ several
different types of indexes to perform this task effectively.
FAISS (Facebook AI Similarity Search) [39] is a library developed by Facebook AI
Research group which is used for efficient searching of dense vectors or document
embeddings. It contains many useful algorithms for searching arbitrary sizes of vectors.
We employ FAISS in our architecture as a similarity search engine. FAISS implements
some of the search algorithms on GPU and is extremely scalable in comparison with the
traditional SQL database engines.
FAISS is best utilized with documents are represented as vectors and are identified by an
integer id. Image embeddings or representation vectors are extracted using the trained model
and are inserted into the FAISS index. FAISS can compare the vectors using L2 (Euclidean)
distances, dot products or cosine similarity. When a query image is presented the system creates the vector embedding of the image and queries the FAISS index for similar images.
FAISS returns the lowest L2 distances (or the highest dot product) with the query vector [39].
In our system, we first build a FAISS index with XXX shoe image vectors. When a
query image is presented, the system first generates a vector of the image using the GAN
model and queries the FAISS index for closest vectors. The FAISS library returns a list of
image ids of which the L2 distances are smallest to the query vector.
The third and last module is the web interface which is used to interact with the
system. We demonstrate selected shoe images and similar images.
Successful deep learning models require large amount of data for training. It is imperative to provide a sufficient amount of data so that after enough training the DNN can
capture the essence of the domain and visually understand what is in it and what is not.
193
194
Generative adversarial networks for image-to-Image translation
For a request image, the image is passed into the trained model and feature vector of
the image is extracted. Similarity score is computed across all of the images stored in feature vector DB. The best similar images that have the lowest similarity score are ranked
for the recommendation. Our goal is to achieve the optimal model as the inference time
has to be fast.
8.3.1 Deep network architectures
We make an effort to build a recommender system based on visual similarity in this chapter. Firstly, we explore the best model, which learns the best visual features of fashion
items. The best model in this study refers to the optimum neural network, which gives
high accuracy with fast inference time. Since the various visual features such as colors,
edges, corners, and other different patterns in the model are learned during the training
process, deep learning is also referred to as feature learning. We present our feature
learning experiments of following neural networks under this section.
8.3.1.1 Proposed network
The proposed architecture of discriminator and generator networks inspired by InfoGAN [31] is defined in Table 8.1.
The discriminator model D takes an image with tree color channels and 207 207
pixel in size and outputs a binary prediction as fake or real. Instead of binary prediction,
we use the discriminator to extract a feature vector with a size of 1 12,544 after saving
the trained discriminator model. Generator model G input is a concatenated 108dimensional vector consisting of noise variable (100) and latent code (8) representing class
information. We conduct unsupervised learning which we use no labeled data. We
assume that our data consists essentially of eight different classes (heels, sandals, sports,
boots, high boots, loafers, slippers, and flats), so the latent code value is set to eight.
We change the latent code value to nine for adding handbag to further extend. We
use batch normalization [40] in the models to stabilize the training. We also apply a regularization layer (dropout [41]) after all convolutional layers to solve the overfitting problem and memorization limitation. For D (Discriminator), all convolution layers use
Leaky ReLU (LReLU) and the output layer using sigmoid activation is used to get
the prediction score of the images over two classes as real (class ¼ 1) and fake (class ¼ 0).
For G (generator), all transposed convolutional layers use ReLU and tanh activation
function is used in the last layer, which is described in Ref. [16]. Similar to the InfoGAN,
D and Q share the same network structure using convolutional layers except for the fully
connected output layers. At the output layer, Q uses tanh activation function.
The discriminator and generator loss curves are depicted in Fig. 8.4. From the patterns
in the loss curves, it can be seen that both discriminator loss and generator loss decrease up
to about 2500th iteration. After about this iteration point, generator loss is increasing rapidly and discriminator losses are dropping, which mean that discriminator is getting too
strong to distinguish the real and fake samples and generator is not able to generate better
Visual similarity-based fashion recommendation system
Table 8.1 The networks used for training the shoe and handbag dataset.
Discriminator Model D/Recognition
Network Q
Input 207x207 Color image
3x3 conv2d. 16 LReLU. stride 2.
Dropout (.5)
3x3 conv2d. 32 LReLU. stride
1. batchnorm
Dropout (.5)
3x3 conv2d. 64 LReLU. stride
2. batchnorm
Dropout (.5)
3x3 conv2d. 128 LReLU. stride
1. Batchnorm
Dropout(.5)
3x3 conv2d. 256 LReLU. stride
2. Batchnorm
Dropout(.5)
3x3 conv2d. 16 LReLU. stride
1. batchnorm
FC. 1 sigmoid for D (output layer
for D)
FC. 8 Tanh for Q (output layer
for Q)
Generator Model G
Input ℝ108
FC. 4x6x256
3x3 conv2d_transpose. 128 ReLU. stride
1. batchnorm
Dropout (.6)
3x3 conv2d_transpose. 64 ReLU. stride
1. batchnorm
Dropout (.6)
3x3 conv2d_transpose. 32 ReLU. stride
1. batchnorm
Dropout(.6)
3x3 conv2d_transpose. 16 ReLU. stride
1. batchnorm
Dropout(.6)
3x3 conv2d_transpose. 16 ReLU. stride
1. batchnorm
Dropout(.6)
3x3 conv2d_transpose. 3 Tanh. stride 1.
Dropout(.6)
Fig. 8.4 Training results of the proposed network.
195
196
Generative adversarial networks for image-to-Image translation
samples. We accept the 2000–2500 intervals are the ideal checkpoints for our network
and the generator samples at this iterations confirm that the model has learned the shoe
features well. We can observe from the Fig. 8.4 that the shoe samples generated by the
proposed network have some basic shoe features such as color and class patterns including
heels, sandals, sports, boots, high boots, loafers, slippers, and flats. The model also overcomes the image quality and diversity barriers, which the GAN models often suffer the
lack of the diversity of generated images.
8.3.1.2 State-of-the-art CNNs
In this study, we retrain 12 pretrained models trained for the ImageNet classification
(showed in Fig. 8.2) to use as feature extractor by removing the last (dense) layer. We
remove the dense layer that has 1000 labels (classes) and add dropout and a new dense
layer consisting of one label that represents shoe class. We leverage learning deep features
from shoe images and extract the feature vectors of these new models trained on general
shoe domain instead of the shoe classification task. In a nutshell, we transfer the pretrained
features into shoe domain by using the power of transfer learning. Feature vectors containing the new visual features belong to shoe domain are used for computing a distance
metric between similar shoe items. The process of extraction feature vectors for a given
input is called inference. While the light-weight models have low inference time, the
heavy models with large number of parameters are expensive for inference. The inference
time has to be fast because the high inference time leads to negative user experience.
Inference time or the response speed of the trained network is as important as the
accuracy.
8.4 Experiments and results
To measure the performance of the aforementioned models and the overall architecture
we have conducted several experiments. AI models created and employed in this study
are developed using Tensorflow framework. We use a public standard benchmark dataset
for measuring the performance of the models.
8.4.1 Experimental setup
The training experiments and performance tests have been conducted on a server that has
24-core Intel Xeon E5-2628L CPU, 256 GB RAM which runs Ubuntu Server 16.04
OS. 8 NVidia GTX 1080-Ti GPUs on the server have been used for training the models.
The baseline framework is TensorFlow for all model experiments. The shoe dataset used
for this study is collected from Turkish e-commerce sites: https://www.flo.com.tr,
https://www.trendyol.com, and https://www.n11.com. We scraped the handbag data
from various web sites: https://www.flo.com.tr, https://www.amazon.com, https://
www.hepsiburada.com.tr, https://www.boyner.com.tr, https://www.trendyol.com,
Visual similarity-based fashion recommendation system
https://www.ayakkabidunyasi.com.tr, https://www.morhipo.com, and https://www.
n11.com. The overall training dataset consists of 156,896 shoe images and 130,540 handbag images. We conducted the performance tests of all models on 10,000 randomly
selected shoe images from UT-Zap50K benchmark dataset.
8.4.2 Comparative results
Fig. 8.5 depicts the test results of all models on UT-Zap50K benchmark dataset. For
each network architecture, Fig. 8.5A shows the precision results and Fig. 8.5B shows
the inference times. It is known that the performance results of unsupervised learning
hard to evaluate, hence no universally agreed performance metrics are available for
visual recommendation applications. However, the return of similarity results from
irrelevant classes for a query product of a particular class indicates that the model is
working poorly. For example, when the customer clicks on a product belonging to
the heel class, it is expected that similar products from the heel class will be recommended. Therefore, we firstly compare the models with the standard precision metric
which is formulated as follows:
Precision ¼
#of relevant items retrieved
#of retrieved items
Precision values for each model are calculated using eight classes in the UT-Zap50K
dataset. For a given query image Zap50K class information is used as the ground truth. If
the retrieved image is of the same class then the result is counted as relevant and marked
irrelevant if otherwise.
Performance results shown in Fig. 8.5B shows that inference time increases linearly with the depth of the neural network. Large number of parameters (weights)
makes the networks memory-inefficient and computationally expensive. Inference
time for our proposed model is around 0.04 s which is significantly shorter than other
pretrained models. Our model also provides higher precision rates. The proposed
model performs the best in terms of precision and inference time among all models
tested in this study.
Fig. 8.6 shows the results of all models tested in this study for a sample shoe image. It
can be observed that all versions of DenseNet model have returned similarity results in
irrelevant classes for a query image given in the type of sneakers.
Sample visual recommendation results with similarity scores have been displayed in
Fig. 8.7 (woman shoes) and Fig. 8.8 (woman handbags).
8.4.3 Web interface for visual inspection
Success of image retrieval systems is hard to measure due to the fact that the concept of
image similarity is highly subjective. We have created a web interface for visual inspection of similarity results as another performance evaluation. We have asked human
197
Fig. 8.5 Performance comparison for the models used in this study: (left) precision results and (right) inference time per model.
Visual similarity-based fashion recommendation system
Fig. 8.6 Visual similarity search results for the proposed model and other pretrained models.
Fig. 8.7 Sample visual similarity results of the proposed model on unseen shoe images retrieved from
beymen.com (Query image and top-5 similar images—the results taken from 7723 indexed images).
199
200
Generative adversarial networks for image-to-Image translation
Fig. 8.8 Sample visual similarity results of the proposed model on unseen handbag images retrieved
from beymen.com (Query image and top-5 similar images—the results taken from 3507 indexed
images).
annotators to select the best results among alternative results. Fig. 8.9 shows the annotation interface. The annotator clicks a shoe image on the left and results of different
images are shown on the right. Then the annotator selects the rows which he/she thinks
contain the most similar images. We use this information to determine which model
performs the best in terms of actual human feedback.
8.5 Conclusion and future works
In this chapter, we outlined our work visual similarity-based fashion recommendation
system. This chapter is extended from our DeepML-2019 submission [42]. The system
consists of a GAN-based image retrieval module and a high-performance image feature
search library. We have collected a large set of shoe images from e-commerce to train the
GAN and used the model we obtained from this training to create a CBIR system. We
also created another set of shoe image from https://www.flo.com.tr which is used to create a web interface to demonstrate the results of our GAN model along with other models
we tested in this study. The experimental results show that the proposed model achieved
superior performance in terms of precision and query time. The results also show that the
system can be used in real-world e-commerce platforms as well.
Based on the findings in this study, we plan to build an end-to-end online fashion
recommendation system for e-commerce sites. The system will contain recommendation
support for various fashion items such as shoes, clothes, and accessories. We also plan to
explore other GAN architectures for generating successful image representation for different fashion categories.
Fig. 8.9 Web interface checked human annotators to selection of the best model: (A) VGG19, (B) MobileNetV2, (C) ResNet152, (D) proposed
model (modified InfoGAN).
202
Generative adversarial networks for image-to-Image translation
References
[1] A.W.M. Smeulders, M. Worring, S. Santini, A. Gupta, R. Jain, Content-based image retrieval at the
end of the early years, IEEE Trans. Pattern Anal. Mach. Intell. 22 (12) (2000) 1349–1380.
[2] V.N. Gudivada, V.V. Raghavan, Content-based image retrieval systems, Computer 28 (9) (1995)
18–22.
[3] E. Simo-Serra, H. Ishikawa, Fashion style in 128 floats: joint ranking and classification using weak data
for feature extraction, in: Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, 2016.
[4] G. Scarpa, M. Gargiulo, A. Mazza, R. Gaetano, A CNN-based fusion method for feature extraction
from sentinel data, Remote Sens. 10 (2) (2018) 236.
[5] D. Weimer, B. Scholz-Reiter, M. Shpitalni, Design of deep convolutional neural network architectures for automated feature extraction in industrial inspection, CIRP Ann. Manuf. Technol. 65 (2016)
417–420.
[6] W. Zhao, S. Du, Spectral-spatial feature extraction for hyperspectral image classification: a dimension
reduction and deep learning approach, IEEE Trans. Geosci. Remote Sens. 54 (8) (2016) 4544–4554.
[7] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
Y. Bengio, Generative adversarial nets, in: Proceedings of the 27th International Conference on Neural
Information Processing Systems (NIPS’14), vol. 2, MIT Press, Cambridge, MA, 2014, pp. 2672–2680.
[8] J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent
adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision,
2017.
[9] C. Ledig, et al., Photo-realistic single image super-resolution using a generative adversarial network,
in: Proceedings—30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017,
2017.
[10] H. Zhang, et al., StackGAN: text to photo-realistic image synthesis with stacked generative adversarial
networks, in: Proceedings of the IEEE International Conference on Computer Vision, 2017.
[11] T. Kim, M. Cha, H. Kim, J.K. Lee, J. Kim, Learning to discover cross-domain relations with generative
adversarial networks, in: 34th International Conference on Machine Learning, ICML 2017, 2017.
[12] T. Li, et al., Beautygan: instance-level facial makeup transfer with deep generative adversarial network,
in: MM 2018—Proceedings of the 2018 ACM Multimedia Conference, 2018.
[13] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, A.A. Bharath, Generative adversarial networks: an overview, IEEE Signal Process. Mag. 35 (1) (2018) 53–65.
[14] X. Hou, K. Sun, G. Qiu, Deep feature similarity for generative adversarial networks, in: Proceedings—
4th Asian Conference on Pattern Recognition, ACPR 2017, 2018.
[15] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition,
arXiv Prepr. arXiv1409.1556(2014).
[16] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in: 4th International Conference on Learning Representations, ICLR
2016—Conference Track Proceedings, 2016.
[17] X. Hou, L. Shen, K. Sun, G. Qiu, Deep feature consistent variational autoencoder, in: Proceedings—
2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017, 2017.
[18] A. Yu, K. Grauman, Fine-grained visual comparisons with local learning, in: Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, 2014.
[19] J. Fu, J. Wang, Z. Li, M. Xu, H. Lu, Efficient clothing retrieval with semantic-preserving visual phrases,
in: Asian Conference on Computer Vision, Springer, Berlin, Heidelberg, pp. 420–431.
[20] Q. Liu, S. Wu, L. Wang, Deepstyle: learning user preferences for visual recommendation, in: SIGIR
2017—Proceedings of the 40th International ACM SIGIR Conference on Research and Development
in Information Retrieval, 2017.
[21] X. Wang, T. Zhang, Clothes search in consumer photos via color matching and attribute learning,
in: MM’11—Proceedings of the 2011 ACM Multimedia Conference and Co-Located Workshops,
2011.
[22] Z. Zhou, Y. Xu, J. Zhou, L. Zhang, Interactive image search for clothing recommendation, in: MM
2016—Proceedings of the 2016 ACM Multimedia Conference, 2016.
Visual similarity-based fashion recommendation system
[23] S. Liu, Z. Song, G. Liu, C. Xu, H. Lu, S. Yan, Street-to-shop: cross-scenario clothing retrieval via parts
alignment and auxiliary set, in: Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, 2012.
[24] Z. Feng, Z. Yu, Y. Yang, Y. Jing, J. Jiang, M. Song, Interpretable partitioned embedding for customized multi-item fashion outfit composition, in: ICMR 2018—Proceedings of the 2018 ACM International Conference on Multimedia Retrieval, 2018.
[25] Y. Jing, et al., Visual search at pinterest, in: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015.
[26] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, Adv. Neural Inform. Process. Syst. 25 (2012) 1097–1105.
[27] M.H. Kiapour, X. Han, S. Lazebnik, A.C. Berg, T.L. Berg, Where to buy it: matching street clothing
photos in online shops, in: Proceedings of the IEEE International Conference on Computer Vision,
2015.
[28] J. Huang, R. Feris, Q. Chen, S. Yan, Cross-domain image retrieval with a dual attribute-aware ranking
network, in: Proceedings of the IEEE International Conference on Computer Vision, 2015.
[29] D. Shankar, S. Narumanchi, H.A. Ananya, P. Kompalli, K. Chaudhury, Deep learning based large scale
visual recommendation and search for e-commerce, arXiv Prepr. arXiv1703.02344(2017).
[30] K. Yamaguchi, M.H. Kiapour, L.E. Ortiz, T.L. Berg, Parsing clothing in fashion photographs,
in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2012.
[31] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel, InfoGAN: interpretable representation learning by information maximizing generative adversarial nets, in: Proceedings of the 30th
International Conference on Neural Information Processing Systems (NIPS’16), Curran Associates
Inc., Red Hook, NY, 2016, pp. 2180–2188
[32] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, L.D. Jackel, Handwritten digit recognition with a back-propagation network, in: Proceedings of the 2nd International
Conference on Neural Information Processing Systems (NIPS’89), MIT Press, Cambridge, MA,
1989, pp. 396–404.
[33] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition. Proc. IEEE 86 (11) (1998) 2278–2324, https://doi.org/10.1109/5.726791.
[34] C. Szegedy, et al., Going deeper with convolutions, in: Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2015.
[35] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016.
[36] G. Huang, Z. Liu, L. Van Der Maaten, K.Q. Weinberger, Densely connected convolutional networks,
in: Proceedings—30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017,
2017.
[37] A.G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam,
MobileNets: efficient convolutional neural networks for mobile vision applications Andrew, Rep.
Pract. Oncol. Radiother. (2017) arXiv preprint arXiv:1704.04861.
[38] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.C. Chen, MobileNetV2: inverted residuals and
linear bottlenecks, in: Proceedings of the IEEE Computer Society Conference on Computer Vision
and Pattern Recognition, 2018.
[39] J. Johnson, M. Douze, H. Jegou, Billion-scale similarity search with GPUs. IEEE Trans. Big Data
(2019), https://doi.org/10.1109/TBDATA.2019.2921572.
[40] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal
covariate shift, in: 32nd International Conference on Machine Learning, ICML 2015, 2015.
[41] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to
prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (1) (2014) 1929–1958.
[42] B. Ay, G. Aydın, Z. Koyun, M. Demir, A visual similarity recommendation system using generative
adversarial networks, in: 2019 International Conference on Deep Learning and Machine Learning in
Emerging Applications (Deep-ML), 2019, pp. 44–48.
203
CHAPTER 9
Deep learning-based vegetation index
estimation
Patricia
L. Suáreza, Angel D. Sappaa,b, and Boris X. Vintimillaa
a
ESPOL Polytechnic University, CIDIS-FIEC, Guayaquil, Ecuador
Computer Vision Center, Edifici O, Campus UAB, Bellaterra, Barcelona, Spain
b
9.1 Introduction
Computer vision applications can be found in almost every domain, including topics such
as medical imaging, gaming, video surveillance, multimedia, industrial applications, and
remote sensing, just to mention a few. In most of the cases, these applications are based on
images obtained from cameras working at the visible spectrum. There are some cases, in
particular in medical imaging and remote sensing, where cross-spectral and multispectral
images are considered. The appealing factor of using images from different spectral bands
lies on the one hand on the possibility to obtain information that cannot be seen at the
visible spectrum; on the other hand, on the combined use of information that can be
considered to generate some kind of high-level reasoning; for instance, in remote sensing
the combined use of images from different spectral bands is considered to generate vegetation indexes (VIs). These VIs are used to determine the health and strength of vegetation and their definitions involve several factors, such as soil reflectance, vegetation
density, etc. All this information would help to increase the yield of crops [1, 2]. The
obtained information is used for monitoring and evaluating the Earth’s vegetative cover
using several factors, such as soil reflectance, atmosphere, vegetation density, etc., with
the aim to obtain those formulas that get more reliable information about vegetation
based on remotely sensed values.
The usual form of a VI is a ratio of reflectance measured in two bands, or their algebraic combination. Spectral ranges (bands) to be used in VI calculation are selected
depending on the spectral properties of plants. Lately, techniques based on sensors sensitive to multiple spectra have been implemented to perform remote sensing to evaluate
the biophysical variables of vegetation in both forestry and agriculture [3, 4]. Furthermore, Panda et al. [5] proposed a method for processing high-end images in order to
determine the importance of spectral VIs in the field of agricultural crop yield prediction
using a neural network. In Ref. [6], the authors proposed to analyze the climatological
phenomena that affect the local climate. According to their theory, these phenomena
have a direct effect on crop yield.
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00013-0
Copyright © 2021 Elsevier Inc.
All rights reserved.
205
206
Generative adversarial networks for image-to-Image translation
The index could be computed using several spectral bands that are sensitive to plant
biomass and health. For instance, it is known that healthy vegetation reflects light strongly
in the near-infrared band and less strongly in the visible portion of the spectrum. Thus,
the information between the light reflected in the near-infrared and in the visible spectrum is generally used to detect areas that potentially have healthy vegetation.
Among the different indexes proposed in the literature, the Normalized Difference
Vegetation Index (NDVI) is the most widely used [7]; NDVI is often used to monitor
drought, forecast agricultural production, assist in forecasting fire zones and desert
offensive maps [8]. NDVI is preferable for global vegetation monitoring since it helps
to compensate for changes in lighting conditions, surface slope exposure, and other
external factors. In general, it is used to determine the condition, developmental stages, and biomass of cultivated plants and to forecast their yields. This index is calculated as the ratio between the difference and sum of the reflectance in NIR and red
regions:
NVDI ¼
RNIR RRED
,
RNIR + RRED
(9.1)
where RNIR is the reflectance of NIR radiation and RRED is the reflectance of visible red
radiation.
This index defines values from 1.0 to 1.0, basically representing greens, where negative values are mainly formed from clouds, water, and snow, and values close to 0 are
primarily formed from rocks and bare soil. Very small values (0.1 or less) of the NDVI
function correspond to empty areas of rocks, sand, or snow. Moderate values (from 0.2 to
0.3) represent shrubs and meadows, while large values (from 0.6 to 0.8) indicate temperate and tropical forests [9, 10].
Proposals that use images of several spectra, whether crossed or multispectral, depend
on the use of multiple sensors. In the case of VIs such as NDVI, it is required to have
images of the visible spectrum and near-infrared spectrum of the same scene, which
are acquired by different cameras at the same time . These images are required to calculate
the values of Eq. (9.1). It should be noted that before calculating Eq. (9.1) images must be
accurately recorded, that is, the information must be referred to the same reference system. Since the images of different spectra can be displayed differently, the challenge is to
find the same reference points in the images of both spectra [11]. Recently, techniques
with convolutional networks have been proposed focusing on solving this problem and
finding correspondences in crossed spectral domains [12, 13]. With the correlated information, the images can be recorded in a single reference system.
In Ref. [14], the authors proposed to use the NDVI to measure the changes in the
ecosystem in a given interval of time. The changes in the index values allow us to infer
how the climate impacts the health of the crops. With this method, the impact of climate
Deep learning-based vegetation index estimation
change can be determined and controlled planning can be managed, focusing efforts
on the most affected areas. This is valuable in determining effective and smart reforestation plans.
In this chapter, a novel approach to perform an image-to-image translation is proposed, in which the NDVI is estimated using a synthetic NIR image. The proposed
model is able to use unpaired data to estimate a synthetic NIR just from an grayscale
image using a CycleGAN. Actually, a similar technique has been recently presented
in Ref. [15] where an NDVI is generated from a near-infrared (NIR) image, and also
in Ref. [16] where the VI is estimated just from the single image of the visible spectrum.
Although interesting results have been obtained, the weak point of these approaches lies
on the need of having NIR images, which are not that common such as visible spectrum
images. In other words, the disadvantage of these approaches depends on paired samples
for the training process. The solution proposed in the current chapter consists of a model
where the index is estimated from an unpaired learning-based approach, where a cycled
generative adversarial network (CycleGAN) [17] is trained with a large data set.
In the proposed approach, an unsupervised learning model with a set of unpaired
images is used as an input, one from the visible spectrum and the other image corresponds
to an NIR image; each one is fed into a CycleGAN to perform the image domain translation. Additionally, a multiple loss function is used to obtain a better optimization of the
model; a residual network (ResNet) architecture is used to go deeper without degradation in accuracy and error rate. The chapter is organized as follows. Section 9.2 presents
works related to the NDVI problem, as well as the basic concepts and notation of GAN
and CycleGAN networks. The proposed approach is detailed in Section 9.3. The experimental results with a set of real images are presented in Section 9.4. Finally, the conclusions are given in Section 9.5.
9.2 Related work
Solutions based on computer vision to tackle problems related to precision agriculture
have been widely used. This technology enables better identification, analysis, and management of this temporal and spatial in-field variability. Nowadays, with NIR sensors, all
the captured crop information could be filed to obtain statistical information of every year
trying to predict the health of future plantations for better crop productivity. Many computer vision techniques have evolved to offer solutions for these kinds of agricultural prediction problems. These methods came from mathematical and statistical to deep learning
neural networks.
This section review works related to VI estimation, using classical approaches as well
as convolutional neural network (CNN)-based approaches.
207
208
Generative adversarial networks for image-to-Image translation
9.2.1 Vegetation index: Formulations and applications
In this section, agricultural approaches focused on the use of the NDVI to perform adequate control of crop production using the index information to monitor plant health at
each stage of their growth are reviewed.
In Ref. [18], the authors proposed to use SAR images to estimate missing spectral
features through data fusion and deep learning, exploiting both temporal and cross-sensor
dependencies on Sentinel-1 and Sentinel-2 time series, in order to obtain the NDVI.
Another approach is presented in Ref. [19]; the authors propose a technique to predict
the vegetation dynamics behavior using Moderate Resolution Imaging Spectroradiometer (MODIS) NDVI time series data sets and long short term memory model network, an advanced technique adapted from the artificial neural network.
Another approach, presented by Ulsig et al. [20], introduces an automated technique to detect and count individual palm trees from UAV using a combination of
spectral and spatial analyses. The proposed approach comprises a step that discriminates the vegetation from the surrounding objects by applying the normalized difference VI and another step used to detect individual palm trees using a combination
of circular Hough transform (CHT) and the morphological operators. Damian et al.
[21] propose to use the information obtained from the normalized difference VI using
satellite images to increase the productivity improving the task of delimiting management zones for annual crops. For this research three crop productivity maps, from 2009
to 2015, were used for each area of analysis, developing a descriptive and geostatistical
case study.
According to Ulsig et al. [22], long-term observations of vegetation phenology can be
used to monitor the response of terrestrial ecosystems to climate change. They propose a
method for observing phenological events by analyzing time series of VIs such as the normalized difference VI to investigate the potential of a Photochemical Reflection Index
(PRI) to improve the accuracy of MODIS-based phenological estimates in an evergreen
coniferous forest. The results suggest that PRI can serve as an effective indicator of spring
seasonal transitions, and confirm the usefulness of MODIS PRI for detecting phenology.
In addition, Li et al. present a study to evaluate the economic benefits of greening programs (e.g., planting urban trees, adding or enhancing parks, providing incentives for
green roofs) using low-cost NDVI data from satellite imagery, using the spatial lag-Tobit
models [23], which predict tree canopy cover from NDVI. In another research [24], the
authors focus on temporal NDVI and surface temperature, the methodology used altogether for the assessment of resolution dynamic Urban Heat Island (UHI) change on
environmental condition with different environmental conditions, geographical locations, and demography. The research demonstrates the correlation between temporal
NDVI and surface temperature exemplified with a case study conducted over two different regions, geographically as well as economically. In Ref. [25], the authors present a
Deep learning-based vegetation index estimation
method to reconstruct NDVI time series datasets for monitoring long-term changes in
terrestrial vegetation. This temporal-spatial iteration (TSI) method was developed to estimate the NDVIs of contaminated pixels, based on reliable data. The TSI method will be
most applicable when large numbers of contaminated pixels exist.
Also, in Ref. [26], the authors present a method to analyze the use of the NDVI to
evaluate crop yields, using a multispectral sensor mounted on a UAV with the objective
of predicting biomass variations and grain production. In another work, presented by
Taghizadeh et al. [27], the authors propose an approach to extract the phenological
parameters based on time series of the NDVI, and these variables are used with crop rotation to predict the organic carbon content of the surface layer.
Also in Ref. [28], the authors present a high-throughput phenotyping platform to
dynamically monitor NDVI during the growing season for the contrasting wheat crops.
The high-throughput phenotyping platform captured the variation of NDVI among
crops and treatments (i.e., irrigation, nitrogen, and sowing). The high-throughput phenotyping platform can be used in agronomy, physiology, and breeding to explore the
complex interaction of genotype, environment, and management of the soil in a farmland
area. Additionally, in Ref. [29], the authors illustrate how the normalized difference VI,
leaf area index (LAI), and fractional vegetation cover are related to each other, using a
simple radiative transfer model with vegetation, soil, and atmospheric components.
Another approach [30] presents a local modeling technique to estimate regression models
with spatially varying relationships, using geographically weighted regression (GWR), to
investigate the spatially nonstationary relationships between NDVI and climatic factors at
multiple scales in northern China. The results indicate that all GWR models with appropriate bandwidth represented significant improvements in model performance over the
ordinary least-squares (OLS) models. The results revealed that the ecogeographical transition zone and the GWR model can improve the model ability to address spatial, nonstationary, and scale-dependent problems in landscape ecology.
9.2.2 Deep learning-based approaches
Deep learning models have obtained state-of-the-art results on some computer vision
complex problems. Nevertheless, there are many challenging problems in agricultural
pending to be solved and deep learning approaches are the most likely to be used, obviating the need for a pipeline of specialized and hand-crafted methods used before. Some
researchers have proposed deep learning-based approaches for remote sensing and agricultural applications. In Ref. [31], the authors propose to use SAR images to estimate
missing spectral features through data fusion and deep learning, exploiting both temporal
and cross-sensor dependencies on Sentinel-1 and Sentinel-2 time series, in order to
obtain the normalized difference VI.
209
210
Generative adversarial networks for image-to-Image translation
Huang et al. [32] proposed a novel method for effective and efficient topographic
shadow detection for the images obtained from Sentinel-2A multispectral imager
(MSI) by combining both the spectral and spatial information. This method uses a
CNN, operating directly on indexes input due to its remarkable classification performance, exploiting the spatial contextual information and spectral features for effective
topographic extraction. In addition, in Ref. [33], a decision-level fusion approach is proposed with a simpler architecture for the task of dense semantic labeling. This method
first obtains two initial probabilistic labelings resulting from a fully CNN and a simple
classifier, for example, logistic regression exploiting spectral channels and LiDAR data,
respectively. The conditional random field (CRF) inference will estimate the final dense
semantic labeling results. In Ref. [34], the authors present a methodology to predict the
NDVI by training a crop growth model with historical data. Although they use a very
simple soybean growth model, the methodology could be extended to other crops and
more complex models.
All the approaches presented earlier are just a selection of recent publications where
the usefulness of VIs, in particular the NDVI, can be appreciated. Unfortunately, to compute the NDVI registered images from different spectra (i.e., visible and NIR) are
needed, which sometimes is a challenging task since they may look different. So, the
problem is how to find the same set of features in both spectra (e.g., points [11]) to
be used as a reference for the registration process. Recently, some deep learning-based
approaches have been proposed to overcome this problem and to obtain correspondences
in cross-spectral domains (e.g., [13, 35]). Once correspondences are obtained, the image
registration can proceed by mapping both images to a single reference system; then VIs
can be easily computed. As mentioned in the previous section, recently some approaches
for estimating NDVI have been proposed (e.g., [15, 16]) implementing GAN’s networks
using NIR or RGB images; both approaches depend on the existence of accurately registered images.
Having in mind the registration drawback needed to estimate the NDVI and to overcome this problem, in the current work an unsupervised learning model is proposed (a
CycleGAN architecture). The model is trained with a set of unpaired images (grayscale
and NDVI image) under an unsupervised scheme. To understand generative adversarial
networks (GANs), a summary is given here.
GANs are powerful and flexible tools quite useful in several computer vision problems; one of their most common applications is image generation. Fig. 9.1 depicts this
architecture. In the GAN framework [36], generative models are estimated via an adversarial process, in which simultaneously two models are trained: (i) a generative model G
that captures the data distribution, and (ii) a discriminative model D that estimates the
probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. In this architecture, it
Deep learning-based vegetation index estimation
z (Gaussian noise)
Generator G(z)
Generated
data
x (Real data)
Discriminator D(x)
S
Synthetic
or real?
Converged model
R
Update model
Fig. 9.1 Illustration of a generative adversarial network.
is possible to apply certain conditions to improve the learning process. According to Ref.
[37], to learn the generator’s distribution pg over data x, the generator builds a mapping
function from a prior noise distribution pz(z) to a data space G(z;θg) and the discriminator, D(x;θd), outputs a single scalar representing the probability that x came from training
data rather than pg. G and D are both trained simultaneously, the parameters for G are
adjusted to minimize logð1 DðGðzÞÞÞ and for D to minimize log DðxÞ with a value
function V (D, G):
min max V ðD, GÞ ¼ xp dataðxÞ ½ logDðxÞ + zp dataðzÞ ½log ð1 DðGðzÞÞÞ:
G
D
(9.2)
GANs can be extended to a conditional model if both the generator and discriminator
are conditioned on some extra information y (see Fig. 9.2). This information could be any
kind of auxiliary information, such as class labels or data from other modalities. We can
perform the conditioning by feeding y into both discriminator and generator as additional
input layer. The objective function of a two-player minimax game would be
211
212
Generative adversarial networks for image-to-Image translation
z (Gaussian noise)
y (label)
Generator G(z|y)
Generated
data
x (Real data)
Discriminator D(x|y)
S
Synthetic
or real?
Converged Model
R
Update model
Fig. 9.2 Illustration of a conditional generative adversarial network.
min max V ðD,GÞ ¼ xp dataðxÞ ½log DðxjyÞ + zp zðzÞ ½ logð1 DðGðzjyÞÞÞ:
G
D
(9.3)
The discriminator performs a binary classification including the extra information fed
to the network, as a result, the discriminator and generator will gain more accurate gradients. Conditional GANs enhance the stability of the model, but it affects the learning of
the semantic characteristics of the image samples.
9.3 Proposed approach
This work proposes to estimate the NDVI VI using a synthetic NIR generated from just
from a single image of the visible spectrum using a CycleGAN. The architecture used in
this approach is based on the one presented in Ref. [17], a previous work that presents an
unpaired image-to-image translation, through a CycleGAN. This type of network permits domain style transfer, which is a convenient method for image-to-image translation
problems because it is not necessary to have a set of input images that capture the scene at
Deep learning-based vegetation index estimation
the same time and place from different spectra. Obtaining this type of set of images could
be time consuming and quite difficult based on what type of domain style the image data
set we are trying to translate between. In Ref. [38], the authors present a general-purpose
image-to-image translation model in a supervised manner by using conditional adversarial; these networks not only learn the mapping from an input image to output image but
also learn a loss function to train the corresponding mapping. Before presenting the proposed approach, a brief description of CycleGAN is presented.
9.3.1 Cycle generative adversarial networks
Image-to-image translation is the process of transforming an image from one domain to
another, where the goal is to learn the mapping between an input image and an output
image. This task has been generally performed by using a training set of aligned image
pairs. However, for many tasks, paired training data will not be available, and to prepare
them often takes a lot of work from specialized personnel to obtain thousands of paired
image datasets, especially with complex image translations. CycleGAN is an architecture
to address this problem because it learns to perform image translations without explicit
pairs of images. No one-to-one image pairs are required (see Fig. 9.3) to observe the corresponding scheme. CycleGAN will learn to perform style transfer from the two sets
despite every image having vastly different compositions. According to Zhu et al.
[17], the CycleGAN presents an approach for learning to translate an image from a source
domain X to a target domain Y in the absence of paired examples (see Fig. 9.4) to observe
a description of the domain translation with paired samples (left of the graph); and unpaired
samples (right of the graph); in our case, we use a translation of unpaired images. Thus, the
goal is to learn a mapping G: X!Y such that the distribution of images from G(X) is
indistinguishable from the distribution Y using an adversarial loss. Because this mapping
DR
DS
V
R
S
W
DR
DS
V
V
r
^
S
^r
W
^
R
s
s^
W
Cycled-consistency
loss
Cycled-consistency
loss
Fig. 9.3 Cycle generative adversarial network, original scheme proposed in Ref. [17].
213
214
Generative adversarial networks for image-to-Image translation
Paired
xi
Unpaired
yi
xm
yn
Fig. 9.4 (Left) Supervised training (paired data). (Right) Unsupervised training (unpaired data).
is highly under-constrained, it is necessary an inverse mapping F: Y ! X and introduce a
cycle consistency loss to enforce F(G(X)) X (and vice versa).
The model includes two mappings functions G: X!Y and F: Y !X. In addition, it
introduces two adversarial discriminators Dx and Dy, where Dx aims to distinguish
between images x and translated images F(y); in the same way, Dy aims to discriminate
between y and G(x). Besides, the proposed approach includes two types of loss terms:
adversarial losses [36] for matching the distribution of generated images to the data distribution in the target domain real images; and a cycle consistency loss to prevent the
learned mappings G and F from contradicting each other.
9.3.2 Residual learning model (ResNet)
Deep neural networks have evolved from simple to very complex architectures depending on the type of problem to be solved, whether these are classification, segmentation,
recognition, identification, etc. One of the first implementation of deep convolutional
networks is presented in Ref. [39], where the authors present an approach to classify
1000 different classes from the ImageNet dataset. The model have been designed to support very deep CNN training to classify the 1.2 million high-resolution images into the
1000 different classes. The model has 60 million parameters and 650,000 neurons. The
architecture consists of five convolutional layers followed by max-pooling layers in some
cases, and three fully connected layers with a last softmax layer of 1000 elements. The
authors also have implemented a very efficient convolutional operation with multiple
GPU to reduce training time and overfitting. Additionally in the fully connected layers
they have employed a dropout operation to perform regularization, which proved to be
very effective. Another technique that continues the work in very deep learning networks, is the one presented in Ref. [40]; according to the authors, deep networks naturally integrate low/mid/high-level features and classifiers in an end-to-end multilayer
Deep learning-based vegetation index estimation
way, and the “levels” of features can be enriched by the number of stacked layers (depth).
When deeper networks are able to start converging, a degradation problem could appear,
with the network depth increasing, accuracy gets saturated and then degrades rapidly; this
behavior of degradation indicates that every neural model is unique and not easy to optimize. There exists a solution by construction to the deeper model: the added layers are
identity mapping, and the other layers are copied from the learned shallower model. The
existence of this constructed solution indicates that a deeper model should produce no
higher training error than its shallower counterpart. In Ref. [40], it is presented a deep
residual learning framework, where instead of waiting for the stacked layers to fit directly
to a desired underlying mapping, these layers are allowed to fit the residual mapping. Previously, the desired underlying mapping was denoted by H(x). It is allowed that the
stacked nonlinear layers fit another mapping of F(x) :¼ H(x) + x. The original mapping
is recast into F(x) + x. The authors hypothesize that it is easier to optimize the residual
mapping than to optimize the original, unreferenced mapping. To the extreme, if an
identity mapping were optimal, it would be easier to push the residual to zero than to
fit an identity mapping by a stack of nonlinear layers. The formulation of F(x) + x
can be realized by feed-forward neural networks with “shortcut connections” to perform
identity mapping, and their outputs are added to the outputs of the stacked layers (see Fig.
9.5), also an identity shortcut connection add neither extra parameter nor computational
complexity. The entire network can still be trained end-to-end by stochastic gradient
descent (SGD) with backpropagation.
It avoids the vanishing gradient problem, as the gradient is backpropagated to earlier
layers; repeated multiplication may make the gradient infinitely small. As a result, as the
network goes deeper, its performance can get saturated or even starts degrading rapidly.
To avoid all these problems, we implement our generator and discriminator to propagate
larger gradients to initial layers and these layers also could learn as fast as the final layers,
giving us the ability to train deeper networks. ResNet is a model designed to be applied in
a deep neural network layer architecture, which consists of convolution layers known as
building blocks, where a residue of input is added to the output.
9.3.3 Proposed architecture
This section presents the approach proposed for NDVI vegetation estimation just with a
single image from the visible spectrum. As mentioned earlier, it uses a similar architecture
like the one proposed in Ref. [17], a recent work for unpaired image-to-image translation, where the use of a CycleGAN has been proposed. CycleGANs is a convenient
method for image-to-image translation problems, such as style transfer, because it just
relies on an unconstrained input set and output set rather than specific corresponding
input/output pairs. This could be time consuming, unfeasible, or even impossible based
on what two image types one is trying to translate between. Another approach presented
215
216
Generative adversarial networks for image-to-Image translation
n
l
u
X Identity
F(x)
n
l
u
n
F(x) + x
l
u
Fig. 9.5 Residual block used on the generator network.
in Ref. [38] has shown results synthesizing photos from label maps, reconstructing objects
from edge maps, but still dependent on some kind of correlated labeling.
Our architecture is based on the approach presented in Ref. [17] in relation to cycle consistent learning and loss functions; in our work, it is used to estimate the synthetic NIR
images. The proposed model can learn to translate the images between the visible spectrum
to the corresponding NIR spectrum, without the need to have accurately registered RGB/
NIR pairs. This allows us to use these NIR synthetic images in the calculation of the NDVI
VI and to be able to use them in solutions oriented to solve problems related to the state of the
crops and their corresponding level of productivity in the crops. Another advantage of being
able to count on the synthetic images of the NIR spectrum is that, undoubtedly, the costs of
Deep learning-based vegetation index estimation
the solutions are decreased since there is no need to buy acquisition devices sensitive to that
electromagnetic spectrum. Additionally, our architecture uses the ResNet [40] to perform
the image transformation from one spectrum to another.
The core idea of ResNet is to introduce a so-called “identity shortcut connection”
that skips one or more layers. These skip connections ensure properties of NIR images of
previous layers are available for later layers as well, so that their outputs do not deviate
much from original grayscale image input; otherwise, the characteristics of original
images will not be retained in the output and results will be very unreal. The formulation
of F(x) + x can be realized by feed-forward neural networks with “shortcut connections”
(see Section 9.3.2). Shortcut connections are those skipping one or more layers. In our
case, the shortcut connections simply perform identity mapping, and their outputs are
added to the outputs of the stacked layers. Identity shortcut connections add neither extra
parameter nor computational complexity. The entire network can still be trained end-toend by SGD with backpropagation. These skip connections ensure properties of NIR
images of previous layers are available for later layers as well so that their outputs do
not deviate much from original RGB input (grayscale); otherwise, the characteristics
of original images will not be retained in the output and results will be very unreal.
Fig. 9.6 depicts the CycleGAN model proposed in the current work. As shown in
Estimated NIR images G
Grayscale images (x)
Activation
function Tanh
Convolutional block
2D convolution
Conv. block A1
Deconvolutional block
y
Deconv. block C2
Rectifier linear unit
Batch normalization
Rectifier linear unit
Conv. block A3
Deconv. block C1
Residual block B1
Residual block B9
2D deconvolution
x
RESIDUAL BLOCK
Batch Normalization
Rectifier linear unit
2D convolution (3×3×3)
Batch normalizatinon
Rectifier linear unit
2D convolution (3×3×3)
Fig. 9.6 Cycle generative adversarial generator network detailed architecture.
217
218
Generative adversarial networks for image-to-Image translation
Fig. 9.6, CycleGAN architecture to generate NIR synthetic images is composed of
two generators G, F and two discriminators Dx, Dy. In order to generate a synthetic
image, the architecture takes the advantage from the joint of cycle consistency and
least-square losses [41] in addition to the usual discriminator and generator losses.
The results of the experiments have shown that these loss functions demand that
the model maintain textural information of the visible and NIR images and generate
uniform synthetic outputs. According to Zhu et al. [17], the objective of a CycleGAN
is to learn mapping functions between two domains X and Y given training samples
N
xi N
i¼1 X and xi i¼1 Y .
The generator network architecture designed to estimate NIR synthetic VI is
described in Fig. 9.6. Also, Figs. 9.9 and 9.10 depict the CycleGAN scheme proposed
in the current work. The model includes two mapping functions G: X!Y and F:
Y !X. In addition, it introduces two adversarial discriminators Dx and Dy, where Dx
aims to distinguish between images x and translated images F(y); in the same way, Dy
aims to discriminate between y and G(x). Besides, the proposed approach includes
two types of loss terms: adversarial losses [36] for matching the distribution of generated
synthetic NIR images to the data distribution in the target domain real NIR images; and a
cycle consistency loss to prevent the learned mappings G and F from contradicting each
other.
9.3.4 Loss functions
The adversarial losses, according to Goodfellow et al. [36], are applied to both mapping
functions. For the mapping function G: X!Y and its discriminator Dy, the objective is
defined as
LGAN ðG,Dy ,X, Y Þ ¼ yp dataðyÞ ½ logDY ðyÞ + xp dataðxÞ ½ log ð1 DY ðGðxÞÞÞ, (9.4)
where G tries to generate images G(x) that look similar to images from domain Y, while
Dy aims to distinguish between translated samples G(x) and real samples y.
For the mapping function F: Y !X and its discriminator Dx, the objective is defined
as
LGAN ðF,Dx , Y , XÞ ¼ xp dataðxÞ ½ log DX ðxÞ + yp dataðyÞ ½ logð1 DX ðFðyÞÞÞ, (9.5)
where F tries to generate images F(y) that look similar to images from domain X, while
Dx aims to distinguish between translated samples F(y) and real samples x.
Also, according to Zhu et al. [17], to reduce the space of possible mapping functions,
the learned mapping functions should be cycle consistent; for each image x from domain
Deep learning-based vegetation index estimation
X, the image translation cycle should be able to bring x back to the original image, that is,
x!G(x)!F(G(x)) x, calling this forward cycle consistency. Therefore, for each image
y from domain Y, G, and F should also satisfy backward cycle consistency: y ! F(y) !
G(F(y)) y. This cycle consistency loss is defined as
Lcycle ðG, FÞ ¼ x p dataðxÞ ½kFðGðxÞÞ xk1 + y p dataðyÞ ½kGðFðyÞÞ yk1 :
(9.6)
9.3.5 Least-square GAN’s loss
In the current work, a least-square loss has been implemented [41] to accelerate the training process. This loss is able to move the fake samples toward the decision boundary, in
other words, generate samples that are closer to real data, in our case the synthetic NIR
image. The experiments performed with this loss instead of negative log likelihood
shown better results. Eqs. (9.4), (9.5) are replaced with the least-square losses, which
are defined as
LLSGAN ðG,Dy , X,Y Þ ¼ yp dataðyÞ ½ðDY ðyÞ 1Þ2 + xp dataðxÞ ½DY ðGðxÞÞ2 (9.7)
LLSGAN ðF,Dx , Y , XÞ ¼ xp dataðxÞ ½ðDX ðxÞ 1Þ2 + yp dataðyÞ ½ðDX ðFðyÞÞ2 Þ:
(9.8)
and
For this unsupervised approach, the standard CycleGAN LCYCLE : (cycle-consistent
loss), and LLSGAN : (least-square loss), have been implemented, both with their corresponding weights distributions for the multiple loss function. For the first unsupervised
approach, the weighted sum of the individual loss function terms designed to obtain the
best results, is defined as
LFINALSYNNIRCYCLEGAN ¼ 0:38LGAN + 0:62LCYCLE :
(9.9)
And the second loss evaluated in this unsupervised approach is the LSGAN loss where
the weighted sum of the individual loss function terms is defined as
LFINALSYNNIRCYCLELSGAN ¼ 0:65LLSGAN + 0:35LCYCLE :
(9.10)
The combination of the weights associated with each loss function is focused on
improving the quality of the images for human perception and at the same time, they
are used as regularization terms that determine which loss function is the most significant
in the optimization of the model for the generation of the synthetic VI. An inappropriate
weight balance increases the risk that the model generates synthetic indexes with too
many artifacts and that it cannot generalize properly.
Once the synthetic NIR image is estimated, the NDVI is computed by using Eq. (9.1)
together with the information from the red channel of the given image.
219
220
Generative adversarial networks for image-to-Image translation
9.4 Results and discussions
9.4.1 Datasets for training and testing
The proposed approach has been evaluated using a grayscale and with an unpaired NVDI
VI; the architecture of the U-Net generator implemented is presented in Fig. 9.6; the
model receives as an input the a single image of the visible spectrum representation from
Brown and S€
usstrunk [42]. From the aforementioned data set, the country, mountain, and
field categories have been considered for evaluating the performance of the proposed
approach; examples of this dataset are presented in Fig. 9.7. This dataset consists of
477 registered images categorized in 9 groups captured in RGB (visible spectrum) and
NIR (near-infrared spectrum). The country category contains 52 pairs of images of
(1024 680 pixels), while the field contains 51 pairs of images of (1024 680 pixels).
In order to train the network to generate the VI from each of these categories, a data
lengthening process has been applied to avoid overfitting or underfitting the model,
so that it can converge and generalize; this process is carried out automatically by a specialized algorithm. It should be noted that during the training process paired images do
not belong to the same scene, because there is no need to have correspondences as input
for the CycleGAN proposed model.
9.4.2 Data augmentation
The proposed architecture uses as an input an unpaired dataset from Brown and S€
usstrunk
[42], the RGB converted to grayscale and the NIR images. In order to enlarge the size of
the training dataset, we have implemented an automatic data augmentation process to
create a modified version of images in the dataset of grayscale and NIR by taking random
crops with a parameterized size, randomly selecting the coordinates in the image to crop
the region before the training phase. After, the creation of multiple variations of the
images, that can improve the performance and the ability of the fit models to generalize
what they have learned to new images. The data augmentation process executed for this
approach (see Fig. 9.8) has provided us with a total of 70 different variations with a size of
256 256 for each image per category existent in the data set; 3500 pairs of images of
(256 256 pixels) have been generated, both in a grayscale of the RGB images as well as
in the corresponding NIR images (the NIR images are used to compute the groundtruth NVDI indexes, which are represented as images). Additionally, 1000 pairs of
images, per category, of (256 256 pixels) have been also generated for testing and
100 pairs of images per category for validation, which can be used to feed the learning
network to synthesize VIs to increase the performance and accelerate the generalization
of the model.
Deep learning-based vegetation index estimation
Fig. 9.7 Some examples of cross-spectral images, where the (first row) RGB images; (second row)
unpaired NIR images; (third row) ground-truth NDVI images. (Images from M. Brown, S. S€
usstrunk,
Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, IEEE, 2011, pp. 177–184.)
9.4.3 Evaluation metrics
Digital images resulting from an artificial intelligence process, such as deep neural networks, are subject to a wide variety of distortions, which may result in a degradation
of visual quality. Quality is a very important parameter for all objects and their functionalities. The importance of research in the objective evaluation of image quality is to
develop measures that can automatically predict the perceived image quality. In an
221
222
Generative adversarial networks for image-to-Image translation
Fig. 9.8 Algorithm proposed for data augmentation.
image-based technique, image quality is a prime criterion. Commonly, for a good image
quality evaluation, an evaluation with complete reference metrics is applied, like mean
square error (MSE), one of the most used image quality metrics. The MSE metric measures the average of the squares of the errors or deviations. This is to say that large differences between actual and predicted are punished more with MSE. This error (MSE) does
not match with human visual perception. In contrast to MSE recently, a perceptual metric
that measures image quality level, Structural Similarity (SSIM) index , has been developed
with a view to comparing the structural and feature similarity measures between restored
and original objects on the basis of perception. For our approach, we have used Root
Means Squared Error and SSIM index as metrics, with which we were able to compute
the results of the experiments and obtain consistent results. However, RMSE does not
measure representation of the textures of the images, instead of SSIM index, which is
an absolute value of the representation perspective presented in the images. Additionally,
from a semantic perspective, SSIM index gives better results to measure over RMSE error.
Also, the SSIM index performs well to obtain perception and saliency-based errors. According to Wang et al. [43], SSIM index evaluates images accounting for the fact that the
human visual perception system is sensitive to changes in the local structure; the purpose of
using this index defines the structural information in an image as those attributes that represent the structure of objects in the scene. The structural loss for a pixel p is defined as
LSSIM ¼
P
1 X
1 SSIMðpÞ,
NM p¼1
(9.11)
where SSIM( p) is the structural similarity index (see Ref. [43] for more details) centered
in pixel p of the patch P.
Deep learning-based vegetation index estimation
9.4.4 Experimental results
The proposed approach (see Figs. 9.9 and 9.10) has been evaluated using NIR and RGB
images together with the corresponding NDVI obtained from Eq. (9.1), in which the
Cycle consistency loss
FYX
Fake grayscale
image in x domain
GXY
Reconstructed
NIR image
Real NIR image
in Y domain
Real or
fake?
Dx
Discriminator in
Y domain
Real grayscale
image in x domain
Fig. 9.9 Cycle generative adversarial model F: Y (NIR) ! X(grayscale) and its discriminator Dx.
Cycle consistency loss
GXY
Fake NIR image
in Y domain
Real grayscale
image in X domain
FYX
Reconstructed
grayscale image
Real or
fake?
DY
Discriminator in
Y domain
Real NIR image
in Y domain
Fig. 9.10 Cycle generative adversarial model G: X(grayscale) ! Y (NIR) and its discriminator Dy.
223
224
Generative adversarial networks for image-to-Image translation
RGB red channel was used; the cross-spectral data set used in our implementation came
from Brown and S€
usstrunk [42]. This dataset consists of 477 registered images categorized
into 9 groups captured in RGB (visible) and NIR (near-infrared) spectral bands. The
country, mountain, and field categories have been considered for evaluating the performance of the proposed approach. The country category contains 52 pairs of images of
(1024 680 pixels), mountain category contains 55 pairs of images of (1024 680 pixels), while the field contains 51 pairs of images of (1024 680 pixels). In order
to increase the training dataset, a data augmentation process was performed, to improve
the accuracy of our network to generate synthetic NIR images. The data augmentation
consists of applying flipping, rotating, and transposing over the original images. After the
data augmentation process, for each category 600 pairs of images from visible and NIR
spectrum have been generated. Additionally, for each category 40 pairs of images for testing and 20 pairs of images for validation from visible and NIR spectrum have been used.
It is important to emphasize that despite the dataset images are registered, for the CycleGAN training process, we use unpaired images.
On average, every training process took about 80 hours using a 3.2 GHz 8 core processor with 32 GB of memory with a NVIDIA TITAN XP GPU. Some illustrations with
the corresponding NIR results obtained with the proposed CycleGAN approach are
depicted in Fig. 9.11 for qualitative evaluation.
These synthetic NIR images obtained with the CycleGAN are then used for estimating the NDVIs. Figs. 9.12–9.14 present some illustrations of NDVIs estimated per category country, field, and mountain using the generated synthetic NIR images. Also, Figs.
9.15–9.17 present illustrations of the NVDI generated with the proposed approach compared with results from Ref. [17], showing better qualitative results. Quantitative evaluations are presented in Table 9.1. In this table, average root mean square error (RMSE)
and SSIM index computed over the validation set are depicted, when different combinations of the proposed loss functions were considered. Our experiments used the standard loss function for GANs, which are based on negative log likelihood and also used the
least-square loss to obtain better quantitative results and avoid the vanishing gradient
problem, where a deep feed-forward network is unable to propagate valid gradient information from the output back to the first layer of the model. We implement least-square
loss to accelerate and maintain stable the training process. Additionally, in this table,
results from Refs. [16, 17] are presented. It can be appreciated that in all the cases the
results obtained with the least-square loss in the proposed CycleGAN are better than
those obtained with the approach presented in Refs. [16, 17]. It should be mentioned
that the least-square losses permits to accelerate the network convergence, allowing a better optimization of the network.
To increase the cycle loss effect over the network we used L1 (λ). The CycleGAN
network proposed has been trained using Stochastic AdamOptimazer since it is well
Deep learning-based vegetation index estimation
Fig. 9.11 Illustration of NIR images obtained by the proposed CycleGAN, which are later on used to
estimate the corresponding NDVIs. (First row) RGB images. (Second row) Grayscale image used as input
into the CycleGAN. (Third row) Estimated NIR images. (Fourth row) Ground-truth NIR images. (Images
from M. Brown, S. S€
usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 177–184, country, field, and
mountain categories.)
suited for problems with deep network, large datasets, and avoid overfitting. The image
dataset was normalized in a ( 1, 1) range and rescaled to 256 256 to avoid memory
problems during the training process. The following hyperparameters were used during
the training process: learning rate 0.0003, epsilon ¼ 1e 08, exponential decay rate for
the first moment momentum 0.6, L1 (λ) 10.5, weight decay 1e 2, leak ReLU 0.20.
225
226
Generative adversarial networks for image-to-Image translation
Fig. 9.12 Images of NDVI VIs from Country category obtained with the synthetic NIR generated by the
proposed CycleGAN. (Left) Ground-truth NDVI VI images. (Right) Estimated NDVI VIs. (Images from M.
Brown, S. S€
usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 177–184.)
Deep learning-based vegetation index estimation
Fig. 9.13 Images of NDVI VIs from Field category obtained with the synthetic NIR generated by the
proposed CycleGAN. (Left) Ground-truth NDVI VI images. (Right) Estimated NDVI VIs. (Images from
M. Brown, S. S€
usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 177–184.)
227
228
Generative adversarial networks for image-to-Image translation
Fig. 9.14 Images of NDVI VIs from mountain category obtained with the synthetic NIR generated by
the proposed CycleGAN. (Left) Ground-truth NDVI VI images. (Right) Estimated NDVI VIs. (Images from
M. Brown, S. S€
usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, IEEE, 2011, pp. 177–184.)
Deep learning-based vegetation index estimation
Fig. 9.15 Images of NDVI VIs obtained from country with the proposed CycleGAN implemented in this
paper: (first col) NDVI estimated with Ref. [17]; (second col) NDVI estimated by the proposed CyclicGAN;
(third col) Ground-truth NDVI VI. (Images from M. Brown, S. S€
usstrunk, Multi-spectral SIFT for scene
category recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, IEEE, 2011, pp. 177–184.)
229
230
Generative adversarial networks for image-to-Image translation
Fig. 9.16 Images of NDVI VIs from field obtained with the proposed CycleGAN implemented in this
chapter: (first col) NDVI estimated with Ref. [17]; (second col) NDVI estimated by the proposed
CyclicGAN; (third col) ground-truth NDVI VI. (Images from M. Brown, S. S€
usstrunk, Multi-spectral SIFT
for scene category recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, IEEE, 2011, pp. 177–184.)
Deep learning-based vegetation index estimation
Fig. 9.17 Images of NDVI VIs from mountain obtained with the proposed CycleGAN implemented in
this chapter: (first col) NDVI estimated with Ref. [17]; (second col) NDVI estimated by the proposed
CyclicGAN; (third col) ground-truth NDVI VI. (Images from M. Brown, S. S€
usstrunk, Multi-spectral SIFT
for scene category recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, IEEE, 2011, pp. 177–184.)
231
232
Generative adversarial networks for image-to-Image translation
Table 9.1 Average root means squared errors (RMSE) and structural similarities (SSIM) obtained from
the estimated NDVI and the real one computed from Eq. (9.1) (SSIM the bigger the better).
RMSE
Training
Supervised approach:
results from Ref. [16]
Unsupervised approach:
results from Ref. [17]
Proposed NDVI
estimation with
LFINALSYNNIRCYCLELSGAN
SSIM
Country
Field
Mountain
Country
Field
Mountain
3.53
3.70
–
0.94
0.91
–
3.46
3.53
3.82
0.93
0.90
0.88
3.39
3.56
3.81
0.94
0.92
0.89
Notes: NDVI values are scaled up to a range of [0–255] since they are depicted as images as shown in Figs. 9.15–9.17.
9.5 Conclusions
This chapter tackles the challenging problem of generating NDVI VI using an NIR synthetic image and its corresponding RGB representation. NIR images are estimated by
using a CycleGAN network. Results have shown that in most of the cases the network
is able to obtain reliable synthetic NIR representations that can be used to obtain VIs. As
mentioned in Section 9.4, this approach has not the limitation of needing paired
NIR-RGB images for training. As a future work, actually, as work in progress we are
considering the use of a CycleGAN architecture with continual learning with deep generative display, but feed it with RGB and their corresponding NIR image in the generator to speed up the generalization. Future work will also consider other loss functions to
improve the training process.
Acknowledgments
This work has been partially supported by the ESPOL project PRAIM (FIEC-09-2015); the Spanish Government under Project TIN2017-89723-P; and the “CERCA Programme/Generalitat de Catalunya.” The
authors also thank NVIDIA for GPU donations and the CYTED Network: "Ibero-American Thematic
Network on ICT Applications for Smart Cities" (REF-518RT0559).
References
[1] S.F. Di Gennaro, F. Rizza, F.W. Badeck, A. Berton, S. Delbono, B. Gioli, P. Toscano, A. Zaldei,
A. Matese, UAV-based high-throughput phenotyping to discriminate barley vigour with visible and
near-infrared vegetation indices, Int. J. Remote Sens. 39 (15–16) (2018) 5330–5344.
[2] M.F. Dreccer, G. Molero, C. Rivera-Amado, C. John-Bejai, Z. Wilson, Yielding to the image: how
phenotyping reproductive growth can assist crop improvement and production, Plant Sci. 282 (2019)
73–82.
[3] M. Wójtowicz, A. Wójtowicz, J. Piekarczyk, Application of remote sensing methods in agriculture,
Commun. Biometry Crop Sci. 11 (1) (2016) 31–50.
Deep learning-based vegetation index estimation
[4] T. Adão, J. Hruška, L. Pádua, J. Bessa, E. Peres, R. Morais, J. Sousa, Hyperspectral imaging: a review on
UAV-based sensors, data processing and applications for agriculture and forestry, Remote Sens. 9 (11)
(2017) 1110.
[5] S.S. Panda, D.P. Ames, S. Panigrahi, Application of vegetation indices for agricultural crop yield prediction using neural network techniques, Remote Sens. 2 (3) (2010) 673–696.
[6] S.S. Dahikar, S.V. Rode, Agricultural crop yield prediction using artificial neural network approach,
Int. J. Innov. Res. Electr. Electron. Instrum. Control Eng. 2 (1) (2014) 683–686.
[7] J. Rouse Jr, R.H. Haas, J.A. Schell, D.W. Deering, Monitoring vegetation systems in the great plains
with ERTS, NTRS, NASA Technical Reports Server, 1974. Tech. Rep.
[8] S. Skakun, C.O. Justice, E. Vermote, J.-C. Roger, Transitioning from MODIS to VIIRS: an analysis of
inter-consistency of NDVI data sets for agricultural monitoring, Int. J. Remote Sens. 39 (4) (2018)
971–992.
[9] T.N. Carlson, D.A. Ripley, On the relation between NDVI, fractional vegetation cover, and leaf area
index, Remote Sens. Environ. 62 (3) (1997) 241–252.
[10] A.H. Junges, D.C. Fontana, C.S. Lampugnani, Relationship between the normalized difference vegetation index and leaf area in vineyards, Bragantia 78 (2) (2019) 297–305.
[11] P. Ricaurte, C. Chilán, C.A. Aguilera-Carrasco, B.X. Vintimilla, A.D. Sappa, Feature point descriptors: infrared and visible spectra, Sensors 14 (2) (2014) 3690–3701.
[12] C.A. Aguilera, F.J. Aguilera, A.D. Sappa, C. Aguilera, R. Toledo, Learning cross-spectral similarity
measures with deep convolutional neural networks, in: The IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) Workshops, JunIEEE, Las Vegas, USA, 2016, p. 9.
[13] P.L. Suárez, A.D. Sappa, B.X. Vintimilla, Cross-spectral image patch similarity using convolutional
neural network, in: IEEE International Workshop of Electronics, Control, Measurement, Signals
and their Application to Mechatronics (ECMSM)IEEE, 2017, pp. 1–5.
[14] N. Pettorelli, A.L.M. Chauvenet, J.P. Duffy, W.A. Cornforth, A. Meillere, J.E.M. Baillie, Tracking the
effect of climate change on ecosystem functioning using protected areas: Africa as a case study, Ecol.
Indic. 20 (2012) 269–276.
[15] P.L. Suárez, A.D. Sappa, B.X. Vintimilla, Learning image vegetation index through a conditional generative adversarial network, in: 2nd Ecuador Technical Chapters Meeting2017, pp. 27–35.
[16] P.L. Suárez, A.D. Sappa, B.X. Vintimilla, Vegetation index estimation from monospectral images,
in: International Conference Image Analysis and RecognitionSpringer, 2018, pp. 353–362.
[17] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent
adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision2017,
pp. 2223–2232.
[18] A. Marra, M. Gargiulo, G. Scarpa, R. Gaetano, Estimating the NDVI from SAR by Convolutional
Neural Networks, in: IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing
SymposiumIEEE, 2018, pp. 1954–1957.
[19] D.S. Reddy, P.R.C. Prasad, Prediction of vegetation dynamics using NDVI time series data and
LSTM, Model. Earth Syst. Environ. 4 (1) (2018) 409–419.
[20] S. Al Mansoori, A. Kunhu, H. Al Ahmad, Automatic palm trees detection from multispectral UAV data
using normalized difference vegetation index and circular Hough transform, in: High-Performance
Computing in Geoscience and Remote Sensing VIII, 10792 International Society for Optics and Photonics, 2018, pp. 11–19.
[21] J.M. Damian, M.R. Cherubin, A.Z. da Fonseca, E.Z. Fornari, A.L. Santi, O.H. de Castro Pias, Applying the NDVI from satellite images in delimiting management zones for annual crops., Sci. Agric. 77 (1)
(2020) 1–11.
[22] L. Ulsig, C. Nichol, K. Huemmrich, D. Landis, E. Middleton, A. Lyapustin, I. Mammarella, J. Levula,
A. Porcar-Castell, Detecting inter-annual variations in the phenology of evergreen conifers using longterm MODIS vegetation index time series, Remote Sens. 9 (1) (2017) 49.
[23] W. Li, J.-D.M. Saphores, T.W. Gillespie, A comparison of the economic benefits of urban green spaces
estimated with NDVI and with high-resolution land cover data, Landsc. Urban Plan. 133 (2015)
105–117.
[24] M. Rani, P. Kumar, P.C. Pandey, P.K. Srivastava, B.S. Chaudhary, V. Tomar, V.P. Mandal, Multitemporal NDVI and surface temperature analysis for Urban Heat Island inbuilt surrounding of sub-
233
234
Generative adversarial networks for image-to-Image translation
humid region: a case study of two geographical regions, Remote Sens. Appl. Soc. Environ. 10 (2018)
163–172.
[25] L. Xu, B. Li, Y. Yuan, X. Gao, T. Zhang, A temporal-spatial iteration method to reconstruct NDVI
time series datasets, Remote Sens. 7 (7) (2015) 8906–8924.
[26] M.A. Hassan, M. Yang, A. Rasheed, G. Yang, M. Reynolds, X. Xia, Y. Xiao, Z. He, A rapid monitoring of NDVI across the wheat growth cycle for grain yield prediction using a multi-spectral UAV
platform, Plant Sci. 282 (2019) 95–103.
[27] R. Taghizadeh-Mehrjardi, K. Schmidt, A. Amirian-Chakan, T. Rentschler, M. Zeraatpisheh,
F. Sarmadian, R. Valavi, N. Davatgar, T. Behrens, T. Scholten, Predicting machine learning models
and rescanning covariate space, Remote Sens. 12 (7) (2020) 1095.
[28] T. Duan, S.C. Chapman, Y. Guo, B. Zheng, Dynamic monitoring of NDVI in wheat agronomy and
breeding trials using an unmanned aerial vehicle, Field Crops Res. 210 (2017) 71–80.
[29] Z. Jiang, A.R. Huete, J. Chen, Y. Chen, J. Li, G. Yan, X. Zhang, Analysis of NDVI and scaled difference vegetation index retrievals of vegetation fraction, Remote Sens. Environ. 101 (3) (2006)
366–378.
[30] Z. Zhao, J. Gao, Y. Wang, J. Liu, S. Li, Exploring spatially variable relationships between NDVI and
climatic factors in a transition zone using geographically weighted regression, Theor. Appl. Climatol.
120 (3–4) (2015) 507–519.
[31] A. Mazza, M. Gargiulo, G. Scarpa, R. Gaetano, Estimating the NDVI from SAR by convolutional
neural networks, in: IGARSS IEEE International Geoscience and Remote Sensing SymposiumIEEE,
2018, pp. 1954–1957.
[32] H. Huang, G. Sun, J. Ren, J. Rang, A. Zhang, Y. Hao, Spectral-spatial topographic shadow detection
from Sentinel-2A MSI imagery via convolutional neural networks, in: IGARSS IEEE International
Geoscience and Remote Sensing SymposiumIEEE, 2018, pp. 661–664.
[33] Y. Liu, S. Piramanayagam, S.T. Monteiro, E. Saber, Dense semantic labeling of very-high-resolution
aerial imagery and lidar with fully-convolutional neural networks and higher-order CRFs,
in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops2017, pp. 76–85.
[34] A. Berger, G. Ettlin, C. Quincke, P. Rodrı́guez-Bocca, Predicting the normalized difference vegetation index (NDVI) by training a crop growth model with historical data, Comput. Electron. Agr.
161 (2019) 305–311.
[35] C.A. Aguilera, A.D. Sappa, C. Aguilera, R. Toledo, Cross-spectral local descriptors via quadruplet network, Sensors 17 (4) (2017) 873.
[36] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems2014,
pp. 2672–2680.
[37] M. Mirza, S. Osindero, Conditional Generative Adversarial Nets, ArXiv abs-1411-1784 (2014).
[38] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition2017,
pp. 1125–1134.
[39] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems2012, pp. 1097–1105.
[40] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition2016, pp. 770–778.
[41] X. Mao, Q. Li, H. Xie, R.Y.K. Lau, Z. Wang, S.P. Smolley, Least squares generative adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision2017,
pp. 2794–2802.
[42] M. Brown, S. S€
usstrunk, Multi-spectral SIFT for scene category recognition, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern RecognitionIEEE, 2011, pp. 177–184.
[43] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to
structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612.
CHAPTER 10
Image generation using generative
adversarial networks
Omkar Metri and H.R Mamatha
Department of CSE, PES University, Bengaluru, India
10.1 Introduction to deep learning
Neural networks and deep learning are one of the greatest innovations in the field of artificial intelligence. The reason for the majority of the real tasks at hand such as image and
face recognition, speech recognition, object detection, and natural language processing
having the best solutions is neural networks and deep learning. Likewise, the majority of
machine learning algorithms are great in learning patterns, classifications tasks such as
category assignment and regression for interpreting the numerical values based on the
available data. But the computers have struggled when asked for data generation. The
only way out for gathering data to train the models is collection from different sources
or manual creation of the data. This gave rise to generative modeling. In a nutshell, generative modeling is a robust way of understanding the data distributions in an unsupervised manner. These models aim to generate new data by learning true distributions of
the data. The data distributions cannot be featured with perfection. As a consequence,
with the help of the neural networks and deep learning, it can be approximated as a
function of the true distribution. Two elegant architectures used for generation purposes
are variational autoencoder (VAE) and generative adversarial network (GAN).
10.1.1 Generative deep learning
A few questions may strike the mind such as the differences between generative and discriminative models, the need of generative models, and others. This section answers these
questions. A generative model basically categorizes the data sample based on how the data
was generated. In other words, categorizing the data sample based on generation assumptions. On the other hand, discriminative models categorize the data sample based on the
differences ignoring the data generation details. If we imagine the speech to language
classification task, the generative approach would be learning the languages and classifying based on the gained knowledge, and the discriminative approach would be determining the linguistic differences without learning any language and predicting the language
of the speech. For the x inputs and y labels, the generative algorithms would learn the
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00007-5
Copyright © 2021 Elsevier Inc.
All rights reserved.
235
236
Generative adversarial networks for image-to-Image translation
joint probability, i.e., p(x, y). Similarly, discriminative algorithms would learn the conditional probability, i.e., p(y j x) [1]. The generative models are being extensively used for
producing realistic images of artwork, simulation purposes, time-series data, reinforcement learning as well as for generalizing features. A few prominent models are nuclear
autoregressive density estimator (NADE) [2], masked autoencoder density estimator
(MADE) [2], pixel recurrent neural networks (PixelRNN), pixel convolution neural
networks (PixelCNN), variational autoencoder (VAE), Markov chains, and generative
adversarial network (GAN).
PixelRNN made an impact and became one of the promising solutions for image
compression, reconstruction, generation, and others [3]. The model basically loads the
image and at a given point of time scans one row and one pixel within that row. The
idea is to predict the distribution of the next pixel with the possible values. Joint distribution of the pixel is basically the product of the conditional probabilities, thus making a
sequence problem. Two types of architecture are experimented with, i.e., row LSTM
(long-short-term memory) and diagonal BiLSTMs. The former uses a unidirectional
layer of LSTM to scan the image row by row along with one-dimensional convolution.
The diagonal BiLSTMs scan the image in the diagonal fashion. To increase the convergence speed and propagation of the signals explicitly, residual connections were added to
the architecture. Usually, the realistic pictures have three channels, i.e., RGB. Hence,
while predicting a pixel for the R channel, the context is the previously generated pixels
to the left and above. Similarly, the context for the G channel remains the previously
generated pixels along with the dependency on the R channel. Likewise, the
B channel will have dependency on the generated pixels, the R channel and the
G channel. To ensure the dependencies, masks are applied, i.e., masked convolution.
Also, PixelCNN has been modeled using CNN. The model has been tested on MNIST,
CIFAR-10, and ImageNet data. Fig. 10.1 shows the generation samples of CIFAR-10
and 32 32 ImageNet.
The sequential generation is slow, which is a major drawback of pixelRNN/CNN.
Also, the training of pixelCNN is faster compared to pixelRNN. Van den Oord et al.
[4] present an improvement over the pixelCNN termed a Gated PixelCNN. It is computationally efficient and surpasses the pixelRNN in Ref. [3]. The vanilla PixelCNN ignored
the content to the right of the current pixel, termed the blind spot (Fig. 1 in Ref. [3]). The
gated architecture took off the blind spots with the help of two convolution stacks: one for a
horizontal stack (current row till the current pixel) and the other for a vertical stack (all the
rows above the current row). Using conditional modeling of images, the conditional PixelCNN model generated realistic images of different classes. Also, the model was tested on
human images. It performed well on generating images of the same person with different
postures (Fig. 10.2) and modeling on eight classes (Fig. 10.3). It also demonstrated the use
of PixelCNN as an image decoder. The generated samples were of average quality depicting the model’s capability of capturing the variations on objects. Another improvement on
Image generation using generative adversarial networks
Fig. 10.1 Generation samples of CIFAR-10 (left) and 32 32 ImageNet (right) [3].
Fig. 10.2 Source image (left) and image generation samples (right) [4].
PixelCNN with a logistic likelihood is proposed in Ref. [5]. The next section deals with a
brief introduction of autoencoders and a thorough explanation of the VAE.
10.1.2 Variational autoencoder
Autoencoders are feed-forward neural networks where the input is equivalent to the output. The autoencoder has three parts, i.e., encoder, code, and decoder. The input is fed to
the encoder which compresses and produces a code. The code is termed latent space representation. In turn, the decoder regenerates the input using the latent representation.
The output is a degraded representation of the input [6]. Autoencoders are used for
237
238
Generative adversarial networks for image-to-Image translation
Fig. 10.3 Conditional image generation on eight classes [4].
anomaly detection purposes, information retrieval, image denoising, medical aging, popularity prediction, and others [7].
Many variations of the autoencoders have been used in a variety of applications. The
prominent ones are sparse, stacked, denoising, and variational autoencoders. The standard autoencoder encodes the images as latent vectors, trying to memorize the images.
Therefore, generating new images is not possible as the task of producing latent vectors
depends on the input images. VAE solves the problem efficiently by representing the data
into latent space which roughly follows the normal Gaussian distribution. Hence, feeding
a randomly sampled data from the distribution to the decoder will generate a new image.
To measure the efficiency, two separate losses are calculated. One is the mean squared
error (generative loss) and the other is the Kullback-Leibler (KL) divergence measuring
the proximity between latent space and normal Gaussian distribution. The simplest
method to optimize KL divergence is by allowing the encoder to produce a code of
two vectors (means and standard deviations) and thereby picking a sample for generating
an image [8, 9].
Image generation using generative adversarial networks
The availability of huge amounts of data has driven visual forecasting. The drawbacks
of the traditional approaches like nearest neighbor algorithms and transferring raw trajectories are computationally expensive, output space with high dimensions, encoding difficulty due to the pixel color variation in the frames, and blurry predictions. The
challenges are addressed by predicting the dense pixel trajectories using conditional
VAE [10]. The results indicate the model’s capability of learning representations with less
data and commendable visual forecasting from static images (Figs. 3 and 4 in Ref. [10]).
VAEs have been used in the music sector for style transfer between music genres [11].
Also, VAEs have been employed in medicine [12] and anomaly detection [13].
10.2 Introduction to GAN
GANs [14] are an amazing AI innovation fit for making pictures, sound, and recordings
that are unclear from the real thing. The section explains GAN in relation to the concept
of game theory, i.e., Nash equilibrium, architecture, and training problems.
10.2.1 Nash equilibrium
Nash equilibrium is an important concept of game theory named after the inventor, i.e.,
John Nash. The best end result is dependent on the behavior and interaction of the participants in the game. In other words, the optimal solution in a noncooperative game is
where the player cannot change the initial strategy. Basically, the player does not gain
anything by deviating from the initial strategy under the assumption that other players
keep the strategies unchanged. The game may include multiple equilibria or none of
them [15, 16].
For example, assume two companies A and B. The companies are determining if they
should start an advertising campaign to launch their products. If both the companies
choose to advertise, each company acquires 100 customers. If only one of them chooses
to campaign, then the company will acquire 200 customers. If neither of them campaigns,
then no customers are acquired (Table 10.1).
Company A should advertise as it provides a good profit and reward, rather than not
advertise. A similar scenario comes up for company B as well. Hence, both the companies
opting for advertisement is a Nash equilibrium. Another common example associated
with Nash equilibrium is the prisoner’s dilemma [16].
Table 10.1 Reward table.
Company A, B
Advertise
Do not advertise
Advertise
Do not advertise
100, 100
0, 200
200, 0
0, 0
239
240
Generative adversarial networks for image-to-Image translation
10.2.2 GAN and Nash equilibrium
A GAN setup includes two neural network systems in opposition to one another—one to
create fakes (generator) and one to spot them (discriminator). The term generative indicates
the idea of creating new data depending upon the training data. The term adversarial indicates
a gamelike framework with two networks, i.e., generator and discriminator. The generator
produces the realistic data which is similar to the training data, whereas the discriminator’s
task is to identify fake data produced by the generator from the real data coming from the
training sample (Fig. 10.4) [18–23]. GAN is built on the concept of a zero-sum noncooperative game, i.e., minimax. In other words, one player maximizes its action and the other
player minimizes those actions. Before diving into deep mathematics, Eq. (10.1) represents
the variable notations with Eqs. (10.2)–(10.4) representing log-loss, KL divergence, and
Jensen-Shanon divergence (JSD) function notations used in the rest of the chapter.
z ! Noise vector
x ! Original training data ! xr
GðzÞ ! Generator output ! xf
(10.1)
DðxÞ ! Discriminator output for xr
DðGðzÞÞ ! Discriminator output for xf
1X
ðyi log ðpi ÞÞ + ðð1 yi Þ log ð1 pi ÞÞ
E ðpj yÞ ¼ N
ð
pðxÞ
DKL ðP k QÞ ¼ pðxÞ log
dx
qðxÞ
1
1
JSDðP k QÞ ¼ DKL ðP k M Þ + DKL ðQ k M Þ
2
2
(10.2)
(10.3)
(10.4)
The discriminator is a binary classifier with the intention of categorizing the image as
real or fake. As a consequence, the model should showcase high and low probabilities for
Fig. 10.4 GAN architecture [17].
Image generation using generative adversarial networks
real and fake data. Hence, the outcome of D(x) and D(G(z)) lies in the range 0 and 1. The
discriminator’s task is to maximize and minimize the probabilities of D(x) and D(G(z)),
respectively, whereas the generator maximizes the probability of D(G(z)). The discriminator and generator aim at achieving Eqs. (10.5), (10.6), respectively. Therefore, the
value function of the minimax game is defined as Eq. (10.7). E() in Eq. (10.7) is the
binary cross entropy, i.e., log-loss function. The noise z follows normal or uniform
distribution. According to game theory, the convergence of GAN is achieved when
the discriminator and generator reach Nash equilibrium, i.e., loge4 (details in
Section 10.2.2.1).
Discriminator : max ExpðxÞ ½ log DðxÞ + EzpðzÞ ½ log ð1 DðGðzÞÞÞ
Generator : max EzpðzÞ ½ log DðGðzÞÞ
(10.5)
(10.6)
GAN : minG maxD V ðD, GÞ ¼ ExpðxÞ ½ log DðxÞ + EzpðzÞ ½ log ð1 DðGðzÞÞÞ (10.7)
10.2.2.1 Nash equilibrium proof
Assume,
A ¼ Pðxr Þ
B ¼ P xf
y ¼ DðxÞ
(10.8)
From the Radon-Nikodym theorem, G(z) can be approximated to x. In addition,
from Eqs. (10.2), (10.8), Eq. (10.7) can be written as
ð
minG maxD V ðD, GÞ ¼ A log y + B log ð1 yÞ dy
(10.9)
y
The optimal discriminator is obtained by maximizing the integrand in Eq. (10.9).
Hence, the integrand is rewritten as
f ðyÞ ¼ A log y + B log ð1 yÞ
(10.10)
To find the maximum of Eq. (10.10), f 0 (y) ¼ 0 and f 00 (y) < 0 should satisfy.
f 0 ðyÞ ¼ 0
A
B
¼0
y 1y
y¼
(10.11)
A
A+B
Eq. (10.11) boils down to,
DðxÞ ¼
P ðxr Þ
1
¼ jif jP ðxr Þ ¼ P xf
2
P ðxr Þ + P xf
(10.12)
241
242
Generative adversarial networks for image-to-Image translation
Eq. (10.12) indicates the discriminator’s puzzled situation on feeding original and
generated images. Hence, Eq. (10.9) becomes:
ð
A
B
+ B log
dy
V ðD, GÞ ¼ A log
A+B
A+B
y
ð
A
B
¼ A log
+ B log
+ ð log e 2 log e 2ÞðA + BÞdy
A+B
A+B
y
ð
ð
A
B
¼ A log
+ B log
+ log e 2ðA + BÞdy log e 2ðA + BÞdy
A+B
A+B
y
y
ð
ð
A
B
+ A log e 2 + B log
+ B log e 2dy log e 2 ðA + BÞdy
¼ A log
A+B
A+B
y
x
ð
A
B
¼ A log
+ B log
dy 2 log e 2
A+B
A+B
y
2
2
A+B
A+B
+ DKL B k
2 log e 2
¼DKL A k
2
2
¼2 JSDðA k BÞ log e 4
(10.13)
Eq. (10.13) is rewritten as
V ðD, GÞ ¼ loge 4 + 2 JSD P ðxr Þ k P xf
(10.14)
From the JSD divergence, JSD(P(xr) k P(xf)) is zero when P(xr) ¼ P(xf). Hence, Eq.
(10.14) reduces to loge4 proving that Nash equilibrium (global minimum) for the minimax game is loge4.
10.2.2.2 Training problems
Since the advent of GAN, a lot of research has been conducted. Consequently, numerous
problems are associated with the training. A few are stated below [17, 24]:
1. Modal collapse: is the collapsing of the generator to produce the same kind of image
for the possibly different latent vectors. The aim of the generator is Eq. (10.6). Consider the generator is trained substantially without updating the generator. In this case,
the generated images will converge on finding the optimal image fooling the discriminator, thereby becoming independent of z (Eq. 10.15). Hence, the gradient turns
zero w.r.t. z and the mode collapses to a single point. One prominent affecting factor
Image generation using generative adversarial networks
is the learning rate. It is recommended to use a low learning rate and add noise to real
and generated images during training. Mode collapse is a challenging problem to date.
x∗ ¼ argmax DðxÞ
(10.15)
2. Nonconvergence and stability: Due to the diminished gradient, the discriminator
becomes more powerful and the generator’s gradient vanishes. Hence, GAN does
not learn anything. Also, the stability of the model is a major concern. In other words,
model parameters destabilize and therefore are responsible for nonconvergence.
Lipschitz regularization has shown great success in stabilizing GAN training [25]
and a few important considerations during training are cited in Ref. [26].
3. Early stopping and hyperparameters: Early stopping is terminating the training due to
an abrupt increase or decrease in loss. Hyperparameter selection like loss function and
imbalance between the two components can result in overfitting.
10.2.3 VAE-GAN
Larsen et al. [27] presented VAEGAN architecture by combining VAE and GAN outplaying the trivial VAE. The flow is indicated in Fig. 10.5A. The disadvantage with the
VAE is the generation of blurry images. Hence, the loss function of the VAE’s decoder is
Fig. 10.5 VAE-GAN architecture [27]: (A) vanilla VAE-GAN and (B) VAE-GAN with auxiliary generator.
243
244
Generative adversarial networks for image-to-Image translation
replaced with the loss metric learned through the GAN. Thus, no assumptions are made
regarding the loss function as it uses the GAN’s discriminator to categorize the image as
real or fake. In other words, the generator (VAE decoder) uses this information to produce less blurry images. The original VAE generates samples which differ in a Gaussian
way resulting in blurry images. VAEGAN overcomes the problem by sending the reconstructions and ground truth to the discriminator assuming that one hidden layer of the
discriminator will differ in a Gaussian way. Hence, one of the hidden layers of the
GAN discriminator is used for VAE loss. The reason for not choosing the final output
layer is due to the lack of variation learning between the real and fake.
LGAN ¼ log ðDisðxÞÞ + log ð1 DisðDec ðzÞÞÞ + log ð1 DisðDec ðEnc ðxÞÞÞÞ
(10.16)
The architecture is refined by adding an auxiliary generator over the generator cum
decoder (Fig. 10.5B). The discriminator is set to receive and classify images from three
sources, i.e., original (x), samples from normal distributions (xp), and VAE (x ). Hence,
the outputs generated by the auxiliary and decoder generator are treated as fake. The
objective of the GAN is Eq. (10.16) and the results obtained are better compared to
vanilla VAE-GAN (Fig. 10.6).
10.3 Applications
Since the innovation of GANs, they have been extensively used by the experts and
researchers. It is one of the remarkable innovations in the field of deep learning. GANs
can be used for image editing by reconstructing the image. GANs can be employed for
Fig. 10.6 Reconstruction using VAE variations [27].
Image generation using generative adversarial networks
Table 10.2 Implementation details.
Sno
GAN flavor/link
Paper
1
2
3
4
5
6
7
8
9
10
11
Pix2Pix, CycleGAN
Pix2Pix
CoGAN
BiGAN
StarGAN
StarGAN V2
SRGAN
Art2Real
Monkey-Net
First Order Model
StackGAN
[19, 20]
[19]
[21]
[23]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
strengthening security by creating fake threats and training the model to identify these
threats. The availability of data in the healthcare industry is limited. Hence, the GANs
can be used for synthetic generation of data. On a similar basis, they can be employed to
generate 3D objects and for prediction of future video frames and thereby generating
videos. In addition to it, they are making a huge impact in the movie making, gaming,
music, and fashion industries. The section introduces a few image translation applications
using flavors of GAN. Tables 10.2 and 10.3 provide the implementation and dataset
details.
10.3.1 Image-to-image translation using {c, cycle}-GAN
The goal of the image-to-image translation is to align the input image to output image
with the help of aligned image pairs. Mirza and Osindero [18] extended the vanilla GAN
to a conditional model. The architecture supplied additional information, i.e., class labels
to the generator and discriminator by adding an extra input layer. The experiment was
conducted on the MNIST dataset. Parzen’s window-based log likelihood estimate is calculated and the comparative results are cited in Table 1 of Ref. [18]. On a similar basis, tag
vectors were conditioned on the images for the automatic tagging of images. The objective remained the same in Ref. [19] with the discriminator’s task unchanged and the generator’s task being to fool the discriminator along with nearing the output to that of the
ground truth in the L1 sense as L1 supports little haziness. The architecture is named as
pix2pix and the experiments have been performed on various datasets like cityscapes,
CMP (Centre for Machine Perception) facades, Google maps, edges, sketches, and
others. Pix2pix architecture consists of two components, i.e., U-Net generator and
PatchGAN discriminator. The U-Net generator is an autoencoder with skip connections. As a consequence of the matching spatial connections between the layers, the skip
connections do not require resizing or projections. The aim of the patchGAN
245
246
Generative adversarial networks for image-to-Image translation
Table 10.3 Dataset description.
Sno
Dataset name/link
Description
1
Open Image V6
2
3
Cityscapes, Edge2handbags,
Edge2shoes, Facades, Maps
CelebA
4
5
6
CelebA-HQ
RaFD
AFHQ
7
8
9
10
Monet2Photo
Landscape2Photos
Portrait2Photo
UvA-Nemo
11
12
13
14
BAIR Robo Pushing
CUB Bird
VoxCeleb
Oxford-102
15
COCO
9M images with image-level labels, bounding box,
and segmentation masks of objects, visual
relationships, and localized narratives
Urban street scenes, sketches, buildings, and Google
maps
202,599 number of face images with 40 annotations
per image of 10,177 unique identities
High-resolution images of CelebA
67 models with 8 different emotional expressions
15,000 high-resolution images of cat, dog, and
wildlife
Monet images
Landscape images
Portrait images
1240 smile videos from 400 subjects (597
spontaneous and 643 posed)
59,000 examples of robot pushing motions
6033 images of 200 bird species
Short clips of video extracted from YouTube videos
102 flower categories with the number of images per
class ranging between 40 and 258
COCO is a large-scale object detection,
segmentation, and captioning dataset
discriminator is to classify the N N patch in an image as real/fake. In addition, it has
fewer parameters and is faster when compared to classifying the entire image. The advantage of pix2pix architecture is being generic and learning the objective during training
without making any assumptions between two types of images. Hence, it is flexible
for various situations. Human scoring and semantic segmentation are the two evaluation
strategies [19, 35, 36].
Zhu et al. [20] is a notable extension of GAN architecture with two generator and
two discriminator models trained simultaneously. For the simplicity of understanding,
d1 and d2 are two domains. One generator takes the input images from d1 and generates
images from d2. Similarly, the other generator takes the input images from d2 and outputs
images from d1. The discriminators play the same role and generators update accordingly.
Cycle consistency is the add-on to this architecture from the machine translation domain.
It states that the phrase translated from Kannada to English should translate from English
to Kannada with the same efficiency. In case of CycleGAN, the idea is to feed the output
of one generator as input to the other generator and the output should be the same as the
original image. The reverse is also true. The loss is calculated in two parts, i.e., forward
Image generation using generative adversarial networks
Table 10.4 Evaluation of different models on Cityscapes dataset.
Rank
Model
Year
Per Pixel Acc (%)
Per Class Acc (%)
Class IOU
Paper
1
2
3
4
5
Pix2Pix
CycleGAN
CoGAN
SimGAN
BiGAN
2016
2017
2016
2016
2016
71
52
40
20
19
25
17
10
10
6
0.18
0.16
0.06
0.04
0.02
[19]
[20]
[22]
[21]
[23]
and backward cycle consistency loss [20, 37]. Impressive applications like collection style
transfer, object transfiguration, season transfer, and image generation from paintings are
demonstrated as well. A few more noteworthy extensions of GAN are simGAN,
coGAN, and BiGAN [21–23]. Table 10.4 summarizes the scores obtained using different
flavors of GAN on the cityscapes dataset. Pictorial results are showcased in Fig. 10.7.
Pix2Pix (Fig. 10.8A) and CycleGAN (Fig. 10.8B) can be employed for many real applications at hand.
10.3.2 Face generation using StarGAN
Face generation is the task of generating different variations of the face from the existing
dataset. The task of generating the face from the given image is related to a particular
aspect, i.e., changing the color of the hair, smiling to angry, etc. The important part
of face generation is the requirement of high-quality images. Choi et al. [28] use two
datasets, i.e., CelebA consists of images with 40 attributes such as hair color, skin tone,
etc. and RaFD consists of 8 emotional expressions per face image. The trivial versions are
inefficient and less productive for translation among multidomain datasets as it would
require k(k 1) generators with k value set to the number of domains. StarGAN is an
efficient solution for learning the differences of various domains with the help of a single
generator and discriminator. The trivial versions learn fixed translation such as black-togray hair while the StarGAN generator is fed with image and domain information,
thereby learning to translate the images. The domain is either fed as a binary or onehot vector and the target domain is randomly generated during the train, giving the
control and flexibility of generating the images in any domain during the test phase.
Fig. 10.7 Different flavors of GANs on the Cityscapes dataset [20].
247
248
Generative adversarial networks for image-to-Image translation
Fig. 10.8 Use of GANs in the fashion industry, paintings, and realistic image generation: (A) Edges to
handbags using Pix2Pix [19] and (B) CycleGAN image generation [20].
The approach allows training on multiple datasets by ignoring the unknown labels and
concentrating on the necessary labels. Basically, the model learns the features of CelebA
along with the RaFD dataset like happy, fearful and incorporates these emotional features
on the CelebA dataset (Fig. 10.10A). Two variations have been experimented with and
the results depict the success of the model, not only on multiple domains of a single dataset but also on multiple domains across multiple datasets.
Image generation using generative adversarial networks
The limitation of starGAN is the indication of the domain with a predetermined label
as the generator receives a fixed label, thereby producing the same output image for every
domain. The same set of researchers [28] came up with the scalable approach to generate
images having multiple domains. In other words, the generated image can have more
than one domain incorporated. This is achieved by replacing the domain information
with the domain-specific style code [29]. To achieve this, two modules are in the architecture. One is the mapping network which transforms the latent code to various style
codes and one of them is randomly selected during training. Another module is the style
encoder with the functionality of extraction of the style code from the image. StarGAN
and StarGAN V2 architecture are presented in Fig. 10.9A and B, respectively. Apart from
the scalable approach and good results (Fig. 10.10B), Choi et al. [29] present an AFHQ
Fig. 10.9 StarGAN architecture: (A) StarGAN [28] and (B) StarGAN V2 [29].
249
250
Generative adversarial networks for image-to-Image translation
Fig. 10.10 Image-to-image translation: (A) StarGAN [28] and (B) StarGAN V2 [29].
dataset consisting of high-quality animal face images having inter and intra domain
differences. The dataset is publicly available.
10.3.3 Photo-realistic images using SRGAN and Art2Real
Single image superresolution (SR) is the task of improving and reckoning a highresolution (HR) image using a low-resolution (LR) image. Dong et al. [38] is a major
Image generation using generative adversarial networks
Fig. 10.11 Left to right: Bicubic, SRResNEt, SRGAN, and original image [30].
breakthrough with the proposal of SRCNN. The model structure is simple and achieved
SOTAa compared to the traditional sparse coding-based SR methods. The major unsolvable problem in SR is recovering the finer texture details when resolved at the upper
scale [30]. Ledig et al. [30] proposed SRGAN with the combination of deep learning and
adversarial networks. The ultimate goal of the generator is the estimation of the HR
image using the LR image, achieved with the help of feed-forward CNN. On the other
hand, the discriminator is similar to the VGG network with the exception of dropping off
the max-pooling layer throughout the component. The work highlighted the perceptual
loss function defined as the weighted sum of content loss and adversarial loss. It accounts
for the loss more sensitive to human perception. Content loss is defined as the difference
between the features of the generated image and original image such as PSNR (peak signal to noise ratio), MSE (mean square error), and SSIM (structural similarity index).
Adversarial loss accounts for the probability that the generated image is real or fake. It
is worth mentioning that no objective measures can match the level of human perception
(Fig. 10.11). Hence, mean opinion score (MOS) testing was performed where 26 evaluators rated the generated images on a range of 1–5. The MOS test proved that SRGAN
results are superior with SOTAa performance and measures such as SSIM, PSNR fail to
estimate the image quality.
D-SRGANb has also been used in the field of the Digital Elevation Model (DEM).
DEM is a 3D representation of the terrain’s surface such as the Earth, moon, or asteroid
(Fig. 10.12) [39]. The accuracy of the DEM models depends on various factors apart from
source, horizontal and vertical precision of the elevation data. DEM data is used for prediction of soil attributes [40], developing the product, decision-making, mapping purpose, 3D simulations, river channel estimation, contour maps creation, and so on [41].
a
b
SOTA, State of the art.
D-SRGAN, Dem-Super resolution generative adversarial network.
251
252
Generative adversarial networks for image-to-Image translation
Fig. 10.12 3D rendering of a DEM of Tithonium Chasma on Mars [39].
Lakshmi and Yarrakula [41] present the research in DEM generation and various techniques developed over a decade. A variety of satellite images are available at different resolutions such as spectral, radioactive, and temporal. This reduces the time, cost, and effort
required to collect DEM data. SRGAN and D-SRGAN [42] differ in network architecture with D-SRGAN outperforming other methods used in the domain of DEM with the
model performing well on a flatter terrain compared to a steeper terrain.
On similar lines, Tomei et al. [31] proposed semantic aware architecture to generate realistic images from the paintings. The architecture is based on memory banks
from which realistic details are recovered at patch level (Fig. 10.13). Hence, it
includes preparation of memory banks to support generation. One memory bank
(Bc) consists of patches of one semantic class such as Bsky, Brock. These patches are
extracted using a sliding window policy if 20% of pixels belong to the semantic class
c. Once the preparation of the memory banks is complete, the painting is transformed
to a realistic image by pairing generated and real patches. The generated patches are
segmentation maps from the original/generated painting and the real patches are
memory banks. The dataset consists of paintings from specific artists, artworks from
Wikiart, landscape photos from Flickr, and people photos from CelebA. The evaluation metric is Frechet Inception Distance (FID) which measures the difference
between two Gaussians and the results outperform in comparison with cycleGAN,
Unsupervised Image-to-Image Translation (UNIT), Disentangled Representation
for Image-to-Image Translation (DRIT), and style-transferred real methods.
A lower FID conveys the realism of the generated images. Fig. 10.14 depicts the
qualitative results of the Art2Real framework.
Fig. 10.13 Art2Real architecture [31].
254
Generative adversarial networks for image-to-Image translation
Fig. 10.14 Qualitative results of Art2Real [31].
10.3.4 Image animation and scene generation using monkey
net, first-order motion, and StackGAN
Animation in the field of computer graphics and vision is defined as the ability of generating
moving images [43]. It is also referred to as computer-generated imagery [44] which usually
comprises 3D computer graphics. For 3D animation, objects are rendered on the computer,
whereas for 2D figure animation, separate objects are used and the animator moves the specific parts like eyes, mouth based on keyframes. Early works show the use of deep learning
and GAN for deep motion transfer [45, 46]. Siarohin et al. [32] generate image animation on
the target object given a driving sequence and source image. The requirement of pretrained
models for object detection, ground truth data availability, animating any kind of object, and
video translations from one domain to another are the few limitations of animations. To
address these challenges, it introduced a three-module framework. The first module extracts
the object keypoints in an unsupervised fashion. The second module generates motion
heatmaps from the keypoints to encode the motion information and the third module
incorporates the heatmaps and appearance from the source image to produce a short video.
The novelty of the approach is image animation on arbitrary objects and a game plan of
transferring the motion information by learning the pixel movement in an unsupervised
manner. The architecture follows a self-learning scheme named as Monkey-Net (Fig.
10.15). The UvA-Nemo, Tai-Chai, and BAIR datasets are used for experimental purposes.
The evaluation metrics are Average Keypoints Distance (AKD), Missing Keypoints Distance (MKD), AED (Average Euclidean Distance (AED), and FID. The results outperform
when compared to the X2face method (Fig. 10.16). The image animation was followed by
Image generation using generative adversarial networks
Fig. 10.15 (A) Image animation (B) monkey architecture [32].
image-to-video translation. The qualitative evaluation was performed using amazon
mechanical turk (MTurk). MTurk makes tasks like survey, data validation easier by assigning the tasks to distributed forces who perform them virtually. Three videos were shown to
users: driving video and two videos generated using X2face and Monkey methods. The
Monkey generated videos are preferred more than 80% of times over the X2face method.
The weakness of Monkey-Net is the poor generation quality due to pose changes in case
of large objects [33]. To overcome the weakness, the keypoints detector models complex
motions with the help of local affine transformations. To improve their estimation, equivariance loss is used during keypoints training. If the driving video contains large motions,
then the generator should infer the object parts from the context which are not clear in
the input image. Hence, an occlusion-aware generator is set up. Most of the frameworks
fail to handle HR data, whereas the experimental results on the HR dataset depict the success
of the framework in image animations when compared to other SOTAa methods
(Fig. 10.16). In addition, the authors have released a new Thau-Chi-HD dataset.
Deep learning in combination with GANs has been used in scene generation as well.
Generating quality images from the text is an interesting field of computer vision termed
scene generation. Early approaches fail to capture the object and generate low-quality
images. Zhang et al. [34] presented a novel approach to generate 256 256 quality images
conditioned on text data. The idea is to decompose the problem into subproblems by
extracting the primitive shape and colors of the object, thereby producing LR images with
the help of given words. In the next stage, HR realistic images are produced using the generated LR images and text descriptions. This is achieved by stacking up two GANs
(Fig. 10.17). The first step is converting the text to embedding. The less text description
results in the discontinuous latent data, thereby behaving as an obstacle for the generator.
Hence, the embedding is condition augmented which is combined with noise to generate
an image. The discriminator generates the decision score. The rough LR images along with
the text embedding are fed to the second GAN with the generator following the encoderdecoder network with residual blocks. CUB, Oxford-102, and MS COCO datasets have
been used for experimental purposes with inception score as evaluation metrics. The results
generate realistic images when compared to the existing methods (Figs. 10.18 and 10.19).
255
Fig. 10.16 Image animation: reference video (first row), X2face (second row), Monkey-Net (third row), and first-order model (fourth row) [33].
Fig. 10.17 StackGAN architecture [34].
Fig. 10.18 Scene generation using different methods on the CUB dataset [34].
Fig. 10.19 Scene generation on the Oxford-102 and COCO datasets [34].
260
Generative adversarial networks for image-to-Image translation
10.4 Future of GANs
Research in GANs is soaring in image and video generation to a great extent. GANs are a
perfect setup to govern the generated samples belonging to the distribution of interest
with the help of the adversarial discriminator. The results proved to synthesize facial
expressions, swapping the horse with a zebra, fashion industry, paintings to realistic
images, animation and scene generation compared to the other SOTA1 methods.
A statement by Facebook AI Research—“GANS are the most interesting idea of the
decade”—is somewhat true and experienced to the present day. A good amount of
research is currently ongoing in natural language processing and GANs. In addition,
the possible future applications are text, audio, and music generation, drug discovery,
and medical imaging. With the amount of research ongoing in various fields, GANs
remain a promising and prominent solution.
References
[1] A.Y. Ng, M.I. Jordan, On discriminative vs. generative classifiers: a comparison of logistic regression
and naive Bayes, in: Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, 2001, pp. 841–848.
[2] M.M. Khapra, CS7015 (Deep Learning): Lecture 22 Autoregressive Models (NADE, MADE), 2020,
[Online] Available at: https://www.cse.iitm.ac.in/miteshk/CS7015/Slides/Handout/Lecture22.
pdf. (Accessed 6 July 2020).
[3] A.V.D. Oord, N. Kalchbrenner, K. Kavukcuoglu, Pixel Recurrent Neural Networks, 2016. arXiv
2016. arXiv preprint arXiv:1601.06759.
[4] A. Van den Oord, N. Kalchbrenner, L. Espeholt, O. Vinyals, A. Graves, Conditional image generation
with pixelcnn decoders, in: Advances in Neural Information Processing Systems, 2016, pp. 4790–4798.
[5] T. Salimans, A. Karpathy, X. Chen, D.P. Kingma, Pixelcnn++: Improving the Pixelcnn with Discretized Logistic Mixture Likelihood and Other Modifications, 2017. arXiv preprint arXiv:1701.05517.
[6] A. Dertat, Applied Deep Learning—Part 3: Autoencoders, 2020, [Online] Available at: https://
towardsdatascience.com/applied-deep-learning-part-3-autoencoders-1c083af4d798. (Accessed 6 July
2020).
[7] Wikipedia Contributors, Autoencoder—Wikipedia, the Free Encyclopedia, 2020, [Online] Available
at: https://en.wikipedia.org/w/index.php?title¼Autoencoder&oldid¼957588290. (Accessed 6 July
2020).
[8] D.P. Kingma, M. Welling, An Introduction to Variational Autoencoders, 2019. arXiv preprint
arXiv:1906.02691.
[9] K. Frans, Variational Autoencoders Explained, 2020, [Online] Available at: http://kvfrans.com/
variational-autoencoders-explained. (Accessed 6 July 2020).
[10] J. Walker, C. Doersch, A. Gupta, M. Hebert, An uncertain future: forecasting from static images using
variational autoencoders, in: European Conference on Computer Vision, Springer, Cham, 2016,
October, pp. 835–851.
[11] G. Brunner, A. Konrad, Y. Wang, R. Wattenhofer, MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer, 2018. arXiv preprint arXiv:1809.07600.
[12] Q. Zhao, E. Adeli, N. Honnorat, T. Leng, K.M. Pohl, Variational autoencoder for regression: Application to brain aging analysis, in: International Conference on Medical Image Computing and
Computer-Assisted Intervention, Springer, Cham, 2019, October, pp. 823–831.
Image generation using generative adversarial networks
[13] J. An, S. Cho, Variational Autoencoder Based Anomaly Detection Using Reconstruction Probability,
2015.
[14] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.C. Courville, Y.
Bengio, Generative Adversarial Networks, 2014. arXiv preprint arXiv:1406.2661.
[15] CFI Education Inc, Nash Equilibrium, 2020, [Online] Available at: https://corporatefinanceinstitute.
com/resources/knowledge/economics/nash-equilibrium-game-theory/. (Accessed 6 July 2020).
[16] J. Chen, Nash Equilibrium, 2020, [Online] Available at: https://www.investopedia.com/terms/n/
nash-equilibrium.asp#::text¼The%20Nash%20equilibrium%20in%20this,one%20prisoner’s%
20outcome%20is%20worse. (Accessed 6 July 2020).
[17] J. Hui, Gan—Why It Is So Hard to Train Generative Adversarial Networks! 2020, [Online] Available
at:
https://medium.com/@jonathan_hui/gan-why-it-is-so-hard-to-train-generative-advisorynetworks-819a86b3750b. (Accessed 6 July 2020).
[18] M. Mirza, S. Osindero, Conditional Generative Adversarial Nets, 2014. arXiv 2014. arXiv preprint
arXiv:1411.1784.
[19] P. Isola, J.Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017,
pp. 1125–1134.
[20] J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent
adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision,
2017, pp. 2223–2232.
[21] M. Liu, O. Tuzel, Coupled Generative Adversarial Networks, 2016. arXiv 2016. arXiv preprint
arXiv:1606.07536.
[22] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, R. Webb, Learning from simulated and
unsupervised images through adversarial training, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2107–2116.
[23] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, A. Courville, Adversarially Learned Inference, 2016. arXiv preprint arXiv:1606.00704.
[24] M. Pasini, 10 Lessons I Learned Training GANs for One Year, 2020, [Online] Available at: https://
towardsdatascience.com/10-lessons-i-learned-training-generative-adversarial-networks-gans-for-ayear-c9071159628. (Accessed 6 July 2020).
[25] Y. Qin, N. Mitra, P. Wonka, How Does Lipschitz Regularization Influence GAN Training?, 2018.
arXiv preprint arXiv:1811.09567.
[26] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved Techniques for
Training Gans, 2016. arXiv 2016. arXiv preprint arXiv:1606.03498.
[27] A.B. Larsen, S.K. Sønderby, H. Larochelle, O. Winther, Autoencoding Beyond Pixels Using a Learned
Similarity Metric, 2016. arXiv preprint arXiv:1512.09300.
[28] Y. Choi, M. Choi, M. Kim, J.W. Ha, S. Kim, J. Choo, Stargan: unified generative adversarial networks
for multi-domain image-to-image translation, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2018, pp. 8789–8797.
[29] Y. Choi, Y. Uh, J. Yoo, J.W. Ha, Stargan v2: diverse image synthesis for multiple domains, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp.
8188–8197.
[30] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J.
Totz, Z. Wang, W. Shi, Photo-realistic single image super-resolution using a generative adversarial
network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2017, pp. 4681–4690.
[31] M. Tomei, M. Cornia, L. Baraldi, R. Cucchiara, Art2real: unfolding the reality of artworks via
semantically-aware image-to-image translation, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2019, pp. 5849–5859.
[32] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, N. Sebe, Animating arbitrary objects via deep motion
transfer, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019,
pp. 2377–2386.
261
262
Generative adversarial networks for image-to-Image translation
[33] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, N. Sebe, First order motion model for image animation, in: Advances in Neural Information Processing Systems, 2019, pp. 7137–7147.
[34] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan: text to photorealistic image synthesis with stacked generative adversarial networks, in: Proceedings of the IEEE
International Conference on Computer Vision, 2017, pp. 5907–5915.
[35] C. Shorten, Pix2Pix, 2020, [Online] Available at: https://towardsdatascience.com/pix2pix869c17900998. (Accessed 6 July 2020).
[36] Neurohive, Pix2pix—Image-to-Image Translation Neuralnetwork, 2020, [Online] Available at:
https://neurohive.io/en/popular-networks/pix2pix-image-to-image-translation/. (Accessed 6 July
2020).
[37] J. Brownlee, A Gentle Introduction to Cyclegan for Image Translation, 2020, [Online] Available at:
https://machinelearningmastery.com/what-is-cyclegan/. (Accessed 6 July 2020).
[38] C. Dong, C.C. Loy, K. He, X. Tang, Image super-resolution using deep convolutional networks, IEEE
Trans. Pattern Anal. Mach. Intell. 38 (2016) 295–307.
[39] Wikipedia Contributors, Digital Elevation Model—Wikipedia, the Free Encyclopedia, 2020, [Online]
Available
at:
https://en.wikipedia.org/w/index.php?title¼Digital_elevation_model&
oldid¼962013302. (Accessed 6 July 2020).
[40] J. Thompson, J. Bell, C. Butler, Digital elevation model resolution: effects on terrain attribute calculation and quantitative soil-landscape modeling, Geoderma 100 (2001) 67–89, https://doi.org/
10.1016/S0016-7061(00)00081-1.
[41] S. Lakshmi, K. Yarrakula, Review and critical analysis on digital elevation models, Geofizika 35 (2019)
129–157, https://doi.org/10.15233/gfz.2018.35.7.
[42] B.Z. Demiray, M. Sit, I. Demir, D-SRGAN: DEM Super-Resolution With Generative Adversarial
Network, 2020. arXiv preprint arXiv:2004.04788.
[43] Wikipedia Contributors, Computer Animation—Wikipedia, the Free Encyclopedia, 2020, [Online]
Available at: https://en.wikipedia.org/w/index.php?title¼Computer_animation&oldid¼966153819.
(Accessed 6 July 2020).
[44] Wikipedia Contributors, Computer-Generated Imagery—Wikipedia, the Free Encyclopedia, 2020,
[Online] Available at: https://en.wikipedia.org/w/index.php?title¼Computer-generated_imagery&
oldid¼962950245. (Accessed 6 July 2020).
[45] O. Wiles, A. Sophia Koepke, A. Zisserman, X2face: a network for controlling face generation using
images, audio, and pose codes, in: Proceedings of the European Conference on Computer Vision
(ECCV), 2018, pp. 670–686.
[46] A. Bansal, S. Ma, D. Ramanan, Y. Sheikh, Recycle-gan: unsupervised video retargeting, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 119–135.
CHAPTER 11
Generative adversarial networks
for histopathology staining
Aashutosh
Ganesha,b and Koshy Georgea,c,d
a
PES Center for Intelligent Systems, PES University, Bangalore, India
Radboud University, Nijmegen, The Netherlands
c
Department of Electronics and Communication Engineering, PES University, Bangalore, India
d
SRM University—AP, Guntur District, Andhra Pradesh, India
b
11.1 Introduction
Generative adversarial networks (GANs), a type of deep learning proposed in Ref. [1],
consist of two networks, the generator and the discriminator. The former belongs to the class
of generative or forward models, which depends on unsupervised learning to determine
the distribution of the training data, and the latter belongs to the class of discriminative or
backward models that ascertains the decision boundaries via supervised learning [2].
(Generative modeling and some applications are treated in Ref. [3]. Some recent books
on GANs are Refs. [4, 5].) While generative methods model class-conditional distributions and prior probabilities, discriminative methods estimate posterior probabilities
without explicitly modeling the probability distributions. Note that the discriminator
is a classifier. In the context of GANs, it attempts to distinguish between real data and
the data created by the generator. Albeit it is possible to arrive at several generativediscriminative pairs, what makes GANs unique is that the generator and the discriminator
are pitted against each other in a two-player game seeking to find the Nash equilibrium [6].
Several discriminators have been proposed that successfully map a high-dimensional
input to a class label [7–9]. This has been made possible due to the back-propagation algorithm and the use of piecewise linear activation functions with well-behaved gradients
[10–12]. Dropout algorithms [13–15] have also contributed to this success. A GAN
essentially is a procedure for creating generators that mitigates the difficulties faced in creating meaningful generator-discriminator models. These obstacles include approximating intractable probabilistic computations and leverage piecewise linear activation
functions.
GANs are similar to variational autoencoders (VAE) [16, 17] in that both approaches
are used to determine the distribution of data using unsupervised learning. Accordingly,
both have two networks. While the decoder is generative, the encoder is a recognition
model. Such an approach leads to an intractable distribution which then has to be
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00010-5
Copyright © 2021 Elsevier Inc.
All rights reserved.
263
264
Generative adversarial networks for image-to-Image translation
approximated by another tractable distribution and then use the method of variational
inference. The different approach adopted makes GANs better than VAEs.
Suppose that G and D are the two feed-forward neural networks, respectively, representing the generative and discriminative networks. In a GAN, G and D simultaneously
participate in the following two-player game:
min max fx ½log DðxÞ + z ½log f1 DðGðzÞÞgg
G
D
(11.1)
An input prior p(z) is first defined, and then mapped to the space GðzÞ. While the discriminator maximizes the probability of assigning the correct label to training examples
(real data) and minimizes the probability to samples from G, the generator G is trained to
maximize the probability assigned by the discriminator to those samples it generates.
Thus, G attempts to capture the data distribution and D estimates the probability that
a sample came from the training data rather than from G.
The applications of GANs have been quite varied, and includes image segmentation,
text-to-image synthesis, and high-resolution image generation [18–20]. (A recent survey
of applications is available in Ref. [21].) In particular, GANs have been found useful in
medical imaging; see Refs. [22–25]. Deep learning and GANs have helped in the automation of diagnostics of diseases such as breast cancer and gastrointestinal disease, segmentation of nuclei, image reconstruction, and image translation of X-ray image to
CT scans [26–31]. This has been made possible largely due to strides in computing
power, storage capacity, and image capture techniques [32].
Histology and histopathology are the careful study of microscopic tissues. Histopathology is important for diagnosis and is considered a gold standard; for example, it is
required for cancer diagnosis, where the microscopic tissue is analyzed by a pathologist.
A fundamental step in histopathology is staining caused by chemical reactions induced in
the tissue under analysis, and results in accentuated features that help in diagnosis. The
stains range from the commonly used hematoxylin and eosin (H&E) stain—devised independently by Wissowzky in 1876 and Busch in 1877 [33]—to the relatively rare
Grocott-G€
om€
ori methenamine silver (GMS) stain proposed by G€
om€
ori in 1946 [34].
Different stains affect the tissue on a slide distinctively thereby highlighting particular features for the pathologist [35]. Whenever required, developing diverse stained histopathological slides of the same tissue sample is a parallel, laborious, and a time-consuming
process. Moreover, it is subject to human error. Evidently, histology staining and histological analysis are cumbersome processes, where automation can be beneficial to
diagnosticians.
With recent developments in deep learning, accelerated computing, and storage, histological image analysis has had some transformative changes. In Ref. [36], breast cancer
classification has an accuracy of over 98.4% with a recurrent patch-based convolutional
neural network (CNN). GANs have showcased its usefulness in histopathology: stain
Generative adversarial networks for histopathology staining
normalization is introduced in Ref. [37], InfoGAN [38] and WGAN [39] are used for
feature extraction in Ref. [40], and synthetic histopathology image generation is discussed in Ref. [41]. The process of histology staining has also shown some scope for automation, as illustrated in Refs. [42, 43], where histopathology staining is achieved through
style transfer [44] or through residual GANs [45].
From a machine learning perspective, each stain results in a different feature space, and
a transformative network has the ability to transform one space to another. The latter is a
classic image-to-image translation problem. Thus, we can frame the problem of transforming one stained tissue to another as an image-to-image translation problem [46].
In this chapter, we consider the problem of transformation of a feature space corresponding to one stain to another posed as an image-to-image translation problem, and present a
solution based on GANs.
Specifically, the use case and limits of the image-to-image translation utilizing GANs
are demonstrated here using the images from the Automatic Nonrigid Histological Image
Registration (the ANHIR) challenge dataset [47–52]. (ANHIR challenge was part of the
IEEE International Symposium on Biomedical Imaging [ISBI] in 2019, where the call
was to register tissues across different samples for large images.) Histology staining in this
chapter is framed as a domain adaptation problem for each stain. The dataset consists of
various tissues with different types of stains per tissue. This challenge requires the tissues
to be registered for a given pair of input and target images. Registration is an important
task in medical imaging as it allows diagnosticians to extract more information from one
image than they typically do from single samples. Histology registration postalignment
allows the viewer to see the information from multiple stains on the same sample.
However, since these samples of differently stained tissues are not readily available,
there is a potential application of converting one tissue stain to another stain type. As
mentioned earlier, GANs have proved effective in various image generation tasks such
as segmentation and synthetic data generation. The ANHIR dataset is utilized here to
demonstrate this domain adaptation problem wherein a stained histology image leads
to a histology image with a different stain. In this chapter, we discuss the details of
the implementation. Specifically, the preparation of the dataset and the methodology
to solve an image-to-image translation problem are discussed. Moreover, the efficacy
of GANs when the number of available images is relatively small is showcased and we
illustrate some techniques that yield better performance. The results of this chapter
are based on an implementation of the code primarily done in python, specifically TensorFlow [53]. Due to constraints on available datasets, it must be emphasized that the
suggested methodology may not be completely clinically viable as yet.
This chapter is organized as follows. GANs are presented in Section 11.2. In this
section, we present the vanilla GAN and other variations relevant in our context,
the objective functions considered for optimization, and the image-quality metrics.
265
266
Generative adversarial networks for image-to-Image translation
The image-to-image transformation problem is discussed in Section 11.3. Histology is
outlined in Section 11.4. The networks and the dataset used in this chapter are described
in Section 11.5, and the results presented and discussed in Section 11.6, followed by conclusions in Section 11.7.
11.2 Generative adversarial networks
The vanilla GAN introduced in Ref. [1] ensured that the generator-discriminator pair
played a two-player min-max game seeking the Nash equilibrium (a saddle point), as
described in Eq. (11.1). Both DðGðzÞÞ and DðxÞ represent probabilities. A trained discriminator D is such that it maximizes the probability DðxÞ for an image x that belongs
to the input distribution. From a prior distribution p(z), a sample z is input to G. The
resulting output GðzÞ of an untrained generator evidently does not belong to the input
distribution and hence rightly classified as a fake image by the trained discriminator; that
is, the value of DðGðzÞÞ is nowhere near unity. The generator G is trained so that the
probability DðGðzÞÞ is maximized. Essentially, D tries to reject these images as fake while
G attempts to fool D into thinking they are real. Eventually, G learns sufficiently to generate samples that correspond to the distribution of the real data.
11.2.1 Improvements to vanilla GAN
Albeit the vanilla GAN is comparatively better than its contemporaries, a number of
issues can affect its performance. Some of the issues are as follows: First, it is likely that
the generator network does not improve as fast as the discriminator network causing the
former to output less than ideal images. Second, the generator produces samples from a
limited class; this issue is called mode collapse. Third, the networks are trained using the
back-propagation algorithm and hence require the computation of gradients. The problem of unstable gradients, vanishing gradients associated with Kullback-Leibler (KL) [54]
and Jensen-Shannon (JS) [55] divergences have been well reported. Hyperparameter
optimization is essential to strike a balance between the generator and discriminator
wherein one does not improve at a rate that the other cannot keep up. Several improvements to the architecture, batch training, and techniques to avoid mode collapse have
been proposed [56]. In this chapter, we suggest the required improvements for the
image-to-image translation problem.
11.2.2 Deep convolutional GANs
Introduced by Radford et al. [57], the deep convolutional GAN (DCGAN) showcased a
marked improvement in generating natural images from multiple modalities of realworld data. The DCGAN adopted the following steps to improve the efficacy: strided
convolutions as opposed to downsampling through max pooling [58]; batch
Generative adversarial networks for histopathology staining
normalization [59] to ensure zero mean and unit variance; the use of Leaky ReLU activation function for the discriminator; and the use of inception score to evaluate the efficacy. In particular, DCGAN improved the generation of images from the ImageNet,
faces and CIFAR10 datasets [56, 57, 60]. However, for the specific application in medical
imaging, there is some more room for improvement.
11.2.3 Variations in optimization functions
As the implementation of GANs are susceptible to issues in the gradients and the quality
of generated images, better results may be obtained by varying the performance objective. These functions are used during the training of GANs. Some possibilities are listed
here:
1. The L2 loss function was introduced in Ref. [61] to overcome the problem of
vanishing gradients especially for those fake samples sufficiently far from real data
but classified correctly. Since the least squares loss function penalizes such samples,
the least-squares GAN (LSGAN) attempts to generate samples closer to the real data.
The suggested objective functions for the 0–1 coding scheme are as follows:
(11.2)
J D ¼ x ðDðxÞ 1Þ2 + z ðDðGðzÞÞÞ2
(11.3)
J G ¼ z DðGðzÞ 1Þ2
2. In order to minimize the pixel-wise distance between the generated and target
images, the mean square error (MSE) [42] is often used in the objective function.
Let x be the input image, x^ the output image of the generator, and y the target image.
Then
J D ¼ ½ log ð1 Dð^
x ÞÞ + ½ log ðDðyÞÞ
(11.4)
J G ¼ ½ logDð^
x Þ + λ E MSE ð^
x , yÞ
(11.5)
where λ is a regularization parameter and
E MSE ðA,BÞ ¼
N X
M
1 X
ðaij bij Þ2
NM i¼1 j¼1
(11.6)
Here, A and B are two images of dimensions N M pixels. Let aij and bij, respectively,
be the values of the pixels in the (i, j)th position in images A and B.
3. Instead of MSE, some researchers prefer to use the mean absolute error (MAE) instead
[62]. This also has an effect on the pixel-wise distance between the target and output
images.
J D ¼ ½ log ð1 Dð^
x ÞÞ + ½ log ðDðyÞÞ
(11.7)
267
268
Generative adversarial networks for image-to-Image translation
J G ¼ ½ logDð^
x Þ + λ E MAE ð^
x ,yÞ
(11.8)
where λ is a regularization parameter and
E MAE ðA,BÞ ¼
N X
M
1 X
jaij bij j
NM i¼1 j¼1
(11.9)
The other quantities are as defined earlier.
11.2.4 Image-quality metrics
We digress briefly to explore image-quality metrics to measure the efficacy of GANs.
This is due to an inherent problem in generative modeling, and to a degree unsupervised
learning. In the case of an image-to-image translation problem, a reasonable method of
measuring network performance is image-quality metrics. The following image-quality
metrics are used here to evaluate the performance of GANs in our context.
1. Suppose that A and B are two images of dimensions N M pixels. Let aij and bij,
respectively, be the values of the pixels in the (i, j)th position in images A and B. Then,
the pixel-wise MSE metric [63] between the two images A and B is as defined earlier,
and repeated here for convenience.
N X
M
1 X
E MSE ðA, BÞ ¼
ðaij bij Þ2
NM i¼1 j¼1
(11.10)
Evidently, the smaller the value of MSE, the closer are the two images A and B with
reference to this measure. However, it may be noted that this metric may not correlate well with subjective analysis of quality. In our context, images A and B, respectively, correspond to the target image and the output of GAN.
2. The peak signal-to-noise ratio (PSNR) is derived from the metric MSE, E MSE ðA, BÞ.
Similar to MSE, PSNR also does not correlate well with human quality assessment.
This metric [63] is computed as follows:
E PSNR ðA, BÞ ¼ 10 log
2552
E MSE ðA, BÞ
(11.11)
A and B are closer with respect to PSNR if the corresponding measure is large.
Clearly, smaller the value of MSE, the larger the value of PSNR.
3. The structural similarity index (SSI) [63] primarily deals with three aspects of similarity—
luminance, contrast, and structure. While luminance is defined as the brightness of the
image, the contrast is the difference between the luminance divided by the average luminance of the image. If A and B are two images, this metric is computed as follows:
Generative adversarial networks for histopathology staining
2μ μ + c1
LðA, BÞ ¼ 2 A B2
μA + μB + c1
(11.12)
2σ A σ B + c2
CðA, BÞ ¼ 2
σ A + σ 2B + c2
(11.13)
ρAB + c3
σ A σ B + c3
(11.14)
SðA,BÞ ¼
E SSI ðA, BÞ ¼ L p ðA,BÞ C q ðA, BÞ Sr ðA,BÞ
(11.15)
In these equations, μA and μB are, respectively, the average values of the pixels in A
and B, σ 2A and σ 2B the corresponding variances, and ρAB is the correlation coefficient
between A and B. The constants c1, c2, and c3 are introduced to avoid division by zero,
or near-division by zero, and the constants p, q, and r represent relative importance of
three components. The measure SSI is considered to be related to perception in a
human visual system. Its value lie in the interval [0, 1]. Clearly, a value of unity indicates that the two images are the same.
In the context of image compression, a comparison of these metrics is available in Ref.
[64] where an analytical link between PSNR and SSI has been shown for common degradations in an image including additive Gaussian noise and Gaussian blur. We note that
there are full-reference, partial-reference, and nonreference methods of image comparison. In the context of medical image comparison, full-reference methods are a stronger
indicator of network performance. The aforementioned metrics belong to this class of
methods.
11.3 The image-to-image translational problem
The task of image-to-image translation is one of the prominent developments in deep
learning applied to image processing. This task can be described as one that transforms
the feature space of an image to another. Applications include transforming aerial photographs to maps, removal of background from images, and colorization of black and
white images. Generative networks have been found to be rather useful for this. The goal
of the network is to learn the map between the input and target images to suitably transform the former to the latter. That is, if U is the set of input images and T is the set of
target images, the trained generator network should be such that G : U ! T and the distribution of images GðUÞ is indistinguishable from the distribution of images in T .
CNNs have showcased remarkable results in several fields including medical image
classification. An evident drawback of this class of networks is the requirement for large
quantities of data to realize the full potential of CNNs. Moreover, some problems in
medical imaging require pixel-level classification; for example, medical image
269
270
Generative adversarial networks for image-to-Image translation
segmentation. The U-net architecture builds upon the fully connected CNN [26]. The
expansive and contracting paths are somewhat symmetric leading to an U-shaped architecture. Some of the principal differences relative to typical CNNs are that the pooling
layers are replaced with upsampling layers and successive convolutional layers have a large
number of filters. In addition, extensive data augmentation is adopted to compensate for
smaller datasets.
We note that the U-net is essentially an autoencoder (AE). In general, AEs are networks utilized to learn encodings and they are predominantly used in unsupervised learning [65]. They consist of an encoder and decoder, the former to encode the data
distribution to a latent space or bottleneck and the latter to decode this latent space.
AEs are typically used for principal component analysis.
Conditional GANs (CGANs) [66] are explored in Ref. [62] to deal with the
image-to-image translational problem. Here, the pair of networks learn a conditional
generative model [1] making them suitable to tackle such problems. In Ref. [62], the
generator is a U-net architecture described earlier which allows it to encode the images
into a bottleneck. The discriminator is a convolutional PatchGAN classifier, which
allows the network to perform patch-level classification. This encourages the generator
to learn patch-level features. In contrast to regular GANs wherein the generative models
learn a mapping from the random noise vector z to an output image y ¼ GðzÞ, CGANs
learn the mapping from the input image and the noise vector to the output image:
y ¼ Gðx,zÞ. The objective function for CGAN can therefore be described as
J G, D ¼ x, y ½ log Dðx, yÞ + x, z ½ log ð1 Dðx,Gðx,zÞÞÞ
(11.16)
The optimal CGAN is then the following:
arg min max J G, D
G
D
This can be regularized as follows:
arg min max J G, D + λx, y, z ½k y Gðx,zÞk1 G
D
(11.17)
where the L1 metric is preferred over the L2 metric to mitigate the effect of blurring [67].
The discriminator that we use is the convolutional PatchGAN classifier, originally
proposed in Ref. [68]. The generator with L1 distance function accurately capture
the low frequencies. Accordingly, the GAN discriminator needs to only enforce correctness at the higher frequencies. To accomplish this, it is sufficient to restrict attention to
the structure in local image patches, leading to the terminology PatchGAN which only
penalizes structure at the scale of patches. Thus, the discriminator classifies based on
whether or not each P P patch in an image is real or fake. This is run convolutionally
across the image and averages the responses to yield the output image. PatchGAN is computationally efficient in that it has fewer parameters, runs faster, and can be applied to
Generative adversarial networks for histopathology staining
images of arbitrary sizes. The mathematical background is that it models the image as a
Markov random field assuming that pixels separated by more than the patch dimensions
are statistically independent. This idea has been successfully used for both texture and
style.
Thus, the end-to-end pipeline for the image-to-image translation problem uses an
autoencoder (U-net) as our generator and the discriminator of PatchGAN. The latter
is trained to distinguish between the real images (actual data) from the fake or synthesized
images. The goal of the autoencoder is to transform from the domain space to another by
utilizing a learned transformation encoded in the latent space. In contrast, the objective of
the discriminator is to distinguish between real and fake images.
11.4 Histology and medical imaging
As referenced in previous sections, histology is the examination and study of microscopic
structures present on tissues and histopathology is where this examination is used for diagnosis [69]. The overall goal of histopathological analysis is to understand and establish the
relationship between the structures present on the tissue with the affliction that the subject of analysis has.
Histopathological analysis is primarily qualitative, where certified pathologists comb
through the tissue slide images directly from the microscope or the scanned whole slide
images to determine affliction. A fundamental step in histology is staining, where due to
cell structures/chemical compounds present on the slide the stains react with them to
accentuate the features present on the slide for diagnosis. The a priori information present
on the tissue is not immediately visible, requiring a pathologist to use reagents to give a
contrast. This allows them to better evaluate the tissue.
Different stains are utilized depending on the region of interest, where different tissue
sections react differently. The entire process behind histopathology, from analysis to
diagnosis, is necessary but time consuming. Analysis of histology primarily revolves
around structures present on the extracted sample. Important precursory steps before
analysis are the biopsy of the tissue and fixing of the tissue dyeing/staining; the latter step
is relevant in our discussions. The dyes reveal cellular structures and counterstains are used
for contrast. The most common stain used is hematoxylin and eosin (H&E), since it is
relatively quick to stain and stains a large number of cells well. However, in some cases,
it does not provide the contrast required. For that different types of stains are used
depending on the afflictions. Histopathology additionally examines the extent of the
affliction, called disease grading. It is very useful in distinguishing between different subtypes of diseases, especially in cancers. The scope for automation is evident; companies
such as Leica have introduced an automatic stainer to circumvent the process through a
physical batch stainer [70, 71].
271
272
Generative adversarial networks for image-to-Image translation
The scope for automation is quite high for histopathology image analysis as well. With
techniques such as whole slide scanning, it has become easier to devise algorithms, which
are able to comb through the image for diagnostic purposes. While most of the past medical image analysis has centered on cytology, histopathology serves as the gold standard for
diagnostics, especially for cancers. While providing us with a rich feature space, there are
challenges in automation of analysis. For this chapter, we will primarily be examining
histopathology stains as different feature spaces.
11.4.1 Histology as different feature spaces
Since stains are reagents that interact with cells present on the tissue, they produce different features in that they accentuate structures based on the underlying chemical reactions. H&E clearly dye structures such as cytoplasm, nuclei, organelles, and extracellular
components. It aids in the diagnosis of diseases based on the organization of the tissue.
H&E chemicals are basic and acidic, respectively, and they operate on the cell nucleus and
the cytoplasm cell walls, respectively. Specialized stains have been developed that deal
with sections not normally dyed by H&E. For example, Masson’s trichrome stains connective tissue where the basic structures are stained blue. Alcian Blue stains heavy proteins
such as mucin blue.
With the established fact that each stain provides different information to the diagnosticians, each of them can be considered to be a different feature space. Special stains
are sometimes used in conjunction with routine stains to extract more information from
the tissue. However, there is a scope for error in fixing and staining that sample. Additionally, these special stains require expensive chemicals. Therefore, there is a use-case for
generating the newly stained tissue, through image processing and machine learningbased methods.
In summary, the stains generate unique features, and these unique features are
required for diagnosis. Moreover, conventional staining is prone to human error especially during redyeing the tissue, and is time consuming. Further, the process is expensive.
Thus, an automated system to generate a reasonable approximation of the tissue has the
potential to save resources. Accordingly, only when required, pathologists need to resort
to conventional methods.
11.5 Network architecture and dataset
The methodology adopted in this chapter to use GANs for histology staining is described
here. The principal issue is the unavailability of a large dataset of unstained tissues. Therefore, our goal is virtual staining of a histology slide, which transforms a slide from one
feature space into another. As mentioned earlier, virtual staining has been explored
through style transfer in Ref. [43] and GANs in Ref. [42]. The fundamental difference
between Ref. [42] and this study is the choice of viewing tissue samples. While Ref. [42]
Generative adversarial networks for histopathology staining
utilizes autofluorescent images to generate bright-field images, we attempt here to learn a
map between two bright-field images. Additionally, we examine the transformation
between two types of stains rather than stain a sample from their unstained equivalent
image. We again emphasize that unstained images are usually not available.
We use the ANHIR dataset to showcase the efficacy of GANs for histology staining.
Specifically, we demonstrate that an image of a tissue stained with one chemical is transformed to an image of the same tissue stained with a different chemical.
11.5.1 ANHIR dataset
The automatic nonrigid histological image registration (ANHIR) dataset is designed for
image registration for large-scale images. Registration is required to align multiple tissue
sections into one representation in 3D space, where each individual image is stacked
upon each other. This allows a pathologist to extract more information from the tissue
samples from multiple features and biomarkers. The characteristics of the dataset are
depicted in Table 11.1. This dataset has a variety of tissues from different organs and
the tissues are stained with a diverse set of histological stains. We note that the ANHIR
dataset provides more stains than indicated in Table 11.1. (For example, images of an
H&E-stained and IHC-stained lung lesion tissue are both available. In addition, the dataset provides images with CD10, CD31, and Ki67 stains, which are not considered here.)
Moreover, the dataset also provides different levels of magnification for each tissue ranging from 10 to 40. We opt for higher magnification in order to obtain a larger image
dataset to achieve our objective.
A brief explanation of the stains and their effects are as follows:
• CD31 is a type of immunohistochemical (IHC) stain, which mediates cell-to-cell
adhesion. These are typically used in tonsils, skin, liver, and kidneys.
• Estrogen receptor (ER) antibody stains are proteins activated by the estrogen hormone. It is extensively used in breast carcinoma.
Table 11.1 The ANHIR dataset.
Tissue
Stains
Magnitude
Average size
Lung lesion
Lung lobe
Mammary gland
Mice-kidney
Colon adenocarcinoma
Gastric adenocarcinoma
Breast tissue
Human kidney
H&E, IHC
H&E, CD31
H&E, ER, PR
PAS, SMA, CD31
H&E, IHC, CD
H&E, IHC
H&E, IHC, PR
H&E, PAS, MAS
40
10
40
20
10
40
40
40
18k15k
11k6k
12k4k
37k30k
60k50k
60k75k
65k60k
18k55k
273
274
Generative adversarial networks for image-to-Image translation
Table 11.2 Subset of ANHIR dataset.
Tissue
Stains
Scaling
Number of images
Lung lesion
Kidney
Lung lobe
H&E, CD31
H&E, MAS
H&E, CD31
50
25
25
1250
368
151
• Progestrone receptor (PR) antibody stains are used for breast cancer detection and are
typically used in conjunction with ERs.
• Periodic acid-Schiff (PAS) stains are used for detecting the presence of carbohydrates
in tissues such as connective tissues, mucus, etc. These are used in diagnostics in diseases such as glycogen storage disease.
• Masson’s trichome (MAS) stain is a three-color staining procedure. It produces red for
keratin and muscle fibers, blue and green for collagen, light red or pink for cytoplasms,
and dark brown to black for cell nuclei. This stain is typically used in cardiac, kidney,
muscular, and hepatic pathologies.
• Smooth muscle actin (SMA) is a special type of stain, which is typically used in specialized cancer diagnostics.
Further details about these stains can be found in Refs. [72, 73]. As indicated in the table,
the available magnifications and the average size of the images vary. For the purpose of
this chapter, we choose a subset of the ANHIR dataset as indicated in Table 11.2. The
number of available images are also indicated.
11.5.2 Dataset preparation
The tissues and stains considered here are listed in Table 11.2. The whole slide scan of the
histological image is first divided into smaller sections of size 256 256. The pixel values
in each image are scaled to lie within the range [1, 1]. This increases the numerical
stability during training. (Moreover, it has been the experience that using a floating point
arithmetic is better.)
11.5.3 Network architectures
The architectures for the generator and discriminator have been arrived at after several
experiments. The networks are based on U-net for the generator network and the PatchGAN architecture for the discriminator [62]. This architecture maps closely with
CGANs. The generator architecture of the network includes convolutional layers with
skip connections. The number of filters is 64, 128, 256, 512, and 1024. The activation
functions for all but the final layer is the ReLU [74], and for the final layer the sigmoid
activation function tanh v. This network for the discriminator features batch normalization and leaky ReLU activation function. The other details of the architectures are provided in “Appendix” section.
Generative adversarial networks for histopathology staining
We consider here the performance objectives mean absolute error (MAE) and the
root mean square error (RMSE). The networks were trained using the Adam optimizer
[75], at a learning rate of 105 and 104 for the discriminator and the generator networks,
respectively. Batch sizes of 5, 10, and 15 were used at different trials. Shuffling was utilized, and finally the images were trained for 201 epochs for the number of input images
as indicated in Table 11.2.
The approach to evaluate the efficacy of the proposed GAN should include both
qualitative (visual inspection) and quantitative image metrics. Both are required as the
latter alone may not suffice in that good performance with respect to quantitative metrics
may not indicate clinical viability. We emphasize here that the output images must eventually be validated by a diagnostician.
11.6 Results and discussions
As indicated earlier, the subset of ANHIR dataset considered here is listed in Table 11.2.
Sample input and target images for the three tissues are shown in Fig. 11.1A–F. Here, an
image of a lung lesion tissue stained with H&E is shown in Fig. 11.1A. This is the input
image to the proposed GAN. The target output image is shown in Fig. 11.1B, which is
the image of the same tissue but stained with CD31. Likewise, images of a kidney tissue
stained with H&E and MAS stains are, respectively, shown in Fig. 11.1C and D, and
images of a lung lobe tissue stained with H&E and CD31 stains are, respectively, shown
in Fig. 11.1E and F.
As mentioned earlier, the images belonging to a particular tissue are input to the proposed GAN architecture described in the previous section and “Appendix” section. The
performance of GAN depends on the choice of objective functions. In this chapter, we
consider two objective functions. In what follows, a GAN trained with the MSE objective function described in Eqs. (11.4), (11.5) is referred to as GANMSE, and a GAN
trained with the MAE objective function described in Eqs. (11.7), (11.8) is denoted
GANMAE. The output of the generative network should resemble that of the corresponding target images. The closeness of the images is quantified using the metrics
MSE, PSNR, and the SSI. (We note that due to a lack of resources, validation by a
pathologist was not possible.)
Sample sets of input, target, and output images in the case of lung lesion tissue are
shown in Fig. 11.2 showing the results with the proposed GANMSE and GANMAE.
We note an Adam optimizer has been used during the training process. (The chosen
parameters are β1 ¼ 0.9, β2 ¼ 0.999, and E ¼ 107.) From Fig. 11.2, it is clear that there
is a close approximation between the target and output images, respectively, shown in
Fig. 11.2B and C with mildly blurry results. In contrast, it is evident from Fig. 11.2E
and F that GANMAE generates deep blue images, which is less than ideal for our
275
276
Generative adversarial networks for image-to-Image translation
Fig. 11.1 Lung lesion tissue: (A) Input image with H&E stain. (B) Target image with CD31 stain. Kidney
tissue: (C) Input image with H&E stain. (D) Target image with MAS stain. Lung lobe tissue: (E) Input
image with H&E stain. (F) Target image with CD31 stain.
application. Thus, a closer comparison between target and output images clearly indicates
that the MSE loss function provides better results.
These observations are further validated using the image-quality metrics. Indeed,
with GANMSE the respective averaged values of SSI, PSNR, and MSE are 0.8455,
27.365, and 0.0650, and the corresponding values for GANMAE are 0.5938, 9.0791,
and 0.1678. (These values are also depicted in Table 11.3.) All these measures indicate
Generative adversarial networks for histopathology staining
Fig. 11.2 Lung lesion tissue. With GANMSE: (A) Input image; (B) target image; (C) output image. With
GANMAE: (D) Input image; (E) target image; (F) output image.
Table 11.3 Averaged image metric values.
Tissue
Type
SSI
PSNR
MSE
Lung lesion
GANMSE
GANMAE
GANMSE
GANMAE
GANMSE
GANMAE
0.8455
0.5938
0.3757
0.4038
0.7213
0.6931
27.365
9.0791
12.6759
12.0887
21.0530
21.1157
0.0650
0.1678
0.0052
0.0739
0.0046
0.0140
Kidney
Lung lobe
that the output of the generator and the target image are reasonably close. However,
GANMSE is performing better on the test images. Evidently, if an MSE loss function
is used during training, the MSE image-quality metric yields a smaller value.
In addition to these observations, it is evident from Fig. 11.2B and C that some of the
finer details have not been well captured by the network. Equivalently, the generative
277
278
Generative adversarial networks for image-to-Image translation
network has not completely learned the transformation from one feature space to
another. The primary reason for this is the smaller size of the dataset. It may be recalled
that only 1250 images were available.
From the observations with reference to the lung lesion tissue, it is evident that the
higher-level features of the images such as the overall structure are captured completely.
However, the generated images can be blurry and there is a marked variations between
the images with GANMSE and GANMAE. Therefore, our approach to histology staining
posed an image-to-image translation problem is promising. Moreover, the quantitative
measures SSIM and PSNR may not be the ideal metrics in our context. Despite the
scores, the images are slightly blurry as observed in Fig. 11.2C. Further, there are artefacts
introduced if the number of epochs is lesser than 100; these eventually disappear with
additional training. As will be evident, these remarks hold for both kidney and lung lobe
tissues. Some issues with these tissue images are exacerbated due to the smaller size of the
datasets.
The results with kidney tissues are shown in Fig. 11.3 for both GANMSE and
GANMAE. Similarly, the results with lung lobe tissues are shown in Fig. 11.4 for both
Fig. 11.3 Kidney tissue. With GANMSE: (A) Input image; (B) target image; (C) output image. With
GANMAE: (D) Input image; (E) target image; (F) output image.
Generative adversarial networks for histopathology staining
Fig. 11.4 Lung lobe tissue. With GANMSE: (A) Input image; (B) target image; (C) output image. With
GANMAE: (D) Input image; (E) target image; (F) output image.
variants of GANs considered here. From Table 11.3, the differences between the two
GANs are less pronounced for both kidney and lung lobe tissues, with the performances
of GANMSE better than the other. (There is a negligible discrepancy in these observations
with respect to PSNR for the lung lobe tissue.) Nonetheless, minor visual differences can
be observed in the kidney tissue even with GANMSE. This is due to the relatively higher
complexity of the kidney tissue when compared with the lung lesion or lung lobe tissues.
An additional issue is the relatively smaller number of images in the dataset. At present we
have only 368 images corresponding to the kidney tissue, which is rather small considering the complexity.
In contrast, even with a much lower dataset size of 151, the correlation between the
output and target images for a lung lobe tissue is reasonably good for both GANMSE and
GANMAE with the former outperforming the latter in both quantitative measures as
depicted in Table 11.3, as well as visually. This is clearly due to the fact that these tissues
are much less complex. An interesting point to highlight is that despite the dataset being
smaller in size for the lung lobe tissue, the transformation is significantly better. This is
279
280
Generative adversarial networks for image-to-Image translation
due to the fact that these images are far less complex when compared to the images corresponding to the kidney tissue.
Evidently, the network performance is highly varying with respect to the complexity
of the tissue. The network showcases reasonably good performances in tissues such as
lung lesions and lung lobes. However, the performance decreases with complex tissue
samples such as kidneys. Accordingly, there is a requirement for a larger distribution
of tissue for the generated samples to be modeled with a higher degree of accuracy.
An important point to note is the imperfections present due to the image capture of
the histology and due to human error while fixing the histology. The imperfections present on the target image in Fig. 11.5A where the tissue is ripped and stringy is reflected in
the output image Fig. 11.5B generated with GANMAE. This emphasizes the fact that the
images used for training the GAN are to be chosen carefully, and requires domain knowledge from a pathologist.
To summarize, these results showcase the image-to-image transformation from an
input image corresponding to one stain to another image corresponding to a different
stain. When the number of samples in the dataset is large, the quality of the images is
quite satisfactory. In some cases, however, the results are slightly blurry, which may
be addressed if the dataset is larger. Variations in the objective function clearly have
an effect on the output image. The images that resulted with an MSE objective function
are better than those obtained with an MAE objective function. Due to the complexity of
the image, the output images corresponding to the kidney tissue are rather diffuse as the
intricate features have not been captured completely correctly. The finer details corresponding to the lung lobes have been well captured. As far as the metrics are concerned,
there is a strong indication that both SSI and PSNR do not indicate visual quality of the
images. Moreover, the imperfections present on the target tissue appear to be transferred
as well to the output. This can prove problematic as the imperfections may be due to
Fig. 11.5 Errors lung lobe tissue: (A) Target with imperfections; (B) output image with distortions.
Generative adversarial networks for histopathology staining
human errors. Further, artefacts are sometimes introduced at lower epochs. However,
with sufficient training, these artefacts are eventually removed.
11.7 Conclusions
GANs in the past have been used for image translation tasks. In our medical image translation task, we utilize the conditional GAN algorithm to generate differently stained
images from paired input images. The results have been demonstrated to be satisfactory.
If the size of the dataset is suitably large, the results are better as it captures both high-level
and low-level features. Unfortunately, the imperfections in the images are captured as
well by the network. Therefore, the training dataset ought to be carefully chosen. Additionally, histology staining is dependent on the presence of a particular compound/tissue
on the slide. Accordingly, the network needs to have a large distribution of histology
stains in different circumstances/stages. This is important as it needs to learn more settings
for it to be able to learn the transformation sufficiently well to be viable clinically.
While the PSNR and the SSIM are popular image-quality metrics, a potential quality
metric that could be utilized for future evaluation is the relative target registration error.
Here, the landmarks can give the networks a better indication of whether the networks
are performing better or worse. However, such a study requires hand labeling by experts.
We have showcased the use of CGANs for the image-to-image translation between
two stained tissues. Therefore, it has the potential for transforming images of unstained
tissues to stained ones given enough data with different settings. However, the issues
highlighted earlier are to be addressed before a more complete end-to-end system is available that can transform an unstained tissue to a stained one. Finally, the networks should
also learn a general mapping of how to transform stained tissues. Since the images are
paired from one image to another, the network might learn a map where it transforms
the input image to the target image, rather than altering the stain. This issue is a known
problem when using a one-to-one mapping between inputs and targets. Thus future
work in this space should explore unpaired translation of such images to make an effective
end-to-end staining algorithm.
Appendix: Network architectures
The generator and discriminator networks are described here. The generator network
consists of 38 layers with several sets of convolutional and max pool layers, and listed
in Tables 11.4 and 11.5. Each layer is characterized by a triplet (n1, n2, n3) which indicates
an image of dimensions n1 n2 with n3 the number of filters. There are four skip connections. Specifically, the outputs of Layers 3, 6, 9, and 13 are as well inputs to Layers 35,
30, 25, and 20, respectively. The 12 layers of the discriminator network are listed in
Table 11.6.
281
282
Generative adversarial networks for image-to-Image translation
Table 11.4 Generator architecture: Part A.
No.
Layer
Input
Output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Input layer
Conv2D
Conv2D
MaxPooling2D
Conv2D
Conv2D
MaxPooling2D
Conv2D
Conv2D
MaxPooling2D
Conv2D
Conv2D
Dropout
MaxPooling2D
Conv2D
Conv2D
Dropout
Upsampling2D
Conv2D
(256, 256, 3)
(256, 256, 3)
(256, 256, 64)
(256, 256, 64)
(128, 128, 64)
(128, 128, 128)
(128, 128, 128)
(64, 64, 128)
(64, 64, 256)
(64, 64, 256)
(32, 32, 256)
(32, 32, 512)
(32, 32, 512)
(32, 32, 512)
(16, 16, 512)
(16, 16, 1024)
(16, 16, 1024)
(16, 16, 1024)
(32, 32, 1024)
(256, 256, 3)
(256, 256, 64)
(256, 256, 64)
(128, 128, 64)
(128, 128, 128)
(128, 128, 128)
(64, 64, 128)
(64, 64, 256)
(64, 64, 256)
(32, 32, 256)
(32, 32, 512)
(32, 32, 512)
(32, 32, 512)
(16, 16, 512)
(16, 16, 1024)
(16, 16, 1024)
(16, 16, 1024)
(32, 32, 1024)
(32, 32, 512)
Remark
To Layer 35
To Layer 30
To Layer 25
To Layer 20
Table 11.5 Generator architecture: Part B.
No.
Layer
Input
Output
20
Concatenate
(32, 32, 1024)
21
22
23
24
25
Conv2D
Conv2D
Upsampling2D
Conv2D
Concatenate
26
27
28
29
30
Conv2D
Conv2D
Upsampling2D
Conv2D
Concatenate
31
32
33
34
35
Conv2D
Conv2D
Upsampling2D
Conv2D
Concatenate
36
37
38
Conv2D
Conv2D
Conv2D
(32, 32, 512)
(32, 32, 512)
(32, 32, 1024)
(32, 32, 512)
(32, 32, 512)
(64, 64, 512)
(64, 64, 256)
(64, 64, 256)
(64, 64, 512)
(64, 64, 256)
(64, 64, 256)
(128, 128, 256)
(128, 128, 128)
(128, 128, 128)
(128, 128, 256)
(128, 128, 128)
(128, 128, 128)
(256, 256, 128)
(256, 256, 64)
(256, 256, 64)
(256, 256, 128)
(256, 256, 64)
(256, 256, 64)
Remark
From Layer 13
(32, 32, 512)
(32, 32, 512)
(64, 64, 512)
(64, 64, 256)
(64, 64, 512)
From Layer 9
(64, 64, 256)
(64, 64, 256)
(128, 128, 256)
(128, 128, 128)
(128, 128, 256)
From Layer 6
(128, 128, 128)
(128, 128, 128)
(256, 256, 128)
(256, 256, 64)
(256, 256, 128)
From Layer 3
(256, 256, 64)
(256, 256, 64)
(256, 256, 3)
Generative adversarial networks for histopathology staining
Table 11.6 Discriminator architecture.
No.
Layer
Input
Output
1
2
3
4
5
6
7
8
9
10
11
12
Input layer
Conv2D
Leaky ReLU
Dropout
Conv2D
Leaky ReLU
Dropout
Conv2D
Leaky ReLU
Dropout
Flatten
Dense
(256, 256, 3)
(256, 256, 3)
(128, 128, 64)
(128, 128, 64)
(128, 128, 64)
(64, 64, 128)
(64, 64, 128)
(64, 64, 128)
(32, 32, 256)
(32, 32, 256)
(32, 32, 256)
(262144)
(256, 256, 3)
(128, 128, 64)
(128, 128, 64)
(128, 128, 64)
(64, 64, 128)
(64, 64, 128)
(64, 64, 128)
(32, 32, 256)
(32, 32, 256)
(32, 32, 256)
(262144)
(1)
References
[1] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks, in: Proceedings of the 28th International Conference on Neural
Information Processing Systems (NIPS’14), Montreal, Quebec, Canada, 2014, pp. 2672–2680.
[2] A.Y. Ng, M.I. Jordan, On discriminative vs. generative classifiers: a comparison of logistic regression
and naive Bayes, in: Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS’01), Vancouver, British Columbia, Canada, 2001, pp. 841–848.
[3] D. Foster, Generative Deep Learning: Teaching Machines to Paint, Write, Compose, and Play,
O’Reilly Media, Sebastopol, CA, 2019.
[4] K. Ganguly, Learning Generative Adversarial Networks: Next-Generation Deep Learning Simplified,
Packt Publishing, Birmingham, UK, 2017.
[5] J. Langr, V. Bok, GANs in Action: Deep Learning With Generative Adversarial Networks, Manning
Publications, Shelter Island, NY, 2019.
[6] J.F. Nash, Jr., Equilibrium points in n-person game, Proc. Natl. Acad. Sci. 36 (1) (1950) 48–49.
[7] G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P.
Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups, IEEE Signal Process. Mag. 29 (6) (2012) 82–97.
[8] A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM 60 (6) (2017) 84–90.
[9] R. Yamashita, M. Nishio, R.K.G. Do, K. Togashi, Convolutional neural networks: an overview and
application to radiology, Insights Imaging 9 (2018) 611–629.
[10] K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun, What is the best multi-stage architecture for object
recognition? in: Proceedings of the 12th International Conference on Computer Vision (ICCV’09),
Kyoto, Japan, 2009, pp. 2146–2153.
[11] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the 14th
International Conference on Artificial Intelligence and Statistics (AISTATS’11), Ft. Lauderdale, FL,
USA, 2011, pp. 315–323.
[12] F. Agostinelli, M. Hoffman, P. Sadowski, P. Baldi, Learning activation functions to improve deep neural networks, in: Proceedings of the 3rd International Conference on Learning Representations Workshop (ICLR), San Diego, CA, USA, 2015.
[13] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to
prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (1) (2014) 1929–1958.
283
284
Generative adversarial networks for image-to-Image translation
[14] Y. Gal, Z. Ghahramani, Dropout as a Bayesian approximation: representing model uncertainty in deep
learning, in: Proceedings of the 33rd International Conference on Machine Learning (ICML 16),
New York, NY, USA, 2016, pp. 1050–1059.
[15] Y. Gal, Z. Ghahramani, A theoretically grounded application of dropout in recurrent neural networks,
in: Proceedings of the 30th International Conference on Neural Information Processing Systems
(NIPS’16), Barcelona, Spain, 2016, pp. 1027–1035.
[16] D.P. Kingma, M. Welling, Auto-encoding variational Bayes, in: Proceedings of the 2nd International
Conference on Learning Representations (ICLR 2014), Banff, AB, Canada, 2014.
[17] D.J. Rezende, S. Mohamed, D. Wierstra, Stochastic backpropagation and approximate inference in
deep generative models, in: Proceedings of the 31st International Conference on Machine Learning
(ICML), Beijing, China, 2014, pp. 1278–1286.
[18] E. Denton, S. Chintala, A. Szlam, R. Fergus, Deep generative image models using a Laplacian pyramid
of adversarial networks, in: Proceedings of the 29th International Conference on Neural Information
Processing Systems (NIPS’15), Montreal, Quebec, Canada, 2015, pp. 1486–1494.
[19] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, Generative adversarial text to image
synthesis, in: Proceedings of the 33rd International Conference on Machine Learning (ICML),
New York, NY, USA, 2016, pp. 1060–1069.
[20] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J.
Totz, Z. Wang, W. Shi, Photo-realistic single image super-resolution using a generative adversarial
network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Honolulu, HI, USA, 2017, pp. 105–114.
[21] H. Alqahtani, M. Kavakli-Thorne, G. Kumar, Applications of generative adversarial networks (GANs):
an updated review, Arch. Comput. Methods Eng. 28 (2021) 525–552, https://doi.org/10.1007/
s11831-019-09388-y.
[22] S. Kazeminia, C. Baur, A. Kuijper, B. van Ginneken, N. Navab, S. Albarqouni, A. Mukhopadhyay,
GANs for medical image analysis, Artif. Intell. Med. 109 (2020), 101938.
[23] X. Yi, E. Walia, P. Babyn, Generative adversarial network in medical imaging: a review, Med. Image
Anal. 58 (2019) 1–24.
[24] S. Kaji, S. Kida, Overview of image-to-image translation by use of deep neural networks: denoising,
super-resolution, modality conversion, and reconstruction in medical imaging, Radiol. Phys. Technol.
12 (3) (2019) 235–248.
[25] K. Armanious, C. Jiang, M. Fischer, T. K€
ustner, T. Hepp, K. Nikolaou, S. Gatidis, B. Yang, MedGAN: medical image translation using GANs, Comput. Med. Imaging Graph. 79 (2020) 101684.
[26] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation, in: N. Navab, J. Hornegger, W. Wells, A. Frangi (Eds.), Medical Image Computing and
Computer-Assisted Intervention (MICCAI 2015), Lecture Notes in Computer Science, vol. 9351,
Springer, Switzerland, 2015, pp. 234–241.
[27] J.K. Min, M.S. Kwak, J.M. Cha, Overview of deep learning in gastrointestinal endoscopy, Gut Liver
13 (4) (2019) 388–393.
[28] D. Hachuel, A. Jha, C.D. Velez, A. Martinez, Mo2049—augmenting gastrointestinal health: a deep
learning approach to human stool recognition and characterization in macroscopic images, Gastroenterology 156 (6 suppl 1) (2019) S-937.
[29] F. Mahmood, D. Borders, R. Chen, G.N. McKay, K.J. Salimian, A. Baras, N.J. Durr, Deep adversarial
training for multi-organ nuclei segmentation in histopathology images, IEEE Trans. Med. Imaging
(2019), https://doi.org/10.1109/TMI.2019.2927182.
[30] H. Tsuda, K. Hotta, Cell image segmentation by integrating pix2pixs for each class, in: Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),
Long Beach, CA, USA, 2019, pp. 1065–1073.
[31] S. Pandey, P.R. Singh, J. Tian, An image augmentation approach using two-stage generative adversarial network for nuclei image segmentation, Biomed. Signal Process. Control 57 (2020) 101782.
[32] H. Wang, B. Raj, On the origin of deep learning, arXiv:1702.07800 (2017).
[33] D. Wittekind, Traditional staining for routine diagnostic pathology including the role of tannic acid. 1.
Value and limitations of hematoxylin-eosin stain, Biotech. Histochem. 78 (5) (2003) 261–270.
Generative adversarial networks for histopathology staining
[34] R.G. Grocott, A stain for fungi in tissue sections and smears using Gomori’s methenamine-silver nitrate
technic, Am. J. Clin. Pathol. 25 (8) (1955). 975–959.
[35] H.A. Alturkistani, F.M. Tashkandi, Z.M. Mohammedsaleh, Histological stains: a literature review and
case study, Glob. J. Health Sci. 8 (3) (2016) 72–79.
[36] D. Bardou, K. Zhang, S.M. Ahmad, Classification of breast cancer based on histology images using
convolutional neural networks, IEEE Access 6 (2018) 24680–24693.
[37] M.T. Shaban, C. Baur, N. Navab, S. Albarqouni, StainGAN: stain style transfer for digital histological
images, in: Proceedings of the IEEE 16th International Symposium on Biomedical Imaging (ISBI
2019), Venice, Italy, 2019, pp. 953–956.
[38] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, P. Abbeel, InfoGAN: interpretable representation learning by information maximizing generative adversarial nets, in: Proceedings of the 30th
International Conference on Neural Information Processing Systems (NIPS’16), Barcelona, Spain,
2016, pp. 2180–2188.
[39] M. Arjovsky, S. Chintala, L. Bottou, Wasserstein generative adversarial networks, in: Proceedings of
the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 2017, pp.
214–223.
[40] B. Hu, Y. Tang, E.I.-C. Chang, Y. Fan, M. Lai, Y. Xu, Unsupervised learning for cell-level visual
representation in histopathology images with generative adversarial networks, IEEE J. Biomed. Health
Informatics 23 (3) (2019) 1316–1328.
[41] L. Hou, A. Agarwal, D. Samaras, T.M. Kurc, R.R. Gupta, J.H. Saltz, Robust histopathology image
analysis: to label or synthesize, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 8525–8534.
[42] Y. Rivenson, H. Wang, Z. Wei, K. de Haan, Y. Zhang, Y. Wu, H. G€
unaydin, J.E. Zuckerman, T.
Chong, A.E. Sisk, L.M. Westbrook, W.D. Wallace, A. Ozcan, Virtual histological staining of unlabelled tissue-autofluorescence images via deep learning, Nat. Biomed. Eng. 3 (6) (2019) 466–477.
[43] A. Ganesh, N.R. Vasanth, A.S. Ramaswamy, K. George, Staining of unstained histology using style
transfer with color-based segmentation, in: Proceedings of the 2019 IEEE TENCON, Kochi, India,
2019.
[44] L.A. Gatys, A.S. Ecker, M. Bethge, A neural algorithm of artistic style, arXiv:1508.06576 (2015).
[45] L. Zhang, C. Long, X. Zhang, C. Xiao, RIS-GAN: explore residual and illumination with generative
adversarial networks for shadow removal, in: Proceedings of the Thirty-Fourth AAAI Conference on
Artificial Intelligence (AAAI-20), New York, NY, USA, 2020.
[46] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A.A. Efros, O. Wang, E. Shechtman, Toward multimodal
image-to-image translation, in: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 2017, pp. 465–476.
[47] R. Fernandez-Gonzalez, A. Jones, E. Garcia-Rodriguez, P.Y. Chen, A. Idica, S.J. Lockett, M.H.
Barcellos-Hoff, C. Ortiz-De-Solorzano, System for combined three-dimensional morphological
and molecular analysis of thick tissue specimens, Microsc. Res. Tech. 59 (6) (2002) 522–530.
[48] G. Bueno, O. Deniz, AIDPATH: Academia and Industry Collaboration in Digital Pathology, http://
aidpath.eu/?page_id¼279.
[49] J. Borovec, A. Munoz-Barrutia, J. Kybic, Benchmarking of image registration methods for differently
stained histological slides, in: Proceedings of the 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 2018, pp. 3368–3372.
[50] L. Gupta, B.M. Klinkhammer, P. Boor, D. Merhof, M. Gadermayr, Stain independent segmentation of
whole slide images: a case study in renal histology, in: Proceedings of the 15th IEEE International Symposium on Biomedical Imaging (ISBI), Washington, DC, USA, 2018.
[51] J. Borovec, J. Kybic, I. Arganda-Carreras, D.V. Sorokin, G. Bueno, A.V. Khvostikov, S. Bakas, E.I.
Chang, S. Heldmann, K. Kartasalo, L. Latonen, J. Lotz, M. Noga, S. Pati, K. Punithakumar, P.
Ruusuvuori, A. Skalski, N. Tahmasebi, M. Valkonen, L. Venet, Y. Wang, N. Weiss, M.
Wodzinski, Y. Xiang, Y. Xu, Y. Yan, P. Yushkevic, S. Zhao, A. Muñoz-Barrutia, ANHIR: automatic
non-rigid histological image registration challenge, IEEE Trans. Med. Imaging, https://doi.org/10.
1109/TMI.2020.2986331.
285
286
Generative adversarial networks for image-to-Image translation
[52] J. Borovec, BIRL: benchmark on image registration methods with landmark validation,
arXiv:1912.13452 (2020).
[53] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado, A. Davis, J. Dean,
M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,
L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga, S. Moore, D. Murray, C. Olah,
M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke,
V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, X. Zheng, TensorFlow: large-scale machine learning on heterogeneous distributed systems, arXiv:1603.04467 (2016).
[54] S. Kullback, R.A. Leibler, On information and sufficiency, Ann. Math. Stat. 22 (1) (1951) 79–86.
[55] J. Burbea, C.R. Rao, On the convexity of some divergence measures based on entropy functions, IEEE
Trans. Inf. Theory 28 (3) (1982) 489–495.
[56] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for
training GANs, in: Proceedings of the 30th International Conference on Neural Information Processing Systems (NIPS’16), Barcelona, Spain, 2016, pp. 2234–2242.
[57] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in: Proceedings of the International Conference on Learning Representations (ICLR 2016), San Juan, Puerto Rico, 2016.
[58] D. Scherer, A. M€
uller, S. Behnke, Evaluation of pooling operations in convolutional architectures for
object recognition, in: K. Diamantaras, W. Duch, L.S. Iliadis (Eds.), Proceedings of the 20th International Conference on Artificial Neural Networks (ICANN), Thessaloniki, Greece, September 15–18,
vol. 6354, Springer, Switzerland, 2010, pp. 92–101.
[59] S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal
covariate shift, in: Proceedings of the 32nd International Conference on Machine Learning
(ICML’15), Lille, France, 2015, pp. 448–456.
[60] S. Ravuri, S. Mohamed, M. Rosca, O. Vinyals, Learning implicit generative models with method of
learning moments, in: Proceedings of the 35th International Conference on Machine Learning
(ICML), Stockholm, Sweden, 2018, pp. 4314–4323.
[61] X. Mao, Q. Li, H. Xie, R.Y.K. Lau, Z. Wang, S.P. Smolley, Least squares generative adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV 2017), Venice, Italy, 2017, pp. 2813–2821.
[62] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR
2019), Honolulu, HI, USA, 2017, pp. 5967–5976.
[63] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to
structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612.
[64] A. Hore, D. Ziou, Image quality metrics: PSNR vs. SSIM, in: Proceedings of the 20th IEEE International Conference on Pattern Recognition (ICPR), Istanbul, Turkey, 2010, pp. 2366–2369.
[65] P. Baldi, Autoencoders, unsupervised learning, and deep architectures, in: Proceedings of the ICML
Workshop on Unsupervised and Transfer Learning, Bellevue, WA, USA, 2011, pp. 37–49.
[66] M. Mirza, S. Osindero, Conditional generative adversarial nets, arXiv:1411.1784 (2014).
[67] A.B.L. Larsen, S.K. Sønderby, H. Larochelle, O. Winther, Autoencoding beyond pixels using a learned
similarity metric, in: Proceedings of the 33th International Conference on Machine Learning (ICML),
New York, NY, USA, 2016, pp. 1558–1566.
[68] C. Li, M. Wand, Precomputed real-time texture synthesis with Markovian generative adversarial networks, in: B. Leibe, J. Matas, N. Sebe, M. Welling (Eds.), Proceedings of the 14th European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, October 11–14, Lecture Notes in
Computer Science, vol. 9907, Springer, Switzerland, 2016, pp. 702–716.
[69] M.N. Gurcan, L.E. Boucheron, A. Can, A. Madabhushi, N.M. Rajpoot, B. Yener, Histopathological
image analysis: a review, IEEE Rev. Biomed. Eng. 2 (2009) 147–171.
[70] S. Thiem, C. Dalkidis, Automatic stainer having a heating station, Leica Microsystems Nussloch,
GmbH, Nussloch, Germany, US Patent 6827900 B2 (December 2004).
[71] A.R. Morales, M. Nassiri, Automation of the histology laboratory, Lab. Med. 38 (7) (2007) 405–410.
Generative adversarial networks for histopathology staining
[72] P. Gattuso, V.B. Reddy, O. David, D.J. Spitz, M.H. Haber, Differential Diagnosis in Surgical Pathology, Saunders Elsevier, Philadelphia, PA, 2010.
[73] D.J. Dabbs, Diagnostic Immunohistochemistry: Theranostic and Genomic Applications, Elsevier, Philadelphia, PA, 2018.
[74] V. Nair, G.E. Hinton, Rectified linear units improve restricted Boltzmann machines, in: Proceedings of
the 27th International Conference on Machine Learning (ICML), Haifa, Israel, 2010, pp. 807–814.
[75] D.P. Kingma, J.L. Ba, ADAM: a method for stochastic optimization, in: Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, 2015.
287
CHAPTER 12
Analysis of false data detection rate
in generative adversarial networks
using recurrent neural network
A. Sampath Kumara, Leta Tesfaye Juleb,c, Krishnaraj Ramaswamyc,d,
S. Sountharrajane, N. Yuuvarajf, and Amir H. Gandomig
a
Department of Computer Science and Engineering, Dambi Dollo University, Dambi Dollo, Ethiopia
Department of Physics, College of Natural and Computational Science, Dambi Dollo University, Dambi Dollo, Ethiopia
c
Centre for Excellence in Indigenous Knowledge, Innovative Technology Transfer and Entrepreneurship, Dambi Dollo
University, Dambi Dollo, Ethiopia
d
Department of Mechanical Engineering, Dambi Dollo University, Dambi Dollo, Ethiopia
e
School of Computing Science and Engineering, VIT Bhopal University, Bhopal, India
f
Research and Development, ICT Academy, Chennai, India
g
University of Technology Sydney, Ultimo, NSW, Australia
b
12.1 Introduction
Recently, the generative adversarial networks (GANs) have emerged as a potential class
of generative models, where it operates as a joint optimization model with two neural
networks of contrasting goals. Since decades, the generative adversarial network
(GAN) [1–10] is emerging as a viable solution for most application with its adversarial
training ability in optimizing the generative ability. GANs are an emergent model for
unsupervised and semisupervised learning. The learning is achieved with implicit modeling of high-dimensional data distribution [6]. GANs are considered interesting since it
moves away from a viewpoint of likelihood maximization; however, it uses an adversarial
game approach for the process of training the generative models [11].
Conventional GAN has no prior training data on the distribution of data, which provide the goal of generating the samples from the distributions [12]. Effective training on
GANs is rather challenging. The generator and discriminator model capacities are balanced for the generator to learn effectively. The lack of unambiguous convergence criterion tends to complement this problem [13]. Various attempts in existing researches to
scale up the GAN are unsuccessful due to its reliability of identifying fake or false data and
unstable training. GAN encounters various difficulties, while scaling up its robustness and
scalability. Hence, a stable training model across a limited range of datasets with deeper
learning algorithm can be made significant to scale up the operation of GAN in finding
the false detection rate.
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00012-9
Copyright © 2021 Elsevier Inc.
All rights reserved.
289
290
Generative adversarial networks for image-to-image translation
The problem of imbalanced learning can be defined as a problem of learning from a
binary or multiclass dataset, where for one of the classes called the majority class the number of instances is significantly greater than in the remaining classes called the minority
classes [14]. In unbalanced datasets, standard learning methods work poorly because they
are a prejudice to the majority classes. In particular, minority classes contribute lesser to
minimize the objective function during training in a standard classification method [15].
Designing the GANs is still difficult in practice, even if GANs have achieved great
success in image generation. In case a GAN would be unstable, network architectures
should be well designed. Various GAN methods are developed [16–18] for improving
the stabilization ability of GANs learning. The instability associated with GAN learning
is caused by the saturation occurring while sampling the data in the discriminator [19].
Hence to resolve such problem, this chapter uses optimal weight selection to avoid the
saturation, thereby increasing the stability of operation.
In this chapter, we develop a GAN and the operation in the GAN is scaled up using a
recurrent neural network (RNN). It uses its neighborhood relationship between the
samples to generate the target output and error generated. The errors are then propagated
in the backward direction over the GAN to update the network weights to estimate the
output. The RNN generator, on the other hand, reduces the probability of the RNN
discriminator in identifying the false generated samples and increasing the probability of
discriminator in identifying correctly the real samples. The objective function in RNN is
designed in such a way that its gradient operator for the false samples is quite far from the
decision boundary of RNN discriminator, thereby producing the increasing true
classification rate.
12.1.1 Contributions
The main contributions of the study are presented below:
• The author(s) uses RNN to scale up the operation of GAN for stable training.
• The discriminator uses the same RNN to classify the generated and real data samples
by updating the weights.
• The aim of RNN in the GAN structure is to delimit the error rate using its time series
prediction based on its past inputs. It further uses its neighborhood relationship
between the samples to generate the target output and error generated.
• The errors are then propagated in the backward direction over the GAN to update the
network weights to estimate the output. The RNN generator, on the other hand, reduces
the probability of the RNN discriminator in identifying the false generated samples and
increasing the probability of discriminator in identifying correctly the real samples.
• The objective function in RNN is designed in such a way that its gradient operator for
the false samples is quite far from the decision boundary of RNN discriminator,
thereby producing the increasing true classification rate.
Analysis of false data detection rate in adversarial networks
• The experiments are carried out on the real-world time series dataset show the results
of accurate classification with increased false detection rate than benchmark GAN
method.
The outline of the chapter is as follows: Section 12.2 discusses the related works.
Section 12.3 provides the proposed GAN classification with modifications made in
the work to improve the performance of classifier. Section 12.4 evaluates the entire work
and Section 12.5 concludes the work with possible directions of the future scope.
12.2 Related works
Lyu et al. [20] designed denoising-based GAN for removing the mixed noise by the combination of three different GAN elements. It includes feature extractor networks, discriminator, and a generator. The feature extractor network additionally trains the
generator-discriminator network using a mutual game. The direct mapping is implemented to eliminate the noise to improve the quality of input data. Chen et al. [21] developed
a denoising method with GAN (D-GAN) to remove the presence of speckle noise. The
generator is trained with ground truth value for mapping the noisy regions in an image.
The discriminator finds the similarity comparison using the loss function with reconstructed input data. The generator-discriminator finally eliminates the noise with its stable training to achieve denoising effectiveness. Nawi et al. [22] formed a GAN embedded
with a discriminative metric for the generation of real samples using its deep metric learning. The feature extractor acts as a discriminator that identifies both the discriminative
and preserving loss. It further reduces the distance of estimation between the real and
generation samples that preserve the input data from losses and improves the model
stability with weight adaptation strategy. Li et al. [23] used multitask learning-based
GAN (MTL-GAN) to segment the grains or noises present in the images of alloy microstructure. It uses the detection of grains at edges and its segmentation using rich convolutional features. The fine-tuning of GAM is employed to find the hidden grains and
extracts the quantitative indicators. The above methods achieve the accurate noise reduction but the accuracy tends to reduce on increasing data samples.
Zhong et al. [24] developed a deep-GAN model embedded with the output noises
input from the decoder-encoder model. The GAN is improved with adversarial training and Bayesian inference. The entire model is pretrained for mapping the noise vector
to optimal feature that acts as an input feed for the generator. The generator’s learnability is trained with dataset carrying intrinsic distribution information to reduce
the set of errors. Cui et al. [25] use the encoder-decoder network in GAN to remove
the noise from the input signal. The network tends to classify the original input data
from the adversarial loss and pixel losses and thereby the original information is preserved. Sun et al. [26] modified the GAN model and formulated it with
t-distributions noise mixture using a latent generative space. The learning components
291
292
Generative adversarial networks for image-to-image translation
of the t-distributions are mixture the diversity of class deification even in the presence
of the noise vector. The classification loss embedded with generator-discriminator
losses stabilizes the entire network. The use of adversarial loss tends to reduce with
the stability of noise vector with increased data samples.
Ak et al. [27] developed an enhanced attention GAN with improved stability using an
integrated attention module for linear modulation. It is designed to avoid matching losses
between the features that significantly degrade the classification errors. The training stability is hence improved to resolve the collapse of the system with the reduction of
instance noise. Xu et al. [28] developed a multideep GAN to resolve the unstable training
using its reconstructive sampling. The multigranularity GAN is then decomposed to
remove the noises present in the input data to further enhance the training. Deep neural
network-based melanoma detection has been introduced by Banuselvasaraswathy et al.
[29]. The deep learning-based training analysis suggested could be used in proposing an
improved deep-based GAN development. Zhang et al. [30] developed a deep-GAN
using its explicit training data that map the noise distribution with real-time data. Further,
the generation of fake samples helps in balancing the datasets for stable training. These
methods tend to reduce the probability of sampling under increased data samples using
the first part of the generator.
Zhang and Sheng [31] used Wasserstein GAN to improve the model’s generalization
ability. It detects the input seismic signal and can distinguish the noise and the original
signal especially in low signal-to-noise ratio (SNR) conditions. Hence, the stability is
improved on adversarial sample datasets. Yang et al. [32] developed a GAN architecture
with loss function to distinguish the clean signal from a parallel noisy signal. It uses supervised loss representation, i.e., high-level loss in the hidden layer of GAN to find the losses
under low SNR and in a resource-constrained environment. These two methods suffer
poorly from the unstable training and the inclusion of RNN can effectively improve the
stabilized training pattern. In Refs. [26, 33–36], various machine learning and deep learning algorithms for classification of diseases are proposed. The optimization techniques
suggested by them for diagnosing breast cancer are an effective one. The other big data
analytics techniques suggested by them are noteworthy in handling big data and this in
turn reveals the possibility of reducing the traffic in the network.
Sampathkumar et al. [37–40] proposed various optimization approaches for identifying the accuracy of the accurate gene for a particular disease. Various classifiers and statistical approaches are used to predict the feature selection gene with reduced
dimensionality, which prove that algorithms are effective one. These methods fail to discriminate between original and fake samples resulting in poor determination accuracy.
Congestion control in WSN could be improved by multiple routing paths and this could
be achieved by means of priority-based scheduling algorithms as discussed in Ref. [41]. Various researches have been undergone in the machine learning algorithm over the accuracy
prediction. In this chapter, GAN with RNN [42–45] has concentrating over the reducing
Analysis of false data detection rate in adversarial networks
false samples, which have not been carried out in the previous works. The utilization of
RNN is to effectively process the temporal data, which are unavailable in the existing
methods. The utilization of recurrent layers overcomes the challenges associated with a loss
function. The RNN further can enable unsupervised learning against the conventional
supervised learning models [46–48].
12.3 Methods
The proposed method is designed with the integration of GAN with RNN [22,25,32].
The GAN is responsible for optimal of neural network with adversarial training. The
generator network is regarded as the first network that helps in mapping the input from
the source and acts as a low-dimensional in contrast with high-dimensional dataset with
respect to the target domain. The adversarial or discriminator network is the second network that generates the output, which the adversarial network fails in classifying the realtime dataset. The generator-discriminator gets optimized w.r.t. the discriminator output.
For easier classification by the discriminator, the outputs from the generator w.r.t real
dataset samples. To optimally attain the results, the study uses RNN algorithm to generate the optimal weights to achieve accurate predictive results.
The training of regression model [49] benefits the adversarial network signal. The
adversarial network process the predicted and real classification with receptive field with
weight function that quantifies well the task with global real-time prediction. It reduces
the errors associated with classification of task. Hence, it is very necessary to design a loss
function and that should contribute well with adversarial training.
Fig. 12.1 shows various components that include a generator G operating on a noise
vector z, which is sampled using an input distribution pz. It uses recurrent layers to transform a sample x from the input noise vector z. Finally, the discriminator D classifies the
input samples from the real data distribution samples pdata.
The algorithm or the workflow of entire system is given in following steps:
Step 1: Generate the synthetic sampling from real data distributions using the
generator network.
Step 2: Transform noise vector from distribution into newer samples.
z
Generator G
x
Pdata
Pdata
Label
y
Fig. 12.1 Generative adversarial network.
Discriminator D
Unlabeled
293
294
Generative adversarial networks for image-to-image translation
Step 3: Run RNN and the weights of RNN is updated using the fitness function.
Step 4: Select the fitness value with improved probability estimator to increase classifier with probable weights updated based on the newer samples.
Step 5: Access the new samples and the samples from the generator using discriminator network.
Step 6: Find the similarity between these two data.
12.4 GAN-RNN architecture
GAN is designed with the two neural networks. First, the generator network generates
the convincing synthetic sampling x drawn from real data distributions pdata. The generator network transforms the noise vector z from pz distribution into newer samples x.
Second, the discriminator network accesses the samples from the real data distributions
pdata and it then accesses the samples from G and classifies between the two data. The
training of GAN is carried out to solve the optimization problem, where the discriminator gets maximized and generator gets minimized.
min max f ðD, GÞ ¼ E j log DðxÞj + E j log ð1 DðGðzÞÞÞj
G D
x pdata
x pz b
c
(12.1)
where G is represented as the generator, D is represented as the discriminator, and f(D, G)
is represented as the objective function.
The final layer in the second network uses an activation function, i.e., sigmoid activation function: D(x), D(G(z)) [0, 1]. The activation functions are maximized, where
the error is minimized during predict w.r.t. target values on real/fake samples. On contrary, the generator reduces the discriminator to predict the fake samples. Hence, the
generator loss directly depends on the discriminator performance.
12.4.1 Optimization of GAN using RNN
Consider a sample (a, b), where a is the input and b is the label with class C. The probability estimation using the RNN model is defined as Y ¼ softmax(f(a; θ)). The study uses
cross-entropy objective function that helps in updating the weights of RNN through
backward learning, i.e., backpropagation during the process of training the classifier,
which is seen in Fig. 12.2. The fitness function is selected to be robust with improved
probability estimator that increases the performance of GAN-RNN classifier with probable weights update. The weights are updated optimally as in Ref. [36].
Consider L(b, Y) as the loss function and RNN is associated with finding the best
weights {w1, w2, …, wK} that combine the results of loss function at each iteration.
The study uses a cross-entropy loss function to validate the performance of the GANRNN classifier (Fig. 12.3). The error produced by validating the model lies between the
range [22] (Fig. 12.4). The cross-entropy increases the prediction probability of the
Analysis of false data detection rate in adversarial networks
Input Dataset
z
Generator G
Pdata
x
Pdata
Labelled
y
Unlabelled
Discriminator D
Weights update
RNN
Fig. 12.2 GAN-RNN learning module.
2
10
Train
Validation
Test
Best
Mean Squared Error (mse)
1
10
0
10
-1
10
-2
10
-3
10
0
5
10
15
20
25
30
Iterations
Fig. 12.3 Average performance of GAN-RNN model.
35
40
45
50
295
Generative adversarial networks for image-to-image translation
12
11
Output and Target
10
9
Training Targets
Training Outputs
Validation Targets
Validation Outputs
Test Targets
Test Outputs
Errors
Response
8
7
6
5
4
3
2
Targets - Outputs
Error
296
0
-2
200
400
600
800
1000
1200
1400
1600
1800
Time
Fig. 12.4 Average error and plots of entire datasets.
input labels or the actual labels. The reduction of average values of the loss function is the
maximization of log-likelihood of the input data. The cross-entropy loss function is
defined using the following expression:
X
CE ¼ ðb log ðwi Þ + ð1 bÞlog ð1 wi ÞÞ
(12.2)
i
f ¼
M
X
CE ðc, Y Þ log CE ðc, Y Þ
(12.3)
c¼1
where M represents the total number of class, log represents the natural logarithm, y represents the binary indicator with c as the class label with accurate classified result over an
observed data O, and p represents the predicted observation O from the class c.
With the training samples (N), the loss function is defined as
min f ¼ w
M
X
f ðbi , Yi Þ
i¼1
where b represents the true label and Y represents the predicted label.
(12.4)
Analysis of false data detection rate in adversarial networks
12.5 Performance evaluation
This section provides the results of proposed GAN-RNN model with various other
GAN models that include: GAN [26], Wasserstein GAN [30], deep-GAN [29],
D-GAN [31], and MTL-GAN [23].
12.5.1 Dataset collection
The multivariate datasets are mixed of integer and real-valued attribute characteristics and
the categorical attributes are removed and this is used to evaluate the GAN-RNN model
including: Dataset-1: breast cancer Wisconsin (original) with 699 instances, Dataset-2:
breast cancer Wisconsin (diagnostic) with 569 instances, Dataset-3: heart disease with
303 instances, Dataset-4: heart failure clinical records with 299 instances, Dataset-5:
diabetes 130-US hospitals with 100,000 instances, and Dataset-6: lung cancer with
32 instances [50]. The features or classes used in this chapter are given in brief in
Tables 12.1 and 12.2.
12.5.2 Performance metrics
The performance of GAN-RNN model is validated against various metrics that include
accuracy, sensitivity, specificity, F-measure, geometric-mean (G-mean), percentage
error, and training performance. The definitions of all the performance metrics are given
below.
Table 12.1 Attributes of first three datasets for prediction.
Attributes
Dataset-1
Dataset-2
Dataset-3
Data set characteristics
Attribute characteristics (*categorical attributes
are removed)
Number of instances
Number of attributes
Missing values or errors present
Multivariate
Integer
Multivariate
Real
699
10
Yes
569
32
No
Multivariate
Integer and
real
303
75
Yes
Table 12.2 Attributes of last three datasets for prediction.
Attributes
Dataset-4
Dataset-5
Dataset-6
Data set characteristics
Attribute characteristics (*categorical attributes
are removed)
Number of instances
Number of attributes
Missing values or errors present
Multivariate
Integer and
real
299
13
No
Multivariate
Integer
Multivariate
Integer
100,000
55
Yes
32
56
Yes
297
298
Generative adversarial networks for image-to-image translation
Accuracy to validate the GAN-RNN is the total accurate samples predictions
required to ensure that the GAN-RNN models predict the output correctly without
noise. Hence, it can be defined as the ratio of total correct predictions vs total predictions.
TP + TN
(12.5)
TP + TN + FP + FN
where TP is the true positive cases, TN is the true negative cases, FP is the false positive
cases, and FN is the false negative cases.
F-measure to validate the GAN-RNN is defined as the weighted harmonic mean of
the sensitivity and specificity values. It tends to range between zero and one.
Accuracy ¼
F measure ¼
2TP
2TP + FP + FN
(12.6)
It is inferred that the higher the F-measure value is, the higher the performance of
GAN-RNN classifier would be.
Sensitivity is defined as the ability of the GAN-RNN model to correctly identify the
true positive rate.
Sensitivity ¼
TP
TP + FN
(12.7)
Specificity is defined as the ability of the GAN-RNN model to correctly identify the
true negative rate.
Specificity ¼
TN
TN + FP
(12.8)
Geometric-mean (G-mean) to validate the GAN-RNN is defined as the aggregation
of both specificity and sensitivity measures that maintain the trade-off between the measures if the dataset is imbalanced. It is measured using the following equation:
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
TP
TN
G mean ¼
(12.9)
TP + FN TN + FP
Higher the G-mean is, higher is the performance of GAN-RNN and vice versa.
Mean absolute percentage error (MAPE) to validate the GAN-RNN is defined as the
performance measure of prediction accuracy to estimate the total eliminated losses during
the prediction of accurate data instances from the preprocessed datasets. Thus, MAPE is
defined as the ratio of the difference between the actual classes (At) and predicted classes
(Ft), and the actual class. The MAPE value is then multiplied with 100 to find the percentage errors and finally, it is divided with the fit points or actual data points (n). The
percentage error is hence defined as
n 100 X
At Ft (12.10)
MAPE ¼
n t¼1 At Analysis of false data detection rate in adversarial networks
12.5.3 Discussions
Table 12.3 shows the results of accuracy between the proposed GAN-RNN (Method-6)
with benchmark GAN (Method-1), Wasserstein GAN (Method-2), Deep-GAN
(Method-3), D-GAN (Method-5), and MTL-GAN (Method-5). The results are
Table 12.3 Accuracy of various GAN with (a) Dataset-1, (b) Dataset-2,
(c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6.
(a)
Methods
Dataset-1
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
96.6507
97.3779
97.4082
97.489
97.4991
97.6304
(b)
Methods
Dataset-2
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
97.9334
97.9536
97.9738
97.9738
97.9839
97.9839
(c)
Methods
Dataset-3
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
96.0952
96.1154
96.1255
96.2164
96.2366
96.2972
(d)
Methods
Dataset-4
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
97.6395
97.6395
97.7203
97.7203
97.7405
97.791
Continued
299
300
Generative adversarial networks for image-to-image translation
Table 12.3 Accuracy of various GAN with (a) Dataset-1, (b) Dataset-2,
(c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6—cont’d
(e)
Methods
Dataset-5
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
94.1348
94.2863
94.3671
94.4176
94.5994
94.6095
(f )
Methods
Dataset-6
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
97.5698
97.5698
97.6506
97.6506
97.6708
97.7213
validated against various datasets as discussed in Section 12.5. The results validated against
the existing model over each dataset show that the GAN-RNN obtains higher classification accuracy than other existing GAN methods. The results of simulation over the
large Dataset-5 show that the GAN-RNN and other existing methods suffer from
reduced accuracy and this is due to the complexity in computing the larger data features,
however, the GAN-RNN has proven to be operating with reduced complexity than
other methods.
Table 12.4 shows the results of sensitivity accuracy between the proposed GANRNN with existing GAN models over various datasets. The results validated against
the existing model show that the GAN-RNN obtains higher sensitivity accuracy than
other existing GAN methods. The sensitivity over the large Dataset-5 is lesser on all
GAN methods. However, the GAN-RNN has proven to be operating with higher sensitivity than other methods.
Table 12.5 shows the results of specificity accuracy between the proposed GANRNN with existing GAN models over various datasets. The results validated against
the existing model show that the GAN-RNN obtains higher specificity accuracy than
other existing GAN methods. The specificity over the large Dataset-5 is lesser on all
GAN methods. However, the GAN-RNN has proven to be operating with higher specificity than other methods.
Table 12.6 shows the results of F-measure accuracy between the proposed GANRNN with existing GAN models over various datasets. The results validated against
Analysis of false data detection rate in adversarial networks
Table 12.4 Sensitivity of various GAN with (a) Dataset-1, (b) Dataset-2,
(c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6.
(a)
Methods
Dataset-1
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
65.6131
67.0181
69.8168
72.5236
82.0721
85.7091
(b)
Methods
Dataset-2
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
86.0626
92.7609
93.8729
94.4587
94.7415
94.7516
(c)
Methods
Dataset-3
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
62.5821
63.3194
64.2193
64.5829
66.4211
67.2605
(d)
Methods
Dataset-4
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
87.6594
87.6594
88.4775
88.6088
89.3562
89.4269
(e)
Methods
Dataset-5
GAN
W_GAN
Deep_GAN
61.3297
61.6933
62.2185
Continued
301
Table 12.4 Sensitivity of various GAN with (a) Dataset-1, (b) Dataset-2,
(c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6—cont’d
(e)
Methods
Dataset-5
D_GAN
MTL_GAN
GAN_RNN
63.8456
64.3506
64.6738
(f )
Methods
Dataset-6
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
87.7241
87.7241
88.5432
88.6745
89.4219
89.4936
Table 12.5 Specificity of various GAN with (a) Dataset-1, (b) Dataset-2,
(c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6.
(a)
Methods
Dataset-1
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
93.9537
93.9739
93.9739
93.9739
93.9739
94.6203
(b)
Methods
Dataset-2
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
93.7214
94.5193
94.7213
94.8021
94.8223
94.8829 up
(c)
Methods
Dataset-3
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
91.8923
91.9529
91.9933
93.2366
93.6507
94.0042
Analysis of false data detection rate in adversarial networks
(d)
Methods
Dataset-4
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
94.7304
94.7405
94.8314
94.8314
94.9223
94.9223
(e)
Methods
Dataset-5
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
92.4276
92.5892
92.5993
92.6498
92.6902
92.7407
(f )
Methods
Dataset-6
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
94.6607
94.6708
94.7617
94.7617
94.8526
94.8526
Table 12.6 F-measure of various GAN with (a) Dataset-1, (b) Dataset-2,
(c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6.
(a)
Methods
Dataset-1
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
91.3151
92.7604
92.9119
93.427
93.6391
94.2855
Continued
303
304
Generative adversarial networks for image-to-image translation
Table 12.6 F-measure of various GAN with (a) Dataset-1, (b) Dataset-2,
(c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6—cont’d
(b)
Methods
Dataset-2
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
79.383
79.5143
80.0395
81.1313
81.8181
82.111
(c)
Methods
Dataset-3
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
60.0657
71.6534
71.9777
74.8875
78.1508
81.3838
(d)
Methods
Dataset-4
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
87.971
88.0922
90.0526
90.0829
91.4565
91.4666
(e)
Methods
Dataset-5
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
54.1239
61.8736
62.217
62.6816
64.2279
64.3693
(f )
Methods
Dataset-6
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
88.0326
88.1538
90.1152
90.1455
91.5201
91.5313
Analysis of false data detection rate in adversarial networks
the existing model show that the GAN-RNN obtains higher F-measure accuracy than
other existing GAN methods. The F-measure over the large Dataset-5 is lesser on all
GAN methods. However, the GAN-RNN has proven to be operating with higher
F-measure than other methods.
Table 12.7 shows the results of G-mean accuracy between the proposed GAN-RNN
with existing GAN models over various datasets. The results validated against the existing
model show that the GAN-RNN obtains higher G-mean accuracy than other existing
GAN methods. The G-mean over the large Dataset-5 is lesser on all GAN methods.
However, the GAN-RNN has proven to be operating with higher G-mean than other
methods.
Table 12.8 shows the results of MAPE accuracy between the proposed GAN-RNN
with existing GAN models over various datasets. The results validated against the existing
Table 12.7 G-mean of various GAN with (a) Dataset-1, (b) Dataset-2,
(c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6.
(a)
Methods
Dataset-1
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
95.4278
98.8628
99.4183
99.7213
99.8627
99.8627
(b)
Methods
Dataset-2
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
96.1954
96.1954
96.5691
96.6398
96.9731
97.0135
(c)
Methods
Dataset-3
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
81.1616
81.6767
82.212
82.4443
83.5664
84.031
Continued
305
306
Generative adversarial networks for image-to-image translation
Table 12.7 G-mean of various GAN with (a) Dataset-1, (b) Dataset-2,
(c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6—cont’d
(d)
Methods
Dataset-4
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
96.2621
96.2621
96.6368
96.7075
97.0408
97.0812
(e)
Methods
Dataset-5
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
81.4545
81.6969
81.9696
82.9796
83.3028
83.4957
(f )
Methods
Dataset-6
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
83.9301
84.7885
86.4257
88.0013
93.0937
94.6289
Table 12.8 MAPE of various GAN with (a) Dataset-1, (b) Dataset-2,
(c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6.
(a)
Methods
Dataset-1
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
28.4113
27.0064
24.2077
21.5108
11.9523
10.2453
Analysis of false data detection rate in adversarial networks
Table 12.8 MAPE of various GAN with (a) Dataset-1, (b) Dataset-2,
(c) Dataset-3, (d) Dataset-4, (e) Dataset-5, and (f ) Dataset-6—cont’d
(b)
Methods
Dataset-2
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
80.7386
20.5224
13.6918
2.60904
48.4075
14.4503
(c)
Methods
Dataset-3
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
31.4423
30.704
29.8051
29.4516
27.6023
26.7741
(d)
Methods
Dataset-4
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
32.6947
32.3311
31.8059
30.1788
29.6738
29.3506
(e)
Methods
Dataset-5
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
72.7857
72.6039
64.6269
63.2826
55.851
55.1632
(f )
Methods
Dataset-6
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
27.7352
27.5534
19.5814
18.2381
10.8116
10.1248
307
308
Generative adversarial networks for image-to-image translation
model show that the GAN-RNN obtains reduced MAPE accuracy than other existing
GAN methods. The MAPE over the Dataset-5 is highest of all datasets due to its
increased dataset size. However, the GAN-RNN has proven to be operating with
reduced MAPE than other methods.
Further, the proposed method is tested with a high-dimensional dataset to test the efficacy and robustness of the proposed model. The proposed method is tested additionally
tested with dataset-5 by varying the number of training data and the results are given below.
From Fig. 12.5, we found that the performance of the proposed system with 90%
training data performs the best with various class labels. Hence, we conclude that with
increasing training labels, the supervised method performs the best on reduced noise
labels, which are evident from the figure. We hence choose the 90% training data with
10% noise labels for further evaluation.
The result shows that the proposed method on all three categories has higher performance with data stability than the other models as reported in Tables 12.9–12.11. The
result shows that with 90% training data, we found that the stability of the system is higher
than that of other models.
With 90% of the training data, we found the GAN with RNN generates optimal synthetic samples from the datasets. The classification of data using discriminator network on
the real data distribution shows accurate transformation of noise vector into new samples.
This resolves the problems of optimization and the use of a cross-entropy objective function updates the RNN weights using through backward learning that increases the robust
of the classifier with minimal losses. However, with reduction of training data, the performance tends to degrade as the noise vector transformation is less accurate, which affects
the stability of the system.
12.6 Conclusions
In this chapter, the scaling of GAN is improved with RNN that effectively learns the
network recursively for classification. The recursive learning from the past records
and neighborhood relationship reduces the error rate, while analyzing the time series
data. The backward propagation of the GAN with updated network weights has explicitly enabled the output estimation. The target output is projected by the GAN-RNN
model and most accurate predictions are thus made. Therefore, the probability of identifying the false samples is reduced using RNN discriminator operated with a gradient
operator-based objective function. The results of the simulation show that the GANRNN (97.01%) model has a higher classification rate than other existing methods such
as benchmark GAN (96.67%), Wasserstein GAN (96.82%), Deep-GAN (96.87%),
D-GAN (96.91%), and MTL-GAN (96.96%). The training of the model using synthetic
datasets has enabled higher predictive results with reduced false classification rate. In
future, the system may choose an optimal loss function on contributing with adversarial
training.
Analysis of false data detection rate in adversarial networks
90
Classification Accuracy (%)
80
70
60
10%
20%
30%
50
40
30
20
10
0
200
(A)
400
600
Residuals
800
1000
1200
60% trainingdata
80
Classification Accuracy (%)
70
60
10%
20%
30%
50
40
30
20
0
200
(B)
400
600
Residuals
800
1000
1200
75% trainingdata
44
42
Classification Accuracy (%)
40
38
36
10%
20%
30%
34
32
30
28
26
24
0
(C)
200
400
600
Residuals
800
1000
1200
90% trainingdata
Fig. 12.5 Comparison of GAN-RNN accuracy with varying training data over Dataset-5 with varying
training labels.
309
310
Generative adversarial networks for image-to-image translation
Table 12.9 Comparison of other parameters GAN-RNN with 60% training data with 10% label noise.
Metrics
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
F-measure
G-mean
MAPE
Sensitivity
Specificity
75.1937
75.5129
73.5287
83.5006
75.9477
75.4599
75.7568
69.6674
76.7431
77.9414
75.6402
77.5809
62.5377
77.5066
81.1982
75.8311
79.7666
43.3257
79.3848
86.7468
80.668
82.3765
40.2916
79.5969
88.1371
86.5453
85.2939
38.3286
86.8634
88.5719
Table 12.10 Comparison of GAN-RNN with 75% training data with 10% label noise.
Metrics
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
F-measure
G-mean
MAPE
Sensitivity
Specificity
42.0106
78.2172
31.3378
66.7712
79.9575
44.2377
78.4622
28.2188
70.4946
80.159
56.1514
80.0529
26.7341
78.8757
83.8824
56.321
80.0953
23.997
92.009
83.9036
58.8355
80.5301
23.3819
92.7089
85.3575
90.0036
92.0408
18.3954
98.3610
86.2483
Table 12.11 Comparison of GAN-RNN with 90% training data with 10% label noise.
Metrics
GAN
W_GAN
Deep_GAN
D_GAN
MTL_GAN
GAN_RNN
F-measure
G-mean
MAPE
Sensitivity
Specificity
72.0748
47.4309
21.8431
82.1325
79.1197
72.1278
61.1378
19.0105
84.9757
82.175
73.0727
64.3522
18.9151
85.0712
83.1718
74.2605
48.7045
13.7695
90.2051
86.5241
79.5969
81.9841
12.3591
91.6272
88.6461
85.4954
92.4756
11.1395
92.8468
90.8636
References
[1] M.Y. Liu, O. Tuzel, Coupled generative adversarial networks, in: Advances in Neural Information
Processing Systems, 2016, pp. 469–477.
[2] H. Zhang, I. Goodfellow, D. Metaxas, A. Odena, Self-attention generative adversarial networks, in:
International Conference on Machine Learning, 2019, May, pp. 7354–7363.
[3] C. Li, M. Wand, Precomputed real-time texture synthesis with Markovian generative adversarial networks, in: European Conference on Computer Vision, Springer, Cham, 2016, October, pp. 702–716.
[4] T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks, in:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp.
4401–4410.
[5] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, D.N. Metaxas, Stackgan: text to photorealistic image synthesis with stacked generative adversarial networks, in: Proceedings of the IEEE
International Conference on Computer Vision, 2017, pp. 5907–5915.
[6] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, A.A. Bharath, Generative adversarial networks: an overview, IEEE Signal Process. Mag. 35 (1) (2018) 53–65.
Analysis of false data detection rate in adversarial networks
[7] T. Schlegl, P. Seeb€
ock, S.M. Waldstein, U. Schmidt-Erfurth, G. Langs, Unsupervised anomaly detection with generative adversarial networks to guide marker discovery, in: International Conference on
Information Processing in Medical Imaging, Springer, Cham, 2017, June, pp. 146–157.
[8] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, S. Belongie, Stacked generative adversarial networks, in:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.
5077–5086.
[9] E.L. Denton, S. Chintala, R. Fergus, Deep generative image models using a Laplacian pyramid of
adversarial networks, in: Advances in Neural Information Processing Systems, 2015, pp. 1486–1494.
[10] J.M. Wolterink, T. Leiner, M.A. Viergever, I. Išgum, Generative adversarial networks for noise reduction in low-dose CT, IEEE Trans. Med. Imaging 36 (12) (2017) 2536–2545.
[11] K. Roth, A. Lucchi, S. Nowozin, T. Hofmann, Stabilizing training of generative adversarial networks
through regularization, in: Advances in Neural Information Processing Systems, 2017, pp. 2018–2028.
[12] G.J. Qi, Loss-sensitive generative adversarial networks on lipschitz densities, Int. J. Comput. Vis. 128
(2020) 1118–1140.
[13] D. Warde-Farley, Y. Bengio, Improving generative adversarial networks with denoising feature matching, in: International Conference on Learning Representations, Toulon, France, April 24–26, 2017,
2016.
[14] G. Douzas, F. Bacao, Effective data generation for imbalanced learning using conditional generative
adversarial networks, Expert Syst. Appl. 91 (2018) 464–471.
[15] K.M. Ting, An instance-weighting method to induce cost-sensitive trees, IEEE Trans. Knowl. Data
Eng. 14 (3) (2002) 659–665.
[16] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv (2015). preprint arXiv:1511.06434.
[17] J. Zhao, M. Mathieu, Y. LeCun, Energy-based generative adversarial network, arXiv (2016). preprint
arXiv:1609.03126.
[18] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for
training gans, in: Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
[19] X. Mao, Q. Li, H. Xie, R.Y. Lau, Z. Wang, Multi-class generative adversarial networks with the L2
loss function, arXiv 5 (2016) 1057–7149. preprint arXiv:1611.04076.
[20] Q. Lyu, M. Guo, Z. Pei, DeGAN: mixed noise removal via generative adversarial networks, Appl. Soft
Comput. 95 (2020) 106478.
[21] Z. Chen, Z. Zeng, H. Shen, X. Zheng, P. Dai, P. Ouyang, DN-GAN: denoising generative adversarial
networks for speckle noise reduction in optical coherence tomography images, Biomed. Signal Process.
Control 55 (2020) 101632.
[22] N.M. Nawi, A. Khan, M.Z. Rehman, H. Chiroma, T. Herawan, Weight optimization in recurrent
neural networks with hybrid metaheuristic cuckoo search techniques for data classification, Math.
Probl. Eng. 2015 (2015).
[23] M. Li, D. Chen, S. Liu, F. Liu, Grain boundary detection and second phase segmentation based on
multi-task learning and generative adversarial network, Measurement 162 (2020) 107857.
[24] G. Zhong, W. Gao, Y. Liu, Y. Yang, D.H. Wang, K. Huang, Generative adversarial networks with
decoder-encoder output noises, Neural Netw. 127 (2020) 19–28.
[25] Z. Cui, K. Henrickson, R. Ke, Y. Wang, Traffic graph convolutional recurrent neural network: a deep
learning framework for network-scale traffic learning and forecasting, IEEE Trans. Intell. Transp. Syst.
(2019), https://doi.org/10.1109/TITS.2019.2950416.
[26] J. Sun, G. Zhong, Y. Chen, Y. Liu, T. Li, K. Huang, Generative adversarial networks with mixture of
t-distributions noise for diverse image generation, Neural Netw. 122 (2020) 374–381.
[27] K.E. Ak, J.H. Lim, J.Y. Tham, A.A. Kassim, Semantically consistent text to fashion image synthesis
with an enhanced attentional generative adversarial network, Pattern Recogn. Lett. 135 (2020) 22–29.
[28] L. Xu, X. Zeng, W. Li, Z. Huang, Multi-granularity generative adversarial nets with reconstructive
sampling for image Inpainting, Neurocomputing 402 (2020) 220–234.
[29] B. Banuselvasaraswathy, A. Sampathkumar, P. Jayarajan, N. Sheriff, M. Ashwin, V. Sivasankaran,
A review on thermal and QoS aware routing protocols for health care applications in WBASN, in:
IEEE International Conference on Communication and Signal Processing, July 28-30, India, 2020.
311
312
Generative adversarial networks for image-to-image translation
[30] W. Zhang, X. Li, X.D. Jia, H. Ma, Z. Luo, X. Li, Machinery fault diagnosis with imbalanced data using
deep generative adversarial networks, Measurement 152 (2020) 107377.
[31] J. Zhang, G. Sheng, First arrival picking of microseismic signals based on nested U-net and Wasserstein
generative adversarial network, J. Pet. Sci. Eng. (2020), https://doi.org/10.1016/j.petrol.2020.
107527.
[32] F. Yang, Z. Wang, J. Li, R. Xia, Y. Yan, Improving generative adversarial networks for speech
enhancement through regularization of latent representations, Speech Comm. 118 (2020) 1–9.
[33] A. Sampathkumar, S. Murugan, R. Rastogi, M.K. Mishra, S. Malathy, R. Manikandan, Energy efficient ACPI and JEHDO mechanism for IoT device energy management in healthcare, in: G.
Kanagachidambaresan, R. Maheswar, V. Manikandan, K. Ramakrishnan (Eds.), Internet of Things
in Smart Technologies for Sustainable Urban Development, EAI/Springer Innovations in Communication and Computing. Springer, Cham, 2020.
[34] S. Pascual, J. Serra, A. Bonafonte, Time-domain speech enhancement using generative adversarial networks, Speech Comm. 114 (2019) 10–21.
[35] Z. Chen, C. Wang, H. Wu, K. Shang, J. Wang, DMGAN: discriminative metric-based generative
adversarial networks, Knowl.-Based Syst. 192 (2020) 105370.
[36] X. Li, Y. Makihara, C. Xu, Y. Yagi, M. Ren, Gait recognition invariant to carried objects using alpha
blending generative adversarial networks, Pattern Recogn. 105 (2020) 107376.
[37] A. Sampathkumar, J. Mulerikkal, M. Sivaram, Glowworm swarm optimization for effectual load balancing and routing strategies in wireless sensor networks, Wirel. Netw. 26 (6) (2020) 4227–4238,
https://doi.org/10.1007/s11276-020-02336-w.
[38] A. Sampathkumar, S. Murugan, A.A. Elngar, L. Garg, R. Kanmani, A.C.J. Malar, A novel scheme for
an IoT-based weather monitoring system using a wireless sensor network, in: Integration of WSN and
IoT for Smart Cities, 2020, pp. 181–191.
[39] S.R. Thennarasu, M. Selvam, K. Srihari, A new whale optimizer for workflow scheduling in cloud
computing environment, J. Ambient Intell. Human. Comput. (2020) 1–8.
[40] K. Kamaraj, C. Arvind, K. Srihari, A weight optimized artificial neural network for automated software
test oracle, Soft Comput. (2020) 1–11.
[41] M. Hibat-Allah, M. Ganahl, L.E. Hayward, R.G. Melko, J. Carrasquilla, Recurrent neural network
wave functions, Phys. Rev. Res. 2 (2) (2020), 023358.
[42] M.Z. Uddin, M.M. Hassan, A. Alsanad, C. Savaglio, A body sensor data fusion and deep recurrent
neural network-based behavior recognition approach for robust healthcare, Inf. Fusion 55 (2020)
105–115.
[43] K. Guo, Y. Hu, Z. Qian, H. Liu, K. Zhang, Y. Sun, B. Yin, Optimized graph convolution recurrent
neural network for traffic prediction, IEEE Trans. Intell. Transp. Syst. 22 (2) (2021) 1138–1149.
[44] M. Almiani, A. AbuGhazleh, A. Al-Rahayfeh, S. Atiewi, A. Razaque, Deep recurrent neural network
for IoT intrusion detection system, Simul. Model. Pract. Theory 101 (2020) 102031.
[45] X. Li, Z. Xu, S. Li, H. Wu, X. Zhou, Cooperative kinematic control for multiple redundant manipulators under partially known information using recurrent neural network, IEEE Access 8 (2020)
40029–40038.
[46] A. Sampathkumar, P. Vivekanandan, Gene selection using multiple queen colonies in large scale
machine learning, J. Electr. Eng. 9 (6) (2020) 97–111.
[47] A. Sampathkumar, P. Vivekanandan, Gene selection using PLOA method in microarray data for cancer
classification, J. Med. Imag. Health Inf. 9 (2019) 1294–1300.
[48] A. Sampathkumar, R. Rastogi, Arukonda, S.A. Shankar, S. Kautish, M. Sivaram, An efficient hybrid
methodology for detection of cancer-causing gene using CSC for micro array data, in: J. Ambient
Intell. Human. Comput., Springer, 2020, doi:10.1007/s12652-020-01731-7.
[49] A. Sherstinsky, Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm)
network, Physica D 404 (2020) 132306.
[50] T.H. Wen, S. Young, Recurrent neural network language generation for spoken dialogue systems,
Comput. Speech Lang. 63 (2020) 101017.
CHAPTER 13
WGGAN: A wavelet-guided generative
adversarial network for thermal image
translation
Ran
Zhanga,∗, Junchi Bina,∗, Zheng Liua, and Erik Blaschb
a
University of British Columbia, Kelowna, BC, Canada
MOVEJ Analytics, Dayton, OH, United States
b
13.1 Introduction
Thermal or infrared (IR) images are widely used in different applications inducing nightvision navigation and surveillance [1], face recognition [2], and remote sensing [3]. IR
images are produced by infrared cameras to record the thermal information of objects.
These images are monochrome and are usually shown in gray scale [4]. They are different
from RGB-converted gray scale images that maintain the texture information. Compared with RGB images, IR images are less affected by environmental factors such as
illumination differences, fog, and smoke. However, they do not contain color and texture information, which is critical to understanding the objects in images. RGB and IR
images have different characteristics and advantages for capturing the information of
objects. IR images tend to catch more thermal structure, while the RGB images are more
sensitive to colors. The VI and IR images can be fused [1, 5, 6] to generate more useful
and comprehensive images that take advantage of both sources. However, the acquisition
of RGB images in dark conditions is difficult due to the hardware limitation, which
requires lights. RGB images are more easily understood by humans and play an important
role in machine vision applications. Therefore, IR-to-RGB translation is needed [7].
Traditional methods of image translation require specifying color manually [8], reference image [4], or paired image datasets [9]. Manually specifying colors are able to produce vivid, colorful images according to the guide of human [8]. However, it requires a
lot of labor and is a relatively time-consuming method compared with automatic
methods. Reference-based methods colorize images automatically by establishing the
feature connections from source to target images. They can be combined with manual
intervention to improve performance. But they need the reference images which are
selected manually. With the advances of convolutional neural networks (CNNs), image
∗
These authors contributed equally to this work.
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00015-4
Copyright © 2021 Elsevier Inc.
All rights reserved.
313
314
Generative adversarial networks for image-to-image translation
translation can be fully automated. CNNs extract both low-level and semantic features.
The object can be localized and colorized when CNNs learn semantic information [10].
CNN-based translation methods are trained on a large number of images that can cover
various objects in different situations. During the testing process, these methods are fully
automatic and do not need reference images. Although the CNN-based translation has
high accuracy, all these methods need to have fully paired images to learn the direct mapping between IR and RGB images. The acquisition of paired images is challenging due to
the difficulty of hardware calibration in industrial applications. A generative adversarial
network (GAN) is proposed to address the unpaired image translation by including discriminators during training generative models. Recent state-of-the-art image translation
methods are proposed based on GAN [11–13].
However, when transferring IR images to RGB images, contemporary GANs are
unable to keep the structure of the object or produce clear texture information. In this
study, we proposed a wavelet-guided generative adversarial network (WGGAN) to
address the challenge. Similar to contemporary methods, the WGGAN is also comprised
of an autoencoder for image translation and a discriminator for training. To deal with the
spatial distortion problem, we combine discrete wavelet transformation and variational
autoencoder to keep structural information in the early stages of the network. It brings
clear synthetic RGB images, as shown in Fig. 13.1. Contrarily, both qualitative and quantitative analyses are implemented to evaluate the proposed method’s performance. Compared with novel methods, the proposed method has more promising results in both
results. To conclude, our contributions are as follows:
• Design combining the discrete wavelet transformation and variational autoencoder for
IR-to-RGB image translation, which improves both qualitative and quantitative analyses. To the best of our knowledge, this is the first method to adopt discrete wavelet
transformation in GAN-based methods of IR-to-RGB translation.
• Robust performance as the WGGAN does not require paired IR and RGB image
datasets facilitating the thermal image translation when paired images are not available.
Fig. 13.1 An example of IR-to-RGB translation via the proposed wavelet-guided generative adversarial
network (WGGAN).
WGGAN: A wavelet-guided generative adversarial network
The rest of the chapter is organized as follows. Section 13.2 introduces the progress in
infrared image translation and relevant GANs’ applications in image translation.
Section 13.3 introduces the proposed WGGAN from the overall architecture to details
of implements. Section 13.4 presents the experimental setup and results compared with
contemporary methods. Finally, Section 13.5 concludes the experiments and WGGAN.
13.2 Related work
13.2.1 Infrared image translation
Infrared image translation aims to transfer single-channel gray scale images to multichannel RGB images that contain color and texture information. It can be divided into
scribble-based [10, 14–16], reference-based [17–19], and fully automatic [20, 21]
methods. Scribble-based methods assume that adjacent areas have a similar color. The
scribbles can be added by human intervention or edge detection algorithm.
Reference-based methods rely on reference images that have a structure similar to the
source image. The reference images can be selected automatically by feature matching.
Then the IR images can be transferred to color images by image analogy [18]. Fully automatic methods usually utilize the CNNs [22] or GANs [22, 23] to extract features and
automatically establish the pixel-wise mapping from source images to target images. The
IR images can be transferred directly to RGB images without manual intervention or
reference images. However, they are usually supervised methods and require that the
training dataset be paired, that is, the IR images should have corresponding calibrated
RGB images. In most scenarios, it is hard to obtain paired datasets. Our proposed
WGGAN only needs the unpaired IR and RGB images for training, and gaining significant practical values comparing with other fully automatic methods.
13.2.2 GANs in image translation
Transferring IR images to RGB images can be considered as a specific application of
image translation. Much research has been conducted in this field. Image translation
focuses on transferring the style of the image from one domain to another domain.
Depending on whether the dataset is paired or unpaired, the image translation can be
divided into paired or unpaired ways. Paired data and unpaired data can also be utilized
to train a model [24], thus, obtaining the advantages of both paired and unpaired
methods. The training image translation model with paired data can lead to better performance despite that the paired data are not easy to collect and calibrate. Conditional
GAN [25] is used in the pixel-to-pixel level paired image-to-image translation. It showed
great performance on the paired datasets. Unpaired image translation methods are more
widely exploited as they have fewer limitations on the datasets. These methods usually
contain more than one autoencoder, which generates target images first and reconstructs
315
316
Generative adversarial networks for image-to-image translation
images from the target domain to the source domain. The reconstructed images should be
similar or consistent with the source image in this process. CycleGAN [11] introduces
cycle consistency loss to keep the reconstructed images similar to the source image.
Resembling cycle consistency loss, reconstructed loss is designed in DiscoGAN [26]
to keep the similarities between original and reconstructed images. DualGAN [27] adopts
dual learning to GANs and learns to transfer images between two domains. UNIT [28]
makes a shared-latent space assumption for transferring images. UGATIT [12] produces
attention masks and fuses the generation output with the attention mechanism to generate higher quality target images. Apart from the above unimodal image translation
problem, which is limited between two domains, multimodal image translation using
a single model in unpaired datasets is more challenging. StarGAN [29] performs
image-to-image translation for multiple domains. StarGAN v2 [30] enhances the performance by introducing the mapping network and style encoder. MUNIT [13] is another
multimodal translation model which assumes that the image composition can be decomposed into a domain-invariant content code and a domain-specific style code. However,
an empirical study reveals that these methods may have strong spatial distortion during
paired thermal image translation [7]. Moreover, the quantitative results also indicate the
unsatisfactory of translated images of both CycleGAN and UNIT. To address the problem of deformed translation, our proposed WGGAN aims to preserve structural information by adopting discrete wavelet transformation to variational autoencoder for
unpaired thermal image translation.
13.3 Wavelet-guided generative adversarial network
13.3.1 Overall architecture
The proposed wavelet-guided generative adversarial network (WGGAN) is designed for
converting images from the IR domain to the RGB domain. Fig. 13.2 shows the overall
Fig. 13.2 The architecture of the proposed WGGAN.
WGGAN: A wavelet-guided generative adversarial network
architecture of the proposed method. The overall process consists of training and testing
stages. The training stage includes a proposed wavelet-guided variational autoencoder
(WGVA), a discriminator, and a cycle consistency loss. The proposed WGVA aims to
generate RGB images from real IR images, while a discriminator aims to recognize
the transferred RGB images from real RGB images. Moreover, the cycle consistency loss
[11] is implemented to train the WGVA due to the unaligned IR-RGB image pairs. After
the training stage, the WGVA can be launched to translate IR images as a standalone
model without the other components during the testing stage.
13.3.2 Wavelet-guided variational autoencoder
The proposed wavelet-guided variational autoencoder (WGVA) is developed based on
deep variational autoencoder and discrete wavelet transformation (DWT) [31]. As shown
in Fig. 13.3, the WGVA consists of two subnetworks: an encoder E for converting IR
image to latent space z, and a generator G to reconstruct the RGB image from z. Like
MUNIT, StarGAN, and CycleGAN [13, 29], the design of WGVA follows standard
residual autoencoder architecture with two convolutional layers, two pooling layers,
and four residual blocks at the encoder. Moreover, the architecture of the generator is
symmetrically reversed to the encoder. The details of residual autoencoder can be found
in Ref. [28]. Unlike standard residual autoencoder, the latent space z is reparameterized
to variational distribution, which enables smooth and accurate generation [13]. On the
other hand, inspired by DWT, wavelet pooling, and wavelet unpooling are designed to
substitute conventional pooling layers in standard residual autoencoder. Then, the highfrequency components skip bridge the corresponding pooling and unpooling layers for
improving generative resolution, as illustrated in Fig. 13.3. The details of reparameterization and wavelet pooling are introduced in the following sections.
Fig. 13.3 The illustration of wavelet-guided variational autoencoder (WGVA).
317
318
Generative adversarial networks for image-to-image translation
13.3.2.1 Reparameterization in latent space
The encoder-generator pair {E, G} constitutes the WGVA for IR-to-RGB translation as
shown in Fig. 13.3. The latent space z is the representation of an input IR image x.
According to the theoretical study, the z should represent the variational distribution
of data to have smooth generation [13]. For achieving this purpose, the reparameterization is implemented as follows:
z qðzj xÞ ¼ N z; μ, σ 2 I
(13.1)
z ¼ μ + σ⊙E, where E N ð0, σ 2 I Þ
(13.2)
where ⊙ refers to element-wise product; N (.) represents the normal distribution; μ is the
mean of z; σ is the standard deviation of z; I is the identity matrix; and E is the normal
noise. Both σ and μ are learnable variables in WGVA. In other words, the σ and μ can be
regarded as approximated mean and standard deviation of the entire dataset after training.
Through the above equations, the latent space z is standardized to the intended distribution with the σ and μ. On the other hand, the normal noise E is added to smooth
the z for stochastic optimization [13]. Therefore, the latent space z can be regarded to
the variational distribution of input image after reparameterization with learned σ and μ.
13.3.2.2 Discrete wavelet transformation for pooling
According to the empirical studies [13, 31], the standard residual blocks may generate
distorted and blurred images. The major reason is that the generator is lack of structural
information from the encoder. To address the issue, we adopt the discrete wavelet transformation (DWT) to extract structural information at pooling layers of the proposed
method [31].
Discrete wavelet transformation (DWT) has four kernels, {LL>, LH>, HL>, HH>},
where the low (L) and high (H) pass filters are
1
1
L> ¼ pffiffiffi ½1, 1, H> ¼ pffiffiffi ½1, 1
2
2
(13.3)
Thus, the DWT can generate four types of output denoted as LL, LH, HL, HH,
respectively. Fig. 13.4 shows the examples after DWT. The output of LL has the smooth
texture of the images, while the rest of the outputs capture the vertical, horizontal, and
diagonal edges [31]. For simplicity, we denote the output of LL as low-frequency components and outputs of LH, HL, and HH as high-frequency components. The DWT
enables the proposed model to control the IR-to-RGB conversion by different components separately. Specifically, the low-frequency component can affect the overall generative texture, while the high-frequency components affect the generative structure.
Without processing the generative network’s high-frequency components, the structural
information can be well maintained in these components. From this point of view, the
WGGAN: A wavelet-guided generative adversarial network
Fig. 13.4 The illustration of discrete wavelet transformation (DWT).
Fig. 13.5 The illustration of wavelet pooling and wavelet unpooling.
wavelet pooling and wavelet unpooling are proposed to use these components in autoencoder for better IR-to-RGB translation, as shown in Fig. 13.5 [31].
The wavelet pooling applies DWT to the encoder layer to have low frequency and
high-frequency components. The kernels of the convolutional layer are changed to
DWT kernels to apply the DWT in the deep neural layer. On the other hand, the wavelet
pooling layer is locked during the optimization. Moreover, the stride of the layer is 2 to
have downsampling features as same as conventional pooling layers [31]. The lowfrequency component will be further processed by the network. Meanwhile, the
high-frequency components skipped to the symmetrical wavelet unpooling layer in
the generator. In the wavelet unpooling layer of the generator, both high-frequency
and low-frequency components are concatenated. Then, the concatenated components
are processed to have upsampling features by transpose convolution [31].
13.3.3 Objective functions in adversarial training
The full objective of the WGGAN comprises four loss functions: cycle-consistency loss,
ELBO loss, perceptual loss, and GAN loss [11, 13, 32].
Cycle-consistency loss. To train the proposed method with unpaired RGB and IR
images, we adopt the cycle consistency loss which is similar to MUNIT and CycleGAN
[11]. The basic idea of cycle consistency loss aims to include two generative networks for constraining the generative images. Two generative adversarial networks: GAN1 ¼ {E1, G1, D1}
319
320
Generative adversarial networks for image-to-image translation
for IR-to-RGB translation and GAN2 ¼ {E2, G2, D2} for RGB-to-IR translation are used in
training, where E, G, and D denote encoder, generator, and discriminator, respectively. For
simplicity, E(x) ¼ z indicates the latent space z is generated by encoder E. The theory of the loss
is that the image translation cycle should be capable of bringing converted images back to original images, i.e., x ! E(x) ! G1(z) ! G2G1(z) x. The cycle-consistency loss is shown as
below:
LCC ðE1 , G1 , E2 , G2 Þ ¼ x1 pðx1 Þ ½k G2 ðG1 ðz1 ÞÞ x1 k + x2 pðx2 Þ ½k G1 ðG2 ðz2 ÞÞ x2 k
(13.4)
where k k represents the ‘1 distance.
ELBO loss. ELBO aims to minimize the variational upper bound of latent space z.
The objective function is
LE ðE, GÞ ¼ λ1 KL qðzj xÞkpη ðzÞ λ2 zqðzj xÞ log pG ðxj zÞ
(13.5)
X
P ðxÞ
(13.6)
KLðP|jQÞ ¼
P ðxÞlog
QðxÞ
where the hyperparameters λ1 and λ2 control the weights of the objective terms and the
KL divergence terms that penalize deviation of the distribution of latent space from the
prior distribution pη(z). Here q(.) represents the reparameterization mentioned in the previous section, while pη(z) represents zero-mean Gaussian distribution. pG(.) is the Laplacian distribution based on generator according to empirical studies [13].
Perceptual loss. Perceptual loss is a conventional loss function of neural style transfer
with the assistant of pretrained VGG-16 [33] as shown in the following equation. The
perceptual loss consists of two parts: the first term loss is content loss and the second term
is style loss.
LP ðE, G, xc , xs Þ ¼
1
k ϕj ðGðzÞÞ ϕj ðxc Þ k22 +
Cj Hj Wj
i
1 h k
Gr ϕj ðGðzÞÞ Gr ϕj ðxs Þ k22
Cj Hj Wj
(13.7)
where ϕj(x) represents the feature map of jth convolutional layers of shape Cj Hj Wj
in pretrained VGG-16; xc denotes content images while xs denotes style images; Gr is the
Gram matrix which is used for representing image style. From this point of view, the
perceptual loss aims to transfer the style of images while maintaining image structure.
The details of the perceptual loss can be found in Ref. [32].
GAN loss. GAN loss aims to ensure the translated images resembling the images in
the target domains, respectively [14]. For example, if the discriminator regards the synthetic IR images as real IR images, the synthetic IR images are successful.
WGGAN: A wavelet-guided generative adversarial network
LGAN ðE, G, DÞ ¼ xpðxÞ logDðxÞ + zpðzj xÞ ½ log 1 DðGðzÞÞ
(13.8)
Full loss. Finally, the complete loss function can be written as:
Ltotal ¼ λE ðLE ðE, GÞÞ + λP ðLP ðE1 , G1 , x1 , x2 ÞÞ + λGAN ðLGAN ðE, G, DÞÞ
+ λCC ðLCC ðE1 , G1 , E2 , G2 ÞÞ
(13.9)
where λE ¼ 0.1, λP ¼ 0.1, λGAN ¼1, and λCC ¼ 10 represent the weights of ELBO loss,
perceptual loss, GAN loss, and cycle consistency loss, respectively.
13.4 Experiments
This section presents the details of the experiments. First, the section describes the implemented dataset and evaluation methods in this experiment. Then, the baselines and the
relevant experimental setup are also introduced. Finally, both qualitative and quantitative
analyses of translation results are presented.
13.4.1 Data description
The implemented dataset is FLIR ADAS [34], which is an open dataset for autonomous
driving. The dataset contains RGB and IR images from the same driving car. However,
the recorded RGB and IR images are unpaired due to the cameras’ different properties
[34]. For all experiments, the training and testing splits follow the dataset benchmark.
The training dataset contains 8862 IR images and RGB images, while there are 1363
IR images and 1257 RGB images in the testing dataset. The statistics of the FLIR ADAS
are presented in Table 13.1.
13.4.2 Evaluation methods
Qualitative analysis. In researches on generative models, the human perceptual study is
a direct way to compare the quality of translation among models. In this study, several
graduate students and computer vision engineers are invited to subjectively evaluate the
translated results from the proposed method and baseline methods with source IR image.
They are then required to select which output has the best quality with short comments.
Table 13.1 Statistics in FLIR ADAS.
Dataset
Image type
# of frames
Image size
Training
IR
RGB
IR
RGB
8862
8363
1363
1257
640 512
1800 1600
640 512
1800 1600
Testing
321
322
Generative adversarial networks for image-to-image translation
Quantitative analysis. Numerical evaluating the proposed WGGAN and other
comparative methods is challenging since there are no paired RGB images. To measure
the translated quality, we include four Inception-based metrics: 1-Nearest Neighbor classifier (1-NN), kernel maximum mean discrepancy (KMMD), Frechet inception distance
(FID), and Wasserstein distance (WD) [35]. These methods compute the distance
between features in the target and generated images from the Inception network [35].
If an IR image is well translated, these metrics will have small values, which indicates
the generated RGB image is similar to the distribution of target RGB images. Besides,
two no-reference image quality assessment (NR-IQA) methods, blind/referenceless
image spatial quality evaluator (BRISQUE) [36], and natural image quality evaluator
(NIQE) [37], are also used to independently evaluate the generated images without
any pair or unpaired images. The small values of them indicate the high quality of the
translated images. Moreover, a multicriteria decision analysis, TOPSIS [38], is included
to summarize all the quantitative evaluation metrics.
13.4.3 Baselines
CycleGAN. CycleGAN consists of two standard residual autoencoders for training with
GAN loss and cycle-consistency loss [11].
MUNIT. MUNIT is similar to CycleGAN, which also consists of two autoencoders.
For having diverse generative images, the encoder of MUNIT has one content branch
and a style branch. Inspired by neural style transfer [13], where the two branches are
combined by adaptive instance normalization in the generator for image reconstruction.
StarGAN. StarGAN is a state-of-the-art generative method in facial attribute transfer
and facial expression synthesis. It includes mask vector and domain classification to
generate diverse output [29].
UGATIT. The UGATIT adopts an attention mechanism to residual autoencoder
with auxiliary classifier inspired by weakly supervised learning. Moreover, it also introduces an adaptive instance normalization to the residual generator [12]. It achieves novel
performance in tasks of anime translation and style transfer.
13.4.4 Experimental setup
The adaptive moment optimization (ADAM) [12] is used as an optimizer for training the
proposed method where the learning rate is set to 0.00001, and momentums are set to 0.5
and 0.99. For improving the model’s robustness, the batch size was set as 1 with instance
normalization after each neural layer. The discriminator is adopted from PatchGAN [11].
Moreover, all the activation functions of neurons are set to the rectified linear unit
(ReLU), while the activation function of the output layer is Tanh to generate synthetic
images.
WGGAN: A wavelet-guided generative adversarial network
To make a fair comparison, both WGGAN and baseline models are trained in
27 epochs with batch size 1. On the other hand, all images are resized to 512 512
before feeding to the network. A desktop with an NVIDIA TITAN RTX, and Intel
Core i7 and 64 GB memory is used throughout the experiments.
13.4.5 Translation results
This subsection aims to present both qualitative and quantitative analyses of translation
results compared with baselines. In qualitative analysis, the examples of translation are
illustrated with subjective comments. On the other hand, the quantitative analysis
presents numerical results to compare the proposed WGGAN with baselines.
13.4.5.1 Qualitative analysis
Fig. 13.6 illustrates the translation results in the test set of FLIR ADAS. StarGAN has the
worst translation performance, which has unsuitable colors and noisy black spots on the
images. The rest of the translated images can generate clear edges of solid objects like
vehicles. However, the UGATIT is not capable of clearly translating objects such as trees
and houses. Compared with WGGAN, CycleGAN, and MUNIT, participants also point
out that the road texture is not well translated by UGATIT, as shown in Fig. 13.6. The
texture of the road is too smooth to present the details, such as the curb on the road. On
the other hand, CycleGAN can accurately translate the objects from IR images with
sharp edges and textures. Several participants also mention that there are some incorrect
mapping objects on the generated RGB images from CycleGAN. For example, trees
should not appear in the sky in Fig. 13.6B. On the contrary, both the proposed WGGAN
and MUNIT are able to translate the IR images with clear texture information of objects.
However, some parts of the images are not correctly translated, such as the sky and people, as shown in Fig. 13.6C. In qualitative evaluation, participants indicate that the proposed WGGAN can generate the best quality of images with clear texture and correct
mapping objects. Compared with other state-of-the-art methods, the generated images
are less scattered noises. To conclude, participants believe that the proposed WGGAN
has the best performance in IR-to-RGB translation.
13.4.5.2 Quantitative analysis
Table 13.2 illustrates the quantitative results of the IR-to-RGB translation. The best
result of each evaluation method is highlighted. It is difficult to identify the best method
within contemporary methods. CycleGAN achieves excellent performance in NR-IQA
evaluation, while MUNIT has better performance in 1-NN and KMMD. Unlike contemporary methods, the proposed WGGAN outperforms all the contemporary models
with the smallest values in 1-NN, KMMD, FID, and NIQE. For inception-based metrics, WGGAN has 26.1% and 53.1% improvement in KMMD and FID, which means
that the generative RGB images are similar to the target RGB domain. On the other
323
Fig. 13.6 Examples of (A) source IR images, (B) proposed WGGAN, (C) CycleGAN, (D) MUNIT,
(E) StarGAN, and (F) UGATIT.
WGGAN: A wavelet-guided generative adversarial network
Table 13.2 Quantitative results of contemporary methods.
Models
1-NN
KMMD
FID
WD
BRISQUE
NIQE
CycleGAN
MUNIT
StarGAN
UGATIT
WGGAN
0.961
0.927
0.992
0.959
0.924
0.318
0.237
0.397
0.283
0.175
0.222
0.157
0.121
0.098
0.046
61.60
67.50
75.73
65.88
65.97
15.17
27.99
36.55
36.81
28.89
2.730
2.750
6.392
3.663
2.477
Table 13.3 Ranking results by TOPSIS based on quantitative result.
TOPSIS
CycleGAN
MUNIT
StarGAN
UGATIT
WGGAN
0.479
0.569
0.313
0.559
0.796
hand, the WGGAN also achieves the best performance in NIQE, which indicates the
image is similar to the natural images in terms of statistical regularities. To better conclude
the best model with these evaluating methods, we use a common multicriteria decision
model, TOPSIS, to rank the image translation models based on evaluating methods.
Table 13.3 shows that WGGAN has the highest values of TOPSIS. The ranking results
also demonstrate that the proposed WGGAN can generate more high-quality RGB
images in comparison with other novel image translation methods.
13.5 Conclusion
In this chapter, an IR-to-RGB image translation method, wavelet-guided generative
adversarial network (WGGAN), is proposed for context enhancement. A waveletguided variational autoencoder (WGVA) is proposed for generating smooth and clear
RGB images from the IR domain, which combines variational inference and discrete
wavelet transformation. In addition, more objective functions are introduced to improve
generative quality, such as ELBO loss and perceptual loss. Both qualitative and quantitative results demonstrate the effectiveness of the proposed WGGAN to enable better
context enhancement for IR-to-RGB translation. Many industrial applications can benefit from the proposed method, such as object detection at night for applications of semiautonomous driving, unmanned aerial vehicle (UAV) surveillance, and urban security.
Although the proposed WGGAN has promising results in thermal image translation,
there is still room for improvement. For example, the objects’ colors should be more discriminative from the background. Therefore, we first aim to add numerous IR and RGB
images for fully training the proposed WGGAN. Then, more advanced modules, such as
adaptive instance normalization, will be included to enhance the translation.
325
326
Generative adversarial networks for image-to-image translation
Acknowledgment
The first three authors are supported in part by grants from TerraSense Analytics Ltd. and Advanced
Research Computing, University of British Columbia.
References
[1] G. Bhatnagar, Z. Liu, A novel image fusion framework for night-vision navigation and surveillance,
Signal Image Video Process. 9 (1) (2015) 165–175.
[2] G. Hermosilla, F. Gallardo, G. Farias, C.S. Martin, Fusion of visible and thermal descriptors using
genetic algorithms for face recognition systems, Sensors 15 (8) (2015) 17944–17962.
[3] X. Chang, L. Jiao, F. Liu, F. Xin, Multicontourlet-based adaptive fusion of infrared and visible remote
sensing images, IEEE Geosci. Remote Sens. Lett. 7 (3) (2010) 549–553.
[4] T. Hamam, Y. Dordek, D. Cohen, Single-band infrared texture-based image colorization, in: 2012
IEEE 27th Convention of Electrical and Electronics Engineers in Israel, 2012, pp. 1–5.
[5] J. Ma, et al., Infrared and visible image fusion via detail preserving adversarial learning, Inf. Fusion
54 (2020) 85–98.
[6] X. Jin, et al., A survey of infrared and visual image fusion methods, Infrared Phys. Technol. 85 (2017)
478–501.
[7] S. Liu, V. John, E. Blasch, Z. Liu, Y. Huang, IR2VI: enhanced night environmental perception by
unsupervised thermal image translation, in: IEEE Computer Society Conference on Computer Vision
and Pattern Recognition Workshops, June, vol. 2018, 2018, pp. 1234–1241.
[8] H. Chang, O. Fried, Y. Liu, S. DiVerdi, A. Finkelstein, Palette-based photo recoloring, ACM Trans.
Graph. 34 (4) (2015).
[9] G. Larsson, M. Maire, G. Shakhnarovich, Learning representations for automatic colorization, in: Proceedings of European Conference on Computer Vision (ECCV), LNCS, vol. 9908, 2016, pp.
577–593.
[10] A.Y.-S. Chia, et al., Semantic colorization with internet images, ACM Trans. Graph. 30 (6) (2011) 1–8.
[11] J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent
adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision,
October, vol. 2017, 2017, pp. 2242–2251.
[12] J. Kim, M. Kim, H. Kang, L. Kwanhee, U-GAT-IT: unsupervised generative attentional networks
with adaptive layer-instance normalization for image-to-image translation, in: ICLR 2020, 2020,
pp. 1–19.
[13] X. Huang, M.Y. Liu, S. Belongie, J. Kautz, Multimodal unsupervised image-to-image translation, in:
Proceedings of European Conference on Computer Vision (ECCV), LNCS, vol. 11207, 2018, pp.
179–196.
[14] I.J. Goodfellow, et al., Generative adversarial nets, Adv. Neural Inf. Process. Syst. 3 (2014) 2672–2680.
[15] A. Levin, D. Lischinski, Y. Weiss, Colorization using optimization, in: ACM SIGGRAPH 2004
Papers, 2004, pp. 689–694.
[16] Q. Luan, F. Wen, D. Cohen-Or, L. Liang, Y.-Q. Xu, H.-Y. Shum, Natural image colorization, in:
Proceedings of the 18th Eurographics Conference on Rendering Techniques, 2007, pp. 309–320.
[17] R.K. Gupta, A.Y.-S. Chia, D. Rajan, H.Z. Ee Sin Ng, Image colorization using similar images, in:
Proceedings of the 20th ACM international conference on Multimedia (MM’12), 2012.
[18] A. Hertzmann, C.E. Jacobs, N. Oliver, B. Curless, D.H. Salesin, Image analogies, in: Proceedings of
the 28th Annual Conference on Computer Graphics and Interactive Techniques, 2001, pp. 327–340.
[19] Y. Zheng, E. Blasch, Z. Liu, Multispectral Image Fusion and Night Vision Colorization, Society of
Photo-Optical Instrumentation Engineers, 2018.
[20] Z. Cheng, Q. Yang, B. Sheng, Deep colorization, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), vol. 2015, 2015, pp. 415–423.
WGGAN: A wavelet-guided generative adversarial network
[21] M. Limmer, H.P.A. Lensch, Infrared colorization using deep convolutional neural networks, in: 2016
15th IEEE International Conference on Machine Learning and Applications (ICMLA), 2016, pp.
61–68.
[22] P.L. Suárez, A.D. Sappa, B.X. Vintimilla, Infrared image colorization based on a triplet DCGAN architecture, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops
(CVPRW), 2017, pp. 212–217.
[23] P.L. Suárez, A.D. Sappa, B.X. Vintimilla, Learning to colorize infrared images, in: PAAMS, 2017.
[24] S. Tripathy, J. Kannala, E. Rahtu, Learning image-to-image translation using paired and unpaired training samples, in: Proceedings of Asian Conference on Computer Vision (ACCV), LNCS, vol. 11362,
2019, pp. 51–66.
[25] P. Isola, J.Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings—30th IEEE Conference on Computer Vision and Pattern Recognition,
CVPR 2017, January, vol. 2017, 2017, pp. 5967–5976.
[26] T. Kim, M. Cha, H. Kim, J.K. Lee, J. Kim, Learning to discover cross-domain relations with generative
adversarial networks, in: The 34th International Conference on Machine Learning, March, vol. 4,
2017, pp. 2941–2949.
[27] Z. Yi, H. Zhang, P. Tan, M. Gong, DualGAN: unsupervised dual learning for image-to-image translation, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), October,
vol. 2017, 2017, pp. 2868–2876.
[28] M.Y. Liu, T. Breuel, J. Kautz, Unsupervised image-to-image translation networks, Adv. Neural Inf.
Process. Syst. 2017 (2017) 701–709.
[29] Y. Choi, M. Choi, M. Kim, J.W. Ha, S. Kim, J. Choo, StarGAN: unified generative adversarial networks for multi-domain image-to-image translation, in: Proceedings of the IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, 2018, pp. 8789–8797.
[30] Y. Choi, Y. Uh, J. Yoo, J.-W. Ha, StarGAN v2: diverse image synthesis for multiple domains, in:
CoRR, vol. abs/1912.0, 2019.
[31] J. Yoo, Y. Uh, S. Chun, B. Kang, J.W. Ha, Photorealistic style transfer via wavelet transforms, in: Proceedings of the IEEE International Conference on Computer Vision, October, vol. 2019, 2019, pp.
9035–9044.
[32] J. Johnson, A. Alahi, L. Fei-Fei, Perceptual losses for real-time style transfer and super-resolution,
in: Proceedings of the 14th European Conference on Computer Vision, Lecture Notes in Computer
Science (LNCS), vol. 9906, Springer, 2016, pp. 694–711.
[33] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in:
3rd International Conference on Learning Representations. ICLR 2015—Conference Track Proceedings, 2015, pp. 1–14.
[34] F.A. Group, FLIR thermal dataset for algorithm training, 2018, [Online]. Available from: https://
www.flir.in/oem/adas/adas-dataset-form.
[35] Q. Xu, et al., An empirical study on evaluation metrics of generative adversarial networks, in: CoRR,
vol. arXiv:1806, 2018, pp. 1–14.
[36] A. Mittal, A.K. Moorthy, A.C. Bovik, No-reference image quality assessment in the spatial domain,
IEEE Trans. Image Process. 21 (12) (2012) 4695–4708.
[37] A. Mittal, R. Soundararajan, A.C. Bovik, Making a ‘completely blind’ image quality analyzer, IEEE
Signal Process. Lett. 20 (3) (2013) 209–212.
[38] V. Yadav, S. Karmakar, P.P. Kalbar, A.K. Dikshit, PyTOPS: a python based tool for TOPSIS, SoftwareX 9 (2019) 217–222.
327
CHAPTER 14
Generative adversarial network
for video analytics
A.
Sasithradevia, S. Mohamed Mansoor Roomib, and R. Sivaranjanic
a
School of Electronics Engineering, VIT University, Chennai, India
Department of Electronics and Communication Engineering, Thiagarajar College of Engineering, Madurai, India
c
Department of Electronics and Communication Engineering, Sethu Institute of Technology, Madurai, India
b
14.1 Introduction
The objective of video analytics is to recognize the events in videos automatically. Video
analytics can detect events such as a sudden burst of flames, suspicious movement of vehicles and pedestrians, abnormal movement of a vehicle not obeying traffic signs.
A commonly known application in the research field of video analytics is video surveillance which has started evolving 50 years ago. The principle behind the video surveillance
is to involve human operators to monitor the events occurring in a public area, room, or
desired space. In general, an operator is given full responsibility for several cameras and
studies have shown that increasing the number of cameras to be monitored per operator
degrades the performance of the operator. Hence, video analysis software aims to provide
a better trade-off between accurate event detection and huge video information [1–3].
Machine learning, in particular, its descendant namely deep learning has prompted the
research in the video analytics domain. The fundamental purpose of deep learning is
to identify the sophisticated model that signifies the probability distributions over the
different samples of videos which need analytics.
Generative adversarial network (GAN) provides an efficient way to learn deep representations with minimal training data. GAN is an evolving technique for generating
and representing the samples using both unsupervised and semisupervised learning
methods. It is accomplished through the implicit modeling of high-dimensional data distribution. The underlying working principle of GAN is to train the pair of networks in
competition with each other. Among these networks, one acts like an imitator and the
other as a skillful specialist. From the formal description of GAN, the generator creates
fake data mimicking the realistic one and the discriminator is an expert trained to distinguish the real samples from the forger ones. Both the networks are trained simultaneously
in competition with each other. This generic framework for GAN is shown in Fig. 14.1.
Both the generator and the discriminator are the neural networks where the former
generates new instances and the latter assesses whether the instances belong to the dataset.
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00008-7
Copyright © 2021 Elsevier Inc.
All rights reserved.
329
330
Generative adversarial networks for image-to-Image translation
Fig. 14.1 Generic framework for generative adversarial networks.
For the purpose of classification, the discriminator plays the role of a classifier to distinguish the real from the fake. To build a GAN, one needs to have a training dataset and a
clear idea about the desired output. Initially, GAN learns from simple distribution of 2D
data, later GAN could be able to mimic high-dimensional data distribution along with
eventual training. During the training phase, both the competing networks get the attributes regarding the distribution of data. The data samples generated by the generator
along with the real data samples are used to train the discriminator. After sufficient training, the generator is trained against the discriminator. Thus the generator learns to map
any random data samples. Consider the scenario as Fig. 14.1, where a D-dimensional
noise vector obtained from the latent space is fed into the generator which converts them
into new data samples. The discriminator then processes both the real and fake samples for
classifying it. The main advantage of GAN relies on its randomness which aids it to create
new data samples rather than the exact replica of the real data. Another crucial advantage
of GAN over Autoencoders [4] and Boltzmann machine [5] is that GAN does not rely on
Markov chain for the purpose of generating training models. GANs were designed to
eliminate the high complexity associated with Markov chains. Also, the generator function undergoes a minimum restriction compared to Boltzman machines. Owing to these
advantages, GANs have been attracted toward a variety of applications and the craving to
utilize it in numerous areas is increasing. They have been effectively used in a wide variety of tasks like image to image translation, obtaining high-resolution images from lowresolution images, deciding the drugs for treating desired diseases, retrieving images,
object recognition, text-image translation, intelligent video analysis [6], and so on. In this
article, we present an overview of the working principle of GANs and its variants available for video analytics. We also emphasize the pros, cons, and the challenges for the
fruitful implementation of GANs in different video analytic problems.
Generative adversarial network for video analytics
The remainder of this chapter is organized as follows: Section 14.2 provides the building blocks of GANs, its driving factor called objective functions and the challenging issues
of GANs. Section 14.3 highlights the variants of GANs emerged for the problem of video
analytics in past years. Section 14.4 discusses the possible future works in the area of video
analytics based on GAN. Section 14.5 concludes this chapter.
14.2 Building blocks of GAN
This section describes the basic building blocks of GAN and the different objective
functions used for training the GAN architectures.
14.2.1 Training process
The training process involving the objective or cost function is the basic building block
for GANs. Training of GAN is a dual process which includes choosing the parameters for
a generator that confuses the discriminator with fake data and discriminator that
maximize the accuracy for any given application. The algorithm involved in the training
process is described as follows:
Algorithm 14.1
Step 1: Update parameters of discriminator “θD”:
Input: “m” samples from real frames and “m” samples from noise data.
Do: Compute the expected Gradient rθD ¼ f{JθD(θD; θG)}
Update: θD (θD, —θD)
Step 2: Update parameters of Generator “θG”:
Input: “m” samples from noise data and θD.
Do: Compute the expected Gradient rθG ¼ f{JθG(θG; θD)}
Update: θG (θG, rθG).
The objective or the cost function V(G, D) for the training depends on the two competing networks. The training process includes both maximization and minimization as
max min
V ðG, DÞ
D G
(14.1)
where V(G, D) ¼ fpdata(x) log D(x) + fpg(x) log(1 D(x)).
As illustrated in the Algorithm 14.1, one of the model parameters are updated, while
p ðxÞ
the other is fixed. An exclusive discriminator D0 ðxÞ ¼ pdataðdata
is available for any fixed
xÞ + pgðxÞ
generator G [7]. The generator is also optimal when Pg(x) ¼ Pdata(x) and it shows that the
generator reaches an optimal point only when the discriminator is totally confused in
331
332
Generative adversarial networks for image-to-Image translation
discriminating the real data from fakes. The discriminator is not trained completely until
the generator reaches the optimum value. But the generator is updated simultaneously
with the discriminator. An alternate cost function typically used for updating the
generator is maxG log D(G(Z)) instead of minG log (1 2D(G(Z))).
14.2.2 Objective functions
The main objective of generative models is to make Pg(x) equivalent to the real data
distribution Pdata(x). Hence, the underlying fact for training the generator is to reduce
the dissimilarity between the two distributions [8]. In recent years, researchers have
attempted to utilize various dissimilarity measures to upgrade the performance of
GAN. This section describes the difference in computation using various measures
and objective functions.
f-Divergence: It is a dissimilarity measure between two distribution functions that
are convex in nature. The f-Divergence between the two convex functions [8] namely
Pg(x) and Pdata(x) is written as
0
x2
ð
Pdata ðxÞ
@
Df Pdata | Pg ¼ Pg ðxÞf
dx
(14.2)
Pg ðxÞ
x1
Integral probability metric: It provides the maximal dissimilarity measure between
two arbitrary functions [8]. Consider the data space X R with probability distribution
function defined as P(X). The IPM distance metric between the distributions Pdata,
Pg P(X) is defined as
(14.3)
dF Pdata , Pg ¼ Supf F ExPdata f ðxÞ ExPg ðf ðxÞÞ
Auxiliary object functions: The auxiliary functions that are related to the
adversarial objective functions are reconstruction and classification objective function.
• Reconstruction objective function: The goal of reconstruction objective function is to
minimize the difference between the output image of the neural network and the
real image provided as input to the neural network [9, 10]. This type of objective
function aids the generator to preserve the content of the real image data and use
the autoencoder architecture for GAN’s discriminator [11, 12]. The discrepancy
value evaluated using reconstruction objective function mostly involves L1 norm
measure.
• Classification objective function: Discriminator network can also be used as classifier [13,
14] when cross entropy loss is employed as an objective function in the discriminator.
Cross entropy loss is widely used in many GAN applications for semisupervised
learning and domain adaptation. This objective function can also be used to train
the generator and discriminator jointly for the classification purpose.
Generative adversarial network for video analytics
14.3 GAN variations for video analytics
In recent years, intelligent video analytics has become an emerging technology and
research field in academics and industry. The scenes in videos are recorded by cameras
that aids for invigilation of happenings that occur in the area where human ability fails.
Recently, a huge number of cameras are utilized for useful purposes [6] like fire detection, person detection and tracking, vehicle detection, smoke detection, unknown object
and crime detection in country borders, shopping malls, airports, sports stadiums,
underground stations, residential areas, and academic campuses and so on. The manually
monitoring the videos is really cumbersome due to the obstacles like drowsiness of the
operator, diversion due to increased responsibilities, etc. This prompts the need for
semisupervised approaches for analyzing the events in videos [15, 16]. Hence, intelligent
video analytics is one challenging problem in the field of computer vision where deep
networks have not succeeded classical handcrafted attributes. To date, video analytics
has traveled a long journey from holistic features such as motion history image [17]
(MHI), motion energy image [18] (MEI), action banks [19] up to local feature-based
approaches like HOG3D [20], spatiotemporal histogram of radon projection
(STHRP) [21], histogram of optical flow [22], and tracking approaches. One efficient
approach is to employ deep networks for learning and analyzing the videos without
the knowledge of class labels but with the sequential organization of frames termed as
“weak supervision.” This technique also requires a little supervision in strategies for providing input to deep neural networks such as sampling, encoding, and organizing
methods. Unlike deep networks, generative models called GANs [23, 24] have been successfully implemented in the field of video analytics without human intervention in labeling the videos for applications such as future video frame prediction [25]. Over a period
of time, the architectures of GAN is modified for various applications like video generation, video prediction, action recognition, video summarization, video understanding,
and so on as listed in Table 14.1.
14.3.1 GAN variations for video generation and prediction
Recent progress in generative models [26] has attracted the researchers to examine image
synthesis. In particular, GANs have been employed to synthesize images from random
data distribution, through nonlinear transformation from prime image to synthesized
one or generate the synthesized images from the source domain. This enhanced advances
in image synthesis have gained the confidence to utilize GANs for generating video
sequence. One of the challenging issues in using GANs for generating and predicting
videos is that the output of the GAN architectures must provide meaningful video
responses. This challenge has added huge responsibility to GAN which includes understanding both the spatial and the temporal content of the video. One such extension of
GAN is MoCoGAN [27], which is used for generating videos with no prior knowledge
333
334
Generative adversarial networks for image-to-Image translation
Table 14.1 GAN variations.
S. no
GAN variations
Application
1
2
3
4
5
6
7
8
9
10
11
MoCoGAN
VGGAN
LGGAN
TGANs
Dynamic transfer GAN
FTGAN
DMGAN
AMCGAN
Discrimnet
HiGAN
DCycle GAN
Video generation
12
13
14
PoseGAN
Recycle GAN
DTRGAN
Video prediction
Action recognition
Video recognition
Face translation between images
and videos
Human pose estimation
Video retargeting
Video Summarization
about priming image. This variant of GAN architecture partitions the input data distribution into two subspaces namely content and motion subspace. The content subspace
sampling follows Gaussian distribution sampling whereas motion subspace sampling was
accomplished by RNN. These two subspaces form the two discriminators called content
and motion discriminators. Even though MoCoGAN could generate videos of variable
lengths, the motion discriminator was designed only to handle the frames in limited number. As shown in Fig. 14.2, the spatial content generation was performed for different
instances of appearance but the motion was fixed at the same expression.
Another useful variant of GAN is dynamic transfer GAN [28]. It attempts to generate
the video sequence by transferring the dynamics of temporal motion available in the
Fig. 14.2 Example frames generated by MoCoGAN [27].
Generative adversarial network for video analytics
Fig. 14.3 Frames generated by dynamic GAN [28] for anger expression.
source video sequence onto a prime target image. This target image contains the spatial
content of the video data and the dynamic information is obtained from the arbitrary
motion.
RNN is used for spatiotemporal encoding. This dynamic GAN can generate video
sequences of variable length using the competition between a generator and two discriminator networks. Among the two discriminator networks, one act as spatial discriminator
to monitor the fidelity of the generated video sequence and other acts as a dynamic discriminator to maintain the integrity of the entire video sequence. They have provided
visualization to demonstrate the ability of the dynamic GAN in encoding the enriched
dynamics from source videos by suppressing the appearance features. Fig. 14.3 shows an
example of frames generated using dynamic GAN for anger expression.
Ohnishi et al. developed flow and texture generative adversarial network (FTGAN)
model [29] used to generate hierarchical video from orthogonal information. FTGAN
comprises two networks namely FlowGAN and TextureGAN. This variation in the
GAN architecture is proposed to explore the representation and generate videos without
enormous annotation cost. Flow GAN is used to generate optical flow which provides
the edge and motion information for the video to be generated. The RGB videos are
generated from optical flow using Texture GAN. The generic framework for FTGAN
is shown in Fig. 14.4.
Fig. 14.4 Generic framework for FTGAN [29].
335
336
Generative adversarial networks for image-to-Image translation
TextureGAN preserves the consistency in the foreground and the scenes while accumulating texture information with the generated optical flow. This model provides a
progressive advance in generating more realistic videos without label data. The prime
advantage of FTGAN is that both GANs share complementary information about the
video content. The authors used both real and computer graphics (CG) videos for training the texture and FlowGANs. The real-world dataset namely Penn Action contains
2326 videos of 15 different classes whereas the CG human video dataset namely SURREAL consists of 67,582 videos. The TextureGAN and FlowGAN are trained on these
dataset for 60 k iterations. The accuracy obtained on SURREAL dataset is 44% and 54%
while using textureGAN and flowGAN respectively. On Penn Action dataset, the
accuracy obtained through textureGAN is 72% and flowGAN is 58%.
A multistage dynamic generative adversarial network (MSDGAN) was proposed for
generating time-lapse videos of high resolution. The process involved in MSDGAN [30]
is twofolds: at the initial stage, realistic information is generated for each frame in the
video. The next stage prunes the videos generated by the first stage through the use
of motion dynamics which could make the videos closer to the real one. The authors
had used a large-scale time-lapse dataset to test the videos. This model generates realistic
videos of up to 128 128 resolution for 32 frames. They had collected over 5000 timelapse videos from YouTube and short clips are created manually from it. After that, the
short video clips are partitioned into frames and MSDGAN is used to generate clips.
A short video clip can be generated from continuous 32 frames. Fig. 14.5 shows the
frames generated by the MSDGAN and the red circle indicates motion between adjacent
frames.
Fig. 14.5 Frames generated by MSDGAN [30], given the start frame 1.
Generative adversarial network for video analytics
A robust one-stream video generation architecture which is an extension of Wasserstein GAN architecture known as improved video generative adversarial network
(iVGAN) is another variation of GAN. This model generates the whole video clip without separating foreground from background. Similar to classical GAN, iVGAN [31]
model has two networks called a generator and a critic/discriminator network. The
aim of using a generator network is to create videos from a low-dimensional latent code.
Critic network discriminates the real and fake data and updates in competence with the
generator. This iVGAN architecture tackles the challenging issues in video analytics such
as future frame prediction, video colonization, and in painting. The authors used different
dataset such as stabilized videos collected from YouTube and Airplanes dataset. This
model works by constantly filling the damaged holes to reconstruct the spatial and temporal information of videos. Fig. 14.6 depicts the example video frames generated by
iVGAN.
One of the useful efforts for generating the videos for the given description/caption
has been taken by Pan et al. [32]. These kinds of video generation from the text description are attracted toward real-time applications. It is attained through the efficient extension of GAN architecture termed as temporal GAN (TGAN). TGAN consists of a
generator and three discriminator networks. The input to the generator network is
the combination of noise vector and the encoded sentences derived from LSTM network. The generator produces the frames of video sequences using 3D convolution
operator. Three discriminators are utilized in TGANs for purposes such as video, frames,
and motion discrimination. Among the discriminators, the function of two networks is to
distinguish the real and fake videos or frames formed by the generator. In addition to this,
these discriminator networks discriminate the semantically matched and mismatched
video or frame text description pairs. The need for last discriminator network is to
improve the temporal coherence between the real and generated frames. The whole
TGAN architecture has undergone end to end learning. This GAN variant is evaluated
over datasets like SMBG, TBMG, and MSVD to generate videos from captions. The
coherence metric implies readability and temporal coherence of videos. The coherence
metric of 1.86 is reported for TGANs. Table 14.2 enumerates the collection of dataset
used for evaluating the GAN variants proposed for video generation.
14.3.2 GAN variations for video recognition
Video recognition is usually done via large number of labeled videos during the training
session. For a new test task, many videos are unlabeled and annotation is needed there. It
also requires human help to annotate every video. It’s a tedious process to annotate a large
set of data. In order to overcome this Yu et al. proposed a novel approach called hierarchical generative adversarial networks (HiGAN) [3] where the fully labeled images are
utilized to recognize those unlabeled videos. The idea behind HiGAN model is
337
Fig. 14.6 Video frames generated using iVGAN [31].
Table 14.2 List of dataset available to validate video generation techniques.
S. no
Model
Dataset
1
MoCoGAN
2
3
TGANs
Dynamic
transfer GAN
FTGAN
iVGAN
MSDGAN
MUG facial expression dataset, YouTube videos,
Weizmann action dataset and UCF101
SMBG,TBMG, MSVD
CASIA
4
5
6
Penn Action, SURREAL
Tiny videos, Airplane dataset
YouTube videos, Beach dataset, Golf dataset
Generative adversarial network for video analytics
combining low-level conditional GAN and high-level conditional GAN and utilizing
the adversarial learning from them. Also, this method provides domain invariant feature
representation between labeled images and unlabeled video. The performance is evaluated by conducting experiments on two complex video datasets UCF101 [33] and
HMDB51 [34]. In this work, each target video is split into 16-frame clips without
any overlap and it constructs a video clip domain by combining all the target video frame.
In each video clip, the deep feature that is 512D feature vector from pool 5 layer of 3D
ConvNets is extracted and are used to train large-scale video dataset. The HiGAN comparatively outperforms in terms of recognition rate in both datasets compared to the
approach C3D [35]. HiGAN recognition rate is observed as 4% improvement in
UCF101 and 10% improvement in HMDB51 dataset compared to C3D technique.
Human behavior understanding in video is still a challenging task. It requires an accurate model to handle both the pixel-wise and global level prediction. Spampinato et al.
[36] demonstrated an adversarial GAN-based framework to learn video representation
through unsupervised learning to perform both local and global prediction of human
behavior in videos. In this approach, first the video is synthesized by factorizing the process in to the generation of static visual content and motion and secondly enforcing spatiotemporal coherency of object trajectories and finally incorporates motion estimation
and pixel-wise dense prediction. So, the self-supervised way of learning provides an
effective feature set which further used for video object segmentation and video action
tasks [37]. Also, the new segmentation network proposed is able to integrate into any
another segmentation model for supervision. This provides a strong model for object
motion. The wide range of experimental evaluation showed that VOS-GAN performance on modeling object motion better than the existing video generation methods
such as VGAN, TGAN, and MoCoGAN.
In the previous researches the following approaches were implemented for video
retargeting [38]: The first one is specifically performed domain wise which is not applicable for other domains. And the second one is implemented across the domain which
needs manual supervision for labeling and alignment of information and the last approach
is unsupervised and unpaired image translation where learning is mutually done in different domains which is also shown insufficient information for processing. Bansal et al.
[38] propose a new unsupervised data-driven approach for effective video retargeting
which incorporates spatiotemporal information with conditional generative adversarial
networks (GANs). It combines both spatial and temporal information along with adversarial loses for translating content and also preserving style. The publicly available Viper
dataset is used for experimentation for image-to-labels and labels-to-image to evaluate
the results of video retargeting. The performance measures such as mean pixel accuracy
(M), average class accuracy (AC), and intersection over union (IoU) provides comparatively better results for the combination of cycle GAN and recycle-GAN
339
340
Generative adversarial networks for image-to-Image translation
Jang and Kim [39] developed appearance and motion conditions generative adversarial network (AMC-GAN) which consists of a generator, two discriminators, and perceptual ranking module. The two discriminators monitor the appearance and motion
features. They used a new conditioning scheme that helps the training by varying appearance and motion conditions. The perceptual ranking module enables AMCGAN for
understanding the events in the video. AMCGAN model is evaluated on MUG facial
expressions and NATOPS human action dataset. The MUG dataset consists of 931 video
clips which contain six basic emotions like anger, disgust, fear, happy, sad, and surprise. It
is preprocessed to get 32 frames of resolution 64 64 pixels. The NATOPS human
action dataset has 9600 videos containing 24 different actions.
In unsupervised video representation future frame prediction is a challenging task.
Existing methods operate directly on pixels which result blurry prediction of the future
frame. Liang et al. [26] proposed a dual motion generative adversarial net (GAN) architecture to predict future frame in video sequence through dual learning mechanism. The
future frame prediction and dual future flow prediction form a close loop. It achieves
better video prediction by generating informative feedback signals to each other. This
dual motion GAN has fully differentiable network architecture for video prediction.
Extensive experiments on video frame prediction, flow prediction, and unsupervised
video representation learning demonstrate the contributions of Dual Motion GAN to
motion encoding and predictive learning. Caltech and YouTube Clips are taken for
future frame analysis to show the performance of video recognition using dual motion
GAN compared to other existing approaches in the KITTI dataset. The performance
evaluation metrics such as mean square error (MSE), peak signal-to-noise ratio
(PSNR), and structural similarity index metrics (SSIM) are used to evaluate the image
quality of future frame prediction. Higher PSNR and SSIM are achieved via dual motion
GAN. The implementations are based on the public Torch7 platform on a single NVIDIA GeForce GTX 1080. Dual motion GAN takes around 300 ms to predict one future
frame.
14.3.3 GAN variations for video summarization
Due to the availability of the huge amount of multimedia data produced by the progressive growth of video capturing devices, video summarization [12,40–42] plays a crucial
role in video analytics problem. Video summarization [43] extracts the representative and
useful content from the video for data analysis and it is highly useful in large scale video
analysis. One of the efficient approaches in video summarization is that deriving the suitable key frames from the entire video and those set of key frames are enough to portray
the story of the video. To enhance the quality of summarization, there exists some challenges need to be tackled by the summarization techniques. The first challenge is to
choose a fine key frame selection strategy which takes into account the temporal relation
Generative adversarial network for video analytics
of the frames within the video and the importance of the key frames. The next challenge
is to devise a mechanism to assess the preciseness and completeness of the selected key
frames. To address these issues, several models have been introduced so far like
feature-based approaches [12], Long-short-term memory (LSTM)-based models
[1,44] and determinantal point process (DPP)-based techniques [45]. Owing to the
memory problems that arise in LSTM as well as DPP and redundant key frames issue
in feature-based approaches, GAN has attracted the researchers in this community
because of its regularization ability. One of the GAN variants proposed for video summarization namely dilated temporal relational generative adversarial network
(DTRGAN) [46] is shown in Fig. 14.7.
The generator contains two units namely dilated temporal relational (DTR) and bidirectional LSTM (Bi-LSTM). The generator gets the video and the real summary of the
respective video as input. The DTR unit aims to tackle the first challenge. The inputs to
the discriminator are real, generated, and random summary pairs and the purpose of the
discriminator is to optimize the player losses at the time of training. A supervised generator loss term is introduced to attain the completeness and preciseness nature of the
key frames.
Fig. 14.7 Architecture of DTRGAN [46].
341
342
Generative adversarial networks for image-to-Image translation
14.3.4 PoseGAN
Walker et al. [25] have developed a video forecasting technique by generating future pose
using generative adversarial network (GAN) and variational autoencoders (VAEs). In this
approach, video forecasting is attained by generating videos directly in pixel space. This
approach models the whole structure of the videos including the scene dynamics conjointly under unconstrained environment. The authors divided the video forecasting
problem into two stages: The first stage handles the high-level features of video like
human scenes and uses VAE to predict the future actions of a human. The authors used
UCF101 dataset for evaluating the poseGAN architecture in predicting the future poses
of human.
14.4 Discussion
14.4.1 Advantages of GAN
One of the major advantages of GAN is that it does not require knowledge about the
shape of the generator’s probability distribution model. Hence, GANs avoids the need
for the determined density shapes for representing high complex and high-density data
distribution.
Reduced time complexity: The sampling of generated data can be parallelized in
GANs and it makes them pretty faster than PixelRNN [47], Wavenet [48], and
PixelCNN [49]. In the future frame prediction problem [39], the autoregressive models
rely on the value of the previous frame’s pixel value for the prediction of the probability
distribution of the future frame’s pixel. Hence, the generation of the future frame is too
slow and the time consumption is even worse for high-dimensional data. But GANs uses
a simple feed-forward neural network strategy for mapping in the generator. The generator creates all the future frame pixels at the same time itself rather than pixel by pixel
approach followed by autoregressive models. This pace of GAN processing attracted
many researchers in various fields.
Accurate results: Form the study of different GAN variants it is evident that GAN
can produce astounding results for video analytics problems. Also, the performance is far
better than variational autoencoder (VAE), one of the generator models which assume
the probability distribution of pixels as a normal distribution. As GAN can master in capturing the high-frequency parts of the data, the generator develops to guide the highfrequency parts to betray the discriminator.
Lack of assumptions: Even though VAE attempts to maximize likelihood through
variational lower bound, it needs assumptions on the prior and posterior probability distributions of data. On the other hand, GANs do not need any strong assumptions about
the probability distribution.
Generative adversarial network for video analytics
14.4.2 Disadvantages of GAN
Trade-off between discriminator and generator: The imbalance occurs between
generator and discriminator because of nonconvergence and mode collapse. Mode collapse is a commonly occurring and difficult issue in GAN models. It happens in the case
when the generator is offered with images that look similar. Also, when the generator is
trained extensively without updating any information to the discriminator, the mode collapses. Owing to this mode collapse, the generator will converge to an optimal data which
fools discriminator the most and it is the best realistic image from the perspective of the
discriminator. A partial mode collapse occurs in GANs frequently than a complete mode
collapse. Thus, the training process involved in GAN is heuristic in nature.
Hyperparameters and training: The need for suitable hyperparameters to attain
the cost function is a major concern in GANs. The tuning of these parameters is also
a time-consuming process.
14.5 Conclusion
GAN is growing as an efficient generative model through the generation of real-like data
using random latent spaces. The underlying fact in the GAN process is that it does not need
the understanding of real data samples and high-level mathematical foundations. This merit
allowed the GANs to be extensively used in various academic and engineering fields. In this
chapter, we introduced the basics and working principle of GAN, several variations of
GAN available for various applications like video generation, video prediction, action recognition, and video summarization in the area of video analytics. The enormous growth of
GAN in the video analytics domain is not only due to its ability to learn the deep representation and nonlinear mapping but due to its potential to use the enormous amount of
unlabeled video data. There are huge openings in the development of algorithms and architectures of GAN for using it in different application domains apart from video analytics,
such as prediction, superresolution, generating new human poses, and face frontal view
generation. The future scope in video recognition includes exploiting large-scale web
images for video recognition which will further improve the recognition accuracy. Video
retargeting can be accomplished more precisely using spatiotemporal generative models
and further, it can be extended to multiple source domain adaptation. Also, the spatiotemporal neural network architecture can be applied for video retargeting in future. The realworld videos with complex motion interactions can be attempted for video recognition
through the modeling of multiagent dependencies. Also, the alternative can be made
for loss function, evaluation metrics, RNN, and synthetically generated videos to improve
the performance of video recognition system. Generative adversarial neural networks can
be the next step in deep learning evolution and while they provide better results across
several application domains.
343
344
Generative adversarial networks for image-to-Image translation
References
[1] H. Chen, G. Ding, Z. Lin, S. Zhao, J. Han, Show, observe and tell: attribute-driven attention model for
image captioning, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence,
2018, pp. 606–612.
[2] L.C. Chen, G. Papandreou, S.F. Adam, Rethinking Atrous Convolution for Semantic Image Segmentation, (2017). arXiv preprint arXiv:170605587.
[3] F. Yu, X. Wu, et al., Exploiting Images for Video Recognition with Hierarchical Generative Adversarial Networks, (2018). arXiv:1805.04384v1 [cs.CV].
[4] S. Skansi, Autoencoders, in: Introduction to Deep Learning, Springer, Berlin, 2018, pp. 153–163.
[5] D.H. Ackley, G.E. Hinton, T.J. Sejnowski, A learning algorithm for Boltzmann machines. Cogn. Sci.
9 (1) (1985) 147–169, https://doi.org/10.1207/s15516709cog0901_7.
[6] Wahyono, A. Filonenko, K.-H. Jo, Designing interface and integration framework for multi-channel
intelligent surveillance system, in: IEEE Conference on Human System Interactions, 2016.
[7] J. Goodfellow, M. Pouget-Abadie, B.X. Mirza, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio,
Generative adversarial nets, in: Advances in Neural Information Processing Systems, 2014,
pp. 2672–2680.
[8] Y. Hong, U. Hwang, J. Yoo, S. Yoon, Show Generative Adversarial Networks and Their Variants
Work: An Overview, (2017). arXiv preprint arXiv:1711.05914.
[9] T. Che, Y. Li, A.P. Jacob, Y. Bengio, W. Li, Mode regularized generative adversarial networks,
in: Proc. ICLR, 2017, 2017.
[10] J.Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to image translation using cycle-consistent
adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision,
2017, pp. 2223–2232.
[11] D. Berthelot, T. Schumm, L. Metz, Began: Boundary Equilibrium Generative Adversarial Networks,
https://arxiv.org/abs/1703.10717, 2017.
[12] B. Zhao, E.P. Xing, Quasi real-time summarization for consumer videos, in: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2014, pp. 2513–2520.
[13] A. Odena, C. Olah, J. Shlens, Conditional image synthesis with auxiliary classifier GANs, in: Proceedings
of the 34th International Conference on Machine Learning, vol. 70, 2017, pp. 2642–2651.
[14] J.T. Spring Enberg, Unsupervised and Semi-Supervised Learning with Categorical Generative Adversarial Networks, (2015). arXiv preprint arXiv:1511.06390.
[15] B. Fernando, H. Bilen, E. Gavves, S. Gould, Self-Supervised Video Representation Learning With
Odd-One-out Networks, arXiv preprint arXiv:1611.06646(2016).
[16] I. Misra, C.L. Zitnick, M. Hebert, Shuffle and learn: unsupervised learning using temporal order
verification, in: European Conference on Computer Vision, 2016, pp. 527–544.
[17] M.A.R. Ahad, J.K. Tan, H. Kim, S. Ishikawa, Motion history image: its variants and applications,
Mach. Vis. Appl. 23 (2) (2012) 255–281.
[18] A.F. Bobick, J.W. Davis, The recognition of human movement using temporal templates, IEEE Trans.
Pattern Anal. Mach. Intell. 23 (2001) 257–267.
[19] S. Sadanand, J.J. Corso, Action bank: a high-level representation of activity in video. in: IEEE
Conference on Computer Vision and Pattern Recognition, Providence, RI, 2012, pp. 1234–1241,
https://doi.org/10.1109/CVPR.2012.6247806.
[20] N. Li, X. Cheng, S. Zhang, et al., Realistic human action recognition by fast HOG3D and selforganization feature map. Mach. Vis. Appl. 25 (2014) 1793–1812, https://doi.org/10.1007/s00138014-0639-9.
[21] A. Sasithradevi, S.M.M. Roomi, Video classification and retrieval through spatio-temporal radon features. Pattern Recogn. 99 (2020) 107099, https://doi.org/10.1016/j.patcog.2019.107099.
[22] J. Pers, V. Suli´c, M. Kristan, M. Perˇse, K. Polanec, S. Kovaˇciˇc, Histograms of optical flow for
efficient representation of body motion, Pattern Recogn. Lett. 31 (11) (2010) 1369–1376.
[23] J. Zhao, M. Mathieu, Y. LeCun, Energy-Based Generative Adversarial Network, (2016). arXiv
preprint arXiv:1609.03126.
[24] Z. Huang, B. Kratzwald, et al., Face Translation Between Images and Videos Using Identity-Aware
CycleGAN, (2017). arXiv:1712.00971v1 [cs.CV].
Generative adversarial network for video analytics
[25] J. Walker, K. Marino, et al., The Pose Knows: Video Forecasting by Generating Pose Futures, (2017).
(arXiv:1705.00053v1 [cs.CV).
[26] X. Liang, Lisa, et al., Dual Motion GAN for Future-Flow Embedded Video Prediction, (2017)
arXiv:1708.00284v2 [cs.CV].
[27] S. Tulyakov, M.-Y. Liu, X. Yang, J. Kautz, Mocogan, Decomposing Motion and Content for Video
Generation, (2017). arXiv preprint arXiv:1707.04993.
[28] W.J. Baddar, G. Gu, et al., Dynamics Transfer GAN: Generating Video by Transferring Arbitrary Temporal Dynamics from a Source Video to a Single Target Image, (2017). arXiv:1712.03534v1 [cs.CV].
[29] K. Ohnishi, S. Yamamoto, et al., Hierarchical Video Generation from Orthogonal Information:
Optical Flow and Texture, (2017). arXiv:1711.09618v2 [cs.CV].
[30] W. Xiong, W. Luo, et al., Learning to Generate Time-Lapse Videos Using Multi-Stage Dynamic
Generative Adversarial Networks, (2018)arXiv:1709.07592v3 [cs.CV].
[31] B. Kratzwald, Z. Huang, et al., Improving Video Generation for Multi-Functional Applications,
arXiv:1711.11453v2 [cs.CV](2018).
[32] Y. Pan, Z. Qiu, et al., To Create What you Tell: Generating Videos from Captions, (2018).
arXiv:1804.08264v1 [cs.CV].
[33] K. Soomro, A.R. Zamir, M. Shah, UCF101: a dataset of 101 human actions classes from videos in the
wild, in: Computer Vision and Pattern Recognition (cs.CV), 2012 arXiv:1212.0402.
[34] H. Kuehne, H. Jhuang, E.´ı. Garrote, T. Poggio, T. Serre, Hmdb: a large video database for human
motion recognition, in: International Conference on Computer Vision (ICCV), IEEE, 2011,
pp. 2556–2563.
[35] D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3D
convolutional networks, in: ICCV, 2015, pp. 4489–4497.
[36] C. Spampinato, S. Palazzo, et al., Adversarial Framework for Unsupervised Learning of Motion
Dynamics in Videos, (2019)arXiv:1803.09092v2 [cs.CV].
[37] U. Ahsan, H. Sun, et al., DiscrimNet: Semi-Supervised Action Recognition from Videos Using
Generative Adversarial Networks, (2018)arXiv:1801.07230v1 [cs.CV].
[38] A. Bansal, S. Ma, D. Ramanan, Y. Sheikh, Recycle-GAN: unsupervised video retargeting. in: V. Ferrari,
M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Computer Vision—ECCV 2018 Lecture Notes in
Computer Science, vol. 11209, Springer, Cham, 2018https://doi.org/10.1007/978-3-030-01228-1_8.
[39] Y. Jang, G. Kim, et al., Video Prediction with Appearance and Motion Conditions, (2018).
arXiv:1807.02635v1 [cs.CV].
[40] M. Gygli, H. Grabner, H. Riemenschneider, L. Van Gool, Creating summaries from user videos,
in: Proceedings of European Conference on Computer Vision, 2014, pp. 505–520.
[41] G. Kim, L. Sigal, E.P. Xing, Joint summarization of large-scale collections of web images and videos for
storyline reconstruction, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2014, pp. 4225–4232.
[42] B. Mahasseni, M. Lam, S. Todorovic, Unsupervised video summarization with adversarial LSTM networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
[43] A. Sharghi, B. Gong, M. Shah, Query-focused extractive video summarization, in: Proceedings of
European Conference on Computer Vision, 2016.
[44] K. Zhang, W.L. Chao, F. Sha, K. Grauman, Video summarization with long short-term memory,
in: Proceedings of European Conference on Computer Vision, 2016, pp. 766–782.
[45] B. Gong, W.L. Chao, K. Grauman, F. Sha, Diverse sequential subset selection for supervised video
summarization, in: Advances in Neural Information Processing Systems, 2014, pp. 2069–2077.
[46] Y. Zhang, M. Kampffmeyer, et al., Dilated Temporal Relational Adversarial Network for Generic
Video Summarization, (2019). arXiv:1804.11228v2 [cs.CV].
[47] A. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior,
K. Kavukcuoglu, Wavenet: A Generative Model for Raw Audio, (2016). arXiv preprint
arXiv:1609.03499.
[48] A. Oord, N. Kalchbrenner, K. Kavukcuoglu, Pixel Recurrent Neural Networks, (2016). arXiv preprint arXiv:1601.06759 90.
[49] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for
training GANs, in: Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
345
CHAPTER 15
Multimodal reconstruction of retinal
images over unpaired datasets using
cyclical generative adversarial networks
Roucoa,b, Jorge Novoa,b, and Marcos Ortegaa,b
Álvaro
S. Hervellaa,b, Jose
a
CITIC Research Center, University of A Coruña, A Coruña, Spain
VARPA Research Group, Biomedical Research Institute of A Coruña (INIBIC), University of A Coruña, A Coruña, Spain
b
15.1 Introduction
The recent rise of deep learning has revolutionized medical imaging, making a significant
impact on modern medicine [1]. Nowadays, in clinical practice, medical imaging technologies are key tools for the prevention, diagnosis, and follow-up of numerous diseases
[2]. There exist a large variety of imaging modalities that allow to visualize the different
organs and tissues in the human body [3]. Thus, clinicians can select the most adequate
imaging modality to study the different anatomical or pathological structures in detail.
Nevertheless, the detailed analysis of the images can be a tedious and difficult task for
a clinical specialist. For instance, many diseases in their early stages are only evidenced
by very small lesions or subtle anomalies. In these scenarios, factors such as the clinicians’
expertise and workload can affect the reliability of the final analysis. Thus, the use of deep
learning algorithms allows to accelerate the process and helps to produce a more reliable
analysis of the images. Ultimately, this will result in a better diagnosis and treatment for
the patients.
Deep neural networks (DNNs) have been demonstrated to provide a superior performance for numerous image analysis problems in comparison to more classical methods
[4]. For instance, nowadays, deep learning represents the state-of-the-art approach for
typical tasks, such as image segmentation [5] or image classification [6]. Besides the
remarkable improvements in these canonical image analysis problems, deep learning also
makes possible the emergence of novel applications. For instance, these algorithms can be
used for the transformation of images among different modalities [7], or the training of
future clinical professionals using realistic generated images [8]. These novel applications,
among others, certainly benefit from the particular advantages of generative adversarial
networks (GANs) [9]. This creative setting, consisting of different networks with opposite objectives, have been demonstrated to be able to further exploit the capacity of
the DNNs.
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00014-2
Copyright © 2021 Elsevier Inc.
All rights reserved.
347
348
Generative adversarial networks for image-to-Image translation
Multimodal reconstruction is a novel application driven by DNNs that consists in the
translation of medical images among complementary modalities [7]. Nowadays, complementary imaging modalities, representing the same organs or tissues, are commonly available in most medical specialties [3]. The differences among modalities can be due to the
use of different capture devices, and also due to the use of contrasts that enhance certain
tissues. The clinicians choose the most adequate imaging modality according to different
factors, such as the target organs or tissues, the evidence of disease, or the risk factors of the
patient. In this sense, it is particularly important to consider the properties of the different
anatomical and pathological structures, given that some structures can be enhanced in one
modality and be completely missing in other. This significant change in the appearance,
dependent on the properties of the tissues and organs, can make the translation among
modalities very challenging. However, this challenge that complicates the training of the
multimodal reconstruction is beneficial if we are interested in using the task for representation learning purposes. This is due to the fact that a harder task will enforce the network to learn more complex representations during the training. In this regard, the
multimodal reconstruction has already demonstrated a successful performance as pretraining task for transfer learning in medical imaging [10].
In this chapter, we study the use of GANs for the multimodal reconstruction between
complementary imaging modalities. In particular, the multimodal reconstruction is
addressed by using a cyclical GAN methodology, which allows training the adversarial
setting with independent sets of two different image modalities [11]. Nowadays, GANs
represent the quintessential approach for image-to-image translation tasks [12]. However, these kinds of applications are typically focused on producing realistic and aesthetically pleasing images. In contrast, in the multimodal reconstruction of medical images,
the realism and aesthetics of the generated images are not as important as producing medically accurate reconstructions. In particular, this means that the generated color patterns
and textures must be coherent with the expected visualization of the real organs or tissues
in the target modality. Additionally, this may involve the omission of certain structures,
or even the enhancement of those that are only vaguely appreciated in the original
modality. We evaluate all these aspects in order to assess the validity of the studied cyclical
GAN method for the multimodal reconstruction.
The study presented in this chapter is focused on ophthalmic imaging. In particular,
we use the retinography and the fluorescein angiography as the original and target
imaging modalities in the multimodal reconstruction. These imaging modalities, which
represent the eye fundus, are useful for the study of important ocular and systemic diseases, such as glaucoma or diabetes [2]. A representative example of retinography and
fluorescein angiography for the same eye is depicted in Fig. 15.1. The main difference
between them is that the fluorescein angiography uses a contrast dye, which is injected
to the patient, to produce the fluorescence of the blood. Thus, the fluorescein angiography depicts an enhanced representation of the retinal vasculature and related lesions.
Multimodal reconstruction of retinal images
Fig. 15.1 Example of retinography and fluorescein angiography for the same eye: (A) retinography
and (B) angiography.
In this context, the successful training of a deep neural network in the multimodal
reconstruction of the angiography from the retinography will provide a model able to
produce a contrast-free estimation of the enhanced retinal vasculature. Additionally,
due to the challenges of the transformation, which is mainly mediated by the presence
of blood flow in the different tissues, the neural networks will need to learn rich high
level representations of the data. This represents a remarkable potential for transfer learning purposes [13, 14].
The presented study includes an extensive evaluation of the cyclical GAN methodology for the multimodal reconstruction between complementary imaging modalities.
For this purpose, two different multimodal datasets containing both retinography and
fluorescein angiography images are used. Additionally, in order to further analyze the
advantages and limitations of the methodology, we present an extensive comparison with
a state-of-the-art approach for the multimodal reconstruction of these ophthalmic images
[15]. In contrast with the cyclical GAN methodology, this other approach requires
the use of multimodal paired data for training, i.e., retinography and angiography of
the same eye. Therefore, the cyclical GAN presents an important advantage, avoiding
not only the necessity of paired data but also the unnecessary preprocessing for the alignment of the different image pairs.
15.2 Related research
Generative adversarial networks (GANs) represent a relatively new deep learning framework for the estimation of generative models [16]. The original GAN setting consists of
two different networks with opposite objectives. In particular, a discriminator that learns
to distinguish between real and fake samples and a generator that learns to produce fake
samples that the discriminator misclassifies as real. Based on this original idea, several
variations were developed in posterior works, aiming at applying the novel paradigm
in different scenarios [17].
349
350
Generative adversarial networks for image-to-Image translation
In recent years, GANs have been extensively used for addressing different vision problems and graphics tasks. The use of GANs has been especially groundbreaking for computer graphics applications due to the visually appealing results that are obtained.
Similarly, a kind of vision problem that has been revolutionized by the use of GANs is
image-to-image translation, which consists of performing a mapping between different
image domains or imaging modalities [12]. An early work addressing this problem with
GANs, known as Pix2Pix [18], relied on the availability of paired data for learning the generative model. In particular, Isola et al. [18] show that their best results are achieved by
combining a traditional pixel-wise loss and a conditional GAN framework. Given the difficulty of gathering the paired data in many application domains, posterior works have
proposed alternatives to learn the task by using unpaired training data. Among the different
proposals, the work of Zhu et al. [19], known as CycleGAN, has been especially influential.
CycleGAN compensates for the lack of paired data by learning not only the desired mapping function but also the inverse mapping. This allows introducing a cycle-consistency
loss whereby the subsequent application of both mapping functions must return the original input image. Concurrently, this same idea with different naming was also proposed in
DualGAN [20] and DiscoGAN [21]. Additionally, besides the cycle-consistency alternative, other different proposals have been presented in different works [12] although
the use of these other alternatives is not as extended in posterior applications.
In medical imaging, GANs have also been used for different applications, including the
mapping between complementary imaging modalities. In particular, GANs have been successfully applied in tasks such as image denoising [22], multimodal reconstruction [11], segmentation [23], image synthesis [24], or anomaly detection [25]. Among these different
tasks, several of them can be directly addressed as an image-to-image translation [8]. In
these cases, the adaption of those state-of-the-art approaches that already demonstrated
a good performance in natural images has been common. In particular, numerous works
in medical imaging are based on the use of Pix2Pix or CycleGAN methodologies [8].
Similarly to other application domains, the choice between one or other approach is conditioned by the availability of paired data for training. However, in medical imaging, the
paired data is typically easy to obtain, which is evidenced by the prevalence of paired
approaches in the literature [8]. With regard to the multimodal reconstruction, the
difficulty in these cases is to perform an accurate registration of the available image pairs.
An important concern regarding the use of GANs in medical imaging is the hallucination of nonexistent structures by the networks [8]. This is a concomitant risk with the
use of GANs due to the high capacity of these frameworks to model the given training
data. Cohen et al. [26] demonstrated that this risk is especially elevated when the training
data is heavily unbalanced. For instance, a GAN framework that is trained for multimodal
reconstruction with a large majority of pathological images will tend to hallucinate pathological structures when processing healthy images. This behavior can be in part mitigated by the addition of pixel-wise losses if paired data is available. Nevertheless,
regarding the multimodal reconstruction, even when the paired data is available, most
Multimodal reconstruction of retinal images
of the works still use the GAN framework together with the pixel-wise loss [8]. In this
regard, the work of Hervella et al. [15] is an example of multimodal reconstruction without GANs and using instead the Structural Similarity (SSIM) for the loss function. The
motivation for this is, for many applications in medical imaging, it is not necessary to
generate realistic or aesthetically pleasing images. In this context, the results obtained
in Ref. [15] show that, without the use of GANs, the generated images lack realism
and can be easily identified as synthetic samples.
15.3 Multimodal reconstruction of retinal images
Multimodal reconstruction is an image translation task between complementary medical
imaging modalities [7]. The objective of this task is, given a certain medical image, to
reconstruct the underlying tissues and organs according to the characteristics of a different
complementary imaging modality. Particularly, this chapter is focused on the multimodal
reconstruction of the fluorescein angiography from the retinography. These two
complementary retinal imaging modalities represent the eye fundus, including the main
anatomical structures and possible lesions in the eye. The main difference between retinography and angiography is that the latter requires the injection of a contrast dye before
capturing the images. The injection of this contrast dye results in an enhancement of the
retinal vasculature as well as those pathological structures with blood flow. Simultaneously, those other retinal structures and tissues where there is a lack of blood flow
may be attenuated in the resulting images. Thus, there is an intricate relation between
retinography and angiography, given that the visual transformation between the
modalities depends on physical properties such as the presence of blood flow in the
different tissues. As a reference, the transformation between retinography and angiography for the main anatomical and pathological structures in the retina can be visualized in
Fig. 15.2.
Fig. 15.2 Example of retinography and fluorescein angiography for the same eye. The included images
depict the main anatomical structures as well as the two main types of lesions in the retina.
351
352
Generative adversarial networks for image-to-Image translation
Recently, the difficulty of performing the multimodal reconstruction between retinography and angiography has been overcome by using DNNs [7]. In this regard, the
required multimodal transformation can be modeled as a mapping function that
GR2A : R ! A given a certain retinography r R returns the corresponding angiography
a ¼ GR2A(r) A for the same eye. In this scenario, the mapping function GR2A can be
parameterized by a DNN. Thus, the function parameters can be learned by applying
an adequate training strategy. In this regard, we present two different deep learning-based
approaches for learning the mapping function GR2A, the cyclical GAN methodology [11]
and the paired SSIM methodology [15].
15.3.1 Cyclical GAN methodology
The cyclical GAN methodology is based on the use of generative adversarial networks
(GANs) for learning the mapping function from retinography to angiography [11]. In this
regard, GANs have demonstrated to be useful tools for learning the data distribution of a
certain training set, allowing the generation of new images that resemble those contained
in the training data [16]. This means that, by using GANs and a sufficiently large training
set of unlabeled angiographies, it is possible to generate new fake angiographies that are
theoretically indistinguishable from the real ones. However, in the presented multimodal
reconstruction, the generated images do not only need to resemble real angiographies
but, also, need to represent the physical attributes given by a particular retinography.
Thus, in contrast with the original GAN approach [16], the presented methodology does
not generate new images from a random noise vector, but rather from another image
with the same spatial dimensions as the one that is being generated. In practice, this
image-to-image transformation is achieved by using an encoder-decoder network as
the generator, whereas the discriminator is still a decoder network as in the original
GAN approach. Applying this setting, the multimodal reconstructions could theoretically be trained by using two independent unlabeled sets of images, one of the retinographies, and the other of angiographies.
An inherent difficulty of training an image-to-image GAN is that, typically, the
generator network has enough capacity to generate a variety of plausible images while
ignoring the characteristics of the network input. In the case of the multimodal reconstruction, this would mean that the physical attributes of the retinographies are not successfully transferred to the generated angiographies. In this regard, early image-to-image
GAN approaches addressed the issue by explicitly conditioning the generated images on
the network input [18]. In particular, this is achieved by using a paired dataset instead of
two independent datasets for training. For instance, the use of retinography-angiography
pairs, instead of independent retinography and angiography samples, allows training a discriminator to distinguish between fake and real angiographies conditioned on a given real
retinography. The use of such a discriminator will force the generator to analyze and take
Multimodal reconstruction of retinal images
into account the attributes of the input retinography. Additionally, in Ref. [18], the use of
paired datasets is even further exploited by complementing the adversarial feedback to the
generator with a pixel-wise similarity metric between the generator output and the available ground truth. However, in this case, it is not only necessary to have paired data, but
also the available image pairs must be aligned.
In contrast with previous alternatives, the presented cyclical GAN methodology
addresses the issue of the generator potentially ignoring the characteristics of its input
in a different manner that does not require the use of paired datasets. In particular, the
cyclical GAN solution is based on the use of a double transformation [19]. The idea is
to simultaneously learn GR2A and its inverse mapping function GA2R : A ! R that given
a certain angiography a A produces a retinography r ¼ GA2R(a) R of the same eye.
Then, the subsequent application of both transformations should be equivalent to the
identity function. For instance, if a retinography is transformed into angiography and,
then, it is transformed back into retinography, the resulting image should be identical
to the original retinography that is used as input. However, if any of the two transformations ignores the characteristics of their input, the resulting retinography will differ
from the original. Therefore, it is possible to ensure that the input image characteristics
are not being ignored by enforcing the identity between the original retinography and
the one that is transformed back from angiography. This is referred to as
cycle-consistency, and it can be applied by using any similarity metric between both original and reconstructed input image. An important advantage of this solution is that it does
not require the use of paired datasets, only being necessary two independent sets of unlabeled retinographies and angiographies.
In order to obtain the best performance for the multimodal reconstruction, the presented cyclical GAN methodology involves the use of two complementary training
cycles: (1) from retinography to angiography to retinography (R2A2R) and (2) from
angiography to retinography to angiography (A2R2A). A flowchart showing the complete training procedure is depicted in Fig. 15.3. It is observed that two different generators, GR2A and GA2R, and two different discriminators, DA and DR, are used during the
training. The discriminators DA and DR are trained to distinguish between generated and
real images. Simultaneously, the generators GR2A and GA2R are trained to generate
images that the discriminators misclassify as real. This adversarial training is performed
using a least square loss, which has demonstrated to produce a more stable learning process in comparison to the original loss in regular GANs [27]. Regarding the discriminator
training, the target values are 1 for the real images and 0 for the generated images. Thus,
the adversarial training losses for the discriminators are defined as
LDadvA ¼ ErR DA ðGR2A ðr ÞÞ2 + EaA ðDA ðaÞ 1Þ2
(15.1)
(15.2)
LDadvR ¼ EaA DR ðGA2R ðaÞÞ2 + ErR ðDR ðr Þ 1Þ2
353
354
Generative adversarial networks for image-to-Image translation
Fig. 15.3 Flowchart for the complete training procedure in the cyclical GAN methodology. This
approach involves the use of two complementary training cycles that only differ in which imaging
modality is being used as input and which one is the target. For each training cycle, the
appearance of the target modality in the generated images is enforced by the feedback of the
discriminator. Simultaneously, the cycle-consistency is used to ensure that the input image
characteristics, such as the anatomical and pathological structures, are not being ignored by the
networks.
In the case of the generator training, the objective is that the discriminator assigns a
value 1 to the generated images. Thus, the adversarial training losses for the generators are
defined as
2
adv
LG
¼
E
ð
D
ð
G
ð
r
Þ
Þ
1
Þ
(15.3)
rR
A
R2A
R2A
adv
¼ EaA ðDR ðGA2R ðaÞÞ 1Þ2
LG
(15.4)
A2R
Regarding the cycle consistency in the presented approach, the L1-norm between the
original image and its reconstructed version is used as a loss function. In particular, the
complete cycle-consistency loss, including both training cycles, is defined as
L cyc ¼ ErR GA2R ðGR2A ðr ÞÞ rk1 + EaA GR2A ðGA2R ðaÞÞ ak1
(15.5)
As it can be observed in previous equations as well as in Fig. 15.3, there is a strong
parallelism between both training cycles, R2A2R and A2R2A. In particular, the only
difference is the imaging modality that each training cycle starts with, what sets which
imaging modality is being used as input and which one is the target.
Finally, the complete loss function that is used for simultaneously training all the
networks is defined as
adv
adv
L ¼ LG
+ LDadvA + LG
+ LDadvR + λL cyc
R2A
A2R
(15.6)
Multimodal reconstruction of retinal images
where λ is a parameter that controls the relative importance of the cycle-consistency loss
and the adversarial losses. For the experiments presented in this chapter, this parameter is
set to a value of λ ¼ 10, which was also previously adopted in Ref. [19].
The optimization of the loss function during the training is performed with the Adam
algorithm [28]. Regarding the hyperparameters of Adam, the weight decays are β1 ¼ 0.5
and β2 ¼ 0.999. In comparison to the original values recommended by Kingma et al. [28],
this set of values has demonstrated to provide a more stable learning process when training
GANs [29]. The optimization is performed with a batch size of 1 image. The learning rate
is set to an initial value of α ¼ 2e 4 and it is kept constant for 200,000 iterations. Then,
following the approach previously adopted in Ref. [19], the learning rate is linearly
reduced to zero for the same number of iterations. The number of iterations before
starting to reduce the learning rate is established empirically through the analysis of both
the learning curves and the generated images in a training subset that is reserved for
validation.
Finally, a data augmentation strategy is applied to avoid possible overfitting to the
training set. In particular, random spatial and color augmentations are applied to the
images. The spatial augmentations consist of affine transformations and the color augmentations are linear transformations of the image channels in HSV (Hue-SaturationValue) color space. In the case of the angiographies, which have one single channel, a
linear transformation is directly applied over the raw intensity values. This augmentation
strategy has been previously applied for the analysis of retinal images, demonstrating a
good performance avoiding overfitting with limited training data [10, 30]. The particular
range for the transformations was validated before training in order to ensure that the
augmented images still resemble valid retinas.
15.3.2 Paired SSIM methodology
An alternative methodology for the multimodal reconstruction between retinography
and angiography was proposed in Ref. [7]. In this case, the authors avoid the use of GANs
by taking advantage of existing multimodal paired data. In particular, a set of
retinography-angiography pairs where both images correspond to the same eye. The
motivation for this lies in the fact that, in contrast to other application domains, in medical imaging the paired data is easy to obtain. Nowadays, in modern clinical practice, the
use of different imaging modalities is broadly extended across most medical services. In
this sense, although for many patients the use of a single imaging modality can be enough
for diagnostic purposes, there is still a large number of cases where the use of several imaging modalities is required. In this latter scenario, it is also common to use more complex
or invasive techniques, such as those requiring the injection of contrasts. This is the case
of retinography and angiography in retinal imaging. While retinography is a broadly
extended modality, typically used in screening programs, angiography is only used when
355
356
Generative adversarial networks for image-to-Image translation
it is clearly required. However, each time the angiography is taken for a patient, retinography is typically also available. This facilitates the gathering of these paired multimodal
datasets.
Technically, the advantage of using paired training data is that it allows directly comparing the network output with a ground truth image. In particular, during the training,
for each retinography that is fed to the network, there is also available an angiography of
the same eye. Thus, the training feedback can be obtained by computing any similarity
metric between generated and real angiography. In order to facilitate this measurement of
similarity, the retinography and angiography within each multimodal pair are registered.
The registration produces an alignment of the different retinal structures between the
retinography and the angiography. Consequently, there will also be an alignment
between the network output and the real angiography that is used as ground truth. This
allows the use of common pixel-wise metrics for the measurement of the similarity
between the network output and the target image.
In the presented methodology [15], the registration is performed following a domainspecific method that relies on the vascular structures of the retina [31]. In particular, this
registration method presents two different steps. The first step is a landmark-based registration where the landmarks are the crossings and the bifurcations of the retinal vasculature. This first registration produces a coarse alignment of the images that is later refined
by performing a subsequent intensity-based registration. This second registration is based
on the optimization of a similarity metric of the vessels between both images. The
complete registration procedure allows generating a paired and registered multimodal
dataset, which is used for directly training the generator network GR2A. The complete
methodology for training the multimodal reconstruction is depicted in Fig. 15.4. As it is
observed, an advantage of this methodology is that only a single neural network is
required.
Regarding the training of the generator, the similarity between the network output
and target angiography is evaluated by using the structural similarity (SSIM) [32]. This
metric, which was initially proposed for image quality assessment, measures the similarity
between images by independently considering the intensity, contrast, and structural
Fig. 15.4 Flowchart for the complete training procedure of the paired SSIM methodology. The first
step is the multimodal registration of the paired retinal images, which can be performed off-line
before the actual network training. Then, the training feedback is provided by the structural
similarity (SSIM), which is a pixel-wise similarity metric.
Multimodal reconstruction of retinal images
information. The measurement is performed at a local level considering a small neighborhood for each pixel. In particular, an SSIM map between two images (x, y) is computed with a set of local statistics as
X 2μx μy + C1 + 2σ xy + C2
SSIMðx, yÞ ¼
(15.7)
μ2x + μ2y + C1 σ 2x + σ 2y + C2
where μx and μy are the local averages for x and y, respectively, σ x and σ y are the local
standard deviations for x and y, respectively, and σ xy is the local covariance between x and
y. These local statistics are computed for each pixel by weighting its neighborhood with
an isotropic two-dimensional Gaussian with σ ¼ 1.5 pixels [32].
Then, given that SSIM is a similarity metric, the loss function for training GR2A is
defined by using the negative SSIM:
L SSIM ¼ Er , aðR, AÞ ½SSIMðGR2A ðr Þ, aÞ
(15.8)
The optimization of the loss function during the training is performed with the Adam
algorithm [28]. Regarding the hyperparameters of Adam, the weight decays are set as
β1 ¼ 0.9 and β2 ¼ 0.999, which are the default values recommended by Kingma et al.
[28]. The optimization is performed with a batch size of 1 image. The learning rate is
set to an initial value of α ¼ 2e 4 and then it is reduced by a factor of 10 when the validation
loss ceases to improve for 1250 iterations. Finally, the training is early stopped after 5000
iterations without improvement in the validation loss. These hyperparameters are established empirically according to the evolution of the learning curves during the training.
Finally, a data augmentation strategy is also applied to avoid possible overfitting to the
training set. In particular, random spatial and color augmentations are applied to the
images. The spatial augmentations consist of affine transformations and the color
augmentations are linear transformations of the image channels in HSV
(Hue-Saturation-Value) color space. In this case, the color augmentations are only
applied to the retinography, which is the only imaging modality being used as input
to a neural network. In contrast, the same affine transformation is applied to the retinography and the angiography in each multimodal image pair. This is necessary to keep the
alignment between the images and make possible the measurement of the pixel-wise similarity, namely SSIM, between the network output and the target angiography. As in the
cyclical GAN methodology, the particular range for the transformations is validated
before training in order to ensure that the augmented images still resemble valid retinas.
15.3.3 Network architectures
Regarding the neural networks, the same network architectures are used for the two presented methodologies, cyclical GAN and paired SSIM. This eases the comparison
357
358
Generative adversarial networks for image-to-Image translation
between the methodologies, excluding the network architecture as a factor in the possible performance differences. In particular, the experiments that are presented in this
chapter are performed with the same network architectures that were previously used
in Ref. [19]. The generator, which is used in both cyclical GAN and paired SSIM, is
a fully convolutional neural network consisting of an encoder, a decoder, and several
residual blocks in the middle of them. A diagram of the network and the details of
the different blocks are depicted in Fig. 15.5 and Table 15.1, respectively. In contrast
with other common encoder-decoder architectures, this network presents a small
encoder and decoder, which is compensated by the large number of layers that are present
in the middle residual blocks. As a consequence, there is also a small spatial reduction of
the input data through the network. In particular, the height and width of the internal
representations within the network are reduced up to a factor of 4. This relatively low
spatial reduction allows keeping an adequate level of spatial accuracy without the
Fig. 15.5 Diagram of the network architecture for the generator. Each colored block represents the
output of a layer in the neural network. The width of the blocks represents the number of
channels whereas the height represents the spatial dimensions. The details of the different layers
are in Table 15.1.
Table 15.1 Building blocks of the generator architecture.
Block
Layers
Kernel
Stride
Out features
Encoder
Conv/IN/ReLU
Conv/IN/ReLU
Conv/IN/ReLU
Conv/IN/ReLU
Conv/IN
Residual addition
ConvT/IN/ReLU
ConvT/IN/ReLU
Conv/IN/ReLU
77
33
33
33
33
–
33
33
77
1
2
2
1
1
–
2
2
1
64
128
256
256
256
256
128
64
Image channels
Residual
Decoder
Conv, convolution; IN, instance normalization [33]; ConvT, convolution transpose.
Multimodal reconstruction of retinal images
necessity of additional features such as skip connections [34]. Another particularity of the
network is the use of instance normalization [33] layers after each convolution, in contrast
to the more extended use of batch normalization. In this regard, instance normalization
was initially proposed for improving the performance of style-transfer applications and
has demonstrated to be also effective for cyclical GANs. Additionally, these normalization layers could be seen as an effective way of dealing with the problems of using batch
normalization with small batch sizes. In this sense, it should be noticed that both the
experiments presented in this chapter as well as the experiments in Ref. [19] are performed with a batch size of 1 image.
In contrast with the generator, the discriminator network is only used in the cyclical
GAN methodology. The selected architecture is the one that was also used in Ref. [19].
In particular, the discriminator is a fully convolutional neural network, which allows
working on arbitrarily sized images. This kind of discriminator architecture is typically
known as PatchGAN [18], given that the decision of the discriminator is produced at
the level of overlapping image patches. A diagram of the network and the details of
the different layers are depicted in Fig. 15.6 and Table 15.2, respectively. The characteristics of the different layers are similar to those in the generator network. The main difference is the use of Leaky ReLU instead of ReLU as an activation function, which has
Fig. 15.6 Diagram of the network architecture for the discriminator. Each colored block represents the
output of a layer in the network. The width of the blocks represents the number of channels whereas
the height represents the spatial dimensions. The details of the different layers are in Table 15.2.
Table 15.2 Layers of the discriminator architecture.
Layers
Kernel
Stride
Out features
Conv/Leaky ReLU
Conv/IN/Leaky ReLU
Conv/IN/Leaky ReLU
Conv/IN/Leaky ReLU
Conv
44
44
44
44
44
2
2
2
1
1
64
128
256
512
1
Conv, convolution; IN, instance normalization.
359
360
Generative adversarial networks for image-to-Image translation
demonstrated to be a useful modification for the adequate training of GANs [29]. With
regard to the discriminator output, this architecture provides a decision for overlapping
image patches of size 70 70.
15.4 Experiments and results
15.4.1 Datasets
The experiments presented in this chapter are performed on a multimodal dataset consisting of 118 retinography-angiography pairs. This multimodal dataset is created from
two different collections of images. In particular, half of the images are taken from a public multimodal dataset provided by Isfahan MISP [35] whereas the other half have been
gathered from a local hospital [15].
The Isfahan MISP collection consists of 59 retinography-angiography pairs including
both pathological and healthy cases. In particular, 30 image pairs correspond to patients
that were diagnosed with diabetic retinopathy whereas the other 29 images pairs correspond to healthy retinas. All the images in the collection present a size of 720 576 pixels.
The private collection consists of 59 additional retinography-angiography pairs. Most
of the images correspond to pathological cases, including representative samples of several
common ophthalmic diseases. Additionally, the original images presented different sizes
and, therefore, they were resized to a fixed size of 720 576. This collection of images
has been gathered from the ophthalmic services of Complexo Hospitalario Universitario
de Santiago de Compostela (CHUS) in Spain.
To perform the different experiments, the complete multimodal dataset is randomly
split into two subsets of equal size, i.e., 59 image pairs each. One of these subsets is held
out as a test set and the other is used for training the multimodal reconstruction. Additionally, the training image pairs are randomly split into a validation subset of nine image
pairs and a training subset of 50 image pairs. The purpose of this split is to control the
training progress through the validation subset, as described in Section 15.3.
Finally, it should be noticed that, although the same subset of image pairs is used for
the training of both methodologies, the images are considered as unpaired for the cyclical
GAN approach.
15.4.2 Qualitative evaluation of the reconstruction
Firstly, the quality and coherence of the generated angiographies are evaluated through
visual analysis. To that end, Figs. 15.7 and 15.8 depict some representative examples of
generated images together with the original retinographies and angiographies. The
examples are taken from the holdout test set. In general, both methodologies were able
to learn an adequate transformation for the main anatomical structures in the retina,
namely, the vasculature, fovea, and optic disc. In particular, it is observed that the retinal
Multimodal reconstruction of retinal images
Fig. 15.7 Examples of generated angiographies together with the corresponding original
retinographies and angiographies. Some representative examples of microaneurysms (green),
microhemorrhages (blue), and bright lesions (yellow) are marked with circles.
vasculature is successfully enhanced in all the cases, which is one of the main characteristics of real angiographies. This vascular enhancement evidences a high-level understanding of the different structures in the retina, given that other dark-colored
structures in retinography, such as the fovea, are mainly kept with a dark tone in the
reconstructed angiographies. This means that the applied transformation is structurespecific and guided by the semantic information in the images instead of low-level
361
362
Generative adversarial networks for image-to-Image translation
Fig. 15.8 Examples of generated angiographies together with the corresponding original
retinographies and angiographies. Some representative examples of microaneurysms (green) and
microhemorrhages (blue) are marked with circles.
information such as the color. In contrast with the vasculature, the reconstructed optic
discs are not as similar as those in the real angiographies. However, this can be explained
by the fact that the appearance of the optic disc is not as consistent among angiographies.
In this sense, both methodologies learn to reconstruct the optic disc with a slight higher
intensity, which may indicate that this is the predominant appearance of this anatomical
structure in the training set.
Multimodal reconstruction of retinal images
With regard to the pathological structures, there are greater differences between the
presented methodologies. For instance, microaneurysms are only generated or enhanced
by the cyclical GAN methodology. Microaneurysms are tiny vascular lesions that, in contrast to other pathological structures, remain connected to the bloodstream. Therefore,
they are directly affected by the injected contrast dye in the angiography. As it is observed
in Fig. 15.7, the cyclical GAN methodology is able to enhance these small lesions. However, neither all the microaneurysms in the ground truth angiography are reconstructed
nor all the reconstructed microaneurysms are present in the ground truth. This may indicate that part of these microaneurysms are artificially created by the network or that small
microhemorrhages are being misidentified as microaneurysms. Nevertheless, it must be
considered that the detection of microaneurysms is a very challenging task in the field.
Thus, despite the possible errors, the fact that these small structures were identified by the
cyclical GAN methodology is a significative outcome.
In contrast to the previous analysis about microaneurysms, the examples of Fig. 15.7
evidence that the paired SSIM methodology provides a better reconstruction for other
pathological structures. In particular, bright lesions that are present in the retinography
should not be visible in the angiography. However, the cyclical GAN approach fails to
completely remove these lesions, especially if they are large such as those in the top-left
quarter or the retina shown in Fig. 15.7B. The paired SSIM approach provides a more
accurate reconstruction regarding these kinds of lesions although in the previous case
there still remains a show in the area of the lesion. Finally, regarding the microhemorrhages, these kinds of lesions are also more accurately reconstructed by the paired SSIM
approach. In particular, these lesions present a dark appearance in both retinography and
angiography. In the depicted examples, it is observed that paired SSIM reconstructs the
microhemorrhages, as expected. However, the cyclical GAN approach tends to remove
these lesions. Additionally, in some cases, the small microhemorrhages are reconstructed
with a bright tone like the microaneurysms.
Besides the anatomical and pathological structures in the retina, the main difference
that is observed between both methodologies is the general appearance of the generated
angiographies. In this regard, the images generated by the cyclical GAN present a more
realistic look and they could be easier misidentified as real angiographies. The main reason for this is the texture in the images. In particular, cyclical GAN produces a textured
retinal background that mimics the appearance of a real angiography. In contrast, the retinal background in the angiographies generated by paired SSIM is very homogeneous,
which gives away the synthetic nature of the images. The explanation for this difference
between both approaches is the use of GANs in the cyclical GAN methodology. In this
sense, the discriminator network has the capacity to learn and distinguish the main characteristics of the angiography, including the textured background. Thus, a synthetic angiography with a smooth background would be easily identified as fake by the
discriminator. Consequently, during the training, the generator will learn to generate
363
364
Generative adversarial networks for image-to-Image translation
the textured background in order to trick the discriminator. In the case of the paired
SSIM, the presented results show that SSIM does not provide the feedback that is
required to learn this characteristic. Additionally, according to the results presented in
Ref. [15], the use of L1-norm or L2-norm in the loss function does not provide that feedback either. In this regard, it should be noticed that these are full-reference pixel-wise
metrics that directly compare the network output against a specific ground truth image.
Thus, even if an angiography-like texture is generated, this will not necessarily minimize
the loss function if the generated texture does not exactly match the one in the provided
ground truth. It could be the case that the specific texture of each angiography was
impossible to infer from the corresponding retinography. In that scenario, the generator
could never completely reduce the loss portion corresponding to the textured background. The resulting outcome could be the generation of a homogeneous background
that minimizes the loss throughout the training set. This explanation fits with what is
observed in Figs. 15.7 and 15.8.
15.4.3 Quantitative evaluation of the reconstruction
The multimodal reconstruction is quantitatively evaluated by measuring the reconstruction error between the generated and the ground truth angiographies. In particular, the
reconstruction is evaluated by means of SSIM, mean average error (MAE), and mean
squared error (MSE), which are common evaluation metrics for image reconstruction
and image quality assessment. The presented evaluation is performed on the paired data
of the holdout test set.
When comparing the two presented methodologies, it must be considered that the
paired SSIM relies on the availability of paired data for training. The paired data represent
a richer source of information in comparison to the unpaired counterpart and, therefore,
it is expected that the paired SSIM provided better performance than cyclical GAN for
the same number of training samples. Additionally, it should be also considered that the
paired data, despite being commonly available in medical imaging, is inherently harder to
collect than the unpaired counterpart. For these reasons, the presented evaluation not
only compares the performance of both methodologies when using the complete training
set but, also, it compares the performance when there are more unpaired than paired
images available for training. This is an expected scenario in practical applications.
The results of the quantitative evaluation are depicted in Fig. 15.9. In the case of
paired SSIM, the presented results correspond to several experiments with a varying
number of training samples, ranging from 10 to 50 image pairs. In the case of cyclical
GAN, the presented results are obtained after training with the complete training
subset, i.e., 50 image pairs. Firstly, it is observed that the paired SSIM always provides
better results than the cyclical GAN considering SSIM although that is not the case
for MAE and MSE. Considering these two metrics, the paired SSIM obtains similar
Multimodal reconstruction of retinal images
Fig. 15.9 Comparison of cyclical GAN and paired SSIM with a varying number of training samples for
paired SSIM. The evaluation is performed by means of (A) SSIM, (B) MAE, and (C) MSE.
365
366
Generative adversarial networks for image-to-Image translation
or worse results depending on the number of training samples. In general, it is clear that,
up to 30 image pairs, the paired SSIM experiments a positive evolution with the addition
of more training data. Then, between 30 and 50 image pairs, the evolution stagnates and
there is no improvement with the addition of more images. In the case of MAE and MSE,
the final results to which the paired SSIM converges are approximately the same as those
obtained by the cyclical GAN. This may indicate an existent upper bound in the performance of the multimodal reconstruction with this experimental setting. Regarding the
comparison by means of SSIM, there is an important difference between both methodologies independently of the number of training images for paired SSIM. On the one
hand, this may be explained by the fact that the generator of the paired SSIM has been
explicitly trained to maximize SSIM. Thus, this network excels when it is evaluated by
means of this metric. On the other hand, however, it must be considered that SSIM is a
more complex metric in comparison to MAE or MSE. In particular, SSIM does not
directly measure the difference between pixels but, instead, it measures local similarities
that include high-level information such as structural coherence. Thus, it could be possible that subtle structural errors, which are not evidenced by MAE or MSE, contribute to
the worse performance of cyclical GAN considering SSIM.
15.4.4 Ablation analysis of the generated images
In order to better understand the obtained results, we present a more detailed quantitative
analysis in this section. In particular, the presented analysis considers the possible differences in error distribution among different retinal regions. As it was shown in
Section 15.4.2, both methodologies seem to provide a similar enhancement of the retinal
vasculature. However, there are important differences in the reconstructed retinal background and certain pathological structures. Therefore, it is interesting to study how the
reconstruction error is distributed between the vasculature and the background, and
whether this distribution is different between both methodologies. To that end, the
reconstruction errors are recalculated using a binary vascular mask to separate between
vasculature and background regions. Given that only a broad approximation of the vasculature is necessary, the vascular mask is computed by applying some common image
processing techniques. First, the multiscale Laplacian operator proposed in Ref. [31] is
applied to the original angiography. This operation further enhances the retinal vasculature, resulting in an image with much greater contrast between vasculature and background [36]. Then, the vascular region is dilated to ensure that the resulting mask not
only includes the vessels but also their surrounding pixels. This way, the reconstruction
error in the vasculature will also include the error due to inappropriate vessel edges.
Finally, the vascular mask is binarized by applying Otsu’s thresholding method [37].
An example of the produced binary vascular mask together with the original angiography
is depicted in Fig. 15.10.
Multimodal reconstruction of retinal images
Fig. 15.10 Example of vascular mask used for evaluation: (A) angiography and (B) resulting vessel
mask for (A).
The results of the quantitative evaluation using the computed vascular masks are
depicted in Fig. 15.11. Firstly, it is observed that, in all the cases, the reconstruction error
is greater in the vessels than in the background. This may indicate that the reconstruction
of the retinal background is an easier task in comparison to the retinal vasculature. In this
regard, it must be noticed that the retinal vasculature is an intricate network with numerous intersection and bifurcations, which increases the difficulty of the reconstruction.
The background also includes some pathological structures, which can be a source of
errors as seen in Section 15.4.2. However, these pathological structures neither are present in all the images nor occupy a significantly large area of the background. Moreover,
the bright lesions in the angiography, i.e., the microaneurysms, are included within the
vascular mask, as can be seen in Fig. 15.10. This balances the contribution of the pathological structures between both regions. Regarding the comparison between cyclical
GAN and paired SSIM, the analysis is the same as in the previous evaluation. This happens independently of the retinal region that is analyzed, vasculature or background. In
particular, the performance of paired SSIM experiments the same evolution with the
increase in the number of training images. Considering MAE and MSE, paired SSIM
converges again to the same results that are achieved by cyclical GAN, resulting in a similar performance. In contrast, there is still an important difference between the methodologies when considering SSIM.
Finally, it is interesting to observe that the error distribution between regions is the same
for paired SSIM and cyclical GAN, even when there is a clear visual difference in the reconstructed background between both methodologies (see Fig. 15.7). This shows that the more
realistic look provided by the textured background does not necessarily lead to a better
reconstruction in terms of full-reference pixel-wise metrics. In particular, the same reconstruction error can be achieved by producing a homogeneous background with an adequate
tone, as paired SSIM does. This explains why the use of these metrics as a loss function does
not encourage the generator to produce a textured background. Moreover, in the case of
SSIM, which is the metric used by paired SSIM during training, the reconstruction error
for the textured background is even greater than that of the homogeneous version.
367
368
Generative adversarial networks for image-to-Image translation
Fig. 15.11 Comparison of cyclical GAN and paired SSIM with a varying number of training samples for
paired SSIM. The evaluation is conducted independently for vessels and the background of the images.
The evaluation is performed by means of (A) SSIM, (B) MAE, and (C) MSE.
Multimodal reconstruction of retinal images
15.4.5 Structural coherence of the generated images
An observation that remains to be explained after the previous analyses is the different
results obtained whether the evaluation is performed by means of SSIM or MAE/
MSE. In particular, both methodologies achieve similar results in MAE and MSE,
although paired SSIM always performs better in terms of SSIM. Given that SSIM is characterized by including higher level information such as the structural coherence between
images, the generated images are visually inspected to find possible structural differences.
Fig. 15.12 depicts some composite images using a checkerboard pattern that is used to
perform the visual analysis. In particular, the depicted images show the generated angiography together with the original retinography (Fig. 15.12A and C) as well as the generated angiography together with the ground truth angiography (Fig. 15.12B and D). At
a glance, it seems that both angiographies, from paired SSIM and cyclical GAN, are perfectly reconstructed. However, on closer examination, it is observed that in the angiographies generated by cyclical GAN there are small displacements with respect to the
originals. Examples of these displacements are shown in detail in Fig. 15.12. As it is
observed, the displacement occurs, at least, in the retinal vasculature. Moreover, it can
be observed that the displacement is consistent among the zoomed patches even when
they are distant in the images. This indicates that the observed displacement could be the
result of an affine transformation.
With regard to the cause of the displacement, an initial hypothesis is based on the fact
that cyclical GAN does not put any hard constraint on the structure of the generated angiography. The only requirements are that the image must look like a real angiography and
that it must be possible to reconstruct the original retinography from it. Thus, although
the more straightforward way to reconstruct the original retinography seems to be to
keep the original structure as it is, nothing enforces the networks to do so. Nevertheless,
it must be considered that if GR2A applies any spatial transformation to the generated
angiographies, and then GA2R must learn to apply the inverse transformation when
reconstructing the original retinography. This synergy between the networks is necessary
to still minimize the cycle-consistency loss in the cyclical GAN methodology. Although
not straightforward, this situation seems plausible given that the observed displacement is
very subtle. The presented situation may initiate if the first network, GR2A, starts to
reconstruct the vessels of the angiography over the vessel edges of the input retinography.
This is likely to happen given the facility of a neural network to detect edges in an image.
Moreover, the vessel edges are easier to detect than the vessel centerlines. To verify this
hypothesis, the angiographies generated during the first stages of the training have been
revised. A representative example of these images is depicted in Fig. 15.13. As it can be
observed, there are some bright lines that seems to be drawn over the edges of the subtle
dark vessels. This evidences the origin of the issue, although the ultimate cause is the
underconstrained training setting of cyclical GAN.
369
370
Generative adversarial networks for image-to-Image translation
Fig. 15.12 Comparison of generated angiographies against (A, C) the corresponding original
retinographies and (B, D) the corresponding ground truth angiographies. (A, B) Angiography
generated using paired SSIM. (C, D) Angiography generated using cyclical GAN. Additionally,
cropped regions are depicted in detail for each case.
Fig. 15.13 Representative example of generated angiography on the first stages of training for cyclical
GAN: (A) original retinography and (B) generated angiography.
Multimodal reconstruction of retinal images
15.5 Discussion and conclusions
In this chapter, we have presented a cyclical GAN methodology for the multimodal
reconstruction of retinal images [11]. This multimodal reconstruction is a novel task that
consists of the translation of medical images between complementary modalities [7]. This
allows the estimation of either more invasive or less affordable imaging modalities from a
readily available alternative. For instance, this chapter addresses the estimation of fluorescein angiography from retinography, where the former requires the injection of a contrast dye to the patients. Despite the recent technical advances in the field, the direct use
of generated images in clinical practice is still only a future potential application. However, there are several other possible applications where this multimodal reconstruction
can be taken advantage of. For instance, multimodal reconstruction has already demonstrated to be a successful pretraining task for transfer learning in medical image analysis
[13, 14]. This is an important application that reduces the necessity of large collections of
expert-annotated data in medical imaging [10].
In order to provide a comprehensive analysis of the cyclical GAN methodology, we
have also presented an exhaustive comparison against a state-of-the-art approach where
no GANs were used [15]. This way, it is possible to study the particular advantages and
disadvantages of using GANs for multimodal reconstruction. The provided comparison is
performed under the fairest conditions, by using the same dataset, network architectures,
and training strategies. In this regard, the only differences are those intrinsically due to the
methodologies themselves. Regarding the presented results, it is seen that both
approaches are able to produce an adequate estimation of the angiography from retinography. However, there are important differences in several aspects of the generated angiographies. Moreover, the requirements for training each one of both approaches must
also be considered in the comparison.
Regarding the requirements for the training of both approaches, the main difference
is the use of unpaired data in cyclical GAN and paired data in paired SSIM. In broad
domain applications, i.e., performed in natural images, this would represent an insurmountable obstacle for the paired SSIM methodology. However, in medical imaging,
the paired data can be relatively easy to obtain due to the common use of complementary
imaging modalities in clinical practice. In this case, however, the disadvantage of paired
SSIM is the necessity of registered image pairs where the different anatomical and pathological structures must be aligned. The multimodal registration method that is applied in
paired SSIM has demonstrated to be reliable for the alignment of retinographyangiography pairs [31]. Moreover, it has been successfully applied for the registration
of the multimodal dataset that is used in the experiments herein described. However,
the results presented in Ref. [31] also show that, quantitatively, the registration performance is lower for the most complex cases, which can be due to, e.g., low-quality images
or severe pathologies. This could potentially limit the variety of images in an extended
371
372
Generative adversarial networks for image-to-Image translation
version of the dataset including more challenging scenarios. Additionally, the registration
method in paired SSIM is domain-specific and, therefore, cannot be directly applied to
other types of multimodal image pairs. This means that the use of paired SSIM in other
medical specialties would require the availability of adequate registration methods.
Although image registration is a common task in medical imaging, the availability of such
multimodal registration algorithms cannot be taken for granted. In contrast, cyclical
GAN can be directly applied to any kind of multimodal setting without the need for registered or paired data.
Another important difference between the presented approaches is the complexity
of the training procedure. In this sense, cyclical GAN represents a more complex
approach including four different neural networks and two training cycles, as described
in Section 15.3.1. In comparison, once the multimodal image registration is performed,
paired SSIM only requires the training of a single neural network. The use of four different networks in cyclical GAN means that, computationally, more memory is
required for training. In a situation of limited resources, which is the common practical
scenario, this will negatively affect the size and number of images that is possible for
each batch during the training. Moreover, in practice, cyclical GAN also requires longer training times than paired SSIM, which further increases the computational costs.
This is in part due to the use of a single network in paired SSIM but also to the use of a
full-reference pixel-wise metric for the loss functions. The feedback provided by this
more classical alternative results in a faster convergence in comparison to the adversarial
training.
Regarding the performance of the multimodal reconstruction, the examples depicted
in Figs. 15.7 and 15.8 show that both methodologies are able to successfully recognize the
main anatomical structures in the retina. In that sense, despite the evident aesthetic differences, the transformations applied to the anatomical structures are adequate in both
cases. Thus, both approaches show a similar potential for transfer learning regarding
the analysis of the retinal anatomy. However, when considering the pathological structures, there are important differences between both methodologies. In this case, none of
the methodologies perfectly reconstruct all the lesions. In particular, the examples
depicted in Fig. 15.7 indicate that each methodology gives preference to different types
of lesions in the generated images. Thus, it is not clear which alternative would be a better
option toward the pathological analysis of the retinal images. In this regard, given the
mixed results that are obtained, future works could explore the development of hybrid
methods for the multimodal reconstruction of retinal images. The objective, in this case,
would be to combine the good properties of cyclical GAN and paired SSIM.
One of the main differences between cyclical GAN and paired SSIM is the appearance
of the generated angiographies. Due to the use of a GAN framework in cyclical GAN, the
generated angiographies look realistic and aesthetically pleasing. In contrast, the angiographies generated by paired SSIM present a more synthetic appearance. The importance
Multimodal reconstruction of retinal images
of this difference in the appearance of the generated angiographies depends on the specific
application. On the one hand, for representation learning purposes, the priority is the
proper recognition of the different retinal structures. Additionally, even for the potential
clinical interpretation of the images, realism is not as important as the accurate reconstruction of the different structures. On the other hand, there exist potential applications
such as data augmentation or clinical simulations where the realism of the images is of
great importance.
Finally, a relevant observation presented in this chapter is the fact that cyclical GAN
does not necessarily keep the exact same structure of the input image. This is a known possible issue, given the underconstrained training setting in cyclical GANs. Nevertheless, in
this chapter, we have presented empirical evidence of this issue in the form of small displacements for the reconstructed blood vessels. According to the evidence presented in
Section 15.4.5, it is not possible to predict whether these displacements will happen or
how they will exactly be. In this sense, the particular structural displacements produced
by the networks is affected by the stochasticity of the training procedure. Moreover,
although we have only noticed these structural incoherence in the blood vessels, it would
be possible to note the existence of similar subtle structural transformations for other elements in the images. In line with prior observations in the presented comparison, the
importance of these structural errors depends on the specific application for which the multimodal reconstruction is applied. For instance, this kind of small structural variations
should not significantly affect the quality of the internal representations learned by the network. However, they would impede the use of cyclical GAN as a tool for accurate multimodal image registration. The development of hybrid methodologies, as previously
discussed, could also be a solution to this structural issue while keeping the good properties
of GANs. For instance, according to the results presented in Section 15.4.3, the addition of
a small number of paired training samples could be sufficient for improving the structural
coherence of the cyclical GAN approach. Additionally, a hybrid approach of this kind
could still incorporate those more challenging paired images that may not be successfully
registered.
To conclude, the presented cyclical GAN approach has been demonstrated to be a
valid alternative for the multimodal reconstruction of retinal images. In particular, the
provided comparison shows that cyclical GAN has both advantages and disadvantages
with respect to the state-of-the-art approach paired SSIM. In this regard, these two
approaches are complementary to each other when considering their strengths and weaknesses. This motivates the future development of hybrid methods aiming at taking advantage of the strengths of both alternatives.
Acknowledgments
This work was supported by Instituto de Salud Carlos III, Government of Spain, and the European Regional
Development Fund (ERDF) of the European Union (EU) through the DTS18/00136 research project, and
373
374
Generative adversarial networks for image-to-Image translation
by Ministerio de Ciencia, Innovación y Universidades, Government of Spain, through the RTI2018095894-B-I00 research project. The authors of this work also receive financial support from the ERDF
and European Social Fund (ESF) of the EU and Xunta de Galicia through Centro de Investigación de Galicia,
ref. ED431G 2019/01, and the predoctoral grant contract ref. ED481A-2017/328.
Conflict of interest
The authors declare no conflicts of interest.
References
[1] G. Litjens, T. Kooi, B.E. Bejnordi, A.A.A. Setio, F. Ciompi, M. Ghafoorian, J.A. van der Laak, B. van
Ginneken, C.I. Sánchez, A survey on deep learning in medical image analysis, Med. Image Anal.
42 (2017) 60–88, https://doi.org/10.1016/j.media.2017.07.005.
[2] E.D. Cole, E.A. Novais, R.N. Louzada, N.K. Waheed, Contemporary retinal imaging techniques in
diabetic retinopathy: a review, Clin. Exp. Ophthalmol. 44 (4) (2016) 289–299, https://doi.org/
10.1111/ceo.12711.
[3] T. Farncombe, K. Iniewski, Medical Imaging: Technology and Applications, CRC Press, 2017.
[4] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, M.S. Lew, Deep learning for visual understanding: a
review, Neurocomputing 187 (2016) 27–48, https://doi.org/10.1016/j.neucom.2015.09.116.
[5] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, P. Martinez-Gonzalez, J. GarciaRodriguez, A survey on deep learning techniques for image and video semantic segmentation, Appl.
Soft Comput. 70 (2018) 41–65. ISSN 15684946 https://doi.org/10.1016/j.asoc.2018.05.018.
[6] W. Rawat, Z. Wang, Deep convolutional neural networks for image classification: a comprehensive
review, Neural Comput. 29 (9) (2017) 2352–2449, https://doi.org/10.1162/neco_a_00990.
[7] A.S. Hervella, J. Rouco, J. Novo, M. Ortega, Retinal image understanding emerges from selfsupervised multimodal reconstruction, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2018, https://doi.org/10.1007/978-3-030-00928-1_37.
[8] S. Engelhardt, L. Sharan, M. Karck, R.D. Simone, I. Wolf, Cross-domain conditional generative
adversarial networks for stereoscopic hyperrealism in surgical training, in: Medical Image Computing
and Computer-Assisted Intervention (MICCAI), 2019, https://doi.org/10.1007/978-3-030-322540_18.
[9] X. Yi, E. Walia, P. Babyn, Generative adversarial network in medical imaging: a review, Med. Image
Anal. 58 (2019) 101552. ISSN 1361-8415 https://doi.org/10.1016/j.media.2019.101552.
[10] Á.S. Hervella, J. Rouco, J. Novo, M. Ortega, Learning the retinal anatomy from scarce annotated data
using self-supervised multimodal reconstruction, Appl. Soft Comput. 91 (2020) 106210, https://doi.
org/10.1016/j.asoc.2020.106210.
[11] A.S. Hervella, J. Rouco, J. Novo, M. Ortega, Deep multimodal reconstruction of retinal images using
paired or unpaired data, in: International Joint Conference on Neural Networks (IJCNN), 2019,
https://doi.org/10.1109/IJCNN.2019.8852082.
[12] L. Wang, W. Chen, W. Yang, F. Bi, F.R. Yu, A state-of-the-art review on image synthesis with generative adversarial networks, IEEE Access 8 (2020) 63514–63537, https://doi.org/10.1109/
ACCESS.2020.2982224.
[13] A.S. Hervella, L. Ramos, J. Rouco, J. Novo, M. Ortega, Multi-modal self-supervised pre-training for
joint optic disc and cup segmentation in eye fundus images, in: 2020 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2020, https://doi.org/10.1109/
ICASSP40776.2020.9053551.
[14] J. Morano, A.S. Hervella, N. Barreira, J. Novo, J. Rouco, Multimodal transfer learning-based
approaches for retinal vascular segmentation, in: 24th European Conference on Artificial Ingelligence
(ECAI), 2020.
Multimodal reconstruction of retinal images
[15] Á.S. Hervella, J. Rouco, J. Novo, M. Ortega, Self-supervised multimodal reconstruction of retinal
images over paired datasets, Expert Syst. Appl. (2020) 113674, https://doi.org/10.1016/j.
eswa.2020.113674.
[16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in Neural Information Processing Systems (NIPS), 27,
2014, pp. 2672–2680.
[17] Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, Y. Zheng, Recent progress on generative adversarial networks (GANs): a survey, IEEE Access 7 (2019) 36322–36333, https://doi.org/10.1109/
ACCESS.2019.2905015.
[18] P. Isola, J.-Y. Zhu, T. Zhou, A.A. Efros, Image-to-image translation with conditional adversarial networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017,
https://doi.org/10.1109/CVPR.2017.632.
[19] J.-Y. Zhu, T. Park, P. Isola, A.A. Efros, Unpaired image-to-image translation using cycle-consistent
adversarial networks, in: 2017 IEEE International Conference on Computer Vision (ICCV), 2017,
https://doi.org/10.1109/ICCV.2017.244.
[20] Z. Yi, H. Zhang, P. Tan, M. Gong, DualGAN: unsupervised dual learning for image-to-image translation, in: The IEEE International Conference on Computer Vision (ICCV), 2017, https://doi.org/
10.1109/ICCV.2017.310.
[21] T. Kim, M. Cha, H. Kim, J.K. Lee, J. Kim, Learning to discover cross-domain relations with generative
adversarial networks, in: Proceedings of the 34th International Conference on Machine Learning,
vol. 70, 2017, pp. 1857–1865.
[22] J.M. Wolterink, T. Leiner, M.A. Viergever, I. Išgum, Generative adversarial networks for noise reduction in low-dose CT, IEEE Trans. Med. Imaging 36 (12) (2017) 2536–2545, https://doi.org/10.1109/
TMI.2017.2708987.
[23] Y. Xue, T. Xu, H. Zhang, L. Long, X. Huang, SegAN: adversarial network with multi-scale L1 loss for
medical image segmentation, Neuroinformatics 16 (3–4) (2018) 383–392. ISSN 1539-2791 https://
doi.org/10.1007/s12021-018-9377-x.
[24] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai, J. Goldberger, H. Greenspan, GAN-based synthetic
medical image augmentation for increased CNN performance in liver lesion classification, Neurocomputing 321 (2018) 321–331. ISSN 0925-2312 https://doi.org/10.1016/j.neucom.2018.09.013.
[25] T. Schlegl, P. Seeb€
ock, S.M. Waldstein, G. Langs, U. Schmidt-Erfurth, F-AnoGAN: fast unsupervised
anomaly detection with generative adversarial networks, Med. Image Anal. 54 (2019) 30–44. ISSN
1361-8415 https://doi.org/10.1016/j.media.2019.01.010.
[26] J. Cohen, M. Luck, S. Honari, Distribution matching losses can hallucinate features in medical image
translation, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2018,
https://doi.org/10.1007/978-3-030-00928-1_60.
[27] X. Mao, Q. Li, H. Xie, R.Y. Lau, Z. Wang, S. Paul Smolley, Least squares generative adversarial networks, in: The IEEE International Conference on Computer Vision (ICCV), 2017, https://doi.org/
10.1109/ICCV.2017.304.
[28] D.P. Kingma, J. Ba, Adam: a method for stochastic optimization, in: International Conference on
Learning Representations (ICLR), 2015.
[29] A. Radford, L. Metz, S. Chintala, Unsupervised representation learning with deep convolutional generative adversarial networks, in: International Conference on Learning Representations (ICLR), 2016.
[30] Á.S. Hervella, J. Rouco, J. Novo, M.G. Penedo, M. Ortega, Deep multi-instance heatmap regression for the detection of retinal vessel crossings and bifurcations in eye fundus images, Comput.
Methods Prog. Biomed. 186 (2020) 105201. ISSN 0169-2607 https://doi.org/10.1016/j.cmpb.
2019.105201.
[31] A.S. Hervella, J. Rouco, J. Novo, M. Ortega, Multimodal registration of retinal images using domainspecific landmarks and vessel enhancement, in: International Conference on Knowledge-Based and
Intelligent Information and Engineering Systems (KES), 2018, https://doi.org/10.1016/j.
procs.2018.07.213.
[32] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to
structural similarity, IEEE Trans. Image Process. 13 (4) (2004) 600–612, https://doi.org/10.1109/
TIP.2003.819861.
375
376
Generative adversarial networks for image-to-Image translation
[33] D. Ulyanov, A. Vedaldi, V.S. Lempitsky, Improved texture networks: maximizing quality and diversity
in feed-forward stylization and texture synthesis, in: 2017 IEEE Conference on Computer Vision and
Pattern Recognition, CVPR, 2017, pp. 4105–4113, https://doi.org/10.1109/CVPR.2017.437.
[34] O. Ronneberger, P. Fischer, T. Brox, U-Net: convolutional networks for biomedical image segmentation, in: Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2015, https://
doi.org/10.1007/978-3-319-24574-4_28.
[35] S.H.M. Alipour, H. Rabbani, M.R. Akhlaghi, Diabetic Retinopathy Grading by Digital Curvelet
Transform, Computational and Mathematical Methods in Medicine, 2012, https://doi.org/
10.1155/2012/761901.
[36] A.S. Hervella, J. Rouco, J. Novo, M. Ortega, Self-supervised deep learning for retinal vessel segmentation using automatically generated labels from multimodal data, in: International Joint Conference on
Neural Networks (IJCNN), 2019, https://doi.org/10.1109/IJCNN.2019.8851844.
[37] N. Otsu, A threshold selection method from gray-level histograms, IEEE Trans. Syst. Man Cybern. 9
(1) (1979) 62–66, https://doi.org/10.1109/TSMC.1979.4310076.
CHAPTER 16
Generative adversarial network
for video anomaly detection
Thittaporn
Ganokratanaaa and Supavadee Aramvithb
a
Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University, Bangkok, Thailand
Multimedia Data Analytics and Processing Research Unit, Department of Electrical Engineering, Faculty of Engineering,
Chulalongkorn University, Bangkok, Thailand
b
16.1 Introduction
Video anomaly detection (VAD) has gained increasing recognition in a surveillance system for ensuring security. VAD is a challenging task due to the high appearance structure
of the images with motion between frames. This anomaly research has drawn interests
from researchers in computer vision areas. The traditional approaches including the social
force model (SF) [1], mixture of probabilistic principal component analyzers (MPPCA)
[2], Gaussian mixture of dynamic texture (MDT) and the combination of SF + MPPCA
[3, 4], sparse reconstruction [5–7], one class learning machine [8], K-nearest neighbor [9],
and tracklet analysis [10, 11] are proposed to challenge the anomaly detection problem
due to their performance in detecting multiple objects. However, the traditional
approaches do not perform well with anomaly detection problem since this problem
is a complex task that mostly occurs in the crowded scenes, making it more problematic
for the traditional approaches to generalize. Thus, deep learning approaches are
employed to achieve a higher anomaly detection rate, such as deep Gaussian mixture
model (GMM) [12, 13], autoencoder [14, 15], and deep pretrain convolution neural network (CNN) [16–19]. Even using these deep learning approaches, the problem is still
open when dealing with all issues of anomaly detection. Specifically, the major challenging issues in the anomaly detection task are categorized into three types: complex scene,
small anomaly samples, and object localization in pixel level. With the complex scene,
one may consist of multiple moving objects with clutter and occlusions that cause the
difficulty in detecting and localizing objects. This issue also refers to the crowded scene
which is more challenging than the uncrowded scene. The second challenge is the small
samples from available anomaly datasets with abnormal ground truth, leading to the
struggles of model training in a data-hungry deep learning approach. In practice, it is
impossible to train all anomalous events as they occur randomly. Therefore, the anomaly
detection task is categorized as an unsupervised learning manner since there are no
requirements of data labeling on the positive rare class. Another important issue is about
Generative Adversarial Networks for Image-to-Image Translation
https://doi.org/10.1016/B978-0-12-823519-5.00011-7
Copyright © 2021 Elsevier Inc.
All rights reserved.
377
378
Generative adversarial networks for image-to-Image translation
pixel localization of the objects in the scene. Previous works [1, 6, 15, 20] struggled with
this challenging task in which they achieved high accuracy only at frame-level anomaly
detection. On the other hand, the accuracy of a pixel-level anomaly localization is significantly poorer. In recent works [21–23] researchers try to improve the performance to
cover all evaluation criteria but can achieve good performance either at the frame or pixel
level in some complex scenes. This happens as a consequence of insufficient input features
for training the model such as appearance and motion patterns of the objects. The features
of foreground objects should be extracted sufficiently and efficiently during training to
make the model understand all characteristics.
To deal with these challenges, the unsupervised deep learning-based approach is the
most suitable technique for the anomaly detection problem since it does not require any
labeled data on abnormalities. Unsupervised learning is a key domain of deep generative
models, such as adversarially trained autoencoders (AAE) [24], variational autoencoder
(VAE) [25], and generative adversarial networks (GAN) [26, 27]. Generative models
for anomaly detection aim to model only normal events in training as it is the majority
of the patterns. The abnormal events can be distinguished by evaluating the distance from
the learned normal events. Early generative works are mostly based on handcrafted features [1, 3, 5, 11, 28] or CNN [17, 18] to extract and learn the important features. However, the performances of anomaly detection and localization are still needed to be
improved due to the difficulties in approximating many probabilistic computations
and utilizing the piecewise linear units as in the generative models [29–31]. Hence, recent
trends for video anomaly detection focus more on GAN [20–22] which is an effective
approach that achieves high performance in image generation and synthesis, affords data
augmentation, and overcomes classification problems in the complex scenarios.
16.1.1 Anomaly detection for surveillance videos
Video surveillance has gained increasing popularity since it has been widely used to
ensure security. Closed-circuit television (CCTV) cameras are used to monitor the scene,
record certain situations, and provide evidence. They generally perform as the postvideo
forensic process that allows the investigation for abnormalities of previous events in manual control by human operators [32]. This manual causes difficulty for the operators since
abnormalities can occur in any situation, such as crowded or uncrowded in indoor and
outdoor scenes. Additionally, it may cause serious problems, including a terrorist attack, a
robbery, and an area invasion, leading to personal injury or death and property damage
[33]. Thus, to enhance the performance of video surveillance, it is crucial to building an
intelligent system for anomaly detection and localization.
The anomaly is defined as “a person or thing that is different from what is usual, or not
in agreement with something else and therefore not satisfactory” [34]. Multiple terms
Generative adversarial network for video anomaly detection
Fig. 16.1 Examples of abnormal events in crowds from UCSD pedestrian [4], UMN [1], and CHUK
Avenue datasets [6].
stand for the anomaly, including anomalous events, abnormal events, unusual events,
abnormality, irregularity, and suspicious activity. In VAD, the abnormal event can be
seen as the distinctive patterns or motions that are different from neighboring areas or
the majority of the activities in the scene. Specifically, the normal events are the frequently occurring objects and the common moving patterns representing the majority
of the patterns, while the abnormal events are varied and rarely occurred describing as
an infrequent event that may include an unseen object and have a significantly lower
probability than the probability of the normal event. Examples of different abnormal
events are shown in Fig. 16.1.
The anomaly detection for surveillance videos is challenging because of complex patterns of the real scene (e.g., moving foreground objects with large amounts of occlusion
and clutter in crowds) captured by the static CCTV cameras. The VAD relies on fixed
CCTV cameras, which take only moving foreground objects into account while disregarding the static background. The goal of VAD is to accurately identify all possible
anomalous events from the regular normal patterns in crowded and complex scenes from
the video sequences. To design the effective anomaly detection for surveillance videos, it
is considered to learn all information of the objects from both of their appearance (spatial)
and motion (temporal) features under the unsupervised learning or semisupervised learning manner. In the model training, with the unsupervised learning task, only the frames of
normal events are trained, meaning that there is no data labeling on abnormalities. This
benefits the use of VAD in real-world environments where any type of abnormal events
can unpredictably occur. Then, all videos are fed into the model during testing. Any pattern deviated from the trained normal samples is identified as abnormal events that can be
detected by evaluating the anomaly score known as the error of the predictive model in a
vector space or the posterior probability of the test samples.
16.1.2 A broader view of generative adversarial network for anomaly
detection in videos
A GAN has been studied for years. The success of GAN comes from the effectiveness
of its structure in improving image generation and classification tasks with a pair of
networks. GAN presents an end-to-end deep learning framework in modeling the
379
380
Generative adversarial networks for image-to-Image translation
likelihood of normal events in videos and provides flexibility in model training since
it does not require annotated abnormal samples. Its learning is achieved through
deriving backpropagation to compute the error of each parameter in both generator
and discriminator networks. The goals of GAN are to produce the synthetic output
that is not able to be identified as different from the real data and to automatically
learn a loss function to achieve the indistinguishable output goal. The loss of GAN
attempts to classify whether the synthetic output is fake or real, while it is trained to
be minimized in the generative model at the same time. This loss makes it more beneficial to apply GAN in various applications since it can adapt to the data without
requiring different loss functions, unlike the loss functions of the traditional CNNs
approach.
Specifically, in the video anomaly detection, GAN performs as a two-player minimax game between a generator G and a discriminator D, providing high accuracy
output. G attempts to fool D by generating synthetic images that are similar to the real
data, whereas D efforts to discriminate whether the synthetic image belongs to the real
or synthetic data. This mini-max game benefits data augmentation and implicit data
management, thanks to D that assists G to reduce the distance between its samples
and the training data distribution and to train on the small benchmark without the need
to define an explicit parametric function or additional classifiers. Therefore, GAN is
one of the most distinctive approaches to deal with complex anomaly detection tasks
since it achieves good results in reconstructing, translating, and classifying images.
Following the unsupervised GAN in an image-to-image translation [35], it can extract
significant features of the objects of interest (e.g., moving foreground objects) and efficiently translate them from spatial to temporal representations without any prior
knowledge of anomaly or direct information on anomaly types. In this way, GAN
can provide comprehensive information concerning appearance and motion features.
Hence, we focus on reviewing GAN for the anomaly detection task in videos and also
introduce our proposed method named deep spatiotemporal translation network or
DSTN, as a novel unsupervised GAN approach to detect and localize anomalies in
crowded scenes [26].
This chapter contains six sections. In Section 16.1, anomaly detection for surveillance videos and a broader view of GAN are reviewed. Section 16.2 presents a literature review, including the basic structure of GAN along with the literature of
anomaly detection in videos based on GAN. We elaborate on the GAN training
in Section 16.3, which includes the image-to-image translation and our proposed
DSTN. The performance of DSTN is discussed in Section 16.4 along with its related
details, including the publicly available anomaly benchmarks, the evaluation criteria,
the comparison of GAN with an autoencoder, and the advantages and limitations of
GAN for anomaly detection in videos. Finally, Section 16.5 provides a conclusion
for this chapter.
Generative adversarial network for video anomaly detection
16.2 Literature review
We introduce the basic structure of the GAN and review the related works on anomaly
detection for surveillance videos based on GAN. The details of GAN architecture and its
state-of-the-art methods in video anomaly detection are described as following.
16.2.1 The basic structure of generative adversarial network
Since the concept of generative models has been studied in machine learning areas for
many years, it has gained wide recognition from Goodfellow et al. [27] who introduced
a novel adversarial process named GAN. The basic structure of GAN consists of two networks working simultaneously against each other, the generator G and the discriminator
D as shown in Fig. 16.2. In general, G produces a synthetic image n from the input noise
z, whereas D attempts to differentiate between n and a real image r. The goals of G are to
generate synthesized examples of objects that look like the real ones and then attempt to
fool D to make a wrong decision that the synthesized data generated by G are real. On the
other hand, D has been learned on a dataset with the label of the images. D tries its best to
discriminate whether its input data are fake or real by comparing them with the real training data. In other words, G is a counterfeiter producing fake checks, while D is an officer
trying to catch G. Specifically, G is good at creating the synthesized images as it only
updates the gradient through D to optimize its parameters, making D more challenging
to differentiate its input data. The training of this mini-max game makes both networks
better until at one point that the probability distributions of G and the real data are
Fig. 16.2 Generative adversarial network architecture.
381
382
Generative adversarial networks for image-to-Image translation
equivalent (when there is enough capacity and training time) so that G and D are not able
to improve any longer. Thus, D is unable to differentiate between these two
distributions.
In the perspective of the generative adversarial network training, G takes the input
noise z from a probability distributionpz(z) and then it generates the fake data and feeds
into D as D(G(z)). D(x) denotes the probability that x is from the distribution of real
datapdata instead of the distribution of generator pg. The discriminator D takes two inputs
from G(z) and pdata. D is trained to enlarge the probability of defining the precise label to
the real and the synthesized examples. Specifically, the goal of D is to accurately classify its
input samples by giving a label of 1 to real samples and a label of 0 to synthetic ones. D
tries to solve a binary classification problem based on neuron networks with a sigmoid
function, giving output in the range [0,1]. Then G is simultaneously trained to reduce
[log(1 D(G(z)))]. These two adversarial networks, G and D, are represented with value
function V(G, D) as follows:
min max V ðD, GÞ
(16.1)
V ðD, GÞ ¼ xpdata ðxÞ ½ logDðxÞ + zpz ðzÞ ½ log ð1 DðGðzÞÞÞ,
(16.2)
G
D
where [logD(x)] is the objective function of the discriminator, representing the entropy
of the real data distribution pdata passing through D, which tries to maximize [logD(x)].
Note that the objective function of the discriminator will be maximized when both real
and synthetic samples are accurately classified to 1 and 0, respectively. The objective
function of the generator is [log(1 D(G(z)))], representing the entropy of the random
noise samples z passing through G to generate the synthetic samples or fake data and then
pass through D to minimize [log(1 D(G(z)))]. The goal of the generator’s objective
function is to fool D to make a wrong classification as it inspires D to identify the synthetic
samples as real ones or to label the synthetic samples as 1. In other words, it attempts to
minimize the likelihood that D classifies these samples correctly as fake data. Thus,
[log(1 D(G(z)))] is reduced when the synthesized samples are wrongly labeled as 1.
However, in real practice, G is poor in generating synthetic samples in the early training
stage, making D too easy to classify the synthesized samples from G and the real samples
from the dataset due to their great difference. To solve this problem, the generator’s
objective function can be changed from reducing [log(1 D(G(z)))] to increasing
[logD(x)] to provide enough gradient for G. This alternative objective function of the
generator provides a stronger gradient in the early stage of the generator’s training.
As both objective functions are distinct, the two networks are trained together by
alternating the gradient updates following the standard gradient rule with the momentum
parameter. There are two main procedures for training G and D networks to update their
gradients alternately. The process is first to freeze G and train only D. This alternative
gradient update is triggered by the fact that the discriminator needs to learn the outputs
Generative adversarial network for video anomaly detection
of the generator to define the real data from the fake ones. Thus the generator is required
to be frozen. The discriminator network can be updated as shown in the following
equation:
m h
i
X
(16.3)
rθd 1=m
logD xðiÞ + log 1 D G zðiÞ
i¼1
Specifically, the gradient updates are different for the learning of two networks: D uses
stochastic gradient ascent, while G uses stochastic gradient descent. D uses a hyperparameter k to update steps for each step of G. To update D, the stochastic gradient ascent
performs the updates for k times to increase the likelihood that D accurately labels both
samples (fake and real data). These updates are achieved using backpropagation on an
equal number of examples with batch size of 2. Let noise samples m consist of
{z(1), z(2), …, z(m)} from the distribution of generator pg(z) and real examples m consist
of {x(1), x(2), …, x(m)} from the distribution of real data pdata. Once D is updated, then only
G is trained to update its gradient as shown in the following equation:
m
X
(16.4)
log 1 D G zðiÞ
rθg 1=m
i¼1
Concerning the update for the generator network, m noise samples are input into G
only once to generate m synthesized examples. G uses the stochastic gradient descent to
minimize the likelihood that D labels the synthesized samples correctly. The generator’s
objective function aims to minimize [log(1 D(G(z)))] to boost the likelihood that synthetic examples are classified as real examples. This process computes the gradients during
backpropagation for both networks. Still, it only updates the parameters of G. D is kept
constant during the training of G to prevent the possibility that G might never converge.
16.2.2 The literature of video anomaly detection based on generative
adversarial network
Here we review recent literature works that use GANs for anomaly detection in crowds.
There are three outstanding video anomaly detection works ranked by publication year as
follows.
16.2.2.1 Cross-channel generative adversarial networks
Starting with the work proposed by Ravanbakhsh et al. [20], this work applies conditional GANs (cGANs) where the generator G and the discriminator D are both conditioned to the real data and also relies on the idea of the translating image to image [35].
Following the characteristics of cGANs, the input image x is fed to G to produce a generated image p that looks realistic. G attempts to deceive D that p is real, while D efforts to
identify x from p. This paper states that U-Net structure [36] in the generative network
383
384
Generative adversarial networks for image-to-Image translation
and a patch discriminator (Markovian discriminator) advantage the transformation of
images in different representations (e.g., spatial to temporal representations). Thus, the
authors adopt this concept to translate the appearance of a frame to the motion of optical
flow and target to learn only the normal patterns. To detect an abnormality, they compare
the generated image with the real image by using a simple pixel-by-pixel difference along
with pretrain on ImageNet [37]. The framework of the anomaly detection in videos
using cGANs during testing is shown in Fig. 16.3.
F!O
More specifically, the authors train two networks N
that uses frames F to genO!F
erate optical flow O and N
that uses optical flow O to generate frames F. Assume
that Ft is the frame of the training video sequence with RGB channels at time t and Ot is
the optical flow containing three channels (horizontal, vertical, and magnitude). Ot is
obtained from two consecutive frames, Ft and Ft+1, following the computation in
Ref. [38]. As both generative and discriminative models are the conditional networks,
G generates output from its inputs consisting of an image x and noise vector z, providing
a synthetic output image p ¼ G(x,z). In the case of N F!O , x is assigned as a current frame
x ¼ Ft, hence the generation of its corresponding optical flow (or the synthetic output
image p) is represented as y ¼ Ot. For D, it takes two inputs, whether it is (x,y) or (x,
p) to yield a probability of classes belonging to its pair. The loss functions are defined,
including a reconstruction loss LL1 and a conditional GAN loss LcGAN as shown in
Eqs. (16.5) and (16.6), respectively.
F!O
For N
, LL1 is determined with the training set X ¼ fðFt , Ot Þg as
LL1 ðx, yÞ ¼ ky Gðx, yÞk1
(16.5)
whereas LcGAN is assigned as
LcGAN ðD, GÞ ¼ ðx, yÞX ½ log Dðx, yÞ + XfFt g, zZ ½ log ð1 Dðx, Gðx, zÞÞÞ (16.6)
Fig. 16.3 A framework of video anomaly detection using conditional generative adversarial nets
(cGANs) during testing in Ref. [20]. There are two generator networks: (i) producing a
corresponding optical flow image from its input frames and (ii) reconstructing an appearance from
a real optical flow image.
Generative adversarial network for video anomaly detection
O!F
In contrast, the training set of N
is X ¼ fðOt , Ft ÞgN
t¼1 . Once the training is finished, the only model being used during testing is G, consisting of GF!O and GO!F
networks. Both networks are not able to reconstruct the abnormality since they have
been trained with only normal events. Then, the abnormality can be found by subtracting
pixels to obtain the difference between O and po, ΔO ¼ O po,where po is the optical flow
reconstruction when using F, GF!O(F). Another network is GO!F(O) that produces the
appearance reconstruction pF. However, ΔO provides more information than the difference between F and pF. In this case, the authors added an additional network to find the
difference in semantic perspective ΔS by using AlexNet [39] with its fifth convolutional
layer h defined as ΔS ¼ h(F) h(pF). These two difference ΔO and ΔS are combined and
normalized in between [0,1] for an abnormality map. Finally, the final score of abnormality heatmap is achieved by summing the normalized semantic difference map NS
and the normalized optical flow difference map NO, A ¼ NS + λNO where λ ¼ 2.
16.2.2.2 Future frame prediction based on generative adversarial network
Apart from the above work, there is an approach for a video future frame prediction of
abnormalities based on GAN proposed by Liu et al. [21]. This work is inspired by the
problems of anomaly detection that are mostly about minimizing the reconstruction
errors from the training data. Instead, the authors proposed an unsupervised feature learning for video prediction and leveraged the distinction of their predicted frame with the
real data for anomaly detection. The framework of video future prediction for detecting
anomalies is shown in Fig. 16.4. In the training stage, only normal events are learned since
they are considered as predictable patterns, using both appearance and motion constraints. Then, during testing, all frames are input and compared with the predicted frame.
If the input frame corrects with the predicted frame, it is a normal event. If it is not, then it
Fig. 16.4 Video future prediction framework for anomaly detection [21]. U-Net structure and
pretrained Flownet are used to predict a target frame and to obtain optical flow, respectively.
Adversarial training is used to disguise whether a predicted frame is real or fake.
385
386
Generative adversarial networks for image-to-Image translation
becomes an anomalous event. Using a good predictor to train is the key in this work; thus
U-Net network [36] is chosen due to its performance in translating images with the GAN
model.
In mathematical terms, given a video sequence containing t frames I1, I2, …, It. In this
work, a future frame is defined as It + 1, while a predicted future frame as I^t + 1. The goal is
to make I^t + 1 close to It + 1 to determine whether I^t + 1 is an abnormal or normal event by
minimizing their distance in terms of intensity and gradient. In addition, optical flow is
used to represent the temporal features between frames; It + 1 and It, and I^t + 1 and It.. We
first take a look at the generator objective function LG, consisting of appearance (ingredient Lint and gradient Lgd), motion Lop, and adversarial training LG
adv, in Eq. (16.7):
G ^
It +1
LG ¼ λint Lint I^t + 1 , It + 1 + λgd Lgd I^t + 1 , It + 1 + λop Lop I^t + 1 , It + 1 , It + λadv Ladv
(16.7)
The discriminator objective function LD is defined in Eq. (16.8):
D ^
I t + 1 , It + 1 :
LD ¼ Ladv
(16.8)
The authors followed the work in Ref. [40] by using intensity and gradient difference.
Specifically, the intensity and gradient penalties assure the similarity of all pixels and the
sharpened generating images, respectively. Suppose I^t + 1 is I^ and It+1 is I. The distance of
‘2 between I^ and I is minimized in intensity to guarantee the similarity in the RGB space
as shown in the following equation:
2
(16.9)
Lint I^, I ¼ I^ I 2 :
Then the gradient loss is defined as follows [40] in Eq. (16.10):
X I^i, j I^i1, j Ii, j Ii1, j + I^i, j I^i, j1 Ii, j Ii, j1 Lgd I^, I ¼
1
1
i, j
(16.10)
where i and j are the frame index.
Then the optical flow estimation is applied by using a pretrained network, Flownet
[41], denoted as f. The temporal loss is defined in the following equation:
Lop I^t + 1 , It + 1 , It ¼ f I^t + 1 , I1 f ðIt + 1 , I1 Þ1 :
(16.11)
In the manner of the adversarial network, the training is an alternative update. The
U-Net is used as the generator, while a patch discriminator is used for the discriminator
following Ref. [35]. To train the discriminator D, they assign a label of 0 to a fake image
and a label of 1 to a real image. The goal of D is to categorize the real future frame It+1 into
class 1 and the predicted future frame I^t + 1 into class 0. During training the discriminator
Generative adversarial network for video anomaly detection
D, the weight of G is fixed by using the loss function of mean square error (MSE),
denoted as LMSE. Hence, the LMSE of D can be defined in the following equation:
X
X
D ^
Ladv
I, I ¼
LMSE D Ii, j , 1 =2 +
LMSE D I^i, j , 0 =2,
(16.12)
i, j
i, j
where i and j are the patch index. The MSE loss function LMSE is defined in the following
equation:
2
(16.13)
LMSE Y^ , Y ¼ Y^ Y ,
where the value of Y is in [0,1], while Y^ ½0, 1.
In contrast, the objective of the generator G is to reconstruct images to fool D to label
them as 1. The weight of D is fixed during training G. Thus, LMSE of G is defined as
shown in the following equation:
X
G ^
Ladv
I ¼
LMSE D I^i, j , 1 =2:
(16.14)
i, j
To conclude, the appearance, motion, and adversarial training can assure that normal
events are generated. The events with a great difference between the prediction and the
real data are classified as abnormalities.
16.2.2.3 Cross-channel adversarial discriminators
As in Ref. [20], the authors have been continuously proposed another GAN-based
approach [22] for abnormality detection in crowd behavior. The training procedure is
the same as Ref. [20] that only the frames of normal events are trained with the crosschannel networks based on conditional GANs by engaging G to translate the raw pixel
image to the optical flow, inspired by Isola et al. [35]. This paper takes advantage of the
U-Net framework [36] for translating from one to another image and representing a multichannel data, i.e., spatial and temporal representations, into an account, similarly to
Refs. [20] and [21]. The novel part is in the testing where the authors proposed the
end-to-end framework without additional classifiers by using the learned discriminator
as the classifier of abnormalities. The framework of the cross-channel adversarial discriminators is shown in Fig. 16.5. For a brief explanation, G and D are simultaneously trained
only on the frames of normalities. G generates the synthetic image from the learned normal events, while D learns how to differentiate whether its input is normal events or not
due to the data distribution, defining the abnormal events as outliers in this sense. During
testing, only D is being used directly to classify anomalies in the scene. In such a way,
there is no need for reconstructing images at the testing time unlike the common
GAN-based models [20, 21] that use G in testing.
387
388
Generative adversarial networks for image-to-Image translation
Fig. 16.5 Cross-channel adversarial discriminators flow diagram with additional detail on parameters
following [22]. Two generator networks are used during training: (i) generating a corresponding optical
flow image and (ii) reconstructing an appearance. At testing time, only discriminative networks are
used and represented as a learned decision boundary to detect anomalies.
F!O
O!F
Specifically, there are two networks used for training: N
and N
. Suppose
that Ft is a frame (at time t) of a video sequence and Ot is an optical flow acquired from
two consecutive frames, Ft and Ft+1, following the computation of optical flow-based
theory for warping [38]. In this work, G and D are conditioned to each other. G takes
F!O
an image x and noise vector z to output a synthetic optical flow r ¼ G(x,z). For N
,
let x be a current frame x ¼ Ft. Then, the generation of its corresponding optical flow r
can be represented as y ¼ Ot. Conversely, D takes two inputs, whether it is (x,y) or (x,r) to
obtain a probability of classes belonging to its pair. The reconstruction loss LL1 and a
conditional GAN loss LcGAN can be obtained as follows.
F!O
In the case of N
,LL1 is determined with the training set X ¼ fðFt , Ot Þg as shown
in the following equation:
LL1 ðx, yÞ ¼ ky Gðx, yÞk1
(16.15)
whereas LcGAN is represented in the following equation:
LcGAN ðD, GÞ ¼ ðx, yÞX ½ log Dðx, yÞ + XfFt g, zZ ½ log ð1 Dðx, Gðx, zÞÞÞ: (16.16)
O!F
is X ¼ fðOt , Ft ÞgN
In contrast, the training set of N
t¼1 . Note that all training procedures are the same as Ref. [20]. G performs as implicit supervision for D. Both GF!O
and GO!F networks lack the ability to reconstruct the abnormal events because they
observe only the normal events during training, while DF!O and DO!F have learned
the patterns to distinguish real data from artifacts.
Generative adversarial network for video anomaly detection
The discriminator is considered as the learned decision boundary that splits the densest
area (i.e., the normal events x3) from the rest (i.e., abnormal events x1 and generated
images x2). Since the goal is to detect abnormal events x1, the latter outside the decision
boundaries are judged as outliers by D. During testing, the authors focus on only the discriminative networks for two-channel transformation tasks. The patch-based discrimina^ F!O and D
^ O!F are applied to the test frame F and its corresponding optical flow O
tors D
with the same 30 30 grid, resulting in two 30 30 score maps represented as SO for
^ F!O and SF for D
^ O!F . In detail, a patch pF on F and a patch pO on O are input to
D
F!O
^
D
. Any abnormal events occur in these patches (pF and/or pO) are considered outliers
^ F!O , resulting in a low probability score of
according to the distribution learned by D
F!O
^
D
ðpF, pOÞ. To finalize the anomaly maps, the normalized channel score maps with
equal weights are fused S ¼ SO + SF in the range [0,1] and then applied with a range of
thresholds to compute the ROC curves.
We notice that there are some interesting points in this work: (i) the authors state that
DF!Oprovides higher performance than DO!F since the input of DF!O is the real frame
which contains more information than the optical flow frame, (ii) their proposed end-toend framework is simpler and also faster in testing than Ref. [20] since they do not require
the generative models during testing and any additional classifiers to add on top of the
model such as a pretrained AlexNet [39]. The observations from the early works inspire
us to propose our method [26], which we discuss in Section 16.3.2.
16.3 Training a generative adversarial network
16.3.1 Using generative adversarial network based on the
image-to-image translation
The anomalous object observation using an unsupervised learning approach is considered
as the structural problem of the reconstruction model known as the per-pixel classification or regression problem. The common framework used to solve this type of problem
and explored by the state-of-the-art works in anomaly detection task [20–22] is the generative image-to-image translation network constructed by Isola et al. [35]. In general,
they use this network to learn an optimal mapping from input to output image based on
the objective function of GAN. Their experimental results prove that the network is
good at generating synthesized images such as color, object reconstruction from edges,
and label maps.
From an overall perspective, based on the original GANs [27], the input of G consists
of the image x and noise vector z in which the mapping of the original GANs is represented as G: z ! y (learning from z to output image y). Differently, conditional GANs
learn from two inputs, x and z, to y, represented as G: {x,z} ! y. However, z is not
necessary for the network as G still can learn the mapping without z, especially in the
early training where G is learned to ignore z. Thus, the authors decided to use z in
389
390
Generative adversarial networks for image-to-Image translation
the form of dropout in both the training and testing process. Considering the objective
function, since the architecture of image-to-image translation network is based on the
conditional GANs, its objective functions are indicated the same as we explained in Refs.
[20] (see Eqs. 16.5 and 16.6) and [22] (see Eqs. 16.15 and 16.16) where two inputs are
required for the discriminator D(x,y) and L1 loss is used to help output sharper image. On
the other hand, the future frame prediction [21] applied unconditional GANs that uses
only one input for the discriminator D(y) (see Eq. 16.12). It relied on the traditional L2
regression (MSE loss function) to condition the output and the input. This forcing condition results in lower performance (i.e., blurry images) on the frame-level anomaly
detection compared to Refs. [20] and [22].
16.3.2 Unsupervised learning of generative adversarial network for video
anomaly detection
In this section, we introduce our proposed method, named DSTN [26]. We take advantage of image-to-image transformation architecture with the U-Net network [36] to
translate a spatial domain to a temporal domain. In this way, we can obtain comprehensive information on the objects from both appearance and motion information (optical
flow). The proposed DSTN differs from the previous works [20–22] since we focus on
only one deep spatiotemporal translation network to enhance the anomaly detection performance at a frame level and the challenging anomaly localization at a pixel level with
regard to accuracy and computational time. Specifically, we include preprocessing and
postprocessing stages to assist the learning of GAN without using any pretrained network
to help in the classification, making the DSTN faster and more flexible. Besides, we differ
from Ref. [35] as our target output is the motion information of the object corresponding
to its appearance, not the realistic images. There are two main procedures for each training and testing time. For training, a feature collection and a spatiotemporal translation
play essential roles to sufficiently collect information and effectively learn to model,
respectively. Then a differentiation and an edge wrapping are utilized at the testing time.
We shall explain the main components of our proposed method in detail including the
system overview for both training and testing as follows.
16.3.2.1 System overview
We first start with the system overview of ourDSTN. The DSTN is based on GAN and
plugged with preprocessing and postprocessing procedures to improve the performance
in learning normalities and localizing anomalies. Overall, the main components of the
DSTN are fourfold: a feature collection, a spatiotemporal translation, a differentiation,
and an edge wrapping. The feature collection is a key initial process for extracting the
appearances of objects. These features are fed into the model to learn the normal patterns.
In our case, the generator G is used in both training and testing time, while the discriminator D only at the training time. During training, G learns the normal patterns from the
Generative adversarial network for video anomaly detection
training videos. Hence it understands and has only the knowledge of what normal patterns look like. The reason why we feed only the frames of normal patterns is that we
need the model to be flexible and able to handle all possible anomalous events in realworld environments without labels of anomalies. In testing, all videos, including normal
and abnormal events, are input into the model where G tries to reconstruct the appearance and the motion representation from the learned normal events. Since G has not
learned any abnormal samples, it is unable to reconstruct the abnormal area properly.
We then take this inability to correctly reconstruct anomalous events to detect the anomalies in the scene. The anomalies can be exposed by subtracting pixels in the local area
between the synthesized image and the real image and then applying the edge wrapping at
the final stage to achieve precise edges of the abnormal objects.
Specifically, during training, only normal events of original frames f are input with
background removal frames fBR into the generative network G, which contains encoder
En and decoder De, to generate dense optical flow (DIS) frames OFgen representing the
motion of the normal objects. To attain good optical flow, the real DIS optical flow frame
OFdis and fBR are fused to eliminate noise that frequently occurs in OFdis, giving Fused
Dense Optical Flow frames OFfus. The patches of f and fBR are concatenated and fed into
G to produce the patches of OFgen, while D has two alternate inputs, the patches of OFfus
(real optical flow image) and OFgen (synthetic optical flow image), and tries to discriminate whether OFgen is fake or real. The training framework of DSTN is shown in Fig.
16.6.
After training, the DSTN model understands a mapping of the appearance representation of normal events to its corresponding dense optical flow (motion representation).
All parameters used in training are also used in testing. During testing, the unknown
events from the testing videos will be reconstructed by G. However, the reconstruction
of G provides results of unstructured blobs based on its knowledge of the learned normal
patterns. Thus, these unstructured blobs are considered as anomalies. To capture the
anomalies, the differentiation is computed by subtracting the patches of OFgen with
the patches of OFfus. Note that not only the anomaly detection is significantly essential
for real-world use, but also the anomaly localization. Therefore, edge wrapping (EW) is
proposed to obtain the final output by retaining only the actual edges corresponding to
the real abnormal objects and suppressing the rest. The DSTN framework at the testing
time is shown in Fig. 16.7.
16.3.2.2 Feature collection
We explain our proposed DSTN based on its training and testing time. During training,
even GAN is good at data augmentation and image generation on small datasets, it still
desires for sufficient features (e.g., appearance and motion features of objects) from data
examples to feed its data-hungry characteristics of the deep learning-based model. The
importance of feature extraction is recognized and represented as the preprocessing
391
392
Generative adversarial networks for image-to-Image translation
Fig. 16.6 A training framework of proposed DSTN.
Fig. 16.7 A testing framework of proposed DSTN [26].
Generative adversarial network for video anomaly detection
procedure before learning the model. There are several procedures in the feature collection, including (i) background removal, (ii) fusion, (iii) patch extraction, and
(iv) concatenated spatiotemporal features, as described below.
In (i) background removal, we only take the moving foreground objects into account
because we focus on the real situation from the CCTV cameras. Thus the static background is ignored in this sense. This method benefits in extracting the object features
and removing irrelevant pixels in the background for obtaining only the important
appearance information. Let ft be the current frame at time t of the video and ft 1 be
the previous frame. The background removal fBR is computed by using the frame absolute difference as shown in the following equation:
fBR ¼ jft ft1 j
(16.17)
After computing the frame absolute difference, the background removal output is
binarized and then concatenated with the original frame f to acquire more information
on the appearance, assisting the learning of the generator. It can simply conclude that
more input features mean the better performance of the generator. The significance
of the concatenated fBR and f frames is that it delivers extra features on the appearance
of foreground objects in fBR as f contains all information where fBR may lose some of
the information during the subtraction process.
For (ii) fusion, according to the literature on GAN for video anomaly detection
[20–22], they all apply theory for warping [38] for representing the temporal features.
However, it has problems in obtaining all information on objects and also with high time
complexity. Since the development of video anomaly detection requires reliable performance in terms of accuracy and running time, thus the theory for warping is not suitable
for this task. To achieve the best performance on motion representation, we use dense
inverse search (DIS) [42] to represent the motion features of foreground objects in surveillance videos due to its high accuracy and low time complexity performance on
detecting and tracking objects. The DIS optical flow OFdis can be obtained from two
consecutive frames ft and ft 1 as shown in Fig. 16.8 where the resolution of ft and
ft 1 frames is 238 158 channels following the UCSD Ped1 dataset [4]. The number
of the channels cp for ft and ft 1 frames is 1 (cp ¼ 1), while cp of OFdis is 3 (cp ¼ 3).
However, OFdis contains noise dispersed in the background same as the objects as
shown in Fig. 16.9. Thus, we propose a novel fusion between OFdis and fBR to use
the good foreground object from fBR with OFdis for acquiring both appearance and
motion information and assisting in noise reduction in OFdis. The fusion provides a clean
background and explicit foreground objects. From Fig. 16.9, it is clearly shown that
fusion effectively helps to remove noise from OFdis. Specifically, the noise reduction
is implemented efficiently by observing fBR values equal to 0 or 255 and then using
393
394
Generative adversarial networks for image-to-Image translation
Fig. 16.8 Dense inverse search optical flow framework.
Fig. 16.9 A fusion between background subtraction and real optical flow.
masking of fBR on OFdis to change its values. Let ζ be a constant value. The new output
represented OFdis is a fusion OFfus as defined in Eq. (16.18):
OF fus ¼ OF dis bfBR =ðfBR + ζÞc:
(16.18)
Apart from (i) background removal and (ii) fusion, (iii) patch extraction acts a part in
the feature collection process and supports in acquiring more spatial and temporal features
of the moving foreground object in local pixels. By doing that, it can achieve better information than directly extracting features from the full image. Patch extraction is implemented using the full-size of the moving foreground object appearance at the current
frame f along with its direction, motion, and magnitude from the frame-by-frame dense
optical flow image. We normalize all patch elements in the range [1, 1]. The patch size
is defined as (w/a) h cp, where w is the frame width, h is the frame height, a is a scale
value, and cp is the number of channels. A sliding-window method with a stride d is
applied on the input frames of the generator G (i.e., f and fBR) and the discriminator
D (i.e., OFfus). Fig. 16.10 shows examples of patch extraction. We extract the patches
Generative adversarial network for video anomaly detection
Fig. 16.10 Examples of patch extraction on a spatial frame.
with the scale value a ¼ 4 and d ¼ w/a to obtain more local information from spatial and
temporal representations. Then, the extracted patch image is scaled up to 256 256 full
image to gain more information on the appearance from the semantic information and
input into the model for the further process.
The final process of the feature collection is (iv) the concatenation of spatiotemporal
features for data preparation. We input the appearance information to the generative
model to output the motion information in both training and testing time, as shown
in Fig. 16.11. Since providing sufficient feature inputs to G is significantly important
to produce good corresponding optical flow images; thus, the patches of f and fBR are
concatenated to cover all possible low-level appearance information of the normal patterns for G to understand and learn them extensively. More specifically, fBR provides precise foreground object contours, whereas f provides inclusive knowledge for the whole
scene. The sizes of the input and target images are fixed to the 256 256 full image as the
default value in our proposed framework. The concatenated frames have two channels
Fig. 16.11 Overview of our data preparation showing spatiotemporal input features (concatenated
patches) and output feature (generated dense optical flow patch).
395
396
Generative adversarial networks for image-to-Image translation
(cp ¼ 2) and the temporal target output has three channels (cp ¼ 3). As a final point, this
concatenation process implies the potential of the learning of a spatiotemporal translation
model to reach its desired temporal target output.
16.3.2.3 Spatiotemporal translation model
In this section, we declare our training structure in the manner of GAN-based U-Net
architecture [36] for translating the spatial inputs (f and fBR) to the temporal output
(OFgen) and present on how the interplay between the generative and discriminative networks works during training. The details of the proposed spatiotemporal translation
model are clearly explained as the following.
The generative network G performs an image transformation from the concatenated f
and fBR appearances to the OFgen motion representations. Generally, there are two inputs
for G, an image x and noise z, to generate image e as the output with the same size of x but
different channels, e ¼ G(x,z) [27, 43, 44]. On the other hand, the additional Gaussian
noise z is not prominent to G in our case since G can learn to ignore z in the early stage of
training. In addition, z is not that effective for transforming the spatial representation data
from its input to the temporal representation data. Therefore, dropout [35] is applied in
the decoder within batch normalization [45] instead of z, resulting in e ¼ G(x).
On closer inspection, the full network of generator consisting of the encoder En and
decoder De architectures is constructed by skip connections or residual connections [35],
as shown in Fig. 16.12. The idea of skip connections is that it links the layers from the
encoder straight to the decoder, allowing the networks to be easier for the optimization
Fig. 16.12 Generator architecture consisting of an encoder and a decoder with skip connections [26].
Generative adversarial network for video anomaly detection
and providing greater quality and less complexity for image translation than the traditional CNN architectures, e.g., AlexNet [39] and VGG nets [46, 47]. More specifically,
let t be the total number of layers in the generative network. The skip connections are
introduced at each layer i of En and layer t i of De. The data can be transferred from the
first to the final layer by integrating channels of i with t i. The architectures of En and
De are illustrated in Fig. 16.13. En compresses the spatial representation of data to higherlevel representation, while De performs the reverse process to generate OFgen. En uses
Leaky-ReLU (L-ReLU) activation function. Conversely, De uses the ReLU activation
function that benefits in accelerating the learning rate of the model to saturate color distributions [36]. To achieve accurate OFgen, the objective function is assigned and optimized by the Adam optimization algorithm [48] during training.
For the encoder module, it acts as a data compression from a high-dimensional space
into a low-dimensional latent space representation to pass to the decoder module. The
first layer in the encoder is convolution using CNNs as the learnable feature extraction
instead of the handcrafted features approach that is more delicate to derive the obscure
data structures than the deep learning approach. The convolution is a linear operation
implemented by covering an n ninput image I with a k ksliding window w. The output of convolution function on cell c of the image I is defined as shown in the following
equation:
yc ¼
kk
X
ðwi Ii, c Þ + bc
(16.19)
i¼1
where yc is the output after the convolution and bc is the bias. Let p be the padding and s be
the stride. The output size of convolution O is calculated as shown in the following
equation:
O ¼ ðn k + 2pÞ=s + 1
Fig. 16.13 Encoder and decoder architectures [26].
(16.20)
397
398
Generative adversarial networks for image-to-Image translation
Fig. 16.14 An example of a convolution operation on an image cell.
The convolution operation on an image cell with b ¼ 0, n ¼ 8, k ¼ 3, p ¼ 0, and s ¼ 1
is delivered in Fig. 16.14 for more understanding.
Once the convolution operation is completed, batch normalization is applied by normalizing the convolution output following the normal distribution in Ref. [45] to reduce
the training time and avoid the vanishing gradient problem. Suppose y is convolution
output values over a mini-batch: B ¼ {y1, y2, …, ym}, γ and β are learnable parameters
andεis a constant to avoid zero variance. The normalized output S can be obtained by
scaling and shifting as defined in the following equation:
Si ¼ γ y^i + β ≡ BN γ, β ðyi Þ
pffiffiffiffiffiffiffiffiffiffiffiffi
• Normalize: y^i ¼ ðyi μB Þ= σ 2B + ε,
m
P
• Mini-batch mean: μB ¼ 1=m yi ,
m
i¼1 P
• Mini-batch variance: σ 2B ¼ 1=m ðyi μB Þ2 .
(16.21)
i¼1
The final layer of En is adopted by the activation function to introduce nonlinear mapping function from the input to the output (response variable). The nonlinear mapping
performs the transformation from one to another scale and decides whether the neurons
can pass through it. The nonlinearity property makes the network more complex, resulting in a stronger model for learning the complex input data. The L-ReLu is assigned in
the proposed DSTN to avoid the vanishing gradient problem. The property of L-ReLu is
to allow the negative value to pass through the neuron by mapping it to the small negative
value of the response variable, leading to the improvement of the flow of gradients
through the model. The L-ReLu function can be determined, as shown in Eq. (16.22):
f ðsÞ ¼
s, if s 0
as, otherwise
(16.22)
where s is the input, a is a coefficient (a ¼ 0.2) allowing the negative value to pass through
the neuron, and f(s) is the response variable. Fig. 16.15 shows the graph representing the
input data s mapping to the response variable.
Generative adversarial network for video anomaly detection
Fig. 16.15 Leaky-ReLu activation function.
Regarding the decoder module De, it is an inverse process of encoding part in which
the residual connections are passed through the encoder to the corresponding layers at the
decoder. The dropout is used in De to represent noise vector z as it eliminates the neuron
connections using a default probability, helping to prevent the overfitting during training
and improve the performance of GAN. Let h {1, 2, …, H} be the hidden layers of the
network, z(h+1)be the output layers h + 1, and r(h)
be the random variable following
j
Bernoulli distribution with probability p [49]. The feed-forward operation can be
described in the following equation:
r ðhÞ : BernoulliðpÞ,
feðsÞðhÞ ¼ r ðhÞ ∗f ðsÞðhÞ ,
ðh + 1Þ
ðh + 1Þ e ðhÞ
ðh + 1Þ
¼ wi
:
f ðsÞ + bi
zi
(16.23)
Apart from the generator, we shall discuss the discriminative network D for the testing
procedure. D distinguishes the real patch OFfus (y ¼ OFfus) and the synthetic patch OFgen
(OFgen ¼ e). As a result, D delivers a scalar output determining the possibility that its
inputs are from the real data. In the discriminative architecture, PatchGAN is constructed
and applied to each partial image to help accelerate the training time for GAN, resulting
in a better performance than using the full image discriminator net with the resolution of
256 256 pixels. The discriminator D is implemented by subsampling from 256 256
OFfus image to 64 64 pixels, providing 16 patches of OFfus passing through the PatchGAN model to classify whether OFgen is real or fake as shown in Fig. 16.16. The reason
why we use the 64 64 PatchGAN is that it provides good pixel accuracy and good
intensity on the appearance, making the synthetic image to be more recognizable.
The experimental results on the impact of using the 64 64 PatchGAN can be found
in Ref. [26].
399
400
Generative adversarial networks for image-to-Image translation
Fig. 16.16 Discriminator architecture with PatchGAN model [26].
To define objective function and optimization, we first discuss two objective functions determined during training; a GAN Loss, LGAN, and a L1 Loss or a generator loss,
LL1. Note that our proposed DSTN method comprises only one translational network
from the spatial (appearance) to the temporal (motion) image representations. The
motion representation is computed based on the dense optical flow using arrays of horizontal and vertical components with the magnitude. Let y be the output image OFfus, x
be the input image for G (the concatenated f and fBR image), and z is the additional Gaussian noise vector. Specifically, the dropout is adopted as z, then G can be represented as
G(x). The objective functions can be denoted in Eqs. (16.24) and (16.25):
• GAN Loss:
LGAN ðG, DÞ ¼ Ey ½ log DðyÞ + Ex ½ log ð1 DðGðxÞÞÞ
(16.24)
LL1 ðGÞ ¼ Ex, y ky GðxÞk1
(16.25)
• L1 Loss:
Then, the optimization of G can be defined as in the following equation:
G∗ ¼ arg min max LGAN ðG, DÞ + λLL1 ðGÞ
G
D
(16.26)
The advantage of using one spatiotemporal translation network is that it has less complexity while providing sufficient important features of objects for the learning of GAN.
Generative adversarial network for video anomaly detection
16.3.3 Anomaly detection
After the training, the spatiotemporal translation network perceives the transformation
from the concatenated f and fBR appearance to the OFfus motion representations. All
parameters in training are applied in the testing. To detect anomalies, we input two consecutive frames (f and ft1) from the test videos to the model. During testing, G is used to
reconstruct OFgen following its trained knowledge. However, since G has trained with
only the normal patterns, G is unable to regenerate the unknown events the same as the
normal ones. We take this generator’s inability to reconstruct correct abnormal events in
order to detect all possible anomalies that occur in the scene. The anomalies can be
exposed by subtracting the patches of OFfus and OFgen to locate the difference in local
pixels. To be more accurate on the object localization, edge wrapping is proposed to
highlight the actual local pixels of anomalies.
For anomaly detection, differentiation is a simple and effective method to obtain
abnormalities. The pixels between a patch of OFfus (real image) and a patch of OFgen (fake
image) are subtracted to determine even if there are anomalous events in the scene. This
differentiation is directly defined in in the following equation:
ΔOF ¼ OF fus OF gen > 0
(16.27)
where ΔOF is the differentiation output in which its value is greater than 0 (ΔOF > 0).
The reason why ΔOF can successfully indicate the abnormal events in the scene is that
the differentiation between OFfus and OFgen provides a large difference in the anomalous
areas where G is unable to reconstruct the abnormal events in OFgen the same as the
abnormal events in OFfus (the real abnormal object from testing video sequence). In other
words, G tries to reconstruct OFgen the same as OFfus, but it can only reconstruct unstructured blobs based on its knowledge of the learned normal events, making the abnormal
events of OFgen different from OFfus. ΔOF provides the score indicating the probability of
pixel, whether it belongs to normalities or abnormalities. The range of pixel values for
each ΔOF from the test videos is between 0 and 1 where the highest pixel value is considered as an anomaly. To normalize the probability score from ΔOF, the maximum value
MOF of all components is computed following the range of pixel values for each test
video. From this process, we can gradually alter the threshold of the probability scores
of anomalies to define the best decision boundary for obtaining ROC curves. Suppose
the position of the pixel in the image is (i, j). The normalization of ΔOF denoted as NOF is
shown in the following equation:
NOF ði, jÞ ¼ 1=MOF ΔOF ði, jÞ
(16.28)
However, even we obtain good normalized differentiation NOF showing anomalies
in the scene, there are concerning problems occurred in the experimental results, such as a
false positive detection on the normal events (i.e., normal event is detected as abnormal
401
402
Generative adversarial networks for image-to-Image translation
events) and overdetection on the pixels around the actual abnormal object (i.e., the area
of the detected abnormal object is too large). This is because the performance of object
localization is not effective enough. Therefore, we propose the edge wrapping (EW)
method to overcome these concerning problems and specifically enhance the pixel-level
anomaly localization performance. Our EW is performed by using [50] to preserve only
the edges of the actual abnormal object and suppressing the rest (e.g., noise and insignificant edges that do not belong to the abnormal object), providing precise abnormal event
detection and localization. EW performs as a multistage process categorized into three
phases: a noise reduction, an intensity gradient, and a nonmaximum suppression.
To eliminating background noise and irrelevant pixels of abnormal objects, a Gaussian filter is applied to blur the normalization of differentiation NOF with the size of
we he ce where we and he are the width and the height of the filter and ce is the number
of channels (e.g., the gray scale image has ce ¼ 1 and the color image has ce ¼ 3). Our differentiation output is the gray scale image ce ¼ 1. Considering the intensity gradient, an
edge gradient Ge is achieved using a gradient operator to filter the image in a horizontal
direction Gx and a vertical direction (Gy) for obtaining gradient magnitude perpendicular
to edge direction at each pixel. The derivative filter has the same size as the Gaussian filter.
The first derivative is computed, as shown in Eqs. (16.29) and (16.30):
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Ge ¼ Gx2 + Gy2
(16.29)
(16.30)
θ ¼ tan 1 Gy =Gx
Then, a threshold is defined to conserve only the significant edges. This process is
known as nonmaximum suppression. In this phase, the gradient magnitude at each pixel
is investigated whether it is greater than a threshold T where we use a value of 50 as it
performs the best results, as discussed in Ref. [26]. If it is greater than T, it shows an edge
point corresponding to a local maxima to all possible neighborhoods. Hence, we preserve
the local maxima and suppress the rest to 0 to acquire the edges corresponding to the
certain anomalies. In addition, the Gaussian filter is then again applied with a kernel size
of we he ce to prevent noise in the image, representing an output EW for the final
output of the anomaly localization OL where ζ is a constant value. The anomaly localization OL can be computed, as shown in the following equation:
OL ¼ ΔOF bEW =EW + ζc
(16.31)
16.4 Experimental results
The performance of the DSTN is evaluated on the publicly standard benchmarks used in
the video anomaly detection task, UCSD pedestrian [4], UMN [1], and CUHK Avenue
[6]. These datasets are recorded in crowds containing indoor and outdoor scenes. Our
Generative adversarial network for video anomaly detection
experiment results are compared with various competing methods respecting the accuracy on both frame and pixel level and the computational time. Additionally, we indicate
the impact of GAN based U-Net network with residual connections compared with
another popular architecture, i.e., autoencoder, and address the advantages and the disadvantages of GAN for anomaly detection. The detail of each subtopic is explained as
follows.
16.4.1 Dataset
16.4.1.1 UCSD dataset
The UCSD pedestrian dataset [4] consists of walking crowded pedestrian in two outdoor
scenes with various anomalies, e.g., cycling, skateboarding, driving vehicles, and rolling
wheelchairs. It is a well-known video benchmark dataset for the anomaly detection task
due to its complex scene in the real environments with the low-resolution images. There
are two subfolders: Ped1 and Ped2, where Ped stands for the pedestrian. The UCSD Ped1
contains 5500 normal frames with the 34 training video sequences and 3400 anomalous
frames with 16 testing video sequences. The image resolution of the UCSD Ped1 is
238 158 pixels, and the UCSD Ped2 is 360 240 pixels for all frames. In the UCSD
Ped2, there are 346 frames for normal events with 16 training video sequences and 1652
frames for anomalous events with 12 testing video sequences. The characteristics of Ped2
consists of the crowded pedestrian walking horizontally to the camera plane. The examples of the UCSD are shown in Fig. 16.17, where (a) is Ped1 and (b) is Ped2.
16.4.1.2 UMN dataset
The UMN dataset [1] is one of the publicly available benchmarks in the video anomaly
detection task designed for identifying the anomalies in crowds. It contains 11 videos
with 7700 frames recorded in various indoor and outdoor scenarios. All frames have a
resolution of 320 240 pixels. Both indoor and outdoor scenes have the characteristics
Fig. 16.17 UCSD pedestrian dataset: (A) Ped1 and (B) Ped2.
403
404
Generative adversarial networks for image-to-Image translation
Fig. 16.18 UMN dataset.
of walking pedestrian as the normal event and the running pedestrian as the abnormal
event as shown in Fig. 16.18. All video sequences start with the walking patterns then
end with the running patterns.
16.4.1.3 CHUK Avenue dataset
The CUHK Avenue dataset [6] consists of the crowded scenes at the campus. The total
number of frames is 30,652 frames consisting of 15,328 frames for 16 training videos and
15,324 frames for 21 test videos. Each video sequence has a length of 1–2 min, with
25 frames per second (fps). This dataset is challenging due to its various moving objects
in crowds and types of anomaly patterns related to human actions, including objectrelated human actions (person throwing, grabbing, and leaving objects), running, jumping, and loitering. In contrast, the normal pattern is the crowds who walk parallel to the
image plane. The examples of the CHUK Avenue dataset are shown in Fig. 16.19.
16.4.2 Implementation details
We implement the proposed DSTN framework using Keras [51] machine learning platform with TensorFlow [52] backend, and Matlab. A GPU NVIDIA GeForce GTX,
1080 Ti with 3584 CUDA Cores and 484 GB/sec memory bandwidth, is used during
the training procedure. Additionally, the testing measurement is employed on a CPU
Intel Core i9-7960 with a 2.80 GHz processor base frequency. The model performs
transformation learning from spatial to temporal representations with the help of Adam
optimizer. A learning rate is set to 0.0002, while the exponential decay rate β1 and β2 are
set to 0.9 and 0.999, respectively, with epsilon 108.
Fig. 16.19 CHUK Avenue dataset.
Generative adversarial network for video anomaly detection
16.4.3 Evaluation criteria
16.4.3.1 Receiver operating characteristic (ROC)
The Receiver operation characteristic curve (ROC) is a standard method used for evaluating the performance of an anomaly detection system. It is a plot that indicates a comparison between true positive rate (TPR) and false positive rate (FPR) at various threshold
criteria [53] and benefits the analysis of the decision-making process.
In the anomaly detection observation, the abnormal events that are correctly determined as the positive detections (abnormal event) from the entire positive ground truth
data are represented as TPR known as the probability of detection. The more the curve of
TPR goes up, the better the detection accuracy of abnormal events is. The normal event
(negative data) that are incorrectly determined as the positive detections from the entire
negative ground truth data are represented as FPR. The higher FPR means the higher rate
of the misclassification of normal events. There are four types of binary predictions for
TPR and FPR computation, as described below.
True positive (TP) is the correct positive detection of an abnormal event when the
prediction outcome and the ground truth data are positive (abnormal event).
False positive (FP) is the false positive detection when the outcome is predicted as
positive (abnormal event), but the ground truth data is negative (normal event), meaning
that the normal event is incorrectly detected as an abnormal event. This problem often
occurs in the video anomaly detection task (e.g., a walking person is detected as an
anomaly).
True negative (TN) is the correct detection of a normal event when the outcome is
predicted as negative (normal event) and the ground truth data is also negative.
False negative (FN) is the incorrect detection when the outcome is predicted as negative (normal event) and the ground truth data is positive (abnormal event).
Hence, TPR and FPR can be computed, as shown in Eqs. (16.32) and (16.33),
respectively:
TPR ¼ TP=ðTP + FN Þ
(16.32)
FPR ¼ FP=ðFP + TN Þ
(16.33)
16.4.3.2 Area under curve (AUC)
Area Under Curve, also known as AUC, is used in classification analysis problems to
define the best prediction model. It is computed from all the areas under the ROC curve,
where TPR is plotted against FPR. The higher value of AUC indicates the superior performance of the model. Ideally, the model is a perfect classifier when all positive data are
ranked above all negative data (AUC ¼ 1). In practice, most of the AUC results are
required in the range between 0.5 and 1.0 (AUC ¼ [0.5,1]), meaning that the random
positive data are ranked higher than the random negative data (greater than 50%).
Besides, the worst case is when all negative data are ranked above all positive data, leading
405
406
Generative adversarial networks for image-to-Image translation
the AUC to 0 (AUC ¼ 0). Hence, AUC classifiers can be defined as AUC [0,1] where
AUC values for real-world use are greater than 0.5. The AUC values that are less than 0.5
are not acceptable for the model [53]. To conclude, we prefer higher AUC values than
the lower ones.
16.4.3.3 Equal error rate (EER)
Apart from the AUC, the performance of the model can be quantified by observing a
receiving operating characteristic equal error rate (ROC-EER). The EER is a fixpoint
that specifies equality of the misclassification of positive and negative data. Specifically,
EER can be obtained from the intersection of the ROC curve on the diagonal EER line
by varying a threshold until FPR equals to the miss rate 1-TPR. The lower EER values
indicate that the model has better performance.
16.4.3.4 Frame-level and pixel-level evaluations for anomaly detection
In general, the quantitative performance evaluation of the anomaly detection consists of
two criteria, including frame-level and pixel-level evaluations. The frame-level evaluation focuses on the detection rate of the anomalous event in the scene. If one or more
anomalous pixels are detected, the frame will be labeled as the abnormal frame no matter
what size and location of the abnormal objects are. In this case, the detected frame is
defined as TP if the actual frame is also abnormal. Contrarily, if the actual frame is normal,
then the detected frame is classified as FP. The evaluation of the pixel level determines the
correct location of anomalous object detection in the scene. This evaluation is a challenging criterion in anomaly detection and localization research since it focuses on the local
pixel. It is remarkably more demanding and stricter than the frame-level evaluation due
to its complexity of localizing anomalies, which improves the accuracy of the frame-level
anomaly detection. To indicate whether the frame is the true positive (TP), the detected
abnormal area is needed to be overlapped more than 40% with the ground truth [3]. In
addition, the frame will be distinguished as the false positive (FP) if one pixel is detected as
abnormal events.
16.4.3.5 Pixel accuracy
In a standard semantic segmentation evaluation, the pixel accuracy metric [54] is computed to define the correctness of the pixel belonging to each semantic class. In the proposed DSTN, two semantic classes are defined, including a foreground
P P semantic class and
a background semantic class. The pixel accuracy is defined as inii/ inti, where nij is the
number of misclassified pixels of class i, and nti is the total of consisting pixels of class i.
16.4.3.6 Structural similarity index (SSIM)
SSIM index is a perceptual metric to measure the image quality of the predicted image to
its original image [55]. Using the SSIM index, the model is more effective when the
Generative adversarial network for video anomaly detection
predicted image is more similar to the target image. In our case, we use SSIM to analyze
the similarity of the dense optical flow generated from the generator to its real dense optical flow obtained from two consecutive video frames.
16.4.4 Performance of DSTN
We evaluate the proposed DSTN regarding accuracy and time complexity aspects. The
ROC curve is used to illustrate the performance of anomaly detection at the frame level
and the pixel level and analyze the experimental results with other state-of-the-art works.
Additionally, the AUC and the EER are evaluated as the criteria for determining the
results.
The performance of DSTN is first evaluated on the UCSD dataset, consisting of
10 and 12 videos for the UCSD Ped1 and the UCSD Ped2 with the pixel-level ground
truth, by using both frame-level and pixel-level protocols. In the first stage of DSTN,
patch extraction is implemented to provide the appearance features of the foreground
object and its motion regarding the vector changes in each patch. The patches are
extracted independently from each original image with a size of 238 158 pixels
(UCSD Ped1) and 360 240 pixels (UCSD Ped2) to apply it with (w/4) h cp. As
a result, we obtain 22 k patches from the UCSD Ped1 and 13.6 k patches from the UCSD
Ped2. Then, to feed into the spatiotemporal translation model, we resize all patches to the
256 256 default size in both training and testing time.
During training, the input of G (the concatenation of f and fBR patches) and target data
(the generated dense optical flow OFgen) are set to the same size as the default resolution of
256 256 pixels. The encoding and decoding modules in G are implemented differently.
As in the encoder network, the image resolution is encoded from
256 ! 128 ! 64 ! 32 ! 16 ! 8 ! 4 ! 2 ! 1 to obtain the latent space representing
the spatial image in one-dimensional data space. CNN effectively employs this downscale
with kernels of 3 3 and stride s ¼ 2. Additionally, the number of neurons corresponding
to the image resolution in En is introduced in each layer from
6 ! 64 ! 128 ! 256 ! 512 ! 512 ! 512 ! 512 ! 512. In contrast, De decodes the
latent space to reach the target data (the temporal representation of OFgen) with a size
of 256 256 pixels using the same structure as En. The dropout is employed in De as
noise z to remove the neuron connections using probability p ¼ 0.5, resulting in the prevention of overfitting on the training samples. Since D needs to fluctuate G to correct the
classification between real and fake images at training time, PatchGAN is then applied by
inputting a patch size of 64 64 pixels to output the probability of class label for the
object. The PatchGAN architecture is constructed from 64 ! 32 ! 16 ! 8 ! 4 ! 2 ! 1,
which is then flattened to 512 neurons and plugged in with Fully Connection (FC) and
Softmax layers. The use of PatchGAN benefits the model in terms of time complexity.
This is probably because there are fewer parameters to learn on the partial image, making
407
408
Generative adversarial networks for image-to-Image translation
the model less complex and can achieve good running time for the training process. In the
aspect of testing, G is specifically employed to reconstruct OFgen in order to analyze the
real motion information OFfus. The image resolution for testing and training are set to the
same resolution for all datasets.
The quantitative performance of DSTN is presented in Table 16.1 where we consider
the DSTN with various state-of-the-art works, e.g., AMDN [15], GMM-FCN [12],
Convolutional AE [14], and future frame prediction [21]. From Table 16.1 it can be
observed that the DSTN overcomes most of the methods in both frame-level and
pixel-level criteria since we achieve higher AUC and lower EER on the UCSD Dataset.
Moreover, we show the qualitative performance of DSTN using the standard evaluation
for anomaly detection research known as the ROC curves, where we vary a threshold
from 0 to 1 to plot the curve of TPR against FPR. The qualitative performance of DSTN
is compared with other approaches in both frame-level evaluation (see Fig. 16.20A) and
pixel-level evaluation (see Fig. 16.20B) on the UCSD Ped1 and at the frame-level evaluation on the UCSD Ped2 as presented in Figs. 16.20 and 16.21, respectively. Following
Figs. 16.20 and 16.21, the DSTN (circle) shows the strongest growth curve on TPR and
overcomes all the competing methods in the frame and pixel level. This means that the
DSTN is a reliable and effective method to be able to detect and localize the anomalies
with high precision.
Examples of the experimental results of DSTN on the UCSD Ped 1 and Ped 2 dataset
are illustrated in Fig. 16.22 to extensively present its performance in detecting and localizing anomalies in the scene. According to Fig. 16.22, the proposed DSTN is able to
detect and locate various types of abnormalities effectively with each object, e.g., (a) a
wheelchair, (b) a vehicle, (c) a skateboard, and (d) a bicycle, or even more than one
anomaly in the same scene, e.g., (e) bicycles, (f ) a vehicle and a bicycle, and (g) a bicycle
and a skateboard. However, we face the false positive problems in Fig. 16.22H a bicycle
and a skateboard, where the walking person (normal event) is detected as an anomaly.
Even the bicycle and the skateboard are correctly detected as anomalies in Fig.
16.22H, the false detection on the walking person makes this frame wrong anyway.
The false positive anomaly detection is probably caused by a similar speed of walking
to cycling in the scene.
For the UMN dataset, the performance of DSTN is evaluated using the same settings
as training parameters and network configuration on the UCSD pedestrian dataset.
Table 16.2 indicates the AUC performance comparison of the DSTN with various competing works such as GANs [20], adversarial discriminator [22], AnomalyNet [23], and so
on. Table 16.2 shows that the proposed DSTN achieves the best AUC results the same as
Ref. [23], which outperforms all other methods. Noticeably, most of the competing
methods can achieve high AUC on the UMN dataset. This is because the UMN dataset
has less complexity regarding its abnormal patterns than the UCSD pedestrian and the
Avenue datasets. Fig. 16.23 shows the performance of DSTN in detecting and localizing
Table 16.1 EER and AUC comparison of DSTN with other methods on UCSD Dataset [26].
Ped1 (frame level)
Ped1 (pixel level)
Ped2 (frame level)
Ped2 (pixel level)
Method
EER
AUC
EER
AUC
EER
AUC
EER
AUC
MPPCA
Social Force (SF)
SF + MPPCA
Sparse Reconstruction
MDT
Detection at 150fps
SR + VAE
AMDN (double fusion)
GMM
Plug-and-Play CNN
GANs
GMM-FCN
Convolutional AE
Liu et al.
Adversarial discriminator
AnomalyNet
DSTN (proposed method)
40%
31%
32%
19%
25%
15%
16%
16%
15.1%
8%
8%
11.3%
27.9%
23.5%
7%
25.2%
5.2%
59.0%
67.5%
68.8%
–
81.8%
91.8%
90.2%
92.1%
92.5%
95.7%
97.4%
94.9%
81%
83.1%
96.8%
83.5%
98.5%
81%
79%
71%
54%
58%
43%
41.6%
40.1%
35.1%
40.8%
35%
36.3%
–
–
34%
–
27.3%
20.5%
19.7%
21.3%
45.3%
44.1%
63.8%
64.1%
67.2%
69.9%
64.5%
70.3%
71.4%
–
33.4%
70.8%
45.2%
77.4%
30%
42%
36%
–
25%
–
18%
17%
–
18%
14%
12.6%
21.7%
12%
11%
10.3%
9.4%
69.3%
55.6%
61.3%
–
82.9%
–
89.1%
90.8%
–
88.4%
93.5%
92.2%
90%
95.4%
95.5%
94.9%
95.5%
–
80%
72%
–
54%
–
–
–
–
–
–
19.2%
–
–
–
–
21.8%
–
–
–
–
–
–
–
–
–
–
–
78.2%
–
40.6%
–
52.8%
83.1%
410
Generative adversarial networks for image-to-Image translation
Fig. 16.20 ROC Comparison of DSTN with other methods on UCSD Ped1 dataset: (A) frame-level
evaluation and (B) pixel-level evaluation [26].
anomalies in different scenarios on the UMN dataset, including (d) an indoor and outdoors in (a), (b), and (c), where we can detect most of the individual objects in the
crowded scene.
Apart from evaluating DSTN on the UCSD and the UMN datasets, we also assess our
performance on the challenging CUHK Avenue dataset with the same parameter and
Generative adversarial network for video anomaly detection
Fig. 16.21 ROC Comparison of DSTN with other methods on UCSD Ped2 dataset at frame-level
evaluation [26].
Fig. 16.22 Examples of DSTN performance in detecting on localizing anomalies on UCSD Ped1 and
Ped2 dataset: (A) a wheelchair, (B) a vehicle, (C) a skateboard, (D) a bicycle, (E) bicycles, (F) a
vehicle and a bicycle, (G) a bicycle and a skateboard, and (H) a bicycle and a skateboard [26].
configuration settings as the UCSD and the UMN datasets. Table 16.3 presents the performance comparison in terms of EER and AUC of the DSTN with other competing
works [6, 12, 14, 21, 23] in which the proposed DSTN surpasses all state-of-the-art
works for both protocols. We show examples of the DSTN performance in detecting
and localizing various types of anomalies, e.g., (a) jumping, (b) throwing papers,
(c) falling papers, and (d) grabbing a bag, on the CUHK Avenue dataset in Fig. 16.24
The DSTN can effectively detect and localize anomalies in this dataset, even in
411
412
Generative adversarial networks for image-to-Image translation
Table 16.2 AUC comparison of DSTN with other methods
on UMN dataset [26].
Method
AUC
Optical-flow
SFM
Sparse reconstruction
Commotion
Plug-and-play CNN
GANs
Adversarial discriminator
Anomalynet
DSTN (proposed method)
0.84
0.96
0.976
0.988
0.988
0.99
0.99
0.996
0.996
Fig. 16.23 Examples of DSTN performance in detecting on localizing anomalies on UMN dataset,
where (A), (B), and (D) contain running activity outdoors while (C) is in an indoor [26].
Table 16.3 EER and AUC comparison of DSTN with other
methods on CUHK Avenue dataset [26].
Method
EER
AUC
Convolutional AE
Detection at 150 FPS
GMM-FCN
Liu et al.
Anomalynet
DSTN (proposed method)
25.1%
–
22.7%
–
22%
20.2%
70.2%
80.9%
83.4%
85.1%
86.1%
87.9%
Fig. 16.24D, which contains only small movements for abnormal events (only the human
head and the fallen bag are slightly moving).
To indicate the significance of our performance for real-time use, we then compare
the running time of DSTN during testing in seconds per frame as shown in Table 16.4
with other competing methods [3–6, 15] following the environment and the computational time from Ref. [15].
Regarding Table 16.4, we achieve a lower running time than most of the competing
methods except for Ref. [6]. This is because the architecture of DSTN relies on the
Generative adversarial network for video anomaly detection
Fig. 16.24 Examples of DSTN performance in detecting on localizing anomalies on CUHK Avenue
dataset: (A) jumping, (B) throwing papers, (C) falling papers, and (D) grabbing a bag [26].
Table 16.4 Running time comparison on testing measurement (seconds per frame).
CPU
(GHz)
Sparse
Reconstruction
Detection at
150 fps
MDT
Li et al.
AMDN
(double fusion)
DSTN
(proposed
method)
2.8
Method
Running time
GPU
Memory
(GB)
Ped1
Ped2
UMN
Avenue
2.6
–
2.0
3.8
–
0.8
–
3.4
–
8.0
0.007
–
–
0.007
3.9
2.8
2.1
–
–
Nvidia
Quadro
K4000
–
2.0
2.0
32
17
0.65
5.2
23
0.80
–
–
–
–
–
–
–
24
0.315
0.319
0.318
0.334
framework of the deep learning model using multiple layers of the convolutional neural
network, which is more complex than Ref. [6] that uses the learning of a sparse dictionary
and provides fewer connections. However, according to the experimental results in
Tables 16.1 and 16.3, our proposed DSTN significantly provides higher AUC and lower
EER with respect to the frame and the pixel level on the CUHK Avenue and the UCSD
pedestrian datasets than Ref. [6]. Regarding the running time, the proposed method runs
3.17 fps for the UCSD Ped1 dataset, 3.15 fps for the UCSD Ped2 dataset, 3.15 fps for the
UMN dataset, and 3 fps for the CUHK Avenue dataset. In every respect, we provide the
performance comparison of the proposed DSTN with other competing works [3–6, 15]
to show our performance in regard to the frame-level AUC and the running time in seconds per frame for the UCSD Ped1 and Ped2 dataset as presented in Figs. 16.25 and
16.26, respectively.
Considering Figs. 16.25 and 16.26, our proposed method achieves the best results
regarding the AUC and running time aspects. In this way, we can conclude that our
DSTN surpasses other state-of-the-art approaches since we reach the highest AUC values
at the frame level anomaly detection and the pixel level localization and given the good
computational time for real-world applications.
413
414
Generative adversarial networks for image-to-Image translation
Fig. 16.25 Frame-level AUC comparison and running time on UCSD Ped1 dataset.
Fig. 16.26 Frame-level AUC comparison and running time on UCSD Ped2 dataset.
16.4.5 The comparison of generative adversarial network with an
autoencoder
GAN-based U-Net architecture is a practical approach to shortcut low-level information
across the network. The skip connections in the generator play a significant role in our
proposed framework. We highlight its significance with the experiments on the UCSD
Ped2 and compare it with autoencoder, which can be constructed by erasing the skip
connections in the U-Net architecture. All training videos are learned on both skip
Generative adversarial network for video anomaly detection
connections and autoencoder for 40 epochs to observe the performance in minimizing
the L1 loss, as shown in Fig. 16.27. From Fig. 16.27, it demonstrates that the loss curve of
the skip connections reaches lower error over the training time than the loss curve of the
autoencoder, showing superior performance of the skip connections over the
autoencoder.
Besides, we observe the ability to generate temporal information (generated dense
optical flow) of the skip connections and the autoencoder using the test videos from
the UCSD Ped2 and compare it to the dense optical flow ground truth as displayed
in Fig. 16.28. The autoencoder is impotent to achieve the motion information shown
in Fig. 16.28C. In contrast, the skip connections in Fig. 16.28B can produce the motion
Fig. 16.27 Performance comparison on UCSD Ped2 dataset between GAN based U-Net architecture
(the residual connection) and autoencoder [26].
Fig. 16.28 The qualitative results in generating (A) dense optical flow on UCSD Ped2 dataset between
(B) residual connection and (C) autoencoder [26].
415
416
Generative adversarial networks for image-to-Image translation
Table 16.5 FCN-score and SSIM comparison on UCSD Ped2 dataset
between residual connection and autoencoder.
Network architecture
Pixel accuracy
SSIM
Autoencoder
Residual connection
0.83
0.9
0.82
0.96
information of dense optical flow that correctly corresponds to its ground truth in Fig.
16.28A, giving a good synthesized image quality.
To indicate the quantitative performance of the skip connections and the autoencoder, Structural Similarity Index (SSIM) [55] and FCN-score [54] are evaluated for each
architecture on the UCSD Ped2 as presented in Table 16.5 A higher value means better
performance for both evaluation criteria.
Table 16.5 shows that the GAN-based U-Net architecture with skip connection is
more suitable for the low-level information since it achieves superior results than the
autoencoder for both evaluation metrics, especially in the SSIM.
16.4.6 Advantages and limitations of generative adversarial network
for video anomaly detection
The generative adversarial networks for anomaly detection have certain advantages over
the traditional CNNs. The advantages are that the GAN framework does not require any
labeled data and inference during the learning procedure. In addition, GAN can generate
the example data without using different entries in a sequential sample and does not need
Markov Chain Monte Carlo (MCMC) method to train the model as the Adversarially
trained AAE [24] and VAE [25]. Instead, it computes only the backpropagation to obtain
the gradients. As regards statistical advantage, the GAN model may gain the density distribution of example data from the generator network, which is trained and updated with
the gradients flowing through the discriminator rather than directly updating with the
example data. In this way, for GAN in video anomaly detection task, the objective function of the generator is strengthened to be able to generate the synthetic output that looks
real from the input image since the parameters of the generator do not directly obtain the
components of the target image. Apart from the advantages mentioned above, the generator network provides a very sharp synthetic image, while the visual performance of the
VAE network based on the MCMC method presents a blurry image for mixing the
modes in chains. As in the limitations of GAN, the training of GAN is unstable compared
to VAE, resulting in the difficulty of predicting the value of each pixel for the whole
image and causing artifact noise in the synthetic image. The major limitation of anomaly
detection using GAN in the current research trends is that only the static camera scenario
is implemented to obtain the appearance and motion features from the moving foreground objects. Besides, GAN also has a problem in learning and generating small objects
Generative adversarial network for video anomaly detection
(full appearance of the objects) in the crowded scene, making it challenging to enhance
the accuracy of the model, especially at the pixel level.
16.5 Summary
In this chapter, we extensively explain the architecture of GANs and explore its applications on video anomaly detection research. DSTN, a novel unsupervised anomaly
detection and localization method, is introduced to enhance the knowledge of GAN
and improve the performance of the system with respect to the accuracy of anomaly
detection at the frame level and localization at the pixel level and the computational time.
The DSTN is intended to comprehensively master the features from the spatial to the
temporal representations by employing the novel fusion between the background
removal and the real dense optical flow. The concatenation of patches is presented to
assist the learning of the generative network. The proposed method is an unsupervised
manner since only the normalities are trained to obtain the corresponding generated
dense optical flow without labeling abnormal data. Since all videos are input into the
model during testing, the unrecognized patterns are classified as abnormalities because
the model has no prior knowledge of any abnormal events. The abnormalities can be
simply detected by subtracting the difference in local pixels between the real and the generated dense optical flow images. To the best of our knowledge, the proposed DSTN is
the first attempt to boost pixel-level anomaly localization with the edge wrapping
method as the postprocessing process of the GAN framework. We implemented three
publicly available benchmarks; UCSD pedestrian, UMN, and CUHK Avenue datasets.
The performance of DSTN is distinguished with various methods and analyzed with the
autoencoder to specify the significance of using the skip connections of GAN. From the
experimental results, the proposed DSTN outperforms other state-of-the-art works for
anomaly detection and localization and time consumption. The advantages and limitations of GAN are addressed in the final section to deliver a comprehensive view of the use
of GAN for the video anomaly detection task.
References
[1] R. Mehran, A. Oyama, M. Shah, Abnormal crowd behavior detection using social force model, in:
2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 935–942.
[2] J. Kim, K. Grauman, Observe locally, infer globally: a space-time MRF for detecting abnormal activities with incremental updates, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, 2009, pp. 2921–2928.
[3] W. Li, V. Mahadevan, N. Vasconcelos, Anomaly detection and localization in crowded scenes, IEEE
Trans. Pattern Anal. Mach. Intell. 36 (2013) 18–32.
[4] V. Mahadevan, W. Li, V. Bhalodia, N. Vasconcelos, Anomaly detection in crowded scenes, in: 2010
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE, 2010,
pp. 1975–1981.
417
418
Generative adversarial networks for image-to-Image translation
[5] Y. Cong, J. Yuan, J. Liu, Sparse reconstruction cost for abnormal event detection, in: CVPR 2011,
IEEE, 2011, pp. 3449–3456.
[6] C. Lu, J. Shi, J. Jia, Abnormal event detection at 150 fps in matlab, in: Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2720–2727.
[7] Y. Yuan, Y. Feng, X. Lu, Structured dictionary learning for abnormal event detection in crowded
scenes, Pattern Recogn. 73 (2018) 99–110.
[8] S. Wang, E. Zhu, J. Yin, F. Porikli, Video anomaly detection and localization by local motion based
joint video representation and OCELM, Neurocomputing 277 (2018) 161–175.
[9] X. Zhang, S. Yang, X. Zhang, W. Zhang, J. Zhang, Anomaly Detection and Localization in Crowded
Scenes by Motion-Field Shape Description and Similarity-Based Statistical Learning, 2018 (arXiv preprint arXiv:1805.10620).
[10] H. Mousavi, S. Mohammadi, A. Perina, R. Chellali, V. Murino, Analyzing tracklets for the detection of
abnormal crowd behavior, in: 2015 IEEE Winter Conference on Applications of Computer Vision,
IEEE, 2015, pp. 148–155.
[11] H. Mousavi, M. Nabi, H. Kiani, A. Perina, V. Murino, Crowd motion monitoring using tracklet-based
commotion measure, in: 2015 IEEE International Conference on Image Processing (ICIP), IEEE,
2015, pp. 2354–2358.
[12] Y. Fan, G. Wen, D. Li, S. Qiu, M.D. Levine, F. Xiao, Video anomaly detection and localization via
Gaussian mixture fully convolutional variational autoencoder, Comput. Vis. Image Underst. 195
(2020) 102920.
[13] Y. Feng, Y. Yuan, X. Lu, Learning deep event models for crowd anomaly detection, Neurocomputing
219 (2017) 548–556.
[14] M. Hasan, J. Choi, J. Neumann, A.K. Roy-Chowdhury, L.S. Davis, Learning temporal regularity in
video sequences, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 733–742.
[15] D. Xu, Y. Yan, E. Ricci, N. Sebe, Detecting anomalous events in videos by learning deep representations of appearance and motion, Comput. Vis. Image Underst. 156 (2017) 117–127.
[16] S. Bouindour, M.M. Hittawe, S. Mahfouz, H. Snoussi, Abnormal Event Detection Using Convolutional Neural Networks and 1-Class SVM Classifier, IET Digital Library, 2017.
[17] M. Ravanbakhsh, M. Nabi, H. Mousavi, E. Sangineto, N. Sebe, Plug-and-play cnn for crowd motion
analysis: An application in abnormal event detection, in: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE, 2018, pp. 1689–1698.
[18] M. Sabokrou, M. Fayyaz, M. Fathy, Z. Moayed, R. Klette, Deep-anomaly: fully convolutional neural
network for fast anomaly detection in crowded scenes, Comput. Vis. Image Underst. 172 (2018)
88–97.
[19] H. Wei, Y. Xiao, R. Li, X. Liu, Crowd abnormal detection using two-stream fully convolutional neural networks, in: 2018 10th International Conference on Measuring Technology and Mechatronics
Automation (ICMTMA), IEEE, 2018, pp. 332–336.
[20] M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. Regazzoni, N. Sebe, Abnormal event
detection in videos using generative adversarial nets, in: 2017 IEEE International Conference on Image
Processing (ICIP), IEEE, 2017, pp. 1577–1581.
[21] W. Liu, W. Luo, D. Lian, S. Gao, Future frame prediction for anomaly detection—a new baseline, in:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018,
pp. 6536–6545.
[22] M. Ravanbakhsh, E. Sangineto, M. Nabi, N. Sebe, Training adversarial discriminators for crosschannel abnormal event detection in crowds, in: 2019 IEEE Winter Conference on Applications of
Computer Vision (WACV), IEEE, 2019, pp. 1896–1904.
[23] J.T. Zhou, J. Du, H. Zhu, X. Peng, Y. Liu, R.S.M. Goh, Anomalynet: an anomaly detection network
for video surveillance, IEEE Trans. Inf. Forensics Secur. 14 (2019) 2537–2550.
[24] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, Adversarial Autoencoders, 2015 (arXiv preprint arXiv:1511.05644).
[25] J. An, S. Cho, Variational autoencoder based anomaly detection using reconstruction probability, in:
Special Lecture on IE, vol. 2, 2015, pp. 1–18.
Generative adversarial network for video anomaly detection
[26] T. Ganokratanaa, S. Aramvith, N. Sebe, Unsupervised anomaly detection and localization based on
deep spatiotemporal translation network, IEEE Access 8 (2020) 50312–50329.
[27] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio,
Generative adversarial nets, Adv. Neural Inf. Proces. Syst. 2 (2014) 2672–2680.
[28] J. Sun, X. Wang, N. Xiong, J. Shao, Learning sparse representation with variational auto-encoder for
anomaly detection, IEEE Access 6 (2018) 33353–33361.
[29] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 315–323.
[30] I.J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio, Maxout Networks, 2013 (arXiv
preprint arXiv:1302.4389).
[31] K. Jarrett, K. Kavukcuoglu, M.A. Ranzato, Y. Lecun, What is the best multi-stage architecture for
object recognition? in: 2009 IEEE 12th International Conference on Computer Vision, IEEE,
2009, pp. 2146–2153.
[32] Y. Mashalla, Impact of computer technology on health: computer vision syndrome (CVS), Med. Pract.
Rev. 5 (2014) 20–30.
[33] K. Gates, Professionalizing police media work: surveillance video and the forensic sensibility, in:
Images, Ethics, Technology, Routledge, 2015.
[34] C. Dictionary, Cambridge Advanced Learner’s Dictionary, PONS-Worterbucher, Klett Ernst Verlag,
2008.
[35] P. Isola, J.-Y. Zhu, T. Zhou, A. Efros, Image-to-image translation with conditional adversarial networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2017, pp. 1125–1134.
[36] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted intervention,
Springer, 2015, pp. 234–241.
[37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
M. Bernstein, Imagenet large scale visual recognition challenge, Int. J. Comput. Vis. 115 (2015)
211–252.
[38] T. Brox, A. Bruhn, N. Papenberg, J. Weickert, High accuracy optical flow estimation based on a theory
for warping, in: European Conference on Computer Vision, Springer, 2004, pp. 25–36.
[39] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
[40] M. Mathieu, C. Couprie, Y. Lecun, Deep Multi-Scale Video Prediction beyond Mean Square Error,
2015 (arXiv preprint arXiv:1511.05440).
[41] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt,
D. Cremers, T. Brox, Flownet: learning optical flow with convolutional networks, in: Proceedings
of the IEEE International Conference on Computer Vision, 2015, pp. 2758–2766.
[42] T. Kroeger, R. Timofte, D. Dai, L. Van Gool, Fast optical flow using dense inverse search, in: European Conference on Computer Vision, Springer, 2016, pp. 471–488.
[43] A. Radford, L. Metz, S. Chintala, Unsupervised Representation Learning With Deep Convolutional
Generative Adversarial Networks, 2015 (arXiv preprint arXiv:1511.06434).
[44] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for
training gans, in: Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
[45] S. Ioffe, C. Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal
Covariate Shift, 2015 (arXiv preprint arXiv:1502.03167).
[46] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
[47] K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition,
2014 (arXiv preprint arXiv:1409.1556).
[48] D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, 2014 (arXiv preprint
arXiv:1412.6980).
[49] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to
prevent neural networks from overfitting, J. Mach. Learn. Res. 15 (2014) 1929–1958.
419
420
Generative adversarial networks for image-to-Image translation
[50] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. 8
(1986) 679–698.
[51] F. Chollet, Keras document, Keras, GitHub, 2015.
[52] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M.
Isard, Tensorflow: a system for large-scale machine learning, in: 12th {USENIX} symposium on
operating systems design and implementation ({OSDI} 16), 2016, pp. 265–283.
[53] T. Fawcett, An introduction to ROC analysis, Pattern Recogn. Lett. 27 (2006) 861–874.
[54] J. Long, E. Shelhamer, T. Darrell, Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
[55] Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, Image quality assessment: from error visibility to
structural similarity, IEEE Trans. Image Process. 13 (2004) 600–612.
Index
Note: Page numbers followed by f indicate figures, t indicate tables, and b indicate boxes.
A
Ablation analysis, 366–368
Accuracy, 50, 88
of filters, 94f
AC GAN, 30–31, 30f
loss functions and distance metrics, 32–33t
pros and cons of, 34–35t
Adam algorithm, 355, 357
Adaptive moment optimization (ADAM), 322
Adversarial autoencoders (AAEs), 107–108, 117
Adversarial loss, 9, 213–214, 218
Adversarial network, 293, 293f
Adversarial preparation, 61
Adversarial training, 140
Age-cGAN, 61t, 75
Aging of face, 119
AlexNet, 65, 178–181, 190
Alternative FCM algorithm, 84–85
Amazon, 43t
Animation, 254–259
AOI, 49t
Appearance and motion conditions GAN
(AMC-GAN), 334t, 340
Area under curve (AUC), 405–406
Artificial intelligence-based methods, 142–146
Art2Real, 250–253, 253–254f
Attentional generative adversarial networks
(AttnGAN), 141
attRNN, 40
Autoencoder, 414–416
Automatic caricature generation, 135–136
Automatic nonrigid histological image registration
(ANHIR) dataset, 265, 273–275
Auxiliary automatic driving, 76
Auxiliary object functions, 332–333
B
Background removal, 391–395, 417
Backward forward GAN (BFGAN), 40, 40f
BicycleGAN model, 132–134
Bidirectional GAN (BiGAN), 29–30, 29f
loss functions and distance metrics, 32–33t
pros and cons of, 34–35t
Bidirectional LSTM (Bi-LSTM), 341, 341f
Bilingual evaluation understudy (BELU) score, 52
Bool GANs, 117
Boundary equilibrium GAN (BEGAN), 23–24, 24f
loss functions and distance metrics, 32–33t
pros and cons of, 34–35t
C
Caltech 256, 49t
Caption, 43t
CariGANs, 136
Cartoon character generation, 76
Cascaded super-resolution GAN (CSRGAN), 6, 10
CD31 stain, 273
CelebA dataset, 49t, 247–248
CelebA-HQ, 49t
Chinese poem dataset, 43t
CHUK Avenue dataset, 404, 404f
CIFAR 10/100, 49t
Classification objective function, 332
Closed-circuit television (CCTV) cameras, 378
Cluster analysis, 81–82
CNN-based architectures, 185, 190–191, 196
CNN/Daily Mail dataset, 43t
COCO dataset, 43t
COIL-20, 49t
Compactness, 84–85
Computer-aided diagnosis (CAD) systems, 162
Conditional adversarial networks, 134, 383–384
Conditional generative adversarial networks
(CGANs), 28–29, 28f, 105–106, 126, 135,
139, 270
architecture, 166–167, 167f
loss functions and distance metrics, 32–33t
pros and cons of, 34–35t
respiratory sound synthesis
algorithm, 170–172
analysis, 179–181
data augmentation, 174–175
dataset, 174–181
discriminator network architecture, 169, 170f
generator network architecture, 168–169, 169f
inverse CWT, 176, 177f
performance results, 177–179
421
422
Index
Conditional generative adversarial networks
(CGANs) (Continued)
scalograms, 170b, 176, 176f
steps, 173–174
system model, 167–168, 168f
time-scale representation, 168
trained network model, 177, 177t
Content-based image retrieval (CBIR), 185, 187,
200
Context Encoder, 61t
Continuous wavelet transform (CWT), 168
Controllable GANs, 139–140
Conventional generative adversarial networks
(cGANs), 289
Convolutional neural network (CNN), 18, 65,
269–270, 397, 407–408, 416–417
architecture, 66f
Convolutional traces, 145
Convolution operation, 398, 398f
Cooccurrence matrices, 143
Critic network discriminates, 337
Cross-channel adversarial discriminators, 387–389
Cross-channel generative adversarial networks,
383–385
Cross entropy loss, 332
Crossview Fork, 135
Cross-view image synthesis, 135
Crossview Sequential, 135
cSeq-GAN, 40–41, 41f
CUB, 49t
Cycle-consistency loss, 319–320
Cycle generative adversarial networks (CGAN),
25–26, 26f, 61t, 113–114, 115f, 132,
212–213, 322–325, 324f, 348, 350, 352–355
image-to-image translation, 245–247, 247–248f
loss functions and distance metrics, 32–33t
model, 223f
normalized difference vegetation index (NDVI),
213–216
architecture, 217–218, 217f
pros and cons of, 34–35t
qualitative evaluation, 225f
Cycle text-to-image GAN, 141
D
Data augmentation, 220, 222f, 355, 357
Datasets
image, 49, 49t
for video generation techniques, 338t
Decision-level fusion approach, 210
Deep belief network (DBN), 67
architecture, 67f
Deep convolutional GAN (DCGAN), 64, 108–110,
127, 164, 266–267
DeepFake
artificial intelligence-based methods, 142–146
challenges, 131
definition, 128
face swapping, 148–150
facial expressions, manipulation of, 152–153
facial features, manipulation of, 150–152
GAN-based techniques
image-to-image translation, 132–136
text-to-image synthesis, 136–142
legal and ethical considerations, 153–154
new face construction, 146–147
sample source and generated fake images,
128–130, 129–130f
Deep generative adversarial networks (GANs)
model, 291–292
Deep learning (DL), 65–68, 209–212, 347, 349,
352, 377–378, 391–393, 397
end-to-end, 379–380
generative, 235–237, 238f
overview, 235–239
unsupervised approach, 378
variational autoencoder (VAE), 237–239
Deep learning-based (DA-DCGAN), 117
Deep network architectures, 194–196
Deep neural networks (DNNs), 347–348, 352
Deep spatiotemporal translation network (DSTN),
390
dataset
CHUK Avenue, 404, 404f
UCSD, 403, 403f
UMN, 403–404
feature collection, 391–396
implementation, 404
overview, 390–391, 392f
performance, 407–413, 414f
spatiotemporal translation model, 396–400
testing framework, 391, 392f
training framework, 391, 392f
Denoising-based generative adversarial networks
(D-GAN), 291
Dense inverse search (DIS), 393
DenseNet, 22–23, 191
Digital elevation model (DEM), 251–252, 252f
Index
Digital imaging and communications in medicine
(DICOM), 81
experimental analysis, 90–92
image segmentation, 90, 90f
montage of, 90, 91f
performance analysis, 92, 93f
Dilated temporal relational GAN (DTRGAN),
340–341, 341f
Discrete wavelet transformation, 318–319
Discriminator, 240–241
Discriminator model (DM), 63–64, 63f, 210–211,
211f
Discriminator network, 293, 293f
DNNs. See Deep neural networks (DNNs)
DSTN. See Deep spatiotemporal translation
network (DSTN)
Dual attentional generative adversarial network
(Dualattn-GAN), 141–142
Dual motion GAN (DMGAN), 334t, 340
Dynamic memory generative adversarial networks
(DM-GAN), 140
Dynamic transfer GAN, 334–335, 334t, 335f
E
Earth mover distance, 21
e-commerce, 185–187, 196–197, 200
Edge-enhanced GAN (EE-GAN), 9
for remote sensing image, 10
Edge-enhanced super-resolution network (EESR),
5
Edge wrapping (EW), 401–402
ELBO loss, 320
Encoder-based GAN, 47, 47f
Encoder-decoder network, 291–292
Enhanced super-resolution GAN (ESRGAN), 9–10
Ensemble learning GANs, 42, 44f
Equal error rate (EER), 406
Errors lung lobe tissue, 280, 280f
Estrogen receptor (ER) antibody stains, 273
Expectation-maximization (EM), 82–83
F
Face aging, 75
Facebook AI Similarity Search (FAISS), 193
Face conditional GAN (FCGAN), 61t, 73
Face frontal view generation, 75
Face generation, 75, 247–250
Face swapping, 148–150
Facial expressions, manipulation of, 152–153
Facial features, manipulation of, 150–152
FakeSpotter, 144
False data detection rate. See Recurrent neural
network (RNN), generative adversarial
networks (GANs)
False positive rate (FPR), 405
Fashion recommendation system, 191–196, 192f
Fault diagnosis, 120
f-Divergence, 332
Feature collection, 391–396
FG-SRGAN, 4–5
Filters
accuracy comparison, 94f
classification outputs, 93, 94t
FPR comparison, 94, 95f
harmonic mean, 95, 96f
PPV comparison, 95, 95f
sensitivity comparison, 93–94, 94f
specificity comparison, 94, 95f
Fingerprints, 144
Flow and texture generative adversarial network
(FTGAN), 334t, 335, 335f
FlowGAN, 335
Fluorescein angiography, 348–349, 349f, 351, 351f,
371
Forum of International Respiratory Societies
(FIRS), 161
Frame-level anomaly detection, 377–378, 406
Frechet inception distance (FID), 50–51
F1 score, 50
Fully connected convoultional GANs
(FCC-GANs), 103–104, 117
Fully connected GANs, 103–104
Fusion, 391–395, 394f
Fuzzy C-means, 82–83
Fuzzy C-means clustering (FCMC), 83–84, 87–88
G
GANs. See Generative adversarial networks (GANs)
Gaussian filter, 89
Generative adversarial networks (GANs), 1, 2f, 18,
59–64, 185–186, 379–380
advantages, 127–128, 329–330, 342, 416–417
applications, 73–76, 119–120, 127
architectures, 19f, 102–116, 125–126, 126f, 381f
vs. autoencoder, 379–380
based on image-toimage translation, 389–390
423
424
Index
Generative adversarial networks (GANs) (Continued)
basic structure, 99, 100f
building blocks of, 331–332
components, 125
cross-channel, 383–385
cross-channel adversarial discriminators, 387–389
cyclical, 348, 352–355
design of, 60f
disadvantages, 343
fake images (see DeepFake)
future frame prediction, 385–387
generic framework, 329, 330f
image-to-image, 352–353
issues and challenges, 11–12
limitations, 128, 416–417
loss functions and distance metrics, 32–33t
model, 125
need for, 99–102
objective functions, 332
parts, 99
pros and cons of, 34–35t
research gaps, 117–119
structure of, 381–383
training process, 331–332
variants, 126
for video generation and prediction, 333–337
for video recognition, 337–340
for video summarization, 340–341
working principle, 125
Generative adversarial text-to-image synthesis,
137–138
Generator model (GM), 62, 62f, 210–211, 211f
Generator network, 293
Geographically weighted regression (GWR), 209
Geometry-guided CGANs, 135
GoogLeNet, 65, 178–181, 190
Gradient penalty, 22
Grocott-Gomori methenamine silver (GMS) stain,
264
Guided image filtering, 89
H
Harmonic mean, 88
of filters, 95, 96f
Hematoxylin and eosin (H&E) stain, 264, 271–273,
273t, 275, 276f
Hierarchical generative adversarial networks
(HiGAN), 334t, 337–339
High-quality images, 101f, 120
High-resolution picture generation, 74
Histopathology staining, GANs for, 266
applications, 264
automatic nonrigid histological image registration
(ANHIR) dataset, 273–275
conditional GANs (CGANs), 270
dataset, 272–275
deep convolutional GAN (DCGAN), 266–267
discriminator, 263, 283t
errors lung lobe tissue, 280, 280f
generator, 263, 282t
histology, 264, 271–272
histopathological analysis, 271
histopathology, 264–265
image-quality metrics, 268–269
image-to-image translation, 265, 269–271
kidney tissue, 278–279, 278f
lung lesion tissue, 275–276, 276–277f
lung lobe tissue, 278–279, 279f
machine learning, 265
medical imaging, 271–272
network architectures, 272–275, 281–282
optimization functions, 267–268
vanilla, 266
I
Identity shortcut connection. See ResNet
Image datasets, 49, 49t
Image generation, 73, 119
applications, 244–259
face generation, 247–250
image animation, 254–259, 256f
image-to-image translation, 245–247
photo-realistic images, 250–253
scene generation, 254–259, 258–259f
generative adversarial network (GAN)
architecture, 240f
Art2Real, 250–253, 253–254f
cycleGAN, 245–247, 247–248f
dataset, 246t
first-order motion, 254–259
implementation, 245t
monkey net, 254–255, 255–256f
Nash equilibrium, 239–243
stackGAN, 254–259, 257f
starGAN, 247–250
superresolution (SR), 250–253
variational autoencoder (VAE), 243–244, 244f
Index
ImageNet, 191, 196
Image segmentation, 81–82
Image super-resolution, 6
Image synthesis, 119
Image-to-image translation, 73, 101f, 119, 132–136,
213–214, 389–390
histopathology staining, 265, 269–271
using cycle-GAN, 245–247, 247–248f
Imfilter, 89
Imguided filter, 89
Imitation game, 1
Improved GAN (IGAN), 48, 48f
Improved video generative adversarial network
(iVGAN), 337, 338t, 338f
Inception score (IS), 51
Incremental learning, 144
Info GAN, 22–23, 23f, 188–190, 188f
loss functions and distance metrics, 32–33t
pros and cons of, 34–35t
Infrared image translation, 315
generative adversarial network in, 315–316
Integral probability metric, 332
Intersection over union (IoU), 51
Inverse CWT, 176, 177f
IR-to-RGB translation, 313–314, 314f, 318–319,
319f, 323–325
J
Jaccard index, 51
Jensen-Shannon (JS) divergence, 266
Jensen-Shanon divergence (JSD), 240, 242
Julia, 54
K
Kernelized FCM, 84–85
Kernel maximum mean discrepancy (KMMD),
323–325
Kidney tissue, 278–279, 278f
k-Lipchitz constant, 21
Kullback-Leibler divergence (KLD), 17–18, 238,
266
L
LabelMe, 49t
Laplacian pyramid GAN (LAPGAN), 69–70, 127
Least-square loss, 219
Legal case reports, 43t
LeNet, 190
Long short-term memory (LSTM), 37, 67–68
architecture, 68f
Loss function-based conditional progressive
growing GAN (LC-PGGAN), 46, 46f
Lung lesion tissue, 275–276, 276–277f
Lung lobe tissue, 278–279, 279f
M
Machine learning, 381–382, 404
Magnetic resonance images (MRI), 45
Markov Chain Monte Carlo (MCMC) method,
416–417
Masson’s trichome (MAS) stain, 274
MatLab, 53–54
Mean absolute error (MAE), 275
Mean absolute percentage error (MAPE), 298
Mean average error (MAE), 364–366
Mean squared error (MSE), 221–222, 364–366
Medical imaging, 347–348, 350–351, 355–356, 364,
371–372
Microaneurysms, 361–362f, 363, 367
MinMax FCM, 86–87
MirrorGAN, 139
Missing part generation, 120
MNIST digit generation, 106, 106f
MobileNet, 191
MoCoGAN, 333–334, 334t, 334f
Mode collapse, 54
Modified generator GAN (MG-GAN), 164
Monkey net, 254–259
Motion energy image (MEI), 333
Motion history image (MHI), 333
MRI brain tumor, 49t
Multichannel attention selection GAN, 134
Multichannel residual conditional GAN
(MCRCGAN), 45, 45f
Multiconditional generative adversarial network
(MC-GAN), 138
Multidomain image-to-image translation, 132
Multimodal image-to-image translation, 132–134
Multimodal reconstruction, 348
of retinal image, 351–360
ablation analysis, 366–368
cyclical generative adversarial networks
(GANs), 348–349, 352–355
datasets, 360
network architectures, 357–360
qualitative evaluation, 360–364
425
426
Index
Multimodal reconstruction (Continued)
quantitative evaluation, 364–366
SSIM methodology, 355–357
structural coherence, 369–370, 370f
Multiscale dense block generative adversarial
network (MSDB-GAN), 48–49, 48f
Multistage dynamic generative adversarial network
(MSDGAN), 336, 336f, 338t
Multitask learning (MTL), 291
MUNIT, 315–316, 322–325, 324f
Mutual information (MI), 22–23
N
Nash equilibrium, 54, 239–243
Natural language processing
datasets, 42, 43t
GAN application in, 33–41
NDVI. See Normalized difference vegetation
index (NDVI)
Near-infrared (NIR) images, 207, 220
Near-infrared (NIR) spectrum, 216–217
New face construction, 146–147
News summarization dataset, 43t
Noisy speech, 43t
Normalized difference vegetation index (NDVI)
applications, 208–209
cycle generative adversarial networks, 212–216
architecture, 217–218, 217f
model, 223f
qualitative evaluation, 225f
data augmentation, 220, 222f
datasets, 217f
deep learning-based approaches, 209–212
estimation
country category, 226f, 229f
field category, 227f, 230f
mountain category, 228f, 231f
evaluation metrics, 221–222
formulations, 208–209
least-square loss, 219
loss functions, 218–219
overview, 205–207
residual learning model (ResNet), 214–215
O
Object-driven generative adversarial networks
(ObJGAN), 140
Octave GANs, 109–110, 117
o-Kernelized FCM, 84–85
Open Images, 49t
OpenStreetMap, 49t
Open subtitles dataset, 43t
OpinRank, 43t
Oxford 102, 49t
P
Pairwise learning, 145–146
Parallel GAN, 25–26, 25f
loss functions and distance metrics, 32–33t
pros and cons of, 34–35t
Parameterized ReLu (PrelU), 7
Patch extraction, 391–395, 395f, 407
PatchGAN, 270–271, 322, 359–360, 399, 400f,
407–408
PCA-GAN, 5
Peak signal-to-noise ratio (PSNR), 6, 268
Perceptual loss, 7–9, 320
Periodic acid-Schiff (PAS) stains, 274
Person reidentification (REID), 10
PG-GAN, 46, 46f
Photo inpainting, 74
Photo-realistic images, 250–253
Pixel accuracy, 406
Pixel convolution neural networks (PixelCNN),
235–237
Pixel-level anomaly localization, 377–378, 406
Pixel recurrent neural networks (PixelRNN),
235–236
Pix2pix, 61t
PoseGAN, 342
Precision, 50, 88
Progestrone receptor (PR) antibody stains, 274
Python, 52–53
Q
Quality-aware GAN, 47, 47f
Quasi recurrent neural network (QRNN), 39–40
QuGAN, 39–40, 39f
R
Radon-Nikodym theorem, 241
RaFD dataset, 247–248
RankGAN, 37–38, 37f
Realistic photograph generation, 75
Recall, 50
Index
Recall-oriented understudy for gisting evaluation
(ROUGE) score, 52
Receiver operating characteristic (ROC), 405
Reconstruction objective function, 332
Rectified linear unit (ReLU), 322
Recurrent neural network (RNN), 18, 66, 290
architecture, 66f
generative adversarial networks (GANs)
accuracy, 298–300, 299–300t
adversarial/discriminator network, 293, 293f
architecture, 294–296
deep-GAN model, 291–292
denoising method, 291
encoder-decoder network, 291–292
enhanced attention, 292
F-measure, 298, 300–305, 303–304t
generator network, 293
geometric-mean (G-mean), 298, 305,
305–306t
learning module, 295f
mean absolute percentage error (MAPE), 298,
305–308, 306–307t
multideep, 292
multitask learning (MTL), 291
optimization, 294–296
performance, 295f, 297–308
sensitivity, 298, 300, 301–302t
specificity, 298, 300, 302–303t
Wasserstein, 292
Reinforce GAN, 38, 38f
Residual blocks, 7
ResNet, 6, 65, 190–191, 214–215, 216f, 217–218
Resnet 50 network model, 178–182, 179–180f,
180–181t
ResNeXt, 5
Respiratory sound synthesis, conditional GAN
algorithm, 170–172
analysis, 179–181
data augmentation, 174–175
dataset, 174–181
discriminator network architecture, 169, 170f
generator network architecture, 168–169, 169f
inverse CWT, 176, 177f
performance, 177–179
scalograms, 170b, 176, 176f
steps, 173–174
system model, 167–168, 168f
time-scale representation, 168
trained network model, 177, 177t
Restricted Boltzmann machines (RBM), 67
Res_WGAN, 5
Retinal image, multimodal reconstruction of, 348,
351–360
ablation analysis, 366–368
cyclical generative adversarial networks (GANs),
348–349, 352–355
datasets, 360
network architectures, 357–360
qualitative evaluation, 360–364
quantitative evaluation, 364–366
SSIM methodology, 355–357
structural coherence, 369–370
RNN. See Recurrent neural network (RNN)
Root mean squared error (RMSE), 224, 232t,
275
R programming, 53
S
Scale-adaptive low-resolution person reidentification (SALR-REID), 10
Scalograms, 170b, 176, 176f
Scene generation, 254–259, 258–259f
Seismic images, SRGAN-based model, 7, 8f
Semantic similarity discriminator, 36, 37f
Semi GAN, 27–28, 27f
loss functions and distance metrics, 32–33t
pros and cons of, 34–35t
Semisupervised learning, 19f, 27
Semi GAN, 27–28, 27f
Sensitivity, 51
SeqAttnGAN, 118
SeqGAN, 36, 36f
Sequential GAN
supervised, 31, 31f
unsupervised, 24–25, 25f
SGAN, 42–44, 44f
Shadow maps, 120
Silhouette method, 84–85
Simple generative adversarial networks, 166
Smooth muscle actin (SMA) stain, 274
Spatial FCM, 84–85
Spatiotemporal translation model, 396–400
Specificity, 51
Speech enhancement, 120
sp-Kernelized FCM, 84–85
SRResNet, 5
427
428
Index
Stacked generative adversarial networks
(StackGAN), 61t, 110–113, 138, 254–259
Stacked generative adversarial networks
(StackGAN++), 139
StarGAN, 73, 132, 247–250, 322–323, 324f
Stochastic Adam optimazer, 224–225
Structural similarity index, 221–222, 224, 232t,
268
video anomaly detection (VAD), 406–407
Superresolution (SR), 74, 250–253
Super-resolution GAN (SR-GAN), 5, 61t, 71–72,
127
architecture of, 6–7, 7–8f
image quality and, 11
network architecture, 7
perceptual loss, 7–9
adversarial loss, 9
content loss, 8–9
video surveillance and forensic application, 10
Supervised learning, 18–19, 28
ACGAN, 30–31, 30f
bidirectional GAN (BiGAN), 29–30, 29f
conditional GAN (CGAN), 28–29, 28f
Supervised sequential GAN, 31, 31f
loss functions and distance metrics, 32–33t
pros and cons of, 34–35t
T
Temporal GAN (TGAN), 61t, 69, 334t, 337,
338t
Text-to-image GAN, 45–46, 46f
Text-to-image synthesis, 76, 136–142
TextureGAN, 335
Thermal image translation. See Wavelet-guided
generative adversarial network (WGGAN)
TH-GAN, 41, 41f
2D median filter, 89
3D object generation, 75
3D presentation states (3DPR), 92
Threshold, 402
Tithonium Chasma, 252f
Trained model
of discriminator, 3, 3f
of generator, 3, 4f
True positive rate (TPR), 405
Turing Test, 1
Two-stage general adversarial network (TsGAN),
44–45, 45f
U
UCSD dataset, 403, 403f, 407
UGATIT, 322–323, 324f
UMN dataset, 403–404
U-Net, 269–271, 385–387, 414–415
U-Net generator, 217f, 220
Unified GAN (UGAN), 39, 39f, 132
Unpaired photo-to-caricature translation, 136
Unsupervised generative attentional networks
(U-GAT-IT), 134
Unsupervised learning, 18–19, 19f, 210, 377–379,
389
BEGAN, 23–24, 24f
cycle GAN, 25–26, 26f
of generative adversarial network, 390–400
InfoGAN, 22–23, 23f
parallel GAN, 25–26, 25f
sequential GAN, 24–25, 25f
vanilla GAN, 19–20, 20f
Wasserstein GAN, 20f, 21–22
WGAN-GP, 20f, 22
UT-Zap50K benchmark dataset, 186–187, 196–197
V
VAD. See Video anomaly detection (VAD)
VAE. See Variational autoencoder (VAE)
Vanilla GAN, 19–20, 19–20f, 126, 188, 266
loss functions and distance metrics, 32–33t
pros and cons of, 34–35t
Vanishing gradients, 54
Variational autoencoder (VAE), 17–18, 237–239,
243–244, 244f, 314–315
wavelet-guided, 317–319
Vari GAN, 61t, 68–69
Vegetation indexes (VIs), 205
normalized difference vegetation index (NDVI)
applications, 208–209
formulations, 208–209
VGGNet, 65, 186
Video analytics, 329
generative adversarial network (GAN), 334t, 338t
for video generation and prediction, 333–337
for video recognition, 337–340
for video summarization, 340–341
Video anomaly detection (VAD), 377–378
deep spatiotemporal translation network
(see Deep spatiotemporal translation
network (DSTN))
Index
evaluation
area under curve (AUC), 405–406
equal error rate (EER), 406
frame-level evaluations, 406
pixel accuracy, 406
pixel-level evaluations, 406
receiver operating characteristic (ROC), 405
structural similarity index (SSIM), 406–407
generative adversarial network (GAN), 379–380
advantages, 416–417
based on image-toimage translation, 389–390
cross-channel, 383–385
cross-channel adversarial discriminators,
387–389
limitations, 416–417
prediction based on, 385–387
structure of, 381–383
for surveillance videos, 378–379
Video frame prediction, 76
Video GAN (VGAN), 61t, 71
Video retargeting, 339
Video surveillance, 378–379
Video synthesis, 76, 120
Video understanding, 333
Visual Genome, 49t
Visual similarity search systems
fashion recommendation system (see Fashion
recommendation system)
test results, 197, 198–200f
web interface, 197–200, 201f
W
WarpGAN, 135–136
Wasserstein distance, 21
Wasserstein GAN (WGAN), 20f, 21–22, 114–116,
292
loss functions and distance metrics, 32–33t
pros and cons of, 34–35t
Wavelet-guided generative adversarial network
(WGGAN)
adaptive moment optimization (ADAM), 322
architecture, 316–317
cycleGAN, 322–325, 324f
FLIR ADAS dataset, 321, 321t, 323
MUNIT, 315–316, 322–325, 324f
qualitative analysis, 321
translation results, 323, 324f
quantitative analysis, 322
translation results, 323–325, 325t
StarGAN., 322–323, 324f
UGATIT, 322–323, 324f
wavelet-guided variational autoencoder
(WGVA), 317–319
cycle-consistency loss, 319–320
discrete wavelet transformation, 318–319
ELBO loss, 320
full loss, 321
GAN loss, 320–321
perceptual loss, 320
reparameterization, 318
Wavelet-guided variational autoencoder (WGVA),
317–319, 325
cycle-consistency loss, 319–320
discrete wavelet transformation, 318–319
ELBO loss, 320
full loss, 321
GAN loss, 320–321
perceptual loss, 320
reparameterization, 318
Weak supervision, 333
WGAN-GP, 20f, 22
loss functions and distance metrics, 32–33t
pros and cons of, 34–35t
WGGAN. See Wavelet-guided generative
adversarial network (WGGAN)
WGVA. See Wavelet-guided variational
autoencoder (WGVA)
Wiener 2 filtering, 89
Y
YELP dataset, 43t
Z
ZFNet, 65
429
Download