Uploaded by Kartik Upadhyay

Multimodal image-to-image translation between doma

advertisement
Soft Computing (2020) 24:18173–18184
https://doi.org/10.1007/s00500-020-05073-6
METHODOLOGIES AND APPLICATION
Multimodal image-to-image translation between domains with high
internal variability
Jian Wang1 · Jiancheng Lv1
· Xue Yang1 · Chenwei Tang1 · Xi Peng1
Published online: 12 June 2020
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract
Multimodal image-to-image translation based on generative adversarial networks (GANs) shows suboptimal performance in
the visual domains with high internal variability, e.g., translation from multiple breeds of cats to multiple breeds of dogs. To
alleviate this problem, we recast the training procedure as modeling distinct distributions which are observed sequentially,
for example, when different classes are encountered over time. As a result, the discriminator may forget about the previous
target distributions, known as catastrophic forgetting, leading to non-/slow convergence. Through experimental observation, we
found that the discriminator does not always forget the previously learned distributions during training. Therefore, we propose
a novel generator regulating GAN (GR-GAN). The proposed method encourages the discriminator to teach the generator
more effectively when it remembers more of the previously learned distributions, while discouraging the discriminator to
guide the generator when catastrophic forgetting happens on the discriminator. Both qualitative and quantitative results show
that the proposed method is significantly superior to the state-of-the-art methods in handling the image data that are with high
variability.
Keywords GANs · Image translation · High internal variability · Catastrophic forgetting · Generator regulating
1 Introduction
Image-to-image translation (in short, image translation)
aims at learning the mappings between two different visual
domains. Based on generative adversarial networks (GANs)
(Goodfellow et al. 2014), remarkable progress has been
recently achieved and this task has attracted considerable
attention in the computer vision community because a wide
range of problems in computer vision can be posed as image
translation problems, for example super-resolution (Dong
et al. 2014), colorization (Larsson et al. 2016; Zhang et al.
2016), inpainting (Pathak et al. 2016), and style transfer
(Gatys et al. 2016). A widely accepted view is that the mappings between two visual domains are inherently multimodal
(Zhu et al. 2017b; Huang et al. 2018), i.e., a single input
may correspond to multiple plausible outputs. To achieve
multimodal image translation, recent works proposed that
Communicated by V. Loia.
B
1
Jiancheng Lv
lvjiancheng@scu.edu.cn
the data representation of both source and target domains
should be disentangled into two parts: content and style
(Huang et al. 2018; Yu et al. 2018; Ma et al. 2019; GonzalezGarcia 2018; Lee et al. 2018). More specifically, the content
is domain invariant and the style captures domain-specific
properties. Suppose the content and style are well decoupled, multimodal translation can be successfully achieved by
recombining input’s content vector with a series of random
style vectors in the target style space.
In numerous practical applications, it is possible that the
domains with high internal variability will be translated into
each other. For example, on the cat ↔ dog task, the data we
collected tend to include numerous cat and dog breeds, like
the Oxford-IIIT pet dataset (Parkhi et al. 2012). However,
we experimentally found that existing models cannot perform disentanglement well when these models involve visual
domains with high internal variability (Fig. 1). Specifically,
the disentangled style contains only the pixel-level statistics
(e.g., color or texture) rather than the high-level concepts
(e.g., geometric structure). As a result, these models can only
be capable of changing the low-level features, but fail to perform large changes on high-level semantic information of
College of Computer Science, Sichuan University, Chengdu
610065, People’s Republic of China
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
18174
J. Wang et al.
discriminator from affecting the generator by providing the
generator with a small gradient for training. The proposed
GR-GAN incurs marginal computational overheads and can
easily be applied to various image translation models without
changing the original network structure.
The main contributions of this work could be summarized
as follows:
Fig. 1 Generator regulating GAN (GR-GAN). When involving
domains with high internal variability, the (left) existing methods can
only transfer low-level information, and there are no large semantic
changes with the change of style. (right) We propose a novel GAN
objective, which can be applied to arbitrary existing multimodal image
translation model to improve the performance
input samples. The possible reason is that they suffered from
the non-/slow convergence problem (Salimans et al. 2016).
In high internal variability (multiple classes) setting, the
training regime for existing GAN-based translation models
still assumes that the training data are drawn i.i.d. from the
distribution of interest, i.e., data representing every class are
used concurrently during training. In real-world scenarios,
however, the data may sequentially arrive and only a small
portion can be obtained at a time.1 Therefore, we proposed
to view the training of these models as modeling different
distributions that are observed sequentially. Updating the discriminator to capture new observed distributions causes the
discriminator to forget previously learned distributions. This
so-called catastrophic forgetting issue (French 1999) will
lead to the non-convergence of GAN frameworks (ThanhTung et al. 2018). Many attempts to solve this issue by reusing
old samples or by utilizing continual learning techniques (Wu
et al. 2018; Seff et al. 2017; Thanh-Tung et al. 2018). However, such methods are usually computationally intensive.
During the actual training process, we do found the catastrophic forgetting issue by observing the discriminator for
a period of time. More importantly, we also observe that
the discriminator does not always forget. The discriminator
learns some meaningful (useful) feature representations at
certain time steps so that it has a correct assessment of the
samples that have been seen before (Fig. 3).
Based on the above observations, we propose a novel
generator regulating GAN (GR-GAN) to improve the performance of multimodal image translation under the setting
of “domains with high internal variability.” By adaptively
adjusting the learning dynamics of the generator network,
our method enables the generator to be trained more effectively with much stronger gradient under the guidance of the
discriminator that remembers more previous target distributions. On the contrary, when catastrophic forgetting happens
on the discriminator, the proposed method tries to prevent the
1
Considering the memory consumption, the batch size of image translation models is usually very small, e.g., 1.
– To the best of our knowledge, this could be the first work
to focus on the study of multimodal translation between
visual domains with high internal variability, and we
show the suboptimal of existing method on such tasks.
– In a high internal variability setting, we recast the training regime as modeling different distributions that are
observed sequentially and found the major contributor to
non-convergence: catastrophic forgetting.
– We propose a novel adversarial objective to mitigate the
non-convergence problem in multimodal image translation tasks under “domains with high internal variability”
setting.
– Extensive experiments show that our approach is superior
to the state-of-the-art methods. Images synthesized by
our method are more realistic and diverse.
The rest of this paper is organized as follows: Sect. 2 discusses some work related to our own. In Sect. 3, we provide a
brief background to the works on which we are based, and the
details of our method are described in Sect. 4. In Sect. 5, several experiments are conducted and analyzed. Conclusions
are discussed in Sect. 6.
2 Related work
In this section, we discuss several previous works related to
our own.
2.1 Image-to-image translation
The first framework to use conditional GANs (CGANs)
(Mirza and Osindero 2014) for image translation task is Isola
et al. (2017), which takes images in the source domain as
conditional information to generate corresponding images
in the target domain. The idea is later used on massive
tasks, such as synthesizing images from sketches (Sangkloy et al. 2017; Tang et al. 2019) or semantic maps (Isola
et al. 2017). However, as Isola et al. 2017 require paired
training data, its application is greatly limited. Zhu et al.
(2017a), Yi et al. (2017), and Kim et al. (2017) extend image
translation to the unpaired setting by introducing a cycle consistency constraint. The cycle consistency constraint forces
that if an image is translated from the source domain to the
target domain and back, it should still be the original image.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Multimodal image-to-image translation between domains with high internal variability
Besides, Liu et al. (2017) assume the corresponding samples
of two domains share the same latent space. Nevertheless,
image translation is multimodal in nature. Zhu et al. (2017b)
show that it can perform multimodal image translation, but it
still needs paired data for training. Numerous recent efforts
(Huang et al. 2018; Lee et al. 2018; Yu et al. 2018) suggest that the image representations of both source and target
domains should be disentangled into two representations:
content and style. The content is domain invariant, and the
style captures domain-specific properties. These multimodal
methods achieve great success in various image translation
tasks. However, translation under “domains with high internal variability” setting is rarely studied.
2.2 Catastrophic forgetting in GAN frameworks
Recently, generative adversarial networks (GANs) (Goodfellow et al. 2014) have been widely used in various tasks, such
as image generation (Radford et al. 2016), image editing (Zhu
et al. 2016), and text generation (Liu et al. 2018). The theoretical understanding of GANs has been a research hot spot.
When different classes are encountered over time, Seff et al.
(2017) regard the training regime of GAN frameworks as a
continual learning problem. They show that catastrophic forgetting (French 1999) happens in GANs. To further extend
the theory, Thanh-Tung et al. (2018) provide a theoretical
analysis of the problem and prove that catastrophic forgetting can make the training of GANs non-convergent. Unlike
previous methods that apply continual learning techniques
to GANs, Wu et al. (2018) alleviate the catastrophic forgetting issue by replaying memories of previous tasks. In
high internal variability setting, we also view the training of
GAN-based image translation models as a continual learning
problem.
18175
the discriminator f θ is implemented by alternately optimizing the following two loss functions:
Ld = −
−
Lg =
E
x∼Pr (x)
[log(σ ( f θ (x)))]
[log(1 − σ ( f θ (gφ ∗ (z))))],
E
(1)
z∼Pz (z)
E
z∼Pz (z)
[log(1 − σ ( f θ ∗ (gφ (z))))],
(2)
where x is a sample from the real data distribution Pr , and
σ (·) is the sigmoid function. σ ( f θ (x)) represents the probability of that x comes from Pr rather than Pg (the distribution
of generated samples), gφ ∗ means the generator is fixed and
f θ ∗ means the discriminator is fixed. Note that f θ (·) is the
non-transformed discriminator output. It can be interpreted
as how realistic the input data are; a negative number means
that the input data look fake, while a positive number means
that the input data look real, i.e.,
f θ (·)
> 0 input data look real
< 0 input data look fake.
Loss function L g is called the saturating loss function. However, in practice, another variation called non-saturating loss
function is commonly used, as follows:
L ns
g =−
E
z∼Pz (z)
[log(σ ( f θ ∗ (gφ (z))))].
(3)
We will discuss these two types of loss functions in detail in
Sect. 4.
3.2 Multimodal image translation
3 Preliminaries
In this section, we provide some previous knowledge of
GANs and multimodal image translation.
3.1 GANs
The GAN framework consists of two parts: a generator
gφ (z) : z → x that maps a latent variable drawn i.i.d.
from a Gaussian or uniform distribution Pz (z) to the target data space and a discriminator f θ (x) : x → R that
maps a data sample (real and generated) to a real number
associated with likelihood. The discriminator is trained to
distinguish between real samples and samples synthesized
by the generator (fake samples), and it, in turn, guides the
generator to synthesize samples that are enough to confuse
the discriminator. The game between the generator gφ and
Numerous recent contributions (Huang et al. 2018; Yu et al.
2018; Ma et al. 2019; Gonzalez-Garcia 2018; Lee et al. 2018)
convincingly showed that the multimodal mappings can be
learned between two domains that are each specified merely
by a set of unlabeled samples. For example, given a set of
unlabeled images of cats, MUNIT (Huang et al. 2018) synthesizes a variety of new images of dogs and vice versa.
These recent approaches hold the assumption that the data
representation can be decomposed into a content code that
is domain invariant, and a style code that captures domainspecific properties. In the case of good decoupling of content
and style, multimodal image translation can be successfully
achieved by recombining object’s content vector with a series
of random style vectors in the target style space. The information flow is described in Fig. 2.
To establish such mappings between two domains, three
types of constraints are employed: (1) when mapping from
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
18176
J. Wang et al.
4.1 Motivation
(a)
(b)
(c)
Fig. 2 The information flow of multimodal image translation. a Sample
x A (x B ) in domain A (B) is disentangled into two representations:
content code c A (c B ) and style code s A (s B ). b By recombining c A
with s B , the cross-domain mapping from A to B is achieved, and the
output is x AB that should belong to the domain B. c Several efforts like
MUNIT (Huang et al. 2018) force the style code to be Gaussian, so the
multimodal translation can be achieved by randomly sampling the style
code from Gaussian
source domain to target domain, the output has to be indistinguishable from the samples of the target domain; (2) each
sample is translated into the target domain and then translated
back, and the final result should be as similar as possible to
the original sample; (3) the two representations (content and
style) encoded from input data can still be decoded back.
All of these three types of constraints are elegant and do not
require additional supervision.
Of the above three cues, the most dominant cue is the
distribution constraint. This constraint is performed using
GANs and is applied at the distribution level. As we mentioned earlier, the non-saturating form is commonly used in
practice. Thus, the loss functions2 of the discriminator L d
and generator L ns
g are:
Ld = −
−
L ns
g =−
E
[log(σ ( f θ (x t )))]
E
[log(1 − σ ( f θ (gφ ∗ (x s , z))))],
(4)
E
[log(σ ( f θ ∗ (gφ (x s , z))))],
(5)
x t ∼Pt (x t )
z∼Pz (z)
x s ∼Ps (x s )
z∼Pz (z)
x s ∼Ps (x s )
where x s is a sample from the source data distribution Ps , x t
is a sample from the target data distribution Pt , z is a latent
variable from a Gaussian or uniform distribution Pz (z), gφ ∗
means the generator is fixed and f θ ∗ means the discriminator
is fixed.
Domains with high internal variability make the multimodal
image translation tougher. In such a setting, we regard the
training procedure of GAN-based image translation models
as a continual learning problem. However, in real-world scenarios, the training data arrive sequentially and only a small
portion can be obtained at a time (e.g., small batch size). We
can think that these models are used to learn a set of target
distributions Pt0 , Pt1 , . . . , PtM . The discriminator at task m
does not have access to distributions Pt0 , Pt1 , . . . , Ptm−1 . The
discriminator, thus, forgets about the previously learned target distributions (Thanh-Tung et al. 2018). This problem is
known as catastrophic forgetting (French 1999) and can hurt
the performance of models (leading to the non-convergence).
This issue can be alleviated using samples from previous tasks (memory replay mechanism) (Wu et al. 2018)
or different types of regularization that result in penalizing
large changes in parameters or activations (Seff et al. 2017;
Thanh-Tung et al. 2018). However, these techniques require
considerable memory consumption or computational overhead.
Indeed, we observed the catastrophic forgetting issue during the training process. But more importantly, we also
observe that the discriminator does not always forget. The
discriminator learns some meaningful feature representations at certain time steps, so that it has a pretty good
assessment of the samples that have been seen before. The
possible reason is that the training procedure is not a standard continual learning problem, i.e., the data distribution
observed by the model is not always completely different in
practice, as illustrated in Fig. 3.
The discriminator shows no forgetting at a certain time
step, indicating that it captures meaningful feature representations. Based on the above observation, instead of mitigating
the discriminator forgetting, we propose to encourage the
discriminator that remembers more previously learned distributions to teach the generator more effectively and to reduce
the impact of the poor discriminator on the generator. Hence,
we need to find an adaptive way to regulate the generator (Fig.
4).
4.2 Learning algorithm
4 Method
In this section, we describe our basic idea, methodology, and
implementation of our proposed generator regulating GAN
(GR-GAN).
Unlike the existing GAN-based translation methods, which
mainly adopt the non-saturating loss function, in this work,
we introduce a regularization term into the saturating loss
function of vanilla GAN to achieve the ability to adaptively
adjust the generator. The saturating loss function of the generator is given by:
Lg =
2
Some models use LSGAN (Mao et al. 2017) objective.
E
z∼Pz (z)
x s ∼Ps (x s )
[log(1 − σ ( f θ ∗ (gφ (x s , z))))].
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
(6)
Multimodal image-to-image translation between domains with high internal variability
Fig. 3 Illustration of motivation. We show a segment of the MUNIT
(Huang et al. 2018 discriminator’s evaluation of a set of real target
domain samples during training. The score of the evaluation (y-axis)
represents the average probability that the discriminator judges a set
of real samples to be real. In high internal variability scenarios (cat
→ dog task, blue line), the discriminator at some steps cannot make
a good overall assessment on samples due to catastrophic forgetting.
Meanwhile, it is easy to see that the discriminators at different steps
differ greatly in the degree of forgetting about previously learned target
distributions. In low internal variability scenarios (summer → winter
task, red line), the phenomenon of catastrophic forgetting is not particularly prominent, which is why previous methods work well on this
task (color figure online)
(a)
(b)
Fig. 4 An overview of the proposed GR-GAN. Sample x s is translated from the source domain to the target domain to generate sample
x st . To train the generator gφ , a existing translation methods treat the
trained discriminator f θ ∗ in each step equally, regardless of whether it
is a good teacher to the generator. b We propose to adaptively regulate
the generator’s learning dynamics by evaluating a set of real samples
{x 1t , x 2t , . . . , x tN } of the target domain to make it more sensitive to the
discriminator that remembers more previous target distributions
Let h(·) = log(1 − σ (·)). From the chain rule of backpropagation (LeCun et al. 1989), it is easy to know the gradient
of the generator which is:
∂h ∂ f θ ∗
∂h
=
∂ gφ
∂ f θ ∗ ∂ gφ
∂ fθ ∗
1
=
.
−
f
(g
(x
))
∗
1 + e θ φ s ∂ gφ
Fig. 5 The saturating loss function and non-saturating loss function of
vanilla GAN
line in Fig. 5). Therefore, vanilla GAN introduces the nonsaturating loss to provide stronger gradient for the generator
(see blue line in Fig. 5).
In our case, we do not want L g to always provide strong
gradient to the generator. As mentioned earlier, our goal
is to make the good discriminator (remembers more about
the previously learned target distributions) more effective in
teaching the generator and to reduce the impact of the poor
discriminator on the generator. To achieve that, we simply
need to make L g provide the generator with more sufficient
gradient when the discriminator remembers more target distributions learned earlier (captures more meaningful feature
representations), and leave L g saturating when catastrophic
forgetting happens on the discriminator.
Theoretically, the less the discriminator forgets about target distributions, the better it can assess the real samples of the
target domain (the more real samples it can judge to be real),
i.e., E x t ∼Pt (x t ) [ f θ ∗ (x t )] > 0 gets a larger value. Suppose the
discriminator completely forgets about learned target distributions, E x t ∼Pt (x t ) [ f θ ∗ (x t )] → 0, since a poor discriminator
cannot provide the right judgment on samples. We introduce
E x t ∼Pt (x t ) [ f θ ∗ (x t )] as a regularization term into the saturating loss function (Eq. 6) and obtain the generator’s loss
function of our GR-GAN:
gr
(7)
In practice, Eq. 6 may not provide sufficient gradient for the
generator to learn well (Arjovsky et al. 2017). The discriminator can reject the synthesized samples with high confidence
since the generator is difficult to generate indistinguishable
enough samples; thus, f θ ∗ (gφ (x s )) gets a small (negative)
value. In this case, h(·) saturates and ∂h/∂ gφ → 0 (see red
18177
Lg =
E
z∼Pz (z)
x s ∼Ps (x s )
[log(1 − σ ( f θ ∗ (gφ (x s , z)) + ζ ))],
(8)
where
ζ = E x t ∼Pt (x t ) [ f θ ∗ (x t )].
(9)
The discriminator’s loss function of GR-GAN is the same
as Eq. 4. As before, let h(·) = log(1 − σ (·)) and x =
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
18178
J. Wang et al.
f θ ∗ (gφ (x s )) + ζ . Thus, the gradient of the generator is:
∂h
∂h ∂ x
=
∂ gφ
∂ x ∂ gφ
∂x
1
=
1 + e−x ∂ gφ
∂ fθ ∗
1
=
.
−(
f
(g
(x
))+ζ
)
∗
∂ gφ
1+e θ φ s
(10)
It is easy to know when the discriminator remembers more of
the previously learned distributions, the regularization term ζ
gets a larger value. In this case, h(·) is not easy to saturate and
∂h/∂ gφ gets a higher value. Therefore, L g provides stronger
gradient for the generator and encourages the generator to
“work harder.” When catastrophic forgetting happens on the
discriminator, ζ → 0, the loss function becomes equal to
Equation 6.
(a) cat↔ dog
4.3 Implementation
However, at each step, calculating the value of the regularization term requires access to all the real data in the training
set. It is very computationally intensive. In practice, we only
sample a small amount of data in the dataset to calculate this
N
value: ζ = N1
j=1 f θ ∗ (x t ), where N denotes the number
of real samples randomly sampled for the target domain. In
this work, we set N = 8 in all the experiments. The overview
of our method is shown in Fig. 4.
5 Experiments
In this section, we demonstrate the effectiveness of our
approach. This section is structured as follows. First, we will
compare our method with other methods from the perspective of visual realism to prove that our method can produce
more realistic results in the case of high diversity within the
domains. Next, we will show that our method can synthesize
more diverse results. We will then use the domain adaptation
task to show that our approach can make the generated results
more consistent with the target domain. Moreover, we will
compare the performance of the models with different GAN
objectives. Finally, we compare the time complexity of our
method with other algorithms.
5.1 Datasets
To evaluate the realism of the results and compare the performance of different types of GAN, we conduct experiments on
the Oxford-IIIT pet dataset3 (cat ↔ dog) (Parkhi et al. 2012),
3
This dataset is available at http://www.robots.ox.ac.uk/~vgg//data/
pets.
(b) motor↔ bicycle
Fig. 6 Several examples. We conduct the experiments on a the OxfordIIIT pet dataset (cat ↔ dog) and b ImageNet (motor ↔ bicycle). Both
tasks involve the domains of great internal diversity, and translation
between domains requires large semantic changes
which contains 7349 images of 12 cat and 25 dog species. To
minimize the impact of the background, we use the bounding
box (or RoI) information provided in the dataset to remove
the superfluous part of the image and only retained the head
area. Due to the limited number of images with ROI information, the final processed dataset contains only 3,686 images,
and there are only about 100 images for each type of cat and
dog. What’s worse, the cats vary enormously from species to
species, and the dogs are the same. We show some example
images of this dataset in Fig. 6a.
Meanwhile, we test the diversity of generated results on
the cat ↔ dog task. As a supplement, we also test the diversity
of results on motor ↔ bicycle and Yosemite (summer ↔ winter) tasks (Zhu et al. 2017a). The images for motor ↔ bicycle
are downloaded from ImageNet (Deng et al. 2009), consisting of 1,205 images for motorcycles and 1,209 for bicycles.
For the consideration of impact from image background, we
cropped images centered on the object (motorcycle or bicycle). Several example images in this dataset are shown in
Fig. 6b.
Furthermore, we perform domain adaptation on the classification task with MNIST (LeCun et al. 1998) ↔ MNIST-M
(Ganin et al. 2016) and Synthetic Cropped LineMod ↔
Cropped LineMod (Hinterstoisser et al. 2012).
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Multimodal image-to-image translation between domains with high internal variability
Fig. 7 Quantitative results of perceptual realism. We compare which
of the paired models (baseline and baseline with GR-GAN objective)
produces more realistic results by human judgment. The numbers show
the preference for comparison pair, which indicates that our translation
results more realistic
18179
presented to the tester4 for 1 s. The tester needs to determine
which one is more like a real image from the paired images.
We test the performance of these models on the cat ↔
dog task, and Fig. 7 shows our test results. For a fair comparison, we compare which of the paired models (baseline
model and baseline model with the proposed GAN objective)
produces more realistic results. The numbers of Fig. 7 indicate the preference for comparison pair, also called “fooling
rate.” Through comparison, it can be found that the proposed
new GAN objective can mislead the testers into thinking that
the results are more real. In other words, our method can
significantly improve the perceptual realism of the results of
images translation models.
5.3 Diversity
5.1.1 Baselines
In this paper, we chose MUNIT (Huang et al. 2018), CBNIT
(Yu et al. 2018), and DRIT (Lee et al. 2018) as the baselines,
since the mappings between two domains are inherently multimodal. Note that MUNIT and CBNIT are based on least
squares GAN (LSGAN) (Mao et al. 2017), and DRIT is
based on vanilla GAN (Goodfellow et al. 2014). To make
a fair comparison, we only replace the objective of GAN
and retain the original network structure and parameter setting of these models. To facilitate the distinction, suffix “-gr”
is added to the model name after replacing the new GAN
objective we proposed.
5.2 Perceptual realism
5.2.1 Qualitative evaluation
We first show the visual realism of the results synthesized
by the different models on the cat ↔ dog task in Fig. 10. It
is easy to observe that MUNIT-gr, CBNIT-gr, and DRIT-gr
tend to synthesize features that are closer to the ground truth,
such as the cat’s ears standing up and the dog’s ears sagging.
To accomplish this, translation requires that the model be
able to make significant high-level semantic changes. We
can also observe that the baseline methods often only change
very low-level features (color and texture), and we can even
consider that they are simply copying the input.
5.2.2 Quantitative evaluation
Evaluating the quality of generated images is an open and
challenging problem. As proposed in Zhang et al. 2016, we
employ human judgments to evaluate the perceptual realism
of our results. The real and generated images are randomly
arranged into a paired test sequences, which are sequentially
5.3.1 Qualitative evaluation
The diversity of results is also an important evaluation index.
In Fig. 8, we show the diversity comparison of results generated by MUNIT and MUNIT-gr (ours) on the cat ↔ dog task.
Because the original GAN objective plays MUNIT weak role
in modeling the distribution of the target domain, the disentangled style code does not contain high-level concepts
related to the features of the target; thus, the diversity is
only reflected in the overall tone difference of image. Unlike
MUNIT, the diversity of our MUNIT-gr is more reflected in
the high-level semantics, such as ear shape, nose shape, and
hair length. It means our method can generalize in the distribution, rather than simply memorizing the input or making
only small changes to the input.
5.3.2 Quantitative evaluation
To quantitatively evaluate the diversity of results, the LPIPS
metric (Zhang et al. 2018) is employed. The LPIPS measures
the average feature distances between generated samples. As
reported in Zhu et al. (2017b), for each model, we calculate
the average distance between 2,000 pairs of images randomly
generated from 100 input images. For comparison, we calculate the distance between the real images in the target domain
by entering random pairs. To make the results more reliable,
we also conduct experiments on motor ↔ bicycle (Deng et al.
2009) and summer ↔ winter (Zhu et al. 2017a) tasks. As
shown in Table 1, the greater the distance, the higher the
diversity. In general, our method can improve the diversity
of generated results, especially in the datasets with high internal diversity like the Oxford-IIIT pet dataset (cat ↔ dog) and
the ImageNet (motor ↔ bicycle). On Yosemite (summer ↔
winter), the diversity of results is not improved much, since
4
All testers are independent of the authors’ research group.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
18180
J. Wang et al.
Table 1 Quantitative evaluation of diversity
Method
C→D
M→B
S→W
D→C
B→M
W→S
Real images
0.391 ± 0.001
0.390 ± 0.001
0.446 ± 0.001
0.382 ± 0.001
0.394 ± 0.001
0.444 ± 0.001
MUNIT
0.341 ± 0.002
0.286 ± 0.002
0.287 ± 0.002
0.342 ± 0.002
0.319 ± 0.002
0.303 ± 0.002
MUNIT-gr
0.374 ± 0.002
0.308 ± 0.002
0.303 ± 0.001
0.372 ± 0.002
0.333 ± 0.002
0.300 ± 0.002
CBNIT
0.306 ± 0.002
0.178 ± 0.001
0.240 ± 0.002
0.257 ± 0.002
0.124 ± 0.001
0.230 ± 0.003
CBNIT-gr
0.327 ± 0.002
0.218 ± 0.001
0.258 ± 0.002
0.326 ± 0.001
0.230 ± 0.001
0.241 ± 0.002
DRIT
0.285 ± 0.001
0.211 ± 0.001
0.195 ± 0.001
0.278 ± 0.002
0.209 ± 0.001
0.202 ± 0.001
DRIT-gr
0.324 ± 0.002
0.243 ± 0.001
0.221 ± 0.001
0.290 ± 0.001
0.226 ± 0.001
0.204 ± 0.001
The LPIPS metric (Zhang et al. 2018) is employed to evaluate the diversity of results. In particular, on the Yosemite summer ↔ winter task, our
method does not perform much better than baseline since the baseline had worked well on this task
C, cat; D, dog; M, motor; B, bicycle; S, summer; W, winter
Fig. 8 Diversity comparison. Translation results with a series of random style vectors sampled from Gaussian. Apparently, our approach synthesizes
more diverse and realistic results over baselines
Table 2 Domain adaptation results
Model
Fig. 9 Domain adaptation experiments. Several results of the Synthetic
Cropped LineMod ↔ Cropped LineMod task
the previous method has been able to perform cross-domain
translation well.
5.4 Domain adaptation
Domain adaptation technology aims to solve the domainshift problem between the source domain and the target
domain. We believe that if the results of image translation can
be used to improve the domain-shift problem, it can reflect the
good performance of the image translation model. Following
PixelDA (Bousmalis et al. 2017), we conduct the classification experiments on tasks MNIST (LeCun et al. 1998)
↔ MNIST-M (Ganin et al. 2016) and Synthetic Cropped
LineMod ↔ Cropped LineMod (Hinterstoisser et al. 2012).
We first use these image translation models to translate the
Classification accuracy
MNIST-M
LineMod
Source-only
0.598
0.500
MUNIT
0.886
0.896
MUNIT-gr
0.893
0.912
PixelDA
0.937
0.981
Target-only
0.957
0.994
We show the classification accuracy on MNIST → MNIST-M and Synthetic Cropped LineMod → Cropped LineMod. The “Source-only” and
“Target-only” indicate that the mode uses either image only from the
source and target domain during training
labeled source domain images into the target domain to
generate the labeled target domain images. We use these generated labeled images as the training data to train a classifier
for target domain sample classification. For a fair comparison, the structure of the classifier network remains the same
as PixelDA. In addition to the baselines we mentioned, we
also use the state-of-the-art domain adaptation algorithm PixelDA for comparison.
As shown in Fig. 9, we first qualitatively compare the
visual quality of MUNIT and MUNIT-gr (ours) on the Synthetic Cropped LineMod ↔ Cropped LineMod task. It is
easy to see that the results of our method are of higher qual-
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Multimodal image-to-image translation between domains with high internal variability
18181
Fig. 10 Qualitative results of perceptual realism. Several results synthesized by baselines and our methods. Above line: the task of cat → dog.
Below line: the task of dog → cat
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
18182
J. Wang et al.
Fig. 11 Models with different GANs. We show the results of MUNIT with four types of GAN: vanilla GAN, WGAN, LSGAN, and our GR-GAN.
Above line: the task of cat → dog. Below line: the task of dog → cat
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Multimodal image-to-image translation between domains with high internal variability
18183
ity, especially in Cropped LineMod → Synthetic Cropped
LineMod side. In terms of quantitative classification accuracy, as shown in Table 2, all the cross-domain translation
models did not exceed PixelDA, because these models did
not utilize label information during training as PixelDA did.
Compared with baselines, our approach can better improve
the performance of domain adaptation (Fig. 10).
Table 3 Computational complexity comparison
5.5 Performance on different GANs
our method is O(t). Table 3 shows the computational complexity comparison of three different methods.
For demonstrating the effectiveness of our method, we compare the performance of MUNIT (Huang et al. 2018) with
different GAN objectives on the cat ↔ dog task. We employ
four types of GAN: vanilla GAN, WGAN, LSGAN, and our
GR-GAN. To make a fair comparison, we merely replace
the GAN objective and retain the original network structure
and parameter setting of the model. As shown in Fig. 11, it
is easy to observe that serious mode collapse occurs in both
SGAN and WGAN in the case of dog → cat, and in WGAN
in the case of cat → dog. Indeed, our GR-GAN produces
more realistic-looking images.
5.6 Comparison of computational complexity
One of the major advantages of our method over those algorithms that prevent network catastrophic forgetting is that the
computational cost is relatively small. Methods to prevent
network forgetting are mainly divided into memory replay
mechanism (Wu et al. 2018) and elastic weight consolidation (EWC) (Kirkpatrick et al. 2017). The memory replay
mechanism is an expensive way to mitigate catastrophic forgetting. For the tth training, it must reuse the previous t − 1
training data for the discriminator training. Learning from a
sequence of t time steps would then have a time complexity of O(t 2 ). EWC prevents parameters that are important
to the previously learned distributions from deviating too far
from their optimal values. For the tth time step, a regularization term of the following form is added to the current loss
function:
λ
t−1
θ − θ (i)∗ 2Fi ,
(11)
i=1
where Fi is the Fisher information matrix calculated at the
end of step i, θ (i)∗ is previous step i’s optimal parameters, θ
is current parameters, and λ controls the relative importance
of step i to the current step. Therefore, the time complexity
of EWC algorithm is also O(t 2 ). Our method is to randomly
select a small number of real samples from the training data
at each time step and dynamically adjust the learning dynamics of the generator according to the evaluation value of the
discriminator on these samples. Thus, the time complexity of
Algorithm
Complexity
Memory replay mechanism
O(t 2 )
EWC
O(t 2 )
CR-GAN
O(t)
6 Conclusion
In this work, we discuss the multimodal image translation
between visual domains with high internal variability and
propose a novel GR-GAN to solve the non-convergence issue
in such a scenario. Our method causes little computational
effort and can be easily applied to various image translation
models without modifying the original network structure.
Both qualitative and quantitative results show that the proposed method can significantly improve the performance of
image translation.
Acknowledgements This work was supported by the National Key
R&D Program of China under Contract No. 2017YFB1002201, the
National Natural Science Fund for Distinguished Young Scholar (Grant
No. 61625204) and partially supported by the State Key Program
of National Science Foundation of China (Grant Nos. 61836006 and
61432014).
Compliance with ethical standards
Conflict of interest The authors declare that they have no conflict of
interest.
Ethical approval This article does not contain any studies with human
participants or animals performed by any of the authors.
References
Arjovsky M, Chintala S, Bottou L (2007) Wasserstein generative adversarial networks. In: International conference on machine learning,
pp 214–223
Bousmalis K, Silberman N, Dohan D, Erhan D, Krishnan D (2017)
Unsupervised pixel-level domain adaptation with generative
adversarial networks. In: Proceedings of the IEEE conference on
computer vision and pattern recognition, pp 3722–3731
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet:
a large-scale hierarchical image database. In: Proceedings of the
IEEE conference on computer vision and pattern recognition, pp
248–255
Dong C, Loy CC, He K, Tang X (2014) Learning a deep convolutional
network for image super-resolution. In: European conference on
computer vision, pp 184–199
French RM (1999) Catastrophic forgetting in connectionist networks.
Trends Cognit Sci 3(4):128–135
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
18184
J. Wang et al.
Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette
F, Marchand M, Lempitsky V (2016) Domain-adversarial training
of neural networks. J Mach Learn Res 17(1):59.1–59.35
Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference
on computer vision and pattern recognition, pp 2414–2423
Gonzalez-Garcia A, van de Weijer J, Bengio Y (2018) Image-to-image
translation for cross-domain disentanglement. In: Advances in
neural information processing systems 31: Annual conference on
neural information processing Systems 2018, pp 1287–1298
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair
S, Courville A, Bengio Y (2014) Generative adversarial nets. In:
Advances in neural information processing systems 27: Annual
conference on neural information processing systems 2014, pp
2672–2680
Hinterstoisser S, Lepetit V, Ilic S, Holzer S, Bradski G, Konolige K,
Navab N (2012) Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In:
Asian conference on computer vision, pp 548–562
Huang X, Liu MY, Belongie S, Kautz J (2018) Multimodal unsupervised image-to-image translation. In: Proceedings of the European
conference on computer vision (ECCV), pp 172–189
Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation
with conditional adversarial networks. In: Proceedings of the IEEE
conference on computer vision and pattern recognition, pp 1125–
1134
Kim T, Cha M, Kim H, Lee JK, Kim J (2017) Learning to discover
cross-domain relations with generative adversarial networks. In:
Proceedings of the 34th international conference on machine learning, vol 70, pp 1857–1865
Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu
AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A et al
(2017) Overcoming catastrophic forgetting in neural networks.
Proc Natl Acad Sci 114(13):3521–3526
Larsson G, Maire M, Shakhnarovich G (2016) Learning representations
for automatic colorization. In: European conference on computer
vision, pp 577–593
LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard
W, Jackel LD (1989) Backpropagation applied to handwritten zip
code recognition. Neural Comput 1(4):541–551
LeCun Y, Bottou L, Bengio Y, Haffner P et al (1998) Gradientbased learning applied to document recognition. Proc IEEE
86(11):2278–2324
Lee HY, Tseng HY, Huang JB, Singh M, Yang MH (2018) Diverse
image-to-image translation via disentangled representations. In:
Proceedings of the European conference on computer vision
(ECCV), pp 35–51
Liu MY, Breuel T, Kautz J (2017) Unsupervised image-to-image translation networks. In: Advances in neural information processing
systems 30: Annual conference on neural information processing
systems 2017, pp 700–708
Liu D, Fu J, Qu Q, Lv J (2018) BFGAN: Backward and forward
generative adversarial networks for lexically constrained sentence generation. IEEE ACM Trans Audio Speech Lang Process
27(12):2350–2361
Ma L, Jia X, Georgoulis S, Tuytelaars T, Van Gool L (2019) Exemplar
guided unsupervised image-to-image translation with semantic
consistency. In: International conference on learning representations
Mao X, Li Q, Xie H, Lau RY, Wang Z, Paul Smolley S (2017) Least
squares generative adversarial networks. In: Proceedings of the
IEEE international conference on computer vision, pp 2794–2802
Mirza M, Osindero S (2014) Conditional generative adversarial nets.
arXiv preprint arXiv:1411.1784
Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV (2012) Cats and dogs.
In: Proceedings of the IEEE conference on computer vision and
pattern recognition
Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context
encoders: feature learning by inpainting. In: Proceedings of the
IEEE conference on computer vision and pattern recognition, pp
2536–2544
Radford A, Metz L, Chintala S (2016) Unsupervised representation
learning with deep convolutional generative adversarial networks.
In: International conference on learning representations
Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen
X (2016) Improved techniques for training gans. In: Advances in
neural information processing systems, pp 2234–2242
Sangkloy P, Lu J, Fang C, Yu F, Hays J (2017) Scribbler: controlling
deep image synthesis with sketch and color. In: Proceedings of the
IEEE conference on computer vision and pattern recognition, pp
5400–5409
Seff A, Beatson A, Suo D, Liu H (2017) Continual learning in generative
adversarial nets. arXiv preprint arXiv:1705.08395
Tang C, Xu K, He Z, Lv J (2019) Exaggerated portrait caricatures synthesis. Inf Sci 502:363–375
Thanh-Tung H, Tran T, Venkatesh S (2018) On catastrophic forgetting and mode collapse in generative adversarial networks. arXiv
preprint arXiv:1807.04015
Wu C, Herranz L, Liu X, Wang Y, van de Weijer J, Raducanu B (2018)
Memory replay gans: learning to generate images from new categories without forgetting. In: Conference on neural information
processing systems (NIPS)
Yi Z, Zhang H, Tan P, Gong M (2017) Dualgan: unsupervised dual
learning for image-to-image translation. In: Proceedings of the
IEEE international conference on computer vision, pp 2849–2857
Yu X, Ying Z, Li G, Gao W (2018) Multi-mapping image-to-image
translation with central biasing normalization. arXiv preprint
arXiv:1806.10050
Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In:
European conference on computer vision, pp 649–666
Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In:
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595
Zhu JY, Krähenbühl P, Shechtman E, Efros AA (2016) Generative visual
manipulation on the natural image manifold. In: European conference on computer vision, pp. 597–613
Zhu JY, Park T, Isola P, Efros AA (2017a) Unpaired image-to-image
translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision,
pp 2223–2232
Zhu JY, Zhang R, Pathak D, Darrell T, Efros AA, Wang O, Shecht
man E (2017b) Toward multimodal image-to-image translation.
In: Advances in neural information processing systems 30: Annual
conference on neural information processing systems 2017, pp
465–476
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
Terms and Conditions
Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”).
Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for smallscale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By
accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these
purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial.
These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal
subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription
(to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will
apply.
We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within
ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not
otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as
detailed in the Privacy Policy.
While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may
not:
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at
onlineservice@springernature.com
Download