Soft Computing (2020) 24:18173–18184 https://doi.org/10.1007/s00500-020-05073-6 METHODOLOGIES AND APPLICATION Multimodal image-to-image translation between domains with high internal variability Jian Wang1 · Jiancheng Lv1 · Xue Yang1 · Chenwei Tang1 · Xi Peng1 Published online: 12 June 2020 © Springer-Verlag GmbH Germany, part of Springer Nature 2020 Abstract Multimodal image-to-image translation based on generative adversarial networks (GANs) shows suboptimal performance in the visual domains with high internal variability, e.g., translation from multiple breeds of cats to multiple breeds of dogs. To alleviate this problem, we recast the training procedure as modeling distinct distributions which are observed sequentially, for example, when different classes are encountered over time. As a result, the discriminator may forget about the previous target distributions, known as catastrophic forgetting, leading to non-/slow convergence. Through experimental observation, we found that the discriminator does not always forget the previously learned distributions during training. Therefore, we propose a novel generator regulating GAN (GR-GAN). The proposed method encourages the discriminator to teach the generator more effectively when it remembers more of the previously learned distributions, while discouraging the discriminator to guide the generator when catastrophic forgetting happens on the discriminator. Both qualitative and quantitative results show that the proposed method is significantly superior to the state-of-the-art methods in handling the image data that are with high variability. Keywords GANs · Image translation · High internal variability · Catastrophic forgetting · Generator regulating 1 Introduction Image-to-image translation (in short, image translation) aims at learning the mappings between two different visual domains. Based on generative adversarial networks (GANs) (Goodfellow et al. 2014), remarkable progress has been recently achieved and this task has attracted considerable attention in the computer vision community because a wide range of problems in computer vision can be posed as image translation problems, for example super-resolution (Dong et al. 2014), colorization (Larsson et al. 2016; Zhang et al. 2016), inpainting (Pathak et al. 2016), and style transfer (Gatys et al. 2016). A widely accepted view is that the mappings between two visual domains are inherently multimodal (Zhu et al. 2017b; Huang et al. 2018), i.e., a single input may correspond to multiple plausible outputs. To achieve multimodal image translation, recent works proposed that Communicated by V. Loia. B 1 Jiancheng Lv lvjiancheng@scu.edu.cn the data representation of both source and target domains should be disentangled into two parts: content and style (Huang et al. 2018; Yu et al. 2018; Ma et al. 2019; GonzalezGarcia 2018; Lee et al. 2018). More specifically, the content is domain invariant and the style captures domain-specific properties. Suppose the content and style are well decoupled, multimodal translation can be successfully achieved by recombining input’s content vector with a series of random style vectors in the target style space. In numerous practical applications, it is possible that the domains with high internal variability will be translated into each other. For example, on the cat ↔ dog task, the data we collected tend to include numerous cat and dog breeds, like the Oxford-IIIT pet dataset (Parkhi et al. 2012). However, we experimentally found that existing models cannot perform disentanglement well when these models involve visual domains with high internal variability (Fig. 1). Specifically, the disentangled style contains only the pixel-level statistics (e.g., color or texture) rather than the high-level concepts (e.g., geometric structure). As a result, these models can only be capable of changing the low-level features, but fail to perform large changes on high-level semantic information of College of Computer Science, Sichuan University, Chengdu 610065, People’s Republic of China 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 18174 J. Wang et al. discriminator from affecting the generator by providing the generator with a small gradient for training. The proposed GR-GAN incurs marginal computational overheads and can easily be applied to various image translation models without changing the original network structure. The main contributions of this work could be summarized as follows: Fig. 1 Generator regulating GAN (GR-GAN). When involving domains with high internal variability, the (left) existing methods can only transfer low-level information, and there are no large semantic changes with the change of style. (right) We propose a novel GAN objective, which can be applied to arbitrary existing multimodal image translation model to improve the performance input samples. The possible reason is that they suffered from the non-/slow convergence problem (Salimans et al. 2016). In high internal variability (multiple classes) setting, the training regime for existing GAN-based translation models still assumes that the training data are drawn i.i.d. from the distribution of interest, i.e., data representing every class are used concurrently during training. In real-world scenarios, however, the data may sequentially arrive and only a small portion can be obtained at a time.1 Therefore, we proposed to view the training of these models as modeling different distributions that are observed sequentially. Updating the discriminator to capture new observed distributions causes the discriminator to forget previously learned distributions. This so-called catastrophic forgetting issue (French 1999) will lead to the non-convergence of GAN frameworks (ThanhTung et al. 2018). Many attempts to solve this issue by reusing old samples or by utilizing continual learning techniques (Wu et al. 2018; Seff et al. 2017; Thanh-Tung et al. 2018). However, such methods are usually computationally intensive. During the actual training process, we do found the catastrophic forgetting issue by observing the discriminator for a period of time. More importantly, we also observe that the discriminator does not always forget. The discriminator learns some meaningful (useful) feature representations at certain time steps so that it has a correct assessment of the samples that have been seen before (Fig. 3). Based on the above observations, we propose a novel generator regulating GAN (GR-GAN) to improve the performance of multimodal image translation under the setting of “domains with high internal variability.” By adaptively adjusting the learning dynamics of the generator network, our method enables the generator to be trained more effectively with much stronger gradient under the guidance of the discriminator that remembers more previous target distributions. On the contrary, when catastrophic forgetting happens on the discriminator, the proposed method tries to prevent the 1 Considering the memory consumption, the batch size of image translation models is usually very small, e.g., 1. – To the best of our knowledge, this could be the first work to focus on the study of multimodal translation between visual domains with high internal variability, and we show the suboptimal of existing method on such tasks. – In a high internal variability setting, we recast the training regime as modeling different distributions that are observed sequentially and found the major contributor to non-convergence: catastrophic forgetting. – We propose a novel adversarial objective to mitigate the non-convergence problem in multimodal image translation tasks under “domains with high internal variability” setting. – Extensive experiments show that our approach is superior to the state-of-the-art methods. Images synthesized by our method are more realistic and diverse. The rest of this paper is organized as follows: Sect. 2 discusses some work related to our own. In Sect. 3, we provide a brief background to the works on which we are based, and the details of our method are described in Sect. 4. In Sect. 5, several experiments are conducted and analyzed. Conclusions are discussed in Sect. 6. 2 Related work In this section, we discuss several previous works related to our own. 2.1 Image-to-image translation The first framework to use conditional GANs (CGANs) (Mirza and Osindero 2014) for image translation task is Isola et al. (2017), which takes images in the source domain as conditional information to generate corresponding images in the target domain. The idea is later used on massive tasks, such as synthesizing images from sketches (Sangkloy et al. 2017; Tang et al. 2019) or semantic maps (Isola et al. 2017). However, as Isola et al. 2017 require paired training data, its application is greatly limited. Zhu et al. (2017a), Yi et al. (2017), and Kim et al. (2017) extend image translation to the unpaired setting by introducing a cycle consistency constraint. The cycle consistency constraint forces that if an image is translated from the source domain to the target domain and back, it should still be the original image. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. Multimodal image-to-image translation between domains with high internal variability Besides, Liu et al. (2017) assume the corresponding samples of two domains share the same latent space. Nevertheless, image translation is multimodal in nature. Zhu et al. (2017b) show that it can perform multimodal image translation, but it still needs paired data for training. Numerous recent efforts (Huang et al. 2018; Lee et al. 2018; Yu et al. 2018) suggest that the image representations of both source and target domains should be disentangled into two representations: content and style. The content is domain invariant, and the style captures domain-specific properties. These multimodal methods achieve great success in various image translation tasks. However, translation under “domains with high internal variability” setting is rarely studied. 2.2 Catastrophic forgetting in GAN frameworks Recently, generative adversarial networks (GANs) (Goodfellow et al. 2014) have been widely used in various tasks, such as image generation (Radford et al. 2016), image editing (Zhu et al. 2016), and text generation (Liu et al. 2018). The theoretical understanding of GANs has been a research hot spot. When different classes are encountered over time, Seff et al. (2017) regard the training regime of GAN frameworks as a continual learning problem. They show that catastrophic forgetting (French 1999) happens in GANs. To further extend the theory, Thanh-Tung et al. (2018) provide a theoretical analysis of the problem and prove that catastrophic forgetting can make the training of GANs non-convergent. Unlike previous methods that apply continual learning techniques to GANs, Wu et al. (2018) alleviate the catastrophic forgetting issue by replaying memories of previous tasks. In high internal variability setting, we also view the training of GAN-based image translation models as a continual learning problem. 18175 the discriminator f θ is implemented by alternately optimizing the following two loss functions: Ld = − − Lg = E x∼Pr (x) [log(σ ( f θ (x)))] [log(1 − σ ( f θ (gφ ∗ (z))))], E (1) z∼Pz (z) E z∼Pz (z) [log(1 − σ ( f θ ∗ (gφ (z))))], (2) where x is a sample from the real data distribution Pr , and σ (·) is the sigmoid function. σ ( f θ (x)) represents the probability of that x comes from Pr rather than Pg (the distribution of generated samples), gφ ∗ means the generator is fixed and f θ ∗ means the discriminator is fixed. Note that f θ (·) is the non-transformed discriminator output. It can be interpreted as how realistic the input data are; a negative number means that the input data look fake, while a positive number means that the input data look real, i.e., f θ (·) > 0 input data look real < 0 input data look fake. Loss function L g is called the saturating loss function. However, in practice, another variation called non-saturating loss function is commonly used, as follows: L ns g =− E z∼Pz (z) [log(σ ( f θ ∗ (gφ (z))))]. (3) We will discuss these two types of loss functions in detail in Sect. 4. 3.2 Multimodal image translation 3 Preliminaries In this section, we provide some previous knowledge of GANs and multimodal image translation. 3.1 GANs The GAN framework consists of two parts: a generator gφ (z) : z → x that maps a latent variable drawn i.i.d. from a Gaussian or uniform distribution Pz (z) to the target data space and a discriminator f θ (x) : x → R that maps a data sample (real and generated) to a real number associated with likelihood. The discriminator is trained to distinguish between real samples and samples synthesized by the generator (fake samples), and it, in turn, guides the generator to synthesize samples that are enough to confuse the discriminator. The game between the generator gφ and Numerous recent contributions (Huang et al. 2018; Yu et al. 2018; Ma et al. 2019; Gonzalez-Garcia 2018; Lee et al. 2018) convincingly showed that the multimodal mappings can be learned between two domains that are each specified merely by a set of unlabeled samples. For example, given a set of unlabeled images of cats, MUNIT (Huang et al. 2018) synthesizes a variety of new images of dogs and vice versa. These recent approaches hold the assumption that the data representation can be decomposed into a content code that is domain invariant, and a style code that captures domainspecific properties. In the case of good decoupling of content and style, multimodal image translation can be successfully achieved by recombining object’s content vector with a series of random style vectors in the target style space. The information flow is described in Fig. 2. To establish such mappings between two domains, three types of constraints are employed: (1) when mapping from 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 18176 J. Wang et al. 4.1 Motivation (a) (b) (c) Fig. 2 The information flow of multimodal image translation. a Sample x A (x B ) in domain A (B) is disentangled into two representations: content code c A (c B ) and style code s A (s B ). b By recombining c A with s B , the cross-domain mapping from A to B is achieved, and the output is x AB that should belong to the domain B. c Several efforts like MUNIT (Huang et al. 2018) force the style code to be Gaussian, so the multimodal translation can be achieved by randomly sampling the style code from Gaussian source domain to target domain, the output has to be indistinguishable from the samples of the target domain; (2) each sample is translated into the target domain and then translated back, and the final result should be as similar as possible to the original sample; (3) the two representations (content and style) encoded from input data can still be decoded back. All of these three types of constraints are elegant and do not require additional supervision. Of the above three cues, the most dominant cue is the distribution constraint. This constraint is performed using GANs and is applied at the distribution level. As we mentioned earlier, the non-saturating form is commonly used in practice. Thus, the loss functions2 of the discriminator L d and generator L ns g are: Ld = − − L ns g =− E [log(σ ( f θ (x t )))] E [log(1 − σ ( f θ (gφ ∗ (x s , z))))], (4) E [log(σ ( f θ ∗ (gφ (x s , z))))], (5) x t ∼Pt (x t ) z∼Pz (z) x s ∼Ps (x s ) z∼Pz (z) x s ∼Ps (x s ) where x s is a sample from the source data distribution Ps , x t is a sample from the target data distribution Pt , z is a latent variable from a Gaussian or uniform distribution Pz (z), gφ ∗ means the generator is fixed and f θ ∗ means the discriminator is fixed. Domains with high internal variability make the multimodal image translation tougher. In such a setting, we regard the training procedure of GAN-based image translation models as a continual learning problem. However, in real-world scenarios, the training data arrive sequentially and only a small portion can be obtained at a time (e.g., small batch size). We can think that these models are used to learn a set of target distributions Pt0 , Pt1 , . . . , PtM . The discriminator at task m does not have access to distributions Pt0 , Pt1 , . . . , Ptm−1 . The discriminator, thus, forgets about the previously learned target distributions (Thanh-Tung et al. 2018). This problem is known as catastrophic forgetting (French 1999) and can hurt the performance of models (leading to the non-convergence). This issue can be alleviated using samples from previous tasks (memory replay mechanism) (Wu et al. 2018) or different types of regularization that result in penalizing large changes in parameters or activations (Seff et al. 2017; Thanh-Tung et al. 2018). However, these techniques require considerable memory consumption or computational overhead. Indeed, we observed the catastrophic forgetting issue during the training process. But more importantly, we also observe that the discriminator does not always forget. The discriminator learns some meaningful feature representations at certain time steps, so that it has a pretty good assessment of the samples that have been seen before. The possible reason is that the training procedure is not a standard continual learning problem, i.e., the data distribution observed by the model is not always completely different in practice, as illustrated in Fig. 3. The discriminator shows no forgetting at a certain time step, indicating that it captures meaningful feature representations. Based on the above observation, instead of mitigating the discriminator forgetting, we propose to encourage the discriminator that remembers more previously learned distributions to teach the generator more effectively and to reduce the impact of the poor discriminator on the generator. Hence, we need to find an adaptive way to regulate the generator (Fig. 4). 4.2 Learning algorithm 4 Method In this section, we describe our basic idea, methodology, and implementation of our proposed generator regulating GAN (GR-GAN). Unlike the existing GAN-based translation methods, which mainly adopt the non-saturating loss function, in this work, we introduce a regularization term into the saturating loss function of vanilla GAN to achieve the ability to adaptively adjust the generator. The saturating loss function of the generator is given by: Lg = 2 Some models use LSGAN (Mao et al. 2017) objective. E z∼Pz (z) x s ∼Ps (x s ) [log(1 − σ ( f θ ∗ (gφ (x s , z))))]. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. (6) Multimodal image-to-image translation between domains with high internal variability Fig. 3 Illustration of motivation. We show a segment of the MUNIT (Huang et al. 2018 discriminator’s evaluation of a set of real target domain samples during training. The score of the evaluation (y-axis) represents the average probability that the discriminator judges a set of real samples to be real. In high internal variability scenarios (cat → dog task, blue line), the discriminator at some steps cannot make a good overall assessment on samples due to catastrophic forgetting. Meanwhile, it is easy to see that the discriminators at different steps differ greatly in the degree of forgetting about previously learned target distributions. In low internal variability scenarios (summer → winter task, red line), the phenomenon of catastrophic forgetting is not particularly prominent, which is why previous methods work well on this task (color figure online) (a) (b) Fig. 4 An overview of the proposed GR-GAN. Sample x s is translated from the source domain to the target domain to generate sample x st . To train the generator gφ , a existing translation methods treat the trained discriminator f θ ∗ in each step equally, regardless of whether it is a good teacher to the generator. b We propose to adaptively regulate the generator’s learning dynamics by evaluating a set of real samples {x 1t , x 2t , . . . , x tN } of the target domain to make it more sensitive to the discriminator that remembers more previous target distributions Let h(·) = log(1 − σ (·)). From the chain rule of backpropagation (LeCun et al. 1989), it is easy to know the gradient of the generator which is: ∂h ∂ f θ ∗ ∂h = ∂ gφ ∂ f θ ∗ ∂ gφ ∂ fθ ∗ 1 = . − f (g (x )) ∗ 1 + e θ φ s ∂ gφ Fig. 5 The saturating loss function and non-saturating loss function of vanilla GAN line in Fig. 5). Therefore, vanilla GAN introduces the nonsaturating loss to provide stronger gradient for the generator (see blue line in Fig. 5). In our case, we do not want L g to always provide strong gradient to the generator. As mentioned earlier, our goal is to make the good discriminator (remembers more about the previously learned target distributions) more effective in teaching the generator and to reduce the impact of the poor discriminator on the generator. To achieve that, we simply need to make L g provide the generator with more sufficient gradient when the discriminator remembers more target distributions learned earlier (captures more meaningful feature representations), and leave L g saturating when catastrophic forgetting happens on the discriminator. Theoretically, the less the discriminator forgets about target distributions, the better it can assess the real samples of the target domain (the more real samples it can judge to be real), i.e., E x t ∼Pt (x t ) [ f θ ∗ (x t )] > 0 gets a larger value. Suppose the discriminator completely forgets about learned target distributions, E x t ∼Pt (x t ) [ f θ ∗ (x t )] → 0, since a poor discriminator cannot provide the right judgment on samples. We introduce E x t ∼Pt (x t ) [ f θ ∗ (x t )] as a regularization term into the saturating loss function (Eq. 6) and obtain the generator’s loss function of our GR-GAN: gr (7) In practice, Eq. 6 may not provide sufficient gradient for the generator to learn well (Arjovsky et al. 2017). The discriminator can reject the synthesized samples with high confidence since the generator is difficult to generate indistinguishable enough samples; thus, f θ ∗ (gφ (x s )) gets a small (negative) value. In this case, h(·) saturates and ∂h/∂ gφ → 0 (see red 18177 Lg = E z∼Pz (z) x s ∼Ps (x s ) [log(1 − σ ( f θ ∗ (gφ (x s , z)) + ζ ))], (8) where ζ = E x t ∼Pt (x t ) [ f θ ∗ (x t )]. (9) The discriminator’s loss function of GR-GAN is the same as Eq. 4. As before, let h(·) = log(1 − σ (·)) and x = 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 18178 J. Wang et al. f θ ∗ (gφ (x s )) + ζ . Thus, the gradient of the generator is: ∂h ∂h ∂ x = ∂ gφ ∂ x ∂ gφ ∂x 1 = 1 + e−x ∂ gφ ∂ fθ ∗ 1 = . −( f (g (x ))+ζ ) ∗ ∂ gφ 1+e θ φ s (10) It is easy to know when the discriminator remembers more of the previously learned distributions, the regularization term ζ gets a larger value. In this case, h(·) is not easy to saturate and ∂h/∂ gφ gets a higher value. Therefore, L g provides stronger gradient for the generator and encourages the generator to “work harder.” When catastrophic forgetting happens on the discriminator, ζ → 0, the loss function becomes equal to Equation 6. (a) cat↔ dog 4.3 Implementation However, at each step, calculating the value of the regularization term requires access to all the real data in the training set. It is very computationally intensive. In practice, we only sample a small amount of data in the dataset to calculate this N value: ζ = N1 j=1 f θ ∗ (x t ), where N denotes the number of real samples randomly sampled for the target domain. In this work, we set N = 8 in all the experiments. The overview of our method is shown in Fig. 4. 5 Experiments In this section, we demonstrate the effectiveness of our approach. This section is structured as follows. First, we will compare our method with other methods from the perspective of visual realism to prove that our method can produce more realistic results in the case of high diversity within the domains. Next, we will show that our method can synthesize more diverse results. We will then use the domain adaptation task to show that our approach can make the generated results more consistent with the target domain. Moreover, we will compare the performance of the models with different GAN objectives. Finally, we compare the time complexity of our method with other algorithms. 5.1 Datasets To evaluate the realism of the results and compare the performance of different types of GAN, we conduct experiments on the Oxford-IIIT pet dataset3 (cat ↔ dog) (Parkhi et al. 2012), 3 This dataset is available at http://www.robots.ox.ac.uk/~vgg//data/ pets. (b) motor↔ bicycle Fig. 6 Several examples. We conduct the experiments on a the OxfordIIIT pet dataset (cat ↔ dog) and b ImageNet (motor ↔ bicycle). Both tasks involve the domains of great internal diversity, and translation between domains requires large semantic changes which contains 7349 images of 12 cat and 25 dog species. To minimize the impact of the background, we use the bounding box (or RoI) information provided in the dataset to remove the superfluous part of the image and only retained the head area. Due to the limited number of images with ROI information, the final processed dataset contains only 3,686 images, and there are only about 100 images for each type of cat and dog. What’s worse, the cats vary enormously from species to species, and the dogs are the same. We show some example images of this dataset in Fig. 6a. Meanwhile, we test the diversity of generated results on the cat ↔ dog task. As a supplement, we also test the diversity of results on motor ↔ bicycle and Yosemite (summer ↔ winter) tasks (Zhu et al. 2017a). The images for motor ↔ bicycle are downloaded from ImageNet (Deng et al. 2009), consisting of 1,205 images for motorcycles and 1,209 for bicycles. For the consideration of impact from image background, we cropped images centered on the object (motorcycle or bicycle). Several example images in this dataset are shown in Fig. 6b. Furthermore, we perform domain adaptation on the classification task with MNIST (LeCun et al. 1998) ↔ MNIST-M (Ganin et al. 2016) and Synthetic Cropped LineMod ↔ Cropped LineMod (Hinterstoisser et al. 2012). 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. Multimodal image-to-image translation between domains with high internal variability Fig. 7 Quantitative results of perceptual realism. We compare which of the paired models (baseline and baseline with GR-GAN objective) produces more realistic results by human judgment. The numbers show the preference for comparison pair, which indicates that our translation results more realistic 18179 presented to the tester4 for 1 s. The tester needs to determine which one is more like a real image from the paired images. We test the performance of these models on the cat ↔ dog task, and Fig. 7 shows our test results. For a fair comparison, we compare which of the paired models (baseline model and baseline model with the proposed GAN objective) produces more realistic results. The numbers of Fig. 7 indicate the preference for comparison pair, also called “fooling rate.” Through comparison, it can be found that the proposed new GAN objective can mislead the testers into thinking that the results are more real. In other words, our method can significantly improve the perceptual realism of the results of images translation models. 5.3 Diversity 5.1.1 Baselines In this paper, we chose MUNIT (Huang et al. 2018), CBNIT (Yu et al. 2018), and DRIT (Lee et al. 2018) as the baselines, since the mappings between two domains are inherently multimodal. Note that MUNIT and CBNIT are based on least squares GAN (LSGAN) (Mao et al. 2017), and DRIT is based on vanilla GAN (Goodfellow et al. 2014). To make a fair comparison, we only replace the objective of GAN and retain the original network structure and parameter setting of these models. To facilitate the distinction, suffix “-gr” is added to the model name after replacing the new GAN objective we proposed. 5.2 Perceptual realism 5.2.1 Qualitative evaluation We first show the visual realism of the results synthesized by the different models on the cat ↔ dog task in Fig. 10. It is easy to observe that MUNIT-gr, CBNIT-gr, and DRIT-gr tend to synthesize features that are closer to the ground truth, such as the cat’s ears standing up and the dog’s ears sagging. To accomplish this, translation requires that the model be able to make significant high-level semantic changes. We can also observe that the baseline methods often only change very low-level features (color and texture), and we can even consider that they are simply copying the input. 5.2.2 Quantitative evaluation Evaluating the quality of generated images is an open and challenging problem. As proposed in Zhang et al. 2016, we employ human judgments to evaluate the perceptual realism of our results. The real and generated images are randomly arranged into a paired test sequences, which are sequentially 5.3.1 Qualitative evaluation The diversity of results is also an important evaluation index. In Fig. 8, we show the diversity comparison of results generated by MUNIT and MUNIT-gr (ours) on the cat ↔ dog task. Because the original GAN objective plays MUNIT weak role in modeling the distribution of the target domain, the disentangled style code does not contain high-level concepts related to the features of the target; thus, the diversity is only reflected in the overall tone difference of image. Unlike MUNIT, the diversity of our MUNIT-gr is more reflected in the high-level semantics, such as ear shape, nose shape, and hair length. It means our method can generalize in the distribution, rather than simply memorizing the input or making only small changes to the input. 5.3.2 Quantitative evaluation To quantitatively evaluate the diversity of results, the LPIPS metric (Zhang et al. 2018) is employed. The LPIPS measures the average feature distances between generated samples. As reported in Zhu et al. (2017b), for each model, we calculate the average distance between 2,000 pairs of images randomly generated from 100 input images. For comparison, we calculate the distance between the real images in the target domain by entering random pairs. To make the results more reliable, we also conduct experiments on motor ↔ bicycle (Deng et al. 2009) and summer ↔ winter (Zhu et al. 2017a) tasks. As shown in Table 1, the greater the distance, the higher the diversity. In general, our method can improve the diversity of generated results, especially in the datasets with high internal diversity like the Oxford-IIIT pet dataset (cat ↔ dog) and the ImageNet (motor ↔ bicycle). On Yosemite (summer ↔ winter), the diversity of results is not improved much, since 4 All testers are independent of the authors’ research group. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 18180 J. Wang et al. Table 1 Quantitative evaluation of diversity Method C→D M→B S→W D→C B→M W→S Real images 0.391 ± 0.001 0.390 ± 0.001 0.446 ± 0.001 0.382 ± 0.001 0.394 ± 0.001 0.444 ± 0.001 MUNIT 0.341 ± 0.002 0.286 ± 0.002 0.287 ± 0.002 0.342 ± 0.002 0.319 ± 0.002 0.303 ± 0.002 MUNIT-gr 0.374 ± 0.002 0.308 ± 0.002 0.303 ± 0.001 0.372 ± 0.002 0.333 ± 0.002 0.300 ± 0.002 CBNIT 0.306 ± 0.002 0.178 ± 0.001 0.240 ± 0.002 0.257 ± 0.002 0.124 ± 0.001 0.230 ± 0.003 CBNIT-gr 0.327 ± 0.002 0.218 ± 0.001 0.258 ± 0.002 0.326 ± 0.001 0.230 ± 0.001 0.241 ± 0.002 DRIT 0.285 ± 0.001 0.211 ± 0.001 0.195 ± 0.001 0.278 ± 0.002 0.209 ± 0.001 0.202 ± 0.001 DRIT-gr 0.324 ± 0.002 0.243 ± 0.001 0.221 ± 0.001 0.290 ± 0.001 0.226 ± 0.001 0.204 ± 0.001 The LPIPS metric (Zhang et al. 2018) is employed to evaluate the diversity of results. In particular, on the Yosemite summer ↔ winter task, our method does not perform much better than baseline since the baseline had worked well on this task C, cat; D, dog; M, motor; B, bicycle; S, summer; W, winter Fig. 8 Diversity comparison. Translation results with a series of random style vectors sampled from Gaussian. Apparently, our approach synthesizes more diverse and realistic results over baselines Table 2 Domain adaptation results Model Fig. 9 Domain adaptation experiments. Several results of the Synthetic Cropped LineMod ↔ Cropped LineMod task the previous method has been able to perform cross-domain translation well. 5.4 Domain adaptation Domain adaptation technology aims to solve the domainshift problem between the source domain and the target domain. We believe that if the results of image translation can be used to improve the domain-shift problem, it can reflect the good performance of the image translation model. Following PixelDA (Bousmalis et al. 2017), we conduct the classification experiments on tasks MNIST (LeCun et al. 1998) ↔ MNIST-M (Ganin et al. 2016) and Synthetic Cropped LineMod ↔ Cropped LineMod (Hinterstoisser et al. 2012). We first use these image translation models to translate the Classification accuracy MNIST-M LineMod Source-only 0.598 0.500 MUNIT 0.886 0.896 MUNIT-gr 0.893 0.912 PixelDA 0.937 0.981 Target-only 0.957 0.994 We show the classification accuracy on MNIST → MNIST-M and Synthetic Cropped LineMod → Cropped LineMod. The “Source-only” and “Target-only” indicate that the mode uses either image only from the source and target domain during training labeled source domain images into the target domain to generate the labeled target domain images. We use these generated labeled images as the training data to train a classifier for target domain sample classification. For a fair comparison, the structure of the classifier network remains the same as PixelDA. In addition to the baselines we mentioned, we also use the state-of-the-art domain adaptation algorithm PixelDA for comparison. As shown in Fig. 9, we first qualitatively compare the visual quality of MUNIT and MUNIT-gr (ours) on the Synthetic Cropped LineMod ↔ Cropped LineMod task. It is easy to see that the results of our method are of higher qual- 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. Multimodal image-to-image translation between domains with high internal variability 18181 Fig. 10 Qualitative results of perceptual realism. Several results synthesized by baselines and our methods. Above line: the task of cat → dog. Below line: the task of dog → cat 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 18182 J. Wang et al. Fig. 11 Models with different GANs. We show the results of MUNIT with four types of GAN: vanilla GAN, WGAN, LSGAN, and our GR-GAN. Above line: the task of cat → dog. Below line: the task of dog → cat 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. Multimodal image-to-image translation between domains with high internal variability 18183 ity, especially in Cropped LineMod → Synthetic Cropped LineMod side. In terms of quantitative classification accuracy, as shown in Table 2, all the cross-domain translation models did not exceed PixelDA, because these models did not utilize label information during training as PixelDA did. Compared with baselines, our approach can better improve the performance of domain adaptation (Fig. 10). Table 3 Computational complexity comparison 5.5 Performance on different GANs our method is O(t). Table 3 shows the computational complexity comparison of three different methods. For demonstrating the effectiveness of our method, we compare the performance of MUNIT (Huang et al. 2018) with different GAN objectives on the cat ↔ dog task. We employ four types of GAN: vanilla GAN, WGAN, LSGAN, and our GR-GAN. To make a fair comparison, we merely replace the GAN objective and retain the original network structure and parameter setting of the model. As shown in Fig. 11, it is easy to observe that serious mode collapse occurs in both SGAN and WGAN in the case of dog → cat, and in WGAN in the case of cat → dog. Indeed, our GR-GAN produces more realistic-looking images. 5.6 Comparison of computational complexity One of the major advantages of our method over those algorithms that prevent network catastrophic forgetting is that the computational cost is relatively small. Methods to prevent network forgetting are mainly divided into memory replay mechanism (Wu et al. 2018) and elastic weight consolidation (EWC) (Kirkpatrick et al. 2017). The memory replay mechanism is an expensive way to mitigate catastrophic forgetting. For the tth training, it must reuse the previous t − 1 training data for the discriminator training. Learning from a sequence of t time steps would then have a time complexity of O(t 2 ). EWC prevents parameters that are important to the previously learned distributions from deviating too far from their optimal values. For the tth time step, a regularization term of the following form is added to the current loss function: λ t−1 θ − θ (i)∗ 2Fi , (11) i=1 where Fi is the Fisher information matrix calculated at the end of step i, θ (i)∗ is previous step i’s optimal parameters, θ is current parameters, and λ controls the relative importance of step i to the current step. Therefore, the time complexity of EWC algorithm is also O(t 2 ). Our method is to randomly select a small number of real samples from the training data at each time step and dynamically adjust the learning dynamics of the generator according to the evaluation value of the discriminator on these samples. Thus, the time complexity of Algorithm Complexity Memory replay mechanism O(t 2 ) EWC O(t 2 ) CR-GAN O(t) 6 Conclusion In this work, we discuss the multimodal image translation between visual domains with high internal variability and propose a novel GR-GAN to solve the non-convergence issue in such a scenario. Our method causes little computational effort and can be easily applied to various image translation models without modifying the original network structure. Both qualitative and quantitative results show that the proposed method can significantly improve the performance of image translation. Acknowledgements This work was supported by the National Key R&D Program of China under Contract No. 2017YFB1002201, the National Natural Science Fund for Distinguished Young Scholar (Grant No. 61625204) and partially supported by the State Key Program of National Science Foundation of China (Grant Nos. 61836006 and 61432014). Compliance with ethical standards Conflict of interest The authors declare that they have no conflict of interest. Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors. References Arjovsky M, Chintala S, Bottou L (2007) Wasserstein generative adversarial networks. In: International conference on machine learning, pp 214–223 Bousmalis K, Silberman N, Dohan D, Erhan D, Krishnan D (2017) Unsupervised pixel-level domain adaptation with generative adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3722–3731 Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 248–255 Dong C, Loy CC, He K, Tang X (2014) Learning a deep convolutional network for image super-resolution. In: European conference on computer vision, pp 184–199 French RM (1999) Catastrophic forgetting in connectionist networks. Trends Cognit Sci 3(4):128–135 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. 18184 J. Wang et al. Ganin Y, Ustinova E, Ajakan H, Germain P, Larochelle H, Laviolette F, Marchand M, Lempitsky V (2016) Domain-adversarial training of neural networks. J Mach Learn Res 17(1):59.1–59.35 Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2414–2423 Gonzalez-Garcia A, van de Weijer J, Bengio Y (2018) Image-to-image translation for cross-domain disentanglement. In: Advances in neural information processing systems 31: Annual conference on neural information processing Systems 2018, pp 1287–1298 Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems 27: Annual conference on neural information processing systems 2014, pp 2672–2680 Hinterstoisser S, Lepetit V, Ilic S, Holzer S, Bradski G, Konolige K, Navab N (2012) Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In: Asian conference on computer vision, pp 548–562 Huang X, Liu MY, Belongie S, Kautz J (2018) Multimodal unsupervised image-to-image translation. In: Proceedings of the European conference on computer vision (ECCV), pp 172–189 Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125– 1134 Kim T, Cha M, Kim H, Lee JK, Kim J (2017) Learning to discover cross-domain relations with generative adversarial networks. In: Proceedings of the 34th international conference on machine learning, vol 70, pp 1857–1865 Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A et al (2017) Overcoming catastrophic forgetting in neural networks. Proc Natl Acad Sci 114(13):3521–3526 Larsson G, Maire M, Shakhnarovich G (2016) Learning representations for automatic colorization. In: European conference on computer vision, pp 577–593 LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551 LeCun Y, Bottou L, Bengio Y, Haffner P et al (1998) Gradientbased learning applied to document recognition. Proc IEEE 86(11):2278–2324 Lee HY, Tseng HY, Huang JB, Singh M, Yang MH (2018) Diverse image-to-image translation via disentangled representations. In: Proceedings of the European conference on computer vision (ECCV), pp 35–51 Liu MY, Breuel T, Kautz J (2017) Unsupervised image-to-image translation networks. In: Advances in neural information processing systems 30: Annual conference on neural information processing systems 2017, pp 700–708 Liu D, Fu J, Qu Q, Lv J (2018) BFGAN: Backward and forward generative adversarial networks for lexically constrained sentence generation. IEEE ACM Trans Audio Speech Lang Process 27(12):2350–2361 Ma L, Jia X, Georgoulis S, Tuytelaars T, Van Gool L (2019) Exemplar guided unsupervised image-to-image translation with semantic consistency. In: International conference on learning representations Mao X, Li Q, Xie H, Lau RY, Wang Z, Paul Smolley S (2017) Least squares generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2794–2802 Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV (2012) Cats and dogs. In: Proceedings of the IEEE conference on computer vision and pattern recognition Pathak D, Krahenbuhl P, Donahue J, Darrell T, Efros AA (2016) Context encoders: feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2536–2544 Radford A, Metz L, Chintala S (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In: International conference on learning representations Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. In: Advances in neural information processing systems, pp 2234–2242 Sangkloy P, Lu J, Fang C, Yu F, Hays J (2017) Scribbler: controlling deep image synthesis with sketch and color. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5400–5409 Seff A, Beatson A, Suo D, Liu H (2017) Continual learning in generative adversarial nets. arXiv preprint arXiv:1705.08395 Tang C, Xu K, He Z, Lv J (2019) Exaggerated portrait caricatures synthesis. Inf Sci 502:363–375 Thanh-Tung H, Tran T, Venkatesh S (2018) On catastrophic forgetting and mode collapse in generative adversarial networks. arXiv preprint arXiv:1807.04015 Wu C, Herranz L, Liu X, Wang Y, van de Weijer J, Raducanu B (2018) Memory replay gans: learning to generate images from new categories without forgetting. In: Conference on neural information processing systems (NIPS) Yi Z, Zhang H, Tan P, Gong M (2017) Dualgan: unsupervised dual learning for image-to-image translation. In: Proceedings of the IEEE international conference on computer vision, pp 2849–2857 Yu X, Ying Z, Li G, Gao W (2018) Multi-mapping image-to-image translation with central biasing normalization. arXiv preprint arXiv:1806.10050 Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: European conference on computer vision, pp 649–666 Zhang R, Isola P, Efros AA, Shechtman E, Wang O (2018) The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595 Zhu JY, Krähenbühl P, Shechtman E, Efros AA (2016) Generative visual manipulation on the natural image manifold. In: European conference on computer vision, pp. 597–613 Zhu JY, Park T, Isola P, Efros AA (2017a) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2223–2232 Zhu JY, Zhang R, Pathak D, Darrell T, Efros AA, Wang O, Shecht man E (2017b) Toward multimodal image-to-image translation. In: Advances in neural information processing systems 30: Annual conference on neural information processing systems 2017, pp 465–476 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 123 Content courtesy of Springer Nature, terms of use apply. Rights reserved. Terms and Conditions Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH (“Springer Nature”). Springer Nature supports a reasonable amount of sharing of research papers by authors, subscribers and authorised users (“Users”), for smallscale personal, non-commercial use provided that all copyright, trade and service marks and other proprietary notices are maintained. By accessing, sharing, receiving or otherwise using the Springer Nature journal content you agree to these terms of use (“Terms”). For these purposes, Springer Nature considers academic use (by researchers and students) to be non-commercial. These Terms are supplementary and will apply in addition to any applicable website terms and conditions, a relevant site licence or a personal subscription. These Terms will prevail over any conflict or ambiguity with regards to the relevant terms, a site licence or a personal subscription (to the extent of the conflict or ambiguity only). For Creative Commons-licensed articles, the terms of the Creative Commons license used will apply. We collect and use personal data to provide access to the Springer Nature journal content. We may also use these personal data internally within ResearchGate and Springer Nature and as agreed share it, in an anonymised way, for purposes of tracking, analysis and reporting. We will not otherwise disclose your personal data outside the ResearchGate or the Springer Nature group of companies unless we have your permission as detailed in the Privacy Policy. While Users may use the Springer Nature journal content for small scale, personal non-commercial use, it is important to note that Users may not: 1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access control; 2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is otherwise unlawful; 3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in writing; 4. use bots or other automated methods to access the content or redirect messages 5. override any security feature or exclusionary protocol; or 6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal content. In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue, royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any other, institutional repository. These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved. To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law, including merchantability or fitness for any particular purpose. Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed from third parties. If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not expressly permitted by these Terms, please contact Springer Nature at onlineservice@springernature.com