Generative adversarial networks Ian Goodfellow Jean Pouget-Abadie Bing Xu David Warde-Farley Aaron Courville Mehdi Mirza Sherjil Ozair Yoshua Bengio 1 Discriminative deep learning • Recipe for success x 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 2 Discriminative deep learning • Recipe for success: Google’s winning entry into the ImageNet 1K competition (with extra data). 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow Figure 3: GoogLeNet network with all the bells and whistles 7 3 Discriminative deep learning • Recipe for success: - Gradient backpropagation. - Dropout - Activation functions: • rectified linear • maxout Google’s winning entry into the ImageNet 1K competition (with extra data). 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow Figure 3: GoogLeNet network with all the bells and whistles 7 4 Generative modeling • Have training examples x ~ pdata(x ) • Want a model that can draw samples: x ~ pmodel(x ) • Where pmodel ≈ pdata x ~ pdata(x ) x ~ pmodel(x ) 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 5 Why generative models? • Conditional generative models - Speech synthesis: Text Speech - Machine Translation: French • • English French: Si mon tonton tond ton tonton, ton tonton sera tondu. English: If my uncle shaves your uncle, your uncle will be shaved - Image Image segmentation • Environment simulator - Reinforcement learning - Planning • Leverage unlabeled data 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 6 Maximum likelihood: the dominant approach • ML objective function m ⇣ ⌘ X 1 ⇤ (i) ✓ = max log p x ; ✓ ✓ m i=1 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 7 Undirected graphical models • State-of-the-art general purpose undirected graphical model: Deep Boltzmann machines • Several “hidden layers” h 1 p(h, x) = p̃(h, x) Z p̃(h, x) = exp( E(h, x)) Z= p̃(h, x) h(3) h(2) h(1) x h,x 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 8 Undirected graphical p̃(h, x) = exp ( E (h, x)) models: disadvantage X Z= p̃(h, x) • ML Learning requires that we draw samples: h,x h(3) " d d log p(x) = log d✓i d✓i X p̃(h, x) log Z(✓) h d log Z(✓) = d✓i # h(2) h(1) x d d✓i Z(✓) Z(✓) • Common way to do this is via MCMC (Gibbs sampling). h) = p(x | h(1) )p(h(1) | h(2) ) . . . p(h(L 1) | h(L) )p(h(L) ) 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 9 Boltzmann Machines: disadvantage • Model is badly parameterized for learning high quality samples. • Why? - Learning leads to large values of the model parameters. ‣ Large valued parameters = peaky distribution. - Large valued parameters means slow mixing of sampler. - Slow mixing means that the gradient updates are correlated leads to divergence of learning. 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 10 Boltzmann Machines: disadvantage • Model is badly parameterized for learning high quality samples. • Why poor mixing? Coordinated flipping of lowlevel features MNIST dataset 1st layer features (RBM) 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 11 d log Z(✓) = d✓i d d✓i Z(✓) Z(✓) Directed graphical models p(x, h) = p(x | h(1) )p(h(1) | h(2) ) . . . p(h(L d 1 d log p(x) = p(x) d i p(x) d i p(x) = h p(x | h)p(h) 1) | h(L) )p(h(L) ) h(3) h(2) h(1) x • Two problems: 1. Summation over exponentially many states in h 2. Posterior inference, i.e. calculating p(h | x), is intractable. 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 12 Directed graphical models: New approaches • The Variational Autoencoder model: - Kingma and Welling, Auto-Encoding Variational Bayes, International Conference on Learning Representations (ICLR) 2014. - Rezende, Mohamed and Wierstra, Stochastic back-propagation and variational inference in deep latent Gaussian models. ArXiv. - Use a reparametrization that allows them to train very efficiently with gradient backpropagation. 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 13 Generative stochastic networks • General strategy: Do not write a formula for p(x), just learn to sample incrementally. ... • Main issue: Subject to some of the same constraints on mixing as undirected graphical models. 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 14 Generative adversarial networks • Don’t write a formula for p(x), just learn to sample directly. • No summation over all states. • How? By playing a game. 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 15 Two-player zero-sum game • Your winnings + your opponent’s winnings = 0 • Minimax theorem: a rational strategy exists for all such finite games 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 16 Two-player zero-sum game • Strategy: specification of which moves you make in which circumstances. • Equilibrium: each player’s strategy is the best possible for their opponent’s strategy. Your opponent Rock Paper Scissors You - Choose you action at random Rock - Mixed strategy equilibrium 0 -1 1 Scissors Paper • Example: Rock-paper-scissors: 1 0 -1 -1 1 0 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 17 Generative modeling with game theory? • Can we design a game with a mixed-strategy equilibrium that forces one player to learn to generate from the data distribution? 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 18 Adversarial nets framework • A game between two players: 1. Discriminator D 2. Generator G • D tries to discriminate between: - A sample from the data distribution. - And a sample from the generator G. • G tries to “trick” D by generating samples that are hard for D to distinguish from data. 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 19 Adversarial nets framework x D tries to output 1 D tries to output 0 Differentiable function D Differentiable function D x sampled from data x sampled from model x Differentiable function G Input noise Z 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow z 20 Zero-sum game er words, D and G play the following two-player minimax game with value function V • Minimax objective function: min max V (D, G) = Ex⇠pdata (x) [log D(x)] + Ez⇠pz (z) [log(1 G D D(G(z)))]. next section, we present a theoretical analysis of adversarial nets, essentially show ining criterion allows one to recover the data generating distribution as G and D h capacity, in the non-parametric Seeuse: Figure 1 for a less formal, more ped • In i.e., practice, to estimatelimit. G we ation of the approach. In practice, we must implement the game using an iterative, n ach. Optimizing D to completion in the inner loop of training is computationally pro maxinEoverfitting. D(G(z))] z∼pz (z) [log finite datasets would result Instead, we alternate between k steps of op G one step of optimizing G. This results in D being maintained near its optimal sol Why? Stronger forisGanalogous when D is very good. s G changes slowly enough. gradient This strategy to the way that SML/PCD g maintains samples from a Markov chain from one learning step to the next in order g in a Markov chain as part of the inner loop of learning. The procedure is formally p orithm 1.Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 2014 NIPS 21 h Discriminator strategy • Optimal strategy for any pmodel(x) is always pdata (x) D(x) = pdata (x) + pmodel (x) 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 22 burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. Learning process In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)). This objective function results in the same point of the Data dynamics of G and D but provides much stronger gradients early in learning. pDfixed (data) distribution Model distribution ... D After updating G Poorly fit(a)model After updating (b) (c) Mixed (d) strategy equilibrium Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on transformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a) Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier. (b) In theWorkshop inner loop the algorithm D is trained discriminate samples from data, converging to D⇤ (x) =23 2014 NIPS onof Perturbations, Optimization, andto Statistics --- Ian Goodfellow pdata (x) burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. Learning process In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)). This objective function results in the same point of the Data dynamics of G and D but provides much stronger gradients early in learning. pDfixed (data) distribution Model distribution ... D After updating G Poorly fit(a)model After updating (b) (c) Mixed (d) strategy equilibrium Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on transformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a) Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier. (b) In theWorkshop inner loop the algorithm D is trained discriminate samples from data, converging to D⇤ (x) =24 2014 NIPS onof Perturbations, Optimization, andto Statistics --- Ian Goodfellow pdata (x) burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. Learning process In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)). This objective function results in the same point of the Data dynamics of G and D but provides much stronger gradients early in learning. pDfixed (data) distribution Model distribution ... D After updating G Poorly fit(a)model After updating (b) (c) Mixed (d) strategy equilibrium Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on transformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a) Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier. (b) In theWorkshop inner loop the algorithm D is trained discriminate samples from data, converging to D⇤ (x) =25 2014 NIPS onof Perturbations, Optimization, andto Statistics --- Ian Goodfellow pdata (x) burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented in Algorithm 1. Learning process In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning, when G is poor, D can reject samples with high confidence because they are clearly different from the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize log(1 D(G(z))) we can train G to maximize log D(G(z)). This objective function results in the same point of the Data dynamics of G and D but provides much stronger gradients early in learning. pDfixed (data) distribution Model distribution ... D After updating G Poorly fit(a)model After updating (b) (c) Mixed (d) strategy equilibrium Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution (D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black, dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on transformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a) Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier. (b) In theWorkshop inner loop the algorithm D is trained discriminate samples from data, converging to D⇤ (x) =26 2014 NIPS onof Perturbations, Optimization, andto Statistics --- Ian Goodfellow pdata (x) Theoretical properties er words, D and G play the following two-player minimax game with value function V min max V (D, G) = Ex⇠pdata (x) [log D(x)] + Ez⇠pz (z) [log(1 G D D(G(z)))]. next•section, we presentproperties a theoretical analysis of adversarial Theoretical (assuming infinite nets, data,essentially infinite show ining criterion allows one to recover the data generating distribution as G and D model direct limit. updating of generator’s h capacity, i.e., in capacity, the non-parametric See Figure 1 for a less formal, more ped ation of distribution): the approach. In practice, we must implement the game using an iterative, n ach. Optimizing D to completion in the inner loop of training is computationally pro finite datasets wouldglobal result inoptimum. overfitting. Instead, we alternate between k steps of op - Unique one step of optimizing G. This results in D being maintained near its optimal sol s G changes slowly enough. This strategy is analogous to the way that SML/PCD - Optimum corresponds to data distribution. g maintains samples from a Markov chain from one learning step to the next in order g in a Markov chain as part of inner loopguaranteed. of learning. The procedure is formally p - Convergence tothe optimum orithm 1. ctice, equation 1 may not provide sufficient gradient for G to learn well. Early in G is poor, D can reject samples with high confidence because they are clearly differ ining data. In this case, log(1 D(G(z))) saturates. Rather than training G to m 2014 NIPS Workshop Perturbations, and Statistics --- Ian Goodfellow 27 D(G(z))) weoncan train GOptimization, to maximize log D(G(z)). This objective function resu Quantitative likelihood results • Parzen window-based log-likelihood estimates. - Density estimate with Gaussian kernels centered on the samples drawn from the model. Model DBN [3] Stacked CAE [3] Deep GSN [6] Adversarial nets MNIST 138 ± 2 121 ± 1.6 214 ± 1.1 225 ± 2 TFD 1909 ± 66 2110 ± 50 1890 ± 29 2057 ± 26 indow-based log-likelihood estimates. The reported numbers on MNIST les on test set, with the standard error of the mean computed across exam dard error across folds of the dataset, a different chosen using28th 2014 NIPS Workshop on Perturbations, Optimization, and Statisticswith --- Ian Goodfellow Visualization of model samples MNIST TFD CIFAR-10 (fully connected) CIFAR-10 (convolutional) 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 29 Learned 2-D manifold of MNIST 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 30 Visualizing trajectories 1. Draw sample (A) B 2. Draw sample (B) 3. Simulate samples along the path between A and B 4. Repeat steps 1-3 as desired. A 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 31 Visualization of model trajectories MNIST digit dataset Toronto Face Dataset (TFD) 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 32 Visualization of model trajectories CIFAR-10 (convolutional) 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 33 Extensions • Conditional model: - Learn p(x | y) - Discriminator is trained on (x,y) pairs - Generator net gets y and z as input - Useful for: Translation, speech synth, image segmentation. 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 34 Extensions • Inference net: - Learn a network to model p(z | x) - Infinite training set! 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 35 Extensions • Take advantage of high amounts of unlabeled data using the generator. • Train G on a large, unlabeled dataset • Train G’ to learn p(z|x) on an infinite training set • Add a layer on top of G’, train on a small labeled training set 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 36 Extensions • Take advantage of unlabeled data using the discriminator • Train G and D on a large amount of unlabeled data - Replace the last layer of D - Continue training D on a small amount of labeled data 2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow 37 Thank You. Questions? 38