Generative Adversarial Networks

advertisement
Generative adversarial
networks
Ian
Goodfellow
Jean
Pouget-Abadie
Bing
Xu
David
Warde-Farley
Aaron
Courville
Mehdi
Mirza
Sherjil
Ozair
Yoshua
Bengio
1
Discriminative deep learning
• Recipe for success
x
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
2
Discriminative deep learning
• Recipe for success:
Google’s winning entry
into the ImageNet 1K
competition (with extra data).
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
Figure 3: GoogLeNet network with all the bells and whistles
7
3
Discriminative deep learning
• Recipe for success:
- Gradient backpropagation.
- Dropout
- Activation functions:
• rectified linear
• maxout
Google’s winning entry
into the ImageNet 1K
competition (with extra data).
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
Figure 3: GoogLeNet network with all the bells and whistles
7
4
Generative modeling
• Have training examples x ~ pdata(x )
• Want a model that can draw samples: x ~
pmodel(x )
• Where pmodel ≈ pdata
x ~ pdata(x )
x ~ pmodel(x )
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
5
Why generative models?
• Conditional generative models
- Speech synthesis: Text
Speech
- Machine Translation: French
•
•
English
French: Si mon tonton tond ton tonton, ton tonton sera tondu.
English: If my uncle shaves your uncle, your uncle will be shaved
- Image
Image segmentation
• Environment simulator
- Reinforcement learning
- Planning
• Leverage unlabeled data
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
6
Maximum likelihood: the dominant approach
• ML objective function
m
⇣
⌘
X
1
⇤
(i)
✓ = max
log p x ; ✓
✓ m
i=1
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
7
Undirected graphical models
• State-of-the-art general purpose undirected
graphical model: Deep Boltzmann machines
• Several “hidden layers” h
1
p(h, x) = p̃(h, x)
Z
p̃(h, x) = exp( E(h, x))
Z=
p̃(h, x)
h(3)
h(2)
h(1)
x
h,x
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
8
Undirected
graphical
p̃(h, x) = exp
( E (h, x)) models: disadvantage
X
Z=
p̃(h, x)
• ML Learning
requires
that we draw samples:
h,x
h(3)
"
d
d
log p(x) =
log
d✓i
d✓i
X
p̃(h, x)
log Z(✓)
h
d
log Z(✓) =
d✓i
#
h(2)
h(1)
x
d
d✓i Z(✓)
Z(✓)
• Common way to do this is via MCMC (Gibbs sampling).
h) = p(x | h(1) )p(h(1) | h(2) ) . . . p(h(L
1)
| h(L) )p(h(L) )
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
9
Boltzmann Machines: disadvantage
• Model is badly parameterized for learning high
quality samples.
• Why?
- Learning leads to large values of the model parameters.
‣ Large valued parameters = peaky distribution.
- Large valued parameters means slow mixing of sampler.
- Slow mixing means that the gradient updates are
correlated leads to divergence of learning.
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
10
Boltzmann Machines: disadvantage
• Model is badly parameterized for learning high
quality samples.
• Why poor mixing?
Coordinated
flipping of lowlevel features
MNIST dataset
1st layer features (RBM)
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
11
d
log Z(✓) =
d✓i
d
d✓i Z(✓)
Z(✓)
Directed graphical models
p(x, h) = p(x | h(1) )p(h(1) | h(2) ) . . . p(h(L
d
1 d
log p(x) =
p(x)
d i
p(x) d i
p(x) =
h
p(x | h)p(h)
1)
| h(L) )p(h(L) )
h(3)
h(2)
h(1)
x
• Two problems:
1. Summation over exponentially many states in h
2. Posterior inference, i.e. calculating p(h | x), is intractable.
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
12
Directed graphical models: New approaches
• The Variational Autoencoder model:
-
Kingma and Welling, Auto-Encoding Variational Bayes, International
Conference on Learning Representations (ICLR) 2014.
-
Rezende, Mohamed and Wierstra, Stochastic back-propagation and
variational inference in deep latent Gaussian models. ArXiv.
-
Use a reparametrization that allows them to train very efficiently
with gradient backpropagation.
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
13
Generative stochastic networks
• General strategy: Do not write a formula for p(x),
just learn to sample incrementally.
...
• Main issue: Subject to some of the same constraints
on mixing as undirected graphical models.
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
14
Generative adversarial networks
• Don’t write a formula for p(x), just learn to sample
directly.
• No summation over all states.
• How? By playing a game.
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
15
Two-player zero-sum game
• Your winnings + your opponent’s winnings = 0
• Minimax theorem: a rational strategy exists for all
such finite games
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
16
Two-player zero-sum game
• Strategy: specification of which moves you make in which
circumstances.
• Equilibrium: each player’s strategy is the best possible for
their opponent’s strategy.
Your opponent
Rock Paper Scissors
You
- Choose you action at random
Rock
- Mixed strategy equilibrium
0
-1
1
Scissors Paper
• Example: Rock-paper-scissors:
1
0
-1
-1
1
0
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
17
Generative modeling with game theory?
• Can we design a game with a mixed-strategy
equilibrium that forces one player to learn to generate
from the data distribution?
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
18
Adversarial nets framework
• A game between two players:
1. Discriminator D
2. Generator G
• D tries to discriminate between:
- A sample from the data distribution.
- And a sample from the generator G.
• G tries to “trick” D by generating samples that are
hard for D to distinguish from data.
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
19
Adversarial nets framework
x
D tries to
output 1
D tries to
output 0
Differentiable
function D
Differentiable
function D
x sampled
from data
x sampled
from model
x
Differentiable
function G
Input noise
Z
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
z
20
Zero-sum game
er words,
D and G play
the following
two-player minimax game with value function V
• Minimax
objective
function:
min max V (D, G) = Ex⇠pdata (x) [log D(x)] + Ez⇠pz (z) [log(1
G
D
D(G(z)))].
next section, we present a theoretical analysis of adversarial nets, essentially show
ining criterion allows one to recover the data generating distribution as G and D
h capacity,
in the non-parametric
Seeuse:
Figure 1 for a less formal, more ped
• In i.e.,
practice,
to estimatelimit.
G we
ation of the approach. In practice, we must implement the game using an iterative, n
ach. Optimizing D to completion in the inner loop of training is computationally pro
maxinEoverfitting.
D(G(z))]
z∼pz (z) [log
finite datasets would result
Instead,
we alternate between k steps of op
G
one step of optimizing G. This results in D being maintained near its optimal sol
Why?
Stronger
forisGanalogous
when D
is very
good.
s G changes
slowly
enough. gradient
This strategy
to the
way that
SML/PCD
g maintains samples from a Markov chain from one learning step to the next in order
g in a Markov chain as part of the inner loop of learning. The procedure is formally p
orithm
1.Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
2014 NIPS
21
h
Discriminator strategy
• Optimal strategy for any pmodel(x) is always
pdata (x)
D(x) =
pdata (x) + pmodel (x)
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
22
burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented
in Algorithm 1.
Learning process
In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,
when G is poor, D can reject samples with high confidence because they are clearly different from
the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize
log(1 D(G(z))) we can train G to maximize log D(G(z)). This objective function results in the
same
point of the Data
dynamics
of G and D but provides much stronger gradients early in learning.
pDfixed
(data)
distribution
Model distribution
...
D After updating
G
Poorly fit(a)model After updating
(b)
(c)
Mixed (d)
strategy
equilibrium
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution
(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,
dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is
the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain
of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on
transformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)
Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.
(b) In
theWorkshop
inner loop
the algorithm
D is trained
discriminate
samples from data, converging to D⇤ (x) =23
2014
NIPS
onof
Perturbations,
Optimization,
andto
Statistics
--- Ian Goodfellow
pdata (x)
burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented
in Algorithm 1.
Learning process
In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,
when G is poor, D can reject samples with high confidence because they are clearly different from
the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize
log(1 D(G(z))) we can train G to maximize log D(G(z)). This objective function results in the
same
point of the Data
dynamics
of G and D but provides much stronger gradients early in learning.
pDfixed
(data)
distribution
Model distribution
...
D After updating
G
Poorly fit(a)model After updating
(b)
(c)
Mixed (d)
strategy
equilibrium
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution
(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,
dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is
the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain
of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on
transformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)
Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.
(b) In
theWorkshop
inner loop
the algorithm
D is trained
discriminate
samples from data, converging to D⇤ (x) =24
2014
NIPS
onof
Perturbations,
Optimization,
andto
Statistics
--- Ian Goodfellow
pdata (x)
burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented
in Algorithm 1.
Learning process
In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,
when G is poor, D can reject samples with high confidence because they are clearly different from
the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize
log(1 D(G(z))) we can train G to maximize log D(G(z)). This objective function results in the
same
point of the Data
dynamics
of G and D but provides much stronger gradients early in learning.
pDfixed
(data)
distribution
Model distribution
...
D After updating
G
Poorly fit(a)model After updating
(b)
(c)
Mixed (d)
strategy
equilibrium
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution
(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,
dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is
the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain
of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on
transformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)
Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.
(b) In
theWorkshop
inner loop
the algorithm
D is trained
discriminate
samples from data, converging to D⇤ (x) =25
2014
NIPS
onof
Perturbations,
Optimization,
andto
Statistics
--- Ian Goodfellow
pdata (x)
burning in a Markov chain as part of the inner loop of learning. The procedure is formally presented
in Algorithm 1.
Learning process
In practice, equation 1 may not provide sufficient gradient for G to learn well. Early in learning,
when G is poor, D can reject samples with high confidence because they are clearly different from
the training data. In this case, log(1 D(G(z))) saturates. Rather than training G to minimize
log(1 D(G(z))) we can train G to maximize log D(G(z)). This objective function results in the
same
point of the Data
dynamics
of G and D but provides much stronger gradients early in learning.
pDfixed
(data)
distribution
Model distribution
...
D After updating
G
Poorly fit(a)model After updating
(b)
(c)
Mixed (d)
strategy
equilibrium
Figure 1: Generative adversarial nets are trained by simultaneously updating the discriminative distribution
(D, blue, dashed line) so that it discriminates between samples from the data generating distribution (black,
dotted line) px from those of the generative distribution pg (G) (green, solid line). The lower horizontal line is
the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain
of x. The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on
transformed samples. G contracts in regions of high density and expands in regions of low density of pg . (a)
Consider an adversarial pair near convergence: pg is similar to pdata and D is a partially accurate classifier.
(b) In
theWorkshop
inner loop
the algorithm
D is trained
discriminate
samples from data, converging to D⇤ (x) =26
2014
NIPS
onof
Perturbations,
Optimization,
andto
Statistics
--- Ian Goodfellow
pdata (x)
Theoretical properties
er words, D and G play the following two-player minimax game with value function V
min max V (D, G) = Ex⇠pdata (x) [log D(x)] + Ez⇠pz (z) [log(1
G
D
D(G(z)))].
next•section,
we presentproperties
a theoretical analysis
of adversarial
Theoretical
(assuming
infinite nets,
data,essentially
infinite show
ining criterion allows one to recover the data generating distribution as G and D
model
direct limit.
updating
of generator’s
h capacity,
i.e., in capacity,
the non-parametric
See Figure
1 for a less formal, more ped
ation of distribution):
the approach. In practice, we must implement the game using an iterative, n
ach. Optimizing D to completion in the inner loop of training is computationally pro
finite datasets
wouldglobal
result inoptimum.
overfitting. Instead, we alternate between k steps of op
- Unique
one step of optimizing G. This results in D being maintained near its optimal sol
s G changes
slowly enough.
This strategy
is analogous
to the way that SML/PCD
- Optimum
corresponds
to data
distribution.
g maintains samples from a Markov chain from one learning step to the next in order
g in a Markov
chain as part of
inner loopguaranteed.
of learning. The procedure is formally p
- Convergence
tothe
optimum
orithm 1.
ctice, equation 1 may not provide sufficient gradient for G to learn well. Early in
G is poor, D can reject samples with high confidence because they are clearly differ
ining data. In this case, log(1 D(G(z))) saturates. Rather than training G to m
2014
NIPS Workshop
Perturbations,
and Statistics
--- Ian Goodfellow
27
D(G(z)))
weoncan
train GOptimization,
to maximize
log D(G(z)).
This objective function resu
Quantitative likelihood results
• Parzen window-based log-likelihood estimates.
- Density estimate with Gaussian kernels centered on
the samples drawn from the model.
Model
DBN [3]
Stacked CAE [3]
Deep GSN [6]
Adversarial nets
MNIST
138 ± 2
121 ± 1.6
214 ± 1.1
225 ± 2
TFD
1909 ± 66
2110 ± 50
1890 ± 29
2057 ± 26
indow-based log-likelihood estimates. The reported numbers on MNIST
les on test set, with the standard error of the mean computed across exam
dard
error
across
folds of
the dataset,
a different chosen using28th
2014 NIPS
Workshop
on Perturbations,
Optimization,
and Statisticswith
--- Ian Goodfellow
Visualization of model samples
MNIST
TFD
CIFAR-10 (fully connected)
CIFAR-10 (convolutional)
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
29
Learned 2-D manifold of MNIST
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
30
Visualizing trajectories
1. Draw sample (A)
B
2. Draw sample (B)
3. Simulate samples
along the path
between A and B
4. Repeat steps 1-3 as
desired.
A
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
31
Visualization of model trajectories
MNIST digit dataset
Toronto Face Dataset (TFD)
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
32
Visualization of model trajectories
CIFAR-10
(convolutional)
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
33
Extensions
• Conditional model:
- Learn p(x | y)
- Discriminator is trained on (x,y) pairs
- Generator net gets y and z as input
- Useful for: Translation, speech synth, image
segmentation.
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
34
Extensions
• Inference net:
- Learn a network to model p(z | x)
- Infinite training set!
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
35
Extensions
• Take advantage of high amounts of unlabeled data
using the generator.
• Train G on a large, unlabeled dataset
• Train G’ to learn p(z|x) on an infinite training set
• Add a layer on top of G’, train on a small labeled
training set
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
36
Extensions
• Take advantage of unlabeled data using the
discriminator
• Train G and D on a large amount of unlabeled data
- Replace the last layer of D
- Continue training D on a small amount of labeled
data
2014 NIPS Workshop on Perturbations, Optimization, and Statistics --- Ian Goodfellow
37
Thank You.
Questions?
38
Download