Semi-Supervised Learning and Ladder Networks

advertisement
Semi-Supervised Learning and
Ladder Networks
Tapani Raiko
Aalto University
4 November 2015
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
1 / 44
Motivation: Ladder network
Yearly progress in permutation-invariant MNIST.
A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko.
Semi-Supervised Learning with Ladder Network. To appear in NIPS 2015.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
2 / 44
Table of Contents
Semisupervised learning
Ladder network
Exercises
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
3 / 44
Semisupervised learning
How can unlabeled data help in classification?
Example: Only two data points with labels.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
4 / 44
Semisupervised learning
How would you label this point?
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
5 / 44
Semisupervised learning
What if you see all the unlabeled data?
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
6 / 44
Semisupervised learning
Labels are homogenous in densely populated space.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
7 / 44
Semisupervised learning
Labels are homogenous in densely populated space.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
8 / 44
Semisupervised learning
Labels are homogenous in densely populated space.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
9 / 44
Semisupervised learning
Labels are homogenous in densely populated space.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
10 / 44
Semisupervised learning
Labels are homogenous in densely populated space.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
11 / 44
Semisupervised learning
Labeled data: {xt , yt }1≤t≤N .
Unlabeled data: {xt }N+1≤t≤M .
Often labeled data is scarce, unlabeled data is plentiful:
N M.
Early works (McLachlan, 1975; Titterington et al., 1985)
modelled P(x|y ) as clusters.
Unlabeled data affects the shape and size of clusters.
Use Bayes theorem P(y |x) ∝ P(x|y )P(y ) to classify.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
12 / 44
How about P(y |x) directly?
Modelling P(x|y ) is inefficient when real task is P(y |x).
Idea? Assign probabilistic labels q(yt ) = P(yt |xt ) to
unlabeled inputs xt , and train P(y |x) with them.
However, there is no effect as the gradient vanishes:
Eq(y )
Z
∂
P(y | x)
∂
log P(y | x) = q(y ) ∂θ
dy
∂θ
P(y | x)
Z
∂
∂
=
P(y | x)dy =
1 = 0.
∂θ
∂θ
There are ways to adjust the assigned labels q(yt ) to
make them count.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
13 / 44
Adjusting assigned labels q(yt ) (1/2)
Label propagation (Szummer and Jaakkola, 2003)
I Nearest neighbours tend to have the same label.
I Propagate labels to their neighbours and iterate.
Pseudo-labels (Lee, 2013)
I Round probabilistic labels q(yt ) towards 0/1
gradually during training.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
14 / 44
Adjusting assigned labels q(yt ) (2/2)
Co-training (Blum and Mitchell, 1998)
I Assumes multiple views on x, say x = (x(1) , x(2) ).
I Train a separate classifier P(y | x(j) ) for each view.
I For unlabeled data, the true label is the same for
each view.
I Combine individual q (j) (yt ) into a joint q(yt ) and
feed it as target to each classifier.
Part of Ladder network (Rasmus et al., 2015)
I Corrupt input xt with noise to get x̃t .
I Train P(y |x̃) with a target from clean
q(yt ) = P(yt |xt ).
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
15 / 44
Table of Contents
Semisupervised learning
Ladder network
Exercises
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
16 / 44
On details and invariance
What is average of images in the category Cat?
What is the average of Dog?
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
17 / 44
On details and invariance
Answer: both are just blurry blobs.
Autoencoder tries to learn a representation from which it
can reconstruct the observations.
It cannot discard details: position, pose, lighting. . .
⇒ Not well compatible with supervised learning.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
18 / 44
Ladder network, main ideas
I
I
Shortcut connections in an autoencoder network
allow it to discard details.
Learning in deep networks can be made efficient by
spreading unsupervised learning targets all over the
network.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
19 / 44
Combining DSS+DAE
h
h
C
h
Denoising Source Separation
(Särelä and Valpola, 2005)
x
x
h
Denoising Autoencoder
g
f
(Vincent et al., 2008)
C
h
f
x
x
x
h
g1
h
C1
Ladder Network
g
f
0
(Valpola, 2015, Rasmus et al., 2015)
C0
x
x
x
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
20 / 44
Same encoder f (·) used for corrupted and clean paths.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
22 / 44
Supervised learning: Backprop from output ỹ.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
23 / 44
Unsupervised learning:
Several denoising autoencoders simultaneously.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
24 / 44
Unsupervised learning:
Produce robust representations (DSS aspect).
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
25 / 44
Read test output from the clean path. (Not used in training.)
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
26 / 44
Training criterion
Only one phase of training: Minimize criterion C.
C = − log P(ỹ = yt |xt ) +
L
X
2
(l)
(l) λl z − ẑBN l=0
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
27 / 44
Scaling issues
Issue 1: Doubling W(1) and halving W(2) decreases noise.
Issue 2: Collapsing z(1) = ẑ(1) = 0 eliminates cost C (1) .
Solution: Batch normalization (Ioffe and Szegedy, 2015)
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
28 / 44
Some model details
u
h
nonlinearity
scaling and bias
g(z,u)
noise
z
z
normalize
Vz
Wh
normalize
h
u
g (z̃, u) done componentwise: gi (z̃i , ui ).
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
29 / 44
Functional form of lateral connections?
Gaussian model: P(z) = N (µ, σp2 )
Gaussian noise: P(z̃|z) = N (z, σn2 )
2
σp2
n
Optimal denoising: ẑ = σ2σ+σ
2 µ + σ 2 +σ 2 z̃
p
n
p
n
Top-down signal u corresponds to P(z).
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
30 / 44
Functional form of lateral connections?
2
σ2
p
n
Recall ẑ = σ2σ+σ
2 µ + σ 2 +σ 2 z̃ from previous slide.
p
n
p
n
Additive: ẑadd = g1 (z̃) + g2 (u)
corresponds to modelling the mean µ with u.
Modulated: ẑmod = g3 (z̃, u)(z̃ + bias)
corresponds to modelling variance σp2 with u.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
31 / 44
Functional form of lateral connections?
How to interpret top-down signal u modulating:
Does this detail in z̃ fit in the big picture?
If yes, trust it at let it through to reconstruction ẑ.
If not, filter it away as noise.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
32 / 44
Analysis: Unsupervised learning
(Rasmus et al.,
2015b)
f (2)
g (1)
h(1)
f (2)
ĥ(1)
f (1)
x̃
g (2)
h(2)
h(2)
g (1)
h(1)
ĥ(1)
f (1)
g (0)
x̂
ĥ(2)
x̃
g (0)
x̂
We compare deep denoising autoencoder and Ladder
with additive or modulated lateral connections.
Data is small natural image patches.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
33 / 44
Analysis: Unsupervised learning
Denoising performance
Mod
No-lat
Add
0.115
↵ = 0.25
0.120
0.125
Cost
↵ = 4.0
0.0
0.1
0.2
0.3
0.4
0.5
1.0
2.0 3.0
α
1 million parameters, vary sizes of layers.
Result: Modulated connections best.
Ladder needs fewer units on h(2) .
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
34 / 44
Analysis: Unsupervised learning
Translation invariance measure of units h(2) as a function
of unit significance.
Mod 256-1622-50
(2)
γi
Add 256-839-336
No-lat 256-590-589
1.0
0.8
0.6
0.4
0.2
0.0
10−3
10−2
10−1
100
10−2
10−1
100
10−2
10−1
100
With modulated connections, all units become invariant.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
35 / 44
Analysis: Unsupervised learning
Learned pooling functions
Each h(1) unit belongs to several pooling groups.
Units h(2) specialize to colour, orientation, location, . . .
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
36 / 44
Small network for ẑi = g (z̃i , ui )
Each unit i has its own mini network with 9 parameters.
Few parameters compared to weight matrices.
Product ui z̃i for modulating (variance modelling).
Nonlinearity for multimodal distributions.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
37 / 44
Example of a multimodal distribution
(L)
Signal z0
for digit 0 just before softmax.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
38 / 44
MNIST results
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
40 / 44
Gamma (Γ) model
Simplified model: Only auxiliary cost just before softmax.
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
41 / 44
Table of Contents
Semisupervised learning
Ladder network
Exercises
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
42 / 44
Exercises
I
I
I
Read Section 14.3 (Semi-supervised learning) and Section 5
(Related work) of the Ladder paper
http://arxiv.org/abs/1507.02672
Assume having only 1000 labels in MNIST (100 labeled
examples of each class), and train a traditional classifier with
noise regularization. Then, implement squared difference of
noisy and clean signals just before softmax as an auxiliary
criterion for all 60000 examples (=simplified Γ-ladder). Can
you improve the test performance by using the combined
criterion?
If interested, you can read the rest of the Ladder paper and
code github.com/arasmus/ladder
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
43 / 44
References
I
A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko. Semi-Supervised Learning with Ladder Network.
To appear in NIPS 2015.
I
M. Seeger. A Taxonomy for Semi-Supervised Learning Methods. Chapter 2 in book Semi-Supervised Learning, MIT
Press, 2006.
I
McLachlan, G. (1975). Iterative reclassification procedure for constructing an asymptotically optimal rule of
allocation in discriminant analysis. J. American Statistical Association, 70, 365–369.
I
Titterington, D., Smith, A., and Makov, U. (1985). Statistical analysis of finite mixture distributions. In Wiley
Series in Probability and Mathematical Statistics.
I
Szummer, M. and Jaakkola, T. (2003). Partially labeled classification with Markov random walks. Advances in
Neural Information Processing Systems 15 (NIPS 2002), 14, 945–952.
I
Lee, D.-H. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural
networks. In Workshop on Challenges in Representation Learning, ICML 2013.
I
Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proc. of the eleventh
annual conference on Computational learning theory (COLT 98), pages 92–100.
I
I
Särelä, J. and Valpola, H. (2005). Denoising source separation. JMLR, 6, 233–272.
I
I
H. Valpola. From neural PCA to deep unsupervised learning. arXiv preprint arXiv:1411.7783 (2014).
I
A. Rasmus, T. Raiko, and H. Valpola. Denoising autoencoder with modulated lateral connections learns invariant
representations of natural images. In the workshop track of the International Conference on Learning
Representations (ICLR 2015), San Diago, California, May, 2015b.
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders:
Learning useful representations in a deep network with a local denoising criterion. JMLR, 11, 3371–3408.
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal
covariate shift. arXiv:1502.03167
Tapani Raiko (Aalto University)
Semi-Supervised Learning and Ladder Networks
4 November 2015
44 / 44
Download