Semi-Supervised Learning and Ladder Networks Tapani Raiko Aalto University 4 November 2015 Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 1 / 44 Motivation: Ladder network Yearly progress in permutation-invariant MNIST. A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko. Semi-Supervised Learning with Ladder Network. To appear in NIPS 2015. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 2 / 44 Table of Contents Semisupervised learning Ladder network Exercises Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 3 / 44 Semisupervised learning How can unlabeled data help in classification? Example: Only two data points with labels. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 4 / 44 Semisupervised learning How would you label this point? Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 5 / 44 Semisupervised learning What if you see all the unlabeled data? Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 6 / 44 Semisupervised learning Labels are homogenous in densely populated space. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 7 / 44 Semisupervised learning Labels are homogenous in densely populated space. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 8 / 44 Semisupervised learning Labels are homogenous in densely populated space. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 9 / 44 Semisupervised learning Labels are homogenous in densely populated space. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 10 / 44 Semisupervised learning Labels are homogenous in densely populated space. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 11 / 44 Semisupervised learning Labeled data: {xt , yt }1≤t≤N . Unlabeled data: {xt }N+1≤t≤M . Often labeled data is scarce, unlabeled data is plentiful: N M. Early works (McLachlan, 1975; Titterington et al., 1985) modelled P(x|y ) as clusters. Unlabeled data affects the shape and size of clusters. Use Bayes theorem P(y |x) ∝ P(x|y )P(y ) to classify. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 12 / 44 How about P(y |x) directly? Modelling P(x|y ) is inefficient when real task is P(y |x). Idea? Assign probabilistic labels q(yt ) = P(yt |xt ) to unlabeled inputs xt , and train P(y |x) with them. However, there is no effect as the gradient vanishes: Eq(y ) Z ∂ P(y | x) ∂ log P(y | x) = q(y ) ∂θ dy ∂θ P(y | x) Z ∂ ∂ = P(y | x)dy = 1 = 0. ∂θ ∂θ There are ways to adjust the assigned labels q(yt ) to make them count. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 13 / 44 Adjusting assigned labels q(yt ) (1/2) Label propagation (Szummer and Jaakkola, 2003) I Nearest neighbours tend to have the same label. I Propagate labels to their neighbours and iterate. Pseudo-labels (Lee, 2013) I Round probabilistic labels q(yt ) towards 0/1 gradually during training. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 14 / 44 Adjusting assigned labels q(yt ) (2/2) Co-training (Blum and Mitchell, 1998) I Assumes multiple views on x, say x = (x(1) , x(2) ). I Train a separate classifier P(y | x(j) ) for each view. I For unlabeled data, the true label is the same for each view. I Combine individual q (j) (yt ) into a joint q(yt ) and feed it as target to each classifier. Part of Ladder network (Rasmus et al., 2015) I Corrupt input xt with noise to get x̃t . I Train P(y |x̃) with a target from clean q(yt ) = P(yt |xt ). Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 15 / 44 Table of Contents Semisupervised learning Ladder network Exercises Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 16 / 44 On details and invariance What is average of images in the category Cat? What is the average of Dog? Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 17 / 44 On details and invariance Answer: both are just blurry blobs. Autoencoder tries to learn a representation from which it can reconstruct the observations. It cannot discard details: position, pose, lighting. . . ⇒ Not well compatible with supervised learning. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 18 / 44 Ladder network, main ideas I I Shortcut connections in an autoencoder network allow it to discard details. Learning in deep networks can be made efficient by spreading unsupervised learning targets all over the network. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 19 / 44 Combining DSS+DAE h h C h Denoising Source Separation (Särelä and Valpola, 2005) x x h Denoising Autoencoder g f (Vincent et al., 2008) C h f x x x h g1 h C1 Ladder Network g f 0 (Valpola, 2015, Rasmus et al., 2015) C0 x x x Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 20 / 44 Same encoder f (·) used for corrupted and clean paths. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 22 / 44 Supervised learning: Backprop from output ỹ. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 23 / 44 Unsupervised learning: Several denoising autoencoders simultaneously. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 24 / 44 Unsupervised learning: Produce robust representations (DSS aspect). Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 25 / 44 Read test output from the clean path. (Not used in training.) Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 26 / 44 Training criterion Only one phase of training: Minimize criterion C. C = − log P(ỹ = yt |xt ) + L X 2 (l) (l) λl z − ẑBN l=0 Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 27 / 44 Scaling issues Issue 1: Doubling W(1) and halving W(2) decreases noise. Issue 2: Collapsing z(1) = ẑ(1) = 0 eliminates cost C (1) . Solution: Batch normalization (Ioffe and Szegedy, 2015) Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 28 / 44 Some model details u h nonlinearity scaling and bias g(z,u) noise z z normalize Vz Wh normalize h u g (z̃, u) done componentwise: gi (z̃i , ui ). Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 29 / 44 Functional form of lateral connections? Gaussian model: P(z) = N (µ, σp2 ) Gaussian noise: P(z̃|z) = N (z, σn2 ) 2 σp2 n Optimal denoising: ẑ = σ2σ+σ 2 µ + σ 2 +σ 2 z̃ p n p n Top-down signal u corresponds to P(z). Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 30 / 44 Functional form of lateral connections? 2 σ2 p n Recall ẑ = σ2σ+σ 2 µ + σ 2 +σ 2 z̃ from previous slide. p n p n Additive: ẑadd = g1 (z̃) + g2 (u) corresponds to modelling the mean µ with u. Modulated: ẑmod = g3 (z̃, u)(z̃ + bias) corresponds to modelling variance σp2 with u. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 31 / 44 Functional form of lateral connections? How to interpret top-down signal u modulating: Does this detail in z̃ fit in the big picture? If yes, trust it at let it through to reconstruction ẑ. If not, filter it away as noise. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 32 / 44 Analysis: Unsupervised learning (Rasmus et al., 2015b) f (2) g (1) h(1) f (2) ĥ(1) f (1) x̃ g (2) h(2) h(2) g (1) h(1) ĥ(1) f (1) g (0) x̂ ĥ(2) x̃ g (0) x̂ We compare deep denoising autoencoder and Ladder with additive or modulated lateral connections. Data is small natural image patches. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 33 / 44 Analysis: Unsupervised learning Denoising performance Mod No-lat Add 0.115 ↵ = 0.25 0.120 0.125 Cost ↵ = 4.0 0.0 0.1 0.2 0.3 0.4 0.5 1.0 2.0 3.0 α 1 million parameters, vary sizes of layers. Result: Modulated connections best. Ladder needs fewer units on h(2) . Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 34 / 44 Analysis: Unsupervised learning Translation invariance measure of units h(2) as a function of unit significance. Mod 256-1622-50 (2) γi Add 256-839-336 No-lat 256-590-589 1.0 0.8 0.6 0.4 0.2 0.0 10−3 10−2 10−1 100 10−2 10−1 100 10−2 10−1 100 With modulated connections, all units become invariant. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 35 / 44 Analysis: Unsupervised learning Learned pooling functions Each h(1) unit belongs to several pooling groups. Units h(2) specialize to colour, orientation, location, . . . Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 36 / 44 Small network for ẑi = g (z̃i , ui ) Each unit i has its own mini network with 9 parameters. Few parameters compared to weight matrices. Product ui z̃i for modulating (variance modelling). Nonlinearity for multimodal distributions. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 37 / 44 Example of a multimodal distribution (L) Signal z0 for digit 0 just before softmax. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 38 / 44 MNIST results Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 40 / 44 Gamma (Γ) model Simplified model: Only auxiliary cost just before softmax. Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 41 / 44 Table of Contents Semisupervised learning Ladder network Exercises Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 42 / 44 Exercises I I I Read Section 14.3 (Semi-supervised learning) and Section 5 (Related work) of the Ladder paper http://arxiv.org/abs/1507.02672 Assume having only 1000 labels in MNIST (100 labeled examples of each class), and train a traditional classifier with noise regularization. Then, implement squared difference of noisy and clean signals just before softmax as an auxiliary criterion for all 60000 examples (=simplified Γ-ladder). Can you improve the test performance by using the combined criterion? If interested, you can read the rest of the Ladder paper and code github.com/arasmus/ladder Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 43 / 44 References I A. Rasmus, H. Valpola, M. Honkala, M. Berglund, and T. Raiko. Semi-Supervised Learning with Ladder Network. To appear in NIPS 2015. I M. Seeger. A Taxonomy for Semi-Supervised Learning Methods. Chapter 2 in book Semi-Supervised Learning, MIT Press, 2006. I McLachlan, G. (1975). Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. J. American Statistical Association, 70, 365–369. I Titterington, D., Smith, A., and Makov, U. (1985). Statistical analysis of finite mixture distributions. In Wiley Series in Probability and Mathematical Statistics. I Szummer, M. and Jaakkola, T. (2003). Partially labeled classification with Markov random walks. Advances in Neural Information Processing Systems 15 (NIPS 2002), 14, 945–952. I Lee, D.-H. (2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML 2013. I Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In Proc. of the eleventh annual conference on Computational learning theory (COLT 98), pages 92–100. I I Särelä, J. and Valpola, H. (2005). Denoising source separation. JMLR, 6, 233–272. I I H. Valpola. From neural PCA to deep unsupervised learning. arXiv preprint arXiv:1411.7783 (2014). I A. Rasmus, T. Raiko, and H. Valpola. Denoising autoencoder with modulated lateral connections learns invariant representations of natural images. In the workshop track of the International Conference on Learning Representations (ICLR 2015), San Diago, California, May, 2015b. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11, 3371–3408. Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167 Tapani Raiko (Aalto University) Semi-Supervised Learning and Ladder Networks 4 November 2015 44 / 44