Uploaded by Ankit Sharma

(2) What I learned from Deep Learning Summer School 2016 LinkedIn

(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
Hamid Palangi
What I learned from Deep Learning
Summer School 2016
Machine Learning / Deep Learning at Microsoft Research
1 article
Published on August 20, 2016
Two weeks ago I attended the deep learning summer school at Montreal organized by
Yoshua Bengio and Aaron Courville. Below is a summary of what I learned. It starts from
basic concepts and continues with more advanced topics.
1. Essence of regularization
Two popular regularizations that are used in machine learning / deep learning are L2 (keeps
L2 norm of the weights bounded, results in non-sparse set of weights, i.e., the weight of
irrelevant features are small but NOT zero) and L1 (results in sparse set of weights,
computationally more expensive than L2). They help to adjust the hypothesis complexity,
e.g., if hypothesis has high variance (overfitting), they can help to alleviate the problem.
From a Bayesian point of view, L2 regularization is equivalent to a circular Gaussian prior
for weights. L1 regularization is equivalent to a double exponential prior. Note that the
regularization is only applied on weights NOT biases. Other popular regularization
techniques that help better generalization are dropout [Hinton et al, JMLR 2014], or using
unsupervised training for initialization of supervised training, e.g., using RBMs to initialize
autoencoder's weights as explained in [Hinton & Salakhutdinov, Science 2006]. Usually in
practice, using a large model with regularization (e.g., injecting noise) works better than
using a small fully parametric model without regularization.
2. Why do we need more than one neuron?
A single Neuron can only solve a linear separable problem, e.g., AND operation. It can not
solve a non-linear separable problem, e.g., XOR operation. Nevertheless, if we use a better
representation of the input data it can solve the XOR operation. For example, by using a
non-linear transformation of the input data, if inputs are x1 and x2, XOR( y1, y2) can be
done using a single neuron if y1 = AND(NOT(x1),x2) and y2 = AND(x1,NOT(x2)).
3. What non-linearity to choose for neurons?
The rule of thumb to select non-linearity is to always start with ReLU (Rectified Linear
Unit). It leads to less computational complexity for backpropagation and usually results in
sparse activations for neurons. The non-differentiable point at 0 in ReLU is not a problem
(sub-gradients can address this problem). Question: Is it a good idea to use different nonMessaging
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
linearities in different layers? No success yet. Except if we want to put some
2 structure in the
output, e.g., the attention mechanism.
4. Practical tips to train a neural network
Initialization: To break symmetry we use random initialization, for example see [Glorot
& Bengio, 2010].
Hyper-parameter selection: (a): Using grid search, i.e., trying all possible configurations
of hyper-parameters. This is computationally expensive. (b): Using random search
[Bergstra & Bengio, 2012], i.e., specify a distribution over the values of each hyperparameter and then sampling from each of them independently. (c): Bayesian
optimization [Snoek, et al, NIPS 2012] which requires less number of guesses to get
Early stopping: Since it has zero cost, it is better to always do it.
Validation set choice: This can become very important. The validation set size should be
large enough so that the model does not overfit on the validation set. This type of
overfitting also depends on how many validation tests we run on the validation set.
Normalization: For real valued data, normalization speeds up the training.
Learning rate: Starting with a large learning rate and then decaying it or using methods
with adaptive learning rates like Adagrad, RMSprop or Adam.
Gradient check: Very helpful for debugging the implementation of backprop. We simply
compare the gradient with a finite difference approximation of it. Question: Can the
finite difference approximation of the gradient replace backprop? No, because it is less
numerically stable.
Always make sure the model overfits on a small dataset.
What to do if training is hard?: First, make sure backpropagation implementation is not
buggy and the learning rate is not too large. Then, If it is underfitting, use better
optimization methods, larger models, etc . If it is overfitting, use better regularization,
e.g., unsupervised initialization, dropout, etc.
Batch Normalization [Loffe & Szegedy, JMLR 2015]: Very helpful technique, which
shows that the normalization at higher layers further improves the performance. It can be
done in 4 steps: (a): Doing normalization for each hidden layer before applying nonlinearity. (b): During training, mean and standard deviation are computed for each
minibatch. (c): During backpropagation, we should take into account the normalization
during forward pass. In other words, a scale and shift operation should be performed
during backpropagation. Scale and shift parameters should also be learned because
derivative with respect to hidden layers will also depend on them. (d): At the test time,
global mean and standard deviation is used NOT the ones calculated for each minibatch.
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
5. How important is depth?
Nicely explained by Rob Fergus. We can investigate the importance of depth by inspecting
different parts of Krizhevsky's Convolutional Neural Network (CNN) which has 8 layers
and is trained on ImageNet. The architecture of Krizhevsky's CNN [Krizhevsky et al, NIPS
2012] along with the results of applying SVM on different layers are shown below [picture
from Rob Fergus presentation]:
Another important observation is that if we remove layers 3, 4 (convolutional layers) and 6,
7 (fully connected layers), the performance drops 33.5%.
It is important to note that simply adding many more layers does not always improve the
performance. For example, results of simply using 20 layers and 56 layers of CIFAR-10 are
shown below [picture from He et al, CVPR 2016]:
Similar phenomena has been observed on ImageNet which means that learning better
models is not always equivalent to adding more layers. Note that above problem is NOT
caused by overfitting as it is obvious from training error curves above. One reason might be
the fact that with deeper networks the error signal during backpropagation is not significant
enough when it arrives at lower layers. To resolve this problem, residual network is
proposed in [He et al, CVPR 2016] which simply adds skip connections in CNN
architecture. One example is shown below [picture from He et al, CVPR 2016]:
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
Note that the skip connection is applied before the non-linear activation function.
6. Which one is more important, designing a better feature extractor below, or,
designing a better classifier on the top?
Using a powerful feature extractor (e.g., a CNN or deep residual network for vision tasks) is
far more important than designing the classifier on the top.
7. Evolution of image databases to big data
Below is a summary of image databases from 1970 till now [picture from Antonio Torralba
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
8. Convolutional Generative Adversarial Networks
Assume that we want to find a generative model that can generate data similar to the
samples that we have in our dataset. For example, we want to build a generative model that
can generate images similar to those in MNIST or CIFAR dataset. Generally, this is a very
difficult task because of many intractable probabilistic computations involved in maximum
likelihood or other related methods for this task. One elegant idea for this task is Generative
Adversarial Networks (GANs) proposed by [Goodfellow et al, NIPS 2014]. In GANs, two
models are simultaneously trained, a generative model (G) and a discriminative model (D).
G generates an image, and D is a binary classifier that classifies the given image to be a
sample from dataset (true data), or a sample generated by G (artificially generated data). G is
trained to maximize the probability that D makes a mistake (min-max two player game). As
a result, after training, G estimates the distribution of the data. Some sample images
generated by G for MNIST and CIFAR-10 from [Goodfellow et al, NIPS 2014] are
represented below (picture from [Goodfellow et al, NIPS 2014]):
In [Radford et al, ICLR 2016] a form of Convolutional Network is proposed which is more
stable with adversarial training than other methods. Other related references for GANs
are "Adversarial examples in the physical world
(http://arxiv.org/abs/1607.02533)", "Improved techniques for training GANs
(http://arxiv.org/abs/1606.03498)", "Virtual adversarial training for semi-supervised text
classification (http://arxiv.org/abs/1605.07725)". They have even been used to generate new
Pokemon GO species! (https://www.youtube.com/watch?v=rs3aI7bACGc).
9. Which deep learning toolkit to use?
There is no silver bullet! It depends on the target task and application. Below is a
comparison from Alex Wiltschko presentation:
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
There is also a great comparison among Caffe, CNTK, TensorFlow, Theano and Torch with
much more details in this post by Kenneth Tran.
10. What are the new advances in recurrent neural networks research?
Recurrent neural networks (mainly LSTMs and GRUs) have been significantly successful
recently mainly used for converting sequence to vector (e.g., Sentence Embedding [Palangi
et al, 2015]), sequence to sequence (e.g., Machine Translation [Sutskever et al,
2014], [Bahdanau et al, 2014]) and vector to sequence (e.g., Image Captioning [Vinyals et
al, 2014]). Vanilla RNNs have not been as successful to capture long term dependencies due
to vanishing/exploding gradient problems. Nevertheless, in the limit of infinite time training
(which is not practical), vanilla RNN will eventually learn long term dependencies. Below
are a list of recent works related to RNNs which got my attention during Yoshua Bengio's
presentation about RNNs:
(a): Assume that we want to train a neural language model using LSTM. The basic task is to
predict the next word given previous words for which we minimize the perplexity as cost
function. During training, we give all "true" previous words to the model and use them to
predict the next word. But during inference, we give all "predicted" previous words to the
model and use them to predict the next word. To resolve this incompatibility between
training and inference, a method is proposed in [Bengio et al, 2015] where during training, a
weak supervision from previously generated words by the model is also used. This results in
significant performance improvement.
(b): Multiplicative integration with RNNs proposed in [Wu et al, 2016]. The main idea is to
replace the summation with Hadamard product in RNNs. This simple modification results in
significant performance improvement presented in above reference.
(c): How to understand and measure the architectural complexity of a given RNN model? In
[Zhang et al, 2016], three measures are proposed which are: (c.1): recurrent depth (length of
longest path divided by sequence length), (c.2): feedforward depth (length of longest path
from input to nearest output) and (c.3): skip coefficient (length of shortest path divided by
sequence length).
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
(d): Pixel RNNs (ICML 2016 best paper award) [Oord et al, 2016]: This work
proposes a
method to model the probability distribution of a natural image. The main idea is to factorize
the probability distribution of the input image into the product of conditional probabilities.
To do this, a Diagonal BiLSTM unit is proposed that efficiently captures the entire available
context (all the pixels above the current pixel) of the image (see Fig. 2 of the paper).
Residual skip connections are also used in the architecture. It has resulted in state-of-the-art
performance in terms of log-likelihood. Below are a number of natural images generated by
the model trained on ImageNet [picture from Oord et al, 2016]:
11. Can all problems be mapped to y=f(x)?
No! Example tasks which the simple y=f(x) fails are: (a): cloze style QA where the task is to
read and comprehend a text (e.g., book, etc) and then answer questions about it. (b): Given a
text, the task is to fill in the blanks. (c): ChatBot.
As explained nicely in Sumit Chopra's presentation, the model needs to: (a): Remember the
external context. (b): Given an input, the model needs to know where to look for in the
context. (c): What to look for in the context. (d): How to reason, using this external context.
(e): The model should also handle a changing external context.
Therefore, introducing a notion of memory to capture external context is important. One
proposal is to use hidden states of RNNs as memory. For example, running an RNN on the
context (book, text, etc) to get its representation, then, using this representation to map a
question to answer. There are two problems with this approach: (a): It does not scale. (b) the
idea that hidden states of an RNN are both the memory and the controller of the memory is
not appropriate. We should separate these two.
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
The main idea of a memory network [Weston et al, 2015] is to separate the 2controller of the
memory from the memory itself. In other words, it combines a large memory with a learning
component that can read and write to the memory.
Memory networks perform better than LSTMs in QA task but the performance of both of
them are close in language modelling task. One reason might be the fact that for language
modelling task we do not need very long term dependencies compared to QA and dialogue
related tasks. One shortcoming of current memory networks is that there is no memory
compression. If the memory is full, they simply recycle.
12. Large scale deep learning with TensorFlow presented by Jeff Dean
Generally, the important features that are desirable in a machine learning system are (from
Jeff Dean's presentation): (a): Ease of expression: for many machine learning algorithms.
(b): Scalability: to be able to run experiments quickly. (c): Portability: so that we can run
experiments on various platforms. (d): Reproducability: which helps to share and reproduce
research. (e): Production readiness: from research to real products.
TensorFlow (TF) have been designed with careful consideration to above features. Other
notes about TF are: (a): The core of TF is C++ which results in very low overhead. (b): TF
system automatically decides which operations should be run on CPU or GPU. This usually
helps to significantly improve the time of experiments. (c): The first version of scalable
deep learning system at Google, i.e., DistBelief [Dean et al, NIPS 2012] is not as flexible as
TF for research purposes. DistBelief has separate parameter servers, i.e., separate code for
parameter servers v.s. rest of the system, which results in a non-uniform and more
complicated system. (d): TF session interface allows to "extend" which can be used to add
nodes to the computation graph and "run" which in addition to running the full computation
graph can also be used to run an arbitrary subgraph of the computation graph. (e): Question:
How does TF make distributed training easy? It uses model parallelism (partitioning model
across machines) and data parallelism. It is easy to express both types of parallelisms in TF
with minimal changes to single device model code. (f): TF can take care of devices / graph
placement. In other words, given a computation graph and a set of devices, TF allows the
user to decide which device executes each node.
13. History of Statistical Language Modelling?
Statistical language modelling is all about how probable a sentence is. We generally
maximize the log probabilities of sentences in the corpora. This, however, has not been
obvious for everyone in 90s (review of Brown et al, 1990 paper) [from Kyunghyun Cho's
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
which reads: "The validity of statistical (information theoretic) approach to MT has
indeed been recognized ... as early as 1949. And was universally recognized as
mistaken [sic] by 1950 ... The crude force of computers is not science."
14. What are the issues with non-parametric language modelling (e.g., n-grams)?
In n-gram language modelling, we basically collect n-gram statistics from a large corpus
(i.e., counting). Some issues with this approach are: (a): False conditional independence
assumption: because in an n-gram language model we assume that each word is only
conditioned on the previous n-1 words. (b): Data sparsity: which means that if a cooccurrence of some words has never been observed in the training set, it will be assigned
zero probability which results in the probability of whole sentence to be zero. Conventional
solutions for this problem are smoothing and backoff. (c): Lack of generalization across
As an example, an n-gram language model might fail in the sentence "The dogs chasing
the cat bark". The tri-gram probability P(bark | the, cat) is very low (not observed in a
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
natural language corpus by the model, because the cat never barks and the plural
verb "bark"
has appeared after singular noun "cat"), but the whole sentence totally makes sentence.
15. Parametric and Neural Language Modelling
The basic idea of a neural language model is to create continuous space word
representations and use them for language modelling. For example, in [Bengio et al 2003], a
feedforward neural network with a softmax layer on the top is used to for language
modelling represented below (picture from Kyunghyun Cho's presentation):
A better choice for neural language modelling are RNNs (LSTMs, GRUs, ...) or Memory
Networks which have resulted in state-of-the art performance in terms of perplexity. For
example, see the paper "Exploring the Limits of Language Modelling" by Jozefowicz et al,
2016. A simple example of an unfolded vanilla RNN language model is represented below
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
where the model reads the input word, updates the hidden states representations
and predicts
the next word (picture from Kyunghyun Cho's presentation):
16. Character-Level Neural Machine Translation
The task in machine translation is to generate a sentence in target language, given a
sentence in source language. In Neural Machine Translation (NMT), an RNN (LSTM, GRU,
etc) is used to encode the source sentence into a vector, and another RNN is used to decode
the vector from encoder into a sequence of words in target language (sequence to sequence
learning). This is shown in the following diagram (picture from Kyunghyun Cho's
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
Above model can be improved if we use an attention based decoder [Bahdanau
et al, ICLR
2015]. The idea is to compute a set of attention weights and use weighted sum of
encoder's annotation vectors in the decoder. This approach, allows decoder to automatically
just focus on the parts of the source sentence that are relevant for predicting each target
word. It is shown in the following diagram (picture from Kyunghyun Cho's presentation):
The main issue with above models is that they use words as basic units of language. For
example, "run", "runs", "ran" and "running" are from one lexeme "run". But above models
assign them four independent vectors. It is also not always easy to segment a sentence into
words. The question is, can we use character level NMT to address above issues? In [Chung
et al, 2016], it is shown that character level NMT works surprisingly well. It is also
interesting to note that an RNN, implicitly segments a character sequence automatically. For
example, see the demonstration below (from Kyunghyun Cho's presentation):
17. Why Generative Models?
Nicely explained in Shakir Mohamed's presentation, we need generative models for moving
beyond associating inputs to outputs, semi-supervised classification, data manipulation,
filling in the blank, inpainting, denoising, one-shot generalization [Rezende et al, ICML
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
2016] and many more applications. Progress in generative models is presented
in the
following diagram (note that the vertical axis should be negative log-likelihood) [from
Shakir Mohamed's presentation]:
18. What are different types of generative models?
Generative models can be classified into three groups:
(a): Fully Observed Models: Model directly observes data without introducing any new
unobserved local variable. These types of models can directly encode the relationship among
observed points. For directed graphical models, it is easy to scale up to large models and the
parameter learning is simple because log-likelihood can be computed directly (no need for
approximation). For undirected models, the parameter learning is difficult as we need to
compute normalization constants. Generation in fully observed models can be slow. Below
diagram shows different fully observed generative models [from Shakir Mohamed's
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
(b): Transformation Models: Model transforms an unobserved noise source using a
parameterised function. It is easy to (1): sample from these models and (2): compute
expectations without knowing the final distribution. They can be used with large scale
classifiers and convolutional neural networks. Nevertheless, it is difficult to maintain
invertibility and extend to generic data types using these models. Below diagram shows
different transformation generative models [from Shakir Mohamed's presentation]:
(c): Latent Variable Models: In these models, an unobserved local random variable is
introduced that represents hidden causes. It is easy to sample from these models and to
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
include hierarchy and depth. It is also possible to do scoring and model selection
marginalized likelihood. Nevertheless, it is difficult to determine latent variables
corresponding to an input. Below diagram shows different latent variable generative models
[from Shakir Mohamed's presentation]:
Report this
Show previous comments
Michel Blaauw
Snr. Data Scientist/SVB Sociale Verzekeringsbank @ DRIVEN-BY-DATA
Thanks a lot, most enlightening!
1 Like
Jérôme E. Blanch∑t 📊 📈 📉
Senior Specialist • Housing Research | CMHC Pol. & Research Data Scientist Contributor at Kaggle, Numer.ai…
Thanks for the summary. Yoshua Bengio's recommendations on Neural Network Hyperparameters optimization are always an interesting read. I am so proud we get our own Silicon
Valley in Montreal.
1 Like
Add a comment…
(2) What I learned from Deep Learning Summer School 2016 | LinkedIn
munity Guidelines
cy & Terms
Visit our Help Center.
Select Language
English (English)
Manage your account and privacy.
Go to your Settings.
dIn Corporation © 2018
Random flashcards
Arab people

15 Cards


30 Cards


17 Cards

Create flashcards