Deep Learning Tutorial
Mitesh M. Khapra
IBM Research India
(Ideas and material borrowed from
Richard Socher’s tutorial @ ML Summer School 2014
Yoshua Bengio’s tutorial @ ML Summer School 2014
& Hugo Larochelle’s lecture videos & slides)
• What?
• Why?
• How?
• Where?
Roadmap
2
• What are Deep Neural Networks?
• Why?
• How?
• Where?
Roadmap
3
Roadmap
• What are Deep Neural Networks?
• Why should I be interested in Deep Learning?
• How?
• Where?
4
Roadmap
• What are Deep Neural Networks?
• Why should I be interested in Deep Learning?
• How do I make a Deep Neural Network work?
• Where?
5
Roadmap
• What are Deep Neural Networks?
• Why should I be interested in Deep Learning?
• How do I make train a Deep Neural Network work?
• Where?
6
Roadmap
• What are Deep Neural Networks?
• Why should I be interested in Deep Learning?
• How do I train a Deep Neural Network?
• Where can I find additional material?
7
8
A typical machine learning example feature extraction number of positive words, number of negative words, length of review, author name, bag of words, etc.
feature vector data label ๐ฅ
1
= 1, 0, 0, 1, 0, 1 , ๐ฆ
1
= 1 ๐ฅ
2
= 0, 0, 1, 1, 0, 1 , ๐ฆ
2
= 0 ๐ฅ
3
= 1, 0, 1, 1, 0, 1 , ๐ฆ
3
= 1 ๐ฅ
4
= 0, 0, 1, 0, 1, 1 , ๐ฆ
4
= 0
9
A typical machine learning example data label ๐ฅ
1
= 1, 0, 0, 1, 0, 1 , ๐ฆ
1
= 1 ๐ฅ
2
= 0, 0, 1, 1, 0, 1 , ๐ฆ
2
= 0 ๐ฅ
3
= 1, 0, 1, 1, 0, 1 , ๐ฆ
3
= 1
๐จ๐๐๐๐๐ ๐ = ๐ ๐ ๐ ๐๐๐ ๐๐ฅ๐๐๐๐๐, ๐ ๐ค ๐ฅ = ๐ ๐๐๐๐๐๐ ๐ค ๐ ๐ฅ ๐ ๐ ๐๐๐๐๐ ๐ค ๐ ๐ข๐โ ๐กโ๐๐ก ๐ ๐ค ๐ฅ ๐ ๐๐ ๐๐ ๐๐๐๐ ๐ ๐ก๐ ๐ฆ ๐ ๐๐ ๐๐๐ ๐ ๐๐๐๐ ๐ฅ
4
= 0, 0, 1, 0, 1, 1 , ๐ฆ
4
= 0
10
So, where does deep learning fit in?
• Machine Learning
– hand crafted features
– optimize weights to improve prediction
• Representation Learning
– automatically learn features
• Deep Learning
– automatically learn multiple levels of features
From Richar Socher’s tutorial @ ML Summer School, Lisbon
11
The basic building block ๐ ๐ฅ = ๐ + ๐ค ๐ ๐ฅ ๐ ๐ โ ๐ฅ = ๐ ๐๐๐๐๐๐ ๐ + ๐ค ๐ ๐ฅ ๐ ๐ ๐ฅ
1 ๐ ๐ฅ ๐ค
1 ๐ค
2 ๐ค
3 ๐
1 ๐ฅ
2 ๐ฅ
3 ๐ค ๐
= ๐ค๐๐๐โ๐ก ๐ = ๐๐๐๐ ๐ ๐ฅ = ๐๐๐ − ๐๐๐ก๐๐ฃ๐๐ก๐๐๐ โ ๐ฅ = ๐๐๐ก๐๐ฃ๐๐ก๐๐๐ single artificial neuron
๐ฎ๐๐๐: ๐๐๐ฃ๐๐ ๐ ๐ฅ, ๐ฆ ๐๐๐๐๐ ๐๐๐๐๐ ๐ค, ๐ ๐ ๐ข๐โ ๐กโ๐๐ก โ(๐ฅ ๐
) ๐๐ ๐๐ ๐๐๐๐ ๐ ๐ก๐ ๐ฆ ๐ ๐๐ ๐๐๐ ๐ ๐๐๐๐
12
Okay, so what can I use it for?
• For binary classification problems by treating โ ๐ฅ ๐๐ ๐ ๐ฆ = 1 ๐ฅ
• Works when data is linearly separable ๐ฅ
1 ๐ค
1 ๐ค
2 ๐ค
3 ๐ฅ
2 ๐ฅ
3 ๐
1 ๐ฅ = ๐๐๐๐ก๐ข๐๐๐ ๐๐๐๐ ๐๐๐ฃ๐๐ ๐๐๐ฃ๐๐๐ค๐ ๐ฆ = ๐๐๐ ๐๐ก๐๐ฃ๐\๐๐๐๐๐ก๐๐ฃ๐ โ ๐ฅ > 0.5 ๐กโ๐๐ ๐๐๐ ๐
(image from Hugo Larochelles’s slides) ๐ฅ
1 ๐ค
1 ๐ค
2 ๐ค
3 ๐ฅ
2 ๐ฅ
3 ๐
1 ๐ฅ = ๐๐๐๐ก๐ข๐๐๐ ๐๐๐๐ ๐๐๐ฃ๐๐ ๐๐๐ฃ๐๐๐ค๐ ๐ฆ = ๐๐๐๐ ๐๐ข๐กโ๐๐\female ๐๐ข๐กโ๐๐ โ ๐ฅ > 0.5 ๐กโ๐๐ ๐๐๐ ๐ 13
What are its limitations?
• Fails when data is not linearly separable….
(images from Hugo Larochelles’s slides)
• …unless the input is suitably transformed ๐ฅ = ๐ฅ
1
, ๐ฅ
2 ๐ฅ ′ = ๐ด๐๐ท(๐ฅ
1
, ๐ฅ
2
), ๐ด๐๐ท(๐ฅ
1
, ๐ฅ
2
)
14
A neural network for XOR
Wait…., are you telling me that I will always have to meditate on the data and then decide the transformation/network ?
No, definitely not. The XOR example is only to give the intuition.
The key takeaway is that by adding more layers you can make the data separable.
๐ด๐๐ท(๐ฅ
1
, ๐ฅ
2
) ๐ด๐๐ท(๐ฅ
2
, ๐ฅ
1
)
Lets spend some more time in understanding this ….
๐ฅ
1 ๐ฅ
2
A multi-layered neural network
15
๐ (2)
๐ (1)
(graphs from Pascal Vincent’s slides)
Capacity of a multi-layer network ๐ง
1
−1 ๐ฆ
1
0.5
0.7
−0.4
−1.5
๐ฆ
2
1 1 1
1 ๐ฅ
1 ๐ฅ
2
16
Capacity of a multi-layer network
(image from Pascal Vincent’s slides)
17
Capacity of a multi-layer network
In particular, we can find a separator for the XOR problem
(images from from Pascal Vincent’s slides)
Universal Approximation Theorem (Hornik, 1991) :
• “a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrary well, given enough hidden units”
18
Lets take a minute here…
If “a single hidden layer neural network” is enough then why go deeper?
๐ฅ
Hand-crafted features representations ๐ฅ = ๐ฅ
1
, ๐ฅ
2
Automatically learned features representations
๐
(2)
′
= ๐ด๐๐ท(๐ฅ
1
, ๐ฅ
2
), ๐ด๐๐ท(๐ฅ
1
, ๐ฅ
2
) ๐ฅ
1
๐ (1) ๐ฅ
2
… … …
19
Multiple layers = multiple levels of features ๐ฆ
But why would I be interested in learning multiple levels of representations ?
Lets see where the motivation comes from…
๐ (4)
๐ (3)
๐ (2) ๐ฅ
1 ๐ฅ
2
๐ (1) ๐ฅ
3
20
(idea from Hugo Larochelle’s slides)
The brain analogy
Layer 1 representation nose mouth eyes
Layer 2 representation face
Layer 3 representation
21
YAWN!!!! Enough With the Brain Tampering
Just tell me Why should I be interested In Deep Learning?
(“Show Me the Money”)
22
23
(from Y. Bengio’s MLSS 2014 slides)
Used in a wide variety of applications
24
Industrial Scale Success Stories
Speech Recognition
Object Recognition
Face Recognition
Cross Language
Learning
Machine Translation
Text Analytics
Dramatic improvements reported in some cases
Disclaimer: Some nodes and edges may be missing due to limited public knowledge
25
(from Y. Bengio’s MLSS 2014 slides)
Some more success stories
26
Let me see if I understand this correctly…
• Speech Recognition, Machine Translation, etc. are more than 50 years old
• Single artificial neurons have been around for more than 50 years ๐ฅ
1 ๐ค
1 ๐ค
2 ๐ค
3 ๐ฅ
2 ๐ฅ
3 ๐
1 ๐ฆ
50+ years?
๐ฅ
1 ๐ฅ
2
๐
(4)
๐
(3)
๐
(2)
No, even deep neural networks have been around for many, many years but prior to 2006
training deep nets was unsuccessful
๐ (1) ๐ฅ
3
27
(from Y. Bengio’s MLSS 2014 slides)
So what has changed since 2006?
• New methods for unsupervised pre-training have been developed
• More efficient parameter estimation methods
• Better understanding of model regularization
• Faster machines and more data help DL more than other algorithms
28
29
recap ๐ ๐ฅ = ๐ + ๐ค ๐ ๐ฅ ๐ ๐ โ ๐ฅ = ๐ ๐๐๐๐๐๐ ๐ + ๐ค ๐ ๐ฅ ๐ ๐ ๐ฅ
1 ๐ ๐ฅ ๐ค
1 ๐ค
2 ๐ค
3 ๐
1 ๐ฅ
2 ๐ฅ
3 ๐ค ๐
= ๐ค๐๐๐โ๐ก ๐ = ๐๐๐๐ ๐ ๐ฅ = ๐๐๐ − ๐๐๐ก๐๐ฃ๐๐ก๐๐๐ โ ๐ฅ = ๐๐๐ก๐๐ฃ๐๐ก๐๐๐ single artificial neuron
๐ฎ๐๐๐: ๐๐๐ฃ๐๐ ๐ ๐ฅ, ๐ฆ ๐๐๐๐๐ ๐๐๐๐๐ ๐ค, ๐ ๐ ๐ข๐โ ๐กโ๐๐ก โ(๐ฅ ๐
) ๐๐ ๐๐ ๐๐๐๐ ๐ ๐ก๐ ๐ฆ ๐ ๐๐ ๐๐๐ ๐ ๐๐๐๐
30
Switching to slides corresponding to lecture 2 from Hugo Larochelle’s course http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html
31
32
Some pointers to additional material
• http://deeplearning.net/
• http://info.usherbrooke.ca/hlarochelle/neural
_networks/content.html
33