DeepLearningTutorial_Icare2014

advertisement

Deep Learning Tutorial

Mitesh M. Khapra

IBM Research India

(Ideas and material borrowed from

Richard Socher’s tutorial @ ML Summer School 2014

Yoshua Bengio’s tutorial @ ML Summer School 2014

& Hugo Larochelle’s lecture videos & slides)

• What?

• Why?

• How?

• Where?

Roadmap

2

• What are Deep Neural Networks?

• Why?

• How?

• Where?

Roadmap

3

Roadmap

• What are Deep Neural Networks?

• Why should I be interested in Deep Learning?

• How?

• Where?

4

Roadmap

• What are Deep Neural Networks?

• Why should I be interested in Deep Learning?

• How do I make a Deep Neural Network work?

• Where?

5

Roadmap

• What are Deep Neural Networks?

• Why should I be interested in Deep Learning?

• How do I make train a Deep Neural Network work?

• Where?

6

Roadmap

• What are Deep Neural Networks?

• Why should I be interested in Deep Learning?

• How do I train a Deep Neural Network?

• Where can I find additional material?

7

the what?

8

A typical machine learning example feature extraction number of positive words, number of negative words, length of review, author name, bag of words, etc.

feature vector data label ๐‘ฅ

1

= 1, 0, 0, 1, 0, 1 , ๐‘ฆ

1

= 1 ๐‘ฅ

2

= 0, 0, 1, 1, 0, 1 , ๐‘ฆ

2

= 0 ๐‘ฅ

3

= 1, 0, 1, 1, 0, 1 , ๐‘ฆ

3

= 1 ๐‘ฅ

4

= 0, 0, 1, 0, 1, 1 , ๐‘ฆ

4

= 0

9

next

A typical machine learning example data label ๐‘ฅ

1

= 1, 0, 0, 1, 0, 1 , ๐‘ฆ

1

= 1 ๐‘ฅ

2

= 0, 0, 1, 1, 0, 1 , ๐‘ฆ

2

= 0 ๐‘ฅ

3

= 1, 0, 1, 1, 0, 1 , ๐‘ฆ

3

= 1

๐‘จ๐’”๐’”๐’–๐’Ž๐’† ๐’š = ๐’‡ ๐’˜ ๐’™ ๐‘“๐‘œ๐‘Ÿ ๐‘’๐‘ฅ๐‘Ž๐‘š๐‘๐‘™๐‘’, ๐‘“ ๐‘ค ๐‘ฅ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘ค ๐‘– ๐‘ฅ ๐‘– ๐‘– ๐‘™๐‘’๐‘Ž๐‘Ÿ๐‘› ๐‘ค ๐‘ ๐‘ข๐‘โ„Ž ๐‘กโ„Ž๐‘Ž๐‘ก ๐‘“ ๐‘ค ๐‘ฅ ๐‘– ๐‘–๐‘  ๐‘Ž๐‘  ๐‘๐‘™๐‘œ๐‘ ๐‘’ ๐‘ก๐‘œ ๐‘ฆ ๐‘– ๐‘Ž๐‘  ๐‘๐‘œ๐‘ ๐‘ ๐‘–๐‘๐‘™๐‘’ ๐‘ฅ

4

= 0, 0, 1, 0, 1, 1 , ๐‘ฆ

4

= 0

10

So, where does deep learning fit in?

• Machine Learning

– hand crafted features

– optimize weights to improve prediction

• Representation Learning

– automatically learn features

• Deep Learning

– automatically learn multiple levels of features

From Richar Socher’s tutorial @ ML Summer School, Lisbon

11

back

The basic building block ๐‘Ž ๐‘ฅ = ๐‘ + ๐‘ค ๐‘– ๐‘ฅ ๐‘– ๐‘– โ„Ž ๐‘ฅ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘ + ๐‘ค ๐‘– ๐‘ฅ ๐‘– ๐‘– ๐‘ฅ

1 ๐‘Ž ๐‘ฅ ๐‘ค

1 ๐‘ค

2 ๐‘ค

3 ๐‘

1 ๐‘ฅ

2 ๐‘ฅ

3 ๐‘ค ๐‘–

= ๐‘ค๐‘’๐‘–๐‘”โ„Ž๐‘ก ๐‘ = ๐‘๐‘–๐‘Ž๐‘  ๐‘Ž ๐‘ฅ = ๐‘๐‘Ÿ๐‘’ − ๐‘Ž๐‘๐‘ก๐‘–๐‘ฃ๐‘Ž๐‘ก๐‘–๐‘œ๐‘› โ„Ž ๐‘ฅ = ๐‘Ž๐‘๐‘ก๐‘–๐‘ฃ๐‘Ž๐‘ก๐‘–๐‘œ๐‘› single artificial neuron

๐‘ฎ๐’๐’‚๐’: ๐‘”๐‘–๐‘ฃ๐‘’๐‘› ๐‘ ๐‘ฅ, ๐‘ฆ ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘  ๐‘™๐‘’๐‘Ž๐‘Ÿ๐‘› ๐‘ค, ๐‘ ๐‘ ๐‘ข๐‘โ„Ž ๐‘กโ„Ž๐‘Ž๐‘ก โ„Ž(๐‘ฅ ๐‘—

) ๐‘–๐‘  ๐‘Ž๐‘  ๐‘๐‘™๐‘œ๐‘ ๐‘’ ๐‘ก๐‘œ ๐‘ฆ ๐‘— ๐‘Ž๐‘  ๐‘๐‘œ๐‘ ๐‘ ๐‘–๐‘๐‘™๐‘’

12

Okay, so what can I use it for?

• For binary classification problems by treating โ„Ž ๐‘ฅ ๐‘Ž๐‘  ๐‘ ๐‘ฆ = 1 ๐‘ฅ

• Works when data is linearly separable ๐‘ฅ

1 ๐‘ค

1 ๐‘ค

2 ๐‘ค

3 ๐‘ฅ

2 ๐‘ฅ

3 ๐‘

1 ๐‘ฅ = ๐‘“๐‘’๐‘Ž๐‘ก๐‘ข๐‘Ÿ๐‘’๐‘  ๐‘“๐‘Ÿ๐‘œ๐‘š ๐‘š๐‘œ๐‘ฃ๐‘–๐‘’ ๐‘Ÿ๐‘’๐‘ฃ๐‘–๐‘’๐‘ค๐‘  ๐‘ฆ = ๐‘๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’\๐‘›๐‘’๐‘”๐‘Ž๐‘ก๐‘–๐‘ฃ๐‘’ โ„Ž ๐‘ฅ > 0.5 ๐‘กโ„Ž๐‘’๐‘› ๐‘’๐‘™๐‘ ๐‘’

(image from Hugo Larochelles’s slides) ๐‘ฅ

1 ๐‘ค

1 ๐‘ค

2 ๐‘ค

3 ๐‘ฅ

2 ๐‘ฅ

3 ๐‘

1 ๐‘ฅ = ๐‘“๐‘’๐‘Ž๐‘ก๐‘ข๐‘Ÿ๐‘’๐‘  ๐‘“๐‘Ÿ๐‘œ๐‘š ๐‘š๐‘œ๐‘ฃ๐‘–๐‘’ ๐‘Ÿ๐‘’๐‘ฃ๐‘–๐‘’๐‘ค๐‘  ๐‘ฆ = ๐‘š๐‘Ž๐‘™๐‘’ ๐‘Ž๐‘ข๐‘กโ„Ž๐‘œ๐‘Ÿ\female ๐‘Ž๐‘ข๐‘กโ„Ž๐‘œ๐‘Ÿ โ„Ž ๐‘ฅ > 0.5 ๐‘กโ„Ž๐‘’๐‘› ๐‘’๐‘™๐‘ ๐‘’ 13

What are its limitations?

• Fails when data is not linearly separable….

(images from Hugo Larochelles’s slides)

• …unless the input is suitably transformed ๐‘ฅ = ๐‘ฅ

1

, ๐‘ฅ

2 ๐‘ฅ ′ = ๐ด๐‘๐ท(๐‘ฅ

1

, ๐‘ฅ

2

), ๐ด๐‘๐ท(๐‘ฅ

1

, ๐‘ฅ

2

)

14

A neural network for XOR

Wait…., are you telling me that I will always have to meditate on the data and then decide the transformation/network ?

No, definitely not. The XOR example is only to give the intuition.

The key takeaway is that by adding more layers you can make the data separable.

๐ด๐‘๐ท(๐‘ฅ

1

, ๐‘ฅ

2

) ๐ด๐‘๐ท(๐‘ฅ

2

, ๐‘ฅ

1

)

Lets spend some more time in understanding this ….

๐‘ฅ

1 ๐‘ฅ

2

A multi-layered neural network

15

๐‘Š (2)

๐‘Š (1)

(graphs from Pascal Vincent’s slides)

Capacity of a multi-layer network ๐‘ง

1

−1 ๐‘ฆ

1

0.5

0.7

−0.4

−1.5

๐‘ฆ

2

1 1 1

1 ๐‘ฅ

1 ๐‘ฅ

2

16

Capacity of a multi-layer network

(image from Pascal Vincent’s slides)

17

Capacity of a multi-layer network

In particular, we can find a separator for the XOR problem

(images from from Pascal Vincent’s slides)

Universal Approximation Theorem (Hornik, 1991) :

• “a single hidden layer neural network with a linear output unit can approximate any continuous function arbitrary well, given enough hidden units”

18

Lets take a minute here…

If “a single hidden layer neural network” is enough then why go deeper?

๐‘ฅ

Hand-crafted features representations ๐‘ฅ = ๐‘ฅ

1

, ๐‘ฅ

2

Automatically learned features representations

๐‘Š

(2)

= ๐ด๐‘๐ท(๐‘ฅ

1

, ๐‘ฅ

2

), ๐ด๐‘๐ท(๐‘ฅ

1

, ๐‘ฅ

2

) ๐‘ฅ

1

๐‘Š (1) ๐‘ฅ

2

… … …

19

Multiple layers = multiple levels of features ๐‘ฆ

But why would I be interested in learning multiple levels of representations ?

Lets see where the motivation comes from…

๐‘Š (4)

๐‘Š (3)

๐‘Š (2) ๐‘ฅ

1 ๐‘ฅ

2

๐‘Š (1) ๐‘ฅ

3

20

(idea from Hugo Larochelle’s slides)

The brain analogy

Layer 1 representation nose mouth eyes

Layer 2 representation face

Layer 3 representation

21

YAWN!!!! Enough With the Brain Tampering

Just tell me Why should I be interested In Deep Learning?

(“Show Me the Money”)

22

the why?

23

(from Y. Bengio’s MLSS 2014 slides)

Used in a wide variety of applications

24

Industrial Scale Success Stories

Speech Recognition

Object Recognition

Face Recognition

Cross Language

Learning

Machine Translation

Text Analytics

Dramatic improvements reported in some cases

Disclaimer: Some nodes and edges may be missing due to limited public knowledge

25

(from Y. Bengio’s MLSS 2014 slides)

Some more success stories

26

Let me see if I understand this correctly…

• Speech Recognition, Machine Translation, etc. are more than 50 years old

• Single artificial neurons have been around for more than 50 years ๐‘ฅ

1 ๐‘ค

1 ๐‘ค

2 ๐‘ค

3 ๐‘ฅ

2 ๐‘ฅ

3 ๐‘

1 ๐‘ฆ

50+ years?

๐‘ฅ

1 ๐‘ฅ

2

๐‘Š

(4)

๐‘Š

(3)

๐‘Š

(2)

No, even deep neural networks have been around for many, many years but prior to 2006

training deep nets was unsuccessful

๐‘Š (1) ๐‘ฅ

3

27

(from Y. Bengio’s MLSS 2014 slides)

So what has changed since 2006?

• New methods for unsupervised pre-training have been developed

• More efficient parameter estimation methods

• Better understanding of model regularization

• Faster machines and more data help DL more than other algorithms

28

the how?

29

recap ๐‘Ž ๐‘ฅ = ๐‘ + ๐‘ค ๐‘– ๐‘ฅ ๐‘– ๐‘– โ„Ž ๐‘ฅ = ๐‘ ๐‘–๐‘”๐‘š๐‘œ๐‘–๐‘‘ ๐‘ + ๐‘ค ๐‘– ๐‘ฅ ๐‘– ๐‘– ๐‘ฅ

1 ๐‘Ž ๐‘ฅ ๐‘ค

1 ๐‘ค

2 ๐‘ค

3 ๐‘

1 ๐‘ฅ

2 ๐‘ฅ

3 ๐‘ค ๐‘–

= ๐‘ค๐‘’๐‘–๐‘”โ„Ž๐‘ก ๐‘ = ๐‘๐‘–๐‘Ž๐‘  ๐‘Ž ๐‘ฅ = ๐‘๐‘Ÿ๐‘’ − ๐‘Ž๐‘๐‘ก๐‘–๐‘ฃ๐‘Ž๐‘ก๐‘–๐‘œ๐‘› โ„Ž ๐‘ฅ = ๐‘Ž๐‘๐‘ก๐‘–๐‘ฃ๐‘Ž๐‘ก๐‘–๐‘œ๐‘› single artificial neuron

๐‘ฎ๐’๐’‚๐’: ๐‘”๐‘–๐‘ฃ๐‘’๐‘› ๐‘ ๐‘ฅ, ๐‘ฆ ๐‘๐‘Ž๐‘–๐‘Ÿ๐‘  ๐‘™๐‘’๐‘Ž๐‘Ÿ๐‘› ๐‘ค, ๐‘ ๐‘ ๐‘ข๐‘โ„Ž ๐‘กโ„Ž๐‘Ž๐‘ก โ„Ž(๐‘ฅ ๐‘—

) ๐‘–๐‘  ๐‘Ž๐‘  ๐‘๐‘™๐‘œ๐‘ ๐‘’ ๐‘ก๐‘œ ๐‘ฆ ๐‘— ๐‘Ž๐‘  ๐‘๐‘œ๐‘ ๐‘ ๐‘–๐‘๐‘™๐‘’

30

Switching to slides corresponding to lecture 2 from Hugo Larochelle’s course http://info.usherbrooke.ca/hlarochelle/neural_networks/content.html

31

the where?

32

Some pointers to additional material

• http://deeplearning.net/

• http://info.usherbrooke.ca/hlarochelle/neural

_networks/content.html

33

Download