part 1 pptx

advertisement
Deep Learning: Back To The Future
Hinton NIPS 2012 Talk Slide (More Or Less)

What was hot in 1987
Neural networks

What happened in ML since 1987
Computers got faster
Larger data sets became available

What is hot 25 years later
Neural networks

… but they are informed by graphical models!
Brief History Of Machine Learning






1960s Perceptrons
1969 Minsky & Papert book
1985-1995 Neural Nets and Back Propagation
1995- Support-Vector Machines
2000- Bayesian Models
2013- Deep Networks
What My
Lecture
Looked Like
In 1987
The Limitations Of Two Layer Networks


Many problems can’t be learned without a layer of
intermediate or hidden units.
Problem
Where does training signal come from?
Teacher specifies target outputs, not target hidden unit
activities.

If you could learn input->hidden and
hidden->output connections, you could
learn new representations!
But how do hidden units get an error signal?
Why Stop At One Hidden Layer?

E.g., vision hierarchy for recognizing handprinted text
Word
output layer
Character
hidden layer 3
Stroke
hidden layer 2
Edge
hidden layer 1
Pixel
input layer
Demos

Yann LeCun’s LeNet5
http://yann.lecun.com/exdb/lenet/index.html
Why Deeply Layered Networks Fail

Credit assignment problem
How is a neuron in layer 2 supposed to know what it should
output until all the neurons above it do something sensible?
How is a neuron in layer 4 supposed to know what it should
output until all the neurons below it do something sensible?

Mathematical manifestation
Error gradients get squashed as they are passed
back through a deep network
Solution

Traditional method of training
Random initial weights

Alternative
Do unsupervised learning layer by layer to get weights in a
sensible configuration for the statistics of the input.
Then when net is trained in a supervised fashion, credit
assignment will be easier.
Autoencoder Networks




Self-supervised training procedure
Given a set of input vectors (no target outputs)
Map input back to itself via a hidden layer bottleneck
How to achieve bottleneck?
 Fewer neurons
 Sparsity constraint
 Information transmission constraint (e.g., add noise to unit, or
shut off randomly, a.k.a. dropout)
Autoencoder Combines
An Encoder And A Decoder
Decoder
Encoder
Stacked Autoencoders
...
copy
deep network

Note that decoders can be stacked to produce a
generative model of the domain
Neural Net Can Be
Viewed As A Graphical Model
y
x1

x2
x3
x4
Deterministic neuron
ìï 1 if y = (1+ exp(- w x ))-1
å
i i
P(y | x1, x2 , x3 , x4 ) = í
otherwise
ïî 0

Stochastic neuron
ìï 1 with probability (1+ exp(- w x ))-1
å ii
P(y | x1, x2 , x3 , x4 ) = í
otherwise
ïî 0
Boltzmann Machine
(Hinton & Sejnowski, circa 1985)

Undirected graphical model
 Each node is a stochastic neuron
 Potential function defined on each pair of neurons


Algorithms were developed for
doing inference for special cases
of the architecture.
E.g., Restricted Boltzmann Machine
 2 layers
 Completely interconnected between
layers
 No connections within layer
Punch Line

Deep network can be implemented as a multilayer
restricted Boltzmann machine
Sequential layer-to-layer training procedure
Training requires probabilistic inference
Update rule: ‘contrastive divergence’

Different research groups prefer
different neural substrate, but it
doesn’t really matter if you use
deterministic neural net vs. RBM
erent Levels of Abstraction
cal Learning
gression from low
level structure as
ural complexity
onitor what is being
o guide the machine
bspaces
er level
on can be used for
ct tasks
From Ng’s group
Suskever, Martens, Hinton (2011)
Generating Text From A Deep Belief Net

Wikipedia
The meaning of life is the tradition of the ancient human reproduction: it is less favorable
to the good boy for when to remove her bigger. In the show’s agreement unanimously
resurfaced. The wild pasteured with consistent street forests were incorporated by the
15th century BE. In 1996 the primary rapford undergoes an effort that the reserve
conditioning, written into Jewish cities, sleepers to incorporate the .St Eurasia that
activates the population. Mar??a Nationale, Kelli, Zedlat-Dukastoe, Florendon, Ptu’s
thought is. To adapt in most parts of North America, the dynamic fairy Dan please
believes, the free speech are much related to the

NYT
while he was giving attention to the second advantage of school building a 2-for-2 stool
killed by the Cultures saddled with a half- suit defending the Bharatiya Fernall ’s office .
Ms . Claire Parters will also have a history temple for him to raise jobs until naked
Prodiena to paint baseball partners , provided people to ride both of Manhattan in 1978 ,
but what was largely directed to China in 1946 , focusing on the trademark period is the
sailboat yesterday and comments on whom they obtain overheard within the 120th
anniversary , where many civil rights defined , officials said early that forms , ” said
Bernard J. Marco Jr. of Pennsylvania , was monitoring New York
2013 News



No need to use unsupervised training or probabilistic
models if…
You use clever tricks of the neural net trade, i.e.,
Back propagation with
 deep networks
 rectified linear units
 dropout
 weight maxima
Krizhevsky, Sutskever, & Hinton

ImageNet competition
 15M images in 22k categories
 For contest, 1.2M images in 1k categories
 Classification: can you name object in 5 guesses?

√
2012 Results
2013: Down to 11% error
Download