Deep Learning - Introduction Eder Santana Neural Networks for Signal Processing

advertisement
Deep Learning Introduction
Eder Santana
Neural Networks for Signal Processing
EEL 6504 - Spring 2016
Introduction
This introduction is divided in two parts
1. On the need of deep architectures
2. Problems training deep architectures
3. Recent advances
Thus, we will cover the pros and cons of deep learning. Each topic contains links
to reference papers in the speaker notes.
On the need of deep architectures
The success of machine learning applications (regression, classification, etc) is
tied to the quality of the data representation.
-
Remember the Iris dataset, some of the dimensions were easier to classify
then others.
For complex problems where no single dimension of the raw data is discriminative
enough, domain knowledge for feature representation is fundamental.
-
For example, frequency domain representations for speech recognition.
On the need of deep architecrtues
But how much can we do with generic priors? Can we design a single algorithm
that can be successfully applied to different data domains with minor
modifications?
Deep Learning is tightly coupled with the need to learn better
representations of data.
On the need of deep architectures
To illustrate the problem of learning representations. Let us remember what each
layer in a MLP does. Each line represents a processing element. The top most
layer is a classifier.
Figure: MLP with 3
layers. Showing input
to first hidden and
output layers
On the need of deep architectures
Each layer learns a feature representation of the layer below. The final goal of
those representations is to optimize the criterion being backpropagated.
Assume the problem of classification: The practical observation is that the number
of neurons (connections, weights) required by a a single hidden layer MLP is
larger than the number required by deep MLP.
“Using tools from measure theory and matrix algebra, we prove
that besides a negligible set, all functions that can be
implemented by a deep network of polynomial size, require
exponential size in order to be realized (or even
approximated) by a shallow network.” - Cohen et. al.
On the need of deep architectures
Learning image multi-scale invariances
In image classification, we want to
recognize objects in different view
angles and scales.
Images from the Microsoft ASIRRA dataset. Dogs vs Cats dataset on
Kaggle: https://www.kaggle.com/c/dogs-vs-cats
On the need of deep architectures
Learning image multi-scale invariances
How does the human brain image processing pipeline looks like?
Yamins and DiCarlo, 2016
On the need of deep architectures
Learning image multi-scale invariances
What do we learn on each layer?
Example an unsupervised convolutional
autoencoer applied to faces
Lee et. al. ICML 2009
On the need of deep architectures
Learning image multi-scale invariances
What do we learn on each layer of AlexNet (Zeiler variation)?
Krizhevsky et. al. 2012
On the need of deep architectures
Learning image multi-scale invariances
What do we learn on each layer of AlexNet (Zeiler variation)? Showing prefrered stimulus and images that drive
highest activation.
Zeiler and Fergus, 2013
On the need of deep architectures
Learning image multi-scale invariances
What do we learn on each layer of AlexNet (Zeiler variation)? Showing prefrered stimulus and images that drive
highest activation.
Zeiler and Fergus, 2013
On the need of deep architectures
Learning image multiscale invariances
What do we learn on each layer of
AlexNet (Zeiler variation)? Showing
prefrered stimulus and images that
drive highest activation.
Zeiler and Fergus, 2013
On the need of deep architectures
Learning image multi-scale invariances
Visualizing t-SNE embedding: http://cs.stanford.edu/people/karpathy/cnnembed/
On the need of deep architectures
Learning invariances in music and speech data
How do you do that using what we already know?
On the need of deep architectures
Learning invariances in music and speech data
“Spectogram is an image and we know convnets”. But what about the time
component?
On the need of deep architectures
Learning invariances in music
and speech data
Speech recognition (before):
Get speech “windows”: phonemes
Represent phonemes using a
Gaussian Process or Deep Belief
Network (Deep Autoencoder)
Conect the windows: model phoneme
transition using Hidden Markov
Model
Yan Zhang
On the need of deep architectures
Learning invariances in music and speech
data
Speech recognition (after, end-to-end deep
learning):
No windowing
First layers are convolutional
Middle layers are bidirectional recurrent neural
networks to represent time context (future
classes).
Cost function is Connectionist Temporal
Classification (CTC). It allows sequence inputs
Hannun et. al. 2014
On the need of deep architectures
Learning invariances in music and speech data
Content based music recommendation
Sander Dieleman, 2014 http://benanne.github.io/2014/08/05/spotify-cnns.html
On the need of deep architectures
Learning representations for text
How should a computer represent a character or a word?
Signal processing is about Algebra, but how can we represent something like
“bread” + “butter” or “king” - “crown”?
We need to represent “words” as vectors! Vector Algebra is well defined.
On the need of deep architectures
Learning representations for text
Process
Represent each word by an unique id (one-hot-encoding, sparse): x
Initialize word-to-vector embedding (dense): B
Get word representation in the new space: y = Bx
Each y is just a row of B
Adapt B
How?
On the need of deep architectures
Learning representations for text
Mikolov et. al. 2013
On the need of deep architectures
Learning representations for text
Mikolov et. al.
2013
On the need of deep architectures
Learning representations for text
It also works in character level
Andrej Karpathy, 2015
On the need of deep architectures
Learning representations for text
It also works in character level: Paul Graham generator (one char at time!)
The surprised in investors weren't going to raise money. I'm not the
company with the time there are all interesting quickly, don't have to
get off the same programmers. There's a super-angel round
fundraising, why do you can do. If you have a different physical
investment are become in people who reduced in a startup with the
way to argument the acquirer could see them just that you're also the
founders will part of users' affords that and an alternation to the idea.
[2] Don't work at first member to see the way kids will seem in
advance of a bad successful startup. And if you have to act the big
company too."
Andrej Karpathy, 2015
Problems training deep architectures
Is this free lunch?
Note: Some of the following problems were addressed to some extent. In red we
show possible ways to address the problems.
Feature design -> Neural networks architecture design (experience)
How many neurons per layer, convolutions?, max-pooling?, recurrent?
Computational cost (gpus, ASIC)
Several matrix multiplications and convolutions and they are not all embarrassingly parallel.
Curse of dimensionality (big data)
The number of trainable weights sums up to millions
Problems training deep architectures
Is this free lunch?
Non-convex problem (problem?)
Local minima problems seemed to be due to bad initialization and wrong activation function
choices.
Vanishing gradients in backpropagation (improved SGD, LSTM)
Problems training deep architectures
Is this free lunch?
Need different learning rates for different layers (improved SGD)
Output layer needs a stable representation to draw a separation surface (remember our
example above)
Recent advances (or why now?)
Since deep learning is “just” neural networks, why is it happening only now?
Community acceptance (neural networks were labeled brute force and
unreliable due to non-convex optimization, data descriptors + kernel
machines were the prefered approach.)
https://plus.google.com/+YannLeCunPhD/posts/gurGyczzsJ7
Faster computers with more memory
Larger datasets
Also, there were A LOT of incremental contributions
To the beginner this may look like black magic, superstition or any other term for “not science”.
Recent advances
Weight initialization techniques
Activation functions (relu, maxout, etc)
Regularization (dropout, batch normalization, multi-task learning)
Data agumentation
Optimization algorithms (momentum, adagrad, rmsprop, adam, etc)
Architectures (recurrent, convolutional, inception, residual, etc)
Deep learning software (CUDA, Caffe, Theano, Tensorflow, Keras, Torch, etc)
Deep learning hardware (CPU clusters, GPUs, ASIC, phones, etc)
Recent advances
Weight initialization
Recent advances
Weight initialization
● Too large: neurons get stuck into low
gradient range
● To small: neurons are too correlated
which slows down learning
● Glorot uniform initialization:
Wij ~ U
Recent advances
Activation functions
Softplus
ReLU
Recent advances
Activation functions
Softplus
Approximate gradient doesn’t vanish for x>0.
ReLU
Recent advances
Regularization (Dropout)
“If you’re not overfitting, your network isn’t big enough” - attributed to Geoff Hitton
How to design a Neural Network:
Before
Start with a small network
Increase number of neurons (and rarely number of layers) until you start overfitting
Now
Start with a network big enough to overfit
Increase regularization and data agumentation until you stop overfitting
Recent advances
Regularization (Dropout)
Dropout: randomly multiply by zero the outputs of the network during training. During test time rescale the learned weights
down.
0.5
0.
0.
0
1.
0.
0
0.
0.
=
*
0.7
1.
0.7
0.01
0.
0
0.2
1.
0.2
Layer Output = f(wTx)
Dropout mask with
probability 0.5
New layer after masking
Recent advances
Regularization (Dropout)
Dropout: randomly multiply by zero the outputs of the network during training. During test time rescale the learned weights
down.
0.5
0.
0.
0
0.5
0.
0
0.
0.
=
*
0.7
0.5
0.35
0.01
0.
0
0.2
0.5
0.1
Layer Output = f(wTx)
Dropout mask with probability
0.5 and rescale
New layer output after
masking and rescale
Note that there can
be different dropout
proababilities! We
just sample and
rescale accordingly.
Recent advances
Regularization (Dropout)
Dropout prevents coadaptation of features. Intuition: during traning we add noise to the gradient so the weights don’t learn
rely on each as much. We train them to work alone and bring friends during test time.
training
test
Hinton et. al. 2013
Recent advances
Regularization (Batch Normalization)
Let us remember the error surface for MSE and how correlated features impair learning.
J = E[(d - wTx)2]= E[d2] + wTCw - 2wTP
Where C is the autocorrelation matrix and P is the cross correlation matrix. This formula is quadradic
with accelation matrix being a function of C.
Let us focus on the quadratic term for the 2D case:
wTCw = (w1C11 + w2C21)w1 + (w1C12 + w2C22)w2 = w12C11 + w22C22 + w1w2C21C12
Recent advances
Regularization (Batch Normalization)
wTCw = (w1C11 + w2C21)w1 + (w1C12 + w2C22)w2 = w12C11 + w22C22 + w1w2C21C12
Let us visualize contour level plots.
Correlated inputs = “narrow
valleys”: in the learning surface!
C21C12 = 0
No correlation on the
input. We learn as fast
in all directions. Same
variance for x1 and x2.
C21C12 = 0
C21C12 = positive
But x2 has higher
variance than x1
and But x2 has higher
variance than x1
Recent advances
Regularization (Batch Normalization)
Also “eigenvalue spread” defines maximum learning rate, misadjustment, covergence time, etc.
Learning and correlated inputs are not friends! Batch normalization learns to “remove the mean and
divide by std after every layer”
There is a version for conv layers too! In
both cases, we learn a moving average
approximation of the mean and variance
that we use at test time (It is not fair to
calculate test batche statistics in your
experiments. Don’t do that in your
homeworks or publications)
Recent advances
Data agumentation (noise)
For deep learning the more data the better! We already saw one way to increase
the virutal size of our dataset: adding noise!
Dropout in the input
Adding Gaussian noise (works better for speech data)
Recent advances
Data agumentation (data distortion)
There are also small transformation in the input that help neural networks to
generalize better:
Vertical axis flip
And they all can be combined for
even more augmentation.
Multipe crops
(changes scale
and position)
Rotation,
changes viewing
angles.
Download