Tony Jebara, Columbia University
Machine Learning
4771
Instructor: Tony Jebara
Tony Jebara, Columbia University
Topic 4
• Tutorial: Matlab
• Perceptron, Online & Stochastic Gradient Descent
• Convergence Guarantee
• Perceptron vs. Linear Regression
• Multi-Layer Neural Networks
• Back-Propagation
• Demo: LeNet
• Deep Learning
Tony Jebara, Columbia University
Tutorial: Matlab
• Matlab is the most popular language for machine learning
• See www.cs.columbia.edu->computing->Software->Matlab
• Online info to get started is available at:
http://www.cs.columbia.edu/~jebara/tutorials.html
• Matlab tutorials
• List of Matlab function calls
• Example code: for homework #1 will use polyreg.m
• General: help, lookfor, 1:N, rand, zeros, A’, reshape, size
• Math: max, min, cov, mean, norm, inv, pinv, det, sort, eye
• Control: if, for, while, end, %, function, return, clear
• Display: figure, clf, axis, close, plot, subplot, hold on, fprintf
• Input/Output: load, save, ginput, print,
• BBS and TA’s are also helpful
Tony Jebara, Columbia University
Perceptron (another Neuron)
• Classification scenario once again but consider +1, -1 labels
X =
{(x ,y ), (x ,y ),…, (x
1
1
2
2
}
,yN )
N
x ∈ RD
y ∈ {−1,1}
• A better choice for a classification squashing function is
⎧
⎪
⎪ −1 when z < 0
g (z ) = ⎨
⎪
+1 when z ≥ 0
⎪
⎪
⎩
• And a better choice is classification loss
(
)
(
)
L y, f (x; θ) = step −yf (x; θ)
• Actually with above g(z) any loss is like classification loss
R (θ) =
1
4N
∑
N
i =1
(
(
T
y − g θ xi
))
2
≡
1
N
(
T
step
−y
θ
xi
∑ i =1
i
N
• What does this R(θ) function look like?
)
Tony Jebara, Columbia University
Perceptron & Classification Loss
• Classification loss for the Risk leads to hard minimization
• What does this R(θ) function look like?
R (θ) =
1
N
(
T
step
−y
θ
xi
∑ i =1
i
N
• Qbert-like, can’t do gradient
descent since the gradient
is zero except at edges
when a label flips
)
Tony Jebara, Columbia University
Perceptron & Perceptron Loss
• Instead of Classification Loss
R (θ) =
1
N
(
T
step
−y
θ
xi
∑ i =1
i
N
L
)
yf(x;θ)
• Consider Perceptron Loss:
R per (θ) = − N1
(
T
y
θ
∑ i∈misclassified i xi
L
)
yf(x;θ)
• Instead of staircase-shaped R
get smooth piece-wise linear R
• Get reasonable gradients for gradient descent
∇θR per (θ) = − N1
∑
i∈misclassified
θt +1 = θt − η ∇θR per
t
θ
yi x i
= θt + η N1
∑
i∈misclassified
yi x i
Tony Jebara, Columbia University
Perceptron vs. Linear Regression
• Linear regression gets close
but doesn’t do perfectly
classification error = 2
squared error = 0.139
• Perceptron gets zero error
classification error = 0
perceptron err = 0
Tony Jebara, Columbia University
Stochastic Gradient Descent
• Gradient Descent vs. Stochastic Gradient Descent
• Instead of computing the average gradient for all points
and then taking a step
∇θR per (θ) = − N1
∑
i∈misclassified
yi x i
• Update the gradient for each mis-classified point by itself
∇θR per (θ) = −yi xi
if i mis-classified
• Also, set η to 1 without loss of generality
θt +1 = θt − η ∇θR per
θt
= θt + yi x i
if i mis-classified
Tony Jebara, Columbia University
Online Perceptron
• Apply stochastic gradient descent to a perceptron
• Get the “online perceptron” algorithm:
initialize t = 0 and θ = 0
while not converged {
0
pick i ∈ {1,…,N }
(
)
if yi x iT θt ≤ 0
{ θt +1 = θt + yi x i
t = t +1
}}
• Either pick i randomly or use a “for i=1 to N” loop
• If the algorithm stops, we have a theta that separates data
• The total number of mistakes we made along the way is t
Tony Jebara, Columbia University
Online Perceptron Theorem
Theorem: the online perceptron algorithm converges to zero
error in finite t if we assume
1) all data inside a sphere of radius r: xi ≤ r ∀i
T
2) data is separable with margin γ: yi (θ* ) xi ≥ γ ∀i
Proof:
• Part 1) Look at inner product of current θt with θ*
assume we just updated a mistake on point i:
(θ )
*
T
t
( )
θ = θ
*
T
θ
t −1
( )
+ yi θ
*
T
( )
xi ≥ θ
*
T
θt −1 + γ
after applying t such updates, we must get:
(θ )
*
T
t
( )
θ = θ
*
T
θt ≥ t γ
Tony Jebara, Columbia University
Online Perceptron Proof
• Part 1) (θ* ) θt = (θ* ) θt ≥ t γ
T
• Part 2) θ
t
2
T
= θ
t −1
≤ θ
t −1
≤ θ
t −1
+ yi x i
2
2
2
= θ
t −1
2
2
+ xi
( )
+ 2yi θ
t −1
T
xi + xi
2
since only update mistakes
middle term is negative
+ r2
≤ tr 2
• Part 3) Angle between optimal & current solution
(
cos θ*, θt
)
θ)
(
=
*
θ
t
T
θt
θ
*
• Since cos ≤ 1 ⇒
≥
tγ
θ
t
θ
*
tγ
tr 2 θ*
≥
tγ
apply part 1
then part 2
tr 2 θ*
r2 *
≤1 ⇒ t ≤ 2 θ
γ
2
…so t is finite!
Tony Jebara, Columbia University
Multi-Layer Neural Networks
• What if we consider cascading multiple layers of network?
• Each output layer is input to the next layer
• Each layer has its own weights parameters
• Eg: each layer has linear nodes (not perceptron/logistic)
• Above Neural Net has 2 layers. What does this give?
Tony Jebara, Columbia University
Multi-Layer Neural Networks
• Need to introduce non-linearities between layers
• Avoids previous redundant linear layer problem
• Neural network can adjust the basis functions themselves…
f (x ) = g
(
( ))
T
θ
g
θ
∑ i =1 i i x
P
Tony Jebara, Columbia University
Multi-Layer Neural Networks
• Multi-Layer Network can handle more complex decisions
• 1-layer: is linear, can’t handle XOR
• Each layer adds more flexibility (but more parameters!)
• Each node splits its input space with linear hyperplane
• 2-layer: if last layer is AND operation, get convex hull
• 2-layer: can do almost anything multi-layer can
by fanning out the inputs at 2nd layer
• Note: Without loss of generality, we can omit the 1 and θ0
Tony Jebara, Columbia University
Back-Propagation
• Gradient descent on squared loss is done layer by layer
• Layers: input, hidden, output. Parameters: θ = {wij ,w jk ,wkl }
a j = ∑ wij zi
i
( )
zj = g aj
ak = ∑ w jk z j
j
zk = g (ak )
al = ∑ wkl zk
k
zl = g (al )
• Each input xn for n=1..N generates its own a’s and z’s
• Back-Propagation: Splits layer into its inputs & outputs
• Get gradient on output…back-track chain rule until input
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
(
( ))
R (θ) =
1
N
∑
n
n
L
y
−
f
x
n=1
=
1
N
∑
1
n 2
N
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
kl
j
jk
i
• First compute output layer derivative:
∂R
=
∂wkl
1
N
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ Chain Rule
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
n
define L :=
1
2
(y
n
ij i
n
2
( ))
−f x
n
2
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
(
( ))
R (θ) =
1
N
∑
n
n
L
y
−
f
x
n=1
=
1
N
∑
1
n 2
N
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
kl
j
jk
i
• First compute output layer derivative:
∂R
=
∂wkl
1
N
=
1
N
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ Chain Rule
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
2⎤
⎡ 1 n
n
⎢ ∂ 2 y − g al
⎥ ⎛ ∂a n ⎞⎟
⎢
⎥ ⎜⎜ l ⎟⎟
∑⎢
⎥ ⎜⎜ ∂w ⎟⎟
∂a n
n ⎢
⎥ ⎝ kl ⎠
l
⎢⎣
⎥⎦
(
( ))
n
define L :=
1
2
(y
n
ij i
n
2
( ))
−f x
n
2
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
(
( ))
R (θ) =
1
N
∑
n
n
L
y
−
f
x
n=1
=
1
N
∑
1
n 2
N
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
kl
k
j
jk
i
• First compute output layer derivative:
∂R
=
∂wkl
1
N
=
1
N
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ Chain Rule
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
2⎤
⎡ 1 n
n
⎢ ∂ 2 y − g al
⎥ ⎛ ∂a n ⎞⎟
⎢
⎥ ⎜⎜ l ⎟⎟ = 1
∑⎢
⎥ ⎜⎜ ∂w ⎟⎟ N
∂a n
n ⎢
⎥ ⎝ kl ⎠
l
⎢⎣
⎥⎦
(
( ))
n
define L :=
∑ ⎡⎢⎣−(y
n
n
1
2
) ( )( )
− zln g ' aln ⎤⎥ zkn
⎦
(y
n
ij i
n
2
( ))
−f x
n
2
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
(
( ))
R (θ) =
1
N
∑
n
n
L
y
−
f
x
n=1
=
1
N
∑
1
n 2
N
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
kl
j
jk
i
• First compute output layer derivative:
∂R
=
∂wkl
=
1
N
1
N
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ Chain Rule
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
2⎤
⎡ 1 n
n
⎢ ∂ 2 y − g al
⎥ ⎛ ∂a n ⎞⎟
⎢
⎥ ⎜⎜ l ⎟⎟ = 1
∑⎢
⎥ ⎜⎜ ∂w ⎟⎟ N
∂a n
n ⎢
⎥ ⎝ kl ⎠
l
⎢⎣
⎥⎦
(
( ))
n
define L :=
1
2
(y
n
ij i
n
2
( ))
−f x
n
Define as δ
∑ ⎡⎢⎣−(y
n
n
) ( )( )
− zln g ' aln ⎤⎥ zkn =
⎦
1
N
∑δ z
n
n n
l k
2
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
1
N
∑
∑ ⎡⎢⎣−(y
n
n
1
n 2
(y
n
1
N
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜⎜ ∂wk ⎟⎟⎟⎟
n ⎢
⎣ k ⎥⎦ ⎝ jk ⎠
(∑ w g (∑ w g (∑ w x ))))
k
) ( )( )
kl
− zln g ' aln ⎤⎥ zkn =
⎦
• Next, hidden layer derivative:
∂R
=
∂w jk
−g
j
1
N
∑δ z
n
jk
n n
l k
i
n
ij i
2
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
1
N
∑
∑ ⎡⎢⎣−(y
n
n
1
n 2
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
kl
) ( )( )
− zln g ' aln ⎤⎥ zkn =
⎦
j
1
N
jk
i
n
ij i
2
∑δ z
n
n n
l k
• Next, hidden layer derivative:
∂R
=
∂w jk
1
N
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜⎜ ∂wk ⎟⎟⎟⎟ =
n ⎢
⎣ k ⎥⎦ ⎝ jk ⎠
1
N
n ⎛
n ⎞
⎡
∂Ln ∂al ⎤⎥ ⎜⎜ ∂ak ⎟⎟
⎢
∑ ⎢∑ ∂a n ∂a n ⎥ ⎜⎜⎜ ∂w ⎟⎟⎟
n ⎢ l
jk ⎠
l
k ⎥⎦ ⎝
⎣
Multivariate Chain Rule
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
1
N
∑
∑ ⎡⎢⎣−(y
n
n
1
n 2
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
kl
) ( )( )
− zln g ' aln ⎤⎥ zkn =
⎦
j
1
N
jk
i
n
ij i
2
∑δ z
n
n n
l k
• Next, hidden layer derivative:
∂R
=
∂w jk
1
N
=
1
N
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜⎜ ∂wk ⎟⎟⎟⎟ =
n ⎢
⎣ k ⎥⎦ ⎝ jk ⎠
n ⎤
⎡
⎢ δn ∂al ⎥ z n
∑ ⎢∑ l ∂a n ⎥ j
n ⎢ l
k ⎥⎦
⎣
( )
1
N
n ⎤⎛
n ⎞
n
⎡
∂a
∂a
⎟
∂L
⎜
⎢
l ⎥⎜
k ⎟
∑ ⎢∑ ∂a n ∂a n ⎥ ⎜⎜⎜ ∂w ⎟⎟⎟
n ⎢ l
jk ⎠
l
k ⎥⎦ ⎝
⎣
Multivariate Chain Rule
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
1
N
∑
∑ ⎡⎢⎣−(y
n
n
1
n 2
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
kl
) ( )( )
− zln g ' aln ⎤⎥ zkn =
⎦
j
1
N
jk
i
n
ij i
2
∑δ z
n
n n
l k
• Next, hidden layer derivative:
n ⎤⎛
n ⎞
n
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
⎡
∂a
∂a
⎟
∂L
⎜
1
⎢
⎥ ⎜ k ⎟⎟ = 1
⎢
l ⎥⎜
k ⎟
⎟⎟
⎟
N ∑⎢
n ⎥⎜
N ∑ ⎢∑
n
n ⎥⎜
⎜
⎜
⎟
⎟
n ⎢ ∂ak ⎥ ⎜
n ⎢ l ∂al ∂ak ⎥ ⎜
⎣
⎦ ⎝ ∂w jk ⎠
⎣
⎦ ⎝ ∂w jk ⎠
n ⎤
⎡
n ∂al ⎥
n
1
⎢
= N ∑ ⎢ ∑ δl
z
⎥
j
n
∂a
n ⎢ l
⎥
k ⎦
⎣
Recall al = ∑ k w
kl g (ak )
∂R
=
∂w jk
( )
Multivariate Chain Rule
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
1
N
∑
∑ ⎡⎢⎣−(y
n
n
1
n 2
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
) ( )( )
kl
− zln g ' aln ⎤⎥ zkn =
⎦
j
1
N
jk
i
n
ij i
2
∑δ z
n
n n
l k
• Next, hidden layer derivative:
n ⎤⎛
n ⎞
n
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
⎡
∂a
∂a
⎟
∂L
⎜
1
⎢
⎥ ⎜ k ⎟⎟ = 1
⎢
l ⎥⎜
k ⎟
⎟⎟ Multivariate Chain Rule
⎟
N ∑⎢
n ⎥⎜
N ∑ ⎢∑
n
n ⎥⎜
⎜
⎜
⎟
⎟
n ⎢ ∂ak ⎥ ⎜
n ⎢ l ∂al ∂ak ⎥ ⎜
⎣
⎦ ⎝ ∂w jk ⎠
⎣
⎦ ⎝ ∂w jk ⎠
n ⎤
⎡
⎡
⎤ n
n ∂al ⎥
n
n
n
1
⎢
1
⎢
= N ∑ ⎢ ∑ δl
z j = N ∑ ∑ δl wkl g ' ak ⎥ z j
n ⎥
⎢
⎥
∂a
n ⎢ l
n ⎣ l
⎥
⎦
k
⎣
⎦
Recall al = ∑ k w
kl g (ak )
∂R
=
∂w jk
( )
( )( )
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
1
N
∑
∑ ⎡⎢⎣−(y
n
n
1
n 2
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
) ( )( )
kl
− zln g ' aln ⎤⎥ zkn =
⎦
j
1
N
jk
i
n
ij i
2
∑δ z
n
n n
l k
• Next, hidden layer derivative:
n ⎤⎛
n ⎞
n
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
⎡
∂a
∂a
⎟
∂L
⎜
1
⎢
⎥ ⎜ k ⎟⎟ = 1
⎢
l ⎥⎜
k ⎟
⎟⎟
Multivariate Chain Rule
⎟
N ∑⎢
n ⎥⎜
N ∑ ⎢∑
n
n ⎥⎜
⎜
⎜
⎟
⎟
∂w
∂w
∂a
∂a
∂a
⎜
⎜
n ⎢
n ⎢ l
jk ⎠
l
k ⎥⎦ ⎝
⎣ k ⎥⎦ ⎝ jk ⎠
⎣
n ⎤
⎡
⎡
⎤ n
n ∂al ⎥
n
n
n
1
⎢
1
⎢
⎥ z = 1 ∑ δn z n
= N ∑ ⎢ ∑ δl
z
=
δ
w
g
'
a
∑
∑
⎥
j
l
kl
k
k j
n
N
N
⎢ l
⎥ j
∂a
n ⎢ l
n
n
⎥
⎣
⎦
k ⎦
⎣
Recall al = ∑ k w
kl g (ak )
Define as δ
∂R
=
∂w jk
( )
( )( )
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
∂R
=
∂w jk
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜⎜ ∂wk ⎟⎟⎟⎟ =
n ⎢
⎣ k ⎥⎦ ⎝ jk ⎠
1
N
∑
∑ ⎡⎢⎣−(y
n
1
N
n
1
n 2
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
kl
k
) ( )( )
− zln g ' aln ⎤⎥ zkn =
⎦
⎡
⎤ n
n
n
⎢
∑ ⎢∑ δl wkl g ' ak ⎥⎥ z j =
n ⎣ l
⎦
( )( )
1
N
1
N
j
jk
i
2
n
ij i
∑δ z
n
n n
l k
∑δ z
n
n n
k j
• Any previous (input) layer derivative: repeat the formula!
∂R
=
∂wij
1
N
⎡ n ⎤ ⎛ ∂a n ⎞⎟
⎢ ∂L ⎥ ⎜⎜ j ⎟⎟ =
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜ ∂w ⎟⎟⎟
n
⎣ j ⎦ ⎝ ij ⎠
1
N
n ⎞
⎡
n ⎤⎛
n
∂a
∂a
⎜
∂L
⎢
j ⎟
k ⎥⎜
∑ ⎢⎢∑ ∂a n ∂a n ⎥⎥ ⎜⎜⎜ ∂w ⎟⎟⎟⎟⎟ =
n
ij ⎠
k
j ⎦⎝
⎣ k
• What is this last z?
1
N
⎡
⎤ n
n
n
⎢
∑ ⎢∑ δk w jkg ' a j ⎥⎥ zi =
n ⎣ k
⎦
( )( )
1
N
∑δ z
n
n n
j i
Tony Jebara, Columbia University
Back-Propagation
• Again, take small step in direction opposite to gradient
wijt+1 = wijt − η
∂R
∂wij
w tjk+1 = w tjk − η
∂R
∂w jk
wklt+1 = wklt − η
∂R
∂wkl
• Early stop before when error bottoms out on validation set
Tony Jebara, Columbia University
Neural Networks Demo
• Again, take small step in direction opposite to gradient
• Digits Demo: LeNet… http://yann.lecun.com
• Problems with back-prop
is that MLP over-fits…
• Other problems: hard to interpret, black-box
• What are the hidden inner layers doing?
Tony Jebara, Columbia University
Auto-Encoders
• Make the neural net reconstruct the input vector
• Set the target y vector to be the x vector again
• But, it gets narrow in the middle!
• So, there is some information “compression”
• This leads to better generalization
• This is unsupervised learning since we only use the x data
Tony Jebara, Columbia University
Auto-Encoders
Tony Jebara, Columbia University
Auto-Encoders
Tony Jebara, Columbia University
Deep Learning
• We can stack several independently trained auto-encoders
• Using back-propagation, we do pre-training
• Train Net 1 to go from 2000 inputs to 1000 to 2000 inputs
• Train Net 2 to go from 1000 hidden values to 500 to 1000
• Train Net 3 to go from 500 hidden to 30 to 500
• Then, do unrolling - link up the learned networks as above
Tony Jebara, Columbia University
Deep Learning
• Then do fine-tuning of the overall neural network
by running back-propagation on the whole thing to
reconstruct x from x
Tony Jebara, Columbia University
Deep Learning
• Does good reconstruction!
• Beats PCA on images of unaligned faces.
• PCA is better when face images are aligned…
• We will cover PCA in a few lectures…
Tony Jebara, Columbia University
Minimum Training Error?
• Is minimizing Empricial Risk the right thing?
• Are Perceptrons and Neural Networks
giving the best classifier?
• We are getting:
minimum training error
not minimum testing error
• Perceptrons are giving a bunch of solutions:
⇒
… a solution with guarantees SVMs
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )