Machine Learning 4771

advertisement
Tony Jebara, Columbia University
Machine Learning
4771
Instructor: Tony Jebara
Tony Jebara, Columbia University
Topic 4
• Tutorial: Matlab
• Perceptron, Online & Stochastic Gradient Descent
• Convergence Guarantee
• Perceptron vs. Linear Regression
• Multi-Layer Neural Networks
• Back-Propagation
• Demo: LeNet
• Deep Learning
Tony Jebara, Columbia University
Tutorial: Matlab
• Matlab is the most popular language for machine learning
• See www.cs.columbia.edu->computing->Software->Matlab
• Online info to get started is available at:
http://www.cs.columbia.edu/~jebara/tutorials.html
• Matlab tutorials
• List of Matlab function calls
• Example code: for homework #1 will use polyreg.m
• General: help, lookfor, 1:N, rand, zeros, A’, reshape, size
• Math: max, min, cov, mean, norm, inv, pinv, det, sort, eye
• Control: if, for, while, end, %, function, return, clear
• Display: figure, clf, axis, close, plot, subplot, hold on, fprintf
• Input/Output: load, save, ginput, print,
• BBS and TA’s are also helpful
Tony Jebara, Columbia University
Perceptron (another Neuron)
• Classification scenario once again but consider +1, -1 labels
X =
{(x ,y ), (x ,y ),…, (x
1
1
2
2
}
,yN )
N
x ∈ RD
y ∈ {−1,1}
• A better choice for a classification squashing function is
⎧
⎪
⎪ −1 when z < 0
g (z ) = ⎨
⎪
+1 when z ≥ 0
⎪
⎪
⎩
• And a better choice is classification loss
(
)
(
)
L y, f (x; θ) = step −yf (x; θ)
• Actually with above g(z) any loss is like classification loss
R (θ) =
1
4N
∑
N
i =1
(
(
T
y − g θ xi
))
2
≡
1
N
(
T
step
−y
θ
xi
∑ i =1
i
N
• What does this R(θ) function look like?
)
Tony Jebara, Columbia University
Perceptron & Classification Loss
• Classification loss for the Risk leads to hard minimization
• What does this R(θ) function look like?
R (θ) =
1
N
(
T
step
−y
θ
xi
∑ i =1
i
N
• Qbert-like, can’t do gradient
descent since the gradient
is zero except at edges
when a label flips
)
Tony Jebara, Columbia University
Perceptron & Perceptron Loss
• Instead of Classification Loss
R (θ) =
1
N
(
T
step
−y
θ
xi
∑ i =1
i
N
L
)
yf(x;θ)
• Consider Perceptron Loss:
R per (θ) = − N1
(
T
y
θ
∑ i∈misclassified i xi
L
)
yf(x;θ)
• Instead of staircase-shaped R
get smooth piece-wise linear R
• Get reasonable gradients for gradient descent
∇θR per (θ) = − N1
∑
i∈misclassified
θt +1 = θt − η ∇θR per
t
θ
yi x i
= θt + η N1
∑
i∈misclassified
yi x i
Tony Jebara, Columbia University
Perceptron vs. Linear Regression
• Linear regression gets close
but doesn’t do perfectly
classification error = 2
squared error = 0.139
• Perceptron gets zero error
classification error = 0
perceptron err = 0
Tony Jebara, Columbia University
Stochastic Gradient Descent
• Gradient Descent vs. Stochastic Gradient Descent
• Instead of computing the average gradient for all points
and then taking a step
∇θR per (θ) = − N1
∑
i∈misclassified
yi x i
• Update the gradient for each mis-classified point by itself
∇θR per (θ) = −yi xi
if i mis-classified
• Also, set η to 1 without loss of generality
θt +1 = θt − η ∇θR per
θt
= θt + yi x i
if i mis-classified
Tony Jebara, Columbia University
Online Perceptron
• Apply stochastic gradient descent to a perceptron
• Get the “online perceptron” algorithm:

initialize t = 0 and θ = 0
while not converged {
0
pick i ∈ {1,…,N }
(
)
if yi x iT θt ≤ 0
{ θt +1 = θt + yi x i
t = t +1
}}
• Either pick i randomly or use a “for i=1 to N” loop
• If the algorithm stops, we have a theta that separates data
• The total number of mistakes we made along the way is t
Tony Jebara, Columbia University
Online Perceptron Theorem
Theorem: the online perceptron algorithm converges to zero
error in finite t if we assume
1) all data inside a sphere of radius r: xi ≤ r ∀i
T
2) data is separable with margin γ: yi (θ* ) xi ≥ γ ∀i
Proof:
• Part 1) Look at inner product of current θt with θ*
assume we just updated a mistake on point i:
(θ )
*
T
t
( )
θ = θ
*
T
θ
t −1
( )
+ yi θ
*
T
( )
xi ≥ θ
*
T
θt −1 + γ
after applying t such updates, we must get:
(θ )
*
T
t
( )
θ = θ
*
T
θt ≥ t γ
Tony Jebara, Columbia University
Online Perceptron Proof
• Part 1) (θ* ) θt = (θ* ) θt ≥ t γ
T
• Part 2) θ
t
2
T
= θ
t −1
≤ θ
t −1
≤ θ
t −1
+ yi x i
2
2
2
= θ
t −1
2
2
+ xi
( )
+ 2yi θ
t −1
T
xi + xi
2
since only update mistakes
middle term is negative
+ r2
≤ tr 2
• Part 3) Angle between optimal & current solution
(
cos θ*, θt
)
θ)
(
=
*
θ
t
T
θt
θ
*
• Since cos ≤ 1 ⇒
≥
tγ
θ
t
θ
*
tγ
tr 2 θ*
≥
tγ
apply part 1
then part 2
tr 2 θ*
r2 *
≤1 ⇒ t ≤ 2 θ
γ
2
…so t is finite!
Tony Jebara, Columbia University
Multi-Layer Neural Networks
• What if we consider cascading multiple layers of network?
• Each output layer is input to the next layer
• Each layer has its own weights parameters
• Eg: each layer has linear nodes (not perceptron/logistic)
• Above Neural Net has 2 layers. What does this give?
Tony Jebara, Columbia University
Multi-Layer Neural Networks
• Need to introduce non-linearities between layers
• Avoids previous redundant linear layer problem
• Neural network can adjust the basis functions themselves…
f (x ) = g
(
( ))
T

θ
g
θ
∑ i =1 i i x
P
Tony Jebara, Columbia University
Multi-Layer Neural Networks
• Multi-Layer Network can handle more complex decisions
• 1-layer: is linear, can’t handle XOR
• Each layer adds more flexibility (but more parameters!)
• Each node splits its input space with linear hyperplane
• 2-layer: if last layer is AND operation, get convex hull
• 2-layer: can do almost anything multi-layer can
by fanning out the inputs at 2nd layer
• Note: Without loss of generality, we can omit the 1 and θ0
Tony Jebara, Columbia University
Back-Propagation
• Gradient descent on squared loss is done layer by layer
• Layers: input, hidden, output. Parameters: θ = {wij ,w jk ,wkl }
a j = ∑ wij zi
i
( )
zj = g aj
ak = ∑ w jk z j
j
zk = g (ak )
al = ∑ wkl zk
k
zl = g (al )
• Each input xn for n=1..N generates its own a’s and z’s
• Back-Propagation: Splits layer into its inputs & outputs
• Get gradient on output…back-track chain rule until input
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
(
( ))
R (θ) =
1
N
∑
n
n
L
y
−
f
x
n=1
=
1
N
∑
1
n 2
N
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
kl
j
jk
i
• First compute output layer derivative:
∂R
=
∂wkl
1
N
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ Chain Rule
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
n
define L :=
1
2
(y
n
ij i
n
2
( ))
−f x
n
2
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
(
( ))
R (θ) =
1
N
∑
n
n
L
y
−
f
x
n=1
=
1
N
∑
1
n 2
N
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
kl
j
jk
i
• First compute output layer derivative:
∂R
=
∂wkl
1
N
=
1
N
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ Chain Rule
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
2⎤
⎡ 1 n
n
⎢ ∂ 2 y − g al
⎥ ⎛ ∂a n ⎞⎟
⎢
⎥ ⎜⎜ l ⎟⎟
∑⎢
⎥ ⎜⎜ ∂w ⎟⎟
∂a n
n ⎢
⎥ ⎝ kl ⎠
l
⎢⎣
⎥⎦
(
( ))
n
define L :=
1
2
(y
n
ij i
n
2
( ))
−f x
n
2
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
(
( ))
R (θ) =
1
N
∑
n
n
L
y
−
f
x
n=1
=
1
N
∑
1
n 2
N
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
kl
k
j
jk
i
• First compute output layer derivative:
∂R
=
∂wkl
1
N
=
1
N
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ Chain Rule
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
2⎤
⎡ 1 n
n
⎢ ∂ 2 y − g al
⎥ ⎛ ∂a n ⎞⎟
⎢
⎥ ⎜⎜ l ⎟⎟ = 1
∑⎢
⎥ ⎜⎜ ∂w ⎟⎟ N
∂a n
n ⎢
⎥ ⎝ kl ⎠
l
⎢⎣
⎥⎦
(
( ))
n
define L :=
∑ ⎡⎢⎣−(y
n
n
1
2
) ( )( )
− zln g ' aln ⎤⎥ zkn
⎦
(y
n
ij i
n
2
( ))
−f x
n
2
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
(
( ))
R (θ) =
1
N
∑
n
n
L
y
−
f
x
n=1
=
1
N
∑
1
n 2
N
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
kl
j
jk
i
• First compute output layer derivative:
∂R
=
∂wkl
=
1
N
1
N
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ Chain Rule
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
2⎤
⎡ 1 n
n
⎢ ∂ 2 y − g al
⎥ ⎛ ∂a n ⎞⎟
⎢
⎥ ⎜⎜ l ⎟⎟ = 1
∑⎢
⎥ ⎜⎜ ∂w ⎟⎟ N
∂a n
n ⎢
⎥ ⎝ kl ⎠
l
⎢⎣
⎥⎦
(
( ))
n
define L :=
1
2
(y
n
ij i
n
2
( ))
−f x
n
Define as δ
∑ ⎡⎢⎣−(y
n
n
) ( )( )
− zln g ' aln ⎤⎥ zkn =
⎦
1
N
∑δ z
n
n n
l k
2
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
1
N
∑
∑ ⎡⎢⎣−(y
n
n
1
n 2
(y
n
1
N
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜⎜ ∂wk ⎟⎟⎟⎟
n ⎢
⎣ k ⎥⎦ ⎝ jk ⎠
(∑ w g (∑ w g (∑ w x ))))
k
) ( )( )
kl
− zln g ' aln ⎤⎥ zkn =
⎦
• Next, hidden layer derivative:
∂R
=
∂w jk
−g
j
1
N
∑δ z
n
jk
n n
l k
i
n
ij i
2
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
1
N
∑
∑ ⎡⎢⎣−(y
n
n
1
n 2
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
kl
) ( )( )
− zln g ' aln ⎤⎥ zkn =
⎦
j
1
N
jk
i
n
ij i
2
∑δ z
n
n n
l k
• Next, hidden layer derivative:
∂R
=
∂w jk
1
N
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜⎜ ∂wk ⎟⎟⎟⎟ =
n ⎢
⎣ k ⎥⎦ ⎝ jk ⎠
1
N
n ⎛
n ⎞
⎡
∂Ln ∂al ⎤⎥ ⎜⎜ ∂ak ⎟⎟
⎢
∑ ⎢∑ ∂a n ∂a n ⎥ ⎜⎜⎜ ∂w ⎟⎟⎟
n ⎢ l
jk ⎠
l
k ⎥⎦ ⎝
⎣
Multivariate Chain Rule
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
1
N
∑
∑ ⎡⎢⎣−(y
n
n
1
n 2
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
kl
) ( )( )
− zln g ' aln ⎤⎥ zkn =
⎦
j
1
N
jk
i
n
ij i
2
∑δ z
n
n n
l k
• Next, hidden layer derivative:
∂R
=
∂w jk
1
N
=
1
N
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜⎜ ∂wk ⎟⎟⎟⎟ =
n ⎢
⎣ k ⎥⎦ ⎝ jk ⎠
n ⎤
⎡
⎢ δn ∂al ⎥ z n
∑ ⎢∑ l ∂a n ⎥ j
n ⎢ l
k ⎥⎦
⎣
( )
1
N
n ⎤⎛
n ⎞
n
⎡
∂a
∂a
⎟
∂L
⎜
⎢
l ⎥⎜
k ⎟
∑ ⎢∑ ∂a n ∂a n ⎥ ⎜⎜⎜ ∂w ⎟⎟⎟
n ⎢ l
jk ⎠
l
k ⎥⎦ ⎝
⎣
Multivariate Chain Rule
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
1
N
∑
∑ ⎡⎢⎣−(y
n
n
1
n 2
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
kl
) ( )( )
− zln g ' aln ⎤⎥ zkn =
⎦
j
1
N
jk
i
n
ij i
2
∑δ z
n
n n
l k
• Next, hidden layer derivative:
n ⎤⎛
n ⎞
n
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
⎡
∂a
∂a
⎟
∂L
⎜
1
⎢
⎥ ⎜ k ⎟⎟ = 1
⎢
l ⎥⎜
k ⎟
⎟⎟
⎟
N ∑⎢
n ⎥⎜
N ∑ ⎢∑
n
n ⎥⎜
⎜
⎜
⎟
⎟
n ⎢ ∂ak ⎥ ⎜
n ⎢ l ∂al ∂ak ⎥ ⎜
⎣
⎦ ⎝ ∂w jk ⎠
⎣
⎦ ⎝ ∂w jk ⎠
n ⎤
⎡
n ∂al ⎥
n
1
⎢
= N ∑ ⎢ ∑ δl
z
⎥
j
n
∂a
n ⎢ l
⎥
k ⎦
⎣
Recall al = ∑ k w
kl g (ak )
∂R
=
∂w jk
( )
Multivariate Chain Rule
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
1
N
∑
∑ ⎡⎢⎣−(y
n
n
1
n 2
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
) ( )( )
kl
− zln g ' aln ⎤⎥ zkn =
⎦
j
1
N
jk
i
n
ij i
2
∑δ z
n
n n
l k
• Next, hidden layer derivative:
n ⎤⎛
n ⎞
n
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
⎡
∂a
∂a
⎟
∂L
⎜
1
⎢
⎥ ⎜ k ⎟⎟ = 1
⎢
l ⎥⎜
k ⎟
⎟⎟ Multivariate Chain Rule
⎟
N ∑⎢
n ⎥⎜
N ∑ ⎢∑
n
n ⎥⎜
⎜
⎜
⎟
⎟
n ⎢ ∂ak ⎥ ⎜
n ⎢ l ∂al ∂ak ⎥ ⎜
⎣
⎦ ⎝ ∂w jk ⎠
⎣
⎦ ⎝ ∂w jk ⎠
n ⎤
⎡
⎡
⎤ n
n ∂al ⎥
n
n
n
1
⎢
1
⎢
= N ∑ ⎢ ∑ δl
z j = N ∑ ∑ δl wkl g ' ak ⎥ z j
n ⎥
⎢
⎥
∂a
n ⎢ l
n ⎣ l
⎥
⎦
k
⎣
⎦
Recall al = ∑ k w
kl g (ak )
∂R
=
∂w jk
( )
( )( )
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
1
N
∑
∑ ⎡⎢⎣−(y
n
n
1
n 2
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
k
) ( )( )
kl
− zln g ' aln ⎤⎥ zkn =
⎦
j
1
N
jk
i
n
ij i
2
∑δ z
n
n n
l k
• Next, hidden layer derivative:
n ⎤⎛
n ⎞
n
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
⎡
∂a
∂a
⎟
∂L
⎜
1
⎢
⎥ ⎜ k ⎟⎟ = 1
⎢
l ⎥⎜
k ⎟
⎟⎟
Multivariate Chain Rule
⎟
N ∑⎢
n ⎥⎜
N ∑ ⎢∑
n
n ⎥⎜
⎜
⎜
⎟
⎟
∂w
∂w
∂a
∂a
∂a
⎜
⎜
n ⎢
n ⎢ l
jk ⎠
l
k ⎥⎦ ⎝
⎣ k ⎥⎦ ⎝ jk ⎠
⎣
n ⎤
⎡
⎡
⎤ n
n ∂al ⎥
n
n
n
1
⎢
1
⎢
⎥ z = 1 ∑ δn z n
= N ∑ ⎢ ∑ δl
z
=
δ
w
g
'
a
∑
∑
⎥
j
l
kl
k
k j
n
N
N
⎢ l
⎥ j
∂a
n ⎢ l
n
n
⎥
⎣
⎦
k ⎦
⎣
Recall al = ∑ k w
kl g (ak )
Define as δ
∂R
=
∂w jk
( )
( )( )
Tony Jebara, Columbia University
Back-Propagation
• Cost function:
∂R
=
∂wkl
1
N
∂R
=
∂w jk
1
N
R (θ) =
⎡ n ⎤⎛ n ⎞
⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ =
n
⎢⎣ l ⎥⎦ ⎝ kl ⎠
1
N
⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜⎜ ∂wk ⎟⎟⎟⎟ =
n ⎢
⎣ k ⎥⎦ ⎝ jk ⎠
1
N
∑
∑ ⎡⎢⎣−(y
n
1
N
n
1
n 2
(y
n
−g
(∑ w g (∑ w g (∑ w x ))))
kl
k
) ( )( )
− zln g ' aln ⎤⎥ zkn =
⎦
⎡
⎤ n
n
n
⎢
∑ ⎢∑ δl wkl g ' ak ⎥⎥ z j =
n ⎣ l
⎦
( )( )
1
N
1
N
j
jk
i
2
n
ij i
∑δ z
n
n n
l k
∑δ z
n
n n
k j
• Any previous (input) layer derivative: repeat the formula!
∂R
=
∂wij
1
N
⎡ n ⎤ ⎛ ∂a n ⎞⎟
⎢ ∂L ⎥ ⎜⎜ j ⎟⎟ =
∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜ ∂w ⎟⎟⎟
n
⎣ j ⎦ ⎝ ij ⎠
1
N
n ⎞
⎡
n ⎤⎛
n
∂a
∂a
⎜
∂L
⎢
j ⎟
k ⎥⎜
∑ ⎢⎢∑ ∂a n ∂a n ⎥⎥ ⎜⎜⎜ ∂w ⎟⎟⎟⎟⎟ =
n
ij ⎠
k
j ⎦⎝
⎣ k
• What is this last z?
1
N
⎡
⎤ n
n
n
⎢
∑ ⎢∑ δk w jkg ' a j ⎥⎥ zi =
n ⎣ k
⎦
( )( )
1
N
∑δ z
n
n n
j i
Tony Jebara, Columbia University
Back-Propagation
• Again, take small step in direction opposite to gradient
wijt+1 = wijt − η
∂R
∂wij
w tjk+1 = w tjk − η
∂R
∂w jk
wklt+1 = wklt − η
∂R
∂wkl
• Early stop before when error bottoms out on validation set
Tony Jebara, Columbia University
Neural Networks Demo
• Again, take small step in direction opposite to gradient
• Digits Demo: LeNet… http://yann.lecun.com
• Problems with back-prop
is that MLP over-fits…
• Other problems: hard to interpret, black-box
• What are the hidden inner layers doing?
Tony Jebara, Columbia University
Auto-Encoders
• Make the neural net reconstruct the input vector
• Set the target y vector to be the x vector again
• But, it gets narrow in the middle!
• So, there is some information “compression”
• This leads to better generalization
• This is unsupervised learning since we only use the x data
Tony Jebara, Columbia University
Auto-Encoders
Tony Jebara, Columbia University
Auto-Encoders
Tony Jebara, Columbia University
Deep Learning
• We can stack several independently trained auto-encoders
• Using back-propagation, we do pre-training
• Train Net 1 to go from 2000 inputs to 1000 to 2000 inputs
• Train Net 2 to go from 1000 hidden values to 500 to 1000
• Train Net 3 to go from 500 hidden to 30 to 500
• Then, do unrolling - link up the learned networks as above
Tony Jebara, Columbia University
Deep Learning
• Then do fine-tuning of the overall neural network
by running back-propagation on the whole thing to
reconstruct x from x
Tony Jebara, Columbia University
Deep Learning
• Does good reconstruction!
• Beats PCA on images of unaligned faces.
• PCA is better when face images are aligned…
• We will cover PCA in a few lectures…
Tony Jebara, Columbia University
Minimum Training Error?
• Is minimizing Empricial Risk the right thing?
• Are Perceptrons and Neural Networks
giving the best classifier?
• We are getting:
minimum training error
not minimum testing error
• Perceptrons are giving a bunch of solutions:
⇒
… a solution with guarantees  SVMs
Download