Tony Jebara, Columbia University Machine Learning 4771 Instructor: Tony Jebara Tony Jebara, Columbia University Topic 4 • Tutorial: Matlab • Perceptron, Online & Stochastic Gradient Descent • Convergence Guarantee • Perceptron vs. Linear Regression • Multi-Layer Neural Networks • Back-Propagation • Demo: LeNet • Deep Learning Tony Jebara, Columbia University Tutorial: Matlab • Matlab is the most popular language for machine learning • See www.cs.columbia.edu->computing->Software->Matlab • Online info to get started is available at: http://www.cs.columbia.edu/~jebara/tutorials.html • Matlab tutorials • List of Matlab function calls • Example code: for homework #1 will use polyreg.m • General: help, lookfor, 1:N, rand, zeros, A’, reshape, size • Math: max, min, cov, mean, norm, inv, pinv, det, sort, eye • Control: if, for, while, end, %, function, return, clear • Display: figure, clf, axis, close, plot, subplot, hold on, fprintf • Input/Output: load, save, ginput, print, • BBS and TA’s are also helpful Tony Jebara, Columbia University Perceptron (another Neuron) • Classification scenario once again but consider +1, -1 labels X = {(x ,y ), (x ,y ),…, (x 1 1 2 2 } ,yN ) N x ∈ RD y ∈ {−1,1} • A better choice for a classification squashing function is ⎧ ⎪ ⎪ −1 when z < 0 g (z ) = ⎨ ⎪ +1 when z ≥ 0 ⎪ ⎪ ⎩ • And a better choice is classification loss ( ) ( ) L y, f (x; θ) = step −yf (x; θ) • Actually with above g(z) any loss is like classification loss R (θ) = 1 4N ∑ N i =1 ( ( T y − g θ xi )) 2 ≡ 1 N ( T step −y θ xi ∑ i =1 i N • What does this R(θ) function look like? ) Tony Jebara, Columbia University Perceptron & Classification Loss • Classification loss for the Risk leads to hard minimization • What does this R(θ) function look like? R (θ) = 1 N ( T step −y θ xi ∑ i =1 i N • Qbert-like, can’t do gradient descent since the gradient is zero except at edges when a label flips ) Tony Jebara, Columbia University Perceptron & Perceptron Loss • Instead of Classification Loss R (θ) = 1 N ( T step −y θ xi ∑ i =1 i N L ) yf(x;θ) • Consider Perceptron Loss: R per (θ) = − N1 ( T y θ ∑ i∈misclassified i xi L ) yf(x;θ) • Instead of staircase-shaped R get smooth piece-wise linear R • Get reasonable gradients for gradient descent ∇θR per (θ) = − N1 ∑ i∈misclassified θt +1 = θt − η ∇θR per t θ yi x i = θt + η N1 ∑ i∈misclassified yi x i Tony Jebara, Columbia University Perceptron vs. Linear Regression • Linear regression gets close but doesn’t do perfectly classification error = 2 squared error = 0.139 • Perceptron gets zero error classification error = 0 perceptron err = 0 Tony Jebara, Columbia University Stochastic Gradient Descent • Gradient Descent vs. Stochastic Gradient Descent • Instead of computing the average gradient for all points and then taking a step ∇θR per (θ) = − N1 ∑ i∈misclassified yi x i • Update the gradient for each mis-classified point by itself ∇θR per (θ) = −yi xi if i mis-classified • Also, set η to 1 without loss of generality θt +1 = θt − η ∇θR per θt = θt + yi x i if i mis-classified Tony Jebara, Columbia University Online Perceptron • Apply stochastic gradient descent to a perceptron • Get the “online perceptron” algorithm: initialize t = 0 and θ = 0 while not converged { 0 pick i ∈ {1,…,N } ( ) if yi x iT θt ≤ 0 { θt +1 = θt + yi x i t = t +1 }} • Either pick i randomly or use a “for i=1 to N” loop • If the algorithm stops, we have a theta that separates data • The total number of mistakes we made along the way is t Tony Jebara, Columbia University Online Perceptron Theorem Theorem: the online perceptron algorithm converges to zero error in finite t if we assume 1) all data inside a sphere of radius r: xi ≤ r ∀i T 2) data is separable with margin γ: yi (θ* ) xi ≥ γ ∀i Proof: • Part 1) Look at inner product of current θt with θ* assume we just updated a mistake on point i: (θ ) * T t ( ) θ = θ * T θ t −1 ( ) + yi θ * T ( ) xi ≥ θ * T θt −1 + γ after applying t such updates, we must get: (θ ) * T t ( ) θ = θ * T θt ≥ t γ Tony Jebara, Columbia University Online Perceptron Proof • Part 1) (θ* ) θt = (θ* ) θt ≥ t γ T • Part 2) θ t 2 T = θ t −1 ≤ θ t −1 ≤ θ t −1 + yi x i 2 2 2 = θ t −1 2 2 + xi ( ) + 2yi θ t −1 T xi + xi 2 since only update mistakes middle term is negative + r2 ≤ tr 2 • Part 3) Angle between optimal & current solution ( cos θ*, θt ) θ) ( = * θ t T θt θ * • Since cos ≤ 1 ⇒ ≥ tγ θ t θ * tγ tr 2 θ* ≥ tγ apply part 1 then part 2 tr 2 θ* r2 * ≤1 ⇒ t ≤ 2 θ γ 2 …so t is finite! Tony Jebara, Columbia University Multi-Layer Neural Networks • What if we consider cascading multiple layers of network? • Each output layer is input to the next layer • Each layer has its own weights parameters • Eg: each layer has linear nodes (not perceptron/logistic) • Above Neural Net has 2 layers. What does this give? Tony Jebara, Columbia University Multi-Layer Neural Networks • Need to introduce non-linearities between layers • Avoids previous redundant linear layer problem • Neural network can adjust the basis functions themselves… f (x ) = g ( ( )) T θ g θ ∑ i =1 i i x P Tony Jebara, Columbia University Multi-Layer Neural Networks • Multi-Layer Network can handle more complex decisions • 1-layer: is linear, can’t handle XOR • Each layer adds more flexibility (but more parameters!) • Each node splits its input space with linear hyperplane • 2-layer: if last layer is AND operation, get convex hull • 2-layer: can do almost anything multi-layer can by fanning out the inputs at 2nd layer • Note: Without loss of generality, we can omit the 1 and θ0 Tony Jebara, Columbia University Back-Propagation • Gradient descent on squared loss is done layer by layer • Layers: input, hidden, output. Parameters: θ = {wij ,w jk ,wkl } a j = ∑ wij zi i ( ) zj = g aj ak = ∑ w jk z j j zk = g (ak ) al = ∑ wkl zk k zl = g (al ) • Each input xn for n=1..N generates its own a’s and z’s • Back-Propagation: Splits layer into its inputs & outputs • Get gradient on output…back-track chain rule until input Tony Jebara, Columbia University Back-Propagation • Cost function: ( ( )) R (θ) = 1 N ∑ n n L y − f x n=1 = 1 N ∑ 1 n 2 N (y n −g (∑ w g (∑ w g (∑ w x )))) k kl j jk i • First compute output layer derivative: ∂R = ∂wkl 1 N ⎡ n ⎤⎛ n ⎞ ⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ Chain Rule n ⎢⎣ l ⎥⎦ ⎝ kl ⎠ n define L := 1 2 (y n ij i n 2 ( )) −f x n 2 Tony Jebara, Columbia University Back-Propagation • Cost function: ( ( )) R (θ) = 1 N ∑ n n L y − f x n=1 = 1 N ∑ 1 n 2 N (y n −g (∑ w g (∑ w g (∑ w x )))) k kl j jk i • First compute output layer derivative: ∂R = ∂wkl 1 N = 1 N ⎡ n ⎤⎛ n ⎞ ⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ Chain Rule n ⎢⎣ l ⎥⎦ ⎝ kl ⎠ 2⎤ ⎡ 1 n n ⎢ ∂ 2 y − g al ⎥ ⎛ ∂a n ⎞⎟ ⎢ ⎥ ⎜⎜ l ⎟⎟ ∑⎢ ⎥ ⎜⎜ ∂w ⎟⎟ ∂a n n ⎢ ⎥ ⎝ kl ⎠ l ⎢⎣ ⎥⎦ ( ( )) n define L := 1 2 (y n ij i n 2 ( )) −f x n 2 Tony Jebara, Columbia University Back-Propagation • Cost function: ( ( )) R (θ) = 1 N ∑ n n L y − f x n=1 = 1 N ∑ 1 n 2 N (y n −g (∑ w g (∑ w g (∑ w x )))) kl k j jk i • First compute output layer derivative: ∂R = ∂wkl 1 N = 1 N ⎡ n ⎤⎛ n ⎞ ⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ Chain Rule n ⎢⎣ l ⎥⎦ ⎝ kl ⎠ 2⎤ ⎡ 1 n n ⎢ ∂ 2 y − g al ⎥ ⎛ ∂a n ⎞⎟ ⎢ ⎥ ⎜⎜ l ⎟⎟ = 1 ∑⎢ ⎥ ⎜⎜ ∂w ⎟⎟ N ∂a n n ⎢ ⎥ ⎝ kl ⎠ l ⎢⎣ ⎥⎦ ( ( )) n define L := ∑ ⎡⎢⎣−(y n n 1 2 ) ( )( ) − zln g ' aln ⎤⎥ zkn ⎦ (y n ij i n 2 ( )) −f x n 2 Tony Jebara, Columbia University Back-Propagation • Cost function: ( ( )) R (θ) = 1 N ∑ n n L y − f x n=1 = 1 N ∑ 1 n 2 N (y n −g (∑ w g (∑ w g (∑ w x )))) k kl j jk i • First compute output layer derivative: ∂R = ∂wkl = 1 N 1 N ⎡ n ⎤⎛ n ⎞ ⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ Chain Rule n ⎢⎣ l ⎥⎦ ⎝ kl ⎠ 2⎤ ⎡ 1 n n ⎢ ∂ 2 y − g al ⎥ ⎛ ∂a n ⎞⎟ ⎢ ⎥ ⎜⎜ l ⎟⎟ = 1 ∑⎢ ⎥ ⎜⎜ ∂w ⎟⎟ N ∂a n n ⎢ ⎥ ⎝ kl ⎠ l ⎢⎣ ⎥⎦ ( ( )) n define L := 1 2 (y n ij i n 2 ( )) −f x n Define as δ ∑ ⎡⎢⎣−(y n n ) ( )( ) − zln g ' aln ⎤⎥ zkn = ⎦ 1 N ∑δ z n n n l k 2 Tony Jebara, Columbia University Back-Propagation • Cost function: ∂R = ∂wkl 1 N R (θ) = ⎡ n ⎤⎛ n ⎞ ⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ = n ⎢⎣ l ⎥⎦ ⎝ kl ⎠ 1 N 1 N ∑ ∑ ⎡⎢⎣−(y n n 1 n 2 (y n 1 N ⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜⎜ ∂wk ⎟⎟⎟⎟ n ⎢ ⎣ k ⎥⎦ ⎝ jk ⎠ (∑ w g (∑ w g (∑ w x )))) k ) ( )( ) kl − zln g ' aln ⎤⎥ zkn = ⎦ • Next, hidden layer derivative: ∂R = ∂w jk −g j 1 N ∑δ z n jk n n l k i n ij i 2 Tony Jebara, Columbia University Back-Propagation • Cost function: ∂R = ∂wkl 1 N R (θ) = ⎡ n ⎤⎛ n ⎞ ⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ = n ⎢⎣ l ⎥⎦ ⎝ kl ⎠ 1 N 1 N ∑ ∑ ⎡⎢⎣−(y n n 1 n 2 (y n −g (∑ w g (∑ w g (∑ w x )))) k kl ) ( )( ) − zln g ' aln ⎤⎥ zkn = ⎦ j 1 N jk i n ij i 2 ∑δ z n n n l k • Next, hidden layer derivative: ∂R = ∂w jk 1 N ⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜⎜ ∂wk ⎟⎟⎟⎟ = n ⎢ ⎣ k ⎥⎦ ⎝ jk ⎠ 1 N n ⎛ n ⎞ ⎡ ∂Ln ∂al ⎤⎥ ⎜⎜ ∂ak ⎟⎟ ⎢ ∑ ⎢∑ ∂a n ∂a n ⎥ ⎜⎜⎜ ∂w ⎟⎟⎟ n ⎢ l jk ⎠ l k ⎥⎦ ⎝ ⎣ Multivariate Chain Rule Tony Jebara, Columbia University Back-Propagation • Cost function: ∂R = ∂wkl 1 N R (θ) = ⎡ n ⎤⎛ n ⎞ ⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ = n ⎢⎣ l ⎥⎦ ⎝ kl ⎠ 1 N 1 N ∑ ∑ ⎡⎢⎣−(y n n 1 n 2 (y n −g (∑ w g (∑ w g (∑ w x )))) k kl ) ( )( ) − zln g ' aln ⎤⎥ zkn = ⎦ j 1 N jk i n ij i 2 ∑δ z n n n l k • Next, hidden layer derivative: ∂R = ∂w jk 1 N = 1 N ⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜⎜ ∂wk ⎟⎟⎟⎟ = n ⎢ ⎣ k ⎥⎦ ⎝ jk ⎠ n ⎤ ⎡ ⎢ δn ∂al ⎥ z n ∑ ⎢∑ l ∂a n ⎥ j n ⎢ l k ⎥⎦ ⎣ ( ) 1 N n ⎤⎛ n ⎞ n ⎡ ∂a ∂a ⎟ ∂L ⎜ ⎢ l ⎥⎜ k ⎟ ∑ ⎢∑ ∂a n ∂a n ⎥ ⎜⎜⎜ ∂w ⎟⎟⎟ n ⎢ l jk ⎠ l k ⎥⎦ ⎝ ⎣ Multivariate Chain Rule Tony Jebara, Columbia University Back-Propagation • Cost function: ∂R = ∂wkl 1 N R (θ) = ⎡ n ⎤⎛ n ⎞ ⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ = n ⎢⎣ l ⎥⎦ ⎝ kl ⎠ 1 N 1 N ∑ ∑ ⎡⎢⎣−(y n n 1 n 2 (y n −g (∑ w g (∑ w g (∑ w x )))) k kl ) ( )( ) − zln g ' aln ⎤⎥ zkn = ⎦ j 1 N jk i n ij i 2 ∑δ z n n n l k • Next, hidden layer derivative: n ⎤⎛ n ⎞ n ⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟ ⎡ ∂a ∂a ⎟ ∂L ⎜ 1 ⎢ ⎥ ⎜ k ⎟⎟ = 1 ⎢ l ⎥⎜ k ⎟ ⎟⎟ ⎟ N ∑⎢ n ⎥⎜ N ∑ ⎢∑ n n ⎥⎜ ⎜ ⎜ ⎟ ⎟ n ⎢ ∂ak ⎥ ⎜ n ⎢ l ∂al ∂ak ⎥ ⎜ ⎣ ⎦ ⎝ ∂w jk ⎠ ⎣ ⎦ ⎝ ∂w jk ⎠ n ⎤ ⎡ n ∂al ⎥ n 1 ⎢ = N ∑ ⎢ ∑ δl z ⎥ j n ∂a n ⎢ l ⎥ k ⎦ ⎣ Recall al = ∑ k w kl g (ak ) ∂R = ∂w jk ( ) Multivariate Chain Rule Tony Jebara, Columbia University Back-Propagation • Cost function: ∂R = ∂wkl 1 N R (θ) = ⎡ n ⎤⎛ n ⎞ ⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ = n ⎢⎣ l ⎥⎦ ⎝ kl ⎠ 1 N 1 N ∑ ∑ ⎡⎢⎣−(y n n 1 n 2 (y n −g (∑ w g (∑ w g (∑ w x )))) k ) ( )( ) kl − zln g ' aln ⎤⎥ zkn = ⎦ j 1 N jk i n ij i 2 ∑δ z n n n l k • Next, hidden layer derivative: n ⎤⎛ n ⎞ n ⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟ ⎡ ∂a ∂a ⎟ ∂L ⎜ 1 ⎢ ⎥ ⎜ k ⎟⎟ = 1 ⎢ l ⎥⎜ k ⎟ ⎟⎟ Multivariate Chain Rule ⎟ N ∑⎢ n ⎥⎜ N ∑ ⎢∑ n n ⎥⎜ ⎜ ⎜ ⎟ ⎟ n ⎢ ∂ak ⎥ ⎜ n ⎢ l ∂al ∂ak ⎥ ⎜ ⎣ ⎦ ⎝ ∂w jk ⎠ ⎣ ⎦ ⎝ ∂w jk ⎠ n ⎤ ⎡ ⎡ ⎤ n n ∂al ⎥ n n n 1 ⎢ 1 ⎢ = N ∑ ⎢ ∑ δl z j = N ∑ ∑ δl wkl g ' ak ⎥ z j n ⎥ ⎢ ⎥ ∂a n ⎢ l n ⎣ l ⎥ ⎦ k ⎣ ⎦ Recall al = ∑ k w kl g (ak ) ∂R = ∂w jk ( ) ( )( ) Tony Jebara, Columbia University Back-Propagation • Cost function: ∂R = ∂wkl 1 N R (θ) = ⎡ n ⎤⎛ n ⎞ ⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ = n ⎢⎣ l ⎥⎦ ⎝ kl ⎠ 1 N 1 N ∑ ∑ ⎡⎢⎣−(y n n 1 n 2 (y n −g (∑ w g (∑ w g (∑ w x )))) k ) ( )( ) kl − zln g ' aln ⎤⎥ zkn = ⎦ j 1 N jk i n ij i 2 ∑δ z n n n l k • Next, hidden layer derivative: n ⎤⎛ n ⎞ n ⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟ ⎡ ∂a ∂a ⎟ ∂L ⎜ 1 ⎢ ⎥ ⎜ k ⎟⎟ = 1 ⎢ l ⎥⎜ k ⎟ ⎟⎟ Multivariate Chain Rule ⎟ N ∑⎢ n ⎥⎜ N ∑ ⎢∑ n n ⎥⎜ ⎜ ⎜ ⎟ ⎟ ∂w ∂w ∂a ∂a ∂a ⎜ ⎜ n ⎢ n ⎢ l jk ⎠ l k ⎥⎦ ⎝ ⎣ k ⎥⎦ ⎝ jk ⎠ ⎣ n ⎤ ⎡ ⎡ ⎤ n n ∂al ⎥ n n n 1 ⎢ 1 ⎢ ⎥ z = 1 ∑ δn z n = N ∑ ⎢ ∑ δl z = δ w g ' a ∑ ∑ ⎥ j l kl k k j n N N ⎢ l ⎥ j ∂a n ⎢ l n n ⎥ ⎣ ⎦ k ⎦ ⎣ Recall al = ∑ k w kl g (ak ) Define as δ ∂R = ∂w jk ( ) ( )( ) Tony Jebara, Columbia University Back-Propagation • Cost function: ∂R = ∂wkl 1 N ∂R = ∂w jk 1 N R (θ) = ⎡ n ⎤⎛ n ⎞ ⎢ ∂L ⎥ ⎜⎜ ∂al ⎟⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜ ∂w ⎟⎟⎟ = n ⎢⎣ l ⎥⎦ ⎝ kl ⎠ 1 N ⎡ ∂Ln ⎤ ⎛⎜ ∂a n ⎞⎟ ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜⎜ ∂wk ⎟⎟⎟⎟ = n ⎢ ⎣ k ⎥⎦ ⎝ jk ⎠ 1 N ∑ ∑ ⎡⎢⎣−(y n 1 N n 1 n 2 (y n −g (∑ w g (∑ w g (∑ w x )))) kl k ) ( )( ) − zln g ' aln ⎤⎥ zkn = ⎦ ⎡ ⎤ n n n ⎢ ∑ ⎢∑ δl wkl g ' ak ⎥⎥ z j = n ⎣ l ⎦ ( )( ) 1 N 1 N j jk i 2 n ij i ∑δ z n n n l k ∑δ z n n n k j • Any previous (input) layer derivative: repeat the formula! ∂R = ∂wij 1 N ⎡ n ⎤ ⎛ ∂a n ⎞⎟ ⎢ ∂L ⎥ ⎜⎜ j ⎟⎟ = ∑ ⎢⎢ ∂a n ⎥⎥ ⎜⎜⎜ ∂w ⎟⎟⎟ n ⎣ j ⎦ ⎝ ij ⎠ 1 N n ⎞ ⎡ n ⎤⎛ n ∂a ∂a ⎜ ∂L ⎢ j ⎟ k ⎥⎜ ∑ ⎢⎢∑ ∂a n ∂a n ⎥⎥ ⎜⎜⎜ ∂w ⎟⎟⎟⎟⎟ = n ij ⎠ k j ⎦⎝ ⎣ k • What is this last z? 1 N ⎡ ⎤ n n n ⎢ ∑ ⎢∑ δk w jkg ' a j ⎥⎥ zi = n ⎣ k ⎦ ( )( ) 1 N ∑δ z n n n j i Tony Jebara, Columbia University Back-Propagation • Again, take small step in direction opposite to gradient wijt+1 = wijt − η ∂R ∂wij w tjk+1 = w tjk − η ∂R ∂w jk wklt+1 = wklt − η ∂R ∂wkl • Early stop before when error bottoms out on validation set Tony Jebara, Columbia University Neural Networks Demo • Again, take small step in direction opposite to gradient • Digits Demo: LeNet… http://yann.lecun.com • Problems with back-prop is that MLP over-fits… • Other problems: hard to interpret, black-box • What are the hidden inner layers doing? Tony Jebara, Columbia University Auto-Encoders • Make the neural net reconstruct the input vector • Set the target y vector to be the x vector again • But, it gets narrow in the middle! • So, there is some information “compression” • This leads to better generalization • This is unsupervised learning since we only use the x data Tony Jebara, Columbia University Auto-Encoders Tony Jebara, Columbia University Auto-Encoders Tony Jebara, Columbia University Deep Learning • We can stack several independently trained auto-encoders • Using back-propagation, we do pre-training • Train Net 1 to go from 2000 inputs to 1000 to 2000 inputs • Train Net 2 to go from 1000 hidden values to 500 to 1000 • Train Net 3 to go from 500 hidden to 30 to 500 • Then, do unrolling - link up the learned networks as above Tony Jebara, Columbia University Deep Learning • Then do fine-tuning of the overall neural network by running back-propagation on the whole thing to reconstruct x from x Tony Jebara, Columbia University Deep Learning • Does good reconstruction! • Beats PCA on images of unaligned faces. • PCA is better when face images are aligned… • We will cover PCA in a few lectures… Tony Jebara, Columbia University Minimum Training Error? • Is minimizing Empricial Risk the right thing? • Are Perceptrons and Neural Networks giving the best classifier? • We are getting: minimum training error not minimum testing error • Perceptrons are giving a bunch of solutions: ⇒ … a solution with guarantees SVMs