Articial Neural Networks [Read Ch. 4] [Recommended exercises 4.1, 4.2, 4.5, 4.9, 4.11] Threshold units Gradient descent Multilayer networks Backpropagation Hidden layer representations Example: Face Recognition Advanced topics 74 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Connectionist Models Consider humans: Neuron switching time ~ :001 second Number of neurons ~ 10 Connections per neuron ~ 10 ? Scene recognition time ~ :1 second 100 inference steps doesn't seem like enough ! much parallel computation 10 4 5 Properties of articial neural nets (ANN's): Many neuron-like threshold switching units Many weighted interconnections among units Highly parallel, distributed process Emphasis on tuning weights automatically 75 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 When to Consider Neural Networks Input is high-dimensional discrete or real-valued (e.g. raw sensor input) Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Human readability of result is unimportant Examples: Speech phoneme recognition [Waibel] Image classication [Kanade, Baluja, Rowley] Financial prediction 76 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 ALVINN drives 70 mph on highways Sharp Left Straight Ahead Sharp Right 30 Output Units 4 Hidden Units 30x32 Sensor Input Retina 77 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Perceptron x1 w1 x2 w2 . . . x0=1 w0 Σ n Σ wi xi { i=0 wn o= xn 8 > < > : 1 if n Σw AA x >0 i=0 i i -1 otherwise w + w x + + wnxn > 0 o(x ; : : : ; xn) = ?11 ifotherwise. 1 0 1 1 Sometimes we'll use simpler vector notation: 8 > if w~ ~x > 0 o(~x) = <>: ?11 otherwise. 78 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Decision Surface of a Perceptron x2 x2 + + - - + + x1 x1 - - - (a) + (b) Represents some useful functions What weights represent g(x ; x ) = AND(x ; x )? 1 2 1 2 But some functions not representable e.g., not linearly separable Therefore, we'll want networks of these... 79 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Perceptron training rule where wi wi + wi wi = (t ? o)xi Where: t = c(~x) is target value o is perceptron output is small constant (e.g., .1) called learning rate 80 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Perceptron training rule Can prove it will converge If training data is linearly separable and suciently small 81 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Gradient Descent To understand, consider simpler linear unit, where o = w + w x + + w n xn Let's learn wi's that minimize the squared error E [w~ ] 21 d2XD(td ? od) Where D is set of training examples 0 1 1 2 82 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Gradient Descent 25 20 E[w] 15 10 5 0 2 1 -2 -1 0 0 1 -1 2 3 w1 w0 Gradient 3 @E @E @E rE [w~ ] @w ; @w ; @w 75 n Training rule: w~ = ?rE [w~ ] i.e., @E wi = ? @w 2 64 0 1 i 83 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Gradient Descent @E = @ 1 X(t ? o ) @wi @wi 2 d d d 1 X @ = 2 d @w (td ? od) i @ (t ? o ) = 21 Xd 2(td ? od) @w d d i @ X = d (td ? od) @w (td ? w~ x~d) i @E = X(t ? o )(?x ) @wi d d d i;d 2 2 84 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Gradient Descent (training examples; ) Each training example is a pair of the form h~x; ti, where ~x is the vector of input values, and t is the target output value. is the learning rate (e.g., .05). Initialize each wi to some small random value Until the termination condition is met, Do { Initialize each wi to zero. { For each h~x; ti in training examples, Do Input the instance ~x to the unit and compute the output o For each linear unit weight wi, Do wi wi + (t ? o)xi { For each linear unit weight wi, Do wi wi + wi Gradient-Descent 85 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Summary Perceptron training rule guaranteed to succeed if Training examples are linearly separable Suciently small learning rate Linear unit training rule uses gradient descent Guaranteed to converge to hypothesis with minimum squared error Given suciently small learning rate Even when training data contains noise Even when training data not separable by H 86 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satised 1. Compute the gradient rED[w~ ] 2. w~ w~ ? rED[w~ ] Incremental mode Gradient Descent: Do until satised For each training example d in D 1. Compute the gradient rEd[w~ ] 2. w~ w~ ? rEd[w~ ] 1 ED [w~ ] 2 dX2D(td ? od) 1 Ed[w~ ] 2 (td ? od) Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if made small enough 2 2 87 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Multilayer Networks of Sigmoid Units head hid ... F1 88 ... who’d hood F2 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Sigmoid Unit x1 x2 AA AA A . . . xn w1 w2 x0 = 1 w0 Σ n net = Σ wi xi i=0 wn 1 o = σ(net) = -net 1+e (x) is the sigmoid function 1 1 + e?x Nice property: ddxx = (x)(1 ? (x)) We can derive gradient decent rules to train One sigmoid unit Multilayer networks of sigmoid units ! Backpropagation ( ) 89 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Error Gradient for a Sigmoid Unit @E = @ 1 X (t ? o ) @wi @wi 2 d2D d d @ (t ? o ) = 12 Xd @w d d i @ 1 X = 2 d 2(td ? od) @w (td ? od) 1i 0 @od CA = Xd (td ? od) B@? @w i @o X d @netd = ? d (td ? od) @net @w d i But we know: @od = @(netd) = o (1 ? o ) d d @netd @netd @netd = @ (w~ ~xd) = x i;d @wi @wi So: @E = ? X (t ? o )o (1 ? o )x d d d d i;d @wi d2D 2 2 90 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Backpropagation Algorithm Initialize all weights to small random numbers. Until satised, Do For each training example, Do 1. Input the training example to the network and compute the network outputs 2. For each output unit k k ok(1 ? ok )(tk ? ok ) 3. For each hidden unit h h oh(1 ? oh) X wh;k k k2outputs 4. Update each network weight wi;j wi;j wi;j + wi;j where wi;j = j xi;j 91 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 More on Backpropagation Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will nd a local, not necessarily global error minimum { In practice, often works well (can run multiple times) Often include weight momentum wi;j(n) = j xi;j + wi;j (n ? 1) Minimizes error over training examples { Will it generalize well to subsequent examples? Training can take thousands of iterations ! slow! Using network after training is very fast 92 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Learning Hidden Layer Representations Inputs A target function: Input 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001 Outputs ! ! ! ! ! ! ! ! Output 10000000 01000000 00100000 00010000 00001000 00000100 00000010 00000001 Can this be learned?? 93 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Learning Hidden Layer Representations A network: Inputs Outputs Learned hidden layer representation: Input Hidden Output Values 10000000 ! .89 .04 .08 ! 10000000 01000000 ! .01 .11 .88 ! 01000000 00100000 ! .01 .97 .27 ! 00100000 00010000 ! .99 .97 .71 ! 00010000 00001000 ! .03 .05 .02 ! 00001000 00000100 ! .22 .99 .99 ! 00000100 00000010 ! .80 .01 .98 ! 00000010 00000001 ! .60 .94 .01 ! 00000001 94 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Training Sum of squared errors for each output unit 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 95 500 1000 lecture slides for textbook 1500 2000 2500 Machine Learning, T. Mitchell, McGraw Hill, 1997 Training Hidden unit encoding for input 01000000 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 96 500 1000 lecture slides for textbook 1500 2000 2500 Machine Learning, T. Mitchell, McGraw Hill, 1997 Training Weights from inputs to one hidden unit 4 3 2 1 0 -1 -2 -3 -4 -5 0 97 500 1000 lecture slides for textbook 1500 2000 2500 Machine Learning, T. Mitchell, McGraw Hill, 1997 Convergence of Backpropagation Gradient descent to some local minimum Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with dierent inital weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses 98 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Expressive Capabilities of ANNs Boolean functions: Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. 99 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Overtting in ANNs Error versus weight updates (example 1) 0.01 Training set error Validation set error 0.009 0.008 Error 0.007 0.006 0.005 0.004 0.003 0.002 0 5000 10000 15000 Number of weight updates 20000 Error versus weight updates (example 2) 0.08 Training set error Validation set error 0.07 0.06 Error 0.05 0.04 0.03 0.02 0.01 0 0 100 1000 2000 3000 4000 Number of weight updates lecture slides for textbook 5000 6000 Machine Learning, T. Mitchell, McGraw Hill, 1997 Neural Nets for Face Recognition left strt rght up ... ... 30x32 inputs Typical input images 90% accurate learning head pose, and recognizing 1-of-20 faces 101 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Learned Hidden Unit Weights left strt rght up ... ... Learned Weights 30x32 inputs Typical input images http://www.cs.cmu.edu/tom/faces.html 102 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Alternative Error Functions Penalize large weights: 1 X (tkd ? okd) + i;jX wji E (w~ ) 2 d2XD k2outputs Train on target slopes as well as values: 2 0 1 3 66 77 @o 1 X BB @tkd X X kd C C 6 E (w~ ) 2 d2D k2outputs 4(tkd ? okd) + j2inputs @ j ? j A 75 @xd @xd 2 2 2 2 Tie together weights: e.g., in phoneme recognition network 103 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997 Recurrent Networks y(t + 1) y(t + 1) b x(t) x(t) c(t) (b) Recurrent network (a) Feedforward network y(t + 1) x(t) c(t) y(t) x(t – 1) c(t – 1) y(t – 1) (c) Recurrent network x(t – 2) c(t – 2) unfolded in time 104 lecture slides for textbook Machine Learning, T. Mitchell, McGraw Hill, 1997