Part 7: Neural Networks & Learning 12/3/07 Gradient Ascent Process P˙ = "#F (P) Change in fitness : m "F d P m dF k F˙ = =# = # ($F ) k P˙k k=1 "P d t k=1 !d t k ˙ ˙ F = $F % P Lecture 28 2 F˙ = "F # $"F = $ "F % 0 ! 12/3/07 ! 1 Therefore gradient ascent increases fitness (until reaches 0 gradient) 12/3/07 2 General Ascent on Fitness Surface General Ascent in Fitness Note that any adaptive process P( t ) will increase fitness provided : 0 < F˙ = "F # P˙ = "F P˙ cos$ ∇F + where $ is angle between "F and P˙ – Hence we need cos " > 0 ! or " < 90 o 12/3/07 3 12/3/07 4 ! Gradient of Fitness Fitness as Minimum Error % ( "F = "'' #$ E q ** = #$ "E q & q ) q Suppose for Q different inputs we have target outputs t1,K,t Q Suppose for parameters P the corresponding actual outputs ! are y1,K,y Q Suppose D(t,y ) " [0,#) measures difference between ! ! target & actual outputs Q ! Q ! Let F (P) = "# E q (P) = "# D[t q ,y q (P)] ! ! q=1 12/3/07 d D(t q ,y q ) #y q " d yq #Pk q q q = " D t ,y # $y = Let E q = D(t q ,y q ) be error on qth sample ! "D(t q ,y q ) "y qj "E q " = D(t q ,y q ) = # "Pk "Pk "y qj "Pk j yq q=1 5 12/3/07 ! ( ) $Pk 6 ! 1 Part 7: Neural Networks & Learning 12/3/07 Derivative of Squared Euclidean Distance Suppose D(t,y ) = t " y = # ( t " y ) Jacobian Matrix #"y1q & "y1q % "P1 L "Pm ( % Define Jacobian matrix J = M O M ( q %"y q ( " y n n % "P1 L "Pm (' $ 2 q q Note J " # ! n$m and %D(t ,y Since ("E q ) k = ! q q )"# ! q q #y #D(t ,y ) #E =$ , q #Pk #y j j #Pk = ! T "#E q = ( J q ) #D(t q ,y q ) ! i i "D(t # y ) " " (t # y ) = $ (t i # y i )2 = $ i"y i "y j "y j i j i n$1 q j q 2 i d( t j " y j ) dyj 2 2 = "2( t j " y j ) d D(t,y ) = 2( y # t ) dy 12/3/07 ! " 12/3/07 7 ! 8 ! Recap Gradient of Error on qth Input "E q d D(t ,y = "Pk d yq q q ) # "y T P˙ = "$ ( J q ) (t q # y q ) q q "Pk To know how to decrease the differences between actual & desired outputs, "y q = 2( y $ t ) # "Pk q q ! we need to know elements of Jacobian, "y qj "Pk , which says how jth output varies with kth parameter (given the qth input) "y q = 2% ( y qj $ t qj ) j j "Pk T "E q = 2( J q ) ( y q # t q ) The Jacobian depends on the specific form of the system, in this case, a feedforward neural network ! 12/3/07 9 ! 12/3/07 10 ! Notation Multilayer Notation xq W1 s1 12/3/07 W2 s2 WL–2 WL–1 sL–1 • • • • • • • yq sL 11 L layers of neurons labeled 1, …, L Nl neurons in layer l sl = vector of outputs from neurons in layer l input layer s1 = xq (the input pattern) output layer sL = yq (the actual output) Wl = weights between layers l and l+1 Problem: find how outputs yiq vary with weights Wjkl (l = 1, …, L–1) 12/3/07 12 2 Part 7: Neural Networks & Learning 12/3/07 Typical Neuron Error Back-Propagation s1l–1 sjl–1 We will compute Wi1 l–1 Wijl–1 hil Σ "E q starting with last layer (l = L #1) "W ijl and working back to earlier layers (l = L # 2,K,1) sil σ WiNl–1 ! sNl–1 12/3/07 13 12/3/07 14 Output-Layer Neuron Delta Values Convenient to break derivatives by chain rule : s1L–1 "E q "E q "hil = "W ijl#1 "hil "W ijl#1 "E q Let $ = l "h i sjL–1 l i So ! 15 dh 2 = 2( siL $ t iq ) ! 16 "hiL " = $W ikL#1skL#1 = sL#1 j "W ijL#1 "W ijL#1 k d siL d hiL ! = 2( siL $ t iq )' &( hiL ) 12/3/07 Eq Output-Layer Derivatives (2) 2 #E q # = % (skL $ tkq ) #hiL #hiL k L i tiq siL = yiq 12/3/07 Output-Layer Derivatives (1) d( siL $ t iq ) σ sNL–1 12/3/07 = hiL WiNL–1 "E q "hil = $il "W ijl#1 "W ijl#1 "iL = Wi1 L–1 WijL–1 Σ " #E q = %iL sL$1 j #W ijL$1 where %iL = 2( siL $ t iq )' &( hiL ) 17 12/3/07 18 ! 3 Part 7: Neural Networks & Learning 12/3/07 Hidden-Layer Neuron Hidden-Layer Derivatives (1) "E q "hil = $il "W ijl#1 "W ijl#1 Recall s1l s1l–1 sjl–1 W1il Wi1 Wijl–1 s1l+1 "il = l–1 Σ hil sil σ Wkil skl+1 Eq WiNl–1 l l d $ ( hil ) "hkl +1 " #m W km sm "W kil sil = = = W kil = W kil$ %( hil ) l l l "h i "h i "h i d hil ! WNil sNl–1 ! sNl+1 sNl " #il = &#kl +1W kil% $( hil ) = % $( hil )&#kl +1W kil ! 12/3/07 19 #E q #E q #h l +1 #h l +1 = $ l +1 k l = $"kl +1 k l #hil #h i #h i k #h k k k k 12/3/07 20 ! Derivative of Sigmoid Hidden-Layer Derivatives (2) Suppose s = " ( h ) = l#1 l#1 dW s "hil " = $W ikl#1skl#1 = dWij l#1j = sl#1j "W ijl#1 "W ijl#1 k ij " ! "1 "2 Dh s = Dh [1+ exp("#h )] = "[1+ exp("#h )] Dh (1+ e"#h ) ! #E q = %il sl$1 j #W ijl$1 = "(1+ e"#h ) =# where %il = ' &( hil )(%kl +1W kil k "2 ("#e ) = # e"#h "#h "#h 2 (1+ e ) $ 1+ e"#h 1 e"#h 1 ' = #s& " ) "#h 1+ e"#h 1+ e"#h 1+ e"#h ( % 1+ e = #s(1" s) 12/3/07 21 12/3/07 22 ! ! Summary of Back-Propagation Algorithm Output-Layer Computation sjL–1 %E q = "iL sL$1 j %W ijL$1 sNL–1 Hidden layers : "il = #sil (1$ sil )%"kl +1W kil k ! &E q = "il sl$1 j &W ijl$1 12/3/07 "W ijL#1 = $%iL sL#1 j s1L–1 Output layer : "iL = 2#siL (1$ siL )( siL $ t iq ) ! 1 (logistic sigmoid) 1+ exp(#$h ) 23 12/3/07 Wi1 L–1 WijL–1 ! Σ hiL ΔW ijL–1 WiNL–1 η × σ siL = yiq – tiq 1– δiL × "iL = 2#siL (1$ siL )( t iq $ siL ) 2α 24 ! 4 Part 7: Neural Networks & Learning 12/3/07 Hidden-Layer Computation l#1 ij "W s1l–1 l l#1 i j = $% s W1il Wi1 l–1 W l–1 sjl–1 ! ij ΔW ijl–1 hil Σ " = #s (1$ s 12/3/07 l i WNil 1– δil l i Wkil l i × )%" l +1 k W l ki • Batch Learning s1l+1 δ1l+1 Eq skl+1 × WiNl–1 × η sNl–1 sil σ Training Procedures δk l+1 on each epoch (pass through all the training pairs), weight changes for all patterns accumulated weight matrices updated at end of epoch accurate computation of gradient • Online Learning – weight are updated after back-prop of each training pair – usually randomize order for each epoch – approximation of gradient sNl+1 Σ – – – – δNl+1 • Doesn’t make much difference α k 25 12/3/07 26 ! Gradient Computation in Batch Learning Summation of Error Surfaces E E E1 E1 E2 12/3/07 E2 27 Gradient Computation in Online Learning 12/3/07 28 The Golden Rule of Neural Nets E Neural Networks are the second-best way to do everything! E1 E2 12/3/07 29 12/3/07 30 5 Part 7: Neural Networks & Learning 12/3/07 Complex Systems • • • • • • VIII. Review of Key Concepts 12/3/07 31 Many interacting elements Local vs. global order: entropy Scale (space, time) Phase space Difficult to understand Open systems 12/3/07 Many Interacting Elements Complementary Interactions • Massively parallel • Distributed information storage & processing • Diversity • • • • • – avoids premature convergence – avoids inflexibility 12/3/07 33 Positive feedback / negative feedback Amplification / stabilization Activation / inhibition Cooperation / competition Positive / negative correlation 12/3/07 Emergence & Self-Organization 34 Pattern Formation • Microdecisions lead to macrobehavior • Circular causality (macro / micro feedback) • Coevolution – predator/prey, Red Queen effect – gene/culture, niche construction, Baldwin effect 12/3/07 32 35 • • • • • Excitable media Amplification of random fluctuations Symmetry breaking Specific difference vs. generic identity Automatically adaptive 12/3/07 36 6 Part 7: Neural Networks & Learning 12/3/07 Stigmergy Emergent Control • Stigmergy • Entrainment (distributed synchronization) • Coordinated movement • Continuous (quantitative) • Discrete (qualitative) • Coordinated algorithm – through attraction, repulsion, local alignment – in concrete or abstract space – non-conflicting – sequentially linked • Cooperative strategies – nice & forgiving, but reciprocal – evolutionarily stable strategy 12/3/07 37 12/3/07 Wolfram’s Classes Attractors • • • • • Classes – point attractor – cyclic attractor – chaotic attractor Class I: point Class II: cyclic Class III: chaotic Class IV: complex (edge of chaos) – – – – • Basin of attraction • Imprinted patterns as attractors – pattern restoration, completion, generalization, association 12/3/07 39 persistent state maintenance bounded cyclic activity global coordination of control & information order for free 12/3/07 Energy / Fitness Surface 40 Biased Randomness • Descent on energy surface / ascent on fitness surface • Lyapunov theorem to prove asymptotic stability / convergence • Soft constraint satisfaction / relaxation • Gradient (steepest) ascent / descent • Adaptation & credit assignment 12/3/07 38 • • • • • • 41 Exploration vs. exploitation Blind variation & selective retention Innovation vs. incremental improvement Pseudo-temperature Diffusion Mixed strategies 12/3/07 42 7 Part 7: Neural Networks & Learning 12/3/07 Natural Computation • • • • • • Tolerance to noise, error, faults, damage Generality of response Flexible response to novelty Adaptability Real-time response Optimality is secondary 12/3/07 Student Course Evaluation! (We will do it in class this time) 43 12/3/07 44 8