WK5 – Dynamic Networks Contents CS 476: Networks of Neural Computation Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions WK5 – Dynamic Networks: Time Delayed & Recurrent Networks Dr. Stathis Kasderidis Dept. of Computer Science University of Crete Spring Semester, 2009 CS 476: Networks of Neural Computation, CSD, UOC, 2009 Contents •Sequence Learning Contents •Time Delayed Networks I: Implicit Representation Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions •Time Delayed Networks II: Explicit Representation •Recurrent Networks I: Elman + Jordan Networks •Recurrent Networks II: Back Propagation Through Time •Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Sequence Learning Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions •MLP & RBF networks are static networks, i.e. they learn a mapping from a single input signal to a single output response for an arbitrary large number of pairs; •Dynamic networks learn a mapping from a single input signal to a sequence of response signals, for an arbitrary number of pairs (signal,sequence). •Typically the input signal to a dynamic network is an element of the sequence and then the network produces as a response the rest of the sequence. •To learn sequences we need to include some form of memory (short term memory) to the network. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Sequence Learning II •We can introduce memory effects with two principal ways: Contents Sequences •Implicit: Time Delayed I e.g. Time lagged signal as input to a static network or as recurrent connections Time Delayed II •Explicit: Recurrent I Recurrent II Conclusions e.g. Temporal Backpropagation Method •In the implicit form, we assume that the environment from which we collect examples of (input signal, output sequence) is stationary. For the explicit form the environment could be nonstationary, i.e. the network can track the changes in the structure of the signal. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks I •The time delayed approach includes two basic types Contents of networks: Sequences •Implicit Representation of Time: We combine a memory structure in the input layer of the network Time Delayed I with a static network model Time Delayed II •Explicit Representation of Time: We explicitly Recurrent I allow the network to code time, by generalising the network weights from scalars to vectors, as in Recurrent II TBP (Temporal Backpropagation). Conclusions •Typical forms of memories that are used are the Tapped Delay Line and the Gamma Memory family. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks I •The Tapped Delay Line form of memory is shown below for an input signal x(n): Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions •The Gamma form of memory is defined by: n 1 p n p g p (n) p 1 (1 ) n p CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks I •The Gamma Memory is shown below: Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks I •In the implicit representation approach we combine a Contents static network (e.g. MLP / RBF) with a memory structure (e.g. tapped delay line). Sequences Time Delayed I •An example is shown below (the NETtalk network): Time Delayed II Recurrent I Recurrent II Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks I •We present the data in a sliding window. For example Contents in NETtalk the middle group of input neurons present the letter in focus. The rest of the input groups, three Sequences before & three after, present context. Time Delayed I •The purpose is to predict for example the next Time Delayed II symbol in the sequence. Recurrent I •The NETtalk model (Sejnowski & Rosenberg, 1987) has: Recurrent II •203 input nodes Conclusions •80 hidden neurons •26 output neurons •18629 weights •Used BP method for training CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II •In explicit time representation, neurons have a spatioContents temporal structure, i.e. its synapse arriving to a neuron is not a scalar number but a vector of weights, which Sequences are used for convolution of the time-delayed input Time Delayed I signal of a previous neuron with the synapse. Time Delayed II Recurrent I •A schematic representation of a neuron is given below: Recurrent II Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II •The output of the neuron in this case is given by: Contents p y j ( n) w j (l ) x(n l ) b j l 0 Sequences Time Delayed I Time Delayed II •In Recurrent I Recurrent II Conclusions case of a whole network, for example assuming a single output node and a linear output layer, the response is given by: p y (n) w j y j (n) w j w j (l ) x(n l ) b j b0 j 1 j 1 l 0 m1 m1 Where p is the depth of the memory and b0 is the bias of the output neuron CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II •In the more general case, where we have multiple neurons at the output layer, we have for neuron j of any layer: Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions m0 p y j (n) w ji (l ) xi (n l ) b j i 1 l 0 •The output of any synapse is given by the convolution sum: p T s ji (n) w ji xi (n) w ji (l ) xi (n l ) l 0 CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II Where the state vector xi(n) and weight vector wji for synapse I are defined as follows: Contents Sequences •xi(n)=[xi(n), Time Delayed I •wji=[wji(0), xi(n-1),…, xi(n-p)]T wji(1),…, wji(p)]T Time Delayed II Recurrent I Recurrent II Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II: Learning Law •To train such a network we use the Temporal BackPropagation algorithm. We present the algorithm below. Contents Sequences •Assume that neuron j lies in the output layer and its Time Delayed II response is denoted by yj(n) at time n, while its desired response is given by dj(n). Time Delayed I Recurrent I Recurrent II Conclusions •We can define an instantaneous value for the sum of squared errors produced by the network as follows: 1 2 E ( n) e j ( n) 2 j CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II: Learning Law-1 •The error signal at the output layer is defined by: Contents e j (n) d j (n) y j (n) Sequences Time Delayed I Time Delayed II Recurrent I •The idea is a minimise an overall cost function, calculated over all time: Etotal E(n) n Recurrent II Conclusions •We could proceed as usual by calculating the gradient of the cost function over the weights. This implies that we need to calculate the instantaneous gradient: Etotal w ji E (n) w ji n CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II: Learning Law-2 Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions •However this approach to work we need to unfold the network in time (i.e. to convert it to an equivalent static network and then calculate the gradient). This option presents a number of disadvantages: •A loss of symmetry between forward and backward pass for the calculation of instantaneous gradient; •No nice recursive formula for propagation of error terms; •Need for global bookkeeping to keep track of which static weights are actually the same in the equivalent network CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II: Learning Law-3 •For these reasons we prefer to calculate the gradient of the cost function as follows: Contents Sequences Time Delayed I Time Delayed II Etotal w ji Etotal v j (n) v ( n ) w n j ji •Note that in general holds: Recurrent I Recurrent II Conclusions Etotal v j (n) E (n) v j (n) w ji w ji •The equality is correct only when we take the sum over all time. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II: Learning Law-4 •To calculate the weight update we use the steepest descent method: Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions Etotal v j (n) w ji (n 1) w ji (n) v j (n) w ji (n) Where is the learning rate. •We calculate the terms in the above relation as follows: v j (n) x i ( n) w ji (n) This is by definition the induced field vj(n) CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II: Learning Law-5 •We define the local gradient as: Contents j (n) Sequences Etotal v j ( n) Time Delayed I Time Delayed II •Thus we can write the weight update equations in the familiar form: Recurrent I Recurrent II wji (n 1) wji (n) j (n) xi (n) Conclusions •We need to calculate the for the cases of output and hidden layers. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II: Learning Law-6 •For the output layer the local gradient is given by: Contents Etotal E (n) j (n) e j (n) ' (v j (n)) v j (n) v j (n) Sequences Time Delayed I Time Delayed II •For Recurrent I Recurrent II Conclusions a hidden layer we assume that neuron j is connected to a set A of neurons in the next layer (hidden or output). Then we have: Etotal j ( n) v j (n) Etotal vr (k ) r A k vr ( k ) v j ( n) CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II: Learning Law-7 •By re-writing we get the following: Contents j ( n) r ( k ) Sequences r A Time Delayed I Time Delayed II k vr ( k ) v j ( n ) r (k ) r A k vr ( k ) y j ( n ) y j ( n ) v j ( n ) Recurrent I Recurrent II Conclusions •Finally we putting all together we get: n p j (n) ' (v j (n)) r (k ) wrj (k l ) r A k n p ' (v j (n)) r (n l ) wrj (n) r A l 0 CS 476: Networks of Neural Computation, CSD, UOC, 2009 Time Delayed Networks II: Learning Law-8 Where l is the layer level Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent I Contents Sequences Time Delayed I •A network is called recurrent when there are connections which feedback to previous layers or neurons, including self-connections. An example is shown next: Time Delayed II Recurrent I Recurrent II Conclusions •Successful early models of recurrent networks are: •Jordan •Elman Network Network CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent I Contents Sequences Time Delayed I •The Jordan Network has the structure of an MLP and additional context units. The Output neurons feedback to the context neurons in 1-1 fashion. The context units also feedback to themselves. Time Delayed II •The network is trained by using the Backpropagation algorithm Recurrent I Recurrent II •A schematic is shown in the next figure: Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent I •The Elman Network has also the structure of an Contents MLP and additional context units. The Hidden neurons feedback to the context neurons in 1-1 fashion. The Sequences hidden neurons’ connections to the context units are Time Delayed I constant and equal to 1. It is also called Simple Time Delayed II Recurrent Network (SRN). Recurrent I Recurrent II Conclusions •The network is trained by using the Backpropagation algorithm •A schematic is shown in the next figure: CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II Sequences •More complex forms of recurrent networks are possible. We can start by extending a MLP as a basic building block. Time Delayed I •Typical paradigms of complex recurrent models are: Contents Time Delayed II Recurrent I •Nonlinear Autoregressive with Exogenous Inputs Network (NARX) Recurrent II •The State Space Model Conclusions •The Recurrent Multilayer Perceptron (RMLP) •Schematic representations of the networks are given in the next slides: CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-1 •The structure of the NARX model includes: Contents •A Sequences MLP static network; •A Time Delayed I Time Delayed II current input u(n) and its delayed versions up to a time q; Recurrent II time delayed version of the current output y(n) which feeds back to the input layer. The memory of the delayed output vector is in general p. Conclusions •The Recurrent I •A output is calculated as: y(n+1)=F(y(n),…,y(n-p+1),u(n),…,u(n-q+1)) CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-2 •A schematic of the NARX model is as follows: Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-3 •The structure of the State Space model includes: Contents •A Sequences hidden neurons define the state of the network; •The Time Delayed I Time Delayed II •A Recurrent I •A Recurrent II Conclusions MLP network with a single hidden layer; linear output layer; feedback of the hidden layer to the input layer assuming a memory of q lags; •The output is determined by the coupled equations: x(n+1)=f(x(n),u(n)) y(n+1)=C x(n+1) CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-4 Where f is a suitable nonlinear function characterising Contents the hidden layer. x is the state vector, as it is produced by the hidden layer. It has q components. y is the Sequences output vector and it has p components. The input Time Delayed I vector is given by u and it has m components. Time Delayed II Recurrent I •A schematic representation of the network is given below: Recurrent II Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-5 •The structure of the RMLP includes: Contents •One Sequences •Feedback Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions or more hidden layers; •The around each layer; general structure of a static MLP network; •The output is calculated as follows (assuming that xI, xII, and xo are the first, second and output layer outputs): xI(n+1)=I(xI(n), u(n)) xII(n+1)=II(xII(n), xI(n+1)) xO(n+1)=O(xO(n), xII(n+1)) CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-6 Where the functions I(•), II(•) and O(•) denote the Contents Activation functions of the corresponding layer. Sequences •A schematic representation is given below: Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-7 •Some theorems on the computational power of recurrent networks: Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions •Thm 1: All Turing machines may be simulated by •Thm 2: NARX networks with one layer of hidden fully connected recurrent networks built on neurons with sigmoid activation functions. neurons with bounded, one-sided saturated activation functions and a linear output neuron can simulate fully connected recurrent networks with bounded, one-sided saturated activation functions, except for a linear slowdown. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-8 NARX networks with one hidden layer of neurons with BOSS activations functions and a linear output neuron are Turing equivalent. •Corollary: Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-9 •The training of the recurrent networks can be done with two methods: Contents Sequences •BackPropagation Time Delayed I •Real-Time Time Delayed II Recurrent I Recurrent II Conclusions Through Time Recurrent Learning •We can train a recurrent network with either epochbased or continuous training operation. However an epoch in recurrent networks does not mean the presentation of all learning patterns but rather denotes the length of a single sequence that we use for training. So an epoch in recurrent network corresponds in presenting only one pattern to the network. •At the end of an epoch the network stabilises. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-10 •Some useful heuristics for the training is given below: Contents •Lexigraphic Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions order of training samples should be followed, with the shortest strings of symbols being presented in the network first; •The training should begin with a small training sample and then its size should be incrementally increased as the training proceeds; •The synaptic weights of the network should be updated only if the absolute error on the training sample currently being processed by the network is greater than some prescribed criterion; CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-11 •The use of weight decay during training is recommended; weight decay was discussed in WK3 Contents Sequences •The BackPropagation Through Time algorithm Time Delayed II proceeds by unfolding a network in time. To be more specific: Time Delayed I Recurrent I Recurrent II Conclusions •Assume that we have a recurrent network N which is required to learn a temporal task starting from time n0 and going all the way to time n. •Let N* denote the feedforward network that results from unfolding the temporal operation of the recurrent network N. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-12 •The network N* is related to the original network N as follows: Contents •For Sequences Time Delayed I Time Delayed II each time step in the interval (n0,n], the network N* has a layer containing K neurons, where K is the number of neurons contained in network N; Recurrent I every layer of network N* there is a copy of each neuron in network N; Recurrent II •For Conclusions •In every time step l [n0,n], the synaptic connection from neuron i in layer l to neuron j in layer l+1 of the network N* is a copy of the synaptic connection from neuron i to neuron j in the network N. •The following example explains the idea of unfolding: CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-13 •We assume that we have a network with two neurons which is unfolded for a number of steps, n: Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-14 • We present now the method of Epochwise BackPropagation Through Time. • Let the dataset used for training the network be partitioned into independent epochs, with each epoch representing a temporal pattern of interest. Let n0 denote the start time of an epoch and n1 denotes its end time. • We can define the following cost function: Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions 1 n1 2 Etotal (n0 , n1 ) e j (n) 2 n n0 jA CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-15 Contents Sequences Time Delayed I Where A is the set of indices j pertaining to those neurons in the network for which desired responses are specified, and ej(n) is the error signal at the output of such a neuron measured with respect to some desired response. Time Delayed II Recurrent I Recurrent II Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-16 • Contents 1. Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions The algorithm proceeds as follows: For a given epoch, the recurrent network starts running from some initial state until it reaches a new state, at which point the training is stopped and the network is reset to an initial state for the next epoch. The initial state doesn’t have to be the same for each epoch of training. Rather, what is important is for the initial state for the new epoch to be different from the state reached by the network at the end of the previous epoch; CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-17 2. First a single forward pass of the data through the network for the interval (n0, n1) is performed. The complete record of input data, network state (i.e. synaptic weights), and desired responses over this interval is saved; 3. A single backward pass over this past record is performed to compute the values of the local gradients: Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions j ( n) Etotal (n0 , n1 ) v j (n) For all j A and n0 < n n1 . This computation is performed by the formula: CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-18 ' (v j (n))e j (n) for n n1 j (n) ' ( v ( n )) e ( n ) w ( n 1 ) j jk k j for n0 n n1 k A Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions Where ’(•) is the derivative of an activation function with respect to its argument, and vj(n) is the induced local field of neuron j. The use of above formula is repeated, starting from time n1 and working back, step by step, to time n0 ; the number of steps involved here is equal to the number of time steps contained in the epoch. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-19 4. Contents Sequences Once the computation of back-propagation has been performed back to time n0+1, the following adjustment is applied to the synaptic weight wji of neuron j: Time Delayed I Time Delayed II w ji Recurrent I Recurrent II Conclusions Etotal w ji n1 n n0 1 j ( n) xi ( n 1) Where is the learning rate parameter and xi(n-1) is the input applied to the ith synapse of neuron j at time n-1. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Recurrent II-20 • There is a potential problem with the method, which is called the Vanishing Gradients Problem, i.e. the corrections calculated for the weights are not large enough when using methods based on steepest descent. • However this is a research problem currently and ones has to see the literature for details. Contents Sequences Time Delayed I Time Delayed II Recurrent I Recurrent II Conclusions CS 476: Networks of Neural Computation, CSD, UOC, 2009 Conclusions Contents •Dynamic networks learn sequences in contrast to the static mappings of MLP and RBF networks. Sequences •Time representation takes place explicitly or implicitly. •The implicit form includes time-delayed versions of Time Delayed II the input vector and use of a static network model afterwards or the use of recurrent networks. Time Delayed I Recurrent I Recurrent II Conclusions •The explicit form uses a generalisation of the MLP model where a synapse is modelled now as a weight vector and not as a single number. The synapse activation is not any more the product of the synapse’s weight with the output of a previous neuron but rather the inner product of the synapse’s weight vector with the time-delayed state vector of the previous neuron. CS 476: Networks of Neural Computation, CSD, UOC, 2009 Conclusions I •The extended MLP networks with explicit temporal structure are trained with the Temporal BackPropagation algorithm. Contents Sequences •The recurrent networks include a number of simple and complex architectures. In the simpler case we Time Delayed II train the networks using the standard Recurrent I BackProgation algorithm. Time Delayed I Recurrent II Conclusions •In the more complex cases we first unfold the network in time and then train it using the BackProgation Through Time algorithm. CS 476: Networks of Neural Computation, CSD, UOC, 2009