About Assignment #3 Two approaches to backpropagation learning: 1. “Per-pattern” learning: Update weights after every exemplar presentation. 2. “Per-epoch” (batch-mode) learning: Update weights after every epoch. During epoch, compute the sum of required changes for each weight across all exemplars. After epoch, update each weight using the respective sum. November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 1 About Assignment #3 Per-pattern learning often approaches near-optimal network error quickly, but may then take longer to reach the error minimum. During per-pattern learning, it is important to present the exemplars in random order. Reducing the learning rate between epochs usually leads to better results. November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 2 About Assignment #3 Per-epoch learning involves less frequent weight updates, which makes the initial approach to the error minimum rather slow. However, per-epoch learning computes the actual network error and its gradient for each weight so that the network can make more informed decisions about weight updates. Two of the most effective algorithms that exploit this information are Quickprop and Rprop. November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 3 The Quickprop Learning Algorithm The assumption underlying Quickprop is that the network error as a function of each individual weight can be approximated by a paraboloid. Based on this assumption, whenever we find that the gradient for a given weight switched its sign between successive epochs, we should fit a paraboloid through these data points and use its minimum as the next weight value. November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 4 The Quickprop Learning Algorithm Illustration (sorry for the crummy paraboloid): assumed error function (paraboloid) E slope: E’(t-1) E(t-1) slope: E’(t) E(t) w(t-1) w(t) November 16, 2010 w(t+1) w(t-1) Neural Networks Lecture 17: Self-Organizing Maps w 5 The Quickprop Learning Algorithm Newton’s method: E aw bw c 2 E (t ) E ' (t ) 2aw(t ) b w E (t 1) E ' (t 1) 2aw(t 1) b w E ' (t ) E ' (t 1) E ' (t ) E ' (t 1) 2a w(t ) w(t 1) w(t 1) ( E ' (t ) E ' (t 1)) w(t ) b E ' (t ) w(t 1) November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 6 The Quickprop Learning Algorithm For the minimum of E we must have: E (t 1) 2aw(t 1) b 0 w b w(t 1) 2a w(t 1) [ E ' (t ) E ' (t 1)]w(t ) E ' (t )w(t 1) w(t 1) w(t 1) E ' (t ) E ' (t 1) E ' (t )w(t 1) w(t 1) w(t ) E ' (t 1) E ' (t ) November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 7 The Quickprop Learning Algorithm Notice that this method cannot be applied if the error gradient has not decreased in magnitude and has not changed its sign at the preceding time step. In that case, we would ascent in the error function or make an infinitely large weight modification. In most cases, Quickprop converges several times faster than standard backpropagation learning. November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 8 Resilient Backpropagation (Rprop) The Rprop algorithm takes a very different approach to improving backpropagation as compared to Quickprop. Instead of making more use of gradient information for better weight updates, Rprop only uses the sign of the gradient, because its size can be a poor and noisy estimator of required weight updates. Furthermore, Rprop assumes that different weights need different step sizes for updates, which vary throughout the learning process. November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 9 Resilient Backpropagation (Rprop) The basic idea is that if the error gradient for a given weight wij had the same sign in two consecutive epochs, we increase its step size ij, because the weight’s optimal value may be far away. If, on the other hand, the sign switched, we decrease the step size. Weights are always changed by adding or subtracting the current step size, regardless of the absolute value of the gradient. This way we do not “get stuck” with extreme weights that are hard to change because of the shallow slope in the sigmoid function. November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 10 Resilient Backpropagation (Rprop) Formally, the step size update rules are: ( t 1) ( t 1) E ij , if w ij ( t 1) E (t ) ( t 1) ij ij , if wij (ijt 1) , otherwise E 0 wij E (t ) 0 wij (t ) Empirically, best results were obtained with initial step sizes of 0.1, +=1.2, -=1.2, max=50, and min=10-6. November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 11 Resilient Backpropagation (Rprop) Weight updates are then performed as follows: (t ) ij w E , if 0 wij (t ) E (t ) ij , if 0 wij 0 , otherwise (t ) (t ) ij It is important to remember that, like in Quickprop, in Rprop the gradient needs to be computed across all samples (per-epoch learning). November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 12 Resilient Backpropagation (Rprop) The performance of Rprop is comparable to Quickprop; it also considerably accelerates backpropagation learning. Compared to both the standard backpropagation algorithm and Quickprop, Rprop has one advantage: Rprop does not require the user to estimate or empirically determine a step size parameter and its change over time. Rprop will determine appropriate step size values by itself and can thus be applied “as is” to a variety of problems without significant loss of efficiency. November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 13 The Counterpropagation Network Let us look at the CPN structure again. How can this network determine its hidden-layer winner unit? O 11 w Y1 w13O O w21 O w12 H 12 H1 w O w22 H w21 w11H Output layer O Y2 w23 H2 H 31 w H w22 H3 Hidden layer H w32 X1 November 16, 2010 X2 Additional connections! Neural Networks Lecture 17: Self-Organizing Maps Input layer 14 The Solution: Maxnet A maxnet is a recurrent, one-layer network that uses competition to determine which of its nodes has the greatest initial input value. All pairs of nodes have inhibitory connections with the same weight -, where typically 1/(# nodes). In addition, each node has a self-excitatory connection to itself, whose weight is typically 1. The nodes update their net input and their output by the following equations: net wi xi i f (net ) max( 0, net ) November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 15 Maxnet All nodes update their output simultaneously. With each iteration, the neurons’ activations will decrease until only one neuron remains active. This is the “winner” neuron that had the greatest initial input. Maxnet is a biologically plausible implementation of a maximum-finding function. In parallel hardware, it can be more efficient than a corresponding serial function. We can add maxnet connections to the hidden layer of a CPN to find the winner neuron. November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 16 Maxnet Example Example of a Maxnet with five neurons and = 1, = 0.2: 0.5 0 0.24 0.07 0.9 0 0.24 0.07 0.9 0 0.24 0.07 0.9 0 0.36 0.22 0.17 1 November 16, 2010 Winner! Neural Networks Lecture 17: Self-Organizing Maps 17 Self-Organizing Maps (Kohonen Maps) As you may remember, the counterpropagation network employs a combination of supervised and unsupervised learning. We will now study Self-Organizing Maps (SOMs) as examples for completely unsupervised learning (Kohonen, 1980). This type of artificial neural network is particularly similar to biological systems (as far as we understand them). November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 18 Self-Organizing Maps (Kohonen Maps) In the human cortex, multi-dimensional sensory input spaces (e.g., visual input, tactile input) are represented by two-dimensional maps. The projection from sensory inputs onto such maps is topology conserving. This means that neighboring areas in these maps represent neighboring areas in the sensory input space. For example, neighboring areas in the sensory cortex are responsible for the arm and hand regions. November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 19 Self-Organizing Maps (Kohonen Maps) Such topology-conserving mapping can be achieved by SOMs: • Two layers: input layer and output (map) layer • Input and output layers are completely connected. • Output neurons are interconnected within a defined neighborhood. • A topology (neighborhood relation) is defined on the output layer. November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 20 Self-Organizing Maps (Kohonen Maps) Network structure: output vector o O1 O2 x1 O3 x2 … Om … xn input vector x November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 21 Self-Organizing Maps (Kohonen Maps) Common output-layer structures: One-dimensional (completely interconnected) i i Two-dimensional (connections omitted, only neighborhood relations shown [green]) Neighborhood of neuron i November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 22 Self-Organizing Maps (Kohonen Maps) A neighborhood function (i, k) indicates how closely neurons i and k in the output layer are connected to each other. Usually, a Gaussian function on the distance between the two neurons in the layer is used: position of i November 16, 2010 position of k Neural Networks Lecture 17: Self-Organizing Maps 23 Unsupervised Learning in SOMs For n-dimensional input space and m output neurons: (1) Choose random weight vector wi for neuron i, i = 1, ..., m (2) Choose random input x (3) Determine winner neuron k: ||wk – x|| = mini ||wi – x|| (Euclidean distance) (4) Update all weight vectors of all neurons i in the neighborhood of neuron k: wi := wi + ·(i, k)·(x – wi) (wi is shifted towards x) (5) If convergence criterion met, STOP. Otherwise, narrow neighborhood function and learning parameter and go to (2). November 16, 2010 Neural Networks Lecture 17: Self-Organizing Maps 24