Bayesian Neural Petteri 1.Introduction The structure of the brain differs much from the way computers are built, but at the same time brains can perform certain tasks such as pattern recognition and classification more effectively than any computer. As neurological (study of the nervous system) research advanced, researchers were able to build a model of the structure of the brain. In computer science the AI community welcomed this research and designed their own artificial brain model, which is a simplification of the actual structure of the brain. The key components are neurons and synapses. Neurons are the basic computational unit of the brain, which receive some input and produce an output based on the input. The synapses form the connections between the neurons and together these building blocks can be used to build large artificial (neural) networks. A neuron can either be activated or deactivated or it can have a continuous activation level. This value is based on the input (+ possible other parameters such as bias) and the function that calculates the activation value from the input is called an activation function. The most widely used function is the sigmoid, which is shown in Figure 1. Figure 1. The Sigmoid function α is the slope parameter w is the vector of synaptic weights 2. Multilayer Perceptron Networks (MLP) [Hay99] The most widely used neural network architecture is the so-called multilayer feedforward network, in which the neurons are organized into layers in an hierarchical manner. The first level of neurons forms the input layer of the network and the last layer forms the output layer of the network. Between the input and output layer we may have an arbitrary number of hidden layers, through which the signals are propagated. Figure 2 gives an architectural overview of a multilayer feedforward (perceptron) network, with two hidden layers. … … … Figure 2. A multilayer perceptron network with two hidden layers Learning in MLP networks is typically done by the backpropagation algorithm (backprop). The network is given a set of training samples, which are of form {x,t}, where x is an input vector (values for each neuron in the input layer) and t is the corresponding vector of target values (what the activation levels of the neurons in the output layer should be). First the inputs are propagated through the network, which gives us some output values y. These are compared to the target values and the errors of the output nodes are stored. Together the error values define the error energy of the network. The goal is to minimize this error energy (distribution). The standard optimization technique for error minimization is to use gradient optimization techniques. The theoretical background stems from multivariable calculus as a necessary condition for a minimum is that the gradient around that point is zero. Thus by going into the direction that minimizes the gradient of the error, we can eventually converge to a minimizing point. Whether this is the best estimate or not is impossible to say, but often even this simple framework produces good results. The name backprop comes from the fact that we must pass the error terms backwards from the output nodes to the nodes in the last hidden layer etc. Otherwise it would be impossible to estimate locally the gradient of the error distribution. A standard approach in gradient optimization methods is to use small step sizes. This is achieved by using a learning rate parameter η to weight the gradient values. To try to avoid local minimum points, the learning rate parameter can be described as a function of time so that as we calculate more and more learning iterations, the value of the learning rate parameter decreases. The backprop framework suffers from a major drawback. Namely if the data samples are noisy, then the network will be overfitted for the noise and it will not generalize well to new cases. The simplest attempt to avoid this is to introduce a penalizing term on the weight terms and try to keep the weight values as small as possible. Now the error energy is defined in terms of a) the errors between the activation levels of the nodes in the output layer and the target values and b) the penalizing weight term. This complete framework, which also forms the basis for Bayesian neural networks, is illustrated in Figure 3. The equation shown in Figure 3 uses mean square error as the error “score” of the targets values and output values. For the weights a squared penalizing term is used. The goal of learning is now to find the weights, which minimize this error distribution. Figure 3. The penalized error term. The first sum goes over the samples and the second goes over the synapses in the network. The terms α and β are weighting terms. Networks Nurmi 3. Bayesian Neural Networks The Bayesian approach was designed to “fill in gaps in the backpropagation framework” [Mac91]. There are still some problematic issues even in the (weighted) penalized backprop learning scheme. The first problem relates to the weighting parameters α and β. Namely, how should one select values for these parameters? This issue was discussed by Mackay [Mac91] and he suggested to use hierarchical model, in which the weighting parameters are assigned Gaussian hyperparameter priors from which the values for the parameters can be sampled. Neal [Nea96] ignores the use of weighting parameters and we will focus more on Neal’s approach. The MLP network architectures are typically used for a predefined network architecture. As Mackay formulated the Bayesian view for the network, he also made it possible to evaluate different network architectures. Similarly it is possible to define probability distributions over the possible choices of penalizing functions etc. These problems can be solved with similar kind of methods as are used in general cases. The commonly used methods (nowadays) include such as MDL, variational methods and information theoretic scoring criterion such as the BIC (Bayesian information criterion). For information about model learning methods, the interested reader is referred to [MP01] or [Hec99]. To be able to use Bayesian methods for learning related tasks, one must formulate the problem in a probabilistic manner. Thus we have to derive a correspondence between the input values and the target values. In data-analysis this is typically done by using regression (see e.g. [GCS+04]) in which the values of some variable are modelled as a function of some other parameters. Moreover, to be precise, the actual regression model is a Gaussian regression model. Figure 4 gives the exact formula for calculating the probability of seeing a particular target value given an input, the black-box parameters (=weighting terms), a network architecture and a noise model for the data. Figure 4: Gaussian regression for neural networks The regression term defines the likelihood of the data in terms of the target values. What is still needed, is a definition of a posterior distribution over the possible weight values. But before we can formulate a posterior distribution, we need to define a suitable prior distribution. The first approach, due to Mackay, was to use Gaussian priors, which allows us to represent the resulting posterior distribution as a Gaussian distribution. As the resulting posterior is Gaussian, it is possible to use Gaussian approximation for the posterior, which makes the process computationally somewhat feasible. However, Neal [Nea96] argues that this kind of approach breaks up as the number of hidden units in the network grows. Neal [Nea94] discusses using priors for infinite networks, in which the number of hidden units approaches infinity. To avoid overfitting, the priors are scaled according to the number of hidden units in the network. However, for our purposes it satisfies to look at the Gaussian priors introduced by Mackay. Figure 5 first gives the prior distribution of the weights and then in the lower equation is given the resulting posterior distribution. Figure 5: Gaussian priors along with the resulting posterior distribution Note that currently we have only defined the posterior distribution in terms of an individual sample and thus we still need to have a way to evaluate the learning somehow. The solution is to use as a evaluation metric the mean of the posterior predictive distribution. Thus we calculate the average probability that is attached to an unseen (and unclassified) sample. This approach is illustrated in Figure 6. Another possibility is to divide the training data into subsets and use some cross-validation technique (leave-one-out, k-fold cross validation, Monte Carlo cross-validation etc.). Figure 6: Mean of the posterior predictive distribution under the regression model 4. Hybrid MCMC-Learning Although many Bayesian learning methods have been proposed for neural networks (Gaussian approximation, Metropolis-Hastings, Stochastic Dynamics, Langevin Monte-Carlo, Generalized Hybrid Monte Carlo and Mean field estimation), we will concentrate on the (generic) Hybrid Markov-Chain MonteCarlo method, which is used by Neal [Nea96]. The first step in describing the algorithm, is to describe the learning task as a (closed) physical system. Assume that our dynamical system has n particles within the system. We will use a vector q to describe the location of each of the particles at a given instant of time. Thus q is a vector of n-components. Assume first that each particle is not moving. Then the energy of the system is defined in terms of the potential energy of the particles. We define the probability density for the variable q in the following way: P(q) ∝ exp(-E(q)) Here E(q) is the “potential energy” function. Typically also a pseudo-temperature variable T is used (for annealing purposes), but in this case it is unnecessary. We allow the particles to move around the system and define a momentum term p, which describes the movement of the particles. We define the kinetic energy of a system to be K(p) and we select K(p) = ∑i=1n p2i / 2mi . Here pi is the ith component of the momentum term and mi is the mass of the ith particle. Together the potential and kinetic energy distributions allow us to define a joint distribution of the variables p and q, which defines the phase-space distribution of the system. This probability is expressed in terms of a Hamiltonian term H(q,p) = E(q) + K(p), which gives the total energy of the system. The probability distribution is again defined as the exponential of the negative energy. Thus we have the following: P(q,p) ∝ exp(-H(q,p)) Next we describe the evolution of (pseudo-)time in terms of Hamiltonian dynamics. The evolution of the parameters is illustrated in Figure 7. Figure 7: The Hamiltonian evolution of the phasespace dynamics The Hamiltonian dynamics are especially useful, because a) as the parameters q and p vary (according to the dynamic equations ), the total energy H stays the same b) the distribution P(q,p) is invariant with respect to transitions that consists of following a trajectory in the phase-space, which evolves according to Hamiltonian dynamics. If we could evaluate the Hamiltonian dynamics exactly, then the following sampling scheme could be used: 1) Sample new values for p and q with a fixed H 2) Sample new states H 3) Repeat The above framework is called the stochastic dynamics method and if the simulations can be done exactly, then the method will (eventually) estimate every trajectory. However, because the dynamics can’t be estimated exactly, one must resort to approximation methods. One such is the Leapfrog approximation, in which discrete steps are used to estimate the evolution of the state. The equations for this method are given in Figure 8. Figure 8: Leapfrog state estimation The Leapfrog estimation will actually change the value of H slightly and thus the method suffers from a systematic error. If the step-size is reduced to zero (in the limit), then no error will be made, but the evaluation of a single trajectory in the phase-space will require an infinite number of steps. To overcome this problem, it is possible to combine the stochastic dynamics method with the Metropolis-algorithm and thus avoid errors. The idea is to perform the Leapfrog simulation for the parameters p and q and then observe the change in the energy H. The new state is considered as a candidate for the Metropolis-algorithm and it will be accepted with a probability proportional to the change in energy. The exact algorithm is given below: Algorithm I: Hybrid MCMC 1. Let state be (q,p). Perform L leapfrog steps arrive at state (q’,p’). 2. Perform Gibbs sampling on the momentum variable p’ 3. Accept new state with probability min(1, exp(-(H(q’,p’) – H(q,p))) What is still missing from this framework, is the connection to learning in neural networks. The idea is to consider the potential term q to describe the parameters of the neural network. Thus q consists of the biases and weights of the network. The momentum variables p are the hyperparameters, which describe the (hierarchical) distribution of the network parameters. Now the algorithm first “anneals” the state of the weights and then performs Gibbs sampling on the hyperparameters of the network. The resulting state is considered as a candidate state for the Metropolis and either rejected or accepted. Eventually the algorithm has visited all regions in the phase-space and the weight distribution has converged to the true distribution. The computations performed in this framework are computationally very intensive, but on the other hand the goal is to devise optimal learning methods for very large neural networks. 5. Results Neal tests the Hybrid Markov Chain Monte-Carlo method against the Metropolis-algorithm in the robot-arm problem [Nea92]. In addition he tests, how simulated annealing can affect the convergence performance of the Hybrid MCMC method. The extension to simulated annealing is pretty straightforward as now the pseudo-temperature variable is introduced in the canonical form of the joint distribution and this temperature is gradually cooled. When the temperature is one, the method is equivalent to the Hybrid MCMC method. In the robot-arm problem the goal is to predict the robot arms position at the next time step given the current position. The sensor data is modelled as noisy and thus a Gaussian noise model is introduced. The prediction task can be described as a pair (two-dimensional data) of linear equations, which describe the network function. The step-size of the Hybrid MCMC method was selected to be 1000 and the iterations were run 200 times, which results in a total of 200 000 iterations. Similarly, the standard Metropolisalgorithm was evaluated using 200,000 iterations. The computations took around 20 minutes (with a machine with approx. 25 MIPS). Figure 9 illustrates, how the Hybrid MCMC and the simulated annealing version of Hybrid MCMC perform. Both methods were run 9 times and if the resulting prediction error is larger than 0.0075, then the run was rated a failure. The size of the square represents the amount of failures. Figure 9: Hybrid MCMC with and without annealing [Nea94]. From Figure 9 we can clearly see that the simulated annealing improves performance. The constant parameters ε0 and ν are used to set the step-size in the Leapfrog estimation according to the formula ε = ε0 exp(νC). Here C is sampled from a Cauchy-distribution. To further illustrate the advantage of using simulated annealing, Figure 10 gives the prediction errors as a function of the iterations. Figure 10: Convergence curves with and without annealing [Nea94]. Unfortunately no study was available, where the different methods would be extensively tested against each other. In this dummy example the Hybrid MCMC method works well, but so did also the Gaussian approximation method used by Mackay. On the other hand, Neal was able to motivate the use of Hybrid MCMC against the usage of Metropolis algorithm as the performance of the Hybrid MCMC was superior. The best result obtained with the Metropolis algorithm exceeded the threshold value by a factor of approximately two. This clearly illustrates that 200,000 iterations were not enough for the Metropolis-algorithm to reach convergence. 6. Future Directions After the few initial publications, the research in this field has notably decreased. The main contributions to this area have been in the form of new algorithmic solutions, of which especially the mean field method is a very promising technique. Also the research with optimizations for MCMC methods has continued, but both of these research directions are more generically related to probabilistic modelling and not purely on Bayesian neural networks. The applications of Bayesian neural networks are not very numerous and most of the applications that have been done are to very specific areas. For example Vehtari, Lampinen et al. [VHL+98] use Bayesian neural networks to classify forest scenery data. 7. References [GCS+04] A. Gelman, J. Carlin, H. Stern, D. Rubin, Bayesian Data Analysis, 2nd ed., CRC Press, 2004. [Hay99] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, 1999. [Hec99] D. Heckerman, A Tutorial on Learning with Bayesian Networks, In Learning in Graphical Models, ed. M.I. Jordan, MIT Press, 1999. [Mac91] D. Mackay, Bayesian Methods for Adaptive Models, Ph.D thesis, California Institute of Technology, 1991. [MP01] G. McLachlan, D. Peel, Finite Mixture Models, John Wiley & Sons, 2001. [Nea92] R. Neal, Bayesian training of backpropagation networks by the Hybrid Monte Carlo Method, Technical Report CRG-TR-92-1, University of Toronto, Dept. of CS, 1992. [Nea94] R. Neal, Priors for Infinite Networks, Technical Report CRG-TR-94-1, University of Toronto, Dept. of CS, 1994. [Nea96] R. Neal, Bayesian Learning for Neural Networks, Lecture Notes in Statistics No. 118, Springer-Verlag, 1996. [VHL+98] A. Vehtari, J. Heikkonen, J. Lampinen, J. Juujärvi, Using Bayesian Neural Networks to Classify Forest Scenes, In Intelligent Robots and Computer Vision XVII: Algorithms, Techniques and Active Vision, pp. 66-73, 1998.