advertisement

CSM10: Bayesian neural networks Summary, we: Looked at radial basis function networks. Compared them with multi layer perceptrons and statistical techniques Aims: To examine the development of neural networks. To critically investigate different types of neural network and to identify the key parameters which determine network architectures suitable for specific applications Learning outcomes: Demonstrate an understanding of the key principles involved in neural network application development. Analyse practical problems from a neural computing perspective and be able to select a suitable neural network architecture for a given problem. Differentiate between supervised and unsupervised methods of training neural networks, and understand which of these methods should be used for different classes of problems. Demonstrate an understanding of the data selection and pre-processing techniques used in constructing neural network applications. An example of Bayesian statistics: “The probability of it raining tomorrow is 0.3” Suppose we want to reason with information that contains probabilities such as: ''There is a 70\% chance that the patient has a bacterial infection''. Bayes theories rest on the belief that for everything there is a prior probability that it could be true. Given a prior probability about some hypothesis (e.g. does the patient have influenza?) there must be some evidence we can call on to adjust our views (beliefs) on the matter. Given relevant evidence we can modify this prior probability to produce a posterior probability of the same hypothesis given new evidence. If the following terms are used: p(X) means prior probability of X p(X|Y) means probability of X given that we have observed evidence Y p(Y) is the probability of the evidence Y occurring on its own. p(Y|X) is the probability of the evidence Y occurring given the hypothesis X is true (the likelihood). We make of Bayes Theorem: p( X | Y ) p(Y | X ) p( X ) p(Y ) In other words: posterior likelihood prior evidence We know what p(X) is - the prior probability of patients in general having influenza. Assuming that we find that the patient has a fever, we would like to find P(X:Y) the probability of this particular patient having influenza given that we can see that they have a fever (Y). If we don't actually know this we can ask the opposite question, i.e. if a patient has influenza, what is the probability that they have a fever? Fever is probably certain in this case, we'll assume that it is 1. The term p(Y) is the probability of the evidence occurring on it's own, i.e. what is the probability of anyone having a fever (whether they have influenza or not? p(Y) can be calculated from: p(Y) p(X | Y)p(X) p(Y | notX)p(not Y) This states that the probability of a fever occurring in anyone is the probability of a fever occurring in an influenza patient times the probability of anyone having influenza plus the probability of fever occurring in a non-influenza patient times the probability of this person being a non-influenza case. From the original prior probability of p(X)held in our knowledge base we can calculate p(X|Y) after having asked about the patients fever, we can now forget about the original p(X) and instead use the new p(X|Y) as a new p(X). So the whole process can be repeated time and time again as new evidence comes in from the keyboard (i.e. the user enters answers). Each time an answer is given the probability of the illness being present is shifted up or down a bit using the Bayesian equation, each time a different prior probability being used which has been derived from the last posterior probability. An example using Bayes Theorem: Suppose that the hypothesis X is that ‘X is a man’ and notX is that ‘X is a woman’, and we want to calculate which is the most likely given the available evidence. Suppose that we have evidence that the prior probability of X, p(X) is 0.7, so that p(not X) = 0.3. Suppose we have evidence Y that X has long hair, and suppose that p(Y|X) is 0.1 {i.e. most men don’t have long hair} and p(Y) is 0.4 {i.e. quite a few people have long hair}. Our new estimate of P(X|Y) i.e. that X is a man given that we now know that X has long hair is: p(X|Y) = p(Y|X)P(X)/P(Y) = (0.1*(0.7))/0.4 = 0.175 So our probability of ‘X is a man’ has moved from 0.7 to 0.175, given the evidence of long hair. In this way new P(X|Y) are calculated from old probabilities given new evidence. Eventually, having gathered all the evidence concerning all of the hypotheses, we, or the system, can come to a final conclusion about the patient. What most systems using this form of inference do is set an upper and lower threshold. If the probability exceeds the upper threshold that hypothesis is accepted as a likely conclusion to make. If it falls below the lower threshold then it is rejected as unlikely. Problems with Bayesian inference Computationally expensive The Prior probabilities are not always available and are often subjective Often the Bayesian formulae don’t correspond with the expert’s degrees of belief. For Bayesian systems to work correctly, an expert should tell us that ‘The presence of evidence Y enhances the probability of the hypothesis X, and the absence of evidence Y decreases the probability of X’, but in fact many experts will say that ‘The presence of Y enhances the probability of X, but the absence of Y has no significance’, which is not true in a strict Bayesian framework. Assumes independent evidence Bayesian methods are often used in both statistics and Artificial Intelligence based around expert systems. However, they can also be used with neural networks. Conventional training methods for multilayer perceptrons (such as backpropagation) can be interpreted in statistical terms as variations on maximum likelihood estimation. The idea is to find a single set of weights for the network that maximize the fit to the training data, perhaps modified by some sort of weight penalty to prevent overfitting. The Bayesian school of statistics is based on a different view of what it means to learn from data, in which probability is used to represent uncertainty about the relationship being learned (a use that is shunned in Conventional frequentist statistics). Before we have seen any data, our prior opinions about what the true relationship might be can be expressed in a probability distribution over the network weights that define this relationship. After we look at the data (or after our program looks at the data), our revised opinions are captured by a posterior distribution over network weights. Network weights that seemed plausible before, but which don't match the data very well, will now be seen as being much less likely, while the probability for values of the weights that do fit the data well will have increased. Typically, the purpose of training is to make predictions for future cases where only the inputs to the network are known. The result of conventional network training is a single set of weights that can be used to make such predictions. In contrast, the result of Bayesian training is a posterior distribution over network weights. If the inputs of the network re set to the values for some new case, the posterior distribution over network weights will give rise to a distribution over the outputs of the network, which is known as the predictive distribution for this new case. If a single-valued prediction is needed, one might use the mean of the predictive distribution, but the full predictive distribution also tells you how uncertain this prediction is. Why bother with all this? The hope is that Bayesian methods will provide solutions to such fundamental problems as: How to judge the uncertainty of predictions. This can be solved by looking at the predictive distribution, as described above. How to choose an appropriate network architecture (e.g., the number hidden layers, the number of hidden units in each layer). How to adapt to the characteristics of the data (e.g., the smoothness of the function, the degree to which different inputs are relevant). Good solutions to these problems, especially the last two, depend on using the right prior distribution, one that properly represents the uncertainty that you probably have about which inputs are relevant, how smooth the function you are modelling is, how much noise there is in the observations, etc. Such carefully vague prior distributions are usually defined in a hierarchical fashion, using hyperparameters, some of which are analogous to the weight decay constants of more conventional training procedures. An "Automatic Relevance Determination" scheme can be used to allow many possibly-relevant inputs to be included without damaging effects. Selection of an appropriate network architecture is another place where prior knowledge plays a role. One approach is to use a very general architecture, with lots of hidden units, maybe in several layers or groups, controlled using hyperparameters. Implementing all this is one of the biggest problems with Bayesian methods. Dealing with a distribution over weights (and perhaps hyperparameters) is not as simple as finding a single "best" value for the weights. Exact analytical methods for models as complex as neural networks are out of the question. Two approaches have been tried: Find the weights/hyperparameters that are most probable, using methods similar to conventional training (with regularization), and then approximate the distribution over weights using information available at this maximum. Use a Monte Carlo method to sample from the distribution over weights. The most efficient implementations of this use dynamical Monte Carlo methods whose operation resembles that of backprop with momentum. Monte Carlo methods for Bayesian neural networks have been developed by Neal. In this approach, the posterior distribution is represented by a sample of perhaps a few dozen sets of network weights Pros of Bayesian neural networks Bayesian methods are a superset of conventional neural network methods. Network complexity (such as number of hidden units) can be chosen as part of the training process, without using cross-validation. Better when data is in short supply as you can (usually) use the validation data to train the network. For classification problems the tendency of conventional approached to make overconfident predictions in regions of sparse training data can be avoided. Regularisation is a way of controlling the complexity of a model by adding a penalty term to the error function (such as weight decay). Regularization is a natural consequence of using Bayesian methods, which allow us to set regularisation coefficients automatically (without cross-validation). Large numbers of regularisation coefficients can be used, which would be computationally prohibitive if their values had to be optimised using cross-validation. Confidence intervals and error bars can be obtained and assigned to the network outputs when the network is used for regression problems. Allows straightforward comparison of different neural network models (such as MLPs with different numbers of hidden units or MLPs and RBFs) using only the training data. Guidance is provided on where in the input space to seek new data (active learning allows us to determine where to sample the training data next). Relative importance of inputs can be investigated (Automatic Relevance Detection) Very successful in certain domains Theoretically the most powerful method Cons of Bayesian neural networks Requires to choose prior distributions, mostly based on analytical convenience rather than real knowledge about the problem Computationally intractable Model comparison Due to probability measure, models can be compared: Complex models have lower probability density over large range of data sets Simple models have high probability density over a small range of data sets Thus, there should be a compromise in terms of complexity and confidence in the model Ockham’s razor More complex networks (e.g. more hidden units) give better fit to the training data. However, the best generalisation requires that the model be neither too complex nor too simple. The conventional solution is to use cross-validation which is time consuming and wasteful of data. Bayesian methods have an automatic Ockham’s razor to select the appropriate level of model complexity. Consider three neural network models H1, H2, H3 of successively greater complexity (e.g. neural networks with increasing numbers of hidden units. Then evaluate the probabilities of the models given the data D, using Bayes theorem: p ( H i | D) p( D | H i ) p( H i ) p ( D) Assuming equal priors p(Hi) then models are ranked according to p(D| Hi) A particular data set D0 might favour the model H2 which has intermediate complexity: So to extend Bayesian learning to neural networks we: Assume probability distribution over network weights and some prior Need to interpret network outputs probabilistically (use appropriate output activation function) Then we apply Bayesian learning procedure: Start with prior distribution p(w) and choose appropriate parameters (guess, usually broad distribution to reflect uncertainty) Observe data, calculate posterior of parameters by Bayes rule. Continue updating if more data comes in, replacing the prior with the posterior In order to make prediction, the expectation given the posterior distribution has to be found (might be very complex) Summary In practice, Bayesian networks often outperform standard networks (such as MLPs trained with backpropagation). However there are several unresolved issues (such as how best to choose the priors) and more research is needed. Bayesian networks are computationally intensive and therefore take a long time to train.