Bayesian Optimization (BO) Javad Azimi Fall 2010 http://web.engr.oregonstate.edu/~azimi/ Outline • Formal Definition • Application • Bayesian Optimization Steps – Surrogate Function(Gaussian Process) – Acquisition Function • • • • • • PMAX IEMAX MPI MEI UCB GP-Hedge Formal Definition • Input: • Goal: Fuel Cell Application This is how an MFC works Fuel (organic matter) Oxidation products (CO2) ee- bacteria Nano-structure of anode significantly impact the electricity production. SEM image of bacteria sp. on Ni nanoparticle enhanced carbon fibers. O2 H+ H2O Cathode Anode We should optimize anode nano-structure to maximize power by selecting a set of experiment. 4 Big Picture • Since Running experiment is very expensive we use BO. • Select one experiment to run at a time based on results of previous experiments. Current Experiments Our Current Model Select Single Experiment Run Experiment 5 BO Main Steps • Surrogate Function(Response Surface , Model) – Make a posterior over unobserved points based on the prior. – Its parameter might be based on the prior. Remember it is a BAYESIAN approach. • Acquisition Criteria(Function) – Which sample should be selected next. Surrogate Function • Simulates the unknown function distribution based on the prior. – Deterministic (Classical Linear Regression,…) • There is a deterministic prediction for each point x in the input space. – Stochastic (Bayesian regression, Gaussian Process,…) • There is a distribution over the prediction for each point x in the input space. (i.e Normal distribution) – Example • Deterministic: f(x1)=y1, f(x2)=y2 • Stochastic: f(x1)=N(y1,2) f(x2)=N(y2,5) Gaussian Process(GP) • A Gaussian process is a collection number of random variables, any finite number of which have a joint Gaussian distribution. – Consistency requirement or marginalization property. • Marginalization property: Gaussian Process(GP) • Formal prediction: • Interesting points: – Squared exponential function corresponds to Bayesian linear regression with an infinite number of basis function. – Variance is independent from observation – The mean is a linear combination of observation. – If the covariance function specifies the entries of covariance matrix, marginalization is satisfied! Gaussian Process(GP) • Gaussian Process is: – An exact interpolating regression method. • Predict the training data perfectly. (not true in classical regression) – A natural generalization of linear regression. • Nonlinear regression approach! – A simple example of GP can be obtained from Bayesian regression. • Identical results – Specifies a distribution over functions. Gaussian process(2): distribution over functions 95% confidence interval for each point x. Three sampled functions Gaussian process(2): GP vs Bayesian regression • Bayesian regression: – Distribution over weight – The prior is defined over the weights. • Gaussian Process – Distribution over function – The prior is defined over the function space. • These are the same but from different view. Short Summary • Given any unobserved point z, we can define a normal distribution of its prediction value such that: – Its means is the linear combination of the observed value. – Its variance is related to its distance from observed value. (closer to observed data, less variance) BO Main Steps • Surrogate Function(Response Surface , Model) – Make a posterior over unobserved points based on the prior. – Its parameter might be based on the prior. Remember it is a BAYESIAN approach. • Acquisition Criteria(Function) – Which sample should be selected next. Bayesian Optimization: (Acquisition criterion) • Remember: we are looking for: • Input: – Set of observed data. – A set of points with their corresponding mean and variance. • Goal: Which point should be selected next to get to the maximizer of the function faster. • Different Acquisition criterion(Acquisition functions or policies) Policies • • • • Maximum Mean (MM). Maximum Upper Interval (MUI). Maximum Probability of Improvement (MPI). Maximum Expected of Improvement (MEI). Policies: Maximum Mean (MM). • Returns the point with highest expected value. • Advantage: – If the model is stable and has been learnt very good, performs very good. • Disadvantage: – There is a high chance to fall in local minimum(just exploit). • Can converge to global optimum finally? – No Policies: Maximum Upper Interval (MUI). • Returns the point with highest 95% upper interval. • Advantage: – Combination of mean and variance(exploitation and exploration). • Disadvantage: – Dominated by variance and mainly explore the input space. • Can converge to global optimum finally? – Yes. – But needs almost infinite number of samples. Policies: Maximum Probability of Improvement (MPI) • Selects the sample with highest probability of improving the current best observation (ymax) by some margins m. Policies: Maximum Probability of Improvement (MPI) • Advantage: – Considers mean and variance and ymax in policy(smarter than MUI) • Disadvantage: – Ad-hoc parameter m – Large value of m? • Exploration – Small value of m? • Exploitation Policies: Maximum Expected of Improvement (MEI) • Maximum Expected of improvement. • Question: Expectation over which variable? –m Policies: Upper Confidence Bounds • Select based on the variance and mean of each point. – The selection of k left to the user. – Recently, a principle approach to select this parameter has been proposed. Summary • We introduced several approaches, each of which has advantage and disadvantage. – MM – MUI – MPI – MEI – GP-UCB • Which one should be selected for an unknown model? GP-Hedge • GP-Hedge(2010) • It select one of the baseline policy based on the theoretical results of multi-armed bandit problem, although the objective is a bit different! • They show that they can perform better than (or as well as) the best baseline policy in some framework. Future Works • Method selection smarter than GP-Hedge with theoretical analysis. • Batch Bayesian optimization. • Scheduling Bayesian optimization.