Myopic Policies for Budgeted Optimization with Constrained Experiments (Project Report) Wei Lin December 1, 2008 Abstract Motivated by a real-world problem, we study a novel setting for budgeted optimization where the goal is to optimize an unknown function f (x) given a budget. In our setting, it is not practical to request samples of f (x) at precise input values due to the formidable cost of experimental setup at precise values. Rather, we may request constrained experiments, which give the experimenter constraints on x for which they must return f (x). Importantly, as the constraints become looser, the experimental cost decreases, but the uncertainty about the location of the next observation increases. Our problem is to manage this trade-off by selecting a sequence of constrained experiments to best optimize f within the budget. We propose a number of myopic policies for selecting constrained experiments using both model-free and model-based approaches, inspired by policies for unconstrained settings. Experiments on synthetic and real-world functions indicate that our policies outperform random selection, that the model-based policies are superior to model-free ones, and give insights into which policies are preferable overall. Contents 1 Introduction 3 2 Problem Setup 5 3 Myopic Policies for Constrained Experiments 7 3.1 Heuristics for Constrained Experiments . . . . . . . . . . . . . . . . . . . 8 3.1.1 Model-Free Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.1.2 Model-based Heuristics . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Cost sensitive policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.3 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3.1 Conditional posterior distribution . . . . . . . . . . . . . . . . . . 12 3.3.2 Heuristics computation . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.3 Regret computation . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 Experimental Results 16 4.1 Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Modeling Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3.1 Traditional Optimization . . . . . . . . . . . . . . . . . . . . . . . 18 4.3.2 Constrained Experiments . . . . . . . . . . . . . . . . . . . . . . . 23 A Gradient 31 A.1 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 A.2 MUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 A.3 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 A.4 MEI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 B Mathematical Derivation 34 B.1 MEI Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 34 B.2 Weighted Combination of Distributions . . . . . . . . . . . . . . . . . . . 35 B.2.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 B.2.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2 Chapter 1 Introduction This work is motivated by the experimental design problem of optimizing the power output of nano-enhanced microbial fuel cells. Microbial fuel cells (MFCs) [2, 5] use microorganisms to break down organic materials and generate electricity (see Figure 1.1). For a particular MFC application domain it is critical to optimize the biological energetics and the microbial/electrode interface of the system, which research has shown to depend strongly on the surface properties of the anode [13, 15]. This motivates the design of nano-enhanced anodes, where various types of nano-structures, e.g. carbon nano-wire, are grown directly on the anode surface. Unfortunately, there is little understanding of the interaction between the variety of possible nano-enhancements and MFC capabilities for different micro-organisms. Thus, optimizing anode design for a particular application is largely guess work for the scientists. Our goal is to develop algorithms to aid them in this process. Traditional experimental design approaches such as response surface methods [12] commonly assume that the experimental inputs can be specified precisely and attempt to optimize a design by requesting specific experiments. For example, requesting that an anode be tested with nana-wire of specific length and density. However, these parameters are unlike regular experimental control parameters (such as temperature) that can be easily set. Rather, manufacturing nano-structures is an art and it is very difficult to achieve a precise parameter setting. Instead, it is more practical to request constrained experiments, which place constraints on these parameters. For example, specifying intervals on length and density. Given such a request, nano-materials that satisfy the given set of constraints can be produced at some time and resource cost, where the cost will typically increase as the constraints become tighter. Prior work on experimental design, stochastic optimization, and active learning as- 3 R e- eH+ C O2 O2 Cathode Anode eAcetate Air H2O Figure 1.1: Schematic of single chamber MFC, showing the anode, where bacteria form a biofilm on the surface, and an air cathode. sume the ability to request precise experiments and hence do not directly apply to our setting of constrained experiments. In this project, we study the associated budgeted optimization problem where we are given an experimental budget and must request a sequence of constrained experiments to optimize the unknown function. Solving this problem involves reasoning about a variety of trade-offs, for example, the trade-off between using weaker constraints to lower the cost of an experiment, while increasing the uncertainty about the location of the next observation. For this purpose we extend existing model-free and model-based, myopic policies for traditional optimization problems [10, 11, 6, 9] to our setting of constrained experiments. Our experimental results on both synthetic functions and a function generated from real-world data indicate that our policies have considerable advantages compared to random policies and that the model-based policies can significantly outperform the model-free policies. We make the following contributions. First, motivated by a real-world application, we identify and formulate the problem of budgeted optimization with constrained experiments (Chapter 2). We believe that this problem class is not rare and for example is the common case in nano-structure optimization and any other design domain where it is difficult to construct precise experimental setups. Second, we propose a number of myopic policies for selecting constrained experiments (Chapter 3). Finally, we present empirically evaluate the policies and make recommendations (Chapter 4). 4 Chapter 2 Problem Setup Let X ∈ Rn be an n-dimensional input space, where we will often refer to elements of X as experiments. We assume an unknown real-valued function f : X → R, which will be thought of as the expected value of the dependent variable after running an experiment. In our motivating domain, f (x) is the expected power output produced by a particular nano-structure x. Conducting an experiment x produces a noisy outcome y = f (x) + ǫ, where ǫ is a noise term. In traditional optimization settings, the goal is to find an x that approximately optimizes f by requesting a sequence of experiments and observing their outcomes. As describe in the previous section, we are interested in applications where it is not practical to request specific experiments. Rather, we will request constrained experiments, which each specify constraints that define a subset of experiments in X . In this paper, each constrained experiment will be a hyper-rectangles r = hr1 , . . . , rn i, where the ri are non- empty intervals along each input dimension. Given a constrained experiment request r, the experimenter will return a tuple (x, y, c), where x is an experiment in r, y is a noisy outcome of x, and c is the cost of creating and running the experiment x. This tuple indicates that the experimenter accumulated a cost of c in order to construct and run an experiment x that satisfied r, resulting in an observed value of y. The units of cost will vary across applications, in our motivating application, costs are dominated by the time required to construct an experiment satisfying r. The input to our problem is an initial budget B, a conditional density pc (c|r) over the cost c of fulfilling a constrained experiment r, and a conditional density px (x|r) over the specific experiment x generated for a constrained experiment r. Given this input, we then must request a sequence of constrained experiments until the total cost of those experiments goes over the budget. This results in a sequence of m experiment tuples 5 (x1 , y1 , c1 ), . . . , (xm , ym , cm ) corresponding to those requests that were filled before going over the budget on request m + 1. After the sequence of requests we must then output an x̂ ∈ {x1 , . . . , xm }, which is our prediction of which observed experiment has the best value of f . This formulation matches the objective of our motivating application of using the given budget to produce a good nano-structure x̂, rather than just make a prediction of what nano-structure might be good. We have assumed that pc and px are part of the problem inputs and use simple forms for these in this paper. In general, these functions could be more complicated and learned from previous experience and/or elicited from scientists. 6 Chapter 3 Myopic Policies for Constrained Experiments In this chapter, we introduce a number of myopic policies for selecting constrained experiments. These policies are are based on extending existing policies for more traditional optimization problems ([10, 11, 6, 9]) to our setting of constrained experiments. Here we focus on the general definition of these policies and will describe our particular instantiations and implementations in Section 3.3. Many existing policies for traditional optimization can be viewed as defining a greedy heuristic function which assigns a score to each experiment x given the current experimental state (D, B), where D is the data set of previously observed experiments and B is the remaining budget. These heuristics provide different mechanisms for managing the exploration-exploitation trade-off, which involves balancing exploratory experiments which probe relatively unexplored portions of the experimental space, and exploitive experiments which probe areas that are predicted to be promising based on the current information in D. There are two primary reasons that these existing heuristics cannot be directly used for our problem. First, they assign values to individual experiments rather than constrained experiments, as we require. Second, most such heuristic do not account for experimental cost, but rather assume that each experiment has uniform cost. Indeed all of the experimental evaluations in ([10, 11, 6, 9]), to name just a few, involve uniform cost experiments. Our proposed policies, described below, for selecting constrained experiments can be viewed as extending existing heuristic approaches to address each of these issues. First, in the next sub-section we extend existing heuristic to constrained experiments, without taking cost into account. Second, in the following sub-section, we 7 describe how to create policies based on these heuristics that take costs into account. 3.1 Heuristics for Constrained Experiments Each of our heuristics is simply a function H(r|(D, B)) that assigns a value to a constrained experiment r given the current experimental state (D, B). It is important to note that these values will not take into account the cost of r, but rather evaluate constrained experiments as though they all had uniform cost. In this sense, the heuristics are aimed at just evaluating some combination of the exploratory and exploitation value of each r. 3.1.1 Model-Free Heuristics Our first two heuristics, Round Robin and Biased Round Robin, are intended as simple baselines and do not utilize the predictive posterior distribution pθ (y|x, D) and are hence called model-free approaches. They are based on heuristics used on previous work on budgeted multi-armed bandit problems ([9]) and budgeted learning of naive Bayes classifiers ([8]) and are intended as simple baselines. Round Robin (RR). In the context of individual experiments, the RR heuristic simply looped through the set of possible experiments until the budget was expended. In the multi-armed bandit case, where experiments correspond to arm pulls, the RR heuristic attempts to keep the number of pull of each arm as uniformly as possible. In the context of constrained experiments we define the RR heuristic RR(r|(D, B)) to be the number of experiments in D that satisfy the constraints of the constrained experiment r. Thus, RR will prefer constrained experiments that overlap as few previous experiments as possible. Note that typically there will be a large, even infinite, number of constrained experiments that do not overlap any experiments (e.g. those with very tight constraints) and thus achieve the same heuristic value. Our approach for taking costs into account described later will add an additional cost-based preference. Biased Round Robin (BRR). The BRR heuristic behaves identically to RR, except that it repeats the previously selected experiment as long as it was successful. In the context of binary outcome multi-arm bandit problems ([9]) this corresponded to pulling an arm as long as it resulted in a positive outcome and otherwise moving on to the next arm. This heuristic, thus provides additional exploitive behavior over the RR heuristic. For our problem, we will define the execution of a constrained experiment r to be successful if it resulted in an outcome that improved on the outcome of the previous best observed 8 experiment in D. Thus, BRR will repeatedly try a constrained experiment as long as it improves the experimental objective and otherwise selects according to the heuristic BRR(r|(D, B)). 3.1.2 Model-based Heuristics Our final four heuristics are model-based approaches that make use of the conditional posterior distribution pθ (y|x, D). The intention is to utilize this posterior to more intelligently select constrained experiments, with the hope of outperforming model-free approaches, such as RR and BRR, when the budget constraints are significant. Existing model-based heuristics for individual experiments are typically based on computing certain statistics of the posterior distribution for the possible experiments (e.g. see [7])1 . These can be extended to our setting by computing those same statistics for constrainted experiments. To do this, note that we can translate the above posterior pθ , which is conditioned on individual experiments, to a posterior p̄θ that is conditioned on constrained experiments via the relation p̄θ (y|r, D) = E[pθ (y|X, D)]2 (3.1) where X is a random variable denoting an experiment distributed according to px (·|r). Intuitively, this definition corresponds to the process of first drawing an experiment according to the distribuiton specified by r and then drawing an outcome for that experiment from the current posterior pθ . We now describe our four model based heuristics, which are well established in traditional optimization, as described, for example, in [7]. Maximum Mean (MM). In the context of individual experiments, the MM heuristic simply selects the experiment that has the largest expected outcome according to the current posterior. This heuristic has often been called PMAX in the literature ([10, 11, 6]), which we have renamed to avoid confusion with our later heuristics. For constrained experiments, this translates to the heuristic MM(r|(D, B)) = E[y|r, D] (3.2) where y is a random variable distribued according to p̄θ (·|r, D). MM is purely an exploitive heuristic and has the weakness that it can often be too greedy and get stuck in a poor local maximum point before exploring the rest of the experimental space. 1 The heuristics in [7] are intended to find the minimum of the surface, while our extended constrained heuristics are for maximizing. 2 Note that the outcome y in the left hand side of Equation (3.1) is over constrained experiment r, while y in the right hand side is of individual experiment x ∈ X. To avoid confusion, yr is used in some place of this report. 9 Maximum Upper Interval (MUI). MUI, which has gone by the name IEMAX in previous work ([10, 11, 6]), is an attempt to overcome the greedy nature of MM by taking into account the uncertainty in our estimated model. Intuitively, MUI will explore areas where there is a non-trivial probability of achieving a good result. In particular, the MUI heuristic MUI(r|(D, B)) is equal to the upper bound of the 95% confidence interval according to the current posterior p̄θ (y|r, D) (see Equation 3.3, where y ∼ p̄θ (·|r, D)). p MUI(r|(D, B)) = E[y|r, D] + 1.96 V ar[y|r, D] (3.3) Intuively, MUI will agressively explore untouched regions of the experimental space since the outcomes in such regions will have high posterior variance. However, as experimentation continues and uncertainty decreases MUI will focus on the most promising areas and behave similarly to MM. Maximum Probability of Improvement (MPI). In the context of individual experiments, the MPI heuristic corresponds to selecting the experiment that has the highest probability of generating an outcome that improves on the current best outcome currently in D. One problem with the pure MPI strategy is that it has a tendency to behave similarly to PMAX and focus on areas that currently look promising, rather than explore unknown areas. The reason for this behavior is that pure MPI does not take into consideration the amount of improvement over the current best outcome. In particular, it is typical for the posterior to assign small variances to the outcomes in well explored regions, meaning that while there might be a good probability of observing a small improvement the probability of a substantial improvement is small. Thus, it is common to consider the use of a margin m when using MPI, which we will refer to as MPI(m). If we let y ∗ represent the current best outcome that was observed in D, then MPIm (r|(D, B)) is equal to the probability that the outcome of constrained experiment r will be greater than y ∗ + m|y ∗ |. Assuming p̄θ (y|r, D) is normally distributed3 , MPI heuristic is given by MPIm (r|(D, B)) = Φ E[y|r, D] − (y ∗ + m|y ∗ |) p V ar[y|r, D] ! (3.4) where y ∼ p̄θ (·|r, D) and Φ(·) is the normal cumulative distribution function. The larger the value of m the more exploratory the policy will be since it will focus on relatively unexplored regions where there are large variances and hence more probability mass associated outcomes greater than y ∗ + m|y ∗ |. For intermediate values of m the MPI heuristic can be expected to provide some balance between exploration and exploitation focusing 3 p̄θ (y|r, D) might not be normally distributed even pθ (y|X, D) is (see Appendix B.2). 10 on explored regions when there is a reasonable change of significantly improvement and otherwise focusing on unexplored regions. Maximum Expected Improvement (MEI). The MEI heuristic is an attempt to improve on the pure MPI heuristic without requiring the introduction of a parameter. Rather than focus on the probability of improvement it considers the expected value of improvement according to the current posterior. In particularly let I(y, y ∗ ) be a variable equal to zero if y ≤ y ∗ and equal to y − y ∗ otherwise. Assuming p̄θ (y|r, D) is normally distributed, we define the MEI heuristic to be MEI(r|(D, B)) = E[I(y, y ∗ )|r] p = V ar[y|r, D](−uΦ(−u) + φ(u)) where (3.5) y ∗ − E[y|r, D] u= p V ar[y|r, D] and φ(·) is the normal density function. 3.2 Cost sensitive policies Below we describe our policies for selecting a constraint experiment, given a current experimental state (D, B), based on the above extended heuristics, which take the cost into consideration. Constrained Minimum Cost Policies (CMC). Given a specific heuristic, the highest scoring constrained experiment will typically be as highly constrained as possible centered around individual experiments of maximum heuristic value. However, these constrained experiments will also typically be the most costly, leading to a trade-off between cost and heuristic value. The class of CMC policies attempts to manage the trade-off by choosing the least-cost constrained experiment that achieves a heuristic value within a specified fraction α of the value of the highest scoring constrained experiment. In particular, for a given heuristic H, the CMC policy is defined as CMCH (D, B) = arg min{c(r) | r ∈ R, H(r|D, B) ≥ αh∗ } r (3.6) where h∗ = maxr∈R H(r|D, B) is the best heuristic value and c(r) is the expected cost of r. When α is large CMC tends to choose more tightly constrained and costly constrained experiments, which has the effect of more reliably obtaining an experiment with a high heuristic value. Rather, when α is small CMC will choose lower cost constrained experiments with lower heuristic value, which has the effect of increasing the uncertainty about 11 the individual experiment that will be observed, providing some amount of constrained exploration. Cost Normalized MEI (CN-MEI). CMC policies require selecting α, which may strongly impact the performance of the policy. In particular, when α is set too high, the CMC policy may spend or waste a large portion of the budget in a local optimum region, whereas if α is set too small, the policy may act randomly and fail to utilize the heuristic information at all. Here we investigate an alternative parameter free policy, which selects the constrained experiment that achieves the highest expected improvement per unit cost, or rate of improvement. In particular, the policy is defined as M EI(r|D, B) CN-MEI(D, B) = arg max . r c(r) (3.7) Note that in concept we could define cost-normalized policies based on heuristics other than MEI. To limit our scope we have chosen to consider only MEI due to it nice interpretation as a rate of improvement. 3.3 3.3.1 Implementation details Conditional posterior distribution We use Gaussian process regression to compute the conditional posterior pθ (y|x, D) required by our model-based heuristics. A Gaussian Process (GP) provides a distribution over the space of functions and has a convenient closed form for computing the posterior mean and variance of the outcome y for any experiment x given the prior and the observations in D [14]. Here we first briefly describe how the conditional posterior is computed with Gaussian process regression (GPR). Suppose f (x) is the function that we would like to optimize, and we assume a simple regression model, i.e., y = f (x) + ǫ, where ǫ is Gaussian noise of zero mean and variance σǫ2 . After observing D = (X, y) where X is a matrix representation of xi ’s, with each row correspond to one xi and y is a vector representation of yi ’s of D. The marginal posterior distribution of experiment outcome for query point xq is a Gaussian distribution with the mean and variance defined in Equation (3.8) and (3.9) respectively. −1 E[yq |xq , D] = K(xq , X)T K(X, X) + σǫ2 I y −1 V ar[yq |xq , D] = K(xq , xq ) − K(xq , X)T K(X, X) + σǫ2 I K(xq , X) (3.8) (3.9) Note that K(·) is covariance function which specifies the covariance between pairs of random variables. In all of our experiments we use a GP with a zero mean prior and a 12 prior covariance specified by a Gaussian Kernel K(x, x′ ) = σ exp(− 1 |x − x′ |2 ), 2l 2 with kernel width l = 0.01 and signal variance σ = ymax , where ymax is an upper bound on the outcome value. The value of ymax is typically easy to elicit from scientists and serves to set our prior uncertainty so that there is non-trivial probability assigned to the expected range of the output. We did not optimize these parameter settings, but did empirically verify the GPs behaved reasonably for the set of functions in our experiments. 3.3.2 Heuristics computation Given our GP posterior distribution, we need to compute the various heuristics defined in the previous section. Unfortunately, those heuristics were defined in terms of expectations and variances, which require integration over the region of a constrained experiment. Unfortunately these integrals do not have convenient closed forms. To deal with this issue, we approximate the integrals by discretizing the space of experiments and using summation in place of integration. Given a constrained experiment r, we discretize it by uniformly sampling a set of experiments {x1 , . . . , xn } within r. Then we can rewrite Equation (3.1) as Equation (3.10). n 1X p̄θ (yr |r, D) = pθ (yi |xi , D) n i=1 (3.10) Given Equation (3.10), discretized version expectation (E[yr |r, D]) and variance (V ar[yr |r, D]) of a constrained experiment r are provided by Equation (3.11) and Equation (3.12) respectively, where yr ∼ p̄θ (·|r, D) and yi ∼ pθ (·|xi , D) (see Appendix B for derivation). n E[yr |r, D] = 1X E[pθ (yi |xi , D)] n i=1 (3.11) n V ar[yr |r, D] = 1X V ar[pθ (yi |xi , D)] + (E[pθ (yi |xi , D)])2 n i=1 (3.12) − (E[yr |r, D])2 Furthermore, to help improve efficiency and to make the maximization over constrained experiments tractable, we also restrict our attention to hyper-cube constrained experiments such that |ri | = l(r)/d where l(r) is the proportional to the perimeter of the constrained experiment and d is the number of dimensions considered in experimental space. All the constrained experiments are centered at the discretized points that have a 13 finite set of edge lengths specified by the discretization level w of the space. In our experiments we used a discretization level of 0.01. To do this, we only considered constrained experiments |ri | r | mod( , 2) = 0 and r ∈ R . w ∗ For CMC policies, h = maxr∈R∗ H(r|D, B) where R∗ is a set of the tightest constrained experiments given the initial budget B. R∗ = {r | |ri | = 2w} if B ≥ c(r∗ ) and |ri∗ | = 2w, otherwise R∗ = {r | |ri | = |ri∗ |} where |ri | ∗ r = arg max c(r) | mod( , 2) = 0, c(r) < B and r ∈ R . r w This restriction allows us to efficiently compute the heuristic value for all of the constrained experiments using convolution algorithms4 (see Figure 3.1 for expectation matrix computation of constrained experiments). The first term of Equation (3.12) for variance of constrained experiment can also be computated similarly by using convolution algorithms on matrix M , where Mi,j = V ar[pθ (yi,j |xi,j , D)] + (E[pθ (yi,j |xi,j , D)])2 . ds icretz iato in e lvel 2 3 -3 5 -3 6 -7 -5 -4 0 4 0 2 1 -6 1 -7 -3 7 -4 -3 -4 5 3 -1 3 -6 5 5 -1 1 4 -3 5 -5 6 7 0 -6 -3 1 -1 3 -6 5 -6 5 7 -2 Expectationmatrix sampledindividualexperiments Mfor 0.12 0.56 0.04 0.12 0.2 -0.24 0.36 0.32 -0.16 ExperimentalSpace Expectationmatrix constrainedexperiments M' for Figure 3.1: Computation expectation matrix for constrained experiments {r| |ri | = 4w} using convolution algorithms on expectation matrix M , where Mi,j = E[pθ (yi,j |xi,j , D)]. This approach is most appropriate for low dimensional optimization problems, which is the situation we often find in practical experimental design problems with tight budgets. 4 We use Matlab function conv2 for 2D convolution and convn for N-D convolution, with shape parameter = ’valid’. 14 This is because with a tight budget the number of experiments is quite limited, which makes optimizing over many dimensions impractical, leading the scientists to carefully select the key dimensions to consider. For example, in our microbial fuel cell application, the scientist will generally be interested in optimizing 2 to 3 properties of a nano-structure at most, such as the length and density of a carbon nano-tube surface. Thus, the most relevant scenarios for us and many other problems are small number of experiments and small dimensionality. 3.3.3 Regret computation Each policy will generate a sequence of m experiment tuples (x1 , y1 , c1 ), . . . , (xm , ym , cm ) (see Chapter 2). We must then output an x̂ ∈ {x1 , . . . , xm }, which is our prediction of which observed experiment has the best value of f . For model-free policies x̂ = xi , where i = arg max {yi |i = 1, . . . , m} . i For model-based policies x̂ = arg max {E(yi |xi , D)|i = 1, . . . , m} xi where y ∼ pθ (·|x, D). Regret for both model-free and model-based policies is calculated as5 regret(f, x̂) = ymax − f (x̂). 5 We only need to computate regrets while evaluating and assume that f (x) can be easily calculated. 15 Chapter 4 Experimental Results 4.1 Test Functions We evaluate our policies using four 2-dimensional functions over [0, 1]2 with different characteristics. The first three functions: Cosines, Rosenbrock, and Discontinuous are benchmarks that have been widely used in previous work on stochastic optimization (e.g [1, 3]). The mathematical expressions of the functions are listed in Table 4.1 and Figure 4.1 to Figure 4.4 show their contour plots. In these experiments we use a zero-mean Gaussian noise model with variance equal to 1% of the function range. Exploratory experiments with larger noise levels indicate similar qualitative performance, though absolute optimization performance of the methods degrades some as expected. Unfortunately, we have not yet been able to perform comparative experiments in our motivating nano-design problem due to the high-cost of obtaining data, which is precisely why our methods are needed in the first place. Rather, for real-world data, we utilized data collected as part of a study on biosolar hydrogen production [4], where the goal is to maximize the hydrogen production of the cyanobacteria Synechocystis sp. PCC 6803 by optimizing the pH and Nitrogen levels of the growth media. The data set contains Function Cosines Rosenbrock Discont Table 4.1: Benchmark Functions. Mathematical representation Global optimum 1 − (u2 + v 2 − 0.3cos(3πu) − 0.3cos(3πv)) (0.31, 0.31) 10 − 100(y − x2 )2 − (1 − x)2 (1, 1), ymax = 10 0 otherwise ymax = 1 u = 1.6x − 0.5, v = 1.6y − 0.5 1 − 2((x − 0.5)2 + (y − 0.5)2 ) if x < 0.5 16 ymax ≈ 0.9 (0.5, 0.5) 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 1 0 0.1 Figure 4.1: Cosines. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 4.2: Discont. 66 samples uniformly distributed over the 2D input space. This data is used to simulate the true function by fitting a Gaussian process regression model to it, picking kernel parameters via standard validation techniques. With this model we then simulated the experimental design process by sampling from the posterior of this GP to obtain noisy outcomes for any given requested experiment. See Figure 4.4 for the contour plot. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 1 0 Figure 4.3: Rosenbrock. 4.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 4.4: Real. Modeling Assumptions In our experiments, we will use a density Px (x|r) over experiments given a request r, which is uniform over r. We further assume that the density Pc (c|r) over the cost given a request is a conditional Gaussian distribution. In particular, we use a standard deviation of 0.01 and a mean equal to 1+ w × slope . |ri | 17 In this formulation, each experiment requires at least a unit cost in expectation, with the expected additional cost being inversely proportional to the size of the constrained experiment r. The value of the slope parameter dictates how quickly the cost increases as the size of a constrained experiment decreases (see Figure 4.5). This cost model in inspired by the fact that all experimental setups require a minimal amount of work, here unit cost, but that the amount of additional work grows with increased constraints given to the experimenter. All of our policies easily extend to other cost models. 16 14 slop = 10 slop = 15 slop = 30 12 cost 10 8 6 4 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 |ri| Figure 4.5: Costs v.s Constrains, w = 0.01 4.3 Evaluation In this section, we evaluate the performances of policies for both traditional optimization setting and constrained experiment setting. 4.3.1 Traditional Optimization In traditional optimization setting, heuristics assign values to individual experiments rather than constrained experiments. And they do not account for experimental cost, but rather assume that each experiment has uniform cost. We define these heuristics in traditional optimization setting as follows1 , where y ∼ pθ (·|x, D). 1 Unfortunately, both RR and BRR are undefined in continuous experimental space for traditional optimization setting, since they assign values to individual experiments. However, RR and BRR are definable in discretized experimental space based on some selecting orderings. We didn’t experiment 18 Maximum Mean (MM). MM(x|D) = E[y|x, D] (4.1) Maximum Upper Interval (MUI). p MUI(x|D) = E[y|x, D] + 1.96 V ar[y|x, D]. (4.2) Maximum Probability of Improvement (MPI). MPIm (x|D) = Φ E[y|x, D] − (y ∗ + m|y ∗ |) p V ar[y|x, D] ! . (4.3) Maximum Expected Improvement (MEI). ! ∗ E[y|x, D] − y MEI(x|D) = (E[y|x, D] − y ∗ )Φ p V ar[y|x, D] ! p y ∗ − E[y|x, D] + V ar[y|x, D]φ p . V ar[y|x, D] (4.4) The policy for selecting a experiment, given a current experimental state D, based on the above heuristics H in the context of individual experiments, which doesn’t take the cost into consideration is defined as MAXH (D) = arg max{H(x|D) | x ∈ R}. x We used hillclimbing to implement this policy (see gradients in Appendix A). Hillclimbing used 50 random-restart points for each run. For each random-restart point, 0.01 is used as gradient step and 100 steps is taken for each random-restart point. Performances are averaged over 500 runs. Each run starts with five randomly selected initial points and policies are used to select experiments for 50 steps. Random policy is considered as the baseline. To compare the regret values across different functions we report both absolute regret values and normalized regret values, which are computed by dividing the regret of each policy by the mean regret achieved by random policy. A normalized regret less than one indicates that an approach achieves regret better than random, while a normalize regret greater than one indicates that an approach is doing worse than random. Our results for individual functions are shown in Figure 4.7 to Figure 4.10. Figure 4.6 gives the normalized regrets for each policy averaged across our four functions. We summarize our findings as follows. with RR and BRR for traditional optimization setting, since all the other heuristics evaluated in this section are defined in continous experimental space. 19 2 MEI IEMAX PMAX MPI(0) MPI(0.2) Random MPI(0.1) 1.8 1.6 Normalized Regret 1.4 1.2 1 0.8 0.6 0.4 0.2 0 5 10 15 20 25 30 35 40 45 50 Step Figure 4.6: Normalized Overall Regrets. • MEI heuristic achieves consistently the best performance among all the heuristics. • In Figure 4.6, MAX-MM (PMAX) out-performs random when the number of steps is less than 17, while it is consistently worse than the random policy when more than 17 steps are taken. The main reason for this is that when few steps are allowed, the chance of touching global optimal area for random is low. It might be a good idea to focus on current promising area given hightly limited number of steps. As the number of steps increased, the chance is getting higher. However, PMAX is consistently worse than random in Discont function, which is due to the large proportion of optimal area. MAX-MUI (IEMAX) overcomes the greedy nature of PMAX and is consistently better than random. • The behavior of pure MPI strategy (MAX-MPI(0)) is almost the opposite of PMAX. Given the fact that Φ(·) is monotonic, we can rewrite Equation (4.3) for H = MPI(0) as MAXH (D) = arg max x ( ) E[y|x, D] − y ∗ p |x ∈ R . V ar[y|x, D] When few experimental points are generated, MAX-MPI(0) typically selects from points {x | E[y|x, D] ≥ y ∗ }, which are usually around local optima area. Typically under noise condition, as more and more points generated, eventually it comes to a stage where ∀x ∈ R, E[y|x, D] < y ∗ . Then we are driven to search elsewhere where the standard error is higher, which is the reason why MAX-MPI(0) eventually 20 0.8 2 MEI IEMAX PMAX MPI(0) MPI(0.2) Random MPI(0.1) 0.7 1.6 1.4 Normalized Regret Normalized Regret 0.6 MEI IEMAX PMAX MPI(0) MPI(0.2) Random MPI(0.1) 1.8 0.5 0.4 0.3 1.2 1 0.8 0.6 0.2 0.4 0.2 0.1 0 5 10 15 20 25 30 35 40 45 50 0 0 Step 5 10 15 20 25 30 35 40 45 50 Step Figure 4.7: Regrets (left) and Normalized Regrets (right) for Cosines. escapes from local optima and hence out-performs random when more steps are taken. • MAX-MPI(m) achieves comparable performances comparing with MAX-MEI. With some margin m > 0, we usually have ∀x ∈ R, E[y|x, D] < (y ∗ + m|y ∗ |) (otherwise behaves like pure MPI). The probability of improvement given by Equation (4.3) will be small. Eventually, the probability of improvement around the current best point becomes so small that we are driven to search elsewhere where the variance is higher. This is the main reason that IEMAX out-performs MAX-MPI(m) when the number of steps is small. We also observed that MAX-MPI(0.2) out-performs MAXMPI(0.1) when the variance of experimental space is high, while MAX-MPI(0.1) achieves better performance when the variance is reduced as more experimental points generated. This indicates that MPI with high margin is prefered at the initial stage with few experiments generated, while MPI with small margin is prefered when experimental space is well explored. 21 0.25 4 MEI IEMAX PMAX MPI(0) MPI(0.2) Random MPI(0.1) 3.5 3 Normalized Regret Normalized Regret 0.2 MEI IEMAX PMAX MPI(0) MPI(0.2) Random MPI(0.1) 0.15 0.1 2.5 2 1.5 1 0.05 0.5 0 0 5 10 15 20 25 30 35 40 45 0 0 50 5 10 15 20 Step 25 30 35 40 45 50 40 45 50 45 50 Step Figure 4.8: Regrets (left) and Normalized Regrets (right) for Discont. 2 2.5 MEI IEMAX PMAX MPI(0) MPI(0.2) Random MPI(0.1) 1.8 1.6 2 Normalized Regret Normalized Regret 1.4 MEI IEMAX PMAX MPI(0) MPI(0.2) Random MPI(0.1) 1.2 1 0.8 1.5 1 0.6 0.4 0.5 0.2 0 0 5 10 15 20 25 30 35 40 45 50 0 5 10 15 20 Step 25 30 35 Step Figure 4.9: Regrets (left) and Normalized Regrets (right) for Rosenbrock. 0.5 2.5 MEI IEMAX PMAX MPI(0) MPI(0.2) Random MPI(0.1) 0.45 0.4 2 Normalized Regret Normalized Regret 0.35 MEI IEMAX PMAX MPI(0) MPI(0.2) Random MPI(0.1) 0.3 0.25 0.2 1.5 1 0.15 0.1 0.5 0.05 0 0 5 10 15 20 25 30 35 40 45 50 Step 0 0 5 10 15 20 25 30 35 Step Figure 4.10: Regrets (left) and Normalized Regrets (right) for Real. 22 40 Table 4.2: Normalized Overall Regrets. slope = 0.1 α = 0.8 CN-MEI 4.3.2 α = 0.9 slope = 0.15 α = 0.8 0.468 α = 0.9 0.558 slope = 0.30 α = 0.8 α = 0.9 0.760 CMC-MEI 0.487 0.455 0.562 0.535 0.691 0.681 CMC-MPI(0.2) 0.505 0.622 0.611 0.766 0.726 0.875 CMC-MUI 0.945 0.733 1.000 0.787 0.989 0.795 CMC-MM 1.147 1.260 1.138 1.375 1.304 1.489 CMC-RR 0.862 0.894 0.918 CMC-BRR 0.842 0.878 0.907 Constrained Experiments For each function and choice of budget and cost structure (i.e. choice of slope), we evaluate each policy by averaging the regret it achieves over 500 runs. Each run starts with five randomly selected initial points and then policies are used to select constrained experiments until the budget runs out, at which point the final regret is measured. To compare the regret values across different functions we report normalized regret values, which are computed by dividing the regret of each policy by the mean regret achieved by a random policy (i.e., a policy that always selects the entire input space as the constrained experiment). A normalized regret less than one indicates that an approach achieves regret better than random, while a normalize regret greater than one indicates that an approach is doing worse than random. We ran experiments with a budget of 15 units with slop values of 10, 15, and 30. In experiments not shown we also considered other budget amounts, both smaller and larger, and found similar qualitative results. For the CMC policies we considered two values of α, 0.8 and 0.9. For the MPI heuristic we used a margin of m = 0.2, which based on a small number of trials performed well compared to other values. Our results for individual functions are shown in Tables 4.3 to Table 4.6, giving both the average normalized regrets over 500 trials and the standard deviations (Figures from 4.19 to 4.22 are with slop value of 15 and α = 0.9). In order to provide an assessment of overall performance, Table 4.2 gives the normalized regrets for each policy averaged across our four functions. Below we review our main empirical findings. 23 Model-free Policies From Table 4.2 we can see that our model-free policies CMC-RR and CMC-BRR achieve an improvement over random by approximately 10% across the different slopes. This shows that the heuristic of trying to cover the space as evenly as possible pays off with respect to random. We also see that in general CMC-BRR performs slightly better than CMC-RR, which gives some indication that the additional exploitive behavior of BRR pays off overall. Looking at the individual results, however, we can see that for the real function (Table 4.6 and Figure 4.22) both BRR and RR perform worse than random. Further investigation reveals that the reason for this poor performance is that CMCRR/BRR have a bias toward constrained experiments near the center of the input space and sampling distribution gets less biased as the budget increased. The approximate sampling distributions for slop=10 with initial budget=10 units in Figure 4.11. Under the same setting, CMC-RR/BRR out-perform random at budget=45, which is shown in Figure 4.12. This bias is a result of the fact that constrained experiments are required to fall completely within the experimental space of [0, 1]2 and CMC policy attemps to select the lest-cost constrained experiment (see Equation 3.6). There are fewer such constrained experiments near the edges and corners. The real function has its optimal point in a corner of the space, explaining the poor performance of CMC-RR/BRR. 1 1 0.0723 0.1269 0.0733 2/3 0.0939 0.1183 0.0951 0.1188 0.1486 0.1185 0.0947 0.1177 0.0944 2/3 0.1251 0.2124 0.1233 1/3 1/3 0.0688 0.1274 0.0705 0 0 0 1/3 2/3 1 0 1/3 2/3 1 Figure 4.11: Approximate sampling dis- Figure 4.12: Approximate sampling dis- tribution for 10 units budget, slop=10. tribution for 45 units budget, slop=10. Model-based Policies We first observe from the overall and individual results that CMC-MM is consistently worse than the random policy. The primary reason for this is that it typically selects high cost experiments near locally optimal points, causing it to quickly expend its budget 24 2.5 2 1.5 1 0.5 0 −0.5 1 0.8 1 0.6 0.8 0.6 0.4 0.4 0.2 0.2 0 0 Figure 4.13: Plot of MUI heuristic. and to not explore adequately. Next, we observe that CMC-MUI is typically better than the model-free policies for α = 0.9 but is quite poor for α = 0.8. Further investigation reveals that for α = 0.8 CMC-MUI typically selects constrained experiments with large sizes (low cost). This leads to close to random behavior. The reason for this is that the upper interval of the GP posterior tends to be quite flat across the space and optimistic (see Figure 4.13). This allows for large constrained experiments to maintain good MUI values. For α = 0.9 this effect is lessened due to the more strict requirement on heuristic value. Next, from the overall table we see that the MEI-based methods and CMC-MPI are the top contenders among all methods, and they are in general superior to model-free methods. Looking at the individual results we can see that this holds for all functions except for Discontinuous, where the CMC approaches are slightly worse than random. Further investigation shows that for this function the CMC approach selected higher cost constrained experiments relative to the other functions, resulting in fewer overall experiments and worse performance than random. This behavior is due to the discontinuity of the function near the optimal region (center), where the function suddenly becomes very small. The result is that the heuristic value of constrained experiments near the discontinuity decreases rapidly as they grow in size, causing for CMC to select small, high cost regions. This suggests that CMC is not a good choice for discontinuous functions with high variability in outcomes. 25 5 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 4 3 2 1 0 −1 −2 −3 −4 −5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 4.14: GP posterior (left), MPI(0.1) (center) and MEI (right). Comparing CMC-MPI and CMC-MEI, CMC-MEI appears to have a slight edge overall, which is consistent with what we observed in traditional setting. In general, we observed that MPI tends to select slightly fewer experiments than MEI, which we conjecture is due to the observation that the MEI heuristic tends to be smoother than MPI over the experimental space. Under traditional setting, we have noticed that MAX-MPI(m) typically tries to select from points {x | E[y|x, D] = y ∗ + δ, δ → 0+ } at the beginning, which have low variance, causing them to have high MPI(m) values. And MPI(m) values of points {x | E[y|x, D] = y ∗ − δ, δ → 0+ } are close to 0. That is the main reason why there is sharp gap in the optimal MPI(m) value area. However the second term of Equa- tion (4.4) gives MEI a smoother curve between the maximum and minimum MEI values (see Figure 4.14) and also a relative higher heuristic value in unexplored area comparing with MPI(m). CMC-MEI is also preferable in that it does not require the selection of a margin parameter. Comparing CMC-MEI to CN-MEI we see CN-MEI is comparable overall with slightly worse performance for slope=0.3, which is the model where costs grow most rapidly with decreasing sizes of constrained experiments. For slope=0.3 we found, as expected, that CN-MEI, which normalizes by the cost of a constrained experiment, tends to select lower-cost experiments than CMC-MEI. In some cases, this tendency appears to be too conservative since more aggressive spending by the CMC approaches appear to be beneficial. However, we see that for the discontinuous function CN-MEI significantly outperformed CMC-MEI since it was not forced to select overly high cost constrained experiments by the discontinuity. Overall, an encouraging observation for CN-MEI is that it is the only model-based method that consistently outperforms model-free methods across all functions and cost models, indicating a degree of robustness. Finally, we observe that the CMC-MEI and CMC-MPI can be quite sensitive to the value of α. For MPI, it appears that the smaller α value tends to be better. However, for CMC-MEI, there is not a consistent trend across functions and cost models, for which 26 0.8 0.8 0.8 0.9 Random 0.7 0.7 0.6 Avg−regret Avg−regret 0.6 0.5 0.4 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.8 0.9 Random 0 5 10 15 20 25 Avg−cost 30 35 40 45 0.1 50 0 5 10 15 20 25 Avg−cost 30 35 40 45 50 Figure 4.15: Cosines with slop=15, MEI (left), MPI(0.2) (right). 0.2 0.2 0.8 0.9 Random 0.18 0.8 0.9 Random 0.18 0.16 0.16 0.14 Avg−regret Avg−regret 0.14 0.12 0.1 0.12 0.1 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0 5 10 15 20 25 Avg−cost 30 35 40 45 50 0 0 5 10 15 20 25 Avg−cost 30 35 40 45 50 Figure 4.16: Discont with slop=15, MEI (left), MPI(0.2) (right). value of α is better. This presents a practical difficulty when applying CMC-MEI in a new domain. Rather, CN-MEI is parameter free and has reasonably robust performance. This makes CN-MEI a recommended method when encountering a new problem. 27 1.8 1.8 0.8 0.9 Random 1.4 1.4 1.2 1.2 1 0.8 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 5 10 15 20 25 Avg−cost 30 35 40 45 0.8 0.9 Random 1.6 Avg−regret Avg−regret 1.6 0 50 0 5 10 15 20 25 Avg−cost 30 35 40 45 50 Figure 4.17: Rosenbrock with slop=15, MEI (left), MPI(0.2) (right). 0.5 0.5 0.8 0.9 Random 0.4 0.4 0.35 0.35 0.3 0.25 0.3 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0 5 10 15 20 25 Avg−cost 30 35 40 45 0.8 0.9 Random 0.45 Avg−regret Avg−regret 0.45 0 50 0 5 10 15 20 25 Avg−cost 30 35 40 45 50 Figure 4.18: Real-world with slop=15, MEI (left), MPI(0.2) (right). 0.7 0.7 Random CMC−MM CMC−MUI CMC−RR CMC−BRR 0.65 0.6 CN−MEI Random CMC−MEI CMC−MPI(0.2) 0.6 0.55 0.5 Avg−regret Avg−regret 0.5 0.45 0.4 0.35 0.4 0.3 0.3 0.25 0.2 0.2 0 5 10 15 20 25 Avg−cost 30 35 40 45 50 0.1 0 5 10 15 20 25 Avg−cost Figure 4.19: Cosines with slop=15 and α = 0.9. 28 30 35 40 45 50 0.25 0.2 Random CMC−MM CMC−MUI CMC−RR CMC−BRR 0.2 CN−MEI Random CMC−MEI CMC−MPI(0.2) 0.18 0.16 0.15 Avg−regret Avg−regret 0.14 0.1 0.12 0.1 0.08 0.06 0.05 0.04 0.02 0 0 5 10 15 20 25 Avg−cost 30 35 40 45 0 50 0 5 10 15 20 25 Avg−cost 30 35 40 45 50 Figure 4.20: Discont with slop=15 and α = 0.9. 1.8 1.8 Random CMC−MM CMC−MUI CMC−RR CMC−BRR 1.6 1.4 1.4 1.2 Avg−regret 1.2 Avg−regret CN−MEI Random CMC−MEI CMC−MPI(0.2) 1.6 1 0.8 1 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 5 10 15 20 25 Avg−cost 30 35 40 45 50 0 5 10 15 20 25 Avg−cost 30 35 40 45 50 Figure 4.21: Rosenbrock with slop=15 and α = 0.9. 0.6 0.5 Random CMC−MM CMC−MUI CMC−RR CMC−BRR 0.5 CN−MEI Random CMC−MEI CMC−MPI(0.2) 0.45 0.4 0.35 0.4 Avg−regret Avg−regret 0.3 0.3 0.2 0.25 0.2 0.15 0.1 0.1 0.05 0 0 0 5 10 15 20 25 Avg−cost 30 35 40 45 50 0 5 10 15 20 25 Avg−cost Figure 4.22: Real-world with slop=15 and α = 0.9. 29 30 35 40 45 50 Table 4.3: Normalized Regrets for Cosines Function slope = 0.1 α = 0.8 CN-MEI α = 0.9 slope = 0.15 α = 0.8 0.569 (0.34) α = 0.9 0.714 (0.41) slope = 0.30 α = 0.8 α = 0.9 0.826 (0.42) CMC-MEI 0.528 (0.38) 0.445 (0.35) 0.589 (0.41) 0.494 (0.37) 0.712 (0.43) 0.690 (0.46) CMC-MPI(0.2) 0.480 (0.37) 0.471 (0.36) 0.562 (0.43) 0.582 (0.43) 0.655 (0.46) 0.670 (0.45) CMC-MUI 0.929 (0.43) 0.788 (0.40) 0.947 (0.43) 0.831 (0.43) 0.944 (0.44) 0.828 (0.42) CMC-MM 0.948 (0.69) 0.966 (0.72) 0.971 (0.69) 1.008 (0.71) 1.057 (0.69) 1.067 (0.67) CMC-RR 0.842 (0.40) 0.866 (0.40) 0.897 (0.41) CMC-BRR 0.833 (0.40) 0.862 (0.41) 0.889 (0.41) Table 4.4: Normalized Regrets for Discont Function slope = 0.1 α = 0.8 CN-MEI α = 0.9 slope = 0.15 α = 0.8 0.527 (0.44) α = 0.9 0.497 (0.44) slope = 0.30 α = 0.8 α = 0.9 0.626 (0.60) CMC-MEI 0.614 (0.53) 0.670 (0.55) 0.636 (0.55) 0.824 (0.68) 0.771 (0.64) 1.041 (0.75) CMC-MPI(0.2) 0.655 (0.56) 0.962 (0.84) 0.867 (0.71) 1.232 (1.17) 1.018 (0.80) 1.439 (1.23) CMC-MUI 0.950 (0.90) 0.556 (0.57) 1.023 (0.94) 0.617 (0.58) 1.032 (1.01) 0.645 (0.64) CMC-MM 1.681 (2.26) 2.053 (2.69) 1.574 (2.01) 2.283 (2.70) 1.993 (2.45) 2.377 (2.49) CMC-RR 0.602 (0.51) 0.617 (0.53) 0.634 (0.56) CMC-BRR 0.587 (0.50) 0.602 (0.53) 0.634 (0.57) Table 4.5: Normalized Regrets for Rosenbrock Function slope = 0.10 α = 0.8 CN-MEI α = 0.9 slope = 0.15 α = 0.8 0.602 (0.40) α = 0.9 0.665 (0.49) slope = 0.30 α = 0.8 α = 0.9 0.736 (0.60) CMC-MEI 0.584 (0.39) 0.618 (0.42) 0.657 (0.48) 0.655 (0.45) 0.696 (0.50) 0.676 (0.53) CMC-MPI(0.2) 0.611 (0.43) 0.627 (0.45) 0.661 (0.47) 0.676 (0.57) 0.670 (0.52) 0.714 (0.56) CMC-MUI 1.004 (1.10) 0.999 (1.12) 1.079 (1.34) 0.996 (1.11) 1.030 (1.19) 0.945 (1.10) CMC-MM 0.828 (0.77) 0.843 (1.06) 0.808 (0.82) 0.960 (1.73) 0.957 (1.21) 1.260 (2.03) CMC-RR 0.897 (0.85) 0.930 (0.87) 0.967 (1.00) CMC-BRR 0.881 (0.83) 0.927 (0.90) 0.955 (0.98) Table 4.6: Normalized Regrets for Real-world Function slope = 0.10 α = 0.8 CN-MEI α = 0.9 slope = 0.15 α = 0.8 0.176 (0.26) α = 0.9 0.354 (0.46) slope = 0.30 α = 0.8 α = 0.9 0.852 (0.66) CMC-MEI 0.221 (0.41) 0.089 (0.18) 0.365 (0.57) 0.169 (0.32) 0.586 (0.65) 0.318 (0.47) CMC-MPI(0.2) 0.274 (0.56) 0.426 (0.66) 0.355 (0.65) 0.574 (0.74) 0.560 (0.74) 0.677 (0.75) CMC-MUI 0.897 (0.68) 0.588 (0.55) 0.952 (0.69) 0.706 (0.61) 0.949 (0.67) 0.761 (0.60) CMC-MM 1.131 (1.04) 1.178 (1.07) 1.199 (1.04) 1.248 (1.08) 1.208 (0.93) 1.251 (0.91) CMC-RR 1.107 (0.68) 1.161 (0.69) 1.172 (0.67) CMC-BRR 1.065 (0.68) 1.123 (0.70) 1.148 (0.67) 30 Appendix A Gradient A.1 MM In the context of individual experiments, MM heuristic is defined as MM(x|D) = E[y|x, D] (A.1) where y ∼ pθ (·|x, D), D = (X, y) and X = {x1 , . . . , xn }. In our work, we use Gaussian process regression (GPR) to compute the expectation of conditional posterior pθ (y|x, D) required by our model-based heuristics (see Equation 3.8). We write a vector α to denote −1 (K(X, X) + σǫ2 I) y in Equation 3.8, which can be rewritten as T 1 E[y|x, D] = K(x, X) α = n X ki αi (A.2) i=1 where K(x, X)T = (k1 , . . . , kn ) and ki = σ exp(− 2l1 |x − xi |2 ). Then the derivative or gradient of MM(x|D) with respect to individual experiment x = {x1 , . . . , xd } is computed component by component, i.e., T ∂MM(x|D) ∂MM(x|D) ∂MM(x|D) . = ,..., ∂x ∂x1 ∂xd (A.3) Note that |x − xi |2 = (x1 − x1i )2 + · · · + (xd − xdi )2 , each entry of (A.3) is given by n ∂MM(x|D) X ∂ exp(− 2l1 |x − xi |2 ) = ki αi ∂xj ∂xj i=1 =− 1 n X 1 i=1 l j (x − (A.4) xji )ki αi . We use Cholesky factorization to compute α = LT \ (L \ y), where L = cholesky(K + σn2 I) [14]. 31 A.2 MUI In the context of individual experiments, MUI heuristic is defined as p MUI(x|D) = E[y|x, D] + 1.96 V ar[y|x, D]. (A.5) Then the derivative or gradient of MUI(x|D) with respect to individual experiment x is given by ∂MUI(x|D) ∂E[y|x, D] = ∂x ∂x ∂V ar[y|x, D] 1 . + 1.96 × V ar[y|x, D]−1/2 2 ∂x (A.6) The first term of Equation (A.6) is the same as gradient for MM heuristic. Given the Gaussian Kernel K(x, x′ ) = σ exp(− 1 |x − x′ |2 ), 2l and Equation (3.9) V ar[y|x, D] = σ − K(x, X)T K(X, X) + σǫ2 I −1 K(x, X). Then the gradient of variance in the second term of Equation (A.6) can be written as ∂V ar[y|x, D] = ∂x ∂V ar[y|x, D] ∂V ar[y|x, D] ,..., 1 ∂x ∂xd T (A.7) with each entry T ∂V ar[y|x, D] ∂k1 ∂kn −1 = (k1 , . . . , kn ) K ,..., j ∂xj ∂xj ∂x ∂k1 ∂kn + , . . . , j K −1 (k1 , . . . , kn )T j ∂x ∂x where K = K(X, X) + σǫ2 I and 1 ∂ki = − (xj − xji )ki . j ∂x l A.3 MPI In the context of individual experiments, MPI heuristic with margin m is defined as ! E[y|x, D] − (y ∗ + m|y ∗ |) p MPIm (x|D) = Φ . (A.8) V ar[y|x, D] 32 Then the derivative or gradient of MPIm (x|D) with respect to individual experiment p x is given by Equation (A.9), where E(x) = E[y|x, D], S(x) = V ar[y|x, D] and their gradients can be calculated by Equation (A.3) and (A.7) respectively. ∂MPIm (x|D) E(x) − (y ∗ + m|y ∗ |) =φ ∂x S(x) ∂E(x) S(x) ∂x A.4 − ∂S(x) (E(x) ∂x S(x)2 − (y ∗ + m|y ∗ |)) ! (A.9) MEI In the context of individual experiments, MEI heuristic is defined as E(x) − y ∗ ∗ MEI(x|D) = (E(x) − y )Φ S(x) ∗ y − E(x) . + S(x)φ S(x) (A.10) Then the derivative or gradient of MEI(x|D) with respect to individual experiment x is given by MEI(x|D) E(x) − y ∗ ∂E(x) E(x) − y ∗ ∗ + (E(x) − y )φ V1 = Φ ∂x ∂x S(x) S(x) ∗ ∂S(x) y − E(x) + + S(x)V2 φ ∂x S(x) where ∂E(x) S(x) ∂x − ∂S(x) (E(x) − y ∗ ) ∂x V1 = S(x)2 ∗ y − E(x) E(x) − y ∗ V2 = φ (−V1 ). S(x) S(x) 33 (A.11) Appendix B Mathematical Derivation B.1 MEI Heuristic The heuristic formula of MEI is extended based on [7], which is intended to find the minimum of the surface. Let Y (x) be a random variable describing our uncertainty about the functions value at a point x and assume that Y (x) is normally distributed with expectation E(x) and variance S(x)2 . If the current best observation is fmax , then we will achieve an improvement of I if Y (x) = fmax + I. The likelihood of achieving this improvement is given by the normal density function (fmax + I − E(x))2 1 √ exp − . 2S(x)2 2πS(x) (B.1) The expected improvement is simply the expected value of the improvement found by integrating over this density: Z ∞ (fmax + I − E(x))2 1 exp − dI E(I) = I √ 2S(x)2 2πS(x) 0 (fmax + I − E(x))2 1 exp − I √ E(I) = dI 2S(x)2 2πS(x) 0 1 (fmax − E(x))2 =√ exp − 2S(x)2 2πS(x) Z ∞ 2I(fmax − E(x)) + I 2 × I exp − dI 2S(x)2 0 Z Let ∞ (B.2) 2I(fmax − E(x)) + I 2 . T = exp − 2S(x)2 34 (B.3) Then the derivative of T with respect to I is ∂T 1 =− (IT + (fmax − E(x))T ) , ∂I S(x)2 from which we can get ∂T IT = −(fmax − E(x))T − S(x)2 . ∂I Equation (B.3) can be rewritten as Z ∞ (fmax − E(x))2 1 exp − E(I) = √ IT dI 2S(x)2 2πS(x) 0 Z ∞ 1 I − (E(x) − fmax )2 √ dI = −(fmax − E(x)) exp − 2S(x)2 2πS(x) 0 fmax − E(x) + S(x)φ( ) S(x) Z E(x)−fmax S(x) (I ∗ )2 ∗ 1 √ exp − = −(fmax − E(x)) dI 2 2πS(x) −∞ fmax − E(x) + S(x)φ( ) S(x) where I − (E(x) − fmax )) . I∗ = S(x) and φ(·) is the normal density function. Equation (B.4) =⇒ E(I) = S(x)(−uΦ(−u) + φ(u)) (B.4) (B.5) where fmax − E(x) S(x) and Φ(·) is the normal cumulative distribution function. u= B.2 B.2.1 Weighted Combination of Distributions Expectation Let Z be a variable Z= n X wi Y i i=1 that is distributed according to a mixture of distributions Y = {Y1 , . . . , Yn } with mixture weights W = {w1 , . . . , wn } and Yi is distributed with expectation E(Yi ) and variance S(Yi )2 . Then we have E(Z) = E( n X wi Y i ) = i=1 n X i=1 35 wi E(Yi ). (B.6) B.2.2 Variance V ar(Z) = E(Z 2 ) − (E(Z))2 ! n X =E wi Yi2 − i=1 = n X i=1 = n X i=1 wi E(Yi2 ) − n X wi E(Yi ) i=1 n X !2 (B.7) wi E(Yi ) i=1 wi (S(Yi )2 + E(Yi )2 ) − 36 !2 n X i=1 !2 wi E(Yi ) Bibliography [1] B. S. Anderson, A. W. More, and D. Cohn. A nonparametric approach to noisy and costly optimization. In ICML, 2000. [2] D. Bond and D. Lovley. Electricity production by geobacter sulfurreducens attached to electrodes. Applications of Environmental Microbiology, 69:1548–1555, 2003. [3] M. Brunato, R. Battiti, and S. Pasupuleti. A memory-based rash optimizer. In AAAI-06 Workshop on Heuristic Search, Memory Based Heuristics and Their applications, 2006. [4] E. H. Burrows, W.-K. Wong, X. Fern, F. W. Chaplen, and R. L. Ely. Optimization of ph and nitrogen for enhanced hydrogen production by synechocystis sp. pcc 6803 via statistical and machine learning methods. Biotechnology Progress, in submission. [5] Y. Fan, H. Hu, and H. Liu. Enhanced coulombic efficiency and power density of aircathode microbial fuel cells with an improved cell configuration. Journal of Power Sources, page In press, 2007. [6] A. M. Jeff Schneider. Active learning in discrete input spaces. In Interface Symposium, 2002. [7] D. Jones. A taxonomy of global optimization methods based on response surfaces. Journal of Global Optimization, 21:345–383, 2001. [8] D. Lizotte, O. Madani, and R. Greiner. Budgeted learning of naive-bayes classifiers. In UAI, 2003. [9] O. Madani, D. Lizotte, and R. Greiner. Active model selection. In UAI, 2004. [10] A. Moore and J. Schneider. Memory-based stochastic optimization. In NIPS, 1995. [11] A. Moore, J. Schneider, J. Boyan, and M. S. Lee. Q2: Memory-based active learning for optimizing noisy continuous functions. In ICML, pages 386–394, 1998. 37 [12] R. Myers and D. Montgomery. Response surface methodology: process and product optimization using designed experiments. Wiley, 1995. [13] D. Park and J. Zeikus. Improved fuel cell and electrode designs for producing electricity from microbial degradation. Biotechnol.Bioeng., 81(3):348–355, 2003. [14] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT, 2006. [15] G. Reguera. Extracellular electron transfer via microbial nanowires. Nature, pages 1098–1101, 2005. 38