Myopic Policies for Budgeted Optimization with Constrained Experiments (Project Report) Wei Lin

advertisement
Myopic Policies for Budgeted Optimization with
Constrained Experiments
(Project Report)
Wei Lin
December 1, 2008
Abstract
Motivated by a real-world problem, we study a novel setting for budgeted optimization
where the goal is to optimize an unknown function f (x) given a budget. In our setting,
it is not practical to request samples of f (x) at precise input values due to the formidable cost of experimental setup at precise values. Rather, we may request constrained
experiments, which give the experimenter constraints on x for which they must return
f (x). Importantly, as the constraints become looser, the experimental cost decreases, but
the uncertainty about the location of the next observation increases. Our problem is to
manage this trade-off by selecting a sequence of constrained experiments to best optimize
f within the budget. We propose a number of myopic policies for selecting constrained
experiments using both model-free and model-based approaches, inspired by policies for
unconstrained settings. Experiments on synthetic and real-world functions indicate that
our policies outperform random selection, that the model-based policies are superior to
model-free ones, and give insights into which policies are preferable overall.
Contents
1 Introduction
3
2 Problem Setup
5
3 Myopic Policies for Constrained Experiments
7
3.1
Heuristics for Constrained Experiments . . . . . . . . . . . . . . . . . . .
8
3.1.1
Model-Free Heuristics . . . . . . . . . . . . . . . . . . . . . . . . .
8
3.1.2
Model-based Heuristics . . . . . . . . . . . . . . . . . . . . . . . .
9
3.2
Cost sensitive policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3.3
Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
3.3.1
Conditional posterior distribution . . . . . . . . . . . . . . . . . .
12
3.3.2
Heuristics computation . . . . . . . . . . . . . . . . . . . . . . . .
13
3.3.3
Regret computation . . . . . . . . . . . . . . . . . . . . . . . . . .
15
4 Experimental Results
16
4.1
Test Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
4.2
Modeling Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
4.3
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
4.3.1
Traditional Optimization . . . . . . . . . . . . . . . . . . . . . . .
18
4.3.2
Constrained Experiments . . . . . . . . . . . . . . . . . . . . . . .
23
A Gradient
31
A.1 MM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
A.2 MUI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
A.3 MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
A.4 MEI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
B Mathematical Derivation
34
B.1 MEI Heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
34
B.2 Weighted Combination of Distributions . . . . . . . . . . . . . . . . . . .
35
B.2.1 Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
B.2.2 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
2
Chapter 1
Introduction
This work is motivated by the experimental design problem of optimizing the power
output of nano-enhanced microbial fuel cells. Microbial fuel cells (MFCs) [2, 5] use microorganisms to break down organic materials and generate electricity (see Figure 1.1). For
a particular MFC application domain it is critical to optimize the biological energetics
and the microbial/electrode interface of the system, which research has shown to depend
strongly on the surface properties of the anode [13, 15]. This motivates the design of
nano-enhanced anodes, where various types of nano-structures, e.g. carbon nano-wire,
are grown directly on the anode surface. Unfortunately, there is little understanding of
the interaction between the variety of possible nano-enhancements and MFC capabilities
for different micro-organisms. Thus, optimizing anode design for a particular application
is largely guess work for the scientists. Our goal is to develop algorithms to aid them in
this process.
Traditional experimental design approaches such as response surface methods [12]
commonly assume that the experimental inputs can be specified precisely and attempt
to optimize a design by requesting specific experiments. For example, requesting that an
anode be tested with nana-wire of specific length and density. However, these parameters are unlike regular experimental control parameters (such as temperature) that can
be easily set. Rather, manufacturing nano-structures is an art and it is very difficult to
achieve a precise parameter setting. Instead, it is more practical to request constrained
experiments, which place constraints on these parameters. For example, specifying intervals on length and density. Given such a request, nano-materials that satisfy the given
set of constraints can be produced at some time and resource cost, where the cost will
typically increase as the constraints become tighter.
Prior work on experimental design, stochastic optimization, and active learning as-
3
R
e-
eH+
C
O2
O2
Cathode
Anode
eAcetate
Air
H2O
Figure 1.1: Schematic of single chamber MFC, showing the anode, where bacteria form a biofilm on the surface, and an air cathode.
sume the ability to request precise experiments and hence do not directly apply to our
setting of constrained experiments. In this project, we study the associated budgeted
optimization problem where we are given an experimental budget and must request a
sequence of constrained experiments to optimize the unknown function. Solving this
problem involves reasoning about a variety of trade-offs, for example, the trade-off between using weaker constraints to lower the cost of an experiment, while increasing the
uncertainty about the location of the next observation. For this purpose we extend existing model-free and model-based, myopic policies for traditional optimization problems
[10, 11, 6, 9] to our setting of constrained experiments. Our experimental results on both
synthetic functions and a function generated from real-world data indicate that our policies have considerable advantages compared to random policies and that the model-based
policies can significantly outperform the model-free policies.
We make the following contributions. First, motivated by a real-world application,
we identify and formulate the problem of budgeted optimization with constrained experiments (Chapter 2). We believe that this problem class is not rare and for example is
the common case in nano-structure optimization and any other design domain where it
is difficult to construct precise experimental setups. Second, we propose a number of
myopic policies for selecting constrained experiments (Chapter 3). Finally, we present
empirically evaluate the policies and make recommendations (Chapter 4).
4
Chapter 2
Problem Setup
Let X ∈ Rn be an n-dimensional input space, where we will often refer to elements of X
as experiments. We assume an unknown real-valued function f : X → R, which will be
thought of as the expected value of the dependent variable after running an experiment.
In our motivating domain, f (x) is the expected power output produced by a particular
nano-structure x. Conducting an experiment x produces a noisy outcome y = f (x) + ǫ,
where ǫ is a noise term.
In traditional optimization settings, the goal is to find an x that approximately optimizes f by requesting a sequence of experiments and observing their outcomes. As
describe in the previous section, we are interested in applications where it is not practical
to request specific experiments. Rather, we will request constrained experiments, which
each specify constraints that define a subset of experiments in X . In this paper, each
constrained experiment will be a hyper-rectangles r = hr1 , . . . , rn i, where the ri are non-
empty intervals along each input dimension. Given a constrained experiment request r,
the experimenter will return a tuple (x, y, c), where x is an experiment in r, y is a noisy
outcome of x, and c is the cost of creating and running the experiment x. This tuple
indicates that the experimenter accumulated a cost of c in order to construct and run an
experiment x that satisfied r, resulting in an observed value of y. The units of cost will
vary across applications, in our motivating application, costs are dominated by the time
required to construct an experiment satisfying r.
The input to our problem is an initial budget B, a conditional density pc (c|r) over
the cost c of fulfilling a constrained experiment r, and a conditional density px (x|r) over
the specific experiment x generated for a constrained experiment r. Given this input,
we then must request a sequence of constrained experiments until the total cost of those
experiments goes over the budget. This results in a sequence of m experiment tuples
5
(x1 , y1 , c1 ), . . . , (xm , ym , cm ) corresponding to those requests that were filled before going
over the budget on request m + 1. After the sequence of requests we must then output
an x̂ ∈ {x1 , . . . , xm }, which is our prediction of which observed experiment has the best
value of f . This formulation matches the objective of our motivating application of using
the given budget to produce a good nano-structure x̂, rather than just make a prediction
of what nano-structure might be good. We have assumed that pc and px are part of the
problem inputs and use simple forms for these in this paper. In general, these functions
could be more complicated and learned from previous experience and/or elicited from
scientists.
6
Chapter 3
Myopic Policies for Constrained
Experiments
In this chapter, we introduce a number of myopic policies for selecting constrained experiments. These policies are are based on extending existing policies for more traditional
optimization problems ([10, 11, 6, 9]) to our setting of constrained experiments. Here we
focus on the general definition of these policies and will describe our particular instantiations and implementations in Section 3.3.
Many existing policies for traditional optimization can be viewed as defining a greedy
heuristic function which assigns a score to each experiment x given the current experimental state (D, B), where D is the data set of previously observed experiments and
B is the remaining budget. These heuristics provide different mechanisms for managing
the exploration-exploitation trade-off, which involves balancing exploratory experiments
which probe relatively unexplored portions of the experimental space, and exploitive experiments which probe areas that are predicted to be promising based on the current
information in D.
There are two primary reasons that these existing heuristics cannot be directly used
for our problem. First, they assign values to individual experiments rather than constrained experiments, as we require. Second, most such heuristic do not account for
experimental cost, but rather assume that each experiment has uniform cost. Indeed all
of the experimental evaluations in ([10, 11, 6, 9]), to name just a few, involve uniform
cost experiments. Our proposed policies, described below, for selecting constrained experiments can be viewed as extending existing heuristic approaches to address each of
these issues. First, in the next sub-section we extend existing heuristic to constrained
experiments, without taking cost into account. Second, in the following sub-section, we
7
describe how to create policies based on these heuristics that take costs into account.
3.1
Heuristics for Constrained Experiments
Each of our heuristics is simply a function H(r|(D, B)) that assigns a value to a constrained experiment r given the current experimental state (D, B). It is important to
note that these values will not take into account the cost of r, but rather evaluate constrained experiments as though they all had uniform cost. In this sense, the heuristics
are aimed at just evaluating some combination of the exploratory and exploitation value
of each r.
3.1.1
Model-Free Heuristics
Our first two heuristics, Round Robin and Biased Round Robin, are intended as simple baselines and do not utilize the predictive posterior distribution pθ (y|x, D) and are
hence called model-free approaches. They are based on heuristics used on previous work
on budgeted multi-armed bandit problems ([9]) and budgeted learning of naive Bayes
classifiers ([8]) and are intended as simple baselines.
Round Robin (RR). In the context of individual experiments, the RR heuristic
simply looped through the set of possible experiments until the budget was expended.
In the multi-armed bandit case, where experiments correspond to arm pulls, the RR
heuristic attempts to keep the number of pull of each arm as uniformly as possible. In
the context of constrained experiments we define the RR heuristic RR(r|(D, B)) to be the
number of experiments in D that satisfy the constraints of the constrained experiment r.
Thus, RR will prefer constrained experiments that overlap as few previous experiments
as possible. Note that typically there will be a large, even infinite, number of constrained
experiments that do not overlap any experiments (e.g. those with very tight constraints)
and thus achieve the same heuristic value. Our approach for taking costs into account
described later will add an additional cost-based preference.
Biased Round Robin (BRR). The BRR heuristic behaves identically to RR, except
that it repeats the previously selected experiment as long as it was successful. In the
context of binary outcome multi-arm bandit problems ([9]) this corresponded to pulling an
arm as long as it resulted in a positive outcome and otherwise moving on to the next arm.
This heuristic, thus provides additional exploitive behavior over the RR heuristic. For
our problem, we will define the execution of a constrained experiment r to be successful
if it resulted in an outcome that improved on the outcome of the previous best observed
8
experiment in D. Thus, BRR will repeatedly try a constrained experiment as long as
it improves the experimental objective and otherwise selects according to the heuristic
BRR(r|(D, B)).
3.1.2
Model-based Heuristics
Our final four heuristics are model-based approaches that make use of the conditional
posterior distribution pθ (y|x, D). The intention is to utilize this posterior to more intelligently select constrained experiments, with the hope of outperforming model-free
approaches, such as RR and BRR, when the budget constraints are significant. Existing model-based heuristics for individual experiments are typically based on computing
certain statistics of the posterior distribution for the possible experiments (e.g. see [7])1 .
These can be extended to our setting by computing those same statistics for constrainted
experiments. To do this, note that we can translate the above posterior pθ , which is conditioned on individual experiments, to a posterior p̄θ that is conditioned on constrained
experiments via the relation
p̄θ (y|r, D) = E[pθ (y|X, D)]2
(3.1)
where X is a random variable denoting an experiment distributed according to px (·|r).
Intuitively, this definition corresponds to the process of first drawing an experiment according to the distribuiton specified by r and then drawing an outcome for that experiment from the current posterior pθ . We now describe our four model based heuristics,
which are well established in traditional optimization, as described, for example, in [7].
Maximum Mean (MM). In the context of individual experiments, the MM heuristic
simply selects the experiment that has the largest expected outcome according to the
current posterior. This heuristic has often been called PMAX in the literature ([10, 11, 6]),
which we have renamed to avoid confusion with our later heuristics. For constrained
experiments, this translates to the heuristic
MM(r|(D, B)) = E[y|r, D]
(3.2)
where y is a random variable distribued according to p̄θ (·|r, D). MM is purely an exploitive
heuristic and has the weakness that it can often be too greedy and get stuck in a poor
local maximum point before exploring the rest of the experimental space.
1
The heuristics in [7] are intended to find the minimum of the surface, while our extended constrained
heuristics are for maximizing.
2
Note that the outcome y in the left hand side of Equation (3.1) is over constrained experiment r,
while y in the right hand side is of individual experiment x ∈ X. To avoid confusion, yr is used in some
place of this report.
9
Maximum Upper Interval (MUI). MUI, which has gone by the name IEMAX
in previous work ([10, 11, 6]), is an attempt to overcome the greedy nature of MM by
taking into account the uncertainty in our estimated model. Intuitively, MUI will explore
areas where there is a non-trivial probability of achieving a good result. In particular, the
MUI heuristic MUI(r|(D, B)) is equal to the upper bound of the 95% confidence interval
according to the current posterior p̄θ (y|r, D) (see Equation 3.3, where y ∼ p̄θ (·|r, D)).
p
MUI(r|(D, B)) = E[y|r, D] + 1.96 V ar[y|r, D]
(3.3)
Intuively, MUI will agressively explore untouched regions of the experimental space since
the outcomes in such regions will have high posterior variance. However, as experimentation continues and uncertainty decreases MUI will focus on the most promising areas
and behave similarly to MM.
Maximum Probability of Improvement (MPI). In the context of individual
experiments, the MPI heuristic corresponds to selecting the experiment that has the
highest probability of generating an outcome that improves on the current best outcome
currently in D. One problem with the pure MPI strategy is that it has a tendency
to behave similarly to PMAX and focus on areas that currently look promising, rather
than explore unknown areas. The reason for this behavior is that pure MPI does not
take into consideration the amount of improvement over the current best outcome. In
particular, it is typical for the posterior to assign small variances to the outcomes in well
explored regions, meaning that while there might be a good probability of observing a
small improvement the probability of a substantial improvement is small. Thus, it is
common to consider the use of a margin m when using MPI, which we will refer to as
MPI(m). If we let y ∗ represent the current best outcome that was observed in D, then
MPIm (r|(D, B)) is equal to the probability that the outcome of constrained experiment
r will be greater than y ∗ + m|y ∗ |. Assuming p̄θ (y|r, D) is normally distributed3 , MPI
heuristic is given by
MPIm (r|(D, B)) = Φ
E[y|r, D] − (y ∗ + m|y ∗ |)
p
V ar[y|r, D]
!
(3.4)
where y ∼ p̄θ (·|r, D) and Φ(·) is the normal cumulative distribution function. The larger
the value of m the more exploratory the policy will be since it will focus on relatively
unexplored regions where there are large variances and hence more probability mass associated outcomes greater than y ∗ + m|y ∗ |. For intermediate values of m the MPI heuristic
can be expected to provide some balance between exploration and exploitation focusing
3
p̄θ (y|r, D) might not be normally distributed even pθ (y|X, D) is (see Appendix B.2).
10
on explored regions when there is a reasonable change of significantly improvement and
otherwise focusing on unexplored regions.
Maximum Expected Improvement (MEI). The MEI heuristic is an attempt to
improve on the pure MPI heuristic without requiring the introduction of a parameter.
Rather than focus on the probability of improvement it considers the expected value of
improvement according to the current posterior. In particularly let I(y, y ∗ ) be a variable
equal to zero if y ≤ y ∗ and equal to y − y ∗ otherwise. Assuming p̄θ (y|r, D) is normally
distributed, we define the MEI heuristic to be
MEI(r|(D, B)) = E[I(y, y ∗ )|r]
p
= V ar[y|r, D](−uΦ(−u) + φ(u))
where
(3.5)
y ∗ − E[y|r, D]
u= p
V ar[y|r, D]
and φ(·) is the normal density function.
3.2
Cost sensitive policies
Below we describe our policies for selecting a constraint experiment, given a current
experimental state (D, B), based on the above extended heuristics, which take the cost
into consideration.
Constrained Minimum Cost Policies (CMC). Given a specific heuristic, the
highest scoring constrained experiment will typically be as highly constrained as possible
centered around individual experiments of maximum heuristic value. However, these
constrained experiments will also typically be the most costly, leading to a trade-off
between cost and heuristic value. The class of CMC policies attempts to manage the
trade-off by choosing the least-cost constrained experiment that achieves a heuristic value
within a specified fraction α of the value of the highest scoring constrained experiment.
In particular, for a given heuristic H, the CMC policy is defined as
CMCH (D, B) = arg min{c(r) | r ∈ R, H(r|D, B) ≥ αh∗ }
r
(3.6)
where h∗ = maxr∈R H(r|D, B) is the best heuristic value and c(r) is the expected cost of
r. When α is large CMC tends to choose more tightly constrained and costly constrained
experiments, which has the effect of more reliably obtaining an experiment with a high
heuristic value. Rather, when α is small CMC will choose lower cost constrained experiments with lower heuristic value, which has the effect of increasing the uncertainty about
11
the individual experiment that will be observed, providing some amount of constrained
exploration.
Cost Normalized MEI (CN-MEI). CMC policies require selecting α, which may
strongly impact the performance of the policy. In particular, when α is set too high, the
CMC policy may spend or waste a large portion of the budget in a local optimum region,
whereas if α is set too small, the policy may act randomly and fail to utilize the heuristic
information at all. Here we investigate an alternative parameter free policy, which selects
the constrained experiment that achieves the highest expected improvement per unit cost,
or rate of improvement. In particular, the policy is defined as
M EI(r|D, B)
CN-MEI(D, B) = arg max
.
r
c(r)
(3.7)
Note that in concept we could define cost-normalized policies based on heuristics other
than MEI. To limit our scope we have chosen to consider only MEI due to it nice interpretation as a rate of improvement.
3.3
3.3.1
Implementation details
Conditional posterior distribution
We use Gaussian process regression to compute the conditional posterior pθ (y|x, D) required by our model-based heuristics. A Gaussian Process (GP) provides a distribution
over the space of functions and has a convenient closed form for computing the posterior
mean and variance of the outcome y for any experiment x given the prior and the observations in D [14]. Here we first briefly describe how the conditional posterior is computed
with Gaussian process regression (GPR).
Suppose f (x) is the function that we would like to optimize, and we assume a simple
regression model, i.e., y = f (x) + ǫ, where ǫ is Gaussian noise of zero mean and variance
σǫ2 . After observing D = (X, y) where X is a matrix representation of xi ’s, with each row
correspond to one xi and y is a vector representation of yi ’s of D. The marginal posterior
distribution of experiment outcome for query point xq is a Gaussian distribution with the
mean and variance defined in Equation (3.8) and (3.9) respectively.
−1
E[yq |xq , D] = K(xq , X)T K(X, X) + σǫ2 I
y
−1
V ar[yq |xq , D] = K(xq , xq ) − K(xq , X)T K(X, X) + σǫ2 I
K(xq , X)
(3.8)
(3.9)
Note that K(·) is covariance function which specifies the covariance between pairs of
random variables. In all of our experiments we use a GP with a zero mean prior and a
12
prior covariance specified by a Gaussian Kernel
K(x, x′ ) = σ exp(−
1
|x − x′ |2 ),
2l
2
with kernel width l = 0.01 and signal variance σ = ymax
, where ymax is an upper bound
on the outcome value. The value of ymax is typically easy to elicit from scientists and
serves to set our prior uncertainty so that there is non-trivial probability assigned to
the expected range of the output. We did not optimize these parameter settings, but did
empirically verify the GPs behaved reasonably for the set of functions in our experiments.
3.3.2
Heuristics computation
Given our GP posterior distribution, we need to compute the various heuristics defined in
the previous section. Unfortunately, those heuristics were defined in terms of expectations
and variances, which require integration over the region of a constrained experiment.
Unfortunately these integrals do not have convenient closed forms. To deal with this
issue, we approximate the integrals by discretizing the space of experiments and using
summation in place of integration. Given a constrained experiment r, we discretize it
by uniformly sampling a set of experiments {x1 , . . . , xn } within r. Then we can rewrite
Equation (3.1) as Equation (3.10).
n
1X
p̄θ (yr |r, D) =
pθ (yi |xi , D)
n i=1
(3.10)
Given Equation (3.10), discretized version expectation (E[yr |r, D]) and variance (V ar[yr |r, D])
of a constrained experiment r are provided by Equation (3.11) and Equation (3.12) respectively, where yr ∼ p̄θ (·|r, D) and yi ∼ pθ (·|xi , D) (see Appendix B for derivation).
n
E[yr |r, D] =
1X
E[pθ (yi |xi , D)]
n i=1
(3.11)
n
V ar[yr |r, D] =
1X
V ar[pθ (yi |xi , D)] + (E[pθ (yi |xi , D)])2
n i=1
(3.12)
− (E[yr |r, D])2
Furthermore, to help improve efficiency and to make the maximization over constrained experiments tractable, we also restrict our attention to hyper-cube constrained
experiments such that |ri | = l(r)/d where l(r) is the proportional to the perimeter of the
constrained experiment and d is the number of dimensions considered in experimental
space. All the constrained experiments are centered at the discretized points that have a
13
finite set of edge lengths specified by the discretization level w of the space. In our experiments we used a discretization level of 0.01. To do this, we only considered constrained
experiments
|ri |
r | mod( , 2) = 0 and r ∈ R .
w
∗
For CMC policies, h = maxr∈R∗ H(r|D, B) where R∗ is a set of the tightest constrained
experiments given the initial budget B. R∗ = {r | |ri | = 2w} if B ≥ c(r∗ ) and |ri∗ | = 2w,
otherwise R∗ = {r | |ri | = |ri∗ |} where
|ri |
∗
r = arg max c(r) | mod( , 2) = 0, c(r) < B and r ∈ R .
r
w
This restriction allows us to efficiently compute the heuristic value for all of the constrained experiments using convolution algorithms4 (see Figure 3.1 for expectation matrix
computation of constrained experiments). The first term of Equation (3.12) for variance
of constrained experiment can also be computated similarly by using convolution algorithms on matrix M , where Mi,j = V ar[pθ (yi,j |xi,j , D)] + (E[pθ (yi,j |xi,j , D)])2 .
ds
icretz
iato
in
e
lvel
2
3
-3
5
-3
6
-7
-5
-4
0
4
0
2
1
-6
1
-7
-3
7
-4
-3
-4
5
3
-1
3
-6
5
5
-1
1
4
-3
5
-5
6
7
0
-6
-3
1
-1
3
-6
5
-6
5
7
-2
Expectationmatrix
sampledindividualexperiments
Mfor
0.12
0.56
0.04
0.12
0.2
-0.24
0.36
0.32
-0.16
ExperimentalSpace
Expectationmatrix
constrainedexperiments
M' for
Figure 3.1: Computation expectation matrix for constrained experiments {r| |ri | = 4w}
using convolution algorithms on expectation matrix M , where Mi,j = E[pθ (yi,j |xi,j , D)].
This approach is most appropriate for low dimensional optimization problems, which
is the situation we often find in practical experimental design problems with tight budgets.
4
We use Matlab function conv2 for 2D convolution and convn for N-D convolution, with shape
parameter = ’valid’.
14
This is because with a tight budget the number of experiments is quite limited, which
makes optimizing over many dimensions impractical, leading the scientists to carefully
select the key dimensions to consider. For example, in our microbial fuel cell application,
the scientist will generally be interested in optimizing 2 to 3 properties of a nano-structure
at most, such as the length and density of a carbon nano-tube surface. Thus, the most
relevant scenarios for us and many other problems are small number of experiments and
small dimensionality.
3.3.3
Regret computation
Each policy will generate a sequence of m experiment tuples (x1 , y1 , c1 ), . . . , (xm , ym , cm )
(see Chapter 2). We must then output an x̂ ∈ {x1 , . . . , xm }, which is our prediction of
which observed experiment has the best value of f . For model-free policies
x̂ = xi , where i = arg max {yi |i = 1, . . . , m} .
i
For model-based policies
x̂ = arg max {E(yi |xi , D)|i = 1, . . . , m}
xi
where y ∼ pθ (·|x, D). Regret for both model-free and model-based policies is calculated
as5
regret(f, x̂) = ymax − f (x̂).
5
We only need to computate regrets while evaluating and assume that f (x) can be easily calculated.
15
Chapter 4
Experimental Results
4.1
Test Functions
We evaluate our policies using four 2-dimensional functions over [0, 1]2 with different
characteristics. The first three functions: Cosines, Rosenbrock, and Discontinuous are
benchmarks that have been widely used in previous work on stochastic optimization (e.g
[1, 3]). The mathematical expressions of the functions are listed in Table 4.1 and Figure 4.1 to Figure 4.4 show their contour plots. In these experiments we use a zero-mean
Gaussian noise model with variance equal to 1% of the function range. Exploratory experiments with larger noise levels indicate similar qualitative performance, though absolute
optimization performance of the methods degrades some as expected.
Unfortunately, we have not yet been able to perform comparative experiments in our
motivating nano-design problem due to the high-cost of obtaining data, which is precisely
why our methods are needed in the first place. Rather, for real-world data, we utilized
data collected as part of a study on biosolar hydrogen production [4], where the goal is
to maximize the hydrogen production of the cyanobacteria Synechocystis sp. PCC 6803
by optimizing the pH and Nitrogen levels of the growth media. The data set contains
Function
Cosines
Rosenbrock
Discont
Table 4.1: Benchmark Functions.
Mathematical representation
Global optimum
1 − (u2 + v 2 − 0.3cos(3πu) − 0.3cos(3πv))
(0.31, 0.31)
10 − 100(y − x2 )2 − (1 − x)2
(1, 1), ymax = 10
0 otherwise
ymax = 1
u = 1.6x − 0.5, v = 1.6y − 0.5
1 − 2((x − 0.5)2 + (y − 0.5)2 ) if x < 0.5
16
ymax ≈ 0.9
(0.5, 0.5)
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
1
0
0.1
Figure 4.1: Cosines.
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 4.2: Discont.
66 samples uniformly distributed over the 2D input space. This data is used to simulate
the true function by fitting a Gaussian process regression model to it, picking kernel
parameters via standard validation techniques. With this model we then simulated the
experimental design process by sampling from the posterior of this GP to obtain noisy
outcomes for any given requested experiment. See Figure 4.4 for the contour plot.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0
1
0
Figure 4.3: Rosenbrock.
4.2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 4.4: Real.
Modeling Assumptions
In our experiments, we will use a density Px (x|r) over experiments given a request r,
which is uniform over r. We further assume that the density Pc (c|r) over the cost given a
request is a conditional Gaussian distribution. In particular, we use a standard deviation
of 0.01 and a mean equal to
1+
w × slope
.
|ri |
17
In this formulation, each experiment requires at least a unit cost in expectation, with
the expected additional cost being inversely proportional to the size of the constrained
experiment r. The value of the slope parameter dictates how quickly the cost increases
as the size of a constrained experiment decreases (see Figure 4.5). This cost model in
inspired by the fact that all experimental setups require a minimal amount of work, here
unit cost, but that the amount of additional work grows with increased constraints given
to the experimenter. All of our policies easily extend to other cost models.
16
14
slop = 10
slop = 15
slop = 30
12
cost
10
8
6
4
2
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
|ri|
Figure 4.5: Costs v.s Constrains, w = 0.01
4.3
Evaluation
In this section, we evaluate the performances of policies for both traditional optimization
setting and constrained experiment setting.
4.3.1
Traditional Optimization
In traditional optimization setting, heuristics assign values to individual experiments
rather than constrained experiments. And they do not account for experimental cost,
but rather assume that each experiment has uniform cost. We define these heuristics in
traditional optimization setting as follows1 , where y ∼ pθ (·|x, D).
1
Unfortunately, both RR and BRR are undefined in continuous experimental space for traditional
optimization setting, since they assign values to individual experiments. However, RR and BRR are
definable in discretized experimental space based on some selecting orderings. We didn’t experiment
18
Maximum Mean (MM).
MM(x|D) = E[y|x, D]
(4.1)
Maximum Upper Interval (MUI).
p
MUI(x|D) = E[y|x, D] + 1.96 V ar[y|x, D].
(4.2)
Maximum Probability of Improvement (MPI).
MPIm (x|D) = Φ
E[y|x, D] − (y ∗ + m|y ∗ |)
p
V ar[y|x, D]
!
.
(4.3)
Maximum Expected Improvement (MEI).
!
∗
E[y|x,
D]
−
y
MEI(x|D) = (E[y|x, D] − y ∗ )Φ p
V ar[y|x, D]
!
p
y ∗ − E[y|x, D]
+ V ar[y|x, D]φ p
.
V ar[y|x, D]
(4.4)
The policy for selecting a experiment, given a current experimental state D, based on
the above heuristics H in the context of individual experiments, which doesn’t take the
cost into consideration is defined as
MAXH (D) = arg max{H(x|D) | x ∈ R}.
x
We used hillclimbing to implement this policy (see gradients in Appendix A). Hillclimbing used 50 random-restart points for each run. For each random-restart point, 0.01
is used as gradient step and 100 steps is taken for each random-restart point. Performances are averaged over 500 runs. Each run starts with five randomly selected initial
points and policies are used to select experiments for 50 steps. Random policy is considered as the baseline. To compare the regret values across different functions we report
both absolute regret values and normalized regret values, which are computed by dividing
the regret of each policy by the mean regret achieved by random policy. A normalized regret less than one indicates that an approach achieves regret better than random, while a
normalize regret greater than one indicates that an approach is doing worse than random.
Our results for individual functions are shown in Figure 4.7 to Figure 4.10. Figure 4.6
gives the normalized regrets for each policy averaged across our four functions. We summarize our findings as follows.
with RR and BRR for traditional optimization setting, since all the other heuristics evaluated in this
section are defined in continous experimental space.
19
2
MEI
IEMAX
PMAX
MPI(0)
MPI(0.2)
Random
MPI(0.1)
1.8
1.6
Normalized Regret
1.4
1.2
1
0.8
0.6
0.4
0.2
0
5
10
15
20
25
30
35
40
45
50
Step
Figure 4.6: Normalized Overall Regrets.
• MEI heuristic achieves consistently the best performance among all the heuristics.
• In Figure 4.6, MAX-MM (PMAX) out-performs random when the number of steps
is less than 17, while it is consistently worse than the random policy when more
than 17 steps are taken. The main reason for this is that when few steps are allowed,
the chance of touching global optimal area for random is low. It might be a good
idea to focus on current promising area given hightly limited number of steps.
As the number of steps increased, the chance is getting higher. However, PMAX
is consistently worse than random in Discont function, which is due to the large
proportion of optimal area. MAX-MUI (IEMAX) overcomes the greedy nature of
PMAX and is consistently better than random.
• The behavior of pure MPI strategy (MAX-MPI(0)) is almost the opposite of PMAX.
Given the fact that Φ(·) is monotonic, we can rewrite Equation (4.3) for H = MPI(0)
as
MAXH (D) = arg max
x
(
)
E[y|x, D] − y ∗
p
|x ∈ R .
V ar[y|x, D]
When few experimental points are generated, MAX-MPI(0) typically selects from
points {x | E[y|x, D] ≥ y ∗ }, which are usually around local optima area. Typically
under noise condition, as more and more points generated, eventually it comes to a
stage where ∀x ∈ R, E[y|x, D] < y ∗ . Then we are driven to search elsewhere where
the standard error is higher, which is the reason why MAX-MPI(0) eventually
20
0.8
2
MEI
IEMAX
PMAX
MPI(0)
MPI(0.2)
Random
MPI(0.1)
0.7
1.6
1.4
Normalized Regret
Normalized Regret
0.6
MEI
IEMAX
PMAX
MPI(0)
MPI(0.2)
Random
MPI(0.1)
1.8
0.5
0.4
0.3
1.2
1
0.8
0.6
0.2
0.4
0.2
0.1
0
5
10
15
20
25
30
35
40
45
50
0
0
Step
5
10
15
20
25
30
35
40
45
50
Step
Figure 4.7: Regrets (left) and Normalized Regrets (right) for Cosines.
escapes from local optima and hence out-performs random when more steps are
taken.
• MAX-MPI(m) achieves comparable performances comparing with MAX-MEI. With
some margin m > 0, we usually have ∀x ∈ R, E[y|x, D] < (y ∗ + m|y ∗ |) (otherwise
behaves like pure MPI). The probability of improvement given by Equation (4.3)
will be small. Eventually, the probability of improvement around the current best
point becomes so small that we are driven to search elsewhere where the variance is
higher. This is the main reason that IEMAX out-performs MAX-MPI(m) when the
number of steps is small. We also observed that MAX-MPI(0.2) out-performs MAXMPI(0.1) when the variance of experimental space is high, while MAX-MPI(0.1)
achieves better performance when the variance is reduced as more experimental
points generated. This indicates that MPI with high margin is prefered at the initial
stage with few experiments generated, while MPI with small margin is prefered
when experimental space is well explored.
21
0.25
4
MEI
IEMAX
PMAX
MPI(0)
MPI(0.2)
Random
MPI(0.1)
3.5
3
Normalized Regret
Normalized Regret
0.2
MEI
IEMAX
PMAX
MPI(0)
MPI(0.2)
Random
MPI(0.1)
0.15
0.1
2.5
2
1.5
1
0.05
0.5
0
0
5
10
15
20
25
30
35
40
45
0
0
50
5
10
15
20
Step
25
30
35
40
45
50
40
45
50
45
50
Step
Figure 4.8: Regrets (left) and Normalized Regrets (right) for Discont.
2
2.5
MEI
IEMAX
PMAX
MPI(0)
MPI(0.2)
Random
MPI(0.1)
1.8
1.6
2
Normalized Regret
Normalized Regret
1.4
MEI
IEMAX
PMAX
MPI(0)
MPI(0.2)
Random
MPI(0.1)
1.2
1
0.8
1.5
1
0.6
0.4
0.5
0.2
0
0
5
10
15
20
25
30
35
40
45
50
0
5
10
15
20
Step
25
30
35
Step
Figure 4.9: Regrets (left) and Normalized Regrets (right) for Rosenbrock.
0.5
2.5
MEI
IEMAX
PMAX
MPI(0)
MPI(0.2)
Random
MPI(0.1)
0.45
0.4
2
Normalized Regret
Normalized Regret
0.35
MEI
IEMAX
PMAX
MPI(0)
MPI(0.2)
Random
MPI(0.1)
0.3
0.25
0.2
1.5
1
0.15
0.1
0.5
0.05
0
0
5
10
15
20
25
30
35
40
45
50
Step
0
0
5
10
15
20
25
30
35
Step
Figure 4.10: Regrets (left) and Normalized Regrets (right) for Real.
22
40
Table 4.2: Normalized Overall Regrets.
slope = 0.1
α = 0.8
CN-MEI
4.3.2
α = 0.9
slope = 0.15
α = 0.8
0.468
α = 0.9
0.558
slope = 0.30
α = 0.8
α = 0.9
0.760
CMC-MEI
0.487
0.455
0.562
0.535
0.691
0.681
CMC-MPI(0.2)
0.505
0.622
0.611
0.766
0.726
0.875
CMC-MUI
0.945
0.733
1.000
0.787
0.989
0.795
CMC-MM
1.147
1.260
1.138
1.375
1.304
1.489
CMC-RR
0.862
0.894
0.918
CMC-BRR
0.842
0.878
0.907
Constrained Experiments
For each function and choice of budget and cost structure (i.e. choice of slope), we
evaluate each policy by averaging the regret it achieves over 500 runs. Each run starts
with five randomly selected initial points and then policies are used to select constrained
experiments until the budget runs out, at which point the final regret is measured. To
compare the regret values across different functions we report normalized regret values,
which are computed by dividing the regret of each policy by the mean regret achieved by a
random policy (i.e., a policy that always selects the entire input space as the constrained
experiment). A normalized regret less than one indicates that an approach achieves regret
better than random, while a normalize regret greater than one indicates that an approach
is doing worse than random.
We ran experiments with a budget of 15 units with slop values of 10, 15, and 30.
In experiments not shown we also considered other budget amounts, both smaller and
larger, and found similar qualitative results. For the CMC policies we considered two
values of α, 0.8 and 0.9. For the MPI heuristic we used a margin of m = 0.2, which
based on a small number of trials performed well compared to other values. Our results
for individual functions are shown in Tables 4.3 to Table 4.6, giving both the average
normalized regrets over 500 trials and the standard deviations (Figures from 4.19 to 4.22
are with slop value of 15 and α = 0.9). In order to provide an assessment of overall
performance, Table 4.2 gives the normalized regrets for each policy averaged across our
four functions. Below we review our main empirical findings.
23
Model-free Policies
From Table 4.2 we can see that our model-free policies CMC-RR and CMC-BRR achieve
an improvement over random by approximately 10% across the different slopes. This
shows that the heuristic of trying to cover the space as evenly as possible pays off with
respect to random. We also see that in general CMC-BRR performs slightly better than
CMC-RR, which gives some indication that the additional exploitive behavior of BRR
pays off overall. Looking at the individual results, however, we can see that for the real
function (Table 4.6 and Figure 4.22) both BRR and RR perform worse than random.
Further investigation reveals that the reason for this poor performance is that CMCRR/BRR have a bias toward constrained experiments near the center of the input space
and sampling distribution gets less biased as the budget increased. The approximate
sampling distributions for slop=10 with initial budget=10 units in Figure 4.11. Under
the same setting, CMC-RR/BRR out-perform random at budget=45, which is shown in
Figure 4.12. This bias is a result of the fact that constrained experiments are required
to fall completely within the experimental space of [0, 1]2 and CMC policy attemps to
select the lest-cost constrained experiment (see Equation 3.6). There are fewer such
constrained experiments near the edges and corners. The real function has its optimal
point in a corner of the space, explaining the poor performance of CMC-RR/BRR.
1
1
0.0723
0.1269
0.0733
2/3
0.0939
0.1183
0.0951
0.1188
0.1486
0.1185
0.0947
0.1177
0.0944
2/3
0.1251
0.2124
0.1233
1/3
1/3
0.0688
0.1274
0.0705
0
0
0
1/3
2/3
1
0
1/3
2/3
1
Figure 4.11: Approximate sampling dis-
Figure 4.12: Approximate sampling dis-
tribution for 10 units budget, slop=10.
tribution for 45 units budget, slop=10.
Model-based Policies
We first observe from the overall and individual results that CMC-MM is consistently
worse than the random policy. The primary reason for this is that it typically selects
high cost experiments near locally optimal points, causing it to quickly expend its budget
24
2.5
2
1.5
1
0.5
0
−0.5
1
0.8
1
0.6
0.8
0.6
0.4
0.4
0.2
0.2
0
0
Figure 4.13: Plot of MUI heuristic.
and to not explore adequately. Next, we observe that CMC-MUI is typically better than
the model-free policies for α = 0.9 but is quite poor for α = 0.8. Further investigation
reveals that for α = 0.8 CMC-MUI typically selects constrained experiments with large
sizes (low cost). This leads to close to random behavior. The reason for this is that the
upper interval of the GP posterior tends to be quite flat across the space and optimistic
(see Figure 4.13). This allows for large constrained experiments to maintain good MUI
values. For α = 0.9 this effect is lessened due to the more strict requirement on heuristic
value.
Next, from the overall table we see that the MEI-based methods and CMC-MPI are
the top contenders among all methods, and they are in general superior to model-free
methods. Looking at the individual results we can see that this holds for all functions
except for Discontinuous, where the CMC approaches are slightly worse than random.
Further investigation shows that for this function the CMC approach selected higher cost
constrained experiments relative to the other functions, resulting in fewer overall experiments and worse performance than random. This behavior is due to the discontinuity of
the function near the optimal region (center), where the function suddenly becomes very
small. The result is that the heuristic value of constrained experiments near the discontinuity decreases rapidly as they grow in size, causing for CMC to select small, high cost
regions. This suggests that CMC is not a good choice for discontinuous functions with
high variability in outcomes.
25
5
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
4
3
2
1
0
−1
−2
−3
−4
−5
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 4.14: GP posterior (left), MPI(0.1) (center) and MEI (right).
Comparing CMC-MPI and CMC-MEI, CMC-MEI appears to have a slight edge overall, which is consistent with what we observed in traditional setting. In general, we
observed that MPI tends to select slightly fewer experiments than MEI, which we conjecture is due to the observation that the MEI heuristic tends to be smoother than MPI over
the experimental space. Under traditional setting, we have noticed that MAX-MPI(m)
typically tries to select from points {x | E[y|x, D] = y ∗ + δ, δ → 0+ } at the beginning,
which have low variance, causing them to have high MPI(m) values. And MPI(m) values
of points {x | E[y|x, D] = y ∗ − δ, δ → 0+ } are close to 0. That is the main reason why
there is sharp gap in the optimal MPI(m) value area. However the second term of Equa-
tion (4.4) gives MEI a smoother curve between the maximum and minimum MEI values
(see Figure 4.14) and also a relative higher heuristic value in unexplored area comparing
with MPI(m). CMC-MEI is also preferable in that it does not require the selection of a
margin parameter.
Comparing CMC-MEI to CN-MEI we see CN-MEI is comparable overall with slightly
worse performance for slope=0.3, which is the model where costs grow most rapidly
with decreasing sizes of constrained experiments. For slope=0.3 we found, as expected,
that CN-MEI, which normalizes by the cost of a constrained experiment, tends to select lower-cost experiments than CMC-MEI. In some cases, this tendency appears to be
too conservative since more aggressive spending by the CMC approaches appear to be
beneficial. However, we see that for the discontinuous function CN-MEI significantly
outperformed CMC-MEI since it was not forced to select overly high cost constrained experiments by the discontinuity. Overall, an encouraging observation for CN-MEI is that it
is the only model-based method that consistently outperforms model-free methods across
all functions and cost models, indicating a degree of robustness.
Finally, we observe that the CMC-MEI and CMC-MPI can be quite sensitive to the
value of α. For MPI, it appears that the smaller α value tends to be better. However, for
CMC-MEI, there is not a consistent trend across functions and cost models, for which
26
0.8
0.8
0.8
0.9
Random
0.7
0.7
0.6
Avg−regret
Avg−regret
0.6
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.8
0.9
Random
0
5
10
15
20
25
Avg−cost
30
35
40
45
0.1
50
0
5
10
15
20
25
Avg−cost
30
35
40
45
50
Figure 4.15: Cosines with slop=15, MEI (left), MPI(0.2) (right).
0.2
0.2
0.8
0.9
Random
0.18
0.8
0.9
Random
0.18
0.16
0.16
0.14
Avg−regret
Avg−regret
0.14
0.12
0.1
0.12
0.1
0.08
0.08
0.06
0.06
0.04
0.04
0.02
0.02
0
5
10
15
20
25
Avg−cost
30
35
40
45
50
0
0
5
10
15
20
25
Avg−cost
30
35
40
45
50
Figure 4.16: Discont with slop=15, MEI (left), MPI(0.2) (right).
value of α is better. This presents a practical difficulty when applying CMC-MEI in a
new domain. Rather, CN-MEI is parameter free and has reasonably robust performance.
This makes CN-MEI a recommended method when encountering a new problem.
27
1.8
1.8
0.8
0.9
Random
1.4
1.4
1.2
1.2
1
0.8
1
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
5
10
15
20
25
Avg−cost
30
35
40
45
0.8
0.9
Random
1.6
Avg−regret
Avg−regret
1.6
0
50
0
5
10
15
20
25
Avg−cost
30
35
40
45
50
Figure 4.17: Rosenbrock with slop=15, MEI (left), MPI(0.2) (right).
0.5
0.5
0.8
0.9
Random
0.4
0.4
0.35
0.35
0.3
0.25
0.3
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0
5
10
15
20
25
Avg−cost
30
35
40
45
0.8
0.9
Random
0.45
Avg−regret
Avg−regret
0.45
0
50
0
5
10
15
20
25
Avg−cost
30
35
40
45
50
Figure 4.18: Real-world with slop=15, MEI (left), MPI(0.2) (right).
0.7
0.7
Random
CMC−MM
CMC−MUI
CMC−RR
CMC−BRR
0.65
0.6
CN−MEI
Random
CMC−MEI
CMC−MPI(0.2)
0.6
0.55
0.5
Avg−regret
Avg−regret
0.5
0.45
0.4
0.35
0.4
0.3
0.3
0.25
0.2
0.2
0
5
10
15
20
25
Avg−cost
30
35
40
45
50
0.1
0
5
10
15
20
25
Avg−cost
Figure 4.19: Cosines with slop=15 and α = 0.9.
28
30
35
40
45
50
0.25
0.2
Random
CMC−MM
CMC−MUI
CMC−RR
CMC−BRR
0.2
CN−MEI
Random
CMC−MEI
CMC−MPI(0.2)
0.18
0.16
0.15
Avg−regret
Avg−regret
0.14
0.1
0.12
0.1
0.08
0.06
0.05
0.04
0.02
0
0
5
10
15
20
25
Avg−cost
30
35
40
45
0
50
0
5
10
15
20
25
Avg−cost
30
35
40
45
50
Figure 4.20: Discont with slop=15 and α = 0.9.
1.8
1.8
Random
CMC−MM
CMC−MUI
CMC−RR
CMC−BRR
1.6
1.4
1.4
1.2
Avg−regret
1.2
Avg−regret
CN−MEI
Random
CMC−MEI
CMC−MPI(0.2)
1.6
1
0.8
1
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
5
10
15
20
25
Avg−cost
30
35
40
45
50
0
5
10
15
20
25
Avg−cost
30
35
40
45
50
Figure 4.21: Rosenbrock with slop=15 and α = 0.9.
0.6
0.5
Random
CMC−MM
CMC−MUI
CMC−RR
CMC−BRR
0.5
CN−MEI
Random
CMC−MEI
CMC−MPI(0.2)
0.45
0.4
0.35
0.4
Avg−regret
Avg−regret
0.3
0.3
0.2
0.25
0.2
0.15
0.1
0.1
0.05
0
0
0
5
10
15
20
25
Avg−cost
30
35
40
45
50
0
5
10
15
20
25
Avg−cost
Figure 4.22: Real-world with slop=15 and α = 0.9.
29
30
35
40
45
50
Table 4.3: Normalized Regrets for Cosines Function
slope = 0.1
α = 0.8
CN-MEI
α = 0.9
slope = 0.15
α = 0.8
0.569 (0.34)
α = 0.9
0.714 (0.41)
slope = 0.30
α = 0.8
α = 0.9
0.826 (0.42)
CMC-MEI
0.528 (0.38)
0.445 (0.35)
0.589 (0.41)
0.494 (0.37)
0.712 (0.43)
0.690 (0.46)
CMC-MPI(0.2)
0.480 (0.37)
0.471 (0.36)
0.562 (0.43)
0.582 (0.43)
0.655 (0.46)
0.670 (0.45)
CMC-MUI
0.929 (0.43)
0.788 (0.40)
0.947 (0.43)
0.831 (0.43)
0.944 (0.44)
0.828 (0.42)
CMC-MM
0.948 (0.69)
0.966 (0.72)
0.971 (0.69)
1.008 (0.71)
1.057 (0.69)
1.067 (0.67)
CMC-RR
0.842 (0.40)
0.866 (0.40)
0.897 (0.41)
CMC-BRR
0.833 (0.40)
0.862 (0.41)
0.889 (0.41)
Table 4.4: Normalized Regrets for Discont Function
slope = 0.1
α = 0.8
CN-MEI
α = 0.9
slope = 0.15
α = 0.8
0.527 (0.44)
α = 0.9
0.497 (0.44)
slope = 0.30
α = 0.8
α = 0.9
0.626 (0.60)
CMC-MEI
0.614 (0.53)
0.670 (0.55)
0.636 (0.55)
0.824 (0.68)
0.771 (0.64)
1.041 (0.75)
CMC-MPI(0.2)
0.655 (0.56)
0.962 (0.84)
0.867 (0.71)
1.232 (1.17)
1.018 (0.80)
1.439 (1.23)
CMC-MUI
0.950 (0.90)
0.556 (0.57)
1.023 (0.94)
0.617 (0.58)
1.032 (1.01)
0.645 (0.64)
CMC-MM
1.681 (2.26)
2.053 (2.69)
1.574 (2.01)
2.283 (2.70)
1.993 (2.45)
2.377 (2.49)
CMC-RR
0.602 (0.51)
0.617 (0.53)
0.634 (0.56)
CMC-BRR
0.587 (0.50)
0.602 (0.53)
0.634 (0.57)
Table 4.5: Normalized Regrets for Rosenbrock Function
slope = 0.10
α = 0.8
CN-MEI
α = 0.9
slope = 0.15
α = 0.8
0.602 (0.40)
α = 0.9
0.665 (0.49)
slope = 0.30
α = 0.8
α = 0.9
0.736 (0.60)
CMC-MEI
0.584 (0.39)
0.618 (0.42)
0.657 (0.48)
0.655 (0.45)
0.696 (0.50)
0.676 (0.53)
CMC-MPI(0.2)
0.611 (0.43)
0.627 (0.45)
0.661 (0.47)
0.676 (0.57)
0.670 (0.52)
0.714 (0.56)
CMC-MUI
1.004 (1.10)
0.999 (1.12)
1.079 (1.34)
0.996 (1.11)
1.030 (1.19)
0.945 (1.10)
CMC-MM
0.828 (0.77)
0.843 (1.06)
0.808 (0.82)
0.960 (1.73)
0.957 (1.21)
1.260 (2.03)
CMC-RR
0.897 (0.85)
0.930 (0.87)
0.967 (1.00)
CMC-BRR
0.881 (0.83)
0.927 (0.90)
0.955 (0.98)
Table 4.6: Normalized Regrets for Real-world Function
slope = 0.10
α = 0.8
CN-MEI
α = 0.9
slope = 0.15
α = 0.8
0.176 (0.26)
α = 0.9
0.354 (0.46)
slope = 0.30
α = 0.8
α = 0.9
0.852 (0.66)
CMC-MEI
0.221 (0.41)
0.089 (0.18)
0.365 (0.57)
0.169 (0.32)
0.586 (0.65)
0.318 (0.47)
CMC-MPI(0.2)
0.274 (0.56)
0.426 (0.66)
0.355 (0.65)
0.574 (0.74)
0.560 (0.74)
0.677 (0.75)
CMC-MUI
0.897 (0.68)
0.588 (0.55)
0.952 (0.69)
0.706 (0.61)
0.949 (0.67)
0.761 (0.60)
CMC-MM
1.131 (1.04)
1.178 (1.07)
1.199 (1.04)
1.248 (1.08)
1.208 (0.93)
1.251 (0.91)
CMC-RR
1.107 (0.68)
1.161 (0.69)
1.172 (0.67)
CMC-BRR
1.065 (0.68)
1.123 (0.70)
1.148 (0.67)
30
Appendix A
Gradient
A.1
MM
In the context of individual experiments, MM heuristic is defined as
MM(x|D) = E[y|x, D]
(A.1)
where y ∼ pθ (·|x, D), D = (X, y) and X = {x1 , . . . , xn }. In our work, we use Gaussian
process regression (GPR) to compute the expectation of conditional posterior pθ (y|x, D)
required by our model-based heuristics (see Equation 3.8). We write a vector α to denote
−1
(K(X, X) + σǫ2 I)
y in Equation 3.8, which can be rewritten as
T
1
E[y|x, D] = K(x, X) α =
n
X
ki αi
(A.2)
i=1
where K(x, X)T = (k1 , . . . , kn ) and ki = σ exp(− 2l1 |x − xi |2 ).
Then the derivative or gradient of MM(x|D) with respect to individual experiment
x = {x1 , . . . , xd } is computed component by component, i.e.,
T
∂MM(x|D)
∂MM(x|D)
∂MM(x|D)
.
=
,...,
∂x
∂x1
∂xd
(A.3)
Note that
|x − xi |2 = (x1 − x1i )2 + · · · + (xd − xdi )2 ,
each entry of (A.3) is given by
n
∂MM(x|D) X ∂ exp(− 2l1 |x − xi |2 )
=
ki αi
∂xj
∂xj
i=1
=−
1
n
X
1
i=1
l
j
(x −
(A.4)
xji )ki αi .
We use Cholesky factorization to compute α = LT \ (L \ y), where L = cholesky(K + σn2 I) [14].
31
A.2
MUI
In the context of individual experiments, MUI heuristic is defined as
p
MUI(x|D) = E[y|x, D] + 1.96 V ar[y|x, D].
(A.5)
Then the derivative or gradient of MUI(x|D) with respect to individual experiment x
is given by
∂MUI(x|D)
∂E[y|x, D]
=
∂x
∂x
∂V ar[y|x, D]
1
.
+ 1.96 × V ar[y|x, D]−1/2
2
∂x
(A.6)
The first term of Equation (A.6) is the same as gradient for MM heuristic. Given the
Gaussian Kernel
K(x, x′ ) = σ exp(−
1
|x − x′ |2 ),
2l
and Equation (3.9)
V ar[y|x, D] = σ − K(x, X)T K(X, X) + σǫ2 I
−1
K(x, X).
Then the gradient of variance in the second term of Equation (A.6) can be written as
∂V ar[y|x, D]
=
∂x
∂V ar[y|x, D]
∂V ar[y|x, D]
,...,
1
∂x
∂xd
T
(A.7)
with each entry
T
∂V ar[y|x, D]
∂k1
∂kn
−1
= (k1 , . . . , kn ) K
,..., j
∂xj
∂xj
∂x
∂k1
∂kn
+
, . . . , j K −1 (k1 , . . . , kn )T
j
∂x
∂x
where K = K(X, X) + σǫ2 I and
1
∂ki
= − (xj − xji )ki .
j
∂x
l
A.3
MPI
In the context of individual experiments, MPI heuristic with margin m is defined as
!
E[y|x, D] − (y ∗ + m|y ∗ |)
p
MPIm (x|D) = Φ
.
(A.8)
V ar[y|x, D]
32
Then the derivative or gradient of MPIm (x|D) with respect to individual experiment
p
x is given by Equation (A.9), where E(x) = E[y|x, D], S(x) = V ar[y|x, D] and their
gradients can be calculated by Equation (A.3) and (A.7) respectively.
∂MPIm (x|D)
E(x) − (y ∗ + m|y ∗ |)
=φ
∂x
S(x)
∂E(x)
S(x)
∂x
A.4
−
∂S(x)
(E(x)
∂x
S(x)2
− (y ∗ + m|y ∗ |))
!
(A.9)
MEI
In the context of individual experiments, MEI heuristic is defined as
E(x) − y ∗
∗
MEI(x|D) = (E(x) − y )Φ
S(x)
∗
y − E(x)
.
+ S(x)φ
S(x)
(A.10)
Then the derivative or gradient of MEI(x|D) with respect to individual experiment x
is given by
MEI(x|D)
E(x) − y ∗
∂E(x)
E(x) − y ∗
∗
+ (E(x) − y )φ
V1
=
Φ
∂x
∂x
S(x)
S(x)
∗
∂S(x)
y − E(x)
+
+ S(x)V2
φ
∂x
S(x)
where
∂E(x)
S(x)
∂x
− ∂S(x)
(E(x) − y ∗ )
∂x
V1 =
S(x)2
∗
y − E(x)
E(x) − y ∗
V2 = φ
(−V1 ).
S(x)
S(x)
33
(A.11)
Appendix B
Mathematical Derivation
B.1
MEI Heuristic
The heuristic formula of MEI is extended based on [7], which is intended to find the
minimum of the surface. Let Y (x) be a random variable describing our uncertainty
about the functions value at a point x and assume that Y (x) is normally distributed with
expectation E(x) and variance S(x)2 . If the current best observation is fmax , then we
will achieve an improvement of I if Y (x) = fmax + I. The likelihood of achieving this
improvement is given by the normal density function
(fmax + I − E(x))2
1
√
exp −
.
2S(x)2
2πS(x)
(B.1)
The expected improvement is simply the expected value of the improvement found by
integrating over this density:
Z ∞ (fmax + I − E(x))2
1
exp −
dI
E(I) =
I √
2S(x)2
2πS(x)
0
(fmax + I − E(x))2
1
exp −
I √
E(I) =
dI
2S(x)2
2πS(x)
0
1
(fmax − E(x))2
=√
exp −
2S(x)2
2πS(x)
Z ∞
2I(fmax − E(x)) + I 2
×
I exp −
dI
2S(x)2
0
Z
Let
∞
(B.2)
2I(fmax − E(x)) + I 2
.
T = exp −
2S(x)2
34
(B.3)
Then the derivative of T with respect to I is
∂T
1
=−
(IT + (fmax − E(x))T ) ,
∂I
S(x)2
from which we can get
∂T
IT = −(fmax − E(x))T −
S(x)2 .
∂I
Equation (B.3) can be rewritten as
Z ∞
(fmax − E(x))2
1
exp −
E(I) = √
IT dI
2S(x)2
2πS(x)
0
Z ∞
1
I − (E(x) − fmax )2
√
dI
= −(fmax − E(x))
exp −
2S(x)2
2πS(x)
0
fmax − E(x)
+ S(x)φ(
)
S(x)
Z E(x)−fmax
S(x)
(I ∗ )2 ∗
1
√
exp −
= −(fmax − E(x))
dI
2
2πS(x)
−∞
fmax − E(x)
+ S(x)φ(
)
S(x)
where
I − (E(x) − fmax ))
.
I∗ =
S(x)
and φ(·) is the normal density function. Equation (B.4) =⇒
E(I) = S(x)(−uΦ(−u) + φ(u))
(B.4)
(B.5)
where
fmax − E(x)
S(x)
and Φ(·) is the normal cumulative distribution function.
u=
B.2
B.2.1
Weighted Combination of Distributions
Expectation
Let Z be a variable
Z=
n
X
wi Y i
i=1
that is distributed according to a mixture of distributions Y = {Y1 , . . . , Yn } with mixture
weights W = {w1 , . . . , wn } and Yi is distributed with expectation E(Yi ) and variance
S(Yi )2 . Then we have
E(Z) = E(
n
X
wi Y i ) =
i=1
n
X
i=1
35
wi E(Yi ).
(B.6)
B.2.2
Variance
V ar(Z) = E(Z 2 ) − (E(Z))2
!
n
X
=E
wi Yi2 −
i=1
=
n
X
i=1
=
n
X
i=1
wi E(Yi2 ) −
n
X
wi E(Yi )
i=1
n
X
!2
(B.7)
wi E(Yi )
i=1
wi (S(Yi )2 + E(Yi )2 ) −
36
!2
n
X
i=1
!2
wi E(Yi )
Bibliography
[1] B. S. Anderson, A. W. More, and D. Cohn. A nonparametric approach to noisy and
costly optimization. In ICML, 2000.
[2] D. Bond and D. Lovley. Electricity production by geobacter sulfurreducens attached
to electrodes. Applications of Environmental Microbiology, 69:1548–1555, 2003.
[3] M. Brunato, R. Battiti, and S. Pasupuleti. A memory-based rash optimizer. In
AAAI-06 Workshop on Heuristic Search, Memory Based Heuristics and Their applications, 2006.
[4] E. H. Burrows, W.-K. Wong, X. Fern, F. W. Chaplen, and R. L. Ely. Optimization
of ph and nitrogen for enhanced hydrogen production by synechocystis sp. pcc 6803
via statistical and machine learning methods. Biotechnology Progress, in submission.
[5] Y. Fan, H. Hu, and H. Liu. Enhanced coulombic efficiency and power density of aircathode microbial fuel cells with an improved cell configuration. Journal of Power
Sources, page In press, 2007.
[6] A. M. Jeff Schneider. Active learning in discrete input spaces. In Interface Symposium, 2002.
[7] D. Jones. A taxonomy of global optimization methods based on response surfaces.
Journal of Global Optimization, 21:345–383, 2001.
[8] D. Lizotte, O. Madani, and R. Greiner. Budgeted learning of naive-bayes classifiers.
In UAI, 2003.
[9] O. Madani, D. Lizotte, and R. Greiner. Active model selection. In UAI, 2004.
[10] A. Moore and J. Schneider. Memory-based stochastic optimization. In NIPS, 1995.
[11] A. Moore, J. Schneider, J. Boyan, and M. S. Lee. Q2: Memory-based active learning
for optimizing noisy continuous functions. In ICML, pages 386–394, 1998.
37
[12] R. Myers and D. Montgomery. Response surface methodology: process and product
optimization using designed experiments. Wiley, 1995.
[13] D. Park and J. Zeikus. Improved fuel cell and electrode designs for producing electricity from microbial degradation. Biotechnol.Bioeng., 81(3):348–355, 2003.
[14] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning.
MIT, 2006.
[15] G. Reguera. Extracellular electron transfer via microbial nanowires. Nature, pages
1098–1101, 2005.
38
Download