LECTURE ON ORDINAL OPTIMIZATION (HANDOUT OF HO

advertisement
LECTURE NOTES #7&8
VERSION: 3.1
DATE: 2003-8-12
FROM ES 205 LECTURE #9 & #13 ORDINAL OPTIMIZATION
On Ordinal Optimization
by
Yu-Chi Ho
Harvard University
• The most useful lecture I will give in this course. The most useful research I have
done in 30 years.
I. Introduction and Rationale
It can be argued that OPTIMIZATION in a general sense is the principal driver behind all of
prescriptive scientific and engineering endeavor, be it operations research, control theory,
or engineering design. It is also true that the real world is full of complex decision and
optimization problems that we cannot solve. While the literature on optimization and
decision-making is huge, many of the concrete analytical results are associated with what
may be called Real Variable Based methods. The idea of successive approximation to
an optimum (say, minimum) by sequential improvements based on local information is
often captured by the metaphor of “skiing downhill in a fog.” The concepts of gradient
(slope), curvature (valley), and trajectories of steepest descent (fall line) all require the
notion of derivatives and are based on the existence of a more or less smooth response
surface. There exist various first and second order algorithms of feasible directions for the
iterative determination of the optimum (minimum) of an arbitrary multi-dimensional
response or performance surface. Considerable numbers of major success stories exist in
this genre including the Nobel Prize winning work on linear programming. It is not
necessary to repeat or even reference these here.
On the other hand, we submit that the reason many real world optimization problems
remain unsolved is partly due to the changing nature of the problem domain which makes
calculus or real variable based method less applicable. For example, large number of
human-made system problems, such as manufacturing automation, communication
networks, computer performances, and/or general resource allocation problems, involve
combinatorics rather than real analysis, symbols rather than variables, discrete instead of
continuous choices, and synthesizing a configuration rather than proportioning the design
parameters. Optimization for such problem seem to call for general search of the
performance terrain or response surface as opposed to the “skiing downhill in a fog”
Copyright © by Yu-Chi Ho
1
metaphor of real variable based performance optimization1. Arguments for change can
also be made on the technological front. Sequential algorithms were often dictated as a
result of the limited memory and centralized control of earlier generations of computers.
With the advent of modern massively parallel machines and essentially unlimited size of
virtual memory, distributed and parallel procedures can work hand-in-glove with the
Search Based method of performance evaluation. It is one of the theses of this paper to
argue for such a complementary approach to optimization.
If we accept the need for search based method as complement to the more established
analytical techniques, then we can next argue that it is more important to quickly narrow
the search for optimum performance to a “good enough” subset in the design universe
than to estimate accurately the values of the system performance during the process of
optimization. (i.e. having “good enough” subset is more important than the accurate
estimation of the system performance.) We should compare order first and estimate value
second, i.e., ordinal optimization comes before cardinal optimization. Furthermore, we
shall argue that our preoccupation with the “best” may be only an ideal that will always
be unattainable or not cost effective. Real world solution to real world problem will
involve compromise for the “good enough”2. The purpose of this paper is to establish the
distinct advantages of the softer approach of ordinal approach for the search based type of
problems, to analyze some of its general properties, and to show the many orders of
magnitude improvement in computational efficiency that is possible under this mind set.
•
Cost of getting best for sure
II.
Certainty vs. Probability and Best vs. Good Enough
The argument for asking a softer vs. hard question rests on two tenets: trading “certainty”
for “high probability” and relaxing “best” to “good enough”. We can easily demonstrate
the price one pays for insisting on certainty vs. being nearly certain via a simple but
generic example. Let us agree that by BEST we mean getting to within 0.001% of the
optimum (e.g., 0.001% of the optimum mean finding the one of the ten best among one
million) and GOOD ENOUGH means getting to within 1% of the best. Let us also say
that if we can ascertain the value of the BEST to a confidence interval of ±0.001% we are
sure and a confidence interval of ±1% is high probability. Then it is well known that
under independent sampling, variance decreases as 1/ n . Thus each order of magnitude
increase in confidence interval requires 2 order of magnitude increases in sampling cost.
1
We hasten to add that we fully realize the distinction we make here is not absolutely
black and white. A continuum of problems types exists. Similarly, there is a spectrum of
the nature of optimization variables or search space ranging from continuous to integer to
discrete to combinatorial to symbolic.
2 Of course, here one is mindful of the business dictum, used by Mrs. Fields for her
enormously successful cookie shops, which says “Good enough never is”. However, this
can be reconciled via the frame of reference of the optimization problem.
Copyright © by Yu-Chi Ho
2
To go from Confidence Interval (CFI) =±1% to CFI=±0.001% implies a 1,000,000 fold
increase in sampling cost. Similarly, to insist on the “best” vs. to be satisfied with being
“good enough” is another thousand fold increase in sampling cost. Furthermore, the
efficiency gains are multiplicative rather than additive. Thus, if we can be satisfied with
a CFI of ±1% in estimating the value of the best and in getting to within 1% of the
best, then a 109 fold increase in efficiency can be achieved compared to getting the
best for sure. (Because getting the best for sure causes so much additional surprising
costs, it may be more efficient in sum to consider getting something that is “good
enough” and sparing the cost of certainty.) We submit in dealing with complex problems
in real life we often consciously or unconsciously settle for such a trade-off. Of course,
we need to interpret such a trade-off with care. The number 109 is more an indication of
the approximate order of magnitude rather than a precise calculation. Also being in the
top 1% may not be satisfactory if the performance universe of the top 1% solution
candidates are [1, 2, . . . , 98, 99, …10000000]. (In other words, the difference between
performances of two top 1% solution candidates can be very large.) Such a needle-in-thehaystack optimization problem is inherently hard. Short of exhaustive search, no method
will be good enough. Furthermore, it is important to remember that the efficiency
increase is gained at the expense of precision. There are no free lunches. But why insist
on an expensive gourmet meal when a quick lunch is perfectly adequate and may in fact
be “optimal” when the cost benefit ratio is considered.
•
Concepts: OPC, PDF, Sampling population N, Subset G and S, Alignment
Probability
III.
Ordered Performance Curves and Their Robustness.
Consider a design universe  ={1, . . . , N} of size N. Corresponding to each design
alternative we have the performance J(i) Ji for i = 1, 2, . . . , N where is the design
parameter(s). Without loss of generality we assume the designs are ordered, i.e., J1 < J2
< . . . <JN, and we are minimizing. If we plot the performance value versus design, the
resultant curve must be monotonically increasing by definition. In practice, limited
computing budget or time prevent us from evaluating all the Ji’s exactly3, we can only
estimate these performance values under noise, the observed values J i Ji + wi 4 , where
wi represent the estimation error, do not in general obey J 1 < J 2 < . . . < J N. The problem
of ordinal optimization has to do with to what extent are the true orders perturbed by the
estimation errors. Symmetrically, if we order the observed performances, we are
interested in to what extent they align with the true order. For example a reasonable
question can be “Is the observed order the maximum likelihood estimate of the true
For example, J() may represent the expected value of some sample performance
criterion which can only be estimated via Monte Carlo simulation.
4 The additive noise model is not necessary. We could pursue an alternative
multiplicative or more general model.
3
Copyright © by Yu-Chi Ho
3
order?”5 Previously, we have argued [Ho-Sreenivas-Vakili, 1992 and Ho, 1993] that there
are only three types of monotonically ordered performance curves: Steep, Flat, and
Neutral. The flat curve captures the case where there is relatively little true performance
difference among the alternatives. Estimation errors can then greatly perturbed the
observed order from that of the true. Any kind of selection rule for good alternatives
based on the observed performances amounts to more or less blind choice. In this case,
we cannot base “good” designation on observed performance. The probability of picking
a good subset can be explicitly calculated [Ho-Sreenivas-Vakili, 1992]. We will not
repeat the details. On the other hand, a steeply ordered performance curve implies large
performance differences among the alternatives. In which case, we have an easy problem
since estimation error will have little perturbing effect on the ordering. (Great difference
among solution candidates means estimation error will have little effect on order.) From
this viewpoint, the only interesting generic case can be represented by Ji=i with wi
characterized by some i.i.d.distribution, say, U[0,W] or exponential with parameter, .
This is particularly true if we are primarily interested in, say the top 5 or 10%, of the
performances. In that region, we shall consider Ji=i to be the canonical problem for
ordinal optimization. Generic results for the probability of picking “good enough” subsets
based on the observed order for this case for various levels of perturbing estimation error6
have been previously derived and shown to be very useful for increasing the efficiency of
simulation [Ho-Sreenivas-Vakili, 1992 and Patsis-Chen-Larson 1993]. (We assure
performance (based on parameter ) is ordered J1()<J2()<… and ask due to randomness
– how likely is it that an observed order is the true order? Is observed order the most
likely estimate of true?) Effect of correlations among the estimation errors wi have also
been studied [Deng-Ho-Hu 1992]. Orders of magnitude improvement in computational
efficiencies in the sense of § II have been observed.
•
(i)
Blind Choice [Ho-Sreenivas-Vakili, 1992]: Suppose we blindly pick out g
out of N choices, what is the probability that at least k of the top g choices are actually
contained in the pick set
g N  g
g
g  
i
g  i 
P  k , g, N    P i  , g, N      
(1)
N 
i k
i k
g
 
(Note: there are total of (N,g) ways to pick g alternative out of N possibilities. there are
(g,i) ways of distributing i true top-g choices in g slots and (N-g, g-i) way of distributing
the remaining g-i true top-g choices in the N-g slots)
(i.e. P(at least k good enough designs are contained in g random-picking designs, when
there are N total designs.) = i=k g P(there are exactly i good enough designs contained in
g random-picking designs, when there are N total designs) = i=k g (# of ways to select i
5
We shall relate the model discussed here with that of the ranking literature in statistics
separately later.
6 For example, the parameter W or just mentioned above.
Copyright © by Yu-Chi Ho
4
designs from the g good enough designs)×(# of ways to select the remaining g-i designs
from the non-good enough designs to consist the selected g designs)/(# of ways to select g
designs from totally N designs).)
You can calculate P for other values N, g, and k easily, one example of which for N=1000
is illustrated in Figure 1 below.
Alignment Probability
1.0
k=1
k=10
0.5
0.0000
10
50
100
150
g
200
Fig. 1. P vs g parametrized by k=1,2,…,10 for N=1000.
•
The concept of G  the good enough subset}, e.g., the top-12 choices
The concept of S { the selected subset}, e.g., the observed top-12 choices
The concept of Alignment = GS = ? i.e., how has noise or modeling error
distorted the (soft) order
In other words, alignment asks what is the probability that “good enough”
subset overlaps with observed/apparent “good enough.” (e.g. the left most
curve indicates P(at least 1 of he true top 10 designs are contained in the 10
random picking designs).)
Example: |S|=1 means we are selecting champions, |G| =1 means we are
narrowing the search for true champions. (In other words, |S|=1 means we
are selecting the observed best one; and |G|=1 means we are looking for the
true best one.)
•
(ii)
Horse race selection rule
On the other hand, if we don’t pick blindly, i.e., the variance of w is not infinite and
estimated good design have a better chance at being actual good designs, then we should
Copyright © by Yu-Chi Ho
5
definitely do better. Suppose now we use the horse race (HR) selection rule, i.e., pick
according to the estimated performance, however approximate they may be7. It turns out
we no longer can determine P in closed form. But P can be easily and experimentally
calculated via a simple Monte Carlo experiment as a function of the noise variance 2, N,
g, and k, i.e., the response surface P(|GS|≥k; 2, N). We accomplish this by assuming a
particular form for J(), e.g., J(i) = i in which case the best design is  (we are
minimizing) and the worst design is  and the Ordered Performance Curve (OPC) is
linearly increasing. For any finite i.i.d. noise w, we can implement Ĵ    J    w and
directly check the alignment between the observed and true order. A spread sheet
implementation as in fig.2 can be easily carried out and the alignment probability P(|G∩
S|≥k; N, 2) determined.
Fig.2 Spread Sheet Implementation of Generic Experiment
Column 1 models the N (=200) alternatives and the true order 1 through N. Column 2
shows the linearly increasing OPC from 1 through N (=200). The rate of OPC increase
with respect to the noise variance 2 essentially determines the estimation or
approximation error of Ĵ   . This is shown by the random noise generated in column 3
which in this case has a large range U(0,100). Column 4 displays the corrupted (or
estimated) performance. When we sort on the column 4 we can directly observe the
Copyright © by Yu-Chi Ho
6
alignment in column 1, i.e., how many numbers 1 through g are in the top-g rows. Try
this and you will be surprised! It takes less than two minutes to setup on EXCEL or Lotus
1-2-3.
Note that the ordered performance curve of ANY system must be linearly increasing and
can assume only limited shapes near the optimum, e.g., steep, moderate, and flat slopes.
As far as the alignment probability is concerned, these correspond to large, moderate and
small estimation errors of system performance in the calculation of P(|GS|≥k; 2, N). In
other words, we submit that a parameterized UNIVERSAL ALIGNMENT PROBABILITY
curve can be constructed and be used to predict how various selection rules will perform
in isolating good from bad designs. This will only require VERY APPROXIMATE evaluation
of a number of designs. The implication of this assertion for simulation is obvious. Given
a complex system, we can quickly narrow down the search for good designs using very
approximate models and simulate them in parallel over a parameter domain for very short
durations. This is possible because we are only using the simulation to separate the good
from the bad (an ordinal idea) and not to estimate accurately the performance values (a
cardinal notion).
LECTURE #9 OO CONTINUED
REVIEW: concepts of G, S, |G ∩ S|, the Thurston Model on Jest=Jtrue+noise with
parameter 2, the alignment probability as a function of N, g, k, and 2. Universality of P
(experimentally verified)
It is also clearly that such an idea can be incorporated into a general optimization
environment. Consider the following conceptual flow chart
•
The Soft Optimization Shell (SOS)
Copyright © by Yu-Chi Ho
7
Fig.3 Flow Chart for SOS
We call this a Soft Optimization Shell or SOS. It is soft optimization because we do not
insist on getting the “best” but only the “good enough”. In fact it is this tradeoff that enable
us to get high values for P. It is a shell because we can plug in whatever computer model for
the system under investigation ranging from back-of-the-envelope formula to complex
simulation programs. In the case of simulation program we can vary the length of the
simulation experiment to control the parameter 2, which provides added flexibility. It is also
clear how a parallel simulation language that can run a set of simulation experiments over the
N designs simultaneously on either sequential or parallel computers can be very useful in this
SOS environment. The point is that we can have a hierarchy of models ranging from the
simplest to the most complex. At each level of the hierarchy we can quantify the degree and
confidence of the approximation (in terms of the alignment probability described above). We
Copyright © by Yu-Chi Ho
8
can also creatively allocate our computing resources over different levels of this hierarchy
between “quickly surveying the landscape” to “lavish attention on particular details”. The
possibility of exploiting the synergistic interaction of what human and machine each doing
their best is endless in Fig.3. We have only scratched the surface.
•
•
•
•
•
Encouraging results from generic experiments ==> we can often solve a surrogate
problem instead of the original hard problem
ORIGINAL
SURROGATE
Steady state DES
Transient DES
rare-event probability
not so rare event prob.
complex model
simpler model
More importantly, the speed up of these devices MULTIPLY
For discrete event simulation and optimization, ordinal idea can be combined with
standard clock to achieve speed improvement of several orders of magnitude.
What about heuristics? We have seen what blind choice can do, Ji=i is one way
to represent the favorable bias of heuristics.
the concept of picking probability
In the above figure, on the one hand blind choice rule means sampling randomly
from a number of designs, which may or may not be good designs. This does not
allow us to get a good idea of what “top x% of designs” should look like. So in
turn we cannot accurately estimate whether a given design is good or bad. On the
other hand, improving over choice rule so that we can sample from good designs.
This will give us a better idea of what top designs should look like. The better our
selection rule is, the less samples are needed to say if a selected design is good or
not.
Use some form of selection rules to pick the good enough subset, e.g., the
tournament literature. Baseball, Tennis, Random, and Stupid champion selection
rules.
Copyright © by Yu-Chi Ho
9
•
•
•
The problem of large search space from (i) combinatorial explosion (ii)
quantifying the search space of high dimension
The idea of sampling to reduce search space but to guarantee sufficient number of
"good enough" solutions in the reduced search space.
You can solve a generic problem experimentally with the result that large
approximation errors can be tolerated so long as we ask a softer question.
Copyright © by Yu-Chi Ho
10
On Applying Ordinal Optimization
ES 205 Class Notes #13
Prof. Y.C. Ho
Spring 2001
Let us say you have a very computationally complex system/problem. To evaluate the
performance of a design, theta, (alternative words are candidate solution, choice,
parameter set, etc.) is time consuming. Furthermore the search space for designs are large,
say at least one billion or more. You know very little about the structure of the search
space,  and certainly do not have time to exhaustively or blindly search it. Here is what
OO recommends
1.
Uniformly pick 1000 designs out of .
2.
Say you are interested in getting within the top-1% of designs. Then these 1000
designs will contain 10 such samples highly likely (why?). Your job is to find at
least one of these 10 without busting your computational budget on evaluating
these 1000 designs
3.
If your actually know the performance value of these 1000 designs, denoted as
J1, . . . J1000, then without loss of generality, we assume they are in fact order as
such. You plot these values against the order you should get a monotonically
increasing curve (we are minimizing and J1 is the best). This is the so-called
Order Performance Curve or OPC. There are only five different kinds of OPCs
qualitatively.
4.
In the real world, of course you can only know estimates of the performance of
these 1000 designs. Denote these as J~1, . . . , J~1000, the observed order. The true
Ji can be considered as a random variable centered at J~1 with a variance, . For
simplicity let us say all  are equal and we use only one value .
5.
OO theory says that if we take the top-n of these 1000 designs based on the
observed values, J~1, then the alignment probability AP of having at least k<n of
true top-n in the observed top-n can be easily determined. AP is a function of
 Range of the Ji values
 
 The parameters n and k
 Type of OPC you think you have
Of course for high values of AP we want large range and n, small sigma and k.
Which type of OPC we have also make a difference. With specific knowledge, we
have good reason to assume a linearly increasing OPC.
6.
AP has been extensively tabulated as a function of these arguments in Ho-Lau
1997.
7.
For a given or estimated range of values for Ji and the approximate value of . We
can certainly check what ii takes to get AP=0.9 for n=10 and k=1, say.
Copyright © by Yu-Chi Ho
11
Download