LECTURE NOTES #7&8 VERSION: 3.1 DATE: 2003-8-12 FROM ES 205 LECTURE #9 & #13 ORDINAL OPTIMIZATION On Ordinal Optimization by Yu-Chi Ho Harvard University • The most useful lecture I will give in this course. The most useful research I have done in 30 years. I. Introduction and Rationale It can be argued that OPTIMIZATION in a general sense is the principal driver behind all of prescriptive scientific and engineering endeavor, be it operations research, control theory, or engineering design. It is also true that the real world is full of complex decision and optimization problems that we cannot solve. While the literature on optimization and decision-making is huge, many of the concrete analytical results are associated with what may be called Real Variable Based methods. The idea of successive approximation to an optimum (say, minimum) by sequential improvements based on local information is often captured by the metaphor of “skiing downhill in a fog.” The concepts of gradient (slope), curvature (valley), and trajectories of steepest descent (fall line) all require the notion of derivatives and are based on the existence of a more or less smooth response surface. There exist various first and second order algorithms of feasible directions for the iterative determination of the optimum (minimum) of an arbitrary multi-dimensional response or performance surface. Considerable numbers of major success stories exist in this genre including the Nobel Prize winning work on linear programming. It is not necessary to repeat or even reference these here. On the other hand, we submit that the reason many real world optimization problems remain unsolved is partly due to the changing nature of the problem domain which makes calculus or real variable based method less applicable. For example, large number of human-made system problems, such as manufacturing automation, communication networks, computer performances, and/or general resource allocation problems, involve combinatorics rather than real analysis, symbols rather than variables, discrete instead of continuous choices, and synthesizing a configuration rather than proportioning the design parameters. Optimization for such problem seem to call for general search of the performance terrain or response surface as opposed to the “skiing downhill in a fog” Copyright © by Yu-Chi Ho 1 metaphor of real variable based performance optimization1. Arguments for change can also be made on the technological front. Sequential algorithms were often dictated as a result of the limited memory and centralized control of earlier generations of computers. With the advent of modern massively parallel machines and essentially unlimited size of virtual memory, distributed and parallel procedures can work hand-in-glove with the Search Based method of performance evaluation. It is one of the theses of this paper to argue for such a complementary approach to optimization. If we accept the need for search based method as complement to the more established analytical techniques, then we can next argue that it is more important to quickly narrow the search for optimum performance to a “good enough” subset in the design universe than to estimate accurately the values of the system performance during the process of optimization. (i.e. having “good enough” subset is more important than the accurate estimation of the system performance.) We should compare order first and estimate value second, i.e., ordinal optimization comes before cardinal optimization. Furthermore, we shall argue that our preoccupation with the “best” may be only an ideal that will always be unattainable or not cost effective. Real world solution to real world problem will involve compromise for the “good enough”2. The purpose of this paper is to establish the distinct advantages of the softer approach of ordinal approach for the search based type of problems, to analyze some of its general properties, and to show the many orders of magnitude improvement in computational efficiency that is possible under this mind set. • Cost of getting best for sure II. Certainty vs. Probability and Best vs. Good Enough The argument for asking a softer vs. hard question rests on two tenets: trading “certainty” for “high probability” and relaxing “best” to “good enough”. We can easily demonstrate the price one pays for insisting on certainty vs. being nearly certain via a simple but generic example. Let us agree that by BEST we mean getting to within 0.001% of the optimum (e.g., 0.001% of the optimum mean finding the one of the ten best among one million) and GOOD ENOUGH means getting to within 1% of the best. Let us also say that if we can ascertain the value of the BEST to a confidence interval of ±0.001% we are sure and a confidence interval of ±1% is high probability. Then it is well known that under independent sampling, variance decreases as 1/ n . Thus each order of magnitude increase in confidence interval requires 2 order of magnitude increases in sampling cost. 1 We hasten to add that we fully realize the distinction we make here is not absolutely black and white. A continuum of problems types exists. Similarly, there is a spectrum of the nature of optimization variables or search space ranging from continuous to integer to discrete to combinatorial to symbolic. 2 Of course, here one is mindful of the business dictum, used by Mrs. Fields for her enormously successful cookie shops, which says “Good enough never is”. However, this can be reconciled via the frame of reference of the optimization problem. Copyright © by Yu-Chi Ho 2 To go from Confidence Interval (CFI) =±1% to CFI=±0.001% implies a 1,000,000 fold increase in sampling cost. Similarly, to insist on the “best” vs. to be satisfied with being “good enough” is another thousand fold increase in sampling cost. Furthermore, the efficiency gains are multiplicative rather than additive. Thus, if we can be satisfied with a CFI of ±1% in estimating the value of the best and in getting to within 1% of the best, then a 109 fold increase in efficiency can be achieved compared to getting the best for sure. (Because getting the best for sure causes so much additional surprising costs, it may be more efficient in sum to consider getting something that is “good enough” and sparing the cost of certainty.) We submit in dealing with complex problems in real life we often consciously or unconsciously settle for such a trade-off. Of course, we need to interpret such a trade-off with care. The number 109 is more an indication of the approximate order of magnitude rather than a precise calculation. Also being in the top 1% may not be satisfactory if the performance universe of the top 1% solution candidates are [1, 2, . . . , 98, 99, …10000000]. (In other words, the difference between performances of two top 1% solution candidates can be very large.) Such a needle-in-thehaystack optimization problem is inherently hard. Short of exhaustive search, no method will be good enough. Furthermore, it is important to remember that the efficiency increase is gained at the expense of precision. There are no free lunches. But why insist on an expensive gourmet meal when a quick lunch is perfectly adequate and may in fact be “optimal” when the cost benefit ratio is considered. • Concepts: OPC, PDF, Sampling population N, Subset G and S, Alignment Probability III. Ordered Performance Curves and Their Robustness. Consider a design universe ={1, . . . , N} of size N. Corresponding to each design alternative we have the performance J(i) Ji for i = 1, 2, . . . , N where is the design parameter(s). Without loss of generality we assume the designs are ordered, i.e., J1 < J2 < . . . <JN, and we are minimizing. If we plot the performance value versus design, the resultant curve must be monotonically increasing by definition. In practice, limited computing budget or time prevent us from evaluating all the Ji’s exactly3, we can only estimate these performance values under noise, the observed values J i Ji + wi 4 , where wi represent the estimation error, do not in general obey J 1 < J 2 < . . . < J N. The problem of ordinal optimization has to do with to what extent are the true orders perturbed by the estimation errors. Symmetrically, if we order the observed performances, we are interested in to what extent they align with the true order. For example a reasonable question can be “Is the observed order the maximum likelihood estimate of the true For example, J() may represent the expected value of some sample performance criterion which can only be estimated via Monte Carlo simulation. 4 The additive noise model is not necessary. We could pursue an alternative multiplicative or more general model. 3 Copyright © by Yu-Chi Ho 3 order?”5 Previously, we have argued [Ho-Sreenivas-Vakili, 1992 and Ho, 1993] that there are only three types of monotonically ordered performance curves: Steep, Flat, and Neutral. The flat curve captures the case where there is relatively little true performance difference among the alternatives. Estimation errors can then greatly perturbed the observed order from that of the true. Any kind of selection rule for good alternatives based on the observed performances amounts to more or less blind choice. In this case, we cannot base “good” designation on observed performance. The probability of picking a good subset can be explicitly calculated [Ho-Sreenivas-Vakili, 1992]. We will not repeat the details. On the other hand, a steeply ordered performance curve implies large performance differences among the alternatives. In which case, we have an easy problem since estimation error will have little perturbing effect on the ordering. (Great difference among solution candidates means estimation error will have little effect on order.) From this viewpoint, the only interesting generic case can be represented by Ji=i with wi characterized by some i.i.d.distribution, say, U[0,W] or exponential with parameter, . This is particularly true if we are primarily interested in, say the top 5 or 10%, of the performances. In that region, we shall consider Ji=i to be the canonical problem for ordinal optimization. Generic results for the probability of picking “good enough” subsets based on the observed order for this case for various levels of perturbing estimation error6 have been previously derived and shown to be very useful for increasing the efficiency of simulation [Ho-Sreenivas-Vakili, 1992 and Patsis-Chen-Larson 1993]. (We assure performance (based on parameter ) is ordered J1()<J2()<… and ask due to randomness – how likely is it that an observed order is the true order? Is observed order the most likely estimate of true?) Effect of correlations among the estimation errors wi have also been studied [Deng-Ho-Hu 1992]. Orders of magnitude improvement in computational efficiencies in the sense of § II have been observed. • (i) Blind Choice [Ho-Sreenivas-Vakili, 1992]: Suppose we blindly pick out g out of N choices, what is the probability that at least k of the top g choices are actually contained in the pick set g N g g g i g i P k , g, N P i , g, N (1) N i k i k g (Note: there are total of (N,g) ways to pick g alternative out of N possibilities. there are (g,i) ways of distributing i true top-g choices in g slots and (N-g, g-i) way of distributing the remaining g-i true top-g choices in the N-g slots) (i.e. P(at least k good enough designs are contained in g random-picking designs, when there are N total designs.) = i=k g P(there are exactly i good enough designs contained in g random-picking designs, when there are N total designs) = i=k g (# of ways to select i 5 We shall relate the model discussed here with that of the ranking literature in statistics separately later. 6 For example, the parameter W or just mentioned above. Copyright © by Yu-Chi Ho 4 designs from the g good enough designs)×(# of ways to select the remaining g-i designs from the non-good enough designs to consist the selected g designs)/(# of ways to select g designs from totally N designs).) You can calculate P for other values N, g, and k easily, one example of which for N=1000 is illustrated in Figure 1 below. Alignment Probability 1.0 k=1 k=10 0.5 0.0000 10 50 100 150 g 200 Fig. 1. P vs g parametrized by k=1,2,…,10 for N=1000. • The concept of G the good enough subset}, e.g., the top-12 choices The concept of S { the selected subset}, e.g., the observed top-12 choices The concept of Alignment = GS = ? i.e., how has noise or modeling error distorted the (soft) order In other words, alignment asks what is the probability that “good enough” subset overlaps with observed/apparent “good enough.” (e.g. the left most curve indicates P(at least 1 of he true top 10 designs are contained in the 10 random picking designs).) Example: |S|=1 means we are selecting champions, |G| =1 means we are narrowing the search for true champions. (In other words, |S|=1 means we are selecting the observed best one; and |G|=1 means we are looking for the true best one.) • (ii) Horse race selection rule On the other hand, if we don’t pick blindly, i.e., the variance of w is not infinite and estimated good design have a better chance at being actual good designs, then we should Copyright © by Yu-Chi Ho 5 definitely do better. Suppose now we use the horse race (HR) selection rule, i.e., pick according to the estimated performance, however approximate they may be7. It turns out we no longer can determine P in closed form. But P can be easily and experimentally calculated via a simple Monte Carlo experiment as a function of the noise variance 2, N, g, and k, i.e., the response surface P(|GS|≥k; 2, N). We accomplish this by assuming a particular form for J(), e.g., J(i) = i in which case the best design is (we are minimizing) and the worst design is and the Ordered Performance Curve (OPC) is linearly increasing. For any finite i.i.d. noise w, we can implement Ĵ J w and directly check the alignment between the observed and true order. A spread sheet implementation as in fig.2 can be easily carried out and the alignment probability P(|G∩ S|≥k; N, 2) determined. Fig.2 Spread Sheet Implementation of Generic Experiment Column 1 models the N (=200) alternatives and the true order 1 through N. Column 2 shows the linearly increasing OPC from 1 through N (=200). The rate of OPC increase with respect to the noise variance 2 essentially determines the estimation or approximation error of Ĵ . This is shown by the random noise generated in column 3 which in this case has a large range U(0,100). Column 4 displays the corrupted (or estimated) performance. When we sort on the column 4 we can directly observe the Copyright © by Yu-Chi Ho 6 alignment in column 1, i.e., how many numbers 1 through g are in the top-g rows. Try this and you will be surprised! It takes less than two minutes to setup on EXCEL or Lotus 1-2-3. Note that the ordered performance curve of ANY system must be linearly increasing and can assume only limited shapes near the optimum, e.g., steep, moderate, and flat slopes. As far as the alignment probability is concerned, these correspond to large, moderate and small estimation errors of system performance in the calculation of P(|GS|≥k; 2, N). In other words, we submit that a parameterized UNIVERSAL ALIGNMENT PROBABILITY curve can be constructed and be used to predict how various selection rules will perform in isolating good from bad designs. This will only require VERY APPROXIMATE evaluation of a number of designs. The implication of this assertion for simulation is obvious. Given a complex system, we can quickly narrow down the search for good designs using very approximate models and simulate them in parallel over a parameter domain for very short durations. This is possible because we are only using the simulation to separate the good from the bad (an ordinal idea) and not to estimate accurately the performance values (a cardinal notion). LECTURE #9 OO CONTINUED REVIEW: concepts of G, S, |G ∩ S|, the Thurston Model on Jest=Jtrue+noise with parameter 2, the alignment probability as a function of N, g, k, and 2. Universality of P (experimentally verified) It is also clearly that such an idea can be incorporated into a general optimization environment. Consider the following conceptual flow chart • The Soft Optimization Shell (SOS) Copyright © by Yu-Chi Ho 7 Fig.3 Flow Chart for SOS We call this a Soft Optimization Shell or SOS. It is soft optimization because we do not insist on getting the “best” but only the “good enough”. In fact it is this tradeoff that enable us to get high values for P. It is a shell because we can plug in whatever computer model for the system under investigation ranging from back-of-the-envelope formula to complex simulation programs. In the case of simulation program we can vary the length of the simulation experiment to control the parameter 2, which provides added flexibility. It is also clear how a parallel simulation language that can run a set of simulation experiments over the N designs simultaneously on either sequential or parallel computers can be very useful in this SOS environment. The point is that we can have a hierarchy of models ranging from the simplest to the most complex. At each level of the hierarchy we can quantify the degree and confidence of the approximation (in terms of the alignment probability described above). We Copyright © by Yu-Chi Ho 8 can also creatively allocate our computing resources over different levels of this hierarchy between “quickly surveying the landscape” to “lavish attention on particular details”. The possibility of exploiting the synergistic interaction of what human and machine each doing their best is endless in Fig.3. We have only scratched the surface. • • • • • Encouraging results from generic experiments ==> we can often solve a surrogate problem instead of the original hard problem ORIGINAL SURROGATE Steady state DES Transient DES rare-event probability not so rare event prob. complex model simpler model More importantly, the speed up of these devices MULTIPLY For discrete event simulation and optimization, ordinal idea can be combined with standard clock to achieve speed improvement of several orders of magnitude. What about heuristics? We have seen what blind choice can do, Ji=i is one way to represent the favorable bias of heuristics. the concept of picking probability In the above figure, on the one hand blind choice rule means sampling randomly from a number of designs, which may or may not be good designs. This does not allow us to get a good idea of what “top x% of designs” should look like. So in turn we cannot accurately estimate whether a given design is good or bad. On the other hand, improving over choice rule so that we can sample from good designs. This will give us a better idea of what top designs should look like. The better our selection rule is, the less samples are needed to say if a selected design is good or not. Use some form of selection rules to pick the good enough subset, e.g., the tournament literature. Baseball, Tennis, Random, and Stupid champion selection rules. Copyright © by Yu-Chi Ho 9 • • • The problem of large search space from (i) combinatorial explosion (ii) quantifying the search space of high dimension The idea of sampling to reduce search space but to guarantee sufficient number of "good enough" solutions in the reduced search space. You can solve a generic problem experimentally with the result that large approximation errors can be tolerated so long as we ask a softer question. Copyright © by Yu-Chi Ho 10 On Applying Ordinal Optimization ES 205 Class Notes #13 Prof. Y.C. Ho Spring 2001 Let us say you have a very computationally complex system/problem. To evaluate the performance of a design, theta, (alternative words are candidate solution, choice, parameter set, etc.) is time consuming. Furthermore the search space for designs are large, say at least one billion or more. You know very little about the structure of the search space, and certainly do not have time to exhaustively or blindly search it. Here is what OO recommends 1. Uniformly pick 1000 designs out of . 2. Say you are interested in getting within the top-1% of designs. Then these 1000 designs will contain 10 such samples highly likely (why?). Your job is to find at least one of these 10 without busting your computational budget on evaluating these 1000 designs 3. If your actually know the performance value of these 1000 designs, denoted as J1, . . . J1000, then without loss of generality, we assume they are in fact order as such. You plot these values against the order you should get a monotonically increasing curve (we are minimizing and J1 is the best). This is the so-called Order Performance Curve or OPC. There are only five different kinds of OPCs qualitatively. 4. In the real world, of course you can only know estimates of the performance of these 1000 designs. Denote these as J~1, . . . , J~1000, the observed order. The true Ji can be considered as a random variable centered at J~1 with a variance, . For simplicity let us say all are equal and we use only one value . 5. OO theory says that if we take the top-n of these 1000 designs based on the observed values, J~1, then the alignment probability AP of having at least k<n of true top-n in the observed top-n can be easily determined. AP is a function of Range of the Ji values The parameters n and k Type of OPC you think you have Of course for high values of AP we want large range and n, small sigma and k. Which type of OPC we have also make a difference. With specific knowledge, we have good reason to assume a linearly increasing OPC. 6. AP has been extensively tabulated as a function of these arguments in Ho-Lau 1997. 7. For a given or estimated range of values for Ji and the approximate value of . We can certainly check what ii takes to get AP=0.9 for n=10 and k=1, say. Copyright © by Yu-Chi Ho 11