Ordinal Optimization and Simulation Yu-Chi Ho Harvard University, Cambridge, MA 02138 Christos G. Cassandras Boston University, Boston, MA 02215 Chun-Hung Chen University of Pennsylvania, Philadelphia, PA 19104-6315 Liyi Dai Washington University, St. Louis, MO 63130 Abstract Simulation plays a vital role in analyzing stochastic systems, particularly, in comparing alternative system designs with a view to optimize system performance. Using simulation to analyze complex systems, however, can be both prohibitively expensive and time consuming. Efficiency is a key concern for the use of simulation to optimization problems. Ordinal optimization has emerged as an effective approach to significantly improve efficiency of simulation and optimization. One can achieve an exponential convergence rate when using ordinal optimization for simulation problems. There are already several success stories of ordinal optimization. In this paper we will introduce the idea of ordinal optimization and report some recent advances in this research. Details of an extension of ordinal optimization to a class of resource application problems are given as well. *** Keywords: Discrete-event simulation, ordinal optimization, discrete-event dynamic systems, resource allocation problem. *** Corresponding author: Professor Chun-Hung Chen, Dept. of Systems Engineering, University of Pennsylvania, Philadelphia, PA 19104-6315, USA. Fax: 215-573-2065, Tel: 215-898-3967, Email: chchen@seas.upenn.edu. 1 1. Introduction It can be argued that OPTIMIZATION in the general sense of making things BETTER is the principal driving force behind any prescriptive scientific and engineering endeavor, be it operations research, control theory, or engineering design. It is also true that the real world is full of complex decision and optimization problems that we cannot solve. While the literature on optimization and decision making is huge, much of the concrete analytical results and success stories are associated with what may be called Real-Variable-Based methods. The idea of successive approximation to an optimum (say, minimum) by sequential improvements based on local information is often captured by the metaphor "skiing downhill in a fog". The concepts of gradient (slope), curvature (valley), and trajectories of steepest descent (fall line) all require the notion of derivatives/gradients and are based on the existence of a more or less smooth response surface. There exist various first and second order algorithms of feasible directions for the iterative determination of the optimum (minimum) of an arbitrary multi-dimensional response or performance surface. A considerable number of major success stories exist in this genre including the Nobel prize winning work on linear programming. It is not necessary to repeat or even reference these here. On the other hand, we submit that the reason many real-world optimization problems remain unsolved is partly due to the changing nature of the problem domain in recent years which makes calculus or real-variable-based method less applicable. For example, a large number of human-made system problems, such as manufacturing automation, communication networks, computer performances, and/or general resource allocation problems, involve combinatorics rather than real analysis, symbols rather than variables, discrete instead of continuous choices, and synthesizing a configuration rather than proportioning the design parameters. Such problems are often characterized by the lack of structure, presence of large uncertainties, and enormously large search spaces. Performance evaluation, optimization, and validation of such systems problems are daunting, if not impossible, tasks. Most analytical tools such as queueing network theory or Markov Chains fall short due to modeling or computational limitations. Simulation becomes the general tool of choice or last resort. Furthermore, optimization for such problems seems to call for a general search of the performance terrain or response surface as opposed to the “skiing downhill in a fog” metaphor of real variable based performance optimization1. Arguments for change can also be made on the technological front. Sequential algorithms were often dictated as a result of the limited memory and centralized control of earlier generations of computers. With the advent of modern general purpose parallel computing and the essentially unlimited size of virtual memory, distributed and parallel procedures on a network of machines can work hand-in-glove with the Search Based method of performance evaluation. It is one of the theses of this paper to argue for such a complementary approach to optimization via simulation models. 1 We hasten to add that we fully realize the distinction we make here is not absolutely black and white. A continuum of problem types exist. Similarly, there is a spectrum of the nature of optimization variables or search space ranging from continuous to integer to discrete to combinatorial to symbolic. 2 If we accept the need for a search-based method as a complement to analytical techniques, then we can argue that to quickly narrow the search for optimum performance to a “good enough” subset in the design universe is more important than to estimate accurately the values of the system performance during the initial stages of the process of optimization. We should compare order first and estimate value second, i.e., ordinal optimization comes before cardinal optimization. Colloquial expressions, such as “ballpark estimate”, “80/20 solution” and “forest vs. trees”, state the same sentiment. Furthermore, we shall argue that our preoccupation with the “best” may be only an ideal that will always be unattainable or not cost effective. Real world solutions to real world problems will involve compromise for the “good enough”2 . The purpose of this paper is to establish the distinct advantages of the softer approach of ordinal optimization for the search-based type of problems, to analyze some of its general properties, and to show the many orders of magnitude improvement in computational efficiency that are possible under this mindset; finally to also advocate the use of this mindset as an additional tool in the arsenal of soft computing and a complement to the more traditional cardinal approach to optimization and engineering design. In the next section, we formulate the general simulation and optimization problems and point out the major limitation of traditional simulation approaches. Ordinal optimization is introduced in Section 3. In Section 4, we present the theoretical analysis that ordinal optimization can achieve an exponential convergence rate. Section 5 gives a budget allocation technique which can future improve efficient of exponential convergence of ordinal optimization. In Section 6, we will present several successful stories of the application of ordinal optimization. In particular, details of an extension of ordinal optimization to a class of resource application problem and its application to network control problem will be given. Section 7 concludes this paper. 2. Problem Statement Suppose we have a complex stochastic system. The performance of this system is conceptually given by J() where J is the expected performance measure and are the system design parameters which may be continuous, discrete, combinatorial or even symbolic. A general problem of stochastic optimization can be defined as min J() E[L( )] (2.1) where the search space is an arbitrary, huge, structureless but finite set; is the system design parameter vector; J, the performance criterion which is the expectation of L, the sample performance, as a functional of and , a random vector that represents uncertain factors in the systems. However, there are some basic limitations related to search-based methods which require the use of simulation to estimate performances J() as in Eq.(2.1). There are two aspects to this problem: the "stochastic" part and the "optimization" part. 2 Of course, here one is mindful of the business dictum, used by Mrs. Fields for her enormously successful cookie shops, which says ”Good enough never is”. However, this can be reconciled via the frame of reference of the optimization problem. 3 The "stochastic" aspect has to do with the problem of performing numerical expectation since the functional L( ) is available only in the form of a complex calculation via simulation. The standard approach is to estimate E[L( )] by sampling, i.e., 1 t E[L( )] Jˆ ( ) L( , i ) t i 1 (2.2) where i represents the i-th sample of . The ultimate accuracy (confidence interval) of this estimate cannot be improved upon faster than 1/t1/2, the result of averaging i.i.d. noise. This rate can be quite adequate or even very good for many problems but not good enough for the class of complex simulation-based performance evaluation problems mentioned above. Here each sample of L(i) requires one simulation run which for the type of complex stochastic systems we are interested in here can take seconds or minutes even on the fastest computer not to mention the setup cost of programming the simulation. Millions or billions of required samples of L(i) for different values of may become an infeasible burden. Yet there is no getting away from this fundamental limit of 1/t1/2 , i.e. one order of magnitude increase in accuracy requires two orders of magnitude increase in sampling cost. The second limitation has to do with the "optimization" part. Traditionally, analysis again plays an important part in numerical optimization whether it is convergence, gradient computation, or general hill climbing. Again these tools, such as infinitesimal perturbation analysis (IPA), take advantage of the structure and real-variable nature of the problem to devise effective algorithms for optimization. However, when we exhaust such advantages and the remaining problem becomes structureless and is totally arbitrary as above, we encounter the limitation introduced by combinatorial explosion, or the so called NP-hard Limitation3. In such cases, a more or less blind search is the only alternative. While it is often easy to prove probabilistic convergence of random search, it is very hard to improve the search efficiency. An example will illustrate this: Suppose is of size 1010,which is small by combinatorial standards. We are allowed to take 10,000 samples of J() uniformly in A natural question is what the probability that at least one of these samples will belong in the top 50, top 500, or top 5000 of the designs of is, i.e., P{at least one of the 10,000 sample is in the top-g} = 1 - (1- g 10,000 ) 1010 (2.3) which turns out to be 0.00004999, 0.0004998, and 0.004988 respectively. This means that for a simulation based optimization problem of any complexity, a simple random search is not an effective approach — too much computation for too little return4. 3 For example, in the case of a single input single output static decision rule, the number of possibilities is the size of the output possibilities raised to the power of the size of the input possibilities. This is an extremely large number. 4 To make matters worse, the No-Free-Lunch Theorem of Wolpert and MaCready (IEEE Journal on Evolutionary Computation, Vol. 1, 1997) says that without structural information no algorithm can do better on the average than blind search. 4 These two difficulties, the 1/t1/2 and the NP-Hard Limitation, are basic for general stochastic optimization problems of the type we are interested in. Each one can induce an insurmountable computational burden. To make matters worse, the effect of these limitations is multiplicative rather than additive. No amount of theoretical analysis can help once we are faced with these limitations. This is where ordinal optimization comes in. 3. Ordinal Optimization The idea of OO is to effect a strategic change of goals5. It is based on two tenets: (a) "Order" converges exponentially fast while "value" converges at rate 1/t1/2. In other words, it is much easier to determine whether or not alternative A is better than B than to determine "A-B=?". This is intuitively reasonable if we think of the problem of holding identically-looking packages in each hand and trying to determine which package weighs more vs. to estimate the difference in weight between the two packages. More quantitatively, suppose we observe A and B through i.i.d. zero mean additive normal noises. The chance of mis-ordering A and B is roughly proportional to the area under the overlapping tail of the two density functions centered at A and B as illustrated below. A plus perturbing noise B plus perturbing noise This notion can be made more precise. If A and B plus noise are the performance value estimates of two simulation experiments, then under the best conditions the estimates cannot converge to the true values faster than 1/t1/2 where “t” is the length or number of replications of the simulation experiments. But the relative order of A vs. B converges exponentially fast (Dai 1996, Xie 1998). Furthermore, the value or cardinal estimation may not always be necessary. In many cases, we simply try to determine which is the better alternative without necessarily knowing the detailed consequence of what each alternative entails, i.e., we emphasize the choice (order) rather than estimating the utility (value) of the choices. (b) Goal softening eases the computational burden of finding the optimum. Traditional optimization determines a single alternative in and hopes that it coincides with the true optimum. The metaphor of hitting one speeding bullet with another comes to mind. Instead if we 5 This is a euphemism for retreat. 5 are willing to settle for the set of "good enough" alternatives with high probabilities, e.g., any of the top-n% of the choices for 95% of the time, then this probability improves dramatically with the size of these sets (i.e., n). It is like trying to hit a truck with a shotgun. Let us formalize this notion a bit. Suppose we define a subset of the design space as the “good enough” subset, G, which could be the top-g designs or top-n(g/N)% of the designs of the sampled set of N designs. We denote the size of the subset G as |G| g. We now pick out (blindly or by some rule) another subset, called the selected subset, S with the same cardinality, i.e., |S|=|G|=g6 . We now ask the question what is the probability that among the set S we have at least k (≤ g) of the members of G, i.e., P{|GS|≥k}. This question is relevant since our rule for picking the set S clearly affects how many good designs we get in S. In the case of blind picking (i.e., a purely random selection of S), we expect the overlap to be smaller than the result of selecting based on some estimated order of the alternatives. We define P{|GS|k} as the alignment probability which is a measure of the goodness of our selection rules. Fig.3.1 below illustrates the concept. For perspective, we have also indicated the traditional optimization view where the subsets G and S are both singletons. Figure 3.1. Generalized concept for Optimization. For the case of blind picking, which can be considered as observing the relative orders through noise having infinite variance (i.e., any design can be considered as the good enough design and vice versa), this alignment probability is given by Ho et al. (1992). g N g i g i P{|GS|≥k} = N i k g g (3.1) For example, if N=200, g=12, and k=1, then P P{|GS|≥k} > 0.5. When N=200, g=32, and k=1 P is almost 1. We can calculate P for other values N, g, and k easily, one example of which for N=1000 is illustrated in Figure 3.2 below. 6 Note |S| and |G| need not be equal in which case a slightly more general from of Eq.(4) to follow results. 6 Alignment Probability 1.0 k=1 k=10 0.5 0.0000 10 50 100 150 g 200 Figure 3.2. P vs. g parameterized by k=1,2, . . ., 10 for N=1000 Once again this shows that by blindly picking some 50-130 choices out of 1000 we can get quite high values for P. Remember these curves are for blind picks. On the other hand, if we don’t pick blindly, i.e., the variance of the estimation error is not infinite and estimated good designs have a better chance at being actual good designs, then we should definitely do better. Suppose now we no longer pick designs blindly, but pick according to their estimated performance, however approximate they may be, e.g., the selected subset S can be the estimated top-g or top-n(g/N)% of the designs7 . It turns out we no longer can determine P in closed form. But P can be easily and experimentally calculated via a simple Monte Carlo experiment as a function of the noise variance 2, N, g, and k, i.e., the response surface P{|GS|≥k; 2, N}. For example, we can assume a particular form for J(), e.g., J(i) = i in which case the best design is 1 (we are minimizing) and the worst design is N and the ordered performances are linearly increasing. For any finite i.i.d. estimation noise w, we can implement Jˆ ( ) = J() + w and directly check the alignment between the observed and true order. More generally, this alignment probability can be made precise and quantitative (Lau and Ho 1997, Lee et al. 1998) and shown to be exponentially increasing with respect to the size of G and S. Consequently, by combining the two basic ideas (a) and (b) of OO in this section, several orders of magnitude of easing of the computational burden can be achieved. In fact, some special cases of the alignment probability is also called the probability of correct selection or P{CS} in the parlance of the simulation literature. There exists a large literature on assessing P{CS} based on classical statistical models. Goldsman and Nelson (1994) provide an excellent survey on available approaches to ranking, selection, and multiple 7 This can be characterized as the Horse Race selection rule where we estimate the performance of ALL designs and pick the estimated top-g designs. 7 comparisons (e.g., Goldsman, Nelson, and Schmeiser 1991, Gupta and Panchapakesan 1979). In addition, Bechhofer et al. (1995) provide a systematic and more detailed discussion on this issue. However, most of these approaches are only suitable for problems with a small number of designs (e.g., Goldsman and Nelson (1994) suggest 2 to 20 designs). The question "Isn't OO just reinventing the wheel of ranking and selection literature in statistics?" may occur to the reader at this point. First, there is a scale difference, since in OO we are dealing with design space of size billions times billions while R&S typically deal with tens of choices. Secondly, OO does not address questions such as the distance between the best and the rest, a cardinal notion; or whether or not the observed order is the maximum likelihood estimate of the actual order, an elegant concept in Ranking and Selection literature but not very useful in the OO context. Also, the notion of softening which is a main tenet for OO is not present in R&S literature. Lastly, OO is complementary to other tools of computational intelligence such as fuzzy logic and genetic algorithms. In genetic algorithm, one must evaluate the fitness of a population of candidate designs. However, it is implicitly assumed in almost all GA literature that this evaluation is simple to carry out. For complex simulation problems we have in mind here, this is not the case. Thus taking advantage of the ability of OO to quickly separate the good from the bad one can extend the applicability of GA to problems heretofore not feasible computationally. Similarly, in OO we have defined concepts such as "good enough" or high alignment probability as crisp notions. It is straight forward to fuzzify these definitions and have OO apply to fuzzy systems problems. In other words, "softening" and "fuzzification" are distinct and complementary concepts. Much remains to be done. In the next section, we will show the fast convergence of ordinal optimization theoretically. 4. Convergence of Ordinal Comparison In this section, we provide a theoretical investigation of the efficiency of ordinal comparison. Without loss of generality, we say that (design) i is better than j if J(i) < J(j). For notational simplicity, in this section, we assume that that the N designs are indexed in such as way that < J(1) > J(2) > J(3) > .. > J(N) > -. We are interested in finding a design in G, one of the M , 1 M N , true best designs in . Let H = - G. We use Jˆ ( , t ) to denote an estimate of J() based on a simulation of length t. In order to characterize the rate of convergence for ordinal comparison, we define the following indicator process 1 if min G Jˆ , t min H Jˆ , t I t 0 otherwise. Then I(t) is equal to 1 if the observed best design is a good design in G and equal to 0 otherwise. I(t) is a function of t indicating when the observed best design is one of the good designs. We next examine the convergence of P{I(t) = 1} and P{I(t) = 0} as a function of simulation length of simulation samples t. The convergence rate of I(t) will be taken as that of P{I(t)=1} or P{I(t)=0}. 8 A lower bound In many cases such as finite-state Markov chains and averaging i.i.d. random variables, the variance of an estimate Jˆ ( , t ) decays to zero at rate of O(1/t). Such delay rate of the variance can be used to derive a lower bound on the rate of convergence for the indicator process, which is precisely stated in the following theorem. Theorem 4.1 (Dai 1996) Assume that (A4.1) The stochastic system under simulation is ergodic in the sense that lim L( , , t ) = J(), a.s. t for any , and that the variance of the estimate Jˆ ( , t ) decays to zero at rate O(1/t) for all , i.e., Var( Jˆ ( , t ) ) = O(1/t). Then P{I(t) = 1} = 1 - O(1/t), P{I(t) = 0} = O(1/t), in other words, the indicator process I(t) converges to 1 at the rate of at least O(1/t). # Assumption (A4.1) is very mild and satisfied by most stochastic systems. This assumption guarantees that J() can be estimated with arbitrary accuracy by increasing the simulation duration of any one sample trajectory. Exponential convergence Theorem 4.1 gives a very conservative bound for general cases. In many situations, we know much more about the structure and distribution of X(, t). Quite often, the rate of convergence can be proven faster than that in Theorem 4.1. Especially, we consider the case of simulating general regenerative systems. Definition 4.2. (Kalashnikov 1994, p.19) An {X(, t)}is said to be regenerative (in a classic sense) if there is an increasing sequence of nonnegative finite random times { i(), i 0} such that for each. (i) {X(, i()+t), k() - i(), t 0, k i } are identically distributed. (ii) {X(, i()+t), k() - i(), t 0, k i } does not depend on {X(, t), j(), ti, 0ji}. # The sequence {i(), i 1} is a sequence of regenerative points. We consider two types of performance measures K t , i 1t Jˆ ( , t ) = ( X ( , , u )) du and Jˆ ( , t ) = t0 ( X ( , , u )) du i 1 i 1 K t , Ti i 1 9 (4.1) where (x ) R, Ti() = i() - i-1(), i 1, is the length of the ith regeneration cycle, and K(t,) = max {i: i() t, i 0} i is the number of regeneration cycles on [0, t]. K (t , ) = K(t,) + 1 is the first passage time process. For convenience, we define T0() = 0(). Performance measures of the form (4.1) frequently appear in simulation. They are essentially identical except for the tail terms. However, these tail terms typically converge at the rate of O(1/t), slower than the rate of ordinal comparison as we shall show. Therefore, such tail terms have to be taken into account of consideration. For every , we assume that (A4.1) holds and (A4.2) The cycle time T1() has finite, continuous moment generating function E[e sT1 ] in a neighborhood of s = 0, s 0; (A4.3) The cycle time T1() is not degenerate in the sense of limt0 P{T1() t} < 1. (A4.4) The function (x ) R is bounded and | (x ) | B, 0 < B < . Assumptions (A4.2) and (A4.4) are technical and satisfied in many cases (Dai 1996). Assumption (A4.3) is very mild. Note that (A4.2) and (A4.3) imply 0 < E[T1()] < . Regenerative systems constitute a large class of systems. For regenerative systems, the rate of convergence of ordinal comparison in exponential as stated in the following theorem. Theorem 4.3 (Dai 1996). Suppose (A4.1)-(A4.4) holds. Then there exists a positive number > 0 such that P{I(t) = 1} = 1 - O(e-t), P{I(t) = 0} = O(e-t), for both of the performance measures Jˆ ( 1 , t ) and Jˆ ( 2 , t ) , in other words, the indicator process converges at an exponential rate. # Relaxation of assumptions (A4.1)-(A4.4) in terms of large deviation principles is provided in Dai and Chen (1997). The constant in the rate of convergence is given by the following expression. = inf lim log E e st min s[ 0 , ) t 1 t G Jˆ , t min H Jˆ , t . 5. Computing Budget Allocation for Further Efficiency Improvement While ordinal optimization could significantly reduce the computational cost for design selection, there is potential to further improve its performance by intelligently determining the number of simulation samples or replications among different designs. Intuitively, to ensure a high alignment probability, a larger portion of the computing budget should be allocated to those designs which are potentially good designs. In other words, a larger number of simulations must be conducted with those critical designs in order to reduce estimator variance. On the other hand, 10 limited computational effort should be expanded on non-critical designs that have little effect on identifying the good designs even if they have large variances. In doing so, less computational effort is spent on simulating non-critical designs and more computational effort is spent on simulating critical designs; hence, the overall simulation efficiency is improved. Ideally, we want to optimally choose the number of simulation samples for all designs to maximize simulation efficiency with a given computing budget. In this section, we present a technique called optimal computing budget allocation (OCBA) which makes use of this idea. Let t be the number of simulation samples of design . If simulation is performed on a sequential computer and the difference of computation costs of simulating different designs is negligible, the total computation cost can be approximated by t . The goal is to choose t for all such that the total computation cost is minimized, subject to the restriction that the confidence level defined by alignment probability is greater than some satisfactory level. min t t s.t. alignment probability P*. (5.1.a) where P* is a user-defined confidence level requirement. Or one may intend to choose t such that the alignment probability is maximized, subject to a limited computing budget T. max [alignment probability] t s.t. t = T. (5.1.b) Three major issues for solving (5.1) are the following: There is no closed-form expression for the alignment probability. The alignment probability at t, , can be computed only after all t simulation samples for all designs are performed. t, , are integers and the number of combinations for different t is large even for moderate size of . To solve the budget allocation problem in (5.1), the alignment probability must be estimated easily. An extra Monte Carlo simulation for estimating the alignment probability would be too expensive under this setting. On the other hand, most of existing ranking and selection approaches are only suitable for problems with a small number of designs (e.g., Goldsman and Nelson (1994) suggest 2 to 20 designs). Using a Bayesian model, Chen (1996) gives an estimation technique to quantify the confidence level for ordinal comparison when the number of designs is large. In addition to the confidence probability, this approach also provides sensitivity information for each algorithm, which is useful when solving the allocation problem in (5.1). It is convenient to let (1), (2), … , (N) denote the indices (order statistics) such that 1 t( 1) 1 t( 2 ) 1 t( N ) L( (1) , i ) t L( ( 2 ) , i ) … t L( ( N ) , i ) . t ( 1) i1 ( 2 ) i 1 ( N ) i 1 11 (5.2) Under this notation, design (1) has the smallest sample mean and is the observed best design. We have the following estimator of P{CS} from Chen (1996). ~ Theorem 5.1. Let J denote the random variable whose probability distribution is the posterior ~ distribution of the expected performance for design under a Bayesian model. Assume J i and ~ J j , i j, are independent (i.e., the simulations for different designs are independent). For a minimization problem, P{|GS|k} N j G S k 1 ~ ~ P{J (1) J ( j ) } Approximate Probability of Correct Selection (APCS). # ~ Under the Bayesian model, the posterior distribution p( J ) consists of information from both prior distribution and the simulation results {L(, i), i = 1, 2, ..., t}. Furthermore, with mild normal assumption, if the variance 2 is known (Bernardo and Smith 1994), 1 t ~ 2 J ~ N( L( , i ) , ). t t i1 ~ If the variance is unknown, 2 can be replaced by the sample variance and J becomes t-distributed (Chick 1997). For ease of explanation, we will concentrate on the case that |S| = |G| = k = 1 in the rest of this section. In which case, 1 t( j ) 1 t (1) L ( , ) L ( , ) ( j) i (1) i N N t i 1 t(1) i 1 ~ ~ APCS = P{J (1) J ( j ) } = ( j ) 2 j 2 j 2 ( j ) (21) t t ( j ) ( 1 ) (5.3) where is the standard normal cumulative distribution. The APCS in (5.3) gives an estimate of the probability that the selected observed best design is indeed the true best design. Note that the computation of APCS is simply a product of pairwise comparison probabilities, which is easy to compute. Another critical issue in solving (5.1) is a cost-effective method to estimate the sensitivity information, or more specifically, the estimation of the new P{CS} if additional simulation samples are performed on design s. Definition 5.2. Define EPCS(t1, t2, .., t-1, t+, t+1, .., tN) as an estimated P{CS} if additional simulation samples are performed on design . EPCS is computed using the statistical information after t1, t2, .., tN simulations are completed for designs 1, .., N, respectively. ~ 1 t 2 ~ Let J ~ N( L ( , i ) , ). If the observed sample means are ranked based on t i1 t (5.2), Chen, et al. (1996) give a simple and effective way to estimate the EPCS : 12 ~ ~ ~ If (1), EPCS(t1, t2, .., t-1, t+, t+1, .., tN) P{J (1) J s } N ~ ~ P{J (1) J ( j ) } , and j 2,( j ) N ~ ~ ~ If = (1), EPCS(t1, t2, .., t-1, t+, t+1, .., tN) P{J (1) J ( j ) } . j 2 Design (1) is the observed best design which has the smallest sample mean. Note that the expression of EPCS is similar to that of APCS in Eq. (5.3) and is easy to compute. EPCS provides sensitivity information about how APCS will change if additional simulation samples are performed on design . Since the estimation of the alignment probability and its sensitivity information is feasible, we are able to deal with the budget allocation problem in (5.1). Our approach builds on the idea that the computing budget should be allocated in a way that best improves the overall confidence level. This approach for the problem in (5.1.a) is outlined as follows: Step 0. Perform simulations and obtain n0 samples for all designs; l0; t 1l t l2 t lN = t0. Step 1. If APCS( t 1l , t 2l ,, t lN ) P*, stop. Step 2. Increase the computing budget (i.e., number of additional simulations) by and allocate this budget across the candidate designs by solving the subproblem: EPCS (t1l l1 , t 2l l2 , , t kl lk ) , s.t. ik1 li = and li 0 for all i.} { max l l 1 ,, k Step 3.Perform simulations and obtain additional li samples for design i, i = 1, ..., k. t il 1 = t il + li , for i = 1, ..., k; l l + 1. Go to Step 1. For the maximization problem in (5.1.b), the above algorithm still applies except the stopping criterion in Step 1 must be substituted by "If t T, stop." A last issue remained for solving the budget allocation problem in (5.1) is the development of an efficient and effective procedure for solving the subproblem in Step 2. We have explored two major approaches: a) solving the nonlinear program directly, and b) replacing the objective function with an approximation and then solved it analytically. The two approaches are briefly described as follows: Nonlinear Programming Approach Here we assume the variables are continuous and apply nonlinear programming techniques (Luenberger 1984, Nash and Sofer 1996) to obtain an approximate solution to the subproblem in Step 2. The gradient of APCS with respect to t can be estimated by EPCS (t1 , t 2 , , t , t 1 , , t N ) APCS (t1 , t 2 , , t , t 1 , , t N ) APCS = t Chen et al. (1997) develop an effective OCBA technique by integrating a simple steepest-descent method. Numerical testing show that the OCBA can significantly reduced simulation time as 13 compared with the well-known computing budget allocation techniques such as the two-stage procedures given in Rinott (1978) and Dudewicz and Dalal (1975). Objective Function Approximation Approach The nonlinear programming approach needs an extra computation cost to iteratively search the solution to the subproblem in Step 2. Another approach which we have exploited is to replace the objective function with an approximation and then find an analytical solution to the approximation problem. The advantage is that there is no need to iteratively apply a nonlinear programming technique in Step 2 for an optimal solution. Chen et al .(1998c) approximate EPCS using the Chernoff bounds. An asymptotically solution can be found and is summarized in the following theorem. Theorem 5.3. Given total number of simulation budget T to be allocated to N competing designs, the Approximate Probability of Correct Selection (APCS) can be asymptotically maximized when (a) t(1) t( 2 ) N (1) ( 2) j 1 (21), ( 2 ) 2 (1), ( j ) j (1) 1/ 2 2 (1), t for = 1, ..., N and (1) (2), (b) t ( 2) ( 2 ) (1),( 2 ) where i,j = 1 tj 1 ti L(i , s ) - L( j , s ) , for any i, j . t i s 1 t j s1 # The budget allocation problems for some special cases have already been studied by other researchers. It is worthwhile to compare with our results in Theorem in 5.3 with some wellknown solutions. Remarks 5.4 and 5.5 show that our approach achieves the same results as those earlier approaches under the required special conditions. Note that our approach is also applicable to much more general problems. Remark 5.4. In the case of N = 2, Theorem 5.3 yields 2 t1 1 1, 2 t 2 2 12, 2 1/ 2 1 . 2 # This evaluated result is identical to the well-known optimal allocation solution for N = 2. Remark 5.5. Assume that (a) (1), = , for = 1, 2, , N, and (1), (b) 12 22 N2 . # In this special case, Theorem 5.3 obtains t(1) = tg(N-1)1/2 for = 1, 2, , N, and (1), which is the same as the results in Bechhofer (1969) and Dunnett (1955). 14 6. Applications and Extensions of Ordinal optimization As shown in section 4, the convergence rate for ordinal comparison can be exponential for many stochastic problems. We can have good estimates on the ordinal comparison while the value of an estimate is still very poor. If our goal is to find the good designs rather than to find an accurate estimate of the best performance value, which is true in many practical situations, it is advantageous to use ordinal comparison for selecting a good design. This idea has been applied on a 10-node network (Patsis et al. 1997), where it is shown that a good design can be identified with high probability with relatively short simulations. From a computational perspective, they demonstrate many orders of magnitude of speedup. The idea has been successfully applied to several other problems. For example, a heuristic algorithm selection problem is formatted as a stochastic optimization problem in Chen et al. (1998b) and only a much smaller amount of simulation time is needed to identify a good heuristic algorithm. Chen et al. (1998a) show that ordinal optimization can be applied to the robot motion planning problem for identifying a good enough motion plan with a very short computation time. In addition to the direct use of ordinal optimization, the ordinal comparison idea can also be integrated into some procedures and achieve a much faster convergence rate. For example, the Stochastic Comparison Algorithm integrates ordinal optimization into a recursive selection procedure and obtains a fast convergence rate (Gong et al 1999). In this paper, details of an extension of ordinal optimization to a class of resource application problem and its application to network control problem is given below. As shown in section 4, the convergence rate for ordinal comparison can be exponential for many stochastic problems. We can have good estimate on the ordinal comparison while the value estimate is still very poor. If our goal is to find the good designs rather than to find an accurate estimate of the best performance value, which is true in many practical situations, it is advantageous to use ordinal comparison for selecting the good designs. This idea has been applied on a 10-node network (Patsis et al. 1997), where it is shown that a good design can be identified with high probability with relatively short simulations. From a computational perspective, they demonstrate many orders of magnitude of speedup. The idea has been successfully applied to several other problems. For example, a heuristic algorithm selection problem is formatted as a stochastic optimization problem in Chen et al. (1998b) and only a much smaller amount of simulation time is needed to identify a good heuristic algorithm. Chen et al. (1998a) show that ordinal optimization can be applied to the robot motion planning problem for identifying a good enough motion plan with a very short computation time. Ho and Larson (1995) showed how to apply OO to rare event simulations. In Deng and Ho (1999), ordinal optimization was used to search for the global optimum in a well known open non-convex optimal control problem of 30 year standing. In addition to the direct use of ordinal optimization, the ordinal comparison idea can also be integrated into some procedures and achieve a much faster convergence rate. For example, Stochastic Comparison Algorithm integrates ordinal optimization into a recursive selection procedure and obtain a fast convergence rate (Gong et al 1999). Ordinal optimization can also be combined with massively parallel computer to achieve speed up in search for optimum (Ho 1994). Additional applications and an interactive online demo can be found at the website: hrl.harvard.edu/~ho/DEDS. In this paper, details of an 15 extension of ordinal optimization to a class of resource application problem and its application to network control problem is given below. In the resource allocation problems studied in Cassandras et al. (1998), there are r identical resources to be allocated over N user classes so as to optimize some system performance measure (objective function). Let s = [n1,…,nN] denote an allocation and S = {[n1,…,nN] : iN1 ni = r, ni {1, …, r}} be the finite set of feasible allocations, where ni is the number of resources allocated to user class i. Let Ji(s) be the class i cost associated with the allocation vector s. Assuming that Ji(s) depends only on the number of resources assigned to class i, we can write Ji(s) = Ji(ni). Then, the basic resource allocation problem is N (RA) min sS J n i i 1 i i where, without loss of generality for the purpose of our analysis, we may set i = 1 for all i. In a stochastic setting, the cost function Ji(s) is usually an expectation whose exact value is generally difficult to obtain and one resorts to estimates of Ji(s) which may be obtained ~ through simulation or through direct on-line observation of a system. We denote by Lit s an estimate of Ji(s) based on observing a sample path for a time period of length t. Such stochastic discrete resource allocation problems are hard due to the combinatorial explosion of the feasible allocation search space, as well as the absence of closed-form expressions for the cost function of interest. The stochastic resource allocation algorithm considered here is represented as a Markov ~ process {( ~sk ,Ck )} where the vector ~sk [n~1,k , n~2,k ,..., n~N ,k ] is the allocation after the kth step, ~ Ck is a subset of user indices after the kth step updated as shown below, and the " ~. " notation is ~ used to indicate that all quantities involved are based on estimates Lit s of Ji(s), obtained from a sample path of length t; in particular, the length of the sample path at the kth iteration of the ~ process is denoted by f(k) and the corresponding cost estimates by Lif k . With C0 = {1, …, N}, after proper initialization, we have: ~ n~i , k 1 if i ~ ik * and k ~ ~ n~i , k 1 n~i , k 1 if i jk* and k n~ otherwise i,k where ~i , ~j 0 ~i , ~j 0 * k * k * k * k (6.1) ~ f k ~ ~* ni,k , ik arg max ~ Li (6.2) (6.3) iCk ~ f k ~ ~* ni,k , jk arg min ~ Li iCk ~ ~ ~ ~ ~ k ik * , jk* L~if k n~~i ,k L~jf k n~~j ,k 1 * k * k * k * k 16 (6.4) ~ C~k ~j k* if ~k ~ ik * , j k* 0 ~ ~ (6.5) C k 1 C 0 if C k 1 ~ otherwise C k ~ ~ ~ where Lif k ni Lif k ni Lit ni 1 . It is clear that equations (6.1) - (6.5) define a Markov ~ ~ ~ process ~sk ,Ck , whose state transition probability matrix is determined by ik * , j k* , and ~ ~ ~ ~ ~ k ( ik * , j k* ). Briefly, ik * , and j k* estimate the users ``most sensitive'' and "least sensitive" to the ~ removal of a resource among those users in the set Ck . Then, (6.1) forces the exchange of a resource from the most to the least sensitive user at the kth step of this process, provided the ~ ~ ~ quantity k ( ik * , j k* ) is strictly positive; otherwise, the allocation is unaffected, but the user with ~ ~ index jk * is removed from the set Ck through (6.5). When this set is reduced to a single user, it ~ ~ ~ is reset to C 0 and the procedure repeats. The quantity k ( ik * , j k* ) is the estimated "potential cost reduction" when an actual change in allocation takes place. We will make the following three assumptions: (A6.1) For all i = 1, …, N, Ji(ni) is such that Ji(ni+1) > Ji(ni), where Ji(ni) = Ji(ni) - Ji(ni-1), ni = 1, .., r, with Ji(0) - and Ji(N+1) . ~ (A6.2) For every i, and every ni, the estimate Lit ni is ergodic as the sample path length ~ increases in the sense that lim Lit ni . = Ji(ni), a.s. t ~ ~ ~ (A6.3) Let k i, j J i n~i ,k L j n~ j ,k 1 . For every k ( ik * , j k* ) = 0 , there is a constant p0 ~ ~ ~ ~ ~ ~ such that P{ k ik * , jk* 0 k ik * , jk* 0, ~sk , Ck s, C } p0 0 , for any k and any pair ~ ~s , C k k . Note that (A6.3) is a technical condition guaranteeing that an estimate does not always give one-sided-biased incorrect information. Also note that the results below do not require the technical condition that the optimal allocation be unique. If several allocations exhibit optimal performance, the proposed scheme will converge to a set of allocations and will oscillate among the members of the set. Under (A6.1)-(A6.3), it was proven that ~sk converges in probability to the optimal solution as long as s* as long as f(k) , as k . Cassandras et al. (1998) give detailed discussion and interpretation of the above process, and the proof of the following two theorems. Theorem 6.1. Suppose that Assumptions (A6.1)-(A6.3) hold and that the optimum s* is unique. Then the process described by equations (6.1) - (6.5) converges in probability to the optimal allocation s*. # 17 Theorem 6.2. Suppose (A6.1)-(A6.3) and the assumptions of Lower bound of the convergence rate of comparison from other parets hold. If f(k) k1+c for some constant c > 0, then the process described by equations (6.1) - (6.5) converges almost surely to the global optimal allocation. # Theorems 6.1 and 6.2 establish the convergence of the algorithm. For regenerative systems, stronger results regarding the rate of convergence have been obtained as stated in the following theorem. Theorem 6.3 (Dai et al. 1998). Suppose that Assumptions (A6.1)-(A6.3) hold and for every pair ~ ~ i, j, i j, Lit ni Ltj n j is a regenerative process with regenerative cycle times satisfying (A4.1)-(A4.4). Then, there exists a c > 0 large enough such that if f(k) cln(k) for sufficiently large k, the process described by equations (6.1) - (6.5) converges almost surely to the global optimal allocation. Furthermore, P{ ~s = s*} = 1 - O(ln(k) e-f(k)) k for some > 0. # Note that ordinal comparison is used in (6.1). Since ordinal comparison allows one to make a correct decision based on short simulations, the algorithm can quickly get to the neighborhood of the optimal solution. The fast convergence has been confirmed experimentally as shown in Cassandras et al. (1998). Compared with conventional stochastic approximation methods, the algorithm also eliminating the difficult task of selecting appropriate "step-size". A numerical example n1 p1 1 n2 p2 2 nN pN N 1 2 N Figure 6.1. Queueing system with N parallel servers. As an application, we consider a buffer allocation problem for a queueing system as shown in Figure 6.1, where each server represents a user and each buffer slot represents a resource that is to be allocated to a user. Jobs arrive at this system at a rate λ and are routed to one of the N users with some probability pi, i = 1, …, N. Each user is processing jobs at a rate μi, i = 1, …, N, and if a job is routed to a user with a full queue, the job is lost. Let Ji(ni) be the 18 individual job loss probability of the ith server (ni is the number of buffers allocated to the ith server). Our goal is to allocate all K available buffer slots to the users in order to minimize the objective function N J i ni i 1 N s.t. n i K i 1 It can be easily verified that assumptions A1-A3 are satisfied for this example (Cassandras et al. 1998). For this problem, one can directly apply the algorithm corresponding to the process (6.1) -(6.5). Standard concurrent estimation methods (e.g., see Cassandras and ~ Panayiotou 1996) may be used to obtain all Lif k n~i ,k required by the algorithm from observation of a sample path under a given allocation n~ , n~ ,..., n~ , . For simplicity, in our 1,k 2 ,k N ,k simulation of this system we have assumed that the arrival process is Poisson with rate λ = 0.5, and all service times are exponential with rates i = 1.0, i = 1, …, 6. Furthermore, all routing probabilities were chosen to be equal, i.e., pi = 1/6, i = 1, …, 6. Figure 6.2 shows typical traces of the evolution of the algorithm for this system when N = 6 users and K = 24 buffer slots (resources). In this problem, it is obvious that the optimal allocation is [4,4,4,4,4,4] due to symmetry. This was chosen only to facilitate the determination of the optimal allocation; the speed of convergence, however, is not dependent on the assumptions made in this example. The two traces of Figure 6.2 (on page 22) correspond to two different initial allocations. In this case, the simulation length is increased linearly in steps of 3,000 events per iteration and one can easily see that even if we start at one of the worst possible allocations (e.g. [19, 1, 1, 1, 1, 1,]), the cost quickly converges to the neighborhood of the optimal point. 7. Conclusions The development of OO is still in its infancy. By redefining what is desired in a simulation (e.g., order rather than value estimates, good enough rather than optimal choices) we present a general method for easing the computational burden associated with complex stochastic optimization problems via simulation. The methodology is easy to understand and as a rule does not require any additional computation for its use. Similar to the role of order statistics in sampling, it is totally complementary to any other tools one may be using. References 1. Bechhofer, R. E. "Optimal Allocation Of Observations When Comparing Several Treatments With A Control," Multivariate Analysis II, P. R. krishnaiah, ed., Academic Press, Inc. 463473, 1969. 2. Bechhofer, R. E., T. J. Santner, and D. M. Goldsman. Design and Analysis of Experiments for Statistical Selection, Screening, and Multiple Comparisons, John Wiley & Sons, Inc., 1995. 3. Bernardo, J. M., and A. F. M. Smith. Bayesian Theory. Wiley, 1994. 19 4. Cassandras, C.G., L. Dai, and C. G. Panayiotou, "Ordinal Optimization for Deterministic and Stochastic Discrete Resource Allocation," IEEE Trans. on Automatic Control, Vol. 43, No. 7, 881-900, 1998. 5. Cassandras, C. G., and C. G. Panayiotou, "Concurrent Sample Path Analysis of Discrete Event Systems", Proc. of 35th IEEE Conf. on Decision and Control, 3332-3337, 1996. 6. Chen, C. H. "A Lower Bound for the Correct Subset-Selection Probability and Its Application to Discrete Event System Simulations," IEEE Transactions on Automatic Control, Vol. 41, No. 8, 1227-1231, August, 1996. 7. Chen, H. C., C. H. Chen, L. Dai, and E. Yucesan, "New Development of Optimal Computing Budget Allocation For Discrete Event Simulation," Proceedings of the 1997 Winter Simulation Conference, 334-341, December, 1997. 8. Chen, C. H., H. C. Chen, and L. Dai, "A Gradient Approach of Smartly Allocating Computing Budget for Discrete Event Simulation," Proceedings of the 1996 Winter Simulation Conference, 398-405, December, 1996. 9. Chen, C. H., V. Kumar, and Y. C. Luo, "Motion Planning of Walking Robots Using Ordinal Optimization," IEEE Robotics and Automation Magazine, 22-32, June, 1998a. 10. Chen, C. H., S. D. Wu, and L. Dai, "Ordinal Comparison of Heuristic Algorithms Using Stochastic Optimization," to appear in IEEE Trans. on Robotics and Automation, 1998b. 11. Chen, H. C., C. H., Chen, and E. Yucesan, "Optimal Computing Budget Allocation in Selecting The Best System Design via Discrete Event Simulation," submitted to IEEE Transactions on Control Systems Technology, 1998c. 12. Chick, S. E. "Bayesian Analysis for Simulation Input and Output," Proceedings of the 1997 Winter Simulation Conference, 253-260, 1997. 13. Dai, L., "Convergence Properties of Ordinal Comparison in the Simulation of Discrete Event Dynamic Systems," J. of Opt. Theory and Applications, Vol. 91, No.2, 363-388, 1996. 14. Dai, L., C. G. Cassandras, and C. Panayiotou, "On The Convergence Rate of Ordinal Optimization for A Class of Stochastic Discrete Resource Allocation Problems,'' to appear in IEEE Transactions on Automatic Control, 1998. 15. Dai, L. and C. H. Chen, "Rate of Convergence for Ordinal Comparison of Dependent Simulations in Discrete Event Dynamic Systems," Journal of Optimization Theory and Applications, Vol. 94, No. 1, 29-54, July, 1997. 16. Dudewicz, E. J. and S. R. Dalal, Allocation of Observations in Ranking and Selection with Unequal Variances. Sankhya, B37:28-78, 1975. 17. Dunnett, C. W. "A Multiple Comparison Procedure For Comparing Several Treatments With A Control," Journal of American Statistics Association. 50, 1096-1121, 1955. 18. Goldsman, G., and B. L. Nelson. "Ranking, Selection, and Multiple Comparison in Computer Simulation," Proceedings of the 1994 Winter Simulation Conference, 192-199, 1994. 19. Goldsman, G., B. L. Nelson, and B. Schmeiser. "Methods for Selecting the Best System," Proceedings of the 1991 Winter Simulation Conference, 177-186, 1991. 20 20. Gong, W. B., Y. C. Ho and W. Zhai, "Stochastic Comparison Algorithm for Discrete Optimization with Estimation," to appear in SIAM Journal on Optimization, April 1999. 21. Gupta, S. S. and S. Panchapakesan, Multiple Decision Procedures: Theory and Methodolgy of Selecting and Ranking Populations, John Wiley, New York 1979. 22. Ho, Y. C., R. S. Sreenivas, and P. Vakili, "Ordinal Optimization of DEDS," Journal of Discrete Event Dynamic Systems, Vol. 2, No. 2, 61-88, 1992. 23. Kalashnikov, V. V. Topics on Regenerative Processes, CRC Press, 1994. 24. Lau, T.W.E. and Y. C. Ho, "Universal Alignment Probabilities and Subset Selection for Ordinal Optimization" JOTA , Vol. 39, No. 3, 455-490, June 1997. 25. Lee, L. H, T.W.E. Lau, Y. C. Ho, "Explanation of Goal Softening in Ordinal Optimization" IEEE Transactions on Automatic Control, July 1998. 26. Luenberger, D. G., Linear and Nonlinear Programming. Addison-Wesley, 1984. 27. Nash, S. G., and A. Sofer, Linear and Nonlinear Programming, McGraw-Hill, Inc., 1996. 28. Patsis, N. T., C. H. Chen, and M. E. Larson, "SIMD Parallel Discrete Event Dynamic System Simulation," IEEE Trans. on Control Systems Technology, Vol. 5, No. 3, 30-41, Jan. 1997. 29. Rinott, Y., On Two-stage Selection Procedures and Related Probability Inequalities. Communications in Statistics A7, 799-811, 1978. 30. Xie, X. L., "Dynamics and Convergence Rate of Ordinal Comparison of Stochastic Discrete Event Systems," IEEE Transactions on Automatic Control, 1998. 21 Figure 6.2. Typical trace for different initial allocations. 22