On bayesian design in finite source queues∗ M. Eugenia Castellanos1 , Javier Morales2 , Asunción M. Mayoral2 , Roland Fried3 , and Carmen Armero4 1 2 3 4 Universidad Rey Juan Carlos, Madrid, Spain maria.castellanos@urjc.es Universidad Miguel Hernández, Elche, Spain j.morales@umh.es, asun.mayoral@umh.es Universidad Carlos III, Madrid, Spain rfried@est-econ.uc3m.es Universitat de València, Spain Carmen.Armero@uv.es Summary. We develop a Bayesian analysis of queueing systems in applications of the machine interference problem, like job-shop type systems, telecommunication traffic, semiconductor manufacturing or transport. Bayesian guarantees of system performance can be given using the predictive or the posterior distribution. While these distributions can be obtained by standard Monte Carlo integration in case of an M/M/c//r queueing system with exponential operational and service times, more refined Monte Carlo sampling (MCMC) strategies are needed in general. Key words: machine interference problem, queuing design, Bayesian performance criteria, MCMC methods 1 Introduction Queueing models allow to analyze problems which involve congestion. Once the characteristics of the system have been identified, we can derive conclusions about its temporal evolution and steady-state behavior, sometimes analytically, mostly by simulation [K75], [K76]; [GH98]; [LK00]; [M03]. Statistics becomes important in the presence of uncertainty about the stochastic model. Bayesian inference is well suited for analyzing queues because we can evaluate posterior and predictive distributions for system performance measures taking different sources of uncertainties into account. Moreover, the Bayesian approach provides an optimal decision framework for design purposes since incorporation of losses and costs is straightforward [AB99]. In an M/M/c//r queueing system, we analyze a set of r machines, which are subject to failures and maintained by c repair crews. The Bayesian framework allows us to derive performance criteria concerning the availability of the machines, without needing to specify values for unknown model parameters. The queuing system can thus be designed choosing a suitable number of repair crews such that a certain acceptable system performance is achieved, with guarantees optionally in terms of ∗ Research under grants MTM2004-03290; GV05/018; TSI2004-06801-C04-01; MTM2004-02934. 1382 M. Eugenia Castellanos et al. the predictive or the posterior distribution. We discuss such criteria and illustrate them on simulated data. Section 2 presents the basic model and criteria measuring system performance. Section 3 develops Bayesian design strategies guaranteeing reliable performance and exemplifies our proposals. Section 4 presents some extensions of the basic model needing more refined computational techniques. 2 Problem formulation 2.1 M/M/c//r queueing system We start with a very simplistic model for illustration. Assume that a company uses r identical machines for production. The machines break down at random time points and must be repaired (maintained). The number c of repair crews needs to be chosen sufficiently large to guarantee a satisfactory availability of the machines. A high availability can be achieved by a huge choice of c, but for the prize of large costs. We assume a unique type of maintenance service, discarding different kinds of damages, inventory limits and transportation times. Section 4 presents some extensions of this basic setting. In the language of queueing systems, the machines act as users and the crews as servers. When an user arrives at the system and finds an idle server, (s)he is attended immediately. Otherwise the user needs to wait in a queue until a server becomes available. Simple queueing systems are defined by the arrival pattern, the service mechanism including the number of servers, the distribution of the service times, the capacity of the waiting room, the size of the customer population and the discipline, that is the order in which the users in the queue are selected for service. The problem as stated before can be modelled as a queueing system with the following characteristics: • Arrivals to the system: Machines (users) get periodically into the system and are declared “non-operative” when stopping for being served. • Service mechanism: If any of the c servers is idle, an incoming user is attended immediately; otherwise, (s)he needs to wait in the queue until a server gets idle. • Finite population: The number r of users is finite. • Discipline: The order of the queue is FIFO, meaning the user “first in the system” (stopped to be maintained) is “first out” (starts maintenance). • Steady state: We assume the system to work under stationarity. If we assume that the operational times To and the maintenance times Tm vary independently according to exponential distributions with means 1/λ and 1/µ respectively, the problem becomes an M/M/c//r queueing system. When assuming To or Tm to follow another distribution not being an exponential, e.g. a gamma, the first (second) of the M’s is replaced by a G. A steady-state solution of the queueing system always exists as the population of users is finite and the size of the queue cannot increase to infinity. The steady state solution of the number of non-operating users in the system, Nno , in case of the M/M/c//r queuing system depends on the parameters λ and µ and the number of crews c [GH98]: On bayesian design in finite source queues 8 n > < (r −r!n)!n! µλ p0 , n = 0, 1, . . . , c − 1 n P (Nno = n|λ, µ) = λ r! > p0 , n = c, c + 1, . . . , r. : n−c µ (r − n)!c P (Nno = 0|λ, µ) = n=0 (1) c! where p0 denotes P (Nno = 0|λ, µ), evaluated from: " c−1 X 1383 r! (r − n)!n! n λ µ + r X n=c r! (r − n)!cn−c c! n #−1 λ µ The distribution of the time Tq a machine needs to wait for maintenance is also known in this model, namely [GH98]: 8 c−1 X (r − n) P (Nno = n | λ, µ) > > > , if t = 0 > > > r − E(Nno | λ, µ) < n=0 r−1 P (Tq ≤ t | λ, µ) = X (r − n) P (Nno = n | λ, µ) > Fga (t | n − c + 1, cµ) > > > r − E(Nno | λ, µ) > n=c > : + P (T = 0 | λ, µ) if t > 0 (2) q Rt and 0 if t < 0, where Fga (t | a, b) = 0 Ga(s | a, b)ds, t > 0 stands for the distribution function of a Gamma with expectation a/b and variance a/b2 . 2.2 Reliable maintenance The number Nno of non-operating users in the system and the time Tq that a machine has to wait for maintenance are classical congestion measures in queueing systems. A popular summary measure of system performance in machine repair models is the operational availability: Ao = E(To ) , E(To ) + E(Tno ) where Tno = Tm + Tq is the non-operational time and To is again the operational time between consecutive maintenances [KGE98], [RKK00]. It can be seen as a first order approximation of the expected availability E(A), where A= To . To + Tno A queuing system can be designed by determining the minimal c meeting some demand on the expected availability, or by finding the value of c from which on the increase of the availability levels out. However, these criteria cannot be applied directly since the distributions of To and Tno depend on the unknown failure and service rates λ and µ. Additional statistical reasoning is needed therefore, analyzing some data to get information on λ and µ and include it into the design. We propose Bayesian modelling for finding feasible values of c, guaranteeing satisfactory availability with high security, and avoiding a compromise in form of assuming values for λ and µ. A related design problem in terms of the operational capacity, i.e. the number of machines working at any time point, has been addressed in [MCMFA05]. For this measure it is also reasonable to consider the probability of fulfillment in addition to the average fulfillment considered here. 1384 M. Eugenia Castellanos et al. 3 Bayesian maintenance design In a Bayesian formulation of a queueing model with parameter uncertainty, the background knowledge is subsumed into a prior distribution for the unknown λ and µ. Updating this prior by combination with the likelihood obtained from the measured data results in the posterior distribution. A predictive analysis integrates all information available on λ and µ contained in the posterior. 3.1 Bayesian model specification A Bayesian model for a queueing system is given by two levels: 1. Data level. The data consists of no life (or failure) times {to1 , . . . , tono } and nm maintenance (or service) times {tm1 , . . . , tmnm }, regarded as independent realizations of random variables To and Tm with distributions Fo and Fm , respectively. In an M/M/c//r queuing system, Fo and Fm are exponential distributions characterized by their means 1/λ and 1/µ. 2. Prior level. In the absence of prior information, a non-informative Jeffrey’s prior is assumed with independency between λ and µ, p(λ, µ) = p(λ) · p(µ) ∝ (1/λ) · (1/µ). Otherwise, the natural choice for the prior is a product of Gamma distributions Ga(·|ko , so ) and Ga(·|km , sm ), with parameters (ko , so ) and (km , sm ) expressing the information available on λ and µ. Jeffrey’s prior corresponds to degenerate Gamma distributions. The posterior distribution of λ and µ, obtained from combining the prior with the likelihood, in case of an M/M/c//r queuing system is given by: p(λ, µ|data) = Ga(λ|no , tno ) · Ga(µ|nm , tnm ), Pno Pnm (3) with tno = i=1 toi and tnm = j=1 tmj . In the general case of a G/G/c//r queuing system, there is not necessarily an analytic expression of the posterior distribution, so that MCMC techniques are needed for approximation. All system performance measures like the availability A inherit a posterior distribution from (4). This means that we do not just have an estimate of, say, P (A ≥ a|λ, µ) for each a ∈ [0, 1], but a full posterior probability distribution for it, p(P (A ≥ a | λ, µ) | data), specifying the full information and uncertainty in the model. Estimations, as well as the confidence in them, can be expressed via percentiles, standard deviations, or probabilities of ranges. The predictive steady state distribution of A is obtained from the expectations of the parameter-dependent steady state solutions P (A ≥ a|λ, µ) with respect to (w.r.t.) the posterior (4), Z P (A ≥ a|data) = P (A ≥ a|λ, µ) p(λ, µ|data) d(λ, µ) , for a ∈ [0, 1]. (4) Although there will not always exist an analytic formula of the posterior distribution, Bayesian inference can be performed using stochastic simulation. We generate M values {λi , µi }M i=1 from the posterior (4), and then generate values from To |λi and Tno |λi , µi . In view of the definition of Tno , we need to simulate from Tm and from Tq using (2). With the simulated values {(to,i , tno,i )}M i=1 we get draws from the predicted availability, ai = to,i /(to,i + tno,i ), and can approximate the predicted average availability. On bayesian design in finite source queues 1385 3.2 Bayesian performance criteria for reliable maintenance In a similar line as [MCMFA05], we can reformulate the demand on the availability A from Section 2.2 in terms of the mean of the predictive distribution P (A ≥ a|data), for a given level of availability a ∈ [0, 1], or the posterior distribution of the mean E(A|λ, µ), p(E(A|λ, µ)|data), as follows: 1. Achieving a predicted mean availability larger than a: E(A|data) ≥ a. (5) The predicted mean availability can easily be approximated averaging values from the distribution of A given the data, using the above algorithm. 2. Achieving a sufficiently large posterior probability β ∈ [0, 1] that the mean availability P post [E(A|λ, µ) ≥ a] ≥ β. (6) This posterior probability can be approximated by Monte Carlo as follows: P [E(A|λ, µ) ≥ a|data] ≈ #{i, E(A|λi , µi ) ≥ a} , M where in this expression we additionally need an estimate of the expectation in the numerator, which can be obtained P from a sample {aki }K k=1 for each λi , µi , k approximating the expectation by ai = K k=1 ai /K. To determine suitable choices of c we select one of these goals and evaluate the corresponding measure for different values of c. In the predictive goal (5), we average over all possible λ and µ weighted by their posterior distribution, while in the posterior goal (6) we require that the values of λ and µ for which we achieve reliable performance have high probability. Predictive criteria are easily comprehensive as they are based on expected system performance; they are perfectly suitable e.g. if we design many queuing systems and average performance guarantees are sufficient. Posterior criteria imply that the selected goal is achieved in most of the systems designed. 3.3 An example [DVV04] consider a radar consisting of r = 64 identical pieces, which are subject to break-down and need to be maintained for the radar to work. Following their assumptions we use λ = 0.00008 (failures/h) and µ = 0.006 (repairs/h), but these values are usually not known exactly. We generate 50 exponential draws for both To and Tm using these values of λ and µ. The goal is a mean availability of 90%. Figure 1 depicts the predicted mean availability, E(A|data), and the posterior probability of the mean availability being larger than 0.8 or 0.9, P (E(A) ≥ 0.8|data) and P (E(A) ≥ 0.9|data), for different c = 1, 2, . . . , 10. Monte Carlo integration was performed generating M = 10.000 values {λi , µi }M i=1 from the posterior p(λ, µ|data) = Ga(λ|50, 598252.6)· Ga(µ|50, 16062.42). We see that the predictive goal is satisfied for c ≥ 3, and for c ≥ 4 there is almost no further increase of the predicted mean availability. The posterior goal is more demanding than the predictive goal if we insist on a high posterior probability. 1386 M. Eugenia Castellanos et al. Predicted mean availability Posterior mean availability 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 Larger than 0.8 Larger than 0.9 0.1 0.1 0 0 1 2 3 4 5 6 7 repair crews 8 9 10 1 2 3 4 5 6 7 repair crews Fig. 1. Predicted mean availability (left) and posterior probability of the mean availability being greater than 0.8 or 0.9 (right), for several values of c. We need c ≥ 4 to achieve the posterior goal of an expectation greater than 0.9 with probability also at least 0.9. When requiring only an expectation greater than 0.8, c ≥ 3 crews satisfy this goal with probability more than 0.9. A reasonable choice would thus be c = 4 or c = 3. We note that the situation in [DVV04] is more complicated. The radar still works if a few pieces are out of order, and we omit spares. The next section reports further extensions of the basic model illustrated here. 4 Extensions All but the simplest modifications of the M/M/c//r queueing system lead to a severe increase of statistical and computational demands. We briefly outline three extensions along with possible solutions: the first one maintains the model structure and the extension only consists in the distribution of the life and the service time. The second one maintains the same basic queueing model, but the complexity increases by considering many queuing systems working independently. The third extension deals with a queueing network with many connected queues. There are many more 8 9 10 On bayesian design in finite source queues 1387 interesting extensions of queuing systems: the distribution of the life times can vary due to aging, there can be patrolling services, spare machines, ancillary operators, etc. 4.1 G/G/c//r queuing systems A natural question is to examine the sensitivity of the results w.r.t. the assumption of exponential life and service times by studying other, more flexible distributions. Introducing an additional source of uncertainty, we can consider the Erlang distribution, Er(a, b), a popular and flexible distribution in queues which is a particular Gamma distribution (the first parameter a is an integer). If this model is assumed for the operational and maintenance times, To ∼ Er(ao , bo ) and Tm ∼ Er(am , bm ), not only Formulas (1) and (2) describing the steady-state solution and the distribution of the time in queue are no longer valid, but also we need a prior distribution for the parameters (ao , bo , am , bm ). The resulting posterior distributions can usually only be handled by MCMC techniques. Consequently, computing the predictive and posterior criteria, E(A | data) and P [E(A | λ, µ) ≥ a | data] with more general models needs more sophisticated techniques and computational efforts. 4.2 Independent but quasi-identical queueing systems This situation is made up by q independent and quasi-identical M/M/ci //ri queueing systems, i = 1, . . . , q. As a typical example of this situation, we consider a multinational company with several factories at different sites. The factories work independently and the machines as well as the repair crews have the same characteristics at all sites. The company wants to set up a new factory and uses information about the reliability of the machines and the quality of the repair teams in the existing factories to design the new one. Consequently, we assume that the failure rates λ1 , . . . , λq and the maintenance rates µ1 , . . . , µq at the different sites are random samples from common populations with distributions indexed by some hyperparameters. Bayesian hierarchical models are particularly well suited for dealing with this multilevel scenario and MCMC methods are required to deduce the system performance. 4.3 Queueing networks We consider a system of machines that may break down because of different failures, requiring different types of repairs. These different repairs share common steps and follow a fixed protocol that, depending on the type of failure, sends a broken down machine to a sequence of specialized services. Consequently, the machine waits in queue for a first service, and after having been attended it waits in another queue for the next service, and so on. The complexity of such models is due to the fact that there is an arrival and a service parameter for each queue (node), and the probabilistic results for the total system are expressed as multivariate distributions. Sophisticated MCMC techniques and statistical tools are needed for complex queueing networks. 1388 M. Eugenia Castellanos et al. References [AB99] Armero, C., Bayarri, M.J.: Dealing with uncertainties in queues and networks of queues: A Bayesian approach. In: Ghosh, S. (ed.) Multivariate, Design and Sampling. Marcel Dekker, New York, pp. 57-608 (1999) [DVV04] De Smidt-Destombes, K.S., van der Heijden, M.C., van Harten, A.: On the availability of a k-out-of-N system given limited spares and repair capacity under a condition based maintenance strategy. Reliability Engineering and System Safety, 83, pp. 287-300 (2004) [GH98] Gross, D., Harris, C.M.: Fundamentals of Queueing Theory. Third Edition. Wiley, New York (1998) [KGE98] Kang, K., Gue, K.R., Eaton, D.R.: Cycle time reduction for naval aviation depots. In: Medeiros, D.J., Watson, E.F., Carson, J.S., Manivannan, M.S. (eds) Proceedings of the 1998 Winter Simulation Conference, pp. 907-912 (1998). [K75] Kleinrock, L.: Queueing Systems. Volume I: Theory. Wiley, New York (1975) [K76] Kleinrock, L.: Queueing Systems. Volume II: Computer Applications. Wiley, New York (1976) [LK00] Law, A.M., Kelton, W.D.: Simulation Modelling and Analysis. Third Edition. McGraw-Hill Education (2000) [M03] Medhi, J.: Stochastic Models in Queueing Theory. Second Edition. Academic Press (2003) [MCMFA05] Morales, J., Castellanos, M.E., Mayoral, M.A., Fried, R., Armero, C.: Bayesian design in queues: An application to aeronautic maintenance. Working paper, Universidad Miguel Hernández, Elche, Spain (2005) [RKK00] Rodrigues, M.B., Karpowicz, M., Kang, K.: A readiness analysis for the argentine air force and the brazilian navy A-4 fleet via consolidated logistics support. In: Joines, J.A., Barton, R.R., Kang, K., Fishwick, P.A. (eds) Proceedings of the 1998 Winter Simulation Conference, pp. 10681074 (2000).