An Investigation into the Application of Different Performance Prediction Methods to Distributed Enterprise Applications DAVID A. BACIGALUPO† daveb@dcs.warwick.ac.uk STEPHEN A. JARVIS† saj@dcs.warwick.ac.uk † liganghe@dcs.warwick.ac.uk LIGANG HE dps@dcs.warwick.ac.uk DANIEL P. SPOONER† DONNA N. DILLENBERGER* engd@us.ibm.com grn@dcs.warwick.ac.uk GRAHAM R. NUDD† † High Performance Systems Group, University of Warwick, Coventry CV4 7AL, UK * IBM T.J. Watson Research Centre, Yorktown Heights, New York 10598, USA Abstract. Response time predictions for workload on new server architectures can enhance Service Level Agreement–based resource management. This paper evaluates three performance prediction methods using a distributed enterprise application benchmark. The historical method makes predictions by extrapolating from previously gathered performance data, while the layered queuing method makes predictions by solving layered queuing networks. The hybrid method combines these two approaches, using a layered queuing model to generate the data for a historical model. The methods are evaluated in terms of: the systems that can be modelled; the metrics that can be predicted; the ease with which the models can be created and the level of expertise required; the overheads of recalibrating a model; and the delay when evaluating a prediction. The paper also investigates how a prediction-enhanced resource management algorithm can be tuned so as to compensate for predictive inaccuracy and balance the costs of SLA violations and server usage. Keywords: Performance Prediction, Distributed Enterprise Application, Layered Queuing Modelling, Historical Performance Data, Resource Management, Service Level Agreement 1. Introduction It has been shown that response time predictions can enhance the workload and resource management of distributed enterprise applications [1,11]. Two common approaches used in the literature for making these response time predictions are extrapolating from historical performance data and solving queuing network models. Examples of the first approach include the use of both coarse [8] and fine [1] grained historical performance data. The former involves recording workload information and operating system/database load metrics, and the later involves recording the historical usage of each machine’s CPU, memory and IO resources by different classes of workload. Another example of this approach is being developed in the High Performance Systems Group at the University of Warwick [5]. This historical method has been implemented as a tool called HYDRA which has been applied to both distributed enterprise [5] and business to business [16] applications. It is differentiated from other historical modelling work by its focus on simplifying the process of analysing any historical data so as to extract the small number of trends that will be most useful to a resource manager. Other historical modelling work focuses on predicting future workload resource demands [14] and is complementary to this research in which the emphasis is on being able to predict new server architectures. Examples of the queuing modelling approach include [6,11,13] and the layered queuing method, as implemented in the layered queuing network solver (LQNS) [17]. The layered queuing method is of particular interest and will be examined further in this paper as: it explicitly models the tiers of servers found in this class of application, and it has been applied to a range of distributed systems (i.e. [15]) including the distributed enterprise application used in this paper [10]. A third approach is to overcome some of the limitations of the historical and queuing approaches by combining them, albeit at the cost of a more complex model. This can be done by using historical models to calibrate queuing model processing times, which can be expensive to measure directly in distributed enterprise systems. For example in [12] queuing network processing times are inferred from coarse-grained historical performance data. This paper complements this work by examining a combined approach in which a layered queuing model is used to generate historical performance data and so calibrate a historical model. This ‘hybrid’ method has the advantage of rapid historical predictions without having to collect real historical performance data. It is important to compare the effectiveness of different approaches for modelling distributed enterprise applications so practitioners can make an informed choice when designing prediction-enhanced workload and resource management algorithms. However, although there have been comparisons of different performance prediction approaches using distributed enterprise applications, there have been few quantitative comparisons of the three approaches on a single distributed enterprise application. For example in [15] a layered queuing model of a distributed database system is created and compared to a markov chain-based queuing model of the system. In [17] the layered queuing method is compared more generally to other performance modelling methods. Another recognised queuing method which has been applied to similar applications is described in [13] and compared with the layered queuing method. However none of these papers include a comparison with a historical model of the same application. The historical prediction method described in [8] is applied to a web-based stock trading application (Microsoft FMStocks) and compared to a queuing modelling approach. However a queuing network model is not created of the application. This work investigates how the performance of a distributed enterprise application benchmark can be predicted using the HYDRA historical method, the layered queuing method and the hybrid method. The methods are then evaluated in terms of: the systems that can be modelled; the metrics that can be predicted; the ease with which the models can be created and the level of expertise required; the overheads of recalibrating a model; and the delay when evaluating a prediction. The IBM Websphere commercial e-Business middleware [7] is selected as the platform on which the benchmark will be run as it is a common choice for distributed enterprise applications. The IBM Performance Benchmark Sample ‘Trade’ [9] is selected as it is the main distributed enterprise application benchmark for the Websphere platform. To the best of our knowledge this is the only quantitative comparison of these three classes of prediction method on this benchmark. The comparison described in this paper involves: defining a system model and case study representative of current distributed enterprise applications (see sections 2 and 3); creating the three performance models of the case study and investigating the predictive accuracy that can be obtained (see sections 4-6); considering how the models can/cannot be extended (see section 7) and evaluating the strengths and weaknesses of the three methods (see section 8). The paper also includes an analysis of the tuning of a predictionenhanced resource management algorithm (see section 9). 2. System Model Based on the Oceano resource manager [4] the system model (see figure 1) consists of a service provider which hosts a number of applications and also contains a resource manager that controls the transfer of application servers between those applications. An application server can only process the workload from one application at a time to isolate the applications (which may be being hosted for competing organisations). Based on other established work (i.e. [10,18,2]) each application is modelled as a tier of application servers accessing a single database server. Application servers may have heterogeneous server architectures. Based on the queuing network in the Websphere e-Business platform: a single first in first out (FIFO) waiting queue is used by each application server; the database server has one FIFO queue per application server; and both servers can process multiple requests concurrently via time-sharing. Figure 1 The proposed system model The workload manager tier of the application model involves the workload being divided into ‘service classes’, each of which is associated with a response time requirement (i.e. in a SLA). The role of the workload manager is to rout the incoming requests to the available servers whilst meeting these goals. In such a system it is important to be able to make response time predictions on alternative application server architectures so as to allow servers to be allocated to applications, workload to be allocated to servers and upgrades to be planned in an informed fashion. To allow these predictions to be made it is useful for the system to provide two supporting services. The first involves allowing performance models to be recalibrated on established servers in order to save modelling variables that change infrequently (such as monitoring and logging policies), or variables that are hard to measure (such as the complexity of processing data to serve a service class). An example of the later from the Trade benchmark is the average size of the clients’ ‘portfolio’ of stock. The second service involves allowing application-specific benchmarks to be run on new server architectures so as to calibrate their request processing speeds. 3. Case Study This section describes the workload and server configuration which, along with the Trade benchmark, will provide an example of the system model that is representative of commercial distributed enterprise applications. The case study will then be modelled using each of the three performance prediction methods in sections 4-6. 3.1. Workload The workload in a service class is divided into clients each of which sends requests to the application. Each request calls one of the operations on the application-tier interface (i.e. buy/sell/quote etc). A service class is created for ‘browse’ users with the next operation called by a client being randomly selected, with probabilities defined as part of the Trade benchmark as being representative of real clients. A service class is created for ‘buy’ users using the ‘register new user and login’, ‘buy’ and ‘logoff’ operations. On average buy clients make 10 sequential buy requests before sending a ‘logoff’ request. This creates a buy service class with a mean portfolio size of 5.5. For simplicity, the typical workload is defined as all browse clients. ‘No. of clients and the mean client think-time’ is used as the primary measure of the workload from a service class. The total number of clients across all service classes and the percentage of the different service classes are used to represent the system load. Using number of clients (as opposed to a static arrival rate definition) to represent the amount of workload is common when modelling distributed enterprise applications (i.e. [6,10]). This is because it explicitly models the fact that the time a request from a client arrives is not independent of the response times of previous requests, so as the load increases the rate at which clients send requests decreases. In this context ‘client’ refers to a request generator (i.e. a web browser window) that requires the result of the previous request to send the next request. Users that start several such conversations can be represented as multiple clients. Think-times are exponentially distributed with a mean of 7 seconds for all service classes as recommended by IBM as being representative of Trade clients [2], although heterogeneous think-times are supported by all three methods. 3.2. Servers The system contains 3 application servers. Under the typical workload the max throughputs of a new ‘slow’ server AppServS (P3 450Mhz, 128MB heap), an established ‘fast’ server AppServF, (P4 1.8Ghz, 256MB heap) and an established ‘very fast’ server AppServVF, (P4 2.66Ghz, 256MB heap) are found to be 86, 186 and 320 requests/second respectively. AppServS has a smaller heap size due to limited memory, but this is sufficient to store the workload in main memory. The database server (Athlon 1.4Ghz, 512MB RAM) uses DB2 7.2 as the database; all servers run on Windows 2000 Advanced Server and 250 clients are simulated by each workload generator (P4 1.8Ghz, 512MB RAM) using Apache JMeter [3]. 4. Historical Method The historical modelling method involves sampling performance metrics (i.e. response times and throughputs) and associating these measurements with variables representing the state of the machine (primarily the workload being processed) and the machine’s architecture. Additional variables record static performance benchmarks for the different architectures. The modelling process involves determining the relationships (i.e. linear/exponential equations) between the variables and the performance metrics. This is facilitated by defining one or more typical workloads and server architectures, and then determining the relationships between the variables relative to their typical values. This approach removes the need to model variables that remain constant throughout the normal operating range of the system. The historical method is implemented as part of a tool known as HYDRA that allows the accuracy of relationships to be tested on variable quantities of historical data. In the following case study, predictions are required for service class response times at different amounts of workload, different application server architectures and different percentages of the service classes in the workload. These are modelled in the historical method using three corresponding relationships, details of which are provided in the following sections. 4.1. Relationship 1: Number of Typical Workload Clients-Response Time It has been found that this relationship is best approximated using separate ‘lower’ and ‘upper’ equations for before and after max throughput: mrt = cL eλ L *no _ of _ clients mrt = λU * no _ of _ clients + cU (1) (2) where mrt is the mean response time and cL, cU, λL and λU are parameters that must be calibrated from historical data, as is described in the next section under relationship 2. It is also found that using a further breakdown of the possible system loads, so as to define a ‘transition’ relationship for phasing from the lower to the upper equation, can increase predictive accuracy as discussed in [5]. However the accuracy of such a relationship is not considered further here. The correct choice of the lower or the upper equation can be made by calculating the number of clients at max throughput using the relationship between the number of clients and the server’s throughput. This is a linear relationship until the max throughput for the server under that particular workload is reached. The gradient, m, of this relationship is a parameter that must be calibrated from historical data. After max throughput is reached the throughput is assumed to be roughly constant. This relationship can be used to generate predicted throughput scalability graphs for servers with heterogeneous CPU speeds, since the value of m depends on and can be predicted from the mean client thinktime, but does not vary due to different server CPU speeds. m is 0.14 for all servers in the experimental setup. This gives a prediction accuracy of 1.3% across the three servers. 4.2. Relationship 2: Effect of Application Server Max Throughput on Relationship 1 The following functions approximate this relationship in the experimental setup: c L = ∆ (c L ) × mx _ throughput + C (c L ) (3) λL = C (λL ) × mx _ throughput ∆ ( λ L ) (4) where Λ(cL), C(cL), C(λL) and Λ(λL) are parameters that must be calibrated from historical data (see below). Parameters for the upper (linear) equations can also be calculated as follows. Given an increase/decrease in server max throughput of z%, λU is found to increase/decrease by roughly 1/z%, and cU is found to be roughly constant. The parameters in relationships 1 and 2 are calibrated by fitting trend-lines (using a least squares fit) to historical data from the established AppServF and AppServVF servers. The historical data consists of the max throughputs of each server and nudp/nldp data points for the upper/lower equation of relationship 1 respectively. Each data point records the mean response time (as averaged across ns samples) of the typical workload at a numbers of clients. In our experimental setup, samples are recorded using one benchmarking client per server. The overall predictive accuracy is defined as the mean of the lower equation accuracy and the upper equation accuracy. Server CL λL (ms) (ms) S 138.9 4E-06 F 84.1 0.0001 VF 10.7 0.0009 Table 1 Historical method relationship parameters Figure 2 Mean response time predictions for the typical workload on new and established server architectures It is found that accurate predictions can be made even when nudp and nldp are both reduced to 2 and ns is reduced to 50. The resulting parameters are shown in table 1. Figure 2 illustrates the mean response time predictions made using this calibration (including a transition exponential relationship for phasing between equations 1 and 2). A minimum of 100 samples per ‘measured’ data point are recorded. A good level of accuracy of 89.1% for the established servers and 83% for the new server is achieved. For these predictions the samples were made sequentially (after a 1 minute warm-up period). This said, the most time it took to record 50 samples was 4.5 seconds before max throughput and 2.2 minutes after. When recording these two data points on an established server, a workload manager might have to transfer clients onto or off the server to get a second data point. The effect on the predictive accuracy of the number of clients between the two data points (i.e. the number of clients that are transferred) will therefore be investigated. Experiments are conducted for the lower and upper equations. We have found it to be effective to use a transition exponential relationship to phase between the lower and upper equations between 66% and 110% of the max throughput load in our experimental setup. The supporting experimentation for the lower equation therefore examines the effect of the number of clients between a data point below 66% of the max throughput load and a data point fixed at 66% of the max throughput load. The supporting experimentation for the upper equation examines the effect of the number of clients between a data point fixed at 110% of the max throughput load and a data point at a higher load. LQNS is used to generate these data points, and is also used to generate data points for the new server architecture so as to test the accuracy of predictions. Figure 3 shows the accuracy of the predictions on the new server architecture as the mean number of clients between the two data points, x, is increased. The actual value of x used for a particular server is scaled according to the machine’s speed so the % of the max throughput load between the two data points is constant across all established servers. As with all the predictions in this paper, the accuracy of the more complex lower exponential equation is generally lower than that of the upper linear equation. Figure 3 The predictive accuracy as the number of Figure 4 Heterogeneous workload mean response clients between historical data points is increased time predictions for the new server architecture As x increases there is a roughly linear increase in the lower equation’s predictive accuracy so the more clients the workload manager transfers before taking a second data point the greater the accuracy is likely to be. However the upper equation’s increase in accuracy slowly levels off, making it increasingly less useful for a workload manager to transfer that many clients. It can also be seen that there are more fluctuations in the lower equation line; a workload manager might take this into account by transferring enough clients to guarantee a particular accuracy level despite any fluctuations in the predictive accuracy. It is noted that it has been found to be difficult to obtain results for values of x below 30 as the predicted response time for the data point with the larger number of clients can be less than the predicted response time for the data point with the smaller number of clients. This is due to the 20ms LQNS convergence criterion and could be improved by decreasing it, but at the cost of slower predictions. 4.3. Relationship 3: Buy Request %-Server Max Throughput There is found to be a linear relationship between the percentage of buy requests, b, on an established server and its max throughput which is used to extrapolate the max throughput at any buy percentage, mx_throughputE(b). The max throughput on a new server at a particular percentage of buy requests is then calculated as follows, where a percentage of buy requests of 0 represents the typical (homogeneous) workload: mx _ throughput N (b) = mx _ throughput E (b) × mx _ throughput N (0) mx _ throughput E (0) (5) These relationships are tested using LQNS predictions for historical data; specifically the max throughput of AppServF at 0% and 25% buy requests (189 and 158 requests/second respectively). Figure 4 shows that there is a good prediction for the shapes of the mean workload response time graphs (due to the λL parameters being small, the scalability lines appear almost linear before max throughput is reached). A similar procedure can also be used to extrapolate the deviation of service class specific response times from the mean workload response time due to differences in the number and complexity of database requests made. 5. Layered Queuing Method A layered queuing performance model explicitly defines an application’s queuing network. A wide range of applications can be modelled due to the number of features supported by the language, which include: open, closed and mixed queuing networks; FIFO and priority queuing disciplines; synchronous calls, asynchronous forks and joins, and the forwarding of requests onto another queue; and service with a second phase [17]. An approximate solution to the layered queuing model can be generated automatically using the layered queuing network solver (LQNS) making the method relatively easy to use; all that is required when creating the model is specifying the system queuing network configuration. Performance metrics generated include response times, throughputs and utilisation information for each service class at each processor. The application model specified in section 2 is defined as a layered queuing model. The database server disk is modelled as a processor that can only process one request at a time. Processing times are assumed to be exponentially distributed. Requests in the workload are broken down into ‘request types’ that are expected to exhibit similar performance characteristics due to the operations being called and the amount of data associated with the request. The parameters to the model are: − Queuing network configuration: the maximum number of requests each processor can process at the same time via time-sharing; − Service class specific: amount of workload, the workload mix (the expected percentage of the different request types received each second); − Request-type specific: mean processing times on each server, average number of database requests per application server request. The per-request type parameters can be calibrated by taking an established server offline and sending a workload consisting only of that request type; the parameters are calculated from the resulting throughput (in requests/second) and the CPU usage of each server. The model can then be evaluated for any heterogeneous workload. The request processing speeds of new servers can be rapidly benchmarked using a ‘typical’ workload, using either a max throughput or mean request processing time metric. Calculating a new server’s mean request type processing times then involves multiplying the mean processing times on an established server by the established/new server request processing speed ratio. 5.1. Results Each service class is calibrated on AppServF as detailed in table 2. Buy requests make 2 database requests, and browse requests make 1.14 database requests on average. The application and database servers can process 50 and 20 requests at the same time via time-sharing, respectively. LQNS produced solutions after a maximum of 3 seconds under a convergence criterion of 20ms, on an Athlon 1.4Ghz. Processor App. Server DB Server Browse (ms) 4.505 0.8294 Buy (ms) 8.761 1.613 Table.2. Layered queuing method processing time parameters as calibrated on AppServF Figure 2 shows mean response time predictions at different numbers of clients under the typical workload. The mean accuracy of the predictions for the established servers is 97.8% for throughput, and 68.8% for mean response time; the mean accuracy for the new server is 97.1% for throughput, and 73.4% for mean response time. Predictions can also be made for heterogeneous workloads, an example of which is illustrated in figure 4. Although the historical-based predictions are more accurate it is likely that the layered queuing accuracies could be increased by better modelling of delays such as communication overhead. 6. Hybrid Method The hybrid method involves using a historical model, but with ‘pseudo’ historical data generated using a layered queuing model. Basic hybrid models involve generating historical data points to calibrate the relationships in the model prior to the server architectures for which the predictions are required being known. The predictive accuracy of this approach can be increased by using an ‘advanced’ model in which layered queuing is used to generate historical data for the server architectures for which predictions are required – this allows the historical model to represent these proposed architectures as ‘established’ servers. However, layered queuing predictions are slower than historical predictions due to the iterative numerical solution technique employed. This results in hybrid predictions incurring a ‘start-up’ delay the first time a prediction is made for a new server architecture. This adds to the existing start-up delay of benchmarking the new server architecture’s request processing speed; after this initial start-up delay the more responsive historical predictions can be used. The hybrid method is evaluated using an advanced hybrid model created from the historical and layered queuing models. The first time a prediction is required the layered queuing model is calibrated (see section 5), after which this model is used to generate historical data to calibrate relationships 1 and 3 of the historical model (see section 4). Relationship 2 is not used as the layered queuing model generates historical data for specific server architectures. The historical model is calibrated by using the layered queuing model to generate a maximum of 4 historical data points for the lower and upper relationship 1 equations for each of the three servers. This resulted in a mean start-up delay of 11 seconds on an Athlon 1.4Ghz machine. The accuracy of the hybrid predictions are found to be similar to those made using the layered queuing model only; the mean response time predictions for the established servers are 67.1% accurate and for the new server 74.9% accurate. 7. Extending the Case Study The previous three sections show that predictions can be made with a good level of accuracy using all three prediction methods. However there are two common practices in distributed enterprise application systems that are not covered in this case study. The first involves SLAs, which are specified in terms of distribution as well as mean-based metrics and the second involves systems, in which the application servers’ main memories are used as caches. These are considered in the following two sections respectively. This will allow the three performance prediction methods to be evaluated more thoroughly in terms of the metrics that can be predicted and the systems that can be modelled. 7.1. Response Time Distribution Predictions After max throughput (i.e. 100% application server CPU utilisation) is reached, the most significant component of the response time is the application server queuing time (as opposed to the database server disk access time). This results in two different types of probability distribution function for the response time of requests for before and after max throughput is reached. In the case study these two functions are found to be constant (relative to the predicted mean response time) across server architectures with heterogeneous processing speeds. Distribution predictions can therefore be extrapolated from the mean response time prediction using these functions. The response time distribution of requests in the case study is approximated by the exponential/double exponential distribution for before/after 100% CPU utilisation. The probability distribution functions are shown in equations 6 and 7, respectively. P ( X ≤ x ) =1 − e − (1 / r p ) x −( x − a ) / b 1 − e , x ≥ rp P ( X ≤ x) = ( x − a2) / b e x < rp , 2 (6) (7) Where a, the double exponential distribution location parameter is set to rp and b, the scale parameter is found to be constant across servers with heterogeneous processing speeds and is calibrated at 204.1. SLAs are often specified in terms of a percentile metric specifying a percentage of requests p that must be less than a maximum response time rmax. Using the distribution equations, the predicted response times in figure 2 are converted to a percentile metric (with p=90%). All three methods give a good level of predictive accuracy; the historical model predictions are 80%/88% accurate and the layered queuing predictions are 77%/69% accurate for new/established servers. The hybrid predictions are similar to the layered queuing predictions at 77%/70% accurate for new/established servers. The predictive accuracies are a maximum of 4.6% less accurate than the corresponding mean response time predictions. It is noted that percentile metrics can also be predicted directly using the historical method (but not the layered queuing method or the hybrid method) to avoid this small decrease in accuracy. 7.2. Modelling Caching In distributed enterprise systems it is important for application data to be stored in the database between client requests so clients can continue to use the application if the application server to which they are connected fails. (Database servers are typically hosted on machines with more fault tolerance hardware such as RAID disk arrays and better backup facilities.) The Trade application (which is also an example of best practice design) stores most of its data directly in the database as opposed to in the application server’s memory so as to simplify recovering from application server failures. The alternative to this is an indirect approach in which more data is stored in the application server’s main memory. This data can then ‘persist’ in the database after the response has been returned to the client. This results in the application server’s memory acting as a cache to the database, which can increase performance at the risk of data inconsistencies (if the application server crashes whilst data is being persisted). The effect of an architecture’s cache (i.e. main memory) size can be modelled using the historical method by recording this as a variable and determining how this variable effects the other variables/relationships as before. However it is found to be difficult to predict the effect of a new server architecture’s main memory size using the layered queuing method (and hence the hybrid method) when caching is used and when requests from each client are not independent of the response time of previous requests from that client (as is typically the case in distributed enterprise applications, see section 3.1). This is explained as follows. In the case study when the workload does not fit in main memory, the main memory will act as a cache (using a least recently used replacement scheme) for the per-client ‘session’ data in the database. When a request misses the cache an extra call to the database is incurred to read the session associated with the request. Although the layered queuing model can be extended to include this extra database call, it is difficult to calculate the average number of the database call that will be made for each service class. This is because this value depends on the probability of a cache miss for a client c in that service class. This in turn depends on the probability that the number of bytes replaced in the cache during time Tc is greater than the cache size minus the session data size for client c, where Tc is the time between requests for client c. This probability in turn depends on the arrival rate and session data size distributions for all the service classes. When requests from a client are not independent, these arrival rate distributions are variable and must be predicted using the model. So the number of database calls for each service class in the model depends on: i) the solution to the model; and ii) the ability to extrapolate arrival rate distributions from the mean values predicted. However the layered queuing method does not support parameters specified in terms of metrics that the model predicts, and it is non-trivial to extend the layered queuing numerical solution technique to include this. 8. Evaluation 8.1. Systems that can be Modelled It has been shown that all three performance prediction methods can be used to make mean response time predictions for the distributed enterprise application case study with a good level of predictive accuracy. All three methods are also sufficiently powerful to model variations on this system model. Examples include: some or all clients sending requests at a constant rate; priority queuing disciplines and application components communicating using asynchronous calls. However it has also been shown that it is nontrivial to extend the layered queuing (and hence hybrid) models to predict the effect of caching, whereas this is possible using the historical method. It is also noted that all three methods can model systems containing queues that are not explicitly defined, including bottlenecks for example. A bottleneck could be caused by application server threads requiring simultaneous access to a critical code section. However the layered queuing method and the hybrid method require additional profiling to model the extra queues created. 8.2. Metrics that can be Predicted It has been shown that response time predictions can be made for different workload levels. However, resource managers typically require predictions for the maximum number of clients an SLA-constrained server can support. This can be predicted using the historical and hybrid methods by rewriting equations 1 and 2 in terms of the mean response time. However in the current layered queuing solver the number of clients can only be an input so it is necessary to search for a number of clients that results in response times just below SLA compliance. A limitation of the layered queuing (and hence hybrid) methods are that they can only make mean value response time predictions whereas SLAs are often specified using percentile response time metrics. However it has been found to be possible in the case study to measure and extrapolate distributions from the mean value predictions. Using this technique it has been shown that percentile response time metrics can be predicted with a good level of accuracy. Another limitation of the layered queuing and hybrid methods is that they can only make steady state predictions. The historical method does not suffer from these restrictions as it can record (as variables) both percentile metrics and the time the server has been stabilising toward the steady state. In fact the historical method can extrapolate and predict a range of metrics whereas the metrics that the layered queuing (and hence hybrid) methods can predict are fixed using the current solver. 8.3. Ease with which a Model can be Created and Level of Expertise Required It has been shown that the layered queuing method is more restrictive in the systems that it can model and the metrics it can predict. However, layered queuing models have also been found to be easy to create with a minimum level of performance modelling expertise, as a model specifies just the system’s queuing network configuration. In contrast creating a historical model involves specifying and validating how predictions will be made. As a result, despite the HYDRA tools simplifying the model creation process, it is still harder to create a historical model than a layered queuing model and the process requires more performance modelling expertise. Layered queuing models also have the advantage that they can be calibrated using a small workload, whereas it has been shown that historical models require calibrating at both small and large workloads (i.e. to calibrate both lower and upper equations in relationship 1). Creating a hybrid model requires that the performance analyst is capable of creating two types of performance model and so requires the most performance modelling expertise. However the hybrid method also simplifies calibrating and validating the historical component of the hybrid model. This is because historical data can be generated using the layered queuing model as opposed to having to record historical data under a range of workloads and server architectures. As a result it has been found to be easier to create a hybrid model than a historical model. 8.4. Overhead of Dynamic Model Recalibration It has been shown that accurate historical predictions can be made even with a very limited amount of historical data. As a result the historical method can rapidly but accurately re-calibrate relationship parameters and when using the hybrid method, the time to generate new historical data using the layered queuing method can be kept low. In the layered queuing method (and hence hybrid method) re-calibrations require dedicated access to a server and information on system configuration parameters. However the layered queuing (and hence hybrid) methods do have the advantage that only one application server (as opposed to two or more for the historical method) is required, which may be helpful in small systems. 8.5. Delay when Evaluating a Prediction The layered queuing method can require significant CPU time to make each prediction (i.e. up to 3 seconds on an Athlon 1.4Ghz); this may be a major limitation as the resource management algorithm may need to make many predictions. This is made worse by the fact that multiple predictions must be made when searching for the maximum number of clients a server can support whilst still being in SLA compliance. The historical method has the advantage that predictions can be made almost instantaneously. Hybrid method predictions incur a ‘start-up’ delay the first time a prediction is made for a new server architecture whilst historical data is generated; it has been shown that this can be as short as an 11 second delay on an Athlon 1.4Ghz. After the start-up delay the predictions are almost instantaneous. It is noted that a resource manager may need to evaluate many predictions for each server architecture, for example to evaluate the effect of allocating different amounts and types of workload. If this is the case the total prediction evaluation delay may be less than that for the layered queuing method. 9. Tuning a Prediction-Enhanced Resource Manager SLA-based service providers (as defined in section 2) incur two main types of cost. The first type of cost involves paying penalties for SLA failures (i.e. missing SLA response time goals); and the second is the cost of using the servers in the system (i.e. buying or renting the hardware). This section investigates how a prediction-enhanced resource manager can balance these costs whilst compensating for predictive inaccuracy. This will be investigated using a resource management algorithm which determines the application servers to use to process a workload that is to be transferred to the service provider. The algorithm also provides an initial division of the workload across the servers obtained (which could then be modified by a workload manager). The algorithm (see algorithm 1) takes as input a list of the service classes in the workload and a list of available application servers. The service classes are sorted and hence processed in order of priority, so if there are insufficient servers the lower priority service classes are rejected from the system first. Since there is no priority queuing or processing in the system model, the ideal application server selection algorithm (on line 6) would minimise the amount of workload with different SLA response time goals on the same server. However, to facilitate the tuning analysis a short algorithm with a fast evaluation time is considered more appropriate than an algorithm with near optimal efficiency – because of this a greedy approach to server selection is used. This involves selecting the server which the performance model predicts can be allocated the most clients from the current service class. An exception to this rule occurs when selecting the last server that will be required by a service class; the algorithm takes the server that can be allocated the smallest number of clients, given that it can still take all the clients remaining to be allocated in the service class. 1. sort the service classes in order of increasing response time goal 2. current_service_class=first service class in list 3. do 4. if (all clients in current_service_class allocated to an application server) 5. current_service_class=next service class in list 6. app_server = application_server_selection_algorithm() 7. allocate clients from current_service_class to app_server until: maximum capacity is reached on app_server OR all clients in current_service_class are allocated to an application server 8. while (application servers with available capacity exist and unallocated clients exist) Algorithm 1. Resource management algorithm. Each service class consists of a number of clients, each of which is initially ‘unallocated’. Application servers are considered to have available capacity unless the performance model predicts that adding an extra client from the current service class would result in some clients missing SLA response time goals. When predictions are inaccurate some service classes may have insufficient servers at runtime. To deal with this the system model in section 2 is extended so application servers reject clients at runtime if response times are within a threshold of missing SLA goals. This prevents all the existing clients on a server from also missing their SLA goals. In practice, it is likely that the rejected workload would be handled by a second set of servers that accept all workload. A generic strategy to compensate for predictive inaccuracy and balance the service provider’s costs involves multiplying the number of clients in each service class by a number which we refer to as the ‘slack’. The resource manager then allocates application servers to service classes based on this modified workload. 9.1. Results The algorithm is evaluated using two cost metrics, the first being the percentage of clients rejected from the servers (‘% SLA failures’). The second metric represents the amount of server processing power allocated to the application, where an application server’s processing power is defined as its max throughput under the typical workload. For convenience this metric is recorded as a percentage of the total processing power of the list of application servers and will be referred to as ‘% server usage’. We investigate the effect of the resource management slack parameter for balancing these cost metrics whilst compensating for predictive inaccuracy. The case study in section 3 is extended to include a total of 16 application servers. Eight of the servers have a new architecture (AppServS) and eight have the same architectures as existing servers (4×AppServF and 4×AppServVF). Three service classes are created by dividing the browse service class into two service classes with different SLA response time goals. The workload that is to be allocated to the new servers is defined as: 10% buy clients (RT goal: 150ms), 45% high priority browse clients (RT goal: 300ms), and 45% low priority browse clients (RT goal: 600ms). The percentages are selected based on the Trade application, which defines 10% of the standard workload to be purchase requests. The response time goals are selected based on the response time of the fastest application server at max throughput (~600ms). The investigation begins by analysing how predictive accuracy can be compensated for in order to reduce the % SLA failures to 0. This has been found to be straightforward when the predictive error is uniform. Define y as the predictive accuracy, where multiplying the actual number of clients by y gives the prediction. Experiments have confirmed that setting the slack to y results in 0% SLA failures below 100% server usage and a constant % server usage at any predictive accuracy. Figure 5 % SLA failures when using the resource Figure 6 % Server usage when using the resource management algorithm at different loads management algorithm at different loads To examine the more interesting case of non-uniform predictive accuracy the more accurate historical model is used to represent the real system response times, and the hybrid model is used as the less accurate predictions. The layered queuing model is not used due to the limitations discussed in section 8. Figures 5 and 6 show the resulting performance of the resource management algorithm at different loads and slack levels, in terms of the % SLA failure and % server usage performance metrics respectively. Each line was generated in under one second. The average predictive accuracy of the nonuniform predictions (weighted by the number of servers in the server pool) is 92.5% (i.e. y=1.075). However the minimum slack that results in 0% SLA failures before 100% server usage is 1.1. The difference is due to some predictions being used more by the resource management algorithm than others. For example the predictive accuracy of AppServF is the highest of the three servers at 97.04%, but due to the design of the algorithm the middle servers tend to be used less frequently. It is noted that the irregular shape of the lines on the resource management performance graphs is because runtime optimisations allow the resource manager to use any available capacity the algorithm leaves on a server. So once the total workload crosses a threshold and a small number of clients are allocated to an additional server, the resource manager’s performance will temporarily improve (as can be seen at 9000 clients). The second part of the investigation involves looking at how we can balance the % SLA failure and % server usage costs. As the slack level is reduced below 1.1 the % SLA failures will increase away from 0 and the % server usage will decrease. A new metric, ‘% server usage saving’ is defined as SUmax - % server usage, where SUmax is the % server usage at the minimum slack level that results in 0% SLA failures (SUmax=62.7% at a slack of 1.1 in this set of experiments). ‘average % server usage saving’ and ‘average % SLA failure’ metrics are also defined as the average % server usage saving and average % SLA failure values across all loads prior to 100% server usage. Figure 7 Algorithm cost metrics as the slack is reduced Figure 8 SLA failures/server usage relationship as from 1.1 to 0 slack is reduced from 1.1 to 0.9 Figure 7 shows the effect on the average % SLA failures and average % server usage saving metrics, as slack is reduced from 1.1 to 0. During the first 0.1 reduction in slack, the increase in average % SLA failures is smaller than the increase in the average % server usage saving (as also shown in figure 8). This is because it requires a significant amount of server processing power to guarantee that there will be no SLA failures at any load. This is in part due to the runtime optimisations and the spikes on the % SLA failure graph that they cause (see figure 5). Then, between a slack of 1.0 and 0.9 the rate of increase of the two metrics is almost identical. As the slack is reduced further the average % SLA failures goes up at a faster rate than the average % server usage saving until 100% SLA failures and SUmax=62.7% server usage saving are reached at 0 slack (i.e. no clients allocated). Current work is investigating cost functions and how they can map SLA failure and server usage metrics to their associated costs. Given such functions the y-axis of figure 7 could become a single cost axis by subtracting the cost saving due to the server usage saving from the cost due to the SLA failures. Slack setting(s) with the lowest cost could then be determined. 10. Conclusion This paper reports on a comparative evaluation of three methods for predicting mean response times of heterogeneous workloads on new server architectures for an industrial strength distributed enterprise application benchmark. To the best of our knowledge this is the only comparison of the layered queuing, historical and hybrid prediction methods using this benchmark. Results are presented showing that all three methods can be used to make predictions for new server architectures with a good level of accuracy. It is also shown that the historical method can make accurate predictions when only a very limited amount of historical data is available. This paper also considers how two extensions to the case study could be modelled using each method. This has involved showing that response time distributions can be predicted with a good level of accuracy given a mean response time prediction, and that it is difficult to predict the effect of caching using the layered queuing method. The methods were evaluated (with a focus on how they could be used to enhance a resource management algorithm) in terms of: the systems that can be modelled; the metrics that can be predicted; the ease with which the models can be created and the level of expertise required; the overheads of recalibrating a model; and the delay incurred when evaluating a prediction. The paper also investigates how a prediction-enhanced resource management algorithm can be tuned so as to compensate for predictive inaccuracy and balance the costs of SLA failures and server usage. Future work includes evaluating the strengths and weaknesses identified with each method on different types of prediction-enhanced resource management algorithm. Acknowledgments The authors would like to thank Robert Berry, Beth Hutchison, Te-Kai Liu and Nigel Thomas for their contributions towards this research. The work is sponsored in part by the EPSRC (contract no. GR/S03058/01 and GR/R47424/01), the NASA AMES Research Center administered by USARDSG (contract no. N68171-01-C-9012) and IBM UK Ltd. References 1. J. Aman, C. Eilert, D. Emmes, P. Yocom, D. Dillenberger, Adaptive Algorithms for Managing a Distributed Data Processing Workload, IBM Systems Journal, 36(2):242-283, 1997 2. Y. An, T. Kin, T. Lau, P. Shum, A Scalability Study for WebSphere Application Server and DB2 Universal Database, IBM White paper, 2002. Available at: http://www.ibm.com/developerworks/ 3. Apache JMeter User Manual. Available at: http://jakarta.apache.org/jmeter/index.html 4. K. Appleby, S. Fakhouri, L. Fong, G. Goldszmidt, M. Kalantar, S. Krishnakumar, D.P. Pazel, J. Pershing, B. Rochwerger, Oceano-SLA Based Management of a Computing Utility, 7th IFIP/IEEE International Symposium on Integrated Network Management, New York, May 2001 5. D. Bacigalupo, S.A. Jarvis, L. He, G.R. Nudd, An Investigation into the Application of Different Performance Prediction Techniques to e-Commerce Applications, Workshop on Performance Modelling, Evaluation and Optimization of Parallel and Distributed Systems, 18th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2004), New Mexico, USA, April 2004 6. Y. Diao, J. Hellerstein, S. Parekh, Stochastic Modeling of Lotus Notes with a Queueing Model, Computer Measurement Group International Conference (CMG 2001), California, USA, December 2001 7. M. Endrel, IBM WebSphere V4.0 Advanced Edition Handbook, IBM International Technical Support Organisation Pub., 2002. Available at: http://www.redbooks.ibm.com/ 8. M. Goldszmidt, D. Palma, B. Sabata, On the Quantification of e-Business Capacity, ACM Conference on Electronic Commerce (EC 2001), Florida, USA, October 2001 9. IBM Websphere Performance Sample: Trade. Available at http://www.ibm.com/software/info/websphere/ 10. T. Liu, S. Kumaran, J. Chung, Performance Modeling of EJBs, 7th World Multiconference on Systemics, Cybernetics and Informatics (SCI 2003), Florida USA, 2003 11. Z. Liu, M.S. Squillante, J. Wolf, On Maximizing Service-Level-Agreement Profits, ACM Conference on Electronic Commerce (EC 2001), Florida, USA, October 2001 12. Z. Liu, C.H. Xia, P. Momcilovic, L. Zhang, AMBIENCE: Automatic Model Building using IferENCE, IBM Research Report RC22961, November 2003. Available at: www.research.ibm.com 13. D. Menasce, Two-Level Iterative Queuing Modeling of Software Contention, 10th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunications Systems (MASCOTS 2002), Texas, USA, October 2002 14. J. Rolia, X. Zhu, M. Arlitt, A. Andrzejak, Statistical Service Assurances for Applications in Utility Grid Environments, 10th IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunications Systems (MASCOTS 2002), Texas, USA, October 2002 15. F. Sheikh, M. Woodside, Layered Analytic Performance Modelling of a Distributed Database System, International Conference on Distributed Computing Systems (ICDCS'97), Maryland USA, May 1997 16. J.D. Turner, D.A. Bacigalupo, S.A. Jarvis, D.N. Dillenberger, G.R. Nudd, Application Response Measurement of Distributed Web Services, International Journal of Computer Resource Measurement, 108:45-55, 2002 17. C.M. Woodside, J.E. Neilson, D.C. Petriu, S. Majumdar, The Stochastic Rendezvous Network Model for Performance of Synchronous Client-Server-like Distributed Software, IEEE Trans. On Computer, 44(1):20-34, 1995 18. L. Zhang, C. Xia, M. Squillante, W. Nathaniel Mills III, Workload Service Requirements Analysis: A Queueing Network Optimization Approach, 10th IEEE International Symposium on Modeling, Analysis, & Simulation of Computer & Telecommunications Systems (MASCOTS 2002), Texas, USA, October 2002