An Investigation into the Application of Different Enterprise Applications

advertisement
An Investigation into the Application of Different
Performance Prediction Methods to Distributed
Enterprise Applications
DAVID A. BACIGALUPO†
daveb@dcs.warwick.ac.uk
STEPHEN A. JARVIS†
saj@dcs.warwick.ac.uk
†
liganghe@dcs.warwick.ac.uk
LIGANG HE
dps@dcs.warwick.ac.uk
DANIEL P. SPOONER†
DONNA N. DILLENBERGER*
engd@us.ibm.com
grn@dcs.warwick.ac.uk
GRAHAM R. NUDD†
†
High Performance Systems Group, University of Warwick, Coventry CV4 7AL, UK
*
IBM T.J. Watson Research Centre, Yorktown Heights, New York 10598, USA
Abstract. Response time predictions for workload on new server architectures can enhance Service Level
Agreement–based resource management. This paper evaluates three performance prediction methods using a
distributed enterprise application benchmark. The historical method makes predictions by extrapolating from
previously gathered performance data, while the layered queuing method makes predictions by solving layered
queuing networks. The hybrid method combines these two approaches, using a layered queuing model to
generate the data for a historical model. The methods are evaluated in terms of: the systems that can be
modelled; the metrics that can be predicted; the ease with which the models can be created and the level of
expertise required; the overheads of recalibrating a model; and the delay when evaluating a prediction. The
paper also investigates how a prediction-enhanced resource management algorithm can be tuned so as to
compensate for predictive inaccuracy and balance the costs of SLA violations and server usage.
Keywords: Performance Prediction, Distributed Enterprise Application, Layered Queuing Modelling,
Historical Performance Data, Resource Management, Service Level Agreement
1. Introduction
It has been shown that response time predictions can enhance the workload and
resource management of distributed enterprise applications [1,11]. Two common
approaches used in the literature for making these response time predictions are
extrapolating from historical performance data and solving queuing network models.
Examples of the first approach include the use of both coarse [8] and fine [1] grained
historical performance data. The former involves recording workload information and
operating system/database load metrics, and the later involves recording the historical
usage of each machine’s CPU, memory and IO resources by different classes of
workload. Another example of this approach is being developed in the High Performance
Systems Group at the University of Warwick [5]. This historical method has been
implemented as a tool called HYDRA which has been applied to both distributed
enterprise [5] and business to business [16] applications. It is differentiated from other
historical modelling work by its focus on simplifying the process of analysing any
historical data so as to extract the small number of trends that will be most useful to a
resource manager. Other historical modelling work focuses on predicting future workload
resource demands [14] and is complementary to this research in which the emphasis is on
being able to predict new server architectures.
Examples of the queuing modelling approach include [6,11,13] and the layered
queuing method, as implemented in the layered queuing network solver (LQNS) [17].
The layered queuing method is of particular interest and will be examined further in this
paper as: it explicitly models the tiers of servers found in this class of application, and it
has been applied to a range of distributed systems (i.e. [15]) including the distributed
enterprise application used in this paper [10].
A third approach is to overcome some of the limitations of the historical and queuing
approaches by combining them, albeit at the cost of a more complex model. This can be
done by using historical models to calibrate queuing model processing times, which can
be expensive to measure directly in distributed enterprise systems. For example in [12]
queuing network processing times are inferred from coarse-grained historical
performance data. This paper complements this work by examining a combined approach
in which a layered queuing model is used to generate historical performance data and so
calibrate a historical model. This ‘hybrid’ method has the advantage of rapid historical
predictions without having to collect real historical performance data.
It is important to compare the effectiveness of different approaches for modelling
distributed enterprise applications so practitioners can make an informed choice when
designing prediction-enhanced workload and resource management algorithms. However,
although there have been comparisons of different performance prediction approaches
using distributed enterprise applications, there have been few quantitative comparisons of
the three approaches on a single distributed enterprise application. For example in [15] a
layered queuing model of a distributed database system is created and compared to a
markov chain-based queuing model of the system. In [17] the layered queuing method is
compared more generally to other performance modelling methods. Another recognised
queuing method which has been applied to similar applications is described in [13] and
compared with the layered queuing method. However none of these papers include a
comparison with a historical model of the same application. The historical prediction
method described in [8] is applied to a web-based stock trading application (Microsoft
FMStocks) and compared to a queuing modelling approach. However a queuing network
model is not created of the application.
This work investigates how the performance of a distributed enterprise application
benchmark can be predicted using the HYDRA historical method, the layered queuing
method and the hybrid method. The methods are then evaluated in terms of: the systems
that can be modelled; the metrics that can be predicted; the ease with which the models
can be created and the level of expertise required; the overheads of recalibrating a model;
and the delay when evaluating a prediction. The IBM Websphere commercial e-Business
middleware [7] is selected as the platform on which the benchmark will be run as it is a
common choice for distributed enterprise applications. The IBM Performance Benchmark
Sample ‘Trade’ [9] is selected as it is the main distributed enterprise application
benchmark for the Websphere platform. To the best of our knowledge this is the only
quantitative comparison of these three classes of prediction method on this benchmark.
The comparison described in this paper involves: defining a system model and case
study representative of current distributed enterprise applications (see sections 2 and 3);
creating the three performance models of the case study and investigating the predictive
accuracy that can be obtained (see sections 4-6); considering how the models can/cannot
be extended (see section 7) and evaluating the strengths and weaknesses of the three
methods (see section 8). The paper also includes an analysis of the tuning of a predictionenhanced resource management algorithm (see section 9).
2. System Model
Based on the Oceano resource manager [4] the system model (see figure 1) consists of
a service provider which hosts a number of applications and also contains a resource
manager that controls the transfer of application servers between those applications. An
application server can only process the workload from one application at a time to isolate
the applications (which may be being hosted for competing organisations). Based on
other established work (i.e. [10,18,2]) each application is modelled as a tier of application
servers accessing a single database server. Application servers may have heterogeneous
server architectures. Based on the queuing network in the Websphere e-Business
platform: a single first in first out (FIFO) waiting queue is used by each application
server; the database server has one FIFO queue per application server; and both servers
can process multiple requests concurrently via time-sharing.
Figure 1 The proposed system model
The workload manager tier of the application model involves the workload being
divided into ‘service classes’, each of which is associated with a response time
requirement (i.e. in a SLA). The role of the workload manager is to rout the incoming
requests to the available servers whilst meeting these goals.
In such a system it is important to be able to make response time predictions on
alternative application server architectures so as to allow servers to be allocated to
applications, workload to be allocated to servers and upgrades to be planned in an
informed fashion. To allow these predictions to be made it is useful for the system to
provide two supporting services. The first involves allowing performance models to be
recalibrated on established servers in order to save modelling variables that change
infrequently (such as monitoring and logging policies), or variables that are hard to
measure (such as the complexity of processing data to serve a service class). An example
of the later from the Trade benchmark is the average size of the clients’ ‘portfolio’ of
stock. The second service involves allowing application-specific benchmarks to be run on
new server architectures so as to calibrate their request processing speeds.
3. Case Study
This section describes the workload and server configuration which, along with the
Trade benchmark, will provide an example of the system model that is representative of
commercial distributed enterprise applications. The case study will then be modelled
using each of the three performance prediction methods in sections 4-6.
3.1. Workload
The workload in a service class is divided into clients each of which sends requests to
the application. Each request calls one of the operations on the application-tier interface
(i.e. buy/sell/quote etc). A service class is created for ‘browse’ users with the next
operation called by a client being randomly selected, with probabilities defined as part of
the Trade benchmark as being representative of real clients. A service class is created for
‘buy’ users using the ‘register new user and login’, ‘buy’ and ‘logoff’ operations. On
average buy clients make 10 sequential buy requests before sending a ‘logoff’ request.
This creates a buy service class with a mean portfolio size of 5.5. For simplicity, the
typical workload is defined as all browse clients.
‘No. of clients and the mean client think-time’ is used as the primary measure of the
workload from a service class. The total number of clients across all service classes and
the percentage of the different service classes are used to represent the system load.
Using number of clients (as opposed to a static arrival rate definition) to represent the
amount of workload is common when modelling distributed enterprise applications (i.e.
[6,10]). This is because it explicitly models the fact that the time a request from a client
arrives is not independent of the response times of previous requests, so as the load
increases the rate at which clients send requests decreases. In this context ‘client’ refers
to a request generator (i.e. a web browser window) that requires the result of the previous
request to send the next request. Users that start several such conversations can be
represented as multiple clients. Think-times are exponentially distributed with a mean of
7 seconds for all service classes as recommended by IBM as being representative of
Trade clients [2], although heterogeneous think-times are supported by all three methods.
3.2. Servers
The system contains 3 application servers. Under the typical workload the max
throughputs of a new ‘slow’ server AppServS (P3 450Mhz, 128MB heap), an established
‘fast’ server AppServF, (P4 1.8Ghz, 256MB heap) and an established ‘very fast’ server
AppServVF, (P4 2.66Ghz, 256MB heap) are found to be 86, 186 and 320 requests/second
respectively. AppServS has a smaller heap size due to limited memory, but this is
sufficient to store the workload in main memory. The database server (Athlon 1.4Ghz,
512MB RAM) uses DB2 7.2 as the database; all servers run on Windows 2000 Advanced
Server and 250 clients are simulated by each workload generator (P4 1.8Ghz, 512MB
RAM) using Apache JMeter [3].
4. Historical Method
The historical modelling method involves sampling performance metrics (i.e. response
times and throughputs) and associating these measurements with variables representing
the state of the machine (primarily the workload being processed) and the machine’s
architecture. Additional variables record static performance benchmarks for the different
architectures. The modelling process involves determining the relationships (i.e.
linear/exponential equations) between the variables and the performance metrics. This is
facilitated by defining one or more typical workloads and server architectures, and then
determining the relationships between the variables relative to their typical values. This
approach removes the need to model variables that remain constant throughout the
normal operating range of the system. The historical method is implemented as part of a
tool known as HYDRA that allows the accuracy of relationships to be tested on variable
quantities of historical data.
In the following case study, predictions are required for service class response times at
different amounts of workload, different application server architectures and different
percentages of the service classes in the workload. These are modelled in the historical
method using three corresponding relationships, details of which are provided in the
following sections.
4.1. Relationship 1: Number of Typical Workload Clients-Response Time
It has been found that this relationship is best approximated using separate ‘lower’ and
‘upper’ equations for before and after max throughput:
mrt = cL eλ L *no _ of
_ clients
mrt = λU * no _ of _ clients + cU
(1)
(2)
where mrt is the mean response time and cL, cU, λL and λU are parameters that must be
calibrated from historical data, as is described in the next section under relationship 2. It
is also found that using a further breakdown of the possible system loads, so as to define
a ‘transition’ relationship for phasing from the lower to the upper equation, can increase
predictive accuracy as discussed in [5]. However the accuracy of such a relationship is
not considered further here.
The correct choice of the lower or the upper equation can be made by calculating the
number of clients at max throughput using the relationship between the number of clients
and the server’s throughput. This is a linear relationship until the max throughput for the
server under that particular workload is reached. The gradient, m, of this relationship is a
parameter that must be calibrated from historical data. After max throughput is reached
the throughput is assumed to be roughly constant. This relationship can be used to
generate predicted throughput scalability graphs for servers with heterogeneous CPU
speeds, since the value of m depends on and can be predicted from the mean client thinktime, but does not vary due to different server CPU speeds. m is 0.14 for all servers in the
experimental setup. This gives a prediction accuracy of 1.3% across the three servers.
4.2. Relationship 2: Effect of Application Server Max Throughput on Relationship 1
The following functions approximate this relationship in the experimental setup:
c L = ∆ (c L ) × mx _ throughput + C (c L )
(3)
λL = C (λL ) × mx _ throughput ∆ ( λ L )
(4)
where Λ(cL), C(cL), C(λL) and Λ(λL) are parameters that must be calibrated from
historical data (see below). Parameters for the upper (linear) equations can also be
calculated as follows. Given an increase/decrease in server max throughput of z%, λU is
found to increase/decrease by roughly 1/z%, and cU is found to be roughly constant.
The parameters in relationships 1 and 2 are calibrated by fitting trend-lines (using a
least squares fit) to historical data from the established AppServF and AppServVF servers.
The historical data consists of the max throughputs of each server and nudp/nldp data points
for the upper/lower equation of relationship 1 respectively. Each data point records the
mean response time (as averaged across ns samples) of the typical workload at a numbers
of clients. In our experimental setup, samples are recorded using one benchmarking client
per server. The overall predictive accuracy is defined as the mean of the lower equation
accuracy and the upper equation accuracy.
Server
CL
λL
(ms)
(ms)
S
138.9
4E-06
F
84.1
0.0001
VF
10.7
0.0009
Table 1 Historical method relationship parameters
Figure 2 Mean response time predictions for the typical
workload on new and established server architectures
It is found that accurate predictions can be made even when nudp and nldp are both
reduced to 2 and ns is reduced to 50. The resulting parameters are shown in table 1.
Figure 2 illustrates the mean response time predictions made using this calibration
(including a transition exponential relationship for phasing between equations 1 and 2). A
minimum of 100 samples per ‘measured’ data point are recorded. A good level of
accuracy of 89.1% for the established servers and 83% for the new server is achieved.
For these predictions the samples were made sequentially (after a 1 minute warm-up
period). This said, the most time it took to record 50 samples was 4.5 seconds before max
throughput and 2.2 minutes after.
When recording these two data points on an established server, a workload manager
might have to transfer clients onto or off the server to get a second data point. The effect
on the predictive accuracy of the number of clients between the two data points (i.e. the
number of clients that are transferred) will therefore be investigated. Experiments are
conducted for the lower and upper equations. We have found it to be effective to use a
transition exponential relationship to phase between the lower and upper equations
between 66% and 110% of the max throughput load in our experimental setup. The
supporting experimentation for the lower equation therefore examines the effect of the
number of clients between a data point below 66% of the max throughput load and a data
point fixed at 66% of the max throughput load. The supporting experimentation for the
upper equation examines the effect of the number of clients between a data point fixed at
110% of the max throughput load and a data point at a higher load. LQNS is used to
generate these data points, and is also used to generate data points for the new server
architecture so as to test the accuracy of predictions.
Figure 3 shows the accuracy of the predictions on the new server architecture as the
mean number of clients between the two data points, x, is increased. The actual value of x
used for a particular server is scaled according to the machine’s speed so the % of the
max throughput load between the two data points is constant across all established
servers. As with all the predictions in this paper, the accuracy of the more complex lower
exponential equation is generally lower than that of the upper linear equation.
Figure 3 The predictive accuracy as the number of
Figure 4 Heterogeneous workload mean response
clients between historical data points is increased
time predictions for the new server architecture
As x increases there is a roughly linear increase in the lower equation’s predictive
accuracy so the more clients the workload manager transfers before taking a second data
point the greater the accuracy is likely to be. However the upper equation’s increase in
accuracy slowly levels off, making it increasingly less useful for a workload manager to
transfer that many clients. It can also be seen that there are more fluctuations in the lower
equation line; a workload manager might take this into account by transferring enough
clients to guarantee a particular accuracy level despite any fluctuations in the predictive
accuracy. It is noted that it has been found to be difficult to obtain results for values of x
below 30 as the predicted response time for the data point with the larger number of
clients can be less than the predicted response time for the data point with the smaller
number of clients. This is due to the 20ms LQNS convergence criterion and could be
improved by decreasing it, but at the cost of slower predictions.
4.3. Relationship 3: Buy Request %-Server Max Throughput
There is found to be a linear relationship between the percentage of buy requests, b, on
an established server and its max throughput which is used to extrapolate the max
throughput at any buy percentage, mx_throughputE(b). The max throughput on a new
server at a particular percentage of buy requests is then calculated as follows, where a
percentage of buy requests of 0 represents the typical (homogeneous) workload:
mx _ throughput N (b) =
mx _ throughput E (b)
× mx _ throughput N (0)
mx _ throughput E (0)
(5)
These relationships are tested using LQNS predictions for historical data; specifically
the max throughput of AppServF at 0% and 25% buy requests (189 and 158
requests/second respectively). Figure 4 shows that there is a good prediction for the
shapes of the mean workload response time graphs (due to the λL parameters being small,
the scalability lines appear almost linear before max throughput is reached). A similar
procedure can also be used to extrapolate the deviation of service class specific response
times from the mean workload response time due to differences in the number and
complexity of database requests made.
5. Layered Queuing Method
A layered queuing performance model explicitly defines an application’s queuing
network. A wide range of applications can be modelled due to the number of features
supported by the language, which include: open, closed and mixed queuing networks;
FIFO and priority queuing disciplines; synchronous calls, asynchronous forks and joins,
and the forwarding of requests onto another queue; and service with a second phase [17].
An approximate solution to the layered queuing model can be generated automatically
using the layered queuing network solver (LQNS) making the method relatively easy to
use; all that is required when creating the model is specifying the system queuing
network configuration. Performance metrics generated include response times,
throughputs and utilisation information for each service class at each processor.
The application model specified in section 2 is defined as a layered queuing model.
The database server disk is modelled as a processor that can only process one request at a
time. Processing times are assumed to be exponentially distributed. Requests in the
workload are broken down into ‘request types’ that are expected to exhibit similar
performance characteristics due to the operations being called and the amount of data
associated with the request. The parameters to the model are:
− Queuing network configuration: the maximum number of requests each processor
can process at the same time via time-sharing;
− Service class specific: amount of workload, the workload mix (the expected
percentage of the different request types received each second);
− Request-type specific: mean processing times on each server, average number of
database requests per application server request.
The per-request type parameters can be calibrated by taking an established server offline and sending a workload consisting only of that request type; the parameters are
calculated from the resulting throughput (in requests/second) and the CPU usage of each
server. The model can then be evaluated for any heterogeneous workload. The request
processing speeds of new servers can be rapidly benchmarked using a ‘typical’ workload,
using either a max throughput or mean request processing time metric. Calculating a new
server’s mean request type processing times then involves multiplying the mean
processing times on an established server by the established/new server request
processing speed ratio.
5.1. Results
Each service class is calibrated on AppServF as detailed in table 2. Buy requests make
2 database requests, and browse requests make 1.14 database requests on average. The
application and database servers can process 50 and 20 requests at the same time via
time-sharing, respectively. LQNS produced solutions after a maximum of 3 seconds
under a convergence criterion of 20ms, on an Athlon 1.4Ghz.
Processor
App. Server
DB Server
Browse (ms)
4.505
0.8294
Buy (ms)
8.761
1.613
Table.2. Layered queuing method processing time parameters as calibrated on AppServF
Figure 2 shows mean response time predictions at different numbers of clients under
the typical workload. The mean accuracy of the predictions for the established servers is
97.8% for throughput, and 68.8% for mean response time; the mean accuracy for the new
server is 97.1% for throughput, and 73.4% for mean response time. Predictions can also
be made for heterogeneous workloads, an example of which is illustrated in figure 4.
Although the historical-based predictions are more accurate it is likely that the layered
queuing accuracies could be increased by better modelling of delays such as
communication overhead.
6. Hybrid Method
The hybrid method involves using a historical model, but with ‘pseudo’ historical data
generated using a layered queuing model. Basic hybrid models involve generating
historical data points to calibrate the relationships in the model prior to the server
architectures for which the predictions are required being known. The predictive accuracy
of this approach can be increased by using an ‘advanced’ model in which layered queuing
is used to generate historical data for the server architectures for which predictions are
required – this allows the historical model to represent these proposed architectures as
‘established’ servers. However, layered queuing predictions are slower than historical
predictions due to the iterative numerical solution technique employed. This results in
hybrid predictions incurring a ‘start-up’ delay the first time a prediction is made for a
new server architecture. This adds to the existing start-up delay of benchmarking the new
server architecture’s request processing speed; after this initial start-up delay the more
responsive historical predictions can be used.
The hybrid method is evaluated using an advanced hybrid model created from the
historical and layered queuing models. The first time a prediction is required the layered
queuing model is calibrated (see section 5), after which this model is used to generate
historical data to calibrate relationships 1 and 3 of the historical model (see section 4).
Relationship 2 is not used as the layered queuing model generates historical data for
specific server architectures.
The historical model is calibrated by using the layered queuing model to generate a
maximum of 4 historical data points for the lower and upper relationship 1 equations for
each of the three servers. This resulted in a mean start-up delay of 11 seconds on an
Athlon 1.4Ghz machine. The accuracy of the hybrid predictions are found to be similar to
those made using the layered queuing model only; the mean response time predictions for
the established servers are 67.1% accurate and for the new server 74.9% accurate.
7. Extending the Case Study
The previous three sections show that predictions can be made with a good level of
accuracy using all three prediction methods. However there are two common practices in
distributed enterprise application systems that are not covered in this case study. The first
involves SLAs, which are specified in terms of distribution as well as mean-based metrics
and the second involves systems, in which the application servers’ main memories are
used as caches. These are considered in the following two sections respectively. This will
allow the three performance prediction methods to be evaluated more thoroughly in terms
of the metrics that can be predicted and the systems that can be modelled.
7.1. Response Time Distribution Predictions
After max throughput (i.e. 100% application server CPU utilisation) is reached, the
most significant component of the response time is the application server queuing time
(as opposed to the database server disk access time). This results in two different types of
probability distribution function for the response time of requests for before and after
max throughput is reached. In the case study these two functions are found to be constant
(relative to the predicted mean response time) across server architectures with
heterogeneous processing speeds. Distribution predictions can therefore be extrapolated
from the mean response time prediction using these functions.
The response time distribution of requests in the case study is approximated by the
exponential/double exponential distribution for before/after 100% CPU utilisation. The
probability distribution functions are shown in equations 6 and 7, respectively.
P ( X ≤ x ) =1 − e
− (1 / r p ) x
−( x − a ) / b

1 − e
, x ≥ rp

P ( X ≤ x) =  ( x − a2) / b
 e
x < rp
,

2

(6)
(7)
Where a, the double exponential distribution location parameter is set to rp and b, the
scale parameter is found to be constant across servers with heterogeneous processing
speeds and is calibrated at 204.1.
SLAs are often specified in terms of a percentile metric specifying a percentage of
requests p that must be less than a maximum response time rmax. Using the distribution
equations, the predicted response times in figure 2 are converted to a percentile metric
(with p=90%). All three methods give a good level of predictive accuracy; the historical
model predictions are 80%/88% accurate and the layered queuing predictions are
77%/69% accurate for new/established servers. The hybrid predictions are similar to the
layered queuing predictions at 77%/70% accurate for new/established servers. The
predictive accuracies are a maximum of 4.6% less accurate than the corresponding mean
response time predictions. It is noted that percentile metrics can also be predicted directly
using the historical method (but not the layered queuing method or the hybrid method) to
avoid this small decrease in accuracy.
7.2. Modelling Caching
In distributed enterprise systems it is important for application data to be stored in the
database between client requests so clients can continue to use the application if the
application server to which they are connected fails. (Database servers are typically
hosted on machines with more fault tolerance hardware such as RAID disk arrays and
better backup facilities.) The Trade application (which is also an example of best practice
design) stores most of its data directly in the database as opposed to in the application
server’s memory so as to simplify recovering from application server failures. The
alternative to this is an indirect approach in which more data is stored in the application
server’s main memory. This data can then ‘persist’ in the database after the response has
been returned to the client. This results in the application server’s memory acting as a
cache to the database, which can increase performance at the risk of data inconsistencies
(if the application server crashes whilst data is being persisted).
The effect of an architecture’s cache (i.e. main memory) size can be modelled using
the historical method by recording this as a variable and determining how this variable
effects the other variables/relationships as before. However it is found to be difficult to
predict the effect of a new server architecture’s main memory size using the layered
queuing method (and hence the hybrid method) when caching is used and when requests
from each client are not independent of the response time of previous requests from that
client (as is typically the case in distributed enterprise applications, see section 3.1). This
is explained as follows. In the case study when the workload does not fit in main
memory, the main memory will act as a cache (using a least recently used replacement
scheme) for the per-client ‘session’ data in the database. When a request misses the cache
an extra call to the database is incurred to read the session associated with the request.
Although the layered queuing model can be extended to include this extra database
call, it is difficult to calculate the average number of the database call that will be made
for each service class. This is because this value depends on the probability of a cache
miss for a client c in that service class. This in turn depends on the probability that the
number of bytes replaced in the cache during time Tc is greater than the cache size minus
the session data size for client c, where Tc is the time between requests for client c. This
probability in turn depends on the arrival rate and session data size distributions for all
the service classes. When requests from a client are not independent, these arrival rate
distributions are variable and must be predicted using the model. So the number of
database calls for each service class in the model depends on: i) the solution to the model;
and ii) the ability to extrapolate arrival rate distributions from the mean values predicted.
However the layered queuing method does not support parameters specified in terms of
metrics that the model predicts, and it is non-trivial to extend the layered queuing
numerical solution technique to include this.
8. Evaluation
8.1. Systems that can be Modelled
It has been shown that all three performance prediction methods can be used to make
mean response time predictions for the distributed enterprise application case study with
a good level of predictive accuracy. All three methods are also sufficiently powerful to
model variations on this system model. Examples include: some or all clients sending
requests at a constant rate; priority queuing disciplines and application components
communicating using asynchronous calls. However it has also been shown that it is nontrivial to extend the layered queuing (and hence hybrid) models to predict the effect of
caching, whereas this is possible using the historical method.
It is also noted that all three methods can model systems containing queues that are not
explicitly defined, including bottlenecks for example. A bottleneck could be caused by
application server threads requiring simultaneous access to a critical code section.
However the layered queuing method and the hybrid method require additional profiling
to model the extra queues created.
8.2. Metrics that can be Predicted
It has been shown that response time predictions can be made for different workload
levels. However, resource managers typically require predictions for the maximum
number of clients an SLA-constrained server can support. This can be predicted using the
historical and hybrid methods by rewriting equations 1 and 2 in terms of the mean
response time. However in the current layered queuing solver the number of clients can
only be an input so it is necessary to search for a number of clients that results in
response times just below SLA compliance.
A limitation of the layered queuing (and hence hybrid) methods are that they can only
make mean value response time predictions whereas SLAs are often specified using
percentile response time metrics. However it has been found to be possible in the case
study to measure and extrapolate distributions from the mean value predictions. Using
this technique it has been shown that percentile response time metrics can be predicted
with a good level of accuracy. Another limitation of the layered queuing and hybrid
methods is that they can only make steady state predictions. The historical method does
not suffer from these restrictions as it can record (as variables) both percentile metrics
and the time the server has been stabilising toward the steady state. In fact the historical
method can extrapolate and predict a range of metrics whereas the metrics that the
layered queuing (and hence hybrid) methods can predict are fixed using the current
solver.
8.3. Ease with which a Model can be Created and Level of Expertise Required
It has been shown that the layered queuing method is more restrictive in the systems
that it can model and the metrics it can predict. However, layered queuing models have
also been found to be easy to create with a minimum level of performance modelling
expertise, as a model specifies just the system’s queuing network configuration. In
contrast creating a historical model involves specifying and validating how predictions
will be made. As a result, despite the HYDRA tools simplifying the model creation
process, it is still harder to create a historical model than a layered queuing model and the
process requires more performance modelling expertise. Layered queuing models also
have the advantage that they can be calibrated using a small workload, whereas it has
been shown that historical models require calibrating at both small and large workloads
(i.e. to calibrate both lower and upper equations in relationship 1). Creating a hybrid
model requires that the performance analyst is capable of creating two types of
performance model and so requires the most performance modelling expertise. However
the hybrid method also simplifies calibrating and validating the historical component of
the hybrid model. This is because historical data can be generated using the layered
queuing model as opposed to having to record historical data under a range of workloads
and server architectures. As a result it has been found to be easier to create a hybrid
model than a historical model.
8.4. Overhead of Dynamic Model Recalibration
It has been shown that accurate historical predictions can be made even with a very
limited amount of historical data. As a result the historical method can rapidly but
accurately re-calibrate relationship parameters and when using the hybrid method, the
time to generate new historical data using the layered queuing method can be kept low. In
the layered queuing method (and hence hybrid method) re-calibrations require dedicated
access to a server and information on system configuration parameters. However the
layered queuing (and hence hybrid) methods do have the advantage that only one
application server (as opposed to two or more for the historical method) is required,
which may be helpful in small systems.
8.5. Delay when Evaluating a Prediction
The layered queuing method can require significant CPU time to make each prediction
(i.e. up to 3 seconds on an Athlon 1.4Ghz); this may be a major limitation as the resource
management algorithm may need to make many predictions. This is made worse by the
fact that multiple predictions must be made when searching for the maximum number of
clients a server can support whilst still being in SLA compliance. The historical method
has the advantage that predictions can be made almost instantaneously. Hybrid method
predictions incur a ‘start-up’ delay the first time a prediction is made for a new server
architecture whilst historical data is generated; it has been shown that this can be as short
as an 11 second delay on an Athlon 1.4Ghz. After the start-up delay the predictions are
almost instantaneous. It is noted that a resource manager may need to evaluate many
predictions for each server architecture, for example to evaluate the effect of allocating
different amounts and types of workload. If this is the case the total prediction evaluation
delay may be less than that for the layered queuing method.
9. Tuning a Prediction-Enhanced Resource Manager
SLA-based service providers (as defined in section 2) incur two main types of cost.
The first type of cost involves paying penalties for SLA failures (i.e. missing SLA
response time goals); and the second is the cost of using the servers in the system (i.e.
buying or renting the hardware). This section investigates how a prediction-enhanced
resource manager can balance these costs whilst compensating for predictive inaccuracy.
This will be investigated using a resource management algorithm which determines the
application servers to use to process a workload that is to be transferred to the service
provider. The algorithm also provides an initial division of the workload across the
servers obtained (which could then be modified by a workload manager).
The algorithm (see algorithm 1) takes as input a list of the service classes in the
workload and a list of available application servers. The service classes are sorted and
hence processed in order of priority, so if there are insufficient servers the lower priority
service classes are rejected from the system first. Since there is no priority queuing or
processing in the system model, the ideal application server selection algorithm (on line
6) would minimise the amount of workload with different SLA response time goals on
the same server. However, to facilitate the tuning analysis a short algorithm with a fast
evaluation time is considered more appropriate than an algorithm with near optimal
efficiency – because of this a greedy approach to server selection is used. This involves
selecting the server which the performance model predicts can be allocated the most
clients from the current service class. An exception to this rule occurs when selecting the
last server that will be required by a service class; the algorithm takes the server that can
be allocated the smallest number of clients, given that it can still take all the clients
remaining to be allocated in the service class.
1. sort the service classes in order of increasing response time goal
2. current_service_class=first service class in list
3. do
4.
if (all clients in current_service_class allocated to an application server)
5.
current_service_class=next service class in list
6.
app_server = application_server_selection_algorithm()
7.
allocate clients from current_service_class to app_server until:
maximum capacity is reached on app_server
OR all clients in current_service_class are allocated to an application server
8. while (application servers with available capacity exist and unallocated clients
exist)
Algorithm 1. Resource management algorithm. Each service class consists of a number of clients, each of which
is initially ‘unallocated’. Application servers are considered to have available capacity unless the performance
model predicts that adding an extra client from the current service class would result in some clients missing
SLA response time goals.
When predictions are inaccurate some service classes may have insufficient servers at
runtime. To deal with this the system model in section 2 is extended so application
servers reject clients at runtime if response times are within a threshold of missing SLA
goals. This prevents all the existing clients on a server from also missing their SLA goals.
In practice, it is likely that the rejected workload would be handled by a second set of
servers that accept all workload. A generic strategy to compensate for predictive
inaccuracy and balance the service provider’s costs involves multiplying the number of
clients in each service class by a number which we refer to as the ‘slack’. The resource
manager then allocates application servers to service classes based on this modified
workload.
9.1. Results
The algorithm is evaluated using two cost metrics, the first being the percentage of
clients rejected from the servers (‘% SLA failures’). The second metric represents the
amount of server processing power allocated to the application, where an application
server’s processing power is defined as its max throughput under the typical workload.
For convenience this metric is recorded as a percentage of the total processing power of
the list of application servers and will be referred to as ‘% server usage’. We investigate
the effect of the resource management slack parameter for balancing these cost metrics
whilst compensating for predictive inaccuracy.
The case study in section 3 is extended to include a total of 16 application servers.
Eight of the servers have a new architecture (AppServS) and eight have the same
architectures as existing servers (4×AppServF and 4×AppServVF). Three service classes
are created by dividing the browse service class into two service classes with different
SLA response time goals. The workload that is to be allocated to the new servers is
defined as: 10% buy clients (RT goal: 150ms), 45% high priority browse clients (RT
goal: 300ms), and 45% low priority browse clients (RT goal: 600ms). The percentages
are selected based on the Trade application, which defines 10% of the standard workload
to be purchase requests. The response time goals are selected based on the response time
of the fastest application server at max throughput (~600ms).
The investigation begins by analysing how predictive accuracy can be compensated
for in order to reduce the % SLA failures to 0. This has been found to be straightforward
when the predictive error is uniform. Define y as the predictive accuracy, where
multiplying the actual number of clients by y gives the prediction. Experiments have
confirmed that setting the slack to y results in 0% SLA failures below 100% server usage
and a constant % server usage at any predictive accuracy.
Figure 5 % SLA failures when using the resource
Figure 6 % Server usage when using the resource
management algorithm at different loads
management algorithm at different loads
To examine the more interesting case of non-uniform predictive accuracy the more
accurate historical model is used to represent the real system response times, and the
hybrid model is used as the less accurate predictions. The layered queuing model is not
used due to the limitations discussed in section 8. Figures 5 and 6 show the resulting
performance of the resource management algorithm at different loads and slack levels, in
terms of the % SLA failure and % server usage performance metrics respectively. Each
line was generated in under one second. The average predictive accuracy of the nonuniform predictions (weighted by the number of servers in the server pool) is 92.5% (i.e.
y=1.075). However the minimum slack that results in 0% SLA failures before 100%
server usage is 1.1. The difference is due to some predictions being used more by the
resource management algorithm than others. For example the predictive accuracy of
AppServF is the highest of the three servers at 97.04%, but due to the design of the
algorithm the middle servers tend to be used less frequently.
It is noted that the irregular shape of the lines on the resource management
performance graphs is because runtime optimisations allow the resource manager to use
any available capacity the algorithm leaves on a server. So once the total workload
crosses a threshold and a small number of clients are allocated to an additional server, the
resource manager’s performance will temporarily improve (as can be seen at 9000
clients).
The second part of the investigation involves looking at how we can balance the %
SLA failure and % server usage costs. As the slack level is reduced below 1.1 the % SLA
failures will increase away from 0 and the % server usage will decrease. A new metric,
‘% server usage saving’ is defined as SUmax - % server usage, where SUmax is the % server
usage at the minimum slack level that results in 0% SLA failures (SUmax=62.7% at a slack
of 1.1 in this set of experiments). ‘average % server usage saving’ and ‘average % SLA
failure’ metrics are also defined as the average % server usage saving and average %
SLA failure values across all loads prior to 100% server usage.
Figure 7 Algorithm cost metrics as the slack is reduced
Figure 8 SLA failures/server usage relationship as
from 1.1 to 0
slack is reduced from 1.1 to 0.9
Figure 7 shows the effect on the average % SLA failures and average % server usage
saving metrics, as slack is reduced from 1.1 to 0. During the first 0.1 reduction in slack,
the increase in average % SLA failures is smaller than the increase in the average %
server usage saving (as also shown in figure 8). This is because it requires a significant
amount of server processing power to guarantee that there will be no SLA failures at any
load. This is in part due to the runtime optimisations and the spikes on the % SLA failure
graph that they cause (see figure 5). Then, between a slack of 1.0 and 0.9 the rate of
increase of the two metrics is almost identical. As the slack is reduced further the average
% SLA failures goes up at a faster rate than the average % server usage saving until
100% SLA failures and SUmax=62.7% server usage saving are reached at 0 slack (i.e. no
clients allocated).
Current work is investigating cost functions and how they can map SLA failure and
server usage metrics to their associated costs. Given such functions the y-axis of figure 7
could become a single cost axis by subtracting the cost saving due to the server usage
saving from the cost due to the SLA failures. Slack setting(s) with the lowest cost could
then be determined.
10. Conclusion
This paper reports on a comparative evaluation of three methods for predicting mean
response times of heterogeneous workloads on new server architectures for an industrial
strength distributed enterprise application benchmark. To the best of our knowledge this
is the only comparison of the layered queuing, historical and hybrid prediction methods
using this benchmark. Results are presented showing that all three methods can be used
to make predictions for new server architectures with a good level of accuracy. It is also
shown that the historical method can make accurate predictions when only a very limited
amount of historical data is available. This paper also considers how two extensions to
the case study could be modelled using each method. This has involved showing that
response time distributions can be predicted with a good level of accuracy given a mean
response time prediction, and that it is difficult to predict the effect of caching using the
layered queuing method. The methods were evaluated (with a focus on how they could be
used to enhance a resource management algorithm) in terms of: the systems that can be
modelled; the metrics that can be predicted; the ease with which the models can be
created and the level of expertise required; the overheads of recalibrating a model; and
the delay incurred when evaluating a prediction. The paper also investigates how a
prediction-enhanced resource management algorithm can be tuned so as to compensate
for predictive inaccuracy and balance the costs of SLA failures and server usage. Future
work includes evaluating the strengths and weaknesses identified with each method on
different types of prediction-enhanced resource management algorithm.
Acknowledgments
The authors would like to thank Robert Berry, Beth Hutchison, Te-Kai Liu and Nigel
Thomas for their contributions towards this research. The work is sponsored in part by
the EPSRC (contract no. GR/S03058/01 and GR/R47424/01), the NASA AMES
Research Center administered by USARDSG (contract no. N68171-01-C-9012) and IBM
UK Ltd.
References
1.
J. Aman, C. Eilert, D. Emmes, P. Yocom, D. Dillenberger, Adaptive Algorithms for Managing a
Distributed Data Processing Workload, IBM Systems Journal, 36(2):242-283, 1997
2.
Y. An, T. Kin, T. Lau, P. Shum, A Scalability Study for WebSphere Application Server and DB2 Universal
Database, IBM White paper, 2002. Available at: http://www.ibm.com/developerworks/
3.
Apache JMeter User Manual. Available at: http://jakarta.apache.org/jmeter/index.html
4.
K. Appleby, S. Fakhouri, L. Fong, G. Goldszmidt, M. Kalantar, S. Krishnakumar, D.P. Pazel, J. Pershing,
B. Rochwerger, Oceano-SLA Based Management of a Computing Utility, 7th IFIP/IEEE International
Symposium on Integrated Network Management, New York, May 2001
5.
D. Bacigalupo, S.A. Jarvis, L. He, G.R. Nudd, An Investigation into the Application of Different
Performance Prediction Techniques to e-Commerce Applications, Workshop on Performance Modelling,
Evaluation and Optimization of Parallel and Distributed Systems, 18th IEEE International Parallel and
Distributed Processing Symposium (IPDPS 2004), New Mexico, USA, April 2004
6.
Y. Diao, J. Hellerstein, S. Parekh, Stochastic Modeling of Lotus Notes with a Queueing Model, Computer
Measurement Group International Conference (CMG 2001), California, USA, December 2001
7.
M. Endrel, IBM WebSphere V4.0 Advanced Edition Handbook, IBM International Technical Support
Organisation Pub., 2002. Available at: http://www.redbooks.ibm.com/
8.
M. Goldszmidt, D. Palma, B. Sabata, On the Quantification of e-Business Capacity, ACM Conference on
Electronic Commerce (EC 2001), Florida, USA, October 2001
9.
IBM Websphere Performance Sample: Trade. Available at http://www.ibm.com/software/info/websphere/
10. T. Liu, S. Kumaran, J. Chung, Performance Modeling of EJBs, 7th World Multiconference on Systemics,
Cybernetics and Informatics (SCI 2003), Florida USA, 2003
11. Z. Liu, M.S. Squillante, J. Wolf, On Maximizing Service-Level-Agreement Profits, ACM Conference on
Electronic Commerce (EC 2001), Florida, USA, October 2001
12. Z. Liu, C.H. Xia, P. Momcilovic, L. Zhang, AMBIENCE: Automatic Model Building using IferENCE,
IBM Research Report RC22961, November 2003. Available at: www.research.ibm.com
13. D. Menasce, Two-Level Iterative Queuing Modeling of Software Contention, 10th IEEE International
Symposium on Modeling, Analysis and Simulation of Computer and Telecommunications Systems
(MASCOTS 2002), Texas, USA, October 2002
14. J. Rolia, X. Zhu, M. Arlitt, A. Andrzejak, Statistical Service Assurances for Applications in Utility Grid
Environments, 10th IEEE International Symposium on Modeling, Analysis and Simulation of Computer
and Telecommunications Systems (MASCOTS 2002), Texas, USA, October 2002
15. F. Sheikh, M. Woodside, Layered Analytic Performance Modelling of a Distributed Database System,
International Conference on Distributed Computing Systems (ICDCS'97), Maryland USA, May 1997
16. J.D. Turner, D.A. Bacigalupo, S.A. Jarvis, D.N. Dillenberger, G.R. Nudd, Application Response
Measurement of Distributed Web Services, International Journal of Computer Resource Measurement,
108:45-55, 2002
17. C.M. Woodside, J.E. Neilson, D.C. Petriu, S. Majumdar, The Stochastic Rendezvous Network Model for
Performance of Synchronous Client-Server-like Distributed Software, IEEE Trans. On Computer,
44(1):20-34, 1995
18. L. Zhang, C. Xia, M. Squillante, W. Nathaniel Mills III, Workload Service Requirements Analysis: A
Queueing Network Optimization Approach, 10th IEEE International Symposium on Modeling, Analysis, &
Simulation of Computer & Telecommunications Systems (MASCOTS 2002), Texas, USA, October 2002
Download