Towards a Performance Model for Distributed Computations [with

advertisement
Utilising Located Functions to Model and Optimise Distributed
Computations
Abstract
Reasoning about distributed systems is
never easy and developments in GRID
computing and Web based data storage are
making the task of orchestrating computations
even more difficult. When using these systems,
identifying which of the available computation
resources and large and duplicated datasets to
use quickly becomes non-trivial. In addition to
reasoning about the problem itself, it is
necessary to consider the costs of moving data
(and also functions), to satisfy efficiency targets
for the computation.
An appropriate abstraction to assist with
this reasoning in terms of resource location is
needed. This paper presents a conceptual
notation and performance model that enables eresearchers to reason about these computations
and their optimisations to make choices which
will lead to best use of available resources.
1. Introduction
[Traditional distributed systems modelling
and problems, but modelling always good;
complexity
theory
stipulates
traditional
engineering approaches not sufficient for
designing current systems?; real distributed
system engineering efforts (OMII-UK) lack
appropriate
modelling
approaches
and
abstraction to understand/disseminate system
knowledge; require appropriate level of
abstraction; importance of considering location
in distributed systems for data and functions –
bandwidth is the bottleneck]
2. Located Functions
Located functions are an abstraction which
help with the description of operations
performed using distributed systems. Consider a
task in which data obtained from queries
performed on two databases is combined and
formatted for display to a user. This might be
represented as a diagram such as that shown in
Figure 1 in which a desired result is obtained
from processing the results of a query to two
databases to produce a single output (such as a
diagram).
Database
1
Query
Database
Service 1
Results
Process Data &
Visualise
Result
Query
Database
2
Database
Service 1
Results
Figure 2: An operation on a distributed
system
As exemplified by the Web Services
philosophy that there is no need to worry about
location, it is accepted that the locations of the
various necessary resources are amongst the
details that should be abstracted away when
specifying a computation to be executed on a
distributed system. Adopting this view, the
operation in Figure 1 could be reduced to the
expression in Figure 2.
f g D1 , D2 , hD1 , D3 
Figure 3: An example expression
2.1. Located Data
However, in the details of executing
computations, locations are important; obviously
data and the functions which are to act upon
them need to be co-located which implies
movement of one or both. At this level, the
question that needs to be addressed is how to
orchestrate the necessary encounters efficiently.
A located function is a notation which permits
new ways to reason about function execution in
distributed systems.
With the high level consideration of what
need to be evaluated, thought needs to be given
to the practical issue of how it is to be achieved.
In the located functions notation, we “decorate”
elements of the expression with location
information. This has been done for the data
required for the sample expression in Figure 4.
This revised version of the expression uses
the located function “x:” notation to indicate the
location of the data following the colon and
indicates that D1 is available at location 1, while
D2 is in location 2 and D3 is in location 3.
f g 1: D1,2 : D2 , h1: D1,3 : D3 
Figure 4: Including Data Locations
Assuming f,g,h are common (or utility)
functions which are readily available throughout
the system and can be executed anywhere,
Figure 4 contains all the information
necessary to make rational decisions about how
to evaluate the expression. It is immediately
apparent that one of D1, D2 has to be moved in
order to evaluate g. Similarly, one of D1, D3 has
to be moved in order to evaluate h and moving
both D2 and D3 to location 1 seems an obvious
choice since this permits f to be executed at 1
without further movement of data.
2.2. Locating functions
In section 2.1 above, it was assumed that
functions f,g,h are all widely available but this
isn’t always the case. It is normal in distributed
systems for some functions (or computations) to
only be available at particular locations.
Therefore it is necessary to add location
information for the functions to an expression as
well as for data which we achieve using the same
notation as for data. See figure 5 in which it is
stated that function f is available at locations 1
and 2, g is available at location 2 and h is
available at locations 1 and 3.
1/ 2 : f 2 : g 1 : D1 ,2 : D2 ,1/ 3 : h1 : D1,3 : D3 
Figure 5: Expression with function
locations
From this further elaborated expression, it is
clear that the results of one of g or h have to be
moved for f to be executed. There is also a
choice for the execution of h; it is available at
locations 1 and 3 and the input it needs is divided
between locations 1 and 3. If f is run at location
1, is necessary to move the result of g from
location 2 but, if h were run at location 1, there
would be no need to move its result. Similarly,
if f is run at location 2, then the need to move the
result of g is eliminated but the result of h has to
be moved (to 2 from either 1 or 3).
In practical situations, it is likely that some
functions will be universally available and some
will be located. Also, in today’s connected
world, it is also likely that data will be available
from more than one location. Clearly when
identifying which of the possible locations
should be used for the various portions of such a
function can be quite complex; the two locations
for f, D1, D2 and (at least) three for g give rise to
a minimum of 24 potential ways to distribute the
function amongst the locations.
We suggest an appropriate approach is to
base a decision on an estimate of the relative
execution time for each of the possibilities using
a time-like cost calculation, but to compute these
figures we need more information: we need to
know the sizes of the datasets involved and the
bandwidth available for the relocation of data
between the various locations. Using this data,
we can arrive at an estimate for the time cost of
moving a dataset between two locations and use
this cost to inform decisions about how to
compute a function.
2.3. Adding in Function Costs
In order to make a rational decision, we
need a measure of the implications of these
various decisions. For the movement of data, the
likely cost in time is determined by the size of
the dataset to be moved and the (available)
bandwidth between the source and destination.
In the case of functions, the time cost depends on
the amount of data to be processed and
processing power available at the location. We
propose a (time-like) cost unit for use to assist
with making these decisions called the DEC
(Distance Estimate Cost). In the case of a data
movement, the cost is estimated as the size of the
data to be moved divided by the available
bandwidth (i.e., DEC = size(D) / bandwidth(A>B). For functions, it is less clear how to
estimate the cost. Clearly available processing
power is a fact but so too is the volume of data
which has to be processed: many Grid operations
work on very large datasets [2-4]. We propose a
simple measure based on the total size of the
parameters to a function divided by a measure of
the power of the location expressed a rate at
which it is able to process data (i.e., DEC = (sum
of sizes of parameters)/ data_throughput).
Table 1: Bandwidth between locations
1
2
3
1
X
10
10
2
10
X
50
3
10
50
X
Table 2: Size of Datasets
Dataset
Size
1
10
2
3
Result of g
Result of h
90
100
100
100
Table 3: Processing capability
Location
Data Throughput
1
5
2
20
3
1000
If the (relative) bandwidths available
between the various locations are as shown in
Table 1 and the (relative) sizes of the datasets are
as shown in Table 2 and the processing power
available at the locations is as shown in Table 3
then the cost of executing g is given by the cost
of moving D1 to location 2 (from 1) plus the cost
of processing g at location 2. There is also
potentially the cost of moving D2, but this is
already in the right place, so this cost is zero.
Hence the cost of running g at location 2 is given
by:
(10/10 + 0) + (10 + 90)/ 20 = 6
The cost for running h depends on where it
is executed but is one of:
(0 + (100/10) ) + (10 + 100)/5 = 32
(executed at 1)
( (10/10) + 0 ) + (10 + 100)/1000 = 1.11
(executed at 3)
The cost of executing f is calculated in a
similar manner. The total cost for the whole of
the evaluation of f is the sum of the costs of
evaluating g and h (wherever the computation is
carried out) plus the cost of any movement of the
outputs from g,h (which are assumed to be
available at the location where the calculation is
carried out), and the processing cost of f itself.
The final results are shown in Table 4: from
which it is evident that, despite necessitating an
extra dataset movement, easily the best option
for this particular computation is to run function
f at location 2 and h at location 3.
Table 4: Total cost of expression
Location
of f
Location
of h
1
1
1
3
2
1
Total cost
6 + 32 +
((100/10 +100/10)
+
(100+100)/5 ) = 88
6 + 1.11 +
((100/10 + 100/10) +
(100+100)/5 ) = 67.11
6 + 32 +
((0+100/10)+
(100+100)/20) = 58
2
3
6 + 1.11 +
((0+100/50)
+
(100+100)/20) = 19.11
This example is restricted to making choices
about where to execute function but in today’s
connected world, it is likely that data will also be
available for more than one provider so that
decisions need to be made about where to locate
data as well as processing. In a connected (Grid)
environment with data and processing offered by
many providers, deciding how best to evaluate a
desired result can be difficult.
2.4. Mobile Functions
In GRID computing mobility needn’t be
limited to data; functions can be mobile too.
However, there are risks associated with
executing imported code so there are generally
constraints which limit the mobility of functions.
Provided the size of the executable code is
modest in comparison with the datasets, mobile
functions can be regarded as functions available
at a choice of any of the locations to which they
can be relocated (in addition to their actual
location). Where the size of the code is
significant, a calculation of an execution cost has
to be elaborated further to include the cost of the
movement using the same technique as shown
above for estimating the cost of moving data.
3. Located Functions in a Real Grid
Deployment
The SEE-GEO (SEcurE access to
GEOspatial services) [7] project addresses an
interoperability scenario that involves executing
a query across two disparate data sets and
rendering the result in a graphical format.
The two deployed data resources involved in
this interoperability experiment are firstly census
statistics, which contains regional statistical
information (e.g., cost of various products), and
secondly borders data, which contains
geographical data on regions represented as
polygons. Each of these services is made
available over the web using domain-specific
web service interfaces. For the SEE-GEO
project, the OGSA-DAI Grid middleware [1, 8]
was chosen to host this capability.
OGSA-DAI enables multiple data resources,
such as relational or XML databases, or files, to
be exposed and accessible via a centralised web
service. This OGSA-DAI web service is able to
accept a query that may involve many of these
connected data resources, and orchestrate it
across federated resources to provide the result.
The basic unit of work in OGSA-DAI is
called an activity, examples include an SQL data
query, an XSL data transform (perhaps on the
result of a query), and data delivery (for
example, delivering the result of an XSL
transform to another location). An activity is an
arbitrary function hosted by an OGSA-DAI web
service. Essentially, OGSA-DAI provides the
ability to move data to and from locations, host
and execute functions over that data, and
organise tasks.
In the SEE-GEO project, OGSA-DAI was
selected to enable cross-data resource query and
graphical visualisation capability.
This is
represented in Figure 5.
Portal (1)
Census
DB (2)
OGSA-DAI (4)
getData
I
Request
attributes
mage
Attributes
geoLink
Request
attributes
getFeature
Borders
DB (3)
Polygons
Feature
Portrayal
Request
image
WFS
Request
features

 2 : reqA4 : Q,2 : census ,  
 
5 : fp 4 : gl 
 3 : reqF 4 : Q,3 : borders  

Figure 6. Basic SEE-GEO scenario in
located function notation
Map
Server
GDAS
model every implementational aspect of this
scenario, but this would detract from the issues
we wish to examine. The feature portrayal,
getData and getFeature functions, which simply
invoke their respective services, and the portal
query being passed to the OGSA-DAI service,
are examples of such detractions. They are
necessary implementation detail, but modelling
them does not offer any benefit. Therefore, we
abstract away this unnecessary detail from this
scenario for conciseness.
The format of the located functions example
given above can be used to model this scenario,
with some modifications and elaboration:
Feature
Portrayal
Service (5)
Figure 5: SEE-GEO geo-linking service
constructed within OGSA-DAI
A query, generated by the portal, is received
by the OGSA-DAI-enabled geoLink service
which obtains the appropriate data from the two
data resources using domain-specific data
resource interfaces (GDAS and WFS) and
retrieval functions (getData and getFeature),
executes a join across the received data. It then
utilises the Feature Portrayal Service to render
the data in a graphical format for delivery to a
Map Server which the client can access to obtain
the result. This deployment utilises a number of
OGSA-DAI’s capabilities which are relevant to
this paper:
 Hosting application-specific functions
 Consuming and delivering to different types of
data resource
 Dynamic selection of different data resources
3.1. Applying Located Functions to the
Geo-Linking Scenario
When modelling a complex real system,
achieving the correct level of abstraction is
important [Peter ref?]. We could choose to
The numerics correspond to locations
depicted in Figure 5.
The alphabetic
abbreviations correspond to:
 gl: geoLink function
 fp: feature portrayal request service function
 reqA: census request service function
 reqF: borders request service function
 census: the census database
 borders: the borders database
For simplicity, we omit the Map Server from
the model at this stage. We can apply located
functions to analyse this scenario. Let us assume
that the census database is available at locations
2 and 6 and the feature portrayal service resides
at locations 5 and 7. We then arrive at the
following expression in located function
notation:

 2 / 6 : reqA4 : Q,2 / 6 : census ,  
 
5 / 7 : fp 4 : gl 
 3 : reqF 4 : Q,3 : borders 


Figure 7: The extended SEE-GEO
scenario represented in located function
notation
For this example, let us consider the
bandwidth, dataset size and processing power
given in tables 5, 6 and 7 respectively.
Table 5: Available bandwidth between the
SEE-GEO scenario locations
2
3
4
5
6
7
2
X
10
5
-
4
6
10
5
10
-
X
20
15
-
20
X
20
-
We can simplify the bandwidth table by
only considering the possible data movements:
firstly that location 4, being the core
orchestrating component of this scenario, needs
to communicate with all other locations, and
secondly the possibility of data movement
between locations 2 and 6 (with the census
database) implied by the notation rendering in
figure 7.
Table 6: Dataset size within the SEE-GEO
scenario
Dataset
Size
Q
0.1
census
18000
borders
1000
Result of reqA
50
Results of reqF
50
Result of gl
150
Table 7: Processing power available at
the SEE-GEO locations
Location
Processing Power
#Expr
#1
#2
#3
#4
#5
#6
#7
#8
#9
1, 2, 3, 4
5
6
7
50
35
60
90
We concentrate on the differences of
processing power at locations 5 and 7, assuming
that the fp function performed at these locations
is compute-intensive.
As previously mentioned, the expression in
Figure 7 implies that the reqA function executed
on locations 2 or 6 could require the census
database to be moved from 6 to 2 or vice versa.
However, these possibilities can be discounted
early during calculation given the cost of moving
the census database. This results in either
location 2 being selected for both the reqA
function and location of the census database, or
location 6 being selected likewise, with no cost
associated with moving the census database
since it resides at either location.
We give the calculations for the remaining
possibilities in table 8. The calculation column
follows a overall data transfer cost + overall
computation cost format.
Table 8: DEC calculations for the SEE-GEO scenario
Expression segment Calculation
Cumulative DEC
3:reqF(4:Q,3:borders) ((0.1/10)+0) + ((0.1+1000) / 50) = 0.01 + 20.01
20.002
2:reqA(4:Q,2:census) ((0.1/10)+0) + ((0.1+18000)/50) = 0.01 + 360.01
360.002
6:reqA(4:Q,6:census) ((0.1/20)+0) + ((0.1+18000)/60) = 0.005 + 300.01
300.002
4:gl( #2, #1 )
((50/10)+(50/10)) + ((50+50)/50) = 10 + 2
12 + #2 + #1 = 392.02
4:gl( #3, #1 )
((50/20)+(50/10)) + ((50+50)/50) = 7.5 + 2 9.5 + #3 + #1 = 329.52
5:fp(4:gl(2:reqA(4:Q,2:census),3:reqF(4:Q,3:borders)))
5:fp ( #4 )
(150/15) + (150/35) = 10 + 4.286
14.286 + #4 = 406.31
5:fp(4:gl(6:reqA(4:Q,6:census),3:reqF(4:Q,3:borders)))
5:fp( #5 )
(150/15) + (150/35) = 10 + 4.286
14.286 + #5 = 343.81
7:fp(4:gl(2:reqA(4:Q,2:census),3:reqF(4:Q,3:borders)))
7:fp( #4 )
(150/20) + (150/90) = 5 + 1.667
6.667 + #4 = 398.69
7:fp(4:gl(6:reqA(4:Q,6:census),3:reqF(4:Q,3:borders)))
7:fp( #5 )
(150/20) + (150/90) = 5 + 1.667
6.667 + #5 = 336.19
From these calculations, we can observe
from
expression
#9
that
the
7:fp(4:gl(6:reqA(4:Q,6:census),3:reqF(4:Q,3:bo
rders))) possibility is the optimum choice (due to
the extra bandwidth and processing power
awarded by the selected locations).
We have not included the storage of the
resultant image at the Map Server in this model
for clarification purposes. However, we can
include this aspect, assuming a Map Server at a
new location 8, by encapsulating the notation
given in Figure 11 with the identity function i.e.,
8:I( … ), which has no computation cost, to
reflect the result of the fp function being passed
to the Map Server. This simply involves a data
movement and does not lead to any further
possibilities; location 8 would be the only
location where a Map Server resides.
3.2. Issues in Real Grid Deployments
When applying this technique in real
systems, there are a number of factors we can
also choose to consider.
Firstly, security on the Grid is a complex
issue [5], but one that can be included in our
model.
Despite the sophistication and
complexity of the various security mechanisms
available on the Grid, the issue is essentially
whether a client is authenticated and authorised
to access functions or data provided by a
particular service. Where this is not the case, we
can include this in our model by assuming a
bandwidth of zero for those client/server
relationships, regardless of the underlying
networking infrastructure.
We can also consider asymmetric bandwidth
between a client and server, where upload and
download speed may not necessarily equate. In
real server deployments, for example, this can be
due to bandwidth throttling to obtain a level of
fairness between clients [6]. Where this is the
case, since our approach inherently takes into
account the direction of data movement, we can
simply include the bidirectional bandwidth
figures in DEC calculations.
For simplicity, we have assumed static
bandwidth configurations.
The approach
detailed in this paper, although optimistic, still
awards a greater probability of optimised
resource usage, and enables real decisions to be
made. In reality of course, these cannot be
guaranteed. However, we could enhance this
probability by utilising dynamic, empirically
observed information on bandwidth and
processing throughput from a third party
resource monitoring system. Ganglia [9] is an
example of such a system, and provides access to
information concerning computing resources
within a Grid including processing power.
3.3. Conclusions
Future work – application of this approach
to a computation grid (e.g. GridSAM). Obvious
similarities.
[1]
M. Antonioletti, M. P. Atkinson, R.
Baxter, A. Borley, N. P. Chue Hong, B.
Collins, N. Hardman, A. Hume, A.
Knox, M. Jackson, A. Krause, S. Laws,
J. Magowan, N. W. Paton, D. Pearson,
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
T. Sugden, P. Watson, and M.
Westhead,
"The
Design
and
Implementation of Grid Database
Services in OGSA-DAI," Concurrency
and Computation: Practice and
Experience, vol. Volume 17, pp. 357376, February 2005 2005.
F. Berman, A. J. G. Hey, and G. C. Fox,
Grid Computing: Making the Global
Infrastructure a Reality: John Wiley
and Sons Ltd, 2003.
J. Bradley, C. Brown, B. Carpenter, V.
Chang, J. Crisp, S. Crouch, D. de
Roure, S. Newhouse, G. Li, J. Papay, C.
Walker, and A. Wookey, "The OMII
Software Distribution," in UK e-Science
All Hands Meeting 2006 (NeSC 2006),
Nottingham, UK., pp. 748-753.
I. Foster, C. Kesselman, and S. Tuecke,
"The Anatomy of the Grid: Enabling
Scaleable
Virtual
Organization,"
International Journal of Supercomputer
Applications and High Performance
Computing, vol. 15, pp. 200-222, 2001.
I. Foster, K. Kesselman, G. Tsudik, and
S. Tuecke, "A security architecture for
computational grids," in 5th ACM
conference
on
Computer
and
communications security, 1998.
A. Hagin, N. Hagin, and V. Voinov,
"Providing Quality of Service on the
Web Using Bandwidth Throttling.," in
5th Workshop of the OpenView
University
Association
OVUA'98
Rennes, France, 1998.
C. Higgins and G. Hobona, "Grid OGC
Collision - the SEE-SAW projects," in
20th Open Grid Forum (OGF)
Manchester, 2007.
K. Karasavvas, M. Antonioletti, M. P.
Atkinson, N. P. Chue Hong, T. Sugden,
A. C. Hume, M. Jackson, A. Krause,
and C. Palansuriya, "Introduction to
OGSA-DAI Services," Lecture Notes in
Computer Science, vol. 3458, pp. 1-12,
May 2005 2005.
M. L. Massie, B. N. Chun, and D. E.
Culler, "The Ganglia Distributed
Monitoring
System:
Design,
Implementation
and
Experience.,"
Parallel Computing, vol. 30, July 2004
2004.
Download