DEC_3

advertisement
Utilising Located Functions to Model and Optimise Distributed
Computations
Abstract
Reasoning about distributed systems is never easy and developments in GRID computing and Web
based data storage are making the task of orchestrating computations even more difficult. When using
these systems, identifying which of the available computation resources and large and duplicated datasets
to use quickly becomes non-trivial. In addition to reasoning about the problem itself, it is necessary to
consider the costs of moving data (and also functions), to satisfy efficiency targets for the computation.
An appropriate abstraction to assist with this reasoning in terms of resource location is needed. This
paper presents a conceptual notation and performance model that enables e-researchers to reason about
these computations and their optimisations to make choices which will lead to best use of available
resources.
1. Introduction
[Traditional distributed systems modelling and problems, but modelling always good; complexity
theory stipulates traditional engineering approaches not sufficient for designing current systems?; real
distributed system engineering efforts (OMII-UK) lack appropriate modelling approaches and abstraction
to understand/disseminate system knowledge; require appropriate level of abstraction; importance of
considering location in distributed systems for data and functions – bandwidth is the bottleneck]
2. Located Functions
Located functions are an abstraction which help with the description of operations performed using
distributed systems. Consider a task in which data obtained from queries performed on two databases is
combined and formatted for display to a user. This might be represented as a diagram such as that shown
in Figure 1 in which a desired result is obtained from processing the results of a query to two databases to
produce a single output (such as a diagram).
Databas
e
1
Databas
e
2
Databas
e
Service
1
Quer
y
Results
Result
Process Data &
Visualise
Quer
y
Databas
e
Service
1
Results
Figure 2: An operation on a distributed system
As exemplified by the Web Services philosophy that there is no need to worry about location, it is
accepted that the locations of the various necessary resources are amongst the details that should be
abstracted away when specifying a computation to be executed on a distributed system. Adopting this
view, the operation in Figure 1 could be reduced to the expression in Figure 2.
f g D1 , D2 , hD1 , D3 
Figure 3: An example expression
2.1. Located Data
However, in the details of executing computations, locations are important; obviously data and the
functions which are to act upon them need to be co-located which implies movement of one or both. At
this level, the question that needs to be addressed is how to orchestrate the necessary encounters efficiently.
A located function is a notation which permits new ways to reason about function execution in distributed
systems.
With the high level consideration of what need to be evaluated, thought needs to be given to the
practical issue of how it is to be achieved. In the located functions notation, we “decorate” elements of the
expression with location information. This has been done for the data required for the sample expression in
Figure 4. This revised version of the expression uses the located function “x:” notation to indicate the
location of the data following the colon and indicates that D1 is available at location 1, while D2 is in
location 2 and D3 is in location 3.
f g 1 : D1,2 : D2 , h1 : D1,3 : D3 
Figure 4: Including Data Locations
Assuming f,g,h are common (or utility) functions which are readily available throughout the system and can
be executed anywhere,
Figure 4 contains all the information necessary to make rational decisions about how to evaluate the
expression. It is immediately apparent that one of D1, D2 has to be moved in order to evaluate g. Similarly,
one of D1, D3 has to be moved in order to evaluate h and moving both D2 and D3 to location 1 seems an
obvious choice since this permits f to be executed at 1 without further movement of data.
2.2. Locating functions
In section 2.1above, it was assumed that functions f,g,h are all widely available but this isn’t always
the case. It is normal in distributed systems for some functions (or computations) to only be available at
particular locations. Therefore it is necessary to add location information for the functions to an expression
as well as for data which we achieve using the same notation as for data. See Figure 5 in which it is stated
that function f is available at locations 1 and 2, g is available at location 2 and h is available at locations 1
and 3.
1/ 2 : f 2 : g 1 : D1 ,2 : D2 ,1/ 3 : h1 : D1 ,3 : D3  Figure 5: Expression with function locations
From this further elaborated expression, it is clear that the results of one of g or h have to be moved for
f to be executed. There is also a choice for the execution of h; it is available at locations 1 and 3 and the
input it needs is divided between locations 1 and 3. If f is run at location 1, is necessary to move the result
of g from location 2 but, if h were run at location 1, there would be no need to move its result. Similarly, if
f is run at location 2, then the need to move the result of g is eliminated but the result of h has to be moved
(to 2 from either 1 or 3).
In practical situations, it is likely that some functions will be universally available and some will be
located. Also, in today’s connected world, it is also likely that data will be available from more than one
location. Clearly when identifying which of the possible locations should be used for the various portions
of such a function can be quite complex; the two locations for f, D1, D2 and (at least) three for g give rise to
a minimum of 24 potential ways to distribute the function amongst the locations.
We suggest an appropriate approach is to base a decision on an estimate of the relative execution time
for each of the possibilities using a time-like cost calculation, but to compute these figures we need more
information: we need to know the sizes of the datasets involved and the bandwidth available for the
relocation of data between the various locations. Using this data, we can arrive at an estimate for the time
cost of moving a dataset between two locations and use this cost to inform decisions about how to compute
a function.
2.3. Adding in Function Costs
In order to make a rational decision, we need a measure of the implications of these various decisions.
For the movement of data, the likely cost in time is determined by the size of the dataset to be moved and
the (available) bandwidth between the source and destination. In the case of functions, the time cost
depends on the amount of data to be processed and processing power available at the location. We propose
a (time-like) cost unit for use to assist with making these decisions called the DEC (Distance Estimate
Cost). In the case of a data movement, the cost is estimated as the size of the data to be moved divided by
the available bandwidth (i.e., DEC = size(D) / bandwidth(A->B). For functions, it is less clear how to
estimate the cost. Clearly available processing power is a fact but so too is the volume of data which has to
be processed: many GRID operations work on very large datasets. We propose a simple measure based on
the total size of the parameters to a function divided by a measure of the power of the location expressed a
rate at which it is able to process data (i.e., DEC = (sum of sizes of parameters)/ data_throughput).
1
X
10
10
1
2
3
Table 1: Bandwidth between locations
2
3
10
10
X
50
50
X
Table 2: Size of Datasets
Dataset
1
2
3
Result of g
Result of h
Location
1
2
3
Size
10
90
100
100
100
Table 3: Processing capability
Data Throughput
5
20
1000
If the (relative) bandwidths available between the various locations are as shown in Table 1 and the
(relative) sizes of the datasets are as shown in Table 2 and the processing power available at the locations is
as shown in Table 3 then the cost of executing g is given by the cost of moving D1 to location 2 (from 1)
plus the cost of processing g at location 2. There is also potentially the cost of moving D2, but this is
already in the right place, so this cost is zero. Hence the cost of running g at location 2 is given by:
(10/10 + 0) + (10 + 90)/ 20 = 6
The cost for running h depends on where it is executed but is one of:
(0 + (100/10) ) + (10 + 100)/5 = 32
(executed at 1)
( (10/10) + 0 ) + (10 + 100)/1000 = 1.11
(executed at 3)
The cost of executing f is calculated in a similar manner. The total cost for the whole of the evaluation
of f is the sum of the costs of evaluating g and h (wherever the computation is carried out) plus the cost of
any movement of the outputs from g,h (which are assumed to be available at the location where the
calculation is carried out), and the processing cost of f itself. The final results are shown in Table 4: from
which it is evident that, despite necessitating an extra dataset movement, easily the best option for this
particular computation is to run function f at location 2 and h at location 3.
Table 4: Total cost of expression
Location
of f
1
Location
of h
1
Total cost
6 + 32 +
1
3
2
1
2
3
((100/10 +100/10)
+
(100+100)/5 ) = 88
6 + 1.11 +
((100/10 + 100/10) +
(100+100)/5 ) = 67.11
6 + 32 +
((0+100/10)+
(100+100)/20) = 58
6 + 1.11 +
((0+100/50)
+
(100+100)/20) = 19.11
This example is restricted to making choices about where to execute function but in today’s connected
world, it is likely that data will also be available for more than one provider so that decisions need to be
made about where to locate data as well as processing. In a connected (GRID) environment with data and
processing offered by many providers, deciding how best to evaluate a desired result can be difficult.
2.4. Mobile Functions
In GRID computing mobility needn’t be limited to data; functions can be mobile too. However, there
are risks associated with executing imported code so there are generally constraints which limit the
mobility of functions. Provided the size of the executable code is modest in comparison with the datasets,
mobile functions can be regarded as functions available at a choice of any of the locations to which they
can be relocated (in addition to their actual location). Where the size of the code is significant, a
calculation of an execution cost has to be elaborated further to include the cost of the movement using the
same technique as shown above for estimating the cost of moving data.
[Data Grid instance: OGSA-DAI; very light intro to real Grid security, its importance and issues it
presents; apply security to bandwidth table]
3. Application of Located Functions to Model a Real Grid Deployment
The SEE-GEO (SEcurE access to GEOspatial services) project addresses an interoperability scenario
that involves executing a query across two disparate data sets and rendering the result in a graphical format.
The two deployed data resources involved in this interoperability experiment are firstly census
statistics, which contains regional statistical information (e.g. cost of various products), and secondly
borders data, which contains geographical data on regions represented as polygons. Each of these services
is presented over the web using domain-specific web service interfaces. For the SEE-GEO project, the
OGSA-DAI Grid middleware was chosen to host this capability.
OGSA-DAI enables multiple data resources, such as relational or XML databases, or files, to be
exposed and accessible via a centralised web service. This OGSA-DAI web service is able to accept data
access queries that may involve many of these connected data resources, and orchestrate that query across
those federated resources to provide a result. Additionally, OGSA-DAI supports the execution of
workflows that describe finely-grained specification of more complex interconnected activities. An activity
is a basic OGSA-DAI unit of work, examples of which include an SQL data query, an XSL data transform
(perhaps on the result of a query), and data delivery (for example, delivering the result of an XSL transform
to another location). In particular, an activity can be an arbitrary function hosted by an OGSA-DAI web
service. Essentially, it provides the ability to move data to and from locations, host and execute functions
over that data, and organise tasks that involve these capabilities.
In the SEE-GEO project, OGSA-DAI was employed to enable this cross-data resource query and
graphical visualisation capability. This is represented in Figure 5
Portal
Census
DB
I
GDAS
getData
Request
attributes
mage
Attributes
geoLink
getFeature
Borders
DB
Map
Server
Polygons
Feature
Portrayal
Request
image
WFS
Request
features
Figure 5: SEE-GEO geo-linking service constructed within OGSA-DAI
Feature
Portrayal
Service
A query, generated by the portal, is received by the OGSA-DAI-enabled geoLink service which obtains
the appropriate data from the two data resources using domain-specific data resource interfaces (GDAS and
WFS) and retrieval functions (getData and getFeature), executes a join across the received data, and then
utilises the Feature Portrayal Service to render the data in a graphical format for delivery to a Map Server
which the client can access to obtain the result. Hence, this deployment utilises a number of OGSA-DAI’s
capabilities which are relevant to this paper:
 Workflow functionality to orchestrate the overall computation
 Hosting application-specific functions
 Consuming and delivering to different types of data resource
 Dynamic selection of different data resources
 Utilisation of additional levels of security enforced by the various data and service resources. This will
be discussed later in section 5.3.
3.1. Applying Located Functions to the Geo-Linking Scenario
When modelling a complex real system, achieving the correct level of abstraction is crucial [Peter
ref?]. We could choose to model every implementational aspect of this scenario, but this would detract
from the issues we wish to examine. The feature portrayal, getData and getFeature functions, which simply
invoke their respective services, and the portal query being passed to the OGSA-DAI service, are examples
of such detractions. They are necessary implementation detail, but modelling them does not offer any
benefit. Therefore, we abstract away this unnecessary detail from this scenario for conciseness.
The format of the located functions example given above can be used to model this scenario, with
some modifications and elaboration:

 2 : reqA4 : Q,2 : census ,  
 
5 : fp 4 : gl 


3
:
reqF
4
:
Q
,
3
:
borders



Figure 6. Basic SEE-GEO scenario in located function notation
The numerics correspond to locations depicted in Figure 5. The alphabetic abbreviations correspond
to:
 gl: geoLink function
 fp: feature portrayal request service function
 reqA: census request service function
 reqF: borders request service function
 census: the census database
 borders: the borders database
For simplicity, we omit the Map Server from this model at this stage. By taking into account the
possibility that the data resources and functions above may reside in multiple locations, we can apply
located functions to analyse this scenario. Let us assume for an example that the census database is also
available at a location 6 and the feature portrayal service also resides at another location 7. We then arrive
at the following expression in located function notation:

 2 / 6 : reqA4 : Q,2 / 6 : census ,  
 
5 / 7 : fp 4 : gl 
 3 : reqF 4 : Q,3 : borders 


Figure 7: The extended SEE-GEO scenario represented in located function notation
For this example, we can consider the bandwidth, dataset size and processing power
given in Error! Reference source not found.,
Table 6: and Error! Reference source not found. respectively.
Table 5: Available bandwidth between the SEE-GEO scenario locations
2
3
4
5
6
7
2
X
10
5
4
10
10
X
15
20
20
6
5
20
X
We can simplify the bandwidth table by only considering the possible data movements: firstly that
location 4, being the core orchestrating component of this scenario, needs to communicate with all other
locations, and secondly the possibility of data movement between locations 2 and 6 (with the census
database) implied by the notation rendering in Figure .
Table 6: Dataset size within the SEE-GEO scenario
Dataset
Size
Q
0.1
census
18000
borders
1000
Result of reqA
50
Results of reqF
50
Result of gl
150
Table 7: Processing power available at the SEE-GEO locations
Location
Processing Power
1, 2, 3, 4
50
5
35
6
60
7
90
For simplicity of example, we concentrate on the differences of processing power at locations 5 and 7,
assuming that the fp function performed at these locations is compute-intensive.
As previously mentioned, the expression in Figure implies that the reqA function executed on
locations 2 or 6 could require the census database to be moved from 6 to 2 or vice versa. However, these
possibilities can be discounted early during calculation since the cost of moving the census database would
be far too great given the size of the database and the bandwidth available. This results in either location 2
being selected for the reqA function and location of the census database, or location 6 being selected
likewise, with no cost associated for moving the census database since it resides at either location.
We will examine the calculations for the remaining possibilities in
. The calculation column follows a overall
data transfer cost +overall computation cost
format.
#Expr
#1
#2
#3
#4
#5
#6
#7
#8
#9
Table 8: DEC calculations for the SEE-GEO scenario
Expression segment Calculation
Cumulative DEC
3:reqF(4:Q,3:borders) ((0.1/10)+0) + ((0.1+1000) / 50) = 0.01 + 20.01
20.002
2:reqA(4:Q,2:census) ((0.1/10)+0) + ((0.1+18000)/50) = 0.01 + 360.01
360.002
6:reqA(4:Q,6:census) ((0.1/20)+0) + ((0.1+18000)/60) = 0.005 + 300.01
300.002
4:gl( #2, #1 )
((50/10)+(50/10)) + ((50+50)/50) = 10 + 2
12 + #2 + #1 = 392.02
4:gl( #3, #1 )
((50/20)+(50/10)) + ((50+50)/50) = 7.5 + 2 9.5 + #3 + #1 = 329.52
5:fp(4:gl(2:reqA(4:Q,2:census),3:reqF(4:Q,3:borders)))
5:fp ( #4 )
(150/15) + (150/35) = 10 + 4.286
14.286 + #4 = 406.31
5:fp(4:gl(6:reqA(4:Q,6:census),3:reqF(4:Q,3:borders)))
5:fp( #5 )
(150/15) + (150/35) = 10 + 4.286
14.286 + #5 = 343.81
7:fp(4:gl(2:reqA(4:Q,2:census),3:reqF(4:Q,3:borders)))
7:fp( #4 )
(150/20) + (150/90) = 5 + 1.667
6.667 + #4 = 398.69
7:fp(4:gl(6:reqA(4:Q,6:census),3:reqF(4:Q,3:borders)))
7:fp( #5 )
(150/20) + (150/90) = 5 + 1.667
6.667 + #5 = 336.19
From these calculations, we can observe
from
expression
#9
that
the
7:fp(4:gl(6:reqA(4:Q,6:census),3:reqF(4:Q,3:bo
rders))) possibility remains the optimum choice.
We have not included the storage of the
resultant image at the Map Server in this model
for clarification purposes. However, we can
include this aspect, assuming a Map Server at a
new location 8, by encapsulating the notation
given in Figure with the identity function i.e.
8:I( … ), which has no computation cost, to
reflect the result of the fp function being passed
to the Map Server. This would simply involve a
single data movement which does not lead to any
further possibilities; location 8 would be the only
location where a Map Server resides.
3.2. Issues in Real Grid Deployments
Security (where no security credentials exist
between A and B, we can consider the bandwidth
as being zero, regardless of the core networking
infrastructure), movement of functions (e.g.
remote hot web service deployment – the BESC
Service Cloud, GridSAM – ability to move
functions as data for execution elsewhere,
although this is computation grid).
3.3. Conclusions
Future work – application of this approach
to a computation grid (e.g. GridSAM). Obvious
similarities.
Download