Utilising Located Functions to Model and Optimise Distributed Computations Abstract Reasoning about distributed systems is never easy and developments in GRID computing and Web based data storage are making the task of orchestrating computations even more difficult. When using these systems, identifying which of the available computation resources and large and duplicated datasets to use quickly becomes non-trivial. In addition to reasoning about the problem itself, it is necessary to consider the costs of moving data (and also functions), to satisfy efficiency targets for the computation. An appropriate abstraction to assist with this reasoning in terms of resource location is needed. This paper presents a conceptual notation and performance model that enables e-researchers to reason about these computations and their optimisations to make choices which will lead to best use of available resources. 1. Introduction [Traditional distributed systems modelling and problems, but modelling always good; complexity theory stipulates traditional engineering approaches not sufficient for designing current systems?; real distributed system engineering efforts (OMII-UK) lack appropriate modelling approaches and abstraction to understand/disseminate system knowledge; require appropriate level of abstraction; importance of considering location in distributed systems for data and functions – bandwidth is the bottleneck] 2. Located Functions Located functions are an abstraction which help with the description of operations performed using distributed systems. Consider a task in which data obtained from queries performed on two databases is combined and formatted for display to a user. This might be represented as a diagram such as that shown in Figure 1 in which a desired result is obtained from processing the results of a query to two databases to produce a single output (such as a diagram). Databas e 1 Databas e 2 Databas e Service 1 Quer y Results Result Process Data & Visualise Quer y Databas e Service 1 Results Figure 2: An operation on a distributed system As exemplified by the Web Services philosophy that there is no need to worry about location, it is accepted that the locations of the various necessary resources are amongst the details that should be abstracted away when specifying a computation to be executed on a distributed system. Adopting this view, the operation in Figure 1 could be reduced to the expression in Figure 2. f g D1 , D2 , hD1 , D3 Figure 3: An example expression 2.1. Located Data However, in the details of executing computations, locations are important; obviously data and the functions which are to act upon them need to be co-located which implies movement of one or both. At this level, the question that needs to be addressed is how to orchestrate the necessary encounters efficiently. A located function is a notation which permits new ways to reason about function execution in distributed systems. With the high level consideration of what need to be evaluated, thought needs to be given to the practical issue of how it is to be achieved. In the located functions notation, we “decorate” elements of the expression with location information. This has been done for the data required for the sample expression in Figure 4. This revised version of the expression uses the located function “x:” notation to indicate the location of the data following the colon and indicates that D1 is available at location 1, while D2 is in location 2 and D3 is in location 3. f g 1 : D1,2 : D2 , h1 : D1,3 : D3 Figure 4: Including Data Locations Assuming f,g,h are common (or utility) functions which are readily available throughout the system and can be executed anywhere, Figure 4 contains all the information necessary to make rational decisions about how to evaluate the expression. It is immediately apparent that one of D1, D2 has to be moved in order to evaluate g. Similarly, one of D1, D3 has to be moved in order to evaluate h and moving both D2 and D3 to location 1 seems an obvious choice since this permits f to be executed at 1 without further movement of data. 2.2. Locating functions In section 2.1above, it was assumed that functions f,g,h are all widely available but this isn’t always the case. It is normal in distributed systems for some functions (or computations) to only be available at particular locations. Therefore it is necessary to add location information for the functions to an expression as well as for data which we achieve using the same notation as for data. See Figure 5 in which it is stated that function f is available at locations 1 and 2, g is available at location 2 and h is available at locations 1 and 3. 1/ 2 : f 2 : g 1 : D1 ,2 : D2 ,1/ 3 : h1 : D1 ,3 : D3 Figure 5: Expression with function locations From this further elaborated expression, it is clear that the results of one of g or h have to be moved for f to be executed. There is also a choice for the execution of h; it is available at locations 1 and 3 and the input it needs is divided between locations 1 and 3. If f is run at location 1, is necessary to move the result of g from location 2 but, if h were run at location 1, there would be no need to move its result. Similarly, if f is run at location 2, then the need to move the result of g is eliminated but the result of h has to be moved (to 2 from either 1 or 3). In practical situations, it is likely that some functions will be universally available and some will be located. Also, in today’s connected world, it is also likely that data will be available from more than one location. Clearly when identifying which of the possible locations should be used for the various portions of such a function can be quite complex; the two locations for f, D1, D2 and (at least) three for g give rise to a minimum of 24 potential ways to distribute the function amongst the locations. We suggest an appropriate approach is to base a decision on an estimate of the relative execution time for each of the possibilities using a time-like cost calculation, but to compute these figures we need more information: we need to know the sizes of the datasets involved and the bandwidth available for the relocation of data between the various locations. Using this data, we can arrive at an estimate for the time cost of moving a dataset between two locations and use this cost to inform decisions about how to compute a function. 2.3. Adding in Function Costs In order to make a rational decision, we need a measure of the implications of these various decisions. For the movement of data, the likely cost in time is determined by the size of the dataset to be moved and the (available) bandwidth between the source and destination. In the case of functions, the time cost depends on the amount of data to be processed and processing power available at the location. We propose a (time-like) cost unit for use to assist with making these decisions called the DEC (Distance Estimate Cost). In the case of a data movement, the cost is estimated as the size of the data to be moved divided by the available bandwidth (i.e., DEC = size(D) / bandwidth(A->B). For functions, it is less clear how to estimate the cost. Clearly available processing power is a fact but so too is the volume of data which has to be processed: many GRID operations work on very large datasets. We propose a simple measure based on the total size of the parameters to a function divided by a measure of the power of the location expressed a rate at which it is able to process data (i.e., DEC = (sum of sizes of parameters)/ data_throughput). 1 X 10 10 1 2 3 Table 1: Bandwidth between locations 2 3 10 10 X 50 50 X Table 2: Size of Datasets Dataset 1 2 3 Result of g Result of h Location 1 2 3 Size 10 90 100 100 100 Table 3: Processing capability Data Throughput 5 20 1000 If the (relative) bandwidths available between the various locations are as shown in Table 1 and the (relative) sizes of the datasets are as shown in Table 2 and the processing power available at the locations is as shown in Table 3 then the cost of executing g is given by the cost of moving D1 to location 2 (from 1) plus the cost of processing g at location 2. There is also potentially the cost of moving D2, but this is already in the right place, so this cost is zero. Hence the cost of running g at location 2 is given by: (10/10 + 0) + (10 + 90)/ 20 = 6 The cost for running h depends on where it is executed but is one of: (0 + (100/10) ) + (10 + 100)/5 = 32 (executed at 1) ( (10/10) + 0 ) + (10 + 100)/1000 = 1.11 (executed at 3) The cost of executing f is calculated in a similar manner. The total cost for the whole of the evaluation of f is the sum of the costs of evaluating g and h (wherever the computation is carried out) plus the cost of any movement of the outputs from g,h (which are assumed to be available at the location where the calculation is carried out), and the processing cost of f itself. The final results are shown in Table 4: from which it is evident that, despite necessitating an extra dataset movement, easily the best option for this particular computation is to run function f at location 2 and h at location 3. Table 4: Total cost of expression Location of f 1 Location of h 1 Total cost 6 + 32 + 1 3 2 1 2 3 ((100/10 +100/10) + (100+100)/5 ) = 88 6 + 1.11 + ((100/10 + 100/10) + (100+100)/5 ) = 67.11 6 + 32 + ((0+100/10)+ (100+100)/20) = 58 6 + 1.11 + ((0+100/50) + (100+100)/20) = 19.11 This example is restricted to making choices about where to execute function but in today’s connected world, it is likely that data will also be available for more than one provider so that decisions need to be made about where to locate data as well as processing. In a connected (GRID) environment with data and processing offered by many providers, deciding how best to evaluate a desired result can be difficult. 2.4. Mobile Functions In GRID computing mobility needn’t be limited to data; functions can be mobile too. However, there are risks associated with executing imported code so there are generally constraints which limit the mobility of functions. Provided the size of the executable code is modest in comparison with the datasets, mobile functions can be regarded as functions available at a choice of any of the locations to which they can be relocated (in addition to their actual location). Where the size of the code is significant, a calculation of an execution cost has to be elaborated further to include the cost of the movement using the same technique as shown above for estimating the cost of moving data. [Data Grid instance: OGSA-DAI; very light intro to real Grid security, its importance and issues it presents; apply security to bandwidth table] 3. Application of Located Functions to Model a Real Grid Deployment The SEE-GEO (SEcurE access to GEOspatial services) project addresses an interoperability scenario that involves executing a query across two disparate data sets and rendering the result in a graphical format. The two deployed data resources involved in this interoperability experiment are firstly census statistics, which contains regional statistical information (e.g. cost of various products), and secondly borders data, which contains geographical data on regions represented as polygons. Each of these services is presented over the web using domain-specific web service interfaces. For the SEE-GEO project, the OGSA-DAI Grid middleware was chosen to host this capability. OGSA-DAI enables multiple data resources, such as relational or XML databases, or files, to be exposed and accessible via a centralised web service. This OGSA-DAI web service is able to accept data access queries that may involve many of these connected data resources, and orchestrate that query across those federated resources to provide a result. Additionally, OGSA-DAI supports the execution of workflows that describe finely-grained specification of more complex interconnected activities. An activity is a basic OGSA-DAI unit of work, examples of which include an SQL data query, an XSL data transform (perhaps on the result of a query), and data delivery (for example, delivering the result of an XSL transform to another location). In particular, an activity can be an arbitrary function hosted by an OGSA-DAI web service. Essentially, it provides the ability to move data to and from locations, host and execute functions over that data, and organise tasks that involve these capabilities. In the SEE-GEO project, OGSA-DAI was employed to enable this cross-data resource query and graphical visualisation capability. This is represented in Figure 5 Portal Census DB I GDAS getData Request attributes mage Attributes geoLink getFeature Borders DB Map Server Polygons Feature Portrayal Request image WFS Request features Figure 5: SEE-GEO geo-linking service constructed within OGSA-DAI Feature Portrayal Service A query, generated by the portal, is received by the OGSA-DAI-enabled geoLink service which obtains the appropriate data from the two data resources using domain-specific data resource interfaces (GDAS and WFS) and retrieval functions (getData and getFeature), executes a join across the received data, and then utilises the Feature Portrayal Service to render the data in a graphical format for delivery to a Map Server which the client can access to obtain the result. Hence, this deployment utilises a number of OGSA-DAI’s capabilities which are relevant to this paper: Workflow functionality to orchestrate the overall computation Hosting application-specific functions Consuming and delivering to different types of data resource Dynamic selection of different data resources Utilisation of additional levels of security enforced by the various data and service resources. This will be discussed later in section 5.3. 3.1. Applying Located Functions to the Geo-Linking Scenario When modelling a complex real system, achieving the correct level of abstraction is crucial [Peter ref?]. We could choose to model every implementational aspect of this scenario, but this would detract from the issues we wish to examine. The feature portrayal, getData and getFeature functions, which simply invoke their respective services, and the portal query being passed to the OGSA-DAI service, are examples of such detractions. They are necessary implementation detail, but modelling them does not offer any benefit. Therefore, we abstract away this unnecessary detail from this scenario for conciseness. The format of the located functions example given above can be used to model this scenario, with some modifications and elaboration: 2 : reqA4 : Q,2 : census , 5 : fp 4 : gl 3 : reqF 4 : Q , 3 : borders Figure 6. Basic SEE-GEO scenario in located function notation The numerics correspond to locations depicted in Figure 5. The alphabetic abbreviations correspond to: gl: geoLink function fp: feature portrayal request service function reqA: census request service function reqF: borders request service function census: the census database borders: the borders database For simplicity, we omit the Map Server from this model at this stage. By taking into account the possibility that the data resources and functions above may reside in multiple locations, we can apply located functions to analyse this scenario. Let us assume for an example that the census database is also available at a location 6 and the feature portrayal service also resides at another location 7. We then arrive at the following expression in located function notation: 2 / 6 : reqA4 : Q,2 / 6 : census , 5 / 7 : fp 4 : gl 3 : reqF 4 : Q,3 : borders Figure 7: The extended SEE-GEO scenario represented in located function notation For this example, we can consider the bandwidth, dataset size and processing power given in Error! Reference source not found., Table 6: and Error! Reference source not found. respectively. Table 5: Available bandwidth between the SEE-GEO scenario locations 2 3 4 5 6 7 2 X 10 5 4 10 10 X 15 20 20 6 5 20 X We can simplify the bandwidth table by only considering the possible data movements: firstly that location 4, being the core orchestrating component of this scenario, needs to communicate with all other locations, and secondly the possibility of data movement between locations 2 and 6 (with the census database) implied by the notation rendering in Figure . Table 6: Dataset size within the SEE-GEO scenario Dataset Size Q 0.1 census 18000 borders 1000 Result of reqA 50 Results of reqF 50 Result of gl 150 Table 7: Processing power available at the SEE-GEO locations Location Processing Power 1, 2, 3, 4 50 5 35 6 60 7 90 For simplicity of example, we concentrate on the differences of processing power at locations 5 and 7, assuming that the fp function performed at these locations is compute-intensive. As previously mentioned, the expression in Figure implies that the reqA function executed on locations 2 or 6 could require the census database to be moved from 6 to 2 or vice versa. However, these possibilities can be discounted early during calculation since the cost of moving the census database would be far too great given the size of the database and the bandwidth available. This results in either location 2 being selected for the reqA function and location of the census database, or location 6 being selected likewise, with no cost associated for moving the census database since it resides at either location. We will examine the calculations for the remaining possibilities in . The calculation column follows a overall data transfer cost +overall computation cost format. #Expr #1 #2 #3 #4 #5 #6 #7 #8 #9 Table 8: DEC calculations for the SEE-GEO scenario Expression segment Calculation Cumulative DEC 3:reqF(4:Q,3:borders) ((0.1/10)+0) + ((0.1+1000) / 50) = 0.01 + 20.01 20.002 2:reqA(4:Q,2:census) ((0.1/10)+0) + ((0.1+18000)/50) = 0.01 + 360.01 360.002 6:reqA(4:Q,6:census) ((0.1/20)+0) + ((0.1+18000)/60) = 0.005 + 300.01 300.002 4:gl( #2, #1 ) ((50/10)+(50/10)) + ((50+50)/50) = 10 + 2 12 + #2 + #1 = 392.02 4:gl( #3, #1 ) ((50/20)+(50/10)) + ((50+50)/50) = 7.5 + 2 9.5 + #3 + #1 = 329.52 5:fp(4:gl(2:reqA(4:Q,2:census),3:reqF(4:Q,3:borders))) 5:fp ( #4 ) (150/15) + (150/35) = 10 + 4.286 14.286 + #4 = 406.31 5:fp(4:gl(6:reqA(4:Q,6:census),3:reqF(4:Q,3:borders))) 5:fp( #5 ) (150/15) + (150/35) = 10 + 4.286 14.286 + #5 = 343.81 7:fp(4:gl(2:reqA(4:Q,2:census),3:reqF(4:Q,3:borders))) 7:fp( #4 ) (150/20) + (150/90) = 5 + 1.667 6.667 + #4 = 398.69 7:fp(4:gl(6:reqA(4:Q,6:census),3:reqF(4:Q,3:borders))) 7:fp( #5 ) (150/20) + (150/90) = 5 + 1.667 6.667 + #5 = 336.19 From these calculations, we can observe from expression #9 that the 7:fp(4:gl(6:reqA(4:Q,6:census),3:reqF(4:Q,3:bo rders))) possibility remains the optimum choice. We have not included the storage of the resultant image at the Map Server in this model for clarification purposes. However, we can include this aspect, assuming a Map Server at a new location 8, by encapsulating the notation given in Figure with the identity function i.e. 8:I( … ), which has no computation cost, to reflect the result of the fp function being passed to the Map Server. This would simply involve a single data movement which does not lead to any further possibilities; location 8 would be the only location where a Map Server resides. 3.2. Issues in Real Grid Deployments Security (where no security credentials exist between A and B, we can consider the bandwidth as being zero, regardless of the core networking infrastructure), movement of functions (e.g. remote hot web service deployment – the BESC Service Cloud, GridSAM – ability to move functions as data for execution elsewhere, although this is computation grid). 3.3. Conclusions Future work – application of this approach to a computation grid (e.g. GridSAM). Obvious similarities.