Utilising Located Functions to Model and Optimise Distributed Computations Abstract Reasoning about distributed systems is never easy and developments in GRID computing and Web based data storage are making the task of orchestrating computations even more difficult. When using these systems, identifying which of the available computation resources and large and duplicated datasets to use quickly becomes non-trivial. In addition to reasoning about the problem itself, it is necessary to consider the costs of moving data (and also functions), to satisfy efficiency targets for the computation. An appropriate abstraction to assist with this reasoning in terms of resource location is needed. This paper presents a conceptual notation and performance model that enables eresearchers to reason about these computations and their optimisations to make choices which will lead to best use of available resources. 1. Introduction [Traditional distributed systems modelling and problems, but modelling always good; complexity theory stipulates traditional engineering approaches not sufficient for designing current systems?; real distributed system engineering efforts (OMII-UK) lack appropriate modelling approaches and abstraction to understand/disseminate system knowledge; require appropriate level of abstraction; importance of considering location in distributed systems for data and functions – bandwidth is the bottleneck] 2. Located Functions Located functions are an abstraction which help with the description of operations performed using distributed systems. Consider a task in which data obtained from queries performed on two databases is combined and formatted for display to a user. This might be represented as a diagram such as that shown in Figure 1 in which a desired result is obtained from processing the results of a query to two databases to produce a single output (such as a diagram). Database 1 Query Database Service 1 Results Process Data & Visualise Result Query Database 2 Database Service 1 Results Figure 2: An operation on a distributed system As exemplified by the Web Services philosophy that there is no need to worry about location, it is accepted that the locations of the various necessary resources are amongst the details that should be abstracted away when specifying a computation to be executed on a distributed system. Adopting this view, the operation in Figure 1 could be reduced to the expression in Figure 2. f g D1 , D2 , hD1 , D3 Figure 3: An example expression 2.1. Located Data However, in the details of executing computations, locations are important; obviously data and the functions which are to act upon them need to be co-located which implies movement of one or both. At this level, the question that needs to be addressed is how to orchestrate the necessary encounters efficiently. A located function is a notation which permits new ways to reason about function execution in distributed systems. With the high level consideration of what need to be evaluated, thought needs to be given to the practical issue of how it is to be achieved. In the located functions notation, we “decorate” elements of the expression with location information. This has been done for the data required for the sample expression in Figure 4. This revised version of the expression uses the located function “x:” notation to indicate the location of the data following the colon and indicates that D1 is available at location 1, while D2 is in location 2 and D3 is in location 3. f g 1: D1,2 : D2 , h1: D1,3 : D3 Figure 4: Including Data Locations Assuming f,g,h are common (or utility) functions which are readily available throughout the system and can be executed anywhere, Figure 4 contains all the information necessary to make rational decisions about how to evaluate the expression. It is immediately apparent that one of D1, D2 has to be moved in order to evaluate g. Similarly, one of D1, D3 has to be moved in order to evaluate h and moving both D2 and D3 to location 1 seems an obvious choice since this permits f to be executed at 1 without further movement of data. 2.2. Locating functions In section 2.1 above, it was assumed that functions f,g,h are all widely available but this isn’t always the case. It is normal in distributed systems for some functions (or computations) to only be available at particular locations. Therefore it is necessary to add location information for the functions to an expression as well as for data which we achieve using the same notation as for data. See figure 5 in which it is stated that function f is available at locations 1 and 2, g is available at location 2 and h is available at locations 1 and 3. 1/ 2 : f 2 : g 1 : D1 ,2 : D2 ,1/ 3 : h1 : D1,3 : D3 Figure 5: Expression with function locations From this further elaborated expression, it is clear that the results of one of g or h have to be moved for f to be executed. There is also a choice for the execution of h; it is available at locations 1 and 3 and the input it needs is divided between locations 1 and 3. If f is run at location 1, is necessary to move the result of g from location 2 but, if h were run at location 1, there would be no need to move its result. Similarly, if f is run at location 2, then the need to move the result of g is eliminated but the result of h has to be moved (to 2 from either 1 or 3). In practical situations, it is likely that some functions will be universally available and some will be located. Also, in today’s connected world, it is also likely that data will be available from more than one location. Clearly when identifying which of the possible locations should be used for the various portions of such a function can be quite complex; the two locations for f, D1, D2 and (at least) three for g give rise to a minimum of 24 potential ways to distribute the function amongst the locations. We suggest an appropriate approach is to base a decision on an estimate of the relative execution time for each of the possibilities using a time-like cost calculation, but to compute these figures we need more information: we need to know the sizes of the datasets involved and the bandwidth available for the relocation of data between the various locations. Using this data, we can arrive at an estimate for the time cost of moving a dataset between two locations and use this cost to inform decisions about how to compute a function. 2.3. Adding in Function Costs In order to make a rational decision, we need a measure of the implications of these various decisions. For the movement of data, the likely cost in time is determined by the size of the dataset to be moved and the (available) bandwidth between the source and destination. In the case of functions, the time cost depends on the amount of data to be processed and processing power available at the location. We propose a (time-like) cost unit for use to assist with making these decisions called the DEC (Distance Estimate Cost). In the case of a data movement, the cost is estimated as the size of the data to be moved divided by the available bandwidth (i.e., DEC = size(D) / bandwidth(A>B). For functions, it is less clear how to estimate the cost. Clearly available processing power is a fact but so too is the volume of data which has to be processed: many Grid operations work on very large datasets [2-4]. We propose a simple measure based on the total size of the parameters to a function divided by a measure of the power of the location expressed a rate at which it is able to process data (i.e., DEC = (sum of sizes of parameters)/ data_throughput). Table 1: Bandwidth between locations 1 2 3 1 X 10 10 2 10 X 50 3 10 50 X Table 2: Size of Datasets Dataset Size 1 10 2 3 Result of g Result of h 90 100 100 100 Table 3: Processing capability Location Data Throughput 1 5 2 20 3 1000 If the (relative) bandwidths available between the various locations are as shown in Table 1 and the (relative) sizes of the datasets are as shown in Table 2 and the processing power available at the locations is as shown in Table 3 then the cost of executing g is given by the cost of moving D1 to location 2 (from 1) plus the cost of processing g at location 2. There is also potentially the cost of moving D2, but this is already in the right place, so this cost is zero. Hence the cost of running g at location 2 is given by: (10/10 + 0) + (10 + 90)/ 20 = 6 The cost for running h depends on where it is executed but is one of: (0 + (100/10) ) + (10 + 100)/5 = 32 (executed at 1) ( (10/10) + 0 ) + (10 + 100)/1000 = 1.11 (executed at 3) The cost of executing f is calculated in a similar manner. The total cost for the whole of the evaluation of f is the sum of the costs of evaluating g and h (wherever the computation is carried out) plus the cost of any movement of the outputs from g,h (which are assumed to be available at the location where the calculation is carried out), and the processing cost of f itself. The final results are shown in Table 4: from which it is evident that, despite necessitating an extra dataset movement, easily the best option for this particular computation is to run function f at location 2 and h at location 3. Table 4: Total cost of expression Location of f Location of h 1 1 1 3 2 1 Total cost 6 + 32 + ((100/10 +100/10) + (100+100)/5 ) = 88 6 + 1.11 + ((100/10 + 100/10) + (100+100)/5 ) = 67.11 6 + 32 + ((0+100/10)+ (100+100)/20) = 58 2 3 6 + 1.11 + ((0+100/50) + (100+100)/20) = 19.11 This example is restricted to making choices about where to execute function but in today’s connected world, it is likely that data will also be available for more than one provider so that decisions need to be made about where to locate data as well as processing. In a connected (Grid) environment with data and processing offered by many providers, deciding how best to evaluate a desired result can be difficult. 2.4. Mobile Functions In GRID computing mobility needn’t be limited to data; functions can be mobile too. However, there are risks associated with executing imported code so there are generally constraints which limit the mobility of functions. Provided the size of the executable code is modest in comparison with the datasets, mobile functions can be regarded as functions available at a choice of any of the locations to which they can be relocated (in addition to their actual location). Where the size of the code is significant, a calculation of an execution cost has to be elaborated further to include the cost of the movement using the same technique as shown above for estimating the cost of moving data. 3. Located Functions in a Real Grid Deployment The SEE-GEO (SEcurE access to GEOspatial services) [7] project addresses an interoperability scenario that involves executing a query across two disparate data sets and rendering the result in a graphical format. The two deployed data resources involved in this interoperability experiment are firstly census statistics, which contains regional statistical information (e.g., cost of various products), and secondly borders data, which contains geographical data on regions represented as polygons. Each of these services is made available over the web using domain-specific web service interfaces. For the SEE-GEO project, the OGSA-DAI Grid middleware [1, 8] was chosen to host this capability. OGSA-DAI enables multiple data resources, such as relational or XML databases, or files, to be exposed and accessible via a centralised web service. This OGSA-DAI web service is able to accept a query that may involve many of these connected data resources, and orchestrate it across federated resources to provide the result. The basic unit of work in OGSA-DAI is called an activity, examples include an SQL data query, an XSL data transform (perhaps on the result of a query), and data delivery (for example, delivering the result of an XSL transform to another location). An activity is an arbitrary function hosted by an OGSA-DAI web service. Essentially, OGSA-DAI provides the ability to move data to and from locations, host and execute functions over that data, and organise tasks. In the SEE-GEO project, OGSA-DAI was selected to enable cross-data resource query and graphical visualisation capability. This is represented in Figure 5. Portal (1) Census DB (2) OGSA-DAI (4) getData I Request attributes mage Attributes geoLink Request attributes getFeature Borders DB (3) Polygons Feature Portrayal Request image WFS Request features 2 : reqA4 : Q,2 : census , 5 : fp 4 : gl 3 : reqF 4 : Q,3 : borders Figure 6. Basic SEE-GEO scenario in located function notation Map Server GDAS model every implementational aspect of this scenario, but this would detract from the issues we wish to examine. The feature portrayal, getData and getFeature functions, which simply invoke their respective services, and the portal query being passed to the OGSA-DAI service, are examples of such detractions. They are necessary implementation detail, but modelling them does not offer any benefit. Therefore, we abstract away this unnecessary detail from this scenario for conciseness. The format of the located functions example given above can be used to model this scenario, with some modifications and elaboration: Feature Portrayal Service (5) Figure 5: SEE-GEO geo-linking service constructed within OGSA-DAI A query, generated by the portal, is received by the OGSA-DAI-enabled geoLink service which obtains the appropriate data from the two data resources using domain-specific data resource interfaces (GDAS and WFS) and retrieval functions (getData and getFeature), executes a join across the received data. It then utilises the Feature Portrayal Service to render the data in a graphical format for delivery to a Map Server which the client can access to obtain the result. This deployment utilises a number of OGSA-DAI’s capabilities which are relevant to this paper: Hosting application-specific functions Consuming and delivering to different types of data resource Dynamic selection of different data resources 3.1. Applying Located Functions to the Geo-Linking Scenario When modelling a complex real system, achieving the correct level of abstraction is important [Peter ref?]. We could choose to The numerics correspond to locations depicted in Figure 5. The alphabetic abbreviations correspond to: gl: geoLink function fp: feature portrayal request service function reqA: census request service function reqF: borders request service function census: the census database borders: the borders database For simplicity, we omit the Map Server from the model at this stage. We can apply located functions to analyse this scenario. Let us assume that the census database is available at locations 2 and 6 and the feature portrayal service resides at locations 5 and 7. We then arrive at the following expression in located function notation: 2 / 6 : reqA4 : Q,2 / 6 : census , 5 / 7 : fp 4 : gl 3 : reqF 4 : Q,3 : borders Figure 7: The extended SEE-GEO scenario represented in located function notation For this example, let us consider the bandwidth, dataset size and processing power given in tables 5, 6 and 7 respectively. Table 5: Available bandwidth between the SEE-GEO scenario locations 2 3 4 5 6 7 2 X 10 5 - 4 6 10 5 10 - X 20 15 - 20 X 20 - We can simplify the bandwidth table by only considering the possible data movements: firstly that location 4, being the core orchestrating component of this scenario, needs to communicate with all other locations, and secondly the possibility of data movement between locations 2 and 6 (with the census database) implied by the notation rendering in figure 7. Table 6: Dataset size within the SEE-GEO scenario Dataset Size Q 0.1 census 18000 borders 1000 Result of reqA 50 Results of reqF 50 Result of gl 150 Table 7: Processing power available at the SEE-GEO locations Location Processing Power #Expr #1 #2 #3 #4 #5 #6 #7 #8 #9 1, 2, 3, 4 5 6 7 50 35 60 90 We concentrate on the differences of processing power at locations 5 and 7, assuming that the fp function performed at these locations is compute-intensive. As previously mentioned, the expression in Figure 7 implies that the reqA function executed on locations 2 or 6 could require the census database to be moved from 6 to 2 or vice versa. However, these possibilities can be discounted early during calculation given the cost of moving the census database. This results in either location 2 being selected for both the reqA function and location of the census database, or location 6 being selected likewise, with no cost associated with moving the census database since it resides at either location. We give the calculations for the remaining possibilities in table 8. The calculation column follows a overall data transfer cost + overall computation cost format. Table 8: DEC calculations for the SEE-GEO scenario Expression segment Calculation Cumulative DEC 3:reqF(4:Q,3:borders) ((0.1/10)+0) + ((0.1+1000) / 50) = 0.01 + 20.01 20.002 2:reqA(4:Q,2:census) ((0.1/10)+0) + ((0.1+18000)/50) = 0.01 + 360.01 360.002 6:reqA(4:Q,6:census) ((0.1/20)+0) + ((0.1+18000)/60) = 0.005 + 300.01 300.002 4:gl( #2, #1 ) ((50/10)+(50/10)) + ((50+50)/50) = 10 + 2 12 + #2 + #1 = 392.02 4:gl( #3, #1 ) ((50/20)+(50/10)) + ((50+50)/50) = 7.5 + 2 9.5 + #3 + #1 = 329.52 5:fp(4:gl(2:reqA(4:Q,2:census),3:reqF(4:Q,3:borders))) 5:fp ( #4 ) (150/15) + (150/35) = 10 + 4.286 14.286 + #4 = 406.31 5:fp(4:gl(6:reqA(4:Q,6:census),3:reqF(4:Q,3:borders))) 5:fp( #5 ) (150/15) + (150/35) = 10 + 4.286 14.286 + #5 = 343.81 7:fp(4:gl(2:reqA(4:Q,2:census),3:reqF(4:Q,3:borders))) 7:fp( #4 ) (150/20) + (150/90) = 5 + 1.667 6.667 + #4 = 398.69 7:fp(4:gl(6:reqA(4:Q,6:census),3:reqF(4:Q,3:borders))) 7:fp( #5 ) (150/20) + (150/90) = 5 + 1.667 6.667 + #5 = 336.19 From these calculations, we can observe from expression #9 that the 7:fp(4:gl(6:reqA(4:Q,6:census),3:reqF(4:Q,3:bo rders))) possibility is the optimum choice (due to the extra bandwidth and processing power awarded by the selected locations). We have not included the storage of the resultant image at the Map Server in this model for clarification purposes. However, we can include this aspect, assuming a Map Server at a new location 8, by encapsulating the notation given in Figure 11 with the identity function i.e., 8:I( … ), which has no computation cost, to reflect the result of the fp function being passed to the Map Server. This simply involves a data movement and does not lead to any further possibilities; location 8 would be the only location where a Map Server resides. 3.2. Issues in Real Grid Deployments When applying this technique in real systems, there are a number of factors we can also choose to consider. Firstly, security on the Grid is a complex issue [5], but one that can be included in our model. Despite the sophistication and complexity of the various security mechanisms available on the Grid, the issue is essentially whether a client is authenticated and authorised to access functions or data provided by a particular service. Where this is not the case, we can include this in our model by assuming a bandwidth of zero for those client/server relationships, regardless of the underlying networking infrastructure. We can also consider asymmetric bandwidth between a client and server, where upload and download speed may not necessarily equate. In real server deployments, for example, this can be due to bandwidth throttling to obtain a level of fairness between clients [6]. Where this is the case, since our approach inherently takes into account the direction of data movement, we can simply include the bidirectional bandwidth figures in DEC calculations. For simplicity, we have assumed static bandwidth configurations. The approach detailed in this paper, although optimistic, still awards a greater probability of optimised resource usage, and enables real decisions to be made. In reality of course, these cannot be guaranteed. However, we could enhance this probability by utilising dynamic, empirically observed information on bandwidth and processing throughput from a third party resource monitoring system. Ganglia [9] is an example of such a system, and provides access to information concerning computing resources within a Grid including processing power. 3.3. Conclusions Future work – application of this approach to a computation grid (e.g. GridSAM). Obvious similarities. [1] M. Antonioletti, M. P. Atkinson, R. Baxter, A. Borley, N. P. Chue Hong, B. Collins, N. Hardman, A. Hume, A. Knox, M. Jackson, A. Krause, S. Laws, J. Magowan, N. W. Paton, D. Pearson, [2] [3] [4] [5] [6] [7] [8] [9] T. Sugden, P. Watson, and M. Westhead, "The Design and Implementation of Grid Database Services in OGSA-DAI," Concurrency and Computation: Practice and Experience, vol. Volume 17, pp. 357376, February 2005 2005. F. Berman, A. J. G. Hey, and G. C. Fox, Grid Computing: Making the Global Infrastructure a Reality: John Wiley and Sons Ltd, 2003. J. Bradley, C. Brown, B. Carpenter, V. Chang, J. Crisp, S. Crouch, D. de Roure, S. Newhouse, G. Li, J. Papay, C. Walker, and A. Wookey, "The OMII Software Distribution," in UK e-Science All Hands Meeting 2006 (NeSC 2006), Nottingham, UK., pp. 748-753. I. Foster, C. Kesselman, and S. Tuecke, "The Anatomy of the Grid: Enabling Scaleable Virtual Organization," International Journal of Supercomputer Applications and High Performance Computing, vol. 15, pp. 200-222, 2001. I. Foster, K. Kesselman, G. Tsudik, and S. Tuecke, "A security architecture for computational grids," in 5th ACM conference on Computer and communications security, 1998. A. Hagin, N. Hagin, and V. Voinov, "Providing Quality of Service on the Web Using Bandwidth Throttling.," in 5th Workshop of the OpenView University Association OVUA'98 Rennes, France, 1998. C. Higgins and G. Hobona, "Grid OGC Collision - the SEE-SAW projects," in 20th Open Grid Forum (OGF) Manchester, 2007. K. Karasavvas, M. Antonioletti, M. P. Atkinson, N. P. Chue Hong, T. Sugden, A. C. Hume, M. Jackson, A. Krause, and C. Palansuriya, "Introduction to OGSA-DAI Services," Lecture Notes in Computer Science, vol. 3458, pp. 1-12, May 2005 2005. M. L. Massie, B. N. Chun, and D. E. Culler, "The Ganglia Distributed Monitoring System: Design, Implementation and Experience.," Parallel Computing, vol. 30, July 2004 2004.