Active Middleware for multicast and reduction operations in distributed cluster environments by Nitin Bahadur and Nadathur Gokul { bnitin, gokul } @cs.wisc.edu Abstract This paper describes a scalable, dynamic middleware for use in cluster environments. The middleware provides a base for building applications that scale with the number of nodes. It provides data communication facilities to applications built on top of it through an event-driven model and a typical send-receive paradigm. New functionality is added dynamically to nodes in the cluster by using dynamic code execution techniques. Scalability is achieved by logically partitioning the nodes into a binomial tree and intermediate nodes of the tree perform reduction on results received from child nodes. Our model handles node failures and does automatic tree reconfiguration. We have built two applications, system-monitoring tool and file transfer/program spawning to demonstrate the use of our middleware. Performance results show that our middleware is scalable. 1. Introduction One of the key issues in high performance distributed computing is scalability. For large distributed applications, we require tools for program monitoring, debugging, which should not only operate in a distributed manner but also scale with the number of nodes. The underlying framework should allow tools to operate efficiently oblivious of the underlying communication mechanism. The goal of this project is to provide a middleware for performing distributed reduction 1 applications in a scalable and dynamic manner. The middleware is targeted at cluster environments but can be easily adapted to a wide-area network. This section presents the motivation for the project and a summary of our objectives. 1 Reduction operation can be as simple as “the sum” / concatenation of responses received 1 1.1 Motivation and Objectives 1. Scalability Scalability is an important issue in distributed applications. In a master-client model, the master can become the bottleneck as the number of client increase. We aim to improve scalability by reducing the number of messages processed by each node. 2. Communication primitives The middleware should provide primitives for reliable communication between 2 nodes. It should also provide an efficient mechanism for sending a message to all nodes. 3. Dynamism A problem with a static master-client paradigm is that we cannot adapt the reduction function (function that interprets the results) based on current needs or observed conditions. So we need to change the client application every-time we make a change to the master so that the new reduction function is used everywhere. It would be good if just by changing the functionality at one (master) node, we can affect the entire computation favorably. This also reduces the associated maintenance and upgradation cost of the clients. We provide the ability to add new features to clients without interrupting their normal execution. 4. Continuous operation It is important that clients continue to function in the event of any kind of failure. Also, it must be possible to restart a client after it crashes. To this end, we provide a crash detection and handling and node restart. 5. Dynamic processes It should be possible to add new clients to an existing set without having to bring down all of them. Our support for dynamic processes allows one to add clients dynamically. We aim to achieve the above objectives through a middleware, which provides the following features: Point-to-point message delivery Scalable point-to-multipoint message delivery using multicast Dynamic addition of new functionality in the clients through the master Detection of node failures and facility to restart a node 2 Dynamic addition of new clients to the existing set We do not make any assumptions about the underlying hardware and our code can be easily ported to other platforms. The outline of the rest of the paper is as follows: Section 2 presents the design of our library. Section 3 is dedicated to implementation issues. We discuss some applications we have developed in Section 4. Performance evaluation of our library is presented in Section 5. Finally we discuss some related work and other possible approaches in Section 6 and conclude in Section 7. 2. Design In this section we present the design of the Active Reduction Tree Library (ARTL). The general architecture is as shown in Figure 1. Master App ARTL-M 1 2 Response ART library at master 1. Sends queries 2. Reduces results 3. Hands back results to application Reduction Client App Client App ARTL - I ARTL - I 2 ART library at Intermediate non-leaf node 1. Executes responses to queries 2. Reduces incoming results 3. Sends reduced results to master Response Client App ART library at Leaf nodes 1. Executes responses to queries 2. Sends back results to master ARTL - F Figure 1 : The figure enumerates the overall ARTL network. The number on each link represents the discrete time unit after which a node will receive the query from the master. At intermediate nodes reduction occur in parallel. Consider a cluster of nodes connected by a high-speed network. The application on each node that uses the ART library starts the ART runtime. The network is partitioned logically by ARTL into a tree. Figure 1 3 shows the logical connectivity and not the actual underlying network connectivity. The implication of this design is discussed in Section 2.2. The ART runtime, which now resides at every node handles all inter node communication for the application. It provides functionality to send unicast/multicast messages. The runtime also provides threads of execution (pthreads) for the application to : 1. Specify its responses for queries. 2. Perform reduction operation on responses that pass through it on the way to the master. Depending on whether the node is the master, a leaf or an intermediate node the ARTL library provides appropriate functionality transparently. The ART library operates at user level and requires no special privileges. 2.1 Binomial Tree To handle messages that are to be sent to all clients in an efficient manner the ART library partitions the nodes in the cluster in the form of a binomial tree. It is to be noted that this is a logical or application level partition of the nodes and does not involve any changes to the network level connectivity. The assumption behind this scheme is that, in cluster environments the network latency is very low (as compared to WAN environments) between 2 nodes and hence if a single node is bombarded with lots of messages, the CPU becomes the bottleneck. We now list the properties of binomial trees [Cormen 90]. A binomial tree Bk is an ordered tree defined recursively. B0 consists of a single node. The binomial tree Bk consists of two binomial trees B k-1 that are linked together in the following way. The root of one tree is made the leftmost child of the root of the other tree. The building process is as shown in Figure 2. The binomial tree gives a distribution that takes advantage of the parallelism in the non-leaf nodes for reducing messages flowing upstream towards the master. If there are N nodes then the greatest number of hops from root to any leaf is log2N. This increases the number of hops but as we have mentioned before the main aim is to alleviate the processing bottleneck at the master. Towards this end, the binomial tree is an optimal configuration. 4 B0 1 B1 B2 B3 1 1 1 2 2 3 3 1 4 5 3 3 2 2 7 6 4 4 8 4 Figure 2 : Building a binomial tree of degree 2 A Binomial tree of degree 3 Advantages of using a binomial tree: 1. The binomial tree allows more than 1 node to take part in computing (reducing) the messages rather than straining a single node. For example, in Figure 1, the label “reduction” shows that the corresponding node reduces the responses before passing it to the master. The operation has no dependence on peers and hence can be performed in parallel. 2. The other motivation for using the binomial tree is that it minimizes ( as compared to a binary tree, BTree and any k-ary tree in general [Cormen 90]) the time required to propagate a message to all the nodes. This result can be explained intuitively as follows: For any tree – like configuration, it is the root that waits for the longest time to get all the responses. So in this wait time the root could issue more queries or messages to other nodes. This means a tree configuration that has the greatest out degree (number of children for a particular node) at the root and decreases this as we move downstream is well suited. The binomial tree has this desired characteristic. A binomial tree B k (figure 2 shows B3) has a depth of k and a maximum out degree k. This maximum out degree is at root of the tree. 5 2.2 Tree Setup When a node starts, it builds a binomial tree of the network based on an initial configuration file (configuration file contains all the nodes on which the application should run). It then connects to its child nodes and parent. This process takes place at all the nodes and when the full tree is connected, the master node sends a configuration message down the tree. The configuration message transfers information about the reduction functions to be executed at intermediate nodes. The client side ARTL then loads the functions into the client application process. The mechanisms for these are explained in sections 2.5 and 3.2. 2.3 Communication Subsystem By default, ARTL provides an event-driven model for communication. A node can send a message to another node without the receiver node expecting that message. The received message can be handled by some default handler registered by the application. If the sender and receiver are synchronized then the receiver can post a receive for a message it expects to receive. Thus we allow the application to communicate using an event-driven approach as well as the traditional send-recv paradigm. An example of an event-driven message would be a message from the master informing all nodes that some node has just gone down. The running nodes could then some appropriate action based on this message, or just ignore it. 2.4 Framework for processing messages Once a message has been received, it has to be processed in a way that is unique to the application that sent the message. This means that the ART library must provide a framework into which the client side of the application can plug in its response to a message that has been received. It should be noted that functions for reducing messages (reduction functions) and functions that are invoked as a response to a message from the master (response functions), are specific to a service (message identifier) . The response functions are callbacks [Paul 97] registered by the client application with ART library (Figure 3). 6 R E S P O N S E Callback registered by application with ARTL event handler Application Mapping : Services to functions reduction operations Query message Thread Pool Event Handler Responses from downstream nodes ARTL Communication Layer Network Incoming Packet Figure 3 : The design of the Event handler. Mapping is a table that binds services to objects containing reduction operations using which responses are reduced. The “RESPONSE” is a callback registered by the client application. The functions for reducing responses are active in the sense that they are sent2 over the wire during the start up phase and the client has no knowledge about these functions before that. The idea is similar to the concept of active networks [ANTS 98]. Active reduction functions have the advantage that if the existing data have to be interpreted or reduced in a new way, all that has to be done is to create a function that implements the reduction operation and a specification that binds the function to a particular service (interpretation of data). Using the specification a mapping is created between the function and the service for which it is used. It is to be noted that this scheme is not restricted to just reduction operations but can be used for loading new responses on the fly. As shown in Figure 3, a mapping binds the reduction and response functions to services for which they should be invoked. Once callbacks have been specified and appropriate services have been bound, a mechanism is needed for providing an environment to execute the response and reduction operations .The environment should handle both operations transparently. When a message is received, 2 Functions are sent by reference, see Section 5 7 the appropriate operation registered by the application is scheduled for execution by the scheduler. These outstanding services are then serviced by a pool of threads (figure 7 section 3.3 ) . The scheduler consists of a static pool of threads embedded with the scheduling logic. The main motivation for maintaining a static pool of threads is that thread creation and destruction are expensive [Firefly 90]. The scheduler assigns idle threads for processing incoming messages. The result of the operation is then sent to the parent node along the binomial tree. The thread frees all resources associated with the current task and waits for the next task. We chose to implement a multithreaded message processor for ART because queries for multiple services can arrive at the same time and hence can be processed concurrently. This is especially useful in cluster environments where node typically houses more than one processor. 2.5 Crash Detection, node restart and dynamic processes The communication layer provides crash detection. When a node goes down 3, all nodes are informed of this event. The master node recomputes the binomial tree and nodes are sent a reconfiguration message. Reconfiguration is not required when a leaf node goes down. Similarly, when a previously down node restarts, it sends a join message to the master, which again does the aforementioned process and the restarted node becomes part of the tree. Figure 4 shows that a leaf node link failure (node 5) is simple to handle and there is no change in tree structure. Figure 5 shows the case when intermediate node 2 goes down. 3 Node going down maybe the client process going down or the physical computer running the client going down. 8 1 1 5 3 7 6 2 5 4 7 3 6 2 4 8 Figure 4: Change in tree when a leaf-node goes down 1 1 5 7 3 6 2 7 4 8 3 6 2 7 8 Figure 5: Change in tree when a non-leaf node goes down We allow the user to add new nodes to the existing set. The master inserts the new node in the tree and all other nodes are informed of its existence. Thus we can dynamically increase and decrease the set of running nodes. Node failure, node restart and new node are events for which the application can register callback functions and take appropriate action. 3. Implementation The implementation of our design in described in this section. The design was implemented on a cluster of 64 nodes running Linux version 2.2.12. Each is a dual Pentium III Xeon processor running at 550 Mhz. Each node has a RAM of 2 GB. The middleware has been implemented in C++. 9 1 3 1 2 3 4 1 2 4 3 2 4 Figure 6: Setting up of the binomial tree 3.1 Communication Subsystem The communication subsystem is responsible for the setup of the binomial tree and message delivery to all nodes. 3.1.1 Setup Communication takes places using TCP. During startup, each node builds up TCP connections with its children and its parent (except for master). The TCP connections are kept persistent. TCP is used to ensure reliable data transfer. By using persistent connections we save on TCP handshake during open and close for each message. Each node sets up connections with its children first, followed by with its parent as shown in Figure 6. This way, when the master is connected with all its children, it is sure that all nodes (which are up) have been connected to the tree (since its children would have first connected with their children and so on). Then the master sends the reduction functions ( see Sec 3.2 for details ) to the all nodes in the tree. On receipt of these the nodes load them into their address space. 3.1.2 Messaging Every message sent has an ARTL header attached to it. Using this ARTL at the receiver node determines the message type and amount of data the sender is sending. ARTL supports two kinds of messages: event-driven and traditional send-recv. Event-driven messages are those for which the application has not posted an explicit receive. For such messages the communication subsystem allocates memory on behalf of the application and the message is given to the Event handler which will then call the appropriate callback function in a separate thread. In send-recv messages, the application explicitly posts a receive, 10 the communication subsystem receives the message on behalf of the application and returns it to the application. We do not support asynchronous messages as in MPI [MPI 1.1]. Supporting that would require some storage of state information and our framework can be extended to implement asynchronous messages. 3.2 Implementing active reduction The functions that implement the reduction operations have been implemented as shared objects. These are compiled at the master end and sent to the client nodes. Once ARTL receives the shared objects and the specification, it loads the objects into the address space of the client application using dlopen(). The individual functions are loaded using dlsym(). The binding between a reduction function and a particular service is specified in the message that carries that service. 3.3 Event scheduler The event scheduler consists of three main resource components (Figure 7). Responses for the shaded entry from down stream nodes Network Reduced response sent upstream Table containing Query id and Callback information for currently registered queries Thread Pool Run Queue of reduction operations Event Handler R E S P O N S E Application Figure 7 : Implementation of Event handler 11 1. The static pool of threads that provide the resource for computation. 2. An Event table that provides memory to maintain state for the various services to be processed. Each entry in the Event table also provides a stack to store the responses for that entry. 3. A queue that interfaces the event table and the thread pool. The queue maintains the set of outstanding tasks that have to be processed by the threads and the index into the Event table for the corresponding state information. An incoming query is registered in the Event table. After the setup process, the behavior of nodes is dependent on their position in the multicast tree. If a node is a non-leaf node then the query message is passed on downstream. Further the appropriate response function is scheduled. Once the response function is executed, the state is not freed since this response has to be reduced with other responses going through this node towards the master. When all the downstream responses have arrived, the reduction function is invoked. The responses may arrive in any order. Since responses are bound to a service (sec 2.5), the order does not matter. They get stacked in the appropriate entry according to the service to which they are responding. The responses are then reduced and sent upstream towards the master. If the node is a leaf node the response for the incoming query is executed and the result is sent to the parent node. If the node is a master node, after the responses are reduced, an application specific master – response function is called to return the results of the final reduction to the application. 3.4 Crash Detection and Node restart We have implemented two methods for detecting node failures: using socket signals and periodic refresh timers. Since we have TCP connections between child and parent, whenever a node goes down, the corresponding TCP connection breaks and the other end receives a signal on the associated socket. Using this we can determine which node has gone down and then take appropriate action. Whenever a child node goes down, it is the responsibility of the parent to inform the master of this. The receipts of a signal on break of a TCP connection in instantaneous (in most cases) in cluster environments and results in immediate reconfiguration of the binomial tree. The other technique we implemented is using timeouts. Child nodes send periodic refresh messages to the parent using UDP. If the parent does not receive refresh messages for a particular interval, it 12 assumes that the child has gone down. Then it informs the master which takes appropriate action. The timeout scheme is a bit slower in detection of node failures but might be useful in cases where a node crashes and comes up immediately. This would avoid the overhead of 2 binomial tree reconstructions. This overhead may be significant if a node failure causes major change in tree configuration. Also, the timeout scheme might be useful in WAN environments where detection of a TCP connection closure might take longer time even with the SO_KEEPALIVE option set for the TCP socket. A node failure implies either the application going down or the physical node running the application going down. There are a number of scenarios for node failure and the way be handle these is discussed below: a) If a leaf node goes down, no reconfiguration is required b) If an intermediate node goes down, then the left child of the intermediate node takes place of the parent and all other children of the node that has gone down become its children. c) If a TCP connection breaks but both the involved nodes are running, then before informing the master of a problem, the parent tries to connect to the child again. d) If the master goes down, then the remaining tree remains connected as it is. To handle failure of the master, the master checkpoints the state of the network regularly to disk. So when it comes up after a crash, it knows the state of the network and can form connections accordingly. When a node is restarted after a crash, it contacts the master node first to get the shared objects(s) and the current state of the binomial tree. The master then informs all nodes of the restart of this node. The restarted node is plugged into the tree as a leaf so that there is minimum overhead. Support for new client processes involves adding a new node to the current set of nodes and as in node restart, the new node is integrated into the tree as a leaf node. 13 4. Applications Developed We have developed two applications that make use of our middleware. These are described in this section. SysMon – A System Monitoring Tool Using the ART library a simple system monitoring tool was built. A master node monitors the load average on different nodes of the cluster. The client part of the application does not do any computation other than responding to master queries for the load average of the node. Three reduction operations are performed: minimum load, maximum load and sum of loads. The master node queries the clients every second (this can be varied) for which the client responds with the appropriate response that is the load average in this application. The intermediate nodes filter out the maximum and minimum load and they also calculate the sum of the loads. This data is then sent further upstream where it is further reduced. The master performs the final reduction and displays the information graphically. Further when a node crashes, the master application springs an alert message about the node that has failed. Figure 8 : GUI for system monitor 14 If periodic data needs to be collected from all nodes, then it is not required to send a query periodically to receive a response, the response function at the client nodes can be written to send replies periodically. This would save downstream traffic for periodic queries. The interval of periodicity can however be controlled by the master. Such applications are useful to system administrators for managing a large cluster of nodes and for detecting failures. The second application is a file transfer/program spawning application wherein the master node transfers a file to all nodes. This file can then be executed at the nodes if desired. We provide an option to execute the file after it has been transferred to all nodes or as soon as it reaches a node. The former might be useful if there is some sort of resource sharing or process synchronization among the processes to be executed. Thus, this application provides a simple, scalable and reliable multicasting facility with the option of code execution. 5. Performance We measured the performance of our middleware on various aspects. ARTL setup time is composed of three parts: The local initializations, tree setup and loading of the shared objects. The measurements were taken by repeating the same experiment 5 times on a different set of nodes every time to avoid caching effects. In some experiments, we changed and recompiled our files so that the file would be loaded again and the cached copy would not be used. We measured the time using gettimeofday, which has an accuracy of 1 microsec. Table 1. gives the time taken for local initializations with respect to number of nodes. Number of nodes 2 4 8 16 32 64 Time in microsec 2412 4200 9467 21736 37375 65133 Table 1: Time for local initializations 15 It can be seen that the time doubles as the number of nodes double. This is because the configuration file contains 1 entry per node and the time for processing the entries increases linearly. However even with 256 nodes, this time would be low, about 260 millisec. The other part during setup is the transfer of the shared object(s) information and their loading (done at all nodes). This time is dependent on number of nodes and number of functions per shared object. We measured the time for this setup by varying the number of nodes and number of functions. Table 2 gives the loading time of the shared objects as a function of number of functions in the shared object. Number of functions in shared object 2 5 8 Loading time in microsec 7963 8052 8136 Table 2: Shared object loading time for different number of functions in shared object We can see from Table 2 that the loading time is nearly constant with increase in number of functions. We tested this with small functions of about 10-15 lines each. It would be interesting to see the load time with increase in complexity of functions and increase in size of the shared object. Table 3 gives the time to transfer information about the shared object(s) to all nodes. The time below is not for actual transfer of the shared object as we assume a shared file system and just pass information about the shared objects(s) ( names, number of functions, function names) to all the nodes. We can see that the time does not increase logarithmically as one would have expected in a tree structure. This is because the data transferred is very less ( ~500-800 bytes) and the data transfer time is overshadowed by the time taken to traverse the TCP stack up and down on the intermediate nodes. Still this time is small and for 256 nodes it can be projected to be about 800 millisec. Number of Nodes Shared object information transfer time in microsec 7683 12523 25732 48798 2 4 8 16 Table 3: Time to transfer information about the shared object(s) 16 Total Startup Time 16 Time difference between start of last node and total setup completion Time in seconds 14 12 10 Last node start time 8 Total startup time 6 4 2 0 2 4 8 Number of nodes 2 4 8 16 Time in sec .008 .1 .2 .26 16 Table 4 Number of nodes Figure 9 : Total startup time for different number of nodes Figure 9 shows the total startup time for different number of nodes. The nodes were started by a script using ssh. The total startup time is the time the a process is started on the first node till the time when all the nodes have been connected in the tree and the all local and global initializations have been done. The graph also shows the time instant at all nodes have been started. So the last node start time curve includes the ssh startup time for all nodes. Only after the last node has started can the tree connections be complete. After the last node has started, it has to establish TCP connections with its parent and children and the connections have to propagate up the tree. In case of 2 nodes since there is no connection propagation and only 1 connection needs to be established, the time is low. The time increases with number of nodes and connections. Since tree building takes place in parallel, majority of the connections are already established by the time the last node starts up. The last node is the master node and Table 4 shows that very little time is spent by the master in establishing TCP connections. Thus the ssh time dominates the total startup time. File Transfer We transferred files of 4MB and 40MB using multicast and compared our total transfer time with the time taken to transfer the files one by one to each client directly. Figures 10 and 11 show the timings for unicast, our multicast and expected theoretical time for multicast based on number of nodes in the tree. 17 File transfer time for 4 MB file Time in seconds 25 20 Unicast file transfer time Multicast file transfer time Expected multicast file transfer time 15 10 5 0 2 4 8 16 Total number of nodes Figure 10: File transfer time for a 4 MB file using unicast and multicast Time in seconds File Transfer Time for 40 MB file 180 160 140 120 100 80 60 40 20 0 Unicast File Transfer time Multicast File Transfer time Expected multicast file transfer time 2 4 8 16 Total number of nodes Figure 11: File transfer time for a 40 MB file using unicast and multicast We also measured the performance for 32 nodes and obtained a speedup between 2-3. But we cannot rely of these results due to cluster problems. Figures 10 and 11 show that our multicast scheme is much 18 better than a unicast scheme. Also we see that our implementation achieves near optimal speedup to the expected theoretical. The small deviation is due to two factors. During data transfer the data has to traverse the TCP stack multiple times ( at intermediate nodes ), which causes some delay. And as the extent of parallelism increases, the number of collisions taking place at the link-layer level (Ethernet) increase, resulting in retransmissions and net increase in total data transmission time. 6. Related Work and Other possible approaches The Lilith project [Lilith 99] at Sandia National Laboratories constituted of a general-purpose framework for creation of tools for the use and administration of very large clusters. Their framework written in Java (as compared to ours written in C++) aimed at scalability across a distributed, heterogeneous computing platform. They used a binary tree to achieve scalability whereas we are using a binomial tree (a binomial tree minimizes the number of messages sent [Cormen 90]). The Lilith project also involved code distribution (similar to our shared object distribution) and execution of reduction functions at intermediate nodes. We could have built the ARTL library using MPI instead of building our own communication subsystem using sockets. This would have simplified code. But in MPI, if one process fails/crashes, the entire scheme of things falls apart. We are not aware of mechanisms using which one can continue an MPI program after one or more processes have crashed. This prevents us for purposely bringing down a process for maintenance. Besides, in MPICH the communication model of MPI sets up 1 TCP connection with each other process, so if there are n processes, each process will have n-1 open TCP connections, which becomes unscalable and increases state overhead as the number of processes increases ( as even mentioned by Paul Barford [Barford 00] in his presentation at Univ. of Wisconsin). MPI 2 allows computation to continue in case of process failures and allows addition of dynamic processes [MPI-2 98]. However, MPI 2 implementation are few in number and uncommon as of now. It would be interesting to compare performance and features of our implementation with MPI 2. 19 7. Conclusions and Future Extensions The ARTL middleware provides a scalable framework for developing tools for use in large cluster environments. The middleware manages the communication among nodes and distribution of code to be executed at various nodes. Our experiments show that our framework is scalable and system-monitoring applications can be easily developed on top of it. Our framework can also be extended across WAN environments for WAN monitoring. It is possible to provide differential scheduling in the thread scheduler for QoS and urgent processing. The response functions can be made dynamic. This corresponds to active services [ANTS 98]. Security can be added to the middleware by using MD5 message digest or X.509 certificates. The use of IP multicast can be explored. Currently we do not handle node failures in which a node goes down before the complete tree is setup ( during initialization ) and also when a node goes down when the master is down. Extensions to support these would be useful. Acknowledgements We are grateful to Steve Huss-Lederman of Univ. of Wisconsin-Madison for his insights on MPI. We are also thankful to the reviewers for their comments and to Rob Iverson for helping with the GUI. References [Cormen 90] Cormen, Leiserson and Rivest, Introduction to Algorithms, Prentice Hall, 1990 [Tomlinson 99] G.Tomlinson, D.Major,R.Lee, High-Capacity Internet Middleware: Internet Caching System Architectural Overview, SIGMETRICS WISP99, May 1999. [MPI 1.1] MPI: A message passing standard, http://www.mpi-forum.org/docs/mpi-11html/mpi-report.html, Message passing interface forum, June 1995. [Paul 97] Paul Jakubik, Callbacks in C++, www.primenet.com/~jakubik/callback.html, May 1997 [Lilith 99] Sandia National Laboratories, Lilith:Scalable Tools for Distributed Computing, http://dancer.ca.sandia.gov/Lilith/ 20 [Firefly 90] M. D. Schroeder and M. Burrows, "Performance of the Firefly RPC", ACM Trans. On Computer Systems, 8 1, February 1990 [ANTS 98] D. Wetherall, J. Guttag, D. Tennenhouse, ANTS: A ToolKit for Building and Dynamically Deploying Network Protocols, IEEE OpenArch, April 1998. [Barford 00] Paul Barford, Understanding End- to-End Performance of Wide Area Service, Special CS Colloquium, Univ. of Wisconsin, April 2000. [MPI-2 98] William C. Saphir, SuperComputing ’98 Tutorial, MPI-2, SC ’98, Orlando Appendix 1: ARTL API Public Functions of ARTL class A) Initialization and cleanup // init ARTL char art_init(char *config_file, char *shared_ob_info_file, unsigned short port_number, char ismaster, char restart); // stop using ARTL char art_destroy(void); B) Callback Function Routines // register default function handler char art_register_default(response_func new_func) // return default function handler response_func art_get_default(void) // register new handler char art_register_fn(art_func new_func, char *name, int wait_for = -1, char *dll_name=NULL); // function handler with a dll name and no wait for char art_register_fn(art_func new_func, char *name, char *dll_name); // remove handler entry char art_remove(art_func old_func); // remove handler entry for function by this name char art_remove(char *name); // remove all handlers char art_removeall(void); // check if handler with this name is present art_func art_ispresent(char *name); // check if handler with this handler is present char* art_ispresent(art_func old_func); 21 C) Data communication routines // send data to someone, specified by IP_address // IP_address == BROADCAST implies a broadcast char art_send_data(char type, u_short service_id, Address IP_address, u_long length, char *pkt, char *func_name=NULL, char socket_type=TCP_SOCKET); // send data to someone, specified by hostname char art_send_data(char type, u_short service_id, char *hostname, u_long length, char *pkt, char *func_name=NULL, char socket_type=TCP_SOCKET); // broadcast data to all nodes, used by master char art_broadcast(Address IP_address, u_long length, char *pkt, char socket_type=TCP_SOCKET); // receive data from specified IP address in user provided buffer char art_user_recv(char *data, Address & IP_address, long & length, int sockfd=0, char socket_type=TCP_SOCKET); // add an event table entry for a query char art_add_event_entry(u_short service_id, char* reduction_func); 22