Active Middleware for multicast and reduction operations in distributed cluster environments by

advertisement
Active Middleware for multicast and reduction operations in distributed
cluster environments
by
Nitin Bahadur and Nadathur Gokul
{ bnitin, gokul } @cs.wisc.edu
Abstract
This paper describes a scalable, dynamic middleware for use in cluster environments. The
middleware provides a base for building applications that scale with the number of nodes. It
provides data communication facilities to applications built on top of it through an event-driven
model and a typical send-receive paradigm. New functionality is added dynamically to nodes in
the cluster by using dynamic code execution techniques. Scalability is achieved by logically
partitioning the nodes into a binomial tree and intermediate nodes of the tree perform reduction
on results received from child nodes. Our model handles node failures and does automatic tree
reconfiguration. We have built two applications, system-monitoring tool and file transfer/program
spawning to demonstrate the use of our middleware. Performance results show that our
middleware is scalable.
1. Introduction
One of the key issues in high performance distributed computing is scalability. For large distributed
applications, we require tools for program monitoring, debugging, which should not only operate in a
distributed manner but also scale with the number of nodes. The underlying framework should allow tools
to operate efficiently oblivious of the underlying communication mechanism. The goal of this project is to
provide a middleware for performing distributed reduction 1 applications in a scalable and dynamic
manner. The middleware is targeted at cluster environments but can be easily adapted to a wide-area
network. This section presents the motivation for the project and a summary of our objectives.
1
Reduction operation can be as simple as “the sum” / concatenation of responses received
1
1.1
Motivation and Objectives
1. Scalability
Scalability is an important issue in distributed applications. In a master-client model, the master can
become the bottleneck as the number of client increase. We aim to improve scalability by reducing the
number of messages processed by each node.
2. Communication primitives
The middleware should provide primitives for reliable communication between 2 nodes. It should also
provide an efficient mechanism for sending a message to all nodes.
3. Dynamism
A problem with a static master-client paradigm is that we cannot adapt the reduction function (function
that interprets the results) based on current needs or observed conditions. So we need to change the
client application every-time we make a change to the master so that the new reduction function is used
everywhere. It would be good if just by changing the functionality at one (master) node, we can affect the
entire computation favorably. This also reduces the associated maintenance and upgradation cost of the
clients. We provide the ability to add new features to clients without interrupting their normal execution.
4. Continuous operation
It is important that clients continue to function in the event of any kind of failure. Also, it must be possible
to restart a client after it crashes. To this end, we provide a crash detection and handling and node
restart.
5. Dynamic processes
It should be possible to add new clients to an existing set without having to bring down all of them. Our
support for dynamic processes allows one to add clients dynamically.
We aim to achieve the above objectives through a middleware, which provides the following features:

Point-to-point message delivery

Scalable point-to-multipoint message delivery using multicast

Dynamic addition of new functionality in the clients through the master

Detection of node failures and facility to restart a node
2

Dynamic addition of new clients to the existing set
We do not make any assumptions about the underlying hardware and our code can be easily ported to
other platforms. The outline of the rest of the paper is as follows: Section 2 presents the design of our
library. Section 3 is dedicated to implementation issues. We discuss some applications we have
developed in Section 4. Performance evaluation of our library is presented in Section 5. Finally we
discuss some related work and other possible approaches in Section 6 and conclude in Section 7.
2. Design
In this section we present the design of the Active Reduction Tree Library (ARTL). The general
architecture is as shown in Figure 1.
Master App
ARTL-M
1
2
Response
ART library at master
1. Sends queries
2. Reduces results
3. Hands back results to
application
Reduction
Client App
Client App
ARTL - I
ARTL - I
2
ART library at Intermediate
non-leaf node
1. Executes responses to
queries
2. Reduces incoming results
3. Sends reduced results to
master
Response
Client App
ART library at Leaf nodes
1. Executes responses to
queries
2. Sends back results to master
ARTL - F
Figure 1 : The figure enumerates the overall ARTL network. The number on each link represents the discrete
time unit after which a node will receive the query from the master. At intermediate nodes reduction occur in
parallel.
Consider a cluster of nodes connected by a high-speed network. The application on each node that uses
the ART library starts the ART runtime. The network is partitioned logically by ARTL into a tree. Figure 1
3
shows the logical connectivity and not the actual underlying network connectivity. The implication of this
design is discussed in Section 2.2. The ART runtime, which now resides at every node handles all inter
node communication for the application. It provides functionality to send unicast/multicast messages. The
runtime also provides threads of execution (pthreads) for the application to :
1. Specify its responses for queries.
2. Perform reduction operation on responses that pass through it on the way to the master.
Depending on whether the node is the master, a leaf or an intermediate node the ARTL library provides
appropriate functionality transparently. The ART library operates at user level and requires no special
privileges.
2.1
Binomial Tree
To handle messages that are to be sent to all clients in an efficient manner the ART library partitions the
nodes in the cluster in the form of a binomial tree. It is to be noted that this is a logical or application level
partition of the nodes and does not involve any changes to the network level connectivity. The
assumption behind this scheme is that, in cluster environments the network latency is very low (as
compared to WAN environments) between 2 nodes and hence if a single node is bombarded with lots of
messages, the CPU becomes the bottleneck. We now list the properties of binomial trees [Cormen 90].
A binomial tree Bk is an ordered tree defined recursively. B0 consists of a single node. The binomial tree
Bk consists of two binomial trees B
k-1
that are linked together in the following way. The root of one tree is
made the leftmost child of the root of the other tree. The building process is as shown in Figure 2.
The binomial tree gives a distribution that takes advantage of the parallelism in the non-leaf nodes for
reducing messages flowing upstream towards the master. If there are N nodes then the greatest number
of hops from root to any leaf is log2N. This increases the number of hops but as we have mentioned
before the main aim is to alleviate the processing bottleneck at the master. Towards this end, the binomial
tree is an optimal configuration.
4
B0
1
B1
B2
B3
1
1
1
2
2
3
3
1
4
5
3
3
2
2
7
6
4
4
8
4
Figure 2 : Building a binomial tree of degree 2
A Binomial tree of degree 3
Advantages of using a binomial tree:
1. The binomial tree allows more than 1 node to take part in computing (reducing) the messages rather
than straining a single node. For example, in Figure 1, the label “reduction” shows that the
corresponding node reduces the responses before passing it to the master. The operation has no
dependence on peers and hence can be performed in parallel.
2. The other motivation for using the binomial tree is that it minimizes ( as compared to a binary tree, BTree and any k-ary tree in general [Cormen 90]) the time required to propagate a message to all the
nodes. This result can be explained intuitively as follows: For any tree – like configuration, it is the
root that waits for the longest time to get all the responses. So in this wait time the root could issue
more queries or messages to other nodes. This means a tree configuration that has the greatest out
degree (number of children for a particular node) at the root and decreases this as we move
downstream is well suited. The binomial tree has this desired characteristic. A binomial tree B k (figure
2 shows B3) has a depth of k and a maximum out degree k. This maximum out degree is at root of the
tree.
5
2.2
Tree Setup
When a node starts, it builds a binomial tree of the network based on an initial configuration file
(configuration file contains all the nodes on which the application should run). It then connects to its child
nodes and parent. This process takes place at all the nodes and when the full tree is connected, the
master node sends a configuration message down the tree. The configuration message transfers
information about the reduction functions to be executed at intermediate nodes. The client side ARTL
then loads the functions into the client application process. The mechanisms for these are explained in
sections 2.5 and 3.2.
2.3
Communication Subsystem
By default, ARTL provides an event-driven model for communication. A node can send a message to
another node without the receiver node expecting that message. The received message can be handled
by some default handler registered by the application. If the sender and receiver are synchronized then
the receiver can post a receive for a message it expects to receive. Thus we allow the application to
communicate using an event-driven approach as well as the traditional send-recv paradigm. An example
of an event-driven message would be a message from the master informing all nodes that some node
has just gone down. The running nodes could then some appropriate action based on this message, or
just ignore it.
2.4
Framework for processing messages
Once a message has been received, it has to be processed in a way that is unique to the application that
sent the message. This means that the ART library must provide a framework into which the client side of
the application can plug in its response to a message that has been received. It should be noted that
functions for reducing messages (reduction functions) and functions that are invoked as a response to a
message from the master (response functions), are specific to a service (message identifier) . The
response functions are callbacks [Paul 97] registered by the client application with ART library (Figure 3).
6
R
E
S
P
O
N
S
E
Callback registered by
application with ARTL
event handler
Application
Mapping : Services to
functions
reduction
operations
Query
message
Thread
Pool
Event Handler
Responses from
downstream
nodes
ARTL Communication Layer
Network
Incoming Packet
Figure 3 : The design of the Event handler. Mapping is a table that binds services to objects containing reduction
operations using which responses are reduced. The “RESPONSE” is a callback registered by the client application.
The functions for reducing responses are active in the sense that they are sent2 over the wire
during the start up phase and the client has no knowledge about these functions before that. The idea is
similar to the concept of active networks [ANTS 98].
Active reduction functions have the advantage that if the existing data have to be interpreted or
reduced in a new way, all that has to be done is to create a function that implements the reduction
operation and a specification that binds the function to a particular service (interpretation of data). Using
the specification a mapping is created between the function and the service for which it is used. It is to be
noted that this scheme is not restricted to just reduction operations but can be used for loading new
responses on the fly.
As shown in Figure 3, a mapping binds the reduction and response functions to services for
which they should be invoked. Once callbacks have been specified and appropriate services have been
bound, a mechanism is needed for providing an environment to execute the response and reduction
operations .The environment should handle both operations transparently. When a message is received,
2
Functions are sent by reference, see Section 5
7
the appropriate operation registered by the application is scheduled for execution by the scheduler. These
outstanding services are then serviced by a pool of threads (figure 7 section 3.3 ) .
The scheduler consists of a static pool of threads embedded with the scheduling logic. The main
motivation for maintaining a static pool of threads is that thread creation and destruction are expensive
[Firefly 90]. The scheduler assigns idle threads for processing incoming messages. The result of the
operation is then sent to the parent node along the binomial tree. The thread frees all resources
associated with the current task and waits for the next task. We chose to implement a multithreaded
message processor for ART because queries for multiple services can arrive at the same time and hence
can be processed concurrently. This is especially useful in cluster environments where node typically
houses more than one processor.
2.5
Crash Detection, node restart and dynamic processes
The communication layer provides crash detection. When a node goes down 3, all nodes are informed of
this event. The master node recomputes the binomial tree and nodes are sent a reconfiguration message.
Reconfiguration is not required when a leaf node goes down. Similarly, when a previously down node
restarts, it sends a join message to the master, which again does the aforementioned process and the
restarted node becomes part of the tree. Figure 4 shows that a leaf node link failure (node 5) is simple to
handle and there is no change in tree structure. Figure 5 shows the case when intermediate node 2 goes
down.
3
Node going down maybe the client process going down or the physical computer running the client going down.
8
1
1
5
3
7
6
2
5
4
7
3
6
2
4
8
Figure 4: Change in tree when a leaf-node goes down
1
1
5
7
3
6
2
7
4
8
3
6
2
7
8
Figure 5: Change in tree when a non-leaf node goes down
We allow the user to add new nodes to the existing set. The master inserts the new node in the tree and
all other nodes are informed of its existence. Thus we can dynamically increase and decrease the set of
running nodes. Node failure, node restart and new node are events for which the application can register
callback functions and take appropriate action.
3. Implementation
The implementation of our design in described in this section. The design was implemented on a cluster
of 64 nodes running Linux version 2.2.12. Each is a dual Pentium III Xeon processor running at 550 Mhz.
Each node has a RAM of 2 GB. The middleware has been implemented in C++.
9
1
3
1
2
3
4
1
2
4
3
2
4
Figure 6: Setting up of the binomial tree
3.1
Communication Subsystem
The communication subsystem is responsible for the setup of the binomial tree and message delivery to
all nodes.
3.1.1
Setup
Communication takes places using TCP. During startup, each node builds up TCP connections with its
children and its parent (except for master). The TCP connections are kept persistent. TCP is used to
ensure reliable data transfer. By using persistent connections we save on TCP handshake during open
and close for each message. Each node sets up connections with its children first, followed by with its
parent as shown in Figure 6. This way, when the master is connected with all its children, it is sure that all
nodes (which are up) have been connected to the tree (since its children would have first connected with
their children and so on). Then the master sends the reduction functions ( see Sec 3.2 for details ) to the
all nodes in the tree. On receipt of these the nodes load them into their address space.
3.1.2
Messaging
Every message sent has an ARTL header attached to it. Using this ARTL at the receiver node determines
the message type and amount of data the sender is sending. ARTL supports two kinds of messages:
event-driven and traditional send-recv. Event-driven messages are those for which the application has not
posted an explicit receive. For such messages the communication subsystem allocates memory on behalf
of the application and the message is given to the Event handler which will then call the appropriate
callback function in a separate thread. In send-recv messages, the application explicitly posts a receive,
10
the communication subsystem receives the message on behalf of the application and returns it to the
application. We do not support asynchronous messages as in MPI [MPI 1.1]. Supporting that would
require some storage of state information and our framework can be extended to implement
asynchronous messages.
3.2
Implementing active reduction
The functions that implement the reduction operations have been implemented as shared objects. These
are compiled at the master end and sent to the client nodes. Once ARTL receives the shared objects and
the specification, it loads the objects into the address space of the client application using dlopen().
The individual functions are loaded using dlsym(). The binding between a reduction function and a
particular service is specified in the message that carries that service.
3.3
Event scheduler
The event scheduler consists of three main resource components (Figure 7).
Responses for the shaded
entry from down stream
nodes
Network
Reduced
response sent
upstream
Table
containing
Query id and
Callback
information for
currently
registered
queries
Thread
Pool
Run Queue of
reduction
operations
Event Handler
R
E
S
P
O
N
S
E
Application
Figure 7 : Implementation of Event handler
11
1.
The static pool of threads that provide the resource for computation.
2.
An Event table that provides memory to maintain state for the various services to be processed.
Each entry in the Event table also provides a stack to store the responses for that entry.
3.
A queue that interfaces the event table and the thread pool.
The queue maintains the set of outstanding tasks that have to be processed by the threads and the index
into the Event table for the corresponding state information. An incoming query is registered in the Event
table. After the setup process, the behavior of nodes is dependent on their position in the multicast tree. If
a node is a non-leaf node then the query message is passed on downstream. Further the appropriate
response function is scheduled. Once the response function is executed, the state is not freed since this
response has to be reduced with other responses going through this node towards the master. When all
the downstream responses have arrived, the reduction function is invoked. The responses may arrive in
any order. Since responses are bound to a service (sec 2.5), the order does not matter. They get stacked
in the appropriate entry according to the service to which they are responding. The responses are then
reduced and sent upstream towards the master. If the node is a leaf node the response for the incoming
query is executed and the result is sent to the parent node. If the node is a master node, after the
responses are reduced, an application specific master – response function is called to return the results
of the final reduction to the application.
3.4
Crash Detection and Node restart
We have implemented two methods for detecting node failures: using socket signals and periodic refresh
timers. Since we have TCP connections between child and parent, whenever a node goes down, the
corresponding TCP connection breaks and the other end receives a signal on the associated socket.
Using this we can determine which node has gone down and then take appropriate action. Whenever a
child node goes down, it is the responsibility of the parent to inform the master of this. The receipts of a
signal on break of a TCP connection in instantaneous (in most cases) in cluster environments and results
in immediate reconfiguration of the binomial tree.
The other technique we implemented is using timeouts. Child nodes send periodic refresh messages to
the parent using UDP. If the parent does not receive refresh messages for a particular interval, it
12
assumes that the child has gone down. Then it informs the master which takes appropriate action. The
timeout scheme is a bit slower in detection of node failures but might be useful in cases where a node
crashes and comes up immediately. This would avoid the overhead of 2 binomial tree reconstructions.
This overhead may be significant if a node failure causes major change in tree configuration. Also, the
timeout scheme might be useful in WAN environments where detection of a TCP connection closure
might take longer time even with the SO_KEEPALIVE option set for the TCP socket.
A node failure implies either the application going down or the physical node running the application going
down. There are a number of scenarios for node failure and the way be handle these is discussed below:
a) If a leaf node goes down, no reconfiguration is required
b) If an intermediate node goes down, then the left child of the intermediate node takes place of the
parent and all other children of the node that has gone down become its children.
c) If a TCP connection breaks but both the involved nodes are running, then before informing the master
of a problem, the parent tries to connect to the child again.
d) If the master goes down, then the remaining tree remains connected as it is. To handle failure of the
master, the master checkpoints the state of the network regularly to disk. So when it comes up after a
crash, it knows the state of the network and can form connections accordingly.
When a node is restarted after a crash, it contacts the master node first to get the shared objects(s) and
the current state of the binomial tree. The master then informs all nodes of the restart of this node. The
restarted node is plugged into the tree as a leaf so that there is minimum overhead.
Support for new client processes involves adding a new node to the current set of nodes and as in node
restart, the new node is integrated into the tree as a leaf node.
13
4.
Applications Developed
We have developed two applications that make use of our middleware. These are described in this
section.
SysMon – A System Monitoring Tool
Using the ART library a simple system monitoring tool was built. A master node monitors the load
average on different nodes of the cluster. The client part of the application does not do any computation
other than responding to master queries for the load average of the node. Three reduction operations are
performed: minimum load, maximum load and sum of loads. The master node queries the clients every
second (this can be varied) for which the client responds with the appropriate response that is the load
average in this application. The intermediate nodes filter out the maximum and minimum load and they
also calculate the sum of the loads. This data is then sent further upstream where it is further reduced.
The master performs the final reduction and displays the information graphically. Further when a node
crashes, the master application springs an alert message about the node that has failed.
Figure 8 : GUI for system monitor
14
If periodic data needs to be collected from all nodes, then it is not required to send a query periodically to
receive a response, the response function at the client nodes can be written to send replies periodically.
This would save downstream traffic for periodic queries. The interval of periodicity can however be
controlled by the master. Such applications are useful to system administrators for managing a large
cluster of nodes and for detecting failures.
The second application is a file transfer/program spawning application wherein the master node transfers
a file to all nodes. This file can then be executed at the nodes if desired. We provide an option to execute
the file after it has been transferred to all nodes or as soon as it reaches a node. The former might be
useful if there is some sort of resource sharing or process synchronization among the processes to be
executed. Thus, this application provides a simple, scalable and reliable multicasting facility with the
option of code execution.
5.
Performance
We measured the performance of our middleware on various aspects.
ARTL setup time is composed of three parts: The local initializations, tree setup and loading of the shared
objects. The measurements were taken by repeating the same experiment 5 times on a different set of
nodes every time to avoid caching effects. In some experiments, we changed and recompiled our files so
that the file would be loaded again and the cached copy would not be used. We measured the time using
gettimeofday, which has an accuracy of 1 microsec. Table 1. gives the time taken for local initializations
with respect to number of nodes.
Number of
nodes
2
4
8
16
32
64
Time in
microsec
2412
4200
9467
21736
37375
65133
Table 1: Time for local initializations
15
It can be seen that the time doubles as the number of nodes double. This is because the configuration file
contains 1 entry per node and the time for processing the entries increases linearly. However even with
256 nodes, this time would be low, about 260 millisec.
The other part during setup is the transfer of the shared object(s) information and their loading (done at all
nodes). This time is dependent on number of nodes and number of functions per shared object. We
measured the time for this setup by varying the number of nodes and number of functions.
Table 2 gives the loading time of the shared objects as a function of number of functions in the shared
object.
Number of functions in
shared object
2
5
8
Loading time in
microsec
7963
8052
8136
Table 2: Shared object loading time for different number of functions
in shared object
We can see from Table 2 that the loading time is nearly constant with increase in number of functions.
We tested this with small functions of about 10-15 lines each. It would be interesting to see the load time
with increase in complexity of functions and increase in size of the shared object.
Table 3 gives the time to transfer information about the shared object(s) to all nodes. The time below is
not for actual transfer of the shared object as we assume a shared file system and just pass information
about the shared objects(s) ( names, number of functions, function names) to all the nodes. We can see
that the time does not increase logarithmically as one would have expected in a tree structure. This is
because the data transferred is very less ( ~500-800 bytes) and the data transfer time is overshadowed
by the time taken to traverse the TCP stack up and down on the intermediate nodes. Still this time is
small and for 256 nodes it can be projected to be about 800 millisec.
Number of Nodes
Shared object information
transfer time in microsec
7683
12523
25732
48798
2
4
8
16
Table 3: Time to transfer information about the shared object(s)
16
Total Startup Time
16
Time difference
between start of last
node and total setup
completion
Time in seconds
14
12
10
Last node start time
8
Total startup time
6
4
2
0
2
4
8
Number
of
nodes
2
4
8
16
Time in
sec
.008
.1
.2
.26
16
Table 4
Number of nodes
Figure 9 : Total startup time for different number of nodes
Figure 9 shows the total startup time for different number of nodes. The nodes were started by a script
using ssh. The total startup time is the time the a process is started on the first node till the time when all
the nodes have been connected in the tree and the all local and global initializations have been done. The
graph also shows the time instant at all nodes have been started. So the last node start time curve
includes the ssh startup time for all nodes. Only after the last node has started can the tree connections
be complete. After the last node has started, it has to establish TCP connections with its parent and
children and the connections have to propagate up the tree. In case of 2 nodes since there is no
connection propagation and only 1 connection needs to be established, the time is low. The time
increases with number of nodes and connections. Since tree building takes place in parallel, majority of
the connections are already established by the time the last node starts up. The last node is the master
node and Table 4 shows that very little time is spent by the master in establishing TCP connections. Thus
the ssh time dominates the total startup time.
File Transfer
We transferred files of 4MB and 40MB using multicast and compared our total transfer time with the time
taken to transfer the files one by one to each client directly. Figures 10 and 11 show the timings for
unicast, our multicast and expected theoretical time for multicast based on number of nodes in the tree.
17
File transfer time for 4 MB file
Time in seconds
25
20
Unicast file transfer
time
Multicast file
transfer time
Expected multicast
file transfer time
15
10
5
0
2
4
8
16
Total number of nodes
Figure 10: File transfer time for a 4 MB file using unicast and multicast
Time in seconds
File Transfer Time for 40 MB file
180
160
140
120
100
80
60
40
20
0
Unicast File
Transfer time
Multicast File
Transfer time
Expected multicast
file transfer time
2
4
8
16
Total number of nodes
Figure 11: File transfer time for a 40 MB file using unicast and multicast
We also measured the performance for 32 nodes and obtained a speedup between 2-3. But we cannot
rely of these results due to cluster problems. Figures 10 and 11 show that our multicast scheme is much
18
better than a unicast scheme. Also we see that our implementation achieves near optimal speedup to the
expected theoretical. The small deviation is due to two factors. During data transfer the data has to
traverse the TCP stack multiple times ( at intermediate nodes ), which causes some delay. And as the
extent of parallelism increases, the number of collisions taking place at the link-layer level (Ethernet)
increase, resulting in retransmissions and net increase in total data transmission time.
6.
Related Work and Other possible approaches
The Lilith project [Lilith 99] at Sandia National Laboratories constituted of a general-purpose framework
for creation of tools for the use and administration of very large clusters. Their framework written in Java
(as compared to ours written in C++) aimed at scalability across a distributed, heterogeneous computing
platform. They used a binary tree to achieve scalability whereas we are using a binomial tree (a binomial
tree minimizes the number of messages sent [Cormen 90]). The Lilith project also involved code
distribution (similar to our shared object distribution) and execution of reduction functions at intermediate
nodes.
We could have built the ARTL library using MPI instead of building our own communication subsystem
using sockets. This would have simplified code. But in MPI, if one process fails/crashes, the entire
scheme of things falls apart. We are not aware of mechanisms using which one can continue an MPI
program after one or more processes have crashed. This prevents us for purposely bringing down a
process for maintenance. Besides, in MPICH the communication model of MPI sets up 1 TCP connection
with each other process, so if there are n processes, each process will have n-1 open TCP connections,
which becomes unscalable and increases state overhead as the number of processes increases ( as
even mentioned by Paul Barford [Barford 00] in his presentation at Univ. of Wisconsin).
MPI 2 allows computation to continue in case of process failures and allows addition of dynamic
processes [MPI-2 98]. However, MPI 2 implementation are few in number and uncommon as of now. It
would be interesting to compare performance and features of our implementation with MPI 2.
19
7.
Conclusions and Future Extensions
The ARTL middleware provides a scalable framework for developing tools for use in large cluster
environments. The middleware manages the communication among nodes and distribution of code to be
executed at various nodes. Our experiments show that our framework is scalable and system-monitoring
applications can be easily developed on top of it. Our framework can also be extended across WAN
environments for WAN monitoring.
It is possible to provide differential scheduling in the thread scheduler for QoS and urgent processing. The
response functions can be made dynamic. This corresponds to active services [ANTS 98]. Security can
be added to the middleware by using MD5 message digest or X.509 certificates. The use of IP multicast
can be explored. Currently we do not handle node failures in which a node goes down before the
complete tree is setup ( during initialization ) and also when a node goes down when the master is down.
Extensions to support these would be useful.
Acknowledgements
We are grateful to Steve Huss-Lederman of Univ. of Wisconsin-Madison for his insights on MPI. We are
also thankful to the reviewers for their comments and to Rob Iverson for helping with the GUI.
References
[Cormen 90] Cormen, Leiserson and Rivest, Introduction to Algorithms, Prentice Hall, 1990
[Tomlinson 99] G.Tomlinson, D.Major,R.Lee, High-Capacity Internet Middleware: Internet Caching
System Architectural Overview, SIGMETRICS WISP99, May 1999.
[MPI 1.1] MPI: A message passing standard, http://www.mpi-forum.org/docs/mpi-11html/mpi-report.html, Message passing interface forum, June 1995.
[Paul 97] Paul Jakubik, Callbacks in C++, www.primenet.com/~jakubik/callback.html, May 1997
[Lilith 99] Sandia National Laboratories, Lilith:Scalable Tools for Distributed Computing,
http://dancer.ca.sandia.gov/Lilith/
20
[Firefly 90]
M. D. Schroeder and M. Burrows, "Performance of the Firefly RPC", ACM Trans. On
Computer Systems, 8 1, February 1990
[ANTS 98] D. Wetherall, J. Guttag, D. Tennenhouse, ANTS: A ToolKit for Building and Dynamically
Deploying Network Protocols, IEEE OpenArch, April 1998.
[Barford 00] Paul Barford, Understanding End- to-End Performance of Wide Area Service, Special CS
Colloquium, Univ. of Wisconsin, April 2000.
[MPI-2 98] William C. Saphir, SuperComputing ’98 Tutorial, MPI-2, SC ’98, Orlando
Appendix 1: ARTL API
Public Functions of ARTL class
A) Initialization and cleanup
// init ARTL
char
art_init(char *config_file, char *shared_ob_info_file, unsigned short port_number, char ismaster,
char restart);
// stop using ARTL
char
art_destroy(void);
B) Callback Function Routines
// register default function handler
char
art_register_default(response_func new_func)
// return default function handler
response_func
art_get_default(void)
// register new handler
char
art_register_fn(art_func new_func, char *name, int wait_for = -1, char *dll_name=NULL);
// function handler with a dll name and no wait for
char
art_register_fn(art_func new_func, char *name, char *dll_name);
// remove handler entry
char
art_remove(art_func old_func);
// remove handler entry for function by this name
char
art_remove(char *name);
// remove all handlers
char
art_removeall(void);
// check if handler with this name is present
art_func
art_ispresent(char *name);
// check if handler with this handler is present
char* art_ispresent(art_func old_func);
21
C) Data communication routines
// send data to someone, specified by IP_address
// IP_address == BROADCAST implies a broadcast
char
art_send_data(char type, u_short service_id, Address IP_address, u_long length, char *pkt, char
*func_name=NULL, char socket_type=TCP_SOCKET);
// send data to someone, specified by hostname
char
art_send_data(char type, u_short service_id, char *hostname, u_long length, char *pkt, char
*func_name=NULL, char socket_type=TCP_SOCKET);
// broadcast data to all nodes, used by master
char
art_broadcast(Address IP_address, u_long length, char *pkt, char socket_type=TCP_SOCKET);
// receive data from specified IP address in user provided buffer
char
art_user_recv(char *data, Address & IP_address, long & length, int sockfd=0, char
socket_type=TCP_SOCKET);
// add an event table entry for a query
char
art_add_event_entry(u_short service_id, char* reduction_func);
22
Download