Copyright 1994 IEEE. Published in the Proceedings of MPCS '94, May 1994 at Ischia, ITALY. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 732-562-3966. Dynamic Load Distribution in Massively Parallel Architectures: the Parallel Objects Example Antonio Corradi, Letizia Leonardi, Franco Zambonelli Dipartimento di Elettronica Informatica e Sistemistica - Università di Bologna Viale Risorgimento 2, 40136 Bologna - Ph.: +39-51-6443001 - Fax: +39-51-6443073 E-mail: {antonio, letizia, franco}@deis33.cineca.it This work has been partially supported by the "Progetto Finalizzato Sistemi Informatici e Calcolo Parallelo" of the Italian National Research Council (grant No. 89.00012.69) and by the "MURST" 40% Funding. The paper presents the mechanisms for dynamic load distribution implemented within the support for the Parallel Objects (PO for short) programming environment. PO applications evolve depending on their dynamic need of resources, enhancing application performance. The goal is to show how dynamic load distribution can be successfully applied on a massively parallel architecture. application, a policy phase decides the needed action and a mechanism phase applies the taken decisions. The paper focuses on dynamic load distribution mechanisms instead of policies and it is organised as follows. Section 2 describes the PO model and its issues in the allocation. The implementation of the support for the PO environment is presented in section 3. The migration mechanisms are described in section 4 and their effectiveness is evaluated, via several application examples, in section 5. 1. Introduction 2. The PO model To build complete high-level programming environments is one of the challenges general-purpose parallel computing must face. Those environments would attract a large community of users if they help in solving the complexities introduced by parallel computing. We claim that the presence of transparent allocation tools is essential: the programmer should not worry about the mapping of his(her) applications onto the machine, unless (s)he wants to do it. These characteristics are requirements of the Parallel Objects programming environment [1]. In addition, in massively parallel architectures, load balancing and locality of communications are also goals to be met for efficiency sake [2]. Load balancing and locality can be obtained via a dynamic distribution of load during execution, that, in the PO environment, is achieved in two ways. On the one hand, PO uses remote creation of newly needed objects [3]. On the other hand, PO provides migration of already allocated objects [4]. Even if migration can be more expensive than remote creation, it can be more effective both to achieve load balancing and to allow efficient access to remote resources and communication partners. The dynamic load distribution consists of different phases: a monitoring phase identifies the evolution of the The PO environment is based on the active object model [5], that provides independent execution facilities internal to each object: at least one execution thread is associated with each object. In PO, computation results from message passing between objects. When one object requires an external service, it sends a message to another object, specifying the service it needs. Asynchronous modes of communication leads to inter-object parallelism, decoupling sending and receiving objects. Intra-object parallelism is given by the presence of multiple execution threads within the same PO object. A parallel object can receive more requests: each is served by an internal activity. The consistency of the object state concurrently accessed by these activities is guaranteed by a scheduling policy that can be user-defined and subject to inheritance. Each PO object is composed of two logically separated parts: the parallel part and the non-parallel one. The structure of the non-parallel part is similar to the one of a Smalltalk-80 object [6]: it contains the state and the operations (called also methods). The parallel part is constituted by management components, in order to interface the object with the outside and co-ordinate the other internal components. Abstract PO objects are always instances of a given class. PO classes describe both the non-parallel and the parallel part of an object: the interface, all the operations that can be requested to objects, the state variables of an object and the synchronisation scheduling. For distributed implementation, PO adopts a solution with multiple copies of classes, one for each node where an instance of the class is executing. Dynamic allocation issues, in PO, arises for object creation (remote creation), and for dynamic movement of already existing objects (migration). Any PO application, as it starts, is composed of few objects, called the starter objects, in charge of booting the application. Each object but the starters is created at runtime: objects can explicitly request the creation of other objects and can successively request their services. Since objects can be created at any time by one creator depending on the application execution, a static policy could not decide a priori their allocation. A remote creation policy is capable of dynamically finding a node where to allocate usefully the objects to be created [7]. Such a policy has to take into account several issues: the presence of the class is a needed condition, but a new object should be preferably allocated in an under-loaded node and close to its communication partners. User is not aware of remote creation: the new object is automatically placed on a node chosen by the remote creation policy. However, the user can influence the policy by giving allocation hints: for example, a user can specify that the objects of a given class have to be locally created or remotely created onto a neighbour node. Migration is the movement of an entity from one node (the sender) to another one (the receiver) [4]. The effectiveness of migrating processes has been doubted [8], saying that the overhead due to the transfer is rarely compensated by performance improvements. We claim, instead, that migration of active objects is usually effective in PO because: − many long-lived objects exist during the whole execution of an application. For those objects, the remote creation policy is not sufficient to balance the load. Moreover, the cost of a migration is negligible with respect to the object lifetime. − The migration mechanisms implemented in the PO support (see section 4) are simpler than the process migration one, even if they involve several active entities. Migration occurs transparently to the user, but objects can be qualified as "fixed" (i.e. not migratable) by the user with, again, an allocation hint. That avoids, for example, useless migration of objects with limited lifetime. 3. PO support implementation The PO support is currently implemented on a transputer-based architecture, the Meiko Computing Surface, but its layered and modular structure - the support is based on encapsulated primitives - makes the user unaware of any characteristic of the physical architecture. The support consists of two levels: a support for the object functions, present in each object, and a support to handle object dynamicity. The object support is composed by a set of threads that realises the execution management for each PO object: they are the Object Manager (OM for short) and one or more State Managers (SM). The OM is the receiver of the service requests arriving to the object and it is the only entitled to access and modify the object execution state (represented by the status of the pending requests and of the executing activities). The OM shares no memory with the other object components, neither the activities nor the SMs: it interacts by message exchange with other object components and it can be allocated independently of them. The state of any PO object can be split in partitions: one SM is associated with each partition. When an object is distributed onto different nodes (we will see in the following that this occurs during migration), the SMs allow activities to remotely access to the managed part of the state. An access to a non-local state partition is transparently translated into a message for the corresponding SM. However, an activity, when it is coresident with the interested state, overcomes the SM service and directly accesses to the local memory for that state part. The PO support for object dynamicity is composed by a set of entities, present on each node and called system managers: they implement the dynamic management of an application at the architecture level. The monitoring manager periodically measures the application behaviour to detect its evolution. The allocation manager is in charge of the allocation policies. It chooses the allocation of new entities and, when in need, it decides migration. The allocation manager takes its decision on the basis of load information provided by the monitoring manager of the node itself and on the load information of other nodes in the system. The migration and remote creation policies we have experienced are mostly based on locality, for scalability sake [9]. The creation manager implements the decision taken by the allocation manager: it receives the order of creating entities in its node. This creation can involve whole objects (either a "new" object, or a migrated one, whose internal information comes from a previous execution), activities and SMs. The creation manager needs to have access to classes data structures to handle creations correctly. The router manager is in charge of delivering all messages that flow between application objects (both inter-object and intra-object communications) and between the managers of not physically connected nodes. Routers manage these communications in a transparent way and independently on the allocation. In the current implementation, the router is static and provided by the CSTools programming environment of the Meiko Computing Surface. 4. Migration mechanisms in the PO support The migration mechanisms implemented in the PO support are characterised by their implementation, simpler than that of process migration [4]. The PO support deals with object migration by using two separated mechanisms: the migration of the management part of an object (the OM) and the migration of its state (the SMs). The OM and the SMs can migrate from one node to another one because they are cyclic threads without any internal execution status. Therefore, a manager can be blocked at any time. The migration of Object and State Managers, even if involves active entities, is similar to data migration. The data structures managed by the OM and by the SMs (i.e., the execution state and the object state) are transferred from one node to another. The manager of that data is not really moved: it simply asks another (a new one) manager on the receiver node to take the charge of the management. The PO support rules out activity migration, avoiding the costs of process migration. When one object migrates, its activities can go on executing up to their completion on the creation site. Only the new activities are going to be allocated to the new site. Not to move the execution of the activities inside an object can be criticised; we claim its effectiveness. Let us consider, for example, the case of an object that migrates from an over-loaded sender node to an under-loaded receiver one: − its OM and its SMs are migrated with a minimal effort. − already executing activities do not move from the sender node (the object is temporarily distributed). However, future creation of activities loads the receiver node; − the sender node load is going to gradually diminish when the activities belonging to the migrated object complete. SMs follow the OM in its migration (as in the above example) or move later, when the number of accesses to the state in the sender node overcomes the ones on the receiver node. The general scenario is one where OMs follow the load evolution and attracts activities and SMs. Any object migration incurs in the problem of requalifying object references [10]. In the PO support, this problem is currently solved without leaving any dependency on the sender node: every object is identified and referred by a unique name for its whole life. This name is associated with the current OM address (the receiver of any message directed to the object), that changes when the allocation of the OM changes. When an object is migrated, any communication to it fails: if this happens, the new name-address association is automatically found by the support before retrying the communication. The following sub-sections describe the mechanisms for the migration of OM and SMs, with their costs. Since any external access to the object structures should be viewed as a deviation from the Object-Oriented paradigm, migration is managed by the object itself at the reception of a request (OM or SM migration message). 4.1. Object Manager migration The OM migration means OM relocation from a sender node to a receiver one. The OM migration of an object O is composed of the following steps (figure 1): 1. a migration message is sent to the object O by the allocation manager; 2. the O's OM receives the migration command and then 2a. it sends an object creation message to the creation manager of the receiver node. Once the new object (O1) is created onto the receiver node, its OM reveals that it is a migrated object. Then, O1 must receive the data structures of the object O before starting its normal execution cycle 2b. O stops to receive external requests; 2c. O packs its execution state in a message and sends it to the new object O1 on the receiver node. 3. the O's OM de-allocates itself. OM migration does not require the object activities to be informed about it: activities always refer to their OM by sending it a message (i.e. the termination message), and never by a direct access to its data. When the OM is migrated, the communication from an activity to its OM fails: the new OM address is found and the communication retried. Moreover, it is important that the OM migration does not suspend the executing activities inside the object: the only effect of this migration is a delay on the service of the waiting or incoming requests. Therefore, the cost of this migration can be measured with the time during which an object is unable to serve external requests. This cost has been maintained very low (from 25 to 30 ms) by exploiting parallelism: when the object receives the migration command and stops to serve requests, the new object has been already created in parallel at the receiver node and it just waits for the execution state before restarting its service [4]. We emphasise that this time is the worst case a request to the migrating object to be delayed because of a migration. Allocation Manager Migration Message Sender Node Receiver Node Object Creation Message 5. Application examples Creation Manager Object Creation Migrating Object Stopping of Receiving that do not need to access to the state during its migration are not involved in the migration at all. The cost of a state migration can be measured by the period an activity could be suspended while waiting to access the state. This cost can vary from 16 ms up to 20 ms. The dominant factor is the number of executing activities in the object: the number of messages to be sent to the activities increase the cost. Because of the high bandwidth (20 Mbit/s) of the interconnection network, the cost of migration is not influenced by the dimension of the state, as long as it is less than 10 Kbyte. The above cost evaluation considers the worst case: activities suspend themselves for the whole time only if they try to access to the state at the beginning of the migration; in all other cases, the suspension is more limited. Sending of the Execution State Packing of the Execution State New Object Several applications have been developed as testbeds for the above described mechanisms: these applications aim to test the use of the migration mechanisms to achieve load balancing and locality of communications. New Object ? NO Restoration of the Execution State Deallocation Reactivation Figure 1. The Object Manager migration 4.2. State Manager migration The migration mechanism of the SM follows the same guidelines of above. It consists of creating a new SM onto the receiver node and of sending the state to it; finally, the old SM is de-allocated. Anyway, a relevant difference from OM migration outstands. Let us recall that activities co-resident with the state can directly access to it, without any SM mediation. While the SM is under migration and even afterwards, an activity could access to a part of memory no longer part of the object state: the executing activities in need to access to the state under migration must then be suspended for correctness sake. Thus, the SM migration mechanism for the object O needs the sending to the activities of a suspension message. The "state migrating" message is sent to all activities, but only when the activities try to have access to the state they receive it. Then, activities suspend their execution until a "state migration completed" message is received. Activities 5.1. Load balancing: the Mandelbrot fractal The calculus of the Mandelbrot fractal figure [11] shows how the migration mechanisms can be applied to achieve load balancing. This application is equally partitioned into several objects of the same class: each PO object is devoted to the calculus of a strip of the figure, with no interaction among these objects. The same number of objects is assigned to each node: when the application starts, the load is balanced. The calculation stops for some objects beforehand and for others afterwards: that unbalances the load of the system. When that occurs, object migrations shift load from overloaded nodes to underloaded ones: decisions are taken by following a local load balancing policy [9]. By averaging the data on several experiments and with a different number of nodes, the speedup is over 30%. In particular, we notice improvement over 60% for very unbalanced experiments, while very little overhead (0,3%) was imposed on the application by the load balancing mechanisms when the load has been balanced for the whole calculation. 5.2. Locality of communications Another application aims to test migration effects to improve locality of inter-object communications. The application consists of a multitude of objects belonging to a Client class and a Database class. Objects of the Client class require services (such as insertion, extraction and query) to objects of the Database class. We measure the average cost of a service, function of the distance between the nodes where Client and Database objects reside. This cost can be lowered in presence of an allocation policy that migrates Client objects that requests services close to the DataBase server. Even if, in the latter case, we have to consider also the migration cost, it can be counterbalanced by the number of services the Client requires. When the system is well balanced in load, a large number of service requests is needed to outweigh the cost of migration. If the Client and the Database are initially at distance 2, over 1000 service requests counterbalance a migration that moves the objects at distance 1. If the Client and the Database objects are initially at distance 10, 280 service requests are needed. If the system load is not balanced and the Client and the Database are, respectively, in an overloaded part of the system and in an underloaded one, a migration that moves the Client toward its Database produces better results. Table 1 shows the average number of service requests needed to outweigh migration depending on the system load unbalance and for different initial distances between client and server. This number dramatically lowers, as soon as the system becomes unbalanced. For an unbalance (measured by the standard deviation σ) of 50%, the Client can effectively migrate to its Database at distance 10 even if it requests only 20 services. These results show how load balancing and locality of communication are strictly connected. This can also apply to one-to-many or many-to-many relationships. However, the implementation of policies for managing those communication patterns is not trivial: what client must be moved close to what server? Distance (σ) 2 6 10 14 20 ... 40 0% 1100 496 280 173 130 ... 98 25% 160 100 60 35 24 ... 17 50% 51 31 20 14 12 ... 10 75% 15 9 6 5 5 ... 4 100% 7 4 3 3 3 ... 3 Table 1. Communication locality: number of services to outweigh migration 6. Conclusions and future work The paper presents the support for the PO objectoriented programming environment. It provides a transparent and dynamic distribution of the system load via remote creation and migration of objects. Its effectiveness has been achieved with the assumption of the object-oriented framework: the cases of profitable migration are frequent in an object scenario and the intrusion degree of mechanisms is low. The implemented mechanisms have been evaluated for several testbed applications: they have both achieved load balancing and improved performance. The paper does not focus on policies, but shows the necessity of working in that direction, in particular to consider the communication costs. To solve the complexities introduced by communication, our perspective is that allocation policies cannot be completely automated, but need also to be directed by the user. In the current implementation, the user can specify only few allocation hints, such as "object X is fixed", "object Y is close to creator" (see section 3). More flexibility stems from the expression in these hints of the execution and communication needs of the whole object and its components. References 1. M. Boari et alii, "A Programming Environment Based on Parallel Objects for Transputer Architectures", in "Models and Tools for Massively Parallel Architectures", CNR, Napoli, June 1993. 2. W. C. Hsieh, P. Wang, W. E. Weihl, "Computation Migration: Enhancing Locality for Distributed Memory Parallel System", ACM SIGPLAN Notices, Vol. 28, No. 7, July 1993. 3. D. L. Eager, E. D. Lazowska, J. Zahorjan, "Adaptive Load Sharing in Homogeneous Distributed Systems", IEEE Transactions on Software Engineering, May 1986. 4. J.M.Smith, "A Survey of Process Migration Mechanisms", Operating Systems Review, ACM, July 1988. 5. R.S. Chin, S.T. Chanson, "Distributed Object-Based Programming Systems", ACM Computing Surveys, Vol. 23, No. 1, March 1991. 6. A. Goldberg, D. Robson, "Smalltalk-80: the Language and its Implementation", Addison-Wesley, 1983. 7. N. G. Shivaratri, P. Krueger, M. Singhal, "Load Distributing for Locally Distributed System", IEEE Computer, Vol. 25, No. 12, Dec. 1992. 8. D. L. Eager, E. D. Lazowska, J. Zahorjan, "The Limited Performance Benefits of Migrating Active Processes for Load Sharing", ACM SIGMETRICS Conf. on Modelling of Computer System, 1988. 9. A. Corradi, L. Leonardi, F. Zambonelli, "Load Balancing Strategies for Massively Parallel Architectures", Parallel Processing Letters, Vol. 2, No. 2 & 3, Sept. 1992. 10. Y. Artsy, R. Finkel, "Designing a Process Migration Facility: the Charlotte Experience", IEEE Computer, v. 22, n. 9, Sept. 1989. 11. B.B.Mandelbrot, "The Fractal Geometry of Nature", W.H.Freeman, San Francisco, 1982.