Distributed Systems: Message Passing, Clusters, and Implementation of Clusters in Representative Operating Systems 1 Distributed message passing • Communication and synchronization mechanisms in distributed systems – Distributed message passing – Remote procedure call • An implementation approach for message passing – Use the services of a message-passing module – Service is requested in the form of primitives and parameters CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 2 Distributed message passing (cont.) • Send primitive – Parameters • Destination process identifier • The message contents – Operation • • • • Sending process uses ‘Send’ primitive (destination, message contents) Message-passing module constructs data unit with destination and contents Data unit is sent to the destination machine using communication facility (e.g., TCP/IP) Data unit is received by the destination machine and is routed by the communication facility to the message-passing module • The message-passing module stores the message in the buffer for the destination process • Receive primitive – Operation • Destination process assigns buffer area for messages and uses ‘Receive’ primitive to the message passing module • Alternatively, message-passing module signals destination process with ‘Receive' signal and places message in shared buffer CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 3 Distributed message passing (cont.) • Design issues: – Reliability vs. unreliability – Blocking vs. non-blocking • Reliability vs. unreliability – Reliable message passing • Guarantees delivery if possible • Uses a reliable transport protocol • Performs error checking, acknowledgment, retransmission, and reordering of messages if delivered out of sequence • Acknowledgment to the sending process that delivery was either successful or it failed (e.g. network failure) – Unreliable message passing • Message-passing facility sends the message without reporting success or failure • Message passing facility has a simple design and low overhead • Applications may use ‘Request’ and ’Reply’ to acknowledge delivery CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 4 Distributed message passing (cont.) • Blocking vs. non-blocking – Blocking or synchronous primitives • Blocking ‘Send’ does not return control to the sending process (process suspended) • until – Message has been transmitted (unreliable service), or – Message has been sent and an acknowledgment received (reliable service) • Blocking ‘Receive’ does not return control to the receiving process until – Message has been placed in the allocated buffer CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 5 Distributed message passing (cont.) • Blocking vs. non-blocking – Non-blocking or asynchronous primitives • ‘Send’ primitive does not suspend process – Control returned to the process as soon as the message has been queued for transmission or a copy has been made – After the message has been transmitted or copied to a safe place for later transmission, sending process is interrupted to be informed that the message buffer is available • ‘Receive’ primitive does not suspend process – Process is sent an interrupt upon message arrival or process can poll periodically for messages • Advantages/disadvantages – Efficient use of message passing mechanism – Difficult to test and debug: time-dependent sequences can lead to obscure bugs CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 6 Remote procedure calls • Provides access to remote services by providing simple procedure call/return semantics, similar to those used for local services • Advantages – The procedure call is used extensively – Remote interfaces can be specified and clearly documented as a set of named operations with designated types – The interface is standardized • The communication code for an application can be generated automatically • Client/server modules can be easily ported between different OSs and target systems • Example of procedure call for the calling program where CALL P (X, Y) P = procedure name X = passed arguments Y = returned values CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 7 Remote procedure calls (cont.) • Dummy or stub procedure on the local machine – – – – Included in the caller’s address space or dynamically linked at call time Creates message identifying remote procedure and includes parameters Sends message to remote system and waits for reply When reply arrives, it returns to the calling program providing the returned values • Dummy or stub procedure on the remote machine – Upon receiving the message, generates a local CALL P (X, Y) – Returns reply CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 8 Remote procedure calls (cont.) • Design issues – Parameter passing • Call by value (parameters passed as values) – Parameters copied into the message and sent to remote system – Easy to implement for RPCs • Call by reference (pointers to a location that contains the value) – More difficult to implement for RPCs – Parameters and results representation • No problem if the calling and called programs use the same language and run on the same type of OSs and machines • If there are differences, the remote procedure call mechanism must provide the conversion: standardized format for common objects (e.g., integers, characters) – Client/server binding • A client/server binding is established after the two applications have made a logical connection and are ready to exchange commands and data • Non-persistent binding: Logical connection between the two processes established at the time of RPC and disconnected after the values are returned • Persistent binding: Connection set up for RPC remains up after return CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 9 Remote procedure calls (cont.) • Design issues (cont.) – Synchronous vs. asynchronous • Synchronous RPC – Calling process waits for the returned values – Traditional, functions like a subroutine call – Easy to understand and test but leads to lower performance • Asynchronous RPC – Calling process is not blocked – Methods for synchronizing the client and the server » Higher layer applications in both client and server initiate the exchange and then verifies that all actions have been completed » Client uses a series of asynchronous RPCs followed by a synchronous RPC CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 10 Remote procedure calls (cont.) • Design issues (cont.) – Object-oriented mechanisms • Operation – Client sends request to an object request broker – Broker acts as a directory of all remote services on the network. Broker calls appropriate remote object and passes data. – Remote object services request, replies to broker, which returns response to client • Competing approaches: – Common Object Request Broker Architecture (CORBA) from the Object Management Group, backed by IBM, Apple, Sun – Common Object Model (COM), the basis for Object Linking and Embedding (OLE) from Microsoft CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 11 Clusters • Cluster: group of interconnected computers (nodes) working together as a unified computer recourse and creating the illusion of being one machine • Advantages of clusters: – Absolute scalability • Clusters can consist of hundreds of machines, each being a multiprocessor – Incremental scalability • A cluster can grow in small increments with minimum service disruption – High availability • Fault-tolerant operation in software – High price/performance ratio • Off-the shelf building blocks CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 12 Clusters (cont.) • Cluster configurations – Passive standby • Active system processes the entire load, the standby takes over in case of failure of primary • Active sends ‘heartbeat’ messages to standby to indicate continued operation • High cost – no tasks sharing • Easy to implement – Active secondary • Secondary server is also used for processing tasks • Reduced cost due to tasks sharing • Increased complexity CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 13 Clusters (cont.) • Cluster configurations (cont.) – Separate servers • • • • • Each server has its own disk, no disks shared Data copied between servers periodically Scheduling assigns client requests to servers to balance the load High availability High server and network overhead due to data copying – Shared disks, non-shared volumes (shared nothing) • Common disks are partitioned into volumes, each volume owned by only one computer • On computer failure, cluster is reconfigured to assign volumes to remaining computers – Shared disks, shared volumes • Each computer has access to all volumes on all disks • Locking mechanism used to ensure that data is accessed by one computer at a time CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 14 Clusters (cont.) • OS design issues – Failure management • Highly available clusters – High probability that all resources will be in service – In case of failure, the queries in progress are lost – If retried, the query will be serviced by another computer in the cluster • Fault-tolerant clusters – Redundant shared disks and fault-tolerant operations – Fail-over: switching an application from a failed system to an alternative – Fail-back: the restoration of applications and data resources to the failed system after recovery – Load balancing • Load must be balanced among available computers • When a new computer is added to the cluster, loads needs to be rebalanced to include the new computer CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 15 Clusters (cont.) • OS design issues (cont.) – Parallelizing computation: executing software from a single application in parallel • Parallelizing compiler – It is determined, at compile time, which parts of the application can be run in parallel – The parallel parts are assigned to different computers in the cluster • Parallelized application – The application is designed to run on the cluster and uses message passing for communication – Most powerful approach to exploit clusters • Parametric computing – Useful for programs that must be executed a large number of times, each time with a different set of parameters (e.g., a simulation model) – Parametric processing tools are needed to organize, run, and manage the jobs CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 16 Clusters (cont.) • Cluster computer architecture – All computers are interconnected by a high-speed LAN or switch – Each computer is capable of operating independently – A middleware layer of software runs on each computer to implement the cluster functionality • Provides a unified system image to the user, called a single-system image • Is responsible for providing load balancing and high availability • Middleware services and functions – Single entry point: A user logs into the cluster, not on a specific computer – Single file hierarchy: The user sees only a single file hierarchy, under one root directory – Single control point: A default workstation is used for cluster management and control – Single virtual networking: There is a single virtual network connecting the cluster computers, even if it consists of multiple interconnected networks CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 17 Clusters (cont.) • Middleware services and functions (cont.) – Single memory space: A distributed shared memory is used to share variables – Single job-management system: The cluster has a job scheduler and jobs are submitted to the cluster and not to individual computers – Single user interface: A common graphic interface is used for all users, regardless of the workstation they use to enter the cluster – Single I/O space: Any node can access any I/O device – Single process space: A process on any node can create or communicate with any other process in the cluster – Check-pointing: Process states and intermediate results are saved periodically, permitting rollback recovery after failures – Process migration: Processes can mode inside the cluster to provide load balancing CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 18 Clusters (cont.) • Clusters compared with SMPs – SMPs • • • • • Easier to manage and configure than clusters Much closer to the original uniprocessor model Major difference from the uniprocessor is the scheduler function Uses less physical space and requires less energy than a comparable cluster SMP products are well established and stable – Clusters • Far superior to SMPs in terms of absolute and incremental scalability • Far superior in terms of availability – Clusters are likely to dominate the high-performance server market CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 19 Windows 2000 Cluster Server • The configuration is a shared-nothing cluster, where each volume and other resources are owned by a single system at a time (initially code-named Wolfpack) • Main concepts – Cluster Service: • The software on each node responsible for cluster-specific activities – Resource: • These are the resources managed by the cluster service • They are objects representing either physical hardware devices (e.g., disk drives, network cards) or logical items (e.g., disk volumes, IP addresses, applications, databases) • Resources are implemented as dynamically linked libraries (DLLs) and managed by a resource monitor – Online: A resource is online at a node if it provides a service at that node – Group: • A collection of resources that are managed as a single entity • Consists of all elements needed to run a specific application and to allow the client systems to connect to the service provided by that application • Operations can be performed on the entire group (e.g., transfer to another node) CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 20 Windows 2000 Cluster Server (cont.) CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 21 Windows 2000 Cluster Server (cont.) • The W2K Cluster Server components and their relationship in a single node of a cluster – Node manager • Responsible for maintaining this node’s membership in the cluster • It sends periodic heartbeat messages to the node managers of the other nodes in the same cluster • If it detects the loss of heartbeat messages from another node – It broadcasts a message to the entire cluster – All members exchange messages to verify their view of current cluster membership – If a node manager does not reply, it is removed from cluster and its active groups are transferred to one or more of the other nodes in the cluster – Configuration database manager • Responsible for the cluster configuration database • The database has information about all cluster resources, groups, and node ownership of groups • Database managers on all nodes communicate with each other to maintain a consistent view of configuration information in the cluster • The integrity of the database is maintained by using fault-resistant software for all changes to cluster configuration CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 22 Windows 2000 Cluster Server (cont.) • The W2K Cluster Server components and their relationship in a single node of a cluster (cont.) – Resource manager / fail-over manager • Responsible for management of resource groups • Initiates actions such as startup, reset, and fail-over • In case of fail-over, the fail-over managers on the active nodes negotiate the redistribution of resource groups from the failed node to the remaining active ones • When the node that failed has recovered, the fail-over managers may decide to move back some groups – Event processor • Connects all the components of the cluster service • Handles common operations • Controls cluster service initialization – Communications manager • Provides the facilities for message exchange with other nodes in the cluster – Global update manger • Provides an update service for other components CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 23 Sun cluster • Solaris UNIX has been extended to make the Sun Cluster distributed operating system • It appears to users and applications as a single computer running the Solaris OS • Components: – – – – Object and communications support Process management Networking Global distributed file system CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 24 Sun cluster (cont.) • Object and communications support – Object oriented: uses the CORBA object model to define objects and the remote procedure call (RPC) mechanism • Global process management – The location of a process is transparent to the user – Each process has a unique identifier within the cluster – Process migration is possible: a process can move from node to node to achieve load balancing and for fail-over (caveat: the threads of a single process must be on the same node) • Networking – Strategy: • A packet filter is used to route packets to the proper node • Cluster appears externally as a single server with a single IP address – Operation • Incoming packets are received on the node that has the network adapter, filtered, and delivered to the correct target node for protocol processing over cluster interconnect • For outgoing packets, originating node performs protocol processing, transfers packet over cluster interconnect to the node that has external network physical connection CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 25 Sun cluster (cont.) • Global file system – Like the standard Solaris, the Sun Cluster is based on the the concepts of virtual node (vnode) and the virtual file system (vfs) – Standard Solaris • Vnode – The vnode structure is used to provide a general-purpose interface to all types of file systems – A vnode provides mapping to an object in any file system type (by contrast, an inode in UNIX can provide mapping to UNIX files only) – The vnode interface accepts general-purpose file manipulation commands (e.g., read, write) and translates them into the actions appropriate for the respective file system • Vfs – Vfs structures are used to describe entire file systems – The Vfs interface accepts general-purpose commands that operate on entire files and translates them into actions appropriate for a particular file system CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 26 Sun cluster (cont.) • Global file system (cont.) – Global file access • The global file system provides an uniform interface to files distributed over the cluster • Processes on all nodes use the same pathname to locate a file and can open any file – Implementation • A proxy file system was built on top of the existing Solaris file system at the vnode interface • Vfs/vnode operations are converted by the proxy layer into object invocations • The invoked object may reside on any node in the cluster; it performs a local vnode/vfs operation on the underlying file system • Caching is used for file contents, directory information, and file attributes CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 27 Beowulf and Linux clusters • Beowulf – Beowulf project • Initiated under the NASA High Performance Computing and Communications (HPCC) project • Goal: expand the capabilities of clustered PCs for performing important computational tasks • Widely implemented, the most important new cluster technology available – Beowulf features • Use of off-the shelf components, no custom components, available from many vendors • Dedicated processors • Dedicated private network (LAN or WAN or inter-networked combination) • Scalable I/O • Free software base and distributed computing tools • Return of the design and improvements to the community CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 28 Beowulf and Linux clusters (cont.) CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 29 Beowulf and Linux clusters (cont.) • Most Beowulf implementations use a cluster of Linux workstations or PCs • A representative Linux implementation of Beowulf contains – A number of workstations (not necessarily the same platform) all running Linux – Secondary storage at each workstation can be available for distributed access (e.g., distributed file sharing) – The Linux nodes are interconnected with an off-the-shelf network (e.g., Ethernet switch or an interconnected set of Ethernet switches) • Beowulf software – Open-source Beowulf software – Beowulf tools and utilities – Linux kernel, modified to allow the individual nodes to participate in a number of global namespaces CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 30 Beowulf and Linux clusters (cont.) • Examples of Beowulf system software – Beowulf distributed process space (BPROC) • Allows a process to span multiple nodes in a cluster environment • Provides a mechanism for starting a process on another node without logging in that node • Makes all remote processes visible in the process table of the cluster’s front end node – Beowulf Ethernet channel bonding • Mechanism that joins multiple networks into a single logical network with high bandwidth • Distributes packets over the available device transmit queues • Provides load balancing over multiple Ethernets connected to Linux workstations – PVMSYNC • Provides a synchronization mechanism and shared data objects within a cluster – EnFusion • Set of tools for parametric computing, i.e., execution of a program as a large number of jobs, each with different parameters CS-550 (M.Soneru): Distributed Systems – Message Passing, Clusters, and Comparative OSs [Sta’01] 31