Distributed System Cluster Comp

Distributed Systems:
Message Passing, Clusters, and
Implementation of Clusters in Representative
Operating Systems
Distributed message passing
• Communication and synchronization mechanisms in distributed systems
– Distributed message passing
– Remote procedure call
• An implementation approach for message passing
– Use the services of a message-passing module
– Service is requested in the form of primitives and parameters
Distributed message passing (cont.)
• Send primitive
– Parameters
• Destination process identifier
• The message contents
– Operation
Sending process uses ‘Send’ primitive (destination, message contents)
Message-passing module constructs data unit with destination and contents
Data unit is sent to the destination machine using communication facility (e.g., TCP/IP)
Data unit is received by the destination machine and is routed by the communication
facility to the message-passing module
• The message-passing module stores the message in the buffer for the destination process
• Receive primitive
– Operation
• Destination process assigns buffer area for messages and uses ‘Receive’ primitive to the
message passing module
• Alternatively, message-passing module signals destination process with ‘Receive' signal
and places message in shared buffer
Distributed message passing (cont.)
• Design issues:
– Reliability vs. unreliability
– Blocking vs. non-blocking
• Reliability vs. unreliability
– Reliable message passing
• Guarantees delivery if possible
• Uses a reliable transport protocol
• Performs error checking, acknowledgment, retransmission, and reordering of
messages if delivered out of sequence
• Acknowledgment to the sending process that delivery was either successful or
it failed (e.g. network failure)
– Unreliable message passing
• Message-passing facility sends the message without reporting success or
• Message passing facility has a simple design and low overhead
• Applications may use ‘Request’ and ’Reply’ to acknowledge delivery
Distributed message passing (cont.)
• Blocking vs. non-blocking
– Blocking or synchronous primitives
• Blocking ‘Send’ does not return control to the sending process (process
• until
– Message has been transmitted (unreliable service), or
– Message has been sent and an acknowledgment received (reliable service)
• Blocking ‘Receive’ does not return control to the receiving process
– Message has been placed in the allocated buffer
Distributed message passing (cont.)
• Blocking vs. non-blocking
– Non-blocking or asynchronous primitives
• ‘Send’ primitive does not suspend process
– Control returned to the process as soon as the message has been queued
for transmission or a copy has been made
– After the message has been transmitted or copied to a safe place for later
transmission, sending process is interrupted to be informed that the
message buffer is available
• ‘Receive’ primitive does not suspend process
– Process is sent an interrupt upon message arrival or process can poll
periodically for messages
• Advantages/disadvantages
– Efficient use of message passing mechanism
– Difficult to test and debug: time-dependent sequences can lead to obscure
Remote procedure calls
• Provides access to remote services by providing simple procedure
call/return semantics, similar to those used for local services
• Advantages
– The procedure call is used extensively
– Remote interfaces can be specified and clearly documented as a set of
named operations with designated types
– The interface is standardized
• The communication code for an application can be generated automatically
• Client/server modules can be easily ported between different OSs and target
• Example of procedure call for the calling program
P = procedure name
X = passed arguments
Y = returned values
Remote procedure calls (cont.)
• Dummy or stub procedure on the local machine
Included in the caller’s address space or dynamically linked at call time
Creates message identifying remote procedure and includes parameters
Sends message to remote system and waits for reply
When reply arrives, it returns to the calling program providing the returned values
• Dummy or stub procedure on the remote machine
– Upon receiving the message, generates a local CALL P (X, Y)
– Returns reply
Remote procedure calls (cont.)
• Design issues
– Parameter passing
• Call by value (parameters passed as values)
– Parameters copied into the message and sent to remote system
– Easy to implement for RPCs
• Call by reference (pointers to a location that contains the value)
– More difficult to implement for RPCs
– Parameters and results representation
• No problem if the calling and called programs use the same language and run
on the same type of OSs and machines
• If there are differences, the remote procedure call mechanism must provide the
conversion: standardized format for common objects (e.g., integers, characters)
– Client/server binding
• A client/server binding is established after the two applications have made a
logical connection and are ready to exchange commands and data
• Non-persistent binding: Logical connection between the two processes
established at the time of RPC and disconnected after the values are returned
• Persistent binding: Connection set up for RPC remains up after return
Remote procedure calls (cont.)
• Design issues (cont.)
– Synchronous vs. asynchronous
• Synchronous RPC
– Calling process waits for the returned values
– Traditional, functions like a subroutine call
– Easy to understand and test but leads to lower performance
• Asynchronous RPC
– Calling process is not blocked
– Methods for synchronizing the client and the server
» Higher layer applications in both client and server initiate the
exchange and then verifies that all actions have been completed
» Client uses a series of asynchronous RPCs followed by a
synchronous RPC
Remote procedure calls (cont.)
• Design issues (cont.)
– Object-oriented mechanisms
• Operation
– Client sends request to an object request broker
– Broker acts as a directory of all remote services on the network. Broker calls
appropriate remote object and passes data.
– Remote object services request, replies to broker, which returns response to client
• Competing approaches:
– Common Object Request Broker Architecture (CORBA) from the Object
Management Group, backed by IBM, Apple, Sun
– Common Object Model (COM), the basis for Object Linking and Embedding
(OLE) from Microsoft
• Cluster: group of interconnected computers (nodes) working
together as a unified computer recourse and creating the illusion
of being one machine
• Advantages of clusters:
– Absolute scalability
• Clusters can consist of hundreds of machines, each being a multiprocessor
– Incremental scalability
• A cluster can grow in small increments with minimum service disruption
– High availability
• Fault-tolerant operation in software
– High price/performance ratio
• Off-the shelf building blocks
Clusters (cont.)
• Cluster configurations
– Passive standby
• Active system processes the entire load, the standby takes over in case
of failure of primary
• Active sends ‘heartbeat’ messages to standby to indicate continued
• High cost – no tasks sharing
• Easy to implement
– Active secondary
• Secondary server is also used for processing tasks
• Reduced cost due to tasks sharing
• Increased complexity
Clusters (cont.)
• Cluster configurations (cont.)
– Separate servers
Each server has its own disk, no disks shared
Data copied between servers periodically
Scheduling assigns client requests to servers to balance the load
High availability
High server and network overhead due to data copying
– Shared disks, non-shared volumes (shared nothing)
• Common disks are partitioned into volumes, each volume owned by
only one computer
• On computer failure, cluster is reconfigured to assign volumes to
remaining computers
– Shared disks, shared volumes
• Each computer has access to all volumes on all disks
• Locking mechanism used to ensure that data is accessed by one
computer at a time
Clusters (cont.)
• OS design issues
– Failure management
• Highly available clusters
– High probability that all resources will be in service
– In case of failure, the queries in progress are lost
– If retried, the query will be serviced by another computer in the cluster
• Fault-tolerant clusters
– Redundant shared disks and fault-tolerant operations
– Fail-over: switching an application from a failed system to an alternative
– Fail-back: the restoration of applications and data resources to the failed
system after recovery
– Load balancing
• Load must be balanced among available computers
• When a new computer is added to the cluster, loads needs to be
rebalanced to include the new computer
Clusters (cont.)
• OS design issues (cont.)
– Parallelizing computation: executing software from a single
application in parallel
• Parallelizing compiler
– It is determined, at compile time, which parts of the application can be run
in parallel
– The parallel parts are assigned to different computers in the cluster
• Parallelized application
– The application is designed to run on the cluster and uses message passing
for communication
– Most powerful approach to exploit clusters
• Parametric computing
– Useful for programs that must be executed a large number of times, each
time with a different set of parameters (e.g., a simulation model)
– Parametric processing tools are needed to organize, run, and manage the
Clusters (cont.)
• Cluster computer architecture
– All computers are interconnected by a high-speed LAN or switch
– Each computer is capable of operating independently
– A middleware layer of software runs on each computer to implement the
cluster functionality
• Provides a unified system image to the user, called a single-system image
• Is responsible for providing load balancing and high availability
• Middleware services and functions
– Single entry point: A user logs into the cluster, not on a specific computer
– Single file hierarchy: The user sees only a single file hierarchy, under one
root directory
– Single control point: A default workstation is used for cluster management
and control
– Single virtual networking: There is a single virtual network connecting the
cluster computers, even if it consists of multiple interconnected networks
Clusters (cont.)
• Middleware services and functions (cont.)
– Single memory space: A distributed shared memory is used to share
– Single job-management system: The cluster has a job scheduler and jobs
are submitted to the cluster and not to individual computers
– Single user interface: A common graphic interface is used for all users,
regardless of the workstation they use to enter the cluster
– Single I/O space: Any node can access any I/O device
– Single process space: A process on any node can create or communicate
with any other process in the cluster
– Check-pointing: Process states and intermediate results are saved
periodically, permitting rollback recovery after failures
– Process migration: Processes can mode inside the cluster to provide load
Clusters (cont.)
• Clusters compared with SMPs
– SMPs
Easier to manage and configure than clusters
Much closer to the original uniprocessor model
Major difference from the uniprocessor is the scheduler function
Uses less physical space and requires less energy than a comparable cluster
SMP products are well established and stable
– Clusters
• Far superior to SMPs in terms of absolute and incremental scalability
• Far superior in terms of availability
– Clusters are likely to dominate the high-performance server market
Windows 2000 Cluster Server
• The configuration is a shared-nothing cluster, where each volume and other
resources are owned by a single system at a time (initially code-named
• Main concepts
– Cluster Service:
• The software on each node responsible for cluster-specific activities
– Resource:
• These are the resources managed by the cluster service
• They are objects representing either physical hardware devices (e.g., disk
drives, network cards) or logical items (e.g., disk volumes, IP addresses,
applications, databases)
• Resources are implemented as dynamically linked libraries (DLLs) and
managed by a resource monitor
– Online: A resource is online at a node if it provides a service at that node
– Group:
• A collection of resources that are managed as a single entity
• Consists of all elements needed to run a specific application and to allow the
client systems to connect to the service provided by that application
• Operations can be performed on the entire group (e.g., transfer to another node)
Windows 2000 Cluster Server (cont.)
Windows 2000 Cluster Server (cont.)
• The W2K Cluster Server components and their relationship in a
single node of a cluster
– Node manager
• Responsible for maintaining this node’s membership in the cluster
• It sends periodic heartbeat messages to the node managers of the other nodes in
the same cluster
• If it detects the loss of heartbeat messages from another node
– It broadcasts a message to the entire cluster
– All members exchange messages to verify their view of current cluster membership
– If a node manager does not reply, it is removed from cluster and its active groups
are transferred to one or more of the other nodes in the cluster
– Configuration database manager
• Responsible for the cluster configuration database
• The database has information about all cluster resources, groups, and node
ownership of groups
• Database managers on all nodes communicate with each other to maintain a
consistent view of configuration information in the cluster
• The integrity of the database is maintained by using fault-resistant software for
all changes to cluster configuration
Windows 2000 Cluster Server (cont.)
• The W2K Cluster Server components and their relationship in a single node of
a cluster (cont.)
– Resource manager / fail-over manager
• Responsible for management of resource groups
• Initiates actions such as startup, reset, and fail-over
• In case of fail-over, the fail-over managers on the active nodes negotiate the
redistribution of resource groups from the failed node to the remaining active ones
• When the node that failed has recovered, the fail-over managers may decide to move
back some groups
– Event processor
• Connects all the components of the cluster service
• Handles common operations
• Controls cluster service initialization
– Communications manager
• Provides the facilities for message exchange with other nodes in the cluster
– Global update manger
• Provides an update service for other components
Sun cluster
• Solaris UNIX has been extended to make the Sun Cluster distributed operating
• It appears to users and applications as a single computer running the Solaris OS
• Components:
Object and communications support
Process management
Global distributed file system
Sun cluster (cont.)
• Object and communications support
– Object oriented: uses the CORBA object model to define objects and the remote
procedure call (RPC) mechanism
• Global process management
– The location of a process is transparent to the user
– Each process has a unique identifier within the cluster
– Process migration is possible: a process can move from node to node to achieve load
balancing and for fail-over (caveat: the threads of a single process must be on the same
• Networking
– Strategy:
• A packet filter is used to route packets to the proper node
• Cluster appears externally as a single server with a single IP address
– Operation
• Incoming packets are received on the node that has the network adapter, filtered, and
delivered to the correct target node for protocol processing over cluster interconnect
• For outgoing packets, originating node performs protocol processing, transfers packet over
cluster interconnect to the node that has external network physical connection
Sun cluster (cont.)
• Global file system
– Like the standard Solaris, the Sun Cluster is based on the the concepts of
virtual node (vnode) and the virtual file system (vfs)
– Standard Solaris
• Vnode
– The vnode structure is used to provide a general-purpose interface to all types
of file systems
– A vnode provides mapping to an object in any file system type (by contrast, an
inode in UNIX can provide mapping to UNIX files only)
– The vnode interface accepts general-purpose file manipulation commands
(e.g., read, write) and translates them into the actions appropriate for the
respective file system
• Vfs
– Vfs structures are used to describe entire file systems
– The Vfs interface accepts general-purpose commands that operate on entire
files and translates them into actions appropriate for a particular file system
Sun cluster (cont.)
• Global file system (cont.)
– Global file access
• The global file system provides an uniform interface to files distributed over the cluster
• Processes on all nodes use the same pathname to locate a file and can open any file
– Implementation
• A proxy file system was built on top of the existing Solaris file system at the vnode
• Vfs/vnode operations are converted by the proxy layer into object invocations
• The invoked object may reside on any node in the cluster; it performs a local vnode/vfs
operation on the underlying file system
• Caching is used for file contents, directory information, and file attributes
Beowulf and Linux clusters
• Beowulf
– Beowulf project
• Initiated under the NASA High Performance Computing and
Communications (HPCC) project
• Goal: expand the capabilities of clustered PCs for performing
important computational tasks
• Widely implemented, the most important new cluster technology
– Beowulf features
• Use of off-the shelf components, no custom components, available
from many vendors
• Dedicated processors
• Dedicated private network (LAN or WAN or inter-networked
• Scalable I/O
• Free software base and distributed computing tools
• Return of the design and improvements to the community
Beowulf and Linux clusters (cont.)
Beowulf and Linux clusters (cont.)
• Most Beowulf implementations use a cluster of Linux
workstations or PCs
• A representative Linux implementation of Beowulf contains
– A number of workstations (not necessarily the same platform) all running
– Secondary storage at each workstation can be available for distributed
access (e.g., distributed file sharing)
– The Linux nodes are interconnected with an off-the-shelf network (e.g.,
Ethernet switch or an interconnected set of Ethernet switches)
• Beowulf software
– Open-source Beowulf software
– Beowulf tools and utilities
– Linux kernel, modified to allow the individual nodes to participate in a
number of global namespaces
Beowulf and Linux clusters (cont.)
• Examples of Beowulf system software
– Beowulf distributed process space (BPROC)
• Allows a process to span multiple nodes in a cluster environment
• Provides a mechanism for starting a process on another node without logging
in that node
• Makes all remote processes visible in the process table of the cluster’s front
end node
– Beowulf Ethernet channel bonding
• Mechanism that joins multiple networks into a single logical network with high
• Distributes packets over the available device transmit queues
• Provides load balancing over multiple Ethernets connected to Linux
• Provides a synchronization mechanism and shared data objects within a cluster
– EnFusion
• Set of tools for parametric computing, i.e., execution of a program as a large
number of jobs, each with different parameters
