Uploaded by Fam Youssif

cs422-distributed systems

advertisement
Distributed Systems
CS 422
Prof.Dr. Hesham El-Deeb
2022/ 2023
Distributed Systems - CS 422
Contents
Chapter 1 ........................................................................................................................................ 1
Introduction ................................................................................................................................... 1
1.1 Processing Types...................................................................................................................... 3
Pipelining and parallelism .................................................................................................... 5
Analogy Simple Rules............................................................................................................ 7
Example #2 Parallel Processing............................................................................................ 7
Static Connection #1 (2-d hypercube).................................................................................. 8
Static Connection #2 (3-d hypercube).................................................................................. 9
Parallel Efficiency.................................................................................................................. 9
1.2 Distributed System Fundamentals ....................................................................................... 10
Distributed Systems Layout................................................................................................ 10
Network Performance ......................................................................................................... 11
Middleware .......................................................................................................................... 11
Distributed Systems Main features .................................................................................... 12
Distributed Systems Advantages ........................................................................................ 13
Distributed system disadvantages ...................................................................................... 13
1.3 Goals of Distributed systems ................................................................................................ 18
1.3.1 Heterogeneity .............................................................................................................. 18
1.3.2 Openness...................................................................................................................... 19
1.3.3 Security ........................................................................................................................ 19
1.3.4 Scalability .................................................................................................................... 19
1.3.5 Failure handling.......................................................................................................... 21
1.3.6 Concurrency................................................................................................................ 21
1.3.7 Transparency (allowable) ......................................................................................... 22
1.4 Distributed System Model .................................................................................................. 23
1.5 Hardware & Software Concepts in DS ................................................................................ 25
1.5.1 Hardware Concepts .................................................................................................... 26
1.5.2 Software Concepts ...................................................................................................... 28
1.6 Distributed Systems Types ................................................................................................. 29
1.6.1 Distributed Computing Systems (DCS) ................................................................... 29
1.6.2 Distributed Information Systems (DIS) .................................................................... 34
1.6.3 Distributed Pervasive (spreading) Systems (DPS) ................................................... 36
ii
Distributed Systems - CS 422
Revision Sheet # 1 ........................................................................................................................ 40
PROBLEMS................................................................................................................................. 40
Assignment # 1 ..................................................................................................................... 41
Assignment # 2 ..................................................................................................................... 41
Assignment # 3 ..................................................................................................................... 41
Assignment # 4 ..................................................................................................................... 41
Chapter 2 ...................................................................................................................................... 42
Distributed Systems Architectures ............................................................................................ 42
2.1 Basic Definitions .................................................................................................................. 44
2.2 Distributed Systems Architectures .................................................................................... 45
2.3 Multiprocessor Architectures ............................................................................................... 45
2.4 Client-server Architectures .................................................................................................. 48
2.4.1 Two-tier (layer) thin and fat clients .................................................................................. 57
2.4.2 Three-tier architectures ..................................................................................................... 61
2.5 Distributed Object Architectures....................................................................................... 64
2.6 CORBA Architecture .......................................................................................................... 68
2.6.1 CORBA Goal ....................................................................................................................... 69
2.6.2 CORBA Architecture.......................................................................................................... 70
2.6.3 CORBA Application Structure .......................................................................................... 71
2.6.4 CORBA Standards .............................................................................................................. 72
2.6.5 CORBA Objects .................................................................................................................. 72
2.6.6 CORBA Services ................................................................................................................. 73
2.6.7 CORBA Products ................................................................................................................ 73
2.6.8 CORBA Servant Class ....................................................................................................... 76
2.6.9 CORBA Server Class – 1 ................................................................................................... 77
2.6.10 CORBA Client .................................................................................................................. 78
Revision Sheet # 2 ........................................................................................................................ 79
Assignment # 5 ..................................................................................................................... 79
Chapter 3 ...................................................................................................................................... 80
Synchronization ........................................................................................................................... 80
3.1 Clock Synchronization ........................................................................................................ 80
3.1.1 Measuring the Time ............................................................................................................ 83
3.1.2 GPS and UTC ...................................................................................................................... 86
iii
Distributed Systems - CS 422
3.2 Logical vs. physical clocks .................................................................................................. 91
3.2.1 Logical Clock ....................................................................................................................... 92
3.2.2 Physical Clock...................................................................................................................... 96
3.3 Clock Synchronization Algorithms.................................................................................... 99
3.3.1 The Cristian's Algorithm (1989) .............................................................................. 99
3.3.2 The Berkeley Algorithm (1989) .............................................................................. 102
3.4 Parallel and Distributed Processing Problems ............................................................... 103
3.4.1 Mutual Exclusion Algorithm Outline .................................................................... 103
3.4.2 Distributed Termination Problem ......................................................................... 105
3.4.3 The Byzantine Generals Problem (BGP)............................................................... 106
3.5 Election Algorithms........................................................................................................... 108
3.5.1 Leader Election Problem ........................................................................................ 109
3.5.2 Leader Election in Synchronous Rings.................................................................. 109
3.5.3 Synchronous Message-Passing Model ................................................................... 110
3.5.4 Simple Leader Election Algorithm ........................................................................ 110
Revision Sheet # 3 ...................................................................................................................... 113
Assignment # 6 ................................................................................................................... 114
Assignment # 7 ................................................................................................................... 114
Assignment # 8 ................................................................................................................... 114
Assignment # 9 ................................................................................................................... 115
Assignment # 10 ................................................................................................................. 115
Assignment # 11 ................................................................................................................. 115
Assignment # 12 ................................................................................................................. 115
Chapter 4 .................................................................................................................................... 116
The Distributed Algorithms...................................................................................................... 116
4.1 Introduction ......................................................................................................................... 116
4.2 Variations of PRAM Model .............................................................................................. 117
4.2.1 PRAM Model for Parallel Computations .............................................................. 117
4.2.2 READ & WRITE in PRAM ................................................................................... 118
4.2.3 PRAM Subclasses .................................................................................................... 118
4.3 Simulating Multiple Accesses on an EREW PRAM......................................................... 120
4.3.1 Algorithm Broadcast _EREW ................................................................................. 121
4.4 Computing Sum and All Partial Sums .............................................................................. 122
iv
Distributed Systems - CS 422
4.4.1 Sum of an Array of Numbers on the EREW Model ............................................. 123
4.4.2 All Partial Sums of an Array .................................................................................. 125
4.5 The Sorting Algorithm ........................................................................................................ 127
4.5.1 n2 Sorting Algorithm ............................................................................................. 127
4.5.2 Algorithm Sort _ CRCW( Ascending) ................................................................... 127
4.6 Message – Passing Models and Algorithms .................................................................... 129
4.6.1 Message-passing Computing Models ..................................................................... 129
Revision Sheet # 4 ...................................................................................................................... 132
Assignment # 13 ................................................................................................................. 132
Chapter 5 .................................................................................................................................... 133
Naming In Distributed Systems ............................................................................................... 133
5.1 Naming Basic Concept ...................................................................................................... 134
5.2 Naming Types (identification, description, location) ..................................................... 135
5.2.1 Uniform Resource Identifier (URI) (URL/URN/URL+URN) ............................ 136
5.3 Naming Implementation Approaches.............................................................................. 145
5.3.1 Flat Naming Approach............................................................................................ 146
5.3.2 Structured Naming .................................................................................................. 152
5.3.3 Attribute-Based Naming ......................................................................................... 156
Revision Sheet # 5 ...................................................................................................................... 158
Chapter 6 .................................................................................................................................... 159
Fault Tolerance .......................................................................................................................... 159
6.1 Introduction to Fault Tolerance....................................................................................... 159
6.1.1 Basic Concepts .......................................................................................................... 159
6.1.2 Failure Models .......................................................................................................... 162
6.2 Process Flexibility ................................................................................................................ 166
6.2.1 Design Issues ............................................................................................................. 166
6.2.2 Failure Masking and Replication ............................................................................ 168
6.2.3 Agreement in Faulty Systems .................................................................................. 170
6.2.4 Failure Detection ...................................................................................................... 172
6.3 Reliable Client-Server Communication .......................................................................... 174
6.3.1 Point-to-Point Communication ............................................................................... 174
6.3.2 RPC Semantics in the Presence of Failures............................................................ 174
6.4 Recovery ............................................................................................................................. 180
v
Distributed Systems - CS 422
6.4.1 Introduction .............................................................................................................. 180
6.4.2 Checkpointing ........................................................................................................... 181
6.4.3 Message Logging....................................................................................................... 182
Revision Sheet # 6 ...................................................................................................................... 185
vi
Distributed Systems - CS 422
Chapter 1
Introduction
Computer systems are undergoing a revolution. From 1945, when the modem
computer era began, until about 1985, computers were large and expensive. Even
minicomputers cost at least tens of thousands of dollars each. As a result, most
organizations had only a handful of computers, and for lack of a way to connect them,
these operated independently from one another. Starting around the mid-1980s,
however, two advances in technology began to change that situation.
The first was the development of powerful microprocessors. Initially, these were
8-bit machines, but soon 16-, 32-, and 64-bit CPUs became common. Many of these
had the computing power of a mainframe (i.e., large) computer, but for a fraction of the
price. The amount of improvement that has occurred in computer technology in the past
half century is truly staggering and totally unprecedented in other industries. From a
machine that costs 10 million dollars and executed 1 instruction per second.
We have come to machines that cost 1000 dollars and are able to execute 1
billion instructions per second, a price/performance gain of 1013.If cars had improved
at this rate in the same time period, a Rolls Royce would now cost 1 dollar and get a
billion miles per gallon. (Unfortunately, it would probably also have a 200-page manual
telling how to open the door.)
The second development was the invention of high-speed computer networks.
Local-area networks or LANs allow hundreds of machines within a building to be
connected in such a way that small amounts of information can be transferred between
machines in a few microseconds or so. Larger amounts of data can be moved between
machines at rates of 100 million to 10 billion bits/sec.
Wide-area networks or WANs allow millions of machines all over the earth to
be connected at speeds varying from 64 Kbps (kilobits per second) to gigabits per
1
Distributed Systems - CS 422
second. The result of these technologies is that it is now not only feasible, but easy, to
put together computing systems composed of large numbers of computers connected
by a high-speed network. They are usually caned computer networks or distributed
systems, in contrast to the previous centralized systems (or single processor systems)
consisting of a single computer, its peripherals, and perhaps some remote terminals.
In a distributed or parallel program, we try to take advantage of parallelism by
dividing the (Sequential) program into as many tasks as the program’s correctness will
allow and then run one or more of these tasks, some of which can run simultaneously
on more than one processor. If the distributed system uses all its resources to run tasks
from only one program at a time, we call it Parallel Processing. If the distributed
system shares its resources with tasks from many independent programs, we call it
Distributed Processing.
Parallel processing is the use of concurrency in the operation of a computer
system to increase throughput, increase fault-tolerance, or reduces the time needed to
solve particular problems. Parallel processing is the only route to reach the highest
levels of computer performance. Physical laws and manufacturing capabilities limit the
switching times and integration densities of current semi-conductor-based devices,
putting a ceiling on the speed at which any single device can operate. For this reason,
all modern computers rely upon parallelism to some extent. The fastest computers
exhibit parallelism at many levels.
We begin by describing pipelining and parallelism, the two traditional methods
used to increase concurrency in computer system. We survey low-level and high-level
parallel processing mechanisms that appear in hardware, and we examine some of the
most popular processor interconnection topologies. The final sections discuss
parallelism in software. We describe the generation and coordination of software
processes and the problem of scheduling the execution of these processes on actual
parallel hardware.
2
Distributed Systems - CS 422
1.1 Processing Types

Sequential (one processor) refers to the mental process of integrating and
understanding stimuli in a particular, serial order. Both the perception of stimuli
in sequence and the subsequent production of information in a specific
arrangement fall under successive processing.

Pipelining (one processor with shifting mechanism) instruction pipelining is a
technique for implementing instruction-level parallelism within a single
processor. Pipelining attempts to keep every part of the processor busy with
some instruction by dividing incoming instructions into a series of sequential
steps (the eponymous "pipeline") performed by different processor units with
different parts of instructions processed in parallel.

Parallel (more than one processor with one partitioned program) is a method in
computing of running two or more processors (CPUs) to handle separate parts
of an overall task. Breaking up different parts of a task among multiple
processors will help reduce the amount of time to run a program.
3
Distributed Systems - CS 422

Distributed (more than one processor with more than one partitioned programs)
is a phrase used to refer to a variety of computer systems that use more than one
computer
(or processor)
to
run
an
application.
This
includes
parallel processing in which a single computer uses more than one CPU to
execute programs.
 Grid computing is the use of widely distributed computer resources to reach a
common goal. A computing grid can be thought of as a distributed system with
non-interactive workloads that involve many files. Grid computing is
distinguished from conventional high-performance computing systems such
as cluster computing in that grid computers have each node set to perform a
different
task/application.
Grid
computers
also
tend
to
be
more heterogeneous and geographically dispersed (thus not physically coupled)
than cluster computers.
Although a single grid can be dedicated to a particular application, commonly a
grid is used for a variety of purposes. Grids are often constructed with generalpurpose grid middleware software libraries. Grid sizes can be quite large.
Grids are a form of distributed computing whereby a "super virtual computer" is
composed of many networked loosely coupled computers acting together to
perform large tasks. For certain applications, distributed or grid computing can
be seen as a special type of parallel computing that relies on complete computers
(with onboard CPUs, storage, power supplies, network interfaces, etc.)
connected to a computer network (private or public) by a conventional network
interface, such as Ethernet. This is in contrast to the traditional notion of
a supercomputer, which has many processors connected by a local highspeed computer bus.
4
Distributed Systems - CS 422
The Processing Performance is identified by the below formula:
Processing Performance = F(X, Y, Z)
Where:
X → no of processors
Y → Data dependency
Z → Cost & overhead of scheduling policy
Pipelining and parallelism
To reduce the time needed for a mechanism to perform a task, we must either
increase the speed of the mechanism or introduce concurrency. Two traditional methods
have been used to increase concurrency: pipelining and parallelism. If an operation can
be divided into a number of stages, pipelining allows different tasks to be in different
stages of completion. An automobile assembly line is an example of pipelining.
Parallelism is the use of multiple resources to increase concurrency. A group of
combines working together to harvest a wheat field is an example of parallelism.
To illustrate and contrast these two fundamental methods for increasing
concurrency, we present the following pizza-baking example. Suppose a pizza requires
10 minutes to bake. An oven that holds a single pizza can yield 6 baked pizzas an hour.
To increase the number of pizzas baked per hour, either the baking time must be reduced
or a way must be found to have more than one pizza baking at a time. (Assume that
quality control constraints prevent us from raising the oven's temperature in order to
reduce the baking time.)
One way to increase production is through use of parallelism. If 5 ovens are used,
the ovens yield 5 pizzas every 10 minutes and 30 pizzas an hour. Note that the 5 ovens
are used most efficiently if the number of pizzas needed is a multiple of 5. For example,
the ovens require the same amount of time- 20 minutes– to produce 6, 7, 8, 9, or 10
pizzas.
Another way to increase production is through the use of pipelining. Imagine a
conveyer belt running through a long pizza oven. A pizza placed at one end of the
5
Distributed Systems - CS 422
conveyer belt spends 10 minutes in the oven before it reaches the other end. If the
conveyer belt has room for 5 pizzas, a cook can place an unbaked pizza at one end of
the belt every 2 minutes. Ten minutes after the first pizza has been put into one end of
the oven, it appears as a baked pizza at the other end. From that time on, another baked
pizza will appear every two minutes, and the production of the oven will be 30 pizzas
an hour. The pizza-baking speeds of the single-oven, parallel-oven, and pipelined-oven
methods are compared In Table1.
The speedup achieved is the ratio between the time needed for the single pizza oven
to produce some number of pizzas and the time needed to produce the same number of
pizzas using pipelining and/or parallelism. The below table is an example of Sequential,
Parallel and Pipelining Processing. It is contrasting the Pizza-Baking Times of a SingleOven, Five-Ovens, and a Conveyer-Belt Oven.
Pizzas Baked
Single Oven
Five Ovens
(Sequential)
(Parallel)
1
10 min.
10 min.
10 min.
2
20
10
12
3
30
10
14
4
40
10
16
5
50
10
18
6
60
20
20
7
70
20
22
8
80
20
24
9
90
20
26
10
100
20
28
11
110
30
30
12
120
30
32
6
Conveyer Ovens
(Pipelining)
Distributed Systems - CS 422
Analogy Simple Rules
Example #2 Parallel Processing
The diagram shows below the program graph.
7
Distributed Systems - CS 422
Static Connection #1 (2-d hypercube)
8
Distributed Systems - CS 422
Static Connection #2 (3-d hypercube)
Parallel Efficiency
 Speed up and Utilization are not the only parallel performance metrics.
 Parallel efficiency is one of the most important parallel performance metrics.
9
Distributed Systems - CS 422
 Parallel efficiency = sequential execution time / (parallel execution time × # of
processors)
1.2 Distributed System Fundamentals
A distributed system is a collection of independent computers that appears to its
users as a single coherent system. A distributed system is defined as one in which
components at networked computers communicate and coordinate their actions only by
passing messages.
This definition allows for: concurrent execution of programs, but prevents the
possibility of a global clock and means that components can fail independently of one
another (fault tolerance concept).
Shared resources are managed by server processes, which provide client
processes with access to those resources via a well-defined set of operations. In a
distributed system written in an object-oriented language, resources may be
encapsulated as objects whose methods are invoked by client objects.
Distributed Systems Layout
10
Distributed Systems - CS 422
Network Performance
Where
WPLAN → wireless personal area network
WMAN → A Metropolitan Area Network (MAN) is a network that interconnects users
with computer resources in a geographic area or region larger than a large Local Area
Network (LAN) but smaller than a Wide Area Network (WAN)
The below diagram shows A distributed system organized as middleware. Note
that the middleware layer extends over multiple machines.
Middleware
It is an aspect of distributed computing, defined as a computer software that
connects software components or applications. This software consists of a set of
11
Distributed Systems - CS 422
enabling services that allow multiple processes running on one or more machines to
interact across a network. i.e. it connects parts of an application and enables requests
and data to pass between them. It includes web servers, application servers, and similar
tools that support application development and delivery.
Usage of middleware
Middleware services provide a more functional set of application programming
interfaces to allow an application to:
1. Providing interaction with another service or application
2. Be independent from network services
3. Be reliable and available always when compared to the operating system and
network services
Middleware Types
Hurwitz's classification system organizes the types of middleware based on
scalability and recoverability:
1. Remote Procedure Call — Client makes calls to procedures running on remote
systems.
2. Message Oriented Middleware — Messages sent to the client are collected and stored
until they are acted upon, while the client continues with other processing.
3. Object Request Broker — makes it possible for applications to send objects and
request services in an object-oriented system.
4. SQL-oriented Data Access — middleware between applications and database
servers.
5. Embedded Middleware — communication services and integration interface
software/firmware that operates between embedded applications and the real time
operating system.
Distributed Systems Main features
 Geographical distribution of autonomous computers
12
Distributed Systems - CS 422
 Communication through cable/fiber/wireless/...connections
Distributed Systems Advantages
 interaction
 co-operation
 sharing of resources
 reduced costs
 improved availability and performance
 Scalability
 fault tolerance
Distributed system disadvantages
 Complexity – Typically, distributed systems are more complex than centralized
systems
 Security – More susceptible to external attack
 Manageability – More effort required for system management
 Unpredictability – Unpredictable responses depending on the system organization
and network load
Urgency of Distributed Computing
Distributed (computer) systems are critical for functioning of many organizations
Distributed Application is a set of processes that are distributed across a network
of machines and work together as an ensemble to solve a common problem.
13
Distributed Systems - CS 422

Internet: global network of interconnected computers which communicate
through IP protocols

Intranet: a separately administered network with a boundary that allows to enforce
local security policies

Mobile and ubiquitous (everywhere) computing
o laptops
o PDAs (A personal digital assistant as a handheld computer, also known as a
palmtop computer)
o mobile phones
o printers
o home devices
 World-Wide Web: system for publishing and accessing resources and services
across the Internet
Example 3 illustrates a typical portion of the Internet
14
Distributed Systems - CS 422
Overview of Internet Information Appliances
Characteristics of Internet
 very large and heterogeneous
 enables email, file transfer, multimedia communications, WWW
 open-ended
 connects intranets (via backbones) with home users (via modems, ISPs)
Example 4 illustrates a typical Intranet
email server
Desktop
computers
print and other servers
Web server
Local area
network
email server
print
Fi le server
other servers
the rest of
the Internet
router/firewall
15
Distributed Systems - CS 422
Characteristics of Intranets
Several LANs linked by backbones

Enables information flow within an organization: electronic data, documents,

Provides various services –email, file, print servers,

often connected to Internet via Router
In/out communications protected by Firewall

Example: French University in Egypt http://portal.ufe.edu.eg/spip/?lang=ar
Portable and handheld devices
Desktop
computers
email server
print and other servers
Web server
Local area
network
email server
print
Fi le server
other servers
the rest of
the Internet
router/firewall
Now more than ever, it’s never been more important to be connected at work. With the
diverse range of communication and increase in collaboration comes the need to work
in a more understanding, social, and unrestricted way. A successful intranet is essentially
the foundation of a connected, engaged, and productive workplace.
However, it is crucial to develop a clear, structured, and compatible intranet plan that
allows you to manifest your organization’s efforts. When a successful intranet is
deployed and managed correctly, it seeks to unite your people, drive productivity, create
a positive culture, and deliver significant results for all stakeholders.
16
Distributed Systems - CS 422
Extranet
Example 5 Mobile & ubiquitous computing

Wireless LANs (WLANs)
o connectivity for portable devices (laptops, PDAs, mobile phones, video/digital
cameras)
o Uses WAP (Wireless Applications Protocol)

Home intranet (= home network)
o devices embedded in home appliances (hi-fi, washing machines, …)
o universal ‘remote control’ + communication future → environment for
applying embedded systems,
o ubiquitous computing
High-fidelity (Hi-fi)
High fidelity or hi-fi reproduction is a term used by home stereo listeners and
home audio enthusiasts (audiophiles) to refer to high-quality reproduction of sound or
images that are very faithful to the original performance.
Ideally, high-fidelity equipment has minimal amounts of noise and distortion
and an accurate frequency response as set out in 1973 by the German Deutsches Institut
für Normung (DIN) standard DIN 45500
17
Distributed Systems - CS 422
Example 6 WWW
World-wide resource sharing over Internet or Intranet based on the following
technologies:
 HTML (Hypertext Markup Language)
 URL (Uniform Resource Locator)
 Client-Server architecture
 Open-ended
 can be extended, re-implemented, ...
1.3 Goals of Distributed systems
Due to its special characteristics:
1. Complexity
2. Size
3. Changing technologies
4. Society’s dependence
Goals are:
1. Heterogeneity
2. Openness
3. Security
4. Scalability
5. Fault handling
6. Concurrency
7. Transparency
1.3.1 Heterogeneity
To achieve this goal, we have to overcome:
1. varying software and hardware
18
Distributed Systems - CS 422
 OSs, networks, computer hardware, program languages, implementations by
different developers
 need for standards of protocols, middleware
2. Heterogeneity and mobile code support
 virtual machine approach
1.3.2 Openness
To achieve this goal, we must have:
1. independence of vendors
2. publishable key interfaces: CORBA (Common Object Request Broker
Architecture).
3. publishable communication mechanisms: Java RMI (Remote Method Invocation)
1.3.3 Security
To achieve this goal, we must do:
1. confidentiality (protect against leak)
a. Ex: medical records
2. integrity (protect against alteration and interference)
a.
Ex: financial data
Need encryption and knowledge of identity:
1. Denial of service attacks via intrusion detection algorithm
2. Security of mobile code
1.3.4 Scalability
Design of scalable distributed systems assures:
1. Controlling the cost of physical resource(O(n), where n is number of users)
2. Controlling the performance loss(O(log n), where n is the size of the set of data))
3. Preventing software resource running
4. Avoiding performance bottleneck (using DNS)
19
Distributed Systems - CS 422
There are scalability limitations such as:
Concept
Example
Centralized services
A single server for all users
Centralized data
A single on-line telephone book
Centralized
Doing routing based on complete
algorithms
information
Scalability Techniques (1) (Server is busy)
The difference between lettings: No. of servers or clients doing a job (check
form) to keep high performance level:
A server or a client check forms as they are being filled
20
Distributed Systems - CS 422
Scalability Techniques (2) (distribution)
An example of dividing the DNS name space into zones to assure distribution
across several machines, thus avoiding that a single server has to deal with all
requests for name resolution.
1.3.5 Failure handling
It is the ability to continue computation in the presence of failures. So, we have to
implement:
•
Detecting failures
•
Masking failures (hiding failure)
•
Tolerate failures
•
Recovery from failures
•
Redundancy as a solution to tolerate failure
1.3.6 Concurrency
It means, processes execute simultaneously (at the same time) and share resources.
It might need some sort of:
• Synchronization using a clock
• Synchronization using a clock
• inter-process communication (IPC)
21
Distributed Systems - CS 422
1.3.7 Transparency (allowable)
It is the hiding of the separated nature of system from user/programmer and
transpiring other specifications. Transparencies => By ANSA Reference Manual &
ISO Reference Model for Open Distributed Processing (RMODP), which is achieved
through:
• Access transparency: enables local and remote resources to be accessed using
identical operations.
• Location transparency: enables resources to be accessed without knowledge of their
location.
• Concurrency transparency: enables several processes to operate concurrently using
shared resources without interference between them.
• Replication transparency: enables multiple instances of resources to be used to
increase reliability and performance without knowledge of the replicas by users or
application programmers.
•
Failure transparency: enables the concealment (hiding) of faults, allowing users and
application programs to complete their tasks despite the failure of hardware or
software components.
•
Mobile transparency (Migration transparency): allows the movement of resources
and clients within a system without affecting the operation of users or programs.
•
Performance transparency: allows the system to be reconfigured to improve
performance as loads vary.
•
Scaling transparency: allows the system and applications to expand in scale without
change to the system structure or the application algorithms.
22
Distributed Systems - CS 422
The table below demonstrates what to hide (not allowable)
Transparency
Access
Description
Hide differences in data representation and how a resource is
accessed
Location
Hide where a resource is located
Migration
Hide that a resource may move to another location
Relocation
(mobility)
Replication
Concurrency
Hide that a resource may be moved to another location while in use
Hide that a resource may be shared by several competitive users
Hide that a resource may be shared by several competitive users
simultaneously
Failure
Hide the failure and recovery of a resource
Persistence
Hide whether a (software) resource is in memory or on disk
1.4 Distributed System Model
A distributed system consists of:
•
a collection of autonomous computers linked by
•
a computer network and equipped with
•
Distributed system software.
This software enables computers to coordinate their activities and to share the resources
of the system hardware, software, and data. Users of a distributed system should
perceive a single, integrated computing facility even though it may be implemented by
many computers in different locations, networks.
23
Distributed Systems - CS 422
The object-oriented model for a distributed system is based on the model supported by
object-oriented programming languages (because it is more suitable than structure
programming).
Distributed object systems generally provide:
• Remote procedure calls (RPC), which are used in client-server communication, are
replaced by remote method invocation (RMI) in distributed object systems.
• Remote Method invocation (RMI) in an object-oriented programming language
together with operating systems support for object sharing and persistence.
Execution of Remote Procedure Calls (RPC)
Daemon: A program or process that sits idly in the background until it is invoked
to perform its task.
24
Distributed Systems - CS 422
The state of an object consists of the values of its instance variables. In the
object-oriented paradigm, the state of a program is partitioned into separate parts, each
of which is associated with an object.
Since object-based programs are logically partitioned, the physical distribution
of objects into different processes or computers in a distributed system is a natural
extension.
The Object Management Group's Common Object Request Broker Architecture
(CORBA) is a widely used standard for distributed object systems. It is vendorindependent architecture Broker Architecture (CORBA) is a widely used standard for
distributed object systems. It is vendor-independent architecture and infrastructure that
computer applications use to work together over networks.
Other object management systems include:
•
the Open Software Foundation's Distributed Computing Environment (DCE)
•
Microsoft's Distributed Common Object Manager (DCOM).
1.5 Hardware & Software Concepts in DS
The hardware concepts are:
•
Multiprocessors
•
Homogeneous Multicomputer Systems S/W concept:
•
Uni-processor Operating Systems
•
Multi-processor Operating Systems
•
Multi-computer Operating Systems
•
Distributed Shared Memory Systems
•
Network Operating System
Note: search for a list of operating systems, determine which one is suitable for
distributed systems applications
25
Distributed Systems - CS 422
1.5.1 Hardware Concepts
The figure below demonstrates Different basic organizations and memories in
distributed computer systems:
Multiprocessors (1)
A bus-based multiprocessor
26
Distributed Systems - CS 422
Multiprocessors (2)
a) A crossbar switch
b) An omega switching network
Homogeneous Multicomputer Systems
b) Grid ( mesh )
b) Hypercube
27
Distributed Systems - CS 422
1.5.2 Software Concepts
System
Distributed
Operating
Systems
(DOS)
Network
Operating
Systems
(NOS)
Description
Main Goal
Tightly-coupled operating system for
multi-processors and homogeneous
multicomputer
Loosely-coupled operating system for
heterogeneous multicomputer (LAN
and WAN)
Hide and manage
hardware resources
Offer local services to
remote clients
Middleware
Additional layer at top of NOS
Provide distribution
based OS
implementing general-purpose services
transparency
Comparison between Systems
A comparison between multiprocessor operating systems, multicomputer
operating systems, network operating systems, and middleware based distributed
systems. The comparison includes degree of transparency, same OS on all nodes,
number of copies of OS, basis for communication, resource management, scalability
and Openness.
28
Distributed Systems - CS 422
Distributed OS
Item
Multiprocessors
Degree of
transparency
Same OS on all
nodes
Number of
copies of OS
Network
Middleware
OS
-based OS
Multicomputer
Very High
High
Low
High
Yes
Yes
No
No
1
N
N
N
Messages
Files
Basis for
Shared
Model
communication
memory
Resource
Global,
Global,
Per
management
central
distributed
node
Scalability
No
Moderately
Yes
Varies
Openness
Closed
Closed
Open
Open
specific
Per node
1.6 Distributed Systems Types
1.6.1 Distributed Computing Systems (DCS)
An important class of distributed systems is the one used for high-performance
computing tasks. Roughly speaking, one can make a distinction between two
subgroups. In cluster computing the underlying hardware consists of a collection of
similar workstations or PCs, closely connected by means of a high speed local-area
network. In addition, each node runs the same operating system.
29
Distributed Systems - CS 422
The situation becomes quite different in the case of grid computing. This subgroup
consists of distributed systems that are often constructed as a federation of computer
systems, where each system may fall under a different administrative domain, and may
be very different when it comes to hardware, software, and deployed network
technology.
Cluster Computing Systems
Cluster computing systems became popular when the price/performance ratio of personal
computers and workstations improved. At a certain point, it became financially and
technically attractive to build a supercomputer using off-the-shelf technology by simply
hooking up a collection of relatively simple computers in a high-speed network. In virtually
all cases, cluster computing is used for parallel programming in which a single (compute
intensive) program is run in parallel on multiple machines.
One well-known example of a cluster computer is formed by Linux-based Beowulf
clusters. Each cluster consists of a collection of compute nodes that are controlled and
accessed by means of a single master node. The master typically handles the allocation of
nodes to a particular parallel program, maintains a batch queue of submitted jobs, and
provides an interface for the users of the system. As such, the master actually runs the
middleware needed for the execution of programs and management of the cluster, while
the compute nodes often need nothing else but a standard operating system.
An important part of this middleware is formed by the libraries for executing parallel
programs. Many of these libraries effectively provide only advanced message-based
communication facilities, but are not capable of handling faulty processes, security, etc. As
an alternative to this hierarchical organization, a symmetric approach is followed in the
MOSIX system (Amar et at, 2004). MOSIX attempts to provide a single-system image of
a cluster, meaning that to a process a cluster computer offers the ultimate distribution
transparency by appearing to be a single computer.
As we mentioned, providing such an image under all circumstances is impossible. In the
case of MOSIX, the high degree of transparency is provided by allowing processes to
dynamically and preemptively migrate between the nodes that make up the cluster.
30
Distributed Systems - CS 422
Process migration allows a user to start an application on any node (referred to as the
home node), after which it can transparently move to other nodes, for example, to make
efficient use of resources.
Grid Computing Systems
A characteristic feature of cluster computing is its homogeneity. In most cases, the
computers in a cluster are largely the same, they all have the same operating system,
and are all connected through the same network. In contrast, grid computing systems
have a high degree of heterogeneity: no assumptions are made concerning hardware,
operating systems, networks, administrative domains, security policies, etc.
A key issue in a grid computing system is that resources from different organizations
are brought together to allow the collaboration of a group of people or institutions. Such
a collaboration is realized in the form of a virtual organization.
The people belonging to the same virtual organization have access rights to the
resources that are provided to that organization. Typically, resources consist of compute
servers (including supercomputers, possibly implemented as cluster computers),
storage facilities, and databases. In addition, special networked devices such as
telescopes, sensors, etc., can be provided as well.
Given its nature, much of the software for realizing grid computing evolves around
providing access to resources from different administrative domains, and to only those
users and applications that belong to a specific virtual organization. For this reason,
focus is often on architectural issues.
A layered architecture for grid computing systems
31
Distributed Systems - CS 422
The architecture consists of four layers. The lowest fabric layer provides interfaces to
local resources at a specific site. Note that these interfaces are tailored to allow sharing
of resources within a virtual organization. Typically, they will provide functions for
querying the state and capabilities of a resource, along with functions for actual
resource management (e.g., locking resources).
The connectivity layer consists of communication protocols for supporting grid
transactions that span the usage of multiple resources. For example, protocols are
needed to transfer data between resources, or to simply access a resource from a remote
location. In addition, the connectivity layer will contain security protocols to
authenticate users and resources. Note that in many cases human users are not
authenticated; instead, programs acting on behalf of the users are authenticated.
In this sense, delegating rights from a user to programs is an important function that
needs to be supported in the connectivity layer. We return extensively to delegation
when discussing security in distributed systems.
The resource layer is responsible for managing a single resource. It uses the functions
provided by the connectivity layer and calls directly the interfaces made available by
the fabric layer. For example, this layer will offer functions for obtaining configuration
information on a specific resource, or, in general, to perform specific operations such
as creating a process or reading data. The resource layer is thus seen to be responsible
for access control, and hence will rely on the authentication performed as part of the
connectivity layer.
The next layer in the hierarchy is the collective layer. It deals with handling access to
multiple resources and typically consists of services for resource discovery, allocation
and scheduling of tasks onto multiple resources, data replication, and so on. Unlike the
connectivity and resource layer, which consist of a relatively small, standard collection
of protocols, the collective layer may consist of many different protocols for many
different purposes, reflecting the broad spectrum of services it may offer to a virtual
organization.
32
Distributed Systems - CS 422
Finally, the application layer consists of the applications that operate within a virtual
organization and which make use of the grid computing environment.
Typically the collective, connectivity, and resource layer form the heart of what could
be called a grid middleware layer. These layers jointly provide access to and
management of resources that are potentially dispersed across multiple sites. An
important observation from a middleware perspective is that with grid computing the
notion of a site (or administrative unit) is common. This prevalence is emphasized by
the gradual shift toward a service-oriented architecture in which sites offer access to the
various layers through a collection of services. This, by now, has led to the definition
of an alternative architecture known as the Open Grid Services Architecture (OGSA).
This architecture consists of various layers and many components, making it rather
complex. Complexity seems to be the fate of any standardization process.
DIS Observation: many distributed systems are configured for high-performance
computing.
Cluster computing: essentially a group of high=end systems connected through a
LAN:
• Homogenous: same OS, near-identical hardware
• Single managing node
33
Distributed Systems - CS 422
Grid Computing: the next step: lots of nodes from everywhere:
• Heterogeneous
• Dispersed across several organizations
• Can easily span a wide-area network
Note: To allow for collaborations, grids generally use virtual organizations. In essence,
this is a grouping of users (or better: their IDs) that will allow for authorization on
resource allocation.
1.6.2 Distributed Information Systems (DIS)
To clarify our discussion, let us concentrate on database applications. In practice,
operations on a database are usually carried out in the form of transactions.
Programming using transactions requires special primitives that must either be supplied
by the underlying distributed system or by the language runtime system.
Observation: the vast amount of distributed systems in use today are forms of traditional
information systems that now integrate legacy systems. Example: transaction
processing systems.
Essential: all read and write operations are executed, i.e. their effects are made
permanent at the execution of END_TRANSACTION.
Observation: transactions form an atomic operation.
34
Distributed Systems - CS 422
Distributed Information Systems: Transactions
Another important class of distributed systems is found in organizations that were
confronted with a wealth of networked applications, but for which interoperability
turned out to be a painful experience. Many of the existing middleware solutions are
the result of working with an infrastructure in which it was easier to integrate
applications into an enterprise-wide information system.
We can distinguish several levels at which integration took place. In many cases, a
networked application simply consisted of a server running that application (often
including a database) and making it available to remote programs, called clients. Such
clients could send a request to the server for executing a specific operation, after which
a response would be sent back. Integration at the lowest level would allow clients to
wrap a number of requests, possibly for different servers, into a single larger request
and have it executed as a distributed transaction.
The key idea was that all, or none of the requests would be executed. As applications
became more sophisticated and were gradually separated into independent components
(notably distinguishing database components from processing components), it became
clear that integration should also take place by letting applications communicate
directly with each other. This has now led to a huge industry that concentrates on
enterprise application integration (EAl).
In the following, we concentrate on these two forms of distributed systems.
Model: a transaction is a collection of operations of the state of an object (database,
object composition, etc.) that satisfies the following properties (ACID):
 Atomicity: all operations either succeed, or all of them fail. When the transaction
fails, the state of the object will remain unaffected by the transaction.
 Consistency a transaction establishes a valid state transition. This does not exclude
the possibility of invalid, intermediate states during the transaction’s execution.
35
Distributed Systems - CS 422
 Isolation: concurrent transactions do not interfere with each other. It appears to each
transaction T that other transactions occur either before T, or after T, but never both.
 Durability after the execution of a transaction, its effects are made permanent:
changes to the state survive failures.
1.6.3 Distributed Pervasive (spreading) Systems (DPS)
The distributed systems we have been discussing so far are largely characterized by
their stability: nodes are fixed and have a more or less permanent and high-quality
connection to a network. To a certain extent, this stability has been realized through the
various techniques that are discussed in this book and which aim at achieving
distribution transparency.
For example, the wealth of techniques for masking failures and recovery will give the
impression that only occasionally things may go wrong. Likewise, we have been able
to hide aspects related to the actual network location of a node, effectively allowing
users and applications to believe that nodes stay put. However, matters have become
very different with the introduction of mobile and embedded computing devices. We
are now confronted with distributed systems in which instability is the default behavior.
The devices in these, what we refer to as distributed pervasive systems, are often
characterized by being small, battery-powered, mobile, and having only a wireless
connection, although not all these characteristics apply to all devices. Moreover, these
characteristics need not necessarily be interpreted as restrictive, as is illustrated by the
possibilities of modem smart phones (Roussos et al, 2005).
As its name suggests, a distributed pervasive system is part of our surroundings (and
as such, is generally inherently distributed). An important feature is the general lack of
human administrative control. At best, devices can be configured by their owners, but
otherwise they need to automatically discover their environment and "nestle in" as best
as possible. This nestling in has been made more precise by Grimm et al. (2004) by
formulating the following three requirements for pervasive applications:
1. Embrace contextual changes.
36
Distributed Systems - CS 422
2. Encourage ad hoc composition.
3. Recognize sharing as the default.
Observation: there is a next generation of distributed systems emerging in which the
nodes are small, mobile, and often embedded as part of a larger system. Some
requirements:
Contextual change: the system is part of an environment in which changes should be
immediately accounted for.
Ad hoc composition: each node may be used a very different ways by different users.
Requires ease-of-configuration.
Sharing is the default: nodes come and go, providing sharable services and information.
Calls again for simplicity.
Observation: pervasiveness and distribution transparency may not always form a good
match.
Example 7: Distributed Pervasive Systems
Electronic health care system
37
Distributed Systems - CS 422
Home systems: should be completely self-organizing:
 There should be no system administrator
 Provide a personal space for each of its users
 Simplest solution: a centralized home box?
Electronic health systems: devices are physically close to a person:
 Where and how should monitored data be stored?
 How can we prevent loss of crucial data?
 What is needed to generate and propagate alerts?
 How can security be enforced?
 How can physicians provide online feedback?
Example 8 Sensor Network
 Characteristics: the nodes to which sensors are attached are:
 Many (10s-1000s)
 Simple (i.e., hardly any memory, CPU power, or communication facilities)
 Often battery-powered (or even battery-less)
Sensor networks as distributed systems: consider them from a database perspective:
38
Distributed Systems - CS 422
39
Distributed Systems - CS 422
Revision Sheet # 1
PROBLEMS
1.
What is the role of middleware in a distributed system?
2.
Explain what is meant by (distribution) transparency, and give examples of
different types of transparency.
3.
Why is it sometimes so hard to hide the occurrence and recovery from failures
in a distributed system?
4.
Why is it not always a good idea to aim at implementing the highest degree of
transparency possible?
5.
What is an open distributed system and what benefits does openness provide?
6.
Describe precisely what is meant by a scalable system.
7.
Scalability can be achieved by applying different techniques. What are these
techniques?
8.
Explain what is meant by a virtual organization and give a hint on how such
organizations could be implemented.
9.
When a transaction is aborted. We have said that the world is restored to its
previous state. As though the transaction had never happened. We lied. Give
an example where resetting the world is impossible.
10. Executing nested transactions requires some form of coordination. Explain
what a coordinator should actually do.
11. We argued that distribution transparency may not be in place for pervasive
systems. This statement is not true for all types of transparencies. Give an
example.
12. We already gave some examples of distributed pervasive systems: home
systems. Electronic health-care systems and sensor networks. Extend this list
with more examples.
13. (Lab assignment) Sketch a design for a home system consisting of a separate
media server that will allow for the attachment of a wireless client. The latter
40
Distributed Systems - CS 422
is connected to (analog) audio/video equipment and transforms the digital
media streams to analog output. The server runs on a separate machine.
Possibly connected to the Internet but has no keyboard and/or monitor
connected.
Assignment # 1
Repeat the Example # 1, having 6 ovens (processors) rather than 5 ovens and the
baking time is 20 min rather than 10 min. How many pizza (processes) is
implemented at time = 60 minutes for sequential, parallel and pipelining.
Give your comments.
Assignment # 2
Repeat Example # 2, having 16 processors rather than 8 processors and compute
the speedup and the utilization. What are your comments, about speedup and
utilization; based on the results of the cases of 4/8/16 processors?
Assignment # 3
a) Compute the parallel efficiency for the 4 processors case.
b) Compute the parallel efficiency for the 8 processors case.
c) Compute the parallel efficiency for the 16 processors case.
d) What are your comments, on the parallel efficiency; based on the results in (a),
(b) and (c)?
Assignment # 4
Compare between: CORBA, DCE and DCOM
Your comparison should include:
• Basic theory of operation with Architecture
• Applicability
• Trend of updating.
41
Distributed Systems - CS 422
Chapter 2
Distributed Systems Architectures
Centralized System vs. Distributed System
Criteria
Centralized
system
Distributed
System
Economics
Low
High
Availability
Low
High
Complexity
Low
High
Consistency
Simple
High
Scalability
Poor
Good
Technology
Homogeneous
Heterogeneous
Security
High
Low
In distributed architecture, components are presented on different platforms and
several components can cooperate with one another over a communication network
in order to achieve a specific objective or goal.

In this architecture, information processing is not confined to a single machine
rather it is distributed over several independent computers.
42
Distributed Systems - CS 422

A distributed system can be demonstrated by the client-server architecture
which forms the base for multi-tier architectures; alternatives are the broker
architecture such as CORBA, and the Service-Oriented Architecture (SOA).

There are several technology frameworks to support distributed architectures,
including .NET, J2EE, CORBA, .NET Web services, AXIS Java Web services,
and Globus Grid services.

Middleware is an infrastructure that appropriately supports the development and
execution of distributed applications. It provides a buffer between the
applications and the network.

It sits in the middle of system and manages or supports the different components
of a distributed system. Examples are transaction processing monitors, data
convertors and communication controllers etc.
Distributed systems are often complex pieces of software of which the components are
by definition dispersed across multiple machines. To master their complexity, it is
crucial that these systems are properly organized. There are different ways on how to
view the organization of a distributed system, but an obvious one is to make a
distinction between the logical organization of the collection of software components
and on the other hand the actual physical realization.
The organization of distributed systems is mostly about the software components that
constitute the system. These software architectures tell us how the various software
components are to be organized and how they should interact. In this chapter we will
first pay attention to some commonly applied approaches toward organizing
(distributed) computer systems. The actual realization of a distributed system requires
that we instantiate and place software components on real machines.
There are many different choices that can be made in doing so. The final instantiation
of a software architecture is also referred to as a system architecture. In this chapter we
43
Distributed Systems - CS 422
will look into traditional centralized architectures in which a single server implements
most of the software components (and thus functionality), while remote clients can
access that server using simple communication means. In addition, we consider
decentralized architectures in which machines more or less play equal roles, as well as
hybrid organizations.
As we explained in Chap.1, an important goal of distributed systems is to separate
applications from underlying platforms by providing a middleware layer adopting such
a layer is an important architectural decision, and its main purpose is to provide
distribution transparency. However, trade-offs need to be made to achieve transparency,
which has led to various techniques to make middleware adaptive. We discuss some of
the more commonly applied ones in this chapter, as they affect the organization of the
middleware itself. Adaptability in distributed systems can also be achieved by having
the system monitor its own behavior and taking appropriate measures when needed.
This 'insight has led to a class of what are now referred to as autonomic systems. These
distributed systems are frequently organized in the form of feedback control loops,
which form an important architectural element during a system's design. In this chapter,
we devote a section to autonomic distributed systems.
2.1 Basic Definitions
• A program is the code you write
• A process is what you get when you run it
• A message is used to communicate between processes
• A packet is a fragment of a message that might travel on a wire
• A protocol is a formal description of message formats and the rules that two processes
must follow in order to exchange those messages
• A network is the infrastructure that links computers, workstations, terminals, servers,
etc. It consists of routers which are connected by communication links.
44
Distributed Systems - CS 422
• A component can be a process or any piece of hardware required to run a process,
support communications between processes, store data.
• A distributed system is an application that executes a collection of protocols to
coordinate the actions of multiple processes on a network, such that all components
cooperate together to perform a single or small set of related tasks (Detailed definition
of DS).
2.2 Distributed Systems Architectures
1. Multiprocessor Architectures
2. Client-server Architectures
• Distributed services which are called on by clients.
• Servers that provide services are treated differently from clients that use
services
3. Distributed Object Architectures
• No distinction between clients and servers (object world).
• Any object on the system may provide and use services(functions) from other
objects
2.3 Multiprocessor Architectures
• Advantage: Simplest distributed system model.
• System composed of multiple processes which may execute on different
processors.
• Architectural model of many large real-time systems.
• Distribution of process to processor may be pre-ordered or may be under the
control of a dispatcher of the operating system.
45
Distributed Systems - CS 422
A multiprocessor Traffic Control System (Example 1)
Sensor
processor
Sensor
control
process
Traffic flow
processor
Traffic light control
processor
Display
process
Light
control
process
Traffic lights
Trafficflow sensors and
cameras
Operator consoles
Multiprocessor Traffic Control System (SCOOT systems)
SCOOT systems are designed to be a central processor hosting the SCOOT
Kernel integrated with the company specific UTC software that controls
communications to the on-street equipment and provides the operator interface. This
processor and associated networked terminals may be installed in a control room.
46
Distributed Systems - CS 422
The figure below shows Enhancement with a Wireless Digital Radio to Achieve
the City Lights and Intelligent Traffic Control. Modernization of the city road light
control, not just to add luster to the city, and can improve the rational use of urban
resources.
Wireless digital radio to achieve the city with street surveillance, real-time data
transmission reliability, lower operation and maintenance costs, transmission speed can
view images.
47
Distributed Systems - CS 422
2.4 Client-server Architectures
In the basic client-server model, processes in a distributed system are divided into two
(possibly overlapping) groups. A server is a process implementing a specific service,
for example, a file system service or a database service. A client is a process that
requests a service from a server by sending it a request and subsequently waiting for
the server's reply. This client-server interaction, also known as request-reply.
Communication between a client and a server can be implemented by means of a simple
connectionless protocol when the underlying network is fairly reliable as in many localarea networks. In these cases, when a client requests a service, it simply packages a
message for the server, identifying the service it wants, along with the necessary input
data.
The message is then sent to the server. The latter, in turn, will always wait for an
incoming request, subsequently process it, and package the results in a reply message
that is then sent to the client.
Using a connectionless protocol has the obvious advantage of being efficient. As long
as messages do not get lost or corrupted, the request/reply protocol just sketched works
fine. Unfortunately, making the protocol resistant to occasional transmission failures is
not trivial.
The only thing we can do is possibly let the client resend the request when no reply
message comes in. The problem, however, is that the client cannot detect whether the
original request message was lost, or that transmission of the reply failed.
If the reply was lost, then resending a request may result in performing the operation
twice.
If the operation was something like "transfer $10,000 from my bank account," then
clearly, it would have been better that we simply reported an error instead. On the other
48
Distributed Systems - CS 422
hand, if the operation was "tell me how much money I have left," it would be perfectly
acceptable to resend the request.
When an operation can be repeated multiple times without harm, it is said to be
idempotent. Since some requests are idempotent and others are not it should be clear
that there is no single solution for dealing with lost messages.
In conclusion, we could summarize the client server architecture milestones as follows:
 The application is modelled as a set of services that are provided by servers and
a set of clients that use these services.
 Clients know of servers but servers need not know of clients.
 Clients and servers are logical processes
 The mapping of processors to processes
 (scheduling/allocation) is not necessarily 1 : 1
Client versus Servers
Clients are PCs or workstations on which users run applications. Clients rely on
servers for resources, such as files, devices, and even processing power.
While Servers are powerful computers or processes dedicated to managing disk
drives (file servers), printers (print servers), or network traffic (network servers).
Comparison with peer-to-peer (P2P) architecture
In the client-server model, the server is a centralized system. The more
simultaneous clients a server has, the more resources it needs.
In a peer-to-peer network, two or more computers (called peers) pool their
resources and communicate in a decentralized system. Peers are coequal nodes in a
nonhierarchical network.
49
Distributed Systems - CS 422
A comparison between C/S and Peer to Peer structure is presented as follows:
S.NO
1
CLIENT-SERVER NETWORK
PEER-TO-PEER NETWORK
In Client-Server Network, Clients
In Peer-to-Peer Network, Clients
and server are differentiated,
and server are not differentiated.
Specific server and clients are
present.
2
3
Client-Server Network focuses on
While Peer-to-Peer Network
information sharing.
focuses on connectivity.
In Client-Server Network,
While in Peer-to-Peer Network,
Centralized server is used to store
Each peer has its own data.
the data.
4
In Client-Server Network, Server
While in Peer-to-Peer Network,
responds the services which are
Each and every node can do both
requested by Client.
request and respond for the
services.
5
Client-Server Network are costlier
While Peer-to-Peer Network are
than Peer-to-Peer Network.
less costly than Client-Server
Network.
6
Client-Server Network are more
While Peer-to-Peer Network are
stable than Peer-to-Peer Network.
less stable if number of peer is
increase.
7
Client-Server Network is used for
While Peer-to-Peer Network is
both small and large networks.
generally suited for small networks
with fewer than 10 computers.
50
Distributed Systems - CS 422
In a structured peer-to-peer architecture, the overlay network is constructed
using a deterministic procedure. By far the most-used procedure is to organize the
processes through a distributed hash table (DHT).
In a DHT -based system, data items are assigned a random key from a large
identifier space, such as a 128-bit or 160-bit identifier. Likewise, nodes in the system
are also assigned a random number from the same identifier space.
The root of every DHT-based system is then to implement an efficient and
deterministic scheme that uniquely maps the key of a data item to the identifier of a
node based on some distance metric.
Most importantly, when looking up a data item, the network address of the node
responsible for that data item is returned.
Effectively, this is accomplished by routing a request for a data item to the
responsible node.
Unstructured peer-to-peer systems largely rely on randomized algorithms for
constructing an overlay network. The main idea is that each node maintains a list of
neighbors, but that this list is constructed in a more or less random way. Likewise, data
items are assumed to be randomly placed on nodes.
As a consequence, when a node needs to locate a specific data item, the only
thing it can effectively do is flood the network with a search query.
One of the goals of many unstructured peer-to-peer systems is to construct an
overlay network that resembles a random graph. The basic model is that each node
maintains a list of c neighbors, where, ideally, each of these neighbors represents a
randomly chosen live node from the current set of nodes.
The list of neighbors is also referred to as a partial view. There are many ways
to construct such a partial view. Jelasity et al. (2004, 2005a) have developed a
framework that captures many different algorithms for overlay construction to allow
for evaluations and comparison. In this framework, it is assumed that nodes regularly
exchange entries from their partial view.
51
Distributed Systems - CS 422
Each entry identifies another node in the network, and has an associated age that
indicates how old the reference to that node is.
Clients and Servers
General interaction between a client and a server
An Example Client and Server (1)
The header.h file used by the client and server
52
Distributed Systems - CS 422
An Example Client and Server (2)
A sample server
An Example Client and Server (3)
A client using the server to copy a file
53
Distributed Systems - CS 422
Typical client server system
c3
c2
c4
c12
c11
s1
c1
Server process
s4
c10
c5
Client process
s2
c6
s3
c9
c8
c7
Physical Computers in a C/S network
c1
CC1
c2
c3, c4
CC2
CC3
Network
s1, s2
s3, s4
SC2
Server
computer
SC1
c5, c6, c7
c8, c9
CC4
c10, c1 1, c1 2
CC5
Client
computer
CC6
C/S architecture from Layered Application Point of View
The Presentation layer is concerned with presenting the results of a computation
to system users and with collecting user inputs.
The Application processing layer is concerned with providing application
specific functionality (e.g., in a banking system, banking functions such as open
account, close account). While the Data management layer is concerned with managing
the system databases.
54
Distributed Systems - CS 422
Application layers in C/S model
The user-interface level contains all that is necessary to directly interface with the
user, such as display management. The processing level typically contains the
applications. The data level manages the actual data that is being acted on. Clients
typically implement the user-interface level. This level consists of the programs that
allow end users to interact with applications.
There is a considerable difference in how sophisticated user-interface programs are.
The simplest user-interface program is nothing more than a character-based screen.
Such an interface has been typically used in mainframe environments. In those cases
where the mainframe controls all interaction, including the keyboard and monitor, one
can hardly speak of a client-server environment.
However, in many cases, the user's terminal does some local processing such as echoing
typed keystrokes, or supporting form-like interfaces in which a complete entry is to be
edited before sending it to the main computer. Nowadays, even in mainframe
environments, we see more advanced user interfaces.
Typically, the client machine offers at least a graphical display in which pop-up or pulldown menus are used, and of which many of the screen controls are handled through a
55
Distributed Systems - CS 422
mouse instead of the keyboard. Typical examples of such interfaces include the XWindows interfaces as used in many UNIX environments, and earlier interfaces
developed for MS-DOS PCs and Apple Macintoshes.
Modern user interfaces offer considerably more functionality by allowing applications
to share a single graphical window, and to use that window to exchange data through
user actions. For example, to delete a file, it is usually possible to move the icon
representing that file to an icon representing a trash can. Likewise, many word
processors allow a user to move text in a document to another position by using only
the mouse.
As a first example, consider an Internet search engine. Ignoring all the animated
banners, images, and other fancy window dressing, the user interface of a search engine
is very simple: a user types in a string of keywords and is subsequently presented with
a list of titles of Web pages. The back end is formed by a huge database of Web pages
that have been perfected and indexed.
The core of the search engine is a program that transforms the user's string of keywords
into one or more database queries. It subsequently ranks the results into a list, and
transforms that list into a series of HTML pages. Within the client-server model, this
information retrieval part is typically placed at the processing level.
The data level in the client-server model contains the programs that maintain the actual
data on which the applications operate. An important property of this level is that data
are often persistent, that is, even if no application is running, data will be stored
somewhere for next use.
In its simplest form, the data level consists of a file system, but it is more common to
use a full-fledged database.
In the client-server model, the data level is typically implemented at the server side.
Besides merely storing data, the data level is generally also responsible for keeping data
consistent across different applications.
56
Distributed Systems - CS 422
When databases are being used, maintaining consistency means that metadata such as
table descriptions, entry constraints and application-specific metadata are also stored at
this level. For example, in the case of a bank, we may want to generate a notification
when a customer's credit card debt reaches a certain value. This type of information can
be maintained through a database trigger that activates a handler for that trigger at the
appropriate moment. In most business-oriented environments, the data level is
organized as a relational database.
Data independence is crucial here. The data are organized independent of the
applications in such a way that changes in that organization do not affect applications,
and neither do the applications affect the data organization.
Using relational databases in the client-server model helps separate the processing level
from the data level, as processing and data are considered independent. However,
relational databases are not always the ideal choice.
A characteristic feature of many applications is that they operate on complex data types
that are more easily modeled in terms of objects than in terms of relations. Examples of
such data types range from simple polygons and circles to representations of aircraft
designs, as is the case with computer-aided design (CAD) systems.
In those cases where data operations are more easily expressed in terms of object
manipulations, it makes sense to implement the data level by means of an objectoriented or object-relational database. Notably the latter type has gained popularity as
these databases build upon the widely dispersed relational data model, while offering
the advantages that object-orientation gives.
2.4.1 Two-tier (layer) thin and fat clients
Thin-client model (i.e. fat server)
In a thin-client model, all of the application processing and data management is
carried out on the server. The client is simply responsible for running the presentation
software (so client is thin):
57
Distributed Systems - CS 422

Used when legacy systems are migrated to client server architectures in which
legacy system acts as a server in its own right with a graphical interface
implemented on a client

A major disadvantage is that it places a heavy processing load on both the server
and the network.
Fat-client model (i.e. thin server)
In this model, the server is only responsible for data management. The
software on the client implements the application logic and the interactions with the
system user (so client is fat):

Most appropriate for new C/S systems where the capabilities of the client system
are known in advance.

More complex than a thin client model especially for management. New
versions of the application have to be installed on all clients.
Advantages

Separation of responsibilities such as user interface presentation and business
logic processing.

Reusability of server components and potential for concurrency

Simplifies the design and the development of distributed applications

It makes it easy to migrate or integrate existing applications into a distributed
environment.

It also makes effective use of resources when a large number of clients are
accessing a high-performance server.
58
Distributed Systems - CS 422
Disadvantages

Lack of heterogeneous infrastructure to deal with the requirement changes.

Security complications.

Limited server availability and reliability.

Limited testability and scalability.

Fat clients with presentation and business logic together.
Legacy Systems Lifetime
Companies spend a lot of money on software systems and, to get a return on that
investment, the software must be useable for a number of years. The lifetime of software
systems is very variable, but many large systems remain in use for more than 10 years.
Some organizations still rely on software systems that are more than 20 years
old. Many of these old systems are still business-critical. That is, the business relies on
the services provided by the software and any failure of these services would have a
serious effect on the day-to-day running of the business. These old systems have been
given name legacy systems.
59
Distributed Systems - CS 422
Legacy Systems NASA Example
NASA's now retired Space Shuttle program used a large amount of 1970s-era
technology. Replacement was cost-unaffordable because of:
•
The expensive requirement for flight certification.
•
The legacy hardware used completed the expensive integration.
•
Certification requirement for flight.
But any new equipment would have had to go through that entire process –
requiring extensive tests of the new components in their new configurations – before a
single unit could be used in the Space Shuttle program.
Legacy Systems Structure
Fat Client Model Applicability
More processing is delegated to the client as the application processing is locally
executed. It is most suitable for new C/S systems where the capabilities of the client
system are known in advance (e.g. ATM). Its disadvantage is that it is more complex
than a thin client model especially for management, since new versions of the
application have to be installed on all clients (maintenance overhead).
60
Distributed Systems - CS 422
A client-server ATM system (Fat client Example 2)
ATM
ATM
Account server
Teleprocessing
monitor
Customer
account
database
ATM
ATM
2.4.2 Three-tier architectures
In a three-tier architecture, each of the application architecture layers may
execute on a separate processor. Its advantages are that it allows for better performance
than a thin-client approach and it’s simpler to manage than a fat-client approach. It is
also a more scalable architecture - as demands increase, extra servers can be added.
A 3-tier C/S architecture
The diagram below shows 3 separate processors (two servers and one client)
A typical example of where a three-tiered architecture is used is in transaction
processing. As we discussed in Chap. 1, a separate process, called the transaction
processing monitor, coordinates all transactions across possibly different data servers.
Another, but very different example where we often see a three-tiered architecture is in
the organization of Web sites. In this case, a Web server acts as an entry point to a site,
61
Distributed Systems - CS 422
passing requests to an application server where the actual processing takes place. This
application server, in tum, interacts with a database server. For example, an application
server may be responsible for running the code to inspect the available inventory of
some goods as offered by an electronic bookstore. To do so, it may need to interact with
a database containing the raw inventory data.
Advantages

Better performance than a thin-client approach and is simpler to manage than a
thick-client approach.

Enhances the reusability and scalability − as demands increase, extra servers
can be added.

Provides multi-threading support and also reduces network traffic.

Provides maintainability and flexibility
Disadvantages

Unsatisfactory Testability due to lack of testing tools.

More critical server reliability and availability.
An internet banking system (Example 3)
Client
HTTP interaction
Client
Database server
Web server
SQL query
Account service
provision
SQL
Customer
account
database
Client
Client
Internet Banking System
When a bank customer accesses online banking services with a web browser (the
client), the client initiates a request to the bank's web server. The customer's login
62
Distributed Systems - CS 422
credentials may be stored in a database, and the web server accesses the database server
as a client.
An application server interprets the returned data by applying the bank's business
logic and provides the output to the web server. Finally, the web server returns the result
to the client web browser for display.
In each step of this sequence of client–server message exchanges, a computer
processes a request and returns data. This is the request-response messaging pattern.
When all the requests are met, the sequence is complete, and the web browser presents
the data to the customer.
Use of C/S architectures
Architecture
Application
One-tier C/S
Legacy system applications where separating application
architecture
processing and data management is impractical.
with fat clients
Computationally-intensive applications such as compilers with
little or no data management.
Data-intensive applications (browsing and querying) with little
or no application processing.
Two-tier C/S
Applications where application processing is provided by off-
architecture
the-shelf software (e.g. Microsoft Excel) on the client.
with fat clients
Applications where computationally-intensive processing of data
(e.g. data visualization) is required.
Applications with relatively stable end-user functionality used in
an environment with well-established system management.
Three-tier or
Large scale applications with hundreds or thousands of clients
multi-tier C/S
Applications where both the data and the application are volatile.
architecture
Applications where data from multiple sources are integrated.
The picture below is a visualization of how a car deforms in an asymmetrical crash
using finite element analysis.
63
Distributed Systems - CS 422
2.5 Distributed Object Architectures
There is no distinction in a distributed object architectures between clients and
servers. Each distributable entity is an object that provides services to other objects and
receives services from other objects (objects world). Object communication is through
a middleware system called an object request broker (ORB). Its disadvantage is that
distributed object architectures are more complex to design than Clint-Server systems.
Layout of Distributed Object Architecture
o1
o2
o3
o4
S (o1)
S (o2)
S (o3)
S (o4)
Object request broker
o5
o6
S (o5)
S (o6)
Advantages of distributed object architecture
It is a very open system architecture that allows new resources to be added to it as
required. The system is flexible and scalable. It allows the system designer to delay
64
Distributed Systems - CS 422
decisions on where and how services should be provided. It is possible to reconfigure
the system dynamically with objects migrating across the network as required.
Usage of distributed object architecture
As a logical model that allows you to structure and organize the system, you think about
how to provide application functionality solely in terms of services and combinations
of services (world of services).
As a flexible approach to the implementation of client-server systems, the logical model
of the system is a client-server model but both clients and servers are realized as
distributed objects communicating through a common communication framework.
The logical model of the system is not one of service provision where there are
distinguished data management services.
It has the following advantages:
•
It allows the number of databases that are accessed to be increased without
disrupting the system (scalability)
•
It allows new types of relationship to be mined by adding new integrator objects
(flexibility)
Disadvantages

Complexity − More complex than centralized systems.

Security − More susceptible to external attack.

Manageability − More effort required for system management.

Unpredictability − Unpredictable responses depending on the system organization
and network load.
65
Distributed Systems - CS 422
A data mining system (Example 4)
Database 1
Integ rator 1
Report gen.
Database 2
Visualiser
Integ rator 2
Database 3
Display
Broker Architectural Style
Broker Architectural Style is a middleware architecture used in distributed computing
to coordinate and enable the communication between registered servers and clients.
Here, object communication takes place through a middleware system called an object
request broker (software bus).

Client and the server do not interact with each other directly. Client and server
have a direct connection to its proxy which communicates with the mediatorbroker.

A server provides services by registering and publishing their interfaces with
the broker and clients can request the services from the broker statically or
dynamically by look-up.

CORBA (Common Object Request Broker Architecture) is a good
implementation example of the broker architecture.
66
Distributed Systems - CS 422
Components of Broker Architectural Style
The components of broker architectural style are discussed through following heads:
Broker
Broker is responsible for coordinating communication, such as forwarding and
dispatching the results and exceptions. It can be either an invocation-oriented service,
a document or message - oriented broker to which clients send a message.

It is responsible for brokering the service requests, locating a proper server,
transmitting requests, and sending responses back to clients.

It retains the servers’ registration information including their functionality and
services as well as location information.

It provides APIs for clients to request, servers to respond, registering or
unregistering server components, transferring messages, and locating servers.
Stub
Stubs are generated at the static compilation time and then deployed to the client side
which is used as a proxy for the client. Client-side proxy acts as a mediator between
the client and the broker and provides additional transparency between them and the
client; a remote object appears like a local one.
The proxy hides the IPC (inter-process communication) at protocol level and performs
marshaling of parameter values and un-marshaling of results from the server.
Skeleton
Skeleton is generated by the service interface compilation and then deployed to the
server side, which is used as a proxy for the server. Server-side proxy encapsulates
low-level system-specific networking functions and provides high-level APIs to
mediate between the server and the broker.
67
Distributed Systems - CS 422
It receives the requests, unpacks the requests, unmarshals the method arguments, calls the
suitable service, and also marshals the result before sending it back to the client.
Bridge
A bridge can connect two different networks based on different communication
protocols. It mediates different brokers including DCOM, .NET remote, and Java
CORBA brokers.
Bridges are optional component, which hides the implementation details when two
brokers interoperate and take requests and parameters in one format and translate them
to another format.
2.6 CORBA Architecture
CORBA is an acronym for Common Object Request Broker Architecture).
CORBA (1991) is an international standard for an Object Request Broker - middleware
to manage communications between distributed objects. Object Management Group
(OMG) is responsible for defining CORBA. The OMG comprises all the major vendors
and developers of distributed object technology including:
•
platform, database, and application vendors
•
software tool and corporate developers
Middleware for distributed computing is required at 2 levels:
•
At the logical communication level: the middleware allows objects on different
computers to exchange data and control information;
•
At the component level: the middleware provides a basis for developing compatible
components. CORBA component standards have been defined. Hint: visit OMG for
CORBA FAQ and Releases of CORBA.
CORBA specifies a system that provides interoperability among objects in a
heterogeneous, distributed environment in a way that is transparent to the programmer.
This model defines common object semantics for specifying the externally visible
characteristics of objects in a standard and implementation-independent way. In this
68
Distributed Systems - CS 422
model, clients request services from objects (which will also be called servers) through
a well-defined interface specified in Object Management Group Interface Definition
Language (IDL).
The request is an event, and it carries information including:
•
an operation
•
the object reference of the service provider, which is a name that defines an object
consistently
•
actual parameters (if any)
The central component (core) of CORBA is the object request broker (ORB). It
includes the entire communication infrastructure necessary to:
•
identify and locate objects
•
handle connection management
•
deliver data
In general, the object request broker is not required to be a single component; it
is simply defined by its interfaces. Software broker is an agent do job for the others
(Using IIOP, Internet Inter-ORB Protocol; to pass method invocation requests to the
correct objects and return the results to the caller).
2.6.1 CORBA Goal
The OMG’s goal was to adopt distributed object systems that utilize objectoriented programming for distributed systems:
• Systems to be built on heterogeneous hardware, networks, operating systems and
programming languages.
• The distributed objects would be implemented in various programming languages and
still be able to communicate with each other.
69
Distributed Systems - CS 422
2.6.2 CORBA Architecture
The simplified architecture of the CORBA is as presented in the next figure followed
by a detailed architecture.
70
Distributed Systems - CS 422
Portable Object Adapter (POA)
There are different types of CORBA object adapters, such as real-time object
adapters (in TAO) and portable object adapters. The Portable Object Adapter (POA) is
a particular type of object adapter that is defined by the CORBA standard specification.
A POA object adapter allows an object implementation to function with other different
ORBs.
Internet Inter-ORB Protocol (IIOP)
A protocol which will be mandatory for all CORBA 2.0 (1996) compliant
platforms. The initial phase of the CORBA 2.0 project is to build an infrastructure
consisting of:
1. an IIOP to HTTP gateway which allows CORBA clients to access WWW
resources;
2. an HTTP to IIOP gateway to let WWW clients access CORBA resources;
3. a web server which makes resources available by both IIOP and HTTP;
4. web browsers which can use IIOP as their native protocol (for navigation on the
internet)
2.6.3 CORBA Application Structure
1. Application objects itself
2. Standard objects defined by the OMG, for a specific domain (e.g. Insurance,
Trading ...etc.)
3. Fundamental CORBA services such as directories and security management
4. Horizontal facilities (i.e. cutting across applications) such as user interface
facilities
71
Distributed Systems - CS 422
Application
objects
Domain
facilities
Horizontal C ORBA
facilities
Object request broker
CORBA services
2.6.4 CORBA Standards
1. An object model for application objects. A CORBA object is an encapsulation
of state with a well-defined, language-neutral interface defined in an IDL
(interface definition language)
2. An object request broker that manages requests for object services
3. A set of general object services of use to many distributed applications
4. A set of common components built on top of these services
2.6.5 CORBA Objects
CORBA objects are comparable, in principle, to objects in C++ and Java:
 They MUST have a separate interface definition that is expressed using a
common language (IDL) similar to C++
 There is a mapping from this IDL to programming languages (C++, Java)
 Therefore, objects written in different languages can communicate with each
other
72
Distributed Systems - CS 422
2.6.6 CORBA Services
1. Object life cycle: Defines how CORBA objects are created, removed, moved, and
copied
2. Naming: Defines how CORBA objects can have friendly symbolic names
3. Events: Decouples the communication between distributed objects
4. Relationships: Provides arbitrary typed n-ary relationships between CORBA objects
5. Externalization: Coordinates the transformation of CORBA objects to and from
external media.
6. Transactions: Coordinates atomic access to CORBA objects (complete success or
complete failures for group of operations i.e. partial success or failure is not
permissible)
7. Property: Supports the association of name-value pairs with CORBA objects
8. Trader: Supports the finding of CORBA objects based on properties describing the
service offered by the object
9. Query: Supports queries on objects
2.6.7 CORBA Products
1. The Java 2 ORB: it comes with Sun's Java 2 SDK (Software Development Kit). It
is missing several features.
2. VisiBroker for Java: A popular Java ORB from Inprise Corporation (new name of
Borland after 1999). VisiBroker is also embedded in other products. For example,
it is the ORB that is embedded in the Netscape Communicator browser.
3. OrbixWeb: A popular Java ORB from Iona Technologies.
4. WebSphere: A popular application server with an ORB from IBM.
5. Netscape Communicator: Netscape browsers have a version of VisiBroker
embedded in them. Applets (A program designed to be executed from within
another application. Unlike an application, applets cannot be executed directly
73
Distributed Systems - CS 422
from the operating system) can issue request on CORBA objects without
downloading ORB classes into the browser. They are already there.
CORBA Example 5 (The Stock Application)
The stock trading application is a distributed application that illustrates the Java
programming language and CORBA. In this introductory module only a small simple
subset of the application is used.
The stock application allows multiple users to watch the activity of stocks. The
user is presented with a list of available stocks identified by their stock symbols. The
user can select a stock and then press the "view" button.
Object request broker (ORB)
The basic functionality provided by the object request broker consists of:
1. Passing the requests from clients to the object implementations on which they
are invoked. In order to make a request, the client can communicate with the
ORB core through the Interface Definition Language stub or through the
dynamic invocation interface (DII).
2. The stub represents the mapping between the language of implementation of the
client and the ORB core. Thus the client can be written in any language as long
as the implementation of the object request broker supports this mapping.
3. The ORB core then transfers the request to the object implementation which
receives the request as an up-call through:
 an Interface Definition Language (IDL) skeleton (which represents the object
interface at the server side and works with the client stub) or
 A dynamic skeleton interface (DSI) (a skeleton with multiple interfaces).
74
Distributed Systems - CS 422
Detailed ORB Architecture
ORB-based object communications layout
o1
o2
S (o1)
S (o2)
IDL
stub
IDL
skeleton
Object Request Broker
75
Distributed Systems - CS 422
Inter-ORB communications
ORBs are not usually separate programs but are a set of objects in a library that
are linked with an application when it is developed. ORBs handle communications
between objects executing on the same machine. Several ORBS may be available and
each computer in a distributed system will have its own ORB (obligatory). Inter-ORB
communications are used for distributed object calls.
Advantages of ORB
The ORB implements programming language independence for the request. The
client issuing the request can be written in a different programming language from the
implementation of the CORBA object. The ORB does the necessary translation
between programming languages. Language bindings are defined for all popular
programming languages (C, C++, Java, Ada, COBOL, Smalltalk, Objective C, and Lisp
programming languages).
2.6.8 CORBA Servant Class
public class HelloServant extends HelloPOA {
/* inheritance from the super classHelloPortableObjectadaptor*/
76
Distributed Systems - CS 422
private ORB orb;
public HelloServant( ORB orb ) {
this.orb = orb; }
public String sayHello( ) {
return "Hello From CORBA Server..."; }
public void shutdown( ) {
orb.shutdown( false ); }
}
2.6.9 CORBA Server Class – 1
 Create and initializes an ORB instance:
ORB orb = ORB.init(args, null);
 Get a reference to the root POA and activates the POAManager:
POA rootpoa =
POAHelper.narrow(orb.resolve_initial_references("RootPOA"));
rootpoa.the_POAManager().activate();
 Create a servant instance:
org.omg.CORBA.Object ref = rootpoa.servant_to_reference(helloImpl);
Hello href = HelloHelper.narrow(ref);
 Obtain the Initial Object Reference
org.omg.CORBA.Object objRef =
orb.resolve_initial_references("NameService"); // persistent NS
// orb.resolve_initial_references("TNameService "); // transient NS
77
Distributed Systems - CS 422
// Java IDL Transient Naming Service
 Narrow the Naming Context
NamingContextExt ncRef = NamingContextExtHelper.narrow(objRef);
 Register the new object in the naming context
NameComponent path[] = ncRef.to_name( “swe622” );
ncRef.rebind(path, href);
 Wait for invocations of the object from a client
orb.run();
2.6.10 CORBA Client
 Create an ORB Object
ORB orb = ORB.init(args, null);

Obtain the Initial Object Reference
org.omg.CORBA.Object objRef =
orb.resolve_initial_references("NameService");

Narrow the Object Reference to get Naming Context
NamingContextExt ncRef =
NamingContextExtHelper.narrow(objRef);

Resolve the Object Reference in Naming
helloImpl = HelloHelper.narrow(ncRef.resolve-str(“swe622”));

Invoking the method
helloImpl.sayHello();
78
Distributed Systems - CS 422
Revision Sheet # 2
PROBLEMS
1. If a client and a server are placed far apart, we may see network latency
dominating overall performance. How can we tackle this problem?
2. What is a three-tiered client-server architecture?
3. Explain a practical example of thin client architecture.
4. Explain a practical example of fat client architecture.
5. What is the difference between a vertical distribution and a horizontal
distribution?
6. To what extent are interceptors dependent on the middle ware where they are
deployed?
Assignment # 5
Download
“the
stock
application”
file,
on
the
following
link;
https://www.slideshare.net/SenthilKanth/stock-applicationusing-corba
Read it carefully. Prepare a pseudo-code algorithm for performing this CORBAJava application. Note: no need to dig into Java code
79
Distributed Systems - CS 422
Chapter 3
Synchronization
In this chapter, we mainly concentrate on how processes can synchronize. For example,
it is important that multiple processes do not simultaneously access a shared resource,
such as printer, but instead cooperate in granting each other temporary exclusive access.
Another example is that multiple processes may sometimes need to agree on the
ordering of events, such as whether message ml from process P was sent before or after
message m2 from process Q. As it turns out, synchronization in distributed systems is
often much more difficult compared to synchronization in uniprocessor or
multiprocessor systems.
The problems and solutions that are discussed in this chapter are, by their nature, rather
general, and occur in many different situations in distributed systems. We start with a
discussion of the issue of synchronization based on actual time, followed by
synchronization in which only relative ordering matters rather than ordering in absolute
time. In many cases, it is important that a group of processes can appoint one process
as a coordinator, which can be done by means of election algorithms. We discuss
various election algorithms in a separate section.
3.1 Clock Synchronization
It is a problem that deals with the idea that internal clocks of several computers
may differ. Even when initially set accurately, real clocks will differ after some amount
of time due to clock drift, caused by clocks counting time at slightly different rates.
Such "clock synchronization" is used for synchronization in telecommunications and
automatic baud rate detection.
In a centralized system, time is unambiguous. When a process wants to know the
time, it makes a system call and the kernel tells it. If process A asks for the time. And
80
Distributed Systems - CS 422
then a little later process B asks for the time, the value that B gets will be higher than
(or possibly equal to) the value A got. It will certainly not be lower. In a distributed
system, achieving agreement on time is not trivial.
Related Problems
Besides the incorrectness of the time itself, there are problems associated with
clock skew that take on more complexity in a distributed system in which several
computers will need to realize the same global time:

For instance, in Unix systems; the make command is used to compile new or
modified code without the need to recompile unchanged code (compile modified
code only).

The make command uses the clock of the machine it runs on to determine which
source files need to be recompiled (latest version based on clock information). If
the sources reside on a separate file server (containing compiler) and the two
machines have unsynchronized clocks, the make program (editor) might not
produce the correct results.
Clock Synchronization Consequences
The way make normally works is simple. When the programmer has finished
changing all the source files, he runs make, which examines the times at which all the
source and object files were last modified. If the source file input. c has time 2151 and
the corresponding object file input.o has time 2150, make knows that input.c has been
changed since input.o was created, and thus input.c must be recompiled.
On the other hand, if output.c has time 2144 and output.o has time 2145, no
compilation is needed.
Thus make goes through all the source files to find out which ones need to be
recompiled and calls the compiler to recompile them. Now imagine what could happen
in a distributed system in which there were no global agreement on time. Suppose that
output.o has time 2144 as above, and shortly thereafter output.c is modified but is
81
Distributed Systems - CS 422
assigned time 2143 because the clock on its machine is slightly behind, as shown in the
following figure. Make will not call the compiler.
The resulting executable binary program will then contain a mixture of object files
from the old sources and the new sources. It will probably crash and the programmer
will go crazy trying to understand what is wrong with the code. There are many more
examples where an accurate account of time is needed.
The example above can easily be reformulated to file timestamps in general. In
addition, think of application domains such as financial brokerage, security auditing,
and collaborative sensing, and it will become clear that accurate timing is important.
Since time is so basic to the way people think and the effect of not having all the clocks
synchronized can be so dramatic, it is fitting that we begin our study of synchronization
with the simple question: Is it possible to synchronize all the clocks in a distributed
system? The answer is surprisingly complicated.
When each machine has its own clock, an event 2 that occurred after another event
1 may be assigned an earlier time (the make program, may not recall the compiler and
resulting executable binary program may crash).
Traditional Solutions
In a centralized system the solution is trivial; the centralized server will dictate the
system time. There are some solutions to the clock synchronization problem in a
centralized server environment:
 Cristian's algorithm
 Berkeley Algorithm
82
Distributed Systems - CS 422
Internet Clock Synchronization Solution
In a distributed system the problem takes on more complexity because a global time
is not easily known. The most used clock synchronization solution on the Internet is the
Network Time Protocol (NTP) which is a layered client-server architecture based on
Universal Datagram Protocol (UDP) message passing which is the set of network
protocols used for the Internet.
With UDP, computer applications can send messages, in this case referred to as
datagrams, to other hosts on an Internet Protocol (IP) network without requiring prior
communications to set up special transmission channels or data paths (open
transmission).
3.1.1 Measuring the Time
From the astronomers' point of view, solar second is:
1/ (24x60x60) = 1/ 86400 of solar day.
From the physician’s point of view, the most accurate timekeeping devices are
atomic clocks (1948), which are accurate to seconds in many millions of years and are
used to calibrate other clocks and timekeeping instruments.
Atomic clocks use the spin property of atoms as their basis, and since 1967, the
International System of Measurements bases its unit of time, the second, on the
properties of cesium atoms.
Atomic clocks
Atomic clock ensemble at the U.S. Naval Observatory, which may be accessed by
telephone (202-762-1401) or via Internet Network Time Protocol (NTP) servers.
Caesium is also used in atomic clocks, which point. Use the resonant vibration
frequency of caesium-133 atoms as a reference:
 Precise caesium clocks today measure frequency with an accuracy of from 2 to 3
parts in 1014, which would correspond to a time measurement accuracy of accuracy
of from 2 to 3 parts in 10, which would correspond to a time measurement accuracy
of 2 nanoseconds per day, or one second in 1.4 million years.
83
Distributed Systems - CS 422
 The latest versions in the United States and France are accurate to 1.7 parts in 1015,
or 1 second in 17 million years, which has been regarded as "the most accurate
realization of a unit that mankind has yet achieved."
The picture below is the first atomic clock, constructed in 1949 by the US National
Bureau of Standards
The International System of Units (abbreviated SI from the French le Système
international d'unités) is the modern form of the metric system. SI defines the second
as 9,192,631,770 cycles of that radiation which corresponds to the transition between
two electron spin energy levels of the ground state of the 133Cs atom.
An SI prefix (also known as a metric prefix) is a name or associated symbol that
precedes a basic unit of measure (or its symbol) to form a decimal multiple or
submultiple. SI prefixes are used to reduce the number of zeros shown in numerical
quantities.
For example, one billionth of an ampere (a small electrical current) can be written
as 0.000 000 001 ampere. In symbol form, this is written as 1 Nano ampere or 1 nA.
Today, the Global Positioning System in coordination with the Network Time
Protocol can be used to synchronize timekeeping systems across the globe. As of 2006,
the smallest unit of time that has been directly measured is on the attosecond (10−18 s)
time scale (Atto- was made from the Danish word for eighteen (atten)).
84
Distributed Systems - CS 422
World Time
The basis for scientific time is a continuous count of seconds based on atomic clocks
around the world, known as the International Atomic Time (TAI). Coordinated
Universal Time (UTC) is the basis for modern civil time.
Since January 1, 1972, it has been defined to follow TAI with an exact offset of an
integer number of seconds, changing only when a leap second is added to keep clock
time synchronized with the rotation of the Earth. In TAI and UTC systems, the duration
of a second is constant, as it is defined by the unchanging transition period of the cesium
atom.
Greenwich Mean Time (GMT) is an older standard, adopted starting with British
railroads in 1847(It is the site of the original Royal Observatory, through which passes
the prime meridian (An imaginary great circle on the earth's surface passing through
the North and South geographic poles. All points on the same meridian have the same
longitude,) or longitude 0°)).
Using telescopes instead of atomic clocks, GMT was calibrated to the mean solar
time at the Royal Observatory, Greenwich in the UK. Universal Time (UT) is the
modern term for the international telescope-based system, adopted to replace
"Greenwich Mean Time" in 1928 by the International Astronomical Union.
Observations at the Greenwich Observatory itself stopped in 1954, though the
location is still used as the basis for the coordinate system
Prime meridian
85
Distributed Systems - CS 422
3.1.2 GPS and UTC
As a step toward actual clock synchronization problems, we first consider a related
problem, namely determining one's geographical position anywhere on Earth. This
positioning problem is by itself solved through a highly specific. Dedicated distributed
system, namely GPS, which is an acronym for global positioning system. GPS is a
satellite-based distributed system that was launched in 1978.
Although it has been used mainly for military applications, in recent years it has
found its way too many civilian applications, notably for traffic navigation. However,
many more application domains exist. For example, GPS phones now allow to let
callers track each other's position, a feature which may show to be extremely handy
when you are lost or in trouble. This principle can easily be applied to tracking other
things as well, including pets, children, cars, boats, and so on. An excellent overview
of GPS is provided by Zogg (2002).
The Global Positioning System (GPS); is a USA military distributed system
consisting of
a group of satellites
each circulating in an orbit at a height of
approximately 20,000 Km and each one has 4 atomic clocks to broadcasts:
 a very precise time signal worldwide,
 Instructions for converting GPS time to UTC.
To determine the longitude, latitude, and altitude of a receiver on earth at specific
time; we need only 4 satellites from the GPS satellites
Other Similar Systems
GLONASS, acronym for "Global Navigation Satellite System", is a space-based
satellite navigation system operated by the Russian Aerospace Defense Forces. By
2010, GLONASS had achieved 100% coverage of Russia's territory and in October
2011, the full orbital constellation of 24 satellites was restored, enabling full global
coverage.
86
Distributed Systems - CS 422
There is also the planned European Union Galileo positioning system, India's Indian
Regional Navigation Satellite System, and the Chinese Beidou Navigation Satellite
System.
Dynamics of GPS
A visual example of the GPS constellation in motion with the Earth rotating. Notice
how the number of satellites in view from a given point on the Earth's surface, in this
example at 45°N; changes with time.
GPS Layout
The Global Positioning System (GPS) is a U.S.-owned utility that provides users
with positioning, navigation, and timing that provides users with positioning,
navigation, and timing (PNT) services. This system consists of three segments:
 the space segment,
 the control segment, and
 the user segment.
The U.S. Air Force develops, maintains, and operates the space and control segments.
Space Segment
The GPS space segment consists of a constellation of satellites transmitting radio
signals to users. The United States is committed to maintaining the availability of at
least 24 operational GPS satellites, 95% of the time. To ensure this commitment, the
Air Force has been flying 31 operational GPS satellites for the past few years.
87
Distributed Systems - CS 422
The GPS control segment consists of a global network of ground facilities that track
the GPS satellites, monitor their transmissions, perform analyses, and send commands
and data to the constellation.
User segment
The user segment consists of the GPS receiver equipment, which receives the
signals from the GPS satellites and uses the transmitted information to calculate the
user’s three-dimensional position and time.
GPS Math Model Basics
The unknown position of the ship can be expressed as a point (x, y, z), which
can later be translated into a latitude and longitude on a map. Let us mark off the three
88
Distributed Systems - CS 422
axes in units equal to the radius of the earth. Thus, a point at sea level will have x2+ y2+
z2= 1 in this system. Also, we will measure time in units of milliseconds.
The GPS system finds distances by knowing how long it takes a radio signal to
get from one point to another. For this we need to know the speed of light,
approximately equal to .047 (in units of earth radii per millisecond).
Let (x, y, z) be the ship’s position as a receiver and at the time when the signals arrive.
Our goal is to determine the values of these variables at certain dedicated time. Using
the data from the first satellite, we can compute the distance from the ship as follows:
The signal was sent from satellite at time 19.9 and arrived to the ship at time t.
Traveling at a speed of .047, that makes the distance:
d = .047(t − 19.9)
This same distance can be expressed in terms of (x, y, z) and the satellite’s position
(1, 2, 0):
d = √ (x − 1)2 + (y − 2)2 + (z − 0)2
Combining these results leads to the equation
(x − 1) 2+ (y − 2)2+ z2= .0472(t − 19.9)2
Similar equations can be derived for the three other satellites. Writing all four
equations together gives:
2x + 4y + 0z − 2(.0472)(19.9)t = 12+ 22+ 02− .0472(19.9)2+ x2+ y2+ z2− .0472t2
4x + 0y + 4z − 2(.0472)(2.4)t = 22+ 02+ 22− .0472(2.4)2+ x2+ y2+ z2− .0472t2
2x + 2y + 2z − 2(.0472)(32.6)t = 12+ 12+ 12− .0472(32.6)2+ x2+ y2+ z2− .0472t2
4x + 2y + 0z − 2(.0472)(19.9)t = 22+ 12+ 02− .0472(19.9)2+ x2+ y2+ z2− .0472t2
Solving 4 equations in 4 unknowns, we get: t = 43.1 and 49.91
If we select the first solution t = 43.1, then (x, y, z) = (1.317, 1.317, 0.790), which has
a length of about 2. We are using units of earth radii, so this point is around 4000 miles
above the surface of the earth (airplane not a ship i.e. refused answer).
89
Distributed Systems - CS 422
The second value of t = 49.91leads to (x, y, z) = (0.667, 0.667, 0.332), with length
0.9997. That places the point on the surface of the earth (to four decimal places) and
gives us the rational location of the ship.
So far, we have assumed that measurements are perfectly accurate. Of course, they are
not. For one thing, GPS does not take leap seconds into account. In other words, there
is a systematic deviation from UTe, which by January 1, 2006 is 14 seconds. Such an
error can be easily compensated for in software. However, there are many other sources
of errors, starting with the fact that the atomic clocks in the satellites are not always in
perfect synch, the position of a satellite is not known precisely, the receiver's clock has
a finite accuracy, the signal propagation speed is not constant (as signals slow down
when entering, e.g., the ionosphere), and so on. Moreover, we all know that the earth is
not a perfect sphere, leading to further corrections.
By and large, computing an accurate position is far from a trivial undertaking and
requires going down into many gory details. Nevertheless, even with relatively cheap
GPS receivers, positioning can be precise within a range of 1-5 meters. Moreover,
professional receivers (which can easily be hooked up in a computer network) have a
claimed error of less than 20-35 nanoseconds. Again, we refer to the excellent overview
by Zogg (2002) as a first step toward getting acquainted with the details.
UTC Availability
To provide Coordinated Universal Time (UTC) to people who need precise time,
the National Institute of Standard Time (NIST) operates a shortwave radio station with
call letters WWV from Fort Collins, Colorado.
WWV broadcasts a short pulse at the start of each UTC second. The accuracy of
WWV is:
± 1 msec
Several earth satellites also offer a UTC service with accuracy ± 0.5 msec
NIST radio station WWV broadcasts time and frequency information 24 hours
per day, 7 days per week to millions of listeners worldwide. WWV is located in Fort
Collins, Colorado, about 100 kilometers north of Denver. The broadcast information
90
Distributed Systems - CS 422
includes time announcements, standard time intervals, standard frequencies, UT1 time
corrections, a BCD time code, geophysical alerts, marine storm warnings, and Global
Positioning System (GPS) status reports.
3.2
Logical vs. physical clocks
Logical clock keeps track of event ordering
 among related events not necessary to be agreed with real time (e.g. in make
example, all m/c must agree that time is 10:00) even it is 10:02
 Lamport timestamps and Vector clocks are concepts of the logical clocks in
distributed systems.
Physical clocks keep time of day and are consistent across systems and must not
be deviated from real time by more than certain amount.
Instead, a distributed system really has an approximation of the Physical Time
across all its machines. Logical Clocks refer to implementing a protocol on all
machines within your distributed system, so that the machines are able to maintain
consistent ordering of events within some virtual timespan.
91
Distributed Systems - CS 422
3.2.1 Logical Clock
So far, we have assumed that clock synchronization is naturally related to real time.
However, we have also seen that it may be sufficient that every node agrees on a
current time, without that time necessarily being the same as the real time. We can go
one step further. For running make, for example, it is adequate that two nodes agree
that input.o is outdated by a new version of input.c. In this case, keeping track of each
other's events (such as a producing a new version of input.c) is what matters.
For these algorithms, it is conventional to speak of the clocks as logical clocks. In a
classic paper, Lamport (1978) showed that although clock synchronization is possible,
it need not be absolute.
If two processes do not interact, it is not necessary that their clocks be synchronized
because the lack of synchronization would not be observable and thus could not cause
problems. Furthermore, he pointed out that what usually matters is not that all
processes agree on exactly what time it is, but rather that they agree on the order in
which events occur. In the make example, what counts is whether input.c is older or
newer than input.o, not their absolute creation times.
As a conclusion, logical clocks are clocks used in asynchronous distributed
systems for ordering events since there are no global physical clocks available. Logical
clock algorithms can be categorized into:
 Lamport timestamps, which are monotonically increasing
 Vector clocks, that allow for total ordering of events in a distributed system.
 Matrix clocks, an extension of vector clocks that also contains information about
other processes' views of the system.
92
Distributed Systems - CS 422
Logical Clock in Distributed System
Logical Clocks refer to implementing a protocol on all machines within your distributed
system, so that the machines are able to maintain consistent ordering of events within
some virtual timespan.
A logical clock is a mechanism for capturing chronological and causal relationships in
a distributed system. Distributed systems may have no physically synchronous global
clock, so a logical clock allows global ordering on events from different processes in
such systems.
Example
If we go outside then we have made a full plan that at which place we have to go first,
second and so on. We don’t go to second place at first and then the first place.
We always maintain the procedure or an organization that is planned before. In a similar
way, we should do the operations on our PCs one by one in an organized way.
Suppose, we have more than 10 PCs in a distributed system and every PC is doing its
own work but then how we make them work together. There comes a solution to this
i.e. LOGICAL CLOCK.
Method-1
To order events across process, try to sync clocks in one approach. This means that if
one PC has a time 2:00 pm then every PC should have the same time which is quite not
possible. Not every clock can sync at one time. Then we can’t follow this method.
Method-2
Another approach is to assign Timestamps to events. Taking the example into
consideration, this means if we assign the first place as 1, second place as 2, and third
place as 3 and so on. Then we always know that the first place will always come first
and then so on.
93
Distributed Systems - CS 422
Similarly, if we give each PC their individual number than it will be organized in a way
that 1st PC will complete its process first and then second and so on.
BUT, Timestamps will only work as long as they obey causality.
What is causality?
Causality is fully based on HAPPEN BEFORE RELATIONSHIP.
•
Taking single PC only if 2 events A and B are occurring one by one then TS(A)
< TS(B). If A has timestamp of 1, then B should have timestamp more than 1, then
only happen before relationship occurs.
•
Taking 2 PCs and event A in P1 (PC.1) and event B in P2 (PC.2) then also the
condition will be TS(A) < TS(B). Taking example- suppose you are sending
message to someone at 2:00:00 pm, and the other person is receiving it at 2:00:02
pm. Then it’s obvious that TS (sender) < TS(receiver).
Properties Derived from Happen Before Relationship –
•
Transitive Relation –
If, TS(A) <TS(B) and TS(B) <TS(C), then TS(A) < TS(C)
•
Causally Ordered Relation –
a->b, this means that a is occurring before b and if there is any changes in a it will
surely reflect on b.
•
Concurrent Event –
This means that not every process occurs one by one, some processes are made to
happen simultaneously i.e., A || B.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory
concepts for SDE interviews with the CS Theory Course at a student-friendly price and
become industry ready.
94
Distributed Systems - CS 422
Lamport Timestamps Algorithms
Lamport's logical clocks lead to a situation where all events in a distributed
system are totally ordered with the property that if event a happened before event b,
then a will also be positioned in that ordering before b, that is, C (a) < C (b). However,
with Lamport clocks, nothing can be said about the relationship between two events a
and b by merely comparing their time values C(a) and C(b), respectively.
In other words, if C(a) < C(b), then this does not necessarily imply that a indeed
happened before b. Something more is needed for that.
To explain, consider the messages as sent by the three processes, Denote by
I'snd(mi) the logical time at which message m, was sent, and likewise, by T,.cv (mi) the
time of its receipt. By construction, we know that for each message I'snd(mi) <
T,.cy(mi). But what can we conclude in general from T,.cv(mi) < I'snd(mj)?
A Lamport clock may be used to create a partial causal ordering of events
between processes. Given a logical clock following these rules, the following relation
is true:
 if C(a) < C(b) then a → b , where → means happened-before.
This relation only goes one way, and is called clock consistency condition (CCC):
 if one event comes before another, then that event's logical clock comes before the
other's (Weak CCC).
The strong clock consistency condition, which is two way,
 if C(a) < C(b) then a → b can be obtained by other techniques such as vector clocks.
Vector Clocks Algorithm
Initially all clocks are zero. Each time a process experiences an internal event, it
increments its own logical clock in the vector by one. Each time a process prepares to
95
Distributed Systems - CS 422
send a message, it increments its own logical clock in the vector by one and then sends
its entire vector along with the message being sent (increment then send).
Each time a process receives a message, it increments its own logical clock in the vector
by one and updates each element in its vector by taking the maximum of:
 the value in its own vector clock and
 the value in the vector in the received message
The problem is that Lamport clocks do not capture causality. Causality can be
captured by means of vector clocks.
Vector Clocks Algorithm Example
3.2.2 Physical Clock
Nearly all computers have a circuit for keeping track of time. Despite the
widespread use of the word "clock" to refer to these devices, they are not actually clocks
in the usual sense. Timer is perhaps a better word. A computer timer is usually a
precisely machined quartz crystal. When kept under tension, quartz crystals oscillate at
96
Distributed Systems - CS 422
a well-defined frequency that depends on the kind of crystal, how it is cut, and the
amount of tension.
Associated with each crystal are two registers, a counter and a holding register.
Each oscillation of the crystal decrements the counter by one. When the counter gets to
zero, an interrupt is generated and the counter is reloaded from the holding register. In
this way, it is possible to program a timer to generate an interrupt 60 times a second, or
at any other desired frequency.
Each interrupt is called one clock tick. When the system is booted, it usually
asks the user to enter the date and time, which is then converted to the number of ticks
after some known starting date and stored in memory. Most computers have a special
battery-backed up CMOS RAM so that the date and time need not be entered on
subsequent boots. At every clock tick, the interrupt service procedure adds one to the
time stored in memory. In this way, the (software) clock is kept up to date. With a single
computer and a single clock, it does not matter much if this clock is off by a small
amount.
Since all processes on the machine use the same. Clock, they will still be
internally consistent. For example, if the file input.c has time 2151 and file input.o has
time 2150, make will recompile the source file, even if the clock is off by 2 and the true
times are 2153 and 2152, respectively. All that really matters are the relative times. As
soon as multiple CPUs are introduced, each with its own clock, the situation changes
radically. Although the frequency at which a crystal oscillator runs is usually fairly
stable, it is impossible to guarantee that the crystals in different computers all run at
exactly the same frequency.
In practice, when a system has n computers, all n crystals will run at slightly
different rates, causing the (software) clocks gradually to get out of synch and give
different values when read out. This difference in time values is called clock skew. As
a consequence of this clock skew, programs that expect the time associated with a file,
object, process, or message to be correct and independent of the machine on which it
97
Distributed Systems - CS 422
was generated (i.e., which clock it used) can fail, as we saw in the make example above.
In some systems (e.g., real-time systems), the actual clock time is important. Under
these circumstances, external physical clocks are needed. For reasons of efficiency and
redundancy, multiple physical clocks are generally considered desirable, which yields
two problems:
(1) How do we synchronize them with real world clocks?
(2) How do we synchronize the clocks with each other?
As a conclusion, all computers have a circuit for keeping track of time. This
circuit is referred as “clock”. These computer timer circuit is m/c quartz crystal, which
oscillate at a well-defined frequency. Associated with each crystal, two registers:
 Counter register
 Holding register
Each oscillation of the crystal will decrement the counter by one. When the
counter gets to zero, an interrupt is generated, and the counter is loaded from the holding
register.
In this way, it is possible to program a timer to generate an interrupt 60 times a
second, or at any desired frequency. Each interrupt is called one clock tick (a computer
pumping heart!)
In distributed systems, having multiple n CPU with n multiple clocks runs at
slightly different rates; software clocks gradually to get out of synch and give different
values when read out. This difference in time values is called clock skew (deviation).
In real time systems, the actual clock time is important; so external physical
clocks are needed. There are two important questions:
 How do we synchronize these multiple physical clocks with real world clocks?
 How do we synchronize the clocks with each other?
98
Distributed Systems - CS 422
3.3 Clock Synchronization Algorithms
If one m/c has a WWV receiver, knowing UTC; the goal becomes keeping all
the other machines synchronized to it. (The Cristian's Algorithm in 3.3.1)
If no machines have a WWV receiver, each m/c keeps track of its own time and
the goal is to keep all the machines together as well as possible (The Berkeley algorithm
in 3.3.2).
These algorithms try to accurately resynchronize the clocks if they are drifting
from the UTC in opposite directions.
3.3.1 The Cristian's Algorithm (1989)
A common approach in many protocols and originally proposed by Cristian
(1989) is to let clients contact a time server. The latter can accurately provide the current
time, for example, because it is equipped with a WWV receiver or an accurate clock.
The problem, of course, is that when contacting the server, message delays will have
outdated the reported time. The trick is to find a good estimation for these delays.
Cristian's algorithm relies on the existence of a time server. The time server
maintains its clock by using a radio clock or other accurate time source, then all other
computers in the system stay synchronized with it.
A time client will maintain its clock by making a procedure call to the time
server. Variations of this algorithm make more precise time calculations by factoring
in network propagation time.
So, it is a method for clock synchronization which can be used in many fields of
distributive computer science but is primarily used in low-latency intranets. Cristian
observed that this simple algorithm is probabilistic, in that it only achieves
synchronization if the round-trip time (RTT) of the request is short compared to
required accuracy.
It also suffers in implementations using a single server, making it unsuitable for
many distributive applications where redundancy may be critical. It works between a
99
Distributed Systems - CS 422
process P, and a time server S — connected to a source of UTC (Coordinated Universal
Time):
1. P requests the time from S
2. After receiving the request from P, S prepares a response and appends the time T,
from its own clock at the last possible moment before dispatch.
3. P then sets its time to be T + RTT/2
4. P needs to record the Round Trip Time (RTT) of the request it made to S so that it
can set its clock to:
T + RTT/2.
This method assumes that the RTT is split equally between both request and
response, which may not always be the case but is a reasonable assumption on a LAN
connection. It is important to note that the time is attached at the last possible moment
before being returned to P. This is to eliminate inaccuracies caused by network delay.
Enhancement of the Cristian's Algorithm
Further accuracy can be gained by making multiple requests to S and using the
response with the shortest RTT. We can estimate the accuracy of the system by taking
RTT/2 from the fastest response as a value we call min.
The earliest point at which S could have placed the time T was min after P sent its
request. Therefore, the time at S when the message is received by P is in the range:
(T + min) to (T + RTT - min)
The width of this range is (RTT - 2*min).
This gives an accuracy of (RTT/2 - min).
Cristian’s algorithm modelling
Compensate for delays
 Note times:
o request sent: T0
o reply received: T1
 Assume network delays are symmetric
100
Distributed Systems - CS 422
Client sets time to:
Error bounds
101
Distributed Systems - CS 422
3.3.2 The Berkeley Algorithm (1989)
In many algorithms such as NTP, the time server is passive. Other machines
periodically ask it for the time. All it does is respond to their queries. In Berkeley UNIX,
exactly the opposite approach is taken (Gusella and Zatti, 1989). Here the time server
(actually, a time daemon) is active, polling every machine from time to time to ask what
time it is there. Based on the answers, it computes an average time and tells all the other
machines to advance their clocks to the new time or slow their clocks down until some
specified reduction has been achieved. This method is suitable for a system in which
no machine has a WWV receiver. The time daemon's time must be set manually by the
operator periodically.
It is a method of clock synchronization in distributed computing which assumes
no machine has an accurate time source, which assumes no machine has an accurate
time source. It was developed by Gusella and Zatti at the University of California,
Berkeley in 1989 and like Cristian's algorithm is intended for use within intranets.
This algorithm is more suitable for systems where a radio clock is not present,
this system has no way of making sure of the actual time other than by maintaining a
global average time as the global time as follows:
1. A time server will periodically fetch the time from all the time clients
2. Average the results
3. Report back to the clients the adjustment that needs be made to their local
clocks to achieve the average.
Enhancement of the Berkeley Algorithm
This algorithm highlights the fact that internal clocks may vary not only in the time
they contain but also in the clock rate. Often, any client whose clock differs by a value
outside of a given tolerance is disregarded when averaging the results. This prevents
the overall system time from being drastically skewed due to one wrong clock.
102
Distributed Systems - CS 422
Berkeley Algorithm Details
Unlike Cristian's algorithm the server process in Berkeley algorithm, called the
master periodically polls other slave process:
a. A master is chosen via an election process (section 3.5)
b. The master polls the slaves who reply with their time in a similar way to Cristian's
algorithm.
c. The master observes the round-trip time (RTT) of the messages and estimates the
time of each slave and its own.
d. The master then averages the clock times, ignoring any values it receives far outside
the values of the others.
e. Instead of sending the updated current time back to the other process, the master
then sends out the amount (positive or negative) that each slave must adjust its clock.
This avoids further uncertainty due to RTT at the slave processes.
3.4 Parallel and Distributed Processing Problems
Distributed and Parallel Processing is looking for the highest level of Computer
Performance. Achieving such a goal, faces many practical problems. Some of these
problems are: Distributed Mutual Exclusion Problem, Distributed Termination Problem
or The Byzantine Generals Problem.
3.4.1 Mutual Exclusion Algorithm Outline
Fundamental to distributed systems is the concurrency and collaboration among
multiple processes. In many cases, this also means that processes will need to
103
Distributed Systems - CS 422
simultaneously access the same resources. To prevent that such concurrent accesses
corrupt the resource, or make it inconsistent, solutions are needed to grant mutual
exclusive access by processes. In this section, we take a look at some of the more
important distributed algorithms that have been proposed. A recent survey of
distributed algorithms for mutual exclusion is provided by Saxena and Rai (2003).
Older, but still relevant is Velazquez (1993).
Distributed mutual exclusion algorithms can be classified into two different
categories. In token-based solutions mutual exclusion is achieved by passing a special
message between the processes, known as a token. There is only one token available
and whoever has that token is allowed to access the shared resource. When finished,
the token is passed on to a next process. If a process having the token is not interested
in accessing the resource, it simply passes it on. Token-based solutions have a few
important properties. First, depending on the how the processes are organized, they can
fairly easily ensure that every process will get a chance at accessing the resource.
In other words, they avoid starvation. Second, deadlocks by which several
processes are waiting for each other to proceed, can easily be avoided, contributing to
their simplicity. Unfortunately, the main drawback of token-based solutions is a rather
serious one: when the token is lost (e.g., because the process holding it crashed), an
intricate distributed procedure needs to be started to ensure that a new token is created,
but above all, that it is also the only token. As an alternative, many distributed mutual
exclusion algorithms follow a permission-based approach. In this case. A process
wanting to access the re- source first requires the permission of other processes. There
are many different ways toward granting such a permission and in the sections that
follow we will consider a few of them.
The algorithm is based on sequence numbers like the bakery. The algorithm is
based on sequence numbers like the bakery algorithm. However, since the nodes cannot
directly read the internal variables of the other nodes, the comparison of sequence
104
Distributed Systems - CS 422
numbers and the decision to enter the critical section must be made by sending and
receiving messages.
The basic idea is the same: a node chooses a number, broadcasts its choice to the
other nodes (a Request message) and then waits until it has received confirmation (a
Reply) from each other node that the number chosen is now the lowest outstanding
sequence number (i.e. highest priority).
Ties on the chosen sequence number are resolved arbitrarily in favor of the node
with the lowest identification number.
task body Main_ Process_ Type is begin loop
Non_ Critical _Section; -- Pre-protocol
Choose_ Sequence_ Number;
Send_ Request_ to_ Nodes;
Wait_ for_ Reply;
Critical_ Section;
Reply_ to_ Deferred _Nodes;-- Post-protocol end loop;
end Main_ Process_ Type;
3.4.2 Distributed Termination Problem
It is used for Preventing Distributed Dead Lock. The Dijkstra–Scholten
algorithm is a tree-based algorithm which can be described by the following:
 The initiator of a computation is the root of the tree.
 Upon receiving a computational message:
o If the receiving process is currently not in the computation: the process joins
the tree by becoming a child of the sender of the message. (No
acknowledgement sender of the message. (No acknowledgement message is
sent at this point.)
o If the receiving process is already in the computation: the process immediately
sends an acknowledgement message to the sender of the message.
105
Distributed Systems - CS 422
When a process has no more children and has become idle, the process separates
itself from the tree by sending an acknowledgement message to its tree parent (I am out
or Pass). Termination occurs when the initiator has no children and has become idle
because it has no parent.
3.4.3 The Byzantine Generals Problem (BGP)
Byzantine refers to the Byzantine Generals' Problem, an agreement problem
(first proposed by Marshall Pease, Robert Shostak, and Leslie Lamport in 1980) in
which generals of the Byzantine Empire's army must decide unanimously whether to
attack some enemy army. The Byzantine Army was chosen as an example for the
problem as the Byzantine state experienced frequent duplicity among the high levels of
its administration.
The problem is complicated by:

The geographic separation of the generals, who must communicate by sending
messengers to each other,

The presence of traitors amongst the generals.
These traitors can act arbitrarily in order to achieve the following aims:

Trick some generals into attacking; force a decision that is not consistent with the
generals' desires, e.g. forcing an attack when no general wished to attack; or

Confusing some generals to the point that they are unable to make up their minds.
If the traitors succeed in any of these goals, any resulting attack is hopeless, as
only an intensive effort can result in victory. Byzantine fault tolerance can be achieved,
if the loyal (nonfaulty) generals have a unanimous agreement on their strategy. Note
that if the source general is correct, all loyal generals must agree upon that value.
Otherwise, the choice of strategy agreed upon is inappropriate.
The Classic Problem:

Each division of Byzantine army are directed by its own general.
o Generals, some of which are traitors, communicate each other by messengers.
106
Distributed Systems - CS 422

Requirements:
o All loyal generals decide upon the same plan of action.

A small number of traitors cannot cause the loyal generals to adopt a bad plan (i.e.
minimize their effect).
The problem can be restated as:

All loyal generals receive the same information upon which they will somehow
get to the same decision (unification of received information)

The information sent by a loyal general should be used by all the other loyal
generals (unification of decision)
Reliability by Majority Voting
One way to achieve reliability is to have multiple replica of system (or
component) and take the majority voting among them. In order for the majority voting
to yield a reliable system, the following two conditions should be satisfied:
1. All non-faulty components must use the same input value
2. If the input unit is non-faulty, then all non-faulty components use the value it
provides as input (confidential source )
Impossibility Results
No solution exists if less than or equal to 2/3 generals are loyal (i.e. more than
2/3 loyal or non-faulty components assures a solution).
Practical Use Case of BGP
 Distributed file systems: Many small, latency-sensitive requests (tampering with
files, lost updates)
 Overlay multicast: Transfers large volume of data (tampering with content,
freeloading)

P2P email: Complex, large, decentralized mail (Denial of service by misrouting)
Not only agreement but also identifying faulty nodes is important!
107
Distributed Systems - CS 422
3.5 Election Algorithms
Many distributed algorithms require one process to act as coordinator, initiator,
or otherwise perform some special role. In general, it does not matter which process
takes on this special responsibility, but one of them has to do it. In this section we will
look at algorithms for electing a coordinator (using this as a generic name for the special
process).
If all processes are exactly the same, with no distinguishing characteristics, there
is no way to select one of them to be special -.Consequently, we will assume that each
process has a unique number, for example, its network address (for simplicity, we will
assume one process per machine). In general, election algorithms attempt to locate the
process with the highest process number and designate it as coordinator. The algorithms
differ in the way they do the location.
Central node compute the coordinates for each landmark. To this end, the central
node seeks to minimize the following aggregated error function: “where d (bi,bj )
corresponds to the geometric distance, that is, the distance after nodes b, and b, have
been positioned. The hidden parameter in minimizing the aggregated error function is
the dimension m. Obviously, we have that L > m, but nothing prevents us from
choosing a value for m that is much smaller than L. In that case. a node P measures its
distance to each of the L landmarks and computes its coordinates by minimizing
Furthermore, we also assume that every process knows the process number of every
other process. What the processes do not know is which ones are.
108
Distributed Systems - CS 422
Currently up and which ones are currently down. The goal of an election
algorithm is to ensure that when an election starts, it concludes with all processes
agreeing on who the new coordinator is to be. There are many algorithms and variations,
of which several important ones are discussed in the text books by Lynch (l996) and
Tel (2000), respectively.
We often need one process to act as a coordinator. It may not matter which
process does this, but there should be a group agreement on only one. An assumption
in election algorithms is that all processes are exactly the same with no distinguishing
characteristics. Each process can obtain a unique identifier (for example, a machine
address and process ID) and each process knows of every other process but does not
know which is up and which is down.
3.5.1 Leader Election Problem
A leader among n processors is the processor that is recognized by all other
processors as distinguished to perform a special task. The leader election problem
occurs when the processors of a distributed system must choose one of them as a leader.
Each processor should eventually decide whether or not it is a leader, given that each
processor is only aware of its identification and not aware of any other processes.
The problem of electing a leader in a distributed environment is most important
in situations in which coordination among processors becomes necessary to recover
from a failure or topological change. A leader in such situations is needed, for example,
to coordinate the reestablishment of allocation and routing functions.
Consider, for example, a token-ring network, in which a token moves around the
network, giving its current owner the right to initiate communication. If the token is
lost, a leader is needed in this case to coordinate the regeneration of the lost token.
3.5.2 Leader Election in Synchronous Rings
Now, let us explore the basic idea behind the different leader election algorithms:

Suppose that the communication graph is an arbitrary graph, G = (V, E).
109
Distributed Systems - CS 422

The following two steps summarize our first attempt to solve the problem:
o Each node in the graph would broadcast its unique identifier to all other nodes.
o After receiving the identifiers of all nodes, the node with the highest identifier
declares itself as the leader.
To determine what happens if the algorithm is modified to work under the synchronous
model, let us suppose that the communication graph is a complete graph. The following
two steps summarize the algorithm:
1.
At the first round, each node sends its unique identifier to all other nodes.
2.
At the end of the first round, every node has the identifiers of all nodes; the node
with the highest identifier declares itself as the leader
3.5.3 Synchronous Message-Passing Model
A synchronous system can be modeled as a state machine with the following
components:
 M, a fixed-message alphabet
 A process i can be modeled as:
o
Qi, a (possibly infinite) set of states. The system state can be represented using a
set of variables.
o
q0,i , the initial state in the state set Qi. The state variables have initial values in the
initial state.
o
GenMsgi, a message-generation function. It is applied to the current system state
to generate messages to the outgoing neighbors from elements in M.
o
Transi, a state transition function that maps the current state and the incoming
messages into a new state.
3.5.4 Simple Leader Election Algorithm
The idea of this simple algorithm is that each process sends its identifier all the
way around the ring.
110
Distributed Systems - CS 422
The process that receives its identifier back is declared as a leader. This
algorithm was presented by Chang and Roberts, Le Lann , Lynch. We assume the
following:
 Communication is unidirectional (clock wise).
 The size of the ring is unknown.
 The identification of each processor is unique.
The algorithm can be summarized as follows:
1. Each process sends its identifier to its outgoing neighbor.
2. When a process receives an identifier from its incoming neighbor, then. The process
sends null to its outgoing neighbor, if the received identifier is less than its own
identifier (block action).
3. The process sends the received identifier to its outgoing neighbor, if the received
identifier is greater than its own identifier ( bypass action).
4. The process declares itself as the leader, if the received identifier is equal to its own
identifier.
Assuming that the message alphabet M is the set of identifiers, the Algorithm S_
Elect_Leader Simple will be described as follows:
Algorithm S_ Elect _ Leader _ Simple
Qi
U, some ID
buff, some ID or null status, a value in (unknown, leader)
q0,i
u← ID
buff← IDi status ← unknown
GenMsgi
Send the current value of buff to clockwise-neighbor
Transi
buff ← null
111
Distributed Systems - CS 422
if the incoming message is v and is not null, then Case:
v < u : do nothing
v = u : status ¬ leader
v > u : buff ← v
endcase
112
Distributed Systems - CS 422
Revision Sheet # 3
PROBLEMS
1. One way to handle parameter conversion in RPC systems is to have each
machine send parameters in its native representation, with the other one doing
the translation, if need be. The native system could be indicated by a code in
the first byte. However, since locating the first byte in the first word is
precisely the problem, can this work?
2. Assume a client calls an asynchronous RPC to a server, and subsequently waits
until the server returns a result using another asynchronous RPC. Is this
approach the same as letting the client execute a normal RPC? What if we
replace the asynchronous RPCs with asynchronous RPCs?
3. Instead of letting a server register itself with a daemon as in DCE, we could
also choose to always assign it the same end point. That end point can then be
used in references to objects in the server's address space. What is the main
drawback of this scheme?
4. Would it be useful also to make a distinction between static and dynamic
RPCs?
5. Describe how connectionless communication between a client and a server
proceeds when using sockets.
6. In the text we stated that in order to automatically start a process to fetch
messages from an input queue, a daemon is often used that monitors the input
queue. Give an alternative implementation that does not make use of a daemon.
7. Routing tables in IBM WebSphere, and in many other message-queuing
systems, are configured manually. Describe a simple way to do this
automatically.
113
Distributed Systems - CS 422
8. Suppose that in a sensor network measured temperatures are not times tarnped
by the sensor, but are immediately sent to the operator. Would it be enough to
guarantee only a maximum end-to-end delay?
9. How could you guarantee a maximum end-to-end delay when a collection of
computers is organized in a (logical or physical) ring?
10. How could you guarantee a minimum end-to-end delay when a collection of
computers is organized in a (logical or physical) ring?
Assignment # 6
GPS Math Model: For the following case, compute: xr, yr, zr for a ship
Which is our receiver where t is the time when signal arrives from each satellite
GPS Practical Case Study
Our ship is at an unknown position and has no clock. It receives simultaneous signals
from four satellites, giving their positions and times shown in the table below.
Assignment # 7
Trace the Vector Clocks Algorithm to describe the creation of the 7 events in the figure.
Assignment # 8
Given
 Send request at 5:08:15.100 (T0)
 Receive response at 5:08:15.900 (T1)
 Response contains 5:08:15.300 (Tserver)
Compute Tnew
114
Distributed Systems - CS 422
Assignment # 9
Compute the error bounds for assignment 8 using the formula
Assignment # 10
Using the Berkeley Algorithm, Show how
a. The time daemon asks all the other machines for their clock values
b. The machines answer
c. The time daemon tells everyone how to adjust their clock
Using the Berkeley Algorithm, compute the offset for each client.
Assignment # 11
If minimum message transit time (Tmin) is known: Place bounds on accuracy of
results in Assignment # 10.
Assignment # 12
Suppose that we have four processes running on four processors connected via
a synchronous ring. The process (processors) have the IDs 1, 2, 3, and 4. Message
passing is performed in a unidirectional manner. The ring is oriented such that process
i sends messages to its clockwise neighbor. Draw the state of each process after each of
the four rounds using the previous algorithm
115
Distributed Systems - CS 422
Chapter 4
The Distributed Algorithms
4.1 Introduction
Distributed systems can be categorized as shared-memory or message-passing
systems. In a shared-memory system, processing elements communicate with each
other via shared variables in the global memory. While in message-passing systems,
each processing element has its own local memory, and communication is performed
via message passing. We will discuss a shared-memory abstract model, which can be
used to theoretically study parallel algorithms and evaluate their complexities.
PRAM
It is a theoretical model of shared memory systems called Parallel Random Access
Machine (PRAM). The PRAM model was introduced by Fortune and Wyllie in 1978
for modeling idealized parallel computers in which communication cost and
synchronization overhead are negligible.
At first glance, the PRAM model may appear inappropriate in real-world situations
due to its idealistic nature. However, the PRAM model has been very useful in:
 studying parallel algorithms
 Evaluating their anticipated performance independent of the real machines.
Clearly, if the performance of an algorithm is not satisfactory on a PRAM, it is
meaningless to implement it on a real system. Although it does not consider some
practical considerations in real distributed systems, it does focus on the computational
aspects of the algorithmic complexity, which makes it less difficult to find performance
bounds and complexity estimates.
The PRAM model has played an important role in the introduction of parallel
programming paradigms and design techniques used in real parallel systems. Since
PRAM is conceptually easy to work with when developing parallel algorithms, much
116
Distributed Systems - CS 422
effort has been spent in finding efficient ways to simulate PRAM computation on other
models that do not necessarily follow PRAM assumptions.
This way, parallel algorithms can be designed using PRAM and then translated
into real machines. A large number of PRAM algorithms for solving many fundamental
problems have been introduced and efficiently implemented on real systems.
4.2 Variations of PRAM Model
The purpose of the theoretical models for parallel computation is to give
frameworks by which we can describe and analyze algorithms.
These ideal models are used to obtain performance bounds and complexity
estimates. One of the models that have been used extensively is the PRAM mo del.
A PRAM consists of a control unit, a global memory shared by p processors, each
of which has a unique index as follows: P1 , P2,…….., PP. in addition to the global
memory, via which the processors can communicate, each processor has its own private
memory. Next figure illustrates the components of the PRAM model:
4.2.1 PRAM Model for Parallel Computations
Control
Private
memory
P1
Private
memory
P2
Private
memory
Pp
117
Global
memory
Distributed Systems - CS 422
The p processors operate on a synchronized read, compute, and write cycle. During
a computational step, an active processor may read a data value from a memory
location, perform a single operation, and finally write back the result into a memory
location.
Active processors must execute the same instruction, generally, on different data.
Hence, this model is sometimes called the shared memory, single instruction, and
multiple data (SM SIMD) machine. Algorithms are assumed to run without interference
as long as only one memory access is permitted at a time. We say that PRAM guarantees
atomic access to data located in shared memory.
An operation is considered to be atomic if it is completed in its entirety or it is not
performed at all (all or nothing).
4.2.2 READ & WRITE in PRAM
There are different modes for read and write operations in a PRAM. These different
modes are summarized as follows:
 Exclusive read (ER) Only one processor can read from any memory location at a
time.
 Exclusive write (EW) Only one processor can write to any memory location at a
time.
 Concurrent read (CR) multiple processors can read from the same memory location
simultaneously.
 Concurrent write (CW) multiple processors can write to the same memory location
simultaneously.
4.2.3 PRAM Subclasses
The PRAM can be further divided into the following four subclasses:
 EREW PRAM: Access to any memory cell is exclusive. This is the most restrictive
PRAM model.
118
Distributed Systems - CS 422
 ERCW PRAM: This allows concurrent writes to the same memory location by
multiple processors, but read accesses remain exclusive.
 CREWPRAM: Concurrent read accesses are allowed, but write accesses are
exclusive.
 CRCWPRAM: Both concurrent read and write accesses are allowed.
Analysis of Parallel Algorithms
 The complexity of a sequential algorithm is generally determined by its time and
space complexity. The time complexity of an algorithm refers to its execution time
as a function of the problem’s size. Similarly, the space complexity refers to the
amount of memory required by the algorithm as a function of the size of the
problem. The time complexity has been known to be the most important measure of
the performance of algorithms. An algorithm whose time complexity is bounded by
a polynomial is called a polynomial–time algorithm. An algorithm is considered to
be efficient if it runs in polynomial time. Inefficient algorithms are those that require
a search of the whole enumerated space and have an exponential time complexity.
 For parallel algorithms, the time complexity remains an important measure of
performance. Additionally, the number of processors plays a major role in
determining the complexity of a parallel algorithm. In general, we say that the
performance of a parallel algorithm is expressed in terms of how fast it is, and how
many resources it uses when it runs. These criteria can be measured quantitatively
as follows:
1. Run time, which is defined as the time spent during the execution of the
algorithm.
2. Number of processors the algorithm uses to solve a problem.
3. The cost of the parallel algorithm, which is the product of the run time and the
number of processors.
119
Distributed Systems - CS 422
 The run time of a parallel algorithm is the length of the time period between the time
the first processor to begin execution starts and the time the last processor to finish
execution terminates. However, since the analysis of algorithms is normally
conducted before the algorithm is even implemented on an actual computer, the run
time is usually obtained by counting the number of steps in the algorithm. The cost
of a parallel algorithm is basically the total number of steps executed collectively
by all processors. If the cost of an algorithm is C, the algorithm can be converted
into a sequential one that runs in O(C) time on one processor. A parallel algorithm
is said to be cost optimal if its cost matches the lower bound on the number of
sequential operations to solve a given problem within a constant factor. It follows
that a parallel algorithm is not cost optimal if there exists a sequential algorithm
whose run time is smaller than the cost of the parallel algorithm.
 It may be possible to speed up the execution of a cost-optimal PRAM algorithm by
increasing the number of processors. However, we should be careful because using
more processors may increase the cost of the parallel algorithm. Similarly, a PRAM
algorithm may use fewer processors in order to reduce the cost. In this case the
execution may be slowed down and offset the decrease in the number of processors.
Therefore, using fewer processors requires that we make them work more
efficiently. Further details on the relationship between the run time, number of
processors, and optimal cost can be found in Brent (1974).

In order to design efficient parallel algorithms, one must consider the following
general rules. The number of processors must be bounded by the size of the problem.
The parallel run time must be significantly smaller than the execution time of the
best sequential algorithm. The cost of the algorithm is optimal.
4.3 Simulating Multiple Accesses on an EREW PRAM
Suppose that a memory location, x, is needed by all processors at a given time in a
PRAM. Concurrent read by all processors can be performed in the CREW and CRCW
120
Distributed Systems - CS 422
cases in constant time. In the EREW case, the following broadcasting mechanism can
be followed:
 P1 reads x and makes it known to P2.
 P1 and P2 make x known to P3 and P4, respectively, in parallel.
 P1 , P2 , P3 and P4 make x known to P5, P6, P7, and P8, respectively, in parallel.
 These eight processors will make x known to another eight processors, In order to
represent this algorithm in PRAM, an array, L, of size p is used as a working space
in the shared memory to distribute the contents of x to all processors.
Initially P1 will read x in its private memory and write it into L[1] Processor P2 will
read x from L[1] into its private memory and write it into L[2].
Simultaneously, P3 and P4 read x from L [1] and L[2], respectively, then write them
into L[3] and L[4], respectively.
Processors P5 , P6, P7, and P8 will then simultaneously read L [1], L [2], L [3], and
L [4], respectively, in parallel and write them into L [5 L [6] L [7], and L [8],
respectively. This process will continue until eventually all the processors have read x.
4.3.1 Algorithm Broadcast _EREW
Processor P1
y (in P1'S private memory) ------- x
L[1-------] y
for i = 0 to log p - 1 do
for all Pj where 2i + 1<= j <=2i+1 do in parallel
y (in Pj ‘ S private memory) ------- L [ j – 2i]
L[j] ------y
endfor
endfor
Verify the time complexity, O (log p)
Verify the space complexity, O(p)
121
Distributed Systems - CS 422
Since the number of processors having read x doubles in each iteration, the
procedure terminates in O (log p) time. The array L is the price paid in terms of memory,
which is O(p).
4.4 Computing Sum and All Partial Sums
We design a PRAM algorithm to compute all partial sums of an array of numbers.
Given n numbers, stored in array A[ 1.. n], we want to compute the partial sums A [1]
, A[l] + A[2], A[l] + A[2] +A[3],..., A[1] + A[2] + ... + A[n].
At first glance, one might think that accumulating sums is an inherently serial
process, because one must add up the first k elements before adding in element k + I.
To make it easy to understand the algorithm, we start by developing a similar algorithm
for the simpler problem of computing the simple sum of an array of n values. Then we
extend the algorithm to compute all partial sums using what is learned from the simple
summation problem.
122
Distributed Systems - CS 422
4.4.1 Sum of an Array of Numbers on the EREW Model
In this section, we discuss an algorithm to compute the sum of n numbers. Summation
can be done in time O(log n) organizing the numbers as the leaves of a binary tree and
performing the sums at each level of the tree in parallel.
We present this algorithm on an EREW PRAM with n/2 processors, because we won't
need to perform any multiple read or write operations on the same memory location.
Recall that in an EREW PRAM, read and write conflicts are not allowed. We assume
that the array A [l..n] is stored in the global memory. The summation will end up in the
last l
For simplicity, we assume that n is an integral power of 2. The algorithm will complete
the work in log n iterations as follows. In the first iteration, all the processors are active.
In the second iteration, only half of the processors will be active, and so on. The details
are describe in Algorithm Sum active, and so on. The details are describe in Algorithm
Sum _ EREW given below.
Algorithm Sum_EREW
for i = 1 to log n do
for all pj , where 1 £ j £ n / 2 do in parallel
if ( 2j modulo 2i ) = 0 then
A[ 2j ] ¬ A[ 2j ] + A[ 2 j - 2i-1 ]
endif
endfor
endfor
Run time, T(n) = O (log n)
Number of processors, P(n) = n/2
Cost, C(n) = 0(n log n)
123
Distributed Systems - CS 422
Complexity Analysis
Notice that, the for loop is executed log n times, and each iteration has constant
time complexity. Hence, the run time of the algorithm is o (log n). Since the number of
processors used is n/2, the cost is obviously o (n log n). The complexity measures of
Algorithm Sum _EREW are summarized as follows:
1. Run time, T(n) = o (log n)
2. Number of processors, P(n) = n/2
3. Cost, C(n) = o (n log n)
Since a good sequential algorithm can sum the list of n elements in 0(n), this
algorithm is not cost optimal. 3.
Algorithm Procedures
In order to sum eight elements, three iterations are needed. In the first iteration,
processors P1 , P2 , P3 , and P4 add the values, stored at locations 1, 3, 5, and 7 to the
numbers values, stored at locations 1, 3, 5, and 7 to the numbers stored at locations 2,
4, 6, and 8, respectively.
In the second iteration, processors P2 and P4 add the values stored at locations 2 and 6
to the numbers stored at locations 4 and 8, respectively.
Finally, in the third iteration, processor P4 adds the value stored at location 4 to the
value stored at location 8. Thus, location 8 will eventually contain the sum of all
numbers in the array.
124
Distributed Systems - CS 422
Sum Example
4.4.2 All Partial Sums of an Array
Take a closer look at Algorithm Sum _ EREW and notice that most of the processors
are idle most of the time. However, by exploiting the idle processors, we should be able
to compute all partial sums of the array in 'the same amounts of time it takes to compute
the single sum.
We present Algorithm All Sums _EREW to calculate all partial sums of an array on an
EREW PRAM with n - 1 processors ( P2,P3..., Pn). Again, the elements of the array
A[l..n] are assumed to be in the global shared memory. The partial sum algorithm
replaces each A [k] by the sum of all elements preceding and including A[k].
125
Distributed Systems - CS 422
In Algorithm Sum _EREW presented earlier, during iteration i, only n/2i processors
are active, while in the algorithm we present here, nearly all processors are in use.
Algorithm All _Partial _Sums _EREW
for i = 1 to log n do
for all Pj , where 2i-1 + 1 £ j £ n do in parallel
A [ j ] ¬ A [ j ] + A[ j - 2i-1]
endfor
endfor
Run time, T(n) = O(log n)
Number of processors, P(n) = n – 1
Cost, C(n) = 0(n log n)
Visit: Algorithm ALL_Partial_Sums .docx
Computing partial sums of an array of eight elements
126
Distributed Systems - CS 422
4.5 The Sorting Algorithm
The sorting algorithm we present here is based on the enumeration (list) idea.
Given an unsorted list of n elements a1, a2,..., ai,…, an, an enumeration sort determines
the position of each element a1 in the sorted list by computing the number of elements
smaller than it.
If ci elements are smaller than ai , then it is the (ci +1)th element in the sorted list.
If two or more elements have the same value, the element with the largest index in the
unsorted list will be considered as the largest in the sorted list-for example, suppose
that ai = aj, then ai will be considered the larger of the two if i > j; otherwise, aj is the
larger.
4.5.1 n2 Sorting Algorithm
Consider the n2 processors as being arranged into n rows of n elements each. The
processors are numbered as follows: Pi,j is the processor located in row i and column j
in the grid of processors. We assume that the sorted list is stored in the global memory
in an array A [1..n]. Another array C[1..n] will be used to store the number of
elements smaller than every element in A. The algorithm consists of two steps:
1. Each row of processors i computes C [i], the number of elements smaller than A [i].
Each processor Pi,j compares A[i] and then updates C [i] appropriately.
2. The first processor in each row Pi,1 places A[i] in its proper position in the sorted
list (C [i] + 1).
4.5.2 Algorithm Sort _ CRCW( Ascending)
/* Step 1 */
For all Pi,j where 1 ≤ i, j ≤ n do in parallel
if (A[i] > A[j]) or (A [i] = A[j] and i > j)
then C[i] ← 1
else
C[i] ← 0 endif
endfor
127
Distributed Systems - CS 422
/* Step 2 */
For all Pi,1
where 1 ≤ i ≤ n do in parallel
A [C[i]+1]← A[i]
Endfor
Complexity Analysis
The complexity measures of the enumerating sort on CRCW PRAM are
summarized as follows:
Run time, T(n) = O (1)
Number, of processors, P(n) = n2
Cost, C(n) = O (n2)
The run time of this algorithm is constant, because each of the two steps of the
algorithm consumes a constant amount of time. Since the number of processors used is
n2, the cost is obviously O(n2).
CRCW algorithms versus EREW algorithms
The debate about whether or not concurrent memory accesses should be provided by
the hardware of a parallel computer is a messy one. Some argue that hardware
mechanisms to support CRCW algorithms are too expensive and used too infrequently
to be justified. Others complain that EREW PRAM's provide too restrictive a
programming model. The answer to this debate probably lies somewhere in the middle,
and various compromise models have been proposed. Nevertheless, it is instructive to
examine what algorithmic advantage is provided by concurrent accesses to memory.
In this section, we shall show that there are problems on which a CRCW algorithm
outperforms the best possible EREW algorithm. For the problem of finding the
identities of the roots of trees in a forest, concurrent reads allow for a faster algorithm.
128
Distributed Systems - CS 422
For the problem of finding the maximum element in an array, concurrent writes permit
a faster algorithm.
4.6 Message – Passing Models and Algorithms
Message-passing distributed algorithms are designed to run on the processing units of
a distributed system, which may be connected in a variety of ways: ranging from
geographically dispersed networks to architecture-specific interconnection structures.
A processing unit in such systems is an autonomous computer, which may be engaged
in its own private activities while at the same time cooperating with other units in the
context of some computational.
4.6.1 Message-passing Computing Models
An algorithm designed for a message-passing system consists of a collection of local
programs running concurrently on the different processing units in a distributed system.
Each local program performs a sequence of computation and message-passing
operations. Message passing in distributed systems can be modeled using a
communication graph. The nodes of the graph represent the processors (or the processes
running on them), and the edges represent communication links between processors.
A message-passing distributed system may operate in synchronous, asynchronous, or
partially synchronous modes.
In the synchronous extreme, the execution is completely lockstep and local programs
proceed in synchronous rounds, for example in one round each local program sends
messages to its outgoing neighbors, waits for the arrival of messages from its incoming
neighbors, and performs some computation upon the receipt of the messages.
In the other extreme, in asynchronous mode the local programs execute in arbitrary
order at an arbitrary rate. The partially synchronous systems work at an intermediate
degree of synchrony, where there are restrictions on the relative timing events.
The processes share information by sending/receiving (or dispatching/collecting) data
to/from each other. The processes most likely run the same programs, and the whole
129
Distributed Systems - CS 422
system should work correctly regardless of the messaging relations among the
processes or the structure of the network.
A popular standard and message-passing system is the message passing interface
(MPI). Such models themselves do not impose particular restrictions on the mechanism
for messaging, and thus give programmers much flexibility in algorithm/system
designs. However, this also means that programmers need to deal with actual
sending/receiving of messages, failure recovery, managing running processes, etc.
Synchronous Message-Passing Model
Thus, a synchronous system can be modeled as a state machine with the following
components:

M, a fixed-message alphabet

A process i can be modeled as:

Qi, a (possibly infinite) set of states. The system state can be represented using a
set of variables.

q0,i , the initial state in the state set Qi. The state variables have initial values in
the initial state.

GenMsgi, a message-generation function. It is applied to the current system state
to generate messages to the outgoing neighbors from elements in M.

Transi, a state transition function that maps the current state and the incoming
messages into a new state.
Algorithm S _Sum _ Hypercube
Qi
buff, an integer
dim, a value in {1, 2,..., log n}
q0,i
buff ← xi
dim ← log n
GenMsgi
130
Distributed Systems - CS 422
If the current value of dim = 0, do nothing. Otherwise, send the current
value of buff to the neighbor along the dimension dim.
Transi
If the incoming message is v & dim > 0 ,
then buff← buff + v,
dim ← dim – 1
131
Distributed Systems - CS 422
Revision Sheet # 4
Assignment # 13
Trace the following algorithms using your own examples and clarify its space
complexity and time complexity:
 Simulating Multiple Accesses on an EREW PRAM (Broadcast Algorithm)
 Computing Sum Algorithm
 All Partial Sums Algorithm
 Sorting Algorithm
 Message Passing Models Sum Algorithm
132
Distributed Systems - CS 422
Chapter 5
Naming In Distributed Systems
Naming Mechanism in Traditional Network
Names play a very important role in all computer systems. They are used to share
resources, to uniquely identify entities, to refer to locations, and more. An important
issue with naming is that a name can be resolved to the entity it refers to. Name
resolution thus allows a process to access the named entity. To resolve names, it is
necessary to implement a naming system. The difference between naming in distributed
systems and non-distributed systems lies in the way naming systems are implemented.
In a distributed system, the implementation of a naming system is itself often distributed
across multiple machines.
How this distribution is done plays a key role in the efficiency and scalability of the
naming system. In this chapter, we concentrate on three different, important ways that
names are used in distributed systems. First, after discussing some general issues with
respect to naming, we take a closer look at the organization and implementation of
human-friendly names. Typical examples of such names include those for file systems
and the World Wide Web. Building worldwide, scalable naming systems is a primary
concern for these types of names.
Second, names are used to locate entities in a way that is independent of their current
location. As it turns out, naming systems for human-friendly names are not particularly
suited for supporting this type of tracking down entities. Most names do not even hint
at the entity's location. Alternative organizations are needed, such as those being used
for mobile telephony where names are location independent identifiers, and those for
distributed hash tables. Finally, humans often prefer to describe entities by means of
various characteristics, leading to a situation in which we need to resolve a description
133
Distributed Systems - CS 422
by means of attributes to an entity adhering to that description. This type of name
resolution is notoriously difficult and we will pay separate attention to it
The practice of using a name as a simpler, more memorable abstraction of a host's
numerical address on a network dates back to the ARPANET era. Before the DNS was
invented in 1982, each computer on the network retrieved a file called HOSTS.TXT
from a computer at SRI (Stanford Research Institute International). The HOSTS.TXT
file mapped names to numerical addresses.
A host’s file still exists on most modern operating systems by default and generally
contains a mapping of "localhost" to the IP address 127.0.0.1. The rapid growth of the
network made a centrally maintained, hand-crafted HOSTS.TXT file unaffordable. It
became necessary to implement a more scalable system capable of automatically
publishing the requisite information.
5.1 Naming Basic Concept

Name:

Address: string of bits that have location semantics
o

string of bits that refer to an entity (e.g., your name).
e.g., your home address, your phone #
Also, name is an identifier that:
o
Identifies a resource
 Uniquely
 Describes the resource
o
Enables us to locate that resource
 Directly
 With help?
Key issues in Naming
 How is name used?
o Disambiguate only
o Access resource given the name
o Build a name to find a resource
134
Distributed Systems - CS 422
 Do humans need to use name?
o Construct
o Recall
 Is resource static?
o Never moves?
o Change in location should change name
o Resource may move
o Resource is mobile
 Performance requirements
5.2 Naming Types (identification, description, location)
1. Address, which is an access point associated with an entity
2. Globally unique identifier (e.g. TCP/IP)
 Ethernet
 Solves identification, but not description or location
3. Hierarchically assigned globally unique names in shape of character string (e.g.
URL)
 Telephone number, IP address
 Telephone number, IP address
 Solves identification, location
 Cannot help with description
4. Registries and name spaces (e.g. Uniform Resource Name (URN))
 Solves identification and location
 Helps with description
 Registry can describe in detail
 Complicated!
Naming & Application Developers
 DNS is widely accepted standard
o Only names machines
135
Distributed Systems - CS 422
o Doesn’t handle mobility
 URL / URN will become standard
o Can be descriptive
o Globally unique
o Determined, but expensive to create
 Mix of URL and DNS
5.2.1 Uniform Resource Identifier (URI) (URL/URN/URL+URN)
Computer scientists may classify a URI as a locator (URL), or a name (URN), or both.
It is an Internet Engineering Task Force (IETF) meta-standard. It defines naming
schemes/protocols. Each naming scheme has its own mechanism. Examples of absolute
URIs:
 http://example.org/absolute/URI/with/absolute/path/to/resource.txt
 ftp://example.org/resource.txt
 urn: issn:1535-3613
Uniform Resource Locator (URL)
Instead of being divided into the route to the server, separated by dots, and the file
path, separated by slashes; Tim Berners-Lee would have liked it to be one coherent
hierarchical path.
 e.g. http://www.serverroute.com / path/to/file.html →
http://com/serverroute/www/path/to/file.html
URL uses DNS to map to host (Mix of URL with DNS).
Uniform Resource Name (URN) (Permanent URL)
The Functional Requirements for Uniform Resource Names are described in RFC
1737. The URNs are part of a larger Internet information architecture which is
composed of:
 URNs,
 Uniform Resource Characteristics (URCs)
 Uniform Resource Locators (URLs).
136
Distributed Systems - CS 422
Each plays a specific role:
 URNs are used for identification
 URCs for including meta-information
 URLs for locating or finding resources
The Internet Protocol Suite
Digital subscriber line is a family of technologies that provide Internet access by
transmitting digital data over the wires of a local telephone network. In
telecommunications marketing, the term DSL is widely understood to mean
asymmetric digital subscriber line (ADSL)
Application Layer BGP · DHCP · DNS · FTP · GTP · HTTP · IMAP ( email
receiving)· IRC · Megaco · MGCP · NNTP · IMAP ( email receiving)· IRC · Megaco
· MGCP · NNTP · NTP · POP ( email receiving)· · RIP · RPC · RTP · RTSP · SDP ·
SIP · SMTP ( email sending)· · SNMP · SOAP · SSH · Telnet · TLS/SSL · XMPP ·
Transport Layer TCP · UDP · DCCP · SCTP · RSVP · ECN · (more) Internet Layer
IP (IPv4, IPv6) · ICMP · ICMPv6 · IGMP · IPsec ·
Link Layer ARP · RARP · NDP · OSPF · Tunnels (L2TP) · PPP · Media Access
Control (Ethernet, MPLS, DSL , ISDN, FDDI) · Device Drivers ·
137
Distributed Systems - CS 422
Domain Name System (DNS)
It is a hierarchical naming system for computers, services, or any resource participating
in the Internet. It associates various information with domain names assigned to such
participants. It translates domain names meaningful to humans into the numerical
(binary) identifiers associated with networking equipment for the purpose of locating
and addressing these devices world-wide. It serves as the "phone book" for the Internet
by translating human-friendly computer hostnames into IP addresses. For example,
www.example.com translates to 208.77.188.166.
World-Wide Web (WWW) hyperlinks and Internet contact information can remain
consistent and constant even if the current Internet routing arrangements change or the
participant uses a mobile device.
Internet domain names are easier to remember than IP addresses such as
208.77.188.166 (IPv4) or 2001:db8:1f70::999:de8:7648:6e8 (IPv6). People take
advantage of this when they perform meaningful URLs and e-mail addresses without
having to know how the machine will actually locate them.
DNS / URL System (Mix)
http://maedhbh.maths.ted.ie/project
Using DNS, the machine name blood.cs.tcd.ie would indicate that the machine
blood was located in the Computer Science (cs) department of Trinity College Dublin
(tcd) which is located in Ireland (ie). Thus, the name space has been broken up into a
hierarchical structure of domains within domains.
The Uniform Resource Locator (URL) is an extension of the naming convention
used by DNS. It adds a prefix, which is used to specify which of the Web channel of
the Web services you request of that machine. Such services are HTTP & FTP.
Also other information can be appended to the address. One such use it to specify
the particular Web page you request, e.g.
http://maedhbh.maths.ted.ie/project
138
Distributed Systems - CS 422
The advantages of the DNS/URL system is that the databases for storing information
about the machines is distributed in a fashion where the owner of the machines is
responsible for that data.
Domain Name System (DNS) Hierarchy
DNS Hierarchy Features
1. Unique domain suffix is assigned by the Internet Authority
2. The domain administrators have complete control over the domain
3. No limit on the number of sub-domains or number of levels
4. Name space is not related with the physical interconnection
5. Geographical hierarchy is allowed (e.g., cnri.reston.va.us)
6. A name could be a domain or an individual objects
DNS Top Level Domains
Assignment
Domain
Name
Com
Commercial
Edu
Educational
Gov
Government
Mil
Military
Net
Network
Org
Other organizations
country code
au, uk, ca, …
139
Distributed Systems - CS 422
The DNS Name Space
The most important types of resource records forming the contents of nodes in the
DNS name space:
Type of
Associated
record
entity
Description
SOA
Zone
Holds information on the represented zone
A
Host
Contains an IP address of the host this node represents
MX
Domain
Refers to a mail server to handle mail addressed to this node
SRV
Domain
Refers to a server handling a specific service
NS
Zone
CNAME
Node
Refers to a name server that implements the represented
zone
Symbolic link with the primary name of the represented
node
Domain Name Space
140
Distributed Systems - CS 422
Name Space Implementation
Name spaces always map names to something. DNS maps associates information
with domain names. It can be divided into three layers:
1. Global layer: Doesn’t change very often.
2. Administrational layer: Single organization
3. Managerial layer: Change regularly
Name Space Distribution Example
An example partitioning of the DNS name space, including Internet-accessible
files, into three layers.
Name Space Layers Characteristics
A comparison between name servers for implementing nodes from a large-scale
name space partitioned into a global layer, an administrational layer, and a managerial
layer.
141
Distributed Systems - CS 422
Item
Global
Administrational Managerial
Geographical scale of
network
Worldwide
Organization
Department
Total number of nodes
Few
Many
Vast
numbers
Responsiveness to
lookups
Seconds
Milliseconds
Immediate
Update propagation
Lazy
Immediate
Immediate
Number of replicas
Many
None or few
None
Is client-side caching
applied?
Yes
Yes
Sometimes
DNS Name Servers (Decentralized)
Centralizing DNS is avoided for the following reasons:
 To avoid single point of failure
 To avoid traffic volume
 Access to distant centralized database
 Distribute maintenance overhead
 Centralized means: doesn’t scale!
DNS Name Server Hierarchy Concepts
Server are organized in hierarchies. Each server has authority over a portion of the
hierarchy:
 A single node in the name hierarchy cannot be split
 A server maintains only a subset of all names
 It needs to know other servers that are responsible for the other
portions of the hierarchy.
Regarding its Authority, each server has the name to address translation table (ATT)
for all names in the name space it controls. Every server knows the root and Root
server knows about all top-level domains.
142
Distributed Systems - CS 422
DNS Name Servers Types
No server has all name-to-IP address mappings i.e. the knowledge is distributed as
follows:

Local name servers
o
Each ISP (company) has local (default) name server
o
Host DNS query first go to local name server

Root Name Servers, as a Mediator Agent

Authoritative name servers
o
For a host: stores that host’s (name, IP address)
o
Can perform name/address translation for that host’s name
Root Name Servers (Mediator Agent)
Contacted by local name server that cannot resolve name. The Root name server:
1. Contacts authoritative name server if name mapping not known
2. Gets mapping
3. Returns mapping to local name server
Dozen root name servers worldwide
143
Distributed Systems - CS 422
Simple DNS Example (1st Alternative)
Host whsitler.cs.cmu.edu wants IP address of www.berkeley.edu
1. Contacts its local DNS server, mango.srv.cs.cmu.edu
2. mango.srv.cs.cmu.edu contacts root name server, if necessary
3. Root name server contacts authoritative name server,
ns1.berkeley.edu, if necessary
 May not know authoritative name server
 May know intermediate name server: who to contact to find
authoritative name server?
144
Distributed Systems - CS 422
DNS: Iterated Queries (3rd Alternative)
Recursive query:
 Puts load of name resolution on contacted name server
 Heavy load?
Iterated query:
 Contacted server replies with name of server to contact
 “I don’t know this name, but ask this server”
5.3 Naming Implementation Approaches
1. Flat Naming
 Simple solutions (broadcasting and Forwarding pointers)
 Home-based approaches
 Distributed Hash Tables (structured P2P)
 Hierarchical location service
2. Structured Naming, for example:
 Phone numbers
 Credit card numbers
 DNS
 Human names in the US
145
Distributed Systems - CS 422
 Files in UNIX, Windows
 URLs
3. Attribute-based Naming
 using a collection of (attribute, value) pairs
5.3.1 Flat Naming Approach
Problem:
 Given an essentially unstructured name (e.g., an identifier), how can
we locate its associated access point?
Solution:
1. Simple solutions (broadcasting and Forwarding pointers )
2. Home-based approaches
3. Distributed Hash Tables (structured P2P)
4. Hierarchical location service
5.3.1.1 Simple Solutions
Broadcasting: (Question is ID, answer is address)
Simply broadcast the ID, requesting the entity to return its current address. Its
disadvantages are:
 Can never scale beyond local-area networks (LAN)
 Requires all processes to listen to incoming location
Forwarding pointers: (VIP hot line contact)
Each time an entity moves, it leaves behind a pointer telling where it has gone
to. Dereferencing can be made entirely transparent to clients by simply following the
chain of pointers
 Update a client’s reference as soon as present location has been found
 Geographical scalability problems:
o Long chains are not fault tolerant
o Increased network latency at dereferencing. It is essential to have
separate chain reduction mechanisms.
146
Distributed Systems - CS 422
5.3.1.2 Home-Based Approaches
Single-tiered scheme: (continues tracking)
Let a home keep track of where the entity is:
 An entity’s home address is registered at a naming service
 The home registers the foreign address of the entity
 Clients always contact the home first, and then continues with the foreign
location
Two-tiered scheme:
Keep track of visiting entities:
 Check local visitor register first
 Fall back to home location if local lookup fails
 Problems with home-based approaches:
1. The home address has to be supported as long as the entity lives.
2. The home address is fixed, which means an unnecessary load when
the entity permanently moves to another location
3. Poor geographical scalability
Question: How can we solve the “permanent move” problem?
147
Distributed Systems - CS 422
Answer: register the home at a traditional naming service and to let a client first look
up the location of the home.
5.3.1.3 Distributed Hash Tables (DHTs)
Example: Consider the organization of many nodes into a logical ring (Chord
Protocol):
 Each node is assigned a random m-bit identifier.
 Every entity is assigned a unique m-bit key.
 Entity with key k falls under authority of node with smallest id k (called its
successor).
Solution: Let node id keep track of succ(id) and start linear search along the ring.
Chord: Overview
A Chord is a scalable, distributed “lookup service”. A Lookup service is a service that
maps keys to nodes. The Key technology used is consistent hashing. The major benefits
of Chord over other lookup services is Provable correctness and Provable
“performance”. It uses the Secure Hash Algorithm (SHA-1) which is one of a number
of cryptographic hash functions http://aruljohn.com/hash.php
Chord: Primary Motivation
In P2P network, one node(N5) as a client is trying to locate a specific file (LetItBe)
located on another node(N1) as a server on the network.
148
Distributed Systems - CS 422
Chord Identifiers
Where m is a bit identifier space for both keys and nodes
 Key identifier: SHA-1(key)
 Node identifier: SHA-1(IP address)
After hashing the key value of the file name, the client node will search for the
closest node or higher in value on the network to acquire the needed file.
Algorithmic Requirements
● Every node can find the answer.
● Keys are load balanced among nodes.
Note: We're not talking about popularity of keys, which may be wildly different.
Addressing this is a further challenge...
● Routing tables must adapt to node failures and arrivals.
● How many hops must lookups take? – Trade off possible between state/maintenance,
traffic and number lookups.
DHTs: Finger Tables
 Each node p maintains a finger table FTp[] with at most m entries:
FTp[i] = succ(p + 2i−1)
Note: FTp[i] points to the first node succeeding p by at least 2i−1.
 To look up a key k, node p forwards the request to node with index j
satisfying
q = FTp[j] ≤ k < FTp[j + 1]
149
Distributed Systems - CS 422
It means, I do not have the looked up file, Search it in the next station until found or stop.
5.3.1.4 Hierarchical Location Services (HLS)
The basic idea is to build a large-scale search tree for which the underlying network is
divided into hierarchical domains. Each domain is represented by a separate directory node.
150
Distributed Systems - CS 422
HLS: Tree Organization
The address of an entity is stored in a leaf node, or in an intermediate node.
Intermediate nodes contain a pointer to a child if and only if the sub-tree rooted at the
child stores an address of the entity. The root knows about all entities.
HLS: Lookup Operation
The Basic principles are:
 Start lookup at local leaf node
 If node knows about the entity, follow downward pointer, otherwise go
one level up (if exist, go down; otherwise continue up)
 Upward lookup always stops at root
151
Distributed Systems - CS 422
5.3.2 Structured Naming
The name space is the way that names in a particular system are organized. This also
defines the set of all possible names. Some examples are:
 Phone numbers
 Credit card numbers
 DNS
 Human names in the US
 Files in UNIX, Windows
 URLs
Names are organized into what is commonly referred to as a name space. A name space
can be represented as a labeled, directed graph with two types of nodes.
Name Space
Essence: a graph in which a leaf node represents a (named) entity. A directory node is
an entity that refers to other nodes.
Note: A directory node contains a directory table of (edge label, node identifier) pairs.
Observation: We can easily store all kinds of attributes in a node, describing aspects of
the entity the node represents:
1.
Type of the entity
2.
An identifier for that entity
3.
Address of the entity’s location
4.
Nicknames
152
Distributed Systems - CS 422
Directory nodes can also have attributes, besides just storing a directory table with
(edge label, node identifier) pairs.
Name Resolution
Looking up a name (finding the “value”) is called name resolution. But the problem is
to resolve a name, we need a directory node. We first need to find that “initial” node.
Closure mechanism (or where to start) is the mechanism to select the implicit context
from which to start name resolution. Such Examples are: file systems, ZIP code, DNS
 www.cs.vu.nl: start at a DNS name server
 0031204447784: dial a phone number
 130.37.24.8: route to the VU’s Web server
Observation: A closure mechanism may also determine how name resolution should
proceed.
Name Linking
Hard link: What we have described so far as a path name: a name that is resolved by
following a specific path in a naming graph from one node to another.
Soft link: Allow a node O to contain a name of another node:
 First resolve O’s name (leading to O)
 Read the content of O, yielding name
 Name resolution continues with name
Observations: The name resolution process determines that we read the content of a
node, in particular, the name in the other node that we need to go to. One way or the
other, we know where and how to start name resolution given name. The Node n5 has
only one name (i.e. no name linking but n1,n4 links to node n6).
153
Distributed Systems - CS 422
Iterative Name Resolution
 resolve(dir,[name1,...,nameK]) is sent to Server0 responsible for dir
 Server0 resolves resolve(dir,name1) → dir1, returning the identification
(address) of Server1, which stores dir1.
 Client sends resolve(dir1,[name2,...,nameK]) to Server1, etc.
Recursive Name Resolution
 resolve(dir,[name1,...,nameK]) is sent to Server0 responsible for dir
 Server0 resolves resolve(dir,name1) →dir1, and sends
resolve(dir1,[name2,...,nameK]) to Server1, which stores dir1.
 Server0 waits for the result from Server1, and returns it to the client.
154
Distributed Systems - CS 422
Scalability Issues
Size scalability: We need to ensure that servers can handle a large number of requests
per time unit ⇒ high-level servers are in big trouble.
Solution: Assume (at least at global and administrational level) that content of nodes
hardly ever changes. In that case, we can apply extensive replication by mapping nodes
to multiple servers and start name resolution at the nearest server.
Observation: An important attribute of many nodes is the address where the
represented entity can be contacted. Replicating nodes makes large-scale traditional
name servers unsuitable for locating mobile entities.
Geographical scalability: We need to ensure that the name resolution process scales
across large geographical distances.
155
Distributed Systems - CS 422
Problem: By mapping nodes to servers that may, in principle, be located anywhere, we
introduce an implicit location dependency in our naming scheme.
5.3.3 Attribute-Based Naming
Observation: In many cases, it is much more convenient to name, and look up entities
by means of their attributes ⇒ traditional directory services (Also Known As (a.k.a.),
yellow pages).
Problem: Lookup operations can be extremely expensive, as they require to match
requested attribute values, against actual attribute values ⇒ inspect all entities.
Solution: Implement basic directory service as database and combine with traditional
structured naming system.
Directory Service
A directory service is a database that contains information about all objects on the
network. Directory services contain data and metadata. Metadata is information about
data. For example: A user account is data. Metadata specifies what information is
included in every user account object.
Early Directory Services
The first directory service was developed at PARC and was called Grapevine. X.500
was developed as a directory service standard by the ISO and CCITT. Although X.500
was developed as a comprehensive standard, as with the OSI model, it was not widely
deployed on real-world LANs. X.500 formed the basis of a standard that is widely
deployed known as LDAP. Some X.500 conventions are used in Active Directory and
eDirectory.
X.500 Directory Service
It provides directory service based on a description of properties instead of a full name
(e.g., yellow pages in properties instead of a full name (e.g., yellow pages in telephone
book). An X.500 directory entry is comparable to a resource record in DNS. Each
record is made up of a collection of (attribute, value) pairs:
156
Distributed Systems - CS 422
 Collection of all entries is a Directory Information Base (DIB)
 Each naming attribute is a Relative Distinguished Name (RDN)
 RDNs, in sequence, can be used to form a Directory Information Tree (DIT)
LDAP
It stands for Lightweight Directory Access Protocol. LDAP is a scaled-down
implementation of the X.500 standard. Active Directory and eDirectory are based on
LDAP. Netscape’s Directory Server was the first wide implementation of LDAP.
Most LDAP directories use a single master method of replication. Changes are made to
the master databases and then propagated out to subordinate databases. The
disadvantage of this scheme is that it has a single point of failure.
Objects within an LDAP directory are referenced using the object’s DN (Distinguished
Name). The DN consists of the RDN (Relative Distinguished Name) appended with
the names of ancestor entries.
LDAP Example
157
Distributed Systems - CS 422
Revision Sheet # 5
PROBLEMS
1. Give an example of where an address of an entity E needs to be further resolved
into another address to actually access E.
2. Would you consider a URL such as http://www.acme.org/index.html to be location
independent? What about http://www.acme.nllindex.html?
3. Give some examples of true identifiers.
4. Is an identifier allowed to contain information on the entity it refers to?
5. Outline an efficient implementation of globally unique identifiers.
6. Consider a Chord DHT-based system for which k bits of an m-bit identifier space
have been reserved for assigning to superpeers. If identifiers are randomly
assigned, how many superpeers can one expect to have in an N-node system?
7. If we insert a node into a Chord system, do we need to instantly update all the finger
tables? What is a major drawback of recursive lookups when resolving a key in a
DHT-based system?
8. Considering that a two-tiered home-based approach is a specialization of a
hierarchical location service, where is the root?
9. The root node in hierarchical location services may become a potential bottleneck.
How can this problem be effectively circumvented?
10. Give an example of how the closure mechanism for a URL could work.
11. Explain the difference between a hard link and a soft link in UNIX systems. Are
there things that can be done with a hard link that cannot be done with a soft link
or vice versa?
12. High-level name servers in DNS, that is, name servers implementing nodes in the
DNS name space that are close to the root, generally do not support recursive name
resolution. Can we expect much performance improvement if they did?
13. Explain how DNS can be used to implement a home-based approach to locating
mobile hosts. How is a mounting point looked up in most UNIX systems?
158
Distributed Systems - CS 422
Chapter 6
Fault Tolerance
A characteristic feature of distributed systems that distinguishes them from singlemachine systems is the notion of partial failure. A partial failure may happen when one
component in a distributed system fails. This failure may affect the proper operation of
other components, while at the same time leaving yet other components totally
unaffected. In contrast, a failure in non-distributed systems is often total in the sense
that it affects all components, and may easily bring down the entire system.
An important goal in distributed systems design is to construct the system in such a way
that it can automatically recover from partial failures without seriously affecting the
overall performance. In particular, whenever a failure occurs, the distributed system
should continue to operate in an acceptable way while repairs are being made that is, it
should tolerate faults and continue to operate to some extent even in their presence.
6.1 Introduction to Fault Tolerance
Fault tolerance has been subject to much research in computer science. In this section,
we start with presenting the basic concepts related to processing failures, followed by
a discussion of failure models. The key technique for handling failures is redundancy,
which is also discussed.
6.1.1 Basic Concepts
To understand the role of fault tolerance in distributed systems we first need to take a
closer look at what it actually means for a distributed system to tolerate faults. Being
fault tolerant is strongly related to what are called dependable systems. Dependability
is a term that covers a number of useful requirements for distributed systems including
the following:
159
Distributed Systems - CS 422
1. Availability
2. Reliability
3. Safety
4. Maintainability
Avail ability is defined as the property that a system is ready to be used immediately.
In general, it refers to the probability that the system is operating correctly. At any given
moment and is available to perform its functions on behalf of its users. In other words,
a highly available system is one that will most likely be working at a given instant in
time.
Reliability refers to the property that a system can run continuously without failure. In
contrast to availability, reliability is defined in terms of a time interval instead of an
instant in time. A highly-reliable system is one that will most likely continue to work
without interruption during a relatively long period of time. This is an intelligent but
important difference when compared to availability.
If a system goes down for one millisecond every hour, it has an availability of over
99.9999 percent, but is still highly unreliable. Similarly, a system that never crashes but
is shut down for two weeks every August has high reliability but only 96 percent
availability. The two are not the same.
Safety refers to the situation that when a system temporarily fails to operate correctly,
nothing catastrophic happens. For example, many process control systems, such as
those used for controlling nuclear power plants or sending people into space, are
required to provide a high degree of safety. If such control systems temporarily fail for
only a very brief moment, the effects could be disastrous.
Many examples from the past (and probably many more yet to come) show how hard
it is to build safe systems.
Finally, maintainability refers to how easy a failed system can be repaired. A -highly
maintainable system may also show a high degree of availability, especially if failures
can be detected and repaired automatically.
160
Distributed Systems - CS 422
Often, dependable systems are also required to provide a high degree of security, especially
when it comes to issues such as integrity. A system is said to fail when it cannot meet its
promises. In particular, if a distributed system is designed to provide its users with a
number of services, the system has failed when one or more of those services cannot be
(completely) provided.
An error is a part of a system's state that may lead to a failure. For example, when
transmitting packets across a network, it is to be expected that some packets have been
damaged when they arrive at the receiver. Damaged in this context means that the receiver
may incorrectly sense a bit value (e.g., reading a 1instead of a 0), or may even be unable
to detect that something has arrived.
The cause of an error is called a fault. Clearly, finding out what caused an error is important.
For example, a wrong or bad transmission medium may easily cause packets to be
damaged. In this case, it is relatively easy to remove the fault.
However, transmission errors may also be caused by bad weather conditions such as in
wireless networks. Changing the weather to reduce or prevent errors is a bit complex.
Building dependable systems closely relates to controlling faults. A distinction can be made
between preventing, removing, and forecasting faults. For our purposes, the most important
issue is fault tolerance, meaning that a system can provide its services even in the presence
of faults. In other words, the system can tolerate faults and continue to operate normally.
Faults are generally classified as transient, intermittent, or permanent:
 Transient faults occur once and then disappear. If the operation is repeated, the fault
goes away. A bird flying through the beam of a microwave transmitter may cause lost
bits on some network (not to mention a roasted bird). If the transmission times out and
is retried, it will probably work the second time.
 An intermittent fault occurs, then vanishes of its own accord, then reappears, and so on.
A loose contact on a connector will often cause an intermittent fault. Intermittent faults
cause a great deal of aggravation because they are difficult to diagnose. Typically, when
the fault doctor shows up, the system works fine.
161
Distributed Systems - CS 422
 A permanent fault is one that continues to exist until the faulty component is
replaced. Burnt-out chips, software bugs, and disk head crashes are examples of
permanent faults.
6.1.2 Failure Models
A system that fails is not adequately providing the services it was designed for. If we
consider a distributed system as a collection of servers that communicate with one
another and with their clients, not adequately providing services means that servers,
communication channels, or possibly both, are not doing what they are supposed to do.
However, a malfunctioning server itself may not always be the fault we are looking for.
If such a server depends on other servers to adequately provide its services, the cause
of an error may need to be searched for somewhere else.
Such dependency relations appear in abundance in distributed systems. A failing disk
may make life difficult for a file server that is designed to provide a highly available
file system. If such a file server is part of a distributed database, the proper working of
the entire database may be at stake, as only part of its data may be accessible.
To get a better grasp on how serious a failure actually is, several classification schemes
have been developed. One such scheme is shown in the following figure:
162
Distributed Systems - CS 422
 A crash failure occurs when a server prematurely halts, but was working
correctly until it stopped. An important aspect of crash failures is that once the
server has halted, nothing is heard from it anymore. A typical example of a crash
failure is an operating system that comes to a grinding halt, and for which there
is only one solution: reboot it. Many personal computer systems suffer from
crash failures so often that people have come to expect them to be normal.
Consequently, moving the reset button from the back of a cabinet to the front
was done for good reason. Perhaps one day it can be moved to the back again,
or even removed altogether.
 An omission failure occurs when a server fails to respond to a request.
Several things might go wrong. In the case of a receive omission failure, possibly
the server never got the request in the first place. Note that it may well be the
case that the connection between a client and a server has been correctly
established, but that there was no thread listening to incoming requests. Also, a
receive omission failure will generally not affect the current state of the server,
as the server is unaware of any message sent to it.
Likewise, a send omission failure happens when the server has done its work,
but somehow fails in sending a response. Such a failure may happen, for
example, when a send buffer overflows while the server was not prepared for
such a situation.
Note that, in contrast to a receive omission failure, the server may now be in a
state reflecting that it has just completed a service for the client. As a
consequence, if the sending of its response fails, the server has to be prepared
for the client to reissue its previous request.
Other types of omission failures not related to communication may be caused by
software errors such as infinite loops or improper memory management by
which the server is said to "hang."
163
Distributed Systems - CS 422
 Another class of failures is related to timing. Timing failures occur when the
response lies outside a specified real-time interval. Isochronous data streams
providing data too soon may easily cause trouble for a recipient if there is not
enough buffer space to hold all the incoming data. More common, however, is
that a server responds too late, in which case a performance failure is said to
occur.
 A serious type of failure is a response failure, by which the server's response is
simply incorrect. Two kinds of response failures may happen. In the case of a
value failure, a server simply provides the wrong reply to a request. For example,
a search engine that systematically returns Web pages not related to any of the
search terms used. The other type of response failure is known as a state
transition failure.
This kind of failure happens when the server reacts unexpectedly to an incoming
request. For example, if a server receives a message it cannot recognize, a state
transition failure happens if no measures have been taken to handle such
messages.
In particular, a faulty server may incorrectly take default actions it should never
have initiated.
 The most serious are arbitrary failures, also known as Byzantine failures. In
effect, when arbitrary failures occur, clients should be prepared for the worst.In
particular, it may happen that a server is producing output it should never have
produced, but which cannot be detected as being incorrect Worse yet a faulty
server may even be maliciously working together with other servers to produce
intentionally wrong answers.
This situation illustrates why security is also considered an important
requirement when talking about dependable systems.
Arbitrary failures are closely related to crash failures. The definition of crash
failures as presented above is the most benign way for a server to halt. They are
also referred to as fail-stop failures. In effect, a fail-stop server will simply stop
164
Distributed Systems - CS 422
producing output in such a way that its halting can be detected by other
processes. In the best case, the server may have been so friendly to announce it
is about to crash; otherwise it simply stops.
Finally, there are also occasions in which the server is producing random output,
but this output can be recognized by other processes as plain junk. The server is
then exhibiting arbitrary failures, but in a benign way. These faults are also
referred to as being fail-safe. If a system is to be fault tolerant, the best it can do
is to try to hide the occurrence of failures from other processes.
The key technique for masking faults is to use redundancy. Three kinds are possible:
information redundancy, time redundancy, and physical redundancy:
 With information redundancy, extra bits are added to allow recovery from
garbled bits. For example, a Hamming code can be added to transmitted data to
recover from noise on the transmission line.
 With time redundancy, an action is performed, and then. If need be, it is
performed again. Transactions use this approach. If a transaction aborts, it can
be redone with no harm. Time redundancy is especially helpful when the faults
are transient or intermittent.
 With physical redundancy, extra equipment or processes are added to make it
possible for the system as a whole to tolerate the loss or malfunctioning of some
components. Physical redundancy can thus be done either in hardware or in
software. For example, extra processes can be added to the system so that if a
small number of them crash, the system can still function correctly. Physical
redundancy is a well-known technique for providing fault tolerance.
It is used in biology (mammals have two eyes, two ears, two lungs, etc.), aircraft
(747s have four engines but can fly on three), and sports (multiple referees in
case one misses an event). It has also been used for fault tolerance in electronic
circuits for years;
165
Distributed Systems - CS 422
6.2 Process Flexibility
Let us concentrate on how fault tolerance can actually be achieved in distributed
systems. The first topic we discuss is protection against process failures, which is
achieved by replicating processes into groups. In the following pages, we consider the
general design issues of process groups, and discuss what a fault-tolerant group actually
is. Also, we look at how to reach agreement within a process group when one or more
of its members cannot be trusted to give correct answers.
6.2.1 Design Issues
The key approach to tolerating a faulty process is to organize several identical processes
into a group. The key property that all groups have is that when a message is sent to the
group itself, all members of the group receive it. In this way, if one process in a group
fails, hopefully some other process can take over for it.
Process groups may be dynamic. New groups can be created and old groups can be
destroyed. A process can join a group or leave one during system operation.
A process can be a member of several groups at the same time. Consequently,
mechanism s are needed for managing groups and group membership.
The purpose of introducing groups is to allow processes to deal with collections of
processes as a single abstraction. Thus a process can send a message to a group of
servers without having to know who they are or how many there are or where they are,
which may change from one call to the next.
Flat Groups versus Hierarchical Groups
An important distinction between different groups has to do with their internal structure.
In some groups, all the processes are equal. No one is boss and all decisions are made
collectively. In other groups, some kind of hierarchy exists. For example, one process
is the coordinator and all the others are workers. In this model, when a request for work
is generated, either by an external client or by one of the workers, it is sent to the
coordinator. The coordinator then decides which worker is best suited to carry it out,
166
Distributed Systems - CS 422
and forwards it there. More complex hierarchies are also possible, of course. These
communication patterns are illustrated in the following figure:
Figure (a) Communication in a flat group
Figure (b) Communication in a simple hierarchical group
Each of these organizations has its own advantages and disadvantages. The flat group
is symmetrical and has no single point of failure. If one of the processes crashes, the
group simply becomes smaller, but can otherwise continue. A disadvantage is that
decision making is more complicated. For example, to decide anything, a vote often has
to be taken, incurring some delay and overhead.
The hierarchical group has the opposite properties. Loss of the coordinator brings the
entire group to a grinding halt, but as long as it is running, it can make decisions without
bothering everyone else.
Group Membership
When group communication is present, some method is needed for creating and
deleting groups, as well as for allowing processes to join and leave groups.
One possible approach is to have a group server to which all these requests can be sent.
The group server can then maintain a complete data base of all the groups and their
exact membership. This method is straightforward, efficient, and fairly easy to
implement. Unfortunately, it shares a major disadvantage with all centralized
167
Distributed Systems - CS 422
techniques: a single point of failure. If the group server crashes, group management
ceases to exist. Probably most or all groups will have to be reconstructed from scratch,
possibly terminating whatever work was going on.
The opposite approach is to manage group membership in a distributed way.
For example, if (reliable) multicasting is available, an outsider can send a message to
all group members announcing its wish to join the group.
Ideally, to leave a group, a member just sends a goodbye message to everyone.
In the context of fault tolerance, assuming fail-stop semantics is generally not
appropriate. The trouble is, there is no polite announcement that a process crashes as
there is when a process leaves voluntarily. The other members have to discover this
experimentally by noticing that the crashed member no longer responds to anything.
Once it is certain that the crashed member is really down (and not just slow), it can be
removed from the group.
Another tricky issue is that leaving and joining have to be synchronous wit data
messages being sent. In other words, starting at the instant that a process has joined a
group, it must receive all messages sent to that group. Similarly, as soon as a process
has left a group, it must not receive any more messages from the group, and the other
members must not receive any more messages from it. One way of making sure that a
join or leave is integrated into the message stream at the right place is to convert this
operation into a sequence of messages sent to the whole group.
One final issue relating to group membership is what to do if so many machines go
down that the group can no longer function-at all. Some protocol is needed to rebuild
the group. Invariably, some process will have to take the initiative to start the ball
rolling, but what happens if two or three try at the same time?
The protocol must to be able to withstand this.
6.2.2 Failure Masking and Replication
Process groups are part of the solution for building fault-tolerant systems. In particular,
having a group of identical processes allows us to mask one or more faulty processes
168
Distributed Systems - CS 422
in that group. In other words, we can replicate processes and organize them into a group
to replace a single (vulnerable) process with a (fault tolerant) group. As discussed in
the previous chapter, there are two ways to approach such replication: by means of
primary-based protocols, or through replicated-write protocols.
Primary-based replication in the case of fault tolerance generally appears in the form of
a primary-backup protocol. In this case, a group of processes is organized in a
hierarchical fashion in which a primary coordinates all write operations.
In practice, the primary is fixed, although its role can be taken over by one of the
backups. If need be. In effect, when the primary crashes, the backups execute some
election algorithm to choose a new primary.
Replicated-write protocols are used in the form of active replication, as well as by
means of quorum-based protocols.
These solutions correspond to organizing a collection of identical processes into a flat
group. The main advantage is that such groups have no single point of failure, at the
cost of distributed coordination.
An important issue with using process groups to tolerate faults is how much replication
is needed. To simplify our discussion, let us consider only replicated-write systems. A
system is said to be k fault tolerant if it can survive faults in k components and still meet
its specifications. If the components, say processes, fail silently, then having k + 1 of
them is enough to provide k fault tolerance. If k of them simply stop, then the answer
from the other one can be used.
On the other hand, if processes exhibit Byzantine failures, continuing to run when sick
and sending out erroneous or random replies, a minimum of 2k + 1 processors are
needed to achieve k fault tolerance. In the worst case, the k failing processes could
accidentally (or even intentionally) generate the same reply.
However, the remaining k + 1 will also produce the same answer, so the client or voter
can just believe the majority.
Of course, in theory it is fine to say that a system is k fault tolerant and just let the k + I
identical replies outvote the k identical replies, but in practice it is hard to imagine
169
Distributed Systems - CS 422
circumstances in which one can say with certainty that k processes can fail but k + 1
processes cannot fail. Thus even in a fault-tolerant system some kind of statistical
analysis may be needed.
An implicit precondition for this model to be relevant is that all requests arrive at all
servers in the same order, also called the atomic multicast problem.
Actually, this condition can be relaxed slightly, since reads do not matter and some
writes may commute, but the general problem remains.
6.2.3 Agreement in Faulty Systems
Organizing replicated processes into a group helps to increase fault tolerance.
As we mentioned, if a client can base its decisions through a voting mechanism, we can
even tolerate that k out of 2k + 1 processes are lying about their result.
The assumption we are making, however, is that processes do not team up to produce a
wrong result. In general, matters become more intricate if we demand that a process
group reaches an agreement, which is needed in many cases. Some examples are:
electing a coordinator, deciding whether or not to commit a transaction, dividing up
tasks among workers, and synchronization, among numerous other possibilities.
When the communication and processes are all perfect, reaching such agreement is
often straightforward, but when they are not, problems arise.
The general goal of distributed agreement algorithms is to have all the non faulty
processes reach consensus on some issue, and to establish that consensus within a finite
number of steps. The problem is complicated by the fact that different assumptions
about the underlying system require different solutions, assuming solutions even exist.
We could distinguish the following cases,
1. Synchronous versus asynchronous systems. A system is synchronous if and only if
the processes are known to operate in a lock-step mode. Formally, this means that there
should be some constant c? : 1, such that if any processor has taken c + 1 steps, every
other process has taken at least 1 step. A system that is not synchronous is said to be
asynchronous.
170
Distributed Systems - CS 422
2. Communication delay is bounded or not. Delay is bounded if and only if we know
that every message is delivered with a globally and predetermined maximum time.
3. Message delivery is ordered or not. In other words, we distinguish the situation where
messages from the same sender are delivered in the order that they were sent, from the
situation in which we do not have such guarantees.
4. Message transmission is done through unicasting or multicasting.
As it turns out, reaching agreement is only possible for the situations shown in the
following figure. In all other cases, it can be shown that no solution exists. Note that most
distributed systems in practice assume that processes behave asynchronously, message
transmission is unicast, and communication delays are unbounded. As a consequence, we
need to make use of ordered (reliable) message delivery, such as provided as by TCP. The
following figure illustrates the nontrivial nature of distributed agreement when processes
may fail.
The problem was originally studied by Lamport et al. and is also known as the
Byzantine agreement problem, referring to the numerous wars in which several
armies needed to reach agreement on, for example, troop strengths while being faced
with traitorous generals, conniving lieutenants, and so on.
171
Distributed Systems - CS 422
6.2.4 Failure Detection
It may have become clear from our discussions so far that in order to properly mask
failures, we generally need to detect them as well. Failure detection is one of the
cornerstones of fault tolerance in distributed systems. What it all boils down to is that
for a group of processes, non faulty members should be able to decide who is still a
member, and who is not. In other words, we need to be able to detect when a member
has failed.
When it comes to detecting process failures, there are essentially only two mechanisms.
Either processes actively send "are you alive?" messages to each other (for which they
obviously expect an answer), or passively wait until messages come in from different
processes. The latter approach makes sense only when it can be guaranteed that there
is enough communication between processes. In practice, actively pinging processes is
usually followed.
There has been a huge body of theoretical work on failure detectors. What it all boils
down to is that a timeout mechanism is used to check whether a process has failed. In
real settings, there are two major problems with this approach.
 First, due to unreliable networks, simply stating that a process has failed because
it does not return an answer to a ping message may be wrong. In other words, it
is quite easy to generate false positives. If a false positive has the effect that a
perfectly healthy process is removed from a membership list, then clearly we are
doing something wrong.
 Another serious problem is that timeouts are just plain crude. As noticed by
Birman, there is hardly any work on building proper failure detection subsystems
that take more into account than only the lack of a reply to a single message.
This statement is even more evident when looking at industry-deployed
distributed systems.
There are various issues that need to be taken into account when designing a failure
detection subsystem. For example, failure detection can take place through gossiping
172
Distributed Systems - CS 422
in which each node regularly announces to its neighbors that it is still up and running.
As we mentioned, an alternative is to let nodes actively probe each other.
Failure detection can also be done as a side-effect of regularly exchanging information
with neighbors, as is the case with gossip-based information dissemination.
This approach is essentially also adopted in: processes periodically gossip their service
availability. This information is gradually disseminated through the network by
gossiping. Eventually, every process will know about every other process, but more
importantly, will have enough information locally available to decide whether a process
has failed or not. A member for which the availability information is old, will
presumably have failed.
Another important issue is that a failure detection subsystem should ideally be able to
distinguish network failures from node failures. One way of dealing with this problem
is not to let a single node decide whether one of its neighbors has crashed. Instead, when
noticing a timeout on a ping message, a node requests other neighbors to see whether
they can reach the presumed failing node. Of course, positive information can also be
shared: if a node is still alive, that information can be forwarded to other interested
parties (who may be detecting a link failure to the suspected node).
This brings us to another key issue: when a member failure is detected, how should
other non-faulty processes be informed? One simple, and somewhat radical approach
is the one followed in FUSE. In FUSE, processes can be joined in a group that spans a
wide-area network. The group members create a spanning tree that is used for
monitoring member failures. Members send ping messages to their neighbors. When a
neighbor does not respond, the pinging node immediately switches to a state in which
it will also no longer respond to pings from other nodes. By recursion, it is seen that a
single node failure is rapidly promoted to a group failure notification. FUSE does not
suffer a lot from link failures for the simple reason that it relies on point-to-point TCP
connections between group members.
173
Distributed Systems - CS 422
6.3 Reliable Client-Server Communication
In many cases, fault tolerance in distributed systems concentrates on faulty processes.
However, we also need to consider communication failures. Most of the failure models
discussed previously apply equally well to communication channels. In particular, a
communication channel may exhibit crash, omission, timing, and arbitrary failures. In
practice, when building reliable communication channels, the focus is on masking crash
and omission failures. Arbitrary failures may occur in the form of duplicate messages,
resulting from the fact that in a computer network messages may be buffered for a
relatively long time, and are reinjected into the network after the original sender has
already issued a retransmission.
6.3.1 Point-to-Point Communication
In many distributed systems, reliable point-to-point communication is established by
making use of a reliable transport protocol, such as TCP. TCP masks omission failures,
which occur in the form of lost messages, by using acknowledgments and
retransmissions. Such failures are completely hidden from a TCP client.
However, crash failures of connections are not masked. A crash failure may occur when
(for whatever reason) a TCP connection is abruptly broken so that no more messages
can be transmitted through the channel. In most cases, the client is informed that the
channel has crashed by raising an exception. The only way to mask such failures is to
let the distributed system attempt to automatically set up a new connection, by simply
resending a connection request. The underlying assumptions that the other side is still,
or again, responsive to such requests.
6.3.2 RPC Semantics in the Presence of Failures
Let us now take a closer look at client-server communication when using high-level
facilities such as Remote Procedure Calls (RPCs). The goal of RPC is to hide
communication by making remote procedure calls look just like local ones. With a few
174
Distributed Systems - CS 422
exceptions, so far we have come fairly close. Indeed, as long as both client and server
are functioning perfectly, RPC does its job well.
The problem comes about when errors occur. It is then that the differences between
local and remote calls are not always easy to mask.
To structure our discussion, let us distinguish between five different classes of failures
that can occur in RPC systems, as follows:
1. The client is unable to locate the server.
2. The request message from the client to the server is lost.
3. The server crashes after receiving a request.
4. The reply message from the server to the client is lost.
5. The client crashes after sending a request.
Each of these categories poses different problems and requires different solutions.
Client Cannot Locate the Server
To start with, it can happen that the client cannot locate a suitable server. All servers
might be down, for example. Alternatively, suppose that the client is compiled using a
particular version of the client stub, and the binary is not used for a considerable period
of time. In the meantime, the server evolves and a new version of the interface is
installed; new stubs are generated and put into use. When the client is eventually run,
the binder will be unable to match it up with a server and will report failure. While this
mechanism is used to protect the client from accidentally trying to talk to a server that
may not agree with it in terms of what parameters are required or what it is supposed to
do, the problem remains of how should this failure be dealt with. One possible solution
is to have the error raise an exception. In some languages,
Lost Request Messages
The second item on the list is dealing with lost request messages. This is the easiest one
to deal with: just have the operating system or client stub start a timer when sending the
request. If the timer expires before a reply or acknowledgment comes back, the message
is sent again. If the message was truly lost, the server will not be able to tell the
175
Distributed Systems - CS 422
difference between the retransmission and the original, and everything will work fine.
Unless, of course, so many request messages are lost that the client gives up and falsely
concludes that the server is down, in which case we are back to "Cannot locate server."
If the request was not lost, the only thing we need to do is let the server be able to detect
it is dealing with a retransmission. Unfortunately, doing so is not so simple, as we
explain when discussing lost replies.
Server Crashes
The next failure on the list is a server crash. The normal sequence of events at a server
is shown in the following figure: A request arrives, is carried out, and a reply is sent.
Now consider the following figure: A request arrives and is carried out, just as before,
but the server crashes before it can send the reply.
Finally, look at the following figure: Again a request arrives, but this time the server
crashes before it can even be carried out. And, of course, no reply is sent back.
176
Distributed Systems - CS 422
The annoying part of the previous three figures is that the correct treatment differs for
each case.
In the second figure, the system has to report failure back to the client (e.g., raise, an
exception), whereas in the third figure, it can just retransmit the request. The problem
is that the client's operating system cannot tell which is which. All it knows is that its
timer has expired.
There are four strategies the client can follow:
First, the client can decide to never reissue a request, at the risk that the text will not
be printed.
Second, it can decide to always reissue a request, but this may lead to its text being
printed twice. Third, it can decide to reissue a request only if it did not yet receive an
The parentheses indicate an event that can no longer happen because the server already
crashed. Fig. 8-8 shows all possible combinations. As can be readily verified, there is
no combination of client strategy and server strategy that will work correctly under all
possible event sequences. The bottom line is that the client can never know whether the
server crashed just before or after having the text printed.
The following figure illustrate different combinations of client and server strategies in
the presence of server crashes.
177
Distributed Systems - CS 422
Acknowledgment that its print request had been delivered to the server. In that case, the
client is counting on the fact that the server crashed before the print request could be
delivered. The fourth and last strategy is to reissue a request only if it has received an
acknowledgment for the print request.
With two strategies for the server, and four for the client, there are a total of eight
combinations to consider.
To explain, note that there are three events that can happen at the server:
 send the completion message (M),
 print the text (P), and
 crash (C).
These events can occur in six different orderings:
1. M ~P ~C: A crash occurs after sending the completion message and printing the text.
2. M ~C (~P): A crash happens after sending the completion message, but before the
text could be printed.
3. p ~M ~C: A crash occurs after sending the completion message and printing the text.
4. P~C( ~M): The text printed, after which a crash occurs before the completion
message could be sent.
5. C (~P ~M): A crash happens before the server could do anything.
6. C(~M ~P): A crash happens before the server could do anything.
Client Crashes
The final item on the list of failures is the client crash. What happens if a client sends a
request to a server to do some work and crashes before the server replies? At this point
a computation is active and no parent is waiting for the result.
Such an unwanted computation is called an orphan.
Orphans can cause a variety of problems that can interfere with normal operation of the
system. As a bare minimum, they waste CPU cycles. They can also lock files or
otherwise tie up valuable resources. Finally, if the client reboots and does the RPC
again, but the reply from the orphan comes back immediately afterward, confusion can
result.
178
Distributed Systems - CS 422
What can be done about orphans? Nelson (1981) proposed four solutions.
 In solution 1, before a client stub sends an RPC message, it makes a log entry
telling what it is about to do. The log is kept on disk or some other medium that
survives crashes. After a reboot, the log is checked and the orphan is explicitly
killed off. This solution is called orphan extermination.
 In solution 2. Called reincarnation, all these problems can be solved without the
need to write disk records. The way it works is to divide time up into sequentially
numbered epochs. When a client reboots, it broadcasts a message to all machines
declaring the start of a new epoch. When such a broadcast comes in, all remote
computations on behalf of that client are killed. Of course, if the network is
partitioned, some orphans may survive. Fortunately, however, when they report
back, their replies will contain an obsolete epoch number, making them easy to
detect.
 Solution 3 is a variant on this idea, but somewhat less draconian. It is called
gentle reincarnation. When an epoch broadcast comes in, each machine checks
to see if it has any remote computations running locally, and if so, tries its best
to locate their owners. Only if the owners cannot be located anywhere is the
computation killed.
 Finally, we have solution 4, expiration, in which each RPC is given a standard
amount of time, T, to do the job. If it cannot finish, it must explicitly ask for
another quantum, which is a nuisance. On the other hand, if after a crash the
client waits a time T before rebooting, all orphans are sure to be gone. The
problem to be solved here is choosing a reasonable value of Tin the face of RPCs
with wildly differing requirements.
179
Distributed Systems - CS 422
6.4 Recovery
So far, we have mainly concentrated on algorithms that allow us to tolerate faults.
However, once a failure has occurred, it is essential that the process where the failure
happened can recover to a correct state. In what follows, we first concentration what it
actually means to recover to a correct state, and subsequently when and how the state of
a distributed system can be recorded and recovered to,by means of check pointing and
message logging.
6.4.1 Introduction
Fundamental to fault tolerance is the recovery from an error. Recall that an error is that
part of a system that may lead to a failure. The whole idea of error recovery is to replace
an erroneous state with an error-free state. There are essentially two forms of error
recovery:
 In backward recovery, the main issue is to bring the system from its present
erroneous state back into a previously correct state. To do so, it will be necessary
to record the system's state from time to time, and to restore such a recorded state
when things go wrong. Each time (part of) the system's present state is recorded,
a checkpoint is said to be made.
 Another form of error recovery is forward recovery. In this case, when the
system has entered an erroneous state, instead of moving back to a previous,
checkpointed state, an attempt is made to bring the system in a correct new state
from which it can continue to execute. The main problem with forward error
recovery mechanisms is that it has to be known in advance which errors may
occur. Only in that case is it possible to correct those errors and move to a new
state.
The distinction between backward and forward error recovery is easily explained when
considering the implementation of reliable communication.
180
Distributed Systems - CS 422
The common approach to recover from a lost packet is to let the sender retransmit that
packet. In effect, packet retransmission establishes that we attempt to go back to a
previous, correct state, namely the one in which the packet that was lost is being sent.
Reliable communication through packet retransmission is therefore an example of
applying backward error recovery techniques.
An alternative approach is to use a method known as erasure correction. In this
approach, a missing packet is constructed from other, successfully delivered packets.
For example, in an (n,k) block erasure code, a set of k source packets is encoded into a
set of n encoded packets, such that any set of k encoded packets is enough to reconstruct
the original k source packets. Typical values are k =16' or
k=32, and k<11~2k. If not enough packets have yet been delivered, the sender will have
to continue transmitting packets until a previously lost packet can be constructed.
Erasure correction is a typical example of a forward error recovery approach.
By and large, backward error recovery techniques are widely applied as a general
mechanism for recovering from failures in distributed systems. The major benefit of
backward error recovery is that it is a generally applicable method independent of any
specific system or process.
However, backward error recovery also introduces some problems:
 First, restoring a system or process to a previous state is generally a relatively
costly operation in terms of performance.
 Second, because backward error recovery mechanisms are independent of the
distributed application for which they are actually used, no guarantees can be
given that once recovery has taken place.
 Finally, although backward error recovery requires checkpointing, some states
can simply never be rolled back to.
6.4.2 Checkpointing
In a fault-tolerant distributed system, backward error recovery requires that the system
regularly saves its state onto stable storage. In particular, we need to record a consistent
181
Distributed Systems - CS 422
global state, also called a distributed snapshot. In a distributed snapshot, if a process P
has recorded the receipt of a message, then there should also be a process Q that has
recorded the sending of that message. After all, it must have come from somewhere.
The following figure illustrate the recovery line: In backward error recovery schemes,
each process saves its state from time to time to a locally-available stable storage. To
recover after a process or system failure requires that we construct a consistent global
state from these local states.
In particular, it is best to recover to the most recent distributed snapshot, also referred
to as a recovery line. In other words, a recovery line corresponds to the most recent
consistent collection of checkpoints, as shown in the coming figure.
6.4.3 Message Logging
Considering that checkpointing is an expensive operation, especially concerning the
operations involved in writing state to stable storage, techniques have been sought to
reduce the number of checkpoints, but still enable recovery. An important technique in
distributed systems is logging messages. The basic idea underlying message logging is
that if the transmission of messages can be replayed, we can still reach a globally
consistent state but without having to restore that state from stable storage. Instead, a
checkpointed state is taken as a starting point, and all messages that have been sent
since are simply retransmitted and handled accordingly.
This approach works fine under the assumption of what is called a piecewise
deterministic model. In such a model, the execution of each process is assumed to take
place as a series of intervals in which events take place.
182
Distributed Systems - CS 422
For example, an event may be the execution of an instruction, the sending of a message,
and so on. Each interval in the piecewise deterministic model is assumed to start with
a nondeterministic event, such as the receipt of a message. However, from that moment
on, the execution of the process is completely deterministic. An interval ends with the
last event before a nondeterministic event occurs.
In effect, an interval can be replayed with a known result that is, in a completely
deterministic way, provided it is replayed starting with the same nondeterministic event
as before. Consequently, if we record all nondeterministic events in such a model, it
becomes possible to completely replay the entire execution of a process in a
deterministic way.
Considering that message logs are necessary to recover from a process crash so that a
globally consistent state is restored, it becomes important to know precisely when
messages are to be logged. It turns out that many existing message-logging schemes
can be easily characterized, if we concentrate on how they deal with orphan processes.
An orphan process is a process that survives the crash of another process, but whose
state is inconsistent with the crashed process after its recovery. As an example, consider
the situation shown in Fig. 8-26. Process Q receives messages
m 1 and m 2 from
process P and R, respectively, and subsequently sends a message
3 to R. However, in contrast to all other messages, message m 2 is not logged. If
process Q crashes and later recovers again, only the logged messages required for the
recovery of Q are replayed, in the coming example, mI' Because m2 was not logged, its
transmission will not be replayed, meaning that the transmission of m 3 also may not
take place as shown in the next figure.
183
Distributed Systems - CS 422
Incorrect replay of messages after recovery, leading to an orphan process.
However, the situation after the recovery of Q is inconsistent with that before its
recovery. In particular, R holds a message (m 3) that was sent before the crash, but
whose receipt and delivery do not take place when replaying what had happened before
the crash. Such inconsistencies should obviously be avoided.
184
Distributed Systems - CS 422
Revision Sheet # 6
PROBLEMS
1. Dependable systems are often required to provide a high degree of security.
Why?
2. What makes the fail-stop model in the case of crash failures so difficult to
implement?
3. For each of the following applications. Do you think at-least-once semantics or
at most- once semantics is best? Discuss.
(a) Reading and writing files from a file server.
(b) Compiling a program.
(c) Remote banking.
4. To what extent is scalability of atomic multicasting important?
5. Virtual synchrony is analogous to weak consistency in distributed data stores,
with group view changes acting as synchronization points. In this context, what
would be the analog of strong consistency?
6. Explain how the write-ahead log in distributed transactions can be used to
recover from failures.
7. Does a stateless server need to take checkpoints?
185
Download