Distributed Systems CS 422 Prof.Dr. Hesham El-Deeb 2022/ 2023 Distributed Systems - CS 422 Contents Chapter 1 ........................................................................................................................................ 1 Introduction ................................................................................................................................... 1 1.1 Processing Types...................................................................................................................... 3 Pipelining and parallelism .................................................................................................... 5 Analogy Simple Rules............................................................................................................ 7 Example #2 Parallel Processing............................................................................................ 7 Static Connection #1 (2-d hypercube).................................................................................. 8 Static Connection #2 (3-d hypercube).................................................................................. 9 Parallel Efficiency.................................................................................................................. 9 1.2 Distributed System Fundamentals ....................................................................................... 10 Distributed Systems Layout................................................................................................ 10 Network Performance ......................................................................................................... 11 Middleware .......................................................................................................................... 11 Distributed Systems Main features .................................................................................... 12 Distributed Systems Advantages ........................................................................................ 13 Distributed system disadvantages ...................................................................................... 13 1.3 Goals of Distributed systems ................................................................................................ 18 1.3.1 Heterogeneity .............................................................................................................. 18 1.3.2 Openness...................................................................................................................... 19 1.3.3 Security ........................................................................................................................ 19 1.3.4 Scalability .................................................................................................................... 19 1.3.5 Failure handling.......................................................................................................... 21 1.3.6 Concurrency................................................................................................................ 21 1.3.7 Transparency (allowable) ......................................................................................... 22 1.4 Distributed System Model .................................................................................................. 23 1.5 Hardware & Software Concepts in DS ................................................................................ 25 1.5.1 Hardware Concepts .................................................................................................... 26 1.5.2 Software Concepts ...................................................................................................... 28 1.6 Distributed Systems Types ................................................................................................. 29 1.6.1 Distributed Computing Systems (DCS) ................................................................... 29 1.6.2 Distributed Information Systems (DIS) .................................................................... 34 1.6.3 Distributed Pervasive (spreading) Systems (DPS) ................................................... 36 ii Distributed Systems - CS 422 Revision Sheet # 1 ........................................................................................................................ 40 PROBLEMS................................................................................................................................. 40 Assignment # 1 ..................................................................................................................... 41 Assignment # 2 ..................................................................................................................... 41 Assignment # 3 ..................................................................................................................... 41 Assignment # 4 ..................................................................................................................... 41 Chapter 2 ...................................................................................................................................... 42 Distributed Systems Architectures ............................................................................................ 42 2.1 Basic Definitions .................................................................................................................. 44 2.2 Distributed Systems Architectures .................................................................................... 45 2.3 Multiprocessor Architectures ............................................................................................... 45 2.4 Client-server Architectures .................................................................................................. 48 2.4.1 Two-tier (layer) thin and fat clients .................................................................................. 57 2.4.2 Three-tier architectures ..................................................................................................... 61 2.5 Distributed Object Architectures....................................................................................... 64 2.6 CORBA Architecture .......................................................................................................... 68 2.6.1 CORBA Goal ....................................................................................................................... 69 2.6.2 CORBA Architecture.......................................................................................................... 70 2.6.3 CORBA Application Structure .......................................................................................... 71 2.6.4 CORBA Standards .............................................................................................................. 72 2.6.5 CORBA Objects .................................................................................................................. 72 2.6.6 CORBA Services ................................................................................................................. 73 2.6.7 CORBA Products ................................................................................................................ 73 2.6.8 CORBA Servant Class ....................................................................................................... 76 2.6.9 CORBA Server Class – 1 ................................................................................................... 77 2.6.10 CORBA Client .................................................................................................................. 78 Revision Sheet # 2 ........................................................................................................................ 79 Assignment # 5 ..................................................................................................................... 79 Chapter 3 ...................................................................................................................................... 80 Synchronization ........................................................................................................................... 80 3.1 Clock Synchronization ........................................................................................................ 80 3.1.1 Measuring the Time ............................................................................................................ 83 3.1.2 GPS and UTC ...................................................................................................................... 86 iii Distributed Systems - CS 422 3.2 Logical vs. physical clocks .................................................................................................. 91 3.2.1 Logical Clock ....................................................................................................................... 92 3.2.2 Physical Clock...................................................................................................................... 96 3.3 Clock Synchronization Algorithms.................................................................................... 99 3.3.1 The Cristian's Algorithm (1989) .............................................................................. 99 3.3.2 The Berkeley Algorithm (1989) .............................................................................. 102 3.4 Parallel and Distributed Processing Problems ............................................................... 103 3.4.1 Mutual Exclusion Algorithm Outline .................................................................... 103 3.4.2 Distributed Termination Problem ......................................................................... 105 3.4.3 The Byzantine Generals Problem (BGP)............................................................... 106 3.5 Election Algorithms........................................................................................................... 108 3.5.1 Leader Election Problem ........................................................................................ 109 3.5.2 Leader Election in Synchronous Rings.................................................................. 109 3.5.3 Synchronous Message-Passing Model ................................................................... 110 3.5.4 Simple Leader Election Algorithm ........................................................................ 110 Revision Sheet # 3 ...................................................................................................................... 113 Assignment # 6 ................................................................................................................... 114 Assignment # 7 ................................................................................................................... 114 Assignment # 8 ................................................................................................................... 114 Assignment # 9 ................................................................................................................... 115 Assignment # 10 ................................................................................................................. 115 Assignment # 11 ................................................................................................................. 115 Assignment # 12 ................................................................................................................. 115 Chapter 4 .................................................................................................................................... 116 The Distributed Algorithms...................................................................................................... 116 4.1 Introduction ......................................................................................................................... 116 4.2 Variations of PRAM Model .............................................................................................. 117 4.2.1 PRAM Model for Parallel Computations .............................................................. 117 4.2.2 READ & WRITE in PRAM ................................................................................... 118 4.2.3 PRAM Subclasses .................................................................................................... 118 4.3 Simulating Multiple Accesses on an EREW PRAM......................................................... 120 4.3.1 Algorithm Broadcast _EREW ................................................................................. 121 4.4 Computing Sum and All Partial Sums .............................................................................. 122 iv Distributed Systems - CS 422 4.4.1 Sum of an Array of Numbers on the EREW Model ............................................. 123 4.4.2 All Partial Sums of an Array .................................................................................. 125 4.5 The Sorting Algorithm ........................................................................................................ 127 4.5.1 n2 Sorting Algorithm ............................................................................................. 127 4.5.2 Algorithm Sort _ CRCW( Ascending) ................................................................... 127 4.6 Message – Passing Models and Algorithms .................................................................... 129 4.6.1 Message-passing Computing Models ..................................................................... 129 Revision Sheet # 4 ...................................................................................................................... 132 Assignment # 13 ................................................................................................................. 132 Chapter 5 .................................................................................................................................... 133 Naming In Distributed Systems ............................................................................................... 133 5.1 Naming Basic Concept ...................................................................................................... 134 5.2 Naming Types (identification, description, location) ..................................................... 135 5.2.1 Uniform Resource Identifier (URI) (URL/URN/URL+URN) ............................ 136 5.3 Naming Implementation Approaches.............................................................................. 145 5.3.1 Flat Naming Approach............................................................................................ 146 5.3.2 Structured Naming .................................................................................................. 152 5.3.3 Attribute-Based Naming ......................................................................................... 156 Revision Sheet # 5 ...................................................................................................................... 158 Chapter 6 .................................................................................................................................... 159 Fault Tolerance .......................................................................................................................... 159 6.1 Introduction to Fault Tolerance....................................................................................... 159 6.1.1 Basic Concepts .......................................................................................................... 159 6.1.2 Failure Models .......................................................................................................... 162 6.2 Process Flexibility ................................................................................................................ 166 6.2.1 Design Issues ............................................................................................................. 166 6.2.2 Failure Masking and Replication ............................................................................ 168 6.2.3 Agreement in Faulty Systems .................................................................................. 170 6.2.4 Failure Detection ...................................................................................................... 172 6.3 Reliable Client-Server Communication .......................................................................... 174 6.3.1 Point-to-Point Communication ............................................................................... 174 6.3.2 RPC Semantics in the Presence of Failures............................................................ 174 6.4 Recovery ............................................................................................................................. 180 v Distributed Systems - CS 422 6.4.1 Introduction .............................................................................................................. 180 6.4.2 Checkpointing ........................................................................................................... 181 6.4.3 Message Logging....................................................................................................... 182 Revision Sheet # 6 ...................................................................................................................... 185 vi Distributed Systems - CS 422 Chapter 1 Introduction Computer systems are undergoing a revolution. From 1945, when the modem computer era began, until about 1985, computers were large and expensive. Even minicomputers cost at least tens of thousands of dollars each. As a result, most organizations had only a handful of computers, and for lack of a way to connect them, these operated independently from one another. Starting around the mid-1980s, however, two advances in technology began to change that situation. The first was the development of powerful microprocessors. Initially, these were 8-bit machines, but soon 16-, 32-, and 64-bit CPUs became common. Many of these had the computing power of a mainframe (i.e., large) computer, but for a fraction of the price. The amount of improvement that has occurred in computer technology in the past half century is truly staggering and totally unprecedented in other industries. From a machine that costs 10 million dollars and executed 1 instruction per second. We have come to machines that cost 1000 dollars and are able to execute 1 billion instructions per second, a price/performance gain of 1013.If cars had improved at this rate in the same time period, a Rolls Royce would now cost 1 dollar and get a billion miles per gallon. (Unfortunately, it would probably also have a 200-page manual telling how to open the door.) The second development was the invention of high-speed computer networks. Local-area networks or LANs allow hundreds of machines within a building to be connected in such a way that small amounts of information can be transferred between machines in a few microseconds or so. Larger amounts of data can be moved between machines at rates of 100 million to 10 billion bits/sec. Wide-area networks or WANs allow millions of machines all over the earth to be connected at speeds varying from 64 Kbps (kilobits per second) to gigabits per 1 Distributed Systems - CS 422 second. The result of these technologies is that it is now not only feasible, but easy, to put together computing systems composed of large numbers of computers connected by a high-speed network. They are usually caned computer networks or distributed systems, in contrast to the previous centralized systems (or single processor systems) consisting of a single computer, its peripherals, and perhaps some remote terminals. In a distributed or parallel program, we try to take advantage of parallelism by dividing the (Sequential) program into as many tasks as the program’s correctness will allow and then run one or more of these tasks, some of which can run simultaneously on more than one processor. If the distributed system uses all its resources to run tasks from only one program at a time, we call it Parallel Processing. If the distributed system shares its resources with tasks from many independent programs, we call it Distributed Processing. Parallel processing is the use of concurrency in the operation of a computer system to increase throughput, increase fault-tolerance, or reduces the time needed to solve particular problems. Parallel processing is the only route to reach the highest levels of computer performance. Physical laws and manufacturing capabilities limit the switching times and integration densities of current semi-conductor-based devices, putting a ceiling on the speed at which any single device can operate. For this reason, all modern computers rely upon parallelism to some extent. The fastest computers exhibit parallelism at many levels. We begin by describing pipelining and parallelism, the two traditional methods used to increase concurrency in computer system. We survey low-level and high-level parallel processing mechanisms that appear in hardware, and we examine some of the most popular processor interconnection topologies. The final sections discuss parallelism in software. We describe the generation and coordination of software processes and the problem of scheduling the execution of these processes on actual parallel hardware. 2 Distributed Systems - CS 422 1.1 Processing Types Sequential (one processor) refers to the mental process of integrating and understanding stimuli in a particular, serial order. Both the perception of stimuli in sequence and the subsequent production of information in a specific arrangement fall under successive processing. Pipelining (one processor with shifting mechanism) instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps (the eponymous "pipeline") performed by different processor units with different parts of instructions processed in parallel. Parallel (more than one processor with one partitioned program) is a method in computing of running two or more processors (CPUs) to handle separate parts of an overall task. Breaking up different parts of a task among multiple processors will help reduce the amount of time to run a program. 3 Distributed Systems - CS 422 Distributed (more than one processor with more than one partitioned programs) is a phrase used to refer to a variety of computer systems that use more than one computer (or processor) to run an application. This includes parallel processing in which a single computer uses more than one CPU to execute programs. Grid computing is the use of widely distributed computer resources to reach a common goal. A computing grid can be thought of as a distributed system with non-interactive workloads that involve many files. Grid computing is distinguished from conventional high-performance computing systems such as cluster computing in that grid computers have each node set to perform a different task/application. Grid computers also tend to be more heterogeneous and geographically dispersed (thus not physically coupled) than cluster computers. Although a single grid can be dedicated to a particular application, commonly a grid is used for a variety of purposes. Grids are often constructed with generalpurpose grid middleware software libraries. Grid sizes can be quite large. Grids are a form of distributed computing whereby a "super virtual computer" is composed of many networked loosely coupled computers acting together to perform large tasks. For certain applications, distributed or grid computing can be seen as a special type of parallel computing that relies on complete computers (with onboard CPUs, storage, power supplies, network interfaces, etc.) connected to a computer network (private or public) by a conventional network interface, such as Ethernet. This is in contrast to the traditional notion of a supercomputer, which has many processors connected by a local highspeed computer bus. 4 Distributed Systems - CS 422 The Processing Performance is identified by the below formula: Processing Performance = F(X, Y, Z) Where: X → no of processors Y → Data dependency Z → Cost & overhead of scheduling policy Pipelining and parallelism To reduce the time needed for a mechanism to perform a task, we must either increase the speed of the mechanism or introduce concurrency. Two traditional methods have been used to increase concurrency: pipelining and parallelism. If an operation can be divided into a number of stages, pipelining allows different tasks to be in different stages of completion. An automobile assembly line is an example of pipelining. Parallelism is the use of multiple resources to increase concurrency. A group of combines working together to harvest a wheat field is an example of parallelism. To illustrate and contrast these two fundamental methods for increasing concurrency, we present the following pizza-baking example. Suppose a pizza requires 10 minutes to bake. An oven that holds a single pizza can yield 6 baked pizzas an hour. To increase the number of pizzas baked per hour, either the baking time must be reduced or a way must be found to have more than one pizza baking at a time. (Assume that quality control constraints prevent us from raising the oven's temperature in order to reduce the baking time.) One way to increase production is through use of parallelism. If 5 ovens are used, the ovens yield 5 pizzas every 10 minutes and 30 pizzas an hour. Note that the 5 ovens are used most efficiently if the number of pizzas needed is a multiple of 5. For example, the ovens require the same amount of time- 20 minutes– to produce 6, 7, 8, 9, or 10 pizzas. Another way to increase production is through the use of pipelining. Imagine a conveyer belt running through a long pizza oven. A pizza placed at one end of the 5 Distributed Systems - CS 422 conveyer belt spends 10 minutes in the oven before it reaches the other end. If the conveyer belt has room for 5 pizzas, a cook can place an unbaked pizza at one end of the belt every 2 minutes. Ten minutes after the first pizza has been put into one end of the oven, it appears as a baked pizza at the other end. From that time on, another baked pizza will appear every two minutes, and the production of the oven will be 30 pizzas an hour. The pizza-baking speeds of the single-oven, parallel-oven, and pipelined-oven methods are compared In Table1. The speedup achieved is the ratio between the time needed for the single pizza oven to produce some number of pizzas and the time needed to produce the same number of pizzas using pipelining and/or parallelism. The below table is an example of Sequential, Parallel and Pipelining Processing. It is contrasting the Pizza-Baking Times of a SingleOven, Five-Ovens, and a Conveyer-Belt Oven. Pizzas Baked Single Oven Five Ovens (Sequential) (Parallel) 1 10 min. 10 min. 10 min. 2 20 10 12 3 30 10 14 4 40 10 16 5 50 10 18 6 60 20 20 7 70 20 22 8 80 20 24 9 90 20 26 10 100 20 28 11 110 30 30 12 120 30 32 6 Conveyer Ovens (Pipelining) Distributed Systems - CS 422 Analogy Simple Rules Example #2 Parallel Processing The diagram shows below the program graph. 7 Distributed Systems - CS 422 Static Connection #1 (2-d hypercube) 8 Distributed Systems - CS 422 Static Connection #2 (3-d hypercube) Parallel Efficiency Speed up and Utilization are not the only parallel performance metrics. Parallel efficiency is one of the most important parallel performance metrics. 9 Distributed Systems - CS 422 Parallel efficiency = sequential execution time / (parallel execution time × # of processors) 1.2 Distributed System Fundamentals A distributed system is a collection of independent computers that appears to its users as a single coherent system. A distributed system is defined as one in which components at networked computers communicate and coordinate their actions only by passing messages. This definition allows for: concurrent execution of programs, but prevents the possibility of a global clock and means that components can fail independently of one another (fault tolerance concept). Shared resources are managed by server processes, which provide client processes with access to those resources via a well-defined set of operations. In a distributed system written in an object-oriented language, resources may be encapsulated as objects whose methods are invoked by client objects. Distributed Systems Layout 10 Distributed Systems - CS 422 Network Performance Where WPLAN → wireless personal area network WMAN → A Metropolitan Area Network (MAN) is a network that interconnects users with computer resources in a geographic area or region larger than a large Local Area Network (LAN) but smaller than a Wide Area Network (WAN) The below diagram shows A distributed system organized as middleware. Note that the middleware layer extends over multiple machines. Middleware It is an aspect of distributed computing, defined as a computer software that connects software components or applications. This software consists of a set of 11 Distributed Systems - CS 422 enabling services that allow multiple processes running on one or more machines to interact across a network. i.e. it connects parts of an application and enables requests and data to pass between them. It includes web servers, application servers, and similar tools that support application development and delivery. Usage of middleware Middleware services provide a more functional set of application programming interfaces to allow an application to: 1. Providing interaction with another service or application 2. Be independent from network services 3. Be reliable and available always when compared to the operating system and network services Middleware Types Hurwitz's classification system organizes the types of middleware based on scalability and recoverability: 1. Remote Procedure Call — Client makes calls to procedures running on remote systems. 2. Message Oriented Middleware — Messages sent to the client are collected and stored until they are acted upon, while the client continues with other processing. 3. Object Request Broker — makes it possible for applications to send objects and request services in an object-oriented system. 4. SQL-oriented Data Access — middleware between applications and database servers. 5. Embedded Middleware — communication services and integration interface software/firmware that operates between embedded applications and the real time operating system. Distributed Systems Main features Geographical distribution of autonomous computers 12 Distributed Systems - CS 422 Communication through cable/fiber/wireless/...connections Distributed Systems Advantages interaction co-operation sharing of resources reduced costs improved availability and performance Scalability fault tolerance Distributed system disadvantages Complexity – Typically, distributed systems are more complex than centralized systems Security – More susceptible to external attack Manageability – More effort required for system management Unpredictability – Unpredictable responses depending on the system organization and network load Urgency of Distributed Computing Distributed (computer) systems are critical for functioning of many organizations Distributed Application is a set of processes that are distributed across a network of machines and work together as an ensemble to solve a common problem. 13 Distributed Systems - CS 422 Internet: global network of interconnected computers which communicate through IP protocols Intranet: a separately administered network with a boundary that allows to enforce local security policies Mobile and ubiquitous (everywhere) computing o laptops o PDAs (A personal digital assistant as a handheld computer, also known as a palmtop computer) o mobile phones o printers o home devices World-Wide Web: system for publishing and accessing resources and services across the Internet Example 3 illustrates a typical portion of the Internet 14 Distributed Systems - CS 422 Overview of Internet Information Appliances Characteristics of Internet very large and heterogeneous enables email, file transfer, multimedia communications, WWW open-ended connects intranets (via backbones) with home users (via modems, ISPs) Example 4 illustrates a typical Intranet email server Desktop computers print and other servers Web server Local area network email server print Fi le server other servers the rest of the Internet router/firewall 15 Distributed Systems - CS 422 Characteristics of Intranets Several LANs linked by backbones Enables information flow within an organization: electronic data, documents, Provides various services –email, file, print servers, often connected to Internet via Router In/out communications protected by Firewall Example: French University in Egypt http://portal.ufe.edu.eg/spip/?lang=ar Portable and handheld devices Desktop computers email server print and other servers Web server Local area network email server print Fi le server other servers the rest of the Internet router/firewall Now more than ever, it’s never been more important to be connected at work. With the diverse range of communication and increase in collaboration comes the need to work in a more understanding, social, and unrestricted way. A successful intranet is essentially the foundation of a connected, engaged, and productive workplace. However, it is crucial to develop a clear, structured, and compatible intranet plan that allows you to manifest your organization’s efforts. When a successful intranet is deployed and managed correctly, it seeks to unite your people, drive productivity, create a positive culture, and deliver significant results for all stakeholders. 16 Distributed Systems - CS 422 Extranet Example 5 Mobile & ubiquitous computing Wireless LANs (WLANs) o connectivity for portable devices (laptops, PDAs, mobile phones, video/digital cameras) o Uses WAP (Wireless Applications Protocol) Home intranet (= home network) o devices embedded in home appliances (hi-fi, washing machines, …) o universal ‘remote control’ + communication future → environment for applying embedded systems, o ubiquitous computing High-fidelity (Hi-fi) High fidelity or hi-fi reproduction is a term used by home stereo listeners and home audio enthusiasts (audiophiles) to refer to high-quality reproduction of sound or images that are very faithful to the original performance. Ideally, high-fidelity equipment has minimal amounts of noise and distortion and an accurate frequency response as set out in 1973 by the German Deutsches Institut für Normung (DIN) standard DIN 45500 17 Distributed Systems - CS 422 Example 6 WWW World-wide resource sharing over Internet or Intranet based on the following technologies: HTML (Hypertext Markup Language) URL (Uniform Resource Locator) Client-Server architecture Open-ended can be extended, re-implemented, ... 1.3 Goals of Distributed systems Due to its special characteristics: 1. Complexity 2. Size 3. Changing technologies 4. Society’s dependence Goals are: 1. Heterogeneity 2. Openness 3. Security 4. Scalability 5. Fault handling 6. Concurrency 7. Transparency 1.3.1 Heterogeneity To achieve this goal, we have to overcome: 1. varying software and hardware 18 Distributed Systems - CS 422 OSs, networks, computer hardware, program languages, implementations by different developers need for standards of protocols, middleware 2. Heterogeneity and mobile code support virtual machine approach 1.3.2 Openness To achieve this goal, we must have: 1. independence of vendors 2. publishable key interfaces: CORBA (Common Object Request Broker Architecture). 3. publishable communication mechanisms: Java RMI (Remote Method Invocation) 1.3.3 Security To achieve this goal, we must do: 1. confidentiality (protect against leak) a. Ex: medical records 2. integrity (protect against alteration and interference) a. Ex: financial data Need encryption and knowledge of identity: 1. Denial of service attacks via intrusion detection algorithm 2. Security of mobile code 1.3.4 Scalability Design of scalable distributed systems assures: 1. Controlling the cost of physical resource(O(n), where n is number of users) 2. Controlling the performance loss(O(log n), where n is the size of the set of data)) 3. Preventing software resource running 4. Avoiding performance bottleneck (using DNS) 19 Distributed Systems - CS 422 There are scalability limitations such as: Concept Example Centralized services A single server for all users Centralized data A single on-line telephone book Centralized Doing routing based on complete algorithms information Scalability Techniques (1) (Server is busy) The difference between lettings: No. of servers or clients doing a job (check form) to keep high performance level: A server or a client check forms as they are being filled 20 Distributed Systems - CS 422 Scalability Techniques (2) (distribution) An example of dividing the DNS name space into zones to assure distribution across several machines, thus avoiding that a single server has to deal with all requests for name resolution. 1.3.5 Failure handling It is the ability to continue computation in the presence of failures. So, we have to implement: • Detecting failures • Masking failures (hiding failure) • Tolerate failures • Recovery from failures • Redundancy as a solution to tolerate failure 1.3.6 Concurrency It means, processes execute simultaneously (at the same time) and share resources. It might need some sort of: • Synchronization using a clock • Synchronization using a clock • inter-process communication (IPC) 21 Distributed Systems - CS 422 1.3.7 Transparency (allowable) It is the hiding of the separated nature of system from user/programmer and transpiring other specifications. Transparencies => By ANSA Reference Manual & ISO Reference Model for Open Distributed Processing (RMODP), which is achieved through: • Access transparency: enables local and remote resources to be accessed using identical operations. • Location transparency: enables resources to be accessed without knowledge of their location. • Concurrency transparency: enables several processes to operate concurrently using shared resources without interference between them. • Replication transparency: enables multiple instances of resources to be used to increase reliability and performance without knowledge of the replicas by users or application programmers. • Failure transparency: enables the concealment (hiding) of faults, allowing users and application programs to complete their tasks despite the failure of hardware or software components. • Mobile transparency (Migration transparency): allows the movement of resources and clients within a system without affecting the operation of users or programs. • Performance transparency: allows the system to be reconfigured to improve performance as loads vary. • Scaling transparency: allows the system and applications to expand in scale without change to the system structure or the application algorithms. 22 Distributed Systems - CS 422 The table below demonstrates what to hide (not allowable) Transparency Access Description Hide differences in data representation and how a resource is accessed Location Hide where a resource is located Migration Hide that a resource may move to another location Relocation (mobility) Replication Concurrency Hide that a resource may be moved to another location while in use Hide that a resource may be shared by several competitive users Hide that a resource may be shared by several competitive users simultaneously Failure Hide the failure and recovery of a resource Persistence Hide whether a (software) resource is in memory or on disk 1.4 Distributed System Model A distributed system consists of: • a collection of autonomous computers linked by • a computer network and equipped with • Distributed system software. This software enables computers to coordinate their activities and to share the resources of the system hardware, software, and data. Users of a distributed system should perceive a single, integrated computing facility even though it may be implemented by many computers in different locations, networks. 23 Distributed Systems - CS 422 The object-oriented model for a distributed system is based on the model supported by object-oriented programming languages (because it is more suitable than structure programming). Distributed object systems generally provide: • Remote procedure calls (RPC), which are used in client-server communication, are replaced by remote method invocation (RMI) in distributed object systems. • Remote Method invocation (RMI) in an object-oriented programming language together with operating systems support for object sharing and persistence. Execution of Remote Procedure Calls (RPC) Daemon: A program or process that sits idly in the background until it is invoked to perform its task. 24 Distributed Systems - CS 422 The state of an object consists of the values of its instance variables. In the object-oriented paradigm, the state of a program is partitioned into separate parts, each of which is associated with an object. Since object-based programs are logically partitioned, the physical distribution of objects into different processes or computers in a distributed system is a natural extension. The Object Management Group's Common Object Request Broker Architecture (CORBA) is a widely used standard for distributed object systems. It is vendorindependent architecture Broker Architecture (CORBA) is a widely used standard for distributed object systems. It is vendor-independent architecture and infrastructure that computer applications use to work together over networks. Other object management systems include: • the Open Software Foundation's Distributed Computing Environment (DCE) • Microsoft's Distributed Common Object Manager (DCOM). 1.5 Hardware & Software Concepts in DS The hardware concepts are: • Multiprocessors • Homogeneous Multicomputer Systems S/W concept: • Uni-processor Operating Systems • Multi-processor Operating Systems • Multi-computer Operating Systems • Distributed Shared Memory Systems • Network Operating System Note: search for a list of operating systems, determine which one is suitable for distributed systems applications 25 Distributed Systems - CS 422 1.5.1 Hardware Concepts The figure below demonstrates Different basic organizations and memories in distributed computer systems: Multiprocessors (1) A bus-based multiprocessor 26 Distributed Systems - CS 422 Multiprocessors (2) a) A crossbar switch b) An omega switching network Homogeneous Multicomputer Systems b) Grid ( mesh ) b) Hypercube 27 Distributed Systems - CS 422 1.5.2 Software Concepts System Distributed Operating Systems (DOS) Network Operating Systems (NOS) Description Main Goal Tightly-coupled operating system for multi-processors and homogeneous multicomputer Loosely-coupled operating system for heterogeneous multicomputer (LAN and WAN) Hide and manage hardware resources Offer local services to remote clients Middleware Additional layer at top of NOS Provide distribution based OS implementing general-purpose services transparency Comparison between Systems A comparison between multiprocessor operating systems, multicomputer operating systems, network operating systems, and middleware based distributed systems. The comparison includes degree of transparency, same OS on all nodes, number of copies of OS, basis for communication, resource management, scalability and Openness. 28 Distributed Systems - CS 422 Distributed OS Item Multiprocessors Degree of transparency Same OS on all nodes Number of copies of OS Network Middleware OS -based OS Multicomputer Very High High Low High Yes Yes No No 1 N N N Messages Files Basis for Shared Model communication memory Resource Global, Global, Per management central distributed node Scalability No Moderately Yes Varies Openness Closed Closed Open Open specific Per node 1.6 Distributed Systems Types 1.6.1 Distributed Computing Systems (DCS) An important class of distributed systems is the one used for high-performance computing tasks. Roughly speaking, one can make a distinction between two subgroups. In cluster computing the underlying hardware consists of a collection of similar workstations or PCs, closely connected by means of a high speed local-area network. In addition, each node runs the same operating system. 29 Distributed Systems - CS 422 The situation becomes quite different in the case of grid computing. This subgroup consists of distributed systems that are often constructed as a federation of computer systems, where each system may fall under a different administrative domain, and may be very different when it comes to hardware, software, and deployed network technology. Cluster Computing Systems Cluster computing systems became popular when the price/performance ratio of personal computers and workstations improved. At a certain point, it became financially and technically attractive to build a supercomputer using off-the-shelf technology by simply hooking up a collection of relatively simple computers in a high-speed network. In virtually all cases, cluster computing is used for parallel programming in which a single (compute intensive) program is run in parallel on multiple machines. One well-known example of a cluster computer is formed by Linux-based Beowulf clusters. Each cluster consists of a collection of compute nodes that are controlled and accessed by means of a single master node. The master typically handles the allocation of nodes to a particular parallel program, maintains a batch queue of submitted jobs, and provides an interface for the users of the system. As such, the master actually runs the middleware needed for the execution of programs and management of the cluster, while the compute nodes often need nothing else but a standard operating system. An important part of this middleware is formed by the libraries for executing parallel programs. Many of these libraries effectively provide only advanced message-based communication facilities, but are not capable of handling faulty processes, security, etc. As an alternative to this hierarchical organization, a symmetric approach is followed in the MOSIX system (Amar et at, 2004). MOSIX attempts to provide a single-system image of a cluster, meaning that to a process a cluster computer offers the ultimate distribution transparency by appearing to be a single computer. As we mentioned, providing such an image under all circumstances is impossible. In the case of MOSIX, the high degree of transparency is provided by allowing processes to dynamically and preemptively migrate between the nodes that make up the cluster. 30 Distributed Systems - CS 422 Process migration allows a user to start an application on any node (referred to as the home node), after which it can transparently move to other nodes, for example, to make efficient use of resources. Grid Computing Systems A characteristic feature of cluster computing is its homogeneity. In most cases, the computers in a cluster are largely the same, they all have the same operating system, and are all connected through the same network. In contrast, grid computing systems have a high degree of heterogeneity: no assumptions are made concerning hardware, operating systems, networks, administrative domains, security policies, etc. A key issue in a grid computing system is that resources from different organizations are brought together to allow the collaboration of a group of people or institutions. Such a collaboration is realized in the form of a virtual organization. The people belonging to the same virtual organization have access rights to the resources that are provided to that organization. Typically, resources consist of compute servers (including supercomputers, possibly implemented as cluster computers), storage facilities, and databases. In addition, special networked devices such as telescopes, sensors, etc., can be provided as well. Given its nature, much of the software for realizing grid computing evolves around providing access to resources from different administrative domains, and to only those users and applications that belong to a specific virtual organization. For this reason, focus is often on architectural issues. A layered architecture for grid computing systems 31 Distributed Systems - CS 422 The architecture consists of four layers. The lowest fabric layer provides interfaces to local resources at a specific site. Note that these interfaces are tailored to allow sharing of resources within a virtual organization. Typically, they will provide functions for querying the state and capabilities of a resource, along with functions for actual resource management (e.g., locking resources). The connectivity layer consists of communication protocols for supporting grid transactions that span the usage of multiple resources. For example, protocols are needed to transfer data between resources, or to simply access a resource from a remote location. In addition, the connectivity layer will contain security protocols to authenticate users and resources. Note that in many cases human users are not authenticated; instead, programs acting on behalf of the users are authenticated. In this sense, delegating rights from a user to programs is an important function that needs to be supported in the connectivity layer. We return extensively to delegation when discussing security in distributed systems. The resource layer is responsible for managing a single resource. It uses the functions provided by the connectivity layer and calls directly the interfaces made available by the fabric layer. For example, this layer will offer functions for obtaining configuration information on a specific resource, or, in general, to perform specific operations such as creating a process or reading data. The resource layer is thus seen to be responsible for access control, and hence will rely on the authentication performed as part of the connectivity layer. The next layer in the hierarchy is the collective layer. It deals with handling access to multiple resources and typically consists of services for resource discovery, allocation and scheduling of tasks onto multiple resources, data replication, and so on. Unlike the connectivity and resource layer, which consist of a relatively small, standard collection of protocols, the collective layer may consist of many different protocols for many different purposes, reflecting the broad spectrum of services it may offer to a virtual organization. 32 Distributed Systems - CS 422 Finally, the application layer consists of the applications that operate within a virtual organization and which make use of the grid computing environment. Typically the collective, connectivity, and resource layer form the heart of what could be called a grid middleware layer. These layers jointly provide access to and management of resources that are potentially dispersed across multiple sites. An important observation from a middleware perspective is that with grid computing the notion of a site (or administrative unit) is common. This prevalence is emphasized by the gradual shift toward a service-oriented architecture in which sites offer access to the various layers through a collection of services. This, by now, has led to the definition of an alternative architecture known as the Open Grid Services Architecture (OGSA). This architecture consists of various layers and many components, making it rather complex. Complexity seems to be the fate of any standardization process. DIS Observation: many distributed systems are configured for high-performance computing. Cluster computing: essentially a group of high=end systems connected through a LAN: • Homogenous: same OS, near-identical hardware • Single managing node 33 Distributed Systems - CS 422 Grid Computing: the next step: lots of nodes from everywhere: • Heterogeneous • Dispersed across several organizations • Can easily span a wide-area network Note: To allow for collaborations, grids generally use virtual organizations. In essence, this is a grouping of users (or better: their IDs) that will allow for authorization on resource allocation. 1.6.2 Distributed Information Systems (DIS) To clarify our discussion, let us concentrate on database applications. In practice, operations on a database are usually carried out in the form of transactions. Programming using transactions requires special primitives that must either be supplied by the underlying distributed system or by the language runtime system. Observation: the vast amount of distributed systems in use today are forms of traditional information systems that now integrate legacy systems. Example: transaction processing systems. Essential: all read and write operations are executed, i.e. their effects are made permanent at the execution of END_TRANSACTION. Observation: transactions form an atomic operation. 34 Distributed Systems - CS 422 Distributed Information Systems: Transactions Another important class of distributed systems is found in organizations that were confronted with a wealth of networked applications, but for which interoperability turned out to be a painful experience. Many of the existing middleware solutions are the result of working with an infrastructure in which it was easier to integrate applications into an enterprise-wide information system. We can distinguish several levels at which integration took place. In many cases, a networked application simply consisted of a server running that application (often including a database) and making it available to remote programs, called clients. Such clients could send a request to the server for executing a specific operation, after which a response would be sent back. Integration at the lowest level would allow clients to wrap a number of requests, possibly for different servers, into a single larger request and have it executed as a distributed transaction. The key idea was that all, or none of the requests would be executed. As applications became more sophisticated and were gradually separated into independent components (notably distinguishing database components from processing components), it became clear that integration should also take place by letting applications communicate directly with each other. This has now led to a huge industry that concentrates on enterprise application integration (EAl). In the following, we concentrate on these two forms of distributed systems. Model: a transaction is a collection of operations of the state of an object (database, object composition, etc.) that satisfies the following properties (ACID): Atomicity: all operations either succeed, or all of them fail. When the transaction fails, the state of the object will remain unaffected by the transaction. Consistency a transaction establishes a valid state transition. This does not exclude the possibility of invalid, intermediate states during the transaction’s execution. 35 Distributed Systems - CS 422 Isolation: concurrent transactions do not interfere with each other. It appears to each transaction T that other transactions occur either before T, or after T, but never both. Durability after the execution of a transaction, its effects are made permanent: changes to the state survive failures. 1.6.3 Distributed Pervasive (spreading) Systems (DPS) The distributed systems we have been discussing so far are largely characterized by their stability: nodes are fixed and have a more or less permanent and high-quality connection to a network. To a certain extent, this stability has been realized through the various techniques that are discussed in this book and which aim at achieving distribution transparency. For example, the wealth of techniques for masking failures and recovery will give the impression that only occasionally things may go wrong. Likewise, we have been able to hide aspects related to the actual network location of a node, effectively allowing users and applications to believe that nodes stay put. However, matters have become very different with the introduction of mobile and embedded computing devices. We are now confronted with distributed systems in which instability is the default behavior. The devices in these, what we refer to as distributed pervasive systems, are often characterized by being small, battery-powered, mobile, and having only a wireless connection, although not all these characteristics apply to all devices. Moreover, these characteristics need not necessarily be interpreted as restrictive, as is illustrated by the possibilities of modem smart phones (Roussos et al, 2005). As its name suggests, a distributed pervasive system is part of our surroundings (and as such, is generally inherently distributed). An important feature is the general lack of human administrative control. At best, devices can be configured by their owners, but otherwise they need to automatically discover their environment and "nestle in" as best as possible. This nestling in has been made more precise by Grimm et al. (2004) by formulating the following three requirements for pervasive applications: 1. Embrace contextual changes. 36 Distributed Systems - CS 422 2. Encourage ad hoc composition. 3. Recognize sharing as the default. Observation: there is a next generation of distributed systems emerging in which the nodes are small, mobile, and often embedded as part of a larger system. Some requirements: Contextual change: the system is part of an environment in which changes should be immediately accounted for. Ad hoc composition: each node may be used a very different ways by different users. Requires ease-of-configuration. Sharing is the default: nodes come and go, providing sharable services and information. Calls again for simplicity. Observation: pervasiveness and distribution transparency may not always form a good match. Example 7: Distributed Pervasive Systems Electronic health care system 37 Distributed Systems - CS 422 Home systems: should be completely self-organizing: There should be no system administrator Provide a personal space for each of its users Simplest solution: a centralized home box? Electronic health systems: devices are physically close to a person: Where and how should monitored data be stored? How can we prevent loss of crucial data? What is needed to generate and propagate alerts? How can security be enforced? How can physicians provide online feedback? Example 8 Sensor Network Characteristics: the nodes to which sensors are attached are: Many (10s-1000s) Simple (i.e., hardly any memory, CPU power, or communication facilities) Often battery-powered (or even battery-less) Sensor networks as distributed systems: consider them from a database perspective: 38 Distributed Systems - CS 422 39 Distributed Systems - CS 422 Revision Sheet # 1 PROBLEMS 1. What is the role of middleware in a distributed system? 2. Explain what is meant by (distribution) transparency, and give examples of different types of transparency. 3. Why is it sometimes so hard to hide the occurrence and recovery from failures in a distributed system? 4. Why is it not always a good idea to aim at implementing the highest degree of transparency possible? 5. What is an open distributed system and what benefits does openness provide? 6. Describe precisely what is meant by a scalable system. 7. Scalability can be achieved by applying different techniques. What are these techniques? 8. Explain what is meant by a virtual organization and give a hint on how such organizations could be implemented. 9. When a transaction is aborted. We have said that the world is restored to its previous state. As though the transaction had never happened. We lied. Give an example where resetting the world is impossible. 10. Executing nested transactions requires some form of coordination. Explain what a coordinator should actually do. 11. We argued that distribution transparency may not be in place for pervasive systems. This statement is not true for all types of transparencies. Give an example. 12. We already gave some examples of distributed pervasive systems: home systems. Electronic health-care systems and sensor networks. Extend this list with more examples. 13. (Lab assignment) Sketch a design for a home system consisting of a separate media server that will allow for the attachment of a wireless client. The latter 40 Distributed Systems - CS 422 is connected to (analog) audio/video equipment and transforms the digital media streams to analog output. The server runs on a separate machine. Possibly connected to the Internet but has no keyboard and/or monitor connected. Assignment # 1 Repeat the Example # 1, having 6 ovens (processors) rather than 5 ovens and the baking time is 20 min rather than 10 min. How many pizza (processes) is implemented at time = 60 minutes for sequential, parallel and pipelining. Give your comments. Assignment # 2 Repeat Example # 2, having 16 processors rather than 8 processors and compute the speedup and the utilization. What are your comments, about speedup and utilization; based on the results of the cases of 4/8/16 processors? Assignment # 3 a) Compute the parallel efficiency for the 4 processors case. b) Compute the parallel efficiency for the 8 processors case. c) Compute the parallel efficiency for the 16 processors case. d) What are your comments, on the parallel efficiency; based on the results in (a), (b) and (c)? Assignment # 4 Compare between: CORBA, DCE and DCOM Your comparison should include: • Basic theory of operation with Architecture • Applicability • Trend of updating. 41 Distributed Systems - CS 422 Chapter 2 Distributed Systems Architectures Centralized System vs. Distributed System Criteria Centralized system Distributed System Economics Low High Availability Low High Complexity Low High Consistency Simple High Scalability Poor Good Technology Homogeneous Heterogeneous Security High Low In distributed architecture, components are presented on different platforms and several components can cooperate with one another over a communication network in order to achieve a specific objective or goal. In this architecture, information processing is not confined to a single machine rather it is distributed over several independent computers. 42 Distributed Systems - CS 422 A distributed system can be demonstrated by the client-server architecture which forms the base for multi-tier architectures; alternatives are the broker architecture such as CORBA, and the Service-Oriented Architecture (SOA). There are several technology frameworks to support distributed architectures, including .NET, J2EE, CORBA, .NET Web services, AXIS Java Web services, and Globus Grid services. Middleware is an infrastructure that appropriately supports the development and execution of distributed applications. It provides a buffer between the applications and the network. It sits in the middle of system and manages or supports the different components of a distributed system. Examples are transaction processing monitors, data convertors and communication controllers etc. Distributed systems are often complex pieces of software of which the components are by definition dispersed across multiple machines. To master their complexity, it is crucial that these systems are properly organized. There are different ways on how to view the organization of a distributed system, but an obvious one is to make a distinction between the logical organization of the collection of software components and on the other hand the actual physical realization. The organization of distributed systems is mostly about the software components that constitute the system. These software architectures tell us how the various software components are to be organized and how they should interact. In this chapter we will first pay attention to some commonly applied approaches toward organizing (distributed) computer systems. The actual realization of a distributed system requires that we instantiate and place software components on real machines. There are many different choices that can be made in doing so. The final instantiation of a software architecture is also referred to as a system architecture. In this chapter we 43 Distributed Systems - CS 422 will look into traditional centralized architectures in which a single server implements most of the software components (and thus functionality), while remote clients can access that server using simple communication means. In addition, we consider decentralized architectures in which machines more or less play equal roles, as well as hybrid organizations. As we explained in Chap.1, an important goal of distributed systems is to separate applications from underlying platforms by providing a middleware layer adopting such a layer is an important architectural decision, and its main purpose is to provide distribution transparency. However, trade-offs need to be made to achieve transparency, which has led to various techniques to make middleware adaptive. We discuss some of the more commonly applied ones in this chapter, as they affect the organization of the middleware itself. Adaptability in distributed systems can also be achieved by having the system monitor its own behavior and taking appropriate measures when needed. This 'insight has led to a class of what are now referred to as autonomic systems. These distributed systems are frequently organized in the form of feedback control loops, which form an important architectural element during a system's design. In this chapter, we devote a section to autonomic distributed systems. 2.1 Basic Definitions • A program is the code you write • A process is what you get when you run it • A message is used to communicate between processes • A packet is a fragment of a message that might travel on a wire • A protocol is a formal description of message formats and the rules that two processes must follow in order to exchange those messages • A network is the infrastructure that links computers, workstations, terminals, servers, etc. It consists of routers which are connected by communication links. 44 Distributed Systems - CS 422 • A component can be a process or any piece of hardware required to run a process, support communications between processes, store data. • A distributed system is an application that executes a collection of protocols to coordinate the actions of multiple processes on a network, such that all components cooperate together to perform a single or small set of related tasks (Detailed definition of DS). 2.2 Distributed Systems Architectures 1. Multiprocessor Architectures 2. Client-server Architectures • Distributed services which are called on by clients. • Servers that provide services are treated differently from clients that use services 3. Distributed Object Architectures • No distinction between clients and servers (object world). • Any object on the system may provide and use services(functions) from other objects 2.3 Multiprocessor Architectures • Advantage: Simplest distributed system model. • System composed of multiple processes which may execute on different processors. • Architectural model of many large real-time systems. • Distribution of process to processor may be pre-ordered or may be under the control of a dispatcher of the operating system. 45 Distributed Systems - CS 422 A multiprocessor Traffic Control System (Example 1) Sensor processor Sensor control process Traffic flow processor Traffic light control processor Display process Light control process Traffic lights Trafficflow sensors and cameras Operator consoles Multiprocessor Traffic Control System (SCOOT systems) SCOOT systems are designed to be a central processor hosting the SCOOT Kernel integrated with the company specific UTC software that controls communications to the on-street equipment and provides the operator interface. This processor and associated networked terminals may be installed in a control room. 46 Distributed Systems - CS 422 The figure below shows Enhancement with a Wireless Digital Radio to Achieve the City Lights and Intelligent Traffic Control. Modernization of the city road light control, not just to add luster to the city, and can improve the rational use of urban resources. Wireless digital radio to achieve the city with street surveillance, real-time data transmission reliability, lower operation and maintenance costs, transmission speed can view images. 47 Distributed Systems - CS 422 2.4 Client-server Architectures In the basic client-server model, processes in a distributed system are divided into two (possibly overlapping) groups. A server is a process implementing a specific service, for example, a file system service or a database service. A client is a process that requests a service from a server by sending it a request and subsequently waiting for the server's reply. This client-server interaction, also known as request-reply. Communication between a client and a server can be implemented by means of a simple connectionless protocol when the underlying network is fairly reliable as in many localarea networks. In these cases, when a client requests a service, it simply packages a message for the server, identifying the service it wants, along with the necessary input data. The message is then sent to the server. The latter, in turn, will always wait for an incoming request, subsequently process it, and package the results in a reply message that is then sent to the client. Using a connectionless protocol has the obvious advantage of being efficient. As long as messages do not get lost or corrupted, the request/reply protocol just sketched works fine. Unfortunately, making the protocol resistant to occasional transmission failures is not trivial. The only thing we can do is possibly let the client resend the request when no reply message comes in. The problem, however, is that the client cannot detect whether the original request message was lost, or that transmission of the reply failed. If the reply was lost, then resending a request may result in performing the operation twice. If the operation was something like "transfer $10,000 from my bank account," then clearly, it would have been better that we simply reported an error instead. On the other 48 Distributed Systems - CS 422 hand, if the operation was "tell me how much money I have left," it would be perfectly acceptable to resend the request. When an operation can be repeated multiple times without harm, it is said to be idempotent. Since some requests are idempotent and others are not it should be clear that there is no single solution for dealing with lost messages. In conclusion, we could summarize the client server architecture milestones as follows: The application is modelled as a set of services that are provided by servers and a set of clients that use these services. Clients know of servers but servers need not know of clients. Clients and servers are logical processes The mapping of processors to processes (scheduling/allocation) is not necessarily 1 : 1 Client versus Servers Clients are PCs or workstations on which users run applications. Clients rely on servers for resources, such as files, devices, and even processing power. While Servers are powerful computers or processes dedicated to managing disk drives (file servers), printers (print servers), or network traffic (network servers). Comparison with peer-to-peer (P2P) architecture In the client-server model, the server is a centralized system. The more simultaneous clients a server has, the more resources it needs. In a peer-to-peer network, two or more computers (called peers) pool their resources and communicate in a decentralized system. Peers are coequal nodes in a nonhierarchical network. 49 Distributed Systems - CS 422 A comparison between C/S and Peer to Peer structure is presented as follows: S.NO 1 CLIENT-SERVER NETWORK PEER-TO-PEER NETWORK In Client-Server Network, Clients In Peer-to-Peer Network, Clients and server are differentiated, and server are not differentiated. Specific server and clients are present. 2 3 Client-Server Network focuses on While Peer-to-Peer Network information sharing. focuses on connectivity. In Client-Server Network, While in Peer-to-Peer Network, Centralized server is used to store Each peer has its own data. the data. 4 In Client-Server Network, Server While in Peer-to-Peer Network, responds the services which are Each and every node can do both requested by Client. request and respond for the services. 5 Client-Server Network are costlier While Peer-to-Peer Network are than Peer-to-Peer Network. less costly than Client-Server Network. 6 Client-Server Network are more While Peer-to-Peer Network are stable than Peer-to-Peer Network. less stable if number of peer is increase. 7 Client-Server Network is used for While Peer-to-Peer Network is both small and large networks. generally suited for small networks with fewer than 10 computers. 50 Distributed Systems - CS 422 In a structured peer-to-peer architecture, the overlay network is constructed using a deterministic procedure. By far the most-used procedure is to organize the processes through a distributed hash table (DHT). In a DHT -based system, data items are assigned a random key from a large identifier space, such as a 128-bit or 160-bit identifier. Likewise, nodes in the system are also assigned a random number from the same identifier space. The root of every DHT-based system is then to implement an efficient and deterministic scheme that uniquely maps the key of a data item to the identifier of a node based on some distance metric. Most importantly, when looking up a data item, the network address of the node responsible for that data item is returned. Effectively, this is accomplished by routing a request for a data item to the responsible node. Unstructured peer-to-peer systems largely rely on randomized algorithms for constructing an overlay network. The main idea is that each node maintains a list of neighbors, but that this list is constructed in a more or less random way. Likewise, data items are assumed to be randomly placed on nodes. As a consequence, when a node needs to locate a specific data item, the only thing it can effectively do is flood the network with a search query. One of the goals of many unstructured peer-to-peer systems is to construct an overlay network that resembles a random graph. The basic model is that each node maintains a list of c neighbors, where, ideally, each of these neighbors represents a randomly chosen live node from the current set of nodes. The list of neighbors is also referred to as a partial view. There are many ways to construct such a partial view. Jelasity et al. (2004, 2005a) have developed a framework that captures many different algorithms for overlay construction to allow for evaluations and comparison. In this framework, it is assumed that nodes regularly exchange entries from their partial view. 51 Distributed Systems - CS 422 Each entry identifies another node in the network, and has an associated age that indicates how old the reference to that node is. Clients and Servers General interaction between a client and a server An Example Client and Server (1) The header.h file used by the client and server 52 Distributed Systems - CS 422 An Example Client and Server (2) A sample server An Example Client and Server (3) A client using the server to copy a file 53 Distributed Systems - CS 422 Typical client server system c3 c2 c4 c12 c11 s1 c1 Server process s4 c10 c5 Client process s2 c6 s3 c9 c8 c7 Physical Computers in a C/S network c1 CC1 c2 c3, c4 CC2 CC3 Network s1, s2 s3, s4 SC2 Server computer SC1 c5, c6, c7 c8, c9 CC4 c10, c1 1, c1 2 CC5 Client computer CC6 C/S architecture from Layered Application Point of View The Presentation layer is concerned with presenting the results of a computation to system users and with collecting user inputs. The Application processing layer is concerned with providing application specific functionality (e.g., in a banking system, banking functions such as open account, close account). While the Data management layer is concerned with managing the system databases. 54 Distributed Systems - CS 422 Application layers in C/S model The user-interface level contains all that is necessary to directly interface with the user, such as display management. The processing level typically contains the applications. The data level manages the actual data that is being acted on. Clients typically implement the user-interface level. This level consists of the programs that allow end users to interact with applications. There is a considerable difference in how sophisticated user-interface programs are. The simplest user-interface program is nothing more than a character-based screen. Such an interface has been typically used in mainframe environments. In those cases where the mainframe controls all interaction, including the keyboard and monitor, one can hardly speak of a client-server environment. However, in many cases, the user's terminal does some local processing such as echoing typed keystrokes, or supporting form-like interfaces in which a complete entry is to be edited before sending it to the main computer. Nowadays, even in mainframe environments, we see more advanced user interfaces. Typically, the client machine offers at least a graphical display in which pop-up or pulldown menus are used, and of which many of the screen controls are handled through a 55 Distributed Systems - CS 422 mouse instead of the keyboard. Typical examples of such interfaces include the XWindows interfaces as used in many UNIX environments, and earlier interfaces developed for MS-DOS PCs and Apple Macintoshes. Modern user interfaces offer considerably more functionality by allowing applications to share a single graphical window, and to use that window to exchange data through user actions. For example, to delete a file, it is usually possible to move the icon representing that file to an icon representing a trash can. Likewise, many word processors allow a user to move text in a document to another position by using only the mouse. As a first example, consider an Internet search engine. Ignoring all the animated banners, images, and other fancy window dressing, the user interface of a search engine is very simple: a user types in a string of keywords and is subsequently presented with a list of titles of Web pages. The back end is formed by a huge database of Web pages that have been perfected and indexed. The core of the search engine is a program that transforms the user's string of keywords into one or more database queries. It subsequently ranks the results into a list, and transforms that list into a series of HTML pages. Within the client-server model, this information retrieval part is typically placed at the processing level. The data level in the client-server model contains the programs that maintain the actual data on which the applications operate. An important property of this level is that data are often persistent, that is, even if no application is running, data will be stored somewhere for next use. In its simplest form, the data level consists of a file system, but it is more common to use a full-fledged database. In the client-server model, the data level is typically implemented at the server side. Besides merely storing data, the data level is generally also responsible for keeping data consistent across different applications. 56 Distributed Systems - CS 422 When databases are being used, maintaining consistency means that metadata such as table descriptions, entry constraints and application-specific metadata are also stored at this level. For example, in the case of a bank, we may want to generate a notification when a customer's credit card debt reaches a certain value. This type of information can be maintained through a database trigger that activates a handler for that trigger at the appropriate moment. In most business-oriented environments, the data level is organized as a relational database. Data independence is crucial here. The data are organized independent of the applications in such a way that changes in that organization do not affect applications, and neither do the applications affect the data organization. Using relational databases in the client-server model helps separate the processing level from the data level, as processing and data are considered independent. However, relational databases are not always the ideal choice. A characteristic feature of many applications is that they operate on complex data types that are more easily modeled in terms of objects than in terms of relations. Examples of such data types range from simple polygons and circles to representations of aircraft designs, as is the case with computer-aided design (CAD) systems. In those cases where data operations are more easily expressed in terms of object manipulations, it makes sense to implement the data level by means of an objectoriented or object-relational database. Notably the latter type has gained popularity as these databases build upon the widely dispersed relational data model, while offering the advantages that object-orientation gives. 2.4.1 Two-tier (layer) thin and fat clients Thin-client model (i.e. fat server) In a thin-client model, all of the application processing and data management is carried out on the server. The client is simply responsible for running the presentation software (so client is thin): 57 Distributed Systems - CS 422 Used when legacy systems are migrated to client server architectures in which legacy system acts as a server in its own right with a graphical interface implemented on a client A major disadvantage is that it places a heavy processing load on both the server and the network. Fat-client model (i.e. thin server) In this model, the server is only responsible for data management. The software on the client implements the application logic and the interactions with the system user (so client is fat): Most appropriate for new C/S systems where the capabilities of the client system are known in advance. More complex than a thin client model especially for management. New versions of the application have to be installed on all clients. Advantages Separation of responsibilities such as user interface presentation and business logic processing. Reusability of server components and potential for concurrency Simplifies the design and the development of distributed applications It makes it easy to migrate or integrate existing applications into a distributed environment. It also makes effective use of resources when a large number of clients are accessing a high-performance server. 58 Distributed Systems - CS 422 Disadvantages Lack of heterogeneous infrastructure to deal with the requirement changes. Security complications. Limited server availability and reliability. Limited testability and scalability. Fat clients with presentation and business logic together. Legacy Systems Lifetime Companies spend a lot of money on software systems and, to get a return on that investment, the software must be useable for a number of years. The lifetime of software systems is very variable, but many large systems remain in use for more than 10 years. Some organizations still rely on software systems that are more than 20 years old. Many of these old systems are still business-critical. That is, the business relies on the services provided by the software and any failure of these services would have a serious effect on the day-to-day running of the business. These old systems have been given name legacy systems. 59 Distributed Systems - CS 422 Legacy Systems NASA Example NASA's now retired Space Shuttle program used a large amount of 1970s-era technology. Replacement was cost-unaffordable because of: • The expensive requirement for flight certification. • The legacy hardware used completed the expensive integration. • Certification requirement for flight. But any new equipment would have had to go through that entire process – requiring extensive tests of the new components in their new configurations – before a single unit could be used in the Space Shuttle program. Legacy Systems Structure Fat Client Model Applicability More processing is delegated to the client as the application processing is locally executed. It is most suitable for new C/S systems where the capabilities of the client system are known in advance (e.g. ATM). Its disadvantage is that it is more complex than a thin client model especially for management, since new versions of the application have to be installed on all clients (maintenance overhead). 60 Distributed Systems - CS 422 A client-server ATM system (Fat client Example 2) ATM ATM Account server Teleprocessing monitor Customer account database ATM ATM 2.4.2 Three-tier architectures In a three-tier architecture, each of the application architecture layers may execute on a separate processor. Its advantages are that it allows for better performance than a thin-client approach and it’s simpler to manage than a fat-client approach. It is also a more scalable architecture - as demands increase, extra servers can be added. A 3-tier C/S architecture The diagram below shows 3 separate processors (two servers and one client) A typical example of where a three-tiered architecture is used is in transaction processing. As we discussed in Chap. 1, a separate process, called the transaction processing monitor, coordinates all transactions across possibly different data servers. Another, but very different example where we often see a three-tiered architecture is in the organization of Web sites. In this case, a Web server acts as an entry point to a site, 61 Distributed Systems - CS 422 passing requests to an application server where the actual processing takes place. This application server, in tum, interacts with a database server. For example, an application server may be responsible for running the code to inspect the available inventory of some goods as offered by an electronic bookstore. To do so, it may need to interact with a database containing the raw inventory data. Advantages Better performance than a thin-client approach and is simpler to manage than a thick-client approach. Enhances the reusability and scalability − as demands increase, extra servers can be added. Provides multi-threading support and also reduces network traffic. Provides maintainability and flexibility Disadvantages Unsatisfactory Testability due to lack of testing tools. More critical server reliability and availability. An internet banking system (Example 3) Client HTTP interaction Client Database server Web server SQL query Account service provision SQL Customer account database Client Client Internet Banking System When a bank customer accesses online banking services with a web browser (the client), the client initiates a request to the bank's web server. The customer's login 62 Distributed Systems - CS 422 credentials may be stored in a database, and the web server accesses the database server as a client. An application server interprets the returned data by applying the bank's business logic and provides the output to the web server. Finally, the web server returns the result to the client web browser for display. In each step of this sequence of client–server message exchanges, a computer processes a request and returns data. This is the request-response messaging pattern. When all the requests are met, the sequence is complete, and the web browser presents the data to the customer. Use of C/S architectures Architecture Application One-tier C/S Legacy system applications where separating application architecture processing and data management is impractical. with fat clients Computationally-intensive applications such as compilers with little or no data management. Data-intensive applications (browsing and querying) with little or no application processing. Two-tier C/S Applications where application processing is provided by off- architecture the-shelf software (e.g. Microsoft Excel) on the client. with fat clients Applications where computationally-intensive processing of data (e.g. data visualization) is required. Applications with relatively stable end-user functionality used in an environment with well-established system management. Three-tier or Large scale applications with hundreds or thousands of clients multi-tier C/S Applications where both the data and the application are volatile. architecture Applications where data from multiple sources are integrated. The picture below is a visualization of how a car deforms in an asymmetrical crash using finite element analysis. 63 Distributed Systems - CS 422 2.5 Distributed Object Architectures There is no distinction in a distributed object architectures between clients and servers. Each distributable entity is an object that provides services to other objects and receives services from other objects (objects world). Object communication is through a middleware system called an object request broker (ORB). Its disadvantage is that distributed object architectures are more complex to design than Clint-Server systems. Layout of Distributed Object Architecture o1 o2 o3 o4 S (o1) S (o2) S (o3) S (o4) Object request broker o5 o6 S (o5) S (o6) Advantages of distributed object architecture It is a very open system architecture that allows new resources to be added to it as required. The system is flexible and scalable. It allows the system designer to delay 64 Distributed Systems - CS 422 decisions on where and how services should be provided. It is possible to reconfigure the system dynamically with objects migrating across the network as required. Usage of distributed object architecture As a logical model that allows you to structure and organize the system, you think about how to provide application functionality solely in terms of services and combinations of services (world of services). As a flexible approach to the implementation of client-server systems, the logical model of the system is a client-server model but both clients and servers are realized as distributed objects communicating through a common communication framework. The logical model of the system is not one of service provision where there are distinguished data management services. It has the following advantages: • It allows the number of databases that are accessed to be increased without disrupting the system (scalability) • It allows new types of relationship to be mined by adding new integrator objects (flexibility) Disadvantages Complexity − More complex than centralized systems. Security − More susceptible to external attack. Manageability − More effort required for system management. Unpredictability − Unpredictable responses depending on the system organization and network load. 65 Distributed Systems - CS 422 A data mining system (Example 4) Database 1 Integ rator 1 Report gen. Database 2 Visualiser Integ rator 2 Database 3 Display Broker Architectural Style Broker Architectural Style is a middleware architecture used in distributed computing to coordinate and enable the communication between registered servers and clients. Here, object communication takes place through a middleware system called an object request broker (software bus). Client and the server do not interact with each other directly. Client and server have a direct connection to its proxy which communicates with the mediatorbroker. A server provides services by registering and publishing their interfaces with the broker and clients can request the services from the broker statically or dynamically by look-up. CORBA (Common Object Request Broker Architecture) is a good implementation example of the broker architecture. 66 Distributed Systems - CS 422 Components of Broker Architectural Style The components of broker architectural style are discussed through following heads: Broker Broker is responsible for coordinating communication, such as forwarding and dispatching the results and exceptions. It can be either an invocation-oriented service, a document or message - oriented broker to which clients send a message. It is responsible for brokering the service requests, locating a proper server, transmitting requests, and sending responses back to clients. It retains the servers’ registration information including their functionality and services as well as location information. It provides APIs for clients to request, servers to respond, registering or unregistering server components, transferring messages, and locating servers. Stub Stubs are generated at the static compilation time and then deployed to the client side which is used as a proxy for the client. Client-side proxy acts as a mediator between the client and the broker and provides additional transparency between them and the client; a remote object appears like a local one. The proxy hides the IPC (inter-process communication) at protocol level and performs marshaling of parameter values and un-marshaling of results from the server. Skeleton Skeleton is generated by the service interface compilation and then deployed to the server side, which is used as a proxy for the server. Server-side proxy encapsulates low-level system-specific networking functions and provides high-level APIs to mediate between the server and the broker. 67 Distributed Systems - CS 422 It receives the requests, unpacks the requests, unmarshals the method arguments, calls the suitable service, and also marshals the result before sending it back to the client. Bridge A bridge can connect two different networks based on different communication protocols. It mediates different brokers including DCOM, .NET remote, and Java CORBA brokers. Bridges are optional component, which hides the implementation details when two brokers interoperate and take requests and parameters in one format and translate them to another format. 2.6 CORBA Architecture CORBA is an acronym for Common Object Request Broker Architecture). CORBA (1991) is an international standard for an Object Request Broker - middleware to manage communications between distributed objects. Object Management Group (OMG) is responsible for defining CORBA. The OMG comprises all the major vendors and developers of distributed object technology including: • platform, database, and application vendors • software tool and corporate developers Middleware for distributed computing is required at 2 levels: • At the logical communication level: the middleware allows objects on different computers to exchange data and control information; • At the component level: the middleware provides a basis for developing compatible components. CORBA component standards have been defined. Hint: visit OMG for CORBA FAQ and Releases of CORBA. CORBA specifies a system that provides interoperability among objects in a heterogeneous, distributed environment in a way that is transparent to the programmer. This model defines common object semantics for specifying the externally visible characteristics of objects in a standard and implementation-independent way. In this 68 Distributed Systems - CS 422 model, clients request services from objects (which will also be called servers) through a well-defined interface specified in Object Management Group Interface Definition Language (IDL). The request is an event, and it carries information including: • an operation • the object reference of the service provider, which is a name that defines an object consistently • actual parameters (if any) The central component (core) of CORBA is the object request broker (ORB). It includes the entire communication infrastructure necessary to: • identify and locate objects • handle connection management • deliver data In general, the object request broker is not required to be a single component; it is simply defined by its interfaces. Software broker is an agent do job for the others (Using IIOP, Internet Inter-ORB Protocol; to pass method invocation requests to the correct objects and return the results to the caller). 2.6.1 CORBA Goal The OMG’s goal was to adopt distributed object systems that utilize objectoriented programming for distributed systems: • Systems to be built on heterogeneous hardware, networks, operating systems and programming languages. • The distributed objects would be implemented in various programming languages and still be able to communicate with each other. 69 Distributed Systems - CS 422 2.6.2 CORBA Architecture The simplified architecture of the CORBA is as presented in the next figure followed by a detailed architecture. 70 Distributed Systems - CS 422 Portable Object Adapter (POA) There are different types of CORBA object adapters, such as real-time object adapters (in TAO) and portable object adapters. The Portable Object Adapter (POA) is a particular type of object adapter that is defined by the CORBA standard specification. A POA object adapter allows an object implementation to function with other different ORBs. Internet Inter-ORB Protocol (IIOP) A protocol which will be mandatory for all CORBA 2.0 (1996) compliant platforms. The initial phase of the CORBA 2.0 project is to build an infrastructure consisting of: 1. an IIOP to HTTP gateway which allows CORBA clients to access WWW resources; 2. an HTTP to IIOP gateway to let WWW clients access CORBA resources; 3. a web server which makes resources available by both IIOP and HTTP; 4. web browsers which can use IIOP as their native protocol (for navigation on the internet) 2.6.3 CORBA Application Structure 1. Application objects itself 2. Standard objects defined by the OMG, for a specific domain (e.g. Insurance, Trading ...etc.) 3. Fundamental CORBA services such as directories and security management 4. Horizontal facilities (i.e. cutting across applications) such as user interface facilities 71 Distributed Systems - CS 422 Application objects Domain facilities Horizontal C ORBA facilities Object request broker CORBA services 2.6.4 CORBA Standards 1. An object model for application objects. A CORBA object is an encapsulation of state with a well-defined, language-neutral interface defined in an IDL (interface definition language) 2. An object request broker that manages requests for object services 3. A set of general object services of use to many distributed applications 4. A set of common components built on top of these services 2.6.5 CORBA Objects CORBA objects are comparable, in principle, to objects in C++ and Java: They MUST have a separate interface definition that is expressed using a common language (IDL) similar to C++ There is a mapping from this IDL to programming languages (C++, Java) Therefore, objects written in different languages can communicate with each other 72 Distributed Systems - CS 422 2.6.6 CORBA Services 1. Object life cycle: Defines how CORBA objects are created, removed, moved, and copied 2. Naming: Defines how CORBA objects can have friendly symbolic names 3. Events: Decouples the communication between distributed objects 4. Relationships: Provides arbitrary typed n-ary relationships between CORBA objects 5. Externalization: Coordinates the transformation of CORBA objects to and from external media. 6. Transactions: Coordinates atomic access to CORBA objects (complete success or complete failures for group of operations i.e. partial success or failure is not permissible) 7. Property: Supports the association of name-value pairs with CORBA objects 8. Trader: Supports the finding of CORBA objects based on properties describing the service offered by the object 9. Query: Supports queries on objects 2.6.7 CORBA Products 1. The Java 2 ORB: it comes with Sun's Java 2 SDK (Software Development Kit). It is missing several features. 2. VisiBroker for Java: A popular Java ORB from Inprise Corporation (new name of Borland after 1999). VisiBroker is also embedded in other products. For example, it is the ORB that is embedded in the Netscape Communicator browser. 3. OrbixWeb: A popular Java ORB from Iona Technologies. 4. WebSphere: A popular application server with an ORB from IBM. 5. Netscape Communicator: Netscape browsers have a version of VisiBroker embedded in them. Applets (A program designed to be executed from within another application. Unlike an application, applets cannot be executed directly 73 Distributed Systems - CS 422 from the operating system) can issue request on CORBA objects without downloading ORB classes into the browser. They are already there. CORBA Example 5 (The Stock Application) The stock trading application is a distributed application that illustrates the Java programming language and CORBA. In this introductory module only a small simple subset of the application is used. The stock application allows multiple users to watch the activity of stocks. The user is presented with a list of available stocks identified by their stock symbols. The user can select a stock and then press the "view" button. Object request broker (ORB) The basic functionality provided by the object request broker consists of: 1. Passing the requests from clients to the object implementations on which they are invoked. In order to make a request, the client can communicate with the ORB core through the Interface Definition Language stub or through the dynamic invocation interface (DII). 2. The stub represents the mapping between the language of implementation of the client and the ORB core. Thus the client can be written in any language as long as the implementation of the object request broker supports this mapping. 3. The ORB core then transfers the request to the object implementation which receives the request as an up-call through: an Interface Definition Language (IDL) skeleton (which represents the object interface at the server side and works with the client stub) or A dynamic skeleton interface (DSI) (a skeleton with multiple interfaces). 74 Distributed Systems - CS 422 Detailed ORB Architecture ORB-based object communications layout o1 o2 S (o1) S (o2) IDL stub IDL skeleton Object Request Broker 75 Distributed Systems - CS 422 Inter-ORB communications ORBs are not usually separate programs but are a set of objects in a library that are linked with an application when it is developed. ORBs handle communications between objects executing on the same machine. Several ORBS may be available and each computer in a distributed system will have its own ORB (obligatory). Inter-ORB communications are used for distributed object calls. Advantages of ORB The ORB implements programming language independence for the request. The client issuing the request can be written in a different programming language from the implementation of the CORBA object. The ORB does the necessary translation between programming languages. Language bindings are defined for all popular programming languages (C, C++, Java, Ada, COBOL, Smalltalk, Objective C, and Lisp programming languages). 2.6.8 CORBA Servant Class public class HelloServant extends HelloPOA { /* inheritance from the super classHelloPortableObjectadaptor*/ 76 Distributed Systems - CS 422 private ORB orb; public HelloServant( ORB orb ) { this.orb = orb; } public String sayHello( ) { return "Hello From CORBA Server..."; } public void shutdown( ) { orb.shutdown( false ); } } 2.6.9 CORBA Server Class – 1 Create and initializes an ORB instance: ORB orb = ORB.init(args, null); Get a reference to the root POA and activates the POAManager: POA rootpoa = POAHelper.narrow(orb.resolve_initial_references("RootPOA")); rootpoa.the_POAManager().activate(); Create a servant instance: org.omg.CORBA.Object ref = rootpoa.servant_to_reference(helloImpl); Hello href = HelloHelper.narrow(ref); Obtain the Initial Object Reference org.omg.CORBA.Object objRef = orb.resolve_initial_references("NameService"); // persistent NS // orb.resolve_initial_references("TNameService "); // transient NS 77 Distributed Systems - CS 422 // Java IDL Transient Naming Service Narrow the Naming Context NamingContextExt ncRef = NamingContextExtHelper.narrow(objRef); Register the new object in the naming context NameComponent path[] = ncRef.to_name( “swe622” ); ncRef.rebind(path, href); Wait for invocations of the object from a client orb.run(); 2.6.10 CORBA Client Create an ORB Object ORB orb = ORB.init(args, null); Obtain the Initial Object Reference org.omg.CORBA.Object objRef = orb.resolve_initial_references("NameService"); Narrow the Object Reference to get Naming Context NamingContextExt ncRef = NamingContextExtHelper.narrow(objRef); Resolve the Object Reference in Naming helloImpl = HelloHelper.narrow(ncRef.resolve-str(“swe622”)); Invoking the method helloImpl.sayHello(); 78 Distributed Systems - CS 422 Revision Sheet # 2 PROBLEMS 1. If a client and a server are placed far apart, we may see network latency dominating overall performance. How can we tackle this problem? 2. What is a three-tiered client-server architecture? 3. Explain a practical example of thin client architecture. 4. Explain a practical example of fat client architecture. 5. What is the difference between a vertical distribution and a horizontal distribution? 6. To what extent are interceptors dependent on the middle ware where they are deployed? Assignment # 5 Download “the stock application” file, on the following link; https://www.slideshare.net/SenthilKanth/stock-applicationusing-corba Read it carefully. Prepare a pseudo-code algorithm for performing this CORBAJava application. Note: no need to dig into Java code 79 Distributed Systems - CS 422 Chapter 3 Synchronization In this chapter, we mainly concentrate on how processes can synchronize. For example, it is important that multiple processes do not simultaneously access a shared resource, such as printer, but instead cooperate in granting each other temporary exclusive access. Another example is that multiple processes may sometimes need to agree on the ordering of events, such as whether message ml from process P was sent before or after message m2 from process Q. As it turns out, synchronization in distributed systems is often much more difficult compared to synchronization in uniprocessor or multiprocessor systems. The problems and solutions that are discussed in this chapter are, by their nature, rather general, and occur in many different situations in distributed systems. We start with a discussion of the issue of synchronization based on actual time, followed by synchronization in which only relative ordering matters rather than ordering in absolute time. In many cases, it is important that a group of processes can appoint one process as a coordinator, which can be done by means of election algorithms. We discuss various election algorithms in a separate section. 3.1 Clock Synchronization It is a problem that deals with the idea that internal clocks of several computers may differ. Even when initially set accurately, real clocks will differ after some amount of time due to clock drift, caused by clocks counting time at slightly different rates. Such "clock synchronization" is used for synchronization in telecommunications and automatic baud rate detection. In a centralized system, time is unambiguous. When a process wants to know the time, it makes a system call and the kernel tells it. If process A asks for the time. And 80 Distributed Systems - CS 422 then a little later process B asks for the time, the value that B gets will be higher than (or possibly equal to) the value A got. It will certainly not be lower. In a distributed system, achieving agreement on time is not trivial. Related Problems Besides the incorrectness of the time itself, there are problems associated with clock skew that take on more complexity in a distributed system in which several computers will need to realize the same global time: For instance, in Unix systems; the make command is used to compile new or modified code without the need to recompile unchanged code (compile modified code only). The make command uses the clock of the machine it runs on to determine which source files need to be recompiled (latest version based on clock information). If the sources reside on a separate file server (containing compiler) and the two machines have unsynchronized clocks, the make program (editor) might not produce the correct results. Clock Synchronization Consequences The way make normally works is simple. When the programmer has finished changing all the source files, he runs make, which examines the times at which all the source and object files were last modified. If the source file input. c has time 2151 and the corresponding object file input.o has time 2150, make knows that input.c has been changed since input.o was created, and thus input.c must be recompiled. On the other hand, if output.c has time 2144 and output.o has time 2145, no compilation is needed. Thus make goes through all the source files to find out which ones need to be recompiled and calls the compiler to recompile them. Now imagine what could happen in a distributed system in which there were no global agreement on time. Suppose that output.o has time 2144 as above, and shortly thereafter output.c is modified but is 81 Distributed Systems - CS 422 assigned time 2143 because the clock on its machine is slightly behind, as shown in the following figure. Make will not call the compiler. The resulting executable binary program will then contain a mixture of object files from the old sources and the new sources. It will probably crash and the programmer will go crazy trying to understand what is wrong with the code. There are many more examples where an accurate account of time is needed. The example above can easily be reformulated to file timestamps in general. In addition, think of application domains such as financial brokerage, security auditing, and collaborative sensing, and it will become clear that accurate timing is important. Since time is so basic to the way people think and the effect of not having all the clocks synchronized can be so dramatic, it is fitting that we begin our study of synchronization with the simple question: Is it possible to synchronize all the clocks in a distributed system? The answer is surprisingly complicated. When each machine has its own clock, an event 2 that occurred after another event 1 may be assigned an earlier time (the make program, may not recall the compiler and resulting executable binary program may crash). Traditional Solutions In a centralized system the solution is trivial; the centralized server will dictate the system time. There are some solutions to the clock synchronization problem in a centralized server environment: Cristian's algorithm Berkeley Algorithm 82 Distributed Systems - CS 422 Internet Clock Synchronization Solution In a distributed system the problem takes on more complexity because a global time is not easily known. The most used clock synchronization solution on the Internet is the Network Time Protocol (NTP) which is a layered client-server architecture based on Universal Datagram Protocol (UDP) message passing which is the set of network protocols used for the Internet. With UDP, computer applications can send messages, in this case referred to as datagrams, to other hosts on an Internet Protocol (IP) network without requiring prior communications to set up special transmission channels or data paths (open transmission). 3.1.1 Measuring the Time From the astronomers' point of view, solar second is: 1/ (24x60x60) = 1/ 86400 of solar day. From the physician’s point of view, the most accurate timekeeping devices are atomic clocks (1948), which are accurate to seconds in many millions of years and are used to calibrate other clocks and timekeeping instruments. Atomic clocks use the spin property of atoms as their basis, and since 1967, the International System of Measurements bases its unit of time, the second, on the properties of cesium atoms. Atomic clocks Atomic clock ensemble at the U.S. Naval Observatory, which may be accessed by telephone (202-762-1401) or via Internet Network Time Protocol (NTP) servers. Caesium is also used in atomic clocks, which point. Use the resonant vibration frequency of caesium-133 atoms as a reference: Precise caesium clocks today measure frequency with an accuracy of from 2 to 3 parts in 1014, which would correspond to a time measurement accuracy of accuracy of from 2 to 3 parts in 10, which would correspond to a time measurement accuracy of 2 nanoseconds per day, or one second in 1.4 million years. 83 Distributed Systems - CS 422 The latest versions in the United States and France are accurate to 1.7 parts in 1015, or 1 second in 17 million years, which has been regarded as "the most accurate realization of a unit that mankind has yet achieved." The picture below is the first atomic clock, constructed in 1949 by the US National Bureau of Standards The International System of Units (abbreviated SI from the French le Système international d'unités) is the modern form of the metric system. SI defines the second as 9,192,631,770 cycles of that radiation which corresponds to the transition between two electron spin energy levels of the ground state of the 133Cs atom. An SI prefix (also known as a metric prefix) is a name or associated symbol that precedes a basic unit of measure (or its symbol) to form a decimal multiple or submultiple. SI prefixes are used to reduce the number of zeros shown in numerical quantities. For example, one billionth of an ampere (a small electrical current) can be written as 0.000 000 001 ampere. In symbol form, this is written as 1 Nano ampere or 1 nA. Today, the Global Positioning System in coordination with the Network Time Protocol can be used to synchronize timekeeping systems across the globe. As of 2006, the smallest unit of time that has been directly measured is on the attosecond (10−18 s) time scale (Atto- was made from the Danish word for eighteen (atten)). 84 Distributed Systems - CS 422 World Time The basis for scientific time is a continuous count of seconds based on atomic clocks around the world, known as the International Atomic Time (TAI). Coordinated Universal Time (UTC) is the basis for modern civil time. Since January 1, 1972, it has been defined to follow TAI with an exact offset of an integer number of seconds, changing only when a leap second is added to keep clock time synchronized with the rotation of the Earth. In TAI and UTC systems, the duration of a second is constant, as it is defined by the unchanging transition period of the cesium atom. Greenwich Mean Time (GMT) is an older standard, adopted starting with British railroads in 1847(It is the site of the original Royal Observatory, through which passes the prime meridian (An imaginary great circle on the earth's surface passing through the North and South geographic poles. All points on the same meridian have the same longitude,) or longitude 0°)). Using telescopes instead of atomic clocks, GMT was calibrated to the mean solar time at the Royal Observatory, Greenwich in the UK. Universal Time (UT) is the modern term for the international telescope-based system, adopted to replace "Greenwich Mean Time" in 1928 by the International Astronomical Union. Observations at the Greenwich Observatory itself stopped in 1954, though the location is still used as the basis for the coordinate system Prime meridian 85 Distributed Systems - CS 422 3.1.2 GPS and UTC As a step toward actual clock synchronization problems, we first consider a related problem, namely determining one's geographical position anywhere on Earth. This positioning problem is by itself solved through a highly specific. Dedicated distributed system, namely GPS, which is an acronym for global positioning system. GPS is a satellite-based distributed system that was launched in 1978. Although it has been used mainly for military applications, in recent years it has found its way too many civilian applications, notably for traffic navigation. However, many more application domains exist. For example, GPS phones now allow to let callers track each other's position, a feature which may show to be extremely handy when you are lost or in trouble. This principle can easily be applied to tracking other things as well, including pets, children, cars, boats, and so on. An excellent overview of GPS is provided by Zogg (2002). The Global Positioning System (GPS); is a USA military distributed system consisting of a group of satellites each circulating in an orbit at a height of approximately 20,000 Km and each one has 4 atomic clocks to broadcasts: a very precise time signal worldwide, Instructions for converting GPS time to UTC. To determine the longitude, latitude, and altitude of a receiver on earth at specific time; we need only 4 satellites from the GPS satellites Other Similar Systems GLONASS, acronym for "Global Navigation Satellite System", is a space-based satellite navigation system operated by the Russian Aerospace Defense Forces. By 2010, GLONASS had achieved 100% coverage of Russia's territory and in October 2011, the full orbital constellation of 24 satellites was restored, enabling full global coverage. 86 Distributed Systems - CS 422 There is also the planned European Union Galileo positioning system, India's Indian Regional Navigation Satellite System, and the Chinese Beidou Navigation Satellite System. Dynamics of GPS A visual example of the GPS constellation in motion with the Earth rotating. Notice how the number of satellites in view from a given point on the Earth's surface, in this example at 45°N; changes with time. GPS Layout The Global Positioning System (GPS) is a U.S.-owned utility that provides users with positioning, navigation, and timing that provides users with positioning, navigation, and timing (PNT) services. This system consists of three segments: the space segment, the control segment, and the user segment. The U.S. Air Force develops, maintains, and operates the space and control segments. Space Segment The GPS space segment consists of a constellation of satellites transmitting radio signals to users. The United States is committed to maintaining the availability of at least 24 operational GPS satellites, 95% of the time. To ensure this commitment, the Air Force has been flying 31 operational GPS satellites for the past few years. 87 Distributed Systems - CS 422 The GPS control segment consists of a global network of ground facilities that track the GPS satellites, monitor their transmissions, perform analyses, and send commands and data to the constellation. User segment The user segment consists of the GPS receiver equipment, which receives the signals from the GPS satellites and uses the transmitted information to calculate the user’s three-dimensional position and time. GPS Math Model Basics The unknown position of the ship can be expressed as a point (x, y, z), which can later be translated into a latitude and longitude on a map. Let us mark off the three 88 Distributed Systems - CS 422 axes in units equal to the radius of the earth. Thus, a point at sea level will have x2+ y2+ z2= 1 in this system. Also, we will measure time in units of milliseconds. The GPS system finds distances by knowing how long it takes a radio signal to get from one point to another. For this we need to know the speed of light, approximately equal to .047 (in units of earth radii per millisecond). Let (x, y, z) be the ship’s position as a receiver and at the time when the signals arrive. Our goal is to determine the values of these variables at certain dedicated time. Using the data from the first satellite, we can compute the distance from the ship as follows: The signal was sent from satellite at time 19.9 and arrived to the ship at time t. Traveling at a speed of .047, that makes the distance: d = .047(t − 19.9) This same distance can be expressed in terms of (x, y, z) and the satellite’s position (1, 2, 0): d = √ (x − 1)2 + (y − 2)2 + (z − 0)2 Combining these results leads to the equation (x − 1) 2+ (y − 2)2+ z2= .0472(t − 19.9)2 Similar equations can be derived for the three other satellites. Writing all four equations together gives: 2x + 4y + 0z − 2(.0472)(19.9)t = 12+ 22+ 02− .0472(19.9)2+ x2+ y2+ z2− .0472t2 4x + 0y + 4z − 2(.0472)(2.4)t = 22+ 02+ 22− .0472(2.4)2+ x2+ y2+ z2− .0472t2 2x + 2y + 2z − 2(.0472)(32.6)t = 12+ 12+ 12− .0472(32.6)2+ x2+ y2+ z2− .0472t2 4x + 2y + 0z − 2(.0472)(19.9)t = 22+ 12+ 02− .0472(19.9)2+ x2+ y2+ z2− .0472t2 Solving 4 equations in 4 unknowns, we get: t = 43.1 and 49.91 If we select the first solution t = 43.1, then (x, y, z) = (1.317, 1.317, 0.790), which has a length of about 2. We are using units of earth radii, so this point is around 4000 miles above the surface of the earth (airplane not a ship i.e. refused answer). 89 Distributed Systems - CS 422 The second value of t = 49.91leads to (x, y, z) = (0.667, 0.667, 0.332), with length 0.9997. That places the point on the surface of the earth (to four decimal places) and gives us the rational location of the ship. So far, we have assumed that measurements are perfectly accurate. Of course, they are not. For one thing, GPS does not take leap seconds into account. In other words, there is a systematic deviation from UTe, which by January 1, 2006 is 14 seconds. Such an error can be easily compensated for in software. However, there are many other sources of errors, starting with the fact that the atomic clocks in the satellites are not always in perfect synch, the position of a satellite is not known precisely, the receiver's clock has a finite accuracy, the signal propagation speed is not constant (as signals slow down when entering, e.g., the ionosphere), and so on. Moreover, we all know that the earth is not a perfect sphere, leading to further corrections. By and large, computing an accurate position is far from a trivial undertaking and requires going down into many gory details. Nevertheless, even with relatively cheap GPS receivers, positioning can be precise within a range of 1-5 meters. Moreover, professional receivers (which can easily be hooked up in a computer network) have a claimed error of less than 20-35 nanoseconds. Again, we refer to the excellent overview by Zogg (2002) as a first step toward getting acquainted with the details. UTC Availability To provide Coordinated Universal Time (UTC) to people who need precise time, the National Institute of Standard Time (NIST) operates a shortwave radio station with call letters WWV from Fort Collins, Colorado. WWV broadcasts a short pulse at the start of each UTC second. The accuracy of WWV is: ± 1 msec Several earth satellites also offer a UTC service with accuracy ± 0.5 msec NIST radio station WWV broadcasts time and frequency information 24 hours per day, 7 days per week to millions of listeners worldwide. WWV is located in Fort Collins, Colorado, about 100 kilometers north of Denver. The broadcast information 90 Distributed Systems - CS 422 includes time announcements, standard time intervals, standard frequencies, UT1 time corrections, a BCD time code, geophysical alerts, marine storm warnings, and Global Positioning System (GPS) status reports. 3.2 Logical vs. physical clocks Logical clock keeps track of event ordering among related events not necessary to be agreed with real time (e.g. in make example, all m/c must agree that time is 10:00) even it is 10:02 Lamport timestamps and Vector clocks are concepts of the logical clocks in distributed systems. Physical clocks keep time of day and are consistent across systems and must not be deviated from real time by more than certain amount. Instead, a distributed system really has an approximation of the Physical Time across all its machines. Logical Clocks refer to implementing a protocol on all machines within your distributed system, so that the machines are able to maintain consistent ordering of events within some virtual timespan. 91 Distributed Systems - CS 422 3.2.1 Logical Clock So far, we have assumed that clock synchronization is naturally related to real time. However, we have also seen that it may be sufficient that every node agrees on a current time, without that time necessarily being the same as the real time. We can go one step further. For running make, for example, it is adequate that two nodes agree that input.o is outdated by a new version of input.c. In this case, keeping track of each other's events (such as a producing a new version of input.c) is what matters. For these algorithms, it is conventional to speak of the clocks as logical clocks. In a classic paper, Lamport (1978) showed that although clock synchronization is possible, it need not be absolute. If two processes do not interact, it is not necessary that their clocks be synchronized because the lack of synchronization would not be observable and thus could not cause problems. Furthermore, he pointed out that what usually matters is not that all processes agree on exactly what time it is, but rather that they agree on the order in which events occur. In the make example, what counts is whether input.c is older or newer than input.o, not their absolute creation times. As a conclusion, logical clocks are clocks used in asynchronous distributed systems for ordering events since there are no global physical clocks available. Logical clock algorithms can be categorized into: Lamport timestamps, which are monotonically increasing Vector clocks, that allow for total ordering of events in a distributed system. Matrix clocks, an extension of vector clocks that also contains information about other processes' views of the system. 92 Distributed Systems - CS 422 Logical Clock in Distributed System Logical Clocks refer to implementing a protocol on all machines within your distributed system, so that the machines are able to maintain consistent ordering of events within some virtual timespan. A logical clock is a mechanism for capturing chronological and causal relationships in a distributed system. Distributed systems may have no physically synchronous global clock, so a logical clock allows global ordering on events from different processes in such systems. Example If we go outside then we have made a full plan that at which place we have to go first, second and so on. We don’t go to second place at first and then the first place. We always maintain the procedure or an organization that is planned before. In a similar way, we should do the operations on our PCs one by one in an organized way. Suppose, we have more than 10 PCs in a distributed system and every PC is doing its own work but then how we make them work together. There comes a solution to this i.e. LOGICAL CLOCK. Method-1 To order events across process, try to sync clocks in one approach. This means that if one PC has a time 2:00 pm then every PC should have the same time which is quite not possible. Not every clock can sync at one time. Then we can’t follow this method. Method-2 Another approach is to assign Timestamps to events. Taking the example into consideration, this means if we assign the first place as 1, second place as 2, and third place as 3 and so on. Then we always know that the first place will always come first and then so on. 93 Distributed Systems - CS 422 Similarly, if we give each PC their individual number than it will be organized in a way that 1st PC will complete its process first and then second and so on. BUT, Timestamps will only work as long as they obey causality. What is causality? Causality is fully based on HAPPEN BEFORE RELATIONSHIP. • Taking single PC only if 2 events A and B are occurring one by one then TS(A) < TS(B). If A has timestamp of 1, then B should have timestamp more than 1, then only happen before relationship occurs. • Taking 2 PCs and event A in P1 (PC.1) and event B in P2 (PC.2) then also the condition will be TS(A) < TS(B). Taking example- suppose you are sending message to someone at 2:00:00 pm, and the other person is receiving it at 2:00:02 pm. Then it’s obvious that TS (sender) < TS(receiver). Properties Derived from Happen Before Relationship – • Transitive Relation – If, TS(A) <TS(B) and TS(B) <TS(C), then TS(A) < TS(C) • Causally Ordered Relation – a->b, this means that a is occurring before b and if there is any changes in a it will surely reflect on b. • Concurrent Event – This means that not every process occurs one by one, some processes are made to happen simultaneously i.e., A || B. Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready. 94 Distributed Systems - CS 422 Lamport Timestamps Algorithms Lamport's logical clocks lead to a situation where all events in a distributed system are totally ordered with the property that if event a happened before event b, then a will also be positioned in that ordering before b, that is, C (a) < C (b). However, with Lamport clocks, nothing can be said about the relationship between two events a and b by merely comparing their time values C(a) and C(b), respectively. In other words, if C(a) < C(b), then this does not necessarily imply that a indeed happened before b. Something more is needed for that. To explain, consider the messages as sent by the three processes, Denote by I'snd(mi) the logical time at which message m, was sent, and likewise, by T,.cv (mi) the time of its receipt. By construction, we know that for each message I'snd(mi) < T,.cy(mi). But what can we conclude in general from T,.cv(mi) < I'snd(mj)? A Lamport clock may be used to create a partial causal ordering of events between processes. Given a logical clock following these rules, the following relation is true: if C(a) < C(b) then a → b , where → means happened-before. This relation only goes one way, and is called clock consistency condition (CCC): if one event comes before another, then that event's logical clock comes before the other's (Weak CCC). The strong clock consistency condition, which is two way, if C(a) < C(b) then a → b can be obtained by other techniques such as vector clocks. Vector Clocks Algorithm Initially all clocks are zero. Each time a process experiences an internal event, it increments its own logical clock in the vector by one. Each time a process prepares to 95 Distributed Systems - CS 422 send a message, it increments its own logical clock in the vector by one and then sends its entire vector along with the message being sent (increment then send). Each time a process receives a message, it increments its own logical clock in the vector by one and updates each element in its vector by taking the maximum of: the value in its own vector clock and the value in the vector in the received message The problem is that Lamport clocks do not capture causality. Causality can be captured by means of vector clocks. Vector Clocks Algorithm Example 3.2.2 Physical Clock Nearly all computers have a circuit for keeping track of time. Despite the widespread use of the word "clock" to refer to these devices, they are not actually clocks in the usual sense. Timer is perhaps a better word. A computer timer is usually a precisely machined quartz crystal. When kept under tension, quartz crystals oscillate at 96 Distributed Systems - CS 422 a well-defined frequency that depends on the kind of crystal, how it is cut, and the amount of tension. Associated with each crystal are two registers, a counter and a holding register. Each oscillation of the crystal decrements the counter by one. When the counter gets to zero, an interrupt is generated and the counter is reloaded from the holding register. In this way, it is possible to program a timer to generate an interrupt 60 times a second, or at any other desired frequency. Each interrupt is called one clock tick. When the system is booted, it usually asks the user to enter the date and time, which is then converted to the number of ticks after some known starting date and stored in memory. Most computers have a special battery-backed up CMOS RAM so that the date and time need not be entered on subsequent boots. At every clock tick, the interrupt service procedure adds one to the time stored in memory. In this way, the (software) clock is kept up to date. With a single computer and a single clock, it does not matter much if this clock is off by a small amount. Since all processes on the machine use the same. Clock, they will still be internally consistent. For example, if the file input.c has time 2151 and file input.o has time 2150, make will recompile the source file, even if the clock is off by 2 and the true times are 2153 and 2152, respectively. All that really matters are the relative times. As soon as multiple CPUs are introduced, each with its own clock, the situation changes radically. Although the frequency at which a crystal oscillator runs is usually fairly stable, it is impossible to guarantee that the crystals in different computers all run at exactly the same frequency. In practice, when a system has n computers, all n crystals will run at slightly different rates, causing the (software) clocks gradually to get out of synch and give different values when read out. This difference in time values is called clock skew. As a consequence of this clock skew, programs that expect the time associated with a file, object, process, or message to be correct and independent of the machine on which it 97 Distributed Systems - CS 422 was generated (i.e., which clock it used) can fail, as we saw in the make example above. In some systems (e.g., real-time systems), the actual clock time is important. Under these circumstances, external physical clocks are needed. For reasons of efficiency and redundancy, multiple physical clocks are generally considered desirable, which yields two problems: (1) How do we synchronize them with real world clocks? (2) How do we synchronize the clocks with each other? As a conclusion, all computers have a circuit for keeping track of time. This circuit is referred as “clock”. These computer timer circuit is m/c quartz crystal, which oscillate at a well-defined frequency. Associated with each crystal, two registers: Counter register Holding register Each oscillation of the crystal will decrement the counter by one. When the counter gets to zero, an interrupt is generated, and the counter is loaded from the holding register. In this way, it is possible to program a timer to generate an interrupt 60 times a second, or at any desired frequency. Each interrupt is called one clock tick (a computer pumping heart!) In distributed systems, having multiple n CPU with n multiple clocks runs at slightly different rates; software clocks gradually to get out of synch and give different values when read out. This difference in time values is called clock skew (deviation). In real time systems, the actual clock time is important; so external physical clocks are needed. There are two important questions: How do we synchronize these multiple physical clocks with real world clocks? How do we synchronize the clocks with each other? 98 Distributed Systems - CS 422 3.3 Clock Synchronization Algorithms If one m/c has a WWV receiver, knowing UTC; the goal becomes keeping all the other machines synchronized to it. (The Cristian's Algorithm in 3.3.1) If no machines have a WWV receiver, each m/c keeps track of its own time and the goal is to keep all the machines together as well as possible (The Berkeley algorithm in 3.3.2). These algorithms try to accurately resynchronize the clocks if they are drifting from the UTC in opposite directions. 3.3.1 The Cristian's Algorithm (1989) A common approach in many protocols and originally proposed by Cristian (1989) is to let clients contact a time server. The latter can accurately provide the current time, for example, because it is equipped with a WWV receiver or an accurate clock. The problem, of course, is that when contacting the server, message delays will have outdated the reported time. The trick is to find a good estimation for these delays. Cristian's algorithm relies on the existence of a time server. The time server maintains its clock by using a radio clock or other accurate time source, then all other computers in the system stay synchronized with it. A time client will maintain its clock by making a procedure call to the time server. Variations of this algorithm make more precise time calculations by factoring in network propagation time. So, it is a method for clock synchronization which can be used in many fields of distributive computer science but is primarily used in low-latency intranets. Cristian observed that this simple algorithm is probabilistic, in that it only achieves synchronization if the round-trip time (RTT) of the request is short compared to required accuracy. It also suffers in implementations using a single server, making it unsuitable for many distributive applications where redundancy may be critical. It works between a 99 Distributed Systems - CS 422 process P, and a time server S — connected to a source of UTC (Coordinated Universal Time): 1. P requests the time from S 2. After receiving the request from P, S prepares a response and appends the time T, from its own clock at the last possible moment before dispatch. 3. P then sets its time to be T + RTT/2 4. P needs to record the Round Trip Time (RTT) of the request it made to S so that it can set its clock to: T + RTT/2. This method assumes that the RTT is split equally between both request and response, which may not always be the case but is a reasonable assumption on a LAN connection. It is important to note that the time is attached at the last possible moment before being returned to P. This is to eliminate inaccuracies caused by network delay. Enhancement of the Cristian's Algorithm Further accuracy can be gained by making multiple requests to S and using the response with the shortest RTT. We can estimate the accuracy of the system by taking RTT/2 from the fastest response as a value we call min. The earliest point at which S could have placed the time T was min after P sent its request. Therefore, the time at S when the message is received by P is in the range: (T + min) to (T + RTT - min) The width of this range is (RTT - 2*min). This gives an accuracy of (RTT/2 - min). Cristian’s algorithm modelling Compensate for delays Note times: o request sent: T0 o reply received: T1 Assume network delays are symmetric 100 Distributed Systems - CS 422 Client sets time to: Error bounds 101 Distributed Systems - CS 422 3.3.2 The Berkeley Algorithm (1989) In many algorithms such as NTP, the time server is passive. Other machines periodically ask it for the time. All it does is respond to their queries. In Berkeley UNIX, exactly the opposite approach is taken (Gusella and Zatti, 1989). Here the time server (actually, a time daemon) is active, polling every machine from time to time to ask what time it is there. Based on the answers, it computes an average time and tells all the other machines to advance their clocks to the new time or slow their clocks down until some specified reduction has been achieved. This method is suitable for a system in which no machine has a WWV receiver. The time daemon's time must be set manually by the operator periodically. It is a method of clock synchronization in distributed computing which assumes no machine has an accurate time source, which assumes no machine has an accurate time source. It was developed by Gusella and Zatti at the University of California, Berkeley in 1989 and like Cristian's algorithm is intended for use within intranets. This algorithm is more suitable for systems where a radio clock is not present, this system has no way of making sure of the actual time other than by maintaining a global average time as the global time as follows: 1. A time server will periodically fetch the time from all the time clients 2. Average the results 3. Report back to the clients the adjustment that needs be made to their local clocks to achieve the average. Enhancement of the Berkeley Algorithm This algorithm highlights the fact that internal clocks may vary not only in the time they contain but also in the clock rate. Often, any client whose clock differs by a value outside of a given tolerance is disregarded when averaging the results. This prevents the overall system time from being drastically skewed due to one wrong clock. 102 Distributed Systems - CS 422 Berkeley Algorithm Details Unlike Cristian's algorithm the server process in Berkeley algorithm, called the master periodically polls other slave process: a. A master is chosen via an election process (section 3.5) b. The master polls the slaves who reply with their time in a similar way to Cristian's algorithm. c. The master observes the round-trip time (RTT) of the messages and estimates the time of each slave and its own. d. The master then averages the clock times, ignoring any values it receives far outside the values of the others. e. Instead of sending the updated current time back to the other process, the master then sends out the amount (positive or negative) that each slave must adjust its clock. This avoids further uncertainty due to RTT at the slave processes. 3.4 Parallel and Distributed Processing Problems Distributed and Parallel Processing is looking for the highest level of Computer Performance. Achieving such a goal, faces many practical problems. Some of these problems are: Distributed Mutual Exclusion Problem, Distributed Termination Problem or The Byzantine Generals Problem. 3.4.1 Mutual Exclusion Algorithm Outline Fundamental to distributed systems is the concurrency and collaboration among multiple processes. In many cases, this also means that processes will need to 103 Distributed Systems - CS 422 simultaneously access the same resources. To prevent that such concurrent accesses corrupt the resource, or make it inconsistent, solutions are needed to grant mutual exclusive access by processes. In this section, we take a look at some of the more important distributed algorithms that have been proposed. A recent survey of distributed algorithms for mutual exclusion is provided by Saxena and Rai (2003). Older, but still relevant is Velazquez (1993). Distributed mutual exclusion algorithms can be classified into two different categories. In token-based solutions mutual exclusion is achieved by passing a special message between the processes, known as a token. There is only one token available and whoever has that token is allowed to access the shared resource. When finished, the token is passed on to a next process. If a process having the token is not interested in accessing the resource, it simply passes it on. Token-based solutions have a few important properties. First, depending on the how the processes are organized, they can fairly easily ensure that every process will get a chance at accessing the resource. In other words, they avoid starvation. Second, deadlocks by which several processes are waiting for each other to proceed, can easily be avoided, contributing to their simplicity. Unfortunately, the main drawback of token-based solutions is a rather serious one: when the token is lost (e.g., because the process holding it crashed), an intricate distributed procedure needs to be started to ensure that a new token is created, but above all, that it is also the only token. As an alternative, many distributed mutual exclusion algorithms follow a permission-based approach. In this case. A process wanting to access the re- source first requires the permission of other processes. There are many different ways toward granting such a permission and in the sections that follow we will consider a few of them. The algorithm is based on sequence numbers like the bakery. The algorithm is based on sequence numbers like the bakery algorithm. However, since the nodes cannot directly read the internal variables of the other nodes, the comparison of sequence 104 Distributed Systems - CS 422 numbers and the decision to enter the critical section must be made by sending and receiving messages. The basic idea is the same: a node chooses a number, broadcasts its choice to the other nodes (a Request message) and then waits until it has received confirmation (a Reply) from each other node that the number chosen is now the lowest outstanding sequence number (i.e. highest priority). Ties on the chosen sequence number are resolved arbitrarily in favor of the node with the lowest identification number. task body Main_ Process_ Type is begin loop Non_ Critical _Section; -- Pre-protocol Choose_ Sequence_ Number; Send_ Request_ to_ Nodes; Wait_ for_ Reply; Critical_ Section; Reply_ to_ Deferred _Nodes;-- Post-protocol end loop; end Main_ Process_ Type; 3.4.2 Distributed Termination Problem It is used for Preventing Distributed Dead Lock. The Dijkstra–Scholten algorithm is a tree-based algorithm which can be described by the following: The initiator of a computation is the root of the tree. Upon receiving a computational message: o If the receiving process is currently not in the computation: the process joins the tree by becoming a child of the sender of the message. (No acknowledgement sender of the message. (No acknowledgement message is sent at this point.) o If the receiving process is already in the computation: the process immediately sends an acknowledgement message to the sender of the message. 105 Distributed Systems - CS 422 When a process has no more children and has become idle, the process separates itself from the tree by sending an acknowledgement message to its tree parent (I am out or Pass). Termination occurs when the initiator has no children and has become idle because it has no parent. 3.4.3 The Byzantine Generals Problem (BGP) Byzantine refers to the Byzantine Generals' Problem, an agreement problem (first proposed by Marshall Pease, Robert Shostak, and Leslie Lamport in 1980) in which generals of the Byzantine Empire's army must decide unanimously whether to attack some enemy army. The Byzantine Army was chosen as an example for the problem as the Byzantine state experienced frequent duplicity among the high levels of its administration. The problem is complicated by: The geographic separation of the generals, who must communicate by sending messengers to each other, The presence of traitors amongst the generals. These traitors can act arbitrarily in order to achieve the following aims: Trick some generals into attacking; force a decision that is not consistent with the generals' desires, e.g. forcing an attack when no general wished to attack; or Confusing some generals to the point that they are unable to make up their minds. If the traitors succeed in any of these goals, any resulting attack is hopeless, as only an intensive effort can result in victory. Byzantine fault tolerance can be achieved, if the loyal (nonfaulty) generals have a unanimous agreement on their strategy. Note that if the source general is correct, all loyal generals must agree upon that value. Otherwise, the choice of strategy agreed upon is inappropriate. The Classic Problem: Each division of Byzantine army are directed by its own general. o Generals, some of which are traitors, communicate each other by messengers. 106 Distributed Systems - CS 422 Requirements: o All loyal generals decide upon the same plan of action. A small number of traitors cannot cause the loyal generals to adopt a bad plan (i.e. minimize their effect). The problem can be restated as: All loyal generals receive the same information upon which they will somehow get to the same decision (unification of received information) The information sent by a loyal general should be used by all the other loyal generals (unification of decision) Reliability by Majority Voting One way to achieve reliability is to have multiple replica of system (or component) and take the majority voting among them. In order for the majority voting to yield a reliable system, the following two conditions should be satisfied: 1. All non-faulty components must use the same input value 2. If the input unit is non-faulty, then all non-faulty components use the value it provides as input (confidential source ) Impossibility Results No solution exists if less than or equal to 2/3 generals are loyal (i.e. more than 2/3 loyal or non-faulty components assures a solution). Practical Use Case of BGP Distributed file systems: Many small, latency-sensitive requests (tampering with files, lost updates) Overlay multicast: Transfers large volume of data (tampering with content, freeloading) P2P email: Complex, large, decentralized mail (Denial of service by misrouting) Not only agreement but also identifying faulty nodes is important! 107 Distributed Systems - CS 422 3.5 Election Algorithms Many distributed algorithms require one process to act as coordinator, initiator, or otherwise perform some special role. In general, it does not matter which process takes on this special responsibility, but one of them has to do it. In this section we will look at algorithms for electing a coordinator (using this as a generic name for the special process). If all processes are exactly the same, with no distinguishing characteristics, there is no way to select one of them to be special -.Consequently, we will assume that each process has a unique number, for example, its network address (for simplicity, we will assume one process per machine). In general, election algorithms attempt to locate the process with the highest process number and designate it as coordinator. The algorithms differ in the way they do the location. Central node compute the coordinates for each landmark. To this end, the central node seeks to minimize the following aggregated error function: “where d (bi,bj ) corresponds to the geometric distance, that is, the distance after nodes b, and b, have been positioned. The hidden parameter in minimizing the aggregated error function is the dimension m. Obviously, we have that L > m, but nothing prevents us from choosing a value for m that is much smaller than L. In that case. a node P measures its distance to each of the L landmarks and computes its coordinates by minimizing Furthermore, we also assume that every process knows the process number of every other process. What the processes do not know is which ones are. 108 Distributed Systems - CS 422 Currently up and which ones are currently down. The goal of an election algorithm is to ensure that when an election starts, it concludes with all processes agreeing on who the new coordinator is to be. There are many algorithms and variations, of which several important ones are discussed in the text books by Lynch (l996) and Tel (2000), respectively. We often need one process to act as a coordinator. It may not matter which process does this, but there should be a group agreement on only one. An assumption in election algorithms is that all processes are exactly the same with no distinguishing characteristics. Each process can obtain a unique identifier (for example, a machine address and process ID) and each process knows of every other process but does not know which is up and which is down. 3.5.1 Leader Election Problem A leader among n processors is the processor that is recognized by all other processors as distinguished to perform a special task. The leader election problem occurs when the processors of a distributed system must choose one of them as a leader. Each processor should eventually decide whether or not it is a leader, given that each processor is only aware of its identification and not aware of any other processes. The problem of electing a leader in a distributed environment is most important in situations in which coordination among processors becomes necessary to recover from a failure or topological change. A leader in such situations is needed, for example, to coordinate the reestablishment of allocation and routing functions. Consider, for example, a token-ring network, in which a token moves around the network, giving its current owner the right to initiate communication. If the token is lost, a leader is needed in this case to coordinate the regeneration of the lost token. 3.5.2 Leader Election in Synchronous Rings Now, let us explore the basic idea behind the different leader election algorithms: Suppose that the communication graph is an arbitrary graph, G = (V, E). 109 Distributed Systems - CS 422 The following two steps summarize our first attempt to solve the problem: o Each node in the graph would broadcast its unique identifier to all other nodes. o After receiving the identifiers of all nodes, the node with the highest identifier declares itself as the leader. To determine what happens if the algorithm is modified to work under the synchronous model, let us suppose that the communication graph is a complete graph. The following two steps summarize the algorithm: 1. At the first round, each node sends its unique identifier to all other nodes. 2. At the end of the first round, every node has the identifiers of all nodes; the node with the highest identifier declares itself as the leader 3.5.3 Synchronous Message-Passing Model A synchronous system can be modeled as a state machine with the following components: M, a fixed-message alphabet A process i can be modeled as: o Qi, a (possibly infinite) set of states. The system state can be represented using a set of variables. o q0,i , the initial state in the state set Qi. The state variables have initial values in the initial state. o GenMsgi, a message-generation function. It is applied to the current system state to generate messages to the outgoing neighbors from elements in M. o Transi, a state transition function that maps the current state and the incoming messages into a new state. 3.5.4 Simple Leader Election Algorithm The idea of this simple algorithm is that each process sends its identifier all the way around the ring. 110 Distributed Systems - CS 422 The process that receives its identifier back is declared as a leader. This algorithm was presented by Chang and Roberts, Le Lann , Lynch. We assume the following: Communication is unidirectional (clock wise). The size of the ring is unknown. The identification of each processor is unique. The algorithm can be summarized as follows: 1. Each process sends its identifier to its outgoing neighbor. 2. When a process receives an identifier from its incoming neighbor, then. The process sends null to its outgoing neighbor, if the received identifier is less than its own identifier (block action). 3. The process sends the received identifier to its outgoing neighbor, if the received identifier is greater than its own identifier ( bypass action). 4. The process declares itself as the leader, if the received identifier is equal to its own identifier. Assuming that the message alphabet M is the set of identifiers, the Algorithm S_ Elect_Leader Simple will be described as follows: Algorithm S_ Elect _ Leader _ Simple Qi U, some ID buff, some ID or null status, a value in (unknown, leader) q0,i u← ID buff← IDi status ← unknown GenMsgi Send the current value of buff to clockwise-neighbor Transi buff ← null 111 Distributed Systems - CS 422 if the incoming message is v and is not null, then Case: v < u : do nothing v = u : status ¬ leader v > u : buff ← v endcase 112 Distributed Systems - CS 422 Revision Sheet # 3 PROBLEMS 1. One way to handle parameter conversion in RPC systems is to have each machine send parameters in its native representation, with the other one doing the translation, if need be. The native system could be indicated by a code in the first byte. However, since locating the first byte in the first word is precisely the problem, can this work? 2. Assume a client calls an asynchronous RPC to a server, and subsequently waits until the server returns a result using another asynchronous RPC. Is this approach the same as letting the client execute a normal RPC? What if we replace the asynchronous RPCs with asynchronous RPCs? 3. Instead of letting a server register itself with a daemon as in DCE, we could also choose to always assign it the same end point. That end point can then be used in references to objects in the server's address space. What is the main drawback of this scheme? 4. Would it be useful also to make a distinction between static and dynamic RPCs? 5. Describe how connectionless communication between a client and a server proceeds when using sockets. 6. In the text we stated that in order to automatically start a process to fetch messages from an input queue, a daemon is often used that monitors the input queue. Give an alternative implementation that does not make use of a daemon. 7. Routing tables in IBM WebSphere, and in many other message-queuing systems, are configured manually. Describe a simple way to do this automatically. 113 Distributed Systems - CS 422 8. Suppose that in a sensor network measured temperatures are not times tarnped by the sensor, but are immediately sent to the operator. Would it be enough to guarantee only a maximum end-to-end delay? 9. How could you guarantee a maximum end-to-end delay when a collection of computers is organized in a (logical or physical) ring? 10. How could you guarantee a minimum end-to-end delay when a collection of computers is organized in a (logical or physical) ring? Assignment # 6 GPS Math Model: For the following case, compute: xr, yr, zr for a ship Which is our receiver where t is the time when signal arrives from each satellite GPS Practical Case Study Our ship is at an unknown position and has no clock. It receives simultaneous signals from four satellites, giving their positions and times shown in the table below. Assignment # 7 Trace the Vector Clocks Algorithm to describe the creation of the 7 events in the figure. Assignment # 8 Given Send request at 5:08:15.100 (T0) Receive response at 5:08:15.900 (T1) Response contains 5:08:15.300 (Tserver) Compute Tnew 114 Distributed Systems - CS 422 Assignment # 9 Compute the error bounds for assignment 8 using the formula Assignment # 10 Using the Berkeley Algorithm, Show how a. The time daemon asks all the other machines for their clock values b. The machines answer c. The time daemon tells everyone how to adjust their clock Using the Berkeley Algorithm, compute the offset for each client. Assignment # 11 If minimum message transit time (Tmin) is known: Place bounds on accuracy of results in Assignment # 10. Assignment # 12 Suppose that we have four processes running on four processors connected via a synchronous ring. The process (processors) have the IDs 1, 2, 3, and 4. Message passing is performed in a unidirectional manner. The ring is oriented such that process i sends messages to its clockwise neighbor. Draw the state of each process after each of the four rounds using the previous algorithm 115 Distributed Systems - CS 422 Chapter 4 The Distributed Algorithms 4.1 Introduction Distributed systems can be categorized as shared-memory or message-passing systems. In a shared-memory system, processing elements communicate with each other via shared variables in the global memory. While in message-passing systems, each processing element has its own local memory, and communication is performed via message passing. We will discuss a shared-memory abstract model, which can be used to theoretically study parallel algorithms and evaluate their complexities. PRAM It is a theoretical model of shared memory systems called Parallel Random Access Machine (PRAM). The PRAM model was introduced by Fortune and Wyllie in 1978 for modeling idealized parallel computers in which communication cost and synchronization overhead are negligible. At first glance, the PRAM model may appear inappropriate in real-world situations due to its idealistic nature. However, the PRAM model has been very useful in: studying parallel algorithms Evaluating their anticipated performance independent of the real machines. Clearly, if the performance of an algorithm is not satisfactory on a PRAM, it is meaningless to implement it on a real system. Although it does not consider some practical considerations in real distributed systems, it does focus on the computational aspects of the algorithmic complexity, which makes it less difficult to find performance bounds and complexity estimates. The PRAM model has played an important role in the introduction of parallel programming paradigms and design techniques used in real parallel systems. Since PRAM is conceptually easy to work with when developing parallel algorithms, much 116 Distributed Systems - CS 422 effort has been spent in finding efficient ways to simulate PRAM computation on other models that do not necessarily follow PRAM assumptions. This way, parallel algorithms can be designed using PRAM and then translated into real machines. A large number of PRAM algorithms for solving many fundamental problems have been introduced and efficiently implemented on real systems. 4.2 Variations of PRAM Model The purpose of the theoretical models for parallel computation is to give frameworks by which we can describe and analyze algorithms. These ideal models are used to obtain performance bounds and complexity estimates. One of the models that have been used extensively is the PRAM mo del. A PRAM consists of a control unit, a global memory shared by p processors, each of which has a unique index as follows: P1 , P2,…….., PP. in addition to the global memory, via which the processors can communicate, each processor has its own private memory. Next figure illustrates the components of the PRAM model: 4.2.1 PRAM Model for Parallel Computations Control Private memory P1 Private memory P2 Private memory Pp 117 Global memory Distributed Systems - CS 422 The p processors operate on a synchronized read, compute, and write cycle. During a computational step, an active processor may read a data value from a memory location, perform a single operation, and finally write back the result into a memory location. Active processors must execute the same instruction, generally, on different data. Hence, this model is sometimes called the shared memory, single instruction, and multiple data (SM SIMD) machine. Algorithms are assumed to run without interference as long as only one memory access is permitted at a time. We say that PRAM guarantees atomic access to data located in shared memory. An operation is considered to be atomic if it is completed in its entirety or it is not performed at all (all or nothing). 4.2.2 READ & WRITE in PRAM There are different modes for read and write operations in a PRAM. These different modes are summarized as follows: Exclusive read (ER) Only one processor can read from any memory location at a time. Exclusive write (EW) Only one processor can write to any memory location at a time. Concurrent read (CR) multiple processors can read from the same memory location simultaneously. Concurrent write (CW) multiple processors can write to the same memory location simultaneously. 4.2.3 PRAM Subclasses The PRAM can be further divided into the following four subclasses: EREW PRAM: Access to any memory cell is exclusive. This is the most restrictive PRAM model. 118 Distributed Systems - CS 422 ERCW PRAM: This allows concurrent writes to the same memory location by multiple processors, but read accesses remain exclusive. CREWPRAM: Concurrent read accesses are allowed, but write accesses are exclusive. CRCWPRAM: Both concurrent read and write accesses are allowed. Analysis of Parallel Algorithms The complexity of a sequential algorithm is generally determined by its time and space complexity. The time complexity of an algorithm refers to its execution time as a function of the problem’s size. Similarly, the space complexity refers to the amount of memory required by the algorithm as a function of the size of the problem. The time complexity has been known to be the most important measure of the performance of algorithms. An algorithm whose time complexity is bounded by a polynomial is called a polynomial–time algorithm. An algorithm is considered to be efficient if it runs in polynomial time. Inefficient algorithms are those that require a search of the whole enumerated space and have an exponential time complexity. For parallel algorithms, the time complexity remains an important measure of performance. Additionally, the number of processors plays a major role in determining the complexity of a parallel algorithm. In general, we say that the performance of a parallel algorithm is expressed in terms of how fast it is, and how many resources it uses when it runs. These criteria can be measured quantitatively as follows: 1. Run time, which is defined as the time spent during the execution of the algorithm. 2. Number of processors the algorithm uses to solve a problem. 3. The cost of the parallel algorithm, which is the product of the run time and the number of processors. 119 Distributed Systems - CS 422 The run time of a parallel algorithm is the length of the time period between the time the first processor to begin execution starts and the time the last processor to finish execution terminates. However, since the analysis of algorithms is normally conducted before the algorithm is even implemented on an actual computer, the run time is usually obtained by counting the number of steps in the algorithm. The cost of a parallel algorithm is basically the total number of steps executed collectively by all processors. If the cost of an algorithm is C, the algorithm can be converted into a sequential one that runs in O(C) time on one processor. A parallel algorithm is said to be cost optimal if its cost matches the lower bound on the number of sequential operations to solve a given problem within a constant factor. It follows that a parallel algorithm is not cost optimal if there exists a sequential algorithm whose run time is smaller than the cost of the parallel algorithm. It may be possible to speed up the execution of a cost-optimal PRAM algorithm by increasing the number of processors. However, we should be careful because using more processors may increase the cost of the parallel algorithm. Similarly, a PRAM algorithm may use fewer processors in order to reduce the cost. In this case the execution may be slowed down and offset the decrease in the number of processors. Therefore, using fewer processors requires that we make them work more efficiently. Further details on the relationship between the run time, number of processors, and optimal cost can be found in Brent (1974). In order to design efficient parallel algorithms, one must consider the following general rules. The number of processors must be bounded by the size of the problem. The parallel run time must be significantly smaller than the execution time of the best sequential algorithm. The cost of the algorithm is optimal. 4.3 Simulating Multiple Accesses on an EREW PRAM Suppose that a memory location, x, is needed by all processors at a given time in a PRAM. Concurrent read by all processors can be performed in the CREW and CRCW 120 Distributed Systems - CS 422 cases in constant time. In the EREW case, the following broadcasting mechanism can be followed: P1 reads x and makes it known to P2. P1 and P2 make x known to P3 and P4, respectively, in parallel. P1 , P2 , P3 and P4 make x known to P5, P6, P7, and P8, respectively, in parallel. These eight processors will make x known to another eight processors, In order to represent this algorithm in PRAM, an array, L, of size p is used as a working space in the shared memory to distribute the contents of x to all processors. Initially P1 will read x in its private memory and write it into L[1] Processor P2 will read x from L[1] into its private memory and write it into L[2]. Simultaneously, P3 and P4 read x from L [1] and L[2], respectively, then write them into L[3] and L[4], respectively. Processors P5 , P6, P7, and P8 will then simultaneously read L [1], L [2], L [3], and L [4], respectively, in parallel and write them into L [5 L [6] L [7], and L [8], respectively. This process will continue until eventually all the processors have read x. 4.3.1 Algorithm Broadcast _EREW Processor P1 y (in P1'S private memory) ------- x L[1-------] y for i = 0 to log p - 1 do for all Pj where 2i + 1<= j <=2i+1 do in parallel y (in Pj ‘ S private memory) ------- L [ j – 2i] L[j] ------y endfor endfor Verify the time complexity, O (log p) Verify the space complexity, O(p) 121 Distributed Systems - CS 422 Since the number of processors having read x doubles in each iteration, the procedure terminates in O (log p) time. The array L is the price paid in terms of memory, which is O(p). 4.4 Computing Sum and All Partial Sums We design a PRAM algorithm to compute all partial sums of an array of numbers. Given n numbers, stored in array A[ 1.. n], we want to compute the partial sums A [1] , A[l] + A[2], A[l] + A[2] +A[3],..., A[1] + A[2] + ... + A[n]. At first glance, one might think that accumulating sums is an inherently serial process, because one must add up the first k elements before adding in element k + I. To make it easy to understand the algorithm, we start by developing a similar algorithm for the simpler problem of computing the simple sum of an array of n values. Then we extend the algorithm to compute all partial sums using what is learned from the simple summation problem. 122 Distributed Systems - CS 422 4.4.1 Sum of an Array of Numbers on the EREW Model In this section, we discuss an algorithm to compute the sum of n numbers. Summation can be done in time O(log n) organizing the numbers as the leaves of a binary tree and performing the sums at each level of the tree in parallel. We present this algorithm on an EREW PRAM with n/2 processors, because we won't need to perform any multiple read or write operations on the same memory location. Recall that in an EREW PRAM, read and write conflicts are not allowed. We assume that the array A [l..n] is stored in the global memory. The summation will end up in the last l For simplicity, we assume that n is an integral power of 2. The algorithm will complete the work in log n iterations as follows. In the first iteration, all the processors are active. In the second iteration, only half of the processors will be active, and so on. The details are describe in Algorithm Sum active, and so on. The details are describe in Algorithm Sum _ EREW given below. Algorithm Sum_EREW for i = 1 to log n do for all pj , where 1 £ j £ n / 2 do in parallel if ( 2j modulo 2i ) = 0 then A[ 2j ] ¬ A[ 2j ] + A[ 2 j - 2i-1 ] endif endfor endfor Run time, T(n) = O (log n) Number of processors, P(n) = n/2 Cost, C(n) = 0(n log n) 123 Distributed Systems - CS 422 Complexity Analysis Notice that, the for loop is executed log n times, and each iteration has constant time complexity. Hence, the run time of the algorithm is o (log n). Since the number of processors used is n/2, the cost is obviously o (n log n). The complexity measures of Algorithm Sum _EREW are summarized as follows: 1. Run time, T(n) = o (log n) 2. Number of processors, P(n) = n/2 3. Cost, C(n) = o (n log n) Since a good sequential algorithm can sum the list of n elements in 0(n), this algorithm is not cost optimal. 3. Algorithm Procedures In order to sum eight elements, three iterations are needed. In the first iteration, processors P1 , P2 , P3 , and P4 add the values, stored at locations 1, 3, 5, and 7 to the numbers values, stored at locations 1, 3, 5, and 7 to the numbers stored at locations 2, 4, 6, and 8, respectively. In the second iteration, processors P2 and P4 add the values stored at locations 2 and 6 to the numbers stored at locations 4 and 8, respectively. Finally, in the third iteration, processor P4 adds the value stored at location 4 to the value stored at location 8. Thus, location 8 will eventually contain the sum of all numbers in the array. 124 Distributed Systems - CS 422 Sum Example 4.4.2 All Partial Sums of an Array Take a closer look at Algorithm Sum _ EREW and notice that most of the processors are idle most of the time. However, by exploiting the idle processors, we should be able to compute all partial sums of the array in 'the same amounts of time it takes to compute the single sum. We present Algorithm All Sums _EREW to calculate all partial sums of an array on an EREW PRAM with n - 1 processors ( P2,P3..., Pn). Again, the elements of the array A[l..n] are assumed to be in the global shared memory. The partial sum algorithm replaces each A [k] by the sum of all elements preceding and including A[k]. 125 Distributed Systems - CS 422 In Algorithm Sum _EREW presented earlier, during iteration i, only n/2i processors are active, while in the algorithm we present here, nearly all processors are in use. Algorithm All _Partial _Sums _EREW for i = 1 to log n do for all Pj , where 2i-1 + 1 £ j £ n do in parallel A [ j ] ¬ A [ j ] + A[ j - 2i-1] endfor endfor Run time, T(n) = O(log n) Number of processors, P(n) = n – 1 Cost, C(n) = 0(n log n) Visit: Algorithm ALL_Partial_Sums .docx Computing partial sums of an array of eight elements 126 Distributed Systems - CS 422 4.5 The Sorting Algorithm The sorting algorithm we present here is based on the enumeration (list) idea. Given an unsorted list of n elements a1, a2,..., ai,…, an, an enumeration sort determines the position of each element a1 in the sorted list by computing the number of elements smaller than it. If ci elements are smaller than ai , then it is the (ci +1)th element in the sorted list. If two or more elements have the same value, the element with the largest index in the unsorted list will be considered as the largest in the sorted list-for example, suppose that ai = aj, then ai will be considered the larger of the two if i > j; otherwise, aj is the larger. 4.5.1 n2 Sorting Algorithm Consider the n2 processors as being arranged into n rows of n elements each. The processors are numbered as follows: Pi,j is the processor located in row i and column j in the grid of processors. We assume that the sorted list is stored in the global memory in an array A [1..n]. Another array C[1..n] will be used to store the number of elements smaller than every element in A. The algorithm consists of two steps: 1. Each row of processors i computes C [i], the number of elements smaller than A [i]. Each processor Pi,j compares A[i] and then updates C [i] appropriately. 2. The first processor in each row Pi,1 places A[i] in its proper position in the sorted list (C [i] + 1). 4.5.2 Algorithm Sort _ CRCW( Ascending) /* Step 1 */ For all Pi,j where 1 ≤ i, j ≤ n do in parallel if (A[i] > A[j]) or (A [i] = A[j] and i > j) then C[i] ← 1 else C[i] ← 0 endif endfor 127 Distributed Systems - CS 422 /* Step 2 */ For all Pi,1 where 1 ≤ i ≤ n do in parallel A [C[i]+1]← A[i] Endfor Complexity Analysis The complexity measures of the enumerating sort on CRCW PRAM are summarized as follows: Run time, T(n) = O (1) Number, of processors, P(n) = n2 Cost, C(n) = O (n2) The run time of this algorithm is constant, because each of the two steps of the algorithm consumes a constant amount of time. Since the number of processors used is n2, the cost is obviously O(n2). CRCW algorithms versus EREW algorithms The debate about whether or not concurrent memory accesses should be provided by the hardware of a parallel computer is a messy one. Some argue that hardware mechanisms to support CRCW algorithms are too expensive and used too infrequently to be justified. Others complain that EREW PRAM's provide too restrictive a programming model. The answer to this debate probably lies somewhere in the middle, and various compromise models have been proposed. Nevertheless, it is instructive to examine what algorithmic advantage is provided by concurrent accesses to memory. In this section, we shall show that there are problems on which a CRCW algorithm outperforms the best possible EREW algorithm. For the problem of finding the identities of the roots of trees in a forest, concurrent reads allow for a faster algorithm. 128 Distributed Systems - CS 422 For the problem of finding the maximum element in an array, concurrent writes permit a faster algorithm. 4.6 Message – Passing Models and Algorithms Message-passing distributed algorithms are designed to run on the processing units of a distributed system, which may be connected in a variety of ways: ranging from geographically dispersed networks to architecture-specific interconnection structures. A processing unit in such systems is an autonomous computer, which may be engaged in its own private activities while at the same time cooperating with other units in the context of some computational. 4.6.1 Message-passing Computing Models An algorithm designed for a message-passing system consists of a collection of local programs running concurrently on the different processing units in a distributed system. Each local program performs a sequence of computation and message-passing operations. Message passing in distributed systems can be modeled using a communication graph. The nodes of the graph represent the processors (or the processes running on them), and the edges represent communication links between processors. A message-passing distributed system may operate in synchronous, asynchronous, or partially synchronous modes. In the synchronous extreme, the execution is completely lockstep and local programs proceed in synchronous rounds, for example in one round each local program sends messages to its outgoing neighbors, waits for the arrival of messages from its incoming neighbors, and performs some computation upon the receipt of the messages. In the other extreme, in asynchronous mode the local programs execute in arbitrary order at an arbitrary rate. The partially synchronous systems work at an intermediate degree of synchrony, where there are restrictions on the relative timing events. The processes share information by sending/receiving (or dispatching/collecting) data to/from each other. The processes most likely run the same programs, and the whole 129 Distributed Systems - CS 422 system should work correctly regardless of the messaging relations among the processes or the structure of the network. A popular standard and message-passing system is the message passing interface (MPI). Such models themselves do not impose particular restrictions on the mechanism for messaging, and thus give programmers much flexibility in algorithm/system designs. However, this also means that programmers need to deal with actual sending/receiving of messages, failure recovery, managing running processes, etc. Synchronous Message-Passing Model Thus, a synchronous system can be modeled as a state machine with the following components: M, a fixed-message alphabet A process i can be modeled as: Qi, a (possibly infinite) set of states. The system state can be represented using a set of variables. q0,i , the initial state in the state set Qi. The state variables have initial values in the initial state. GenMsgi, a message-generation function. It is applied to the current system state to generate messages to the outgoing neighbors from elements in M. Transi, a state transition function that maps the current state and the incoming messages into a new state. Algorithm S _Sum _ Hypercube Qi buff, an integer dim, a value in {1, 2,..., log n} q0,i buff ← xi dim ← log n GenMsgi 130 Distributed Systems - CS 422 If the current value of dim = 0, do nothing. Otherwise, send the current value of buff to the neighbor along the dimension dim. Transi If the incoming message is v & dim > 0 , then buff← buff + v, dim ← dim – 1 131 Distributed Systems - CS 422 Revision Sheet # 4 Assignment # 13 Trace the following algorithms using your own examples and clarify its space complexity and time complexity: Simulating Multiple Accesses on an EREW PRAM (Broadcast Algorithm) Computing Sum Algorithm All Partial Sums Algorithm Sorting Algorithm Message Passing Models Sum Algorithm 132 Distributed Systems - CS 422 Chapter 5 Naming In Distributed Systems Naming Mechanism in Traditional Network Names play a very important role in all computer systems. They are used to share resources, to uniquely identify entities, to refer to locations, and more. An important issue with naming is that a name can be resolved to the entity it refers to. Name resolution thus allows a process to access the named entity. To resolve names, it is necessary to implement a naming system. The difference between naming in distributed systems and non-distributed systems lies in the way naming systems are implemented. In a distributed system, the implementation of a naming system is itself often distributed across multiple machines. How this distribution is done plays a key role in the efficiency and scalability of the naming system. In this chapter, we concentrate on three different, important ways that names are used in distributed systems. First, after discussing some general issues with respect to naming, we take a closer look at the organization and implementation of human-friendly names. Typical examples of such names include those for file systems and the World Wide Web. Building worldwide, scalable naming systems is a primary concern for these types of names. Second, names are used to locate entities in a way that is independent of their current location. As it turns out, naming systems for human-friendly names are not particularly suited for supporting this type of tracking down entities. Most names do not even hint at the entity's location. Alternative organizations are needed, such as those being used for mobile telephony where names are location independent identifiers, and those for distributed hash tables. Finally, humans often prefer to describe entities by means of various characteristics, leading to a situation in which we need to resolve a description 133 Distributed Systems - CS 422 by means of attributes to an entity adhering to that description. This type of name resolution is notoriously difficult and we will pay separate attention to it The practice of using a name as a simpler, more memorable abstraction of a host's numerical address on a network dates back to the ARPANET era. Before the DNS was invented in 1982, each computer on the network retrieved a file called HOSTS.TXT from a computer at SRI (Stanford Research Institute International). The HOSTS.TXT file mapped names to numerical addresses. A host’s file still exists on most modern operating systems by default and generally contains a mapping of "localhost" to the IP address 127.0.0.1. The rapid growth of the network made a centrally maintained, hand-crafted HOSTS.TXT file unaffordable. It became necessary to implement a more scalable system capable of automatically publishing the requisite information. 5.1 Naming Basic Concept Name: Address: string of bits that have location semantics o string of bits that refer to an entity (e.g., your name). e.g., your home address, your phone # Also, name is an identifier that: o Identifies a resource Uniquely Describes the resource o Enables us to locate that resource Directly With help? Key issues in Naming How is name used? o Disambiguate only o Access resource given the name o Build a name to find a resource 134 Distributed Systems - CS 422 Do humans need to use name? o Construct o Recall Is resource static? o Never moves? o Change in location should change name o Resource may move o Resource is mobile Performance requirements 5.2 Naming Types (identification, description, location) 1. Address, which is an access point associated with an entity 2. Globally unique identifier (e.g. TCP/IP) Ethernet Solves identification, but not description or location 3. Hierarchically assigned globally unique names in shape of character string (e.g. URL) Telephone number, IP address Telephone number, IP address Solves identification, location Cannot help with description 4. Registries and name spaces (e.g. Uniform Resource Name (URN)) Solves identification and location Helps with description Registry can describe in detail Complicated! Naming & Application Developers DNS is widely accepted standard o Only names machines 135 Distributed Systems - CS 422 o Doesn’t handle mobility URL / URN will become standard o Can be descriptive o Globally unique o Determined, but expensive to create Mix of URL and DNS 5.2.1 Uniform Resource Identifier (URI) (URL/URN/URL+URN) Computer scientists may classify a URI as a locator (URL), or a name (URN), or both. It is an Internet Engineering Task Force (IETF) meta-standard. It defines naming schemes/protocols. Each naming scheme has its own mechanism. Examples of absolute URIs: http://example.org/absolute/URI/with/absolute/path/to/resource.txt ftp://example.org/resource.txt urn: issn:1535-3613 Uniform Resource Locator (URL) Instead of being divided into the route to the server, separated by dots, and the file path, separated by slashes; Tim Berners-Lee would have liked it to be one coherent hierarchical path. e.g. http://www.serverroute.com / path/to/file.html → http://com/serverroute/www/path/to/file.html URL uses DNS to map to host (Mix of URL with DNS). Uniform Resource Name (URN) (Permanent URL) The Functional Requirements for Uniform Resource Names are described in RFC 1737. The URNs are part of a larger Internet information architecture which is composed of: URNs, Uniform Resource Characteristics (URCs) Uniform Resource Locators (URLs). 136 Distributed Systems - CS 422 Each plays a specific role: URNs are used for identification URCs for including meta-information URLs for locating or finding resources The Internet Protocol Suite Digital subscriber line is a family of technologies that provide Internet access by transmitting digital data over the wires of a local telephone network. In telecommunications marketing, the term DSL is widely understood to mean asymmetric digital subscriber line (ADSL) Application Layer BGP · DHCP · DNS · FTP · GTP · HTTP · IMAP ( email receiving)· IRC · Megaco · MGCP · NNTP · IMAP ( email receiving)· IRC · Megaco · MGCP · NNTP · NTP · POP ( email receiving)· · RIP · RPC · RTP · RTSP · SDP · SIP · SMTP ( email sending)· · SNMP · SOAP · SSH · Telnet · TLS/SSL · XMPP · Transport Layer TCP · UDP · DCCP · SCTP · RSVP · ECN · (more) Internet Layer IP (IPv4, IPv6) · ICMP · ICMPv6 · IGMP · IPsec · Link Layer ARP · RARP · NDP · OSPF · Tunnels (L2TP) · PPP · Media Access Control (Ethernet, MPLS, DSL , ISDN, FDDI) · Device Drivers · 137 Distributed Systems - CS 422 Domain Name System (DNS) It is a hierarchical naming system for computers, services, or any resource participating in the Internet. It associates various information with domain names assigned to such participants. It translates domain names meaningful to humans into the numerical (binary) identifiers associated with networking equipment for the purpose of locating and addressing these devices world-wide. It serves as the "phone book" for the Internet by translating human-friendly computer hostnames into IP addresses. For example, www.example.com translates to 208.77.188.166. World-Wide Web (WWW) hyperlinks and Internet contact information can remain consistent and constant even if the current Internet routing arrangements change or the participant uses a mobile device. Internet domain names are easier to remember than IP addresses such as 208.77.188.166 (IPv4) or 2001:db8:1f70::999:de8:7648:6e8 (IPv6). People take advantage of this when they perform meaningful URLs and e-mail addresses without having to know how the machine will actually locate them. DNS / URL System (Mix) http://maedhbh.maths.ted.ie/project Using DNS, the machine name blood.cs.tcd.ie would indicate that the machine blood was located in the Computer Science (cs) department of Trinity College Dublin (tcd) which is located in Ireland (ie). Thus, the name space has been broken up into a hierarchical structure of domains within domains. The Uniform Resource Locator (URL) is an extension of the naming convention used by DNS. It adds a prefix, which is used to specify which of the Web channel of the Web services you request of that machine. Such services are HTTP & FTP. Also other information can be appended to the address. One such use it to specify the particular Web page you request, e.g. http://maedhbh.maths.ted.ie/project 138 Distributed Systems - CS 422 The advantages of the DNS/URL system is that the databases for storing information about the machines is distributed in a fashion where the owner of the machines is responsible for that data. Domain Name System (DNS) Hierarchy DNS Hierarchy Features 1. Unique domain suffix is assigned by the Internet Authority 2. The domain administrators have complete control over the domain 3. No limit on the number of sub-domains or number of levels 4. Name space is not related with the physical interconnection 5. Geographical hierarchy is allowed (e.g., cnri.reston.va.us) 6. A name could be a domain or an individual objects DNS Top Level Domains Assignment Domain Name Com Commercial Edu Educational Gov Government Mil Military Net Network Org Other organizations country code au, uk, ca, … 139 Distributed Systems - CS 422 The DNS Name Space The most important types of resource records forming the contents of nodes in the DNS name space: Type of Associated record entity Description SOA Zone Holds information on the represented zone A Host Contains an IP address of the host this node represents MX Domain Refers to a mail server to handle mail addressed to this node SRV Domain Refers to a server handling a specific service NS Zone CNAME Node Refers to a name server that implements the represented zone Symbolic link with the primary name of the represented node Domain Name Space 140 Distributed Systems - CS 422 Name Space Implementation Name spaces always map names to something. DNS maps associates information with domain names. It can be divided into three layers: 1. Global layer: Doesn’t change very often. 2. Administrational layer: Single organization 3. Managerial layer: Change regularly Name Space Distribution Example An example partitioning of the DNS name space, including Internet-accessible files, into three layers. Name Space Layers Characteristics A comparison between name servers for implementing nodes from a large-scale name space partitioned into a global layer, an administrational layer, and a managerial layer. 141 Distributed Systems - CS 422 Item Global Administrational Managerial Geographical scale of network Worldwide Organization Department Total number of nodes Few Many Vast numbers Responsiveness to lookups Seconds Milliseconds Immediate Update propagation Lazy Immediate Immediate Number of replicas Many None or few None Is client-side caching applied? Yes Yes Sometimes DNS Name Servers (Decentralized) Centralizing DNS is avoided for the following reasons: To avoid single point of failure To avoid traffic volume Access to distant centralized database Distribute maintenance overhead Centralized means: doesn’t scale! DNS Name Server Hierarchy Concepts Server are organized in hierarchies. Each server has authority over a portion of the hierarchy: A single node in the name hierarchy cannot be split A server maintains only a subset of all names It needs to know other servers that are responsible for the other portions of the hierarchy. Regarding its Authority, each server has the name to address translation table (ATT) for all names in the name space it controls. Every server knows the root and Root server knows about all top-level domains. 142 Distributed Systems - CS 422 DNS Name Servers Types No server has all name-to-IP address mappings i.e. the knowledge is distributed as follows: Local name servers o Each ISP (company) has local (default) name server o Host DNS query first go to local name server Root Name Servers, as a Mediator Agent Authoritative name servers o For a host: stores that host’s (name, IP address) o Can perform name/address translation for that host’s name Root Name Servers (Mediator Agent) Contacted by local name server that cannot resolve name. The Root name server: 1. Contacts authoritative name server if name mapping not known 2. Gets mapping 3. Returns mapping to local name server Dozen root name servers worldwide 143 Distributed Systems - CS 422 Simple DNS Example (1st Alternative) Host whsitler.cs.cmu.edu wants IP address of www.berkeley.edu 1. Contacts its local DNS server, mango.srv.cs.cmu.edu 2. mango.srv.cs.cmu.edu contacts root name server, if necessary 3. Root name server contacts authoritative name server, ns1.berkeley.edu, if necessary May not know authoritative name server May know intermediate name server: who to contact to find authoritative name server? 144 Distributed Systems - CS 422 DNS: Iterated Queries (3rd Alternative) Recursive query: Puts load of name resolution on contacted name server Heavy load? Iterated query: Contacted server replies with name of server to contact “I don’t know this name, but ask this server” 5.3 Naming Implementation Approaches 1. Flat Naming Simple solutions (broadcasting and Forwarding pointers) Home-based approaches Distributed Hash Tables (structured P2P) Hierarchical location service 2. Structured Naming, for example: Phone numbers Credit card numbers DNS Human names in the US 145 Distributed Systems - CS 422 Files in UNIX, Windows URLs 3. Attribute-based Naming using a collection of (attribute, value) pairs 5.3.1 Flat Naming Approach Problem: Given an essentially unstructured name (e.g., an identifier), how can we locate its associated access point? Solution: 1. Simple solutions (broadcasting and Forwarding pointers ) 2. Home-based approaches 3. Distributed Hash Tables (structured P2P) 4. Hierarchical location service 5.3.1.1 Simple Solutions Broadcasting: (Question is ID, answer is address) Simply broadcast the ID, requesting the entity to return its current address. Its disadvantages are: Can never scale beyond local-area networks (LAN) Requires all processes to listen to incoming location Forwarding pointers: (VIP hot line contact) Each time an entity moves, it leaves behind a pointer telling where it has gone to. Dereferencing can be made entirely transparent to clients by simply following the chain of pointers Update a client’s reference as soon as present location has been found Geographical scalability problems: o Long chains are not fault tolerant o Increased network latency at dereferencing. It is essential to have separate chain reduction mechanisms. 146 Distributed Systems - CS 422 5.3.1.2 Home-Based Approaches Single-tiered scheme: (continues tracking) Let a home keep track of where the entity is: An entity’s home address is registered at a naming service The home registers the foreign address of the entity Clients always contact the home first, and then continues with the foreign location Two-tiered scheme: Keep track of visiting entities: Check local visitor register first Fall back to home location if local lookup fails Problems with home-based approaches: 1. The home address has to be supported as long as the entity lives. 2. The home address is fixed, which means an unnecessary load when the entity permanently moves to another location 3. Poor geographical scalability Question: How can we solve the “permanent move” problem? 147 Distributed Systems - CS 422 Answer: register the home at a traditional naming service and to let a client first look up the location of the home. 5.3.1.3 Distributed Hash Tables (DHTs) Example: Consider the organization of many nodes into a logical ring (Chord Protocol): Each node is assigned a random m-bit identifier. Every entity is assigned a unique m-bit key. Entity with key k falls under authority of node with smallest id k (called its successor). Solution: Let node id keep track of succ(id) and start linear search along the ring. Chord: Overview A Chord is a scalable, distributed “lookup service”. A Lookup service is a service that maps keys to nodes. The Key technology used is consistent hashing. The major benefits of Chord over other lookup services is Provable correctness and Provable “performance”. It uses the Secure Hash Algorithm (SHA-1) which is one of a number of cryptographic hash functions http://aruljohn.com/hash.php Chord: Primary Motivation In P2P network, one node(N5) as a client is trying to locate a specific file (LetItBe) located on another node(N1) as a server on the network. 148 Distributed Systems - CS 422 Chord Identifiers Where m is a bit identifier space for both keys and nodes Key identifier: SHA-1(key) Node identifier: SHA-1(IP address) After hashing the key value of the file name, the client node will search for the closest node or higher in value on the network to acquire the needed file. Algorithmic Requirements ● Every node can find the answer. ● Keys are load balanced among nodes. Note: We're not talking about popularity of keys, which may be wildly different. Addressing this is a further challenge... ● Routing tables must adapt to node failures and arrivals. ● How many hops must lookups take? – Trade off possible between state/maintenance, traffic and number lookups. DHTs: Finger Tables Each node p maintains a finger table FTp[] with at most m entries: FTp[i] = succ(p + 2i−1) Note: FTp[i] points to the first node succeeding p by at least 2i−1. To look up a key k, node p forwards the request to node with index j satisfying q = FTp[j] ≤ k < FTp[j + 1] 149 Distributed Systems - CS 422 It means, I do not have the looked up file, Search it in the next station until found or stop. 5.3.1.4 Hierarchical Location Services (HLS) The basic idea is to build a large-scale search tree for which the underlying network is divided into hierarchical domains. Each domain is represented by a separate directory node. 150 Distributed Systems - CS 422 HLS: Tree Organization The address of an entity is stored in a leaf node, or in an intermediate node. Intermediate nodes contain a pointer to a child if and only if the sub-tree rooted at the child stores an address of the entity. The root knows about all entities. HLS: Lookup Operation The Basic principles are: Start lookup at local leaf node If node knows about the entity, follow downward pointer, otherwise go one level up (if exist, go down; otherwise continue up) Upward lookup always stops at root 151 Distributed Systems - CS 422 5.3.2 Structured Naming The name space is the way that names in a particular system are organized. This also defines the set of all possible names. Some examples are: Phone numbers Credit card numbers DNS Human names in the US Files in UNIX, Windows URLs Names are organized into what is commonly referred to as a name space. A name space can be represented as a labeled, directed graph with two types of nodes. Name Space Essence: a graph in which a leaf node represents a (named) entity. A directory node is an entity that refers to other nodes. Note: A directory node contains a directory table of (edge label, node identifier) pairs. Observation: We can easily store all kinds of attributes in a node, describing aspects of the entity the node represents: 1. Type of the entity 2. An identifier for that entity 3. Address of the entity’s location 4. Nicknames 152 Distributed Systems - CS 422 Directory nodes can also have attributes, besides just storing a directory table with (edge label, node identifier) pairs. Name Resolution Looking up a name (finding the “value”) is called name resolution. But the problem is to resolve a name, we need a directory node. We first need to find that “initial” node. Closure mechanism (or where to start) is the mechanism to select the implicit context from which to start name resolution. Such Examples are: file systems, ZIP code, DNS www.cs.vu.nl: start at a DNS name server 0031204447784: dial a phone number 130.37.24.8: route to the VU’s Web server Observation: A closure mechanism may also determine how name resolution should proceed. Name Linking Hard link: What we have described so far as a path name: a name that is resolved by following a specific path in a naming graph from one node to another. Soft link: Allow a node O to contain a name of another node: First resolve O’s name (leading to O) Read the content of O, yielding name Name resolution continues with name Observations: The name resolution process determines that we read the content of a node, in particular, the name in the other node that we need to go to. One way or the other, we know where and how to start name resolution given name. The Node n5 has only one name (i.e. no name linking but n1,n4 links to node n6). 153 Distributed Systems - CS 422 Iterative Name Resolution resolve(dir,[name1,...,nameK]) is sent to Server0 responsible for dir Server0 resolves resolve(dir,name1) → dir1, returning the identification (address) of Server1, which stores dir1. Client sends resolve(dir1,[name2,...,nameK]) to Server1, etc. Recursive Name Resolution resolve(dir,[name1,...,nameK]) is sent to Server0 responsible for dir Server0 resolves resolve(dir,name1) →dir1, and sends resolve(dir1,[name2,...,nameK]) to Server1, which stores dir1. Server0 waits for the result from Server1, and returns it to the client. 154 Distributed Systems - CS 422 Scalability Issues Size scalability: We need to ensure that servers can handle a large number of requests per time unit ⇒ high-level servers are in big trouble. Solution: Assume (at least at global and administrational level) that content of nodes hardly ever changes. In that case, we can apply extensive replication by mapping nodes to multiple servers and start name resolution at the nearest server. Observation: An important attribute of many nodes is the address where the represented entity can be contacted. Replicating nodes makes large-scale traditional name servers unsuitable for locating mobile entities. Geographical scalability: We need to ensure that the name resolution process scales across large geographical distances. 155 Distributed Systems - CS 422 Problem: By mapping nodes to servers that may, in principle, be located anywhere, we introduce an implicit location dependency in our naming scheme. 5.3.3 Attribute-Based Naming Observation: In many cases, it is much more convenient to name, and look up entities by means of their attributes ⇒ traditional directory services (Also Known As (a.k.a.), yellow pages). Problem: Lookup operations can be extremely expensive, as they require to match requested attribute values, against actual attribute values ⇒ inspect all entities. Solution: Implement basic directory service as database and combine with traditional structured naming system. Directory Service A directory service is a database that contains information about all objects on the network. Directory services contain data and metadata. Metadata is information about data. For example: A user account is data. Metadata specifies what information is included in every user account object. Early Directory Services The first directory service was developed at PARC and was called Grapevine. X.500 was developed as a directory service standard by the ISO and CCITT. Although X.500 was developed as a comprehensive standard, as with the OSI model, it was not widely deployed on real-world LANs. X.500 formed the basis of a standard that is widely deployed known as LDAP. Some X.500 conventions are used in Active Directory and eDirectory. X.500 Directory Service It provides directory service based on a description of properties instead of a full name (e.g., yellow pages in properties instead of a full name (e.g., yellow pages in telephone book). An X.500 directory entry is comparable to a resource record in DNS. Each record is made up of a collection of (attribute, value) pairs: 156 Distributed Systems - CS 422 Collection of all entries is a Directory Information Base (DIB) Each naming attribute is a Relative Distinguished Name (RDN) RDNs, in sequence, can be used to form a Directory Information Tree (DIT) LDAP It stands for Lightweight Directory Access Protocol. LDAP is a scaled-down implementation of the X.500 standard. Active Directory and eDirectory are based on LDAP. Netscape’s Directory Server was the first wide implementation of LDAP. Most LDAP directories use a single master method of replication. Changes are made to the master databases and then propagated out to subordinate databases. The disadvantage of this scheme is that it has a single point of failure. Objects within an LDAP directory are referenced using the object’s DN (Distinguished Name). The DN consists of the RDN (Relative Distinguished Name) appended with the names of ancestor entries. LDAP Example 157 Distributed Systems - CS 422 Revision Sheet # 5 PROBLEMS 1. Give an example of where an address of an entity E needs to be further resolved into another address to actually access E. 2. Would you consider a URL such as http://www.acme.org/index.html to be location independent? What about http://www.acme.nllindex.html? 3. Give some examples of true identifiers. 4. Is an identifier allowed to contain information on the entity it refers to? 5. Outline an efficient implementation of globally unique identifiers. 6. Consider a Chord DHT-based system for which k bits of an m-bit identifier space have been reserved for assigning to superpeers. If identifiers are randomly assigned, how many superpeers can one expect to have in an N-node system? 7. If we insert a node into a Chord system, do we need to instantly update all the finger tables? What is a major drawback of recursive lookups when resolving a key in a DHT-based system? 8. Considering that a two-tiered home-based approach is a specialization of a hierarchical location service, where is the root? 9. The root node in hierarchical location services may become a potential bottleneck. How can this problem be effectively circumvented? 10. Give an example of how the closure mechanism for a URL could work. 11. Explain the difference between a hard link and a soft link in UNIX systems. Are there things that can be done with a hard link that cannot be done with a soft link or vice versa? 12. High-level name servers in DNS, that is, name servers implementing nodes in the DNS name space that are close to the root, generally do not support recursive name resolution. Can we expect much performance improvement if they did? 13. Explain how DNS can be used to implement a home-based approach to locating mobile hosts. How is a mounting point looked up in most UNIX systems? 158 Distributed Systems - CS 422 Chapter 6 Fault Tolerance A characteristic feature of distributed systems that distinguishes them from singlemachine systems is the notion of partial failure. A partial failure may happen when one component in a distributed system fails. This failure may affect the proper operation of other components, while at the same time leaving yet other components totally unaffected. In contrast, a failure in non-distributed systems is often total in the sense that it affects all components, and may easily bring down the entire system. An important goal in distributed systems design is to construct the system in such a way that it can automatically recover from partial failures without seriously affecting the overall performance. In particular, whenever a failure occurs, the distributed system should continue to operate in an acceptable way while repairs are being made that is, it should tolerate faults and continue to operate to some extent even in their presence. 6.1 Introduction to Fault Tolerance Fault tolerance has been subject to much research in computer science. In this section, we start with presenting the basic concepts related to processing failures, followed by a discussion of failure models. The key technique for handling failures is redundancy, which is also discussed. 6.1.1 Basic Concepts To understand the role of fault tolerance in distributed systems we first need to take a closer look at what it actually means for a distributed system to tolerate faults. Being fault tolerant is strongly related to what are called dependable systems. Dependability is a term that covers a number of useful requirements for distributed systems including the following: 159 Distributed Systems - CS 422 1. Availability 2. Reliability 3. Safety 4. Maintainability Avail ability is defined as the property that a system is ready to be used immediately. In general, it refers to the probability that the system is operating correctly. At any given moment and is available to perform its functions on behalf of its users. In other words, a highly available system is one that will most likely be working at a given instant in time. Reliability refers to the property that a system can run continuously without failure. In contrast to availability, reliability is defined in terms of a time interval instead of an instant in time. A highly-reliable system is one that will most likely continue to work without interruption during a relatively long period of time. This is an intelligent but important difference when compared to availability. If a system goes down for one millisecond every hour, it has an availability of over 99.9999 percent, but is still highly unreliable. Similarly, a system that never crashes but is shut down for two weeks every August has high reliability but only 96 percent availability. The two are not the same. Safety refers to the situation that when a system temporarily fails to operate correctly, nothing catastrophic happens. For example, many process control systems, such as those used for controlling nuclear power plants or sending people into space, are required to provide a high degree of safety. If such control systems temporarily fail for only a very brief moment, the effects could be disastrous. Many examples from the past (and probably many more yet to come) show how hard it is to build safe systems. Finally, maintainability refers to how easy a failed system can be repaired. A -highly maintainable system may also show a high degree of availability, especially if failures can be detected and repaired automatically. 160 Distributed Systems - CS 422 Often, dependable systems are also required to provide a high degree of security, especially when it comes to issues such as integrity. A system is said to fail when it cannot meet its promises. In particular, if a distributed system is designed to provide its users with a number of services, the system has failed when one or more of those services cannot be (completely) provided. An error is a part of a system's state that may lead to a failure. For example, when transmitting packets across a network, it is to be expected that some packets have been damaged when they arrive at the receiver. Damaged in this context means that the receiver may incorrectly sense a bit value (e.g., reading a 1instead of a 0), or may even be unable to detect that something has arrived. The cause of an error is called a fault. Clearly, finding out what caused an error is important. For example, a wrong or bad transmission medium may easily cause packets to be damaged. In this case, it is relatively easy to remove the fault. However, transmission errors may also be caused by bad weather conditions such as in wireless networks. Changing the weather to reduce or prevent errors is a bit complex. Building dependable systems closely relates to controlling faults. A distinction can be made between preventing, removing, and forecasting faults. For our purposes, the most important issue is fault tolerance, meaning that a system can provide its services even in the presence of faults. In other words, the system can tolerate faults and continue to operate normally. Faults are generally classified as transient, intermittent, or permanent: Transient faults occur once and then disappear. If the operation is repeated, the fault goes away. A bird flying through the beam of a microwave transmitter may cause lost bits on some network (not to mention a roasted bird). If the transmission times out and is retried, it will probably work the second time. An intermittent fault occurs, then vanishes of its own accord, then reappears, and so on. A loose contact on a connector will often cause an intermittent fault. Intermittent faults cause a great deal of aggravation because they are difficult to diagnose. Typically, when the fault doctor shows up, the system works fine. 161 Distributed Systems - CS 422 A permanent fault is one that continues to exist until the faulty component is replaced. Burnt-out chips, software bugs, and disk head crashes are examples of permanent faults. 6.1.2 Failure Models A system that fails is not adequately providing the services it was designed for. If we consider a distributed system as a collection of servers that communicate with one another and with their clients, not adequately providing services means that servers, communication channels, or possibly both, are not doing what they are supposed to do. However, a malfunctioning server itself may not always be the fault we are looking for. If such a server depends on other servers to adequately provide its services, the cause of an error may need to be searched for somewhere else. Such dependency relations appear in abundance in distributed systems. A failing disk may make life difficult for a file server that is designed to provide a highly available file system. If such a file server is part of a distributed database, the proper working of the entire database may be at stake, as only part of its data may be accessible. To get a better grasp on how serious a failure actually is, several classification schemes have been developed. One such scheme is shown in the following figure: 162 Distributed Systems - CS 422 A crash failure occurs when a server prematurely halts, but was working correctly until it stopped. An important aspect of crash failures is that once the server has halted, nothing is heard from it anymore. A typical example of a crash failure is an operating system that comes to a grinding halt, and for which there is only one solution: reboot it. Many personal computer systems suffer from crash failures so often that people have come to expect them to be normal. Consequently, moving the reset button from the back of a cabinet to the front was done for good reason. Perhaps one day it can be moved to the back again, or even removed altogether. An omission failure occurs when a server fails to respond to a request. Several things might go wrong. In the case of a receive omission failure, possibly the server never got the request in the first place. Note that it may well be the case that the connection between a client and a server has been correctly established, but that there was no thread listening to incoming requests. Also, a receive omission failure will generally not affect the current state of the server, as the server is unaware of any message sent to it. Likewise, a send omission failure happens when the server has done its work, but somehow fails in sending a response. Such a failure may happen, for example, when a send buffer overflows while the server was not prepared for such a situation. Note that, in contrast to a receive omission failure, the server may now be in a state reflecting that it has just completed a service for the client. As a consequence, if the sending of its response fails, the server has to be prepared for the client to reissue its previous request. Other types of omission failures not related to communication may be caused by software errors such as infinite loops or improper memory management by which the server is said to "hang." 163 Distributed Systems - CS 422 Another class of failures is related to timing. Timing failures occur when the response lies outside a specified real-time interval. Isochronous data streams providing data too soon may easily cause trouble for a recipient if there is not enough buffer space to hold all the incoming data. More common, however, is that a server responds too late, in which case a performance failure is said to occur. A serious type of failure is a response failure, by which the server's response is simply incorrect. Two kinds of response failures may happen. In the case of a value failure, a server simply provides the wrong reply to a request. For example, a search engine that systematically returns Web pages not related to any of the search terms used. The other type of response failure is known as a state transition failure. This kind of failure happens when the server reacts unexpectedly to an incoming request. For example, if a server receives a message it cannot recognize, a state transition failure happens if no measures have been taken to handle such messages. In particular, a faulty server may incorrectly take default actions it should never have initiated. The most serious are arbitrary failures, also known as Byzantine failures. In effect, when arbitrary failures occur, clients should be prepared for the worst.In particular, it may happen that a server is producing output it should never have produced, but which cannot be detected as being incorrect Worse yet a faulty server may even be maliciously working together with other servers to produce intentionally wrong answers. This situation illustrates why security is also considered an important requirement when talking about dependable systems. Arbitrary failures are closely related to crash failures. The definition of crash failures as presented above is the most benign way for a server to halt. They are also referred to as fail-stop failures. In effect, a fail-stop server will simply stop 164 Distributed Systems - CS 422 producing output in such a way that its halting can be detected by other processes. In the best case, the server may have been so friendly to announce it is about to crash; otherwise it simply stops. Finally, there are also occasions in which the server is producing random output, but this output can be recognized by other processes as plain junk. The server is then exhibiting arbitrary failures, but in a benign way. These faults are also referred to as being fail-safe. If a system is to be fault tolerant, the best it can do is to try to hide the occurrence of failures from other processes. The key technique for masking faults is to use redundancy. Three kinds are possible: information redundancy, time redundancy, and physical redundancy: With information redundancy, extra bits are added to allow recovery from garbled bits. For example, a Hamming code can be added to transmitted data to recover from noise on the transmission line. With time redundancy, an action is performed, and then. If need be, it is performed again. Transactions use this approach. If a transaction aborts, it can be redone with no harm. Time redundancy is especially helpful when the faults are transient or intermittent. With physical redundancy, extra equipment or processes are added to make it possible for the system as a whole to tolerate the loss or malfunctioning of some components. Physical redundancy can thus be done either in hardware or in software. For example, extra processes can be added to the system so that if a small number of them crash, the system can still function correctly. Physical redundancy is a well-known technique for providing fault tolerance. It is used in biology (mammals have two eyes, two ears, two lungs, etc.), aircraft (747s have four engines but can fly on three), and sports (multiple referees in case one misses an event). It has also been used for fault tolerance in electronic circuits for years; 165 Distributed Systems - CS 422 6.2 Process Flexibility Let us concentrate on how fault tolerance can actually be achieved in distributed systems. The first topic we discuss is protection against process failures, which is achieved by replicating processes into groups. In the following pages, we consider the general design issues of process groups, and discuss what a fault-tolerant group actually is. Also, we look at how to reach agreement within a process group when one or more of its members cannot be trusted to give correct answers. 6.2.1 Design Issues The key approach to tolerating a faulty process is to organize several identical processes into a group. The key property that all groups have is that when a message is sent to the group itself, all members of the group receive it. In this way, if one process in a group fails, hopefully some other process can take over for it. Process groups may be dynamic. New groups can be created and old groups can be destroyed. A process can join a group or leave one during system operation. A process can be a member of several groups at the same time. Consequently, mechanism s are needed for managing groups and group membership. The purpose of introducing groups is to allow processes to deal with collections of processes as a single abstraction. Thus a process can send a message to a group of servers without having to know who they are or how many there are or where they are, which may change from one call to the next. Flat Groups versus Hierarchical Groups An important distinction between different groups has to do with their internal structure. In some groups, all the processes are equal. No one is boss and all decisions are made collectively. In other groups, some kind of hierarchy exists. For example, one process is the coordinator and all the others are workers. In this model, when a request for work is generated, either by an external client or by one of the workers, it is sent to the coordinator. The coordinator then decides which worker is best suited to carry it out, 166 Distributed Systems - CS 422 and forwards it there. More complex hierarchies are also possible, of course. These communication patterns are illustrated in the following figure: Figure (a) Communication in a flat group Figure (b) Communication in a simple hierarchical group Each of these organizations has its own advantages and disadvantages. The flat group is symmetrical and has no single point of failure. If one of the processes crashes, the group simply becomes smaller, but can otherwise continue. A disadvantage is that decision making is more complicated. For example, to decide anything, a vote often has to be taken, incurring some delay and overhead. The hierarchical group has the opposite properties. Loss of the coordinator brings the entire group to a grinding halt, but as long as it is running, it can make decisions without bothering everyone else. Group Membership When group communication is present, some method is needed for creating and deleting groups, as well as for allowing processes to join and leave groups. One possible approach is to have a group server to which all these requests can be sent. The group server can then maintain a complete data base of all the groups and their exact membership. This method is straightforward, efficient, and fairly easy to implement. Unfortunately, it shares a major disadvantage with all centralized 167 Distributed Systems - CS 422 techniques: a single point of failure. If the group server crashes, group management ceases to exist. Probably most or all groups will have to be reconstructed from scratch, possibly terminating whatever work was going on. The opposite approach is to manage group membership in a distributed way. For example, if (reliable) multicasting is available, an outsider can send a message to all group members announcing its wish to join the group. Ideally, to leave a group, a member just sends a goodbye message to everyone. In the context of fault tolerance, assuming fail-stop semantics is generally not appropriate. The trouble is, there is no polite announcement that a process crashes as there is when a process leaves voluntarily. The other members have to discover this experimentally by noticing that the crashed member no longer responds to anything. Once it is certain that the crashed member is really down (and not just slow), it can be removed from the group. Another tricky issue is that leaving and joining have to be synchronous wit data messages being sent. In other words, starting at the instant that a process has joined a group, it must receive all messages sent to that group. Similarly, as soon as a process has left a group, it must not receive any more messages from the group, and the other members must not receive any more messages from it. One way of making sure that a join or leave is integrated into the message stream at the right place is to convert this operation into a sequence of messages sent to the whole group. One final issue relating to group membership is what to do if so many machines go down that the group can no longer function-at all. Some protocol is needed to rebuild the group. Invariably, some process will have to take the initiative to start the ball rolling, but what happens if two or three try at the same time? The protocol must to be able to withstand this. 6.2.2 Failure Masking and Replication Process groups are part of the solution for building fault-tolerant systems. In particular, having a group of identical processes allows us to mask one or more faulty processes 168 Distributed Systems - CS 422 in that group. In other words, we can replicate processes and organize them into a group to replace a single (vulnerable) process with a (fault tolerant) group. As discussed in the previous chapter, there are two ways to approach such replication: by means of primary-based protocols, or through replicated-write protocols. Primary-based replication in the case of fault tolerance generally appears in the form of a primary-backup protocol. In this case, a group of processes is organized in a hierarchical fashion in which a primary coordinates all write operations. In practice, the primary is fixed, although its role can be taken over by one of the backups. If need be. In effect, when the primary crashes, the backups execute some election algorithm to choose a new primary. Replicated-write protocols are used in the form of active replication, as well as by means of quorum-based protocols. These solutions correspond to organizing a collection of identical processes into a flat group. The main advantage is that such groups have no single point of failure, at the cost of distributed coordination. An important issue with using process groups to tolerate faults is how much replication is needed. To simplify our discussion, let us consider only replicated-write systems. A system is said to be k fault tolerant if it can survive faults in k components and still meet its specifications. If the components, say processes, fail silently, then having k + 1 of them is enough to provide k fault tolerance. If k of them simply stop, then the answer from the other one can be used. On the other hand, if processes exhibit Byzantine failures, continuing to run when sick and sending out erroneous or random replies, a minimum of 2k + 1 processors are needed to achieve k fault tolerance. In the worst case, the k failing processes could accidentally (or even intentionally) generate the same reply. However, the remaining k + 1 will also produce the same answer, so the client or voter can just believe the majority. Of course, in theory it is fine to say that a system is k fault tolerant and just let the k + I identical replies outvote the k identical replies, but in practice it is hard to imagine 169 Distributed Systems - CS 422 circumstances in which one can say with certainty that k processes can fail but k + 1 processes cannot fail. Thus even in a fault-tolerant system some kind of statistical analysis may be needed. An implicit precondition for this model to be relevant is that all requests arrive at all servers in the same order, also called the atomic multicast problem. Actually, this condition can be relaxed slightly, since reads do not matter and some writes may commute, but the general problem remains. 6.2.3 Agreement in Faulty Systems Organizing replicated processes into a group helps to increase fault tolerance. As we mentioned, if a client can base its decisions through a voting mechanism, we can even tolerate that k out of 2k + 1 processes are lying about their result. The assumption we are making, however, is that processes do not team up to produce a wrong result. In general, matters become more intricate if we demand that a process group reaches an agreement, which is needed in many cases. Some examples are: electing a coordinator, deciding whether or not to commit a transaction, dividing up tasks among workers, and synchronization, among numerous other possibilities. When the communication and processes are all perfect, reaching such agreement is often straightforward, but when they are not, problems arise. The general goal of distributed agreement algorithms is to have all the non faulty processes reach consensus on some issue, and to establish that consensus within a finite number of steps. The problem is complicated by the fact that different assumptions about the underlying system require different solutions, assuming solutions even exist. We could distinguish the following cases, 1. Synchronous versus asynchronous systems. A system is synchronous if and only if the processes are known to operate in a lock-step mode. Formally, this means that there should be some constant c? : 1, such that if any processor has taken c + 1 steps, every other process has taken at least 1 step. A system that is not synchronous is said to be asynchronous. 170 Distributed Systems - CS 422 2. Communication delay is bounded or not. Delay is bounded if and only if we know that every message is delivered with a globally and predetermined maximum time. 3. Message delivery is ordered or not. In other words, we distinguish the situation where messages from the same sender are delivered in the order that they were sent, from the situation in which we do not have such guarantees. 4. Message transmission is done through unicasting or multicasting. As it turns out, reaching agreement is only possible for the situations shown in the following figure. In all other cases, it can be shown that no solution exists. Note that most distributed systems in practice assume that processes behave asynchronously, message transmission is unicast, and communication delays are unbounded. As a consequence, we need to make use of ordered (reliable) message delivery, such as provided as by TCP. The following figure illustrates the nontrivial nature of distributed agreement when processes may fail. The problem was originally studied by Lamport et al. and is also known as the Byzantine agreement problem, referring to the numerous wars in which several armies needed to reach agreement on, for example, troop strengths while being faced with traitorous generals, conniving lieutenants, and so on. 171 Distributed Systems - CS 422 6.2.4 Failure Detection It may have become clear from our discussions so far that in order to properly mask failures, we generally need to detect them as well. Failure detection is one of the cornerstones of fault tolerance in distributed systems. What it all boils down to is that for a group of processes, non faulty members should be able to decide who is still a member, and who is not. In other words, we need to be able to detect when a member has failed. When it comes to detecting process failures, there are essentially only two mechanisms. Either processes actively send "are you alive?" messages to each other (for which they obviously expect an answer), or passively wait until messages come in from different processes. The latter approach makes sense only when it can be guaranteed that there is enough communication between processes. In practice, actively pinging processes is usually followed. There has been a huge body of theoretical work on failure detectors. What it all boils down to is that a timeout mechanism is used to check whether a process has failed. In real settings, there are two major problems with this approach. First, due to unreliable networks, simply stating that a process has failed because it does not return an answer to a ping message may be wrong. In other words, it is quite easy to generate false positives. If a false positive has the effect that a perfectly healthy process is removed from a membership list, then clearly we are doing something wrong. Another serious problem is that timeouts are just plain crude. As noticed by Birman, there is hardly any work on building proper failure detection subsystems that take more into account than only the lack of a reply to a single message. This statement is even more evident when looking at industry-deployed distributed systems. There are various issues that need to be taken into account when designing a failure detection subsystem. For example, failure detection can take place through gossiping 172 Distributed Systems - CS 422 in which each node regularly announces to its neighbors that it is still up and running. As we mentioned, an alternative is to let nodes actively probe each other. Failure detection can also be done as a side-effect of regularly exchanging information with neighbors, as is the case with gossip-based information dissemination. This approach is essentially also adopted in: processes periodically gossip their service availability. This information is gradually disseminated through the network by gossiping. Eventually, every process will know about every other process, but more importantly, will have enough information locally available to decide whether a process has failed or not. A member for which the availability information is old, will presumably have failed. Another important issue is that a failure detection subsystem should ideally be able to distinguish network failures from node failures. One way of dealing with this problem is not to let a single node decide whether one of its neighbors has crashed. Instead, when noticing a timeout on a ping message, a node requests other neighbors to see whether they can reach the presumed failing node. Of course, positive information can also be shared: if a node is still alive, that information can be forwarded to other interested parties (who may be detecting a link failure to the suspected node). This brings us to another key issue: when a member failure is detected, how should other non-faulty processes be informed? One simple, and somewhat radical approach is the one followed in FUSE. In FUSE, processes can be joined in a group that spans a wide-area network. The group members create a spanning tree that is used for monitoring member failures. Members send ping messages to their neighbors. When a neighbor does not respond, the pinging node immediately switches to a state in which it will also no longer respond to pings from other nodes. By recursion, it is seen that a single node failure is rapidly promoted to a group failure notification. FUSE does not suffer a lot from link failures for the simple reason that it relies on point-to-point TCP connections between group members. 173 Distributed Systems - CS 422 6.3 Reliable Client-Server Communication In many cases, fault tolerance in distributed systems concentrates on faulty processes. However, we also need to consider communication failures. Most of the failure models discussed previously apply equally well to communication channels. In particular, a communication channel may exhibit crash, omission, timing, and arbitrary failures. In practice, when building reliable communication channels, the focus is on masking crash and omission failures. Arbitrary failures may occur in the form of duplicate messages, resulting from the fact that in a computer network messages may be buffered for a relatively long time, and are reinjected into the network after the original sender has already issued a retransmission. 6.3.1 Point-to-Point Communication In many distributed systems, reliable point-to-point communication is established by making use of a reliable transport protocol, such as TCP. TCP masks omission failures, which occur in the form of lost messages, by using acknowledgments and retransmissions. Such failures are completely hidden from a TCP client. However, crash failures of connections are not masked. A crash failure may occur when (for whatever reason) a TCP connection is abruptly broken so that no more messages can be transmitted through the channel. In most cases, the client is informed that the channel has crashed by raising an exception. The only way to mask such failures is to let the distributed system attempt to automatically set up a new connection, by simply resending a connection request. The underlying assumptions that the other side is still, or again, responsive to such requests. 6.3.2 RPC Semantics in the Presence of Failures Let us now take a closer look at client-server communication when using high-level facilities such as Remote Procedure Calls (RPCs). The goal of RPC is to hide communication by making remote procedure calls look just like local ones. With a few 174 Distributed Systems - CS 422 exceptions, so far we have come fairly close. Indeed, as long as both client and server are functioning perfectly, RPC does its job well. The problem comes about when errors occur. It is then that the differences between local and remote calls are not always easy to mask. To structure our discussion, let us distinguish between five different classes of failures that can occur in RPC systems, as follows: 1. The client is unable to locate the server. 2. The request message from the client to the server is lost. 3. The server crashes after receiving a request. 4. The reply message from the server to the client is lost. 5. The client crashes after sending a request. Each of these categories poses different problems and requires different solutions. Client Cannot Locate the Server To start with, it can happen that the client cannot locate a suitable server. All servers might be down, for example. Alternatively, suppose that the client is compiled using a particular version of the client stub, and the binary is not used for a considerable period of time. In the meantime, the server evolves and a new version of the interface is installed; new stubs are generated and put into use. When the client is eventually run, the binder will be unable to match it up with a server and will report failure. While this mechanism is used to protect the client from accidentally trying to talk to a server that may not agree with it in terms of what parameters are required or what it is supposed to do, the problem remains of how should this failure be dealt with. One possible solution is to have the error raise an exception. In some languages, Lost Request Messages The second item on the list is dealing with lost request messages. This is the easiest one to deal with: just have the operating system or client stub start a timer when sending the request. If the timer expires before a reply or acknowledgment comes back, the message is sent again. If the message was truly lost, the server will not be able to tell the 175 Distributed Systems - CS 422 difference between the retransmission and the original, and everything will work fine. Unless, of course, so many request messages are lost that the client gives up and falsely concludes that the server is down, in which case we are back to "Cannot locate server." If the request was not lost, the only thing we need to do is let the server be able to detect it is dealing with a retransmission. Unfortunately, doing so is not so simple, as we explain when discussing lost replies. Server Crashes The next failure on the list is a server crash. The normal sequence of events at a server is shown in the following figure: A request arrives, is carried out, and a reply is sent. Now consider the following figure: A request arrives and is carried out, just as before, but the server crashes before it can send the reply. Finally, look at the following figure: Again a request arrives, but this time the server crashes before it can even be carried out. And, of course, no reply is sent back. 176 Distributed Systems - CS 422 The annoying part of the previous three figures is that the correct treatment differs for each case. In the second figure, the system has to report failure back to the client (e.g., raise, an exception), whereas in the third figure, it can just retransmit the request. The problem is that the client's operating system cannot tell which is which. All it knows is that its timer has expired. There are four strategies the client can follow: First, the client can decide to never reissue a request, at the risk that the text will not be printed. Second, it can decide to always reissue a request, but this may lead to its text being printed twice. Third, it can decide to reissue a request only if it did not yet receive an The parentheses indicate an event that can no longer happen because the server already crashed. Fig. 8-8 shows all possible combinations. As can be readily verified, there is no combination of client strategy and server strategy that will work correctly under all possible event sequences. The bottom line is that the client can never know whether the server crashed just before or after having the text printed. The following figure illustrate different combinations of client and server strategies in the presence of server crashes. 177 Distributed Systems - CS 422 Acknowledgment that its print request had been delivered to the server. In that case, the client is counting on the fact that the server crashed before the print request could be delivered. The fourth and last strategy is to reissue a request only if it has received an acknowledgment for the print request. With two strategies for the server, and four for the client, there are a total of eight combinations to consider. To explain, note that there are three events that can happen at the server: send the completion message (M), print the text (P), and crash (C). These events can occur in six different orderings: 1. M ~P ~C: A crash occurs after sending the completion message and printing the text. 2. M ~C (~P): A crash happens after sending the completion message, but before the text could be printed. 3. p ~M ~C: A crash occurs after sending the completion message and printing the text. 4. P~C( ~M): The text printed, after which a crash occurs before the completion message could be sent. 5. C (~P ~M): A crash happens before the server could do anything. 6. C(~M ~P): A crash happens before the server could do anything. Client Crashes The final item on the list of failures is the client crash. What happens if a client sends a request to a server to do some work and crashes before the server replies? At this point a computation is active and no parent is waiting for the result. Such an unwanted computation is called an orphan. Orphans can cause a variety of problems that can interfere with normal operation of the system. As a bare minimum, they waste CPU cycles. They can also lock files or otherwise tie up valuable resources. Finally, if the client reboots and does the RPC again, but the reply from the orphan comes back immediately afterward, confusion can result. 178 Distributed Systems - CS 422 What can be done about orphans? Nelson (1981) proposed four solutions. In solution 1, before a client stub sends an RPC message, it makes a log entry telling what it is about to do. The log is kept on disk or some other medium that survives crashes. After a reboot, the log is checked and the orphan is explicitly killed off. This solution is called orphan extermination. In solution 2. Called reincarnation, all these problems can be solved without the need to write disk records. The way it works is to divide time up into sequentially numbered epochs. When a client reboots, it broadcasts a message to all machines declaring the start of a new epoch. When such a broadcast comes in, all remote computations on behalf of that client are killed. Of course, if the network is partitioned, some orphans may survive. Fortunately, however, when they report back, their replies will contain an obsolete epoch number, making them easy to detect. Solution 3 is a variant on this idea, but somewhat less draconian. It is called gentle reincarnation. When an epoch broadcast comes in, each machine checks to see if it has any remote computations running locally, and if so, tries its best to locate their owners. Only if the owners cannot be located anywhere is the computation killed. Finally, we have solution 4, expiration, in which each RPC is given a standard amount of time, T, to do the job. If it cannot finish, it must explicitly ask for another quantum, which is a nuisance. On the other hand, if after a crash the client waits a time T before rebooting, all orphans are sure to be gone. The problem to be solved here is choosing a reasonable value of Tin the face of RPCs with wildly differing requirements. 179 Distributed Systems - CS 422 6.4 Recovery So far, we have mainly concentrated on algorithms that allow us to tolerate faults. However, once a failure has occurred, it is essential that the process where the failure happened can recover to a correct state. In what follows, we first concentration what it actually means to recover to a correct state, and subsequently when and how the state of a distributed system can be recorded and recovered to,by means of check pointing and message logging. 6.4.1 Introduction Fundamental to fault tolerance is the recovery from an error. Recall that an error is that part of a system that may lead to a failure. The whole idea of error recovery is to replace an erroneous state with an error-free state. There are essentially two forms of error recovery: In backward recovery, the main issue is to bring the system from its present erroneous state back into a previously correct state. To do so, it will be necessary to record the system's state from time to time, and to restore such a recorded state when things go wrong. Each time (part of) the system's present state is recorded, a checkpoint is said to be made. Another form of error recovery is forward recovery. In this case, when the system has entered an erroneous state, instead of moving back to a previous, checkpointed state, an attempt is made to bring the system in a correct new state from which it can continue to execute. The main problem with forward error recovery mechanisms is that it has to be known in advance which errors may occur. Only in that case is it possible to correct those errors and move to a new state. The distinction between backward and forward error recovery is easily explained when considering the implementation of reliable communication. 180 Distributed Systems - CS 422 The common approach to recover from a lost packet is to let the sender retransmit that packet. In effect, packet retransmission establishes that we attempt to go back to a previous, correct state, namely the one in which the packet that was lost is being sent. Reliable communication through packet retransmission is therefore an example of applying backward error recovery techniques. An alternative approach is to use a method known as erasure correction. In this approach, a missing packet is constructed from other, successfully delivered packets. For example, in an (n,k) block erasure code, a set of k source packets is encoded into a set of n encoded packets, such that any set of k encoded packets is enough to reconstruct the original k source packets. Typical values are k =16' or k=32, and k<11~2k. If not enough packets have yet been delivered, the sender will have to continue transmitting packets until a previously lost packet can be constructed. Erasure correction is a typical example of a forward error recovery approach. By and large, backward error recovery techniques are widely applied as a general mechanism for recovering from failures in distributed systems. The major benefit of backward error recovery is that it is a generally applicable method independent of any specific system or process. However, backward error recovery also introduces some problems: First, restoring a system or process to a previous state is generally a relatively costly operation in terms of performance. Second, because backward error recovery mechanisms are independent of the distributed application for which they are actually used, no guarantees can be given that once recovery has taken place. Finally, although backward error recovery requires checkpointing, some states can simply never be rolled back to. 6.4.2 Checkpointing In a fault-tolerant distributed system, backward error recovery requires that the system regularly saves its state onto stable storage. In particular, we need to record a consistent 181 Distributed Systems - CS 422 global state, also called a distributed snapshot. In a distributed snapshot, if a process P has recorded the receipt of a message, then there should also be a process Q that has recorded the sending of that message. After all, it must have come from somewhere. The following figure illustrate the recovery line: In backward error recovery schemes, each process saves its state from time to time to a locally-available stable storage. To recover after a process or system failure requires that we construct a consistent global state from these local states. In particular, it is best to recover to the most recent distributed snapshot, also referred to as a recovery line. In other words, a recovery line corresponds to the most recent consistent collection of checkpoints, as shown in the coming figure. 6.4.3 Message Logging Considering that checkpointing is an expensive operation, especially concerning the operations involved in writing state to stable storage, techniques have been sought to reduce the number of checkpoints, but still enable recovery. An important technique in distributed systems is logging messages. The basic idea underlying message logging is that if the transmission of messages can be replayed, we can still reach a globally consistent state but without having to restore that state from stable storage. Instead, a checkpointed state is taken as a starting point, and all messages that have been sent since are simply retransmitted and handled accordingly. This approach works fine under the assumption of what is called a piecewise deterministic model. In such a model, the execution of each process is assumed to take place as a series of intervals in which events take place. 182 Distributed Systems - CS 422 For example, an event may be the execution of an instruction, the sending of a message, and so on. Each interval in the piecewise deterministic model is assumed to start with a nondeterministic event, such as the receipt of a message. However, from that moment on, the execution of the process is completely deterministic. An interval ends with the last event before a nondeterministic event occurs. In effect, an interval can be replayed with a known result that is, in a completely deterministic way, provided it is replayed starting with the same nondeterministic event as before. Consequently, if we record all nondeterministic events in such a model, it becomes possible to completely replay the entire execution of a process in a deterministic way. Considering that message logs are necessary to recover from a process crash so that a globally consistent state is restored, it becomes important to know precisely when messages are to be logged. It turns out that many existing message-logging schemes can be easily characterized, if we concentrate on how they deal with orphan processes. An orphan process is a process that survives the crash of another process, but whose state is inconsistent with the crashed process after its recovery. As an example, consider the situation shown in Fig. 8-26. Process Q receives messages m 1 and m 2 from process P and R, respectively, and subsequently sends a message 3 to R. However, in contrast to all other messages, message m 2 is not logged. If process Q crashes and later recovers again, only the logged messages required for the recovery of Q are replayed, in the coming example, mI' Because m2 was not logged, its transmission will not be replayed, meaning that the transmission of m 3 also may not take place as shown in the next figure. 183 Distributed Systems - CS 422 Incorrect replay of messages after recovery, leading to an orphan process. However, the situation after the recovery of Q is inconsistent with that before its recovery. In particular, R holds a message (m 3) that was sent before the crash, but whose receipt and delivery do not take place when replaying what had happened before the crash. Such inconsistencies should obviously be avoided. 184 Distributed Systems - CS 422 Revision Sheet # 6 PROBLEMS 1. Dependable systems are often required to provide a high degree of security. Why? 2. What makes the fail-stop model in the case of crash failures so difficult to implement? 3. For each of the following applications. Do you think at-least-once semantics or at most- once semantics is best? Discuss. (a) Reading and writing files from a file server. (b) Compiling a program. (c) Remote banking. 4. To what extent is scalability of atomic multicasting important? 5. Virtual synchrony is analogous to weak consistency in distributed data stores, with group view changes acting as synchronization points. In this context, what would be the analog of strong consistency? 6. Explain how the write-ahead log in distributed transactions can be used to recover from failures. 7. Does a stateless server need to take checkpoints? 185