Uploaded by Nathan Gistelinck

Summary PDS DT

advertisement
Parallel and distributed software systems, Filip De Turck
Jarne Verhaeghe
January 9, 2019
Contents
1 Introduction to Distributed Software
1.1 Definitions and Terminology . . . . . .
1.2 Developing distributed applications . .
1.2.1 System models . . . . . . . . .
1.3 Architecture . . . . . . . . . . . . . . .
1.3.1 Logical architecture . . . . . .
1.3.2 System architecture . . . . . .
1.3.3 P2P . . . . . . . . . . . . . . .
1.4 Middleware and services . . . . . . . .
1.5 Classes of distributed systems . . . . .
1.6 Important architectures and platforms
1.7 Scalability and high availability . . . .
1.8 The Java family . . . . . . . . . . . . .
1.9 The .NET family . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
5
6
6
6
8
8
9
9
9
10
12
12
2 Middleware
2.1 Situating middleware . . . . . . . . . . . . .
2.2 Definitions and terminology . . . . . . . . .
2.3 Communication between distributed objects
2.3.1 Request-Reply protocol . . . . . . .
2.3.2 Marshalling . . . . . . . . . . . . . .
2.4 Remote Procedure Call (RPC) . . . . . . .
2.4.1 RPC architecture . . . . . . . . . . .
2.4.2 SUN RPC details . . . . . . . . . . .
2.5 JAVA RMI . . . . . . . . . . . . . . . . . .
2.6 Corba RMI . . . . . . . . . . . . . . . . . .
2.7 Middleware services . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
13
16
17
17
18
18
19
19
21
23
3 Enterprise Applications
3.1 Overview of Java Enterprise Edition (JEE) architecture . . . . . . . . . . . . . . .
27
27
4 Global State and Time
4.1 Physical clock synchronization . . .
4.1.1 Hardware clocks . . . . . . .
4.1.2 Skew and drift . . . . . . . .
4.1.3 Time standards . . . . . . . .
4.1.4 Clock synchronization . . . .
4.2 Logical clocks . . . . . . . . . . . . .
4.2.1 Events and temporal ordering
4.2.2 Lamport clock . . . . . . . .
4.2.3 Vector clock . . . . . . . . . .
4.3 Performance metrics . . . . . . . . .
29
29
29
30
30
31
32
32
33
34
35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4.3.1
4.3.2
Response time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
35
5 Coordination
5.1 Failure detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Failure detection algorithm . . . . . . . . . . . . . . . . . . .
5.2 Distributed mutual exclusion . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . .
5.2.3 Centralized approach . . . . . . . . . . . . . . . . . . . . . . .
5.2.4 Ring approach . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.5 Multicast-based algorithm: Ricart-Agrawala algorithm . . . .
5.2.6 Multicast-based algorithm: Maekawa voting . . . . . . . . . .
5.2.7 Suzuki-Kasami Algorithm . . . . . . . . . . . . . . . . . . . .
5.2.8 Raymond’s Algorithm . . . . . . . . . . . . . . . . . . . . . .
5.2.9 Group Mutual Exclusion . . . . . . . . . . . . . . . . . . . . .
5.3 Election . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . .
5.3.3 Ring algorithm: Chang-Roberts . . . . . . . . . . . . . . . . .
5.3.4 Multicast algorithm: Garcia-Molina (or the bully algorithm)
5.3.5 Franklin’s Algorithm . . . . . . . . . . . . . . . . . . . . . . .
5.3.6 Peterson’s Algorithm . . . . . . . . . . . . . . . . . . . . . . .
5.3.7 Dolev-Klawe-Rodeh Algorithm . . . . . . . . . . . . . . . . .
5.3.8 Tree Election Algorithm . . . . . . . . . . . . . . . . . . . . .
5.3.9 Echo Algorithm with Extinction . . . . . . . . . . . . . . . .
5.3.10 Minimum Spanning Trees . . . . . . . . . . . . . . . . . . . .
5.3.11 Election in Arbitrary Networks . . . . . . . . . . . . . . . . .
5.4 Ordered multicast . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Terminology and operations . . . . . . . . . . . . . . . . . . .
5.4.2 Basic multicast . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.3 Message order . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
36
36
36
36
37
37
37
38
38
39
40
42
43
43
44
44
44
44
45
46
46
47
47
48
48
49
49
49
50
50
6 Distributed Consensus with Failures
6.1 Consensus problem . . . . . . . . . . . . . . . . . . . . .
6.1.1 Distributed consensus problem . . . . . . . . . .
6.1.2 1-crash consensus problem . . . . . . . . . . . . .
6.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Bracha-Toueg crash consensus algorithm . . . . .
6.2.2 Consensus with failure detection . . . . . . . . .
6.2.3 Chandra-Toueg consensus algorithm . . . . . . .
6.2.4 Byzantine process . . . . . . . . . . . . . . . . .
6.2.5 Chandra-Toueg Byzantine consensus algorithm .
6.2.6 Clock synchronization with Byzantine processes .
6.2.7 More algorithms . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
51
51
52
52
52
52
53
53
53
53
53
7 Anonymous Networks
7.1 Definition . . . . . . . . . . . .
7.2 Probabilistic algorithms . . . .
7.3 Itai-Rodeh election algorithm .
7.3.1 Algorithm outline . . .
7.3.2 Algorithm evaluation . .
7.4 Echo algorithm with extinction
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
55
55
55
55
56
56
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56
56
57
57
57
8 Peer-to-Peer systems
8.1 Introduction . . . . . . . . . . . . . . . .
8.1.1 Why P2P-systems? . . . . . . . .
8.1.2 P2P generations . . . . . . . . .
8.2 Overlays . . . . . . . . . . . . . . . . . .
8.3 Distributed Hash Tables . . . . . . . . .
8.3.1 Introduction . . . . . . . . . . .
8.3.2 Circular Routing (1D) . . . . . .
8.3.3 Prefix Routing: Pastry . . . . . .
8.3.4 Skiplist Routing: Chord . . . . .
8.3.5 Multi-dimensional Routing: CAN
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
(Content Addressable Network)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
58
58
58
58
59
60
60
61
61
62
66
9 Cloud Computing
9.1 Definition . . . . . . . . . . . . . . .
9.1.1 Characteristics . . . . . . . .
9.1.2 Service models . . . . . . . .
9.1.3 Deployment models . . . . .
9.1.4 Payment models . . . . . . .
9.1.5 Advantages . . . . . . . . . .
9.1.6 Obstacles . . . . . . . . . . .
9.2 Cloud Platforms . . . . . . . . . . .
9.2.1 Amazon Web Services (AWS)
9.2.2 Microsoft Windows Azure . .
9.2.3 Google App Engine . . . . .
9.3 Building blocks of an IaaS Cloud . .
9.3.1 Provisioning resources . . . .
9.3.2 Virtualization . . . . . . . . .
9.3.3 Virtual Images . . . . . . . .
9.3.4 Virtual Applicances . . . . .
9.4 Enterprise Applications . . . . . . .
9.4.1 Live Migration . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7.5
7.4.1 Algorithm outline . .
7.4.2 Algorithm details . . .
Itai-Rodeh ring size algorithm
7.5.1 Algorithm outline . .
7.5.2 Algorithm details . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10 Resource Allocation in Distributed Systems
10.1 Definition . . . . . . . . . . . . . . . . . . . .
10.2 Resource allocation algorithms . . . . . . . .
10.2.1 Different types . . . . . . . . . . . . .
10.2.2 Algorithms . . . . . . . . . . . . . . .
10.3 Autonomic resource allocation . . . . . . . . .
10.3.1 Autonomic systems: goal . . . . . . .
10.3.2 Characteristics of autonomic systems .
10.3.3 Control loops . . . . . . . . . . . . . .
10.3.4 Architecture for distributed autonomic
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
69
69
69
69
70
70
71
72
72
72
73
73
74
74
74
75
75
75
75
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
systems
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
77
77
77
77
78
78
78
79
79
82
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Introduction to Distributed
Software
1.1
Definitions and Terminology
Definition 1. Distributed system = A system where 1) hardware and software components are
located at networked computers and 2) components communicate and coordinate their actions
ONLY by passing messages
• Network is the computer
• examples:
– grid computing = A computational grid is a hardware and software infrastructure
that provides dependable, consistent, pervasive, and inexpensive access to high-end
computational capabilities (geographically spread resources, no single entity)
– P2P / file sharing
– IoT
– Online gaming
– Cloud computing = Grid computing + virtualization + elasticity + business model
• Consequences:
– no limit on spatial extent
– no global time notion
– concurrent execution
– failures likely to happen
Client-server:
Definition 2. Server = running process on networked computer accepting requests to perform
a service and responding appropriately
Definition 3. Client = running process on networked computer sending service requests to
servers
Definition 4. Remote invocation = is the complete interaction between a client and a server,
needed to process a single request from the client
Why distributed systems?
4
Figure 1.1:
• Pros:
– Resource sharing, (information + hardware)
– scalability
– fault tolerance
• Cons:
– no limit on spatial extent, difficult to manage
– no global time notion
– almost always concurrent execution
1.2
Developing distributed applications
• Requirements analysis phase
– required functions
– non-functional constraints
– Middleware services
• Architectural phase
– Formalised requirements
– UML-diagrams
• Design phases
– High level architecture subsystems and interfaces
5
– detailed design
– Middleware Platform
• Implementation phase
• Integration phase
3 Architectures
• Logical: how will the subsystems interact
• System: client-server or P2P
• Functional: focusing on functions to be provided by the application (functional requirements)
1.2.1
System models
• Interaction model : communication link
Definition 5. Synchronous systems = 1) time to execute each process step has a known
lower and upper bound 2) each message is received within a known bounded time 3) each
process controlled by local clock with known bound for drift rate
• Failure model
– Omission failures: fails to perform an action
– Byzantine failures: arbitrary behaviour (process giving the wrong result)
– Timing failures: synchronous system is is violating 1/more timing constraints
• Security model
1.3
1.3.1
Architecture
Logical architecture
= architecture capturing how subsystems will interact (also called architectural style)
• Layered architecture
– layer interacts with neighbour
– PRO: reduced number of interfaces, dependencies, easy replacement of a layer
– CON: possible duplication of functionality
• Interacting objects
– no predefined interaction patters
– PRO: highly flexible
– CON: complex management and maintenance
• Event-based interaction
– publish-subscribe style
– PRO: loose coupling of components
• Data centric architecture
– only interaction through shared data base
– + loose coupling of components
– - possibly slow
6
Figure 1.2:
7
Figure 1.3: Logical architectures, a) layered b) interacting c) event based d) data centric
1.3.2
System architecture
Page 13 figure!
• Client server
• Client multiserver (DNS based load balancing for example)
• client- multiserver (implicit server lookup)
• proxy server
1.3.3
P2P
• Processes host both requesting and replying functionalities = servants
• Not possible to contact every node so only a subset (= direct links and interacting subset is
the P2P overlay)
• Flavors:
– Unstructured : links have no relation to the service offered at the peers, looking up a
suitable servant in principle involves a full search, not scalable but simple
– Structured : links are organized such that there is an explicit relationship between the
logical links and the services offered at each peer, use Distributed Hash tables (DHTs)
– Address scalability: no bottlenecks
– Robustness: many interacting processes so less vulnerable to a process failure, if one
fails, only a small fraction is missing
– Grow: growth is easier to manage
• File sharing:
– Unstructured:
∗ Mediated P2P: (Napster...), different functions needed, indexing and lookup functions: client-server architecture, file download: peer-to-peer architecture (so global
index at server, so looking up a file requires querying this server)
8
Figure 1.4:
∗ Pure P2P: inject search message in overlay network to look for a file, reply is sent
when a peer has it, download via dedicated data connection not using overlay
∗ Hybrid P2P: peers are organized in groups with a coordinating superpeer that is
responsible for the whole group (has usually more resources), supernodes interact
in P2P, intra-group via client-server
– Structured: mapping between file and location, typical is mapping a content item to
a key (e.g. hash function of the name), or mapping a key to a host (each host is
responsible for a given key range–)
1.4
Middleware and services
Role of middleware
• Abstract hardware/software platform: OS, implementation language
• Realize transparency
• Provide generic services (figure 1.4)
1.5
Classes of distributed systems
See figure 1.5
1.6
Important architectures and platforms
Client thick/thin:
• PROS: cheaper + many devices involved, cheaper to update (server-controlled)
• CONS: some functions more natural to support on client, permanent network connection
needed
• examples: web browser, remote desktop clients (e.g. Athena)
9
Figure 1.5: All different classes of distributed systems
Grids and clusters:
• Cluster middleware
– job scheduling
– cluster monitoring
– user administration
– matchmaking
– job submission
• Grid middleware:
– meta-scheduling
– abstraction of heterogeneity
– information service (Globus, gLite, MPI)
Transactional systems, figure 1.6
1.7
Scalability and high availability
Definition 6. Scalability = the systems ability to handle gracefully a growing amount of requests
• Horizontal scalability
– = Adding more computational nodes to distributed system
– master/slave techniques, partitioning of data on the available nodes (DHT), NoSQL
based systems in particular allow for horizontal scaling
10
Figure 1.6: transactional architecture
• Vertical scalability
– = adding more resources to a single computational node, improve existing code to
handle more requests
– Techniques: adding memory or disks, more powerful CPU, improving IO and concurrency models
High availability
• The property of a distributed system to handle hardware and software failures
• Techniques:
– Redundancy: software components available on several locations
– Failover: when a software component or resource fails, automatic failover to the redundant components
– Replicas: multiple copies of the same data available
• Very often source of performance problems
Definition 7. CAP theorem = A distributed computer system can NOT simultaneously provide
all three of the following guarantees:
• Consistency: all read/write operations must result in global consistent state
• Availability: all requests on non-failed components must get a response
• Partition Tolerance: the system continues to operate when nodes are not able to communicate
with each other (is in practice often neglected, whilst the other 2 are maintained)
11
Figure 1.7: Java technologies: Web-client versus Application client
1.8
The Java family
see figure 1.7
1.9
The .NET family
• Very similar functions as in JEE:
• Web tier:
– Web service (SOAP/REST) heavily used
– ASP is counterpart for JSP
– .NET framework for web tier development: ASP.NET
– COM+ component model ported to .NET
– .NET enterprise services: resource pooling, eventing, security, transaction support,
synchronization
• Business tier:
12
Chapter 2
Middleware
2.1
Situating middleware
A middleware framework takes care of the tasks of invoking a procedure or method on a remote
component and provides multithreading. Acts as a ”between” object between client and server
without the programmer having to implement the above tasks (almost all distributed applications
are implemented through a middleware framework). A middleware is responsible for the following
tasks, which are normally done between the server and the client, Tasks:
• Client and server: create sockets for TCP (and the TCP/IP connection) or UDP
• Client: converting list of argument and method name to bit streams to be transported
(serializing)
• Server: interpreting the bit stream, local execution
• Server: converting return values to bit stream, transport back
• Client interpreting return bit stream, local handling of results
• Both: fault tolerance measures (retransmissions..) and closing sockets (and TCP/IP connection)
2.2
Definitions and terminology
Distributed Objects
• Objects = logical units of the system
• Distributed OO: physically distributed, only remote object accessible through methods
• Replicated server objects = objects with exactly the same interface and functionality, but
located on different computers (for fault tolerance, high availability...)
Local invocation
• Object reference is used
• Object on local computer
Remote invocation
• Only methods in remote interface are available and must be used.
13
Figure 2.1: Middleware framework
Figure 2.2: Remote and local invocation example, Squares = instantiated objects, Circles = processes
14
• Limited access
• Every remote object has remote interface (Java: Remote, Corba using Interface Definition
Language (IDL))
• Approach:
– Same invocation syntax as for local invocation, but a ”RemoteException” needs to be
caught when invoking remote objects
– Object interface reflects remoteness (e.g. extends from interface Remote)
Remote Object References
• Reference purpose = unique ID for objects in distributed systems
• Properties:
– Uniqueness over time: must be unique to all other objects any given moment in time
– Uniqueness over space: must be unique to all other objects at any given location
• Passable as arguments (unlike local references) and returnable as result
• Used to invoke methods on remote objects
• Consists:
– host IP address
– Process port number + process creation time
– Object number (to distinguish between objects created at the same time, due to inaccuracy of time measurement)
– Object interface name
– In this order (bits) (32-32-32-32-?)
Interface Definition Language (IDL)
• IDL interface specifies: interface name, set of methods and attributes, that clients can request
• keywords
– Interface
– Attribute
– In (type arguments and results)
– Out
– Inout
– Any: Specify that any type can be passed as argument (comparable to void*)
– Readonly: used with attribute type (similar to public class), only a get is available
when readonly is specified
– Module: used to group interfaces and IDL types in logical units (own namespace and
naming scope)
– Raises: for raising exceptions (instead of throw)
• Object supertype, concept of exceptions
• Based on C++ preprocessing so #ifndef,#define,#endif also used in the interface definition
• Types: short,long,unsigned short, unsigned long, float, double, char, boolean, octet, any
• page 38 example
15
2.3
Communication between distributed objects
Invocation semantics
• Not easy to guarantee that exactly one invocation is taking place (communication channel
can fail unexpectedly)
• Techniques to guarantee:
– Retry-request message: request is retransmitted until either the remote server sends
back a reply or the server can be assumed to have crashed
– Duplicate request filtering: At the server, the requests are inspected to make sure to
filter out duplicate requests
– Retransmission of results: When result can not be delivered, re-execute the method and
send again, or keep history of result messages, such that results can be retransmitted
without re-execution
• Types of invocation semantics:
– Maybe semantics: when no request retransmission is used, this implies calls are executed
WITHOUT guarantee
– At-least-once semantics: when request retransmission is used but no duplicate filtering
and no call re-execution. Failures of the request, or reply messages are masked but the
call can be executed multiple times
– At-most-once semantics: All three are used (no call re-execution), invoker receives
either a result or an exception that no result was received (first case: exactly one
execution, second: invoked once or not at all)
Retransmit request
NO
YES
YES
Duplicate filtering
NA
NO
YES
Re-execute call
NA
YES
NO
Retransmit reply
NA
NO
YES
Invocation semantics
maybe
at-least-once
at-most-once
RMI architecture
Figure 2.3
• Proxy: one proxy per remote object, is the local representative of a remote object and
implements all methods of the remote interface, handles (un)marshalling
• Dispatcher : one per class, identifies appropriate method to invoke in skeleton, passes the
request message to it
• Skeleton: one per class, takes care of the (un)marshalling of arguments and results, invokes
the requested method on the remote object and sends a reply message to the proxy’s method
• Communication module: runs reply-request protocol + desired invocation semantics
• Remote reference module: manages remote object references and translates between local
and remote references, contains a remote object reference table and a table of local proxies
• Source code for proxy, dispatcher and skeleton, generated by interface compiler provided
by the middleware framework, communication module and remote reference by middleware
library
16
Figure 2.3: Schematic representation of the involved objects and modules for remote method invocation
(client object a invokes server object b).
2.3.1
Request-Reply protocol
• executed in communication module
• Purpose:
– messages to server object
– send back the return values or exception info to the requester
– enforce appropriate invocation semantics
• example message:
1. messageType[int]
4. methodID[int or Method]
2. requestID [int]
5. arguments [byte[]]
3. objectReference[RemoteObjectRef]
(after each other)
1. request = 0, reply = 1
2. allows to identify matching request
3. marshalled remote object reference
4. method to invoke (e.g. numbering convention between communication modules)
5. Marshalled method arguments
2.3.2
Marshalling
• = Process of transforming an object to a bit stream (serializing) to send over the network
• Two options:
– Using external standard (External Data Representation), corba’s common data representation (CDR) or Java Serialization
– Using proprietary sender standard, importance here is to send this standard along with
the request, so it can be interpreted
• Inverse = unmarshalling
Java serialization:
17
• Class implements Serializable interface
• Writing:
– create ObjectOutputStream (out)
– call out.write(Object(<object>))
• Reading:
– create ObjectInputStream (in)
– call in.readObject()
– cast to specific class (result is of type Object)
– BUT: class file must be reachable by virtual machine
• Details: page 42
RMI binding service:
• Cons:
– no logical meaning
– direct reference to physical location: server might crash (automatic failover to other
server), object might migrate, load balancing through replication not possible
• Binding service = registry of remote objects reference (textual name mappings)
• Server registers remote objects
• client performs lookup
2.4
2.4.1
Remote Procedure Call (RPC)
RPC architecture
• RPC = non-OO version of RMI with C as programming language
• Procedure calls instead of methods calls
• Similar to RMI architecture
• Differences:
– No remote reference module, since procedure calls do not deal with objects and object
references
– A stub procedure at the client side for each procedure at the server side (similar as
proxy)
– At server, a dispatcher for selecting the appropriate server stub procedure
– Server stub procedure (like skeleton)
18
Figure 2.4: RPC remote invocation architecture
2.4.2
SUN RPC details
• marshalling: external data representation (XDR, RFC 1832)
• Interface compiler: rpcgen
• Binding service: portmapper, provided by SUN RPC, running on every computer of the
distributed system, records port in use by each service running locally
• Numbering of procedures starts at 1 (0 is the null procedure)
• No main() routine in server, return variable needs to be static
• Steps for developing:
1. Interface definition (XDR, .x extension, used for storing interface description): procedures have only 1 argument, use C-structs and additional keywords (program=logical
grouping of procedures,version=version control)
2. Compile XDR-file with rpcgen:generates client stub procedure, server main procedure, dispatcher, server proc stubs, XDR marshalling and unmarshalling procedures,
header with C version of interface definition
3. Implement application logic (client and server proc’s)
4. Perform RPC in client code (clnt create(hostname,program,version) returns client
handle, clnt destroy(handle) matching closing procedure)
5. Compile + run
Example: data service
Page 44-46 (syntax based)
2.5
JAVA RMI
• Java-only alternative to Corba
• Remote invocations identical to local ones but: client knows invocation is remote (MUST
catch RemoteException) and server object MUST implement Remote interface
• Terminology:
– client-side RMI-component (proxy): stub
– server-side RMI-component: skeleton
– No dispatcher per class (single generic dispatcher)
19
– Binder: RMIregistry (1 instance on every server computer, maintains table mapping
the textual names to references of the remote objects hosted on that computer, accessed
by methods of Naming class)
– Stubs and skeletons automatically generated by RMI-compiler rmic
• Parameter passing:
– Remote objects passed by reference (remote object reference)
– Non-remote objects passed by value (copy is made and sent, must be serializable)
• Code download:
– Allows download of class files from one virtual machine to another
– If client or server doesn’t already have class of an object, passed by value, code is
automatically downloaded
– Same for remote object reference and class isn’t available for the proxy, proxy code is
downloaded
• Example page 47+48
• Remote interface:
– extends java.rmi.Remote
– all methods throw java.rmi.RemoteException
– has public visibility
• Remote objects:
– extend either: java.rmi.server.UnicastRemoteObject OR java.rmi.server.Activatable
– Implement desired remote interface
– Constructor must throw RemoteException
• Registry:
– Must run on every server
– maps String to remote object reference (String: //computerName:port/objectname)
– Accessed through java.rmi.Naming
– Server methods:
∗ public void bind(String name, Remote obj)
∗ public void unbind(String name, Remote obj)
∗ public void rebind(String name, Remote obj)
– Client methods:
∗ public Remote lookup(String name)
∗ public String[] list() (returns all names bound in registry)
• Client recipe:
– Create and install RMISecurityManager
– Lookup remote object(specify host, port number, textual representation of server object)
– Perform remote invocations, catching remote exceptions
• Server recipe:
20
– Remote interface (see up)
– Remote Server Object (see up)
– Remote Server Program
∗ Create and install RMISecurityManager
∗ Create server object
∗ Register at least one server object
2.6
Corba RMI
Situating corba
• Matured technology often used in industry and because it supports a lot of middleware
services
• Basic idea: Need for middleware that allows applications to communicate irrespective of their
programming language and that offers all required services
• Uses Object Request Broker (ORB), software component which helps a client to invoke a
method on an object
Framework components
• IDL (Interface Definition Language): for defining the interfaces of Corba objects
• Corba Architecture: similar to RMI, main components: ORB Core, Object Adapter, Skeletons, Client stubs/proxies, Implementation repository, Interface Repository, Dynamic Invocation Interface, Dynamic Skeleton Interface
• GIOP = General Inter-Orb protocol, request-reply protocol between ORBs, using Common
Data Representation (CDR) for marshalling
• Object Reference Definition referred as IORs, Interoperable Object References
• Corba services: Naming, Event, Notification, Security, Transaction, Concurrency, Trading
Corba architecture:
• ORB Core: comparable functionality to Communication Module, provides application interface to start/stop ORB, convert between remote objects and strings, provide argument lists
for requests using dynamic invocation
• Object adapter: Remote Reference and Dispatcher module, creates remote object references
for Corba objects + dispatches each invocation to appropriate servant
• Servant: object implementing the business logic (= service procedure in RPC)
• Skeletons: generated by IDL compiler, makes sure remote method invocations are dispatched
via appropriate skeleton to particular servant, unmarshals the arguments and marshals exceptions and results
• Client stubs/proxies: marshals arguments, unmarshals exceptions and results, generated by
IDL
Difference between Java and Corba
• Corba object model are not necessarily objects, but components
– Implement IDL interface
21
Figure 2.5: Corba Remote method invocation architecture
– has remote object reference
– able to respond to invocation of methods on IDL interface
• No class objects in Corba, data structures of various types and complexity can be passed as
arguments, Objects only as reference not by value
• Pseudo objects:
– Cannot be passed as arguments
– Implement IDL interface
– Implemented as libraries
– Orb interface = example, allows:
∗
∗
∗
∗
init method (initialize ORB)
connect method: register Corba object with ORB
shutdown method: stop Corba objects
conversion between remote object references and strings
Implementation
• Specification of IDL interface for each involved objects
• Compilation of IDL interfaces, generates stub code, skeleton code, header files, interface files
• Implementation of servant class by either
– Extending corresponding skeleton class (BOA, Basic Object Adapter, approach): servant class implements interface methods and uses exactly the method signatures defined
in equivalent interface
– Using Portable Object Adapter (POA): allows single servant to support multiple object
identities simultaneously + construct object implementations that are portable between
different ORB products
• Implementation of server program: containing main method, depending on servant class
choice:
– BOA: ORB needs to be created + initialized, instance of Servant class is created +
registered to ORB (connect method) and waits for incoming client requests
– POA: rootPOA object is created first, next POAManager is created and activated and
servants are also activated, object references are created from the POA
• Client program: contain main method doing creation and initialization of ORB, invocation of
narrow method to cast an Object to particular required type, invocation of remote methods,
catching of Corba System exceptions
22
• Running:
– start orbd
– start Echo server
– Run client application
• Example page 53-55
2.7
Middleware services
• Naming service
– The remote object references are registered and bound to a textual name, sometimes
also called binding service
– Registration done by server process
– Clients perform lookup
– Solves: object references having no logical meaning, for server crashes all client need to
be informed by changed reference etc
– Different services:
∗ Java RMI: registry (already covered)
∗ JNDI: distributed version of RMI registry, Java, methods for performing standard
directory operations
∗ Corba Naming Service: bind, rebind, resolve (for clients to look up the remote object
references by name, e.g. Object=resolve(path);), all belonging to NamingContext
interface.
∗ (CORBA) In server (class with main method) after instance creation of the servant,
a reference to the Naming Service has to be obtained and the server has to be
registered to the Naming Service.
∗ (CORBA) Client same, but also revoke method on this reference, when running
naming server name and port should also be provided
– Corba Trading service:
∗
∗
∗
∗
comparable to Naming Service,
allows objects to be located by attribute: directory service
Database: service types and attributes → remote object references
Clients specify: constraints on values of attributes and preferences for the order in
which to receive matching offers
• Event and Notification Service
– Publish-Subscribe paradigm
∗
∗
∗
∗
Objects that generate events publish them
Objects which want to receive events subscribe/register to the types
Sometimes events are called notifications
Event has: name/identifier of generating object, operation generation the event,
parameters of the event, timestamp of generation, sequence number for ordering
– Architecture for Distributed Event Notification
∗ Figure 2.6 shows 3 options, third one is most flexible = observer pattern
∗ Observer decouples object of intereset from its subscribers
23
∗ Tasks: forwarding of events to subscribers, filtering of events based on specified
criteria, detection of patterns in the events, notification mailbox functionality (store
events until they can be delivered)
– Corba Event Service
∗ Based on publish-subscribe
∗ Defines IDL interface for suppliers (objects of interest) and consumers (subscribers)
∗ Methods: push (invoked by suppliers on PushConsumer interface, consumers register their object references with suppliers), pull (invoked by consumers on PullSupplier interface, in this model suppliers register their object references with consumers)
∗ Event channel pushes always to consumers and pulls from suppliers, it is pulled
from consumers and pushed to from suppliers, the passive part is always the client,
the active is the server
– Corba Notification Service
∗
∗
∗
∗
∗
∗
Is advanced Event Service
Besides event service, also allow use of filters
Notifications have a datatype
Event Consumers can use the filters to specify the events they are interested in
Proxies forward the notifications to consumers according to constraints in filters
Event Suppliers can discover events the consumers are interested in and consumers
can discover the event types
∗ Event channel also is configurable for: reliability, priority of events, required ordering, policy for discarding stored events
– Java RMI provides no event server, programmers have to write callback methods to
implement the event service functionality
• Messaging Service
• Persistence Service
– Allows for automatically storing and retrieving the state of objects
– No manual SQL statements needed
– Next chapter
• Transaction Service
– Transaction = set of operations that moves data from one consistent state to another,
one fail, entire set is undone, success a commit is done, failure causes a rollback
– Corba uses the Corba Object Transaction Service (OTS)
– Page 63 for figure
• Activation Service
– Easy to start for distributed system administrator to start start a object a moment a
client request is sent
– Also for resource efficiency reasons, only activated when needed
– responsible for:
∗ Activating objects when requests arrive
∗ Passivating objects when they are idle for some time
– Java RMI Activatable Server Objects
24
Figure 2.6: Illustration of the event observer design pattern
∗
∗
∗
∗
Extend from java.rmi.activation.Activatable class
ActivationID and MarshalledObject in constructor as arguments
In main they are registered, stored in a table of passive objects and location
server processes started on demand by Activator (one per server computer)
– Corba Implementation Repository
∗ Responsible for: activating registered servers on demand, locating servers that are
currently running
∗ Corba Object Adapter name is used for registration and activation
∗ Is a database containing for each registered Object Adapter Name, the pathname
(of binary with object implementation) and host name + port number of the computer
• Loadbalancing Service
– Automatic distribution of load between different servers
– Often broker is used: same interface as servers, forwards load to ”most appropriate”
server
– Often in combination with Naming service
• Dynamic Invocation Service
– Allowing the invocation of objects, whose interface is not known at compile time, but
discovered at run time
– JAVA: Reflection API
– CORBA: Interface Repository (page 64)
• Security Service
• Session Tracking
25
Figure 2.7: An example of chaining two Corba event channels
26
Chapter 3
Enterprise Applications
An enterprise application consists of three layers:
• Database layer
• Business layer
• Presentation layer
3.1
Overview of Java Enterprise Edition (JEE) architecture
Three layers in JEE:
• Business layer are run in an EJB container (hosting Enterprise Java Beans)
• Presentation layer: run in a web container (hosting servlets and Java Server Pages (JSP)
components)
Features provided by JEE (non-functional, do not influence the functionality of an application,
but ease the effort for the programmer + improve performance):
• Component life cycle management: starting components only when needed and shutting
them down otherwise
• Transaction processing: a set of operations are executed within the context of a transaction
• Security Handling: specific methods calls can only be executed when the caller has the right
permission
• Persistence: Automatically storing and retrieving data from the database, without the need
of writing queries
• Remotability: Components can be accessed from remote locations (no need of writing transfer method calls)
• Timer: for scheduling operations
• State management: maintaining values of vars when components are not active
• Resource pooling: sharing resources over multiple requests, make sure the resources (Memory
and CPU) are used efficiently
• Messaging: proving the means to publish data and allow clients to subscribe to data they
are interested in
27
Figure 3.1: Schematic overview of JEE applications
Figure 3.1 shows the schematic overview:
• Entity = POJO (Plain Old Java Object) (thus not a bean (EJB))
• Run within container of persistence provider
• remote data is made accessible by a RDBMS (Remote Database Management System)
• Clients are used to connect through a browser or standalone application
Benefits:
1. Simplifies the development of large, distributed applications
• EJB container provides system-level services to enterprise beans
• Bean developer can concentrate on solving business problems
• EJB container (not bean developer) is responsible for system-level services, such as
transaction management and security authorization
2. Client developer can focus on the presentation of the client:
• Beans contain the application’s business logic
• Therefore the clients are thinner (important for clients that run on small devices)
3. Enterprise beans are portable/reusable components
• Application assembler can build new applications from existing beans
• These can run on any compliant JEE server
4. JEE is a Web centric architecture, targeted clients are thin clients
28
Chapter 4
Global State and Time
Due to the distributed character of the system, there is a lack of global state notion. Caused
by various processors, which are connected with non-negligible delay communication links and an
inherent impossibility to distinguish between lost messages and crashed or faulty processes. Two
flavours of time:
• Physical time: real time, related to solar time, important when connection to ”real” time is
needed
• Logical time: only important for event ordering, easier to realize (no specific hardware
needed)
4.1
4.1.1
Physical clock synchronization
Hardware clocks
• Quartz crystal that oscillates at a relatively stable frequency
• Each oscillation decreases registry, when zero an interrupt generates a clock tick, that increments a variable in memory
• H = 60/s, interrupt rate
• Clock resolution=the smallest measurable time interval using one specific clock
• From the incremented variable (which is number of ms) software clock C(t) is calculated, t
is the real time ((objective: (C(t) = t)))
Figure 4.1: A schematic overview of a hardware clock
29
• Computation from H: C(t) = αHt + β, β = offset value
• Changing clock values is dangerous but a software clock S(t) can be defined: S(t) = AC(t)+
B, choosing A and B such that S(t) is continuous and represents the correct time after some
delay
4.1.2
Skew and drift
• Quartz crystals are not identical so oscillation frequencies differ resulting in certain error
∆H/H
• Drift rate ρ: time systems are specified by: 1 − ρ <=
a clock will get out of synch with the real time
dC(t)
dt
<= 1 + ρ, drift rate = how fast
• Clock skew = difference between two clock readings at a given moment in time: skewC1 ,C2 (t) =
|C1 (t) − C2 (t)|
• Faulty clock = not accurate or stops ticking
• Correctness definitions:
1. Correctness based on known drift upper bound
–
–
–
–
Drift upper bound ρ >= 0
clock drift rate −ρ <= ρC <= ρ
C(t) = (1 + ρC )t + β
Example: time interval (t0 − t) >= 0 : C(t0 ) − C(t) = (1 + ρC )t0 − (1 + ρC )t =
(1 + ρC )(t0 − t)
– And thus (min when ρC = −ρ and vice versa): (1 − ρ)(t0 − t) <= C(t0 ) − C(t) <=
((1 + ρ)(t0 − t)
– Thus correct clock implies no jumps in clock readings as t0 − t gets arbitrary
small so must C(t0 ) − C(t)
2. Correctness based on monotonicity
– Weaker than second from 1.
– t < t0 → C(t) < C(t0 )
3. Hybrid correctness conditions
– Require monotonicity, while allowing forward jumps at synchronization points
4.1.3
Time standards
• History: astronomers job
• Now: atomic clock based on Cesium 133 atom
• Earth rotation slows down, so astronomic time slows down
• Time standard with: fixed duration seconds (unlike astronomical time) AND keeping somehow in sync with astronomical time (unlike atomic)
• UTC (Coordinated Universal Time), based on Atomic time but with leap seconds whenever
difference with atomic time is 800ms
30
4.1.4
Clock synchronization
Internal- External synchronization
• Set of executing processes Π
• External synchronization: all clocks Cp (t) have a bound skew ∆ with respect to external
clock Cext (t): |Cp (t) − Cext (t)| < ∆, ∀p ∈ Π
• Internal synchronization: clock skew between any two clocks in system has upper bound:
|Cp (t) − Cq (t)| < ∆, ∀p, q ∈ Π
• External synchronization means internal synchronization with bound 2∆ (page 90 for proof,
very simple)
Client-server algorithms
• Refers that one process is supposed to have a correct view on the real time and can hence
serve as source for setting time. A clock server and clock client, to minimize the clock skew
• Synchronous system:
– Upper bound on transmission (maximum one-way) delay for communication between
p and q = δmax and lower bound δmin
– p sends a message m to q at time tp (clock Cp = tp ) containing the value tp
– Setting Cq based on delay δ is hard because the delay is unknown at q, but took at
least δmin → tp + δmin <= Cp <= tp + δmax
– E[Cq ] = tp +
δmin +δmax
2
(if delay is uniform)
– Maximal clock skew between p and q is: max skewp,q = max|Cp − Cq | =
δmax −δmin
2
• Asynchronous system: Cristian’s algorithm
– Asynchronous means no upper bound but assume lower bound δmin
∗ q requests via message r the time from p (which is accurate), q timestamps Tr
∗ p replies by sending its current clock reading tp to q in message a
∗ q receives message a and timestamps with local clock Ta
– Total interaction = Ta − Tr , time to send r from q to p and a from p to q
– Upper bound δ <= Ta − Tr − δmin
Ta −Tr
2
r
skewp,q = Ta −T
2
– So: Cq = tp +
– max
− δmin
Peering algorithms
• Can be used in client-server fashion but also peer-to-peer
• Network time protocol (NTP) (figure 4.2)
– Sequence:
∗
∗
∗
∗
q
p
p
q
requests via message r the time from p (which is accurate), q timestamps Ti−3
receives message r, timestamps this with Ti−2
replies sends its current clock reading Ti−1 to q in message a
receives message a and timestamps with local clock Ti
– Equations page 93+94
– this results in < õ, δ > pairs, with lowest δ being the most accurate version
31
Figure 4.2: Clock synchronization in the network time protocol
– Stratum = a level in the hierarchy of time servers of NTP
– Lower stratum get their time from higher stratum, root server (stratum 1) gets it from
UTC
– Passive server, waits until a requests comes
– Modes of operation:
∗ Multicast: Time is multicasted to different clients, only useful with low delay (LAN
environment)
∗ Procedure call: time stamps are sent from server to one client, using approach as
Christians method
∗ Symmetric mode: Mentioned algorithm, eight pairs are computed between two
servers after which the optimal adjustment is decided based on a minimal δ-value
– Accuracies: 1-50ms
• Berkeley algorithm
– Server takes an active role, by polling computers for which the time should be synchronized
– Master coordinating server is elected from a set of peering time servers which are called
the slaves
– Master polls from slaves, uses Christians method to estimate their actual times (based
on round trip delay) and computes an average value of current time
– Time error minimization by:
∗ Setting a max on two-way delay between master and slave (which causes the inaccurate time values), time values with delays exceeding this max are discarded
∗ Instead of new time value, the correction is sent (removes inaccuracy caused by the
delay of message exchange)
∗ Outlaying time values are discarded, slave times exceeding a certain fault-tolerance
level are discared
∗ If the master fails, a new master is selected
4.2
4.2.1
Logical clocks
Events and temporal ordering
• Actual time is not needed, only to establish order of events, logical clocks = monotonically
increasing integer counters
32
• Event is either:
– A change of local process state (change of variable e.g.)
– Sending a message to a process
– Receiving a message from a process
• Local ordering is trivial
• Global ordering or across processes there is need for a global version, and happens before
relations obeys:
– If two events belong to the same process, the global version is identical to the local
version
– If s represents sending a message and r represents the event of receiving the same
message (typically in different processes), then s → r
– The relation is transitive
• Not always possible to find a chain of events interconnecting an arbitrary couple of events
• No causality: e → f does not imply that e is somehow the cause of f but only indicate a
possibility, just expresses that e happened before f, inversely we KNOW that f is not the
cause for e = Causal ordering relation
4.2.2
Lamport clock
• Using same relations as above
• A scalar value in each process Li
• Each time a new event gets timestamped the counter is incremented
• Exchanging messages:
– p = sending message, q = receiving message
– Sending side there is no problem, send event gets timestamp on local software clock Lp
– Receiving side (e = event, s = send):
∗ Lq [r] > Lq [e]
∗ Lq [r] > Lp [s]
∗ Natural choice = max of both = Lq [r] =max(Lp [s], Lq [e]) + 1
• Implementing Lamport clock is:
1. Initializing all Lp to 0
2. Incrementing Lp just before each event is handled in process p
3. When message is sent from p, the send event is time stamped using 2. , and this time
stamp Lp [s] is sent along with the message
4. When a message is received at process q:
– New local clock is computed: Lq =max(Lp [s], Lq ) + 1
– Receive event is time stamped using this clock value
• Lamport clocks are not a guarantee to potential causality, if L[e] < L[f ] the clock indicates
there might be a chain of events interconnecting the two but doesn’t guarantee that such a
chain exists = serious shortcoming
33
Figure 4.3: Lamport clock: sending a message
Figure 4.4: Vector clocks in presence of interprocess communication
4.2.3
Vector clock
• a Lamport clock only counts the total number of events a process has seen, regardless of the
origin of these events
• A vector clock accumulates the number of events a process has seen, on a process basis
• Each element in the vector represents the total number of events seen from another process
• Vector: Vp , N elements (N = # processes)
• Implementation
1. Initialize Vp
2. Increment Vp [p] just before an event is time stamped in process p
3. When p sends a message, p sends its complete vector clock along
4. When process q receives a message:
– Vq [i]=max(Vq [i], Vp [i])
– Increment Vq according to 2.
– Timestamp receive event with Vq
• Drawbacks:
– Amount of data exchanged grows linearly with number of processes
– In case the number of processes is not constant, a more complex scheme is needed,
where the vector size must be adapted dynamically
34
Figure 4.5: Response time schematic
4.3
4.3.1
Performance metrics
Response time
= As a function of the number of simultaneous requests
• Submission time (+marshalling)
• Forward transmission time (together with backw) is the required time for sending all packets
or frames to destination host, the packet/frame delay is the sum of delays on each subnetwork
link traversed by the packet/frame
• Time spent in queue before a serving thread can be created
• Dispatch request + create thread from thread pool + demarshalling (proc,invoc)
• Server processing time
• return processing time (+marshalling) (proc,ret)
• Backward transmission time
• Reception time (+demarshalling)
Link delay (between nodes):
• Processing delay
• Queueing delay
• Transmission delay: delay between time that the first and last bits of the packet/frame are
transmitted
• Propagation delay: delay between the time the last bit is transmitted at the head node of
the link and the time the last bit is received at the tail node
4.3.2
Throughput
= Max number of simultaneous requests per second
35
Chapter 5
Coordination
• No global time so clocks from previous chapter are used
• 2 Problems
• Failure detection
• Distributed mutual exclusion
5.1
Failure detection
5.1.1
Problem statement
• Process p for which we want to say it is still alive or not
• Unreliable failure detector, only hints at the status:
– Suspected: process p is probably crashed
– Unsuspected: process p is probably alive
• Reliable:
– Suspected: process p is definitely crashed
– Unsuspected : process p is probably still alive
5.1.2
Failure detection algorithm
• Most based on active polling: server periodically checks if the process is still alive by sending
a message (or heartbeat signals)
• Heartbeat:
– Every process p sends message to process it wishes to ascertain liveliness
– periodicity T , synchronous system, message delivery time delay δ : 0 <= δ <= ∆
– Server measures interval ∆m between messages received from same process
– o = clock offset between p and q, T1p , T2p send time of consecutive messages, T1q receiving
timestamp (same for 2):
T1q = T1p + δ1 + o
(5.1)
– Assume periodic sending: T2p = T1p + T : ∆m = T + δ2 − δ1 , because of the bound:
∆m <= T + ∆
36
• If q measures interarrival time of exceeding above bound, process p has definitely crashed,
but otherwise we cannot be sure it is alive
• For synchronous process: reliable failure detector
• Asynchronous (upper bound for communication is not available), q can only assume that p is
in trouble if ∆m becomes exceedingly large, as soon as q receives a message from suspected
p, it is assumed unsuspected again and threshold is again used
• Problem is setting ∆, mainly: relation to network delay (e.g. 20%)
5.2
5.2.1
Distributed mutual exclusion
Problem statement
• Critical section and mutual exclusion (=mutex)
• N processes that:
– have no shared variables
– only communicate by message passing
– want to access a common resource in a critical section
– Can communicate with each other
• Failure model assumes:
– reliable channel between communicating processes
– Processes do not crash
– They eventually leave the critical section (well behaving)
• Algorithm:
– Safety requirement: only ONE process is active in critical section
– Liveness requirement: requests to leave and enter the critical section eventually succeed
– Fairness bonus: access to the critical section is granted using the ”happened-before”
ordering
• Primitives: enter() (request to enter, blocks until access is given), leave()
5.2.2
Evaluation metrics
• Bandwidth usage: measured as a number of messages needed to enter/leave the critical
section
• Client delay: how much time does it take for a client to enter/leave the critical section (if
free), expressed in one-way network delay δ
• Synchronization delay: how much time the algorithm wastes between subsequent accesses
(one client leaving and other accessing), expressed as a multiple of δ, = upper bound on
number of clients that can be server per unit of time
37
5.2.3
Centralized approach
• One process takes control
• Messages:
– Request: client to server, to enter
– Grant: server to client, granted
– Leave: client to server
• Maintains:
– A token (var) indicating which process has access, = -1 if not taken by anyone
– Queue Q stores incoming messages
• Client: send request, wait for grant, when ready send leave
• Safety condition met by having the token, liveliness also (was assumed it will leave) and
dequeue oldest message, fairness: physical clocks give advantage to closer processes so vector
clock or logical timestamps allows ordering requests (to happened before relation)
• Server: receive request, check if token is -1 (free), if so send grant, else put the request in
the queue Q, receive leave, dequeue oldest in Q, token becomes the sender of that dequeued
message and a grant is sent to that process
Performance:
• Bandwidth:
– Entering is 2 messages (Grant and Request)
– Leaving is 1 message
• Client delay: Unloaded system, wait for Grant so delay is 2δ, leaving is no delay
• Synchronization delay: Loaded system, when some process left the critical section, leave is
sent (delay: δ), server receives and send Grant to the corresponding process (same delay):
total synchronization delay: 2δ
5.2.4
Ring approach
• Processes are organized in a logical ring, communicate with only one neighbour, so process
pi communicates with pi+1modN
• Token granting access now travels along the ring, and only one message: Token
• Every process has the same behaviour
• Sending Token can only after receiving Token implies safe algorithm
• Liveliness also because the token will travel along the ring
• Fairness not guaranteed, not based on a happened-before ordering of the access requests, if
a process is close to the process having the token it has more chance of getting the critical
section
• Algorithm: If receive and p wants access, execute the logic and if done send token, else send
token to neighbour
38
Performance:
• Bandwidth: consumes bandwidth even in absence of using the critical section
• Client delay: 2 extreme cases
– Best: Token arrives just when the process requests access, no delay
– Worst: Request access but just forwarded the token, so delay is :N δ
– Average: N δ/2
• Synchronization delay: this is subject to the positioning of the leaving and requesting process
– Best case: neighbours, delay δ
– Worst: when the the leaving and requesting process are the same, N δ
– Average: (N + 1)δ/2
5.2.5
Multicast-based algorithm: Ricart-Agrawala algorithm
(figure 5.1 example)
• Multicast to limit number of messages
• Each process p:
– logical clock (fairness)
– maintains a message queue Q to store pending requests
– State variable: Released: doesn’t want access, Wanted: send request, Held: in
critical section
• Messages:
– Request(pi , Ti ) (p process, T logical clock value at p)
– Reply: Can only enter critical section when it has received a confirmation from all
other processes
∗ Case 1: only p is requesting then all other are in Released state and will immediately reply to p, p will collect N-1 Reply messages and enter
∗ Case 2: p and q want to enter, others are in Released, p and q will both receive
N-2 Reply messages, both p and q will look into their respective logical clock, if
their logical clock is lower, they will get the critical section and the other will send
a Reply message. P will have an enqueued request from q and will sent a Reply
after leaving the critical section (page 113 pseudo figure)
p r o c e d u r e INITIALIZE :
s t a t e = Released
p r o c e d u r e ENTER:
s t a t e = Wanted
m u l t i c a s t Request ( p i , T i ) t o a l l ( i n c l u d i n g i t s e l f )
T = T i;
w a i t u n t i l l N−1 Reply r e c e i v e d ;
s t a t e = Held ;
p r o c e d u r e : LEAVE:
s t a t e = Released ;
send Reply t o a l l pending R e q u e s t s i n Q;
p r o c e d u r e : RECEIVEREQUEST( p j , T j ) :
i f s t a t e == Held && ( ( ) ) s t a t e == Wanted )
39
Figure 5.1: Ricart-Agrawala algorithm at work to realize distributed mutual exclusion.
&& (T, p i )<( T j , p j ) ) ) then :
enqueue Request i n Q;
else
send Reply t o p j ;
end i f
• Safety: both active in critical section (suppose), both received N-1 replies, implies both have
granted access to critical section, this would mean they did not have the section or that
if wanted, their logical clock is both higher than the other (which is nonsense) (page 114
logical proof)
• Liveliness: each Request is either replied immediately or stocked in the queue and replied
after leaving
• Due to the happened-before of the logical clock a fairness is guaranteed
Performance:
• Bandwidth: Entering is sending a Request to each (N messages), waiting for Grant (N-1),
so total = 2N-1, with multicast support: N+1, leaving is implicit, no extra messages
• Client Delay: request is immediately responded by a Grant so delay is 2δ (unloaded)
• Synchronization: When leaving it sends a reply, as soon this is received it will go into the
critical section so: δ delay
5.2.6
Multicast-based algorithm: Maekawa voting
(figure 5.2 example)
• Problem with previous algorithm (Ricart-Agrawala) is the need to contact ALL processes
(= scalability issues)
• Basic idea:
– Only small number of processes are responsible for guaranteeing mutual exclusive access
– If p wants access, contact only these, this set is dependent on the processes
40
– Each process has associated voting set, with the intersection between any two subsets
NOT empty (to ensure safety)
– For fairness: voting sets equal√in size and every process is included in the same number
of sets, optimal: K=1* to 2* N and M = K, K = size of voting set, M = amount of
voting sets
• Implementation:
– If N=S 2 , arrange processes in S x S matrix
– If process k is at row i and column j, than voting set for process K is the union of all
processes of row i and column j
– Process itself is in set, any 2 voting sets are not empty , K=2S-1, M=K
– Safety: only vote for one process to enter the section, so extra binary: true or false
– Differences with Ricart-Agrawala algorithm:
∗ Additional state per process
∗ multicast Request messages to voting set only (only expecting K replies to enter)
∗ Explicit notification of processes releasing the critical section needed (to let voting
processes known they can now vote for other)
p r o c e d u r e INITIALIZE :
s t a t e = Released
voted = F a l s e
p r o c e d u r e ENTER:
s t a t e = Wanted
m u l t i c a s t Request ( p i , T i ) t o Voting s e t V i
w a i t u n t i l l K Reply r e c e i v e d ;
s t a t e = Held
p r o c e d u r e : LEAVE:
s t a t e = Released
send R e l e a s e t o v o t i n g s e t V i
p r o c e d u r e : RECEIVEREQUEST( p j , T j ) :
i f s t a t e == Held && voted == True then :
enqueue Request i n Q;
else
send Reply t o p j ;
voted = t r u e
end i f
p r o c e d u r e : RECEIVEREQUEST( p j , T j ) :
i f Q != empty then :
dequeue pending Request m from Q;
send Reply t o s e n d e r (m)
voted = t r u e
else
voted = f a l s e
end i f
Performance
• Bandwidth: same as Ricart-Algrawala with difference, requests are now size K and only K
replies are sent back, so 2K, Leaving message is needed to K are needed here, enter: 2K,
leave: K
41
Figure 5.2: Maekawa voting: example for N=4
• Client delay: Same delay as Ricart-Algrawala, delay 2δ
• Additional Leave message sent and Reply results in 2δ delay
5.2.7
Suzuki-Kasami Algorithm
Completed connected network of processes.
Algorithm overview
• Token passed between processes
• request(i,num) is send to other processes (i=process id, num = sequence number)
• Every process maintains an array: req[0, ..., N − 1] where req[j]=sequence number of latest
request from process j
• Process that holds the token can grant it
• Request can become outdated, update is necessary
• Passed with token:
– Array last[0, ..., N − 1], last[k] = sequence number of process k during last visit
– Queue Q with ids of process with pending requests
Algorithm Details
• When receive request(k,num), calculate req[k] = max(req[k], num)
• If receive token:
– Copies process num to last[i]
– Keeps only processes k with (1 + last[k] = req[k]) in Q
– Visits Critical Section
– If Q non-empty, forward token to first process in Q
– Delete entry of selected process in Q
Performance:
42
• Enter: N-1 requests (all except self)
• Receive: 1 message with token
• N total
5.2.8
Raymond’s Algorithm
token-passing for tree-based networks
Algorithm Overview
• Node that holds the token = root of tree, child nodes maintain pointer to root node
• Variables of node:
– Parent pointer (empty in root)
– Local queue Q to store pending requests
Algorithm Details:
• Request: via parent, each visited node, pointers are followed
• Only first request in Q is forwarded to parent
• Token moves:
– Root changes, swap parent variables between the processes
– Node with token becomes new root
Algorithm Performance:
• 1 taken, safety ok
• Deadlock impossible (only acyclic graphs), i can only wait for j, (if i is lower than j)
• Fairness by queue Q
• Distance between two nodes (average): O(logn) = required messages
5.2.9
Group Mutual Exclusion
Multiple processes want to join the critical section simultaneously, these processes are called:
forum
• M fora, (M<N)
• Centralized Approach:
– Every forum elects a leader
– Mutex algorithm between leaders
– When leader enters notifies the forum, if leaves, the whole forum leaves
• Distributed Approach:
– Shared memory algorithm
– Every forum elects a leader
– Leader guide processes in the critical section
– Once leaves, the others are also denied access
43
5.3
5.3.1
Election
Problem statement
• N processes with no shared vars and message passing communication
• ONE to play a special role (coordinator), every process should have the same coordinator
and if the election fails a new election round if started
• Each process has an unique ID and problem is to select the process with the highest ID
• Two state variables:
– Elected : refers to the elected process (or unknown)
– Participant: indicates whether a process is currently participating in an election
• A process detecting a failing coordinator can start the election procedure (so several instances
of this election can be running)
5.3.2
Evaluation metrics
• Safety: (required) at any time for each participant process elected == unknown OR elected
== P, this ensures that when the election process completes, all live processes have elected
the same process, when finished P should be the same in each process
• Liveness: (required) each process should participate and should set its elected variable to
either a value different from ? or should crash
• Metrics:
– Bandwidth
– Turnaround time: time needed for the election (measured from the moment the election
process is started until all processes have agreed on a common coordinator)
5.3.3
Ring algorithm: Chang-Roberts
• Assumes the logical ring structure
• Largest ID become coordinator and reliable communication channels between live processes
(no assumptions on channel or processing delays so it supports asynchronous systems)
• Messages:
– Election(i,ID) with i the rank of the process initiator of the specific instance of the
election algorithm, ID current max ID
– Elected(i), i the rank of the process that has been elected
• Algorithm is on page 123, situations in Election:
– ID > IDj : there pi is better candidate to elect than pj , thus forward the message, to
indicate it has done its job, set participant to true
– ID <= IDj and i 6= j: so ID < IDj , pj is a better candidate, but if already sent no
need to send again (and no need to risk non-terminating situation)
– i == j: here the election has travelled through the full ring, without being modified,
we can conclude that IDj is max ID and announces the elected
44
Performance
• Safety: assume pi and pj were elected, this would mean that both Elected(pi ) and Elected(pj )
have been sent, this means that Election(i,pi ) and Election(j,pj ) were received by pi and pj
and have travelled through the full ring which is not possible because IDs are unique so one
MUST be larger than the other
• Liveness: assume no failures, so messages are allowed to circulate, the circulation will stop
and participant variable will guarantee no message can travel more than one round
• Bandwidth: 3 phases:
– Election-message phase, where ID grows, worst: N-1, best 0 (when calling process is
becoming the coordinator)
– Complete constant round, N messages
– Elected round, N messages
– Average: (5N-1)/2 messages
• Turnaround time: analogous to bandwidth usage : (5N − 1)δ/2
5.3.4
Multicast algorithm: Garcia-Molina (or the bully algorithm)
• Allows for process crashes during execution
• Detected by time outs (assumption of synchronous system model)
• Assumes every process knows every other process, has set L(arger) with IDs exceeding p,
set S(maller) with IDs smaller than p
• Three messages:
– Election(i): with i the rank of the initiating process, used to announce an election
round
– Answer: used to reply to an election message
– Coordinator(i): i the rank of the elected coordinator process, is used to announce the
coordinator process to all processes
• Algorithm on page 127
– Calling election checks first if it has the highest ID by checking if L is empty, if so
communicate
– Else check whether none have crashed in L by sending a Election message, if no answer
assume itself as coordinator, if answer, it can be sure one is alive with higher ID
– Then wait for coordinator message to arrive, if no message, restart the procedure, when
receives, it answers (prevent time-outs)
Performance
• Safety: will be able to locate highest ID if no new processes arrive while electing, if so
two coordinators may be communicated, to avoid this a new appearing process will call the
election (as it has no coordinator) and will find out it has the highest ID and will act as a
boss (hence the Bully mechanism)
• Liveness: reliable message delivery, so suitable coordinator is found
• Bandwidth and Turnaround time (same) Depend on number of crashes and size of L
45
Figure 5.3: Garcia-Molina algorithm at work. The coordinating process p4 crashes. This event is first
detected by p1 , calling the election. Eventually p3 takes the role of coordinator (process IDs
are here identical to the process rank, i.e. for process pi , we have IDi = i).
5.3.5
Franklin’s Algorithm
Variant of Chang Roberts, based on bidirectional communication in a ring, nodes either passive/active
Algorithm outline:
• Multiple rounds
• Actions:
– send token with process ID to both neighbours
– Examines the tokens from own neighbours, if one ID > process ID, become passive
(only forwards)
– When only one active (receive own token), become leader
Algorithm Performance:
• N active processes, at least N/2 turn passive after one round
• Most log2 N rounds, only one left, one additional for leader
• So: 1 + log2 N rounds needed
5.3.6
Peterson’s Algorithm
Variant of Chang Roberts, unidirectional communication in a ring
• Difference Franklin: processes communicate with alias (which can change from round to
round), unique leader with largest alias ID
• Nodes: active, passive (=forward)
• Each node has variable alias(i) to keep track of its alias
Algorithm Outline
• Sends alias to successor
• Receives alias from predecessor in ring: alias (PD)
• If alias = alias PD: node i is leader
46
• Else:
– Send alias PD to successor
– Receives alias from predecessor of its predecessor: alias (PD2)
– If alias (PD) > max (alias, alias(PD2)): alias = alias (PD), node i remains active and
moves to next round
– else: node i turns passive
Performance:
• Same reasoning as Franklin and same outcome
5.3.7
Dolev-Klawe-Rodeh Algorithm
• Same as Peterson
• Use of aliases
• Special leader message can be sent by the last active node containing its alias (largest ID)
and to inform the node with this ID is the lover (=hand-over)
5.3.8
Tree Election Algorithm
Tree topologies or acyclic graph
• List with links to child nodes, link to parent
• Both can be empty
• Algorithm (each node):
– Collects IDs from children
– Computes max(IDs of children, own ID)
– Passes max value to parent node
– Receives leader ID back from parent
– Passes leader ID to its children
• Actions started from leaf nodes, to wake up: flooding algorithm (broadcasts info to neighbours, and so on)
• Local algorithm at process p:
– Wait until messages are received from all neighbours except one, which becomes parent
– Computes max (received IDs, own ID)
– Passes max to parent node
– If p receives a message back from its parent with maxparent calculates max0p = max(maxp , maxparent )
– Pass max0p to neighbours except parents
47
5.3.9
Echo Algorithm with Extinction
• Any topology, wave algorithm, nodes initiate wave and pass the wave messages to their
neighbours
• Algorithm outline:
– Each initiator starts a run of echo
– Wave started by initiator is tagged with its ID
– Only wave with largest ID completes, the initiator of that wave becomes leader
– Waves with lower ID are always dropped
– Non-initiators join the first wave that passes by
• Local algorithm:
– Nodes participates in waveq , receives message from waver
– q < r: sender becomes parent of p, changes to wave tagged with r (abandon q)
– q > r: node continues with wave q
– q = r: node sends a message to its parent once messages from all neighbours have
arrived
• Performance: at most N waves, each wave sends 2E messages, N = nodes, E = edges, at
most O(N E) messages
5.3.10
Minimum Spanning Trees
Definition: an MST (minimum spanning tree) = tree spanning all nodes with minimal sum of
edge weights, not necessarily unique
Kruskal’s Algorithm:
• Greedy algorithm for determining MST
• Forest T (set of trees) is created: each node is a separate tree, T consists of N different trees
• set E containing all the edges in the graph is created
• While E nonempty, T not yet spanning tree
– Remove edge with min. weight from S
– If connects two different trees: add to forest, combining two trees into a single one
– Else discard edge
Distributed Kruskal’s Algorithm:
• Challenge: nodes in the same tree, whether an edge is an outgoing edge of the tree
• Nodes need to work together to find least weight outgoing edge
• Once found: join request to two involved trees, joined dynamically
Gallager-Humblet-Spira algorithm
• = distributed Kruskal’s Algorithm
• Each tree: name (real value), a level (integer value, = max number of joins by any node in
the tree)
• Init both = 0
48
• Edges status: basic (undecided), branch (confirmed part of MST), rejected (confirmed NOT
part of MST)
• Messages exchanged to:
– determine core edges interconnecting two trees
– join the trees
– update status of edges
• Core node = end points of the core edges with largest ID becomes leader
MST-based election algorithm
• Node can send election message over MST
• Gallager-Humblet-Spira message amount: O(E + N logN ), O(N ) extra messages for election
are needed
5.3.11
Election in Arbitrary Networks
• Any topology, not known in advance
• Effective:
– Echo algorithm with extinction
– MST based algorithms
– Flooding algorithms: multiple rounds, send info to all neighbours, size of network needs
to be known
•
5.4
Ordered multicast
NO NEED TO STUDY THIS FOR THE EXAM
5.4.1
Terminology and operations
• group g
• Message m
• sender(m)
• group(m): identifying the group of processes the message should be delivered to
• payload(m): actual message
• multicast(g,m): actual multicast operation which sends the message m to every process p
in g
• deliver(m): deliver a message m to a process
• Group can be closed (only members can multicast) or open (everyone can)
49
5.4.2
Basic multicast
• Most reliable use unicast primitives
• send(p,m): reliable unicast sending of message m to process p
• receive(m):put message m in queue of process
p r o c e d u r e MULTICAST( g ,m)
f o r a l l p i n g do
SEND( p ,m)
end f o r
p r o c e d u r e RECEIVE(m)
DELIVER(m)
• Typical problems:
– Ack-implosion: with reliable unicast, ack’s are typically used. As g grows, source has
to cope with increasing amount of ack’s and can get overloaded, solution is distribute
the multicast over multiple nodes (improves scalability)
– Implementing multicast over unicast will send same information multiple times to node,
use IP-multicast is available
5.4.3
Message order
• For the receive buffer
• FIFO: ordering at receive side is the same as the order the messages were sent by the same
source
• Causal ordering: happened before relation is used to order messges issued by possibly different sources, guarantees that different multicast messages are sent that EACH process that
delivered the latter, will have delivered the first first
• Total ordering: Order of message delivery is the same in each receiving process, if correct
process delivers m before m’, every correct process will deliver m before m’. Total doesn’t
imply FIFO or causal, only causal implies FIFO
NO NEED TO STUDY THIS FOR THE EXAM
50
Chapter 6
Distributed Consensus with
Failures
Introduction
• Processes may crash
• Rest of system need to cope with the crash
• Severe type of failures = Byzantine failures (where processes show behaviour that is not in
line with the specification of the distributed algorithm)
• Alleviated by:
– Including redundancy and replication
– Letting processes negotiate before an action is taken
6.1
6.1.1
Consensus problem
Distributed consensus problem
• Terminology:
– Consensus problem = involved processes need to agree on a single value, even if some
have crashed
– Binary consensus = uniform decision for 0 or 1 (by processes that are not byzantine or
crashed)
– Crash consensus algorithm: processes should randomly choose once between 0-1
• Important properties:
– Termination: every correct process decides eventually for 0 or 1
– Agreement: All correct processes need to decide for the same value
– Validity: If all processes decide for the same initial value, then result is that value
• K-crash consensus algorithm
– k>0
– Can cope with up to k crashing processes
– Assumed: network topology is complete, even with fails
51
6.1.2
1-crash consensus problem
• No processes can observe crashes, no termination for 1-crash
• Probabilistic approach: N = number of processes, if k >= N/2, no probabilistic algorithm,
network is divided in two disjoint parts
• Works only if k < N/2
6.2
6.2.1
Algorithms
Bracha-Toueg crash consensus algorithm
• Outline:
– assumes k < N/2
– Each process has 2 variables: b = decision value of the process (0-1), w = weight of the
process, approximates the number of processes that voted b in previous round
– Round n=0, each process randomly chooses a value 0 or 1 with weight = 1, message to
each process
– Other rounds: each correct and undecided process p sends a message (n,bp , wp ) to all
processes, based on first N-k messages received at p:
– Messages from earlier or future rounds are dropped
– If w > N/2 for incoming messages: bp = b
– Otherwise: if most messages contain b=1:bp = 1, else bp = 0
– If w > N/2 for more than k messages: p decides for b, p sends messages (n+1,b,N-k)
and (n+2,b,N-k)
• Algorithm evaluation:
– When process decides, other correct processes are guaranteed to decide within two
rounds
– If waiting for more than N-k messages: deadlock
– Since p waits for N-k messages only decide if: more than k processes have a weight
> N/2 or N − k > k
– Las Vegas example
6.2.2
Consensus with failure detection
• k-crash consensus with k< N
• Outline:
–
–
–
–
–
Each process randomly chooses a value 0 or 1
max N rounds
in round n, process pn acts as coordinator
If not crashed, it broadcasts its value
Each process waits for incoming message from pn , if arrives: process adopts the value
of pn ELSE after time-out T: suspects pn has crashed
• Evaluation
– If after round i, all correct processes have value b, then all processes keep this value in
the next rounds
– After round N all correct processes decide for b
– Is a rotating coordination algorithm
52
6.2.3
Chandra-Toueg consensus algorithm
• Assumption: failure detector less accurate than previous subsection
• Updated version of previous algorithm: works with acknowledgement messages
• Last-update variable per process is the round of the last update
• Correctly for k< N/2, can be proven even always correct
6.2.4
Byzantine process
• Is a process that may start to show strange behaviour
• Assumption: processes are either correct or Byzantine from the start, network topology is
complete at all times
• Byzantine consensus algorithm: algorithm for which processes that are not Byzantine agree
on a value 0 or 1
• k-Byzantine consensus algorithm can cope with up to k Byzantine processes
6.2.5
Chandra-Toueg Byzantine consensus algorithm
• k-Byzantine consensus algorithm with k < N/3
• Only collect messages from N-k processes, amongst these N-k, k can be Byzantine so: N-2k
votes from correct processes should be higher than k votes from Byzantine processes
• Similar to above algorithm, verification phase of votes
• Assumed that votes cannot be trusted and are echoed, only accepted if more than (N+k)/2
processes confirm
6.2.6
Clock synchronization with Byzantine processes
• Byzantine processes can report wrong clock values
• Makeney-Schneider algorithm:
– Synchronization rounds
– Tasks:
∗
∗
∗
∗
Collect clock values from all processes
Discards values τ for which fewer than N-k processes report a value in [τ − δ, τ + δ]
Replaces all discarded and non-received values by acceptable value
Takes average of these N n values as new Clock
– Works if k < N/3 as well
6.2.7
More algorithms
• General-lieutenants problem: Assumes that the lieutenants know who is the general and
some of the lieutenants can be byzantine, Byzantine broadcast algorithms are used, wellknown: Lamport-Shostak-Pease broadcast algorithm
• Algorithms with authentication: a public and private key per process, together with the
assumption that Byzantine process can not lie about their received value
• K-agreement problem: Instead of binary, possible values 0..k-1
53
• Approximate agreement algorithms: agreement within an tolerance value (example: agreement for clock synchronization)
• Commit algorithms: Valuable in distributed databases, where processes need to agree on
commit, rollback or abort examples: two-phase commit, three-phase commit
54
Chapter 7
Anonymous Networks
7.1
Definition
• Assumption that processes have unique IDs is not always valid
• In heterogeneous settings with multiple devices, no unique ID can be guaranteed
• Processes do not reveal their ID for security reasons
• To keep footprint low, the decision is sometimes made to not transmit nor store the IDs
• Even in this case we still need election and coordination election
7.2
Probabilistic algorithms
• Election algorithms from chapter 5 are impossible, so 2 types:
• Las Vegas algorithms: these may not terminate, but outcome is always correct, they start
with random values and try to reach consensus by exchanging messages
• Monte Carlo algorithms: these always terminate, but can be incorrect, they start with
estimates, which are updated in multiple rounds
7.3
7.3.1
Itai-Rodeh election algorithm
Algorithm outline
• Adapted version of Chang Roberts election algorithm, is a Las Vegas algorithm for ring
topologies
• Random ID generating by each node, passed around the ring, if already ONE largest ID,
this becomes the leader, else next round with only the largest IDs
• keeps track of round numbers and hop count of messages, because messages from earlier
rounds are ignored, hop count to check if passed all nodes
• Node p, random id IDp from 1..N then message (n,i,h,b) to next node (election round, ID
of source, hop count, boolean(is true if duplicate ID is found with ID equal to itself))
• Options:
– n > n or n = n and i > IDp : message from future round with higher ID, p becomes
passive and sends (n,i,h+1,b)
55
– n < n or n = n and i < IDp : message from earlier round or current with smaller ID,
message is dropped
– n = n and i = IDp and h < N : p is not the source (because h < N ) sends
(n,IDp ,h+1,true)
– n = n and i = IDp and h = N and b=true: p receives own message back, there is
another process with same ID because b=true, next round is started
– n = n and i = IDp and h = N and b = f alse: p becomes the leader
7.3.2
Algorithm evaluation
• Size of network N needs to be known, O(NlogN) messages are exchanged
• Always correct and is indeed a Las Vegas, proven to terminate with probability one
7.4
7.4.1
Echo algorithm with extinction
Algorithm outline
• Rounds
• All initiators active at the start, non-initiators are passive
• Every round every process randomly selects an ID
• Algorithm is started and round numbers for detecting messages of earlier rounds
• Process becomes passive when it receives a message with either a higher round number or
same round number but with a higher ID
• Number of nodes needs to be known thus each node reports its size of its subgraph in a wave
message to its parent, initiator checks if the number of nodes is indeed correct, if not next
round
7.4.2
Algorithm details
• At start of round, select random ID, start a wave, tagged with round number n, ID of process
p, and size of subtree s
• initially n=1, s=0 (size of subtree), this means it is not intended for the parent, s different
from 0 report the size of the subtree to the parent
• Process p waits for wave messages tagged with round number n and ID of process j, options:
– n > n or n = n and IDj > IDp : p marks the sender as parent, and changes to n and
wave j
– n < n or n = n and IDj < IDp : the message is dropped
– n = n and IDj = IDp : p sends a message to the parent once messages from all
neighbours arrives
• When a wave completes, the initiator checks the reported network size, if correct, this
initiator becomes the leader otherwise it starts the next round
• Las Vegas algorithm, starts with random value and tries to converge with a common elected
leader
56
7.5
7.5.1
Itai-Rodeh ring size algorithm
Algorithm outline
• Computation of size of an anonymous ring is impossible, so no Las Vegas algorithm for
computing the ring size exists
• Itai-Rodeh = Monte Carlo algorithm for computing the size of an anonymous ring, probability of incorrect outcome can be arbitrarily close to zero
• Each process maintains estimate estp of ring size, initially 2
• If process finds that the ring estimate is too conservative, next round
• Each round, randomly select ID from 1..R, send message(estp ,IDp ,h) (h=hop count)
7.5.2
Algorithm details
• Process p waits for message, options:
• est < estp : message is dropped
• est > estp : p increases estp , two options:
– h < est: p sends (est,ID,h+1) and estp = est
– h = est: estp = est + 1
• est = estp two options:
– h < est: p sends (est,ID,h+1)
– h = est: two options:
∗ ID 6= IDp : estp = est + 1
∗ ID = IDp : message dropped
• At termination: estp <= N for all processes, estimate is only increased when a process is
certain that the current estimate is too conservative
• This can be incorrect with too many random IDs being the same
57
Chapter 8
Peer-to-Peer systems
8.1
8.1.1
Introduction
Why P2P-systems?
• Client-server architectures are not as scalable, scaling up the server leads to higher operational costs and more investments (not linear)
• P2P: using edge resources (CPU, storage,...) to build a useful and compelling application,
each user of the system donates a part of the resources he owns to the system
• Complex
• Price:
– Nodes are under control of end users (and can be switched off without prior notice =
churn), this can also be large
– Network interconnecting edge devices is owned by a thrid party, and can be slow and
unreliable
– No central infrastructures, so distributed algorithms are needed for all management
functions (more complex, can fail)
• Properties: decentralized control, some form of self-organization (for failing), identical responsibilities and free-riding users
8.1.2
P2P generations
• Content sharing origin
• First: locate proper location, than download from that location, poor scalability and single
point of failure
• Second: Fully decentralized and overlay network for connectivity
• Variants:
– pure P2P, all nodes are identical
– hybrid systems, where a number of privileged nodes implement more system logic than
others, to cope with scalability issues of pure P2P, search commands are flooded and
increases traffic, hybrid a supernode maintains a central index of content of all nodes
• Third: basic building blocks are implemented as middleware services, include:
58
Figure 8.1: Overlay and IP-networks are shown as different planes. Routing tables are maintained at
the IP and the overlay level. All P2P-nodes are edge nodes in the IP-network (and run an
overlay networking stack in addition to an IP-stack
– Data placement
– Data lookup
– Automatic replication/caching
– Authentication/security
• Page 152, table 8.1 comparisons
8.2
Overlays
Ideally: each node should know other, but this is not scalable
Only limited to a few other nodes and to reach all resources, requests that cannot be handled by
this subset are forwarded to neighbours of neighbours
• Routing done on application level, network = overlay network, logical routing but done over
IP
• Routing protocols in overlay network are not optimal for routing in IP-layer (and vice versa)
because they behave differently
• Overlay paths always take more hops than IP paths (page 154 table 8.2 comparison)
• Overlay header = ID of source and destination overlay nodes
• Each time an IP-packet arrives in an overlay node this node will either:
– handle (if node was destination)
– forward (to next using own routing table)
• In content based, each item gets a Global Unique ID (GUID used for routing), usually a
hash on the content or the metadata itself
• Overlays use hash values to route requests = Distributed Hash Tables (DHTs)
59
8.3
8.3.1
Distributed Hash Tables
Introduction
Responsibilities and API
• Main responsibilities:
– managing objects: add/remove operations
– managing nodes: add/remove overlay nodes
– find an retrieve digital content
• Key concept: content items AND overlay nodes (hashing the overlay node’s public key or
IP address) are identified using a GUID
• Each node is responsible for a set of GUIDs (objects identified by certain GUIDs can be
replicated at multiple nodes)
• basic DHT API: like normal API, capable of storing < key, value > pairs, DHT looks after
proper locations and replication
– publish(GUID)
– remove(GUID)
– value=get(GUID)
• Distributed Object Location and Routing API (DOLR), more powerful, application itself is
in control where each data item is stored (responsible for retrieving data, without being able
to select optimal location)
– publish(GUID)
– unpublish(GUID)
– sendToObj(message,GUID,[#]): message sending to object with GUID, can be a
get-message, optional argument is sending it to only 1 or a number of replicas
Bootstrapping
• = Finding an active P2P node to request joining the network
• Solutions:
– Static: Pre-configuring addresses of stable nodes in the client code, easy but requires
that these stable nodes are always on (and fixed IP address)
– Dynamic: active nodes through DNS-service, symbolic domain name to IP of an
currently active node. DNS requires view on currently active nodes and select best
node (round-robin, ...)
• Bootstrapping server can provide:
– Initial routing table
– GUID space the node is responsible for
– Protocol specific information
60
8.3.2
Circular Routing (1D)
• N bits, so 2N objects, to code GUIDs
• draw on a ring, neighbouring objects have adjacent GUIDs
• Some are overlay nodes (or called active nodes), each responsible for a segment of the ring
• Circular routing:
– each active node: list of IP addresses to closest 2l active neighbours = leafset
– Mechanism:
∗ node Source wants to locate an item at node Destination (k)
∗ GUID exceeds max of reachable, sent to furthest accessible node
∗ repeats until a node detects that destination node comes within reach of direct
communication
∗ direct to that node, terminates the forwarding process
– Worst case: destination on opposite side of the ring, N nodes, N/2 distance of nodes,
jumps of size l, at most N/2l hops
8.3.3
Prefix Routing: Pastry
Routing
• Optimizing in circular is simple = increase l (= more memory space)
• Besides leaf set, each node tracks number of nodes further away (increasing jump size), uses
longest prefix matches to compute optimal next destination
• Routing table: log2b N = n/b rows (b configurable), having 2b − 1 entries, each group of b
bits as a base 2b digit
• Row i of routing table for node X, nodes having a prefix of size i in common with X are
references
• Entry: [GUID,IP] (page 157)
• Algorithm (R=routing matrix, L=leafset, message m, node C, source S, destination D), if
D sufficiently close to Cs GUID, D is in leaf set else longest prefix match p, if not in R find
node that is closest to D in routing table:
p r o c e d u r e RECEIVE(m, S ,D)
i f L −1 <= D <= L1 then
f o r w a r d m t o GUID i found i n l e a f s e t o r t o c u r r e n t node C
else
f i n d l o n g e s t p r e f i x match p o f D and C
c = symbol ( p+1) o f D
i f R pc != n u l l then
f o r w a r d m t o R pc
else
f i n d GUID i i n L o r R with | GUID i−D| <| GUID i−C|
f o r w a r d m t o GUID i
end i f
end i f
end p r o c e d u r e
• If R is empty, pasty becomes circular routing
61
Node dynamics: joining and leaving
• If node wants to join DHT, construct own leaf set and routing table, other nodes need to
update theirs
• Discovery algorithm for finding physically close node B, sends join to this B with own GUID,
routed (via pastry) to active node which is responsible for that nodes GUID
• All nodes send info to new node with (parts of) routing table and (parts of) leaf set, own
set:
– X sends join(X, GU IDX ) to B
– DHT routes join the usual way to node F, minimizing |GU IDF − GU IDX |
– All nodes on the routing path send info to X to construct RX and LX
• Initializing LX :
– F is very close to new node in GUID space
– leaf set of F is good first choice: LX = LF
• Initializing RX :
– bootstrap node B is very close (physically)
– (if no additional info) no common prefix, routing is done as by node B
– New node copies first row of B as its own
– Node i (starts at 0, numbered as the nodes to reach F) has i common symbols with new
node’s ID: copy row i in new routing table
– B0 = B, BN −1 = F : RX [i] = RBi [i]
• Both optimized on feedback of routing itself
• Leaving: only leaf sets are repaired, system will discover faulty routing table entries dynamically and introduce new entries automatically
• If node X fails: detected by neighbours (missing heartbeat), such neighbour will locate a
close node Y to X from own leaf set, will request leaf set of Y and update own accordingly,
and sent to neighbours, changes accordingly
8.3.4
Skiplist Routing: Chord
Introduction
• only one method in Chord-API: lookup(key), maps a key to an IP
• Application get notified when a change to the key set has happened so they can react
• Samples: Cooperative mirroring (multiple servers cooperate to store content), time shared
storage (ensuring data availability), distributed indexes, large-scale combinatorial search
(code breaking)
• SHA-1 hash function, mapped to a circle
• Important: the successor of a key = the node with the smallest GUID not smaller than the
GUID associated with the key = node that will take care of the key: minx|nx >= key
• If X leaves, keys from X will be handed over to successor(X), when Y joins keys from
successor(Y) will be reassigned to Y
62
Figure 8.2: Chord ring with 8 active nodes, storing 5 keys, each key is stored by its successor, with
joining and leaving
Routing
• called skiplist routing
• Routing table = nodes with non-equidistant interleaving, called finger table
• N overlay nodes in ring, each node tracks log2 N other nodes on a distance N/2, N/4... in
finger table with the GUID ID and IP
• node at row i is at a distance 2i : f inger[i] = successor(n + 2i−1 )(i >= i)
• Greedy routing (longest jump possible) without exceeding destination GUID
• Algorithm: ((a, b] circle segment clockwise from a excluding a including b )
p r o c e d u r e FINDSUCCESOR( n , i d ) #c u r r e n t node i s n
i f i d i n ( n , n . s u c c e s s o r ] then
return n . successor
else
n’=CLOSESTPRECEEDINGNODE( n , i d )
r e t u r n FINDSUCCESSOR( n ’ , i d )
end i f
end p r o c e d u r e
p r o c e d u r e CLOSESTPRECEEDINGNODE( n , i d )
f o r i=m downto 1 do #m i s s i z e o f f i n g e r t a b l e
i f f i n g e r [ i ] i n ( n , i d ] then # f i n d i n g c l o s e s t p r e c e d i n g node
return finger [ i ]
# of id in f i n g e r table
end i f
end f o r
return n
end p r o c e d u r e
63
Node dynamics
State variables of each node
• Finger table
• successor
• predecessor
To cope with node dynamics, first concern is to update the successor variable as fast as possible
Joining
• New Chord-ring is trivial, one node is contained and successor is itself
• for a node n to join, suffices to set the successor of n correctly, n should be able to find the
successor of the GUID associated with n
• Algorithm:
p r o c e d u r e CREATE( n )
n . predecessor = null
n. successor = n
end p r o c e d u r e
p r o c e d u r e JOIN ( n ’ , n )
n ’ . predecessor = null
n ’ . s u c c e s s o r = FINDSUCCESSOR( n , n ’ )
end p r o c e d u r e
• These only update successor of each node, setting predecessors and finger tables is done by
stabilization algorithm
Stabilizing
• Runs periodically, responsible for making the ring state consistent, fixing finger table, predecessors
• Algorithms:
p r o c e d u r e STABILIZE ( n )
x = (n . successor ) . predecessor
i f x i n ( n , n . s u c c e s s o r ] then
n. successor = x
end i f
NOTIFY( n . s u c c e s s o r , n )
end p r o c e d u r e
p r o c e d u r e NOTIFY( n ’ , n )
i f ( ( n ’ . p r e d e c e s s s o r ==( n u l l ) ) o r ( n i n ( n ’ . p r e d e c e s s o r , n ’ [ ) ) then
n ’ . predecessor = n
end i f
end p r o c e d u r e
• Stabilize updates the successor of n and the predecessor of successor(n), first queries current
successor for predecessor, if x is in between, x is a better choice
• If changed, notify successor to update predecessor
• Notify(n’,n) node n thinks it is the predecessor n’ and requests to update this in n’
64
Figure 8.3: node dynamics example
• Also the finger tables are kept up to date, done by invoking findSuccessor() method on
the node with the desired minimum distance, each node will also periodically check its
predecessor, if failed, set to null
Failures
• Suppose request lookup(y) is launched at node X, and all finger table entries with GU ID < y
have failed
• Cope: each node keeps track of r successors in the ring, in case the finger table is invalid, a
node from this list is chosen (with GUID closer to the destinations GUID)
• Then takes r successive node failures before the ring gets corrupt (defaults to circular routing
if finger table fails)
• Algorithm:
p r o c e d u r e FIXFINGERS( n ) #n i s c u r r e n t node
next++
i f next > m then
next = 1
end i f
f i n g e r [ next ]=FINDSUCCESOR( n , n+2ˆ{ next −1})
end p r o c e d u r e
p r o c e d u r e CHECKPREDECESSOR( n )
i f n . p r e d e c e s s o r f a i l e d then
n . predecessor = null
end i f
end p r o c e d u r e
• To combine finger table mechanism with successor list, routing function has to be modified,
closestPrecedingNode() method checks successor list
• If a node n has been identified as fail, this will be removed form the successor list and finger
table, findSuccessor uses a time-out to cope with failing nodes
65
p r o c e d u r e FINDSUCCESOR( n , id−
i f i d i n ( n , s u c c e s s o r ] then
return successor
else
try
n’=CLOSESTPRECEEDINGNODE( n , i d )
r e t u r n FINDSUCCESSOR( n ’ , i d )
c a t c h ( TimoutException )
i n v a l i d a t e n ’ from f i n g e r and s u c c e s s o r t a b l e
r e t u r n FINDSUCCSSOR( n , i d )
end i f
end p r o c e d u r e
p r o c e d u r e CLOSESTPRECEEDINGNODE( n , i d )
f o r i=m downto 1 do
i f f i n g e r [ i ] i n ( n , i d ) then
return finger [ i ]
end i f
end f o r
f o r i = r downto 1 do #r = s i z e s u c c e s s o r t a b l e
i f n . s u c c e s s o r [ i ] i n ( n , i d ] then
return n . successor [ i ]
end i f
end f o r
return n
end p r o c e d u r e
8.3.5
Multi-dimensional Routing: CAN (Content Addressable Network)
Introduction
• In multidimensional routing GUIDs are put in a d-dimensional Cartesian space with GUIDs
as points
• mapping: P=hash(K), pair is mapped to Cartesian coordinate P
• To retrieve the key K, has: P = hash(K), if P is owned by requesting node or neighbours,
location where the key is stored is returned immediately, else fowarded to a node closer than
requester.
• Routing = self-organizing: each node learns and stores a set of neighbours (= nodes sharing
a hyperplane of dimension (d-1))
√
• Equally partitioned zones , each node has 2d neighbours, average path length: d d N /4
Routing
• As in 1D, finding the neighbour of the current node with a GUID closer to the destination.
• Each node knows its neighbours and coordinate zones they are responsible fore
• page 169 routing example figure
Node dynamics
• Joining
– Steps:
66
–
–
–
–
–
1. Find an active node in the CAN: bootstrapping
2. Find a zone to care for
3. Publish new node to neighbours
(a) Updating the neighbour sets: note that: new neighbour set of X is in the union
of old neighbour of N and N, and (switch outer N and X for vice versa)
(b) Node N: Periodic refreshes containing ID(N)+zone(N) for each neighbour x of
N: ID(x)+zone(x)
each CAN has a DNS name for bootstrapping, each node has list of probably active
nodes
Reply to join by random selecting a node from its active node list
new node X selects random point P, then sends join(P), normal routed, arrive at N
responsible for P, will split zone in 2 and reply to X with the zone X should care of
together with key-value pairs
X an N update their neighbour set, distribute info, together with zone they care for
Each neighbour will be able to update its information, joining is a local operation, only
direct neighbours are immediately affected
• Leaving
– Friendly exit: node will issue notification before leaving, where it informs its neighbours
with currently the smallest zone responsibility
– Neighbours takes over and tries to join the two zones into one, if not possible, responsible
for two (extra process for defragmenting the GUID space)
– Unexpected departure: absence of periodic updates, immediate take-over algorithm
(multiple nodes can start this simultaneously, so must be accounted for):
p r o c e d u r e FAIL (N,X)
t i m e o u t = C x z o n e S i z e (N)
STARTTIME( t i m e o u t )
f o r a l l y i n n e i g h b o u r (X) do
TAKEOVER( y , N, z o n e S i z e (N) )
end f o r
update zone (N) i n f o
end p r o c e d u r e
p r o c e d u r e TAKEOVER(Y, N, s i z e )
i f z o n e S i z e (Y) <= z o n e S i z e (N) then
c a n c e l t i m e r @Y
update n e i g h b o u r i n f o
else
f o r a l l y i n n e i g h b o u r (X) do
TAKEOVER( y , Y, z o n e S i z e (Y) )
end f o r
end i f
end p r o c e d u r e
– If node detects more than half of its neighbours fail, perform a neighbour discovery first
(expanding ring search)
Advanced mechanisms
• Increase d: reduces hop count, as well as latency, state maintenance increases
• Use multiple realities: separate instances of CAN are running, in each reality, a node is
responsible for a different zone, request: check EACH reality for best neighbour, improved
data availability, robustness, reduced latency
67
• RTT based routing metrics: distances could be weighted by actual delays, low latency
paths will be favoured
• Overloading coordinate zones: multiple nodes, responsible for same zone (called peers),
requires synchronization but enhances robustness
• Multiple hash functions: k different hash function map same value to k nodes, queries
sent to k nodes (improved latency, improved data availability), larger node state, increased
traffic (to reduce start with query to closest node)
• Organize coordinate space based on physical network layout
• Uniform coordinate partitioning: prior to splitting zone, check whether neighbour could
be split (has larger zone), better load balancing and more uniform zone sizes
• Caching and replication to manage hot spots: cache recently accessed data in node,
check cache before forwarding, replicate frequently accessed data in neighbouring nodes
68
Chapter 9
Cloud Computing
9.1
Definition
• Cloud provider provides on-demand computing resources to an application provider, which
can be used to create scalable web applications for the end user
• utility computing=offering computing resources on demand
• Definition (accoring to National Institute of Standards and Technology (NIST)) = Cloud
computing is a model for enabling ubiquitous, convenient, on-demand network access to a
shared pool of configurable computing resources that can be rapidly provisioned and released
with minimal management effort or service provider interaction
9.1.1
Characteristics
• On-demand self-service: can unilaterally provision computing capabilities (server time,
network storage) as needed automatically without requiring human interaction with each
service’s provider
• Broad network access: Capabilities are available over the network and accessed through
standard mechanisms that promote use by heterogeneous thin or thick client platforms
• Resource pooling: The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically
assigned and reassigned according to consumer demand (multi-tenant applications: multiple
end users can make use of the same application instances thus increasing scalability and
lowering cost per user)
• Rapid elasticity: Capabilities can be rapidly and elastically provisioned, in some cases
automatically, to quickly scale out and rapidly released to quickly scale in. (for consumer
unlimited, any quantity can be purchased)
• Measured Service: Resource usage can be monitored, controlled and reported providing
transparency for both the provider and consumer of the utilized service
9.1.2
Service models
• Cloud can offer different services (software, middleware, infrastructure...)
• Cloud Software as a Service (SaaS):
– Use provider’s applications running on a cloud infrastructure accessed via web browser
69
– Consumer does not manage underlying infrastructure
– e.g. Google Docs, Gmail.... (Collaboration, Business Processes, CRM/ERP/HR, Industry Applications...)
• Cloud Platform as a Service (PaaS):
– Deploy onto the could infrastructure consumer-created or acquired applications developed using programming languages and tools supported by the provider
– Consumer does not manage underlying infrastructure
– Does control deployed applications and hosting
– Examples: Windows Azure, Google App Engine.. (is a Web 2.0 application runtime,
middleware, database, Java Runtime...)
• Cloud Infrastructure as a Service (IaaS):
– Provision processing, storage, networks and other fundamental computing resources
where the consumer is able to deploy and run arbitrary software (as OS, applications...)
– Does not manage cloud infrastructure BUT control over operating systems, storage,
deployed applications and limited control of select networking components (host firewalls..), often virtualized with multiple VMs on a single physical machine
– Example: Amazon, Rackspace (Servers, Data Center Fabric, Networking, Storage)
• Each offers a level of abstraction and management load
• Business Process as a Service (BPaaS): entire business processes are outsourced to a
cloud environment, it is important to make sure users get regular status updates, are able
to interact when needed and have access to dashboards as if the business processes were run
on own premises
9.1.3
Deployment models
Access to the infrastructure
• Private cloud: Operated solely for an organization, may be managed by the organization
or a third party and may exist on premise or off premise, e.g. datacenter
• Community cloud: shared by several organizations + supports a specific community that
has shared concerns (compliance, security...), managed by organizations or a third party, on
and off premise, e.g. group of hospitals
• Public cloud: Available to the general public or a large industry group, owned by an
organization selling cloud services, e.g. Amazon, Google, Microsoft
• Hybrid cloud: Composition of two or more clouds (private, community, public) that remain
unique entities but bounds together that enables data and application portability (e.g. cloud
bursting for load-balancing between clouds), e.g. Microsoft Windows Azure, Typically IaaS
connected with private cloud
9.1.4
Payment models
Cloud computing is pay-as-you-go approach
• Per-instance billing: common, pay for every hour a VM or instance is used, even if they
are idle
70
Figure 9.1: BPaaS architecture
• Reserved usage: some cases, clients know they will be needing it for longer periods
(months, years), up-front payment and reserve for that time period, will always be available
(lower hourly rates)
• Bidding: Maximum instance price is provided by customer, varies on load of the cloud
(more expensive when utilization degree of the cloud is higher) if prices becomes higher than
the instance, the instance is stopped
• Actual usages: some PaaS clouds, determines the cost based on the actual CPU cycles,
only actually used resources are paid
• First three: IaaS, combinations also exist
9.1.5
Advantages
• User do not need to own and configure machines: management of infrastructure, left to cloud
providers, only worry about what to do with the machine/resource
• Request resources when needed and only pay when actually using
• Simple web based interface for request resource and monitor and manage resource
• Extreme scaling: can scale the footprint from 1 server to 1000+ servers in a matter of few
minutes or less
• Economic model: rent vs lease and management cost is often higher than resource cost
Useful scenarios:
• Load varying with time: load varies based on time of the day, week, year, provisioning
resources for peak capacity means a large portion of the infrastructure will be idle = overprovisioning, if less resources available than peak capacity, less wasting but underprovisioning
71
• Demand unknown in advance, so correctly provisioning the service is difficult
• Batch analytics that can benefit from huge number of resources for a short time duration, if
batch job can be parallellized the job can be deployed on multiple servers
9.1.6
Obstacles
• Availability/Business Continuity: use Multiple Cloud Providers so you have less points
of failure
• Data Lock-in: lack of data standards, difficult to liberate data stored in the cloud, vast
amounts of data and connection speed limitations, solution: Standardize API’s, Compatible
SW to enable Surge or Hybrid Cloud Computing
• Data Confidentiality and Auditability: trusting a third party with confidential information, deploy Encryption VLANs and Firewalls
• Data Transfer Bottlenecks: large amounts of data in cloud, difficult and slow and expensive to move data to another location (FedExing Disks, physically moving them, Higher
BW Switches)
• Performance Unpredictability: Multiple VMs can share CPU and main memory, but
I/O sharing is problematic for using network and disks, solve: Improved VM Support, Flash
Memory, Gang Schedule VMs
• Scalable Storage: less obvious to apply cloud concepts on persistent storage: Invent Scalable Store
• Bugs in Large Distributed Systems: Linked to availability, many of the bigger outages
of cloud infrastructures, caused by bugs in cloud management services, : Invent Debugger
that relies on Distributed VMs
• Scaling Quickly: not always sufficient, Invent Auto-Scaler that relies on ML, Snapshots
for Conservation
• Reputation Fate Sharing: If a mall-intended application runs on a server, all other
applications running on the same physical server are affected with the same reputation (e.g.
IP blacklisting...), solution: Offer reputation-guarding services like those for email
• Software Licensing: Pay-for-use licences
9.2
9.2.1
Cloud Platforms
Amazon Web Services (AWS)
• Collection of different services
• e.g. Amazon Elastic Compute cloud (EC2), IaaS solution, can request a machine with
specific requirements as CPU, memory, disk space, pre-installed OS and middleware
• Uses a portal to request, no default support for automatic scaling, failover: extra services
are needed
72
Figure 9.2: The Windows Azure platform
9.2.2
Microsoft Windows Azure
• Focuses more on offering a platform by adding more services and lowering management
overload
• Forces developers to use specific development approaches (using .NET libraries )
• More flexible than Google AppEngine
• Can provide automated scaling between application framework and hardware virtual machines
• Three roles:
– Web Roles: web application programming, preconfigured with Internet Information
Services (II)
– Worker Roles: Windows servers, for doing processing or computing work
– Virtual Machines:
– Load balancer to manage internet traffic, stateless paradigm, between different roles
not load-balanced
– SQL server
– Queues: to store message that can later be processed
– Tables: cheaply store large amounts of data, not require predefined schema and not
needed to be queried using SQL
– Blob: Binary Large Objects, accessible by HTTP commands
9.2.3
Google App Engine
• PaaS model
• Provides a platform to host web applications
• App Engine SDK for programming (Python and Java support)
73
• Set of primitives (datastore (no SQL), URL fetch, memcache, JavaMail, Images, Authentication...)
• User focuses on developing the application in this framework
• Once deployed, scaling, availability are handled by Google AppEnginePlatform (its main
focus, so less freedom) one of only considered platforms with automatic scalability
9.3
9.3.1
Building blocks of an IaaS Cloud
Provisioning resources
Resources are called VMs, done by creating a disk image, instantiating it and then running it
• Image: is a durable object which is stored in the cloud, contains a disk image containing
an operating system and applications, instance can be captured by image
• Instance: Can be instantiated an placed on a host server. To achieve this, image is bound
to a specific runtime environment (e.g. IP), stop/suspend or snapshot from VM results in
storing its state in the instance, instance uses storage resources of server and is like a disk
drive of a computer
• Virtual Machine is a logical computer running on a physical machine, that uses a an
instance disk to store its state, consumes server and network resources
• Provisioning the VM is done in multiple steps:
1. User requests are forwarded to a Provisioning Component
2. The Resource Manager locates a hardware resource that has available capacity
3. The Provisioning Component copies the image for the virtual machine from the Image
Library to the target hardware resource
4. It configures and creates the virtual machine and personalizes a generic pre-built image,
called a virtual appliance, to user settings
5. Finally, the user is notified upon completion of the request
• Resource manager manages hardware and network resources in the system, these are organized in groups and placed in different resource pools
• Research challenge: efficient packing of VMs and lowering storage requirements. Research
into efficient allocation strategies, lowering the amount of hardware needed and thus power
consumption of clouds
9.3.2
Virtualization
• VMs are isolated from each other for security
• applications-OSs-VMs-One Hypervisor-Physical machine, otherwise: applications-OS-Physical
machine
• Hypervisor (or Virtual Machine Manager, VMM) is a software layer responsible for managing
VMs, it separates the VM from the physical hardware, schedules the requests of the different
VMs on the physical hardware
– Type 1: bare metal hypervisors, directly installed onto the physical hardware (for
servers)
– Type 2: hosted hypervisors, on top of an operating system adding an additional OS
layer between physical machine and hypervisor
74
9.3.3
Virtual Images
• At VM creation time, it boots from this stack
• Disk image, contains VM’s hard disk drive, but also the information, contents of a VMware
image: .vmdk (contents of VM’s hard disk drive), .nvram (state of VM’s BIOS), .vmx (setting
of virtual machine), .vmxf (supplemental configuration information), .vswp (Swap file)
• Must be stored in cloud to be accessible for instantiation
• Research: de-duplication of files over multiple images, ensuring files occurring in two different
images are only stored once, further decreasing the required storage and caching of images
to improve VM construction time
9.3.4
Virtual Applicances
• = image of a pre-built software stack that can be instantiated (can be OS, OS and application
middleware (as webserver) or full software stack with running application)
• Composite appliance has two parts: definition (=description of the images, set of parameters providing additional details such as CPU and memery and relationships between
individual images) and images (contains a virtual disk image with OS and software, config
files, and custom configuration scripts)
• Page 190 for figures
9.4
9.4.1
Enterprise Applications
Live Migration
• Cloud providers need to shut down physical servers for maintenance, service of applications
running must not be interrupted, solution: live migration = a running VM is moved from
one host to the other
• Maintenance, vertical scalability (moving several VMs on servers to one), energy efficiency
• Within LAN (Local Area Network) well supported, migration of memory, CPU registry,
config files and network connections from one hypervisor to another while VM is running,
storage is not transferred this is shared (Storage Area Network, accessible on local network)
• migration:
• : Phase 0: pre-migration, active VM on host A, Alternative physical host may be selected
for migration (speedup purposes), Block devices mirrored and free resources maintained
1. Reservation: initialize a container on the target host B, initially confirm that necessary
resources are available on B and reserve a VM container of that size (in case of failure:
VM keeps running on A)
2. Iterative Pre-Copy: (overhead in performance due to copying) Enable shadow paging
(all memory pages are copied, and when an update happens it is not written to physical
memory but in a newly allocated memory page), copy dirty pages (adjusted pages) in
successive rounds (VM can still change memory pages)
3. Stop-and-copy: (VM out of service) Suspend VM on host A, Redirect traffic to Host B,
Synchronize all remaining VM state to Host B (copy on A is still primary, and resumed
in case of failure)
4. Commitment: VM state on Host A is released if B indicates to A that is has successfully
receive a consistent OS image
75
5. Activation: (VM runing normally on host B) VM starts on Host B, Connects to local
devices and resumes normal operation
• Networking issues
– VM keeps old IP address: interior routing protocol is used within the cloud to ensure
packets are sent to the corerct host machine (Mobile IP to ensure a permanent IP can
be used)
– IP address of VM changes: A tunnel between the old host and new host, DNS is also
updated
76
Chapter 10
Resource Allocation in
Distributed Systems
10.1
Definition
• Allocation of resources: crucial in distributed systems, to make sure applications run properly, feedback to applications required, limit energy consumption
• Different types: CPU, memory, bandwidth, storage
• Availability can vary dynamically, distributes (heterogeneously) over various devices
• Objectives (choices have to be made for this):
– minimize response times of applications
– maximize the quality of experience for the end users
– minimize costs
– maximize revenues
– Minimize energy consumption (switch off idle devices)
10.2
Resource allocation algorithms
10.2.1
Different types
• Single request versus batch: single request, one request at the time in the order of arrival,
batch algorithms collect incoming request and then process them simultaneously (nonefficient but easy vs efficient computational intensive)
• Optimal versus heuristical : Heuristical provide a non-optimal solution but execution times
and required computation resources are lower
• Online versus offline: online generate output in near real-time, require fast algorithms,
offline executed once every hour (or less, used for migration, can use more resources because
of this timescale and be more efficient)
• Static versus dynamic: static assume that the available resources and requested resources
do not change over time
77
10.2.2
Algorithms
• Resource allocation algorithm is based on 1: available resources, 2: requested resources
• Three types:
– Integer Linear Programmming or ILP
maximize C T x
subject to Ax <= b, x >= 0 and x ∈ Z
x contains decision variables (binary for instance whether or not should be assigned)
Constraints expressed by A matrix and b vector
All these elements are calculated before algorithm is fired
Advantage: optimal solution by Simplex algorithm and Branch and bound algorithm
∗ drawback: large A-matrices and b-vectors, algorithm execution is very time-consuming
and takes hours or even days
∗ ILP-solver is used, specifies stop criteria (max time spent, solution within x% of
the optimum (typical 95,98,99))
∗ Important property: each iteration and indication of optimality (how far the current
solution is from the optimal) is obtained, does not hold for the other algorithms
∗
∗
∗
∗
∗
∗
– Metaheuristics
∗ heuristical algorithm that can be applied to wide range of problems (=generic
solution algorithms)
∗ Regularly used: Tabu search, Simulated annealing, Genetic algorithm/evolutionary
algorithms, Ant-based algorithms or particle swarm algorithms
∗ Can serves as a benchmark for exact solution and greedy algorithms
∗ Typically offline and batch request for resource allocation
– Greedy algorithms
∗ Either bin packing or knapsack filling
∗ Bin packing: different sized objects must be packed in a finite number of bins,
containers with predetermined capacity with a goal to minimize the number of bins
used (NP-hard).
∗ First fit algorithm: fast but often non-optimal algorithm, place each item into first
bin it will fit (more efficiently if first sorted = first-fit decreasing algorithm)
∗ Other is Best fit algorithm: places new objects in the fullest bin that still has
room
∗ Knapsack algorithms: special case of bin packing, number of bins is restricted
to only one and each item is characterised by value and volume, algorithms aims
at max value of items in the sack
10.3
Autonomic resource allocation
10.3.1
Autonomic systems: goal
• Most systems still need manual intervention of a technician if a device fails, in case of internet
gaming
• Autonomic systems offer solution, refers to autonomic nervous system that controls all the
vital body activities
• An autonomic system is able to govern itself, self-governance
78
Figure 10.1: Overview of a basic control loop, used in control theory
10.3.2
Characteristics of autonomic systems
The self-chop characteristics, not necessary conditions for a system to be autonomic
• Self-configuration:
– A system that is able to dynamically adapt to changing environments
– (often) by policies provided by the operator
– deployment, removal of software components
– increased responsiveness
• Self-healing:
– discover, diagnose and react to disruptions
– Increased resilience
– Day-to-day operations less likely to fail
• Self-optimization:
– Tune itself to meet end-user or business needs
– External stimuli is not a disruption (unlike self-healing)
– Actions: reallocating resources to improve overall utilization
– Improved operation efficiency
• Self-protection
– Anticipate, detect, identify and protect against external and internal threats
– Include unauthorized access, virus Denial of Service attacks
– Improved security
10.3.3
Control loops
• Control loop = fundamental concept of control theory and represents a continuous flow of
steps that are required for the management of the system
• Has reference value it wants the system to adhere (=control loop’s goal)
• Measured output, controller will enforce changes, resulting in an updated system output and
hopefully at measured output
MAPE Control loop
79
Figure 10.2: The MAPE Control Loop, originally presented by IBM
• Autonomic managers, a group of cooperating components that each manage their own elements
• 4 steps:
– Monitor step: aggregates, filters and manages knowledge collected by its sensors, added
to knowledge module
– Analysis step: diagnosis of monitoring results, detects disruptions in system
– Planning step: Determines set of elementary actions based on the diagnosis
– Execute step: orchestrates the executions of these planned actions
• use or generate knowledge that can be used in other steps
• Simple
• MAPE control loop manages a single element
• Not an actual autonomic system: does not specify how learning can be achieved (correct
past mistakes) nor how an operator can tune the system through high-level objectives
OODA Control loop
• Observe Orient Decide and Act
• Resembles human thinking (origin is military)
• Roughly same as MAPE but more dynamic
80
Figure 10.3: The OODA Control Loop, presented by John Boyd
• loop = set of interacting control loops
• Knowledge is filtered by orient step, then shaped by decision step and then executed by
action step
• Each step to provide feedback to correct past mistakes
• Parallel process, avoids to stop observing when analysis is continuing
• Downside: oversimplifies the way human brain works which is the goal of a control loop, e.g.
shortcuts in decision making
FOCALE Control Loops
• Foundation Observation Comparison Action and Learning Environment
• Based on OODA, tries to overcome its difficulties
• Connects all components through a bus (communication substrate, allows publishing without
requiring details of the receiver)
• Observe: Knowledge retrieved from managed resources
• Normalize: fed to a model-based translation process, facilitate translation of device specific
info into normalized form
• Compare: This normalized data is analysed to determine the current state of the system
and compared to the desired state of the system
• Allows shortcutting if needed based on three dynamic control loops (resemble human brain
actions):
– Reactive loops: immediate responses based on external stimuli, allows shortcuts in
order to perform high-priority and urgent tasks, highest frequency and circumvent the
decide, reason and learn components
– Deliberative loops: Receives data from and can send commands to the reactive
processes, long + short term memory to create more elaborate plans of action, lower
frequencies, circumvent the learning component
– Reflective loops: supervises the deliberative processes, studies decisions made in past
+ analyses them, conclusions are used to prevent sub-optimal actions from being taken
again in the future, run at lowest frequency
81
Figure 10.4: FOCALE, featuring three distinct control loops
Control loop implementations
• Figure 10.5
• MAPE control loop mapped to real-time architecture for autonomic computing systems, use
case: CPU load of different virtual servers on same hardware
• Control loop monitors the performance of an Application Server Cluster and if if performance
degrades provisions the application with more servers (through an execution engine)
• By controlling the CPU ratio allocated to each server, the load can be managed
10.3.4
Architecture for distributed autonomic systems
Architectural Overview
• Architecture of distributed autonomic system (management domain cooperation, e.g. Telenet and Belgacom ISP)
• Cooperate without sharing all knowledge
• Semi-hierarchical organization
• Autonomic Manager (= Autonomic Element, AE)
• AEs grouped together in cooperating communities/clusters
• Composed of set of Managed Entities (=managed resources or AEs, child AEs are guided
by their parent)
• Parent-child relationship inside management domain, results in logical tree
• Root (centralized) governs entire network, bottom layer, manage set of managed resources
(distributed)
• Parent AEs of each management domain, maintain bidirectional relationship with each other
accordingly to unambiguous interaction agreement
82
Figure 10.5: A control loop implementation in a cloud computing environment
Autonomic Element overview
• Functionality roughly like MAPE control loop
• Consists of Management components
• Loose coupling, so less strict
• Components:
– Monitoring Probe: is the monitor step in MAPE, interacts through the Resource Interface to the managed resources, stores gained knowledge in the info and data model
– Context Manager: Is responsible for interacting with other AEs, publish, requests to
remote, other AEs
– Information and Data Model : Local database that stores all available knowledge
– State Comparator and Planning Agent: is Analysis and Planning steps of MAPE
– Resource Endpoint: Simple interface to communicate with managed resources, new
configurations deduced
– Policy Framework: OVeral behaviour of AE, to separate business logic from actual system, realte to security concerns, performance concerns (Connection B can only tolerate
up to Y%) or others...
83
Download