Distributed System Lecture 4: Distributed Real-time Systems

advertisement
Distributed System
• Definition:
Lecture 4: Distributed Real-time Systems
... a system of multiple autonomous processing
elements, cooperating in a common purpose or to
achieve a common goal.
Mikael Asplund
Real-Time Systems Laboratory
Department of Computer and Information Science
Linköpings universitet
Sweden
Lecture 4: Distributed Real-time Systems
Mikael Asplund
Lecture 4: Distributed Real-time Systems
Mikael Asplund
Advantages
2
Challenges
• Performance
• Transparency
• Distribution
• Communication
• Reliability
• Performance
• Scalability
• Heterogeneity
• Sharing of resources
• Openness
• Communication
• Reliability
• Security
Lecture 4: Distributed Real-time Systems
Mikael Asplund
3
Lecture 4: Distributed Real-time Systems
Mikael Asplund
4
Definition
The second is the duration of 9,192,631,770 periods of
the radiation corresponding to the transition between the
two hyperfine levels of the ground state of the caesium
133 atom at a temperature of 0 Kelvin.
Time and Ordering
Lecture 4: Distributed Real-time Systems
Mikael Asplund
5
Lecture 4: Distributed Real-time Systems
Mikael Asplund
6
Modeling time
• A set of instants T that is isomorphic to the set
Terminology
• A duration is the interval between two
of real numbers, that is:
instances
– a field
• An event is an action at an instant of time
– ordered
– events are thus partially ordered!
– Dedekind complete
• Alternatives:
– discrete time
– dense time (rational numbers)
• Ignore relativity!
Lecture 4: Distributed Real-time Systems
Mikael Asplund
Lecture 4: Distributed Real-time Systems
Mikael Asplund
7
Standards
• International atomic time (TAI)
8
Clocks
• A device for measuring time
– proper time of earths geoid
• The granularity of the clock is the smallest
– Chronoscopic (no disconuities)
duration that can be measured
– Starts from January 1st 1958
• A timestamp of an event e is the state of the
clock immediately at the instant of the event,
• Universal time Coordinated (UTC)
denoted clock(event).
– Astronomical time
– Based on UTC (with leap seconds)
Lecture 4: Distributed Real-time Systems
Mikael Asplund
Lecture 4: Distributed Real-time Systems
Mikael Asplund
9
Global vs. local clocks
10
Synchronous vs. asynchronous
•
A distributed system S is a set of sequential processes
– p1, p2, …, pn
• System is synchronous: whenever pi makes one
step, pj makes n (n ≥ 1) steps, and there is a
bound on message delays
• System is asynchronous if no such bounds exists
Lecture 4: Distributed Real-time Systems
Mikael Asplund
11
Lecture 4: Distributed Real-time Systems
Mikael Asplund
12
Ordering of events
• Events:
Happened-before ordering
• a → b, if a and b are events at the same process and a
happened before b
– Local execution steps
• a → b, if a is the event of sending a message m and b
is the event of the same message being received
– Send message
– Receive message
• Transitive, if a → b and b → c then a → c
p0
q0
p1
p2
q1
p3
q2
p4
q3
p5
p0
q4
Lecture 4: Distributed Real-time Systems
Mikael Asplund
q0
13
p1
p2
q1
p3
q2
p4
q3
p5
q4
Lecture 4: Distributed Real-time Systems
Mikael Asplund
Lamport's logical clocks
Happened-before ordering
• Usually called causal order
• Let each process keep a counter C
• Cannot be realised using physical clocks
• For every internal event: increment C
– Why?
• For each message that is sent: piggyback a
• Solution: use logical clocks
timestamp t=C
– Each process has a logical clock C
• For every message that is received:
– Requirment: if a → b, then C(a) < C(b)
Lecture 4: Distributed Real-time Systems
Mikael Asplund
14
Let C = max(t,C)
15
Lecture 4: Distributed Real-time Systems
Mikael Asplund
16
Dependability & Distribution
• Making systems fault-tolerant typically uses
redundancy
– Redundancy in space leads to distribution
Fault-tolerant Distributed
Systems
– But distributed systems are not necessarily faulttolerant!
Lecture 4: Distributed Real-time Systems
Mikael Asplund
17
Lecture 4: Distributed Real-time Systems
Mikael Asplund
18
Achieving availability
Fault models revisited
• Node failures
– Crash
– Omission
• Active replication
– Byzantine (arbitrary)
– Group membership
• Passive replication
• Channel failures
– Primary – backup
– Crash (and potential partitions)
• What do I need to implement it?
– Message loss
– Message ordering
– Agreement among replicas
– Erroneous/arbitrary messages
Lecture 4: Distributed Real-time Systems
Mikael Asplund
19
Lecture 4: Distributed Real-time Systems
Mikael Asplund
20
A useful broadcast
”Chicken and egg” problem
• Reliable broadcast
– Agreement: All non-crashed processes agree on
• Replication is useful in presence of failures if
messages delivered
there is a consistent common state among
• I.e for any message m, if a correct process
replicas
delivers m, then every correct process delivers m.
– What happens when a replica fails?
– Integrity: No spurious messages
• To get consistency, processes need to
• I.e. no erroneous, duplicated or created messages
communicate their state via broadcast
– Validity: All messages broadcast by non-crashed
processes are delivered
• But broadcast algorithms are distributed
algorithms that run on every node
– also affected by failures…
Lecture 4: Distributed Real-time Systems
Mikael Asplund
21
How to implement?
• The first step is to separate the underlying network
Lecture 4: Distributed Real-time Systems
Mikael Asplund
22
Common channel assumptions
• Communication channel assumptions
(transport) and the broadcast mechanism
• Distinguish between receipt and delivery of a message
– No link failures lead to partition
– Send does not duplicate or change messages
– Receive does not ”invent” messages
Lecture 4: Distributed Real-time Systems
Mikael Asplund
23
Lecture 4: Distributed Real-time Systems
Mikael Asplund
24
Reliable broadcast
Failures
• What happens if p fails
• Within every process p
– Directly after a receipt
– While relaying
– Execute broadcast(m) of message m by:
– Before sending the message
• adding sender(m) and a unique ID as a header to
the message m
– After sending to some, but not all neighbours
• send(m) to all neighbours including itself
• Prove correctness of algorithm by proving the
necessary properties in:
– When receive(m):
– Validity
• if previously not executed deliver(m) then
– Integrity
• if sender(m) ≠ p then send(m) to all neighbours
– Agreement
• deliver(m)
– Order
Lecture 4: Distributed Real-time Systems
Mikael Asplund
25
Brake by wire
Lecture 4: Distributed Real-time Systems
Mikael Asplund
26
The consensus problem
• Processes p1,…, pn take part in a decision
– Each pi proposes a value vi
– All correct processes decide on a common value v
that is equal to one of the proposed values
• Desired properties
– Termination: Every correct process eventually
decides
– Agreement: No two correct processes decide
differently
– Validity: If a process decides v then the value v was
proposed by some process
Lecture 4: Distributed Real-time Systems
Mikael Asplund
27
Lecture 4: Distributed Real-time Systems
Mikael Asplund
28
Assume synchrony
Basic impossibility result
• If a node does not respond within time t, it will
[Fischer, Lynch and Paterson 1985]
not respond at time t+d
• Partial synchrony
– Bounds exist but are not known
• There is no deterministic algorithm solving the
consensus problem in an asynchronous
distributed system with a single crash failure.
• Powerful abstraction:
– Unreliable failure detectors
• Why?
Lecture 4: Distributed Real-time Systems
Mikael Asplund
29
Lecture 4: Distributed Real-time Systems
Mikael Asplund
30
Total order broadcast
Byzantine generals
• Total order broadcast
– Reliable broadcast + total order property
• Total order property
– Let m and m’ be any two messages.
– Let p be a (correct) process that delivers m without
having delivered m’
– Then no (correct) process delivers m’ before m
• Total order property by consensus in
– A synchronous network, with
– Reliable broadcast
Lecture 4: Distributed Real-time Systems
Mikael Asplund
31
Lecture 4: Distributed Real-time Systems
Mikael Asplund
Byzantine generals
• Theorem: There is an upper bound t for the number of Byzantine
32
Scenario 1
• G and L1 are correct, L2 is faulty
failures compared to the size of the network N
– N ≥ 3t+1
• Gives a t+1 round algorithm for solving consensus in a
synchronous network
Lecture 4: Distributed Real-time Systems
Mikael Asplund
33
Lecture 4: Distributed Real-time Systems
Mikael Asplund
Scenario 2
• G and L2 are correct, L1 is faulty
Lecture 4: Distributed Real-time Systems
Mikael Asplund
34
Scenario 3
• L1 and L2 are correct, G is faulty
35
Lecture 4: Distributed Real-time Systems
Mikael Asplund
36
2-round algorithm
• … does not work with t=1, N=3!
• Seen from L1, scenario 1 and 3 are identical,
so if L1 decides 1 in scenario 1 it will decide 1
Clock synchronization
in scenario 3
• Similarly for L2, if it decides 0 in scenario 2 it
decides 0 in scenario 3
• Lecture
Distributed
Systems
L1 4:and
L2Real-time
do not
agree in scenario 3 !
Mikael Asplund
37
Lecture 4: Distributed Real-time Systems
Mikael Asplund
Clock drift
• Assume the existence of a reference clock r
Precision
• Offset between clocks c_i and c_j:
• Clock drift for clock c:
drift=
38
ij
offset =∣c i e−c j e∣
c e1 −c e2 
r e1 −r e2 
• Precision of a group of clocks
• Drift rate:
P = max offset ij 
i , j ∈[1,n]
=∣drift−1∣
• Bounding precision -> internal synchronization
Lecture 4: Distributed Real-time Systems
Mikael Asplund
39
Lecture 4: Distributed Real-time Systems
Mikael Asplund
Accuracy
40
Faults
• Excessive drift
• The accuracy of a clock is the offset to the
reference clock
• Clock reading errors
• Bounding accuracy -> external synchronization
• Byzantine faults (dual-faced clocks)
• External synchronization implies internal
synchronization
Lecture 4: Distributed Real-time Systems
Mikael Asplund
41
Lecture 4: Distributed Real-time Systems
Mikael Asplund
42
How often
Time server
• ρ: Drift rate
• R: Time between synchronizations
mr
• P: Precision after synchronization
• Choose R so that PR⋅2 P required
mt
p
Lecture 4: Distributed Real-time Systems
Mikael Asplund
43
Time server,S
Lecture 4: Distributed Real-time Systems
Mikael Asplund
Network Time Protocol
44
Averaging
• Mean of clocks excluding t fastest and t slowest
1
2
3
2
3
3
New clock value
Note: Arrows denote synchronization control, numbers denote
strata.
Lecture 4: Distributed Real-time Systems
Mikael Asplund
45
Lecture 4: Distributed Real-time Systems
Mikael Asplund
46
Download