Reading material TDDB47 Real-time Systems Lecture 9: Distributed Systems

advertisement
Reading material
TDDB47 Real-time Systems
• Chapter 14 of Burns & Wellings
• Chapter 8 of Herman Kopetz book, Realtime Systems, Design principles for
distributed embedded applications
Lecture 9: Distributed Systems
Simin Nadjm-Tehrani
Real-time Systems Laboratory
Department of Computer and Information Science
Linköping university
Undergraduate course on Real-time Systems
Linköping
33 pages
Autumn 2004
Distributed Systems
• Overview of some general problems and how
real-time systems are affected by them
• Looking back at the distributed scheduling
problem with offloading of tasks that we
studied earlier
Undergraduate course on Real-time Systems
Linköping
• [Article by Ramamritham, Stankovic,
and Zhao, IEEE Transactions on
Computers, Volume 38(8), August
1989.]
Undergraduate course on Real-time Systems
Linköping
2 of 33
Autumn 2004
Reasons for distribution
• Locality, organisation
– Engine control, brake system,
gearbox control, airbag,…
– An extension of modularisation, and
means for fault containment
• Load sharing
– Web services, search, parallelisation
of heavy duty computations
3 of 33
Autumn 2004
Undergraduate course on Real-time Systems
Linköping
Local control
4 of 33
Autumn 2004
Sharing the load
Simplistic view:
• Although data is distributed each local
controller can perform its computations
locally if data is properly organised
• Design modules with high cohesion and
low interaction
Simplistic view:
• Guarantee that a node can deal with
what it accepts
• Spread the load so that tasks are
(globally) serviced in a best effort
manner
• But what are the requirements on data
transmission for data/commands that
have to be shared?
• But what are the fundamental issues for
guaranteeing (global) behaviour in
distributed systems?
Undergraduate course on Real-time Systems
Linköping
5 of 33
Autumn 2004
Undergraduate course on Real-time Systems
Linköping
6 of 33
Autumn 2004
Dependability & Distribution
Brake-by-wire
• Making systems fault-tolerant typically
uses redundancy
• Redundancy in space leads to
distribution
• But distributed systems are not
necessarily fault-tolerant!
Undergraduate course on Real-time Systems
Linköping
7 of 33
Autumn 2004
Undergraduate course on Real-time Systems
Linköping
Justifying safety
• Redundancy: Having distributed sensors
and actuators makes brake control more
fault-tolerant
8 of 33
Autumn 2004
Justifying availability
• Primary backup
• Active replication
Local decision or
distributed decision?
X
• What if one node is acting differently
(distributed decision) or getting the
signal incorrectly (local decision)?
Undergraduate course on Real-time Systems
Linköping
9 of 33
Autumn 2004
Distributed systems & FT
– Introduce new complications
• no global clock
• richer failure models
+ Replication and group mechanisms
• transparency in treatment of faults
Undergraduate course on Real-time Systems
Linköping
Undergraduate course on Real-time Systems
Linköping
10 of 33
Autumn 2004
Recall from earlier lecture
• Node failures
– Crash
– Omission
– Byzantine (arbitrary)
• Channel failures
– Crash (and potential partitions)
– Message loss
– Erroneous/arbitrary messages
11 of 33
Autumn 2004
Undergraduate course on Real-time Systems
Linköping
12 of 33
Autumn 2004
”Chicken and egg” problem
• Replication is useful in presence of
failures if there is a consistent common
state among replicas
• To get consistency, processes need to
communicate their state via broadcast
• But broadcast algorithms are
distributed algorithms that run on every
node
• also affected by failures...
A useful (weak) broadcast
• Reliable broadcast
– all non-crashed processes agree on
messages delivered (agreement)
– no spurious messages (integrity)
– all messages broadcast by
non-crashed processes delivered
(validity)
All or none!
Undergraduate course on Real-time Systems
Linköping
13 of 33
Autumn 2004
Undergraduate course on Real-time Systems
Linköping
14 of 33
Autumn 2004
How to implement?
• The first step is to separate the
underlying network (transport) and the
broadcast mechanism
• Distinguish between receipt and
delivery of a message
Application layer
Broadcast
Send
Deliver
Broadcast mechanism
Receive
Send
Transport
Undergraduate course on Real-time Systems
Linköping
15 of 33
Autumn 2004
Undergraduate course on Real-time Systems
Linköping
16 of 33
Autumn 2004
Reliable broadcast
What if p fails?
Directly after a
receipt?
Within every process p
• Execute broadcast(m) by:
While relaying?
– adding sender(m) and a unique ID as a
header to the message m (building m)
– send(m) to all neighbours including itself
After sending to some but
not all neighbours?
• When receive(m):
– if previously not executed deliver(m) then
• if sender(m) /= p then send(m) to all
neighbours
• deliver(m)
Undergraduate course on Real-time Systems
Linköping
This is where failure models
come in...
17 of 33
Autumn 2004
Undergraduate course on Real-time Systems
Linköping
18 of 33
Autumn 2004
Algorithms for broadcast
• Correctness: prove validity,
integrity, agreement, order
Do not forget the failure model!
• Needs a notion of time (precedence) in
distributed systems
• A distributed system S is a set of
sequential processes p1, p2, …, pn
– S is Synchronous: whenever pi makes one
step, pj makes n (n ≥ 1) steps, and there is a
bound on message delays
Typical assumptions:
– no link failures leading to partition
– send does not duplicate or change
messages
– receive does not ”invent” messages
Undergraduate course on Real-time Systems
Linköping
Relating states
– S is Asynchronous if no such bounds exists
19 of 33
Autumn 2004
Undergraduate course on Real-time Systems
Linköping
The consensus problem
• Processes p1,…,pn take part in a decision
• Each pi proposes a value vi
• All correct processes decide on a
common value v that is equal to one of
the proposed values
Undergraduate course on Real-time Systems
Linköping
21 of 33
Autumn 2004
Desired properties
• Every correct process eventually decides
(Termination)
• No two correct processes decide
differently (Agreement)
• If a process decides v then the value v
was proposed by some process
(Validity)
Undergraduate course on Real-time Systems
Linköping
Basic impossibility result
[Fischer, Lynch and Paterson 1985]
22 of 33
Autumn 2004
A way around it
Assume Synchrony:
There is no deterministic algorithm
solving the consensus problem in an
asynchronous distributed system with a
single crash failure
Undergraduate course on Real-time Systems
Linköping
20 of 33
Autumn 2004
23 of 33
Autumn 2004
• Distributed computations proceed in
rounds initiated by pulses
• Pulses implemented using local physical
clocks, synchronised assuming bounded
message delays
Undergraduate course on Real-time Systems
Linköping
24 of 33
Autumn 2004
Architectural support
Communication protocol
• Timed-triggered architecture
[Kopetz et.al]
Node 1
…
Node n-1
• Message Description List: allocates a
pre-defined slot within which each node
can send its (pre-defined) message
Node n
…
Node 1
• Time division multiple access (TDMA)
Node n-1
Node n
…
A TDMA round
Undergraduate course on Real-time Systems
Linköping
Undergraduate course on Real-time Systems
Linköping
25 of 33
Autumn 2004
Temporal firewall
26 of 33
Autumn 2004
Replication & failure detection
• Provides temporally accurate state
information (via clock synchronisation)
Host
CNI
• When the data is no longer valid, it can
no longer be exchanged
…
CC
BG
• Separates dealing with channel faults
from dealing with node faults
Host
CC
BG BG
BG
BG BGBG BG
BG: Bus Guardian
CNI: Communication Network Interface
Undergraduate course on Real-time Systems
Linköping
Undergraduate course on Real-time Systems
Linköping
27 of 33
Autumn 2004
28 of 33
Autumn 2004
Byzantine agreement
Scenario 1
• A difficult problem solved in 1980 by
Pease, Shostak and Lamport
• G and L1 are correct, L2 is faulty
G
G
0
id
sa
L1
Each process may fail in an arbitrary
way (may be malicious)
0
0
L2
L2
sa
id
1
L1
1
29 of 33
Autumn 2004
1
id
sa
Undergraduate course on Real-time Systems
Linköping
0
1
L1
• Theorem: There is an upper bound t for
the number of byzantine failures
compared to the size of the network N,
N ≥ 3t+1
• Gives a t+1 round algorithm for solving
consensus in a synchronous network
sa
id
1
1
G said 1
L2
Undergraduate course on Real-time Systems
Linköping
L1
G said 0
L2
30 of 33
Autumn 2004
Scenario 2
• G and L2 are correct, L1 is faulty
G
0
sa
id
31 of 33
Autumn 2004
0
0
L2
L2
sa
id
0
0
L2
sa
id
L2
sa
id
Undergraduate course on Real-time Systems
Linköping
L1
1
L2
1
1
id
sa
1
G said 0
L1
id
sa
L1
L2
0
1
G said 1
0
0
1
1
L1
0
1
G
id
sa
L1
id
sa
L1
0
1
L1
• The general is faulty!
G
G
0
Scenario 3
G said 1
L2
L1
G said 0
Undergraduate course on Real-time Systems
Linköping
L2
32 of 33
Autumn 2004
2-round algorithm
… does not work with t=1, N=3!
• Seen from L1, scenario 1 and 3 are
identical, so if L1 decides 1 in scenario 1
it will decide 1 in scenario 3
• With TTP we can even tolerate arbitrary
node failures!
• Similarly for L2, if it decides 0 in
scenario 2 it decides 0 in scenario 3
• L1 and L2 do not agree in scenario 3 !
Undergraduate course on Real-time Systems
Linköping
33 of 33
Autumn 2004
Undergraduate course on Real-time Systems
Linköping
Now back to load sharing!
• Task T arrives at node Ni
• If Ni cannot guarantee T meeting its
deadline, it will ask some nodes to bid
for running T
Nk
Succesful bid
T
Nodes with
sufficient surplus
according to Ni
knowledge
34 of 33
Autumn 2004
Global guarantees
• Not possible!
• Making detailed timing assumptions,
perhaps…
• Requires verifying network timing and
throughput constraints
Ni
Undergraduate course on Real-time Systems
Linköping
35 of 33
Autumn 2004
Undergraduate course on Real-time Systems
Linköping
36 of 33
Autumn 2004
Download