Network algorithms

advertisement
NETWORK
ALGORITHMS
PresenterKurchi Subhra Hazra
Agenda
• Basic Algorithms such as Leader Election
• Consensus in Distributed Systems
• Replication and Fault Tolerance in Distributed Systems
• GFS as an example of a Distributed System
Network Algorithms
• Distributed System is a collection of entities where
 Each of them is autonomous, asynchronous and failure-prone
 Communicating through unreliable channels
 To perform some common function
• Network algorithms enable such distributed systems to effectively
perform these “common functions”
Gobal State in Distributed Systems
• We want to estimate a “consistent” state of a distributed system
• Required for determining if the system is deadlocked, terminated and
for debugging
• Two approaches:
• 1. Centralized- All processes and channels report to a central process
• 2. Distributed – Chandy Lamport Algorithm
Chandy Lamport Algorithm
Based on Marker Messages M
On receiving M over channel c:
If state is not recorded:
a) Record own state
b) Start recording state of incoming channels
c) Send Marker Messages to all outgoing channels
Else
a) Record state of c
Chandy Lamport Algorithm
P1
e10
e11,2
e13 e14
M
M
M
a
P2
e20
e21,2,3
b
P3
e30
e13
e23 e24
M
M M
e32,3,4
e31
1- P1 initiates snapshot: records its state (S1); sends Markers to P2 & P3;
turns on recording for channels Ch21 and Ch31
2- P2 receives Marker over Ch12, records its state (S2), sets state(Ch12) = {}
sends Marker to P1 & P3; turns on recording for channel Ch32
3- P1 receives Marker over Ch21, sets state(Ch21) = {a}
4- P3 receives Marker over Ch13, records its state (S3), sets state(Ch13) = {}
sends Marker to P1 & P2; turns on recording for channel Ch23
5- P2 receives Marker over Ch32, sets state(Ch32) = {b}
6- P3 receives Marker over Ch23, sets state(Ch23) = {}
7- P1 receives Marker over Ch31, sets state(Ch31) = {}
Taken from CS 425/UIUC/Fall
2009
Leader Election
• Suppose you want to
-elect a master server out of n servers
-elect a co-ordinator among different mobile systems
Common Leader Election Algorithms
-Ring Election
-Bully Election
Two requirements
- Safety (Process with best attribute is elected)
- Liveness (Election terminates)
Ring Election
• Processes organized in a ring
• Send message clockwise to next process in a ring with its
id and own attribute value
• Next process checks the election message
a) if its attribute value is greater, it replaces its own
process id with that in the message.
b) If the attribute value is less, it simply passes on the
message
c) If the attribute value is equal it declares itself as the
leader and passes on an “elected” message.
What happens when a node fails?
Ring Election - Example
Taken from CS 425/UIUC/Fall
2009
Ring Election - Example
Taken from CS 425/UIUC/Fall
2009
Bully Algorithm
Best case and worst case scenarios
Taken from CS 425/UIUC/Fall
2009
Consensus
• A set of n processes/systems attempt to “agree” on some information
• Pi begins in undecided state and proposes value viєD
• Pi‘s communicate by exchanging values
• Pi sets its decision value di and enters decided state
• Requirements:
1.Termination:
Eventually all correct processes decide, i.e., each correct process sets its
decision variable
2. Agreement :
Decision value of all correct processes is the same
3. Integrity:
If all correct processes proposed v, then any correct decided process has
di = v
2 Phase Commit Protocol
• Useful in distributed transactions to perform atomic
commit
• Atomic Commit: Set of distinct changes applied in a single
operation
• Suppose A transfers 300 $ from A’s account to B’s bank
account.
• A= A-300
• B=B+300
These operations should be guaranteed for consistency.
2 Phase Commit Protocol
What happens if the co-ordinator and a participant fails after doCommit?
Issue with 2PC
CanCommit?
Coordinator
A
B
Issue with 2PC
Yes
Coordinator
A
B
Issue with 2PC
doCommit
A crashes
Co-ordinator
Crashes
Coordinator
A
B
B commits
A new co-ordinator cannot know whether A had committed.
3 Phase Commit Protocol (3PC)
Use an
additional
stage
3PC Cont…
canCommit
ack
preCommit
ack
commit
Co-ordinator
Cohort 1
commit
Cohort 2
commit
Cohort 3
commit
3PC Cont…
• Why is this better?
• 2PC: execute transaction when everyone is willing to COMMIT it
• 3PC: execute transaction when everyone knows it will COMMIT
(http://www.coralcdn.org/07wi-cs244b/notes/l4d.txt)
• But 3PC is expensive
• Timeouts triggered by slow machines
Paxos Protocol
• A consensus algorithm
• Important Safety Conditions:
• Only one value is chosen
• Only a proposed value is chosen
• Important Liveness Conditions:
• Some proposed value is eventually chosen
• Given a value is chosen, a process can learn the value eventually
• Nodes behave as Proposer, Acceptor and Learners
Paxos Protocol – Phase 1
Select a number n
for proposal
of value v
Proposer
Prepare message
Acceptor
Acknowledgement
Acceptor
Acceptor
What about
this acceptor?
Acceptor
Majority of
acceptors
is enough
Acceptors
respond back
with the
highest n it
has seen
Paxos Protocol – Phase 2
Proposer
Majority of acceptors
agree on proposal n
with value v
n
Acceptor
n
n
Acceptor
Acceptor
Acceptor
Paxos Protocol – Phase 2
Acceptors
accept
Proposer
Accept
Majority of acceptors
agree on proposal n
with value v
Acceptor
Acceptor
What if v
is null?
Acceptor
Acceptor
Paxos Protocol Cont…
• What if arbitrary number of proposers are allowed?
Round 1
P
n1
Round 2
Acceptor
n2
Q
Paxos Protocol Cont…
• What if arbitrary number of proposers are allowed?
Round 1
P
n3
Round 2
Acceptor
n4
Q
• To ensure progress, use distinguished proposer
Round 3
Round 4
Paxos Protocol Contd…
• Some issues:
a)
b)
c)
d)
How to choose proposer?
How do we ensure unique n ?
Expensive protocol
No primary if distinguished proposer used
Originally used by Paxons to run their part-time parliament
Replication
• Replication is important for
1. Fault Tolerance
2. Load Balancing
3. Increased Availability
Requirements:
1. Transparency
2. Consistency
Failure in Distributed Systems
• An important consideration in every design decision
• Fault detectors should be :
a) Complete – should be able to detect a fault when it
occurs
b) Accurate – Does not raise false positives
Byzantine Faults
• Arbitrary messages and transitions
• Cause: e.g., software bugs, malicious attacks
• Byzantine Agreement Problem: “Can a set of concurrent processes
achieve coordination in spite of the faulty behavior of some of them?”
• Concurrent processes could be replicas in distributed systems
Practical Byzantine Fault
Tolerance(PBFT)
• Replication Algorithm that is able to tolerate faults.
• Useful for software faults
• Why “Practical”?
-> since can be used in an asynchronous environment like the internet
• Important Assumptions:
1. At most
𝑛−1
3
nodes can be faulty
2. All replicas start in the same state
3. Failures are independent – Practical?
PBFT Cont..
request
pre-prepare
prepare
commit
reply
C
R1
R2
R3
R4
C : Client
R1: Primary replica
Client blocks
and waits for f+1
replies
After accepting
2f prepares
PBFT Cont…
• The algorithm provides
• -> Safety
• By guaranteeing linearizability. Pre-prepare and prepare
ensures total order on messages
• -> Liveness
• By providing for view change, when the primary replica
fails. Here, synchrony is assumed.
• How do we know apriori the value of f?
Google File System
• Revisited traditional file system design
1. Component failures are a norm
2. Multi-GB Files are common
3. Files mutated by appending new data
4. Relaxed consistency model
GFS Architecture
Leader Election/
Replication
Maintains
metadata,
namespace, chunk
metadata etc
GFS – Relaxed Consistency
GFS – Design Issues
Single
Master
Rational: Keep things
simple
Problems:
1. Increasing volume of underlying storage -> Increase in
metadata
2. Clients not as fast as master server -> Master server became
bottleneck
Current: Multiple Masters per data center
Ref: http://queue.acm.org/detail.cfm?id=1594206
GFS Design Isuues
• Replication of chunks
a) Replication across racks – default number is 3
b) Allowing concurrent changes to the same file.-> In retrospect,
they would rather have a single writer
c) Primary replica serializes mutation to chunks -They do
not use any of the consensus protocols before applying
mutations to the chunks.
Ref: http://queue.acm.org/detail.cfm?id=1594206
THANK YOU
Download