Networks and Distributed Snapshot Sukumar Ghosh Department of Computer Science University of Iowa Contents Part 1. The evolution of network topologies Part 2. Distributed snapshot Part 3. Tolerating failures Random Graphs How a connected topology evolves in the real world • Erdös-Rényi graphs (ER graphs) • Power-law graphs • Small-world graphs Random graphs: Erdös-Rényi model ER model is one of several models of random graphs Presents a theory of how social webs are formed. Start with a set of isolated nodes V = {0,1,2,..., n} Connect each pair of nodes with a probability The resulting graph is known as G(n, p) p (0 £ p £ 1) Erdös-Rényi model ER model is different from the G(n, m) model The G(n, m) model randomly selects one from the entire family of graphs with n nodes and m edges. Properties of ER graphs Property 1. The expected number of edges is n(n - 1) p 2 Property 2. The expected degree per node is (n -1).p Property 3. The expected diameter of G(n, p) is log n log n log deg n = = log deg log (n -1). p [deg = expected degree of a node] Diameter of a network Let d(i, j) denote the distance of the shortest path between a pair of nodes largest value of network. i and j . For all such pairs of nodes, the d(i, j) is known as the diameter of the Degree distribution in random graphs v Probability that a node connects with a given set of nodes (and not to the remaining remaining (n -1- k) nodes) is k pk .(1- p)n-1-k One can choose k out of the remaining (n -1) nodes in æ n -1 ö çè k ÷ø ways. So the probability distribution is æ n -1 ö k n-1-k (binomial distribution) P(k) = ç .p .(1 p) k ÷ø è Degree distribution in random graphs N(k) = Number of nodes with degree k Properties of ER graphs 1 -- When p << , an ER graph is a n collection of disjoint trees. c -- When p = (c > 1) suddenly one n giant (connected) component emerges. Other components have a much smaller size [Phase change] O(logn) Properties of ER graphs When clog n p= (c > 1) the graph is almost always connected n These give “ideas” about how a network can evolve. But not all random topologies are ER graphs! For example, social networks are often “clustered”, but ER graphs have poor (i.e. very low) clustering coefficient (what is clustering coefficient?) Clustering coefficient For a given node, its local clustering coefficient (CC) measures what fraction of its various pairs of neighbors are neighbors of each other. B’s neighbors are {A,C,D,E}. Only (A,D), (D,E), (E,C) are connected CC(B) = 3/6 = ½ CC of a graph is the mean of the CC of its various nodes CC(D) = 2/3 = CC(E) The connectors Malcom Gladwell, a staff writer at the New Yorker magazine describes in his book The Tipping Point, a simple experiment to measure how social a person is. He started with a list of 248 last names A person scores a point if he or she knows someone with a last name from this list. If he/she knows three persons with the same last name, then he/she scores 3 points The connectors (Outcome of the Tipping Point experiment) Altogether 400 persons from different groups were tested. It was found that (min) 9, (max) 118 {from a random sample} (min) 16, (max) 108 {from a highly homogeneous group} (min) 2, (max) 95 {from a college class} [Conclusion: Some people are very social, even in small or homogeneous samples. They are connectors] Connectors Barabási observed that connectors are not unique to human society only, but true for many complex networks ranging from biology to computer science, where there are some nodes with an anomalously large number of links. This was not quite expected in ER graphs. The world wide web, the ultimate forum of democracy, is not a random network, as Barabási’s web-mapping project revealed. Anatomy of the World Wide Web Barabási experimented with the Univ. of Notre Dame’s web. 325,000 pages 270,000 pages (i.e. 82%) had three or fewer links 42 had 1000+ incoming links each. The entire WWW exhibited even more disparity. 90% had ≤ 10 links, whereas a few (4-5) like Yahoo were referenced by close to a million pages! These are the hubs of the web. They help create short paths between nodes (mean distance = 19 for WWW obtained via extrapolation). (Some dispute this figure now) Power law graph The degree distribution in of the web pages in the World Wide Web follow a power-law. In a power-law graph, the number of nodes N(k) with degree k satisfies the condition N(k) = C. 1r . Also known as scale-free graph. k Other examples are -- Income and number of people with that income -- Magnitude and number of earthquakes of that magnitude -- Population and number of cities with that population Random vs. Power-law Graphs The degree distribution in of the web pages in the World Wide Web follows a power-law Random vs. Power-Law networks Example: Airline Routes Think of how new routes are added to an existing network Preferential attachment Existing network A new node connects with an existing node with a probability proportional to its degree. The sum of the node degrees = 8 Also known as “Rich gets richer” policy New node Preferential attachment continued Barabási and Albert showed that when large networks are formed via preferential attachment , the resulting graph exhibits a power-law distribution of the node degrees. Other properties of power law graphs Graphs following a power-law distribution have a small diameter d ∼ lnlnn N(k) ∼ k -r (2 < r < 3) (n = number of nodes). The clustering coefficient decreases as the node degree increases (power law) Graphs following a power-law distribution tend to be highly resilient to random edge removal, but quite vulnerable to targeted attacks on the hubs. The small-world model Due to Watts and Strogatz (1998) They followed up on Milgram’s work (on six degrees of separation) and reason about why there is a small degree of separation between individuals in a social network. Research originally inspired by Watt’s efforts to understand the synchronization of cricket chirps, which show a high degree of coordination over long ranges, as though the insects are being guided by an invisible conductor. Disease spreads faster over a small-world network. Questions not answered by Milgram Milgram’s experiment tried to validate the theory of six degrees of separation between any two individuals on the planet. Why six degrees of separation? Any scientific reason? What properties do these social graphs have? Are there other situations in which this model is applicable? Time to reverse engineer this. What are small-world graphs Completely regular Small-world graphs (n >> k > ln (n) >1) Completely random n = number of nodes, k = number of neighbors of each node Completely regular If k=4 then Clustering coefficient CC = 3 6 = 1 2 n Diameter L = 2k Diameter is too large! A ring lattice Completely random Diameter is small, but the Clustering coefficient is small too! Small-world graphs Start with the regular graph, and with probability p rewire each link to a randomly selected node. It results in a graph that has high clustering coefficient but low diameter … Small-world graphs Smallworld properties hold Limitation of Watts-Strogatz model Jon Kleinberg argued Watts-Strogatz small-world model illustrates the existence of short paths between pairs of nodes. But it does not give any clue about how those short paths will be discovered. A greedy search for the destination will not lead to the discovery of these short paths. Kleinberg’s Small-World Model Consider an p (n´ n) grid. Each node has a link to every node at lattice distance (short range neighbors) & lattice distance d q long range links. Choose long-range links at with a probability proportional to d -r n p = 1, q = 2 r = 2 n Results Theorem 1. There is a constant but independent of a 0(depending on pand q n ), such that when r=0 , the expected delivery time of any decentralized algorithm is at least a 0. n2/3 More results Theorem 2. There is a decentralized algorithm A and a constant of n a2 dependent on so that when r=2 pand q but independent and delivery time of A is at most p = q =1, the expected a 2.log 2 n Variation of search time with r Log T Exponent r Distributed Snapshot Think about these How many messages are in transit on the internet? What is the total cash reserve in the Bank of America? How many cars are on the streets of Kolkata now? How much pollutants are there in the air (or water now)? What are most people in the US thinking about the election? How do we compute these? UAV surveillance of traffic Importance of snapshots Major uses in - data collection - surveillance - deadlock detection - termination detection - rollback recovery - global predicate computation Importance of snapshots A snapshot may consist of the internal states of the recording processes , or it may consist of the state of external shared objects updated by an updater process. Distributed Snapshot: First Case Assume that the snapshot consists of the internal states of the recording processes. The main issue is synchronization. An ad hoc combination of the local snapshots will not lead to a meaningful distributed snapshot. One-dollar bank Let a $1 coin circulate in a network of a million banks. How can someone count the total $ in circulation? If not counted “properly,” then one may think the total $ in circulation to be one million. Review Causal Ordering Causality helps identify sequential and concurrent events in distributed systems, since clocks are not always reliable. 1. Local ordering: a b c (based on the local clock) 2. Message sent message received [Thus joke Re: joke] 3. If a b and b c then a c ( implies causally ordered before or happened before relation) Consistent cut A cut is a set of events. If a cut C is consistent then (a ÎC) Ù (b ∼ a) Þ b ÎC If this is not true, then the cut C is inconsistent time Consistent snapshot The set of states immediately following the events (actions) in a consistent cut forms a consistent snapshot of a distributed system. • A snapshot that is of practical interest is the most recent one. Let C1 and C2 be two consistent cuts and C1Ì C2. Then C2 is more recent than C1. • Analyze why certain cuts in the one-dollar bank are inconsistent. Consistent snapshot How to record a consistent snapshot? Note that 1. The recording must be non-invasive. 2. Recording must be done on-the-fly. You cannot stop the system. Chandy-Lamport Algorithm Works on a (1) strongly connected graph (2) each channel is FIFO. An initiator initiates the algorithm by sending out a marker ( ) White and red processes Initially every process is white. When a process receives a marker, it turns red and remain red Every action by a process, and every message sent by a process gets the color of that process. So, white action = action by a white process red action = action by a red process white message = message sent by a white process red message = message sent by a red process Two steps Step 1. In one atomic action, the initiator (a) Turns red (b) Records its own state (c) sends a marker along all outgoing channels Step 2. Every other process, upon receiving a marker for the first time (and before doing anything else) (a) Turns red (b) Records its own state (c) sends markers along all outgoing channels The algorithm terminates when (1) every process turns red, and (2) Every process has received a marker through each incoming channel. Why does it work? Lemma 1. No red message is received in a white action. Why does it work? All white All red SSS Easy conceptualization of the snapshot state Theorem. The global state recorded by Chandy-Lamport algorithm is equivalent to the ideal snapshot state SSS. Hint. A pair of actions (a, b) can be scheduled in any order, if there is no causal order between them, so (a; b) is equivalent to (b; a) Why does it work? Let an observer observe the following actions: w[i] w[k] r[k] w[j] r[i] w[l] r[j] r[l] … ≡ w[i] w[k] w[j] r[k] r[i] w[l] r[j] r[l] …[Lemma 1] ≡ w[i] w[k] w[j] r[k] w[l] r[i] r[j] r[l] …[Lemma 1] ≡ w[i] w[k] w[j] w[l] r[k] r[i] r[j] r[l] …[done!] Recorded state Example 1: Count the tokens Let us verify that Chandy-Lamport snapshot algorithm correctly counts the tokens circulating in the system 2 token C token no token A B A C no token token no token no token token no token 1 B Are these consistent cuts? How to account for the channel states? Compute this using the sent and received variables for each process. Example 2: Communicating State Machines Something unusual Let machine i start Chandy-Lamport snapshot before it has sent M along ch1. Also, let machine j receive the marker after it sends out M’ along ch2. Observe that the snapshot state is SSS = down ∅ up M’ Doesn’t this appear strange? This state was never reached during the computation! Understanding snapshot Understanding snapshot The observed state is a feasible state that is reachable from the initial configuration. It may not actually be visited during a specific execution. The final state of the original computation is always reachable from the observed state. Discussions What good is a snapshot if that state has never been visited by the system? - It is relevant for the detection of stable predicates. - Useful for checkpointing. Discussions What if the channels are not FIFO? Study how Lai-Yang algorithm works. It does not use any marker LY1. The initiator records its own state. When it needs to send a message m to another process, it sends a message (m, red). LY2. When a process receives a message (m, red), it records its state if it has not already done so, and then accepts the message m. Question 1. Why will it work? Question 1 Are there any limitations of this approach? Food for thought Distributed snapshot = distributed read. Distributed reset = distributed write How difficult is distributed reset? Distributed debugging (Marzullo and Neiger, 1991) e, VC(e) observer Distributed system Distributed debugging Uses vector clocks. Sij is a global state after the ith action by process 0 and the jth action by process 1 Distributed debugging Possibly ϕ: At least one consistent global state S is reachable from the initial global state, such that φ(S) = true. Definitely ϕ: All computations pass through some consistent global state S such that φ(S) = true. Never ϕ: No computation passes through some consistent global state S such that φ(S) = true. Definitely ϕ ⇒ Possibly ϕ Examples ϕ = x+y =12 (true at S21) Possibly ϕ ϕ = x+y > 15 (true at S31) Definitely ϕ ϕ = x=y=5 (true at S40 and S22) Never ϕ *Neither S40 nor S22 are consistent states* Distributed Snapshot: Second case The snapshot consists of the external observations of the recording processes -- distributed snapshots of shared external objects. 1. How many cars are on the streets now? 2. How many trees have been downed by the storm? Distributed snapshot of shared objects The first algorithm 0 1 2 i Algorithm double collect function read while true X[0..n-1] := collect; Y[0..n-1] := collect; if ∀i∈{0,..,n-1} location i was not changed between two collects then return Y; end function update (i,v) M[i] := v; end Limitations of double collect Read may never terminate! Why? We need a better algorithm that guarantees termination. Coordinated snapshot Engage multiple readers and ask them to record snapshots at the same time. It will work if the writer is sluggish and the clocks are accurately synchronized. Faulty recorder Assume that there are n recorders. Each records a snapshot and shares with the others, so that each can form a complete snapshot. Easy when all recorders record correctly and transmit the information reliably. But what if one or more recorders are faulty or the communication is error prone? Distributed Consensus Consensus is very important to take coordinated action. How can the recorders reach consensus in presence of communication failure? It reduces to the classic Byzantine Generals Problem Byzantine Generals Problem Describes and solves the consensus problem on the synchronous model of communication. The network topology is a completely connected graph. Processes undergo byzantine failures, the worst possible kind of failure. Shows the power of the adversary. Byzantine Generals Problem • n generals {0, 1, 2, ..., n-1} decide about whether to "attack" or to "retreat" during a particular phase of a war. The goal is to agree upon the same plan of action. • Some generals may be "traitors" and therefore send either no input, or send conflicting inputs to prevent the "loyal" generals from reaching an agreement. • Devise a strategy, by which every loyal general eventually agrees upon the same plan, regardless of the action of the traitors. Byzantine Generals Attack=1 {1, 1, 0, 0} 0 Attack = 1 1 {1, 1, 0, 1} The traitor may send conflicting input values traitor {1, 1, 0, 0} 2 Retreat = 0 3 {1, 1, 0, 0} Retreat = 0 Every general will broadcast his/her judgment to everyone else. These are inputs to the consensus protocol. Byzantine Generals We need to devise a protocol so that all peers (call it a lieutenant) receives the same value from any given general (call it a commander). Clearly, the lieutenants will have to use secondary information. Note that the roles of the commander and the lieutenants will rotate among the generals. Interactive consistency specifications IC1. Every loyal lieutenant receives the same order from the commander. commander IC2. If the commander is loyal, then every loyal lieutenant receives the order that the commander sends. lieutenants The Communication Model Oral Messages 1. Messages are not corrupted in transit. (why? if the message gets altered then blame the sender) 2. Messages can be lost, but the absence of message can be detected. 3. When a message is received (or its absence is detected), the receiver knows the identity of the sender (or the defaulter). OM(m) represents an interactive consistency protocol in presence of at most m traitors. An Impossibility Result Using oral messages, no solution to the Byzantine Generals problem exists with three or fewer generals and one traitor. Consider the two cases: commander 0 1 1 commander 0 0 0 lieutenent 1 lieutenant 2 (a) 0 1 lieutenent 1 1 lieutenant 2 (b) In (a), to satisfy IC2, lieutenant 1 must trust the commander, but in IC2, the same idea leads to the violation of IC1. Impossibility result (continued) Using oral messages, no solution to the Byzantine Generals problem exists with 3m or fewer generals and m traitors (m > 0). The proof is by contradiction. Assume that such a solution exists. Now, divide the 3m generals into three groups of m generals each, such that all the traitors belong to one group. Let one general simulate each of these three groups. This scenario is equivalent to the case of three generals and one traitor. We already know that such a solution does not exist. The OM(m) algorithm Recursive algorithm OM(m) OM(0) OM(m-1) OM(m-2) OM(0) OM(m) = Consensus Algorithm with oral messages in presence of up to m traitors OM(0) = Direct broadcast The OM(m) algorithm 1. Commander i sends out a value v (0 or 1) 2. If m > 0, then every lieutenant j ≠ i, after receiving v, acts as a commander and initiates OM(m-1) with everyone except i . 3. Every lieutenant, collects (n-1) values: (n-2) values received from the lieutenants using OM(m-1), and one direct value from the commander. Then he picks the majority of these values as the order from i Example of OM(1) commander 0 1 1 commander 0 1 1 22 1 3 1 1 1 1 2 3 3 1 (a) 1 0 0 1 2 1 0 22 3 1 1 0 0 2 3 3 1 (b) 1 1 1 2 Example of OM(2) Commander OM(2) 0 v v 1 2 v OM(2) v v 3 v 4 6 5 OM(1) v 4 5 v 6 v OM(0) v 5 v 5 2 v 6 v v 2 v v 6 v v 6 2 4 v v 6 v 2 OM(0) v 2 v 5 OM(1) v 4 5 Proof of OM(m) Lemma. Let the commander be loyal, and n > 2m + k, where m = maximum number of traitors. loyal commander values received via OM(r) Then OM(k) satisfies IC2 n-m-1 loyal lieutenants m traitors Proof of OM(m) Proof loyal commander If k=0, then the result trivially holds. Let it hold for k = r (r > 0) i.e. OM(r) satisfies IC2. We have to show that it holds for k = r + 1 too. By definition n > 2m+r+1, so n-1 > 2m+r So OM(r) holds for the lieutenants in the bottom row. Each loyal lieutenant collects n-m-1 identical good values and m bad values. So bad values are voted out (n-m-1 > m+r implies n-m-1 > m) values received via OM(r) m traitors n-m-1 loyal lieutenants “OM(r) holds” means each loyal lieutenant receives identical values from every loyal commander The final theorem Theorem. If n > 3m where m is the maximum number of traitors, then OM (m) satisfies both IC1 and IC2. Proof. Consider two cases: Case 1. Commander is loyal. The theorem follows from the previous lemma (substitute k = m). Case 2. Commander is a traitor.We prove it by induction. Base case. m=0 trivial. (Induction hypothesis) Let the theorem hold for m = r. (Inductive step) We have to show that it holds for m = r+1 too. Proof (continued) There are n > 3(r + 1) generals and r + 1 traitors. Excluding the commander, there are > 3r+2 generals of which there are r traitors. So > 2r+2 lieutenants are loyal. Since 3r+ 2 > 3.r, OM(r) satisfies IC1 and IC2 > 2r+2 r traitors Proof (continued) In OM(r+1), a loyal lieutenant chooses the majority from (1) > 2r+1 values obtained from the loyal lieutenants via OM(r), (2) the r values from the traitors, and > 2r+2 r traitors (3) the value directly from the commander. The set of values collected in part (1) & (3) are the same for all loyal lieutenants – it is the same set of values that these lieutenants received from the commander. Also, by the induction hypothesis, in part (2) each loyal lieutenant receives identical values from each traitor. So every loyal lieutenant eventually collects the same set of values. Conclusion 1. Distributed snapshot of shared objects can be tricky when the writer does not cooperate 2. Approximate snapshots is useful for a rough view. 3. Failures add new twist to the recording of snapshots. 4. Much work remains to be done for the upper layers of snapshot integration (What can you make out from a trail of Twitter data with not much correlation?)