Comp 260: Advanced Algorithms Tufts University, Spring 2011 Prof. Lenore Cowen Scribe: Eli Brown Lecture 3: Introduction to Probabilistic Algorithms (MIS))1 1 Review of Basic Probability Definition 1.0.1 (Random Variable) A random variable X is a real number that is the outcome of a random event. For example, X = number of spots when a six-sided die is rolled. Definition 1.0.2 (Expected Value) Expectation of X, denoted E[X] is X E[X] = i· (P r[X = i]) i That is, if someone promises to pay you a dollar per spot that comes up on a die, the expected value is the amount you expect to be paid on average. With the example above of random variable X, E[X] = 6 X i· (P r[X = i]) = i=1 7 2 Note that E[X] is not necessarily a possible value of X. P r[X = 72 ] = 0. Also note that expected value is not the only way to judge a probabilistic decision. Increasing the payout on an incredibly unlikely event will raise expected value, but not necessarily make, say the lottery, a good bet. 1 These notes were partially based on past lectures scribed by Adam Lewis and Jeremy Freeman 1 Theorem 1.0.3 (Linearity of Expectation) If X and Y are anyP two random P variables then E[X + Y ] = E[X] + E[Y ]. Generally: E[ Xi ] = E[Xi ]. If c is a real number then E[c · X] = cE[X]. E[XY ] 6= E[X]E[Y ] unless X & Y are independent. Definition 1.0.4 (Indicator Random Variable) Let w be some event (e.g. a six sided die was rolled and turned up a 6), then the indicator random variable for w is 1 if w happens Iw = 0 otherwise The probability of an event is the expectation of its indicator random variable. Theorem 1.0.5 (Markov’s Inequality) P r[|X| ≥ a] ≤ E[x] a This bounds the probability of having a variable express a value far from the expected value. Proof 1.0.6 Let I|x|≥a = It is always the case that and by 1.0.3 E[x] a x a if |x| ≥ a if 0 < |x| < a 1 0 ≥ I|x|≥a Therefore: E[ xa ] ≥ E[I|x|≥a ] . ≥ P r[|x| ≥ a] Definition 1.0.7 (Variance) V ar[X] = E[X − E[X]]2 = E[X 2 − 2XE[X] + (E[X])2 + 2E[X 2 ] = E[X 2 ] − E[X]2 Theorem 1.0.8 (Chebyshev’s inequality) P r[X − E[X] ≥ a] ≤ V ar[X] a2 Chebyshev’s inequality can be shown from Markov’s inequality and the definition of variance. 2 2 Probabilistic Algorithms 2.1 A Toy Example Suppose you are given two containers, one containing N blue balls and the other containing N2 blue balls and N2 yellow balls. The goal is to determine which container is which. 2.2 Algorithm 1 1. Pick a container at random 2. Draw 20 balls 3. If a yellow ball is found, stop and report “found the mixed container” 4. Else stop and report “found the blue container” This algorithm is fast, but not always correct. However, it has a high probability of being correct. 2.3 Algorithm 2 1. Pick a container at random and draw a single ball 2. If a yellow ball is found, stop and report “found the mixed container” 3. Else goto step 1 This algorithm is always correct, but not always fast. However, it has a high probability of being fast. 3 2.4 Algorithm 3 1. Pick a container at random and start drawing balls 2. If a yellow ball is found, stop and report “found the mixed container” 3. Else continue until the container is empty and report “found the blue container” This algorithm is always correct, but with probability 2.5 1 2 it is slow. Types of Probabilistic Algorithms There are two classes of probabilistic algorithms we will discuss. Monte Carlo algorithms are always fast though not always correct, but correct with high probability. Las Vegas algorithms are always correct though not always fast, but fast with high probability. Formal definitions follow. Definition 2.5.1 (Monte Carlo Algorithm) A Monte Carlo algorithm is a probabilistic algorithm M for which ∃ a polynomial P such that ∀x, M terminates within P (x) steps on input x. Furthermore, P [M (x) is correct] > coin tosses in algorithm M . 2 3 where probability is taken over all Note that it is easy transform any such Monte Carlo Algorithm into a Monte Carlo Algorithm that is correct with probability 1 − by simply running M multiple times and taking a vote. In fact, this strategy works whenever the probability that the probability the algorithm is correct is greater than 21 by at least a fixed constant value. Definition 2.5.2 (Las Vegas Algorithm) A Las Vegas algorithm is a probabilistic algorithm M for which ∃ a polynomial P such that ∀x E[running time] = ∞ X (t)P r[M (x) takes exactlly t steps] < P (x) t=1 Furthermore, the output of M is always correct. 4 You can think of randomized algorithms as deterministic via the trick of considering an algorithm that tries each possible sequence of random choices assuming discrete random variables. Early primality checking algorithms were often Monte Carlo. If a given number was factorable, the algorithm could say so, otherwise the number was probably prime. There have since been Las Vegas and eventually deterministic algorithms. It is still not known whether having a Monte Carlo algorithm implies there is a Las Vegas algorithm, but the converse is true. It is also still an open question whether or not there exist problems whose only polynomial time algorithms require randomness. 3 The Max-Cut Problem Given a graph, G = (V, E), we wish to partition V into two sets, A and B, such that the number of edges crossing the cut is maximized. This problem is NP-Hard. 3.1 The Erdős and Spencer Probabilistic Method We now use the probabilistic method to show that for any graph, there is guaranteed to exist a cut that contains at least half the edges of the graph. Notice that such a cut is always a 21 approximation to max cut, since no cut can contain more than all the edges of a graph. It turns out this probabilistic method belies a deterministic algorithm for 12 approximation of max-cut. To perform the probabilistic algorithm, for each vertex i, flip a coin and let the variable Xi be a random variable: −1 if coin i flipped tails Xi = 1 if coin i flipped heads Then if Xi = −1, put i in A. If Xi = 1, put i in B. E[Xi ] = 0. Define a varible that represents whether or not an edge crosses the cut (1 or 0 resp.) xi xj Eij = 1 − ∀i, j ∈ V 2 5 Let S be the number of edges that cross the cut. X E[S] = E Eij (i,j)∈E = X E[Eij ] by Linearity of Expectation (i,j∈E = X (i,j)∈E = 1 − xi xj E 2 |E| 1 X − E [Xi Xj ] 2 2 (i,j)∈E = |E| 1 X − E [Xi ] E [Xj ] 2 2 (i,j)∈E because coin flips were independent |E| because E[Xi ] = 0 ∀i = 2 This probabilistic argument says that if E[S] ≥ |E| , then there must exist 2 some assignment of vertices to sets, A & B, that is at least this good, but note that this just gives us the expectation. What we really want is that the probability that we get a cut at least as good as the expectation is 1c . You can prove that with Chebyshev’s inequality. Also, a proof that this algorithm gives ≥ |E| edges with P r > 21 is left as an exercise to the reader. 4 From here, we will create a deterministic algorithm to do the same thing. Note that the simplest version of that would be to enumerate all possible configurations of the flipped coins and then check them until you find a configuration that matches your expectation. The Erdös-Spencer method is just that: having shown the expectation is good with a randomized algorithm, means there exists some configuration with |E| edges the fact that E[S] ≥ |E| 2 2 that cross the cut. 6 A Digression on Pairwise Independence xi 1 1 0 0 xj 1 0 1 0 xk = xi ⊕ xj 0 1 1 0 In the table above, any one variable (column) can be removed, leaving the remaining two pairwise independent. Notice that in the proof above, we only needed pairwise independence. In the context of a randomized algorithm we just want a way to reduce the possibilities of random strings of coin flips while maintaining pairwise independence, that is: E[Xs = a|xt = b] = E[xs = a] where s 6= t ∈ {i, j, k} We can take advantage of that notion of pairwise independence in constructing a deterministic algorithm for the above problem that interates over a sample space of coin flips that is polynomial (rather than exponential) size, and thus runs in polynomial time. Suppose we have N variables and assume N = 2k for some natural number k (if not just pad the set). Take a sequence of length N over the set {−1, 1}. Let W = w1 w2 w3 ...wlog2 N be a random log N -bit sequence of 0’s and 1’s. Let Xi = (−1)bin(i)w , where bin(i) is the binary expansion of i and x y is the count of bits set in the bit-wise exclusive-or of x and y. Each Xi ∈ {−1, 1} and Xi , Xj are pairwise independent. Notice Xi will be 1 if bin(i) w is even and −1 if it is odd. As an example, 1101 0010 0001 1011 = 4. Vertex 36 would be written as log N bits, say as 00100100 and then XORed with the bits of W . If two numbers have two different bit values then they will have different XOR values against W , giving us 2logN = N strings to try. The algorithm is to enumerate all the log N bit strings, XOR with each vertex, and check all possible strings from each to pick the best. Clearly with only N strings to try, this is a polynomial time algorithm, and since the analysis shows the expectation over this smaller sample space is also good, the partition of the best string will give us the cut that we need in polynomial time. 7 3.2 A Much Simpler Deterministic Algorithm We now show, in fact, that there’s a much simpler deterministic procedure that would yield the same bound. A vertex can be labeled as “happy” or “unhappy”. Vertices are unhappy if they have more direct neighbors on their side of the cut than on the other side, and are happy otherwise. Any unhappy vertex can be made happy by flipping it to the other side of the cut. The greedy algorithm is to simply flip the side of unhappy vertices. That is, while there exists an unhappy vertex, pick an unhappy vertex and flip it across the cut. While a flip may cause some neighbors to become unhappy, each flip is guaranteed to increase the number of edges that cross the cut, thus forward progress is made with each step and the algorithm will terminate. In other words, at each step the number of edges with an endpoint in each set decreases. Furthermore, when the algorithm terminates, since all vertices are happy, locally, the number of edges that cross the cut is greater than the number of edges that stay on the same side of the cut and therefore, the number of edges globally that cross the cut is greater than the number of edges that stay on the same side of the cut, and thus, it terminates with at least half the edges crossing the cut. That solution has a running time of O(|E|) and it is still an open question whether or not there is an algorithm with O(|V |) running time. Both these methods can get a 21 optimal solution, but we will see Gomez and Williamson later in the class which will give us a 0.878 optimal solution to the MaxCut problem. 4 Maximal Independent Set (MIS) Definition 4.0.1 A Maximal Independent Set (MIS) of a graph G = (V, E) is a set of vertices I ⊆ V such that • Independent: x ∈ I ⇒ y ∈ / I ∀y such that (x, y) ∈ E 8 • Maximal: ∀x, x ∈ I or y ∈ I for some y such that (x, y) ∈ E Figure 1: Examples of Maximal Independent Sets In Figure 1, there are two examples of MIS graphs. Note that Maximal 6= Maximum. The only restriction for maximal is that you cannot have a vertex and none of its neighbors in the set. Note that Maximum Independent Set is NP-hard, but MIS is easy. One way to do it would be to create a tree from a breadth-first traversal and then choose every other level for the set. So deterministically finding a polynomial time algorithm to compute an MIS is easy. However, finding a good parallel algorithm to compute an MIS is harder, and that’s the problem we consider next.” 9 4.1 Parallel Computing Application To solve MIS with each node represented by its own processor, we approach the problem in rounds. We begin with an empty set of vertices I and a graph G = (V, E), and then in each round each of the vertices in {V − I} flips a coin that goes towards determining if it can enter I. Let D be the maximum degree in G. Call a vertex “unsatisfied” if neither it nor any of its neighbors has been put in the MIS. 1. If vertex j is unsatisfied, it flips a coin to assign 1 with probability p = Xj = 0 otherwise 1 4D 2. If Xj = 1 and Xk 6= 1 for all neighbors k of i then vertex j enters the MIS 3. Update the list of which vertices are satisfied. If not all vertices, go to next round. The calculation for how long this is likely to take is a bit more complicated. First define an indicator variable Y Yi = Xi × (1 − Xj ) (i,j)∈E Random variable Yi shows whether or not vertex i enters the MIS this round. It is zero unless Xi is 1 and Xj is 0 for all j, the neighbors of i. To represent whether or not vertex i is satisfied, define: P 1 iff (Yi + (i,j)∈E Yj ) ≥ 1 Zi = 0 otherwise Next we can sort the vertices into groups by degree. We can make log D buckets: D 1 − 2, 2 − 4, 4 − 8, 8 − 16, . . . , − D 2 10 The last group is called “big degree nodes” and we put all vertices in it that satisfy 2dlog(D−1)e ≤ degree(i) ≤ 2dlogDe Now we claim that in round i, we satisfy a constant fraction of the big degree nodes. If that is true, we have shown O(log2 n) rounds will be enough to have an MIS. Figure 2: i becomes satisfied by j flipping a 1 and all the other neighbors’ not flipping 1 Proof 4.1.1 Let T be the number of big degree nodes that get satisfied in round i: X T = Zj j∈{big degree nodes} We want to show that E[T ] ≥ |big degree nodes| constant but we will actually bound based on something smaller. Consider the probability that i becomes satisfied by j flipping 1 and none of the other neighbors flipping 1 (i.e. the picture is Figure 2). Then we define Y Y Rij = Xj (1 − Xk ) (1 − Xl ) (j,k)∈E (i,l)∈E,l6=j 11 This Rij is disjoint for every neighbor j of i since they keep each other from happening. Note that no two Rij can be 1 at the same time. So, with D the maximum degree in G: X Zi ≥ Rij (i,j)∈E T = X Zi i∈big ≥ X X Rij i∈big (i,j)∈E ≥ X X i∈big (i,j)∈E p− X (j,k)∈E p2 − X p2 (k,l)∈E,k6=l ≥|big|Dp − Dp2 − Dp2 For further study, check out Professor Cowen’s paper on parallel, randomized MIS when there is not a clock maintained across all processors involved. For further study, read Mike Luby’s paper. For a generalization to when there is not a clock that allows synchronized rounds, see a paper of Awerbuch, Cowen and Smith in the 1994 STOC conference. 5 Digression - Vertex Coloring Definition 5.0.2 Let G = (V, E). A proper Vertex Coloring is an assignment C : V → {1, . . . , k} of k colors such that ∀(u, v) ∈ E, c(u) 6= c(v). Any planar graph can be 4-colored. More generally, let ∆ be the max degree of G. Then the vertices of any G can be colored with ∆ + 1 colors. The proof is by induction on the number of vertices Proof 5.0.3 Suppose you have a graph with k vertices. Assume the theorem is true for any graph with k vertices, then for a graph with k + 1 vertices, take G − V for some vertex V , and what remains is a k vertex graph. Color with the ∆ + 1 colors and color V with a color not used by any neighbor. 12 Fact: any planar graph has average degree less than 6. Corollary: in any planar graph, there is always some vertex of degree 5 or less. Theorem 5.0.4 Any planar graph can be colored with 6 colors Proof 5.0.5 Using the above facts: take a vertex with fewer than 6 neighbors and remove it. Color the remaining graph then add it back in. Since you maintain planarity, you maintain the average degree across steps. Distributed ∆ + 1 coloring can be done by applying MIS iteratively, assigning a color to each MIS: 1. Each uncolored vertex chooses a color at random from δi + 1 colors where δi is the degree of vertex i 2. For every edge, if both endpoints have the same color, arbitrarily uncolor one endpoint 3. Update the graph by removing all colored vertices and updating list of allowed colors for remaining vertices 13