MAT 585: Exact Recovery of the Semidefinite Relaxation for Stochastic Block Model Afonso S. Bandeira April 7, 2015 Today we consider a semidefinite programming relaxation algorithm for SBM and derive conditions for exact recovery. The main ingredient for the proof will be duality theory. 1 Quick Recap • n nodes, two clusters of equal size cluster and inter-cluster edges. n 2, p and q probabilities of intra- • MLE approach: max X Aij xi xj i,j s.t. xi = ±1, ∀i X xj = 0, (1) j where A is the adjacency matrix of the graph G 1 if (i, j) ∈ E(G) Aij = 0 otherwise. (2) • MLE recovery condition α+β p − αβ > 1 2 1 (3) 2 The algorithm P Note that if we remove the constraint that j xj = 0 in (1) then the optimal solution becomes x = 1. Let us define B = 2A − (11T − I), meaning that 0 1 Bij = −1 if i = j if (i, j) ∈ E(G) otherwise (4) It is clear that the problem max X Bij xi xj i,j s.t. xi = ±1, ∀i X xj = 0 (5) j has the same solution as (1). However, when the constraint is dropped, max X Bij xi xj i,j s.t. xi = ±1, ∀i , (6) x = 1 is no longer an optimal solution. Intuitively, there is enough “−1” contribution to discourage unbalanced partitions. In fact, (6) is the problem we’ll set ourselves to solve. Unfortunately (6) is in general NP-hard (one can encode, for example, Max-Cut by picking an appropriate B). We will relax it to an easier problem by the same technique used to approximate the Max-Cut problem in the previous section (this technique is often known as matrix lifting). If we write X = xxT then we can formulate the objective of (6) as X Bij xi xj = xT Bx = Tr(xT Bx) = Tr(BxxT ) = Tr(BX) i,j Also, the condition xi = ±1 implies Xii = x2i = 1. This means that (6) is equivalent to 2 max Tr(BX) Xii = 1, ∀i s.t. X= xxT for some x ∈ (7) Rn . The fact that X = xxT for some x ∈ Rn is equivalent to rank(X) = 1 and X 0.This means that (6) is equivalent to max Tr(BX) s.t. Xii = 1, ∀i (8) X0 rank(X) = 1. We now relax the problem and, as before, remove the non-convex rank constraint max Tr(BX) s.t. Xii = 1, ∀i (9) X 0. This is an SDP that can be solved (up to arbitrary precision) in polynomial time [2]. Since we removed the rank constraint, the solution to (9) is no longer guaranteed to be of rank-1. We will take a different approach from the one we used before to obtain an approximation ratio for Max-Cut, which was a worst-case approximation ratio guarantee. What we will show is that, for some values of α and β, with high probability, the solution to (9) not only satisfies the rank constraint but it coincides with X = gg T where g corresponds to the true partition. After X is computed, g is simply obtained as its leading eigenvector. 3 The analysis Without loss of generality, we can assume that g = (1, . . . , 1, −1, . . . , −1)T , meaning that the true partition corresponds to the first n2 nodes on one side and the other n2 on the other. 3 3.1 Some preliminary definitions Recall that the degree matrix D of a graph G is a diagonal matrix where each diagonal coefficient Dii corresponds to the number of neighbours of vertex i and that λ2 (M ) is the second smallest eigenvalue of a symmetric matrix M . Definition 1 Let G+ (resp. G− ) be the subgraph of G that includes the edges that link two nodes in the same community (resp. in different communities) and A the adjacency matrix of G. We denote by DG+ (resp. DG− ) the degree matrix of G+ (resp. G− ) and define the Stochastic Block Model Laplacian to be LSBM = DG+ − DG− − A 3.2 Convex Duality A standard technique to show that a candidate solution is the optimal one for a convex problem is to use convex duality. We will describe duality with a game theoretical intuition in mind. The idea will be to rewrite (9) without imposing constraints on X but rather have the constraints be implicitly enforced. Consider the following optimization problem. max X min Z, Q Z is diagonal Q0 Tr(BX) + Tr(QX) + Tr (Z (In×n − X)) (10) Let us give it a game theoretical interpretation. Suppose there is a primal player (picking X) whose objective is to maximize the objective and a dual player, picking Z and Q after seeing X, trying to make the objective as small as possible. If the primal player does not pick X satistying the constraints of (9) then we claim that the dual player is capable of driving the objective to −∞. Indeed, if there is an i for which Xii 6= 1 then the dual player can 1 simply pick Zii = −c 1−X and make the objective as small as desired by ii taking large enough c. Similarly, if X is not positive semidefinite, then the dual player can take Q = cvv T where v is such that v T Xv < 0. If, on the other hand, X satisfies the constraints of (9) then Tr(BX) ≤ min Z, Q Z is diagonal Q0 Tr(BX) + Tr(QX) + Tr (Z (In×n − X)) , 4 since equality can be achieve if, for example, the dual player picks Q = 0n×n , then it is clear that the values of (9) and (10) are the same: max Tr(BX) = max X, Xii ∀i X0 X min Z, Q Z is diagonal Q0 Tr(BX) + Tr(QX) + Tr (Z (In×n − X)) With this game theoretical intuition in mind, it is clear that if we change the “rules of the game” and have the dual player decide her variables before the primal player (meaning that the primal player can pick X knowing the values of Z and Q) then it is clear that the objective can only increase, which means that: max Tr(BX) ≤ X, Xii ∀i X0 min Z, Q Z is diagonal Q0 max Tr(BX) + Tr(QX) + Tr (Z (In×n − X)) . X Note that we can rewrite Tr(BX) + Tr(QX) + Tr (Z (In×n − X)) = Tr ((B + Q − Z) X) + Tr(Z). When playing: min Z, Q Z is diagonal Q0 max Tr ((B + Q − Z) X) + Tr(Z), X if the dual player does not set B + Q − Z = 0n×n then the primal player can drive the objective value to +∞, this means that the dual player is forced to chose Q = Z − B and so we can write min Z, Q Z is diagonal Q0 max Tr ((B + Q − Z) X) + Tr(Z) = X min Z, Z is diagonal Z−B0 max Tr(Z), X which clearly does not depend on the choices of the primal player. This means that max Tr(BX) ≤ min Tr(Z). X, Xii ∀i X0 Z, Z is diagonal Z−B0 This is known as weak duality (strong duality says that, under some conditions the two optimal values actually match, see, for example, [2]). Also, the problem 5 min Tr(Z) s.t. Z is diagonal (11) Z −B 0 is called the dual problem of (9). The derivation above explains why the objective value of the dual is always greater or equal to the primal. Nevertheless, there is a much simpler proof (although not as enlightening): let X, Z be respectively a feasible point of (9) and (11). Since Z is diagonal and Xii = 1 then Tr(ZX) = Tr(Z). Also, Z − B 0 and X 0, therefore Tr[(Z − B)X] ≥ 0. Altogether, Tr(Z) − Tr(BX) = Tr[(Z − B)X] ≥ 0, as stated. Recall that we want to show that gg T is the optimal solution of (9). Then, if we find Z diagonal, such that Z − B 0 and Tr[(Z−B)gg T ] = 0, (this condition is known as complementary slackness) then X = gg T must be an optimal solution of (9). To ensure that gg T is the unique solution we just have to ensure that the nullspace of Z − B only has dimension 1 (which corresponds to multiples of g). Essentially, if this is the case, then for any other possible solution X one could not satisfy complementary slackness. This means that if we can find Z with the following properties: 1. Z is diagonal 2. Tr[(Z − B)gg T ] = 0 3. Z − B 0 4. λ2 (Z − B) > 0, then gg T is the unique optima of (9) and so recovery of the true partition is possible (with an efficient algorithm). Z is known as the dual certificate, or dual witness. 6 3.3 Building the dual certificate The idea to build Z is to construct it to satisfy properties (1) and (2) and try to show that it satisfies (3) and (4) using concentration. If indeed Z − B 0 then (2) becomes equivalent to (Z − B)g = 0. This means that we need to construct Z such that Zii = g1i B[i, :]g. Since B = 2A − (11T − I) we have Zii = 1 1 (2A − (11T − I))[i, :]g = 2 (Ag)i + 1, gi gi meaning that Z = 2(DG+ − DG− ) + I is our guess for the dual witness. As a result Z − B = 2(DG+ − DG− ) − I − 2A − (11T − I) = 2LSBM + 11T It trivially follows (by construction) that (Z − B)g = 0. Therefore Lemma 2 If λ2 (2LSBM + 11T ) > 0, (12) then the relaxation recovers the true partition. Note that 2LSBM + 11T is a random matrix and so this boils down to “an exercise” in random matrix theory. 3.4 Matrix Concentration Clearly, E 2LSBM + 11T = 2ELSBM + 11T = 2EDG+ − 2EDG− − 2EA + 11T , and EDG+ = n 2 × n 2 n α log(n) I, 2 n EDG− = n β log(n) I, 2 n and EA is a matrix such with 4 α log(n) and the off-diagonal n β log(n) 1 α log(n) =2 + 11T + n n blocks where the diagonal blocks have blocks have β log(n) . We can write this as EA n β log(n) 1 α log(n) − gg T 2 n n 7 This means that E 2LSBM + 11 T log n log n T = ((α − β) log n) I+ 1 − (α + β) 11T −(α−β) gg . n n Since 2LSBM g = 0 we can ignore what happens in the span of g and it is not hard to see that log n T log n 11T − (α − β) gg = (α−β) log n. λ2 ((α − β) log n) I + 1 − (α + β) n n This means that it is enough to show that kLSBM − E [LSBM ]k < α−β log n, 2 (13) which is a large deviations inequality. (k · k denotes operator norm) We will not go into the details here but the idea is to write LSBM − E [LSBM ] as a sum of independent random matrices and use Matrix Bernstein inequality (as in [1]). In fact, using this strategy one can show that, with high probability, 13 holds as long as 8 (α − β)2 > 8(α + β) + (α − β). 3 3.5 (14) Comparison with phase transition To compare (14) with (3) we note that the latter can be rewritten as (α − β)2 > 4(α + β) − 4 and α + β > 2 and so the relaxation achieves exact recovery almost at the threshold, essentially only suboptimal by a factor of 2. References [1] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4):389–424, 2012. [2] L. Vanderberghe and S. Boyd. Semidefinite programming. SIAM Review, 38:49–95, 1996. 8