On the robustness of PhaseLift to arbitrary errors in the noiseless case Paul Hand Rice University Department of Computational and Applied Mathematics, MS-134, 6100 Main Street, Houston, TX 77005 Abstract—The phase retrieval problem is the task of recovering an unknown n-vector from phaseless linear measurements. Through the technique of lifting, this problem may be convexified into a semidefinite rank-one matrix recovery problem known as PhaseLift. Under a linear number of exact Gaussian measurements, PhaseLift recovers the unknown vector exactly with high probability. Under noisy measurements, the solution to a variant of PhaseLift has error proportional to the ℓ1 norm of the noise. In the present paper, we prove that this variant of PhaseLift can tolerate a small fraction of arbitrary errors under O(n) real-valued measurements. The phase retrieval problem under arbitrary errors can be viewed as a special case of robust Principal Component Analysis (PCA) for a rankone matrix with Gaussian rank-one measurements. In this case, the PhaseLift program is simpler than the standard robust PCA optimization program because it has no free parameters. I. I NTRODUCTION This paper studies an algorithm for recovering a vector x0 ∈ Rn from phaseless linear measurements that may contain arbitrary errors. That is, for fixed measurement vectors ai ∈ Rn for i = 1 . . . m, our task is to find x0 satisfying bi = |⟨x0 , ai ⟩|2 + εi (1) for known bi ∈ R, known ai , and unknown εi . This problem is known as phase retrieval. In the present paper, we will analyze it in the case that the error vector ε is sparse yet arbitrary. This setup is noiseless in the sense that most measurements are exact. Measurements of form (1) arise in several applications, including Xray crystallography [15], [10], [1]. In such applications, extremely large errors in some measurements may be due to sensor failure, occlusions, or other effects. Ideally, recovery algorithms could provably tolerate a small number of such errors. Recently, researchers have introduced algorithms for the phase retrieval problem that have provable recovery guarantees [1], [4]. The insight of these methods is that the phase retrieval problem can be convexified by lifting it to the space of matrices. That is, instead of searching for the vector x0 , one can search for the lifted matrix x0 xt0 . The quadratic measurements (1) then become linear measurements on this lifted matrix. As the desired matrix is semidefinite and low rank, one can write a semidefinite rank minimization problem which has a convex relaxation known as PhaseLift. In this noiseless case, PhaseLift is the program min tr(X) subject to X ≽ 0, {ati Xai = bi }i=1...m (2) Here, tr(X) is a convex proxy for the rank of X. An estimate for the underlying signal x0 can be computed by the leading eigenvector of the optimizer of (2). When the measurements ai are Gaussian, [7] and [2] have shown that (2) can be simplified to the semidefinite feasibility problem find X ≽ 0 such that {ati Xai = bi }i=1...m (3) In [2], the authors showed that this program succeeds at finding x0 xt0 exactly and with high probability when m ≥ cn for a sufficiently large c. This result is quite surprising because there are only O(n) measurements for an O(n2 ) dimensional object. The semidefinite constraint apparently is sufficiently ‘pointy’ that the highdimensional affine space of data-consistent matrices intersects the semidefinite cone only at exactly one point. In the noisy case where ε may be dense, [2] showed that the PhaseLift variant ! |ati Xai − bi | subject to X ≽ 0 (4) min i successfully recovers a matrix near x0 xt0 with high probability. Specifically, they prove that the solution X̂ to (4) satisfies ∥X̂ − x0 xt0 ∥F ≤ C0 ∥ε∥1 /m with high probability. This error estimate is most interesting in the case where the error is small in an ℓ1 sense. The contribution of the present paper is to show that the program (4) is additionally robust against a constant fraction of arbitrary errors. Theorem 1. There exist positive numbers fmin , c, γ such that for any x0 ∈ Rn and any ε ∈ Rm the following holds. If m ≥ cn and ∥ε∥0 /m ≤ fmin , then x0 xt0 is the unique solution to (4) with probability at least 1−e−γm. A. Relation to Robust PCA Robust Principal Component Analysis (PCA) is the task of matrix completion under gross measurement errors. This framework typically involves measuring some of the entries of a low rank n×n matrix X0 and assuming that some fraction of those measurements are arbitrarily corrupted, giving the data matrix A. The matrix X0 can then be recovered under certain conditions by a sparseplus-low-rank convex program: min λ∥X∥∗ + ∥E∥1 subject to P(X + E) = P(A) X,E (5) where ∥X∥∗ is the nuclear norm of X, λ is a constant, ∥E∥1 is the ℓ1 norm of the vectorization of E, and P is the projection of a matrix onto the observed entries. [3], [5], [11], [19], [8], [6], [13]. Results from this formulation have been quite surprising. For example, under an appropriate choice of λ and under an incoherence assumption, the sparse-plus-low-rank decomposition succeeds for sufficiently low rank X0 when O(n2 ) entries are measured and a small fraction of them have arbitrary errors [3]. Subsequent results have been proved that only require m ! rn polylog(n) measurements, where r is the rank of X0 [6], [13]. These results are information theoretically optimal except for the polylogarithmic factor. The present paper can be viewed as a rank-one semidefinite robust PCA problem under Gaussian rankone measurements. From this perspective, we would naturally formulate the problem by ! |ati Xai − bi | subject to X ≽ 0. min λ tr(X) + i (6) The present paper shows that direct rank penalization through the trace term is not fundamental for exact recovery in the presence of arbitrary errors. That is, (6) can be simplified by taking λ = 0. The resulting program has no free parameters that need to be explicitly tuned. As in [7], [2], the positive semidefinite cone provides enough of a constraint to enforce lowrankness. The present paper differs from the robust PCA literature in that the measurements are generic and are not direct samples of the entries of the unknown matrix. As in the matrix completion literature, this difference in measurement structure affords proofs of information theoretically optimal data scalings with no additional logarithmic factors. B. Numerical simulation We now explore the empirical performance of (4) by numerical simulation. Let the signal length n vary from 5 to 50, and let the number of measurements m vary from 10 to 250. Let x0 = e1 , and let ai ∼ N (0, In ). For each (n, m), we consider measurements such that " bi ∼ Uniform([0, 104 ]) if 1 ≤ i ≤ ⌈0.05m⌉, bi = |⟨x0 , ai ⟩|2 otherwise. We estimate x0 xt0 by solving (4) using the SDPT3 solver [16], [17] and YALMIP [14]. For a given optimizer X̂, define the capped relative error as min(∥X̂ − x0 xt0 ∥F /∥x0 xt0 ∥F , 1). Figure 1 plots the average capped relative error over 10 independent trials. It provides empirical evidence that the problem (4) succeeds under a linear number of measurements, even when a constant fraction of them contain very large errors. 50 Signal length (n) That is to say, x0 xt0 will be exactly recovered with high probability, even in the presence of a linear number of arbitrary errors among a linear number of measurements. In the case that there are no gross errors, this theorem simplifies to the recovery results of [2] and the rank-one case of [12]. Theorem 1 prompts several important questions. Is the semidefinite program (4) robust against noise? Can the theorem be strengthened to permit universal recovery for any x0 and ε under a fixed realization of the measurement vectors ai ? Also, can the theorem be extended to the complex-valued case? For a brief comment on the possible complex extension, see Section II-C. We leave all these questions to future research. 25 m= (n+1)n 2 50 100 150 200 Number of measurements (m) 250 Fig. 1. Recovery error for the PhaseLift matrix recovery problem (4) as a function of n and m, when 5% of measurements contain large errors. Black represents an average recovery error of 100% over 10 independent trials. White represents an average recovery error of zero. The solid curve depicts when the number of measurements equals the number of degrees of freedom in a symmetric n × n matrix. The number of measurements required for successful recovery appears to be linear in n in the presence of large errors. II. P ROOFS → Rm be the mapping X .→ where S n is the space of # symmetric real-valued n × n matrices. Note that A∗ λ = i λi ai ati . Let ei be the ith standard basis vector. Without loss of Let A : S (ati Xai )i=1...m , n generality, we take x0 = e1 . Let X0 = x0 xt0 . We can write the measurements (1) as b = AX0 + ε. Similarly, the optimization program (4) can be written The second technical lemma establishes that the optimal solution X0 + H to (4) lies on the cone ∥HT ⊥ ∥∗ ≥ 0.56∥HT ∥F with high probability. min ∥AX − b∥1 such that X ≽ 0. Lemma 2. Fix x0 ∈ Rn and ε ∈ Rm . There exists a constant γ̃ such that the following holds. If m ≥ 100c0 n and ∥ε∥0 /m ≤ 0.01, then with probability at least 1 − e−γ̃m any feasible X0 +H such that ∥A(X0 +H)−b∥1 ≤ ∥A(X0 ) − b∥1 satisfies ∥HT ⊥ ∥∗ ≥ 0.56∥HT ∥F . We introduce the following notation. Let ∥X∥∗ be the nuclear norm of the matrix X. When X ≽ 0, ∥X∥∗ = tr(X). Denote the Frobenius and spectral norms of X as ∥X∥F and ∥X∥, respectively. Let ⟨X, Y ⟩ = tr(Y t X) be the Hilbert-Schmidt inner product. Let T be the space of symmetric matrices supported on their first row and column. The orthogonal complement T ⊥ is then the space of matrices supported in the lower-right n − 1 × n − 1 block. Let I be the identity matrix, and let (E) be the indicator function of the event E. A. Recovery by dual certificates The proof of Theorem 1 will be based on dual certificates, as in [9], [4], [7], [2]. A dual certificate is an optimal variable for the dual problem to (4). Its existence certifies the correctness of a minimizer to (4). The first order optimality conditions at X0 for (4) are Ỹ = A∗ λ̃ (7) Ỹ ≽ 0 (9) λ̃ ∈ ∂∥ · ∥1 (−ε) ⟨Ỹ , X0 ⟩ = 0 (8) (10) where ∂∥ · ∥1 (−ε) is the subdifferential of the ℓ1 norm evaluated at −ε. Note that (9) and (10) imply ỸT = 0. Such a Ỹ would be dual certificate for (4). Unfortunately, constructing Ỹ is difficult. As in [9], [4], [7], [2], we seek an inexact dual certificate, which approximately satisfies these conditions. Specifically, we build a dual certificate Y = A∗ λ that satisfies YT ⊥ ≽ IT ⊥ , ∥YT ∥F ≤ 1/2, " λi = −α sgn(εi ) λi ≤ α (11) (12) if εi ̸= 0, if εi = 0. (13) for some positive α, To prove that existence of such a Y will guarantee successful recovery of x0 xt0 with high probability, we will rely on two technical lemmas. The first lemma provides ℓ1 -isometry bounds on A proven in [4]. Lemma 1 ([4]). There exist constants c0 , γ0 such that if m ≥ c0 n, then with probability at least 1 − e−γ0 m , % $ 1 1 ∥X∥∗ for all X, (14) ∥A(X)∥1 ≤ 1 + m 16 % $ 1 1 ∥X∥ ∥A(X)∥1 ≥ 0.94 1 − m 16 for all symmetric, rank-2 X (15) Proof: By assumption, ∥AH − ε∥1 ≤ ∥ε∥1 . Let S be a set of cardinality ⌈0.01m⌉ containing the support of ε. We have ∥AS c H∥1 + ∥AS H − ε∥1 ≤ ∥ε∥1 . (16) By the triangle inequality, ∥AS c H∥1 + ∥ε∥1 − ∥AS H∥1 ≤ ∥ε∥1 , (17) which implies ∥AS c H∥1 ≤ ∥AS H∥1 (18) Breaking (18) into its components on T and T ⊥ and applying the triangle inequality, we have ∥AS c HT ∥1 − ∥AS c HT ⊥ ∥1 ≤ ∥AS HT ∥1 + ∥AS HT ⊥ ∥1 ⇒∥AS c HT ∥1 ≤ ∥AS HT ∥1 + ∥AHT ⊥ ∥1 (19) We now apply the ℓ1 isometry bounds from Lemma 1 on each term of (19). As |S c | ≥ c0 n, Lemma 1 gives that with probability at least 1 − e−γ1 m $ % 1 ∥AS c HT ∥1 ≥ 0.94 1 − |S c |∥HT ∥ (20) 16 $ % c 1 |S | √ ∥HT ∥F , ≥ 0.94 1 − (21) 16 2 where the second inequality follows because HT has rank at most 2. As |S| ≥ c0 n and |S| = ⌈0.01m⌉, Lemma 1 gives that with probability at least 1 − e−γ2 m $ % 1 ∥AS HT ∥1 ≤ |S| 1 + ∥HT ∥∗ (22) 16 % $ 1 √ 2∥HT ∥F ≤ |S| 1 + (23) 16 As m ≥ c0 n, Lemma 1 also gives that with probability at least 1 − e−γ0 m , ∥AHT ⊥ ∥1 ≤ m(1 + 1 )∥HT ⊥ ∥∗ 16 (24) Combining (19)–(24), we have & ) ' ( 1 0.94 1 − 16 |S c | √ |S| ( √ ' ∥HT ∥F ≤ ∥HT ⊥ ∥∗ − 2 1 m m 2 1 + 16 Thus, 0.56∥HT ∥F ≤ ∥HT ⊥ ∥∗ with probability at least 1 − 3e−γ̃m . We now prove that existence of an inexact dual certificate (11)–(13) will guarantee successful recovery of X0 = x0 xt0 with high probability. Lemma 3. Fix x0 ∈ Rn . Let m ≥ 100c0n. Fix ε ∈ Rm such that ∥ε∥0 /m ≤ 0.01. Suppose that there exists Y = A∗ λ satisfying (11)–(13). Then, with probability at least 1 − e−γ̃m , X0 is the unique solution to (4). Proof: Let X0 + H be feasible for (4) and such that ∥A(X0 + H) − b∥1 ≤ ∥A(X0 ) − b∥1 . That is, ∥AH − ε∥1 ≤ ∥ε∥1 . By condition (13), λ/α ∈ ∂∥ · ∥1 (−ε) (25) Hence, ∥ − ε∥1 + ⟨λ/α, AH⟩ ≤ ∥ε∥1 ⇒ ⟨λ, AH⟩ ≤ 0 ⇒ (26) (27) ⟨Y, H⟩ ≤ 0. (28) ⟨YT , HT ⟩ ≤ −⟨YT ⊥ , HT ⊥ ⟩ (29) Decomposing (28) into T and T ⊥ , we have As YT ⊥ ≽ 0 and HT ⊥ ≽ 0, we have ⟨YT ⊥ , HT ⊥ ⟩ ≤ |⟨YT , HT ⟩| (30) The form of Y0 is due to [2], and the intuition t behind it is as follows. $ Note that % E(ai ai ) = In and ˜ β0 0 E(|⟨ai , e1 ⟩|2 ai ati ) = , where β˜0 = Ez 4 for 0 In−1 #m 1 a standard normal z. The construction m i=1 [β̃0 − 2 t ˜ |⟨ai , e1 ⟩| ]ai ai thus has expected value (β0 − 1)IT ⊥ , which would provide the exact dual certificate conditions (9)–(10). As shown in [2], a satisfactory inexact dual certificate can be built with m = O(n) and coefficients that are truncated to be no larger than 7/m. In the present formulation, the terms Y+ and Y− are then set to have coefficients ∓7/m in order to satisfy (13). Lemma 4. Let x0 = e1 and fix ε ∈ Rm . There exists constants c, γ ∗ such that the following holds. Let m ≥ cn. Suppose ∥ε∥0 /m ≤ 0.001. Let S + and S − be disjoint sets of cardinality ⌈0.001m⌉ containing the indices over which ε is positive and negative, respectively. Then the dual certificate Y from (31) satisfies 17 3 ∥YT ∥F ≤ 1/4 and ∥YT ⊥ − 10 IT ⊥ ∥ ≤ 10 with probability −γ ∗ m at least 1 − e . Proof: It suffices to show that with high probability , , , , ,Y0,T ⊥ − 17 IT ⊥ , ≤ 0.15, (32) , , 10 3 , 20 ∥Y±,T ⊥ ∥ ≤ 0.015, ∥Y±,T ∥F ≤ 0.035. ∥Y0,T ∥F ≤ By conditions (11)–(12) 1 ∥HT ⊥ ∥∗ ≤ |⟨YT ⊥ , HT ⊥ ⟩| ≤ |⟨YT , HT ⟩| ≤ ∥HT ∥F 2 By Lemma 2, ∥HT ⊥ ∥∗ ≥ 0.56∥HT ∥F with probability at least 1−e−γ̃m . We conclude that HT = 0 and HT ⊥ = 0, with probability at least 1−e−γ̃m . Thus, X0 is the unique solution to (4). B. Construction of the dual certificate We now construct the dual certificate for x0 = e1 . Let S + and S − be disjoint supersets of the indices over which ε is positive or negative, respectively. Let S = S + ∪ S − . For pedagogical purposes, S + and S − should be thought of as exactly the indices over which ε is positive or negative. For technical reasons, we let them be supersets with cardinality linear in n, in order to use standard concentration estimates. For a fixed choice of S + and S − , let the inexact dual certificate Y be defined by ! 1*! −7ai ati + 7ai ati Y = m i∈S + i∈S − + ! 2 (31) + [β0 − |⟨ai , e1 ⟩| (|⟨ai , e1 ⟩| ≤ 3)]ai ati i∈S c where β0 = Ez 4 (|z| ≤ 3) ≈ 2.6728 and z is a standard normal random variable. We write (31) as Y = Y+ + Y− + Y0 . (33) (34) (35) First, we establish (32)–(33). By Lemma 2.3 in [2], there exist c̃, γ̃ such that if |S c | ≥ c̃n, then with c probability at least 1 − e−γ̃|S | , , , , , m 17 , , (36) , |S c | Y0,T ⊥ − 10 IT ⊥ , ≤ 1/10, and , , , m , , , (37) , |S c | Y0,T , ≤ 3/20. Thus, , , , , c , , , , ,Y0,T ⊥ − 17 IT ⊥ , ≤ ,Y0,T ⊥ − |S | 17 IT ⊥ , , , , , 10 m 10 % $ c |S | 17 + 1− m 10 ≤ 0.15 (38) which establishes (32). By (37), we get (33) immediately. Next, we establish (34). Let a′ be the vector formed by n − 1 components of a. Observe that # the last ′ ′∗ a a is a Wishart random matrix, for which i∈S + i i standard estimates, such as Corollary 5.35 in [18], apply. If |S + | ≥ c̃0 n and |S + | = ⌈0.001m⌉, with probability at least 1 − e−γ̃1 m for some γ̃1 , , , , 1 ! , , , a′i ai ′∗ − IT ⊥ , ≤ 1/2 , + , , |S | + i∈S Hence, , , , 1 ! , , ′ ′∗ , ai ai , ≤ 3/2 , + , |S | , + i∈S + Thus, ∥Y+,T ⊥ ∥ ≤ 23 |Sm | , and we arrive at (34). The bound for the Y−,T ⊥ term is identical. + Next, we establish (35). Note that Y+ = −7 |Sm | · # 1 t + i∈S + ai ai . As per Lemma 5, if |S | ≥ c̃1 n and |S + | + |S + | = ⌈0.001m⌉, then ∥Y+,T ∥F ≤ 7 |Sm | ·5 ≤ 7 ·0.001 · 5, with probability at least 1 − e−γ̃1 m . Thus (32)–(35) hold simultaneously with probabil∗ ity at least 1 − e−γ for some γ ∗ provided c ≥ max(1000c0, 2c̃, 1000c̃0, 1000c̃1 ). The behavior of Y±,T relies on the following. 1 #m t Lemma 5. Let A = m i=1 ai ai . There exists c̃, γ̃1 such that if m ≥ c̃n then ∥AT ∥F ≤ 5 with probability at least 1 − e−γ̃1 m . Proof: Let y be the (1, 1) entry of A, and let y ′ be the rest of the column. Then ∥AT ∥2F = y 2 + ∥y ′ ∥22 . #first m 1 So, y = m i=1 a2i (1). Hence, my ∼ χ2m . Standard results on the concentration of chi-squared variables give P(my ≥ 4m) ≤ e−γ̃m . Hence P(y 2 ≥ 16) ≤ e−γ̃m . Now, it remains to bound ∥y ′ ∥22 . We can write y ′ = 1 ′ ′ ′ ′ m Z c, where Z = [a1 , . . . , am ] and ci = ai (i), where ′ Z and c are independent. Note that ∥c∥22 is a Chisquared random variable with m degrees of freedom. Hence, with probability at least 1 − e−m , ∥c∥22 ≤ 3m For fixed ∥x∥2 = 1, ∥Z ′ x∥22 ∼ χ2n−1 , and hence with probability at least 1 − e−γm , ∥Z ′ x∥22 ≥ m/100, m when m ≥ c̃n. Hence, m2 ∥y ′ ∥22 ≤ 3m√· 100 with −γm ′ probability at least 1−2e . Thus ∥y ∥2 < 3/10 with probability at least 1 − 2e−γm . We have ∥AT ∥2F ≤ 25, and hence ∥AT ∥F ≤ 5. The proof of Theorem 1 comes from the direct combination of Lemmas 3 and 4. C. Comment on the complex-valued case Extending Theorem 1 to the complex-valued case may be possible. In the proofs above, the most central use of the real-valued assumption is in the concentration estimates of the dual certificate. The certificate is broken into pieces where ε is positive, negative, and zero. In the complex case, there are an infinite number of direction that the values of ε can take. Hence, the complex case will require a more involved analysis. R EFERENCES [1] E. Candès, Y. Eldar, T. Strohmer, and V. Voroninski. Phase retrieval via matrix completion. SIAM J. Imaging Sci., 6(1):199– 225, 2013. [2] Emmanuel J Candès and Xiaodong Li. Solving quadratic equations via phaselift when there are about as many equations as unknowns. Found. of Comput. Math., pages 1–10, 2012. [3] Emmanuel J Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? J. ACM, 58(3):11, 2011. [4] Emmanuel J Candès, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math., 66(8):1241–1274, 2013. [5] Venkat Chandrasekaran, Sujay Sanghavi, Pablo A Parrilo, and Alan S Willsky. Rank-sparsity incoherence for matrix decomposition. SIAM Journal on Optimization, 21(2):572–596, 2011. [6] Yudong Chen, Ali Jalali, Sujay Sanghavi, and Constantine Caramanis. Low-rank matrix recovery from errors and erasures. Information Theory, IEEE Transactions on, 59(7):4324–4337, 2013. [7] Laurent Demanet and Paul Hand. Stable optimizationless recovery from phaseless linear measurements. Journal of Fourier Analysis and Applications, 20(1):199–221, 2014. [8] Arvind Ganesh, John Wright, Xiaodong Li, Emmanuel J Candes, and Yi Ma. Dense error correction for low-rank matrices via principal component pursuit. In Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on, pages 1513– 1517. IEEE, 2010. [9] David Gross. Recovering low-rank matrices from few coefficients in any basis. Information Theory, IEEE Transactions on, 57(3):1548–1566, 2011. [10] Robert W Harrison. Phase problem in crystallography. J. Opt. Soc. Am. A, 10(5):1046–1055, 1993. [11] Daniel Hsu, Sham M Kakade, and Tong Zhang. Robust matrix decomposition with sparse corruptions. Information Theory, IEEE Transactions on, 57(11):7221–7234, 2011. [12] Richard Kueng, Holger Rauhut, and Ulrich Terstiege. Low rank matrix recovery from rank one measurements. CoRR, abs/1410.6913, 2014. [13] Xiaodong Li. Compressed sensing and matrix completion with constant proportion of corruptions. Constructive Approximation, 37(1):73–99, 2013. [14] J. Löfberg. Yalmip : A toolbox for modeling and optimization in MATLAB. In Proceedings of the CACSD Conference, Taipei, Taiwan, 2004. [15] RP Millane. Phase retrieval in crystallography and optics. J. Opt. Soc. Am. A, 7(3):394–411, 1990. [16] K. C. Toh, M.J. Todd, and R. H. Tutuncu. Sdpt3 - a matlab software package for semidefinite programming. Optimization Methods and Software, 11:545–581, 1998. [17] R.H. Tutuncu, K.C. Toh, and M.J. Todd. Solving semidefinitequadratic-linear programs using sdpt3. Mathematical Programming Ser. B, 95:189–217, 2003. [18] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y.C. Eldar and G. Kutyniok, editors, Compressed Sensing: Theory and Applications. Cambridge University Press, 2012. [19] John Wright, Arvind Ganesh, Shankar Rao, Yigang Peng, and Yi Ma. Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In Advances in neural information processing systems, pages 2080– 2088, 2009.