Randomized algorithms for optimization: Statistical and computational guarantees Martin Wainwright UC Berkeley Statistics and EECS Based on joint work with: Mert Pilanci (UC Berkeley) Yun Yang (UC Berkeley) Martin Wainwright (UC Berkeley) July 2015 1 / 28 Sketching via random projections Massive data sets require fast algorithms but with rigorous guarantees. Sketching via random projections Massive data sets require fast algorithms but with rigorous guarantees. A general purpose tool: Choose a random subspace of “low” dimension m. Project data into subspace, and solve reduced dimension problem. Lower dimensional space Random projection High−dimensional space Sketching via random projections A general purpose tool: Choose a random subspace of “low” dimension m. Project data into subspace, and solve reduced dimension problem. Lower dimensional space Random projection High−dimensional space Basic underlying idea now widely used in practice: Johnson & Lindenstrauss (1984): for Hilbert spaces various surveys and books: Vempala, 2004; Mahoney et al., 2011 Cormode et al., 2012 Classical sketching for constrained least-squares Sy SA = S y A Original problem: data (y, A) ∈ Rn × Rn×d , and convex constraint set C ⊆ Rd xLS = arg min kAx − yk22 x∈C Classical sketching for constrained least-squares Sy SA = S y A Original problem: data (y, A) ∈ Rn × Rn×d , and convex constraint set C ⊆ Rd xLS = arg min kAx − yk22 x∈C Sketched problem: data (Sy, SA) ∈ Rm × Rm×d : x b = arg min kSAx − Syk22 x∈C Classical sketching for constrained least-squares Sy SA = S y Sketched problem: data (Sy, SA) ∈ Rm × Rm×d : x b = arg min kSAx − Syk22 x∈C Some history: random projections and Johnson-Lindenstrauss: 1980s onwards sketching for unconstrained least-squares: Sarlos, 2006 leverage scores, cores sets: Drineas et al., 2010, 2011 overview paper: Mahoney et al., 2011 A Sketches based on randomized orthonormal systems Step 1: Choose some fixed orthonormal matrix H ∈ Rn×n . Example: Hadamard matrices 1 1 1 H2t = H2 ⊗ H2 ⊗ · · · ⊗ H2 H2 = √ | {z } 2 1 −1 Kronecker product t times (E.g., Ailon & Liberty, 2010) Sketches based on randomized orthonormal systems Step 1: Choose some fixed orthonormal matrix H ∈ Rn×n . Example: Hadamard matrices 1 1 1 H2t = H2 ⊗ H2 ⊗ · · · ⊗ H2 H2 = √ | {z } 2 1 −1 Kronecker product t times = Sy e H D y Step 2: (A) Multiply data vector y with a diagonal matrix of random signs {−1, +1} e ∈ Rm×n (B) Choose m rows of H to form sub-sampled matrix H e Dy. (C) Requires O(n log m) time to compute sketched vector Sy = H (E.g., Ailon & Liberty, 2010) Different notions of approximation Given a convex set C ⊆ Rd : Original least-squares problem o n xLS = arg min kAx − yk22 x∈C | {z } f (x) Sketched o n solution x b = arg min kSAx − Syk22 x∈C b a “good” approximation to xLS ? Question: When is sketched solution x Different notions of approximation Given a convex set C ⊆ Rd : Original least-squares problem o n xLS = arg min kAx − yk22 x∈C | {z } f (x) Sketched o n solution x b = arg min kSAx − Syk22 x∈C b a “good” approximation to xLS ? Question: When is sketched solution x Cost approximation Sketched solution x b ∈ C is a δ-accurate cost approximation if f (xLS ) ≤ f (b x) ≤ (1 + δ)2 f (xLS ). Different notions of approximation Given a convex set C ⊆ Rd : Original least-squares problem o n xLS = arg min kAx − yk22 x∈C | {z } f (x) Sketched o n solution x b = arg min kSAx − Syk22 x∈C b a “good” approximation to xLS ? Question: When is sketched solution x Cost approximation Sketched solution x b ∈ C is a δ-accurate cost approximation if f (xLS ) ≤ f (b x) ≤ (1 + δ)2 f (xLS ). Solution approximation Sketched solution x b ∈ C is a δ-accurate solution approximation if kb x − xLS kA ≤ δ | {z } √1 kA(b x−xLS )k2 n Cost approx. for unconstrained LS Unc ons t r aine d Le as t Squar e s : d = 500 Appr ox. r at io f( x) /f( x*) 70 Randomized Hadamard Gaussian Rademacher n=4096 60 n=2048 50 40 n=1024 30 20 10 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Cont r ol par am e t e r α Sketch size m = 4α rank(A) 0.8 0.9 1 What if solution approximation is our goal? often the least-squares solution xLS itself is of primary interest unfortunately, δ-accurate cost approximation does not ensure high solution accuracy Martin Wainwright (UC Berkeley) July 2015 7 / 28 What if solution approximation is our goal? often the least-squares solution xLS itself is of primary interest unfortunately, δ-accurate cost approximation does not ensure high solution accuracy Thought experiment: Consider random ensembles of linear regression problems: y = Ax∗ + w, Martin Wainwright (UC Berkeley) where x∗ ∈ Rd , and w ∼ N (0, σ 2 In ). July 2015 7 / 28 What if solution approximation is our goal? often the least-squares solution xLS itself is of primary interest unfortunately, δ-accurate cost approximation does not ensure high solution accuracy Thought experiment: Consider random ensembles of linear regression problems: y = Ax∗ + w, where x∗ ∈ Rd , and w ∼ N (0, σ 2 In ). Least-squares solution xLS has mean-squared error at most EkxLS − x∗ k2A - σ 2 rank(A) n {z } | Nominal δ Martin Wainwright (UC Berkeley) July 2015 7 / 28 Unconstrained LS: Solution approximation Mean−squared pred. error vs. row dimension Mean−squared prediction error 1 0.1 0.01 LS IHS Naive 0.001 2 10 3 10 Row dimension n Sketch size m % rank(A) log n. 4 10 Fundamental cause of poor performance? Recall planted ensembles of problems: y = Ax∗ + w, where x∗ ∈ C, and w ∼ N (0, σ 2 In ). Fundamental cause of poor performance? Recall planted ensembles of problems: y = Ax∗ + w, where x∗ ∈ C, and w ∼ N (0, σ 2 In ). Any random sketching matrix S ∈ Rn×m such that h i m |||E S T (SS T )−1 S |||op - . n Fundamental cause of poor performance? Recall planted ensembles of problems: y = Ax∗ + w, where x∗ ∈ C, and w ∼ N (0, σ 2 In ). Any random sketching matrix S ∈ Rn×m such that h i m |||E S T (SS T )−1 S |||op - . n Theorem (Pilanci & W, 2014) Any possible estimator (Sy, SA) 7→ x e has error lower bounded as h i log P1/2 (C) sup ES,w ke x − xLS k2A % σ 2 min{n, m} x∗ ∈C where P1/2 (C) is the 1/2-packing number of C ∩ B2 (1) in the norm k · kA . Fundamental cause of poor performance? Any random sketching matrix S ∈ Rn×m such that h i m |||E S T (SS T )−1 S |||op - . n Theorem (Pilanci & W, 2014) Any possible estimator (Sy, SA) 7→ x e has error lower bounded as i h log P1/2 (C) x − xLS k2A % σ 2 sup ES,w ke min{n, m} x∗ ∈C where P1/2 (C) is the 1/2-packing number of C ∩ B2 (1) in the norm k · kA . Concretely: For unconstrained least-squares, we have i h rank(A) x − xLS k2A % σ 2 sup ES,w ke . ∗ min{n, m} x ∈C Consequently, we need m ≥ n to match least-squares performance in estimating x∗ . A slightly different approach: Hessian sketch Observe that xLS = arg min kAx − yk22 = arg min x∈C x∈C n1 2 o xT AT Ax − hAT y, xi . A slightly different approach: Hessian sketch Observe that xLS = arg min kAx − yk22 = arg min x∈C x∈C n1 2 o xT AT Ax − hAT y, xi . Consider sketching only quadratic component: x e := arg min x∈C n1 2 o kSAxk22 − hAT y, xi . A slightly different approach: Hessian sketch Observe that xLS = arg min kAx − yk22 = arg min x∈C x∈C n1 2 o xT AT Ax − hAT y, xi . Consider sketching only quadratic component: x e := arg min x∈C n1 2 o kSAxk22 − hAT y, xi . For a broad class of sketches, as long sketch dimension m % (1/δ 2 ) W 2 (AK), can prove that ke x − xLS kA - δ kxLS kA A slightly different approach: Hessian sketch Observe that xLS = arg min kAx − yk22 = arg min x∈C x∈C n1 2 o xT AT Ax − hAT y, xi . Consider sketching only quadratic component: x e := arg min x∈C n1 2 o kSAxk22 − hAT y, xi . For a broad class of sketches, as long sketch dimension m % (1/δ 2 ) W 2 (AK), can prove that ke x − xLS kA - δ kxLS kA Key point: This one-step method is also provably sub-optimal, but... A slightly different approach: Hessian sketch Observe that xLS = arg min kAx − yk22 = arg min x∈C x∈C n1 2 o xT AT Ax − hAT y, xi . Consider sketching only quadratic component: x e := arg min x∈C n1 2 o kSAxk22 − hAT y, xi . For a broad class of sketches, as long sketch dimension m % (1/δ 2 ) W 2 (AK), can prove that ke x − xLS kA - δ kxLS kA Key point: This one-step method is also provably sub-optimal, but... ...the construction can be iterated to obtain an optimal method. An optimal method: Iterative Hessian sketch Given an iteration number T ≥ 1: (1) Initialize at x0 = 0. An optimal method: Iterative Hessian sketch Given an iteration number T ≥ 1: (1) Initialize at x0 = 0. (2) For iterations t = 0, 1, 2, . . . , T − 1, generate an independent sketch matrix S t+1 ∈ Rm×n , and perform the update xt+1 = arg min x∈C n1 2 o kS t+1 A(x − xt )k22 − hAT (y − Axt ), xi . An optimal method: Iterative Hessian sketch Given an iteration number T ≥ 1: (1) Initialize at x0 = 0. (2) For iterations t = 0, 1, 2, . . . , T − 1, generate an independent sketch matrix S t+1 ∈ Rm×n , and perform the update xt+1 = arg min x∈C n1 2 o kS t+1 A(x − xt )k22 − hAT (y − Axt ), xi . (3) Return the estimate x b = xT . An optimal method: Iterative Hessian sketch Given an iteration number T ≥ 1: (1) Initialize at x0 = 0. (2) For iterations t = 0, 1, 2, . . . , T − 1, generate an independent sketch matrix S t+1 ∈ Rm×n , and perform the update xt+1 = arg min x∈C n1 2 o kS t+1 A(x − xt )k22 − hAT (y − Axt ), xi . (3) Return the estimate x b = xT . Intuition Step 1 returns the plain Hessian sketch x e = x1 . Step t is sketching a problem for which xt − xLS is the optimal solution. The error is thus successively “localized”. Theory for unconstrained least-squares Theorem (Pilanci & W., 2014) Given a sketch dimension m % rank(A), the error decays geometrically kxt+1 − xLS kA ≤ 1 t kxLS kA 2 with probability at least 1 − c1 T e−c2 m . for all t = 0, 1, . . . , T − 1 Theory for unconstrained least-squares Theorem (Pilanci & W., 2014) Given a sketch dimension m % rank(A), the error decays geometrically kxt+1 − xLS kA ≤ 1 t kxLS kA 2 for all t = 0, 1, . . . , T − 1 with probability at least 1 − c1 T e−c2 m . applies to any sub-Gaussian sketch; same result for fast JL sketches with additional logarithmic factors total number of random projections scales as T m for any ǫ > 0, taking T = log solution. 2kxLS kA ) ǫ iterations yields ǫ-accurate Geometric convergence for unconstrained LS Error to least−squares solution versus iteration Log error to least−squares soln 0 −2 −4 −6 −8 −10 −12 γ=4 γ=6 γ=8 5 10 15 20 25 Iteration number 30 35 40 Experiments for planted ensembles Linear regression problems with A ∈ Rn×d and n > d: y = Ax∗ + w, where x∗ ∈ C, and w ∼ N (0, σ 2 In ). Experiments for planted ensembles Linear regression problems with A ∈ Rn×d and n > d: y = Ax∗ + w, where x∗ ∈ C, and w ∼ N (0, σ 2 In ). Least-squares solution has error ∗ EkxLS − x kA - r σ2 d n Experiments for planted ensembles Linear regression problems with A ∈ Rn×d and n > d: y = Ax∗ + w, where x∗ ∈ C, and w ∼ N (0, σ 2 In ). Least-squares solution has error ∗ EkxLS − x kA - r σ2 d n Scaling behavior: Fix σ 2 = 1 and sample size n = 100d, and vary d ∈ {16, 32, 64, 128, 256}. Run IHS with sketch size m = 4d for T = 4 iterations. Compare to classical sketch with sketch size 16d. Sketched accuracy: IHS versus classical sketch Least−squares vs. dimension 0.25 0.2 Error 0.15 0.1 0.05 0 16 32 64 Dimension 128 256 Extensions to constrained problems Constrained problem xLS xLS = arg min kAx − yk22 x∈C K C where C ⊆ Rd is a convex set. Extensions to constrained problems Constrained problem xLS xLS = arg min kAx − yk22 x∈C K C where C ⊆ Rd is a convex set. Tangent cone K at xLS Set of feasible directions at the optimum xLS K = ∆ ∈ Rd | ∆ = t (x − xLS ) for some x ∈ C. . Illustration: Binary classification with SVM Observe labeled samples (bi , Li ) ∈ RD × {−1, +1}. Goal: Find linear classifier b 7→ sign(hw, bi) with low classification error. Illustration: Binary classification with SVM Observe labeled samples (bi , Li ) ∈ RD × {−1, +1}. Support vector machine: produces classifier that depends only on samples lying on the margin Number of support vectors k typically ≪ total number of samples n Sketching the dual of the SVM Primal form of SVM: w b = arg minn w∈R d n 1 X o 1 max 0, 1 − Li hw, bi i + kwk22 . 2γ i=1 2 Sketching the dual of the SVM Primal form of SVM: w b = arg minn w∈R d n 1 X o 1 max 0, 1 − Li hw, bi i + kwk22 . 2γ i=1 2 Dual form of SVM xLS := arg minn k diag(L)Bxk22 , x∈P n P xi = γ . where P n := x ∈ Rn | x ≥ 0 and i=1 Sketching the dual of the SVM Primal form of SVM: w b = arg minn w∈R d n 1 X o 1 max 0, 1 − Li hw, bi i + kwk22 . 2γ i=1 2 Dual form of SVM xLS := arg minn k diag(L)Bxk22 , x∈P n P xi = γ . where P n := x ∈ Rn | x ≥ 0 and i=1 Sketched dual SVM x b := arg minn kS diag(L)Bxk22 x∈P Unfavorable dependence on optimum x∗ C K ∆ x LS Tangent cone K at xLS Set of feasible directions at the optimum xLS K = ∆ ∈ Rd | ∆ = t (x − xLS ) for some x ∈ C. . Favorable dependence on optimum x∗ C ∆ K x LS Tangent cone K at xLS Set of feasible directions at the optimum xLS K = ∆ ∈ Rd | ∆ = t (x − xLS ) for some x ∈ C. . Gaussian width of transformed tangent cone Gaussian width of set AK ∩ S n−1 = {A∆ | ∆ ∈ K, kA∆k2 = 1} W(AK) := E where g ∼ N (0, In×n ). h sup z∈AK∩S n−1 hg, zi i Gaussian width of transformed tangent cone Gaussian width of set AK ∩ S n−1 = {A∆ | ∆ ∈ K, kA∆k2 = 1} W(AK) := E h sup z∈AK∩S n−1 hg, zi i where g ∼ N (0, In×n ). Gaussian widths used in many areas: Banach space theory: Pisier, 1986 Empirical process theory: Ledoux & Talagrand, 1991, Bartlett et al., 2002 Compressed sensing: Mendelson et al., 2008; Chandrasekaran et al., 2012 A general guarantee Tangent cone at xLS : K = {∆ ∈ Rd | ∆ = t(x − xLS ) ∈ C Width of transformed cone AK ∩ S n−1 : h i W(AK) = E sup hg, zi z∈AK∩S n−1 for some t ≥ 0. where g ∼ N (0, In×n ). Theorem (Pilanci & W., 2014) Given a sketch dimension m % W 2 (AK), the error decays geometrically kxt+1 − xLS kA ≤ 1 t kxLS kA 2 for all t = 0, 1, . . . , T − 1 with probability at least 1 − c1 T e−c2 m . Martin Wainwright (UC Berkeley) July 2015 21 / 28 Ilustration: Width calculation for dual SVM Relevant constraint set is simplex in Rn : n X xi = γ . P n := x ∈ Rn | x ≥ 0 and i=1 in practice, SVM dual solution x bdual is often sparse, with relatively few non-zeros under mild conditions on A, it can be shown that h i p k log n. E sup hg, Axi x∈P n kxk0 ≤k, kAxk2 ≤1 Conclusion For a SVM solution with k support vectors, a sketch dimension m % k log n is sufficient to ensure geometric convergence. Martin Wainwright (UC Berkeley) July 2015 22 / 28 Geometric convergence for SVM Sparse error vs. iteration 0 −1 −2 Log error −3 −4 −5 −6 −7 −8 −9 −10 0 m = 2 k log n m = 5 k log n m = 25 k log n 2 4 6 Iteration number 8 10 Sketched accuracy: IHS versus classical sketch Sparse classifier vs. dimension 0.2 0.18 0.16 Error 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 16 32 64 Dimension 128 256 A more general story: Newton Sketch Convex program over set C ⊆ Rd : xopt = arg min f (x), x∈C where f : Rd → R is twice-differentiable. A more general story: Newton Sketch Convex program over set C ⊆ Rd : xopt = arg min f (x), x∈C where f : Rd → R is twice-differentiable. Ordinary Newton steps: xt+1 = arg min x∈C n1 2 o k∇2 f (xt )1/2 (x − xt )k22 + h∇f (xt ), x − xt i , where ∇2 f (xt )1/2 is a matrix square of the Hessian at xt . A more general story: Newton Sketch Convex program over set C ⊆ Rd : xopt = arg min f (x), x∈C where f : Rd → R is twice-differentiable. Ordinary Newton steps: xt+1 = arg min x∈C n1 2 o k∇2 f (xt )1/2 (x − xt )k22 + h∇f (xt ), x − xt i , where ∇2 f (xt )1/2 is a matrix square of the Hessian at xt . Sketched Newton steps: x̃t+1 = arg min x∈C n1 2 o kS t ∇2 f (xt )1/2 (x − x̃t )k22 + h∇f (x̃t ), x − x̃t i . A more general story: Newton Sketch Convex program over set C ⊆ Rd : xopt = arg min f (x), x∈C where f : Rd → R is twice-differentiable. Ordinary Newton steps: xt+1 = arg min x∈C n1 2 o k∇2 f (xt )1/2 (x − xt )k22 + h∇f (xt ), x − xt i , where ∇2 f (xt )1/2 is a matrix square of the Hessian at xt . Sketched Newton steps: x̃t+1 = arg min x∈C n1 2 o kS t ∇2 f (xt )1/2 (x − x̃t )k22 + h∇f (x̃t ), x − x̃t i . Question: What is the minimal sketch dimension required to ensure that {x̃t }Tt=0 stays uniformly close to {xt }Tt=0 ? Sketching the central path: m = d Exact Newton Newton's Sketch Trial 1 Trial 2 Trial 3 Sketching the central path: m = 4d Exact Newton Newton's Sketch Trial 1 Trial 2 Trial 3 Sketching the central path: m = 16d Exact Newton Newton's Sketch Trial 1 Trial 2 Trial 3 Running time comparisons optimality gap 10 5 Exact Newton GD Acc. GD SGD BFGS Newton's Sketch 10 0 10 -5 10 -10 10 -15 0 500 1000 1500 2000 2500 iterations optimality gap 10 5 Exact Newton GD Acc. GD SGD BFGS Newton's Sketch 10 0 10 -5 10 -10 10 -15 0 2 4 6 8 wall-clock time (seconds) 10 12 Summary important distinction: cost versus solution approximation classical least-squares sketch is provably sub-optimal for solution approximation iterative Hessian sketch: fast geometric convergence with guarantees in both cost/solution approximation sharp dependence of sketch dimension on geometry of solution and constraint set a more general perspective: sketched forms of Newton’s method Summary important distinction: cost versus solution approximation classical least-squares sketch is provably sub-optimal for solution approximation iterative Hessian sketch: fast geometric convergence with guarantees in both cost/solution approximation sharp dependence of sketch dimension on geometry of solution and constraint set a more general perspective: sketched forms of Newton’s method Papers/pre-prints: Pilanci & W. (2014a): Randomized sketches of convex programs with sharp guarantees, To appear in IEEE Trans. Info. Theory Pilanci & W. (2014b): Iterative Hessian Sketch: Fast and accurate solution approximation for constrained least-squares, Arxiv pre-print. Yang, Pilanci & W. (2015): Randomized sketches for kernels: fast and optimal non-parametric regression, Arxiv pre-print. Pilanci & W. (2015): Newton Sketch: A linear-time optimization algorithm with linear-quadratic convergence. Arxiv pre-print.