Randomized algorithms for optimization: Statistical and computational guarantees UC Berkeley Statistics and EECS

advertisement
Randomized algorithms for optimization:
Statistical and computational guarantees
Martin Wainwright
UC Berkeley
Statistics and EECS
Based on joint work with:
Mert Pilanci (UC Berkeley)
Yun Yang (UC Berkeley)
Martin Wainwright (UC Berkeley)
July 2015
1 / 28
Sketching via random projections
Massive data sets require fast algorithms but with rigorous guarantees.
Sketching via random projections
Massive data sets require fast algorithms but with rigorous guarantees.
A general purpose tool:
Choose a random subspace of “low” dimension m.
Project data into subspace, and solve reduced dimension problem.
Lower dimensional space
Random
projection
High−dimensional space
Sketching via random projections
A general purpose tool:
Choose a random subspace of “low” dimension m.
Project data into subspace, and solve reduced dimension problem.
Lower dimensional space
Random
projection
High−dimensional space
Basic underlying idea now widely used in practice:
Johnson & Lindenstrauss (1984): for Hilbert spaces
various surveys and books: Vempala, 2004; Mahoney et al., 2011
Cormode et al., 2012
Classical sketching for constrained least-squares
Sy
SA
=
S
y
A
Original problem: data (y, A) ∈ Rn × Rn×d , and convex constraint set C ⊆ Rd
xLS = arg min kAx − yk22
x∈C
Classical sketching for constrained least-squares
Sy
SA
=
S
y
A
Original problem: data (y, A) ∈ Rn × Rn×d , and convex constraint set C ⊆ Rd
xLS = arg min kAx − yk22
x∈C
Sketched problem: data (Sy, SA) ∈ Rm × Rm×d :
x
b = arg min kSAx − Syk22
x∈C
Classical sketching for constrained least-squares
Sy
SA
=
S
y
Sketched problem: data (Sy, SA) ∈ Rm × Rm×d :
x
b = arg min kSAx − Syk22
x∈C
Some history:
random projections and Johnson-Lindenstrauss: 1980s onwards
sketching for unconstrained least-squares: Sarlos, 2006
leverage scores, cores sets: Drineas et al., 2010, 2011
overview paper: Mahoney et al., 2011
A
Sketches based on randomized orthonormal systems
Step 1: Choose some fixed orthonormal matrix H ∈ Rn×n .
Example: Hadamard matrices
1 1 1
H2t =
H2 ⊗ H2 ⊗ · · · ⊗ H2
H2 = √
|
{z
}
2 1 −1
Kronecker product t times
(E.g., Ailon & Liberty, 2010)
Sketches based on randomized orthonormal systems
Step 1: Choose some fixed orthonormal matrix H ∈ Rn×n .
Example: Hadamard matrices
1 1 1
H2t =
H2 ⊗ H2 ⊗ · · · ⊗ H2
H2 = √
|
{z
}
2 1 −1
Kronecker product t times
=
Sy
e
H
D
y
Step 2:
(A) Multiply data vector y with a diagonal matrix of random signs {−1, +1}
e ∈ Rm×n
(B) Choose m rows of H to form sub-sampled matrix H
e Dy.
(C) Requires O(n log m) time to compute sketched vector Sy = H
(E.g., Ailon & Liberty, 2010)
Different notions of approximation
Given a convex set C ⊆ Rd :
Original least-squares
problem
o
n
xLS = arg min kAx − yk22
x∈C
| {z }
f (x)
Sketched
o
n solution
x
b = arg min kSAx − Syk22
x∈C
b a “good” approximation to xLS ?
Question: When is sketched solution x
Different notions of approximation
Given a convex set C ⊆ Rd :
Original least-squares
problem
o
n
xLS = arg min kAx − yk22
x∈C
| {z }
f (x)
Sketched
o
n solution
x
b = arg min kSAx − Syk22
x∈C
b a “good” approximation to xLS ?
Question: When is sketched solution x
Cost approximation
Sketched solution x
b ∈ C is a δ-accurate cost approximation if
f (xLS ) ≤ f (b
x) ≤ (1 + δ)2 f (xLS ).
Different notions of approximation
Given a convex set C ⊆ Rd :
Original least-squares
problem
o
n
xLS = arg min kAx − yk22
x∈C
| {z }
f (x)
Sketched
o
n solution
x
b = arg min kSAx − Syk22
x∈C
b a “good” approximation to xLS ?
Question: When is sketched solution x
Cost approximation
Sketched solution x
b ∈ C is a δ-accurate cost approximation if
f (xLS ) ≤ f (b
x) ≤ (1 + δ)2 f (xLS ).
Solution approximation
Sketched solution x
b ∈ C is a δ-accurate solution approximation if
kb
x − xLS kA ≤ δ
|
{z
}
√1 kA(b
x−xLS )k2
n
Cost approx. for unconstrained LS
Unc ons t r aine d Le as t Squar e s : d = 500
Appr ox. r at io f( x) /f( x*)
70
Randomized Hadamard
Gaussian
Rademacher
n=4096
60
n=2048
50
40
n=1024
30
20
10
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Cont r ol par am e t e r α
Sketch size m = 4α rank(A)
0.8
0.9
1
What if solution approximation is our goal?
often the least-squares solution xLS itself is of primary interest
unfortunately, δ-accurate cost approximation does not ensure high
solution accuracy
Martin Wainwright (UC Berkeley)
July 2015
7 / 28
What if solution approximation is our goal?
often the least-squares solution xLS itself is of primary interest
unfortunately, δ-accurate cost approximation does not ensure high
solution accuracy
Thought experiment: Consider random ensembles of linear regression
problems:
y = Ax∗ + w,
Martin Wainwright (UC Berkeley)
where x∗ ∈ Rd , and w ∼ N (0, σ 2 In ).
July 2015
7 / 28
What if solution approximation is our goal?
often the least-squares solution xLS itself is of primary interest
unfortunately, δ-accurate cost approximation does not ensure high
solution accuracy
Thought experiment: Consider random ensembles of linear regression
problems:
y = Ax∗ + w,
where x∗ ∈ Rd , and w ∼ N (0, σ 2 In ).
Least-squares solution xLS has mean-squared error at most
EkxLS − x∗ k2A
-
σ 2 rank(A)
n
{z
}
|
Nominal δ
Martin Wainwright (UC Berkeley)
July 2015
7 / 28
Unconstrained LS: Solution approximation
Mean−squared pred. error vs. row dimension
Mean−squared prediction error
1
0.1
0.01
LS
IHS
Naive
0.001 2
10
3
10
Row dimension n
Sketch size m % rank(A) log n.
4
10
Fundamental cause of poor performance?
Recall planted ensembles of problems:
y = Ax∗ + w,
where x∗ ∈ C, and w ∼ N (0, σ 2 In ).
Fundamental cause of poor performance?
Recall planted ensembles of problems:
y = Ax∗ + w,
where x∗ ∈ C, and w ∼ N (0, σ 2 In ).
Any random sketching matrix S ∈ Rn×m such that
h
i
m
|||E S T (SS T )−1 S |||op - .
n
Fundamental cause of poor performance?
Recall planted ensembles of problems:
y = Ax∗ + w,
where x∗ ∈ C, and w ∼ N (0, σ 2 In ).
Any random sketching matrix S ∈ Rn×m such that
h
i
m
|||E S T (SS T )−1 S |||op - .
n
Theorem (Pilanci & W, 2014)
Any possible estimator (Sy, SA) 7→ x
e has error lower bounded as
h
i
log P1/2 (C)
sup ES,w ke
x − xLS k2A % σ 2
min{n, m}
x∗ ∈C
where P1/2 (C) is the 1/2-packing number of C ∩ B2 (1) in the norm k · kA .
Fundamental cause of poor performance?
Any random sketching matrix S ∈ Rn×m such that
h
i
m
|||E S T (SS T )−1 S |||op - .
n
Theorem (Pilanci & W, 2014)
Any possible estimator (Sy, SA) 7→ x
e has error lower bounded as
i
h
log P1/2 (C)
x − xLS k2A % σ 2
sup ES,w ke
min{n, m}
x∗ ∈C
where P1/2 (C) is the 1/2-packing number of C ∩ B2 (1) in the norm k · kA .
Concretely: For unconstrained least-squares, we have
i
h
rank(A)
x − xLS k2A % σ 2
sup ES,w ke
.
∗
min{n,
m}
x ∈C
Consequently, we need m ≥ n to match least-squares performance in
estimating x∗ .
A slightly different approach: Hessian sketch
Observe that
xLS = arg min kAx − yk22 = arg min
x∈C
x∈C
n1
2
o
xT AT Ax − hAT y, xi .
A slightly different approach: Hessian sketch
Observe that
xLS = arg min kAx − yk22 = arg min
x∈C
x∈C
n1
2
o
xT AT Ax − hAT y, xi .
Consider sketching only quadratic component:
x
e := arg min
x∈C
n1
2
o
kSAxk22 − hAT y, xi .
A slightly different approach: Hessian sketch
Observe that
xLS = arg min kAx − yk22 = arg min
x∈C
x∈C
n1
2
o
xT AT Ax − hAT y, xi .
Consider sketching only quadratic component:
x
e := arg min
x∈C
n1
2
o
kSAxk22 − hAT y, xi .
For a broad class of sketches, as long sketch dimension m % (1/δ 2 ) W 2 (AK),
can prove that
ke
x − xLS kA - δ kxLS kA
A slightly different approach: Hessian sketch
Observe that
xLS = arg min kAx − yk22 = arg min
x∈C
x∈C
n1
2
o
xT AT Ax − hAT y, xi .
Consider sketching only quadratic component:
x
e := arg min
x∈C
n1
2
o
kSAxk22 − hAT y, xi .
For a broad class of sketches, as long sketch dimension m % (1/δ 2 ) W 2 (AK),
can prove that
ke
x − xLS kA - δ kxLS kA
Key point:
This one-step method is also provably sub-optimal, but...
A slightly different approach: Hessian sketch
Observe that
xLS = arg min kAx − yk22 = arg min
x∈C
x∈C
n1
2
o
xT AT Ax − hAT y, xi .
Consider sketching only quadratic component:
x
e := arg min
x∈C
n1
2
o
kSAxk22 − hAT y, xi .
For a broad class of sketches, as long sketch dimension m % (1/δ 2 ) W 2 (AK),
can prove that
ke
x − xLS kA - δ kxLS kA
Key point:
This one-step method is also provably sub-optimal, but...
...the construction can be iterated to obtain an optimal method.
An optimal method: Iterative Hessian sketch
Given an iteration number T ≥ 1:
(1) Initialize at x0 = 0.
An optimal method: Iterative Hessian sketch
Given an iteration number T ≥ 1:
(1) Initialize at x0 = 0.
(2) For iterations t = 0, 1, 2, . . . , T − 1, generate an independent sketch matrix
S t+1 ∈ Rm×n , and perform the update
xt+1 = arg min
x∈C
n1
2
o
kS t+1 A(x − xt )k22 − hAT (y − Axt ), xi .
An optimal method: Iterative Hessian sketch
Given an iteration number T ≥ 1:
(1) Initialize at x0 = 0.
(2) For iterations t = 0, 1, 2, . . . , T − 1, generate an independent sketch matrix
S t+1 ∈ Rm×n , and perform the update
xt+1 = arg min
x∈C
n1
2
o
kS t+1 A(x − xt )k22 − hAT (y − Axt ), xi .
(3) Return the estimate x
b = xT .
An optimal method: Iterative Hessian sketch
Given an iteration number T ≥ 1:
(1) Initialize at x0 = 0.
(2) For iterations t = 0, 1, 2, . . . , T − 1, generate an independent sketch matrix
S t+1 ∈ Rm×n , and perform the update
xt+1 = arg min
x∈C
n1
2
o
kS t+1 A(x − xt )k22 − hAT (y − Axt ), xi .
(3) Return the estimate x
b = xT .
Intuition
Step 1 returns the plain Hessian sketch x
e = x1 .
Step t is sketching a problem for which xt − xLS is the optimal solution.
The error is thus successively “localized”.
Theory for unconstrained least-squares
Theorem (Pilanci & W., 2014)
Given a sketch dimension m % rank(A), the error decays geometrically
kxt+1 − xLS kA ≤
1 t
kxLS kA
2
with probability at least 1 − c1 T e−c2 m .
for all t = 0, 1, . . . , T − 1
Theory for unconstrained least-squares
Theorem (Pilanci & W., 2014)
Given a sketch dimension m % rank(A), the error decays geometrically
kxt+1 − xLS kA ≤
1 t
kxLS kA
2
for all t = 0, 1, . . . , T − 1
with probability at least 1 − c1 T e−c2 m .
applies to any sub-Gaussian sketch; same result for fast JL sketches with
additional logarithmic factors
total number of random projections scales as T m
for any ǫ > 0, taking T = log
solution.
2kxLS kA
)
ǫ
iterations yields ǫ-accurate
Geometric convergence for unconstrained LS
Error to least−squares solution versus iteration
Log error to least−squares soln
0
−2
−4
−6
−8
−10
−12
γ=4
γ=6
γ=8
5
10
15
20
25
Iteration number
30
35
40
Experiments for planted ensembles
Linear regression problems with A ∈ Rn×d and n > d:
y = Ax∗ + w,
where x∗ ∈ C, and w ∼ N (0, σ 2 In ).
Experiments for planted ensembles
Linear regression problems with A ∈ Rn×d and n > d:
y = Ax∗ + w,
where x∗ ∈ C, and w ∼ N (0, σ 2 In ).
Least-squares solution has error
∗
EkxLS − x kA -
r
σ2 d
n
Experiments for planted ensembles
Linear regression problems with A ∈ Rn×d and n > d:
y = Ax∗ + w,
where x∗ ∈ C, and w ∼ N (0, σ 2 In ).
Least-squares solution has error
∗
EkxLS − x kA -
r
σ2 d
n
Scaling behavior:
Fix σ 2 = 1 and sample size n = 100d, and vary d ∈ {16, 32, 64, 128, 256}.
Run IHS with sketch size m = 4d for T = 4 iterations.
Compare to classical sketch with sketch size 16d.
Sketched accuracy: IHS versus classical sketch
Least−squares vs. dimension
0.25
0.2
Error
0.15
0.1
0.05
0
16
32
64
Dimension
128
256
Extensions to constrained problems
Constrained problem
xLS
xLS = arg min kAx − yk22
x∈C
K
C
where C ⊆ Rd is a convex set.
Extensions to constrained problems
Constrained problem
xLS
xLS = arg min kAx − yk22
x∈C
K
C
where C ⊆ Rd is a convex set.
Tangent cone K at xLS
Set of feasible directions at the optimum xLS
K = ∆ ∈ Rd | ∆ = t (x − xLS )
for some x ∈ C. .
Illustration: Binary classification with SVM
Observe labeled samples (bi , Li ) ∈ RD × {−1, +1}.
Goal: Find linear classifier b 7→ sign(hw, bi) with low classification error.
Illustration: Binary classification with SVM
Observe labeled samples (bi , Li ) ∈ RD × {−1, +1}.
Support vector machine: produces classifier that depends only on samples
lying on the margin
Number of support vectors k typically ≪ total number of samples n
Sketching the dual of the SVM
Primal form of SVM:
w
b = arg minn
w∈R
d
n 1 X
o
1
max 0, 1 − Li hw, bi i + kwk22 .
2γ i=1
2
Sketching the dual of the SVM
Primal form of SVM:
w
b = arg minn
w∈R
d
n 1 X
o
1
max 0, 1 − Li hw, bi i + kwk22 .
2γ i=1
2
Dual form of SVM
xLS := arg minn k diag(L)Bxk22 ,
x∈P
n
P
xi = γ .
where P n := x ∈ Rn | x ≥ 0 and
i=1
Sketching the dual of the SVM
Primal form of SVM:
w
b = arg minn
w∈R
d
n 1 X
o
1
max 0, 1 − Li hw, bi i + kwk22 .
2γ i=1
2
Dual form of SVM
xLS := arg minn k diag(L)Bxk22 ,
x∈P
n
P
xi = γ .
where P n := x ∈ Rn | x ≥ 0 and
i=1
Sketched dual SVM
x
b := arg minn kS diag(L)Bxk22
x∈P
Unfavorable dependence on optimum x∗
C
K
∆
x
LS
Tangent cone K at xLS
Set of feasible directions at the optimum xLS
K = ∆ ∈ Rd | ∆ = t (x − xLS )
for some x ∈ C. .
Favorable dependence on optimum x∗
C
∆
K
x
LS
Tangent cone K at xLS
Set of feasible directions at the optimum xLS
K = ∆ ∈ Rd | ∆ = t (x − xLS )
for some x ∈ C. .
Gaussian width of transformed tangent cone
Gaussian width of set
AK ∩ S n−1 = {A∆ | ∆ ∈ K, kA∆k2 = 1}
W(AK) := E
where g ∼ N (0, In×n ).
h
sup
z∈AK∩S n−1
hg, zi
i
Gaussian width of transformed tangent cone
Gaussian width of set
AK ∩ S n−1 = {A∆ | ∆ ∈ K, kA∆k2 = 1}
W(AK) := E
h
sup
z∈AK∩S n−1
hg, zi
i
where g ∼ N (0, In×n ).
Gaussian widths used in many areas:
Banach space theory: Pisier, 1986
Empirical process theory: Ledoux & Talagrand, 1991, Bartlett et al., 2002
Compressed sensing: Mendelson et al., 2008; Chandrasekaran et al., 2012
A general guarantee
Tangent cone at xLS :
K = {∆ ∈ Rd | ∆ = t(x − xLS ) ∈ C
Width of transformed cone AK ∩ S n−1 :
h
i
W(AK) = E
sup
hg, zi
z∈AK∩S n−1
for some t ≥ 0.
where g ∼ N (0, In×n ).
Theorem (Pilanci & W., 2014)
Given a sketch dimension m % W 2 (AK), the error decays geometrically
kxt+1 − xLS kA ≤
1 t
kxLS kA
2
for all t = 0, 1, . . . , T − 1
with probability at least 1 − c1 T e−c2 m .
Martin Wainwright (UC Berkeley)
July 2015
21 / 28
Ilustration: Width calculation for dual SVM
Relevant constraint set is simplex in Rn :
n
X
xi = γ .
P n := x ∈ Rn | x ≥ 0 and
i=1
in practice, SVM dual solution x
bdual is often sparse, with relatively few
non-zeros
under mild conditions on A, it can be shown that
h
i
p
k log n.
E
sup
hg, Axi x∈P n
kxk0 ≤k, kAxk2 ≤1
Conclusion
For a SVM solution with k support vectors, a sketch dimension m % k log n is
sufficient to ensure geometric convergence.
Martin Wainwright (UC Berkeley)
July 2015
22 / 28
Geometric convergence for SVM
Sparse error vs. iteration
0
−1
−2
Log error
−3
−4
−5
−6
−7
−8
−9
−10
0
m = 2 k log n
m = 5 k log n
m = 25 k log n
2
4
6
Iteration number
8
10
Sketched accuracy: IHS versus classical sketch
Sparse classifier vs. dimension
0.2
0.18
0.16
Error
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
16
32
64
Dimension
128
256
A more general story: Newton Sketch
Convex program over set C ⊆ Rd :
xopt = arg min f (x),
x∈C
where f : Rd → R is twice-differentiable.
A more general story: Newton Sketch
Convex program over set C ⊆ Rd :
xopt = arg min f (x),
x∈C
where f : Rd → R is twice-differentiable.
Ordinary Newton steps:
xt+1 = arg min
x∈C
n1
2
o
k∇2 f (xt )1/2 (x − xt )k22 + h∇f (xt ), x − xt i ,
where ∇2 f (xt )1/2 is a matrix square of the Hessian at xt .
A more general story: Newton Sketch
Convex program over set C ⊆ Rd :
xopt = arg min f (x),
x∈C
where f : Rd → R is twice-differentiable.
Ordinary Newton steps:
xt+1 = arg min
x∈C
n1
2
o
k∇2 f (xt )1/2 (x − xt )k22 + h∇f (xt ), x − xt i ,
where ∇2 f (xt )1/2 is a matrix square of the Hessian at xt .
Sketched Newton steps:
x̃t+1 = arg min
x∈C
n1
2
o
kS t ∇2 f (xt )1/2 (x − x̃t )k22 + h∇f (x̃t ), x − x̃t i .
A more general story: Newton Sketch
Convex program over set C ⊆ Rd :
xopt = arg min f (x),
x∈C
where f : Rd → R is twice-differentiable.
Ordinary Newton steps:
xt+1 = arg min
x∈C
n1
2
o
k∇2 f (xt )1/2 (x − xt )k22 + h∇f (xt ), x − xt i ,
where ∇2 f (xt )1/2 is a matrix square of the Hessian at xt .
Sketched Newton steps:
x̃t+1 = arg min
x∈C
n1
2
o
kS t ∇2 f (xt )1/2 (x − x̃t )k22 + h∇f (x̃t ), x − x̃t i .
Question:
What is the minimal sketch dimension required to ensure that {x̃t }Tt=0 stays
uniformly close to {xt }Tt=0 ?
Sketching the central path: m = d
Exact Newton
Newton's Sketch
Trial 1
Trial 2
Trial 3
Sketching the central path: m = 4d
Exact Newton
Newton's Sketch
Trial 1
Trial 2
Trial 3
Sketching the central path: m = 16d
Exact Newton
Newton's Sketch
Trial 1
Trial 2
Trial 3
Running time comparisons
optimality gap
10 5
Exact Newton
GD
Acc. GD
SGD
BFGS
Newton's Sketch
10 0
10
-5
10
-10
10
-15
0
500
1000
1500
2000
2500
iterations
optimality gap
10 5
Exact Newton
GD
Acc. GD
SGD
BFGS
Newton's Sketch
10 0
10
-5
10
-10
10
-15
0
2
4
6
8
wall-clock time (seconds)
10
12
Summary
important distinction: cost versus solution approximation
classical least-squares sketch is provably sub-optimal for solution
approximation
iterative Hessian sketch: fast geometric convergence with guarantees in
both cost/solution approximation
sharp dependence of sketch dimension on geometry of solution and
constraint set
a more general perspective: sketched forms of Newton’s method
Summary
important distinction: cost versus solution approximation
classical least-squares sketch is provably sub-optimal for solution
approximation
iterative Hessian sketch: fast geometric convergence with guarantees in
both cost/solution approximation
sharp dependence of sketch dimension on geometry of solution and
constraint set
a more general perspective: sketched forms of Newton’s method
Papers/pre-prints:
Pilanci & W. (2014a): Randomized sketches of convex programs with
sharp guarantees, To appear in IEEE Trans. Info. Theory
Pilanci & W. (2014b): Iterative Hessian Sketch: Fast and accurate
solution approximation for constrained least-squares, Arxiv pre-print.
Yang, Pilanci & W. (2015): Randomized sketches for kernels: fast and
optimal non-parametric regression, Arxiv pre-print.
Pilanci & W. (2015): Newton Sketch: A linear-time optimization
algorithm with linear-quadratic convergence. Arxiv pre-print.
Download