Lecture 3 – Tuesday, October 6 3.1 Problem set up and motivation

advertisement
Nachdiplom Lecture: Statistics meets Optimization
Fall 2015
Lecture 3 – Tuesday, October 6
Lecturer: Martin Wainwright
Scribe: Yuting Wei
Today, we analyze an application of random projection to compute approximate solutions of constrained
least-squares problems. This method is often referred to as sketched least-squares.
3.1
Problem set up and motivation
Suppose that we are given an observation vector y ∈ Rn and matrix A ∈ Rn×d , and that for some convex
set C ⊂ Rd , we would like to compute the constrained least-squares solution
1
xLS : = argmin ky − Axk22 .
x∈C 2
|
{z
}
(3.1)
: =f (x)
In general, this solution may not be unique, but we assume throughout this lecture that uniqueness holds
(so that n ≥ d necessarily).
Different versions of the constrained least-squares problem arise in many applications:
• In the simplest case of an unconstrained problem (C = Rd ), it corresponds to the usual least-squares
estimator, which has been widely studied. Most past work on sketching least-squares has focused on
this case.
• When C is a scaled form of the `1 -ball—that is, C = {x ∈ Rd | kxk1 ≤ R}—then the constrained
problem is known as the Lasso. It is widely used for estimating sparse regression vectors.
• The support vector machine for classification, when solved in its dual form, leads to a least-squares
problem over a polytope C.
Problems of the form (3.1) can also arise as intermediate steps of using Newton’s method to solve a constrained optimization problem.
The original problem can be difficult to solve if the first matrix dimension n is too large. Thus, in order
to reduce both storage and computation requirements, a natural idea is to randomly project the original
data to a lower-dimensional space. In particular, given a random sketch matrix S ∈ Rm×n , consider the
sketched least-squares problem
1
x
b : = argmin kS(y − Ax)k22 .
x∈C 2
|
{z
}
: =g(x)
1
(3.2)
2
Lecture 3 – Tuesday, October 6
The row dimensions of the sketched problem are now m-dimensional as opposed to n-dimensional, which
will lead to savings whenever we can choose m n. This idea (in the unconstrained case C = Rd ) was first
proposed and analyzed by Sarlos [9]. The more general analysis to be described here is due to Pilanci and
Wainwright [8].
There are different ways to measure the quality of the sketched solution. For a given tolerance parameter
δ ∈ (0, 1), we say that it is δ-accurate cost approximation if
f (xLS ) ≤ f (b
x) ≤ (1 + δ)2 f (xLS ).
(3.3)
On the other hand, we say that it is a δ-accurate solution approximation if
1
√ A(xLS − x
b)
≤ δ.
n
2
(3.4)
In this lecture, we will focus on the cost notion of performance (3.3).
In particular, our main goal is to give a precise answer to the following question: how large must the projection dimension m be to guarantee that the sketched solution x
b provides a δ-accurate cost approximation?
(a)
(b)
Figure 3.1: Illustration of the tangent cone at xLS .
The answer depends on a natural geometric object associated with the convex program (3.1), namely the
tangent cone of C at xLS . In particular, it is the set
n
o
K(xLS ) = ∆ ∈ Rd | ∆ = t(x − xLS ), x ∈ C, t ≥ 0 .
(3.5)
Our theory shows that the projection dimension m required depends on the “size” of (a transformed version
of) this tangent cone. Note that the structure of this K(xLS ) can vary drastically as a function of xLS .
Figure 3.1(a) shows a favorable case, where xLS lies at a vertex of a polyhedral constraint set; panel (b)
shows a less favorable case, where xLS lies on a higher-dimensional face. The worst case for our theory is
when xLS is an interior point of C, so that K(xLS ) = Rd . Note that this worst case always occurs for an
unconstrained problem.
Lecture 3 – Tuesday, October 6
3.2
3
Projection theorem for constrained least square problem
Consider the set
AK(xLS ) ∩ S n−1 = {v ∈ Rn | ∃ ∆ ∈ K(xLS ), v = A∆, kvk2 = 1},
(3.6)
corresponding the intersection of the transformed tangent cone AK(xLS ) with the Euclidean sphere. We
define its Gaussian width in the usual way
hu, gi
where g ∼ N (0, In×n ).
(3.7)
W(AK(xLS ) ∩ S n−1 ) : = E
sup
u∈AK(xLS )∩S n−1
Theorem 3.2.1. Consider the sketched least-squares solution (3.2) based on a Gaussian random matrix
S ∈ Rm×n . Given a projection dimension m ≥ δC2 W2 (AK(xLS ) ∩ S n−1 ), it is a δ-accurate cost approximation
with probability at least 1 − c0 exp{−c1 mδ 2 }.
Here (C, c0 , c1 ) are universal constants independent of all problem parameters.
Proof. In order to simplify notation, we omit the dependence of K(xLS ) on xLS , and write K instead. We
define the following two quantities:
Z1 (AK) =
inf
v∈AK(xLS
)∩S n−1
kSvk22
m
(3.8a)
and
Z2 (AK, u) =
sup
v∈AK(xLS )∩S
T ST S
,
u (
−
I)v
m
n−1
where u ∈ S n−1 is some fixed vector.
(3.8b)
For a random sketch, note that Z1 and Z2 are also random variables, but parts of our theory hold deterministically for any choice of S. The following lemma shows that the ratio Z2 /Z1 controls the quality of x
b as a
cost approximation for f (x) = 21 kAx − yk22 .
Lemma 3.2.1. For any sketch matrix S ∈ Rm×n , we have
f (b
x) ≤
2
Z2 (AK, uLS )
1+
f (xLS )
Z1 (AK)
where uLS : =
y−AxLS
ky−AxLS k2 .
(3.9)
Proof. Define the error vector eb = x
b − xLS . By the triangle inequality, we have
kAb
ek2
.
kAb
x − yk2 ≤ kAxLS − yk2 + kAb
ek2 = kAxLS − yk2 1 +
kAxLS − yk2
By definition, we have kAb
x − yk2 =
p
2f (b
x) and kAxLS − yk2 =
f (b
x) ≤ f (xLS ) 1 +
p
2f (xLS ), and hence
kAb
ek2
kAxLS − yk2
2
.
Consequently, in order to prove the lemma, it suffices to upper bound the ratio
(3.10)
kAb
ek2
kAxLS −yk2
by Z2 /Z1 .
4
Lecture 3 – Tuesday, October 6
Note that x
b is optimal and xLS is feasible for the sketched problem (3.2). Thus, the first-order conditions
for optimality guarantee that
h∇g(b
x), −b
e i ≥ 0 =⇒ hS(Ab
x − y), −SAb
ei≥0
=⇒
1
1
kSAb
ek22 ≤ − hS(AxLS − y), −SAb
e i = (AxLS − y)T
m
m
−
ST S
+ I Ab
e − hAxLS − y, Ab
e i.
m
(3.11)
On the other hand, by the optimality of xLS and feasibility of x
b for the original problem (3.1), we have
h∇f (xLS ), eb i ≥ 0 =⇒ hAxLS − y, Ab
ei≥0
(3.12)
Combining inequalities (3.11) and (3.12) yields
1
kSAb
ek22 ≤ (AxLS − y)T
m
−
ST S
+ I Ab
e −hAxLS − y, Ab
ei
|
{z
}
m
≤0
ST S
≤ (AxLS − y)T −
+ I Ab
e.
m
Dividing through by Euclidean norms to normalize then yields
(AxLS − y)T S T S
Ab
e kSAb
ek22
kAb
ek22
≤
−
I
kAxLS − yk2 kAb
ek2 .
kAxLS − yk2
mkAb
ek22
m
kAb
ek2 Recalling the definitions of Z1 and Z2 , we have
kSAb
ek22
kAb
ek2 ≥ Z1 kAb
ek22 ,
mkAb
ek22
and
(AxLS − y)T S T S
Ab
e kAxLS − yk2 kAb
ek2 ≤ Z2 kAxLS − yk2 kAb
ek 2 .
−I
kAxLS − yk2
m
kAb
ek 2 Putting together the pieces yields the desired bound
kAb
ek2
Z2 (AK, uLS )
.
≤
kAxLS − yk2
Z1 (AK)
n
o
For a given δ ∈ (0, 1/2), define the event E(δ) : = Z1 ≥ 1 − δ, Z2 ≤ δ/2 . Conditioned on this event, we
have
2
f (b
x)
Z2 (AK, uLS ) 2 δ/2 2
≤ 1+
≤ 1+
≤ 1+δ ,
f (xLS )
Z1 (AK)
1−δ
which is our desired form of δ-accurate cost approximation. Consequently, in order to complete the proof,
2
we need to state a lower bound on the projection dimension that ensures that P[E(δ)] ≥ 1 − c1 e−c2 mδ .
Lemma 3.2.2. For a Gaussian sketch matrix, there is a universal constant C such that m ≥
2
S n−1 ) implies that P[E(δ)] ≥ 1 − c1 e−c2 mδ .
C
2
δ 2 W (AK
∩
We only sketch the proof of this lemma here, since it follows essentially from our main result in Lecture #2.
Define the random variable
ST S
T
Z(AK) =
sup
− In v .
(3.13)
v
m
n−1
v∈AK∩S
Lecture 3 – Tuesday, October 6
5
From Lecture #2, given our stated lower bound on the projection dimension, we have Z(AK) ≤ δ/C 0 with
2
probability at least 1 − c1 e−c2 mδ . Here the constant C 0 > 0 can be made as large we please by choosing the
constant C in Lemma 3.2.2 larger.
Thus, let us suppose that the bound Z(AK) ≤ δ/C 0 holds. As long as C 0 > 1, we are guaranteed that
Z1 ≥ 1 − δ. On the other hand, showing that Z2 ≤ δ/2 requires a little more work, involving splitting into
two cases, depending on whether the inner product huLS , vi is positive or negative. We refer to the paper [8]
for details.
3.3
Some illustrative examples
Let’s consider some examples to illustrate.
Unconstrained least squares: When K = Rd , then the transformed tangent cone AK is the range space of
matrix A. Using our calculations of Gaussian widths, we see that a projection dimension m & rank(A)/δ 2 is
sufficient to give a δ-accurate cost approximation. This result (modulo some superfluous logarithmic factors)
was first shown by Sarlos [9].
Noiseless compressed sensing: Suppose that our goal is to recover a sparse vector x∗ ∈ Rd based on
observing the random projection z = Sx∗ . We can recover known bounds for exact recovery as a special
case of our general result, one with A = Id (and hence n = d).
Consider the `1 -constrained least squares problem minkxk1 ≤R 12 kx∗ − xk22 , say with the radius chosen as
R = kx∗ k1 . By construction, we have xLS = x∗ for this problem. The sketched analog is given by
min
kxk1 ≤R
1
kS(x∗ − x)k22 =
2
min
kxk1 ≤R
1
kz − Sxk22 ,
2
where z = Sx∗ is observed.
(3.14)
Suppose that kx∗ k0 = k, meaning that x∗ is a k-sparse vector. It turns out that (with high probability) the
sketched problem (3.14) has x
b = x∗ has its unique solution whenever m % k log(ed/k). This is a standard
result in compressed sensing [5, 4], and can be derived as a corollary of our general theorem.
We begin by noting that if xLS provides a δ-accurate cost approximation, then we are guaranteed that
2
kx∗ − x
bk22 ≤ 1 + δ kx∗ − xLS k22 = 0,
which implies that x
b = x∗ . So for this special noiseless problem, a cost approximation of any quality
automatically yields exact recovery.
How large a projection dimension m is needed for a cost approximation (and hence exact recovery)? We
need to bound the Gaussian width W(AK ∩ S n−1 ). The following lemma provides an outer bound on this
set. We use the notation Bp (r) = {x ∈ Rd | kxkp ≤ r} for the `p -norm ball of radius r, and clconv to denote
the closed convex hull.
Lemma 3.3.1. For any vector x∗ that is k-sparse, we have
n
o
(b)
(a)
√
K(x∗ ) ∩ S d−1 ⊆ B1 (2 k) ∩ B2 (1) ⊆ 2 clconv B0 (4k) ∩ B2 (1) .
(3.15)
6
Lecture 3 – Tuesday, October 6
Let us use this lemma to bound the Gaussian width. From inclusion (a) in equation (3.15), we have
W(K(x∗ ) ∩ S d−1 ) ≤ E
sup √ hx, gi
kxk1 ≤2 k
kxk2 =1
√
≤ 2 k Ekgk∞
(ii) p
≤ 2 2k log d
(i)
a very crude bound!
where inequality (i) follows from Holder’s inequality; and step (ii) follows from our bound on the expected
maximum of d independent Gaussian (shown in Lecture #2). This bound shows that it suffices to take
m % k log d.
Now let’s use inclusion (b) in equation (3.15), along with a more careful argument, to obtain a sharper
bound. We have
h
i
W(K(x∗ ) ∩ S d−1 ) ≤ 2E max sup hxS , gS i ≤ 2E[ max kgS k2 ]
where g ∼ N (0, Id ).
|S|=4k kxS k2 ≤1
|S|=4k
p
√
Each variable kgS k2 has mean at most |S| = 4k, and is concentrated around its mean with sub-Gaussian
tails (Lecture #2, concentration of Lipschitz functions). Consequently, we have
s h
i
p
√
d
E max sup hxS , gS i ≤ 4k + c log
= c0 k log(ed/k).
4k
|S|=4k kxS k2 ≤1
This calculation shows that it is sufficient to take m % k log(ed/k).
It remains to prove Lemma 3.3.1.
Proof. We prove the first inclusion here. For any x ∈ C, the difference vector ∆ = x − x∗ (and scaled copies
of it) belong to the tangent cone K(x∗ ). Let us write the support set of x∗ as S and zero part S c . Since
x ∈ C, we have
kxk1 = kx∗ + ∆k1 = kx∗S + ∆S k1 + k∆S c k1 ≤ kx∗S k1 .
By triangle inequality,
kx∗S k1 − k∆S k1 + k∆S c k1 ≤ kx∗S + ∆S k1 + k∆S c k1 ≤ kx∗S k1 .
Combining the two bounds yields k∆S c k1 ≤ k∆S k1 , and hence
√
k∆k1 = k∆S k1 + k∆S c k1 ≤ 2k∆S k1 ≤ 2 kk∆k2 ,
as claimed.
For the proof of inclusion (b) in equation (3.15), we refer the reader to Loh and Wainwright [7].
3.4
Other random projection methods
Up until now, we have showed that through Gaussian projection we can reduce the problem dimension to
its intrinsic dimension which characterized by a geometric object: Gaussian Width of the tangent cone at
Lecture 3 – Tuesday, October 6
7
xLS . If we only care about the data storage issue, this projection does help us. However, from a complexity
of computing point of view, the cost of doing Gaussian sketching to get Sy, SA where A ∈ Rn×d , is of order
O(mnd). The computing complexity of doing the sketching can be even larger than solving the original least
square problem directly. Then why bother using the sketching technique?
The key is that our theory holds for many other types of random sketches apart from Gaussian. In fact,
our theory gives bounds for any matrix S for which we can control the quantity
Z(AK(xLS ) ∩ S n−1 ) =
sup
v∈AK(xLS )∩S n−1
|
1
kSvk22 − 1|.
m
What type of matrices can be used to speed up the projection step?
• Fast JL transforms: These methods are based on structured forms of random matrices for which
the projection can be computed quickly [1, 2].
• Sparse JL transforms: These methods are based on constructing projection matrices S with a very
large number of zero entries, so that the matrix multiplication can be sped up by using routines for
sparse matrices [6, 3].
One class of fast JL transforms based on structured matrices take the following form:
1. Start with an orthonormal matrix H ∈ Rn×n for which it is possible to perform fast matrix multiplication. Examples include the discrete Fourier matrix (multiplies carried out by fast Fourier transform),
or the Hadamard matrix (multiplies carried out with the fast Hadamard transform).
2. Form the projection matrix S = Hm D where
• the matrix Hm ∈ Rm×n is obtained by sampling m rows from H uniformly at random (without
replacement)
• the matrix D is a diagonal matrix with i.i.d. {−1, 1} entries
Bibliography
[1] N. Ailon and B. Chazelle. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform.
In Proceedings of the 38th Annual ACM Symposium on Theory of Computing, pages 557–563. ACM,
2006.
[2] N. Ailon and E. Liberty. Fast dimension reduction using rademacher series on dual bch codes. Discrete
Comput. Geom, 42(4):615–630, 2009.
[3] J. Bourgain, S. Dirksen, and J. Nelson. Toward a unified theory of sparse dimensionality reduction in
euclidean space. Geometric and Functional Analysis, 25(4), 2015.
[4] E. J. Candès and T. Tao. Decoding by linear programming. IEEE Trans. Info Theory, 51(12):4203–4215,
December 2005.
[5] D. L. Donoho. Compressed sensing. IEEE Trans. Info. Theory, 52(4):1289–1306, April 2006.
[6] D. M. Kane and J. Nelson. Sparser Johnson-Lindenstrauss transforms. Journal of the ACM, 61(1), 2014.
[7] P. Loh and M. J. Wainwright. High-dimensional regression with noisy and missing data: Provable
guarantees with non-convexity. Annals of Statistics, 40(3):1637–1664, September 2012.
[8] M. Pilanci and M. J. Wainwright. Randomized sketches of convex programs with sharp guarantees. IEEE
Trans. Info. Theory, To appear. Presented in part at ISIT 2014.
[9] T. Sarlos. Improved approximation algorithms for large matrices via random projections. In 47th Annual
IEEE Symposium on Foundations of Computer Science, pages 143–152. IEEE, 2006.
8
Download