A tail inequality for quadratic forms of subgaussian random vectors Daniel Hsu

advertisement
Electron. Commun. Probab. 17 (2012), no. 52, 1–6.
DOI: 10.1214/ECP.v17-2079
ISSN: 1083-589X
ELECTRONIC
COMMUNICATIONS
in PROBABILITY
A tail inequality for quadratic forms
of subgaussian random vectors
Daniel Hsu∗
Sham M. Kakade†
Tong Zhang‡
Abstract
This article proves an exponential probability tail inequality for positive semidefinite
quadratic forms in a subgaussian random vector. The bound is analogous to one that
holds when the vector has independent Gaussian entries.
Keywords: Tail inequality; quadratic form; subgaussian random vectors; subgaussian chaos.
AMS MSC 2010: 60F10.
Submitted to ECP on June 11, 2012, final version accepted on October 29, 2012.
Supersedes arXiv:1110.2842.
1
Introduction
Suppose that x = (x1 , . . . , xn ) is a random vector. Let A ∈ Rn×n be a fixed matrix. A
natural quantity that arises in many settings is the quadratic form kAxk2 = x> (A> A)x.
Throughout kvk denotes the Euclidean norm of a vector v , and kM k denotes the spectral (operator) norm of a matrix M . We are interested in how close kAxk2 is to its
expectation.
Consider the special case where x1 , . . . , xn are independent standard Gaussian random variables. The following proposition provides an (upper) tail bound for kAxk2 .
Proposition 1.1. Let A ∈ Rn×n be a matrix, and let Σ := A> A. Let x = (x1 , . . . , xn ) be
an isotropic multivariate Gaussian random vector with mean zero. For all t > 0,
i
h
p
Pr kAxk2 > tr(Σ) + 2 tr(Σ 2 )t + 2kΣkt ≤ e−t .
The proof, given in Appendix A.2, is straightforward given the rotational invariance
of the multivariate Gaussian distribution, together with a tail bound for linear combinations of χ2 random variables from [2]. We note that a slightly weaker form of
Proposition 1.1 can be proved directly using Gaussian concentration [3].
In this note, we consider the case where x = (x1 , . . . , xn ) is a subgaussian random
vector. By this, we mean that there exists a σ ≥ 0, such that for all α ∈ Rn ,
E [exp (α> x)] ≤ exp kαk2 σ 2 /2 .
We provide a sharp upper tail bound for this case analogous to one that holds in the
Gaussian case (indeed, the same as Proposition 1.1 when σ = 1).
∗ Microsoft Research New England, USA. E-mail: dahsu@microsoft.com
† Microsoft Research New England, USA. E-mail: skakade@microsoft.com
‡ Department of Statistics, Rutgers University, USA. E-mail: tzhang@stat.rutgers.edu
A tail inequality for quadratic forms of subgaussian random vectors
Tail inequalities for sums of random vectors
One motivation for our main result comes from the following observations about
sums of random vectors. Let a1 , . . . , an be vectors in a Euclidean space, and let A =
[a1 | · · · |an ] be the matrix with ai as its ith column. Consider the squared norm of the
random sum
2
n
X
ai xi kAxk = 2
(1.1)
i=1
where x := (x1 , . . . , xn ) is a martingale difference sequence with E[xi | x1 , . . . , xi−1 ] = 0
and E[x2i | x1 , . . . , xi−1 ] = σ 2 . Under mild boundedness assumptions on the xi , the
probability that the squared norm in (1.1) is much larger than its expectation
E[kAxk2 ] = σ 2
n
X
kai k2 = σ 2 tr(A> A)
i=1
falls off exponentially fast. This can be shown, for instance, using the following lemma
by taking ui = ai xi (see Appendix A.1).
Proposition 1.2. Let u1 , . . . , un be a martingale difference vector sequence, i.e.,
E[ui |u1 , . . . , ui−1 ] = 0,
such that
for all i = 1, . . . , n,
n
X
E kui k2 | u1 , . . . , ui−1 ≤ v and kui k ≤ b
i=1
for all i = 1, . . . , n, almost surely. For all t > 0,
" n
#
√
X √
ui > v + 8vt + (4/3)bt ≤ e−t .
Pr i=1
After squaring the quantities in the stated probabilistic event, Proposition 1.2 gives
the bound
√
kAxk2 ≤ σ 2 · tr(A> A) + σ 2 · O tr(A> A)( t + t)
+
p
tr(A> A) max kai k(t + t3/2 ) + max kai k2 t2
i
i
with probability at least 1 − e−t when the xi are almost surely bounded by 1 (or any
constant).
Unfortunately, this bound obtained from Proposition 1.2 can be suboptimal when
the xi are subgaussian. For instance, if the xi are Rademacher random variables, so
Pr[xi = +1] = Pr[xi = −1] = 1/2, then it is known that
kAxk2 ≤ tr(A> A) + O
p
tr((A> A)2 )t + kAk2 t
(1.2)
with probability at least 1 − e−t . A similar result holds for any subgaussian distribution
on the xi [1]. This is an improvement over the previous bound because the deviation
terms (i.e., those involving t) can be significantly smaller, especially for large t.
In this work, we give a simple proof of (1.2) with explicit constants that match the
analogous bound when the xi are independent standard Gaussian random variables.
ECP 17 (2012), paper 52.
ecp.ejpecp.org
Page 2/6
A tail inequality for quadratic forms of subgaussian random vectors
2
Positive semidefinite quadratic forms
Our main theorem, given below, is a generalization of (1.2).
Theorem 2.1. Let A ∈ Rn×n be a matrix, and let Σ := A> A. Suppose that x =
(x1 , . . . , xn ) is a random vector such that, for some µ ∈ Rn and σ ≥ 0,
E [exp (α> (x − µ))] ≤ exp kαk2 σ 2 /2
(2.1)
for all α ∈ Rn . For all t > 0,
"
#
kΣk2 1/2 p
>
Pr kAxk > σ · tr(Σ)+2 tr(Σ 2 )t+2kΣkt +tr(Σµµ )· 1+2
t
≤ e−t .
tr(Σ 2 )
2
2
Remark 2.2. If µ = 0, then the assumption (2.1) implies E[x] = 0 and cov(x) σ 2 I . In
this case,
E[kAxk2 ] = tr(Σ cov(x)) ≤ σ 2 tr(Σ),
var(kAxk2 ) = O(σ 4 tr(Σ 2 )),
so probability inequality may be interpreted as a Bernstein inequality. If µ = 0 and
σ = 1, then the probability inequality reads
i
h
p
Pr kAxk2 > tr(Σ) + 2 tr(Σ 2 )t + 2kΣkt ≤ e−t ,
which is the same as Proposition 1.1.
Remark 2.3. Our proof (via (2.2), (2.4), and (2.5)) actually establishes the following
upper bounds on the moment generating function of kAxk2 for 0 ≤ η < 1/(2σ 2 kΣk):
h
p i
E exp ηkAxk2 ≤ E exp σ 2 kA> zk2 η + µ> A> z 2η
σ 4 tr(Σ 2 )η 2 + kAµk2 η
≤ exp σ 2 tr(Σ)η +
1 − 2σ 2 kΣkη
where z is a vector of n independent standard Gaussian random variables.
Proof of Theorem 2.1. Let z be a vector of n independent standard Gaussian random
variables (sampled independently of x). For any α ∈ Rn ,
E [exp (z > α)] = exp kαk2 /2 .
(2.2)
Thus, for any λ ∈ R and ε ≥ 0, we have the following decoupling (which holds, in fact,
for any random vector x):
>
2
E [exp (λz Ax)] ≥ E exp (λz Ax) kAxk > ε · Pr kAxk2 > ε
2 λ ε
≥ exp
· Pr kAxk2 > ε .
2
>
(2.3)
Moreover, using (2.1),
>
>
E [exp (λz Ax)] = E E exp (λz A(x − µ)) z exp (λz Aµ)
2 2
λ σ
>
2
>
>
≤ E exp
kA zk + λµ A z .
2
>
ECP 17 (2012), paper 52.
(2.4)
ecp.ejpecp.org
Page 3/6
A tail inequality for quadratic forms of subgaussian random vectors
Let U SV > be a singular value decomposition of A; where U and V are, respectively,
√
√
matrices of orthonormal left and right singular vectors; and S = diag( ρ1 , . . . , ρm ) is
the diagonal matrix of corresponding singular values. Note that
kρk1 =
n
X
kρk22 =
ρi = tr(Σ),
i=1
n
X
ρ2i = tr(Σ 2 ),
kρk∞ = max ρi = kΣk.
and
i
i=1
By rotational invariance, y := U > z is an isotropic multivariate Gaussian random vector
2
with mean zero. Therefore kA> zk2 = z > U S 2 U > z = ρ1 y12 + · · · + ρn yn
and µ> A> z =
>
>
2
>
2
ν y = ν1 y1 + · · · + νn yn , where ν := SV µ (note that kνk = kSV µk = kAµk2 ). Let
γ := λ2 σ 2 /2. By Lemma 2.4,
"
E exp γ
n
X
√
ρi yi2
i=1
n
2γ X
νi yi
+
σ i=1
!#
kρk22 γ 2 + kνk2 γ/σ 2
≤ exp kρk1 γ +
1 − 2kρk∞ γ
(2.5)
for 0 ≤ γ < 1/(2kρk∞ ). Combining (2.3), (2.4), and (2.5) gives
kρk22 γ 2 + kνk2 γ/σ 2
2
2
Pr kAxk > ε ≤ exp −εγ/σ + kρk1 γ +
1 − 2kρk∞ γ
for 0 ≤ γ < 1/(2kρk∞ ) and ε ≥ 0. Choosing
s
2
2kρk∞ τ
1+
kρk22
2
ε := σ (kρk1 + τ ) + kνk
1
and γ :=
2kρk∞
s
1−
kρk22
2
kρk2 + 2kρk∞ τ
!
,
we have
s
"
#
2kρk∞ τ
Pr kAxk > σ (kρk1 + τ ) + kνk 1 +
kρk22
s
!!
!!
kρk∞ τ
2kρk∞ τ
kρk22
kρk∞ τ
kρk22
1+
− 1+
= exp −
h1
≤ exp −
2kρk2∞
kρk22
kρk22
2kρk2∞
kρk22
2
2
2
√
−1
where h1 (a) := 1 + a − 1 + 2a
, which has the inverse
p
p function h1 (b) =
2
2
result follows by setting τ := 2 kρk2 t + 2kρk∞ t = 2 tr(Σ )t + 2kΣkt.
√
2b + b. The
The following lemma is a standard estimate of the logarithmic moment generating
function of a quadratic form in standard Gaussian random variables, proved much along
the lines of the estimate from [2].
Lemma 2.4. Let z be a vector of n independent standard Gaussian random variables.
n
Fix any non-negative vector α ∈ Rn
+ and any vector β ∈ R . If 0 ≤ λ < 1/(2kαk∞ ), then
"
log E exp λ
n
X
αi zi2
i=1
+
n
X
!#
βi zi
≤ kαk1 λ +
i=1
kαk22 λ2 + kβk22 /2
.
1 − 2kαk∞ λ
√
Proof. Fix λ ∈ R such that 0 ≤ λ < 1/(2kαk∞ ), and let ηi := 1/ 1 − 2αi λ > 0 for
i = 1, . . . , n. We have
Z
∞
1
√ exp −zi2 /2 exp λαi zi2 + βi zi dzi
2π
−∞
2 2Z ∞
βi η i
1
1
2 2
p
= ηi exp
exp
−
z
−
β
η
dzi
i
i i
2
2ηi2
2πηi2
−∞
E exp λαi zi2 + βi zi =
ECP 17 (2012), paper 52.
ecp.ejpecp.org
Page 4/6
A tail inequality for quadratic forms of subgaussian random vectors
so
"
log E exp λ
n
X
αi zi2 +
i=1
n
X
!#
βi zi
n
=
i=1
n
1X 2 2 1X
β η +
log ηi2 .
2 i=1 i i
2 i=1
The right-hand side can be bounded using the inequalities
n
n
∞
n
1X
1 X X (2αi λ)j
kαk22 λ2
1X
log ηi2 = −
log(1 − 2αi λ) =
≤ kαk1 λ +
2 i=1
2 i=1
2 i=1 j=1
j
1 − 2kαk∞ λ
and
n
1X 2 2
kβk22 /2
βi η i ≤
.
2 i=1
1 − 2kαk∞ λ
Example: fixed-design regression with subgaussian noise
We give a simple application of Theorem 2.1 to fixed-design linear regression with the
ordinary least squares estimator.
Let x1 , . . . , xn be fixed design vectors in Rd . Let the responses y1 , . . . , yn be random
variables for which there exists σ > 0 such that
"
n
X
E exp
!#
αi (yi − E[yi ])
≤ exp σ
i=1
2
n
X
!
αi2
i=1
for any α1 , . . . , αn ∈ R. This condition is satisfied, for instance, if
yi = E[yi ] + εi
for independent subgaussian zero-mean noise variables ε1 , . . . , εn . Let Σ :=
which we assume is invertible without loss of generality. Let
β := Σ
−1
n
β̂ := Σ
i=1
xi x>
i /n,
!
n
1X
xi E[yi ]
n i=1
be the coefficient vector of minimum expected squared error (i.e., E[n−1
yi )2 ] = min!). The ordinary least squares estimator is given by
−1
Pn
1X
xi yi
n i=1
Pn
>
i=1 (xi
β−
!
.
The excess loss R(β̂) of β̂ is the difference between the expected squared error of β̂ and
that of β :
"
#
"
#
n
R(β̂) := E
n
1X >
1X >
(xi β̂ − yi )2 − E
(x β − yi )2 .
n i=1
n i=1 i
It is easy to see that
n
2
2 X −1/2 R(β̂) = Σ 1/2 (β̂ − β) = Σ
xi (yi − E[yi ]) .
i=1
By Theorem 2.1,
√
#
σ 2 d + 2 dt + 2t
≤ e−t .
Pr R(β̂) >
n
"
Note that in the case that E[(yi − E[yi ])2 ] = σ 2 for each i, then
E[R(β̂)] =
σ2 d
;
n
so the tail inequality above is essentially tight when the yi are independent Gaussian
random variables.
ECP 17 (2012), paper 52.
ecp.ejpecp.org
Page 5/6
A tail inequality for quadratic forms of subgaussian random vectors
A
A.1
Standard tail inequalities
Martingale tail inequalities
The following is a standard form of Bernstein’s inequality stated for martingale difference sequences.
Lemma A.1 (Bernstein’s inequality for martingales). Let d1 , . . . , dn be a martingale difference sequence with respect to random variables x1 , . . . , xn ( i.e., E[di |x1 , . . . , xi−1 ] = 0
Pn
for all i = 1, . . . , n) such that |di | ≤ b and i=1 E[d2i |x1 , . . . , xi−1 ] ≤ v . For all t > 0,
"
Pr
n
X
di >
√
#
2vt + (2/3)bt ≤ e−t .
i=1
Proposition 1.2 is an immediate consequence of the following folklore results, together with Jensen’s inequality. Lemma A.2 is a straightforward application of Bernstein’s inequality to a Doob martingale, and Lemma A.3 is proved by a simple induction
argument.
Pn
Lemma A.2. Let u1 , . . . , un be random vectors such that
and kui k ≤ b for all i = 1, . . . , n, almost surely. For all t > 0,
i=1
E[kui k2 |u1 , . . . , ui−1 ] ≤ v
#
" n
X
√
X n
−t
Pr ui ui −E > 8vt + (4/3)bt ≤ e .
i=1
i=1
Lemma A.3. If u1 , . . . , un is a martingale difference vector sequence ( c.f. ProposiPn
Pn
tion 1.2), then E[k i=1 ui k2 ] = i=1 E[kui k2 ].
A.2
Gaussian quadratic forms and χ2 tail inequalities
It is well-known that if z ∼ N (0, 1) is a standard Gaussian random variable, then z 2
follows a χ2 distribution with one degree of freedom. The following inequality from [2]
gives a bound on linear combinations of χ2 random variables.
Lemma A.4 (χ2 tail inequality; [2]). Let q1 , . . . , qn be independent χ2 random variables,
each with one degree of freedom. For any vector γ = (γ1 , . . . , γn ) ∈ Rn
+ with nonnegative entries, and any t > 0,
"
Pr
n
X
#
q
γi qi > kγk1 + 2 kγk22 t + 2kγk∞ t ≤ e−t .
i=1
Proof of Proposition 1.1. Let V ΛV > be an eigen-decomposition of A> A, where V is a
matrix of orthonormal eigenvectors, and Λ := diag(ρ1 , . . . , ρn ) is the diagonal matrix of
corresponding eigenvalues ρ1 , . . . , ρn . By the rotational invariance of the distribution,
z := V > x is an isotropic multivariate Gaussian random vector with mean zero. Thus,
kAxk2 = z > Λz = ρ1 z12 +· · ·+ρn zn2 , and the zi2 are independent χ2 random variables, each
with one degree of freedom. The claim now follows from a tail bound for χ2 random
variables (Lemma A.4).
References
[1] D. L. Hanson and F. T. Wright, A bound on tail probabilities for quadratic forms in independent
random variables, The Annals of Math. Stat. 42 (1971), no. 3, 1079–1083. MR-0279864
[2] B. Laurent and P. Massart, Adaptive estimation of a quadratic functional by model selection,
The Annals of Statistics 28 (2000), no. 5, 1302–1338. MR-1805785
[3] G. Pisier, The volume of convex bodies and banach space geometry, Cambridge University
Press, 1989. MR-1036275
Acknowledgments. We thank the anonymous reviewers for their helpful comments.
ECP 17 (2012), paper 52.
ecp.ejpecp.org
Page 6/6
Download