Square-root Lasso: Pivotal Recovery of Sparse Signals via Conic

advertisement
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Biometrika, pp. 1–18
°
C 2011 Biometrika Trust
Printed in Great Britain
Advance Access publication on ???
Square-root Lasso: Pivotal Recovery of Sparse Signals via
Conic Programming
B Y A. BELLONI
Duke University, Fuqua School of Business, Decision Science Group. 100 Fuqua Street,
Durham, North Carolina 27708
abn5@duke.edu
V. CHERNOZHUKOV
MIT, Department of Economics and Operations Research Center. 52 Memorial Drive,
Cambridge, Massachusetts 02142
vchern@mit.edu
L. WANG
MIT, Department of Mathematics. 77 Massachusetts Avenue, Cambridge, Massachusetts 02139
liewang@math.mit.edu
AND
S UMMARY
We propose a pivotal method for estimating high-dimensional sparse linear regression models, where the overall number of regressors p is large, possibly much larger than n, but only s
regressors are significant. The method is a modification of Lasso, called square-root Lasso. The
method neither relies on the knowledge of the standard deviation σ of the regression errors nor
does it need to pre-estimate σ. Despite not knowing σ, square-root
p Lasso achieves near-oracle
performance, attaining the prediction norm convergence rate σ (s/n) log p, and thus matching the performance of the Lasso that knows σ. Moreover, we show that these results are valid
for both Gaussian and non-Gaussian errors, under some mild moment restrictions, using moderate deviation theory. Finally, we formulate the square-root Lasso as a solution to a convex
conic programming problem. This formulation allows us to implement the estimator using efficient algorithmic methods, such as interior point and first order methods specialized to conic
programming problems of a very large size.
Some key words: conic programming; high-dimensional sparse model; unknown sigma.
1. I NTRODUCTION
We consider the following classical linear regression model for outcome yi given fixed pdimensional regressors xi :
yi = x0i β0 + σ²i , i = 1, ..., n,
(1.1)
with independent and identically distributed noise ²i , i = 1, ..., n, having law F0 such that
EF0 [²i ] = 0 and EF0 [²2i ] = 1.
(1.2)
2
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
A. B ELLONI , V. C HERNOZHUKOV AND L. WANG
Vector β0 ∈ Rp is the unknown true parameter value, and σ > 0 is the unknown standard deviation. The regressors xi are p-dimensional,
xi = (xij , j = 1, ..., p),
where the dimension p is possibly much larger than the sample size n. Accordingly, the true
parameter value β0 lies in a very high-dimensional space Rp . However, the key assumption that
makes the estimation possible is the sparsity of β0 , which says that only s < n regressors are
significant, namely
T = supp(β0 ) has s elements.
(1.3)
The identity T of the significant regressors is unknown. Throughout, without loss of generality,
we normalize
n
1X 2
xij = 1 for all j = 1, . . . , p.
n
(1.4)
i=1
In making asymptotic statements below we allow for s → ∞ and p → ∞ as n → ∞.
Clearly, the ordinary least squares estimator is not consistent for estimating β0 in the setting
with p > n. The Lasso estimator considered in Tibshirani (1996) can achieve consistency under
mild conditions by penalizing the sum of absolute parameter values:
b
β̄ ∈ arg minp Q(β)
+
β∈R
λ
· kβk1 ,
n
(1.5)
P
P
b
where Q(β)
:= n−1 ni=1 (yi − x0i β)2 and kβk1 = pj=1 |βj |. The Lasso estimator is computationally attractive because it minimizes a structured convex function. Moreover, when errors are
normal, F0 = N (0, 1), and suitable design conditions hold, if one uses the penalty level
√
λ = σ · c · 2 n · Φ−1 (1 − α/2p)
(1.6)
for some numeric constant c > 1, this estimator achieves the near-oracle performance, namely
r
s log(2p/α)
kβ̄ − β0 k2 . σ
,
(1.7)
n
with probability at least 1 − α. Remarkably, in (1.7) the overall number of regressors p shows up
only through
√ a logarithmic factor, so that if p is polynomial in n, the oracle rate is achieved up to
a factor of log n. Recallpthat the oracle knows the identity T of significant regressors, and so it
can achieve the rate of σ s/n. The result above was demonstrated by Bickel et al. (2009), and
closely related results were given in Meinshausen & Yu (2009), Zhang & Huang (2008), Candes
& Tao (2007), and van de Geer (2008). Bunea et al. (2007), Zhao & Yu (2006), Huang et al.
(2008), Lounici et al. (2010), Wainwright (2009), and Zhang (2009) contain other fundamental
results obtained for related problems; please see Bickel et al. (2009) for further references.
Despite these attractive features, the Lasso construction (1.5)-(1.6) relies on the knowledge
of the standard deviation of the noise σ. Estimation of σ is a non-trivial task when p is large,
particularly when p À n, and remains an outstanding practical and theoretical problem. The
estimator we propose in this paper, the square-root Lasso or Square-root Lasso, is a modification
of Lasso, which eliminates completely the need to know or pre-estimate σ. In addition, by using
the moderate deviation theory, we can dispense with the normality assumption F0 = Φ under
some conditions.
Biometrika style
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
3
The Square-root Lasso estimator of β0 is defined as the solution to the following optimization
problem:
q
λ
b
βb ∈ arg minp Q(β)
(1.8)
+ · kβk1 ,
β∈IR
n
with the penalty level
λ=c·
√
n · Φ−1 (1 − α/2p),
(1.9)
for some numeric constant c > 1. We stress that the penalty level in (1.9) is independent of σ,
in contrast to (1.6). Furthermore, under reasonable conditions, the proposed penalty level (1.9)
will also be valid asymptotically without imposing normality F0 = Φ, by virtue of moderate
deviation theory (this side result is new and is of independent interest for other problems with
`1 -penalization).
We will show that the Square-root Lasso estimator achieves the near-oracle rates of convergence under suitable design conditions and suitable conditions on F0 that extend significantly
beyond normality:
r
s log(2p/α)
kβb − βk2 . σ
,
(1.10)
n
with probability approaching 1 − α. Thus, this estimator matches the near-oracle performance of
Lasso, even though it does not know the noise level σ. This is the main result of this paper. It is
important to emphasize here that this result is not a direct consequence of the analogous result for
Lasso. Indeed, for a given value of the penalty level, the statistical structure of Square-root Lasso
is different from that of Lasso, and so our proofs are also different.
Importantly, despite taking the square root of the least squares criterion function, the problem (1.8) retains global convexity, making the estimator computationally attractive. In fact, the
second main result of this paper is to formulate Square-root Lasso as a solution of a conic programming problem. Conic programming problems can be seen as linear programming problems
with conic constraints, so it generalizes the canonical linear programming with non-negative
orthant constraints. Conic programming inherits a rich set of theoretical properties and algorithmic methods from linear programming. In our case, the constraints take the form of a second
order cone, leading to a particular, highly tractable form of conic programming. In turn, this
formulation allows us to implement the estimator using efficient algorithmic methods, such as
interior point methods, which provide polynomial-time bounds on computational time (Nesterov
& Nemirovskii, 1993; Renegar, 2001), and modern first order methods that have been recently
extended to handle conic programming problems of a very large size (Nesterov, 2005, 2007; Lan
et al., 2011; Beck & Teboulle, 2009; Becker et al., 2010a).
N OTATION . In what follows, all true parameter values, such as β0 , σ, F0 , are implicitly indexed by the sample size n, but we omit the index in our notations whenever this does not
cause confusion. The regressors xi , i = 1, ..., n, are taken to be fixed throughout. This includes
random design as a special case, where we condition on the realized values of the regressors.
In making asymptotic statements, we assume that n → ∞ and p = pn → ∞, and we also allow for s = sn → ∞. The notation o(·) is defined with respect to n → ∞. We use the notation (a)+ = max{a, 0}, a ∨ b = max{a, b} and a ∧ b = min{a, b}. The `2 -norm is denoted by
k · k2 , and `∞ norm by k · k∞ . Given a vector δ ∈ IRp and a set of indices T ⊂ {1, . . . , p},
we denote by δT the P
vector in which δT j = δj if j ∈ T , δT j = 0 if j ∈
/ T . We also use
n
En [f ] := En [f (z)] := i=1 f (zi )/n. We use a . b to denote a 6 cb for some constant c > 0
that does not depend on n.
4
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
A. B ELLONI , V. C HERNOZHUKOV AND L. WANG
2. T HE C HOICE OF P ENALTY L EVEL
2·1. The General Principle and Heuristics
The key quantity determining the choice of the penalty level for Square-root Lasso is the score
√b
– the gradient of Q
evaluated at the true parameter value β = β0 :
q
b )
Q(β
E [xσ²]
En [x²]
b 0) = ∇
q 0 =pn
S̃ := ∇ Q(β
=p
.
En [σ 2 ²2 ]
En [²2 ]
b 0)
2 Q(β
The score S̃ does not depend on the unknown standard deviation σ or the unknown true parameter
value β0 , and therefore, the score is pivotal with respect to these parameters. Under the classical
normality assumption, namely F0 = Φ, the score is in fact completely pivotal, conditional on
X. This means that in principle we know the distribution of S̃ in this case, or at least we can
compute it by simulation.
The score S̃ summarizes the estimation noise in our problem, and we may set the penalty level
λ/n to dominate the noise. For efficiency reasons, we set λ/n at a smallest level that dominates
the estimation noise, namely we choose the smallest λ such that
λ > cΛ, for Λ := nkS̃k∞ ,
(2.11)
with a high probability, say 1 − α, where Λ is the maximal score scaled by n, and c > 1 is a
theoretical constant of Bickel et al. (2009) to be stated later. We note that the principle of setting
λ to dominate the score of the criterion function is motivated by Bickel et al. (2009)’s choice
of penalty level for Lasso. In fact, this is a general principle that carries over to other convex
problems, including ours, and that leads to the optimal – near-oracle – performance of other
`1 -penalized estimators.
In the case of Square-root Lasso the maximal score is pivotal, so the penalty level in (2.11)
must be pivotal too. In fact, we used the square-root transformation in the Square-root Lasso
formulation (1.8) precisely to guarantee this pivotality. In contrast, for Lasso, the score S =
b 0 ) = 2σ · En [x²] is obviously non-pivotal, since it depends on σ. Thus, the penalty level
∇Q(β
for Lasso must be non-pivotal. These theoretical differences translate into obvious practical differences: In Lasso, we need to guess conservative upper bounds σ̄ on σ; or we need to use
preliminary estimation of σ using a pilot Lasso, which uses a conservative upper bound σ̄ on
σ. In Square-root Lasso, none of the above is needed. Finally, we note that the use of pivotality
principle for constructing the penalty level is also fruitful in other problems with pivotal scores,
for example, median regression (Belloni & Chernozhukov, 2010).
The rule (2.11) is not practical yet, since we do not observe Λ directly. However, we can
proceed as follows.
1. When we know the distribution of errors exactly, e.g. F0 = Φ, we propose to set λ as c times
the (1 − α)-quantile of Λ given X and F0 . This choice of the penalty level precisely implements (2.11), and we can easily compute it by simulation.
2. When we do not know F0 exactly, but instead know that F0 is an element of some family F,
we can rely on either finite-sample or asymptotic upper bounds on quantiles of Λ given X.
For
√ example, as mentioned in the introduction, under some mild conditions on F, we can use
c nΦ−1 (1 − α/2p) as a valid asymptotic choice.
What follows below elaborates these approaches more formally.
Finally, before proceeding to formal constructions, it is useful to mention some simple heuristics for the principle (2.11). These heuristics arise from considering the simplest case, where
Biometrika style
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
5
none of the regressors are significant so that β0 = 0. We want our estimator to perform at a nearoracle level in all cases, including this case, but here the oracle estimator β̃ knows that none of
regressors are significant, and so it sets β̃ = β0 = 0. We also want βb = β0 = 0 in this case, at
least with a high probability 1 − α. From the subgradient optimality conditions of (1.8), in order
for this to be true we must have
−S̃j + λ/n > 0 and S̃j + λ/n > 0, for all 1 6 j 6 p.
We can only guarantee this by setting the penalty level λ/n such that λ > n max16j6p |S̃j | =
nkS̃k∞ with probability at least 1 − α. This is precisely the rule (2.11) appearing above, and, as
it turns out, this rule delivers near-oracle performance more generally, when β0 6= 0.
2·2. The Formal Choice of Penalty Level and Its Properties
In order to describe our choice of λ formally, define for 0 < α < 1
Λ(1 − α|X, F ) := (1 − α) − quantile of Λ|X, F0 = F
p
√
Λ(1 − α) := nΦ−1 (1 − α/2p) 6 2n log(2p/α).
(2.12)
(2.13)
We can compute the first quantity by simulation.
When F0 = Φ, we can set λ according to either of the following options:
exact
λ = cΛ(1 − α|X, Φ)√
asymptotic λ = cΛ(1 − α) := c nΦ−1 (1 − α/2p)
(2.14)
The parameter 1 − α is a confidence level which guarantees near-oracle performance with probability 1 − α; we recommend 1 − α = 95%. The constant c > 1 is a theoretical constant of Bickel
et al. (2009), which is needed to guarantee a regularization event introduced in the next section;
we recommend c = 1.1. The options in (2.14) are valid either in finite or large samples under
the conditions stated below. They are also supported by the finite-sample experiments reported
in Section 5. Moreover, we recommend using the exact option over the asymptotic option, because by construction the former is better tailored to the given sample size n and design matrix
X. Nonetheless, the asymptotic option is easier to compute. In the next section, our theoretical
results shows that the options in (2.14) lead to near-oracle rates of convergence.
For the asymptotic results, we shall impose the following growth condition:
C ONDITION G. We have that log2 (p/α) log(1/α) = o(n) and p/α → ∞ as n → ∞.
The following lemma shows that the exact and asymptotic options in (2.14) implement the
regularization event λ > cΛ in the Gaussian case with the exact or asymptotic probability 1 − α
respectively. The lemma also bounds the magnitude of the penalty level for the exact option,
which will be useful for stating bounds on the estimation error later. We assume throughout the
paper that 0 < α < 1 is bounded away from 1, but we allow for α to approach 0 as n grows.
L EMMA 1 (P ROPERTIES OF THE P ENALTY L EVEL IN THE G AUSSIAN C ASE ). Suppose
that F0 = Φ. (i) The exact option in (2.14) implements
λ > cΛ with probability at least 1 − α.
p
(ii) Assume that p/α > 8. For any 1 < ` < n/ log(1/α), the asymptotic option in (2.14)
implements λ > cΛ with probability at least
p
¶
µ
exp(2 log(2p/α)` log(1/α)/n)
1
2
p
− α` /4−1 ,
1 − α · τ, with τ = 1 +
log(p/α)
1 − ` log(1/α)/n
where under condition G we have τ = 1 + o(1).
6
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
A. B ELLONI , V. C HERNOZHUKOV AND L. WANG
(iii) Assume that p/α > 8 and n > 4 log(2/α). Then we have
p
Λ(1 − α|X, Φ) 6 ν · Λ(1 − α) 6 ν · 2n log(2p/α), with ν =
p
1 + 2/ log(2p/α)
p
,
1 − 2 log(2/α)/n
where under condition G we have ν = 1 + o(1).
In the non-normal case, we can set λ according to one of the following options:
exact
λ = cΛ(1 − α|X, F )
semi-exact λ = c maxF ∈F Λ(1 −
√ α|X, F )
asymptotic λ = cΛ(1 − α) := c nΦ−1 (1 − α/2p)
(2.15)
We set the confidence level 1 − α and the constant c > 1 as before. The exact option is applicable when we know that F0 = F , as for example in the previous normal case. The semi-exact
option is applicable when we know that F0 is a member of some family F, or that the family F
will give a more conservative penalty level. We also assume that F in (2.15) is either finite or,
more generally, that the maximum in (2.15) is well defined. For example, in applications, where
the regression errors ²0i s are thought of having a potentially wide range of tail behavior, it is
useful to set F = {t(4), t(8), t(∞)} where t(k) denotes the Student distribution with k degrees
of freedom. As stated previously, we can compute the quantiles Λ(1 − α|X, F ) by simulation.
Therefore, we can implement the exact option easily, and if F is not too large, we can also implement the semi-exact option easily. Finally, the asymptotic option is applicable when F0 and
design X satisfy some moment conditions stated below. We recommend using either the exact
or semi-exact options, when applicable, since they are better tailored to the given design X and
sample size n. However, the asymptotic option also has attractive features, namely it is trivial
to compute, and it often provides a very similar, albeit slightly more conservative, penalty level
as the exact or semi-exact options do; indeed, Fig. 1 provides a numerical illustration of this
phenomenon.
Fig. 1. The asymptotic bound Λ(0.95) and realized values of Λ(0.95|X, F ) sorted in increasing order, shown for
100 realizations of X = [x1 , ..., xn ]0 , and for three cases of error distribution, F = t(4), t(8), and t(∞). For each
realization, we have generated X = [x1 , ..., xn ]0 as independent and identically distributed draws of x ∼ N (0, Σ)
with Σjk = (1/2)|j−k| .
Biometrika style
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
7
For the asymptotic results in the non-normal case, we shall impose the following moment
conditions.
C ONDITION M. There exist finite constants q > 2 and n0 > 0 such that the following holds:
The law F0 is an element of the family F such that supn>n0 ,F ∈F EF [|²|q ] < ∞; the design X
obeys supn>n0 ,16j6p En [|xj |q ] < ∞.
We also have to restrict the growth of p relative to n, and we also assume that α is either
bounded away from zero or approaches zero not too rapidly.
C ONDITION R. We have that p 6 αnη(q−2)/2 /2 for some constant 0 < η < 1, and α−1 =
o(n[(q/2−1)∧(q/4)]∨(q/2−2) /(log n)q/2 ) as n → ∞ where q > 2 is given in Condition M.
The following lemma shows that the options (2.15) implement the regularization event λ > cΛ
in the non-Gaussian case with exact or asymptotic probability 1 − α. In particular, conditions R
and M, through relations (A12) and (A14), imply that for any fixed c > 0,
pr(|En [²2 ] − 1| > c|F0 = F ) = o(α).
(2.16)
The lemma also bounds the magnitude of the penalty level λ for the exact and semi-exact options,
which is useful for stating bounds on the estimation error later.
L EMMA 2 (P ROPERTIES OF THE P ENALTY L EVEL IN THE N ON -G AUSSIAN C ASE ). (i)
The exact option in (2.15) implements λ > cΛ with probability at least 1 − α, if F0 = F . (ii)
The semi-exact option in (2.15) implements λ > cΛ with probability at least 1 − α, if either
F0 ∈ F or Λ(1 − α|X, F ) > Λ(1 − α|X, F0 ) for some F ∈ F.
Suppose further that conditions M and R hold. Then, (iii) the asymptotic option in (2.15)
implements λ > cΛ with probability at least 1 − α + o(α), and (iv) the magnitude of the penalty
level of the exact and semi-exact options in (2.15) satisfies the inequality
p
max Λ(1 − α|X, F ) 6 Λ(1 − α)(1 + o(1)) 6 2n log(2p/α)(1 + o(1)).
F ∈F
The lemma says that all of the asymptotic conclusions reached in Lemma 1 about the penalty
level in the Gaussian case continue to hold in the non-Gaussian case, albeit under more restricted
conditions on the growth of p relative to n. The growth condition depends on the number of
bounded moments q of regressors and the error terms – the higher q is, the more rapidly p can
grow with n. We emphasize that the condition given above is only one possible set of sufficient
conditions that guarantees the Gaussian-like conclusions of Lemma 2; in a companion work, we
are providing a more detailed, more exhaustive set of sufficient conditions.
3.
F INITE -S AMPLE AND A SYMPTOTIC B OUNDS ON THE E STIMATION E RROR OF
S QUARE - ROOT L ASSO
3·1. Conditions on the Gram Matrix
√
We shall state convergence rates for δb := βb − β0 in the Euclidean norm kδk2 = δ 0 δ and also
in the prediction norm
p
p
kδk2,n := En [(x0 δ)2 ] = δ 0 En [xx0 ]δ.
The latter norm directly depends on the Gram matrix En [xx0 ]. The choice of penalty level described in Section 2 ensures the regularization event λ > cΛ, with probability 1 − α or with
8
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
A. B ELLONI , V. C HERNOZHUKOV AND L. WANG
probability approaching 1 − α. This event will in turn imply another regularization event, namely
that δb belongs to the restricted set ∆c̄ , where
∆c̄ = {δ ∈ Rp : kδT c k1 6 c̄kδT k1 , δ 6= 0}, for c̄ :=
c+1
.
c−1
b 2,n and kδk
b 2 in terms of the
Accordingly, we will state the bounds on estimation errors kδk
0
following restricted eigenvalues of the Gram matrix En [xx ]:
√
kδk2,n
skδk2,n
and κ
ec̄ := min
.
κc̄ := min
(3.17)
δ∈∆c̄ kδk2
δ∈∆c̄ kδT k1
These restricted eigenvalues can depend on n and T , but we suppress the dependence in our
notations.
In making simplified asymptotic statements, such as the ones appearing in the introduction,
we will often invoke the following condition on the restricted eigenvalues:
C ONDITION RE. There exist finite constants n0 > 0 and κ > 0, such that the restricted eigenvalues obey κc̄ > κ and κ̃c̄ > κ for all n > n0 .
The restricted eigenvalues (3.17) are simply variants of the restricted eigenvalues introduced
in Bickel et al. (2009). Even though the minimal eigenvalue of the Gram matrix En [xx0 ] is zero
whenever p > n, Bickel et al. (2009) show that its restricted eigenvalues can in fact be bounded
away from zero. Bickel et al. (2009) and others provide a detailed set of sufficient primitive
conditions for this, which are sufficiently general to cover many fixed and random designs of
interest, and which allow for reasonably general (though not arbitrary) forms of correlation between regressors. This makes conditions on restricted eigenvalues useful for many applications.
Consequently, we take the restricted eigenvalues as primitive quantities and Condition RE as
a primitive condition. Note also that the restricted eigenvalues are tightly tailored to the `1 penalized estimation problem. Indeed, κc̄ is the modulus of continuity between the estimation
norm and the penalty-related term computed over the restricted set, containing the deviation of
the estimator from the true value; and κ̃c̄ is the modulus of continuity between the estimation
norm and the Euclidean norm over this set.
It is useful to recall at least one simple sufficient condition for bounded restricted eigenvalues.
If for m = s log n and the m-sparse eigenvalues of the Gram matrix En [xx0 ] are bounded away
from zero and from above for all n > n0 :
0<k6
min
kδT c k0 6m,δ6=0
kδk22,n
kδk22
6
max
kδT c k0 6m,δ6=0
kδk22,n
kδk22
6 k 0 < ∞,
(3.18)
for some positive finite constants k, k 0 , and n0 , then Condition RE holds once n is sufficiently
large. In words, (3.18) only requires the eigenvalues of certain “small” m × m submatrices of
the large p × p Gram matrix to be bounded from above and below. The sufficiency of (3.18) for
Condition RE follows from Bickel et al. (2009), and many sufficient conditions for (3.18) are
provided by Bickel et al. (2009), Zhang & Huang (2008), and Meinshausen & Yu (2009).
3·2. Finite-Sample and Asymptotic Bounds on Estimation Error
We now present the main result of this paper. Recall that we are not assuming that the noise is
(sub) Gaussian, nor that σ is known.
T HEOREM 1 (F INITE S AMPLE B OUNDS ON E STIMATION E RROR ). Consider the model described in (1.1)-(1.4). Let c > 1, c̄ = (c + 1)/(c − 1), and suppose that λ obeys the growth re-
Biometrika style
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
9
√
striction λ s 6 nκc̄ ρ, for some ρ < 1. If λ > cΛ, then δb = βb − β0 obeys
√
p
2(1
+
1/c)
λ
s
2
b 2,n 6
kδk
σ En [² ]
.
2
1−ρ
nκc̄
In particular, if λ > cΛ with probability at least 1 − α, and En [²2 ] 6 ω 2 with probability at least
1 − γ, then with probability at least 1 − α − γ, δb = βb − β0 obeys
√
√
2(1
+
1/c)
λ
s
2(1
+
1/c)
λ
s
b 2,n 6
b 26
kδk
σω
and kδk
σω
.
2
2
1−ρ
nκc̄
1−ρ
nκc̄ κ̃c̄
This result provides a finite-sample bound for δb that is similar to that for the Lasso estimator
that knows σ, and this result leads to the same rates of convergence as in the case of Lasso. It
is important to note some differences: First, for a given value of the penalty level, the statistical
structure of Square-root Lasso is different from that of Lasso, and so our proof of Theorem
1√
is also different. Second, in the proof we have to invoke the additional growth restriction,
λ s < nκc̄ , which is not present in the Lasso analysis that treats σ as known. We may think of
this restriction as the price of not knowing σ in our framework. However, this additional condition
is very mild and holds under typical conditions if (s/n) log(p/α) → 0, as the corollaries below
indicate, and this condition is independent of the noise level σ. In comparison, for the Lasso
estimator, if we treat σ as unknown and attempt to estimate σ consistently using a pilot Lasso,
which uses an upper bound σ̄ > σ instead of σ, a similar growth condition σ̄(s/n) log(p/α) → 0
would have to be imposed, but this condition depends on the noise level σ via the inequality
σ̄ > σ. It is also clear that this growth condition is more restrictive than our growth condition
(s/n) log(p/α) → 0 when σ̄ is large.
Theorem 1 immediately implies various finite-sample bounds by combining the finite-sample
bounds on λ specified in the previous section with the finite-sample bounds on quantiles of the
sample average En [²2 ]. This corollary also implies various asymptotic bounds when combined
with Lemma 1, Lemma 2, and the application a law of large numbers for arrays.
We proceed to note the following immediate corollary of the result.
C OROLLARY 1 (F INITE S AMPLE B OUNDS IN THE G AUSSIAN C ASE ). Consider the model
described in (1.1)-(1.4). Suppose further that F0 = Φ, λ is chosen according to the exact
> 8 and n > 4 log(2/α). Let c > 1, c̄ = (c +
p option in (2.14), p/αp
p1)/(c − 1), ν =
1 + 2/ log(2p/α)/(1 − 2 log(2/α)/n), and for any ` such that 1 < ` < n/ log(1/α), set
p
2
ω 2 = 1 + `(log(1/α)/n)1/2 + `2 log(1/α)/(2n) and γ = α` /4 . Then, if cν 2s log(2p/α) 6
√
nκc̄ ρ for some ρ < 1, with probability at least 1 − α − γ we have that δb = βb − β0 obeys
p
p
νω σ 2s log(2p/α)
νω σ 2s log(2p/α)
b
b
√
√
kδk2,n 6 2(1 + c)
and kδk2 6 2(1 + c)
.
1 − ρ2
1 − ρ2
nκc̄
nκc̄ κc̄
The corollary above derive finite sample bounds to the classical Gaussian case. It is instructive
to state an asymptotic version of the corollary above.
C OROLLARY 2 (A SYMPTOTIC B OUNDS IN THE G AUSSIAN CASE ). Consider the model described in (1.1)-(1.4) and suppose that F0 = Φ, Conditions RE and G hold, and
(s/n) log(p/α) → 0. Let λ be specified according to either the exact or asymptotic option in
(2.14). Fix any µ > 1. Then, for n large enough, δb = βb − β0 obeys
r
r
σ 2s log(2p/α)
σ 2s log(2p/α)
b
b
and kδk2 6 2(1 + c)µ 2
,
kδk2,n 6 2(1 + c)µ
κ
n
κ
n
10
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
A. B ELLONI , V. C HERNOZHUKOV AND L. WANG
with probability 1 − α(1 + o(1)).
A similar corollary applies in the non-Gaussian case.
C OROLLARY 3 (A SYMPTOTIC B OUNDS IN THE NON -G AUSSIAN CASE ). Consider
the model described in (1.1)-(1.4). Suppose that Conditions RE, M, and R hold, and
(s/n) log(p/α) → 0. Let λ be specified according to the asymptotic, exact, or semi-exact option
in (2.15). Fix any µ > 1. Then, for n large enough, δb = βb − β0 obeys
r
r
σ 2s log(2p/α)
σ 2s log(2p/α)
b
b
kδk2,n 6 2(1 + c)µ
and kδk2 6 2(1 + c)µ 2
,
κ
n
κ
n
with probability at least 1 − α(1 + o(1)).
As in Lemma 2, in order to achieve Gaussian-like asymptotic conclusions in the non-Gaussian
case, we impose stronger conditions on the growth of p relative to n. The result exploits that
these conditions imply that the average square of the noise concentrates around its expectation
as stated in (2.16).
4. C OMPUTATIONAL P ROPERTIES OF S QUARE - ROOT L ASSO
The second main result of this paper is to formulate the Square-root Lasso as a conic programming problem, with constraints given by a second order cone (also informally known as
the ice-cream cone). This formulation allows us to implement the estimator using efficient algorithmic methods, such as interior point methods, which provide polynomial-time bounds on
computational time (Nesterov & Nemirovskii, 1993; Renegar, 2001), and modern first order
methods that have been recently extended to handle conic programming problems of a very large
size (Nesterov, 2005, 2007; Lan et al., 2011; Beck & Teboulle, 2009; Becker et al., 2010a). Before describing the details, it is useful to recall that a conic programming problem takes the form
min c0 u subject to Au = b and u ∈ C, where C is a cone. This formulation naturally nests classical LP as the case in which the cone is simply the non-negative orthant. Conic programming
has a tractable dual form max b0 w subject to w0 A + s = c and s ∈ C ∗ , where C ∗ := {s : s0 u >
0, for all u ∈ C} is the dual cone of C. A particularly important, highly tractable class of problems arises when C is the ice-cream cone, C = Qn+1 = {(v, t) ∈ Rn × R : t > kvk}, which is
self-dual, C = C ∗ .
Square-root Lasso is precisely a conic programming problem with second-order conic constraints. Indeed, we can reformulate (1.8) as follows:
¯
p
´ ¯
t
λ X³ +
¯ v = yi − x0i β + + x0i β − , i = 1, . . . , n,
min √ +
βj + βj− ¯ i
(4.19)
p
p
¯ (v, t) ∈ Qn+1 , β + ∈ R+ , β − ∈ R+ .
n n
t,v,β + ,β −
i=1
Furthermore, we can show that this problem admits the following dual problem:
n
1X
maxn
yi ai
a∈R
n
i=1
n
¯ X
¯ |
xij ai /n| 6 λ/n, j = 1, . . . , p,
¯
¯ i=1
¯
√
kai k 6 n.
(4.20)
From a statistical perspective, this dual problem maximizes the sample correlation of the score
variable ai with the outcome variables yi subject to the constraint that the score ai is approximately uncorrelated with the covariates xij . The optimal scores b
ai are in fact equal to the resid-
Biometrika style
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
11
b for all i = 1, ..., n, up to a renormalization factor. The optimal score plays a key
uals yi − x0i β,
b
role in deriving sparsity bounds on β.
We say that strong duality holds between a primal and its dual problem if their optimal values
are the same (i.e. no duality gap). This is typically an assumption for interior point methods and
first order methods to work. We formalize the preceding discussion in the following theorem.
T HEOREM 2. The Square-root Lasso problem (1.8) is equivalent to conic programming problem (4.19), which admits the dual problem (4.20) for which strong duality holds. The solution βb
to the problem (1.8), the solution βb+ , βb− , vb = (b
v1 , ..., vbn ) to (4.19), and the solution b
a to (4.20)
√
b i = 1, ..., n, and b
a = nb
v /kb
v k.
are related via βb = βb+ − βb− , vbi = yi − x0i β,
Recall that strong duality holds between a primal and its dual problem if their optimal values
are the same, that is, there is no duality gap. The strong duality demonstrated in the theorem is in
fact required for the use of both the interior point and first order methods adapted to conic programs. We have implemented these methods, as well as a coodinatewise method, for Square-root
LASSO and made the code available through the author’s webpages. Interestingly, our implementations of Square-root LASSO run at least as fast as the corresponding implementations of
these methods for Lasso, for instance, the Sdpt3 implementation of interior point method for
LASSO (Toh et al., 1999, 2010), and the Tfocs implementation of first order methods (Becker
et al., 2010b). The running times are recorded in an extended working version of this paper.
5. E MPIRICAL P ERFORMANCE OF S QUARE - ROOT L ASSO R ELATIVE TO L ASSO
In this section we use Monte-Carlo experiments to assess the finite sample performance
of the following estimators: (a) the Infeasible Lasso, which knows σ which is unknown
outside the experiments, (b) post Infeasible Lasso, which applies ordinary least squares to
the model selected by Infeasible Lasso, (c) Square-root Lasso, which does not know σ, and
(d) post-Square-root Lasso, which applies ordinary least squares to the model selected by
Square-root Lasso.
We set the penalty level for Infeasible Lasso and Square-root Lasso according to the asymptotic options (1.6) and (1.9) respectively, with 1 − α = 0.95 and c = 1.1. We have also performed experiments where we set the penalty levels according to the exact option. The results
are similar to the case with the penalty level set according to the asymptotic option, so we only
report the results for the latter.
We use the linear regression model stated in the introduction as a data-generating process,
with either standard normal or t(4) errors:
√
(a) ²i ∼ N (0, 1) or (b) ²i ∼ t(4)/ 2,
so that E[²2i ] = 1 in either case. We set the true parameter value as
β0 = [1, 1, 1, 1, 1, 0, ..., 0]0 ,
(5.21)
and the standard deviation as σ. We vary the parameter σ between 0.25 and 3. The number of
regressors is p = 500, the sample size is n = 100, and we used 1000 simulations for each design.
We generate regressors as xi ∼ N (0, Σ) with the Toeplitz correlation matrix Σjk = (1/2)|j−k| .
We use as benchmark the performance of the oracle estimator which knows the which knows the
true support of β0 which is unknown outside the experiment.
A. B ELLONI , V. C HERNOZHUKOV AND L. WANG
²i ∼ N(0,1)
5
Relative Empirical Risk
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
Relative Empirical Risk
12
4
3
2
1
0
5
4
3
2
1
0
0
0.5
1
Infeasible Lasso
1.5
σ
2
2.5
Square-root Lasso
3
√
²i ∼ t(4)/ 2
0
0.5
post Infeasible Lasso
1
1.5
σ
2
2.5
3
post Square-root Lasso
Fig. 2. The average relative empirical risk of the estimators with respect to the oracle estimator, that knows the true
support, as a function of the standard deviation of the noise σ.
b denote the support selected by the
Fig. 3. Let T = supp(β0 ) denote the true unknown support and Tb = supp(β)
square-root Lasso estimator or Infeasible Lasso. As a function of the noise level σ, we display the average number
of regressors missed from the true support E|T \ Tb| and the average number of regressors selected outside the true
support E|Tb \ T |.
We present the results of computational experiments for designs a) and b) in Figs. 2 and 3. For
each model, the figures show the following quantities as a function of the standard deviation of
e
the noise σ for each estimator β:
Biometrika style
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
13
r Figure 2: the relative average empirical risk with respect to the oracle estimator β ∗ ,
·q
E
¸
·q
¸
En [(x0i (βe − β0 ))2 ] /E
En [(x0i (β ∗ − β0 ))2 ] ,
r Figure 3: the average number of regressors missed from the true model and the average number of regressors selected outside the true model, respectively
e and E[|supp(β)
e \ supp(β0 )|].
E[|supp(β0 ) \ supp(β)|]
Figure 2, left panel, shows the empirical risk for the normal case. We see that, for a wide
range of the standard deviation of the noise σ, Infeasible Lasso and Square-root Lasso perform similarly in terms of empirical risk, although Infeasible Lasso outperforms somewhat
Square-root Lasso. At the same time, post-Square-root Lasso outperforms somewhat post Infeasible Lasso, and in fact outperforms all of the estimators considered. Overall, we are pleased with
the performance of Square-root Lasso and post-Square-root Lasso relatively to Lasso and postLasso. Despite not knowing σ, Square-root Lasso performs comparably to the standard Lasso
that knows σ. These results are in close agreement with our theoretical results, which state that
the upper bounds on empirical risk for Square-root Lasso asymptotically approach the analogous
bounds for Infeasible Lasso.
Figure 3 provide additional insight into the performance of the estimators. Figure 3 shows that
such heavier penalty translates into Square-root Lasso achieving better sparsity than Infeasible
Lasso. This in turn translates into post-Square-root Lasso outperforming post Infeasible Lasso
in terms of empirical risk, as we saw in Fig. 2. We note that the finite-sample differences in empirical risk for Infeasible Lasso and Square-root Lasso arise primarily due to Square-root Lasso
having a larger bias than Infeasible Lasso. This bias arises because Square-root Lasso uses an
b in place of σ 2 . In these experiments, as we changed
b β)
effectively heavier penalty induced by Q(
√b b
the standard deviation of the noise, the average values of Q(
β)/σ were between 1.18 and 1.22.
Finally, Figure 2, right panel, shows the empirical risk for the t(4) case. We see that the results
for the Gaussian case carry over to the t(4) case, with nearly undetectable changes. In fact, the
performance of Infeasible Lasso and Square-root Lasso under t(4) errors nearly coincides with
their performance under Gaussian errors. This is exactly what is predicted by our theoretical
results, using moderate deviation theory.
ACKNOWLEDGEMENT
We would like to thank Isaiah Andrews, Denis Chetverikov, JB Doyle, Ilze Kalnina, and James
MacKinnon for very helpful comments.
A PPENDIX 1
Proofs of Theorems 1 and 2
Proof of Theorem 1. Step 1. In this step we show that δb = βb − β0 ∈ ∆c̄ under the prescribed penalty
level. By definition of βb
q
q
b − Q(β
b 1 6 λ (kδbT k1 − kδbT c k1 ),
b β)
b 0 ) 6 λ kβ0 k1 − λ kβk
Q(
(A1)
n
n
n
where the last inequality holds because
b 1 = kβ0T k1 − kβbT k1 − kβbT c k1 6 kδbT k1 − kδbT c k1 .
kβ0 k1 − kβk
(A2)
A. B ELLONI , V. C HERNOZHUKOV AND L. WANG
14
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
Also, if λ > cnkS̃k∞ then
q
q
b − Q(β
b 1 > − λ (kδbT k1 + kδbT c k1 ),
b β)
b 0 ) > S̃ 0 δb > −kS̃k∞ kδk
Q(
cn
q
b Combining (A1) with (A3) we obtain
where the first inequality hold by convexity of Q.
−
λ
λ b
(kδT k1 + kδbT c k1 ) 6 (kδbT k1 − kδbT c k1 ),
cn
n
(A3)
(A4)
that is
c+1 b
· kδT k1 = c̄kδbT k1 , or δb ∈ ∆c̄ .
(A5)
c−1
Step 2. In this step we derive bounds on the estimation error. We shall use the following relations:
kδbT c k1 6
b − Q(β
b 2 − 2En σ²x0 δ,
b
b β)
b 0 ) = kδk
Q(
2,n
µq
¶ µq
¶
q
q
b − Q(β
b + Q(β
b − Q(β
b β)
b 0) =
b β)
b 0)
b β)
b 0) ,
Q(
Q(
Q(
q
b 6 2 Q(β
b 1,
b 0 )kS̃k∞ kδk
2|En [σ²x0 δ]|
√ b
skδk2,n
kδbT k1 6
for δb ∈ ∆c̄ ,
κc̄
where (A8) holds by Holder inequality and the last inequality holds by the definition of κc̄ .
Using (A1) and (A6)-(A9) we obtain
!
µq
¶ Ã√ b
q
q
λ
sk
δk
2,n
b 2 6 2 Q(β
b 1+
b + Q(β
b 0 )kS̃k∞ kδk
b β)
b 0)
kδk
Q(
− kδbT c k1 .
2,n
n
κc̄
Also using (A1) and (A9) we obtain
Ã√
!
q
q
b 2,n
λ
skδk
b
b
b
b
Q(β) 6 Q(β0 ) +
− kδT c k1 .
n
κc̄
(A6)
(A7)
(A8)
(A9)
(A10)
(A11)
Combining inequalities (A11) and (A10), we obtain
µ √
¶2
√
q
q
q
λ s b
λ s b
2
b
b
b
b
b 0 ) λ kδbT c k1 .
kδk2,n 6 2 Q(β0 )kS̃k∞ kδk1 + 2 Q(β0 )
kδk2,n +
kδk2,n − 2 Q(β
nκc̄
nκc̄
n
Since λ > cnkS̃k∞ , we obtain
µ √
¶2
√
q
q
λ s b
λ s b
2
b
b
b
b
kδk2,n +
kδk2,n ,
kδk2,n 6 2 Q(β0 )kS̃k∞ kδT k1 + 2 Q(β0 )
nκc̄
nκc̄
and then using (A9) we obtain
"
µ √ ¶2 #
µ
¶q
√
λ s
b 2 6 2 1 +1
b 2,n .
b 0 ) λ s kδk
1−
kδk
Q(β
2,n
nκc̄
c
nκc̄
Provided that
√
λ s
nκc̄
6 ρ < 1 and solving the inequality above we obtain the bound stated in the theorem.¤
Proof of Theorem 2. The equivalence of Square-root Lasso problem (1.8) and the conic programming
problem (4.19) follows immediately from the definitions. To establish the duality, for e = (1, . . . , 1)0 , we
can write (4.19) in the matrix form as
¯
t
λ 0 + λ 0 − ¯¯ v + Xβ + − Xβ − = Y
min √ + e β + e β ¯
p
p
¯ (v, t) ∈ Qn+1 , β + ∈ R+ , β − ∈ R+
n
n n
t,v,β + ,β −
Biometrika style
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
15
By the conic duality theorem, this has the dual
√
¯ t
¯ s = 1/ n, a + sv = 0
¯
max
Y 0 a ¯ X 0 a + s+ = λe/n, −X 0 a + s− = λe/n
¯
a,st ,sv ,s+ ,s−
(sv , st ) ∈ Qn+1 , s+ ∈ Rp+ , s− ∈ Rp+
The constraints X 0 a + s+ =√λ/n and −X 0 a + s− = λ/n leads to kX 0 ak∞ 6 λ/n. The conic constraint
(sv , st ) ∈ Qn+1 leads to 1/ n = st > ksv k = kak. By scaling the variable a by n we obtain the stated
dual problem.
Since the primal problem is strongly feasible, strong duality holds by Theorem 3.2.6 of Renegar (2001).
Pn
Pp
Pn
b
βk
√
Thus, by strong duality, we have n1 i=1 yi b
ai = kY −X
+ nλ j=1 |βbj |. Since n1 i=1 xij b
ai βbj =
n
λ|βbj |/n for every j = 1, . . . , p, we have
p
p
n
n
n
X
b
b
kY − X βk
kY − X βk
1X X
1X
1X
b
√
√
yi b
ai =
+
xij b
ai βj =
+
b
ai
xij βbj .
n i=1
n
n
n
n
j=1
i=1
i=1
j=1
i
Pn h
√
b ai = kY − X βk/
b √n. Since kb
Rearranging the terms we have n1 i=1 (yi − x0i β)b
ak 6 n the equality
q
√
b
b = (Y − X β)/
b
b
b β).
can only hold for b
a = n(Y − X β)/kY
− X βk
Q(
¤
A PPENDIX 2
Proofs of Lemmas 1 and 2
Proof of Lemma 1. p
Statement (i) holds by definition. To show statement (ii), we define tn = Φ−1 (1 −
α/2p) and 0 < rn = ` log(1/α)/n < 1. It is known that log(p/α) < t2n < 2 log(2p/α) when p/α > 8.
Then
√
pr(cΛ > c ntn |X, F0 = Φ)
p
6 p max pr(|n1/2 En [xj ²]| > tn (1 − rn )|F0 = Φ) + pr( En [²2 ] < 1 − rn |F0 = Φ)
16j6p
p
6 2p Φ̄(tn (1 − rn )) + pr( En [²2 ] < 1 − rn |F0 = Φ).
By Chernoff tail bound for χ2 (n), Lemma 1 in Laurent & Massart (2000), we know that
p
pr( En [²2 ] < 1 − rn |F0 = Φ) 6 exp(−nrn2 /4),
and
φ(tn (1 − rn ))
φ(tn ) exp(t2n rn − 12 t2n rn2 )
= 2p
·
tn (1 − rn )
tn
1 − rn
µ
¶
1 2 2
2
2
1 + tn exp(tn rn − 2 tn rn )
1 exp(t2n rn )
6 2pΦ̄(tn ) 2 ·
6α 1+ 2
tn
1 − rn
tn
1 − rn
µ
¶
1
exp(2 log(2p/α)rn )
6 α 1+
.
log(p/α)
1 − rn
√
For statement (iii), it is sufficient
to show that pr(cΛ > cv ntn |X,
p
pF0 = Φ) 6 α. It can be seen that
there exists a v 0 such that v 0 > 1 + 2/ log(2p/α) and 1 − v 0 /v > 2 log(2/α)/n so that
p
√
pr(cΛ > cv ntn |X, F0 = Φ) 6 ppr(|n1/2 En [xj ²]| > v 0 tn |F0 = Φ) + pr( En [²2 ] < v 0 /v|F0 = Φ)
p
= 2pΦ̄(v 0 tn ) + pr( En [²2 ] < v 0 /v|F0 = Φ).
2p Φ̄(tn (1 − rn )) 6 2p
Similarly, by Chernoff tail bound for χ2 (n), Lemma 1 in Laurent & Massart (2000),
p
pr( En [²2 ] < v 0 /v|F0 = Φ) 6 exp(−n(1 − v 0 /v)2 /4) 6 α/2.
A. B ELLONI , V. C HERNOZHUKOV AND L. WANG
16
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
And
φ(v 0 tn )
φ(tn ) exp(− 12 t2n (v 02 − 1))
=
2p
·
v 0 tn
tn
v0
µ
¶
µ
¶
1 2
exp(− 2 tn (v 02 − 1))
1
1 exp(− 21 t2n (v 02 − 1))
6 2pΦ̄(tn ) 1 + 2 ·
=α 1+ 2
tn
v0
tn
v0
µ
¶
02
1
exp(− log(2p/α)(v − 1))
6 α 1+
6 2α exp(− log(2p/α)(v 02 − 1)) < α/2.
log(p/α)
v0
√
Putting them together, we know that pr(cΛ > cv ntn |X, F0 = Φ) 6 α.
Finally, the asymptotic result follows directly from the finite sample boundspand noting that p/α
√ →∞
and that under the growth condition we can choose ` → ∞ so that ` log(p/α) log(1/α) = o( n). ¤
2pΦ̄(v 0 tn ) 6 2p
Proof of Lemma 2. Statements (i) and (ii) hold by definition. To show statement (iii), consider first the
2
case of 2 < q 6 8, we define tn = Φ−1 (1 − α/2p) and rn = α− q n−[(1−2/q)∧1/2] `n , for some `n which
grows to infinity but so slowly that the condition stated below is satisfied. Then for any F = Fn and
X = Xn that obey Condition M:
√
pr(cΛ > c ntn |X, F0 = F )
p
6(1) p max pr(|n1/2 En [xj ²]| > tn (1 − rn )|F0 = F ) + pr( En [²2 ] < 1 − rn |F0 = F )
16j6p
6(2) p max pr(|n1/2 En [xj ²]| > tn (1 − rn )|F0 = F ) + o(α)
16j6p
=(3) 2p Φ̄(tn (1 − rn ))(1 + o(1)) + o(α) =(4) 2p
= 2p
=(5)
φ(tn (1 − rn ))
(1 + o(1)) + o(α)
tn (1 − rn )
φ(tn ) exp(t2n rn − t2n rn2 /2)
(1 + o(1)) + o(α)
tn
1 − rn
φ(tn )
2p
(1 + o(1)) + o(α) =(6) 2p Φ̄(tn )(1 + o(1)) + o(α) = α(1 + o(1)),
tn
where (1) holds by the union bound; (2) holds by the application of either Rosenthal’s inequality Rosenthal
(1970) for the case of q > 4 and Vonbahr-Esseen’s inequalities von Bahr & Esseen (1965) for the case of
2 < q 6 4,
p
pr( En [²2 ] < 1 − rn |F0 = F ) 6 pr(|En [²2 ] − 1| > rn |F0 = F ) . α`−q/2
= o(α),
(A12)
n
(4) and (6) by φ(t)/t ∼ Φ̄(t) as t → ∞; (5) by t2n rn = o(1), which holds if
2
log(p/α)α− q n−[(1−2/q)∧1/2] `n = o(1). Under our condition log(p/α) = O(log n), this condition
is satisfied for some slowly growing `n , if
α−1 = o(n(q/2−1)∧q/4 /(log n)q/2 ).
(A13)
To verify relation (3), note that by Condition M and Slastnikov’s theorem on moderate deviations,
Slast√
nikov (1982) and Rubin & Sethuraman (1965), we have that uniformly in 0 6 |t| 6 k log n for some
k 2 < q − 2, uniformly in 1 6 j 6 p and for any F = Fn ∈ F,
pr(n1/2 |En xj ²| > t|F0 = F )
→ 1,
2Φ̄(t)
p
p
so the relation (3) holds for t = tn (1 − rn ) 6 P2 log(2p/α) 6 η(q − 2) log n for η < 1 by assumpn
tion. We apply Slastnikov’s theorem to n−1/2 | i=1 zi,n | for zi,n = xij ²i , where we allow the design
X, the law F , and index j to be (implicitly) indexed by n. Slastnikov’s theorem then applies provided supn,j6p En EFn |zn |q = supn,j6p En |xj |q EFn |²|q < ∞, which is implied by our Condition M,
and where we used the condition that the design is fixed, so that ²i are independent of xij . Thus, we ob-
Biometrika style
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
17
tained the moderate deviation result uniformly in 1 6 j 6 p and for any sequence of distributions F = Fn
and designs X = Xn that obey our Condition M.
Next suppose that q > 8. Then the same argument applies, except that now relation (2) could
p also be
log n/n;
established by using Slastnikov’s
theorem
on
moderate
deviations.
In
this
case
redefine
r
=
k
n
p
then, for some constant k 2 < (q/2) − 2 we have
p
2
pr( En [²2 ] < 1 − rn |F0 = F ) 6 pr(|En [²2 ] − 1| > rn |F0 = F ) . n−k ,
(A14)
so the relation (2) holds if
2
1/α = o(nk ).
(A15)
This applies whenever q > 4, and this results in weaker requirements on α if q > 8. The relation (5) then
follows if t2n rn = o(1), which is easily satisfied for the new rn , and the result follows.
Combining conditions in (A13) and (A15) to give the weakest restrictions on the growth of α−1 , we
obtain the growth conditions stated in the lemma.
0
To
√ show statement (iv) of the lemma, we note that it suffices to show that for any ν > 1, pr(cΛ >
0
cν ntn |X, F0 = F ) = o(α), which follows analogously to the proof of statement (iii); we omit the
details for brevity.
¤
R EFERENCES
B ECK , A. & T EBOULLE , M. (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM Journal on Imaging Sciences 2, 183–202.
B ECKER , S., C AND ÈS , E. & G RANT, M. (2010a). Templates for convex cone problems with applications to sparse
signal recovery. ArXiv .
B ECKER , S., C AND ÈS , E. & G RANT, M. (2010b). Tfocs v1.0 user guide (beta) .
B ELLONI , A. & C HERNOZHUKOV, V. (2010). `1 -penalized quantile regression for high dimensional sparse models.
accepted at the Annals of Statistics .
B ICKEL , P. J., R ITOV, Y. & T SYBAKOV, A. B. (2009). Simultaneous analysis of Lasso and Dantzig selector. Annals
of Statistics 37, 1705–1732.
B UNEA , F., T SYBAKOV, A. B. & W EGKAMP, M. H. (2007). Aggregation for Gaussian regression. The Annals of
Statistics 35, 1674–1697.
C ANDES , E. & TAO , T. (2007). The Dantzig selector: statistical estimation when p is much larger than n. Ann.
Statist. 35, 2313–2351.
H UANG , J., H OROWITZ , J. L. & M A , S. (2008). Asymptotic properties of bridge estimators in sparse highdimensional regression models. The Annals of Statistics 36, 587613.
L AN , G., L U , Z. & M ONTEIRO , R. D. C. (2011). Primal-dual first-order methods with o(1/²) interation-complexity
for cone programming. Mathematical Programming , 1–29.
L AURENT, B. & M ASSART, P. (2000). Adaptive estimation of a quadratic functional by model selection. Ann. Statist.
28, 1302–1338.
L OUNICI , K., P ONTIL , M., T SYBAKOV, A. B. & VAN DE G EER , S. (2010). Taking advantage of sparsity in multitask learning. arXiv:0903.1468v1 [stat.ML] .
M EINSHAUSEN , N. & Y U , B. (2009). Lasso-type recovery of sparse representations for high-dimensional data.
Annals of Statistics 37, 2246–2270.
N ESTEROV, Y. (2005). Smooth minimization of non-smooth functions, mathematical programming. Mathematical
Programming 103, 127–152.
N ESTEROV, Y. (2007). Dual extrapolation and its applications to solving variational inequalities and related problems.
Mathematical Programming 109, 319–344.
N ESTEROV, Y. & N EMIROVSKII , A. (1993). Interior-Point Polynomial Algorithms in Convex Programming.
Philadelphia: Society for Industrial and Applied Mathematics (SIAM).
R ENEGAR , J. (2001). A Mathematical View of Interior-Point Methods in Convex Optimization. Philadelphia: Society
for Industrial and Applied Mathematics (SIAM).
ROSENTHAL , H. P. (1970). On the subspaces of Lp (p > 2) spanned by sequences of independent random variables.
Israel J. Math. 8, 273–303.
RUBIN , H. & S ETHURAMAN , J. (1965). Probabilities of moderatie deviations. Sankhyā Ser. A 27, 325–346.
S LASTNIKOV, A. D. (1982). Large deviations for sums of nonidentically distributed random variables. Teor. Veroyatnost. i Primenen. 27, 36–46.
T IBSHIRANI , R. (1996). Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B 58, 267–288.
18
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
A. B ELLONI , V. C HERNOZHUKOV AND L. WANG
T OH , K. C., T ODD , M. J. & T UTUNCU , R. H. (1999). Sdpt3 — a matlab software package for semidefinite programming. Optimization Methods and Software 11, 545–581.
T OH , K. C., T ODD , M. J. & T UTUNCU , R. H. (2010). On the implementation and usage of sdpt3 – a matlab
software package for semidefinite-quadratic-linear programming, version 4.0. Handbook of semidefinite, cone and
polynomial optimization: theory, algorithms, software and applications Edited by M. Anjos and J.B. Lasserre.
VAN DE G EER , S. A. (2008). High-dimensional generalized linear models and the lasso. Annals of Statistics 36,
614–645.
VON BAHR , B. & E SSEEN , C.-G. (1965). Inequalities for the rth absolute moment of a sum of random variables,
1 6 r 6 2. Ann. Math. Statist 36, 299–303.
WAINWRIGHT, M. (2009). Sharp thresholds for noisy and high-dimensional recovery of sparsity using `1 -constrained
quadratic programming (lasso). IEEE Transactions on Information Theory 55, 2183–2202.
Z HANG , C.-H. & H UANG , J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann. Statist. 36, 1567–1594.
Z HANG , T. (2009). On the consistency of feature selection using greedy least squares regression. Journal of Machine
Learning Research 10, 555–568.
Z HAO , P. & Y U , B. (2006). On model selection consistency of lasso. J. Machine Learning Research 7, 2541–2567.
[Received September 2010. Revised January 2011]
Download