The Risk of Machine Learning Alberto Abadie Maximilian Kasy Harvard University

advertisement
The Risk of Machine Learning
(preliminary and incomplete)
Alberto Abadie
Maximilian Kasy
Harvard University
Harvard University
April 26, 2016
Abstract
Applied economists often simultaneously estimate a large number of parameters
of interest, for instance treatment effects for many treatment values (teachers,
locations...), treatment effects for many treated groups, and prediction models
with many regressors. To avoid overfitting in such settings, machine learning estimators (such as Ridge, Lasso, Pretesting) combine (i) regularized estimation,
and (ii) data-driven choice of regularization parameters.
We aim to provide guidance for applied researchers to choose between such estimators, assuming they are interested in precise estimates of many parameters.
We first characterize the risk (mean squared error) of regularized estimators
and analytically compare their performance. We then show that data-driven
choices of regularization parameters (using Stein’s unbiased risk estimate or
cross-validation) yield estimators with risk uniformly close to the optimal choice.
We apply our results using data from the literature, on the causal effect of locations on intergenerational mobility, on illegal trading by arms companies with
conflict countries under an embargo, and on series regression estimates of the
effect of education on earnings. In these applications the relative performance
of alternative estimators corresponds to that suggested by our results.
Alberto Abadie, John F. Kennedy School of Government, Harvard University, alberto abadie@harvard.edu.
Maximilian Kasy, Department of Economics, Harvard University, maximiliankasy@fas.harvard.edu. We
thank Gary Chamberlain and seminar participants at Harvard for helpful comments and discussions.
1
Introduction
Applied economists often face the problem of estimating high-dimensional parameters. Examples include (i) the estimation of causal (or predictive effects) for a large number of units
such as neighborhoods or cities, teachers, workers and firms, or judges, (ii) estimation of
the causal effect of a given treatment for a large number of subgroups defined by observable covariates, and (iii) prediction problems with a large number of predictive covariates
or transformations of covariates. The machine learning literature provides a host of estimation methods for such high-dimensional problems, including Ridge, Lasso, Pretest (or
post-Lasso), and many others. The applied researcher faces the question of which of these
procedures to pick in a given situation. We aim to provide some guidance, based on a
characterization and comparison of the risk-properties (in particular, mean squared error)
of such procedures.
A general concern that motivates machine learning procedures is to avoid “generalization error,” or over-fitting. The choice of estimation procedures involves a variance-bias
trade-off, which depends on the data generating process. More flexible procedures tend to
have a lower bias, but at the price of a larger variance. In order to address this concern,
most machine learning procedures for “supervised learning” (known to economists as regression analysis) involve two key features, (i) regularized estimation, and (ii) data-driven
choice of regularization parameters. These features are also central to more well-established
methods of non-parametric estimation.
In this paper, we consider the canonical problem of estimating many means, which,
after some transformations, covers all the applications mentioned above. We consider
componentwise estimators of the form µ
bi = m(Xi , λ), where λ is a regularization parameter
and Xi is observed. Our article is structured according to the mentioned two features of
machine learning procedures. We first focus on feature (i) and study the risk properties
of regularized estimators with fixed or oracle-optimal regularization parameters. We show
that there is an (infeasible) optimal regularized estimator, for any given data generating
process. This estimator has the form of a posterior mean, using a prior distribution for
1
the parameters of interest which is equal to the (unknown) empirical distribution across
components. The optimal regularized estimator is useful to characterize the risk properties
of machine learning estimators. It turns out that, in our setting, the risk function of
any regularized estimator can be expressed as a function of the distance between that
regularized estimator and the optimal one.
We then turn to a family of parametric models for the distribution of parameters of
interest. These models assume a certain fraction of true zeros, corresponding to the notion
of sparsity, while the remaining parameters are normally distributed around some grand
mean. For these parametric models we provide analytic risk functions. These analytic
risk functions allow for an intuitive discussion of the relative performance of alternative
estimators: (1) When there is no point-mass of true zeros, or only a small fraction of zeros,
Ridge tends to perform better than either Lasso or the Pretest estimator. (2) When there
is a sizable share of true zeros, the ranking of estimators depends on the distribution of
the remaining parameters. (a) If the non-zero parameters are smoothly distributed in a
vicinity of zero, Ridge still performs best. (b) If the non-zero parameters are very dispersed,
Lasso tends to do comparatively well. (c) If the non-zero parameters are well-separated
from zero, Pretest estimation tends to do well. We see these conclusions confirmed in the
empirical applications discussed below.
The next part of the article turns to feature (ii) of machine learning estimators and
studies the data-driven choice of regularization parameters. Such estimators choose regularization parameters by minimizing a criterion function which estimates risk. Ideally, such
estimators would have a risk function that is uniformly close to the risk function of the
infeasible estimator using an oracle-optimal regularization parameter. We show this can be
achieved under fairly mild conditions whenever the dimension of the problem under consideration is large. The key condition is that the criterion function is uniformly consistent
for the true risk function. We further provide fairly weak conditions for Stein’s unbiased
risk estimate (SURE) and for cross-validation (CV) to satisfy this requirement.
Our results are then illustrated in the context of three applications taken from the
2
empirical economics literature. The first application uses data from Chetty and Hendren
(2015) to study the effects of locations on intergenerational earnings mobility of children.
The second application uses data from the event-study analysis in DellaVigna and La Ferrara (2010) who study if the stock prices of weapon-producing companies react to changes
in the intensity of conflicts in countries under an arms trade embargo. The third application
considers series regressions of earnings on education, using CPS data.
The presence of many neighborhoods in the first application, many weapon producing
companies in the second one, and many series regression terms in the third one makes
these estimation problems high-dimensional. As suggested by intuition, we find that Ridge
performs very well for the location effects application, reflecting the smooth and approximately normal distribution of true location effects. Pretest estimation performs well for
the arms trade application, reflecting the fact that a significant fraction of arms firms do
not violate arms embargoes.
The rest of this article is structured as follows. Section 2 introduces our setup, the
canonical problem of estimating a vector of means under quadratic loss. Section 2.1 discusses a series of examples from empirical economics which are covered by our setup.
Having introduced the setup, Section 2.2 discusses it in the context of the machine learning
literature, and of the older literature on estimation of normal means. Section 3 provides
characterizations of the risk function of regularized estimators in our setting. We derive a
general characterization in Section 3.1 and 3.2, and analytic formulas in a spike-and-normal
model in Section 3.3. These characterizations allow for a comparison of the mean squared
error of alternative procedures, and yield recommendations for the choice of an estimator.
Section 4 turns to data-driven choices of regularization parameters. We show uniform risk
consistency results for Stein’s unbiased risk estimate and for cross-validation. Section 5
discusses Monte Carlo simulations and several empirical applications. Section 6 concludes.
The appendix contains all proofs.
3
2
Setup
Throughout this paper, we consider the following setting. We observe a realization of an
n-vector of random variables, X = (X1 , . . . , Xn )0 , where the components of X are mutually
independent with finite mean µi and finite variance σi2 , for i = 1, . . . , n. Our goal is to
estimate µ1 , . . . , µn .
In many applications, the Xi arise as preliminary least squares estimates of the coefficients of interest µi . Consider for instance a randomized control trial where randomization
of treatment assignment is carried out separately for n non-overlapping subgroups. Within
each subgroup, the difference in the sample averages between treated and control units,
Xi , has mean equal to the average treatment effect for that group in the population, µi .
Further examples are discussed in Section 2.1 below.
We restrict our attention to component-wise estimators of µi ,
µ
bi = m(Xi , λ),
(1)
where m : R × [0, ∞] 7→ R defines an estimator of µi as a function of Xi and a non-negative
regularization parameter, λ. The parameter λ is common across the components i but
b in Section 4 below, focusing
might depend on the vector X; we study data-driven choices λ
in particular on Stein’s unbiased risk estimate (SURE) and cross-validation (CV).
Popular estimators of this component-wise form are Ridge, Lasso, and Pretest estimators. These are defined as follows:
mR (x, λ) = argmin (x − m)2 + λm2
(Ridge)
m∈R
=
1
x,
1+λ
mL (x, λ) = argmin (x − m)2 + 2λ|m|
(Lasso)
m∈R
= 1(x < −λ)(x + λ) + 1(x > λ)(x − λ),
mP T (x, λ) = argmin (x − m)2 + λ2 1(m 6= 0)
m∈R
= 1(|x| > λ)x,
4
(Pretest)
where 1(A) denotes the indicator function which equals 1 if A holds, and equals 0 else.
Figure 1 plots mR (x, λ), mL (x, λ), and mP T (x, λ) as functions of x. As we discuss in detail
below, depending on the data generating process (distribution of the means µi ), one or
another of these estimating functions m might be able to better approximate the optimal
estimating function m∗P .
b = (b
Let µ = (µ1 , . . . , µn )0 and µ
µ1 , . . . , µ
bn )0 , where for simplicity we leave the dependence of µ
b on λ implicit in our notation. Let P1 , . . . , Pn be the distributions of X1 , . . . , Xn ,
and let P = (P1 , . . . , Pn ). We evaluate estimates based on the squared error loss function,
or compound loss,
n
2
1X
Ln (X, m(·, λ), P ) =
m(Xi , λ) − µi ,
n i=1
(2)
where Ln depends on P via µ. We consider an estimator to be good if it has small expected
loss. There are different ways of taking this expectation, however, resulting in different risk
functions, and the distinction between them is conceptually quite important.
There is, first, the componentwise risk function, which fixes Pi and considers the expected squared error of µ
bi as an estimator of µi ,
R(m(·, λ), Pi ) = E[(m(Xi , λ) − µi )2 |Pi ].
(3)
There is, second, the risk function of the compound decision problem. Compound risk Rn
averages componentwise risk over the empirical distribution of Pi across the components
i = i, . . . , n. Compound risk is given by the expectation of compound loss Ln given P .
Rn (m(·, λ), P ) = E[Ln (X, m(·, λ), P )|P ]
n
1X
=
R(m(·, λ), Pi ).
n i=1
(4)
There is, third, the empirical Bayes risk. Empirical Bayes risk considers the Pi to be
themselves draws from some population distribution π, and averages componentwise risk
over this population distribution,
R̄(m(·, λ), π) = Eπ [R(m(·, λ), Pi )]
5
Z
=
R(m(·, λ), Pi )dπ(Pi ).
(5)
Notice the similarity between compound risk and empirical Bayes risk; they differ only by
replacing an empirical (sample) distribution by a population distribution. For large n, the
difference between the two vanishes, as we will explore in Section 4.
Throughout we use Rn (m(·, λ), P ) to denote the risk function of the estimator m(·, λ)
b P ) is
with fixed (non-random) λ, and similarly for R̄(m(·, λ), π). In contrast, Rn (m(·, λ),
b where the latter is chosen in a
the risk function taking into account the randomness of λ,
b π).
data-dependent manner, and similarly for R̄(m(·, λ),
For a given P , that is for a given empirical distribution of the Pi , we define the “oracle”
selector of the regularization parameter as the value of λ that minimizes compound risk,
λ∗ (P ) = argmin Rn (m(·, λ), P ).
(6)
λ∈[0,∞]
We use λ∗R (P ), λ∗L (P ), and λ∗P T (P ) to denote the oracle selectors for Ridge, Lasso, and
Pretest, respectively. Analogously, for a given π, that is for a given population distribution
of Pi , we define the empirical Bayes oracle selector as
λ̄∗ (π) = argmin R̄(m(·, λ), π),
(7)
λ∈[0,∞]
with λ̄∗R (π), λ̄∗L (π), and λ̄∗P T (π) for Ridge, Lasso, and Pretest, respectively. In Section 3,
we characterize compound and empirical Bayes risk for fixed λ and for the oracle optimal
b can be, under certain conditions,
λ. In Section 4 we then show that data-driven choices λ
almost as good as the oracle optimal choice, in a sense to be made precise.
2.1
Examples
There are many examples in the empirical economics literature to which our setup applies.
In most applications in economics the Xi arise as preliminary least squares estimates for the
µi , and are assumed to be unbiased but noisy. As discussed in the introduction, examples
include (i) studies estimating causal or predictive effects for a large number of units such as
neighborhoods, cities, teachers, workers, firms, or judges, (ii) studies estimating the causal
6
effect of a given treatment for a large number of subgroups defined by some observable
covariate, and (iii) prediction problems with a large number of predictive covariates or
transformations of covariates.
The first category includes Chetty and Hendren (2015) who estimate the effect of geographic locations on intergenerational mobility for a large number of locations. They use
differences between the outcomes of siblings whose parents move during their childhood in
order to identify these effects. Estimation of many causal effects also arises in the teacher
value added literature, when the object of interest are individual teachers’ effects, see for
instance Chetty, Friedman, and Rockoff (2014). In empirical finance and related fields,
event studies are often used to estimate reactions of stock market prices to newly available
information. DellaVigna and La Ferrara (2010), for instance, are interested in violation of
arms embargoes by weapons manufacturers. They estimate the effects of changes in the
intensity of armed conflicts in countries under arms trade embargoes on the stock market prices of arms-manufacturing companies. In labor economics, estimation of firm and
worker effects in studies of wage inequality has been considered since Abowd, Kramarz, and
Margolis (1999), a recent application of this approach can be found in Card, Heining, and
Kline (2012). A last example of the first category is provided by Abrams, Bertrand, and
Mullainathan (2012), who estimate the causal effect of judges on racial bias in sentencing.
An example for the second category, estimation of the causal effect of a binary treatment
for many sub-populations, would be heterogeneity of the causal effect of class size on student
outcomes across demographic subgroups, based for instance on the project STAR data. The
project STAR involved experimental assignment of students to classes of different size in
79 schools; these data were analyzed by Krueger (1999). Causal effects for many subgroups
are also of interest in medical contexts or for active labor market programs, where doctors
/ policy makers have to decide on treatment assignment based on observables.
The third category is prediction with many regressors. For this category to fit into our
setting, consider regressors which are normalized to be orthogonal. Prediction with many
regressors arises in particular in macroeconomic forecasting. Stock and Watson (2012),
7
in an analysis complementing the present article, evaluate various procedures in terms
of their forecast performance for a number of macroeconomic time series for the United
States. Regression with many predictors also arises in series regression, where the predictors
are transformations of a smaller set of observables. Series regression and its asymptotic
properties have been studied a lot in theoretical econometrics, see for instance Newey
(1997). A discussion of the equivalence of our normal means model to nonparametric
regression can be found in sections 7.2-7.3 of Wasserman (2006), where the Xi and µi
correspond to the estimated and true coefficients on an appropriate orthogonal basis of
functions. Application of Lasso and Pretesting to series regression is discussed for instance
in Belloni and Chernozhukov (2011).
In Section 5 we return to three of these applications, revisiting the estimation of location effects on intergenerational mobility, as in Chetty and Hendren (2015), the effect of
changes in the intensity of conflicts in arms-embargo countries on the stock prices of arms
manufacturers, as in DellaVigna and La Ferrara (2010), and series regression of earnings
on education as in Belloni and Chernozhukov (2011).
2.2
Literature
The field of machine learning is growing fast and is increasingly influential in economics;
see for instance Athey and Imbens (2015) and Kleinberg, Ludwig, Mullainathan, and Obermeyer (2015). A large set of algorithms and estimation procedures have been developed.
Friedman, Hastie, and Tibshirani (2001) provide an introduction to this literature. An
estimator which is becoming increasingly popular in economics is the Lasso, which was
first introduced by Tibshirani (1996). A review of the properties of the Lasso can be found
in Belloni and Chernozhukov (2011).
Machine learning research tends to focus primarily on algorithms and computational
issues, and less prominently on formal properties of the proposed estimation procedures.
This contrasts with the older literature in mathematical statistics and statistical decision
theory on the estimation of normal means. This literature has produced many deep results
which turn out to be relevant for understanding the behavior of estimation procedures in
8
non-parametric statistics and machine learning. A foundational article for this literature
is James and Stein (1961), who study a special case of our setting, where Xi ∼ N (µi , 1).
b = X is inadmissible whenever n is at least 3; there
They show that the estimator µ
exists a shrinkage estimator that has mean square error smaller than the mean squared
b = X for all values of µ. Brown (1971) provided more general characterizations
error of µ
of admissibility, and showed that this dependence on dimension is deeply connected to
the recurrence or transience of Brownian motion. Stein et al. (1981) characterized the risk
b , and based on this characterization proposed an unbiased
function of arbitrary estimators µ
estimator of the mean squared error of a given estimator, labeled “Stein’s unbiased risk
estimator” or SURE. We return to SURE in Section 4 as a method for data-driven choices
of regularization parameters.
A general approach for the construction of regularized estimators such as the one proposed by James and Stein (1961) is provided by the empirical Bayes framework, first
proposed in Robbins (1956) and Robbins (1964). A key insight of the empirical Bayes
framework, and the closely related compound decision problem framework, is that trying
to minimize squared error in higher dimensions involves a trade-off across components of
the estimand. The data are informative about which estimators and regularization parameters perform well in terms of squared error, and thus allow to construct regularized
b = X. This intuition is elaborated on in
estimators that dominate the unregularized µ
Stigler (1990). The empirical Bayes framework was developed further by Efron and Morris
(1973) and Morris (1983), among others. Good reviews and introductions can be found in
Zhang (2003) and Efron (2010).
In Section 4 we consider data driven choices of regularization parameters, and emphasize uniform validity of asymptotic approximations to the risk function of the resulting
estimators. Lack of uniform validity of standard asymptotic characterizations of risk (as
well as of test size) in the context of Pretesting and other model-selection based estimators
has been emphasized by Leeb and Pötscher (2005).
9
3
The risk function
We now turn to our first set of formal results, characterizing the mean squared error of
regularized estimators. Our goal is to guide researchers’ choice of estimation approach by
providing intuition for the conditions under which alternative approaches work well.
We first derive a general characterization of the mean squared error in our setting. This
characterization is based on the geometry of estimating functions m as depicted in Figure
1. It is a-priori not obvious which of these functions is well suited for estimation. We
show that for any given data generating process there is an optimal function m∗P which
minimizes the mean squared error. The mean squared error for an arbitrary m is equal,
up to a constant, to the L2 distance between m and m∗P . A function m thus yields a good
estimator if it is able to approximate the shape of m∗P well.
In Section 3.2, we next provide analytic expressions for the component-wise risk of
Ridge, Lasso, and Pretest estimation, imposing the additional assumption of normality.
Summing or integrating componentwise risk over some distribution for (µi , σi ) delivers
expressions for compound and empirical Bayes risk.
In Section 3.3, we finally turn to a specific parametric family of data generating processes. For this parametric family, we provide analytic risk functions and visual comparisons
of the relative performance of alternative estimators, shown in figures 2, 3, and 4. This
allows to highlight key features of the data generating process which favor one or the other
estimator. The family of data generating processes we consider assumes that there is some
share p of true zeros among the µi , reflecting the notion of sparsity, while the remaining µi
are drawn from a normal distribution with some grand mean µ0 and variance σ02 .
3.1
General characterization
Recall the setup introduced in Section 2, where we observe n random variables Xi , Xi has
mean µi , and the Xi are independent across i. We are interested in the mean squared error
for the compound problem of estimating all of the µi simultaneously. In this formulation,
the µi are not random but rather fixed, unknown parameters. We can, however, consider
10
the empirical distribution of the µi across i, and the corresponding conditional distributions
of the Xi , to construct a joint distribution of Xi and µi . For this construction, we can define
a conditional expectation m∗P of µi given Xi – which turns out to be the optimal estimating
function given P .
Let us now formalize this idea. Let I be a random variable with a uniform distribution
over the set {1, 2, . . . , n}, and consider the random component (XI , µI ) of (X, µ). This
construction induces a mixture distribution for (XI , µI ) (conditional on P ),
n
1X
(XI , µI )|P ∼
Pi δµi ,
n i=1
where δµ1 , . . . , δµn are Dirac measures at µ1 , . . . , µn . Based on this mixture distribution,
define the conditional expectation
m∗P (x) = E[µI |XI = x, P ]
(8)
and the average conditional variance
vP∗ = E Var(µI |XI , P )|P .
Theorem 1 (Characterization of risk functions)
Under the assumptions of Section 2, and using the notation introduced above, the compound
risk function Rn of µ
bi = m(Xi , λ) for fixed λ can be written as
Rn (m(·, λ), P ) = vP∗ + E (m(XI , λ) − m∗P (XI ))2 |P ,
(9)
which implies
λ∗ (P ) = argmin E (m(XI , λ) − m∗P (XI ))2 |P .
λ∈[0,∞]
The proof of this theorem and all further results can be found in the appendix. The
statement of this theorem implies that the risk of componentwise estimators is equal to an
irreducible part vP∗ , plus the L2 distance of the estimating function m(., λ) to the infeasible
optimal estimating function m∗P . A given data generating process P maps into an optimal
estimating function m∗P , and the relative performance of alternative estimators m depends
on how well they approximate m∗P .
11
We can easily write m∗P explicitly, since the conditional expectation defining m∗P is a
weighted average of the values taken by µi . Suppose, for example, that Xi ∼ N (µi , 1) for
i = 1 . . . n. Let ϕ be the standard normal probability density function. Then,
n
X
m∗P (x) =
µi ϕ(x − µi )
i=1
n
X
.
ϕ(x − µi )
i=1
3.2
Componentwise risk assuming normality
The characterization of the previous section relied only on the existence of second moments,
as well as on the fact that we are considering componentwise estimators. If slightly more
structure is imposed, we can obtain more explicit expressions for compound risk and empirical Bayes risk. We shall now assume in particular that the Xi are normally distributed,
Xi ∼ N (µi , σi2 ).
We shall furthermore focus on the leading examples of componentwise estimators introduced in Section 2, Ridge, Lasso, and Pretesting , whose estimating functions m were
plotted in Figure 1. Recall that compound risk is given by an average of componentwise
risk across components i,
n
1X
Rn (m(·, λ), P ) =
R(m(·, λ), Pi ),
n i=1
similarly for empirical Bayes risk,
Z
R̄(m(·, λ), π) =
R(m(·, λ), Pi )dπ(Pi ).
The following lemma provides explicit expressions for componentwise risk.
Lemma 1 (Componentwise risk)
Consider the setup of Section 2. Then, for i = 1, . . . , N , the componentwise risk of Ridge
is:
R(mR (·, λ), Pi ) =
1
1+λ
2
12
σi2
+ 1−
1
1+λ
2
µ2i .
Assume, in addition, that Xi has a normal distribution. Then, the componentwise risk of
Lasso is
R(mL (·, λ), Pi ) =
1+Φ
−λ − µ i
σi
!
λ − µ i
−Φ
(σi2 + λ2 )
σi
!
−λ − µ λ − µ −λ + µ −λ − µ i
i
i
i
+
ϕ
+
ϕ
σi2
σi
σi
σi
σi
−λ − µ λ − µi i
+ Φ
−Φ
µ2i .
σi
σi
Under the same conditions, the componentwise risk of Pretest is
!
−λ − µ λ − µ i
i
R(mP T (·, λ), Pi ) = 1 + Φ
−Φ
σi2
σi
σi
!
λ − µ λ − µ −λ − µ −λ − µ i
i
i
i
+
ϕ
−
ϕ
σi2
σi
σi
σi
σi
−λ − µ λ − µi i
+ Φ
−Φ
µ2i .
σi
σi
The componentwise risk functions whose expressions are given in Lemma 1 are plotted
in Figure 2. As this figure reveals, componentwise risk is large for Ridge when µi is large
relative to λ; similarly for Lasso except that risk remains bounded in the latter case. Risk
is large for Pretest estimation when µi is close to λ.
Note however that these functions are plotted for a fixed value of the regularization
parameter λ, varying the mean µi . If we are interested in compound risk and assume that
λ is chosen optimally (equal to λ∗ (P )), then compound risk has to always remain below
b = X, which equals En [σi2 ] under the assumptions of Lemma 1.
the risk of the estimator µ
3.3
Spike and normal data generating process
If we take the expressions for componentwise risk derived in Lemma 1 and average them over
some population distribution of (µi , σi2 ), we obtain empirical Bayes risk. For parametric
families of distributions of (µi , σi2 ), this might be done analytically.
We shall do so now, considering a family of distributions that is rich enough to cover
common intuitions about data generating processes, but simple enough to allow for analytic
13
expressions. Based on these expressions we will be able to give rules of thumb when either
of the estimators considered is preferable.
The family of distributions considered assumes homoskedasticity, σi2 = σ 2 , and allows
for a fraction p of true zeros in the distribution of µi , while assuming that the remaining
µi are drawn from some normal distribution. The following proposition derives the optimal estimating function m∗π as well as empirical Bayes risk functions for this family of
distributions.
Proposition 1 (Spike and normal data generating process)
Assume π is such that µ1 , . . . , µn are drawn independently from a distribution with probability mass p at zero, and normal with mean µ0 and variance σ02 elsewhere and that,
conditional on µi , Xi follows a normal distribution with mean µi and variance σ 2 . Let
m̄∗π (x) = Eπ [µI |XI = x]
= Eπ [µi |Xi = x],
for i = 1, . . . , n. Then,
x − µ0
p
σ02 + σ 2
1
!
(1 − p) p 2
ϕ
σ0 + σ 2
∗
m̄π (x) =
1 x
1
ϕ
p ϕ
+ (1 − p) p 2
σ
σ
σ0 + σ 2
µ0 σ 2 + xσ02
σ02 + σ 2
!.
x − µ0
p
σ02 + σ 2
The integrated risk of Ridge is
R̄(mR (·, λ), π) =
1
1+λ
!2
λ
σ 2 + (1 − p)
1+λ
!2
(µ20 + σ02 ),
with
λ̄∗R (π) =
σ2
.
(1 − p)(µ20 + σ02 )
The integrated risk of Lasso is given by
R̄(mL (·, λ), π) = pR̄0 (mL (·, λ), π) + (1 − p)R̄1 (mL (·, λ), π),
14
where
R̄0 (mL (·, λ), π) = 2Φ
−λ σ
λ λ
(σ 2 + λ2 ) − 2
ϕ
σ2,
σ
σ
and
R̄1 (mL (·, λ), π) =
!
!!
λ − µ0
−Φ p 2
(σ 2 + λ2 )
2
σ0 + σ
!
!!
λ − µ0
−λ − µ0
−Φ p 2
(µ20 + σ02 )
+ Φ p 2
2
2
σ0 + σ
σ0 + σ
!
1
λ − µ0
−p 2
ϕ p 2
(λ + µ0 )(σ02 + σ 2 )
2
2
σ0 + σ
σ0 + σ
!
1
−λ − µ0
−p 2
ϕ p 2
(λ − µ0 )(σ02 + σ 2 ).
2
2
σ0 + σ
σ0 + σ
−λ − µ0
1+Φ p 2
σ0 + σ 2
Finally, the integrated risk of Pretest is given by
R̄(mP T (·, λ), π) = pR̄0 (mP T (·, λ), π) + (1 − p)R̄1 (mP T (·, λ), π),
where
R̄0 (mP T (·, λ), π) = 2Φ
−λ σ
σ2 + 2
λ λ
ϕ
σ2.
σ
σ
and
R̄1 (mP T (·, λ), π) =
!
!!
λ − µ0
−Φ p 2
σ2
2
σ0 + σ
!
!!
λ − µ0
−λ − µ0
+ Φ p 2
−Φ p 2
(µ20 + σ02 )
2
2
σ0 + σ
σ0 + σ
!
1
λ − µ0
−p 2
ϕ p 2
λ(σ02 − σ 2 ) + µ0 (σ02 + σ 2 )
σ0 + σ 2
σ0 + σ 2
!
1
−λ − µ0
−p 2
λ(σ02 − σ 2 ) − µ0 (σ02 + σ 2 ) .
ϕ p 2
σ0 + σ 2
σ0 + σ 2
−λ − µ0
1+Φ p 2
σ0 + σ 2
While the expressions provided by Proposition 1 look somewhat cumbersome, they are
easily plotted. Figure 3 does so, plotting the empirical Bayes risk function of different
estimators. Each of these plots is based on a fixed value of p, varying µ0 and σ02 along the
two axes. Each panel is for a different estimator. Figure 4 then takes these risk functions,
15
for the same range of parameter values, and plots which of the three estimators, Ridge,
Lasso, or Pretest estimation, performs best for a given data generating process.
We can take away several intuitions from these plots: For small shares of true zeros,
Ridge dominates the other estimators, no matter what the grand mean and population
dispersion are. As the share of true zeros increases, the other estimators start to look
better. Lasso does comparatively well when the non-zero parameters are very dispersed
but do have some mass around zero. Intuitively, Ridge has a hard time reconciling gains
from local shrinkage around zero, without suffering from exploding bias as we reach the
tails. Pretest has a hard time because of the presence of intermediate values which are
not easily classified into zero or non-zero true mean. Pretest does well when the non-zeros
are well separated from the zeros, that is when the grand mean is large relative to the
population dispersion. When the number of true zeros gets large, we see that Ridge still
dominates if the non-zero values are not too far from zero, while Pretest dominates if the
non-zero values are very far from zero. Lasso does best in the intermediate range between
these extremes.
In the context of applications such as the ones we consider below, we suggest that
researchers start by contemplating which of these scenarios seems most plausible as a
description of their setting. In many settings, there will be no prior reason to presume the
presence of many true zeros, suggesting the use of Ridge. If there are reasons to assume
there are many true zeros, such as in our arms trade application, then the question arises
how well separated and dispersed these are. In the well separated case Pretesting might be
justified, in the non-separated but dispersed case Lasso might be a good idea.
4
Data driven choice of regularization parameters
In Section 3 we studied the risk properties of alternative regularized estimators. Our calculations in Section 3.3 presumed that we knew the oracle optimal regularization parameter
λ̄∗ (π). This is evidently infeasible, given that the distribution π of true means µi is unknown.
Are the resulting risk comparisons relevant nonetheless? The answer is yes, to the extent
16
that we can obtain consistent estimates of λ̄∗ (π) from the observed data Xi . In this section
b of λ̄∗ (π) based on Stein’s unbiased
we make this idea precise. We consider estimates λ
b have risk
risk estimate and based on cross validation. The resulting estimators m(Xi , λ)
functions which are uniformly close to those of the infeasible estimators m(Xi , λ̄∗ (π)).
The uniformity part of this statement is important and not obvious. It contrasts
markedly with other oracle approximations to risk, most notably approximations which
assume that the true zeros, that is the components i for which µi = 0, are known. Asymptotic approximations of this latter form are often invoked when justifying the use of Lasso
or Pretesting estimators. Such approximations are in general not uniformly valid, as emphasized by Leeb and Pötscher (2005) and others. This implies that there are relevant
ranges of parameter values for which Lasso or Pretesting perform significantly worse than
the corresponding oracle-selection-of-zeros estimators. We will briefly revisit this point
below.
4.1
Uniform loss and risk consistency
For the remainder of the paper we adopt the following short-hand notation:
Ln (λ) = Ln (X, m(·, λ), P ),
Rn (λ) = Rn (m(·, λ), P ), and
R̄π (λ) = R̄(m(·, λ), π).
b of the oracle optimal choice of λ which are based on
We shall now consider estimators λ
b is then used to calculate
minimizing some empirical estimate rn of the risk function R̄π . λ
b Our goal is to show that these estimates
regularized estimates of the form µ
bi = m(Xi , λ).
are almost as good as estimates based on optimal choices of λ, where “almost as good” is
understood as having risk (loss) that’s uniformly close to the oracle optimal estimator.
Our results are based on taking limits as n goes to ∞, and assuming that the Pi are
i.i.d. draws from some distribution π, as in the empirical Bayes setting. In the large n limit
the difference between Ln , Rn , and R̄π vanishes as we shall see, so that loss optimality and
compound / empirical Bayes risk optimality of λ become equivalent.
17
The following theorem establishes our key result for this section. It provides sufficient
conditions for uniform loss consistency, namely that (i) the difference between loss Ln (λ)
b is chosen
and empirical Bayes risk R̄π (λ) vanishes uniformly in λ and π, and (ii) that λ
to minimize an estimate rn (λ) of risk R̄π (λ) which is also uniformly consistent, possibly
bn ) and the
up to a constant vπ . Under these conditions, the difference between loss Ln (λ
infeasible minimal loss inf λ∈[0,∞] Ln (λ) vanishes uniformly.
Theorem 2 (Uniform loss consistency)
Assume
sup Pπ
π∈Π
!
sup Ln (λ) − R̄π (λ) > → 0,
∀ > 0.
(10)
λ∈[0,∞]
Assume also that there are functions, r̄π (λ), v̄π , and rn (λ) (of (π, λ), π, and ({Xi }ni=1 , λ),
respectively) such that R̄π (λ) = r̄π (λ) + v̄π , and
sup Pπ
π∈Π
!
sup rn (λ) − r̄π (λ) > → 0,
∀ > 0.
(11)
λ∈[0,∞]
Then,
b
sup Pπ Ln (λn ) − inf Ln (λ) > → 0,
λ∈[0,∞]
π∈Π
∀ > 0,
(12)
bn = argmin λ∈[0,∞] rn (λ).
where λ
The sufficient conditions given by this theorem, as stated in equations (10) and (11),
are rather high-level. We shall now give more primitive conditions for these requirements
to hold. In sections 4.2 and 4.3 below, we propose suitable choices of rn (λ) based on Stein’s
unbiased risk estimator (SURE) and cross-validation (CV), and show that equation (11)
holds for these choices of rn (λ).
In the the following Theorem 3, we show that equation (10) holds under fairly general
conditions, so that the difference between compound loss and empirical Bayes risk vanishes
uniformly. Lemma 2 then shows that these conditions apply in particular to Ridge, Lasso,
and Pretest estimation.
Theorem 3 (Uniform L2 -convergence)
Suppose that
18
1. m(x, λ) is monotonic in λ for all x in R,
2. m(x, 0) = x and limλ→∞ m(x, λ) = 0 for all x in R,
3. supπ∈Π Eπ [X 4 ] < ∞, supπ∈Π Eπ [µ4 ] < ∞.
4. For any > 0 there exists a set of regularization parameters 0 = λ0 < . . . < λk = ∞,
which may depend on , such that
Eπ [(|X − µ| + |µ|)|m(X, λj ) − m(X, λj−1 )|] ≤ for all j = 1, . . . , k.
Then,
"
sup Eπ
π∈Π
sup
#
2
Ln (λ) − R̄π (λ)
→ 0.
(13)
λ∈[0,∞]
Lemma 2
If supπ∈Π Eπ [X 4 ] < ∞ and supπ∈Π Eπ [µ4 ] < ∞, then equation (13) holds for ridge, and
lasso. If, in addition, X is continuously distributed with a bounded density, then equation
(13) holds for pretest.
Theorem 2 provides sufficient conditions for uniform loss consistency. The following
corollary shows that under the same conditions we obtain uniform risk consistency, that
b becomes
is the empirical Bayes risk of the estimator based on the data-driven choice λ
uniformly close to the risk of the oracle-optimal λ̄∗ (π). For the statement of this Corollary,
bn ), π) is the empirical Bayes risk of the estimator m(., λ
bn ) using the
recall that R̄(m(., λ
bn .
stochastic (data-dependent) λ
Corollary 1 (Uniform risk consistency)
Under the assumptions of Theorem 3,
bn ), π) − inf R̄π (λ) → 0.
sup R̄(m(., λ
λ∈[0,∞]
π∈Π
19
(14)
4.2
Stein’s unbiased risk estimate
Theorem 2 provides sufficient conditions for uniform loss consistency using a general estimator rn of risk. We shall now establish that our conditions apply to a particular estimator
of rn , known as Stein’s unbiased risk estimate (SURE), which was first proposed by Stein
et al. (1981). SURE leverages the assumption of normality to obtain an elegant expression
of risk as an expected sum of squared residuals plus a penalization term.
SURE as originally proposed requires that m be piecewise differentiable as a function
of x, which excludes discontinuous estimators such as the Pretest estimator mP T (x, λ). We
provide a minor generalization in Lemma 3 which allows for such discontinuities. This
lemma is stated in terms of empirical Bayes risk; with the appropriate modifications the
same result holds verbatim for compound risk.
Lemma 3 (SURE for piecewise differentiable estimators)
Suppose
X|µ ∼ N (µ, 1).
and
µ ∼iid π.
Let fπ = π ∗ ϕ be the marginal density of X. Suppose further that m(x) is differentiable
everywhere in R\{x1 , . . . , xJ }, but might be discontinuous at {x1 , . . . , xJ }. Let ∇m be
the derivative of m (defined arbitrarily at {x1 , . . . , xJ }), and let ∆mj = limx↓xj m(x) −
limx↑xj m(x). Assume that Eπ [(m(X) − X)2 ] < ∞ and (m(x) − x)ϕ(x − µ) → 0 as |x| → ∞
π-a.s. Then,
R̄(m(.), π) = Eπ [(m(X) − X)2 ] + 2 Eπ [∇m(X)] +
J
X
!
∆mj fπ (xj )
− 1.
j=1
The result of this lemma yields an objective function for the choice of λ of the general
form we considered in Section 4.1, where v̄π = −1 and
2
r̄π (λ) = Eπ [(m(X, λ) − X) ] + 2 Eπ [∇x m(X, λ)] +
J
X
j=1
20
!
∆mj (λ)fπ (xj ) ,
(15)
where ∇x m is the derivative of m with respect to its first argument, and {x1 , . . . , xJ } may
depend on λ.
The expression given in equation (15) suggests an obvious empirical counterpart, to be
used as an estimator of r̄π (λ),
n
1X
rn (λ) =
(m(Xi , λ) − Xi )2 + 2
n i=1
!
n
J
X
X
1
∇x m(Xi , λ) +
∆mj (λ)fb(xj ) ,
n i=1
j=1
where fb(x) is an estimator of fπ (x). This expression could be thought of as a penalized
least squares objectives. We can write the penalty explicitly for our leading examples:
2
1+λ
Lasso: 2Pn (|X| > λ)
Ridge:
Prestest: 2Pn (|X| > λ) + 2λ · (fb(−λ) + fb(λ)).
For Ridge and Lasso, a smaller λ implies a smaller sum of squared residuals, but also a
larger penalty.
For our general result on uniform risk consistency, Theorem 2, to apply, we have to
show that equation (11) holds. That is, we have to show that rn (λ) is uniformly consistent
as an estimator of r̄π (λ). The following lemma provides the desired result.
Lemma 4
Assume the conditions of Theorem 3. Then, equation (11) holds for m(·, λ) equal to
mR (·, λ), mL (·, λ). If, in addition,
b
sup Pπ sup |x|f (x) − |x|fπ (x) > → 0 ∀ > 0,
π∈Π
x∈R
then equation (11) holds for m(·, λ) equal to mP T (·, λ).
Lemma 3 implies in particular that the optimal regularization parameter λ̄∗ is identified.
In fact, under the same conditions the stronger result holds that m∗ as defined in Section
3.1 is identified as well. This result is stated in Brown (1971), for instance. We will use this
result in order to provide estimates of m̄∗π (x) in the context of the empirical applications
21
considered below, and to visually compare the shape of this estimated m̄∗π (x) to the shape
of Ridge, Lasso, and Pretest estimating functions.
Lemma 5
Under the conditions of Lemma 3, the optimal shrinkage function is given by
m̄∗π (x) = x + ∇x log(fπ (x)).
4.3
Cross-validation
COMING SOON.
22
(16)
5
Applications
In this section we consider three applications taken from the literature, in order to illustrate
our results. The first application, based on Chetty and Hendren (2015), estimates the
effect of living in a given commuting zone during childhood on intergenerational income
mobility. The second application, based on DellaVigna and La Ferrara (2010), uses changes
in the stock prices of arms manufacturers following changes in the intensity of conflicts in
countries under an arms trade embargo, in order to detect potential violations of these
embargoes. The third application uses data from the 2000 census of the US to estimate a
series regression of log wages on education and potential experience.
5.1
Background
Before presenting our results and relating them to our theoretical discussion, let us briefly
review the background for each of these applications. Given this background, we can formulate some intuitions about likely features of the data generating process. Using our
theoretical results, these intuitions then suggest which of the proposed estimators might
work best in a given application.
Chetty and Hendren (2015) aim to identify and estimate the causal effect βi of locations
i (commuting zones, more specifically) on intergenerational mobility in terms of income.1
They identify this effect by comparing the outcomes of differently aged children of the
same parents who move at some point during their childhood. The older children are then
exposed to the old location for a longer time relative to the younger children, and to to
the new location for a shorter time. A regression interacting parental income with location
indicators, while controlling for family fixed effects, identifies the effects of interest (under
some assumptions, and up to a constant). We take their estimates βbi of the causal impact
of location i on mobility for parents at the 25th percentile of the income distribution as
our point of departure.
1
The data for this application are available at http://www.equality-of-opportunity.org/images/
nbhds_online_data_table3.xlsx.
23
In this application, the point 0 has no special significance. We would expect the distribution π of true location effects βi to be smooth, possibly normal, and by construction
centered at 0. This puts us in a scenario where the results of section 3.3 suggest that Ridge
is likely to perform well.
DellaVigna and La Ferrara (2010) aim to detect whether weapons manufacturers violate
UN arms trade embargos for various conflicts and civil wars.2 They use changes in stock
prices of arms manufacturing companies at the time of large changes in the intensity of
conflicts in countries under arms-trade embargoes to detect illegal arms trade. Based on
their partitioning of the data, we focus on the 214 event study estimates βbi for events which
increase the intensity of conflicts in arms embargo areas, and for arms manufacturers in
high-corruption countries.
In this application, we would in fact expect many of the true effects βi to equal 0 (or
close to 0). This would be the case if the (fixed) cost of violating an arms embargo is too
large for most weapons manufacturers, so that they don’t engage in trade with civil war
parties. There might be a subset of firms, however, who consider it worthwhile to violate
the embargo, and their profits might be significantly increased by doing so. We would thus
expect there to be many true zeros, while the non-zero effects might be well separated.
This puts us in a scenario where the results of section 3.3 suggest that Pretest is likely to
perform well.
In our third application we use data from the 2000 US Census, in order to estimate a nonparametric regression of log wages on years of education and potential experience, similar
to the example considered in Belloni and Chernozhukov (2011).3 We construct a set of
regressors by taking a saturated basis of linear splines in education, fully interacted with the
terms of a 6th order polynomial in potential experience. We orthogonalize these regressors,
2
The data for this application are available at http://eml.berkeley.edu/~sdellavi/wp/
AEJDataPostingZip.zip.
3
The data for this application are available at http://economics.mit.edu/files/384.
24
and take the coefficients βbi of an OLS regression of log wages on these orthogonalized
regressors as our point of departure.
In this application, economics provides less intuition as to what distribution of coefficients to expect. Statistical theory involving arguments from functional analysis however
suggests that the distribution of coefficients might be such that Lasso performs well; cf.
Belloni and Chernozhukov (2011).
5.2
Results
Let us now turn to the actual results. As we shall see, all of the intuitions just discussed
are borne out in our empirical analysis. For each of the applications we proceed as follows.
We first assemble the least squares coefficient estimates βbi and their estimated standard
errors σ
bi2 . We then define Xi = βbi /b
σi . Standard asymptotics for least squares estimators
suggest that approximately Xi ∼ N (µi , 1), where µi = βi /σi .
We consider componentwise shrinkage estimators of the form µ
bi = m(Xi , λ). For each of
the estimators Ridge, Lasso, and Pretest we use SURE in order to pick their regularization
b∗ is chosen to minimize SURE. We furthermore compare these three
parameter, that is λ
b
estimators in terms of their estimated risk R.
Our results are summarized in table 1, which shows estimated optimal tuning parameters and estimated minimal risk for each application and estimator. As suggested by the
above discussion, we indeed get the Ridge performs best (has the lowest estimated risk) for
the location effect application, Pretest estimation performs best for the arms trade application, and Lasso performs best for the series regression application. These results drive
home a point already suggested by our analytical results on the relative risk of different
estimators: neither of these estimators uniformly dominates the others. Which estimator
performs best depends on the data generating process, and in realistic applications either
of these estimators might be preferable.
Figures 5 through 10 provide more detail on the results summarized in table 1. For
each of the applications and estimators, Figures 5, 7, and 7 plot SURE as a function of the
b∗ reported in Table 1 are the minimizers of these
regularization parameter. The optimal λ
25
b are the minimizing values. The top panels of Figures 6, 8, and
functions, the reporte R
b∗ ), as well as the estimated optimal m
10 then plot the estimating functions m(., λ
b ∗ (.), cf.
Lemma 5. The bottom panel of these same figures plots kernel density estimates of the
marginal density of Xi in each application.
We conjecture that many applications in economics involving fixed effects estimation
(teacher effects, worker effects, judge effects...) as well as applications estimating treatment
effects for many subgroups will be similar to the location effects setting, in that effects are
are smoothly distributed and Ridge is the preferred estimator. There is a more special set
of applications where economic agents engaging in non-zero behavior incur a fixed cost.
Such applications might be similar to the arms trade application with a true mass of zero
effects, where Pretest or Lasso are preferable. In series regression contexts such as our log
wage example, finally, Lasso appears to perform consistently well.
6
Conclusion
The literature on machine learning is growing rapidly. Two common features of machine
learning algorithms are regularization and data driven choice of regularization parameters.
We study the properties of such procedures. We consider in particular the problem of
estimating many means µi based on observations Xi . This problem arises often in economic
applications. In such applications the “observations” Xi are usually equal to preliminary
least squares coefficient estimates, for instance of fixed effects.
Our goal in this paper is to provide some guidance for applied researchers: Which
estimation method should one choose in a given application? And how should one pick
regularization parameters? To the extent that researchers care about the squared error of
their estimates, procedures are preferable if they have lower mean squared error than the
competitors.
Based on our results, Ridge appears to dominate the alternatives considered when the
true effects µi are smoothly distributed, and there is no point mass of true zeros. This
appears to be the case in applications that estimate the effect of many units such as locations
26
or teachers, and applications that estimate effects for many subgroups. Pretest appears to
dominate if there are many true zeros, while the non-zero effects are well separated from
zero. This happens in some economic applications when there are fixed costs for agents who
engage in non-zero behavior. Lasso finally dominates for intermediate cases, and appears
to do well in particular for series regression.
Regarding the choice of regularization parameters, we prove a series of results which
show that data driven choices are almost optimal (in a uniform sense) for large-dimensional
problems. This is the case in particular for choices which minimize Stein’s Unbiased Risk
Estimate (SURE), when observations are normally distributed, and for Cross Validation
(CV), when repeated observations for a given effect are available.
There are of course some limitations to our analysis, which point toward interesting
future research. First, we focus on a restricted class of estimators, those which can be
bi ). This covers many estimators
written in the component-wise shrinkage form µ
bi = m(Xi , λ
of interest for economists, most notably Ridge, Lasso, and Pretest estimation. Many other
estimators in the machine learning literature do not have this tractable form, however, such
as for instance random forests or support vector machines. It would be desirable to gain a
better understanding of the risk properties of such estimators, as well.
Second, we focus on squared error loss for the joint estimation problem,
1
n
µi − µi )2 .
i (b
P
This loss function is analytically quite convenient, and amenable to tractable results. Other
loss functions might be of practical interest, however, and might be studied using numerical
methods. In this context it is also worth emphasizing again that we were focusing on point
estimation, where all coefficients µi are simultaneously of interest. This is relevant for many
practical applications such as those discussed above. In other cases, however, one might
instead be interested in the estimates µ
bi solely as input for a lower-dimensional decision
problem, or in (frequentist) inference on the coefficients µi . Our analysis of mean squared
error does not directly speak to such questions.
27
Appendix
A.1
Proofs
Proof of Theorem 1:
n
1X
E[(m(Xi , λ) − µi )2 |Pi ]
n i=1
= E (m(XI , λ) − µI )2 |P
= E E[(m∗P (XI ) − µI )2 |XI , P ]|P + E (m(XI , λ) − m∗P (XI ))2 |P
∗
= vP
+ E (m(XI , λ) − m∗P (XI ))2 |P .
Rn (m(., λ), P ) =
Proof of Lemma 1: Notice that
mR (x, λ) − µi =
1
1+λ
(x − µi ) −
λ
1+λ
µi .
The result for ridge equals the second moment of this expression. For pretest, notice that
mP T (x, λ) − µi = 1(|x| > λ)(x − µi ) − 1(|x| ≤ λ)µi .
Therefore,
R(mP T (·, λ), Pi ) = E (Xi − µi )2 1(|Xi | > λ) + µ2i Pr |Xi | ≤ λ .
(A.1)
0
Using the fact that ϕ (v) = −vϕ(v) and integrating by parts, we obtain
Z
b
Z
b
h
i
ϕ(v) dv − bϕ(b) − aϕ(a)
ha
i h
i
= Φ(b) − Φ(a) − bϕ(b) − aϕ(a) .
2
v ϕ(v) dv =
a
Now,
"
E (Xi − µi )2 1(|Xi | > λ) = σi2 E
Xi − µi
σi
!2
#
1(|Xi | > λ)
−λ − µ i
i
σi2
=
1+Φ
+
!
λ − µ λ − µ −λ − µ −λ − µ i
i
i
i
ϕ
−
ϕ
σi2 .
σi
σi
σi
σi
σi
−Φ
!
λ − µ σi
(A.2)
The result for the pretest estimator follows now easily from equations (A.1) and (A.2). For lasso, notice
that
mL (x, λ) − µi = 1(x < −λ)(x + λ − µi ) + 1(x > λ)(x − λ − µi ) − 1(|x| ≤ λ)µi
= 1(|x| > λ)(x − µi ) + (1(x < −λ) − 1(x > λ))λ − 1(|x| ≤ λ)µi .
Therefore,
R(mL (·, λ), Pi ) = E (Xi − µi )2 1(|Xi | > λ) + λ2 E[1(|Xi | > λ)] + µ2i E[1(|Xi | ≤ λ)]
28
+ 2λ E (Xi − µi )1(Xi < −λ) − E (Xi − µi )1(Xi > λ)
= R(mP T (·, λ), Pi ) + λ2 E[1(|Xi | > λ)]
+ 2λ E (Xi − µi )1(Xi < −λ) − E (Xi − µi )1(Xi > λ) .
Notice that
Z
(A.3)
b
vϕ(v)dv = ϕ(a) − ϕ(b).
a
As a result,
λ − µ −λ − µi i
E (Xi − µi )1(Xi < −λ) − E (Xi − µi )1(Xi > λ) = −σi ϕ
+ϕ
.
σi
σi
(A.4)
Now, the result for lasso follows from equations (A.3) and (A.4).
Proof of Proposition 1: The results for ridge are trivial. For lasso, first notice that the integrated risk
at zero is:
−λ λ λ
R0 (mL (·, λ), π) = 2Φ
(σ 2 + λ2 ) − 2
ϕ
σ2 .
σ
σ
σ
Next, notice that
!
Z −λ − µ0
−λ − µ 1 µ0 − µ ϕ
dµ = Φ p 2
,
Φ
σ
σ0
σ0
σ0 + σ 2
!
Z λ − µ 1 µ0 − µ λ − µ0
Φ
ϕ
dµ = Φ p 2
,
σ
σ0
σ0
σ0 + σ 2
!
Z λ−µ 1
−λ − µ λ − µ 1 µ0 − µ µ0 σ 2 + λσ02
0
ϕ
ϕ
dµ = − p 2
ϕ p 2
λ+
σ
σ
σ0
σ0
σ02 + σ 2
σ0 + σ 2
σ0 + σ 2
!
Z −λ − µ µ0 σ 2 − λσ02
−λ + µ −λ − µ 1 µ0 − µ 1
0
λ−
ϕ p 2
ϕ
ϕ
dµ = − p 2
σ
σ
σ0
σ0
σ02 + σ 2
σ0 + σ 2
σ0 + σ 2
The integrals involving µ2 are more involved. Let v be a Standard Normal variable independent of µ.
Notice that,
Z
Z
Z
1 µ − µ λ − µ 1 µ − µ 0
0
2
2
µ Φ
ϕ
dµ = µ
I[v≤(λ−µ)/σ] ϕ(v)dv
ϕ
dµ
σ
σ0
σ0
σ0
σ0
Z Z
µ − µ0
1
=
dµ ϕ(v)dv.
µ2 I[µ≤λ−σv] ϕ
σ0
σ0
Using the change of variable u = (µ − µ0 )/σ0 , we obtain,
Z
Z
1 µ − µ0 2
µ I[µ≤λ−σv] ϕ
dµ = (µ0 + σ0 u)2 I[u≤(λ−µ0 −σv)/σ0 ] ϕ(u)du
σ0
σ0
λ − µ − σv λ − µ − σv 0
0
=Φ
µ20 − 2ϕ
σ0 µ0
σ0
σ0
!
λ − µ − σv λ − µ − σv λ − µ − σv 0
0
0
+ Φ
−
ϕ
σ02
σ0
σ0
σ0
λ − µ − σv λ − µ − σv 0
0
=Φ
(µ20 + σ02 ) − ϕ
σ0 (λ + µ0 − σv).
σ0
σ0
29
Therefore,
Z
!
λ − µ 1 µ − µ λ − µ0
0
µ Φ
dµ = Φ p 2
ϕ
(µ20 + σ02 )
σ
σ0
σ0
σ0 + σ 2
!
λ − µ0
1
ϕ p 2
(λ + µ0 )σ02
−p 2
σ0 + σ 2
σ0 + σ 2
!
!
1
λ − µ0
λ − µ0
2 2
+ σ0 σ p 2
ϕ p 2
.
σ02 + σ 2
σ0 + σ 2
σ0 + σ 2
2
Similarly,
Z
!
−λ − µ 1 µ − µ −λ − µ0
0
µ Φ
dµ = Φ p 2
(µ20 + σ02 )
ϕ
σ
σ0
σ0
σ0 + σ 2
!
1
−λ − µ0
−p 2
ϕ p 2
(−λ + µ0 )σ02
2
2
σ0 + σ
σ0 + σ
!
!
−λ − µ0
−λ − µ0
1
2 2
ϕ p 2
.
+ σ0 σ p 2
σ02 + σ 2
σ0 + σ 2
σ0 + σ 2
2
The integrated risk conditional on µ 6= 0 is
R1 (mL (·, λ), π) =
!!
λ − µ0
−Φ p 2
(σ 2 + λ2 )
σ0 + σ 2
!
!!
−λ − µ0
λ − µ0
−Φ p 2
(µ20 + σ02 )
+ Φ p 2
σ0 + σ 2
σ0 + σ 2
!
1
λ − µ0
−p 2
ϕ p 2
(λ + µ0 )(σ02 + σ 2 )
σ0 + σ 2
σ0 + σ 2
!
1
−λ − µ0
−p 2
ϕ p 2
(λ − µ0 )(σ02 + σ 2 ).
σ0 + σ 2
σ0 + σ 2
−λ − µ0
1+Φ p 2
σ0 + σ 2
!
The results for pretest follow from similar calculations.
Next lemma is used in the proof of Theorem 2.
Lemma A.1
For any two real-valued functions, f and g,
inf f − inf g ≤ sup |f − g|.
Proof: The result of the lemma follows directly from
inf f ≥ inf g − sup |f − g|,
and
inf g ≥ inf f − sup |f − g|.
30
Proof of Theorem 2: Because v̄π does not depend on λ, we obtain
bn ) − rn (λ) − rn (λ
bn ) = Ln (λ) − R̄π (λ) − Ln (λ
bn ) − R̄π (λ
bn )
Ln (λ) − Ln (λ
bn ) − rn (λ
bn ) .
+ r̄π (λ) − rn (λ) − r̄π (λ
Applying Lemma A.1 we obtain
bn ) −
inf Ln (λ) − Ln (λ
λ∈[0,∞]
inf
λ∈[0,∞]
bn ) ≤ 2 sup Ln (λ) − R̄π (λ)
rn (λ) − rn (λ
λ∈[0,∞]
+ 2 sup r̄π (λ) − rn (λ).
λ∈[0,∞]
bn is the value of λ at which rn (λ) attains its minimum, the result of the theorem follows. Given that λ
The following preliminary lemma will be used in the proof of Theorem 3.
Lemma A.2
For any finite set of regularization parameters, 0 = λ0 < . . . < λk = ∞, let
uj =
sup
L(λ)
λ∈[λj−1 ,λj ]
lj =
inf
λ∈[λj−1 ,λj ]
L(λ),
where L(λ) = (µ − m(X, λ))2 . Suppose that for any > 0 there is a finite set of regularization parameters,
0 = λ0 < . . . < λk = ∞ (where k may depend on ) such that
sup max Eπ [uj − lj ] ≤ (A.5)
sup max max{varπ (lj ), varπ (uj )} < ∞.
(A.6)
π∈Π 1≤j≤k
and
π∈Π 1≤j≤k
Then, equation (13) holds.
Proof: We will use En to indicate averages over (µ1 , X1 ), . . . , (µn , Xn ). Let λ ∈ [λj−1 , λj ]. By construction
En [L(λ)] − Eπ [L(λ)] ≤ En [uj ] − Eπ [lj ] ≤ En [uj ] − Eπ [uj ] + Eπ [uj − lj ]
En [L(λ)] − Eπ [L(λ)] ≥ En [lj ] − Eπ [uj ] ≥ En [lj ] − Eπ [lj ] − Eπ [uj − lj ]
and thus
sup (En [L(λ)]−Eπ [L(λ)])2
λ∈[0,∞]
≤ max max{(En [uj ] − Eπ [uj ])2 , (En [lj ] − Eπ [lj ])2 } +
1≤j≤k
max Eπ [uj − lj ]
1≤j≤k
+ 2 max max{|En [uj ] − Eπ [uj ]|, |En [lj ] − Eπ [lj ]|} max Eπ [uj − lj ]
1≤j≤k
≤
k X
1≤j≤k
(En [uj ] − Eπ [uj ])2 + (En [lj ] − Eπ [lj ])2 + 2
j=1
+ 2
k X
|En [uj ] − Eπ [uj ]| + |En [lj ] − Eπ [lj ]| .
j=1
31
2
Therefore,
Eπ
h
sup (En [L(λ)] − Eπ [L(λ)])2
i
λ∈[0,∞]
≤
k X
Eπ [(En [uj ] − Eπ [uj ])2 ] + Eπ [En [lj ] − Eπ [lj ])2 ] + 2
j=1
+ 2
k
X
Eπ [|En [uj ] − Eπ [uj ]| + |En [lj ] − Eπ [lj ]|]
j=1
≤
k X
varπ (uj )/n + varπ (lj )/n + 2
j=1
+ 2
k q
X
varπ (uj )/n +
q
varπ (lj )/n .
j=1
Now, the result of the lemma follows from the assumption of uniformly bounded variances.
Proof of Theorem 3: We will show that the conditions of the theorem imply equations (A.5) and (A.6)
and, therefore, the uniform convergence result in equation (13). Using conditions 1 and 2, along with
the convexity of 4-th powers we immediately get bounded variances. Because the maximum of a convex
function is achieved at the boundary,
varπ (uj ) ≤ Eπ [u2j ] ≤ Eπ [max{(X − µ)4 , µ4 }] ≤ Eπ [(X − µ)4 ] + Eπ [µ4 ],
Notice also that
varπ (lj ) ≤ Eπ [lj2 ] ≤ Eπ [u2j ].
Now, condition 3 implies equation (A.6) in Lemma A.2.
It remains to find a set of regularization parameters such that Eπ [uj − lj ] < for all j. Using again
monotonicity of m(X, λ) in λ and convexity of the square function, we have that the supremum defining
uj is achieved at the boundary,
uj = max{L(λj−1 ), L(λj )},
while
lj = min{L(λj−1 ), L(λj )}
if µ ∈
/ [m(X, λj−1 ), m(X, λj )], and lj = 0 otherwise. In the former case,
uj − lj = |L(λj ) − L(λj−1 )|,
in the latter case uj − lj = max{L(λj−1 ), L(λj )}. Consider first the case of µ ∈
/ [m(X, λj−1 ), m(X, λj )].
Using the formula a2 − b2 = (a + b)(a − b) and the shorthand mj = m(X, λj ), we obtain
uj − lj = (mj − µ)2 − (mj−1 − µ)2 = (mj − µ) + (mj−1 − µ) mj − mj−1 ≤ (|mj − µ| + |mj−1 − µ|)|mj − mj−1 |.
To check that the same bound applies to the case µ ∈ [m(X, λj−1 ), m(X, λj )] notice that
max {|mj − µ|, |mj−1 − µ|} ≤ |mj − µ| + |mj−1 − µ|
and, because µ ∈ [m(X, λj−1 ), m(X, λj )],
max {|mj − µ|, |mj−1 − µ|} ≤ |mj − mj−1 |.
32
Monotonicity, boundary conditions, and the convexity of absolute values allow to bound further,
uj − lj ≤ 2(|X − µ| + |µ|)|mj − mj−1 |.
Now, condition 4 in Theorem 3 implies equation (A.5) in Lemma A.2 and, therefore, the result of the
theorem.
Proof of Lemma 2: Conditions 1 and 2 of Theorem 3 are easily verified to hold for ridge, lasso, and the
pretest estimator. Let us thus discuss condition 4.
Let ∆mj = m(X, λj ) − m(X, λj−1 ) and ∆λj = λj − λj−1 . For ridge, ∆mj is given by
1
1
−
X
∆mj =
1 + λj
1 + λj−1
so that the requirement follows from finite variances if we choose a finite set of regularization parameters
such that
1
1
sup E (|X − µ| + |µ|)|X| < −
1 + λj
1 + λj−1 π∈Π
for all j = 1, . . . , k, which is possible by uniformly bounded moments condition.
For lasso, notice that |∆mk | = (|X| − λk−1 ) 1(|X| > λk−1 ) ≤ |X| 1(|X| > λk−1 ), and |∆mj | ≤ ∆λj for
j = 1, . . . , k − 1. We will first verify that for any > 0 there is a finite λk−1 such that condition 4 of
the lemma holds for j = k. Notice that for any pair of non-negative random variables (ξ, ζ) such that
E[ξ ζ] < ∞, and for any positive constant, c, we have that
E[ξζ] ≥ E[ξζ 1(ζ > c)] ≥ cE[ξ1(ζ > c)]
and, therefore,
E[ξ1(ζ > c)] ≤
E[ξζ]
.
c
As a consequence of this inequality, and because supπ∈Π Eπ [(|X − µ| + |µ|)|X|2 ] < ∞ (implied by condition
3), then for any > 0 there exists a finite positive constant, λk−1 such that condition 4 of the lemma holds
for j = k. Given that λk−1 is finite, supπ∈Π Eπ [|X − µ| + |µ|] < ∞ and |∆mj | ≤ ∆λj imply condition 4
for j = 1, . . . , k − 1.
For pretest,
|∆mj | = |X| 1(|X| ∈ (λj−1 , λj ]),
so that we require that for any > 0 we can find a finite number of regularization parameters, 0 = λ0 <
λ1 < . . . < λk−1 < λk = ∞, such that
Eπ [(|X − µ| + |µ|)|X| 1(|X| ∈ (λj−1 , λj ])] < ,
for j = 1, . . . , k. Applying the Cauchy-Schwarz inequality and uniform boundedness of fourth moments,
this condition is satisfied if we can choose uniformly bounded Pπ (|X| ∈ (λj−1 , λj ]), which is possible
under the assumption that X is continuously distributed with a (version of the) density that is uniformly
bounded.
Proof of Corollary 1: From Theorem 2 and Lemma A.1, it follows immediately that
b
sup Pπ Ln (λn ) − inf R̄π (λ) > → 0.
π∈Π
λ∈[0,∞]
By definition,
bn ), π) = Eπ [Ln (λ
bn )].
R̄(m(., λ
33
Equation (14) thus follows if we can strengthen uniform convergence in probability to uniform L1 converbn ), as per Theorem 2.20 in van der Vaart
gence. To do so, we need to show uniform integrability of Ln (λ
(2000).
Monotonicity, convexity of loss, and boundary conditions imply
n
X
bn ) ≤ 1
µ2i + (Xi − µi )2 .
Ln (λ
n i=1
Uniform integrability along arbitrary sequences πn , and thus L1 convergence, follows from the assumed
bounds on moments.
Proof of Lemma 3: Recall the definition R̄(m(.), π) = Eπ [(m(X) − µ)2 ]. Expanding the square yields
Eπ [(m(X) − µ)2 ] = Eπ [(m(X) − X + X − µ)2 ]
= Eπ [(X − µ)2 ] + Eπ [(m(X) − X)2 ] + 2Eπ [(X − µ)(m(X) − X)].
By the form of the standard normal density,
∇x ϕ(x − µ) = −(x − µ)ϕ(x − µ).
Partial integration over the intervals ]xj , xj+1 [ (where we let x0 = −∞ and xJ+1 = ∞) yields
Z Z
Eπ [(X − µ)(m(X) − X)] =
(x − µ) (m(x) − x) ϕ(x − µ) dx dπ(µ)
R
=−
R
J Z Z
X
j=0
=
J Z
X
j=0
R
xj+1
(m(x) − x) ∇x ϕ(x − µ) dx dπ(µ)
xj
R
"Z
xj+1
(∇m(x) − 1) ϕ(x − µ) dx
xj
+ lim (m(x) − x)ϕ(x − µ) − lim (m(x) − x)ϕ(x − µ) dπ(µ)
x↓xj
x↑xj+1
= Eπ [∇m(X)] − 1 +
J
X
∆mj f (xj ).
j=1
Proof of Lemma 4: Uniform convergence of the first term follows by the exact same arguments we used
to show uniform convergence of Ln (λ) to R̄π (λ) in Theorem 3. We thus focus on the second term, and
discuss its convergence on a case-by-case basis for our leading examples.
For Ridge, this second term is equal to the constant
2 ∇x mR (x, λ) =
2
,
1+λ
and uniform convergence holds trivially.
For Lasso, the second term is equal to
2 En [∇x mL (X, λ)] = 2 Pn (|X| > λ).
To prove uniform convergence of this term we slightly modify the proof of the Glivenko-Cantelly Theorem
(e.g., van der Vaart (2000), Theorem 19.1). Let Fn be the cumulative distribution function of X1 , . . . , Xn ,
34
and let Fπ be its population counterpart. It is enough to prove uniform convergence of Fn (λ),
!
sup Pπ
π∈Π
sup |Fn (λ) − Fπ (λ)| > →0
∀ > 0.
λ∈[0∞]
Using Chebyshev’s inequality and supπ∈Π varπ (1(X ≤ λ)) ≤ 1/4 for every λ ∈ [0, ∞], we obtain
p
sup |Fn (λ) − Fπ (λ)| → 0,
π∈Π
for every λ ∈ [0, ∞]. Next, we will establish that for any > 0, it is possible to find a finite set of
regularization parameters 0 = λ0 < λ1 < · · · < λk = ∞ such that
sup max {Fπ (λj ) − Fπ (λj−1 )} < .
π∈Π 1≤j≤k
This assertion follows from the fact that fπ (x) is uniformly bounded by ϕ(0). The rest of the proof proceeds
as in the proof of Theorem 19.1 in van der Vaart (2000).
Let us finally turn to pre-testing. The objective function for pre-testing is equal to the one for Lasso, plus
additional terms for the jumps at ±λ; the penalty term equals
2Pn (|X| > λ) + 2λ(fb(−λ) + fb(λ)).
Uniform convergence of the SURE criterion for pre-testing thus holds if (i) the conditions for Lasso are
satisfied, and (ii) we have a uniformly consistent estimator of |x|fb(x).
35
A.2
Tables and figures
Table 1: Optimal bandwidth parameter and minimal value of SURE criterion
location effects
n
595
b
R
b∗
λ
b
arms trade
214 R
b∗
λ
b
Returns to education 65 R
b∗
λ
Ridge
0.29
2.44
0.50
0.98
1.00
0.01
Lasso
0.32
1.34
0.06
1.50
0.84
0.59
Pre-test
0.41
5.00
-0.02
2.38
0.93
1.14
For each of the applications considered and for each of the estimators which we focus on, this table shows
b∗ minimizes risk as estimated by SURE, as well as the corresponding minimal R
b value of the
which value λ
SURE. These let us compare in turn which of the estimators appears to work best in a given application.
36
Figure 1: Estimators
8
6
4
m
2
0
-2
-4
Ridge
Pretest
Lasso
-6
-8
-8
-6
-4
-2
0
2
4
6
8
X
This graph plots mR (x, λ), mL (x, λ), and mP T (x, λ) as functions of x. The regularization parameters
chosen are λ = 1 for Ridge, λ = 2 for the Lasso estimator, and λ = 4 for the Pretest estimator.
37
Figure 2: Componentwise risk functions
11
Ridge
Pretest
Lasso
m(x)=x
10
9
8
7
R
6
5
4
3
2
1
0
-6
-4
-2
0
2
4
6
7
These graphs show componentwise risk functions Ri as a function of µ, assuming that σi2 = 1, for the
componentwise estimators displayed in Figure 1.
38
Figure 3: Risk for estimators in spike and normal setting
p = 0.00
p = 0.25
1
1
1
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
4
0
4
0
4
4
2
2
2
0 0
σ0
0
4
4
ridge
2
2
0 0
σ0
µ0
4
0 0
σ0
µ0
LASSO
4
2
2
ridge
1
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
4
0
4
0
4
4
0 0
σ0
0
4
4
2
2
0
4
2
2
0 0
σ0
µ
4
2
2
0 0
σ0
µ
0
pretest
0
m*
m*
p = 0.75
1
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
4
0
4
0
4
4
0
4
4
2
2
0 0
4
2
2
0 0
σ0
µ0
ridge
0 0
σ0
µ0
LASSO
4
2
2
ridge
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
4
0
4
0
4
σ0
0 0
2
µ
0
pretest
0
4
4
2
σ0
4
2
2
0 0
µ0
LASSO
1
4
2
0 0
σ0
µ0
0.8
2
µ
0
pretest
0.8
σ0
2
0 0
σ0
µ
p = 0.50
2
µ0
LASSO
0.8
2
0 0
σ0
µ0
2
σ0
µ
0
m*
0 0
µ
0
pretest
39
4
2
2
σ0
2
0 0
µ
0
m*
Figure 4: Best estimator in spike and normal setting
p = 0.25
5
4.5
4.5
4
4
3.5
3.5
3
3
σ0
σ
0
p = 0.00
5
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
0.5
1
1.5
2
2.5
µ0
3
3.5
4
4.5
0
5
0
0.5
1
1.5
2
5
4.5
4.5
4
4
3.5
3.5
3
3
2.5
2
1.5
1.5
1
1
0.5
0.5
0
0.5
1
1.5
2
2.5
µ0
3.5
4
4.5
5
3
3.5
4
4.5
5
2.5
2
0
3
p = 0.75
5
σ0
σ
0
p = 0.50
2.5
µ0
3
3.5
4
4.5
0
5
0
0.5
1
1.5
2
2.5
µ0
This figure compares integrated risk values attained by Ridge, Lasso, and Pretest for different parameter
values of the spike and normal specification in Section 3.2.2. Blue circles are placed at parameters values
for which Ridge minimizes integrated risk, green crosses at values for which Lasso minimizes integrated
risk, and red dots are parameters values for which Pretest minimizes integrated risk.
40
Figure 5: Location effects on intergenerational mobility: SURE Estimates
SURE as function of 6
1
ridge
lasso
pretest
0.8
SURE(6)
0.6
0.4
0.2
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
6
b∗ for
This figure plots the SURE criterion as a function of λ for Ridge, Lasso, and Pretest. The estimated λ
each of these is chosen as the value minimizing SURE. In this application, Ridge has the lowest estimated
risk, which is attained at λ = 2.44.
41
Figure 6: Location effects on intergenerational mobility: Shrinkage Estimators
Shrinkage estimators
4
3
2
b
m(x)
1
0
-1
-2
-3
-4
-3
-2
-1
0
1
2
3
2
3
x
fb(x)
Kernel estimate of the density of X
-3
-2
-1
0
1
x
The first panel shows the optimal shrinkage estimator (solid line) along with the Ridge, Lasso, and Pretest
estimators (dashed lines) evaluated at SURE minimizing values of the regularization parameters. The
Ridge estimator is linear, with positive slope equal to 0.29. Lasso is piecewise linear, with kinks at the
positive and negative versions of the SURE minimizing value of the regularization parameter, λ = 1.34.
Pretest is flat at zero, because SURE is minimized for values of λ higher than the maximum absolute value
of X1 , . . . , Xn . The second panel shows a kernel estimate of the distribution of X. In this application,
Ridge has the lowest estimated risk.
42
Figure 7: Arms Event Study: SURE Estimates
SURE as function of 6
1
ridge
lasso
pretest
0.8
SURE(6)
0.6
0.4
0.2
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
6
b∗ for
This figure plots the SURE criterion as a function of λ for Ridge, Lasso, and Pretest. The estimated λ
each of these is chosen as the value minimizing SURE. In this application, Pretest has the lowest estimated
risk, which is attained at λ = 2.38.
43
Figure 8: Arms Event Study: Shrinkage Estimators
Shrinkage estimators
4
3
2
b
m(x)
1
0
-1
-2
-3
-4
-3
-2
-1
0
1
2
3
2
3
x
fb(x)
Kernel estimate of the density of X
-3
-2
-1
0
1
x
The first panel shows the optimal shrinkage estimator (solid line) along with the Ridge, Lasso, and Pretest
estimators (dashed lines) evaluated at SURE minimizing values of the regularization parameters. The
Ridge estimator is linear, with positive slope equal to 0.50. Lasso is piecewise linear, with kinks at the
bL,n = 1.50.
positive and negative versions of the SURE minimizing value of the regularization parameter, λ
b
Pretest is discontinuous at λP T,n = ±2.39. In this application, Pretest has the lowest estimated risk.
44
Figure 9: Mincer series regression: SURE Estimates
SURE as function of 6
SURE(6)
ridge
lasso
pretest
1
0.8
0.6
0.4
0.2
0
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
6
b∗ for
This figure plots the SURE criterion as a function of λ for Ridge, Lasso, and Pretest. The estimated λ
each of these is chosen as the value minimizing SURE. In this application, Lasso has the lowest estimated
risk, which is attained at λ = 0.59.
45
Figure 10: Mincer series regression: Shrinkage Estimators
Shrinkage estimators
4
3
2
b
m(x)
1
0
-1
-2
-3
-4
-3
-2
-1
0
1
2
3
2
3
x
fb(x)
Kernel estimate of the density of X
-3
-2
-1
0
1
x
The first panel shows the optimal shrinkage estimator (solid line) along with the Ridge, Lasso, and Pretest
estimators (dashed lines) evaluated at SURE minimizing values of the regularization parameters. The
Ridge estimator is linear, with positive slope equal to 0.99. Lasso is piecewise linear, with kinks at the
bL,n = 0.59.
positive and negative versions of the SURE minimizing value of the regularization parameter, λ
b
Pretest is discontinuous at λP T,n = ±2.39. In this application, Lasso has the lowest estimated risk.
46
Figure 11: Spike and normal: True and Estimated Risk
SURE as function of λ
2.4
estimated risk
true risk
2.2
2
ridge
SURE(λ)
1.8
1.6
1.4
lasso
1.2
1
pretest
0.8
0.6
0.4
0
0.5
1
1.5
λ
47
2
2.5
3
Figure 12: Spike and normal: Shrinkage Estimators
Shrinkage estimators
4
3
3
2
2
1
1
0
b
m(x)
m(x)
Optimal shrinkage functions
4
0
-1
-1
-2
-2
-3
-3
0
2
-4
-4
4
-2
0
2
4
x
x
Density of X
Kernel estimate of the density of X
fb(x)
-2
f (x)
-4
-4
-4
-2
0
2
4
-4
x
-2
0
x
48
2
4
References
Abowd, J. M., F. Kramarz, and D. N. Margolis (1999). High wage workers and high wage firms. Econometrica 67 (2), 251–333.
Abrams, D., M. Bertrand, and S. Mullainathan (2012). Do judges vary in their treatment of race? Journal
of Legal Studies 41 (2), 347–383.
Athey, S. and G. Imbens (2015, July). Summer institute 2015 methods lectures.
Belloni, A. and V. Chernozhukov (2011). High dimensional sparse econometric models: An introduction.
Springer.
Brown, L. D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value problems.
The Annals of Mathematical Statistics, 855–903.
Card, D., J. Heining, and P. Kline (2012). Workplace heterogeneity and the rise of west german wage
inequality. NBER working paper (w18522).
Chetty, R., J. N. Friedman, and J. E. Rockoff (2014). Measuring the impacts of teachers ii: Teacher
value-added and student outcomes in adulthood. The American Economic Review 104 (9), 2633–2679.
Chetty, R. and N. Hendren (2015). The impacts of neighborhoods on intergenerational mobility: Childhood
exposure effects and county-level estimates.
DellaVigna, S. and E. La Ferrara (2010). Detecting illegal arms trade. American Economic Journal:
Economic Policy 2 (4), 26–57.
Efron, B. (2010). Large-scale inference: empirical Bayes methods for estimation, testing, and prediction,
Volume 1. Cambridge University Press.
Efron, B. and C. Morris (1973). Stein’s estimation rule and its competitors—an empirical Bayes approach.
Journal of the American Statistical Association 68 (341), 117–130.
Friedman, J., T. Hastie, and R. Tibshirani (2001). The elements of statistical learning, Volume 1. Springer
series in statistics Springer, Berlin.
James, W. and C. Stein (1961). Estimation with quadratic loss. In Proceedings of the fourth Berkeley
symposium on mathematical statistics and probability, Volume 1, pp. 361–379.
Kleinberg, J., J. Ludwig, S. Mullainathan, and Z. Obermeyer (2015). Prediction policy problems. American
Economic Review 105 (5), 491–95.
Krueger, A. (1999). Experimental estimates of education production functions. The Quarterly Journal of
Economics 114 (2), 497–532.
Leeb, H. and B. M. Pötscher (2005). Model selection and inference: Facts and fiction. Econometric
Theory 21 (1), 21–59.
Morris, C. N. (1983). Parametric empirical bayes inference: Theory and applications. Journal of the
American Statistical Association 78 (381), pp. 47–55.
Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of
Econometrics 79 (1), 147–168.
49
Robbins, H. (1956). An empirical bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics.
The Regents of the University of California.
Robbins, H. (1964). The empirical bayes approach to statistical decision problems. The Annals of Mathematical Statistics, 1–20.
Stein, C. M. et al. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of
Statistics 9 (6), 1135–1151.
Stigler, S. M. (1990). The 1988 Neyman memorial lecture: a Galtonian perspective on shrinkage estimators.
Statistical Science, 147–155.
Stock, J. H. and M. W. Watson (2012). Generalized shrinkage methods for forecasting using many predictors. Journal of Business & Economic Statistics 30 (4), 481–493.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical
Society. Series B (Methodological), 267–288.
van der Vaart, A. (2000). Asymptotic statistics. Cambridge University Press.
Wasserman, L. (2006). All of nonparametric statistics. Springer Science & Business Media.
Zhang, C.-H. (2003). Compound decision theory and empirical Bayes methods: invited paper. The Annals
of Statistics 31 (2), 379–390.
50
Download