The Risk of Machine Learning (preliminary and incomplete) Alberto Abadie Maximilian Kasy Harvard University Harvard University April 26, 2016 Abstract Applied economists often simultaneously estimate a large number of parameters of interest, for instance treatment effects for many treatment values (teachers, locations...), treatment effects for many treated groups, and prediction models with many regressors. To avoid overfitting in such settings, machine learning estimators (such as Ridge, Lasso, Pretesting) combine (i) regularized estimation, and (ii) data-driven choice of regularization parameters. We aim to provide guidance for applied researchers to choose between such estimators, assuming they are interested in precise estimates of many parameters. We first characterize the risk (mean squared error) of regularized estimators and analytically compare their performance. We then show that data-driven choices of regularization parameters (using Stein’s unbiased risk estimate or cross-validation) yield estimators with risk uniformly close to the optimal choice. We apply our results using data from the literature, on the causal effect of locations on intergenerational mobility, on illegal trading by arms companies with conflict countries under an embargo, and on series regression estimates of the effect of education on earnings. In these applications the relative performance of alternative estimators corresponds to that suggested by our results. Alberto Abadie, John F. Kennedy School of Government, Harvard University, alberto abadie@harvard.edu. Maximilian Kasy, Department of Economics, Harvard University, maximiliankasy@fas.harvard.edu. We thank Gary Chamberlain and seminar participants at Harvard for helpful comments and discussions. 1 Introduction Applied economists often face the problem of estimating high-dimensional parameters. Examples include (i) the estimation of causal (or predictive effects) for a large number of units such as neighborhoods or cities, teachers, workers and firms, or judges, (ii) estimation of the causal effect of a given treatment for a large number of subgroups defined by observable covariates, and (iii) prediction problems with a large number of predictive covariates or transformations of covariates. The machine learning literature provides a host of estimation methods for such high-dimensional problems, including Ridge, Lasso, Pretest (or post-Lasso), and many others. The applied researcher faces the question of which of these procedures to pick in a given situation. We aim to provide some guidance, based on a characterization and comparison of the risk-properties (in particular, mean squared error) of such procedures. A general concern that motivates machine learning procedures is to avoid “generalization error,” or over-fitting. The choice of estimation procedures involves a variance-bias trade-off, which depends on the data generating process. More flexible procedures tend to have a lower bias, but at the price of a larger variance. In order to address this concern, most machine learning procedures for “supervised learning” (known to economists as regression analysis) involve two key features, (i) regularized estimation, and (ii) data-driven choice of regularization parameters. These features are also central to more well-established methods of non-parametric estimation. In this paper, we consider the canonical problem of estimating many means, which, after some transformations, covers all the applications mentioned above. We consider componentwise estimators of the form µ bi = m(Xi , λ), where λ is a regularization parameter and Xi is observed. Our article is structured according to the mentioned two features of machine learning procedures. We first focus on feature (i) and study the risk properties of regularized estimators with fixed or oracle-optimal regularization parameters. We show that there is an (infeasible) optimal regularized estimator, for any given data generating process. This estimator has the form of a posterior mean, using a prior distribution for 1 the parameters of interest which is equal to the (unknown) empirical distribution across components. The optimal regularized estimator is useful to characterize the risk properties of machine learning estimators. It turns out that, in our setting, the risk function of any regularized estimator can be expressed as a function of the distance between that regularized estimator and the optimal one. We then turn to a family of parametric models for the distribution of parameters of interest. These models assume a certain fraction of true zeros, corresponding to the notion of sparsity, while the remaining parameters are normally distributed around some grand mean. For these parametric models we provide analytic risk functions. These analytic risk functions allow for an intuitive discussion of the relative performance of alternative estimators: (1) When there is no point-mass of true zeros, or only a small fraction of zeros, Ridge tends to perform better than either Lasso or the Pretest estimator. (2) When there is a sizable share of true zeros, the ranking of estimators depends on the distribution of the remaining parameters. (a) If the non-zero parameters are smoothly distributed in a vicinity of zero, Ridge still performs best. (b) If the non-zero parameters are very dispersed, Lasso tends to do comparatively well. (c) If the non-zero parameters are well-separated from zero, Pretest estimation tends to do well. We see these conclusions confirmed in the empirical applications discussed below. The next part of the article turns to feature (ii) of machine learning estimators and studies the data-driven choice of regularization parameters. Such estimators choose regularization parameters by minimizing a criterion function which estimates risk. Ideally, such estimators would have a risk function that is uniformly close to the risk function of the infeasible estimator using an oracle-optimal regularization parameter. We show this can be achieved under fairly mild conditions whenever the dimension of the problem under consideration is large. The key condition is that the criterion function is uniformly consistent for the true risk function. We further provide fairly weak conditions for Stein’s unbiased risk estimate (SURE) and for cross-validation (CV) to satisfy this requirement. Our results are then illustrated in the context of three applications taken from the 2 empirical economics literature. The first application uses data from Chetty and Hendren (2015) to study the effects of locations on intergenerational earnings mobility of children. The second application uses data from the event-study analysis in DellaVigna and La Ferrara (2010) who study if the stock prices of weapon-producing companies react to changes in the intensity of conflicts in countries under an arms trade embargo. The third application considers series regressions of earnings on education, using CPS data. The presence of many neighborhoods in the first application, many weapon producing companies in the second one, and many series regression terms in the third one makes these estimation problems high-dimensional. As suggested by intuition, we find that Ridge performs very well for the location effects application, reflecting the smooth and approximately normal distribution of true location effects. Pretest estimation performs well for the arms trade application, reflecting the fact that a significant fraction of arms firms do not violate arms embargoes. The rest of this article is structured as follows. Section 2 introduces our setup, the canonical problem of estimating a vector of means under quadratic loss. Section 2.1 discusses a series of examples from empirical economics which are covered by our setup. Having introduced the setup, Section 2.2 discusses it in the context of the machine learning literature, and of the older literature on estimation of normal means. Section 3 provides characterizations of the risk function of regularized estimators in our setting. We derive a general characterization in Section 3.1 and 3.2, and analytic formulas in a spike-and-normal model in Section 3.3. These characterizations allow for a comparison of the mean squared error of alternative procedures, and yield recommendations for the choice of an estimator. Section 4 turns to data-driven choices of regularization parameters. We show uniform risk consistency results for Stein’s unbiased risk estimate and for cross-validation. Section 5 discusses Monte Carlo simulations and several empirical applications. Section 6 concludes. The appendix contains all proofs. 3 2 Setup Throughout this paper, we consider the following setting. We observe a realization of an n-vector of random variables, X = (X1 , . . . , Xn )0 , where the components of X are mutually independent with finite mean µi and finite variance σi2 , for i = 1, . . . , n. Our goal is to estimate µ1 , . . . , µn . In many applications, the Xi arise as preliminary least squares estimates of the coefficients of interest µi . Consider for instance a randomized control trial where randomization of treatment assignment is carried out separately for n non-overlapping subgroups. Within each subgroup, the difference in the sample averages between treated and control units, Xi , has mean equal to the average treatment effect for that group in the population, µi . Further examples are discussed in Section 2.1 below. We restrict our attention to component-wise estimators of µi , µ bi = m(Xi , λ), (1) where m : R × [0, ∞] 7→ R defines an estimator of µi as a function of Xi and a non-negative regularization parameter, λ. The parameter λ is common across the components i but b in Section 4 below, focusing might depend on the vector X; we study data-driven choices λ in particular on Stein’s unbiased risk estimate (SURE) and cross-validation (CV). Popular estimators of this component-wise form are Ridge, Lasso, and Pretest estimators. These are defined as follows: mR (x, λ) = argmin (x − m)2 + λm2 (Ridge) m∈R = 1 x, 1+λ mL (x, λ) = argmin (x − m)2 + 2λ|m| (Lasso) m∈R = 1(x < −λ)(x + λ) + 1(x > λ)(x − λ), mP T (x, λ) = argmin (x − m)2 + λ2 1(m 6= 0) m∈R = 1(|x| > λ)x, 4 (Pretest) where 1(A) denotes the indicator function which equals 1 if A holds, and equals 0 else. Figure 1 plots mR (x, λ), mL (x, λ), and mP T (x, λ) as functions of x. As we discuss in detail below, depending on the data generating process (distribution of the means µi ), one or another of these estimating functions m might be able to better approximate the optimal estimating function m∗P . b = (b Let µ = (µ1 , . . . , µn )0 and µ µ1 , . . . , µ bn )0 , where for simplicity we leave the dependence of µ b on λ implicit in our notation. Let P1 , . . . , Pn be the distributions of X1 , . . . , Xn , and let P = (P1 , . . . , Pn ). We evaluate estimates based on the squared error loss function, or compound loss, n 2 1X Ln (X, m(·, λ), P ) = m(Xi , λ) − µi , n i=1 (2) where Ln depends on P via µ. We consider an estimator to be good if it has small expected loss. There are different ways of taking this expectation, however, resulting in different risk functions, and the distinction between them is conceptually quite important. There is, first, the componentwise risk function, which fixes Pi and considers the expected squared error of µ bi as an estimator of µi , R(m(·, λ), Pi ) = E[(m(Xi , λ) − µi )2 |Pi ]. (3) There is, second, the risk function of the compound decision problem. Compound risk Rn averages componentwise risk over the empirical distribution of Pi across the components i = i, . . . , n. Compound risk is given by the expectation of compound loss Ln given P . Rn (m(·, λ), P ) = E[Ln (X, m(·, λ), P )|P ] n 1X = R(m(·, λ), Pi ). n i=1 (4) There is, third, the empirical Bayes risk. Empirical Bayes risk considers the Pi to be themselves draws from some population distribution π, and averages componentwise risk over this population distribution, R̄(m(·, λ), π) = Eπ [R(m(·, λ), Pi )] 5 Z = R(m(·, λ), Pi )dπ(Pi ). (5) Notice the similarity between compound risk and empirical Bayes risk; they differ only by replacing an empirical (sample) distribution by a population distribution. For large n, the difference between the two vanishes, as we will explore in Section 4. Throughout we use Rn (m(·, λ), P ) to denote the risk function of the estimator m(·, λ) b P ) is with fixed (non-random) λ, and similarly for R̄(m(·, λ), π). In contrast, Rn (m(·, λ), b where the latter is chosen in a the risk function taking into account the randomness of λ, b π). data-dependent manner, and similarly for R̄(m(·, λ), For a given P , that is for a given empirical distribution of the Pi , we define the “oracle” selector of the regularization parameter as the value of λ that minimizes compound risk, λ∗ (P ) = argmin Rn (m(·, λ), P ). (6) λ∈[0,∞] We use λ∗R (P ), λ∗L (P ), and λ∗P T (P ) to denote the oracle selectors for Ridge, Lasso, and Pretest, respectively. Analogously, for a given π, that is for a given population distribution of Pi , we define the empirical Bayes oracle selector as λ̄∗ (π) = argmin R̄(m(·, λ), π), (7) λ∈[0,∞] with λ̄∗R (π), λ̄∗L (π), and λ̄∗P T (π) for Ridge, Lasso, and Pretest, respectively. In Section 3, we characterize compound and empirical Bayes risk for fixed λ and for the oracle optimal b can be, under certain conditions, λ. In Section 4 we then show that data-driven choices λ almost as good as the oracle optimal choice, in a sense to be made precise. 2.1 Examples There are many examples in the empirical economics literature to which our setup applies. In most applications in economics the Xi arise as preliminary least squares estimates for the µi , and are assumed to be unbiased but noisy. As discussed in the introduction, examples include (i) studies estimating causal or predictive effects for a large number of units such as neighborhoods, cities, teachers, workers, firms, or judges, (ii) studies estimating the causal 6 effect of a given treatment for a large number of subgroups defined by some observable covariate, and (iii) prediction problems with a large number of predictive covariates or transformations of covariates. The first category includes Chetty and Hendren (2015) who estimate the effect of geographic locations on intergenerational mobility for a large number of locations. They use differences between the outcomes of siblings whose parents move during their childhood in order to identify these effects. Estimation of many causal effects also arises in the teacher value added literature, when the object of interest are individual teachers’ effects, see for instance Chetty, Friedman, and Rockoff (2014). In empirical finance and related fields, event studies are often used to estimate reactions of stock market prices to newly available information. DellaVigna and La Ferrara (2010), for instance, are interested in violation of arms embargoes by weapons manufacturers. They estimate the effects of changes in the intensity of armed conflicts in countries under arms trade embargoes on the stock market prices of arms-manufacturing companies. In labor economics, estimation of firm and worker effects in studies of wage inequality has been considered since Abowd, Kramarz, and Margolis (1999), a recent application of this approach can be found in Card, Heining, and Kline (2012). A last example of the first category is provided by Abrams, Bertrand, and Mullainathan (2012), who estimate the causal effect of judges on racial bias in sentencing. An example for the second category, estimation of the causal effect of a binary treatment for many sub-populations, would be heterogeneity of the causal effect of class size on student outcomes across demographic subgroups, based for instance on the project STAR data. The project STAR involved experimental assignment of students to classes of different size in 79 schools; these data were analyzed by Krueger (1999). Causal effects for many subgroups are also of interest in medical contexts or for active labor market programs, where doctors / policy makers have to decide on treatment assignment based on observables. The third category is prediction with many regressors. For this category to fit into our setting, consider regressors which are normalized to be orthogonal. Prediction with many regressors arises in particular in macroeconomic forecasting. Stock and Watson (2012), 7 in an analysis complementing the present article, evaluate various procedures in terms of their forecast performance for a number of macroeconomic time series for the United States. Regression with many predictors also arises in series regression, where the predictors are transformations of a smaller set of observables. Series regression and its asymptotic properties have been studied a lot in theoretical econometrics, see for instance Newey (1997). A discussion of the equivalence of our normal means model to nonparametric regression can be found in sections 7.2-7.3 of Wasserman (2006), where the Xi and µi correspond to the estimated and true coefficients on an appropriate orthogonal basis of functions. Application of Lasso and Pretesting to series regression is discussed for instance in Belloni and Chernozhukov (2011). In Section 5 we return to three of these applications, revisiting the estimation of location effects on intergenerational mobility, as in Chetty and Hendren (2015), the effect of changes in the intensity of conflicts in arms-embargo countries on the stock prices of arms manufacturers, as in DellaVigna and La Ferrara (2010), and series regression of earnings on education as in Belloni and Chernozhukov (2011). 2.2 Literature The field of machine learning is growing fast and is increasingly influential in economics; see for instance Athey and Imbens (2015) and Kleinberg, Ludwig, Mullainathan, and Obermeyer (2015). A large set of algorithms and estimation procedures have been developed. Friedman, Hastie, and Tibshirani (2001) provide an introduction to this literature. An estimator which is becoming increasingly popular in economics is the Lasso, which was first introduced by Tibshirani (1996). A review of the properties of the Lasso can be found in Belloni and Chernozhukov (2011). Machine learning research tends to focus primarily on algorithms and computational issues, and less prominently on formal properties of the proposed estimation procedures. This contrasts with the older literature in mathematical statistics and statistical decision theory on the estimation of normal means. This literature has produced many deep results which turn out to be relevant for understanding the behavior of estimation procedures in 8 non-parametric statistics and machine learning. A foundational article for this literature is James and Stein (1961), who study a special case of our setting, where Xi ∼ N (µi , 1). b = X is inadmissible whenever n is at least 3; there They show that the estimator µ exists a shrinkage estimator that has mean square error smaller than the mean squared b = X for all values of µ. Brown (1971) provided more general characterizations error of µ of admissibility, and showed that this dependence on dimension is deeply connected to the recurrence or transience of Brownian motion. Stein et al. (1981) characterized the risk b , and based on this characterization proposed an unbiased function of arbitrary estimators µ estimator of the mean squared error of a given estimator, labeled “Stein’s unbiased risk estimator” or SURE. We return to SURE in Section 4 as a method for data-driven choices of regularization parameters. A general approach for the construction of regularized estimators such as the one proposed by James and Stein (1961) is provided by the empirical Bayes framework, first proposed in Robbins (1956) and Robbins (1964). A key insight of the empirical Bayes framework, and the closely related compound decision problem framework, is that trying to minimize squared error in higher dimensions involves a trade-off across components of the estimand. The data are informative about which estimators and regularization parameters perform well in terms of squared error, and thus allow to construct regularized b = X. This intuition is elaborated on in estimators that dominate the unregularized µ Stigler (1990). The empirical Bayes framework was developed further by Efron and Morris (1973) and Morris (1983), among others. Good reviews and introductions can be found in Zhang (2003) and Efron (2010). In Section 4 we consider data driven choices of regularization parameters, and emphasize uniform validity of asymptotic approximations to the risk function of the resulting estimators. Lack of uniform validity of standard asymptotic characterizations of risk (as well as of test size) in the context of Pretesting and other model-selection based estimators has been emphasized by Leeb and Pötscher (2005). 9 3 The risk function We now turn to our first set of formal results, characterizing the mean squared error of regularized estimators. Our goal is to guide researchers’ choice of estimation approach by providing intuition for the conditions under which alternative approaches work well. We first derive a general characterization of the mean squared error in our setting. This characterization is based on the geometry of estimating functions m as depicted in Figure 1. It is a-priori not obvious which of these functions is well suited for estimation. We show that for any given data generating process there is an optimal function m∗P which minimizes the mean squared error. The mean squared error for an arbitrary m is equal, up to a constant, to the L2 distance between m and m∗P . A function m thus yields a good estimator if it is able to approximate the shape of m∗P well. In Section 3.2, we next provide analytic expressions for the component-wise risk of Ridge, Lasso, and Pretest estimation, imposing the additional assumption of normality. Summing or integrating componentwise risk over some distribution for (µi , σi ) delivers expressions for compound and empirical Bayes risk. In Section 3.3, we finally turn to a specific parametric family of data generating processes. For this parametric family, we provide analytic risk functions and visual comparisons of the relative performance of alternative estimators, shown in figures 2, 3, and 4. This allows to highlight key features of the data generating process which favor one or the other estimator. The family of data generating processes we consider assumes that there is some share p of true zeros among the µi , reflecting the notion of sparsity, while the remaining µi are drawn from a normal distribution with some grand mean µ0 and variance σ02 . 3.1 General characterization Recall the setup introduced in Section 2, where we observe n random variables Xi , Xi has mean µi , and the Xi are independent across i. We are interested in the mean squared error for the compound problem of estimating all of the µi simultaneously. In this formulation, the µi are not random but rather fixed, unknown parameters. We can, however, consider 10 the empirical distribution of the µi across i, and the corresponding conditional distributions of the Xi , to construct a joint distribution of Xi and µi . For this construction, we can define a conditional expectation m∗P of µi given Xi – which turns out to be the optimal estimating function given P . Let us now formalize this idea. Let I be a random variable with a uniform distribution over the set {1, 2, . . . , n}, and consider the random component (XI , µI ) of (X, µ). This construction induces a mixture distribution for (XI , µI ) (conditional on P ), n 1X (XI , µI )|P ∼ Pi δµi , n i=1 where δµ1 , . . . , δµn are Dirac measures at µ1 , . . . , µn . Based on this mixture distribution, define the conditional expectation m∗P (x) = E[µI |XI = x, P ] (8) and the average conditional variance vP∗ = E Var(µI |XI , P )|P . Theorem 1 (Characterization of risk functions) Under the assumptions of Section 2, and using the notation introduced above, the compound risk function Rn of µ bi = m(Xi , λ) for fixed λ can be written as Rn (m(·, λ), P ) = vP∗ + E (m(XI , λ) − m∗P (XI ))2 |P , (9) which implies λ∗ (P ) = argmin E (m(XI , λ) − m∗P (XI ))2 |P . λ∈[0,∞] The proof of this theorem and all further results can be found in the appendix. The statement of this theorem implies that the risk of componentwise estimators is equal to an irreducible part vP∗ , plus the L2 distance of the estimating function m(., λ) to the infeasible optimal estimating function m∗P . A given data generating process P maps into an optimal estimating function m∗P , and the relative performance of alternative estimators m depends on how well they approximate m∗P . 11 We can easily write m∗P explicitly, since the conditional expectation defining m∗P is a weighted average of the values taken by µi . Suppose, for example, that Xi ∼ N (µi , 1) for i = 1 . . . n. Let ϕ be the standard normal probability density function. Then, n X m∗P (x) = µi ϕ(x − µi ) i=1 n X . ϕ(x − µi ) i=1 3.2 Componentwise risk assuming normality The characterization of the previous section relied only on the existence of second moments, as well as on the fact that we are considering componentwise estimators. If slightly more structure is imposed, we can obtain more explicit expressions for compound risk and empirical Bayes risk. We shall now assume in particular that the Xi are normally distributed, Xi ∼ N (µi , σi2 ). We shall furthermore focus on the leading examples of componentwise estimators introduced in Section 2, Ridge, Lasso, and Pretesting , whose estimating functions m were plotted in Figure 1. Recall that compound risk is given by an average of componentwise risk across components i, n 1X Rn (m(·, λ), P ) = R(m(·, λ), Pi ), n i=1 similarly for empirical Bayes risk, Z R̄(m(·, λ), π) = R(m(·, λ), Pi )dπ(Pi ). The following lemma provides explicit expressions for componentwise risk. Lemma 1 (Componentwise risk) Consider the setup of Section 2. Then, for i = 1, . . . , N , the componentwise risk of Ridge is: R(mR (·, λ), Pi ) = 1 1+λ 2 12 σi2 + 1− 1 1+λ 2 µ2i . Assume, in addition, that Xi has a normal distribution. Then, the componentwise risk of Lasso is R(mL (·, λ), Pi ) = 1+Φ −λ − µ i σi ! λ − µ i −Φ (σi2 + λ2 ) σi ! −λ − µ λ − µ −λ + µ −λ − µ i i i i + ϕ + ϕ σi2 σi σi σi σi −λ − µ λ − µi i + Φ −Φ µ2i . σi σi Under the same conditions, the componentwise risk of Pretest is ! −λ − µ λ − µ i i R(mP T (·, λ), Pi ) = 1 + Φ −Φ σi2 σi σi ! λ − µ λ − µ −λ − µ −λ − µ i i i i + ϕ − ϕ σi2 σi σi σi σi −λ − µ λ − µi i + Φ −Φ µ2i . σi σi The componentwise risk functions whose expressions are given in Lemma 1 are plotted in Figure 2. As this figure reveals, componentwise risk is large for Ridge when µi is large relative to λ; similarly for Lasso except that risk remains bounded in the latter case. Risk is large for Pretest estimation when µi is close to λ. Note however that these functions are plotted for a fixed value of the regularization parameter λ, varying the mean µi . If we are interested in compound risk and assume that λ is chosen optimally (equal to λ∗ (P )), then compound risk has to always remain below b = X, which equals En [σi2 ] under the assumptions of Lemma 1. the risk of the estimator µ 3.3 Spike and normal data generating process If we take the expressions for componentwise risk derived in Lemma 1 and average them over some population distribution of (µi , σi2 ), we obtain empirical Bayes risk. For parametric families of distributions of (µi , σi2 ), this might be done analytically. We shall do so now, considering a family of distributions that is rich enough to cover common intuitions about data generating processes, but simple enough to allow for analytic 13 expressions. Based on these expressions we will be able to give rules of thumb when either of the estimators considered is preferable. The family of distributions considered assumes homoskedasticity, σi2 = σ 2 , and allows for a fraction p of true zeros in the distribution of µi , while assuming that the remaining µi are drawn from some normal distribution. The following proposition derives the optimal estimating function m∗π as well as empirical Bayes risk functions for this family of distributions. Proposition 1 (Spike and normal data generating process) Assume π is such that µ1 , . . . , µn are drawn independently from a distribution with probability mass p at zero, and normal with mean µ0 and variance σ02 elsewhere and that, conditional on µi , Xi follows a normal distribution with mean µi and variance σ 2 . Let m̄∗π (x) = Eπ [µI |XI = x] = Eπ [µi |Xi = x], for i = 1, . . . , n. Then, x − µ0 p σ02 + σ 2 1 ! (1 − p) p 2 ϕ σ0 + σ 2 ∗ m̄π (x) = 1 x 1 ϕ p ϕ + (1 − p) p 2 σ σ σ0 + σ 2 µ0 σ 2 + xσ02 σ02 + σ 2 !. x − µ0 p σ02 + σ 2 The integrated risk of Ridge is R̄(mR (·, λ), π) = 1 1+λ !2 λ σ 2 + (1 − p) 1+λ !2 (µ20 + σ02 ), with λ̄∗R (π) = σ2 . (1 − p)(µ20 + σ02 ) The integrated risk of Lasso is given by R̄(mL (·, λ), π) = pR̄0 (mL (·, λ), π) + (1 − p)R̄1 (mL (·, λ), π), 14 where R̄0 (mL (·, λ), π) = 2Φ −λ σ λ λ (σ 2 + λ2 ) − 2 ϕ σ2, σ σ and R̄1 (mL (·, λ), π) = ! !! λ − µ0 −Φ p 2 (σ 2 + λ2 ) 2 σ0 + σ ! !! λ − µ0 −λ − µ0 −Φ p 2 (µ20 + σ02 ) + Φ p 2 2 2 σ0 + σ σ0 + σ ! 1 λ − µ0 −p 2 ϕ p 2 (λ + µ0 )(σ02 + σ 2 ) 2 2 σ0 + σ σ0 + σ ! 1 −λ − µ0 −p 2 ϕ p 2 (λ − µ0 )(σ02 + σ 2 ). 2 2 σ0 + σ σ0 + σ −λ − µ0 1+Φ p 2 σ0 + σ 2 Finally, the integrated risk of Pretest is given by R̄(mP T (·, λ), π) = pR̄0 (mP T (·, λ), π) + (1 − p)R̄1 (mP T (·, λ), π), where R̄0 (mP T (·, λ), π) = 2Φ −λ σ σ2 + 2 λ λ ϕ σ2. σ σ and R̄1 (mP T (·, λ), π) = ! !! λ − µ0 −Φ p 2 σ2 2 σ0 + σ ! !! λ − µ0 −λ − µ0 + Φ p 2 −Φ p 2 (µ20 + σ02 ) 2 2 σ0 + σ σ0 + σ ! 1 λ − µ0 −p 2 ϕ p 2 λ(σ02 − σ 2 ) + µ0 (σ02 + σ 2 ) σ0 + σ 2 σ0 + σ 2 ! 1 −λ − µ0 −p 2 λ(σ02 − σ 2 ) − µ0 (σ02 + σ 2 ) . ϕ p 2 σ0 + σ 2 σ0 + σ 2 −λ − µ0 1+Φ p 2 σ0 + σ 2 While the expressions provided by Proposition 1 look somewhat cumbersome, they are easily plotted. Figure 3 does so, plotting the empirical Bayes risk function of different estimators. Each of these plots is based on a fixed value of p, varying µ0 and σ02 along the two axes. Each panel is for a different estimator. Figure 4 then takes these risk functions, 15 for the same range of parameter values, and plots which of the three estimators, Ridge, Lasso, or Pretest estimation, performs best for a given data generating process. We can take away several intuitions from these plots: For small shares of true zeros, Ridge dominates the other estimators, no matter what the grand mean and population dispersion are. As the share of true zeros increases, the other estimators start to look better. Lasso does comparatively well when the non-zero parameters are very dispersed but do have some mass around zero. Intuitively, Ridge has a hard time reconciling gains from local shrinkage around zero, without suffering from exploding bias as we reach the tails. Pretest has a hard time because of the presence of intermediate values which are not easily classified into zero or non-zero true mean. Pretest does well when the non-zeros are well separated from the zeros, that is when the grand mean is large relative to the population dispersion. When the number of true zeros gets large, we see that Ridge still dominates if the non-zero values are not too far from zero, while Pretest dominates if the non-zero values are very far from zero. Lasso does best in the intermediate range between these extremes. In the context of applications such as the ones we consider below, we suggest that researchers start by contemplating which of these scenarios seems most plausible as a description of their setting. In many settings, there will be no prior reason to presume the presence of many true zeros, suggesting the use of Ridge. If there are reasons to assume there are many true zeros, such as in our arms trade application, then the question arises how well separated and dispersed these are. In the well separated case Pretesting might be justified, in the non-separated but dispersed case Lasso might be a good idea. 4 Data driven choice of regularization parameters In Section 3 we studied the risk properties of alternative regularized estimators. Our calculations in Section 3.3 presumed that we knew the oracle optimal regularization parameter λ̄∗ (π). This is evidently infeasible, given that the distribution π of true means µi is unknown. Are the resulting risk comparisons relevant nonetheless? The answer is yes, to the extent 16 that we can obtain consistent estimates of λ̄∗ (π) from the observed data Xi . In this section b of λ̄∗ (π) based on Stein’s unbiased we make this idea precise. We consider estimates λ b have risk risk estimate and based on cross validation. The resulting estimators m(Xi , λ) functions which are uniformly close to those of the infeasible estimators m(Xi , λ̄∗ (π)). The uniformity part of this statement is important and not obvious. It contrasts markedly with other oracle approximations to risk, most notably approximations which assume that the true zeros, that is the components i for which µi = 0, are known. Asymptotic approximations of this latter form are often invoked when justifying the use of Lasso or Pretesting estimators. Such approximations are in general not uniformly valid, as emphasized by Leeb and Pötscher (2005) and others. This implies that there are relevant ranges of parameter values for which Lasso or Pretesting perform significantly worse than the corresponding oracle-selection-of-zeros estimators. We will briefly revisit this point below. 4.1 Uniform loss and risk consistency For the remainder of the paper we adopt the following short-hand notation: Ln (λ) = Ln (X, m(·, λ), P ), Rn (λ) = Rn (m(·, λ), P ), and R̄π (λ) = R̄(m(·, λ), π). b of the oracle optimal choice of λ which are based on We shall now consider estimators λ b is then used to calculate minimizing some empirical estimate rn of the risk function R̄π . λ b Our goal is to show that these estimates regularized estimates of the form µ bi = m(Xi , λ). are almost as good as estimates based on optimal choices of λ, where “almost as good” is understood as having risk (loss) that’s uniformly close to the oracle optimal estimator. Our results are based on taking limits as n goes to ∞, and assuming that the Pi are i.i.d. draws from some distribution π, as in the empirical Bayes setting. In the large n limit the difference between Ln , Rn , and R̄π vanishes as we shall see, so that loss optimality and compound / empirical Bayes risk optimality of λ become equivalent. 17 The following theorem establishes our key result for this section. It provides sufficient conditions for uniform loss consistency, namely that (i) the difference between loss Ln (λ) b is chosen and empirical Bayes risk R̄π (λ) vanishes uniformly in λ and π, and (ii) that λ to minimize an estimate rn (λ) of risk R̄π (λ) which is also uniformly consistent, possibly bn ) and the up to a constant vπ . Under these conditions, the difference between loss Ln (λ infeasible minimal loss inf λ∈[0,∞] Ln (λ) vanishes uniformly. Theorem 2 (Uniform loss consistency) Assume sup Pπ π∈Π ! sup Ln (λ) − R̄π (λ) > → 0, ∀ > 0. (10) λ∈[0,∞] Assume also that there are functions, r̄π (λ), v̄π , and rn (λ) (of (π, λ), π, and ({Xi }ni=1 , λ), respectively) such that R̄π (λ) = r̄π (λ) + v̄π , and sup Pπ π∈Π ! sup rn (λ) − r̄π (λ) > → 0, ∀ > 0. (11) λ∈[0,∞] Then, b sup Pπ Ln (λn ) − inf Ln (λ) > → 0, λ∈[0,∞] π∈Π ∀ > 0, (12) bn = argmin λ∈[0,∞] rn (λ). where λ The sufficient conditions given by this theorem, as stated in equations (10) and (11), are rather high-level. We shall now give more primitive conditions for these requirements to hold. In sections 4.2 and 4.3 below, we propose suitable choices of rn (λ) based on Stein’s unbiased risk estimator (SURE) and cross-validation (CV), and show that equation (11) holds for these choices of rn (λ). In the the following Theorem 3, we show that equation (10) holds under fairly general conditions, so that the difference between compound loss and empirical Bayes risk vanishes uniformly. Lemma 2 then shows that these conditions apply in particular to Ridge, Lasso, and Pretest estimation. Theorem 3 (Uniform L2 -convergence) Suppose that 18 1. m(x, λ) is monotonic in λ for all x in R, 2. m(x, 0) = x and limλ→∞ m(x, λ) = 0 for all x in R, 3. supπ∈Π Eπ [X 4 ] < ∞, supπ∈Π Eπ [µ4 ] < ∞. 4. For any > 0 there exists a set of regularization parameters 0 = λ0 < . . . < λk = ∞, which may depend on , such that Eπ [(|X − µ| + |µ|)|m(X, λj ) − m(X, λj−1 )|] ≤ for all j = 1, . . . , k. Then, " sup Eπ π∈Π sup # 2 Ln (λ) − R̄π (λ) → 0. (13) λ∈[0,∞] Lemma 2 If supπ∈Π Eπ [X 4 ] < ∞ and supπ∈Π Eπ [µ4 ] < ∞, then equation (13) holds for ridge, and lasso. If, in addition, X is continuously distributed with a bounded density, then equation (13) holds for pretest. Theorem 2 provides sufficient conditions for uniform loss consistency. The following corollary shows that under the same conditions we obtain uniform risk consistency, that b becomes is the empirical Bayes risk of the estimator based on the data-driven choice λ uniformly close to the risk of the oracle-optimal λ̄∗ (π). For the statement of this Corollary, bn ), π) is the empirical Bayes risk of the estimator m(., λ bn ) using the recall that R̄(m(., λ bn . stochastic (data-dependent) λ Corollary 1 (Uniform risk consistency) Under the assumptions of Theorem 3, bn ), π) − inf R̄π (λ) → 0. sup R̄(m(., λ λ∈[0,∞] π∈Π 19 (14) 4.2 Stein’s unbiased risk estimate Theorem 2 provides sufficient conditions for uniform loss consistency using a general estimator rn of risk. We shall now establish that our conditions apply to a particular estimator of rn , known as Stein’s unbiased risk estimate (SURE), which was first proposed by Stein et al. (1981). SURE leverages the assumption of normality to obtain an elegant expression of risk as an expected sum of squared residuals plus a penalization term. SURE as originally proposed requires that m be piecewise differentiable as a function of x, which excludes discontinuous estimators such as the Pretest estimator mP T (x, λ). We provide a minor generalization in Lemma 3 which allows for such discontinuities. This lemma is stated in terms of empirical Bayes risk; with the appropriate modifications the same result holds verbatim for compound risk. Lemma 3 (SURE for piecewise differentiable estimators) Suppose X|µ ∼ N (µ, 1). and µ ∼iid π. Let fπ = π ∗ ϕ be the marginal density of X. Suppose further that m(x) is differentiable everywhere in R\{x1 , . . . , xJ }, but might be discontinuous at {x1 , . . . , xJ }. Let ∇m be the derivative of m (defined arbitrarily at {x1 , . . . , xJ }), and let ∆mj = limx↓xj m(x) − limx↑xj m(x). Assume that Eπ [(m(X) − X)2 ] < ∞ and (m(x) − x)ϕ(x − µ) → 0 as |x| → ∞ π-a.s. Then, R̄(m(.), π) = Eπ [(m(X) − X)2 ] + 2 Eπ [∇m(X)] + J X ! ∆mj fπ (xj ) − 1. j=1 The result of this lemma yields an objective function for the choice of λ of the general form we considered in Section 4.1, where v̄π = −1 and 2 r̄π (λ) = Eπ [(m(X, λ) − X) ] + 2 Eπ [∇x m(X, λ)] + J X j=1 20 ! ∆mj (λ)fπ (xj ) , (15) where ∇x m is the derivative of m with respect to its first argument, and {x1 , . . . , xJ } may depend on λ. The expression given in equation (15) suggests an obvious empirical counterpart, to be used as an estimator of r̄π (λ), n 1X rn (λ) = (m(Xi , λ) − Xi )2 + 2 n i=1 ! n J X X 1 ∇x m(Xi , λ) + ∆mj (λ)fb(xj ) , n i=1 j=1 where fb(x) is an estimator of fπ (x). This expression could be thought of as a penalized least squares objectives. We can write the penalty explicitly for our leading examples: 2 1+λ Lasso: 2Pn (|X| > λ) Ridge: Prestest: 2Pn (|X| > λ) + 2λ · (fb(−λ) + fb(λ)). For Ridge and Lasso, a smaller λ implies a smaller sum of squared residuals, but also a larger penalty. For our general result on uniform risk consistency, Theorem 2, to apply, we have to show that equation (11) holds. That is, we have to show that rn (λ) is uniformly consistent as an estimator of r̄π (λ). The following lemma provides the desired result. Lemma 4 Assume the conditions of Theorem 3. Then, equation (11) holds for m(·, λ) equal to mR (·, λ), mL (·, λ). If, in addition, b sup Pπ sup |x|f (x) − |x|fπ (x) > → 0 ∀ > 0, π∈Π x∈R then equation (11) holds for m(·, λ) equal to mP T (·, λ). Lemma 3 implies in particular that the optimal regularization parameter λ̄∗ is identified. In fact, under the same conditions the stronger result holds that m∗ as defined in Section 3.1 is identified as well. This result is stated in Brown (1971), for instance. We will use this result in order to provide estimates of m̄∗π (x) in the context of the empirical applications 21 considered below, and to visually compare the shape of this estimated m̄∗π (x) to the shape of Ridge, Lasso, and Pretest estimating functions. Lemma 5 Under the conditions of Lemma 3, the optimal shrinkage function is given by m̄∗π (x) = x + ∇x log(fπ (x)). 4.3 Cross-validation COMING SOON. 22 (16) 5 Applications In this section we consider three applications taken from the literature, in order to illustrate our results. The first application, based on Chetty and Hendren (2015), estimates the effect of living in a given commuting zone during childhood on intergenerational income mobility. The second application, based on DellaVigna and La Ferrara (2010), uses changes in the stock prices of arms manufacturers following changes in the intensity of conflicts in countries under an arms trade embargo, in order to detect potential violations of these embargoes. The third application uses data from the 2000 census of the US to estimate a series regression of log wages on education and potential experience. 5.1 Background Before presenting our results and relating them to our theoretical discussion, let us briefly review the background for each of these applications. Given this background, we can formulate some intuitions about likely features of the data generating process. Using our theoretical results, these intuitions then suggest which of the proposed estimators might work best in a given application. Chetty and Hendren (2015) aim to identify and estimate the causal effect βi of locations i (commuting zones, more specifically) on intergenerational mobility in terms of income.1 They identify this effect by comparing the outcomes of differently aged children of the same parents who move at some point during their childhood. The older children are then exposed to the old location for a longer time relative to the younger children, and to to the new location for a shorter time. A regression interacting parental income with location indicators, while controlling for family fixed effects, identifies the effects of interest (under some assumptions, and up to a constant). We take their estimates βbi of the causal impact of location i on mobility for parents at the 25th percentile of the income distribution as our point of departure. 1 The data for this application are available at http://www.equality-of-opportunity.org/images/ nbhds_online_data_table3.xlsx. 23 In this application, the point 0 has no special significance. We would expect the distribution π of true location effects βi to be smooth, possibly normal, and by construction centered at 0. This puts us in a scenario where the results of section 3.3 suggest that Ridge is likely to perform well. DellaVigna and La Ferrara (2010) aim to detect whether weapons manufacturers violate UN arms trade embargos for various conflicts and civil wars.2 They use changes in stock prices of arms manufacturing companies at the time of large changes in the intensity of conflicts in countries under arms-trade embargoes to detect illegal arms trade. Based on their partitioning of the data, we focus on the 214 event study estimates βbi for events which increase the intensity of conflicts in arms embargo areas, and for arms manufacturers in high-corruption countries. In this application, we would in fact expect many of the true effects βi to equal 0 (or close to 0). This would be the case if the (fixed) cost of violating an arms embargo is too large for most weapons manufacturers, so that they don’t engage in trade with civil war parties. There might be a subset of firms, however, who consider it worthwhile to violate the embargo, and their profits might be significantly increased by doing so. We would thus expect there to be many true zeros, while the non-zero effects might be well separated. This puts us in a scenario where the results of section 3.3 suggest that Pretest is likely to perform well. In our third application we use data from the 2000 US Census, in order to estimate a nonparametric regression of log wages on years of education and potential experience, similar to the example considered in Belloni and Chernozhukov (2011).3 We construct a set of regressors by taking a saturated basis of linear splines in education, fully interacted with the terms of a 6th order polynomial in potential experience. We orthogonalize these regressors, 2 The data for this application are available at http://eml.berkeley.edu/~sdellavi/wp/ AEJDataPostingZip.zip. 3 The data for this application are available at http://economics.mit.edu/files/384. 24 and take the coefficients βbi of an OLS regression of log wages on these orthogonalized regressors as our point of departure. In this application, economics provides less intuition as to what distribution of coefficients to expect. Statistical theory involving arguments from functional analysis however suggests that the distribution of coefficients might be such that Lasso performs well; cf. Belloni and Chernozhukov (2011). 5.2 Results Let us now turn to the actual results. As we shall see, all of the intuitions just discussed are borne out in our empirical analysis. For each of the applications we proceed as follows. We first assemble the least squares coefficient estimates βbi and their estimated standard errors σ bi2 . We then define Xi = βbi /b σi . Standard asymptotics for least squares estimators suggest that approximately Xi ∼ N (µi , 1), where µi = βi /σi . We consider componentwise shrinkage estimators of the form µ bi = m(Xi , λ). For each of the estimators Ridge, Lasso, and Pretest we use SURE in order to pick their regularization b∗ is chosen to minimize SURE. We furthermore compare these three parameter, that is λ b estimators in terms of their estimated risk R. Our results are summarized in table 1, which shows estimated optimal tuning parameters and estimated minimal risk for each application and estimator. As suggested by the above discussion, we indeed get the Ridge performs best (has the lowest estimated risk) for the location effect application, Pretest estimation performs best for the arms trade application, and Lasso performs best for the series regression application. These results drive home a point already suggested by our analytical results on the relative risk of different estimators: neither of these estimators uniformly dominates the others. Which estimator performs best depends on the data generating process, and in realistic applications either of these estimators might be preferable. Figures 5 through 10 provide more detail on the results summarized in table 1. For each of the applications and estimators, Figures 5, 7, and 7 plot SURE as a function of the b∗ reported in Table 1 are the minimizers of these regularization parameter. The optimal λ 25 b are the minimizing values. The top panels of Figures 6, 8, and functions, the reporte R b∗ ), as well as the estimated optimal m 10 then plot the estimating functions m(., λ b ∗ (.), cf. Lemma 5. The bottom panel of these same figures plots kernel density estimates of the marginal density of Xi in each application. We conjecture that many applications in economics involving fixed effects estimation (teacher effects, worker effects, judge effects...) as well as applications estimating treatment effects for many subgroups will be similar to the location effects setting, in that effects are are smoothly distributed and Ridge is the preferred estimator. There is a more special set of applications where economic agents engaging in non-zero behavior incur a fixed cost. Such applications might be similar to the arms trade application with a true mass of zero effects, where Pretest or Lasso are preferable. In series regression contexts such as our log wage example, finally, Lasso appears to perform consistently well. 6 Conclusion The literature on machine learning is growing rapidly. Two common features of machine learning algorithms are regularization and data driven choice of regularization parameters. We study the properties of such procedures. We consider in particular the problem of estimating many means µi based on observations Xi . This problem arises often in economic applications. In such applications the “observations” Xi are usually equal to preliminary least squares coefficient estimates, for instance of fixed effects. Our goal in this paper is to provide some guidance for applied researchers: Which estimation method should one choose in a given application? And how should one pick regularization parameters? To the extent that researchers care about the squared error of their estimates, procedures are preferable if they have lower mean squared error than the competitors. Based on our results, Ridge appears to dominate the alternatives considered when the true effects µi are smoothly distributed, and there is no point mass of true zeros. This appears to be the case in applications that estimate the effect of many units such as locations 26 or teachers, and applications that estimate effects for many subgroups. Pretest appears to dominate if there are many true zeros, while the non-zero effects are well separated from zero. This happens in some economic applications when there are fixed costs for agents who engage in non-zero behavior. Lasso finally dominates for intermediate cases, and appears to do well in particular for series regression. Regarding the choice of regularization parameters, we prove a series of results which show that data driven choices are almost optimal (in a uniform sense) for large-dimensional problems. This is the case in particular for choices which minimize Stein’s Unbiased Risk Estimate (SURE), when observations are normally distributed, and for Cross Validation (CV), when repeated observations for a given effect are available. There are of course some limitations to our analysis, which point toward interesting future research. First, we focus on a restricted class of estimators, those which can be bi ). This covers many estimators written in the component-wise shrinkage form µ bi = m(Xi , λ of interest for economists, most notably Ridge, Lasso, and Pretest estimation. Many other estimators in the machine learning literature do not have this tractable form, however, such as for instance random forests or support vector machines. It would be desirable to gain a better understanding of the risk properties of such estimators, as well. Second, we focus on squared error loss for the joint estimation problem, 1 n µi − µi )2 . i (b P This loss function is analytically quite convenient, and amenable to tractable results. Other loss functions might be of practical interest, however, and might be studied using numerical methods. In this context it is also worth emphasizing again that we were focusing on point estimation, where all coefficients µi are simultaneously of interest. This is relevant for many practical applications such as those discussed above. In other cases, however, one might instead be interested in the estimates µ bi solely as input for a lower-dimensional decision problem, or in (frequentist) inference on the coefficients µi . Our analysis of mean squared error does not directly speak to such questions. 27 Appendix A.1 Proofs Proof of Theorem 1: n 1X E[(m(Xi , λ) − µi )2 |Pi ] n i=1 = E (m(XI , λ) − µI )2 |P = E E[(m∗P (XI ) − µI )2 |XI , P ]|P + E (m(XI , λ) − m∗P (XI ))2 |P ∗ = vP + E (m(XI , λ) − m∗P (XI ))2 |P . Rn (m(., λ), P ) = Proof of Lemma 1: Notice that mR (x, λ) − µi = 1 1+λ (x − µi ) − λ 1+λ µi . The result for ridge equals the second moment of this expression. For pretest, notice that mP T (x, λ) − µi = 1(|x| > λ)(x − µi ) − 1(|x| ≤ λ)µi . Therefore, R(mP T (·, λ), Pi ) = E (Xi − µi )2 1(|Xi | > λ) + µ2i Pr |Xi | ≤ λ . (A.1) 0 Using the fact that ϕ (v) = −vϕ(v) and integrating by parts, we obtain Z b Z b h i ϕ(v) dv − bϕ(b) − aϕ(a) ha i h i = Φ(b) − Φ(a) − bϕ(b) − aϕ(a) . 2 v ϕ(v) dv = a Now, " E (Xi − µi )2 1(|Xi | > λ) = σi2 E Xi − µi σi !2 # 1(|Xi | > λ) −λ − µ i i σi2 = 1+Φ + ! λ − µ λ − µ −λ − µ −λ − µ i i i i ϕ − ϕ σi2 . σi σi σi σi σi −Φ ! λ − µ σi (A.2) The result for the pretest estimator follows now easily from equations (A.1) and (A.2). For lasso, notice that mL (x, λ) − µi = 1(x < −λ)(x + λ − µi ) + 1(x > λ)(x − λ − µi ) − 1(|x| ≤ λ)µi = 1(|x| > λ)(x − µi ) + (1(x < −λ) − 1(x > λ))λ − 1(|x| ≤ λ)µi . Therefore, R(mL (·, λ), Pi ) = E (Xi − µi )2 1(|Xi | > λ) + λ2 E[1(|Xi | > λ)] + µ2i E[1(|Xi | ≤ λ)] 28 + 2λ E (Xi − µi )1(Xi < −λ) − E (Xi − µi )1(Xi > λ) = R(mP T (·, λ), Pi ) + λ2 E[1(|Xi | > λ)] + 2λ E (Xi − µi )1(Xi < −λ) − E (Xi − µi )1(Xi > λ) . Notice that Z (A.3) b vϕ(v)dv = ϕ(a) − ϕ(b). a As a result, λ − µ −λ − µi i E (Xi − µi )1(Xi < −λ) − E (Xi − µi )1(Xi > λ) = −σi ϕ +ϕ . σi σi (A.4) Now, the result for lasso follows from equations (A.3) and (A.4). Proof of Proposition 1: The results for ridge are trivial. For lasso, first notice that the integrated risk at zero is: −λ λ λ R0 (mL (·, λ), π) = 2Φ (σ 2 + λ2 ) − 2 ϕ σ2 . σ σ σ Next, notice that ! Z −λ − µ0 −λ − µ 1 µ0 − µ ϕ dµ = Φ p 2 , Φ σ σ0 σ0 σ0 + σ 2 ! Z λ − µ 1 µ0 − µ λ − µ0 Φ ϕ dµ = Φ p 2 , σ σ0 σ0 σ0 + σ 2 ! Z λ−µ 1 −λ − µ λ − µ 1 µ0 − µ µ0 σ 2 + λσ02 0 ϕ ϕ dµ = − p 2 ϕ p 2 λ+ σ σ σ0 σ0 σ02 + σ 2 σ0 + σ 2 σ0 + σ 2 ! Z −λ − µ µ0 σ 2 − λσ02 −λ + µ −λ − µ 1 µ0 − µ 1 0 λ− ϕ p 2 ϕ ϕ dµ = − p 2 σ σ σ0 σ0 σ02 + σ 2 σ0 + σ 2 σ0 + σ 2 The integrals involving µ2 are more involved. Let v be a Standard Normal variable independent of µ. Notice that, Z Z Z 1 µ − µ λ − µ 1 µ − µ 0 0 2 2 µ Φ ϕ dµ = µ I[v≤(λ−µ)/σ] ϕ(v)dv ϕ dµ σ σ0 σ0 σ0 σ0 Z Z µ − µ0 1 = dµ ϕ(v)dv. µ2 I[µ≤λ−σv] ϕ σ0 σ0 Using the change of variable u = (µ − µ0 )/σ0 , we obtain, Z Z 1 µ − µ0 2 µ I[µ≤λ−σv] ϕ dµ = (µ0 + σ0 u)2 I[u≤(λ−µ0 −σv)/σ0 ] ϕ(u)du σ0 σ0 λ − µ − σv λ − µ − σv 0 0 =Φ µ20 − 2ϕ σ0 µ0 σ0 σ0 ! λ − µ − σv λ − µ − σv λ − µ − σv 0 0 0 + Φ − ϕ σ02 σ0 σ0 σ0 λ − µ − σv λ − µ − σv 0 0 =Φ (µ20 + σ02 ) − ϕ σ0 (λ + µ0 − σv). σ0 σ0 29 Therefore, Z ! λ − µ 1 µ − µ λ − µ0 0 µ Φ dµ = Φ p 2 ϕ (µ20 + σ02 ) σ σ0 σ0 σ0 + σ 2 ! λ − µ0 1 ϕ p 2 (λ + µ0 )σ02 −p 2 σ0 + σ 2 σ0 + σ 2 ! ! 1 λ − µ0 λ − µ0 2 2 + σ0 σ p 2 ϕ p 2 . σ02 + σ 2 σ0 + σ 2 σ0 + σ 2 2 Similarly, Z ! −λ − µ 1 µ − µ −λ − µ0 0 µ Φ dµ = Φ p 2 (µ20 + σ02 ) ϕ σ σ0 σ0 σ0 + σ 2 ! 1 −λ − µ0 −p 2 ϕ p 2 (−λ + µ0 )σ02 2 2 σ0 + σ σ0 + σ ! ! −λ − µ0 −λ − µ0 1 2 2 ϕ p 2 . + σ0 σ p 2 σ02 + σ 2 σ0 + σ 2 σ0 + σ 2 2 The integrated risk conditional on µ 6= 0 is R1 (mL (·, λ), π) = !! λ − µ0 −Φ p 2 (σ 2 + λ2 ) σ0 + σ 2 ! !! −λ − µ0 λ − µ0 −Φ p 2 (µ20 + σ02 ) + Φ p 2 σ0 + σ 2 σ0 + σ 2 ! 1 λ − µ0 −p 2 ϕ p 2 (λ + µ0 )(σ02 + σ 2 ) σ0 + σ 2 σ0 + σ 2 ! 1 −λ − µ0 −p 2 ϕ p 2 (λ − µ0 )(σ02 + σ 2 ). σ0 + σ 2 σ0 + σ 2 −λ − µ0 1+Φ p 2 σ0 + σ 2 ! The results for pretest follow from similar calculations. Next lemma is used in the proof of Theorem 2. Lemma A.1 For any two real-valued functions, f and g, inf f − inf g ≤ sup |f − g|. Proof: The result of the lemma follows directly from inf f ≥ inf g − sup |f − g|, and inf g ≥ inf f − sup |f − g|. 30 Proof of Theorem 2: Because v̄π does not depend on λ, we obtain bn ) − rn (λ) − rn (λ bn ) = Ln (λ) − R̄π (λ) − Ln (λ bn ) − R̄π (λ bn ) Ln (λ) − Ln (λ bn ) − rn (λ bn ) . + r̄π (λ) − rn (λ) − r̄π (λ Applying Lemma A.1 we obtain bn ) − inf Ln (λ) − Ln (λ λ∈[0,∞] inf λ∈[0,∞] bn ) ≤ 2 sup Ln (λ) − R̄π (λ) rn (λ) − rn (λ λ∈[0,∞] + 2 sup r̄π (λ) − rn (λ). λ∈[0,∞] bn is the value of λ at which rn (λ) attains its minimum, the result of the theorem follows. Given that λ The following preliminary lemma will be used in the proof of Theorem 3. Lemma A.2 For any finite set of regularization parameters, 0 = λ0 < . . . < λk = ∞, let uj = sup L(λ) λ∈[λj−1 ,λj ] lj = inf λ∈[λj−1 ,λj ] L(λ), where L(λ) = (µ − m(X, λ))2 . Suppose that for any > 0 there is a finite set of regularization parameters, 0 = λ0 < . . . < λk = ∞ (where k may depend on ) such that sup max Eπ [uj − lj ] ≤ (A.5) sup max max{varπ (lj ), varπ (uj )} < ∞. (A.6) π∈Π 1≤j≤k and π∈Π 1≤j≤k Then, equation (13) holds. Proof: We will use En to indicate averages over (µ1 , X1 ), . . . , (µn , Xn ). Let λ ∈ [λj−1 , λj ]. By construction En [L(λ)] − Eπ [L(λ)] ≤ En [uj ] − Eπ [lj ] ≤ En [uj ] − Eπ [uj ] + Eπ [uj − lj ] En [L(λ)] − Eπ [L(λ)] ≥ En [lj ] − Eπ [uj ] ≥ En [lj ] − Eπ [lj ] − Eπ [uj − lj ] and thus sup (En [L(λ)]−Eπ [L(λ)])2 λ∈[0,∞] ≤ max max{(En [uj ] − Eπ [uj ])2 , (En [lj ] − Eπ [lj ])2 } + 1≤j≤k max Eπ [uj − lj ] 1≤j≤k + 2 max max{|En [uj ] − Eπ [uj ]|, |En [lj ] − Eπ [lj ]|} max Eπ [uj − lj ] 1≤j≤k ≤ k X 1≤j≤k (En [uj ] − Eπ [uj ])2 + (En [lj ] − Eπ [lj ])2 + 2 j=1 + 2 k X |En [uj ] − Eπ [uj ]| + |En [lj ] − Eπ [lj ]| . j=1 31 2 Therefore, Eπ h sup (En [L(λ)] − Eπ [L(λ)])2 i λ∈[0,∞] ≤ k X Eπ [(En [uj ] − Eπ [uj ])2 ] + Eπ [En [lj ] − Eπ [lj ])2 ] + 2 j=1 + 2 k X Eπ [|En [uj ] − Eπ [uj ]| + |En [lj ] − Eπ [lj ]|] j=1 ≤ k X varπ (uj )/n + varπ (lj )/n + 2 j=1 + 2 k q X varπ (uj )/n + q varπ (lj )/n . j=1 Now, the result of the lemma follows from the assumption of uniformly bounded variances. Proof of Theorem 3: We will show that the conditions of the theorem imply equations (A.5) and (A.6) and, therefore, the uniform convergence result in equation (13). Using conditions 1 and 2, along with the convexity of 4-th powers we immediately get bounded variances. Because the maximum of a convex function is achieved at the boundary, varπ (uj ) ≤ Eπ [u2j ] ≤ Eπ [max{(X − µ)4 , µ4 }] ≤ Eπ [(X − µ)4 ] + Eπ [µ4 ], Notice also that varπ (lj ) ≤ Eπ [lj2 ] ≤ Eπ [u2j ]. Now, condition 3 implies equation (A.6) in Lemma A.2. It remains to find a set of regularization parameters such that Eπ [uj − lj ] < for all j. Using again monotonicity of m(X, λ) in λ and convexity of the square function, we have that the supremum defining uj is achieved at the boundary, uj = max{L(λj−1 ), L(λj )}, while lj = min{L(λj−1 ), L(λj )} if µ ∈ / [m(X, λj−1 ), m(X, λj )], and lj = 0 otherwise. In the former case, uj − lj = |L(λj ) − L(λj−1 )|, in the latter case uj − lj = max{L(λj−1 ), L(λj )}. Consider first the case of µ ∈ / [m(X, λj−1 ), m(X, λj )]. Using the formula a2 − b2 = (a + b)(a − b) and the shorthand mj = m(X, λj ), we obtain uj − lj = (mj − µ)2 − (mj−1 − µ)2 = (mj − µ) + (mj−1 − µ) mj − mj−1 ≤ (|mj − µ| + |mj−1 − µ|)|mj − mj−1 |. To check that the same bound applies to the case µ ∈ [m(X, λj−1 ), m(X, λj )] notice that max {|mj − µ|, |mj−1 − µ|} ≤ |mj − µ| + |mj−1 − µ| and, because µ ∈ [m(X, λj−1 ), m(X, λj )], max {|mj − µ|, |mj−1 − µ|} ≤ |mj − mj−1 |. 32 Monotonicity, boundary conditions, and the convexity of absolute values allow to bound further, uj − lj ≤ 2(|X − µ| + |µ|)|mj − mj−1 |. Now, condition 4 in Theorem 3 implies equation (A.5) in Lemma A.2 and, therefore, the result of the theorem. Proof of Lemma 2: Conditions 1 and 2 of Theorem 3 are easily verified to hold for ridge, lasso, and the pretest estimator. Let us thus discuss condition 4. Let ∆mj = m(X, λj ) − m(X, λj−1 ) and ∆λj = λj − λj−1 . For ridge, ∆mj is given by 1 1 − X ∆mj = 1 + λj 1 + λj−1 so that the requirement follows from finite variances if we choose a finite set of regularization parameters such that 1 1 sup E (|X − µ| + |µ|)|X| < − 1 + λj 1 + λj−1 π∈Π for all j = 1, . . . , k, which is possible by uniformly bounded moments condition. For lasso, notice that |∆mk | = (|X| − λk−1 ) 1(|X| > λk−1 ) ≤ |X| 1(|X| > λk−1 ), and |∆mj | ≤ ∆λj for j = 1, . . . , k − 1. We will first verify that for any > 0 there is a finite λk−1 such that condition 4 of the lemma holds for j = k. Notice that for any pair of non-negative random variables (ξ, ζ) such that E[ξ ζ] < ∞, and for any positive constant, c, we have that E[ξζ] ≥ E[ξζ 1(ζ > c)] ≥ cE[ξ1(ζ > c)] and, therefore, E[ξ1(ζ > c)] ≤ E[ξζ] . c As a consequence of this inequality, and because supπ∈Π Eπ [(|X − µ| + |µ|)|X|2 ] < ∞ (implied by condition 3), then for any > 0 there exists a finite positive constant, λk−1 such that condition 4 of the lemma holds for j = k. Given that λk−1 is finite, supπ∈Π Eπ [|X − µ| + |µ|] < ∞ and |∆mj | ≤ ∆λj imply condition 4 for j = 1, . . . , k − 1. For pretest, |∆mj | = |X| 1(|X| ∈ (λj−1 , λj ]), so that we require that for any > 0 we can find a finite number of regularization parameters, 0 = λ0 < λ1 < . . . < λk−1 < λk = ∞, such that Eπ [(|X − µ| + |µ|)|X| 1(|X| ∈ (λj−1 , λj ])] < , for j = 1, . . . , k. Applying the Cauchy-Schwarz inequality and uniform boundedness of fourth moments, this condition is satisfied if we can choose uniformly bounded Pπ (|X| ∈ (λj−1 , λj ]), which is possible under the assumption that X is continuously distributed with a (version of the) density that is uniformly bounded. Proof of Corollary 1: From Theorem 2 and Lemma A.1, it follows immediately that b sup Pπ Ln (λn ) − inf R̄π (λ) > → 0. π∈Π λ∈[0,∞] By definition, bn ), π) = Eπ [Ln (λ bn )]. R̄(m(., λ 33 Equation (14) thus follows if we can strengthen uniform convergence in probability to uniform L1 converbn ), as per Theorem 2.20 in van der Vaart gence. To do so, we need to show uniform integrability of Ln (λ (2000). Monotonicity, convexity of loss, and boundary conditions imply n X bn ) ≤ 1 µ2i + (Xi − µi )2 . Ln (λ n i=1 Uniform integrability along arbitrary sequences πn , and thus L1 convergence, follows from the assumed bounds on moments. Proof of Lemma 3: Recall the definition R̄(m(.), π) = Eπ [(m(X) − µ)2 ]. Expanding the square yields Eπ [(m(X) − µ)2 ] = Eπ [(m(X) − X + X − µ)2 ] = Eπ [(X − µ)2 ] + Eπ [(m(X) − X)2 ] + 2Eπ [(X − µ)(m(X) − X)]. By the form of the standard normal density, ∇x ϕ(x − µ) = −(x − µ)ϕ(x − µ). Partial integration over the intervals ]xj , xj+1 [ (where we let x0 = −∞ and xJ+1 = ∞) yields Z Z Eπ [(X − µ)(m(X) − X)] = (x − µ) (m(x) − x) ϕ(x − µ) dx dπ(µ) R =− R J Z Z X j=0 = J Z X j=0 R xj+1 (m(x) − x) ∇x ϕ(x − µ) dx dπ(µ) xj R "Z xj+1 (∇m(x) − 1) ϕ(x − µ) dx xj + lim (m(x) − x)ϕ(x − µ) − lim (m(x) − x)ϕ(x − µ) dπ(µ) x↓xj x↑xj+1 = Eπ [∇m(X)] − 1 + J X ∆mj f (xj ). j=1 Proof of Lemma 4: Uniform convergence of the first term follows by the exact same arguments we used to show uniform convergence of Ln (λ) to R̄π (λ) in Theorem 3. We thus focus on the second term, and discuss its convergence on a case-by-case basis for our leading examples. For Ridge, this second term is equal to the constant 2 ∇x mR (x, λ) = 2 , 1+λ and uniform convergence holds trivially. For Lasso, the second term is equal to 2 En [∇x mL (X, λ)] = 2 Pn (|X| > λ). To prove uniform convergence of this term we slightly modify the proof of the Glivenko-Cantelly Theorem (e.g., van der Vaart (2000), Theorem 19.1). Let Fn be the cumulative distribution function of X1 , . . . , Xn , 34 and let Fπ be its population counterpart. It is enough to prove uniform convergence of Fn (λ), ! sup Pπ π∈Π sup |Fn (λ) − Fπ (λ)| > →0 ∀ > 0. λ∈[0∞] Using Chebyshev’s inequality and supπ∈Π varπ (1(X ≤ λ)) ≤ 1/4 for every λ ∈ [0, ∞], we obtain p sup |Fn (λ) − Fπ (λ)| → 0, π∈Π for every λ ∈ [0, ∞]. Next, we will establish that for any > 0, it is possible to find a finite set of regularization parameters 0 = λ0 < λ1 < · · · < λk = ∞ such that sup max {Fπ (λj ) − Fπ (λj−1 )} < . π∈Π 1≤j≤k This assertion follows from the fact that fπ (x) is uniformly bounded by ϕ(0). The rest of the proof proceeds as in the proof of Theorem 19.1 in van der Vaart (2000). Let us finally turn to pre-testing. The objective function for pre-testing is equal to the one for Lasso, plus additional terms for the jumps at ±λ; the penalty term equals 2Pn (|X| > λ) + 2λ(fb(−λ) + fb(λ)). Uniform convergence of the SURE criterion for pre-testing thus holds if (i) the conditions for Lasso are satisfied, and (ii) we have a uniformly consistent estimator of |x|fb(x). 35 A.2 Tables and figures Table 1: Optimal bandwidth parameter and minimal value of SURE criterion location effects n 595 b R b∗ λ b arms trade 214 R b∗ λ b Returns to education 65 R b∗ λ Ridge 0.29 2.44 0.50 0.98 1.00 0.01 Lasso 0.32 1.34 0.06 1.50 0.84 0.59 Pre-test 0.41 5.00 -0.02 2.38 0.93 1.14 For each of the applications considered and for each of the estimators which we focus on, this table shows b∗ minimizes risk as estimated by SURE, as well as the corresponding minimal R b value of the which value λ SURE. These let us compare in turn which of the estimators appears to work best in a given application. 36 Figure 1: Estimators 8 6 4 m 2 0 -2 -4 Ridge Pretest Lasso -6 -8 -8 -6 -4 -2 0 2 4 6 8 X This graph plots mR (x, λ), mL (x, λ), and mP T (x, λ) as functions of x. The regularization parameters chosen are λ = 1 for Ridge, λ = 2 for the Lasso estimator, and λ = 4 for the Pretest estimator. 37 Figure 2: Componentwise risk functions 11 Ridge Pretest Lasso m(x)=x 10 9 8 7 R 6 5 4 3 2 1 0 -6 -4 -2 0 2 4 6 7 These graphs show componentwise risk functions Ri as a function of µ, assuming that σi2 = 1, for the componentwise estimators displayed in Figure 1. 38 Figure 3: Risk for estimators in spike and normal setting p = 0.00 p = 0.25 1 1 1 1 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 4 0 4 0 4 4 2 2 2 0 0 σ0 0 4 4 ridge 2 2 0 0 σ0 µ0 4 0 0 σ0 µ0 LASSO 4 2 2 ridge 1 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 4 0 4 0 4 4 0 0 σ0 0 4 4 2 2 0 4 2 2 0 0 σ0 µ 4 2 2 0 0 σ0 µ 0 pretest 0 m* m* p = 0.75 1 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 4 0 4 0 4 4 0 4 4 2 2 0 0 4 2 2 0 0 σ0 µ0 ridge 0 0 σ0 µ0 LASSO 4 2 2 ridge 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0 4 0 4 0 4 σ0 0 0 2 µ 0 pretest 0 4 4 2 σ0 4 2 2 0 0 µ0 LASSO 1 4 2 0 0 σ0 µ0 0.8 2 µ 0 pretest 0.8 σ0 2 0 0 σ0 µ p = 0.50 2 µ0 LASSO 0.8 2 0 0 σ0 µ0 2 σ0 µ 0 m* 0 0 µ 0 pretest 39 4 2 2 σ0 2 0 0 µ 0 m* Figure 4: Best estimator in spike and normal setting p = 0.25 5 4.5 4.5 4 4 3.5 3.5 3 3 σ0 σ 0 p = 0.00 5 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0.5 1 1.5 2 2.5 µ0 3 3.5 4 4.5 0 5 0 0.5 1 1.5 2 5 4.5 4.5 4 4 3.5 3.5 3 3 2.5 2 1.5 1.5 1 1 0.5 0.5 0 0.5 1 1.5 2 2.5 µ0 3.5 4 4.5 5 3 3.5 4 4.5 5 2.5 2 0 3 p = 0.75 5 σ0 σ 0 p = 0.50 2.5 µ0 3 3.5 4 4.5 0 5 0 0.5 1 1.5 2 2.5 µ0 This figure compares integrated risk values attained by Ridge, Lasso, and Pretest for different parameter values of the spike and normal specification in Section 3.2.2. Blue circles are placed at parameters values for which Ridge minimizes integrated risk, green crosses at values for which Lasso minimizes integrated risk, and red dots are parameters values for which Pretest minimizes integrated risk. 40 Figure 5: Location effects on intergenerational mobility: SURE Estimates SURE as function of 6 1 ridge lasso pretest 0.8 SURE(6) 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 6 b∗ for This figure plots the SURE criterion as a function of λ for Ridge, Lasso, and Pretest. The estimated λ each of these is chosen as the value minimizing SURE. In this application, Ridge has the lowest estimated risk, which is attained at λ = 2.44. 41 Figure 6: Location effects on intergenerational mobility: Shrinkage Estimators Shrinkage estimators 4 3 2 b m(x) 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3 2 3 x fb(x) Kernel estimate of the density of X -3 -2 -1 0 1 x The first panel shows the optimal shrinkage estimator (solid line) along with the Ridge, Lasso, and Pretest estimators (dashed lines) evaluated at SURE minimizing values of the regularization parameters. The Ridge estimator is linear, with positive slope equal to 0.29. Lasso is piecewise linear, with kinks at the positive and negative versions of the SURE minimizing value of the regularization parameter, λ = 1.34. Pretest is flat at zero, because SURE is minimized for values of λ higher than the maximum absolute value of X1 , . . . , Xn . The second panel shows a kernel estimate of the distribution of X. In this application, Ridge has the lowest estimated risk. 42 Figure 7: Arms Event Study: SURE Estimates SURE as function of 6 1 ridge lasso pretest 0.8 SURE(6) 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 6 b∗ for This figure plots the SURE criterion as a function of λ for Ridge, Lasso, and Pretest. The estimated λ each of these is chosen as the value minimizing SURE. In this application, Pretest has the lowest estimated risk, which is attained at λ = 2.38. 43 Figure 8: Arms Event Study: Shrinkage Estimators Shrinkage estimators 4 3 2 b m(x) 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3 2 3 x fb(x) Kernel estimate of the density of X -3 -2 -1 0 1 x The first panel shows the optimal shrinkage estimator (solid line) along with the Ridge, Lasso, and Pretest estimators (dashed lines) evaluated at SURE minimizing values of the regularization parameters. The Ridge estimator is linear, with positive slope equal to 0.50. Lasso is piecewise linear, with kinks at the bL,n = 1.50. positive and negative versions of the SURE minimizing value of the regularization parameter, λ b Pretest is discontinuous at λP T,n = ±2.39. In this application, Pretest has the lowest estimated risk. 44 Figure 9: Mincer series regression: SURE Estimates SURE as function of 6 SURE(6) ridge lasso pretest 1 0.8 0.6 0.4 0.2 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 6 b∗ for This figure plots the SURE criterion as a function of λ for Ridge, Lasso, and Pretest. The estimated λ each of these is chosen as the value minimizing SURE. In this application, Lasso has the lowest estimated risk, which is attained at λ = 0.59. 45 Figure 10: Mincer series regression: Shrinkage Estimators Shrinkage estimators 4 3 2 b m(x) 1 0 -1 -2 -3 -4 -3 -2 -1 0 1 2 3 2 3 x fb(x) Kernel estimate of the density of X -3 -2 -1 0 1 x The first panel shows the optimal shrinkage estimator (solid line) along with the Ridge, Lasso, and Pretest estimators (dashed lines) evaluated at SURE minimizing values of the regularization parameters. The Ridge estimator is linear, with positive slope equal to 0.99. Lasso is piecewise linear, with kinks at the bL,n = 0.59. positive and negative versions of the SURE minimizing value of the regularization parameter, λ b Pretest is discontinuous at λP T,n = ±2.39. In this application, Lasso has the lowest estimated risk. 46 Figure 11: Spike and normal: True and Estimated Risk SURE as function of λ 2.4 estimated risk true risk 2.2 2 ridge SURE(λ) 1.8 1.6 1.4 lasso 1.2 1 pretest 0.8 0.6 0.4 0 0.5 1 1.5 λ 47 2 2.5 3 Figure 12: Spike and normal: Shrinkage Estimators Shrinkage estimators 4 3 3 2 2 1 1 0 b m(x) m(x) Optimal shrinkage functions 4 0 -1 -1 -2 -2 -3 -3 0 2 -4 -4 4 -2 0 2 4 x x Density of X Kernel estimate of the density of X fb(x) -2 f (x) -4 -4 -4 -2 0 2 4 -4 x -2 0 x 48 2 4 References Abowd, J. M., F. Kramarz, and D. N. Margolis (1999). High wage workers and high wage firms. Econometrica 67 (2), 251–333. Abrams, D., M. Bertrand, and S. Mullainathan (2012). Do judges vary in their treatment of race? Journal of Legal Studies 41 (2), 347–383. Athey, S. and G. Imbens (2015, July). Summer institute 2015 methods lectures. Belloni, A. and V. Chernozhukov (2011). High dimensional sparse econometric models: An introduction. Springer. Brown, L. D. (1971). Admissible estimators, recurrent diffusions, and insoluble boundary value problems. The Annals of Mathematical Statistics, 855–903. Card, D., J. Heining, and P. Kline (2012). Workplace heterogeneity and the rise of west german wage inequality. NBER working paper (w18522). Chetty, R., J. N. Friedman, and J. E. Rockoff (2014). Measuring the impacts of teachers ii: Teacher value-added and student outcomes in adulthood. The American Economic Review 104 (9), 2633–2679. Chetty, R. and N. Hendren (2015). The impacts of neighborhoods on intergenerational mobility: Childhood exposure effects and county-level estimates. DellaVigna, S. and E. La Ferrara (2010). Detecting illegal arms trade. American Economic Journal: Economic Policy 2 (4), 26–57. Efron, B. (2010). Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, Volume 1. Cambridge University Press. Efron, B. and C. Morris (1973). Stein’s estimation rule and its competitors—an empirical Bayes approach. Journal of the American Statistical Association 68 (341), 117–130. Friedman, J., T. Hastie, and R. Tibshirani (2001). The elements of statistical learning, Volume 1. Springer series in statistics Springer, Berlin. James, W. and C. Stein (1961). Estimation with quadratic loss. In Proceedings of the fourth Berkeley symposium on mathematical statistics and probability, Volume 1, pp. 361–379. Kleinberg, J., J. Ludwig, S. Mullainathan, and Z. Obermeyer (2015). Prediction policy problems. American Economic Review 105 (5), 491–95. Krueger, A. (1999). Experimental estimates of education production functions. The Quarterly Journal of Economics 114 (2), 497–532. Leeb, H. and B. M. Pötscher (2005). Model selection and inference: Facts and fiction. Econometric Theory 21 (1), 21–59. Morris, C. N. (1983). Parametric empirical bayes inference: Theory and applications. Journal of the American Statistical Association 78 (381), pp. 47–55. Newey, W. K. (1997). Convergence rates and asymptotic normality for series estimators. Journal of Econometrics 79 (1), 147–168. 49 Robbins, H. (1956). An empirical bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California. Robbins, H. (1964). The empirical bayes approach to statistical decision problems. The Annals of Mathematical Statistics, 1–20. Stein, C. M. et al. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics 9 (6), 1135–1151. Stigler, S. M. (1990). The 1988 Neyman memorial lecture: a Galtonian perspective on shrinkage estimators. Statistical Science, 147–155. Stock, J. H. and M. W. Watson (2012). Generalized shrinkage methods for forecasting using many predictors. Journal of Business & Economic Statistics 30 (4), 481–493. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288. van der Vaart, A. (2000). Asymptotic statistics. Cambridge University Press. Wasserman, L. (2006). All of nonparametric statistics. Springer Science & Business Media. Zhang, C.-H. (2003). Compound decision theory and empirical Bayes methods: invited paper. The Annals of Statistics 31 (2), 379–390. 50