Nearly Weighted Risk Minimal Unbiased Estimation Ulrich K. Müller and Yulong Wang Economics Department Princeton University July 2015 Abstract We study non-standard parametric estimation problems, such as the estimation of the AR(1) coe¢ cient close to the unit root. We develop a numerical algorithm that determines an estimator that is nearly (mean or median) unbiased, and among all such estimators, comes close to minimizing a weighted average risk criterion. We demonstrate the usefulness of our generic approach by also applying it to estimation in a predictive regression, estimation of the degree of time variation, and long-range quantile point forecasts for an AR(1) process with coe¢ cient close to unity. JEL classi…cation: C13, C22 Keywords: Mean bias, median bias, local-to-unity, nearly non-invertible MA(1) The authors thank seminar participants at Harvard/MIT, Princeton, University of Kansas and University of Pennsylvania for helpful comments and suggestions. Müller gratefully acknowledges …nancial support by the NSF via grant SES-1226464. 1 Introduction Much recent work in econometrics concerns inference in non-standard problems, that is problems that don’t reduce to a Gaussian shift experiment, even asymptotically. Prominent examples include inference under weak identi…cation (cf. Andrews, Moreira and Stock (2006, 2008)), in models involving local-to-unity time series (Elliott, Rothenberg, and Stock (1996), Jansson and Moreira (2006), Mikusheva (2007)), and in moment inequality models. The focus of this recent work is the derivation of hypothesis tests and con…dence sets, and in many instances, optimal or nearly optimal procedures have been identi…ed. In comparison, the systematic derivation of attractive point estimators has received less attention. In this paper we seek to make progress by deriving estimators that come close to minimizing a weighted average risk (WAR) criterion, under the constraint of having uniformly low bias. Our general framework allows for a wide range of loss functions and bias constraints, such as mean or median unbiasedness. The basic approach is to …nitely discretize the bias constraint. For instance, one might impose zero or small bias only under m distinct parameter values. Under this discretization, the derivation of a WAR minimizing unbiased estimator reduces to a Lagrangian problem with m non-negative Lagrange multipliers. The Lagrangian can be written as an integral over the data, since both the objective and the constraints can be written as expectations. Thus, for given multipliers, the best estimator simply minimizes the integrand for all realizations of the data, and so is usually straightforward to determine. Furthermore, it follows from standard duality theory that the value of the Lagrangian, evaluated at arbitrary non-negative multipliers, provides a lower bound on the WAR of any estimator that satis…es the constraints. This lower bound holds a fortiori in the non-discretized version of the problem, since the discretization amounts to a relaxation of the uniform unbiasedness constraint. We use numerical techniques to obtain approximately optimal Lagrange multipliers, and use them for two conceptually distinct purposes. On the one hand, close to optimal Lagrange multipliers imply a large, and thus particularly informative lower bound on the WAR of any nearly unbiased estimator. On the other hand, close to optimal Lagrange multipliers yield an estimator that is nearly unbiased in the discrete problem. Thus, with a …ne enough discretization and some smoothness of the problem, this estimator also comes close to having uniformly small bias. Combining these two usages then allows us to conclude that we have in fact identi…ed a nearly WAR minimizing estimator among all estimators that have uniformly 1 small bias. It is not possible to …nely discretize a problem involving an unbounded parameter space. At the same time, many non-standard problems become approximately standard in the unbounded part of the parameter space. For instance, a large concentration parameter in a weak instrument problem, or a large local-to-unity parameter in a time series problem, induce nearly standard behavior of test statistics and estimators. We thus suggest “switching” to a suitable nearly unbiased estimator in this standard part of the parameter space in a way that ensures overall attractive properties. These elements— discretization of the original problem, analytical lower bound on risk, numerical approximation, switching— are analogous to Elliott, Müller and Watson’s (2015) approach to the determination of nearly weighted average power maximizing tests in the presence of nuisance parameters. Our contribution here is the transfer of the same ideas to estimation problems. There are two noteworthy special cases that have no counterpart in hypothesis testing. First, under squared loss and a mean bias constraint, the Lagrangian minimizing estimator is linear in the Lagrange multipliers. WAR then becomes a positive de…nite quadratic form in the multipliers, and the constraints are a linear function of the multipliers. The numerical determination of the multipliers thus reduces to a positive de…nite quadratic programming problem, which is readily solved even for large m by well-known and widely implemented algorithms.1 Second, under a median unbiasedness constraint and in absence of any nuisance parameters, it is usually straightforward to numerically determine an exactly median unbiased estimator by inverting the median function of a suitable statistic, even in a non-standard problem. This approach was prominently applied in econometrics in Andrews (1993) and Stock and Watson (1998), for instance. However, the median inversion of di¤erent statistics typically yields di¤erent median unbiased estimators. Our approach here can be used to determine the right statistic to invert under a given WAR criterion: Median inverting a nearly WAR minimizing nearly median unbiased estimator typically yields an exactly median unbiased nearly WAR minimizing estimator, as the median function of the nearly median unbiased estimator is close to the identity function. 1 In this special case, and with a weighting function that puts all mass at one parameter value, the Lagrangian bound on WAR reduces to a Barankin (1946)-type bound on the MSE of an unbiased or biased estimator, as discussed by McAulay and Hofstetter (1971), Glave (1972) and Albuquerque (1973). 2 In addition, invariance considerations are more involved and potentially more powerful in an estimation problem compared to a hypothesis testing problem, since it sometimes makes sense to impose invariance (or equivariance) with respect to the parameter of interest. We suitably extend the theory of invariant estimators in Chapter 3 of Lehmann and Casella (1998) in Section 4 below. We demonstrate the usefulness of our approach in four time series estimation problems. In all of them, we consider an asymptotic version of the problem. First, we consider mean and median unbiased estimation of the local-to-unity AR(1) parameter with known and unknown mean, and with di¤erent assumptions on the initial condition. The analysis of the bias in the AR(1) model has spawned a very large literature (see, among others, Hurvicz (1950), Marriott and Pope (1954), Kendall (1954), White (1961), Phillips (1977, 1978), Sawa (1978), Tanaka (1983, 1984), Shaman and Stine (1988), Yu (2012)), with numerous suggestions of alternative estimators with less bias (see, for instance, Quenouille (1956), Orcutt and Winokur (1969), Andrews (1993), Andrews and Chen (1994), Park and Fuller (1995), MacKinnon and Smith (1998), Cheang and Reinsel (2000), Roy and Fuller (2001), Crump (2008), Phillips and Han (2008), Han, Phillips, and Sul (2011) and Phillips (2012)). With the exception of the conditional optimality of Crump’s (2008) median unbiased estimator in a model with asymptotically negligible initial condition, none of these papers make an optimality claim, and we are not aware of a previously suggested estimator with uniformly nearly vanishing mean bias. As such, our approach comes close to solving this classic time series problem. Second, we consider a nearly non-invertible MA(1) model. The parameter estimator here su¤ers from the so-called pile-up problem, that is the MA(1) root is estimated to be exactly unity with positive probability in the non-invertible case. Stock and Watson (1998) derive an exactly median unbiased estimator by median inverting the Nyblom (1989) statistic in the context of estimating the degree of parameter time variation. We derive an alternative, nearly WAR minimizing median unbiased estimator under absolute value loss and …nd it to have very substantially lower risk under moderate and large amounts of time variation. Third, we derive nearly WAR minimizing mean and median unbiased estimators of the regression coe¢ cient in the so-called predictive regression problem, that is with a weakly exogenous local-to-unity regressor. This contributes to previous analyses and estimator suggestions by Mankiw and Shapiro (1986), Stambaugh (1986, 1999), Cavanagh, Elliott, and Stock (1995), Jansson and Moreira (2006), Kothari and Shanken (1997), Amihud and 3 Hurvich (2004), Eliasz (2005) and Chen and Deo (2009). Finally, we use our framework to derive nearly quantile unbiased long-range forecasts from a local-to-unity AR(1) process with unknown mean. The unbiased property here means that in repeated applications of the forecast rule for the be smaller than the forecast with probability quantile, the future value realizes to , under all parameter values. See Phillips (1979), Stock (1996), Kemp (1999), Gospodinov (2002), Turner (2004) and Elliott (2006) for related studies of forecasts in a local-to-unity AR(1). The remainder of the paper is organized as follows. Section 2 sets up the problem, derives the lower bound on WAR, and discusses the numerical implementation. Section 3 considers the two special cases of mean unbiased estimation under squared loss, and of median unbiased estimation without nuisance parameters. Section 4 extends the framework to invariant estimators. Throughout Sections 2-4, we use the problem of estimating the AR(1) coe¢ cient under local-to-unity asymptotics as our running example. In Section 5, we provide details on the three additional examples mentioned above. Section 6 concludes. 2 Estimation, Bias and Risk 2.1 Set-Up and Notation We observe the random element X in the sample space X . The density of X relative to some -…nite measure is f , where = h( ) 2 H with estimators 2 is the parameter space. We are interested in estimating : X 7! H. For scalar estimation problems, H R, but our set-up allows for more general estimands. Estimation errors lead to losses as measured by the function ` : H 7! [0; 1) (so that `( (x); ) is the loss incurred by the estimate (x) if the true parameter is ). The risk of the estimator R r( ; ) = E [`( (X); )] = `( (x); )f (x)d (x). is given by its expected loss, Beyond the estimator’s risk, we are also concerned about its bias. For some function c:H 7! H, the bias of mean bias, c( ; ) = is de…ned as b( ; ) = E [c( (X); )]. For instance, for the h( ), so that b( ; ) = E [ (X)] scalar parameter of interest , c( ; ) = 1[ > h( )] We are interested in deriving estimators 1 , 2 h( ), and for the median bias of a so that b( ; ) = P [ (X) > h( )] 1 . 2 that minimize risk subject to an unbiasedness constraint. In many problems of interest, a uniformly risk minimizing might not exist, even under the bias constraint. To make further progress, we thus measure the quality 4 of estimators by their weighted average risk R( ; F ) = negative …nite measure F with support in . R r( ; )dF ( ) for some given non- In this notation, the weighted risk minimal unbiased estimator solves the program min R( ; F ) s.t. (1) b( ; ) = 0 8 2 (2) . More generally, one might also be interested in deriving estimators that are only approximately unbiased, that is solutions to (1) subject to " for some " "8 2 b( ; ) (3) 0. Allowing the bounds on b( ; ) to depend on does not yield greater generality, as they can be subsumed in the de…nition of the function c. For instance, a restriction on the relative mean bias to be no more than 5% is achieved by setting c( ; ) = ( h( ))=h( ) and " = 0:05. An estimator is called risk unbiased if E 0 [`( (X); 0 )] E 0 [`( (X); )] for any 0; 2 . As discussed in Chapter 3 of Lehmann and Casella (1998), risk unbiasedness is a potentially attractive property as it ensures that under true value 0 0, the estimate (x) is at least as close to the in expectation as measured by ` as it is to any other value of . It is straight- forward to show that under squared loss `( (x); ) = ( (x) h( ))2 a risk unbiased estimator is necessarily mean unbiased, and under absolute value loss `( (x); ) = j (x) h( )j, it is necessarily median unbiased. From this perspective, squared loss and mean unbiased constraints, and absolute value loss and median unbiased contraints, form natural pairs, and we report results for these pairings in our examples. The following development, however, does not depend on any assumptions about the relationship between loss function and bias constraint, and other pairings of loss function and constraints might be more attractive in speci…c applications. In many examples, the risk of good estimators is far from constant, as the information in X about h( ) varies with . This makes risk comparisons and the weighting function F more di¢ cult to interpret. Similarly, also the mean bias of an estimator is naturally gauged relative to its sampling variability. To address this issue, we introduce a normalization function n( ) that roughly corresponds to the root mean squared error of a good estimator at : The normalized risk rn ( ; ) is then given by rn ( ; ) = E [( (X) 5 )2 ]=n( )2 and rn ( ; ) = E [j (X) h( )j]=n( ) under squared and absolute value loss, respectively, and the normalized mean bias is bn ( ; ) = (E [ (X)] h( ))=n( ). The weighted average R normalized risk with weighting function Fn , rn ( ; )dFn ( ), then reduces to R( ; F ) above with dF ( ) = dFn ( )=n( )2 and dF ( ) = dFn ( )=n( ) in the squared loss and absolute value loss case, respectively, and bn ( ; ) = b( ; ) with c( ; ) = ( h( ))=n( ). The median bias, of course, is readily interpretable without any normalization. Running Example: Consider an autoregressive process of order 1, yt = yt 1; : : : ; T , where "t iidN (0; 1) and y0 = 0. Under local-to-unity asymptotics, 1 + "t , t = = T = =T , this model converges to the limit experiment (in the sense of LeCam) of observing 1 X, where X is an Ornstein-Uhlenbeck process dX(s) = X(s)ds + dW (s) with X(0) = 0 and W a standard Wiener process. The random element X takes on values in the set of cadlag functions on the unit interval X , and the Radon-Nikodym derivative of X relative to Wiener measure is f (x) = exp[ The aim is to estimate 1 2 (x(1) 2 1) 1 2 2 Z 1 x(s)2 ds]: (4) 0 2 R in a way that (nearly) minimizes weighted risk under a mean or median bias constraint. More generally, suppose that yt = yt and = 1 1 + ut with y0 = op (T 1=2 ), T 1=2 2 P[ T ] t=1 ut ) W ( ), =T . Then for any consistent estimator ^ of the long run variance fut g, XT ( ) = T 1=2 2 of ^ 1 y[ T ] ) X( ). Thus, for any -almost surely continuous estimator : X ! R in the limiting problem, (T 1=2 ^ 1 y[ T ] ) has the same asymptotic distribution as (X) by the continuous mapping theorem. Therefore, if (X) is (nearly) median unbiased for , then (XT ) is asymptotically (nearly) median unbiased. For a (nearly) mean unbiased estimator (X), one could put constraints on fut g that ensure uniform integrability of (XT ), so that again E [ (XT )] ! E [ (X)] for all . Alternatively, one might de…ne “asymptotic mean bias”and “asymptotic risk”as the sequential limits lim lim E [1[j (XT )j < K] (XT ) K!1 T !1 h( )] lim lim E [1[`( (XT ); ) < K]`( (XT ); )] K!1 T !1 respectively. With these de…nitions, the bias and (weighted average) risk of (XT ) converges to the bias and weighted average risk of by construction, as long as continuous with …nite mean and risk relative to (4). 6 is -almost surely Figure 1: Normalized Mean Bias and Mean Squared Error of OLS Estimator in Local-toUnity AR(1) The usual OLS estimator for (which is also equal to the MLE) is OLS (x) = 12 (x(1)2 R1 1)= 0 x(s)2 ds. As demonstrated by Phillips (1987) and Chan and Wei (1987), for large , 1=2 a ) ) N (0; 1) as ! 1, yielding the approximation OLS (X) N ( ; 2 ). p We thus use the normalization function n( ) = 2 + 10 (the o¤set of 10 ensures that (2 ) ( OLS (X) n( ) > 0 even as = 0). We initially focus on estimating under squared loss and a mean bias constraint for 0; we return to the problem of median unbiased estimation in Section 3.2 below. Figure 1 depicts the normalized mean bias and the normalized mean squared error of function of . As is well known, the bias of strong mean reversion of, say, OLS OLS as a is quite substantial: even under fairly = 60, the bias is approximately on …fth of the root mean squared error of the estimator. N 2.2 A Lower Bound on Weighted Risk of Unbiased Estimators In general, it will be di¢ cult to analytically solve (1) subject to (3). Both X and H are typically uncountable, so we are faced with an optimization problem in a function space. Moreover, it is usually di¢ cult to obtain analytical closed-form expressions for the integrals de…ning r( ; ) and b( ; ), so one must rely on approximation techniques, such as Monte Carlo simulation. For these reasons, it seems natural to resort to numerical techniques to obtain an approximate solution. 7 There are potentially many ways of approaching this numerical problem. For instance, one might posit some sieve-type space for , and numerically determine the (now …nitedimensional) parameter that provides the relatively best approximate solution. But to make this operational, one must choose some dimension of the sieve space, and it unclear how much better of a solution one might have been able to …nd in a di¤erent or more highly-dimensional sieve space. It would therefore be useful to have a lower bound on the weighted risk R( ; F ) that holds for all estimators that satisfy (3). If an approximate solution ^ is found that also satis…es (3) and whose weighted risk R(^; F ) is close to the bound, then we know that we have found the nearly best solution overall. To derive such a bound, we relax the constraint (3) by replacing it by a …nite number of constraints: Let Gi ; i = 1; : : : ; m be probability distributions on ; and de…ne the weighted R average bias B( ; G) of the estimator as B( ; G) = b( ; )dG( ). Then any estimator that satis…es (3) clearly also satis…es " " for all i = 1; : : : ; m: B( ; Gi ) A special case of (5) has Gi equal to a point mass at at the …nite number of parameter values i, (5) so that (5) amounts to imposing (3) 1; : : : ; m. Now consider the Lagrangian for the problem (1) subject to (5), L( ; ) = R( ; F ) + m X u i (B( ; Gi ) ") + i=1 where = ( 1; : : : ; m) and i = ( li ; m X l i( B( ; Gi ) (6) ") i=1 u i ). By writing R( ; F ) and B( ; Gi ) in terms of their de…ning integrals and by assuming that we can change the order of integration, we obtain L( ; ) = Z Z f (x)`( (x); )dF ( ) + + m X i=1 m X Z u i( l i( i=1 Let be the estimator such that for a given , f (x)c( (x); )dGi ( ) Z f (x)c( (x); )dGi ( ) (7) ") ! ") d (x): (x) minimizes the integrand on the right hand side of (7) for each x. Since minimizing the integrand at each point is su¢ cient to minimize the integral, necessarily minimizes L over . 8 This argument requires that the change of the order of integration in (7) is justi…ed. By Fubini’s Theorem, this is always the case for R( ; F ) (since ` is non-negative), and also for B( ; Gi ) if c is bounded (as is always the case for the median bias, and for the mean bias if H is bounded). For unbounded H and mean bias constraints, it su¢ ces that there exists some estimator with uniformly bounded second moment. Standard Duality Theory for optimization implies the following Lemma. Lemma 1 Suppose ~ satis…es (5), and for arbitrary 0 (that is, each element in non-negative), minimizes L( ; ) over . Then R(~; F ) L( ; ). is L( ; ) by de…nition of , so in particular, L(~; ) L(~; ) since 0 and ~ satis…es (5). Combining these Proof. For any , L( ; ) L( ; ). Furthermore, R(~; F ) inequalities yields the result. Note that the bound on R(~; F ) in Lemma 1 does not require an exact solution to the discretized problem (1) subject to (5). Any some choices for 0 implies a valid bound L( ; ), although yield better (i.e., larger) bounds than others. Since (5) is implied by (3), these bounds hold a fortiori for any estimator satisfying the uniform unbiasedness constraint (3). Running Example: Suppose m = 1, Fn puts unit mass at mass at 1 > 0. 0 = 0 and G1 puts point )2 , normalized mean bias constraint With square loss `( ; ) = ( c( ; ) = ( )=n( ), " = 0 and u1 = 0, (x) minimizes Z Z 2 ( (x) 0) l f (x)`( (x); )dF ( ) 1 f (x)c( (x); )dGi ( ) = f 0 (x) n( 0 )2 This is solved by (x) = 0 + 1 2 2 l n( 0 ) f 1 (x)=(f 0 (x)n( 1 )). For numerical calculation yields R( ; F ) = rn ( ; 0) 1 l 1f 1 (x) = 3 and = 0:48 and B( ; G1 ) = bn ( ; l 1 (x) 1 : n( 1 ) = 1:5, a 1) = 0:06; so that L( ; ) = 0:39. We can conclude from Lemma 1 that no estimator that is unbiased for all 2 R can have risk R( ; F ) = rn ( ; for all estimators 0) that are unbiased only at smaller than 0:39, and the same holds also 1. Moreover, this lower risk bound also holds for all estimators whose normalized bias is smaller than 0.06 uniformly in 2 R, and more generally for all estimators whose normalized mean bias is smaller than 0.06 in absolute value only at If u i 1. N 0 is such that (B( ; Gi ) ") = 0 and l i satis…es (5) and the complementarity slackness conditions ( B( ; Gi ) ") = 0 hold, then (6) implies L( so that by an application of Lemma 1, L( ; 9 ) is the best lower bound. ; ) = R( ; F ), 2.3 Numerical Approach Solving the program (1) subject to (5) thus yields the largest, and thus most informative bound on weighted risk R( ; F ). In addition, solutions to this program also plausibly satisfy a slightly more relaxed version of original non-discretized constraint (3). The reason is that if the bias b( dense in ; ) does not vary greatly in , and the distributions Gi are su¢ ciently , then any estimator that satis…es (5) cannot violate (3) by very much. This suggests the following strategy to obtain a nearly weighted risk minimizing nearly unbiased estimator, that is, for given "B > 0 and "R > 0, an estimator ^ that (i) satis…es (3) with " = "B ; (ii) has weighted risk R(^; F ) (1 + "R )R( ; F ) for any estimator satisfying (3) with " = "B . 1. Discretize by point masses or other distributions Gi , i = 1; : : : ; m. y 2. Obtain approximately optimal Lagrange multipliers ^ for the problem (1) subject to y (5) for " = "B , and associated value R = L( y ; ^ ). ^ 3. Obtain an approximate solution (^" ; ^ ) to the problem min " s.t. R( ; F ) and (8) (1 + "R )R " 0; " B( ; Gi ) ", i = 1; : : : ; m (9) and check whether ^ satis…es (3). If it doesn’t, go back to Step 1 and use a …ner discretization. If it does, ^ = ^ has the desired properties by an application of Lemma 1. nor ^ have to be exact solutions to their respective programs to be able to conclude that ^ is indeed a nearly weighted risk minimizing nearly unbiased Importantly, neither ^y estimator as de…ned above. The solution ^ to the problem in Step 3 has the same form as a weighted average of the integrand in (7), but , that is ^ minimizes now is the ratio of the Lagrange multipli- ers corresponding to the constraints (9) and the Lagrange multiplier corresponding to the constraint (8). These constraints are always feasible, since = ^y satis…es (9) with " = "B y (at least if ^ is the exact solution), and R( ^y ; F ) = R < (1 + "R )R. The additional slack provided by "R is used to tighten the constraints on B(^ ; Gi ) to potentially obtain a ^ satisfying (3). A …ner discretization implies additional constraints in (5) and thus (weakly) 10 increases the value of R. At the same time, a …ner discretization also adds additional constraints on the bias function of ^ , making it more plausible that it satis…es the uniform constraint (3). We suggest using simple …xed point iterations to obtain approximate solutions in Steps 2 and 3, similar to EMW’s approach to numerically approximate an approximately least favorable distribution. See Appendix B for details. Once the Lagrange multipliers underlying ^ are determined, ^ (x) for given data x is simply the minimizer of the integrand on the right hand side of (7). We provide corresponding tables and matlab code for all examples in this paper under www.princeton.edu/ yulong. An important class of non-standard estimation problems are limit experiments that arise from the localization of an underlying parameter 2 around problematic parameter values , such as the local-to-unity asymptotics for the AR(1) estimation problem in the N running example. Other examples include weak instrument asymptotics (where the limit is taken for values of the concentration parameter local-to-zero), or local-to-binding moment conditions in moment inequality models. In these models, standard estimators, such as, say, (bias corrected) maximum likelihood estimators, are typically asymptotically unbiased and e¢ cient for values of 2 = N. Moreover, this good behavior of standard estimators typically also arises, at least approximately, under local asymptotics with localization parameter &( ) ! 1 for some appropriate function & : as 7! R. For instance, as already mentioned in Section 2.1, Phillips (1987) and Chan and Wei (1987) show that the OLS estimator for the AR(1) coe¢ cient =T as &( ) = 1 = is approximately normal under local-to-unity asymptotics = T = ! 1. This can be made precise by limit of experiment arguments, as discussed in Section 4.1 of EMW. Intuitively, the boundary of the localized parameter space typically continuously connects to the unproblematic part of the parameter space of the underlying parameter . n N It then makes sense to focus the numerical strategy on the truly problematic part of the local parameter space, where the standard estimator S is substantially biased or ine¢ cient. As in EMW, consider a partition of the parameter space For large enough where S, S = f : &( ) > Sg into two regions = N [ S. is the (typically unbounded) standard region n S is the problematic nonN = ^ standard region. Correspondingly, consider estimators that are of the following switching S is approximately unbiased and e¢ cient, and form ^(x) = (x) S (x) + (1 11 (x)) N (x) (10) where (x) = 1[^& (x) > cut-o¤ S, > ] for some estimator ^& (X) of &( ). By choosing a large enough we can ensure that S is only applied to realizations of X that stem from the relatively unproblematic part of the parameter space S with high probability. The numerical problem then becomes the determination of the estimator N : X ! 7 H that leads to overall appealing bias and e¢ ciency properties of ^ in (10) for any 2 . As discussed in EMW, an additional advantage of this switching approach is that the weighting function F can then focus on and the “seam”part of S where P ( (X)) is not close to zero or one. The good e¢ ciency properties of ^ on the remaining part of S are simply inherited N by the good properties of S. Incorporating this switching approach in the numerical strategy above is straightforward: For given , the problem in Step 3 now has constrained to be of the form (10), that is for all x that lead to (x) = 1, ^ (x) = S (x), and the numerical challenge is only to determine a suitable N = ^ . By not imposing the form (10) in Step 2, we still obtain a bound on N weighted risk that is valid for any estimator satisfying "B "B for all b( ; ) (that is, without imposing the functional form (10)). Of course, for a given switching rule , it might be that no nearly minimal weighted risk unbiased estimator of the form (10) exists, because S is not unbiased or e¢ cient enough relative to the weighting F , so one might try instead with larger values for S and . Running Example: Taking the local-to-unity limit of Marriott and Pope’s (1954) mean bias expansion E[^OLS ] = model without intercept suggests 2 =T + o(T S (x) = 1 ) for the AR(1) OLS estimator ^OLS in a OLS (x) 2 as a good candidate for indeed, it can be inferred from Figure 1 that the normalized mean bias of 0:01 for P (^& (X) > > S = 20. Furthermore, setting ^& (x) = ) < 0:1% for all S, OLS (x) and S S. And is smaller than = 40, we …nd that so draws from the problematic part of the parameter space are only very rarely subject to the switching. It is now an empirical matter whether imposing the switching to S in this fashion is costly in terms of risk, and this can be evaluated with the algorithm above by setting the upper bound of the support of F to, say, 80, so that the weighted average risk criterion incorporates parameter values where switching to for S occurs with high probability (P (^& (X) > > 70). N 12 ) > 99% 3 Two Special Cases 3.1 Mean Unbiased Estimation under Quadratic Loss Consider the special case of quadratic loss `( ; ) = ( constraint c( ; ) = ( h( ))2 and normalized mean bias h( ))=n( ) (we focus on a scalar estimand for expositional ease, but the following discussion straightforwardly extends to vector valued ). Then the integrand in (7) becomes a quadratic function of (x), and the minimizing value (x) = a linear function of ~ i = 21 ( R u i Pm ~ R f (x)h( )dF ( ) i=1 i R f (x)dF ( ) l i ). f (x) dGi ( n( ) ) (x) is ; (11) Plugging this back into the objective and constraints yields R( ; F ) = ~ 0 ~ + !R, B( ; Gi ) = ! i 0~ i , where Z !R = and i ij !i = Z = Z Z R linear combination of R R ) R f (x) dGj ( n( ) ) d (x), f (x)dF ( ) ! R f (x) R Z dG ( ) f (x)h( )dF ( ) i f (x) n( ) R h( )dGi ( ) d (x) n( ) f (x)dF ( ) h( )2 f (x)dF ( ) is the ith column of the m semi-de…nite. In fact, f (x) dGi ( n( ) d (x); . Note that is symmetric and positive R is positive de…nite as long as fn((x)) dGi ( ) cannot be written as a f (x) dGj ( n( ) m matrix R [ f (x)h( )dF ( )]2 R f (x)h( )2 dF ( ) ), j 6= i almost surely. Thus, minimizing R( ; F ) subject to (5) becomes a (semi)-de…nite quadratic programming problem, which is readily solved 1 by well-known algorithms. In the special case of " = 0, the solution is ~ = !, where ! = (! 1 ; : : : ; ! m )0 . Step 3 of the algorithm generally becomes a quadratically constrained quadratic program, but since " is scalar, it can easily be solved by conditioning on ", with a line search as outer-loop. This remains true also if (10), where N is restricted to be of the switching form is of the form (11), and R( ; F ) and B( ; Gi ) contain additional constants that arise from expectations over draws with (x) = 1. Either way, it is computationally trivial to implement the strategy described in Section 2.3 under mean square risk and a mean bias constraint, even for a very …ne discretization m. 13 Figure 2: Mean Bias and MSE in Local-to-Unity AR(1) with Known Mean Running Example: We set Fn uniform on [0; 80], "B = "R = 0:01, and restrict ^ to be of the switching form (10) with ^& = OLS and = 40 as described at the end of the last section. With Gi equal to point masses at f0:0; 0:22 ; : : : ; 102 g, the algorithm successfully identi…es a nearly weighted risk minimizing estimator ^ that is uniformly nearly unbiased for 0. The remaining bias in ^ is very small: with the normalized mean squared error of ^ close to one, "B = 0:01 implies that about 10,000 Monte Carlo draws are necessary for the largest bias of ^ to be approximately of the same magnitude as the Monte Carlo standard error of its estimate. Figure 2 plots the normalized bias and risk of the nearly weighted risk minimizing unbiased estimator ^ . As a point of reference, we also report the normalized bias and risk of the OLS estimator OLS (replicated from Figure 1), and of the WAR minimizing estimator without any constraints. By construction, the unconstrained WAR minimizing estimator (which equals the posterior mean of possible average normalized risk on under a prior proportional to F ) has the smallest 2 [0; 80]. As can be seen from Figure 2, however, this comes at the cost of a large mean bias. The thick line plots a lower bound on the risk envelope for nearly unbiased estimators: For each , we set the weighting function equal to a point mass at that , and then report the lower bound on risk at among all estimator whose Gi averaged normalized bias is no larger than "B = 0:01 in absolute value for all i (that is, for each , we perform Step 2 of the algorithm in Section 2.3, with F equal to a point mass at ). The risk of ^ is seen 14 to be approximately 10% larger than this lower bound on the envelope. This implies that normalized weightings other than the uniform weighting on 2 [0; 80] cannot decrease the risk very much below the risk of ^ , and to the extent that 10% is considered small, one could call ^ approximately uniformly minimum variance unbiased. N 3.2 Median Unbiased Estimation Without Nuisance Parameters Suppose h( ) = , that is there are no nuisance parameters, and R. Let an estimator taking on values in R, and let m B ( ) be its median function, P ( m B ( )) = 1=2 for all , then the estimator 2 U (X) . If m B ( ) is one-to-one on 1 = m B( B (X)) estimators m B ( B (X)), B be B (X) : R 7! is exactly median unbiased by construction (cf. Lehmann (1959)). In general, di¤erent estimators 1 with inverse m 1 B which raises the question how B yield di¤erent median unbiased B should be chosen for U to have low risk. Running Example: Stock (1991) and Andrews (1993) construct median unbiased estimators for the largest autoregressive root based on the OLS estimator. Their results allow for the possibility that there exists another median unbiased estimator with much smaller risk. N A good candidate for B is a nearly weighted risk minimizing nearly median unbiased estimator ^. A small median bias of ^ typically also yields monotonicity of m^ , so if P ( y = ^ 1 1=2 at the boundary points of (if there are any), then m^ has an inverse m^ , and ^U (x) = m 1 (^(x)) is exactly median unbiased. Furthermore, if ^ is already nearly median ^ unbiased, then m 1 : R 7! is close to the identity transformation, so that R(^U ; F ) is close ) ^ to the nearly minimal risk R(^; F ). This suggests the following modi…ed strategy to numerically identify an exactly median unbiased estimator ^U whose weighted risk R(^U ; F ) is within (1 + "R ) of the risk of any other exactly median unbiased estimator. 1. Discretize by the distributions Gi , i = 1; : : : ; m. y 2. Obtain approximately optimal Lagrange multipliers ^ for the problem (1) subject to y y (5) for " = 0, and associated value R = L( y ; ^ ). Choose Gi and ^ to ensure that ^ P( ^y = ) < 1=2 for any boundary points of . If m go back to Step 1 and use a …ner discretization. 15 ^y still does not have an inverse, y y 3. Compute R(^U ; F ) for the exactly median unbiased estimator ^U (X) = m^ 1 ( ^ y (X)). ^y y If R(^U ; F ) > (1 + "R )R, go back to Step 1 and use a …ner discretization. Otherwise, ^U = ^y has the desired properties. U For the same reasons as discussed in Section 2.3, it can make sense to restrict the estimator of Step 2 of the switching form (10). For unbounded compute m ^y , it is not possible to numerically exactly, but if the problem becomes well approximated by a standard problem with standard unbiased estimator S as &( ) ! 1, then setting the inverse median function to be the identity for very large ^& (X) leads to numerically negligible median biases. Running Example: As discussed in Section 2.1, we combine the median unbiased constraint on 0 with absolute value loss. We make largely similar choices as in the derivation of nearly mean unbiased estimators: the normalized weighting function Fn is uniform on [0; 80], with risk now normalized as rn ( ; ) = E [j (X) j]=n( ), where p n( ) = 2 + 10. Phillips (1974, 1975) provides Edgeworth expansions of ^OLS , which implies that ^OLS =(1 as T 1 ) has a median bias of order o(T ! 1 suggests that the median bias of S (x) = OLS (x) 1 ) for j j < 1. Taking the limit 1 should be small for moderate and large , and a numerical analysis shows that this is indeed the case. We set "R = 0:01, and impose switching to S when OLS (x) > = 40. Under a median bias constraint and absolute value loss, using point masses for Gi leads to a discontinuous integrand in the Lagrangian (7) for given x. In order to ensure a smooth estimator , it makes sense to instead choose the distributions Gi in a way that any mixture of the Gi ’s has a continuous density. To this end we set the Lebesgue density of Gi , i = 1; : : : ; m proportional to the ith third-order basis spline on the m = 51 knots f0; 0:22 ; : : : ; 102 g (with “not-a-knot” end conditions). We only compute m^ 1 for y 100 and set it equal to the identity function ^y otherwise; the median bias of ^U for 100 is numerically indistinguishable from zero with a Monte Carlo standard error of the median bias estimate of approximately 0.2%. y Figure 3 reports the median bias and normalized risk of ^ , the OLS estimator U OLS-based median unbiased estimator U;OLS (x) =m 1 OLS ( OLS (x)), OLS , the and the estimator that minimizes weighted risk relative to F without any median unbiased constraints. The risk of ^y is indistinguishable from the risk of U;OLS . Consequently, this analysis reveals U;OLS to U be (also) nearly weighted risk minimal unbiased in this problem. N 16 Figure 3: Median and Absolute Loss in Local-to-Unity AR(1) with Known Mean 4 Invariance In this section, we consider estimation problems that have some natural invariance (or equivariance) structure, so that it makes sense to impose the corresponding invariance also on estimators. We show that the problem of identifying a (nearly) minimal risk unbiased invariant estimator becomes equivalent to the problem of identifying an unrestricted (nearly) minimal risk unbiased estimator in a related problem with a sample and parameter space generated by maximal invariants. Imposing invariance then reduces the dimension of the e¤ective parameter space, which facilitates numerical solutions. Consider a group of transformations on the sample space g : X denotes a group action. We write a2 A 7! X , where a 2 A a1 2 A for the composite action g(g(x; a1 ); a2 ) = g(x; a2 a1 ) for all a1 ; a2 2 A, and we denote the inverse of action a by a , that is g(x; a a) = x for all a 2 A and x 2 X . Assume that the elements in a are distinct in the sense that g(x; a1 ) = g(x; a2 ) for some x 2 X implies a1 = a2 . Now suppose the problem is invariant in the sense that there exists a corresponding group g : is Pg( ;a) , A 7! for all on the parameter space, and the distribution of g(X; a) under X P and a 2 A (cf. De…nition 2.1 of Chapter 3 in Lehmann and Casella (1998)). Let M (x) and M ( ) for M : X 7! X and M : 7! be maximal invariants of these two groups. To simplify notation, assume that M and M select speci…c orbits, that is M (x) = g(x; O(x) ) for all x 2 X and M ( ) = g( ; O( ) ) for all 17 2 for some functions O : X 7! A and O : 7! A. Then by de…nition of a maximal invariant, M (M (x)) = M (x), M (M ( )) = M ( ), and we have the decomposition x = g(M (x); O(x)) (12) = g(M ( ); O( )): (13) By Theorem 6.3.2 of Lehmann and Romano (2005), the distribution of M (X) only depends on M ( ). The following Lemma provides a slight generalization, which we require below. The proof is in Appendix A. Lemma 2 The joint distribution of (M (X); g( ; O(X) )) under only depends on through M ( ). Suppose further that the estimand h( ) is compatible with the invariance structure in the sense that h( 1 ) = h( 2 ) implies h(g( 1 ; a)) = h(g( 2 ; a)) for all 1; 2 2 and a 2 A. As discussed in Chapter 3 of Lehmann and Casella (1998), this induces a group g^ : H satisfying h(g( ; a)) = g^(h( ); a) for all 2 A 7! H and a 2 A, and it is natural for the loss function and constraints to correspondingly satisfy `( ; ) = `(^ g ( ; a); g( ; a)) for all 2 H, 2 and a 2 A (14) c( ; ) = c(^ g ( ; a); g( ; a)) for all 2 H, 2 and a 2 A: (15) Any loss function `( ; ) that depends on i only through the parameter of interest, `( ; ) = i ` ( ; h( )) for some function ` : H 7! H, such as quadratic loss or absolute value loss, automatically satis…es (14). Similarly, constraints of the form c( ; ) = ci ( ; h( )); such as those arising from mean or median unbiased constraints, satisfy (15). With these notions of invariance in place, it makes sense to impose that estimators conform to this structure and satisfy (g(x; a)) = g^( (x); a) for all x 2 X and a 2 A. (16) Equations (12) and (16) imply that any invariant estimator satis…es (x) = g^( (M (x)); O(x)) for all x 2 X. (17) It is useful to think about the right-hand side of (17) as inducing the invariance property: Any function a : M (X ) ! H de…nes an invariant estimator (x) via (x) = 18 g^( a (M (x)); O(x)). Given that any invariant estimator satis…es (17), the set of all invariant estimators can therefore be generated by considering all (unconstrained) a, and setting (x) = g^( a (M (x)); O(x)). Running Example: Consider estimation of the AR(1) coe¢ cient after observing yt = my + ut , and ut = ut asymptotics 1 + "t , where "t =T for = 1 iidN (0; 1) and u0 = 0. Under local-to-unity 0 and my = my;T = T 1=2 for 2 R, the limiting problem involves the observation X on the space of continuous functions on the unit interval, Rs where X(s) = + J(s) and J(s) = 0 e (s r) dW (r) is an Ornstein-Uhlenbeck process with J(0) = 0. This observation is indexed by = ( ; ), and the parameter of interest is = h( ). The problem is invariant to transformations g(x; a) = x + a with a 2 A = R, and corresponding group g(( ; ); a) = ( ; + a). One choice for maximal invariants is M (x) = g(x; x(0)) and M (( ; )) = g(( ; ); ) = ( ; 0). The Lemma asserts that M (X) = X X(0) and g( ; O(X) ) = ( ; X(0)) have a joint distribution that only depends on through . The induced group g^ is given by g^( ; a) = for all a 2 A. Under (14), the loss must not depend on the location parameter . Invariant estimators are numerically invariant to translation shifts of X, and can all be written in the form (x) = g^( a (x x(0)); x(0)) = a (x x(0)) for some function a that maps paths y on the unit interval with y(0) = 0 to H. N Now under these assumptions, we can write the risk of any invariant estimator as r( ; ) = E [`( (X); )] = E [` (M (X)); g( ; O(X) ) ] = EM ( ) [` (by (12) and (14) with a = O(X) ) (M (X)); g(M ( ); O(X) ) ] = EM ( ) [EM ( ) [` (by Lemma 1) (M (X)); g(M ( ); O(X) ) jM (X)]] and similarly b( ; ) = E [c( (X); )] = EM ( ) [EM ( ) [c Now set and = M( ) 2 (M (X)); g(M ( ); O(X) ) jM (X)]]: 2 H = h( = M ( ); h( ) = ), x = M (x) 2 X = M (X ) ` (x ; ) = E [` (X ); g( ; O(X) ) jX = x ] (18) c ( (x ); ) = E [c (X ); g( ; O(X) ) jX = x ]. (19) 19 Then the starred problem has exactly the same structure as the problem considered in Sections 2 and 3, and the same solution techniques can be applied to identify a nearly weighted average risk minimal estimator ^ : X 7! H (with the weighting function a nonnegative measure on ). This solution is then extended to the domain of the original sample space X via ^(x) = g^(^ (M (x)); O(x)) from (17), and the near optimality of ^ implies the corresponding near optimality of ^ in the class of all invariant estimators. Running Example: Since `( (X ); = h( ), and g( ; a) does not a¤ect h( ), `( (X ); g( ; O(X) )) = ) and c( (X ); g( ; O(X) )) = c( (X ); to estimating ). Thus the starred problem amounts from the observation X = M (X) = X X(0) (whose distribution does not depend on ). Since X is equal to the Ornstein-Uhlenbeck process J with J(0) = 0, the starred problem is exactly what was previously solved in the running example. So in y particular, the solutions ^ and ^ , analyzed in Figures 2 and 3, now applied to X , are also U (nearly) weighted risk minimizing unbiased translation invariant estimators in the problem of observing X = + J. A di¤erent problem arises, however, in the empirical more relevant case with the initial condition drawn from the stationary distribution. With yt = my + ut ; ut = ut iidN (0; 1) and u0 N (0; (1 limiting problem under of the process X(s) = independent of J for 2 ) 1 ) independent of f"t g for + "t , "t < 1 (and zero otherwise), the 1=2 asymptotics is the observation p + J S (s), where J S (s) = Ze s = 2 + J(s) and Z N (0; 1) is =1 =T and my = my;T = T 1 > 0 (and J S = J otherwise). For is then given by (cf. Elliott (1999)) " r 2 1 f (x ) = exp (x (1)2 2 2+ 1) 1 2 2 Z 0 0, the density of X = X X(0) 1 2 x (s) ds + 1 2 ( R1 0 # x (s)ds + x (1))2 : 2+ (20) Analogous to the discussion below (4), this asymptotic problem can be motivated more generally under suitable assumptions about the underlying processes yt : The algorithms discussed in Section 3 can now be applied to determine nearly WAR minimizing invariant unbiased estimators in the problem of observing X with density (20). p We use the normalization function n( ) = 2 + 20, and make the same choices for Fn and Gi on as in the problem with zero initial condition. Taking the local-to-unity limit of the approximately mean unbiased Orcutt and Winokur (1969) estimator of in a regression that includes a constant suggests OLS 4 to be nearly mean unbiased for large ; where now R1 R1 1 2 2 (x (1) 1)= x (s) ds with x (s) = x (s) x (l)dl. Similarly, Phillips’ OLS (x ) = 2 0 0 20 Figure 4: Mean Bias and MSE in Local-to-Unity AR(1) with Unknown Mean (1975) results suggest OLS (x )> OLS 3 to be nearly median unbiased. We switch to these estimators if = 60, set "B = 0:01 and "R = 0:01 for the nearly median unbiased estimator, and "R = 0:02 for the mean unbiased estimator. Figures 4 and 5 show the normalized bias and risk of the resulting nearly weighted risk minimizing unbiased invariant estimators. We also plot the performance of the analogous set of comparisons as in Figures 2 and 3. The mean and median bias of the OLS estimator is now even larger, and the (nearly) unbiased estimators have substantially lower normalized risk. The nearly WAR minimizing mean unbiased estimator still has risk only about 10% above the (lower bound) on the envelope. In contrast to the case considered in Figure 3, the WAR minimizing median unbiased estimator now has perceptibly lower risk than the OLS based median unbiased estimator for small , but the gains are still fairly moderate. Figure 6 compares the performance of the nearly mean unbiased estimator with some previously suggested alternative estimators: The local-to-unity limit OLS 4 of the analytically bias corrected estimator by Orcutt and Winokur (1969), the weighted symmetric estimator analyzed by Pantula, Gonzalez-Farias, and Fuller (1994) and Park and Fuller (1995), and the MLE f (x ) based on the maximal invariant likelihood (20) (called “restricted” MLE by Cheang and Reinsel (2000)). In contrast to ^ , all of these previously M LE (x ) = arg max 0 suggested estimators have non-negligible mean biases for values of smallest mean bias at close to zero, with the = 0 of nearly a third of the root mean squared error. N 21 Figure 5: Median and Absolute Loss in Local-to-Unity AR(1) with Unknown Mean Figure 6: Mean Bias and MSE in Local-to-Unity AR(1) with Unknown Mean 22 5 Further Applications 5.1 Degree of Parameter Time Variation Consider the canonical “local level”model in the sense of Harvey (1989), t X yt = m + "s + ut , ("t ; ut ) (21) iidN (0; I2 ) s=1 with only yt observed. The parameter of interest is , the degree of time variation of the Pt 1=2 ‘local mean’ m + and = T 1 asymptotics, we have the s=1 "s : Under m = T functional convergence of the partial sums XT (s) = T 1=2 bsT c X t=1 yt ) X(s) = s + Z s (22) W" (l)dl + Wu (s) 0 with W" and Wu two independent standard Wiener processes. The limit problem is the estimation of from the observation X, whose distribution depends on = ( ; ) 2 [0; 1) R: This problem is invariant to transformations g(x; a) = x + a s with a 2 A = R, and corresponding group g(( ; ); a) = ( ; + a). One choice for maximal invariants is M (x) = g(x; x(1)) and M (( ; )) = g(( ; ); ) = ( ; 0). The results of Section 4 imply that all invariant estimators can be written as functions of X = M (X), so that Z s Z 1 X (s) = W" (l)dl + Wu (s) s W" (l)dl + Wu (1) 0 (23) 0 which does not depend on any nuisance parameters. The distribution (23) arises more generally as the limit of the suitably scaled partial sum process of the scores, evaluated at the maximum likelihood estimator, of well-behaved time series models with a time varying parameter; see Stock and Watson (1998), Elliott and Müller (2006a) and Müller and Petalas (2010). Lemma 2 and Theorem 3 of Elliott and Müller (2006a) imply that the density of X with respect to the measure of a Brownian Bridge W (s) f (x ) = r 2 e 1 e where x~ (s) = x (s) 2 x~ (1)2 exp Rs 0 e (s l) 2 R1 0 x~ (s)2 ds 2 1 e sW (1) is given by 2 e x~ (1) + + x~ (1) + x (l)dl. 23 R1 0 e R1 0 s 2 x~ (s)ds 2 x~ (s)ds As discussed by Stock (1994), the local level model (21) is intimately linked to the MA(1) model yt = "t + ut . Under small time variation asymptotics is nearly noninvertible with local-to-unity MA(1) coe¢ cient XT (s) = XT (s) =1 yt gTt=2 sXT (1) and the …rst di¤erences f = =T , =T + o(T 1 yt ). Both form small sample maximal invariants to translations of yt , so that they contain the same information. This suggests that X is also the limit problem (in the sense of LeCam) of a stationary Gaussian local-tounity MA(1) model without a constant, and it follows from the development in Elliott and Müller (2006a) that this is indeed the case. It has long been recognized that the maximum likelihood estimator of the MA(1) coe¢ cient exhibits non-Gaussian limit behavior under non-invertibility; see Stock (1994) for a historical account and references. In particular, the MLE ^ su¤ers from the so-called pile-up problem P (^ = 1j = 1) > 0; and Sargan and Bhargava (1983) derive the limiting probability to be 0.657. The full limiting distribution of the MLE does not seem to have been previously derived. The above noted equivalence leads us to conjecture that under =T asymptotics, =1 T (1 ^) ) M LE (X (24) ) = argmax f (X ): A numerical calculation based on (23) con…rms that under = 0, P ( M LE (X ) = 0) = 0:657, but we leave a formal proof of the convergence in (24) to future research. In any event, with P ( M LE (X median unbiased estimator of ) = 0) > 1=2 under 2 [0; 1) on M LE (X = 0, it is not possible to base a ), as the median function m M LE is not one-to-one for small values of . Stock and Watson (1998) base their exactly median R1 unbiased estimator on the Nyblom (1989) statistic 0 X (s)2 ds: We employ the algorithm discussed in Section 3.2 to derive an alternative median unbi- ased estimator that comes close to minimizing weighted average risk under absolute value loss. Taking the local-to-unity limit of Tanaka’s (1984) expansions suggests S (x ) = ) + 1 to be approximately median unbiased and normal with variance 2 for large . p + 6 (the o¤set 6 ensures that normalized We thus consider risk normalized by n( ) = M LE (x risk is well de…ned also for = 0). We impose switching to S whenever M LE (x )> = 10, set Fn to be uniform on [0; 30], choose the Lebesgue density of Gi to be proportional to the ith basis spline on the knots f0; 0:22 ; : : : ; 82 g, and set the inverse median function m^ 1 to the identity for arguments larger than 50 (the median bias of S for ^y > 50 is indistinguishable from zero with a Monte Carlo standard error of approximately 0.2%). With "R = 0:01, the 24 Figure 7: Median Bias and Absolute Loss Performance in Local-to-Noninvertible MA(1) y algorithm then successfully delivers a nearly weighted risk minimizing estimator ^U that is exactly median unbiased for 50, and very nearly median unbiased for > 50. Figure 7 displays its median bias and normalized risk, along with Stock and Watson’s (1998) median unbiased estimator and the weighted risk minimizing estimator without any y bias constraints. Our estimator ^U is seen to have very substantially lower risk than the previously suggested median unbiased estimator by Stock and Watson (1998) for all but very small values of . For instance, under = 15 (a degree of time variation that is detected approximately 75% of the time by a 5% level Nyblom (1989) test of parameter constancy), the mean absolute loss of the Stock and Watson (1998) estimator is about twice as large. 5.2 Predictive Regression A stylized small sample version of the predictive regression problem with a weakly exogenous and persistent regressor zt is given by yt = my + bzt 1 + r"zt + p 1 r2 "yt zt = mz + ut ut = ut 1 + "zt u0 25 N (0; (1 2 ) 1) with ("zt ; "yt )0 iidN (0; I2 ) independent of u0 , 1 < r < 1 known, and only fyt ; zt 1 gTt=1 are observed. The parameter of interest is the regression coe¢ cient b. Other formulations of the predictive regression problem di¤er by not allowing an unknown mean for the regressor, or by di¤erent assumptions about the distribution of u0 . p p =T , b = =T , my = y = T and mz = z T , we obtain ! X1 ( ) ) X( ) = T 1=2 z[ T ] X2 ( ) ! ! p Rs 2 W (s) s + X (l)dl + rW (s) + 1 r X1 (s) 2 z y y 0 Rs = (s l) s pZ dWz (l) + 0e X2 (s) z +e 2 Under asymptotics with P Tc ! T 1=2 bt=1 yt with Z =1 N (0; 1) independent of the standard Wiener processes Wy and Wz . The distri- bution of the limit process X depends on the unknown and = ( y; z; ; ) 2 R3 (0; 1), is the parameter of interest. The problem is seen to be invariant to the group of transformations indexed by a = (ay ; az ; a ) 2 R3 X1 ( ) + ay +a g(X; a) = g( ; a) = X2 (s)ds 0 X2 ( ) + az ! + (ay ; az ; a ; 0) and one choice for maximal invariants is ! X1 ( ) = M (X) = X = X2 ( ) M( ) = R X1 ( ) X1 (1) X2 ( ) = (0; 0; 0; ) ! R ^OLS (X) 0 X2 (s)ds R1 X2 (s)ds 0 R1 R1 where ^OLS (X) = 0 X2 (s)dX1 (s)= 0 X2 (s)2 ds. With g^( ; a) = + a and O(X) = R1 (X1 (1); 0 X2 (s)ds; ^OLS (X); 0), (17) implies that all invariant estimators can be written in the form ^OLS (x) + a (x ), (25) that is, they are equal to the OLS estimator ^OLS (x), plus a “correction term” that is a function of the maximal invariant X . The implications of imposing invariance are quite powerful in this example: the problem of determining the nearly weighted average risk minimizing unbiased estimator relative to (18) and (19) is only indexed by , that is by the mean reversion parameter . Invariance 26 thus also e¤ectively eliminates the parameter of interest from the problem. Solving for ` and c in (18) and (19) requires the computation of the conditional expectation of the loss function and constraint evaluated at g(M ( ); O(X) ) given X under depending on through , this requires derivation of the conditional distribution of ^OLS (X) given X . A calculation shows that under equal to . With ` and c only (that is, with the true value of =( y; z; ; ) = (0; 0; 0; )) R1 X (s)dX2 (s) 1 r2 r 0R 1 2 +r ; R1 X2 (s)2 ds X2 (s)2 ds 0 0 N ^OLS (X)jX ! : (26) X2 (0) follows the same distribution as the process considered in Furthermore, since X2 ( ) the running example at the end of Section 4, the density of X2 follows again readily from Elliott (1998). Finally, the conditional density of X1 given X2 (relative to the measure of a Brownian Bridge) does not depend on minimizer , so it plays no role in the determination of the (x ) of the integrand in the Lagrangian (7). Appendix C.3 provides further details on these derivations. Equation (26) implies that under to , the unconditional expectation of ^OLS (X) is equal R1 R1 r times the bias of the OLS estimator ^ OLS = X2 (s)dX2 (s)= 0 X2 (s)2 ds of the 0 mean reversion parameter . This tight relationship between the bias in the OLS coe¢ cient estimator, and the OLS bias in estimating the autoregressive parameter, was already noted by Stambaugh (1986). What is more, let ~ be an alternative estimator of the mean reverting parameter based on X2 , and set bias of the resulting estimator of in (25) equal to r(^ OLS a ~ ). Then the unconditional is equal to r times the mean bias of ~ , and its mean R1 squared error is equal to r times the mean squared error of ~ plus E[(1 r2 )= 0 X2 (s)2 ds]. 2 Thus, setting a (x ) equal to r(^ OLS ^ ) for ^ the nearly WAR minimizing mean un- biased estimator derived at the end of Section 4 yields a nearly WAR minimizing mean unbiased estimator of in the predictive regression problem. Correspondingly, the Orcutt and Winokur (1969)-based bias correction in the estimation of suggested by Stambaugh (1986) and the restricted maximum likelihood estimator suggested by Chen and Deo (2009) have mean biases that are equal to r times the mean biases reported in Figure 6. We omit a corresponding plot of bias and risk for brevity, but note that just as in Figure 4, a comparison with the (lower bound on the) risk envelope shows that our estimator comes quite close to being uniformly minimum variance among all (nearly) unbiased estimators. The median bias of ^OLS + r(^ OLS ~ ), however, is not simply equal to 27 r times the Figure 8: Median Bias and Absolute Loss Performance in Predictive Regression median bias of ~ . One also cannot simply invert the median function of an estimator of due to the presence of the nuisance parameter . We thus apply our general algorithm to obtain p a nearly WAR minimizing nearly median unbiased estimator. We use n( ) = 2 + 20, "B = "R = 0:01, set Fn uniform on [0; 80] and let Gi be point masses on 2 f0; 0:22 ; : : : ; 102 g (the Gaussian noise in (26) induces smooth solutions even for point masses, in contrast to direct median unbiased estimation of the mean reverting parameter). For ^ OLS > we switch to the estimator a = 3r, which is nearly median unbiased for large to 1. Figure 8 reports the results for r = = 60 and jrj close 0:95, along with the same analogous set of compar- 2 isons already reported in Figure 3. 5.3 Quantile Long-range Forecasts from a Local-To-Unity Autoregressive Process The …nal example again involves a local-to-unity AR(1) process, but now we are interested in constructing long-run quantile forecasts. To be speci…c, assume yt = m + ut , ut = ut iidN (0; 1) and u0 "t some 0 < N (0; (1 2 1 ) ) independent of "t . We observe < 1 and horizon , seek to estimate the conditional 2 fyt gTt=1 , 1 + "t , and given quantile of yt+ . Under We could not replicate the favorable results reported by Eliasz (2005) for his conditionally median unbiased estimator, derived for a closely related model with zero initial condition on z0 . 28 asymptotics with m = T 1=2 , =T and =1 = brT c, the limiting problem involves the observation of the stationary Ornstein-Uhlenbeck process X on the unit interval, Z s Ze s X(s) = + p exp( (s l))dW (l) + 2 0 with Z ( ; )2R N (0; 1) independent of W . The problem is indexed by the unknown parameters (0; 1), and the object of interest is the quantile of X(1 + r) conditional on X. It is potentially attractive to construct estimators that are quantile unbiased in the sense that P( ; ) (X(1 for all ( ; ) 2 R + r) < (X)) = (0; 1). (27) This ensures that in repeated applications, X(1 + r) indeed realizes to be smaller than the estimator (X) of the quantile with probability , irrespective of the true value of the parameters governing X. A straightforward calculation shows that X(1 + r)jX X(1+r)jX = )e r X(1+r)jX + + (X(1) quantile is equal to and 2 X(1+r)jX X(1+r)jX z = (1 e 2r N( X(1+r)jX ; X(1+r)jX with )=2 ; so that the conditional , where P (Z < z ) = use this expression as an estimator, however, since 2 X(1+r)jX ) and . It is not possible to X(1+r)jX depend on the unknown parameters ( ; ). A simply plug-in estimator is given by P I (x) = ^ X(1+r)jX + R1 R1 ^ X(1+r)jX z which replaces ( ; ) by (^ ; ^ OLS ) = ( 0 X(s)ds; 21 (X(1)2 1)= 0 X(s)2 ds), X(s) = X(s) ^. In order to cast this problem in the framework of this paper, let 2 = (X(1 + r) X(1+r)jX )= X(1+r)jX , and part of the parameter, we integrate over so that = ( 1; 0 2) 1 = (X(1) N (0; I2 ). We treat p ) 2 and as …xed = ( ; ; ). In order to recover the stochastic properties of , N (0; I2 ) in the weighting functions F and Gi . In this way, weighted average bias constraints over amount to a bias constraint in the original model with stochastic and distributed N (0; I2 ). Now in this notation, X(1 + r) becomes non-stochastic and equals p e r 1 + 1 e 2r 2 p h( ) = + 2 and the density f (x) depends on 1 (but not 2 ). Furthermore, the quantile unbiased constraint (27) is equivalent to weighted average bias over c( ; ) = 1[h( ) < ] N (0; I2 ) with c( ; ) equal to . We use the usual quantile check function to measure the loss of estimation errors, `( ; ) = jh( ) j j 1[h( ) < ]j. 29 Figure 9: 5% Quantile Bias and Normalized Expected 5% Quantile Check Function Loss of Forecasts in Local-to-Unity AR(1) for 20% Horizon With these choices, the problem is invariant to the group of translations g(x; a) = x + a and g( ; a) = ( ; + a; ) for a 2 R, with g^( ; a) = is given by X = M (X) = X(s) relative to the measure of W ( ) X(1) and + a. One set of maximal invariants = M ( ) = ( ; 0; ), and the density of X W (1) is given by (cf. equation (25) in Elliott and Müller (2006b)) f (x ) = exp[ 1 2 (x (0) 2 1) A calculation shows that with on X = x is N ( 2 hjX = (1 e 2r 2 hjX hjX ; + 2+2 (1 R 1 2 1 x 2 0 , and I2 (s) ds …xed and ) with e r hjX 2 1 q ( 2 R1 0 x (s)ds + x (0)) 1 2 4 1 ]: (28) N (0; I2 ), the distribution of h( ) conditional R1 = ( 0 x (s)ds + x (0))(1 e r )=(2 + ) and ) )=(2 ). The weighted averages of ` ( ; ) and c ( ; ) in (18) and (19) with weighting function R f (x )c ( ; ) I2 ( )d R = ( f (x ) I2 ( )d R f (x )` ( ; ) I2 ( )d R = ( f (x ) I2 ( )d where 2 N (0; I2 ) then become hjX ) hjX hjX ) hjX +( hjX hjX )( ( hjX ) ) hjX are the cdf and pdfs of a scalar and bivariate standard normal, respec- tively. With these expressions at hand, it becomes straightforward to implement the algorithm of Section 2.3. We normalize risk by n( ) = 1 (2 10 30 + ) 1=2 , set "B = 0:002 and "R = 0:01, and in addition to the integration over N (0; I2 ), choose 2 and Gi equal to point masses on uniform on [0; 80] in Fn , 2 2 f0; 0:2 ; : : : ; 10 g. For ^ OLS > = 60, we switch R1 p to an estimator of the unconditional quantile S (x ) = ^ + z~ = 2^ , with ^ = 0 x (s)ds p and ^ = ^ OLS 3 and z~a = za (2^ + 4)=(2^ z 2 ) (the use of the approximately median unbiased estimator ^ and the replacement of z by z~a help correct for the estimation error in (^ ; ^ ); see the Appendix C.4 for details). Figure 9 plots the results for r = 0:2 and = 0:05, along with the plug-in estimator PI de…ned above, and the WAR minimizing estimator without any quantile bias constraints. The plug-in estimator is seen to very severely violate the quantile unbiased constraint (27) for small : instead of the nominal 5%, X(1 + r) takes on a value smaller than the quantile estimator about 16% of the time. 6 Conclusion The literature contains numerous suggestions for estimators in non-standard problems, and their evaluation typically considers bias and risk. For a recent example in the context of estimating a local-to-unity AR(1) coe¢ cient, see, for instance, Phillips (2012). We argue here that the problem of identifying estimators with low bias and risk may be fruitfully approached in a systematic manner. We suggest a numerical approximation technique to the resulting programming problem, and show that it delivers very nearly unbiased estimators with demonstrably close to minimal weighted average risk in a number of classic time series estimation problems. The general approach might well be useful also for cross sectional non-standard problems, such as those arising under weak instruments asymptotics.3 In ongoing work, we consider an estimator of a very small quantile that has the property that in repeated applications, an independent draw from the same underlying distribution realizes to be smaller than the estimator with probability . This is qualitatively similar to the forecasting problem considered in Section 5.3, but concerns a limiting problem of observing a …nite-dimensional random vector with joint extreme value distribution. In all the examples we considered in this paper, it turned out that nearly unbiased estimators with relatively low risk exist. This may come as something of a surprise, and may 3 Andrews and Armstrong (2015) derive a mean unbiased estimator of the structural parameter under a known sign of the …rst stage in this problem. 31 well fail to be the case in other non-standard problems. For instance, as recently highlighted by Hirano and Porter (2012), mean unbiased estimation is not possible for the minimum of the means in a bivariate normal problem. Unreported results show that an application of our numerical approach to this problem yields very large lower bounds on weighted average risk for any estimator with moderately small mean bias. The conclusion then is that even insisting on small (but not necessarily zero) mean bias is not a fruitful constraint in that particular problem. References Albuquerque, J. (1973): “The Barankin Bound: A Geometric Interpretation,” IEEE Transactions on Information Theory, 19, 559–561. Amihud, Y., and C. Hurvich (2004): “Predictive regressions: A reduced-bias estimation method,”Journal of Financial and Quantitative Analysis, 39, 813–841. Andrews, D. W. K. (1993): “Exactly Median-Unbiased Estimation of First Order Autoregressive/Unit Root Models,”Econometrica, 61, 139–165. Andrews, D. W. K., and H. Chen (1994): “Approximately Median-Unbiased Estimation of Autoregressive Models,”Journal of Business and Economic Statistics, 12, 187–204. Andrews, D. W. K., M. J. Moreira, and J. H. Stock (2006): “Optimal Two-Sided Invariant Similar Tests for Instrumental Variables Regression,” Econometrica, 74, 715– 752. (2008): “E¢ cient Two-Sided Nonsimilar Invariant Tests in IV Regression with Weak Instruments,”Journal of Econometrics, 146, 241–254. Andrews, I., and T. B. Armstrong (2015): “Unbiased Instrumental Variables Estimation Under Known First-Stage Sign,”Working Paper, Harvard University. Barankin, E. W. (1946): “Locally Best Unbiased Estimates,” Annals of Mathematical Statistics, 20, 477–501. Cavanagh, C. L., G. Elliott, and J. H. Stock (1995): “Inference in Models with Nearly Integrated Regressors,”Econometric Theory, 11, 1131–1147. 32 Chan, N. H., and C. Z. Wei (1987): “Asymptotic Inference for Nearly Nonstationary AR(1) Processes,”The Annals of Statistics, 15, 1050–1063. Cheang, W.-K., and G. C. Reinsel (2000): “Bias Reduction of Autoregressive Estimates in Time Series Regression Model through Restricted Maximum Likelihood,” Journal of the American Statistical Association, 95, 1173–1184. Chen, W. W., and R. S. Deo (2009): “Bias Reduction and Likelihood-Based Almost Exactly Sized Hypothesis Testing in Predictive Regressions Using the Restricted Likelihood,” Econometric Theory, 25, 1143–1179. Crump, R. K. (2008): “Optimal Conditonal Inference in nearly-integrated autoregressive processes,”Working paper. Federal Reserve Bank of New York. Eliasz, P. (2005): “Optimal Median Unbiased Estimation of Coe¢ cients on Highly Persistent Regressors,”Ph.D. Thesis, Princeton University. Elliott, G. (1998): “The Robustness of Cointegration Methods When Regressors Almost Have Unit Roots,”Econometrica, 66, 149–158. (1999): “E¢ cient Tests for a Unit Root When the Initial Observation is Drawn From its Unconditional Distribution,”International Economic Review, 40, 767–783. (2006): “Forecasting with Trending Data,” in Handbook of Economic Forecasting, Volume 1, ed. by C. W. J. G. G. Elliott, and A. Timmerman. North-Holland. Elliott, G., and U. K. Müller (2006a): “E¢ cient Tests for General Persistent Time Variation in Regression Coe¢ cients,”Review of Economic Studies, 73, 907–940. (2006b): “Minimizing the Impact of the Initial Condition on Testing for Unit Roots,”Journal of Econometrics, 135, 285–310. Elliott, G., U. K. Müller, and M. W. Watson (2015): “Nearly Optimal Tests When a Nuisance Parameter is Present Under the Null Hypothesis,”Econometrica, 83, 771–811. Elliott, G., T. J. Rothenberg, and J. H. Stock (1996): “E¢ cient Tests for an Autoregressive Unit Root,”Econometrica, 64, 813–836. 33 Glave, F. E. (1972): “A New Look at the Barankin Lower Bound,”IEEE Transactions on Information Theory, 18, 349–356. Gospodinov, N. (2002): “Median unbiased forecasts for highly persistent autoregressive processes,”Journal of Econometrics, 111, 85–101. Han, C., P. Phillips, and D. Sul (2011): “Uniform asymptotic normality in stationary and unit root autoregression,”Econometric Theory, 27, 1117–1151. Harvey, A. C. (1989): Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press. Hirano, K., and J. Porter (2012): “Impossibility Results for Nondi¤erentiable Func½ 1790. tionals,”Econometrica, 80, 1769U– Hurvicz, L. (1950): “Least Squares Bias in Time Series,”in Statistical Inference in Dynamic Economic Models, ed. by T. Koopmans, pp. 365–383, New York. Wiley. Jansson, M., and M. J. Moreira (2006): “Optimal Inference in Regression Models with Nearly Integrated Regressors,”Econometrica, 74, 681–714. Kemp, G. C. R. (1999): “The Behavior of Forecast Errors from a Nearly Integrated AR(1) Model as Both Sample Size and Forecast Horizon Become Large,” Econometric Theory, 15(2), pp. 238–256. Kendall, M. G. (1954): “Note on the Bias in the Estimation of Autocorrelation,” Biometrika, 41, 403–404. Kothari, S. P., and J. Shanken (1997): “Book-to-Market, Dividend Yield, and Expected Market Returns: A Time-Series Analysis,”Journal of Financial Economics, 18, 169–203. Lehmann, E. L., and G. Casella (1998): Theory of Point Estimation. Springer, New York, 2nd edn. Lehmann, E. L., and J. P. Romano (2005): Testing Statistical Hypotheses. Springer, New York. MacKinnon, J. G., and A. A. Smith (1998): “Approximate Bias Correction in Econometrics,”Journal of Econometrics, 85, 205–230. 34 Mankiw, N. G., and M. D. Shapiro (1986): “Do we reject too often? Small Sample Properties of Tests of Rational Expectations Models,”Economics Letters, 20, 139–145. Marriott, F. H. C., and J. A. Pope (1954): “Bias in the Estimation of Autocorrelations,”Biometrika, 41, 393–402. McAulay, R., and E. Hofstetter (1971): “Barankin Bounds on Parameter Estimation,” IEEE Transactions on Information Theory, 17, 669 –676. Mikusheva, A. (2007): “Uniform Inference in Autoregressive Models,” Econometrica, 75, 1411–1452. Müller, U. K., and P. Petalas (2010): “E¢ cient Estimation of the Parameter Path in Unstable Time Series Models,”Review of Economic Studies, 77, 1508–1539. Nyblom, J. (1989): “Testing for the Constancy of Parameters Over Time,”Journal of the American Statistical Association, 84, 223–230. Orcutt, G. H., and H. S. Winokur (1969): “First Order Autoregression: Inference, Estimation, and Prediction,”Econometrica, 37, 1–14. Pantula, S. G., G. Gonzalez-Farias, and W. A. Fuller (1994): “A Comparison of Unit-Root Test Criteria,”Journal of Business & Economic Statistics, 12, 449–459. Park, H. J., and W. A. Fuller (1995): “Alternative Estimators and Unit Root Tests for the Autoregressive Process,”Journal of Time Series Analysis, 16, 415–429. Phillips, P. (1977): “Approximations to some …nite sample distributions associated with a …rst order stochastic di¤erence equation,”Econometrica, 45, 463–486. (1978): “Edgeworth and saddlepoint approximations in the …rst-order noncircular autoregression,”Biometrika, 65, 91–108. (1979): “The Sampling Distribution of Forecasts from a First Order Autoregression,”Journal of Econometrics, 9, 241–261. (2012): “Folklore Theorems, Implicit Maps, and Indirect Inference,”Econometrica, 80, 425–454. 35 Phillips, P. C. B. (1987): “Towards a Uni…ed Asymptotic Theory for Autoregression,” Biometrika, 74, 535–547. Phillips, P. C. B., and C. Han (2008): “Gaussian Inference in AR(1) Time Series with or Without a Unit Root,”Econometric Theory, 24, 631–650. Quenouille, M. H. (1956): “Notes on Bias in Estimation,”Biometrika, 43, 353–360. Roy, A., and W. A. Fuller (2001): “Estimation for Autoregressive Time Series with a Root Near 1,”Journal of Business and Economic Statistics, 19, 482–493. Sargan, J. D., and A. Bhargava (1983): “Maximum Likelihood Estimation of Regression Models with First Order Moving Average Errors when the Root Lies on the Unit Circle,” Econometrica, 51, 799–820. Sawa, T. (1978): “The exact moments of the least squares estimator for the autoregressive model,”Journal of Econometrics, 8, 159–172. Shaman, P., and R. A. Stine (1988): “The Bias of Autoregressive Coe¢ cient Estimators,” Journal of the American Statistical Association, 83, 842–848. Stambaugh, R. F. (1986): “Bias in Regressions with Lagged Stochastic Regressors,”Working Paper, University of Chicago. (1999): “Predictive regressions,”Journal of Financial Economics, 54, 375–421. Stock, J. H. (1991): “Con…dence Intervals for the Largest Autoregressive Root in U.S. Macroeconomic Time Series,”Journal of Monetary Economics, 28, 435–459. (1994): “Unit Roots, Structural Breaks and Trends,”in Handbook of Econometrics, ed. by R. F. Engle, and D. McFadden, vol. 4, pp. 2740–2841. North Holland, New York. (1996): “VAR, Error Correction and Pretest Forecasts at Long Horizons,” Oxford Bulletin of Economics and Statistics, 58, 685–701. Stock, J. H., and M. W. Watson (1998): “Median Unbiased Estimation of Coe¢ cient Variance in a Time-Varying Parameter Model,” Journal of the American Statistical Association, 93, 349–358. 36 Tanaka, K. (1983): “Asymptotic Expansions Associated with the AR(1) Model with Unknown Mean,”Econometrica, 51, 1221–1231. (1984): “An asymptotic expansion associated with the maximum likelihood estimators in ARMA models,”Journal of the Royal Statistical Society B, 46, 58–67. Turner, J. (2004): “Local to Unity, Long-Horizon Forecasting Thresholds for Model selection in the AR(1),”Journal of Forecasting, 23, 513–539. White, J. S. (1961): “Asymptotic Expansions for the Mean and Variance of the Serial Correlation Coe¢ cient,”Biometrika, 48, 85–94. Yu, J. (2012): “Bias in the Estimation of the Mean Reversion Parameter in Continuous Time Models,”Journal of Econometrics, 169, 114–122. 37 A Proofs The proof of Lemma 2 uses the following Lemma. Lemma 3 For all a 2 A and x 2 X , O(g(x; a)) = O(x) a . Proof: Replacing x by g(x; a) in (12) yields g(x; a) = g(M (g(x; a)); O(g(x; a))) = g(M (x); O(g(x; a))): Alternatively, applying a on both sides of (12) yields g(x; a) = g(g(M (x); O(x)); a) = g(M (x); a O(x)): Thus g(M (x); a O(x)) = g(M (x); O(g(x; a))). By the assumption that group actions a are distinct, this implies that O(g(x; a)) = a O(x). The result now follows from (a2 a1 ) = a1 a2 . Proof of Lemma 2: Suppose 1 and 2 are such that M ( 1 ) = M ( 2 ). Then there exists a 2 A such that 2 = g( 1 ; a). Thus, for an arbitrary measurable subset B X , P 2 ((M (X); g( 2 ; O(X) )) 2 B) = Pg( 1 ;a) ((M (X); g( 2 ; O(X) )) 2 B) = P 1 ((M (g(X; a)); g( 2 ; O(g(X; a)) )) 2 B) (invariance of problem) = P 1 ((M (X); g( 2 ; O(X) a )) 2 B) (property of M and Lemma 3) = P 1 ((M (X); g(g( 1 ; a); O(X) a )) 2 B) (by 2 = g( 1 ; a)) = P 1 ((M (X); g( 1 ; O(X) )) 2 B): B Details on Algorithm of Section 2.3 The basic idea of the algorithm is to start with some guess for the Lagrange multipliers , compute the biases B( ; Gi ), and adjust ( li ; ui ) iteratively as a function of B( ; Gi ), i = 1; : : : ; m. To facilitate the repeated computation of B( ; Gi ), i = 1; : : : ; m, it is useful to employ an importance sampling estimator. We use the proposal density fp , where fp is the mixture density Pmp 1 fp (x) = (24 + mp 2) 1 (12f p;1 (x) + 12f p;mp (x) + i=2 f p;i (x)) for some application-speci…c mp grid of mp parameter values f p;i gi=1 (that is, under fp , X is generated by …rst drawing an index J uniformly from f 10; 9; : : : ; mp + 11g, then draw X from f p;J , where J = max(min(J; 1); mp )). The overweighting of the boundary values p;1 and p;mp by a factor of 12 counteracts the lack of additional importance sampling points to one side, leading to approximately constant importance sampling Monte Carlo standard errors in problems where is one-dimensional, as is the case in all our applications once invariance is imposed. Let Xl , l = 1; : : : ; N be i.i.d. draws from fp : For a given estimator , we approximate B( ; Gi ) by Z N Z X f (X) f (Xl ) B( ; Gi ) = Efp [ c( (X); ) dGi ( )] N 1 c( (Xl ); ) dGi ( ): fp (X) fp (Xl ) l=1 38 R We further approximate c( ; )f (Xl )dGi ( ) for arbitrary by quadratic interpolation based on the closest three points in the grid Hl = f l;1 ; : : : ; l;M g, l;i < l;j for i < j (we use the same grid for all i = 1; : : : ; m). The grid is chosen large enough so that (Xl ) is never arti…cially constrained for any value of considered by the algorithm. Similarly, R( ; F ) is approximated by R( ; F ) N 1 N X l=1 R R `( (Xl ); )f (Xl )dF ( ) fp (Xl ) and `( (Xl ); )f (Xl )dF ( ) for arbitrary is approximated with the analogous quadratic interpolation scheme. Furthermore, for given , the minimizer (Xl ) of the function Ll : R 7! R Ll ( ) = Z `( ; )f (Xl )dF ( ) + m X u i ( l i) i=1 Z c( ; )f (Xl )dGi ( ) is approximated by …rst obtaining the global minimum over 2 Hl , followed by a quadratic approximation of Ll ( ) around the minimizing l;j 2 Hl based on the three values of Ll ( ) for 2 f l;j 1 ; l;j 1 ; l;j 1 ) Hl . For given ", the approximate solution to (1) subject to (5) is now determined as follows: 1. Generate i.i.d. draws Xl , l = 1; : : : ; N with density fp . R R 2. Compute and store c( ; )f (Xl )dGi ( ) and `( ; )f (Xl )dF ( ), l = 1; : : : ; N: 3. Initialize (0) u;(0) i as 4. For k = 0; : : : ; K (k) (b) Compute B( u;(0) = 0:0001 and ! i l;(0) = !i = 0:05, i = 1; : : : ; m. (Xl ) as described above, l = 1; : : : ; N: (k) (k+1) l;(k) l;(k) exp(! i ( i ; Gi ) as described above, i = 1; : : : ; m: from B( u;(k+1) (d) Compute ! i u;(k+1) !i l;(0) i 1 (a) Compute (c) Compute = 2 Hl , i = 1; : : : ; m, (k) (k) via ; Gi ) u;(k+1) i = u;(k) u;(k) exp(! i (B( i (k) ; Gi ) ")), l;(k+1) i = ")), i = 1; : : : ; m. u;(k) = max(0:01; 0:5! i u;(k) min(100; 1:03! i ) ) if (B( (k+1) ; Gi ) ")(B( l;(k+1) !i (k) ; Gi ) ") < 0, and l;(k+1) ) otherwise, and similarly = max(0:01; 0:5! i l;(k+1) l;(k) if (B( (k+1) ; Gi )+")(B( (k) ; Gi )+") < 0, and ! i = min(100; 1:03! i ) otherwise. = The idea of Step 4.d is to slowly increase the speed of the change in the Lagrange multipliers as long as the sign of the violation remains the same in consecutive iterations, and to decrease it otherwise. y In the context of Step 2 of the algorithm in Section 2.3, we set ^ = (K) for K = 300. For Step 3 of the algorithm in Section 2.3, we …rst initialize and apply the above iterations 300 times for " = "B : We then continue to iterate the above algorithm for another 400 iterations, but 39 every 100 iterations increase or decrease " via a simple bisection method based on whether or not R( (k) ; F ) < (1 + "R )R: After these 4 bisections for ", we continue to iterate another 300 times, for a total of K = 1000 iterations. The check of whether the resulting ^ = (K) satis…es the uniform bias constraint (3) is performed by directly computing its bias b(^ ; ) via the importance sampling approximation N X f (Xl ) b( ; ) N 1 c( (Xl ); ) fp (Xl ) l=1 over a …ne but discrete grid 2 g . If the estimator is constrained to be of the switching form (10), the above algorithm is unchanged, except that (k) (Xl ) in Step 4.a is predetermined to equal (k) (Xl ) = S (Xl ) for all draws Xl with (Xl ) = 1. We set the number of draws to N = 250; 000 in all applications. Computations take no more than minutes on a modern PC, with a substantial fraction of time spent in Step 2. C C.1 Additional Application-Speci…c Details AR(1) Coe¢ cient The process X is approximated by a discrete time Gaussian AR(1) with T = 2500 observations. j i 2 4 j i+1 2 3 49 We use p;i = ( i 5 1 )2 , i = 1; : : : ; 61, g = f(12j=500)2 g500 j=0 , and Hl = ff 4 ( 5 ) + 4 ( 5 ) gj=0 gi=0 [ f144g. Integration over Gi and F is performed using a Gaussian quadrature rule with 7 points separately on each of the intervals determined by the sequence of knots that underlie Gi . C.2 Degree of Parameter Time Variation The process X is approximated by partial sums of a discrete time MA(1) with T = 800 observations. 2 3 40 We set Hl = ff 4j ( 5i )2 + 4 4 j ( i+1 5 ) gj=0 gi=0 [ f64g. After invariance, the e¤ective parameter is , 2 500 and for , we use the grids f( i 5 1 )2 g41 i=1 as proposal and f(j=50) gj=0 to check the uniform bias property. Integration over Gi and F is performed using a Gaussian quadrature rule with 7 points separately on each of the intervals determined by the sequence of knots that underlie Gi . C.3 Predictive Regression The process X2 is approximated by a discrete time Gaussian AR(1) with T = 2500 observations. R1 i We set Hl = f 10 ( 0 X2;l (s)2 ds) 1=2 g80 i= 80 (corresponding to a range of the “correction term”in (25) of up to 8 OLS standard deviations of ^ OLS ). After invariance and integration over N (0; I2 ), 2 g500 as proposal and f(12j=500) the e¤ective parameter is ; and for , we use the grids f( 5i )2 g60 j=0 i=0 to check the uniform bias property. p To derive (26), note that under , X1 (s) = rWz (s) + 1 r2 Wy (s) and R s X2 is independent of Wy . Further dX2 (s) = X2 (s)ds + dWz (s), so that Wz (s) = X2 (s) + 0 X2 (l)dl. Thus, Z s E[X1 (s)jX2 ] = rX2 (s) + r X2 (l)dl 0 40 Z 1 X2 (l)dl = rX2 (1) E[X1 (1)jX2 ] = rX2 (1) + r 0 R1 X (l)dX2 (l) +r E[^OLS (X)jX2 ] = r 0R 1 2 2 0 X2 (l) dl Rs so from X1 (s) = X1 (s) sX1 (1) ^OLS (X) 0 X2 (l)dl, R1 Z X (l)dX2 (l) s E[X1 (s)jX2 ] = rX2 (s) rsX2 (1) r 0R 1 2 X2 (l)dl: 2 dl X (l) 0 2 0 Furthermore, the conditional covariance between X1 (s) and ^OLS (X) given X2 is E[(^OLS (X) " Rs = (1 E[^OLS (X)jX2 ])(X1 (s) E[X1 (s)jX2 ])jX2 ] # R1 R1 Z s 2 2 0 X2 (l)dl 0 X2 (l) dl 0 X2 (l)dl sR 1 X2 (l)dl R 1 r ) R1 =0 2 dl 2 dl 2 dl)2 X (l) X (l) X (l) ( 0 2 2 2 0 0 0 so that by Gaussianity, ^OLS (X) is independent of X1 conditional on X2 . Thus, E[^OLS (X)jX ] = E[^OLS (X)jX2 ], and (26) follows. We now show that the conditional density of X1 given X2 does not depend on . Clearly, X1 is Gaussian conditional on X2 , so the conditional distribution of X1 given X2 is fully described by the conditional mean and conditional covariance kernel. For 0 s1 s2 1, we obtain E[(X1 (s1 ) E[X1 (s1 )jX2 ])(X1 (s2 ) E[X1 (s2 )jX2 ])jX2 ] " r2 ) s1 = (1 s1 s2 R s1 0 Neither this expression, nor E[X1 (s)jX2 ] computed above, depend on C.4 # Rs X2 (l)dl 0 2 X2 (l)dl : R1 2 0 X2 (l) dl , which proves the claim. Quantile Long-range Forecasts The process X is approximated by a discrete time Gaussian AR(1) with T = 2500 observations. i )^ X(1+r)jX;l g80 We set Hl = f(^ X(1+r)jX;l + ^ X(1+r)jX;l z + 10 i= 80 , where ^ X(1+r)jX;l and ^ X(1+r)jX;l R1 are estimators of X(1+r)jX and X(1+r)jX that estimate ( ; ) via ( 0 Xl (s)ds; max( 12 (Xl (1)2 R1 R1 i 2 60 1)= 0 Xl (s)2 ds; 1)), Xl (s) = Xl (s) 0 Xl (s)ds. For , we use the grids f( 5 ) gi=0 as proposal and 2 500 f(12j=500) gj=0 to check the uniform bias property. To further motivate the form of S , observe that for large , the unconditional distribution of a mean-zero Ornstein-Uhlenbeck process with mean reverting parameter is N (0; (2 ) 1 ), suggesting za (2^ ) 1=2 as the estimator of the unconditional quantile. Recall from the discussion in Section 2.1 a a that for large , ^ is approximately distributed ^ N ( ; 2 ). Further, ^ N ( ; 2 ) independent of ^ . Thus, from a …rst order Taylor expansion, P (^ + z~ (2^ ) 1=2 < + (2 ) 1=2 Z) P (~ z (2 ) which is solved by z~ given in the main text. 41 1=2 < ((2 ) 1 + 2 + z~2 (2 ) 2 1=2 ) Z) =