Efficient Two-Step Estimation via Targeting David T. Frazier⇤and Eric Renault† April 24, 2015 Abstract The standard description of two-step extremum estimation amounts to plugging in a first step estimator of nuisance parameters in order to simplify the optimization problem and then to deduce a user friendly estimator for the parameters of interest. This two-step procedure often induces an efficiency loss with respect to estimation of the parameters of interest. In this paper, we consider a more general setting where we do not necessarily have such thing as nuisance parameters but rather awkward occurrences of the parameters of interest. By awkward, we mean that within the estimating equations for a vector of unknown parameters of interest ✓ , some occurrences of ✓, encapsulated by a vector ⌫(✓), may be computationally tricky. Then, it is still the case that prior knowledge of the unknown auxiliary parameters ⌫ = ⌫(✓) would make inference on ✓ much simpler, and it is this fact that motivates the two-step approach developed in this paper. The efficiency problem is more difficult than for the case of standard nuisance parameters since even the (infeasible) approach of plugging in the true unknown value of ⌫ = ⌫(✓) may not allow efficiency, since it overlooks the information about ✓ contained in the awkward occurrences ⌫(✓). Moreover, we stress that standard ways to restore efficiency for two-step procedures may not work due to a consistency issue; when setting the focus on a first step estimator for only some of the occurrences ⌫ = ⌫(✓) of the unknown parameters ✓, global identification may be lost. To alleviate this issue, we develop a targeting strategy that enforces consistency and achieves efficiency. Such difficult occurrences ⌫(✓) of the parameters, which are a nuisance when it comes to solving estimating equations, are present in many financial econometrics applications, often handled by indirect inference. Leading examples are asset pricing models with latent variables (their observation would make estimation much simpler), models where it is simpler to first set the focus of inference on marginal distributions (multivariate GARCH, copulas), models with highly nonlinear objective functions, etc. Based on targeting and penalization of the auxiliary parameters, we propose a new two-step estimation procedure that leads to stable and user-friendly computations. Moreover, estimators delivered in the second step of the estimation procedure are asymptotically efficient. We compare this new method with existing iterative methods in the framework of copula models and asset pricing models. Simulation results illustrate that this new method performs better than existing iterative procedures and is (nearly) computationally equivalent. ⇤ † Department of Econometrics and Business Statistics, Monash University. email: david.frazier@monash.edu Department of Economics, Brown University. email: eric renault@brown.edu 1 Keywords : Targeting, Penalization, Multivariate Time Series Models, Asset Pricing, Implied States. 2 1 Introduction The standard treatment of two-stage estimation (see e.g. Pagan, 1986 or Newey and McFadden, 1994, section 6) is generally motivated by the following sequence of arguments as coined by Pagan (1986): (i) Econometricians are often faced with the troublesome problem that “in order to estimate the parameters they are ultimately interested in, it becomes necessary to quantify a number of nuisance parameters (...) it is the presence of these parameters which converts a relatively simple computational problem into a very complex one”. (ii) ”Because estimation would generally be easy if the nuisance parameter were known, a very common strategy for dealing with them has emerged: they are replaced by a nominated value which is estimated from the data”. Then, the key issue for asymptotic theory is to assess the e↵ect of first-step estimators on second-step standard errors (see Newey and McFadden, 1994, subsection 6.2) and the most favorable situation is when ignoring the first step would be valid: the asymptotic distribution on the second-step estimator for the parameters of interest does not depend on the first step estimator for the nuisance parameters and would have been the same whether the nuisance parameters had been known upfront. Our focus of interest in this paper is germane to the above one but more general. The main di↵erence is that we do not necessarily have such thing as nuisance parameters but rather awkward occurrences of the parameters of interest. By awkward, we mean that within the estimating equations for a vector of unknown parameters of interest ✓, some occurrences of ✓ may be computationally tricky, either due to the complexity of the relationship, or numerical instability, or both. In order to disentangle these unpleasant occurrences from user-friendly ones, we denote the sample-based estimating functions as qT [✓, ⌫(✓)], where ⌫(✓) encapsulates all the occurrences of ✓ considered as somewhat awkward while T stands for the sample size. Generally speaking, our estimator of interest is ✓ˆT defined as a zero of the vector function fT (✓) = qT [✓, ⌫(✓)]. Note that, this general framework obviously encompasses the standard nuisance parameter setting described above. If, within the vector ✓ of unknown parameters, we distinguish some parameters of interest, denoted by ✓1 , and some nuisance parameters, denoted by ✓2 , such that ✓ = (✓10 , ✓20 )0 and ⌫(✓) = ✓2 , we are back to the standard case as far as efficient estimation of ✓1 is concerned. Note that, up to a slight change of notation, our setup nests the case where the function ⌫(✓) would be a sample dependent one ⌫T (✓), for instance because ⌫(✓) shows up after some nuisance parameters have been profiled out. Up to a specific discussion on how to accommodate this case (see the Appendix), the simpler notation ⌫(✓) will be kept throughout. Our leading example will be the case of an extremum estimator ✓ˆT = arg max QT [✓, ⌫(✓)], ✓ (1) so that the estimating equations correspond to first order conditions: qT [✓, ⌫(✓)] = @QT [✓, ⌫(✓)] @⌫ 0 (✓) @QT [✓, ⌫(✓)] + . @✓ @✓ @⌫ (2) ✓ˆT may be the MLE if the function QT [✓, ⌫(✓)] is a well-specified (log)likelihood function. More 3 generally, we will see ✓ˆT throughout as our benchmark estimator for the purpose of asymptotic efficiency. We highlight two important classes of examples in this paper. First, in Section 4, we consider a class of additively separable log-likelihood functions that are usually encountered in the so-called“estimation from likelihood of margins” (see e.g. Joe, 1997). In this setting, the unknown parameters, components of ✓, can be split into two parts ✓ = (✓10 , ✓20 )0 , where ✓1 characterizes the likelihood of the margins and ✓2 characterizes the dependence between components, let’s say the “cross-dependence”, through some link functions (typically linear correlations or copulas). However, the link function describing the cross-dependence applies to data components that have been first standardized using the knowledge of ✓1 . In other words, the part of the likelihood capturing cross-dependence also involves the parameters ✓1 that describe the marginal distributions. Such occurrences of ✓1 are an example of the awkward occurrences mentioned earlier, in that these situations can be difficult to deal with in practice; i.e., ⌫(✓) = ✓1 corresponds to the occurrences of ✓1 in the cross-dependence portion of the log-likelihood. Fortunately, a consistent user friendly estimator of ✓1 is available from the likelihood of the margins and can be plugged into the cross-dependence portion in order to estimate ✓2 . This approach is actually popular for the estimation of nonlinear multivariate time series models like multivariate GARCH or copulas models. However, as explained below, the simplicity obviously entails an efficiency loss since the information in the cross-dependence model about the margin parameters ✓1 has been overlooked. In Section 5, we consider nonlinear models in which observable variables are viewed as functions of some latent variables. Typically, the latent model, which is characterized by a vector of unknown parameters ✓, specifies a Markov process for the state variables and defines their (possibly nonlinear) transition equation. Such an approach becomes difficult when the measurement equation of this non-linear state space model, which is the function that relates observable variables to latent ones, also depends on the same unknown parameters through a vector ⌫(✓). While it would have been relatively easy to estimate ✓ from the observations on the latent variables, inference using available observations is complicated by the additional awkward occurrence of ✓, namely ⌫(✓), in the transformation from latent to observable variables. It is worth noting that the issue we have in mind is not really about filtering latent variables because we actually consider a case where the relationship latent-observable is oneto-one. Hence, backing out the latent variables from observations would have been easy if not polluted by the additional occurrence of unknown parameters in the measurement equation. This kind of situation is common in modern arbitrage-based asset pricing models with hedging of various sources of risk defined by an underlying model of state variables. Latent state variables are common factors, the dynamics of which characterizes the dynamics of observed yields or derivative asset prices. Since the measurement equation, typically an arbitrage-based asset pricing formula, is one-to-one, we can, following Pan (2002), dub ”Implied States” the value of latent variables that can be backed out from observations for a given value of parameters ⌫(✓). From this general intuition, Pan (2002) has extended the approach put forward by Renault and Touzi (1996) (and later revisited by Pastorello, Patilea and Renault (2003)) to devise the so-called “Implied States GMM” estimator. Again, the simplicity of this strategy also comes at the cost of an efficiency loss since the information content about ✓ brought by its awkward occurrence ⌫(✓) is overlooked in this procedure. As Pan (2002) put it, “the efficiency 4 of this “optimal instrument”scheme is limited in that (...) we sacrifice efficiency by ignoring the dependence of (t) on ✓, ” spot volatility (t) backed out from option price being for her the implied state. We are then generally faced with the following trade o↵ between asymptotic efficiency and computational cost (both in terms of computational complexity and stability). On the one hand, we still contemplate that estimation would be easy if the awkward part ⌫(✓) were known. Therefore, there is still some rationale to estimate it in a first stage, that is, if ✓0 stands for the true unknown value of ✓, to replace ⌫(✓0 ) by a consistent sample counterpart ⌫˜T . On the other hand, it is well known (see Newey and McFadden, 1994 for a discussion) that the two-step estimator obtained by plugging in the first-step consistent estimator ⌫˜T of the nuisance parameters would be inefficient in general. However, we want to stress that in our more general case where ⌫ is not necessarily a nuisance parameter but may be a known function ⌫(✓) of parameters of interest, there is even no reason to believe that we would get a more accurate estimator by computing the infeasible estimator ✓˘T , the solution of qT [✓˘T , ⌫(✓0 )] = 0. (3) On the contrary, there are many circumstances (see Pastorello et al., 2003 and references therein) in which the infeasible estimator ✓˘T is actually less accurate than ✓ˆT . The efficiency loss is due to the fact that the computation of ✓˘T disregards the information about ✓ contained in the function ⌫(✓) (see also Crepon et al., 1997 for a similar remark in a GMM context). More precisely, the efficient estimator ✓ˆT is asymptotically equivalent to: 1 @qT 0 @qT 0 0 0 @⌫ 0 [✓ , ⌫(✓ )] + [✓ , ⌫(✓ )] 0 (✓ ) qT [✓0 , ⌫(✓0 )], 0 0 @✓ @⌫ ✓ while the infeasible estimator ✓˘T is asymptotically equivalent to: 1 @qT 0 0 [✓ , ⌫(✓ )] qT [✓0 , ⌫(✓0 )]. @✓0 Two standard strategies are available in the literature to address this efficiency issue. A first possibility, as recently developed by Fan, Pastorello and Renault (2015) (hereafter, FPR) (k) is to devise a sequence of estimators ✓ˆT , k = 1, 2... from a feasible counterpart of (3) qT [✓ˆT (k+1) , ⌫(✓ˆT )] = 0, (k) (4) with, for instance, the aforementioned consistent first-step estimator ✓˜T as the initial value (1) (✓ˆT = ✓˜T ). Following a seminal paper by Song, Fan and Kalbfleich (2005) (hereafter, SFK) who had proposed a simplified version of this strategy in the particular case of separable loglikelihood functions (see Section 4 below), with the algorithm (4) being dubbed “Maximization by Parts” (MBP hereafter). The nice thing with (4) is that each step of the iteration to compute (k+1) (k) ✓ˆT from ✓ˆT is no more computationally demanding than the solution of (3). Moreover, by contrast with (3), this iterative procedure may allow us to reach efficiency since, when the (1) iterative procedure (4) has a limit ✓ˆT , this limit must coincide with the efficient estimator 5 ✓ˆT . However, it is worth realizing that the required contraction mapping property to secure convergence of (4) is not in general fulfilled in finite samples. Therefore, a feasible efficient estimator relies upon the choice of a tuning parameter k(T ), going to infinity at a sufficient (k(T )) rate with the sample size T , in order to obtain an estimator ✓ˆT that is asymptotically ˆ equivalent to ✓T . This may obviously come with the computational cost of a large number k(T ) of iterations, especially when the required population contraction mapping property is hardly fulfilled. Needless to say, the situation is even worse when it is not fulfilled at all, as illustrated in Section 4 below. The main goal of this paper is to promote a new efficient two-step procedure that does not require the contraction mapping property. We will argue that even though its second-step may be more computationally involved than each step of MBP, it keeps some of its simplicity, in particular by comparison with the brute force computation of the efficient estimator ✓ˆT . Our efficient two-step procedure is actually an extension of a two-step extremum estimator first proposed by Trognon and Gourieroux (1990). The key intuition is to correct the naive twostep objective function QT [✓, ⌫˜T ] to compensate for the inefficiency caused by plugging in the first-step consistent estimator ⌫˜T . Our proposed extremum estimator would then be ✓ˆText = arg max Q̃T [✓, ⌫˜T ], (5) ✓ with Q̃T [✓, ⌫˜T ] = QT [✓, ⌫˜T ] + @QT [✓, ⌫˜T ] . [⌫(✓) @⌫ 0 and P lim JT (✓0 ) + T =1 ⌫˜T ] 1 [⌫(✓) 2 ⌫˜T ]0 JT (✓) [⌫(✓) ⌫˜T ] (6) @ 2 QT [✓0 , ⌫(✓0 )] = 0. @⌫@⌫ 0 We show that, when consistent, the estimator ✓ˆText is asymptotically equivalent to the efficient estimator ✓ˆT . The main intuition for this result is that, up to the occurrence of unknown ✓ inside the matrix JT (✓), the first order conditions of the maximization program (5) can be seen as a linearization of first order conditions (2) of the efficient program (1), namely, linearization with respect to ⌫ in the neighborhood of the first-step estimator ⌫˜T . Then, the efficiency argument will be based on a generalization of an argument extensively studied by Robinson (1988). In this seminal paper, general efficiency comparisons are led between roots of rival estimating equations, in particular, as provided by local linearizations. However, we point out a difficulty that seems to have been overlooked in the literature so far. When linearization around a preliminary consistent estimator is applied to a vector of estimating equations, like, fT (✓) = qT [✓, ⌫(✓)], but linearization is performed only with respect to the second set of occurrences of ✓ (the so-called awkward occurrences within ⌫(✓)), the fact that fT (✓) may also depend nonlinearly on ✓ through first occurrences, say ✓ = ✓⇤ in qT [✓⇤ , ⌫(✓)], can impair the consistency of the estimator defined as the root of this (partially) linearized estimating equation. More precisely, local identification is granted but not global identification. Our proposed hedge against this risk is the addition of a penalty term ↵T k⌫(✓) ⌫˜T k2 to the (partially linearized) estimating equations, with a tuning parameter ↵T going to infinity 6 slower than the rate of convergence of our initial estimator ⌫˜T . In other words, both the MBP approach and our new two-step procedure with penalized partial linearization come with the cost of a tuning parameter. While the MBP approach requires choosing the number k(T ) of iterations, instead we will have to choose the rate of divergence ↵(T ) of the penalty weight. We will see, for instance, that in the standard case where all estimators are root-T consistent, a rate T 1/4 is well suited. Moreover, we will propose two di↵erent two-step procedures, both based on partial linearization, depending upon whether we have a first-step consistent estimator ⌫˜T , only of the unpleasant parameters, or we have at our disposal an initial consistent estimator ✓˜T for the whole parameter vector ✓ (and then ⌫˜T = ⌫(✓˜T ) ). Of course, in the latter case, the 2 penalty term could instead be ↵T ✓ ✓˜T and should be able to enforce global identification in even more general circumstances. The paper is organized as follows. The proposed extension of the Trognon and Gourieroux (1990) efficient two-step procedure is studied in Section 2. Our general result explains why some well known two-step estimators are efficient, in spite of the appearance to the contrary: Hatanaka (1974) for a dynamic regression model, Gourieroux, Monfort and Renault (1996) for a GMM estimator. It is worth stressing that efficiency is warranted in these two specific examples because consistency is not an issue. However, we also point out other examples, such as, nonlinear least squares and GMM, where consistency is not warranted, except if one uses the penalty strategy that we have devised through first order conditions. Robinson’s (1988) comparison of estimators is developed in Section 3. It allows us to propose two di↵erent penalized two-step estimators, depending on whether one has at her disposal a first-step consistent estimator of ✓0 or only of ⌫(✓0 ). Section 4 sets the focus on the separable estimation problem with a detailed comparison with MBP, both analytically and through Monte Carlo experiments in the framework of a copula example. Section 5 addresses the general implied states issue, both in the context of maximum likelihood and GMM as well. Again, we are able to provide a detailed comparison with MBP, both analytically and through Monte Carlo experiments, in the simple framework of Merton’s credit risk model. Concluding remarks are given in Section 6. Mathematical proofs, regularity conditions and detailed Monte Carlo evidence are all gathered in the Appendix. 2 2.1 An efficient two-step extremum estimator General framework Let ⇥ ⇢ Rp be a compact parameter space, and ✓0 the true unknown value of ✓. Additional parameters ⌫ are defined by some continuous function ⌫(.) from ⇥ to some subset of Rq . We assume that the extremum estimator ✓ˆT of ✓, defined by (1), is a consistent asymptotically normal estimator of ✓0 . In addition, we assume the following standard regularity conditions are satisfied. Assumption A1: There is a real-valued deterministic function Q1 [., .], continuous on ⇥ ⇥ 7 and such that: (i) PlimT =1 sup |Q1 [✓, ⌫(✓)] QT [✓, ⌫(✓)]| = 0 and ✓2⇥ (ii) ✓0 = arg max Q1 [✓, ⌫(✓)]. ✓2⇥ Assumption A2: The following are satisfied (i) ⌫(.) is twice continuously di↵erentiable on the interior of ⇥. (ii) ✓0 2 Int(⇥), interior set of ⇥, and ⌫ 0 = ⌫(✓0 ) 2 Int( ), interior set of . ˚ ˚ and with qT [✓, ⌫(✓)] (iii) The function QT [✓, ⌫] is twice continuously di↵erentiable on ⇥⇥ defined by (2) p (1) T qT [✓0 , ⌫(✓0 )] !d @[0, I0 ] n o 0 0 )] 0 0 )] @⌫(✓ 0 ) (2) PlimT =1 @qT [✓@✓,⌫(✓ + @qT [✓@⌫,⌫(✓ . @✓0 = H0 0 0 In addition, we maintain the following high-level assumptions. Assumption A3: (✓ˆT0 , ⌫˜T0 )0 is a root-T consistent asymptotically normal estimator of (✓00 , ⌫ 00 )0 and PlimT =1 sup JT (✓) + ✓2⇥ @ 2 QT [✓, ⌫˜T )] = 0. @⌫@⌫ 0 The focus of interest in this section is the comparison of the efficient estimator ✓ˆT , with the two-step alternative ✓ˆText defined in the introduction. For the sake of interpretation, it is worth comparing ✓ˆT and ✓ˆText with the infeasible estimator ✓ˆT⇤ = arg max Q̃0T [✓, ⌫˜T ], ✓2⇥ where Q̃0T [✓, ⌫˜T ] = QT [✓, ⌫˜T ] + @QT [✓, ⌫˜T ] . [⌫(✓) @⌫ 0 ⌫˜T ] 1 [⌫(✓) 2 ⌫˜T ]0 JT (✓0 ) [⌫(✓) ⌫˜T ] . We are then able to prove the following result. Theorem 2.1: Under the maintained assumption that they are all root-T consistent, the three estimators ✓ˆT , ✓ˆText and ✓ˆT⇤ are asymptotically equivalent. We stress that, as announced in the introduction, it is only the careful analysis of the first order conditions (see section 3 below) that will allow us to devise a proper penalty to ensure consistency of our two-step estimators. It is, however, worth interpreting them further to discern the reason why the two-step approach is not responsible for any efficiency loss. We 8 can do that at least in the only case considered by Trognon and Gourieroux (1990), namely the case of a genuine nuisance parameter ✓ = (✓10 , ✓20 )0 , ⌫(✓) = ✓2 . Then, the modified objective function becomes Q̃T [✓, ⌫˜T ] = QT [✓, ⌫˜T ] + @QT [✓, ⌫˜T ] . [✓2 @⌫ 0 ⌫˜T ] 1 [✓2 2 ⌫˜T ]0 JT (✓) [✓2 ⌫˜T ] , and the parameters of interest for efficient estimation are included in the sub-vector ✓1. With this point in mind, we can set the focus on an even simpler two-step estimator obtained as the maximizer of the following simplified objective function, where for sake of avoiding confusion about partial derivatives, we use two di↵erent notations for the same first-step estimator ✓˜2,T = ⌫˜T @QT [✓1 , ✓˜2,T , ⌫˜T ] Q̆T [✓, ⌫˜T ] = QT [✓1 , ✓˜2,T , ⌫˜T ] + . [✓2 @⌫ 0 1 [✓2 2 ⌫˜T ] ⌫˜T ]0 JT (✓1, ✓˜2,T ) [✓2 ⌫˜T ] Then, it is easy to profile ✓2 out of Q̆T [✓, ⌫˜T ] h i @ Q̆T [✓, ⌫˜T ] = 0 , ✓2 = ⌫˜T + JT (✓1, ✓˜2,T ) @✓2 1 @QT [✓1 , ✓˜2,T , ⌫˜T ] . @⌫ Plugging the above value of ✓2 into Q̆T [✓, ⌫˜T ], we can concentrate the objective function with respect to the nuisance parameters ⌫(✓) = ✓2 and obtain the following profile objective function i 1 @QT [✓1 , ✓˜2,T , ⌫˜T ] h ˜ ˜ Q̆c,T [✓1 , ⌫˜T ] = QT [✓1 , ✓2,T , ⌫˜T ] + JT (✓1, ✓2,T ) 2 @⌫ 0 1 @QT [✓1 , ✓˜2,T , ⌫˜T ] . @⌫ For sake of interpretation, let us consider instead the infeasible objective function and its profile counterpart. Then, the concentrated score vector is i @ Q̆0c,T [✓1 , ⌫˜T ] @QT [✓1 , ✓˜2,T , ⌫˜T ] @ 2 QT [✓1 , ✓˜2,T , ⌫˜T ] h 0 ˜ = + J (✓ , ✓ ) T 1 2,T @✓1 @✓1 @✓1 @⌫ 0 1 @QT [✓1 , ✓˜2,T , ⌫˜T ] . @⌫ From the definition of the matrix JT (✓0 ), we can then deduce that PlimT =1 @ Q̆0c,T [✓10 , ⌫ 0 ] = 0. @✓1 @⌫ 0 (7) Equation (7) is precisely the standard condition (see e.g Newey and McFadden, 1994, formula (6.6) p 2179) to ensure that the asymptotic distribution of the estimator of the parameters of interest ✓1 does not depend on the asymptotic distribution of the estimator for the nuisance parameters ⌫. This provides clear intuition as to why Theorem 2.1. works, at least in the particular case considered by Trognon and Gourieroux (1990): the modification of the objective 9 function in (6) has been devised precisely to restore the asymptotic independence between the two kinds of parameters. However, the main contribution of the paper is to provide a much more general setup for efficient two-step estimation through the use of penalized estimating equations. This penalty amounts to a slight twist (via targeting) on the two-step estimator ✓ˆText , in order to ensure its consistency. Then, its asymptotic equivalence with ✓ˆT , as stated in Theorem 2.1., ensures its asymptotic efficiency. As far as the equivalence between ✓ˆText and ✓ˆT⇤ is concerned, note that this is germane to the equivalence between two-step efficient GMM and continuously updated GMM, as first put forward by Hansen et al. (1996). 2.2 Application to nonlinear regression In this subsection, we consider the example of nonlinear least squares. Note that while we consider only ordinary least squares, weighted least squares would not introduce any specific difficulty. Joint estimation of models for conditional mean and variance using Gaussian QMLE (Bollerslev and Wooldridge, 1992) would also fit in this class of examples. Thus, for sake of notational simplicity, let us just consider the following objective function QT [✓, ⌫(✓)] = T 1X [yt T t=1 g(xt , ✓, ⌫(✓))]2 , where g(., ., .) is a known function such that g(xt , ✓0 , ⌫(✓0 )) = E[yt |xt ] . (8) Hence, the maintained identification assumption is E[yt g(xt , ✓, ⌫(✓)) |xt ] = 0 , ✓ = ✓0 . (9) Then, T @QT [✓, ⌫(✓)] 2 X @g(xt , ✓, ⌫(✓)) = [yt @⌫ T t=1 @⌫ @ 2 QT [✓, ⌫(✓)] = @⌫@⌫ 0 g(xt , ✓, ⌫(✓)] , T 2 X @g(xt , ✓, ⌫(✓)) @g(xt , ✓, ⌫(✓)) . T t=1 @⌫ @⌫ 0 T 2 X @ 2 g(xt , ✓, ⌫(✓)) + [yt T t=1 @⌫@⌫ 0 g(xt , ✓, ⌫(✓)] . However, by applying (8), we can choose the following consistent estimator for the Hessian matrix with respect to the parameters ⌫ JT (✓) = T 2 X @g(xt , ✓, ⌫˜T ) @g(xt , ✓, ⌫˜T ) . . T t=1 @⌫ @⌫ 0 10 With this choice, the modified extremum estimator is obtained as the maximizer of Q̃T [✓, ⌫˜T ] = QT [✓, ⌫˜T ] + T 1X = yt T t=1 @QT [✓, ⌫˜T ] 1 . [⌫(✓) ⌫˜T ] [⌫(✓) ⌫˜T ]0 JT (✓) [⌫(✓) @⌫ 0 2 2 @g(xt , ✓, ⌫˜T ) g(xt , ✓, ⌫˜T ) . [⌫(✓) ⌫ ˜ ] . T @⌫ 0 ⌫˜T ] (10) In other words, while the estimator defined as the solution to min ✓ T X [yt g(xt , ✓, ⌫˜T )]2 t=1 is not efficient in general, we can restore efficiency by the additional term in (10). The fact that a nonlinear regression model can be efficiently estimated after linearization of the regression function around a first-step consistent estimator has been known since Hartley (1961). However, it must be kept in mind that the case (10) is more general because we consider only a partial linearization so as to deal with the nasty occurrences ⌫(✓) in g(·). As a result, efficiency is warranted only when consistency is enforced, which may take the penalty strategy developed in Section 3. To see this, note that the identification assumption (9) does not say that E[yt |xt ] = g(xt , ✓, ⌫(✓0 )) @g(xt , ✓, ⌫(✓0 )) ⇥ . ⌫(✓) @⌫ 0 ⇤ ⌫(✓0 ) ) ✓ = ✓0 . The role of targeting will be to enforce the equality ⌫(✓) = ⌫(✓0 ) so that the implication above becomes a consequence of the identification assumption (9). Fortunately, there are cases where penalty/targeting is not needed because consistency is directly implied. Trognon and Gourieroux (1990) point out the example of Hatanaka’s (1974) two-step estimator for a dynamic adjustment model with autoregressive errors. With obvious notations, the model is y t = ↵ 1 y t 1 + ↵ 2 z t + ut ut = ut 1 + "t and is generally rewritten as yt yt 1 = ↵1 (yt 1 yt 2 ) + ↵2 (zt z t 1 ) + "t . Thus, we end up with a nonlinear regression model that can be rewritten in the notational system of (8) yt = g (xt , ↵1 , ↵2 , ⌫(✓)) + "t xt = (zt , yt 1 ), ✓ = (↵1 , ↵2 , )0 , ⌫(✓) = . However, a key remark is that the regression function, albeit nonlinear, is linear with respect to ⌫ when the friendly occurrence of ✓ is fixed. Therefore, this partial linearization with respect to ⌫ does not cause consistency to break down. Theorem 2.1 can be directly applied to confirm 11 that Hatanaka’s (1974) two-step estimator is efficient. 2.3 Application to GMM We now contemplate the case of a parameter identified through H moment restrictions with two kinds of occurrences for the parameters: E['t (✓, ⌫(✓))] = 0 , ✓ = ✓0 . (11) Moment restrictions of the form in (11), and their possible applications, are discussed in more details in Section 5 within the setting of implied states GMM. In this section, however, we give a general discussion in the context of modified two-step extremum estimators. When working with (11), we typically have in mind estimators defined from the criterion function QT [✓, ⌫(✓)] = '¯T (✓, ⌫(✓))0 WT '¯T (✓, ⌫(✓)) where T 1X '¯T (✓, ⌫(✓)) = 't (✓, ⌫(✓)) T t=1 and WT is some positive definite sequence of matrices. Note that, in order to obtain an estimator ✓ˆT , defined by (1), that reaches the semiparametric efficiency bound, the sequence WT should provide a consistent hp i estimator for the inverse of the long term variance matrix 0 0 limT =1 V ar T '¯T (✓ , ⌫(✓ )) . However, this issue is irrelevant for us as we only discuss how to obtain estimators that are asymptotically equivalent to ✓ˆT , irrespective of its efficiency. From the definition of QT [✓, ⌫(✓)], @QT [✓, ⌫(✓)] = @⌫ @ 2 QT [✓, ⌫(✓)] = @⌫@⌫ 0 @ '¯T (✓, ⌫(✓))0 WT '¯T (✓, ⌫(✓)) @⌫ @ '¯T (✓, ⌫(✓))0 @ '¯T (✓, ⌫(✓)) 2 WT @⌫ @⌫ H 2 X @ '¯h;T (✓, ⌫(✓)) 2 .Wh.;T '¯T (✓, ⌫(✓)) 0 @⌫@⌫ h=1 2 where Wh.;T stands for the hth row of WT . Then, we can choose the following consistent estimator for the Hessian matrix with respect to the parameters ⌫ JT (✓) = 2 @ '¯T (✓, ⌫(✓))0 @ '¯T (✓, ⌫(✓)) WT . @⌫ @⌫ 0 12 With this choice, the modified extremum estimator is obtained as the maximizer of @QT [✓, ⌫˜T ] 1 Q̃T [✓, ⌫˜T ] = QT [✓, ⌫˜T ] + . [⌫(✓) ⌫˜T ] [⌫(✓) ⌫˜T ]0 JT (✓) [⌫(✓) ⌫˜T ] 0 @⌫ 2 0 @ '¯T (✓, ⌫˜T ) @ '¯T (✓, ⌫˜T ) = '¯T (✓, ⌫˜T ) + . [⌫(✓) ⌫˜T ] WT '¯T (✓, ⌫˜T ) + . [⌫(✓) 0 @⌫ @⌫ 0 (12) ⌫˜T ] In other words, while the solution of min ['¯T (✓, ⌫˜T )]0 WT ['¯T (✓, ⌫˜T )] (13) ✓ would not be equivalent to ✓ˆT in general, we can restore equivalence (and efficiency in the sense of ✓ˆT ) by using the additional term in (12). However, since (similarly to the former subsection) it is only a partial linearization of the moment conditions, consistency may not be warranted. To see why consistency may be an issue, note that the identification assumption (11) does not say that ⇤ @'t (✓, ⌫(✓0 )) ⇥ E 't (✓, ⌫(✓0 )) + . ⌫(✓) ⌫(✓0 ) = 0 =) ✓ = ✓0 . (14) 0 @⌫ The role of targeting in this context is to enforce the equality ⌫(✓) = ⌫(✓0 ) so that the implication in (14) becomes a consequence of the identification assumption (11). Fortunately, there are cases where the penalty/targeting is not needed because consistency is directly implied. Gourieroux et al. (1996) consider the case where the vector of moment conditions can be split in two parts, with only the second one depending on ⌫: 0 't (✓, ⌫(✓)) = ['1t (✓)0 , '2t (✓, ⌫(✓))0 ] . (15) Then, the implication (14) is obviously warranted when the first set of moment conditions is sufficient to identify ✓, which is typically the case considered by Gourieroux et al. (1996). In this case, Theorem 2.1. ensures efficiency of the modified two-step estimator. Interestingly enough, the efficient two-step estimator proposed by Gourieroux et al. (1996) may be di↵erent from ✓ˆText . It is only when the first set of moment conditions '1t (✓) is linear with respect to ✓ that they will numerically coincide (see section 2.6 in Gourieroux et al., 1996). In the general case, their two-step efficient estimator is not based on a (partial) linearization but on minimizing the norm of the moment vector '¯T (✓, ⌫˜T ), where the weighted matrix is a suitably twisted version of the estimator for the inverse of the long term variance matrix. Note that, we know from (13) that efficiency cannot be met without such a twist. Moreover, our equivalence result is more general since not only does it apply to general moment conditions (not only in the form (15)) but it does not assume that WT is a consistent estimator of the inverse of the long term variance matrix. 13 3 Stochastic di↵erences for linearized estimating equations We first state our general result concerning roots of linearized estimating equations, which extends Theorem 2 of Robinson (1988). Then, in a second subsection, we provide two more user friendly versions of our two-step estimator, depending on whether one want to use a first-step consistent estimator of ✓0 or only of ⌫(✓0 ). Note that, this section is generally valid for estimating equations and their linearizations, irrespective of the fact that these estimating equations can be seen as first order conditions provided by an extremum estimator, as in our leading example studied in Section 2. 3.1 The general result Linear approximations will be considered in some neighborhood @("), " > 0, of the true unknown value @(") = ✓ 2 Rp : ✓ ✓0 < " ⇢ ⇥. Note that, the existence of such " is tantamount to the maintained assumption that the true unknown value ✓0 belongs to the interior of the parameter space. In order to extend the results of Robinson (1988), we first characterize our benchmark estimator ✓ˆT as the solution of some just-identified estimating equations. For sake of generality, we maintain some high level assumptions about these estimating equations, although they would in general be implied by more primitive assumptions, as seen in our leading example of extremum estimation in section 2. Assumption B1: fT (✓) = qT [✓, ⌫(✓)] is a p-vector valued random variable such that: (i) fT has a zero ✓ˆT = ✓0 + oP (1), (ii) For some " > 0, the functions of ✓: ⌫(✓), fT (✓) and entiable on @("), for any given ✓⇤ in @("). (iii) FT (✓0 ) = F + oP (1), where FT (✓) = @fT (✓) @✓ 0 @qT @⌫ 0 [✓, ⌫(✓⇤ )] are continuously di↵er- and F is non-singular. Under standard regularity conditions (see appendix), the non-singular matrix F can obviously be written as @q1 [✓0 , ⌫(✓0 )] @q1 0 @⌫ F = + [✓ , ⌫(✓0 )] 0 (✓0 ), 0 0 @✓ @⌫ @✓ 0 for some population estimating equations q1 [✓, ⌫(✓)] with ✓ the only zero of q1 [✓, ⌫(✓)]. Assumption B2: q1 [✓, ⌫(✓)] = 0 , ✓ = ✓0 . We are interested in partially linear approximations of the estimating function around some consistent initial estimator ✓˜T . Thus, let us define @qT @⌫ h̃T (✓) = qT [✓, ⌫(✓˜T )] + [✓, ⌫(✓˜T )] 0 (✓˜T )(✓ 0 @⌫ @✓ 14 ✓˜T ). Note that h̃T (✓) provides alternative estimating equations that also locally identify ✓ since, with obvious notations (and under standard regularity conditions), a solution ✓ = ✓T⇤ of gT (✓) = 0 will converge towards a solution ✓ = ✓¯ of the population equations q1 [✓, ⌫(✓0 )] + @q1 @⌫ [✓, ⌫(✓0 )] 0 (✓0 )(✓ 0 @⌫ @✓ ✓0 ) = 0. With a genuine linearization of fT (✓) (not only a partial one), Robinson’s Theorem 2 shows that a zero of h̃T (✓) is, in a sense, asymptotically equivalent to ✓ˆT . With a partial linearization, we cannot maintain such a claim since we may only have local identification and not global identification. That is, there may exist some ✓¯ 6= ✓0 such that, with obvious notations, ¯ ⌫(✓0 )] + @q1 [✓, ¯ ⌫(✓0 )] @⌫ (✓0 )(✓¯ q1 [✓, 0 @⌫ @✓0 ✓0 ) = 0 even though ✓ = ✓0 is the only solution of q1 [✓, ⌫(✓)] = 0. To avoid such a perverse situation, we have to slightly penalize our (partially) linearized sequence by defining: 2 @qT ˜T )] @⌫ (✓˜T )(✓ ✓˜T ) + ↵T ✓ ✓˜T ep , [✓, ⌫( ✓ (16) h̃PT (✓) = qT [✓, ⌫(✓˜T )] + @⌫ 0 @✓0 for a real sequence ↵T going slowly to infinity, where ep stands for the p-dimensional vector whose components all equal 1. More precisely, our extension of Robinson’s result can be stated as follows: Proposition 3.1: Under standard regularity conditions detailed in the appendix: under Assumption B1, if ✓˜T is a consistent estimator of ✓0 such that ✓˜T ✓0 = oP (1/↵T ) with limT =1 ↵T = 1, then for any zero ✓˜P of h̃P (✓) in (16) T ✓ˆT T ✓˜TP = OP ✓ ↵T ✓ˆT ✓˜T 2 ◆ . Proposition 3.1 is a generalization of Theorem 2 in Robinson (1988). When there is no function ⌫, our Assumptions B1 and B2 exactly match those of Robinson (1988). However, there is a price to pay for a linearization that is only partial and thus takes a penalty term 2 ↵T going to infinity. Fortunately, the penalty term ↵T ✓ˆT ✓˜T , which shows up in the rate of convergence, will more often than not have a very minor impact for the use of Proposition 3.1. As a matter of fact, Proposition 3.1. will often be applied to state that, when the initial estimator ✓˜T is root-T consistent, the two estimators ✓ˆT and ✓˜TP are first order asymptotically equivalent. This conclusion is indeed warranted insofar as we pick a penalty rate ↵T going to p infinity slower than T . However, the choice of the tuning parameter is more p constrained if one wants to use an initially consistent estimator ✓˜T converging slower than T . Exactly as in 15 the case of Robinson (1988), the conclusion of asymptotic equivalence between ✓ˆT and ✓˜TP takes anyway an initial estimator ✓˜T converging faster than T 1/4 . But, on top of that, if the rate of convergence of ✓˜T is, say, T (1/4)+" , " > 0, (resp T (1/4) log(T )), the wished asymptotic equivalence will be warranted only for a slowly diverging penalty rate ↵T like T " (resp. log[log(T )] ). It is worth noting a tight similarity between the choice of this tuning parameter ↵T and the choice of the number k(T ) of iterations in iterative procedures like generalized backfitting in Pastorello et al. (2003) or MBP in Fan et al. (2015). As it can be seen, for instance, on the bottom of page 465 in Pastorello et al. (2003), k(T ) must go to infinity faster than log(T ) and, in finite samples, the size of the needed k(T ) is inversely related to the strength of the contraction in the contraction mapping argument at stake for convergences of the iterations. 3.2 A couple of two-step efficient estimators Our two-step efficient estimator is a direct extension of Robinson (1988), replacing the complete linearization by a partial one, and is defined as a zero ✓˜TP of the estimating equations h̃PT (✓) = qT [✓, ⌫(✓˜T )] + @qT @⌫ [✓, ⌫(✓˜T )] 0 (✓˜T )(✓ @⌫ 0 @✓ ✓˜T ) + ↵T ✓ ✓˜T 2 ep . These estimating equations provide an efficient estimator by contrast with the naive twostep strategy that would only solve the equations qT [✓, ⌫(✓˜T )] = 0. with iteration on these equations a possibility. Up to the penalty term (only used to enforce consistency), the di↵erence between qT [✓, ⌫(✓˜T )] and h̃PT (✓) is the introduction of the firstorder correction through partial linearization. However, there is an obvious way to make the estimating equations h̃PT (✓) even more computationally friendly by making the correction term linear in the unknown parameters ✓; that is, rather, by solving the following estimating equations: hT (✓) = qT [✓, ⌫(✓˜T )] + (1) @qT ˜ @⌫ [✓T , ⌫(✓˜T )] 0 (✓˜T )(✓ 0 @⌫ @✓ ✓˜T ) + ↵T ✓ ✓˜T 2 ep . (17) (1) The di↵erence between h̃PT (✓) and hT (✓) is that we have also plugged in the first-step consistent estimator ✓˜T to replace the non-awkward occurrence of the parameters ✓ in the complete Jacobian matrix. The great thing with this simplifying modification is that it does not impair the general equivalence result of proposition 3.1 Theorem 3.1: Under standard regularity conditions detailed in appendix: Under assumptions B1 and B2, if ✓˜T is a consistent estimator of ✓0 such that ✓˜T ✓0 = oP (1/↵T ) with (1) (1) limT =1 ↵T = 1, then for any zero ✓T of hT (✓) in (17) ✓ (1) ˆ ✓T ✓T = OP ↵T ✓ˆT ✓˜T 16 2 ◆ . Theorem 3.1 implies that the previous discussion about the asymptotic efficiency of ✓˜TP , (1) deduced from Proposition 3.1, applies similarly to efficiency of ✓T . (1) While ✓T is obviously the most computationally friendly two-step estimator when we have at our disposal a first-step consistent estimator ✓˜T , it may be a shame to require the use of such an estimator when, after all, our only trouble is to properly deal with the awkward parameter (2) occurrences ⌫(✓). We now propose an alternative two-step efficient estimator ✓T that only requires knowledge of a first-step consistent estimator ⌫˜T of ⌫(✓0 ), and not the knowledge of a consistent estimator ✓˜T of the complete parameter vector ✓0 . In applications, it may be typically the case that only a sub-vector of the parameters of interest ✓ can be consistently estimated in a first-step. Of course, the price to pay for this additional extension of Robinson (1988) will be to give up the computational simplification brought by the change from the estimating equations (1) h̃PT (✓) to hT (✓) (change from Proposition 3.1 to Theorem 3.1). By definition, if we don’t have such thing as a first-step estimator ✓˜T , we cannot plug it in to simplify the equations. However, we will be able to derive an alternative two-step efficient estimator through the estimating equations defined by (2) hT (✓) = qT [✓, ⌫˜T ] + @qT [✓, ⌫˜T ].(⌫(✓) @⌫ 0 ⌫˜T ) + ↵T k⌫(✓) ⌫˜T k2 ep . (18) Theorem 3.2: Under standard regularity conditions detailed in the appendix: under assumptions B1 and B2, if ⌫˜T is a consistent estimator of ⌫(✓0 ) such that k˜ ⌫T ⌫(✓0 )k = oP (1/↵T ) (2) (2) with limT =1 ↵T = 1, then for any zero ✓T of hT (✓) in (18) ✓ ◆ 2 (2) ˆ ˆ ✓T ✓T = OP ↵T ⌫(✓T ) ⌫˜T . Theorem 3.2 implies that the previous discussion about the asymptotic efficiency of ✓˜TP and (1) ✓T , deduced from Proposition 3.1 and Theorem 3.1, respectively, applies similarly to efficiency (2) of ✓T . The main di↵erence is that the leading rate of convergence is now the one of the estimator ⌫˜T . It is also worth noting that the idea of the proof of Theorem 3.2 can be applied even when plugging in a first-step consistent estimator ✓˜T to replace part of or all components of the first occurrence of ✓ in the Jacobian term. In particular, the two simplifying ideas of Theorems 3.1 and 3.2 can be used simultaneously. 3.3 Practical implications: In this subsection, we set the focus on the simplest case where both the benchmark efficient estimator ✓ˆT and the initial estimators ✓˜T or ⌫˜T are all root-T consistent. When ✓ˆT is defined as the solution of qT [✓ˆT , ⌫(✓ˆT )] = 0, we propose two more user-friendly estimators, both associated with a sequence ↵T of tuning parameters. 17 (1) First, ✓T defined as solution of: @qT ˜ @⌫ (1) (1) qT [✓T , ⌫(✓˜T )] + [✓T , ⌫(✓˜T )] 0 (✓˜T )(✓T 0 @⌫ @✓ (1) ✓˜T ) + ↵T ✓T 2 ✓˜T ep = 0. (19) (2) Second, ✓T defined as solution of: (2) qT [✓T , ⌫˜T ] + @qT (2) (2) [✓ , ⌫˜T ].(⌫(✓T ) @⌫ 0 T (2) ⌫˜T ) + ↵T ⌫(✓T ) ⌫˜T 2 ep = 0. (20) Recall that several variants are possible, depending upon what part of the first-step estimator is used for the penalty term and/or for computing the derivative @qT /@⌫ 0 . For practical choice of the tuning parameter sequence ↵T , the two golden rules are p as follows. First, for sake of asymptotic efficiency, ↵T must go to infinity strictly slower than T ; Second, the fact that ↵T goes to infinity is only useful to ensure consistency (see Step 1 in the proof of Proposition 3.1). In many circumstances, consistency will be warranted even without the penalty, that is, with ↵T = 0. This, in particular, paves the way for many efficient two-step extremum estimators as exemplified in sections 2.2 and 2.3. Generally speaking, when consistency is not an issue, Theorem 2.1 states asymptotic efficiency of two-step extremum estimators ✓ˆText computed as solutions of ⇢ @QT [✓, ⌫˜T ] 1 ext ˆ ✓T = arg max QT [✓, ⌫˜T ] + . [⌫(✓) ⌫˜T ] [⌫(✓) ⌫˜T ]0 JT (✓) [⌫(✓) ⌫˜T ] . (21) ✓ @⌫ 0 2 Moreover, the proof of Theorem 2.1 shows that the dependence on ✓ of the weighting matrix JT (✓) can be overlooked in computing the first order conditions and then, up to the penalty term, the first order conditions for (21) are very similar to (19). Sections 2.2 and 2.3 display user friendly closed form formulas for the weighting matrix JT (✓) that do not involve any second derivatives. However, it is important to keep in mind that consistency is not always warranted, and then, the only solution is the introduction of the penalty term in first order conditions leading to (19) or (20). 4 4.1 Additive decomposition of Extremum Criterion Efficient two-step Estimation via Margin Targeting There exist many interesting situations in economics and finance where the extremum criterion takes the additively separable form QT [✓, ⌫(✓)] = Q1T [✓1 ] + Q2T [✓2 , ⌫(✓)], (22) where ✓ = (✓10 , ✓20 )0 , ⌫(✓) = ✓1 2 Rp1 , ✓2 2 Rp2 and p1 + p2 = p. This particular structure for QT [✓, ⌫(✓)] includes many nonlinear time series models, such as, the Dynamic Conditional Correlations (DCC-GARCH) model of Engle (2002), the rotated ARCH model of Noureldin et al. (2014), and many copula models. In these multivariate models ✓1 generally represents the 18 parameters that govern the marginal distributions and ✓2 represent the parameters that govern the dependence between the di↵erent components. In this framework, ⌫(✓) = ✓1 represents the additional occurrences of ✓1 that show up in the dependence structure and complicate estimation of ✓. In this setting, a common way of estimating ✓ = (✓10 , ✓20 )0 is the so-called inference from the margins, where a root-T consistent estimator ✓˜T is obtained by first maximizing Q1T [✓1 ] to obtain ✓˜1T , which is equivalent to solving the estimating equations @Q1T [✓˜1T ] = 0, @✓1 (23) ✓˜1T then replaces the unknown ✓1 in Q2T [✓2 , ✓1 ] and Q2T [✓2 , ✓˜1T ] is maximized to obtain ✓˜2T , which is equivalent to solving @Q2T [✓˜2T , ✓˜1T ] = 0. @✓2 (24) If (23) and (24) are unbiased estimating equations for ✓0 , in the sense that, @Q1T [✓1 ] = 0 () ✓1 = ✓10 , T =1 @✓1 @Q2T [✓2 , ✓10 ] lim = 0 () ✓2 = ✓20 , T =1 @✓2 lim 0 0 ✓˜T = (✓˜1T , ✓˜2T )0 is generally a root-T consistent estimator of ✓0 . While computationally simple, the estimator ✓˜T is inefficient, which is seen by noting that the efficient estimator ✓ˆT , the maximizer of QT [✓, ⌫(✓)], solves the estimating equations q1T [✓, ⌫(✓)] qT [✓, ⌫(✓)] = , q2T [✓, ⌫(✓)] where @Q1T [✓1 ] @Q2T [✓2 , ⌫(✓)] + , @✓1 @⌫ @Q2T [✓2 , ⌫(✓)] q2T [✓, ⌫(✓)] = . @✓2 q1T [✓, ⌫(✓)] = Computationally simple and efficient estimators can be obtained in this setting using the two(1) (2) step estimators ✓T and ✓T , defined in Section 3.2 as the solutions to the estimating equations (1) (2) (1) (2) 0 = hT (✓) and 0 = hT (✓) (with hT (✓) and h̃T (✓) given in (19) and (20) respectively), and first-step estimator ✓˜T defined by estimating equations (23) and (24). (1) (2) Obtaining ✓T and ✓T when QT [✓, ⌫(✓)] is additively separable following (22) then requires 19 (1) (2) specializing the definitions of hT (✓) and hT (✓). To this end, for " # " # (1) (2) h1T (✓) h1T (✓) (1) (2) hT (✓) = , hT (✓) = , (1) (2) h2T (✓) h2T (✓) (1) (1) we have that ✓T , defined as the solution to 0 = hT (✓), solves (1) (1) @Q1T [✓1T ] @Q2T [✓2T , ✓˜1T ] @ 2 Q2T [✓˜2T , ✓˜1T ] (1) ˜ (1) + + (✓1T ✓1T ) + ↵T ✓T @✓1 @⌫ @⌫@⌫ 0 (1) 2 @Q2T [✓2T , ✓˜1T ] @ 2 Q2T [✓˜2T , ✓˜1T ] (1) ˜ (1) (1) (1) ˜T ep 0 = h2T (✓T ) = + (✓ ✓ ) + ↵ ✓ ✓ 1T T 2 1T T @✓2 @✓2 @⌫ 0 (1) (1) 0 = h1T (✓T ) = (2) ✓˜T 2 ep1(25) (26) (2) and ✓T , defined as the solution to 0 = hT (✓), solves (2) (2) (2) 2 @Q1T [✓1T ] @Q2T [ ✓2T , ✓˜1T ] @ 2 Q2T [✓2T , ✓˜1T ] (2) ˜ (2) + ( ✓1T ✓1T ) + ↵T ✓1T ✓˜1T ep1(27) 0 @✓1 @⌫ @⌫@⌫ (2) ˜ (2) ˜ 2 2 @Q2T [✓2T , ✓1T ] @ Q2T [✓2T , ✓1T ] (2) ˜ (2) (2) (2) 0 = h2T (✓T ) = + (✓1T ✓1T ) + ↵T ✓1T ✓˜1T ep2 , (28) 0 @✓2 @✓2 @⌫ p for some sequence ↵T going to infinity slower than T . (1) Obviously, solving (25) and (26) (respectively, (27) and (28)) to obtain ✓T (respectively, (2) (1) (2) ✓T ) is more computationally involved than the estimator ✓˜T . However, both ✓T and ✓T share with ✓˜T the convenient feature that the cumbersome occurrence of ✓1 in Q2T [✓2 , ✓1 ] never shows up as an unknown parameter in the estimating equations, which makes our two-step efficient estimator computationally friendly in comparison with the brute force efficient estimator ✓ˆT . This simplification of the estimating equations is also shared by the MBP estimator proposed in SFK. When QT [✓, ⌫(✓)] is additively separable following (22), the MBP algorithm takes as (k) its starting value ✓˜T and defines a sequence of iterative estimators ✓ˆT , k > 1, by solving (2) (2) 0 = h1T (✓T ) (k+1) (k) (k) @Q1T [✓ˆ1T ] @Q2T [✓ˆ2T , ✓ˆ1T ] + , @✓1 @✓1 (k+1) (k) @Q2T [✓ˆ2T , ✓ˆ1T ] 0 = . @✓2 0 = While each iteration of the MBP procedure is computationally simpler than the second-step of the penalized two-step estimators, the price to pay for this simplicity is two-fold: one, to achieve efficiency we require k ! 1, possibly according to a tuning parameter k = k(T ), and two, convergence of the MBP iterations requires the existence of a local contraction mapping condition, often called an information dominance condition. If the information dominance condition is nearly unsatisfied, the MBP iterations converge (k) very slowly, and if this condition is not satisfied ✓ˆT does not converge. To deal with such situations FPR propose a modification of the MBP estimator in SFK that regains a portion of the information associated with the occurrence of ✓2 in Q2T [✓2 , ✓1 ] neglected by the original 20 (k) MBP scheme. Consequently, FPR define this alternative MBP estimator ✓˜T as the solution to the following estimating equations, (k+1) (k+1) (k) @Q1T [✓˜1T ] @Q2T [✓˜2T , ✓˜1T ] 0 = + , @✓1 @✓1 (k+1) (k) @Q2T [✓˜2T , ✓˜1T ] 0 = . @✓2 (29) (30) Note that this estimator is nothing but the MBP estimator conformable to the general definition (4). It is straightforward to compare the computational burden associated with the MBP esti(1) mator in (29), (30) and the two-step penalized estimator ✓T (dubbed P-TS1 ), as well as the (1) additional two-step estimator ✓T (P ) (dubbed TS1 ) that arises from neglecting the penalty terms; (1) i.e., the TS1 estimator ✓T (P ) solves the estimating equations (25, 26), but with ↵T = 0.1 Firstly, comparing the MBP estimator and TS1 (respectively, P-TS1 ), the only di↵erence between the two estimators is that TS1 (respectively, P-TS1 ) entails some minor computational burden as(1) sociated with the introducing of a linear function of ✓1T (this statement holds up to the penalty term for P-TS1 ). This tiny additional complexity is the price to pay to get efficiency in two steps instead of fishing for the limit of an iterative procedure, which, as stated above, may require many iterations depending on the strength of the local-contraction mapping. However, when the local contraction mapping is strong, the MBP procedure of SFK is the simplest from a computational standpoint. As the required contraction mapping condition becomes weaker, the MBP estimator becomes more computationally burdensome.2 In contrast, the two-step procedures discussed herein do not require a contraction mapping condition and can therefore yield consistent and efficient estimators in situations where this condition is violated. (2) In comparison with the aforementioned estimators, the penalized two-step estimator ✓T (2) (dubbed P-TS2 ), and the corresponding version ✓T (P ) (dubbed TS2 ) that neglects the penalty function, incurs additional computational complexity because ✓2 occurs within the partial Hessian term in the estimating equations. However, in this setting, the P-TS2 (and TS2 ) estimator is unique in that it only requires a consistent first-step estimator for ✓10 , and not for ✓20 . In the framework of estimation from the margins, this advantageous property of TS2 (and P-TS2 ) can be interpreted as follows. In many multivariate models, ✓1 can simply be estimated from the margins and is numerical stable. In contrast, estimation of the dependence parameters ✓2 is often tricky and numerically unstable. Indeed, this is a primary reason why (unconditional) variance targeting, as initially proposed by Engle and Mezrich (1996), became popular in the estimation of multivariate GARCH models, with similar reasoning leading researchers to contemplated correlation targeting in estimation of GARCH-DCC models. From a targeting standpoint, the P-TS2 (TS2 ) estimator first obtains a simple estimate ✓˜1T of ✓10 from the margins, then uses ✓˜1T via a ”margin targeting” procedure whereby the second-step of the 1 Note that, from Proposition 3.1 and Theorems 3.1 and 3.2, when consistent the two-step estimators that disregards the penalty term will also be asymptotically efficient. 2 This statement also holds for the MBP estimator proposed in FPR. 21 estimation procedure is stabilized by targeting the consistent marginal parameter estimates. In contrast to (unconditional) variance targeting, P-TS2 (and TS2 ) does not incur an efficiency loss associated with margin targeting. More importantly, P-TS2 (and TS2 ) need not maintain the problematic assumption in unconditional variance targeting on the existence of higher order unconditional moments, which is required in order to for variance targeting to yield an asymptotically normal estimator of the unconditional variance. 4.2 Bivariate Gaussian Copula Models In the following subsection, we illustrate the above discussion between the di↵erent estimation procedures using a Gaussian Copula model. The Bivariate Gaussian copula model has been extensively studied in statistics and economics, see, e.g., Joe (1997), Song (2000), among others, and is often used in empirical analysis. Assume our goal is to estimate the parameters governing the distribution of yi = (yi,1 , yi,2 )0 . Denoting the marginal distribution of yi,j as Fj (·; ↵j ), where ↵j is a vector of unknown parameters, the joint distribution can be constructed using a copula function C(u1 , u2 ; ⇢), where ⇢ denotes the copula dependence parameter. In what follows, we assume yi = (yi,1 , yi,2 )0 follows a bivariate Gaussian copula with cumulative distribution function (CDF) C(F1 (yi,1 ; ↵1 ), F2 (yi,2 ; ↵2 ); ⇢) = ⇢( 1 (F1 (yi,1 ; ↵1 )), 1 (F2 (yi,2 ; ↵2 ))), (31) where ⇢ (·) is the bivariate Gaussian cumulative distribution function with correlation parameter ⇢ and (·) is the standard normal CDF. Denote by c(F1 (yi,1 ; ↵1 ), F2 (yi,2 ; ↵2 ); ⇢) the copula density derived from equation (31). For (u1 , u2 )0 2 (0, 1)2 , Song (2000) demonstrates that the density of the bivariate Gaussian copula is ✓ ◆ 1 ⇢(z12 + z22 ) 2⇢(z1 · z2 ) c(u1 , u2 ; ⇢) = p exp , 2(1 ⇢2 ) 1 ⇢2 1 where zj = (uj ) for j = 1, 2. Let fj (yi,j ; ↵j ) denote the marginal density of yi,j and define ✓1 = (↵10 , ↵20 )0 , ✓2 = ⇢, with ✓ = (✓10 , ✓2 )0 . Inference for ✓ in the Bivariate Gaussian copula model can be carried out using maximum likelihood, with corresponding log-likelihood function QT [✓, ⌫(✓)] = T X 2 X i=1 j=1 log(fj (yi,j ; ↵j )) T log(1 2 ⇢2 ) ⇢ 2(1 ⇢2 ) (⇢A(✓1 ) 2B(✓1 )). (32) PT PT 2 2 Herein, A(✓1 ) = i=1 [zi,1 (↵1 ) + zi,2 (↵2 ) ], B(✓1 ) = i=1 zi,1 (↵1 )zi,2 (↵2 ), and zi,j (↵j ) = 1 (Fj (yi,j ; ↵j )) for j = 1, 2. The likelihood in (32) is separable and we denote the two pieces Q1T [✓1 ] = T X 2 X log(fj (yi,j ; ↵j )), and Q2T [✓2 , ⌫(✓)] = i=1 j=1 where, again, ⌫(✓) = ✓1 . 22 T ⇢ log(1 ⇢2 ) (⇢A(✓1 ) 2B(✓1 )), 2 2(1 ⇢2 ) 4.2.1 Estimators of ✓ Depending on the specification of the marginals fj (·; ↵j ), maximizing QT [✓, ⌫(✓)] to obtain the Maximum Likelihood estimator (MLE) ✓ˆT can be difficult. In these cases a simple two-step estimation approach, the so-called inference from margins (IFM) approach, is often used to estimate ✓ (see, e.g., Shih and Louis (1995), Joe (1997) and P Patton P (2009) for examples and discussion). The IFM approach first maximizes Q1T [✓1 ] = Ti=1 2j=1 log(fj (yi,j ; ↵j )) to obtain 0 0 ✓˜1T = (˜ ↵1T ,↵ ˜ 2T )0 , defined as the solution to ! Pn @f1 (yi,1 ;↵1 ) 1 @Q1T [✓1 ] i=1 f1 (yi,1 ;↵1 ) @↵1 0= = Pn . @f2 (yi,2 ;↵2 ) 1 @✓1 i=1 f (y ;↵ ) @↵ 2 i,2 2 2 Next, the unknown ✓1 in Q2T [✓2 , ✓1 ] is replaced with ✓˜1T and Q2T [✓2 , ✓˜1T ] = T2 log(1 ⇢2 ) ⇢ (⇢A(✓˜1T ) 2B(✓˜1T )) is maximized to obtain ✓˜2T = ⇢˜T , defined as the solution to 2(1 ⇢2 ) 0= @Q2T [✓˜1T , ✓˜2T ] T⇢ = @✓2 1 ⇢2 1 (⇢A(✓˜1T ) (1 ⇢2 )2 (1 + ⇢2 )B(✓˜1T )). It is clear from this decomposition that the IMF estimator disregard the information about ✓1 contained in ! n @B(✓1 ) 1) X ⇢ @A(✓ 2 @Q2T [✓2 , ✓1 ] ⇢ @↵1 @↵1 = . @A(✓1 ) @B(✓1 ) 2 @✓1 1 ⇢ ⇢ 2 @↵2 @↵2 i=1 From the above definitions, we see that the efficient MBP and penalized two-step estimators obtain efficiency by adding back, in di↵ering combinations, terms associated with @Q2T [✓2 , ✓1 ]/@✓1 . MBP accomplishes this task by adding back @Q2T [✓2 , ✓1 ]/@✓1 to the estimating equations for ✓1 and iterating over the cumbersome occurrences of ✓1 (and ✓2 , depending on (1) the precise MBP method). On the other hand, the penalized two-step estimator ✓T (previously dubbed P-TS1 ) linearizes @Q2T [✓2 , ✓1 ]/@✓1 , with respect to the cumbersome occurrence of ✓1 , around the consistent estimator ✓˜1T , and targets the second-step estimators using the initially (2) consistent ✓˜T . The penalized two-step estimator ✓T (previously dubbed P-TS2 ) is similar to P-TS1 but only penalizes the estimating equations with respect to the margins estimator ✓˜1T . Both two-step approaches have the same asymptotic distribution, but can behave di↵erently in finite samples. In comparison with the two-step procedures, the critical regularity condition needed for the MBP estimator to be efficient is the satisfaction of a local contraction mapping condition, also termed the information dominance condition. However, in the bivariate Gaussian copula model, simulation evidence in SFK and Liu and Luger (2009) demonstrate that the MBP approach can behave poorly if there is even moderate correlation. Intuitively, this phenomena is present because as ⇢ increases the portions of the estimating equations that MBP iterates over become more informative for estimating the parameters. For ⇢ large enough the MBP algorithm neglects too much information and yields an inconsistent estimator. 23 4.2.2 Example: Exponential Marginals In this subsection we compare the finite sample properties of the MBP approach of SFK and four di↵erent efficient two-step procedures: the penalized two-step estimator P-TS1 , the nonpenalized counterpart to P-TS1 given by TS1 , the partially penalized two-step estimator P-TS2 , and the non-penalized counterpart to P-TS2 given by TS2 . Data for the exercise is generated from the Gaussian copula in the situation where the marginal densities are exponential: fj (yi,j ; ↵j ) = ↵j exp( ↵j yi,j ), ↵j > 0, j = 1, 2. In particular, the simulation study compares the e↵ects of the correlation parameter and sample size on the various estimators. For the simulation study we set ↵1 = .1, ↵2 = 1 and consider three di↵erent values for the correlation parameter ⇢ = .75, .95, .985. Across the three values of ⇢ we consider three di↵erent sample sizes T = 100, 200, 300. For each T and ⇢ combination we create 1,000 synthetic samples. It is important to note that for ⇢ greater than approximately .95 the information dominance condition associated with the proposed MBP procedure is no longer satisfied. Therefore, at high levels of correlation we expect the finite sample properties of the MBP estimator to be poor in comparison with the various two-stage estimators. The estimators are compared in terms of their means, mean squared error (MSE) and mean absolute error (MAE), across the di↵erent sample sizes. We define convergence for the MBP algorithm as the maximum absolute di↵erence across the parameters being less than 1.0e 05 for two or more successive iterations. Tables 1 to 3 report the averages over the 1,000 synthetic samples for the mean, MSE and MAE across the three correlation values ⇢ = {.75, .95, .985}. For the penalized two-step estimators the penalty term is taken proportional to T 1/4 . For low values of the correlation parameter the MBP algorithm and the efficient two-step estimators are very similar. However, as the correlation parameter increases, the penalized two-step methods give smaller MSEs and MAEs than the MBP estimator and non-penalized two-step estimator. With high correlation values and larger sample sizes the MBP algorithm encounters difficulty in estimation since the matrix driving the updates does not fulfill the IDC. It is important to point out that the same behavior is not found in the two-stage and penalized two-stage estimates, which perform well even for ⇢ = .985. The various combinations of penalized and non-penalized two-step estimators all deliver stable parameter estimates with good finite sample properties. However, the fully penalized estimator P-TS1 does seem to have a slight edge over the other estimators in terms of performance. The small impact of the penalty term in this situation is very easy to interpret: for copula models the IFM procedure often provides accurate starting values, and therefore the need to penalize is drastically reduced.3 In other words, in the copula case, the two-step procedure merely ensures efficiency and penalization seems not to be required. 5 Efficient two-step estimation with Implied States In this section we analyze situations where ✓0 is determined by the law of motion governing a latent stochastic process of interest {Yt⇤ : t 1}. The latent state variables Yt⇤ are unobservable 3 Recall, the penalty term is needed to rule out any perverse solutions to the estimating equations, which can exist because of the partial linearization. 24 to the econometrician, but are related to observed data Yt through a function h[·, ⌫ 0 ], known up to the unknown parameters ⌫ 0 = ⌫(✓0 ), according to the relationship Yt = h[Yt⇤ , ⌫ 0 ]. We are only interested in situations where Y ⇤ 7! g[Y ⇤ , ⌫] is one-to-one for any ⌫, which implies that, if ⌫ 0 was known, Yt⇤ could be directly obtained by inverting h[·, ⌫ 0 ]; i.e., Yt = h[Yt⇤ , ⌫ 0 ] () Yt⇤ = g[Yt , ⌫ 0 ]. (33) When ⌫(✓0 ) is unknown, equation (33) defines the implied state (variable) Yt⇤ (✓) = g[Yt , ⌫(✓)]. As has been noted by several authors, such as, e.g., Renault and Touzi (1996), and Pastorello et al. (2003), the setup in (33) covers many interesting applications in economics and finance. However, estimation of ✓0 is often complicated by the nature of the function h[·, ⌫] and the difficulties encountered when transforming the estimation problem from one based on latent states Yt⇤ , to one based on implied states g[Yt , ⌫(✓)]. In what follows, we demonstrate that the efficient penalized two-step estimator can often be used to obtain consistent and efficient estimators for ✓0 in models with implied states. In particular, we focus on the use of implied states in GMM, so-called, Implied States GMM, and in likelihood models with latent states. A comparison with existing estimation approaches in these settings is also given. 5.1 Implied States GMM Pan (2002) uses the terminology Implied States GMM (IS-GMM) to describe GMM estimation in the context of option pricing models with latent variables. More specifically, the IS-GMM estimator of Pan (2002) uses observed option price data to back-out, through an option pricing formula, the latent state variables driving the price process. Formally, we are interested in analyzing a model with true parameter ✓0 , defined as the unique zero of a vector of moment conditions derived from the law of motion for Yt⇤ : E[ ⇤ (Yt⇤ , ✓)] = 0 () ✓ = ✓0 . (34) Clearly, GMM estimation from (34) is not feasible since Yt⇤ is unobservable. Implementation of GMM in this setting can, however, be carried out by substituting the implied states, say, g[Yt , ✓], which are obtained by inverting Yt = h[Yt⇤ , ✓] to get Yt⇤ = g[Yt , ✓], into (34). The existence of the one-to-one relationship in (33) is common in many arbitrage-based asset pricing models. For instance, in options pricing Yt may be the observed option price, Yt⇤ can represent the latent variables driving the price process and h[Yt⇤ , ✓] will be the pricing formula linking Yt and Yt⇤ . Plugging the implied states g[Yt , ✓] into (34) yields E[ (Yt , ✓, ⌫(✓))] = 0, with (Yt , ✓, ⌫(✓)) = ⇤ (g[Yt , ⌫(✓)], ✓). (35) The first occurrence of ✓ within (35) represents the original occurrences of ✓ in the moment conditions represented by the latent data, while ⌫(✓) represents the occurrences of ✓ in the 25 implied states. This later occurrence of ⌫(✓) in (·) is generally computationally cumbersome in comparison with the former occurrence of ✓ in (·). When the moment conditions in (35) are overidentified, we take as our extremum criterion QT [✓, ⌫(✓)] the efficient two-step GMM criterion: QT [✓, ⌫(✓)] = ¯ T [✓, ⌫(✓)]0 W 1 T (✓˜T ) ¯ T [✓, ⌫(✓)], P where ¯ T [✓, ⌫(✓)] = T1 Tt=1 (Yt , ✓, ⌫(✓)), ✓˜T is a preliminary consistent estimator, and WT 1 (✓˜T ) p is a consistent estimator for the long-run variance matrix of T ¯ T [✓0 , ⌫(✓0 )]. The efficient estimator ✓ˆT can then be defined as the (unique) zero of the estimating equations ¯ @ T [✓, ⌫(✓)] @⌫ 0 (✓) @ ¯ T [✓, ⌫(✓)] 0= + @✓ @✓ @⌫ 0 WT 1 (✓˜T ) ¯ T [✓, ⌫(✓)]. (36) Note that, the estimator ✓ˆT uses a consistent estimator for the selection matrix 0 @ [✓0 , ⌫(✓0 )] @⌫ 0 (✓0 ) @ [✓0 , ⌫(✓0 )] 0 (✓ ) = E + W 1 (✓0 ). @✓ @✓ @⌫ In contrast, the simpler IS-GMM estimator ✓TIS defined as the solution of max ✓ solves the estimating equations " ¯ T [✓, ⌫(✓˜T )]0 W @ ¯ T [✓, ⌫(✓˜T )] 0= @✓ #0 1 T (✓˜T ) ¯ T [✓, ⌫(✓˜T )] WT 1 (✓˜T ) ¯ T [✓, ⌫(✓˜T )], and therefore employs a consistent estimator for the selection matrix 0 0 0 e(✓0 ) = E @ [✓ , ⌫(✓ )] W 1 (✓0 ). @✓ (37) When dim( ) > dim(⇥), the selection matrix e(✓0 ) selects p linear combinations of the estimating equations in a suboptimal manner, and so ✓TIS will be inefficient in general. Intuitively, the inefficiency of ✓TIS is a direct consequence of the estimators disregard for the impact of the awkward occurrences ⌫(✓) of ✓ on the selection matrix, through the Jacobian matrix. Unlike the unconditional moment setting described herein, Pan (2002) considers the application of IS-GMM in the context of conditional moment restrictions. However, when it comes to optimal instruments, the same inefficiency issue will be faced if we overlook components of the Jacobian matrix associated with the occurrences of ✓ in the implied states. Besides the above IS-GMM estimators, Pastorello et al. (2003) propose an iterative latent backfitting estimator that defines estimates ✓˜Tk through the iterations ✓˜Tk+1 = arg max QT [✓, ⌫(✓˜Tk )]. ✓ 26 Upon convergence, ✓˜Tk solves the estimating equations ¯ @ T [✓, ⌫(✓)] 0= @✓ 0 WT 1 (✓˜T ) ¯ T [✓, ⌫(✓)], (38) and therefore, similar to ✓TIS , the latent backfitting estimator of Pastorello et al. (2003) is inefficient when dim( ) > dim(⇥). An alternative to directly solving (36) and the inefficient estimators that solve (37), (38), is the penalized two-step estimator developed herein. Clearly, we have at our disposal an initial consistent estimator ✓˜T of ✓0 . Moreover, ✓˜T can also be used to consistently estimate the optimal instruments via " #0 ¯ T [✓˜T , ⌫(✓˜T )] @⌫ 0 (✓˜T ) @ ¯ T [✓˜T , ⌫(✓˜T )] @ ˜ + WT 1 (✓˜T ). T ( ✓T ) = @✓ @✓ @⌫ The existence of a consistent estimator for (✓0 ) allows us to define a new two-step estimator (1) that utilizes T (✓˜T ) to simplify the existing two-step estimator ✓T defined in equation (19). To this end, defining qT [✓, ⌫(✓˜T )] = (✓˜T ) ¯ T [✓, ⌫(✓˜T )], we can obtain a simplified efficient (1) two-step estimator ✓T⇤IS , in the spirit of ✓T , by solving ˜ 0 = h⇤IS T (✓) = qT [✓, ⌫(✓T )] + ˜ ˜ ¯ ˜ ˜ @ T [✓T , ⌫(✓T )] @⌫(✓T ) (✓ @⌫ @✓0 T ( ✓T ) ✓˜T ) + ↵T k✓ ✓˜T k2 . (1) (1) The main di↵erence between ✓T⇤IS and ✓T , is that ✓T requires di↵erentiating ¯ T [✓, ⌫(✓)] with respect to ⌫(✓) and also the occurrences of ⌫(✓) in T (✓). In other words, for the estimator ✓T⇤IS , T (✓) is calculated once and is not altered thereafter. Efficiency of ✓T⇤IS can be shown by a direct application of Theorem 3.1. The two-step estimator ✓T⇤IS is similar to the IS-GMM estimator developed in FPR, and (k) defined by the sequence of estimators ✓ˆT , the solutions of 0= ˆ(k 1) ) ¯ T [✓, ⌫(✓ˆ(k 1) )]. T ( ✓T T In comparison, neither the two-step or MBP estimator actively search over the cumbersome occurrences of ✓ in ⌫(✓). In this way, both approaches share some of the computational simplicity associated with the inefficient estimators ✓TIS and ✓˜Tk , however, in contrast both estimators retain efficiency (under certain conditions). The main di↵erence between the two approaches is that the two-step approach directly corrects the information loss associated with not optimizing over the occurrences of ✓ due to ⌫(✓) by forming a consistent estimator of these quantities, the FPR approach on the other hand only o↵ers this correction as k ! 1, and only if the required contraction condition is satisfied.4 The price to pay for this two-step procedure is the additional computational cost induced by the linear term in h⇤IS T (✓) and the addition of a penalty to guarantee consistency. 4 See FPR for a precise statement of this contraction mapping condition. 27 5.2 Implied States in Latent Likelihood Let us now consider the case where the unobservable stochastic process {Yt⇤ : t from a transition density that is known up to the unknown ✓0 , and let 1} is drawn P = {f (·|·; ✓) : ✓ 2 ⇥} denote the family of transition densities indexed by ✓. Denoting the log-likelihood based on the unobservable latent state variables Yt⇤ by Q⇤T [✓] T 1X = `(Yt⇤ |Yt⇤ 1 ; ✓), where `(Yt⇤ |Yt⇤ 1 ; ✓) = log(f (Yt⇤ |Yt⇤ 1 ; ✓)), T t=1 the implied states framework utilizes the relationship Yt = h[Yt⇤ , ⌫ 0 ] to transform the estimation problem from one based on Yt⇤ and Q⇤T [✓] to one based on Yt . Using the implied states g[Yt , ⌫(✓)], obtained by inverting (33) at the value ✓, and the Jacobian formula, the infeasible log-likelihood Q⇤T [✓] is transformed into the feasible log-likelihood T T 1X 1X QT [✓, ⌫(✓)] = `(g[Yt , ⌫(✓)]|g[Yt 1 , ⌫(✓)]; ✓) + log |Hy g[Yt , ⌫(✓)]| . T t=1 T t=1 |Hy g[Yt , ⌫(✓)]| is the absolute value of the Jacobian for Y associated with the map Y 7! g[Y, ⌫(✓)]. Estimation of ✓0 from QT [✓, ⌫(✓)] is often encountered in estimation of option pricing models, see, e.g., Renault and Touzi (1996), as well as credit risk models, see, e.g., Duan (1994). Maximization of QT [✓, ⌫(✓)] is generally much more difficult than would be maximization of Q⇤T [✓], if such maximization were indeed feasible. It is clear that directly solving 0 = qT [✓, ⌫(✓)] = @QT [✓, ⌫(✓)] @⌫ 0 (✓) @QT [✓, ⌫(✓)] + @✓ @✓ @⌫ can be cumbersome, as ✓ shows up in several places within QT [✓, ⌫(✓)] and in highly nonlinear ways. While the two-step procedures discussed herein can be applied to such settings, it is perhaps more informative to consider precise implementation of these estimators in a relatively simple example. 5.2.1 Example: Merton Credit Risk Model To demonstrate the penalized two-step methodology in the situation of implied states likelihood estimation, we now consider estimation of the parameters in the structural credit risk model of Merton (1974). Suppose that the firm’s debt consists of a zero coupon bond with face value B and maturity date . Letting Vt denote the firm’s unobservable market value at time-t, the firm’s observable equity price can be interpreted as an European call option written on the firm’s market value 28 with strike price B and maturity ; i.e., S ⌘ max[V B, 0]. (39) From (39) the observed equity prices S0 , ..., ST can be interpreted as option prices written on the firm’s unobservable market values V0 , ..., VT . In the simplest case, the firm’s unobservable market value is described as a Geometric Brownian Motion: dVt = µdt + dWt , (40) Vt where Wt is a standard Brownian motion. Equation (40) allows us to write the conditional likelihood of the sample path (V1 , V2 , ..., VT ) given some initial value V0 and historical parameters (µ, ). The conditional log-likelihood function of the unobserved asset values is then given by Q⇤T [µ, 2 ]= 1 ln(2⇡ 2 2 ) T 1 X ln(Vt /Vt 1 ) 2T t=1 (µ 2 1 2 2 ) 2 n 1X ln Vt , T t=1 see, e.g., Duan (1994, 2000) and FPR for a discussion. Unfortunately, maximum likelihood estimation of (µ, ) from Q⇤T [µ, 2 ] is not feasible since the sample path (V1 , V2 , ..., VT ) is unobserved. However, when the dynamics of the firm’s market value are described by (40), the observable equity values can be related to the unobservable firm values through the Black and Scholes option pricing formula: p St = Vt (dt ) B exp( r( t)) (dt t), (41) p where dt ( 2 ) = ln(Vt /B) + (r + 12 2 )( t)/ t, (·) is the standard normal CDF and r is the risk-free interest rate assumed to be deterministic and time-invariant. Letting g[·, 2 ] denote the inverse of the Black and Scholes option pricing formula, the unobserved firm values are related to the observed equity prices through 2 Vt = g[St , ], which can be obtained, at least numerically, from equation (41) and a given value of 2 . Technically g[·, 2 ] depends on t through the time-to-maturity ( t), however, we eschew this dependence in favor of notational simplicity. Therefore, even though Vt is unobserved, if 2 were known its value could be imputed from Vt = g[St , 2 ] for each t = 1, ..., T . Given this fact, using Vt = g[St , 2 ] and the Jacobian formula, we transform the log-likelihood from one based on Vt to one based on St . Following arguments in Duan (1994), the conditional log-likelihood based on observable equity values is given by QT [µ, 2 ]= 1 ln(2⇡ 2 2 ) T 1 X Rt ( 2 ) 2T t=1 (µ 2 29 1 2 2 ) 2 n 1X ln g(St , T t=1 2 ) T 1X ln T t=1 dt ( 2 ) , where implicit returns Rt ( 2 ) = ln(g[St , 2 ]) ln(g[St 1 , 2 ]), can be obtained using the Black and Scholes formula and a given value of 2 . Estimation of (µ, 2 ) then proceeds by maximizing QT [µ, 2 ]. Since estimation of µ is not a priority the first-step is often to concentrate out µ, which yields T 2 2 1X µT ( 2 ) = Rt ( 2 ) + = R̄T ( 2 ) + , T t=1 2 2 and the log-likelihood based on the observable equity values becomes 2 QT [ ] = 1 log(2⇡ 2 2 ) T 1 X Rt ( 2 ) 2T t=1 R̄T ( 2 ) 2 T 1X log g[St , T t=1 2 2 ] T 1X log T j=1 dt ( 2 ) . QT [ 2 ] depends, in several places, on the structural relationship g[St , 2 ], which makes directly maximizing QT [ 2 ] numerically unstable. As in section Section 5.1, we denote the problematic occurrences of 2 in QT [ 2 ] due to the structural relationship g[St , 2 ] by ⌫( 2 ); note, ⌫( 2 ) = 2 and the di↵erence between the two occurrences of 2 is for notational purposes. The concentrated log-likelihood function then becomes 2 2 QT [ , ⌫( )] = 1 ln(2⇡ 2 T 1X ln T t=1 2 T 1 X Rt (⌫( 2 )) 2T t=1 ) R̄T (⌫( 2 )) 2 T 1X ln g[St , ⌫( 2 )] T t=1 2 dt (⌫( 2 )) . Defining ˜T2 [⌫( 2 )] T 1X = (Rj (⌫( 2 )) T j=1 an estimator of 2 R̄T (⌫( 2 )))2 and AT [⌫( 2 )] = 2 @QT [ 2 , ⌫( 2 )] @⌫( 2 ) , @⌫ @ 2 can be obtained as the solution to the log-likelihood first-order conditions 0= 1 2 + 1 4 ˜T2 [⌫( 2 )] + AT [⌫( 2 )]. Solving the above equation is equivalent to solving the estimating equation 0 = qT [ 2 , ⌫( 2 )], where qT [ 2 , ⌫( 2 )] = 4 AT [⌫( 2 )] 2 + ˜T2 [⌫( 2 )]. Directly solving 0 = qT [ 2 , ⌫( 2 )] to estimate 2 can be cumbersome, and a popular alternative, due to Kealhofer, Mcquown and Vasicek and dubbed the KMV iterative method, is to 30 base estimation of 2 on ˜T2 [⌫( 2 )] = T 1X (Rj (⌫( 2 )) T j=1 R̄T (⌫( 2 )))2 . Given a starting value ˆ 2(1) , for k > 1, the KMV iterative method updates its estimates of by calculating ˆ 2(k) = ˜T2 [⌫(ˆ 2(k 1) )] = T 1X (Rt (⌫(ˆ 2(k T t=1 1) R̄T (⌫(ˆ 2(k )) 1) 2 )))2 , and iterating till convergence. This iterative procedure is often much simpler than one based on solving qT [ 2 , ⌫( 2 )] = 0 since it completely neglects the influence of AT [⌫( 2 )] on the estimates of 2 . FPR demonstrate that the iterative KMV approach coincides with the latent backfitting estimator proposed by Pastorello et al. (2003) (hereafter, PPR). While much simpler than maximum likelihood, the KMV/PPR estimator does not utilize all of the information in the estimating equation qT [ 2 , ⌫( 2 )] and therefore is not asymptotically equivalent to the MLE. To this end, FPR use the MBP approach to obtain an estimator that maintains some of the computational advantages of the KMV/PPR iterative strategy yet still delivers an estimator that is asymptotically equivalent to the MLE. Given an initial estimator ˆ 2(1) , at the k-th iteration (k > 1) the MBP estimator solves the following second-order equation in 2 : 4 AT [⌫(ˆ 2(k 1) )] 2 + ˜T2 [⌫(ˆ 2(k 1) )] = 0. (42) An alternative to the KMV/PPR and MBP approaches is the two-step approach discussed herein. In this context, the two-step approach linearizes the estimating equations qT [ 2 , ⌫( 2 )], with respect to the cumbersome occurrences of ⌫( 2 ) = 2 , around an initially consistent estimator. For ˆ 2(1) an initial estimator of 2 , the non-penalized two-step approach estimates 2 by solving 0= 4 AT [⌫(ˆ 2(1) )] 2 + ˜T2 [⌫(ˆ 2(1) )] + [@AT [⌫(ˆ 2(1) )]/@⌫] 4 ( + [@ ˜T2 [⌫(ˆ 2(1) )]/@⌫]( 2 and, for ↵T a penalty term, the penalized two-step approach estimates 0= 4 AT [⌫(ˆ 2(1) )] 2 2 ˆ 2(1) ) ˆ 2(1) ), 2 by solving + ˜T2 [⌫(ˆ 2(1) )] + [@AT [⌫(ˆ 2(1) )]/@⌫] 4 ( 2 ˆ 2(1) ) + [@ ˜T2 [⌫(ˆ 2(1) )]/@⌫]( 2 ˆ 2(1) )2 . 2 ˆ 2(1) ) + ↵T ( (43) (44) Note that the two-step estimators in equations (43) and (44) require solving a third-order equation in 2 , whereas the MBP estimator in equation (42) solves a second-order equation. However, the two-step estimators solve only one third-order equation in 2 , whereas the MBP estimator requires solving (potentially) many second-order equations in 2 . The computational merits of both approaches will depend on the quality of the first-step estimator ˆ 2(1) and, in the 31 case of MBP, the strength of the contraction mapping guiding the iterations. For both estimation procedures a convenient starting value can be obtained using the KMV/PPR estimation procedure. 5.2.2 Simulation Example To illustrate the usefulness of the efficient two-stage method in the context of the Merton credit risk model we devise a small Monte Carlo experiment comparing the MBP estimator with the penalized (respectively, non-penalized) two-step estimator. We construct 1,000 synthetic samples of 250 and 500 time series observations for daily returns. The firm’s value trajectory is initialized at 10, 000 and the face value of the firm’s debt is fixed at B = 9, 000. The parameters are set to µ = .01 and 2 = .09. We focus on estimation of 2 only and so we work directly with the concentrated log-likelihood function for both estimators.5 The MBP estimator is obtained using a Newton-Raphson approach to solve equation (42). The penalized (respectively, non-penalized) two-step estimator is obtained using a mix of bisection and interpolation and the penalty term satisfies ↵T / T 1/4 . Both methods use starting values obtained from the KMV/PPR method. Across the 1,000 synthetic samples we calculate the mean, median, root mean squared error (RMSE) and mean absolute error (MAE) for the MBP estimator and the two-step estimators. The results of the Monte Carlo experiment are contained in Table 5. Table 5 demonstrates that the two-step estimators and the MBP estimator have similar finite sample properties, with the penalized two-step estimator having significantly smaller RMSE and MAE. It is also important to point out that, as with the copula example in Section 4.2.2, the finite sample properties of the penalized and non-penalized two-step estimator are very similar. 6 Conclusion The development of nonlinear dynamic models in financial econometrics has given rise to estimation problems that are often viewed as computationally difficult. This potential computational burden has led to the development of computationally light estimators whose starting point is often a simple consistent estimator of some instrumental parameters. This first step estimator can be used either for targeting the structural parameters (Indirect Inference a la Gourieroux, Monfort and Renault (1993)) or for simplifying estimating equations for the parameters of interest. More often than not, this simplification comes at the price of some loss in efficiency. Not only do two-step estimators have, in general, an asymptotic distribution that depends on the distribution of the first step estimator, but even iterations may not be able to restore efficiency (see PPR and references therein). FPR demonstrate that the aforementioned inefficiency is caused by disregarding the information contained in (some of) the awkward occurrences of the parameters in the criterion function. Popular iterative (or two-step) procedures are devised precisely to allow us to overlook these awkward occurrences, possibly at the cost of efficiency loss. The goal of FPR was to propose efficient iterative estimation procedures whose computational cost, at each step of 5 The proposed simulation design is similar to that of FPR. 32 the iteration, is no higher than those of popular inefficient inference procedures. This goal was made possible by the fact that their algorithms iterate on the occurrences of the parameters that researchers would like to overlook. In this way, the informational content of these occurrences was no longer ignored, at least in the limit of the iterative procedure. In the present paper, we replace the method of iteration by a partial linearization of the estimating equations around a first step consistent estimator for the parameters that are difficult to deal with. With respect to the efficient iterative procedure of FPR, the pros and cons of our two step procedure are as follows. On the one hand, our approach is not required to compute a sequence of estimators but only a second step estimator. Our second step, in general, maintains the computational simplicity associated with each step of the FPR iterations. Moreover, while consistency of the FPR iterations may break down when their so-called Information Dominance condition is not fulfilled, our approach does not require such a condition. On the other hand, linearization, when it is only partial, may be a risky exercise because it may deliver a solution for the non-linearized portion that is biased, even asymptotically, by the approximated linearization. Then the consistency property of the estimator may be lost. In order to hedge against this risk, we develop a strategy of targeting first step consistent estimators, in the spirit of indirect inference. However, in contrast with indirect inference, targeting is for us only a complementary tool for enforcing consistency. In particular, we don’t want the asymptotic variance of our second step estimator to be inefficiently driven by the first step estimator used for targeting. This is the reason why we must elicit a tuning parameter (the penalty weight) that goes to infinity, in order to enforce consistency, but not too fast in order to avoid the efficiency loss that would be produced by contamination of the second step estimator by the inefficiency of the first step estimator. Finally, it is worth noting that the strategy developed in this paper may be of more general interest. While indirect inference has demonstrated the usefulness of targeting instrumental parameters for simple identification of structural parameters of interest, the recent literature on multivariate GARCH has stressed that targeting some unconditional moments may be a safe way to hedge against the risk of numerical instability associated with supposedly efficient estimators, at least in the presence of high dimensional and/or highly nonlinear optimization problems. In a companion paper, we demonstrate that for multivariate GARCH models, in contrast to existing targeting strategies, our penalization/targeting approach can deliver numerically stable estimates with good finite sample properties without the need to sacrifice efficiency. Moreover, as pointed out in our copula example, in addition to unconditional moments, the relatively simple and robust estimators of the marginal distributions can often provide a useful target. References T. Bollerslev and J. M. Wooldridge. Quasi-maximum likelihood estimation and inference in dynamic models with time-varying covariances. Econometric Reviews, 11(2):143–172, 1992. B. Crepon, F. Kramarz, and A. Trognon. Parameters of interest, nuisance parameters and 33 orthogonality conditions an application to autoregressive error component models. Journal of Econometrics, 82(1):135 – 156, 1997. J.-C. Duan. Maximum likelihood estimation using price data of the derivative contract. Mathematical Finance, 4(2):155–167, 1994. J.-C. Duan. Correction: Maximum likelihood estimation using price data of the derivative contract. Mathematical Finance, 10(4):461–462, 2000. R. Engle. Dynamic conditional correlation. Journal of Business an Economic Statistics, 20(3): 339–350, 2002. R. Engle and J. Mezrich. Garch for groups. RISK, 9(8):36 – 40, 1996. Y. Fan, S. Pastorello, and E. Renault. Maximization by parts in extremum estimation. The Econometrics Journal, 2015. Forthcoming. C. Gourieroux, A. Monfort, and E. Renault. Indirect inference. Journal of Applied Econometrics, 8:S85–S118, 1993. Special Issue on Econometric Inference Using Simulation Techniques. C. Gourieroux, A. Monfort, and E. Renault. Two-stage generalized moment method with applications to regressions with heteroscedasticity of unknown form. Journal of Statistical Planning and Inference, 50(1):37 – 6193, 1996. Econometric Methodology, Part III. L. P. Hansen, J. Heaton, and A. Yaron. Finite-sample properties of some alternative gmm estimators. Journal of Business an Economic Statistics, 14(3):pp. 262–280, 1996. H. O. Hartley. The modified gauss-newton method for the fitting of non-linear regression functions by least squares. Technometrics, 3(2):269–280, 1961. M. Hatanaka. An efficient two-step estimator for the dynamic adjustment model with autoregressive errors. Journal of Econometrics, 2(3):199–220, 1974. H. Joe. Multivariate models and dependence concepts, volume 73. London: Chapman an Hall, 1997. Y. Liu and R. Luger. Efficient estimation of copula-garch models. Computational Statistics an Data Analysis, 53(6):2284 – 2297, 2009. The Fourth Special Issue on Computational Econometrics. R. C. Merton. On the pricing of corporate debt: The risk structure of interest rates. The Journal of Finance, 29(2):449–470, 1974. W. K. Newey and D. L. McFadden. Large sample estimation and hypothesis testing. In R. F. Engle and D. L. McFadden, editors, Handbook of Econometrics, volume 4, pages 2111 – 2245. 1994. D. Noureldin, N. Shephard, and K. Sheppard. Multivariate rotated ARCH models. Journal of Econometrics, 179(1):16 – 30, 2014. 34 A. Pagan. Two stage and related estimators and their applications. The Review of Economic Studies, 53(4):517–538, 1986. A. Pakes and D. Pollard. Simulation and the asymptotics of optimization estimators. Econometrica, 57(5):pp. 1027–1057, 1989. J. Pan. The jump-risk premia implicit in options: evidence from an integrated time-series study. Journal of Financial Economics, 63(1):3 – 50, 2002. S. Pastorello, V. Patilea, and E. Renault. Iterative and recursive estimation in structural nonadaptive models [with comments, rejoinder]. Journal of Business an Economic Statistics, 21(4):449–482, 2003. A. J. Patton. Copula-based models for financial time series. In T. Andersen, R. Davis, J.P. Kreiss, and T. Mikosch, editors, Handbook of Financial Time Series, pages 767 – 785. Springer Verlag, 2009. E. Renault and N. Touzi. Option hedging and implied volatilities in a stochastic volatility model1. Mathematical Finance, 6(3):279–302, 1996. P. M. Robinson. The stochastic di↵erence between econometric statistics. Econometrica, 56 (3):531–548, 1988. J. H. Shih and T. A. Louis. Inferences on the association parameter in copula models for bivariate survival data. Biometrics, 51(4):1384–1399, 1995. P. X.-K. Song. Multivariate dispersion models generated from gaussian copula. Scandinavian Journal of Statistics, 27(2):305–320, 2000. P. X.-K. Song, Y. Fan, and J. D. Kalbfleisch. Maximization by parts in likelihood inference [with comments, rejoinder]. Journal of the American Statistical Association, 100(472):1145–1167, 2005. A. Trognon and C. Gourieroux. A note on the efficiency of two-step estimation methods. In Essays in Honor of Edmond Malinvaud, pages 233–248. MIT Press, 1990. A Regularity Conditions for Extremum Estimators In all the applications considered in this paper, the estimating equations fT (✓) = qT [✓, ⌫(✓)] of interest are obtained as first order conditions of some extremum estimation program: ✓ˆT = arg max QT [✓, ⌫(✓)] ✓2⇥ so that qT [✓, ⌫(✓)] = @QT [✓, ⌫(✓)] @⌫ 0 (✓) @QT [✓, ⌫(✓)] + @✓ @✓ @⌫ 35 (45) It is worth noting that by contrast with the possibly more general framework mentioned in the introduction, we have introduced a simplification by considering only a fixed known function ⌫(✓) instead of a more general sample dependent function ⌫T (✓). This may look restrictive since the function ⌫T (✓) may typically show up when profiling out some specific occurrences of some components of ✓ and thus be computed as data dependent. However, it must be kept in mind that the di↵erence is more notational than real since a general objective function Q⇤T [✓, ⌫T (✓)] may always be rewritten QT [✓, ✓] with a new function defined from Q⇤T [., .] and ⌫T (.) by: QT [✓, ✓⇤ ] = Q⇤T [✓, ⌫T (✓⇤ )] (46) This remark actually shows that we could always choose ⌫(✓) = ✓. We prefer to keep the notation ⌫(✓) for the sake of notational transparency. In most cases, ⌫(✓) will be nothing but a sub-vector of ✓. However, while we will keep in mind that (45) is actually not less general than (46), we will make explicit how the regularity conditions must be interpreted when ⌫T (✓) is actually a sample-dependent consistent estimator of some underlying unknown true ⌫ 0 (✓). In the simple set up of (45),the maintained regularity conditions are the following. R1. The following are satisfied: (1) ⇥ ⇢ Rp and ⇢ Rq are two compact parameters spaces. (2) ⌫(.) is a continuous function from ⇥ to , twice continuously di↵erentiable on the interior of ⇥. (3) ✓0 2 Int(⇥), interior set of ⇥, and ⌫ 0 = ⌫(✓0 ) 2 Int( ), interior set of . R2. QT [✓, ⌫] converges in probability towards a nonstochastic function Q1 [✓, ⌫] uniformly on (✓, ⌫) 2 ⇥ ⇥ . R3. The function ✓ 7 ! Q1 [✓, ⌫(✓)] attains a unique global maximum on ⇥ at ✓ = ✓0 , unique solution of the equations q1 [✓, ⌫(✓)] = 0, where q1 [✓, ⌫(✓)] = @Q1 [✓, ⌫(✓)] @⌫ 0 (✓) @Q1 [✓, ⌫(✓)] + @✓ @✓ @⌫ ˚ ˚. R4. The function QT [✓, ⌫] is twice continuously di↵erentiable on ⇥⇥ R5. The following are satisfied @ 2 QT ( ) @ @ 0 (1) With 0 = (✓0 , ⌫ 0 ), the second derivative a non stochastic matrix D( ). 2 0 converges uniformly on ˚ ˚ towards 2 ⇥⇥ QT ( ) (2) The matrix D✓✓ ( 0 ) = P limT =1 @ @✓@✓ (where 00 = (✓00 , ⌫ 00 )) is negative definite. 0 p h @QT ( 0 ) @⌫ 0 0 @QT ( 0 ) i R6. T + @✓ (✓ ). @⌫ converges in distribution towards a normal distribution @✓ with zero mean and variance ⌦. 36 It is worth reinterpreting these regularity conditions when the objective function QT is actually deduced from another function Q⇤T as in (46). Note that in this case, ⌫(.) is just the identity function (⌫(✓) = ✓, ⇥ = ), making trivial all maintained assumptions about ⌫. However, it must be kept in mind that the role of the data dependent function ⌫T (.) will typically be the consistent estimation of some true unknown function ⌫ 0 (.). Then, the above regularity conditions can be rewritten identical by only replacing the functions QT [✓, ⌫] and ⌫(.) by the functions Q⇤T [✓, ⌫] and ⌫ 0 (.) . Only the limit arguments involving the function ⌫ 0 (.) have to be revisited to take into account its consistent estimation. We will basically rewrite condition R2 and R6 as follows: R2*. The following are satisfied: (1) Q⇤T [✓, ⌫] converges in probability towards a non-stochastic function Q⇤1 [✓, ⌫] uniformly on (✓, ⌫) 2 ⇥ ⇥ . (2) ⌫T (✓) converges in probability towards ⌫ 0 (✓) uniformly on ✓ 2 ⇥. i p h (✓ 0 ,✓ 0 ) (✓ 0 ,✓ 0 ) R6*. T @QT @✓ + @QT@✓ , where QT [✓, ✓⇤ ] = Q⇤T [✓, ⌫T (✓⇤ )],converges in distribution ⇤ towards a normal distribution with zero mean and variance ⌦. Obviously, a more primitive condition for R6* should be basedpof an assumption of joint asymptotic normality, involving not only the score function but also T (⌫T (✓0 ) ⌫ 0 (✓0 )) whose impact on the asymptotic distribution would be deduced from a Taylor expansion p @Q⇤T (✓0 , ⌫T (✓0 )) p @Q⇤T (✓0 , ⌫ 0 (✓0 )) @ 2 Q⇤T (✓0 , ⌫ 0 (✓0 )) p T = T + . T ⌫T (✓0 ) @ @ @ @⌫ 0 ⌫ 0 (✓0 ) While this more specific set up would not introduce any theoretical complication, we omit it throughout for sake of expositional simplicity. B Proofs B.1 Proof of Theorem 2.1 Part (i) Asymptotic equivalence between ✓ˆT and ✓ˆT⇤ : The first order conditions that characterize ✓ˆT⇤ can be written: qT⇤ [✓ˆT⇤ , ⌫˜T ] = 0 with qT⇤ [✓, ⌫˜T ] = @QT [✓, ⌫˜T ] @ 2 QT [✓, ⌫˜T ] + . [⌫(✓) @✓ @✓@⌫ 0 ⌫˜T ] + @⌫ 0 (✓) @QT [✓, ⌫˜T ] . @✓ @⌫ 0 @⌫ 0 (✓) JT (✓0 ) [⌫(✓) @✓ Thus, by comparing with (2): qT⇤ [✓, ⌫˜T ] = qT [✓, ⌫˜T ] + @qT [✓, ⌫˜T ] . [⌫(✓) @⌫ 0 37 ⌫˜T ] ⇠T (✓) ⌫˜T ] with @⌫ 0 (✓) @ 2 QT [✓, ⌫˜T ] ⇠T (✓) = + JT (✓0 ) [⌫(✓) @✓ @⌫@⌫ 0 Hence, i @qT [✓ˆT⇤ , ⌫˜T ] h ˆ⇤ . ⌫( ✓ ) ⌫ ˜ 0 = qT [✓ˆT⇤ , ⌫˜T ] + T T @⌫ 0 ⇣ p ⌘ ⇠T (✓ˆT⇤ ) = oP 1/ T with ⌫˜T ] ⇠T (✓ˆT⇤ ) by virtue of Assumption A3, since ✓ˆT⇤ is root-T consistent. We will see in section 3 that, whenever consistent, an estimator ˚ ✓T solution of hT (˚ ✓T ) = 0 with hT (✓) = qT [✓, ⌫˜T ] + @qT [✓, ⌫˜T ] . [⌫(✓) @⌫ 0 ⌫˜T ] being asymptotically equivalent to ✓ˆT . Thus, by application of Theorem 3.3 of Pakes and Pollard (1989), it is also the case for a solution ✓ˆT⇤ of ⇣ p ⌘ hT (✓ˆT⇤ ) = oP 1/ T . Note that we can apply Theorem p 3.3 of Pakes and Pollard (1989) in particular because, by virtue of Assumptions A2 and A3, T hT (✓0 ) is asymptotically normal. Part (ii): Asymptotic equivalence between ✓ˆT⇤ and ✓ˆText By definition, ✓ˆText is solution of first order conditions: gT (✓ˆText ) = 0 such that ⇣ p ⌘ gT (✓ˆT⇤ ) = oP 1/ T , since gT (✓ˆT⇤ ) is a p-dimensional vector whose component j = 1, ..., p is h i0 h i ⌫(✓ˆT⇤ ) ⌫˜T JjT (✓ˆT⇤ ) ⌫(✓ˆT⇤ ) ⌫˜T . where Jj;T (✓) stands for the matrix of partial derivatives with respect to ✓j of all the coefficients of the matrix JT (✓). Then, the announced asymptotic equivalence follows again by application of Theorem 3.3 of Pakes and Pollard (1989). B.2 Proof of Proposition 3.1 Step 1: We show that ✓˜TP is a consistent estimator of ✓0 . 38 By definition: 0 = qT [✓˜TP , ⌫(✓˜T )] + @qT ˜P @⌫ [✓T , ⌫(✓˜T )] 0 (✓˜T )(✓˜TP 0 @⌫ @✓ ✓˜T ) + ↵T ✓˜TP ✓˜T 2 ep (47) Since the parameter space is compact, we only have to show that for any subsequence of ✓˜TP ¯ we necessarily have ✓¯ = ✓0 . By the that converges in probability towards some limit value ✓, regularity conditions (continuity and uniform convergence) we deduce from (47) that ¯ ⌫(✓0 )] + @q1 [✓, ¯ ⌫(✓0 )] @⌫ (✓0 )(✓¯ 0 = q1 [✓, 0 @⌫ @✓0 ✓0 ) + P lim ↵T ✓˜TP ✓˜T T =1 2 ep . (48) Since PlimT =1 ✓˜T = ✓0 and limT =1 ↵T = 1, (48) implies that PlimT =1 ✓˜TP = ✓0 . Step 2: We show that ✓ˆT ✓˜TP = OP ⇣ fT (✓ˆT ) h̃PT (✓ˆT ) ⌘ = OP ⇣ h̃PT (✓ˆT ) ⌘ This result is a direct consequence of Robinson (1988) Theorem 1 if we can show that the function h̃PT (✓) is conformable to Robinson’s Assumption A2. We have @ h̃PT (✓) @qT [✓, ⌫(✓˜T )] @ @qT @⌫ ˜ ˜ = + 0 [✓, ⌫(✓T )] (✓T )(✓ 0 0 0 @✓ @✓ @✓ @⌫ @✓0 ⇣ ⌘0 @qT @⌫ + 0 [✓, ⌫(✓˜T )] 0 (✓˜T ) + 2↵T ✓ ✓˜T @⌫ @✓ ✓˜T ) ⌦ Idp where, for a (p ⇥ q) matrix A whose coefficients are functions of ✓, we define @A/@✓0 as the (p ⇥ qp) matrix ⇥ @A1 @A2 q ⇤ ... @A @✓ 0 @✓ 0 @✓ 0 where A1 , A2 , ..., Aq stands for the q columns of the matrix A. Since , by assumption, ✓˜T oP (1/↵T ), we deduce that, under regularity conditions P lim ✓0 = @hT (✓0 ) @q1 [✓0 , ⌫(✓0 )] @q1 0 @⌫ = + [✓ , ⌫(✓0 )] 0 (✓0 ) = F, 0 0 0 @✓ @✓ @⌫ @✓ that is by assumption a non-singular matrix. Therefore, we get Assumption A2 of Robinson (1988) under standard regularity conditions. Step 3: We show that ✓ˆT ✓˜TP = OP ✓ ↵T 39 ✓ˆT ✓˜T 2 ◆ . We have fT (✓ˆT ) = qT [✓ˆT , ⌫(✓ˆT )] @qT ˆ @⌫ = qT [✓ˆT , ⌫(✓˜T )] + [✓T , ⌫(✓˜T )] 0 (✓˜T )(✓ˆT ✓˜T ) + OP 0 @⌫ @✓ ✓ ◆ 2 2 = h̃PT (✓ˆT ) + OP ✓ˆT ✓˜T ↵T ✓ˆT ✓˜T ep . Therefore, fT (✓ˆT ) h̃PT (✓ˆT ) = OP ✓ ↵T ✓ˆT ✓˜T which gives the announced result by using the result of Step 2. B.3 2 ◆ ✓ ✓ˆT ✓˜T 2 ◆ , Proof Theorem 3.1 We just show that the proof of Proposition 3.1. will go through with very minor changes. The proof of consistency (Step 1) is the same except that equation (48) must now be replaced by ¯ ⌫(✓0 )] + 0 = q1 [✓, @q1 0 @⌫ [✓ , ⌫(✓0 )] 0 (✓0 )(✓¯ 0 @⌫ @✓ ✓˜T ✓0 ) + PlimT =1 ↵T ✓T⇤⇤ 2 ep (49) Obviously, the same consistency argument is a fortiori still valid. Since PlimT =1 ✓˜T = ✓0 and (1) limT =1 ↵T = 1, (49) implies that PlimT =1 ✓T = ✓0 . With this new way to partially linearize, the Jacobian of the estimating equation is simplified as follows (1) ⇣ @qT [✓, ⌫(✓˜T )] @qT ˜ @hT (✓) ˜T )] @⌫ (✓˜T ) + 2↵T ✓ = + [ ✓ , ⌫( ✓ T @✓0 @✓0 @⌫ 0 @✓0 ✓˜T Thus, we still have ⌘0 (1) lim @hT (✓0 ) @⌫ @q1 [✓0 , ⌫(✓0 )] @q1 0 = + [✓ , ⌫(✓0 )] 0 (✓0 ) = F 0 0 0 @✓ @✓ @⌫ @✓ and thus we can prove a Step 2 exactly as in Proposition 3.1. This Step 2 will tell us that ⇣ ⌘ ✓ˆT ✓T⇤⇤ = OP fT (✓ˆT ) h⇤T (✓ˆT ) . We already know from Proposition 3.1 that fT (✓ˆT ) h̃PT (✓ˆT ) = OP ✓ ↵T ✓ˆT ✓˜T 2 ◆ Thus, the triangle inequality will give the result if we can also show that ✓ ◆ 2 (1) ˆ P ˆ ˆ ˜ h̃T (✓T ) hT (✓T ) = OP ↵T ✓T ✓T 40 We have h̃PT (✓ˆT ) (1) hT (✓ˆT ) = @qT ˆ [✓T , ⌫(✓˜T )] @⌫ 0 @qT ˜ @⌫ ˜ ˆ [✓T , ⌫(✓˜T )] (✓T )(✓T 0 @⌫ @✓0 ✓˜T ) Assuming that the initial estimating equations qT [✓, ⌫] are twice continuously di↵erentiable on the interior of the compact set ⇥ ⇥ (see regularity conditions in appendix), we know that: @qT ˆ [✓T , ⌫(✓˜T )] @⌫ 0 ⇣ @qT ˜ ˜T )] = OP ✓ˆT [ ✓ , ⌫( ✓ T @⌫ 0 Therefore h̃PT (✓ˆT ) (1) hT (✓ˆT ) = OP since ↵T goes to infinity. B.4 ✓ ✓ˆT ✓˜T 2 ◆ = OP ✓ ↵T ✓˜T ✓ˆT ⌘ ✓˜T 2 ◆ Proof of Theorem 3.2 We just show that the proof of Proposition 3.1. will go through with some suitable changes. The proof of consistency (Step 1) is the same except that equation (48) must now be replaced by: ¯ ⌫(✓0 )] + @q1 [✓, ¯ ⌫(✓0 )](⌫(✓) ¯ 0 = q1 [✓, @⌫ 0 (2) ⌫(✓0 ) + P lim ↵T ⌫(✓T ) T =1 ⌫˜T 2 ep (50) Obviously, the same kind of consistency argument is still valid. Since PlimT =1 ⌫˜T = ⌫(✓0 ) and (2) ¯ = ⌫(✓0 ). Therefore we must have limT =1 ↵T = 1, (50) implies that PlimT =1 ⌫(✓T ) = ⌫(✓) ¯ ⌫(✓0 )] = q1 [✓, ¯ ⌫(✓)] ¯ 0 = q1 [✓, from which we deduce ✓¯ = ✓0 by virtue of Assumption B2. To get Step 2, we now compute the Jacobian of the estimating equations (2) @hT (✓) @qT [✓, ⌫˜T ] @ @qT = + [✓, ⌫˜T ] [(⌫(✓) ⌫˜T ) ⌦ Idp ] @✓0 @✓0 @✓0 @⌫ 0 @qT @⌫ @⌫ + 0 [✓, ⌫˜T ] 0 (✓) + 2↵T [⌫(✓) ⌫˜T ]0 0 (✓)ep ep @⌫ @✓ @✓ Thus, we still have (2) lim @hT (✓0 ) @q1 [✓0 , ⌫(✓0 )] @q1 0 @⌫ = + [✓ , ⌫(✓0 )] 0 (✓0 ) = F 0 0 0 @✓ @✓ @⌫ @✓ and thus we can prove a Step 2 exactly as in Proposition 3.1. This Step 2 will tell us that ⇣ ⌘ (2) (2) ✓ˆT ✓T = OP fT (✓ˆT ) hT (✓ˆT ) 41 To get the announced result, we now (Step 3) need to show that ✓ (2) ˆ ˆ fT (✓T ) hT (✓T ) = OP ↵T ⌫(✓ˆT ) ⌫˜T 2 ◆ We have fT (✓ˆT ) = qT [✓ˆT , ⌫(✓ˆT )] @qT ˆ = qT [✓ˆT , ⌫˜T ] + [✓T , ⌫˜T ].(⌫(✓ˆT ) @⌫ 0 ✓ ◆ 2 (2) ˆ ˆ = h T ( ✓T ) + OP ⌫(✓T ) ⌫˜T ⌫˜T ) + OP ↵T ⌫(✓ˆT ) ✓ ⌫(✓ˆT ) ⌫˜T 2 ⌫˜T 2 ◆ ep which gives the announced result. C Tables The following section details the Monte Carlo results from the simulation experiments in Section 4.2.2 and Section 5.2.1. In the tables below, MBP stands for the maximization by parts estimator, P-TS1 is the fully penalized two-step estimator, the two-step estimator with partial penalization is P-TS2 , the two-stage estimator without penalization is TS1 and the simplified two-step estimator is TS2 . 42 43 ↵1 ↵2 ⇢ ↵1 ↵2 ⇢ ↵1 ↵2 ⇢ ↵1 ↵2 ⇢ ↵1 ↵2 ⇢ MBP P-T S1 P-T S2 T S1 T S2 10.0389 100.2121 75.3633 10.03671 100.1963 75.3633 10.0387 100.2106 74.6152 10.03615 100.1919 75.3727 10.0366 100.1881 75.3606 1.0375 103.4928 430.1764 1.0369 103.4614 398.9833 1.0375 103.4881 430.1701 1.0367 103.4473 398.5865 1.0429 103.3290 399.0850 81.2189 9.9658 801.4590 99.6928 2038.4899 74.9160 81.2033 9.9636 801.4120 99.6764 1963.6687 74.9160 81.2172 9.9657 801.4460 99.6922 2038.4766 74.2100 81.1981 9.9634 801.3737 99.6749 1962.7233 74.9204 81.3085 9.9641 801.1743 99.6738 1963.9320 74.9131 0.5655 54.0369 439.7444 0.5650 54.0001 410.1344 0.5655 54.0372 439.7422 0.56507 54.0004 409.9491 0.5694 54.0119 410.2502 10.0047 99.6255 75.0674 59.8332 10.0071 605.4414 99.6437 2008.6820 75.0646 59.6107 10.0048 605.3864 99.6264 2008.3948 75.0646 59.6264 10.0071 605.6657 99.6434 2078.9956 74.3475 59.6130 605.3785 2007.9506 59.8332 10.0058 605.4414 99.6216 2008.6820 75.0615 0.3413 34.5841 431.9198 0.34109 34.5723 402.2031 0.3413 34.5842 431.9187 0.34109 34.5726 402.0887 0.3432 34.5963 402.3241 46.2821 469.1803 2065.2523 46.2837 469.1192 1993.5322 46.2821 469.1822 2065.2499 46.2839 469.1240 1993.2528 46.3782 469.3129 1993.8420 Table 1: µ100 is the estimated Mean, times 100, for the di↵erent estimator, MAE100 is the Monte Carlo mean squared error, times 1000, and MAE100 is the Monte Carlo mean absolute error, times 1000. The penalization parameter was taken proportional to T 1/4 . The parameters ↵1 , ↵2 were set equal to .1 and 1 across all sample sizes. The table below fixed ⇢ = .75. ⇢ = .75 µ100 MSE100 MAE100 µ200 MSE200 MAE200 µ300 MSE300 MAE300 44 ↵1 ↵2 ⇢ ↵1 ↵2 ⇢ ↵1 ↵2 ⇢ ↵1 ↵2 ⇢ ↵1 ↵2 ⇢ MBP P-T S1 P-T S2 T S1 T S2 10.0381 100.3489 95.0797 10.0361 100.3309 95.0797 10.0381 100.3489 94.8122 10.0361 100.3307 95.0799 10.0368 100.3326 95.0740 1.0382 101.3217 0.8194 1.0380 101.3135 0.5909 1.0382 101.3216 0.8194 1.03802 101.3133 0.5908 1.0436 101.6088 0.5900 9.9679 99.5895 94.9604 81.2629 802.9575 72.4880 81.2689 802.8293 61.3679 81.2629 802.9570 72.4880 9.9645 99.6126 94.9894 9.9624 99.5939 94.9894 9.9645 99.6126 94.7308 81.2689 9.9624 802.8277 99.5938 61.3661 94.9895 81.2541 803.7984 61.2780 0.5656 56.7682 0.4688 0.5655 56.7602 0.2910 0.5656 56.7682 0.4688 0.5655 56.7603 0.2909 0.5843 57.6078 0.2925 10.0062 99.8697 94.7739 10.0041 99.8505 95.0280 60.4423 10.0062 608.1178 99.8697 43.7780 95.0280 59.6428 10.0041 604.8638 99.8505 43.5963 95.0280 59.6420 604.96135 54.3549 59.6429 604.8641 43.5954 60.4423 10.0609 608.1178 99.8313 43.7780 94.5256 46.3441 468.8434 46.7765 46.3435 468.8306 37.6353 0.34205 35.0885 0.3468 46.3440 468.8434 46.7766 0.3419 46.3434 35.0822 468.8304 0.2124 37.6357 0.3420 35.0885 0.3468 0.3419 35.0822 0.2124 0.5784 61.1738 59.6184 615.5324 0.5025 56.1719 Table 2: µ100 is the estimated Mean, times 100, for the di↵erent estimator, MAE100 is the Monte Carlo mean squared error, times 1000, and MAE100 is the Monte Carlo mean absolute error, times 1000. The penalization parameter was taken proportional to T 1/4 . The parameters ↵1 , ↵2 were set equal to .1 and 1 across all sample sizes. The table below fixed ⇢ = .95. ⇢ = .95 µ100 MSE100 MAE100 µ200 MSE200 MAE200 µ300 MSE300 MAE300 45 ↵1 ↵2 ⇢ ↵1 ↵2 ⇢ ↵1 ↵2 ⇢ ↵1 ↵2 ⇢ ↵1 ↵2 ⇢ MBP P-T S1 P-T S2 T S1 T S2 10.0367 100.3582 98.5239 10.0357 100.3488 99.8524 10.0367 100.3582 98.4358 10.0357 100.3488 98.5239 10.0372 100.8034 98.0858 1.0393 102.2073 0.0801 1.0392 102.2061 0.0541 1.0393 102.2073 0.0801 1.0392 102.2060 0.0541 1.2314 123.5412 0.2934 81.2839 804.7325 22.7917 81.2881 804.7207 18.6253 81.2839 804.7324 22.7917 81.2881 804.7207 18.6253 88.7035 879.2445 43.3899 9.9631 99.6056 98.4977 9.96217 99.5960 98.4977 9.9631 99.6055 98.4122 9.9621 99.5960 98.4977 10.1446 100.1246 96.4156 112.0659 1108.4347 208.4360 0.5662 57.0553 0.0457 0.5662 57.0553 0.0263 0.5662 57.0552 0.0457 10.0039 99.9360 98.5098 59.6806 600.3720 16.9301 59.6819 600.34731 13.1029 10.0049 99.9456 98.5098 10.0039 99.9361 98.5098 0.3421 34.8572 0.0331 0.3421 34.8534 0.0193 0.3421 34.8571 0.0330 0.3421 34.8534 0.0193 10.2592 2.4063 01.5723 242.56481 94.8641 13.5292 59.6806 10.0049 600.3720 99.9456 16.9308 98.42705 0.5662 59.6819 57.0553 600.34733 0.0263 13.1028 1.7055 165.4135 4.6093 46.3561 468.5121 14.5022 46.3557 468.471 11.3652 46.3560 468.5121 14.5021 46.3557 468.4711 11.3651 143.4284 1439.3493 363.5840 Table 3: µ100 is the estimated Mean, times 100, for the di↵erent estimator, MAE100 is the Monte Carlo mean squared error, times 1000, and MAE100 is the Monte Carlo mean absolute error, times 1000. The penalization parameter was taken proportional to T 1/4 . The parameters ↵1 , ↵2 were set equal to .1 and 1 across all sample sizes. The table below fixed ⇢ = .985. ⇢ = .985 µ100 MSE100 MAE100 µ200 MSE200 MAE200 µ300 MSE300 MAE300 Table 4: Relative computing time, in seconds and number of iterations for MBP (in brackets). ⇢ = .75 ⇢ = .95 ⇢ = .985 T=100 MBP P-TS TS 0.0119 [3] 0.0203 0.0207 0.0260 0.0166 0.0164 [6] 0.0922 [43] 0.0137 0.0142 T=200 MBP P-TS TS 0.0152 [4] 0.0193 0.0199 0.0450 0.0141 0.0167 [16] 0.1005 0.0138 0.0148 [44] T=300 MBP P-TS TS 0.0169 [4] 0.0184 0.0187 0.1001 0.0143 0.0160 [ 42] 0.1018 0.0145 0.0149 [45] Table 5: Results for two-step (TS1 , penalized two-step (P-TS1 ) and MBP estimators in the Merton credit Risk model. MAE is the median absolute error across the simulations multiplied by 100, and RMSE is the root mean squared error across the simulation multiplied by 100. TS T T=250 T=500 Parameter = 0.09 = 0.09 Median 0.0889 0.0897 Mean. 0.0888 0.0895 MAE 6.1099 4.7301 RMSE 7.7331 5.8404 Parameter = 0.09 = 0.09 Median 0.0895 0.0898 Mean. 0.0890 0.0895 MAE 5.9292 4.6715 RMSE 7.5341 5.6284 Parameter = 0.09 = 0.09 Median 0.0892 0.0894 Mean 0.0888 0.0898 MAE 8.1746 6.7727 RMSE 9.8406 6.6129 P-TS T T=250 T=500 MBP T T=250 T=500 46