0* HB31 .M415

0* HB31 .M415 Efficiency of Weighted Average Derivative Estimators Whitney K. Newey Thomas M. Stoker No. 92-8 April 1992 M I T. LIBRARIES 4 1992 RECEIVED AUG Economics Working Paper #92-8 EFFICIENCY OF WEIGHTED AVERAGE DERIVATIVE ESTIMATORS AND INDEX MODELS »* * by Whitney K. Newey and Thomas M. Stoker June 1988, revised April, 1992 * Department of Economics Massachusetts Institute of Technology Cambridge, MA 02139 ** Sloan School of Management Massachusetts Institute of Technology Cambridge, MA 02139 earlier version of this paper was presented at the 1988 summer meeting of the Econometric Society at the University of Minnesota. Financial support was provided by An the NSF and the Sloan Foundation. Helpful comments were provided by referees and Chamberlain, L. Hansen, R. Matzkin, and D. McFadden. G. Digitized by the Internet Archive in 2011 with funding from Boston Library Consortium Member Libraries http://www.archive.org/details/efficiencyofweigOOnewe ABSTRACT Weighted average derivatives are useful parameters for semiparametric index models, including limited dependent variables, and in nonparametric Efficiency of weighted average derivative estimators is demand estimation. a concern, because the weight may affect efficiency and the presence of a nonparametric function estimator might lead to low efficiency. This purpose of this paper is to give efficiency results for -average derivative estimators, including formulating estimators that have high efficiency. Our efficiency analysis proceeds by deriving the semiparametric efficiency bounds for average derivative and index models, and then formulating average derivative estimators with high efficiency. We first derive the bound for weighted average derivatives of conditional location functionals, such as the conditional mean and median. This bound gives the asymptotic distribution of any sufficient well-behaved weighted average derivative estimator. e.g. We compare efficiency for different location measures, mean versus median, finding that the comparison Many semiparametric is similar to the location model. limited dependent variable and regression models take the form of index models, where the location measure (e.g. conditional mean) depends only on linear combinations of the regressors, i.e. efficiency bound for these index models. on "indices." We We then compare derive the semiparametric the bound with the asymptotic variance of weighted average derivative estimators of the index coefficients. the form of an efficient weight function elliptically when the We distribution of the regressors symmetric and discuss existence of optimal weights. We derive is discuss combining different weights to achieve efficiency, in the process deriving a general condition for approximate efficiency of a pooled minimum chi-square estimator model. Also, we in a semiparametric discuss ways the type and number of weights could be selected to achieve high efficiency in practice. Keywords: average derivative, index model, efficiency bound, optimal weights, minimum chi-square, spanning condition. 1. Introduction Average derivatives are useful parameters number of semiparametric models. in a As discussed in Stoker (1986), they can be used in estimation of index models, including limited dependent variable models and partially linear regression models. used in nonparametric demand estimation, as Efficiency of average derivative estimators is and Jerison (1991). a concern, because there are several types Also, the presence of a nonparametric function estimator might that have been proposed. lead to low efficiency. in Hardle, Hildenbrand, Also, they are This purpose of this paper is to give efficiency results for average derivative estimators, including formulating estimators that have high efficiency. Our efficiency analysis proceeds by deriving the semiparametric efficiency bounds for average derivative and index models, and then formulating average derivative estimators with high efficiency. We first derive the bound for weighted average derivatives of conditional location functionals, such as the conditional mean and median. For the conditional mean the previously suggested estimators of Hardle and Stoker (1989) and Stoker (1991a) attain this bound. this This result is what one would bound places no restrictions on the data distribution, and estimator that is in anticipate, because such cases any asymptotically equivalent to a sample average and sufficiently regular will be efficient (e.g. We see Newey, 1990a). functionals such as the median. We find that also give the bound for other location when the efficiency of average derivative estimators for different location functionals can be compared, the comparison to that for location models, e.g. with the average derivative conditional is similar median being more efficient that the conditional mean for "fat-tailed" distributions. Many semiparametric limited dependent variable and regression models take the form of index models, where the location measure (e.g. conditional mean) depends only on linear combinations of the regressors, on "indices." i.e. We efficiency bound for these index models. - 1 - We derive the semiparametric then compare the bound with the asymptotic variance of weighted average derivative estimators of the index coefficients. In an index model, consistent estimators arise functions, so an important efficiency question is from the use of different weighting We the choice of weights. form of an efficient weight function when the distribution of the regressors symmetric. elliptically expectations is It is also shown that We necessary for existence of an efficient weight. discuss the showing that possible to obtain an approximately efficient estimator by pooling using We give general results on is linearity of certain conditional possibility of combining different weights to achieve efficiency, chi-square. derive the when such pooling it is minimum will lead to efficiency in semiparametric models and on how the number of estimators to combine can be chosen from the data. efficiency These results are specialized to derive conditions for achieving approximate from combining many weighted average derivative estimators. ways of combining a few weighted average derivative estimators so as Also, we suggest to achieve high efficiency. Other papers give some efficiency results on average derivatives or index models. Chamberlain (1987) previously derived the semiparametric efficiency bound for conditional mean, single index models. Following our initial work Samarov (1990) gave the Newey efficiency bound for the (unweighted) average derivative of the conditional means. (1991a) gives efficiency bounds for linear functionals of includes average derivatives of conditional mean-square projections (that means as a special papers gives regularity conditions for the bounds. 2 case). None of these Recently Hall and Ichimura (1991) derived some efficiency results for index estimators when there is a residual that is independent of the regressors. We cite the appear working paper Chamberlain (1987) because the index model bound does not in the published version. To be precise, they do not exhibit a sequence of regular parametric submodels for which Cramer-Rao for the submodel approximates the candidates for the bound they suggest. the - 2 - Weighted Average Derivatives and Partial Index Models 2. Let denote a dependent variable, y loss function of a real-valued variable, x k x a 1 vector of regressors, and g(x) = argmin E[p(y - g)|x]. (2.1) Here g(x) mean for is Examples include the conditional a conditional location function. pic) = c 2 , pie) = |e|, the conditional median for exotic location functionals such as quantiles or expectiles. properties of estimators of weighted average derivatives of (x a pic) T T T ,x„) suppose , = 5a(x)/5x x Our interest g(x). derivative of g(x) is in Partition continuously distributed, and for a function is A weighted average . as well as other more a(x) x x = as let a' (x) is 5 = E[w(x)g'(x)], (2.2) where w(x) is For the average derivative to be well defined, a scalar function. must be continuously distributed, but x„ can be discrete. A primary motivation for weighted average derivatives g(x) = G(x]0,x (2.3) Under this model, derivative is x is a partial index model, with ). T 5 = E[w(x)3G(x (3,x „)/3(x proportional to j3, T ./3)]*/3, so that the weighted average and hence can be used to estimate |3 up to scale. Rodriguez and Stoker (1992) have recently used this model in specification analysis for estimation of conditional means. The motivation for the index model where not present is well known (e.g. g(x) see Stoker, 1986). variety of semiparametric regression models. - 3 is the conditional mean and x Other cases can be motivated by a In particular, suppose that is y|x = y|(x^,x (2.4) 2 ), a conditional distribution index model that allows for x in x(r) x to have a conditional an arbitrary way. instance, and y is it It is is many implied by T + cr(x ) more restrictive models. interesting )v), where v is a transformation that can be either known or unknown. is could be x(r) = l(r > 0), heteroskedasticity to depend on corresponding to a model £(y) implies that equation (2.3) This model mean and variance (and other moments) that depend on y = t(x 8 + îx implied by analyzed in Newey (1990b). For x independent of The transformation corresponding to a binary choice model that allows for x in an arbitrary way, or T = x 6 + that 7) is could be it x(r) = £ important for duration data. It (r), then so that partial index models are implied by is satisfied, transformed, semiparametric regression models that allow heteroskedasticity to depend nonparametrically on some regressors. For the semiparametric model in equation (2.4), the choice of loss function pic) can be motivated by efficiency considerations similar to those for the linear model, namely that if the distribution of y might be obtained by working with is pic) fat-tailed, then a more efficient estimator that gives less weight to large values of pic). This feature will become apparent from the semiparametric efficiency bounds derived below. Also, comparison of parameters from different choices of test restrictions on the conditional distribution of and Bassett In (2.5) given may allow one x, similarly to to Koenker (1982). some cases choice of y pic) pic) it is possible to weaken equation An important such case will produce a partial index model. y = x(x^/3 + u(x ) + v, x^, x(r, x^ will be a partial index monotonic is Because the median of a monotonic transformation (2.4) so that essentially only one is in This model - 4 - is where median(v|x) = 0. the transformation of the median, this model for the conditional median, where other conditional location measures. r, is pic) = |e|, but not for a generalization of one considered by Powell For the case where (1991). x not present, is Doksum and Samarov (1992) have and when suggested using average derivative estimators to estimate t(v) is t(v) its inverse monotonic. this paper is to develop the efficiency properties of weighted The primary purpose of average derivative and partial index estimators, and discuss how and when different weighted average derivative estimators can be combined into an approximately efficient estimator of a partial index model. It is beyond the scope of this paper to discuss the properties of particular estimators, although the results here have implications for the asymptotic properties of average derivative estimators. As discussed in Newey (1991a), the semiparametric efficiency bound for an unrestricted functional, such as an average derivative, is the asymptotic variance of any sufficiently regular estimator. Thus, one would expect that asymptotic variance of average derivative estimators to have the form given below. For instance, consider the following kernel estimator that p(e) is p(e) = not smooth (e.g. for Let |e|). f(x) is well-defined even when be the density of x. By integration by parts a = E[£(x)g(x)], (2.6) For a kernel K(v) 2(x.) h w(x) = (1991a). 3 1 let f(x) = T^K. (x-x.)/n î=l h be a kernel i g(x) = argminT. =1 K,(x-x.)p(y.-g)/n Then an estimator of 1 i corresponding to equation (2.6) 5 is 1 y(x [5: be the kernel estimator )x /n]" i i j: y(x.)i(x.)/n, i ill = -w'(x.) - w(x.)f'(x.)/f(x.). i For = h" X(x/h), K, (x) S = (E." w(x.)/n)" 1 (2-7) ' k let density estimator and let of Tsybakov (1982). = -w'(x) - w(x)f (x)/f(x) £(x) i and pic) = e 2 , this estimator similar to those analyzed in Stoker 3 The two leading terms form a nonparametric estimator of the identity matrix, and so do An alternative approach P Let functions. be a (x) to use a series estimator with smooth approximating is K x 1 vector of differentiate functions. Suppose that linear combinations of this vector can approximate functions and their derivatives. Consider the estimator d = E.J^wtx.Jg' (x.)/n, (2.8) This estimator conditional is T K g(x) = n P (x), n = argmin£." p(y. - based on differentiating the series estimator mean case this estimator is analyzed in Newey of g(x) T K tt P g(x). (x.)). For the (1991a). Regularity conditions for asymptotic normality of these estimators are beyond the Nevertheless, the results of this paper do provide a formula for scope of this paper. As discussed the asymptotic variance of these estimators. in Section 3, the semiparametric bound we derive will be the asymptotic variance of any estimator that is asymptotically equivalent to a sample average and sufficiently regular. 3. Efficiency for Weighted Average Derivative Estimation In this Section we derivative estimators. derive the semiparametric efficiency bound for weighted average The average derivative is an unrestricted parameter, in that its The efficiency definition places no substantive restrictions on the data distribution. bound for estimators of such unrestricted parameters can be calculated as the variance of the pathwise derivative of the parameter with respect to the distribution of the data, as shown in Pfanzagl and Wefelmeyer (1982). To discuss the pathwise derivative z = (y,x to T ) it is denote a single observation and some dominating measure). useful to introduce f(z) more terminology. the true density of A "parametric submodel" or "path" is z Let (with respect a parametric family of severe finite not contribute to the asymptotic variance, but they lead to reduction in a (1991b). Stoker sample bias in the estimator, as discussed in - 6 - i.e. (with respect to some dominating measure) that pass through the truth, f(z|0) densities such that f(z|6 = f(z) ) for some parameter value value of the average derivative in equation (2.2) when z and 9 = 9A , let S (z) 31nf(z|9)/d9l denote the score of but —Q pathwise derivative such that E[ip(z)] is = at Let . where typically Newey with finite mean-square 0(z) denote the 5(9) distributed as is precisely defined, e.g., in a function is f(z|9), S (z) = The (1990a). T (i.e. E[i//(z) < i//(z)] oo) and for any sufficiently regular submodel, as(e)/ae\ a (3.1) S Q (z) f(z|9) 9 = e[^(z)s q (z) n t ]. o When the z distribution of is unrestricted, as here, is it it is typically possible to show that a parametric submodel can be chosen so that the score function of z with mean zero and finite mean-square. asymptotic variance of semiparametric estimators of V = (3.2) E[0(z)0(z) formula Cramer-Rao bound for a submodel V when Another interpretation of estimator (3.3) 8, is Ed/»(z)S Q (z) G T is all 9 is The parametric submodels. ](E[S Q (z)S Q (z) This matrix S (z) ^(z) V that by the reasoning of Stein (1956), is and the "delta-method." approximately equal to is ]. should be the supremum of Cramer-Rao bounds over (3.1) a lower bound on the In that case, 8 approximates any T Briefly, the idea behind this equation S (z) T _1 ]) E[S Q (z)i//(z) bounded above by approximately equal to T by ] 9 V and it is i//(z). as the influence function of an efficient is satisfying VK(8 - 8) = n y. ,iMz.Wn î=r + o (1). p i The "influence function" terminology, which originated in the literature, is motivated by the fact that in large samples effect of a single observation on 8. It is - 7 a convenient robust estimation i/»(z) way approximately gives the to think about the asymptotic properties of Vn-consistent estimators, because most will satisfy equation (3.3) for some influence function T mean average derivative estimators, as By the central limit theorem the asymptotic variance of Stoker, 1991a). E[0(z)i/»(z) (e.g. shown in will by S equal to the bound when the influence function equals the pathwise ], derivative in equation (3.2). In general, any influence function will be a pathwise derivative, estimator satisfies equation (3.3) for some 0(z), i.e. if an and certain regularity conditions hold, then its influence function satisfies equation (3.1) (e.g. see Newey, 1990a). When the model imposes no substantive restrictions on the data distribution, the set of scores is unrestricted (except for having mean zero), as it is in this Section. Therefore, there can be at most one pathwise derivative, and hence at most one influence function. 4 As discussed in Newey (1991a), this fact can be used to find the influence function of any estimator, by calculating the pathwise derivative of the parameter that under unrestricted distributions. In particular, the influence function of is estimated an average derivative estimator will be equal to the pathwise derivative derived below, because impose no substantive restrictions on the data distribution. V interpreting Thus, we are we justified in not only as the variance bound for average derivative estimators but also as the asymptotic variance of any (regular) average derivative estimator satisfying equation (3.3). For instance, the bound derived below will be the asymptotic variance of the kernel and series estimators suggested at the end of Section 2, under the plausible assumptions that they satisfy equation (3.3) and are regular. Because the form of the pathwise derivative will first give this form and discuss it, and For two different influence functions 0(z) Let (approximately) equal to E[<$(z)-i//(z)> T 0(z)-i/»(z) SJz)] = E[$(z)-ipU)> the main result of this Section, we postponing the regularity conditions until the e = y - g(x) end of the Section. is and $(z), choosing and differencing equation T <4i(z)-tl>(z)}]. - 8 - (3.1) S Q (z) implies = u = -v(x) (3.4) Theorem £(x) E[.//(zWz) Two given in equation (2.6). is (3.6) m(y-g(x)), The semiparametric variance bound then is T ]. interesting special cases are the conditional = w(x)g'(x) i//(z) mean and median. For the mean, x (3.7) at - 5 + £(x)[y - g(x)]. u = [2f(0|x)] For the median, e = EMzWz) T first term is ] > 0) - l(e < 0), _1 the asymptotic variance bound £. w(x.)g'(x.)/n. restrictive parametric submodels E[i/»(z)S because (z) T = ] is the conditional density of e so that sgn(y - g(x)). bound can be decomposed into two terms. known, by the following argument: E[S (z)|x] = f(0|x) By E[u|x] = 0, 2 T = Var(w(x)g'(x)) + E[u £(xWx) ]. asymptotic variance of so that Kg where = w(x)g'(x) - 5 + £(x)[2f(0|x)] </»(z) (3.8) sgn(e), sgn(e) = and In general, the variance The _1 so that e, given - (x) is = w(x)g'(x)-5 + «x)u = w(x)g'(x)-5 - £(x)p(x) i//(z) where u = below shows that the pathwise derivative 3.1 (3.5) v(x) = dE[m(y-g) |x]/dg| m(e) = dp(e)/de, m(c), 8 i/»(z) is known, being the is the bound when satisfies equation (3.1) for more g(x) The second piece still where the distribution of x is T ]. Also, mean zero function, and hence can approximate SAz) £(x)u. is y given x, can approximate any conditional Thus, the argument for equation for suggesting the following interpretation. 9 - f(x) known, scores satisfy parameterizes the conditional distribution of E[£(x)uS (z) We thank Gary Chamberlain when (3.2) also applies here. The magnitude of second piece unknown depends on g, p(e) 2 E[u Ux)Ux) the location parameter where m(e) different and p(e) argmin In particular, E[p(e-fi)] were e. if corresponding to Ux) of location does not depend on the i.i.d. with density f(e.|x), Thus, when averages derivatives for functions are equal, so that the corresponding bounds can be compared, the comparisons can be carried out in a way similar to that for estimation of location For example, when the conditional density of parameters. ] equal to the bound for estimation of is are given in equation (3.4). v(x) _ 2 T E[E[u \x]l{x)l{x) way the asymptotic variance -2 2 2 E[u |x] = v(x) E[m(e) |x] while pic), ] similarly to the estimators depends on the loss function. form of T = for most values of x, given e x has "thick tails" the second piece of the bound will tend to be smaller for the conditional median than for the conditional mean. Turning now to the statement of regularity conditions, we first give an assumption that is essential to the result. Assumption 3.1: w(x)f(x) is zero on the boundary of the support of x. This condition allows us to ignore boundary terms in the derivation of the bounds. Without this assumption, E[w(x)g'(x)] evaluated at particular points. may include boundary terms that depend on g(x) For continuously distributed regressors, the value of a conditional expectation at a point has an infinite variance bound, so that average For a simple example, suppose that derivative will not be Vn-consistently estimable. x is and (3.9) a scalar that w(x) = 1. is uniformly distributed on [0,1], g(x) the conditional mean, Then 5 = E[g'(x)] = [ g'(x)dx = g(l) - g(0). easy to show that the semiparametric variance bound It is is consistent with the well particular point is is is infinite in this case, known fact that value of a conditional expectation / not v n-consistently estimable. - 10 - at a which to guarantee that this assumption holds in index models, One way to be zero outside the interior of the support of x. Of course, such a choice might be with the efficient choice of weights discussed in conflict w(x) to choose is in Section 5. Additional regularity conditions are useful for deriving the result. Assumption containing the closure of solution at exists and ] that 1/2 is bounded and continuously differentiable E[<(y,x)|x] is x continuously differentiable in x continuously differentiable in and on on sgn(e) will be implied by it mean-square continuously differentiable not be satisfied y-g(x), y because (for particular, As if is it is discrete and m(y-g). C(y> x ) does not hold = D if y in the scalar m(e) is E[m(y-g)|x] binary and is «, the X, and for any X, X, and E[m(y-g)<(y,x) |x] it is C(y> x ' is straightforward For example, for Ely 2 |x], a and x . m(e) = c and for being absolutely continuous with f(y|x) < ] with bounded derivatives, from the other assumptions and continuity of will follow 2 such that f(y|x) The last smoothness condition does not seem very primitive, but to give sufficient conditions for particular has a unique Xx^. on g x in x in ?:R compact are bounded on (x) has conditional density mean-square continuously differentiate is is E[llf ' (x)/f(x)ll Also, 1. w' and w(x) x given y = there E[m(y-g)|x] = such that Prob(i>(x) * 0) nonsingular, is convex and compact, is in its interior g(X) conditional distribution of f(y|x) x of g e W, and for g(x) T E[i//(z)i//(z) X The support 3.2: it m(e) = f(y+a|x) 1/2 This assumption will not continuous, over the range of will not be continuous in g. In m(e) = sgn(c). well known, the class of estimators must be restricted to obtain an efficiency bound result. We do this by restricting attention to estimators 5 that are regular, meaning that for a class of regular parametric submodels the limiting distribution of /n(5-5(8 n )) does not depend on {9 n > when Vn(d -9„) n is bounded and the data has Assumption 3.2 follows in these cases by Lemmas C.2 and C.3 of Newey (1991b) and by E[m(y-g)<;(y,x) |x] = J"m(c)C(u+g,x)f(u+g|x)du when y is continuously distributed. - 11 - distribution Theorem f(z|9 3.1: for each sample size ) If Assumptions nonsingular for all n in ip(z) 2.1 n. and 2.2 are satisfied, and equation V then (3.5), regular parametric models such that is the dd(Q)/de\ V = = E[ip(z)SJz) 0—0 of N(0,V) o 8 that is regular satisfies and where (1), U is Q ) — tit Z independent of E[ip(z.)] = £J Vn(8 - 8 and . > Z [p(z.)] < oo, T ] Q + Furthermore, if E[ij)(z.) is ] supremum of Cramer-Rao bounds of o 8 T E[\p(z)\p(z) and U where - 8 v7T(<5 8 is Z is distributed = ) and any estimator ^"iKzJ/Vn regular then \p(z.) + = L ip(zj. In the environment considered here, where data distribution is not restricted, the assumption that the pathwise derivative formula hold for a parametric submodel convenient regularity condition. It is is verified in the proof of the theorem that there exists a class of regular parametric models where the pathwise derivative formula satisfied, with score that can be chosen to The in equations (2.7) and asymptotic variance \p(z)) have (2.8) will T ], \jj(z) and kernel estimators described as their influence function, and hence as long as they satisfy equation (3.3) (for some and are sufficiently regular. Efficiency Bounds for Multiple Index Models 4. In this Section we derive the variance bound for the parameters of multiple index models, where the function of approximate any mean square random vector. In particular, the series E[<//(z)i//(z) is theorem shows that any influence function must equal last conclusion of this that given in equation (3.5). just a x and parameters. Let g(x) described earlier v(x,0) is restricted to depend on a function be a vector of functions of - 12 - x and a q x 1 as parameter vector A multiple index model 0. one where there is a function is such G(v) that g(x) = G(v(x,£ (4.1) An important example T (x 0,x T T ) . )). the partial index model discussed in Section is 2, v(x,0) = where The pathwise derivative can be used to calculate the efficiency bound for estimators of although this approach must be modified to account for the 0, We now carry restrictions imposed by equation (4.1). out this calculation, using tangent set and projection methods. Because is now an parameter rather than an explicit functional, a more implicit Consider parameterizing the specific parameterization of the problem is useful. T TT submodels by 9 = (0 the distribution of ,T) z ) where , other than a nuisance parameter vector for any feature of is tj and S„ Let 0. denote the respective scores, S 7) where for notational convenience we have suppressed the E[0(z)sL = (4.2) I, E[0(z)S T = ] 0. By the same reasoning following equation i//(z) such that this equation combination of S . This is (3.2), the efficiency satisfied and nuisance scores, i.e. J = U(z) and - T E[(S -t) t] = on S q x 1 Assuming that that 1, is J Let be linear combinations of all possible 3" is C, S with a linear set, E[llt-CS let t 2 II < ] be the characterized by the two conditions t s - for all t e 3", referred to as the efficient score. (3.1) can be approximated by a linear t/»(z) 3 e > 0, constant matrix : referred to as the tangent set. mean square projection of bound will be the variance can be found by a projection calculation. \p{z) the mean-square closure of the union of all c>, Then equation for a pathwise derivative reduces to (3.1) of argument. z and let Then S = S-t. 0(z) = (E[SS The random vector T -1 ]) S S is will satisfy equation and can approximated by a linear combination of scores, so that the semiparametric variance bound for will be (E[SS ])" . - 13 Begun, Hall, Huang, and Wellner (1983) and 3" and Wellner (1992) developed this projection form of the bound. Bickel, Klaasen, Ritov, The form of the efficient score present and discuss T if and then give regularity conditions. the corresponding component of shows that the efficient score S = (4.3) 2 cr~ T V = (E[SS _1 = ]) Although the bound is does not depend on 2 (E[<r" 2 2 |v]/E[<r~ (x)|v]>, Thus, the semiparametric variance bound (4.4) 3G(v)/3v and ) = v set equal to is Theorem ft. (x)v 4.1 (x) = E[u 2 |x] Assuming that = 2 i>(x)" E[m(e) 2 |x]. is 2 vj - complicated, o- it E[<r~ (x)v | v]E[cr~ 2 (x)vL v]/E[cT 2 (x) has a straightforward interpretation of an optimally weighted m-estimator where the regression function "concentrated out." first is 2 - E[cr~ (x)v (x)u<v v(x,/3) we v = v(x,/3 Let where by convention each component of )/3£ 3G(v)/3v, 3v(x,)3 zero it, the main result of this Section, so is v(x) > 0, let w(x) = Wx)/(r -2 (x) | v]]f \ terms in has been G(v) = l/Mx)E[u 2 |x]}, and G(v(x,j3),|3) = argmin E[u(x)p(y - g)|v(x,/3)]. ..E[u(x)p(y - g)] = argmin , , gtIK g(V(X,p)) Thus, this function minimizes the population value of a weighted m-estimation criterion. 8 Consider the estimator of with G(v(x,8),8) that minimizes the sample counterpart to this criteria, substituted for g, 8 = argminj]." w(x.)p(y. - (4.5) G(v(x.,/3),|3)). 1 This is an estimator where the unknown function that minimizes the population counterpart, in the population. asymptotic variance i.e. G(v) where has been replaced by the function G(v) has been "concentrated out" By the usual formula for a parametric m-estimator, 2 (E[o-" (x)G (x)G (x) p p T _1 ]) - 14 - for G^x) = /3 3G(v(x,p),p))/S/3 will have . | p=/3 This asymptotic variance - E[o-" (x)vJv]/E[(r" (x)|v]. P ft ft 7 2 2 G_(x) = v (4.6) the semiparametric bound, because is This interpretation suggests an approach to efficient estimation, that proceeds by replacing and u(x) G(v(x,p*),|3) mean the single index, conditional uses known known. the estimation of In general, G and G u(x) = Var(c|x) In particular it follows G (1991a) that the replacement of will not affect the asymptotic variance, essentially because Also, as usual in m-estimation, estimation of out." when is will not affect the asymptotic w(x) variance, so that this estimator will be efficient. Newey In case, Ichimura's (1991) weighted kernel estimator that can be interpreted as an estimator of u(x) Proposition 2 of by nonparametric estimators. in equation (4.6) by by a nonparametric estimator G has been "concentrated will not affect the w(x) asymptotic distribution under appropriate regularity conditions. The first assumption gives regularity conditions for the distribution of the data as a function of Let ft. parameter and e(/3) G(w(x,ft),ft) = y - G(v(x,ft),ft). be left unspecified, because E [ • denote the location functional when it G The way that can depend on does not affect the form of the bound. denote expectations for a parametric submodel when ] respectively, Assumption of 4.1: for an open set Let ft. i) ft X denote the support of The marginal distribution of such that on X x S, v(x,|3) in ft with bounded derivative, differentiate in ft and differentiate v x tj = E Let ft [ = • ft and ] , x. does not depend on is G{v,ft) directly will ft and -q the true is ft /3; ii) ft bounded and continuously is bounded and continuously with bounded derivatives, the conditional distribution of 7 The moment restriction E[m(e)|x] = implies that 3E[cj(x)m(e) v{x,ft)]/dft = dEMx)E[m(e)Jx]|v(x,£)]/d/3 = 0. Then differentiation of the first order conditions E[w(x)rn(y - G(v{x,ft),ft)) v(x,|3)] = with respect to ft gives 8G(w,ft)/dft = -2 -2 -E[cr (x)v |v]/E[cr (x)|v], so that differentiation separately with respect to the I \ two ft arguments in € G(v(x,ft),ft) gives equation (4.6). - 15 - 2 y x given at has density 3 that f(y|x,/3) G(v(X,S),B) interior the closure of is nonsingular and bounded, there iii) a compact set is 1 x S x such that on continuously differentiate with bounded derivative; and > is bounded away from zero, 9E [m(e)|x]/30 = E[m(e)S with probability one /3 9 with (conditional) information matrix that bounded and bounded away from zero; regular in is iv) E [m(y &, [m(e(/3)) |x] is containing in - g)|x] 3EJm(y-g) |x]/Sg| & p* E_[m(y-g)|x] = P solves G(v(x,/3),/3) "S E „, its is „, „, , g=G(v(x,|3),/3) and 0, |x]. This Assumption consists of more or less standard regularity conditions. The next hypothesis imposes additional smoothness conditions. Assumption in and y For any function 4.2: £(y,x,3) that bounded, continuously differentiable is and has bounded derivative, the integral p\ continuously differentiable in /3 and g with bounded derivative. bounded away from zero uniformly is The first hypothesis is |x,/3] is Also, there exists a with bounded derivatives such that m(e) bounded, continuously differentiable function E_[m(e(j3))m(eO)) |x] P E[m(y-g)<(y,x,|3) in x, /3. similar to the last part of Assumption 3.2, so that primitive conditions for this condition can be specified as in Section 3. Also, it is straightforward to formulate more primitive conditions for the second hypothesis with particular m(e). By 2 E [m(e(/3)) |x] bounded and bounded away from zero it suffices to P m(e) find such that E o Km(e(p))-m(e(0))> 2 |x] small uniformly in is x and 13. For P m(e) = example, in the conditional mean case, with X x B, then choosing so that m(e) |m(e)| f(e|x), with e choosing continuously distributed given |m(e)| * 1 and if and EgNe| m(e) = e Also, in the conditional big enough will satisfy this assumption. sgn(e) * |e| e, m(e) =? x sgn(e) of zero will satisfy this assumption. - 16 - |x] is bounded on except when median case with |e| c = with bounded conditional density except in a small enough neighborhood is Theorem If Assumptions 4.1: from equation 4.3, E [m(y-g)\x] = (G(v,n the Z E [m(y - g)\x] is fi ) — 8E [m(c)\x]/di) = E[m(c)S Z » T is ] nonsingular for + U Z where G(v,t)) solves \x], it as is distributed 7 1 V = (ElSS ])' follows that of 3 that is regular satisfies j3_ U and N(0,V) is independent of is # . Of particular interest for studying weighted average derivatives T v(x,£) = (x |3,x T T ) corresponding to a partial index model. , Here +x 11 T 1Z TT /3,x„) In this case, . Z G for (v) 1 is the case where is /3 up to scale, so we normalize the first coefficient to be one, with (x v(x,/3) = dG(*s,x „)/3*s| _ T ^J— X-.+X. _p-. Z only identified = the efficient score , is S = (4.7) 5. 2 cr" (x)u. 2 2 G^vHXg - E[cr~ (x)x 2 l v]/E[<r" (x) v]>. I Efficiency of Weighted Average Derivative Estimators of Index Coefficients As we discussed average derivatives estimate the parameters of partial in Section 2, index models up to scale. In this Section we consider the efficiency of average derivative estimators of index model coefficients. before, where the first coefficient equals one. and 5 identification condition 8 E[w(x)ag(x)/Sx 5„/5 ] will be consistent for (3. it We adopt For = E[w(x)Sg(x)/5x .„]. * 0, S continuously differentiable in a neighborhood of supremum of Cramer-Rao bounds and any estimator Vn(fi - E[SS are satisfied and then for the class of parametric submodels such that 0, and ), ),-n 4.1 to 4.3 the same normalization as g(x) = G(x Assuming that will be the case that 5„/5 The asymptotic efficiency of Section. - 17 T +x „p,x w(x) = /3, is ), let <5 = obeys the so that analyzed £ = in this To evaluate the efficiency of we this estimator will equation (3.3), having influence function given in equation assume that (3.5). 5 satisfies This assumption is justified because the influence function of any (regular) average derivative estimator satisfies equation (3.5), by Theorem We 3.1. assume that equation also satisfied, a standard regularity condition that is known (4.2) is to be a consequence of regularity of the estimator (e.g. see Newey, 1990a). The Asymptotic Variance of Relative Average Derivative Estimators 5.1 Given that (and hence |§ T x +x [-0,11a' (x) [-/3,I]a'(x) only on Also, v. function of (5.2) is = SaU» - «/»(z) a(x) = a(x x„ is [-3,1], is then It Let « = ,x„) v, [-/3,I]g'(x) = x , since v holding g(x) depends so that by the delta method the influence will equal = S^[-0,lKw(x)g'(x)-5 - <w'(x) + w(x)f ' (x)/f (x)>u £(x) = -5^[-/3,lKw' For notational convenience we use the same 3, ,x with respect to a(x) included in 5(5 /5 )/S5 = § = £(x)u, 2 and the influence function of xJp.x^.x^/Sx^, the partial derivative of Recalling that constant. (3.5), asymptotic variance) can be derived by the delta method. its and note that for any function „/3 (5.1) i.e. has influence function in equation 5 i/»(z) (x) + w(x)f'(x)/f(x)>. and even though the expressions are not the same. £(x) notation here as in Sections The asymptotic variance of T E[i/»(z)i/i(z) is ]. interesting to note that the term Var(w(x)g' (x)) in the average derivative bound has dropped out of the asymptotic variance of the index estimator. This result is consistent with the interpretation of this term as the one that accounts for the density of x being unknown. Since (3 is a feature of the conditional distribution of - 18 - y given x the distribution of x, is ancillary for estimation of and knowledge of /3, this distribution should not affect the efficiency of estimators for £. Efficient Weighting 5.2 We next of Sections 2 and 4, where on the weight function can be chosen so that Section and w(x), /3 is given. p(e) we In this model the efficiency of j§ depends would be useful to know whether a weight function it is efficient, In this subsection 4. as an estimator in the partial index model consider the efficiency of with asymptotic variance equal to the bound of consider existence and the form of such an efficient weight. Our first result elliptically is when that an efficient weight function will exist symmetric distribution and 2 <r (x) depends only on x has an as described in the v, following result. Theorem 5.1: Suppose that i) x has an T nonsingular variance and density vector is u %((x-\i) \(x-u)) and positive definite matrix nonzero with probability one. w(x) = (5.5) 2 <r~ elliptically symmetric distribution, with A; for a differentiable 2 ii) cr 2 (x) = E[u \v] = 2 cr Then an efficient weight function T (v)G (v)/h{(x-iL) h(x-\i)), /i(q) l constant £('), and (v) > G (v) is = l(q)/S qJ(r)dr. The optimal weight depends on the density of the regressors through the inverse of the hazard /i(q). It does not effect the weight corresponding to exponential £(q) is l//i(q) is £(q) and only if proportional to more weight to larger values of "thinner tailed" than normal (e.g. /i(q) £(q) proportional to - 19 In cases x. l/(l+q (x-ji) constant, is and hence normally distributed "thicker tailed" than normal (e.g. f(q) will tend to give if a a ), T a ), a 1), while A(x-fi), exp(-t > > 1) where it if £(q) will tend to give less weight. Some properties of g symmetric distributions are essential to existence elliptically of an efficient weight, because of implicit constraints on and x, denote the vector of all x elements of Let £(x). other than the k x = . (x T,v Then ) £(x) is constrained in the following way. Lemma E[l (x)\x 5.2: ] = 0, k * q - 1. This result can be interpreted in terms of the amount of information used by a single Each average derivative ratio. function of x +<c, (3, , words, £, x +ai8, . j§, . Also, each and allows for is u in the influence uses only the restriction that the index |3, g(x) the term multiplying is C.(x) x. to depend on in an arbitrary way. is of the form In other consistent for the coefficient in a partial index model with index Lemma 5.2 is a consequence. It exactly the condition that makes the is influence function uncorrelated with nuisance parameter scores in such a partial index model, as required by consistency of elements of £*(x) = G 2 (v)<r~ (x){x |3, , as in equation (4.2). 2 2 12 -E[(r~ (x)x 12 l v]/E[(t~ (x) v]> score only have conditional expectation zero given | v. In contrast, the from the efficient This results from the index model imposing more restrictions than any individual average derivative ratio, leading to more information (variance) in the efficient score. Theorem 5.2 shows that it is sometimes possible to combine the information from the individual derivative ratios to obtain efficiency. Intuitively, although the individual terms have less information than the efficient score, together they can have as much, because the index interpretation of the vector of derivative ratios comes from the index model. However, the requirement that a single average derivative vector information is all the quite restrictive, relying on linearity of certain conditional The declining weight implicit in the density weighted estimator of Powell, Stock, and Stoker (1989) would tend to have high efficiency when the x distribution has thinner behaves l/h{q) tails than normal, although it is difficult to find an example where exactly like £(q). - 20 - expectations as a necessary condition. Theorem. Suppose 5.3: that 2 <r nonzero with probability one, and k \x_ k cr k \v] + c in 2 when 2 case where E[x E[x,\x_] Thus, linearity of efficient weight, = ] cr - ] - E[<c_ x k G 0, (v) such that c \v]}. k We v. necessary for existence of an is could also derive a result for the more general way, but for simplicity we have not in a allowed for this generality here. Although x having nonlinear conditional expectations will rule out the existence of an efficient weight, approximate efficiency can influence functions from derivatives efficiency many still derivative estimators. be achieved by combining combining average Intuitively, from different ratios imposes the index model information, that can results if from different weighting functions are lead to Unfortunately, the used. efficient combination will not generally exist in closed form, and hence is difficult to describe. We use certain Hilbert space results to argue that average derivatives can be combined to achieve approximate efficiency, but not that is useful here is that the closure of the direct in closed sum form. The basic result of closed linear subspaces is equal to the orthogonal complement of the intersection of orthogonal complements of the Take the Hilbert space to be the usual one of functions of subspaces. mean square. for each k. Take the subspaces to be the set of functions of x with finite x satisfying Lemma 5.2 These have orthogonal complement equal to the set of functions of The intersection (over k ^ q-1) of this set is the set of functions of v. &_, . This intersection has orthogonal complement equal to the set of functions that have conditional mean zero given v. Then, is If an efficient weight nonsingular. for each <c_. Prob(u = 0) = v, there is a vector 1 T (<c_ k k is depends only on (x) depends on (x) E[SS' k £ q function exists then for each E[(C depends only on (x) > it follows by - 21 - E[£ (x)|v] = that each element of I (x) is in the closure of ix)+»"+l <l (x) : E[lAx)\x = Oh ] ] x that depend on in the efficient score, average derivative that depend on However, it is x, Thus, the terms can be approximated by the terms in the which will lead to approximate efficiency. impossible to give an explicit form for this additive decomposition, because a direct sum of subspaces need not be closed. when Also, even it is closed the decomposition into the sum does not generally have an explicit form, except when the subspaces are orthogonal or finite dimensional. {£,(x) : E[L (x)\x, ] covered by Theorem = 0} 5.1. Here the subspaces are never orthogonal, and many finite dimensional cases are Thus, in general an explicit form for the optimal combination of different weighted average derivative estimators cannot be found. 5.3 Pooling Weighted Average Derivatives Via Minimum Chi-Square Combining different weighted average derivative estimators provides a way to achieve approximate efficiency. when an Also, efficient weight exists, that are unknown, and pooling estimators with efficient estimation. efficiency is by £ 2 let minimum chi-square is and let H = [I I] = e where ®I, (3. = be specified, J), 1 J and w.(x), let |3. = e. is the vector of ones J Let (3. w(x) = (satisfying equation (5.2) for j j w.(x)) and let *(z) = T ^,(z) d/».(z) = E[*(z)*(z) *(z) = components ]. ($» in Let (z) Q TT ) . The asymptotic variance of be a consistent estimator of ^,(z) y is then Q J 1 J for (j the identity matrix with dimension equal to the number of elements of denote the influence function of 0.(z) w.(x), Let Stack the separate estimators into a vector be the associated ratio estimator. 7 = (£,...,£) an approach to feasible estimation, that can be described as follows. denote the weighted average derivative estimator using 5. I is may have components An optimal way to pool different estimators so as to improve linearly independent weighting functions 5 .„/5 •, J2 Jl and known weights it ) , with $.(z) fi, such as obtained by replacing equation (5.2) by nonparametric estimators. - 22 - Q = all £. _.*(z.)*(z.) the unknown The pooled estimator is /n = argmin (r - K0) P (5.6) It T _1 ft (y - K0) = (H T l l H) ft T H Q l y. follows from standard minimum chi-square theory that asymptotically normal with asymptotic variance (H T -1 -1 asymptotic variance of any linear combination of the that H) ft 0., = (j T^— consistent estimator of the asymptotic variance will be The minimum-chi square estimator ft no larger than the J). 1 Also, a —1 1 H) . Also, it will be approximately efficient a linear combination of the influence functions can approximate the if J (H is will be efficient if the efficient score is a linear combination of the influence functions. for large Vn-consistent and is We efficient score in mean-square. state these results in the following theorem, also giving an interpretation of the difference of the efficient information matrix and the the minimum chi-square Theorem v' 5.4: such that S = such that lim 1 T 2 H'Q~ H = min^[{S-TM(z)}{S-TN(z)} ], - IIS precision matrix. then is efficient. E[\\S-Tl T ^W 2 ] = then so that if there exists Furthermore, if for each Urn T 1 (H'Q^H)' = J II there exists IT V. The second condition, on existence of a mean-square approximation of the scores by the influence function, is a "spanning condition" for efficiency. It is the (minimum chi-square) analog of the generalized method of moments spanning condition given in Newey (1990b). To achieve approximate efficiency by combining weighted average derivative estimators, the weights will have to satisfy a spanning condition corresponding to that of Theorem 5.4. This spanning condition is quite complicated, because the influence functions depend on both the weight and its derivative. For this reason, we do not try and give as general a spanning condition as possible, instead focusing on conditions that The are relatively easy to verify. distribution. Let f(x) variable with realization first of these conditions involves the data denote the density of cc, . - 23 - x and let X, denote the random Assumption x 5.1: E[llf'(x)/f(x)ll 2 ] < oo, kiq-1 for The , ) (x) o- r df{x,\x_,)/dx,Cov(l{X,£x,),X,\x_.) last part of this condition does not check for particular distributions. bounded and bounded away from zero, is and any positive integer -2 fix, \x 2 has compact support, J/dx df{x \x r, are continuous on the support of seem primitive, but is it In particular, as long as fix. and x. straightforward to \x, on the > ) interior of the support, the last expression will be continuous on the interior, so that show continuity on the boundary. suffices to it a beta density, proportional to x_ , x, , b x, (l-x, ) For example, suppose for coefficients b a, T{x,\x_,) k | x_ = )~ 2 k dT{x | k a;_ k and on b )/ax Cov(l(X sa; ),X^ x_ ) k k k k [a - (a+b)a^]| | r lc (t -[a/(a+b)] This function will be continuous at any convergence). a Then suppressed for notational convenience. f (x a that are continuous in bounded, positive, and bounded away from one, where dependence of is is X a r )t b where Also, if a sequence is such that dt/[a^ (l-t) X, < x, +1 (l-a^) < 1 b+1 ] (by dominated converges to zero or one, then continuity follows by L'Hopital's rule, with the limit of this expression equal to r -a[a/(a+b)] /(a+l) and at at the limiting value of When Assumption x , r -bU-[a/(a+b)] }/(l-b) k 1 for a and b evaluated . 5.1 is satisfied there for the spanning condition to be satisfied. the at are simple conditions on the weight functions Let xAt,x) denote the vector with position and other components equal to the corresponding components of - 24 - t in x_^. Assumption each w. 5.2: k s q - 1, there continuous function that sup y; k Ej^jjj \Y .i n 1 (x) dp = is a{x), w (v)p £ a set and (x)/dx ] 3 PjJ (a: (t,a:))/a^dt k b(<c) for | 1/C there w (v) < C is £ 'it p and J (<c) n .,..., > 0. e Also, for > 0, n such = J > J. This hypothesis says that the partial derivatives with respect to each x, approximate any continuous function, and that the linear combination definite integral of the partial derivative. satisfied for particular choices of C for some such that for any -co, there is and e < < inf(B) = £ R, b e S, k with (a:) It p. .(a;). is can £.,tc easy to check that this hypothesis p. Ax) For example, suppose is ..p.. (a:) is a is a power series with all terms of a given integer order (sum of exponents) and below included. Then the hypotheses follow by the Weirstrass theorem and because derivatives and definite integrals of power series are also power series. Also, for similar reasons, this hypothesis will be satisfied for Gallant's (1981) flexible Fourier form. An important assumption here function of x, is that w (v) is a function of both of which depend on the unknown index weights are not feasible. x +x They can be made feasible by replacing and v T is a so that these /3, £ p. Ax) with an estimator. Because the choice of weights does not affect the consistency of the estimators, under appropriate regularity conditions the replacement of X with value will have no effect on the limiting distribution of the its corresponding estimated minimum chi-square estimator, and hence no effect on its efficiency. The next result shows Assumptions condition for near efficiency of Theorem 5.5: If Assumptions 5.1 and 5.2 are sufficient for the spanning minimum chi-square. 5.1 and 5.2 are satisfied, then - 25 - Urn Jr—Xd (H'Q H) = V. 5.4 The Choice of Weights We can combine in Practice minimum chi-square with the form the near efficiency of of the optimal weights given for the elliptically symmetric case to suggest a sequence of weights that should have high efficiency with just a few terms included. is to initialize the weight sequence at values that are close to efficient The basic idea when the regressors are normal, add first some terms to allow for nonnormal but elliptically symmetric and then include terms that would account for non x's, G Let distributions. and (v) v = x where a + x G estimator one might choose the constant and is (v) be obtained from a preliminary parametric For example, estimator of the index model. conditional mean, and cr (v) b T „/3, y if is binomial, m (•) = 1, = 0(a + b«v) and a- the coefficient of x and other distributions other than normal, such as that is initialized g(x) the is then corresponding to a preliminary probit be the sample mean and variance of the regressors and scalar argument, with elliptical (v) = $(a + b«v)[l-$(a + b«v)] from probit. Let n and Z be functions of a u».(*) allowing for elliptically symmetric m. J m.{q) = [q/(l+q)] Then a weight sequence . at normal and elliptically symmetric regressors is as given in Assumption 5.2 for w where J powers of (v) is oc. = &iv)/v Z W, p. (a:) = some (small) positive uiMx-^t'hx-ii)), integer. An noted above, estimation Higher (than j J) s J, order terms might include of the weights will not affect the asymptotic distribution of relative average derivative estimators. An important practical One way the number of weights might be chosen GMM number of weights issue is the choice of cross-validation criteria suggested in Newey is in applications. by the minimum chi-square analog of the (1990b). The basic idea is to use equation (4.2) to construct a cross-validated estimator of the mean-square of the which by Theorem 5.5 difference between the efficient score and the influence function, - 26 - By equation precision. IIE[*(z)¥(z) T T ]TI The . error for different T ] = E[SS term does not depend on first J E[(S-lWz)MS-II*(z)> (4.2), term can be dropped. this = (Q -1-1 _1 .) . H = Let î=l .h + T T H n . + ft -l -l + II = Y. .*(z.)*(z.) /(n-1), . -i J*i T T ft and J be a positive r is J -111-1 .*(z.)*(z.) J Let ». T an estimator of the scalar mean-square error criteria up to an additive constant. T T T Q Then a cross-validated criteria for choosing tracetr-y;"^ H - IIH - be an estimator, as *(z) 1111 cv(j) = is ] [Q' 1 + n" 1 *(z.)(l-*(z.) Tn" 1 *(z.))" 1 *(z.) T Q~ 1 ]H. definite matrix. This T so in comparing mean-squared J, described above for the minimum chi-square estimator, fi minimum chi-square related to the difference of the efficient information and the is EHS-TI*(z)} r{S-IT*(z)>] 9 Efficiency For Conditional Distribution Index Models 5.5 When the conditional index distribution model of equation (2.4) holds, relative average derivatives will estimate the same coefficients for different so that their asymptotic efficiencies can be compared. 2 E[<r (x)£(x)£(x) T for ] 2 a- = v(x) (x) -2 E[m(e) 2 |x]. estimators depend on this dependence dropped out. is functions, The asymptotic variance As discussed /i(x) is 2 in Section 3, the asymptotic variance for estimation of the location parameter argmin E[p(y-u)|x], p(e) cr (x) is = so that the asymptotic variance of relative average derivative in a p(e) way that is analogous to the location parameter. Here even more direct, since the first term in the average derivative has For example, if "fat-tailed" enough at each the conditional distribution of y given x is so that the median has lower asymptotic variance than the x mean, then the asymptotic variance of a weighted average derivative estimator based on the median will be smaller than that based on the mean. 9 In a Newey Monte Carlo example for a different semiparametric estimation problem, given in (1990b), an analogous GMM criteria led to an estimator with dispersion that almost as small as that of the best J. - 27 - It is possible to approximately attain the semiparametric bound for the conditional index model, that is given in Newey estimators for different weights and is (1990b), by combining relative average derivative p(e) functions. A sufficient spanning condition that the weighted average derivative estimators are calculated from all combinations of a sequence of weights satisfying Assumptions 5.1 and 5.2 and a sequence of m(e) functions for which linear combinations can, for the conditional distribution of e given any x, approximate any function with finite mean square. omitted a full description and the proof of this result. - 28 - For brevity, we have Appendix: Throughout the appendix let C Proofs of Theorems denote a generic matrix of positive constants that may be different in different appearances. Proof of Theorem 3.1: of <(y,x) an d T an d C(y.x) 9 (A.2) = T are bounded, so this y(x) is a density function for 9 close enough T E_[a|x] = E[a|x] + 9 E[a<|x]. o mean-square continuous differentiability of 8=0 (A.3) Lemma follows by C.4 of Newey (1991b), f(z|6) 1/2 ' neighborhood in a B with score S.(z) = C(y,x) + y(x). By Assumption 3.2 and Lemma C.2 of Newey differentiate in E[m(y-g)|x] and and hence by eq. x E[m(y-g)<(y,x) |x] is o is function theorem, for each x, to 9 E [m(y-g)|x] = in Xx0 there is for g € 5. on (x ,g) continuously differentiable on all continuously so that ^(y,x), can be chosen small enough that for g(x,8) is are continuously differentiable in E [m(y-g)|x] (A.2), exists a unique solution g'(x,9) E[^(y,x)|x] (1991b), with bounded derivatives, and hence so particular, by continuity X, and for the density Also, 0. In addition, and = r(x)-E[y(x)], f(z|6) = f(z)[l + e <(y,x)][l + 9 y(x)]. Both of r(x) consider the parametric submodel z (A.l) to be bounded and continuously r(x) <(y,x) = <(y,x) - E[<|x], differentiate, let f(z) Let Xx§"x0. 9 € J? OCX'S, In there By the implicit a neighborhood such that 3g(x,9)/39 exist and are continuous on that neighborhood, so that by compactness of can be chosen small enough that and bounded on Xx0, 9g(x,9)/S9 and - 29 and g' (x,0) exist, are continuous d g (x,e)/aei (A.4) = e=0 WteXlx] -i>(x) Next, the marginal density of 5(9) = E [w(x)g'(x,9)] = E[w(x)g' T g(x,9) 9, and by differentiable in Next, finite is i//(z) it T E[w(x)y(x)g' (x,9)] [w(x)r(x)f(x)]'g(x,9)]. with bounded derivative, 9 5(9) T ] T | e=0 the pathwise derivative for follows, e.g. by EUlEtslxl-rtx)!! Lemma E[s(z)] 2 ] = 0, all C.7 of T ]. parametric submodels as specified above. Newey and for any (1991b), that for e > there are 0, 2 = E[lls(z)-E[s|x] + E[s|x] - C - r" ] any with s(z) and <(y,x) y(x) 2 E[lls(z)-£(y,x)ll ] < e so that e, < E[lls(z)-S Q (z)ll ^ 2E[lls(z)-E[s|x] - C» where the in - Elftxr^wtxlytxlftxll'gtxll ] + E[w(x)g'(x)r(x)] = E[^(z)S Q (z) mean-square and (A.7) differentiate is satisfying the above boundedness and smoothness hypotheses with and f(x|8) = is -1 = E[£(x)a g (x,9)/d8 0=o = E[£(x)u< Thus, f(z|e) and another integration by parts, (A.4) eq. 35(9)/39| (A.6) parametric submodel (x,9)] + 9 Q = E[£(x)g(x,9)] - 9 E[f(x) By in the so that by integration by parts, f(x)[l+9y(x)], (A.5) x = EtuClx]. 2 ] + 2E[IIE[s|x] - E[s] - last inequality follows by the 2 ] 2 rll ] * 4e, Cauchy-Schwartz inequality. The first conclusion then follows by Bickel, Klaasen, Ritov, and Wellner (1992, Chapter 3, Theorem 2). E a [\\\p(z)\\ To obtain the second conclusion, note that by regularity and Theorem 2.2 of Newey (1990a), (A.6), E[($(z)-|//(z))S (z) ] = 0. 2 ] aS(e)/d9\ is Q=Q continuous in = T E[i//(z)S e (z) The second conclusion then follows because - 30 - Then 9. ], so by eq. S (z) Q T can approximate any mean zero vector function in mean square, and hence can approximate 2 so that 0(z)-i//(z), E[ll0(z)-i//(z)ll We prove Theorem = ] QED. 0. two Lemmas. 4.1 using The form of the tangent first gives the set and the second the projection on the tangent set. Lemma If Assumptions A.l: are satisfied then the tangent set, for all parametric 4.1 submodels satisfying the hypotheses of Theorem ST = (tCz) We prove E[t(z) : 2 ] E[t(z)] = 0, < 0, 4.1, is E[tu\x] = E[tu\v]}. in J and exhibiting a class of parametric submodels that can approximate anything in J in Proof: showing that any nuisance score must this result by lie mean square. Consider first any nuisance score for a submodel satisfying the hypotheses of Theorem 4.1. 3G(v,7))/aT) By the implicit function theorem = -v(x) 3E [m(c)|x] = E[uS lies in E[uS Thus, |x]. is differentiate at |x] and tj a function of is v and so V T) Tl S G(v,T)) 9". To construct a regular parametric submodel with score that can approximate anything in 3", let D differentiate in denote the statement "the function and y Then by Assumption 4.1 bounded, continuously and has bounded derivatives." 3, and is Lemma C.2 of Newey (1991b), Let <(y,x) E [^(y,x)|x] D. satisfy is continuously P differentiable in (3, with derivative E P (conditional) score for f(y|x,/3). of the conditional information E [<[(y,x)|x] satisfies D. E (1991b), [S S (y|x) bounded and boundedness C — T |x]. <(y,x,3) = Thus, 0y. x be as specified in Assumption 4.2. m(e) T> so that by Assumptions 4.1 and 4.2 by Assumption 4.1 and E [m(e(|3))m(e(3)) |x] Lemma and P EJ<;(y,x,3)m(e(/3)) |x] satisfy 25. For a function differentiable with bounded derivative, let - 31 - the is P bounded by is (y|x)S (y|x) satisfies where P This matrix Also, let m(cO)) = m(e(/3))-E [m(e(£))|x] [C(y,x)S (y|x) |x], a(v) that is continuously ) ~ Then C.2 of Newey C(y.x,/3) = <(y,x,0) - m(e(0))(E [m(e(0))m(e(0))|x]) *<E [<(y,x,0)m(e(0))|x] - a(v(x,0))h /3 Then this function satisfies for any function E and 2) [C(y,x,0)|x] with mean zero, <(x) continuously differentiable in from zero and one for 9 = is 7) A(z,9) = (1 = by construction. + T 7) <(y,x,0)(l + with bounded derivative and (0,tj) T T) is a small enough neighborhood of zero, and Therefore, <(x)) is bounded away E [A(z,0)] = 0, . so that by Lemma C.4 of Newey = (1991b), f(z|9) f(z|/3)A(z,9) is a density with mean-square continuously differentiable square root, with 1 S^ = <(z) - m(e)(E[m'(e)m(e)|x]f <E[<(z)m(e)|x] - (A.8) Thus, f(z|9) is E [m(e)|x] = (A.9) Q by Assumption T 7) tj (A.10) a parametric submodel, note that a(v(x,0)). Q and by Assumption g 3E [m(y-g) |x]/9g [G(v(x,p),/3)-e,G(v(x,0),/3)+€], all is it E [m(y-g)|x] = E [m(y-g)|x] 4.1, differentiable in small enough To show that smooth. small enough there = EJm(e a(v)>. is is 4.1, T + T) there E [m(y-g)C(y,x,0)|x] e is such that for > continuously is all tj bounded away from zero on uniformly in i>(v(x,0),7)) + Wv(x,0),7)) |x] x in a and Therefore, by eq. (A.9), for 0. neighborhood of zero such that = E Q [m(y - G(v(x,0),9)) |x], G(v(x,0),9) = G(v(x,0),0) + Wv(x,0),T)). This is a local minimum of E [p(y-g)|x] by continuous differentiability of E [m(y-g)[x] y and the derivative bounded away from zero, and a global minimum for small enough the theorem of the maximum and compactness restriction, and hence is a of § '. Thus, smooth parametric submodel. f(z|0) satisfies the index Furthermore, the other hypotheses in Theorem 4.1 for the parametric submodels are satisfied by construction. To show that S can approximate anything - 32 - in 'H , tj note that for t e J, by by boundedness of ^ 7 CE[lltll exists E[m(e) 2 2 E[IIE[m(e)i|v]ll |x], Then for any ]. and a(v), <;(y,x), € 2 2 E[lli-<ll Also, e. < ] Lemma 2 = E[E[m(e) ] Newey C.7 of 2 2 |x]E[llill |x]] (1991b) that there that are bounded and continuously differentiable with y(x) bounded derivative such that E[IIE[i|x]-r(x)ll follows by it > = E[IIE[m(e)i|x]ll ] E[IIE[m(e)£| v] - a(v)ll e, < ] E[llt-E[i|x]-<ll 2 ] £ CE[lli-<ll 2 2 ] and e, < + CE[IIEtt|x]-E[<; ] 2 |x]ll s Ce. ] Therefore, 2 E[llt-S (A.ll) i C<E[llt-EU|xKll ] II 2 + E[IIEU|xK(x)ll ] 2 ] 2 + E[Var(m(e)|x)(E[m(e)m(e)|x]f IIE[m(e)Clx] - £ Ce + CE[IIE[m(e)Clx] - E[m(e)t|x] + E[m(e)i|v] - s Ce + CE[IIE[m(e)(C-t)|x]ll £ Ce + CE[E[m(e) Lemma A.2: vector E[u If 2 2 2 E[{o-" oo (x)uE[u's|x]> ], 2 where q = let 2 ] 2 (x) = 0) = 0). then the projection of a ] < oo. Also, note that 2 2 E[(r" (x)|v]E[(r~ (x)us|v]/E[o-" (x)|v] -2 q x random 1 is 1. By the Cauchy-Schwartz inequality, s 2 2 E[cr" (x)E[<r" (x) = E[E[<r~ ] 2 2 (x) E[cr = 0. Then a>, and 2 | I v] 2 2 2 E[u |x] v]E[s v]/E[<r~ (x) | | = ] | v]] = -2 m (x)] < E[R(x)|v] = E[E[<r < ] _2 2 2 2 u |v]E[s v]/E[(T (x) Thus, for the expression -2 E[o- J co the expressions are finite by Prob(c ] QED. -2 Lemma, E[t 2 (x)] < 2 2 2 2 2 u |v]E[s |v]/E[o-" (x)|v] all 2 2 2 2 R(x) = (T (xKE[u«s|xl - E[<T (x)u«s| v]/E[<r~ (x)|v]>. 2 2 s Ce. 2 2 2 2 2 s E[E[<u/cr (x)> |x]E[u |x]E[s |x]] = E[s ] 2 exists and the |x]] E[<r (x)uE[o-" (x)u.s|v]/E[<r" (x)|v]> 2 2 and For notational simplicity E[(r" (x)E[<r~ (x) E[s |x]E[ll<-ill ] -2 ] < s - E[s] - uR(x), E[<<r" 2 2 ]> 2 a(v)ll + CE[IIE[m(e)t| v] - a(v)ll ] with finite mean-square on s Proof: 2 2 a(v)ll (implying E[cr (x)|v] given in the statement of t -2 (x)us|x]|v] - 2 E[tu|x] = E[su|x] - E[u |x]R(x) = -2 (x)u«s| v]/E[cr E[(s-t)t] (x)|v] is a function of v, so that = E[E[tu|x]R(x)] = E[EUu|v]R(x)] = E[E[tu| v]E[R(x) - 33 - e t | v]] 9". = Also, for any 0. QED. t e J, Proof of Theorem By Assumption 4.1: smooth with score and the implicit function theorem, 4.1 f(z|0) is satisfying SQ p E[uS |x] = v (A. 12) + dG(w, B)/dB. This equation implies that 2 E[(r" (x)uS Q |v] (A. 13) = E[o-~ 2 P By Lemma 4.1 the projection of SQ tangent set on 9" is 2 (x)E[uSjx]|v] = P E[cr~ 2 (x) v]. | P Lemma so by (x)vJ v] + SG(v,p)/50E[a-" 4.2 and eq. (A. 13) the residual from the is J, S. The conclusion now follows by Bickel, Klaasen, Ritov, P and Wellner (1992, Chapter There Theorem Lemma is Theorem 3, a convenient characterization of efficiency that A regular estimator with influence function B if there is a constant matrix By E[0(z)^/(z) if there eq. T ]-E[i/»(z)S B a is l T ](E[SS 5.1: is I E[Si/»(z) eq. B efficient if and only is \fi(z) = BS. so that E[ip(z)\p(z) E[(^(z)-BSH^(z)-BS> T ]-(E[SS T = l (4.2) A(x-jli). Note that d/i(q) /dq + (5.1), -2 Ellix) ut] = for all t e 1. Note that o- (x)ua(v) e J _2 a(v), since 2 it = QED. an identity matrix with dimension equal to the number of elements of bounded function 2 ]f and only if ] T F(x) = -[-pMH^.OlAtx-M), = o-'^vJG^vMx), T Also, by eq. ] q(x) = (x-ji)' Then by 1. I, ] T = min = BS. i/»(z) </»(z) T = E[^(z)S _1 ]) Let l £(x) T such that h(q)~ l{q)~ dh{q)/dq = (A.14) such that (4.2) it follows that Proof of Theorem where useful for proving is 5.1. A.3: Proof: QED. 2). has finite mean-square (by E[u<o-" (x)}a(v)u|x] = a(v)a-" (x)E[u 2 |x] = a(v). - 34 - Thus, E[<r = E[£(x) (x)] u<a-~ finite) x^. for any and (x)>a(v)u] = T T E[«x) <x(v)] choosing = E[E[£(x)|v] a(v)] for any bounded to approximate a(v) E[£(x)|v] -2 E[F(x)v T = ] Furthermore, 0. is [I..0], and imply Var(x) nonsingularity of by [-0,1], Var(F(x)) a nonsingular linear combination of x-E[x] combination of is it A * (v) In particular, 0. having full row rank and all follows that F(x) T TT x = Then since nonsingular. is orthogonal to is = E[£(x)|v] = Hence, (v)G -<r (x.„,v Standard D a nonsingular matrix is ) a linear is with nonsingular variance. v arguments then imply that there linear regression DF(x) that x, (by -2 E[F(x)|v] = so that (v)G (v)E[F(x)|v], -a- mean-square. in E[£(x)|v] = implying <x(v), the residual vector from the population linear regression of x such that on „ v. Linearity of conditional expectations for spherically symmetric distributions then gives DF(x) = x „-E[x 2 so that by eq. (A. 14), |v], 2 = (x) a- <r Lemma (v), A.3 gives the conclusion as a result of D0(z) = D-B«x)u = = 2 o-" (v)G (v){x 1 Proof of Lemma 2 12 -E[<r" (v)x g(x) = G(x partial index model +X 11 influence function expectation of G 2 ^ lk k ' x w(x) = 2 <r Ux) = B<x (A.15) Let x. = (x._). J T .(x -J .-E[x -J 12 J .|v]) -J 12 and w(x) Let = l is k-l ,X l (vHx^-Elx^ G X, , is -E[x x . "J 12 v]}u an estimator of with ^q' 3^' k+1 eq. in the (A. 14), the conditional Furthermore, the arguments of zero. QED. giving the conclusion. equal the efficient weight, and without changing w(x). Then by Prob(u*0) = so by | QED. S. -1 (v)G (v) such that S = B£(x)u, B x 12 are a nonsingular linear combination of notation let 5,/S. given the arguments of 5.3: (v)G Then by the argument following (x)u. ^(x) Proof of Theorem b L er~ |v]/E[(r" (v)|v]>u 12 As noted in the text, 5.2: 2 ^(vJG^vJDFCxJu = eq. (5.1) and Lemma A.3 there is a matrix 1, lv]>. = (x,„) .. Equation (A.15) implies 12 -J for constant scalar b. and vector b .. -J J - 35 - £. x) = bix.-Ex. v j J Also, b. * 0, J J ) J because + either or b. b not zero (because nonsingularity of is . E[SS nonsingularity of T and ]) b. = 0, if B follows from T nonzero c_.(x .-E[x_.|v]) J J J by initeness of the variance bound for the index model) contradicts dividing by b we ., = E[x ] J T ](E[*(z)*(z) |v] + T (-b_ /b.){x_.-E[x_.|v]>. J Proof of Theorem 5.4: E[S*(z) Lemma Then, 5.1. obtain E[x |v,x J (again implied J T J By equation _1 E[*(z)S ]) J J V (4.6), QED. J _1 - T H'Q H = E[SS -1 T = 1 minÛS-ITCHS-ITO) and the other statements "], ] - ] follow as immediate consequences. Proof of Theorem 5.5: factors out of each It w suffices to prove the result with 2 By £.(x). <r Note first that for any vector (v) b(x), E[llb(x)ull 2 s CE[ llb(x) ] E[ll£ (x)ll 2 * show existence of square matrices suffices to — » as i (x) is J — > it 2 in the closure of , ©•••©£,, J E[\a(cc)-l — n i. ,(x)| J 1 JJ JK Also, by x 2 ] -* bounded, polynomials in E[{a(x)-{p(a:)-E[p(a;)|a:_.]» 2 ] ,, 2,q-l show that for each suffices to (A. 16) k for it suffices to polynomial f(x)~ 2 is finite. ] 5.5 is 2 J (x)-T. ,n..£.(x)ll ] show that It x k df(x)/dxA a(a: 2k = {I: a(x) e £„, E[i\x ,1 = -k 0, E[£ 2 ] n there are < a.}. such that 0. X are dense in while £», = E[Var(a(x)-p(<E)|a:_.)] s Var(a(x)-p(<c)) £ E[{a(x)-p(a;)> 2 ], J {p(x)-E[p(ffi) p(x). £„. and J so that the set E[ll£ (v) Also, by the Hilbert space fact cited in the text, each element of oo. 21 Thus, such that IT., 2 Theorem so by ], II w since 1, * bounded away from zero, (x) = eq. \x_.h p(x) is a polynomial) (A.16) is satisfied (t,a;))f(a^(t,x))dt is dense in Therefore, 1^. a(x) = p(x)-E[p(x) x_ where follows by Assumption 5.1 that is \ ] for a b(x) = a(x) - continuous, for b e S such that x^ > b k ^ on the support of x. Note that a(x) = b(x) + fix) -1 df(x)/dxAf* b(a^(t,a;))dt, b for the 7T. of Assumption 5.2, JJ - 36 - so that lateJ-E.îr.^MI * J k |b(<c) - J].^ ir. j ap. k (<c)/a<c 1 k l + |f(<c) W (xVda^ | ît.jap (a^ft.a))^]^! [b(a^(t,a;)) - I £ (1 Therefore, by + c 1 |f(x)" 3f(a:)/ax k _1 |max arbitrarily small, |iE ;r it k -b|)e * C(l + |f(x) follows that eq. (A. 16) 37 - af(a:)/ax k l is satisfied. )e. QED. References Begun, J., W. Hall, W. Huang, and J. Wellner (1983): "Information and Asymptotic Efficiency in Parametric-Nonparametric Models," Annals of Statistics, 11, 432-452. Ritov, and J. Wellner (1992): "Efficient and Adaptive Inference Semiparametric Models," forthcoming, Johns Hopkins University Press. Bickel, P., Klaasen, Y. in Chamberlain, G. (1987): "Efficiency Bounds for Semiparametric Regression," working paper, University, University of Wisconsin. Doksum, K. and Samarov, work A. (1992): "Average Derivative Estimators of Transformations," in progress. Gallant, A.R. (1981): "On the Bias in Flexible Functional Forms and an Essentially Unbiased Form: The Fourier Flexible Form," Journal of Econometrics, 15, 211-245. and H. Ichimura (1991): "Optimal Semiparametric Estimation in Single Index Models," preprint, Australian National University. Hall, P. Hardle, W., W. Hildenbrand, and M. Jerison (1991): "Empirical Evidence on the Demand," Econometrica 59, 1525-1549. Law of Hardle, W. and T. Stoker (1989): "Investigation of Smooth Multiple Regression by the Method of Average Derivatives," Journal of the American Statistical Association, 84, 986-995. Ichimura, H. (1991): "Semiparametric Least Squares (SLS) and Weighted Least Squares Estimation of Single Index Models," preprint, Department of Economics, University of Minnesota. Koenker, R. and G. Bassett (1982): "Robust Tests for Heteroskedasticity Based on Quantiies," Econometrica Regression 50, 43-61. Newey, W.K. (1990a): "Semiparametric Efficiency Bounds," Journal of Applied Econometrics, 5, 99-135. Newey, W. K. (1990b): "Efficient Estimation of Semiparametric Models Via Moment Restrictions," preprint, MIT Department of Economics. Newey, W.K. (1991a): "The Asymptotic Variance of Semiparametric Estimators," MIT Department of Economics Working Paper No. 583. Newey, W. K. (1991b): "Efficient Estimation of Tobit Models Under Symmetry," in Nonparametric and Semiparametric Methods, Barnett, W.A., J.L. Powell, and W.A. Barnett eds., Cambridge: Cambridge University Press. Pfanzagl, Theory, J. and Wefelmeyer New (1982): Contributions to a General Asymptotic Statistical York: Springer-Verlag. Powell, J.L., J.H. Stock, and T.M. Stoker (1989): "Semiparametric Estimation of Index Coefficients," Econometrica, 57, 1403-1430. - 38 Powell, J.L. (1991): "Estimation of Monotonic Regression Models Under Quantile Restrictions," in Nonparametric and Semiparametric Methods, Barnett, W.A., J.L. Powell, and W.A. Barnett eds., Cambridge: Cambridge University Press. Rodriguez, D. and T. Stoker (1992): "A Regression Test of Semiparametric Index Model Specification," working paper, Sloan School of Management, MIT. Samarov, A. (1990): "On Asymptotic Efficiency of Average Derivative Estimates," preprint, Massachusetts Institute of Technology. (1956): "Efficient Nonparametric Testing and Estimation," Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, Berkeley: Stein, C. University of California Press. Stoker, T.M. (1986): "Consistent Estimation of Scaled Coefficients," Econometrica, 54, 1461-1481. Stoker, T.M. (1991a): "Equivalence of Direct, Indirect and Slope Estimators of Average Derivatives," in Nonparametric and Semiparametric Methods, Barnett, W.A., J.L. Powell, and W.A. Barnett eds., Cambridge: Cambridge University Press. Stoker, T.M. (1991b): "Smoothing Bias in Density Derivative Estimation," of Management Working Paper. MIT Sloan School Tsybakov, A.B. (1982): "Robust Estimates of a Function," Problems of Information Transmission 18, 190-201. - 39 - 3 TOSQ 0074^=128 S Date Due Lib-26-67

0* HB31 .M415

Related documents

Products

Support

0* HB31 .M415

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib