Presented at the seventh international workshop on matrices and statistics in celebration of T. W. Anderson's 80th birthday, Fort Lauderdale, Florida, Dec. 11-14, 1998 FOUNDATIONS OF MULTIVARIATE INFERENCE USING MODERN COMPUTERS H. D. Vinod Economics Dept. Fordham University, Bronx, New York 10458 VINOD@murray.FORDham.edu KEY WORDS: Bootstrap, Regression, Fisher information, Robustness, Pivot, Double bootstrap. JEL classifications: C12, C20, C52, C8. ABSTRACT Fisher suggested in 1930's analytically structured pivot functions (PFs) whose distribution does not depend on unknown parameters. These pivots provided a foundation for (asymptotic) statistical inference. Anderson (1958, p. 116) introduced the concept of a critical function of observables, which finds the rejection probability of a test for Fisher's pivot. Vinod (1998) shows that Godambe's (1985) pivot function (GPF) based on Godambe-Durbin “estimating functions” (EFs) from 1960 are particularly robust compared to pivots by Efron and Hinkley (1978) and Royall (1986). Vinod argues that numerically computed roots of GPFs based on scaled score functions can fill a long-standing need of the bootstrap literature for robust pivots. This paper considers Cox's example in detail and reports on a simulation for it. This paper also discusses new pivots for Poisson mean, Binomial probability and Normal standard deviation. In the context of regression problems we propose and discuss a second multivariate pivot (denoted by GPF²) which is asymptotically ;2 and robust choices of error covariances to allow for heteroscedasticity and autocorrelation. 1 1 Introduction and how estimating functions evolve into Godambe's pivot functions The basic framework of asymptotic and small-sample statistical inference developed by Sir R. A. Fisher and others since 1930's relies on the normality of the estimator s) of ) . Neyman-Pearson lemma provides us with sufficient condition for the existence of the uniformly most powerful (UMP) test based on the likelihood ratio (LR). The level ! test is called UMP if the test has greater power than any other test of the same size ! and retains its size ! for all admissible parameter values. For the exponential family of distributions it is easy to verify that the LR function is monotone in the LR statistic. Anderson (1958, p.116) introduced the concept of a critical function of observables, which finds the rejection probability of a test for Fisher's pivot. He also considers the properties of Hotelling's T² statistic, which is a quadratic form in normal variables. Since the critical region of the LR test for T² is a strictly increasing function of the observable statistic, whether the statistic is in the critical (rejection) region under the null hypothesis does not depend on unknown noncentrality parameters. Hence, Anderson argues that it is an UMP test. Mittelhammer (1996, Ch.9-11) discusses this material in modern jargon of pivots and explains the duality between confidence and critical regions, whereby UMP tests lead to uniformly most accurate (UMA) confidence regions. These methods are the foundations of statistical inference, which relies on parametric modeling using (asymptotic) normality. We show that statistical inference can be made more robust and nonparametric by exploiting the power of modern computers, which did not exist when the traditional methods were first developed. Bootstrap literature (see, e.g., Hall, 1992, Vinod, 1993, Davison and Hinkley, 1997) notes that reliable inference from bootstraps needs valid pivot functions (PFs), whose distribution does not depend on unknown parameters ) . Let s) be an estimator, SE be its standard error, and Fisher's PF be FPF. Typical bootstraps resample only older Wald-type statistics, FPF=(s) –) )/SE, (See Hall, 1992, p.128). Hu and Kalbfleisch (1997) include and refer to some exceptions. For certain biased estimators and illbehaved SEs bootstraps can fail, because FPFs are invalid pivots. Vinod (1998) shows that Godambe's (1985) pivot function (GPF) equals a scaled sum of T quasi likelihood score functions (QSFs), where T denotes the number of observations. As a sum of T items, GPF converges to N(0,I) unit normality directly by the central limit theorem (CLT). Since GPFs do not need unbiasedness of s) or well-behaved SEs to be valid pivots, they can fill a long-standing need in the bootstrap literature for valid pivots. Although the distribution of GPFs never depends on unknown parameters, we shall see that they often need greater computing power to find numerical roots of equations involving GPF=(a constant). Vinod (1998) reviews the attempts by Efron and Hinkley (1978) and Royall (1986) to inject robustness in Fisher's PF for nonnormal situations. He shows that there is an important link between robustness, nonnormality and the so-called “information matrix equality” (IF = I2op =Iopg ) between the Fisher information matrix (IF ), a matrix of second order partials (I2op ) of the log likelihood and the matrix of outer product of gradients (Iopg ) of the log likelihood. We shall see that traditional confidence intervals (CIs) are obtained by analytically inverting the Wald-type statistic. Although Godambe (1985) mentions his pivot functions (GPFs) he uses them only when such analytical inversions are possible. Vinod (1998) suggests numerical inversions by extending GPFs to consider the numerical roots of equations GPF=(constant) called GPF-roots. Vinod's proposition 1 formally proves that GPF-roots yield more robust pivots than Efron-Hinkley-Royall pivots. Similar to Royall, the robustness of GPF-roots is achieved by allowing I2op Á Iopg , which occurs when there is nonnormal skewness and kurtosis. Further robustness is achieved by not insisting on analytical inversion of the GPF-roots thereby permitting non symmetric confidence intervals obtained by computer intensive numerical roots. 2 If one needs confidence intervals for some `estimable' functions of parameters f() ), Vinod (1998) proposes numerically solving the GPF=(constant) equation for f() ). The numerical GPF-roots are robust, because they avoid the `Wald-type' statistic W=[f(s) )–f() )ÎSE(f)] altogether. One need not find the potentially “mixture” sampling distribution f w (W) of W. There is no need to make sure that f w (W) does not depend on unknown ) . In fact, one need not even find the standard error SE(f) for each f() ) or invert the statistic for confidence intervals (CIs). Assuming that reasonable starting values can be obtained, our numerical GPF-root method simply needs a reliable computer algorithm for solving nonlinear equations. Of course, some f() ) will be better-behaved than others. Dufour (1997) discusses inference for some ill-behaved f() ) called `locally almost unidentified' (LAU) functions, lau() ). He shows that LAU functions are quite common in applications, that lau() ) may have unbounded CIs, and that the usual CIs can have zero coverage probability. Moreover, Dufour states that Edgeworth expansions or traditional bootstraps (on Wald-type statistics) do not solve the problem. Later, we state proposition 2, which avoids (bootstrapping) invalid pivots (Wald-type statistics) by solving GPF= constant for lau() ). Again, the limiting normality of GPFs is directly proved by the CLT, avoiding the s) altogether. sampling distribution of lau() An introduction to the estimating function (EF) literature is given in Godambe and Kale (1991), Dunlop (1994), Liang and Zeger (1995), Heyde (1997) and Vinod (1997b, 1998). The EF estimators date back to 1960 and are defined as roots of a function g(y,) )=0 of data and parameters. The EFtheory has several adherents in biostatistics, survey sampling and many applied branches of statistics. The optimum EF (OptEF), g*=0, is unbiased and minimizes Godambe's optimality criterion: Godambe-criterion = [Var(g)]Î (E` gÎ`) )2 . (1) This minimizes the variance in the numerator; and at the same time, the denominator lets g() +/ ) ) for . . “small” nearby values (/>0) differ as much as possible from g. If g is standardized to gs=gÎg, where g denotes ` gÎ` ) , then minimization of (1) means selecting the g with minimum (standardized) variance. In many special cases, the criterion (1) yields the maximum likelihood (ML) estimator ^) ml of parameter ) . Given T observations, the likelihood function C ft (yt ;) ) is a product of a probability density functions. Let L=DTt=1 Lt denote the log of the likelihood function, where Lt =ln ft (yt ; )). Kendall and Stuart (1979, sec. 17.16) note that the Cramer-Rao lower bound on the variance of ^ unb of some function .() ) is Var(. ^ ) 1ÎES2 , where S=(` LÎ` ) ) denotes the an unbiased estimator . ^ score. The lower bound is attained if and only if .unb–.() ) is proportional to the score, or if ^ unb–.() )]. S = (` LÎ` ) ) = A() ) [ . (2) Hence we derive the important result that the score equation S=0 is an OptEF. The property of “attaining Cramer-Rao” lower bound on the variance is a property of the underlying score function (EF) itself. Kendall and Stuart also prove a scalar version of the “information matrix equality” mentioned above. Later, we shall use the corresponding observed information matrices, ^I F , ^I opg and ^I 2op also. Kendall and Stuart note that the factor of proportionality A() ) in (2) has arbitrary functions of ) and the implicit likelihood function for such A() ) is obtained by integrating the score in (2). Note that such likelihood function belongs to the exponential family of distributions defined by f(y|) )=exp{A() )B(y) + C(y) +D() )}. (3) It is well known, Rao (1973, p. 195) and Gourieroux et. al. (1984), that several discrete and continuous distributions belong to the exponential family, including the binomial, Poisson, negative binomial, gamma, normal, uniform, etc. Wedderburn (1974) recommends using the exponential family score function, even when the underlying distribution is not explicitly specified and defines the quasi- 3 likelihood function as the integral of the score. An appealing favorable property of the EFs is that they “attain Cramer-Rao,” which can be proved by using quasi-likelihood score functions (QSFs), without assuming knowledge beyond the mean and variance. See Heyde (1997) for non-exponential family extensions of EFs and a cogent discussion of advantages of EFs over traditional methods. Remark 1: An important lesson from the EF theory is that an indirect approach is good for estimation. One should choose the best available EF (e.g., unbiased EF attaining Cramer-Rao) to indirectly ensure best available estimators, which are defined as roots of EF=(constant). In many examples, the familiar direct method seeking best properties of roots (estimators) themselves can end up failing to achieve them. Vinod (1998) proposes a similar lesson for statistical inference. The GPFs having desirable properties (asymptotic normality) indirectly achieve desirable properties (e.g., short CIs conditional on coverage) of (numerical) GPF-roots. This paper discusses Cox's example, Poisson mean, binomial probability, Normal standard deviation and some multivariate pivots in the context of regression problem omitted in Vinod (1998). This paper also derives a second kind of GPFs called GPF²s where the asymptotic distribution is ;² , again arising from the CLT-type arguments. Section 3 relates GPFs to CIs and studies the relation between the information matrix equality and robustness. For Cox's example, section 4 derives the GPF and section 5 develops the CIs from bootstraps and simulates Cox's example. Section 6 develops a sequence of robust improvements from FPF to GPF. Section 7 gives more GPF examples. Section 8 discusses two types of GPFs for regressions. Section 9 has summary and conclusions. 2. Superiority of EFs over maximum likelihood. If normality is not assumed and only the first two moments are specified, then the likelihood is unknown and the ML estimator is undefined. The QSFs as OptEFs remain available, although the asymmetric derivatives lead to a failure of “integrability conditions,” and hence non-existence of quasi likelihoods (as integrals of quasi scores). Heyde (1997) proves that desirable properties (e.g., minimum variance) of underlying EFs also ensure desirable properties of their roots. Heyde (1997, p.2, & ch. 2) provides examples where the traditional s) (=EF root) is not a `sufficient statistic,' while the QSF provides a `minimal sufficient partitioning of the sample space.' Moreover, two or more EFs can be combined and applied to semiparametric and semimartingale models, even if scores (QSFs) do not even exist, Heyde (1997, ch. 11). Vinod's (1997b) and Vinod and Samanta's (1997) examples propose a new EF-estimator, providing superior out-of-sample forecasts compared to the generalized method of moments (GMM). The superiority of EFs over ML is particularly noteworthy when a regression heteroscedasticity is a known function of regression parameters ". An example with binary dependent variable is given in Vinod and Geddes (1998). Thus choosing optimal EFs and their roots as EFestimators can be recommended. Godambe and Kale (1991) and Heyde (1997) prove that both with and without normality, whenever optimal EF-estimators do not coincide with the usual ones (LS, ML or GMM), EF estimators are superior to others in a class of linear estimators. 3. Optimum EFs and CIs from Godambe's PF Recall that only those EFs which minimize (1) are called OptEFs and denoted by g*. Godambe and Heyde (1987) define the quasi-likelihood score function (QSF) as the OptEF and prove three equivalent properties: (i) E(g*–S)2 Ÿ E(g–S)2 , (ii) corr( g*, S) corr(g, S), where corr(.) denotes the 4 correlation coefficient, and (iii) for large samples, the asymptotic confidence interval widths (CIWs) obtained by inversion of a quasi score function satisfy CIW(g*) Ÿ CIW(g). Vinod (1998) emphasizes the third property involving CIWs, which is not discussed in great detail by Godambe and Heyde. His simulations support the third property. If the limiting distribution is d normal, ^) Ä N() , Var()^ )), we have the following equivalent expressions for the inverse of the Fisher information matrix I•1 = Var(^) )= I•1 = I•1 . Denoting by ASE the asymptotic standard errors F opg i 2op computed from the Var(^) ) diagonals, the asymptotic 95% CIs (CI95) for the i-th element of ) is simply [ ^) i – 1.96 ASEi , ^) i + 1.96 ASEi ]. (4) There is no loss of generality in discussing the 95% =100(1–!)% CIs, since one can readily modify ! (=0.05) to any desired significance level !<1/2. From the tables of the standard normal distribution, where z µ N(0,1), the constant 1.96 in (4) comes from the probability statement: Pr[–1.96 Ÿ z Ÿ 1.96]= 0.95=1–!. (5) It is known that Fisher's pivot function (FPF) is Wald-type and converges to the unit normal: d FPF= zF =(^) i –)i )ÎASEi Ä N(0,1). (6) One can derive the CI95 in (4) by “inverting Fisher's PF,” i.e., by replacing the z of (5) by the zF of (6). This amounts to solving FPF=z! (a constant = „ 1.96). For more general two-sided scalar intervals both a left-hand-tail probability !L and a distinct right-hand-tail probability !U must be given. Then, we find quantiles zL and zU from normal tables to satisfy: Pr[zL Ÿ z Ÿ zU]=1–!L –!U. (7) Now, the FPF roots solving FPF=zL and FPF=zU yield the lower and upper limits of a more general CI. In finite samples, Student's t distribution is often used when standard errors (SEi ) are estimated from the data, which will slightly increase the constant 1.96. In the sequel, we avoid notational clutter by using CI95 as a generic interval, 1.96 as a generic value of z from normal or t tables, and ASE or SE(.) as standard errors from the square root matrix Var(^) ), usually based on Fisher's information matrix IF . Greene (1997, p. 153) defines the pivot as a function of ) and s) , fp () ,s) ) with a known distribution. Actually, it would be a useless PF if the `known distribution' depends on the unknown parameters ) . The distribution of a valid PF must be independent of ) . If s) is a biased estimator of ) , where the bias depends on ) , then the distribution of FPF in (6) obviously depends on ) , implying that the FPF is invalid. For example, see Vinod's (1995) example of ridge regression. Resampling such invalid FPFs can lead to a failure of the bootstrap. We obtain the CIs in (5) by inverting a test statistic. Our pivot functions, GPF(y,) ), also include y but exclude s) and/or its functions from the list of arguments. Clearly, GPFs are more general and contain all information in the sample. Vinod's (1998) proposition 1 reproduced in section 6 states that GPFs converge robustly (with minimal assumptions) to the unit normal, N(0,I) and are certainly not `Wald-type,' in the sense of Dufour (1997). We view GPFs as a step in the sequence of improvements to Fisher's PFs started by Efron and Hinkley and Royall designed to inject robustness. In the sequel, section 4 discusses the improvement sequence in the context of Cox's example, whereas a general discussion for univariate ) is in section 6. A vector generalization of “inverting a pivot” leads to confidence sets instead of intervals. 4. Derivation of the GPF for Cox's example 5 Cox's (1975) example estimates a univariate ) from yit µ N() , 5i ²), (i=1,2 and t=1,âT), with known dichotomous random variances. One imagines two (independent) gadgets with distinct known measurement error variances. The choice of the gadget i depends on the outcome of a toss of an unbiased coin. The subscripts yit imply that a record is kept of both the gadget number i and t-th measurement. Denote by T1 the total number of heads, by T2 total tails, and by –yi the mean of the i-th T sample. The log likelihood function is: L= constant –DTt=1 D2i=1 log 5i –(1/2)Dt=1 D2i=1 (yit –) )2 Î5i2 . The T ˜ where L˜=(DTt=1 y1t Î512 )+(Dt=1 ˜ score is: S= ` LÎ` ) =(L˜–) R), y2t Î522 ) =(T1–y1 Î512 )+(T2–y2 Î522 ) and R= 2 2 (T1 Î51 )+(T2 Î52 ). The ML estimator of ) is the root of the score equation S=0 solved for ) . ^) =L̃ÎR=[(T 2 2 2 2 – – ˜ ml 1 y1 Î51 )+(T2 y2 Î52 )]Î[(T1 Î51 )+(T2 Î52 )]. (8) Efron and Hinkley (EH, 1978) show that the (expected or true) Fisher information equals . IF =Var(S) =E–S=(T/2) [D2i=1 5i•2 ], and the observed Fisher information ^I F is simply the R˜ just defined. Fisher's PF here is FPF=(s) –) )ÎSE, where SE is the square root of the Var(s) )= TÎIF = 2[(1/512 )+ (1/522 )]•1 . The CI95 for Cox's example will obviously invert this FPF. EH (1978) argue that TÎIF as variance is not robust, since it uses T=T1 +T2 , disregarding the realized values of number of heads (T1 ) and tails (T2 ). Instead, they suggest `observed Fisher information' or TÎ^I F = T[(T1 Î512 )+(T2 Î522 )]•1 , (9) to estimate the variance robustly. It is clear that only if T1 =T2 , (9) will equal TÎIF . The CIs from (9) will also be more robust. So far, we are accepting on faith that 5i ² are known. To inject further robustness Royall (1986) allows for errors in the supposedly `known' 5i ². Royall's Var(s)) is: •1 •1 ^ A= T ^I 2op ^I opg ^I 2op . (10) For Cox's example, Royall derives the following expression for variance based on (10): ^ )•2 , T (y – ) )2 Î5 4 › (I T šDTt=1 (y1t– ) )2 Î514 + Dt=1 2t F 2 (11) where ^I F is known from (9). Royall states that (11) is more robust than Efron and Hinkley's TÎ^I F of (9), because it offers protection against errors in the assumed variances 5 2 . Since (10) reduces to TÎ^I i F only if I2op =Iopg , Royall's variance is more robust. By avoiding the `information matrix equality' Royall avoids the restrictive assumptions that skewness #1 =0 and kurtosis #2 =0. Before developing the GPF for Cox's example, we need the optimal EF and a “point estimate” of ) . If normality of yit is believed, the likelihood function is available, the score equation S=0 is the OptEF and its root yields the ML estimator of (8) above. If we assume only the knowledge of the first two moments (instead of normality) we can construct the quasi scores from the two moments for each t as: g*t =D2i=1 5i•2 (yit –) )= D2i=1 g*it defining the g*it notation. Though not needed here, it is instructive to work with a general method for EF point estimates. First verify orthogonality Eg*1t g*2t =0. Then solve: DTt=1 g*1t * E(` g*1t Î` ) ) T g* E(` g1t Î` ) ) =0. + D * )2 t=1 2t E(g*1t )2 E(g2t (12) •4 2 •2 Note that the numerators simplify as: E` g*it Î` ) = • (5i )•2 , Eg*2 it = E(5i ) (yit –)) = (5i ) . Hence (12) T 2 * • 2 2 T 2 • 2 becomes 0=Dt=1 Di=1 git (5i ) (5i ) = Dt=1 Di=1 (5i ) (yit –) ), and its solution is the optimal EF-estimator. One can verify that the OptEF estimator in this general setting for Cox's example is simply: ^) = L̃ÎR˜ = ^) , ef ml (13) 6 where the L̃ and R˜ are the same as in (8). Thus even if normality of yit is not believed (relaxing Cox's specification for greater robustness) the ML estimator is still the optimal EF-theory point estimate. Finally, we are ready for the GPF. Godambe's (1985) pivot is defined in terms of g*t as: T *2 T ˜ d GPF = zG = DTt=1 g*t ‚È{Dt=1 gt }=Dt=1 St Ä N(0,1). (14) This defines the notation S˜t for `scaled quasi-scores' and zG . Now, Cox's example GPF-roots are solutions of GPF=constant. Without the () –)^ ) term, (14) does not look like a typical Fisherian pivot (FPF) in (6). However, we claim that: (i) the very absence of () –)^ ) is an important asset of GPFs for robust inference, (ii) E(GPF)=0 can hold even if E(s) ) Á ) , and (iii) numerical GPF roots similar to (4) will yield CIs for the unknown ), associated with EF-estimators, while avoiding explicit studies of their sampling distributions. In support of our claim, Heyde (1997, p.62) proves that CIs from `asymptotic normal' GPFs are shorter than CIs from `locally asymptotic mixed normal' pivots. Mixtures of distinct distributions in the usual FPF=(s) –) )ÎSE obviously come from the distributions of random variables s) and the distribution of estimated SE in the denominator of FPF. No mixture distributions are needed to assert (14) since its normality is based on the CLT applied to its definition as a sum of scaled scores. We are now ready to place GPFs in the context of the bootstrap for robust computer intensive inference. 5. Computation of GPF roots, CIs from bootstraps and a simulation for Cox's example Two analytical solutions of FPF=(s) –) )ÎSE=z! = „ 1.96 give the limits of CI95 of (4). From (14) it is clear that a CI95 by solving GPF=(a nonzero constant) analytically is impossible. One needs numerical methods to estimate the limits of a CI95 from the GPF. Let us rewrite (14) without the reciprocals of square roots for better behavior of numerical algorithms as follows: T D2 5 •2 (y –) ). z! šDTt=1 D2i=1 5i•4 (yit –) )2 › = Dt=1 it i=1 i 0.5 (15) The choice z! = „ 1.96 depends on the normality in (14), which may not hold exactly in finite samples. Bootstraps achieve robustness by resampling the S˜t of (14) using a nonparametric distribution induced by empirical distribution function (EDF) of a large number (J=999) of solutions of GPF=0. Hall's (1992) approach to short bootstrap confidence intervals is typical. It involves recomputing J times the FPF=(s) –) )ÎSE values. The resampling from the EDF of such FPF values avoids parametric distributional assumptions. Next, one considers j=1,â, J estimates of FPF-roots s) j , their `order statistics,' and perhaps seeks a detailed numerical look at the approximate sampling distribution of s) . If the sampling distribution of the FPF depends on ) (e.g., if the bias E(s) )–) depends on ) ) that FPF is an invalid pivot and hence any bootstrap using an invalid pivot may need considerable adjustment, if not complete rejection. An adjustment for ridge regression is discussed in Vinod (1995). Since SE>0, if we solve FPF=0 for ) the solution is equivalent to solving (s) –) )=0. The solution of FPF=0 is s) (the ML estimator). One can imagine pivot functions whose roots do not equal s) . However, when solving GPF=0 it is easy to verify that the root of a sum of scaled scores DS˜t is the same as the ML estimator (if it is well defined) or the root of the score function DSt . After all, the scale factor vanishes when solving GPF=0. Thus for Cox's example we can use the ML estimates (=^) ef ) to generate scaled scores S˜t for each t. Let us denote the estimated t-th scaled score as: 7 We suggest a bootstrap type shuffling with replacement of these ( t=1,â,T) scaled scores J (=999) times. This creates J replications of GPFs from their own EDF. As with the FPF-roots, solving (16) numerically for each replicate yields J estimates of GPF-roots, to be analyzed by descriptive and order statistics as follows. Let –zGj and sdj(.) denote the sample mean and standard deviation over T values, for each j=1,â, J. Although E(GPF)=0 holds for large T, for relatively small T, the observed mean –zGj may be nonzero. However, if we can assume that any discrepancy between –zGj and zero does –Gj )Îsdj(z^Gt ), where ^zGtj denotes not depend on unknown ) , the following FPF remains valid: ^z*j =(z^Gtj –z the scaled score from (16) for the j-th replicate. Therefore, we can approximate the sampling distribution of the root ^) ef by substituting the standardized resampled ^z*j for the z! in (15). Thus one can use estimated S˜t to provide a refined z! instead of the traditional „ 1.96. Next, we substitute these refined z! 's in (15), and numerically solve for each j=1,â, J to yield the GPF-roots ^) *j . Arranging the roots in an increasing order yields “order statistics” denoted by ^) *(j) . Hence a possibly non-symmetric, (single) bootstrap nonparametric CI95 is given by [ ^) *(25) , ^) *(975) ]. (17) If any CI procedure insists that the CI must always be symmetric around the )^ ef , it is intuitively obvious that it will not be robust. After all, it is not hard to construct examples where a non-symmetric CI is superior. Efron-Hinkley's and Royall's CI95s (based on (8) or (11) respectively) are prima facie not fully robust, simply because they retain the symmetric structure. For the FPF=(s) –) )ÎSE, Hall (1992, p.111-113) proves that symmetric CIs are asymptotically superior to their equal-tailed counterparts. However, Hall does not consider GPFs and his example shows that his superiority result depends on the confidence level. For example, it holds true for a 90% interval, but not for a CI95. In finite samples symmetric CIs are not robust, in general. A comparison between the (single) bootstrap GPF-CI95 of (17) with the parametric bootstrap from (15) using the generic constants „ 1.96 is possible. Instead, our GPF-N(0,I) algorithm generates J (parametric) unit normal deviates zGj from N(0,1) and substitutes them for z! in (15). Again, each zGj yields a nonlinear equation (15), which may be solved by numerical methods with GPF-roots denoted by ^) . The appropriate order statistics yield the following CI95 from our GPF-N(0,I) algorithm: efj [ ^) ef(25) , ^) ef(975) ]. (18) Remark 2: This remark is a digression from the main theme. If computational resources are limited (e.g., if T is very large), the following CI95 remains available for some problems. First, find an unbiased estimate ^) unb , which can be from the mean of a small simulation. Next, compute $ =)^ unb Î)^ ef , and an `unbiased estimate of squared bias,' U=(^) –^) )2 . Adding estimated variance to the U yields ef unb unbiased estimate of the mean squared error (UMSE), and its square root r(UMSE). Vinod (1984) derives the sampling distribution of a generalized t ratio, where ASE is replaced by r(UMSE) as a ratio of weighted ;2 random variables. He indicates approximate methods for obtaining appropriate constants like the generic 1.96 for a numerically `known' bias factor $ . Now we discuss our simulation of Cox's example with T1 =50, T2 =30, 51 =1 and 52 =2. We generate yi µ N(5,5i2 ) for i=1,2. Our –yi =(5.2787, 4.8654). The ML estimator )^ ml= 5.1833 with the 8 ASE of 1.1547 and ML CI95 is [2.9201, 7.4465]. The Efron-Hinkley estimate of ASE is 1.1094 with CI95 of [3.0089, 7.3577]. Royall's ASE= 1.0706 gives a shorter CI95: [3.0848, 7.2817]. Our simulation of the GPF of (14) is implemented with (15). We simply compare two CI95s from the nonparametric algorithm of (17) and the parametric algorithm of (18). For (18), we use J=999 unit normal deviates (GAUSS computer language) and rank order the roots ^) j based on solving (15) with the help of GAUSS's NLSYS library. The smallest among 999 estimates, min (^) ), is 4.7761; and j j the mean(^) j ) is 5.1835. The median and maximum are respectively 5.1885 and 5.5408. The standard deviation is 0.1263. Since the median is slightly larger than the mean, the approximate sampling distribution is slightly skewed to the left. Otherwise, the sampling distribution is quite tight and fairly well behaved, with a remarkably short CI95 from (18): [4.9312, 5.4373]. A similar GPF interval from the nonparametric single bootstrap of (17) is: [4.9572, 5.4924] with a slightly larger standard deviation 0.1283 over the 999 realizations. Note that in this simulation we know that the true value of ) is 5, and we can compare the widths of intervals given that the true value 5 is inside the CIs (conditional on coverage). The widths (CIWs) in decreasing order are: 4.53 for classical ML, 4.35 for Efron-Hinkley, 4.20 for Royall, 0.51 for our parametric version, and 0.53 for our nonparametric version. Hence we can conclude that the parametric CI95 from simulated (18) is the best for this example, with the shortest CIW. The CIW of 0.53 for the nonparametric (17) is almost as low, and the difference may be due to random variation. Thus for Cox's example, used by others in the present context, our simulation shows that the GPFs provide a superior alternative. This supports Godambe and Heyde's (1987) property (iii) and our discussion. 6. Sequence of robust improvements: Fisher, Efron-Hinkley, Royall and GPFs. Having defined our basic ideas in the context of Cox's simple example, we are ready to express our results for a univariate ) in a general setting. Our aim is to work with score functions and review T ln f (y ; ) ) the sequence of robust improvements over Fisher's setting for CIs. Recall that L=DTt=1 Lt =Dt=1 t t is the log of likelihood function. Fisher's pivot is: . T –ES }, zF =() –)^ ){DTt=1 –E `2 Lt Î` ) 2 }0.5 =() –)^ )r{Dt=1 (19) t This equals the FPF in (6) . provided the population expected values of likelihood second order partials are such that r{DTt=1 –ESt }=1ÎASE. Hence the likelihood structure is implicitly preordained. That is, . (19) needs to impose nonrobust assumptions to estimate ESt from observable variances. Efron and Hinkley (1978) remove some structure (i.e., inject robustness) by removing the expectation operator E from (19). They formally prove that the following pivot is more robust than FPF: zEH =() –)^ ){DTt=1 –` 2 Lt Î` ) 2 }0.5 . (20) Royall (1986) goes a step further and argues that the asymptotic CIs from (20) can be invalid. He proves that when the assumed parametric model fails, the variance estimator is inconsistent. Royall uses the delta method to obtain a simple alternative variance estimator included in the following pivot: T [` L Î` ) ]2 }•0.5 . zR = () –)^ ){DTt=1 –` 2 Lt Î` ) 2 } {Dt=1 (21) t Now Godambe's (1985, 1991) PF, whose distribution also does not depend on ) , is: 9 T GPF= zG = DTt=1 ` Lt Î` ) {Dt=1 [` Lt Î` ) ]2 }•0.5 . (22) For Cox's example, (17) and (18) give CI95's from (14) as a special case of (22). Unlike zF , zEH and zR , our GPF of (22) lacks the term () –)^ ), yet we can almost always give CI95's from its numerical GPF-roots. To the best of my knowledge, (22) has not been implemented in the literature for Cox's or other examples. A multivariate extension of (22) is given later in (34). Let 0T =CTt=1 (1+i-S˜t ), where i²= • 1. Assuming (a) 0T is uniformly integrable, (b) E(0T ) Ä 1 # as T Ä _, (c) DS˜t Ä " in probability, as T Ä _, and (d) the maximum of |S˜t | Ä 0 in probability as T Ä _, McLeish (1974) proved (without assuming finite second moments) a central limit theorem (CLT) for dependent processes (common in econometrics). Now we state without proof, Vinod's (1998) proposition: Proposition 1: Let the scaled quasi scores S˜t satisfy McLeish's four assumptions. Now, by his CLT, d their partial sum, GPF=zG =DTt=1 S˜t Ä N(0,1), or converges in distribution to unit normal as T Ä _. Defining robustness as absence of additional assumptions, such as asymptotic normality of a root s) , zG is more robust than zF , zEH and zR . 7. Further GPF examples from the exponential family In this section we illustrate our proposal for various well-known exponential family distributions and postpone the normal regression application to the next section. Poisson Mean: If yt are iid Poisson random variables, the ML estimator of the mean is –y. The variance – (y –ÎT). Hence CI95 is [y – … 1.96r(y –ÎT)] for both Fisher and Efronof yt is also –y and the Var(y)= ^ =Var(y )=DT (y –y) ^ ÎT)]. If the Poisson model – 2 ÎT and his CI95 is [y – … 1.96r(Hinkley. Royall's t t=1 t ^ =y – and the CI95 is the same as Fisher or Efron-Hinkley. Now ft (yt ; ))=() yt Îyt !) exp(–) ). is valid, Hence ` Lt Î` ) =( yt Î) )–1=) •1 (yt –) ), and GPF-roots are from solving GPF=constant, where T (y –) )2 }. GPF = zG = DTt=1 (yt –) )Îr{Dt=1 t (23) Binomial Probability: Let y be independently distributed as Binomial ( ky ) ) y (1–) )k–y for y=0, 1, â,k. ^ =DT (y –y) –Îk, zF =zEH =(y –Îk)[1–(y –Îk)]Îk. Royall shows that – 2 ÎTk2 is Now the ML estimator is ^) =y t=1 t ^ Ä variance, unlike Fisher's or robust in the sense that if the actual distribution is hypergeometric, his Efron-Hinkley's. Since ` Lt Î` ) = yt ) •1 (k–) )•1 –k(k–) )•1 =(k–) )•1 ) •1 (yt –k) ), GPF-roots need numerical solutions, as above. The GPF is defined by: T (k–) )•2 ) •2 (y –k) )2 }•0.5 . zG =DTt=1 (k–) )•1 ) •1 (yt –k) ) {Dt=1 t (24) Normal standard deviation: For yt independent normal, N(., 5 2 ), the ML estimate of 5 is – 2 ÎT}. The asymptotic variance is 5 ^ =r{DT (yt –y) ^ 2 Î2, whether one uses Fisher's or Efron-Hinkley's 5 t=1 ^ =(1Î45 –2 – 5 ^ 2 )[DTt=1 {(yt –y) ^ 2 }2 ÎT], and his asymptotic CI95 is robust against estimate. Royall's nonnormality. Since (` Lt Î` 5 )=5 •3 (yt –.)2 –5 •1 , we again need numerical solutions of GPF=constant, where GPF is defined by: T (y –.)4 –25 2 DT (y –.)2 +T5 4 ]}•0.5 . zG œ {DTt=1 (yt –.)2 –T5 2 } {Dt=1 t t=1 t 10 (25) 8. The GPFs for regressions In this section we consider two types of GPFs for the regression problems, which also involve the exponential family. Consider the usual regression model with T observation and p regressors: y=X" + %, E(%)=0, E%%w = 5 2 I, where I is the identity matrix. Log likelihood function L=DTt=1 Lt contains Lt =(–0.5) log(215 2 ) • 2•15 •2 [yt –(X" )t ]w [yt –(X" )t ], (26) where (X" )t denotes the t-th element of the T×1 vector X" and %t = yt –(X" )t . Now (` Lt Î` " ) is proportional to [Xwt yt –Xwt Xt " ], where Xt =(xt1 , â, xtp ) is a row vector, and (Xwt Xt ) are p×p matrices having p equations for each t=1,âT. It is well known, Davidson and MacKinnon (DM) (1993, p. 489), that (26) often has an additional term representing the contribution of the Jacobian factor. For example, when yt is subjected to a transformation 7 (yt )=log(yt ), ` 7 (yt )Î` yt = 1/yt and the Jacobian factor is its absolute value. Since log(|1/yt |)=–log(|yt |), this will be the additional term in (26). Another kind of familiar generality is achieved by replacing X" by a nonlinear function X(" ) of " and letting the covariances depend on parameters ! with % µ N(0, H(!)), DM (1993, p. 302). Note from (26) that the underlying quasi score function S=DTt=1 (` Lt Î` " )=0 is interpreted here as g*, the OptEF. The EF solution here is simply "^ ef ´ "^ ols =(Xw X)–1 Xw y, the ordinary least squares (OLS) estimator. Hence the OLS estimator is equivalent to the root of the following OptEF or “normal equations”: gols= g*= Xw (y–X" )= DTt=1 Xwt (yt – Xt " ) =0. (27) If E(%% is known, we have the generalized least squares (GLS) estimator, whose “normal equations” in a notation of Vinod (1998) are: w )=5 2 H T Hw % =DT S =0, ggls=Xw H–1 y –Xw H–1 X" = DTt=1 Xwt H•1 (yt – Xt " ) = Dt=1 (28) t t t=1 t * –1 where we view ggls=0=g as our OptEF, Ht denotes the t-th row of H X, and the score St is p×1. The ML estimator under normality assumption is the same as the GLS estimator. We have shown that the usual normal equations in (28) represent a sum of T scores, whose asymptotic normality can be proved directly from the CLT. For constructing the GPF for inference, we seek scaled sum of scores, where the scale factors are based on variances. The usual asymptotic theory for the nonlinear regression y = X(") + %, % µ N(0,H), implies the following expression for variances, DM (1993, p. 290), d rT("^ –" ) Ä NŠ0, plimTÄ_ [T•1 X(" )w H•1 X(" )]•1 ‹. (29) ^ 2 , where (T–p)5 ^ 2 =(y– In the linear case, X(" )=X" , Fisher information matrix IF is [Xw H•1 X]Î 5 X"^ )w H•1 (y–X"^ ). Now its inverse, IF•1 , is the asymptotic covariance matrix denoted as (ASE)2 . In the OLS regression with spherical errors case a 100(1–!)% confidence region for " contains those values of " which satisfy the inequality, Donaldson and Schnabel (1987): (y–X" )w (y–X" ) • (y–X"^ ols)w (y–X"^ ols ) Ÿ s²p Fp,T–p,1–! where s²= (y–X"^ )w (y–X"^ )Î(T–p), and where F denotes the upper 100(1–!)% quantile of the ols ols p,T–p,1–! F distribution with p and T–p degrees of freedom, in the numerator and denominator, respectively. It is equivalently written as: ("^ ols • " )w Xw X("^ ols • " ) Ÿ s²p Fp,n–p,1–! , 11 which shows that the shape will be ellipsoidal. If the Frisch-Waugh theorem is used to convert the regression problem into p separate univariate problems, use of the bootstrap seems to be a good way to obtain robustness. The usual F tests for regressions are based on Fisher's pivot zF = (" –"^ )(ASE)•1 . Replacing the ^ •1 . expected information matrix by the observed, one obtains the Efron-Hinkley pivot z =(" –"^ )(ASE) EH If X is nonstochastic, E[Xw H•1 X]=[Xw H•1 X] and zEH equals zF . A simple binary variable regression where this makes a difference and is intuitively sensible is given by DM (1993, p. 267). For the special case of heteroscedasticity, where H is a known diagonal matrix, Royall (1986) suggests the following robust estimator of the covariance matrix. ^ =T[Xw H•1 X]•1 Xw H•1 diag(^%2 ) H•1 X [Xw H•1 X]•1 , A 1 t (30) where ^%t =yt –(X"^ )t is the residual and diag(.) denotes a diagonal matrix. Royall argues that this is essentially a weighted jackknife variance estimator. Now, let us use ggls of (28) to define outer product of gradients (opg) matrix to substitute in (10). Then, we have ^ œ T[Xw H ^ •1 X]•1 ŠDT Xw H ^ •1 (y – X "^ )(y – X "^ )w H ^ •1 X ‹ [Xw H ^ •1 X]•1 , A (31) 2 t t t t t t t=1 ^ . If H is the identity matrix, (30) reduces to what is where a consistent estimate of H is denoted by H known in econometrics texts, DM (1993, p. 553) as the Eicker-White heteroscedasticity consistent (HC) ^ X(Xw X)•1 . In econometric literature four different HC covariance matrix estimator T (Xw X)•1 Xw H ^ =diag(HCj) for j=0,1,2,3: estimates are distinguished by the choice of the diagonal matrix H 2 2 2 2 HC0=^%t , HC1=^%t TÎ(T–p), HC2=^%t Î(1–2> ), and HC3=^%t Î(1–2> )2 (32) where 2> denotes the diagonal element of the hat matrix X(Xw X)–1 Xw . Among these, DM (1993) recommend HC3 by alluding to some Monte Carlo studies. Royall's pivot based on (31) is zR =(" – ^ )–0.5 . A p-variate GPF similar to (22) is [Xw H•1 E%%w H•1 X]•0.5 Xw H•1 %, where E%%w =H is assumed "^ )(A 2 to be known. Substituting E%%w =H this expression yields p equations in p unknown coefficients " : GPF œ DTt=1 S˜t œ [Xw H•1 X]•0.5 Xw H•1 %. (33) w •1 Recall that (28) shows that X H % =DSt , a sum of scores. In (33) this sum of scores is premultiplied by a scaling matrix. Thus our GPF for regressions in (33) is a sum of T scaled scores. The sum is asymptotically N(0,I) by the CLT. In particular, the OLS has H=I and GPF=[Xw X]•0.5 Xw %. Rather than assuming a known H, a flexible choice discussed below is H=H(9), where 9 parameters can represent the autocorrelation and/or heteroscedasticity among % errors. A bootstrap of (33) in Vinod (1998) shuffles (and sometimes also studentizes) J(=999) times with replacement the T scaled scores S˜t . Babu (1997) studies the “breakdown point” of bootstraps, and suggests Winsorization (replace certain percent of “extreme” values by the nearest non-extreme values) before resampling. Since S˜t are p×1 vectors, we must choose a vector norm |S˜t | to define their extreme values. For robustness, the norm should help eliminate the extreme values of numerical roots of shuffled GPF=D S˜t =(constant) as estimates of " . If one is interested in a CI95, we can Winsorize less than 5% of “extreme” |S˜t | values. Further research and simulation are needed to know how robust and how wide are the CI95 for each norm. Now a GPF-N(0,I) bootstrap algorithm is to make J=999 numerical evaluations of GPF= „ 1.96 for any " or scalar functions f(" ) and construct a CI95 for inference. We propose a second multivariate pivot (denoted by superscript 2) using a quadratic form of original scores in (28). GPF² œ %w H•1 X[Xw H•1 X]•1 Xw H•1 % µ ;2 (p). 12 (34) The limiting distribution of (34) is readily shown to be a central ;2 (zero noncentrality), which does not depend on unknown parameters. As a refinement we may want F instead of ;2 . Then, we assume E%%w = 5 # H instead of E%%w =H, where 5 # is a common error variance. Next, we replace the % vector on the left side of (34) by a scaled vector (%Î5 ). If we estimate 5 # by the regression residual variance (ss %w %)Î(T–p) the random variable on the right hand side of (34) becomes Fp,T–p,1–! . Remark 3: How do we bootstrap using (34)? We need algebraic manipulations to write GPF² as a sum of T items, which can be shuffled. Note that (34) is a quadratic form %w A%, and we can decompose the T×T matrix A as GAGw , where G is orthogonal and A is diagonal. Let us define a T×1 vector %˜=%w GA!Þ& , and note that %w A% = %˜w %˜ =D(%˜t )2 , using the elements of %˜. Now, our bootstrap shuffles the T values of (%˜t )2 with replacement. Since (%˜t )2 values are scalar, they do not need any norms before Winsorization. Unlike the GPF-N(0,I) algorithm above, the GPF²s yield ellipsoid regions rather than convenient upper and lower limits of the usual confidence intervals. The limits of these regions come from the tabulated T! upper 95% value of the ;² or F. If we are interested in inference about a scalar function f(" ) of " vector, we can construct a CI95 by shuffling (%˜t )2 and J times solving GPF²=T! for f(" ). Simultaneous confidence intervals for " are more difficult. We have to numerically maximize and minimize each element of " subject to the inequality constraint GPF² Ÿ T! . Next few paragraphs discuss five robust choices of H(9) for both GPFs of (33) and (34). (i) First, under heteroscedasticity, H= diag( %2t ), and we replace H by a consistent estimate based on HC0 to HC3 defined in (32). (ii) Second, under autocorrelation alone, Vinod (1996, 1998) suggests using the iid “recursive residuals” for time series bootstraps. Note that one retains (the first) p residuals unchanged and constructs T–p iid recursive residuals for shuffling. The key advantage of this method is that it is valid for arbitrary error autocorrelation structures. This is a nonparametric choice of H without any 9. (iii) Third, under first order autoregression, AR(1), among regression errors, we have the following model at time t: yt = Xwt " +%t , where %t =3%t•" + ?t . Now, under normality of errors the likelihood function is f(y1 ,y2 ,â,yT )= f(y1 )f(y2 |y1 )f(y3 |y2 )âf(yT |yT-1 ). Next we use the so-called quasi first differences (e.g., yt –3yt•" ). It is well known that here one treats the first observation differently from others, Greene (1997, p.600). For this model the log-likelihood function is available in textbooks and the corresponding score function equals the partials of the log-likelihood. The partial derivative formula is different for t=1 compared to all other t values. The partial with respect to (wrt) " is: (1Î5u# ) DXtœ1 ?t X*t , where ?1 =(1–3²)!Þ& (y1 –3X1 " ), and for t=2,3,â,T: ?t =(yt –3yt•" ) • (Xt –3Xt•" )" , where we have defined 1×p vectors: X*1 =(1–3²)!Þ& X1 and X*t =(Xt –3Xt•" ) for t=2,3,â,T. Collecting all ?t we have the ? vector of dimension (T×1). Similarly, we have the T×p matrix of quasi differenced w regressor data X* satisfying the definitional relation: (X* X*)=Xw H–1 X. The partial wrt 5u# is: (–TÎ25u# ) +(1Î25u4 ) DXtœ1 ?t2 . Finally, the partial derivative wrt 3 is (1Î5u# ) DXtœ# ?t %t•" € (3%21 Î5u# ) • (3Î[1–32 ]), where %t =yt –Xt " . Thus, for the parameter vector ) =(" , 5u# ,3), we have analytical expressions for the score vector, which is the optimal EF. Simultaneous solution of the p+2 equations in the parameter vector ) gives the usual ML estimator which coincides with the EF estimator here. 13 Instead of simultaneous solutions, Durbin's (1960) seminal paper starting the whole EF literature also suggested the following two-step OLS estimator for 3. Regress yt on yt•" , Xwt and (–Xwt•" ) and use the coefficient of yt•" as s 3. Also, one can use a consistent estimate of s#u of 5u# and simplify the w w w score for " as (1Îsu# ) DtXœ1 ?t X*t = (1Îsu# )X* ?. Using rescaled scores, GPF= zG = [X* X*]•0.5 X* ? and analogous GPF² is the quadratic form (zG )w zG . (iv) Fourth, we consider H(9) induced by a mixed autoregressive moving average ARMA errors. Note that any stable invertible dynamic error process can be approximated by an ARMA(q,q–1) process. Hence Vinod (1985) provides an algorithm for exact ML estimation of regression coefficients " when the model errors are ARMA(q,q-1). His approximation is based on analytically known eigenvalues and eigenvectors of tri-diagonal matrices, with an explicit derivation for the ARMA(2,1). In general, this regression error process would imply that H(9) is a function of 2q–1 elements of the 9 vector representing q parameters on the AR side and q–1 parameters on the moving average (MA) side. (v) Our fifth robust choice of H has heteroscedasticity and autocorrelation consistent (HAC) estimators of H discussed in DM (1993, pp. 553, 613). Assume that both heteroscedasticity and general autocorrelation among regression errors are present and we are not willing to assume any H(9) parametric specification. Instead of H, we denote a nonparametric HAC covariance matrix as E%%w = >, to emphasize its nonzero off-diagonals due to autocorrelations and nonconstant diagonal elements due to heteroscedasticity. Practical construction of > requires following smoothing and truncation adjustments using the (quasi) score functions St defined in (28). Define our basic building block as a p×p matrix Hj =(1ÎT) DTt=j+1 St (St-j )w of autocovariances. We smooth them by using [Hj +Hwj ] to guarantee that we have a symmetric matrix. We further assume that autocovariances die down after m lags, with a known m, and truncate a sum after m terms. After all this truncation and smoothing, we construct > as a nonsingular symmetric matrix proposed by Newey and West (1987): w –1 >= H0 + Dm j=1 w(j,m)[Hj + Hj ], where w(j,m)= 1–j(m+1) . (35) The w(j,m) are Bartlett's window weights familiar from spectral analysis, declining linearly as j increases. One can refine (35) by using a pre-whitened HAC estimator proposed by Andrews and Monahan (1992). Here we do not merely use (35) in the usual fashion as a HAC estimator of variance. We are also extending the EF-theory lesson mentioned in remark 1 to construct a robust (HAC) estimator of the variance of the underlying score function itself, which permits construction of scaled scores S̃t needed to define the GPF in (33). Thus, upon smoothing and truncation, we propose GPF as [Xw >•1 X]•0.5 Xw >•1 % with nonparametric >. Now, similar to (34) we have GPF² œ %w >•1 X[Xw >•1 X]•1 Xw >•1 % µ ;2 (p). (36) Now recall the scaled vector (%Î5) used after (34) for a refinement. As before, instead of E%%w =>, we use E%%w =5 # >, inserting the 5# and use the (%Î5 ) similar to (34). The right hand distribution in (36) becomes Fp,T–p,1–! . The difficulty in Remark 3 about simultaneous CIs for " is relevant for (36) also. Since we have one equation in p unknowns of " , let us focus on one function f(" ) or one element of " , say "1 . It can be shown that this entails no loss of generality if one can rearrange the model in terms of revised set of p parameters. Let us still denote them by " to avoid notational clutter. The trick is to use the Frisch-Waugh theorem, Greene (1997, p.247), to focus on one parameter at a time without loss of generality as follows. For example, we can rewrite our model after partitioning as y = X1 "1 +X2 "2 and focus on one parameter "1 and combine all other parameters into the vector "2 . Now construct a vector y*of residuals from the regression of y on X2 and also create the regressor columns from the residuals of 14 the regression of X1 on X2 . Frisch-Waugh theorem guarantees that this gives exactly the same regression coefficient as the original model. Since "1 could be any one of the p elements of " by rearrangement, there is no loss of generality in this method and the ;2 (p) in (36) becomes ;2 (1) with only one unknown "1 . Since E(;2 (1))=1, one tempting possibility is to solve the equation GPF2 =1 for "1 some J=999 times to construct CI95 for "1 and eventually for all parameters of ". Unfortunately, this does not work and numerical evaluation of the inequality GPF² Ÿ Fp,T–p,1–! for p=1 based on (36) is needed. If there is a small-sample discrepancy between the theoretically correct F values and the observed density of GPF²s in resamples, one can use the upper (1–!) quantile from the observed order statistics instead of Fp,T–p,1–! . One can also use the double bootstrap (d-boot) refinement if adequate computer resources are available. Practitioners often need to test some theoretical propositions that lead to restrictions on ". A general formulation including use of gradients of restriction functions to deal with nonlinearities is available in textbooks, Greene (1997, ch. 7). It is customary to consider m Ÿ p linearly independent restrictions R" =q, we have the matrix R of dimension m×p and assume spherical disturbances, E%%w =I. Let C(" )=R" –q represent m equations which are zero under the null hypothesis. Now the GPF = [Xw X]•0.5 Xw % based on H=I in (33) represents a set of p equations in the p unknown elements of " . At the numerical level these equations can be viewed as m equations in C(" ) and p–m equations in a (linearly independent) subset of coefficients in the " vector. Clearly, one can solve them for C(" ) and construct bootstrap confidence intervals for C(" ). If the observed CI95 for any row of C(" ) contains the zero we do not reject the null hypothesis represented by that row. For testing nonlinear restrictions, our method does not need any Taylor series linearizations. One uses the parametric ;² or F distribution or nonparametric bootstraps in (34) or (36). The LAU functions lau(" ) mentioned earlier are a special case of C(" ). If any row of C(" ) contains more than one " , it is somewhat difficult (See Remark 3) to construct a GPF² confidence set for that row of C(" ). This is a practical disadvantage of GPF²s. Proposition 2: Let zG denote a sum of scaled scores. Assume that a nonsingular estimate of EzG zwG is available (possibly after smoothing and truncation), and that we are interested in inference on Dufour's `locally almost unidentified' functions lau(" ). By choosing a GPF as in (34) or (36) which does not explicitly involve scalar lau(" ) at all, we can construct valid confidence intervals. Proof: Dufour's (1997) result is that CIs obtained by inverting Wald-type statistics [lau(" )– lau("^ )]ÎSE(lau) s can have zero coverage probability for lau(" ). One problem is that the Wald-type statistic is not a valid pivotal quantity when the sampling distribution of lau("^ ) depends on unknown nuisance parameters. Another problem is that the covariance matrix needed for SE(lau) s can be singular. The GPF of (33) is asymptotically normal by proposition 1. Such GPFs are certainly not Wald-type, s ). Assuming nonsingular EzG zw is less stringent than since they do not even contain the expression lau(" G s ) for each choice of lau(" ). Let + denote a p×1 vector assuming nonsingular matrix of variances of lau(" of ones. We obtain a CI95 for any function f(" ) of regression parameters by numerically solving for f(" ) a system of p equations GPF(y," )= „ 1.96(+)= z! For example, if f(" )=lau(" ) is a ratio of two s ) can have obvious regression coefficients, then its denominator can be zero, and the variance of such f(" difficulties. Rao (1973, Sec. 4b) discusses an ingenious squaring method for finding the CIs of ratios. Our approach is more general and simple to program. In the absence of software, Rao's squaring method is rarely, if ever, implemented. By contrast, our GPF is readily implemented in Vinod and Samanta (1997) and Vinod (1997) for functions (ratios) of regression coefficients. Given a scalar function lau(" ), we simply replace one of the p parameters in " by lau(" ), the parametric function of inte- 15 rest, and solve GPF=(constant) to construct a valid CI for the lau(" ). For further improvements we suggest the bootstrap. The bootstrap resampling of scaled scores to compute CIs is attractive in light of limited simulations in Vinod (1998). To confirm it we would have to consider a wide range of actual and nominal “test sizes and powers” for a wide range of data sets. The GPF-N(0,I) algorithm is a parametric bootstrap and relaxing the parametric assumptions of N(0,I) and H(9) leads to a nonparametric and double bootstrap (GPF-d-boot) algorithms. McCullough and Vinod (1997) offer practical details about implementing the d-boot. Although the d-boot imposes heavy computational burden, and cannot be readily simulated, Letson and McCullough (1998) do report encouraging results from their d-boot simulations. Also difficult to simulate are the algorithms developed here for new GPF²s discussed here. Our limited experiments with new algorithms show that GPF²s are quite feasible, but have the following drawback. GPF²s cannot be readily solved for arbitrary nonlinear vector functions of " . Our recent experiments with non-normal, nonparametric, non-spherical error GPF algorithms indicate that their 95% confidence intervals can be wider than the usual intervals. We have not attempted Winsorization, since it offers too rich a variety of options, including a choice of the Winsorized percentage. From Babu's (1997) theory it is clear that greater robustness can be achieved by Winsorization, and we have shown how shorter bootstrap CIs can arise by removing extreme values. 9. Summary and conclusions This paper explains the background of parametric asymptotic inference based on the early work of Fisher, T. W. Anderson and others developed before the era of modern computers. We also discuss the early idea of focusing on functions as in Anderson's critical functions. The estimating functions (EFs) were developed by Godambe and Durbin in 1960. The main lesson of EF-theory (in remark 1 above) is that good EFs automatically lead to good EF-estimators (=EF-roots). Similarly, good pivots (GPFs) which contain all information in the sample, lead to reliable inference. A proposition formally shows that GPF=D S˜t , a sum of T items, converges to N(0,I) by the central limit theorem. We provide details on bootstrap shuffling of scaled scores S˜t for statistical inference from the confidence intervals. For regression coefficients " when H(9) is the matrix of error variances, depending on the particular application at hand, we suggest explicit bootstrap algorithms for five robust choices of H. Although GPF appears in Godambe (1985), the use of its numerical roots is first proposed in Vinod (1998). He claims that the GPFs fill a long-standing need of the bootstrap literature for robust pivots and enable robust statistical inference in many general situations. We support the claim by using Cox's simple example studied by Efron, Hinkley and Royall. For it, we derive Fisher's highly structured pivot zF , its modification zEH by Efron and Hinkley (1978), and a further modification zR by Royall (1986) to inject robustness. We explain why the GPF for Cox's example is more robust than others. We also simulate all these pivots for Cox's example. A parametric GPF-N(0,I) bootstrap algorithm uses about a thousand standard normal deviates to simulate the sampling distribution. A nonparametric (single) bootstrap algorithm uses empirical distribution function for robustness. For Cox's univariate example the simulation shows that GPFs yield short and robust CIs, without having to use the d-boot. The width of the traditional interval is 4.53, whereas the width of GPF intervals is only about 0.53. This is obviously a major reduction in the confidence interval (CI) width, which is predicted by the asymptotic property (iii) given by Godambe and Heyde (1987). Thus, we have demonstrated that for univariate problems our bootstrap methods based on GPFs offer superior statistical inference. 16 This paper discusses new GPF formulas for some exponential family members including Poisson mean, Binomial probability, and Normal standard deviation. This paper also derives from a quadratic form of the GPF an asymptotically ;2 second type GPF²s for regressions. The five robust choices of H(9) mentioned above are shown to be available for GPF²s also. We discuss inference problems for some ill-behaved functions, where traditional CIs can have zero coverage probability. Our solution to this problem, stated as proposition 2, is to numerically solve an equation involving the GPF. Since Dufour (1997) shows that such ill-behaved functions are ubiquitous in applications, (e.g., computation of long-run multiplier) our solution is of considerable practical interest in econometrics and other fields where regressions are used to test nonlinear theoretical propositions, especially those involving ratios of random variables. Heyde (1997) and Davison and Hinkley (1997) offer formal proofs showing that GPFs and dboots, respectively, are powerful tools for construction of short and robust CIs. By defining the tail areas as rejection regions, our CIs can obviously be used for significance testing. The CIs from GPFroots can serve as a foundation for further research on asymptotic inference in an era of powerful computing. For example, numerical pivots may help extend the well-developed EF-theory for nuisance parameters, Liang and Zeger (1995). The potential of EFs and GPFs for semiparametric and semimartingale models with nuisance parameters is indicated by Heyde (1997). Of course, these ideas need to be developed, and we need greater practical experience with many more examples. We have shown that our proposal can potentially simplify, robustify and improve the asymptotic inference methods currently used in statistics and econometrics. ACKNOWLEDGMENTS I thank Professor Godambe of the University of Waterloo for important suggestions. A version of this paper was circulated as a Fordham University Economics Department Discussion Paper dated June 12, 1996. It was revised in 1998 to incorporate new references. 17 REFERENCES Anderson, T. W., 1958, Introduction to Multivariate Statistical Analysis. New York: J. Wiley Andrews D. W. K. and J. C. Monahan, 1992, An improved heteroscedasticity and autocorrelation consistent covariance matrix estimator. Econometrica 60(4), 953-966. Babu, G. J., 1997, Breakdown theory for estimators based on bootstrap and other resampling schemes, Dept. of Statistics, Penn. State University, University Park, PA 16802. Cox, D. R., 1975, Partial likelihood. Biometrika 62, 269-276. Davidson, R. and J. G. MacKinnon, 1993, Estimation and inference in econometrics. New York, Oxford Univ. Press. Davison A. C. and D. V. Hinkley, 1997. Bootstrap methods and their application. New York, Cambridge Univ. Press. Donaldson, J. R. and R. B. Schnabel, 1987, Computational experience with confidence regions and confidence intervals for nonlinear least squares. Technometrics 29, 67-82. Dufour, Jean-Marie, 1997. Some impossibility theorems in econometrics with applications to structural and dynamic models. Econometrica 65, 1365-1387. Dunlop, D. D., 1994, Regression for longitudinal data: A bridge from least squares regression. American Statistician 48, 299-303. Durbin, J., 1960, Estimation of parameters in time-series regression models. Journal of the Royal Statistical Society, Ser. B, 22, 139-153. Efron, B. and D. V. Hinkley, 1978, Assessing the accuracy of maximum likelihood estimation: Observed versus expected information. Biometrika 65, 457-482. Godambe, V. P., 1985, The foundations of finite sample estimation in stochastic processes. Biometrika 72, 419-428. Godambe, V. P., 1991, Orthogonality of estimating functions and nuisance parameters. Biometrika 78, 143-151. Godambe, V. P. and C.C. Heyde, 1987, Quasilikelihood and optimal estimation. International Statistical Review 55, 231-244. Godambe, V. P. and B. K. Kale, 1991, Estimating functions: an overview, Ch. 1 in V. P. Godambe (ed.) Estimating functions, (Oxford: Clarendon Press). Gourieroux, C., A. Monfort and A. Trognon, 1984, Pseudo maximum likelihood methods: theory Econometrica, 52, 681-700. Greene, W. H., 1997, Econometric Analysis. 3rd. ed. New York: Prentice Hall. Hall, Peter, 1992, The bootstrap and Edgeworth expansion. New York: Springer Verlag. Heyde, C. C., 1997, Quasi-Likelihood and Its Applications. New York: Springer Verlag. Hu, Feifang and J. D. Kalbfleisch, 1997, Estimating equations and the bootstrap. in Basawa, Godambe and Taylor (eds.) Selected proceedings of the symposium on estimating equations. IMS Lecture Notes-Monographs Series, Vol. 32 Hayward, California: IMS, 405-416. Kendall, M. and A. Stuart, 1979, The advanced theory of statistics, New York: Macmillan, Vol. 2, Fourth Edition. Letson, D. and B. D. McCullough, 1998, Better confidence intervals: The double bootstrap with no pivot. American Journal of Agricultural Economics (forthcoming) Liang K. and S. L. Zeger, 1995, Inference based on estimating functions in the presence of nuisance parameters. Statistical Science 10, 158-173. 18 McLeish, D. L., 1974, Dependent central limit theorems and invariance principles, Annals of Probability. 2(4), 620-628. McCullough B. D. and H. D. Vinod, 1997, Implementing the double bootstrap. Computational Economics 10, 1-17. Newey, W. K. and K. D. West, 1987, A simple positive semi-definite, heteroscedasticity and autocorrelation consistent covariance matrix. Econometrica 55(3), 703-708. Rao, C. R., 1973, Linear statistical inference and its applications. New York: J. Wiley. Royall, R. M., 1986, Model robust confidence intervals using maximum likelihood estimators. International Statistical Review 54 (2), 221-226. Vinod, H. D., 1984, Distribution of a generalized t ratio for biased estimators. Economics Letters 14, 43-52. Vinod, H. D., 1985, Exact maximum likelihood regression estimation with ARMA(n,n-1) errors. Economics Letters 17, 355-358. Vinod, H. D., 1993, Bootstrap Methods: Applications in Econometrics. in G. S. Maddala, C. R. Rao, and H. D. Vinod (eds.) Handbook of Statistics: Econometrics, Vol. 11, New York: North Holland, Elsevier, Chapter 23, 629-661. Vinod H. D., 1995, Double bootstrap for shrinkage estimators. Journal of Econometrics, 68(2), 287-302. Vinod, H.D., 1996, Comments on bootstrapping time series data. Econometric Reviews, 15(2), 183-190. Vinod, H.D., 1997, Concave consumption, Euler equation and inference using estimating functions.1997 Proceedings of the Business and Economic Statistics Section of the American Statistical Association. (to appear) Vinod, H.D., 1997b, Using Godambe-Durbin estimating functions in econometrics. in Basawa, Godambe and Taylor (eds.) Selected proceedings of the symposium on estimating equations. IMS Lecture Notes-Monographs Series, Vol. 32 Hayward, California: IMS, 215-238. Vinod, H.D., 1998, Foundations of statistical inference based on numerical roots of robust pivot functions (Fellow's Corner) Journal of Econometrics, 86, 387-396. Vinod, H. D. and R. R. Geddes, 1998, Generalized Estimating Equations for Panel Data and Managerial Monitoring in Electric Utilities. presented at the international Indian statistical association conference, McMaster University, Hamilton, Ontario, Canada, Oct. 10-11. Vinod, H. D. and P. Samanta, 1997, Forecasting exchange rate dynamics using GMM, estimating functions and numerical conditional variance methods, Presented at the 17-th annual international symposium on forecasting, June 1997 in Barbados. Wedderburn, R. W. M., 1974, Quasi-likelihood functions, generalized linear models andthe Gaussian method. Biometrika 61, 439-447. 19 Figure 1: Frequency distribution of double bootstrap counts (Cj /100) for "1 (elasticity of capital). Panel A: (Upper) Cauchy errors, Panel B (lower): Normal errors (in simulating Y values). 20 Figure 2: Frequency distribution of double bootstrap counts (Cj /100) for "2 (elasticity of labor). Panel A: (Upper) Cauchy errors, Panel B (lower): Normal errors (in simulating Y values). 21 Figure 3: Frequency distribution of double bootstrap counts (Cj /100) for "3 (technology elasticity). Panel A: (Upper) Cauchy errors, Panel B: (lower) Normal errors (in simulating Y values). 22 Recall T T ggls=Xw H–1 X" –Xw H–1 y = DTt=1 Xwt H•1 (yt – Xt " ) = Dt=1 Hwt %t =Dt=1 St =0, q A p-variate GPF similar to (22) is [Xw H•1 E%%w H•1 X]•0.5 Xw H•1 %, where E%%w =H is assumed to be known. Upon substitution, this expression becomes [Xw H•1 X]•0.5 Xw H•1 %. For the OLS, H=I and the expression is further simplified to GPF= [Xw X]•0.5 Xw %. If H is unknown, but we do know is that it is a diagonal matrix (heteroscedasticity), we propose the following scaled sum of scores as the GPF: zG =[Xw diag( %2t ) X]•0.5 g* µ N(0, Ip ), (36) Let Xt denote t-th row of X (the data on all regressors at time t). Clearly the score St =Xwt %t is a p×1 vector and let >=Xw E(%%w )X denote the covariance matrix of scores. By definition, the GPF is a sum T of scaled scores, where the scale factor is >–0.5 = {(1/T)DTt=1 Ds=1 St Sws}–0.5 . Unfortunately, %%w without the expectation operator is, in general, a singular matrix. Upon smoothing and truncation, we propose GPF(y," )= (>˜)–0.5 Xw % (39) Heyde p. 61 book notation is GPF=Q µ Normal, Qw Q Ÿ ;2 (p,!) %w X (>˜)–1X w % µ ;2 (p,!). In the heteroscedastic case, we have %w X[Xw diag( %2t ) X]•1 Xw % µ ; 2 (p,!) Now we construct a sum of n individual components of the quantity on the left side so we can shuffle it with replacement for a bootstrap. To this end, use the singular value decomposition X = H A0.5 G w and Xw = G A0.5 Hw . So the left side becomes %w H A0.5 G w[ G A0.5 Hw diag( %2t ) H A0.5 G w ]•1 G A0.5 Hw % =%w H A0.5 G w [ G A–0.5 Hw diag( %–t 2 ) H A–0.5 G w ] G A0.5 Hw % =%w H A0.5 A–0.5 Hw diag( %–t 2 ) H A–0.5 A0.5 Hw %, using Gw G=I =%w H Hw diag( %–t 2 ) H Hw %, using A–0.5 A0.5 =I 1xT Txp pxT TxT ... Note that HHw is TxT dimension and can be calculated once and for all. Denote the columns of HHw as h" , h2 , â, hn and %w ht =ft a scalar . Now we write expression as =! ft2 Î %2t T t=1 Now replace %> with the t-th residual and compute the initial T somponents of this sum and shuffle them J times for a bootstrap. Each shuffle will create a realization from a ;2 r.v. 23 Now find that largest (smallest) value of each regression coefficient such that this sum of squares Ÿ 95-th percentile of the J sums computed from ! ft2 Î %2t T t=1 Now use constrained optimization algorithm to find the maximum and minimum of each element of " subject to ! ft2 Î %t Ÿ Upper 95 percentile from the J bootstrap replicates. The maximum " is then the upper T t=1 limit of CI95 and the minimum is the lower limit. (10) eq no is missing In the OLS regression with spherical errors case a 100(1–!)% confidence region for " contains those values of " which satisfy the inequality, Donald son, and Schnabel (1987): (y–X" )w (y–X" ) • (y–X"^ ols)w (y–X"^ ols ) Ÿ s²p Fp,T–p,1–! (37) ^ ^ w where s²= (y–X" ols) (y–X" ols )Î(T–p). It is equivalently written as: ("^ ols • " )w Xw X("^ ols • " ) Ÿ s²p Fp,n–p,1–! (38) which shows that the shape will be ellipsoidal. If the Frisch-Waugh theorem is used to convert the regression problem into p separate univariate problems, use of the bootstrap seems to be a good way to obtain robustness. For multivariate problems (e.g., regressions) Vinod (1998) suggests two algorithms:, the GPFN(0,I) and GPF-d-boot algorithms. In some implementations they can involve thousands of numerically computed GPF-roots which are rank-ordered to compute CIs. The computational burden is manageable for many typical econometric problems, and is expected to further lighten over time. Although simulations in Vinod (1998) are incomplete, they do have a sound theoretical basis in the property (iii) given by Godambe and Heyde (1987). Often, estimating functions (EFs) are QSFs and EF estimators are the quasi maximum likelihood (QML) estimators viewed as roots of QSFs. (%̃t )2 values 24