foundations of multivariate inference using modern

advertisement
Presented at the seventh international workshop on matrices and statistics in celebration of T. W. Anderson's
80th birthday, Fort Lauderdale, Florida, Dec. 11-14, 1998
FOUNDATIONS OF MULTIVARIATE INFERENCE USING MODERN
COMPUTERS
H. D. Vinod
Economics Dept. Fordham University,
Bronx, New York 10458
VINOD@murray.FORDham.edu
KEY WORDS: Bootstrap, Regression, Fisher information, Robustness, Pivot, Double bootstrap.
JEL classifications: C12, C20, C52, C8.
ABSTRACT
Fisher suggested in 1930's analytically structured pivot functions (PFs) whose distribution does not
depend on unknown parameters. These pivots provided a foundation for (asymptotic) statistical
inference. Anderson (1958, p. 116) introduced the concept of a critical function of observables, which
finds the rejection probability of a test for Fisher's pivot. Vinod (1998) shows that Godambe's (1985)
pivot function (GPF) based on Godambe-Durbin “estimating functions” (EFs) from 1960 are particularly robust compared to pivots by Efron and Hinkley (1978) and Royall (1986). Vinod argues that
numerically computed roots of GPFs based on scaled score functions can fill a long-standing need of the
bootstrap literature for robust pivots. This paper considers Cox's example in detail and reports on a
simulation for it. This paper also discusses new pivots for Poisson mean, Binomial probability and
Normal standard deviation. In the context of regression problems we propose and discuss a second
multivariate pivot (denoted by GPF²) which is asymptotically ;2 and robust choices of error
covariances to allow for heteroscedasticity and autocorrelation.
1
1 Introduction and how estimating functions evolve into Godambe's pivot functions
The basic framework of asymptotic and small-sample statistical inference developed by Sir R.
A. Fisher and others since 1930's relies on the normality of the estimator s) of ) . Neyman-Pearson
lemma provides us with sufficient condition for the existence of the uniformly most powerful (UMP)
test based on the likelihood ratio (LR). The level ! test is called UMP if the test has greater power than
any other test of the same size ! and retains its size ! for all admissible parameter values. For the
exponential family of distributions it is easy to verify that the LR function is monotone in the LR
statistic. Anderson (1958, p.116) introduced the concept of a critical function of observables, which
finds the rejection probability of a test for Fisher's pivot. He also considers the properties of Hotelling's
T² statistic, which is a quadratic form in normal variables. Since the critical region of the LR test for T²
is a strictly increasing function of the observable statistic, whether the statistic is in the critical
(rejection) region under the null hypothesis does not depend on unknown noncentrality parameters.
Hence, Anderson argues that it is an UMP test. Mittelhammer (1996, Ch.9-11) discusses this material
in modern jargon of pivots and explains the duality between confidence and critical regions, whereby
UMP tests lead to uniformly most accurate (UMA) confidence regions. These methods are the
foundations of statistical inference, which relies on parametric modeling using (asymptotic) normality.
We show that statistical inference can be made more robust and nonparametric by exploiting the power
of modern computers, which did not exist when the traditional methods were first developed.
Bootstrap literature (see, e.g., Hall, 1992, Vinod, 1993, Davison and Hinkley, 1997) notes that
reliable inference from bootstraps needs valid pivot functions (PFs), whose distribution does not depend
on unknown parameters ) . Let s) be an estimator, SE be its standard error, and Fisher's PF be FPF.
Typical bootstraps resample only older Wald-type statistics, FPF=(s) –) )/SE, (See Hall, 1992, p.128).
Hu and Kalbfleisch (1997) include and refer to some exceptions. For certain biased estimators and illbehaved SEs bootstraps can fail, because FPFs are invalid pivots. Vinod (1998) shows that Godambe's
(1985) pivot function (GPF) equals a scaled sum of T quasi likelihood score functions (QSFs), where T
denotes the number of observations. As a sum of T items, GPF converges to N(0,I) unit normality
directly by the central limit theorem (CLT). Since GPFs do not need unbiasedness of s) or well-behaved
SEs to be valid pivots, they can fill a long-standing need in the bootstrap literature for valid pivots.
Although the distribution of GPFs never depends on unknown parameters, we shall see that they often
need greater computing power to find numerical roots of equations involving GPF=(a constant).
Vinod (1998) reviews the attempts by Efron and Hinkley (1978) and Royall (1986) to inject
robustness in Fisher's PF for nonnormal situations. He shows that there is an important link between
robustness, nonnormality and the so-called “information matrix equality” (IF = I2op =Iopg ) between the
Fisher information matrix (IF ), a matrix of second order partials (I2op ) of the log likelihood and the
matrix of outer product of gradients (Iopg ) of the log likelihood. We shall see that traditional confidence
intervals (CIs) are obtained by analytically inverting the Wald-type statistic. Although Godambe (1985)
mentions his pivot functions (GPFs) he uses them only when such analytical inversions are possible.
Vinod (1998) suggests numerical inversions by extending GPFs to consider the numerical roots of
equations GPF=(constant) called GPF-roots. Vinod's proposition 1 formally proves that GPF-roots yield
more robust pivots than Efron-Hinkley-Royall pivots. Similar to Royall, the robustness of GPF-roots is
achieved by allowing I2op Á Iopg , which occurs when there is nonnormal skewness and kurtosis. Further
robustness is achieved by not insisting on analytical inversion of the GPF-roots thereby permitting non
symmetric confidence intervals obtained by computer intensive numerical roots.
2
If one needs confidence intervals for some `estimable' functions of parameters f() ), Vinod
(1998) proposes numerically solving the GPF=(constant) equation for f() ). The numerical GPF-roots
are robust, because they avoid the `Wald-type' statistic W=[f(s) )–f() )ÎSE(f)] altogether. One need not
find the potentially “mixture” sampling distribution f w (W) of W. There is no need to make sure that
f w (W) does not depend on unknown ) . In fact, one need not even find the standard error SE(f) for each
f() ) or invert the statistic for confidence intervals (CIs). Assuming that reasonable starting values can
be obtained, our numerical GPF-root method simply needs a reliable computer algorithm for solving
nonlinear equations. Of course, some f() ) will be better-behaved than others. Dufour (1997) discusses
inference for some ill-behaved f() ) called `locally almost unidentified' (LAU) functions, lau() ). He
shows that LAU functions are quite common in applications, that lau() ) may have unbounded CIs, and
that the usual CIs can have zero coverage probability. Moreover, Dufour states that Edgeworth
expansions or traditional bootstraps (on Wald-type statistics) do not solve the problem. Later, we state
proposition 2, which avoids (bootstrapping) invalid pivots (Wald-type statistics) by solving GPF=
constant for lau() ). Again, the limiting normality of GPFs is directly proved by the CLT, avoiding the
s) altogether.
sampling distribution of lau()
An introduction to the estimating function (EF) literature is given in Godambe and Kale (1991),
Dunlop (1994), Liang and Zeger (1995), Heyde (1997) and Vinod (1997b, 1998). The EF estimators
date back to 1960 and are defined as roots of a function g(y,) )=0 of data and parameters. The EFtheory has several adherents in biostatistics, survey sampling and many applied branches of statistics.
The optimum EF (OptEF), g*=0, is unbiased and minimizes Godambe's optimality criterion:
Godambe-criterion = [Var(g)]Î (E` gÎ`) )2 .
(1)
This minimizes the variance in the numerator; and at the same time, the denominator lets g() +/ ) ) for
.
.
“small” nearby values (/>0) differ as much as possible from g. If g is standardized to gs=gÎg, where g
denotes ` gÎ` ) , then minimization of (1) means selecting the g with minimum (standardized) variance.
In many special cases, the criterion (1) yields the maximum likelihood (ML) estimator ^) ml of parameter
) . Given T observations, the likelihood function C ft (yt ;) ) is a product of a probability density
functions. Let L=DTt=1 Lt denote the log of the likelihood function, where Lt =ln ft (yt ; )).
Kendall and Stuart (1979, sec. 17.16) note that the Cramer-Rao lower bound on the variance of
^ unb of some function .() ) is Var(.
^ ) 1ÎES2 , where S=(` LÎ` ) ) denotes the
an unbiased estimator .
^
score. The lower bound is attained if and only if .unb–.() ) is proportional to the score, or if
^ unb–.() )].
S = (` LÎ` ) ) = A() ) [ .
(2)
Hence we derive the important result that the score equation S=0 is an OptEF. The property of
“attaining Cramer-Rao” lower bound on the variance is a property of the underlying score function (EF)
itself. Kendall and Stuart also prove a scalar version of the “information matrix equality” mentioned
above. Later, we shall use the corresponding observed information matrices, ^I F , ^I opg and ^I 2op also.
Kendall and Stuart note that the factor of proportionality A() ) in (2) has arbitrary functions of ) and the
implicit likelihood function for such A() ) is obtained by integrating the score in (2). Note that such
likelihood function belongs to the exponential family of distributions defined by
f(y|) )=exp{A() )B(y) + C(y) +D() )}.
(3)
It is well known, Rao (1973, p. 195) and Gourieroux et. al. (1984), that several discrete and continuous
distributions belong to the exponential family, including the binomial, Poisson, negative binomial,
gamma, normal, uniform, etc. Wedderburn (1974) recommends using the exponential family score
function, even when the underlying distribution is not explicitly specified and defines the quasi-
3
likelihood function as the integral of the score. An appealing favorable property of the EFs is that they
“attain Cramer-Rao,” which can be proved by using quasi-likelihood score functions (QSFs), without
assuming knowledge beyond the mean and variance. See Heyde (1997) for non-exponential family
extensions of EFs and a cogent discussion of advantages of EFs over traditional methods.
Remark 1: An important lesson from the EF theory is that an indirect approach is good for estimation.
One should choose the best available EF (e.g., unbiased EF attaining Cramer-Rao) to indirectly ensure
best available estimators, which are defined as roots of EF=(constant). In many examples, the familiar
direct method seeking best properties of roots (estimators) themselves can end up failing to achieve
them.
Vinod (1998) proposes a similar lesson for statistical inference. The GPFs having desirable
properties (asymptotic normality) indirectly achieve desirable properties (e.g., short CIs conditional on
coverage) of (numerical) GPF-roots. This paper discusses Cox's example, Poisson mean, binomial
probability, Normal standard deviation and some multivariate pivots in the context of regression
problem omitted in Vinod (1998). This paper also derives a second kind of GPFs called GPF²s
where the asymptotic distribution is ;² , again arising from the CLT-type arguments. Section 3 relates
GPFs to CIs and studies the relation between the information matrix equality and robustness. For Cox's
example, section 4 derives the GPF and section 5 develops the CIs from bootstraps and simulates Cox's
example. Section 6 develops a sequence of robust improvements from FPF to GPF. Section 7 gives
more GPF examples. Section 8 discusses two types of GPFs for regressions. Section 9 has summary
and conclusions.
2. Superiority of EFs over maximum likelihood.
If normality is not assumed and only the first two moments are specified, then the likelihood is
unknown and the ML estimator is undefined. The QSFs as OptEFs remain available, although the
asymmetric derivatives lead to a failure of “integrability conditions,” and hence non-existence of quasi
likelihoods (as integrals of quasi scores). Heyde (1997) proves that desirable properties (e.g., minimum
variance) of underlying EFs also ensure desirable properties of their roots. Heyde (1997, p.2, & ch. 2)
provides examples where the traditional s) (=EF root) is not a `sufficient statistic,' while the QSF
provides a `minimal sufficient partitioning of the sample space.' Moreover, two or more EFs can be
combined and applied to semiparametric and semimartingale models, even if scores (QSFs) do not even
exist, Heyde (1997, ch. 11). Vinod's (1997b) and Vinod and Samanta's (1997) examples propose a new
EF-estimator, providing superior out-of-sample forecasts compared to the generalized method of moments (GMM). The superiority of EFs over ML is particularly noteworthy when a regression
heteroscedasticity is a known function of regression parameters ". An example with binary dependent
variable is given in Vinod and Geddes (1998). Thus choosing optimal EFs and their roots as EFestimators can be recommended. Godambe and Kale (1991) and Heyde (1997) prove that both with and
without normality, whenever optimal EF-estimators do not coincide with the usual ones (LS, ML or
GMM), EF estimators are superior to others in a class of linear estimators.
3. Optimum EFs and CIs from Godambe's PF
Recall that only those EFs which minimize (1) are called OptEFs and denoted by g*. Godambe
and Heyde (1987) define the quasi-likelihood score function (QSF) as the OptEF and prove three
equivalent properties: (i) E(g*–S)2 Ÿ E(g–S)2 , (ii) corr( g*, S)
corr(g, S), where corr(.) denotes the
4
correlation coefficient, and (iii) for large samples, the asymptotic confidence interval widths (CIWs)
obtained by inversion of a quasi score function satisfy CIW(g*) Ÿ CIW(g).
Vinod (1998) emphasizes the third property involving CIWs, which is not discussed in great
detail by Godambe and Heyde. His simulations support the third property. If the limiting distribution is
d
normal, ^) Ä N() , Var()^ )), we have the following equivalent expressions for the inverse of the Fisher
information matrix I•1 = Var(^) )= I•1 = I•1 . Denoting by ASE the asymptotic standard errors
F
opg
i
2op
computed from the Var(^) ) diagonals, the asymptotic 95% CIs (CI95) for the i-th element of ) is simply
[ ^) i – 1.96 ASEi , ^) i + 1.96 ASEi ].
(4)
There is no loss of generality in discussing the 95% =100(1–!)% CIs, since one can readily modify !
(=0.05) to any desired significance level !<1/2. From the tables of the standard normal distribution,
where z µ N(0,1), the constant 1.96 in (4) comes from the probability statement:
Pr[–1.96 Ÿ z Ÿ 1.96]= 0.95=1–!.
(5)
It is known that Fisher's pivot function (FPF) is Wald-type and converges to the unit normal:
d
FPF= zF =(^) i –)i )ÎASEi Ä N(0,1).
(6)
One can derive the CI95 in (4) by “inverting Fisher's PF,” i.e., by replacing the z of (5) by the zF of (6).
This amounts to solving FPF=z! (a constant = „ 1.96). For more general two-sided scalar intervals both
a left-hand-tail probability !L and a distinct right-hand-tail probability !U must be given. Then, we find
quantiles zL and zU from normal tables to satisfy:
Pr[zL Ÿ z Ÿ zU]=1–!L –!U.
(7)
Now, the FPF roots solving FPF=zL and FPF=zU yield the lower and upper limits of a more general CI.
In finite samples, Student's t distribution is often used when standard errors (SEi ) are estimated from the
data, which will slightly increase the constant 1.96. In the sequel, we avoid notational clutter by using
CI95 as a generic interval, 1.96 as a generic value of z from normal or t tables, and ASE or SE(.) as
standard errors from the square root matrix Var(^) ), usually based on Fisher's information matrix IF .
Greene (1997, p. 153) defines the pivot as a function of ) and s) , fp () ,s) ) with a known
distribution. Actually, it would be a useless PF if the `known distribution' depends on the unknown
parameters ) . The distribution of a valid PF must be independent of ) . If s) is a biased estimator of ) ,
where the bias depends on ) , then the distribution of FPF in (6) obviously depends on ) , implying that
the FPF is invalid. For example, see Vinod's (1995) example of ridge regression. Resampling such
invalid FPFs can lead to a failure of the bootstrap. We obtain the CIs in (5) by inverting a test statistic.
Our pivot functions, GPF(y,) ), also include y but exclude s) and/or its functions from the list of
arguments. Clearly, GPFs are more general and contain all information in the sample.
Vinod's (1998) proposition 1 reproduced in section 6 states that GPFs converge robustly (with
minimal assumptions) to the unit normal, N(0,I) and are certainly not `Wald-type,' in the sense of
Dufour (1997). We view GPFs as a step in the sequence of improvements to Fisher's PFs started by
Efron and Hinkley and Royall designed to inject robustness. In the sequel, section 4 discusses the
improvement sequence in the context of Cox's example, whereas a general discussion for univariate ) is
in section 6. A vector generalization of “inverting a pivot” leads to confidence sets instead of intervals.
4. Derivation of the GPF for Cox's example
5
Cox's (1975) example estimates a univariate ) from yit µ N() , 5i ²), (i=1,2 and t=1,âT), with
known dichotomous random variances. One imagines two (independent) gadgets with distinct known
measurement error variances. The choice of the gadget i depends on the outcome of a toss of an
unbiased coin. The subscripts yit imply that a record is kept of both the gadget number i and t-th
measurement. Denote by T1 the total number of heads, by T2 total tails, and by –yi the mean of the i-th
T
sample. The log likelihood function is: L= constant –DTt=1 D2i=1 log 5i –(1/2)Dt=1
D2i=1 (yit –) )2 Î5i2 . The
T
˜ where L˜=(DTt=1 y1t Î512 )+(Dt=1
˜
score is: S= ` LÎ` ) =(L˜–) R),
y2t Î522 ) =(T1–y1 Î512 )+(T2–y2 Î522 ) and R=
2
2
(T1 Î51 )+(T2 Î52 ). The ML estimator of ) is the root of the score equation S=0 solved for ) .
^) =L̃ÎR=[(T
2
2
2
2
–
–
˜
ml
1 y1 Î51 )+(T2 y2 Î52 )]Î[(T1 Î51 )+(T2 Î52 )].
(8)
Efron and Hinkley (EH, 1978) show that the (expected or true) Fisher information equals
.
IF =Var(S) =E–S=(T/2) [D2i=1 5i•2 ], and the observed Fisher information ^I F is simply the R˜ just defined.
Fisher's PF here is FPF=(s) –) )ÎSE, where SE is the square root of the Var(s) )= TÎIF = 2[(1/512 )+
(1/522 )]•1 . The CI95 for Cox's example will obviously invert this FPF. EH (1978) argue that TÎIF as
variance is not robust, since it uses T=T1 +T2 , disregarding the realized values of number of heads (T1 )
and tails (T2 ). Instead, they suggest `observed Fisher information' or
TÎ^I F = T[(T1 Î512 )+(T2 Î522 )]•1 ,
(9)
to estimate the variance robustly. It is clear that only if T1 =T2 , (9) will equal TÎIF . The CIs from (9)
will also be more robust. So far, we are accepting on faith that 5i ² are known. To inject further
robustness Royall (1986) allows for errors in the supposedly `known' 5i ². Royall's Var(s)) is:
•1
•1
^
A= T ^I 2op ^I opg ^I 2op .
(10)
For Cox's example, Royall derives the following expression for variance based on (10):
^ )•2 ,
T (y – ) )2 Î5 4 › (I
T šDTt=1 (y1t– ) )2 Î514 + Dt=1
2t
F
2
(11)
where ^I F is known from (9). Royall states that (11) is more robust than Efron and Hinkley's TÎ^I F of
(9), because it offers protection against errors in the assumed variances 5 2 . Since (10) reduces to TÎ^I
i
F
only if I2op =Iopg , Royall's variance is more robust. By avoiding the `information matrix equality' Royall
avoids the restrictive assumptions that skewness #1 =0 and kurtosis #2 =0.
Before developing the GPF for Cox's example, we need the optimal EF and a “point estimate”
of ) . If normality of yit is believed, the likelihood function is available, the score equation S=0 is the
OptEF and its root yields the ML estimator of (8) above. If we assume only the knowledge of the first
two moments (instead of normality) we can construct the quasi scores from the two moments for each t
as: g*t =D2i=1 5i•2 (yit –) )= D2i=1 g*it defining the g*it notation. Though not needed here, it is instructive to
work with a general method for EF point estimates. First verify orthogonality Eg*1t g*2t =0. Then solve:
DTt=1 g*1t
*
E(` g*1t Î` ) )
T g* E(` g1t Î` ) ) =0.
+
D
* )2
t=1 2t
E(g*1t )2
E(g2t
(12)
•4
2
•2
Note that the numerators simplify as: E` g*it Î` ) = • (5i )•2 , Eg*2
it = E(5i ) (yit –)) = (5i ) . Hence (12)
T
2
*
•
2
2
T
2
•
2
becomes 0=Dt=1 Di=1 git (5i ) (5i ) = Dt=1 Di=1 (5i ) (yit –) ), and its solution is the optimal EF-estimator.
One can verify that the OptEF estimator in this general setting for Cox's example is simply:
^) = L̃ÎR˜ = ^) ,
ef
ml
(13)
6
where the L̃ and R˜ are the same as in (8). Thus even if normality of yit is not believed (relaxing Cox's
specification for greater robustness) the ML estimator is still the optimal EF-theory point estimate.
Finally, we are ready for the GPF. Godambe's (1985) pivot is defined in terms of g*t as:
T *2
T ˜ d
GPF = zG = DTt=1 g*t ‚È{Dt=1
gt }=Dt=1
St Ä N(0,1).
(14)
This defines the notation S˜t for `scaled quasi-scores' and zG . Now, Cox's example GPF-roots are
solutions of GPF=constant. Without the () –)^ ) term, (14) does not look like a typical Fisherian pivot
(FPF) in (6). However, we claim that: (i) the very absence of () –)^ ) is an important asset of GPFs for
robust inference, (ii) E(GPF)=0 can hold even if E(s) ) Á ) , and (iii) numerical GPF roots similar to (4)
will yield CIs for the unknown ), associated with EF-estimators, while avoiding explicit studies of their
sampling distributions. In support of our claim, Heyde (1997, p.62) proves that CIs from `asymptotic
normal' GPFs are shorter than CIs from `locally asymptotic mixed normal' pivots. Mixtures of distinct
distributions in the usual FPF=(s) –) )ÎSE obviously come from the distributions of random variables s)
and the distribution of estimated SE in the denominator of FPF. No mixture distributions are needed to
assert (14) since its normality is based on the CLT applied to its definition as a sum of scaled scores.
We are now ready to place GPFs in the context of the bootstrap for robust computer intensive inference.
5. Computation of GPF roots, CIs from bootstraps and a simulation for Cox's example
Two analytical solutions of FPF=(s) –) )ÎSE=z! = „ 1.96 give the limits of CI95 of (4). From
(14) it is clear that a CI95 by solving GPF=(a nonzero constant) analytically is impossible. One needs
numerical methods to estimate the limits of a CI95 from the GPF. Let us rewrite (14) without the
reciprocals of square roots for better behavior of numerical algorithms as follows:
T D2 5 •2 (y –) ).
z! šDTt=1 D2i=1 5i•4 (yit –) )2 › = Dt=1
it
i=1 i
0.5
(15)
The choice z! = „ 1.96 depends on the normality in (14), which may not hold exactly in finite samples.
Bootstraps achieve robustness by resampling the S˜t of (14) using a nonparametric distribution induced
by empirical distribution function (EDF) of a large number (J=999) of solutions of GPF=0.
Hall's (1992) approach to short bootstrap confidence intervals is typical. It involves recomputing J times the FPF=(s) –) )ÎSE values. The resampling from the EDF of such FPF values avoids
parametric distributional assumptions. Next, one considers j=1,â, J estimates of FPF-roots s) j , their
`order statistics,' and perhaps seeks a detailed numerical look at the approximate sampling distribution
of s) . If the sampling distribution of the FPF depends on ) (e.g., if the bias E(s) )–) depends on ) ) that
FPF is an invalid pivot and hence any bootstrap using an invalid pivot may need considerable
adjustment, if not complete rejection. An adjustment for ridge regression is discussed in Vinod (1995).
Since SE>0, if we solve FPF=0 for ) the solution is equivalent to solving (s) –) )=0. The
solution of FPF=0 is s) (the ML estimator). One can imagine pivot functions whose roots do not equal s) .
However, when solving GPF=0 it is easy to verify that the root of a sum of scaled scores DS˜t is the
same as the ML estimator (if it is well defined) or the root of the score function DSt . After all, the scale
factor vanishes when solving GPF=0. Thus for Cox's example we can use the ML estimates (=^) ef ) to
generate scaled scores S˜t for each t. Let us denote the estimated t-th scaled score as:
7
We suggest a bootstrap type shuffling with replacement of these ( t=1,â,T) scaled scores J
(=999) times. This creates J replications of GPFs from their own EDF. As with the FPF-roots, solving
(16) numerically for each replicate yields J estimates of GPF-roots, to be analyzed by descriptive and
order statistics as follows. Let –zGj and sdj(.) denote the sample mean and standard deviation over T
values, for each j=1,â, J. Although E(GPF)=0 holds for large T, for relatively small T, the observed
mean –zGj may be nonzero. However, if we can assume that any discrepancy between –zGj and zero does
–Gj )Îsdj(z^Gt ), where ^zGtj denotes
not depend on unknown ) , the following FPF remains valid: ^z*j =(z^Gtj –z
the scaled score from (16) for the j-th replicate. Therefore, we can approximate the sampling
distribution of the root ^) ef by substituting the standardized resampled ^z*j for the z! in (15). Thus one
can use estimated S˜t to provide a refined z! instead of the traditional „ 1.96.
Next, we substitute these refined z! 's in (15), and numerically solve for each j=1,â, J to yield
the GPF-roots ^) *j . Arranging the roots in an increasing order yields “order statistics” denoted by ^) *(j) .
Hence a possibly non-symmetric, (single) bootstrap nonparametric CI95 is given by
[ ^) *(25) , ^) *(975) ].
(17)
If any CI procedure insists that the CI must always be symmetric around the )^ ef , it is intuitively
obvious that it will not be robust. After all, it is not hard to construct examples where a non-symmetric
CI is superior. Efron-Hinkley's and Royall's CI95s (based on (8) or (11) respectively) are prima facie
not fully robust, simply because they retain the symmetric structure. For the FPF=(s) –) )ÎSE, Hall
(1992, p.111-113) proves that symmetric CIs are asymptotically superior to their equal-tailed counterparts. However, Hall does not consider GPFs and his example shows that his superiority result depends
on the confidence level. For example, it holds true for a 90% interval, but not for a CI95. In finite
samples symmetric CIs are not robust, in general.
A comparison between the (single) bootstrap GPF-CI95 of (17) with the parametric bootstrap
from (15) using the generic constants „ 1.96 is possible. Instead, our GPF-N(0,I) algorithm generates
J (parametric) unit normal deviates zGj from N(0,1) and substitutes them for z! in (15). Again, each zGj
yields a nonlinear equation (15), which may be solved by numerical methods with GPF-roots denoted by
^) . The appropriate order statistics yield the following CI95 from our GPF-N(0,I) algorithm:
efj
[ ^) ef(25) , ^) ef(975) ].
(18)
Remark 2: This remark is a digression from the main theme. If computational resources are limited
(e.g., if T is very large), the following CI95 remains available for some problems. First, find an
unbiased estimate ^) unb , which can be from the mean of a small simulation. Next, compute $ =)^ unb Î)^ ef ,
and an `unbiased estimate of squared bias,' U=(^) –^) )2 . Adding estimated variance to the U yields
ef
unb
unbiased estimate of the mean squared error (UMSE), and its square root r(UMSE). Vinod (1984)
derives the sampling distribution of a generalized t ratio, where ASE is replaced by r(UMSE) as a ratio
of weighted ;2 random variables. He indicates approximate methods for obtaining appropriate constants
like the generic 1.96 for a numerically `known' bias factor $ .
Now we discuss our simulation of Cox's example with T1 =50, T2 =30, 51 =1 and 52 =2. We
generate yi µ N(5,5i2 ) for i=1,2. Our –yi =(5.2787, 4.8654). The ML estimator )^ ml= 5.1833 with the
8
ASE of 1.1547 and ML CI95 is [2.9201, 7.4465]. The Efron-Hinkley estimate of ASE is 1.1094 with
CI95 of [3.0089, 7.3577]. Royall's ASE= 1.0706 gives a shorter CI95: [3.0848, 7.2817].
Our simulation of the GPF of (14) is implemented with (15). We simply compare two CI95s
from the nonparametric algorithm of (17) and the parametric algorithm of (18). For (18), we use J=999
unit normal deviates (GAUSS computer language) and rank order the roots ^) j based on solving (15)
with the help of GAUSS's NLSYS library. The smallest among 999 estimates, min (^) ), is 4.7761; and
j
j
the mean(^) j ) is 5.1835. The median and maximum are respectively 5.1885 and 5.5408. The standard
deviation is 0.1263. Since the median is slightly larger than the mean, the approximate sampling
distribution is slightly skewed to the left. Otherwise, the sampling distribution is quite tight and fairly
well behaved, with a remarkably short CI95 from (18): [4.9312, 5.4373]. A similar GPF interval from
the nonparametric single bootstrap of (17) is: [4.9572, 5.4924] with a slightly larger standard deviation
0.1283 over the 999 realizations. Note that in this simulation we know that the true value of ) is 5, and
we can compare the widths of intervals given that the true value 5 is inside the CIs (conditional on
coverage). The widths (CIWs) in decreasing order are: 4.53 for classical ML, 4.35 for Efron-Hinkley,
4.20 for Royall, 0.51 for our parametric version, and 0.53 for our nonparametric version. Hence we
can conclude that the parametric CI95 from simulated (18) is the best for this example, with the shortest
CIW. The CIW of 0.53 for the nonparametric (17) is almost as low, and the difference may be due to
random variation. Thus for Cox's example, used by others in the present context, our simulation shows
that the GPFs provide a superior alternative. This supports Godambe and Heyde's (1987) property (iii)
and our discussion.
6. Sequence of robust improvements: Fisher, Efron-Hinkley, Royall and GPFs.
Having defined our basic ideas in the context of Cox's simple example, we are ready to express
our results for a univariate ) in a general setting. Our aim is to work with score functions and review
T ln f (y ; ) )
the sequence of robust improvements over Fisher's setting for CIs. Recall that L=DTt=1 Lt =Dt=1
t t
is the log of likelihood function. Fisher's pivot is:
.
T –ES },
zF =() –)^ ){DTt=1 –E `2 Lt Î` ) 2 }0.5 =() –)^ )r{Dt=1
(19)
t
This equals the FPF in (6)
. provided the population expected values of likelihood second order partials
are such that r{DTt=1 –ESt }=1ÎASE. Hence the likelihood structure
is implicitly preordained. That is,
.
(19) needs to impose nonrobust assumptions to estimate ESt from observable variances. Efron and
Hinkley (1978) remove some structure (i.e., inject robustness) by removing the expectation operator E
from (19). They formally prove that the following pivot is more robust than FPF:
zEH =() –)^ ){DTt=1 –` 2 Lt Î` ) 2 }0.5 .
(20)
Royall (1986) goes a step further and argues that the asymptotic CIs from (20) can be invalid. He
proves that when the assumed parametric model fails, the variance estimator is inconsistent. Royall
uses the delta method to obtain a simple alternative variance estimator included in the following pivot:
T [` L Î` ) ]2 }•0.5 .
zR = () –)^ ){DTt=1 –` 2 Lt Î` ) 2 } {Dt=1
(21)
t
Now Godambe's (1985, 1991) PF, whose distribution also does not depend on ) , is:
9
T
GPF= zG = DTt=1 ` Lt Î` ) {Dt=1
[` Lt Î` ) ]2 }•0.5 .
(22)
For Cox's example, (17) and (18) give CI95's from (14) as a special case of (22). Unlike zF , zEH and
zR , our GPF of (22) lacks the term () –)^ ), yet we can almost always give CI95's from its numerical
GPF-roots. To the best of my knowledge, (22) has not been implemented in the literature for Cox's or
other examples. A multivariate extension of (22) is given later in (34).
Let 0T =CTt=1 (1+i-S˜t ), where i²= • 1. Assuming (a) 0T is uniformly integrable, (b) E(0T ) Ä 1
#
as T Ä _, (c) DS˜t Ä " in probability, as T Ä _, and (d) the maximum of |S˜t | Ä 0 in probability as
T Ä _, McLeish (1974) proved (without assuming finite second moments) a central limit theorem
(CLT) for dependent processes (common in econometrics). Now we state without proof, Vinod's (1998)
proposition:
Proposition 1: Let the scaled quasi scores S˜t satisfy McLeish's four assumptions. Now, by his CLT,
d
their partial sum, GPF=zG =DTt=1 S˜t Ä N(0,1), or converges in distribution to unit normal as T Ä _.
Defining robustness as absence of additional assumptions, such as asymptotic normality of a root s) , zG
is more robust than zF , zEH and zR .
7. Further GPF examples from the exponential family
In this section we illustrate our proposal for various well-known exponential family
distributions and postpone the normal regression application to the next section.
Poisson Mean: If yt are iid Poisson random variables, the ML estimator of the mean is –y. The variance
– (y
–ÎT). Hence CI95 is [y
– … 1.96r(y
–ÎT)] for both Fisher and Efronof yt is also –y and the Var(y)=
^ =Var(y )=DT (y –y)
^ ÎT)]. If the Poisson model
– 2 ÎT and his CI95 is [y
– … 1.96r(Hinkley. Royall's t
t=1 t
^ =y
– and the CI95 is the same as Fisher or Efron-Hinkley. Now ft (yt ; ))=() yt Îyt !) exp(–) ).
is valid, Hence ` Lt Î` ) =( yt Î) )–1=) •1 (yt –) ), and GPF-roots are from solving GPF=constant, where
T (y –) )2 }.
GPF = zG = DTt=1 (yt –) )Îr{Dt=1
t
(23)
Binomial Probability: Let y be independently distributed as Binomial ( ky ) ) y (1–) )k–y for y=0, 1, â,k.
^ =DT (y –y)
–Îk, zF =zEH =(y
–Îk)[1–(y
–Îk)]Îk. Royall shows that – 2 ÎTk2 is
Now the ML estimator is ^) =y
t=1 t
^ Ä variance, unlike Fisher's or
robust in the sense that if the actual distribution is hypergeometric, his Efron-Hinkley's.
Since ` Lt Î` ) = yt ) •1 (k–) )•1 –k(k–) )•1 =(k–) )•1 ) •1 (yt –k) ), GPF-roots need
numerical solutions, as above. The GPF is defined by:
T (k–) )•2 ) •2 (y –k) )2 }•0.5 .
zG =DTt=1 (k–) )•1 ) •1 (yt –k) ) {Dt=1
t
(24)
Normal standard deviation: For yt independent normal, N(., 5 2 ), the ML estimate of 5 is
– 2 ÎT}. The asymptotic variance is 5
^ =r{DT (yt –y)
^ 2 Î2, whether one uses Fisher's or Efron-Hinkley's
5
t=1
^ =(1Î45
–2 – 5
^ 2 )[DTt=1 {(yt –y)
^ 2 }2 ÎT], and his asymptotic CI95 is robust against
estimate. Royall's nonnormality. Since (` Lt Î` 5 )=5 •3 (yt –.)2 –5 •1 , we again need numerical solutions of GPF=constant,
where GPF is defined by:
T (y –.)4 –25 2 DT (y –.)2 +T5 4 ]}•0.5 .
zG œ {DTt=1 (yt –.)2 –T5 2 } {Dt=1
t
t=1 t
10
(25)
8. The GPFs for regressions
In this section we consider two types of GPFs for the regression problems, which also involve
the exponential family. Consider the usual regression model with T observation and p regressors: y=X"
+ %, E(%)=0, E%%w = 5 2 I, where I is the identity matrix. Log likelihood function L=DTt=1 Lt contains
Lt =(–0.5) log(215 2 ) • 2•15 •2 [yt –(X" )t ]w [yt –(X" )t ],
(26)
where (X" )t denotes the t-th element of the T×1 vector X" and %t = yt –(X" )t . Now (` Lt Î` " ) is proportional to [Xwt yt –Xwt Xt " ], where Xt =(xt1 , â, xtp ) is a row vector, and (Xwt Xt ) are p×p matrices having
p equations for each t=1,âT. It is well known, Davidson and MacKinnon (DM) (1993, p. 489), that
(26) often has an additional term representing the contribution of the Jacobian factor. For example,
when yt is subjected to a transformation 7 (yt )=log(yt ), ` 7 (yt )Î` yt = 1/yt and the Jacobian factor is its
absolute value. Since log(|1/yt |)=–log(|yt |), this will be the additional term in (26). Another kind of
familiar generality is achieved by replacing X" by a nonlinear function X(" ) of " and letting the
covariances depend on parameters ! with % µ N(0, H(!)), DM (1993, p. 302).
Note from (26) that the underlying quasi score function S=DTt=1 (` Lt Î` " )=0 is interpreted here
as g*, the OptEF. The EF solution here is simply "^ ef ´ "^ ols =(Xw X)–1 Xw y, the ordinary least squares
(OLS) estimator. Hence the OLS estimator is equivalent to the root of the following OptEF or “normal
equations”:
gols= g*= Xw (y–X" )= DTt=1 Xwt (yt – Xt " ) =0.
(27)
If E(%%
is known, we have the generalized least squares (GLS) estimator, whose “normal
equations” in a notation of Vinod (1998) are:
w
)=5 2 H
T Hw % =DT S =0,
ggls=Xw H–1 y –Xw H–1 X" = DTt=1 Xwt H•1 (yt – Xt " ) = Dt=1
(28)
t t
t=1 t
*
–1
where we view ggls=0=g as our OptEF, Ht denotes the t-th row of H X, and the score St is p×1. The
ML estimator under normality assumption is the same as the GLS estimator. We have shown that the
usual normal equations in (28) represent a sum of T scores, whose asymptotic normality can be proved
directly from the CLT. For constructing the GPF for inference, we seek scaled sum of scores, where the
scale factors are based on variances. The usual asymptotic theory for the nonlinear regression y = X(")
+ %, % µ N(0,H), implies the following expression for variances, DM (1993, p. 290),
d
rT("^ –" ) Ä NŠ0, plimTÄ_ [T•1 X(" )w H•1 X(" )]•1 ‹.
(29)
^ 2 , where (T–p)5
^ 2 =(y–
In the linear case, X(" )=X" , Fisher information matrix IF is [Xw H•1 X]Î 5
X"^ )w H•1 (y–X"^ ). Now its inverse, IF•1 , is the asymptotic covariance matrix denoted as (ASE)2 . In the
OLS regression with spherical errors case a 100(1–!)% confidence region for " contains those values
of " which satisfy the inequality, Donaldson and Schnabel (1987):
(y–X" )w (y–X" ) • (y–X"^ ols)w (y–X"^ ols ) Ÿ s²p Fp,T–p,1–!
where s²= (y–X"^ )w (y–X"^ )Î(T–p), and where F
denotes the upper 100(1–!)% quantile of the
ols
ols
p,T–p,1–!
F distribution with p and T–p degrees of freedom, in the numerator and denominator, respectively. It is
equivalently written as:
("^ ols • " )w Xw X("^ ols • " ) Ÿ s²p Fp,n–p,1–! ,
11
which shows that the shape will be ellipsoidal. If the Frisch-Waugh theorem is used to convert the
regression problem into p separate univariate problems, use of the bootstrap seems to be a good way to
obtain robustness.
The usual F tests for regressions are based on Fisher's pivot zF = (" –"^ )(ASE)•1 . Replacing the
^ •1 .
expected information matrix by the observed, one obtains the Efron-Hinkley pivot z =(" –"^ )(ASE)
EH
If X is nonstochastic, E[Xw H•1 X]=[Xw H•1 X] and zEH equals zF . A simple binary variable regression
where this makes a difference and is intuitively sensible is given by DM (1993, p. 267). For the special
case of heteroscedasticity, where H is a known diagonal matrix, Royall (1986) suggests the following
robust estimator of the covariance matrix.
^ =T[Xw H•1 X]•1 Xw H•1 diag(^%2 ) H•1 X [Xw H•1 X]•1 ,
A
1
t
(30)
where ^%t =yt –(X"^ )t is the residual and diag(.) denotes a diagonal matrix. Royall argues that this is
essentially a weighted jackknife variance estimator. Now, let us use ggls of (28) to define outer product
of gradients (opg) matrix to substitute in (10). Then, we have
^ œ T[Xw H
^ •1 X]•1 ŠDT Xw H
^ •1 (y – X "^ )(y – X "^ )w H
^ •1 X ‹ [Xw H
^ •1 X]•1 ,
A
(31)
2
t
t
t
t
t
t
t=1
^ . If H is the identity matrix, (30) reduces to what is
where a consistent estimate of H is denoted by H
known in econometrics texts, DM (1993, p. 553) as the Eicker-White heteroscedasticity consistent (HC)
^ X(Xw X)•1 . In econometric literature four different HC
covariance matrix estimator T (Xw X)•1 Xw H
^ =diag(HCj) for j=0,1,2,3:
estimates are distinguished by the choice of the diagonal matrix H
2
2
2
2
HC0=^%t , HC1=^%t TÎ(T–p), HC2=^%t Î(1–2> ), and HC3=^%t Î(1–2> )2
(32)
where 2> denotes the diagonal element of the hat matrix X(Xw X)–1 Xw . Among these, DM (1993)
recommend HC3 by alluding to some Monte Carlo studies. Royall's pivot based on (31) is zR =(" –
^ )–0.5 . A p-variate GPF similar to (22) is [Xw H•1 E%%w H•1 X]•0.5 Xw H•1 %, where E%%w =H is assumed
"^ )(A
2
to be known. Substituting E%%w =H this expression yields p equations in p unknown coefficients " :
GPF œ DTt=1 S˜t œ [Xw H•1 X]•0.5 Xw H•1 %.
(33)
w
•1
Recall that (28) shows that X H % =DSt , a sum of scores. In (33) this sum of scores is premultiplied
by a scaling matrix. Thus our GPF for regressions in (33) is a sum of T scaled scores. The sum is
asymptotically N(0,I) by the CLT. In particular, the OLS has H=I and GPF=[Xw X]•0.5 Xw %. Rather than
assuming a known H, a flexible choice discussed below is H=H(9), where 9 parameters can represent
the autocorrelation and/or heteroscedasticity among % errors. A bootstrap of (33) in Vinod (1998)
shuffles (and sometimes also studentizes) J(=999) times with replacement the T scaled scores S˜t . Babu
(1997) studies the “breakdown point” of bootstraps, and suggests Winsorization (replace certain percent
of “extreme” values by the nearest non-extreme values) before resampling. Since S˜t are p×1 vectors, we
must choose a vector norm |S˜t | to define their extreme values. For robustness, the norm should help
eliminate the extreme values of numerical roots of shuffled GPF=D S˜t =(constant) as estimates of " . If
one is interested in a CI95, we can Winsorize less than 5% of “extreme” |S˜t | values. Further research
and simulation are needed to know how robust and how wide are the CI95 for each norm.
Now a GPF-N(0,I) bootstrap algorithm is to make J=999 numerical evaluations of
GPF= „ 1.96 for any " or scalar functions f(" ) and construct a CI95 for inference. We propose a
second multivariate pivot (denoted by superscript 2) using a quadratic form of original scores in (28).
GPF² œ %w H•1 X[Xw H•1 X]•1 Xw H•1 % µ ;2 (p).
12
(34)
The limiting distribution of (34) is readily shown to be a central ;2 (zero noncentrality), which does not
depend on unknown parameters. As a refinement we may want F instead of ;2 . Then, we assume E%%w =
5 # H instead of E%%w =H, where 5 # is a common error variance. Next, we replace the % vector on the left
side of (34) by a scaled vector (%Î5 ). If we estimate 5 # by the regression residual variance (ss
%w %)Î(T–p)
the random variable on the right hand side of (34) becomes Fp,T–p,1–! .
Remark 3: How do we bootstrap using (34)? We need algebraic manipulations to write GPF² as a sum
of T items, which can be shuffled. Note that (34) is a quadratic form %w A%, and we can decompose the
T×T matrix A as GAGw , where G is orthogonal and A is diagonal. Let us define a T×1 vector
%˜=%w GA!Þ& , and note that %w A% = %˜w %˜ =D(%˜t )2 , using the elements of %˜. Now, our bootstrap shuffles the T
values of (%˜t )2 with replacement. Since (%˜t )2 values are scalar, they do not need any norms before
Winsorization. Unlike the GPF-N(0,I) algorithm above, the GPF²s yield ellipsoid regions rather than
convenient upper and lower limits of the usual confidence intervals. The limits of these regions come
from the tabulated T! upper 95% value of the ;² or F. If we are interested in inference about a scalar
function f(" ) of " vector, we can construct a CI95 by shuffling (%˜t )2 and J times solving GPF²=T! for
f(" ). Simultaneous confidence intervals for " are more difficult. We have to numerically maximize and
minimize each element of " subject to the inequality constraint GPF² Ÿ T! .
Next few paragraphs discuss five robust choices of H(9) for both GPFs of (33) and (34).
(i) First, under heteroscedasticity, H= diag( %2t ), and we replace H by a consistent estimate
based on HC0 to HC3 defined in (32).
(ii) Second, under autocorrelation alone, Vinod (1996, 1998) suggests using the iid “recursive
residuals” for time series bootstraps. Note that one retains (the first) p residuals unchanged and
constructs T–p iid recursive residuals for shuffling. The key advantage of this method is that it is valid
for arbitrary error autocorrelation structures. This is a nonparametric choice of H without any 9.
(iii) Third, under first order autoregression, AR(1), among regression errors, we have the
following model at time t: yt = Xwt " +%t , where %t =3%t•" + ?t . Now, under normality of errors the
likelihood function is f(y1 ,y2 ,â,yT )= f(y1 )f(y2 |y1 )f(y3 |y2 )âf(yT |yT-1 ). Next we use the so-called quasi
first differences (e.g., yt –3yt•" ). It is well known that here one treats the first observation differently
from others, Greene (1997, p.600). For this model the log-likelihood function is available in textbooks
and the corresponding score function equals the partials of the log-likelihood. The partial derivative
formula is different for t=1 compared to all other t values. The partial with respect to (wrt) " is:
(1Î5u# ) DXtœ1 ?t X*t , where ?1 =(1–3²)!Þ& (y1 –3X1 " ), and for t=2,3,â,T: ?t =(yt –3yt•" ) • (Xt –3Xt•" )" ,
where we have defined 1×p vectors: X*1 =(1–3²)!Þ& X1 and X*t =(Xt –3Xt•" ) for t=2,3,â,T. Collecting
all ?t we have the ? vector of dimension (T×1). Similarly, we have the T×p matrix of quasi differenced
w
regressor data X* satisfying the definitional relation: (X* X*)=Xw H–1 X. The partial wrt 5u# is:
(–TÎ25u# ) +(1Î25u4 ) DXtœ1 ?t2 .
Finally, the partial derivative wrt 3 is
(1Î5u# ) DXtœ# ?t %t•" € (3%21 Î5u# ) • (3Î[1–32 ]),
where %t =yt –Xt " . Thus, for the parameter vector ) =(" , 5u# ,3), we have analytical expressions for the
score vector, which is the optimal EF. Simultaneous solution of the p+2 equations in the parameter
vector ) gives the usual ML estimator which coincides with the EF estimator here.
13
Instead of simultaneous solutions, Durbin's (1960) seminal paper starting the whole EF literature also suggested the following two-step OLS estimator for 3. Regress yt on yt•" , Xwt and (–Xwt•" ) and
use the coefficient of yt•" as s
3. Also, one can use a consistent estimate of s#u of 5u# and simplify the
w
w
w
score for " as (1Îsu# ) DtXœ1 ?t X*t = (1Îsu# )X* ?. Using rescaled scores, GPF= zG = [X* X*]•0.5 X* ? and
analogous GPF² is the quadratic form (zG )w zG .
(iv) Fourth, we consider H(9) induced by a mixed autoregressive moving average ARMA
errors. Note that any stable invertible dynamic error process can be approximated by an ARMA(q,q–1)
process. Hence Vinod (1985) provides an algorithm for exact ML estimation of regression coefficients
" when the model errors are ARMA(q,q-1). His approximation is based on analytically known
eigenvalues and eigenvectors of tri-diagonal matrices, with an explicit derivation for the ARMA(2,1). In
general, this regression error process would imply that H(9) is a function of 2q–1 elements of the 9
vector representing q parameters on the AR side and q–1 parameters on the moving average (MA) side.
(v) Our fifth robust choice of H has heteroscedasticity and autocorrelation consistent (HAC)
estimators of H discussed in DM (1993, pp. 553, 613). Assume that both heteroscedasticity and general
autocorrelation among regression errors are present and we are not willing to assume any H(9)
parametric specification. Instead of H, we denote a nonparametric HAC covariance matrix as E%%w = >,
to emphasize its nonzero off-diagonals due to autocorrelations and nonconstant diagonal elements due to
heteroscedasticity. Practical construction of > requires following smoothing and truncation adjustments
using the (quasi) score functions St defined in (28). Define our basic building block as a p×p matrix
Hj =(1ÎT) DTt=j+1 St (St-j )w of autocovariances. We smooth them by using [Hj +Hwj ] to guarantee that we
have a symmetric matrix. We further assume that autocovariances die down after m lags, with a known
m, and truncate a sum after m terms. After all this truncation and smoothing, we construct > as a
nonsingular symmetric matrix proposed by Newey and West (1987):
w
–1
>= H0 + Dm
j=1 w(j,m)[Hj + Hj ], where w(j,m)= 1–j(m+1) .
(35)
The w(j,m) are Bartlett's window weights familiar from spectral analysis, declining linearly as j
increases. One can refine (35) by using a pre-whitened HAC estimator proposed by Andrews and
Monahan (1992). Here we do not merely use (35) in the usual fashion as a HAC estimator of variance.
We are also extending the EF-theory lesson mentioned in remark 1 to construct a robust (HAC)
estimator of the variance of the underlying score function itself, which permits construction of scaled
scores S̃t needed to define the GPF in (33). Thus, upon smoothing and truncation, we propose GPF as
[Xw >•1 X]•0.5 Xw >•1 % with nonparametric >. Now, similar to (34) we have
GPF² œ %w >•1 X[Xw >•1 X]•1 Xw >•1 % µ ;2 (p).
(36)
Now recall the scaled vector (%Î5) used after (34) for a refinement. As before, instead of E%%w =>, we
use E%%w =5 # >, inserting the 5# and use the (%Î5 ) similar to (34). The right hand distribution in (36)
becomes Fp,T–p,1–! . The difficulty in Remark 3 about simultaneous CIs for " is relevant for (36) also.
Since we have one equation in p unknowns of " , let us focus on one function f(" ) or one element of " ,
say "1 . It can be shown that this entails no loss of generality if one can rearrange the model in terms of
revised set of p parameters. Let us still denote them by " to avoid notational clutter. The trick is to use
the Frisch-Waugh theorem, Greene (1997, p.247), to focus on one parameter at a time without loss of
generality as follows. For example, we can rewrite our model after partitioning as y = X1 "1 +X2 "2 and
focus on one parameter "1 and combine all other parameters into the vector "2 . Now construct a vector
y*of residuals from the regression of y on X2 and also create the regressor columns from the residuals of
14
the regression of X1 on X2 . Frisch-Waugh theorem guarantees that this gives exactly the same
regression coefficient as the original model. Since "1 could be any one of the p elements of " by
rearrangement, there is no loss of generality in this method and the ;2 (p) in (36) becomes ;2 (1) with
only one unknown "1 . Since E(;2 (1))=1, one tempting possibility is to solve the equation GPF2 =1 for
"1 some J=999 times to construct CI95 for "1 and eventually for all parameters of ". Unfortunately,
this does not work and numerical evaluation of the inequality GPF² Ÿ Fp,T–p,1–! for p=1 based on
(36) is needed. If there is a small-sample discrepancy between the theoretically correct F values and the
observed density of GPF²s in resamples, one can use the upper (1–!) quantile from the observed order
statistics instead of Fp,T–p,1–! . One can also use the double bootstrap (d-boot) refinement if adequate
computer resources are available.
Practitioners often need to test some theoretical propositions that lead to restrictions on ". A
general formulation including use of gradients of restriction functions to deal with nonlinearities is
available in textbooks, Greene (1997, ch. 7). It is customary to consider m Ÿ p linearly independent
restrictions R" =q, we have the matrix R of dimension m×p and assume spherical disturbances, E%%w =I.
Let C(" )=R" –q represent m equations which are zero under the null hypothesis. Now the GPF =
[Xw X]•0.5 Xw % based on H=I in (33) represents a set of p equations in the p unknown elements of " . At
the numerical level these equations can be viewed as m equations in C(" ) and p–m equations in a (linearly independent) subset of coefficients in the " vector. Clearly, one can solve them for C(" ) and construct bootstrap confidence intervals for C(" ). If the observed CI95 for any row of C(" ) contains the
zero we do not reject the null hypothesis represented by that row. For testing nonlinear restrictions, our
method does not need any Taylor series linearizations. One uses the parametric ;² or F distribution or
nonparametric bootstraps in (34) or (36). The LAU functions lau(" ) mentioned earlier are a special
case of C(" ). If any row of C(" ) contains more than one " , it is somewhat difficult (See Remark 3) to
construct a GPF² confidence set for that row of C(" ). This is a practical disadvantage of GPF²s.
Proposition 2: Let zG denote a sum of scaled scores. Assume that a nonsingular estimate of EzG zwG is
available (possibly after smoothing and truncation), and that we are interested in inference on Dufour's
`locally almost unidentified' functions lau(" ). By choosing a GPF as in (34) or (36) which does not
explicitly involve scalar lau(" ) at all, we can construct valid confidence intervals.
Proof: Dufour's (1997) result is that CIs obtained by inverting Wald-type statistics [lau(" )–
lau("^ )]ÎSE(lau)
s can have zero coverage probability for lau(" ). One problem is that the Wald-type
statistic is not a valid pivotal quantity when the sampling distribution of lau("^ ) depends on unknown
nuisance parameters. Another problem is that the covariance matrix needed for SE(lau)
s can be singular.
The GPF of (33) is asymptotically normal by proposition 1. Such GPFs are certainly not Wald-type,
s ). Assuming nonsingular EzG zw is less stringent than
since they do not even contain the expression lau("
G
s ) for each choice of lau(" ). Let + denote a p×1 vector
assuming nonsingular matrix of variances of lau("
of ones. We obtain a CI95 for any function f(" ) of regression parameters by numerically solving for
f(" ) a system of p equations GPF(y," )= „ 1.96(+)= z! For example, if f(" )=lau(" ) is a ratio of two
s ) can have obvious
regression coefficients, then its denominator can be zero, and the variance of such f("
difficulties. Rao (1973, Sec. 4b) discusses an ingenious squaring method for finding the CIs of ratios.
Our approach is more general and simple to program. In the absence of software, Rao's squaring
method is rarely, if ever, implemented. By contrast, our GPF is readily implemented in Vinod and
Samanta (1997) and Vinod (1997) for functions (ratios) of regression coefficients. Given a scalar function lau(" ), we simply replace one of the p parameters in " by lau(" ), the parametric function of inte-
15
rest, and solve GPF=(constant) to construct a valid CI for the lau(" ). For further improvements we
suggest the bootstrap.
The bootstrap resampling of scaled scores to compute CIs is attractive in light of limited
simulations in Vinod (1998). To confirm it we would have to consider a wide range of actual and
nominal “test sizes and powers” for a wide range of data sets. The GPF-N(0,I) algorithm is a
parametric bootstrap and relaxing the parametric assumptions of N(0,I) and H(9) leads to a
nonparametric and double bootstrap (GPF-d-boot) algorithms. McCullough and Vinod (1997) offer
practical details about implementing the d-boot. Although the d-boot imposes heavy computational
burden, and cannot be readily simulated, Letson and McCullough (1998) do report encouraging results
from their d-boot simulations. Also difficult to simulate are the algorithms developed here for new
GPF²s discussed here. Our limited experiments with new algorithms show that GPF²s are quite feasible,
but have the following drawback. GPF²s cannot be readily solved for arbitrary nonlinear vector
functions of " . Our recent experiments with non-normal, nonparametric, non-spherical error GPF
algorithms indicate that their 95% confidence intervals can be wider than the usual intervals. We have
not attempted Winsorization, since it offers too rich a variety of options, including a choice of the
Winsorized percentage. From Babu's (1997) theory it is clear that greater robustness can be achieved by
Winsorization, and we have shown how shorter bootstrap CIs can arise by removing extreme values.
9. Summary and conclusions
This paper explains the background of parametric asymptotic inference based on the early work
of Fisher, T. W. Anderson and others developed before the era of modern computers. We also discuss
the early idea of focusing on functions as in Anderson's critical functions. The estimating functions
(EFs) were developed by Godambe and Durbin in 1960. The main lesson of EF-theory (in remark 1
above) is that good EFs automatically lead to good EF-estimators (=EF-roots). Similarly, good pivots
(GPFs) which contain all information in the sample, lead to reliable inference. A proposition formally
shows that GPF=D S˜t , a sum of T items, converges to N(0,I) by the central limit theorem. We provide
details on bootstrap shuffling of scaled scores S˜t for statistical inference from the confidence intervals.
For regression coefficients " when H(9) is the matrix of error variances, depending on the particular
application at hand, we suggest explicit bootstrap algorithms for five robust choices of H.
Although GPF appears in Godambe (1985), the use of its numerical roots is first proposed in
Vinod (1998). He claims that the GPFs fill a long-standing need of the bootstrap literature for robust
pivots and enable robust statistical inference in many general situations. We support the claim by using
Cox's simple example studied by Efron, Hinkley and Royall. For it, we derive Fisher's highly structured
pivot zF , its modification zEH by Efron and Hinkley (1978), and a further modification zR by Royall
(1986) to inject robustness. We explain why the GPF for Cox's example is more robust than others.
We also simulate all these pivots for Cox's example. A parametric GPF-N(0,I) bootstrap
algorithm uses about a thousand standard normal deviates to simulate the sampling distribution. A
nonparametric (single) bootstrap algorithm uses empirical distribution function for robustness. For
Cox's univariate example the simulation shows that GPFs yield short and robust CIs, without having to
use the d-boot. The width of the traditional interval is 4.53, whereas the width of GPF intervals is only
about 0.53. This is obviously a major reduction in the confidence interval (CI) width, which is predicted by the asymptotic property (iii) given by Godambe and Heyde (1987). Thus, we have demonstrated
that for univariate problems our bootstrap methods based on GPFs offer superior statistical inference.
16
This paper discusses new GPF formulas for some exponential family members including
Poisson mean, Binomial probability, and Normal standard deviation. This paper also derives from a
quadratic form of the GPF an asymptotically ;2 second type GPF²s for regressions. The five robust
choices of H(9) mentioned above are shown to be available for GPF²s also. We discuss inference
problems for some ill-behaved functions, where traditional CIs can have zero coverage probability. Our
solution to this problem, stated as proposition 2, is to numerically solve an equation involving the GPF.
Since Dufour (1997) shows that such ill-behaved functions are ubiquitous in applications, (e.g.,
computation of long-run multiplier) our solution is of considerable practical interest in econometrics and
other fields where regressions are used to test nonlinear theoretical propositions, especially those
involving ratios of random variables.
Heyde (1997) and Davison and Hinkley (1997) offer formal proofs showing that GPFs and dboots, respectively, are powerful tools for construction of short and robust CIs. By defining the tail
areas as rejection regions, our CIs can obviously be used for significance testing. The CIs from GPFroots can serve as a foundation for further research on asymptotic inference in an era of powerful
computing. For example, numerical pivots may help extend the well-developed EF-theory for nuisance
parameters, Liang and Zeger (1995). The potential of EFs and GPFs for semiparametric and
semimartingale models with nuisance parameters is indicated by Heyde (1997). Of course, these ideas
need to be developed, and we need greater practical experience with many more examples. We have
shown that our proposal can potentially simplify, robustify and improve the asymptotic inference
methods currently used in statistics and econometrics.
ACKNOWLEDGMENTS
I thank Professor Godambe of the University of Waterloo for important suggestions. A version
of this paper was circulated as a Fordham University Economics Department Discussion Paper dated
June 12, 1996. It was revised in 1998 to incorporate new references.
17
REFERENCES
Anderson, T. W., 1958, Introduction to Multivariate Statistical Analysis. New York: J. Wiley
Andrews D. W. K. and J. C. Monahan, 1992, An improved heteroscedasticity and autocorrelation consistent covariance matrix estimator. Econometrica 60(4), 953-966.
Babu, G. J., 1997, Breakdown theory for estimators based on bootstrap and other resampling
schemes, Dept. of Statistics, Penn. State University, University Park, PA 16802.
Cox, D. R., 1975, Partial likelihood. Biometrika 62, 269-276.
Davidson, R. and J. G. MacKinnon, 1993, Estimation and inference in econometrics.
New York, Oxford Univ. Press.
Davison A. C. and D. V. Hinkley, 1997. Bootstrap methods and their application. New
York, Cambridge Univ. Press.
Donaldson, J. R. and R. B. Schnabel, 1987, Computational experience with confidence regions
and confidence intervals for nonlinear least squares. Technometrics 29, 67-82.
Dufour, Jean-Marie, 1997. Some impossibility theorems in econometrics with
applications to structural and dynamic models. Econometrica 65, 1365-1387.
Dunlop, D. D., 1994, Regression for longitudinal data: A bridge from least squares
regression. American Statistician 48, 299-303.
Durbin, J., 1960, Estimation of parameters in time-series regression models. Journal of
the Royal Statistical Society, Ser. B, 22, 139-153.
Efron, B. and D. V. Hinkley, 1978, Assessing the accuracy of maximum likelihood
estimation: Observed versus expected information. Biometrika 65, 457-482.
Godambe, V. P., 1985, The foundations of finite sample estimation in stochastic
processes. Biometrika 72, 419-428.
Godambe, V. P., 1991, Orthogonality of estimating functions and nuisance parameters.
Biometrika 78, 143-151.
Godambe, V. P. and C.C. Heyde, 1987, Quasilikelihood and optimal estimation.
International Statistical Review 55, 231-244.
Godambe, V. P. and B. K. Kale, 1991, Estimating functions: an overview, Ch. 1 in V. P.
Godambe (ed.) Estimating functions, (Oxford: Clarendon Press).
Gourieroux, C., A. Monfort and A. Trognon, 1984, Pseudo maximum likelihood methods:
theory Econometrica, 52, 681-700.
Greene, W. H., 1997, Econometric Analysis. 3rd. ed. New York: Prentice Hall.
Hall, Peter, 1992, The bootstrap and Edgeworth expansion. New York: Springer Verlag.
Heyde, C. C., 1997, Quasi-Likelihood and Its Applications. New York: Springer Verlag.
Hu, Feifang and J. D. Kalbfleisch, 1997, Estimating equations and the bootstrap. in
Basawa, Godambe and Taylor (eds.) Selected proceedings of the symposium on
estimating equations. IMS Lecture Notes-Monographs Series, Vol. 32 Hayward,
California: IMS, 405-416.
Kendall, M. and A. Stuart, 1979, The advanced theory of statistics,
New York: Macmillan, Vol. 2, Fourth Edition.
Letson, D. and B. D. McCullough, 1998, Better confidence intervals: The double
bootstrap with no pivot. American Journal of Agricultural Economics (forthcoming)
Liang K. and S. L. Zeger, 1995, Inference based on estimating functions in the presence
of nuisance parameters. Statistical Science 10, 158-173.
18
McLeish, D. L., 1974, Dependent central limit theorems and invariance principles, Annals
of Probability. 2(4), 620-628.
McCullough B. D. and H. D. Vinod, 1997, Implementing the double bootstrap.
Computational Economics 10, 1-17.
Newey, W. K. and K. D. West, 1987, A simple positive semi-definite, heteroscedasticity
and autocorrelation consistent covariance matrix. Econometrica 55(3), 703-708.
Rao, C. R., 1973, Linear statistical inference and its applications. New York: J. Wiley.
Royall, R. M., 1986, Model robust confidence intervals using maximum likelihood
estimators. International Statistical Review 54 (2), 221-226.
Vinod, H. D., 1984, Distribution of a generalized t ratio for biased estimators.
Economics Letters 14, 43-52.
Vinod, H. D., 1985, Exact maximum likelihood regression estimation with ARMA(n,n-1) errors.
Economics Letters 17, 355-358.
Vinod, H. D., 1993, Bootstrap Methods: Applications in Econometrics. in G. S. Maddala,
C. R. Rao, and H. D. Vinod (eds.) Handbook of Statistics: Econometrics, Vol. 11,
New York: North Holland, Elsevier, Chapter 23, 629-661.
Vinod H. D., 1995, Double bootstrap for shrinkage estimators. Journal of Econometrics,
68(2), 287-302.
Vinod, H.D., 1996, Comments on bootstrapping time series data. Econometric Reviews,
15(2), 183-190.
Vinod, H.D., 1997, Concave consumption, Euler equation and inference using estimating
functions.1997 Proceedings of the Business and Economic Statistics Section of the
American Statistical Association. (to appear)
Vinod, H.D., 1997b, Using Godambe-Durbin estimating functions in econometrics. in
Basawa, Godambe and Taylor (eds.) Selected proceedings of the symposium on
estimating equations. IMS Lecture Notes-Monographs Series, Vol. 32 Hayward,
California: IMS, 215-238.
Vinod, H.D., 1998, Foundations of statistical inference based on numerical roots of robust
pivot functions (Fellow's Corner) Journal of Econometrics, 86, 387-396.
Vinod, H. D. and R. R. Geddes, 1998, Generalized Estimating Equations for Panel Data and
Managerial Monitoring in Electric Utilities. presented at the international Indian
statistical association conference, McMaster University, Hamilton, Ontario, Canada, Oct. 10-11.
Vinod, H. D. and P. Samanta, 1997, Forecasting exchange rate dynamics using GMM,
estimating functions and numerical conditional variance methods, Presented at the
17-th annual international symposium on forecasting, June 1997 in Barbados.
Wedderburn, R. W. M., 1974, Quasi-likelihood functions, generalized linear models andthe Gaussian method. Biometrika 61, 439-447.
19
Figure 1: Frequency distribution of double bootstrap counts (Cj /100) for "1
(elasticity of capital). Panel A: (Upper) Cauchy errors,
Panel B (lower): Normal errors (in simulating Y values).
20
Figure 2: Frequency distribution of double bootstrap counts (Cj /100) for "2
(elasticity of labor). Panel A: (Upper) Cauchy errors,
Panel B (lower): Normal errors (in simulating Y values).
21
Figure 3: Frequency distribution of double bootstrap counts (Cj /100) for "3
(technology elasticity). Panel A: (Upper) Cauchy errors,
Panel B: (lower) Normal errors (in simulating Y values).
22
Recall
T
T
ggls=Xw H–1 X" –Xw H–1 y = DTt=1 Xwt H•1 (yt – Xt " ) = Dt=1
Hwt %t =Dt=1
St =0,
q
A p-variate GPF similar to (22) is [Xw H•1 E%%w H•1 X]•0.5 Xw H•1 %, where E%%w =H is assumed to
be known. Upon substitution, this expression becomes [Xw H•1 X]•0.5 Xw H•1 %. For the OLS, H=I and the
expression is further simplified to GPF= [Xw X]•0.5 Xw %.
If H is unknown, but we do know is that it is a diagonal matrix (heteroscedasticity), we propose
the following scaled sum of scores as the GPF:
zG =[Xw diag( %2t ) X]•0.5 g* µ N(0, Ip ),
(36)
Let Xt denote t-th row of X (the data on all regressors at time t). Clearly the score St =Xwt %t is a
p×1 vector and let >=Xw E(%%w )X denote the covariance matrix of scores. By definition, the GPF is a sum
T
of scaled scores, where the scale factor is >–0.5 = {(1/T)DTt=1 Ds=1
St Sws}–0.5 . Unfortunately, %%w without
the expectation operator is, in general, a singular matrix.
Upon smoothing and truncation, we propose
GPF(y," )= (>˜)–0.5 Xw %
(39)
Heyde p. 61 book notation is GPF=Q µ Normal,
Qw Q Ÿ ;2 (p,!)
%w X (>˜)–1X w % µ ;2 (p,!).
In the heteroscedastic case, we have
%w X[Xw diag( %2t ) X]•1 Xw % µ ; 2 (p,!)
Now we construct a sum of n individual components of the quantity on the left side so we can shuffle it
with replacement for a bootstrap. To this end, use the singular value decomposition X = H A0.5 G w
and Xw = G A0.5 Hw . So the left side becomes
%w H A0.5 G w[ G A0.5 Hw diag( %2t ) H A0.5 G w ]•1 G A0.5 Hw %
=%w H A0.5 G w [ G A–0.5 Hw diag( %–t 2 ) H A–0.5 G w ] G A0.5 Hw %
=%w H A0.5 A–0.5 Hw diag( %–t 2 ) H A–0.5 A0.5 Hw %, using Gw G=I
=%w H Hw diag( %–t 2 ) H Hw %, using A–0.5 A0.5 =I
1xT Txp pxT TxT ...
Note that HHw is TxT dimension and can be calculated once and for all. Denote the columns of HHw as
h" , h2 , â, hn and %w ht =ft a scalar . Now we write expression as
=! ft2 Î %2t
T
t=1
Now replace %> with the t-th residual and compute the initial T somponents of this sum and shuffle them
J times for a bootstrap. Each shuffle will create a realization from a ;2 r.v.
23
Now find that largest (smallest) value of each regression coefficient such that this sum of squares
Ÿ 95-th percentile of the J sums computed from ! ft2 Î %2t
T
t=1
Now use constrained optimization algorithm to find the maximum and minimum of each element of "
subject to
! ft2 Î %t Ÿ Upper 95 percentile from the J bootstrap replicates. The maximum " is then the upper
T
t=1
limit of CI95 and the minimum is the lower limit.
(10) eq no is missing
In the OLS regression with spherical errors case a 100(1–!)% confidence region for " contains
those values of " which satisfy the inequality, Donald son, and Schnabel (1987):
(y–X" )w (y–X" ) • (y–X"^ ols)w (y–X"^ ols ) Ÿ s²p Fp,T–p,1–!
(37)
^
^
w
where s²= (y–X" ols) (y–X" ols )Î(T–p). It is equivalently written as:
("^ ols • " )w Xw X("^ ols • " ) Ÿ s²p Fp,n–p,1–!
(38)
which shows that the shape will be ellipsoidal. If the Frisch-Waugh theorem is used to convert the
regression problem into p separate univariate problems, use of the bootstrap seems to be a good way to
obtain robustness.
For multivariate problems (e.g., regressions) Vinod (1998) suggests two algorithms:, the GPFN(0,I) and GPF-d-boot algorithms. In some implementations they can involve thousands of numerically
computed GPF-roots which are rank-ordered to compute CIs. The computational burden is manageable
for many typical econometric problems, and is expected to further lighten over time.
Although simulations in Vinod (1998) are incomplete, they do have a sound theoretical basis in the
property (iii) given by Godambe and Heyde (1987).
Often, estimating functions (EFs) are QSFs and EF estimators are the quasi maximum likelihood
(QML) estimators viewed as roots of QSFs.
(%̃t )2 values
24
Download