Fixed-smoothing Asymptotics in a Two-step GMM Framework

advertisement
Fixed-smoothing Asymptotics in a Two-step GMM Framework
Yixiao Sun
Department of Economics,
University of California, San Diego
July 30, 2014, First version: December 2011
Abstract
The paper develops the …xed-smoothing asymptotics in a two-step GMM framework. Under this type of asymptotics, the weighting matrix in the second-step GMM criterion function
converges weakly to a random matrix and the two-step GMM estimator is asymptotically
mixed normal. Nevertheless, the Wald statistic, the GMM criterion function statistic and the
LM statistic remain asymptotically pivotal. It is shown that critical values from the …xedsmoothing asymptotic distribution are high order correct under the conventional increasingsmoothing asymptotics. When an orthonormal series covariance estimator is used, the critical
values can be approximated very well by the quantiles of a noncentral F distribution. A simulation study shows that the new statistical tests based on the …xed-smoothing critical values
are much more accurate in size than the conventional chi-square test.
JEL Classi…cation: C12, C32
Keywords: F distribution, Fixed-smoothing Asymptotics, Heteroskedasticity and Autocorrelation Robust, Increasing-smoothing Asymptotics, Noncentral F Test, Two-step GMM Estimation
1
Introduction
Recent research on heteroskedasticity and autocorrelation robust (HAR) inference has been focusing on developing distributional approximations that are more accurate than the conventional
chi-square approximation or the normal approximation. To a great extent and from a broad
perspective, this development is in line with many other areas of research in econometrics where
more accurate distributional approximations are the focus of interest. A common theme for coming up with a new approximation is to embed the …nite sample situation in a di¤erent limiting
thought experiment. In the case of HAR inference, the conventional limiting thought experiment
assumes that the amount of smoothing increases with the sample size but at a slower rate. The
Email: yisun@ucsd.edu. For helpful comments, the author thanks Jungbin Hwang, David Kaplan, and Min
Seong Kim, Elie Tamer, the coeditor, and anonymous referees. The author gratefully acknowledges partial research
support from NSF under Grant No. SES-0752443. Correspondence to: Department of Economics, University of
California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0508.
1
new limiting thought experiment assumes that the amount of smoothing is held …xed as the sample size increases. This leads to two types of asymptotics: the conventional increasing-smoothing
asymptotics and the more recent …xed-smoothing asymptotics. Sun (2014) coins these two inclusive terms so that they are applicable to di¤erent HAR variance estimators, including both
kernel HAR variance estimators and orthonormal series (OS) HAR variance estimators.
There is a large and growing literature on the …xed-smoothing asymptotics. For kernel HAR
variance estimators, the …xed-smoothing asymptotics is the so-called the …xed-b asymptotics
…rst studied by Kiefer and Vogelsang (2002a, 2002b, 2005, KV hereafter) in the econometrics
literature. For other studies, see, for example, Jansson (2004), Sun, Phillips, Jin (2008), Sun
and Phillips (2009), Gonçlaves and Vogelsang (2011) in the time series setting, Bester, Conley,
Hansen and Vogelsang (2011, BCHV hereafter) and Sun and Kim (2014) in the spatial setting, and
Gonçalves (2011), Kim and Sun (2013), and Vogelsang (2012) in the panel data setting. For OS
HAR variance estimators, the …xed-smoothing asymptotics is the so-called …xed-K asymptotics.
For its theoretical development and related simulation evidence, see, for example, Phillips (2005),
Müller (2007), and Sun (2011, 2013). The approximation approaches in some other papers can
also be regarded as special cases of the …xed-smoothing asymptotics. This includes, among others,
Ibragimov and Müller (2010), Shao (2010) and Bester, Hansen and Conley (2011).
All of the recent developments on the …xed-smoothing asymptotics have been devoted to
…rst-step GMM estimation and inference. In this paper, we establish the …xed-smoothing asymptotics in a general two-step GMM framework. For two-step estimation and inference, the HAR
variance estimator not only appears in the covariance estimator but also plays the role of the
optimal weighting matrix in the second-step GMM criterion function. Under the …xed-smoothing
asymptotics, the weighting matrix converges to a random matrix. As a result, the second-step
GMM estimator is not asymptotically normal but rather asymptotically mixed normal. On the
one hand, the asymptotic mixed normality captures the estimation uncertainty of the GMM
weighting matrix and is expected to be closer to the …nite sample distribution of the second-step
GMM estimator. On the other hand, the lack of asymptotic normality posts a challenge for pivotal inference. It is far from obvious that the Wald statistic is still asymptotically pivotal under
the new asymptotics. To confront this challenge, we have to judicially rotate and transform the
asymptotic distribution and show that it is equivalent to a distribution that is nuisance parameter
free.
The …xed-smoothing asymptotic distribution not only depends on the kernel function or basis
functions and the smoothing parameter, which is the same as in the one-step GMM framework,
but also depends on the degree of over-identi…cation, which is di¤erent from existing results. In
general, the degree of over-identi…cation or the number of moment conditions remains in the asymptotic distribution only under the so-called many-instruments or many-moments asymptotics;
see for example Han and Phillips (2006) and references therein. Here the number of moment
conditions and hence the degree of over-identi…cation are …nite and …xed. Intuitively, the degree of over-identi…cation remains in the asymptotic distribution because it is indicative of the
dimension of the limiting random weighting matrix.
In the case of OS HAR variance estimation, the …xed-smoothing asymptotic distribution is
a mixed noncentral F distribution — a noncentral F distribution with a random noncentrality
parameter. This is an intriguing result, as a noncentral F distribution is not expected under the
null hypothesis. It is reassuring that the random noncentrality parameter becomes degenerate
as the amount of smoothing increases. Replacing the random noncentrality parameter by its
mean, we show that the mixed noncentral F distribution can be approximated extremely well by
2
a noncentral F distribution. The noncentral F distribution is implemented in standard programming languages and packages such as the R language, MATLAB, Mathematica and STATA, so
critical values are readily available. No computation-intensive simulation is needed. This can be
regarded as an advantage of using the OS HAR variance estimator.
In the case of kernel HAR variance estimation, the …xed-smoothing asymptotic distribution
is a mixed chi-square distribution — a chi-square distribution scaled by an independent and
positive random variable. This resembles an F distribution except that the random denominator
has a nonstandard distribution. Nevertheless, we are able to show that critical values from the
…xed-smoothing asymptotic distribution are high order correct under the conventional increasingsmoothing asymptotics. This result is established on the basis of two distributional expansions.
The …rst expansion is the expansion of the …xed-smoothing asymptotic distribution as the amount
of smoothing increases, and the second one is the high order Edgeworth expansion established in
Sun and Phillips (2009). We arrive at the two expansions via completely di¤erent routes, yet at
the end some of the terms in the two expansions are exactly the same.
Our framework for establishing the …xed-smoothing asymptotics is general enough to accommodate both the kernel HAR variance estimators and the OS HAR variance estimators. The
…xed-smoothing asymptotics is established under weaker conditions than what are typically assumed in the literature. More speci…cally, instead of maintaining a functional CLT assumption,
we make use of a regular CLT, which is weaker than an FCLT. Our method of proof is also novel.
It applies directly to both smooth and nonsmooth kernel functions. There is no need to give a
separate treatment to non-smooth kernels such as the Bartlett kernel. The uni…ed method of
proof leads to a uni…ed representation of the …xed-smoothing asymptotic distribution.
The …xed-smoothing asymptotics is established for three commonly used test statistics in the
GMM framework: the Wald statistic, the GMM criterion function statistic, and the score type
statistic or the LM statistic. As in the conventional increasing-smoothing asymptotic framework,
we show that these three test statistics are asymptotically equivalent and converge to the same
limiting distribution.
In the Monte Carlo simulations, we examine the accuracy of the …xed-smoothing approximation to the distribution of the Wald statistic. We …nd that the tests based on the new
…xed-smoothing approximation have much more accurate size than the conventional chi-square
tests. This is especially true when the degree of over-identi…cation is large. When the model is
over-identi…ed, the …xed-smoothing approximation that accounts for the randomness of the GMM
weighting matrix is also more accurate than the …xed-smoothing approximation that ignores the
randomness. When the OS HAR variance estimator is used, the convenient noncentral F test has
almost identical size properties as the nonstandard test whose critical values have to be simulated.
The rest of the paper is organized as follows. Section 2 describes the estimation and testing
problems at hand. Section 3 establishes the …xed-smoothing asymptotics for the variance estimators and the associated test statistics. Section 4 gives di¤erent representations of the …xedsmoothing asymptotic distribution. In the case of OS HAR variance estimation, a noncentral F
distribution is shown to be a very accurate approximation to the nonstandard …xed-smoothing
asymptotic distribution. In the case of kernel HAR variance estimation, the …xed-smoothing
approximation is shown to provide a high order re…nement over the chi-square approximation.
Section 5 studies the …xed-smoothing asymptotic approximation under local alternatives. The
next section reports simulation evidence and applies our method to GMM estimation and inference in a stochastic volatility model. The last section provides some concluding discussion.
Technical lemmas and proofs of the main results are given in the Appendix.
3
A word on notation: we use Fp;K p q+1 2 to denote a random variable that follows the
noncentral F distribution with degrees of freedom (p; K p q + 1) and noncentrality parameter
2
: We use Fp;K p q+1 z; 2 to denote the CDF of the noncentral F distribution. For a positive
semide…nite symmetric matrix A; A1=2 is the symmetric matrix square root of A when it is not
de…ned explicitly.
2
Two-step GMM Estimation and Testing
We are interested in a d 1 vector of parameters 2
Rd : Let vt 2 Rdv denote a vector of
observations at time t: Let 0 denote the true parameter value and assume that 0 is an interior
point of : The moment conditions
Ef (vt ; ) = 0; t = 1; 2; :::; T
(1)
hold if and only if = 0 where f (vt ; ) is an m 1 vector of continuously di¤erentiable functions.
The process f (vt ; 0 ) may exhibit autocorrelation of unknown forms. We assume that m d and
that the rank of E @f (vt ; 0 ) =@ 0 is equal to d: That is, we consider a model that is possibly
over-identi…ed with the degree of over-identi…cation q = m d:
De…ne
t
1X
gt ( ) =
f (vj ; );
T
j=1
then the GMM estimator of
0
is given by
^GM M = arg min gT ( )0 W 1 gT ( )
T
2
where WT is a positive de…nite weighting matrix.
To obtain an initial …rst step estimator, we often choose a simple weighting matrix Wo that
does not depend on model parameters, leading to
~T = arg min gT ( )0 W 1 gT ( ) :
o
2
As an example, we may set Wo = Im in the general GMM setting. In the IV regression, we may
set Wo = Z 0 Z=T where Z is the data matrix for the instruments. We assume that
p
Wo ! Wo;1 ;
a matrix that is positive de…nite almost surely.
According
p to Hansen (1982), the optimal weighting matrix WT is the asymptotic variance
matrix of T gT ( 0 ) : On the basis of the …rst step estimate ~T ; we can use u
~t := f (vt ; ~T ) to
estimate the asymptotic variance matrix. Many nonparametric estimators of the variance matrix
are available in the literature. In this paper, we consider a class of quadratic variance estimators,
which includes the conventional kernel variance estimators of Andrews (1991), Newey and West
(1987), Politis (2011), sharp and steep kernel variance estimators of Phillips, Sun and Jin (2006,
2007), and the orthonormal series variance estimators of Phillips (2005), Müller (2007), and Sun
(2011, 2013) as special cases. Following Phillips, Sun and Jin (2006, 2007), we refer to the
conventional kernel estimators as contracted kernel estimators and the sharp and steep kernel
estimators as exponentiated kernel estimators.
4
For a given value of ; the quadratic HAR variance estimator takes the form:
!
!0
T
T
T
T
1 XX
t s
1X
1X
WT ( ) =
Qh
f (vt ; )
f (v ; )
f (vs ; )
f (v ; ) :
;
T
T T
T
T
t=1 s=1
=1
=1
The feasible HAR variance estimator based on ~T is then given by
!
T X
T
T
X
X
1
t
s
1
WT ~T =
u
~t
u
~s
Qh
;
u
~
T
T T
T
t=1 s=1
=1
T
1X
u
~
T
=1
!0
:
(2)
In the above de…nitions, Qh (r; s) is a symmetric weighting function that depends on the smoothing parameter h: For conventional kernel estimators, Qh (r; s) = k ((r s) =b) and we take
h = 1=b: For the sharp kernel estimator, Qh (r; s) = (1 jr sj) 1 fjr sj < 1g and we take
p
h = : For steep quadratic kernel estimators, Qh (r; s) = k (r s) and we take h =
: For
P
the OS estimators Qh (r; s) = K 1 K
(r)
(s)
and
we
take
h
=
K;
where
(r)
are
j
j
j=1 j
R
1
orthonormal basis functions on L2 [0; 1] satisfying 0 j (r) dr = 0: A prewhitened version of the
above estimators can also be used. See, for example, Andrews and Monahan (1992) and Xiao
and Linton (2002).
It is important to point out that
P the HAR variance estimator in (2) is based on the de~ : For over-identi…ed GMM models, the sample average
meaned moment process u
~t T 1 T=1 u
PT
1
~ is in general di¤erent from zero. Under the conventional asymptotic theory, deT
=1 u
meaning is necessary to ensure the consistency of the HAR variance estimator for misspeci…ed
models. Under the new asymptotic theory, demeaning is necessary to ensure that the associated
test statistics are asymptotically pivotal. Hall (2000, 2005: Sec 4.3) provides more discussions
on the bene…ts of demeaning including power improvement for over-identi…cation tests. See also
Sun and Kim (2012).
With the variance estimator WT (~T ); the two-step GMM estimator is:
^T = arg min gT ( )0 W 1 (~T )gT ( ) :
T
2
Suppose we want to test the linear null hypothesis H0 : R 0 = r against H0 : R 0 6= r where
R is a p d matrix with full row rank. Nonlinear restrictions can be converted to linear ones by
the delta method. We consider three types of test statistics. The …rst type is the conventional
Wald statistic. The normalized Wald statistic is
1
h
i 1
WT := WT (^T ) = T (R^T r)0 R GT (^T )0 WT 1 (^T )GT (^T )
R0
(R^T r)=p;
where GT ( ) = @g@T (0 ) :
When p = 1 and for one-sided alternative hypotheses, we can construct the t statistic tT =
tT (^T ) where tT ( T ) is de…ned to be
p
T (R T r)
tT ( T ) = n
o :
1 0 1=2
1
0
R GT ( T ) WT ( T )GT ( T )
R
The second type of test statistic is based on the likelihood ratio principle. Let ^T;R be the
restricted second-step GMM estimator:
^T;R = arg min gT ( )0 W 1 (~T )gT ( ) s:t: R = r:
T
2
5
The likelihood ratio principle suggests the GMM distance statistic (or GMM criterion function
statistic) given by
h
i
DT := T gT (^T )0 W 1 ( T )gT (^T ) T gT (^T;R )0 W 1 ( T )gT (^T;R ) =p
T
T
p
p
where T is a T -consistent estimator of 0 ; i.e., T
0 = Op (1= T ): Typically, we take T to
be ~T ; the unrestricted …rst-step estimator. To ensure that DT
0; we use the same WT 1 ( T )
in computing the restricted and unrestricted GMM criterion functions.
The third type of test statistic is the GMM counterpart of the score statistic or Lagrange
Multiplier (LM) statistic. It is based on the score or gradient of the GMM criterion function,
i.e., T ( ) = G0T ( ) WT 1 ( T )gT ( ). The test statistic is given by
h
i0 h
i 1
^
^
ST = T
G0T (^T;R )WT 1 ( T )GT (^T;R )
T ( T;R )
T ( T;R )=p
p
where as before T is a T -consistent estimator of 0 :
It is implicit in our notation that we use the same smoothing parameter h in WT ( T ); WT (~T )
and WT (^T ). In principle, we can use di¤erent smoothing parameters but the asymptotic distributions will become very complicated. For future research, it is interesting to explore the advantages
of using di¤erent smoothing parameters. In this paper we employ the same smoothing parameter
in all HAR variance estimators.
Under the usual asymptotics where h ! 1 but at a slower rate than the sample size T;
all three statistics WT ; DT and ST are asymptotically distributed as 2p =p, and the t-statistic
tT is asymptotically standard normal under the null. The question is, what are the limiting
distributions of WT ; DT ; ST and tT when h is held …xed as T ! 1? The rationale of considering
this type of thought experiment is that it may deliver asymptotic approximations that are more
accurate than the chi-square or normal approximation in …nite samples.
When h is …xed, that is, b; or K is …xed, the variance matrix estimator involves a …xed
amount of smoothing in that it is approximately equal to an average of a …xed number of quantities from a frequency domain perspective. The …xed-b, …xed- or …xed-K asymptotics may be
collectively referred to as the …xed-smoothing asymptotics. Correspondingly, the conventional
asymptotics under which h ! 1, T ! 1 jointly is referred to as the increasing-smoothing
asymptotics. In view of the de…nition of h for each HAR variance estimator, the magnitude of h
indicates the amount or level of smoothing in each case.
3
The Fixed-smoothing Asymptotics
3.1
Fixed-smoothing asymptotics for the variance estimator
To establish the …xed-smoothing asymptotics, we …rst introduce another HAR variance estimator.
Let
QT;h (r; s) = Qh (r; s)
T
1 X
Qh
T
1 =1
1
T
T
T
T
1 X
1 X X
2
Qh r;
+
Qh
T
T
T
;s
2 =1
1 =1
2 =1
1
T
;
2
T
be the demeaned version of Qh (r; s) ; then we have
T
T
1 XX
QT;h
WT ( ) =
T
t=1 s=1
6
t s
;
T T
f (vt ; )f (vs ; )0 :
(3)
As T ! 1; we obtain the ‘continuous’version of QT;h (r; s) given by
Z 1Z
Z 1
Z 1
Qh (r; 2 ) d 2 +
Qh ( 1 ; s) d 1
Qh (r; s) = Qh (r; s)
0
0
0
1
Qh ( 1 ;
2) d 1d 2:
0
R1
Qh (r; s) can be regarded as a centered version of Qh (r; s), as it satis…es 0 Qh (r; ) dr =
R1
0 Qh ( ; s) ds = 0 for any : For OS HAR variance estimators, we have Qh (r; s) = Qh (r; s) ; and
so centering is not necessary.
For our theoretical development, it is helpful to de…ne the HAR variance estimator:
WT ( ) =
T
T
1 XX
Qh
T
t s
;
T T
t=1 s=1
f (vt ; )f (vs ; )0 :
It
p is shown in Lemma 1(a) below that WT ( T ) and WT ( T ) are asymptotically equivalent for any
T -consistent estimator T : So the limiting distribution of WT ( T ) follows from that of WT ( T ):
To establish the …xed-smoothing asymptotics of WT ( T ) and hence that of WT ( T ), we maintain Assumption 1 on the kernel function and basis functions.
Assumption 1 (a) For kernel HAR variance estimators, the kernel function k ( ) satis…es the
following condition: for any b 2 (0; 1] and
1, kb (x) and k (x) are symmetric, continuous,
piecewise monotonic, and piecewise continuously di¤ erentiable on [ 1; 1]. (b) For the OS HAR
variance estimator, the basis functions j ( ) are piecewise monotonic, continuously di¤ erentiable
R1
and orthonormal in L2 [0; 1] and 0 j (x) dx = 0:
Assumption 1 on the kernel function is very mild. It includes many commonly used kernel
functions such as the Bartlett kernel, Parzen kernel, QS kernel, Daniel kernel and Tukey-Hanning
kernel. Assumption 1 ensures the asymptotic equivalence of WT ( T ) and WT ( T ). In addition,
under Assumption 1, we can use the Fourier series expansion to show that Qh (r; s) has the
following uni…ed representation for all the HAR variance estimators we consider:
Qh (r; s) =
1
X
j
j
(r)
j
(s) ;
(4)
j=1
R1
where f j ( )g is a sequence of continuously di¤erentiable functions satisfying 0 j (r) dr = 0.
The right hand side of (4) converges absolutely and uniformly over (r; s) 2 [0; 1] [0; 1]: For
notational simplicity, the dependence of j and j ( ) on h has been suppressed. For positive
semi-de…nite kernels k ( ), the above representation also follows from Mercer’s theorem, in which
case j is the eigenvalue and j ( ) is the corresponding orthonormal eigen function for the
Fredholm operator with kernel Qh ( ; ) : Alternatively, the representation can be regarded as
a spectral decomposition of this Fredholm operator, which can be shown to be compact. This
representation enables us to give a uni…ed proof for the contracted kernel HAR variance estimator,
the exponentiated kernel HAR variance estimator, and the OS HAR variance estimator. For
kernel HAR variance estimation, there is no need to single out non-smooth kernels such as the
Bartlett kernel and treat them di¤erently as in KV (2005).
De…ne
Gt ( ) =
t
@gt ( )
1 X @f (vj ; )
=
for t
T
@ 0
@ 0
j=1
7
1; and G0 ( ) = 0 for all
2
:
Let ut = f (vt ;
0)
and
0 (t)
1; et s iidN (0; Im ):
We make the following four assumptions on the GMM estimators and the data generating process.
Assumption 2 As T ! 1; ^T = 0 + op (1) ; ^T;R =
point 0 2 :
P1
0
Assumption 3
j= 1 k j k < 1 where j = Eut ut
0 + op (1) ;
~T =
0 + op (1)
for an interior
j.
Assumption 4 For any T = 0 + op (1) ; plimT !1 G[rT ] ( T ) = rG uniformly in r where G =
G( 0 ) and G( ) = E@f (vt ; )=@ 0 .
P
Assumption 5 (a) T 1=2 Tt=1 j (t=T ) ut converges weakly to a continuous distribution, jointly
over j = 0; 1; :::; J for every …nite J:
(b) The following holds:
!
T
1 X
t
P p
ut x for j = 0; 1; :::; J
j
T
T t=1
!
T
1 X
t
p
=P
et x for j = 0; 1; :::; J + o (1) as T ! 1
j
T
T t=1
P
0 =
for every …nite J where x 2 Rp and is the matrix square root of ; i.e.,
:= 1
j= 1 j :
Assumptions 2–4 are standard assumptions in the literature on the …xed-smoothing asymptotics. They are the same as those in Kiefer and Vogelsang (2005), Sun and Kim (2012), among
others. Assumption 5 is a variant of the standard multivariate CLT. If Assumption 1 holds and
T
1=2
[rT ]
1 X
d
g[rT ] ( 0 ) = p
ut ! Bm (r)
T t=1
where Bm (r) is a standard Brownian motion, then Assumption 5 holds. So Assumption 5 is
weaker than the above FCLT, which is typically assumed in the literature on the …xed-smoothing
asymptotics.
A great advantage of maintaining only the CLT assumption is that our results can be easily generalized to the case of higher dimensional dependence, such as spatial dependence and
spatial-temporal dependence. In fact, Sun and Kim (2014) have established the …xed-smoothing
asymptotics in the spatial setting using only the CLT assumption. They avoid the more restrictive FCLT assumption maintained in BCHV (2011). With some minor notational change and a
small change in Assumption 4 as in Sun and Kim (2014), our results remain valid in the spatial
setting.
Assumption 5(b) only assumes that the approximation error is o(1); which is enough for
our …rst order …xed-smoothing asymptotics. However, it is useful to discuss the composition of
the approximation error. Let J (t) = ( 0 (t) ; 1 (t) ; :::; J (t))0 and aJ 2 RJ+1 be a vector of
the same dimension. It follows from Lemma 1 in Taniguchi and Puri (1996) that under some
additional conditions:
!
!
T
T
X
t
t
1
1 X
c(x)
0
0 1
ut x = P aJ p
et x + p + O
P aJ p
J
J
T
T
T
T t=1
T t=1
T
8
for some
p function c(x). In the above Edgeworth expansion, the approximation error of order
c(x)= T captures the skewness of ut . When ut is Gaussian or has a symmetric distribution, this
term disappears. Part of the approximation error of order O(1=T ) comes from the stochastic
dependence of fut g : If we replace the iid Gaussian process f et g by a dependent Gaussian
process uN
that has the same covariance structure as fut g, then we can remove this part of
t
the approximation error.
To present Assumption 5 and our theoretical results more compactly, we introduce the notion of asymptotic equivalence in distribution. Consider two stochastically bounded sequences of
random vectors T 2 Rp and T 2 Rp , we say that they are asymptotically equivalent in distria
bution and write T s T if and only if Ef ( T ) = Ef ( T ) + o (1) as T ! 1 for all bounded
and continuous functions f ( ) on Rp : According to Lemma 2 in the appendix, Assumption 5 is
equivalent to
T
T
1 X
t
1 X
t
a
p
p
s
u
et
j
t
j
T
T
T
T
t=1
jointly over j = 0; 1; :::; J.
Let
"
T
1 X
p
(
)
=
j
T t=1
t=1
t
T
j
#"
f (vt ; )
T
1 X
p
T t=1
j
t
T
#0
f (vt ; )
:
Using the uniform series representation in (4), we can write
WT ( ) =
1
X
j
j
( );
j=1
which is an in…nite weighted sum of outer-products. To establish the …xed-smoothing asymptotic
distribution of WT ( ) under Assumption 5, we split WT ( ) into a …nite sum part and the
P1
PJ
remainder part: WT ( ) =
j=J+1 j j ( ). For each …nite J; we can use
j=1 j j ( ) +
Assumptions 2–5 to obtain the asymptotically equivalent distribution for the …rst term. However,
for the second term to vanish, we require J ! 1: To close the gap in our argument, we use Lemma
4 in the appendix, which puts our proof on a rigorous footing and may be of independent interest.
p
Lemma 1 Let Assumptions 1-5 hold, then for any T -consistent estimator T and for a …xed
h,
(a) WT ( T ) = WT ( T ) + Op (1=T ) ;
(b) WT ( T ) = WT ( 0 ) + op (1) ;
a
a
(c) WT ( T ) s WeT and WT ( T ) s WeT where
WeT =
and e =
1
T
PT
(d) WT (
t=1 et :
d
T ) ! W1
W1
"
T
1
Qh (t=T; =T ) (et
t=1 =1
and WT (
~1
= W
T X
T
X
T)
0
e) (e
0
e)
#
0
d
! W1 where
~1 =
and W
Z
0
1Z 1
9
0
Qh (r; s)dBm (r)dBm (s)0 :
Lemma 1(c)(d) gives a uni…ed representation of the asymptotically equivalent distribution
and the limiting distribution of WT ( T ) and WT ( T ). The representation applies to smooth
kernels as well as nonsmooth kernels such as the Bartlett kernel.
Although we do not make the FCLT assumption, the limiting distribution of WT ( T ) and
WT ( T ) can still be represented by a functional of Brownian motion as in Lemma 1(d). This
representation serves two purposes. First, it gives an explicit representation of the limiting
distribution. Second, the representation in Lemma 1(d) enables us to compare the results we
obtain here with existing results. It is reassuring that the limiting distribution W1 is the same
as that obtained by Kiefer and Vogelsang (2005), Phillips, Sun and Jin (2006, 2007), and Sun
and Kim (2012), among others.
It is standard practice to obtain the limiting distribution in order to conduct asymptotically
valid inference. However, this is not necessary, as we can simply use the asymptotically equivalent
distribution in Lemma 1(c). There are some advantages in doing so. First, it is easier to simulate
the asymptotically equivalent distribution, as it involves only T iid normal vectors. In contrast, to
simulate the Brownian motion in the limiting distribution, we need to employ a large number of iid
normal vectors. Second, the asymptotically equivalent distribution can provide a more accurate
approximation in some cases. Note that WeT takes the same form as WT ( 0 ) : If f (vt ; 0 ) is
iid N (0; ); then WT ( 0 ) and WeT have the same distribution in …nite samples. There is no
approximation error. In contrast, the limiting distribution W1 is always an approximation to
the …nite sample distribution of WT ( 0 ) :
3.2
Fixed-smoothing asymptotics for the test statistics
We now establish the asymptotic distributions of WT ; DT ; ST and tT when h is …xed. For kernel
variance estimators, we focus on the case that k ( ) is positive de…nite, as otherwise the twostep estimator may not even be consistent. In this case, we can show that, for all the variance
estimators we consider, W1 is nonsingular with probability one. Hence, using Lemma 1 and
d
a
Lemma 3 in the appendix, we have WT (~T ) 1 s W 1 and WT (~T ) 1 ! W 1 : Using these
1
eT
results and Assumptions 1-5, we have
h
p
T ^T
G0 WT (~T )
0 =
a
s
G0 WeT1 G
d
1
1
i
1
G0 WT (~T )
1
p
T gT ( 0 ) + op (1)
T
1 X
G0 WeT1 p
et
T t=1
1
G0 W11 G
!
G
(5)
G0 W11 Bm (1) :
The limiting distribution is not normal but rather mixed normal. More speci…cally, W11 is independent of Bm (1); so conditional on W11 ; the limiting distribution is normal with the conditional
variance depending on W11 :
a
d
Using (5) and Lemma 3 in the appendix, we have WT s FeT and WT ! F1 where
"
FeT = R
"
G0 WeT1 G
R G0 WeT1 G
1
1
#0
T
h
1 X
p
et
R G0 WeT1 G
T t=1
#
T
1 X
p
et =p;
T t=1
G0 WeT1
G0 WeT1
10
1
R0
i
1
and
h
i0 h
1 0
F1 = R G0 W11 G
G W11 Bm (1) R G0 W11 G
h
i
1 0
R G0 W11 G
G W11 Bm (1) =p:
1
R0
i
1
To show that FeT and F1 are pivotal, we let et := (e0t;p ; e0t;d p ; e0t;q )0 . The subscripts p, d p;
and q on e indicate not only the dimensions of the random vectors and but also distinguish them
so that, for example, et;p is di¤erent and independent from et;q for all values of p and q: Denote
T
T
1 X
1 X
Cp;T = p
et;p , Cq;T = p
et;q
T t=1
T t=1
and
Cpp;T
T
T
T
T
1 XX
1 XX
t
t
0
=
et;p e~ ;p ; Cpq;T =
et;p e~0 ;q
Qh ( ; )~
Qh ( ; )~
T
T T
T
T T
Cqq;T =
where e~i;j = ei;j
T
1
T
t=1 =1
T X
T
X
t=1 =1
PT
1
(6)
t=1 =1
t
Qh ( ; )~
et;q e~0 ;q ; Dpp;T = Cpp;T
T T
s=1 es;j
for i = t;
1
0
Cpq;T
;
Cpq;T Cqq;T
and j = p; q: Similarly let
Bm (r) := Bp0 (r); Bd0
p (r);
Bq0 (r)
0
where Bp (r); Bd p (r) and Bq (r) are independent standard Brownian motion processes of dimensions p, d p; and q; respectively. Denote
Z 1Z 1
Z 1Z 1
0
Cpp =
Qh (r; s)dBp (r)dBp (s) ; Cpq =
Qh (r; s)dBp (r)dBq (s)0
(7)
0
0
0
0
Z 1Z 1
0
:
Cqq =
Qh (r; s)dBq (r)dBq (s)0 ; Dpp = Cpp Cpq Cqq1 Cpq
0
0
Then using the theory of multivariate statistics, we obtain the following theorem.
Theorem 1 Let Assumptions 1-5 hold. Assume that k ( ) used in the kernel HAR variance
estimation is positive
h de…nite. Then, for ai0 …xed h;h
i
a
d
1
1
1
^
Cp;T Cpq;T Cqq;T
Cq;T =p = FeT ;
(a) WT ( T ) s Cp;T Cpq;T Cqq;T
Cq;T Dpp;T
d
d
0
(b) WT (^T ) ! Bp (1) Cpq Cqq1 Bq (1) Dpp1 Bp (1) Cpq Cqq1 Bq (1) =p = F1 ;
h
i p
a
1
(c) tT (^T ) s teT := Cp;T Cpq;T Cqq;T
Cq;T = Dpp;T ;
p
d
(d) tT (^T ) ! t1 := Bp (1) Cpq Cqq1 Bq (1) = Dpp ;
0
1
0 ; C 0 )0 is independent of C
0
0
where (Cp;T
pq;T Cqq;T and Dpp;T ; and Bp (1) ; Bq (1) is independent
q;T
of Cpq Cqq1 and Dpp :
Remark 1 Theorems 1 (a) and (b) show that both the asymptotically equivalent distribution FeT
and the limiting distribution F1 are pivotal. In particular, they do not depend on the long run
variance : Both FeT and F1 are nonstandard but can be simulated. It is easier to simulate FeT ;
as it involves only T iid standard normal vectors.
11
Remark 2 Let
~ T (~T ) = T (R~T
W
r)0 V~T 1 (R~T
r)=p
be the Wald statistic based on the …rst-step GMM estimator ~T where
h
i
~ 0T Wo 1 G
~T
V~T = R G
1h
~ 0T Wo 1 WT ~T Wo 1 G
~T
G
ih
~ 0T Wo 1 G
~T
G
i
1
R0
d
~ T (~T ) !
~ T = GT (~T ). It is easy to show that W
and G
Bp (1)0 Cpp1 Bp (1) =p: See KV (2005) for
contracted kernel HAR variance estimators, Phillips, Sun and Jin (2006,2007) for exponentiated
kernel HAR variance estimators, and Sun (2013) for OS HAR variance estimators. Comparing
the limit Bp (1)0 Cpp1 Bp (1) =p with F1 given in Theorem 1, we can see that F1 contains additional
adjustment terms.
Remark 3 When the model is exactly identi…ed, we have q = 0. In this case,
d
F1 = Bp (1)0 Cpp1 Bp (1) =p:
This limit is the same as that in the one-step GMM framework. This is not surprising, as when
the model is exactly identi…ed, the GMM weighting matrix becomes irrelevant and the two-step
estimator is numerically identical to the one-step estimator.
Remark 4 It is interesting to see that the limiting distribution F1 (and the asymptotically equivalent distribution) depends on the degree of over-identi…cation. Typically such dependence shows
up only when the number of moment conditions is assumed to grow with the sample size. Here the
number of moment conditions is …xed. It remains in the limiting distribution because it contains
~ 1:
information on the dimension of the random limiting matrix W
Theorem 2 Let Assumptions 1-5 hold. Then, for a …xed h;
DT = WT + op (1) and ST = WT + op (1) :
It follows from Theorem 2 that all three statistics are asymptotically equivalent in distribution
to the same distribution FeT , as given in Theorem 1(a). They also have the same limiting distribution F1 : So the asymptotic equivalence of the three statistics under the increasing-smoothing
asymptotics remains valid under the …xed-smoothing asymptotics.
Careful inspection
p of the proofs of Theorems 1 and 2 shows that the two theorems hold
regardless of which T -consistent estimator of 0 we use in estimating WT ( 0 ) and GT ( 0 ) in
DT and ST : However, it may make a di¤erence to higher orders.
4
Representation and Approximation of the Fixed-smoothing
Asymptotic Distribution
In this section, we establish di¤erent representations of the …xed-smoothing limiting distribution. These representations highlight the connection and di¤erence between the nonstandard
approximation and the standard 2 or normal approximation.
12
4.1
The case of OS HAR variance estimation
To obtain a new representation of F1 ; we consider the matrix
Z 1Z 1
Cpp Cpq
d
0
K
=
Qh (r; s) dBp+q (r) dBp+q
(s)
0
Cpq
Cqq
0
0
Z 1
Z 1
0
K
X
=
j (r) dBp+q (r)
j (r) dBp+q (r)
j=1
0
0
where Bp+q (r) is the standard Brownian motion of dimension p + q: Since f j (r)g are orthoR1
d
normal, 0 j (r) dBp+q (r) s iidN (0; Ip+q ) and the above matrix follows a standard Wishart
distribution Wp+q (K; Ip+q ) : A well known property of a Wishart random matrix is that Dpp =
0 s W (K
Cpp Cpq Cqq1 Cpq
q; Ip ) =K and is independent of both Cpq and Cqq : This implies that
p
Dpp is independent of Bp (1) Cpq Cqq1 Bq (1), and that F1 is equal in distribution to a quadratic
form with an independent scaled Wishart matrix as the (inverse) weighting matrix.
This brings F1 close to Hotelling’s T 2 distribution, which is the same as a standard F distribution after some multiplicative adjustment. The only di¤erence is that Bp (1) Cpq Cqq1 Bq (1) is
2
not normal and hence Bp (1) Cpq Cqq1 Bq (1) does not follow a chi-square distribution. However, conditional on Cpq Cqq1 Bq (1) ; jjBp (1) Cpq Cqq1 Bq (1) jj2 follows a noncentral 2p distribution
with noncentrality parameter
2
= jjCpq Cqq1 Bq (1) jj2 :
It then follows that F1 is conditionally distributed as a noncentral F distribution. Let
=
(K
K
;
p q + 1)
2
2
=E
=
pq
K q
1
:
The following theorem presents this result formally.
Theorem 3 For OS HAR variance estimation, we have
d
1F
2 ; a mixed noncentral F random variable with random noncen(a)
1 = Fp;K p q+1
2
trality parameter
;
(b) P ( 1 F1 < z) = Fp;K p q+1 z; 2 + o K 1 as K ! 1;
(c) P (pF1 < z) = Gp (z)
is the CDF of the chi-square
p+2q 1
K
distribution 2p :
+ Gp00 (z) z 2 K1 + o
Gp0 (z) z
1
K
as K ! 1; where Gp (z)
Remark 5 According to Theorem 3(a),
d
F1 =
where
2
p
2
is independent of
2
p
2
2
K p q+1 .
d
= jjCp
1
Cp
2
p
2
=p
2
K p q+1 =K
It is easy to see that
0
K Cq K
Cq
0
K Cq K
1
Cq
2
1 jj
where each Cm n is an m n matrix (vector) with iid standard normal elements and Cp
and Cq 1 are mutually independent. This provides a simple way to simulate F1 :
13
1 ; C p K ; Cq K
Remark 6 Parts (a) and (b) of Theorem 3 are intriguing in that the limiting distribution under
the null is approximated by a noncentral F distribution, although the noncentrality parameter
goes to zero (in probability) as K ! 1.
Remark 7 When q = 0; that is, when the model is just identi…ed, we have
K
p+1
d
F1 = Fp;K
K
2
=
2
= 0 and so
p+1 :
This result is the same as that obtained by Sun (2013). So an advantage of using the OS HAR
variance estimator is that the …xed-smoothing asymptotics is exactly a standard F distribution
for just identi…ed models.
Remark 8 The asymptotics obtained under the speci…cation that K is …xed and then letting
K ! 1 is a sequential asymptotics. As K ! 1; we may show that
1
2 =p
p
d
F1 =
2
K p q+1 = (K
p
q + 1)
2
p =p
+ op (1) =
+ op (1) :
To the …rst order, the …xed-smoothing asymptotic distribution reduces to the distribution of 2p =p.
As a result, if …rst order asymptotics are used in both steps in the sequential asymptotic theory,
then the sequential asymptotic distribution is the same as the conventional joint asymptotic distribution. However, Theorem 3 is not based on the …rst order asymptotics but rather a high order
asymptotics. The high order sequential asymptotics can be regarded as a convenient way to obtain
an asymptotic approximation that better re‡ects the …nite sample distribution of the test statistic
DT ; ST or WT :
Remark 9 Instead of approximations via asymptotic expansions, all three types of approximations are distributional approximations. More speci…cally, the …nite sample distributions of DT ;
ST ; and WT are approximated by the distributions of
2
p
p
2
2
p
;
=p
; and
q+1 =K
2
K p
2
p
2
=p
(8)
2
K p q+1 =K
respectively under the conventional joint asymptotics, the …xed-smoothing asymptotics, and the
higher order sequential asymptotics. An advantage of using distributional approximations is that
they hold uniformly over their supports (by Pólya’s Lemma).
Remark 10 Theorem 3(c) gives a …rst order distributional expansion of F1 : It is clear that the
di¤ erence between pF1 and 2p depends on K; p and q: Let p1
= Gp 1 (1
) be the (1
)
quantile of Gp (z); then
P pF1
1
p
=
+ Gp0
1
p
1
p
p + 2q
K
1
Gp00
1
p
1
p
2
1
+o
K
1
K
:
For a typical critical value 1p ; Gp0 ( 1p ) > 0 and Gp00 ( 1p ) < 0; so we expect P (F1
1
=p) > ; at least when K is large. So the critical value from F1 is expected to be larger than
p
1
1
=p: For given p and large K; the di¤ erence between P (F1
=p) and increases with q;
p
p
the degree of over-identi…cation. A practical implication of Theorem 3(c) is that when the degree
of over-identi…cation is large, using the chi-square critical value rather than the critical value
from F1 may lead to the …nding of a statistically signi…cant relationship that does not actually
exist.
14
p=1
p=2
8
8
7.5
7.5
7
7
6.5
6.5
6
6
5.5
5.5
5
5
4.5
4.5
4
4
3.5
3.5
3
10
3
15
20
25
30
K
35
40
45
50
10
15
20
25
p=3
35
40
45
50
p=4
8
8
F ∞, q = 0
7.5
NCF, q = 0
NCF, q = 1
NCF, q = 2
NCF, q = 3
5.5
5
F ∞, q = 2
6.5
F ∞, q = 3
6
F ∞, q = 1
7
F ∞, q = 2
6.5
F ∞, q = 0
7.5
F ∞, q = 1
7
F ∞, q = 3
NCF, q = 0
NCF, q = 1
NCF, q = 2
NCF, q = 3
6
5.5
5
4.5
4.5
4
4
3.5
3.5
3
10
30
K
3
15
20
25
30
K
35
40
45
50
10
15
20
25
30
K
35
40
45
50
Figure 1: 95% quantitle of the nonstandard distribution and its noncentral F approximation
Figure 1 reports 95% critical values from the two approximations: the noncentral (scaled)
Fp;K p q+1 ; 2 approximation and the nonstandard F1 approximation, for di¤erent combinations of p and q. The nonstandard F1 distribution is simulated according to the representation in
Remark 5 and the nonstandard critical values are based on 10000 simulation replications. Since
the noncentral F distribution appears naturally in power analysis, it has been implemented in
standard statistical and programming packages. So critical values from the noncentral F dis1
2
tribution can be obtained easily. Let N CF 1
:= Fp;K
be the (1
) quantile of
p q+1
2
1
1
the noncentral Fp;K p q+1 ;
distribution, then N CF
corresponds to F1 ; the (1
)
1
quantile of the nonstandard F1 distribution. Figure 1 graphs N CF 1 and F1
against different K values. The …gure exhibits several patterns, all of which are consistent with Theorem
3(c). First, the noncentral F critical values are remarkably close to the nonstandard critical
values. So approximating 2 by its mean 2 does not incur much approximation error. Second,
the noncentral F and nonstandard critical values increase with the degree of over-identi…cation
q. The increase is more signi…cant when K is smaller. This is expected as both the noncentrality parameter 2 and the multiplicative factor increase with q: In addition, as q increases,
the denominators in the noncentral F and nonstandard distributions become more random. All
these e¤ects shift the probability mass to the right as q increases. Third, for any given p and
q combination, the noncentral F and nonstandard critical values decrease monotonically as K
increases and approach the corresponding critical values from the chi-square distribution.
We have also considered approximating the 90% critical values. In this case, the noncentral
15
F critical values are also very close to the nonstandard critical values.
In parallel with Theorem 3, we can represent and approximate t1 as follows:
Theorem 4 Let = K=(K q): For OS HAR variance estimation, we have
p d
(a) t1 =
= tK q ( ) ; a mixed noncentral t random variable with random noncentrality
parameter = C1q Cqq1 Bq (1) ;
p
< z) = 12 + sgn(z) 12 F1;K q z 2 ; 2 + o K 1 as K ! 1; where 2 = K qq 1
(b) P (t1 =
and sgn( ) is the sign function;
(c) P (t1 < z) = (z) jzj (z) z 2 + (4q + 1) =(4K) + o K 1 as K ! 1; where (z) and
(z) are the CDF and pdf of the standard normal distribution.
Theorem 4(a) is similar to Theorem 3(a). With a scale adjustment, the …xed-smoothing
asymptotic distribution of the t statistic is a mixed noncentral t distribution.
According to Theorem 4(b), we can approximate the quantile of t1 by that of a noncentral
1
F distribution. More speci…cally, let t1
be the (1
) quantile of t1 ; then
8 q
<
< 0:5
F 1 2 ( 2 );
q 1;K q
t11 =
_
2
1
2
:
0:5
F1;K
q ( );
1 2
2
where F1;K
q ( ) is the (1
2 ) quantile of the noncentral F distribution. For a two-sided t test,
q
1
2
the (1
) quantile jt1 j
of jt1 j can be approximated by
F1;K
q ( ): As in the case of
quantile approximation for F1 ; the above approximation is remarkably accurate. Figure 2, which
is similar to Figure 1, illustrates this. The …gure graphs t11 and its approximation against the
values of K for di¤erent degrees of over-identi…cation and for = 0:05 and 0:95: As K increases,
the quantiles approach the normal quantiles 1:645. However when K is small or q is large, there
is a signi…cant di¤erence between the normal quantiles and the corresponding quantiles from t1 :
This is consistent with Theorem 4(c). For a given small K, the absolute di¤erence increases with
the degree of over-identi…cation q. For a given q; the absolute di¤erence decreases with K:
Theorem 4(c) and Figure 2 suggest that the quantile of t1 is larger than the corresponding
normal quantile in absolute value. So the test based on the normal approximation rejects the
null more often than the test based on the …xed-smoothing approximation. This provides an
explanation of the large size distortion of the asymptotic normal test.
1
4.2
The case of kernel HAR variance estimation
0 is not independent of
In the case of kernel HAR variance estimation, Dpp = Cpp Cpq Cqq1 Cpq
Cpq or Cqq : It is not as easy to simplify the nonstandard distribution as in the case of OS HAR
variance estimation. Di¤erent proofs are needed.
Let
Z
Z Z
1
1
=
0
1
Qh (r; r) dr and
2
1
=
In Lemma 5 in the appendix, we show that 1 1
that a and b are of the same order of magnitude.
0
0
[Qh (r; s)]2 drds:
1=h and
Theorem 5 For kernel HAR variance estimation, we have
16
2
1=h where \a
(9)
b” indicates
q=0
q=1
3
3
2
2
1
95% quantile of t
5% quantile of t
0
1
∞
∞
∞
∞
Approximate 95% quantile
Approximate 5% quantile
-1
-2
-3
5% quantile of t
0
Approximate 95% quantile
Approximate 5% quantile
-1
95% quantile of t
-2
0
10
20
30
40
-3
50
0
10
20
K
30
40
50
30
40
50
K
q=2
q=3
4
10
3
2
5
1
0
0
-1
-2
-5
-3
-4
0
10
20
30
40
-10
50
0
10
20
K
K
Figure 2: 5% and 95% quantitles of the nonstandard distribution t1 and their noncentral F
approximations
d
(a) pF1 =
2= 2
p
where
2
p
is independent of
0
e0p Ip + Cpq Cqq1 Cqq1 Cpq
ep
2 d
=
2,
0
e0p Ip + Cpq Cqq1 Cqq1 Cpq
and ep = (1; 0; :::; 0)0 2 Rp .
(b) As h ! 1; we have
p
2 !
Cpp
0
Cpq Cqq1 Cpq
1
0
Ip + Cpq Cqq1 Cqq1 Cpq
ep
1 and
P (pF1 < z) = Gp (z) + Gp0 (z) z (
1
1)
2
(p + 2q
1
where as before Gp (z) is the CDF of the chi-square distribution
1) + Gp00 (z) z 2
2
+ o(
2)
2:
p
Remark 11 Theorem 5(a) shows that pF1 follows a scale-mixed chi-square distribution. Since
2 !p 1 as h ! 1; the sequential limit of pW is the usual chi-square distribution. The result
T
is the same as in the OS HAR variance estimation. The virtue of Theorem 5(a) is that it gives
an explicit characterization of the random scaling factor 2 :
Remark 12 Using Theorem 5(a), we have P (pF1 < z) = EGp z 2 : Theorem 5(b) follows by
taking a Taylor expansion of Gp z 2 around Gp (z) and approximating the moments of 2 : In
the case of the contracted kernel HAR variance estimation, Sun (2014) establishes an expansion
17
of the asymptotic distribution for WT (~T ), which is based on the one-step GMM estimator ~T :
Using the notation in this paper, it is shown that
P Bp (1)0 Cpp1 Bp (1) < z = Gp (z) + Gp0 (z) z (
1)
1
2
(p
1
1) + Gp00 (z) z 2
2
+ o(
2) :
So up to the order O( 2 ); the two expansions agree except that there is an additional term in
Theorem 5(b) that re‡ects the degree of over-identi…cation. If we use the chi-square distribution
or the …xed smoothing asymptotic distribution Bp (1)0 Cpp1 Bp (1) as the reference distribution, then
the probability of over-rejection increases with q, at least when 2 is small.
We now focus on the case of contracted kernel HAR variance estimation. As h ! 1; i.e.,
b ! 0; we can show that
1
=1
where
c1 =
bc1 + o(b) and
Z
1
k (x) dx; c2 =
1
Using Theorem 5(b) and the identity that
P F1 < z 2 = G z 2
G 0 z2
c1 G 0 z 2 z 2 b
2
Z
= bc2 + o (b)
1
(10)
k 2 (x) dx:
1
z 2 + 1 = 2G 00 z 2 z 2 ; we have, for p = 1 :
c2 4
z + (4q + 1) z 2 G 0 z 2 b + o(b)
2
(11)
where G z 2 = G1 z 2 :
The above expansion is related to the high order Edgeworth expansion established in Sun and
Phillips (2009) for linear IV regressions. Sun and Phillips (2009) consider the conventional joint
limit under which b ! 0 and T ! 1 jointly and show that
P (jtT (^T )j < z) =
(z)
( z)
1
c1 z + c2 z 3 + z (4q + 1)
2
b (z) + (bT )
q0
1;1 z
(z) + s:o:
(12)
where (bT ) q0 1;1 z captures the nonparametric bias of the kernel HAR variance estimator, and
‘s.o.’ stands for smaller order terms. Here q0 is the order of the kernel used and the explicit
expression for 1;1 is not important here but is given in Sun and Phillips (2009). Observing that
(z)
( z) = G z 2 ; we have (z) = G 0 z 2 z and 0 (z) = G 0 z 2 + 2z 2 G 00 z 2 : Using these
equations, we can represent the high order expansion in (12) as
P (jtT (^T )j2 < z 2 ) = P (WT < z 2 )
c2 4
z + (4q + 1) z 2 G 0 z 2 b + (bT )
= G z2
c1 G 0 z 2 z 2 b
2
(13)
q0
1;1 G
0
z
2
2
z + s:o:
Comparing this with (11), the terms of order O(b) are seen to be exactly the same across the two
expansions.
1
1
Let z 2 = F1
be the (1
) quantile from the distribution of F1 ; i.e., P F1 < F1
=
1
: Then
i
c2 h 1
2
1
1
1
1
1
F1
+ (4q + 1) F1
G 0 F1
b=1
+ o(b)
G F1
c1 G 0 F1
F1
b
2
18
and so
1
P (WT < F1
)=1
+ (bT )
q0
1;1 G
0
1
F1
1
F1
+ s:o:
1
That is, using the critical value F1
eliminates a term in the higher order distributional ex1
pansion of WT under the conventional asymptotics. The nonstandard critical value F1
is
thus high order correct under the conventional asymptotics. In other words, the nonstandard
approximation provides a high order re…nement over the conventional chi-square approximation.
Although the high order re…nement is established for the case of p = 1 and linear IV regressions, we expect it to be true more generally and for all three types of HAR variance estimators.
All we need is a higher order Edgeworth expansion for the Wald statistic in a general GMM setting. In view of Sun and Phillips (2009), this is not di¢ cult conceptually but can be technically
very tedious.
5
The Fixed-smoothing Asymptotics under the Local Alternatives
In this section, we consider
the …xed-smoothing asymptotics under a sequence of local alternatives
p
H1 : R 0 = r
=
T
where
1 vector. This con…guration, known as the Pitman
0
0 is a p
drift, is widely used in local power analysis. Under the local alternatives, both 0 and the data
generating process depend on the sample size. For notational convenience, we have suppressed
an extra T subscript on 0 .
If Assumptions 2-5 hold under the local alternatives, then we can use the same arguments as
d
d
before and show that WT (^T ) ! F1; 0 and tT (^T ) ! t1; 0 where
F1;
and
t1;
Let
0
0
h
= R G0 W11 G
h
R G0 W11 G
h
= R G0 W11 G
1
1
1
G0 W11 Bm (1) +
0
G0 W11 Bm (1) +
0
G0 W11 Bm (1) +
V = R G0
1
G
1
R0 ;
0
i0 h
i
R0
=p
i0 h
=V
1
R G0 W11 G
R G0 W11 G
1=2
1
R0
i
i
1
1=2
:
0;
and
F1 (jj jj2 ) =
Bp (1)
Cpq Cqq1 Bq (1) +
t1 ( ) =
Bp (1)
Cpq Cqq1 Bq
(1) +
0
Dpp1 Bp (1) Cpq Cqq1 Bq (1) +
p
= Dpp when p = 1:
=p;
(14)
(15)
The following theorem shows that F1; 0 and F1 (jj jj2 ) have the same distribution. Similarly,
t1; 0 and t1 ( ) have the same distribution.
Theorem 6 Let Assumption 1 hold with k ( ) being positive de…nite. In addition, let Assumptions
2-5 hold under H1 . Then, for a …xed h;
d
(a) WT (^T ) ! F1 (jj jj2 );
d
(b) tT (^T ) ! t1 ( ) when p = 1;
where Bp0 (1) ; Bq0 (1)
0
is independent of Cpq Cqq1 and Dpp :
19
In the spirit of Theorem 1, we can also approximate the distributions of WT and tT by their
respective asymptotically equivalent distributions. To conserve space, we omit the details here. In
addition, we can show that, under the assumptions of Theorem 6, DT and ST are asymptotically
equivalent in distribution to WT : Hence they also converge in distribution to F1 (jj jj2 ): When
0 = 0; Theorem 6 reduces to Theorems 1(b) and (d).
The proof of Theorem 6 shows that the right hand side of (14) depends on only through
jj jj2 : For this reason, we can write F1 ( ) as a function of jj jj2 only. But jj jj2 = 00 V 1 0 is
exactly the same as the noncentrality parameter in the noncentral chi-square distribution under
the conventional increasing-smoothing asymptotics. See, for example, Hall (2005, p. 163). When
h ! 1; F1 (jj jj2 ) converges in distribution to 2p (jj jj2 )=p: That is, the conventional limiting
distribution can be recovered from F1 (jj jj2 ): However, for a …nite h; these two distributions can
be substantially di¤erent.
For the Wald statistic based on the …rst-step estimator, it is not hard to show that for a …xed
h;
h
i
h
i0
d
WT (~T ) ! F~1 (jj~jj2 ) := Bp (1) + ~ C 1 Bp (1) + ~ =p
pp
where
1
V~ = R G0 Wo;1
G
1
1
1
1
G0 Wo;1
Wo;1
G G0 Wo;1
G
1
R0 and ~ = V~
1=2
0.
(16)
Similar to F1 (jj jj2 ), the limiting distribution F~1 (jj~jj2 ) depends on ~ only through jj~jj2 : Note
that V~ and V are the asymptotic variances of the …rst-step and two-step estimators in the conventional asymptotic framework. It follows from the standard GMM theory that jj jj2
jj~jj2 .
So in general the two-step test has a larger noncentrality parameter than the …rst-step test. This
is a source of the power advantage of the two-step test.
To characterize the di¤erence between jj jj2 and jj~jj2 ; we let UG G VG0 be a singular value
decomposition of G; where
0
0 0
G;m d = AG ; OG ;
AG is a nonsingular d d matrix, OG is a q d matrix of zeros, and UG and VG are orthonormal matrices. We rotate the moment conditions and partition the rotated moment conditions
UG0 f (vt ; ) according to
fU 1 (vt ; )
(17)
UG0 f (vt ; ) :=
fU 2 (vt ; )
where fU 1 (vt ; ) 2 Rd and fU 2 (vt ; ) 2 Rq . Then the moment conditions in (1) are equivalent to
the following rotated moment conditions:
EUG0 f (vt ; ) = 0; t = 1; 2; :::; T
hold if and only if
matrix are
=
0.
(18)
The corresponding rotated weighting matrix and long run variance
11
12
WU;1
WU;1
21
22
WU;1
WU;1
WU;1 : = UG0 Wo;1 UG =
U
: = UG0 UG =
11
U
21
U
12
U
22
U
;
;
where the matrices are partitioned conforming to the two blocks in (17).
20
1
22
Let BU;1 = WU;1
21 . In the appendix we show that
WU;1
Id
BU;1
V~ = RVG AG1
0
Id
BU;1
U
0
Id
V = RVG AG1
22
U
0
RVG AG1 ;
1
(19)
Id
U
21
U
1
22
U
0
RVG AG1 ;
21
U
and
V~
V = RVG AG1
h
1
22
U
21
U
BU;1
i0
22
U
h
22
U
1
21
U
BU;1
i
0
RVG AG1 :
(20)
1 21
22
So the di¤erence between V~ and V depends on the di¤erence between
U
U and BU;1 :
1 21
22
While
U
U is the long run (population) regression matrix when fU 1 (vt ; 0 ) is regressed
on fU 2 (vt ; 0 ); BU;1 can be regarded as the long run regression matrix implied by the weighting
matrix WU;1 : When the two long run regression matrices are close to each other, V~ and V will
also be close to each other. Otherwise, V~ and V can be very di¤erent.
The di¤erence between V~ and V translates into the di¤erence between the noncentrality
parameters. To gain deeper insights, we let
fU 1 (vt ; ) := VG AG1 fU 1 (vt ; )
0
BU;1
fU 2 (vt ; ) :
~ It is easy to show that the GMM
Note that the long run variance matrix of RfU 1 (vt ; 0 ) is V.
estimator based on the moment conditions EfU 1 (vt ; 0 ) = 0 has the same asymptotic distribution
as the …rst-step estimator ~T : Hence EfU 1 (vt ; 0 ) = 0 can be regarded as the e¤ective moment
conditions behind ~T : Intuitively, ~T employs the moment conditions EfU 1 (vt ; 0 ) = 0 only while
ignoring the second block of moment conditions EfU 2 (vt ; 0 ) = 0: In contrast, the two-step
estimator ^T employs both blocks of moment conditions. The relative merits of the …rst-step
estimator and the two-step estimator depend on whether the second block of moment conditions
contains additional information about 0 : To measure the information content, we introduce the
long run correlation matrix between RfU 1 (vt ; 0 ) and fU 2 (vt ; 0 ) :
12
= V~
1=2
RVG AG1
12
U
0
BU;1
22
U
22
U
1=2
0
22 ) is the long run covariance between Rf
where RVG AG1 ( 12
BU;1
U
U
U 1 (vt ;
2
2
Proposition 7 gives a characterization of jj jj
jj~jj in terms of 12 :
0)
and fU 2 (vt ;
0) :
Proposition 7 If V~ is nonsingular, then
jj jj2
jj~jj2 =
0 ~ 1=2
0V
Ip
1=2
0
12 12
0
12 12
0
12 12
Ip
1=2
V~
1=2
0:
P
Let 12 = pi=1 i ai b0i be the singular value decomposition of 12 where fai g and fbi g are
2
orthonormal vectors in Rp and Rq respectively. Sorted in the descending order,
are the
i
(squared) long run canonical correlation coe¢ cients between RfU 1 (vt ; 0 ) and fU 2 (vt ; 0 ) : Then
2
jj jj
jj~jj2 =
p
X
i=1
2
i
1
21
2
i
h
a0i V~
1=2
0
i2
:
Consider a special case that maxpi=1 2i = 0; i.e., 12 is a matrix of zeros. In this case, the
second block of moment conditions contains no additional information, and we have jj jj2 = jj~jj2 :
1 21
22
This case happens when BU;1 =
U
U , i.e., the implied long run regression coe¢ cient is
the same as the population long run regression coe¢ cient. Consider another special case that
p
2
2
approaches 1. If a01 V~ 1=2 0 6= 0; then jj jj2 jj~jj2 approaches 1 as 21
1 = maxi=1
i
approaches 1 from below. This case happens when the second block of moment conditions has
very high long run prediction power for the …rst block.
There are many cases in-between. Depending on the magnitude of the long run canonical
correlation coe¢ cients and the direction of the local alternative hypothesis, jj jj2 jj~jj2 can range
from 0 to 1: We expect that, under the …xed-smoothing asymptotics, there is a threshold value
c > 0 such that when jj jj2 jj~jj2 > c ; the two-step test is asymptotically more powerful than
the …rst-step test. Otherwise, the two-step test is asymptotically less powerful. The threshold
value c should depend on h; p; q, jj jj2 and the signi…cance level . It should be strictly greater
than zero, as the two-step test entails the cost of estimating the optimal weighting matrix. We
leave the threshold determination to future research.
6
Simulation Evidence and Empirical Application
6.1
Simulation Evidence
In this subsection, we study the …nite sample performance of the …xed-smoothing approximations
to the Wald statistic. We consider the following data generating process:
yt = x0;t + x1;t
1
+ x2;t
2
+ x3;t
3
+ "y;t
where x0;t
1 and x1;t , x2;t and x3;t are scalar regressors that are correlated with "y;t : The
dimension of the unknown parameter
= ( ; 1 ; 2 ; 3 )0 is d = 4: We have m instruments
z0;t ; z1;t ; :::; zm 1;t with z0;t 1. The reduced-form equations for x1;t , x2;t and x3;t are given by
xj;t = zj;t +
m
X1
zi;t + "xj ;t for j = 1; 2; 3:
i=d 1
We assume that zi;t for i
or an MA(1) process
1 follows either an AR(1) process
p
2e
zi;t = zi;t 1 + 1
zi ;t ;
zi;t = ezi ;t
1
where
ezi ;t =
+
p
1
2e
zi ;t ;
eizt + e0zt
p
2
1 0
and et = [e0zt ; e1zt ; :::; em
zt ] s iidN (0; Im ): By construction, the variance of zit for any i =
1; 2; :::; m 1 is 1: Due to the presence of the common shocks e0zt ; the correlation coe¢ cient
between the non-constant zi;t and zj;t for i 6= j is 0:5: The DGP for "t = ("yt ; "x1 t ; "x2 t ; "x3 t )0
is the same as that for (z1;t ; :::; zm 1;t ) except the dimensionality di¤erence. The two vector
processes "t and (z1;t ; :::; zm 1;t ) are independent from each other.
22
In the notation of this paper, we have f (vt ; ) = zt (yt x0t ) where vt = [yt0 ; x0t ; zt0 ]0 , xt =
(1; x1t ; x2t ; x3t )0 , zt = (1; z1t ; :::; zm 1;t )0 . We take = 0:8; 0:5; 0:0; 0:5; 0:8 and 0:9. We
consider m = 4; 5; 6 and the corresponding degrees of over-identi…cation are q = 0; 1; 2: The null
hypotheses of interest are
H01 :
1
= 0;
H02 :
1
=
2
= 0;
H03 :
1
=
2
=
3
=0
where p = 1; 2 and 3 respectively. The corresponding matrix R is the 2 : p + 1 rows of the identity
matrix Id : We consider two signi…cance levels = 5% and = 10% and three di¤erent sample
sizes T = 100; 200; 500: The number of simulation replications is 10000.
We …rst examine the …nite sample size accuracy of di¤erent tests using the OS HAR variance
estimator. The tests are based on the same Wald test statistic, so they have the same sizeadjusted power. The di¤erence lies in the reference distributions or critical values used. We
1
1
2
K
employ the following critical values: 1p =p, K Kp+1 Fp;K
with
p+1 ; K p q+1 Fp;K p q+1
2
1
2
= pq=(K q 1); and F1 ; leading to the
test, the CF (central F) test, the NCF (noncentral
F) test and the nonstandard F1 test. The 2 test uses the conventional chi-square approximation.
The CF test uses the …xed-smoothing approximation for the Wald statistic based on a …rst-step
GMM estimator. Alternatively, the CF test uses the …xed-smoothing approximation with q = 0:
The NCF test uses the noncentral F approximation given in Theorem 3. The F1 test uses the
nonstandard limiting distribution F1 with simulated critical values. For each test, the initial
…rst step estimator is the IV estimator with weight matrix Wo = (Z 0 Z=T ) where Z is the matrix
of the observed instruments.
To speed up the computation, we assume that K is even and use the basis functions 2j 1 (x) =
p
p
2 cos 2j x, 2j (x) = 2 sin 2j x, j = 1; :::; K=2: In this case, the OS HAR variance estimator
can be computed using discrete Fourier transforms. The OS HAR estimator is a simple average of
periodogram. We select K based on the AMSE criterion implemented using the VAR(1) plug-in
procedure in Phillips (2005). For completeness, we reproduce the MSE optimal formula for K
here:
&
'
)] 1=5 4=5
tr [(Im2 + Kmm ) (
0:5
KM SE = 2
T
;
4vec(B)0 vec(B)
where d e is the ceiling function, Kmm is the m2
2
B=
6
1
X
m2 commutation matrix and
j 2 Ef (vt ;
0
0 )f (vt j ; 0 ) :
(21)
j= 1
We compute KM SE on the basis of the initial …rst step estimator ~T and use it in computing
both WT (~T ) and WT (^T ):
Table 1 gives the empirical size of the di¤erent tests for the AR(1) case with sample size
T = 100. The nominal size of the tests is = 5%: The results for = 10% are qualitatively
similar. First, as it is clear from the table, the chi-square test can have a large size distortion.
The size distortion can be very severe. For example, when = 0:9, p = 3 and q = 2, the
empirical size of the chi-square test can be as high as 71.2%, which is far from 5%, the nominal
size of the test. Second, the size distortion of the CF test is substantially smaller than the chisquare test when the degree of over-identi…cation is small. This is because the CF test employs
23
the asymptotic approximation that partially captures the estimation uncertainty of the HAR
variance estimator. Third, the empirical size of the NCF test is nearly the same as that of the
nonstandard F1 test. This is consistent with Figure 1. This result provides further evidence that
the noncentral F distribution approximates the nonstandard F1 distribution very well for at least
the 95% quantile. Finally, among the four tests, the N CF and F1 tests have the most accurate
size. For the two-step GMM estimator, the HAR variance estimator appears in two di¤erent
places and plays two di¤erent roles — …rst as the inverse of the optimal weighting matrix and
then as part of the asymptotic variance estimator for the GMM estimator. The nonstandard F1
approximation and the noncentral F approximation attempt to capture the estimation uncertainty
of the HAR variance estimator in both places. In contrast, a crucial step underlying the central
F approximation is that the HAR variance estimator is treated as consistent when it acts as the
optimal weighting matrix. As a result, the central F approximation does not adequately capture
the estimation uncertainty of the HAR variance estimator. This explains why the N CF and F1
tests have more accurate size than the CF test.
Table 2 presents the simulated empirical size for the MA(1) case. The qualitative observations
for the AR(1) case remain valid. The chi-square test is most size distorted. The NCF and F1
tests are least size distorted. The CF test is in between. As before, the size distortion increases
with the serial dependence, the number of joint hypotheses, and the degree of over-identi…cation.
This table provides further evidence that the noncentral F distribution and the nonstandard
F1 distribution provide more accurate approximations to the sampling distribution of the Wald
statistic.
Tables 3 and 4 report results for the case when the sample size is 200. Compared to the cases
with sample size 100, all tests become more accurate in size. The size accuracy improves further
when the sample size increases to 500. This is well expected. In terms of the size accuracy, the
NCF test and the F1 test are close to each other. They dominate the CF test, which in turn
dominates the chi-square test.
Next, we consider the empirical size of di¤erent tests based on kernel HAR variance estimators.
Both the contracted kernel approach and the exponentiated kernel approach are considered, but
we report only the commonly-used contracted kernel approach as the results are qualitatively
similar. We employ three kernels: the Bartlett, Parzen and QS kernels. These three kernels
are positive (semi) de…nite. For each kernel, we use the data-driven AMSE optimal bandwidth
and its VAR(1) plug-in implementation from Andrews (1991). For each Wald statistic, three
1
1
1
critical values are used: 1p =p, F1
[0] ; and F1
[q] where F1
[q] is the (1
) quantile
1
from the nonstandard F1 distribution with degree of over-identi…cation q: F1 [0] coincides with
the critical value from the …xed-smoothing asymptotics distribution derived for the …rst-step IV
estimator.
To save space, we only report the results for the Parzen kernel. Tables 5–8 correspond
to Tables 1–4. It is clear that the qualitative results exhibited in Tables 1–4 continue to apply.
According to the size accuracy, the F1 [q] test dominates the F1 [0] test, which in turn dominates
the conventional 2 test.
24
6.2
Empirical Application
To illustrate the …xed-smoothing approximations, we consider the log-normal stochastic volatility
model of the form:
rt =
2
t
log
t Zt
= !+
log
2
t 1
! +
u ut
where rt is the rate of return and (Zt ; ut ) is iid N (0; I2 ): The …rst equation speci…es the distribution
of the return as heteroscedastic normal. The second equation speci…es the dynamics of the logvolatility as an AR(1). The parameter vector is = (!; ; u ) : We impose the restriction that
2 (0; 1); which is an empirically relevant range. The model and the parameter restriction are the
same as those considered by Andersen and Sorensen (1996), which gives a detailed discussion on
the motivation of the stochastic volatility models and the GMM approach. For more discussions,
see Ghysels, Harvey, Renault (1996) and references therein.
We employ the GMM to estimate the log-normal stochastic volatility model. The data are
weekly returns of S&P 500 stocks, which are constructed by compounding daily returns with
dividends from a CRSP index …le. We consider both value-weighted returns (vwretd) and equalweighted returns (ewretd). The weekly returns range from the …rst week of 2001 to the last
week of 2012 with sample size T = 627: We use weekly data in order to minimize problems
associated with daily data such as asynchronous trading and bid-ask bounce. This is consistent
with Jacquier, Polson and Rossi (1994).
The GMM approach relies on functions of the time series frt g to identify the parameters of
the model. For the log-normal stochastic volatility model, we can obtain the following moment
conditions:
p
p
E jrt j` = c` E `t for ` = 1; 2; 3; 4 with (c1 ; c2 ; c3 ; c4 ) =
2= ; 1; 2 2= ; 3
E jrt rt
jj
= 2
1
E(
t t j)
and Ert2 rt2
j
=E
2 2
t t j
for j = 1; 2; :::
where
E
E
`
t
`1 `2
t t j
= exp
!`
`2
+
2
8(1
2
u
= E
`1
t
exp
E
`2
t
2
)
;
2` ` j
u 1 2
2
4(1
)
:
Higher order moments can be computed but we choose to focus on a subset of lower order
moments. Andersen and Sorensen (1996) points out that it is generally not optimal to include
too many moment conditions when the sample is limited. On the other hand, it is not advisable
to include just as many moment conditions as the number of parameters. When T = 500 and
= ( 7:36; 0:90; :363), which is an empirically relevant parameter vector, Table 1 in Andersen
and Sorensen (1996) shows that it is MSE-optimal to employ nine moment conditions. For this
reason, we employ two sets of nine moment conditions given in the appendix of Andersen and
Sorensen (1996). The baseline set of the nine moment conditions are Efi (rt ; ) = 0 for i = 1; :::; 9
25
with
fi (rt ; ) = jrt ji
f5 (rt ; ) = jrt rt
f6 (rt ; ) = jrt rt
f7 (rt ; ) = jrt rt
f8 (rt ; ) =
f9 (rt ; ) =
c` E( `t ); i = 1; :::; 4;
2
1
E(
t t 1) ;
2
1
E(
t t 3) ;
2
1
E(
t t 5) ;
E
2
t
2
t
1j
3j
5j
2 2
rt rt 2
rt2 rt2 4
E
2
t 2
2
t 4
;
:
(22)
The alternative set of the nine moment conditions are Efi (rt ; ) = 0 for i = 1; :::; 9 with
fi (rt ; ) = jrt ji
f5 (rt ; ) = jrt rt
f6 (rt ; ) = jrt rt
f7 (rt ; ) = jrt rt
f8 (rt ; ) =
f9 (rt ; ) =
c` E
2j
4j
6j
2 2
rt rt 1
rt2 rt2 3
`
t
; i = 1; :::; 4;
2
1
E(
t t 2) ;
2
1
E(
t t 4) ;
2
1
E(
t t 6) ;
E
2
t
2
t
E
2
t 1
2
t 3
;
:
(23)
In each case m = 9 and d = 3, and so the degree of over-identi…cation is q = m d = 6:
We focus on constructing 90% and 95% con…dence intervals (CI’s) for : Given the high
nonlinearity of the moment conditions, we invert the GMM distance statistics to obtain the CI’s.
The CI’s so obtained are invariant to the reparametrization of model parameters. For example, a
95% con…dence interval is the set of values in (0,1) that the DT test does not reject at the 5%
level. We search over the grid from 0.01 to 0.99 with increment 0.01 to invert the DT test. As in
1
1
[q],
[0] ; and F1
the simulation study, we employ three di¤erent critical values: 1p =p, F1
which correspond to three di¤erent asymptotic approximations. For the series LRV estimator,
1
F1
[0] is a critical value from the corresponding F approximation.
Before presenting the empirical results, we notice that there is no ‘hole’in the CI’s we obtain
— the values of that are not rejected form an arithmetic sequence. Given this, we report only
the minimum and maximum values of over the grid that are not rejected. It turns out the
maximum value is always equal to the upper bound 0.99. It thus su¢ ces to report the minimum
value, which we take as the lower limit of the CI.
Table 9 presents the lower limits for di¤erent 95% CI’s together with the smoothing parameters
used. The smoothing parameters are selected in the same ways as in the simulation study.
Since the selected K values are relatively large and the selected b values are relatively small,
the CI’s based on the 2 approximation and the F1 [0] approximation are close to each other.
However, there is still a noticeable di¤erence between the CI’s based on the 2 approximation
and the nonstandard F1 [q] approximation. This is especially true for the case with baseline
moment conditions, the Bartlett kernel, and equally-weighted returns. Taking into account the
randomness of the estimated weighting matrix leads to wider CI’s. The log-volatility may not be
as persistent as previously thought.
7
Conclusion
The paper has developed the …xed-smoothing asymptotics for heteroskedasticity and autocorrelation robust inference in a two-step GMM framework. We have shown that the conventional
26
chi-square test that ignores the estimation uncertainty of the GMM weighting matrix and the
covariance estimator can have very large size distortion. This is especially true when the number
of joint hypotheses being tested and the degree of over-identi…cation are high or the underlying
processes are persistent. The test based on our new …xed-smoothing approximation reduces the
size distortion substantially and is thus recommended for practical use.
There are a number of interesting extensions. First, given that our proof uses only a CLT,
the results of the paper can be extended easily to the spatial setting, spatial-temporal setting or
panel data setting. See Kim and Sun (2013) for an extension along this line. Second, in the Monte
Carlo experiments, we use the conventional MSE criterion to select the smoothing parameters.
We could have used the methods proposed in Sun and Phillips (2009), Sun, Phillips and Jin
(2008) or Sun (2014) that are tailored towards con…dence interval construction or hypothesis
testing. But in their present forms these methods are either available only for the t-test or work
only in the …rst-step GMM framework. It will be interesting to extend them to more general
tests in the two-step GMM framework. Third, we may employ the asymptotically equivalent
distribution to conduct more reliable inference. Simulation results not reported here show that
the size di¤erence between using the asymptotically equivalent distribution and the limiting
distribution is negligible. However, it is easy to replace the iid process in the asymptotically
equivalent distribution FeT by a dependent process that mimics the degree of autocorrelation in
the moment process. Such a replacement may lead to even more accurate tests and is similar
in spirit to Zhang and Shao (2013) who have shown the high order re…nement of a Gaussian
bootstrap procedure under the …xed-smoothing asymptotics. Finally, the general results of the
paper can be extended to a nonparametric sieve GMM framework. See Chen, Liao and Sun
(2014) for a recent development on autocorrelation robust sieve inference for time series models.
Table 1: Empirical size of the nominal 5% 2 test, F test, noncentral F test and nonstandard
F1 test based on the OS HAR variance estimator for the AR(1) case with T = 100, number of
joint hypotheses p and number of overidentifying restrictions q
2
-0.8
-0.5
0.0
0.5
0.8
0.9
0.114
0.081
0.063
0.094
0.134
0.166
-0.8
-0.5
0.0
0.5
0.8
0.9
0.186
0.113
0.081
0.128
0.196
0.252
-0.8
-0.5
0.0
0.5
0.8
0.9
0.260
0.154
0.104
0.171
0.268
0.332
CF
NCF
p = 1; q = 0
0.072 0.072
0.060 0.060
0.051 0.051
0.063 0.063
0.086 0.086
0.117 0.117
p = 1; q = 1
0.129 0.081
0.086 0.065
0.065 0.053
0.091 0.064
0.140 0.089
0.184 0.126
p = 1; q = 2
0.190 0.080
0.117 0.061
0.085 0.055
0.129 0.065
0.199 0.085
0.257 0.124
F1
2
0.073
0.059
0.050
0.063
0.088
0.120
0.197
0.117
0.083
0.142
0.229
0.290
0.077
0.065
0.052
0.063
0.086
0.123
0.307
0.175
0.113
0.204
0.331
0.420
0.079
0.062
0.055
0.066
0.082
0.121
0.425
0.244
0.148
0.279
0.449
0.529
CF
NCF
p = 2; q = 0
0.087 0.087
0.066 0.066
0.052 0.052
0.065 0.065
0.100 0.100
0.150 0.150
p = 2; q = 1
0.153 0.088
0.099 0.069
0.075 0.057
0.110 0.073
0.172 0.101
0.242 0.155
p = 2; q = 2
0.256 0.090
0.146 0.065
0.101 0.062
0.165 0.065
0.269 0.090
0.358 0.148
27
F1
2
0.084
0.066
0.053
0.065
0.097
0.146
0.310
0.174
0.112
0.222
0.355
0.437
0.087
0.068
0.056
0.072
0.099
0.153
0.457
0.247
0.155
0.308
0.489
0.589
0.083
0.061
0.058
0.061
0.082
0.137
0.602
0.351
0.211
0.415
0.623
0.712
CF
NCF
p = 3; q = 0
0.109 0.109
0.077 0.077
0.060 0.060
0.077 0.077
0.119 0.119
0.181 0.181
p = 3; q = 1
0.193 0.107
0.116 0.079
0.085 0.060
0.130 0.079
0.209 0.112
0.301 0.183
p = 3; q = 2
0.318 0.100
0.175 0.074
0.119 0.065
0.200 0.073
0.330 0.100
0.433 0.160
F1
0.111
0.078
0.062
0.078
0.122
0.184
0.113
0.080
0.060
0.080
0.118
0.192
0.097
0.073
0.063
0.072
0.097
0.157
Table 2: Empirical size of the nominal 5% 2 test, F test, noncentral F test and nonstandard
F1 test based on the series LRV estimator for the MA(1) case with T = 100, number of joint
hypotheses p and number of overidentifying restrictions q
2
-0.8
-0.5
0.0
0.5
0.8
0.9
0.072
0.074
0.063
0.078
0.081
0.077
-0.8
-0.5
0.0
0.5
0.8
0.9
0.099
0.097
0.081
0.108
0.116
0.107
-0.8
-0.5
0.0
0.5
0.8
0.9
0.137
0.131
0.104
0.143
0.153
0.137
CF
NCF
p = 1; q = 0
0.055 0.055
0.055 0.055
0.051 0.051
0.054 0.054
0.056 0.056
0.055 0.055
p = 1; q = 1
0.074 0.057
0.075 0.057
0.065 0.053
0.078 0.057
0.082 0.057
0.077 0.058
p = 1; q = 2
0.105 0.054
0.100 0.055
0.085 0.055
0.109 0.060
0.115 0.060
0.105 0.058
F1
2
0.055
0.055
0.050
0.054
0.056
0.055
0.106
0.101
0.083
0.119
0.125
0.114
0.056
0.057
0.052
0.056
0.057
0.057
0.160
0.150
0.113
0.171
0.185
0.163
0.054
0.055
0.055
0.061
0.061
0.059
0.216
0.204
0.148
0.231
0.247
0.218
CF
NCF
p = 2; q = 0
0.058 0.058
0.058 0.058
0.052 0.052
0.060 0.060
0.060 0.060
0.059 0.059
p = 2; q = 1
0.090 0.061
0.088 0.061
0.075 0.057
0.093 0.062
0.097 0.062
0.093 0.063
p = 2; q = 2
0.133 0.058
0.128 0.063
0.101 0.062
0.138 0.063
0.146 0.062
0.134 0.061
F1
2
0.058
0.059
0.053
0.061
0.062
0.060
0.153
0.144
0.112
0.182
0.195
0.170
0.060
0.060
0.056
0.062
0.062
0.062
0.223
0.207
0.155
0.253
0.274
0.239
0.056
0.060
0.058
0.060
0.059
0.059
0.305
0.284
0.211
0.339
0.366
0.316
CF
NCF
p = 3; q = 0
0.069 0.069
0.069 0.069
0.060 0.060
0.071 0.071
0.071 0.071
0.070 0.070
p = 3; q = 1
0.099 0.067
0.096 0.068
0.085 0.060
0.107 0.068
0.115 0.069
0.107 0.070
p = 3; q = 2
0.153 0.066
0.146 0.066
0.119 0.065
0.168 0.067
0.177 0.067
0.159 0.070
F1
0.070
0.070
0.062
0.072
0.071
0.070
0.068
0.068
0.060
0.068
0.070
0.071
0.066
0.065
0.063
0.067
0.066
0.070
Table 3: Empirical size of the nominal 5% 2 test, F test, noncentral F test and nonstandard
F1 test based on the series LRV estimator for the AR(1) case with T = 200, number of joint
hypotheses p and number of overidentifying restrictions q
2
-0.8
-0.5
0.0
0.50
0.8
0.9
0.102
0.070
0.060
0.079
0.114
0.140
-0.8
-0.5
0.0
0.5
0.8
0.9
0.147
0.083
0.069
0.099
0.164
0.205
-0.8
-0.5
0.0
0.5
0.8
0.9
0.195
0.101
0.074
0.113
0.225
0.276
CF
NCF
p = 1; q = 0
0.071 0.071
0.058 0.058
0.054 0.054
0.063 0.063
0.072 0.072
0.091 0.091
p = 1; q = 1
0.107 0.072
0.070 0.058
0.060 0.053
0.080 0.062
0.112 0.076
0.145 0.095
p = 1; q = 2
0.148 0.071
0.082 0.060
0.066 0.053
0.091 0.056
0.170 0.072
0.210 0.090
F1
2
0.071
0.058
0.054
0.063
0.073
0.094
0.150
0.088
0.065
0.104
0.187
0.243
0.072
0.057
0.053
0.061
0.073
0.090
0.234
0.117
0.081
0.139
0.270
0.347
0.073
0.060
0.054
0.056
0.073
0.088
0.319
0.144
0.096
0.166
0.371
0.460
CF
NCF
p = 2; q = 0
0.073 0.073
0.062 0.062
0.052 0.052
0.063 0.063
0.079 0.079
0.107 0.107
p = 2; q = 1
0.126 0.080
0.081 0.066
0.064 0.058
0.088 0.065
0.138 0.083
0.178 0.110
p = 2; q = 2
0.192 0.075
0.098 0.066
0.074 0.056
0.107 0.064
0.216 0.075
0.280 0.099
28
F1
2
0.073
0.063
0.052
0.064
0.078
0.104
0.234
0.117
0.081
0.144
0.283
0.366
0.081
0.065
0.058
0.062
0.083
0.108
0.348
0.154
0.100
0.182
0.394
0.498
0.070
0.063
0.057
0.059
0.070
0.090
0.460
0.190
0.117
0.234
0.539
0.639
CF
NCF
p = 3; q = 0
0.091 0.091
0.071 0.071
0.059 0.059
0.066 0.066
0.092 0.092
0.124 0.124
p = 3; q = 1
0.147 0.092
0.094 0.073
0.071 0.062
0.092 0.070
0.159 0.087
0.219 0.121
p = 3; q = 2
0.233 0.083
0.116 0.069
0.083 0.059
0.129 0.070
0.264 0.081
0.349 0.106
F1
0.091
0.072
0.057
0.070
0.093
0.128
0.094
0.072
0.061
0.072
0.090
0.127
0.082
0.070
0.060
0.069
0.078
0.103
Table 4: Empirical size of the nominal 5% 2 test, F test, noncentral F test and nonstandard
F1 test based on the series LRV estimator for the MA(1) case with T = 200, number of joint
hypotheses p and number of overidentifying restrictions q
2
-0.8
-0.5
0.0
0.5
0.8
0.9
0.066
0.066
0.060
0.070
0.072
0.069
-0.8
-0.5
0.0
0.5
0.8
0.9
0.080
0.077
0.069
0.085
0.089
0.080
-0.8
-0.5
0.0
0.5
0.8
0.9
0.096
0.095
0.074
0.100
0.103
0.093
CF
NCF
p = 1; q = 0
0.054 0.054
0.056 0.056
0.054 0.054
0.058 0.058
0.057 0.057
0.056 0.056
p = 1; q = 1
0.066 0.055
0.065 0.056
0.060 0.053
0.071 0.058
0.072 0.057
0.067 0.057
p = 1; q = 2
0.080 0.056
0.079 0.055
0.066 0.053
0.080 0.054
0.083 0.054
0.079 0.055
F1
2
0.053
0.055
0.054
0.057
0.057
0.056
0.082
0.079
0.065
0.089
0.091
0.084
0.055
0.056
0.053
0.057
0.056
0.056
0.105
0.102
0.081
0.115
0.123
0.110
0.056
0.057
0.054
0.055
0.055
0.056
0.130
0.123
0.096
0.136
0.146
0.131
CF
NCF
p = 2; q = 0
0.056 0.056
0.054 0.054
0.052 0.052
0.057 0.057
0.058 0.058
0.057 0.057
p = 2; q = 1
0.075 0.063
0.076 0.064
0.064 0.058
0.077 0.060
0.077 0.058
0.074 0.059
p = 2; q = 2
0.090 0.059
0.089 0.060
0.074 0.056
0.094 0.059
0.096 0.059
0.093 0.059
29
F1
2
0.056
0.056
0.052
0.057
0.058
0.057
0.106
0.103
0.081
0.120
0.130
0.114
0.061
0.062
0.058
0.058
0.057
0.058
0.141
0.134
0.100
0.152
0.164
0.142
0.058
0.060
0.057
0.058
0.056
0.058
0.177
0.165
0.117
0.184
0.202
0.172
CF
NCF
p = 3; q = 0
0.062 0.062
0.061 0.061
0.059 0.059
0.060 0.060
0.060 0.060
0.061 0.061
p = 3; q = 1
0.084 0.069
0.082 0.066
0.071 0.062
0.083 0.064
0.085 0.064
0.084 0.066
p = 3; q = 2
0.109 0.067
0.106 0.065
0.083 0.059
0.110 0.065
0.113 0.063
0.105 0.065
F1
0.064
0.064
0.057
0.063
0.063
0.063
0.068
0.066
0.061
0.063
0.065
0.065
0.066
0.064
0.060
0.066
0.063
0.065
Table 5: Empirical size of the nominal 5% 2 test, the F1 [0] test and the F1 [q] test based on
the Parzen kernel LRV estimator for the AR(1) case with T = 100, number of joint hypotheses
p and number of overidentifying restrictions q
2
-0.8
-0.5
0.0
0.5
0.8
0.9
0.132
0.085
0.063
0.093
0.164
0.216
-0.8
-0.5
0.0
0.5
0.8
0.9
0.197
0.105
0.073
0.117
0.223
0.315
-0.8
-0.5
0.0
0.5
0.8
0.9
0.257
0.128
0.086
0.139
0.282
0.384
F1 [0] F1 [q]
p = 1; q = 0
0.091
0.091
0.073
0.073
0.059
0.059
0.074
0.074
0.102
0.102
0.124
0.124
p = 1; q = 1
0.145
0.110
0.091
0.076
0.067
0.059
0.095
0.079
0.151
0.110
0.204
0.145
p = 1; q = 2
0.195
0.115
0.110
0.078
0.079
0.060
0.115
0.081
0.201
0.113
0.269
0.144
2
0.215
0.116
0.071
0.131
0.273
0.382
0.313
0.150
0.096
0.169
0.362
0.521
0.406
0.188
0.114
0.210
0.456
0.605
F1 [0] F1 [q]
p = 2; q = 0
0.112
0.112
0.082
0.082
0.058
0.058
0.076
0.076
0.126
0.126
0.164
0.164
p = 2; q = 1
0.174
0.127
0.104
0.088
0.077
0.067
0.111
0.090
0.189
0.132
0.271
0.182
p = 2; q = 2
0.242
0.139
0.131
0.091
0.091
0.070
0.140
0.088
0.260
0.135
0.365
0.190
2
0.326
0.160
0.093
0.182
0.408
0.559
0.443
0.201
0.121
0.236
0.521
0.704
0.555
0.248
0.150
0.291
0.613
0.774
F1 [0] F1 [q]
p = 3; q = 0
0.153
0.153
0.100
0.100
0.070
0.070
0.096
0.096
0.154
0.154
0.212
0.212
p = 3; q = 1
0.224
0.164
0.126
0.104
0.084
0.074
0.132
0.104
0.230
0.160
0.329
0.226
p = 3; q = 2
0.305
0.166
0.155
0.103
0.108
0.080
0.167
0.105
0.319
0.157
0.438
0.221
Table 6: Empirical size of the nominal 5% 2 test, the F1 (0) test and the F1 (q) test based on
the Parzen kernel LRV estimator for the MA(1) case with T = 100, number of joint hypotheses
p and number of overidentifying restrictions q
2
-0.8
-0.5
0.0
0.5
0.8
0.9
0.075
0.073
0.063
0.079
0.081
0.076
-0.8
-0.5
0.0
0.5
0.8
0.9
0.091
0.086
0.073
0.097
0.104
0.095
-0.8
-0.5
0.0
0.5
0.8
0.9
0.114
0.110
0.086
0.120
0.125
0.113
F1 [0] F1 [q]
p = 1; q = 0
0.065
0.065
0.065
0.065
0.059
0.059
0.064
0.064
0.066
0.066
0.063
0.063
p = 1; q = 1
0.079
0.067
0.076
0.067
0.067
0.059
0.081
0.070
0.083
0.069
0.081
0.069
p = 1; q = 2
0.099
0.072
0.096
0.069
0.079
0.060
0.100
0.071
0.105
0.072
0.098
0.071
2
0.101
0.096
0.071
0.107
0.113
0.103
0.133
0.126
0.096
0.140
0.149
0.134
0.169
0.157
0.114
0.168
0.181
0.160
F1 [0] F1 [q]
p = 2; q = 0
0.070
0.070
0.070
0.070
0.058
0.058
0.069
0.069
0.070
0.070
0.070
0.070
p = 2; q = 1
0.094
0.078
0.094
0.078
0.077
0.067
0.094
0.077
0.097
0.077
0.092
0.075
p = 2; q = 2
0.118
0.083
0.114
0.081
0.091
0.070
0.117
0.083
0.122
0.081
0.114
0.081
30
2
0.133
0.126
0.093
0.150
0.160
0.140
0.173
0.162
0.121
0.191
0.203
0.181
0.215
0.201
0.150
0.229
0.249
0.216
F1 [0] F1 [q]
p = 3; q = 0
0.086
0.086
0.085
0.085
0.070
0.070
0.084
0.084
0.086
0.086
0.082
0.082
p = 3; q = 1
0.108
0.088
0.106
0.087
0.084
0.074
0.111
0.088
0.116
0.090
0.108
0.091
p = 3; q = 2
0.139
0.093
0.133
0.091
0.108
0.080
0.138
0.092
0.144
0.091
0.133
0.091
Table 7: Empirical size of the nominal 5% 2 test, the F1 [0] test and the F1 [q] test based on
the Parzen kernel LRV estimator for the AR(1) case with T = 200, number of joint hypotheses
p and number of overidentifying restrictions q
2
-0.8
-0.5
0.0
0.5
0.8
0.9
0.110
0.075
0.059
0.080
0.124
0.170
-0.8
-0.5
0.0
0.5
0.8
0.9
0.143
0.083
0.063
0.091
0.155
0.235
-0.8
-0.5
0.0
0.5
0.8
0.9
0.168
0.092
0.066
0.098
0.193
0.296
F1 [0] F1 [q]
p = 1; q = 0
0.090
0.090
0.073
0.073
0.061
0.061
0.073
0.073
0.091
0.091
0.105
0.105
p = 1; q = 1
0.117
0.096
0.079
0.071
0.064
0.061
0.085
0.074
0.118
0.094
0.160
0.116
p = 1; q = 2
0.138
0.094
0.087
0.073
0.067
0.061
0.090
0.071
0.154
0.096
0.215
0.121
2
0.150
0.090
0.061
0.096
0.186
0.286
0.208
0.108
0.074
0.117
0.239
0.391
0.257
0.124
0.078
0.131
0.295
0.469
F1 [0] F1 [q]
p = 2; q = 0
0.098
0.098
0.078
0.078
0.058
0.058
0.075
0.075
0.102
0.102
0.129
0.129
p = 2; q = 1
0.136
0.110
0.091
0.083
0.070
0.066
0.091
0.079
0.142
0.110
0.199
0.140
p = 2; q = 2
0.174
0.114
0.103
0.083
0.074
0.065
0.103
0.079
0.185
0.113
0.275
0.147
2
0.218
0.114
0.074
0.126
0.261
0.421
0.293
0.137
0.086
0.147
0.328
0.542
0.349
0.149
0.092
0.172
0.415
0.637
F1 [0] F1 [q]
p = 3; q = 0
0.123
0.123
0.093
0.093
0.070
0.070
0.086
0.086
0.122
0.122
0.161
0.161
p = 3; q = 1
0.166
0.134
0.109
0.100
0.079
0.075
0.102
0.088
0.167
0.127
0.243
0.170
p = 3; q = 2
0.213
0.137
0.116
0.095
0.084
0.074
0.121
0.092
0.230
0.129
0.338
0.173
Table 8: Empirical size of the nominal 5% 2 test, the F1 [0] test and the F1 [q] test based on
the Parzen kernel LRV estimator for the MA(1) case with T = 200, number of joint hypotheses
p and number of overidentifying restrictions q
2
-0.8
-0.5
0.0
0.5
0.8
0.9
0.065
0.065
0.059
0.070
0.071
0.069
-0.8
-0.5
0.0
0.5
0.8
0.9
0.076
0.074
0.063
0.077
0.081
0.075
-0.8
-0.5
0.0
0.5
0.8
0.9
0.087
0.085
0.066
0.085
0.089
0.083
F1 [0] F1 [q]
p = 1; q = 0
0.064
0.064
0.065
0.065
0.061
0.061
0.066
0.066
0.066
0.066
0.066
0.066
p = 1; q = 1
0.074
0.065
0.073
0.066
0.064
0.061
0.074
0.066
0.076
0.066
0.072
0.064
p = 1; q = 2
0.083
0.069
0.083
0.069
0.067
0.061
0.082
0.066
0.083
0.065
0.081
0.066
2
0.077
0.075
0.061
0.082
0.085
0.079
0.095
0.091
0.074
0.097
0.102
0.095
0.107
0.104
0.078
0.111
0.118
0.108
F1 [0] F1 [q]
p = 2; q = 0
0.068
0.068
0.067
0.067
0.058
0.058
0.066
0.066
0.066
0.066
0.065
0.065
p = 2; q = 1
0.082
0.074
0.082
0.074
0.070
0.066
0.080
0.070
0.081
0.071
0.078
0.069
p = 2; q = 2
0.092
0.073
0.088
0.072
0.074
0.065
0.092
0.072
0.094
0.072
0.090
0.071
31
2
0.098
0.093
0.074
0.104
0.111
0.101
0.118
0.113
0.086
0.118
0.126
0.114
0.135
0.131
0.092
0.137
0.145
0.132
F1 [0] F1 [q]
p = 3; q = 0
0.077
0.077
0.077
0.077
0.070
0.070
0.074
0.074
0.075
0.075
0.076
0.076
p = 3; q = 1
0.093
0.085
0.093
0.086
0.079
0.075
0.089
0.080
0.090
0.078
0.089
0.080
p = 3; q = 2
0.108
0.089
0.106
0.086
0.084
0.074
0.104
0.083
0.107
0.082
0.102
0.085
Table 9: Lower limits of di¤erent 95% CI’s for and data-driven smoothing parameters in the
log-normal stochastic volatility model
Equally-weighted Return
Value-weighted Return
Baseline
Alternative
Baseline Alternative
Series 2
0.74
0.58
0.78
0.62
Series F1 [0]
0.73
0.57
0.78
0.62
Series F1 [q]
0.70
0.53
0.76
0.57
K
(140)
(138)
(140)
(140)
Series
2
Bartlett 2
Bartlett F1 [0]
Bartlett F1 [q]
b
0.56
0.54
0.35
(.0155)
0.52
0.52
0.42
(.0157)
0.78
0.77
0.75
(0.155)
0.60
0.60
0.53
(.0155)
Parzen 2
Parzen F1 [0]
Parzen F1 [q]
b
0.72
0.73
0.69
(.0136)
0.52
0.52
0.47
(.0139)
0.78
0.78
0.77
(.0136)
0.50
0.50
0.43
(.0136)
QS 2
0.74
0.52
0.78
0.52
QS F1 [0]
0.74
0.52
0.78
0.52
QS F1 [q]
0.71
0.48
0.77
0.46
b
(.0068)
(.0069)
(.0067)
(.0067)
Baseline: CI using baseline 9 moment conditions given in (22);
Alternative: CI using alternative 9 moment conditions given in (23);
: the CI constructed based on the OS HAR variance estimator and using
reference distribution. Other row titles are similarly de…ned.
32
2
1
as the
8
8.1
Appendix of Proofs
Technical Lemmas
Consider two stochastically bounded sequences of random variables f T 2 Rp g and f T 2 Rp g.
Given the stochastic boundedness, there exists a nondecreasing mapping s(T ) : N ! N such that
d
d
! 1 and s(T ) ! 1 for some random vectors 1 and 1 : Let C ( 1 ; 1 ) be the set
of points at which the CDFs of 1 and 1 are continuous. The following lemma provides a
characterization of asymptotic equivalence in distribution. The proof is similar to that for the
Lévy-Cramér continuity theorem and is omitted here.
s(T )
Lemma 2 The following two statements are equivalent:
(a) P ( s(T ) x) = P ( s(T ) x) + o (1) for all x 2 C (
a
(b) s(T ) s s(T ) :
Lemma 3 If (
1T ; 1T )
P(
1T
1; 1) ;
converges weakly to ( ; ) and
x;
1T
y) = P (
x;
2T
y) + o (1) as T ! 1
2T
(24)
for all continuity point (x; y) of the CDF of ( ; ); then
g(
1T ; 1T )
a
s g(
2T ; 2T );
where g ( ; ) is continuous on a set C such that P (( ; ) 2 C) = 1:
Proof of Lemma 3. Since ( 1T ; 1T ) converges weakly, we can apply Lemma 2 with s(T ) = T:
It follows from the condition in (24) that Ef ( 1T ; 1T ) Ef ( 2T ; 2T ) ! 0 for any f 2 BC. But
Ef ( 1T ; 1T ) ! Ef ( ; ) and so Ef ( 2T ; 2T ) ! Ef ( ; ) for any f 2 BC. That is, ( 2T ; 2T ) also
converges weakly to ( ; ) : Using the same proof for proving the continuous mapping theorem, we
a
have Ef (g( 1T ; 1T )) Ef (g( 2T ; 2T )) ! 0 for any f 2 BC: Therefore g( 1T ; 1T ) s g( 2T ; 2T ):
Lemma 4 Suppose ! T = T;J + T;J and ! T does not depend on J: Assume that there exist T;J
and T such that
(i) P T;J <
P T;J < = o(1) for each …nite J and each 2 R as T ! 1;
(ii) P T;J <
P ( T < ) = o(1) uniformly over su¢ ciently large T for each 2 R as
J ! 1;
(iii) the CDF of T is equicontinuous on R when T is su¢ ciently large, and
p
(iv) for every > 0; supT P
! 0 as J ! 1: Then
T;J >
P (! T < ) = P (
T
< ) + o(1) for each
Proof of Lemma 4. Let " > 0 and
such that for some integer Tmin > 0
2 R as T ! 1:
2 R: Under condition (iii), we can …nd a
P(
T
< + )
:= (") > 0
"
for all T
Tmin : Here Tmin does not depend on or ": Under condition (iv), we can …nd a
Jmin := Jmin (") that does not depend on T such that
P
T;J
>
33
"
for all J
that
0
Jmin and all T: From condition (ii), we can …nd a Jmin
P
0
for all J Jmin
and all T
00
0
exists a Tmin (J0 ) Tmin
T;J
<
P(
< )
T
0
Jmin and a Tmin
Tmin such
"
0 : It follows from condition (i) that for any …nite J
Tmin
0
Tmin such that
P
T;J0
< +
P
T;J0
< +
"
P
T;J0
<
P
T;J0
<
"
0 ; there
Jmin
00 (J ):
for T Tmin
0
00 (J ); we have
When T Tmin
0
P (! T
)=P
T;J0
P
T;J0
P(
T
+
P
T;J0
+
+ 2"
T;J0
P(
T
+
+P
T;J0
>
< + ) + 3"
< ) + 4":
Similarly,
P (! T
)=P
T;J0
P(
T;J0
P(
T
+
P(
T;J0
)
)
2"
P(
)
T;J0
)
T
P
T;J0
3"
4":
Since the above two inequalities hold for all " > 0; we must have P (! T < ) = P (
as T ! 1:
8.2
T
< ) + o(1)
Proof of the Main Results
Proof of Lemma 1. Part (a). Let
T
r
1X
Qh
;s
eT (s) =
T
T
r=1
Z
1
Qh (r; s)dr;
0
Then
WT ( T ) WT ( T )
"
#" T
# "
#" T
#
T
T
X
X
1X
1X
0
0
=
f (vt ; T )
eT (s)f (vs ; T )
eT (t)f (vt ; T )
f (vs ; T )
T
T
t=1
s=1
t=1
s=1
"
#"
#
Z 1Z 1
T X
T
T X
T
X
X
1
r
1
+
Qh
;
Qh ( ; r)d dr
f (vt ; T )f (vs ; T )0 :
T
T T
T
0
0
=1 r=1
t=1 s=1
Under Assumption 1, we have, for any …xed h :
T
T
1 XX
r
Qh
;
T
T T
=1 r=1
Z
0
1Z 1
0
34
Qh ( ; r)d dr = O
1
T
:
Hence
WT (
T)
WT (
T
1 X
p
f (vt ;
T t=1
T)
T)
T
1 X
eT (t)f (vt ;
+ p
T t=1
T
1 X
p
eT (s)f (vs ;
T s=1
T)
T)
T
1 X
p
f (vs ;
T s=1
2
T
1
1 X
f (vt ;
T ) + O( ) p
T
T t=1
T)
: (25)
We proceed to evaluate the orders of the terms in above upper bound. First, for some
between T and 0 ; we have
T
1 X
p
f (vt ;
T t=1
T)
T
1 X
=p
f (vt ;
T t=1
0)
+ GT
T
p
T
T
0
= Op (1)
T
(26)
by Assumptions 4 and 5. Here T in the matrix GT T can be di¤erent for di¤erent rows. For
notational economy, we do not make this explicit. We follow this practice in the rest of the proof.
Next, using summation by parts, we have
T
1 X
p
eT (t)f (vt ;
T t=1
T) =
T
1 X
p
eT (t)f (vt ;
T t=1
0) +
T
1 X
eT (t)f (vt ; 0 ) +
= p
T t=1
p
+eT (T )GT ( T ) T T
T
@f (vt ;
1X
eT (t)
T
@ 0
t=1
T
1
X
[eT (t)
T)
p
T
eT (t + 1)] Gt
0
T
T
p
T
0
:
p
Assumption 1 implies that eT (t) = O(1=T ) uniformly over t: So eT (T )GT ( T ) T
Op (1=T ) : For any conformable vector a;
0
1
#
"
T
1
1
X
X
X
1
1 @
1
eT (t)a0 f (vt ; 0 )
a0 j a
var p
k j kA kak2 = O
2
2
T
T
T
t=1
by Assumption 3. So T
T
1 X
p
eT (t)f (vt ;
T t=1
j= 1
1=2
PT
T) =
j= 1
t=1 eT (t)f (vt ; 0 )
T
X1
[eT (t)
0
T
t=1
T
0
=
1
T2
= Op (1=T ): It then follows that
eT (t + 1)] Gt
t=1
35
T
p
T
T
0
+ Op
1
T
:
(27)
Letting
T
X1
=
=
=
t=1
T
X1
t=1
T
X1
t=1
T
X1
= Gt (
t
t
T G;
T)
we can write
[eT (t)
eT (t + 1)] Gt
[eT (t)
eT (t + 1)]
[eT (t)
[eT (t)
eT (t + 1)]
eT (t + 1)]
t
t
t
p
T
p
p
p
T
T
0
T
T
0
T
T
0
T
0
T
T 1
1 X
[eT (t)
T
+
t=1
T
X
1
T
+
p
eT (t + 1)] tG T
p
eT (t)G T
T
0
0
T
p
eT (T )G T
0
T
t=1
1
T
+ Op
t=1
:
(28)
Given that supt k t k = op (1) under Assumption 4 and eT ( ) is piecewise monotonic under Asp
P
sumption 1, we can easily show that Tt=11 [eT (t) eT (t + 1)] t T T
0 = op (1=T ) : Combining this with (27) and (28) yields
1=2
T
T
X
eT (t)f (vt ;
T)
= Op (1=T ) :
(29)
t=1
Part (a) now follows from (25), (26) and (29).
Part (b). For some T between T and 0 ; we have
WT (
T) =
T
T
1 XX
Qh
T
t
;
T T
t=1 =1
ut +
@f (vt ;
@ 0
T)
0
T
u +
@f (v ;
@ 0
T)
0
0
T
= WT ( 0 ) + I1 + I10 + I2 ;
(30)
where
T
T
1 XX
Qh
I1 =
T
I2 =
1
T2
=1 t=1
T X
T
X
t
;
T T
Qh
t=1 =1
@f (vt ;
@ 0
t
;
T T
T)
@f (vt ;
@ 0
0
T
T)
T
u0 ;
0
@f (v ;
@ 0
T)
0
T
0
For a sequence of vectors or matrices fCt g ; de…ne S0 (C) = 0 and St (C) = T
Using summation by parts, we obtain, for two sequence of matrices fAt g and fBt g :
T
T
1 XX
Qh
T2
=
+
+
=1 t=1
T
X1 TX1
=1 t=1
T
X1
Qh
t=1
T
X1
=1
t
;
T T
rQh
t
;1
T
Qh 1;
T
1
:
Pt
At B 0
t
;
T T
Qh
St (A) S 0 (B)
t+1
;1
T
Qh 1;
+1
T
St (A) ST0 (B)
ST (A) S 0 (B) + Qh (1; 1) ST (A) ST0 (B) :
36
=1 C
:
where
rQh
t
;
T T
Plugging At =
Gt (
T)
t
TG
t
;
T T
:= Qh
@f (vt ;
@ 0
T)
t+1
;
T T
Qh
+1
T
t+1
;
T
+ Qh
+1
T
and Bt = ut into the above expression and letting
0
T
t
;
T
Qh
:
t
=
as before, we obtain a new representation of I1 :
I1 = T
+T
+T
T
X1 TX1
=1 t=1
T
X1
t
;1
T
Qh
t=1
T
X1
t
;
T T
rQh
Qh 1;
=T
+T
+T
T
X1 TX1
=1 t=1
T
X1
t
;1
T
Qh 1;
+ T Qh (1; 1)
T
ST0
0
T
(G +
t
0
T
T)
T
0
ST0 (u)
S 0 (u)
S 0 (u)
0
T
t+1
;1
T
t
T
0
ST0 (u)
+1
T
T
T
0
S 0 (u)
Qh 1;
T
=1
t
Qh
t
G+
T
ST0 (u)
0
T
t
;
T T
rQh
Qh
t=1
T
X1
T)
S 0 (u)
0
T
+1
T
Qh 1;
+ T Qh (1; 1) (G +
t
t+1
;1
T
Qh
T
=1
t
G+
T
(u) + G
0
T
T
T
1 XX
Qh
T
=1 t=1
:= I11 + I12 + I13 + I14 + I15 :
t
;
T T
u0
We now show that each of the …ve terms in the above expression is op (1) : It is easy to see
that
I15 = G
0
T
p
= G T
T
p
= G T
T
p
= G T
T
T
T
1 XX
t
Qh
;
u0
T
T T
=1 t=1
"
#
T
T
1 X 1X
t
Qh
;
u0
0 p
T
T
T
T =1
t=1
Z 1
T
X
1
1
Qh t;
dt + O
0 p
T
T
T
0
1
p
T
0
I14 = T Qh (1; 1)
= Qh (1; 1)
T
T
p
=1
T
X
=1
0
T
T
O
T
37
1
T
u0 = op (1) ;
ST0 (u)
p 0
T ST (u) = op (1) ;
0
u0
and
I13 = T
T
X1
Qh 1;
=1
=
hp
T
+
T
hp
T
T
T
i
0
T
+1
T
Qh 1;
0
T
= op (1) :
T
0
T
"
T
1 X
p
Qh 1;
u
T
T =1
i
hp
i
Qh (1; 1) T ST (u)
S 0 (u)
#
To show that I12 = op (1), we note that under the piecewise monotonicity condition in Assumption 1, Qh (t=T; 1) is piecewise monotonic in t=T: So for some …nite J we can partition the
set f1; 2; :::; T g into J maximal non-overlapping subsets [Jj=1 Ij such that Qh (t=T; 1) is monotonic
on each Ij := fIjL ; :::; IjU g : Now
T
X1
t
;1
T
Qh
t=1
J X
X
t
;1
T
Qh
j=1 t2Ij
J X
X
Qh
t
;1
T
X
Qh
j=1 t2Ij
=
J
X
( )j
j=1
=
t2Ij
J
X
Qh
j=1
t+1
;1
T
Qh
t+1
;1
T
Qh
t+1
;1
T
Qh
Qh
IjL
;1
T
+ op (1)
t
t+1
;1
T
Qh
IjU
;1
T
t
sup k s k + op (1)
s
t
;1
T
sup k s k + op (1)
s
sup k s k + op (1)
s
= O (1) sup k s k + op (1) = op (1)
s
where the op (1) term in the …rst inequality re‡ects the case when t and t + 1 belong to di¤erent partitions and “( )j ” takes “+” or “ ” depending on whether Qh (t=T; 1) is increasing or
decreasing on the interval [IjL ; IjU ]: As a result,
kI12 k = T
T
X1
t=1
Qh
T
X1
t=1
t
;1
T
Qh
t
;1
T
Qh
Qh
t+1
;1
T
t+1
;1
T
t
p
T
t
0
T
0
T
ST0 (u)
p
T ST0 (u) = op (1) : (31)
To show that I11 = op (1) ; we use a similar argument. Under the piecewise monotonicity
condition, we can partition the set of lattice points f(i; j) : i; j = 1; 2; :::; T g into a …nite number
of maximal non-overlapping subsets
I` = f(i; j) : I`1L
i
I`1U ; I`2L
38
j
I`2U g
for ` = 1; :::; `max such that rQh (i=T; j=T ) have the same sign for all (i; j) 2 I` : Now
kI11 k =
=
T
X1 TX1
=1 t=1
`X
max
X
`X
max
( )`
`=1 (i;j)2I`
=
`=1
=
`X
max
(i;j)2I`
Qh
`=1
i j
;
T T
rQh
X
t
;
T T
rQh
i
t
hp
T
i j
;
T T
rQh
I`1L I`2L
;
T
T
hp
T
T
T
0
0
i hp
p
sup k t k
i hp
i
T Sj0 (u)
T
I`1U I`2L
;
T
T
sup
0
T
t
Qh
i
T S 0 (u)
I`1L I`2U
;
T
T
Qh
p
+ Qh
T S 0 (u)
I`1U I`2U
;
T
T
op (1)
= op (1) :
We have therefore proved that I1 = op (1).
We can use similar arguments to show that I2 = op (1). Details are omitted. This completes
the proof of Part (b).
Parts (c) and (d). We start by establishing the series representation of the weighting
function as given in (4). The representation trivially holds with j ( ) = j ( ) for the OS-HAR
variance estimator. It remains to show that it also holds for the contracted kernel kb ( ) and the
exponentiated kernel k ( ) : We focus on the former as the proof for the latter is similar. Under
Assumption 1(a), kb (x) has a Fourier cosine series expansion
kb (x) = c +
1
X
~ j cos j x
(32)
j=1
where c is the constant term in the expansion,
uniformly over x 2 [ 1; 1]: This implies that
kb (r
s) = c +
1
X
P1
j=1
~ j < 1, and the right hand side converges
~ j cos j r cos j s +
j=1
1
X
~ j sin j r sin j s
j=1
and the right hand side converges uniformly over (r; s) 2 [0; 1] [0; 1]: Now
Z 1
Z 1
Z 1Z 1
kh (r; s) = kb (r; s)
kb (
s) d
kb (r
)d +
kb ( 1
0
=
1
X
:=
0
~ j cos j r cos j s +
j=1
1
X
1
X
~ j sin j r
j
(r)
j
2) d 1d 2
0
1
sin j
d
sin j s
0
j=1
j
Z
0
0
(s)
j=1
by taking
j (r) =
cos
sin
1
2
1
2
jr ;
(j + 1) r
R1
0
39
sin
1
2
Z
(j + 1)
j is even
d ; j is odd
(33)
1
sin j
d
and
~ j=2 ;
j is even,
~ (j+1)=2 ; j is odd.
R1
Obviously j (r) is continuously
di¤erentiable
and
satis…es
j (r) dr = 0 for all j = 1; : : : : The
0
P
uniform convergence of 1
(r)
(s)
to
k
(r;
s)
inherits
from the uniform convergence of
j
j
j
j=1
h
the Fourier series in (32).
a
In view of part (b), it su¢ ces to show that WT ( 0 ) s WeT : Using the Cramér-Wold device,
it su¢ ces to show that
a
tr [WT ( 0 )A] s tr [WeT A]
j
=
for any conformable symmetry matrix A. Note that
T
T
1 XX
Qh
tr [WT ( 0 )A] =
T
tr (WeT A) =
1
T
t=1 =1
T X
T
X
Qh
t=1 =1
t
;
T T
u0t Au ;
t
;
T T
( et )0 A ( et ) :
We write tr [WT ( 0 )A] as
tr [WT ( 0 )A] =
1
X
j
j=1
:=
T;J
+
T
1 X
p
T t=1
j
t
T
ut
!0
A
T
1 X
p
T t=1
t
T
j
ut
!
T;J
where
T;J
=
J
X
j
j=1
T;J
=
"
T
1 X
p
j
T t=1
"
T
1 X
j p
T t=1
1
X
j=J+1
t
T
j
#0
"
T
1 X
ut A p
j
T t=1
#0 "
T
t
1 X
ut A p
T
T t=1
#
t
T
ut ;
#
t
T
j
ut :
We proceed to apply Lemma 4 by verifying the four conditions. First, it follows from Assumption
5 that for every J …xed,
"
#0 "
#
J
T
T
X
X
X
1
t
1
t
a
(34)
et A p
et
j p
j
j
T;J s T;J :=
T
T
T t=1
T t=1
j=1
as T ! 1: Second, let
T
=
1
X
j=1
j
"
T
1 X
p
T t=1
j
t
T
et
= tr (WeT A)
and A =
Pm
`=1
0
` a` a`
#0
"
T
1 X
A p
T t=1
be the spectral decomposition of A: Then
"
1
m
T
X
X
1 X
t
p
=
j
j
T
`
T;J
T
T t=1
j=J+1
`=1
40
j
t
T
a0` et
et
#2
:
#
So
E
T
1
X
C
T;J
j=J+1
j jj
T
1X
T
t
T
2
j
t=1
C
1
X
j=J+1
j jj
for some constant C that does not depend on T: That is, E T
T;J ! 0 uniformly in T as
J ! 1: So for any > 0 and " > 0, there exists a J0 that does not depend on T such that
P
for all J
that
T
E
T;J
T
T;J
=
"
J0 and all T: It then follows that there exists a T0 which does not depend on J0 such
P
T;J
<
=P
T
+
P(
T
< + )+P
T
P(
T
< + )+"
P(
T
T;J
<
T;J
T
< ) + 2"
for J
J0 and T
T0 : Here we have used the equicontinuity of the CDF of T when T is
su¢ ciently large. This is veri…ed below. Similarly, we can show that for J J0 and T
T0 we
have
P T;J <
P ( T < ) 2":
Since the above two inequalities hold for all " > 0; it must be true that P T;J <
P( T < )=
o(1) uniformly over su¢ ciently large T as J ! 1:
Third, we can use Lemma 4 to show that T converges in distribution to 1 where
1
=
=
1
X
j
0
0
j=1
Z 1Z 1
Z
j
(r) dBm (r) A
0
Z
1
j
(s) dBm (s)
0
Qh (r; s) [ dBm (r)]0 A dBm (s) :
Details are omitted here. So for any
P(
0
1
T
and ; we have
< + )=P(
1
< + ) + o (1)
as T ! 1: Given the continuity of the CDF of 1 ; for any " > 0; we can …nd a > 0 such that
P(
" for all T when T is su¢ ciently large. We have thus veri…ed Condition
T < + )
(iii) in Lemma 4. Finally, for every > 0, we have
0
1
"
#2
1
m
T
X
X
X
1
t
j jj
j `j p
P
P@
a0` ut > A
j
T;J >
T
T t=1
j=J+1
`=1
"
#2
m
1
T
X
X
1 X
t
1
0
j `j E
j jj p
a` ut :
j
T
T
t=1
j=J+1
`=1
41
But by Assumption 3,
"
1
T
X
1 X
E
j jj p
T t=1
j=J+1
=
1
X
j=J+1
C
j jj
1
X
j=J+1
T
1X
T
t
T
j
1
X
j=J+1
#2
t
T
E a0` ut
j jj
1X 0
a`
T
2
j
t=1
j jj +
a0` ut
2
+
t
1
X
t
s
1X
j
j
T
T
T
j=J+1
t6=s
0
1
1
X
j j jA = o(1)
s a` = O @
j jj
j=J+1
t6=s
uniformly over T; as J ! 1: So we have proved that supT P
Combining the above steps, we have
a
tr [WT ( 0 )A] s
and
d
tr (WeT A) !
As a consequence,
Z
0
1Z 1
0
Ea0` ut u0s a`
T
T;J
>
= o (1) as J ! 1:
= tr (WeT A) ;
Qh (r; s) [ dBm (r)]0 A dBm (s) = tr(W1 A):
a
d
a
d
WT (
T)
s WeT ! W1 :
WT (
T)
s WeT ! W1 :
In view of Part (a), we also have
Proof of Theorem 1. We prove part (b) only as the proofs for other parts are similar. It su¢ ces
d
0
to show that F1 = Bp (1) Cpq Cqq1 Bq (1) Dpp1 Bp (1)
which is an m d matrix, then
F1 =
h
i
~ 11 G
R G0 W
h
i
~ 11 G
R G0 W
1
1
~ 11 Bm (1)
G0 W
0
Cpq Cqq1 Bq (1) =p: Let G =
h
i
~ 11 G
R G0 W
1
1 G,
1
R0
~ 11 Bm (1) =p:
G0 W
Let Um m m d Vd0 d be a singular value decomposition (SVD) of G : By de…nition, U 0 U = U U 0 =
Im , V V 0 = V 0 V = Id and
Ad d
=
;
Oq d
where A is a diagonal matrix with singular values on the main diagonal and O is a matrix of
zeros. Then
h
i 1
h
i 1
~ 11 G
~ 11 Bm (1) = R V 0 U 0 W
~ 11 U V 0
~ 11 Bm (1)
G0 W
R G0 W
V 0U 0W
h
i 1 h
i
h
i 1
0 0 ~ 1
0 0 ~ 1
0
0 ~ 1
0 0 ~ 1
= RV
U W1 U
U W1 Bm (1) = RV
U W1 U
U W1 U U 0 Bm (1)
h
i 1
d
0 ~ 1
0 ~ 1
= RV
W1
W1 Bm (1);
42
and
h
i
~ 11 G
R G0 W
1
h
R0 = R V
0
~ 11 U V 0
U 0W
i
1
d
R0 = RV
h
0
i
~ 11
W
1
V 0 R0 ;
~ 1 ; Bm (1)] and [U 0 W
~ 1 U; U 0 Bm (1)]: Let
where we have used the distributional equivalence of [W
1
1
C11 C12
C21 C22
~1 =
W
C 11 C 12
C 21 C 22
~ 1=
and W
1
0 ; C 12 = (C 21 )0 .
where C11 and C 11 are d d matrices, C22 and C 22 are q q matrices, and C12 = C21
By de…nition
Z 1Z 1
Cpp
Cp;d p
C11 =
Qh (r; s)dBd (r)dBd (s)0 =
(35)
0
C
C
d p;d p
p;d p
0
0
Z 1Z 1
Cpq
C12 =
Qh (r; s)dBd (r)dBq (s)0 =
(36)
C
d p;q
0
0
Z 1Z 1
Qh (r; s)dBq (r)dBq (s)0 = Cqq
(37)
C22 =
0
0
where Cpp ; Cpq ; and Cqq are de…ned in (7); and Cd p;d
It follows from the partitioned inverse formula that
1
C12 C221 C21
C 11 = C11
p;
Cp;d
, C 12 =
and Cd
p
p;q
are similarly de…ned.
C 11 C12 C221 :
~ 11 ; we have
With the above partition of W
h
0
i
~ 11
W
1
C 11 C 12
C 21 C 22
A0 O 0
=
1
= A0 C 11 A
=A
1
1
A
O
1
C 11
1
A0
and so
h
RV
~ 1
W
1
i
1
0
~ 1
W
1
= RV A
1
C 11
1
A0
1
= RV A
1
C 11
1
A0
1
A0
= RV A
1
Id ;
C 11
1
C 12
and
RV
As a result,
0
h
0
~ 1
W
1
i
1
C 11 C 12
C 21 C 22
A0 O0
V 0 R0 = RV A
C 11 C 12
;
1
(38)
C 11
h
i0
d
F1 = RV A 1 Id ; C 11 1 C 12 Bm (1) RV A
i
h
RV A 1 Id ; C 11 1 C 12 Bm (1) =p
0
= RV A
1
Id ;
C12 C221
Bm (1)
RV A
1
Id ;
C12 C221
Bm (1) =p:
RV A
43
1
C 11
1
1
A0
1
C 11
1
A0
V 0 R0 :
1
A0
1
V 0 R0
1
V 0 R0
1
1
~p
Let Bm (1) = [Bd0 (1) ; Bq0 (1)]0 and U
~p
d
p
~p
~0
d Vd d
A~p
=
p;
1;
be a SVD of RV A
~p
O
(d p)
where
:
Then
n
d
~ ~ V~ 0 Bd (1)
F1 = U
~ ~ V~ 0 Bd (1)
U
n
= ~ V~ 0 Bd (1)
C12 C221 Bq (1)
o0 h
1
~ ~ V~ 0 C 11
U
C12 C221 Bq (1) =p
o0 h
~ V~ 0 C 11
C12 C221 Bq (1)
1
V~ ~ 0
~ V~ 0 Bd (1) C12 C 1 Bq (1) =p
22
n h
io0 h
~ V~ 0 C 11
= ~ V~ 0 Bd (1) V~ 0 C12 C221 Bq (1)
i
h
~ V~ 0 Bd (1) V~ 0 C12 C 1 Bq (1) :
22
1
i
~0
V~ ~ 0 U
d
n
~ Bd (1)
~ Bd (1)
=
h
C12 C221 Bq (1)
C12 C221 Bq (1) =p
~ O
~
A;
Bd (1)
~ O
~
A;
C 11
~ O
~
A;
Bd (1)
1
o0 h
~ C 11
V~ ~ 0
C12 C221 Bq (1)
i 1
~ O
~ 0
A;
C12 C221 Bq (1)
1
1
1
i
h
Noting that the joint distribution of V~ 0 Bd (1) ; V~ 0 C12 ; C22 ; V~ 0 C 11
orthonormal matrix V~ ; we have
F1 =
i
1
~0
i
1
V~
i
is invariant to the
1
0
=p:
Writing
C 11
1
= C11
Dpp D12
D21 D22
C12 C221 C21 =
0 and D 22 is a (d
p) (d p) matrix and using equations (35)–(37),
where Dpp = Cpp Cpq Cqq1 Cpq
we have
0
d
F1 = Bp (1) Cpq Cqq1 Bq (1) Dpp1 Bp (1) Cpq Cqq1 Bq (1) =p:
Proof of Theorem 2. Let T be the p 1 vector of Lagrange multipliers for the constrained
GMM estimation. The …rst order conditions for ^T;R are
@gT0 ^T;R
@
WT 1 (~T )gT ^T;R + R0
T
= 0 and R^T;R = r:
(39)
Linearizing the …rst set of conditions and using Assumption 4, we have the following system of
equations:
!
p
p
~
T ^T;R
R0
G0 WT 1 (~T ) T gT ( 0 )
0
=
+ op (1)
p
R 0p p
0p 1
T T
44
where ~ := ~ (~T ) = G0 WT 1 (~T )G. From this, we get
p
T ^T;R
and
p
=
0
T
1
n
R~
=
T
p
G0 WT 1 (~T ) T gT ( 0 )
n
o 1
p
~ 1 R0 R ~ 1 R0
R ~ 1 G0 WT 1 (~T ) T gT ( 0 ) + op (1)
~
1
R0
Combining (5) with (40), we have
p
T ^T;R
which implies that
^T
p
~
=
T ^T;R
1
^T
o
1
n
R0 R ~
R~
1
1
R0
p
G0 WT 1 (~T ) T gT ( 0 ) + op (1) :
o
1
R~
1
(40)
(41)
p
G0 WT 1 (~T ) T gT ( 0 ) + op (1)
(42)
= Op (1) : So
gT ^T;R = gT ^T + GT (^T ) ^T;R
p
^T + op 1= T :
Plugging this into the de…nition of DT ; we obtain
DT = T ^T;R
^T
0
G0T (^T )WT 1 (
2T gT0 ^T WT 1 (
^
T )G( T )
^
T )GT ( T )
^T;R
^T;R
^T =p
^T =p + op (1)
(43)
Using the …rst order conditions for ^T : gT0 ^T WT 1 (~T )GT (^T ) = 0; and Lemma 1(a), we
obtain
0
(44)
DT = T ^T;R ^T ~ ^T;R ^T =p + op (1) :
Plugging (42) into (44) and simplifying the resulting expression, we have
n
DT = R 0 R ~
~
1
R0
n
R R~
1
0
o
1
1
R
0
R~
o
1
1
p
G0 WT 1 (~T ) T gT ( 0 )
R~
1
0
p
G0 WT 1 (~T ) T gT ( 0 ) =p + op (1)
i 1h
h
i0 h
i
p
p
R ~ 1 G0 WT 1 (~T ) T gT ( 0 ) =p + op (1)
= R ~ 1 G0 WT 1 (~T ) T gT ( 0 ) R ~ 1 R0
hp
i0 h
i 1 hp
i
1 0
^
~
=
T R ^T
R
R
T
R
=p + op (1)
0
0
T
= WT + op (1) :
(45)
Next, we prove the second result in the theorem. In view of the …rst order conditions in (39)
and equation (41), we have
p
T
T
@gT0 ^T;R
p
p 0
WT 1 (~T ) T gT ^T;R + op (1) =
TR
@
n
o 1
p
= R0 R ~ 1 R0
R ~ 1 G0 WT 1 (~T ) T gT ( 0 ) + op (1)
p
= ~ T ^T ^T;R + op (1) ;
^T;R =
45
T
+ op (1)
and so
ST = T ^T
^T;R
0
~ ^T
^T;R =p + op (1)
= DT + op (1) = WT + op (1) :
Proof of Theorem 3. Part (a). Let
H=
Bp (1)
Cpq Cqq1 Bq (1)
Bp (1)
Cpq Cqq1 Bq (1)
;H
!
be an orthonormal matrix, then
d
Cpq Cqq1 Bq
F1 = Bp (1)
"
H 0 Dpp1 HH 0
(1)
"
Bp (1)
Cpq Cqq1 Bq (1)
Bp (1)
Cpq Cqq1 Bq (1)
Bp (1)
Cpq Cqq1 Bq (1)
Bp (1)
Cpq Cqq1 Bq (1)
2 0
ep
Cpq Cqq1 Bq (1)
= Bp (1)
2
#
#0
H
=p
H 0 Dpp1 H ep =p;
where ep = (1; 0; 0; :::; 0)0 2 Rp : But H 0 Dpp1 H has the same distribution as Dpp1 and Dpp is
independent of Bp (1) Cpq Cqq1 Bq (1) : So
d
F1 =
Bp (1)
Cpq Cqq1 Bq (1)
2
=p
1
e0p Dpp1 ep
:
(46)
That is, F1 is equal in distribution to a ratio of two independent random variables.
It is easy to see that
e0p Dpp1 ep
1 d
=
"
Z
e0p+q
0
1Z 1
0
1
0
Qh (r; s) dBp+q (r) dBp+q
(s)
ep+q
#
1
d
=
1
K
2
K p q+1
where ep+q = (1; 0; :::; 0)0 2 Rp+q : With this, we can now represent F1 as
2
p
d
F1 =
and so
1
d
F1 =
2
p
2
2
K p q+1 = (K
2
=p
(47)
2
K p q+1 =K
=p
p
d
q + 1)
= Fp;K
p q+1
2
:
(48)
1F
Part (b). Since the numerator and the denominator in (48) are independent,
1 is
2
distributed as a noncentral F distribution, conditional on
: More speci…cally, we have
P
1
F1 < z = P Fp;K
p q+1
46
2
< z = EFp;K
p q+1
z;
2
where Fp;K p q+1 (z; ) is the CDF of the noncentral F distribution with degrees of freedom
(p; K p q + 1) and noncentrality parameter ; and Fp;K p q+1 ( ) is a random variable with
CDF Fp;K p q+1 (z; ) :
We proceed to compute the mean of 2 . Let
Z 1
Z 1
j (r) dBp (r) s iidN (0; Ip ) and j =
j (r) dBq (r) s iidN (0; Iq ):
j =
0
Note that
j
0
are independent of
Cpq = K
j
1
: We can represent Cpq and Cqq as
K
X
0
j j
and Cqq = K
1
j=1
K
X
0
j j:
(49)
j=1
So
2
E
0
0
= EBq (1)0 Cqq1 Cpq
Cpq Cqq1 Bq (1) = Etr Cqq1 Cpq
Cpq Cqq1
1 10
10
10
20
K
K
K
K
X
X
X
X
1
1
0A
0A@ 1
0A@ 1
@
= Etr 4@
j j
j j
j j
K
K
K
K
j=1
j=1
j=1
j=1
= ptrE ( ) ;
where
:=
PK
j=1
0
j j
1
1
0A
j j
13
5
follows an inverse-Wishart distribution and
E( )=
Iq
K q
1
for K large enough. Therefore E 2 = K pqq 1 = 2 :
Next we compute the variance of 2 : It follows from the law of total variance that
var
2
= E var
2
0
jCqq1 Cpq
Cpq Cqq1
+ var
0
tr Cqq1 Cpq
Cpq Cqq1
:
0 C C 1 : So conditional on C 1 C 0 C C 1 ;
Note that Bq (1) is independent of Cqq1 Cpq
pq qq
qq
pq pq qq
quadratic form in standard normals. Hence the conditional variance of 2 is
var
2
0
0
0
jCqq1 Cpq
Cpq Cqq1 = 2tr Cqq1 Cpq
Cpq Cqq1 Cqq1 Cpq
Cpq Cqq1 :
Using the representation in (49), we have
0
0
Etr Cqq1 Cpq
Cpq Cqq1 Cqq1 Cpq
Cpq Cqq1
0
10
1
0
K
K
K
X
X
X
1
0A@ 1
0A
2@ 1
@
= Etr
Cqq
j j
j j
K
K
K
j=1
j=1
j=1
2
K
K
K X
K
X
1 XX
0
0
2
4
= Etr
j1
j2
j1 i1
i1 Cqq
K4
j1 =1 i1 =1
j2 =1 i2 =1
2
K
K
K
K
1 XXXX
0
2
0
2
= Etr 4 4
j1 i1 Cqq j2 i2 Cqq E
K
j1 =1 i1 =1 j2 =1 i2 =1
47
j
10
K
X
0A@ 1
j
K
0
j2 i2
0
j1 i1
j=1
1
0A
Cqq2
j j
3
0
25
i2 Cqq
0
j2 i2
3
5:
2
is a
Since
0
j1 i1
E
0
j2 i2
we have
8 2
p ;
>
>
>
>
< p;
p
=
>
>
p2 + 2p;
>
>
:
0;
j1 = i1 ; j2 = i2 ; and j1 6= j2 ;
j1 = j2 ; i1 = i2 ; and j1 =
6 i1 ;
j1 = i2 ; i1 = j2 ; and j1 6= i1 ;
j1 = j2 = i1 = i2 ;
otherwise,
0
0
Etr Cqq1 Cpq
Cpq Cqq1 Cqq1 Cpq
Cpq Cqq1
2
3
3
2
K X
K
K X
K
X
X
1
1
0
2
0
25 2
0
2
0
25
= Etr 4 4
p + Etr 4 4
p
j1 j1 Cqq j2 j2 Cqq
j1
i1 Cqq j1
i1 Cqq
K
K
j1 =1 j2 =1
j1 =1 i1 =1
3
2
K X
K
X
1
0
2
0
25
p
+ Etr 4 4
j1
i1 Cqq i1
j1 Cqq
K
j1 =1 i1 =1
3 "
0
1 2
2
#
K
K
K
X
X
X
1
1
0
0
2
25
0 A
tr
= Etr @
p2 + p + Etr 4 2
j1 j1 Cqq
` 1 `1
i1 Cqq i1 p
K
K2
j1 =1
`1 =1
i1 =1
2
= E tr 2
p2 + p + E [tr ( )] p
0
1
q X
q
q
X
X
=@
E 2ij A p2 + p + E
i=1 j=1
ii
i=1
!2
p:
It follows from Theorem 5.2.2 of Press (2005, pp. 119, using the notation here) that
E
2
ij
=
(K
(K
q + 1)
q) (K
ij
+ (K
q
2
q
1) (K
1)
q
3)
+
ij
[K
= O(
2
q
1]
1
)
K2
and for i 6= j;
E
ii
jj
=
2
(K
q) (K
q
2
1) (K
q
3)
+
[K
1
q
Hence
0
0
Etr Cqq1 Cpq
Cpq Cqq1 Cqq1 Cpq
Cpq Cqq1 = O(
48
2
1]
1
):
K2
= O(
1
):
K2
(50)
Next
var
0
tr Cqq1 Cpq
Cpq Cqq1
0
0
E tr Cqq1 Cpq
Cpq Cqq1 tr Cqq1 Cpq
Cpq Cqq1
20
10
1
3 20
K
K
K
X
X
1
1
1 X
0A@
0A
25
4
@
= Etr 4@
C
tr
j j
j j
qq
K
K
K
j=1
j=1
j=1
2
3 2
K
K
K
K
1 XX
1 XX
0
0
25
4
= Etr 4
C
tr
i1
i2
i1 j1
j1 qq
K
K
i1 =1 j1 =1
=E
10
K
X
0A@ 1
j
K
j
j=1
0
2
i1 j1 Cqq
0
2
i2 j2 Cqq
tr
E
3
0A
Cqq2 5
j j
3
0
i2 j2
0
25
j2 Cqq
0
i1 j1
0
i2 j2
i2 =1 j2 =1
K
K
K
K
1 XXXX
tr
K2
1
i1 =1 j1 =1 i2 =1 j2 =1
= E [tr ( )]2 p2 + E
K
K
1 XX
tr Cqq2
K2
0
2
0
j1 j1 Cqq i1 i1
2p
i1 =1 j1 =1
2 2
2
= E [tr ( )] p + Etr
2p:
Using the same formulae from Press (2005), we can show that the last term is of order O K
This, combined with (50), leads to var 2 = O(K 2 ):
Taking a Taylor expansion and using the mean and variance of 2 ; we have
P
1
F1 < z
= EFp;K
= EFp;K
p q+1
p q+1
:
2
z;
z;
2
2
+E
@Fp;K
p q+1
z;
@ 2 Fp;K
2
2
2
@
+E
p q+1
@
z; ~ 2
2
2 2
2
(51)
= EFp;K
p q+1
z;
2
for some ~ 2 between
@ 2 Fp;K
+E
2
Fp;K
and
p q+1
@
2
z; ~ 2
: By de…nition,
0
p q+1 (z;
2 2
2
2
)=P@
= EGp
2(
p
)
p
pz
"
"
2
K p q+1
K
p
q+1
#
2
K p q+1
K
p
q+1
#
;
!
where Gp (z; ) is the CDF of the noncentral chi-square distribution
49
1
1
< zA
2(
p
) with noncentrality
parameter : In view of the relationship Gp (z; ) = exp
@Fp;K
p q+1 (z;
)
@ "
1
X
@
=
exp
@
2
j=0
=
1
exp
2
2
1
+ exp
2
2
1
1X
=
exp
2
j=0
2
#
( =2)j
EGp+2j
j!
1
X
( =2)j
EGp+2j
j!
pz
pz
j=0
"
"
p
j=0
( =2)j
j! Gp+2j
(z) ; we have
#!
2
K p q+1
K
P1
q+1
#!
2
K p q+1
K
p
q+1
"
#!
1
2
X
( =2)j
K p q+1
EGp+2+2j pz
j!
K p q+1
j=0
"
#!
"
2
( =2)j
K p q+1
E Gp+2+2j pz
2
j!
K p q+1
pz
Gp+2j
"
2
K p q+1
K
p
q+1
#!#
and so
@ 2 Fp;K
p
q+1 (z; )
@
2
"
1
1X @
2
@
1
4
j=0
1
X
exp
( =2)j
j!
2
1
exp
( =2)j
1X
+
exp
j!
4
2
j=0
for all z and : Combining the boundedness of @ 2 Fp;K
1
p q+1 (z;
F1 < z = Fp;K
p q+1
z;
2
+ O var(
= Fp;K
p q+1
z;
2
+O
1
K2
2
( =2)(j 1)
(j 1)!
2
j=1
1 1
1
+ =
4 4
2
P
#
2
) =@
with (51) yields
)
= Fp;K
p q+1
z;
2
+o
1
K
:
Part (c). It follows from Part (b) that
z
P (pF1 < z) = EGp
= EGp
+E
"
z
2
K p q+1
K
2
K p q+1
@2
Gp
@ 2
K
z
!
;0
;
!
;
2
2
1
K
+o
@
+ E Gp
@
!#
2
K p q+1
K
2
4
z
2
K p q+1
K
;0
!
2
+o
;
where
is between 0 and 2 : As in the proof of Part (b), we can show that
As a result,
"
!#
z 2K p q+1 2
@2
1
1
4
E
;
= O( 2 ) = o( ):
2 Gp
K
K
K
@
50
1
K
(52)
@2
G (z;
@ 2 p
)
1.
(53)
Consequently,
P (pF1 < z) = EGp
z
2
K p q+1
K
!
;0
z
@
+ E Gp
@
2
K p q+1
K
;0
!
2
+ o(
1
):
K
By direct calculations, it is easy to show that
@
Gp (z; 0) =
@
1
[Gp (z)
2
Gp+2 (z)] =
1 z p=2 e
p 2p=2
z=2
p
2
1 0
G (z) z:
p p
=
(54)
Therefore,
P (pF1 < z)
= EGp
z
2
K p q+1
K
!
2
K p q+1
= Gp (z) + Gp0 (z) zE
K
2
K p q+1
K
!
1
!
z
2
K p q+1 2
K
+o
K
q+1
K p q+1
+ Gp00 (z) z 2
K
K2
K
p + 2q 1
1
1
+ Gp00 (z) z 2 + o
:
K
K
K
Gp0 (z) z
1
K
!
2
K p q+1
1
+ Gp00 (z) z 2 var
2
p
= Gp (z) + Gp0 (z) z
= Gp (z)
z
1
EGp0
p
q
q
1
2
p
Gp0 (z) z + o
1
K
Gp0 (z) z
Proof of Theorem 4. Part (a). Using the same argument for proving Theorem 3(a), we have
d
t1 =
and so
B1 (1) C1q Cqq1 Bq (1)
q
2
K q =K
t
d B1 (1)
p1 =
q
C1q Cqq1 Bq (1)
2
K q = (K
(55)
d
= tK
q
( ):
q)
Part (b). Since the distribution of t1 is symmetric about 0; we have for any z 2 R+ :
P
t
p1 < z
p
1 1
1 1
+ P (jt1 = j < jzj) = + P (t21 = < z 2 )
2 2
2 2
1 1
1
2 2
= + F1;K q z ;
+ o( )
2 2
K
=
where the second last equality follows from Theorem 3(b). When z 2 R ; we have
P
t
p1 < z
1
2
1
=
2
=
p
1
1 1
P (jt1 = j < jzj) =
P (t21 = < z 2 )
2
2 2
1
1
F1;K q z 2 ; 2 + o( ):
2
K
51
Therefore
P
t
p1 < z
1 1
+ sgn(z)F1;K
2 2
=
q
z2;
2
+ o(
1
):
K
Part (c). Using Theorem 3(c) and the symmetry of the distribution of t1 about 0; we have
for any z 2 R+ :
1 1
1 1
+ P (jt1 j < jzj) = + P (t21 < z 2 )
2 2
2 2
1 1
q
1
1
= + G z2
G 0 z2 z2
+ G 00 z 2 z 4 + o
2 2
K
2
K
P (t1 < z) =
1
K
:
Using the relationships that
1 1
+ G z2 =
2 2
and
G 0 z2 z2
q
K
1
1
+ G 00 z 2 z 4 =
2
K
we have
(z);
1
(z) z 3 + z(4q + 1) ;
4K
(z)
1
1
z (z) z 2 + (4q + 1) + o( ):
4K
K
P (t1 < z) =
(z) +
1
1
z (z) z 2 + (4q + 1) + o( ):
4K
K
P (t1 < z) =
(z)
1
1
jzj (z) z 2 + (4q + 1) + o( ):
4K
K
P (t1 < z) =
Similarly when z 2 R ; we have
Therefore
Before proving Theorem 5, we present a technical lemma. Part (i) of the lemma is proved in
Sun (2014). Parts (ii) and (iii) of the lemma are proved in Sun, Phillips and Jin (2011)
[1 k (x)] =xq0 , q0 is the Parzen exponent of the kernel function, c1 =
R 1 De…ne g0 = limRx!0
1
2
1 k (x) dx: Recall the de…nitions of 1 and 2 in (9).
1 k (x) dx; c2 =
Lemma 5 (i) For conventional kernel HAR variance estimators, we have, as h ! 1;
(a) 1 = 1 bc1 + O(b2 );
(b) 2 = bc2 + O(b2 ).
(ii) For sharp kernel HAR variance estimators, we have, as h ! 1;
2
(a) 1 = 1
+2 ;
1
(b) 2 = +1
+ O 12 :
(iii) For steep kernel HAR variance estimator, we have, as h ! 1;
(a)
1
=1
(b)
2
=
1=2
g0
1=2
2 g0
+O
+O
1
1
;
:
52
Proof of Theorem 5. Part (a). Recall
0
Cpq Cqq1 Bq (1) Dpp1 Bp (1)
pF1 = Bp (1)
Cpq Cqq1 Bq (1) :
Conditional on (Cpq ; Cqq ; Cpp ) ; Bp (1) Cpq Cqq1 Bq (1) is normal with mean zero and variance Ip +
0 : Let L be the lower triangular matrix such that LL0 is the Choleski decomposition
Cpq Cqq1 Cqq1 Cpq
0 : Then the conditional distribution of
of Ip + Cpq Cqq1 Cqq1 Cpq
:= L 1 Bp (1) Cpq Cqq1 Bq (1) is
N (0; Ip ): Since the conditional distribution does not depend on (Cpq ; Cqq ; Cpp ) ; we can conclude
that is independent of (Cpq ; Cqq ; Cpp ) : So we can write
d
0
pF1 =
A
where A = L0 Dpp1 L. Given that A is a function of (Cpq ; Cqq ; Cpp ) ; we know that
and A are
d
independent. As a result, 0 A = 0 (OAO0 ) for any orthonormal matrix O that is independent
of :
~ be an orthonormal matrix with …rst column = k k : We choose H to
Let H = = k k ; H
be independent of A: This is possible as and A are independent. Then
0
d
pF1 = k k2
OAO0
k k
k k
= k k2
=k
0
k k
k2 e0p
H H 0 OAO0 H H 0
H 0 OAO0 H ep
k k
where ep = (1; 0; :::; 0)0 is the …rst basis vector in Rp : Since k k2 ; H and A are mutually independent from each other, we can write
d
pF1 = k k2 e0p H0 O0 AOH ep
for any orthonormal matrix H that is independent of both
d
pF1 = k k
2
e0p Aep
=
k k2
d
1
e0p Aep
=
and A: Letting O = H0 ; we obtain:
k k2
e0p L0 Dpp1 Lep
1:
Since Lep is the …rst column of L; we have, using the de…nition of the Choleski decomposition:
0
ep
Ip + Cpq Cqq1 Cqq1 Cpq
Lep = q
:
0
e0p Ip + Cpq Cqq1 Cqq1 Cpq
ep
As a result,
d
pF1 =
k k2
2
where
2
= e0p L0 Dpp1 Lep
1
=
0
e0p Ip + Cpq Cqq1 Cqq1 Cpq
ep
0
0
e0p Ip + Cpq Cqq1 Cqq1 Cpq
Dpp1 Ip + Cpq Cqq1 Cqq1 Cpq
ep
Part (b). It is easy to show that
ECqq =
1 Iq
and var [vec (Cqq )] =
53
2 (Iqq
+ Kqq )
:
where 1 and 2 are de…ned in (9), Iqq is the q 2 q 2 identity matrix, and Kqq is the q 2 q 2
commutation matrix. So Cqq = 1 Iq + op (1) and Cqq1 = 1 1 Iq + op (1) as h ! 1: Similarly,
Cpp = 1 Ip + op (1) and Cpp1 = 1 1 Ip + op (1) ; as h ! 1: In addition, using the same argument,
we can show that Cpq = op (1) : Therefore
2
=
0 =
e0p Ip + Cpq Cpq
e0p
Ip +
0 = 2
Cpq Cpq
1
Ip
2
1
ep
1
0 = 2
Cpq Cpq
1
0 =
Ip + Cpq Cpq
2
1
ep
(1 + op (1))
= 1 + op (1)
That is, 2 !p 1 as h ! 1:
We proceed to prove the distributional expansion. The (i; j)-th elements Cpp (i; j), Cpq (i; j)
R1R1
R1R1
~
and Cqq (i; j) of Cpp ; Cpq and Cqq are equal to either 0 0 Qh (r; s) dB(r)dB(s) or 0 0 Qh (r; s) dB(r)dB(s)
~ ( ) are independent standard Brownian motion processes. By direct calculawhere B( ) and B
tions, we have, for any & 2 (0; 3=8);
P (jCef (i; j)
&
2)
ECef (i; j)j >
E jCef (i; j)
ECef (i; j)j8
8&
2
4
2
8&
2
=O
= o(
2)
where e; f = p or q: De…ne the event E as
E = f! : jCef (i; j)
ECef (i; j)j
then the complement E c of E satis…es P (E c ) = o(
&
2
for all i; j and all e and f g ;
2)
as h ! 1: Let E~ be another event, then
= P E~ \ E + P E~ \ E c = P E~ \ E + o(
P E~
~
= P (EjE)P
(E) + o(
~
= P (EjE) + o ( 2 ) :
2)
~ (1
= P (EjE)
o(
2 ))
2)
+ o(
2)
~
That is, up to an error of order o( 2 ); P E~ ; P E~ \ E and P (EjE)
are equivalent. So for the
purpose of proving the theorem, it is innocuous to condition on E or remove the conditioning, if
needed.
Now conditioning E, the numerator of 2 satis…es:
0
e0p Ip + Cpq Cqq1 Cqq1 Cpq
ep
=1+
1
0
2 ep Cpq Iq +
Cqq
=1+
1
0
2 ep Cpq
1
(Iq
1
Iq +
1
1
=1+
ECqq
Mqq ) (Iq
Cqq
ECqq
1
0
Cpq
ep
1
0
Mqq ) Cpq
ep
1
0
0
~
2 ep Cpq Mqq Cpq ep
1
where Mqq is a matrix with elements Mqq (i; j) satisfying jMqq (i; j)j = O ( &2 ) conditional on E,
~ qq is a matrix satisfying M
~ qq (i; j) 1 fi = jg = O ( & ) conditional on E.
and M
2
Let
Cpp Cpq
Cp+q;p+q =
0
Cpq
Cqq
54
and eq+q;p be the matrix consisting of the …rst p columns of the identity matrix Ip+q : To evaluate
the denominator of 2 ; we note that
1
Dpp1 =
e0q+q;p Ip+q +
Cp+q;p+q
1
1
=
eq+q;p
1
Cp+q;p+q
e0q+q;p Ip+q
ECp+q;p+q
1
eq+q;p
1
+e0q+q;p
1
=
1
ECp+q;p+q
Ip
[Cp+q;p+q
ECp+q;p+q ] [Cp+q;p+q
ECp+q;p+q ]
eq+q;p + Mpp
2
1
Cpp
1
ECpp
+
[Cpp
0
ECpp ] + Cpq Cpq
ECpp ] [Cpp
2
1
1
+ Mpp
= o( 2 ) for
where Mpp is a matrix with elements Mpp (i; j) satisfying jMpp (i; j)j = O 3&
2
& > 1=3 conditional on E.
For the purpose of proving our result, Mpp can be ignored as its presence generates an approximation error of order o ( 2 ) ; which is the same as the order of the approximation error given
in the theorem. More speci…cally, let
C~pp =
0
~ qq Cpq
Cpq M
2
1
~ pp =
; D
Ip
Cpp
ECpp
+
[Cpp
0
ECpp ] + Cpq Cpq
ECpp ] [Cpp
2
1
1
and
~ )=
~ := ~ (C~pp ; D
pp
2
1
2
1 + e0p C~pp ep
~ pp Ip + C~pp ep
e0p Ip + C~pp D
;
then
P (pF1 < z) = EGp
Note that for any q
0
ECpq Lqq Cpq
=E
2
z = EGp ~2 z + o (
2) :
q matrix Lqq ; we have
Z
0
Z
1Z 1
0
Qh (r1 ; s1 )Qh (r2 ; s2 )dBp (r1 )dBq (s1 )0 Lqq dBq (s2 )dBp (r1 )0
1Z 1
Qh (r1 ; s1 )Qh (r2 ; s2 )dBp (r1 )dBp (r1 )0 tr dBq (s2 )dBq (s1 )0 Lqq
0
0
Z 1Z 1
= tr (Lqq )
[Qh (r; s)]2 drdrIp = tr (Lqq ) 2 Ip :
=E
0
(56)
0
~ ) around C~pp = q
Taking an expansion of ~2 (C~pp ; D
pp
~2 =
2
0
2
2 = 1 Ip
~ = Ip ; we obtain
and D
pp
+ err
where err is the approximation error and
2
0
=
2q
2
1
+ e0p [Cpp
ECpp ] ep
e0p [Cpp
ECpp ] [Cpp
1
We keep enough terms in
ECpp ] ep
1
2
0
so that EGp ~2 z = EGp
55
+
e0p [Cpp
ECpp ] ep
1
2z
0
+ o(
2) :
2
:
Now we write
P (pF1 < z) = EGp ( 20 z) + o (
2)
= Gp (z) + Gp0 (z) z E
1
1 + Gp00 (z) z 2 E(
2
2
0
2
0
1)2 + o (
2) :
In view of
E
2
0
1=(
2q
1)
1
1
2
1
=(
2q
1)
1
1)
1
e0p [Cpp
ECpp ] ep + E
ECpp ] ep
(p + 1) +
2
2
1
2
1
2
1
2
1
2
1
=(
ECpp ] [Cpp
1
2
2q
Ee0p [Cpp
(p
1)
1
and
E(
2
0
1)2
=E
1
1
2
2q
2
+ e0p [Cpp
ECpp ] ep
+ o(
2)
1
= E e0p [Cpp
=2
2
2
+(
1
2
ECpp ] ep
1) + o (
+(
1)2 + o (
1
2)
2) ;
we have
P (pF1 < z)
= Gp (z) + Gp0 (z) z (
1
= Gp (z) + Gp0 (z) z (
1
2q
1)
2
2
1
2
1)
1) + Gp00 (z) z 2
(p
1
1) + Gp00 (z) z 2
(p + 2q
1
2
2
+ o(
+ o(
2)
2) :
(57)
Proof of Theorem 6. We prove part (a) only as the proof for part (b) is similar. Using the
same arguments and the notation as in the proof of Theorem 1, we have
1
R G0 W11 G
G0 W11 Bm (1) +
and
R G0 W11 G
1
d
0
= RV A
d
R0 = RV A
1
1
C12 C221
Id ;
C 11
1
1
A0
Bm (1) +
0
V 0 R0 :
So
F1;
d
0
= RV A
RV A
1
1
Id ;
Id ;
C12 C221
C12 C221
Bm (1) +
Bm (1) +
0
56
=p:
0
0
h
RV A
1
C 11
1
A0
1
V 0 R0
i
1
Let Bm (1) = [Bd0 (1) ; Bq0 (1)]0 and RV A
~ O
~ : Then
A;
n
~ ~ V~ 0 Bd (1)
= U
d
~ ~ V~ 0 be a SVD of RV A
= U
1
o0
~ ~ V~ 0 C 11
(1) + 0
U
F1; 0
n
o
~ ~ V~ 0 Bd (1) C12 C 1 Bq (1) + 0 =p
U
22
n
o0
1
~0 0
~ V~ 0 C 1 V~ ~ 0
= ~ V~ 0 Bd (1) C12 C221 Bq (1) + U
11
o
n
~ V~ 0 Bd (1) C12 C 1 Bq (1) + U
~ 0 0 =p:
22
C12 C221 Bq
1
1;
~0
V~ ~ 0 U
where ~ =
1
The above distribution does not depend on the orthonormal matrix V~ : So
Let H = (
~
A
kA~
n
d
F1; 0 = ~ Bd (1) C12 C221 Bq (1)
n
~ Bd (1) C12 C 1 Bq (1) + U
~0
22
n
~0
= A~ Bp (1) Cpq Cqq1 Bq (1) + U
n
~0
A~ Bp (1) Cpq Cqq1 Bq (1) + U
h
~0
= Bp (1) Cpq Cqq1 Bq (1) + A~ 1 U
h
1
Dpp Bp (1) Cpq Cqq1 Bq (1) + A~
1U 0
0
1U 0
0
k
~ be a p
; H)
o0
~0 0
~C 1 ~0
+U
11
o
0 =p
o0
1
~ pp A~
AD
0
o
1
0
0
1
i0
~0
U
0
i
:
p orthonormal matrix, then
i0
h
d
0
0
1
0 ~ 1 ~0
U
A
=
H
B
(1)
H
C
C
B
(1)
+
H
0
p
pq
q
qq
0
h
i
~0 0
H 0 Dpp1 H H 0 Bp (1) H 0 Cpq Cqq1 Bq (1) + H 0 A~ 1 U
h
i0
~ 0 0 jj
= H 0 Bp (1) H 0 Cpq Cqq1 Bq (1) + ep jjA~ 1 U
i
h
~ 0 0 jj :
H 0 Dpp1 H H 0 Bp (1) H 0 Cpq Cqq1 Bq (1) + ep jjA~ 1 U
F1;
But the joint distribution of H 0 Bp (1) ; H 0 Cpq ; H 0 Dpp1 H is invariant to H: Hence we can write
h
i0
d
1
1 ~0
~
=
B
(1)
C
C
B
(1)
+
e
jj
A
U
jj
p
pq qq
q
p
0
0
h
i
~ 0 0 jj =p:
Dpp1 Bp (1) Cpq Cqq1 Bq (1) + ep jjA~ 1 U
F1;
That is, the distribution of F1;
0
depends on
0
only through jjA~
57
1U
~0
0 jj.
~ A~A~0 U
~0 =
Note that U
~ ~ V~ 0 U
~ ~ V~ 0
U
jjA~
1
0
1
= RV A
~ 0 0 jj2 =
U
RV A
A~
0 ~
0U
=
0
0
=
0
0
h
1
h
0
1
A~
2
0 jj
G
i 1
~0
~ A~A~0 U
U
0
i 1
h
0 0 0
0
0
0
V R
0 = 0 R V A AV
~0
U
1
A
1
1=2
= jjV
and
1
RV A
R
1 0
0
1
h
0
0
=
0
i
1
1
R0
0
=
Cpq Cqq1 Bq (1) +
0
G
= jj jj2 ;
0
0
1
n
R G0
1
R0
i
1
0
1
G
R0
o
1
0
we have
F1;
d
0
= Bp (1)
Dpp1 Bp (1)
1
0
G WU;1
1
1
U WU;1
1
0
G WU;1
G
Note that
1
0
G WU;1
and
1
WU;1
G
we have
2
G
11 2
= A0G WU;1
=4
21
WU;1
0
Id
BU;1
= RVG AG1
11
U
11
U
21
U
12
U BU;1
In view of
V = RVG AG1
we have
V~
V = RVG AG1
= RVG AG1
h
h
12
U
22
U
22
U
1
1
21
U
h
1
0
11 2
RVG AG1 WU;1
AG1 AG
21
U
BU;1
5
12
U
22
U
0
BU;1
11
U
1
22
WU;1
30
1
U
2
4
21
U
0
+ BU;1
22
U
12
U BU;1
22
U
58
h
1
11 2
WU;1
21
WU;1
1
5 AG ;
1
i
1
22
WU;1
RVG AG1
22
U BU;1
21
U
VG0 R0 :
3
11 2
WU;1
11 2
WU;1
1
3
5
0
0
RVG AG1 :
0
RVG AG1 ;
0
BU;1
21
U
1
21
U
22
U
1
G
AG ;
Id
BU;1
12
U
i0
1
0
G WU;1
G
1
11 2
WU;1
11 2
V~ = RVG AG1 WU;1
AG1 AG
2
1
11 2
WU;1
4
1
21
22
11 2
WU;1
WU;1
WU;1
= RVG AG1
=p = F1 jj jj2 :
We …rst prove (20). Using the singular value decomposition G =
Proof of Proposition 7.
UG G VG0 ; we have
V~ = RVG
Cpq Cqq1 Bq (1) +
i
0
22
+ BU;1
B
RVG AG1
U;1
U
i
0
BU;1 RVG AG1 :
0
Next, we prove the proposition, starting with
jj~jj2
h
0 ~ 1=2 ~ 1=2 ~
V
VV
0V
jj jj2
=
1=2
i
1=2 h
V~
Ip
1=2
1=2
V V~
The above equality holds when V~ is nonsingular. But
V~
V~
V (V~
h
= V~ 1=2 RVG AG1
h
= V~ 1=2 RVG AG1
h
1=2
22
U
=
1=2
21
U
1=2
V~
1=2
V V~
1
22
U
21
U
BU;1
0
22
U BU;1
21
U
22
U BU;1
1=2
= Ip
i0
22
U
22
U
jj~jj2 =
0 ~ 1=2
0V
Ip
h
1=2
0
i
RVG AG1 (V~
V~
1=2
V~
hence
jj jj2
V~
1=2
1=2
V V~
i
1=2
V~
1=2
0:
)
0
12 12
and
ih
1=2
21
U
BU;1
i
0
RVG AG1 (V~
1=2
)
i
)
V (V~
1=2
0
12 12
1
22
U
0
12 12
1=2
) = Ip
Ip
0
12 12
0
12 12 ;
1=2
V~
1=2
0
as desired.
References
[1] Andersen, T. G. and Sorensen, B. E. (1996): “GMM Estimation of a Stochastic Volatility
Model: a Monte Carlo Study.” Journal of Business & Economic Statistics 14(3), 328–52.
[2] Andrews, D. W. K. (1991): “Heteroskedasticity and Autocorrelation Consistent Covariance
Matrix Estimation.” Econometrica 59, 817–858.
[3] Andrews, D. W. K. and J. C. Monahan (1992): “An Improved Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimator.” Econometrica 60(4), 953–966.
[4] Bester, A., Conley, T., Hansen, C., and Vogelsang, T. (2011): “Fixed-b Asymptotics for
Spatially Dependent Robust Nonparametric Covariance Matrix Estimators.”Working paper,
Michigan State University.
[5] Bester, C. A., T. G. Conley, and C.B. Hansen (2011): “Inference with Dependent Data
Using Cluster Covariance Estimators.” Journal of Econometrics 165(2), 137–51.
[6] Chen, X., Z. Liao and Y. Sun (2014): “Sieve Inference on Possibly Misspeci…ed Seminonparametric Time Series Models.” Journal of Econometrics 178(3), 639–658.
[7] Gonçlaves, S. (2011): “The Moving Blocks Bootstrap for Panel Linear Regression Models
with Individual Fixed E¤ects.” Econometric Theory 27(5), 1048–1082.
59
[8] Gonçlaves, S. and T. Vogelsang (2011): “Block Bootstrap HAC Robust Tests: The Sophistication of the Naive Bootstrap.” Econometric Theory 27(4), 745–791.
[9] Ghysels, E., A. C. Harvey and E. Renault (1996): “Stochastic Volatility,” In G.S. Maddala
and C.R.Rao (Eds), Statistical Methods in Finance, 119–191.
[10] Han, C. and P. C. B. Phillips (2006): “GMM with Many Moment Conditions.”Econometrica
74(1), 147–192.
[11] Hansen, L. P. (1982): “Large Sample Properties of Generalized Method of Moments Estimators.” Econometrica 50, 1029–1054.
[12] Hall, A. R. (2000): “Covariance Matrix Estimation and the Power of the Over-identifying
Restrictions Test.” Econometrica 50, 1517–1527.
[13] Hall, A. R. (2005): Generalized Method of Moments. Oxford University Press.
[14] Ibragimov, R. and U. K. Müller (2010): “t-Statistic Based Correlation and Heterogeneity
Robust Inference.” Journal of Business and Economic Statistics 28, 453–468.
[15] Jansson, M. (2004): “On the Error of Rejection Probability in Simple Autocorrelation Robust Tests.” Econometrica 72, 937–946.
[16] Jacquier, E., N. G. Polson and P. E. Rossi (1994): “Bayesian Analysis of Stochastic Volatility
Models” Journal of Business & Economic Statistics 12(4), 371–389.
[17] Kiefer, N. M. and T. J. Vogelsang (2002a): “Heteroskedasticity-autocorrelation Robust Testing Using Bandwidth Equal to Sample Size.” Econometric Theory 18, 1350–1366.
[18] — — — (2002b): “Heteroskedasticity-autocorrelation Robust Standard Errors Using the
Bartlett Kernel without Truncation.” Econometrica 70, 2093–2095.
[19] — — — (2005): “A New Asymptotic Theory for Heteroskedasticity-Autocorrelation Robust
Tests.” Econometric Theory 21, 1130–1164.
[20] Kim, M. S. and Sun, Y. (2013): “Heteroskedasticity and Spatiotemporal Dependence Robust
Inference for Linear Panel Models with Fixed E¤ects.”Journal of Econometrics, 177(1), 85–
108.
[21] Müller, U. K. (2007): “A Theory of Robust Long-Run Variance Estimation.” Journal of
Econometrics 141, 1331–1352.
[22] Newey, W. K. and K. D. West (1987): “A Simple, Positive Semide…nite, Heteroskedasticity
and Autocorrelation Consistent Covariance Matrix.” Econometrica 55, 703–708.
[23] Phillips, P. C. B. (2005): “HAC Estimation by Automated Regression.”Econometric Theory
21, 116–142.
[24] Phillips, P. C. B., Y. Sun and S. Jin (2006): “Spectral Density Estimation and Robust
Hypothesis Testing using Steep Origin Kernels without Truncation.”International Economic
Review 47, 837–894
60
[25] Phillips, P. C. B., Y. Sun and S. Jin (2007): “Consistent HAC Estimation and Robust
Regression Testing Using Sharp Origin Kernels with No Truncation.” Journal of Statistical
Planning and Inference 137, 985–1023.
[26] Politis, D. (2011): “Higher-order Accurate, Positive Semi-de…nite Estimation of Largesample Covariance and Spectral Density Matrices.” Econometric Theory, 27, 703–744.
[27] Press, J. (2005): Applied Multivariate Analysis, Second Edition, Dover Books on Mathematics.
[28] Shao, X. (2010): “A Self-normalized Approach to Con…dence Interval Construction in Time
Series.” Journal of the Royal Statistical Society B, 72, 343–366.
[29] Sun, Y. (2011): “Robust Trend Inference with Series Variance Estimator and Testing-optimal
Smoothing Parameter.” Journal of Econometrics, 164(2), 345–366.
[30] Sun, Y. (2013): “A Heteroskedasticity and Autocorrelation Robust F Test Using Orthonormal Series Variance Estimator.” Econometrics Journal, 16, 1–26.
[31] Sun, Y. (2014): “Let’s Fix It: Fixed-b Asymptotics versus Small-b Asymptotics in Heteroscedasticity and Autocorrelation Robust Inference.” Journal of Econometrics 178(3),
659–677.
[32] Sun, Y. and D. Kaplan (2012): “Fixed-smoothing Asymptotics and Accurate F approximation Using Vector Autoregressive Covariance Matrix Estimators.” Working paper, UC San
Diego.
[33] Sun, Y., P. C. B. Phillips and S. Jin (2008). “Optimal Bandwidth Selection in
Heteroskedasticity-Autocorrelation Robust Testing.” Econometrica 76(1), 175–194.
[34] Sun, Y., P. C. B. Phillips and S. Jin (2011): “Power Maximization and Size Control in
Heteroscedasticity and Autocorrelation Robust Tests with Exponentiated Kernels.” Econometric Theory 27(6), 1320–1368.
[35] Sun, Y. and P. C. B. Phillips (2009): “Bandwidth Choice for Interval Estimation in GMM
Regression.” Working paper, Department of Economics, UC San Diego.
[36] Sun, Y. and M. S. Kim (2012): “Simple and Powerful GMM Over-identi…cation Tests with
Accurate Size.” Journal of Econometrics 166(2), 267–281
[37] Sun, Y. and M. S. Kim (2014): “Asymptotic F Test in a GMM Framework with Cross
Sectional Dependence.” Review of Economics and Statistics, forthcoming.
[38] Taniguchi, M. and Puri, M. L. (1996): “Valid Edgeworth Expansions of M-estimators in
Regression Models with Weakly Dependent Residuals.” Econometric Theory 12, 331–346.
[39] Vogelsang, T. (2012): “Heteroskedasticity, Autocorrelation, and Spatial Correlation Robust
Inference in Linear Panel Models with Fixed-e¤ects. Journal of Econometrics 166, 303–319.
[40] Xiao, Z. and O. Linton (2002): “A Nonparametric Prewhitened Covariance Estimator.”
Journal of Time Series Analysis 23, 215–250.
[41] Zhang, X. and X. Shao (2013): “Fixed-Smoothing Asymptotics For Time Series.”Annals of
Statistics 41(3), 1329–1349.
61
Download