Fixed-smoothing Asymptotics in a Two-step GMM Framework Yixiao Sun Department of Economics, University of California, San Diego July 30, 2014, First version: December 2011 Abstract The paper develops the …xed-smoothing asymptotics in a two-step GMM framework. Under this type of asymptotics, the weighting matrix in the second-step GMM criterion function converges weakly to a random matrix and the two-step GMM estimator is asymptotically mixed normal. Nevertheless, the Wald statistic, the GMM criterion function statistic and the LM statistic remain asymptotically pivotal. It is shown that critical values from the …xedsmoothing asymptotic distribution are high order correct under the conventional increasingsmoothing asymptotics. When an orthonormal series covariance estimator is used, the critical values can be approximated very well by the quantiles of a noncentral F distribution. A simulation study shows that the new statistical tests based on the …xed-smoothing critical values are much more accurate in size than the conventional chi-square test. JEL Classi…cation: C12, C32 Keywords: F distribution, Fixed-smoothing Asymptotics, Heteroskedasticity and Autocorrelation Robust, Increasing-smoothing Asymptotics, Noncentral F Test, Two-step GMM Estimation 1 Introduction Recent research on heteroskedasticity and autocorrelation robust (HAR) inference has been focusing on developing distributional approximations that are more accurate than the conventional chi-square approximation or the normal approximation. To a great extent and from a broad perspective, this development is in line with many other areas of research in econometrics where more accurate distributional approximations are the focus of interest. A common theme for coming up with a new approximation is to embed the …nite sample situation in a di¤erent limiting thought experiment. In the case of HAR inference, the conventional limiting thought experiment assumes that the amount of smoothing increases with the sample size but at a slower rate. The Email: yisun@ucsd.edu. For helpful comments, the author thanks Jungbin Hwang, David Kaplan, and Min Seong Kim, Elie Tamer, the coeditor, and anonymous referees. The author gratefully acknowledges partial research support from NSF under Grant No. SES-0752443. Correspondence to: Department of Economics, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0508. 1 new limiting thought experiment assumes that the amount of smoothing is held …xed as the sample size increases. This leads to two types of asymptotics: the conventional increasing-smoothing asymptotics and the more recent …xed-smoothing asymptotics. Sun (2014) coins these two inclusive terms so that they are applicable to di¤erent HAR variance estimators, including both kernel HAR variance estimators and orthonormal series (OS) HAR variance estimators. There is a large and growing literature on the …xed-smoothing asymptotics. For kernel HAR variance estimators, the …xed-smoothing asymptotics is the so-called the …xed-b asymptotics …rst studied by Kiefer and Vogelsang (2002a, 2002b, 2005, KV hereafter) in the econometrics literature. For other studies, see, for example, Jansson (2004), Sun, Phillips, Jin (2008), Sun and Phillips (2009), Gonçlaves and Vogelsang (2011) in the time series setting, Bester, Conley, Hansen and Vogelsang (2011, BCHV hereafter) and Sun and Kim (2014) in the spatial setting, and Gonçalves (2011), Kim and Sun (2013), and Vogelsang (2012) in the panel data setting. For OS HAR variance estimators, the …xed-smoothing asymptotics is the so-called …xed-K asymptotics. For its theoretical development and related simulation evidence, see, for example, Phillips (2005), Müller (2007), and Sun (2011, 2013). The approximation approaches in some other papers can also be regarded as special cases of the …xed-smoothing asymptotics. This includes, among others, Ibragimov and Müller (2010), Shao (2010) and Bester, Hansen and Conley (2011). All of the recent developments on the …xed-smoothing asymptotics have been devoted to …rst-step GMM estimation and inference. In this paper, we establish the …xed-smoothing asymptotics in a general two-step GMM framework. For two-step estimation and inference, the HAR variance estimator not only appears in the covariance estimator but also plays the role of the optimal weighting matrix in the second-step GMM criterion function. Under the …xed-smoothing asymptotics, the weighting matrix converges to a random matrix. As a result, the second-step GMM estimator is not asymptotically normal but rather asymptotically mixed normal. On the one hand, the asymptotic mixed normality captures the estimation uncertainty of the GMM weighting matrix and is expected to be closer to the …nite sample distribution of the second-step GMM estimator. On the other hand, the lack of asymptotic normality posts a challenge for pivotal inference. It is far from obvious that the Wald statistic is still asymptotically pivotal under the new asymptotics. To confront this challenge, we have to judicially rotate and transform the asymptotic distribution and show that it is equivalent to a distribution that is nuisance parameter free. The …xed-smoothing asymptotic distribution not only depends on the kernel function or basis functions and the smoothing parameter, which is the same as in the one-step GMM framework, but also depends on the degree of over-identi…cation, which is di¤erent from existing results. In general, the degree of over-identi…cation or the number of moment conditions remains in the asymptotic distribution only under the so-called many-instruments or many-moments asymptotics; see for example Han and Phillips (2006) and references therein. Here the number of moment conditions and hence the degree of over-identi…cation are …nite and …xed. Intuitively, the degree of over-identi…cation remains in the asymptotic distribution because it is indicative of the dimension of the limiting random weighting matrix. In the case of OS HAR variance estimation, the …xed-smoothing asymptotic distribution is a mixed noncentral F distribution — a noncentral F distribution with a random noncentrality parameter. This is an intriguing result, as a noncentral F distribution is not expected under the null hypothesis. It is reassuring that the random noncentrality parameter becomes degenerate as the amount of smoothing increases. Replacing the random noncentrality parameter by its mean, we show that the mixed noncentral F distribution can be approximated extremely well by 2 a noncentral F distribution. The noncentral F distribution is implemented in standard programming languages and packages such as the R language, MATLAB, Mathematica and STATA, so critical values are readily available. No computation-intensive simulation is needed. This can be regarded as an advantage of using the OS HAR variance estimator. In the case of kernel HAR variance estimation, the …xed-smoothing asymptotic distribution is a mixed chi-square distribution — a chi-square distribution scaled by an independent and positive random variable. This resembles an F distribution except that the random denominator has a nonstandard distribution. Nevertheless, we are able to show that critical values from the …xed-smoothing asymptotic distribution are high order correct under the conventional increasingsmoothing asymptotics. This result is established on the basis of two distributional expansions. The …rst expansion is the expansion of the …xed-smoothing asymptotic distribution as the amount of smoothing increases, and the second one is the high order Edgeworth expansion established in Sun and Phillips (2009). We arrive at the two expansions via completely di¤erent routes, yet at the end some of the terms in the two expansions are exactly the same. Our framework for establishing the …xed-smoothing asymptotics is general enough to accommodate both the kernel HAR variance estimators and the OS HAR variance estimators. The …xed-smoothing asymptotics is established under weaker conditions than what are typically assumed in the literature. More speci…cally, instead of maintaining a functional CLT assumption, we make use of a regular CLT, which is weaker than an FCLT. Our method of proof is also novel. It applies directly to both smooth and nonsmooth kernel functions. There is no need to give a separate treatment to non-smooth kernels such as the Bartlett kernel. The uni…ed method of proof leads to a uni…ed representation of the …xed-smoothing asymptotic distribution. The …xed-smoothing asymptotics is established for three commonly used test statistics in the GMM framework: the Wald statistic, the GMM criterion function statistic, and the score type statistic or the LM statistic. As in the conventional increasing-smoothing asymptotic framework, we show that these three test statistics are asymptotically equivalent and converge to the same limiting distribution. In the Monte Carlo simulations, we examine the accuracy of the …xed-smoothing approximation to the distribution of the Wald statistic. We …nd that the tests based on the new …xed-smoothing approximation have much more accurate size than the conventional chi-square tests. This is especially true when the degree of over-identi…cation is large. When the model is over-identi…ed, the …xed-smoothing approximation that accounts for the randomness of the GMM weighting matrix is also more accurate than the …xed-smoothing approximation that ignores the randomness. When the OS HAR variance estimator is used, the convenient noncentral F test has almost identical size properties as the nonstandard test whose critical values have to be simulated. The rest of the paper is organized as follows. Section 2 describes the estimation and testing problems at hand. Section 3 establishes the …xed-smoothing asymptotics for the variance estimators and the associated test statistics. Section 4 gives di¤erent representations of the …xedsmoothing asymptotic distribution. In the case of OS HAR variance estimation, a noncentral F distribution is shown to be a very accurate approximation to the nonstandard …xed-smoothing asymptotic distribution. In the case of kernel HAR variance estimation, the …xed-smoothing approximation is shown to provide a high order re…nement over the chi-square approximation. Section 5 studies the …xed-smoothing asymptotic approximation under local alternatives. The next section reports simulation evidence and applies our method to GMM estimation and inference in a stochastic volatility model. The last section provides some concluding discussion. Technical lemmas and proofs of the main results are given in the Appendix. 3 A word on notation: we use Fp;K p q+1 2 to denote a random variable that follows the noncentral F distribution with degrees of freedom (p; K p q + 1) and noncentrality parameter 2 : We use Fp;K p q+1 z; 2 to denote the CDF of the noncentral F distribution. For a positive semide…nite symmetric matrix A; A1=2 is the symmetric matrix square root of A when it is not de…ned explicitly. 2 Two-step GMM Estimation and Testing We are interested in a d 1 vector of parameters 2 Rd : Let vt 2 Rdv denote a vector of observations at time t: Let 0 denote the true parameter value and assume that 0 is an interior point of : The moment conditions Ef (vt ; ) = 0; t = 1; 2; :::; T (1) hold if and only if = 0 where f (vt ; ) is an m 1 vector of continuously di¤erentiable functions. The process f (vt ; 0 ) may exhibit autocorrelation of unknown forms. We assume that m d and that the rank of E @f (vt ; 0 ) =@ 0 is equal to d: That is, we consider a model that is possibly over-identi…ed with the degree of over-identi…cation q = m d: De…ne t 1X gt ( ) = f (vj ; ); T j=1 then the GMM estimator of 0 is given by ^GM M = arg min gT ( )0 W 1 gT ( ) T 2 where WT is a positive de…nite weighting matrix. To obtain an initial …rst step estimator, we often choose a simple weighting matrix Wo that does not depend on model parameters, leading to ~T = arg min gT ( )0 W 1 gT ( ) : o 2 As an example, we may set Wo = Im in the general GMM setting. In the IV regression, we may set Wo = Z 0 Z=T where Z is the data matrix for the instruments. We assume that p Wo ! Wo;1 ; a matrix that is positive de…nite almost surely. According p to Hansen (1982), the optimal weighting matrix WT is the asymptotic variance matrix of T gT ( 0 ) : On the basis of the …rst step estimate ~T ; we can use u ~t := f (vt ; ~T ) to estimate the asymptotic variance matrix. Many nonparametric estimators of the variance matrix are available in the literature. In this paper, we consider a class of quadratic variance estimators, which includes the conventional kernel variance estimators of Andrews (1991), Newey and West (1987), Politis (2011), sharp and steep kernel variance estimators of Phillips, Sun and Jin (2006, 2007), and the orthonormal series variance estimators of Phillips (2005), Müller (2007), and Sun (2011, 2013) as special cases. Following Phillips, Sun and Jin (2006, 2007), we refer to the conventional kernel estimators as contracted kernel estimators and the sharp and steep kernel estimators as exponentiated kernel estimators. 4 For a given value of ; the quadratic HAR variance estimator takes the form: ! !0 T T T T 1 XX t s 1X 1X WT ( ) = Qh f (vt ; ) f (v ; ) f (vs ; ) f (v ; ) : ; T T T T T t=1 s=1 =1 =1 The feasible HAR variance estimator based on ~T is then given by ! T X T T X X 1 t s 1 WT ~T = u ~t u ~s Qh ; u ~ T T T T t=1 s=1 =1 T 1X u ~ T =1 !0 : (2) In the above de…nitions, Qh (r; s) is a symmetric weighting function that depends on the smoothing parameter h: For conventional kernel estimators, Qh (r; s) = k ((r s) =b) and we take h = 1=b: For the sharp kernel estimator, Qh (r; s) = (1 jr sj) 1 fjr sj < 1g and we take p h = : For steep quadratic kernel estimators, Qh (r; s) = k (r s) and we take h = : For P the OS estimators Qh (r; s) = K 1 K (r) (s) and we take h = K; where (r) are j j j=1 j R 1 orthonormal basis functions on L2 [0; 1] satisfying 0 j (r) dr = 0: A prewhitened version of the above estimators can also be used. See, for example, Andrews and Monahan (1992) and Xiao and Linton (2002). It is important to point out that P the HAR variance estimator in (2) is based on the de~ : For over-identi…ed GMM models, the sample average meaned moment process u ~t T 1 T=1 u PT 1 ~ is in general di¤erent from zero. Under the conventional asymptotic theory, deT =1 u meaning is necessary to ensure the consistency of the HAR variance estimator for misspeci…ed models. Under the new asymptotic theory, demeaning is necessary to ensure that the associated test statistics are asymptotically pivotal. Hall (2000, 2005: Sec 4.3) provides more discussions on the bene…ts of demeaning including power improvement for over-identi…cation tests. See also Sun and Kim (2012). With the variance estimator WT (~T ); the two-step GMM estimator is: ^T = arg min gT ( )0 W 1 (~T )gT ( ) : T 2 Suppose we want to test the linear null hypothesis H0 : R 0 = r against H0 : R 0 6= r where R is a p d matrix with full row rank. Nonlinear restrictions can be converted to linear ones by the delta method. We consider three types of test statistics. The …rst type is the conventional Wald statistic. The normalized Wald statistic is 1 h i 1 WT := WT (^T ) = T (R^T r)0 R GT (^T )0 WT 1 (^T )GT (^T ) R0 (R^T r)=p; where GT ( ) = @g@T (0 ) : When p = 1 and for one-sided alternative hypotheses, we can construct the t statistic tT = tT (^T ) where tT ( T ) is de…ned to be p T (R T r) tT ( T ) = n o : 1 0 1=2 1 0 R GT ( T ) WT ( T )GT ( T ) R The second type of test statistic is based on the likelihood ratio principle. Let ^T;R be the restricted second-step GMM estimator: ^T;R = arg min gT ( )0 W 1 (~T )gT ( ) s:t: R = r: T 2 5 The likelihood ratio principle suggests the GMM distance statistic (or GMM criterion function statistic) given by h i DT := T gT (^T )0 W 1 ( T )gT (^T ) T gT (^T;R )0 W 1 ( T )gT (^T;R ) =p T T p p where T is a T -consistent estimator of 0 ; i.e., T 0 = Op (1= T ): Typically, we take T to be ~T ; the unrestricted …rst-step estimator. To ensure that DT 0; we use the same WT 1 ( T ) in computing the restricted and unrestricted GMM criterion functions. The third type of test statistic is the GMM counterpart of the score statistic or Lagrange Multiplier (LM) statistic. It is based on the score or gradient of the GMM criterion function, i.e., T ( ) = G0T ( ) WT 1 ( T )gT ( ). The test statistic is given by h i0 h i 1 ^ ^ ST = T G0T (^T;R )WT 1 ( T )GT (^T;R ) T ( T;R ) T ( T;R )=p p where as before T is a T -consistent estimator of 0 : It is implicit in our notation that we use the same smoothing parameter h in WT ( T ); WT (~T ) and WT (^T ). In principle, we can use di¤erent smoothing parameters but the asymptotic distributions will become very complicated. For future research, it is interesting to explore the advantages of using di¤erent smoothing parameters. In this paper we employ the same smoothing parameter in all HAR variance estimators. Under the usual asymptotics where h ! 1 but at a slower rate than the sample size T; all three statistics WT ; DT and ST are asymptotically distributed as 2p =p, and the t-statistic tT is asymptotically standard normal under the null. The question is, what are the limiting distributions of WT ; DT ; ST and tT when h is held …xed as T ! 1? The rationale of considering this type of thought experiment is that it may deliver asymptotic approximations that are more accurate than the chi-square or normal approximation in …nite samples. When h is …xed, that is, b; or K is …xed, the variance matrix estimator involves a …xed amount of smoothing in that it is approximately equal to an average of a …xed number of quantities from a frequency domain perspective. The …xed-b, …xed- or …xed-K asymptotics may be collectively referred to as the …xed-smoothing asymptotics. Correspondingly, the conventional asymptotics under which h ! 1, T ! 1 jointly is referred to as the increasing-smoothing asymptotics. In view of the de…nition of h for each HAR variance estimator, the magnitude of h indicates the amount or level of smoothing in each case. 3 The Fixed-smoothing Asymptotics 3.1 Fixed-smoothing asymptotics for the variance estimator To establish the …xed-smoothing asymptotics, we …rst introduce another HAR variance estimator. Let QT;h (r; s) = Qh (r; s) T 1 X Qh T 1 =1 1 T T T T 1 X 1 X X 2 Qh r; + Qh T T T ;s 2 =1 1 =1 2 =1 1 T ; 2 T be the demeaned version of Qh (r; s) ; then we have T T 1 XX QT;h WT ( ) = T t=1 s=1 6 t s ; T T f (vt ; )f (vs ; )0 : (3) As T ! 1; we obtain the ‘continuous’version of QT;h (r; s) given by Z 1Z Z 1 Z 1 Qh (r; 2 ) d 2 + Qh ( 1 ; s) d 1 Qh (r; s) = Qh (r; s) 0 0 0 1 Qh ( 1 ; 2) d 1d 2: 0 R1 Qh (r; s) can be regarded as a centered version of Qh (r; s), as it satis…es 0 Qh (r; ) dr = R1 0 Qh ( ; s) ds = 0 for any : For OS HAR variance estimators, we have Qh (r; s) = Qh (r; s) ; and so centering is not necessary. For our theoretical development, it is helpful to de…ne the HAR variance estimator: WT ( ) = T T 1 XX Qh T t s ; T T t=1 s=1 f (vt ; )f (vs ; )0 : It p is shown in Lemma 1(a) below that WT ( T ) and WT ( T ) are asymptotically equivalent for any T -consistent estimator T : So the limiting distribution of WT ( T ) follows from that of WT ( T ): To establish the …xed-smoothing asymptotics of WT ( T ) and hence that of WT ( T ), we maintain Assumption 1 on the kernel function and basis functions. Assumption 1 (a) For kernel HAR variance estimators, the kernel function k ( ) satis…es the following condition: for any b 2 (0; 1] and 1, kb (x) and k (x) are symmetric, continuous, piecewise monotonic, and piecewise continuously di¤ erentiable on [ 1; 1]. (b) For the OS HAR variance estimator, the basis functions j ( ) are piecewise monotonic, continuously di¤ erentiable R1 and orthonormal in L2 [0; 1] and 0 j (x) dx = 0: Assumption 1 on the kernel function is very mild. It includes many commonly used kernel functions such as the Bartlett kernel, Parzen kernel, QS kernel, Daniel kernel and Tukey-Hanning kernel. Assumption 1 ensures the asymptotic equivalence of WT ( T ) and WT ( T ). In addition, under Assumption 1, we can use the Fourier series expansion to show that Qh (r; s) has the following uni…ed representation for all the HAR variance estimators we consider: Qh (r; s) = 1 X j j (r) j (s) ; (4) j=1 R1 where f j ( )g is a sequence of continuously di¤erentiable functions satisfying 0 j (r) dr = 0. The right hand side of (4) converges absolutely and uniformly over (r; s) 2 [0; 1] [0; 1]: For notational simplicity, the dependence of j and j ( ) on h has been suppressed. For positive semi-de…nite kernels k ( ), the above representation also follows from Mercer’s theorem, in which case j is the eigenvalue and j ( ) is the corresponding orthonormal eigen function for the Fredholm operator with kernel Qh ( ; ) : Alternatively, the representation can be regarded as a spectral decomposition of this Fredholm operator, which can be shown to be compact. This representation enables us to give a uni…ed proof for the contracted kernel HAR variance estimator, the exponentiated kernel HAR variance estimator, and the OS HAR variance estimator. For kernel HAR variance estimation, there is no need to single out non-smooth kernels such as the Bartlett kernel and treat them di¤erently as in KV (2005). De…ne Gt ( ) = t @gt ( ) 1 X @f (vj ; ) = for t T @ 0 @ 0 j=1 7 1; and G0 ( ) = 0 for all 2 : Let ut = f (vt ; 0) and 0 (t) 1; et s iidN (0; Im ): We make the following four assumptions on the GMM estimators and the data generating process. Assumption 2 As T ! 1; ^T = 0 + op (1) ; ^T;R = point 0 2 : P1 0 Assumption 3 j= 1 k j k < 1 where j = Eut ut 0 + op (1) ; ~T = 0 + op (1) for an interior j. Assumption 4 For any T = 0 + op (1) ; plimT !1 G[rT ] ( T ) = rG uniformly in r where G = G( 0 ) and G( ) = E@f (vt ; )=@ 0 . P Assumption 5 (a) T 1=2 Tt=1 j (t=T ) ut converges weakly to a continuous distribution, jointly over j = 0; 1; :::; J for every …nite J: (b) The following holds: ! T 1 X t P p ut x for j = 0; 1; :::; J j T T t=1 ! T 1 X t p =P et x for j = 0; 1; :::; J + o (1) as T ! 1 j T T t=1 P 0 = for every …nite J where x 2 Rp and is the matrix square root of ; i.e., := 1 j= 1 j : Assumptions 2–4 are standard assumptions in the literature on the …xed-smoothing asymptotics. They are the same as those in Kiefer and Vogelsang (2005), Sun and Kim (2012), among others. Assumption 5 is a variant of the standard multivariate CLT. If Assumption 1 holds and T 1=2 [rT ] 1 X d g[rT ] ( 0 ) = p ut ! Bm (r) T t=1 where Bm (r) is a standard Brownian motion, then Assumption 5 holds. So Assumption 5 is weaker than the above FCLT, which is typically assumed in the literature on the …xed-smoothing asymptotics. A great advantage of maintaining only the CLT assumption is that our results can be easily generalized to the case of higher dimensional dependence, such as spatial dependence and spatial-temporal dependence. In fact, Sun and Kim (2014) have established the …xed-smoothing asymptotics in the spatial setting using only the CLT assumption. They avoid the more restrictive FCLT assumption maintained in BCHV (2011). With some minor notational change and a small change in Assumption 4 as in Sun and Kim (2014), our results remain valid in the spatial setting. Assumption 5(b) only assumes that the approximation error is o(1); which is enough for our …rst order …xed-smoothing asymptotics. However, it is useful to discuss the composition of the approximation error. Let J (t) = ( 0 (t) ; 1 (t) ; :::; J (t))0 and aJ 2 RJ+1 be a vector of the same dimension. It follows from Lemma 1 in Taniguchi and Puri (1996) that under some additional conditions: ! ! T T X t t 1 1 X c(x) 0 0 1 ut x = P aJ p et x + p + O P aJ p J J T T T T t=1 T t=1 T 8 for some p function c(x). In the above Edgeworth expansion, the approximation error of order c(x)= T captures the skewness of ut . When ut is Gaussian or has a symmetric distribution, this term disappears. Part of the approximation error of order O(1=T ) comes from the stochastic dependence of fut g : If we replace the iid Gaussian process f et g by a dependent Gaussian process uN that has the same covariance structure as fut g, then we can remove this part of t the approximation error. To present Assumption 5 and our theoretical results more compactly, we introduce the notion of asymptotic equivalence in distribution. Consider two stochastically bounded sequences of random vectors T 2 Rp and T 2 Rp , we say that they are asymptotically equivalent in distria bution and write T s T if and only if Ef ( T ) = Ef ( T ) + o (1) as T ! 1 for all bounded and continuous functions f ( ) on Rp : According to Lemma 2 in the appendix, Assumption 5 is equivalent to T T 1 X t 1 X t a p p s u et j t j T T T T t=1 jointly over j = 0; 1; :::; J. Let " T 1 X p ( ) = j T t=1 t=1 t T j #" f (vt ; ) T 1 X p T t=1 j t T #0 f (vt ; ) : Using the uniform series representation in (4), we can write WT ( ) = 1 X j j ( ); j=1 which is an in…nite weighted sum of outer-products. To establish the …xed-smoothing asymptotic distribution of WT ( ) under Assumption 5, we split WT ( ) into a …nite sum part and the P1 PJ remainder part: WT ( ) = j=J+1 j j ( ). For each …nite J; we can use j=1 j j ( ) + Assumptions 2–5 to obtain the asymptotically equivalent distribution for the …rst term. However, for the second term to vanish, we require J ! 1: To close the gap in our argument, we use Lemma 4 in the appendix, which puts our proof on a rigorous footing and may be of independent interest. p Lemma 1 Let Assumptions 1-5 hold, then for any T -consistent estimator T and for a …xed h, (a) WT ( T ) = WT ( T ) + Op (1=T ) ; (b) WT ( T ) = WT ( 0 ) + op (1) ; a a (c) WT ( T ) s WeT and WT ( T ) s WeT where WeT = and e = 1 T PT (d) WT ( t=1 et : d T ) ! W1 W1 " T 1 Qh (t=T; =T ) (et t=1 =1 and WT ( ~1 = W T X T X T) 0 e) (e 0 e) # 0 d ! W1 where ~1 = and W Z 0 1Z 1 9 0 Qh (r; s)dBm (r)dBm (s)0 : Lemma 1(c)(d) gives a uni…ed representation of the asymptotically equivalent distribution and the limiting distribution of WT ( T ) and WT ( T ). The representation applies to smooth kernels as well as nonsmooth kernels such as the Bartlett kernel. Although we do not make the FCLT assumption, the limiting distribution of WT ( T ) and WT ( T ) can still be represented by a functional of Brownian motion as in Lemma 1(d). This representation serves two purposes. First, it gives an explicit representation of the limiting distribution. Second, the representation in Lemma 1(d) enables us to compare the results we obtain here with existing results. It is reassuring that the limiting distribution W1 is the same as that obtained by Kiefer and Vogelsang (2005), Phillips, Sun and Jin (2006, 2007), and Sun and Kim (2012), among others. It is standard practice to obtain the limiting distribution in order to conduct asymptotically valid inference. However, this is not necessary, as we can simply use the asymptotically equivalent distribution in Lemma 1(c). There are some advantages in doing so. First, it is easier to simulate the asymptotically equivalent distribution, as it involves only T iid normal vectors. In contrast, to simulate the Brownian motion in the limiting distribution, we need to employ a large number of iid normal vectors. Second, the asymptotically equivalent distribution can provide a more accurate approximation in some cases. Note that WeT takes the same form as WT ( 0 ) : If f (vt ; 0 ) is iid N (0; ); then WT ( 0 ) and WeT have the same distribution in …nite samples. There is no approximation error. In contrast, the limiting distribution W1 is always an approximation to the …nite sample distribution of WT ( 0 ) : 3.2 Fixed-smoothing asymptotics for the test statistics We now establish the asymptotic distributions of WT ; DT ; ST and tT when h is …xed. For kernel variance estimators, we focus on the case that k ( ) is positive de…nite, as otherwise the twostep estimator may not even be consistent. In this case, we can show that, for all the variance estimators we consider, W1 is nonsingular with probability one. Hence, using Lemma 1 and d a Lemma 3 in the appendix, we have WT (~T ) 1 s W 1 and WT (~T ) 1 ! W 1 : Using these 1 eT results and Assumptions 1-5, we have h p T ^T G0 WT (~T ) 0 = a s G0 WeT1 G d 1 1 i 1 G0 WT (~T ) 1 p T gT ( 0 ) + op (1) T 1 X G0 WeT1 p et T t=1 1 G0 W11 G ! G (5) G0 W11 Bm (1) : The limiting distribution is not normal but rather mixed normal. More speci…cally, W11 is independent of Bm (1); so conditional on W11 ; the limiting distribution is normal with the conditional variance depending on W11 : a d Using (5) and Lemma 3 in the appendix, we have WT s FeT and WT ! F1 where " FeT = R " G0 WeT1 G R G0 WeT1 G 1 1 #0 T h 1 X p et R G0 WeT1 G T t=1 # T 1 X p et =p; T t=1 G0 WeT1 G0 WeT1 10 1 R0 i 1 and h i0 h 1 0 F1 = R G0 W11 G G W11 Bm (1) R G0 W11 G h i 1 0 R G0 W11 G G W11 Bm (1) =p: 1 R0 i 1 To show that FeT and F1 are pivotal, we let et := (e0t;p ; e0t;d p ; e0t;q )0 . The subscripts p, d p; and q on e indicate not only the dimensions of the random vectors and but also distinguish them so that, for example, et;p is di¤erent and independent from et;q for all values of p and q: Denote T T 1 X 1 X Cp;T = p et;p , Cq;T = p et;q T t=1 T t=1 and Cpp;T T T T T 1 XX 1 XX t t 0 = et;p e~ ;p ; Cpq;T = et;p e~0 ;q Qh ( ; )~ Qh ( ; )~ T T T T T T Cqq;T = where e~i;j = ei;j T 1 T t=1 =1 T X T X t=1 =1 PT 1 (6) t=1 =1 t Qh ( ; )~ et;q e~0 ;q ; Dpp;T = Cpp;T T T s=1 es;j for i = t; 1 0 Cpq;T ; Cpq;T Cqq;T and j = p; q: Similarly let Bm (r) := Bp0 (r); Bd0 p (r); Bq0 (r) 0 where Bp (r); Bd p (r) and Bq (r) are independent standard Brownian motion processes of dimensions p, d p; and q; respectively. Denote Z 1Z 1 Z 1Z 1 0 Cpp = Qh (r; s)dBp (r)dBp (s) ; Cpq = Qh (r; s)dBp (r)dBq (s)0 (7) 0 0 0 0 Z 1Z 1 0 : Cqq = Qh (r; s)dBq (r)dBq (s)0 ; Dpp = Cpp Cpq Cqq1 Cpq 0 0 Then using the theory of multivariate statistics, we obtain the following theorem. Theorem 1 Let Assumptions 1-5 hold. Assume that k ( ) used in the kernel HAR variance estimation is positive h de…nite. Then, for ai0 …xed h;h i a d 1 1 1 ^ Cp;T Cpq;T Cqq;T Cq;T =p = FeT ; (a) WT ( T ) s Cp;T Cpq;T Cqq;T Cq;T Dpp;T d d 0 (b) WT (^T ) ! Bp (1) Cpq Cqq1 Bq (1) Dpp1 Bp (1) Cpq Cqq1 Bq (1) =p = F1 ; h i p a 1 (c) tT (^T ) s teT := Cp;T Cpq;T Cqq;T Cq;T = Dpp;T ; p d (d) tT (^T ) ! t1 := Bp (1) Cpq Cqq1 Bq (1) = Dpp ; 0 1 0 ; C 0 )0 is independent of C 0 0 where (Cp;T pq;T Cqq;T and Dpp;T ; and Bp (1) ; Bq (1) is independent q;T of Cpq Cqq1 and Dpp : Remark 1 Theorems 1 (a) and (b) show that both the asymptotically equivalent distribution FeT and the limiting distribution F1 are pivotal. In particular, they do not depend on the long run variance : Both FeT and F1 are nonstandard but can be simulated. It is easier to simulate FeT ; as it involves only T iid standard normal vectors. 11 Remark 2 Let ~ T (~T ) = T (R~T W r)0 V~T 1 (R~T r)=p be the Wald statistic based on the …rst-step GMM estimator ~T where h i ~ 0T Wo 1 G ~T V~T = R G 1h ~ 0T Wo 1 WT ~T Wo 1 G ~T G ih ~ 0T Wo 1 G ~T G i 1 R0 d ~ T (~T ) ! ~ T = GT (~T ). It is easy to show that W and G Bp (1)0 Cpp1 Bp (1) =p: See KV (2005) for contracted kernel HAR variance estimators, Phillips, Sun and Jin (2006,2007) for exponentiated kernel HAR variance estimators, and Sun (2013) for OS HAR variance estimators. Comparing the limit Bp (1)0 Cpp1 Bp (1) =p with F1 given in Theorem 1, we can see that F1 contains additional adjustment terms. Remark 3 When the model is exactly identi…ed, we have q = 0. In this case, d F1 = Bp (1)0 Cpp1 Bp (1) =p: This limit is the same as that in the one-step GMM framework. This is not surprising, as when the model is exactly identi…ed, the GMM weighting matrix becomes irrelevant and the two-step estimator is numerically identical to the one-step estimator. Remark 4 It is interesting to see that the limiting distribution F1 (and the asymptotically equivalent distribution) depends on the degree of over-identi…cation. Typically such dependence shows up only when the number of moment conditions is assumed to grow with the sample size. Here the number of moment conditions is …xed. It remains in the limiting distribution because it contains ~ 1: information on the dimension of the random limiting matrix W Theorem 2 Let Assumptions 1-5 hold. Then, for a …xed h; DT = WT + op (1) and ST = WT + op (1) : It follows from Theorem 2 that all three statistics are asymptotically equivalent in distribution to the same distribution FeT , as given in Theorem 1(a). They also have the same limiting distribution F1 : So the asymptotic equivalence of the three statistics under the increasing-smoothing asymptotics remains valid under the …xed-smoothing asymptotics. Careful inspection p of the proofs of Theorems 1 and 2 shows that the two theorems hold regardless of which T -consistent estimator of 0 we use in estimating WT ( 0 ) and GT ( 0 ) in DT and ST : However, it may make a di¤erence to higher orders. 4 Representation and Approximation of the Fixed-smoothing Asymptotic Distribution In this section, we establish di¤erent representations of the …xed-smoothing limiting distribution. These representations highlight the connection and di¤erence between the nonstandard approximation and the standard 2 or normal approximation. 12 4.1 The case of OS HAR variance estimation To obtain a new representation of F1 ; we consider the matrix Z 1Z 1 Cpp Cpq d 0 K = Qh (r; s) dBp+q (r) dBp+q (s) 0 Cpq Cqq 0 0 Z 1 Z 1 0 K X = j (r) dBp+q (r) j (r) dBp+q (r) j=1 0 0 where Bp+q (r) is the standard Brownian motion of dimension p + q: Since f j (r)g are orthoR1 d normal, 0 j (r) dBp+q (r) s iidN (0; Ip+q ) and the above matrix follows a standard Wishart distribution Wp+q (K; Ip+q ) : A well known property of a Wishart random matrix is that Dpp = 0 s W (K Cpp Cpq Cqq1 Cpq q; Ip ) =K and is independent of both Cpq and Cqq : This implies that p Dpp is independent of Bp (1) Cpq Cqq1 Bq (1), and that F1 is equal in distribution to a quadratic form with an independent scaled Wishart matrix as the (inverse) weighting matrix. This brings F1 close to Hotelling’s T 2 distribution, which is the same as a standard F distribution after some multiplicative adjustment. The only di¤erence is that Bp (1) Cpq Cqq1 Bq (1) is 2 not normal and hence Bp (1) Cpq Cqq1 Bq (1) does not follow a chi-square distribution. However, conditional on Cpq Cqq1 Bq (1) ; jjBp (1) Cpq Cqq1 Bq (1) jj2 follows a noncentral 2p distribution with noncentrality parameter 2 = jjCpq Cqq1 Bq (1) jj2 : It then follows that F1 is conditionally distributed as a noncentral F distribution. Let = (K K ; p q + 1) 2 2 =E = pq K q 1 : The following theorem presents this result formally. Theorem 3 For OS HAR variance estimation, we have d 1F 2 ; a mixed noncentral F random variable with random noncen(a) 1 = Fp;K p q+1 2 trality parameter ; (b) P ( 1 F1 < z) = Fp;K p q+1 z; 2 + o K 1 as K ! 1; (c) P (pF1 < z) = Gp (z) is the CDF of the chi-square p+2q 1 K distribution 2p : + Gp00 (z) z 2 K1 + o Gp0 (z) z 1 K as K ! 1; where Gp (z) Remark 5 According to Theorem 3(a), d F1 = where 2 p 2 is independent of 2 p 2 2 K p q+1 . d = jjCp 1 Cp 2 p 2 =p 2 K p q+1 =K It is easy to see that 0 K Cq K Cq 0 K Cq K 1 Cq 2 1 jj where each Cm n is an m n matrix (vector) with iid standard normal elements and Cp and Cq 1 are mutually independent. This provides a simple way to simulate F1 : 13 1 ; C p K ; Cq K Remark 6 Parts (a) and (b) of Theorem 3 are intriguing in that the limiting distribution under the null is approximated by a noncentral F distribution, although the noncentrality parameter goes to zero (in probability) as K ! 1. Remark 7 When q = 0; that is, when the model is just identi…ed, we have K p+1 d F1 = Fp;K K 2 = 2 = 0 and so p+1 : This result is the same as that obtained by Sun (2013). So an advantage of using the OS HAR variance estimator is that the …xed-smoothing asymptotics is exactly a standard F distribution for just identi…ed models. Remark 8 The asymptotics obtained under the speci…cation that K is …xed and then letting K ! 1 is a sequential asymptotics. As K ! 1; we may show that 1 2 =p p d F1 = 2 K p q+1 = (K p q + 1) 2 p =p + op (1) = + op (1) : To the …rst order, the …xed-smoothing asymptotic distribution reduces to the distribution of 2p =p. As a result, if …rst order asymptotics are used in both steps in the sequential asymptotic theory, then the sequential asymptotic distribution is the same as the conventional joint asymptotic distribution. However, Theorem 3 is not based on the …rst order asymptotics but rather a high order asymptotics. The high order sequential asymptotics can be regarded as a convenient way to obtain an asymptotic approximation that better re‡ects the …nite sample distribution of the test statistic DT ; ST or WT : Remark 9 Instead of approximations via asymptotic expansions, all three types of approximations are distributional approximations. More speci…cally, the …nite sample distributions of DT ; ST ; and WT are approximated by the distributions of 2 p p 2 2 p ; =p ; and q+1 =K 2 K p 2 p 2 =p (8) 2 K p q+1 =K respectively under the conventional joint asymptotics, the …xed-smoothing asymptotics, and the higher order sequential asymptotics. An advantage of using distributional approximations is that they hold uniformly over their supports (by Pólya’s Lemma). Remark 10 Theorem 3(c) gives a …rst order distributional expansion of F1 : It is clear that the di¤ erence between pF1 and 2p depends on K; p and q: Let p1 = Gp 1 (1 ) be the (1 ) quantile of Gp (z); then P pF1 1 p = + Gp0 1 p 1 p p + 2q K 1 Gp00 1 p 1 p 2 1 +o K 1 K : For a typical critical value 1p ; Gp0 ( 1p ) > 0 and Gp00 ( 1p ) < 0; so we expect P (F1 1 =p) > ; at least when K is large. So the critical value from F1 is expected to be larger than p 1 1 =p: For given p and large K; the di¤ erence between P (F1 =p) and increases with q; p p the degree of over-identi…cation. A practical implication of Theorem 3(c) is that when the degree of over-identi…cation is large, using the chi-square critical value rather than the critical value from F1 may lead to the …nding of a statistically signi…cant relationship that does not actually exist. 14 p=1 p=2 8 8 7.5 7.5 7 7 6.5 6.5 6 6 5.5 5.5 5 5 4.5 4.5 4 4 3.5 3.5 3 10 3 15 20 25 30 K 35 40 45 50 10 15 20 25 p=3 35 40 45 50 p=4 8 8 F ∞, q = 0 7.5 NCF, q = 0 NCF, q = 1 NCF, q = 2 NCF, q = 3 5.5 5 F ∞, q = 2 6.5 F ∞, q = 3 6 F ∞, q = 1 7 F ∞, q = 2 6.5 F ∞, q = 0 7.5 F ∞, q = 1 7 F ∞, q = 3 NCF, q = 0 NCF, q = 1 NCF, q = 2 NCF, q = 3 6 5.5 5 4.5 4.5 4 4 3.5 3.5 3 10 30 K 3 15 20 25 30 K 35 40 45 50 10 15 20 25 30 K 35 40 45 50 Figure 1: 95% quantitle of the nonstandard distribution and its noncentral F approximation Figure 1 reports 95% critical values from the two approximations: the noncentral (scaled) Fp;K p q+1 ; 2 approximation and the nonstandard F1 approximation, for di¤erent combinations of p and q. The nonstandard F1 distribution is simulated according to the representation in Remark 5 and the nonstandard critical values are based on 10000 simulation replications. Since the noncentral F distribution appears naturally in power analysis, it has been implemented in standard statistical and programming packages. So critical values from the noncentral F dis1 2 tribution can be obtained easily. Let N CF 1 := Fp;K be the (1 ) quantile of p q+1 2 1 1 the noncentral Fp;K p q+1 ; distribution, then N CF corresponds to F1 ; the (1 ) 1 quantile of the nonstandard F1 distribution. Figure 1 graphs N CF 1 and F1 against different K values. The …gure exhibits several patterns, all of which are consistent with Theorem 3(c). First, the noncentral F critical values are remarkably close to the nonstandard critical values. So approximating 2 by its mean 2 does not incur much approximation error. Second, the noncentral F and nonstandard critical values increase with the degree of over-identi…cation q. The increase is more signi…cant when K is smaller. This is expected as both the noncentrality parameter 2 and the multiplicative factor increase with q: In addition, as q increases, the denominators in the noncentral F and nonstandard distributions become more random. All these e¤ects shift the probability mass to the right as q increases. Third, for any given p and q combination, the noncentral F and nonstandard critical values decrease monotonically as K increases and approach the corresponding critical values from the chi-square distribution. We have also considered approximating the 90% critical values. In this case, the noncentral 15 F critical values are also very close to the nonstandard critical values. In parallel with Theorem 3, we can represent and approximate t1 as follows: Theorem 4 Let = K=(K q): For OS HAR variance estimation, we have p d (a) t1 = = tK q ( ) ; a mixed noncentral t random variable with random noncentrality parameter = C1q Cqq1 Bq (1) ; p < z) = 12 + sgn(z) 12 F1;K q z 2 ; 2 + o K 1 as K ! 1; where 2 = K qq 1 (b) P (t1 = and sgn( ) is the sign function; (c) P (t1 < z) = (z) jzj (z) z 2 + (4q + 1) =(4K) + o K 1 as K ! 1; where (z) and (z) are the CDF and pdf of the standard normal distribution. Theorem 4(a) is similar to Theorem 3(a). With a scale adjustment, the …xed-smoothing asymptotic distribution of the t statistic is a mixed noncentral t distribution. According to Theorem 4(b), we can approximate the quantile of t1 by that of a noncentral 1 F distribution. More speci…cally, let t1 be the (1 ) quantile of t1 ; then 8 q < < 0:5 F 1 2 ( 2 ); q 1;K q t11 = _ 2 1 2 : 0:5 F1;K q ( ); 1 2 2 where F1;K q ( ) is the (1 2 ) quantile of the noncentral F distribution. For a two-sided t test, q 1 2 the (1 ) quantile jt1 j of jt1 j can be approximated by F1;K q ( ): As in the case of quantile approximation for F1 ; the above approximation is remarkably accurate. Figure 2, which is similar to Figure 1, illustrates this. The …gure graphs t11 and its approximation against the values of K for di¤erent degrees of over-identi…cation and for = 0:05 and 0:95: As K increases, the quantiles approach the normal quantiles 1:645. However when K is small or q is large, there is a signi…cant di¤erence between the normal quantiles and the corresponding quantiles from t1 : This is consistent with Theorem 4(c). For a given small K, the absolute di¤erence increases with the degree of over-identi…cation q. For a given q; the absolute di¤erence decreases with K: Theorem 4(c) and Figure 2 suggest that the quantile of t1 is larger than the corresponding normal quantile in absolute value. So the test based on the normal approximation rejects the null more often than the test based on the …xed-smoothing approximation. This provides an explanation of the large size distortion of the asymptotic normal test. 1 4.2 The case of kernel HAR variance estimation 0 is not independent of In the case of kernel HAR variance estimation, Dpp = Cpp Cpq Cqq1 Cpq Cpq or Cqq : It is not as easy to simplify the nonstandard distribution as in the case of OS HAR variance estimation. Di¤erent proofs are needed. Let Z Z Z 1 1 = 0 1 Qh (r; r) dr and 2 1 = In Lemma 5 in the appendix, we show that 1 1 that a and b are of the same order of magnitude. 0 0 [Qh (r; s)]2 drds: 1=h and Theorem 5 For kernel HAR variance estimation, we have 16 2 1=h where \a (9) b” indicates q=0 q=1 3 3 2 2 1 95% quantile of t 5% quantile of t 0 1 ∞ ∞ ∞ ∞ Approximate 95% quantile Approximate 5% quantile -1 -2 -3 5% quantile of t 0 Approximate 95% quantile Approximate 5% quantile -1 95% quantile of t -2 0 10 20 30 40 -3 50 0 10 20 K 30 40 50 30 40 50 K q=2 q=3 4 10 3 2 5 1 0 0 -1 -2 -5 -3 -4 0 10 20 30 40 -10 50 0 10 20 K K Figure 2: 5% and 95% quantitles of the nonstandard distribution t1 and their noncentral F approximations d (a) pF1 = 2= 2 p where 2 p is independent of 0 e0p Ip + Cpq Cqq1 Cqq1 Cpq ep 2 d = 2, 0 e0p Ip + Cpq Cqq1 Cqq1 Cpq and ep = (1; 0; :::; 0)0 2 Rp . (b) As h ! 1; we have p 2 ! Cpp 0 Cpq Cqq1 Cpq 1 0 Ip + Cpq Cqq1 Cqq1 Cpq ep 1 and P (pF1 < z) = Gp (z) + Gp0 (z) z ( 1 1) 2 (p + 2q 1 where as before Gp (z) is the CDF of the chi-square distribution 1) + Gp00 (z) z 2 2 + o( 2) 2: p Remark 11 Theorem 5(a) shows that pF1 follows a scale-mixed chi-square distribution. Since 2 !p 1 as h ! 1; the sequential limit of pW is the usual chi-square distribution. The result T is the same as in the OS HAR variance estimation. The virtue of Theorem 5(a) is that it gives an explicit characterization of the random scaling factor 2 : Remark 12 Using Theorem 5(a), we have P (pF1 < z) = EGp z 2 : Theorem 5(b) follows by taking a Taylor expansion of Gp z 2 around Gp (z) and approximating the moments of 2 : In the case of the contracted kernel HAR variance estimation, Sun (2014) establishes an expansion 17 of the asymptotic distribution for WT (~T ), which is based on the one-step GMM estimator ~T : Using the notation in this paper, it is shown that P Bp (1)0 Cpp1 Bp (1) < z = Gp (z) + Gp0 (z) z ( 1) 1 2 (p 1 1) + Gp00 (z) z 2 2 + o( 2) : So up to the order O( 2 ); the two expansions agree except that there is an additional term in Theorem 5(b) that re‡ects the degree of over-identi…cation. If we use the chi-square distribution or the …xed smoothing asymptotic distribution Bp (1)0 Cpp1 Bp (1) as the reference distribution, then the probability of over-rejection increases with q, at least when 2 is small. We now focus on the case of contracted kernel HAR variance estimation. As h ! 1; i.e., b ! 0; we can show that 1 =1 where c1 = bc1 + o(b) and Z 1 k (x) dx; c2 = 1 Using Theorem 5(b) and the identity that P F1 < z 2 = G z 2 G 0 z2 c1 G 0 z 2 z 2 b 2 Z = bc2 + o (b) 1 (10) k 2 (x) dx: 1 z 2 + 1 = 2G 00 z 2 z 2 ; we have, for p = 1 : c2 4 z + (4q + 1) z 2 G 0 z 2 b + o(b) 2 (11) where G z 2 = G1 z 2 : The above expansion is related to the high order Edgeworth expansion established in Sun and Phillips (2009) for linear IV regressions. Sun and Phillips (2009) consider the conventional joint limit under which b ! 0 and T ! 1 jointly and show that P (jtT (^T )j < z) = (z) ( z) 1 c1 z + c2 z 3 + z (4q + 1) 2 b (z) + (bT ) q0 1;1 z (z) + s:o: (12) where (bT ) q0 1;1 z captures the nonparametric bias of the kernel HAR variance estimator, and ‘s.o.’ stands for smaller order terms. Here q0 is the order of the kernel used and the explicit expression for 1;1 is not important here but is given in Sun and Phillips (2009). Observing that (z) ( z) = G z 2 ; we have (z) = G 0 z 2 z and 0 (z) = G 0 z 2 + 2z 2 G 00 z 2 : Using these equations, we can represent the high order expansion in (12) as P (jtT (^T )j2 < z 2 ) = P (WT < z 2 ) c2 4 z + (4q + 1) z 2 G 0 z 2 b + (bT ) = G z2 c1 G 0 z 2 z 2 b 2 (13) q0 1;1 G 0 z 2 2 z + s:o: Comparing this with (11), the terms of order O(b) are seen to be exactly the same across the two expansions. 1 1 Let z 2 = F1 be the (1 ) quantile from the distribution of F1 ; i.e., P F1 < F1 = 1 : Then i c2 h 1 2 1 1 1 1 1 F1 + (4q + 1) F1 G 0 F1 b=1 + o(b) G F1 c1 G 0 F1 F1 b 2 18 and so 1 P (WT < F1 )=1 + (bT ) q0 1;1 G 0 1 F1 1 F1 + s:o: 1 That is, using the critical value F1 eliminates a term in the higher order distributional ex1 pansion of WT under the conventional asymptotics. The nonstandard critical value F1 is thus high order correct under the conventional asymptotics. In other words, the nonstandard approximation provides a high order re…nement over the conventional chi-square approximation. Although the high order re…nement is established for the case of p = 1 and linear IV regressions, we expect it to be true more generally and for all three types of HAR variance estimators. All we need is a higher order Edgeworth expansion for the Wald statistic in a general GMM setting. In view of Sun and Phillips (2009), this is not di¢ cult conceptually but can be technically very tedious. 5 The Fixed-smoothing Asymptotics under the Local Alternatives In this section, we consider the …xed-smoothing asymptotics under a sequence of local alternatives p H1 : R 0 = r = T where 1 vector. This con…guration, known as the Pitman 0 0 is a p drift, is widely used in local power analysis. Under the local alternatives, both 0 and the data generating process depend on the sample size. For notational convenience, we have suppressed an extra T subscript on 0 . If Assumptions 2-5 hold under the local alternatives, then we can use the same arguments as d d before and show that WT (^T ) ! F1; 0 and tT (^T ) ! t1; 0 where F1; and t1; Let 0 0 h = R G0 W11 G h R G0 W11 G h = R G0 W11 G 1 1 1 G0 W11 Bm (1) + 0 G0 W11 Bm (1) + 0 G0 W11 Bm (1) + V = R G0 1 G 1 R0 ; 0 i0 h i R0 =p i0 h =V 1 R G0 W11 G R G0 W11 G 1=2 1 R0 i i 1 1=2 : 0; and F1 (jj jj2 ) = Bp (1) Cpq Cqq1 Bq (1) + t1 ( ) = Bp (1) Cpq Cqq1 Bq (1) + 0 Dpp1 Bp (1) Cpq Cqq1 Bq (1) + p = Dpp when p = 1: =p; (14) (15) The following theorem shows that F1; 0 and F1 (jj jj2 ) have the same distribution. Similarly, t1; 0 and t1 ( ) have the same distribution. Theorem 6 Let Assumption 1 hold with k ( ) being positive de…nite. In addition, let Assumptions 2-5 hold under H1 . Then, for a …xed h; d (a) WT (^T ) ! F1 (jj jj2 ); d (b) tT (^T ) ! t1 ( ) when p = 1; where Bp0 (1) ; Bq0 (1) 0 is independent of Cpq Cqq1 and Dpp : 19 In the spirit of Theorem 1, we can also approximate the distributions of WT and tT by their respective asymptotically equivalent distributions. To conserve space, we omit the details here. In addition, we can show that, under the assumptions of Theorem 6, DT and ST are asymptotically equivalent in distribution to WT : Hence they also converge in distribution to F1 (jj jj2 ): When 0 = 0; Theorem 6 reduces to Theorems 1(b) and (d). The proof of Theorem 6 shows that the right hand side of (14) depends on only through jj jj2 : For this reason, we can write F1 ( ) as a function of jj jj2 only. But jj jj2 = 00 V 1 0 is exactly the same as the noncentrality parameter in the noncentral chi-square distribution under the conventional increasing-smoothing asymptotics. See, for example, Hall (2005, p. 163). When h ! 1; F1 (jj jj2 ) converges in distribution to 2p (jj jj2 )=p: That is, the conventional limiting distribution can be recovered from F1 (jj jj2 ): However, for a …nite h; these two distributions can be substantially di¤erent. For the Wald statistic based on the …rst-step estimator, it is not hard to show that for a …xed h; h i h i0 d WT (~T ) ! F~1 (jj~jj2 ) := Bp (1) + ~ C 1 Bp (1) + ~ =p pp where 1 V~ = R G0 Wo;1 G 1 1 1 1 G0 Wo;1 Wo;1 G G0 Wo;1 G 1 R0 and ~ = V~ 1=2 0. (16) Similar to F1 (jj jj2 ), the limiting distribution F~1 (jj~jj2 ) depends on ~ only through jj~jj2 : Note that V~ and V are the asymptotic variances of the …rst-step and two-step estimators in the conventional asymptotic framework. It follows from the standard GMM theory that jj jj2 jj~jj2 . So in general the two-step test has a larger noncentrality parameter than the …rst-step test. This is a source of the power advantage of the two-step test. To characterize the di¤erence between jj jj2 and jj~jj2 ; we let UG G VG0 be a singular value decomposition of G; where 0 0 0 G;m d = AG ; OG ; AG is a nonsingular d d matrix, OG is a q d matrix of zeros, and UG and VG are orthonormal matrices. We rotate the moment conditions and partition the rotated moment conditions UG0 f (vt ; ) according to fU 1 (vt ; ) (17) UG0 f (vt ; ) := fU 2 (vt ; ) where fU 1 (vt ; ) 2 Rd and fU 2 (vt ; ) 2 Rq . Then the moment conditions in (1) are equivalent to the following rotated moment conditions: EUG0 f (vt ; ) = 0; t = 1; 2; :::; T hold if and only if matrix are = 0. (18) The corresponding rotated weighting matrix and long run variance 11 12 WU;1 WU;1 21 22 WU;1 WU;1 WU;1 : = UG0 Wo;1 UG = U : = UG0 UG = 11 U 21 U 12 U 22 U ; ; where the matrices are partitioned conforming to the two blocks in (17). 20 1 22 Let BU;1 = WU;1 21 . In the appendix we show that WU;1 Id BU;1 V~ = RVG AG1 0 Id BU;1 U 0 Id V = RVG AG1 22 U 0 RVG AG1 ; 1 (19) Id U 21 U 1 22 U 0 RVG AG1 ; 21 U and V~ V = RVG AG1 h 1 22 U 21 U BU;1 i0 22 U h 22 U 1 21 U BU;1 i 0 RVG AG1 : (20) 1 21 22 So the di¤erence between V~ and V depends on the di¤erence between U U and BU;1 : 1 21 22 While U U is the long run (population) regression matrix when fU 1 (vt ; 0 ) is regressed on fU 2 (vt ; 0 ); BU;1 can be regarded as the long run regression matrix implied by the weighting matrix WU;1 : When the two long run regression matrices are close to each other, V~ and V will also be close to each other. Otherwise, V~ and V can be very di¤erent. The di¤erence between V~ and V translates into the di¤erence between the noncentrality parameters. To gain deeper insights, we let fU 1 (vt ; ) := VG AG1 fU 1 (vt ; ) 0 BU;1 fU 2 (vt ; ) : ~ It is easy to show that the GMM Note that the long run variance matrix of RfU 1 (vt ; 0 ) is V. estimator based on the moment conditions EfU 1 (vt ; 0 ) = 0 has the same asymptotic distribution as the …rst-step estimator ~T : Hence EfU 1 (vt ; 0 ) = 0 can be regarded as the e¤ective moment conditions behind ~T : Intuitively, ~T employs the moment conditions EfU 1 (vt ; 0 ) = 0 only while ignoring the second block of moment conditions EfU 2 (vt ; 0 ) = 0: In contrast, the two-step estimator ^T employs both blocks of moment conditions. The relative merits of the …rst-step estimator and the two-step estimator depend on whether the second block of moment conditions contains additional information about 0 : To measure the information content, we introduce the long run correlation matrix between RfU 1 (vt ; 0 ) and fU 2 (vt ; 0 ) : 12 = V~ 1=2 RVG AG1 12 U 0 BU;1 22 U 22 U 1=2 0 22 ) is the long run covariance between Rf where RVG AG1 ( 12 BU;1 U U U 1 (vt ; 2 2 Proposition 7 gives a characterization of jj jj jj~jj in terms of 12 : 0) and fU 2 (vt ; 0) : Proposition 7 If V~ is nonsingular, then jj jj2 jj~jj2 = 0 ~ 1=2 0V Ip 1=2 0 12 12 0 12 12 0 12 12 Ip 1=2 V~ 1=2 0: P Let 12 = pi=1 i ai b0i be the singular value decomposition of 12 where fai g and fbi g are 2 orthonormal vectors in Rp and Rq respectively. Sorted in the descending order, are the i (squared) long run canonical correlation coe¢ cients between RfU 1 (vt ; 0 ) and fU 2 (vt ; 0 ) : Then 2 jj jj jj~jj2 = p X i=1 2 i 1 21 2 i h a0i V~ 1=2 0 i2 : Consider a special case that maxpi=1 2i = 0; i.e., 12 is a matrix of zeros. In this case, the second block of moment conditions contains no additional information, and we have jj jj2 = jj~jj2 : 1 21 22 This case happens when BU;1 = U U , i.e., the implied long run regression coe¢ cient is the same as the population long run regression coe¢ cient. Consider another special case that p 2 2 approaches 1. If a01 V~ 1=2 0 6= 0; then jj jj2 jj~jj2 approaches 1 as 21 1 = maxi=1 i approaches 1 from below. This case happens when the second block of moment conditions has very high long run prediction power for the …rst block. There are many cases in-between. Depending on the magnitude of the long run canonical correlation coe¢ cients and the direction of the local alternative hypothesis, jj jj2 jj~jj2 can range from 0 to 1: We expect that, under the …xed-smoothing asymptotics, there is a threshold value c > 0 such that when jj jj2 jj~jj2 > c ; the two-step test is asymptotically more powerful than the …rst-step test. Otherwise, the two-step test is asymptotically less powerful. The threshold value c should depend on h; p; q, jj jj2 and the signi…cance level . It should be strictly greater than zero, as the two-step test entails the cost of estimating the optimal weighting matrix. We leave the threshold determination to future research. 6 Simulation Evidence and Empirical Application 6.1 Simulation Evidence In this subsection, we study the …nite sample performance of the …xed-smoothing approximations to the Wald statistic. We consider the following data generating process: yt = x0;t + x1;t 1 + x2;t 2 + x3;t 3 + "y;t where x0;t 1 and x1;t , x2;t and x3;t are scalar regressors that are correlated with "y;t : The dimension of the unknown parameter = ( ; 1 ; 2 ; 3 )0 is d = 4: We have m instruments z0;t ; z1;t ; :::; zm 1;t with z0;t 1. The reduced-form equations for x1;t , x2;t and x3;t are given by xj;t = zj;t + m X1 zi;t + "xj ;t for j = 1; 2; 3: i=d 1 We assume that zi;t for i or an MA(1) process 1 follows either an AR(1) process p 2e zi;t = zi;t 1 + 1 zi ;t ; zi;t = ezi ;t 1 where ezi ;t = + p 1 2e zi ;t ; eizt + e0zt p 2 1 0 and et = [e0zt ; e1zt ; :::; em zt ] s iidN (0; Im ): By construction, the variance of zit for any i = 1; 2; :::; m 1 is 1: Due to the presence of the common shocks e0zt ; the correlation coe¢ cient between the non-constant zi;t and zj;t for i 6= j is 0:5: The DGP for "t = ("yt ; "x1 t ; "x2 t ; "x3 t )0 is the same as that for (z1;t ; :::; zm 1;t ) except the dimensionality di¤erence. The two vector processes "t and (z1;t ; :::; zm 1;t ) are independent from each other. 22 In the notation of this paper, we have f (vt ; ) = zt (yt x0t ) where vt = [yt0 ; x0t ; zt0 ]0 , xt = (1; x1t ; x2t ; x3t )0 , zt = (1; z1t ; :::; zm 1;t )0 . We take = 0:8; 0:5; 0:0; 0:5; 0:8 and 0:9. We consider m = 4; 5; 6 and the corresponding degrees of over-identi…cation are q = 0; 1; 2: The null hypotheses of interest are H01 : 1 = 0; H02 : 1 = 2 = 0; H03 : 1 = 2 = 3 =0 where p = 1; 2 and 3 respectively. The corresponding matrix R is the 2 : p + 1 rows of the identity matrix Id : We consider two signi…cance levels = 5% and = 10% and three di¤erent sample sizes T = 100; 200; 500: The number of simulation replications is 10000. We …rst examine the …nite sample size accuracy of di¤erent tests using the OS HAR variance estimator. The tests are based on the same Wald test statistic, so they have the same sizeadjusted power. The di¤erence lies in the reference distributions or critical values used. We 1 1 2 K employ the following critical values: 1p =p, K Kp+1 Fp;K with p+1 ; K p q+1 Fp;K p q+1 2 1 2 = pq=(K q 1); and F1 ; leading to the test, the CF (central F) test, the NCF (noncentral F) test and the nonstandard F1 test. The 2 test uses the conventional chi-square approximation. The CF test uses the …xed-smoothing approximation for the Wald statistic based on a …rst-step GMM estimator. Alternatively, the CF test uses the …xed-smoothing approximation with q = 0: The NCF test uses the noncentral F approximation given in Theorem 3. The F1 test uses the nonstandard limiting distribution F1 with simulated critical values. For each test, the initial …rst step estimator is the IV estimator with weight matrix Wo = (Z 0 Z=T ) where Z is the matrix of the observed instruments. To speed up the computation, we assume that K is even and use the basis functions 2j 1 (x) = p p 2 cos 2j x, 2j (x) = 2 sin 2j x, j = 1; :::; K=2: In this case, the OS HAR variance estimator can be computed using discrete Fourier transforms. The OS HAR estimator is a simple average of periodogram. We select K based on the AMSE criterion implemented using the VAR(1) plug-in procedure in Phillips (2005). For completeness, we reproduce the MSE optimal formula for K here: & ' )] 1=5 4=5 tr [(Im2 + Kmm ) ( 0:5 KM SE = 2 T ; 4vec(B)0 vec(B) where d e is the ceiling function, Kmm is the m2 2 B= 6 1 X m2 commutation matrix and j 2 Ef (vt ; 0 0 )f (vt j ; 0 ) : (21) j= 1 We compute KM SE on the basis of the initial …rst step estimator ~T and use it in computing both WT (~T ) and WT (^T ): Table 1 gives the empirical size of the di¤erent tests for the AR(1) case with sample size T = 100. The nominal size of the tests is = 5%: The results for = 10% are qualitatively similar. First, as it is clear from the table, the chi-square test can have a large size distortion. The size distortion can be very severe. For example, when = 0:9, p = 3 and q = 2, the empirical size of the chi-square test can be as high as 71.2%, which is far from 5%, the nominal size of the test. Second, the size distortion of the CF test is substantially smaller than the chisquare test when the degree of over-identi…cation is small. This is because the CF test employs 23 the asymptotic approximation that partially captures the estimation uncertainty of the HAR variance estimator. Third, the empirical size of the NCF test is nearly the same as that of the nonstandard F1 test. This is consistent with Figure 1. This result provides further evidence that the noncentral F distribution approximates the nonstandard F1 distribution very well for at least the 95% quantile. Finally, among the four tests, the N CF and F1 tests have the most accurate size. For the two-step GMM estimator, the HAR variance estimator appears in two di¤erent places and plays two di¤erent roles — …rst as the inverse of the optimal weighting matrix and then as part of the asymptotic variance estimator for the GMM estimator. The nonstandard F1 approximation and the noncentral F approximation attempt to capture the estimation uncertainty of the HAR variance estimator in both places. In contrast, a crucial step underlying the central F approximation is that the HAR variance estimator is treated as consistent when it acts as the optimal weighting matrix. As a result, the central F approximation does not adequately capture the estimation uncertainty of the HAR variance estimator. This explains why the N CF and F1 tests have more accurate size than the CF test. Table 2 presents the simulated empirical size for the MA(1) case. The qualitative observations for the AR(1) case remain valid. The chi-square test is most size distorted. The NCF and F1 tests are least size distorted. The CF test is in between. As before, the size distortion increases with the serial dependence, the number of joint hypotheses, and the degree of over-identi…cation. This table provides further evidence that the noncentral F distribution and the nonstandard F1 distribution provide more accurate approximations to the sampling distribution of the Wald statistic. Tables 3 and 4 report results for the case when the sample size is 200. Compared to the cases with sample size 100, all tests become more accurate in size. The size accuracy improves further when the sample size increases to 500. This is well expected. In terms of the size accuracy, the NCF test and the F1 test are close to each other. They dominate the CF test, which in turn dominates the chi-square test. Next, we consider the empirical size of di¤erent tests based on kernel HAR variance estimators. Both the contracted kernel approach and the exponentiated kernel approach are considered, but we report only the commonly-used contracted kernel approach as the results are qualitatively similar. We employ three kernels: the Bartlett, Parzen and QS kernels. These three kernels are positive (semi) de…nite. For each kernel, we use the data-driven AMSE optimal bandwidth and its VAR(1) plug-in implementation from Andrews (1991). For each Wald statistic, three 1 1 1 critical values are used: 1p =p, F1 [0] ; and F1 [q] where F1 [q] is the (1 ) quantile 1 from the nonstandard F1 distribution with degree of over-identi…cation q: F1 [0] coincides with the critical value from the …xed-smoothing asymptotics distribution derived for the …rst-step IV estimator. To save space, we only report the results for the Parzen kernel. Tables 5–8 correspond to Tables 1–4. It is clear that the qualitative results exhibited in Tables 1–4 continue to apply. According to the size accuracy, the F1 [q] test dominates the F1 [0] test, which in turn dominates the conventional 2 test. 24 6.2 Empirical Application To illustrate the …xed-smoothing approximations, we consider the log-normal stochastic volatility model of the form: rt = 2 t log t Zt = !+ log 2 t 1 ! + u ut where rt is the rate of return and (Zt ; ut ) is iid N (0; I2 ): The …rst equation speci…es the distribution of the return as heteroscedastic normal. The second equation speci…es the dynamics of the logvolatility as an AR(1). The parameter vector is = (!; ; u ) : We impose the restriction that 2 (0; 1); which is an empirically relevant range. The model and the parameter restriction are the same as those considered by Andersen and Sorensen (1996), which gives a detailed discussion on the motivation of the stochastic volatility models and the GMM approach. For more discussions, see Ghysels, Harvey, Renault (1996) and references therein. We employ the GMM to estimate the log-normal stochastic volatility model. The data are weekly returns of S&P 500 stocks, which are constructed by compounding daily returns with dividends from a CRSP index …le. We consider both value-weighted returns (vwretd) and equalweighted returns (ewretd). The weekly returns range from the …rst week of 2001 to the last week of 2012 with sample size T = 627: We use weekly data in order to minimize problems associated with daily data such as asynchronous trading and bid-ask bounce. This is consistent with Jacquier, Polson and Rossi (1994). The GMM approach relies on functions of the time series frt g to identify the parameters of the model. For the log-normal stochastic volatility model, we can obtain the following moment conditions: p p E jrt j` = c` E `t for ` = 1; 2; 3; 4 with (c1 ; c2 ; c3 ; c4 ) = 2= ; 1; 2 2= ; 3 E jrt rt jj = 2 1 E( t t j) and Ert2 rt2 j =E 2 2 t t j for j = 1; 2; ::: where E E ` t `1 `2 t t j = exp !` `2 + 2 8(1 2 u = E `1 t exp E `2 t 2 ) ; 2` ` j u 1 2 2 4(1 ) : Higher order moments can be computed but we choose to focus on a subset of lower order moments. Andersen and Sorensen (1996) points out that it is generally not optimal to include too many moment conditions when the sample is limited. On the other hand, it is not advisable to include just as many moment conditions as the number of parameters. When T = 500 and = ( 7:36; 0:90; :363), which is an empirically relevant parameter vector, Table 1 in Andersen and Sorensen (1996) shows that it is MSE-optimal to employ nine moment conditions. For this reason, we employ two sets of nine moment conditions given in the appendix of Andersen and Sorensen (1996). The baseline set of the nine moment conditions are Efi (rt ; ) = 0 for i = 1; :::; 9 25 with fi (rt ; ) = jrt ji f5 (rt ; ) = jrt rt f6 (rt ; ) = jrt rt f7 (rt ; ) = jrt rt f8 (rt ; ) = f9 (rt ; ) = c` E( `t ); i = 1; :::; 4; 2 1 E( t t 1) ; 2 1 E( t t 3) ; 2 1 E( t t 5) ; E 2 t 2 t 1j 3j 5j 2 2 rt rt 2 rt2 rt2 4 E 2 t 2 2 t 4 ; : (22) The alternative set of the nine moment conditions are Efi (rt ; ) = 0 for i = 1; :::; 9 with fi (rt ; ) = jrt ji f5 (rt ; ) = jrt rt f6 (rt ; ) = jrt rt f7 (rt ; ) = jrt rt f8 (rt ; ) = f9 (rt ; ) = c` E 2j 4j 6j 2 2 rt rt 1 rt2 rt2 3 ` t ; i = 1; :::; 4; 2 1 E( t t 2) ; 2 1 E( t t 4) ; 2 1 E( t t 6) ; E 2 t 2 t E 2 t 1 2 t 3 ; : (23) In each case m = 9 and d = 3, and so the degree of over-identi…cation is q = m d = 6: We focus on constructing 90% and 95% con…dence intervals (CI’s) for : Given the high nonlinearity of the moment conditions, we invert the GMM distance statistics to obtain the CI’s. The CI’s so obtained are invariant to the reparametrization of model parameters. For example, a 95% con…dence interval is the set of values in (0,1) that the DT test does not reject at the 5% level. We search over the grid from 0.01 to 0.99 with increment 0.01 to invert the DT test. As in 1 1 [q], [0] ; and F1 the simulation study, we employ three di¤erent critical values: 1p =p, F1 which correspond to three di¤erent asymptotic approximations. For the series LRV estimator, 1 F1 [0] is a critical value from the corresponding F approximation. Before presenting the empirical results, we notice that there is no ‘hole’in the CI’s we obtain — the values of that are not rejected form an arithmetic sequence. Given this, we report only the minimum and maximum values of over the grid that are not rejected. It turns out the maximum value is always equal to the upper bound 0.99. It thus su¢ ces to report the minimum value, which we take as the lower limit of the CI. Table 9 presents the lower limits for di¤erent 95% CI’s together with the smoothing parameters used. The smoothing parameters are selected in the same ways as in the simulation study. Since the selected K values are relatively large and the selected b values are relatively small, the CI’s based on the 2 approximation and the F1 [0] approximation are close to each other. However, there is still a noticeable di¤erence between the CI’s based on the 2 approximation and the nonstandard F1 [q] approximation. This is especially true for the case with baseline moment conditions, the Bartlett kernel, and equally-weighted returns. Taking into account the randomness of the estimated weighting matrix leads to wider CI’s. The log-volatility may not be as persistent as previously thought. 7 Conclusion The paper has developed the …xed-smoothing asymptotics for heteroskedasticity and autocorrelation robust inference in a two-step GMM framework. We have shown that the conventional 26 chi-square test that ignores the estimation uncertainty of the GMM weighting matrix and the covariance estimator can have very large size distortion. This is especially true when the number of joint hypotheses being tested and the degree of over-identi…cation are high or the underlying processes are persistent. The test based on our new …xed-smoothing approximation reduces the size distortion substantially and is thus recommended for practical use. There are a number of interesting extensions. First, given that our proof uses only a CLT, the results of the paper can be extended easily to the spatial setting, spatial-temporal setting or panel data setting. See Kim and Sun (2013) for an extension along this line. Second, in the Monte Carlo experiments, we use the conventional MSE criterion to select the smoothing parameters. We could have used the methods proposed in Sun and Phillips (2009), Sun, Phillips and Jin (2008) or Sun (2014) that are tailored towards con…dence interval construction or hypothesis testing. But in their present forms these methods are either available only for the t-test or work only in the …rst-step GMM framework. It will be interesting to extend them to more general tests in the two-step GMM framework. Third, we may employ the asymptotically equivalent distribution to conduct more reliable inference. Simulation results not reported here show that the size di¤erence between using the asymptotically equivalent distribution and the limiting distribution is negligible. However, it is easy to replace the iid process in the asymptotically equivalent distribution FeT by a dependent process that mimics the degree of autocorrelation in the moment process. Such a replacement may lead to even more accurate tests and is similar in spirit to Zhang and Shao (2013) who have shown the high order re…nement of a Gaussian bootstrap procedure under the …xed-smoothing asymptotics. Finally, the general results of the paper can be extended to a nonparametric sieve GMM framework. See Chen, Liao and Sun (2014) for a recent development on autocorrelation robust sieve inference for time series models. Table 1: Empirical size of the nominal 5% 2 test, F test, noncentral F test and nonstandard F1 test based on the OS HAR variance estimator for the AR(1) case with T = 100, number of joint hypotheses p and number of overidentifying restrictions q 2 -0.8 -0.5 0.0 0.5 0.8 0.9 0.114 0.081 0.063 0.094 0.134 0.166 -0.8 -0.5 0.0 0.5 0.8 0.9 0.186 0.113 0.081 0.128 0.196 0.252 -0.8 -0.5 0.0 0.5 0.8 0.9 0.260 0.154 0.104 0.171 0.268 0.332 CF NCF p = 1; q = 0 0.072 0.072 0.060 0.060 0.051 0.051 0.063 0.063 0.086 0.086 0.117 0.117 p = 1; q = 1 0.129 0.081 0.086 0.065 0.065 0.053 0.091 0.064 0.140 0.089 0.184 0.126 p = 1; q = 2 0.190 0.080 0.117 0.061 0.085 0.055 0.129 0.065 0.199 0.085 0.257 0.124 F1 2 0.073 0.059 0.050 0.063 0.088 0.120 0.197 0.117 0.083 0.142 0.229 0.290 0.077 0.065 0.052 0.063 0.086 0.123 0.307 0.175 0.113 0.204 0.331 0.420 0.079 0.062 0.055 0.066 0.082 0.121 0.425 0.244 0.148 0.279 0.449 0.529 CF NCF p = 2; q = 0 0.087 0.087 0.066 0.066 0.052 0.052 0.065 0.065 0.100 0.100 0.150 0.150 p = 2; q = 1 0.153 0.088 0.099 0.069 0.075 0.057 0.110 0.073 0.172 0.101 0.242 0.155 p = 2; q = 2 0.256 0.090 0.146 0.065 0.101 0.062 0.165 0.065 0.269 0.090 0.358 0.148 27 F1 2 0.084 0.066 0.053 0.065 0.097 0.146 0.310 0.174 0.112 0.222 0.355 0.437 0.087 0.068 0.056 0.072 0.099 0.153 0.457 0.247 0.155 0.308 0.489 0.589 0.083 0.061 0.058 0.061 0.082 0.137 0.602 0.351 0.211 0.415 0.623 0.712 CF NCF p = 3; q = 0 0.109 0.109 0.077 0.077 0.060 0.060 0.077 0.077 0.119 0.119 0.181 0.181 p = 3; q = 1 0.193 0.107 0.116 0.079 0.085 0.060 0.130 0.079 0.209 0.112 0.301 0.183 p = 3; q = 2 0.318 0.100 0.175 0.074 0.119 0.065 0.200 0.073 0.330 0.100 0.433 0.160 F1 0.111 0.078 0.062 0.078 0.122 0.184 0.113 0.080 0.060 0.080 0.118 0.192 0.097 0.073 0.063 0.072 0.097 0.157 Table 2: Empirical size of the nominal 5% 2 test, F test, noncentral F test and nonstandard F1 test based on the series LRV estimator for the MA(1) case with T = 100, number of joint hypotheses p and number of overidentifying restrictions q 2 -0.8 -0.5 0.0 0.5 0.8 0.9 0.072 0.074 0.063 0.078 0.081 0.077 -0.8 -0.5 0.0 0.5 0.8 0.9 0.099 0.097 0.081 0.108 0.116 0.107 -0.8 -0.5 0.0 0.5 0.8 0.9 0.137 0.131 0.104 0.143 0.153 0.137 CF NCF p = 1; q = 0 0.055 0.055 0.055 0.055 0.051 0.051 0.054 0.054 0.056 0.056 0.055 0.055 p = 1; q = 1 0.074 0.057 0.075 0.057 0.065 0.053 0.078 0.057 0.082 0.057 0.077 0.058 p = 1; q = 2 0.105 0.054 0.100 0.055 0.085 0.055 0.109 0.060 0.115 0.060 0.105 0.058 F1 2 0.055 0.055 0.050 0.054 0.056 0.055 0.106 0.101 0.083 0.119 0.125 0.114 0.056 0.057 0.052 0.056 0.057 0.057 0.160 0.150 0.113 0.171 0.185 0.163 0.054 0.055 0.055 0.061 0.061 0.059 0.216 0.204 0.148 0.231 0.247 0.218 CF NCF p = 2; q = 0 0.058 0.058 0.058 0.058 0.052 0.052 0.060 0.060 0.060 0.060 0.059 0.059 p = 2; q = 1 0.090 0.061 0.088 0.061 0.075 0.057 0.093 0.062 0.097 0.062 0.093 0.063 p = 2; q = 2 0.133 0.058 0.128 0.063 0.101 0.062 0.138 0.063 0.146 0.062 0.134 0.061 F1 2 0.058 0.059 0.053 0.061 0.062 0.060 0.153 0.144 0.112 0.182 0.195 0.170 0.060 0.060 0.056 0.062 0.062 0.062 0.223 0.207 0.155 0.253 0.274 0.239 0.056 0.060 0.058 0.060 0.059 0.059 0.305 0.284 0.211 0.339 0.366 0.316 CF NCF p = 3; q = 0 0.069 0.069 0.069 0.069 0.060 0.060 0.071 0.071 0.071 0.071 0.070 0.070 p = 3; q = 1 0.099 0.067 0.096 0.068 0.085 0.060 0.107 0.068 0.115 0.069 0.107 0.070 p = 3; q = 2 0.153 0.066 0.146 0.066 0.119 0.065 0.168 0.067 0.177 0.067 0.159 0.070 F1 0.070 0.070 0.062 0.072 0.071 0.070 0.068 0.068 0.060 0.068 0.070 0.071 0.066 0.065 0.063 0.067 0.066 0.070 Table 3: Empirical size of the nominal 5% 2 test, F test, noncentral F test and nonstandard F1 test based on the series LRV estimator for the AR(1) case with T = 200, number of joint hypotheses p and number of overidentifying restrictions q 2 -0.8 -0.5 0.0 0.50 0.8 0.9 0.102 0.070 0.060 0.079 0.114 0.140 -0.8 -0.5 0.0 0.5 0.8 0.9 0.147 0.083 0.069 0.099 0.164 0.205 -0.8 -0.5 0.0 0.5 0.8 0.9 0.195 0.101 0.074 0.113 0.225 0.276 CF NCF p = 1; q = 0 0.071 0.071 0.058 0.058 0.054 0.054 0.063 0.063 0.072 0.072 0.091 0.091 p = 1; q = 1 0.107 0.072 0.070 0.058 0.060 0.053 0.080 0.062 0.112 0.076 0.145 0.095 p = 1; q = 2 0.148 0.071 0.082 0.060 0.066 0.053 0.091 0.056 0.170 0.072 0.210 0.090 F1 2 0.071 0.058 0.054 0.063 0.073 0.094 0.150 0.088 0.065 0.104 0.187 0.243 0.072 0.057 0.053 0.061 0.073 0.090 0.234 0.117 0.081 0.139 0.270 0.347 0.073 0.060 0.054 0.056 0.073 0.088 0.319 0.144 0.096 0.166 0.371 0.460 CF NCF p = 2; q = 0 0.073 0.073 0.062 0.062 0.052 0.052 0.063 0.063 0.079 0.079 0.107 0.107 p = 2; q = 1 0.126 0.080 0.081 0.066 0.064 0.058 0.088 0.065 0.138 0.083 0.178 0.110 p = 2; q = 2 0.192 0.075 0.098 0.066 0.074 0.056 0.107 0.064 0.216 0.075 0.280 0.099 28 F1 2 0.073 0.063 0.052 0.064 0.078 0.104 0.234 0.117 0.081 0.144 0.283 0.366 0.081 0.065 0.058 0.062 0.083 0.108 0.348 0.154 0.100 0.182 0.394 0.498 0.070 0.063 0.057 0.059 0.070 0.090 0.460 0.190 0.117 0.234 0.539 0.639 CF NCF p = 3; q = 0 0.091 0.091 0.071 0.071 0.059 0.059 0.066 0.066 0.092 0.092 0.124 0.124 p = 3; q = 1 0.147 0.092 0.094 0.073 0.071 0.062 0.092 0.070 0.159 0.087 0.219 0.121 p = 3; q = 2 0.233 0.083 0.116 0.069 0.083 0.059 0.129 0.070 0.264 0.081 0.349 0.106 F1 0.091 0.072 0.057 0.070 0.093 0.128 0.094 0.072 0.061 0.072 0.090 0.127 0.082 0.070 0.060 0.069 0.078 0.103 Table 4: Empirical size of the nominal 5% 2 test, F test, noncentral F test and nonstandard F1 test based on the series LRV estimator for the MA(1) case with T = 200, number of joint hypotheses p and number of overidentifying restrictions q 2 -0.8 -0.5 0.0 0.5 0.8 0.9 0.066 0.066 0.060 0.070 0.072 0.069 -0.8 -0.5 0.0 0.5 0.8 0.9 0.080 0.077 0.069 0.085 0.089 0.080 -0.8 -0.5 0.0 0.5 0.8 0.9 0.096 0.095 0.074 0.100 0.103 0.093 CF NCF p = 1; q = 0 0.054 0.054 0.056 0.056 0.054 0.054 0.058 0.058 0.057 0.057 0.056 0.056 p = 1; q = 1 0.066 0.055 0.065 0.056 0.060 0.053 0.071 0.058 0.072 0.057 0.067 0.057 p = 1; q = 2 0.080 0.056 0.079 0.055 0.066 0.053 0.080 0.054 0.083 0.054 0.079 0.055 F1 2 0.053 0.055 0.054 0.057 0.057 0.056 0.082 0.079 0.065 0.089 0.091 0.084 0.055 0.056 0.053 0.057 0.056 0.056 0.105 0.102 0.081 0.115 0.123 0.110 0.056 0.057 0.054 0.055 0.055 0.056 0.130 0.123 0.096 0.136 0.146 0.131 CF NCF p = 2; q = 0 0.056 0.056 0.054 0.054 0.052 0.052 0.057 0.057 0.058 0.058 0.057 0.057 p = 2; q = 1 0.075 0.063 0.076 0.064 0.064 0.058 0.077 0.060 0.077 0.058 0.074 0.059 p = 2; q = 2 0.090 0.059 0.089 0.060 0.074 0.056 0.094 0.059 0.096 0.059 0.093 0.059 29 F1 2 0.056 0.056 0.052 0.057 0.058 0.057 0.106 0.103 0.081 0.120 0.130 0.114 0.061 0.062 0.058 0.058 0.057 0.058 0.141 0.134 0.100 0.152 0.164 0.142 0.058 0.060 0.057 0.058 0.056 0.058 0.177 0.165 0.117 0.184 0.202 0.172 CF NCF p = 3; q = 0 0.062 0.062 0.061 0.061 0.059 0.059 0.060 0.060 0.060 0.060 0.061 0.061 p = 3; q = 1 0.084 0.069 0.082 0.066 0.071 0.062 0.083 0.064 0.085 0.064 0.084 0.066 p = 3; q = 2 0.109 0.067 0.106 0.065 0.083 0.059 0.110 0.065 0.113 0.063 0.105 0.065 F1 0.064 0.064 0.057 0.063 0.063 0.063 0.068 0.066 0.061 0.063 0.065 0.065 0.066 0.064 0.060 0.066 0.063 0.065 Table 5: Empirical size of the nominal 5% 2 test, the F1 [0] test and the F1 [q] test based on the Parzen kernel LRV estimator for the AR(1) case with T = 100, number of joint hypotheses p and number of overidentifying restrictions q 2 -0.8 -0.5 0.0 0.5 0.8 0.9 0.132 0.085 0.063 0.093 0.164 0.216 -0.8 -0.5 0.0 0.5 0.8 0.9 0.197 0.105 0.073 0.117 0.223 0.315 -0.8 -0.5 0.0 0.5 0.8 0.9 0.257 0.128 0.086 0.139 0.282 0.384 F1 [0] F1 [q] p = 1; q = 0 0.091 0.091 0.073 0.073 0.059 0.059 0.074 0.074 0.102 0.102 0.124 0.124 p = 1; q = 1 0.145 0.110 0.091 0.076 0.067 0.059 0.095 0.079 0.151 0.110 0.204 0.145 p = 1; q = 2 0.195 0.115 0.110 0.078 0.079 0.060 0.115 0.081 0.201 0.113 0.269 0.144 2 0.215 0.116 0.071 0.131 0.273 0.382 0.313 0.150 0.096 0.169 0.362 0.521 0.406 0.188 0.114 0.210 0.456 0.605 F1 [0] F1 [q] p = 2; q = 0 0.112 0.112 0.082 0.082 0.058 0.058 0.076 0.076 0.126 0.126 0.164 0.164 p = 2; q = 1 0.174 0.127 0.104 0.088 0.077 0.067 0.111 0.090 0.189 0.132 0.271 0.182 p = 2; q = 2 0.242 0.139 0.131 0.091 0.091 0.070 0.140 0.088 0.260 0.135 0.365 0.190 2 0.326 0.160 0.093 0.182 0.408 0.559 0.443 0.201 0.121 0.236 0.521 0.704 0.555 0.248 0.150 0.291 0.613 0.774 F1 [0] F1 [q] p = 3; q = 0 0.153 0.153 0.100 0.100 0.070 0.070 0.096 0.096 0.154 0.154 0.212 0.212 p = 3; q = 1 0.224 0.164 0.126 0.104 0.084 0.074 0.132 0.104 0.230 0.160 0.329 0.226 p = 3; q = 2 0.305 0.166 0.155 0.103 0.108 0.080 0.167 0.105 0.319 0.157 0.438 0.221 Table 6: Empirical size of the nominal 5% 2 test, the F1 (0) test and the F1 (q) test based on the Parzen kernel LRV estimator for the MA(1) case with T = 100, number of joint hypotheses p and number of overidentifying restrictions q 2 -0.8 -0.5 0.0 0.5 0.8 0.9 0.075 0.073 0.063 0.079 0.081 0.076 -0.8 -0.5 0.0 0.5 0.8 0.9 0.091 0.086 0.073 0.097 0.104 0.095 -0.8 -0.5 0.0 0.5 0.8 0.9 0.114 0.110 0.086 0.120 0.125 0.113 F1 [0] F1 [q] p = 1; q = 0 0.065 0.065 0.065 0.065 0.059 0.059 0.064 0.064 0.066 0.066 0.063 0.063 p = 1; q = 1 0.079 0.067 0.076 0.067 0.067 0.059 0.081 0.070 0.083 0.069 0.081 0.069 p = 1; q = 2 0.099 0.072 0.096 0.069 0.079 0.060 0.100 0.071 0.105 0.072 0.098 0.071 2 0.101 0.096 0.071 0.107 0.113 0.103 0.133 0.126 0.096 0.140 0.149 0.134 0.169 0.157 0.114 0.168 0.181 0.160 F1 [0] F1 [q] p = 2; q = 0 0.070 0.070 0.070 0.070 0.058 0.058 0.069 0.069 0.070 0.070 0.070 0.070 p = 2; q = 1 0.094 0.078 0.094 0.078 0.077 0.067 0.094 0.077 0.097 0.077 0.092 0.075 p = 2; q = 2 0.118 0.083 0.114 0.081 0.091 0.070 0.117 0.083 0.122 0.081 0.114 0.081 30 2 0.133 0.126 0.093 0.150 0.160 0.140 0.173 0.162 0.121 0.191 0.203 0.181 0.215 0.201 0.150 0.229 0.249 0.216 F1 [0] F1 [q] p = 3; q = 0 0.086 0.086 0.085 0.085 0.070 0.070 0.084 0.084 0.086 0.086 0.082 0.082 p = 3; q = 1 0.108 0.088 0.106 0.087 0.084 0.074 0.111 0.088 0.116 0.090 0.108 0.091 p = 3; q = 2 0.139 0.093 0.133 0.091 0.108 0.080 0.138 0.092 0.144 0.091 0.133 0.091 Table 7: Empirical size of the nominal 5% 2 test, the F1 [0] test and the F1 [q] test based on the Parzen kernel LRV estimator for the AR(1) case with T = 200, number of joint hypotheses p and number of overidentifying restrictions q 2 -0.8 -0.5 0.0 0.5 0.8 0.9 0.110 0.075 0.059 0.080 0.124 0.170 -0.8 -0.5 0.0 0.5 0.8 0.9 0.143 0.083 0.063 0.091 0.155 0.235 -0.8 -0.5 0.0 0.5 0.8 0.9 0.168 0.092 0.066 0.098 0.193 0.296 F1 [0] F1 [q] p = 1; q = 0 0.090 0.090 0.073 0.073 0.061 0.061 0.073 0.073 0.091 0.091 0.105 0.105 p = 1; q = 1 0.117 0.096 0.079 0.071 0.064 0.061 0.085 0.074 0.118 0.094 0.160 0.116 p = 1; q = 2 0.138 0.094 0.087 0.073 0.067 0.061 0.090 0.071 0.154 0.096 0.215 0.121 2 0.150 0.090 0.061 0.096 0.186 0.286 0.208 0.108 0.074 0.117 0.239 0.391 0.257 0.124 0.078 0.131 0.295 0.469 F1 [0] F1 [q] p = 2; q = 0 0.098 0.098 0.078 0.078 0.058 0.058 0.075 0.075 0.102 0.102 0.129 0.129 p = 2; q = 1 0.136 0.110 0.091 0.083 0.070 0.066 0.091 0.079 0.142 0.110 0.199 0.140 p = 2; q = 2 0.174 0.114 0.103 0.083 0.074 0.065 0.103 0.079 0.185 0.113 0.275 0.147 2 0.218 0.114 0.074 0.126 0.261 0.421 0.293 0.137 0.086 0.147 0.328 0.542 0.349 0.149 0.092 0.172 0.415 0.637 F1 [0] F1 [q] p = 3; q = 0 0.123 0.123 0.093 0.093 0.070 0.070 0.086 0.086 0.122 0.122 0.161 0.161 p = 3; q = 1 0.166 0.134 0.109 0.100 0.079 0.075 0.102 0.088 0.167 0.127 0.243 0.170 p = 3; q = 2 0.213 0.137 0.116 0.095 0.084 0.074 0.121 0.092 0.230 0.129 0.338 0.173 Table 8: Empirical size of the nominal 5% 2 test, the F1 [0] test and the F1 [q] test based on the Parzen kernel LRV estimator for the MA(1) case with T = 200, number of joint hypotheses p and number of overidentifying restrictions q 2 -0.8 -0.5 0.0 0.5 0.8 0.9 0.065 0.065 0.059 0.070 0.071 0.069 -0.8 -0.5 0.0 0.5 0.8 0.9 0.076 0.074 0.063 0.077 0.081 0.075 -0.8 -0.5 0.0 0.5 0.8 0.9 0.087 0.085 0.066 0.085 0.089 0.083 F1 [0] F1 [q] p = 1; q = 0 0.064 0.064 0.065 0.065 0.061 0.061 0.066 0.066 0.066 0.066 0.066 0.066 p = 1; q = 1 0.074 0.065 0.073 0.066 0.064 0.061 0.074 0.066 0.076 0.066 0.072 0.064 p = 1; q = 2 0.083 0.069 0.083 0.069 0.067 0.061 0.082 0.066 0.083 0.065 0.081 0.066 2 0.077 0.075 0.061 0.082 0.085 0.079 0.095 0.091 0.074 0.097 0.102 0.095 0.107 0.104 0.078 0.111 0.118 0.108 F1 [0] F1 [q] p = 2; q = 0 0.068 0.068 0.067 0.067 0.058 0.058 0.066 0.066 0.066 0.066 0.065 0.065 p = 2; q = 1 0.082 0.074 0.082 0.074 0.070 0.066 0.080 0.070 0.081 0.071 0.078 0.069 p = 2; q = 2 0.092 0.073 0.088 0.072 0.074 0.065 0.092 0.072 0.094 0.072 0.090 0.071 31 2 0.098 0.093 0.074 0.104 0.111 0.101 0.118 0.113 0.086 0.118 0.126 0.114 0.135 0.131 0.092 0.137 0.145 0.132 F1 [0] F1 [q] p = 3; q = 0 0.077 0.077 0.077 0.077 0.070 0.070 0.074 0.074 0.075 0.075 0.076 0.076 p = 3; q = 1 0.093 0.085 0.093 0.086 0.079 0.075 0.089 0.080 0.090 0.078 0.089 0.080 p = 3; q = 2 0.108 0.089 0.106 0.086 0.084 0.074 0.104 0.083 0.107 0.082 0.102 0.085 Table 9: Lower limits of di¤erent 95% CI’s for and data-driven smoothing parameters in the log-normal stochastic volatility model Equally-weighted Return Value-weighted Return Baseline Alternative Baseline Alternative Series 2 0.74 0.58 0.78 0.62 Series F1 [0] 0.73 0.57 0.78 0.62 Series F1 [q] 0.70 0.53 0.76 0.57 K (140) (138) (140) (140) Series 2 Bartlett 2 Bartlett F1 [0] Bartlett F1 [q] b 0.56 0.54 0.35 (.0155) 0.52 0.52 0.42 (.0157) 0.78 0.77 0.75 (0.155) 0.60 0.60 0.53 (.0155) Parzen 2 Parzen F1 [0] Parzen F1 [q] b 0.72 0.73 0.69 (.0136) 0.52 0.52 0.47 (.0139) 0.78 0.78 0.77 (.0136) 0.50 0.50 0.43 (.0136) QS 2 0.74 0.52 0.78 0.52 QS F1 [0] 0.74 0.52 0.78 0.52 QS F1 [q] 0.71 0.48 0.77 0.46 b (.0068) (.0069) (.0067) (.0067) Baseline: CI using baseline 9 moment conditions given in (22); Alternative: CI using alternative 9 moment conditions given in (23); : the CI constructed based on the OS HAR variance estimator and using reference distribution. Other row titles are similarly de…ned. 32 2 1 as the 8 8.1 Appendix of Proofs Technical Lemmas Consider two stochastically bounded sequences of random variables f T 2 Rp g and f T 2 Rp g. Given the stochastic boundedness, there exists a nondecreasing mapping s(T ) : N ! N such that d d ! 1 and s(T ) ! 1 for some random vectors 1 and 1 : Let C ( 1 ; 1 ) be the set of points at which the CDFs of 1 and 1 are continuous. The following lemma provides a characterization of asymptotic equivalence in distribution. The proof is similar to that for the Lévy-Cramér continuity theorem and is omitted here. s(T ) Lemma 2 The following two statements are equivalent: (a) P ( s(T ) x) = P ( s(T ) x) + o (1) for all x 2 C ( a (b) s(T ) s s(T ) : Lemma 3 If ( 1T ; 1T ) P( 1T 1; 1) ; converges weakly to ( ; ) and x; 1T y) = P ( x; 2T y) + o (1) as T ! 1 2T (24) for all continuity point (x; y) of the CDF of ( ; ); then g( 1T ; 1T ) a s g( 2T ; 2T ); where g ( ; ) is continuous on a set C such that P (( ; ) 2 C) = 1: Proof of Lemma 3. Since ( 1T ; 1T ) converges weakly, we can apply Lemma 2 with s(T ) = T: It follows from the condition in (24) that Ef ( 1T ; 1T ) Ef ( 2T ; 2T ) ! 0 for any f 2 BC. But Ef ( 1T ; 1T ) ! Ef ( ; ) and so Ef ( 2T ; 2T ) ! Ef ( ; ) for any f 2 BC. That is, ( 2T ; 2T ) also converges weakly to ( ; ) : Using the same proof for proving the continuous mapping theorem, we a have Ef (g( 1T ; 1T )) Ef (g( 2T ; 2T )) ! 0 for any f 2 BC: Therefore g( 1T ; 1T ) s g( 2T ; 2T ): Lemma 4 Suppose ! T = T;J + T;J and ! T does not depend on J: Assume that there exist T;J and T such that (i) P T;J < P T;J < = o(1) for each …nite J and each 2 R as T ! 1; (ii) P T;J < P ( T < ) = o(1) uniformly over su¢ ciently large T for each 2 R as J ! 1; (iii) the CDF of T is equicontinuous on R when T is su¢ ciently large, and p (iv) for every > 0; supT P ! 0 as J ! 1: Then T;J > P (! T < ) = P ( T < ) + o(1) for each Proof of Lemma 4. Let " > 0 and such that for some integer Tmin > 0 2 R as T ! 1: 2 R: Under condition (iii), we can …nd a P( T < + ) := (") > 0 " for all T Tmin : Here Tmin does not depend on or ": Under condition (iv), we can …nd a Jmin := Jmin (") that does not depend on T such that P T;J > 33 " for all J that 0 Jmin and all T: From condition (ii), we can …nd a Jmin P 0 for all J Jmin and all T 00 0 exists a Tmin (J0 ) Tmin T;J < P( < ) T 0 Jmin and a Tmin Tmin such " 0 : It follows from condition (i) that for any …nite J Tmin 0 Tmin such that P T;J0 < + P T;J0 < + " P T;J0 < P T;J0 < " 0 ; there Jmin 00 (J ): for T Tmin 0 00 (J ); we have When T Tmin 0 P (! T )=P T;J0 P T;J0 P( T + P T;J0 + + 2" T;J0 P( T + +P T;J0 > < + ) + 3" < ) + 4": Similarly, P (! T )=P T;J0 P( T;J0 P( T + P( T;J0 ) ) 2" P( ) T;J0 ) T P T;J0 3" 4": Since the above two inequalities hold for all " > 0; we must have P (! T < ) = P ( as T ! 1: 8.2 T < ) + o(1) Proof of the Main Results Proof of Lemma 1. Part (a). Let T r 1X Qh ;s eT (s) = T T r=1 Z 1 Qh (r; s)dr; 0 Then WT ( T ) WT ( T ) " #" T # " #" T # T T X X 1X 1X 0 0 = f (vt ; T ) eT (s)f (vs ; T ) eT (t)f (vt ; T ) f (vs ; T ) T T t=1 s=1 t=1 s=1 " #" # Z 1Z 1 T X T T X T X X 1 r 1 + Qh ; Qh ( ; r)d dr f (vt ; T )f (vs ; T )0 : T T T T 0 0 =1 r=1 t=1 s=1 Under Assumption 1, we have, for any …xed h : T T 1 XX r Qh ; T T T =1 r=1 Z 0 1Z 1 0 34 Qh ( ; r)d dr = O 1 T : Hence WT ( T) WT ( T 1 X p f (vt ; T t=1 T) T) T 1 X eT (t)f (vt ; + p T t=1 T 1 X p eT (s)f (vs ; T s=1 T) T) T 1 X p f (vs ; T s=1 2 T 1 1 X f (vt ; T ) + O( ) p T T t=1 T) : (25) We proceed to evaluate the orders of the terms in above upper bound. First, for some between T and 0 ; we have T 1 X p f (vt ; T t=1 T) T 1 X =p f (vt ; T t=1 0) + GT T p T T 0 = Op (1) T (26) by Assumptions 4 and 5. Here T in the matrix GT T can be di¤erent for di¤erent rows. For notational economy, we do not make this explicit. We follow this practice in the rest of the proof. Next, using summation by parts, we have T 1 X p eT (t)f (vt ; T t=1 T) = T 1 X p eT (t)f (vt ; T t=1 0) + T 1 X eT (t)f (vt ; 0 ) + = p T t=1 p +eT (T )GT ( T ) T T T @f (vt ; 1X eT (t) T @ 0 t=1 T 1 X [eT (t) T) p T eT (t + 1)] Gt 0 T T p T 0 : p Assumption 1 implies that eT (t) = O(1=T ) uniformly over t: So eT (T )GT ( T ) T Op (1=T ) : For any conformable vector a; 0 1 # " T 1 1 X X X 1 1 @ 1 eT (t)a0 f (vt ; 0 ) a0 j a var p k j kA kak2 = O 2 2 T T T t=1 by Assumption 3. So T T 1 X p eT (t)f (vt ; T t=1 j= 1 1=2 PT T) = j= 1 t=1 eT (t)f (vt ; 0 ) T X1 [eT (t) 0 T t=1 T 0 = 1 T2 = Op (1=T ): It then follows that eT (t + 1)] Gt t=1 35 T p T T 0 + Op 1 T : (27) Letting T X1 = = = t=1 T X1 t=1 T X1 t=1 T X1 = Gt ( t t T G; T) we can write [eT (t) eT (t + 1)] Gt [eT (t) eT (t + 1)] [eT (t) [eT (t) eT (t + 1)] eT (t + 1)] t t t p T p p p T T 0 T T 0 T T 0 T 0 T T 1 1 X [eT (t) T + t=1 T X 1 T + p eT (t + 1)] tG T p eT (t)G T T 0 0 T p eT (T )G T 0 T t=1 1 T + Op t=1 : (28) Given that supt k t k = op (1) under Assumption 4 and eT ( ) is piecewise monotonic under Asp P sumption 1, we can easily show that Tt=11 [eT (t) eT (t + 1)] t T T 0 = op (1=T ) : Combining this with (27) and (28) yields 1=2 T T X eT (t)f (vt ; T) = Op (1=T ) : (29) t=1 Part (a) now follows from (25), (26) and (29). Part (b). For some T between T and 0 ; we have WT ( T) = T T 1 XX Qh T t ; T T t=1 =1 ut + @f (vt ; @ 0 T) 0 T u + @f (v ; @ 0 T) 0 0 T = WT ( 0 ) + I1 + I10 + I2 ; (30) where T T 1 XX Qh I1 = T I2 = 1 T2 =1 t=1 T X T X t ; T T Qh t=1 =1 @f (vt ; @ 0 t ; T T T) @f (vt ; @ 0 0 T T) T u0 ; 0 @f (v ; @ 0 T) 0 T 0 For a sequence of vectors or matrices fCt g ; de…ne S0 (C) = 0 and St (C) = T Using summation by parts, we obtain, for two sequence of matrices fAt g and fBt g : T T 1 XX Qh T2 = + + =1 t=1 T X1 TX1 =1 t=1 T X1 Qh t=1 T X1 =1 t ; T T rQh t ;1 T Qh 1; T 1 : Pt At B 0 t ; T T Qh St (A) S 0 (B) t+1 ;1 T Qh 1; +1 T St (A) ST0 (B) ST (A) S 0 (B) + Qh (1; 1) ST (A) ST0 (B) : 36 =1 C : where rQh t ; T T Plugging At = Gt ( T) t TG t ; T T := Qh @f (vt ; @ 0 T) t+1 ; T T Qh +1 T t+1 ; T + Qh +1 T and Bt = ut into the above expression and letting 0 T t ; T Qh : t = as before, we obtain a new representation of I1 : I1 = T +T +T T X1 TX1 =1 t=1 T X1 t ;1 T Qh t=1 T X1 t ; T T rQh Qh 1; =T +T +T T X1 TX1 =1 t=1 T X1 t ;1 T Qh 1; + T Qh (1; 1) T ST0 0 T (G + t 0 T T) T 0 ST0 (u) S 0 (u) S 0 (u) 0 T t+1 ;1 T t T 0 ST0 (u) +1 T T T 0 S 0 (u) Qh 1; T =1 t Qh t G+ T ST0 (u) 0 T t ; T T rQh Qh t=1 T X1 T) S 0 (u) 0 T +1 T Qh 1; + T Qh (1; 1) (G + t t+1 ;1 T Qh T =1 t G+ T (u) + G 0 T T T 1 XX Qh T =1 t=1 := I11 + I12 + I13 + I14 + I15 : t ; T T u0 We now show that each of the …ve terms in the above expression is op (1) : It is easy to see that I15 = G 0 T p = G T T p = G T T p = G T T T T 1 XX t Qh ; u0 T T T =1 t=1 " # T T 1 X 1X t Qh ; u0 0 p T T T T =1 t=1 Z 1 T X 1 1 Qh t; dt + O 0 p T T T 0 1 p T 0 I14 = T Qh (1; 1) = Qh (1; 1) T T p =1 T X =1 0 T T O T 37 1 T u0 = op (1) ; ST0 (u) p 0 T ST (u) = op (1) ; 0 u0 and I13 = T T X1 Qh 1; =1 = hp T + T hp T T T i 0 T +1 T Qh 1; 0 T = op (1) : T 0 T " T 1 X p Qh 1; u T T =1 i hp i Qh (1; 1) T ST (u) S 0 (u) # To show that I12 = op (1), we note that under the piecewise monotonicity condition in Assumption 1, Qh (t=T; 1) is piecewise monotonic in t=T: So for some …nite J we can partition the set f1; 2; :::; T g into J maximal non-overlapping subsets [Jj=1 Ij such that Qh (t=T; 1) is monotonic on each Ij := fIjL ; :::; IjU g : Now T X1 t ;1 T Qh t=1 J X X t ;1 T Qh j=1 t2Ij J X X Qh t ;1 T X Qh j=1 t2Ij = J X ( )j j=1 = t2Ij J X Qh j=1 t+1 ;1 T Qh t+1 ;1 T Qh t+1 ;1 T Qh Qh IjL ;1 T + op (1) t t+1 ;1 T Qh IjU ;1 T t sup k s k + op (1) s t ;1 T sup k s k + op (1) s sup k s k + op (1) s = O (1) sup k s k + op (1) = op (1) s where the op (1) term in the …rst inequality re‡ects the case when t and t + 1 belong to di¤erent partitions and “( )j ” takes “+” or “ ” depending on whether Qh (t=T; 1) is increasing or decreasing on the interval [IjL ; IjU ]: As a result, kI12 k = T T X1 t=1 Qh T X1 t=1 t ;1 T Qh t ;1 T Qh Qh t+1 ;1 T t+1 ;1 T t p T t 0 T 0 T ST0 (u) p T ST0 (u) = op (1) : (31) To show that I11 = op (1) ; we use a similar argument. Under the piecewise monotonicity condition, we can partition the set of lattice points f(i; j) : i; j = 1; 2; :::; T g into a …nite number of maximal non-overlapping subsets I` = f(i; j) : I`1L i I`1U ; I`2L 38 j I`2U g for ` = 1; :::; `max such that rQh (i=T; j=T ) have the same sign for all (i; j) 2 I` : Now kI11 k = = T X1 TX1 =1 t=1 `X max X `X max ( )` `=1 (i;j)2I` = `=1 = `X max (i;j)2I` Qh `=1 i j ; T T rQh X t ; T T rQh i t hp T i j ; T T rQh I`1L I`2L ; T T hp T T T 0 0 i hp p sup k t k i hp i T Sj0 (u) T I`1U I`2L ; T T sup 0 T t Qh i T S 0 (u) I`1L I`2U ; T T Qh p + Qh T S 0 (u) I`1U I`2U ; T T op (1) = op (1) : We have therefore proved that I1 = op (1). We can use similar arguments to show that I2 = op (1). Details are omitted. This completes the proof of Part (b). Parts (c) and (d). We start by establishing the series representation of the weighting function as given in (4). The representation trivially holds with j ( ) = j ( ) for the OS-HAR variance estimator. It remains to show that it also holds for the contracted kernel kb ( ) and the exponentiated kernel k ( ) : We focus on the former as the proof for the latter is similar. Under Assumption 1(a), kb (x) has a Fourier cosine series expansion kb (x) = c + 1 X ~ j cos j x (32) j=1 where c is the constant term in the expansion, uniformly over x 2 [ 1; 1]: This implies that kb (r s) = c + 1 X P1 j=1 ~ j < 1, and the right hand side converges ~ j cos j r cos j s + j=1 1 X ~ j sin j r sin j s j=1 and the right hand side converges uniformly over (r; s) 2 [0; 1] [0; 1]: Now Z 1 Z 1 Z 1Z 1 kh (r; s) = kb (r; s) kb ( s) d kb (r )d + kb ( 1 0 = 1 X := 0 ~ j cos j r cos j s + j=1 1 X 1 X ~ j sin j r j (r) j 2) d 1d 2 0 1 sin j d sin j s 0 j=1 j Z 0 0 (s) j=1 by taking j (r) = cos sin 1 2 1 2 jr ; (j + 1) r R1 0 39 sin 1 2 Z (j + 1) j is even d ; j is odd (33) 1 sin j d and ~ j=2 ; j is even, ~ (j+1)=2 ; j is odd. R1 Obviously j (r) is continuously di¤erentiable and satis…es j (r) dr = 0 for all j = 1; : : : : The 0 P uniform convergence of 1 (r) (s) to k (r; s) inherits from the uniform convergence of j j j j=1 h the Fourier series in (32). a In view of part (b), it su¢ ces to show that WT ( 0 ) s WeT : Using the Cramér-Wold device, it su¢ ces to show that a tr [WT ( 0 )A] s tr [WeT A] j = for any conformable symmetry matrix A. Note that T T 1 XX Qh tr [WT ( 0 )A] = T tr (WeT A) = 1 T t=1 =1 T X T X Qh t=1 =1 t ; T T u0t Au ; t ; T T ( et )0 A ( et ) : We write tr [WT ( 0 )A] as tr [WT ( 0 )A] = 1 X j j=1 := T;J + T 1 X p T t=1 j t T ut !0 A T 1 X p T t=1 t T j ut ! T;J where T;J = J X j j=1 T;J = " T 1 X p j T t=1 " T 1 X j p T t=1 1 X j=J+1 t T j #0 " T 1 X ut A p j T t=1 #0 " T t 1 X ut A p T T t=1 # t T ut ; # t T j ut : We proceed to apply Lemma 4 by verifying the four conditions. First, it follows from Assumption 5 that for every J …xed, " #0 " # J T T X X X 1 t 1 t a (34) et A p et j p j j T;J s T;J := T T T t=1 T t=1 j=1 as T ! 1: Second, let T = 1 X j=1 j " T 1 X p T t=1 j t T et = tr (WeT A) and A = Pm `=1 0 ` a` a` #0 " T 1 X A p T t=1 be the spectral decomposition of A: Then " 1 m T X X 1 X t p = j j T ` T;J T T t=1 j=J+1 `=1 40 j t T a0` et et #2 : # So E T 1 X C T;J j=J+1 j jj T 1X T t T 2 j t=1 C 1 X j=J+1 j jj for some constant C that does not depend on T: That is, E T T;J ! 0 uniformly in T as J ! 1: So for any > 0 and " > 0, there exists a J0 that does not depend on T such that P for all J that T E T;J T T;J = " J0 and all T: It then follows that there exists a T0 which does not depend on J0 such P T;J < =P T + P( T < + )+P T P( T < + )+" P( T T;J < T;J T < ) + 2" for J J0 and T T0 : Here we have used the equicontinuity of the CDF of T when T is su¢ ciently large. This is veri…ed below. Similarly, we can show that for J J0 and T T0 we have P T;J < P ( T < ) 2": Since the above two inequalities hold for all " > 0; it must be true that P T;J < P( T < )= o(1) uniformly over su¢ ciently large T as J ! 1: Third, we can use Lemma 4 to show that T converges in distribution to 1 where 1 = = 1 X j 0 0 j=1 Z 1Z 1 Z j (r) dBm (r) A 0 Z 1 j (s) dBm (s) 0 Qh (r; s) [ dBm (r)]0 A dBm (s) : Details are omitted here. So for any P( 0 1 T and ; we have < + )=P( 1 < + ) + o (1) as T ! 1: Given the continuity of the CDF of 1 ; for any " > 0; we can …nd a > 0 such that P( " for all T when T is su¢ ciently large. We have thus veri…ed Condition T < + ) (iii) in Lemma 4. Finally, for every > 0, we have 0 1 " #2 1 m T X X X 1 t j jj j `j p P P@ a0` ut > A j T;J > T T t=1 j=J+1 `=1 " #2 m 1 T X X 1 X t 1 0 j `j E j jj p a` ut : j T T t=1 j=J+1 `=1 41 But by Assumption 3, " 1 T X 1 X E j jj p T t=1 j=J+1 = 1 X j=J+1 C j jj 1 X j=J+1 T 1X T t T j 1 X j=J+1 #2 t T E a0` ut j jj 1X 0 a` T 2 j t=1 j jj + a0` ut 2 + t 1 X t s 1X j j T T T j=J+1 t6=s 0 1 1 X j j jA = o(1) s a` = O @ j jj j=J+1 t6=s uniformly over T; as J ! 1: So we have proved that supT P Combining the above steps, we have a tr [WT ( 0 )A] s and d tr (WeT A) ! As a consequence, Z 0 1Z 1 0 Ea0` ut u0s a` T T;J > = o (1) as J ! 1: = tr (WeT A) ; Qh (r; s) [ dBm (r)]0 A dBm (s) = tr(W1 A): a d a d WT ( T) s WeT ! W1 : WT ( T) s WeT ! W1 : In view of Part (a), we also have Proof of Theorem 1. We prove part (b) only as the proofs for other parts are similar. It su¢ ces d 0 to show that F1 = Bp (1) Cpq Cqq1 Bq (1) Dpp1 Bp (1) which is an m d matrix, then F1 = h i ~ 11 G R G0 W h i ~ 11 G R G0 W 1 1 ~ 11 Bm (1) G0 W 0 Cpq Cqq1 Bq (1) =p: Let G = h i ~ 11 G R G0 W 1 1 G, 1 R0 ~ 11 Bm (1) =p: G0 W Let Um m m d Vd0 d be a singular value decomposition (SVD) of G : By de…nition, U 0 U = U U 0 = Im , V V 0 = V 0 V = Id and Ad d = ; Oq d where A is a diagonal matrix with singular values on the main diagonal and O is a matrix of zeros. Then h i 1 h i 1 ~ 11 G ~ 11 Bm (1) = R V 0 U 0 W ~ 11 U V 0 ~ 11 Bm (1) G0 W R G0 W V 0U 0W h i 1 h i h i 1 0 0 ~ 1 0 0 ~ 1 0 0 ~ 1 0 0 ~ 1 = RV U W1 U U W1 Bm (1) = RV U W1 U U W1 U U 0 Bm (1) h i 1 d 0 ~ 1 0 ~ 1 = RV W1 W1 Bm (1); 42 and h i ~ 11 G R G0 W 1 h R0 = R V 0 ~ 11 U V 0 U 0W i 1 d R0 = RV h 0 i ~ 11 W 1 V 0 R0 ; ~ 1 ; Bm (1)] and [U 0 W ~ 1 U; U 0 Bm (1)]: Let where we have used the distributional equivalence of [W 1 1 C11 C12 C21 C22 ~1 = W C 11 C 12 C 21 C 22 ~ 1= and W 1 0 ; C 12 = (C 21 )0 . where C11 and C 11 are d d matrices, C22 and C 22 are q q matrices, and C12 = C21 By de…nition Z 1Z 1 Cpp Cp;d p C11 = Qh (r; s)dBd (r)dBd (s)0 = (35) 0 C C d p;d p p;d p 0 0 Z 1Z 1 Cpq C12 = Qh (r; s)dBd (r)dBq (s)0 = (36) C d p;q 0 0 Z 1Z 1 Qh (r; s)dBq (r)dBq (s)0 = Cqq (37) C22 = 0 0 where Cpp ; Cpq ; and Cqq are de…ned in (7); and Cd p;d It follows from the partitioned inverse formula that 1 C12 C221 C21 C 11 = C11 p; Cp;d , C 12 = and Cd p p;q are similarly de…ned. C 11 C12 C221 : ~ 11 ; we have With the above partition of W h 0 i ~ 11 W 1 C 11 C 12 C 21 C 22 A0 O 0 = 1 = A0 C 11 A =A 1 1 A O 1 C 11 1 A0 and so h RV ~ 1 W 1 i 1 0 ~ 1 W 1 = RV A 1 C 11 1 A0 1 = RV A 1 C 11 1 A0 1 A0 = RV A 1 Id ; C 11 1 C 12 and RV As a result, 0 h 0 ~ 1 W 1 i 1 C 11 C 12 C 21 C 22 A0 O0 V 0 R0 = RV A C 11 C 12 ; 1 (38) C 11 h i0 d F1 = RV A 1 Id ; C 11 1 C 12 Bm (1) RV A i h RV A 1 Id ; C 11 1 C 12 Bm (1) =p 0 = RV A 1 Id ; C12 C221 Bm (1) RV A 1 Id ; C12 C221 Bm (1) =p: RV A 43 1 C 11 1 1 A0 1 C 11 1 A0 V 0 R0 : 1 A0 1 V 0 R0 1 V 0 R0 1 1 ~p Let Bm (1) = [Bd0 (1) ; Bq0 (1)]0 and U ~p d p ~p ~0 d Vd d A~p = p; 1; be a SVD of RV A ~p O (d p) where : Then n d ~ ~ V~ 0 Bd (1) F1 = U ~ ~ V~ 0 Bd (1) U n = ~ V~ 0 Bd (1) C12 C221 Bq (1) o0 h 1 ~ ~ V~ 0 C 11 U C12 C221 Bq (1) =p o0 h ~ V~ 0 C 11 C12 C221 Bq (1) 1 V~ ~ 0 ~ V~ 0 Bd (1) C12 C 1 Bq (1) =p 22 n h io0 h ~ V~ 0 C 11 = ~ V~ 0 Bd (1) V~ 0 C12 C221 Bq (1) i h ~ V~ 0 Bd (1) V~ 0 C12 C 1 Bq (1) : 22 1 i ~0 V~ ~ 0 U d n ~ Bd (1) ~ Bd (1) = h C12 C221 Bq (1) C12 C221 Bq (1) =p ~ O ~ A; Bd (1) ~ O ~ A; C 11 ~ O ~ A; Bd (1) 1 o0 h ~ C 11 V~ ~ 0 C12 C221 Bq (1) i 1 ~ O ~ 0 A; C12 C221 Bq (1) 1 1 1 i h Noting that the joint distribution of V~ 0 Bd (1) ; V~ 0 C12 ; C22 ; V~ 0 C 11 orthonormal matrix V~ ; we have F1 = i 1 ~0 i 1 V~ i is invariant to the 1 0 =p: Writing C 11 1 = C11 Dpp D12 D21 D22 C12 C221 C21 = 0 and D 22 is a (d p) (d p) matrix and using equations (35)–(37), where Dpp = Cpp Cpq Cqq1 Cpq we have 0 d F1 = Bp (1) Cpq Cqq1 Bq (1) Dpp1 Bp (1) Cpq Cqq1 Bq (1) =p: Proof of Theorem 2. Let T be the p 1 vector of Lagrange multipliers for the constrained GMM estimation. The …rst order conditions for ^T;R are @gT0 ^T;R @ WT 1 (~T )gT ^T;R + R0 T = 0 and R^T;R = r: (39) Linearizing the …rst set of conditions and using Assumption 4, we have the following system of equations: ! p p ~ T ^T;R R0 G0 WT 1 (~T ) T gT ( 0 ) 0 = + op (1) p R 0p p 0p 1 T T 44 where ~ := ~ (~T ) = G0 WT 1 (~T )G. From this, we get p T ^T;R and p = 0 T 1 n R~ = T p G0 WT 1 (~T ) T gT ( 0 ) n o 1 p ~ 1 R0 R ~ 1 R0 R ~ 1 G0 WT 1 (~T ) T gT ( 0 ) + op (1) ~ 1 R0 Combining (5) with (40), we have p T ^T;R which implies that ^T p ~ = T ^T;R 1 ^T o 1 n R0 R ~ R~ 1 1 R0 p G0 WT 1 (~T ) T gT ( 0 ) + op (1) : o 1 R~ 1 (40) (41) p G0 WT 1 (~T ) T gT ( 0 ) + op (1) (42) = Op (1) : So gT ^T;R = gT ^T + GT (^T ) ^T;R p ^T + op 1= T : Plugging this into the de…nition of DT ; we obtain DT = T ^T;R ^T 0 G0T (^T )WT 1 ( 2T gT0 ^T WT 1 ( ^ T )G( T ) ^ T )GT ( T ) ^T;R ^T;R ^T =p ^T =p + op (1) (43) Using the …rst order conditions for ^T : gT0 ^T WT 1 (~T )GT (^T ) = 0; and Lemma 1(a), we obtain 0 (44) DT = T ^T;R ^T ~ ^T;R ^T =p + op (1) : Plugging (42) into (44) and simplifying the resulting expression, we have n DT = R 0 R ~ ~ 1 R0 n R R~ 1 0 o 1 1 R 0 R~ o 1 1 p G0 WT 1 (~T ) T gT ( 0 ) R~ 1 0 p G0 WT 1 (~T ) T gT ( 0 ) =p + op (1) i 1h h i0 h i p p R ~ 1 G0 WT 1 (~T ) T gT ( 0 ) =p + op (1) = R ~ 1 G0 WT 1 (~T ) T gT ( 0 ) R ~ 1 R0 hp i0 h i 1 hp i 1 0 ^ ~ = T R ^T R R T R =p + op (1) 0 0 T = WT + op (1) : (45) Next, we prove the second result in the theorem. In view of the …rst order conditions in (39) and equation (41), we have p T T @gT0 ^T;R p p 0 WT 1 (~T ) T gT ^T;R + op (1) = TR @ n o 1 p = R0 R ~ 1 R0 R ~ 1 G0 WT 1 (~T ) T gT ( 0 ) + op (1) p = ~ T ^T ^T;R + op (1) ; ^T;R = 45 T + op (1) and so ST = T ^T ^T;R 0 ~ ^T ^T;R =p + op (1) = DT + op (1) = WT + op (1) : Proof of Theorem 3. Part (a). Let H= Bp (1) Cpq Cqq1 Bq (1) Bp (1) Cpq Cqq1 Bq (1) ;H ! be an orthonormal matrix, then d Cpq Cqq1 Bq F1 = Bp (1) " H 0 Dpp1 HH 0 (1) " Bp (1) Cpq Cqq1 Bq (1) Bp (1) Cpq Cqq1 Bq (1) Bp (1) Cpq Cqq1 Bq (1) Bp (1) Cpq Cqq1 Bq (1) 2 0 ep Cpq Cqq1 Bq (1) = Bp (1) 2 # #0 H =p H 0 Dpp1 H ep =p; where ep = (1; 0; 0; :::; 0)0 2 Rp : But H 0 Dpp1 H has the same distribution as Dpp1 and Dpp is independent of Bp (1) Cpq Cqq1 Bq (1) : So d F1 = Bp (1) Cpq Cqq1 Bq (1) 2 =p 1 e0p Dpp1 ep : (46) That is, F1 is equal in distribution to a ratio of two independent random variables. It is easy to see that e0p Dpp1 ep 1 d = " Z e0p+q 0 1Z 1 0 1 0 Qh (r; s) dBp+q (r) dBp+q (s) ep+q # 1 d = 1 K 2 K p q+1 where ep+q = (1; 0; :::; 0)0 2 Rp+q : With this, we can now represent F1 as 2 p d F1 = and so 1 d F1 = 2 p 2 2 K p q+1 = (K 2 =p (47) 2 K p q+1 =K =p p d q + 1) = Fp;K p q+1 2 : (48) 1F Part (b). Since the numerator and the denominator in (48) are independent, 1 is 2 distributed as a noncentral F distribution, conditional on : More speci…cally, we have P 1 F1 < z = P Fp;K p q+1 46 2 < z = EFp;K p q+1 z; 2 where Fp;K p q+1 (z; ) is the CDF of the noncentral F distribution with degrees of freedom (p; K p q + 1) and noncentrality parameter ; and Fp;K p q+1 ( ) is a random variable with CDF Fp;K p q+1 (z; ) : We proceed to compute the mean of 2 . Let Z 1 Z 1 j (r) dBp (r) s iidN (0; Ip ) and j = j (r) dBq (r) s iidN (0; Iq ): j = 0 Note that j 0 are independent of Cpq = K j 1 : We can represent Cpq and Cqq as K X 0 j j and Cqq = K 1 j=1 K X 0 j j: (49) j=1 So 2 E 0 0 = EBq (1)0 Cqq1 Cpq Cpq Cqq1 Bq (1) = Etr Cqq1 Cpq Cpq Cqq1 1 10 10 10 20 K K K K X X X X 1 1 0A 0A@ 1 0A@ 1 @ = Etr 4@ j j j j j j K K K K j=1 j=1 j=1 j=1 = ptrE ( ) ; where := PK j=1 0 j j 1 1 0A j j 13 5 follows an inverse-Wishart distribution and E( )= Iq K q 1 for K large enough. Therefore E 2 = K pqq 1 = 2 : Next we compute the variance of 2 : It follows from the law of total variance that var 2 = E var 2 0 jCqq1 Cpq Cpq Cqq1 + var 0 tr Cqq1 Cpq Cpq Cqq1 : 0 C C 1 : So conditional on C 1 C 0 C C 1 ; Note that Bq (1) is independent of Cqq1 Cpq pq qq qq pq pq qq quadratic form in standard normals. Hence the conditional variance of 2 is var 2 0 0 0 jCqq1 Cpq Cpq Cqq1 = 2tr Cqq1 Cpq Cpq Cqq1 Cqq1 Cpq Cpq Cqq1 : Using the representation in (49), we have 0 0 Etr Cqq1 Cpq Cpq Cqq1 Cqq1 Cpq Cpq Cqq1 0 10 1 0 K K K X X X 1 0A@ 1 0A 2@ 1 @ = Etr Cqq j j j j K K K j=1 j=1 j=1 2 K K K X K X 1 XX 0 0 2 4 = Etr j1 j2 j1 i1 i1 Cqq K4 j1 =1 i1 =1 j2 =1 i2 =1 2 K K K K 1 XXXX 0 2 0 2 = Etr 4 4 j1 i1 Cqq j2 i2 Cqq E K j1 =1 i1 =1 j2 =1 i2 =1 47 j 10 K X 0A@ 1 j K 0 j2 i2 0 j1 i1 j=1 1 0A Cqq2 j j 3 0 25 i2 Cqq 0 j2 i2 3 5: 2 is a Since 0 j1 i1 E 0 j2 i2 we have 8 2 p ; > > > > < p; p = > > p2 + 2p; > > : 0; j1 = i1 ; j2 = i2 ; and j1 6= j2 ; j1 = j2 ; i1 = i2 ; and j1 = 6 i1 ; j1 = i2 ; i1 = j2 ; and j1 6= i1 ; j1 = j2 = i1 = i2 ; otherwise, 0 0 Etr Cqq1 Cpq Cpq Cqq1 Cqq1 Cpq Cpq Cqq1 2 3 3 2 K X K K X K X X 1 1 0 2 0 25 2 0 2 0 25 = Etr 4 4 p + Etr 4 4 p j1 j1 Cqq j2 j2 Cqq j1 i1 Cqq j1 i1 Cqq K K j1 =1 j2 =1 j1 =1 i1 =1 3 2 K X K X 1 0 2 0 25 p + Etr 4 4 j1 i1 Cqq i1 j1 Cqq K j1 =1 i1 =1 3 " 0 1 2 2 # K K K X X X 1 1 0 0 2 25 0 A tr = Etr @ p2 + p + Etr 4 2 j1 j1 Cqq ` 1 `1 i1 Cqq i1 p K K2 j1 =1 `1 =1 i1 =1 2 = E tr 2 p2 + p + E [tr ( )] p 0 1 q X q q X X =@ E 2ij A p2 + p + E i=1 j=1 ii i=1 !2 p: It follows from Theorem 5.2.2 of Press (2005, pp. 119, using the notation here) that E 2 ij = (K (K q + 1) q) (K ij + (K q 2 q 1) (K 1) q 3) + ij [K = O( 2 q 1] 1 ) K2 and for i 6= j; E ii jj = 2 (K q) (K q 2 1) (K q 3) + [K 1 q Hence 0 0 Etr Cqq1 Cpq Cpq Cqq1 Cqq1 Cpq Cpq Cqq1 = O( 48 2 1] 1 ): K2 = O( 1 ): K2 (50) Next var 0 tr Cqq1 Cpq Cpq Cqq1 0 0 E tr Cqq1 Cpq Cpq Cqq1 tr Cqq1 Cpq Cpq Cqq1 20 10 1 3 20 K K K X X 1 1 1 X 0A@ 0A 25 4 @ = Etr 4@ C tr j j j j qq K K K j=1 j=1 j=1 2 3 2 K K K K 1 XX 1 XX 0 0 25 4 = Etr 4 C tr i1 i2 i1 j1 j1 qq K K i1 =1 j1 =1 =E 10 K X 0A@ 1 j K j j=1 0 2 i1 j1 Cqq 0 2 i2 j2 Cqq tr E 3 0A Cqq2 5 j j 3 0 i2 j2 0 25 j2 Cqq 0 i1 j1 0 i2 j2 i2 =1 j2 =1 K K K K 1 XXXX tr K2 1 i1 =1 j1 =1 i2 =1 j2 =1 = E [tr ( )]2 p2 + E K K 1 XX tr Cqq2 K2 0 2 0 j1 j1 Cqq i1 i1 2p i1 =1 j1 =1 2 2 2 = E [tr ( )] p + Etr 2p: Using the same formulae from Press (2005), we can show that the last term is of order O K This, combined with (50), leads to var 2 = O(K 2 ): Taking a Taylor expansion and using the mean and variance of 2 ; we have P 1 F1 < z = EFp;K = EFp;K p q+1 p q+1 : 2 z; z; 2 2 +E @Fp;K p q+1 z; @ 2 Fp;K 2 2 2 @ +E p q+1 @ z; ~ 2 2 2 2 2 (51) = EFp;K p q+1 z; 2 for some ~ 2 between @ 2 Fp;K +E 2 Fp;K and p q+1 @ 2 z; ~ 2 : By de…nition, 0 p q+1 (z; 2 2 2 2 )=P@ = EGp 2( p ) p pz " " 2 K p q+1 K p q+1 # 2 K p q+1 K p q+1 # ; ! where Gp (z; ) is the CDF of the noncentral chi-square distribution 49 1 1 < zA 2( p ) with noncentrality parameter : In view of the relationship Gp (z; ) = exp @Fp;K p q+1 (z; ) @ " 1 X @ = exp @ 2 j=0 = 1 exp 2 2 1 + exp 2 2 1 1X = exp 2 j=0 2 # ( =2)j EGp+2j j! 1 X ( =2)j EGp+2j j! pz pz j=0 " " p j=0 ( =2)j j! Gp+2j (z) ; we have #! 2 K p q+1 K P1 q+1 #! 2 K p q+1 K p q+1 " #! 1 2 X ( =2)j K p q+1 EGp+2+2j pz j! K p q+1 j=0 " #! " 2 ( =2)j K p q+1 E Gp+2+2j pz 2 j! K p q+1 pz Gp+2j " 2 K p q+1 K p q+1 #!# and so @ 2 Fp;K p q+1 (z; ) @ 2 " 1 1X @ 2 @ 1 4 j=0 1 X exp ( =2)j j! 2 1 exp ( =2)j 1X + exp j! 4 2 j=0 for all z and : Combining the boundedness of @ 2 Fp;K 1 p q+1 (z; F1 < z = Fp;K p q+1 z; 2 + O var( = Fp;K p q+1 z; 2 +O 1 K2 2 ( =2)(j 1) (j 1)! 2 j=1 1 1 1 + = 4 4 2 P # 2 ) =@ with (51) yields ) = Fp;K p q+1 z; 2 +o 1 K : Part (c). It follows from Part (b) that z P (pF1 < z) = EGp = EGp +E " z 2 K p q+1 K 2 K p q+1 @2 Gp @ 2 K z ! ;0 ; ! ; 2 2 1 K +o @ + E Gp @ !# 2 K p q+1 K 2 4 z 2 K p q+1 K ;0 ! 2 +o ; where is between 0 and 2 : As in the proof of Part (b), we can show that As a result, " !# z 2K p q+1 2 @2 1 1 4 E ; = O( 2 ) = o( ): 2 Gp K K K @ 50 1 K (52) @2 G (z; @ 2 p ) 1. (53) Consequently, P (pF1 < z) = EGp z 2 K p q+1 K ! ;0 z @ + E Gp @ 2 K p q+1 K ;0 ! 2 + o( 1 ): K By direct calculations, it is easy to show that @ Gp (z; 0) = @ 1 [Gp (z) 2 Gp+2 (z)] = 1 z p=2 e p 2p=2 z=2 p 2 1 0 G (z) z: p p = (54) Therefore, P (pF1 < z) = EGp z 2 K p q+1 K ! 2 K p q+1 = Gp (z) + Gp0 (z) zE K 2 K p q+1 K ! 1 ! z 2 K p q+1 2 K +o K q+1 K p q+1 + Gp00 (z) z 2 K K2 K p + 2q 1 1 1 + Gp00 (z) z 2 + o : K K K Gp0 (z) z 1 K ! 2 K p q+1 1 + Gp00 (z) z 2 var 2 p = Gp (z) + Gp0 (z) z = Gp (z) z 1 EGp0 p q q 1 2 p Gp0 (z) z + o 1 K Gp0 (z) z Proof of Theorem 4. Part (a). Using the same argument for proving Theorem 3(a), we have d t1 = and so B1 (1) C1q Cqq1 Bq (1) q 2 K q =K t d B1 (1) p1 = q C1q Cqq1 Bq (1) 2 K q = (K (55) d = tK q ( ): q) Part (b). Since the distribution of t1 is symmetric about 0; we have for any z 2 R+ : P t p1 < z p 1 1 1 1 + P (jt1 = j < jzj) = + P (t21 = < z 2 ) 2 2 2 2 1 1 1 2 2 = + F1;K q z ; + o( ) 2 2 K = where the second last equality follows from Theorem 3(b). When z 2 R ; we have P t p1 < z 1 2 1 = 2 = p 1 1 1 P (jt1 = j < jzj) = P (t21 = < z 2 ) 2 2 2 1 1 F1;K q z 2 ; 2 + o( ): 2 K 51 Therefore P t p1 < z 1 1 + sgn(z)F1;K 2 2 = q z2; 2 + o( 1 ): K Part (c). Using Theorem 3(c) and the symmetry of the distribution of t1 about 0; we have for any z 2 R+ : 1 1 1 1 + P (jt1 j < jzj) = + P (t21 < z 2 ) 2 2 2 2 1 1 q 1 1 = + G z2 G 0 z2 z2 + G 00 z 2 z 4 + o 2 2 K 2 K P (t1 < z) = 1 K : Using the relationships that 1 1 + G z2 = 2 2 and G 0 z2 z2 q K 1 1 + G 00 z 2 z 4 = 2 K we have (z); 1 (z) z 3 + z(4q + 1) ; 4K (z) 1 1 z (z) z 2 + (4q + 1) + o( ): 4K K P (t1 < z) = (z) + 1 1 z (z) z 2 + (4q + 1) + o( ): 4K K P (t1 < z) = (z) 1 1 jzj (z) z 2 + (4q + 1) + o( ): 4K K P (t1 < z) = Similarly when z 2 R ; we have Therefore Before proving Theorem 5, we present a technical lemma. Part (i) of the lemma is proved in Sun (2014). Parts (ii) and (iii) of the lemma are proved in Sun, Phillips and Jin (2011) [1 k (x)] =xq0 , q0 is the Parzen exponent of the kernel function, c1 = R 1 De…ne g0 = limRx!0 1 2 1 k (x) dx: Recall the de…nitions of 1 and 2 in (9). 1 k (x) dx; c2 = Lemma 5 (i) For conventional kernel HAR variance estimators, we have, as h ! 1; (a) 1 = 1 bc1 + O(b2 ); (b) 2 = bc2 + O(b2 ). (ii) For sharp kernel HAR variance estimators, we have, as h ! 1; 2 (a) 1 = 1 +2 ; 1 (b) 2 = +1 + O 12 : (iii) For steep kernel HAR variance estimator, we have, as h ! 1; (a) 1 =1 (b) 2 = 1=2 g0 1=2 2 g0 +O +O 1 1 ; : 52 Proof of Theorem 5. Part (a). Recall 0 Cpq Cqq1 Bq (1) Dpp1 Bp (1) pF1 = Bp (1) Cpq Cqq1 Bq (1) : Conditional on (Cpq ; Cqq ; Cpp ) ; Bp (1) Cpq Cqq1 Bq (1) is normal with mean zero and variance Ip + 0 : Let L be the lower triangular matrix such that LL0 is the Choleski decomposition Cpq Cqq1 Cqq1 Cpq 0 : Then the conditional distribution of of Ip + Cpq Cqq1 Cqq1 Cpq := L 1 Bp (1) Cpq Cqq1 Bq (1) is N (0; Ip ): Since the conditional distribution does not depend on (Cpq ; Cqq ; Cpp ) ; we can conclude that is independent of (Cpq ; Cqq ; Cpp ) : So we can write d 0 pF1 = A where A = L0 Dpp1 L. Given that A is a function of (Cpq ; Cqq ; Cpp ) ; we know that and A are d independent. As a result, 0 A = 0 (OAO0 ) for any orthonormal matrix O that is independent of : ~ be an orthonormal matrix with …rst column = k k : We choose H to Let H = = k k ; H be independent of A: This is possible as and A are independent. Then 0 d pF1 = k k2 OAO0 k k k k = k k2 =k 0 k k k2 e0p H H 0 OAO0 H H 0 H 0 OAO0 H ep k k where ep = (1; 0; :::; 0)0 is the …rst basis vector in Rp : Since k k2 ; H and A are mutually independent from each other, we can write d pF1 = k k2 e0p H0 O0 AOH ep for any orthonormal matrix H that is independent of both d pF1 = k k 2 e0p Aep = k k2 d 1 e0p Aep = and A: Letting O = H0 ; we obtain: k k2 e0p L0 Dpp1 Lep 1: Since Lep is the …rst column of L; we have, using the de…nition of the Choleski decomposition: 0 ep Ip + Cpq Cqq1 Cqq1 Cpq Lep = q : 0 e0p Ip + Cpq Cqq1 Cqq1 Cpq ep As a result, d pF1 = k k2 2 where 2 = e0p L0 Dpp1 Lep 1 = 0 e0p Ip + Cpq Cqq1 Cqq1 Cpq ep 0 0 e0p Ip + Cpq Cqq1 Cqq1 Cpq Dpp1 Ip + Cpq Cqq1 Cqq1 Cpq ep Part (b). It is easy to show that ECqq = 1 Iq and var [vec (Cqq )] = 53 2 (Iqq + Kqq ) : where 1 and 2 are de…ned in (9), Iqq is the q 2 q 2 identity matrix, and Kqq is the q 2 q 2 commutation matrix. So Cqq = 1 Iq + op (1) and Cqq1 = 1 1 Iq + op (1) as h ! 1: Similarly, Cpp = 1 Ip + op (1) and Cpp1 = 1 1 Ip + op (1) ; as h ! 1: In addition, using the same argument, we can show that Cpq = op (1) : Therefore 2 = 0 = e0p Ip + Cpq Cpq e0p Ip + 0 = 2 Cpq Cpq 1 Ip 2 1 ep 1 0 = 2 Cpq Cpq 1 0 = Ip + Cpq Cpq 2 1 ep (1 + op (1)) = 1 + op (1) That is, 2 !p 1 as h ! 1: We proceed to prove the distributional expansion. The (i; j)-th elements Cpp (i; j), Cpq (i; j) R1R1 R1R1 ~ and Cqq (i; j) of Cpp ; Cpq and Cqq are equal to either 0 0 Qh (r; s) dB(r)dB(s) or 0 0 Qh (r; s) dB(r)dB(s) ~ ( ) are independent standard Brownian motion processes. By direct calculawhere B( ) and B tions, we have, for any & 2 (0; 3=8); P (jCef (i; j) & 2) ECef (i; j)j > E jCef (i; j) ECef (i; j)j8 8& 2 4 2 8& 2 =O = o( 2) where e; f = p or q: De…ne the event E as E = f! : jCef (i; j) ECef (i; j)j then the complement E c of E satis…es P (E c ) = o( & 2 for all i; j and all e and f g ; 2) as h ! 1: Let E~ be another event, then = P E~ \ E + P E~ \ E c = P E~ \ E + o( P E~ ~ = P (EjE)P (E) + o( ~ = P (EjE) + o ( 2 ) : 2) ~ (1 = P (EjE) o( 2 )) 2) + o( 2) ~ That is, up to an error of order o( 2 ); P E~ ; P E~ \ E and P (EjE) are equivalent. So for the purpose of proving the theorem, it is innocuous to condition on E or remove the conditioning, if needed. Now conditioning E, the numerator of 2 satis…es: 0 e0p Ip + Cpq Cqq1 Cqq1 Cpq ep =1+ 1 0 2 ep Cpq Iq + Cqq =1+ 1 0 2 ep Cpq 1 (Iq 1 Iq + 1 1 =1+ ECqq Mqq ) (Iq Cqq ECqq 1 0 Cpq ep 1 0 Mqq ) Cpq ep 1 0 0 ~ 2 ep Cpq Mqq Cpq ep 1 where Mqq is a matrix with elements Mqq (i; j) satisfying jMqq (i; j)j = O ( &2 ) conditional on E, ~ qq is a matrix satisfying M ~ qq (i; j) 1 fi = jg = O ( & ) conditional on E. and M 2 Let Cpp Cpq Cp+q;p+q = 0 Cpq Cqq 54 and eq+q;p be the matrix consisting of the …rst p columns of the identity matrix Ip+q : To evaluate the denominator of 2 ; we note that 1 Dpp1 = e0q+q;p Ip+q + Cp+q;p+q 1 1 = eq+q;p 1 Cp+q;p+q e0q+q;p Ip+q ECp+q;p+q 1 eq+q;p 1 +e0q+q;p 1 = 1 ECp+q;p+q Ip [Cp+q;p+q ECp+q;p+q ] [Cp+q;p+q ECp+q;p+q ] eq+q;p + Mpp 2 1 Cpp 1 ECpp + [Cpp 0 ECpp ] + Cpq Cpq ECpp ] [Cpp 2 1 1 + Mpp = o( 2 ) for where Mpp is a matrix with elements Mpp (i; j) satisfying jMpp (i; j)j = O 3& 2 & > 1=3 conditional on E. For the purpose of proving our result, Mpp can be ignored as its presence generates an approximation error of order o ( 2 ) ; which is the same as the order of the approximation error given in the theorem. More speci…cally, let C~pp = 0 ~ qq Cpq Cpq M 2 1 ~ pp = ; D Ip Cpp ECpp + [Cpp 0 ECpp ] + Cpq Cpq ECpp ] [Cpp 2 1 1 and ~ )= ~ := ~ (C~pp ; D pp 2 1 2 1 + e0p C~pp ep ~ pp Ip + C~pp ep e0p Ip + C~pp D ; then P (pF1 < z) = EGp Note that for any q 0 ECpq Lqq Cpq =E 2 z = EGp ~2 z + o ( 2) : q matrix Lqq ; we have Z 0 Z 1Z 1 0 Qh (r1 ; s1 )Qh (r2 ; s2 )dBp (r1 )dBq (s1 )0 Lqq dBq (s2 )dBp (r1 )0 1Z 1 Qh (r1 ; s1 )Qh (r2 ; s2 )dBp (r1 )dBp (r1 )0 tr dBq (s2 )dBq (s1 )0 Lqq 0 0 Z 1Z 1 = tr (Lqq ) [Qh (r; s)]2 drdrIp = tr (Lqq ) 2 Ip : =E 0 (56) 0 ~ ) around C~pp = q Taking an expansion of ~2 (C~pp ; D pp ~2 = 2 0 2 2 = 1 Ip ~ = Ip ; we obtain and D pp + err where err is the approximation error and 2 0 = 2q 2 1 + e0p [Cpp ECpp ] ep e0p [Cpp ECpp ] [Cpp 1 We keep enough terms in ECpp ] ep 1 2 0 so that EGp ~2 z = EGp 55 + e0p [Cpp ECpp ] ep 1 2z 0 + o( 2) : 2 : Now we write P (pF1 < z) = EGp ( 20 z) + o ( 2) = Gp (z) + Gp0 (z) z E 1 1 + Gp00 (z) z 2 E( 2 2 0 2 0 1)2 + o ( 2) : In view of E 2 0 1=( 2q 1) 1 1 2 1 =( 2q 1) 1 1) 1 e0p [Cpp ECpp ] ep + E ECpp ] ep (p + 1) + 2 2 1 2 1 2 1 2 1 2 1 =( ECpp ] [Cpp 1 2 2q Ee0p [Cpp (p 1) 1 and E( 2 0 1)2 =E 1 1 2 2q 2 + e0p [Cpp ECpp ] ep + o( 2) 1 = E e0p [Cpp =2 2 2 +( 1 2 ECpp ] ep 1) + o ( +( 1)2 + o ( 1 2) 2) ; we have P (pF1 < z) = Gp (z) + Gp0 (z) z ( 1 = Gp (z) + Gp0 (z) z ( 1 2q 1) 2 2 1 2 1) 1) + Gp00 (z) z 2 (p 1 1) + Gp00 (z) z 2 (p + 2q 1 2 2 + o( + o( 2) 2) : (57) Proof of Theorem 6. We prove part (a) only as the proof for part (b) is similar. Using the same arguments and the notation as in the proof of Theorem 1, we have 1 R G0 W11 G G0 W11 Bm (1) + and R G0 W11 G 1 d 0 = RV A d R0 = RV A 1 1 C12 C221 Id ; C 11 1 1 A0 Bm (1) + 0 V 0 R0 : So F1; d 0 = RV A RV A 1 1 Id ; Id ; C12 C221 C12 C221 Bm (1) + Bm (1) + 0 56 =p: 0 0 h RV A 1 C 11 1 A0 1 V 0 R0 i 1 Let Bm (1) = [Bd0 (1) ; Bq0 (1)]0 and RV A ~ O ~ : Then A; n ~ ~ V~ 0 Bd (1) = U d ~ ~ V~ 0 be a SVD of RV A = U 1 o0 ~ ~ V~ 0 C 11 (1) + 0 U F1; 0 n o ~ ~ V~ 0 Bd (1) C12 C 1 Bq (1) + 0 =p U 22 n o0 1 ~0 0 ~ V~ 0 C 1 V~ ~ 0 = ~ V~ 0 Bd (1) C12 C221 Bq (1) + U 11 o n ~ V~ 0 Bd (1) C12 C 1 Bq (1) + U ~ 0 0 =p: 22 C12 C221 Bq 1 1; ~0 V~ ~ 0 U where ~ = 1 The above distribution does not depend on the orthonormal matrix V~ : So Let H = ( ~ A kA~ n d F1; 0 = ~ Bd (1) C12 C221 Bq (1) n ~ Bd (1) C12 C 1 Bq (1) + U ~0 22 n ~0 = A~ Bp (1) Cpq Cqq1 Bq (1) + U n ~0 A~ Bp (1) Cpq Cqq1 Bq (1) + U h ~0 = Bp (1) Cpq Cqq1 Bq (1) + A~ 1 U h 1 Dpp Bp (1) Cpq Cqq1 Bq (1) + A~ 1U 0 0 1U 0 0 k ~ be a p ; H) o0 ~0 0 ~C 1 ~0 +U 11 o 0 =p o0 1 ~ pp A~ AD 0 o 1 0 0 1 i0 ~0 U 0 i : p orthonormal matrix, then i0 h d 0 0 1 0 ~ 1 ~0 U A = H B (1) H C C B (1) + H 0 p pq q qq 0 h i ~0 0 H 0 Dpp1 H H 0 Bp (1) H 0 Cpq Cqq1 Bq (1) + H 0 A~ 1 U h i0 ~ 0 0 jj = H 0 Bp (1) H 0 Cpq Cqq1 Bq (1) + ep jjA~ 1 U i h ~ 0 0 jj : H 0 Dpp1 H H 0 Bp (1) H 0 Cpq Cqq1 Bq (1) + ep jjA~ 1 U F1; But the joint distribution of H 0 Bp (1) ; H 0 Cpq ; H 0 Dpp1 H is invariant to H: Hence we can write h i0 d 1 1 ~0 ~ = B (1) C C B (1) + e jj A U jj p pq qq q p 0 0 h i ~ 0 0 jj =p: Dpp1 Bp (1) Cpq Cqq1 Bq (1) + ep jjA~ 1 U F1; That is, the distribution of F1; 0 depends on 0 only through jjA~ 57 1U ~0 0 jj. ~ A~A~0 U ~0 = Note that U ~ ~ V~ 0 U ~ ~ V~ 0 U jjA~ 1 0 1 = RV A ~ 0 0 jj2 = U RV A A~ 0 ~ 0U = 0 0 = 0 0 h 1 h 0 1 A~ 2 0 jj G i 1 ~0 ~ A~A~0 U U 0 i 1 h 0 0 0 0 0 0 V R 0 = 0 R V A AV ~0 U 1 A 1 1=2 = jjV and 1 RV A R 1 0 0 1 h 0 0 = 0 i 1 1 R0 0 = Cpq Cqq1 Bq (1) + 0 G = jj jj2 ; 0 0 1 n R G0 1 R0 i 1 0 1 G R0 o 1 0 we have F1; d 0 = Bp (1) Dpp1 Bp (1) 1 0 G WU;1 1 1 U WU;1 1 0 G WU;1 G Note that 1 0 G WU;1 and 1 WU;1 G we have 2 G 11 2 = A0G WU;1 =4 21 WU;1 0 Id BU;1 = RVG AG1 11 U 11 U 21 U 12 U BU;1 In view of V = RVG AG1 we have V~ V = RVG AG1 = RVG AG1 h h 12 U 22 U 22 U 1 1 21 U h 1 0 11 2 RVG AG1 WU;1 AG1 AG 21 U BU;1 5 12 U 22 U 0 BU;1 11 U 1 22 WU;1 30 1 U 2 4 21 U 0 + BU;1 22 U 12 U BU;1 22 U 58 h 1 11 2 WU;1 21 WU;1 1 5 AG ; 1 i 1 22 WU;1 RVG AG1 22 U BU;1 21 U VG0 R0 : 3 11 2 WU;1 11 2 WU;1 1 3 5 0 0 RVG AG1 : 0 RVG AG1 ; 0 BU;1 21 U 1 21 U 22 U 1 G AG ; Id BU;1 12 U i0 1 0 G WU;1 G 1 11 2 WU;1 11 2 V~ = RVG AG1 WU;1 AG1 AG 2 1 11 2 WU;1 4 1 21 22 11 2 WU;1 WU;1 WU;1 = RVG AG1 =p = F1 jj jj2 : We …rst prove (20). Using the singular value decomposition G = Proof of Proposition 7. UG G VG0 ; we have V~ = RVG Cpq Cqq1 Bq (1) + i 0 22 + BU;1 B RVG AG1 U;1 U i 0 BU;1 RVG AG1 : 0 Next, we prove the proposition, starting with jj~jj2 h 0 ~ 1=2 ~ 1=2 ~ V VV 0V jj jj2 = 1=2 i 1=2 h V~ Ip 1=2 1=2 V V~ The above equality holds when V~ is nonsingular. But V~ V~ V (V~ h = V~ 1=2 RVG AG1 h = V~ 1=2 RVG AG1 h 1=2 22 U = 1=2 21 U 1=2 V~ 1=2 V V~ 1 22 U 21 U BU;1 0 22 U BU;1 21 U 22 U BU;1 1=2 = Ip i0 22 U 22 U jj~jj2 = 0 ~ 1=2 0V Ip h 1=2 0 i RVG AG1 (V~ V~ 1=2 V~ hence jj jj2 V~ 1=2 1=2 V V~ i 1=2 V~ 1=2 0: ) 0 12 12 and ih 1=2 21 U BU;1 i 0 RVG AG1 (V~ 1=2 ) i ) V (V~ 1=2 0 12 12 1 22 U 0 12 12 1=2 ) = Ip Ip 0 12 12 0 12 12 ; 1=2 V~ 1=2 0 as desired. References [1] Andersen, T. G. and Sorensen, B. E. (1996): “GMM Estimation of a Stochastic Volatility Model: a Monte Carlo Study.” Journal of Business & Economic Statistics 14(3), 328–52. [2] Andrews, D. W. K. (1991): “Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimation.” Econometrica 59, 817–858. [3] Andrews, D. W. K. and J. C. Monahan (1992): “An Improved Heteroskedasticity and Autocorrelation Consistent Covariance Matrix Estimator.” Econometrica 60(4), 953–966. [4] Bester, A., Conley, T., Hansen, C., and Vogelsang, T. (2011): “Fixed-b Asymptotics for Spatially Dependent Robust Nonparametric Covariance Matrix Estimators.”Working paper, Michigan State University. [5] Bester, C. A., T. G. Conley, and C.B. Hansen (2011): “Inference with Dependent Data Using Cluster Covariance Estimators.” Journal of Econometrics 165(2), 137–51. [6] Chen, X., Z. Liao and Y. Sun (2014): “Sieve Inference on Possibly Misspeci…ed Seminonparametric Time Series Models.” Journal of Econometrics 178(3), 639–658. [7] Gonçlaves, S. (2011): “The Moving Blocks Bootstrap for Panel Linear Regression Models with Individual Fixed E¤ects.” Econometric Theory 27(5), 1048–1082. 59 [8] Gonçlaves, S. and T. Vogelsang (2011): “Block Bootstrap HAC Robust Tests: The Sophistication of the Naive Bootstrap.” Econometric Theory 27(4), 745–791. [9] Ghysels, E., A. C. Harvey and E. Renault (1996): “Stochastic Volatility,” In G.S. Maddala and C.R.Rao (Eds), Statistical Methods in Finance, 119–191. [10] Han, C. and P. C. B. Phillips (2006): “GMM with Many Moment Conditions.”Econometrica 74(1), 147–192. [11] Hansen, L. P. (1982): “Large Sample Properties of Generalized Method of Moments Estimators.” Econometrica 50, 1029–1054. [12] Hall, A. R. (2000): “Covariance Matrix Estimation and the Power of the Over-identifying Restrictions Test.” Econometrica 50, 1517–1527. [13] Hall, A. R. (2005): Generalized Method of Moments. Oxford University Press. [14] Ibragimov, R. and U. K. Müller (2010): “t-Statistic Based Correlation and Heterogeneity Robust Inference.” Journal of Business and Economic Statistics 28, 453–468. [15] Jansson, M. (2004): “On the Error of Rejection Probability in Simple Autocorrelation Robust Tests.” Econometrica 72, 937–946. [16] Jacquier, E., N. G. Polson and P. E. Rossi (1994): “Bayesian Analysis of Stochastic Volatility Models” Journal of Business & Economic Statistics 12(4), 371–389. [17] Kiefer, N. M. and T. J. Vogelsang (2002a): “Heteroskedasticity-autocorrelation Robust Testing Using Bandwidth Equal to Sample Size.” Econometric Theory 18, 1350–1366. [18] — — — (2002b): “Heteroskedasticity-autocorrelation Robust Standard Errors Using the Bartlett Kernel without Truncation.” Econometrica 70, 2093–2095. [19] — — — (2005): “A New Asymptotic Theory for Heteroskedasticity-Autocorrelation Robust Tests.” Econometric Theory 21, 1130–1164. [20] Kim, M. S. and Sun, Y. (2013): “Heteroskedasticity and Spatiotemporal Dependence Robust Inference for Linear Panel Models with Fixed E¤ects.”Journal of Econometrics, 177(1), 85– 108. [21] Müller, U. K. (2007): “A Theory of Robust Long-Run Variance Estimation.” Journal of Econometrics 141, 1331–1352. [22] Newey, W. K. and K. D. West (1987): “A Simple, Positive Semide…nite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix.” Econometrica 55, 703–708. [23] Phillips, P. C. B. (2005): “HAC Estimation by Automated Regression.”Econometric Theory 21, 116–142. [24] Phillips, P. C. B., Y. Sun and S. Jin (2006): “Spectral Density Estimation and Robust Hypothesis Testing using Steep Origin Kernels without Truncation.”International Economic Review 47, 837–894 60 [25] Phillips, P. C. B., Y. Sun and S. Jin (2007): “Consistent HAC Estimation and Robust Regression Testing Using Sharp Origin Kernels with No Truncation.” Journal of Statistical Planning and Inference 137, 985–1023. [26] Politis, D. (2011): “Higher-order Accurate, Positive Semi-de…nite Estimation of Largesample Covariance and Spectral Density Matrices.” Econometric Theory, 27, 703–744. [27] Press, J. (2005): Applied Multivariate Analysis, Second Edition, Dover Books on Mathematics. [28] Shao, X. (2010): “A Self-normalized Approach to Con…dence Interval Construction in Time Series.” Journal of the Royal Statistical Society B, 72, 343–366. [29] Sun, Y. (2011): “Robust Trend Inference with Series Variance Estimator and Testing-optimal Smoothing Parameter.” Journal of Econometrics, 164(2), 345–366. [30] Sun, Y. (2013): “A Heteroskedasticity and Autocorrelation Robust F Test Using Orthonormal Series Variance Estimator.” Econometrics Journal, 16, 1–26. [31] Sun, Y. (2014): “Let’s Fix It: Fixed-b Asymptotics versus Small-b Asymptotics in Heteroscedasticity and Autocorrelation Robust Inference.” Journal of Econometrics 178(3), 659–677. [32] Sun, Y. and D. Kaplan (2012): “Fixed-smoothing Asymptotics and Accurate F approximation Using Vector Autoregressive Covariance Matrix Estimators.” Working paper, UC San Diego. [33] Sun, Y., P. C. B. Phillips and S. Jin (2008). “Optimal Bandwidth Selection in Heteroskedasticity-Autocorrelation Robust Testing.” Econometrica 76(1), 175–194. [34] Sun, Y., P. C. B. Phillips and S. Jin (2011): “Power Maximization and Size Control in Heteroscedasticity and Autocorrelation Robust Tests with Exponentiated Kernels.” Econometric Theory 27(6), 1320–1368. [35] Sun, Y. and P. C. B. Phillips (2009): “Bandwidth Choice for Interval Estimation in GMM Regression.” Working paper, Department of Economics, UC San Diego. [36] Sun, Y. and M. S. Kim (2012): “Simple and Powerful GMM Over-identi…cation Tests with Accurate Size.” Journal of Econometrics 166(2), 267–281 [37] Sun, Y. and M. S. Kim (2014): “Asymptotic F Test in a GMM Framework with Cross Sectional Dependence.” Review of Economics and Statistics, forthcoming. [38] Taniguchi, M. and Puri, M. L. (1996): “Valid Edgeworth Expansions of M-estimators in Regression Models with Weakly Dependent Residuals.” Econometric Theory 12, 331–346. [39] Vogelsang, T. (2012): “Heteroskedasticity, Autocorrelation, and Spatial Correlation Robust Inference in Linear Panel Models with Fixed-e¤ects. Journal of Econometrics 166, 303–319. [40] Xiao, Z. and O. Linton (2002): “A Nonparametric Prewhitened Covariance Estimator.” Journal of Time Series Analysis 23, 215–250. [41] Zhang, X. and X. Shao (2013): “Fixed-Smoothing Asymptotics For Time Series.”Annals of Statistics 41(3), 1329–1349. 61