Supplement to ‘Individual and Time Effects in Nonlinear Panel Models with Large N, T ’ Iván Fernández-Val‡ Martin Weidner§ March 31, 2015 Abstract This supplemental material contains five appendices. Appendix S.1 presents the results of an empirical application and a Monte Carlo simulation calibrated to the application. Following Aghion et al. (2005), we use a panel of U.K. industries to estimate Poisson models with industry and time effects for the relationship between innovation and competition. Appendix S.2 gives the proofs of Theorems 4.3 and 4.4. Appendices S.3, S.4, and S.5 contain the proofs of Appendices B, C, and D, respectively. Appendix S.6 collects some useful intermediate results that are used in the proofs of the main results. S.1 S.1.1 Relationship between Innovation and Competition Empirical Example To illustrate the bias corrections with real data, we revisit the empirical application of Aghion, Bloom, Blundell, Griffith and Howitt (2005) (ABBGH) that estimated a count data model to analyze the relationship between innovation and competition. They used an unbalanced panel of seventeen U.K. industries followed over the 22 years between 1973 and 1994.1 The dependent variable, Yit , is innovation as measured by a citation-weighted number of patents, and the explanatory variable of interest, Zit , is competition as measured by one minus the Lerner index in the industry-year. Following ABBGH we consider a quadratic static Poisson model with industry and year effects where 2 Yit | ZiT , αi , γt ∼ P(exp[β1 Zit + β2 Zit + αi + γt ]), for (i = 1, ..., 17; t = 1973, ..., 1994), and extend the analysis to a dynamic Poisson model with industry and year effects where 2 Yit | Yit−1 , Zit , αi , γ t ∼ P(exp[βY log(1 + Yi,t−1 ) + β1 Zit + β2 Zit + αi + γt ]), ‡ § Department of Economics, Boston University, 270 Bay State Road, Boston, MA 02215-1403, USA. Email: ivanf@bu.edu Department of Economics, University College London, Gower Street, London WC1E 6BT, UK, and and CeMMaP. Email: m.weidner@ucl.ac.uk 1 We assume that the observations are missing at random conditional on the explanatory variables and unobserved effects and apply the corrections without change since the level of attrition is low in this application. 1 for (i = 1, ..., 17; t = 1974, ..., 1994). In the dynamic model we use the year 1973 as the initial condition for Yit . Table S1 reports the results of the analysis. Columns (2) and (3) for the static model replicate the empirical results of Table I in ABBGH (p. 708), adding estimates of the APEs. Columns (4) and (5) report estimates of the analytical corrections that do not assume that competition is strictly exogenous with L = 1 and L = 2, and column (6) reports estimates of the jackknife bias corrections described in equation (3.4) of the paper. Note that we do not need to report separate standard errors for the corrected estimators, because the standard errors of the uncorrected estimators are consistent for the corrected estimators under the asymptotic approximation that we consider.2 Overall, the corrected estimates, while numerically different from the uncorrected estimates in column (3), agree with the inverted-U pattern in the relationship between innovation and competition found by ABBGH. The close similarity between the uncorrected and bias corrected estimates gives some evidence in favor of the strict exogeneity of competition with respect to the innovation process. The results for the dynamic model show substantial positive state dependence in the innovation process that is not explained by industry heterogeneity. Uncorrected fixed effects underestimates the coefficient and APE of lag patents relative to the bias corrections, specially relative to the jackknife. The pattern of the differences between the estimates is consistent with the biases that we find in the numerical example in Table S4. Accounting for state dependence does not change the inverted-U pattern, but flattens the relationship between innovation and competition. Table S.2 implements Chow-type homogeneity tests for the validity of the jackknife corrections. These tests compare the uncorrected fixed effects estimators of the common parameters within the elements of the cross section and time series partitions of the panel. Under time homogeneity, the probability limit of these estimators is the same, so that a standard Wald test can be applied based on the difference of the estimators in the sub panels within the partition. For the static model, the test is rejected at the 1% level in both the cross section and time series partitions. Since the cross sectional partition is arbitrary, these rejection might be a signal of model misspecification. For the dynamic model, the test is rejected at the 1% level in the time series partition, but it cannot be rejected at conventional levels in the cross section partition. The rejection of the time homogeneity might explain the difference between the jackknife and analytical corrections in the dynamic model. S.1.2 Calibrated Monte Carlo Simulations We conduct a simulation that mimics the empirical example. The designs correspond to static and dynamic Poisson models with additive individual and time effects. We calibrate all the parameters and exogenous variables using the dataset from ABBGH. 2 In numerical examples, we find very little gains in terms of the ratio SE/SD and coverage probabilities when we reestimate the standard errors using bias corrected estimates. 2 S.1.2.1 Static Poisson model The data generating process is 2 Yit | ZiT , α, γ ∼ P(exp[Zit β1 + Zit β2 + αi + γt ]), (i = 1, ..., N ; t = 1, ..., T ), where P denotes the Poisson distribution. The variable Zit is fixed to the values of the competition variable in the dataset and all the parameters are set to the fixed effect estimates of the model. We generate unbalanced panel data sets with T = 22 years and three different numbers of industries N : 17, 34, and 51. In the second (third) case, we double (triple) the cross-sectional size by merging two (three) independent realizations of the panel. Table S3 reports the simulation results for the coefficients β1 and β2 , and the APE of Zit . We com2 pute the APE using the expression (2.5) with H(Zit ) = Zit . Throughout the table, MLE corresponds to the pooled Poisson maximum likelihood estimator (without individual and time effects), MLE-TE corresponds to the Poisson estimator with only time effects, MLE-FETE corresponds to the Poisson maximum likelihood estimator with individual and time fixed effects, Analytical (L=l) is the bias corrected estimator that uses the analytical correction with L = l, and Jackknife is the bias corrected estimator that uses SPJ in both the individual and time dimensions. The analytical corrections are different from the uncorrected estimator because they do not use that the regressor Zit is strictly exogenous. The cross-sectional division in the jackknife follows the order of the observations. The choice of these estimators is motivated by the empirical analysis of ABBGH. All the results in the table are reported in percentage of the true parameter value. The results of the table agree with the no asymptotic bias result for the Poisson model with exogenous regressors. Thus, the bias of MLE-FETE for the coefficients and APE is negligible relative to the standard deviation and the coverage probabilities get close to the nominal level as N grows. The analytical corrections preserve the performance of the estimators and have very little sensitivity to the trimming parameter. The jackknife correction increases dispersion and rmse, specially for the small cross-sectional size of the application. The estimators that do not control for individual effects are clearly biased. S.1.2.2 Dynamic Poisson model The data generating process is 2 Yit | Yit−1 , Zit , α, γ ∼ P(exp[βY log(1 + Yi,t−1 ) + Zit β1 + Zit β2 + αi + γt ]), (i = 1, ..., N ; t = 1, ..., T ). The competition variable Zit and the initial condition for the number of patents Yi0 are fixed to the values in the dataset and all the parameters are set to the fixed effect estimates of the model. To generate panels, we first impute values to the missing observations of Zit using forward and backward predictions from a panel AR(1) linear model with individual and time effects. We then draw panel data sets with T = 21 years and three different numbers of industries N : 17, 34, and 51. As in the static model, we double (triple) the cross-sectional size by merging two (three) independent realizations of the panel. We make the generated panels unbalanced by dropping the values corresponding to the missing observations in the original dataset. 3 Table S4 reports the simulation results for the coefficient βY0 and the APE of Yi,t−1 . The estimators considered are the same as for the static Poisson model above. We compute the partial effect of Yi,t−1 using (2.5) with Zit = Yi,t−1 , H(Zit ) = log(1 + Zit ), and dropping the linear term. Table S5 reports the simulation results for the coefficients β10 and β20 , and the APE of Zit . We compute the partial effect 2 using (2.5) with H(Zit ) = Zit . Again, all the results in the tables are reported in percentage of the true parameter value. The results in table S4 show biases of the same order of magnitude as the standard deviation for the fixed effects estimators of the coefficient and APE of Yi,t−1 , which cause severe undercoverage of confidence intervals. Note that in this case the rate of convergence for the estimator of the APE is √ rN T = N T , because the individual and time effects are hold fixed across the simulations. The analytical corrections reduce bias by more than half without increasing dispersion, substantially reducing rmse and bringing coverage probabilities closer to their nominal levels. The jackknife corrections reduce bias and increase dispersion leading to lower improvements in rmse and coverage probability than the analytical corrections. The results for the coefficient of Zit in table 8 are similar to the static model. The results for the APE of Zit are imprecise, because the true value of the effect is close to zero. S.2 Proofs of Theorems 4.3 and 4.4 We start with a lemma that shows the consistency of the fixed effects estimators of averages of the data and parameters. We will use this result to show the validity of the analytical bias corrections and the consistency of the variance estimators. Lemma S.1. Let G(β, φ) := [N (T − j)]−1 and Bε0 be a subset of R dim β+2 P i,t≥j+1 g(Xit , Xi,t−j , β, αi + γt , αi + γt−j ) for 0 ≤ j < T, 0 0 that contains an ε-neighborhood of (β, πit , πi,t−j ) for all i, t, j, N, T , and for some ε > 0. Assume that (β, π1 , π2 ) 7→ gitj (β, π1 , π2 ) := g(Xit , Xi,t−j , β, π1 , π2 ) is Lipschitz continuous over Bε0 a.s, i.e. |gitj (β1 , π11 , π21 ) − gitj (β0 , π10 , π20 )| ≤ Mitj k(β1 , π11 , π21 ) − (β, π10 , π20 )k b φ) b be for all (β0 , π10 , π20 ) ∈ B 0 , (β1 , π11 , π21 ) ∈ B 0 , and some Mitj = OP (1) for all i, t, j, N, T . Let (β, ε ε an estimator of (β, φ) such that kβb − β 0 k →P 0 and kφb − φ0 k∞ →P 0. Then, b φ) b →P E[G(β 0 , φ0 )], G(β, provided that the limit exists. Proof of Lemma S.1. By the triangle inequality b φ) b − E[G(β 0 , φ0 )]| ≤ |G(β, b φ) b − G(β 0 , φ0 )| + oP (1), |G(β, because |G(β 0 , φ0 ) − E[G(β 0 , φ0 )]| = oP (1). By the local Lipschitz continuity of gitj and the consistency b φ), b of (β, b φ) b − G(β 0 , φ0 )| ≤ |G(β, X 1 0 bα Mitj k(β, bi + γ bt , α bi + γ bt−j ) − (β 0 , αi0 + γt0 , αi0 + γt−j )k N (T − j) i,t≥j+1 ≤ X 1 Mitj (kβb − β 0 k + 4kφb − φ0 k∞ ) N (T − j) i,t≥j+1 4 wpa1. The result then follows because [N (T −j)]−1 P i,τ ≥t b 0 k∞ ) = Mitτ = OP (1) and (kβb−β 0 k+4kφ−φ oP (1) by assumption. Proof of Theorem 4.3. We separate the proof in three parts corresponding to the three statements of the theorem. c →P W ∞ . The asymptotic variance and its fixed effects estimators can be Part I: Proof of W b φ), b where W (β, φ) has a first order representation as c = W (β, expressed as W ∞ = E[W (β 0 , φ0 )] and W a continuously differentiable transformation of terms that have the form of G(β, φ) in Lemma S.1. The result then follows by the continuous mapping theorem noting that kβb − β 0 k →P 0 and kφb − φ0 k∞ ≤ kφb − φ0 kq →P 0 by Theorem C.1. √ −1 Part II: Proof of N T (βeA − β 0 ) →d N (0, W ∞ ). By the argument given after equation (3.3) in the b →P D∞ . These asymptotic biases and their fixed b →P B ∞ and D text, we only need to show that B effects estimators are either time-series averages of fractions of cross-sectional averages, or vice versa. c , but The nesting of the averages makes the analysis a bit more cumbersome than the analysis of W the result follows by similar standard arguments, also using that L → ∞ and L/T → 0 guarantee that b is also consistent for the spectral expectations; see Lemma 6 in Hahn and the trimmed estimator in B Kuersteiner (2011). Part III: Proof of √ −1 N T (βeJ − β 0 ) →d N (0, W ∞ ). For T1 = {1, . . . , b(T + 1)/2c}, T2 = {bT /2c + 1, . . . , T }, T0 = T1 ∪ T2 , N1 = {1, . . . , b(N + 1)/2c}, N2 = {bN/2c + 1, . . . , N }, and N0 = N1 ∪ N2 , let βb(jk) be the fixed effect estimator of β in the subpanel defined by i ∈ Nj and t ∈ Tk .3 In this notation, βeJ = 3βb(00) − βb(10) /2 − βb(20) /2 − βb(01) /2 − βb(02) /2. √ We derive the asymptotic distribution of N T (βeJ − β 0 ) from the joint asymptotic distribution of √ b = N T (βb(00) − β 0 , βb(10) − β 0 , βb(20) − β 0 , βb(01) − β 0 , βb(02) − β 0 ) with dimension 5 × dim β. the vector B By Theorem C.1, √ −1 21(j>0) 21(k>0) √ N T (βb(jk) − β 0 ) = NT −1 (1a,1) for ψit = W ∞ Dβ `it , bit = W ∞ [Uit is implicitly defined by U (·) = (N T ) (1b,1,1) + Uit P −1/2 i,t X [ψit + bit + dit ] + oP (1), i∈Nj ,t∈Tk −1 (1a,4) ], and dit = W ∞ [Uit (1b,4,4) + Uit (·) ], where the Uit (·) Uit . Here, none of the terms carries a superscript (jk) by Assumption 4.3. The influence function ψit has zero mean and determines the asymptotic variance −1 W ∞ , whereas bit and dit determine the asymptotic biases B ∞ and D∞ , but do not affect the asymptotic variance. By this representation, 1 1 −1 b B →d N κ 1 ⊗ B ∞ + κ 2 2 3 1 1 1 1 1 2 2 ⊗ D∞ , 1 1 1 2 0 1 1 0 2 1 1 1 1 2 1 1 1 0 1 −1 1 ⊗ W∞ , 0 2 1 Note that this definition of the subpanels covers all the cases regardless of whether N and T are even or odd. 5 where we use that {ψit : 1 ≤ i ≤ N, 1 ≤ t ≤ T } is independent across i and martingale difference across t and Assumption 4.3. The result follows by writing √ b and using the properties N T (βeJ −β 0 ) = (3, −1/2, −1/2, −1/2, −1/2)B of the multivariate normal distribution. Proof of Theorem 4.4. We separate the proof in three parts corresponding to the three statements of the theorem. δ δ c in part I of the proof of Part I: Vb δ →P V ∞ . V ∞ and Vb δ have a similar structure to W ∞ and W Theorem 4.3, so that the consistency follows by an analogous argument. √ δ 0 Part II: N T (δeA − δN T ) →d N (0, V ∞ ). As in the proof of Theorem 4.2, we decompose rN T √ 0 0 rN T (δeA − δN N T (δeA − δ). T ) = rN T (δ − δN T ) + √ NT Then, by Mann-Wald theorem, √ N T (δeA − δ) = √ δ(1) b δ /T − D b δ /N − δ) →d N (0, V N T (δb − B ∞ ), δ(1) δ(2) b δ →P Dδ∞ , and rN T (δ − δ 0 ) →d N (0, V δ(2) b δ →P B δ∞ and D provided that B ∞ ), where V ∞ and V ∞ NT are defined as in the proof of Theorem 4.2. The statement thus follows by using a similar argument to b δ and D b δ , and because (δ − δ 0 ) and part II of the proof of Theorem 4.3 to show the consistency of B NT √ δ δ(2) δ(1) (δeA − δ) are asymptotically independent, and V ∞ = V +V limN,T →∞ (rN T / N T )2 . √ δ Part III: N T (δeJ − δ 0 ) →d N (0, V ). As in part II, we decompose ∞ NT rN T √ 0 0 rN T (δeJ − δN N T (δeJ − δ). T ) = rN T (δ − δN T ) + √ NT Then, by an argument similar to part III of the proof of Theorem 4.3, √ δ(2) δ(1) N T (δeJ − δ) →d N (0, V ∞ ), δ(1) δ(2) 0 and rN T (δ − δN T ) →d N (0, V ∞ ), where V ∞ and V ∞ are defined as in the proof of Theorem 4.2. δ 0 The statement follows because (δ − δN T ) and (δeJ − δ) are asymptotically independent, and V ∞ = √ δ(2) δ(1) V +V limN,T →∞ (rN T / N T )2 . S.3 Proofs of Appendix B (Asymptotic Expansions) The following Lemma contains some statements that are not explicitly assumed in Assumptions B.1, but that are implied by it. Lemma S.1. Let Assumptions B.1 be satisfied. Then 6 (i) H(β, φ) > 0 for all β ∈ B(rβ , β 0 ) and φ ∈ Bq (rφ , φ0 ) wpa1, sup √ k∂ββ 0 L(β, φ)k = OP sup NT , β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) sup β∈B(rβ k∂φφφ L(β, φ)kq = OP ((N T ) ) , sup ,β 0 ) sup φ∈Bq (rφ k∂βφ0 L(β, φ)kq = OP (N T )1/(2q) , ,φ0 ) k∂βφφ L(β, φ)kq = OP ((N T ) ), sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) −1 H (β, φ) = OP (1). q −1 −1 (ii) Moreover, kSk = OP (1) , H−1 = OP (1) , H = OP (1) , H−1 − H = oP (N T )−1/8 , −1 −1 −1 e −1 H − H − H HH = oP (N T )−1/4 , k∂βφ0 Lk = OP (N T )1/4 , k∂βφφ Lk = OP ((N T ) ) , P P −1 g ∂φφ0 φg L [H−1 S]g = OP (N T )−1/4+1/(2q)+ , and g ∂φφ0 φg L [H S]g = OP (N T )−1/4+1/(2q)+ . Rdim β and w, u ∈ Rdim φ . By a Taylor expansion of Proof of Lemma S.1. # Part (i): Let v ∈ ∂βφ0 φg L(β, φ) around (β 0 , φ0 ) X ug v 0 ∂βφ0 φg L(β, φ) w g " = X ug v 0 # X X 0 0 ∂βφ0 φg L + (βk − βk )∂βk βφ0 φg L(β̃, φ̃) − (φh − φh )∂βφ0 φg φh L(β̃, φ̃) w, g k h with (β̃, φ̃) between (β 0 , φ0 ) and (β, φ). Thus k∂βφφ L(β, φ)kq = sup kvk=1 sup sup X kukq =1 kwkq/(q−1) =1 g ug v 0 ∂βφ0 φg L(β, φ) w ≤ k∂βφφ Lkq + kβ − β k sup ∂ββφφ L(β̃, φ̃) + kφ − φ0 kq sup ∂βφφφ L(β̃, φ̃) , 0 q (β̃,φ̃) (β̃,φ̃) q where the supremum over (β̃, φ̃) is necessary, because those parameters depend on v, w, u. By Assumption B.1, for large enough N and T, sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) k∂βφφ L(β, φ)kq ≤ k∂βφφ Lk + rβ + rφ sup sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) k∂ββφφ L(β, φ)kq k∂βφφφ L(β, φ)kq = OP [(N T ) + rβ (N T ) + rφ (N T ) ] = OP ((N T ) ) . The proofs for the bounds on k∂ββ 0 L(β, φ)k, k∂βφ0 L(β, φ)kq and k∂φφφ L(β, φ)kq are analogous. Next, we show that H(β, φ) is non-singular for all β ∈ B(rβ , β 0 ) and φ ∈ Bq (rφ , φ0 ) wpa1. By a Taylor expansion and Assumption B.1, for large enough N and T, sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) kH(β, φ) − Hkq ≤ rβ + rφ sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) 7 k∂βφφ L(β, φ)kq k∂φφφ L(β, φ)kq = oP (1). (S.1) e Define ∆H(β, φ) = H − H(β, φ). Then k∆H(β, φ)kq ≤ kH(β, φ) − Hkq + H , and therefore q sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) k∆H(β, φ)kq = oP (1), by Assumption B.1 and equation (S.1). −1 For any square matrix with kAkq < 1, (1 − A)−1 q ≤ (1 − kAkq ) , see e.g. p.301 in Horn and Johnson (1985). Then sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) −1 H (β, φ) q = sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) = sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) −1 ≤ H q −1 ≤ H q −1 H − ∆H(β, φ) q −1 −1 −1 H 1 − ∆H(β, φ)H q −1 −1 sup sup 1 − ∆H(β, φ)H β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) q −1 −1 sup sup 1 − ∆H(β, φ)H q β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) −1 −1 ≤ H (1 − oP (1)) = OP (1). q #Part (ii): By the properties of the `q -norm and Assumption B.1(v), kSk = kSk2 ≤ (dim φ)1/2−1/q kSkq = Op (1). Analogously, k∂βφ0 Lk ≤ (dim φ)1/2−1/q k∂βφ0 Lkq = OP (N T )1/4 . −1 −1 −1 By Lemma S.4, kH kq/(q−1) = kH kq because H is symmetric, and q −1 −1 −1 −1 −1 = ≤ kH kq/(q−1) kH kq = kH kq = OP (1). H H 2 Analogously, k∂βφφ Lk ≤ k∂βφφ Lkq = OP ((N T ) ) , X X −1 −1 ∂φφ0 φg L [H S]g ≤ ∂φφ0 φg L [H S]g g g q ≤ k∂φφφ Lkq H−1 q kSkq = OP X X −1 −1 ∂φφ0 φg L [H S]g ∂φφ0 φg L [H S]g ≤ g g q −1 ≤ k∂φφφ Lkq H kSkq = OP (N T )−1/4+1/(2q)+ , (N T )−1/4+1/(2q)+ . q −1 e Assumption B.1 guarantees that H H < 1 wpa1. Therefore, H−1 = H −1 e 1 + HH −1 −1 =H −1 ∞ ∞ X X e −1 )s = H−1 − H−1 HH e −1 + H−1 e −1 )s . (−HH (−HH s=0 s=2 8 (S.2) −1 P∞ −1 e s −1 P∞ e −1 )s Note that H , and therefore ≤ H s=2 H H s=2 (−HH −1 3 e 2 −1 H H −1 −1 e −1 = oP (N T )−1/4 , H − H − H HH ≤ −1 e 1 − H H by Assumption B.1(vi) and equation (S.2). −1 −1 −1 and H − H follow immediately. The results for H S.3.1 Legendre Transformed Objective Function We consider the shrinking neighborhood B(rβ , β 0 ) × Bq (rφ , φ0 ) of the true parameters (β 0 , φ0 ). Statement (i) of Lemma S.1 implies that the objective function L(β, φ) is strictly concave in φ in this shrinking neighborhood wpa1. We define L∗ (β, S) = max φ∈Bq (rφ ,φ0 ) [L(β, φ) − φ0 S] , where β ∈ B(rβ , β 0 ) and S ∈ Φ(β, S) = argmax [L(β, φ) − φ0 S] , (S.3) φ∈Bq (rφ ,φ0 ) Rdim φ . The function L∗ (β, S) is the Legendre transformation of the objective function L(β, φ) in the incidental parameter φ. We denote the parameter S as the dual parameter to φ, and L∗ (β, S) as the dual function to L(β, φ). We only consider L∗ (β, S) and Φ(β, S) for parameters β ∈ B(rβ , β 0 ) and S ∈ S(β, Bq (rφ , φ0 )), where the optimal φ is defined by the first order conditions, i.e. is not a boundary solution. We define the corresponding set of pairs (β, S) that is dual to B(rβ , β 0 ) × Bq (rφ , φ0 ) by SB r (β 0 , φ0 ) = (β, S) ∈ Rdim β+dim φ : (β, Φ(β, S)) ∈ B(rβ , β 0 ) × Bq (rφ , φ0 ) . Assumption B.1 guarantees that for β ∈ B(rβ , β 0 ) the domain S(β, Bq (rφ , φ0 )) includes S = 0, the origin of Rdim φ , as an interior point, wpa1, and that L∗ (β, S) is four times differentiable in a neighborhood of S = 0 (see Lemma S.2 below). The optimal φ = Φ(β, S) in equation (S.3) satisfies the first order condition S = S(β, φ). Thus, for given β, the functions Φ(β, S) and S(β, φ) are inverse to each other, and the relationship between φ and its dual S is one-to-one. This is a consequence of strict concavity of L(β, φ) in the neighborhood of the true parameter value that we consider here.4 One can show that ∂L∗ (β, S) , ∂S Φ(β, S) = − which shows the dual nature of the functions L(β, φ) and L∗ (β, S). For S = 0 the optimization in (S.3) b b is just over the objective function L(β, φ), so that Φ(β, 0) = φ(β) and L∗ (β, 0) = L(β, φ(β)), the profile objective function. We already introduced S = S(β 0 , φ0 ), i.e. at β = β 0 the dual of φ0 is S, and vica 4 Another consequence of strict concavity of L(β, φ) is that the dual function L∗ (β, S) is strictly convex in S. The original L(β, φ) can be recovered from L∗ (β, S) by again performing a Legendre transformation, namely L(β, φ) = min R S∈ dim φ 9 L∗ (β, S) + φ0 S . b versa. We can write the profile objective function L(β, φ(β)) = L∗ (β, 0) as a Taylor series expansion of L∗ (β, S) around (β, S) = (β 0 , S), namely 1 b L(β, φ(β)) = L∗ (β 0 , S) + (∂β 0 L∗ )∆β − ∆β 0 (∂βS 0 L∗ )S + ∆β 0 (∂ββ 0 L∗ )∆β + . . . , 2 where ∆β = β − β 0 , and here and in the following we omit the arguments of L∗ (β, S) and of its partial derivatives when they are evaluated at (β 0 , S). Analogously, we can obtain Taylor expansions for the b b profile score ∂β L(β, φ(β)) = ∂β L∗ (β, 0) and the estimated nuisance parameter φ(β) = −∂S L∗ (β, 0) in ∆β and S, see the proof of Theorem B.1 below. Apart from combinatorial factors those expansions b feature the same coefficients as the expansion of L(β, φ(β)) itself. They are standard Taylor expansions that can be truncated at a certain order, and the remainder term can be bounded by applying the mean value theorem. The functions L(β, φ) and its dual L∗ (β, S) are closely related. In particular, for given β their first derivatives with respect to the second argument S(β, φ) and Φ(β, S) are inverse functions of each other. We can therefore express partial derivatives of L∗ (β, S) in terms of partial derivatives of L(β, φ). This is done in Lemma S.2. The norms k∂βSSS L∗ (β, S)kq , k∂SSSS L∗ (β, S)kq , etc., are defined as in equation (A.1) and (A.2). Lemma S.2. Let assumption B.1 be satisfied. (i) The function L∗ (β, S) is well-defined and is four times continuously differentiable in SB r (β 0 , φ0 ), wpa1. (ii) For L∗ = L∗ (β 0 , S), ∂S L∗ = −φ0 , ∂β L∗ = ∂β L, ∂SS 0 L∗ = −(∂φφ0 L)−1 = H−1 , ∂βS 0 L∗ = −(∂βφ0 L)H−1 , X ∂ββ 0 L∗ = ∂ββ 0 L + (∂βφ0 L)H−1 (∂φ0 β L), ∂SS 0 Sg L∗ = − H−1 (∂φφ0 φh L)H−1 (H−1 )gh , h ∂βk SS 0 L∗ = H−1 (∂βk φ0 φ L)H−1 + X H−1 (∂φg φ0 φ L)H−1 [H−1 ∂βk φ L]g , g ∂βk βl S 0 L∗ = −(∂βk βl φ0 L)H−1 − (∂βl φ0 L)H−1 (∂βk φφ0 L)H−1 − (∂βk φ0 L)H−1 (∂βl φ0 φ L)H−1 X − (∂βk φ0 L)H−1 (∂φg φ0 φ L)H−1 [H−1 ∂βl φ L]g , g ∗ ∂βk βl βm L = ∂βk βl βm L + X (∂βk φ0 L)H−1 (∂φg φ0 φ L)H−1 (∂βl φ L)[H−1 ∂φβm L]g g + (∂βk φ0 L)H−1 (∂βl φ0 φ L)H−1 ∂φβm L + (∂βm φ0 L)H−1 (∂βk φ0 φ L)H−1 ∂φβl L + (∂βl φ0 L)H−1 (∂βm φ0 φ L)H−1 ∂φβk L + (∂βk βl φ0 L)H−1 (∂φ0 βm L) + (∂βk βm φ0 L)H−1 (∂φ0 βl L) + (∂βl βm φ0 L)H−1 (∂φ0 βk L), 10 and ∂SS 0 Sg Sh L∗ = X H−1 (∂φφ0 φf φe L)H−1 (H−1 )gf (H−1 )he f,e +3 X H−1 (∂φφ0 φe L)H−1 (∂φφ0 φf L)H−1 (H−1 )gf (H−1 )he , f,e ∂βk SS 0 Sg L = − X − X − X − X − X − X − X − X ∗ H−1 (∂βk φ0 φ L)H−1 (∂φφ0 φh L)H−1 [H−1 ]gh h H−1 (∂φφ0 φh L)H−1 (∂βk φ0 φ L)H−1 [H−1 ]gh h H−1 (∂φφ0 φh L)H−1 [H−1 (∂βk φ0 φ L)H−1 ]gh h H−1 (∂φf φ0 φ L)H−1 (∂φφ0 φh L)H−1 [H−1 ]gh [H−1 ∂βk φ L]f h,f H−1 (∂φφ0 φh L)H−1 (∂φf φ0 φ L)H−1 [H−1 ]gh [H−1 ∂βk φ L]f h,f H−1 (∂φφ0 φh L)H−1 [H−1 (∂φf φ0 φ L)H−1 ]gh [H−1 ∂βk φ L]f h,f H−1 (∂βk φφ0 φh L)H−1 [H−1 ]gh h H−1 (∂φφ0 φh φf L)H−1 [H−1 ]gh [H−1 (∂βk φ L)]f . h,f (iii) Moreover, k∂βββ L∗ (β, S)k = OP (N T )1/2+1/(2q)+ , sup (β,S)∈SBr (β 0 ,φ0 ) sup (β,S)∈SBr (β 0 ,φ0 ) sup (β,S)∈SBr (β 0 ,φ0 ) sup (β,S)∈SBr (β 0 ,φ0 ) sup (β,S)∈SBr (β 0 ,φ0 ) k∂ββS L∗ (β, S)kq = OP (N T )1/q+ , k∂βSS L∗ (β, S)kq = OP (N T )1/(2q)+ , k∂βSSS L∗ (β, S)kq = OP (N T )1/(2q)+2 , k∂SSSS L∗ (β, S)kq = OP (N T )2 . Proof of Lemma S.2. #Part (i): According to the definition (S.3), L∗ (β, S) = L(β, Φ(β, S))−Φ(β, S)0 S, where Φ(β, S) solves the FOC, S(β, Φ(β, S)) = S, i.e. S(β, .) and Φ(β, .) are inverse functions for every β. Taking the derivative of S(β, Φ(β, S)) = S wrt to both S and β yields [∂S Φ(β, S)0 ][∂φ S(β, Φ(β, S))0 ] = 1, [∂β S(β, Φ(β, S))0 ] + [∂β Φ(β, S)0 ][∂φ S(β, Φ(β, S))0 ] = 0. (S.4) By definition, S = S(β 0 , φ0 ). Therefore, Φ(β, S) is the unique function that satisfies the boundary condition Φ(β 0 , S) = φ0 and the system of partial differential equations (PDE) in (S.4). Those PDE’s 11 can equivalently be written as ∂S Φ(β, S)0 = −[H(β, Φ(β, S))]−1 , ∂β Φ(β, S)0 = [∂βφ0 L(β, Φ(β, S))][H(β, Φ(β, S))]−1 . (S.5) This shows that Φ(β, S) (and thus L∗ (β, S)) are well-defined in any neighborhood of (β, S) = (β 0 , S) in which H(β, Φ(β, S)) is invertible (inverse function theorem). Lemma S.1 shows that H(β, φ) is invertible in B(rβ , β 0 ) × Bq (rφ , φ0 ), wpa1. The inverse function theorem thus guarantee that Φ(β, S) and L∗ (β, S) are well-defined in SB r (β 0 , φ0 ). The partial derivatives of L∗ (β, S) of up to fourth order can be expressed as continuous transformations of the partial derivatives of L(β, φ) up to fourth order (see e.g. proof of part (ii) of the lemma). Hence, L∗ (β, S) is four times continuously differentiable because L(β, φ) is four times continuously differentiable. #Part (ii): Differentiating L∗ (β, S) = L(β, Φ(β, S)) − Φ(β, S)0 S wrt β and S and using the FOC of the maximization over φ in the definition of L∗ (β, S) gives ∂β L∗ (β, S) = ∂β L(β, Φ(β, S)) and ∂S L∗ (β, S) = −Φ(β, S), respectively. Evaluating this expression at (β, S) = (β 0 , S) gives the first two statements of part (ii). Using ∂S L∗ (β, S) = −Φ(β, S), the PDE (S.5) can be written as ∂SS 0 L∗ (β, S) = H−1 (β, Φ(β, S)), ∂βS 0 L∗ (β, S) = −[∂βφ0 L(β, Φ(β, S))]H−1 (β, Φ(β, S)). Evaluating this expression at (β, S) = (β 0 , S) gives the next two statements of part (ii). Taking the derivative of ∂β L∗ (β, S) = ∂β L(β, Φ(β, S)) wrt to β and using the second equation of (S.5) gives the next statement when evaluated at (β, S) = (β 0 , S). Taking the derivative of ∂SS 0 L∗ (β, S) = −[∂φφ0 L(β, Φ(β, S))]−1 wrt to Sg and using the first equation of (S.5) gives the next statement when evaluated at (β, S) = (β 0 , S). Taking the derivative of ∂SS 0 L∗ (β, S) = −[∂φφ0 L(β, Φ(β, S))]−1 wrt to βk and using the second equation of (S.5) gives ∂βk SS 0 L∗ (β, S) = H−1 (β, φ)[∂βk φ0 φ L(β, φ)]H−1 (β, φ) X + H−1 (β, φ)[∂φg φ0 φ L(β, φ)]H−1 (β, φ){H−1 (β, φ)[∂βk φ L(β, φ)]}g , (S.6) g where φ = Φ(β, S). This becomes the next statement when evaluated at (β, S) = (β 0 , S). We omit the proofs for ∂βk βl S 0 L∗ , ∂βk βl S L∗ , ∂SS 0 Sg Sh L∗ and ∂βk SS 0 Sg L∗ because they are analogous. #Part (iii): We only show the result for k∂βSS L∗ (β, S)kq , the proof of the other statements is analogous. By equation (S.6) 2 3 k∂βSS L∗ (β, S)kq ≤ H−1 (β, φ)q k∂βφφ L(β, φ)kq + H−1 (β, φ)q k∂φφφ L(β, φ)kq k∂βφ0 L(β, φ)kq , where φ = Φ(β, S). Then, by Lemma S.1 sup (β,S)∈SBr (β 0 ,φ0 ) k∂βSS L∗ (β, S)kq ≤ sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) −1 H (β, φ)2 k∂βφφ L(β, φ)k q q 3 + H−1 (β, φ)q k∂φφφ L(β, φ)kq k∂βφ0 L(β, φ)kq 12 = O (N T )1/(2q)+ . To derive the rest of the bounds we can use that the expressions from part (ii) hold not only for (β 0 , S), but also for other values (β, S), provided that (β, Φ(β, S) is used as the argument on the rhs expressions. S.3.2 Proofs of Theorem B.1, Corollary B.2, and Theorem B.3 b Proof of Theorem B.1, Part 1: Expansion of φ(β). Let β = βN T ∈ B(β 0 , rβ ). A Taylor expansion of ∂S L∗ (β, 0) around (β 0 , S) gives 1X b φ(β) = −∂S L∗ (β, 0) = −∂S L∗ − (∂Sβ 0 L∗ )∆β + (∂SS 0 L∗ )S − (∂SS 0 Sg L∗ )SSg + Rφ (β), 2 g where we first expand in β holding S = S fixed, and then expand in S. For any v ∈ Rdim φ the remainder term satisfies 0 φ v R (β) = v 0 X 1X [∂Sβ 0 βk L∗ (β̃, S)](∆β)(∆βk ) + [∂SS 0 βk L∗ (β 0 , S̃)]S(∆βk ) 2 k k 1X ∗ 0 [∂SS 0 Sg Sh L (β , S̄)]SSg Sh , + 6 − g,h where β̃ is between β 0 and β, and S̃ and S̄ are between 0 and S. By part (ii) of Lemma S.2, X b (∂φφ0 φg L)H−1 S(H−1 S)g + Rφ (β). φ(β) − φ0 = H−1 (∂φβ 0 L)∆β + H−1 S + 12 H−1 g Using that the vector norm k.kq/(q−1) is the dual to the vector norm k.kq , Assumption B.1, and Lemmas S.1 and S.2 yields φ R (β) = sup q v 0 Rφ (β) kvkq/(q−1) =1 1 1 ∂Sββ L∗ (β̃, S) k∆βk2 + ∂SSβ L∗ (β 0 , S̃) kSkq k∆βk + ∂SSSS L∗ (β 0 , S̄)q kSk3q 2 6 q q h i 1/q+ −1/4+1/q+ −3/4+3/(2q)+2 = OP (N T ) rβ k∆βk + (N T ) k∆βk + (N T ) = oP (N T )−1/2+1/(2q) + oP (N T )1/(2q) kβ − β 0 k , ≤ uniformly over β ∈ B(β 0 , rβ ) by Lemma S.2. Proof of Theorem B.1, Part 2: Expansion of profile score. Let β = βN T ∈ B(β 0 , rβ ). A Taylor expansion of ∂β L∗ (β, 0) around (β 0 , S) gives b ∂β L(β, φ(β)) = ∂β L∗ (β, 0) = ∂β L∗ + (∂ββ 0 L∗ )∆β − (∂βS 0 L∗ )S + 1X (∂βS 0 Sg L∗ )SSg + R1 (β), 2 g where we first expand in β for fixed S = S, and then expand in S. For any v ∈ Rdim β the remainder term satisfies X X 1 [∂ββ 0 βk L∗ (β̃, S)](∆β)(∆βk ) − [∂ββk S 0 L∗ (β 0 , S̃)]S(∆βk ) v R1 (β) = v 2 k k 1X ∗ 0 − [∂βS 0 Sg Sh L (β , S̄)]SSg Sh , 6 0 0 g,h 13 where β̃ is between β 0 and β, and S̃ and S̄ are between 0 and S. By Lemma S.2, b ∂β L(β, φ(β)) = ∂β L + ∂ββ 0 L + (∂βφ0 L)H−1 (∂φ0 β L) (β − β 0 ) + (∂βφ0 L)H−1 S 1X ∂βφ0 φg L + [∂βφ0 L] H−1 [∂φφ0 φg L] [H−1 S]g H−1 S + R1 (β), + 2 g where for any v ∈ Rdim β , kR1 (β)k = sup v 0 R1 (β) kvk=1 ≤ 1 ∂βββ L∗ (β̃, S) k∆βk2 + (N T )1/2−1/q ∂ββS L∗ (β 0 , S̃) kSkq k∆βk 2 q 1 + (N T )1/2−1/q ∂βSSS L∗ (β 0 , S̄)q kSk3q 6h = OP (N T )1/2+1/(2q)+ rβ k∆βk + (N T )1/4+1/(2q)+ k∆βk + (N T )−1/4+1/q+2 √ = oP (1) + oP ( N T kβ − β 0 k), i uniformly over β ∈ B(β 0 , rβ ) by Lemma S.2. We can also write √ −1 b e −1 S − (∂βφ0 L)H−1 HH e −1 S dβ L(β, φ(β)) = ∂β L − N T W (∆β) + (∂βφ0 L)H S + (∂βφ0 L)H −1 1 X −1 −1 + ∂βφ0 φg L + [∂βφ0 L] H [∂φφ0 φg L] [H S]g H S + R(β), 2 g √ = U − N T W (∆β) + R(β), where we decompose the term linear in S into multiple terms by using that h ih i e H−1 − H−1 HH e −1 + . . . . −(∂βS 0 L∗ ) = (∂βφ0 L)H−1 = (∂βφ0 L) + (∂βφ0 L) The new remainder term is i h −1 e R(β) = R1 (β) + (∂ββ 0 L)∆β + (∂βφ0 L)H−1 (∂φ0 β L) − (∂βφ0 L)H (∂φ0 β L) ∆β h −1 i −1 e −1 S − (∂βφ0 L)H e −1 S e −1 HH + (∂βφ0 L) H−1 − H − H HH " # X 1 X −1 −1 −1 −1 + ∂βφ0 φg L[H S]g H S − ∂βφ0 φg L[H S]g H S 2 g g X 1 X −1 −1 −1 + [∂βφ0 L] H−1 [∂φφ0 φg L][H−1 S]g H−1 S − [∂βφ0 L] H [∂φφ0 φg L][H S]g H S . 2 g g 14 By Assumption B.1 and Lemma S.1, −1 kR(β)k ≤ kR1 (β)k + ∂ββ 0 Le k∆βk + k∂βφ0 Lk H−1 − H k∂φ0 β Lk k∆βk −1 + ∂βφ0 Le H k∂φ0 β Lk + ∂φ0 β L k∆βk −1 −1 e −1 2 e −1 + k∂βφ0 Lk H−1 − H − H HH kSk kSk + H ∂βφ0 Le H −1 1 −1 + k∂βφφ Lk H−1 + H H−1 − H kSk2 2 1 −1 2 + H ∂βφφ Le kSk2 2 X X 1 −1 −1 −1 −1 −1 −1 0 0 0 0 + [∂ L] H [∂ L][H S] H S − [∂ L] H [∂ L][H S] H S βφ φφ φg g βφ φφ φg g 2 g g i h √ = kR1 (β)k + oP (1) + oP ( N T kβ − β 0 k) + OP (N T )−1/8++1/(2q) √ = oP (1) + oP ( N T kβ − β 0 k), uniformly over β ∈ B(β 0 , rβ ). Here we use that X X −1 −1 −1 [∂βφ0 L] H [∂φφ0 φg L][H S]g H S [∂βφ0 L] H−1 [∂φφ0 φg L][H−1 S]g H−1 S − g g X −1 −1 ≤ k∂βφ0 Lk H−1 − H H−1 + H kSk ∂φφ0 φg L [H−1 S]g g X −1 −1 −1 ∂φφ0 φg L [H S]g + k∂βφ0 Lk H−1 − H H kSk g X −1 −1 2 ∂φφ0 φg L [H S]g + ∂βφ0 Le H kSk g X −1 −1 −1 e 0 + ∂βφ L H ∂φφg φh L [H S]g [H S]h . g,h Proof of Corollary B.2. βb solves the FOC b φ( b β)) b = 0. ∂β L(β, By βb − β 0 = oP (rβ ) and Theorem B.1, √ N T (βb − β 0 ) + oP (1) + oP ( N T kβb − β 0 k). √ √ √ −1 −1 Thus, N T (βb − β 0 ) = W U + oP (1) + oP ( N T kβb − β 0 k) = W ∞ U + oP (1) + oP ( N T kβb − β 0 k), b φ( b β)) b =U −W 0 = ∂β L(β, √ −1 −1 where we use that W = W ∞ + oP (1) is invertible wpa1 and that W = W ∞ + oP (1). We conclude √ √ −1 0 b b that N T (β − β ) = OP (1) because U = OP (1), and therefore N T (β − β 0 ) = W ∞ U + oP (1). b Proof of Theorem B.3. # Consistency of φ(β): Let η = ηN T > 0 be such that η = oP (rφ ), (N T )−1/4+1/(2q) = oP (η), and (N T )1/(2q) rβ = oP (η). For β ∈ B(rβ , β 0 ), define φb∗ (β) := argmin {φ: kφ−φ0 kq ≤η} 15 kS(β, φ)kq . (S.7) Then, kS(β, φb∗ (β))kq ≤ kS(β, φ0 )kq , and therefore by a Taylor expansion of S(β, φ0 ) around β = β 0 , kS(β, φb∗ (β)) − S(β, φ0 )kq ≤ kS(β, φb∗ (β))kq + kS(β, φ0 )kq ≤ 2kS(β, φ0 )kq ≤ 2kSkq + 2 ∂φβ 0 L(β̃, φ0 ) kβ − β 0 k q h i = OP (N T )−1/4+1/(2q) + (N T )1/(2q) kβ − β 0 k , uniformly over β ∈ B(rβ , β 0 ), where β̃ is between β 0 and β, and we use Assumption B.1(v) and Lemma S.1. Thus, sup h i kS(β, φb∗ (β)) − S(β, φ0 )kq = OP (N T )−1/4+1/(2q) + (N T )1/(2q) rβ . β∈B(rβ ,β 0 ) By a Taylor expansion of Φ(β, S) around S = S(β, φ0 ), b∗ φ (β) − φ0 = Φ(β, S(β, φb∗ (β))) − Φ(β, S(β, φ0 )) ≤ ∂S Φ(β, S̃)0 S(β, φb∗ (β)) − S(β, φ0 ) q q q q −1 ∗ 0 ∗ 0 b b = H (β, Φ(β, S̃)) S(β, φ (β)) − S(β, φ ) = OP (1) S(β, φ (β)) − S(β, φ ) , q q q where S̃ is between S(β, φb∗ (β)) and S(β, φ0 ) and we use Lemma S.1(i). Thus, h i sup φb∗ (β) − φ0 = OP (N T )−1/4+1/(2q) + (N T )1/(2q) rβ = oP (η). q β∈B(rβ ,β 0 ) This shows that φb∗ (β) is an interior solution of the minimization problem (S.7), wpa1. Thus, S(β, φb∗ (β)) = 0, because the objective function L(β, φ) is strictly concave and differentiable, and therefore φb∗ (β) = b b φ(β). We conclude that sup φ(β) − φ0 = OP (η) = oP (rφ ). β∈B(rβ ,β 0 ) q b We have already shown that Assumption B.1(ii) is satisfied, in addition to the # Consistency of β: remaining parts of Assumption B.1, which we assume. The bounds on the spectral norm in Assumption B.1(vi) and in part (ii) of Lemma S.1 can be used to show that U = OP ((N T )1/4 ). First, we consider the case dim(β) = 1 first. The extension to dim(β) > 1 is discussed below. Let −1 η = 2(N T )−1/2 W |U |. Our goal is to show that βb ∈ [β 0 − η, β 0 + η]. By Theorem B.1, √ √ √ √ b 0 + η)) = U − W N T η + oP (1) + oP ( N T η) = oP ( N T η) − W N T η, ∂β L(β 0 + η, φ(β √ √ √ √ b 0 − η)) = U + W N T η + oP (1) + oP ( N T η) = oP ( N T η) + W N T η, ∂β L(β 0 − η, φ(β and therefore for sufficiently large N, T b 0 + η)) ≤ 0 ≤ ∂β L(β 0 − η, φ(β b 0 − η)). ∂β L(β 0 + η, φ(β b φ( b β)) b = 0, for sufficiently large N, T , Thus, since ∂β L(β, b 0 + η)) ≤ ∂β L(β, b φ( b β)) b ≤ ∂β L(β 0 − η, φ(β b 0 − η)). ∂β L(β 0 + η, φ(β b The profile objective L(β, φ(β)) is strictly concave in β because L(β, φ) is strictly concave in (β, φ). b Thus, ∂β L(β, φ(β)) is strictly decreasing. The previous set of inequalities implies that for sufficiently large N, T β 0 + η ≥ βb ≥ β 0 − η. 16 We conclude that kβb − β 0 k ≤ η = OP ((N T )−1/4 ). This concludes the proof for dim(β) = 1. To generalize the proof to dim(β) > 1 we define β± = β 0 ± η 0 b β−β . 0k b kβ−β Let hβ− , β+ i = {rβ− + (1 − r)β+ | r ∈ [0, 1]} be the line segment between β− and β+ . By restricting attention to values β ∈ hβ− , β+ i we can repeat the above argument for the case dim(β) = 1 and thus show that βb ∈ hβ− , β+ i, which implies kβb − β 0 k ≤ η = OP ((N T )−1/4 ). S.3.3 Proof of Theorem B.4 Proof of Theorem B.4. A Taylor expansion of ∆(β, φ) around (β 0 , φ0 ) yields ∆(β, φ) = ∆ + [∂β 0 ∆](β − β 0 ) + [∂φ0 ∆](φ − φ0 ) + 21 (φ − φ0 )0 [∂φφ0 ∆](φ − φ0 ) + R1∆ (β, φ), with remainder term R1∆ (β, φ) = 12 (β − β 0 )0 [∂ββ 0 ∆(β̄, φ)](β − β 0 ) + (β − β 0 )0 [∂βφ0 ∆(β 0 , φ̃)](φ − φ0 ) X (φ − φ0 )0 [∂φφ0 φg ∆(β 0 , φ̄)](φ − φ0 )[φ − φ0 ]g , + 16 g where β̄ is between β and β 0 , and φ̃ and φ̄ are between φ and φ0 . b β) b in Theorem B.1, By assumption, kβb − β 0 k = oP ((N T )−1/4 ), and by the expansion of φb = φ( 3 2 b kφb − φ0 kq ≤ H−1 q kSkq + H−1 q k∂φβ 0 Lkq βb − β 0 + 12 H−1 q k∂φφφ Lkq kSkq + Rφ (β) q −1/4+1/(2q) = OP ((N T ) b φ), b b∆ := R∆ (β, Thus, for R 1 1 b∆ 1 b R1 ≤ 2 kβ − β 0 k2 q ). sup k∂ββ 0 ∆(β, φ)k sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) + (N T )1/2−1/q kβb − β 0 kkφb − φ0 kq sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) + 16 (N T )1/2−1/q kφb − φ0 k3q √ = oP (1/ N T ). sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) b β) b from Theorem B.1, Again by the expansion of φb = φ( b φ) b − ∆ = ∂β 0 ∆ + [∂φ ∆]0 H−1 [∂φβ 0 L] (βb − β 0 ) δb − δ = ∆(β, ! dim Xφ [∂φφ0 φg L]H−1 S[H−1 S]g + + [∂φ ∆]0 H−1 S + 12 1 2 k∂βφ0 ∆(β, φ)kq k∂φφφ ∆(β, φ)kq S 0 H−1 [∂φφ0 ∆]H−1 S + R2∆ , g=1 where ∆ ∆ b + 1 (φb − φ0 + H−1 S)0 [∂φφ0 ∆](φb − φ0 − H−1 S) R2 = R1 + [∂φ ∆]0 Rφ (β) 2 b ≤ R1∆ + (N T )1/2−1/q k∂φ ∆kq Rφ (β) q 1/2−1/q b 0 −1 1 + 2 (N T ) φ − φ + H S k∂φφ0 ∆kq φb − φ0 − H−1 S q q √ = oP (1/ N T ), 17 (S.8) that uses φb − φ0 − H−1 S = OP (N T )−1/2+1/q+ . From equation (S.8), the terms of the expansion q for δb − δ are analogous to the terms of the expansion for the score in Theorem B.1, with ∆(β, φ) taking the role of S.4 √1 NT ∂βk L(β, φ). Proofs of Appendix C (Theorem C.1) Proof of Theorem C.1, Part (i). Assumption B.1(i) is satisfied because limN,T →∞ −1 κ+κ dim φ √ NT = limN,T →∞ N +T √ NT = . Assumption B.1(ii) is satisfied because `it (β, π) and (v 0 φ)2 are four times continuously differentiable and the same is true φ). ∗for L(β, ∗ −1 Let D = diag H(αα) , H(γγ) . Then, D = OP (1) by Assumption 4.1(v). By the properties ∞ −1 −1 −1 −1 = OP (1). Thus, of the matrix norms and Lemma D.1, H − D ≤ (N + T ) H − D ∞ max −1 −1 −1 −1 −1 H ≤ H ≤ D + H − D = OP (1) by Lemma S.4 and the triangle inequality. ∞ q ∞ ∞ We conclude that Assumption B.1(iv) holds. We now show that the assumptions of Lemma S.7 are satisfied: P (i) By Lemma S.2, χi = √1T t ∂βk `it satisfies Eφ (χ2i ) ≤ B. Thus, by independence across i 2 !2 X X 1 1 X 1 ∂βk `it = Eφ √ χi = Eφ χ2i ≤ B, Eφ √ N i N T i,t N i and therefore √1 NT P i,t ∂βk `it = OP (1). Analogously, 1 NT P i,t √ {∂βk βl `it − Eφ [∂βk βl `it ]} = OP (1/ N T ) = oP (1). Next, 2 X 1 Eφ sup sup ∂βk βl βm `it (β, πit ) β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) N T i,t 2 2 X X 1 1 |∂βk βl βm `it (β, πit )| ≤ Eφ M (Zit ) ≤ Eφ sup sup N T i,t β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) N T i,t ≤ Eφ 1 X 1 X M (Zit )2 = Eφ M (Zit )2 = OP (1), N T i,t N T i,t and therefore supβ∈B(rβ ,β 0 ) supφ∈Bq (rφ ,φ0 ) P gives N1T i,t ∂βk βl `it = OP (1). 1 NT P i,t 18 ∂βk βl βm `it (β, πit ) = OP (1). A similar argument (ii) For ξit (β, φ) = ∂βk π `it (β, πit ) or ξit (β, φ) = ∂βk βl π `it (β, πit ), q # " 1 X 1 X Eφ sup sup ξit (β, φ) T N 0 0 β∈B(rβ ,β ) φ∈Bq (rφ ,φ ) t i " !q # 1X 1 X ≤ Eφ sup sup |ξit (β, φ)| N i β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) T t " !q # # " 1X 1 X 1X 1 X ≤ Eφ M (Zit ) M (Zit )q ≤ Eφ T t N i T t N i 1X 1 X Eφ M (Zit )q = OP (1), = T t N i q P P i.e. supβ∈B(rβ ,β 0 ) supφ∈Bq (rφ ,φ0 ) T1 t N1 i ξit (β, φ) = OP (1). Analogously, it follows that q P P supβ∈B(rβ ,β 0 ) supφ∈Bq (rφ ,φ0 ) N1 i T1 t ξit (β, φ) = OP (1). (iii) For ξit (β, φ) = ∂πr `it (β, πit ), with r ∈ {3, 4}, or ξit (β, φ) = ∂βk πr `it (β, πit ), with r ∈ {2, 3}, or ξit (β, φ) = ∂βk βl π2 `it (β, πit ), !(8+ν) X 1 |ξit (β, φ)| Eφ sup sup max T t β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) i !(8+ν) X 1 = Eφ max sup sup |ξit (β, φ)| i β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) T t !(8+ν) !(8+ν) X X 1X X 1 ≤ Eφ |ξit (β, φ)| M (Zit ) ≤ Eφ sup sup T t β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) T t i i " # X1X X1X ≤ Eφ M (Zit )(8+ν) = Eφ M (Zit )(8+ν) = OP (N ). T T t t i i 1 T |ξit (β, φ)| = OP N 1/(8+ν) = OP N 2 . Analo P gously, it follows that supβ∈B(rβ ,β 0 ) supφ∈Bq (rφ ,φ0 ) maxt N1 i |ξit (β, φ)| = OP N 2 . P (iv) Let χt = √1N i ∂π `it . By cross-sectional independence and Eφ (∂π `it )8 ≤ Eφ M (Zit )8 = OP (1), q P P P Eφ χ8t = OP (1) uniformly over t. Thus, Eφ T1 t χ8t = OP (1) and therefore T1 t √1N i ∂π `it = Thus, supβ∈B(rβ ,β 0 ) supφ∈Bq (rφ ,φ0 ) maxi P t OP (1), with q = 8. P 0 ). By Lemma S.2 and Eφ (∂π `it )8+ν ≤ Eφ M (Zit )8+ν = OP (1), Eφ χ8i = Let χi = √1T t ∂π `it (β 0 , πit OP (1) uniformly over i. Here we use µ > 4/[1 − 8/(8 + ν)] = 4(8 + ν)/ν that q is imposed in P 8 P 1 P 1 1 Assumption B.1. Thus, Eφ N i χi = OP (1) and therefore N i √T t ∂π `it = OP (1), with q = 8. The proofs for 1 T 2 P 1 P √ t N i ∂βk π `it − Eφ [∂βk π `it ] = OP (1) and 1 N 2 P 1 P √ i T t ∂βk π `it −Eφ [∂βk π `it ] = OP (1) are analogous. (v) It follows by the independence of {(`i1 , . . . , `iT ) : 1 ≤ i ≤ N } across i, conditional on φ, in Assumption B.1(ii). 0 0 (vi) Let ξit = ∂πr `it (β 0 , πit ) − Eφ [∂πr `it ], with r ∈ {2, 3}, or ξit = ∂βk π2 `it (β 0 , πit ) − Eφ ∂βk π2 `it . For 19 8+ν̃ ν̃ = ν, maxi Eφ ξit = OP (1) by assumption. By Lemma S.1, X X |Covφ (ξit , ξis )| Eφ [ξit ξis ] = s s X 1/(8+ν) 1/(8+ν) ≤ [8 a(|t − s|)]1−2/(8+ν) Eφ |ξt |8+ν Eφ |ξs |8+ν s = C̃ ∞ X m−µ[1−2/(8+ν)] ≤ C̃ m=1 ∞ X m−4 = C̃π 4 /90, m=1 where C̃ is a constant. Here we use that µ > 4(8 + ν)/ν implies µ[1 − 2/(8 + ν) > 4. We thus have P shown maxi maxt s Eφ [ξit ξjs ] ≤ C̃π 4 /90 =: C. h i8 P √1 ξ ≤ C, Analogous to the proof of part (iv), we can use Lemma S.2 to obtain maxi Eφ it t T h i8 P √1 and independence across i to obtain maxt Eφ ≤ C. Similarly, by Lemma S.2 i ξit N " #4 1 X √ [ξit ξjt − Eφ (ξit ξjt )] ≤ C, max Eφ i,j T t which requires µ > 2/[1 − 4/(4 + ν/2)], which is implied by the assumption that µ > 4(8 + ν)/ν. −1 (vii) We have already shown that H = OP (1). q Therefore, we can apply Lemma S.7, which shows that Assumption B.1(v) and (vi) hold. We have already shown that Assumption B.1(i), (ii), (iv), (v) and (vi) hold. One can also check that (N T )−1/4+1/(2q) = oP (rφ ) and (N T )1/(2q) rβ = oP (rφ ) are satisfied. In addition, L(β, φ) is strictly concave. We can therefore invoke Theorem B.3 to show that Assumption B.1(iii) holds and that kβb − β 0 k = OP ((N T )−1/4 ). Proof of Theorem C.1, Part (ii). For any N ×T matrix A we define the N ×T matrix PA as follows (PA)it = αi∗ + γt∗ , (α∗ , γ ∗ ) ∈ argmin X α,γ e A = PÃ, P where Ãit = (S.1) i,t Here, the minimization is over α ∈ RN and γ ∈ RT . The operator PP = P. It is also convenient to define 2 Eφ (−∂π2 `it ) (Ait − αi − γt ) . P is a linear projection, i.e. we have Ait . Eφ (−∂π2 `it ) (S.2) e is a linear operator, but not a projection. Note that Λ and Ξ defined in (C.1) and (4.3) can be written P e A and Ξk = P e Bk , where Ait = −∂π `it and Bk,it = −Eφ (∂β π `it ), for k = 1, . . . , dim β.5 as Λ = P k By Lemma S.8(ii), W =−√ 5 N T 1 1 XX −1 ∂ββ 0 L + [∂βφ0 L] H [∂φβ 0 L] = − [Eφ (∂ββ 0 `it ) + Eφ (−∂π2 `it ) Ξit Ξ0it ] . N T i=1 t=1 NT Bk and Ξk are N × T matrices with entries Bk,it and Ξk,it , respectively, while Bit and Ξit are dim β-vectors with entries Bk,it and Ξk,it . 20 By Lemma S.8(i), U (0) = ∂β L + [∂βφ0 L] H −1 S=√ N T 1 X 1 XX (∂β `it − Ξit ∂π `it ) = √ Dβ `it . N T i,t N T i=1 t=1 We decompose U (1) = U (1a) + U (1b) , with e H−1 S, e −1 S − [∂βφ0 L] H−1 H U (1a) = [∂βφ0 L]H U (1b) = dim Xφ ∂βφ0 φg L + [∂βφ0 L] H −1 −1 −1 [∂φφ0 φg L] H S[H S]g /2. g=1 By Lemma S.8(i) and (iii), U (1a) = − √ N T 1 XX 1 X Λit ∂βπ `˜it + Ξit ∂π2 `˜it = − √ Λit [Dβπ `it − Eφ (Dβπ `it )] , N T i,t N T i=1 t=1 and U (1b) = i h X 1 −1 √ Λ2it Eφ (∂βπ2 `it ) + [∂βφ0 L] H Eφ (∂φ ∂π2 `it ) , 2 N T i,t where for each i, t, ∂φ ∂π2 `it is a dim φ-vector, which can be written as ∂φ ∂π2 `it = A1T A0 1N for an N × T matrix A with elements Ajτ = ∂π3 `jτ if j = i and τ = t, and Ajτ = 0 otherwise. Thus, Lemma S.8(i) P −1 gives [∂βφ0 L] H ∂φ ∂π2 `it = − j,τ Ξjτ 1(i = j)1(t = τ )∂π3 `it = −Ξit ∂π3 `it . Therefore U (1b) = T N X X X 1 1 √ Λ2it Eφ (Dβπ2 `it ). Λ2it Eφ ∂βπ2 `it − Ξit ∂π3 `it = √ 2 N T i,t 2 N T i=1 t=1 Proof of Theorem C.1, Part (iii). Showing that Assumption B.2 is satisfied is analogous to the proof of Lemma S.7 and of part (ii) of this Theorem. In the proof of Theorem 4.1 we show that Assumption 4.1 implies that U = OP (1). This fact √ together with part (i) of this theorem show that Corollary B.2 is applicable, so that N T (βb − β 0 ) = −1 W ∞ U + oP (1) = OP (1), and we can apply Theorem B.4. √ By Lemma S.8 and the result for N T (βb − β 0 ), 0 h i X √ 1 −1 −1 N T ∂β 0 ∆ + (∂φ0 ∆)H (∂φβ 0 L) (βb − β 0 ) = Eφ (Dβ ∆it ) W ∞ U (0) + U (1) + oP (1). N T i,t (S.3) (0) (1) We apply Lemma S.8 to U∆ and U∆ defined in Theorem B.4 to give √ 1 X (0) N T U∆ = − √ Eφ (Ψit )∂π `it , N T i,t √ 1 X (1) N T U∆ = √ Λit [Ψit ∂π2 `it − Eφ (Ψit )Eφ (∂π2 `it )] N T i,t X 1 + √ Λ2it [Eφ (∂π2 ∆it ) − Eφ (∂π3 `it )Eφ (Ψit )] . 2 N T i,t 21 (S.4) The derivation of (S.3) and (S.4) is analogous to the proof of the part (ii) of the Theorem. Combining Theorem B.4 with equations (S.3) and (S.4) gives the result. S.5 Proofs of Appendix D (Lemma D.1) The following Lemmas are useful to prove Lemma D.1. Let L∗ (β, φ) = (N T )−1/2 P i,t `it (β, αi + γt ). Lemma S.1. If the statement of Lemma D.1 holds for some constant b > 0, then it holds for any constant b > 0. ∗ h i ∗ ∗ ∂2 ∗ vv 0 , where H = Eφ − ∂φ∂φ . Since H v = 0, 0L √ √ † ∗ † b ∗ † NT NT ∗ † 0 0 = H vv = H vv 0 , + √ vv = H + + 0 2 bkvv k b(N + T )2 NT Proof of Lemma S.1. Write H = H + H −1 √b NT where † refers to the Moore-Penrose pseudo-inverse.Thus, if H1 is the expected Hessian √ for b = b1 > 0 −1 −1 1 1 NT 0 and H2 is the expected Hessian for b = b2 > 0, H1 − H2 = b1 − b2 (N +T = )2 vv max max −1/2 O (N T ) . Lemma S.2. Let Assumption 4.1 hold and let 0 < b ≤ bmin 1 + −1 H(αα) H(αγ) ∞ <1− b bmax max(N,T ) bmax min(N,T ) bmin −1 , and H(γγ) H(γα) ∞ <1− −1 . Then, b . bmax Proof of Lemma S.2. Let hit = Eφ (−∂π2 `it ), and define X hjt − b 1 P . P P −1 b−1 + j ( τ hjτ ) τ hjτ j √ √ ∗ ∗ ∗ By definition, H(αα) = H(αα) + b1N 10N / N T and H(αγ) = H(αγ) − b1N 10T / N T . The matrix H(αα) √ √ P ∗ is diagonal with elements t hit / N T . The matrix H(αγ) has elements hit / N T . The Woodbury h̃it = hit − b − identity states that −1 ∗−1 ∗−1 H(αα) = H(αα) − H(αα) 1N √ ∗−1 N T b−1 + 10N H(αα) 1N −1 ∗−1 10N H(αα) . √ −1 ∗−1 Then, H(αα) H(αγ) = H(αα) H̃/ N T , where H̃ is the N × T matrix with elements h̃it . Therefore P t h̃it −1 . H(αα) H(αγ) = max P i ∞ t hit Assumption 4.1(iv) guarantees that bmax ≥ hit ≥ bmin , which implies hjt − b ≥ bmin − b > 0, and 1 X hjt − b N bmax P h̃it > hit − b − −1 ≥ bmin − b 1 + ≥ 0. b T bmin τ hjτ j We conclude that −1 H(αα) H(αγ) ∞ P X hjt − b X h̃ 1 1 it b + P = max Pt = 1 − min P P P −1 i i h jτ b−1 + j ( τ hjτ ) t hit t hit t τ j b <1− . bmax −1 Analogously, H(γγ) H(γα) < 1 − ∞ b bmax . 22 −1 . Then, b ≤ b 1+ Proof of Lemma D.1. We choose b < bmin 1 + max(κ2 , κ−2 ) bbmax min min max(N,T ) bmax min(N,T ) bmin −1 for large enough N and T , so that Lemma S.2 becomes applicable. The choice of b has no effect on the general validity of the lemma for all b > 0 by Lemma S.1. By the inversion formula for partitioned matrices, H −1 = −1 ! A −A H(αγ) H(γγ) −H(γγ) H(γα) A H(γγ) + H(γγ) H(γα) A H(αγ) H(γγ) −1 −1 −1 , −1 −1 with A := (H(αα) − H(αγ) H(γγ) H(γα) )−1 . The Woodbury identity states that −1 ∗−1 ∗−1 H(αα) = H(αα) − H(αα) 1N | −1 H(γγ) = ∗−1 H(γγ) − ∗−1 H(γγ) 1T √ ∗−1 N T /b + 10N H(αα) 1N {z −1 =:C(αα) √ | ∗−1 N T /b + 10T H(γγ) 1T {z −1 =:C(γγ) ∗−1 10N H(αα) , } ∗−1 10T H(γγ) . } √ ∗−1 ∗−1 ∗ By Assumption 4.1(v), kH(αα) k∞ = OP (1), kH(γγ) k∞ = OP (1), kH(αγ) kmax = OP (1/ N T ). Therefore6 ∗−1 kC(αα) kmax ≤ kH(αα) k2∞ k1N 10N kmax −1 √ ∗−1 N T /b + 10N H(αα) 1N −1 √ = OP (1/ N T ), ∗−1 kH(αα) k∞ ≤ kH(αα) k∞ + N kC(αα) kmax = OP (1). √ −1 ∗ Analogously, kC(γγ) kmax = OP (1/ N T ) and kH(γγ) k∞ = OP (1). Furthermore, kH(αγ) kmax ≤ kH(αγ) kmax + √ √ b/ N T = OP (1/ N T ). Define B := −1 −1 1N − H(αα) H(αγ) H(γγ) H(γα) −1 − 1N = ∞ X −1 −1 H(αα) H(αγ) H(γγ) H(γα) n . n=1 −1 −1 ∗−1 −1 −1 −1 Then, A = H(αα) + H(αα) B = H(αα) − C(αα) + H(αα) B. By Lemma S.2, kH(αα) H(αγ) H(γγ) H(γα) k∞ ≤ 2 −1 −1 b kH(αα) H(αγ) k∞ kH(γγ) H(γα) k∞ < 1 − bmax < 1, and kBkmax ≤ ≤ ∞ X −1 −1 kH(αα) H(αγ) H(γγ) H(γα) k∞ n=0 "∞ X b n=0 bmax 1− 2n # n −1 −1 kH(αα) k∞ kH(αγ) k∞ kH(γγ) k∞ kH(γα) kmax √ −1 −1 T kH(αα) k∞ kH(γγ) k∞ kH(γα) k2max = OP (1/ N T ). By the triangle inequality, −1 −1 kAk∞ ≤ kH(αα) k∞ + N kH(αα) k∞ kBkmax = OP (1). Thus, for the different blocks of !−1 ∗ H(αα) 0 −1 H − = ∗ 0 H(γγ) 6 ∗−1 −1 A − H(αα) −1 −H(γγ) H(γα) A −A H(αγ) H(γγ) −1 −1 H(γγ) H(γα) A H(αγ) H(γγ) − C(γγ) ! , Here and in the following me make use of the inequalities kABkmax < kAk∞ kBkmax , kABkmax < kAkmax kB 0 k∞ , kAk∞ ≤ nkAkmax , which hold for any m × n matrix A and n × p matrix B. 23 we find ∗−1 A − H(αα) max −1 = H(αα) B − C(αα) max √ − kC(αα) kmax = OP (1/ N T ), √ −1 ≤ kAk∞ kH(αγ) kmax kH(γγ) k∞ = OP (1/ N T ), ≤ −1 kH(αα) k∞ kBkmax −1 −A H(αγ) H(γγ) max −1 −1 −1 ≤ kH(γγ) k2∞ kH(γα) k∞ kAk∞ kH(αγ) kmax + kC(γγ) kmax H(γγ) H(γα) A H(αγ) H(γγ) − C(γγ) max √ −1 ≤ N kH(γγ) k2∞ kAk∞ kH(αγ) k2max + kC(γγ) kmax = OP (1/ N T ). √ The bound OP (1/ N T ) for the max-norm of each block of the matrix yields the same bound for the max-norm of the matrix itself. S.6 S.6.1 Useful Lemmas Some Properties of Stochastic Processes Here we collect some known properties of α-mixing processes, which are useful for our proofs. Lemma S.1. Let {ξt } be an α-mixing process with mixing coefficients a(m). Let E|ξt |p < ∞ and E|ξt+m |q < ∞ for some p, q ≥ 1 and 1/p + 1/q < 1. Then, |Cov (ξt , ξt+m )| ≤ 8 a(m)1/r [E|ξt |p ] 1/p [E|ξt+m |q ] 1/q , where r = (1 − 1/p − 1/q)−1 . Proof of Lemma S.1. See, for example, Proposition 2.5 in Fan and Yao (2003). The following result is a simple modification of Theorem 1 in Cox and Kim (1995). Lemma S.2. Let {ξt } be an α-mixing process with mixing coefficients a(m). Let r ≥ 1 be an integer, δ and let δ > 2r, µ > r/(1 − 2r/δ), c > 0 and C > 0. Assume that supt E |ξt | ≤ C and that a(m) ≤ c m−µ for all m ∈ {1, 2, 3, . . .}. Then there exists a constant B > 0 depending on r, δ, µ, c and C, but not depending on T or any other distributional characteristics of ξt , such that for any T > 0, !2r T X 1 ≤ B. ξt E √ T t=1 The following is a central limit theorem for martingale difference sequences. Lemma S.3. Consider the scalar process ξit = ξN T,it , i = 1, . . . , N , t = 1, . . . , T . Let {(ξi1 , . . . , ξiT ) : 1 ≤ i ≤ N } be independent across i, and be a martingale difference sequence for each i, N , T . Let E|ξit |2+δ be uniformly bounded across i, t, N, T for some δ > 0. Let σ = σ N T > ∆ > 0 for all sufficiently P 2 large N T , and let N1T i,t ξit − σ 2 →P 0 as N T → ∞.7 Then, X 1 √ ξit →d N (0, 1). σ N T i,t 7 Here can allow for an arbitrary sequence of (N, T ) with N T → ∞. 24 Proof of Lemma S.3. Define ξm = ξM,m = ξN T,it , with M = N T and m = T (i − 1) + t ∈ {1, . . . , M }. Then {ξm , m = 1, . . . , M } is a martingale difference sequence. With this redefinition the statement of the Lemma is equal to Corollary 5.26 in White (2001), which is based on Theorem 2.3 in Mcleish (1974), PM and which shows that σ √1M m=1 ξm →d N (0, 1). S.6.2 Some Bounds for the Norms of Matrices and Tensors The following lemma provides bounds for the matrix norm k.kq in terms of the matrix norms k.k1 , k.k2 , k.k∞ , and a bound for k.k2 in terms of k.kq and k.kq/(q−1) . For sake of clarity we use notation k.k2 for the spectral norm in this lemma, which everywhere else is denoted by k.k, without any index. Recall P that kAk∞ = maxi j |Aij | and kAk1 = kA0 k∞ . Lemma S.4. For any matrix A we have 1/q kAkq ≤ kAk1 kAk1−1/q , ∞ for q ≥ 1, 2/q kAk2 kAk1−2/q , ∞ for q ≥ 2, kAkq ≤ kAk2 ≤ q kAkq kAkq/(q−1) , for q ≥ 1. Note also that kAkq/(q−1) = kA0 kq for q ≥ 1. Thus, for a symmetric matrix A, we have kAk2 ≤ kAkq ≤ kAk∞ for any q ≥ 1. Proof of Lemma S.4. The statements follow from the fact that log kAkq is a convex function of 1/q, which is a consequence of the Riesz-Thorin theorem. For more details and references see e.g. Higham (1992). The following lemma shows that the norm k.kq applied to higher-dimensional tensors with a special structure can be expressed in terms of matrix norms k.kq . In our panel application all higher dimensional tensors have such a special structure, since they are obtained as partial derivatives wrt to α and γ from the likelihood function. Lemma S.5. Let a be an N -vector with entries ai , let b be a T -vector with entries bt , and let c be an N × T matrix with entries cit . Let A be an N × N × . . . × N tensor with entries | {z } p times ( Ai1 i2 ...ip = ai1 if i1 = i2 = . . . = ip , 0 otherwise. Let B be an T × T × . . . × T tensor with entries {z } | r times ( Bt1 t2 ...tr = bt1 if t1 = t2 = . . . = tr , 0 otherwise. Let C be an N × N × . . . × N × T × T × . . . × T tensor with entries {z } | {z } | p times r times ( Ci1 i2 ...ip t1 t2 ...tr = ci1 t1 if i1 = i2 = . . . = ip and t1 = t2 = . . . = tr , 0 otherwise. 25 e be an T × T × . . . × T × N × N × . . . × N tensor with entries Let C | {z } | {z } r times p times ( et t ...t i i ...i = C 1 2 r 1 2 p ci1 t1 if i1 = i2 = . . . = ip and t1 = t2 = . . . = tr , 0 otherwise. Then, kAkq = max |ai |, for p ≥ 2, kBkq = max |bt |, for r ≥ 2, kCkq ≤ kckq , for p ≥ 1, r ≥ 1, e q ≤ kc0 kq , kCk for p ≥ 1, r ≥ 1, i t where k.kq refers to the q-norm defined in (A.1) with q ≥ 1. Proof of Lemma S.5. Since the vector norm k.kq/(q−1) is dual to the vector norm k.kq we can rewrite the definition of the tensor norm kCkq as follows kCkq = max max max ku(1) kq/(q−1) =1 ku(k) kq = 1 kv (l) kq = 1 k = 2, . . . , p l = 1, . . . , r X T N X (r) (p) (1) (2) (1) (2) ui1 ui2 · · · uip vi1 vt2 · · · vtr Ci1 i2 ...ip t1 t2 ...tr . i1 i2 ...ip =1 t1 t2 ...tr =1 The specific structure of C yields kCkq = ≤ max max ku(1) kq/(q−1) =1 ku(k) kq = 1 N T X X (1) (2) (p) (1) (2) (r) ui ui · · · ui vt vt · · · vt cit max kv (l) kq = 1 i=1 t=1 k = 2, . . . , p l = 1, . . . , r max kukq/(q−1) ≤1 N X T X max ui vi cit = kckq , kvkq ≤1 i=1 t=1 where we define u ∈ RN with elements ui = ui ui · · · ui (1) (2) (p) and v ∈ RT with elements vt = vt vt · · · vt , (1) (2) (r) (1) and we use that ku(k) kq = 1, for k = 2, . . . , p, and kv (l) kq = 1, for l = 2, . . . , r, implies |ui | ≤ |ui | (1) |vt |, and |vt | ≤ and therefore kukq/(1−q) ≤ ku(1) kq/(1−q) = 1 and kvkq ≤ kv (1) kq = 1. The proof of e q ≤ kc0 kq is analogous. kCk Let A(p) = A, as defined above, for a particular value of p. For p = 2, A(2) is a diagonal N × N 26 1/q 1−1/q matrix with diagonal elements ai , so that kA(2) kq ≤ kA(2) k1 kA(2) k∞ = maxi |ai |. For p > 2, N X (p) (1) (2) (p) ui1 ui2 · · · uip Ai1 i2 ...ip max A = (1) max (k) q ku kq/(q−1) =1 ku kq = 1 i1 i2 ...ip =1 k = 2, . . . , p = N X (1) (2) (p−1) (p) (2) ui ui · · · ui uj Aij max (k) ku kq = 1 i,j=1 max ku(1) kq/(q−1) =1 k = 2, . . . , p ≤ max kukq/(q−1) ≤1 N T X X (2) max ui vi Aij = kA(2) kq ≤ max |ai |, i kvkq =1 i=1 t=1 where we define u ∈ RN with elements ui = ui ui · · · ui (1) (2) (p−1) and v = u(p) , and we use that ku(k) kp = 1, (1) |ui | for k = 2, . . . , p − 1, implies |ui | ≤ and therefore kukq/(q−1) ≤ ku(1) kq/(q−1) = 1. We have thus (p) shown A ≤ maxi |ai |. From the definition of A(p) q above, we obtain A(p) q ≥ maxi |ai | by choosing all u(k) equal to the standard basis vector, whose i∗ ’th component equals one, where i∗ ∈ argmaxi |ai |. Thus, A(p) q = maxi |ai | for p ≥ 2. The proof for kBkq = maxt |bt | is analogous. The following lemma provides an asymptotic bound for the spectral norm of N × T matrices, whose entries are mean zero, and cross-sectionally independent and weakly time-serially dependent conditional on φ. PT Lemma S.6. Let e be an N × T matrix with entries eit . Let σ̄i2 = T1 t=1 Eφ (e2it ), let Ω be the T × T PT PN matrix with entries Ωts = N1 i=1 Eφ (eit eis ), and let ηij = √1T t=1 [eit ejt − Eφ (eit ejt )]. Consider asymptotic sequences where N, T → ∞ such that N/T converges to a finite positive constant. Assume that (i) The distribution of eit is independent across i, conditional on φ, and satisfies Eφ (eit ) = 0. 4 PN PN PN 1 1 1 4 4 4 (ii) N1 i=1 σ̄i2 = OP (1), i,j=1 Eφ ηij = i=1 Eφ ηii = OP (1), T Tr(Ω ) = OP (1), N N2 OP (1). Then, Eφ kek8 = OP (N 5 ), and therefore kek = OP (N 5/8 ). Proof of Lemma S.6. Let k.kF be the Frobenius norm of a matrix, i.e. kAkF = 27 p Tr(AA0 ). For σ̄i4 = (σ̄i2 )2 , σ̄i8 = (σ̄i2 )4 and δjk = 1(j = k), 0 8 0 2 0 kek = kee ee k ≤ kee ee0 k2F = N X N X T X i,j=1 k=1 t,τ =1 !2 eit ekt ekτ ejτ "N # N 2 X X 1/2 2 1/2 2 ηik + T δik σ̄i =T ηjk + T δjk σ̄j 2 =T 2 i,j=1 k=1 N X N X i,j=1 k=1 N X ≤ 3T 2 ηik ηjk + 2T N X i,j=1 = 3T !2 N X 2 i,j=1 2 1/2 ηij σ̄i2 + T δij σ̄i4 !2 2 4 + 4T ηij σ̄i + T 2 δij σ̄i8 ηik ηjk k=1 N X !2 ηik ηjk + 12T 3 2 σ̄i4 ηij + 3T 3 i,j=1 k=1 2 N X 2 N X σ̄i8 , i=1 3 where we used that (a + b + c) ≤ 3(a + b + c ). By the Cauchy Schwarz inequality, v u ! N !2 N N N N u X X X X X u 4 ) + 3T 3 σ̄i8 Eφ (ηij σ̄i8 Eφ kek8 ≤ 3T 2 Eφ ηik ηjk + 12T 3 t N i,j=1 N X N X i,j=1 k=1 = 3T 2 Eφ i=1 k=1 i,j=1 i=1 !2 ηik ηjk + OP (T 3 N 2 ) + OP (T 3 N ). Moreover, N X N X i,j=1 k=1 Eφ ≤ !2 ηik ηjk X i, j, k, l = N X N X Eφ (ηik ηjk ηil ηjl ) = Eφ (ηij ηjk ηkl ηli ) i,j,k,l=1 i,j,k,l=1 X N aijk Eφ (ηii ηij ηjk ηki ) , Eφ (ηij ηjk ηkl ηli ) + 4 i,j,k=1 mutually different ≤ X i, j, k, l 3 1/4 N N X X 4 4 Eφ (ηij ηjk ηkl ηli ) + 4 Eφ (ηii ) Eφ (ηij ) i,j,k=1 i,j,k=1 mutually different = X i, j, k, l 3 1/4 " # N N 1 X X 1 3 4 4 Eφ (ηii ) E (η ) Eφ (ηij ηjk ηkl ηli ) + 4N φ ij N2 N i=1 i,j=1 mutually different = X i, j, k, l Eφ (ηij ηjk ηkl ηli ) + OP (N 3 ). mutually different where in the second step we just renamed the indices and used that ηij is symmetric in i, j; and aijk ∈ [0, 1] in the second line is a combinatorial pre-factor; and in the third step we applied the Cauchy-Schwarz inequality. 28 Let Ωi be the T × T matrix with entries Ωi,ts = Eφ (eit eis ) such that Ω = 1 N PN i=1 Ωi . For i, j, k, l mutually different, 1 Eφ (ηij ηjk ηkl ηli ) = 2 T = 1 T2 T X Eφ (eit ejt ejs eks eku elu elv eiv ) t,s,u,v=1 T X Eφ (eiv eit )Eφ (ejt ejs )Eφ (eks eku )Eφ (elu elv ) = t,s,u,v=1 because Ωi ≥ 0 for all i. Thus, X Eφ (ηij ηjk ηkl ηli ) = X Eφ (ηij ηjk ηkl ηli ) = i, j, k, l i, j, k, l mutually different mutually different ≤ 1 T2 N X Tr(Ωi Ωj Ωk Ωl ) = i,j,k,l=1 1 T2 1 Tr(Ωi Ωj Ωk Ωl ) ≥ 0 T2 X Tr(Ωi Ωj Ωk Ωl ) i, j, k, l mut. different N4 Tr(Ω4 ) = OP (N 4 /T ). T2 Combining all the above results gives Eφ kek8 = OP (N 5 ), since N and T are assumed to grow at the same rate. S.6.3 Verifying the Basic Regularity Conditions in Panel Models The following Lemma provides sufficient conditions under which the panel fixed effects estimators in the main text satisfy the high-level regularity conditions in Assumptions B.1(v) and (vi). hP i 1 b 0 2 Lemma S.7. Let L(β, φ) = √N ` (β, π ) − (v φ) , where πit = αi + γt , α = (α1 , . . . , αN )0 , it it i,t 2 T γ = (γ1 , . . . , γT ), φ = (α0 , γ 0 )0 , and v = (10N , 10T )0 . Assume that `it (., .) is four times continuously differentiable in an appropriate neighborhood of the true parameter values (β 0 , φ0 ). Consider limits as N, T → ∞ with N/T → κ2 > 0. Let 4 < q ≤ 8 and 0 ≤ < 1/8 − 1/(2q). Let rβ = rβ,N T > 0, rφ = rφ,N T > 0, with rβ = o (N T )−1/(2q)− and rφ = o [(N T )− ]. Assume that (i) For k, l, m ∈ {1, 2, . . . , dim β}, 1 X 1 X 1 X ∂βk `it = OP (1), ∂βk βl `it = OP (1), {∂βk βl `it − Eφ [∂βk βl `it ]} = oP (1), N T i,t N T i,t N T i,t 1 X ∂βk βl βm `it (β, πit ) = OP (1). sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) N T i,t √ (ii) Let k, l ∈ {1, 2, . . . , dim β}. For ξit (β, φ) = ∂βk π `it (β, πit ) or ξit (β, φ) = ∂βk βl π `it (β, πit ), q 1 X 1 X sup sup ξit (β, φ) = OP (1) , β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) T t N i q 1 X 1 X sup sup ξit (β, φ) = OP (1) . T t β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) N i 29 (iii) Let k, l ∈ {1, 2, . . . , dim β}. For ξit (β, φ) = ∂πr `it (β, πit ), with r ∈ {3, 4}, or ξit (β, φ) = ∂βk πr `it (β, πit ), with r ∈ {2, 3}, or ξit (β, φ) = ∂βk βl π2 `it (β, πit ), 1X |ξit (β, φ)| = OP N 2 , T t β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) i 1 X |ξit (β, φ)| = OP N 2 . sup sup max N i β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) t sup sup max (iv) Moreover, q q 1 X 1 X 1 X 1 X ∂π `it = OP (1) , ∂π `it = OP (1) , √ √ T t N i N i T t 2 1 X 1 X ∂βk π `it − Eφ [∂βk π `it ] = OP (1) , √ T t N i 2 1 X 1 X ∂βk π `it − Eφ [∂βk π `it ] = OP (1) . √ N i T t (v) The sequence {(`i1 , . . . , `iT ) : 1 ≤ i ≤ N } is independent across i conditional on φ. (vi) Let k ∈ {1, 2, . . . , dim β}. For ξit = ∂πr `it − Eφ [∂πr `it ], with r ∈ {2, 3}, or ξit = ∂βk π2 `it − Eφ ∂βk π2 `it , and some ν̃ > 0, " #8 1 X X 8+ν̃ √ max Eφ ξit ≤ C, max max Eφ [ξit ξis ] ≤ C, max Eφ ξit ≤ C, t i i i T t s " " #4 #8 1 X 1 X √ √ ξit [ξit ξjt − Eφ (ξit ξjt )] ≤ C, ≤ C, max Eφ max Eφ t i,j N T t i uniformly in N, T , where C > 0 is a constant. −1 (vii) H = OP (1). q Then, Assumptions B.1(v) and (vi) are satisfied with the same parameters q, , rβ = rβ,N T and rφ = rφ,N T used here. Proof of Lemma S.7. The penalty term (v 0 φ)2 is quadratic in φ and does not depend on β. This term thus only enters ∂φ L(β, φ) and ∂φφ0 L(β, φ), but it does not effect any other partial derivative of L(β, φ). Furthermore, the contribution of the penalty drops out of S = ∂φ L(β 0 , φ0 ), because we impose the normalization v 0 φ0 = 0. It also drops out of H̃, because it contributes the same to H and H. We can therefore ignore the penalty term for the purpose of proving the lemma (but it is necessary to satisfy −1 the assumption H = OP (1)). q √ √ # Assumption (i) implies that k∂β Lk = OP (1), k∂ββ 0 Lk = OP ( N T ), ∂ββ 0 Le = oP ( N T ), and √ sup sup k∂βββ L(β, φ)k = OP N T . Note that it does not matter which norms we use β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) here because dim β is fixed. 30 # By Assumption (ii), k∂βφ0 Lkq = OP (N T )1/(2q) and sup sup k∂ββφ L(β, φ)kq = β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) P 1 OP (N T )1/(2q) . For example, ∂βk αi L = √N t ∂βk π `it and therefore T k∂βk α Lkq = q !1/q X 1 X = OP N 1/q = OP (N T )1/(2q) . ∂βk π `it √ NT t i Analogously, k∂βk γ Lkq = OP (N T )1/(2q) , and therefore k∂βk φ Lkq ≤ k∂βk α Lkq +k∂βk γ Lkq = OP (N T )1/(2q) . This also implies that k∂βφ0 Lkq = OP (N T )1/(2q) because dim β is fixed. # By Assumption (iii), k∂φφφ Lkq = OP ((N T ) ), k∂βφφ Lkq = OP ((N T ) ), sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) and sup k∂ββφφ L(β, φ)kq = OP ((N T ) ), sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) sup sup β∈B(rβ ,β 0 ) φ∈Bq (rφ ,φ0 ) k∂βφφφ L(β, φ)kq = OP ((N T ) ), k∂φφφφ L(β, φ)kq = OP ((N T ) ). For example, k∂φφφ Lkq ≤ k∂ααα Lkq + k∂ααγ Lkq + k∂αγα Lkq + k∂αγγ Lkq + k∂γαα Lkq + k∂γαγ Lkq + k∂γγα Lkq + k∂γγγ Lkq ≤ k∂παα Lkq + k∂πγγ Lkq + 3 k∂παγ Lkq + 3 k∂πγα Lkq 1−1/q 1/q 1/q 1−1/q ≤ k∂παα Lk∞ + k∂πγγ Lk∞ + 3 k∂παγ Lk∞ k∂πγα Lk∞ + 3 k∂παγ Lk∞ k∂πγα Lk∞ !1−1/q !1/q " X X X X 1 ∂π3 `it + max ∂π3 `it + 3 max |∂π3 `it | max |∂π3 `it | max =√ t t i i NT t t t i !1/q !1−1/q # X X + 3 max |∂π3 `it | max |∂π3 `it | i 1 ≤√ NT t t t !1−1/q " max i X |∂π3 `it | + max t t X |∂π3 `it | + 3 max i i !1/q + 3 max i X |∂π3 `it | X |∂π3 `it | t !1/q max t X |∂π3 `it | t !1−1/q # max t t X |∂π3 `it | = OP (N 2 ) = OP ((N T ) ). t Here, we use Lemma S.5 to bound the norms of the 3-tensors in terms of the norms of matrices, e.g. k∂ααγ Lkq ≤ k∂παγ Lkq , because ∂αi αj γt L = 0 if i 6= j and ∂αi αi γt L = (N T )−1/2 ∂παi γt .8 Then, we use Lemma S.4 to bound q-norms in terms of ∞-norms, and then explicitly expressed those ∞P P norm in terms of the elements of the matrices. Finally, we use that | i ∂π3 `it | ≤ i |∂π 3 `it | and P P | t ∂π3 `it | ≤ t |∂π3 `it |, and apply Assumption (iii). # By Assumption (iv), kSkq = OP (N T )−1/4+1/(2q) and ∂βφ0 Le = OP (1). For example, 1 kSkq = √ NT oP 8 q q !1/q X X X X ∂π `it + ∂π `it = OP N −1/2+1/q = OP (N T )−1/4+1/(2q) . t t i i e = OP (N T )−3/16 = oP (N T )−1/8 and # By Assumption (v) and (vi), kHk ∂βφφ Le = OP (N T )−3/16 = e The proof for (N T )−1/8 . We now show it kHk. ∂βφφ Le is analogous. With a slight abuse of notation we write ∂παγ L for the N × T matrix with entries (N T )−1/2 ∂π3 `it = (N T )−1/2 ∂π3 `it , and analogously for ∂παα L, ∂πγγ L, and ∂πγα L. 31 By the triangle inequality, e = k∂φφ0 L − Eφ [∂φφ0 L]k ≤ k∂αα0 L − Eφ [∂αα0 L]k + k∂γγ 0 L − Eφ [∂γγ 0 L]k + 2 k∂αγ 0 L − Eφ [∂αγ 0 L]k . kHk Let ξit = ∂π2 `it − Eφ [∂π2 `it ]. Since ∂αα0 L is a diagonal matrix with diagonal entries P 1 k∂αα0 L − Eφ [∂αα0 L]k = maxi √N t ξit , and therefore T √1 NT P t ξit , !8 X 1 8 ξit Eφ k∂αα0 L − Eφ [∂αα0 L]k = Eφ max √ i NT t !8 8 X X 1 1 √ ≤ Eφ = OP (N −3 ). ξit ≤ CN √ N T N t i Thus, k∂αα0 L − Eφ [∂αα0 L]k = OP (N −3/8 ). Analogously, k∂γγ 0 L − Eφ [∂γγ 0 L]k = OP (N −3/8 ). Let ξ be the N ×T matrix with entries ξit . We now show that ξ satisfies all the regularity condition of PT 2 ) ≤ C 1/4 Lemma S.6 with eit = ξit . Independence across i is assumed. Furthermore, σ̄i2 = T1 t=1 Eφ (ξit P P 4 N N so that N1 i=1 σ̄i2 = OP (1). For Ωts = N1 i=1 Eφ (ξit ξis ), 1 Tr(Ω4 ) ≤ kΩk4 ≤ kΩk4∞ = T !4 max t X Eφ [ξit ξis ] ≤ C = OP (1). s PT PN 4 4 For ηij = √1T t=1 [ξit ξjt − Eφ (ξit ξjt )] we assume Eφ ηij ≤ C, which implies N1 i=1 Eφ ηii = OP (1) PN 1 1 4 5/8 and N 2 i,j=1 Eφ ηij = OP (1). Then, Lemma S.6 gives kξk = OP (N ). Note that ξ = √N T ∂αγ 0 L − e = OP (N −3/8 ) = Eφ [∂αγ 0 L] and therefore k∂αγ 0 L − Eφ [∂αγ 0 L]k = OP (N −3/8 ). We conclude that kHk OP (N T )−3/16 . # Moreover, for ξit = ∂π2 `it − Eφ [∂π2 `it ] e 8+ν̃ Eφ kHk ∞ !8+ν̃ !8+ν̃ X 1 X 1 max |ξit | = Eφ max √ |ξit | = Eφ √ i NT i NT t t !8+ν̃ ! X X T 8+ν̃ 1 X 1 X 8+ν̃ √ √ ≤ Eφ |ξit | ≤ Eφ |ξit | = OP (N ), T t NT t NT i i e ∞ = oP (N 1/8 ). Thus, by Lemma S.4 and therefore kHk e 1−2/q = oP N 1/8[−6/q+(1−2/q)] = oP N −1/q+1/8 = oP (1), e q ≤ kHk e 2/q kHk kHk ∞ 2 where we use that q ≤ 8. P −1 −1 dim φ # Finally we show that g,h=1 ∂φφg φh Le [H S]g [H S]h = oP (N T )−1/4 . First, dim φ X −1 −1 e ∂φφg φh L [H S]g [H S]h g,h=1 dim dim Xφ Xφ −1 −1 −1 −1 e e ∂αφg φh L [H S]g [H S]h + ∂γφg φh L [H S]g [H S]h ≤ . g,h=1 g,h=1 32 −1 S, where v is a N -vector and w is a T -vector. We assume H = OP (1). By q −1 −1 Lemma S.1 this also implies H = OP (1) and kSk = OP (1). Thus, kvk ≤ H kSk = OP (1), −1 −1 kwk ≤ H kSk = OP (1), kvk∞ ≤ kvkq ≤ H kSkq = OP (N T )−1/4+1/(2q) , kwk∞ ≤ kwkq ≤ q −1 −1/4+1/(2q) kSk = O (N T ) . Furthermore, by an analogous argument to the above proof for H q P q e Assumption (v) and (vi) imply that kHk, ∂παα0 Le = OP (N −3/8 ), ∂παγ 0 Le = OP (N −3/8 ), ∂πγγ 0 Le = Let (v, w)0 := H −1 OP (N −3/8 ). Then, dim Xφ −1 −1 ∂αi φg φh Le [H S]g [H S]h = g,h=1 N X e j vk + 2 (∂αi αj αk L)v j=1 t=1 j,k=1 = N X N X T T X X e j wt + e t ws (∂αi αj γt L)v (∂αi γt γs L)w e i2 + 2 (∂π2 αi L)v T X t,s=1 T X e i wt + e t2 , (∂παi γt L)v (∂παi γt L)w t=1 j=1 t=1 and therefore dim Xφ −1 −1 e e e e ∂αφg φh L [H S]g [H S]h ≤ ∂παα0 L kvkkvk∞ + 2 ∂παγ 0 L kwkkvk∞ + ∂παγ 0 L kwkkwk∞ g,h=1 = OP (N −3/8 )OP (N T )−1/4+1/(2q) = OP (N T )−1/4−3/16+1/(2q) = oP (N T )−1/4 , P −1 −1 dim φ where we use that q > 4. Analogously, g,h=1 ∂γφg φh Le [H S]g [H S]h = oP (N T )−1/4 and thus P −1 −1 dim φ also g,h=1 ∂φφg φh Le [H S]g [H S]h = oP (N T )−1/4 .9 S.6.4 A Useful Algebraic Result e be the linear operator defined in equation (S.2), and and let Let P P be the related projection operator defined in (S.1). Lemma S.8 shows how in the context of panel data models some expressions that e. appear in the general expansion of Appendix B can be conveniently expressed using the operator P This lemma is used extensively in the proof of part (ii) of Theorem C.1. Lemma S.8. Let A, B and C be N × T matrices, and let the expected incidental parameter Hessian H be invertible. Define the N + T vectors A and B and the (N + T ) × (N + T ) matrix C as follows10 ! diag (C1T ) C 1 A1T 1 B1T 1 A= , B= , C= . N T A0 1N N T B 0 1N NT C0 diag (C 0 1N ) Then, 9 proof of Lemma S.7 one might wonder why, instead of P Given the structure of this last part of the P −1 −1 dim φ −1/4 −1/(2q) e e 0 , we did not directly impose as g,h=1 ∂φφg φh L [H S]g [H S]h = oP (N T ) g ∂φg φφ L = oP (N T ) a high-level condition in Assumption B.1(vi). While this alternative high-level assumption would indeed be more elegant and P e 0 sufficient to derive our results, it would not be satisfied for panel models, because it involves bounding i ∂αi γγ L and P 0 e t ∂γt αα L, which was avoided in the proof of Lemma S.7. P P 10 0 Note that A1T is simply the N -vectors with entries t Ait and A 1N is simply the T -vector with entries i Ait , and analogously for B and C. 33 (i) A0 H −1 (ii) A0 H −1 (iii) A0 H −1 B= X X 1 1 e A)it Bit = e B)it Ait , ( P (P (N T )3/2 i,t (N T )3/2 i,t B= X 1 e A)it (P e B)it , Eφ (−∂π2 `it )(P (N T )3/2 i,t CH −1 B= 1 X e e B)it . (PA)it Cit (P (N T )2 i,t e A)it , with à as defined in equation (S.2). The first order condition of Proof. Let α̃i∗ + γ̃t∗ = (PÃ)it = (P ∗ ∗ 1 the minimization problem in the definition of (PÃ)it can be written as √N H α̃γ̃ ∗ = A. One solution to T √ P P ∗ −1 this equation is α̃γ̃ ∗ = N T H A (this is the solution that imposes the normalization i α̃i∗ = t γ̃ ∗ , but this is of no importance in the following). Thus, ∗ 0 X X √ 1 1 X e α̃ −1 B= α̃i∗ Bit + γ̃t∗ Bit = N T A0 H B = (PA)it Bit . ∗ γ̃ N T i,t N T i,t i,t This gives the first equality of Statement (i). The second equality of Statement (i) follows by symmetry. Statement (ii) is a special case of of Statement (iii) with C = √1 NT ∗ H , so we only need to prove Statement (iii). Bit e B)it , where B̃it = Let αi∗ + γt∗ = (PB̃)it = (P Eφ (−∂π2 `it ) . By an argument analogous to the one √ ∗ −1 given above, we can choose αγ ∗ = N T H B as one solution to the minimization problem. Then, N T A0 H −1 CH −1 B= = 1 X ∗ [α̃ Cit αi∗ + α̃i∗ Cit γt∗ + γ̃t∗ Cit αi∗ + γ̃t∗ Cit γt∗ ] N T i,t i 1 X e e B)it . (PA)it Cit (P N T i,t References Aghion, P., Bloom, N., Blundell, R., Griffith, R., and Howitt, P. (2005). Competition and innovation: an inverted-U relationship. The Quarterly Journal of Economics, 120(2):701. Cox, D. D. and Kim, T. Y. (1995). Moment bounds for mixing random variables useful in nonparametric function estimation. Stochastic processes and their applications, 56(1):151–158. Fan, J. and Yao, Q. (2003). Nonlinear time series: nonparametric and parametric methods. Hahn, J. and Kuersteiner, G. (2011). Bias reduction for dynamic nonlinear panel models with fixed effects. Econometric Theory, 1(1):1–40. Higham, N. J. (1992). Estimating the matrix p-norm. Numerische Mathematik, 62(1):539–555. Horn, R. A. and Johnson, C. R. (1985). Matrix analysis. Cambridge university press. McLeish, D. (1974). Dependent central limit theorems and invariance principles. the Annals of Probability, pages 620–628. White, H. (2001). Asymptotic theory for econometricians. Academic press New York. 34 Table S1: Poisson model for patents Dependent variable: citation-­‐ weighted patents (1) (2) (3) (4) (5) (6) Static model Competition Competition squared 165.12 (54.77) -­‐20.00 (7.74) -­‐88.55 (29.08) 152.81 (55.74) -­‐6.43 (8.61) -­‐80.99 (29.61) 387.46 389.99 401.88 401.51 (67.74) -­‐5.98 -­‐5.49 -­‐6.25 -­‐4.74 (19.68) -­‐204.55 -­‐205.84 -­‐212.15 -­‐214.03 (36.17) 1.05 (0.02) 0.86 (0.02) 62.95 (62.68) -­‐12.78 (7.54) -­‐34.15 (33.21) 1.07 (0.03) 0.87 (0.03) 95.70 (65.08) -­‐9.03 (8.18) -­‐51.09 (34.48) 0.46 0.48 0.50 0.70 (0.05) 0.36 0.38 0.39 0.56 (0.07) 199.68 184.70 184.64 255.44 (76.66) -­‐1.68 -­‐0.15 -­‐0.43 -­‐18.45 (15.53) -­‐105.24 -­‐97.23 -­‐97.22 -­‐136.97 (40.87) Yes Yes Yes Yes Yes Yes Yes Yes A A J 1 2 Dynamic model Lag-­‐patents Competition Competition squared Year effects Industry effects Bias correction (number of lags) Yes Notes: Data set obtained from ABBGH. Competition is measured by (1-­‐Lerner index) in the industry-­‐year. All columns are estimated using an unbalanced panel of seventeen industries over the period 1973 to 1994. First year available used as initial condition in dynamic model. The estimates of the coefficients for the static model in columns (2) and (3) replicate the results in ABBGH. A is the bias corrected estimator that uses an analytical correction with a number lags to estimate the spectral expectations specified at the bottom cell. J is the jackknife bias corrected estimator that uses split panel jackknife in both the individual and time dimensions. Standard errors in parentheses and average partial effects in italics. 35 Table S.2: Homogeneity test for the jackknife Static Model Dynamic Model Cross section Time series 10.49 13.37 (0.01) (0.00) 1.87 12.41 (0.60) (0.01) Notes: Wald test for equality of common parameters across sub panels. P-values in parentheses 36 14 14 17 17 17 25 60 64 17 17 17 25 1.04 1.01 1.02 1.02 1.02 0.69 0.01 0.01 0.96 0.96 0.96 0.83 Coefficient of Zit Std. Dev. RMSE SE/SD p; .95 -58 -62 -2 -1 -1 -3 Bias 248 139 226 225 225 333 1.15 1.04 1.49 1.50 1.50 1.01 0.60 0.94 1.00 1.00 1.00 0.95 SE/SD p; .95 Bias 113 139 226 225 225 333 APE of Zit Std. Dev. RMSE p; .95 222 -9 -15 -9 -6 -15 Coefficient of Zit2 Std. Dev. RMSE SE/SD 0.01 0.01 0.96 0.96 0.96 0.83 0.06 0.94 0.96 0.96 0.96 0.88 0.20 0.94 0.98 0.98 0.98 0.90 0.96 0.95 1.04 1.04 1.04 0.79 0.98 0.95 1.12 1.11 1.11 0.85 238 77 128 128 129 170 240 97 158 158 159 208 66 77 128 129 129 169 81 97 158 159 159 208 228 -1 -4 2 5 -12 226 -3 -6 0 3 -15 0.00 0.00 0.94 0.94 0.94 0.94 0.00 0.00 0.96 0.96 0.96 0.93 N = 17, T = 22, unbalanced 14 60 1.03 14 64 1.01 17 17 1.02 17 17 1.02 17 17 1.02 25 25 0.70 N = 34, T = 22, unbalanced 10 58 1.03 10 62 1.00 13 13 0.99 13 13 0.99 13 13 0.99 14 14 0.90 N = 51, T = 22, unbalanced 8 57 1.00 8 61 1.00 11 11 0.97 11 11 0.97 11 11 0.96 11 11 0.90 Table S3: Finite-sample properties in static Poisson model -59 -62 -2 -1 -1 -3 Bias MLE MLE-TE MLE-FETE Analytical (L=1) Analytical (L=2) Jackknife -57 -61 0 0 1 -1 -57 -61 0 0 1 0 0.00 0.00 0.96 0.96 0.96 0.93 0.00 0.00 0.94 0.94 0.94 0.93 1.03 1.00 0.99 0.99 0.99 0.90 1.00 1.00 0.97 0.97 0.96 0.90 59 62 12 12 13 14 58 61 10 10 11 11 10 10 12 12 13 14 8 8 10 10 10 11 -58 -61 0 0 1 -1 -58 -61 0 0 1 0 MLE MLE-TE MLE-FETE Analytical (L=1) Analytical (L=2) Jackknife MLE MLE-TE MLE-FETE Analytical (L=1) Analytical (L=2) Jackknife 2 Notes: All the entries are in percentage of the true parameter value. 500 repetitions. The data generating process is: Yit ~ Poisson(exp{β1Xit + β2Xit2 + αi + γt}) with all the variables and coefficients calibrated to the dataset of ABBGH. Average effect is E[(β1 + 2β2 Xit)exp(β1Xit + β2Xit + αi + γt)]. MLE is the Poisson maximum likelihood estimator without individual and time fixed effects; MLE-TE is the Poisson maximum likelihood estimator with time fixed effects; MLE-FETE is the Poisson maximum likelihood estimator with individual and time fixed effects;Analytical (L = l) is the bias corrected estimator that uses an analytical correction with l lags to estimate the spectral expectations; and Jackknife is the bias corrected estimator that uses split panel jackknife in both the individual and time dimension. 37 Table S4: Finite-sample properties in dynamic Poisson model: lagged dependent variable Bias Coefficient of Yi,t-1 Std. Dev. RMSE SE/SD MLE MLE-TE MLE-FETE Analytical (L=1) Analytical (L=2) Jackknife 135 142 -17 -7 -5 4 3 3 15 15 15 20 135 142 23 17 16 21 MLE MLE-TE MLE-FETE Analytical (L=1) Analytical (L=2) Jackknife 135 141 -16 -7 -4 3 2 2 11 11 11 13 135 141 19 13 12 14 MLE MLE-TE MLE-FETE Analytical (L=1) Analytical (L=2) Jackknife 135 141 -15 -6 -3 3 2 2 8 8 8 11 135 141 17 10 9 11 p; .95 Bias APE of Yi,t-1 Std. Dev. RMSE SE/SD N = 17,T = 21, unbalanced 1.82 0.00 158 2 1.95 0.00 163 3 0.96 0.78 -17 15 0.98 0.91 -8 14 0.96 0.92 -5 15 0.73 0.85 4 20 N = 34, T = 21, unbalanced 1.76 0.00 158 2 1.77 0.00 162 2 0.93 0.65 -16 10 0.95 0.89 -7 10 0.93 0.91 -4 10 0.77 0.85 3 13 N = 51, T = 21, unbalanced 1.81 0.00 158 1 1.79 0.00 162 2 0.97 0.55 -15 8 0.99 0.90 -6 8 0.97 0.93 -4 8 0.77 0.87 3 10 p; .95 158 163 22 16 16 20 3.75 4.17 1.38 1.41 1.38 1.03 0.00 0.00 0.89 0.97 0.98 0.95 158 162 19 12 11 13 2.82 2.69 1.05 1.08 1.05 0.86 0.00 0.00 0.71 0.92 0.94 0.89 158 162 17 10 9 11 2.58 2.41 1.03 1.05 1.03 0.80 0.00 0.00 0.55 0.91 0.93 0.88 Notes: All the entries are in percentage of the true parameter value. 500 repetitions. The data generating process is: Yit ~ Poisson(exp{βY log(1 + Yi,t-1) + β1Zit + β2Zit2 + αi + γt}), where all the exogenous variables, initial condition and coefficients are calibrated to the application of ABBGH. Average effect is βY E[exp{((βY - 1)log(1 + Yi,t-1) + β1Zit + β2Zit2 + αi + γt}]. MLE is the Poisson maximum likelihood estimator without individual and time fixed effects; MLE-TE is the Poisson maximum likelihood estimator with time fixed effects; MLE-FETE is the Poisson maximum likelihood estimator with individual and time fixed effects; Analytical (L = l) is the bias corrected estimator that uses an analytical correction with l lags to estimate the spectral expectations; and Jackknife is the bias corrected estimator that uses split panel jackknife in both the individual and time dimension. 38 27 28 40 40 39 57 81 71 41 40 39 57 1.13 1.12 0.95 0.97 0.97 0.68 0.29 0.44 0.92 0.94 0.94 0.82 Coefficient of Zit Std. Dev. RMSE SE/SD p; .95 -76 -65 9 4 3 3 Bias Coefficient of Zit2 Std. Dev. RMSE SE/SD p; .95 760 541 -3 11 15 24 Bias 351 356 1151 1117 1110 1653 837 647 1150 1116 1109 1651 1.47 1.65 1.03 1.06 1.07 0.74 1.65 1.75 1.08 1.11 1.12 0.75 0.42 0.88 0.94 0.95 0.95 0.85 0.89 0.99 0.99 0.99 0.99 0.86 201 197 606 588 581 838 794 570 606 587 580 837 1.48 1.68 0.99 1.02 1.03 0.71 0.18 0.74 0.95 0.96 0.96 0.83 APE of Zit Std. Dev. RMSE SE/SD p; .95 0.30 0.45 0.92 0.94 0.94 0.81 817 589 736 714 707 1012 768 535 -27 -11 -5 8 252 248 734 713 706 1012 0.00 0.05 0.93 0.95 0.95 0.91 777 534 -68 -51 -47 -38 N = 17, T = 21, unbalanced 27 80 1.13 29 71 1.12 41 42 0.95 40 40 0.97 40 40 0.97 57 57 0.68 N = 34, T = 21, unbalanced 19 77 1.18 19 67 1.18 28 29 0.97 28 28 0.99 27 27 1.00 31 31 0.87 N = 51, T = 21, unbalanced 15 75 1.17 16 65 1.15 22 24 1.01 22 22 1.02 22 22 1.03 25 25 0.89 0.05 0.15 0.94 0.95 0.95 0.92 Table S5: Finite-sample properties in dynamic Poisson model: exogenous regressor -76 -65 9 4 3 3 Bias MLE MLE-TE MLE-FETE Analytical (L=1) Analytical (L=2) Jackknife -74 -64 6 2 0 2 -73 -63 8 4 2 4 0.04 0.15 0.94 0.95 0.95 0.92 0.00 0.05 0.93 0.95 0.95 0.91 1.18 1.18 0.97 0.99 0.99 0.87 1.17 1.15 1.01 1.02 1.03 0.89 77 67 28 27 27 31 76 65 23 22 21 25 19 19 28 27 27 31 15 16 22 21 21 25 -75 -65 6 2 0 2 -74 -63 8 4 2 3 MLE MLE-TE MLE-FETE Analytical (L=1) Analytical (L=2) Jackknife MLE MLE-TE MLE-FETE Analytical (L=1) Analytical (L=2) Jackknife 2 Notes: All the entries are in percentage of the true parameter value. 500 repetitions. The data generating process is: Yit ~ Poisson(exp{βY log(1 + Yi,t-1) + β1Zit + β2Zit + αi + γt}), where all the exogenous variables, initial condition and coefficients are calibrated to the application of ABBGH. Average effect is E[(β1 + 2β2Zit) exp{βYlog(1 + Yi,t-1) + β1Zit + β2Zit2 + αi + γt}]. MLE is the Poisson maximum likelihood estimator without individual and time fixed effects; MLE-TE is the Poisson maximum likelihood estimator with time fixed effects; MLE-FETE is the Poisson maximum likelihood estimator with individual and time fixed effects; Analytical (L = l) is the bias corrected estimator that uses an analytical correction with l lags to estimate the spectral expectations; and Jackknife is the bias corrected estimator that uses split panel jackknife in both the individual and time dimension. 39