Nonparametric estimation of finite mixtures† Stéphane Bonhomme CEMFI, Madrid Koen Jochmans‡ Sciences Po, Paris Jean-Marc Robin Sciences Po, Paris and University College London [Revised October 17, 2012] Abstract. The aim of this paper is to provide simple nonparametric methods to estimate finitemixture models from data with repeated measurements. Three measurements suffice for the mixture to be fully identified and so our approach can be used even with very short panel data. We provide distribution theory for estimators of the number of mixture components, the mixing proportions, as well as of the mixture distributions and various functionals thereof. These estimators are found to perform well in a series of Monte Carlo exercises. We apply our techniques to document heterogeneity in log annual earnings using PSID data spanning the period 1969–1998. Keywords: finite-mixture model, nonparametric estimation, series expansion, simultaneousdiagonalization system. 1. Introduction Finite mixtures encompass a large class of models. Popular applications include modeling unobserved heterogeneity, performing factor analysis and ecological inference, as well as dealing with corrupted data and misclassified observations. We refer to Henry et al. (2010) for illustrations of mixture models in economics and to McLachlan and Peel (2000) for a treatment at booklength. The conventional approach to inference in finite mixtures is parametric—i.e., by specifying the distribution function of the outcome of interest, say y, conditional on a discrete latent variable, say x, up to a finite-dimensional parameter—with identification being driven fully by functional form. Only few results on the nonparametric identifiability and estimability of mixtures have so far been obtained, and our aim here is to contribute to their development. In line with the seminal work of Hall and Zhou (2003), our analysis builds on the availability of repeated measurements on y. These measurements—which could be collected †The first version of this paper dates from January 2012. We are grateful to Marc Henry and Bernard Salanié for comments and discussion. ‡Address for correspondence: Sciences Po, Department of Economics, 28 rue des Saints-Pères, 75007 Paris, France. E-mail: koen.jochmans@sciences-po.org 2 S. Bonhomme, K. Jochmans, and J.-M. Robin either contemporaneously or over time, as with panel data, for example—are assumed to be independent and identically distributed conditional on x. Our arguments yield point identification of the number of component mixtures, the component distributions, and the mixing proportions when three or more repeated measurements are available and a rank condition involving the component distributions is satisfied. Our approach to identification is constructive. The resulting estimators are attractive from a computational point of view and have desirable large-sample properties. For brevity, we will focus on the case where the outcome variable exhibits continuous variation. However, our methods can also be applied when y is a discrete random variable, allowing for the number of support points to grow with the sample size. Our analysis is based on projections of the marginal densities and component mixtures into a basis of functions. We show that the mixture structure imposes a set of linear restrictions on the (generalized) Fourier coefficients of these densities. When the coefficients of the component densities are linearly independent, the number of component mixtures is identified as the rank of the matrix containing the Fourier coefficients of the bivariate marginal density. Further, the coefficients themselves are identified as the eigenvalues of a sequence of whitened matrices of Fourier coefficients of the trivariate marginal density. This implies point identification of the component mixtures on their entire support. The mixing proportions are then readily pinned down as the unique minimizers of a minimum-distance criterion. We propose estimating the number of components by sequentially testing the rank of an empirical analog of the matrix of bivariate Fourier coefficients and show consistency of the resulting point estimate. To estimate the component densities, we employ an algorithm for the joint approximate-diagonalization of a set of eigenmatrices developed by Cardoso and Souloumiac (1993). We derive both mean integrated square error (MISE) and uniform convergence rates, and provide conditions for pointwise asymptotic normality. Interestingly, the convergence rates coincide with those that would be obtained when data could be sampled directly from the component densities, and are optimal in the MISE sense for a variety of basis functions. A least-squares type estimator for the mixing proportions that converges at the parametric rate is then readily constructed. We also present distribution theory for GMM estimators of finite-dimensional parameters defined through conditional moment restrictions involving the component mixtures. Our identification result regarding the number of components improves on the lower bound recently established by Kasahara and Shimotsu (2012). The conditions underlying our identification analysis of the component mixtures and mixing proportions are similar to those in Allman et al. (2009). However, our approach is constructive. The resulting estimators also complement and extend the work by Hall and Zhou (2003) and Hall et al. (2005), who proposed computationally cumbersome nonparametric approaches to estimate bivariate mixtures. We present numerical evidence on the small-sample performance of our procedures through two Monte Carlo illustrations. As an empirical illustration, we use our methods to Finite mixtures 3 investigate the presence of unobserved heterogeneity in earnings data from the period 1969– 1998. We find evidence of substantial nonlinear heterogeneity in earnings and document its evolution over time. 2. Identification Let x be a latent discrete random variable, normalized to take on value k = 1, 2, ..., K with probability ωk > 0. Let y T ≡ (y1 , y2 , . . . , yT )0 be a vector of repeated measurements on an observable outcome variable y whose marginal probability density function (PDF) takes the form K nY T o X f (y1 , . . . , yT ) = fk (yt ) ωk , t=1 k=1 where fk denotes the PDF of y conditional on x = k. Let L2 (S ) denote the set of functions that are square-integrable on a space S , that is, ´ the set of functions g for which S |g()|2 d < ∞. Impose the following condition. Assumption 1. For each k = 1, 2, . . . , K, fk ∈ L2 ([−1, 1]) and is absolutely continuous. Restricting the support to equal the interval [−1, 1] is innocuous, as the outcome may always be transformed by means of a strictly-monotonic function whose image is [−1, 1]. Squareintegrability validates a generalized Fourier expansion in a basis of functions, which is key for our subsequent developments. Let {χi , i ≥ 1} be a complete orthonormal basis for L2 ([−1, 1]). For any integer I, the (generalized) Fourier approximation of fk is given by the projection of the density onto the subspace spanned by χ1 , χ2 , . . . , χI , that is, ProjI fk ≡ I X ˆ bik χi , 1 bik ≡ χi (y)fk (y) dy = E[χi (y)|x = k], −1 i=1 and satisfies limI→∞ kProjI fk − fk k2 = 0 and limI→∞ |ProjI fk (y) − fk (y)| = 0 for all −1 ≤ y ≤ 1, where k·k2 denotes the L2 -norm.1 Likewise, the projection of the marginal density of (y1 , y2 , . . . , yH ) is ProjI,H f ≡ I I X X i1 =1 i2 =1 ··· I X ai1 i2 ···iH χi1 χi2 · · · χiH , iH =1 where ai1 i2 ···iH ≡ E[χi1 (y1 )χi2 (y2 ) · · · χiH (yH )]. In our setup, while the data do not directly nonparametrically identify {bik , i ≥ 1}, they do reveal the sequences {ai1 i2 ···iH , i1 , i2 , . . . , iH ≥ 1} for all H ≤ T , while the mixture 1 The notation kk will be reserved for the Euclidean norm when is a vector and for the matrix norm when is a matrix. We use || to denote the absolute value when is a real number and the cardinality when is a set. 4 S. Bonhomme, K. Jochmans, and J.-M. Robin structure implies the set of linear restrictions ai1 i2 ···iH = K X bi1 k bi2 k · · · biH k ωk , (2.1) k=1 for each H ≤ T and all i1 , i2 , . . . , iH . Under weak restrictions, the relations in (2.1) can be uniquely solved for the Fourier coefficients of the component mixtures and their associated mixing proportions (up to an arbitrary relabeling). To discuss identification it suffices to set T = 3.2 It is useful to write the restrictions in matrix form. For any I, let b11 b12 · · · b1K b21 b22 · · · b2K B≡ . .. .. , .. .. . . . bI1 bI2 ··· bIK be the I × K matrix whose kth column contains the first I Fourier coefficients of fk . Similarly, let a ≡ (a1 , a2 , . . . , aI )0 and introduce the J + 1 symmetric I × I matrices a11j a12j · · · a1Ij a11 a12 · · · a1I a21j a22j · · · a2Ij a21 a22 · · · a2I , A ≡ A0 ≡ . .. .. .. , .. .. j .. .. .. . . . . . . . aI1 aI2 ··· aII aI1j aI2j ··· aIIj containing the Fourier coefficients of the bivariate and of the trivariate marginal PDF, respectively. Then, from (2.1), a = Bω, A0 = BΩB 0 , Aj = BΩ1/2 Dj Ω1/2 B 0 , j = 1, 2, . . . , J, (2.2) where ω ≡ (ω1 , ω2 , . . . , ωK )0 , Ω ≡ diag[ω], and Dj ≡ diag[bj1 , bj2 , . . . , bjK ]. Identification rests on the following assumption. Assumption 2. The matrix B has full column rank, that is, rank[B] = K. The assumption of maximal column rank imposes absence of multicolinearity among the Fourier coefficients of the component mixtures and is intuitive. Clearly, it requires that I ≥ K. Theorem 1 follows. Theorem 1. (i) The number of mixture components, (ii) the component mixtures, and (iii) the mixing proportions are all nonparametrically identified, where (ii) and (iii) are up to arbitrary relabeling of the components. 2 Non-trivial bounds on the component mixtures when T = K = 2 are given by Hall and Zhou (2003). Finite mixtures 5 The proof to Theorem 1 is constructive. By Assumption 2, the matrix A0 , which is real and symmetric, has rank K. As this matrix is nonparametrically identified, so is its rank and, hence, the number of mixture components. This establishes Theorem 1(i). Continuing on, A0 admits the spectral decomposition A0 = V ΛV 0 , V ≡ (v1 , v2 , . . . , vK ) ∈ RI×K , Λ ≡ diag[λ1 , λ2 , . . . , λK ] ∈ RK×K , where the K eigenvalues λ1 , λ2 , . . . , λK are all positive and real, and the corresponding eigenvectors v1 , v2 , . . . , vK are linearly independent. Observe that, if the eigenvalues are multiple, matrix V is not unique. By construction, A0 can be transformed to the identity matrix through pre- and post multiplication with the matrix W ≡ Λ−1/2 V 0 ∈ RK×I and its transpose, respectively. Moreover, from (2.2), we obtain W A0 W 0 = IK = U U 0 , for U ≡ W BΩ1/2 . Note that, by Assumption 2, U is a K × K orthonormal matrix of full rank. Now, we can transform each Aj in the same way and obtain Cj ≡ W Aj W 0 = U Dj U 0 . That is, the sequence of matrices {Cj , j ≥ 1} is simultaneously diagonalizable in the same basis (namely, the columns of U ). The eigenvalue of Cj corresponding to its kth eigenvector equals bjk , the jth Fourier coefficient of fk . Hence, ProjJ fk is identified for any J, which implies that fk is identified in the sense that limJ→∞ ProjJ fk (y) = fk (y) for each y in [−1, 1] and for all k = 1, 2, . . . , K. Identification is achieved up to arbitrary relabeling only because the eigenvectors can be rearranged without affecting the argument. This establishes Theorem 1(ii). It is useful to remark at this point that identification does not require that I → ∞. That is, we need not calculate the whole sequence {ai1 i2 i3 , i1 , i2 , i3 ≥ 1} to recover {bjk , j ≥ 1}. Finally, from (2.2), B + a = ω, where B + ≡ (B 0 B)−1 B 0 is the Moore-Penrose pseudo-inverse of B. Because B is identified, so is ω. This establishes Theorem 1(iii). Knowledge of the component mixtures implies identification of all their functionals. Corollary 1 provides one possible representation of this result that is particularly useful for the purpose of estimation. To state it, let τk (y) ≡ fk (y) , f (y) k = 1, 2, . . . , K, identification of which follows immediately from Theorem 1. Corollary 1. For any measurable function g whose expectation under fk exists, E[g(y)|x = k] = E[g(y)τk (y)] for each k = 1, 2, . . . , K. Furthermore, the function τk and, therefore, all the functionals are nonparametrically identified. 6 S. Bonhomme, K. Jochmans, and J.-M. Robin Corollary 1 is instrumental in inferring a parameter θ0 defined through moment conditions of the form E[g(y; θ0 )|x = k] = 0 for a known measurable function g. The weight function τk is interesting in its own right. More precisely, πk ≡ ωk τk provides the probability that x = k as a function of y. This is a key object of interest in latent-class analysis, allowing classification of observations based on marginal information. 3. Estimation For y nT ≡ (yn1 , yn2 , . . . , ynT )0 , let {y nT , n = 1, 2, . . . , N } denote a sample of size N drawn at random from f . Throughout, we assume that T ≥ 3. Below we present, in turn, the large-sample properties of our estimators of the number of components, the component mixtures and the mixture weights, and of their functionals. The proofs of all theorems in this section are collected in Appendix B. Let ρH be the set of ordered H-tuples of distinct integers from the set {1, 2, . . . , T }. An unbiased estimator of ai1 i2 ...iH for any H is given by the sample average b ai1 i2 ...iH ≡ and is √ N 1 1 X N |ρH | n=1 X χi1 (ynt1 )χi1 (ynt2 ) . . . χiH (yntH ), (3.1) (t1 ,t2 ,...,tH )∈ρH N -consistent and asymptotically normal. 3.1. Number of components √ b0 . Then A b0 is N -consistent Use (3.1) with H = 2 to construct an estimator of A0 , say A and asymptotically normal, that is, √ L b0 − A0 ] → N vec[A N (0, VA0 ), VA0 ≡ E[ψA0 (y T )ψA0 (y T )0 ], P with the i1 + (i2 − 1)Ith entry of ψA0 (y T ) being |ρ2 |−1 (t1 ,t2 )∈ρ2 χi1 (yt1 )χi2 (yt2 ) − ai1 i2 . Inferring K then boils down to testing the rank of this matrix, which can be done by any of a number of approaches. The procedure by Kleibergen and Paap (2006) has several attractive features and therefore carries our preference.3 To describe it, start by ordering the eigenvalues of A0 in decreasing order, and place the eigenvectors in V accordingly. Fix k ∈ {0, 1, . . . , I − 1} and let ΛI−k ≡ Λ(k + 1 : I, k + 1 : I) denote the lower-right (I − k) × (I − k) block of the matrix of eigenvalues of A0 . Also, let Vk ≡ V (1 : k, k + 1 : I) and VI−k ≡ V (k + 1 : I, k + 1 : K), using obvious notation for matrix 0 block selection. Thus, (Vk0 , VI−k )0 is the matrix of eigenvectors of the last I − k eigenvalues. 3 Kleibergen and Paap (2006) discuss the relative merits of their test statistic. Prime advantages include its non-sensitivity to the ordering of variables and the fact that its limit distribution under the null is free of nuisance parameters. Finite mixtures Now let ϕk ≡ (V I−k ⊗ V I−k )vec[A0 ] for V I−k ≡ rank[A0 ] = K ≤ k, ΛI−k = 0 and L rbk ≡ N |ρ2 |ϕ bk Vbϕ−1 ϕ bk → χ2 (I − k)2 , k −10 0 0 (VI−k VI−k )1/2 VI−k [Vk0 , VI−k ]. 0 7 Then, if 0 Vϕk ≡ (V I−k ⊗ V I−k )VA0 (V I−k ⊗ V I−k ), where ϕ bk is the sample analog of ϕk and Vbϕk is a consistent estimator of Vϕk . The rank statistic rbk can be used to test the null H0 : rank[A0 ] = k against the alternative H1 : rank[A0 ] > k. Furthermore, the rank statistic also leads to a consistent estimator of K, based on a sequential testing procedure. Following Robin and Smith (2000), let b ≡ K min k∈{0,1,...,I−1} k : rb` ≥ p1−α (`), ` = 0, 1, . . . , k − 1, rbk < p1−α (k) , for p1−α () the 100(1 − α)th percentile of the χ2 ((I − )2 ) distribution and α = α(N ) a b is the first integer for which we fail to reject the null chosen significance level. That is, K at significance level α. Theorem 2 follows directly from Theorem 5.2 in Robin and Smith (2000). P b → Theorem 2. Let α → 0 and − log α/N → 0 as N → ∞. Then K K. Our estimator of K is similar to a recent proposal of Kasahara and Shimotsu (2012), who derived a consistent estimator of a lower bound on K. Their approach is based on the histogram as an estimator of the bivariate marginal density. For some integer values H1 and H2 , let {Hh1 × Hh2 , h1 = 1, 2, . . . , H1 ; h2 = 1, 2, . . . , H2 } be a set of adjacent rectangles that partition the Cartesian square [−1, 1] × [−1, 1]. Let H be the H1 × H2 matrix whose (h1 , h2 )th entry is the cell probability E[1{(y1 , y2 ) ∈ Hh1 × Hh2 }]. Kasahara and Shimotsu (2012) show that K equals the non-negative rank of H, that is, K = rank+ [H]. Because rank[H] ≤ rank+ [H], an estimator of a lower bound on K can be constructed as above, with b0 replaced by a nonparametric estimator of H. A 3.2. Component densities and mixing proportions b rather than K in estimating Dealing with the pretesting problem that arises from utilizing K the mixture model is a problem beyond the scope of this paper and so, throughout the remainder of this section, we take K as given. The key component of our estimators of the fk and the ωk are the matrices of Fourier coefficients {Dj , j = 1, 2, . . . , J}, and so we start our discussion there. c be the sample counterpart to W , constructed from the spectral decomposition of Let W b0 . For j = 1, 2, . . . , J, let A bj be an estimator of Aj and let C bj ≡ W cA bj W c 0 . Our estimator A of Dj = diag[bj1 , bj2 , . . . , bjK ] is bj ≡ U b 0C bj U b, D b ≡ arg min U X∈O(K) I X i=1 bi X , off X 0 C (3.2) 8 S. Bonhomme, K. Jochmans, and J.-M. Robin where O(K) denotes the set of orthonormal matrices of dimension K × K and the function off[·] maps a square matrix into the sum of its squared off-diagonal elements. It is useful to note that estimation is not based on the eigenspectra of the individual bj , j = 1, 2, . . . , J}, which will typically all be different due to sampling error. matrices {C Rather, our estimator of the eigenvectors U in (3.2) is defined as the orthogonal matrix X in bi X, i = 1, 2, . . . , I} as diagonal as possible.4 It is attractive O(K) that makes the set {X 0 C from an efficiency point of view, as it utilizes the restriction that the {Ci , i = 1, 2, . . . , I} do b can easily be computed using the algorithm share the same eigenvectors. The estimator U for the joint approximate-diagonalization of eigenmatrices (JADE) developed by Cardoso and Souloumiac (1993). JADE is fast and straightforward to implement. In Appendix A we derive the asymptotic properties of the estimator of the Fourier coefficients. Moreover, Theorem A.1 shows that, although we can not directly sample data from the component densities, their Fourier coefficients are estimable at the conventional parametric rate. Given our estimator of the Fourier coefficients, our estimator of fk (k = 1, 2, . . . , K) is a conventional series estimator, i.e., fbk ≡ J X bbjk χj , j=1 bj . where bbjk is the kth diagonal entry of D Let k·k∞ denote the supremum norm. Impose the following high-level assumption. Assumption 3. The non-zero eigenvalues of A0 are all simple. The basis functions satisfy kχj k∞ ≤ ζ, where ζ is a finite constant that does not depend on j. For each k = 1, 2, . . . , K, kProjJ fk − fk k∞ = O(J −β ) for some β ≥ 1. The restriction on the eigenvalues is mainly done to facilitate the exposition. It is well known that the asymptotic distribution of eigenvalues depends in a complicated way on their multiplicity, which is required to be known; see, e.g., Eaton and Tyler (1991). The uniform bound on the basis functions is weak. The bound on the approximation error is conventional in nonparametric curve estimation by series expansions. It is compatible with a large variety of basis functions and can be interpreted as a smoothness requirement on the component mixtures. Chen (2007) provides examples with specific basis functions. Observe that imposing a certain degree of smoothness on the component mixtures is substantially less demanding than is requesting the same from the marginal density. The mixing will typically create non-smoothness, even if the components are all smooth. 4 More precisely, minimizing the criterion defined in (3.2) is equivalent to solving the least-squares P bi − XLi X 0 k2 , where L(K) denotes the set of K × K diagonal problem minX∈O(K),Li ∈L(K) Ii=1 kC b matrices. Note also that U could be computed from more than the leading I matrices of whitened Fourier coefficients. That is, I in (3.2) may be replaced by any fixed integer. Finite mixtures 9 The following theorem provides mean integrated squared error (MISE) and uniform convergence rates for the estimator of the component mixtures. Theorem 3. The estimator of the component mixture satisfies √ 2 fbk − fk = OP J/ N + J −β , Efbk − fk 2 = O J/N + J −2β , ∞ for each k = 1, 2, . . . , K. The rates in Theorem 3 equal the conventional univariate rates of nonparametric series estimators; see, e.g., Schwartz (1967). The MISE rate is known to be optimal for splines in the sense that it achieves the bound established by Stone (1982). Theorem 4 below presents the limit distribution of the estimator of the component mixtures at a fixed evaluation point y. To state the result, some additional notation is required. Let h i X 1 τ k (ynt1 , ynt2 )κJ (ynt3 ; y) − E[τ k (yt1 , yt2 )κJ (yt3 ; y)] , (3.3) ψfk (y nT ) ≡ |ρ3 | (t1 ,t2 ,t3 )∈ρ3 whose dependence on y is kept implicit and where it should be clear that the expectation is taken with respect to (yt1 , yt2 , yt3 ). Here, τ k (y1 , y2 ) ≡ I X I X i1 =1 i2 =1 u0k wi1 χi1 (y1 )χi2 (y2 )wi02 uk , κJ (y1 , y2 ) ≡ J X χj (y1 )χj (y2 ), j=1 where wi is ith column of the K × I whitening matrix W and uk is the kth column of the K × K joint-eigenvalue matrix U . The proof of asymptotic normality shows that the limit distribution of fbk (y) − fk (y) coincides with the limit distribution of the sample average of (3.3). The form of this function is interesting. The function κJ is known as the Jth kernel of the system {χj , j ≥ 1} and behaves much like its namesake in conventional kernel estimation of densities; e.g., E[κJ (yt1 , y)] = ProjJ f (y). Concerning τ k , note that E[τ k (yt1 , yt2 )|x = k 0 ] = ωk−1 1{k 0 = k} and so E[τ k (yt1 , yt2 )κJ (yt3 , y)] = ProjJ fk (y) = E[τk (yt1 )κJ (yt1 , y)], as is readily verified by use of the law of iterated expectations. Hence, fbk (y) can be viewed as a reweighting estimator based on an estimator of f (y), the marginal density of y. Theorem 4. Let J → ∞ so that J/N → 0 and N J −2β → 0 as N → ∞. Then, for each k = 1, 2, . . . , K and all y ∈ [−1, 1] where 0 < f (y) < ∞, the estimated component mixture satisfies √ L −1/2 N Vfk [fbk (y) − fk (y)] → N (0, 1), Vfk ≡ E[ψfk (y T )ψfk (y T )], as N → ∞. ´1 The proof to Theorem 4 shows that Vfk = O(αJ (y)) for αJ (y) ≡ −1 κJ (, y)2 d. Thus, the pointwise convergence rate is determined by the growth rate of αJ (y). A weak bound is p |fbk (y) − fk (y)| = OP ( J/N ) 10 S. Bonhomme, K. Jochmans, and J.-M. Robin because kχj k∞ ≤ ζ, which implies that αJ (y) = O(J). The conditions in the theorem impose that J/N = o(1). The second restriction on the growth rate of J ensures the limit distribution to be free of bias. b On subsequently constructing Use the bbjk to form a plug-in estimator of B, say B. b + ≡ (B b 0 B) b −1 B b 0 , our estimator of ω takes the form B b+a b ≡B b, ω b is constructed using (3.1) for H = 1. Note that ω b minimizes a sample where the I-vector a P version of the L2 -distance kProjI f − k ProjI fk pk k22 with respect to p1 , p2 , . . . , pK . It is straightforward to construct other minimum-distance estimators of ω based on different weight functions than the uniform one used here. b are easily established using the results from Appendix The asymptotic properties of ω PT A. Denote by ψa (y nT ) the I-vector whose ith entry equals T −1 t=1 χi (ynt ) − ai and let b Note that ψB is readily obtained from the influence ψB be the influence function of vec[B]. b function of the vec[Di ], which is provided in Theorem A.1 in Appendix A. On introducing ψω (y nT ) ≡ (B 0 B)−1 (a0 ⊗ IK )ψB (y nT ) + B 0 ψa (y nT ) , Theorem 5 can be stated. b of the mixing proportions ω satisfies Theorem 5. The minimum-distance estimator ω √ L N (b ω − ω) → N (0, Vω ), Vω ≡ E[ψω (y T )ψω (y T )0 ], as N → ∞. Our minimum-distance estimator is similar in spirit to the proposal of Titterington (1983), who worked in a framework where data can be sampled directly from the K component mixtures. 3.3. Functionals Now consider the problem of inferring a vector θ0 defined as the unique solution to a moment condition of the form E[g(y; θ)|x = k] = 0 for some known function g. Prime examples of such θ0 are the moments of the component mixtures; the first moment of fk , for example, is defined through E[y − θ0 |x = k] = 0. From Corollary 1 we have that E[g(y; θ)|x = k] = E[g(y; θ)τk (y)], and so a natural way to proceed is to consider a GMM estimator of the form b m(θ), θb ≡ arg min m(θ) b 0Σ b θ∈Θ where Θ is the parameter space, ΣN is a positive-definite weight matrix that converges in probability to a positive-definite and non-stochastic matrix Σ, and m(θ) b ≡ N 1 X m b n (y nT ; θ), N n=1 m b n (y nT ; θ) ≡ T 1X g(ynt ; θ) τbk (ynt ). T t=1 Finite mixtures 5 Here, τbk (y) is an estimator of the weight function τk at y. τbk (y) ≡ PK fbk (y) , fbk0 (y)b ωk 0 11 We use k = 1, 2, . . . , K, k0 =1 although other possibilities could be entertained. Let g0 be g evaluated at θ0 . Impose the following regularity conditions. Assumption 4. The space Θ is compact and θ0 lies in its interior. The function g is twice continuously-differentiable on Θ, and supθ∈Θ Ekg(y; θ)k and supθ∈Θ Ek∇θ0 g(y; θ)k are finite. The matrix E[∇θ0 g(y; θ0 )τk (y)] has full column rank. For k 0 = 1, 2, . . . , K, τk0 is bounded away from zero and infinity on [−1, 1], g0 τk0 is square integrable, and there exists and integer η ≥ 1 so that kProjJ g0 τk0 − g0 τk0 k∞ = O(J −η ). Assumption 4 contains familiar conditions for GMM estimators to be asymptotically normal. It also postulates a shrinkage rate on the bias of series estimators of the g0 τk , which can be motivated as before. b let Mθ ≡ E[∇θ0 g(y; θ0 )τk (y)] and let Mω be To state the asymptotic distribution of θ, the K-vector that has E[g(y; θ0 )τk (y)τk0 (y)] as its k 0 th entry. Introduce ψθ (y nT ) ≡ (Mθ0 ΣMθ )−1 Mθ0 Σ mn (y nT ; θ0 ) + ψτ (y nT ) , PT where mn (y nT ; θ) ≡ T −1 t=1 g(ynt ; θ) τk (ynt ) and ψτ (y nT ) is defined as 1 |ρ3 | X K h i X g(yt3 ; θ0 ) τ k (ynt1 , ynt2 ) − τk (ynt3 ) τ k0 (ynt1 , ynt2 )ωk0 + Mω0 ψω (y nT ). (t1 ,t2 ,t3 )∈ρ3 k0 =1 The function ψθ is the influence function of θb and has the usual structure. The term involving ψτ captures the impact of first-stage estimation error on the asymptotic variance b Let of θ. Vm ≡ E[(mn (y nT ; θ0 ) + ψτ (y nT ))(mn (y nT ; θ0 ) + ψτ (y nT ))0 ] and assume this matrix to be positive definite. Theorem 6 provides the limit distribution b of θ. Theorem 6. Let J → ∞ so that J 4 /N → 0, N J −2β → 0, and N J −2η → 0 as N → ∞. Then, for any k = 1, 2, . . . , K, the estimator θb of the estimand θ0 satisfies √ L N (θb − θ0 ) → N (0, Vθ ), Vθ ≡ E[ψθ (y T )ψθ (y T )0 ], as N → ∞. Standard arguments show that the optimally-weighted estimator is obtained on setting √ L Σ ∝ Vm−1 , in which case Theorem 6 implies that N (θb − θ0 ) → N (0, [Mθ0 Vm−1 Mθ ]−1 ). 5 Some trimming will typically be warranted to keep m(θ) b well defined. With some work, Theorem 6 below may be generalized to hold under a trimming scheme where the tuning parameter converges to the identity slowly with N . 12 4. S. Bonhomme, K. Jochmans, and J.-M. Robin Monte Carlo evaluation We present numerical evidence on the small-sample performance of our estimators by means of two illustrations. We will work with the family of generalized Beta distributions, which is popular for modeling the distribution of income; see, e.g., McDonald (1984). On the interval [y, y] ⊂ R, the generalized Beta distribution is b(y; ϑ1 , ϑ2 ; y, y) ≡ 1 1 (y − y)ϑ1 −1 (y − y)ϑ2 −1 , ϑ +ϑ −1 1 2 (y − y) B(ϑ1 , ϑ2 ) ´1 where B(ϑ1 , ϑ2 ) ≡ 0 ϑ1 −1 (1 − )ϑ2 −1 d and ϑ1 and ϑ2 are positive real scale parameters. Its mean and variance are µ ≡ y + (y − y) ϑ1 , ϑ1 + ϑ2 σ 2 ≡ (y − y)2 (ϑ1 + ϑ1 ϑ2 2 ϑ2 ) (ϑ1 + ϑ2 + 1) , (4.1) respectively. Throughout, we use normalized Chebychev polynomials as basis functions. The ith such polynomial is 1 2 1 √ cos[(i − 1) arccos()], χi () = π 21{i=1} 1 − 2 for ∈ [−1, 1]. We will estimate the component mixtures and their associated CDFs, as well as the mixing proportions and the mean and variance of the component mixtures. To ensure bona fide density estimates we use fek (y) ≡ max{0, fbk (y) − ck }, (4.2) ´ where ck is chosen so that fek () d = 1. Note that fek is the projection of fbk into the space of square-integrable functions that are non-negative and integrate to one, see Gajek ´y (1986). To infer the conditional CDFs, Fk (y) ≡ −∞ fk () d, we use Clenshaw and Curtis ´y fek () d. Our approximation uses 101 (1960) quadrature to approximate the integral −∞ quadrature nodes. The moments are estimated using the GMM estimator from Theorem 6, without any trimming. To describe the population, we also present simple kernel estimates of the marginal density of the data. These are obtained using a Gaussian kernel and a bandwidth set according to Silverman’s rule of thumb. Experiment 1. Our first experiment involves three generalized Beta distributions on the interval [−1, 1]. Moreover, we consider f1 (y) = b(y; 2, 7; −1, 1), ω1 = .20, f2 (y) = b(y; 5, 4; −1, 1), ω2 = .35, f3 (y) = b(y; 6, 2; −1, 1), ω3 = .45. Finite mixtures Table 1. Sequential rank test in Experiment 1 α = .100 α = .050 b b b b b N K<K K=K K>K K<K K=K 500 .005 .927 .068 .002 .959 750 .006 .926 .068 .002 .958 1000 .005 .924 .071 .002 .957 1500 .005 .929 .066 .002 .960 2000 .006 .930 .064 .002 .956 2500 .004 .929 .067 .002 .965 13 α = .025 b K>K b K<K b K=K b K>K .039 .040 .041 .038 .042 .033 .001 .001 .001 .000 .002 .000 .978 .976 .979 .976 .980 .975 .021 .023 .020 .024 .018 .024 Using (4.1), the means of the component mixtures are µ1 = −5/9 ≈ −.556, µ2 = 1/9 ≈ .111, and µ3 = 1/2 = .500, while their respective variances are σ12 = 28/405 ≈ .069, σ22 = 8/81 ≈ .099, and 23 = 1/12 ≈ .083. Throughout, we set T = 4 and I = J = 6. Table 1 presents simulation results for the estimator of K defined through the sequential testing procedure in Theorem 2 for various values of N and α. The table reports the frequency with which K was either underestimated, correctly estimated, or overestimated b is found to perform well, correctly picking in 10, 000 Monte Carlo replications. Overall, K the true number of component mixtures in about 95% of the cases. The first three panels in Figure 1 show upper 95% and lower 5% envelopes (dashed lines) over 1000 estimates from samples of size N = 1000 of the marginal PDF (left upper panel), of the component PDFs (right upper panel), and of the component CDFs (left middle panel), together with their respective true values (solid lines). They reveal the JADEbased estimator to back out the component mixtures very well, even though inspection of the estimated marginal density does not directly suggest the particular mixture structure. The remaining three panels provide box plots of the sampling distribution of the mixing proportions (right middle panel) and the mean (lower left panel) and variance (lower right panel) of the component mixtures. All box plots are centered around the respective true values. An asterisk marks zero. Overall, the plots suggest good performance. The sampling distribution of the mixing proportions and of the mixture means are broadly correctly centered and have small interquartile ranges. The box plots for the variance show there to be a slightly larger bias in the estimates of the mixture variances, in particular for the first and third component mixture. Nonetheless, the magnitude of the bias is very small. It is well known that higher-order moments are more difficult to estimate accurately. Experiment 2. Our methods contribute to the analysis of non-separable fixed-effect models, that is, models of the form yt = g(x, εt ), for some unobservable εt . A location-scale version is yt = x + ηt , ηt = σ(x)εt , x⊥εt , for some function σ. The location-scale model can be seen as a stripped-down version of a linear fixed-effect model or as a one-factor model. Note, however, that the factor x and the error ηt are not independent. 14 S. Bonhomme, K. Jochmans, and J.-M. Robin Figure 1. Estimates in Experiment 1 Finite mixtures Table 2. Sequential rank test in Experiment 2 α = .100 α = .050 b b b b b N K<K K=K K>K K<K K=K 500 .140 .642 .218 .201 .644 750 .053 .748 .199 .083 .781 1000 .015 .790 .193 .030 .836 1500 .001 .804 .194 .001 .864 2000 .000 .809 .189 .000 .864 2500 .000 .816 .181 .000 .866 15 α = .025 b K>K b K<K b K=K b K>K .155 .136 .134 .135 .136 .134 .264 .120 .044 .004 .000 .000 .617 .782 .864 .900 .904 .905 .119 .098 .092 .097 .096 .095 Suppose that εt is drawn from the generalized Beta distribution on [−1, 1] with ϑ1 = ϑ2 = ϑ. Then its distribution is symmetric and var[εt ] = 1/(2ϑ+1) while fk is supported on the interval [k − σ(k), k + σ(k)] . Below we report results for the scale-function specification √ σ(x) = (2ϑ + 1)/ x. Here, E[y|x = k] = k, var[y|x = k] = 1 , k so that the variability is inversely related to the mean. Table 2 and Figure 2, which have the same structure as Table 1 and Figure 1 above, present simulation results for this locationscale model with K = 5 and ω = (.30, .20, .25, .15, .10)0 . This choice of mixture weights implies f to be unimodal and virtually symmetric, hiding well the underlying mixture structure. The simulations were performed with I = J = 8. The table was generated over 10,000 replications. The plots are based on 1000 Monte Carlo runs. Table 2 shows that the sequential-rank test continues to perform well, although there is a larger tendency to over- and underestimate K when the sample is fairly small. The plots in Figure 2 demonstrate excellent estimates of the component PDFs and CDFs. The box plots of the mixing proportions, too, are again broadly correctly centered, although the interquartile ranges are higher than in Experiment 1. From the bottom panels in the figure it may be observed that the sampling distributions of the GMM estimators of the mean and variance of the component mixtures are now more disperse. The estimates of the variance also suffer from non-negligible bias. This suggests that some trimming will typically be needed in applications where the support is large. 5. Empirical application The classic model of earnings dynamics allows for heterogeneity in log-earnings levels only through the inclusion of unit-specific intercepts; see, e.g., Hall and Mishkin (1982). The recent literature has argued that unobserved heterogeneity in log earnings stretches beyond such additive fixed effects; see, e.g., Browning et al. (2010) for a discussion and an extensive empirical investigation in a parametric framework. In this section, we apply our methods to investigate the degree of heterogeneity in earnings data on males for the period 1969–1998. Our approach is very flexible in terms of unobserved heterogeneity. It allows for persistence 16 S. Bonhomme, K. Jochmans, and J.-M. Robin Figure 2. Estimates in Experiment 2 Finite mixtures 17 of essentially arbitrary form through unobserved time-constant factors but rules out the presence of state dependence. From the PSID 1969–1998 we construct a set of three-period balanced subpanels, using a rolling window of length one.6 ,7 This yields 28 subpanels. For each such subpanel, we obtain our measure of log (annual) earnings of unit n at time t, say ynt , as the residual of a pooled regression of reported log earnings on a constant term, a set of time dummies, years of schooling, and a second-degree polynomial in experience. A graphical inspection of the marginal densities in each subpanel (not reported) does not suggest large disperion between the univariate marginal densities in a given subpanel, so that our smoothing policy seems reasonable. In line with Heckman and Singer (1984) and many others since, we view the latent factor x as representing type heterogeneity and consider a discretization approach, and estimated a trivariate mixture model for each subpanel. Values up to ten for K yielded a similar pattern in the component mixtures. We focus on a relatively small K for ease of exposition. In this way, one can think of the x as an indicator for low, intermediate, and high innate ability, for example. The densities were estimated using Chebychev polynomials as basis √ functions, I set to five, and J = .7 ∗ 3 N to accommodate the relatively strong increase in the number of cross-sectional observations in the subpanels spanning later time periods. We used (4.2) to ensure bona fide density estimates. Figure 3 contains the estimated conditional densities for a selection of subpanels. The pattern in the plots is representative for the other subpanels formed from the PSID data. The values of N and J are indicated above each plot. Figure 3 shows well-separated unimodal component mixtures. From here on out, index the component mixtures k = 1, 2, 3 by mode, that is, from left to right. The second conditional density is fairly symmetric about is mode while the outer two densities are somewhat skewed. Overall, the plots show that a higher value of x translates into a higher probability of being in the top of the earnings distribution. To inspect the evolution of the location and variability of log earnings over time, we estimated the mode and interquartile range of the component mixtures in each subpanel. The interquartile range was estimated by inverting the CDF of the components which, in turn, was computed using Clenshaw-Curtis quadrature (with 101 quadrature nodes, as before). The individual estimates are indicated by a • for k = 1, by an x for k = 2, and by a + for k = 3. To capture the overall time trend, linear-regression lines through the individual points are also given. The left plot in Figure 4 shows the evolution of the mixture modes 6 PSID: Panel Study of Income Dynamics public use dataset. Produced and distributed by the University of Michigan with primary funding from the National Science Foundation, the National Institute of Aging, and the National Institute of Child Health and Human Development. Ann Arbor, MI. 7 We excluded self-employed individuals and students, as well as individuals for whome earnings were top coded. We further restricted the sample to individuals between the ages of 20 and 60, with at most 40 years of experience. S. Bonhomme, K. Jochmans, and J.-M. Robin Figure 3. Conditional densities of log earnings 1976−−1978 (N=906;J=7) 3 3 2.5 2.5 2 2 1.5 1.5 fk fk 1969−−1971 (N=585;J=6) 1 1 0.5 0.5 0 −1 −0.5 0 0.5 0 1 −1 −0.5 y 2.5 2.5 2 2 1.5 1.5 fk fk 3 1 1 0.5 0.5 −1 −0.5 0.5 1 1996−−1998 (N=1420;J=8) 3 0 0 y 1984−−1986 (N=1107;J=8) 0 0.5 0 1 −1 −0.5 y 0 0.5 1 y Figure 4. Mixture modes and interquartile ranges over time 0.8 0.45 0.6 0.4 interquartile range 0.4 0.2 mode 18 0 −0.2 0.35 0.3 −0.4 0.25 −0.6 −0.8 1970 1975 1980 1985 year 1990 1995 0.2 1970 1975 1980 1985 year 1990 1995 Finite mixtures 19 over the sampling period. As is clear from the regression lines, the dispersion between the modes tends to increase somewhat over time, with the mode for k = 1 decreasing, the mode for k = 2 remaining relatively constant, and the mode for k = 3 increasing. In terms of location, this indicates the various component mixtures to move further away from each other, suggesting an increase in the between-group dispersion of log earnings. The right plot of Figure 4 provides the interquartile ranges. The plot shows that the spread of the component mixtures is inversely related to their mode. That is, individuals at the lower end of the earnings distribution are exposed to higher volatility. Further, the difference in the interquartile ranges decreases over the sampling period. The regression lines show that the interquartile ranges of the first and second component mixture have decreased while the interquartile range of the high-end component mixture displays an upward trend. 6. Conclusion We have discussed methods to estimate nonparametric finite-mixture models from short panel data. The estimators are straightforward to implement and have desirable largesample properties. A Monte Carlo assessement further indicated good performance in small samples. As an empirical illustration we applied our approach to earnings data. We found evidence of substantial nonlinear heterogeneity in earnings processes, and documented their evolution over time. In future work, we aim to provide a data-driven method for selecting the truncation parameter optimally, paving the path for adaptive estimation. Another potentially fruitful direction for further research is to investigate how our approach can be modified to allow K to grow with T . With some work, it should also be possible to extend our methodology to Markovian models. One area where this can be of use is in the estimation of dynamic discrete-choice models. Finally, our projection approach can potentially be used to estimate continuous mixtures. However, this will involve an ill-posed inverse problem, and will lead to the associated decrease in convergence rates. Appendix A: Asymptotic properties of estimated Fourier coefficients This appendix works towards Theorem A.1, which establishes the limit behavior of the b j . We do so by providing lemmata which, in turn, develop the large-sample behavior D c (Lemma A.1), the whitened matrices of Fourier of the spectral-decomposition matrix W b b (Lemma A.3). coefficients, Cj (Lemma A.2), and the JADE estimator, U To state our results, some additional notation is needed. First, observe that, for any j, bj − Aj ] is asymptotically linear and that its influence function at y nT , ψA (y nT ), say, vec[A j has X 1 χi1 (ynt1 )χi2 (ynt2 )χj (ynt3 ) − ai1 i2 j , (A.1) |ρ3 | (t1 ,t2 ,t3 )∈ρ3 as its i1 + (i2 − 1)Ith element. 20 S. Bonhomme, K. Jochmans, and J.-M. Robin Let V ≡ (v 1 , v 2 , . . . , v K ) for v k ≡ vk ⊗ vk , where V = (v1 , v2 , ..., vK ), and define the 2 selection matrix Q ∈ RK ×K through the relation Q = vec[diag()], where is any Kvector. Use these expressions to introduce 1 0 ψW (y T ) ≡ (II ⊗ Λ−3/2 V 0 ) − (V ⊗ IK )QΛ−3/2 V ψA0 (y T ), 2 −3/2 0 1 0 ψW 0 (y T ) ≡ (Λ V ⊗ II ) − (IK ⊗ V )QΛ−3/2 V ψA0 (y T ). 2 The interpretation of these functions is that they are the respective influence functions of b and U b 0 , as stated in Lemma A.1. U c and W c 0 of the matrices W and W 0 satisfy Lemma A.1. The estimators W √ √ L L c − W] → c0 − W 0] → N vec[W N (0, VW ), N vec[W N (0, VW 0 ), for VW ≡ E[ψW (y T )ψW (y T )0 ] and VW 0 ≡ E[ψW 0 (y T )ψW 0 (y T )0 ], respectively, as N → ∞. Proof of Lemma A.1. Recall that A0 is real, symmetric, and has rank K. Also, its K nonb0 —which, by construction, too, is symmetric— zero eigenvalues are all distinct. Further, A √ L b0 − A0 ] → N (0, VA ). From Theorem 4.2 in Eaton and Tyler (1991) satisfies N vec[A 0 √ L b − λ) → and Theorem 1 in Magnus (1985), it then follows that N (λ N (0, Vλ ), where 0 0 [Vλ ]k1 ,k2 ≡ (vk1 ⊗ vk1 )VA0 (vk2 ⊗ vk2 ). Because vec[Λ] = Qλ, the Jacobian associated with −3/2 −3/2 −3/2 the transformation from λ to vec[Λ−1/2 ] is − 12 Q diag[λ1 , λ2 , . . . , λK Hence, an application of the delta method gives √ L b −1/2 − Λ−1/2 ] → N vec[Λ N (0, VΛ−1/2 ), ] ≡ − 12 QΛ−3/2 . (A.2) 0 for VΛ−1/2 ≡ 1/4QΛ−3/2 V VA0 V Λ−3/2 Q0 . Moving on, from Corollary 1 in Bura and Pfeiffer (2008), we have that the estimated eigenvectors satisfy √ √ L L N vec[Vb − V ] → N (0, VV ), N vec[Vb 0 − V 0 ] → N (0, VV 0 ), (A.3) where VV ≡ (Λ−1 V 0 ⊗ II )VA0 (V Λ−1 ⊗ II ) and VV 0 ≡ (II ⊗ Λ−1 V 0 )VA0 (II ⊗ V Λ−1 ). Then combining (A.2) and (A.3) with the linearization √ c − W = (Λ b − Λ)−1/2 V 0 + Λ−1/2 (Vb − V )0 + oP (1/ N ) W provides the result. Now use Lemma A.1 to construct 2 ψCj (y T ) ≡ (W Aj ⊗ IK )ψW (y T ) + (W ⊗ W )ψAj (y T ) + (IK ⊗ W Aj )ψW 0 (y T ), for each j = 1, 2, . . . , J. Then Lemma A.2 follows readily. bj = W cA bj W c 0 of the whitened Lemma A.2. For each j = 1, 2, . . . , J, the estimator C matrix Cj = W Aj W 0 satisfies √ L bj − Cj ] → N vec[C N (0, VCj ), VCj ≡ E[ψCj (y T )ψCj (y T )0 ], as N → ∞. Finite mixtures 21 Proof of Lemma A.2. Fix j ∈ {1, 2, . . . , J}. First note that, from (3.1), the (i1 , i2 )th bj takes the form element of A b ai1 i2 j = N 1 1 X N |ρ3 | n=1 X χi1 (ynt1 )χi2 (ynt2 )χj (ynt3 ), (t1 ,t2 ,t3 )∈T3 √ √ L bj −Aj ] → c , too, is N -consistent, we may linearize and that N vec[A N (0, VAj ). Because W bj − Cj ] as vec[C √ c − W ] + (W ⊗ W )vec[A bj − Aj ] + (IK ⊗ W Aj )vec[W c 0 − W 0 ] + oP (1/ N ). (W Aj ⊗ IK )vec[W Combined with Lemma A.1 this yields the result by an application of the delta method. 2 Moving on to the JADE estimator, let ψU (y nT ) ≡ (IK ⊗ U ∆I ) I X ∇i (U 0 ⊗ U 0 )ψCi (y nT ), i=1 ψU 0 (y nT ) ≡ (U ∆I ⊗ IK ) I X ∇i (U 0 ⊗ U 0 )ψCi (y nT ), i=1 where ∆I is the K × K matrix whose (k1 , k2 )th element is ( hP i−1 I 2 (b − b ) if k1 6= k2 , ik ik 1 2 i=1 [∆I ]k1 ,k2 ≡ 0 if k1 = k2 and ∇i ≡ diag[vec[∇i ]] with ∇i ≡ Di ιK ι0K − ιK ι0K Di for i = 1, 2, . . . , I and ιK the K-vector of ones. Note that ∆I is well defined for any I ≥ K by Assumption 2 and the fact that PI limI→∞ i=1 (bik1 − bik2 )2 = kfk1 − fk2 k22 < ∞. The asymptotic distribution of JADE is stated in Lemma A.3. Lemma A.3. The JADE estimator defined in (3.2) satisfies √ √ L L b − U] → b 0 − U 0] → N vec[U N (0, VU ), N vec[U N (0, VU 0 ), for VU ≡ E[ψU (y T )ψU (y T )0 ] and VeU 0 ≡ E[ψU 0 (y T )ψU 0 (y T )0 ], respectively, as N → ∞. √ bi are N -consistent and asymptotically normal, the result Proof of Lemma A.3. As the C follows from Theorem 5 in Bonhomme and Robin (2009). 2 On combining Lemma A.2 and Lemma A.3, and introducing ψDj (y T ) ≡ (IK ⊗ U 0 Cj )ψU (y T ) + (U 0 ⊗ U 0 )ψCj (y T ) + (U 0 Cj ⊗ IK )ψU 0 (y T ), we obtain Theorem A.1. bj = U b 0C bj U b of the matrix Theorem A.1. For each j = 1, 2, . . . , J, the estimator D Dj = U 0 Cj U satisfies √ L b j − Dj ] → N vec[D N (0, VDj ), VDj ≡ E[ψDj (y T )ψDj (y T )0 ] as N → ∞ 22 S. Bonhomme, K. Jochmans, and J.-M. Robin Proof of Theorem A.1. The result follows on combining Lemma A.2 and Lemma A.3 with b j − Dj ]. a linearization of vec[D 2 Appendix B: Proofs of theorems in the main text Proof of Theorem 3. Throughout the proof we fix k. Let b ≡ (b1k , b2k , . . . , bJk )0 and p b b ≡ (bb1k , bb2k , . . . , bbJk )0 . We first show that kb b − bk = OP ( J/N ). Theorem 3 will then follow readily. Note that kb b − bk2 ≤ J X b j − Dj k2 = kD j=1 J X b j − Dj ]0 vec[D b j − Dj ]. vec[D (B.1) j=1 c . Observe that Lemmata b j − Dj = SbA bj Sb0 − SAj S 0 , where S ≡ U 0 W and Sb ≡ U b 0W Write D √ b A.1 and A.3 imply that kS − Sk = OP (1/ N ). Also, for any j, bj − Aj k2 EkA = I X I X E (b ai1 i2 j − ai1 i2 j )2 i1 =1 i2 =1 = I I X X i1 =1 i2 =1 = I X I X h E N 1 X N |ρ3 | n=1 χi1 (ynt1 )χi2 (ynt2 )χj (ynt3 ) − ai1 i2 j 2 i (t1 ,t2 ,t3 )∈ρ3 E[χ2i1 (y1 )χ2i2 (y2 )χ2j (y3 )] − a2i1 i2 j N |ρ3 | i1 =1 i2 =1 ≤ X I X I X ζ 6 − a2i1 i2 j , N |ρ3 | i =1 i =1 1 2 where, recall, ζ is a finite constant so that kχj k∞ ≤ ζ for all j. Therefore, J X bj − Aj k2 = OP (J/N ) kA (B.2) j=1 by Markov’s inequality and Parseval’s identity, i.e., the square-summability of the Fourier coefficients. Now, as b j − Dj ] = (S ⊗ S) vec[A bj − Aj ] vec[D bj − Aj ] + ((Sb − S) ⊗ (Sb − S)) vec[Aj ] + ((Sb − S) ⊗ (Sb − S)) vec[A bj − Aj ] + (S ⊗ (Sb − S)) vec[A bj − Aj ] + ((Sb − S) ⊗ S) vec[A (B.3) + ((Sb − S) ⊗ S) vec[Aj ] + (S ⊗ (Sb − S)) vec[Aj ], it suffices to quantify the contribution of each of these terms to (B.1). The first term will be the dominant one. On using the Cauchy-Schwarz inequality and (B.2), J X j=1 b − Aj )S 0 ]0 vec[S(A b − Aj )S 0 ] ≤ k(S ⊗ S)0 (S ⊗ S)k vec[S(A J X j=1 bj − Aj k2 = OP (J/N ), kA Finite mixtures 23 0 as k(S ⊗ S) (S ⊗ S)k = O(1). The remaining terms can be shown to be of a smaller order using similar arguments. For example, J X 0 0 0 b b b b b vec[S(A−A j )S ] vec[(S −S)Aj (S −S) ] ≤ k(S ⊗S)[(S −S)⊗(S −S)]k j=1 J X bj −Aj kkAj k kA j=1 is asymptotically negligible because k(S ⊗ S)[(Sb − S) ⊗ (Sb − S)]k = OP (1/N ) and J X j=1 bj − Aj kkAj k ≤ kA J J hX i1/2 h X i1/2 p bj − Aj k2 kA kAj k2 = OP ( J/N ), j=1 j=1 p which follows as before. Thus, we conclude that kb b − bk = OP ( J/N ). Now turn to the MISE result. Using orthonormality, Ekfbk − fk k22 ≤ Ekfbk − ProjJ fk k22 + kProjJ fk − fk k22 = Ekb b − bk2 + kProjJ fk − fk k22 . From above, Ekb b − bk2 = OP (J/N ), while kProjJ fk − fk k22 ≤ kProjJ fk − fk k2∞ = O(J −2β ) by Assumption 3. Whence, Ekfbk − fk k22 = OP (J/N + J −2β ), as claimed. To obtain the uniform-convergence rate, we may proceed similarly. By the triangle inequality, kfb − fk k∞ ≤ kfbk − ProjJ fk k∞ + kProjJ fk − fk k∞ . √ Let χJ ≡ (χ1 , χ2 , . . . , χJ )0 and note that supy∈[−1,1] kχJ k = O( J) by the uniform bound on the basis functions in Assumption 3. Thus, p √ kfbk − ProjJ fk k∞ ≤ sup kχJ kkb b − bk = O( J)OP ( J/N ) y∈[−1,1] by the Cauchy-Schwarz inequality. Because kProjJ fk − fk k∞ = O(J −β ) by Assumption 3, this yields the uniform-convergence rate and concludes the proof of Theorem 3. 2 Proof of Theorem 4. Throughout the proof we fix k and consider the asymptotic behavior of fbk at a given value y. Write fbk (y) − fk (y) = [fbk (y) − ProjJ fk (y)] + [ProjJ fk (y) − fk (y)]. By Assumption 3, |ProjJ fk (y) − fk (y)| ≤ kProjJ fk − fk k∞ = O(J −β ). Further, because √ N J −β → 0 as N → ∞, this term is asymptotically negligible. Define the K 2 × 1 selection b j −Dj ]. Using (B.3), a small calculation vector qk through the relation (bbjk −bjk ) = qk0 vec[D yields s N 1 X N b fk (y) − ProjJ fk (y) = √ ξn,1 + ξn,2 + ξn,3 + ξn,4 + oP (1), αJ (y) N n=1 24 S. Bonhomme, K. Jochmans, and J.-M. Robin p for ξn,1 ≡ qk0 (S ⊗ S)ϑn = ψfk (y nT )/ αJ (y) and ξn,2 ≡ qk0 (S ⊗ (Sb − S))ϑn , ξn,3 ≡ qk0 ((Sb − S) ⊗ S)ϑn , ξn,4 ≡ qk0 ((Sb − S) ⊗ (Sb − S))ϑn , p PJ where S and Sb are as in the previous proof, ϑn ≡ j=1 ψAj (y nT )χj (y)/ αJ (y), and ψAj bj ] as defined in (A.1) above. To see why the remaining is the influence function of vec[A terms in (B.3) can be ignored, note that they are all of the form I X I J X 1 X √ ai1 i2 j χj (y) OP N j=1 i1 =1 i2 =1 √ because kSb − Sk = OP (1/ N ) by Lemmata A.2 and A.3. Further, from the restrictions imposed by the mixture structure in (2.1), J X I X I X ai1 i2 j χj (y) = K h X I X I X k0 =1 j=1 i1 =1 i2 =1 i1 =1 i2 =1 bi1 k0 bi2 k0 J ih X i bjk0 χj (y) ωk0 = O(1) j=1 as the first term in square brackets is bounded and the second term is just ProjJ fk0 (y), which converges to fk0 (y) as J → ∞. Below we will further show that s N N b 1 X [fk (y) − ProjJ fk (y)] = √ ξn,1 + oP (1), αJ (y) N n=1 (B.4) so that it suffices to show that this sample average is asymptotically normal. Doing so requires verifying that the conditions of Lyapunov’s central limit theorem are satisfied. Moreover, it needs to be shown that (i) E[ξn,1 ] = 0, (ii) var[ξn,1 ] = O(1), and (iii) there √ exists a δ > 0 so that E|ξn,1 / N |2+δ = o(1). Condition (i) readily follows from inspection of (3.3). To verify Condition (ii), use the law of iterated expectations to see that it suffices to show that E[τ k (y1 , y2 )2 |x = k 0 ] = O(1) and that E[κJ (y3 ; y)2 |x = k 0 ] = O(αJ (y)) for all k 0 = 1, 2, . . . , K where the expectations are taken over the triple (y1 , y2 , y3 ). From Theorem 2.2.2 in Viollaz (1989), E[κJ (y3 ; y)2 |x = k 0 ]/αJ (y) → fk0 (y), which is O(1). Also, 2 0 E[τ k (y1 , y2 ) |x = k ] = I X I hX i1 =1 i2 =1 u0k wi1 E[χi1 (y1 )χi2 (y1 )|x = k 0 ]wi02 uk i2 = O(1), Finite mixtures 25 which follows from a direct calculation. Finally, regarding Condition (iii), first note that αJ (y) 2+δ 2 E|ξn,1 |2+δ J X 2+δ = E [qk0 (S ⊗ S)]ψAj (y T )χj (y) j=1 J J hX 0 i 2+δ h X i 2+δ qk (S ⊗ S)χj (y)2 2 E ψAj (y T )2 2 ≤ j=1 j=1 J J i 2+δ h hX 2 X i 2+δ 2 ψAj (y T )2 2 ≤ qk0 (S ⊗ S) |χj (y)|2 E j=1 j=1 J 2 i 2+δ hX 2+δ 2+δ 2+δ 2 ≤ O(1)αJ (y) 2 EψAj (y T ) , j=1 where the transitions are justified by the Cauchy-Schwarz inequality and the Minkowski inequality. Further, because kχI χ0I k is uniformly bounded by Iζ 2 by Assumption 3 and as kAj k = O(1) uniformly over j, we also have that i2+δ h i2+δ 2+δ 2+δ h 2 EψAj (y T ) ≤ sup ψAj ≤ Iζ kχj k∞ + kAj k ≤ Iζ 3 + kf k2 = O(1). y∈[−1,1]T √ 2+δ Therefore, E|ξn,1 / N |2+δ = O(J/N ) 2 → 0, as N → ∞. To prove the theorem it only p remains to validate (B.4). Observe that αJ (y)ϑn equals 1 |ρ3 | X I X I h K i X X χi1 (ynt1 )χi2 (ynt2 )κJ (ynt3 ; y) − bi1 k0 bi2 k0 ProjJ fk0 (y)ωk0 . (t1 ,t2 ,t3 )∈ρ3 i1 =1 i2 =1 k0 =1 PN By the same arguments as above, it is straightforward to verify that N −1/2 n=1 ϑn satisfies the conditions of Lyapunov’s central limit theorem and is, therefore, OP (1). Hence, N N 1 X 1 X √ √ ξn,2 = oP (1), ξn,3 = oP (1), N n=1 N n=1 √ because kSb − Sk = OP (1/ N ). This completes the proof. N 1 X √ ξn,4 = oP (1), N n=1 2 √ b − Bk = OP (1/ N ). Assumption 2 Proof of Theorem 5. Theorem A.1 implies that kB P b 0 B) b −1 → ensures that B 0 B has full rank. Hence, (B (B 0 B)−1 . Further, a linearization yields √ 0 0 0 0 0 ba b − B )a + B (b b − B a = (B B a − a) + oP (1/ N ). In tandem with the asymptotic results b b , the delta method then provides the result. on B and a 2 Proof of Theorem 6. Fix k throughout the proof. Given our assumptions, consistency of θb follows from uniform convergence of the empirical moment to the population moment. Let P m(θ) ≡ N −1 n mn (y nT ; θ). Using Theorems 3 and 5, √ supkm(θ) b − E[m(θ)]k = OP (J/ N + J −β ), (B.5) θ∈Θ 26 S. Bonhomme, K. Jochmans, and J.-M. Robin which converges to zero as N → ∞. For asymptotic normality, the usual argument based on a Taylor expansion around θ0 yields √ √ N (θb − θ0 ) = (Mθ0 ΣMθ )−1 M Σ N m(θ b 0 ) + oP (1), P b → because Σ Σ and supθ∈Θ k∇θ0 m(θ) b − Mθ (θ)k = oP (1), which follows as (B.5). It then remains to show that √ N 1 X Nm b N (θ0 ) = √ mn (y nT ; θ0 ) + ψτ (y nT ) + oP (1). N n=1 (B.6) To do so, first note that a second-order Taylor expansion, together with Theorems 3 and 5, yields K N i √ 1 XXh N m(θ b 0 ) − m(θ0 ) = √ ξk0 (y nT ) + ∇ωk0 mn (y nT ; θ0 )(b ωk0 − ωk0 ) + oP (1), N n=1 k0 =1 for ξk0 (y nT ) ≡ T −1 T X g(ynt ; θ0 )∇fk0 τk (ynt ) fbk0 (ynt ) − fk0 (ynt ) . (B.7) t=1 Here, the order of the remainder term follows from the fact that J is restricted to grow so √ that N (J 2 /N + J −2β ) → 0 as N → ∞. Using (3.3) and Theorem 4 for each k 0 , the sample average of (B.7) can be expressed as a symmetric U -statistic of order two. The projection P of this U -statistic is N −1 n ϑk0 (y nT ) for h X i 1 E g(y; θ0 )∇fk0 τk (y) τ k0 (ynt1 , ynt2 )κJ (ynt3 ; y) − ProjJ fk0 (y) , ϑk0 (y nT ) ≡ |ρ3 | (t1 ,t2 ,t3 )∈ρ3 where the expectation is taken with respect to y. Computing this expectation for each k 0 PN PK and envoking Lemma A.3 in Ahn and Powell (1993) shows that N −1/2 n=1 k0 =1 ξk0 (y nT ) equals √ N K h i X X X 1 g(yt3 ; θ0 ) τ k (ynt1 , ynt2 ) − τk (ynt3 ) τ k0 (ynt1 , ynt2 )ωk0 + oP (1), N |ρ3 | n=1 (t ,t ,t )∈ρ k0 =1 1 using that argument, √ 2 3 3 N kProjJ g0 τk0 − g0 τk0 k∞ = √ N O(J −η ) = o(1). Also, by essentially the same N K N 1 XX 1 X 0 √ ∇ωk0 mn (y nT ; θ0 )(b ωk0 − ωk0 ) = √ Mω ψω (y nT ) + oP (1). N n=1 k0 =1 N n=1 Hence, N √ 1 X N m(θ b 0 ) − m(θ0 ) = √ ψτ (y nT ) + oP (1), N n=1 from which (B.6) follows. This concludes the proof. 2 Finite mixtures 27 References Ahn, H. and J. L. Powell (1993). Semiparametric estimation of censored selection models with a nonparametric selection mechanism. Journal of Econometrics 58, 3–29. 26 Allman, E. S., C. Matias, and J. A. Rhodes (2009). Identifiability of parameters in latent structure models with many observed variables. Annals of Statistics 37, 3099–3132. 2 Bonhomme, S. and J.-M. Robin (2009). Consistent noisy independent component analysis. Journal of Econometrics 149, 12–25. 21 Browning, M., M. Ejrnæs, and J. Alvarez (2010). Modeling income processes with lots of heterogeneity. Review of Economic Studies 77, 1353–1381. 15 Bura, E. and R. Pfeiffer (2008). On the distribution of left singular vectors of a random matrix and its applications. Statistics & Probability Letters 78, 2275–2280. 20 Cardoso, J.-F. and A. Souloumiac (1993). Blind beamforming for non-gaussian signals. IEEE-Proceedings, F 140, 362–370. 2, 8 Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. In J. J. Heckman and E. E. Leamer (Eds.), Handbook of Econometrics, Volume VI, Part B. Elsevier. 8 Clenshaw, C. W. and A. R. Curtis (1960). A method for numerical integration on an automatic computer. Numerical Mathematics 2, 197–205. 12 Eaton, M. L. and D. E. Tyler (1991). On Wielandt’s inequality and its applications. Annals of Statistics 19, 260–271. 8, 20 Gajek, L. (1986). On improving density estimators which are not bona fide functions. Annals of Statistics 14, 1612–1618. 12 Hall, P., A. Neeman, R. Pakyari, and R. Elmore (2005). Nonparametric inference in multivariate mixtures. Biometrika 92, 667–678. 2 Hall, P. and X.-H. Zhou (2003). Nonparametric estimation of component distributions in a multivariate mixture. Annals of Statistics 31, 201–224. 1, 2, 4 Hall, R. and F. Mishkin (1982). The sensitivity of consumption to transitory income: Estimates from panel data on households. Econometrica 50, 461–481. 15 Heckman, J. J. and B. Singer (1984). A method for minimizing the impact of distributional assumptions in econometric models for duration data. Econometrica 52, 271–320. 17 Henry, M., Y. Kitamura, and B. Salanié (2010). Identifying finite mixtures in econometric models. Discussion Paper 0910-20, Columbia University, Department of Economics. 1 28 S. Bonhomme, K. Jochmans, and J.-M. Robin Kasahara, H. and K. Shimotsu (2012). Nonparametric identification and estimation of the number of components in multivariate mixtures. Unpublished manuscript. 2, 7 Kleibergen, F. and R. Paap (2006). Generalized reduced rank tests using the singular value decomposition. Journal of Econometrics 133, 97–126. 6 Magnus, J. R. (1985). On differentiating eigenvalues and eigenvectors. Econometric Theory 1, 179–191. 20 McDonald, J. B. (1984). Some generalized functions for the size distribution of income. Econometrica 52, 647–663. 12 McLachlan, G. J. and D. Peel (2000). Finite Mixture Models. Wiley-Blackwell. 1 Robin, J.-M. and R. J. Smith (2000). Tests of rank. Econometric Theory 16, 151–175. 7 Schwartz, S. C. (1967). Estimation of probability density by an orthogonal series. Annals of Mathematical Statistics 38, 1261–1265. 9 Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. The Annals of Statistics 10, 1040–1053. 9 Titterington, D. M. (1983). Minimum distance non-parametric estimation of mixture proportions. Journal of the Royal Statistical Society, Series B 45, 37–46. 10 Viollaz, A. J. (1989). Nonparametric estimation of probability density functions based on orthogonal expansions. Revista Matematica de la Universidad Complutense de Madrid 2, 41–82. 24