Nonparametric estimation of finite mixtures Stéphane Bonhomme Koen Jochmans‡ Jean-Marc Robin

advertisement
Nonparametric estimation of finite mixtures†
Stéphane Bonhomme
CEMFI, Madrid
Koen Jochmans‡
Sciences Po, Paris
Jean-Marc Robin
Sciences Po, Paris and University College London
[Revised October 17, 2012]
Abstract. The aim of this paper is to provide simple nonparametric methods to estimate finitemixture models from data with repeated measurements. Three measurements suffice for the
mixture to be fully identified and so our approach can be used even with very short panel
data. We provide distribution theory for estimators of the number of mixture components,
the mixing proportions, as well as of the mixture distributions and various functionals thereof.
These estimators are found to perform well in a series of Monte Carlo exercises. We apply our
techniques to document heterogeneity in log annual earnings using PSID data spanning the
period 1969–1998.
Keywords: finite-mixture model, nonparametric estimation, series expansion, simultaneousdiagonalization system.
1.
Introduction
Finite mixtures encompass a large class of models. Popular applications include modeling
unobserved heterogeneity, performing factor analysis and ecological inference, as well as
dealing with corrupted data and misclassified observations. We refer to Henry et al. (2010)
for illustrations of mixture models in economics and to McLachlan and Peel (2000) for
a treatment at booklength. The conventional approach to inference in finite mixtures is
parametric—i.e., by specifying the distribution function of the outcome of interest, say y,
conditional on a discrete latent variable, say x, up to a finite-dimensional parameter—with
identification being driven fully by functional form. Only few results on the nonparametric
identifiability and estimability of mixtures have so far been obtained, and our aim here is
to contribute to their development.
In line with the seminal work of Hall and Zhou (2003), our analysis builds on the
availability of repeated measurements on y. These measurements—which could be collected
†The first version of this paper dates from January 2012. We are grateful to Marc Henry and
Bernard Salanié for comments and discussion.
‡Address for correspondence: Sciences Po, Department of Economics, 28 rue des Saints-Pères,
75007 Paris, France.
E-mail: koen.jochmans@sciences-po.org
2
S. Bonhomme, K. Jochmans, and J.-M. Robin
either contemporaneously or over time, as with panel data, for example—are assumed to
be independent and identically distributed conditional on x. Our arguments yield point
identification of the number of component mixtures, the component distributions, and the
mixing proportions when three or more repeated measurements are available and a rank
condition involving the component distributions is satisfied. Our approach to identification
is constructive. The resulting estimators are attractive from a computational point of view
and have desirable large-sample properties.
For brevity, we will focus on the case where the outcome variable exhibits continuous
variation. However, our methods can also be applied when y is a discrete random variable,
allowing for the number of support points to grow with the sample size. Our analysis is based
on projections of the marginal densities and component mixtures into a basis of functions.
We show that the mixture structure imposes a set of linear restrictions on the (generalized)
Fourier coefficients of these densities. When the coefficients of the component densities are
linearly independent, the number of component mixtures is identified as the rank of the
matrix containing the Fourier coefficients of the bivariate marginal density. Further, the
coefficients themselves are identified as the eigenvalues of a sequence of whitened matrices
of Fourier coefficients of the trivariate marginal density. This implies point identification of
the component mixtures on their entire support. The mixing proportions are then readily
pinned down as the unique minimizers of a minimum-distance criterion.
We propose estimating the number of components by sequentially testing the rank of
an empirical analog of the matrix of bivariate Fourier coefficients and show consistency of
the resulting point estimate. To estimate the component densities, we employ an algorithm
for the joint approximate-diagonalization of a set of eigenmatrices developed by Cardoso
and Souloumiac (1993). We derive both mean integrated square error (MISE) and uniform
convergence rates, and provide conditions for pointwise asymptotic normality. Interestingly,
the convergence rates coincide with those that would be obtained when data could be
sampled directly from the component densities, and are optimal in the MISE sense for a
variety of basis functions. A least-squares type estimator for the mixing proportions that
converges at the parametric rate is then readily constructed. We also present distribution
theory for GMM estimators of finite-dimensional parameters defined through conditional
moment restrictions involving the component mixtures.
Our identification result regarding the number of components improves on the lower
bound recently established by Kasahara and Shimotsu (2012). The conditions underlying
our identification analysis of the component mixtures and mixing proportions are similar
to those in Allman et al. (2009). However, our approach is constructive. The resulting
estimators also complement and extend the work by Hall and Zhou (2003) and Hall et al.
(2005), who proposed computationally cumbersome nonparametric approaches to estimate
bivariate mixtures.
We present numerical evidence on the small-sample performance of our procedures
through two Monte Carlo illustrations. As an empirical illustration, we use our methods to
Finite mixtures
3
investigate the presence of unobserved heterogeneity in earnings data from the period 1969–
1998. We find evidence of substantial nonlinear heterogeneity in earnings and document its
evolution over time.
2.
Identification
Let x be a latent discrete random variable, normalized to take on value k = 1, 2, ..., K with
probability ωk > 0. Let y T ≡ (y1 , y2 , . . . , yT )0 be a vector of repeated measurements on
an observable outcome variable y whose marginal probability density function (PDF) takes
the form
K nY
T
o
X
f (y1 , . . . , yT ) =
fk (yt ) ωk ,
t=1
k=1
where fk denotes the PDF of y conditional on x = k.
Let L2 (S ) denote the set of functions that are square-integrable on a space S , that is,
´
the set of functions g for which S |g()|2 d < ∞. Impose the following condition.
Assumption 1. For each k = 1, 2, . . . , K, fk ∈ L2 ([−1, 1]) and is absolutely continuous.
Restricting the support to equal the interval [−1, 1] is innocuous, as the outcome may always
be transformed by means of a strictly-monotonic function whose image is [−1, 1]. Squareintegrability validates a generalized Fourier expansion in a basis of functions, which is key
for our subsequent developments.
Let {χi , i ≥ 1} be a complete orthonormal basis for L2 ([−1, 1]). For any integer I, the
(generalized) Fourier approximation of fk is given by the projection of the density onto the
subspace spanned by χ1 , χ2 , . . . , χI , that is,
ProjI fk ≡
I
X
ˆ
bik χi ,
1
bik ≡
χi (y)fk (y) dy = E[χi (y)|x = k],
−1
i=1
and satisfies limI→∞ kProjI fk − fk k2 = 0 and limI→∞ |ProjI fk (y) − fk (y)| = 0 for all
−1 ≤ y ≤ 1, where k·k2 denotes the L2 -norm.1 Likewise, the projection of the marginal
density of (y1 , y2 , . . . , yH ) is
ProjI,H f ≡
I
I X
X
i1 =1 i2 =1
···
I
X
ai1 i2 ···iH χi1 χi2 · · · χiH ,
iH =1
where ai1 i2 ···iH ≡ E[χi1 (y1 )χi2 (y2 ) · · · χiH (yH )].
In our setup, while the data do not directly nonparametrically identify {bik , i ≥ 1}, they
do reveal the sequences {ai1 i2 ···iH , i1 , i2 , . . . , iH ≥ 1} for all H ≤ T , while the mixture
1
The notation kk will be reserved for the Euclidean norm when is a vector and for the matrix
norm when is a matrix. We use || to denote the absolute value when is a real number and the
cardinality when is a set.
4
S. Bonhomme, K. Jochmans, and J.-M. Robin
structure implies the set of linear restrictions
ai1 i2 ···iH =
K
X
bi1 k bi2 k · · · biH k ωk ,
(2.1)
k=1
for each H ≤ T and all i1 , i2 , . . . , iH . Under weak restrictions, the relations in (2.1) can be
uniquely solved for the Fourier coefficients of the component mixtures and their associated
mixing proportions (up to an arbitrary relabeling).
To discuss identification it suffices to set T = 3.2 It is useful to write the restrictions in
matrix form. For any I, let


b11 b12 · · · b1K
 b21 b22 · · · b2K 


B≡ .
..
..  ,
..
 ..
.
.
. 
bI1
bI2
···
bIK
be the I × K matrix whose kth column contains the first I Fourier coefficients of fk .
Similarly, let a ≡ (a1 , a2 , . . . , aI )0 and introduce the J + 1 symmetric I × I matrices




a11j a12j · · · a1Ij
a11 a12 · · · a1I
 a21j a22j · · · a2Ij 
 a21 a22 · · · a2I 




,
A
≡
A0 ≡  .
 ..

..
..  ,
..
..
j
..
..

 ..

.
.
.
.
. 
.
.
aI1
aI2
···
aII
aI1j
aI2j
···
aIIj
containing the Fourier coefficients of the bivariate and of the trivariate marginal PDF,
respectively. Then, from (2.1),
a = Bω,
A0 = BΩB 0 ,
Aj = BΩ1/2 Dj Ω1/2 B 0 ,
j = 1, 2, . . . , J,
(2.2)
where ω ≡ (ω1 , ω2 , . . . , ωK )0 , Ω ≡ diag[ω], and Dj ≡ diag[bj1 , bj2 , . . . , bjK ].
Identification rests on the following assumption.
Assumption 2. The matrix B has full column rank, that is, rank[B] = K.
The assumption of maximal column rank imposes absence of multicolinearity among the
Fourier coefficients of the component mixtures and is intuitive. Clearly, it requires that
I ≥ K.
Theorem 1 follows.
Theorem 1. (i) The number of mixture components, (ii) the component mixtures, and
(iii) the mixing proportions are all nonparametrically identified, where (ii) and (iii) are up
to arbitrary relabeling of the components.
2
Non-trivial bounds on the component mixtures when T = K = 2 are given by Hall and Zhou
(2003).
Finite mixtures
5
The proof to Theorem 1 is constructive. By Assumption 2, the matrix A0 , which is real and
symmetric, has rank K. As this matrix is nonparametrically identified, so is its rank and,
hence, the number of mixture components. This establishes Theorem 1(i). Continuing on,
A0 admits the spectral decomposition
A0 = V ΛV 0 ,
V ≡ (v1 , v2 , . . . , vK ) ∈ RI×K ,
Λ ≡ diag[λ1 , λ2 , . . . , λK ] ∈ RK×K ,
where the K eigenvalues λ1 , λ2 , . . . , λK are all positive and real, and the corresponding
eigenvectors v1 , v2 , . . . , vK are linearly independent. Observe that, if the eigenvalues are
multiple, matrix V is not unique. By construction, A0 can be transformed to the identity
matrix through pre- and post multiplication with the matrix W ≡ Λ−1/2 V 0 ∈ RK×I and
its transpose, respectively. Moreover, from (2.2), we obtain
W A0 W 0 = IK = U U 0 ,
for U ≡ W BΩ1/2 . Note that, by Assumption 2, U is a K × K orthonormal matrix of full
rank. Now, we can transform each Aj in the same way and obtain
Cj ≡ W Aj W 0 = U Dj U 0 .
That is, the sequence of matrices {Cj , j ≥ 1} is simultaneously diagonalizable in the same
basis (namely, the columns of U ). The eigenvalue of Cj corresponding to its kth eigenvector
equals bjk , the jth Fourier coefficient of fk . Hence, ProjJ fk is identified for any J, which
implies that fk is identified in the sense that limJ→∞ ProjJ fk (y) = fk (y) for each y in
[−1, 1] and for all k = 1, 2, . . . , K. Identification is achieved up to arbitrary relabeling only
because the eigenvectors can be rearranged without affecting the argument. This establishes
Theorem 1(ii). It is useful to remark at this point that identification does not require that
I → ∞. That is, we need not calculate the whole sequence {ai1 i2 i3 , i1 , i2 , i3 ≥ 1} to recover
{bjk , j ≥ 1}. Finally, from (2.2),
B + a = ω,
where B + ≡ (B 0 B)−1 B 0 is the Moore-Penrose pseudo-inverse of B. Because B is identified,
so is ω. This establishes Theorem 1(iii).
Knowledge of the component mixtures implies identification of all their functionals.
Corollary 1 provides one possible representation of this result that is particularly useful for
the purpose of estimation. To state it, let
τk (y) ≡
fk (y)
,
f (y)
k = 1, 2, . . . , K,
identification of which follows immediately from Theorem 1.
Corollary 1. For any measurable function g whose expectation under fk exists,
E[g(y)|x = k] = E[g(y)τk (y)]
for each k = 1, 2, . . . , K. Furthermore, the function τk and, therefore, all the functionals
are nonparametrically identified.
6
S. Bonhomme, K. Jochmans, and J.-M. Robin
Corollary 1 is instrumental in inferring a parameter θ0 defined through moment conditions
of the form
E[g(y; θ0 )|x = k] = 0
for a known measurable function g.
The weight function τk is interesting in its own right. More precisely, πk ≡ ωk τk provides
the probability that x = k as a function of y. This is a key object of interest in latent-class
analysis, allowing classification of observations based on marginal information.
3.
Estimation
For y nT ≡ (yn1 , yn2 , . . . , ynT )0 , let {y nT , n = 1, 2, . . . , N } denote a sample of size N drawn
at random from f . Throughout, we assume that T ≥ 3. Below we present, in turn, the
large-sample properties of our estimators of the number of components, the component
mixtures and the mixture weights, and of their functionals. The proofs of all theorems in
this section are collected in Appendix B.
Let ρH be the set of ordered H-tuples of distinct integers from the set {1, 2, . . . , T }. An
unbiased estimator of ai1 i2 ...iH for any H is given by the sample average
b
ai1 i2 ...iH ≡
and is
√
N
1 1 X
N |ρH | n=1
X
χi1 (ynt1 )χi1 (ynt2 ) . . . χiH (yntH ),
(3.1)
(t1 ,t2 ,...,tH )∈ρH
N -consistent and asymptotically normal.
3.1. Number of components
√
b0 . Then A
b0 is N -consistent
Use (3.1) with H = 2 to construct an estimator of A0 , say A
and asymptotically normal, that is,
√
L
b0 − A0 ] →
N vec[A
N (0, VA0 ),
VA0 ≡ E[ψA0 (y T )ψA0 (y T )0 ],
P
with the i1 + (i2 − 1)Ith entry of ψA0 (y T ) being |ρ2 |−1 (t1 ,t2 )∈ρ2 χi1 (yt1 )χi2 (yt2 ) − ai1 i2 .
Inferring K then boils down to testing the rank of this matrix, which can be done by any
of a number of approaches. The procedure by Kleibergen and Paap (2006) has several
attractive features and therefore carries our preference.3
To describe it, start by ordering the eigenvalues of A0 in decreasing order, and place the
eigenvectors in V accordingly. Fix k ∈ {0, 1, . . . , I − 1} and let ΛI−k ≡ Λ(k + 1 : I, k + 1 : I)
denote the lower-right (I − k) × (I − k) block of the matrix of eigenvalues of A0 . Also, let
Vk ≡ V (1 : k, k + 1 : I) and VI−k ≡ V (k + 1 : I, k + 1 : K), using obvious notation for matrix
0
block selection. Thus, (Vk0 , VI−k
)0 is the matrix of eigenvectors of the last I − k eigenvalues.
3
Kleibergen and Paap (2006) discuss the relative merits of their test statistic. Prime advantages
include its non-sensitivity to the ordering of variables and the fact that its limit distribution under
the null is free of nuisance parameters.
Finite mixtures
Now let ϕk ≡ (V I−k ⊗ V I−k )vec[A0 ] for V I−k ≡
rank[A0 ] = K ≤ k, ΛI−k = 0 and
L
rbk ≡ N |ρ2 |ϕ
bk Vbϕ−1
ϕ
bk → χ2 (I − k)2 ,
k
−10
0
0
(VI−k VI−k
)1/2 VI−k
[Vk0 , VI−k
].
0
7
Then, if
0
Vϕk ≡ (V I−k ⊗ V I−k )VA0 (V I−k ⊗ V I−k ),
where ϕ
bk is the sample analog of ϕk and Vbϕk is a consistent estimator of Vϕk . The rank
statistic rbk can be used to test the null H0 : rank[A0 ] = k against the alternative H1 :
rank[A0 ] > k.
Furthermore, the rank statistic also leads to a consistent estimator of K, based on a
sequential testing procedure. Following Robin and Smith (2000), let
b ≡
K
min
k∈{0,1,...,I−1}
k : rb` ≥ p1−α (`), ` = 0, 1, . . . , k − 1, rbk < p1−α (k) ,
for p1−α () the 100(1 − α)th percentile of the χ2 ((I − )2 ) distribution and α = α(N ) a
b is the first integer for which we fail to reject the null
chosen significance level. That is, K
at significance level α. Theorem 2 follows directly from Theorem 5.2 in Robin and Smith
(2000).
P
b →
Theorem 2. Let α → 0 and − log α/N → 0 as N → ∞. Then K
K.
Our estimator of K is similar to a recent proposal of Kasahara and Shimotsu (2012),
who derived a consistent estimator of a lower bound on K. Their approach is based on the
histogram as an estimator of the bivariate marginal density. For some integer values H1
and H2 , let {Hh1 × Hh2 , h1 = 1, 2, . . . , H1 ; h2 = 1, 2, . . . , H2 } be a set of adjacent rectangles
that partition the Cartesian square [−1, 1] × [−1, 1]. Let H be the H1 × H2 matrix whose
(h1 , h2 )th entry is the cell probability E[1{(y1 , y2 ) ∈ Hh1 × Hh2 }]. Kasahara and Shimotsu
(2012) show that K equals the non-negative rank of H, that is, K = rank+ [H]. Because
rank[H] ≤ rank+ [H], an estimator of a lower bound on K can be constructed as above, with
b0 replaced by a nonparametric estimator of H.
A
3.2. Component densities and mixing proportions
b rather than K in estimating
Dealing with the pretesting problem that arises from utilizing K
the mixture model is a problem beyond the scope of this paper and so, throughout the
remainder of this section, we take K as given. The key component of our estimators of the
fk and the ωk are the matrices of Fourier coefficients {Dj , j = 1, 2, . . . , J}, and so we start
our discussion there.
c be the sample counterpart to W , constructed from the spectral decomposition of
Let W
b0 . For j = 1, 2, . . . , J, let A
bj be an estimator of Aj and let C
bj ≡ W
cA
bj W
c 0 . Our estimator
A
of Dj = diag[bj1 , bj2 , . . . , bjK ] is
bj ≡ U
b 0C
bj U
b,
D
b ≡ arg min
U
X∈O(K)
I
X
i=1
bi X ,
off X 0 C
(3.2)
8
S. Bonhomme, K. Jochmans, and J.-M. Robin
where O(K) denotes the set of orthonormal matrices of dimension K × K and the function
off[·] maps a square matrix into the sum of its squared off-diagonal elements.
It is useful to note that estimation is not based on the eigenspectra of the individual
bj , j = 1, 2, . . . , J}, which will typically all be different due to sampling error.
matrices {C
Rather, our estimator of the eigenvectors U in (3.2) is defined as the orthogonal matrix X in
bi X, i = 1, 2, . . . , I} as diagonal as possible.4 It is attractive
O(K) that makes the set {X 0 C
from an efficiency point of view, as it utilizes the restriction that the {Ci , i = 1, 2, . . . , I} do
b can easily be computed using the algorithm
share the same eigenvectors. The estimator U
for the joint approximate-diagonalization of eigenmatrices (JADE) developed by Cardoso
and Souloumiac (1993). JADE is fast and straightforward to implement.
In Appendix A we derive the asymptotic properties of the estimator of the Fourier
coefficients. Moreover, Theorem A.1 shows that, although we can not directly sample data
from the component densities, their Fourier coefficients are estimable at the conventional
parametric rate.
Given our estimator of the Fourier coefficients, our estimator of fk (k = 1, 2, . . . , K) is
a conventional series estimator, i.e.,
fbk ≡
J
X
bbjk χj ,
j=1
bj .
where bbjk is the kth diagonal entry of D
Let k·k∞ denote the supremum norm. Impose the following high-level assumption.
Assumption 3. The non-zero eigenvalues of A0 are all simple. The basis functions
satisfy kχj k∞ ≤ ζ, where ζ is a finite constant that does not depend on j. For each
k = 1, 2, . . . , K, kProjJ fk − fk k∞ = O(J −β ) for some β ≥ 1.
The restriction on the eigenvalues is mainly done to facilitate the exposition. It is well
known that the asymptotic distribution of eigenvalues depends in a complicated way on
their multiplicity, which is required to be known; see, e.g., Eaton and Tyler (1991). The
uniform bound on the basis functions is weak. The bound on the approximation error is
conventional in nonparametric curve estimation by series expansions. It is compatible with
a large variety of basis functions and can be interpreted as a smoothness requirement on the
component mixtures. Chen (2007) provides examples with specific basis functions. Observe
that imposing a certain degree of smoothness on the component mixtures is substantially
less demanding than is requesting the same from the marginal density. The mixing will
typically create non-smoothness, even if the components are all smooth.
4
More precisely, minimizing the criterion defined in (3.2) is equivalent to solving the least-squares
P
bi − XLi X 0 k2 , where L(K) denotes the set of K × K diagonal
problem minX∈O(K),Li ∈L(K) Ii=1 kC
b
matrices. Note also that U could be computed from more than the leading I matrices of whitened
Fourier coefficients. That is, I in (3.2) may be replaced by any fixed integer.
Finite mixtures
9
The following theorem provides mean integrated squared error (MISE) and uniform
convergence rates for the estimator of the component mixtures.
Theorem 3. The estimator of the component mixture satisfies
√
2
fbk − fk = OP J/ N + J −β ,
Efbk − fk 2 = O J/N + J −2β ,
∞
for each k = 1, 2, . . . , K.
The rates in Theorem 3 equal the conventional univariate rates of nonparametric series
estimators; see, e.g., Schwartz (1967). The MISE rate is known to be optimal for splines in
the sense that it achieves the bound established by Stone (1982).
Theorem 4 below presents the limit distribution of the estimator of the component
mixtures at a fixed evaluation point y. To state the result, some additional notation is
required. Let
h
i
X
1
τ k (ynt1 , ynt2 )κJ (ynt3 ; y) − E[τ k (yt1 , yt2 )κJ (yt3 ; y)] , (3.3)
ψfk (y nT ) ≡
|ρ3 |
(t1 ,t2 ,t3 )∈ρ3
whose dependence on y is kept implicit and where it should be clear that the expectation
is taken with respect to (yt1 , yt2 , yt3 ). Here,
τ k (y1 , y2 ) ≡
I X
I
X
i1 =1 i2 =1
u0k wi1 χi1 (y1 )χi2 (y2 )wi02 uk ,
κJ (y1 , y2 ) ≡
J
X
χj (y1 )χj (y2 ),
j=1
where wi is ith column of the K × I whitening matrix W and uk is the kth column of the
K × K joint-eigenvalue matrix U . The proof of asymptotic normality shows that the limit
distribution of fbk (y) − fk (y) coincides with the limit distribution of the sample average of
(3.3). The form of this function is interesting. The function κJ is known as the Jth kernel of
the system {χj , j ≥ 1} and behaves much like its namesake in conventional kernel estimation
of densities; e.g., E[κJ (yt1 , y)] = ProjJ f (y). Concerning τ k , note that E[τ k (yt1 , yt2 )|x =
k 0 ] = ωk−1 1{k 0 = k} and so E[τ k (yt1 , yt2 )κJ (yt3 , y)] = ProjJ fk (y) = E[τk (yt1 )κJ (yt1 , y)], as
is readily verified by use of the law of iterated expectations. Hence, fbk (y) can be viewed as
a reweighting estimator based on an estimator of f (y), the marginal density of y.
Theorem 4. Let J → ∞ so that J/N → 0 and N J −2β → 0 as N → ∞. Then, for each
k = 1, 2, . . . , K and all y ∈ [−1, 1] where 0 < f (y) < ∞, the estimated component mixture
satisfies
√
L
−1/2
N Vfk [fbk (y) − fk (y)] → N (0, 1),
Vfk ≡ E[ψfk (y T )ψfk (y T )],
as N → ∞.
´1
The proof to Theorem 4 shows that Vfk = O(αJ (y)) for αJ (y) ≡ −1 κJ (, y)2 d. Thus, the
pointwise convergence rate is determined by the growth rate of αJ (y). A weak bound is
p
|fbk (y) − fk (y)| = OP ( J/N )
10
S. Bonhomme, K. Jochmans, and J.-M. Robin
because kχj k∞ ≤ ζ, which implies that αJ (y) = O(J). The conditions in the theorem
impose that J/N = o(1). The second restriction on the growth rate of J ensures the limit
distribution to be free of bias.
b On subsequently constructing
Use the bbjk to form a plug-in estimator of B, say B.
b + ≡ (B
b 0 B)
b −1 B
b 0 , our estimator of ω takes the form
B
b+a
b ≡B
b,
ω
b is constructed using (3.1) for H = 1. Note that ω
b minimizes a sample
where the I-vector a
P
version of the L2 -distance kProjI f − k ProjI fk pk k22 with respect to p1 , p2 , . . . , pK . It
is straightforward to construct other minimum-distance estimators of ω based on different
weight functions than the uniform one used here.
b are easily established using the results from Appendix
The asymptotic properties of ω
PT
A. Denote by ψa (y nT ) the I-vector whose ith entry equals T −1 t=1 χi (ynt ) − ai and let
b Note that ψB is readily obtained from the influence
ψB be the influence function of vec[B].
b
function of the vec[Di ], which is provided in Theorem A.1 in Appendix A. On introducing
ψω (y nT ) ≡ (B 0 B)−1 (a0 ⊗ IK )ψB (y nT ) + B 0 ψa (y nT ) ,
Theorem 5 can be stated.
b of the mixing proportions ω satisfies
Theorem 5. The minimum-distance estimator ω
√
L
N (b
ω − ω) → N (0, Vω ),
Vω ≡ E[ψω (y T )ψω (y T )0 ],
as N → ∞.
Our minimum-distance estimator is similar in spirit to the proposal of Titterington (1983),
who worked in a framework where data can be sampled directly from the K component
mixtures.
3.3. Functionals
Now consider the problem of inferring a vector θ0 defined as the unique solution to a moment
condition of the form E[g(y; θ)|x = k] = 0 for some known function g. Prime examples of
such θ0 are the moments of the component mixtures; the first moment of fk , for example,
is defined through E[y − θ0 |x = k] = 0.
From Corollary 1 we have that E[g(y; θ)|x = k] = E[g(y; θ)τk (y)], and so a natural way
to proceed is to consider a GMM estimator of the form
b m(θ),
θb ≡ arg min m(θ)
b 0Σ
b
θ∈Θ
where Θ is the parameter space, ΣN is a positive-definite weight matrix that converges in
probability to a positive-definite and non-stochastic matrix Σ, and
m(θ)
b
≡
N
1 X
m
b n (y nT ; θ),
N n=1
m
b n (y nT ; θ) ≡
T
1X
g(ynt ; θ) τbk (ynt ).
T t=1
Finite mixtures
5
Here, τbk (y) is an estimator of the weight function τk at y.
τbk (y) ≡ PK
fbk (y)
,
fbk0 (y)b
ωk 0
11
We use
k = 1, 2, . . . , K,
k0 =1
although other possibilities could be entertained.
Let g0 be g evaluated at θ0 . Impose the following regularity conditions.
Assumption 4. The space Θ is compact and θ0 lies in its interior. The function g
is twice continuously-differentiable on Θ, and supθ∈Θ Ekg(y; θ)k and supθ∈Θ Ek∇θ0 g(y; θ)k
are finite. The matrix E[∇θ0 g(y; θ0 )τk (y)] has full column rank. For k 0 = 1, 2, . . . , K, τk0 is
bounded away from zero and infinity on [−1, 1], g0 τk0 is square integrable, and there exists
and integer η ≥ 1 so that kProjJ g0 τk0 − g0 τk0 k∞ = O(J −η ).
Assumption 4 contains familiar conditions for GMM estimators to be asymptotically normal.
It also postulates a shrinkage rate on the bias of series estimators of the g0 τk , which can be
motivated as before.
b let Mθ ≡ E[∇θ0 g(y; θ0 )τk (y)] and let Mω be
To state the asymptotic distribution of θ,
the K-vector that has E[g(y; θ0 )τk (y)τk0 (y)] as its k 0 th entry. Introduce
ψθ (y nT ) ≡ (Mθ0 ΣMθ )−1 Mθ0 Σ mn (y nT ; θ0 ) + ψτ (y nT ) ,
PT
where mn (y nT ; θ) ≡ T −1 t=1 g(ynt ; θ) τk (ynt ) and ψτ (y nT ) is defined as
1
|ρ3 |
X
K
h
i
X
g(yt3 ; θ0 ) τ k (ynt1 , ynt2 ) − τk (ynt3 )
τ k0 (ynt1 , ynt2 )ωk0 + Mω0 ψω (y nT ).
(t1 ,t2 ,t3 )∈ρ3
k0 =1
The function ψθ is the influence function of θb and has the usual structure. The term
involving ψτ captures the impact of first-stage estimation error on the asymptotic variance
b Let
of θ.
Vm ≡ E[(mn (y nT ; θ0 ) + ψτ (y nT ))(mn (y nT ; θ0 ) + ψτ (y nT ))0 ]
and assume this matrix to be positive definite. Theorem 6 provides the limit distribution
b
of θ.
Theorem 6. Let J → ∞ so that J 4 /N → 0, N J −2β → 0, and N J −2η → 0 as N → ∞.
Then, for any k = 1, 2, . . . , K, the estimator θb of the estimand θ0 satisfies
√
L
N (θb − θ0 ) → N (0, Vθ ),
Vθ ≡ E[ψθ (y T )ψθ (y T )0 ],
as N → ∞.
Standard arguments show that the optimally-weighted estimator is obtained on setting
√
L
Σ ∝ Vm−1 , in which case Theorem 6 implies that N (θb − θ0 ) → N (0, [Mθ0 Vm−1 Mθ ]−1 ).
5
Some trimming will typically be warranted to keep m(θ)
b
well defined. With some work, Theorem
6 below may be generalized to hold under a trimming scheme where the tuning parameter converges
to the identity slowly with N .
12
4.
S. Bonhomme, K. Jochmans, and J.-M. Robin
Monte Carlo evaluation
We present numerical evidence on the small-sample performance of our estimators by means
of two illustrations. We will work with the family of generalized Beta distributions, which
is popular for modeling the distribution of income; see, e.g., McDonald (1984). On the
interval [y, y] ⊂ R, the generalized Beta distribution is
b(y; ϑ1 , ϑ2 ; y, y) ≡
1
1
(y − y)ϑ1 −1 (y − y)ϑ2 −1 ,
ϑ
+ϑ
−1
1
2
(y − y)
B(ϑ1 , ϑ2 )
´1
where B(ϑ1 , ϑ2 ) ≡ 0 ϑ1 −1 (1 − )ϑ2 −1 d and ϑ1 and ϑ2 are positive real scale parameters.
Its mean and variance are
µ ≡ y + (y − y)
ϑ1
,
ϑ1 + ϑ2
σ 2 ≡ (y − y)2
(ϑ1 +
ϑ1 ϑ2
2
ϑ2 ) (ϑ1 +
ϑ2 + 1)
,
(4.1)
respectively.
Throughout, we use normalized Chebychev polynomials as basis functions. The ith such
polynomial is
1
2
1
√
cos[(i − 1) arccos()],
χi () =
π 21{i=1} 1 − 2
for ∈ [−1, 1].
We will estimate the component mixtures and their associated CDFs, as well as the
mixing proportions and the mean and variance of the component mixtures. To ensure bona
fide density estimates we use
fek (y) ≡ max{0, fbk (y) − ck },
(4.2)
´
where ck is chosen so that fek () d = 1. Note that fek is the projection of fbk into the
space of square-integrable functions that are non-negative and integrate to one, see Gajek
´y
(1986). To infer the conditional CDFs, Fk (y) ≡ −∞ fk () d, we use Clenshaw and Curtis
´y
fek () d. Our approximation uses 101
(1960) quadrature to approximate the integral
−∞
quadrature nodes. The moments are estimated using the GMM estimator from Theorem 6,
without any trimming. To describe the population, we also present simple kernel estimates
of the marginal density of the data. These are obtained using a Gaussian kernel and a
bandwidth set according to Silverman’s rule of thumb.
Experiment 1. Our first experiment involves three generalized Beta distributions on the
interval [−1, 1]. Moreover, we consider
f1 (y) = b(y; 2, 7; −1, 1),
ω1 = .20,
f2 (y) = b(y; 5, 4; −1, 1),
ω2 = .35,
f3 (y) = b(y; 6, 2; −1, 1),
ω3 = .45.
Finite mixtures
Table 1. Sequential rank test in Experiment 1
α = .100
α = .050
b
b
b
b
b
N
K<K
K=K
K>K
K<K
K=K
500 .005
.927
.068
.002
.959
750 .006
.926
.068
.002
.958
1000 .005
.924
.071
.002
.957
1500 .005
.929
.066
.002
.960
2000 .006
.930
.064
.002
.956
2500 .004
.929
.067
.002
.965
13
α = .025
b
K>K
b
K<K
b
K=K
b
K>K
.039
.040
.041
.038
.042
.033
.001
.001
.001
.000
.002
.000
.978
.976
.979
.976
.980
.975
.021
.023
.020
.024
.018
.024
Using (4.1), the means of the component mixtures are µ1 = −5/9 ≈ −.556, µ2 = 1/9 ≈ .111,
and µ3 = 1/2 = .500, while their respective variances are σ12 = 28/405 ≈ .069, σ22 = 8/81 ≈
.099, and 23 = 1/12 ≈ .083. Throughout, we set T = 4 and I = J = 6.
Table 1 presents simulation results for the estimator of K defined through the sequential
testing procedure in Theorem 2 for various values of N and α. The table reports the
frequency with which K was either underestimated, correctly estimated, or overestimated
b is found to perform well, correctly picking
in 10, 000 Monte Carlo replications. Overall, K
the true number of component mixtures in about 95% of the cases.
The first three panels in Figure 1 show upper 95% and lower 5% envelopes (dashed lines)
over 1000 estimates from samples of size N = 1000 of the marginal PDF (left upper panel),
of the component PDFs (right upper panel), and of the component CDFs (left middle
panel), together with their respective true values (solid lines). They reveal the JADEbased estimator to back out the component mixtures very well, even though inspection of
the estimated marginal density does not directly suggest the particular mixture structure.
The remaining three panels provide box plots of the sampling distribution of the mixing
proportions (right middle panel) and the mean (lower left panel) and variance (lower right
panel) of the component mixtures. All box plots are centered around the respective true
values. An asterisk marks zero. Overall, the plots suggest good performance. The sampling
distribution of the mixing proportions and of the mixture means are broadly correctly
centered and have small interquartile ranges. The box plots for the variance show there to
be a slightly larger bias in the estimates of the mixture variances, in particular for the first
and third component mixture. Nonetheless, the magnitude of the bias is very small. It is
well known that higher-order moments are more difficult to estimate accurately.
Experiment 2. Our methods contribute to the analysis of non-separable fixed-effect models,
that is, models of the form yt = g(x, εt ), for some unobservable εt . A location-scale version
is
yt = x + ηt ,
ηt = σ(x)εt ,
x⊥εt ,
for some function σ. The location-scale model can be seen as a stripped-down version of a
linear fixed-effect model or as a one-factor model. Note, however, that the factor x and the
error ηt are not independent.
14
S. Bonhomme, K. Jochmans, and J.-M. Robin
Figure 1. Estimates in Experiment 1
Finite mixtures
Table 2. Sequential rank test in Experiment 2
α = .100
α = .050
b
b
b
b
b
N
K<K
K=K
K>K
K<K
K=K
500
.140
.642
.218
.201
.644
750
.053
.748
.199
.083
.781
1000
.015
.790
.193
.030
.836
1500
.001
.804
.194
.001
.864
2000
.000
.809
.189
.000
.864
2500
.000
.816
.181
.000
.866
15
α = .025
b
K>K
b
K<K
b
K=K
b
K>K
.155
.136
.134
.135
.136
.134
.264
.120
.044
.004
.000
.000
.617
.782
.864
.900
.904
.905
.119
.098
.092
.097
.096
.095
Suppose that εt is drawn from the generalized Beta distribution on [−1, 1] with ϑ1 =
ϑ2 = ϑ. Then its distribution is symmetric and var[εt ] = 1/(2ϑ+1) while fk is supported on
the interval [k − σ(k), k + σ(k)] . Below we report results for the scale-function specification
√
σ(x) = (2ϑ + 1)/ x. Here,
E[y|x = k] = k,
var[y|x = k] =
1
,
k
so that the variability is inversely related to the mean. Table 2 and Figure 2, which have the
same structure as Table 1 and Figure 1 above, present simulation results for this locationscale model with K = 5 and ω = (.30, .20, .25, .15, .10)0 . This choice of mixture weights
implies f to be unimodal and virtually symmetric, hiding well the underlying mixture
structure. The simulations were performed with I = J = 8. The table was generated over
10,000 replications. The plots are based on 1000 Monte Carlo runs.
Table 2 shows that the sequential-rank test continues to perform well, although there is
a larger tendency to over- and underestimate K when the sample is fairly small.
The plots in Figure 2 demonstrate excellent estimates of the component PDFs and CDFs.
The box plots of the mixing proportions, too, are again broadly correctly centered, although
the interquartile ranges are higher than in Experiment 1. From the bottom panels in the
figure it may be observed that the sampling distributions of the GMM estimators of the
mean and variance of the component mixtures are now more disperse. The estimates of
the variance also suffer from non-negligible bias. This suggests that some trimming will
typically be needed in applications where the support is large.
5.
Empirical application
The classic model of earnings dynamics allows for heterogeneity in log-earnings levels only
through the inclusion of unit-specific intercepts; see, e.g., Hall and Mishkin (1982). The
recent literature has argued that unobserved heterogeneity in log earnings stretches beyond
such additive fixed effects; see, e.g., Browning et al. (2010) for a discussion and an extensive
empirical investigation in a parametric framework. In this section, we apply our methods to
investigate the degree of heterogeneity in earnings data on males for the period 1969–1998.
Our approach is very flexible in terms of unobserved heterogeneity. It allows for persistence
16
S. Bonhomme, K. Jochmans, and J.-M. Robin
Figure 2. Estimates in Experiment 2
Finite mixtures
17
of essentially arbitrary form through unobserved time-constant factors but rules out the
presence of state dependence.
From the PSID 1969–1998 we construct a set of three-period balanced subpanels, using
a rolling window of length one.6 ,7 This yields 28 subpanels. For each such subpanel, we
obtain our measure of log (annual) earnings of unit n at time t, say ynt , as the residual of a
pooled regression of reported log earnings on a constant term, a set of time dummies, years
of schooling, and a second-degree polynomial in experience. A graphical inspection of the
marginal densities in each subpanel (not reported) does not suggest large disperion between
the univariate marginal densities in a given subpanel, so that our smoothing policy seems
reasonable.
In line with Heckman and Singer (1984) and many others since, we view the latent factor
x as representing type heterogeneity and consider a discretization approach, and estimated
a trivariate mixture model for each subpanel. Values up to ten for K yielded a similar
pattern in the component mixtures. We focus on a relatively small K for ease of exposition.
In this way, one can think of the x as an indicator for low, intermediate, and high innate
ability, for example. The densities were estimated using Chebychev polynomials as basis
√
functions, I set to five, and J = .7 ∗ 3 N to accommodate the relatively strong increase
in the number of cross-sectional observations in the subpanels spanning later time periods.
We used (4.2) to ensure bona fide density estimates.
Figure 3 contains the estimated conditional densities for a selection of subpanels. The
pattern in the plots is representative for the other subpanels formed from the PSID data.
The values of N and J are indicated above each plot. Figure 3 shows well-separated
unimodal component mixtures. From here on out, index the component mixtures k = 1, 2, 3
by mode, that is, from left to right. The second conditional density is fairly symmetric about
is mode while the outer two densities are somewhat skewed. Overall, the plots show that
a higher value of x translates into a higher probability of being in the top of the earnings
distribution.
To inspect the evolution of the location and variability of log earnings over time, we
estimated the mode and interquartile range of the component mixtures in each subpanel.
The interquartile range was estimated by inverting the CDF of the components which,
in turn, was computed using Clenshaw-Curtis quadrature (with 101 quadrature nodes, as
before). The individual estimates are indicated by a • for k = 1, by an x for k = 2, and by a
+ for k = 3. To capture the overall time trend, linear-regression lines through the individual
points are also given. The left plot in Figure 4 shows the evolution of the mixture modes
6
PSID: Panel Study of Income Dynamics public use dataset. Produced and distributed by the
University of Michigan with primary funding from the National Science Foundation, the National
Institute of Aging, and the National Institute of Child Health and Human Development. Ann
Arbor, MI.
7
We excluded self-employed individuals and students, as well as individuals for whome earnings
were top coded. We further restricted the sample to individuals between the ages of 20 and 60,
with at most 40 years of experience.
S. Bonhomme, K. Jochmans, and J.-M. Robin
Figure 3. Conditional densities of log earnings
1976−−1978 (N=906;J=7)
3
3
2.5
2.5
2
2
1.5
1.5
fk
fk
1969−−1971 (N=585;J=6)
1
1
0.5
0.5
0
−1
−0.5
0
0.5
0
1
−1
−0.5
y
2.5
2.5
2
2
1.5
1.5
fk
fk
3
1
1
0.5
0.5
−1
−0.5
0.5
1
1996−−1998 (N=1420;J=8)
3
0
0
y
1984−−1986 (N=1107;J=8)
0
0.5
0
1
−1
−0.5
y
0
0.5
1
y
Figure 4. Mixture modes and interquartile ranges over time
0.8
0.45
0.6
0.4
interquartile range
0.4
0.2
mode
18
0
−0.2
0.35
0.3
−0.4
0.25
−0.6
−0.8
1970
1975
1980 1985
year
1990
1995
0.2
1970
1975
1980 1985
year
1990
1995
Finite mixtures
19
over the sampling period. As is clear from the regression lines, the dispersion between the
modes tends to increase somewhat over time, with the mode for k = 1 decreasing, the mode
for k = 2 remaining relatively constant, and the mode for k = 3 increasing. In terms of
location, this indicates the various component mixtures to move further away from each
other, suggesting an increase in the between-group dispersion of log earnings. The right
plot of Figure 4 provides the interquartile ranges. The plot shows that the spread of the
component mixtures is inversely related to their mode. That is, individuals at the lower
end of the earnings distribution are exposed to higher volatility. Further, the difference in
the interquartile ranges decreases over the sampling period. The regression lines show that
the interquartile ranges of the first and second component mixture have decreased while
the interquartile range of the high-end component mixture displays an upward trend.
6.
Conclusion
We have discussed methods to estimate nonparametric finite-mixture models from short
panel data. The estimators are straightforward to implement and have desirable largesample properties. A Monte Carlo assessement further indicated good performance in small
samples. As an empirical illustration we applied our approach to earnings data. We found
evidence of substantial nonlinear heterogeneity in earnings processes, and documented their
evolution over time.
In future work, we aim to provide a data-driven method for selecting the truncation
parameter optimally, paving the path for adaptive estimation. Another potentially fruitful
direction for further research is to investigate how our approach can be modified to allow
K to grow with T . With some work, it should also be possible to extend our methodology
to Markovian models. One area where this can be of use is in the estimation of dynamic
discrete-choice models. Finally, our projection approach can potentially be used to estimate
continuous mixtures. However, this will involve an ill-posed inverse problem, and will lead
to the associated decrease in convergence rates.
Appendix A: Asymptotic properties of estimated Fourier coefficients
This appendix works towards Theorem A.1, which establishes the limit behavior of the
b j . We do so by providing lemmata which, in turn, develop the large-sample behavior
D
c (Lemma A.1), the whitened matrices of Fourier
of the spectral-decomposition matrix W
b
b (Lemma A.3).
coefficients, Cj (Lemma A.2), and the JADE estimator, U
To state our results, some additional notation is needed. First, observe that, for any j,
bj − Aj ] is asymptotically linear and that its influence function at y nT , ψA (y nT ), say,
vec[A
j
has
X
1
χi1 (ynt1 )χi2 (ynt2 )χj (ynt3 ) − ai1 i2 j ,
(A.1)
|ρ3 |
(t1 ,t2 ,t3 )∈ρ3
as its i1 + (i2 − 1)Ith element.
20
S. Bonhomme, K. Jochmans, and J.-M. Robin
Let V ≡ (v 1 , v 2 , . . . , v K ) for v k ≡ vk ⊗ vk , where V = (v1 , v2 , ..., vK ), and define the
2
selection matrix Q ∈ RK ×K through the relation Q = vec[diag()], where is any Kvector. Use these expressions to introduce
1
0
ψW (y T ) ≡ (II ⊗ Λ−3/2 V 0 ) − (V ⊗ IK )QΛ−3/2 V ψA0 (y T ),
2
−3/2 0
1
0
ψW 0 (y T ) ≡ (Λ
V ⊗ II ) − (IK ⊗ V )QΛ−3/2 V ψA0 (y T ).
2
The interpretation of these functions is that they are the respective influence functions of
b and U
b 0 , as stated in Lemma A.1.
U
c and W
c 0 of the matrices W and W 0 satisfy
Lemma A.1. The estimators W
√
√
L
L
c − W] →
c0 − W 0] →
N vec[W
N (0, VW ),
N vec[W
N (0, VW 0 ),
for VW ≡ E[ψW (y T )ψW (y T )0 ] and VW 0 ≡ E[ψW 0 (y T )ψW 0 (y T )0 ], respectively, as N → ∞.
Proof of Lemma A.1. Recall that A0 is real, symmetric, and has rank K. Also, its K nonb0 —which, by construction, too, is symmetric—
zero eigenvalues are all distinct. Further, A
√
L
b0 − A0 ] → N (0, VA ). From Theorem 4.2 in Eaton and Tyler (1991)
satisfies N vec[A
0
√
L
b − λ) →
and Theorem 1 in Magnus (1985), it then follows that N (λ
N (0, Vλ ), where
0
0
[Vλ ]k1 ,k2 ≡ (vk1 ⊗ vk1 )VA0 (vk2 ⊗ vk2 ). Because vec[Λ] = Qλ, the Jacobian associated with
−3/2
−3/2
−3/2
the transformation from λ to vec[Λ−1/2 ] is − 12 Q diag[λ1 , λ2 , . . . , λK
Hence, an application of the delta method gives
√
L
b −1/2 − Λ−1/2 ] →
N vec[Λ
N (0, VΛ−1/2 ),
] ≡ − 12 QΛ−3/2 .
(A.2)
0
for VΛ−1/2 ≡ 1/4QΛ−3/2 V VA0 V Λ−3/2 Q0 . Moving on, from Corollary 1 in Bura and Pfeiffer
(2008), we have that the estimated eigenvectors satisfy
√
√
L
L
N vec[Vb − V ] → N (0, VV ),
N vec[Vb 0 − V 0 ] → N (0, VV 0 ),
(A.3)
where VV ≡ (Λ−1 V 0 ⊗ II )VA0 (V Λ−1 ⊗ II ) and VV 0 ≡ (II ⊗ Λ−1 V 0 )VA0 (II ⊗ V Λ−1 ). Then
combining (A.2) and (A.3) with the linearization
√
c − W = (Λ
b − Λ)−1/2 V 0 + Λ−1/2 (Vb − V )0 + oP (1/ N )
W
provides the result.
Now use Lemma A.1 to construct
2
ψCj (y T ) ≡ (W Aj ⊗ IK )ψW (y T ) + (W ⊗ W )ψAj (y T ) + (IK ⊗ W Aj )ψW 0 (y T ),
for each j = 1, 2, . . . , J. Then Lemma A.2 follows readily.
bj = W
cA
bj W
c 0 of the whitened
Lemma A.2. For each j = 1, 2, . . . , J, the estimator C
matrix Cj = W Aj W 0 satisfies
√
L
bj − Cj ] →
N vec[C
N (0, VCj ),
VCj ≡ E[ψCj (y T )ψCj (y T )0 ],
as N → ∞.
Finite mixtures
21
Proof of Lemma A.2. Fix j ∈ {1, 2, . . . , J}. First note that, from (3.1), the (i1 , i2 )th
bj takes the form
element of A
b
ai1 i2 j =
N
1 1 X
N |ρ3 | n=1
X
χi1 (ynt1 )χi2 (ynt2 )χj (ynt3 ),
(t1 ,t2 ,t3 )∈T3
√
√
L
bj −Aj ] →
c , too, is N -consistent, we may linearize
and that N vec[A
N (0, VAj ). Because W
bj − Cj ] as
vec[C
√
c − W ] + (W ⊗ W )vec[A
bj − Aj ] + (IK ⊗ W Aj )vec[W
c 0 − W 0 ] + oP (1/ N ).
(W Aj ⊗ IK )vec[W
Combined with Lemma A.1 this yields the result by an application of the delta method. 2
Moving on to the JADE estimator, let
ψU (y nT ) ≡ (IK ⊗ U ∆I )
I
X
∇i (U 0 ⊗ U 0 )ψCi (y nT ),
i=1
ψU 0 (y nT ) ≡ (U ∆I ⊗ IK )
I
X
∇i (U 0 ⊗ U 0 )ψCi (y nT ),
i=1
where ∆I is the K × K matrix whose (k1 , k2 )th element is
( hP
i−1
I
2
(b
−
b
)
if k1 6= k2 ,
ik
ik
1
2
i=1
[∆I ]k1 ,k2 ≡
0
if k1 = k2
and ∇i ≡ diag[vec[∇i ]] with ∇i ≡ Di ιK ι0K − ιK ι0K Di for i = 1, 2, . . . , I and ιK the K-vector
of ones. Note that ∆I is well defined for any I ≥ K by Assumption 2 and the fact that
PI
limI→∞ i=1 (bik1 − bik2 )2 = kfk1 − fk2 k22 < ∞. The asymptotic distribution of JADE is
stated in Lemma A.3.
Lemma A.3. The JADE estimator defined in (3.2) satisfies
√
√
L
L
b − U] →
b 0 − U 0] →
N vec[U
N (0, VU ),
N vec[U
N (0, VU 0 ),
for VU ≡ E[ψU (y T )ψU (y T )0 ] and VeU 0 ≡ E[ψU 0 (y T )ψU 0 (y T )0 ], respectively, as N → ∞.
√
bi are N -consistent and asymptotically normal, the result
Proof of Lemma A.3. As the C
follows from Theorem 5 in Bonhomme and Robin (2009).
2
On combining Lemma A.2 and Lemma A.3, and introducing
ψDj (y T ) ≡ (IK ⊗ U 0 Cj )ψU (y T ) + (U 0 ⊗ U 0 )ψCj (y T ) + (U 0 Cj ⊗ IK )ψU 0 (y T ),
we obtain Theorem A.1.
bj = U
b 0C
bj U
b of the matrix
Theorem A.1. For each j = 1, 2, . . . , J, the estimator D
Dj = U 0 Cj U satisfies
√
L
b j − Dj ] →
N vec[D
N (0, VDj ),
VDj ≡ E[ψDj (y T )ψDj (y T )0 ]
as N → ∞
22
S. Bonhomme, K. Jochmans, and J.-M. Robin
Proof of Theorem A.1. The result follows on combining Lemma A.2 and Lemma A.3 with
b j − Dj ].
a linearization of vec[D
2
Appendix B: Proofs of theorems in the main text
Proof of Theorem 3. Throughout the proof we fix k. Let b ≡ (b1k , b2k , . . . , bJk )0 and
p
b
b ≡ (bb1k , bb2k , . . . , bbJk )0 . We first show that kb
b − bk = OP ( J/N ). Theorem 3 will then
follow readily. Note that
kb
b − bk2 ≤
J
X
b j − Dj k2 =
kD
j=1
J
X
b j − Dj ]0 vec[D
b j − Dj ].
vec[D
(B.1)
j=1
c . Observe that Lemmata
b j − Dj = SbA
bj Sb0 − SAj S 0 , where S ≡ U 0 W and Sb ≡ U
b 0W
Write D
√
b
A.1 and A.3 imply that kS − Sk = OP (1/ N ). Also, for any j,
bj − Aj k2
EkA
=
I X
I
X
E (b
ai1 i2 j − ai1 i2 j )2
i1 =1 i2 =1
=
I
I X
X
i1 =1 i2 =1
=
I
X
I
X
h
E
N
1 X
N |ρ3 | n=1
χi1 (ynt1 )χi2 (ynt2 )χj (ynt3 ) − ai1 i2 j
2 i
(t1 ,t2 ,t3 )∈ρ3
E[χ2i1 (y1 )χ2i2 (y2 )χ2j (y3 )]
− a2i1 i2 j
N |ρ3 |
i1 =1 i2 =1
≤
X
I X
I
X
ζ 6 − a2i1 i2 j
,
N |ρ3 |
i =1 i =1
1
2
where, recall, ζ is a finite constant so that kχj k∞ ≤ ζ for all j. Therefore,
J
X
bj − Aj k2 = OP (J/N )
kA
(B.2)
j=1
by Markov’s inequality and Parseval’s identity, i.e., the square-summability of the Fourier
coefficients. Now, as
b j − Dj ] = (S ⊗ S) vec[A
bj − Aj ]
vec[D
bj − Aj ]
+ ((Sb − S) ⊗ (Sb − S)) vec[Aj ] + ((Sb − S) ⊗ (Sb − S)) vec[A
bj − Aj ] + (S ⊗ (Sb − S)) vec[A
bj − Aj ]
+ ((Sb − S) ⊗ S) vec[A
(B.3)
+ ((Sb − S) ⊗ S) vec[Aj ] + (S ⊗ (Sb − S)) vec[Aj ],
it suffices to quantify the contribution of each of these terms to (B.1). The first term will
be the dominant one. On using the Cauchy-Schwarz inequality and (B.2),
J
X
j=1
b − Aj )S 0 ]0 vec[S(A
b − Aj )S 0 ] ≤ k(S ⊗ S)0 (S ⊗ S)k
vec[S(A
J
X
j=1
bj − Aj k2 = OP (J/N ),
kA
Finite mixtures
23
0
as k(S ⊗ S) (S ⊗ S)k = O(1). The remaining terms can be shown to be of a smaller order
using similar arguments. For example,
J
X
0 0
0
b
b
b
b
b
vec[S(A−A
j )S ] vec[(S −S)Aj (S −S) ] ≤ k(S ⊗S)[(S −S)⊗(S −S)]k
j=1
J
X
bj −Aj kkAj k
kA
j=1
is asymptotically negligible because k(S ⊗ S)[(Sb − S) ⊗ (Sb − S)]k = OP (1/N ) and
J
X
j=1
bj − Aj kkAj k ≤
kA
J
J
hX
i1/2 h X
i1/2
p
bj − Aj k2
kA
kAj k2
= OP ( J/N ),
j=1
j=1
p
which follows as before. Thus, we conclude that kb
b − bk = OP ( J/N ). Now turn to the
MISE result. Using orthonormality,
Ekfbk − fk k22 ≤ Ekfbk − ProjJ fk k22 + kProjJ fk − fk k22 = Ekb
b − bk2 + kProjJ fk − fk k22 .
From above, Ekb
b − bk2 = OP (J/N ), while kProjJ fk − fk k22 ≤ kProjJ fk − fk k2∞ = O(J −2β )
by Assumption 3. Whence,
Ekfbk − fk k22 = OP (J/N + J −2β ),
as claimed. To obtain the uniform-convergence rate, we may proceed similarly. By the
triangle inequality,
kfb − fk k∞ ≤ kfbk − ProjJ fk k∞ + kProjJ fk − fk k∞ .
√
Let χJ ≡ (χ1 , χ2 , . . . , χJ )0 and note that supy∈[−1,1] kχJ k = O( J) by the uniform bound
on the basis functions in Assumption 3. Thus,
p
√
kfbk − ProjJ fk k∞ ≤
sup kχJ kkb
b − bk = O( J)OP ( J/N )
y∈[−1,1]
by the Cauchy-Schwarz inequality. Because kProjJ fk − fk k∞ = O(J −β ) by Assumption 3,
this yields the uniform-convergence rate and concludes the proof of Theorem 3.
2
Proof of Theorem 4. Throughout the proof we fix k and consider the asymptotic behavior
of fbk at a given value y. Write
fbk (y) − fk (y) = [fbk (y) − ProjJ fk (y)] + [ProjJ fk (y) − fk (y)].
By Assumption 3, |ProjJ fk (y) − fk (y)| ≤ kProjJ fk − fk k∞ = O(J −β ). Further, because
√
N J −β → 0 as N → ∞, this term is asymptotically negligible. Define the K 2 × 1 selection
b j −Dj ]. Using (B.3), a small calculation
vector qk through the relation (bbjk −bjk ) = qk0 vec[D
yields
s
N
1 X
N b
fk (y) − ProjJ fk (y) = √
ξn,1 + ξn,2 + ξn,3 + ξn,4 + oP (1),
αJ (y)
N n=1
24
S. Bonhomme, K. Jochmans, and J.-M. Robin
p
for ξn,1 ≡ qk0 (S ⊗ S)ϑn = ψfk (y nT )/ αJ (y) and
ξn,2 ≡ qk0 (S ⊗ (Sb − S))ϑn ,
ξn,3 ≡ qk0 ((Sb − S) ⊗ S)ϑn ,
ξn,4 ≡ qk0 ((Sb − S) ⊗ (Sb − S))ϑn ,
p
PJ
where S and Sb are as in the previous proof, ϑn ≡ j=1 ψAj (y nT )χj (y)/ αJ (y), and ψAj
bj ] as defined in (A.1) above. To see why the remaining
is the influence function of vec[A
terms in (B.3) can be ignored, note that they are all of the form
I X
I
J X
1 X
√
ai1 i2 j χj (y)
OP
N j=1 i1 =1 i2 =1
√
because kSb − Sk = OP (1/ N ) by Lemmata A.2 and A.3. Further, from the restrictions
imposed by the mixture structure in (2.1),
J X
I X
I
X
ai1 i2 j χj (y)
=
K h X
I X
I
X
k0 =1
j=1 i1 =1 i2 =1
i1 =1 i2 =1
bi1 k0 bi2 k0
J
ih X
i
bjk0 χj (y) ωk0 = O(1)
j=1
as the first term in square brackets is bounded and the second term is just ProjJ fk0 (y),
which converges to fk0 (y) as J → ∞. Below we will further show that
s
N
N b
1 X
[fk (y) − ProjJ fk (y)] = √
ξn,1 + oP (1),
αJ (y)
N n=1
(B.4)
so that it suffices to show that this sample average is asymptotically normal. Doing so
requires verifying that the conditions of Lyapunov’s central limit theorem are satisfied.
Moreover, it needs to be shown that (i) E[ξn,1 ] = 0, (ii) var[ξn,1 ] = O(1), and (iii) there
√
exists a δ > 0 so that E|ξn,1 / N |2+δ = o(1). Condition (i) readily follows from inspection
of (3.3). To verify Condition (ii), use the law of iterated expectations to see that it suffices
to show that E[τ k (y1 , y2 )2 |x = k 0 ] = O(1) and that E[κJ (y3 ; y)2 |x = k 0 ] = O(αJ (y)) for all
k 0 = 1, 2, . . . , K where the expectations are taken over the triple (y1 , y2 , y3 ). From Theorem
2.2.2 in Viollaz (1989),
E[κJ (y3 ; y)2 |x = k 0 ]/αJ (y) → fk0 (y),
which is O(1). Also,
2
0
E[τ k (y1 , y2 ) |x = k ] =
I X
I
hX
i1 =1 i2 =1
u0k wi1 E[χi1 (y1 )χi2 (y1 )|x = k 0 ]wi02 uk
i2
= O(1),
Finite mixtures
25
which follows from a direct calculation. Finally, regarding Condition (iii), first note that
αJ (y)
2+δ
2
E|ξn,1 |2+δ
J
X
2+δ
= E
[qk0 (S ⊗ S)]ψAj (y T )χj (y)
j=1
J
J
hX
0
i 2+δ h X
i 2+δ
qk (S ⊗ S)χj (y)2 2 E
ψAj (y T )2 2
≤
j=1
j=1
J
J
i 2+δ
h
hX
2 X
i 2+δ
2
ψAj (y T )2 2
≤ qk0 (S ⊗ S)
|χj (y)|2
E
j=1
j=1
J 2 i 2+δ
hX
2+δ 2+δ
2+δ
2
≤ O(1)αJ (y) 2
EψAj (y T )
,
j=1
where the transitions are justified by the Cauchy-Schwarz inequality and the Minkowski
inequality. Further, because kχI χ0I k is uniformly bounded by Iζ 2 by Assumption 3 and as
kAj k = O(1) uniformly over j, we also have that
i2+δ h
i2+δ
2+δ
2+δ h 2
EψAj (y T )
≤ sup ψAj ≤ Iζ kχj k∞ + kAj k
≤ Iζ 3 + kf k2
= O(1).
y∈[−1,1]T
√
2+δ
Therefore, E|ξn,1 / N |2+δ = O(J/N ) 2 → 0, as N → ∞. To prove the theorem it only
p
remains to validate (B.4). Observe that αJ (y)ϑn equals
1
|ρ3 |
X
I X
I h
K
i
X
X
χi1 (ynt1 )χi2 (ynt2 )κJ (ynt3 ; y) −
bi1 k0 bi2 k0 ProjJ fk0 (y)ωk0 .
(t1 ,t2 ,t3 )∈ρ3 i1 =1 i2 =1
k0 =1
PN
By the same arguments as above, it is straightforward to verify that N −1/2 n=1 ϑn satisfies
the conditions of Lyapunov’s central limit theorem and is, therefore, OP (1). Hence,
N
N
1 X
1 X
√
√
ξn,2 = oP (1),
ξn,3 = oP (1),
N n=1
N n=1
√
because kSb − Sk = OP (1/ N ). This completes the proof.
N
1 X
√
ξn,4 = oP (1),
N n=1
2
√
b − Bk = OP (1/ N ). Assumption 2
Proof of Theorem 5. Theorem A.1 implies that kB
P
b 0 B)
b −1 →
ensures that B 0 B has full rank. Hence, (B
(B 0 B)−1 . Further, a linearization yields
√
0
0
0
0
0
ba
b − B )a + B (b
b − B a = (B
B
a − a) + oP (1/ N ). In tandem with the asymptotic results
b
b , the delta method then provides the result.
on B and a
2
Proof of Theorem 6. Fix k throughout the proof. Given our assumptions, consistency of θb
follows from uniform convergence of the empirical moment to the population moment. Let
P
m(θ) ≡ N −1 n mn (y nT ; θ). Using Theorems 3 and 5,
√
supkm(θ)
b
− E[m(θ)]k = OP (J/ N + J −β ),
(B.5)
θ∈Θ
26
S. Bonhomme, K. Jochmans, and J.-M. Robin
which converges to zero as N → ∞. For asymptotic normality, the usual argument based
on a Taylor expansion around θ0 yields
√
√
N (θb − θ0 ) = (Mθ0 ΣMθ )−1 M Σ N m(θ
b 0 ) + oP (1),
P
b →
because Σ
Σ and supθ∈Θ k∇θ0 m(θ)
b
− Mθ (θ)k = oP (1), which follows as (B.5). It then
remains to show that
√
N
1 X
Nm
b N (θ0 ) = √
mn (y nT ; θ0 ) + ψτ (y nT ) + oP (1).
N n=1
(B.6)
To do so, first note that a second-order Taylor expansion, together with Theorems 3 and 5,
yields
K
N
i
√ 1 XXh
N m(θ
b 0 ) − m(θ0 ) = √
ξk0 (y nT ) + ∇ωk0 mn (y nT ; θ0 )(b
ωk0 − ωk0 ) + oP (1),
N n=1 k0 =1
for
ξk0 (y nT ) ≡ T −1
T
X
g(ynt ; θ0 )∇fk0 τk (ynt ) fbk0 (ynt ) − fk0 (ynt ) .
(B.7)
t=1
Here, the order of the remainder term follows from the fact that J is restricted to grow so
√
that N (J 2 /N + J −2β ) → 0 as N → ∞. Using (3.3) and Theorem 4 for each k 0 , the sample
average of (B.7) can be expressed as a symmetric U -statistic of order two. The projection
P
of this U -statistic is N −1 n ϑk0 (y nT ) for
h
X
i
1
E g(y; θ0 )∇fk0 τk (y) τ k0 (ynt1 , ynt2 )κJ (ynt3 ; y) − ProjJ fk0 (y) ,
ϑk0 (y nT ) ≡
|ρ3 |
(t1 ,t2 ,t3 )∈ρ3
where the expectation is taken with respect to y. Computing this expectation for each k 0
PN PK
and envoking Lemma A.3 in Ahn and Powell (1993) shows that N −1/2 n=1 k0 =1 ξk0 (y nT )
equals
√
N
K
h
i
X
X
X
1
g(yt3 ; θ0 ) τ k (ynt1 , ynt2 ) − τk (ynt3 )
τ k0 (ynt1 , ynt2 )ωk0 + oP (1),
N |ρ3 | n=1 (t ,t ,t )∈ρ
k0 =1
1
using that
argument,
√
2
3
3
N kProjJ g0 τk0 − g0 τk0 k∞ =
√
N O(J −η ) = o(1). Also, by essentially the same
N
K
N
1 XX
1 X 0
√
∇ωk0 mn (y nT ; θ0 )(b
ωk0 − ωk0 ) = √
Mω ψω (y nT ) + oP (1).
N n=1 k0 =1
N n=1
Hence,
N
√ 1 X
N m(θ
b 0 ) − m(θ0 ) = √
ψτ (y nT ) + oP (1),
N n=1
from which (B.6) follows. This concludes the proof.
2
Finite mixtures
27
References
Ahn, H. and J. L. Powell (1993). Semiparametric estimation of censored selection models
with a nonparametric selection mechanism. Journal of Econometrics 58, 3–29. 26
Allman, E. S., C. Matias, and J. A. Rhodes (2009). Identifiability of parameters in latent
structure models with many observed variables. Annals of Statistics 37, 3099–3132. 2
Bonhomme, S. and J.-M. Robin (2009). Consistent noisy independent component analysis.
Journal of Econometrics 149, 12–25. 21
Browning, M., M. Ejrnæs, and J. Alvarez (2010). Modeling income processes with lots of
heterogeneity. Review of Economic Studies 77, 1353–1381. 15
Bura, E. and R. Pfeiffer (2008). On the distribution of left singular vectors of a random
matrix and its applications. Statistics & Probability Letters 78, 2275–2280. 20
Cardoso, J.-F. and A. Souloumiac (1993). Blind beamforming for non-gaussian signals.
IEEE-Proceedings, F 140, 362–370. 2, 8
Chen, X. (2007). Large sample sieve estimation of semi-nonparametric models. In J. J.
Heckman and E. E. Leamer (Eds.), Handbook of Econometrics, Volume VI, Part B. Elsevier. 8
Clenshaw, C. W. and A. R. Curtis (1960). A method for numerical integration on an
automatic computer. Numerical Mathematics 2, 197–205. 12
Eaton, M. L. and D. E. Tyler (1991). On Wielandt’s inequality and its applications. Annals
of Statistics 19, 260–271. 8, 20
Gajek, L. (1986). On improving density estimators which are not bona fide functions.
Annals of Statistics 14, 1612–1618. 12
Hall, P., A. Neeman, R. Pakyari, and R. Elmore (2005). Nonparametric inference in multivariate mixtures. Biometrika 92, 667–678. 2
Hall, P. and X.-H. Zhou (2003). Nonparametric estimation of component distributions in a
multivariate mixture. Annals of Statistics 31, 201–224. 1, 2, 4
Hall, R. and F. Mishkin (1982). The sensitivity of consumption to transitory income:
Estimates from panel data on households. Econometrica 50, 461–481. 15
Heckman, J. J. and B. Singer (1984). A method for minimizing the impact of distributional
assumptions in econometric models for duration data. Econometrica 52, 271–320. 17
Henry, M., Y. Kitamura, and B. Salanié (2010). Identifying finite mixtures in econometric
models. Discussion Paper 0910-20, Columbia University, Department of Economics. 1
28
S. Bonhomme, K. Jochmans, and J.-M. Robin
Kasahara, H. and K. Shimotsu (2012). Nonparametric identification and estimation of the
number of components in multivariate mixtures. Unpublished manuscript. 2, 7
Kleibergen, F. and R. Paap (2006). Generalized reduced rank tests using the singular value
decomposition. Journal of Econometrics 133, 97–126. 6
Magnus, J. R. (1985). On differentiating eigenvalues and eigenvectors. Econometric Theory 1, 179–191. 20
McDonald, J. B. (1984). Some generalized functions for the size distribution of income.
Econometrica 52, 647–663. 12
McLachlan, G. J. and D. Peel (2000). Finite Mixture Models. Wiley-Blackwell. 1
Robin, J.-M. and R. J. Smith (2000). Tests of rank. Econometric Theory 16, 151–175. 7
Schwartz, S. C. (1967). Estimation of probability density by an orthogonal series. Annals
of Mathematical Statistics 38, 1261–1265. 9
Stone, C. J. (1982). Optimal global rates of convergence for nonparametric regression. The
Annals of Statistics 10, 1040–1053. 9
Titterington, D. M. (1983). Minimum distance non-parametric estimation of mixture proportions. Journal of the Royal Statistical Society, Series B 45, 37–46. 10
Viollaz, A. J. (1989). Nonparametric estimation of probability density functions based on
orthogonal expansions. Revista Matematica de la Universidad Complutense de Madrid 2,
41–82. 24
Download