Asymptotic standard errors in independent factor analysis

advertisement
Asymptotic standard errors in independent
factor analysis
Angela Montanari1 and Cinzia Viroli2
1
2
University of Bologna, angela.montanari@unibo.it
University of Bologna, cinzia.viroli@unibo.it
Summary. Independent Factor Analysis is a recent approach to model latent variables, in which the factors are supposed to be mutually independent and not necessarily gaussian distributed. The factors are modeled by gaussian mixtures that are
quite flexible to approximate any probability density function. The model estimation
can be solved quite easily by the EM algorithm when the number of factors is not
too high. The EM algorithm however does not directly provide the standard errors
for the maximum likelihood estimates. In this paper we focus on the derivation of
the standard errors of the Independent Factor Analysis parameter estimates which
makes use of the information matrix of the complete model density, in accordance
to the so called “missing information principle”.
Key words: Independent Factor Analysis, EM algorithm, Information matrix
1 Introduction
Normality assumption characterizes the most common latent variable models. However in some circumstances this assumption of gaussian factors could not be realistic. The covariance matrix is then no longer a sufficient statistic, which means that
higher order cross-product moments give additional information about the underlying structure of the observed data. A recent approach that allows to consider all
the cross-product moments of the data is the so called Independent Factor Analysis
(IFA) model [Att99], which can be reinterpreted as a particular latent variable model
with non gaussian and independent factors.
In Independent Factor Analysis a set of p observed variables x is modeled in
terms of a smaller set of k unobserved independent latent variables y and an additive
specific term u. In compact form the IFA model is x = Λy + u, where the random
vector u is assumed to be normally distributed, u ∼ N (0, Ψ ) with diagonal Ψ and
the factor loading matrix Λ = {λji } is also termed as mixing matrix. The latent
variables y are assumed to be mutually independent and the density of the ith
factor is modeled by a mixture of mi gaussians with mean µi,qi , variance νi,qi and
mixing proportions wi,qi (qi = 1, ..., mi ), which are constrained to be non-negative
and sum to unity:
730
Montanari, A., Viroli, C.
f (yi |ϑi ) =
mi
X
wi,qi N (µi,qi , νi,qi ) with ϑi = {wi,qi , µi,qi , νi,qi }
(1)
qi =1
The estimation problem for the whole set of the IFA parameters may be successfully
solved by the EM algorithm when the number of factors is not too high. One critical
aspect connected with the use of the EM algorithm is the derivation of the standard errors for the parameter estimates. Notable contributions to give an answer
to the problem include Louis [Lou82], Meng and Rubin [MR91], Oakes [Oak99] and
Jamshdian and Jennrich [JJ00]. Meng and Rubin definitely solved the problem by
deriving the asymptotic variance-covariance matrix as a function of the completedata variance-covariance matrix, which is algebraically tractable even in the most
complicated estimation problems, as the IFA one is. Inherent to the proposed procedure, called Supplemented EM algorithm (SEM), is an appealing property: the
observed information can be rephrased as the difference between the complete information and the missing information. This relation is better known as the “missing
information principle” defined by Orchard and Woodbury [OW99].
It must be pointed out that another way to approximate variance-covariance
matrices is by resorting to techniques as the boostrap or jackknife, which resample
data sets from the empirical distribution function and apply the EM algorithm
on each such data set. A further alternative approximation could be obtained by
computing the empirical information matrix that is a consistent estimator of the
expected information matrix (see Redner et al., [Red84]). However these techniques
are computationally high demanding for complicated models like the IFA one.
In this paper we derive the standard errors for the IFA model parameter estimates by the evaluation of the complete data variance-covariance matrix, coherently
with the “missing information principle”.
2 The IFA model estimation
For convenience, denote the IFA parameters collectively by θ. In order to derive the
optimal likelihood estimates for the IFA parameters, the likelihood function, f (x|θ),
has to be maximized. Unfortunately this quantity is hard to deal with, since it is
not expressed in a closed form. In such situations, it is convenient to introduce some
additional unobserved variables, such that if they were known the optimal value θ
could be easily computed. In the IFA model the introduction of two hidden layers is
required: the first one is directly related to the factors, while the last one is involved
in the gaussian mixture modeling of the factors. As a consequence, the maximum
likelihood function of the observed data (the “incomplete” data) can be rephrased
in terms of the “pseudo-complete” data density:
f (x|θ) =
XZ
z
f (z, y, x|θ) dy =
XZ
f (x|y, θ) f (y|z, θ) f (z|θ) ,
(2)
z
where y is the vector of the latent variables which are modeled as gaussian mixtures and z represents a multivariate allocation variable, that for each observation
denotes the identity of the group from which it is drawn. z = {z1 , ..., zk } follows a
multivariate multinomial distribution:
f (z|θ) = wz
(3)
Asymptotic standard errors in IFA
Q
Q
731
zi,q
i
i
where wz = ki=1 m
qi =1 wi,qi .
Conditionally to the states z, the joint probability of the factor vector y is
gaussian
f (y|z, θ) = N (µz , Vz )
(4)
with mean vector and covariance matrix defined as
2
µz = 4
m1
Y
mk
Y
z1,q
µ1,q11 , ...,
q1 =1
3
zk,q
µk,qkk 5
2
Vz = diag 4
qk =1
m1
Y
z1,q
ν1,q1 1 , ...,
q1 =1
mk
Y
3
zk,q
νk,qkk 5 .
qk =1
Furthermore f (x|y, z, θ) = f (x|y, θ), since the observed variables x depend on
the factor vector y but not on the factor state vector z. Conditionally to the factors
the density of the observed variables is given by:
f (x|y, θ) = N (Λy, Ψ ).
(5)
The maximization of the log-likelihood ln f (x|θ) can be now achieved by the EM
algorithm. In this framework, as a consequence of Jensen’s inequality:
ln f (x|θ) = ln
XZ
f (z, y, x|θ) dy ≥
XZ
z
f z, y|x, θ′ ln
z
f (z, y, x|θ)
dy = F(θ′ , θ).
f (z, y|x, θ′ )
(6)
′
Therefore the EM algorithm consists in maximizing F(θ , θ) with respect to θ. Due
to the hierarchical representation of the IFA model expressed in the (2), F(θ′ , θ) can
be decomposed as the sum of four terms:
F(θ′ , θ) = FO + FB + FT + FH .
(7)
The last term FH is the entropy of the posterior density of the latent variables given
the data and does not depend on the model parameters θ but only on the parameters
θ′ estimated at the previous iteration.
Thus, the optimization procedure involves only the first three terms of the expression (7), that may be respectively expressed as:
XZ
z
+
f z, y|x, θ′ ln f (z, y, x|θ) dy =
XZ
XZ
XZ
f z, y|x, θ′ ln f (x|y, θ) dy
z
′
f z, y|x, θ ln f (y|z, θ) dy +
z
f z, y|x, θ′ ln f (z|θ) dy.
z
From the (3) (4) and (5) it follows that FO depends only on the parameters Λ
and Ψ and it can be defined as the observable layer, FB depends on the parameters {µi,qi , νi,qi } (the bottom hidden layer ) and FT depends only on the coefficients
{wi,qi } (the top hidden layer ). Note that they also depend on all the previous parameter estimates. By making use of the aforementioned decomposition, the two steps of
the EM algorithm can be analytically derived. After laborious but straightforward
calculations the following estimates for the new parameters θ in terms of the old
ones θ′ are obtained:
h
Λ̂ = xE[yT |x]E yyT |x
µ̂i,qi =
i−1
f (zi |x)E[yi |zi , x]
f (zi |x)
hatΨ = xxT − xE[yT |x]ΛT
ν̂i,qi =
f (zi |x)E[yi2 |zi , x]
− µ2i,qi
f (zi |x)
(8)
ŵi,qi = f (zi |x).
732
Montanari, A., Viroli, C.
3 Evaluating the Information in the IFA model
When the complete density is a member of the exponential family the estimation of
the standard errors requires the derivation of the information matrix according to
Cramer-Rao’s theorem. The complete density of the IFA model indeed belongs to
the exponential family. In fact, the first term of the decomposition:
f (x, y, z|θ) = f (z|wz ) f (y|z, µz , Vz ) f (x|y, Λ, Ψ )
(9)
is a multinomial distribution depending on wz ; from the equation (4) the second
conditional density is a multivariate normal distribution depending on {µz , Vz }
only and the last one is a multivariate normal distribution depending on the set of
parameters {Λ, Ψ }. Moreover each set of parameters is independently estimated in
a separate stage of the EM algorithm. Thus, the problem reduces to the derivation
of Cramer-Rao lower bound, i.e. the inverse of the observed information matrix.
The observed information matrix can be defined as:
Io (θ|x) = −
∂ 2 ln f (x|θ)
∂θ · ∂θ
(10)
A direct evaluation of this expression can be very difficult especially when dealing
with latent variables. From Bayes’ rule it follows
ln f (x|θ) = ln f (x, y, z|θ) − ln f (y, z|x, θ)
and after taking second derivatives and averaging over f (y, z|x, θ):
Io (θ|x) = Ioc − Iom
(11)
which expresses the so called “missing information principle”, stating that the observed information is equal to the complete information minus the missing information. The complete information is given by
Ioc = Ey,z|x,θ′
∂ 2 ln f (x, y, z|θ)
∂θ · ∂θ
(12)
and from the equations (6) and (7) under regularity conditions it follows:
Ioc =
∂ 2 [FB (θ′ ; µz , Vz )]
∂ 2 [FT (θ′ ; wz )]
∂ 2 [FH (θ′ )]
∂ 2 [FO (θ′ ; ∆, Ψ )]
+
+
+
,
∂θ · ∂θ
∂θ · ∂θ
∂θ · ∂θ
∂θ · ∂θ
where the last term is zero since it depends only on the previous set of parameters
−1
θ′ . Equation (11) can be written as Io (θ|x) = (I − Iom Ioc
)Ioc .
−1
Dempster et al. (1977) showed that the product Iom Ioc
is equal to the matrix
rate of convergence of the EM algorithm, whose elements can be numerically evaluated at each step of the algorithm. Denoting this matrix as R and inverting gives
the asymptotic variance-covariance matrix, V :
−1
−1
−1
−1
V = Ioc
(I − R)−1 = Ioc
+ Ioc
R(I − R)−1 = Ioc
+ ∆V
(13)
The additive term ∆V denotes the increase in the variance due to missing information.
Asymptotic standard errors in IFA
733
4 Asymptotic Standard errors for the parameter
estimates
In order to derive the asymptotic standard errors the complete information matrix
given by the decomposition (??) has to be evaluated at the optimum point θ = θ̂.
The off-diagonal elements of the complete information matrix are all zero since
the different parameters of each term of the expression (7) are separately learned
and the mean and the variance maximum likelihood estimators are independent in
a gaussian model. Hence, the complete information matrix can be thought of as a
diagonal block matrix. The first block of the complete information matrix can be
obtained by computing the second derivatives of FO with respect to each element
of the factor loading matrix λji :
∂ 2 FO
∂2
=
2
∂λji
∂λ2ji
Z
′
f y|x, θ ln f (x|y, θ) dy .
(14)
Under the assumption that Ψ is diagonal, this implies from (5) that the observed
variables are conditionally independent given the factors and the conditional density
f (x|y, θ) may be factorized into the product:
f (x|y, θ) =
p
Y
p
Y
f (xj |y, θ) =
j=1
N (λj y, ψjj )
j=1
where ψjj is the j th element of the diagonal matrix Ψ .
Under regularity conditions, since f (y|x, θ′ ) does not depend on Λ̂ the expression
(14) may be rephrased as follows:
∂ 2 FO
=
∂λ2ji
Z
f y|x, θ′
−
E yi2 |x
yi2
dy = −
.
ψjj
ψjj
Considering a sample of n i.i.d. observations, the standard error for the element λji
of the mixing matrix is:
s
S.E.(λ̂ji ) =
ψ̂jj
+ [∆V ]λji ,
n · E [yi2 |x]
(15)
where [∆V ]λji denotes the diagonal element of the ∆V matrix defined in the equation (13) corresponding to λji . This additive term has not an explicit analytical
expression but due to its direct relation to the rate of convergence of the EM algorithm it can be easily and numerically evaluated in the E step of the model estimation
process.
The standard error for the noise covariance matrix can be derived as follows.
With reference to one observation the second derivative of FO is given by:
∂ 2 FO
1
= Ψ −2 − Ψ −2 AΨ −1 ,
∂Ψ 2
2
where A = xxT − 2xE[yT |x]ΛT + ΛE[yyT |x]ΛT . From eq. (8)
E
∂ 2 FO
∂Ψ 2
Ψ =Ψ̂
1
= − Ψ̂ −2 .
2
734
Montanari, A., Viroli, C.
Thus, for all the n observations
s
S.E.(Ψ̂ ) =
2Ψ̂ 2
+ [∆V ]Ψ .
n
(16)
With reference to the mixture parameters, the complete information matrix diagonal
element for the means µi,qi and the variances µi,qi is obtained by
∂ 2 FB
1
f (zi |x),
=−
∂µ2i,qi
νi,qi
∂ 2 FB
f (zi |x)
=−
2
∂νi,q
2
i
"
2(E[yi2 |zi , x] − 2E[yi |zi , x]µi,qi + µ2i,qi ) − νi,qi
3
νi,q
i
#
,
and after taking the expectations
s
s
ν̂i,qi
+ [∆V ]µi,qi ,
nŵi,qi
S.E.(µ̂i,qi ) =
S.E.(ν̂i,qi ) =
2
2ν̂i,q
i
+ [∆V ]νi,qi .
nŵi,qi
Finally, let w̄i,qi be new weight parameters introduced in order to ensure the
two constraints of non-negativity and normalization. Then:
∂
∂ 2 FT
=
2
∂ w̄i,q
∂
w̄
i,qi
i
ew̄i,qi
− P w̄i,q
i
q′ e
!
= −wi,qi (1 − wi,qi ),
i
s
S.E.(ŵi,qi ) =
1
+ [∆V ]wi,qi .
nŵi,qi (1 − ŵi,qi )
5 An Example: Boston neighborhood data
This data set has been entirely published in Belsley et al. [Bel80] and its analysis
through Exploratory Projection Pursuit [Fri87] has shown striking structure of the
data and non gaussian projections.
In Figure 1 the ordinary Factor Analysis solution based on the iterate principal
factors and the Independent Factor Analysis one with two factors each are represented. The IFA solution exhibits a clear clustering structure of the data. These
groups are not so clearly captured by the ordinary factor analysis. Table 1 contains the parameter estimates of the factor loading matrix of the unique solution of
the IFA model, compared with the unrotated solution of Factor Analysis and the
varimax rotated solution. The IFA factor loadings exhibit a factor structure that is
simpler, according to Thurstone, than the one obtained by FA and by its varimax
rotation. The asymptotic standard error estimates for the IFA estimated parameters
are in brackets.
From the results we have obtained on this and other examples we can conclude
that the IFA model seems to represent an interesting approach to latent variable
modeling since it is able to allow for non gaussian latent factors, and our approach
leads to coherent standard error estimates. It is true that the IFA model could be
cumbersome and difficult to be applied for complex data sets, as the computational
Asymptotic standard errors in IFA
Factor Analysis
735
Independent Factor Analysis
4
4
2
Factor2
Factor2
2
0
0
-2
-2
-2
0
Factor1
2
-1.5
0.0
Factor1
1.5
Fig. 1. Boston Neighborhood Data: Factor Analysis and Independent Factor Analysis solutions.
Table 1. Factor loading estimates of IFA, FA unrotated solution and FA with
varimax rotation. In brackets the asymptotic standard errors of the IFA estimates.
MEDV
CRIM
ZN
INDUS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
B
LSTAT
-0.36
0.67
-0.29
0.64
0.73
-0.20
0.50
-0.55
0.97
0.90
0.37
-0.50
0.52
IFA
(0.03) 0.82
(0.04) 0.03
(0.05) 0.18
(0.04) -0.16
(0.04) -0.04
(0.03) 0.61
(0.05) 0.00
(0.05) -0.09
(0.02) 0.03
(0.02) -0.05
(0.05) -0.33
(0.05) 0.04
(0.04) -0.34
(0.03)
(0.03)
(0.06)
(0.04)
(0.04)
(0.04)
(0.05)
(0.05)
(0.01)
(0.02)
(0.04)
(0.03)
(0.04)
FA
-0.69
0.57
-0.58
0.84
0.83
-0.50
0.74
-0.76
0.74
0.81
0.49
-0.45
0.78
0.65
-0.01
-0.11
0.13
0.29
0.51
0.25
-0.43
0.10
0.07
-0.25
-0.01
-0.28
Varimax
-0.21
0.93
0.47 -0.33
-0.54
0.24
0.77 -0.36
0.85 -0.22
-0.12
0.70
0.75 -0.20
-0.87
0.07
0.67 -0.33
0.71 -0.39
0.26 -0.48
-0.38
0.25
0.49 -0.67
burden needed to fit the model grows rapidly with the number of factors and the
number of terms in the mixture involved. However in our experience good convergence behaviors of the EM algorithm have been observed up to five or six estimated
factors with less than ten mixture components each.
References
[Att99]
[Bel80]
Attias, H.: Independent factor analysis. Neural Computation., 11, 803–
851 (1999)
Belsley, D.A., Kuh, E., Welsh, R.E.: Regression Diagnostics. John Wiley,
New York (1980)
736
[Fri87]
[JJ00]
[Lou82]
[MR91]
[Oak99]
[OW99]
[Red84]
Montanari, A., Viroli, C.
Friedman, J.: Exploratory Projection Pursuit. Journal of the American
Statistical Association., 82, 249–266 (1987)
Jamshidian, M., Jennrich, R.: Standard errors for EM estimation. Journal
of the Royal Statistical Society B., 62, 257–270 (2000)
Louis, T.A.: Finding the Observed Information Matrix when using the
EM Algorithm. Journal of the Royal Statistical Society B., 44, 226–233
(1982)
Meng, X.L., and Rubin, D.B.: Using EM to Obtain Asymptotic VarianceCovariance Matrices: The SEM Algorithm. Journal of the American Statistical Association., 86, 899–909 (1991)
Oakes, D.: Direct calculation of the information matrix via the EM algorithm. Journal of the Royal Statistical Society B., 61, 479–482 (1999)
Orchard, T., and Woodbury, M.A.: A Missing Information Principle: Theory and Application. Proceedings of the 6th Berkeley Symposium on
Mathematical Statistics and Probability., 1, 697–715 (1972)
Redner, R.A., Walker, H.F.: Mixture densities, maximum likelihood and
the EM algorithm. SIAM Review., 26, 195–239 (1984)
Download