Asymptotic standard errors in independent factor analysis Angela Montanari1 and Cinzia Viroli2 1 2 University of Bologna, angela.montanari@unibo.it University of Bologna, cinzia.viroli@unibo.it Summary. Independent Factor Analysis is a recent approach to model latent variables, in which the factors are supposed to be mutually independent and not necessarily gaussian distributed. The factors are modeled by gaussian mixtures that are quite flexible to approximate any probability density function. The model estimation can be solved quite easily by the EM algorithm when the number of factors is not too high. The EM algorithm however does not directly provide the standard errors for the maximum likelihood estimates. In this paper we focus on the derivation of the standard errors of the Independent Factor Analysis parameter estimates which makes use of the information matrix of the complete model density, in accordance to the so called “missing information principle”. Key words: Independent Factor Analysis, EM algorithm, Information matrix 1 Introduction Normality assumption characterizes the most common latent variable models. However in some circumstances this assumption of gaussian factors could not be realistic. The covariance matrix is then no longer a sufficient statistic, which means that higher order cross-product moments give additional information about the underlying structure of the observed data. A recent approach that allows to consider all the cross-product moments of the data is the so called Independent Factor Analysis (IFA) model [Att99], which can be reinterpreted as a particular latent variable model with non gaussian and independent factors. In Independent Factor Analysis a set of p observed variables x is modeled in terms of a smaller set of k unobserved independent latent variables y and an additive specific term u. In compact form the IFA model is x = Λy + u, where the random vector u is assumed to be normally distributed, u ∼ N (0, Ψ ) with diagonal Ψ and the factor loading matrix Λ = {λji } is also termed as mixing matrix. The latent variables y are assumed to be mutually independent and the density of the ith factor is modeled by a mixture of mi gaussians with mean µi,qi , variance νi,qi and mixing proportions wi,qi (qi = 1, ..., mi ), which are constrained to be non-negative and sum to unity: 730 Montanari, A., Viroli, C. f (yi |ϑi ) = mi X wi,qi N (µi,qi , νi,qi ) with ϑi = {wi,qi , µi,qi , νi,qi } (1) qi =1 The estimation problem for the whole set of the IFA parameters may be successfully solved by the EM algorithm when the number of factors is not too high. One critical aspect connected with the use of the EM algorithm is the derivation of the standard errors for the parameter estimates. Notable contributions to give an answer to the problem include Louis [Lou82], Meng and Rubin [MR91], Oakes [Oak99] and Jamshdian and Jennrich [JJ00]. Meng and Rubin definitely solved the problem by deriving the asymptotic variance-covariance matrix as a function of the completedata variance-covariance matrix, which is algebraically tractable even in the most complicated estimation problems, as the IFA one is. Inherent to the proposed procedure, called Supplemented EM algorithm (SEM), is an appealing property: the observed information can be rephrased as the difference between the complete information and the missing information. This relation is better known as the “missing information principle” defined by Orchard and Woodbury [OW99]. It must be pointed out that another way to approximate variance-covariance matrices is by resorting to techniques as the boostrap or jackknife, which resample data sets from the empirical distribution function and apply the EM algorithm on each such data set. A further alternative approximation could be obtained by computing the empirical information matrix that is a consistent estimator of the expected information matrix (see Redner et al., [Red84]). However these techniques are computationally high demanding for complicated models like the IFA one. In this paper we derive the standard errors for the IFA model parameter estimates by the evaluation of the complete data variance-covariance matrix, coherently with the “missing information principle”. 2 The IFA model estimation For convenience, denote the IFA parameters collectively by θ. In order to derive the optimal likelihood estimates for the IFA parameters, the likelihood function, f (x|θ), has to be maximized. Unfortunately this quantity is hard to deal with, since it is not expressed in a closed form. In such situations, it is convenient to introduce some additional unobserved variables, such that if they were known the optimal value θ could be easily computed. In the IFA model the introduction of two hidden layers is required: the first one is directly related to the factors, while the last one is involved in the gaussian mixture modeling of the factors. As a consequence, the maximum likelihood function of the observed data (the “incomplete” data) can be rephrased in terms of the “pseudo-complete” data density: f (x|θ) = XZ z f (z, y, x|θ) dy = XZ f (x|y, θ) f (y|z, θ) f (z|θ) , (2) z where y is the vector of the latent variables which are modeled as gaussian mixtures and z represents a multivariate allocation variable, that for each observation denotes the identity of the group from which it is drawn. z = {z1 , ..., zk } follows a multivariate multinomial distribution: f (z|θ) = wz (3) Asymptotic standard errors in IFA Q Q 731 zi,q i i where wz = ki=1 m qi =1 wi,qi . Conditionally to the states z, the joint probability of the factor vector y is gaussian f (y|z, θ) = N (µz , Vz ) (4) with mean vector and covariance matrix defined as 2 µz = 4 m1 Y mk Y z1,q µ1,q11 , ..., q1 =1 3 zk,q µk,qkk 5 2 Vz = diag 4 qk =1 m1 Y z1,q ν1,q1 1 , ..., q1 =1 mk Y 3 zk,q νk,qkk 5 . qk =1 Furthermore f (x|y, z, θ) = f (x|y, θ), since the observed variables x depend on the factor vector y but not on the factor state vector z. Conditionally to the factors the density of the observed variables is given by: f (x|y, θ) = N (Λy, Ψ ). (5) The maximization of the log-likelihood ln f (x|θ) can be now achieved by the EM algorithm. In this framework, as a consequence of Jensen’s inequality: ln f (x|θ) = ln XZ f (z, y, x|θ) dy ≥ XZ z f z, y|x, θ′ ln z f (z, y, x|θ) dy = F(θ′ , θ). f (z, y|x, θ′ ) (6) ′ Therefore the EM algorithm consists in maximizing F(θ , θ) with respect to θ. Due to the hierarchical representation of the IFA model expressed in the (2), F(θ′ , θ) can be decomposed as the sum of four terms: F(θ′ , θ) = FO + FB + FT + FH . (7) The last term FH is the entropy of the posterior density of the latent variables given the data and does not depend on the model parameters θ but only on the parameters θ′ estimated at the previous iteration. Thus, the optimization procedure involves only the first three terms of the expression (7), that may be respectively expressed as: XZ z + f z, y|x, θ′ ln f (z, y, x|θ) dy = XZ XZ XZ f z, y|x, θ′ ln f (x|y, θ) dy z ′ f z, y|x, θ ln f (y|z, θ) dy + z f z, y|x, θ′ ln f (z|θ) dy. z From the (3) (4) and (5) it follows that FO depends only on the parameters Λ and Ψ and it can be defined as the observable layer, FB depends on the parameters {µi,qi , νi,qi } (the bottom hidden layer ) and FT depends only on the coefficients {wi,qi } (the top hidden layer ). Note that they also depend on all the previous parameter estimates. By making use of the aforementioned decomposition, the two steps of the EM algorithm can be analytically derived. After laborious but straightforward calculations the following estimates for the new parameters θ in terms of the old ones θ′ are obtained: h Λ̂ = xE[yT |x]E yyT |x µ̂i,qi = i−1 f (zi |x)E[yi |zi , x] f (zi |x) hatΨ = xxT − xE[yT |x]ΛT ν̂i,qi = f (zi |x)E[yi2 |zi , x] − µ2i,qi f (zi |x) (8) ŵi,qi = f (zi |x). 732 Montanari, A., Viroli, C. 3 Evaluating the Information in the IFA model When the complete density is a member of the exponential family the estimation of the standard errors requires the derivation of the information matrix according to Cramer-Rao’s theorem. The complete density of the IFA model indeed belongs to the exponential family. In fact, the first term of the decomposition: f (x, y, z|θ) = f (z|wz ) f (y|z, µz , Vz ) f (x|y, Λ, Ψ ) (9) is a multinomial distribution depending on wz ; from the equation (4) the second conditional density is a multivariate normal distribution depending on {µz , Vz } only and the last one is a multivariate normal distribution depending on the set of parameters {Λ, Ψ }. Moreover each set of parameters is independently estimated in a separate stage of the EM algorithm. Thus, the problem reduces to the derivation of Cramer-Rao lower bound, i.e. the inverse of the observed information matrix. The observed information matrix can be defined as: Io (θ|x) = − ∂ 2 ln f (x|θ) ∂θ · ∂θ (10) A direct evaluation of this expression can be very difficult especially when dealing with latent variables. From Bayes’ rule it follows ln f (x|θ) = ln f (x, y, z|θ) − ln f (y, z|x, θ) and after taking second derivatives and averaging over f (y, z|x, θ): Io (θ|x) = Ioc − Iom (11) which expresses the so called “missing information principle”, stating that the observed information is equal to the complete information minus the missing information. The complete information is given by Ioc = Ey,z|x,θ′ ∂ 2 ln f (x, y, z|θ) ∂θ · ∂θ (12) and from the equations (6) and (7) under regularity conditions it follows: Ioc = ∂ 2 [FB (θ′ ; µz , Vz )] ∂ 2 [FT (θ′ ; wz )] ∂ 2 [FH (θ′ )] ∂ 2 [FO (θ′ ; ∆, Ψ )] + + + , ∂θ · ∂θ ∂θ · ∂θ ∂θ · ∂θ ∂θ · ∂θ where the last term is zero since it depends only on the previous set of parameters −1 θ′ . Equation (11) can be written as Io (θ|x) = (I − Iom Ioc )Ioc . −1 Dempster et al. (1977) showed that the product Iom Ioc is equal to the matrix rate of convergence of the EM algorithm, whose elements can be numerically evaluated at each step of the algorithm. Denoting this matrix as R and inverting gives the asymptotic variance-covariance matrix, V : −1 −1 −1 −1 V = Ioc (I − R)−1 = Ioc + Ioc R(I − R)−1 = Ioc + ∆V (13) The additive term ∆V denotes the increase in the variance due to missing information. Asymptotic standard errors in IFA 733 4 Asymptotic Standard errors for the parameter estimates In order to derive the asymptotic standard errors the complete information matrix given by the decomposition (??) has to be evaluated at the optimum point θ = θ̂. The off-diagonal elements of the complete information matrix are all zero since the different parameters of each term of the expression (7) are separately learned and the mean and the variance maximum likelihood estimators are independent in a gaussian model. Hence, the complete information matrix can be thought of as a diagonal block matrix. The first block of the complete information matrix can be obtained by computing the second derivatives of FO with respect to each element of the factor loading matrix λji : ∂ 2 FO ∂2 = 2 ∂λji ∂λ2ji Z ′ f y|x, θ ln f (x|y, θ) dy . (14) Under the assumption that Ψ is diagonal, this implies from (5) that the observed variables are conditionally independent given the factors and the conditional density f (x|y, θ) may be factorized into the product: f (x|y, θ) = p Y p Y f (xj |y, θ) = j=1 N (λj y, ψjj ) j=1 where ψjj is the j th element of the diagonal matrix Ψ . Under regularity conditions, since f (y|x, θ′ ) does not depend on Λ̂ the expression (14) may be rephrased as follows: ∂ 2 FO = ∂λ2ji Z f y|x, θ′ − E yi2 |x yi2 dy = − . ψjj ψjj Considering a sample of n i.i.d. observations, the standard error for the element λji of the mixing matrix is: s S.E.(λ̂ji ) = ψ̂jj + [∆V ]λji , n · E [yi2 |x] (15) where [∆V ]λji denotes the diagonal element of the ∆V matrix defined in the equation (13) corresponding to λji . This additive term has not an explicit analytical expression but due to its direct relation to the rate of convergence of the EM algorithm it can be easily and numerically evaluated in the E step of the model estimation process. The standard error for the noise covariance matrix can be derived as follows. With reference to one observation the second derivative of FO is given by: ∂ 2 FO 1 = Ψ −2 − Ψ −2 AΨ −1 , ∂Ψ 2 2 where A = xxT − 2xE[yT |x]ΛT + ΛE[yyT |x]ΛT . From eq. (8) E ∂ 2 FO ∂Ψ 2 Ψ =Ψ̂ 1 = − Ψ̂ −2 . 2 734 Montanari, A., Viroli, C. Thus, for all the n observations s S.E.(Ψ̂ ) = 2Ψ̂ 2 + [∆V ]Ψ . n (16) With reference to the mixture parameters, the complete information matrix diagonal element for the means µi,qi and the variances µi,qi is obtained by ∂ 2 FB 1 f (zi |x), =− ∂µ2i,qi νi,qi ∂ 2 FB f (zi |x) =− 2 ∂νi,q 2 i " 2(E[yi2 |zi , x] − 2E[yi |zi , x]µi,qi + µ2i,qi ) − νi,qi 3 νi,q i # , and after taking the expectations s s ν̂i,qi + [∆V ]µi,qi , nŵi,qi S.E.(µ̂i,qi ) = S.E.(ν̂i,qi ) = 2 2ν̂i,q i + [∆V ]νi,qi . nŵi,qi Finally, let w̄i,qi be new weight parameters introduced in order to ensure the two constraints of non-negativity and normalization. Then: ∂ ∂ 2 FT = 2 ∂ w̄i,q ∂ w̄ i,qi i ew̄i,qi − P w̄i,q i q′ e ! = −wi,qi (1 − wi,qi ), i s S.E.(ŵi,qi ) = 1 + [∆V ]wi,qi . nŵi,qi (1 − ŵi,qi ) 5 An Example: Boston neighborhood data This data set has been entirely published in Belsley et al. [Bel80] and its analysis through Exploratory Projection Pursuit [Fri87] has shown striking structure of the data and non gaussian projections. In Figure 1 the ordinary Factor Analysis solution based on the iterate principal factors and the Independent Factor Analysis one with two factors each are represented. The IFA solution exhibits a clear clustering structure of the data. These groups are not so clearly captured by the ordinary factor analysis. Table 1 contains the parameter estimates of the factor loading matrix of the unique solution of the IFA model, compared with the unrotated solution of Factor Analysis and the varimax rotated solution. The IFA factor loadings exhibit a factor structure that is simpler, according to Thurstone, than the one obtained by FA and by its varimax rotation. The asymptotic standard error estimates for the IFA estimated parameters are in brackets. From the results we have obtained on this and other examples we can conclude that the IFA model seems to represent an interesting approach to latent variable modeling since it is able to allow for non gaussian latent factors, and our approach leads to coherent standard error estimates. It is true that the IFA model could be cumbersome and difficult to be applied for complex data sets, as the computational Asymptotic standard errors in IFA Factor Analysis 735 Independent Factor Analysis 4 4 2 Factor2 Factor2 2 0 0 -2 -2 -2 0 Factor1 2 -1.5 0.0 Factor1 1.5 Fig. 1. Boston Neighborhood Data: Factor Analysis and Independent Factor Analysis solutions. Table 1. Factor loading estimates of IFA, FA unrotated solution and FA with varimax rotation. In brackets the asymptotic standard errors of the IFA estimates. MEDV CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT -0.36 0.67 -0.29 0.64 0.73 -0.20 0.50 -0.55 0.97 0.90 0.37 -0.50 0.52 IFA (0.03) 0.82 (0.04) 0.03 (0.05) 0.18 (0.04) -0.16 (0.04) -0.04 (0.03) 0.61 (0.05) 0.00 (0.05) -0.09 (0.02) 0.03 (0.02) -0.05 (0.05) -0.33 (0.05) 0.04 (0.04) -0.34 (0.03) (0.03) (0.06) (0.04) (0.04) (0.04) (0.05) (0.05) (0.01) (0.02) (0.04) (0.03) (0.04) FA -0.69 0.57 -0.58 0.84 0.83 -0.50 0.74 -0.76 0.74 0.81 0.49 -0.45 0.78 0.65 -0.01 -0.11 0.13 0.29 0.51 0.25 -0.43 0.10 0.07 -0.25 -0.01 -0.28 Varimax -0.21 0.93 0.47 -0.33 -0.54 0.24 0.77 -0.36 0.85 -0.22 -0.12 0.70 0.75 -0.20 -0.87 0.07 0.67 -0.33 0.71 -0.39 0.26 -0.48 -0.38 0.25 0.49 -0.67 burden needed to fit the model grows rapidly with the number of factors and the number of terms in the mixture involved. However in our experience good convergence behaviors of the EM algorithm have been observed up to five or six estimated factors with less than ten mixture components each. References [Att99] [Bel80] Attias, H.: Independent factor analysis. Neural Computation., 11, 803– 851 (1999) Belsley, D.A., Kuh, E., Welsh, R.E.: Regression Diagnostics. John Wiley, New York (1980) 736 [Fri87] [JJ00] [Lou82] [MR91] [Oak99] [OW99] [Red84] Montanari, A., Viroli, C. Friedman, J.: Exploratory Projection Pursuit. Journal of the American Statistical Association., 82, 249–266 (1987) Jamshidian, M., Jennrich, R.: Standard errors for EM estimation. Journal of the Royal Statistical Society B., 62, 257–270 (2000) Louis, T.A.: Finding the Observed Information Matrix when using the EM Algorithm. Journal of the Royal Statistical Society B., 44, 226–233 (1982) Meng, X.L., and Rubin, D.B.: Using EM to Obtain Asymptotic VarianceCovariance Matrices: The SEM Algorithm. Journal of the American Statistical Association., 86, 899–909 (1991) Oakes, D.: Direct calculation of the information matrix via the EM algorithm. Journal of the Royal Statistical Society B., 61, 479–482 (1999) Orchard, T., and Woodbury, M.A.: A Missing Information Principle: Theory and Application. Proceedings of the 6th Berkeley Symposium on Mathematical Statistics and Probability., 1, 697–715 (1972) Redner, R.A., Walker, H.F.: Mixture densities, maximum likelihood and the EM algorithm. SIAM Review., 26, 195–239 (1984)