BLIND SEPARATION OF DEPENDENT SOURCES AND SUBSPACES BY MINIMUM MUTUAL INFORMATION J. A. Palmer and S. Makeig Swartz Center for Computational Neuroscience University of California San Diego, La Jolla, CA 92093 ABSTRACT We consider the problem linear separation of dependent sources by minimization of mutual information. We define a type of generalized variance dependence for random vectors, which can be analyzed into what we call “level curve dependence” and “envelope dependence”. We define subgaussian dependence and supergaussian dependence in terms of convexity with respect to the quadratic, and define the “homogeneous dependence” types, sub-sub and sup-sup, and the “conflicting” types sub-sup and sup-sub, in terms of the respective convexity of the envelope and level curve functions. We show that mixtures of subspaces of homogeneously dependent sources can be separated by minimizing the total output mutual information, without requiring a priori knowledge of the subspace structure, and we show that conflicting dependence types are actually separated by maximizing mutual information, creating a problem for subspace separation by simple minimization of output mutual information. Monte Carlo simulations are provided verifying the theory. Index Terms— Independent subspace analysis, mutual information, dependence, blind source separation 1. INTRODUCTION We consider the problem of blindly inverting an instantaneous linear mixture of dependent sources, x(t) = As(t), where the components of the vector s(t), s1 (t), . . . , sn (t) represent the sources, and A ∈ Rn×n is invertible, A−1 , W. The mutual information minimization approach to blind source separation has proved very effective at separating linear mixtures of independent, nongaussian sources [1]. This approach is equivalent to a maximum likelihood approach in which the source density models are adapted as well [2]. In general, the sources may exhibit mutual dependence, e.g. in signal power, leading to what has been variously called Multidimensional ICA [3], independent subspace analysis [4], and independent vector analysis [5]. If the dependence structure is known, and the dependence can be analytically modeled, then a maximum likelihood / minimum mutual information approach can be followed similar to the independent source case. For example, the Fourier coefficients of speech signals exhibit dependence across frequencies, i.e. the power in different frequencies of the power spectrum of a given source tend to co-vary, while coefficients from the spectra of independent sources remain independent. The covariation of the fourier coefficients can be modeled as multiplication of an vector with independent coefficients by a common random scalar, inducing the variance dependence. Implementation of this model and the corresponding ML optimization, constitutes the IVA approach to separation of sources in the frequency domain. The dependence structure is known in this case because dependency exists between, and only between, Fourier coefficients corresponding to the same source. We distinguish between the dependence type, or the statistical model of m-ary dependence, and dependence structure, by which we mean the graphical structure of dependence linkage. The dependence type model considered in the following is variance dependence, and more generally joint probability density functions of the form, p(s) = f g(s1 ) + · · · + g(sn ) (1) When the dependence structure is unknown, even limiting consideration to variance dependence, the problem seems to be a combinatorial one. A brute force approach might perform ML estimation with all possible dependence structures, i.e. all possible groupings of the n sources into subspaces. However, this quickly becomes infeasible as n grows. On the other hand, basic ICA, i.e. the minimization of mutual information of the output sources, is often found to successfully separate sources of the variance dependence type, with the subspace dependence structure ascertainable after the separation. In the case of very large subspaces, this permutation problem may be difficult, as it is for the naive approach to frequency domain ICA win which the Fourier coefficients of different frequencies of the same source are modeled as independent. The IVA approach, using the (in this case) known dependence structure, avoids the permutation problem. For moderate but unknown size subspaces, the dependent subspace identification problem has been found to be be solved reasonably well using existing heuristic clustering techniques [6]. The good performance of basic ICA in the dependent subspace context has led to the conjecture that the minimization of mutual information of the output is able to perform separation of certain dependent sources as well [3]. In this paper we show that the minimization of mutual information approach of basic ICA is able to perform separation of sources of a certain type, which may be called supergaussian [7]. The analytical methods used are the Entropy Power inequality [8], and a differential entropy calculation approach very similar to that used by Benveniste et al. in [7]. In §2, we specify the problem and derive some background results. In §3 we show that for a certain class of dependent sources, which includes supergaussian variance dependent sources, minimization of mutual information is achieved if and only sources are separated (with arbitrary permutation and scaling). In §4 we consider the general class of dependent densities of the “generalized variance dependent” type (1), and determine when the minimization of mutual information achieves separation. We show the for what we define as subgaussian variance dependent sources, separation is actually achieved by maximization of the mutual information. In §5 we present some Monte Carlo experiments to demonstrate and verify the theory, and in §6 we discuss the results and conclude. 2. LINEAR TRANSFORMATIONS: ENTROPY POWER AND MUTUAL INFORMATION In the following, let x be a random vector with nongaussian, possibly dependent components, each with finite (marginal) entropy, h(xi ) = E{− log p(xi )} < ∞. We denote the joint entropy of the components in x by h(x), and the mutual information by I(x). A basic tool is the entropy power inequality, stated in the following for univariate random variables, since this is the only formulation we require in the sequel. Theorem 1 (Entropy Power Inequality). Let xi , i = 1, . . . , n, be independent random variables with finite entropies h(xi ). Let y = P i xi . Then n X exp 2h(y) ≥ exp 2h(xi ) i=1 with equality if and only if the xi are Gaussian. Note that the EPI applies to independent random variables xi . If y = wT x, then we have, n X exp 2h(wT x) ≥ |wi |2 exp 2h(xi ) Dividing both sides by kwk2 , taking logarithms, and using the concavity of the logarithm, we get [8], n X i=1 (|wi |/kwk)2 h(xi ) (2) with equality if and only if the xi are i.i.d. Gaussian. Again, this inequality is proved only for independent random variables. We shall also use the mutual information difference due to linear transformation. This is calculated as follows. Let x be a random vector with dependent components and mutual information I(x). Let y = Wx, and let wi be the columns of WT . Then P I(y) = i h(yi ) − h(y) P = − log | det W| + i h(wiT x) − h(x) P Since I(x) = i h(xi ) − h(x), we have, I(y) − I(x) = − log | det W| + n X i=1 h(wiT x) − h(xi ) Gaussian scale mixtures are examples of dependent minimum entropy basis random vectors. GSMs are defined by x = ξ 1/2 z, where z ∼ N (0, I). In this case, we in fact attain equality, since, with y = wT x, we have E{− log p(y)} = E{E{− log p ξ 1/2 wT z | ξ}} = log kwk + E{− log p ξ 1/2 z } = log kwk + hx where hx is the entropy of x = ξ 1/2 z, and z is standard normal. We make the following simple, but important and greatly simplifying observation, regarding calculation of mutual information difference for random variables with finite entropy. We are interested in the question of the effect of a linear transformation on a vector of dependent random variables. That is, for dependent x, do we have, I(Wx) ≥ I(x) ? i=1 h(wT x) ≥ log kwk + Definition 1 (Linear Minimum and Maximum Entropy Representation). Let the dependent random variables x1 , . . . , xn be orthogonal with equivalent finite entropy, hx . Then x is said to be in its linearly minimum entropy if for all w such that kwk = 1, h(wT x) ≥ hx , and x is linearly maximum entropy if h(wT x) ≥ hx for all kwk = 1. (3) Thus the difference in mutual information due to a linear transformation is a function only of the determinant of the transformation matrix and the difference of the sum of the marginal entropies. We are interested to know in what circumstances independent subspaces achieve the minimum overall mutual information among the components. That is, when minimization of mutual information is achieved by separation of the subspaces, and separation of the components within the dependent subspaces. Evidently there are two issues: (1) the separation of the subspaces from one another, i.e. driving the mutual information between subspace random vectors to zero, and (2) separating the components within the subspace. The latter depends on the affect linear transformations have on the type of dependence in the subspace. This will be explored in more detail in the next section. For our present purposes, we abstract the inequality proved for independent random variables (2) to dependent random variables, and make the following definition. We know that mutual information is invariant to scaling, so we can for example take the rows of W to have unit norm, i.e. kwi k = 1, i = 1, . . . , n. Now we claim that we can also, without loss of generality, take the marginal entropies to be equivalent, assuming they are all finite. This follows by a simple logical argument. Suppose there exists a W such that I(Wx) < I(x). Let Dx be the diagonal matrix such that the marginal entropies of Dx x are equivalent and equal to hx . Then, I(WDx−1 Dx x) < I(x) = I(Dx x) Defining B = WDx−1 , and z = Dx x, we see that there exists a linear transformation, namely B, such that, I(Bz) < I(z) where z = Dx x is the scaled version of x such that the marginal entropies are equal. Thus, if we can show that there is no transformation that reduces the mutual information of a dependent vector with equivalent marginal entropies, then logically there can be no linear transformation reducing the mutual information in the more general case. Symbolically, P → Q ⇒ ¬Q → ¬P . In this case ¬P is I(Wx) ≥ I(x). Hence, proving that I(Bz) ≥ I(z) for all B and h(zi ) = hx for all i, is sufficient to prove that I(Wx) ≥ I(x) for all W and x with finite marginal entropies. 3. INDEPENDENT SUBSPACE ANALYSIS Let xj , j = 1, . . . , m be subvectors of arbitrary P length, concatenated to form the vector x of dimension n = j nj . Suppose that the xj are mutually independent, but may be dependent within their subspaces, i.e. the xj are not interdependent, but may be “intradependent”. We have the following. Theorem 2. If xj , j = 1, . . . , m, are independent nongaussian and linearly minimum entropy random vectors, then, I(Wx) ≥ I(x) with equality if and only if W is a scaled permutation matrix. Proof. Let wi be the columns of WT , and let wij be the partitioned subvectors corresponding to the partition of x into the xj . Without loss of generality, P let the components of all xj have entropy hx , and let kwi k2 = j kwij k2 = 1. If we put y = Wx, then yi = wiT x = m X Proof. The derivative with respect to θ is, Z ∞ d h′y (θ) = − p(y1 ; θ) log p(y1 ; θ) dy1 dθ −∞ Z ∞ Z ∞ d = − p(y1 , y2 ; θ) log p(y1 ; θ) dy1 dy2 dθ −∞ T wij xj −∞ (5) j=1 Now, we have, By the entropy power inequality, we have, exp 2h(yi ) ≥ m X d p(y1 , y2 ; θ) dθ exp T 2h(wij xj ) j=1 And since the xj are linearly minimum entropy, exp 2h(yi ) ≥ m X j=1 kwij k2 exp(2hx ) As before, taking logarithms and using concavity, we have, h(yi ) ≥ hx for all i. Thus from the formula (3), the difference in mutual information is positive, where h(xi ) = hx , and | det W| ≤ 1 (and thus − log | det W| ≥ 0) by the determinant volume inequality since the columns of WT have unit norm. Thus the difference in mutual information is positive for all W by the reasoning at the end of §2. 4. SUPERGAUSSIAN AND SUBGAUSSIAN DEPENDENCE We have seen that Gaussian scale mixtures, which have radially symmetric densities, are minimum entropy basis dependent vectors. We now consider random vectors with somewhat more general dependent densities, which include more general types of scale mixtures, like Generalized Gaussian scale mixtures. Consider first a two dimensional dependent subspace with density, p(x1 , x2 ) = f g(x1 ) + g(x2 ) We consider the entropy of projections, y1 = cos(θ)x1 + sin(θ)x2 as a function of θ, hy (θ). Define y2 = cos(θ)x2 − sin(θ)x1 . We determine the density of y1 as the marginal of the two dimensional density of (y1 , y2 ). We have, p(y1 , y2 ; θ) = f g cos(θ)y1 − sin(θ)y2 + g cos(θ)y2 + sin(θ)y1 with p(y1 ; θ) = R p(y1 , y2 ; θ)dy2 . The entropy of y1 is given by, Z ∞ hy (θ) = − p(y1 ; θ) log p(y1 ; θ) dy1 −∞ We define, ψ(y1 ) , − log p(y1 ; θ) (4) √ Theorem 3. Let f be decreasing, with − log f ( x) concave. Let √ g( x) be increasing and concave on x ∈ (0, ∞), then for θ ∈ (0, π/4), we have, h′y (θ) ≥ 0 ′ = f g(x1 (θ)) + g(x2 (θ)) where g ′ (x1 (θ))x′1 (θ) + g ′ (x2 (θ))x′2 (θ) x′1 (θ) = − sin(θ)y1 − cos(θ)y2 = −x2 (θ) and likewise, x′2 (θ) = x1 (θ). Thus, changing variables from (y1 , y2 ) to (x1 , x2 ) in (5), we get, Z Z h′y (θ) = f ′ g(x1 ) + g(x2 ) g ′ (x2 )x1 − g ′ (x1 )x2 × ψ cos(θ)x1 + sin(θ)x2 dx1 dx2 Dividing the x1 and x2 integrations into positive and negative regions, and using the fact that g ′ (−x) = −g ′ (x) and ψ(−y) = ψ(y) by assumption, we get, ′ Z ∞Z ∞ g (x2 ) g ′ (x1 ) h′y (θ) = 2 f ′ g(x1 ) + g(x2 ) x1 x2 − x2 x1 0 0 × ψ cos(θ)x1 +sin(θ)x2 −ψ cos(θ)x1 −sin(θ)x2 dx1 dx2 Now change from Cartesian (x1 , x2 ) coordinates to polar (r, φ) coordinates. Then x1 = r cos(φ), x2 = r sin(φ), and Z ∞ Z π/2 h′y (θ) = r3 −f ′ g(r cos(φ)) + g(r sin(φ)) sin(2φ) 0 0 ′ g (r cos(φ)) g ′ (r sin(φ)) × − r cos(φ) r sin(φ) × ψ(r cos(φ − θ)) − ψ(r cos(φ + θ)) dφ dr Dividing the angular integration into the sum of integrals over (0, π/4) and (π/4, π/2), using the trigonometric relations, and rearranging we get, Z ∞ Z π/4 ′ hy (θ) = r3 −f ′ g(r cos(φ)) + g(r sin(φ)) sin(2φ) 0 0 ′ g (r cos(φ)) g ′ (r sin(φ)) × − b(φ − θ) − b(φ + θ) dφ dr r cos(φ) r sin(φ) where b(φ) , ψ(r cos(φ)) + ψ(r sin(φ)). Note that the function b(φ) satisfies the symmetries, b(φ) = b(−φ) = b(π/2 − φ) by the symmetry assumed in ψ and trigonometric identities. Also, we have, ′ ψ (r sin(φ)) ψ ′ (r cos(φ)) b′ (φ) = 12 r 2 sin(2φ) − r sin(φ) r cos(φ) Hence, since f ′ is negative by assumption, the sign of h′y depends on the functions g ′ (x)/x, and ψ ′ (x)/x. If they are both √ increasing, or both decreasing, on (0, ∞), or equivalently if g( x), √ and ψ( x) are both concave or both convex on (0, ∞), then h′y (θ) ≥ 0, θ ∈ (0, π/4) Since they are both concave by assumption, the result follows. Corollary 1. Sup-sup and sub-sub dependent densities are linearly minimum entropy, and sub-sup and sup-sub dependent densities are linearly maximum entropy. The higher dimensional case is handled by treating an arbitrary rotation as a sequence of two dimensional rotations, and using the fact derived here that each two-dimensional rotation increases marginal entropy for linearly minimum entropy sources. Details are omitted due to space limitations. 5. SIMULATIONS AND VALIDATION In Figure 1, we present some experiments to empirically validate the theory. We generate four sets of two-dimensional dependent sources, corresponding to the sup-sup, sub-sub, sub-sup, and sup-sub cases respectively. The “sup” density is Laplacian, i.e. p(x) ∝ exp(−|x|), and the “sub” density is Generalized Gaussian with shape parameter 5, p(x) ∝ exp(−|x|5 ). The sup-sup data is generated by multiplying i.i.d. Laplacian samples by a common instance dependent scaling, which is Gamma distributed. This creates a supergaussian envelope dependence. The sub-sub data is generated by inducing a slight variance dependence on i.i.d. subgaussian data by multiplying it by a common random Gamma scaling that is tightly concentrated about unity. The sub-sup data is generated by multiplying uniform data over the diamond (Laplacian level curves) by a slight common scaling to induce a subgaussian envelope over Laplacian level curves. The sup-sub data is generated by multiplying i.i.d. uniform data by a strong scaling, to induce a supergaussian envelope on uniform (subgaussian) level curves. The “time series” are shown in the second row, shifted to improve visibility. The bottom row plots the entropy of projections as a function of the rotation angle for θ ∈ (0, π/2). Symmetry is expected about π/4, and deviation gives an idea of the noise in the empirical entropy calculation. Entropy is calculated by approximately integrating the histogram. It can be seen that entropy increases with rotation for the sup-sup and sub-sub dependent sources, while it decreases for the sub-sup and sup-sub dependent sources, as predicted. 6. DISCUSSION AND CONCLUSION We have considered the problem of blind source separation of dependent sources. In particular we considered sources that are generalized variance dependent as in (1). We have shown that the mutual s1(t), s2(t) 5 0 −5 0 s1 4 4 2 2 0 −2 −10 −4 50 100 150 200 250 t 1.64 s2 0 s1 2 −2 0 −4 50 100 150 200 250 t y h y h y 0.5 0.83 1 θ 1 0 0.81 0.8 0.48 θ 50 100 150 200 250 t 0.83 0.82 0.84 0.82 0 0 −5 50 100 150 200 250 t 0.52 1.56 1.54 0 2 5 1.62 1.6 0 s1 −2 0.85 1.58 0 −2 −2 −2 2 s1(t), s2(t) −2 −2 10 0 −1 s1(t), s2(t) 0 s1 10 s1(t), s2(t) 0 −1 −5 Sup−Sub 2 1 s2 s2 s2 1 0 −10 −10 Sub−Sup 2 h In the sup-sup case, the ψ is guaranteed by convexity properties to have the same type as − log f . In the sub-sub case, the mixing that defines ψ may change y into a sup-sub dependent density, but won’t necessarily do so. Sub−Sub 2 5 y If the convexity is not “homogeneous” as in Thm. 3, but rather “conflicting” such that one is concave and one is convex, then we have, h′y (θ) ≤ 0, θ ∈ (0, π/4) Sup−Sup 10 h Definition 2. A density, p(x1 , . . . , xn ), is said to be sup-sup dependent (respectively sub-sub dependent) if it is of the form, p(x1 , . . . , xn ) = f g1 (x1 ) + · · · + gn (xn ) √ with f decreasing on (0, ∞), − log f ( y) concave (respectively convex),√gi (xi ) non-negative, symmetric, and increasing on (0, ∞), and gi ( x) concave (respectively convex) on (0, ∞), for i = 1, . . . , n. Sup-sub and sub-sup dependence are defined by the concave-convex and convex-concave scenarios respectively. θ 1 0.79 0 θ 1 Fig. 1. Examples of dependent densities with various combinations of subgaussian and supergaussian envelope and level curve function. Bottom row plots the entropy of projections as a function of rotation angle, showing that sup-sup and sub-sub are linearly minimum entropy, while sub-sup and sup-sub are linearly maximum entropy. See text. information relationship between sources with this type of dependence depends on the convexity correspondence properties of the “envelope” and ”level curve” functions. If both are supergaussian then basic ICA can separate the dependent sources and subspaces without knowing the subspace structure. If both are subgaussian, and the mixing of rotation does not change the subgaussian envelope into supergaussian, then separation can likewise be performed. If there is a conflict between the sub- and supergaussianity of the envelope and level curve functions, then the mutual information is actually reduced by mixing within the subspace, and it seems that basic ICA cannot hope to unmix the sources without explicitly performing a type of ML estimation given the subspace structure. 7. REFERENCES [1] D. T. Pham, “Mutual information approach to blind separation of stationary sources,” IEEE Trans. Information Theory, vol. 48, no. 7, pp. 1935–1946, 2002. [2] J.-F. Cardoso, “Infomax and maximum likelihood for source separation,” IEEE Letters on Signal Processing, vol. 4, no. 4, pp. 112–114, 1997. [3] J.-F. Cardoso, “Multidimensional independent component analysis,” in Proceedings of the IEEE International Conference on Acoustics and Signal Processing (ICASSP 1998), Seattle, WA, 1998, pp. 1941–1944. [4] A. Hyvärinen and P. O. Hoyer, “Emergence of phase- and shift-invariant features by decomposition of natural images into independent feature subspaces,” Neural Computation, vol. 12, pp. 1705–1720, 2000. [5] T. Kim, T. Eltoft, and T.-W. Lee, “Independent vector analysis: An extension of ICA to multivariate components,” in Proceedings of the 6th International Conference on Independent Component Analysis, J. Rosca et al., Ed. 2006, Lecture Notes in Computer Science, pp. 165–172, Springer-Verlag. [6] K. Rose, E. Gurewitz, and G.Fox, “Constrained clustering as an optimization method,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. 8, pp. 785–794, 1993. [7] A. Benveniste, M. Goursat, and G. Ruget, “Robust identification of a nonminimum phase system,” IEEE Transactions on Automatic Control, vol. 25, no. 3, pp. 385– 399, 1980. [8] A. Dembo, T. M. Cover, and J. A. Thomas, “Information theoretic inequalities,” IEEE Transactions on Information Theory, vol. 37, no. 6, pp. 1501–1518, 1991.