BLIND SEPARATION OF DEPENDENT SOURCES AND

advertisement
BLIND SEPARATION OF DEPENDENT SOURCES AND SUBSPACES BY
MINIMUM MUTUAL INFORMATION
J. A. Palmer and S. Makeig
Swartz Center for Computational Neuroscience
University of California San Diego, La Jolla, CA 92093
ABSTRACT
We consider the problem linear separation of dependent sources by
minimization of mutual information. We define a type of generalized
variance dependence for random vectors, which can be analyzed into
what we call “level curve dependence” and “envelope dependence”.
We define subgaussian dependence and supergaussian dependence in
terms of convexity with respect to the quadratic, and define the “homogeneous dependence” types, sub-sub and sup-sup, and the “conflicting” types sub-sup and sup-sub, in terms of the respective convexity of the envelope and level curve functions. We show that mixtures of subspaces of homogeneously dependent sources can be separated by minimizing the total output mutual information, without
requiring a priori knowledge of the subspace structure, and we show
that conflicting dependence types are actually separated by maximizing mutual information, creating a problem for subspace separation
by simple minimization of output mutual information. Monte Carlo
simulations are provided verifying the theory.
Index Terms— Independent subspace analysis, mutual information, dependence, blind source separation
1. INTRODUCTION
We consider the problem of blindly inverting an instantaneous linear
mixture of dependent sources, x(t) = As(t), where the components of the vector s(t), s1 (t), . . . , sn (t) represent the sources, and
A ∈ Rn×n is invertible, A−1 , W. The mutual information
minimization approach to blind source separation has proved very
effective at separating linear mixtures of independent, nongaussian
sources [1]. This approach is equivalent to a maximum likelihood
approach in which the source density models are adapted as well [2].
In general, the sources may exhibit mutual dependence, e.g. in signal
power, leading to what has been variously called Multidimensional
ICA [3], independent subspace analysis [4], and independent vector
analysis [5].
If the dependence structure is known, and the dependence can be
analytically modeled, then a maximum likelihood / minimum mutual information approach can be followed similar to the independent source case. For example, the Fourier coefficients of speech
signals exhibit dependence across frequencies, i.e. the power in different frequencies of the power spectrum of a given source tend to
co-vary, while coefficients from the spectra of independent sources
remain independent. The covariation of the fourier coefficients can
be modeled as multiplication of an vector with independent coefficients by a common random scalar, inducing the variance dependence. Implementation of this model and the corresponding ML optimization, constitutes the IVA approach to separation of sources in
the frequency domain. The dependence structure is known in this
case because dependency exists between, and only between, Fourier
coefficients corresponding to the same source.
We distinguish between the dependence type, or the statistical
model of m-ary dependence, and dependence structure, by which
we mean the graphical structure of dependence linkage. The dependence type model considered in the following is variance dependence, and more generally joint probability density functions of the
form,
p(s) = f g(s1 ) + · · · + g(sn )
(1)
When the dependence structure is unknown, even limiting consideration to variance dependence, the problem seems to be a combinatorial one. A brute force approach might perform ML estimation with
all possible dependence structures, i.e. all possible groupings of the
n sources into subspaces. However, this quickly becomes infeasible
as n grows.
On the other hand, basic ICA, i.e. the minimization of mutual
information of the output sources, is often found to successfully separate sources of the variance dependence type, with the subspace
dependence structure ascertainable after the separation. In the case
of very large subspaces, this permutation problem may be difficult,
as it is for the naive approach to frequency domain ICA win which
the Fourier coefficients of different frequencies of the same source
are modeled as independent. The IVA approach, using the (in this
case) known dependence structure, avoids the permutation problem.
For moderate but unknown size subspaces, the dependent subspace
identification problem has been found to be be solved reasonably
well using existing heuristic clustering techniques [6].
The good performance of basic ICA in the dependent subspace
context has led to the conjecture that the minimization of mutual
information of the output is able to perform separation of certain dependent sources as well [3]. In this paper we show that the minimization of mutual information approach of basic ICA is able to perform
separation of sources of a certain type, which may be called supergaussian [7]. The analytical methods used are the Entropy Power
inequality [8], and a differential entropy calculation approach very
similar to that used by Benveniste et al. in [7].
In §2, we specify the problem and derive some background
results. In §3 we show that for a certain class of dependent sources,
which includes supergaussian variance dependent sources, minimization of mutual information is achieved if and only sources are
separated (with arbitrary permutation and scaling). In §4 we consider the general class of dependent densities of the “generalized
variance dependent” type (1), and determine when the minimization
of mutual information achieves separation. We show the for what
we define as subgaussian variance dependent sources, separation is
actually achieved by maximization of the mutual information. In
§5 we present some Monte Carlo experiments to demonstrate and
verify the theory, and in §6 we discuss the results and conclude.
2. LINEAR TRANSFORMATIONS: ENTROPY POWER
AND MUTUAL INFORMATION
In the following, let x be a random vector with nongaussian, possibly dependent components, each with finite (marginal) entropy,
h(xi ) = E{− log p(xi )} < ∞. We denote the joint entropy of
the components in x by h(x), and the mutual information by I(x).
A basic tool is the entropy power inequality, stated in the following for univariate random variables, since this is the only formulation
we require in the sequel.
Theorem 1 (Entropy Power Inequality). Let xi , i = 1, . . . , n, be
independent random variables with finite entropies h(xi ). Let y =
P
i xi . Then
n
X
exp 2h(y) ≥
exp 2h(xi )
i=1
with equality if and only if the xi are Gaussian.
Note that the EPI applies to independent random variables xi . If
y = wT x, then we have,
n
X
exp 2h(wT x) ≥
|wi |2 exp 2h(xi )
Dividing both sides by kwk2 , taking logarithms, and using the concavity of the logarithm, we get [8],
n
X
i=1
(|wi |/kwk)2 h(xi )
(2)
with equality if and only if the xi are i.i.d. Gaussian. Again, this
inequality is proved only for independent random variables.
We shall also use the mutual information difference due to linear
transformation. This is calculated as follows. Let x be a random vector with dependent components and mutual information I(x). Let
y = Wx, and let wi be the columns of WT . Then
P
I(y) =
i h(yi ) − h(y)
P
= − log | det W| + i h(wiT x) − h(x)
P
Since I(x) = i h(xi ) − h(x), we have,
I(y) − I(x) = − log | det W| +
n
X
i=1
h(wiT x) − h(xi )
Gaussian scale mixtures are examples of dependent minimum
entropy basis random vectors. GSMs are defined by x = ξ 1/2 z,
where z ∼ N (0, I). In this case, we in fact attain equality, since,
with y = wT x, we have
E{− log p(y)} = E{E{− log p ξ 1/2 wT z | ξ}}
= log kwk + E{− log p ξ 1/2 z }
= log kwk + hx
where hx is the entropy of x = ξ 1/2 z, and z is standard normal.
We make the following simple, but important and greatly simplifying observation, regarding calculation of mutual information difference for random variables with finite entropy. We are interested
in the question of the effect of a linear transformation on a vector of
dependent random variables. That is, for dependent x, do we have,
I(Wx) ≥ I(x) ?
i=1
h(wT x) ≥ log kwk +
Definition 1 (Linear Minimum and Maximum Entropy Representation). Let the dependent random variables x1 , . . . , xn be orthogonal with equivalent finite entropy, hx . Then x is said to be in its linearly minimum entropy if for all w such that kwk = 1, h(wT x) ≥
hx , and x is linearly maximum entropy if h(wT x) ≥ hx for all
kwk = 1.
(3)
Thus the difference in mutual information due to a linear transformation is a function only of the determinant of the transformation
matrix and the difference of the sum of the marginal entropies.
We are interested to know in what circumstances independent
subspaces achieve the minimum overall mutual information among
the components. That is, when minimization of mutual information
is achieved by separation of the subspaces, and separation of the
components within the dependent subspaces. Evidently there are
two issues: (1) the separation of the subspaces from one another, i.e.
driving the mutual information between subspace random vectors
to zero, and (2) separating the components within the subspace. The
latter depends on the affect linear transformations have on the type of
dependence in the subspace. This will be explored in more detail in
the next section. For our present purposes, we abstract the inequality
proved for independent random variables (2) to dependent random
variables, and make the following definition.
We know that mutual information is invariant to scaling, so we can
for example take the rows of W to have unit norm, i.e. kwi k = 1,
i = 1, . . . , n. Now we claim that we can also, without loss of generality, take the marginal entropies to be equivalent, assuming they
are all finite. This follows by a simple logical argument. Suppose
there exists a W such that I(Wx) < I(x). Let Dx be the diagonal
matrix such that the marginal entropies of Dx x are equivalent and
equal to hx . Then,
I(WDx−1 Dx x) < I(x) = I(Dx x)
Defining B = WDx−1 , and z = Dx x, we see that there exists a
linear transformation, namely B, such that,
I(Bz) < I(z)
where z = Dx x is the scaled version of x such that the marginal
entropies are equal. Thus, if we can show that there is no transformation that reduces the mutual information of a dependent vector
with equivalent marginal entropies, then logically there can be no
linear transformation reducing the mutual information in the more
general case. Symbolically, P → Q ⇒ ¬Q → ¬P . In this case
¬P is I(Wx) ≥ I(x). Hence, proving that I(Bz) ≥ I(z) for all B
and h(zi ) = hx for all i, is sufficient to prove that I(Wx) ≥ I(x)
for all W and x with finite marginal entropies.
3. INDEPENDENT SUBSPACE ANALYSIS
Let xj , j = 1, . . . , m be subvectors of arbitrary
P length, concatenated to form the vector x of dimension n = j nj . Suppose that
the xj are mutually independent, but may be dependent within their
subspaces, i.e. the xj are not interdependent, but may be “intradependent”. We have the following.
Theorem 2. If xj , j = 1, . . . , m, are independent nongaussian and
linearly minimum entropy random vectors, then,
I(Wx) ≥ I(x)
with equality if and only if W is a scaled permutation matrix.
Proof. Let wi be the columns of WT , and let wij be the partitioned
subvectors corresponding to the partition of x into the xj . Without
loss of generality,
P let the components of all xj have entropy hx , and
let kwi k2 = j kwij k2 = 1. If we put y = Wx, then
yi = wiT x =
m
X
Proof. The derivative with respect to θ is,
Z ∞
d
h′y (θ) = −
p(y1 ; θ) log p(y1 ; θ) dy1
dθ
−∞
Z ∞ Z ∞
d
= −
p(y1 , y2 ; θ) log p(y1 ; θ) dy1 dy2
dθ
−∞
T
wij
xj
−∞
(5)
j=1
Now, we have,
By the entropy power inequality, we have,
exp 2h(yi ) ≥
m
X
d
p(y1 , y2 ; θ)
dθ
exp
T
2h(wij
xj )
j=1
And since the xj are linearly minimum entropy,
exp 2h(yi ) ≥
m
X
j=1
kwij k2 exp(2hx )
As before, taking logarithms and using concavity, we have,
h(yi ) ≥ hx
for all i. Thus from the formula (3), the difference in mutual information is positive, where h(xi ) = hx , and | det W| ≤ 1 (and thus
− log | det W| ≥ 0) by the determinant volume inequality since the
columns of WT have unit norm. Thus the difference in mutual information is positive for all W by the reasoning at the end of §2.
4. SUPERGAUSSIAN AND SUBGAUSSIAN DEPENDENCE
We have seen that Gaussian scale mixtures, which have radially symmetric densities, are minimum entropy basis dependent vectors. We
now consider random vectors with somewhat more general dependent densities, which include more general types of scale mixtures,
like Generalized Gaussian scale mixtures.
Consider first a two dimensional dependent subspace with density,
p(x1 , x2 ) = f g(x1 ) + g(x2 )
We consider the entropy of projections,
y1 = cos(θ)x1 + sin(θ)x2
as a function of θ, hy (θ). Define y2 = cos(θ)x2 − sin(θ)x1 . We
determine the density of y1 as the marginal of the two dimensional
density of (y1 , y2 ). We have,
p(y1 , y2 ; θ) =
f g cos(θ)y1 − sin(θ)y2 + g cos(θ)y2 + sin(θ)y1
with p(y1 ; θ) =
R
p(y1 , y2 ; θ)dy2 . The entropy of y1 is given by,
Z ∞
hy (θ) = −
p(y1 ; θ) log p(y1 ; θ) dy1
−∞
We define,
ψ(y1 ) , − log p(y1 ; θ)
(4)
√
Theorem
3. Let f be decreasing, with − log f ( x) concave. Let
√
g( x) be increasing and concave on x ∈ (0, ∞), then for θ ∈
(0, π/4), we have,
h′y (θ) ≥ 0
′
=
f g(x1 (θ)) + g(x2 (θ))
where
g ′ (x1 (θ))x′1 (θ) + g ′ (x2 (θ))x′2 (θ)
x′1 (θ) = − sin(θ)y1 − cos(θ)y2 = −x2 (θ)
and likewise, x′2 (θ) = x1 (θ). Thus, changing variables from
(y1 , y2 ) to (x1 , x2 ) in (5), we get,
Z Z
h′y (θ) =
f ′ g(x1 ) + g(x2 ) g ′ (x2 )x1 − g ′ (x1 )x2
× ψ cos(θ)x1 + sin(θ)x2 dx1 dx2
Dividing the x1 and x2 integrations into positive and negative regions, and using the fact that g ′ (−x) = −g ′ (x) and ψ(−y) = ψ(y)
by assumption, we get,
′
Z ∞Z ∞
g (x2 )
g ′ (x1 )
h′y (θ) = 2
f ′ g(x1 ) + g(x2 ) x1 x2
−
x2
x1
0
0
× ψ cos(θ)x1 +sin(θ)x2 −ψ cos(θ)x1 −sin(θ)x2 dx1 dx2
Now change from Cartesian (x1 , x2 ) coordinates to polar (r, φ) coordinates. Then x1 = r cos(φ), x2 = r sin(φ), and
Z ∞ Z π/2
h′y (θ) =
r3
−f ′ g(r cos(φ)) + g(r sin(φ)) sin(2φ)
0
0
′
g (r cos(φ))
g ′ (r sin(φ))
×
−
r cos(φ)
r sin(φ)
× ψ(r cos(φ − θ)) − ψ(r cos(φ + θ)) dφ dr
Dividing the angular integration into the sum of integrals over
(0, π/4) and (π/4, π/2), using the trigonometric relations, and
rearranging we get,
Z ∞ Z π/4
′
hy (θ) =
r3
−f ′ g(r cos(φ)) + g(r sin(φ)) sin(2φ)
0
0
′
g (r cos(φ)) g ′ (r sin(φ)) ×
−
b(φ − θ) − b(φ + θ) dφ dr
r cos(φ)
r sin(φ)
where b(φ) , ψ(r cos(φ)) + ψ(r sin(φ)). Note that the function
b(φ) satisfies the symmetries, b(φ) = b(−φ) = b(π/2 − φ) by the
symmetry assumed in ψ and trigonometric identities. Also, we have,
′
ψ (r sin(φ))
ψ ′ (r cos(φ))
b′ (φ) = 12 r 2 sin(2φ)
−
r sin(φ)
r cos(φ)
Hence, since f ′ is negative by assumption, the sign of h′y depends on the functions g ′ (x)/x, and ψ ′ (x)/x. If they are both
√ increasing,
or
both
decreasing,
on
(0,
∞),
or
equivalently
if
g(
x),
√
and ψ( x) are both concave or both convex on (0, ∞), then
h′y (θ) ≥ 0,
θ ∈ (0, π/4)
Since they are both concave by assumption, the result follows.
Corollary 1. Sup-sup and sub-sub dependent densities are linearly
minimum entropy, and sub-sup and sup-sub dependent densities are
linearly maximum entropy.
The higher dimensional case is handled by treating an arbitrary rotation as a sequence of two dimensional rotations, and using
the fact derived here that each two-dimensional rotation increases
marginal entropy for linearly minimum entropy sources. Details are
omitted due to space limitations.
5. SIMULATIONS AND VALIDATION
In Figure 1, we present some experiments to empirically validate the
theory. We generate four sets of two-dimensional dependent sources,
corresponding to the sup-sup, sub-sub, sub-sup, and sup-sub cases
respectively. The “sup” density is Laplacian, i.e. p(x) ∝ exp(−|x|),
and the “sub” density is Generalized Gaussian with shape parameter
5, p(x) ∝ exp(−|x|5 ). The sup-sup data is generated by multiplying i.i.d. Laplacian samples by a common instance dependent scaling, which is Gamma distributed. This creates a supergaussian envelope dependence. The sub-sub data is generated by inducing a slight
variance dependence on i.i.d. subgaussian data by multiplying it by
a common random Gamma scaling that is tightly concentrated about
unity. The sub-sup data is generated by multiplying uniform data
over the diamond (Laplacian level curves) by a slight common scaling to induce a subgaussian envelope over Laplacian level curves.
The sup-sub data is generated by multiplying i.i.d. uniform data by a
strong scaling, to induce a supergaussian envelope on uniform (subgaussian) level curves. The “time series” are shown in the second
row, shifted to improve visibility. The bottom row plots the entropy
of projections as a function of the rotation angle for θ ∈ (0, π/2).
Symmetry is expected about π/4, and deviation gives an idea of the
noise in the empirical entropy calculation. Entropy is calculated by
approximately integrating the histogram.
It can be seen that entropy increases with rotation for the sup-sup
and sub-sub dependent sources, while it decreases for the sub-sup
and sup-sub dependent sources, as predicted.
6. DISCUSSION AND CONCLUSION
We have considered the problem of blind source separation of dependent sources. In particular we considered sources that are generalized variance dependent as in (1). We have shown that the mutual
s1(t), s2(t)
5
0
−5
0
s1
4
4
2
2
0
−2
−10
−4
50 100 150 200 250
t
1.64
s2
0
s1
2
−2
0
−4
50 100 150 200 250
t
y
h
y
h
y
0.5
0.83
1
θ
1
0
0.81
0.8
0.48
θ
50 100 150 200 250
t
0.83
0.82
0.84
0.82
0
0
−5
50 100 150 200 250
t
0.52
1.56
1.54
0
2
5
1.62
1.6
0
s1
−2
0.85
1.58
0
−2
−2
−2
2
s1(t), s2(t)
−2
−2
10
0
−1
s1(t), s2(t)
0
s1
10
s1(t), s2(t)
0
−1
−5
Sup−Sub
2
1
s2
s2
s2
1
0
−10
−10
Sub−Sup
2
h
In the sup-sup case, the ψ is guaranteed by convexity properties to
have the same type as − log f . In the sub-sub case, the mixing that
defines ψ may change y into a sup-sub dependent density, but won’t
necessarily do so.
Sub−Sub
2
5
y
If the convexity is not “homogeneous” as in Thm. 3, but rather
“conflicting” such that one is concave and one is convex, then we
have,
h′y (θ) ≤ 0, θ ∈ (0, π/4)
Sup−Sup
10
h
Definition 2. A density, p(x1 , . . . , xn ), is said to be sup-sup dependent (respectively sub-sub dependent) if it is of the form,
p(x1 , . . . , xn ) = f g1 (x1 ) + · · · + gn (xn )
√
with f decreasing on (0, ∞), − log f ( y) concave (respectively
convex),√gi (xi ) non-negative, symmetric, and increasing on (0, ∞),
and gi ( x) concave (respectively convex) on (0, ∞), for i =
1, . . . , n. Sup-sub and sub-sup dependence are defined by the
concave-convex and convex-concave scenarios respectively.
θ
1
0.79
0
θ
1
Fig. 1. Examples of dependent densities with various combinations
of subgaussian and supergaussian envelope and level curve function.
Bottom row plots the entropy of projections as a function of rotation
angle, showing that sup-sup and sub-sub are linearly minimum entropy, while sub-sup and sup-sub are linearly maximum entropy. See
text.
information relationship between sources with this type of dependence depends on the convexity correspondence properties of the
“envelope” and ”level curve” functions. If both are supergaussian
then basic ICA can separate the dependent sources and subspaces
without knowing the subspace structure. If both are subgaussian,
and the mixing of rotation does not change the subgaussian envelope into supergaussian, then separation can likewise be performed.
If there is a conflict between the sub- and supergaussianity of the
envelope and level curve functions, then the mutual information is
actually reduced by mixing within the subspace, and it seems that
basic ICA cannot hope to unmix the sources without explicitly performing a type of ML estimation given the subspace structure.
7. REFERENCES
[1] D. T. Pham, “Mutual information approach to blind separation of stationary
sources,” IEEE Trans. Information Theory, vol. 48, no. 7, pp. 1935–1946, 2002.
[2] J.-F. Cardoso, “Infomax and maximum likelihood for source separation,” IEEE
Letters on Signal Processing, vol. 4, no. 4, pp. 112–114, 1997.
[3] J.-F. Cardoso, “Multidimensional independent component analysis,” in Proceedings of the IEEE International Conference on Acoustics and Signal Processing
(ICASSP 1998), Seattle, WA, 1998, pp. 1941–1944.
[4] A. Hyvärinen and P. O. Hoyer, “Emergence of phase- and shift-invariant features
by decomposition of natural images into independent feature subspaces,” Neural
Computation, vol. 12, pp. 1705–1720, 2000.
[5] T. Kim, T. Eltoft, and T.-W. Lee, “Independent vector analysis: An extension of
ICA to multivariate components,” in Proceedings of the 6th International Conference on Independent Component Analysis, J. Rosca et al., Ed. 2006, Lecture Notes
in Computer Science, pp. 165–172, Springer-Verlag.
[6] K. Rose, E. Gurewitz, and G.Fox, “Constrained clustering as an optimization
method,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. 8,
pp. 785–794, 1993.
[7] A. Benveniste, M. Goursat, and G. Ruget, “Robust identification of a nonminimum
phase system,” IEEE Transactions on Automatic Control, vol. 25, no. 3, pp. 385–
399, 1980.
[8] A. Dembo, T. M. Cover, and J. A. Thomas, “Information theoretic inequalities,”
IEEE Transactions on Information Theory, vol. 37, no. 6, pp. 1501–1518, 1991.
Download