as a PDF

advertisement
IJCNN’99, Washington, DC, July 10-16, 1999
1
An Information Theoretic Method for Designing Multiresolution Principal Component
Transforms
Omid S. Jahromi and Bruce A. Francis
Department of Electrical and Computer Engineering, University of Toronto
Toronto, Ontario, Canada M5S 3G4
fomidj, francisg@control.toronto.edu
ABSTRACT
Principal Component Analysis (PCA) is concerned basically
with finding an optimal way to represent a random vector
through a linear combination of a few uncorrelated random
variables. In signal processing, multiresolution transforms
are used to decompose a time signal into components of
different resolutions. In this paper, we consider designing
optimal multiresolution transforms such that components in
each resolution provide the best approximation to the original signal in that resolution. We call a transformation that
admits this optimality property a Principal Component Multiresolution Transform (PCMT).
We show that PCMTs can be designed by minimizing
the information transfer through their basic building blocks.
We then propose a method to do the minimization in a stageby-stage manner. This latter method has a great appeal in
terms of its computational simplicity as well as theoretical
interpretations. In particular, it agrees with Linsker’s principle of self organization. Finally, we provide analytic arguments and computer simulations to demonstrate the efficiency of our method.
1. INTRODUCTION
Principal component analysis (PCA) is a widely used statistical technique in the theory of neural networks as well as
many other fields of natural sciences. In particular, it can be
used to design optimum source compression methods in a
class called transform coding. Conventional transform coding techniques suffer from many artifacts including complexity of estimating the optimal transform, high number of
parameters to be estimated, and also the so called blocking
effect which is caused by processing the signal to be compressed in disjoint blocks independently.
Multirate filter banks (equivalently lapped transforms)
have been introduced to overcome some of these problems
by providing a great deal of flexibility in the way that the
signal is (hypothetically) segmented before being transformed.
Apart from some (impractical) special cases that are based
on filters with infinite support [9], [12], little progress has
been made in the development of methods for designing optimal and yet practical filter banks. Our goal in this paper is
to develop such a method, from an information theory point
of view, for a class of multirate filter banks described below.
A very important class of filter banks are those that implement multiresoluton transforms [7]. Such filter banks admit a very appealing structure consisting of two-channel orthogonal filter banks connected as a binary tree. We show
that information minimization can be used for designing the
optimal multiresolution transform which we call a Principal
Component Multiresolution Transform (PCMT). A PCMT
has the property that it decomposes the input signal into
multiresolution components in a way that each resolution
is the best approximation possible in that resolution to the
original signal.
The paper is organized as follows. Section 2 introduces
PCA and the concept of optimum transforms for random
variables. An Info-Max interpretation of PCA is given in
section 3. Section (4) introduces principal component filter banks - a concept that extends PCA to the processing of
correlated random signals. We introduce PCMT, its implementation, and design issues in section 5. In section 6 we
consider PCMT from an information theory point of view
and introduce a method that greatly reduces the complexity of nonlinear optimization associated with its design. Finally, we make some concluding remarks in section 7.
Notation: Vectors are denoted by capital letters. The error
vector, however, is denoted by . Boldface capital letters are
used for matrices. The ij element of a matrix is denoted
by ( )ij .
A
A
2. PRINCIPAL COMPONENT ANALYSIS
The idea of using linear subspaces for statistical data analysis goes back to the 1930s when Hotelling [2] introduced
Principal Component Analysis. The importance of Hotelling’s
approach in data compression was apparently first realized
by Kramer and Mathews [5] in the 1950s. Since then, PCA
has been widely used for data compression, pattern classification and noise reduction applications.
Principal component analysis is concerned with explaining the variance-covariance structure of a set of N variables
through a few, K < N , linear combinations of these variables. Although N components are required to reproduce
the total system variability, often much of this variability
can be accounted for by a small number K of the principal
components. If so, there is (almost) as much information in
the K components as there is in the original N variables.
The K principal components can then replace the original
N variables, and the original data set, consisting of P measurements on N variables, is compressed to a data set consisting of P measurements on K principal components.
Let’s consider an N -dimensional stochastic process X (n)
whose samples are independent and identically distributed.
The autocorrelation matrix of X is defined as 1
CXX E fX X T g
(1)
where E is the mathematical expectation operator. It can
be shown (e.g., [8]) that every autocorrelation matrix XX
can be expressed in terms of its orthonormal eigenvectors
C
U
i 2 RN
CXX =
N
X
i=1
i i iT
UU
where k
C
K
X
(UiT X )Ui
(3)
i=1
=X
N
K
X
X
(UiT X )Ui =
(UiT X )Ui
i=1
i=K +1
T
The coefficients Ui X are called the principal components
of X , and by using an increasing number of the largest terms
in (3), X will be approximated by an increasing accuracy.
If K = N , will become zero.
The approximation formula above can be written in a
more compact form by packing the principal components of
~ . Doing so,
interest (i.e. the K largest ones) in a vector X
we can rewrite (3) as
X
where
PT X~
(4)
0 T 1
0 T1
U1 X
U1
B
B
U2T X C
U2T C
C
~
B
B
X = PX; and P = : : : C
A
::: A
UkT X
1 To simplify the notation, we drop the time index
pendency is not concerned explicitly.
UkT
P
P
3. AN INFORMATION-THEORETIC
INTERPRETATION
In the previous section we derived the KLT solely based on
its optimal mean-square approximation properties. It has a
very nice interpretation based on information-theoretic concepts too! Here we briefly explain this latter interpretation
as it will play a central rule in our development of design
methods for multiresolution PCA in section (6).
~ and X can, in genThe mutual information between X
eral, be expressed as
( ~ ; ^ ) = h(X~ )
I X X
( ~j )
h X X
~ ) denotes the (differential) entropy and h(X~ jX )
where h(X
~ given X [1]. Here,
denotes the conditional entropy of X
~ jX ) is zero because there is no
the conditional entropy h(X
~ when X is known. Hence
uncertainty in evaluating X
( ~ ; ) = h(X~ )
I X X
n. The approximation error is then given by
P
(2)
where the i are the eigenvalues of XX . The Ui come
in handy for statistical approximation of X as an arbitrary
vector in RN :
X
Assuming that X is sampled from the random process X (n),
the approximation given by (4) is optimal in the sense that
it results in a residual with minimum expected norm. This
is to say, as given in (5) minimizes E fT g for any fixed
value of K [8]. In this sense, the matrix is the optimum
linear transformation that reduces the N -dimensional vec~.
tor of variables X to a K -dimensional vector X
The matrix
is commonly referred to as Karhunen Loève transform (KLT) in signal processing literature 2 . As
an orthogonal matrix, KLT is very reach in structure and
properties [4].
(6)
Now, assuming that the original random process X was
~)
Gaussian with zero mean, the (differential) entropy h(X
is given by [8]
( ~ ) = 21 (K + K log(2) + log jdet(CX~ X~ )j)
h X
(7)
Substituting (7) in (6) the problem of maximizing the mu~ ; X ) reduces to maximizing the detertual information I (X
minant of X~ X~ . It is straightforward to verify that the matrix as given in (5) is in fact the matrix that maximizes
T =
det( X
~ X~ ) subject to the orthogonality condition
.
This
proves
that
KLT
is
also
optimal
in
the
sense
of
k
transferring maximum information about the random vector
~.
X to X
I
P
C
C
PP
4. PRINCIPAL COMPONENT FILTER BANKS
(5)
n wherever time de-
In this section we extend the idea of optimum rate reduction
(compression) to the case of correlated stationary random
2 In statistics, it is also known as Hotelling transform or simply principal
component transform (PCT).
signals. To do this, we have to go beyond simple orthogonal transforms and use a special class of multi-input multioutput dynamical systems known as filter banks.
x(n)
x
H (z)
0
0
u
(n)
x
1
1
v
(n)
M-1
(z)
(n)
1
u
(n)
x
M-1
1
v
(n)
M-1
(n)
M
M
F (z)
M
F (z)
M
F (z)
1
(n)
M-1
Expanders
Analysis bank
0
(n)
M
u
H
0
M
0
H (z)
v
(n)
y(n)
M-1
Synthesis bank
Decimators
Figure 1: An M -channel multirate analysis/synthesis filter
bank
An M -channel analysis/synthesis filter bank is shown in
Fig. 1. The filters H0 to HM 1 constitute the analysis filter
bank. These filters, together with the decimators following
1 of the
them, generate M subband signals whose rate is M
input signal. The idea is to design the analysis and synthesis
filters such that:
1. When all the subband channels are transmitted to the
synthesis bank, the original signal is perfectly reconstructed by the synthesis bank (within perhaps a delay).
2. When only K out of M subband channels are used
for the synthesis, the mean-square-error between the
input x(n) and the synthesized signal y (n) is minimum.
3. The analysis and synthesis banks are lossless 3 . This
is to say, the sum of subband signals’ energies is equal
to theP
number of channels times the input signal enM 1
ergy: i=0 E fvi2 (n)g = M E fx(n)2 g
A filter bank that satisfies the above properties for all
values of 0 < k < N is called a principal component filter bank (PCFB) [9]. Methods for designing PCFB’s when
the filters are allowed to have ideal (brick-wall) frequency
responses have been developed recently [9], [12]. These
methods, however, are only of theoretical interest since the
resulting ideal filters are not realizable.
In this paper, we do not consider uniform filter banks 4
like the one shown in Fig. 1. Instead, we consider a class of
filter banks called tree-structured filter banks. These filter
banks are constructed by cascading two-channel filter banks
in a dyadic tree (Fig. 2).
3 This requirement is not actually necessary but induces very convenient
properties in both design and implementation phases.
4 Uniform filter banks have some drawbacks. In particular, their lattice implementation (for
2) fails to satisfy condition (1) when finite
precision arithmetic is involved [10].
M>
5. PRINCIPAL COMPONENT
MULTIRESOLUTION TRANSFORMS
Tree-structured filter banks have very appealing properties.
In particular, they perform a multiresolution decomposition
on the input signal space and, hence, are very closely related to wavelet transforms [7]. In fact, the output of the
analysis filters can be thought of as approximations of the
input signal in different resolutions or rates. Now, the question that naturally arises is how to chose the best filter bank
(equivalently, the best wavelet) so that these lower resolution (lower rate) approximations represent the original signal as closely as possible? In this context, using arguments
similar to those in section (4), one can define the optimal
filter bank as the one that satisfies the three conditions mentioned there. We call this optimal filter bank a principal
component multiresolution transform (PCMT).
H 00(z)
2
H 01(z)
2
z -1
z
H 10(z)
2
H 11(z)
2
-1
Figure 2: A multiresolution transform (tree-structured analysis bank)
It is straightforward to verify that to have a PCMT, it is
sufficient that each two-channel section satisfies the 3 optimality conditions. This is to say, a PCMT can be obtained
simply by cascading two-channel PCFBs in a tree structure
[11]. In addition, it is known that a FIR two channel filter bank that satisfies conditions (1) and (3) can be implemented using a so-called paraunitary lattice [10]. This lattice for the analysis filters is shown in Fig. 3 5 . In this figure,
the blocks named 1 , 2 . . . N are 2 2 rotation matrices.
That is
T T
Ti =
T
( i)
( i)
os sin ( i)
( i)
sin os (8)
where i are rotation angles that determine the free parameters of the lattice. For a two-channel paraunitaruy lattice to
be a PCFB, therefore, one has only to choose the N rotation
angles 1 to N such that condition (2) is satisfied.
Condition (2) requests minimum reconstruction error using only one subband. Suppose we want to reconstruct x(n)
using v0 (n) only (Fig. 3); knowing that the lossless condition (3) is automatically satisfied by the paraunitary lattice,
we can easily verify that E f(x(n) y (n))2 g = E fv12 (n)g.
This means the optimal analysis bank should be designed
5 The synthesis
filters are then implemented using the dual of this lattice.
x(n)
v0 (n)
2
H 0(z)
F0 (z)
2
F1 (z)
v1 (n)
2
H 1(z)
2
y(n)
The above theorem basically means that passing through
the rack of rotations from stage i to stage i + 1, the signal
energy is reduced by a constant factor 7 and is independent
of the choice of rotation angels.
(a)
x(n)
2
T1
z -1
z -2
T2
ooo
TN
z -2
2
v0 (n)
v1 (n)
x(n)
Z
T
-1
1
ooo
Z
T
-1
N-2
ooo
(b)
Z
Figure 3: A two-channel paraunitary filter bank (a) and its
lattice representation (b)
T
-1
N-1
T
1
2
Z
Z
T
-1
T
N-2
N
2
0
v (n)
1
T
T1
-1
v (n)
N-1
ooo
such that the subband to be dropped has minimum energy 6 .
Obviously, this energy depends on the input signal statistics
and the choice of rotation angles. Hence, turning a paraunitary lattice into a two-channel PCFB is equivalent to the
following optimization problem:
: Given input signal statistics and the lattice in Fig. 3,
find T1 ; : : : ; TN such that E fv12 (n)g is minimized.
is a nonlinear optimization problem and no analytic
solution has been reported for it yet. A main contribution
of this paper is to reveal the structure and level of difficulty
of this optimization. We achieve this in the next section
by showing that it is in fact a multi-stage information minimization problem. As a by-product, we also show that we
have often to deal with ill-conditioned covariance matrices
which, in turn, make achieving accurate results computationally demanding.
6. INFORMATION TRANSFER IN PARAUNITARY
LATTICES
To start with, we manipulate the paraunitary lattice in Fig. 3(b)
and draw it in a form that contains a delay chain followed by
a dynamic-free structure consisting of 2 2 rotation blocks
(Fig. 4). That this structure is in fact equivalent to the lattice in Fig. 3(b) can be verified using signal flow graph
methods. Also introduced in Fig. 4 are intermediate vector variables denoted by U1 ; : : : ; UN . Assuming that x(n)
is wide-sense stationary and Gaussian, these vectors will be
Gaussian random variables whose covariance matrices we
denote by Ui Ui . The i blocks are orthogonal rotations;
the signal energy is thus conserved from their input to output irrespective the angle of rotation i . Using this fact, it is
straightforward to show that
C
Theorem 1:
6 Equivalently,
T
Tr
fCUi Ui g = 2(N
i
Z
Z
TN-2
-1
ooo
-1
T1
U N-1
UN
W
U 1=X
Figure 4: Modified representation of a two-channel paraunitary lattice
The next theorem establishes the link between PCFBs and
information theory by showing that is an Information
Minimization problem.
Theorem 2: For a stationary Gaussian input, an N -stage
paraunitary lattice is a PCFB iff I (X ; W ), i.e. the mutual
information between the input vector X and the output vector W , is minimum and WW is diagonal.
C
Proof: Recalling , we would like to find the rotation matrices i such that E fv12 (n)g is minimum. Using the notation introduced in Fig. 4, E fv12 (n)g is equal to the lower
diagonal element of WW , i.e. ( WW )22 . Let’s call the
eigenvalues of this covariance matrix 1 and 2 where the
indices are chosen such that 1 2 . For a covariance matrix, the eigenvalues majorize the diagonal elements; that is
to say ( WW )22 2 with equality if and only if WW
is a diagonal matrix [4]. To minimize ( WW )22 , therefore, the last rotation matrix N , should diagonalize WW .
Choosing N to do so, we get
T
C
C
C
T
T
C
C
C
f 2 ( )g = 2
E v1 n
T
T T
(9)
C
Since N is orthogonal, 1 and 2 are eigenvalues of UN UN
too. Hence, problem reduces to finding rotation blocks
1 to N 1 such that the smallest eigenvalue of UN UN is
minimized. Now, noticing the fact that
+ 1)E fx2 (n)g
the retained subband should have maximum possible energy (because the total subband energy is constant).
U N-2
C
1
+ 2 = T rfCUN UN g = onst
7 This factor is simply the ratio of the number of retained branches to
the number of input branches in each stage. Also note that for the last stage
f WW g = f UN UN g = 2 f 2 ( )g.
Tr C
Tr C
Ex n
minimizing the smallest eigenvalue 2 becomes the same
as minimizing the product 1 2 which, in turn, is equal to
det( UN UN ). Up to now, we have shown that is equivalent to
C
0 :
T
T
CUN UN ) is
CUN UN .
Fined 1 ; : : : ; N 1 such that det(
minimized. Then, find N to diagonalize
T
There exists an information-theoretic interpretation for 0 .
To see this, we first note that if a vector Ui is given, all
subsequent vectors Uj ; j > i; are uniquely known. This,
means h(Uj jUi ) = 0 for j > i. The mutual information
between these vectors, therefore, is given by the differential
entropy of the one with smaller dimension:
( i ; j ) = h(Uj ); j > i
I U U
(10)
Since x(n) is assumed to be Gaussian, the vectors Ui have
Gaussian distribution too. Therefore, minimizing det( UN UN )
(as was needed by 0 ) is equal to minimizing the differential entropy h(UN ). Using (10) with j = N and i = 1, this
latter minimization can be further interpreted as minimizing
I (U1 ; UN ).
Finally, we notice that there is no loss of information
in going from UN to W . In other words, there is no distinction between minimizing I (U1 ; UN ) or I (U1 ; W ). This
completes the proof since U1 = X .
As mentioned in the proof above, I (X ; W ) is independent of the choice of N . This leads to the following procedure for designing a two-channel PCFB:
C
T
00 : For the lattice in Fig. 3, find T1 ; : : : ; TN 1 such
that I (X ; UN ) = h(UN ) is minimized. Then choose
TN to diagonalize CUN UN .
00 is still a multivariable nonlinear optimization prob-
equivalent to the joint multivariable optimization defined
by 00 . Various simulations show that this algorithm and
00 give practically the same results but, nevertheless, we
do not consider such experiments conclusive. For one reason, we can not be sure we’ve obtained the global minimum
in 00 since the problem is non-convex and frequently illconditioned. In the following, we provide numerical results
for a typical case.
Simulation example 9 : In this example we consider designing a two-channel PCFB using a 4-stage paraunitary lattice.
14
12
10
8
6
4
2
0
−2
0
0.5
1
1.5
2
2.5
Figure 5: The (optimized) position of eigenvalues of the
covariance matrices of intermediate variables. (The vertical
scale doesn’t represent any quantity).
C
The above method basically says to minimize the information available at the last stage, one can minimize the
amount of information transfered from each intermediate
stage to the next 8 .
Algorithm 1 is quite appealing intuitively. However, at
this time, it is not known to the authors whether it is in fact
The 8 eigenvalues of XX used for this example are
shown at the bottom row of Fig. 5 using ’*’ signs. The
three other rows, from bottom to top, show the eigenvalues of the intermediate covariances Ui Ui for i = 2; 3 and
4 respectively. Those depicted by ’*’ are obtained using
stage-by-stage minimization of h(Ui ), while those represented by ’+’ are obtained by solving multivariable optimization 00 . As can be seen, the eigenvalue sets are practically the same. This means both methods give similar values for det(U4 ) = I (X ; W ) at the end.
It should be emphasized that by trying to minimize the
determinant of intermediate covariance matrices, we are actually making them ill-conditioned! This is a very important
point as it predicts that numerical instability becomes a serious concern during the course of multivariable optimization. This artifact is avoided to a great extent when we use
Algorithm 1.
8 Actually, Linsker’s principal for optimization of neural networks deals
with information maximization. Here we are dealing with information
minimization so we can call ours an anti-Linsker Method!
9 MATLABTM code for the simulation example provided here is available [3].
lem in rotation angles 1 ; 2 ; : : : ; N 1 . We, however, believe that it can be further re-cast as a sequence of singlevariable optimizations. The idea here has a very close relation with the Linsker’s principle of self-organization [6].
We state our method as
Algorithm 1: To minimize I (X ; UN ) for the structure shown
in Fig. 4, minimize I (Ui ; Ui+1 ) for i = 1; : : : ; N 1 in a
successive manner.
C
7. SUMMARY AND CONCLUSION
8. REFERENCES
We considered extension of PCA ideas to correlated random
signals by using multirate filter banks instead of matrices.
We, in particular, considered the class of orthogonal treestructured filter banks which are known to generate a multiresolution decomposition. We then addressed the problem
of designing an optimal multiresolution transform (i.e. a
PCMT) in detail. We showed that, like ordinary PCA, there
is an information theoretic interpretation for this multiresolution transform too. Finally, we proposed a simple, yet
very efficient, algorithm to solve the very difficult optimization that arises in the process of PCMT design.
We summarize our proposed procedure for designing an
FIR PCMT in the algorithm that follows.
Algorithm 2: To design a PCMT, choose the number of
resolutions P to which the signal is to be decomposed. The
tree-structured filter bank implementing this transform would
then have P 1 two-channel paraunitary lattices in its P 1
nodes. Also, choose the number of stages N to be used
in each paraunitary lattice. Do the following steps starting
from the two-channel lattice in the root of the tree:
[1] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiely, 1991.
1. From the autocorrelation (or equivalently power spectral density) function of the input to the lattice, construct XX .
C
2. Use Algorithm 1 to calculate the first N
angles of the lattice.
1 rotation
C
3. Calculate UN UN and choose the last rotation angle
to diagonalize it.
4. Through a decimator, connect the lattice output with
higher energy to the next node in the tree. Calculate
the autocorrelation (power spectral density) function
of the signal entering this node.
Repeat items 1 to 4 for subsequent two-channel lattices in
the tree until the rotation angles for all P
1 lattices are
found.
PCMT has a very high potential as a tool for applications involving classification, compression, denoising and
modulation of time signals. In particular, it reveals a great
deal of information about the structure of a signal by showing how the signal’s energy is distributed among its low
resolution components. Note that PCMT is a signal dependent transform and the PCMT coefficients derived for
a specific signal contain valuable information about the statistical structure of that signal. These coefficients, therefore,
could be used as a classification means themselves. Investigating the statistical properties of PCMT coefficients and
its applications remains an open research topic.
As a final comment, we mention that MATLABTM codes
for implementing the algorithms mentioned in this paper as
well as additional related material are available on-line [3].
[2] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Pschology, 24:417–441, 1933.
[3] O. S. Jahromi.
http://www.control.toronto.edu/˜
omidj/publications/publications.html. World Wide
Web site.
[4] R. A. Johnson and D. W. Wichern. Applied Multivariate Statistical Analysis. Prentice-Hall, 4th edition,
1998.
[5] H. P. Kramer and M. V. Mathews. A linear coding for
transmitting a set of correlated signals. IRE Transactions on Information Theory, IT-2:41–46, 1956.
[6] R. Linsker. Self-organization in a perceptual network.
Computer, 21(3):105–117, March 1988.
[7] S. Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans.
Pattern Anal. Machine Intell., 11:674–693, July 1989.
[8] A. Papoulis. Probability, Random Variables and
Stochastic Processes. McGraw-Hill, 3rd edition,
1991.
[9] M. K. Tsatsanis and G. B. Giannakis. Principal component filter banks for optimal multiresolution analysis. IEEE Trans. Signal Proces., 43(8):1766–1777,
August 1995.
[10] P. P. Vaidyanathan. Multirate Sytems and Filter Banks.
Prentice-Hall, 1993.
[11] P. P. Vaidyanathan. Review of recent results on optimal orthogonormal subband coders. In Proceedings of
SPIE, San Diego, July 1997.
[12] P. P. Vaidyanathan. Theory of optimal orthonormal
filter banks. IEEE Trans. Signal Proces., pages 1528–
1543, June 1998.
Download