IJCNN’99, Washington, DC, July 10-16, 1999 1 An Information Theoretic Method for Designing Multiresolution Principal Component Transforms Omid S. Jahromi and Bruce A. Francis Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 fomidj, francisg@control.toronto.edu ABSTRACT Principal Component Analysis (PCA) is concerned basically with finding an optimal way to represent a random vector through a linear combination of a few uncorrelated random variables. In signal processing, multiresolution transforms are used to decompose a time signal into components of different resolutions. In this paper, we consider designing optimal multiresolution transforms such that components in each resolution provide the best approximation to the original signal in that resolution. We call a transformation that admits this optimality property a Principal Component Multiresolution Transform (PCMT). We show that PCMTs can be designed by minimizing the information transfer through their basic building blocks. We then propose a method to do the minimization in a stageby-stage manner. This latter method has a great appeal in terms of its computational simplicity as well as theoretical interpretations. In particular, it agrees with Linsker’s principle of self organization. Finally, we provide analytic arguments and computer simulations to demonstrate the efficiency of our method. 1. INTRODUCTION Principal component analysis (PCA) is a widely used statistical technique in the theory of neural networks as well as many other fields of natural sciences. In particular, it can be used to design optimum source compression methods in a class called transform coding. Conventional transform coding techniques suffer from many artifacts including complexity of estimating the optimal transform, high number of parameters to be estimated, and also the so called blocking effect which is caused by processing the signal to be compressed in disjoint blocks independently. Multirate filter banks (equivalently lapped transforms) have been introduced to overcome some of these problems by providing a great deal of flexibility in the way that the signal is (hypothetically) segmented before being transformed. Apart from some (impractical) special cases that are based on filters with infinite support [9], [12], little progress has been made in the development of methods for designing optimal and yet practical filter banks. Our goal in this paper is to develop such a method, from an information theory point of view, for a class of multirate filter banks described below. A very important class of filter banks are those that implement multiresoluton transforms [7]. Such filter banks admit a very appealing structure consisting of two-channel orthogonal filter banks connected as a binary tree. We show that information minimization can be used for designing the optimal multiresolution transform which we call a Principal Component Multiresolution Transform (PCMT). A PCMT has the property that it decomposes the input signal into multiresolution components in a way that each resolution is the best approximation possible in that resolution to the original signal. The paper is organized as follows. Section 2 introduces PCA and the concept of optimum transforms for random variables. An Info-Max interpretation of PCA is given in section 3. Section (4) introduces principal component filter banks - a concept that extends PCA to the processing of correlated random signals. We introduce PCMT, its implementation, and design issues in section 5. In section 6 we consider PCMT from an information theory point of view and introduce a method that greatly reduces the complexity of nonlinear optimization associated with its design. Finally, we make some concluding remarks in section 7. Notation: Vectors are denoted by capital letters. The error vector, however, is denoted by . Boldface capital letters are used for matrices. The ij element of a matrix is denoted by ( )ij . A A 2. PRINCIPAL COMPONENT ANALYSIS The idea of using linear subspaces for statistical data analysis goes back to the 1930s when Hotelling [2] introduced Principal Component Analysis. The importance of Hotelling’s approach in data compression was apparently first realized by Kramer and Mathews [5] in the 1950s. Since then, PCA has been widely used for data compression, pattern classification and noise reduction applications. Principal component analysis is concerned with explaining the variance-covariance structure of a set of N variables through a few, K < N , linear combinations of these variables. Although N components are required to reproduce the total system variability, often much of this variability can be accounted for by a small number K of the principal components. If so, there is (almost) as much information in the K components as there is in the original N variables. The K principal components can then replace the original N variables, and the original data set, consisting of P measurements on N variables, is compressed to a data set consisting of P measurements on K principal components. Let’s consider an N -dimensional stochastic process X (n) whose samples are independent and identically distributed. The autocorrelation matrix of X is defined as 1 CXX E fX X T g (1) where E is the mathematical expectation operator. It can be shown (e.g., [8]) that every autocorrelation matrix XX can be expressed in terms of its orthonormal eigenvectors C U i 2 RN CXX = N X i=1 i i iT UU where k C K X (UiT X )Ui (3) i=1 =X N K X X (UiT X )Ui = (UiT X )Ui i=1 i=K +1 T The coefficients Ui X are called the principal components of X , and by using an increasing number of the largest terms in (3), X will be approximated by an increasing accuracy. If K = N , will become zero. The approximation formula above can be written in a more compact form by packing the principal components of ~ . Doing so, interest (i.e. the K largest ones) in a vector X we can rewrite (3) as X where PT X~ (4) 0 T 1 0 T1 U1 X U1 B B U2T X C U2T C C ~ B B X = PX; and P = : : : C A ::: A UkT X 1 To simplify the notation, we drop the time index pendency is not concerned explicitly. UkT P P 3. AN INFORMATION-THEORETIC INTERPRETATION In the previous section we derived the KLT solely based on its optimal mean-square approximation properties. It has a very nice interpretation based on information-theoretic concepts too! Here we briefly explain this latter interpretation as it will play a central rule in our development of design methods for multiresolution PCA in section (6). ~ and X can, in genThe mutual information between X eral, be expressed as ( ~ ; ^ ) = h(X~ ) I X X ( ~j ) h X X ~ ) denotes the (differential) entropy and h(X~ jX ) where h(X ~ given X [1]. Here, denotes the conditional entropy of X ~ jX ) is zero because there is no the conditional entropy h(X ~ when X is known. Hence uncertainty in evaluating X ( ~ ; ) = h(X~ ) I X X n. The approximation error is then given by P (2) where the i are the eigenvalues of XX . The Ui come in handy for statistical approximation of X as an arbitrary vector in RN : X Assuming that X is sampled from the random process X (n), the approximation given by (4) is optimal in the sense that it results in a residual with minimum expected norm. This is to say, as given in (5) minimizes E fT g for any fixed value of K [8]. In this sense, the matrix is the optimum linear transformation that reduces the N -dimensional vec~. tor of variables X to a K -dimensional vector X The matrix is commonly referred to as Karhunen Loève transform (KLT) in signal processing literature 2 . As an orthogonal matrix, KLT is very reach in structure and properties [4]. (6) Now, assuming that the original random process X was ~) Gaussian with zero mean, the (differential) entropy h(X is given by [8] ( ~ ) = 21 (K + K log(2) + log jdet(CX~ X~ )j) h X (7) Substituting (7) in (6) the problem of maximizing the mu~ ; X ) reduces to maximizing the detertual information I (X minant of X~ X~ . It is straightforward to verify that the matrix as given in (5) is in fact the matrix that maximizes T = det( X ~ X~ ) subject to the orthogonality condition . This proves that KLT is also optimal in the sense of k transferring maximum information about the random vector ~. X to X I P C C PP 4. PRINCIPAL COMPONENT FILTER BANKS (5) n wherever time de- In this section we extend the idea of optimum rate reduction (compression) to the case of correlated stationary random 2 In statistics, it is also known as Hotelling transform or simply principal component transform (PCT). signals. To do this, we have to go beyond simple orthogonal transforms and use a special class of multi-input multioutput dynamical systems known as filter banks. x(n) x H (z) 0 0 u (n) x 1 1 v (n) M-1 (z) (n) 1 u (n) x M-1 1 v (n) M-1 (n) M M F (z) M F (z) M F (z) 1 (n) M-1 Expanders Analysis bank 0 (n) M u H 0 M 0 H (z) v (n) y(n) M-1 Synthesis bank Decimators Figure 1: An M -channel multirate analysis/synthesis filter bank An M -channel analysis/synthesis filter bank is shown in Fig. 1. The filters H0 to HM 1 constitute the analysis filter bank. These filters, together with the decimators following 1 of the them, generate M subband signals whose rate is M input signal. The idea is to design the analysis and synthesis filters such that: 1. When all the subband channels are transmitted to the synthesis bank, the original signal is perfectly reconstructed by the synthesis bank (within perhaps a delay). 2. When only K out of M subband channels are used for the synthesis, the mean-square-error between the input x(n) and the synthesized signal y (n) is minimum. 3. The analysis and synthesis banks are lossless 3 . This is to say, the sum of subband signals’ energies is equal to theP number of channels times the input signal enM 1 ergy: i=0 E fvi2 (n)g = M E fx(n)2 g A filter bank that satisfies the above properties for all values of 0 < k < N is called a principal component filter bank (PCFB) [9]. Methods for designing PCFB’s when the filters are allowed to have ideal (brick-wall) frequency responses have been developed recently [9], [12]. These methods, however, are only of theoretical interest since the resulting ideal filters are not realizable. In this paper, we do not consider uniform filter banks 4 like the one shown in Fig. 1. Instead, we consider a class of filter banks called tree-structured filter banks. These filter banks are constructed by cascading two-channel filter banks in a dyadic tree (Fig. 2). 3 This requirement is not actually necessary but induces very convenient properties in both design and implementation phases. 4 Uniform filter banks have some drawbacks. In particular, their lattice implementation (for 2) fails to satisfy condition (1) when finite precision arithmetic is involved [10]. M> 5. PRINCIPAL COMPONENT MULTIRESOLUTION TRANSFORMS Tree-structured filter banks have very appealing properties. In particular, they perform a multiresolution decomposition on the input signal space and, hence, are very closely related to wavelet transforms [7]. In fact, the output of the analysis filters can be thought of as approximations of the input signal in different resolutions or rates. Now, the question that naturally arises is how to chose the best filter bank (equivalently, the best wavelet) so that these lower resolution (lower rate) approximations represent the original signal as closely as possible? In this context, using arguments similar to those in section (4), one can define the optimal filter bank as the one that satisfies the three conditions mentioned there. We call this optimal filter bank a principal component multiresolution transform (PCMT). H 00(z) 2 H 01(z) 2 z -1 z H 10(z) 2 H 11(z) 2 -1 Figure 2: A multiresolution transform (tree-structured analysis bank) It is straightforward to verify that to have a PCMT, it is sufficient that each two-channel section satisfies the 3 optimality conditions. This is to say, a PCMT can be obtained simply by cascading two-channel PCFBs in a tree structure [11]. In addition, it is known that a FIR two channel filter bank that satisfies conditions (1) and (3) can be implemented using a so-called paraunitary lattice [10]. This lattice for the analysis filters is shown in Fig. 3 5 . In this figure, the blocks named 1 , 2 . . . N are 2 2 rotation matrices. That is T T Ti = T ( i) ( i) os sin ( i) ( i) sin os (8) where i are rotation angles that determine the free parameters of the lattice. For a two-channel paraunitaruy lattice to be a PCFB, therefore, one has only to choose the N rotation angles 1 to N such that condition (2) is satisfied. Condition (2) requests minimum reconstruction error using only one subband. Suppose we want to reconstruct x(n) using v0 (n) only (Fig. 3); knowing that the lossless condition (3) is automatically satisfied by the paraunitary lattice, we can easily verify that E f(x(n) y (n))2 g = E fv12 (n)g. This means the optimal analysis bank should be designed 5 The synthesis filters are then implemented using the dual of this lattice. x(n) v0 (n) 2 H 0(z) F0 (z) 2 F1 (z) v1 (n) 2 H 1(z) 2 y(n) The above theorem basically means that passing through the rack of rotations from stage i to stage i + 1, the signal energy is reduced by a constant factor 7 and is independent of the choice of rotation angels. (a) x(n) 2 T1 z -1 z -2 T2 ooo TN z -2 2 v0 (n) v1 (n) x(n) Z T -1 1 ooo Z T -1 N-2 ooo (b) Z Figure 3: A two-channel paraunitary filter bank (a) and its lattice representation (b) T -1 N-1 T 1 2 Z Z T -1 T N-2 N 2 0 v (n) 1 T T1 -1 v (n) N-1 ooo such that the subband to be dropped has minimum energy 6 . Obviously, this energy depends on the input signal statistics and the choice of rotation angles. Hence, turning a paraunitary lattice into a two-channel PCFB is equivalent to the following optimization problem: : Given input signal statistics and the lattice in Fig. 3, find T1 ; : : : ; TN such that E fv12 (n)g is minimized. is a nonlinear optimization problem and no analytic solution has been reported for it yet. A main contribution of this paper is to reveal the structure and level of difficulty of this optimization. We achieve this in the next section by showing that it is in fact a multi-stage information minimization problem. As a by-product, we also show that we have often to deal with ill-conditioned covariance matrices which, in turn, make achieving accurate results computationally demanding. 6. INFORMATION TRANSFER IN PARAUNITARY LATTICES To start with, we manipulate the paraunitary lattice in Fig. 3(b) and draw it in a form that contains a delay chain followed by a dynamic-free structure consisting of 2 2 rotation blocks (Fig. 4). That this structure is in fact equivalent to the lattice in Fig. 3(b) can be verified using signal flow graph methods. Also introduced in Fig. 4 are intermediate vector variables denoted by U1 ; : : : ; UN . Assuming that x(n) is wide-sense stationary and Gaussian, these vectors will be Gaussian random variables whose covariance matrices we denote by Ui Ui . The i blocks are orthogonal rotations; the signal energy is thus conserved from their input to output irrespective the angle of rotation i . Using this fact, it is straightforward to show that C Theorem 1: 6 Equivalently, T Tr fCUi Ui g = 2(N i Z Z TN-2 -1 ooo -1 T1 U N-1 UN W U 1=X Figure 4: Modified representation of a two-channel paraunitary lattice The next theorem establishes the link between PCFBs and information theory by showing that is an Information Minimization problem. Theorem 2: For a stationary Gaussian input, an N -stage paraunitary lattice is a PCFB iff I (X ; W ), i.e. the mutual information between the input vector X and the output vector W , is minimum and WW is diagonal. C Proof: Recalling , we would like to find the rotation matrices i such that E fv12 (n)g is minimum. Using the notation introduced in Fig. 4, E fv12 (n)g is equal to the lower diagonal element of WW , i.e. ( WW )22 . Let’s call the eigenvalues of this covariance matrix 1 and 2 where the indices are chosen such that 1 2 . For a covariance matrix, the eigenvalues majorize the diagonal elements; that is to say ( WW )22 2 with equality if and only if WW is a diagonal matrix [4]. To minimize ( WW )22 , therefore, the last rotation matrix N , should diagonalize WW . Choosing N to do so, we get T C C C T T C C C f 2 ( )g = 2 E v1 n T T T (9) C Since N is orthogonal, 1 and 2 are eigenvalues of UN UN too. Hence, problem reduces to finding rotation blocks 1 to N 1 such that the smallest eigenvalue of UN UN is minimized. Now, noticing the fact that + 1)E fx2 (n)g the retained subband should have maximum possible energy (because the total subband energy is constant). U N-2 C 1 + 2 = T rfCUN UN g = onst 7 This factor is simply the ratio of the number of retained branches to the number of input branches in each stage. Also note that for the last stage f WW g = f UN UN g = 2 f 2 ( )g. Tr C Tr C Ex n minimizing the smallest eigenvalue 2 becomes the same as minimizing the product 1 2 which, in turn, is equal to det( UN UN ). Up to now, we have shown that is equivalent to C 0 : T T CUN UN ) is CUN UN . Fined 1 ; : : : ; N 1 such that det( minimized. Then, find N to diagonalize T There exists an information-theoretic interpretation for 0 . To see this, we first note that if a vector Ui is given, all subsequent vectors Uj ; j > i; are uniquely known. This, means h(Uj jUi ) = 0 for j > i. The mutual information between these vectors, therefore, is given by the differential entropy of the one with smaller dimension: ( i ; j ) = h(Uj ); j > i I U U (10) Since x(n) is assumed to be Gaussian, the vectors Ui have Gaussian distribution too. Therefore, minimizing det( UN UN ) (as was needed by 0 ) is equal to minimizing the differential entropy h(UN ). Using (10) with j = N and i = 1, this latter minimization can be further interpreted as minimizing I (U1 ; UN ). Finally, we notice that there is no loss of information in going from UN to W . In other words, there is no distinction between minimizing I (U1 ; UN ) or I (U1 ; W ). This completes the proof since U1 = X . As mentioned in the proof above, I (X ; W ) is independent of the choice of N . This leads to the following procedure for designing a two-channel PCFB: C T 00 : For the lattice in Fig. 3, find T1 ; : : : ; TN 1 such that I (X ; UN ) = h(UN ) is minimized. Then choose TN to diagonalize CUN UN . 00 is still a multivariable nonlinear optimization prob- equivalent to the joint multivariable optimization defined by 00 . Various simulations show that this algorithm and 00 give practically the same results but, nevertheless, we do not consider such experiments conclusive. For one reason, we can not be sure we’ve obtained the global minimum in 00 since the problem is non-convex and frequently illconditioned. In the following, we provide numerical results for a typical case. Simulation example 9 : In this example we consider designing a two-channel PCFB using a 4-stage paraunitary lattice. 14 12 10 8 6 4 2 0 −2 0 0.5 1 1.5 2 2.5 Figure 5: The (optimized) position of eigenvalues of the covariance matrices of intermediate variables. (The vertical scale doesn’t represent any quantity). C The above method basically says to minimize the information available at the last stage, one can minimize the amount of information transfered from each intermediate stage to the next 8 . Algorithm 1 is quite appealing intuitively. However, at this time, it is not known to the authors whether it is in fact The 8 eigenvalues of XX used for this example are shown at the bottom row of Fig. 5 using ’*’ signs. The three other rows, from bottom to top, show the eigenvalues of the intermediate covariances Ui Ui for i = 2; 3 and 4 respectively. Those depicted by ’*’ are obtained using stage-by-stage minimization of h(Ui ), while those represented by ’+’ are obtained by solving multivariable optimization 00 . As can be seen, the eigenvalue sets are practically the same. This means both methods give similar values for det(U4 ) = I (X ; W ) at the end. It should be emphasized that by trying to minimize the determinant of intermediate covariance matrices, we are actually making them ill-conditioned! This is a very important point as it predicts that numerical instability becomes a serious concern during the course of multivariable optimization. This artifact is avoided to a great extent when we use Algorithm 1. 8 Actually, Linsker’s principal for optimization of neural networks deals with information maximization. Here we are dealing with information minimization so we can call ours an anti-Linsker Method! 9 MATLABTM code for the simulation example provided here is available [3]. lem in rotation angles 1 ; 2 ; : : : ; N 1 . We, however, believe that it can be further re-cast as a sequence of singlevariable optimizations. The idea here has a very close relation with the Linsker’s principle of self-organization [6]. We state our method as Algorithm 1: To minimize I (X ; UN ) for the structure shown in Fig. 4, minimize I (Ui ; Ui+1 ) for i = 1; : : : ; N 1 in a successive manner. C 7. SUMMARY AND CONCLUSION 8. REFERENCES We considered extension of PCA ideas to correlated random signals by using multirate filter banks instead of matrices. We, in particular, considered the class of orthogonal treestructured filter banks which are known to generate a multiresolution decomposition. We then addressed the problem of designing an optimal multiresolution transform (i.e. a PCMT) in detail. We showed that, like ordinary PCA, there is an information theoretic interpretation for this multiresolution transform too. Finally, we proposed a simple, yet very efficient, algorithm to solve the very difficult optimization that arises in the process of PCMT design. We summarize our proposed procedure for designing an FIR PCMT in the algorithm that follows. Algorithm 2: To design a PCMT, choose the number of resolutions P to which the signal is to be decomposed. The tree-structured filter bank implementing this transform would then have P 1 two-channel paraunitary lattices in its P 1 nodes. Also, choose the number of stages N to be used in each paraunitary lattice. Do the following steps starting from the two-channel lattice in the root of the tree: [1] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiely, 1991. 1. From the autocorrelation (or equivalently power spectral density) function of the input to the lattice, construct XX . C 2. Use Algorithm 1 to calculate the first N angles of the lattice. 1 rotation C 3. Calculate UN UN and choose the last rotation angle to diagonalize it. 4. Through a decimator, connect the lattice output with higher energy to the next node in the tree. Calculate the autocorrelation (power spectral density) function of the signal entering this node. Repeat items 1 to 4 for subsequent two-channel lattices in the tree until the rotation angles for all P 1 lattices are found. PCMT has a very high potential as a tool for applications involving classification, compression, denoising and modulation of time signals. In particular, it reveals a great deal of information about the structure of a signal by showing how the signal’s energy is distributed among its low resolution components. Note that PCMT is a signal dependent transform and the PCMT coefficients derived for a specific signal contain valuable information about the statistical structure of that signal. These coefficients, therefore, could be used as a classification means themselves. Investigating the statistical properties of PCMT coefficients and its applications remains an open research topic. As a final comment, we mention that MATLABTM codes for implementing the algorithms mentioned in this paper as well as additional related material are available on-line [3]. [2] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Pschology, 24:417–441, 1933. [3] O. S. Jahromi. http://www.control.toronto.edu/˜ omidj/publications/publications.html. World Wide Web site. [4] R. A. Johnson and D. W. Wichern. Applied Multivariate Statistical Analysis. Prentice-Hall, 4th edition, 1998. [5] H. P. Kramer and M. V. Mathews. A linear coding for transmitting a set of correlated signals. IRE Transactions on Information Theory, IT-2:41–46, 1956. [6] R. Linsker. Self-organization in a perceptual network. Computer, 21(3):105–117, March 1988. [7] S. Mallat. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Machine Intell., 11:674–693, July 1989. [8] A. Papoulis. Probability, Random Variables and Stochastic Processes. McGraw-Hill, 3rd edition, 1991. [9] M. K. Tsatsanis and G. B. Giannakis. Principal component filter banks for optimal multiresolution analysis. IEEE Trans. Signal Proces., 43(8):1766–1777, August 1995. [10] P. P. Vaidyanathan. Multirate Sytems and Filter Banks. Prentice-Hall, 1993. [11] P. P. Vaidyanathan. Review of recent results on optimal orthogonormal subband coders. In Proceedings of SPIE, San Diego, July 1997. [12] P. P. Vaidyanathan. Theory of optimal orthonormal filter banks. IEEE Trans. Signal Proces., pages 1528– 1543, June 1998.