-A Blind Noise Estimation and Compensation for Improved Characterization of Multivariate Processes by Junehee Lee B.S., Seoul National University (1993) S.M., Massachusetts Institute of Technology (1995) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2000 © Massachusetts Institute of Technology 2000. All rights reserved. Author.............. . .. .. .. .. .. .. ..'Z./ .. ... . . .. .. .. ......... .... . .. ... . epartment of Electrical Engineering and Computer Science March 7, 2000 Certified by .................. ... DaV . Staelin Professor of Electrical Engineering Thesis Supervisor A ccepted by ................. ..................Arthur C. Smith Chairman, Departmental Committee on Graduate Students MASSACHUSETTS INST ITUTE OF TECHNOLOGY JUN 2 2 20 LIBRARIES Blind Noise Estimation and Compensation for Improved Characterization of Multivariate Processes by Junehee Lee Submitted to the Department of Electrical Engineering and Computer Science on March 7, 2000, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract This thesis develops iterative order and noise estimation algorithms for noisy multivariate data encountered in a wide range of applications. Historically, many algorithms have been proposed for estimation of signal order for uniform noise variances, and some studies have recently been published on noise estimation for known signal order and limited data size. However, those algorithms are not applicable when both the signal order and the noise variances are unknown, which is often the case for practical multivariate datasets. The algorithm developed in this thesis generates robust estimates of signal order in the face of unknown non-uniform noise variances, and consequently produces reliable estimates of the noise variances, typically in fewer than ten iterations. The signal order is estimated, for example, by searching for a significant deviation from the noise bed in an eigenvalue screeplot. This retrieved signal order is then utilized in determining noise variances, for example, through the Expectation-Maximization (EM) algorithm. The EM algorithm is developed for jointly-Gaussian signals and noise, but the algorithm is tested on both Gaussian and non-Gaussian signals. Although it is not optimal for non-Gaussian signals, the developed EM algorithm is sufficiently robust to be applicable. This algorithm is referred to as the ION algorithm, which stands for Iterative Order and Noise estimation. The ION algorithm also returns estimates of noise sequences. The ION algorithm is utilized to improve three important applications in multivariate data analysis: 1) distortion experienced by least squares linear regression due to noisy predictor variables can be reduced by as much as five times by the ION algorithm, 2) the ION filter which subtracts the noise sequences retrieved by the ION algorithm from noisy variables increases SNR almost as much as the Wiener filter, the optimal linear filter, which requires noise variances a priori,3) the principal component (PC) transform preceded by the ION algorithm, designated as the Blind-Adjusted Principal Component (BAPC) transform, shows significant improvement over simple PC transforms in identifying similarly varying subgroups of variables. Thesis Supervisor: David H. Staelin Title: Professor of Electrical Engineering Acknowledgement I have been lucky to have Prof. Staelin as my thesis supervisor. When I started the long journey to Ph.D., I had no idea about what I was going to do nor how was I going to do it. Without his academic insights and ever-encouraging comments, I am certain that I would be still far away from the finish line. Thanks to Prof. Welsch and Prof. Boning, the quality of this thesis is enhanced in no small amount. They patiently went through the thesis and pointed out errors and unclear explanations. I would like to thank those at Remote Sensing Group, Phil Rosenkranz, Bill Blackwell, Felicia Brady, Scott Bresseler, Carlos Cabrera, Fred Chen, Jay Hancock, Vince Leslie, Michael Schwartz and Herbert Viggh for such enjoyable working environments. I wish to say thanks to my parents and family for their unconditional love, support and encouragement. I dedicate this thesis to Rose, who understands and supports me during all this time, and to our baby who already brought lots of happiness to our lives. Leaders for Manufacturing program at MIT, POSCO scholarship society and Korea Foundation for Advanced Studies supported this research. Thank you. 3 Contents 1 2 12 Introduction 1.1 Motivation and General Approach of the Thesis . . . . . . . . . . . . . . . . 12 1.2 T hesis O utline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Background Study on Multivariate Data Analysis 17 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.2 D efinitions . . . . . . . . . . . . . . . . . . . . . . . . . . . .......... 17 2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Multivariate Data 2.2.2 Multivariate Data Analysis . . . . . . . . . . . . . .......... 18 2.2.3 Time-Structure of a Dataset . . . . . . . . . . . . . . . . . . . . . . . 21 2.2.4 Data Standardization . . . . . . . . . . . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . . . . . . . . . . 23 Data Characterization Tools . . . . . . . . . . . . . . . . . . . . . . . 24 . . . . . . . . . . . . . . . . . . . . 24 Data Prediction Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 26 . . . . . . . . . . . . . . . . . . . . . . . . 26 Total Least Squares Regression . . . . . . . . . . . . . . . . . . . . . 27 Noise-Compensating Linear Regression . . . . . . . . . . . . . . . . . 28 Principal Component Regression . . . . . . . . . . . . . . . . . . . . 32 Partial Least Squares Regression . . . . . . . . . . . . . . . . . . . . 33 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Rank-Reduced Least-Squares Linear Regression . . . . . . . . . . . . 34 . . . . . . . . . . . . . . . . . . . . . . . . . 38 Noise Estimation through Spectral Analysis . . . . . . . . . . . . . . 39 Traditional Multivariate Analysis Tools 2.3.1 Principal Component Transform 2.3.2 Least Squares Regression 2.3.3 Noise Estimation Tools 4 Noise-Compensating Linear Regression with estimated Ce 3 . 4 41 47 Blind Noise Estimation 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 D ata M odel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.1 Signal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.2.2 Noise M odel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.3 Motivation for Blind Noise Estimation . . . . . . . . . . . . . . . . . . . . . 51 3.4 Signal Order Estimation through Screeplot . . . . . . . . . . . . . . . . . . 52 . . . . . . . . . . . . . . . . . 54 3.4.1 Description of Two Example Datasets 3.4.2 Qualitative Description of the Proposed Method for Estimating the Number of Source Signals . . . . . . . . . . . . . . . . . . . . . . . . 56 . . . . . . . . . . . 58 3.4.3 Upper and Lower Bounds of Eigenvalues of Cxx 3.4.4 Quantitative Decision Rule for Estimating the Number of Source Signals 62 . . . . . . . . . . . . . . . . . . . . 62 Determination of the Transition Point . . . . . . . . . . . . . . . . . 64 . . . . . . . . . . . . . . . . . . . . . . . . . . 64 . . . . . . 65 3.5.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.5.2 Expectation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.5.3 Maximization Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.5.4 Interpretation and Test of the Algorithm . . . . . . . . . . . . . . . 73 An Iterative Algorithm for Blind Estimation of p and G . . . . . . . . . . . 75 Evaluation of the Noise Baseline Test of the Algorithm 3.5 3.6 Noise Estimation by Expectation-Maximization (EM) Algorithm 3.6.1 Description of the Iterative Algorithm of Sequential Estimation of . . . . . . . . . . . . . . 76 Test of the ION Algorithm with Simulated Data . . . . . . . . . . . 78 Source Signal Number and Noise Variances 3.6.2 4 . . . 83 Applications of Blind Noise Estimation 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.2 Blind Adjusted Principal Component Transform and Regression . . . . . . 84 4.2.1 4.2.2 Analysis of Mean Square Error of Linear Predictor Based on Noisy Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Increase in MSE due to Finite Training Dataset . . . . . . . . . . . . 87 5 4.2.3 Description of Blind-Adjusted Principal Component Regression . 89 4.2.4 Evaluation of Performance of BAPCR . . . . . . . . . . . . . . . 91 y as a Function of Sample Size of Training Set . . . . . . . . . . 92 . . . . . . . . . . . 95 -y as a Function of Noise Distribution . . . . . . . . . . . . . . . . 97 ION Noise Filter: Signal Restoration by the ION algorithm . . . . . . . 98 4.3.1 Evaluation of the ION Filter . . . . . . . . . . . . . . . . . . . . 99 4.3.2 ION Filter vs. Wiener filter . . . . . . . . . . . . . . . . . . . . . 101 Blind Principal Component Transform . . . . . . . . . . . . . . . . . . . 102 -y as a Function of Number of Source Signals 4.3 4.4 5 Evaluations of Blind Noise Estimation and its Applications on Real Datasets108 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 5.2 Remote Sensing Data 109 5.3 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Background of the AIRS Data . . . . . . . . . . . . . . . . . . . . . 110 5.2.2 Details of the Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.2.3 Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Packaging Paper Manufacturing Data . . . . . . . . . . . . . . . . . . . . . 115 5.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.3.2 Separation of Variables into Subgroups using PC Transform . . . . . 119 5.3.3 Quantitative Metric for Subgroup Separation . . . . . . . . . . . . . 119 5.3.4 Separation of Variables into Subgroups using BAPC transform . . . 121 5.3.5 Verification of Variable Subgroups by Physical Interpretation . . . . 123 5.3.6 Quality prediction 130 . . . . . . . . . . . . . . . . . . . . . . . . . . . . Conclusion 133 6.1 Summary of the Thesis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 6.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 6.3 Suggestions for Further Research . . . . . . . . . . . . . . . . . . . . . . . . 135 6 List of Figures 2-1 Graph of the climatic data of Table 2.1 2-2 Scatter plots of variables chosen from Table 2.1. (a) average high temperature vs. . . . . . . . . . . . . . . . . . . . . average low temperature, (b) average high temperature vs. precipitation. average . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3 Graph of the climatic data of Table 2.2 2-4 Statistical errors minimized in, (a) least squares regression and (b) total least squares regression. 20 . . . . . . . . . . . . . . . . . . . . 21 22 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2-5 Noise compensating linear regression . . . . . . . . . . . . . . . . . . . . . . 30 2-6 A simple illustration of estimation of noise variance for a slowly changing variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2-7 A typical X (t) and its power spectrum density. . . . . . . . . . . . . . . . . 40 2-8 Typical time plots of X 1 , - --, X 5 and Y. . . . . . . . . . . . . . . . . . . . . 42 2-9 Zero-Forcing of eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3-1 Model of the Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3-2 Signal model as instantaneous mixture of p source variables . . . . . . . . . 50 3-3 Illustration of changes in noise eigenvalues for different sample sizes (Table 3.1). 57 3-4 Two screeplots of datasets specified by Table 3.1 3-5 A simple illustration of lower and upper bounds of eigenvalues of CXX. 3-6 Repetition of Figure 3-4 with the straight lines which fit best the noise dominated eigenvalues. 3-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 61 63 Histogram of estimated number of source signal by the proposed method. Datasets of Table 3.1 are used in simulation. 3-8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Flow chart of the EM algorithm for blind noise estimation. 7 . . . . . . . . . 65 74 3-9 Estimated noise variances for two simulated datasets in Table 3.1 using the EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3-10 A few estimated noise sequences for the first dataset in Table 3.1 using the EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3-11 A few estimated noise sequences for the second dataset in Table 3.1 using the EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3-12 Flow chart of the iterative sequential estimation of p and G . . . . . . . . . 79 3-13 The result of the first three iterations of the ION algorithm applied to the second dataset in Table 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4-1 Model for the predictors X 1 , - - -, X, and the response variable Y. . . . . . . 86 4-2 The effect of the size of the training set in linear prediction 88 4-3 Schematic diagram of blind-adjusted principal component regression. ..... 4-4 Simulation results for examples of Table 4.1 using linear regression, PCR, BAPCR, and NAPCR as a function of m. p = 15. 4-5 90 . . . . . . . . . . . . . . 94 Simulation results for examples of Table 4.2 using linear regression, PCR, BAPCR, and NAPCR as a function of p. 4-6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 Simulation results for examples of Table 4.3 using linear regression, PCR, BAPCR, and NAPCR as a function of noise distribution. The horizontal axis of each graph represents the exponent of the diagonal elements of G. 4-7 . Increases in SNR achieved by the ION filtering for examples of Table 4.1 as a function of m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8 100 Increases in SNR achieved by the ION filtering for examples of Table 4.2 as a function of p. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9 98 101 Performance comparison of the Wiener filter, the ION filter, and the PC filter using examples of Table 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-10 Schematic diagram of the three different principal component transforms. . 103 104 4-11 Percentage reduction of MSE achieved by the BPC transform over the noisy transform using Example (b) of Table 4.1. . . . . . . . . . . . . . . . . . . . 106 4-12 Percentage reduction of MSE achieved by BPC transform over the noisy 5-1 transform using Example (d) of Table 4.1. . . . . . . . . . . . . . . . . . . . 107 Format of an AIRS dataset 111 . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 5-2 Signal and noise variances of simulated AIRS dataset. . . . . . . . . . . . . 112 5-3 Three competing filters and notations. . . . . . . . . . . . . . . . . . . . . . 113 5-4 240 Variable AIRS dataset. (a) Signal and noise variances. (b) Eigenvalue screeplot. 5-5 . ........ ......... ....................... .... Plots of SNR of unfiltered, ION-filtered, PC-filtered, and Wiener-filtered datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-6 114 An example eigenvalue screeplot where signal variances are pushed to a higher principal component due to larger noise variances of other variables. 5-7 114 . . . . 115 Differences in signal-to-noise ratio between pairs of ZWIENER and ZION and of ZION and ZPc. (a) Plot of SNR2WIENER SNR 2 pc . . . . . . . . . . . . . . . - SNR 2 ION- - - ... (b) Plot of SNR2ION - . . . . . . . . . . . . 116 5-8 Schematic diagram of paper production line of company B-COM . . . . . . 117 5-9 An eigenvalue screeplot of 577 variable paper machine C dataset of B-COM. 118 5-10 First eight eigenvectors of 577 variable paper machine C dataset of B-COM. 120 5-11 Retrieved Noise standard deviation of the 577-variable B-COM dataset. 122 . . .. 5-12 Eigenvalue screeplot of blind-adjusted 577-variable B-COM dataset. . .... 123 5-13 First eight eigenvectors of the ION - normalized 577 variable B-COM dataset 124 5-14 Time plots of the five variables classified as a subgroup by the first eigenvector. 126 5-15 Time plots of the four variables classified as a subgroup by the second eigenvector. ......... ....................................... 127 5-16 Every tenth eigenvectors of the ION-normalized 577-variable B-COM dataset. 129 5-17 Scatter plot of true and predicted CMT values. Linear least-squares regression is used to determine the prediction equation. . . . . . . . . . . . . . . . 131 5-18 Scatter plot of true and predicted CMT values. BAPCR is used to determine the prediction equation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 132 List of Tables 2.1 Climatic data of capital cities of many US states in January (obtained from Yahoo internet site) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Climatic data of Boston, MA (obtained from Yahoo internet site) . . . . . . 22 2.3 Typical parameters of interest in multivariate data analysis . . . . . . . . . 24 2.4 Results of noise estimation and compensation (NEC) linear regression, compared to the traditional linear regression and noise-compensating linear regression with known Ce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 2.5 Results of NEC linear regression when noise overcompensation occurs. . . . 44 2.6 Improvement in NEC linear regression due to zero-forcing of eigenvalues. . . 46 3.1 Summary of parameters for two example datasets to be used throughout this chapter. .......... ...................................... 56 3.2 Step-by-step description of the ION algorithm . . . . . . . . . . . . . . . . . 4.1 Important parameters of the simulations to evaluate BAPCR. Both A 1 and 80 A 2 are n x n diagonal matrices. There are n - p zero diagonal elements in each matrix. The p non-zero diagonal elements are i- 4 for A 1 and i- 3 for A 2, i = 1, *- - , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 93 Important parameters of the simulations to evaluate BAPCR as a function of p. The values of this table are identical to those of Table 4.1 except for p and m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Important parameters of the simulations to evaluate BAPCR as a function ofG.... . . . ... 4.4 95 ......................................... 97 Mean square errors of the first five principal components obtained through the BPC and the traditional PC transforms using example (b) of Table 4.1. 10 106 4.5 Mean square errors of the first five principal components obtained through BPC and the traditional PC transforms using example (d) of table 4.1. . . . 107 5.1 GSM values for the eigenvectors plotted in Figure 5-10. . . . . . . . . . . . 121 5.2 GSM values for the eigenvectors plotted in Figure 5-13. . . . . . . . . . . . 123 5.3 Variables classified as a subgroup by the first eigenvector in Figure 5-13. . 125 5.4 Variables classified as a subgroup by the second eigenvector in Figure 5-13. 128 5.5 GSM values for the eigenvectors in Figure 5-16. . . . . . . . . . . . . . . . . 128 11 Chapter 1 Introduction 1.1 Motivation and General Approach of the Thesis Multivariate data are collected and analyzed in a variety of fields such as macroeconomic data study, manufacturing quality monitoring, weather forecasting using multispectral remote sensing data, medical and biological studies, and so on. For example, manufacturers of commodities such as steel, paper, plastic film, and semiconductor wafers monitor and store process variables with the hope of having a better understanding of their manufacturing processes so that they can control the processes more efficiently. Such a dataset of multiple variables is called multivariate data, and any attempt to summarize, modify, group, transform the dataset constitutes multivariate data analysis. One growing tendency in multivariate data analysis is toward larger to-be-analyzed datasets. More specifically, the number of recorded variables seems to grow exponentially and the number of observations becomes large. In some cases, the increase in size is a direct consequence of advances in data analysis tools and computer technology, thus enabling analysis of larger datasets which would not have been possible otherwise. This is what we will refer to as capability driven increase. On the other hand, necessity driven increase in data size occurs when the underlying processes to be monitored grow more complicated and there are more variables that need to be monitored as a result. For example, when the scale and complexity of manufacturing processes become larger, the number of process variables to be monitored and controlled increases fast. In some cases, the number of variables increases so fast that the slow-paced increase in knowledge of the physics of manufacturing processes cannot provide enough a priori knowledge which traditional analysis methods 12 require. Among such frequently required a priori information are noise variances and signal order. Exact definitions of them are provided in Chapter 3. The obvious disadvantage of lack of a priori information is that traditional analysis tools requiring the absent information cannot be used, and should be replaced with a suboptimal tool which functions without the a priori information. One example is the pair of Noise-Adjusted Principal Component (NAPC) transform [1] and Principal Component (PC) transform [2]. As we explain in detail in subsequent chapters, the NAPC transform is in general superior to the PC transform in determining signal order and filtering noise out of noisy multivariate dataset. The NAPC transform, however, requires a priori knowledge of noise variances which in many practical large-scale multivariate datasets are not available. As a result, the PC transform, which does not require noise variances, replaces the NAPC transform in practice, producing less accurate analysis results. This thesis investigates possibilities of retrieving from sample datasets vital but absent information. Our primary interest lies in retrieving noise variances when the number of signals is unknown. We will formulate the problem of joint estimation of signal order and noise variances as two individual problems of parameter estimation. The two parameters, signal order and noise variances, are to be estimated alternatively and iteratively. In developing the algorithm for estimating noise variances, we will use an information theoretic approach by assigning probability density functions to signal and noise vectors. For the thesis, we will consider only Gaussian signal and noise vectors, although the resulting methods are tested using non-ideal real data. The thesis also seeks to investigate a few applications of to-be-developed algorithms. There may be many potentially important applications of the algorithm, but we will limit our investigation to linear regression, principal component transform, and noise filtering. The three applications in fact represent the traditional tools which either suffer vastly in performance or fail to function as intended. We compare performances of these applications without the unknown parameters and with retrieved parameters for many examples made of both simulated and real datasets. 13 1.2 Thesis Outline Chapter 2 reviews some basic concepts of multivariate data analysis. In Section 2.2 we define terms that will be used throughout the thesis. Section 2.3 introduces traditional multivariate analysis methods for data characterization, data prediction, and noise estimation. These tools appear repeatedly in subsequent chapters. We especially focus on introducing many traditional tools which try to incorporate the fact that noises in a dataset affect analysis results, and therefore, cannot be ignored. Chapter 3 considers the main problem of the thesis. The noisy data model, which is used throughout the thesis, is introduced in Section 3.2. A noisy variable is the sum of noise and a noiseless variable, which is a linear combination of an unknown number of independent signals. Throughout the thesis, signal and noise are assumed to be uncorrelated. Section 3.3 explains why noise variances are important to know in multivariate data analysis using the PC transform as an example. In Section 3.4, we suggest a method to estimate the number of signals in noisy multivariate data. The method is based on the observations that noise eigenvalues form a flat baseline in the eigenvalue screeplot [3] when noise variances are uniform over all variables. Therefore, a clear transition point which distinguishes signal from noise can be determined by examining the screeplot. We extend this observation into the cases where noises are not uniform over variables (see Figure 3-6). In related work, we derive upper and lower bounds for eigenvalues when noises are not uniform in Section 3.4.3. Theorem 3.1 states that the increases in eigenvalues caused by a diagonal noise matrix are lower-bounded by the smallest noise variance and upper-bounded by the largest noise variance. In Section 3.5, we derive the noise-estimation algorithm based on the EM algorithm [4]. The EM algorithm refers to a computational strategy to compute maximum likelihood estimates from incomplete data by iterating an expectation step and a maximization step alternatively. The actual computational algorithms are derived for Gaussian signal and noise vectors. The derivation is similar to the one in [5], but a virtually boundless number of variables and an assumed lack of time-structure in our datasets make our derivation different. It is important to understand that the EM algorithm, summarized in Figure 3-8, takes the number of signals as an input parameter. A brief example is provided to illustrate the effectiveness of the EM algorithm. 14 The EM algorithm alone cannot solve the problem that we are interested in. The number of signals, which is unknown for our problem, has to be supplied to the EM algorithm as an input. Instead of retrieving the number of signals and noise variances in one joint estimation, we take the approach of estimating the two unknowns separately. We first estimate the number of signals, and feed the estimated value to the EM algorithm. The outcome of the EM algorithm, which may be somewhat degraded because of the potentially incorrect estimation of the number of signals in the first step, is then used to normalize the data to level the noise variances across variables. The estimation of the number of signals is then repeated, but this time using the normalized data. Since the new - normalized - data should have noises whose variances do not vary across variables as much as the previous unnormalized data, the estimation of the signal order should be more accurate than the first estimation. This improved estimate of the signal order is then fed to the EM algorithm, which should produce better estimates of noise variances. The procedure repeats until the estimated parameters do not change significantly. A simulation result, provided in Figure 313, illustrates the improvement in estimates of two unknowns as iterations progress. We designate the algorithm as ION, which stands for Iterative Order and Noise estimation. Chapter 4 is dedicated to the applications of the ION algorithm. Three potentially important application fields are investigated in the chapter. In Section 4.2, the problem of improving linear regression is addressed for the case of noisy predictors, which does not agree with basic assumptions of traditional linear regression. Least-squares linear regression minimizes errors in the response variable, but it does not account for errors in predictors. One of the many possible modifications to the problem is to eliminate the error in the predictors, and principal component filtering has been used extensively for this purpose. One problem to this approach is that the principal component transform does not separate noise from signal when noise variances across variables are not uniform. We propose in the section that the noise estimation by the ION algorithm, when combined with the PC transform to simulate the NAPC transform, should excel in noise filtering, thus improving linear regression. Extensive simulation results are provided in the section to support the proposition. Section 4.3 addresses the problem of noise filtering. The ION algorithm brings at least three ways to carry out noise filtering. The first method is what we refer to as BlindAdjusted Principal Component (BAPC) filtering. This is similar to the NAPC filtering, 15 but the noise variances for normalization are unknown initially and retrieved by the ION algorithm. If the ION would generate perfect estimates of noise variances, the BAPC transform would produce the same result as the NAPC transform. The second method is to apply the ION algorithm to the Wiener filter [6]. The Wiener filter is the optimal linear filter in the least-squares sense. However, one must know the noise variances in order to build the Wiener filter. We propose that when the noise variances are unknown a priorithe ION estimated noise variances be sufficiently accurate to be used for building the Wiener filter. Finally, the ION algorithm itself can be used as a noise filter. It turns out that among many by-products of the ION algorithm are there estimates of noise sequences. If these estimates are good, we may subtract these estimated noise sequences from the noisy variables to retrieve noiseless variables. This is referred to as the ION filter. Since the BAPC transform is extensively investigated in relation to linear regression in Section 4.2, we focus on the Wiener filter and the ION filter in Section 4.3. Simulations show that both methods enhance SNR of noisy variables significantly more than other traditional filtering methods such as PC filtering. In regard to the noise sequence estimation of the ION algorithm, Section 4.4 is dedicated to the blind principal component transform, which looks for the principal component transform of noiseless variables. In Chapter 5, we take on two real datasets, namely a manufacturing dataset and a remote sensing dataset. For the remote sensing dataset, we compare performance of the ION filter with the PC filter and the Wiener filter. The performances of the filters are quantified by SNR of the post-filter datasets. Simulations indicate that the ION filter, which does not require a priori information of noise variances, outperforms the PC filter significantly and performs almost as well as the Wiener filter, which requires noise variances. The manufacturing dataset is used to examine structures of elements of eigenvectors of covariance matrices. Specifically, we are interested in identifying subgroups of variables which are closely related to each other. We introduce a numerical metric which quantifies how well an eigenvector reveals a subgroup of variables. Simulations show that the BAPC transform is clearly better in identifying subgroups of variables than the PC transform. Finally, Chapter 6 closes the thesis by first providing summary of the thesis in Section 6.1, and contributions of the thesis in Section 6.2. Section 6.3 is dedicated to further research topics that stem from this thesis. 16 Chapter 2 Background Study on Multivariate Data Analysis 2.1 Introduction Multivariate data analysis has a wide range of applications, from finance to education to engineering. Because of its numerous applications, multivariate data analysis has been studied by many mathematicians, scientists, and engineers with specific fields of applications in mind (see, for example, [7]), thus creating vastly different terminologies from one field to another. The main objective of the chapter is to provide unified terminologies and basic background knowledges about multivariate data analysis so that the thesis is beneficial to readers from many different fields. First, we define some of the important terms used in the thesis. A partial review of the traditional multivariate data analysis tools are presented in Section 2.3. Since it is virtually impossible to review all multivariate analysis tools in the thesis, only those which are of interest in the context of this thesis are discussed. 2.2 Definitions The goal of this section is to define terms which we will use repeatedly throughout the thesis. It is important for us to agree on exactly what we mean by these words because some of these terms are defined in multiple ways depending on their fields of applications. 17 2.2.1 Multivariate Data A multivariate dataset is defined as a set of observations of numerical values of multiple variables. A two-dimensional array is typically used to record a multivariate dataset. A single observation of all variables constitutes each row, and all observations of one single variable are assembled into one column. Close attention should be paid to the differences between a multivariate dataset and a multi-dimensional dataset [8]. A multivariate dataset typically consists of sequential observations of a number of individual variables. Although a multivariate dataset is usually displayed as a two-dimensional array, its dimension is not two, but is defined roughly as the number of variables. In contrast, a multi-dimensional dataset measures one quantity in a multi-dimensional space, such as the brightness of an image as a function of location in the picture. It is a two-dimensional dataset no matter how big the image is. One example of multivariate data is given in Table 2.1. It is a table of historical climatic data of capital cities of a number of US states in January. The first column lists the names of the cities for which the climatic data are collected, and it serves as the observation index of the data, as do the numbers on the left of the table. The multivariate data, which comprise the numerical part of Table 2.1, consist of 40 observations of 5 variables. A sample multivariate dataset can be denoted by a matrix, namely X. The number of observations of the multivariate dataset is the number of rows of X, and the number of variables is the number of columns. Therefore, an m x n matrix X represents a sample multivariate dataset with m observations of n variables. For example, the sample dataset given in Table 2.1 can be written as a 40 x 5 matrix. Each variable of X is denoted as Xi, and Xi (j) is the jth observation of the ith variable. For example, X 3 is the variable 'Record High' and X 3 (15) = 73 for Table 2.1. Furthermore, the vector X denotes the collection of all variables, namely X = [X1,... , XT]T , and the jth observation of all variables is denoted as X(j), that is, X(j) = [X 1 (j), _. , Xn(j)] T . 2.2.2 Multivariate Data Analysis A sample multivariate dataset is just a numeric matrix. Multivariate data analysis is any attempt to describe and characterize a sample multivariate dataset, either graphically or numerically. It can be as simple as computing some basic statistical properties of the dataset 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Cities Montgomery, AL Juneau, AK Phoenix, AZ Little Rock, AR Sacramento, CA Denver, CO Hartford, CT Tallahassee, FL Atlanta, GA Honolulu, HI Boise, ID Springfield, IL Indianapolis, IN Des Moines, IA Topeka, KS Baton Rouge, LA Boston, MA Lansing, MI Jackson, MS Helena, MT Lincoln, NE Concord, NH Albany, NY Raleigh, NC Bismarck, ND Columbus, OH Oklahoma City, OK Salem, OR Harrisburg, PA Providence, RI Columbia, SC Nashville, TN Austin, TX Salt Lake City, UT Richmond, VA Olympia, WA Washington D.C. Charleston, WV Madison, WI Cheyenne, WY Average Average Record Record Average Precipitation High (F) 56 29 66 49 53 43 33 63 50 80 36 33 34 28 37 60 36 29 56 30 32 30 30 49 20 34 47 46 36 37 55 46 59 36 46 44 42 41 25 38 Low (F) 36 19 41 29 38 16 16 38 32 66 22 16 17 11 16 40 22 13 33 10 10 7 11 29 -2 19 25 33 21 19 32 27 39 19 26 32 27 23 7 15 High (F) 83 57 88 83 70 73 65 82 79 87 63 71 71 65 73 82 63 66 82 62 73 68 62 79 62 74 80 65 73 66 84 78 90 62 80 63 79 79 56 66 Low (F) 0 -22 17 -4 23 -25 -26 6 -8 53 -17 -21 -22 -24 -20 9 -12 -29 2 -42 -33 -33 -28 -9 -44 -19 -4 -10 -9 -13 -1 -17 -2 -22 -12 -8 -5 -15 -37 -29 (in) 4.68 4.54 0.67 3.91 3.73 0.50 3.41 4.77 4.75 3.55 1.45 1.51 2.32 0.96 0.95 4.91 3.59 1.49 5.24 0.63 0.54 2.51 2.36 3.48 0.45 2.18 1.13 5.92 2.84 3.88 4.42 3.58 1.71 1.11 3.24 8.01 2.70 2.91 1.07 0.40 Table 2.1: Climatic data of capital cities of many US states in January (obtained from Yahoo internet site) 19 150- record high - .---- - - - - - average high average low record low average precipitation 1o- 50- - 0-48.0 7.0 6.0 -50 - -5.0 4.0 3.0 2.0 1.0 -100: 0 5 10 15 20 state index 25 35 0.0 40 Figure 2-1: Graph of the climatic data of Table 2.1 such as sample mean and sample covariance matrix. For example, Dow Jones Industrial Average (DJIA) is a consequence of a simple multivariate data analysis; multiple variables (the stock prices of the 30 industrial companies) are represented by one number, a weighted average of the variables. The first step of multivariate data analysis typically is to look at the data and to identify their main features. Simple plots of data often reveal such features as clustering, relationships between variables, presence of outliers, and so on. Even though a picture rarely captures all characterizations of an information-rich dataset, it generally highlights aspects of interest and provides direction for further investigation. Figure 2-1 is a graphical representation of Table 2.1. The horizontal axis represents the observation index obtained from the alphabetical order of the state name (Alabama, Arkansas, -- .). The four temperature variables use the left-hand vertical axis, and the average precipitation uses the right-hand vertical axis. It is clear from the graph that all four temperature variables are closely related. In other words, they tend to move together. The graph also reveals that the relation between the average precipitation and the other four variables is not as strong as the relations within the four variables. Even this simple graph can reveal interesting properties such as these which are not easy to see from the raw dataset of Table 2.1. Figure 2-1 is called the time-plot of Table 2.1 although the physical meaning of the horizontal axis may not be related to time. Another type of widely used graphical representation is the scatter plot. In a scatter 20 (a) (b) 60 9 i 50. 8- . 40 Ec.E 30 c .. * 7- 6 .9-5- - a)) 4 20 ~'0 < 0 -10 10 1 ' 20 30 40 50 60 Average high temperature (F) 0 20 70 30 40 50 60 70 Average high temperature (F) 80 Figure 2-2: Scatter plots of variables chosen from Table 2.1. (a) average high temperature vs. average low temperature, (b) average high temperature vs. average precipitation. plot, two variables of interest are chosen and plotted one variable against the other. When a multivariate dataset contains n variables, n(n - 1)/2 different two-dimensional scatter plots are possible. Two scatter plots of Table 2.1 are drawn in Figure 2-2. Figure 2-2(a) illustrates again the fact that the average high temperature and the average low temperature are closely related. From Figure 2-2(b), the relation between average high temperature and average precipitation is not as obvious as for the two variables in Figure 2-2(a). The notion of 'time' of a dataset disappears in a scatter plot. Once graphical representations provide intuition and highlight interesting aspects of a dataset, a number of numerical multivariate analysis tools can be applied for further analysis. Section 2.3 introduces and discusses these traditional analysis tools. 2.2.3 Time-Structure of a Dataset Existence of time-structure in a multivariate dataset is determined by inquiring if prediction of the current observation can be helped by previous and subsequent observations. In simple words, the dataset is said to have no time-structure if each observation is independent. Knowing if a dataset X has a time-structure is important because these may shape an analyst's strategy for analyzing the dataset. When a dataset X has no time-structure, the X(j)'s are mutually independent random vectors, and the statistical analysis tools described in Section 2.3 may be applied. When X has time-structure, traditional statistical analysis tools may not be enough to bring out rich characteristics of the dataset. Instead, signal 21 1 2 3 4 5 6 7 8 9 10 11 12 Month January February March April May June July August September October November December Average Average Record Record Average Precipitation High (F) 36 38 46 56 67 76 82 80 73 63 52 40 Low (F) 22 23 31 40 50 59 65 64 57 47 38 27 High (F) 63 70 81 94 95 100 102 102 100 90 78 73 Low (F) -12 -4 6 16 34 45 50 47 38 28 15 -7 (in) 3.59 3.62 3.69 3.60 3.25 3.09 2.84 3.24 3.06 3.30 4.22 4.01 Table 2.2: Climatic data of Boston, MA (obtained from Yahoo internet site) processing tools such as Fourier analysis or time-series analysis may be adopted to exploit the time-structure to extract useful information. However, signal processing tools generally do not enrich our understanding of datasets having no time-structure. 160140 .. ---- ----- - 120- --- record high average high average low record low average precipitation 100- 0 - 4.3 - -20, 4.0 - 3.7 3.4 -40- 3.1 1 2 mont5h 7 month 8 9 10 11 122. Figure 2-3: Graph of the climatic data of Table 2.2 Determining independency is not always easy without a priori knowledge unless the time-structure at issue is simple enough to be identified using a time-plot of the sample data. Such simple time-structures include periodicity and slow rates of change in the dataset, which is also referred to as a slowly varying dataset. A dataset with a time-structure typically loses its pattern if the order of observations are 22 changed randomly. Table 2.2 is an example of a dataset with time-structure. The dataset is for the same climatic variables of Table 2.1 from January to December in Boston. A timestructure of slow increase followed by slow decrease of temperature variables is observed. In contrast to this, easily discernible time-structure does not emerge in Figure 2-11. 2.2.4 Data Standardization In many multivariate datasets, variables are measured in different units and are not comparable in scale. When one variable is measured in inches and another in feet, it is probably a good idea to convert them to the same units. It is less obvious what to do when the units are different kinds, e.g., one in feet, another in pounds, and the third one in minutes. Since they can't be converted into the same unit, they usually are transformed into some unitless quantities. This is called data standardization. The most frequently used method of standardization is to subtract the mean from each variable and then divide it by its standard deviation. Throughout the thesis, we will adopt this 'zero-mean unit-variance' as our standardization method. There are other benefits of data standardization. Converting variables into unitless quantities effectively encrypts data and provides protection from damages to companies which distribute data for research if unintentional dissemination of data happens. Stan- dardization also prevents variables with large variances from distorting analysis results. Other methods of standardization are possible. When variables are corrupted by additive noise with known variance, each variable may be divided by its noise standard deviation. Sometimes datasets are standardized so that the range of each variable becomes [-1 2.3 1]. Traditional Multivariate Analysis Tools One of the main goals of multivariate data analysis is to describe and characterize a given dataset in a simple and insightful form. One way to achieve this goal is to represent the dataset with a small number of parameters. Typically, those parameters include location parameters such as mean, median, or mode; dispersion parameters such as variance or range; and relation parameters such as covariance and correlation coefficient (Table 2.3). Location and dispersion parameters are univariate in nature. For these parameters, no difference 'One should not say that the dataset of Table 2.1 has no time-structure based only on Figure 2-1. 23 Category Location Dispersion Relation Parameters Mean, Mode, Median Variance, Standard deviation, Range Covariance, Correlation coefficient Table 2.3: Typical parameters of interest in multivariate data analysis exists between analysis of multivariate and univariate data. What makes multivariate data analysis challenging and different from univariate data analysis is the existence of the relations parameters. Defined only for multivariate data, the relation parameters describe implicitly and explicitly the structure of the data, generally to the second order. Therefore, a multivariate data analysis typically involves parameters such as covariance matrix. The goal of this section is to introduce traditional multivariate analysis tools which have been widely used in many fields of applications. This section covers only those topics that are relevant to this thesis. Introducing all available tools is beyond the scope of this thesis. Readers interested in tools not covered here are referred to books such as [2, 9, 10]. 2.3.1 Data Characterization Tools When a multivariate dataset contains a large number of variables with high correlations among them, it is generally useful to exploit the correlations to reduce dimensions of the dataset with little loss of information. Among the many benefits of reduced dimensionality of a dataset are 1) easier human interpretation of the dataset, 2) less computation time and data storage space in further analysis of the dataset, and 3) the orthogonality among newly-formed variables. Multivariate tools that achieve this goal is categorized as data characterization tools. Principal Component Transform Let X E R' be a zero-mean random vector with a covariance matrix of Cxx. Consider a new variable P = XTV, where v E R". The variance of P is then vTCxxv. The first principal component of X is defined as this new variable P when the variance vTCxxv is maximized subject to vTv = 1. The vector v can be found by Lagrangian multipliers [11]: d (vTCXXv d AvTv dv ( dv ( 24 (2.1) which results in Cxxv = Av. (2.2) Thus the Lagrangian multiplier A is an eigenvalue of the covariance matrix Cxx and v is the corresponding eigenvector. From (2.2), vTCxxv = A since vTv = (2.3) 1. The largest eigenvalue of Cxx is therefore the variance of P 1 , and v is the corresponding eigenvector of Cxx. It turns out that the lower principal components can be also found from the eigenvectors of Cxx. There are n eigenvalues A1 , A2 , - - , An in descending order and n corresponding eigenvectors vi, v 2 , - - -, vn for the n x n covariance matrix Cxx. It can be shown that v 1 , - - -, vn are orthogonal to each other for the symmetric matrix Cxx [12]. The variance of the ith principal component, XTV2 , is Ai. If we define Q E Rfx" and A E Rxf" respectively as (2.4a) [Vi I . - IVn], A =diag (A,, -- - , An) , then the vector of the principal components, defined as P = [P 1 , - (2.4b) , P]T is P = QTX. If AP+1 =,- , An = 0, the principal components Pp+ 1, - - - , Pn have zero variance, indicating that they are constant. This means that even though there are n variables in X, there are only p degrees of freedom, and the first p principal components capture all variations in X. Therefore, the first p principal components - p variables - contains all information of the n original variables, reducing variable dimensionality. Sometimes it may be beneficial to dismiss principal components with small but non-zero variances as having negligible information. This practice is often referred to as data conditioning [13]. Some of the important characteristics of the principal component transform are: * Any two principal components are linearly independent: any two principal components are uncorrelated. 25 " The variance of the ith principal component P is Aj, where Ai is the ith largest eigenvalue of CXX. " The sum of the variances of all principal components is equal to the sum of the variances of all variables comprising X. 2.3.2 Data Prediction Tools In this section, we focus on multivariate data analysis tools whose purpose is to predict future values of one or more variables in the data. For example, linear regression expresses the linear relation between the variables to be predicted, called response variables, and the variables used to predict the response variables, called predictor variables. A overwhelmingly popular linear regression is the least squares linear regression. In this section, we explain it briefly, and discuss why it is not suitable for noisy multivariate datasets. Then we discuss other regression methods which are designed for noisy multivariate datasets. Least Squares Regression Regression is used to study relationships between measurable variables. Linear regression is best used for linear relationships between predictor variables and response variables. In multivariate linear regression, several predictor variables are used to account for a single or multiple response variables. For the sake of simplicity, we limit our explanation to the case of single response variable. Extension to multiple response variables can be done without difficulty. If Y denotes the response variable and Z = [Z 1, - - - , Z" ]T denotes the vector of n regressor variables, then the relation between them is assumed to be linear, that is, Y(j) = yo + jTZ(j) + e(j) where -yo C R and -y E R are unknown parameters and (2.5) E is the statistical error term. Note that the only error term of the equation, e, is additive to Y. In regression analysis, it is a primary goal to obtain estimates of parameters -yo and y. The most frequently adopted criterion for estimating these parameters is to minimize the residual sum of squares; this is known as the least squares estimate. In the least squares linear regression, yo and y can be 26 found from = arg min Z (Y(j) 0YO - - (j). (2.6) The solution to (2.6) is given by i = (2.7a) Czy CN = (2.7b) m J71 m j=1 where Czy and Czz are covariance matrices of (Z, Y) and (Z, Z), respectively. Total Least Squares Regression In traditional linear regressions such as the least squares estimation, only the response variable is assumed to be subject to noise; the predictor variables are assumed to be fixed values and recorded without error. When the predictor variables are also subject to noise, the usual least-squares criterion may not make much sense, since only errors in the response variable are minimized. Consider the following problem: Y = yo + -fTZ+e (2.8a) X = Z + e, (2.8b) where 6 E R' is a zero-mean random vector with a known covariance matrix CEE. Y c R is again the response variable with E [Y] = 0. Z E R' is a zero-mean random vector with a covariance matrix CZZ. Note that zero mean Y and Z lead to yo = 0. The cross-covariance matrix CZ is assumed to be 0. Least squares linear regression of Y on X would yield a distorted answer due to the additive error e. Total least squares (TLS) regression [14, 15] accounts for e in obtaining linear equation. What distinguishes TLS regression from regular LS regression is how to define the statistical error to be minimized. In total least squares regression, the statistical error is defined by the distance between the data point and the regression hyper-plane. For example, the error term in the case of one predictor variable is ei,TLS = V#-2 27 + 1 (2.9) when the regression line is given by a + 8X. Total least squares regression minimizes = the sum of the squares of ei,TLs: m (ei,TLS) 2 . aargmin (aTLS, /TLS) (' ) (2.10) 1 Figure 2-4 illustrates the difference between the regular least squares regression and total least squares regression. (a) Y 0 0 'X (b) y 0 0 0 0 eiLS:: 0 LS 0 0 ei,T LS 0 0 0 0 0 00 0 -0--X -X 0 Figure 2-4: Statistical errors minimized in, (a) least squares regression and (b) total least squares regression. Noise-Compensating Linear Regression In this section, we are interested in finding the relation between Y and Z specified by -y of (2.8a). Should one use traditional least squares regression based on Y and X, the result given in (2.11), i = Cxy C- need not be a reasonable estimate of -y. (2.11) For example, it is observed in [16] that j of (2.11) is a biased estimator of y of (2.8a). Since what we are interested in is obtaining a good estimate of y, it is desirable to reduce noise e of (2.8b) before regression. We will limit methods of reducing noise in X to a linear transform denoted by B. After the linear transform B is applied to X, we regress Y on the transformed variables, BX. Let -Y 28 denote an estimate of -y obtained from the least squares regression of Y on BX, i.e., jB = argnin E Y - rTBX (2.12) , The goal is to find B which makes iB of (2.12) a good estimate of -y. We adopt minimizing the squared sum of errors as the criterion of a good estimate, i.e., Bnc = argMBn (iB _ _7 )T (iB - 'Y) (2.13) We call Bnc of (2.13) the noise compensating linear transform. We first find an equation between -]B and B. Since YB is the least squares regression coefficient vector when the regression is of Y on BX, we obtain (2.14) CY(BX)C(BX)(BX)I The right-hand side of (2.14) can be written CY(BX)(X)(BX) 1 = CyxBT (BCXXBT) = CyzBT (B (Cz + C,,) BT) (2.15) Combining (2.14) and (2.15), we obtain -T (B (Czz + C,) BT) = _T CzzBT, (2.16) which becomes =( TBCeeBT The minimum value of E E [(y - [(Y BBX)] BB) CzzB T (2.17) is achieved when r = yB. rTBX = E = (- (YTZ + e - BBZ - -TBBe)] -T B) CZZ (7T +BBCeeB 29 IB + OUE B) T (2.18) Replacing of (2.18) with the expression in (2.17) gives i7BCECBT E y- - iTBB) CZZY + o 2 iTBX)] = (,Y (2.19) The noise compensating linear transform Bnc is B which minimizes (jB ~~-Y)T (5B - Y) while satisfying (2.17). If we can achieve yB = -y for a certain B while satisfying (2.17), then the problem is solved. In the next equations, we will show that (2.17) holds if jB = and B = CzzC-j. -yT (I - B) CzzB T right-hand side of (2.17) =7T (I - CzzC-1 ) CzzC-1 Czz = TC C 1 CZZ - ITCzzC1 CzzC-iCzz = TCZZC 1 (Cxx - CZZ) C-Czz = iTBCECBT = left-hand side of (2.17) (2.20) Figure 2-5 illustrates the noise compensating linear regression. X- B =CzzC-x 1 BX Least Squares Linear Regression of Y Y = on BX C C-' Y(BX)(BX)(BX) Figure 2-5: Noise compensating linear regression Example 2.1 Improvement in estimating -y: distributed noise case of independent identically In this example, we compare the regression result after the linear transform B = CzzC-1 is applied with the regression result without the linear transform, which is equivalent to setting B=I, for the case of CEE = I. Then B = (Cxx - I) C-1 = 30 I - C- and we saw that B=I and i = T (I = y. If we regress Y on X without the transform X first, then - In that case, ( . )T -- = (C ) (Cy>) ;> o. As a numerical example, we create Z whose population covariance matrix is / Czz = 66.8200 19.1450 -8.9750 30.8350 19.1450 38.1550 17.5975 42.9700 14.3650 -8.9750 17.5975 39.7850 12.3750 17.6300 30.8350 42.9700 12.3750 61.5025 5.7675 14.3650 17.6300 5.7675 27.1050 -17.5675 and we use -y = (-0.99 0.26 0.57 - 0.41 0 . 5 8 )T -17.5675 , to create Y = yTZ + C. We set E to be Gaussian noise with variance of 0.01. X is made by adding a Gaussian noise vector e whose population covariance matrix is I. The number of samples of Z is 2000. We estimate j by first applying the transform B = I - S-1, where Sxx is the sample covariance of X. We also estimate * with B=I. We repeat it 200 times and average the results. The results are given in the following table: B=I-C-1 B=I mean a mean 0- -0.99 -0.9904 0.0049 -0.9704 0.0046 0.26 0.2618 0.0113 0.1873 0.0082 0.57 0.5701 0.0062 0.5697 0.0060 -0.41 -0.4109 0.0087 -0.3648 0.0073 0.58 0.5795 0.0090 0.6001 0.0073 This example clearly illustrates that (1) Noises in the regressor variables are detrimental in estimating the coefficients by the least squares regression. (2) If the covariance CeE is known and noise is uncorrelated with signal, then the noise compensating linear transform B = CzzC-1 can be applied to X before the least squares regression to improve the accuracy of the estimate of coefficient vector y. 31 Principal Component Regression In principal component regression (PCR), noises are reduced by PC filtering before least squares linear regression is carried out. Assuming that Cee = I, it follows from the noise compensating linear regression that B = CzzC-i BX where Q = (I - C-) = (QQT = Q - = I - C-, thus X QA-QT) X (I - A-i) Q T X (2.21) and A are eigenvector and eigenvalue matrices defined in (2.4). For notational convenience, let's define three new notations: Q(i) = [Vi I - Ivi] A(,) = diag (A1,,-- P(l) for any integer I < -- [1, (2.22a) , Al) (2.22b) (2.22c) I]T p. n. In (2.21), QT is the PC transform and Q is the inverse PC transform. The matrix is (I - A-i) I - A-' = diag (1 - Al',- 1 - An-') Therefore, the ith principal component P is scaled by 1- Ai before the inverse PC transform. Note that (1 - A-') ->- (1 - A-i) > 0. If (1 - A-i) > (1 - Aj-') = 0, then (2.21) can be written as BX = Q(i) (I - A-' QT)X (2.23) Considering that QT)X is equal to P(), this means that the last n - 1 principal components are truncated. This is due to the fact that the principal components which correspond to the eigenvalue A = 1 consist only of noise. Therefore, the truncation is equivalent to noise reduction. If there is a limit to the number of principal components which can be retained, only the largest components are usually retained. However, this may not be the best thing to do 32 because there is no reason why the principal components with the highest variances should be more useful for predicting the response variable. There are many examples in which the principal components with smaller variances are relevant to the prediction of the response variable, but are discarded in the process of choosing the principal components to be used for regression. This problem can be partially avoided if the principal components are ordered and chosen based on their correlations with the response variable [17]. Alternatively, one could use the partial least squares regression to avoid the shortcoming of the PC regression. Partial Least Squares Regression In the PC transform, the transform vector v maximizes the variance of XTv subject to vTv = 1. However, the resulting principal component XTV may or may not have a strong relation with Y. If the purpose of the transform is to find a variable which accounts for as much variance of Y as possible, it is desirable to find a vector 1 which maximizes the covariance of the transformed variable XTyi with Y. This is the idea behind the partial least squares (PLS) transform [2, 18]. With Y E R being the zero-mean response variable and X E R" being the zero-mean vector of regressor variables, XT,1 where M E R" is the first PLS variable if it has the maximum covariance with Y subject to the condition p1 Ty1 = 1. The vector ti can be again found by Lagrangian multipliers: E [Y (XT y = (A, TIp) (2.24) which results in 1-= CXY (2.25) To find the second PLS variable, the following algorithm is carried out: (a) Find scalars ko, k 1 , - - -, k so that Y - ko (XT are all uncorrelated with XTpt. ), X 1 - ki (XTt1 ) , - - - , X, - k (XTp1 ) The answers are ko = (CyxpI)/(/TCxxyA) and T ki = (Cxixp)/(pu Cxxy). (b) Treat the residual in Y, Y - ko (XTpA), X, - ki (XTp) , ... ,X - kn (XT') as the new response variable. Treat residuals as the new regressor variables. 33 (c) Using the new response variable and the regressor variables, find the second PLS variable A2 in the same way the first was found. (d) Repeat (a), (b), and (c) for subsequent PLS variables. Stop the procedure when the residual in the response variable is small enough. The PLS transform retains the variations in X which have strong relation with Y. When there is a limit to the number of transformed variables which can be retained, the PLS regression generally accounts for the variations in Y better than the PC regression. Ridge Regression Ridge regression [19] was introduced as a method for eliminating large variances in regression estimates in the presence of large collinearity among predictor variables. In ridge regression, the penalty function which is minimized over -yo and -y has an additive term in addition to the usual squared error term. The additive term is proportional to the squared norm of the coefficient vector -f: yT Z) 2 (so, §) = arg min E (Y - -yo (-YO, -Y) + wqyT7y. (2.26) The solution to (2.26) is = Czy (Czz + 7I) Y1j) 0 = _-E Y (j) - i j=1 1 (2.27a) 1m" - E Z(j) j=1 (2.27b) Comparing (2.27) with (2.6), the only difference is the additive term 7I. This additive term stabilizes the matrix inverse in (2.27a) when the covariance matrix Czz is ill-conditioned. The ridge parameter q > 0 determines the degree of stabilization. A large value implies higher stabilization. In practice, A value of q is determined based on some model selection procedure such as cross-validation [20]. Rank-Reduced Least-Squares Linear Regression Rank-reduced linear regression [21], also known as the 'projected principal component regression', is a regression method concerning multiple dependent variables as well as multiple 34 independent variables. The framework of this specific multivariate regression problem is described in the next paragraph. Let X = [X1,--- , X"]T be the vector of n independent variables. Let Y = [Yi,- , Ym]T be the vector of m dependent variables. We want to find an estimate of Y based on X in the following form: Y WHTX, (2.28) where W is an m x r matrix, H is an n x r matrix, and r < min(m, n). One can assume without loss of generality that W is orthonormal (WTW = I) because any W can be always decomposed into an orthonormal part and the remainder, and the remainder can be included in H. The reduced-rank least-squares linear regression looks for W,, and Hp which satisfy the following criterion: (Wp, Hp) = argin E Y-Y) T (2.29) (Y-Y). It turns out that the joint maximization problem of (2.29) is separable: minimizing E (Y argument at a time. Y) (Y instead of over W and H jointly, we only have to consider one - First, for a given orthonormal matrix W, Hp'w can be found by minimizing E Y Y) T (Y = E -Y Y- WHTX) T (Y - WHTX)]. One can always write Y as the sum of WWTY and (I Y = (I = Y 1 -+ Y - - (2.30) WWTY): WWTY) + WWTY (2.31) (2.32) Note that the two components are orthogonal to each other. Replacing Y in (2.30) with (2.32) yields E [(Y +Yi -Y) (Y1 + -Y = E [YIY1] + E Y1- 1 -Y (2.33) because Y 1 is perpendicular to Y11 + Y. Note that E [YLY 1 ] in (2.33) is fixed for a given W. Therefore, only the second term in (2.33) can be minimized over H. This can be done 35 by setting E [(Y _YXT = 0. (2.34) This condition is imposed because the remainder in Y1 after the estimate Y is subtracted should not be correlated with X in order for E ( 1- - to be minimized. Substituting k = WHTX into (2.34) yields WWTCyx = WHTCxx (2.35) WTCyx = HTCxx (2.36) Assuming that Cxx is invertible, HT 1w = WTCyxC-. With this choice of H the resulting distortion is E [(Y )T (Y )] = E [YTY] = E [yTy] - 2tr {E [WWTCYXC-J XYT] - 2E [YTY] tr {E [WTCyxC-1XXTC= E CxyW] (2.38) - 2tr {WTCyxCiCxyW} + YY] tr {WTCyxC- (2.37) + E [YTV] (2.39) CxyW} E [YTY] - tr {WTCyxC-CxyW} (2.40) Thus to solve for the optimal W we must maximize tr {WTCyxCCxyW (2.41) = tr {WTQAQTW} where QAQT is the eigenvalue decomposition of CyxC-J CxY. If we simplify the above expression further, we get n tr {WTQAQTW} (2.42) = E AjqTWW T q, j=1 where Aj is the jth diagonal element of A and qj is the jth column of Q. Before finding W which maximizes (2.42), we would like to find out a lower bound and a upper bound for scalar qjWWTqi. A lower bound and an upper bound can be determined 36 easily by recognizing (2.43) q3 WWTqj = 1 WTqj 112 > 0, q WWTqj = 1WTqj 112 < 11WT 112 where 11 (2.44) 112= 1 is the vector or matrix 2-norm. We also note that EZ_ 1 qTWWT qi = r because n j=1 qfWWTq tr {QTWWTQ = = tr VVT} = r. (2.45) Therefore, the original problem is equivalent to determining W which maximizes E= subject to 0 < qfWW T qj and 1 A qTWW T q (2.46a) 1 (2.46b) L 1 qWWT q3 = r. (2.46c) Assuming that A3 is in descending order, the solution to (2.46) is TWWTq{1 1<j<r 0 otherwise A closed form expression for W which satisfies (2.47) is W = [q, I ... qr 10 .. Returning to the original problem, the optimal estimate ]. (2.48) Y of rank r based on X is given by Y = (2.49) WWTCyxCj&X where W is given by (2.48). Note that the optimal estimate of rank r is the projection of the linear least squares estimate CyxC5JX into the r-dimensional subspace spanned by the columns of W, or equivalently, spanned by the first r eigenvectors of CyxC-1 Cxy. This is the origin of the name projected principal component regression. Note that LLSE if we do not put on the restriction that 37 Y has to be rank r. Y becomes 2.3.3 Noise Estimation Tools When population statistics such as population mean or variance are not available for a given dataset, sample statistics usually replace them to be used for further statistical analyses. It is often the case that a sample statistic is a good estimate of the corresponding population statistic if the number of samples is large. When measurement noises - usually additive are present in data, sample statistics of noisy data may not be good estimates of population statistics of noise-free data unless the noise statistics are known and easily removed from the corresponding noisy sample statistics. For example, consider the following problem: we want to estimate the population covariance of Z E R", denoted by Czz, from m independent observations of X E R", where X = Z + e and Z and e are uncorrelated so that Czz is equal to Cxx - CeE. If m is large, we may substitute the sample covariance of X, denoted by Sxx, for Cxx. Then an estimate of Czz can be obtained from Sxx - CEE if CFZ is known. Many statistical analysis tools such as the noise-adjusted principal component (NAPC) analysis take the availability of CSE for granted. Problems may arise when the necessary noise statistics are not available. Generally speaking, sample noise statistics cannot be obtained from a dataset because noises do not appear by themselves but are embedded in signals, unless we have ways to separate noises from signals. For example, the sample noise covariance, denoted by SEE, typically cannot be obtained from observations of X. In this section, we will explore a few ways to estimate CEE which often, but not always, work in practice. When necessary, we will make physically realistic assumptions about signals and noises. Once estimated, the noise covariances will be used to compensate noises to improve statistical tools such as the PC transform and linear regression. Throughout the section, we will consider the observed vector X E R , which consists of the noise-free vector Z and noise vector e, i.e., X = Z + e. (2.50) Assuming Z and e are uncorrelated, population covariances of three vectors satisfy the relation Cxx = Czz + Cee. 38 (2.51) Noise Estimation through Spectral Analysis For many physically realistic cases, a variable changes slowly when it is measured repeatedly over time. This observation is a basis for many noise reduction algorithms. For example, Donoho, in [22] in which he proposed a 'de-noising' algorithm, a term which means 'rejecting noise from noisy observations of data', developed the algorithm which achieves the target smoothness of the data. In that case, power spectral analysis could reveal the variance of the additive measurement error. To make it easy to study this specific time-structure of a dataset, we incorporate the observation index, or time index, into (2.50), X(t) = Z(t) + e(t). Assumptions (2.52) The stochastic processes Xi(t), Zi(t), and ei(t) for i ary and e(t) is uncorrelated with Z(t). Zi(t) for i = = 1,... ,n are station- 1, . . . , n change slowly over time, which means that the power spectral density of Zi(t), denoted by Pzjzj (w), is confined in a low frequency range. Ec(t) for i = 1, . -. , n are white. Since Zi (t) and ei (t) are uncorrelated, the power spectral densities of Xi (t), Zi (t) and &i(t)satisfy Pxixi (w) = Pzizi (w) + PeE (w) for i = 1, ... , n (2.53) The high frequency region of Pxix(w) is dominated by the power spectral density of noise. Therefore, the level of Pxixi(w) in the high frequency region indicates the noise variances of ei (see Figure 2-6). Similar assumptions were made by Lim when restoring noisy images in [8, 23]. Since CE is diagonal, estimating o- for i = 1, ... , n amounts to estimating CeE because EZ = D (&2, Example 2.2 -- ,(2.54) 2 Estimation of noise variance by power spectral analysis Consider the variable X 1 and its power spectral density shown in Figure 2-7 for a numerical simulation. Pxix (w) is not flat in the high frequency region. We want to estimate the value 1 12 by averaging Pxlxl (w) in the high frequency region. To do this we need to decide what is defined as the high frequency region. We propose that the value 39 WB, which separates the Px~x, (w) p, 0 WB or 2 E-i W Figure 2-6: A simple illustration of estimation of noise variance for a slowly changing variable (a) 5 (b) .10 5 - 0 (D 0- CL-5 - -51 0 -101 500 1000 time index 1500 0 0. 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 2000 frequency Figure 2-7: A typical Xi(t) and its power spectrum density. 40 1 high frequency and the low frequency region, should be determined so that 1 min Pxjx ([O, WB) WB 1-WB /fl Pxixi (w)dw- (2.55) JB It is not rare to have multiple solutions to (2.55). In that case, the smallest wB is selected as the division frequency so as not to pick a value which appears as a solution due to statistical fluctuations in power spectral density. The estimated noise variance is then: El = 2 _ 1 1 1 - 1 - WB WB ((2.56) Pxlxl (w)dw. For Example 2.2, WB and &2 are illustrated in Figure 2-7 (b) as the vertical dash-dot and horizontal dashed line, respectively. Although somewhat ad hoc, this method works very well in our simulations. For Figure 2-7, o-, = 0.8 and 6, = 0.82. Noise-Compensating Linear Regression with estimated Cee In our introduction of the noise-compensating (NC) linear regression in Subsection 2.3.2, it is shown that measurement noise E embedded in the regressor vector X deteriorates the accuracy of estimated regression coefficients, and that this degradation in the result of regression can be avoided if X is first multiplied by a matrix B = CzzC-1 = I - CEC-1 before the linear regression. Note that CE is required to obtain B. In this section we will present two examples to demonstrate performance improvements via NC linear regression when CEE is not known a priori but is estimated by the power spectral analysis proposed in the previous section. The first example shows that unless the noise covariance is grossly overestimated, the performance of noise-compensating linear regression with estimated Ce regression with known C- is not much inferior to that of noise-compensating linear In the second example, we will present a case when Cee is grossly overestimated. When this happens, noise is overcompensated and the performance of noise-compensating linear regression is much degraded. In this section we will suggest a method in which we correct the overcompensation of noises. Example 2.3 Noise-compensating linear regression with noise estimation After the noise variances are estimated, the noise-compensating linear regression can be applied. For this example, we generate 2000 samples of Z E R5 . As a way to make each 41 x, x 2 5 0 0 -5'''' 0 500 1 00 1500 -5 20(0 0 500 3 1 00 1500 2000 4 5 0 -0 0 0 500 1q00 1500 -0 2000 0 500 1000 Y 1500 2000 -5 L 0 500 1000 1500 2000 5 5 5 0 0 -U 0 500 1000 1500 2000 Figure 2-8: Typical time plots of X 1, -- -, X 5 and Y. variable change slowly over time, we use the first order autoregressive model Zj(k + 1)= 0.93Zj(k) + 0.37r(k), where r(k) N(0, 0.12). We use -y = (-0.99 - = 1, ... ,1999 and i = 1, ... , 5 (2.57) N(0, 1)2 The response variable Y is obtained through the equation - Y = where c k -TZ + E. 0.26 0.57 - 0.41 (2.58) 0 . 5 8 )T. The observed vector X is X = Z+ e where e - (2.59) N(0, 0.82I). Typical time plots of Xi(t)'s and Y are shown in Figure 2-8. To estimate -y, we use only X, Y, and the facts that the noise covariance CE is diagonal and that each variable changes slowly over time. 2y ~ N(m, o 2 In the noise estimation and compensation ) indicates that y is a Gaussian random variable with mean m and variance o2 42 -0.99 0.26 0.57 -0.41 0.58 (j - y)T(.j - _y) with known CE mean 0- with estimated CE mean a- without estimation of CeE mean a- -0.9954 0.2417 0.5322 -0.3750 0.5337 0.0375 0.0325 0.0306 0.0317 0.0356 -1.0874 0.2868 0.6252 -0.4493 0.6367 0.0501 0.0396 0.0421 0.0413 0.0466 -0.5940 0.1550 0.3410 -0.2464 0.3487 0.0367 0.0297 0.0327 0.0305 0.0324 [_0.0052 - J_0.0180 - 0.3005 - Table 2.4: Results of noise estimation and compensation (NEC) linear regression, compared to the traditional linear regression and noise-compensating linear regression with known C. (NEC) linear regression, the noise covariances are first estimated through the method explained in Example 4.1, and then the noise-compensating linear regression is undertaken using the estimated noise covariance. The resulting estimated coefficients are presented in Table 2.4 with results from other methods for comparison. It can be seen from the results that the NEC linear regression performs much better than the traditional linear regression and almost as well as the noise-compensating linear regression with known CEE. 0 It is important to note that all eigenvalues of B = CzzC-1 = I-CEEC-J must be non- negative because Cxx and Czz are all positive-semidefinite. When we use estimated CeE to obtain B, one of the eigenvalues of B could fall below zero if one or more noise variances are estimated bigger than their true variances. This is referred to as grossly overestimated noise variances. This can happen frequently if one or more eigenvalues of Czz are close to zero. In the next example, we modify Z of Example 4.2 so that one of the eigenvalues of Czz is close to zero. It will be illustrated that overestimation of noise variance could lead to a poor NC linear regression. Example 2.4 Overcompensation of Noise This example demonstrates that performance of NEC regression can be poor if noise variances are grossly overestimated. For the example, we take the Z generated in Example 4.2 43 -0.99 0.26 0.57 -0.41 0.58 - -y) (-)T(j with known Ce mean 0- with estimated CE mean a without estimation of Ce mean o- -1.0013 0.3094 0.5661 -0.4590 0.5443 0.0697 0.1703 0.0448 0.1214 0.1926 -0.9234 -0.3303 0.6528 -0.0706 0.8968 0.3965 0.5419 0.1576 0.3013 0.4658 -0.8242 0.0463 0.4813 -0.2496 0.5081 0.0250 0.0287 0.0274 0.0251 0.0294 0.0063 - 0.5753 - 0.1119 - Table 2.5: Results of NEC linear regression when noise overcompensation occurs. and multiply it with a matrix 0.61 -0.98 -0.53 0.87 0.57 -0.18 -0.76 0.27 0.72 -0.76 0.25 0.67 -0.46 0.86 0.84 -0.68 0.88 0.08 0.20 -0.17 0.63 -0.67 -0.36 -0.55 (2.60) 0.32 to get a new Z. The eigenvalues of the new covariance matrix Czz are (4.85, 3.05, 0.87, 0.49, 0.07). Note that the smallest eigenvalues of Czz is close to zero. We then applied the same noise estimation technique as in Example 4.1. The results of 50 simulations are presented in Table 2.5 with results obtained from other methods. The results show that when the noise variances are overestimated, which makes one or more eigenvalues of B negative, the resulting NEC linear regression produces poor estimation of regression coefficients. Li The previous example shows that the noise-compensating linear regression should be used only when all eigenvalues of Czz are large unless, of course, we can recompensate the overcompensation of noise. Fortunately, we will know if overcompensation of noise occurs by monitoring the eigenvalues of B because each time the overcompensation occurs, one or more eigenvalues of B become negative. Fact If the eigenvalues of I - Cea 1 -A, ---, 1 - 1 Cxx are A1 , , the eigenvalues of Caeexx are . Furthermore, the eigenvalues of 9OGe~xx are 0(1 - 44 a),... ,(1 - A1 ). When An < 0, we know that overcompensation occurs. We propose the following recompensation algorithm for the noise overcompensation. Proposition I- When An < 0, we should force the eigenvalue to be zero by redefining B OCeCxx where 0 = 1/(1 - = n ). When Ar, > 0, no recompensation of overcompensation is necessary, thus 0 = 1. { 1 if a>, 0 =otherwise Since this proposition forces the negative eigenvalue iA to be zero, we will call it zero- forcing of eigenvalues. In Figure 2-9, the zero-forcing algorithm is depicted. Note that in the algorithm, we use only X, the facts that CE is diagonal and that the Xi(t)'s change slowly in time. Estimation of o 2 from Pxx(wA, B 1 = I - CECxx. Aare eigenvalues of B 1 . An>0? BB 1 No 1 B=IB = I - 1 _ n CeAr xx -~ Figure 2-9: Zero-Forcing of eigenvalues Example 2.5 Zero-forcing of eigenvalues in noise-compensating linear regres- sion We apply the algorithm of Figure 2-9 to the dataset generated in the previous example. To compare the result with those of other methods, in Table 2.6 we reproduce Table 2.5 and add a column for zero-forcing. It can be seen that the zero-forcing technique dramatically improves the regression results. 45 known Q -0.99 0.26 0.57 -0.41 0.58 -y) mean 0- -1.0013 0.3094 0.5661 -0.4590 0.5443 0.0697 0.1703 0.0448 0.1214 0.1926 (j - y) [_0.0063 - estimated CeE no zero-forcing zero-forcing mean a mean a -0.9234 -0.3303 0.6528 -0.0706 0.8968 [_0.5753 0.3965 0.5419 0.1576 0.3013 0.4658 -1.0115 0.2987 0.5959 -0.3742 0.6376 0.3188 0.3157 0.1136 0.2027 0.2717 - 0.0072 - . no estimation of Cee mean - -0.8242 0.0463 0.4813 -0.2496 0.5081 j_0.1119 Table 2.6: Improvement in NEC linear regression due to zero-forcing of eigenvalues. 46 0.0250 0.0287 0.0274 0.0251 0.0294 - Chapter 3 Blind Noise Estimation 3.1 Introduction This chapter addresses the problem of estimating noise and its variance in the context of noisy multivariate datasets with finite observations while imposing minimal assumptions on the dataset. The problem we are especially interested in involves multivariate datasets where the number of degrees of freedom of the signal portion is 1) much smaller than the number of degrees of freedom of the noisy dataset, and 2) unknown a priori. This problem has been studied in the context of signal enhancement as single and two-sensor problems in [24, 5], in which the order of time-series model which the dataset is based on is assumed known. This is in effect equivalent to assuming known degrees of freedom. Estimating the number of signals in the case of uniform noise variances has been studied in context of model identification in [25, 26, 27, 28]. They use some model selection criteria to select degrees of freedom. This chapter is organized as follows. The noisy multivariate data model is defined in Section 3.2. It is assumed throughout the chapter that the signal and noise are independent Gaussian random vectors. The assumption that measurement noise is independent and Gaussian is natural. The justifications for the signal being assumed to be Gaussian are: 1. Hardly anything is known about the signal. 2. Gaussian is a good approximation of many stochastic processes found in nature. (For example, see [29] page 364.) 3. Gaussian distribution is easily tractable in developing complex algorithms. 47 4. The algorithm we will develop based on Gaussian assumption works well on signals drawn from other distributions in our preliminary simulations. The noisy data vector is modeled as a sum of a signal vector and a noise vector, where the signal vector is modeled as an instantaneous linear mixture of a few Gaussian random variables. Nothing is assumed to be known about the noise vector except that it is an independent Gaussian random vector with a diagonal covariance matrix. The remainder of the chapter addresses estimations of noise variances and the number of source signals. In Section 3.3 the motivation and the usefulness of blind estimation of noise variances are addressed in the context of the Noise-Adjusted Principal Component (NAPC) transform [1]. Section 3.4 addresses the problem of estimating the number of source signals. Our approach is based on the detection of the 'pseudo-flat noise baseline' in an eigenvalue screeplot, which can suggest the number of source signals in the case of an identity noise matrix. We propose first a numerical criterion to obtain an estimate of the number of source signals for the case of an identity noise matrix, and then we extend the criterion to the case of non-identity noise matrices. As a related problem, an upper bound and a lower bound of noise eigenvalues in the case of a non-identity noise matrix are derived in Section 3.4.3. In Section 3.5, we consider the blind noise estimation problem in the case of known source signal numbers. Our approach is based on the Expectation-Maximization (EM) approach, which is a computational strategy to solve complex maximum-likelihood estimation problems. We first describe the problem of maximum-likelihood (ML) parameter estimation of the noise variances and noise sequences in Section 3.5.1. Detailed derivations of the expectation step and the maximization step are presented in Section 3.5.2 and Section 3.5.3, respectively. Test of the algorithm is provided in Section 3.5.4. In Section 3.6, we address the problem of blind noise estimation in the case of an unknown number of source signals. Because EM-based blind noise estimation requires the number of source signals or its estimate, by definition of the case, we need to supply an estimate of the source signals to the EM algorithm. We will use the retrieved number of source signal based on the algorithm developed in Section 3.4. We take this one step further by repeating the estimation of the source signal number and the noise variances back and forth so that an improvement in one estimate leads to a better estimate of the other, and vice versa. This iterative order and noise estimation procedure is designated the ION algorithm. We present simulation results of the ION algorithm in Section 3.6.2. 48 Sensor Noiseless -- ---- Data Noisy Data Z X(t) E(t) Noise Figure 3-1: Model of the Noisy Data 3.2 Data Model Figure 3-1 illustrates the noisy data model used throughout this chapter. The n-dimensional signal vector Z(t) is measured by n sensors idealized as having n additive noise sources e(t). The readings of the sensors constitute the n dimensional vector, denoted as X(t). It is important to realize that neither Z(t) nor e(t) is available to us. Only finite observations of X(t) are available. Our objective is to recover the unknown noise variances from the available finite samples of X(t). It is also desired to recover the actual sequences of noise e(t), if possible. We should note that this objectives cannot be achieved in general, that is, the noisy data vector cannot be separated reliably into the signal vector and the noise vector. 3.2.1 Signal Model The signal vector Z(t) is modeled as an instantaneous linear mixture of elements of pdimensional source vector P(t). Figure 3-2 illustrates this. The relation between P(t) and Z(t) is Z(t) = AP(t). (3.1) where A is the mixing matrix, assumed to be full column rank but otherwise unknown. We only consider the case where the mixing matrix is fixed over time. The term 'instantaneous' emphasizes that only present values of source variables are used to generate current values of the signal vector Z(t). The number of source variables, denoted by p, is unknown. This 49 Z (t) P,(t) P )a. * Z(t) np = A- P(t) n Figure 3-2: Signal model as instantaneous mixture of p source variables means that we do not know the mixing matrix and that we also do not know how many columns it has. In summary, each variable in Z(t) is an unknown linear combination of an unknown number of unknown source variables. The source vector P(t) of unknown dimension p is modeled as a statistically independent Gaussian random vector with zero mean. The variances of source variables are in principle arbitrary since a scalar factor can be exchanged between any source and its associated column in the mixing matrix without altering X(t). Therefore, without any loss of generality we can assume the variance of each source variable to be unity. Combined with the assumed independence of source variables, the unity variances of source variables yields an identity matrix of unknown dimension p x p as the covariance matrix of P(t). Since linear combinations of Gaussian random variables are Gaussian, we note that the signal vector Z(t) is a Gaussian random vector. 3.2.2 Noise Model The noise vector E(t) is modeled as a Gaussian random vector independent of Z(t) with zero mean and an unknown diagonal covariance matrix, denoted by G. The noisy data vector X(t) is then a Gaussian random vector with mean and variance of E [X(t)] = AE [P(t)] + E [e(t)] = 0, 50 (3.2) Cxx = E [X(t)X(t)T] = AAT + G, (3.3) where the superscript T denotes matrix transposition. 3.3 Motivation for Blind Noise Estimation When we received a number of multivariate datasets for analysis from various manufacturing facilities, we suspected that many of them were subject to significant measurement errors. Our suspicions were generally confirmed by manufacturing site visits. However, even at the plant there often was no a priori information concerning the noise statistics. Since CE is essential for many multivariate analysis methods we wanted to use, we became interested in ways to estimate CE from a given dataset. The simplest form of CE is a constant multiple of an identity matrix, namely 0T2I. This is the case when measurement errors are uncorrelated with others, and each measurement error has the same variance. errors are uncorrelated. It is reasonable to assume that most factory measurement Measurement errors originated from various sensors tend to be independent. However, it is not very realistic to assume those measurement errors have the same variance. For most manufacturing datasets we attempted to analyze, some variables have very large measurement errors while others appear to have small measurement errors. Therefore, it is more reasonable to assume that Ce elements of o ,- is a diagonal matrix with diagonal , Ce = D o, --- , o- (3.4) When the dimension p of the source vector and the population noise covariance G is known, the so-called noise-adjusted principal component (NAPC) [1, 30] is typically used to compress an n-dimensional noisy data vector X(t) into the p-dimensional subspace spanned by the p eigenvectors associated with the largest p eigenvalues of the covariance of the "noise-adjusted" random vector, defined as G- 1/ 2X(t). For notational convenience, we will drop the time index t henceforth if doing so does not introduce any confusion. If necessary, we will reinstate the time index. In NAPC, the covariance of the noise-adjusted random vector is obtained first by E [G-1/2XXTG-1/2 = 51 G-1/2Cx G-1/2 (3.5) = G-1/ 2 (AAT + G) G- 1/ 2 (3.6) = G- 1 / 2 AAT G-1/ 2 + I. (3.7) Since A is not known, AAT of (3.7) is typically substituted by Sxx - G where Sxx is the sample covariance of X computed from a sample dataset X. Let A be the diagonal matrix of eigenvalues of (3.7) in descending order and Q be the corresponding matrix of eigenvectors. It can be shown that the transformation of X into the subspace spanned by the first p columns of Q contains all variations originating from signal Z, or equivalently that the transformation of X into the subspace spanned by the remaining n - p columns of Q contains variations originated only from noise. The subspace spanned by the first p eigenvectors of Q is called the signal subspace, and the subspace orthogonal to the signal subspace is called the noise subspace. As such, NAPC is capable of dividing a noisy vector into a vector of complete noise and another vector of signal and noise. This characteristic of NAPC is utilized in many applications such as noise filtering, data compression, system identification and characterization, and so for. Up until now, we assume the case of known G and p. In fact, the practicality of NAPC depends heavily upon the availability of these two parameters. What limits the usability of the NAPC is that both parameters are not known a priori for many 'real-world' datasets. This prevents one from noise-adjusting X before the principal component transform. In that case, an algorithm which reliably estimates the source dimension and the noise covariance from a sample dataset would be useful. 3.4 Signal Order Estimation through Screeplot The problem of detecting the number of meaningful source signals in a noisy observation vector X has been actively studied over the last few years. This is partially in response to the surging necessity of compression of multivariate datasets whose sizes increase exponentially in response to the technological advances in sensors and data storages. The detection of the number of source signals contained in a noisy vector would be beneficial in many ways: 1) to boost signal-to-noise ratio by eliminating noises which are not in the signal subspace 2) to characterize better the complex system represented by the noisy observation 52 3) to reduce the size of the dataset If the population noise covariance G is an identity matrix multiplied by a scalar, namely 12 , and if the population covariance Cxx of the noisy observation vector X is known, then examining the eigenvalues of CXX reveals the number of source signals easily because n - p eigenvalues of Cxx are equal to o 2 and the remaining p eigenvalues are greater than a2 . Therefore, determining the number of source signals is as easy as determining the multiplicity of the smallest eigenvalue of CXX. In most practical situations, the population covariance Cxx is unknown. In that case, the sample covariance matrix Sxx computed from finite samples of X is used in place of the population covariance matrix. The problem of this practice is that the finite sample size of X ensures that all resulting eigenvalues of Sxx are different, thus making it difficult to determine the number of source signals merely by counting the multiplicity of the smallest eigenvalue. More sophisticated approaches to the problem, developed in [28, 31], are based on an information theoretic approach. In those papers, the authors suggested that the estimate of the source number be obtained by minimizing the model selection criteria, first introduced by Akaike in [25, 26] and by Schwartz and Rissanen in [27, 32]. Cabrera-Mercader suggested in [30] that the libraries of sample noise eigenvalues be generated for different sample sizes in computer simulations. These computer-generated sample noise eigenvalues are then used in place of the population noise eigenvalues. By doing so, he observed substantial improvement in the accuracy of the signal estimate in numerical simulations. To our best knowledge, all of the previous studies on the subject of source signal number estimation have been conducted only for cases when either the population noise covariance G is an identity matrix or G is known a priori. In this section we propose a simple numerical method for obtaining an estimate of the source signal number. The proposed method is a numerical implementation of the function of human eyes in determining the separation of signal-dominated eigenvalues and noise-dominated eigenvalues in an eigenvalue screeplot; a procedure of first determining a straight line which fits the noise-dominated eigenvalues and then determining the point at which the straight line and the screeplot diverges significantly. In determining the source signal number, the results of the proposed method will be shown to be comparable with those made by human eyes. The advantages of the proposed method are twofold. 1) No human intervention is required in the determination of the source signal number. This plays 53 an important role in the successive estimation of p and G in the coming sections. 2) The proposed method is more robust to changes in eigenvalues caused by a non-identity noise covariance. Unlike those methods presented in [28, 31] in which knowledge of G is assumed, the proposed method yields a relatively accurate estimate of the source signal number when G is unknown and not necessarily an identity matrix. We begin in Section 3.4.1 by describing briefly the example datasets used throughout the chapter. We also define the multivariate version of signal-to-noise ratio (SNR) in terms of signal and noise eigenvalues. Section 3.4.2 presents an introductory description of our proposed method. We explain qualitatively how one can determine the number of source signals when the population noise covariance G is an identity matrix. The changes of the noise eigenvalues due to the substitution of the population covariance matrix with the sample covariance matrix computed from a finite number of samples of X is considered. We extend the proposed method to the case of an unknown and non-identity noise covariance. We will try to justify this extension in Section 3.4.3 by providing upper and lower bounds of noise eigenvalues when G is an non-identity diagonal matrix. We will show that the shape of the screeplot is altered only to the extent of the difference between the largest and the smallest variances of the noise variables. The actual quantitative algorithm of the proposed method is presented in Section 3.4.4. 3.4.1 Description of Two Example Datasets It is our intention to supplement each new algorithm introduced throughout the chapter with simulation results and graphs. Prior to commencing to describe the method for source signal number estimation, we would like to make up two simple noisy multivariate datasets. By using the same two examples repeatedly when simulations are required, we can maintain coherence among algorithms introduced in this chapter. The two example datasets are designed to be simple and yet sophisticated enough to represent real multivariate datasets which are modeled in this thesis as being generated from a jointly Gaussian random distribution. As we already explained in Section 3.2, a noisy data vector X is modeled as the sum of the noise-free signal vector Z and the noise vector e, where the signal vector Z is an instantaneous linear mixture of the source vector P, where P is a jointly Gaussian random vector with zero mean and an identity covariance matrix. The dimension of P is set to be fifteen and the dimension of X is set to be fifty for both example datasets. Consider 54 the selection of the population noise covariance G. We would like to create one dataset with an identity covariance matrix and the other dataset with a non-identity covariance matrix. The first dataset is therefore created so that the population noise covariance is I. This dataset emulates the cases either where noise variances are naturally uniform over variables or where variables are standardized by their noise variances before acquisition of the dataset. The second dataset is created so that the population noise covariance is a diagonal matrix and the diagonal elements are 1/50,... , 1/2, 1 multiplied by a constant c. The constant c is to control the multivariate signal-to-noise ratio to be defined shortly. This noise covariance is intended to emulate the cases when noise variances vary over variables. In specifying the mixing matrix A, we would like to have A as random as possible except that the two datasets should have comparable SNR. To achieve this, we should make it clear first what we mean by SNR for multivariate datasets. In a univariate dataset, SNR is defined as the ratio of the signal variance to the noise variance. For example, for a variable X 1 = Z, + -1, the SNR is defined as 2 SNR(X 1 ) = 10 logio 0' where or and o- 1 (dB) (3.8) denote variances of Z 1 and El, respectively. In the case of a multivariate dataset, we define SNR as the ratio of the sum of all signal variances to the sum of all noise variances, or SNR(X) = 10 log1 o 2 = 10 log10 tr (CZZ) o tr (G) (dB) (3.9) where tr(.) denotes the trace of a matrix. Recalling that tr (Czz) is equal to the sum of the eigenvalues of Czz, (3.9) can be also expressed in terms of the eigenvalues of Czz and G, SNR = 10 log1 o Az% (dB) i=1 (3.10) i The value of EL1 tr (G) is 50 for the first example and 9 for the second example. Somewhat arbitrarily we set SNR to be 24 dB for both examples. Then the eigenvalues of Czz should meet the condition E.=1 Az, = 50 x 102.4 for the first example and 1 = 9 x 102.4 for the second example. We generate two mixing matrices, one for each example so that the eigenvalues of Czz meet the above conditions. Otherwise the two mixing matrices are arbitrary. Finally, we set the sample size m to be 3000. 55 The specifications for the two examples are summarized in Table 3.1. m n p SNR (dB) G (a) 3000 50 15 24.16 I (b) 3000 50 15 24.16 2 - diag ,o ,, I) Table 3.1: Summary of parameters for two example datasets to be used throughout this chapter. 3.4.2 Qualitative Description of the Proposed Method for Estimating the Number of Source Signals When the population noise covariance G is an identity matrix up to a scalar multiplication factor ol, the largest p eigenvalues of the population covariance matrix Cxx are larger than u1 and the remaining n - p eigenvalues are equal to o . In theory the number of source signals can be determined by counting the number of eigenvalues which are greater than the smallest eigenvalue, o-E. This is called the subspace separation method [35]. In reality, the often unavailable population covariance Cxx is replaced by the sample covariance Sxx computed from a finite sample size X. The sample noise eigenvalues obtained by eigendecomposition of the sample covariance Sxx are all different with certainty. Therefore, we cannot determine the number of source signals by counting the number of eigenvalues bigger than the smallest eigenvalue for this value will be always n - 1 no matter what the true value of p is. There have been various researches in regard to this problem. Parallel analysis, suggested in [33], determine the number of signals by comparing eigenvalues of covariance matrix of the standardized dataset at hand with those of simulated dataset drawn from standardized normal distribution. Allen and Hubbard [34] further promoted this idea and developed an regression equation which determines the eigenvalues for standardized normal datasets. Our approach to the problem is an extension of the subspace separation method. Figure 3-3 shows a few screeplots obtained from Example (a) of Table 3.1. The sample co- variances are computed from different sample sizes while the population covariance matrix remains unchanged. It is clear from the figure that the noise-dominated eigenvalues are 56 not constant anymore when the sample covariance Sxx replaces the population covariance Cxx, and the effect is clearer for a smaller sample size. However, the differences of consec- 100 samples 1 200 samples - - 2000 samples samples 103 1 0 -- -. --. 101. - - --.. . -... ... . -.--.--.-- - -. ... 100 0 5 10 15 20 25 Index 30 35 40 45 50 Figure 3-3: Illustration of changes in noise eigenvalues for different sample sizes (Table 3.1). utive noise eigenvalues are much smaller compared to the differences of consecutive signal eigenvalues. More importantly, noise eigenvalues tend to form a straight line when the vertical axis is on a logarithmic scale. Based on this observation, we can still obtain a rough estimate of the number of source signals by counting eigenvalues which lie significantly above the straight line of noise eigenvalues. We will define quantitatively the meaning of eigenvalues that lie significantly above the straight line of noise eigenvalues in Section 3.4.4. Our proposed method can be also applied when the population noise covariance is arbitrary. This is best visualized by an example. Let's consider the two screeplots of Figure 3-4. These two screeplots are obtained from two example datasets whose parameters are specified in Table 3.1. Figure 3-4(a) is a sample screeplot of the first dataset in which the population noise covariance G is an identity matrix. As we explained, the noise-dominated eigenvalues form a straight line and the transition from noise-dominated eigenvalues to signal-dominated eigenvalues can be hand-picked by finding the point where the screeplot departs from the straight line that fits the noise-dominated eigenvalues. We marked the point in the figure. The actual break is at Index = 15. This idea is also applicable to the screeplot of Figure 3-4(b) , which is a sample screeplot of the second dataset in which the 57 population noise covariance is a diagonal matrix with the elements of 2/50,-.. , 2/2, 2. It can be seen that the slope of the signal-dominated eigenvalues is more negative than the slope of the noise-dominated eigenvalues. One can estimate the number of signal sources by determining where the change in slope occurs. Before we move on to derivations of upper and lower bounds of eigenvalues of Cxx, we would like to emphasize that this method works best when the noise covariance is an identity matrix. When the noise covariance is arbitrary, this method tends to underestimate the number of source signals as it can be seen by comparing (a) and (b) of Figure 3-4. Therefore, this method should be considered only as a way to obtain a rough estimate of the number of source signals when G is arbitrary. 5 0 5 (a) 10 (b) 10 Transition ........ 10 10 Transition 102 12 i~2 0 10 20 30 40 0 50 10 20 Index 30 40 50 Index Figure 3-4: Two screeplots of datasets specified by Table 3.1 Upper and Lower Bounds of Eigenvalues of Cxx 3.4.3 Consider an ri x ri covariance matrix Cx Ax,n A, > = Cz + CeE and its eigenvalues Ax,1 >.- -> 0. Assuming that Cz is a p rank matrix, the eigenvalues of Cz are denoted by - .- - Az,p > Az,+i = -= 0. The noise covariance C az,= is a diagonal matrix withn the entries twoniewa ou ,.-., cr = x ndiagonal . We covariance to find an matri bound +o,jand a lower CXX, CZ want + CEand its,upper=ievle thi bound of AX,3 as functions of AZ,2 and the diagonal entries of Cs. For future reference, the smallest diagonal entry of Ca is denoted by ul and the largest entry by uj. If we define InA 2 and C 58 eigenvalues are, respectively, i= AZi + gal, (3.11a) i = p + 1,---,n , AX,3,i =1* (3.11b) = p + 1, - - -,n 2 0,1 In Theorem 3.1, we will prove that Ax,a,i and Ax,3,i are lower and upper bounds for Ax,j for i = 1, ..., n. Theorem 3.1 For Ax,o,, --- , Ax,,n, AX, 1 ,l -.., Ax,0,n, and Ax,1, .. , AX,n described in the previous paragraph, Ax'a,1 :5 Ax,1 < Ax,0,1 (3.12) AX,n AX,a,n < AX,/3,n Theorem 3.1 states that the plot of eigenvalues of Cxx always lies between the plots of eigenvalues of CXX,a and Cxx,3. We already saw an example of a screeplot which illustrates the meaning of the theorem. In order to establish Theorem 3.1, we need the following lemma. Lemma 3.1.1 Let Cxx = Czz +o 2 e e where ek is the n-dimensional unit vector whose only nonzero entry is its kth element. Note that oT2ekeT is an n x n matrix whose only nonzero entry is its (k, k) element and the value of the that entry is o'k. Then Ax,i for all i = 1,. Az'i ,n. Proof. (a) Ax,1 Az,1 Let orthonormal eigenvectors of Cxx and CZZ be hx,1, , hx,n, and hz,1 , ... , hz,n, respectively. Then Ax,j= hxCxxhx,1 > hz,1 Cxxhz,1 = hz,1 (Czz + Okekek) hzj > AZ,1 59 = Azi + ok (hZ iek) 2 (b) Ax,2 > Az,2 Define a unit vector u as u = cihz,1 + c 2 hZ, 2 (3.13) where (hT, 2 hx,1) cl - Z2 = Vi(,1hx,) (2 hx ~,) + (h ,ih,,1) (3.14) C2 = hi 1 hx,1) 2 + (hT,2hx,1 Note that u is perpendicular to hx,1 because u hx,1 = chz,1 hx,i + c 2 hz,2 hx,i = 0. (3.15) Then Ax,2 , the second eigenvalue of CXX, is not smaller than AZ, 2 because Ax,2 = max vthx,, IIvI=1 VTCXXv > uT Cxxu = UT > AZ,2 (Czz 2 + a-2eke T) u = c2 Az,1 + C2 AZ, 2 + or (uTek) The method of (b) can be extended to prove Ax,i Az,i for any i = 3,... ,n. El Similar to Lemma 3.1.1 is Corollary 3.1.1. Corollary 3.1.1 Let Cxx = Czz - o2ekeT. Then Axi : Az,i for all i = 1,-.-, n. Now we are ready to prove Theorem 3.1. Proof of Theorem 3.1. We can write Cxx as n CXX= (Czz + orI) + -- 2 eke (3.16) From (3.16) and Lemma 3.1.1, the eigenvalues of Cxx are greater than or equal to the eigenvalues of Czz + o2I, or Ax,a,i 5 AX,i, (3.17) 60 Similarly, Cxx can be written as Cxx= Czz + o- 21- o eke (3.18) k=1 From (3.18) and Corollary 3.1.1, the eigenvalues of Cxx are smaller than or equal to the eigenvalues of C + o+I, or (3.19) AX'i < AX"3'i' M Figure 3-5 visualizes the meaning of Theorem 3.1. The screeplot of eigenvalues of Cxx lies in the narrow shaded area in Figure 3-5, where the lower curve is the screeplot of Ax,a,, - - - , Ax,a,n and the upper curve is the screeplot of Ax,,, 1 , - - -, Ax,o,n. . Of course, if the difference between a 2 and a 2 is large, the shaded area may not be so narrow. However, the difference is not larger than one in a normalized dataset. This implies that the screeplot will be changed only slightly even if noise variances are not uniform and that the extent of change in the screeplot is bounded by the difference of o, and o, . ... .. .. .. .. .. 2p 0 Figure 3-5: A simple illustration of lower and upper bounds of eigenvalues of Cxx. The smallest and the largest diagonal entries of CEE are o and a, respectively. 61 3.4.4 Quantitative Decision Rule for Estimating the Number of Source Signals The objective of this section is to develop a quantitative algorithm for obtaining an estimate of the number of source signals. This is nothing more than a numerical implementation of the qualitative method of determining p by examining an eigenvalue screeplot through human eyes described in Section 3.4.2. This section is organized as follows. We will first explain the method for determining the line which fits a pre-determined portion of noise dominated eigenvalues. The line is called the noise baseline. Then we will give a definition of eigenvalues which depart significantly from the baseline. Counting the number of these eigenvalues will yields an estimate of the number of source signals. Evaluation of the Noise Baseline To determine the linear line which fits the noise eigenvalues, we should first determine which eigenvalues are noise eigenvalues. By definition, we cannot determine the entire noise eigenvalues because otherwise we would not have to estimate the number of source signals in the first place. Therefore, we first decide a priori on the eigenvalues which can be safely assumed as noise-dominated eigenvalues. This depends on the compressibility of the individual dataset. For the thesis, we assume that the number of source signals is always fewer than 0.4 times the number of variables. This assumption is made empirically from many manufacturing datasets we worked on. We would like to emphasize that this assumption may have to be modified for different datasets. There is another consideration to make before we obtain the noise baseline. When m is not sufficiently large compared to the number of variables, noise eigenvalues often decrease fast near the end of a screeplot. For the purpose of obtaining the noise baseline, these fastdecreasing eigenvalues should not be included. For this thesis, we assume that the steep decrease in noise eigenvalues does not start until eigenvalues of the smallest 40 percentile. Combining this assumption with the one made in the previous paragraph, we arrive at the conclusion that the middle 20 percent of eigenvalues are noise-dominated eigenvalues and they are free from a certain decrease in value due to an insufficient sample size. From these eigenvalues, we deduce the noise baseline. The actual process to determine the noise baseline is rather simple. Let All, 62 , 12 represent the middle 20 percentile of eigenvalues where 11, - - - ,12 represent the indexes for those eigenvalues. Noting that the vertical axis of a screeplot is in the logarithmic scale in general, we can set the linear equation between the index i and Ai as log1 o Ai = a - i + 8 + Ej, i (3.20) = 11, - - - ,12, where ci represent the statistical error term. Using the standard least squares linear regression criterion, we can obtain estimates of a and 0 from 12 (&,,8) = argin (logio Ai - _-i-)21 (3.21) and the noise baseline is given by logioA =&-i+ i=1, . , n. (3.22) In Figure 3-6, we redraw Figure 3-4 with the noise baselines. Note that the two noise baselines follow the noise dominated eigenvalues closely. The divergences between the baselines and eigenvalues occur at the transition points indicated in Figure 3-4. (a) (b) 10 5 105 * 10 4 -..-.-...... . . .. . .. . . . . . . . . . .... .... 103 104 -- - . . . . . . . . . . .- . .. . - . ..- . ..-. -. .. . . . . .......... .. .. .. . . .. 102 102 .. .. ...-..- . .---... -I .. -. .-... .. .. 10 100 10 .... . .-. -2 10 20 30 40 ...--..--.. .. -. -.. .- - . .. . . . . .... . . .. .. . ... .. .. -.. -.. -- . . - .- - . .- . .- . .- . .- . .-. .-. .- 10- 1 0 . . . . .. .. . .... . . . 100 .. ...-.. .. .. .. .. .. .. .. .. .. ... 10~' 1V ... 103 10-2 50 Index 0 10 20 30 40 50 Index Figure 3-6: Repetition of Figure 3-4 with the straight lines which fit best the noise dominated eigenvalues. 63 Determination of the Transition Point The number of source signals is estimated by the number of signal-dominated eigenvalues. The signal-dominated eigenvalues are those diverging significantly from the noise baseline. Here we would like to define quantitatively the term significantly. Let A1, - , A,, be the eigenvalues of the covariance matrix of X and A, , A be the corresponding values on the noise baseline. Let Li denote the difference between the logarithmic values of Ai and Ai, that is, Li = log1 o Ai - log1 o Ai. (3.23) Then Ai is defined as a signal-dominated eigenvalue if the following conditions are met: 1. Li has to be bigger than a pre-determined threshold. We set the threshold to be LJ y = 20 1=11 N 2 - (3.24) 11 2. There should be only one transition point between signal-dominated eigenvalues and noise-dominated eigenvalues. Therefore, for Ai to be a signal-dominated eigenvalue, A 1, - - -, Ai_ 1 should all be signal-dominated eigenvalues. If these two conditions are met, Ai is regarded as a signal-dominated eigenvalue. The number of source signals is estimated by counting the number of signal-dominated eigenvalues determined this way. Admittedly, this method is somewhat ad hoc. However, it has provided a good estimate of the point of divergence over many experiments we conducted. Test of the Algorithm To examine the performance of the proposed method for source number estimation, we applied the algorithm for estimating the source signal number to the two datasets of Table 3.1. Both datasets are subject to computer-generated additive independent noise. The noise covariances are an identity matrix for the first dataset and the non-identity diagonal matrix o 2 D (1/50,1/49,... , 1) where 2 is a scalar for equalizing SNR for the two datasets. We repeat the simulation 200 times and obtain 200 estimates of the source signal number 64 for each dataset. The results are presented as histograms in Figure 3-7. The mean values of the estimated source number is 11.31 for (a) and 8.64 for (b). Considering that the true (a) (b) 150 150 100 100 C) C) Q) Cr a) U- 50 50 - - 09 I - - - 10 11 12 13 14 Estimated number of source signals 0 15 . . .. 6 . .. . . . 7 8 9 10 11 Estimated number of source signals 12 Figure 3-7: Histogram of estimated number of source signal by the proposed method. Datasets of Table 3.1 are used in simulation. p is 15 and that the transition points of Figure 3-4, which are chosen by human judgment, are P = 11 and P = 9, respectively, we conclude: 1. The estimated source signal number returned by the proposed method is not much different on average from the estimated source signal number determined by a human eye from a screeplot. 2. The estimated source signal number returned by the proposed method is smaller than the true p on average. The magnitude of underestimation is small when the noise covariance is an identity matrix. 3.5 Noise Estimation by Expectation-Maximization (EM) Algorithm The objective of this section is to derive the step-by-step algorithm for computing maximumlikelihood (ML) estimates of noise variances. It is also of interest to recover the actual time-sequence of noises. From the noise data model of Figure 3-1, the observed noisy vector 65 X can be written in terms of source signal vector and noise: X = AP + G 1/ 2W (3.25) where w is a vector of unit-variance noises and G is the n x n diagonal matrix such that E = G 1/ 2 L. The diagonal elements of G are the unknown noise variances. We want to compute the ML estimates of those diagonal elements from m samples of the noisy vector X. For now, we assume that the number of the source signals is known to us. When this value is not known, we can use an estimate of it obtained through the method explained in Section 3.4. We will discuss this case in detail in the next section. In this section, we adopt the Expectation-Maximization (EM) algorithm to obtain iteratively the ML estimates of noise variances. The general description of the EM algorithm is first presented in [4]. Although it is often referred to as the EM algorithm, it is more of a strategy than a fixed algorithm to obtain ML estimates of unknown parameter(s) of a dataset. An actual algorithm based on the EM strategy should be developed in detail when a specific problem of interest is defined. We will define the problem and derive the EM-based iterative algorithm in this section. A similar approach for a simpler case was studied in [5]. 3.5.1 Problem Description In (3.25), A and the diagonal matrix G are unknown but fixed parameters, and P and W are random vectors. Let a symbol 6 denote all unknown parameters in (3.25), that is, 6 = {A, G}. (3.26) Let a vector U E RT+P denote X U= ,(3.27) P and let fu (U; 6) denote the probability density function of the vector U given the parameter 0. Then the ML estimate of 0 based on m independent samples of U can be written as 6 ML = argnax fu (U(t);6) 0 66 (3.28) Since log(-) is a monotonically increasing function, we can replace the probability density function in (3.28) with the logarithm of it without affecting 6 ML 6 ML- = argnax (log Hfu (U(t); 6) = argnax 1log (3.29) fu (U(t); 6) (3.30) t=1 Invoking Bayes' rule, we have fu (U(t); 6) = fp (P(t); 6) -fxIP (X(t)IP(t); 6) , (3.31) and by taking the logarithm of (3.31), we have log fu (U(t); 6) = log fp (P(t);6) + log fxIP (X(t)IP(t);6) (3.32) Assuming P and w are two independent jointly Gaussian vectors with zero mean and an identity covariance matrix, the two components on the righthand side of (3.32) can be expressed as log fp (P(t); 6) = ( log (log 1p2 (t)) ) P2(t) (4k)7- log(27r) - - (3.33) (3.34) ( (3.35) (Xi(t)Aip(t))2 (3.36) - and log fxIP (X(t)IP(t); 6) log i=1 '1 (log( 2r G i) 2Gi (Xi(t) - AiP(t))2) 1 -2 log (27rG ) - 2Gi (Xi(t) - AiP(t) )2 67 (3.37) (3.38) where Gi is the ith diagonal element of G and Ai is the ith row of A. Combining (3.35) and (3.38) with (3.32) and substituting the result to (3.30) yields (- OML = argmx log(27r) - IZPf t=1 (t) log (27rGi) + 2 - i=1 (Xi(t) - AiP(t))2 i=1 (3.39) The first two terms in (3.39) can be omitted because they do not depend on any unknown parameters, and thus do not affect the maximization over the unknown parameters. The ML estimate of 6 then yields (Xi(t) - AiP(t))2 (ML log (27rGi) + argmax (3.40) 0 t=1 i1 = (log (27rGi) + argmin 1 0t= (X (t) - AiP(t))2 (3.41) 1 1i = argmin m log (2r) + m log (Gi) + 1 (X, - PAT)T (X = argnin m log (Gi) ± (X, - = argmin m log (Gi) + (XTXz - 2XTPAT + AiPT PAT)). + PAT) (Xi - PAT)) PA) - (3.42) (3.43) At this point, one should note that the evaluation of the argument of (3.43) requires the source signal matrix P E Rmxp. Since it is not available, the direct maximization of (3.43) over 6 cannot be carried out. The matrices P and X are called the complete data in the sense that OML would have been obtainable if they had been available. By comparison, X is called the incomplete data. The EM algorithm proposes that P and pTp in (3.43) be replaced by expected values of P and pTp given X and a current estimate of OML, which we denote as 0(l) to emphasize that it is the lth iterative estimate of OML (Expectation step), and that the solution of (3.43) be considered as the (I + 1)th estimate of 6 ML (Maximization step). This two step approach is the origin of the name expectation-maximization algorithm. If there is no other local extremum in the argument of (3.43), it is true that lim 6() = OML 1-00 68 (3.44) In cases of multiple local extrema, a stationary point may not be the global maximum, and thus several staring point may be needed as in any "hill-climbing" algorithms. 3.5.2 Expectation Step The expectation step comprises the two conditional expectations E [PIX; 6(,)] and E [PTPIX; 6()]. We will first focus on the computation of E [PIX; 6(,)]. Due to instan- taneity between X and P, this conditional expectation can be further simplified to the computations of E 1P(k)IX(k); 0()] , k = 1,...,m, where P(k) = [Pi(k),- X(k) = [Xi(k),-- ,,X (k)1T. If we define the vector w(k) = [wi(k), - ,Pp(k)]T and ,W (k)]T, the rela- tionship between the three vectors can be written as X(k) = AP(k) + G 1/ 2 W(k). (3.45) For notational simplicity, we will drop the time index k for now. To compute E we need to obtain the probability density function fpix (PIX; 6(,)). [PIX; 6(,)] Invoking the Bayes' rule, we have fPix (PIX; 6(1(,)= f x (x;6(i))(3.46) fx (X; 0() For notational simplicity, we will rewrite (3.46) as f (Pjx; 6(,)) ff = f (XP; 6(,)) f (P; 6(l)) (xX;6((l))=(3.47) f (X; 0(1)) Recalling that X, P, and w are all jointly Gaussian, the three components in (3.47) are expressed as f (XP; (,) = C e(X AP)G f (P;6(1)) = C 2e-2 f = C 3 e-ixTA )A+G(l) (X;0(1)) 69 (X-A) (3.48a) (3.48b) X (3.48c) where C1, C2 and C3 are scalar constants which are irrelevant in further computations. Substituting (3.48) into (3.47) yields f (PIX; ()) = C 4 e-- .H (3.49) where C4 = C1C2/03 and H = XT G5X - XT G- A(I)P - PT AT G- 1 X + PTA TGA(I)P + pTp - XT (Aml)A T + G(l) (A SPT G-A(l) + I) P + XTG- X = - A W (P- XTG- where W(i) = AT G (1) 1 (1) - (3.50) -XTG-A(l)P -PTA XT (A)AT + G-1 G- XW() A(l)W X (P - W T G 1 X (3.51) X A T)G A- G( X+XTG-X - X) T XT(A(l)A() + G(l) X, (3.52) A(i) + I. From this result, we can rewrite (3.49) as I PCWe = f (PIX;()) AT G X WPW-ATG(I) (1) 1 (1) T X (1) ) (3.53) where IXT 05 = C4 e 2 ( G-AI)W-1A G 1 1 () 1 G- + A(I)A+G(j) (1L G +1(A X. (3.54) Recalling that P given X and 0(j) is a Gaussian vector, we can simplify (3.53) into f (PIX; 0(,)) = N (W- A)G-X, W-1). (3.55) Now we have the expression for the first conditional expectation: E [PIX; 0(l)] = W AT)G X (3.56) or equivalently, E [PIX; ()] = XG- A()W. 70 (3.57) Deriving the expression for the second conditional expectation of interest E [PTPIX; 6(1) is our next task. First we observe that P = [P(1)-I - -P(m)]T E rPTPIX;8(1)] = E from which we can write P(k)P(k)TIX;6()1 (3.58) .k=1 Once again, instantaneity between X and P simplifies the problem of obtaining (3.58) into obtaining E [PPTIX; 0(t)] in which we omit the time index for simplicity. Invoking the definition of covariance matrix, E [ppTIX; 0(,)] = EpIx;6(L) + E [PIX; () ET [PIX; 6()] where E (3.59) denotes the covariance of P given X and 0(1). From (3.55), we have (3.60) W-. X Substituting (3.60) and (3.56) into (3.59) yields E PPTjX; = W- +W A TG-XXT G A(,)W- (3.61) From this, we obtain the second conditional expectation of interest: E [PTP IX; 0] = mW-f + W-A TG 3.5.3 XTXG- A(l)W- (3.62) Maximization Step After the two conditional expectations (3.56) and (3.62) are computed, the maximization step is carried out. The maximization step is to search for the values of unknown parameters 0 which maximize the expression in (3.43). With a simple modification of (3.43), the (l+1)th estimate of OML can be written as n 0(1+1) = argnin Q(l)(Gi, Aj) i41 71 (3.63) where Q(i) (Gi, Aj)= (X='Xi - 2XE [P IX; (1)] A[ + A 2 E PTPIX; (1) AT) + m log Gi. (3.64) To find the values of Gi and Ai at which Q(I) (G2 , Ai) is minimized, the partial derivatives of (3.64) with respect to Gi and Ai are set to zero: o9Q(j) (Gi, Aj) aAj Ai,(,+,), = 0, i =1 n, (3.65) = 0, 1 = 1, ..., n. (3.66) Gi,(,+,) 'Q(i) (G, Ai) aGi The solutions to these equations constitute 0(1+1). From (3.65), -2X[E [P IX; 6(j)] + 2Ai,(l,+)E [pTpIX; (1)] = 0 (3.67) Solving (3.67) for Ai,(+1) yields Ai,(,+1) = X[E [PIX;0(j)] (E [pTpIX; 0() (3.68) The estimate of the entire mixing matrix is then A(1+1) = XT E [PIX;0(,)] (E [PTPIX;0(l)] 1 (3.69) Similarly, from (3.66) we have mGi,(l,+) - (XTXi - 2X[E [ IX; 6(L)] A (1 + 1) + Ai,(+l+)E PTP IX; 6(1) AT( 1+1 )) = 0, (3.70) which becomes Gi,(,+)= (XfXi - 2X[E [PjX; 6(,)] A[(,+ ) 1 + Ai,(+)E PTPX; 6() A,(+1)) (3.71) Combining this with (3.68), we get the second set of maximization equations, Gi,(+) = 1 (X TXj - Ai,(+l+)ET [PIX; 6(1)] X,) 72 i =1, - - ,n. (3.72) 3.5.4 Interpretation and Test of the Algorithm Figure 3-8 illustrates the step-by-step operations of the EM algorithm for blind noise estimation. The algorithm takes in the noisy data X and the number of source signals p as input. The algorithm then generates A(,) E R"XP and G(1 ) E R"'X, initial guesses of unknown parameters. The only restriction imposed on the initial guesses is that G(1) is a diagonal matrix. The number of the noisy variables, n, is acquired from the number of columns of X. In the expectation step, C(i) and D(j) are computed from the two equations. Recall that W(i) = AT G A(l) + I. In the maximization step, the unknown parameters are updated to yield A(2 ) and G(2 ). If another iteration is needed, A( 2 ) and G( 2) are fed back to the expectation step to compute C( 2) and D( 2). Typically the total number of iterations is pre-determined, but one can simply decide to stop the iteration if changes of the unknown parameters are negligible after each iteration. Let r, be the total number of iterations and G and A denote the final updates of the unknown parameters. Then each diagonal element of G is the estimated noise variance of the corresponding noisy variable. Furthermore, E [P jX; A, ] AT is an estimate of the time sequence of the noise-free data matrix Z. Therefore, X - E [P IX; A, ] AT represents the estimated noise sequences. To illustrate the effectiveness of the EM algorithm we applied it to the two examples of noisy multivariate datasets specified in Table 3.1. Figure 3-9 provides the results of the blind noise variance estimation using the EM algorithm. In the figure, the noise variances blindly estimated by the EM algorithm are compared to the true noise variances. For the first dataset, the true noise variances are unity for all variables, and the minimum value of estimated noise variances is 0.78 and the maximum value is 1.10. For the second dataset, the true noise variances are 2/50,2/49, -. , 2/2, 2/1. The estimated noise variances again follow the true variance line very closely. As for the second dataset, if we want to apply NAPC to the dataset but we do not have the noise variances a priori, we can acquire the unknown noise variances first from the EM algorithm and then normalize variables by the estimated noise variances. Figure 3-10 and Figure 3-11 provide another capability of blind noise estimation using the EM algorithm. Figure 3-10 illustrates some selected noise time-sequences obtained as one of the output of the EM algorithm. Surprisingly, the estimated noise sequences follow the true noise sequences very closely. Recalling that all we knew in the beginning were the 73 p (number of source signals) ---------------Initial Guess A(,), G(i) Expectation step X - C(i) = E [PIX; A(,), G() = XG - D(i) = E PTPIX; A(l), G(l) +WlAT = 4- A(I)W- mW 1 GlXTXGlA(,)W Maximization step - A(1+1) = XTC(1)D-l - Gi,(+1) = - (XyX - Ai,(l+)CTX 1,---,n YES More Iteration? Increase I by 1 NO 6 Estimates of noise variances X - E PIX; A, G]AT Estimates of noise sequences Figure 3-8: Flow chart of the EM algorithm for blind noise estimation. 74 True Noise Variances Estimated Noise Variances (a) 2 2- 1.5Ca CZ15 Co o zC0.5 o .5An~ V .0 0.5 0 (b) 2.5 0.5 0 10 20 30 40 0 0 50 Index 10 20 30 40 50 Index Figure 3-9: Estimated noise variances for two simulated datasets in Table 3.1 using the EM algorithm dataset X and the fact that there are p source variables, this result brings the possibility of filtering independent measurement noise from noisy variables into reality. Figure 3-11 also shows similar results. 3.6 An Iterative Algorithm for Blind Estimation of p and G Up until now, we developed two estimation algorithms, one for p and another for G. In Section 3.4 we developed a quantitative method to obtain an estimate of the source number when G is an identity matrix and extended the method to the cases where G is not quite an identity matrix. We illustrated through examples that P is close to true value p if G is either an identity matrix or a diagonal matrix whose elements remains in a limited range. The result becomes distant from the true p on average as G gets far from an identity matrix. To obtain a good estimate of p, therefore, it is desirable to have an identity noise covariance matrix. In Section 3.5, we developed an computational algorithm to estimate the noise variances. A simulation demonstrates the capability of the algorithm for estimating the noise variances accurately. The catch is that the algorithm requires the value p as one of its inputs, which is not available a priori in general. In this section, we address the problem of simultaneous estimations of the source signal number and noise variances. Our approach to the problem is to decouple the joint estimation of p and G into iterations of sequential estimations of p and then G. First, we obtain an 75 Noise of 1 st variable Noise of 11th variable 2 KI 41 I" ~ A 0 d I I -1 10 20 30 40 50 0 10 20 ' ~ * 21 '4 \ /V.1k' -2 0 ik j~i 30 Time Time Noise of 21st variable Noise of 31st variable I i 40 50 40 50 Actual Noise Sequence Estimated Noise Sequence 2 ~~- A4 1 ** ~ 0 -1 IfI AJ\ILI N1 -2 0 10 20 30 40 -3 50 0 10 Time 20 30 Time Figure 3-10: A few estimated noise sequences for the first dataset in Table 3.1 using the EM algorithm estimate of the source signal number which may be a poor estimate at the first iteration. This estimate is then sequentially fed to the EM algorithm with X for estimates of the noise variances. Intuitively, this approach seems chicken-and-egg : estimating G requires p, but p cannot be estimated accurately unless G is either an identity matrix or close to it. This may be in fact true during the first iteration. As the iteration continues, however, improvement in estimation of p leads to better estimation of G, which in turn further improves the estimation of p. 3.6.1 Description of the Iterative Algorithm of Sequential Estimation of Source Signal Number and Noise Variances Figure 3-12 illustrates the schematic block diagram of the proposed iterative algorithm for the estimation of the source signal number and noise variances. Let X E Rmn denote a 76 Noise of 1st variable Noise of 11Ith variable 3 2 - Actual Noise Sequence Estimated Noise Sequence 2 1 -\ 0 , -1 - -/ -1 -2 -2 0 10 20 30 40 -3 50 0 10 20 Time Noise of 21st variable 50 3 2 2 A/ 1 - -1 0 - -1 IIIJ -2 -2 10 20 30 40 -3 50 Time A 1 'Fl 1 0 0 40 Noise of 31st variable 3 -3 30 Time A -v 0 Ni 1~ ~1 - 10 20 30 40 50 Time Figure 3-11: A few estimated noise sequences for the second dataset in Table 3.1 using the EM algorithm 77 noisy data matrix. In the proposed algorithm, the data matrix X is multiplied by the inverse of the square root of the estimate of the noise covariance matrix obtained in the previous iteration. For the kth iteration, the estimate of the noise covariance from the previous iteration is denoted by G(k-1). The result(-) of the multiplication, denoted by X(k) = Xd-1/2 (k-1) and referred to as the kth noise-normalized dataset, constitutes our new dataset in which variables are normalized by the current best estimate of noise variances. For the first iteration, we initialize G(o) = I. Therefore, the first noise-normalized dataset is the same as the original dataset. Once the kth noise normalized dataset is obtained, the source signal number is estimated by the method described in Section 3.4. This kth estimate of the source signal number is fed with X to the EM algorithm as if it is the true p to obtain the kth noise covariance estimate G(k). Note that it is not the noise normalized dataset X(k) but the original X that is fed to the EM algorithm. If another iteration is necessary, this G(k) acts as the previous noise covariance estimate to yield the new noise normalized dataset. The ultimate estimate of the G and P are equal to G(kf) and G(kf), respectively, where k1 denotes the final iteration. We will name this algorithm as ION, which stands for Iterative Order and Noise estimation. Table 3.2 is the detailed step-by-step procedure of the ION algorithm. 3.6.2 Test of the ION Algorithm with Simulated Data While the estimates of p and G after the first iteration of the ION algorithm might be poor, they become more accurate after each iteration. This is best visualized by an example. The test in this section focuses on illustrating the improvement of the ION algorithm after each iteration. For this end, we again use the example datasets of Table 3.1. To compare the accuracy of noise variance estimates in each iteration, we use the cumulative squared estimation error (CSEE), or n2 CSEE(j) (G = - Gj,(i)). (3.73) j=1 In Figure 3-13, we show the results of applying the ION algorithm to the simulations defined in Table 3.1. After the first iteration, the estimated number of source signals was 8, and based on this value the EM algorithm returns estimated noise variances. The CSEE after the first iteration stood at 1.09. In the second iteration, first the variables of X were 78 x Estimation of source signal number j (k) EM L(k)P(k)iG(k) E [PIX; A-, G More Iteration ? YES G(k) Increase k by 1 NO [PIX; X-E - G Estimates of noise variances Estimate of number of source signal ,G] AT Estimates of noise sequences Figure 3-12: Flow chart of the iterative sequential estimation of p and G. 79 " Initialize: -+ k = 1, G(0) I. " Order estimation for the kth iteration: -X(k) =k-X -+ Compute the sample covariance matrix of X(k). Call it SXX,(k)- - Compute the eigenvalues of SXX,(k). - Obtain the current estimate P(k) of the source signal number from the method described in Section 3.4. They are labeled as A1,(k),-- , An,(k) " Noise variance estimation for the kth iteration: -+ Compute the current estimate 6(k) of the diagonal noise covariance matrix through the EM algorithm described in Section 3.5. " If another iteration is required: -+ Increase k by one. Go to the order estimation step " After the final iteration: + = G(kf) and P = P(kf). In addition, if desired, the estimated noise sequences can be obtained as a by-product of the EM algorithm from S= X - E[PIX;, 6]AT Table 3.2: Step-by-step description of the ION algorithm 80 divided by the square roots of the corresponding estimated noise variances, and the result was fed to the order estimation algorithm. The resulting estimated number of source signals increased to 13, which was closer to the true value of p = 15. The estimated noise variances based on this value had CSEE of 0.061. In the third iteration, the estimated number of source signals reached 14, and the corresponding estimated noise variances had CSEE of 0.059. After the third iteration, the estimated number of source signals remained at 14, and did not change in subsequent iterations. The ION algorithm has many applications areas, some of which have been widely used for a very long. In the next chapter, we will discuss a few applications which may have potentially important implications such as least-squares linear regression and noise filtering. We will quantify performances of those applications enhanced by the ION algorithm and compare them with performances without the ION algorithm. It will show that performances of those applications may be enhanced significantly by adopting the ION algorithm as a pre-processing. 81 First Iteration Order Estimation by noise baseline 5 10 P( 1 8. 104 Estimated noise variances vs. True noise variances . - 2 10 1.5 102 . . . . :11V /1 10 .... . . .. . . . . . . ... . . . 0.5 10 10~ 0 10 20 30 40 U- 0 5C Second Iteration 20 30 40 50 ........ .... . . '4.. . Index 105 14 10 Index . 2 P( 2) A. . . .. = 13 2 10 1.5 10 10 10-1 ..... ..... ..... 1003 102 . ... .. . .. ... 1 q7 .. -... - 20 - 10 0 20 30-- 40-50 30 40 . 0.5 00 50 10 20 Index Third Iteration .. . ..... 40 30 50 Index 105 3 ) =14 -....... P(-........ 14 -...--.. 2 102 1.5 C1 5 102 101 .... .. . r-. . .. . . .. . . . . . z 0.5 - . 100 10- ....... U) 0 10 20 30 40 U,- 0 50 Index 10 20 30 40 50 Index Figure 3-13: The result of the first three iterations of the ION algorithm applied to the second dataset in Table 3.1 82 Chapter 4 Applications of Blind Noise Estimation 4.1 Introduction The major motivation of the development of the ION algorithm was to obtain the noise variances of individual variables so that each variable can be normalized to unit noise variance before the PC transform is carried out. The combined operation of the noise normalization and the PC transform is called the Nose-Adjusted Principal Component (NAPC) transform. The Blind-Adjusted Principal Component (BAPC) transform is the NAPC transform in which the noise normalization is performed with the noise variances retrieved by the ION algorithm. The implication of the capability of retrieval of the Gaussian noise variances is very broad. A zero-mean white Gaussian noise vector can be wholly characterized by its covariance matrix. Therefore, if the elements of the noise vector is known to be independent so that the covariance matrix is diagonal, then the retrieval of the noise variances indeed represents the complete characterization of the noise vector. The retrieved noise characterization may be then used for many traditional multivariate data processing tools which require a prioriknowledge of noise statistics. Until now, it is not an uncommon practice to analyze a dataset with noise of unknown statistics as if the dataset is noiseless. The goal of this chapter is to suggest a few applications in which ignoring existence of noise is a common practice and to show the extent 83 of performance gain obtainable by the adoption of the retrieved noise statistics in data analysis. The performance - gain or loss - will be measured by relevant metrics which will be defined for each application. In evaluating the effectiveness of the applications of the ION algorithm, we use multiple computer-generated examples which are carefully designed to be representatives of practical multivariate datasets. In Section 4.2, we apply the ION algorithm to linear regression problem in which the multivariate predictor variables are corrupted by noise with unknown noise statistics. As we discussed briefly in Chapter 2, traditional least squares regression does not make much sense when the predictor variables are subject to noise. In Section 4.2, the ION algorithm combined with the NAPC filtering is suggested for eliminating noise in the predictor variables to enhance the regression result. The NAPC transform after the ION algorithm retrieves the noise variances is designated Blind noise-Adjusted Principal Component (BAPC) transform. Section 4.3 addresses the application of ION to noise filtering. In addition to the noise variances and signal order, the ION algorithm retrieves estimates of noise sequences. In Figure 3-12, the estimated noise sequences are shown at the bottom of the figure as e = X - E [PIX; , ] T. An ION filter refers to a simple operation of subtracting this estimated noise sequences from the noisy dataset. We evaluate the ION filter as a function of the sample size m and the signal order p. In addition, the ION filter is compared with the Wiener filter, the optimal linear filter in the least squares sense. Performance is measured in terms of SNR (dB). PC transform of the ION-filtered dataset is designated Blind Principal Component (BPC) transform and discussed in Section 4.4. 4.2 Blind Adjusted Principal Component Transform and Regression One of the assumptions on which desirable properties of least squares linear regression depend is that the predictor variables are recorded without errors. When errors occur in the predictor variables, the least squares criterion which accounts for only errors in the response variables(s) does not make much sense. Still, least squares linear regression is commonly used for noisy predictors [36, 37] because 1) alternative assumptions that allow for errors in the predictors are much more complicated, 2) no single method emerges as the agreed-upon best way to proceed, and 3) least squares linear regression works well except 84 in the case of extremely large predictor noise. Moreover, least squares linear regression predicts future response variable well as long as future predictors share the same noise statistics with the predictors of the training set [37]. However, this argument is true only when there are infinitely large sample size. For most practical situations where the sample size is small, noisy predictors can affect linear regressions. In this section we discuss linear regressions for noisy predictor variables and propose the ION algorithm as a potential method to alleviate distortions caused by noises in the predictor variables. In section 4.2.1 we derive an expression for mean squared error (MSE) when traditional least squares linear regression is carried out for noisy predictor variables. 4.2.1 Analysis of Mean Square Error of Linear Predictor Based on Noisy Predictors Let X E R' be the vector of n noisy predictor variables defined as X = Z+ E (4.1) where Z is the vector of noiseless variables and e is the noise vector. Let the response variable Y C R be modeled as a linear combination of Z 1,.-, Z,, corrupted by independent zero-mean Gaussian random noise e: Y = gTZ + (4.2) In linear prediction, it is of interest to obtain a linear combination of X 1 ,--- , X" such that the value of the linear combination follows Y closely. The mean-square error (MSE), defined as MSE (Y(X)) = E (Y- (X))], is often used to measure how well a particular linear combination, denoted by (4.3), follows Y. Let (4.3) Y(X) in Y*(X) denote the linear combination which minimizes MSE (V(X)). In the case of jointly Gaussian zero-mean random vectors it is well known that f*(X) is linear in X [6]; Y*(X) = LyxX 85 (4.4) X1 Z1 Zn Xn y I Figure 4-1: Model for the predictors X1, --- , X, and the response variable Y. where Lyx satisfies LyxCxx = Cyx. One can write the mean square error for MSE (f*(X)) = E [3Z = (,3T (4.5) Y*(X) as +e - CyxCjx (Z + 6) CyxC-1)Czz (8T -CyxC1) +CyxC-1GC-ICyx (4.6) T +or (4.7) where (.)-1 denotes the inverse or pseudo-inverse of (.) depending on its invertibility. Recalling that Z = AP where P is the vector of independent source signals, one can write Cyx and Cxx as Cyx = Cyz =I3TCzz = /TAAT (4.8) Cxx = Czz + G = AA T + G (4.9) 86 Replacing Cyx and Cxx in (4.7) yields MSE (Y*(X)) = AT I - (AAT + G)AAT) 8 2+or+ G 1/ 2 (AAT + G 1AAT (4.10) where || denotes the norm of a vector. It is clear from (4.10) that MSE (Y*(X)) smaller than aT2. Note that o, would be the mean-square error of Y*(X) if e lim MSE (v*(X)) = is never 0: (4.11) = 0,. Equations (4.10) and (4.11) indicate that increase in the measurement noise variance a2 causes corresponding increase in the mean-square error of 4.2.2 Y* (X). Increase in MSE due to Finite Training Dataset In deriving (4.10), we assumed that the population covariances Cxx and Cyx were available for the computation of Y*(X). In practice, these population covariances are generally not available and and replaced by the sample covariances Sxx and Syx, respectively. Since the sample covariances are computed from the finite training dataset, discrepancies between the sample covariances and the population covariances are inevitable. The discrepancies between the population and the corresponding sample covariances should be negligibly small when the sample size is large. In that case, the MSE resulting from usage of the sample covariances should not be much different from (4.10). On the contrary, when the sample size m is not large enough compared to the number of predictor variables, denoted by n, the errors in the sample covariances could increase the MSE noticeably from (4.10). Let's run an example as an illustration of increase in MSE due to the finite sample size. In the example, the vector X consists of 50 variables, X 1 , - - -, X 5 0 . The measurement noise vector e is a zero-mean Gaussian random vector with covariance G being an identity matrix. The 50 variables of the noise-free vector Z are linear combinations of 10 source signals. The source signals are zero-mean independent Gaussian random variables with unit variances. The response variable Y is a linear combination of 50 variables of Z plus the statistical noise c. The variance of E is set to be 0.5. To show the relation between the sample size and the performance of of linear prediction, we generated seven training datasets with sample sizes of 60, 70, 80, 90, 100, 200 and 500. For each training set, one 87 10 Noisy predictors Noiseless predictors .--- 10 0O 100 150 20 250 30 Size of trsining set 350 400 450 500 Figure 4-2: The effect of the size of the training set in linear prediction prediction equation in which Y* (X) is expressed as a linear combination of variables in X is computed from linear regression. Substituting 3000 samples of the predictors of the validating set to the prediction equation will give 3000 samples of Y* (X) whose differences from the corresponding 3000 samples of Y are averaged to yield the MSE. We will refer to the MSE obtained in this way as the sample MSE (SMSE). For notational convenience, we will call MSE of (4.10) the population MSE (PMSE). Let's define the discrepancy factor 'y as the difference between SMSE and PMSE normalized by PMSE, or SMSE S= - PMSE PME _SMSE =PS - 1. (4.12) Note that -y is a measure of performance degradation of linear regression due to finite sample size. The result of the example is drawn in Figure 4-2 as the curve with *. For comparison, we also present y when the predictor noise e does not exist as the curve with o. For the noisy case, it can be seen from the figure that -y and the sample size m are inversely related. In general, 'y asymptotically approaches 0 as m and Syx asymptotically approaches Cx -+ oo because the sample covariances Sxx and Cyx, respectively [38], thus force SMSE to converge to PMSE. When the sample size decreases, the differences between sample and population covariances get bigger, resulting in larger 'y because the sample covariances Sxx and Syx are more susceptible to errors caused by the predictor noises. This is why least squares linear 88 regression should be used only when there is large enough training dataset. In general, linear regression should not be used when the number of noisy predictors is comparable to the sample size. 4.2.3 Description of Blind-Adjusted Principal Component Regression The purpose of this section is to introduce an application of the ION algorithm as a partial solution to the problem caused by measurement noise e and having too few samples. The algorithm is mainly a concatenation of the ION algorithm and NAPC filtering. The ION algorithm provides the noise variances which are required in NAPC filtering. The problem of large SMSE due to not having enough samples could be alleviated by NAPC filtering because NAPC filtering compresses the signals distributed over all predictors into a smaller number of principal components. A schematic block diagram of the algorithm to be studied here is shown in Figure 4-3. We will call the algorithm BAPCR, the acronym of the Blind-Adjusted Principal Component Regression. The name stems from the two major sub-algorithms which constitute the BAPCR: 1) the ION algorithm which estimates the noise variances blindly, and 2) the noise-adjusted principal component filtering which reduces the number of predictors. The matrix X E Rmxn consists of the m observations of n predictors, and the corresponding m observations of the response variable constitute the column vector Y E Rm. It is known that the predictors are subject to measurement errors which have no correlation with the response variable. It is assumed that the noise covariances are diagonal but those elements are unknown. The degrees of freedom of the predictors is n due to the measurement noises. The degrees of freedom of the predictors would have been a smaller value but for the noises. The noise variances of predictors are estimated using the ION algorithm described in Chapter 3. The output O of the ION algorithm is a diagonal matrix with its diagonal elements being the ML estimates of the noise variances of corresponding variables in X. These noise estimates are subsequently used for the NAPC filtering. First, predictors are noise-adjusted by being divided by the square roots of the corresponding diagonal elements of G. The noise-adjusted dataset, denoted by X', can be written as X' XO-(1/2). 89 (4.13) Y ION G P -C l/2 PC Filtering Signal - Dominant Principal Components Regression Prediction Equation Figure 4-3: Schematic diagram of blind-adjusted principal component regression. 90 In principal component filtering, the eigenvectors of the sample covariance of X' are used to compute the principal components and the eigenvalues are used to determine the number of principal components to be retained. The number of retained principal components is equal to the number of source signals estimated by the ION algorithm. The remaining principal components are discarded as noises. The retained principal components are regarded as new predictors. Since we discarded noise principal components in the NAPC filtering, the number of new predictors is smaller than n. The prediction equation is obtained by linear regression of Y on the new predictors. 4.2.4 Evaluation of Performance of BAPCR It is of interest to see how much performance improvement BAPCR brings to linear prediction over other traditional methods such as linear regression. For this purpose we applied four methods including BAPCR to a few simulated examples of the linear prediction problem. The four methods being contested were linear regression, PCR, BAPCR, and NAPCR. As the criterion of the linear prediction performance we chose the discrepancy factor 'y de- fined in (4.12). The examples were designed to be representative of many practical multivariate datasets. The parameters that need to be specified to define examples are: the number of predictor variables n, the number of source signals p, the sample sizes m, the regression parameter vector 6 that defined the relation between Y and Z, the variance o- of the statistical noise term in Y, the mixing matrix A and the noise covariance G. We would like to compute and present -y of the four methods as a function of m, p, and G. For all examples of the section we used n = 50 and a- = 0.01. The regression parameter vector 3 was randomly generated by computer. If one uses MATLAB program for simulation, 3 could be generated by the command beta = randn(n, 1) where n denotes the number of predictor variables n = 50. Explaining the selection of mixing matrix A needs a little elaboration. Let us begin by recalling the following equation: Cxx = Czz + G = AA T + G. (4.14) Assuming that the n x p mixing matrix A is full column rank, the rank of n x n matrix Czz is p. In other words, p out of n eigenvalues of Czz are non-zero. Recalling that SNR of 91 a noisy multivariate dataset is defined in terms of the eigenvalues of Czz and G, it would be convenient if we fix the p non-zero eigenvalues of Czz. For our examples, we chose somewhat arbitrarily the p non-zero eigenvalues of Czz to be proportional to either i-4 or i- 3, i = 1, - - ,p. Other than these conditions on the eigenvalues of AAT, we let A be random. This can be achieved by solving the following equation for A: AAT = QAQT = Q(p)A(p)Q(p), where A(p) is the p x p diagonal matrix of non-zero eigenvalues of AAT and (4.15) Q(p) is the m x p matrix of p eigenvectors corresponding to p non-zero eigenvalues. The solution to (4.15) is A = Q(P)A 1 2 U (4.16) (1) where U is any p x p computer-generated random orthonormal matrix. -y as a Function of Sample Size of Training Set Our first set of simulations focuses on the performance of four linear predictions as a function of m. Table 4.1 summarizes the simulation parameters used. We set the number of source signals p to be 15. For the first two examples of Table 4.1, the eigenvalues of AAT are set to be 1000 x i- 4, 1 < i < p. The last two examples have 1000 x i-3 , 1 < i < p as the eigenvalues of AAT. Noises are again independent so that the noise covariance G are diagonal. As for the diagonal elements of G, we decided on two possibilities. The first choice is G = I. This corresponds to the case where the predictor variables have been noise-adjusted somehow prior to our analysis. This can happen if a dataset is noise-adjusted by someone who knows the noise variances before it is distributed for analysis. Referring to Table 4.1, this choice covers Example (a) and (c). For Example (b) and (d), we arbitrarily set the noise covariance as G = diag (i-1), i = 1, ... , n. For each of the examples in Table 4.1, we evaluated the discrepancy factor -y for the training sets of a few different sample sizes. For each sample size of the training set four different prediction equations are obtained, one from each of the following four methods: 1) linear regression, 2) principal component regression, 3) blind-adjusted principal component regression, and 4) noise-adjusted principal component regression. As for the PCR, the number of principal components to be retained for the 92 (a) (b) (c) (d) n 50 50 50 50 a 0.01 0.01 0.01 0.01 # randn(n,1) rancn(n,1) randn(n,1) randn(n,1) A 1000 x A1 1000 x A1 1000 x A 2 1000 x A 2 p 15 15 15 15 m G Variable I diag(i of 1) interest I diag(i- 1 ) Table 4.1: Important parameters of the simulations to evaluate BAPCR. Both A 1 and A 2 are n x n diagonal matrices. There are n - p zero diagonal elements in each matrix. The p non-zero diagonal elements are i- 4 for A 1 and i- 3 for A 2 , i = 1, ... ,p. regression was decided from the screeplot. The NAPCR serves as the performance bound for the BAPCR. In obtaining the results for NAPCR, the number of source signals p = 15 and the noise covariance G are assumed to be known. Therefore, the NAPCR should always outperform the other three methods, and the performance difference between the BAPCR and the NAPCR indicates the price of the lack of a priori knowledge of the noise variances and the signal order. Once prediction equations are found, it is applied to a validating set of sample size 3000 with the same parameters used to generate the training set. The discrepancy factor -y is then calculated from (4.10) and (4.12). Figure 4-4 shows the results of the simulations. Each of the four subplots correspond to each example of Table 4.1. Some of the interesting observations that can be made are; " As we speculated before the experiment, NAPCR outperformed all other methods at all sample sizes for all examples. This is expected because the true noise variances and signal order are used for NAPCR. If the noise variances and the signal orders estimated by the ION algorithm are correct, however, the performance of the BAPCR should be comparable to that of the NAPCR as we can see for the largest sample size for Example (a), (b), and (c). We expect that this would be the case also for Example (d) for some sample size larger than 1000. " The extent of outperformance of NAPCR over PCR and BAPCR is smaller in Example (a) and (c) than in Example (b) and (d). This is due to the fact that the noise variances 93 (a) (b) ....................... ..I ..........I......... 101 ... ... .. 101 Linear Regression v-..-PCR e-.. BAPCR - - -NAPCR ... -*- -...100 100 - 10 102 0 -... 200 400 600 800 sample size of training set 101 10 200 400 600 800 sample size of training set (d) (c) 10 .. .. .. .. ... .. . . . .... . . . . . . . .... . . . . . . . . . . . . . . ... .. .. .. ........................ . . ... .. . ... . .. .. . ... .. .. .. ... .. ..... .. . I... .. .. .. ..... .. ... .. .. . .. .. .. .. .. .. .. .. .. .. . . .. .. .. .. .. .. .. ... .. .... .I .. .. .. .* ... ....... ........ ... .. ..... ...... . .. .. .. .... . .. .. .. . .... ................... ........... ......... 10 0 .. .. . .. . .. . . .. ... . : : : : : : : : :.: :-7 - .... . .. . .... . . . . . . . . ... . .. . . : . .. .. .. .. .. . ..................................... ..................................... ................................................. . . . . . . . . .... . . . . . . . . . . . . . . . . . . . . . . . . . . . ... .. .. .. .. .. .. .. ...... .. .. .. .. .. .. .. .. ...... ... .. .............. .. .......... . . ... . ... . .. . . .. . .. . . . . ... . . . . . . . . . .. .. . . .. . : ......... . . . .. . .. .. . . .... .. .. .. .. .. .. .. .. .. .. .. .. ... 10-1 .. . .................. .............. ......... 10-1 .. . .. . . . . . . . .. .. . .. .. .. .. .. .. .. ... .... . . . . . . . . . .. . .. ... .. .. .* ... .. .. .. .. ... . . . . . . . . . ... . . . . . . . . ... . . . . .. .. .. .. .. . . . . . . .. . 0 200 400 600 E00 sample size of training set I . . . . . . . . .. . . ... . .. ... .................... .............. . . . . . . . . . .... . . . . . . ... . . . . . . . . . ... ... .. .. .. .. .. .. .. ... .. .. . . .. . .. .. .. .. .. .. .. ..... . .. .. .. . .... .. . 102 1000 ... .. .. .. . ............................. . . . . . . . . . .... . . . . . . . .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... . . . . . . . . . .. . .... .... ...4 ........ ................ ......- F 1000 10 100 - -.. . ...... . ..... -. .. ... .. .. .. ..... .. .. ... ..... .. .. .. .. .. .. . .. ... 777 o .. I .. .. .. ... ..... .. .. .. ..... .. . .. .. . .. .. .. .. 1000 10 0 200 400 600 800 sample size of training set 1000 Figure 4-4: Simulation results for examples of Table 4.1 using linear regression, PCR, BAPCR, and NAPCR as a function of m. p = 15. of Example (a) and (c) are uniform across variables. Therefore, the difference between PCR and NAPCR is that PCR requires the signal order estimation. Since the variables are already noise-adjusted, signal order estimation by the eigenvalue screeplot should produce the correct answer when the sample size is large. In Figure 4-4(a), the performances of NAPCR and BAPCR are effectively indistinguishable. The same argument also holds between the BAPCR and the PCR. The difference between the PCR and the BAPCR exists in the fact that the BAPCR normalizes variables by their estimated noise variances. Since variables are already normalized, this should make no difference. In fact, BAPCR might underperform PCR if the noise estimation is less than perfect. In our simulations, BAPCR underperforms the PCR slightly when the sample size is small. 9 The performance of PCR does not improve for increasing sample size as much as 94 (a) (b) (c) (d) 50 50 50 50 0.01 0.01 0.01 0.01 i3 randn(n,1) randn(n,1) randn(n,1) randn(n,1) A 1000 x A 1 1000 x A 1 1000 x A 2 1000 x A 2 n o0 of Variable p in t eres t m 60 60 60 60 G I diag(i- 1) I diag(i- 1 ) Table 4.2: Important parameters of the simulations to evaluate BAPCR as a function of p. The values of this table are identical to those of Table 4.1 except for p and m. the other three methods in Example (b) and (d). Since the noise variances are not normalized for these two examples, the signal order estimation through the screeplot does not yield the correct signal order p. Furthermore, due to the same reason the PC transform does not put all signals in the first p principal components. It can be seen that these two shortcomings of the PC transform limit the performance of the PCR. As a result, -y remains relatively flat for increasing sample size. For these two examples, linear regression outperforms the PCR at most sample sizes. 9 The performance improvement of BAPCR over the competition is as large as about 400 %, achieved for Example (d) at the sample size of 60. -y as a Function of Number of Source Signals Our second set of simulations concerns the prediction performance as a function of n. The simulation parameters are summarized in Table 4.2. Note that the values of the parameters of Table 4.2 and Table 4.1 are common except for the variable of interest. We set the sample size m of the training set to be 60 since the performances of the four methods are easily differentiated when m is small. The number of source signals runs from 5 to 30. In Figure 4-5, we presented the discrepancy factor y of the four methods as a function of the number of source signals p. Each of the four subplots corresponds to each example of Table 4.2. Each curve is the average of 40 simulations using randomly chosen matrix and U, which are used to generate A from (4.16). 95 Q (a) ............. ................ ............................... ........... . ....... . ......I..... ..... ......... ....... .. .......................................... ............................... .................. ............................. ...................... ......... 10 1 10 0 ..... ..... ..... .... ... ..... ... .... .... .. ... ... ...... .... .... .... ... ..... .... .... .... ..: ... ... .... ... ... ... .... .... .... .... ..................................... ... . . . . . . . . . .... . . . .. .. .. .. . .. . . . . . . . . .... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . . . . . 10- .. .. .. .. . I 5 10 15 20 # of source signals 25 30 (b) .. .............. ......... ...... .. .. .. ... .: : : : : . . .... ... ..... .. ... ... . ....... ..... ....... .... ... ...... .. ... .... ...................... ................ ......... ........... ........- :.......*' ' :...*. .............. ......................... ... ........ .... ... ....... ......... 10 ............ ...........I....... .......... 100 ... ....... ...*....... ... .................................................. ............................................. ....... .. .. .. .. .. .... ... ... .... .... .... ........_.. 1: 1, ......... ...................... ... ........... .... ... ....... .......... .. ....... ......... .......... .. ....................................... .......... 10 5 10 15 (c) 10 1 .. .. .. . .. .. .. 20 25 30 # of source signals (d) 10 . .. .. .... . . . . . . . . . . . . . . . . . . . .. ..... . . .. ..... .. .. .. ... .. .. .. .. . .. .. .. .. .......... ......... .......... I........................................ .......... ........... ............ ......... 10 0 .. .. .. .. .. ... . .. .. .. .. . .. ....... . . .. .. .. ..... .. .. ... .. .. .. .. ... .. .. . .. .. .. .. .. .. ..... .. .. .. .. .. .. .. ... ... .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. . . .. .. .. . . . ... . .. .. .. .. .. ... ......................... ........... 777_.'! 10 0 -- 10 5 Figure 4-5: 10 15 20 # of source signals 25 ---. 30 Linear Rgeso -v- PCR .. . .. .. .. . . .. .. .. .. ..... .. .. .. ..... ... .. ............. 10 5 10 -e,Ar- BAPCR NAPCR 15 20 # of source signals 25 30 Simulation results for examples of Table 4.2 using linear regression, PCR, BAPCR, and NAPCR as a function of p. " Again, the NAPCR outperformed all other methods at all values of p for all examples. This is due to the fact that the true noise variances and the signal order are assumed to be known for the NAPCR. Since the sample size is small, the noise variances and the signal orders estimated by the ION algorithm are prone to estimation error, which appears as the performance difference between the NAPCR and the BAPCR in all four graphs. " The performance of linear regression is always worse than the other three methods. This confirms our previous observation that linear regression is more susceptible to over-training than the other three methods when the sample size is not large enough. " From the four graphs, we may conclude that the performance of linear regression does not depend on the number of source signals. As for the other three methods, there is a tendency that the performance deteriorates as p increases. We believe that this is 96 (a) (b) 50 50 0.01 0.01 3 randn(n,1) randn(n,1) A 1000 x A1 1000 x A 2 p 5 5 m 200 200 G Variable n 2 of interest Table 4.3: Important parameters of the simulations to evaluate BAPCR as a function of G. due to the fact that increase in p leaves fewer principal components to be removed as noise, and thus decreases the amount of noise filtered by the three methods. y as a Function of Noise Distribution This set of simulations evaluates the performance of the four methods as a function of noise distributions among variables. The simulation parameters are summarized in Table 4.3. As for the diagonal noise covariance matrix G, we used the following six noise distributions: G = diag (io) ; diag (t-0-5) ; diag (i-1) ; diag (i-1.5) ; diag (i-2) ; diag (i-2.5) . (4.17) These six noise covariance matrices are chosen so that we can observe the performances of the four methods when the noise is normalized and when the noise has very different variances among variables. For example, G = diag (i0 ) indicates that the noise is already normalized, and G = diag (i-2.5 ) represents the case when the noise variances vary greatly among variables. We expect that the outperformance by BAPCR over linear regression and PCR becomes more obvious when the noise variances vary more among variables because the advantage of normalizing variables by estimated noise variances becomes larger when the noise variances are different among variables. The simulation results are presented in Figure 4-6. The horizontal axis represents the absolute value of the exponents of the diagonal elements of G. For example, G = diag (i-1.5 ) is represented by the value 1.5 on the horizontal axis. 97 + Linear Regression -v- PCR + BAPCR NAPCR 101(a) 10 10 11(b) 10 - . . . . . . . .. . . . . ..-. 10 10 - . .. 0 0.5 1 1.5 Noise exponent . .. . .. 7' . . .. . . - . -. 10 .-.-. . -. . -..- 2 10 10-2 2.5 - 0 . . . . . . .. . . . . . . - -- 0.5 - 1 1.5 Noise exponent -- - - 2 - 2.5 Figure 4-6: Simulation results for examples of Table 4.3 using linear regression, PCR, BAPCR, and NAPCR as a function of noise distribution. The horizontal axis of each graph represents the exponent of the diagonal elements of G. " NAPCR again outperforms all other methods. " More to come here. 4.3 ION Noise Filter: Signal Restoration by the ION algorithm The primary objective of the ION algorithm is to obtain the noise variances to be used for noise-adjustment of variables before the principal component transform. However, there are other functions that the ION algorithm is capable of. One of them is the signal restoration, or equivalently, the noise sequence estimation, which is the major subject of this section. The EM algorithm derived in Section 3.5 computes not only the noise variances but also a few other quantities. Two of them are A and E [P JX; A, G] (See Figure 3-8). The first quantity A is an n x p matrix, and it is supposedly an estimate of the mixing matrix A. The other quantity, an m x p matrix E [PIX; A, d], may be interpreted as the expected source signal sequences given the observed noisy data X, assuming A and O are the true mixing matrix and noise covariance, respectively. Recalling the noise-free vector Z is the 98 product of the source signal vector P and the mixing matrix A, it is not hard to realize that the product of A and E [PIX; A, ] should be an estimate of the noise-free data matrix; Z = E [PiX; A, A (4.18) In reality, the true source signal number p is unknown. Therefore, A and E [P IX; A, ] are n xp and m xp matrices, respectively, where p is the estimated source signal number obtained from the ION algorithm. Obtaining Z from the noisy data X constitutes a multivariate noise filtering. We designate this operation as the ION noise filtering. One should realize that the characteristics of signal and noise that are utilized by conventional noise filters are not known a priori to the ION filter. For example, the difference of the frequency bands of the signal and noise power spectrum densities are the basis of designing a frequency-selective filter. The blind noise filter does not utilize the signal and noise power spectrum densities. In fact, the ION filter works even if the power spectral densities for signal and noise can share the same frequency band. Before we proceed further, we would like to point out that A and E [PIX; A, d] may not be correct estimates of A and P, respectively. This is the so-called basic indeterminacy in the blind source separation problem [39]. To understand this, it is important to realize that the mixing matrix A and the signal source matrix P appear only as the product of the two in the noisy data matrix X. This leads us to the conclusion that for any given mixing matrix A and the source signal matrix P, we could create infinitely many pairs of mixing matrices and source signal matrices that would have resulted in the same noisy data matrix X. For example, for any p x p orthonormal matrix V, the pairs PV, VTAT and P, AT are not distinguished by the noisy matrix X. 4.3.1 Evaluation of the ION Filter To examine the effectiveness of the ION filter, we first applied the filter to the four examples of multivariate noisy data described in Table 4.1. In all examples, the element of judgment used to evaluate the effectiveness of the ION filter is the improvement in signal-to-noise ratio (SNR) between sequences before and after the filter. A big increase in SNR indicates that the noises are rejected effectively by the filter. In Figure 4-7, the average increase of SNR achieved by the ION filter is drawn as a function of the sample size. For each sample size, 99 12 10 / 10 - -0/ EI _ Z 8 .. .. 6.. - U) 09- 0 51 0 - / 07------------------------------------------------- ...... 50 500 :140 Example axample b 1000 Sample Size 1500 . . . . . . . . . . . . . . .. m 2..........1 ........ 31 0 2000 ---------------- -- 500 Example (c) Example (d) 2... 1000 Sample Size 1500 2000 Figure 4-7: Increases in SNR achieved by the ION filtering for examples of Table 4.1 as a function of m. the SNR increase for each of the 50 variables was obtained first by averaging 40 repeated simulations. In each repeated simulation, we randomly generated the source signals and noises. The SNR increase for 50 variables was then subsequently averaged. " SNR improvement increases as the sample size grows. This is consistent with our understanding that the ION algorithm works more accurately with data of larger sample size. The estimated noise-free data matrix Z is closer to Z when the sample size is larger. " SNR improvement for Example (a) is higher than that for Example (b). This is because the congregate noise power, defined as the sum of all noise variances over variables, is larger in Example (a) than in Example (b). Recall that noise variances of Example (a) and (b) are diag (i0 ) and diag (i- 1), respectively. Therefore, there is more room for SNR improvement for Example (a) than for Example (b). The same explanation goes for the result that Example (c) experiences higher SNR improvement than Example (d). In our second set of simulations we tried to find out how the effectiveness of the ION filter is affected by the number of the source signals. For this purpose, we applied the ION filter to the four examples of Table 4.2. Again, we measured the effectiveness of the filter by the average SNR improvement, which is presented in Figure 4-8. The result indicates that the average SNR improvement achieved by the ION filter is a decreasing function of 100 14 12 - Example (a) Example (b) -e12 -- 10 \ -~ Cz 0) .z U 0 .. . . .. .... ... 5 .... . . ... . . . . 0 2......1..1..2 4 ... .. .. . ... ... >...6 .... 4' 0 Example (c) Example (d) ................... 10 15 01 0 20 # of source signals 5 10 15 20 # of source signals Figure 4-8: Increases in SNR achieved by the ION filtering for examples of Table 4.2 as a function of p. the number of source signals. 4.3.2 ION Filter vs. Wiener filter If signal and noise statistics of a given dataset were known, the linear least-squares estimate (LLSE) of noiseless dataset is obtained by applying the Wiener filter to the noisy dataset. For the data model given in Figure 3-1, the Wiener filter is given by Z = CzzC X. (4.19) To apply the Wiener filter, it is essential that we know the covariance matrix Czz. When we do not know Czz in advance, an estimate of Czz can be obtained from the relation Czz = CXX - G as long as G is known. In many practical examples, neither Czz nor G is known a priori. Encouraged by the significant increases in SNR demonstrated by the ION filter in the current section, we speculate that the performance of the ION filter measured by SNR improvement is comparable to that of the Wiener filter. Of course, the ION filter cannot outperform the Wiener filter whose performance serves as an upperbound. It is of interest for us to see how closely the ION filter approaches the Wiener filter. In studying the performance gap, if any, between the ION filter and the Wiener filter, we resort to computer simulations. We used simulations examples specified in Table 4.1. The 101 results are illustrated in Figure 4-9. Figure (a-1) and (a-2) illustrates the simulation results corresponding to example (a) in Table 4.1, and Figure (b-1) and (b-2) are the simulation results obtained from example (b) of Table 4.1. " As expected, the Wiener filter increases SNR more than both the ION filter and the PC filter. " Performances of the Wiener filter, the ION filter, and the PC filter are not much different for example (a). For example, the differences in SNRs between the Wiener filtered dataset and the ION filtered dataset remains less than idB for all variables. " However, for example (b), the PC filter significantly underperform both the ION filter and the Wiener filter for variables with small SNRs. Even though the ION filter also underperforms the Wiener filter, the degree of underperformance is significantly smaller for the ION filter than for the PC filter. " The significance of the example is the fact that the ION filter, which does not assume the a prioriknowledge of Czz performs almost as well as the Wiener filter, the optimal linear filter which requires a priori knowledge of Czz. 4.4 Blind Principal Component Transform In many practical situations in which we are interested in obtaining principal components (PCs) of a noise-free dataset Z, it is usually the noisy dataset X that is available for the transform. Let the PCs of Z and the PCs of X be referred to as the true PCs and the noisy PCs, respectively. The noisy PCs are often used as substitutes of the true PCs. Because the eigenvalues and the eigenvectors of Cxx are different from those of Czz, the noisy PCs may not be good estimates of the true PCs. As we saw in the previous section, the ION algorithm can reduce noises in X and return Z, the estimate of the noise-free dataset Z. It may be speculated therefore that the PCs of the ION-filtered dataset are better estimates of the true PCs than are the noisy PCs. We will refer to the PC transform of the ION-filtered dataset as the blind principal component (BPC) transform. Figure 4-10 illustrates the schematic diagram for the BPC transform as well as the traditional PC transform. 102 (a-2) (a-1) 30 5 Noisy Dataset - 25 20 ION Filtered Dataset Wiener Filtered Dataset ... PC Filtered Dataset 4 15 3 - ... .. . . .. - . .... z 115 0 0 . . . . .. . . . . . . . . . . . . . . . . . . . ... 0 10 20 30 Variable Index 40 0 50 0 10 (b-1) 50 40 50 5 -o - - 20 25 10 -.-.-.- - ..-.. ... . .-. . z 10 - . .. - . 0 -.-.-. .> 0 .. . . .. .. /. . .L 1 1 0 40 (b-2) 30 0 20 30 Variable Index 10 20 30 Variable Index 40 50 0 0 10 20 30 Variable Index Figure 4-9: Performance comparison of the Wiener filter, the ION filter, and the PC filter using examples of Table 4.1. The goal of this section is to show through simulations that the PCs obtained through the BPC transform of X are in fact better estimates of the true PCs in the mean-square error (MSE) sense. The MSE of the noisy PCs, denoted by MSEpc, is defined as the meansquare difference between the noisy PCs and the true PCs. An MSEpc value exists for each noisy PC. For example, MSEpc of the first noisy PC can be written as I M PC Mk=1 (PCr (k) - PCZ(k)) (4.20) where PCX(-) and PCZ(-) denote the first noisy PC and the first true PC, respectively. Similarly we can define the mean-square error of the BPCs, denoted by MSEbpc, as the mean-square difference between the BPCs and the true PCs. MSEbpc of the first BPC can 103 Ztf ___ X ___ ___ PC transform -+ Noisy PC ION PC filter transform ta+ transform Noise-free PC Figure 4-10: Schematic diagram of the three different principal component transforms. be written as MSEbpc = 1M - E (BPC(k) - PCZ(k) (4.21) k=1 where BPCX(-) denotes the first BPC of X. For our simulation we chose Examples (b) and (d) of Table 4.1. Examples (a) and (c) of the same table were not used for this simulation because the noise variances are constant over variables. In those cases, the eigenvectors of the covariance of X are the same as the eigenvectors of the covariance of Z. Therefore, the difference between the BPC transform, which is PC transform after the ION filter retrieves the noisefree dataset Z and the noisy PC transform is minimal for these two unchosen examples. In Table 4.4 and Table 4.5, we computed mean-square errors of BPC and PC transform for different sample sizes, and listed the results for the first five BPCs and PCs. Let's consider the result presented in Table 4.4. The MSE values of the BPCs are always smaller than those of the noisy PCs. The eigenvectors of the covariance matrix of X are different from the eigenvectors of the covariance matrix of Z due to additive noise whose variances vary over variables. The ION filter reduces these noises, and the covariance matrix of Z, which is obtained through the ION filtering, should be a better estimate of the true covariance matrix of Z. As a result, the eigenvectors of the covariance matrix of 104 Z are well aligned with the eigenvectors of the true covariance. This in turn reduces the MSE values for the BPC transform. This is also true for the result of Example (d). To illustrate the relation between the sample size m and the effectiveness of the BPC transform, we computed the percentage reduction of MSE due to the BPC transform against the sample size. The percentage reduction of MSE due to the BPC transform is defined as MSEpc - MSEbpc X 100(%). MSEbpc The result obtained from Example (b) is presented in Figure 4-11. The percentage reduction of the MSE values are larger when the sample size is bigger. Since the ION filter works better for bigger sample size, the MSE reduction achieved by the BPC is bigger for bigger sample size. The sample result and explanation holds for Example (d), which is presented in Figure 4-12. 105 m=60 m=80 m=100 m=200 m=500 m=1000 MSEbpc(1) 0.252 0.081 0.064 0.066 0.042 0.044 MSEpc(1) 0.281 0.103 0.085 0.091 0.067 0.072 MSEbpc(2) 0.849 0.633 0.165 0.534 0.091 0.061 MSEpc(2) 0.869 0.677 0.184 0.575 0.152 0.124 MSEbpc(3) 0.723 0.826 0.082 0.456 0.144 0.069 MSEpc(3) 0.739 0.844 0.097 0.502 0.193 0.116 MSEbpc(4) 0.527 0.093 0.116 0.073 0.056 0.051 MSEpc(4) 0.594 0.149 0.189 0.143 0.124 0.128 MSEbpc(5) 0.207 0.184 0.122 0.159 0.057 0.052 MSEpc(5) 0.305 0.295 0.203 0.350 0.113 0.152 Table 4.4: Mean square errors of the first five principal components obtained through the BPC and the traditional PC transforms using example (b) of Table 4.1. -*- First PC 180- --e- Second PC - 160 140 - Third PC -A- Fourth PC Fifth PC - -- - -.-.-. - - - - - - -.- -.-.... -. - ---- CO '20--- 0 100 200 300 400 500 600 700 800 900 1C00 Sample size Figure 4-11: Percentage reduction of MSE achieved by the BPC transform over the noisy transform using Example (b) of Table 4.1 106 m=60 m=80 m=100 m=200 m=500 m=1000 MSEbpc(1) 0.195 0.744 0.302 0.084 0.176 0.063 MSEpc(1) 0.214 0.798 0.334 0.112 0.222 0.110 MSEbpc(2) 0.833 0.483 2.295 0.289 0.973 0.121 MSEpc(2) 0.847 0.501 2.310 0.319 1.013 0.165 MSEbpc(3) 2.371 1.087 1.991 0.605 0.338 0.202 MSEpc(3) 2.430 1.148 2.039 0.695 0.439 0.288 MSEbpc(4) 3.536 0.894 2.034 0.102 0.240 0.133 MSEpc(4) 3.563 0.935 2.052 0.124 0.292 0.181 MSEbpc(5) 1.508 0.683 0.736 0.201 0.159 0.077 MSEpc(5) 1.512 0.730 0.773 0.251 0.191 0.111 Table 4.5: Mean square errors of the first five principal components obtained through BPC and the traditional PC transforms using example (d) of table 4.1. 80 -*-- First PC 70 - - - Second PC SThird PC Fourth PC Fifth PC ---- 60 --- 0 0 w0 trnsor 600..700.800.900.1000 100..200.300.400.500 using. Exapl.(d.o.Tale4. 00 107 .. .. . ...... .. Chapter 5 Evaluations of Blind Noise Estimation and its Applications on Real Datasets 5.1 Introduction While we studied multivariate datasets received from various industrial manufacturing companies, it became apparent to us that variables were noisy and noise variances were not uniform across the variables. As a result, it was imperative to be able to retrieve noise variances form a given sample dataset reliably in order to apply the NAPC transform which is superior to the PC transform in noise filtering. After making a few assumptions about signals and noises, we developed the ION algorithm with retrieves not only noise variances but also signal order and noise sequences. We have tested the algorithm on simulated datasets which do not violate the underlying assumptions. One of the assumptions made about signals is that they are Gaussian. The development of the EM algorithm is based heavily upon this assumption. However, it is very conceivable that this assumption may be violated in practice. For example, a real manufacturing dataset could easily display very non-Gaussian behaviors. Another assumption that may not hold in practice is time-related. No time structure is assumed in developing the EM algorithm, linear combinations of signals are assumed to be instantaneous. However, many practical datasets that we have studied did actually have time structure such as slow changes. 108 In this chapter, we apply the ION algorithm to multivariate datasets obtained from manufacturing and remote sensing. By doing so, we want to achieve two things: 1) to finish what motivated the development of the ION algorithm, the analysis of real datasets, 2) to test the robustness of the ION algorithm by applying it to real datasets which may violate the Gaussian assumption and the no time structure assumption. In Section 5.2 we test the ION algorithm on NASA AIRS data. We especially focus on the noise filtering aspect of the ION algorithm in the section. First, a brief discussion of remote sensing in general and the AIRS dataset is presented. Then the ION filter is compared with the Wiener filter and the PC filter which are widely used for multivariate noise filtering. We explain in which cases the PC filter replaces the Wiener filter which is known to improve SNR most among linear filters, and how the ION filter may be applied when the Wiener filter is desired but cannot be applied. Our simulation shows that the ION filter performs much better than the PC filter and approaches the performance of the Wiener filter. A large scale manufacturing dataset is analyzed in Section 5.3. As a new application of the ION algorithm, separation of variables based on eigenvectors of the BAPC is studied. A numerical metric for effectiveness of eigenvectors separating variables into groups is proposed. In addition, the performance of the BAPCR is compared with the traditional least-squares linear regression. 5.2 Remote Sensing Data Remote sensing is a fairly new technique compared to aerial photography. Microwave remote sensing is preferable to optical photography because it can penetrate clouds better and it does not depend on the sun for illumination. For these reasons, microwave remote sensing techniques became popular since the early 1960s [40]. An important class of remote sensing techniques is passive remote sensing, in which sensors measure radiation emitted by atmosphere, surface, and celestial bodies in multispectral channels. The inevitable measurement error is added to the incoming signal during the radiance measurement. There are other sources for additional errors such as 1/f noise, but for the sake of simplicity, our evaluation of the ION algorithm will only address the additive Gaussian measurement noise. Modeling the combined effects of all noise sources as 109 additive Gaussian noise is not uncommon in the remote sensing research community [30]. The objective of this section is to test the ION filter on simulated AIRS data, and compare its performance measured by SNR improvement with two other widely used filters. First, the structure of the AIRS data is explained. 5.2.1 Background of the AIRS Data To provide basic understanding of the AIRS data, we include following excerpts from [30]: The National Aeronautics and Space Administration's (NASA) Atmospheric Infrared Sounder (AIRS) is a high-spectral-resolutioninfrared spectrometer that operates on a polarorbiting platform. AIRS provides spectral observations of the earth's atmosphere and surface over 2371 frequency channels. The data can be used to retrieve various geophysical parameters such as atmospheric temperature and humidity profiles, etc. A full 2371- channel spectrum is generated every 22.4 msec. Figure 5-1 illustrates the format of a typical AIRS dataset. An AIRS dataset consists of 2371 columns. Each column represents measurement of spectral observation of one narrowband frequency channel and constitutes a variable. Therefore, an entire AIRS dataset consists of 2371 variables. Each measurement of 2371 variables becomes a row, and the number of rows depends upon the duration and frequency of the measurement. Typically a measurement is made every 22.4 msec. An AIRS dataset is subject to many noises such as instrument noise and errors due to imperfect calibration. The cumulative effect of these noises is typically modeled as independent additive Gaussian noise. In that case, the model of an AIRS dataset coincides with the one illustrated in Figure 3-1, upon which we based the development of the ION algorithm. Throughout this section, we will refer to this cumulative effect of noises as "noise" so that noise is independent additive white Gaussian. According to the schedule of the NASA AIRS project, the polar-orbiting platform on which the AIRS infrared spectrometer will be aboard will not be on its orbit until the year 2000. An actual AIRS dataset will not be available until then. However, researchers in the field were able to generate simulated AIRS dataset based on what was expected to be observed by the spectrometer over land during nighttime. We were provided with a noiseless 110 Channel Chl ... Ch2371 0.0 ms 22.4 ms Data Matrix X 44.8 ms 67.2 ms Figure 5-1: Format of an AIRS dataset simulated AIRS dataset by the NASA AIRS science team. The dataset is of 2371 variables and contains 7496 observations. This is what we will refer to as the simulated noiseless AIRS dataset. We do not know if the variables of the noiseless AIRS data are Gaussian. A corresponding noisy dataset was generated by adding to the noiseless dataset pseudorandom Gaussian noise, whose statistics was provided by the AIRS science team. Noiseless and noisy simulated AIRS datasets have been widely used to develop data compression and coding algorithms to be used for actual datasets. Figure 5-2 is the signal and noise variances of the simulated AIRS dataset. 5.2.2 Details of the Tests There is no question that analysis of noiseless AIRS data returns more accurate description of the atmospheric profiles than analysis of noisy AIRS data. Therefore, it is highly desirable to remove noise from noisy AIRS data before any further analysis. Traditionally, this is done by the Wiener filter when noise statistics are known, and by the PC filter when noise statistics are not available. It is well known that the Wiener filter is optimal in the least squares sense. Therefore, the performance of the Wiener filter measured by SNR improvement is higher than any other linear filters. On the contrary, while the PC filter does not improve SNR as much as the Wiener filter, it does not require a prioriknowledge of noise variances. The purpose of this test is to compare the ION filter with the PC filter and the Wiener 111 - .. -. - 20 .~.......... ~~~~~~ 10 -. -. 0 ....... -. -. ....... ... - ...... .. - -. ..-. .. 0-20 .. ...-..... -.. -. ......... --. -30 ......................... - -. ......-. ....... -. --. -.-.-. ......... -. -10 Signal Noise - .. -.--.-.. --.-.--.- ....... -. . ....... -..... ...... -. -.-.-. . -- -...... -.. -..-.-..-. -40 I-.. ......... ... ............ ...............-. -50 -......... - ................. -60 F -70 ......... -. -...-...-...-...-. ..-.. 0 500 1000 1500 2000 2500 Variable Index Figure 5-2: Signal and noise variances of simulated AIRS dataset. filter using the AIRS data. We want to illustrate that the ION filter, while as widely applicable as the PC filter because it does not need a priori noise variances, performs almost as well as the optimal linear least squares filter. In Figure 5-3 notations for input and output of the three filters are defined. The ith variables of ZION, ZPC and are denoted by ZION 21c and 2iWIENER, respectively. SNR of defined as, respectively, ZION pC, and ZWIENER ZiWIENER are 2 SNR = 2ION zi (5.1a) a (ZIONZg) U2 SNR2pC 2 - (5.1b) z' (2rC-zj) 2 SNR where a2 ION_ ZiWIENER 2PC_, 2 WIENER = 2 (5.1c) Uzi 0 (ZWIENER-Z.) and a 2WIENER are variances of ZION _z, Zc - Z , and - Zi, respectively. For comparison purposes, we also define the SNR of unfiltered 112 - X ION Filter ZION PC Filter Z c§" [zc,..., 0 ZWIENER \Wiener Filter ZION ON _. WIENER, . ZWIENER] Figure 5-3: Three competing filters and notations. dataset as 2 SNR - (5. 1d) 2 U(X -zi) The difference between SNR 2 !ON and SNRgpc is thus an indication of what can be gained by retrieving noise variances through the ION algorithm and the difference between SNR2!ON and SNR 2 WIENER and SNR2!ON can be interpreted as the performance loss due to lack of a priori knowledge of noise statistics. In testing the three filters, we use a simulated AIRS dataset consisting of 240 channels which are equally spaced among the entire 2471 channels. Figure 5-4(a) provides the signal and noise variances of the selected 240 variables. Figure 5-4(b) is the eigenvalue screeplot of the decimated AIRS dataset after corresponding pseudo-random noises are added to each channel. A quick glance of the screeplot indicates that the number of signals falls on the range of 10 < p < 25. There exists a significant gap between n and p, which is a prerequisite for the ION algorithm. 5.2.3 Test Results In Figure 5-5, we plot SNR's of ZION, ZPC, ZWIENER, and X of 240 variables in the AIRS dataset. The horizontal axis is the variable index, and the vertical axis is the SNR in dB. It shows that the Wiener filter and the ION filter always increase SNR of the noisy dataset. 113 (a) (b) 10 5 20 ...-....-. -- 0 -......- .... 10 CU 10 0..50 C .CU C -40* - Signal ... ... 100 .. -.... ...150 .. 200... 250. - -.---- Noise - .- 105 ......... - - ------..-..---.--.- -60 0 50 100 150 Variable Index 200 0 250 50 100 150 200 250 Index Figure 5-4: 240 Variable AIRS dataset. (a) Signal and noise variances. (b) Eigenvalue screeplot. For the PC filter, however, SNR 2 pc - SNRx is not always positive. For example, this value is negative at high variable indexes, which means that the PC filter actually decreases SNR for high index variables. Effectiveness of ION Filters: AIRS DATA (240 Variables, 7496 Observations) 60 55 50 - 45 - - -- -/ - - /- - - - - - -- 40 z (n 35 -..---.-.-...-.-.---..-.---.-- 30 25 - 20 o Filte rN -N -- - - - -- IO N Filter - - Wiener Filter - PC - 1 - - - -- jV - - - - - ....... -1 -- - - --- - -- -- Filter 15 10 0 50 100 150 200 250 Variable Index Figure 5-5: Plots of SNR of unfiltered, ION-filtered, PC-filtered, and Wiener-filtered datasets. * The PC filter, which has been widely used for reduction of noise with unknown variances, sometimes increases noise variances. For example, SNR decreases for the last 112 variables (from variable 129 to variable 240) in Figure 5-5. This can happen when signal variances of certain variables are extremely small, even smaller than 114 noise variances of other variables. PC transforms of such datasets could end up putting noises at lower principal components while signals are transformed to higher principal component1 . Figure 5-6 sketches an eigenvalue screeplot of such example. " The ION filter increases SNR for all variables. " The SNR difference between ZWIENER and ZION, plotted in Figure 5-7(a), is relatively small compared to the range of SNR displayed in Figure 5-5. This is a significant result considering that the ION filter does not require noise variances a priori. " Figure 5-7(b) reveals that the performance gap of the ION filter and the PC filter can be large for variables with small signal variances. Noise Signal 0 Figure 5-6: An example eigenvalue screeplot where signal variances are pushed to a higher principal component due to larger noise variances of other variables. 5.3 5.3.1 Packaging Paper Manufacturing Data Introduction As a part of the Leaders for Manufacturing (LFM) program, we had a chance to work with a company which produced paper used for packaging materials. We refer to this company 1A lower principal component means a principal component with higher variance. second principal component is lower than the fourth principal component. 115 For example, the (a) (b) 8 40 7 6 5 -35--. - 30 . .--.-. - . -- 2 5-.--.--. - .-.-.- 20 S- z CO) - 3 .. z 15 C') - .- -. - 10 5 2 1 - 0 0 - - --.--.- 50 - - . . 100 150 Variable Index 200 - -- -0 -5 250 0 50 100 150 Variable Index 200 250 Figure 5-7: Differences in signal-to-noise ratio between pairs of ZWIENER and ZION and of ZION and Z'c. (a) Plot of SNR 2WIENER - SNR2ION- (b) Plot of SNR 2 ION - N as B-COM in this thesis. The entire production line of the company can be categorized into four different operations: preparation, production, power and chemical supply, and quality measurement. Figure 5-8 is the schematic diagram of the production line of B-COM. The preparation stage includes fiber pre-process, digestion and stock preparation. Large logs are chipped into small woodchips (fiber pre-process). The woodchips are then digested into thick liquid (digestion). Lots of chemicals supplied by effluent plant are added at this stage. This liquid is then fed to three machines (A, B, and C), whose final product is paper rolls. The power plant supplies necessary electricity and steam. Before the final product is delivered to customers, it goes through a standard quality check, which is performed offline. If a roll of paper does not pass the quality check, the roll is typically recycled at the digestion stage. In addition to the offline quality check, B-COM records thousands of variables across preparations, production, and power and chemical supply stages. It was thought initially that every information about the production and quality of the product are embedded in these thousands of 'inline' variables. During our collaboration with the company, we were provided with extensive collections of observations of inline variables and corresponding values of offline quality variables and asked to look for cause(s) for variations in product qualities. If we could establish a firm relationship between product quality and certain controllable inline variables, the company would benefit greatly because product qualities 116 Power Plant Fiber Pre-Process Digestion - Steam - Electricity - 4 , Effluent Plant Stock Preparation Paper Machine Paper Machine Paper Machine A B C -F1 1~ -F-I-- Quality Measurement Figure 5-8: Schematic diagram of paper production line of company B-COM 117 2 0- -. .. . .. . .. . . .. . . .. . . - -----. -.-...--..--.-..--..---.... -.....-.. -.. -..... -4 . -6 0 . . . . . 100 200 300 Index (i) 400 500 Figure 5-9: An eigenvalue screeplot of 577 variable paper machine C dataset of B-COM. could then be checked and controlled immediately through those inline variables instead of long-delayed offline quality checks. The raw dataset of inline variables was very large and unorganized. First, it contained various non-numeric data points. There were also many data points where the entries were missing. Although these non-numeric or missing entries may contain some valuable information about the process, our focus was mainly on numeric parts of the datasets. Therefore, we basically excluded those missing entries from our analysis. We also excluded variables which remained constant for all times, since they did not have information on quality fluctuation. This is called data preparation, or data cleaning. The cleaned dataset still consists of thousands of variables. For the thesis, we choose to use variables which are collected from the paper machine C for three reasons: 1) the machine C is the largest machine among the three machines, 2) the three machines are totally independent, and 3) time-synchronization among variables measured at different parts of the production line is not very good. For example, it takes about 18 hours for logs at the pre-process stage to come out as paper. Therefore, a data point at the pre-process stage is related to a data point at the final product stage 18 hours later. It is not a small task to follow time delays for thousand of variables. On the contrary, within a paper machine, the time lags among variables are negligible. 118 The final dataset consists of 7847 observations of 577 variables. Each variable is standardized to zero mean and unit variance. The eigenvalue screeplot of this standardized dataset is given in Figure 5-9. The noise flat bed is not clearly defined, indicating that the noise variances are not uniform across standardized variables. 5.3.2 Separation of Variables into Subgroups using PC Transform Analyzing a dataset with a large number of variables sometimes involves dividing variables into several subgroups. If the division is "well-designed," studying individual subgroups may reveal information which is otherwise hard to discover. In addition, subgroups of variables are easier to analyze because of their smaller sizes. In many cases, it is helpful to divide variables based on their correlations: divide variables into subgroups so that strongly correlated variables are put into one subgroup and that little correlations exist between variables across subgroups. By doing so, variables of one subgroup, however many there are in the subgroup, can be considered to control (or be controlled by) one physical factor. Because the PC transform utilizes covariances among variables to obtain eigenvectors which are then used as weights in computing statistically uncorrelated principal components, it is believed that eigenvectors would be useful in dividing subgroups of variables. Specifically, absolute values of elements of eigenvectors represent the contribution of corresponding variables to principal components. Therefore, one could gather variables corresponding to elements of large absolute values in an eigenvector into one subgroup. Figure 5-10 presents the first eight eigenvectors of the 577 variable B-COM dataset. Contrary to our expectation, these plots do not identify any variable subgroup clearly. Instead, the plots imply that most of variables contribute somewhat equally to each principal components. Although possible in theory, this is an unlikely event in practice. 5.3.3 Quantitative Metric for Subgroup Separation The purpose of this section is to introduce a quantitative metric for 'effectiveness' of eigenvectors in defining a subgroup. Qualitatively, an eigenvector is effective in defining a subgroup if most elements of it are close to zero and a few elements have exceptionally large magnitude (positive or negative). Anineffective eigenvector is the one whose elements are mostly comparable magnitude. For example, the eight eigenvectors in Figure 5-10 fit the description of ineffective eigenvectors. The subgroup separation metric (GSM) defined in 119 first eigenvector second eigenvector 0.2 0.2 - . . ... . ..-.. . ... . . 0.1 0.1 0 0 -. . .--.-.-.-.-- . ...-. .. . -0.1 -U.2 -.- 0 100 200 300 400 - 0 .1 -0.2 500 300 400 500 0.2 0.11 0.1 0 0 -0.11 -.-- .. -.-.. . .... . -0.1 I 100 200 300 400 -0.2 500 0 100 fifth eigenvector 02 0.1 0.1 0 0 -0.1 -0.1 ) 100 200 300 400 200 300 400 500 sixth eigenvector 0.2 -0.2 200 fourth eigenvector third eigenvector 02r . -0.2 100 0 -.---.--. - -.--. -..-.-.- . -- - -0.2- 0 500 - 100 seventh eigenvector - 200 - -- 300 400 - 500 eighth eigenvector 0.2 A-2 0.1 0.1 -.-..- . -.- ----.- --.. -.-. 0 .1 . ..- . - - --. -.. -.. 0 0 .... ........ ... -..: . .-. - 0 .1 -0.1 -0.2 100 200 300 400 -0.2 500 --. -- --.-.--.-.- 0 100 200 300 400 500 Figure 5-10: First eight eigenvectors of 577 variable paper machine C dataset of B-COM. 120 this section quantifies the effectiveness of an eigenvector. The idea of GSM is very simple. It is the ratio between two simple averages. One of them is the average of squared elements whose absolute values are in the largest 10 percent, and the other is the average of the remaining elements squared. A large GSM value indicates a good subgroup separation. Let v be an eigenvector of the covariance matrix of a given dataset and {V}, i = 1,.- ,n be its elements. If we rearrange {vi} from the largest to the smallest in their absolute values and rename them as {qi} , i = 1, .. , n, then the subgroup separation measure of v is defined as k E (qi)2 GSM (v)- (5.2) i=1 E (qi )2 i=k+1 where k is the smallest integer bigger than 0.1n. If the elements of v are derived from a Gaussian distribution, then GSM(v) r 7. The GSM values for the eight eigenvectors of Figure 5-10 are computed and presented in Table 5.1. Although we do not yet have an absolute way of saying if a specific GSM value, for example 23.94 for the eighth eigenvector, indicates a good subgroup separation, we note that the GSM values for the first 3 eigenvectors are even smaller than 7, the GSM value of a vector whose elements are Gaussian. GSM PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 5.42 4.32 6.51 10.80 16.64 13.07 9.47 23.94 Table 5.1: GSM values for the eigenvectors plotted in Figure 5-10. 5.3.4 Separation of Variables into Subgroups using BAPC transform Our attempt to determine subgroups through PC transform was not very successful. As an alternative and a potential improvement to the PC transform, we decided to try the BAPC transform on the same 577-variable B-COM datasets. Since the ION algorithm effectively separated noisy variables into 'correlated signals' and 'uncorrelated noises,' it was expected that variables which had strong correlations among them would be deemed by the ION algorithm as having small noise variances. 121 The noise-adjusting part of the -- 1 0 .9 - 0.8 - 0.5 - - -.- I II .I -- - - I ~II~i ii ..- .-.. - - -- - - CI .20.7 - - 0.6 I. I - - --.-- -- - - -- 0.2 - - -..- - --...- - -- 0 0 100 200 300 400 500 600 Variable index Figure 5-11: Retrieved Noise standard deviation of the 577-variable B-COM dataset. BAPC transform then should boost the variances of such variables by dividing them by their small noise standard deviations. As a result, the subsequent PC transform would be dominated by such variables, which should emerge as the contributing variables to the top principal components. Figure 5-11 shows the noise standard deviation of the 577-variable B-COM dataset retrieved by the ION algorithm. These estimated noise standard deviations are used to noise-adjust the 577 variables. The eigenvector screeplot of the noise-adjusted dataset is shown in Figure 5-12. Compared to the corresponding screeplot of Figure 5-9, the screeplot after noise-adjustment has a well-defined noise bed and high SNR values for the initial PC components. Figure 5-13 is the first eight eigenvectors obtained through the BAPC transform. The difference between Figure 5-10 and Figure 5-13 is obvious. The BAPC transform exhibits significantly better-defined subgroups than the PC transform through their respective eigenvectors. Table 5.2 presents the GSM values of the eigenvectors in Figure 5-13. These values are much larger than GSM values of Table 5.1, indicating that the eight eigenvectors in Figure 5-13 are much more effective in separating variables into subgroups than eigenvectors in Figure 5-10. 122 DI0 - --... 4. . ....... ..... 4-- 3 2- - -- -- .-.-.-. .....-.-.---- - . .. . . -.-.-. - -- - -- - -- . ... - -. -- -3- 0 100 200 300 Index (i) 400 500 600 Figure 5-12: Eigenvalue screeplot of blind-adjusted 577-variable B-COM dataset. GSM PCI PC2 PC3 PC4 PC5 PC6 PC7 PC8 3372.7 665.86 615.62 735.27 110.21 30.44 132.60 2544.77 Table 5.2: GSM values for the eigenvectors plotted in Figure 5-13. 5.3.5 Verification of Variable Subgroups by Physical Interpretation In order to verify the accuracy of variable subgroups obtained from the BAPC transform, we would like to examine the physical meanings of variables which contribute significantly to individual principal components to be classified as subgroups. If variables classified as one subgroup fall in one of the following three categories, they can be thought to confirm the validity of subgroups obtained by the BAPC. The three categories are: 1. Variables classified as one subgroup measure one physical quantity repetitively. For example, if all variables in one subgroup measure temperature at the same position in the manufacturing process, those variables may be regarded as a valid subgroup. 2. Variables classified as one subgroup measure physical quantities which are known to be correlated. Determining whether variables fall in this category requires a priori engineering knowledge of the process. 123 First PC Second PC 0.2 0.2 0 0 -. -0.4 -0.6' -- -.. -. -0.2 -.-.-. . .. -. -.-... . 200 0 ...- 400 -0.2 - -0.4 - -0.6- 6( 0 . -.4 - .- .-.. .- .. .-. ..-. 0 200 Third PC 0.8 .-. .-. -..-. -.. -.. -.-.-.-.--.-.---.--. -.. . -. . . -.. --... - -- -- - -. 0.4 0.2 . . .. . . . . . 0.6 -. -.-. . . -.-.-.-.-.-.-. . . . . . . . 0.4 -. . -.. . . . -.-.-.- . . . . . . . . . . .. . 0 0 200 0 400 - -0.2 600 0 400 200 Index 600 Index Fifth PC Sixth PC 0.6 0.6- 0.4 ................. . ......... ........... . 0.2 ................... - ... ........... . 0.4- - - -- - -- - 0.2 0 - . . . . . . . . . . .. . . . . . . . ........... - - - ..... - . 0 200 0 400 -0.2 600 0 200 Index 400 600 Index Eighth PC Seventh PC 0.6 0.8 0.4 . -... 0.2 . . . .-. . . . .-.. 0.6 -- . . . .-.. -.. . . . . . . . . . . . . . -.. ... -. ...-. -. -- - 0.4 . . -. . 0.2 - . -...- -.-. 0 -0.2 .-.-- ..-.-.-.-.-.- 0.2 -..-.-.. -.-.-. ----. . -. -0.2 -0.2 600 Fourth PC 0.6 -0.4 400 Index Index 0 . . . -. .-. .. . ...-.. - - . .-. . . . . -. .-.-. .--.-.- . 0 400 200 -0.2 600 0 200 400 600 Index Index Figure 5-13: First eight eigenvectors of the ION - normalized 577 variable B-COM dataset 124 3. Variable classified as one subgroup measure physical quantities which move very closely together. This does not require a priori engineering knowledge. In fact, variables which fall in this category may promote previously unknown knowledge of the underlying manufacturing process. As a first example, let's study the first eigenvector in Figure 5-13. Variables whose corresponding eigenvector elements have absolute values larger than 0.3 are classified as a subgroup. The list of variables and their physical meanings are listed in Table 5.3. The first two variables in the list are related to air pressure, and the next three variables are process speed. We learned from a process engineer of B-COM that speed and air pressure are closely related, and the list here agrees with that knowledge. In Figure 5-14, the time Variable Label Physical Quantity Measured P4:CUCSPT.MN Couch vacuum set point P4:PIC202.SP Headbox pressure set point P4:SI450 Wire drive speed P4:SI451 Top wire drive speed P4:SI453 Breast roll speed Table 5.3: Variables classified as a subgroup by the first eigenvector in Figure 5-13. plots of the five variables are presented. Although the level of the first two variables are different from the last three variables, they are almost identical in their shapes. Let's repeat the same exercise for the second eigenvector. This time, variables whose corresponding eigenvector elements have absolute values larger than 0.15 are classified as a subgroup. Table 5.4 lists those variables. The first, second, and fourth variables measure electric current provided by the power plant. The third variable is listed as spare, and we do not know the physical nature of the variable. However, we may conclude from the result that it is strongly related to electric current generated and provided by the power plant. Figure 5-15 illustrates the time plots of the four variables chosen by the second eigenvector. Again, they have almost identical shapes. 125 P4:CUCSPT.MN P4:PIC202.SP 400 a00 350 350 300 300 250 250 200 200 150 -- --- - - 100 50' 0 2000 4000 Time index - - -. 6000 8000 150 - - 501 0 2000 P4:S1450 Lnf 11 2600 I A .J 2400 2400 2200 2200 2000. 2000 1800 -- 1800 11, 1600 1400 1200 2000 4000 Time index 6000 L -. . . . . 'I 8000 -1 .1 1600 I 1400 I 0 - P4:S1451 2800 -.-. -- 100 -- 2800 2600 -- 4000 Time index 6000 1200 8000 0 2000 4000 Time index 6000 8000 P4:S1453 2800 - -.. ... -..--. -. .. 1600 -. -..-. . . .. . ..-- - . . . ... .. -.. .. 1400 1200 0 2000 4000 Time index 6000 8000 Figure 5-14: Time plots of the five variables classified as a subgroup by the first eigenvector. 126 P4:OUTAMP.MN 0.35 --.-.-.-.--. - -. -.-.- -.. .-..-. 0.3 0.25 -- - - . ...- - -.. -.. . ... .-.. -. . . . . 0 2000 0.6 - 0.5 -- 0.4 -- 0.3 - 4000 Time index - 8000 6000 0.2 .- .- 0 2000 P4:SPARE2 0.6 4000 Time index 6000 8000 P4:TOPAMP.MN 1 -. .. 0.5 - - .- . . .... . . . .... -.. .. -. .- .. .-. -. . --. ..-..--- .. -.- .. 0.8 -..-.-.-.-. 0.4 .. 0.3 0.2 .... -.. - ....-...-...-...-. -. -. . . . . . -.-.-.-.-.-.-.. .-.-.-- -- - -- 0.15 0.1 - - .--. .- 0.2 -. - - - -. . . P4:PRSAMP.MN 0.7 . - . . . . . . . .....- ..-.......-. ....0.6. . -. --. . .. - ... - - .. 0.7 . . 0.5 ...-. -. ..-. ..- -.. .... --.--.----- -.- . . -. -.-.-.. -- - ................ .. -- -.-. ..... - .. -.. ...... - ..... - - . 0.4 0.1 0 2000 4000 Time index 6000 8000 0.3 0 2000 4000 Time index 6000 8000 Figure 5-15: Time plots of the four variables classified as a subgroup by the second eigenvector. 127 Variable Label Physical Quantity Measured P4:OUTAMP.MN Top wire amps P4:PRSAMP.MN Press amps P4:SPARE2 Spare P4:TOPAMP.MN Top press amps Table 5.4: Variables classified as a subgroup by the second eigenvector in Figure 5-13. Considering that the traditional PC transform did not identify even one meaningful subgroup through its eigenvectors, the results so far confirm our speculation that the BAPC is more effective in identifying subgroups of variables. Eigenvectors of the BAPC transform have higher GSM values, and physical interpretations of grouped variables are consistent. As a last part of our analysis of the effectiveness of the BAPC transform, we present in Figure 5-16 every tenth eigenvectors up to the seventieth one. It shows the trend that the earlier eigenvectors define a subgroup better than the later ones. GSM values in Table 5.5 confirms this observation. GSM PCI PC10 PC20 PC30 PC40 PC50 PC60 PC70 3372.70 108.89 44.79 34.38 19.65 20.40 11.11 11.66 Table 5.5: GSM values for the eigenvectors in Figure 5-16. 128 First PC Tenth PC 0.1 0 -0.1 -..... ..... -0.2 ............. -0.3 .... ... -.-. .. -.- . -. .--.. . ..-. .. -0.4 -. 400 60 0 . ..-.... -.. .-. ... -. -. -.. . -0.2 .-.. . . - . . . . - . . . .- . . . .- . . . -0.1 -0.3 200 C 200 Index 0. .-.. .-.--. 0 ~~~~ ........ .. -0.4 0 -. - -.. . -. .- . - . .- . .- . . - .-. -.. .-..-...-..-.. 0.1 -.... ...... .. -.. . .. .-. ... .. -. . --.. . -. . -. 0.2 .... ............. .. - .-. .600 400 Index Twentieth PC Thirtieth PC 0.2 .. ............* ... ....* ..... . 0.2 .. ..... 0.1 ... ... . ... . . . . . . . . . . . . . . . . . . . . 0.1 . . . . . .. 0 0 . -0.2 .............. -0.1 ...... ....... 0 400 Index Fortieth PC 200 .... .. -0.1 -0.2 0 600 200 400 60 0 400 600 Index Fiftieth PC 0.2 0.3 0.1 0.2 0.1 0 0 -0.1 -0.1 0 200 400 -0.2 600 0 Index Index Seventieth PC Sixtieth PC 0.2 0.2 -.- . 0.1 . . .- .-.. - - -.. . . .-.-. 0 .. -. --. ~ -0.1 0 -. . . -.. . . 0.1 0 -0.2 200 --.. ~ -.~ . . . ..---- 200 400 .-.- -.-. -.-.-. . . . . . ....... . . .... ... . ...... ... . ---...-- -0.1 -0.2 600 0 400 200 600 Index Index Figure 5-16: Every tenth eigenvectors of the ION-normalized 577-variable B-COM dataset. 129 5.3.6 Quality prediction Among the 577 variables of the B-COM dataset, several variables are designated as quality variables. These variables are typically measured offline. If product quality represented by these variables does not meet pre-determined criteria, the failed product cannot be sold and has to be recycled. Quality failure was a significant factor in overall cost to B-COM. Because the quality test could be carried out only after final products are produced, the company was tremendously interested in finding out ways to predict the final product quality from other variables which are measured in real time. The purpose of this section is to apply the BAPCR and the least-squares linear regression to the B-COM dataset and compare the results. We are especially interested in confirming that the BAPCR performs better than the least-squares linear regression for this real dataset. Depending on the result, it can reinforce our claim made in the previous chapters that the BAPCR should be used rather than the traditional least-squares linear regression in the cases of noisy predictors. Among the 577 variables, twenty eight variables are identified as offline quality variables and separated from the other 549 variables. The 549 variables are real-time variables. Among the twenty eight quality variables, we choose one variable, which we call CMT, as the quality variable of interest. We will try to predict CMT using the 549 real-time variables. The 7847 observations are divided into two parts: the first 4000 observations are used to train the regression equation, and the remaining observations serve as the validation dataset. Since there are only 549 predictor variables, the 4000 observations in the training set should be more than enough. First, the traditional least-squares linear regression is used to train the linear equation between CMT and the 549 variables. Once the training is carried out, the validity of the obtained equation is examined using the validating set. In order to illustrate how good (or bad) the prediction is, we present the scatter plot of the measured CMT and the predicted CMT in Figure 5-17. The horizontal axis represents the CMT values in the validating dataset, and the vertical axis represents the predicted CMT values, where the prediction equation is obtained by the traditional linear regression. A perfect regression would produce a straight line of unity slope. The near ball-shaped scatter plot implies that the prediction by the traditional linear regression is not working. The root-mean-square (rms) error is 130 7( 68 - -- 66 - -. ....... - -- -.-.-. -.- . -- -.- .-.-. ... . .. - - . - 666 - .. . ... 5964- - 52 - - 5 2 - 50 50 -. ---.. .. 54 56 58 60 CMT 62 .- .. 64 66 .... -. 68 70 Figure 5-17: Scatter plot of true and predicted CMT values. Linear least-squares regression is used to determine the prediction equation. close to 2.75. This is in fact larger than the standard deviation of the CMT variable, which is close to 2.01. This means that the linear regression is a worse predictor than a constant, namely the average value of the CMT. In our next prediction attempt, we use the BAPCR to train the prediction equation. Figure 5-18 is the resulting scatter plot. It is still very noisy, this plot shows a stronger linear relation between measured and predicted CMT values than the scatter plot obtained by the tradition linear relation. The rms error in this case is 1.78. This example confirms that the BAPCR performs better than the traditional leastsquared linear regression when the predictors are corrupted by measurement noise. One may be surprised by the poor prediction result even by the BAPCR in this example. However, it turns out that a large portion of variance in the CMT variable is simply measurement noise. In fact, one analysis shows that up to 70 percent of CMT variances is attributable to measurement noise. That is why even the BAPCR could not predict the CMT values with more accuracy. 131 70 ........................... ...... ........................ ...... ....... ..... 6 8 . ........ ... .. .... ........ ........... .. ............... ........... ..................... ........... 6 4 . ...... . ....................... ...... ... ...... ...... .. . . . . . . . . . . 6 2 . ........ .. ... ......................... 60 . .. . . . . Ar . . . .... . . .. . . . . . .. . ......... ..... ......... .. IL Ir 0 CL ......... ............... 5 8 . ........: ....... ........ ........ ........... ........... ....... ..... ......... M ........ ........ ........... . . ............................................... 5 6 . ...... ....... ........ 54 52 501 50 ................ ............... ..... ........... ..................... .............. .... ..... ......... ..................... ....... 52 54 56 58 60 CMT 62 64 .... ...... 66 68 70 Figure 5-18: Scatter plot of true and predicted CMT values. BAPCR is used to determine the prediction equati on. 132 Chapter 6 Conclusion 6.1 Summary of the Thesis The problem of analyzing noisy multivariate data with minimal a priori information was addressed in this thesis. The primary objective was to develop efficient algorithms for maximum likelihood estimation of noise variances. The problem falls in the category of signal enhancement of noisy observations which has been widely investigated in recent years, but the researches have been limited to cases where degrees of freedom of signal components in the noisy dataset are known. Similarly, the problem of estimating degrees of freedom of signal component has been studied for many researchers in the field of signal processing, but there application has been restricted to situations where noise variances are either known a priorior uniform across variables. For many practical multivariate datasets, including those of ours in the areas of manufacturing and remote sensing, both noise variances and degrees of freedom of signals are not available, thus rendering applications of previous researches hard. Our approach in this work is to separate the problem of joint estimation of degrees of freedom and noise variances into two individual estimation problems, namely estimation of degrees of freedom and estimation of noise variances. Then we develop an algorithm for each estimation problem as if the other parameter is known, then cascade the two algorithms such that the degrees of freedom is first estimated and then the result is applied to estimation of noise variances. The resulting noise variance estimates are then used for normalization of the noisy variables, which should produce an improved estimate of degrees of freedom. These steps are repeated until the estimates converge. The development of the algorithm 133 for estimation of noise variances are derived by modeling the signal and noise vectors as independent Gaussian vectors. The potential applications of the algorithm are very wide. They are all related to noise estimation and reduction. Three applications investigated in this thesis are linear regression, noise filtering, and principal component transform. Simulations show that the ION algorithm should improve the performances of these applications significantly. 6.2 Contributions The main contribution of this thesis can be summarized as the following: 1. The thesis derived a maximum-likelihood (ML) estimate for noise variances for noisy Gaussian multivariate data. We considered a situation in which unknown noise variances must be retrieved from a sample dataset in order for intended analysis tools to work satisfactorily. We approached the problem from an information theoretic viewpoint by deriving an iterative EM-based algorithm for noise variance estimation. As long as the degrees of freedom are known or accurately estimated, the estimated noise variances are the ML estimate. 2. We considered a more realistic scenario in which the degrees of freedom are not known as well as the noise variances. We proposed a scheme which separates the joint order and noise estimation problem into two individual estimation problems and alternates the two individual estimation processes so that one estimation process should augment the other one. Although we suggest an ad hoc but robust order estimation method based on eigenvalue screeplot, we believe that other more analytic methods could replace it thanks to the modular nature of the ION algorithm. The ION algorithm is readily applicable to various fields in multivariate data analysis and it is easily implemented as a program. 3. We identified the application areas of the ION algorithm and provided wide range of examples in order to visualize the benefits of the proposed algorithm over traditional multivariate analysis tools. The three applications considered in this thesis are linear regression, principal component transform, and noise filtering: 134 " For linear regression, we illustrated that depending on m (number of observations) and n (number of variables) of training dataset, the performance represented by y of (4.12) could be as much as 4 times better for BAPCR over PCR or ordinary linear regression. * For principal component transform and subgrouping variables, by applying the ION algorithm before NAPC transform we had a tremendous success in identifying subgroups of variables from a large multivariate dataset derived from a paper manufacturing plant. Regular principal component transform could not detect any of those subgroups. " For noise filtering, our experiments with remote sensing data illustrated that the performance of the ION filter approaches that of the Wiener filter which is the theoretical limit of a linear filter. 6.3 Suggestions for Further Research In out development of the EM algorithm for estimation of noise variances, we used an independent Gaussian vector model for the signal and noise vectors because of its analytic tractability. While assuming the noise vector is Gaussian can be justified without much difficulty, the signal vector does not always have to be Gaussian. We did not attempt to extend the EM algorithm for non-Gaussian signal vectors. There are two further research topics regarding this issue: " Apply the EM algorithm developed for Gaussian signal vectors to non-Gaussian signal vectors such as Laplace distribution, log-normal distribution, Rayleigh distribution. " Extend the EM algorithm for other non-Gaussian signal vectors. Another research opportunity involves the choice of the estimation method for degrees of freedom. As we stated, there are many established methods regarding this problem [25, 26, 27, 32]. QAt proposed method based on the eigenvalue screeplot is admittedly ad hoc, but still makes sense because the method is robust against non-uniform noise variances across variables. It will be interesting to see how well other algorithms could be substituted for the screeplot method. 135 Convergence study is also left to the future research. Although we have not encountered any example for which the ION algorithm did not converge after just a few iterations, one cannot be assured that the ION algorithm should converge without exception. If the algorithm is to converge without exception, then the future work should prove so. It the algorithm does not always converge, then it remains to be understood in what circumstances it does and does not. As a related problem, if the ION algorithm converges without exception, it would be interesting to see if it is asymptotically equivalent to the Wiener filter. All our examples indicate that they are almost identical performance-wise. Yet another research area that is left out in this thesis is to exploit any time structure of variables to characterize the multivariate datasets. If a dataset has some time structure in it, investigating it could reveal potentially important characterization of the dataset which may not be understood otherwise. Regarding to this approach, power spectral analysis [41], autoregressive models, and Kalman filter [42, 43] could be a few starting points. 136 Bibliography [1] James B. Lee, Stephen Woodyatt, and Mark Berman. Enhancement of high spectral resolution remote-sensing data by a noise-adjusted principal component transform. IEEE Transactions on Geoscience and Remote Sensing, 28(3):295-304, May 1990. [2] W. J. Krzanowski and F. H. C. Marriott. Multivariate Analysis, Part 1, 2. Arnold, London, UK, 1995. [3] Subhash Sharma. Applied Multivariate Techniques. John Wiley & Sons, 1996. [4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistics Society, Series B, 39:1-38, December 1976. [5] E Weinstein, A. V. Oppenheim, M. Feder, and R. Buck, J. Iterative and sequential algorithms for multi-sensor signal enhancement. IEEE Transactions on Signal Process- ing, 42(4):846-859, April 1994. [6] A. Papoulis. Probability, Random Variables, ans Stochastic Processes. McGraw-Hill, 3rd edition, 1991. [7] T. W. Anderson, S. Das Gupta, and G. P. H. Styan. A Bibliography of Multivariate Statistical Analysis. Halsted Press, New York, 1972. [8] Jae. S. Lim. Two-Dimensional Signal and Image Processing. Prentice-Hall, Inc., En- glewood Cliffs, NJ, 1990. [9] J. D. Jobson. Applied Multivariate Data Analysis. Spirnger-Verlag, New York, NY, 1991. 137 [10] Paula Newbold. Statistics for Business and Economics. Prentice-Hall, Englewood Cliffs, NJ, 4th edition, 1995. [11] George Arfken. Mathematical Methods for Physicists. Academic Press, Orlando, FL, third edition, 1985. [12] Gilbert Strang. Linear Algebra and Its Applications. Harcourt Brace Jovanovich College Publishers, 3rd edition, 1988. [13] David A. Belsley, Edwin Kuh, and Roy E. Welsch. Regression Diagnostics. John Wiley & Sons, 1980. [14] Gene H. Golub and Charles F. Van Loan. An analysis of the total least squares problem. SIAM Journal of Numerical Analysis, 17:883-893, 1980. [15] Gene. H. Golub and Charles. F. van Loan. Matrix Computation. The Johns-Hopkins University Press, Baltimore, MD, 3rd edition, 1996. [16] S. D. Hodges and P. G. Moore. Data uncertainties and least squares regression. Applied Statistics, 21:185-195, 1972. [17] W. J. Krzanowski. Ranking principal components to reflect group structure. Journal of Chemometrics, 6:97-102, 1992. [18] H. 0. Wold. Partial least squares. Encyclopedia of Statistical Sciences, 6:581-591, 1985. [19] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 8:27-51, 1970. [20] I. E. Frank and J. H. Friedman. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109-148, 1993. [21] Gregory C. Reinsel and Raja P. Velu. Multivariate Reduced-Rank Regression. Springer, 1998. [22] David L. Donoho. De-noising by soft-thresholding. IEEE Transactions on Information Theory, 41:613-627, May 1995. 138 [23] Jae S. Lim. Image restoration by short space spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28:191-197, April 1980. [24] E Weinstein, A. V. Oppenheim, and M. Feder. Signal enhancement using single and multi-senor measurements. RLE Technical Report, MIT, 560, November 1990. [25] H. Akaike. Information theory and an extension of the maximum likelihood principle. Proceeding of the second internationalsymposium on information theory, supplument to probelms of control and information theory, pages 267-281, 1973. [26] H. Akaike. A new look at the statistical model identification. ieeetaucon, 19:716-723, 1974. [27] G. Schwartz. Estimating the dimension of a model. 4nn. Stat., 6:461-464, 1978. [28] Mati Wax and Thomas Kailath. Detection of signals by information theoretic criteria. IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(2):387-392, April 1985. [29] Robert. G. Gallager. Information Theory and Reliable Communication. John Wiley & Sons, 1968. [30] Carlos Cabrera-Mercader. Robust compression of multispectral remote sensing data. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1999. [31] L. C. Zhao, P. R. Krishnaiah, and Z. D. Bai. On detection of the number of signals in presence of while noise. Journal of Multivariate Analysis, 20(1):1-25, October 1986. [32] J. Rissanen. Modeling by shortest data description. Automatica, 14:465-471, 1978. [33] J. L. Horn. A rationale and test for the number of factors in factor analysis. Psy- chometrika, 30:1789-186, 1965. [34] S. J. Allen and Hubbard R. Regression equations of the latent roots of random data correlation matrices with unities on the diagonal. Multivariate Behavioral Research, 21:393-398, 1986. [35] Charles W. Therrien. Discrete Random Signals and Statistical Signal Processing.Pren- tice Hall, Englewood Cliffs, NJ, 1992. 139 [36] Sanford Weisberg. Applied Linear Regression. John Wiley & Sons, 2nd edition, 1985. [37] J. Berkson. Are there two regressions? J. Am. Statist. Assoc., 45:164-180, 1950. [38] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. John Wiley & Sons, Inc., second edition, 1984. [39] Vincente Zarzoso and Asoke K. Nandi. Blind separation of independent sources for virtually any source probability density function. IEEE Transactions on Signal Pro- cessing, 47(9):2419-2432, September 1999. [40] Fawwaz T. Ulaby, Richard K. Moore, and Adrian K. Fung. Microwave Remote Sensing, volume 1. Artech House, Norwood, MA, 1981. [41] Gwilym M. Jenkins and Donald G. Watts. Spectral Analysis and its applications. Holden-Day, Oakland, CA, 1968. [42] Gilbert Strang. Introduction to Applied Mathematics. Wellesley-Cambridge Press, 1986. [43] Carl W. Helstrom. Probability and Stochastic Processes for Engineers. Macmillan Publishing Company, New York, NY, 2rd edition, 1991. 140