Blind Noise Estimation and Compensation ... Characterization of Multivariate Processes

advertisement
-A
Blind Noise Estimation and Compensation for Improved
Characterization of Multivariate Processes
by
Junehee Lee
B.S., Seoul National University (1993)
S.M., Massachusetts Institute of Technology (1995)
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2000
© Massachusetts Institute of Technology 2000. All rights reserved.
Author..............
.
.. ..
.. .. .. .. ..'Z./ ..
...
. . .. .. .. ......... ....
. .. ... .
epartment of Electrical Engineering and Computer Science
March 7, 2000
Certified by ..................
...
DaV
. Staelin
Professor of Electrical Engineering
Thesis Supervisor
A ccepted by .................
..................Arthur C. Smith
Chairman, Departmental Committee on Graduate Students
MASSACHUSETTS INST ITUTE
OF TECHNOLOGY
JUN 2 2 20
LIBRARIES
Blind Noise Estimation and Compensation for Improved
Characterization of Multivariate Processes
by
Junehee Lee
Submitted to the Department of Electrical Engineering and Computer Science
on March 7, 2000, in partial fulfillment of the
requirements for the degree of
Doctor of Philosophy
Abstract
This thesis develops iterative order and noise estimation algorithms for noisy multivariate
data encountered in a wide range of applications. Historically, many algorithms have been
proposed for estimation of signal order for uniform noise variances, and some studies have
recently been published on noise estimation for known signal order and limited data size.
However, those algorithms are not applicable when both the signal order and the noise
variances are unknown, which is often the case for practical multivariate datasets.
The algorithm developed in this thesis generates robust estimates of signal order in the
face of unknown non-uniform noise variances, and consequently produces reliable estimates
of the noise variances, typically in fewer than ten iterations. The signal order is estimated,
for example, by searching for a significant deviation from the noise bed in an eigenvalue
screeplot. This retrieved signal order is then utilized in determining noise variances, for
example, through the Expectation-Maximization (EM) algorithm. The EM algorithm is
developed for jointly-Gaussian signals and noise, but the algorithm is tested on both Gaussian and non-Gaussian signals. Although it is not optimal for non-Gaussian signals, the
developed EM algorithm is sufficiently robust to be applicable. This algorithm is referred
to as the ION algorithm, which stands for Iterative Order and Noise estimation. The ION
algorithm also returns estimates of noise sequences.
The ION algorithm is utilized to improve three important applications in multivariate
data analysis: 1) distortion experienced by least squares linear regression due to noisy
predictor variables can be reduced by as much as five times by the ION algorithm, 2) the
ION filter which subtracts the noise sequences retrieved by the ION algorithm from noisy
variables increases SNR almost as much as the Wiener filter, the optimal linear filter, which
requires noise variances a priori,3) the principal component (PC) transform preceded by the
ION algorithm, designated as the Blind-Adjusted Principal Component (BAPC) transform,
shows significant improvement over simple PC transforms in identifying similarly varying
subgroups of variables.
Thesis Supervisor: David H. Staelin
Title: Professor of Electrical Engineering
Acknowledgement
I have been lucky to have Prof. Staelin as my thesis supervisor. When I started the long
journey to Ph.D., I had no idea about what I was going to do nor how was I going to do it.
Without his academic insights and ever-encouraging comments, I am certain that I would
be still far away from the finish line. Thanks to Prof. Welsch and Prof. Boning, the quality
of this thesis is enhanced in no small amount. They patiently went through the thesis and
pointed out errors and unclear explanations. I would like to thank those at Remote Sensing
Group, Phil Rosenkranz, Bill Blackwell, Felicia Brady, Scott Bresseler, Carlos Cabrera, Fred
Chen, Jay Hancock, Vince Leslie, Michael Schwartz and Herbert Viggh for such enjoyable
working environments.
I wish to say thanks to my parents and family for their unconditional love, support and
encouragement. I dedicate this thesis to Rose, who understands and supports me during
all this time, and to our baby who already brought lots of happiness to our lives.
Leaders for Manufacturing program at MIT, POSCO scholarship society and Korea
Foundation for Advanced Studies supported this research. Thank you.
3
Contents
1
2
12
Introduction
1.1
Motivation and General Approach of the Thesis . . . . . . . . . . . . . . . .
12
1.2
T hesis O utline
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
Background Study on Multivariate Data Analysis
17
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
17
2.2
D efinitions . . . . . . . . . . . . . . . . . . . . . . . . . . . ..........
17
2.3
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.2.1
Multivariate Data
2.2.2
Multivariate Data Analysis . . . . . . . . . . . . . ..........
18
2.2.3
Time-Structure of a Dataset . . . . . . . . . . . . . . . . . . . . . . .
21
2.2.4
Data Standardization
. . . . . . . . . . . . . . . . . . . . . . . . . .
23
. . . . . . . . . . . . . . . . . . . .
23
Data Characterization Tools . . . . . . . . . . . . . . . . . . . . . . .
24
. . . . . . . . . . . . . . . . . . . .
24
Data Prediction Tools . . . . . . . . . . . . . . . . . . . . . . . . . .
26
. . . . . . . . . . . . . . . . . . . . . . . .
26
Total Least Squares Regression . . . . . . . . . . . . . . . . . . . . .
27
Noise-Compensating Linear Regression . . . . . . . . . . . . . . . . .
28
Principal Component Regression . . . . . . . . . . . . . . . . . . . .
32
Partial Least Squares Regression . . . . . . . . . . . . . . . . . . . .
33
Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
Rank-Reduced Least-Squares Linear Regression . . . . . . . . . . . .
34
. . . . . . . . . . . . . . . . . . . . . . . . .
38
Noise Estimation through Spectral Analysis . . . . . . . . . . . . . .
39
Traditional Multivariate Analysis Tools
2.3.1
Principal Component Transform
2.3.2
Least Squares Regression
2.3.3
Noise Estimation Tools
4
Noise-Compensating Linear Regression with estimated Ce
3
.
4
41
47
Blind Noise Estimation
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
3.2
D ata M odel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.2.1
Signal Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3.2.2
Noise M odel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.3
Motivation for Blind Noise Estimation . . . . . . . . . . . . . . . . . . . . .
51
3.4
Signal Order Estimation through Screeplot
. . . . . . . . . . . . . . . . . .
52
. . . . . . . . . . . . . . . . .
54
3.4.1
Description of Two Example Datasets
3.4.2
Qualitative Description of the Proposed Method for Estimating the
Number of Source Signals . . . . . . . . . . . . . . . . . . . . . . . .
56
. . . . . . . . . . .
58
3.4.3
Upper and Lower Bounds of Eigenvalues of Cxx
3.4.4
Quantitative Decision Rule for Estimating the Number of Source Signals 62
. . . . . . . . . . . . . . . . . . . .
62
Determination of the Transition Point . . . . . . . . . . . . . . . . .
64
. . . . . . . . . . . . . . . . . . . . . . . . . .
64
. . . . . .
65
3.5.1
Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
3.5.2
Expectation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
3.5.3
Maximization Step . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
3.5.4
Interpretation and Test of the Algorithm
. . . . . . . . . . . . . . .
73
An Iterative Algorithm for Blind Estimation of p and G . . . . . . . . . . .
75
Evaluation of the Noise Baseline
Test of the Algorithm
3.5
3.6
Noise Estimation by Expectation-Maximization (EM) Algorithm
3.6.1
Description of the Iterative Algorithm of Sequential Estimation of
. . . . . . . . . . . . . .
76
Test of the ION Algorithm with Simulated Data . . . . . . . . . . .
78
Source Signal Number and Noise Variances
3.6.2
4
. . .
83
Applications of Blind Noise Estimation
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
83
4.2
Blind Adjusted Principal Component Transform and Regression
. . . . . .
84
4.2.1
4.2.2
Analysis of Mean Square Error of Linear Predictor Based on Noisy
Predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
Increase in MSE due to Finite Training Dataset . . . . . . . . . . . .
87
5
4.2.3
Description of Blind-Adjusted Principal Component Regression .
89
4.2.4
Evaluation of Performance of BAPCR . . . . . . . . . . . . . . .
91
y as a Function of Sample Size of Training Set
. . . . . . . . . .
92
. . . . . . . . . . .
95
-y as a Function of Noise Distribution . . . . . . . . . . . . . . . .
97
ION Noise Filter: Signal Restoration by the ION algorithm . . . . . . .
98
4.3.1
Evaluation of the ION Filter
. . . . . . . . . . . . . . . . . . . .
99
4.3.2
ION Filter vs. Wiener filter . . . . . . . . . . . . . . . . . . . . .
101
Blind Principal Component Transform . . . . . . . . . . . . . . . . . . .
102
-y as a Function of Number of Source Signals
4.3
4.4
5
Evaluations of Blind Noise Estimation and its Applications on Real Datasets108
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
108
5.2
Remote Sensing Data
109
5.3
6
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.1
Background of the AIRS Data
. . . . . . . . . . . . . . . . . . . . .
110
5.2.2
Details of the Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . .
111
5.2.3
Test Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
113
Packaging Paper Manufacturing Data
. . . . . . . . . . . . . . . . . . . . .
115
5.3.1
Introduction
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
115
5.3.2
Separation of Variables into Subgroups using PC Transform . . . . .
119
5.3.3
Quantitative Metric for Subgroup Separation
. . . . . . . . . . . . .
119
5.3.4
Separation of Variables into Subgroups using BAPC transform . . .
121
5.3.5
Verification of Variable Subgroups by Physical Interpretation . . . .
123
5.3.6
Quality prediction
130
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion
133
6.1
Summary of the Thesis.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
133
6.2
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
134
6.3
Suggestions for Further Research . . . . . . . . . . . . . . . . . . . . . . . .
135
6
List of Figures
2-1
Graph of the climatic data of Table 2.1
2-2
Scatter plots of variables chosen from Table 2.1. (a) average high temperature
vs.
. . . . . . . . . . . . . . . . . . . .
average low temperature, (b) average high temperature vs.
precipitation.
average
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2-3
Graph of the climatic data of Table 2.2
2-4
Statistical errors minimized in, (a) least squares regression and (b) total least
squares regression.
20
. . . . . . . . . . . . . . . . . . . .
21
22
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
2-5
Noise compensating linear regression . . . . . . . . . . . . . . . . . . . . . .
30
2-6
A simple illustration of estimation of noise variance for a slowly changing
variable
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
2-7
A typical X (t) and its power spectrum density. . . . . . . . . . . . . . . . .
40
2-8
Typical time plots of X 1 , - --, X 5 and Y. . . . . . . . . . . . . . . . . . . . .
42
2-9
Zero-Forcing of eigenvalues
. . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3-1
Model of the Noisy Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
3-2
Signal model as instantaneous mixture of p source variables . . . . . . . . .
50
3-3
Illustration of changes in noise eigenvalues for different sample sizes (Table 3.1). 57
3-4
Two screeplots of datasets specified by Table 3.1
3-5
A simple illustration of lower and upper bounds of eigenvalues of CXX.
3-6
Repetition of Figure 3-4 with the straight lines which fit best the noise dominated eigenvalues.
3-7
. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
61
63
Histogram of estimated number of source signal by the proposed method.
Datasets of Table 3.1 are used in simulation.
3-8
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . .
Flow chart of the EM algorithm for blind noise estimation.
7
. . . . . . . . .
65
74
3-9
Estimated noise variances for two simulated datasets in Table 3.1 using the
EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75
3-10 A few estimated noise sequences for the first dataset in Table 3.1 using the
EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
3-11 A few estimated noise sequences for the second dataset in Table 3.1 using
the EM algorithm
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
77
3-12 Flow chart of the iterative sequential estimation of p and G . . . . . . . . .
79
3-13 The result of the first three iterations of the ION algorithm applied to the
second dataset in Table 3.1
. . . . . . . . . . . . . . . . . . . . . . . . . . .
82
4-1
Model for the predictors X 1 , - - -, X, and the response variable Y. . . . . . .
86
4-2
The effect of the size of the training set in linear prediction
88
4-3
Schematic diagram of blind-adjusted principal component regression. .....
4-4
Simulation results for examples of Table 4.1 using linear regression, PCR,
BAPCR, and NAPCR as a function of m. p = 15.
4-5
90
. . . . . . . . . . . . . .
94
Simulation results for examples of Table 4.2 using linear regression, PCR,
BAPCR, and NAPCR as a function of p.
4-6
. . . . . . . . .
. . . . . . . . . . . . . . . . . . .
96
Simulation results for examples of Table 4.3 using linear regression, PCR,
BAPCR, and NAPCR as a function of noise distribution. The horizontal
axis of each graph represents the exponent of the diagonal elements of G.
4-7
.
Increases in SNR achieved by the ION filtering for examples of Table 4.1 as
a function of m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-8
100
Increases in SNR achieved by the ION filtering for examples of Table 4.2 as
a function of p. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-9
98
101
Performance comparison of the Wiener filter, the ION filter, and the PC filter
using examples of Table 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . .
4-10 Schematic diagram of the three different principal component transforms.
.
103
104
4-11 Percentage reduction of MSE achieved by the BPC transform over the noisy
transform using Example (b) of Table 4.1. . . . . . . . . . . . . . . . . . . .
106
4-12 Percentage reduction of MSE achieved by BPC transform over the noisy
5-1
transform using Example (d) of Table 4.1. . . . . . . . . . . . . . . . . . . .
107
Format of an AIRS dataset
111
. . . . . . . . . . . . . . . . . . . . . . . . . . .
8
5-2
Signal and noise variances of simulated AIRS dataset.
. . . . . . . . . . . .
112
5-3
Three competing filters and notations. . . . . . . . . . . . . . . . . . . . . .
113
5-4
240 Variable AIRS dataset. (a) Signal and noise variances. (b) Eigenvalue
screeplot.
5-5
. ........
.........
.......................
....
Plots of SNR of unfiltered, ION-filtered, PC-filtered, and Wiener-filtered
datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5-6
114
An example eigenvalue screeplot where signal variances are pushed to a higher
principal component due to larger noise variances of other variables.
5-7
114
. . . .
115
Differences in signal-to-noise ratio between pairs of ZWIENER and ZION and of
ZION and ZPc. (a) Plot of
SNR2WIENER
SNR 2 pc . . . . . . . . . . . . . . .
-
SNR
2 ION-
- - ...
(b) Plot of SNR2ION
-
. . . . . . . . . . . .
116
5-8
Schematic diagram of paper production line of company B-COM . . . . . .
117
5-9
An eigenvalue screeplot of 577 variable paper machine C dataset of B-COM.
118
5-10 First eight eigenvectors of 577 variable paper machine C dataset of B-COM.
120
5-11 Retrieved Noise standard deviation of the 577-variable B-COM dataset.
122
. . ..
5-12 Eigenvalue screeplot of blind-adjusted 577-variable B-COM dataset.
.
....
123
5-13 First eight eigenvectors of the ION - normalized 577 variable B-COM dataset 124
5-14 Time plots of the five variables classified as a subgroup by the first eigenvector. 126
5-15 Time plots of the four variables classified as a subgroup by the second eigenvector. .........
.......................................
127
5-16 Every tenth eigenvectors of the ION-normalized 577-variable B-COM dataset. 129
5-17 Scatter plot of true and predicted CMT values. Linear least-squares regression is used to determine the prediction equation. . . . . . . . . . . . . . . .
131
5-18 Scatter plot of true and predicted CMT values. BAPCR is used to determine
the prediction equation.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
132
List of Tables
2.1
Climatic data of capital cities of many US states in January (obtained from
Yahoo internet site)
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2
Climatic data of Boston, MA (obtained from Yahoo internet site) . . . . . .
22
2.3
Typical parameters of interest in multivariate data analysis . . . . . . . . .
24
2.4
Results of noise estimation and compensation (NEC) linear regression, compared to the traditional linear regression and noise-compensating linear regression with known Ce . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
2.5
Results of NEC linear regression when noise overcompensation occurs. . . .
44
2.6
Improvement in NEC linear regression due to zero-forcing of eigenvalues. . .
46
3.1
Summary of parameters for two example datasets to be used throughout this
chapter. ..........
......................................
56
3.2
Step-by-step description of the ION algorithm . . . . . . . . . . . . . . . . .
4.1
Important parameters of the simulations to evaluate BAPCR. Both A 1 and
80
A 2 are n x n diagonal matrices. There are n - p zero diagonal elements in
each matrix. The p non-zero diagonal elements are i- 4 for A 1 and i-
3
for
A 2, i = 1, *- - , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
93
Important parameters of the simulations to evaluate BAPCR as a function
of p. The values of this table are identical to those of Table 4.1 except for p
and m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3
Important parameters of the simulations to evaluate BAPCR as a function
ofG.... . . . ...
4.4
95
.........................................
97
Mean square errors of the first five principal components obtained through
the BPC and the traditional PC transforms using example (b) of Table 4.1.
10
106
4.5
Mean square errors of the first five principal components obtained through
BPC and the traditional PC transforms using example (d) of table 4.1. . . .
107
5.1
GSM values for the eigenvectors plotted in Figure 5-10.
. . . . . . . . . . .
121
5.2
GSM values for the eigenvectors plotted in Figure 5-13.
. . . . . . . . . . .
123
5.3
Variables classified as a subgroup by the first eigenvector in Figure 5-13. .
125
5.4
Variables classified as a subgroup by the second eigenvector in Figure 5-13.
128
5.5
GSM values for the eigenvectors in Figure 5-16. . . . . . . . . . . . . . . . .
128
11
Chapter 1
Introduction
1.1
Motivation and General Approach of the Thesis
Multivariate data are collected and analyzed in a variety of fields such as macroeconomic
data study, manufacturing quality monitoring, weather forecasting using multispectral remote sensing data, medical and biological studies, and so on. For example, manufacturers
of commodities such as steel, paper, plastic film, and semiconductor wafers monitor and
store process variables with the hope of having a better understanding of their manufacturing processes so that they can control the processes more efficiently.
Such a dataset
of multiple variables is called multivariate data, and any attempt to summarize, modify,
group, transform the dataset constitutes multivariate data analysis.
One growing tendency in multivariate data analysis is toward larger to-be-analyzed
datasets. More specifically, the number of recorded variables seems to grow exponentially
and the number of observations becomes large. In some cases, the increase in size is a direct
consequence of advances in data analysis tools and computer technology, thus enabling
analysis of larger datasets which would not have been possible otherwise. This is what we
will refer to as capability driven increase. On the other hand, necessity driven increase in
data size occurs when the underlying processes to be monitored grow more complicated and
there are more variables that need to be monitored as a result. For example, when the scale
and complexity of manufacturing processes become larger, the number of process variables
to be monitored and controlled increases fast.
In some cases, the number of variables
increases so fast that the slow-paced increase in knowledge of the physics of manufacturing
processes cannot provide enough a priori knowledge which traditional analysis methods
12
require. Among such frequently required a priori information are noise variances and signal
order. Exact definitions of them are provided in Chapter 3.
The obvious disadvantage of lack of a priori information is that traditional analysis
tools requiring the absent information cannot be used, and should be replaced with a suboptimal tool which functions without the a priori information. One example is the pair
of Noise-Adjusted Principal Component (NAPC) transform [1] and Principal Component
(PC) transform [2]. As we explain in detail in subsequent chapters, the NAPC transform is
in general superior to the PC transform in determining signal order and filtering noise out of
noisy multivariate dataset. The NAPC transform, however, requires a priori knowledge of
noise variances which in many practical large-scale multivariate datasets are not available.
As a result, the PC transform, which does not require noise variances, replaces the NAPC
transform in practice, producing less accurate analysis results.
This thesis investigates possibilities of retrieving from sample datasets vital but absent
information. Our primary interest lies in retrieving noise variances when the number of signals is unknown. We will formulate the problem of joint estimation of signal order and noise
variances as two individual problems of parameter estimation. The two parameters, signal
order and noise variances, are to be estimated alternatively and iteratively. In developing
the algorithm for estimating noise variances, we will use an information theoretic approach
by assigning probability density functions to signal and noise vectors. For the thesis, we
will consider only Gaussian signal and noise vectors, although the resulting methods are
tested using non-ideal real data.
The thesis also seeks to investigate a few applications of to-be-developed algorithms.
There may be many potentially important applications of the algorithm, but we will limit
our investigation to linear regression, principal component transform, and noise filtering.
The three applications in fact represent the traditional tools which either suffer vastly in
performance or fail to function as intended. We compare performances of these applications
without the unknown parameters and with retrieved parameters for many examples made
of both simulated and real datasets.
13
1.2
Thesis Outline
Chapter 2 reviews some basic concepts of multivariate data analysis. In Section 2.2 we define
terms that will be used throughout the thesis. Section 2.3 introduces traditional multivariate
analysis methods for data characterization, data prediction, and noise estimation. These
tools appear repeatedly in subsequent chapters. We especially focus on introducing many
traditional tools which try to incorporate the fact that noises in a dataset affect analysis
results, and therefore, cannot be ignored.
Chapter 3 considers the main problem of the thesis. The noisy data model, which is used
throughout the thesis, is introduced in Section 3.2. A noisy variable is the sum of noise and
a noiseless variable, which is a linear combination of an unknown number of independent
signals. Throughout the thesis, signal and noise are assumed to be uncorrelated. Section 3.3
explains why noise variances are important to know in multivariate data analysis using the
PC transform as an example.
In Section 3.4, we suggest a method to estimate the number of signals in noisy multivariate data. The method is based on the observations that noise eigenvalues form a flat
baseline in the eigenvalue screeplot [3] when noise variances are uniform over all variables.
Therefore, a clear transition point which distinguishes signal from noise can be determined
by examining the screeplot.
We extend this observation into the cases where noises are
not uniform over variables (see Figure 3-6).
In related work, we derive upper and lower
bounds for eigenvalues when noises are not uniform in Section 3.4.3. Theorem 3.1 states
that the increases in eigenvalues caused by a diagonal noise matrix are lower-bounded by
the smallest noise variance and upper-bounded by the largest noise variance.
In Section 3.5, we derive the noise-estimation algorithm based on the EM algorithm [4].
The EM algorithm refers to a computational strategy to compute maximum likelihood
estimates from incomplete data by iterating an expectation step and a maximization step
alternatively. The actual computational algorithms are derived for Gaussian signal and
noise vectors. The derivation is similar to the one in [5], but a virtually boundless number
of variables and an assumed lack of time-structure in our datasets make our derivation
different. It is important to understand that the EM algorithm, summarized in Figure 3-8,
takes the number of signals as an input parameter. A brief example is provided to illustrate
the effectiveness of the EM algorithm.
14
The EM algorithm alone cannot solve the problem that we are interested in. The number
of signals, which is unknown for our problem, has to be supplied to the EM algorithm as
an input.
Instead of retrieving the number of signals and noise variances in one joint
estimation, we take the approach of estimating the two unknowns separately.
We first
estimate the number of signals, and feed the estimated value to the EM algorithm. The
outcome of the EM algorithm, which may be somewhat degraded because of the potentially
incorrect estimation of the number of signals in the first step, is then used to normalize the
data to level the noise variances across variables. The estimation of the number of signals is
then repeated, but this time using the normalized data. Since the new - normalized - data
should have noises whose variances do not vary across variables as much as the previous
unnormalized data, the estimation of the signal order should be more accurate than the first
estimation. This improved estimate of the signal order is then fed to the EM algorithm,
which should produce better estimates of noise variances. The procedure repeats until the
estimated parameters do not change significantly. A simulation result, provided in Figure 313, illustrates the improvement in estimates of two unknowns as iterations progress. We
designate the algorithm as ION, which stands for Iterative Order and Noise estimation.
Chapter 4 is dedicated to the applications of the ION algorithm.
Three potentially
important application fields are investigated in the chapter. In Section 4.2, the problem
of improving linear regression is addressed for the case of noisy predictors, which does not
agree with basic assumptions of traditional linear regression. Least-squares linear regression
minimizes errors in the response variable, but it does not account for errors in predictors.
One of the many possible modifications to the problem is to eliminate the error in the
predictors, and principal component filtering has been used extensively for this purpose.
One problem to this approach is that the principal component transform does not separate
noise from signal when noise variances across variables are not uniform. We propose in
the section that the noise estimation by the ION algorithm, when combined with the PC
transform to simulate the NAPC transform, should excel in noise filtering, thus improving
linear regression. Extensive simulation results are provided in the section to support the
proposition.
Section 4.3 addresses the problem of noise filtering. The ION algorithm brings at least
three ways to carry out noise filtering. The first method is what we refer to as BlindAdjusted Principal Component (BAPC) filtering. This is similar to the NAPC filtering,
15
but the noise variances for normalization are unknown initially and retrieved by the ION
algorithm.
If the ION would generate perfect estimates of noise variances, the BAPC
transform would produce the same result as the NAPC transform. The second method is
to apply the ION algorithm to the Wiener filter [6]. The Wiener filter is the optimal linear
filter in the least-squares sense. However, one must know the noise variances in order to
build the Wiener filter. We propose that when the noise variances are unknown a priorithe
ION estimated noise variances be sufficiently accurate to be used for building the Wiener
filter.
Finally, the ION algorithm itself can be used as a noise filter. It turns out that
among many by-products of the ION algorithm are there estimates of noise sequences. If
these estimates are good, we may subtract these estimated noise sequences from the noisy
variables to retrieve noiseless variables. This is referred to as the ION filter. Since the
BAPC transform is extensively investigated in relation to linear regression in Section 4.2,
we focus on the Wiener filter and the ION filter in Section 4.3.
Simulations show that
both methods enhance SNR of noisy variables significantly more than other traditional
filtering methods such as PC filtering. In regard to the noise sequence estimation of the
ION algorithm, Section 4.4 is dedicated to the blind principal component transform, which
looks for the principal component transform of noiseless variables.
In Chapter 5, we take on two real datasets, namely a manufacturing dataset and a remote
sensing dataset. For the remote sensing dataset, we compare performance of the ION filter
with the PC filter and the Wiener filter. The performances of the filters are quantified by
SNR of the post-filter datasets. Simulations indicate that the ION filter, which does not
require a priori information of noise variances, outperforms the PC filter significantly and
performs almost as well as the Wiener filter, which requires noise variances.
The manufacturing dataset is used to examine structures of elements of eigenvectors
of covariance matrices. Specifically, we are interested in identifying subgroups of variables
which are closely related to each other. We introduce a numerical metric which quantifies
how well an eigenvector reveals a subgroup of variables. Simulations show that the BAPC
transform is clearly better in identifying subgroups of variables than the PC transform.
Finally, Chapter 6 closes the thesis by first providing summary of the thesis in Section 6.1, and contributions of the thesis in Section 6.2. Section 6.3 is dedicated to further
research topics that stem from this thesis.
16
Chapter 2
Background Study on Multivariate
Data Analysis
2.1
Introduction
Multivariate data analysis has a wide range of applications, from finance to education
to engineering. Because of its numerous applications, multivariate data analysis has been
studied by many mathematicians, scientists, and engineers with specific fields of applications
in mind (see, for example, [7]), thus creating vastly different terminologies from one field to
another.
The main objective of the chapter is to provide unified terminologies and basic background knowledges about multivariate data analysis so that the thesis is beneficial to readers
from many different fields. First, we define some of the important terms used in the thesis. A partial review of the traditional multivariate data analysis tools are presented in
Section 2.3. Since it is virtually impossible to review all multivariate analysis tools in the
thesis, only those which are of interest in the context of this thesis are discussed.
2.2
Definitions
The goal of this section is to define terms which we will use repeatedly throughout the
thesis. It is important for us to agree on exactly what we mean by these words because
some of these terms are defined in multiple ways depending on their fields of applications.
17
2.2.1
Multivariate Data
A multivariate dataset is defined as a set of observations of numerical values of multiple
variables. A two-dimensional array is typically used to record a multivariate dataset. A
single observation of all variables constitutes each row, and all observations of one single
variable are assembled into one column.
Close attention should be paid to the differences between a multivariate dataset and
a multi-dimensional dataset [8].
A multivariate dataset typically consists of sequential
observations of a number of individual variables. Although a multivariate dataset is usually
displayed as a two-dimensional array, its dimension is not two, but is defined roughly as the
number of variables. In contrast, a multi-dimensional dataset measures one quantity in a
multi-dimensional space, such as the brightness of an image as a function of location in the
picture. It is a two-dimensional dataset no matter how big the image is.
One example of multivariate data is given in Table 2.1. It is a table of historical climatic
data of capital cities of a number of US states in January. The first column lists the names of
the cities for which the climatic data are collected, and it serves as the observation index of
the data, as do the numbers on the left of the table. The multivariate data, which comprise
the numerical part of Table 2.1, consist of 40 observations of 5 variables.
A sample multivariate dataset can be denoted by a matrix, namely X. The number
of observations of the multivariate dataset is the number of rows of X, and the number of
variables is the number of columns. Therefore, an m x n matrix X represents a sample
multivariate dataset with m observations of n variables. For example, the sample dataset
given in Table 2.1 can be written as a 40 x 5 matrix. Each variable of X is denoted as Xi,
and Xi (j) is the jth observation of the ith variable. For example, X 3 is the variable 'Record
High' and X 3 (15) = 73 for Table 2.1. Furthermore, the vector X denotes the collection of
all variables, namely X = [X1,... , XT]T , and the jth observation of all variables is denoted
as X(j), that is, X(j) = [X 1 (j), _. , Xn(j)] T .
2.2.2
Multivariate Data Analysis
A sample multivariate dataset is just a numeric matrix. Multivariate data analysis is any
attempt to describe and characterize a sample multivariate dataset, either graphically or
numerically. It can be as simple as computing some basic statistical properties of the dataset
18
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
Cities
Montgomery, AL
Juneau, AK
Phoenix, AZ
Little Rock, AR
Sacramento, CA
Denver, CO
Hartford, CT
Tallahassee, FL
Atlanta, GA
Honolulu, HI
Boise, ID
Springfield, IL
Indianapolis, IN
Des Moines, IA
Topeka, KS
Baton Rouge, LA
Boston, MA
Lansing, MI
Jackson, MS
Helena, MT
Lincoln, NE
Concord, NH
Albany, NY
Raleigh, NC
Bismarck, ND
Columbus, OH
Oklahoma City, OK
Salem, OR
Harrisburg, PA
Providence, RI
Columbia, SC
Nashville, TN
Austin, TX
Salt Lake City, UT
Richmond, VA
Olympia, WA
Washington D.C.
Charleston, WV
Madison, WI
Cheyenne, WY
Average
Average
Record
Record
Average
Precipitation
High (F)
56
29
66
49
53
43
33
63
50
80
36
33
34
28
37
60
36
29
56
30
32
30
30
49
20
34
47
46
36
37
55
46
59
36
46
44
42
41
25
38
Low (F)
36
19
41
29
38
16
16
38
32
66
22
16
17
11
16
40
22
13
33
10
10
7
11
29
-2
19
25
33
21
19
32
27
39
19
26
32
27
23
7
15
High (F)
83
57
88
83
70
73
65
82
79
87
63
71
71
65
73
82
63
66
82
62
73
68
62
79
62
74
80
65
73
66
84
78
90
62
80
63
79
79
56
66
Low (F)
0
-22
17
-4
23
-25
-26
6
-8
53
-17
-21
-22
-24
-20
9
-12
-29
2
-42
-33
-33
-28
-9
-44
-19
-4
-10
-9
-13
-1
-17
-2
-22
-12
-8
-5
-15
-37
-29
(in)
4.68
4.54
0.67
3.91
3.73
0.50
3.41
4.77
4.75
3.55
1.45
1.51
2.32
0.96
0.95
4.91
3.59
1.49
5.24
0.63
0.54
2.51
2.36
3.48
0.45
2.18
1.13
5.92
2.84
3.88
4.42
3.58
1.71
1.11
3.24
8.01
2.70
2.91
1.07
0.40
Table 2.1: Climatic data of capital cities of many US states in January (obtained from
Yahoo internet site)
19
150-
record high
- .----
- -
- - -
average high
average low
record low
average precipitation
1o-
50-
-
0-48.0
7.0
6.0
-50 -
-5.0
4.0
3.0
2.0
1.0
-100:
0
5
10
15
20
state index
25
35
0.0
40
Figure 2-1: Graph of the climatic data of Table 2.1
such as sample mean and sample covariance matrix. For example, Dow Jones Industrial
Average (DJIA) is a consequence of a simple multivariate data analysis; multiple variables
(the stock prices of the 30 industrial companies) are represented by one number, a weighted
average of the variables.
The first step of multivariate data analysis typically is to look at the data and to
identify their main features. Simple plots of data often reveal such features as clustering,
relationships between variables, presence of outliers, and so on. Even though a picture
rarely captures all characterizations of an information-rich dataset, it generally highlights
aspects of interest and provides direction for further investigation.
Figure 2-1 is a graphical representation of Table 2.1.
The horizontal axis represents
the observation index obtained from the alphabetical order of the state name (Alabama,
Arkansas, -- .).
The four temperature variables use the left-hand vertical axis, and the
average precipitation uses the right-hand vertical axis. It is clear from the graph that all
four temperature variables are closely related. In other words, they tend to move together.
The graph also reveals that the relation between the average precipitation and the other
four variables is not as strong as the relations within the four variables. Even this simple
graph can reveal interesting properties such as these which are not easy to see from the raw
dataset of Table 2.1. Figure 2-1 is called the time-plot of Table 2.1 although the physical
meaning of the horizontal axis may not be related to time.
Another type of widely used graphical representation is the scatter plot. In a scatter
20
(a)
(b)
60
9
i
50.
8-
.
40
Ec.E
30
c
..
*
7-
6
.9-5-
-
a))
4
20
~'0
< 0
-10
10
1
'
20
30
40
50
60
Average high temperature (F)
0
20
70
30
40
50
60
70
Average high temperature (F)
80
Figure 2-2: Scatter plots of variables chosen from Table 2.1. (a) average high temperature
vs. average low temperature, (b) average high temperature vs. average precipitation.
plot, two variables of interest are chosen and plotted one variable against the other. When
a multivariate dataset contains n variables, n(n - 1)/2 different two-dimensional scatter
plots are possible. Two scatter plots of Table 2.1 are drawn in Figure 2-2. Figure 2-2(a)
illustrates again the fact that the average high temperature and the average low temperature
are closely related. From Figure 2-2(b), the relation between average high temperature and
average precipitation is not as obvious as for the two variables in Figure 2-2(a). The notion
of 'time' of a dataset disappears in a scatter plot.
Once graphical representations provide intuition and highlight interesting aspects of
a dataset, a number of numerical multivariate analysis tools can be applied for further
analysis. Section 2.3 introduces and discusses these traditional analysis tools.
2.2.3
Time-Structure of a Dataset
Existence of time-structure in a multivariate dataset is determined by inquiring if prediction
of the current observation can be helped by previous and subsequent observations. In simple
words, the dataset is said to have no time-structure if each observation is independent.
Knowing if a dataset X has a time-structure is important because these may shape an
analyst's strategy for analyzing the dataset. When a dataset X has no time-structure, the
X(j)'s are mutually independent random vectors, and the statistical analysis tools described
in Section 2.3 may be applied. When X has time-structure, traditional statistical analysis
tools may not be enough to bring out rich characteristics of the dataset. Instead, signal
21
1
2
3
4
5
6
7
8
9
10
11
12
Month
January
February
March
April
May
June
July
August
September
October
November
December
Average
Average
Record
Record
Average
Precipitation
High (F)
36
38
46
56
67
76
82
80
73
63
52
40
Low (F)
22
23
31
40
50
59
65
64
57
47
38
27
High (F)
63
70
81
94
95
100
102
102
100
90
78
73
Low (F)
-12
-4
6
16
34
45
50
47
38
28
15
-7
(in)
3.59
3.62
3.69
3.60
3.25
3.09
2.84
3.24
3.06
3.30
4.22
4.01
Table 2.2: Climatic data of Boston, MA (obtained from Yahoo internet site)
processing tools such as Fourier analysis or time-series analysis may be adopted to exploit
the time-structure to extract useful information. However, signal processing tools generally
do not enrich our understanding of datasets having no time-structure.
160140
..
----
----- -
120-
---
record high
average high
average low
record low
average precipitation
100-
0
- 4.3
-
-20,
4.0
-
3.7
3.4
-40-
3.1
1
2
mont5h 7
month
8
9
10
11
122.
Figure 2-3: Graph of the climatic data of Table 2.2
Determining independency is not always easy without a priori knowledge unless the
time-structure at issue is simple enough to be identified using a time-plot of the sample data.
Such simple time-structures include periodicity and slow rates of change in the dataset,
which is also referred to as a slowly varying dataset.
A dataset with a time-structure typically loses its pattern if the order of observations are
22
changed randomly. Table 2.2 is an example of a dataset with time-structure. The dataset is
for the same climatic variables of Table 2.1 from January to December in Boston. A timestructure of slow increase followed by slow decrease of temperature variables is observed.
In contrast to this, easily discernible time-structure does not emerge in Figure 2-11.
2.2.4
Data Standardization
In many multivariate datasets, variables are measured in different units and are not comparable in scale. When one variable is measured in inches and another in feet, it is probably
a good idea to convert them to the same units. It is less obvious what to do when the
units are different kinds, e.g., one in feet, another in pounds, and the third one in minutes.
Since they can't be converted into the same unit, they usually are transformed into some
unitless quantities. This is called data standardization. The most frequently used method
of standardization is to subtract the mean from each variable and then divide it by its
standard deviation. Throughout the thesis, we will adopt this 'zero-mean unit-variance' as
our standardization method.
There are other benefits of data standardization.
Converting variables into unitless
quantities effectively encrypts data and provides protection from damages to companies
which distribute data for research if unintentional dissemination of data happens.
Stan-
dardization also prevents variables with large variances from distorting analysis results.
Other methods of standardization are possible. When variables are corrupted by additive
noise with known variance, each variable may be divided by its noise standard deviation.
Sometimes datasets are standardized so that the range of each variable becomes [-1
2.3
1].
Traditional Multivariate Analysis Tools
One of the main goals of multivariate data analysis is to describe and characterize a given
dataset in a simple and insightful form. One way to achieve this goal is to represent the
dataset with a small number of parameters. Typically, those parameters include location
parameters such as mean, median, or mode; dispersion parameters such as variance or range;
and relation parameters such as covariance and correlation coefficient (Table 2.3). Location
and dispersion parameters are univariate in nature. For these parameters, no difference
'One should not say that the dataset of Table 2.1 has no time-structure based only on Figure 2-1.
23
Category
Location
Dispersion
Relation
Parameters
Mean, Mode, Median
Variance, Standard deviation, Range
Covariance, Correlation coefficient
Table 2.3: Typical parameters of interest in multivariate data analysis
exists between analysis of multivariate and univariate data. What makes multivariate data
analysis challenging and different from univariate data analysis is the existence of the relations parameters.
Defined only for multivariate data, the relation parameters describe
implicitly and explicitly the structure of the data, generally to the second order. Therefore,
a multivariate data analysis typically involves parameters such as covariance matrix.
The goal of this section is to introduce traditional multivariate analysis tools which have
been widely used in many fields of applications. This section covers only those topics that
are relevant to this thesis. Introducing all available tools is beyond the scope of this thesis.
Readers interested in tools not covered here are referred to books such as [2, 9, 10].
2.3.1
Data Characterization Tools
When a multivariate dataset contains a large number of variables with high correlations
among them, it is generally useful to exploit the correlations to reduce dimensions of the
dataset with little loss of information. Among the many benefits of reduced dimensionality
of a dataset are 1) easier human interpretation of the dataset, 2) less computation time
and data storage space in further analysis of the dataset, and 3) the orthogonality among
newly-formed variables.
Multivariate tools that achieve this goal is categorized as data
characterization tools.
Principal Component Transform
Let X E R' be a zero-mean random vector with a covariance matrix of Cxx. Consider a
new variable P = XTV, where v E R". The variance of P is then vTCxxv. The first
principal component of X is defined as this new variable P when the variance vTCxxv is
maximized subject to vTv = 1. The vector v can be found by Lagrangian multipliers [11]:
d (vTCXXv
d AvTv
dv (
dv (
24
(2.1)
which results in
Cxxv
=
Av.
(2.2)
Thus the Lagrangian multiplier A is an eigenvalue of the covariance matrix Cxx and v is
the corresponding eigenvector. From (2.2),
vTCxxv = A
since vTv
=
(2.3)
1. The largest eigenvalue of Cxx is therefore the variance of P 1 , and v is the
corresponding eigenvector of Cxx.
It turns out that the lower principal components can be also found from the eigenvectors
of Cxx. There are n eigenvalues A1 , A2 , - - , An in descending order and n corresponding
eigenvectors vi, v 2 , - - -, vn for the n x n covariance matrix Cxx. It can be shown that
v 1 , - - -, vn are orthogonal to each other for the symmetric matrix Cxx [12]. The variance of
the ith principal component, XTV2 , is Ai. If we define
Q
E Rfx" and A E Rxf" respectively
as
(2.4a)
[Vi I . - IVn],
A =diag (A,, -- - , An) ,
then the vector of the principal components, defined as P = [P 1 , -
(2.4b)
,
P]T is
P = QTX.
If AP+1
=,-
,
An = 0, the principal components Pp+ 1, - - - , Pn have zero variance, indicating
that they are constant. This means that even though there are n variables in X, there are
only p degrees of freedom, and the first p principal components capture all variations in
X. Therefore, the first p principal components - p variables - contains all information of
the n original variables, reducing variable dimensionality. Sometimes it may be beneficial
to dismiss principal components with small but non-zero variances as having negligible
information. This practice is often referred to as data conditioning [13].
Some of the important characteristics of the principal component transform are:
* Any two principal components are linearly independent: any two principal components
are uncorrelated.
25
"
The variance of the ith principal component P is Aj, where Ai is the ith largest
eigenvalue of CXX.
" The sum of the variances of all principal components is equal to the sum of the
variances of all variables comprising X.
2.3.2
Data Prediction Tools
In this section, we focus on multivariate data analysis tools whose purpose is to predict
future values of one or more variables in the data. For example, linear regression expresses
the linear relation between the variables to be predicted, called response variables, and the
variables used to predict the response variables, called predictor variables. A overwhelmingly popular linear regression is the least squares linear regression.
In this section, we
explain it briefly, and discuss why it is not suitable for noisy multivariate datasets. Then
we discuss other regression methods which are designed for noisy multivariate datasets.
Least Squares Regression
Regression is used to study relationships between measurable variables. Linear regression
is best used for linear relationships between predictor variables and response variables. In
multivariate linear regression, several predictor variables are used to account for a single or
multiple response variables. For the sake of simplicity, we limit our explanation to the case
of single response variable. Extension to multiple response variables can be done without
difficulty. If Y denotes the response variable and Z = [Z
1,
- - - , Z" ]T denotes the vector of n
regressor variables, then the relation between them is assumed to be linear, that is,
Y(j) = yo + jTZ(j) + e(j)
where -yo
C R and -y E R are unknown parameters and
(2.5)
E is the statistical error term. Note
that the only error term of the equation, e, is additive to Y. In regression analysis, it is
a primary goal to obtain estimates of parameters -yo and y. The most frequently adopted
criterion for estimating these parameters is to minimize the residual sum of squares; this is
known as the least squares estimate. In the least squares linear regression, yo and y can be
26
found from
=
arg min
Z (Y(j)
0YO
-
-
(j).
(2.6)
The solution to (2.6) is given by
i
=
(2.7a)
Czy CN
=
(2.7b)
m J71
m j=1
where Czy and Czz are covariance matrices of (Z, Y) and (Z, Z), respectively.
Total Least Squares Regression
In traditional linear regressions such as the least squares estimation, only the response
variable is assumed to be subject to noise; the predictor variables are assumed to be fixed
values and recorded without error. When the predictor variables are also subject to noise,
the usual least-squares criterion may not make much sense, since only errors in the response
variable are minimized. Consider the following problem:
Y = yo + -fTZ+e
(2.8a)
X = Z + e,
(2.8b)
where 6 E R' is a zero-mean random vector with a known covariance matrix CEE. Y
c R is
again the response variable with E [Y] = 0. Z E R' is a zero-mean random vector with a
covariance matrix CZZ. Note that zero mean Y and Z lead to yo = 0. The cross-covariance
matrix CZ is assumed to be 0. Least squares linear regression of Y on X would yield a
distorted answer due to the additive error e.
Total least squares (TLS) regression [14, 15] accounts for e in obtaining linear equation.
What distinguishes TLS regression from regular LS regression is how to define the statistical
error to be minimized. In total least squares regression, the statistical error is defined by
the distance between the data point and the regression hyper-plane. For example, the error
term in the case of one predictor variable is
ei,TLS =
V#-2
27
+
1
(2.9)
when the regression line is given by
a + 8X. Total least squares regression minimizes
=
the sum of the squares of ei,TLs:
m
(ei,TLS) 2 .
aargmin
(aTLS, /TLS)
('
)
(2.10)
1
Figure 2-4 illustrates the difference between the regular least squares regression and total
least squares regression.
(a)
Y
0
0
'X
(b)
y
0
0
0
0
eiLS::
0
LS 0
0
ei,T LS
0
0
0
0
0 00
0
-0--X
-X
0
Figure 2-4: Statistical errors minimized in, (a) least squares regression and (b) total least
squares regression.
Noise-Compensating Linear Regression
In this section, we are interested in finding the relation between Y and Z specified by -y of
(2.8a). Should one use traditional least squares regression based on Y and X, the result
given in (2.11),
i
= Cxy C-
need not be a reasonable estimate of -y.
(2.11)
For example, it is observed in [16] that
j
of
(2.11) is a biased estimator of y of (2.8a). Since what we are interested in is obtaining a
good estimate of y, it is desirable to reduce noise e of (2.8b) before regression. We will
limit methods of reducing noise in X to a linear transform denoted by B. After the linear
transform B is applied to X, we regress Y on the transformed variables, BX. Let -Y
28
denote an estimate of -y obtained from the least squares regression of Y on BX, i.e.,
jB = argnin E
Y - rTBX
(2.12)
,
The goal is to find B which makes iB of (2.12) a good estimate of -y. We adopt minimizing
the squared sum of errors as the criterion of a good estimate, i.e.,
Bnc = argMBn (iB _ _7 )T (iB - 'Y)
(2.13)
We call Bnc of (2.13) the noise compensating linear transform. We first find an equation
between -]B
and B. Since
YB
is the least squares regression coefficient vector when the
regression is of Y on BX, we obtain
(2.14)
CY(BX)C(BX)(BX)I
The right-hand side of (2.14) can be written
CY(BX)(X)(BX)
1
=
CyxBT (BCXXBT)
=
CyzBT (B (Cz + C,,) BT)
(2.15)
Combining (2.14) and (2.15), we obtain
-T (B (Czz + C,) BT) = _T CzzBT,
(2.16)
which becomes
=(
TBCeeBT
The minimum value of E
E
[(y -
[(Y BBX)]
BB)
CzzB
T
(2.17)
is achieved when r = yB.
rTBX
=
E
=
(-
(YTZ
+ e -
BBZ
-
-TBBe)]
-T B) CZZ (7T +BBCeeB
29
IB + OUE
B) T
(2.18)
Replacing
of (2.18) with the expression in (2.17) gives
i7BCECBT
E
y-
- iTBB) CZZY + o 2
iTBX)] = (,Y
(2.19)
The noise compensating linear transform Bnc is B which minimizes (jB ~~-Y)T (5B - Y)
while satisfying (2.17).
If we can achieve yB = -y for a certain B while satisfying (2.17),
then the problem is solved. In the next equations, we will show that (2.17) holds if jB
=
and B = CzzC-j.
-yT (I - B) CzzB T
right-hand side of (2.17)
=7T (I - CzzC-1 ) CzzC-1
Czz
=
TC
C
1
CZZ - ITCzzC1 CzzC-iCzz
=
TCZZC
1
(Cxx - CZZ) C-Czz
=
iTBCECBT
=
left-hand side of (2.17)
(2.20)
Figure 2-5 illustrates the noise compensating linear regression.
X-
B =CzzC-x
1
BX
Least Squares
Linear Regression
of Y
Y
=
on BX
C
C-'
Y(BX)(BX)(BX)
Figure 2-5: Noise compensating linear regression
Example 2.1
Improvement in estimating -y:
distributed noise
case of independent identically
In this example, we compare the regression result after the linear
transform B = CzzC-1 is applied with the regression result without the linear transform,
which is equivalent to setting B=I, for the case of CEE = I. Then B = (Cxx - I) C-1 =
30
I - C-
and we saw that
B=I and i
=
T (I
= y. If we regress Y on X without the transform X first, then
-
In that case, (
.
)T
--
= (C
)
(Cy>) ;> o.
As a numerical example, we create Z whose population covariance matrix is
/
Czz
=
66.8200
19.1450
-8.9750
30.8350
19.1450
38.1550
17.5975
42.9700
14.3650
-8.9750
17.5975
39.7850
12.3750
17.6300
30.8350
42.9700
12.3750
61.5025
5.7675
14.3650
17.6300
5.7675
27.1050
-17.5675
and we use -y = (-0.99
0.26 0.57 - 0.41
0 . 5 8 )T
-17.5675
,
to create Y = yTZ + C. We set E to be
Gaussian noise with variance of 0.01. X is made by adding a Gaussian noise vector e whose
population covariance matrix is I. The number of samples of Z is 2000. We estimate j by
first applying the transform B = I - S-1,
where Sxx is the sample covariance of X. We
also estimate * with B=I. We repeat it 200 times and average the results. The results are
given in the following table:
B=I-C-1
B=I
mean
a
mean
0-
-0.99
-0.9904
0.0049
-0.9704
0.0046
0.26
0.2618
0.0113
0.1873
0.0082
0.57
0.5701
0.0062
0.5697
0.0060
-0.41
-0.4109
0.0087
-0.3648
0.0073
0.58
0.5795
0.0090
0.6001
0.0073
This example clearly illustrates that
(1)
Noises in the regressor variables are detrimental in estimating the coefficients by the
least squares regression.
(2) If the covariance CeE is known and noise is uncorrelated with signal, then the noise
compensating linear transform B = CzzC-1
can be applied to X before the least
squares regression to improve the accuracy of the estimate of coefficient vector y.
31
Principal Component Regression
In principal component regression (PCR), noises are reduced by PC filtering before least
squares linear regression is carried out. Assuming that Cee = I, it follows from the noise
compensating linear regression that B = CzzC-i
BX
where
Q
=
(I - C-)
=
(QQT
=
Q
-
= I - C-,
thus
X
QA-QT) X
(I - A-i) Q T X
(2.21)
and A are eigenvector and eigenvalue matrices defined in (2.4).
For notational convenience, let's define three new notations:
Q(i) =
[Vi
I - Ivi]
A(,) = diag (A1,,--
P(l)
for any integer I <
--
[1,
(2.22a)
, Al)
(2.22b)
(2.22c)
I]T
p.
n.
In (2.21), QT is the PC transform and
Q
is the inverse PC transform.
The matrix
is
(I - A-i)
I - A-' = diag (1 - Al',-
1
- An-')
Therefore, the ith principal component P is scaled by 1- Ai before the inverse PC transform.
Note that (1 - A-')
->-
(1 - A-i) > 0. If (1 - A-i) > (1 - Aj-')
=
0, then (2.21) can
be written as
BX = Q(i) (I - A-'
QT)X
(2.23)
Considering that QT)X is equal to P(), this means that the last n - 1 principal components
are truncated. This is due to the fact that the principal components which correspond to
the eigenvalue A = 1 consist only of noise. Therefore, the truncation is equivalent to noise
reduction.
If there is a limit to the number of principal components which can be retained, only
the largest components are usually retained. However, this may not be the best thing to do
32
because there is no reason why the principal components with the highest variances should
be more useful for predicting the response variable. There are many examples in which the
principal components with smaller variances are relevant to the prediction of the response
variable, but are discarded in the process of choosing the principal components to be used for
regression. This problem can be partially avoided if the principal components are ordered
and chosen based on their correlations with the response variable [17]. Alternatively, one
could use the partial least squares regression to avoid the shortcoming of the PC regression.
Partial Least Squares Regression
In the PC transform, the transform vector v maximizes the variance of XTv subject to
vTv = 1. However, the resulting principal component XTV may or may not have a strong
relation with Y. If the purpose of the transform is to find a variable which accounts for
as much variance of Y as possible, it is desirable to find a vector 1 which maximizes the
covariance of the transformed variable XTyi
with Y. This is the idea behind the partial
least squares (PLS) transform [2, 18].
With Y E R being the zero-mean response variable and X E R" being the zero-mean
vector of regressor variables, XT,1 where M E R" is the first PLS variable if it has the
maximum covariance with Y subject to the condition p1 Ty1 = 1. The vector ti can be again
found by Lagrangian multipliers:
E [Y (XT y
=
(A, TIp)
(2.24)
which results in
1-=
CXY
(2.25)
To find the second PLS variable, the following algorithm is carried out:
(a) Find scalars ko, k 1 , - - -, k so that Y - ko (XT
are all uncorrelated with XTpt.
), X 1 - ki (XTt1 ) , - - - , X, - k (XTp1 )
The answers are ko = (CyxpI)/(/TCxxyA) and
T
ki = (Cxixp)/(pu
Cxxy).
(b) Treat the residual in Y, Y - ko (XTpA),
X, - ki (XTp)
, ...
,X
- kn (XT')
as the new response variable. Treat residuals
as the new regressor variables.
33
(c) Using the new response variable and the regressor variables, find the second PLS
variable
A2
in the same way the first was found.
(d) Repeat (a), (b), and (c) for subsequent PLS variables. Stop the procedure when the
residual in the response variable is small enough.
The PLS transform retains the variations in X which have strong relation with Y. When
there is a limit to the number of transformed variables which can be retained, the PLS
regression generally accounts for the variations in Y better than the PC regression.
Ridge Regression
Ridge regression [19] was introduced as a method for eliminating large variances in regression
estimates in the presence of large collinearity among predictor variables. In ridge regression,
the penalty function which is minimized over -yo and -y has an additive term in addition to
the usual squared error term. The additive term is proportional to the squared norm of the
coefficient vector -f:
yT Z) 2
(so, §) = arg min E (Y - -yo (-YO, -Y)
+
wqyT7y.
(2.26)
The solution to (2.26) is
=
Czy (Czz + 7I)
Y1j)
0
=
_-E Y (j) - i
j=1
1
(2.27a)
1m"
- E Z(j)
j=1
(2.27b)
Comparing (2.27) with (2.6), the only difference is the additive term 7I. This additive term
stabilizes the matrix inverse in (2.27a) when the covariance matrix Czz is ill-conditioned.
The ridge parameter q > 0 determines the degree of stabilization. A large value implies
higher stabilization. In practice, A value of q is determined based on some model selection
procedure such as cross-validation [20].
Rank-Reduced Least-Squares Linear Regression
Rank-reduced linear regression [21], also known as the 'projected principal component regression', is a regression method concerning multiple dependent variables as well as multiple
34
independent variables. The framework of this specific multivariate regression problem is described in the next paragraph.
Let X
=
[X1,--- , X"]T be the vector of n independent variables. Let Y = [Yi,-
, Ym]T
be the vector of m dependent variables. We want to find an estimate of Y based on X in
the following form:
Y WHTX,
(2.28)
where W is an m x r matrix, H is an n x r matrix, and r < min(m, n). One can assume
without loss of generality that W is orthonormal (WTW = I) because any W can be
always decomposed into an orthonormal part and the remainder, and the remainder can
be included in H. The reduced-rank least-squares linear regression looks for W,, and Hp
which satisfy the following criterion:
(Wp, Hp) = argin E
Y-Y)
T
(2.29)
(Y-Y).
It turns out that the joint maximization problem of (2.29) is separable:
minimizing E (Y argument at a time.
Y)
(Y
instead of
over W and H jointly, we only have to consider one
-
First, for a given orthonormal matrix W, Hp'w can be found by
minimizing
E
Y
Y) T
(Y
= E
-Y
Y-
WHTX) T (Y - WHTX)].
One can always write Y as the sum of WWTY and
(I
Y
=
(I
=
Y 1 -+ Y
-
-
(2.30)
WWTY):
WWTY) + WWTY
(2.31)
(2.32)
Note that the two components are orthogonal to each other. Replacing Y in (2.30) with
(2.32) yields
E [(Y +Yi
-Y)
(Y1
+
-Y
=
E [YIY1] + E
Y1-
1 -Y
(2.33)
because Y 1 is perpendicular to Y11 + Y. Note that E [YLY 1 ] in (2.33) is fixed for a given
W. Therefore, only the second term in (2.33) can be minimized over H. This can be done
35
by setting
E [(Y _YXT
= 0.
(2.34)
This condition is imposed because the remainder in Y1 after the estimate Y is subtracted
should not be correlated with X in order for E
( 1-
-
to be minimized.
Substituting k = WHTX into (2.34) yields
WWTCyx
=
WHTCxx
(2.35)
WTCyx
=
HTCxx
(2.36)
Assuming that Cxx is invertible, HT
1w
= WTCyxC-.
With this choice of H the
resulting distortion is
E [(Y
)T (Y
)]
=
E [YTY]
=
E [yTy] - 2tr {E [WWTCYXC-J XYT]
- 2E [YTY]
tr {E [WTCyxC-1XXTC=
E
CxyW]
(2.38)
- 2tr {WTCyxCiCxyW} +
YY]
tr {WTCyxC-
(2.37)
+ E [YTV]
(2.39)
CxyW}
E [YTY] - tr {WTCyxC-CxyW}
(2.40)
Thus to solve for the optimal W we must maximize
tr {WTCyxCCxyW
(2.41)
= tr {WTQAQTW}
where QAQT is the eigenvalue decomposition of CyxC-J CxY. If we simplify the above
expression further, we get
n
tr {WTQAQTW}
(2.42)
= E AjqTWW T q,
j=1
where Aj is the jth diagonal element of A and qj is the jth column of
Q.
Before finding W which maximizes (2.42), we would like to find out a lower bound and a
upper bound for scalar qjWWTqi. A lower bound and an upper bound can be determined
36
easily by recognizing
(2.43)
q3 WWTqj = 1 WTqj 112 > 0,
q WWTqj = 1WTqj 112 < 11WT 112
where 11
(2.44)
112= 1
is the vector or matrix 2-norm. We also note that EZ_
1
qTWWT qi
=
r because
n
j=1
qfWWTq
tr {QTWWTQ
=
= tr VVT} = r.
(2.45)
Therefore, the original problem is equivalent to determining W which
maximizes
E=
subject to
0 < qfWW T qj
and
1
A qTWW
T
q
(2.46a)
1
(2.46b)
L 1 qWWT q3 = r.
(2.46c)
Assuming that A3 is in descending order, the solution to (2.46) is
TWWTq{1
1<j<r
0
otherwise
A closed form expression for W which satisfies (2.47) is
W = [q, I ...
qr 10 ..
Returning to the original problem, the optimal estimate
].
(2.48)
Y of rank r based on X is given
by
Y
=
(2.49)
WWTCyxCj&X
where W is given by (2.48). Note that the optimal estimate of rank r is the projection of the
linear least squares estimate CyxC5JX into the r-dimensional subspace spanned by the
columns of W, or equivalently, spanned by the first r eigenvectors of CyxC-1 Cxy. This
is the origin of the name projected principal component regression. Note that
LLSE if we do not put on the restriction that
37
Y has to be rank r.
Y
becomes
2.3.3
Noise Estimation Tools
When population statistics such as population mean or variance are not available for a given
dataset, sample statistics usually replace them to be used for further statistical analyses. It
is often the case that a sample statistic is a good estimate of the corresponding population
statistic if the number of samples is large. When measurement noises - usually additive are present in data, sample statistics of noisy data may not be good estimates of population
statistics of noise-free data unless the noise statistics are known and easily removed from
the corresponding noisy sample statistics. For example, consider the following problem: we
want to estimate the population covariance of Z E R", denoted by Czz, from m independent
observations of X E R", where X = Z + e and Z and e are uncorrelated so that Czz is
equal to Cxx - CeE. If m is large, we may substitute the sample covariance of X, denoted
by Sxx, for Cxx.
Then an estimate of Czz can be obtained from Sxx - CEE if CFZ
is known. Many statistical analysis tools such as the noise-adjusted principal component
(NAPC) analysis take the availability of CSE for granted.
Problems may arise when the necessary noise statistics are not available.
Generally
speaking, sample noise statistics cannot be obtained from a dataset because noises do not
appear by themselves but are embedded in signals, unless we have ways to separate noises
from signals. For example, the sample noise covariance, denoted by SEE, typically cannot
be obtained from observations of X.
In this section, we will explore a few ways to estimate CEE which often, but not always,
work in practice.
When necessary, we will make physically realistic assumptions about
signals and noises. Once estimated, the noise covariances will be used to compensate noises
to improve statistical tools such as the PC transform and linear regression.
Throughout the section, we will consider the observed vector X E R , which consists of
the noise-free vector Z and noise vector e, i.e.,
X = Z + e.
(2.50)
Assuming Z and e are uncorrelated, population covariances of three vectors satisfy the
relation
Cxx = Czz + Cee.
38
(2.51)
Noise Estimation through Spectral Analysis
For many physically realistic cases, a variable changes slowly when it is measured repeatedly
over time. This observation is a basis for many noise reduction algorithms. For example,
Donoho, in [22] in which he proposed a 'de-noising' algorithm, a term which means 'rejecting
noise from noisy observations of data', developed the algorithm which achieves the target
smoothness of the data. In that case, power spectral analysis could reveal the variance of
the additive measurement error. To make it easy to study this specific time-structure of a
dataset, we incorporate the observation index, or time index, into (2.50),
X(t) = Z(t) + e(t).
Assumptions
(2.52)
The stochastic processes Xi(t), Zi(t), and ei(t) for i
ary and e(t) is uncorrelated with Z(t). Zi(t) for i
=
=
1,...
,n
are station-
1, . . . , n change slowly over time, which
means that the power spectral density of Zi(t), denoted by Pzjzj (w), is confined in a low
frequency range. Ec(t) for i
=
1, . -. , n are white.
Since Zi (t) and ei (t) are uncorrelated, the power spectral densities of Xi (t), Zi (t) and
&i(t)satisfy
Pxixi (w)
=
Pzizi (w) + PeE (w)
for i
=
1, ... , n
(2.53)
The high frequency region of Pxix(w) is dominated by the power spectral density of noise.
Therefore, the level of Pxixi(w) in the high frequency region indicates the noise variances
of ei (see Figure 2-6). Similar assumptions were made by Lim when restoring noisy images
in [8, 23]. Since CE is diagonal, estimating o-
for i
=
1, ...
,
n amounts to estimating CeE
because
EZ = D (&2,
Example 2.2
-- ,(2.54)
2
Estimation of noise variance by power spectral analysis
Consider the variable X 1 and its power spectral density shown in Figure 2-7 for a numerical
simulation. Pxix
(w) is not flat in the high frequency region. We want to estimate the value
1
12 by averaging Pxlxl (w) in the high frequency region. To do this we need to decide what
is defined as the high frequency region. We propose that the value
39
WB,
which separates the
Px~x, (w)
p,
0
WB
or 2
E-i
W
Figure 2-6: A simple illustration of estimation of noise variance for a slowly changing
variable
(a)
5
(b)
.10
5 -
0
(D
0-
CL-5 -
-51
0
-101
500
1000
time index
1500
0 0. 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
2000
frequency
Figure 2-7: A typical Xi(t) and its power spectrum density.
40
1
high frequency and the low frequency region, should be determined so that
1
min Pxjx ([O, WB)
WB
1-WB
/fl
Pxixi (w)dw-
(2.55)
JB
It is not rare to have multiple solutions to (2.55). In that case, the smallest
wB
is selected as
the division frequency so as not to pick a value which appears as a solution due to statistical
fluctuations in power spectral density. The estimated noise variance is then:
El
=
2 _ 1
1
1
-
1
- WB WB
((2.56)
Pxlxl (w)dw.
For Example 2.2, WB and &2 are illustrated in Figure 2-7 (b) as the vertical dash-dot and
horizontal dashed line, respectively. Although somewhat ad hoc, this method works very
well in our simulations. For Figure 2-7, o-,
=
0.8 and 6,
=
0.82.
Noise-Compensating Linear Regression with estimated Cee
In our introduction of the noise-compensating (NC) linear regression in Subsection 2.3.2,
it is shown that measurement noise E embedded in the regressor vector X deteriorates
the accuracy of estimated regression coefficients, and that this degradation in the result of
regression can be avoided if X is first multiplied by a matrix B
=
CzzC-1
=
I - CEC-1
before the linear regression. Note that CE is required to obtain B.
In this section we will present two examples to demonstrate performance improvements
via NC linear regression when CEE is not known a priori but is estimated by the power
spectral analysis proposed in the previous section.
The first example shows that unless
the noise covariance is grossly overestimated, the performance of noise-compensating linear
regression with estimated Ce
regression with known C-
is not much inferior to that of noise-compensating linear
In the second example, we will present a case when Cee is
grossly overestimated. When this happens, noise is overcompensated and the performance
of noise-compensating linear regression is much degraded. In this section we will suggest a
method in which we correct the overcompensation of noises.
Example 2.3
Noise-compensating linear regression with noise estimation
After the noise variances are estimated, the noise-compensating linear regression can be
applied. For this example, we generate 2000 samples of Z E R5 . As a way to make each
41
x,
x
2
5
0
0
-5''''
0
500
1 00
1500
-5
20(0
0
500
3
1 00
1500
2000
4
5
0
-0
0
0
500
1q00
1500
-0
2000
0
500
1000
Y
1500
2000
-5 L
0
500
1000
1500
2000
5
5
5
0
0
-U
0
500
1000
1500
2000
Figure 2-8: Typical time plots of X 1, -- -, X 5 and Y.
variable change slowly over time, we use the first order autoregressive model
Zj(k + 1)= 0.93Zj(k) + 0.37r(k),
where r(k)
N(0, 0.12). We use -y = (-0.99
-
=
1, ... ,1999 and i = 1, ... , 5
(2.57)
N(0, 1)2 The response variable Y is obtained through the equation
-
Y =
where c
k
-TZ + E.
0.26 0.57 - 0.41
(2.58)
0 . 5 8 )T.
The observed vector
X is
X = Z+ e
where e
-
(2.59)
N(0, 0.82I). Typical time plots of Xi(t)'s and Y are shown in Figure 2-8. To
estimate -y, we use only X, Y, and the facts that the noise covariance CE is diagonal and
that each variable changes slowly over time.
2y ~ N(m, o
2
In the noise estimation and compensation
) indicates that y is a Gaussian random variable with mean m and variance o2
42
-0.99
0.26
0.57
-0.41
0.58
(j -
y)T(.j - _y)
with known CE
mean
0-
with estimated CE
mean
a-
without estimation of CeE
mean
a-
-0.9954
0.2417
0.5322
-0.3750
0.5337
0.0375
0.0325
0.0306
0.0317
0.0356
-1.0874
0.2868
0.6252
-0.4493
0.6367
0.0501
0.0396
0.0421
0.0413
0.0466
-0.5940
0.1550
0.3410
-0.2464
0.3487
0.0367
0.0297
0.0327
0.0305
0.0324
[_0.0052
-
J_0.0180
-
0.3005
-
Table 2.4: Results of noise estimation and compensation (NEC) linear regression, compared
to the traditional linear regression and noise-compensating linear regression with known
C.
(NEC) linear regression, the noise covariances are first estimated through the method explained in Example 4.1, and then the noise-compensating linear regression is undertaken
using the estimated noise covariance. The resulting estimated coefficients are presented in
Table 2.4 with results from other methods for comparison. It can be seen from the results
that the NEC linear regression performs much better than the traditional linear regression
and almost as well as the noise-compensating linear regression with known CEE. 0
It is important to note that all eigenvalues of B = CzzC-1
= I-CEEC-J must be non-
negative because Cxx and Czz are all positive-semidefinite. When we use estimated CeE
to obtain B, one of the eigenvalues of B could fall below zero if one or more noise variances
are estimated bigger than their true variances. This is referred to as grossly overestimated
noise variances. This can happen frequently if one or more eigenvalues of Czz are close to
zero. In the next example, we modify Z of Example 4.2 so that one of the eigenvalues of
Czz is close to zero. It will be illustrated that overestimation of noise variance could lead
to a poor NC linear regression.
Example 2.4
Overcompensation of Noise
This example demonstrates that performance of NEC regression can be poor if noise variances are grossly overestimated. For the example, we take the Z generated in Example 4.2
43
-0.99
0.26
0.57
-0.41
0.58
- -y)
(-)T(j
with known Ce
mean
0-
with estimated CE
mean
a
without estimation of Ce
mean
o-
-1.0013
0.3094
0.5661
-0.4590
0.5443
0.0697
0.1703
0.0448
0.1214
0.1926
-0.9234
-0.3303
0.6528
-0.0706
0.8968
0.3965
0.5419
0.1576
0.3013
0.4658
-0.8242
0.0463
0.4813
-0.2496
0.5081
0.0250
0.0287
0.0274
0.0251
0.0294
0.0063
-
0.5753
-
0.1119
-
Table 2.5: Results of NEC linear regression when noise overcompensation occurs.
and multiply it with a matrix
0.61
-0.98
-0.53
0.87
0.57
-0.18
-0.76
0.27
0.72
-0.76
0.25
0.67
-0.46
0.86
0.84
-0.68
0.88
0.08
0.20
-0.17
0.63
-0.67
-0.36
-0.55
(2.60)
0.32
to get a new Z. The eigenvalues of the new covariance matrix Czz are (4.85, 3.05, 0.87, 0.49,
0.07). Note that the smallest eigenvalues of Czz is close to zero. We then applied the same
noise estimation technique as in Example 4.1. The results of 50 simulations are presented
in Table 2.5 with results obtained from other methods. The results show that when the
noise variances are overestimated, which makes one or more eigenvalues of B negative, the
resulting NEC linear regression produces poor estimation of regression coefficients.
Li
The previous example shows that the noise-compensating linear regression should be
used only when all eigenvalues of Czz are large unless, of course, we can recompensate the
overcompensation of noise. Fortunately, we will know if overcompensation of noise occurs
by monitoring the eigenvalues of B because each time the overcompensation occurs, one or
more eigenvalues of B become negative.
Fact
If the eigenvalues of I - Cea
1 -A,
---, 1 -
1
Cxx are A1 ,
,
the eigenvalues of Caeexx are
. Furthermore, the eigenvalues of 9OGe~xx are 0(1 -
44
a),...
,(1
-
A1 ).
When An < 0, we know that overcompensation occurs. We propose the following recompensation algorithm for the noise overcompensation.
Proposition
I-
When An < 0, we should force the eigenvalue to be zero by redefining B
OCeCxx where 0 = 1/(1 -
=
n ). When Ar, > 0, no recompensation of overcompensation
is necessary, thus 0 = 1.
{
1
if
a>, 0
=otherwise
Since this proposition forces the negative eigenvalue
iA
to be zero, we will call it zero-
forcing of eigenvalues. In Figure 2-9, the zero-forcing algorithm is depicted. Note that in
the algorithm, we use only X, the facts that CE is diagonal and that the Xi(t)'s change
slowly in time.
Estimation of o 2
from Pxx(wA,
B 1 = I - CECxx.
Aare
eigenvalues of B 1 .
An>0?
BB
1
No
1
B=IB =
I - 1 _ n CeAr xx
-~
Figure 2-9: Zero-Forcing of eigenvalues
Example 2.5
Zero-forcing of eigenvalues in noise-compensating linear regres-
sion
We apply the algorithm of Figure 2-9 to the dataset generated in the previous example. To
compare the result with those of other methods, in Table 2.6 we reproduce Table 2.5 and
add a column for zero-forcing. It can be seen that the zero-forcing technique dramatically
improves the regression results.
45
known Q
-0.99
0.26
0.57
-0.41
0.58
-y)
mean
0-
-1.0013
0.3094
0.5661
-0.4590
0.5443
0.0697
0.1703
0.0448
0.1214
0.1926
(j - y) [_0.0063
-
estimated CeE
no zero-forcing
zero-forcing
mean
a
mean
a
-0.9234
-0.3303
0.6528
-0.0706
0.8968
[_0.5753
0.3965
0.5419
0.1576
0.3013
0.4658
-1.0115
0.2987
0.5959
-0.3742
0.6376
0.3188
0.3157
0.1136
0.2027
0.2717
-
0.0072
-
.
no estimation of Cee
mean
-
-0.8242
0.0463
0.4813
-0.2496
0.5081
j_0.1119
Table 2.6: Improvement in NEC linear regression due to zero-forcing of eigenvalues.
46
0.0250
0.0287
0.0274
0.0251
0.0294
-
Chapter 3
Blind Noise Estimation
3.1
Introduction
This chapter addresses the problem of estimating noise and its variance in the context of
noisy multivariate datasets with finite observations while imposing minimal assumptions
on the dataset. The problem we are especially interested in involves multivariate datasets
where the number of degrees of freedom of the signal portion is 1) much smaller than the
number of degrees of freedom of the noisy dataset, and 2) unknown a priori. This problem
has been studied in the context of signal enhancement as single and two-sensor problems
in [24, 5], in which the order of time-series model which the dataset is based on is assumed
known. This is in effect equivalent to assuming known degrees of freedom.
Estimating
the number of signals in the case of uniform noise variances has been studied in context
of model identification in [25, 26, 27, 28]. They use some model selection criteria to select
degrees of freedom.
This chapter is organized as follows. The noisy multivariate data model is defined in
Section 3.2. It is assumed throughout the chapter that the signal and noise are independent
Gaussian random vectors.
The assumption that measurement noise is independent and
Gaussian is natural. The justifications for the signal being assumed to be Gaussian are:
1. Hardly anything is known about the signal.
2. Gaussian is a good approximation of many stochastic processes found in nature. (For
example, see [29] page 364.)
3. Gaussian distribution is easily tractable in developing complex algorithms.
47
4. The algorithm we will develop based on Gaussian assumption works well on signals
drawn from other distributions in our preliminary simulations.
The noisy data vector is modeled as a sum of a signal vector and a noise vector, where
the signal vector is modeled as an instantaneous linear mixture of a few Gaussian random
variables. Nothing is assumed to be known about the noise vector except that it is an
independent Gaussian random vector with a diagonal covariance matrix. The remainder
of the chapter addresses estimations of noise variances and the number of source signals.
In Section 3.3 the motivation and the usefulness of blind estimation of noise variances are
addressed in the context of the Noise-Adjusted Principal Component (NAPC) transform [1].
Section 3.4 addresses the problem of estimating the number of source signals. Our approach is based on the detection of the 'pseudo-flat noise baseline' in an eigenvalue screeplot,
which can suggest the number of source signals in the case of an identity noise matrix. We
propose first a numerical criterion to obtain an estimate of the number of source signals
for the case of an identity noise matrix, and then we extend the criterion to the case of
non-identity noise matrices. As a related problem, an upper bound and a lower bound of
noise eigenvalues in the case of a non-identity noise matrix are derived in Section 3.4.3.
In Section 3.5, we consider the blind noise estimation problem in the case of known
source signal numbers. Our approach is based on the Expectation-Maximization (EM) approach, which is a computational strategy to solve complex maximum-likelihood estimation
problems. We first describe the problem of maximum-likelihood (ML) parameter estimation of the noise variances and noise sequences in Section 3.5.1. Detailed derivations of the
expectation step and the maximization step are presented in Section 3.5.2 and Section 3.5.3,
respectively. Test of the algorithm is provided in Section 3.5.4.
In Section 3.6, we address the problem of blind noise estimation in the case of an
unknown number of source signals. Because EM-based blind noise estimation requires the
number of source signals or its estimate, by definition of the case, we need to supply an
estimate of the source signals to the EM algorithm.
We will use the retrieved number
of source signal based on the algorithm developed in Section 3.4. We take this one step
further by repeating the estimation of the source signal number and the noise variances
back and forth so that an improvement in one estimate leads to a better estimate of the
other, and vice versa. This iterative order and noise estimation procedure is designated the
ION algorithm. We present simulation results of the ION algorithm in Section 3.6.2.
48
Sensor
Noiseless
--
----
Data
Noisy Data
Z X(t)
E(t)
Noise
Figure 3-1: Model of the Noisy Data
3.2
Data Model
Figure 3-1 illustrates the noisy data model used throughout this chapter. The n-dimensional
signal vector Z(t) is measured by n sensors idealized as having n additive noise sources
e(t). The readings of the sensors constitute the n dimensional vector, denoted as X(t). It
is important to realize that neither Z(t) nor e(t) is available to us. Only finite observations
of X(t) are available.
Our objective is to recover the unknown noise variances from the
available finite samples of X(t). It is also desired to recover the actual sequences of noise
e(t), if possible. We should note that this objectives cannot be achieved in general, that
is, the noisy data vector cannot be separated reliably into the signal vector and the noise
vector.
3.2.1
Signal Model
The signal vector Z(t) is modeled as an instantaneous linear mixture of elements of pdimensional source vector P(t). Figure 3-2 illustrates this. The relation between P(t) and
Z(t) is
Z(t) = AP(t).
(3.1)
where A is the mixing matrix, assumed to be full column rank but otherwise unknown. We
only consider the case where the mixing matrix is fixed over time. The term 'instantaneous'
emphasizes that only present values of source variables are used to generate current values
of the signal vector Z(t). The number of source variables, denoted by p, is unknown. This
49
Z (t)
P,(t)
P )a.
*
Z(t)
np
=
A- P(t)
n
Figure 3-2: Signal model as instantaneous mixture of p source variables
means that we do not know the mixing matrix and that we also do not know how many
columns it has. In summary, each variable in Z(t) is an unknown linear combination of an
unknown number of unknown source variables.
The source vector P(t) of unknown dimension p is modeled as a statistically independent
Gaussian random vector with zero mean. The variances of source variables are in principle arbitrary since a scalar factor can be exchanged between any source and its associated
column in the mixing matrix without altering X(t). Therefore, without any loss of generality we can assume the variance of each source variable to be unity. Combined with the
assumed independence of source variables, the unity variances of source variables yields an
identity matrix of unknown dimension p x p as the covariance matrix of P(t). Since linear
combinations of Gaussian random variables are Gaussian, we note that the signal vector
Z(t) is a Gaussian random vector.
3.2.2
Noise Model
The noise vector E(t) is modeled as a Gaussian random vector independent of Z(t) with
zero mean and an unknown diagonal covariance matrix, denoted by G. The noisy data
vector X(t) is then a Gaussian random vector with mean and variance of
E [X(t)] = AE [P(t)] + E [e(t)] = 0,
50
(3.2)
Cxx = E [X(t)X(t)T] = AAT + G,
(3.3)
where the superscript T denotes matrix transposition.
3.3
Motivation for Blind Noise Estimation
When we received a number of multivariate datasets for analysis from various manufacturing
facilities, we suspected that many of them were subject to significant measurement errors.
Our suspicions were generally confirmed by manufacturing site visits. However, even at the
plant there often was no a priori information concerning the noise statistics. Since CE is
essential for many multivariate analysis methods we wanted to use, we became interested
in ways to estimate CE from a given dataset.
The simplest form of CE is a constant multiple of an identity matrix, namely 0T2I. This
is the case when measurement errors are uncorrelated with others, and each measurement
error has the same variance.
errors are uncorrelated.
It is reasonable to assume that most factory measurement
Measurement errors originated from various sensors tend to be
independent. However, it is not very realistic to assume those measurement errors have the
same variance. For most manufacturing datasets we attempted to analyze, some variables
have very large measurement errors while others appear to have small measurement errors.
Therefore, it is more reasonable to assume that Ce
elements of o
,-
is a diagonal matrix with diagonal
,
Ce = D o,
--- , o-
(3.4)
When the dimension p of the source vector and the population noise covariance G is
known, the so-called noise-adjusted principal component (NAPC) [1, 30] is typically used to
compress an n-dimensional noisy data vector X(t) into the p-dimensional subspace spanned
by the p eigenvectors associated with the largest p eigenvalues of the covariance of the
"noise-adjusted" random vector, defined as G- 1/ 2X(t). For notational convenience, we will
drop the time index t henceforth if doing so does not introduce any confusion. If necessary,
we will reinstate the time index.
In NAPC, the covariance of the noise-adjusted random vector is obtained first by
E [G-1/2XXTG-1/2
=
51
G-1/2Cx G-1/2
(3.5)
=
G-1/ 2 (AAT + G) G- 1/ 2
(3.6)
=
G- 1 / 2 AAT G-1/ 2 + I.
(3.7)
Since A is not known, AAT of (3.7) is typically substituted by Sxx - G where Sxx is
the sample covariance of X computed from a sample dataset X. Let A be the diagonal
matrix of eigenvalues of (3.7) in descending order and
Q
be the corresponding matrix of
eigenvectors. It can be shown that the transformation of X into the subspace spanned by
the first p columns of
Q
contains all variations originating from signal Z, or equivalently
that the transformation of X into the subspace spanned by the remaining n - p columns
of
Q
contains variations originated only from noise. The subspace spanned by the first p
eigenvectors of
Q
is called the signal subspace, and the subspace orthogonal to the signal
subspace is called the noise subspace. As such, NAPC is capable of dividing a noisy vector
into a vector of complete noise and another vector of signal and noise. This characteristic
of NAPC is utilized in many applications such as noise filtering, data compression, system
identification and characterization, and so for.
Up until now, we assume the case of known G and p. In fact, the practicality of NAPC
depends heavily upon the availability of these two parameters. What limits the usability of
the NAPC is that both parameters are not known a priori for many 'real-world' datasets.
This prevents one from noise-adjusting X before the principal component transform. In that
case, an algorithm which reliably estimates the source dimension and the noise covariance
from a sample dataset would be useful.
3.4
Signal Order Estimation through Screeplot
The problem of detecting the number of meaningful source signals in a noisy observation
vector X has been actively studied over the last few years. This is partially in response to the
surging necessity of compression of multivariate datasets whose sizes increase exponentially
in response to the technological advances in sensors and data storages. The detection of
the number of source signals contained in a noisy vector would be beneficial in many ways:
1) to boost signal-to-noise ratio by eliminating noises which are not in the signal subspace
2) to characterize better the complex system represented by the noisy observation
52
3) to reduce the size of the dataset
If the population noise covariance G is an identity matrix multiplied by a scalar, namely
12 , and if the population covariance Cxx of the noisy observation vector X is known,
then examining the eigenvalues of CXX reveals the number of source signals easily because
n - p eigenvalues of Cxx are equal to o 2 and the remaining p eigenvalues are greater than
a2 . Therefore, determining the number of source signals is as easy as determining the
multiplicity of the smallest eigenvalue of CXX. In most practical situations, the population
covariance Cxx is unknown. In that case, the sample covariance matrix Sxx computed
from finite samples of X is used in place of the population covariance matrix. The problem
of this practice is that the finite sample size of X ensures that all resulting eigenvalues of
Sxx are different, thus making it difficult to determine the number of source signals merely
by counting the multiplicity of the smallest eigenvalue.
More sophisticated approaches to the problem, developed in [28, 31], are based on an
information theoretic approach. In those papers, the authors suggested that the estimate of
the source number be obtained by minimizing the model selection criteria, first introduced
by Akaike in [25, 26] and by Schwartz and Rissanen in [27, 32]. Cabrera-Mercader suggested
in [30] that the libraries of sample noise eigenvalues be generated for different sample sizes in
computer simulations. These computer-generated sample noise eigenvalues are then used in
place of the population noise eigenvalues. By doing so, he observed substantial improvement
in the accuracy of the signal estimate in numerical simulations. To our best knowledge, all of
the previous studies on the subject of source signal number estimation have been conducted
only for cases when either the population noise covariance G is an identity matrix or G is
known a priori.
In this section we propose a simple numerical method for obtaining an estimate of
the source signal number.
The proposed method is a numerical implementation of the
function of human eyes in determining the separation of signal-dominated eigenvalues and
noise-dominated eigenvalues in an eigenvalue screeplot; a procedure of first determining a
straight line which fits the noise-dominated eigenvalues and then determining the point at
which the straight line and the screeplot diverges significantly. In determining the source
signal number, the results of the proposed method will be shown to be comparable with
those made by human eyes. The advantages of the proposed method are twofold. 1) No
human intervention is required in the determination of the source signal number. This plays
53
an important role in the successive estimation of p and G in the coming sections. 2) The
proposed method is more robust to changes in eigenvalues caused by a non-identity noise
covariance. Unlike those methods presented in [28, 31] in which knowledge of G is assumed,
the proposed method yields a relatively accurate estimate of the source signal number when
G is unknown and not necessarily an identity matrix.
We begin in Section 3.4.1 by describing briefly the example datasets used throughout
the chapter. We also define the multivariate version of signal-to-noise ratio (SNR) in terms
of signal and noise eigenvalues. Section 3.4.2 presents an introductory description of our
proposed method. We explain qualitatively how one can determine the number of source
signals when the population noise covariance G is an identity matrix.
The changes of
the noise eigenvalues due to the substitution of the population covariance matrix with the
sample covariance matrix computed from a finite number of samples of X is considered. We
extend the proposed method to the case of an unknown and non-identity noise covariance.
We will try to justify this extension in Section 3.4.3 by providing upper and lower bounds of
noise eigenvalues when G is an non-identity diagonal matrix. We will show that the shape
of the screeplot is altered only to the extent of the difference between the largest and the
smallest variances of the noise variables. The actual quantitative algorithm of the proposed
method is presented in Section 3.4.4.
3.4.1
Description of Two Example Datasets
It is our intention to supplement each new algorithm introduced throughout the chapter
with simulation results and graphs. Prior to commencing to describe the method for source
signal number estimation, we would like to make up two simple noisy multivariate datasets.
By using the same two examples repeatedly when simulations are required, we can maintain coherence among algorithms introduced in this chapter. The two example datasets are
designed to be simple and yet sophisticated enough to represent real multivariate datasets
which are modeled in this thesis as being generated from a jointly Gaussian random distribution. As we already explained in Section 3.2, a noisy data vector X is modeled as the
sum of the noise-free signal vector Z and the noise vector e, where the signal vector Z is an
instantaneous linear mixture of the source vector P, where P is a jointly Gaussian random
vector with zero mean and an identity covariance matrix. The dimension of P is set to
be fifteen and the dimension of X is set to be fifty for both example datasets. Consider
54
the selection of the population noise covariance G. We would like to create one dataset
with an identity covariance matrix and the other dataset with a non-identity covariance
matrix. The first dataset is therefore created so that the population noise covariance is
I. This dataset emulates the cases either where noise variances are naturally uniform over
variables or where variables are standardized by their noise variances before acquisition of
the dataset.
The second dataset is created so that the population noise covariance is a
diagonal matrix and the diagonal elements are 1/50,...
,
1/2, 1 multiplied by a constant c.
The constant c is to control the multivariate signal-to-noise ratio to be defined shortly. This
noise covariance is intended to emulate the cases when noise variances vary over variables.
In specifying the mixing matrix A, we would like to have A as random as possible
except that the two datasets should have comparable SNR. To achieve this, we should
make it clear first what we mean by SNR for multivariate datasets. In a univariate dataset,
SNR is defined as the ratio of the signal variance to the noise variance. For example, for a
variable X 1 = Z, + -1, the SNR is defined as
2
SNR(X 1 ) = 10 logio 0'
where or
and o-
1
(dB)
(3.8)
denote variances of Z 1 and El, respectively. In the case of a multivariate
dataset, we define SNR as the ratio of the sum of all signal variances to the sum of all noise
variances, or
SNR(X) = 10 log1 o
2
= 10 log10 tr (CZZ)
o
tr (G)
(dB)
(3.9)
where tr(.) denotes the trace of a matrix. Recalling that tr (Czz) is equal to the sum of
the eigenvalues of Czz, (3.9) can be also expressed in terms of the eigenvalues of Czz and
G,
SNR = 10 log1 o
Az% (dB)
i=1
(3.10)
i
The value of EL1 tr (G) is 50 for the first example and 9 for the second example. Somewhat
arbitrarily we set SNR to be 24 dB for both examples. Then the eigenvalues of Czz should
meet the condition E.=1 Az, = 50 x 102.4 for the first example and
1 =
9 x 102.4 for
the second example. We generate two mixing matrices, one for each example so that the
eigenvalues of Czz meet the above conditions.
Otherwise the two mixing matrices are
arbitrary. Finally, we set the sample size m to be 3000.
55
The specifications for the two
examples are summarized in Table 3.1.
m
n
p
SNR (dB)
G
(a)
3000
50
15
24.16
I
(b)
3000
50
15
24.16
2 - diag
,o
,,
I)
Table 3.1: Summary of parameters for two example datasets to be used throughout this
chapter.
3.4.2
Qualitative Description of the Proposed Method for Estimating the
Number of Source Signals
When the population noise covariance G is an identity matrix up to a scalar multiplication
factor ol, the largest p eigenvalues of the population covariance matrix Cxx are larger than
u1
and the remaining n - p eigenvalues are equal to o . In theory the number of source
signals can be determined by counting the number of eigenvalues which are greater than
the smallest eigenvalue, o-E. This is called the subspace separation method [35]. In reality,
the often unavailable population covariance Cxx is replaced by the sample covariance Sxx
computed from a finite sample size X. The sample noise eigenvalues obtained by eigendecomposition of the sample covariance Sxx are all different with certainty. Therefore,
we cannot determine the number of source signals by counting the number of eigenvalues
bigger than the smallest eigenvalue for this value will be always n - 1 no matter what the
true value of p is.
There have been various researches in regard to this problem. Parallel analysis, suggested in [33], determine the number of signals by comparing eigenvalues of covariance
matrix of the standardized dataset at hand with those of simulated dataset drawn from
standardized normal distribution. Allen and Hubbard [34] further promoted this idea and
developed an regression equation which determines the eigenvalues for standardized normal
datasets.
Our approach to the problem is an extension of the subspace separation method. Figure 3-3 shows a few screeplots obtained from Example (a) of Table 3.1.
The sample co-
variances are computed from different sample sizes while the population covariance matrix
remains unchanged.
It is clear from the figure that the noise-dominated eigenvalues are
56
not constant anymore when the sample covariance Sxx replaces the population covariance
Cxx, and the effect is clearer for a smaller sample size. However, the differences of consec-
100 samples
1
200 samples
-
-
2000 samples
samples
103
1
0
--
-.
--.
101.
-
- --.. . -...
... .
-.--.--.--
- -.
...
100
0
5
10
15
20
25
Index
30
35
40
45
50
Figure 3-3: Illustration of changes in noise eigenvalues for different sample sizes (Table 3.1).
utive noise eigenvalues are much smaller compared to the differences of consecutive signal
eigenvalues. More importantly, noise eigenvalues tend to form a straight line when the vertical axis is on a logarithmic scale. Based on this observation, we can still obtain a rough
estimate of the number of source signals by counting eigenvalues which lie significantly
above the straight line of noise eigenvalues. We will define quantitatively the meaning of
eigenvalues that lie significantly above the straight line of noise eigenvalues in Section 3.4.4.
Our proposed method can be also applied when the population noise covariance is arbitrary. This is best visualized by an example. Let's consider the two screeplots of Figure 3-4.
These two screeplots are obtained from two example datasets whose parameters are specified in Table 3.1.
Figure 3-4(a) is a sample screeplot of the first dataset in which the
population noise covariance G is an identity matrix. As we explained, the noise-dominated
eigenvalues form a straight line and the transition from noise-dominated eigenvalues to
signal-dominated eigenvalues can be hand-picked by finding the point where the screeplot
departs from the straight line that fits the noise-dominated eigenvalues. We marked the
point in the figure. The actual break is at Index = 15. This idea is also applicable to the
screeplot of Figure 3-4(b) , which is a sample screeplot of the second dataset in which the
57
population noise covariance is a diagonal matrix with the elements of 2/50,-..
, 2/2,
2. It
can be seen that the slope of the signal-dominated eigenvalues is more negative than the
slope of the noise-dominated eigenvalues. One can estimate the number of signal sources
by determining where the change in slope occurs.
Before we move on to derivations of upper and lower bounds of eigenvalues of Cxx, we
would like to emphasize that this method works best when the noise covariance is an identity
matrix. When the noise covariance is arbitrary, this method tends to underestimate the
number of source signals as it can be seen by comparing (a) and (b) of Figure 3-4. Therefore,
this method should be considered only as a way to obtain a rough estimate of the number
of source signals when G is arbitrary.
5
0
5
(a)
10
(b)
10
Transition
........
10
10
Transition
102
12
i~2
0
10
20
30
40
0
50
10
20
Index
30
40
50
Index
Figure 3-4: Two screeplots of datasets specified by Table 3.1
Upper and Lower Bounds of Eigenvalues of Cxx
3.4.3
Consider an ri x ri covariance matrix Cx
Ax,n
A,
>
=
Cz + CeE and its eigenvalues Ax,1
>.-
->
0. Assuming that Cz is a p rank matrix, the eigenvalues of Cz are denoted by
- .- -
Az,p > Az,+i
=
-=
0. The noise covariance C
az,=
is a diagonal
matrix withn the
entries
twoniewa
ou ,.-., cr =
x ndiagonal
. We
covariance
to find an
matri
bound +o,jand a lower
CXX,
CZ want
+ CEand
its,upper=ievle
thi
bound of
AX,3
as functions of
AZ,2
and the diagonal entries of Cs. For future reference, the
smallest diagonal entry of Ca is denoted by ul and the largest entry by uj. If we define
InA
2 and C
58
eigenvalues are, respectively,
i=
AZi +
gal,
(3.11a)
i = p + 1,---,n
,
AX,3,i =1*
(3.11b)
= p + 1, - - -,n
2
0,1
In Theorem 3.1, we will prove that Ax,a,i and Ax,3,i are lower and upper bounds for Ax,j
for i = 1, ..., n.
Theorem 3.1 For Ax,o,, ---
,
Ax,,n,
AX,
1
,l -.., Ax,0,n, and Ax,1, .. , AX,n described in
the previous paragraph,
Ax'a,1
:5
Ax,1
<
Ax,0,1
(3.12)
AX,n
AX,a,n
<
AX,/3,n
Theorem 3.1 states that the plot of eigenvalues of Cxx always lies between the plots
of eigenvalues of CXX,a and Cxx,3.
We already saw an example of a screeplot which
illustrates the meaning of the theorem. In order to establish Theorem 3.1, we need the
following lemma.
Lemma 3.1.1 Let Cxx = Czz +o
2
e e
where ek is the n-dimensional unit vector whose
only nonzero entry is its kth element. Note that oT2ekeT is an n x n matrix whose only
nonzero entry is its (k, k) element and the value of the that entry is o'k. Then Ax,i
for all i = 1,.
Az'i
,n.
Proof.
(a) Ax,1
Az,1
Let orthonormal eigenvectors of Cxx and CZZ be hx,1,
, hx,n, and hz,1 , ... , hz,n,
respectively. Then
Ax,j=
hxCxxhx,1
> hz,1 Cxxhz,1 = hz,1 (Czz + Okekek) hzj
>
AZ,1
59
=
Azi + ok (hZ iek) 2
(b)
Ax,2 > Az,2
Define a unit vector u as
u = cihz,1 + c 2 hZ, 2
(3.13)
where
(hT, 2 hx,1)
cl
-
Z2
=
Vi(,1hx,)
(2
hx ~,)
+
(h
,ih,,1)
(3.14)
C2 =
hi 1 hx,1)
2
+
(hT,2hx,1
Note that u is perpendicular to hx,1 because
u hx,1 = chz,1 hx,i + c 2 hz,2 hx,i = 0.
(3.15)
Then Ax,2 , the second eigenvalue of CXX, is not smaller than AZ, 2 because
Ax,2
=
max
vthx,, IIvI=1
VTCXXv
>
uT Cxxu = UT
>
AZ,2
(Czz
2
+ a-2eke T) u = c2 Az,1 + C2 AZ, 2 + or (uTek)
The method of (b) can be extended to prove Ax,i
Az,i for any i = 3,... ,n.
El
Similar to Lemma 3.1.1 is Corollary 3.1.1.
Corollary 3.1.1 Let Cxx = Czz - o2ekeT. Then Axi :
Az,i for all i = 1,-.-, n.
Now we are ready to prove Theorem 3.1.
Proof of Theorem 3.1.
We can write Cxx as
n
CXX= (Czz + orI) +
--
2
eke
(3.16)
From (3.16) and Lemma 3.1.1, the eigenvalues of Cxx are greater than or equal to the
eigenvalues of Czz + o2I, or
Ax,a,i 5
AX,i,
(3.17)
60
Similarly, Cxx can be written as
Cxx=
Czz +
o-
21-
o
eke
(3.18)
k=1
From (3.18) and Corollary 3.1.1, the eigenvalues of Cxx are smaller than or equal to the
eigenvalues of C
+
o+I, or
(3.19)
AX'i < AX"3'i'
M
Figure 3-5 visualizes the meaning of Theorem 3.1. The screeplot of eigenvalues of Cxx
lies in the narrow shaded area in Figure 3-5, where the lower curve is the screeplot of
Ax,a,, - - - , Ax,a,n and the upper curve is the screeplot of Ax,,, 1 , - - -, Ax,o,n. . Of course, if
the difference between a 2 and a 2 is large, the shaded area may not be so narrow. However,
the difference is not larger than one in a normalized dataset. This implies that the screeplot
will be changed only slightly even if noise variances are not uniform and that the extent of
change in the screeplot is bounded by the difference of o, and o, .
...
.. .. .. .. ..
2p
0
Figure 3-5: A simple illustration of lower and upper bounds of eigenvalues of Cxx. The
smallest and the largest diagonal entries of CEE are o and a, respectively.
61
3.4.4
Quantitative Decision Rule for Estimating the Number of Source
Signals
The objective of this section is to develop a quantitative algorithm for obtaining an estimate
of the number of source signals. This is nothing more than a numerical implementation of
the qualitative method of determining p by examining an eigenvalue screeplot through
human eyes described in Section 3.4.2. This section is organized as follows. We will first
explain the method for determining the line which fits a pre-determined portion of noise
dominated eigenvalues. The line is called the noise baseline. Then we will give a definition
of eigenvalues which depart significantly from the baseline. Counting the number of these
eigenvalues will yields an estimate of the number of source signals.
Evaluation of the Noise Baseline
To determine the linear line which fits the noise eigenvalues, we should first determine
which eigenvalues are noise eigenvalues.
By definition, we cannot determine the entire
noise eigenvalues because otherwise we would not have to estimate the number of source
signals in the first place. Therefore, we first decide a priori on the eigenvalues which can
be safely assumed as noise-dominated eigenvalues.
This depends on the compressibility
of the individual dataset. For the thesis, we assume that the number of source signals is
always fewer than 0.4 times the number of variables. This assumption is made empirically
from many manufacturing datasets we worked on. We would like to emphasize that this
assumption may have to be modified for different datasets.
There is another consideration to make before we obtain the noise baseline. When m is
not sufficiently large compared to the number of variables, noise eigenvalues often decrease
fast near the end of a screeplot. For the purpose of obtaining the noise baseline, these fastdecreasing eigenvalues should not be included. For this thesis, we assume that the steep
decrease in noise eigenvalues does not start until eigenvalues of the smallest 40 percentile.
Combining this assumption with the one made in the previous paragraph, we arrive at the
conclusion that the middle 20 percent of eigenvalues are noise-dominated eigenvalues and
they are free from a certain decrease in value due to an insufficient sample size. From these
eigenvalues, we deduce the noise baseline.
The actual process to determine the noise baseline is rather simple. Let All,
62
,
12
represent the middle 20 percentile of eigenvalues where
11, - - - ,12
represent the indexes for
those eigenvalues. Noting that the vertical axis of a screeplot is in the logarithmic scale in
general, we can set the linear equation between the index i and Ai as
log1 o Ai = a - i + 8 + Ej,
i
(3.20)
= 11, - - - ,12,
where ci represent the statistical error term. Using the standard least squares linear regression criterion, we can obtain estimates of a and 0 from
12
(&,,8) = argin
(logio Ai -
_-i-)21
(3.21)
and the noise baseline is given by
logioA =&-i+
i=1,
. , n.
(3.22)
In Figure 3-6, we redraw Figure 3-4 with the noise baselines. Note that the two noise baselines follow the noise dominated eigenvalues closely. The divergences between the baselines
and eigenvalues occur at the transition points indicated in Figure 3-4.
(a)
(b)
10 5
105
*
10 4
-..-.-......
. . ..
. .. . . . . . . . . . ....
....
103
104 -- - . . . . . . . . . . .- . .. . - . ..- . ..-.
-.
.. . . . .
.......... .. .. .. . . ..
102
102
..
.. ...-..- . .---... -I ..
-.
.-... .. ..
10
100
10
....
. .-.
-2
10
20
30
40
...--..--..
.. -.
-..
.-
- . .. . . . . .... . . .. .. . ... .. ..
-.. -.. -- . . - .- - . .- . .- . .- . .- . .-. .-. .-
10- 1
0
. . . . .. .. . .... . . .
100
.. ...-.. .. .. .. .. .. .. .. .. .. ...
10~'
1V
...
103
10-2
50
Index
0
10
20
30
40
50
Index
Figure 3-6: Repetition of Figure 3-4 with the straight lines which fit best the noise dominated
eigenvalues.
63
Determination of the Transition Point
The number of source signals is estimated by the number of signal-dominated eigenvalues.
The signal-dominated eigenvalues are those diverging significantly from the noise baseline.
Here we would like to define quantitatively the term significantly.
Let A1, -
, A,, be the eigenvalues of the covariance matrix of X and A,
, A be the
corresponding values on the noise baseline. Let Li denote the difference between the logarithmic values of Ai and Ai, that is,
Li = log1 o Ai - log1 o Ai.
(3.23)
Then Ai is defined as a signal-dominated eigenvalue if the following conditions are met:
1. Li has to be bigger than a pre-determined threshold. We set the threshold to be
LJ
y = 20
1=11
N 2 -
(3.24)
11
2. There should be only one transition point between signal-dominated eigenvalues and
noise-dominated eigenvalues. Therefore, for Ai to be a signal-dominated eigenvalue,
A 1, - - -, Ai_ 1 should all be signal-dominated eigenvalues.
If these two conditions are met, Ai is regarded as a signal-dominated eigenvalue. The number of source signals is estimated by counting the number of signal-dominated eigenvalues
determined this way. Admittedly, this method is somewhat ad hoc. However, it has provided
a good estimate of the point of divergence over many experiments we conducted.
Test of the Algorithm
To examine the performance of the proposed method for source number estimation, we
applied the algorithm for estimating the source signal number to the two datasets of Table 3.1. Both datasets are subject to computer-generated additive independent noise. The
noise covariances are an identity matrix for the first dataset and the non-identity diagonal
matrix o 2 D (1/50,1/49,... , 1) where
2
is a scalar for equalizing SNR for the two datasets.
We repeat the simulation 200 times and obtain 200 estimates of the source signal number
64
for each dataset. The results are presented as histograms in Figure 3-7. The mean values
of the estimated source number is 11.31 for (a) and 8.64 for (b). Considering that the true
(a)
(b)
150
150
100
100
C)
C)
Q)
Cr
a)
U-
50
50 -
-
09
I
-
-
-
10
11
12
13
14
Estimated number of source signals
0
15
. . ..
6
. .. . .
.
7
8
9
10
11
Estimated number of source signals
12
Figure 3-7: Histogram of estimated number of source signal by the proposed method.
Datasets of Table 3.1 are used in simulation.
p is 15 and that the transition points of Figure 3-4, which are chosen by human judgment,
are P = 11 and P = 9, respectively, we conclude:
1. The estimated source signal number returned by the proposed method is not much
different on average from the estimated source signal number determined by a human
eye from a screeplot.
2. The estimated source signal number returned by the proposed method is smaller than
the true p on average.
The magnitude of underestimation is small when the noise
covariance is an identity matrix.
3.5
Noise Estimation by Expectation-Maximization (EM) Algorithm
The objective of this section is to derive the step-by-step algorithm for computing maximumlikelihood (ML) estimates of noise variances.
It is also of interest to recover the actual
time-sequence of noises. From the noise data model of Figure 3-1, the observed noisy vector
65
X can be written in terms of source signal vector and noise:
X = AP + G 1/ 2W
(3.25)
where w is a vector of unit-variance noises and G is the n x n diagonal matrix such that
E = G 1/ 2 L. The diagonal elements of G are the unknown noise variances. We want to
compute the ML estimates of those diagonal elements from m samples of the noisy vector
X. For now, we assume that the number of the source signals is known to us. When this
value is not known, we can use an estimate of it obtained through the method explained in
Section 3.4. We will discuss this case in detail in the next section.
In this section, we adopt the Expectation-Maximization (EM) algorithm to obtain iteratively the ML estimates of noise variances. The general description of the EM algorithm
is first presented in [4].
Although it is often referred to as the EM algorithm, it is more
of a strategy than a fixed algorithm to obtain ML estimates of unknown parameter(s) of
a dataset. An actual algorithm based on the EM strategy should be developed in detail
when a specific problem of interest is defined. We will define the problem and derive the
EM-based iterative algorithm in this section. A similar approach for a simpler case was
studied in [5].
3.5.1
Problem Description
In (3.25), A and the diagonal matrix G are unknown but fixed parameters, and P and W
are random vectors. Let a symbol 6 denote all unknown parameters in (3.25), that is,
6 = {A, G}.
(3.26)
Let a vector U E RT+P denote
X
U=
,(3.27)
P
and let fu (U; 6) denote the probability density function of the vector U given the parameter
0. Then the ML estimate of 0 based on m independent samples of U can be written as
6
ML =
argnax
fu (U(t);6)
0
66
(3.28)
Since log(-) is a monotonically increasing function, we can replace the probability density
function in (3.28) with the logarithm of it without affecting
6
ML
6
ML-
=
argnax (log Hfu (U(t); 6)
=
argnax 1log
(3.29)
fu (U(t); 6)
(3.30)
t=1
Invoking Bayes' rule, we have
fu (U(t); 6) = fp (P(t); 6) -fxIP (X(t)IP(t); 6) ,
(3.31)
and by taking the logarithm of (3.31), we have
log fu (U(t); 6) = log fp (P(t);6) + log fxIP (X(t)IP(t);6)
(3.32)
Assuming P and w are two independent jointly Gaussian vectors with zero mean and an
identity covariance matrix, the two components on the righthand side of (3.32) can be
expressed as
log fp (P(t); 6)
=
(
log
(log
1p2
(t))
)
P2(t)
(4k)7-
log(27r)
-
-
(3.33)
(3.34)
(
(3.35)
(Xi(t)Aip(t))2
(3.36)
-
and
log fxIP (X(t)IP(t); 6)
log
i=1
'1
(log(
2r G i)
2Gi
(Xi(t) - AiP(t))2)
1
-2 log (27rG ) - 2Gi (Xi(t) - AiP(t) )2
67
(3.37)
(3.38)
where Gi is the ith diagonal element of G and Ai is the ith row of A. Combining (3.35)
and (3.38) with (3.32) and substituting the result to (3.30) yields
(-
OML = argmx
log(27r) - IZPf
t=1
(t)
log (27rGi) +
2
-
i=1
(Xi(t) - AiP(t))2
i=1
(3.39)
The first two terms in (3.39) can be omitted because they do not depend on any unknown
parameters, and thus do not affect the maximization over the unknown parameters. The
ML estimate of 6 then yields
(Xi(t) - AiP(t))2
(ML
log (27rGi) +
argmax
(3.40)
0 t=1 i1
=
(log (27rGi) +
argmin 1
0t=
(X (t) - AiP(t))2
(3.41)
1 1i
=
argmin
m log (2r) + m log (Gi) + 1 (X, - PAT)T (X
=
argnin
m log (Gi) ±
(X, -
=
argmin
m log (Gi) +
(XTXz - 2XTPAT + AiPT PAT)).
+
PAT)
(Xi
-
PAT))
PA)
-
(3.42)
(3.43)
At this point, one should note that the evaluation of the argument of (3.43) requires the
source signal matrix P E Rmxp. Since it is not available, the direct maximization of (3.43)
over 6 cannot be carried out. The matrices P and X are called the complete data in the sense
that OML would have been obtainable if they had been available. By comparison, X is called
the incomplete data. The EM algorithm proposes that P and pTp in (3.43) be replaced by
expected values of P and pTp given X and a current estimate of OML, which we denote as
0(l) to emphasize that it is the lth iterative estimate of OML (Expectation step), and that
the solution of (3.43) be considered as the (I + 1)th estimate of
6
ML (Maximization step).
This two step approach is the origin of the name expectation-maximization algorithm. If
there is no other local extremum in the argument of (3.43), it is true that
lim 6() = OML
1-00
68
(3.44)
In cases of multiple local extrema, a stationary point may not be the global maximum, and
thus several staring point may be needed as in any "hill-climbing" algorithms.
3.5.2
Expectation Step
The expectation step comprises the two conditional expectations E [PIX; 6(,)] and
E [PTPIX; 6()].
We will first focus on the computation of E [PIX; 6(,)].
Due to instan-
taneity between X and P, this conditional expectation can be further simplified to the
computations of E 1P(k)IX(k); 0()] , k = 1,...,m, where P(k) = [Pi(k),-
X(k)
=
[Xi(k),--
,,X (k)1T. If we define the vector w(k)
=
[wi(k), -
,Pp(k)]T and
,W (k)]T, the rela-
tionship between the three vectors can be written as
X(k) = AP(k) + G 1/ 2 W(k).
(3.45)
For notational simplicity, we will drop the time index k for now. To compute E
we need to obtain the probability density function fpix (PIX; 6(,)).
[PIX; 6(,)]
Invoking the Bayes'
rule, we have
fPix (PIX; 6(1(,)=
f x (x;6(i))(3.46)
fx (X; 0()
For notational simplicity, we will rewrite (3.46) as
f (Pjx; 6(,))
ff
=
f (XP; 6(,)) f (P; 6(l))
(xX;6((l))=(3.47)
f (X; 0(1))
Recalling that X, P, and w are all jointly Gaussian, the three components in (3.47) are
expressed as
f
(XP; (,) =
C e(X AP)G
f (P;6(1))
=
C 2e-2
f
=
C 3 e-ixTA )A+G(l)
(X;0(1))
69
(X-A)
(3.48a)
(3.48b)
X
(3.48c)
where C1,
C2 and C3 are scalar constants which are irrelevant in further computations.
Substituting (3.48) into (3.47) yields
f (PIX;
())
= C 4 e-- .H
(3.49)
where C4 = C1C2/03 and
H
=
XT G5X
-
XT G- A(I)P - PT AT G- 1 X + PTA TGA(I)P
+ pTp
-
XT (Aml)A T + G(l)
(A
SPT
G-A(l) + I) P
+ XTG- X
=
-
A
W
(P-
XTG-
where W(i) = AT G
(1)
1
(1)
-
(3.50)
-XTG-A(l)P
-PTA
XT (A)AT + G-1
G- XW()
A(l)W
X
(P
- W
T
G
1
X
(3.51)
X
A T)G
A- G( X+XTG-X
-
X) T
XT(A(l)A() + G(l)
X, (3.52)
A(i) + I. From this result, we can rewrite (3.49) as
I
PCWe
=
f (PIX;())
AT G
X
WPW-ATG(I)
(1)
1
(1)
T
X
(1)
)
(3.53)
where
IXT
05
=
C4 e 2
(
G-AI)W-1A
G
1
1 ()
1
G- + A(I)A+G(j)
(1L
G +1(A
X.
(3.54)
Recalling that P given X and 0(j) is a Gaussian vector, we can simplify (3.53) into
f
(PIX; 0(,)) = N (W- A)G-X, W-1).
(3.55)
Now we have the expression for the first conditional expectation:
E [PIX; 0(l)] = W
AT)G
X
(3.56)
or equivalently,
E [PIX;
()] = XG- A()W.
70
(3.57)
Deriving the expression for the second conditional expectation of interest E [PTPIX; 6(1)
is our next task. First we observe that P = [P(1)-I - -P(m)]T
E rPTPIX;8(1)] = E
from which we can write
P(k)P(k)TIX;6()1
(3.58)
.k=1
Once again, instantaneity between X and P simplifies the problem of obtaining (3.58) into
obtaining E [PPTIX; 0(t)] in which we omit the time index for simplicity. Invoking the
definition of covariance matrix,
E [ppTIX; 0(,)] = EpIx;6(L) + E [PIX; () ET [PIX; 6()]
where E
(3.59)
denotes the covariance of P given X and 0(1). From (3.55), we have
(3.60)
W-.
X
Substituting (3.60) and (3.56) into (3.59) yields
E PPTjX;
= W- +W
A TG-XXT
G
A(,)W-
(3.61)
From this, we obtain the second conditional expectation of interest:
E [PTP IX; 0] = mW-f + W-A TG
3.5.3
XTXG-
A(l)W-
(3.62)
Maximization Step
After the two conditional expectations (3.56) and (3.62) are computed, the maximization
step is carried out. The maximization step is to search for the values of unknown parameters
0 which maximize the expression in (3.43). With a simple modification of (3.43), the (l+1)th
estimate of OML can be written as
n
0(1+1) = argnin
Q(l)(Gi, Aj)
i41
71
(3.63)
where
Q(i) (Gi, Aj)=
(X='Xi
-
2XE [P IX; (1)] A[ + A 2 E PTPIX;
(1)
AT) + m log Gi.
(3.64)
To find the values of Gi and Ai at which Q(I) (G2 , Ai) is minimized, the partial derivatives
of (3.64) with respect to Gi and Ai are set to zero:
o9Q(j) (Gi, Aj)
aAj
Ai,(,+,),
= 0,
i =1
n,
(3.65)
= 0,
1 = 1, ..., n.
(3.66)
Gi,(,+,)
'Q(i) (G, Ai)
aGi
The solutions to these equations constitute 0(1+1). From (3.65),
-2X[E [P IX; 6(j)] + 2Ai,(l,+)E [pTpIX; (1)] = 0
(3.67)
Solving (3.67) for Ai,(+1) yields
Ai,(,+1) = X[E [PIX;0(j)] (E [pTpIX; 0()
(3.68)
The estimate of the entire mixing matrix is then
A(1+1) = XT E [PIX;0(,)] (E [PTPIX;0(l)]
1
(3.69)
Similarly, from (3.66) we have
mGi,(l,+) - (XTXi - 2X[E
[
IX; 6(L)] A (1 + 1) + Ai,(+l+)E PTP IX; 6(1) AT( 1+1 ))
= 0,
(3.70)
which becomes
Gi,(,+)=
(XfXi - 2X[E [PjX; 6(,)] A[(,+ )
1
+ Ai,(+)E
PTPX; 6()
A,(+1))
(3.71)
Combining this with (3.68), we get the second set of maximization equations,
Gi,(+)
=
1 (X TXj - Ai,(+l+)ET [PIX; 6(1)] X,)
72
i =1, - - ,n.
(3.72)
3.5.4
Interpretation and Test of the Algorithm
Figure 3-8 illustrates the step-by-step operations of the EM algorithm for blind noise estimation.
The algorithm takes in the noisy data X and the number of source signals p
as input. The algorithm then generates A(,) E R"XP and G(1 ) E R"'X,
initial guesses of
unknown parameters. The only restriction imposed on the initial guesses is that G(1) is
a diagonal matrix. The number of the noisy variables, n, is acquired from the number of
columns of X. In the expectation step, C(i) and D(j) are computed from the two equations.
Recall that W(i) = AT G
A(l) + I. In the maximization step, the unknown parameters
are updated to yield A(2 ) and G(2 ). If another iteration is needed, A( 2 ) and G( 2) are fed
back to the expectation step to compute C( 2) and D( 2).
Typically the total number of
iterations is pre-determined, but one can simply decide to stop the iteration if changes of
the unknown parameters are negligible after each iteration. Let r, be the total number of
iterations and G and A denote the final updates of the unknown parameters. Then each
diagonal element of G is the estimated noise variance of the corresponding noisy variable.
Furthermore, E [P jX;
A, ]
AT is an estimate of the time sequence of the noise-free data
matrix Z. Therefore, X - E [P IX;
A, ] AT
represents the estimated noise sequences.
To illustrate the effectiveness of the EM algorithm we applied it to the two examples
of noisy multivariate datasets specified in Table 3.1. Figure 3-9 provides the results of the
blind noise variance estimation using the EM algorithm. In the figure, the noise variances
blindly estimated by the EM algorithm are compared to the true noise variances. For the
first dataset, the true noise variances are unity for all variables, and the minimum value of
estimated noise variances is 0.78 and the maximum value is 1.10. For the second dataset,
the true noise variances are 2/50,2/49, -. , 2/2, 2/1. The estimated noise variances again
follow the true variance line very closely. As for the second dataset, if we want to apply
NAPC to the dataset but we do not have the noise variances a priori, we can acquire the
unknown noise variances first from the EM algorithm and then normalize variables by the
estimated noise variances.
Figure 3-10 and Figure 3-11 provide another capability of blind noise estimation using
the EM algorithm. Figure 3-10 illustrates some selected noise time-sequences obtained as
one of the output of the EM algorithm. Surprisingly, the estimated noise sequences follow
the true noise sequences very closely. Recalling that all we knew in the beginning were the
73
p (number of source signals)
---------------Initial Guess
A(,), G(i)
Expectation step
X
- C(i) = E [PIX; A(,), G()
= XG
- D(i) = E PTPIX; A(l), G(l)
+WlAT
=
4-
A(I)W-
mW
1
GlXTXGlA(,)W
Maximization step
- A(1+1) = XTC(1)D-l
- Gi,(+1) = -
(XyX
- Ai,(l+)CTX
1,---,n
YES
More Iteration?
Increase I by 1
NO
6
Estimates of
noise variances
X - E PIX; A, G]AT
Estimates of
noise sequences
Figure 3-8: Flow chart of the EM algorithm for blind noise estimation.
74
True Noise Variances
Estimated Noise Variances
(a)
2
2-
1.5Ca
CZ15
Co
o
zC0.5
o
.5An~
V
.0
0.5
0
(b)
2.5
0.5
0
10
20
30
40
0
0
50
Index
10
20
30
40
50
Index
Figure 3-9: Estimated noise variances for two simulated datasets in Table 3.1 using the EM
algorithm
dataset X and the fact that there are p source variables, this result brings the possibility of
filtering independent measurement noise from noisy variables into reality. Figure 3-11 also
shows similar results.
3.6
An Iterative Algorithm for Blind Estimation of p and G
Up until now, we developed two estimation algorithms, one for p and another for G. In
Section 3.4 we developed a quantitative method to obtain an estimate of the source number
when G is an identity matrix and extended the method to the cases where G is not quite
an identity matrix. We illustrated through examples that P is close to true value p if G is
either an identity matrix or a diagonal matrix whose elements remains in a limited range.
The result becomes distant from the true p on average as G gets far from an identity matrix.
To obtain a good estimate of p, therefore, it is desirable to have an identity noise covariance
matrix. In Section 3.5, we developed an computational algorithm to estimate the noise
variances.
A simulation demonstrates the capability of the algorithm for estimating the
noise variances accurately. The catch is that the algorithm requires the value p as one of
its inputs, which is not available a priori in general.
In this section, we address the problem of simultaneous estimations of the source signal
number and noise variances. Our approach to the problem is to decouple the joint estimation
of p and G into iterations of sequential estimations of p and then G. First, we obtain an
75
Noise
of 1 st variable
Noise of 11th variable
2
KI
41
I"
~
A
0
d I I
-1
10
20
30
40
50
0
10
20
'
~
* 21 '4 \ /V.1k'
-2
0
ik
j~i
30
Time
Time
Noise of 21st variable
Noise of 31st variable
I
i
40
50
40
50
Actual Noise Sequence
Estimated Noise Sequence
2
~~-
A4
1
**
~
0
-1
IfI
AJ\ILI
N1
-2
0
10
20
30
40
-3
50
0
10
Time
20
30
Time
Figure 3-10: A few estimated noise sequences for the first dataset in Table 3.1 using the
EM algorithm
estimate of the source signal number which may be a poor estimate at the first iteration.
This estimate is then sequentially fed to the EM algorithm with X for estimates of the
noise variances. Intuitively, this approach seems chicken-and-egg : estimating G requires
p, but p cannot be estimated accurately unless G is either an identity matrix or close to
it. This may be in fact true during the first iteration. As the iteration continues, however,
improvement in estimation of p leads to better estimation of G, which in turn further
improves the estimation of p.
3.6.1
Description of the Iterative Algorithm of Sequential Estimation of
Source Signal Number and Noise Variances
Figure 3-12 illustrates the schematic block diagram of the proposed iterative algorithm for
the estimation of the source signal number and noise variances. Let X E Rmn denote a
76
Noise of 1st variable
Noise of 11Ith variable
3
2
-
Actual Noise Sequence
Estimated Noise Sequence
2
1
-\
0
,
-1
-
-/
-1
-2
-2
0
10
20
30
40
-3
50
0
10
20
Time
Noise of 21st variable
50
3
2
2
A/
1
-
-1
0
-
-1
IIIJ
-2
-2
10
20
30
40
-3
50
Time
A
1
'Fl
1
0
0
40
Noise of 31st variable
3
-3
30
Time
A
-v
0
Ni
1~
~1
-
10
20
30
40
50
Time
Figure 3-11: A few estimated noise sequences for the second dataset in Table 3.1 using the
EM algorithm
77
noisy data matrix. In the proposed algorithm, the data matrix X is multiplied by the inverse
of the square root of the estimate of the noise covariance matrix obtained in the previous
iteration.
For the kth iteration, the estimate of the noise covariance from the previous
iteration is denoted by
G(k-1).
The result(-)
of the multiplication, denoted by X(k) = Xd-1/2
(k-1)
and referred to as the kth noise-normalized dataset, constitutes our new dataset in which
variables are normalized by the current best estimate of noise variances.
For the first
iteration, we initialize G(o) = I. Therefore, the first noise-normalized dataset is the same as
the original dataset. Once the kth noise normalized dataset is obtained, the source signal
number is estimated by the method described in Section 3.4.
This kth estimate of the
source signal number is fed with X to the EM algorithm as if it is the true p to obtain the
kth noise covariance estimate G(k). Note that it is not the noise normalized dataset X(k)
but the original X that is fed to the EM algorithm.
If another iteration is necessary, this G(k) acts as the previous noise covariance estimate
to yield the new noise normalized dataset. The ultimate estimate of the G and P are equal
to G(kf) and G(kf), respectively, where k1 denotes the final iteration. We will name this
algorithm as ION, which stands for Iterative Order and Noise estimation. Table 3.2 is the
detailed step-by-step procedure of the ION algorithm.
3.6.2
Test of the ION Algorithm with Simulated Data
While the estimates of p and G after the first iteration of the ION algorithm might be poor,
they become more accurate after each iteration. This is best visualized by an example. The
test in this section focuses on illustrating the improvement of the ION algorithm after
each iteration. For this end, we again use the example datasets of Table 3.1. To compare
the accuracy of noise variance estimates in each iteration, we use the cumulative squared
estimation error (CSEE), or
n2
CSEE(j)
(G
=
- Gj,(i)).
(3.73)
j=1
In Figure 3-13, we show the results of applying the ION algorithm to the simulations
defined in Table 3.1. After the first iteration, the estimated number of source signals was
8, and based on this value the EM algorithm returns estimated noise variances. The CSEE
after the first iteration stood at 1.09. In the second iteration, first the variables of X were
78
x
Estimation of
source signal number
j (k)
EM
L(k)P(k)iG(k)
E [PIX; A-, G
More Iteration ?
YES G(k)
Increase k by 1
NO
[PIX;
X-E
-
G
Estimates of
noise variances
Estimate of
number of
source signal
,G] AT
Estimates of
noise sequences
Figure 3-12: Flow chart of the iterative sequential estimation of p and G.
79
"
Initialize:
-+ k = 1,
G(0)
I.
" Order estimation for the kth iteration:
-X(k)
=k-X
-+
Compute the sample covariance matrix of X(k). Call it SXX,(k)-
-
Compute the eigenvalues of SXX,(k).
-
Obtain the current estimate P(k) of the source signal number from the
method described in Section 3.4.
They are labeled as A1,(k),--
, An,(k)
" Noise variance estimation for the kth iteration:
-+
Compute the current estimate 6(k) of the diagonal noise covariance matrix through the EM algorithm described in Section 3.5.
" If another iteration is required:
-+
Increase k by one. Go to the order estimation step
" After the final iteration:
+ = G(kf) and P = P(kf). In addition, if desired, the estimated noise
sequences can be obtained as a by-product of the EM algorithm from
S= X - E[PIX;,
6]AT
Table 3.2: Step-by-step description of the ION algorithm
80
divided by the square roots of the corresponding estimated noise variances, and the result
was fed to the order estimation algorithm. The resulting estimated number of source signals
increased to 13, which was closer to the true value of p = 15. The estimated noise variances
based on this value had CSEE of 0.061.
In the third iteration, the estimated number of
source signals reached 14, and the corresponding estimated noise variances had CSEE of
0.059. After the third iteration, the estimated number of source signals remained at 14,
and did not change in subsequent iterations.
The ION algorithm has many applications areas, some of which have been widely used
for a very long. In the next chapter, we will discuss a few applications which may have
potentially important implications such as least-squares linear regression and noise filtering.
We will quantify performances of those applications enhanced by the ION algorithm and
compare them with performances without the ION algorithm. It will show that performances of those applications may be enhanced significantly by adopting the ION algorithm
as a pre-processing.
81
First Iteration
Order Estimation by noise baseline
5
10
P( 1 8.
104
Estimated noise variances vs. True noise variances
.
-
2
10
1.5
102
. . . .
:11V
/1
10
.... . . .. . . . . . . ... . . .
0.5
10
10~
0
10
20
30
40
U-
0
5C
Second Iteration
20
30
40
50
........ .... . .
'4..
.
Index
105
14
10
Index
.
2
P( 2)
A. . . ..
=
13
2
10
1.5
10
10
10-1
.....
.....
.....
1003
102
. ... .. . .. ...
1
q7
..
-...
-
20
-
10
0
20
30--
40-50
30
40
.
0.5
00
50
10
20
Index
Third Iteration
..
. .....
40
30
50
Index
105
3 ) =14
-....... P(-........
14
-...--..
2
102
1.5
C1 5
102
101
....
.. .
r-. . .. .
. ..
. . . . .
z
0.5
- .
100
10-
.......
U)
0
10
20
30
40
U,-
0
50
Index
10
20
30
40
50
Index
Figure 3-13: The result of the first three iterations of the ION algorithm applied to the
second dataset in Table 3.1
82
Chapter 4
Applications of Blind Noise
Estimation
4.1
Introduction
The major motivation of the development of the ION algorithm was to obtain the noise
variances of individual variables so that each variable can be normalized to unit noise
variance before the PC transform is carried out.
The combined operation of the noise
normalization and the PC transform is called the Nose-Adjusted Principal Component
(NAPC) transform. The Blind-Adjusted Principal Component (BAPC) transform is the
NAPC transform in which the noise normalization is performed with the noise variances
retrieved by the ION algorithm.
The implication of the capability of retrieval of the Gaussian noise variances is very
broad. A zero-mean white Gaussian noise vector can be wholly characterized by its covariance matrix. Therefore, if the elements of the noise vector is known to be independent
so that the covariance matrix is diagonal, then the retrieval of the noise variances indeed
represents the complete characterization of the noise vector. The retrieved noise characterization may be then used for many traditional multivariate data processing tools which
require a prioriknowledge of noise statistics.
Until now, it is not an uncommon practice to analyze a dataset with noise of unknown
statistics as if the dataset is noiseless. The goal of this chapter is to suggest a few applications in which ignoring existence of noise is a common practice and to show the extent
83
of performance gain obtainable by the adoption of the retrieved noise statistics in data
analysis. The performance - gain or loss - will be measured by relevant metrics which will
be defined for each application. In evaluating the effectiveness of the applications of the
ION algorithm, we use multiple computer-generated examples which are carefully designed
to be representatives of practical multivariate datasets.
In Section 4.2, we apply the ION algorithm to linear regression problem in which the
multivariate predictor variables are corrupted by noise with unknown noise statistics. As we
discussed briefly in Chapter 2, traditional least squares regression does not make much sense
when the predictor variables are subject to noise. In Section 4.2, the ION algorithm combined with the NAPC filtering is suggested for eliminating noise in the predictor variables to
enhance the regression result. The NAPC transform after the ION algorithm retrieves the
noise variances is designated Blind noise-Adjusted Principal Component (BAPC) transform.
Section 4.3 addresses the application of ION to noise filtering. In addition to the noise
variances and signal order, the ION algorithm retrieves estimates of noise sequences.
In
Figure 3-12, the estimated noise sequences are shown at the bottom of the figure as e
=
X - E [PIX;
,
]
T.
An ION filter refers to a simple operation of subtracting this
estimated noise sequences from the noisy dataset. We evaluate the ION filter as a function
of the sample size m and the signal order p. In addition, the ION filter is compared with the
Wiener filter, the optimal linear filter in the least squares sense. Performance is measured in
terms of SNR (dB). PC transform of the ION-filtered dataset is designated Blind Principal
Component (BPC) transform and discussed in Section 4.4.
4.2
Blind Adjusted Principal Component Transform and Regression
One of the assumptions on which desirable properties of least squares linear regression
depend is that the predictor variables are recorded without errors. When errors occur in
the predictor variables, the least squares criterion which accounts for only errors in the
response variables(s) does not make much sense.
Still, least squares linear regression is
commonly used for noisy predictors [36, 37] because 1) alternative assumptions that allow
for errors in the predictors are much more complicated, 2) no single method emerges as the
agreed-upon best way to proceed, and 3) least squares linear regression works well except
84
in the case of extremely large predictor noise. Moreover, least squares linear regression
predicts future response variable well as long as future predictors share the same noise
statistics with the predictors of the training set [37]. However, this argument is true only
when there are infinitely large sample size. For most practical situations where the sample
size is small, noisy predictors can affect linear regressions.
In this section we discuss linear regressions for noisy predictor variables and propose
the ION algorithm as a potential method to alleviate distortions caused by noises in the
predictor variables. In section 4.2.1 we derive an expression for mean squared error (MSE)
when traditional least squares linear regression is carried out for noisy predictor variables.
4.2.1
Analysis of Mean Square Error of Linear Predictor Based on Noisy
Predictors
Let X E R' be the vector of n noisy predictor variables defined as
X = Z+ E
(4.1)
where Z is the vector of noiseless variables and e is the noise vector. Let the response
variable Y C R be modeled as a linear combination of Z 1,.-,
Z,, corrupted by independent
zero-mean Gaussian random noise e:
Y = gTZ +
(4.2)
In linear prediction, it is of interest to obtain a linear combination of X 1 ,---
, X"
such
that the value of the linear combination follows Y closely. The mean-square error (MSE),
defined as
MSE (Y(X))
= E (Y-
(X))],
is often used to measure how well a particular linear combination, denoted by
(4.3), follows Y. Let
(4.3)
Y(X) in
Y*(X) denote the linear combination which minimizes MSE (V(X)).
In the case of jointly Gaussian zero-mean random vectors it is well known that
f*(X)
is
linear in X [6];
Y*(X) = LyxX
85
(4.4)
X1
Z1
Zn
Xn
y
I
Figure 4-1: Model for the predictors X1, --- , X, and the response variable Y.
where Lyx satisfies
LyxCxx = Cyx.
One can write the mean square error for
MSE (f*(X))
=
E [3Z
=
(,3T
(4.5)
Y*(X) as
+e
-
CyxCjx (Z + 6)
CyxC-1)Czz (8T -CyxC1)
+CyxC-1GC-ICyx
(4.6)
T
+or
(4.7)
where (.)-1 denotes the inverse or pseudo-inverse of (.) depending on its invertibility. Recalling that Z = AP where P is the vector of independent source signals, one can write
Cyx and Cxx as
Cyx = Cyz =I3TCzz = /TAAT
(4.8)
Cxx = Czz + G = AA T + G
(4.9)
86
Replacing Cyx and Cxx in (4.7) yields
MSE (Y*(X))
=
AT
I - (AAT + G)AAT) 8
2+or+
G 1/ 2 (AAT + G
1AAT
(4.10)
where ||
denotes the norm of a vector. It is clear from (4.10) that MSE (Y*(X))
smaller than aT2. Note that o, would be the mean-square error of Y*(X) if e
lim MSE
(v*(X))
=
is never
0:
(4.11)
= 0,.
Equations (4.10) and (4.11) indicate that increase in the measurement noise variance a2
causes corresponding increase in the mean-square error of
4.2.2
Y* (X).
Increase in MSE due to Finite Training Dataset
In deriving (4.10), we assumed that the population covariances Cxx and Cyx were available
for the computation of
Y*(X).
In practice, these population covariances are generally not
available and and replaced by the sample covariances Sxx and Syx, respectively. Since the
sample covariances are computed from the finite training dataset, discrepancies between
the sample covariances and the population covariances are inevitable. The discrepancies
between the population and the corresponding sample covariances should be negligibly
small when the sample size is large. In that case, the MSE resulting from usage of the
sample covariances should not be much different from (4.10). On the contrary, when the
sample size m is not large enough compared to the number of predictor variables, denoted
by n, the errors in the sample covariances could increase the MSE noticeably from (4.10).
Let's run an example as an illustration of increase in MSE due to the finite sample
size. In the example, the vector X consists of 50 variables, X 1 , - - -, X 5 0 . The measurement
noise vector e is a zero-mean Gaussian random vector with covariance G being an identity
matrix. The 50 variables of the noise-free vector Z are linear combinations of 10 source
signals.
The source signals are zero-mean independent Gaussian random variables with
unit variances. The response variable Y is a linear combination of 50 variables of Z plus
the statistical noise c.
The variance of E is set to be 0.5. To show the relation between
the sample size and the performance of of linear prediction, we generated seven training
datasets with sample sizes of 60, 70, 80, 90, 100, 200 and 500. For each training set, one
87
10
Noisy predictors
Noiseless predictors
.---
10
0O
100
150
20
250
30
Size of trsining set
350
400
450
500
Figure 4-2: The effect of the size of the training set in linear prediction
prediction equation in which
Y* (X)
is expressed as a linear combination of variables in
X is computed from linear regression. Substituting 3000 samples of the predictors of the
validating set to the prediction equation will give 3000 samples of Y* (X) whose differences
from the corresponding 3000 samples of Y are averaged to yield the MSE. We will refer to
the MSE obtained in this way as the sample MSE (SMSE). For notational convenience, we
will call MSE of (4.10) the population MSE (PMSE). Let's define the discrepancy factor 'y
as the difference between SMSE and PMSE normalized by PMSE, or
SMSE
S=
-
PMSE
PME
_SMSE
=PS
- 1.
(4.12)
Note that -y is a measure of performance degradation of linear regression due to finite sample
size. The result of the example is drawn in Figure 4-2 as the curve with
*.
For comparison,
we also present y when the predictor noise e does not exist as the curve with o. For the
noisy case, it can be seen from the figure that -y and the sample size m are inversely related.
In general, 'y asymptotically approaches 0 as m
and Syx asymptotically approaches Cx
-+
oo because the sample covariances Sxx
and Cyx, respectively [38], thus force SMSE to
converge to PMSE.
When the sample size decreases, the differences between sample and population covariances get bigger, resulting in larger 'y because the sample covariances Sxx and Syx are
more susceptible to errors caused by the predictor noises. This is why least squares linear
88
regression should be used only when there is large enough training dataset.
In general,
linear regression should not be used when the number of noisy predictors is comparable to
the sample size.
4.2.3
Description of Blind-Adjusted Principal Component Regression
The purpose of this section is to introduce an application of the ION algorithm as a partial
solution to the problem caused by measurement noise e and having too few samples. The
algorithm is mainly a concatenation of the ION algorithm and NAPC filtering. The ION
algorithm provides the noise variances which are required in NAPC filtering. The problem
of large SMSE due to not having enough samples could be alleviated by NAPC filtering
because NAPC filtering compresses the signals distributed over all predictors into a smaller
number of principal components.
A schematic block diagram of the algorithm to be studied here is shown in Figure 4-3. We
will call the algorithm BAPCR, the acronym of the Blind-Adjusted Principal Component
Regression.
The name stems from the two major sub-algorithms which constitute the
BAPCR: 1) the ION algorithm which estimates the noise variances blindly, and 2) the
noise-adjusted principal component filtering which reduces the number of predictors.
The matrix X E Rmxn consists of the m observations of n predictors, and the corresponding m observations of the response variable constitute the column vector Y
E Rm. It
is known that the predictors are subject to measurement errors which have no correlation
with the response variable. It is assumed that the noise covariances are diagonal but those
elements are unknown. The degrees of freedom of the predictors is n due to the measurement noises. The degrees of freedom of the predictors would have been a smaller value but
for the noises.
The noise variances of predictors are estimated using the ION algorithm described in
Chapter 3.
The output
O
of the ION algorithm is a diagonal matrix with its diagonal
elements being the ML estimates of the noise variances of corresponding variables in X.
These noise estimates are subsequently used for the NAPC filtering. First, predictors are
noise-adjusted by being divided by the square roots of the corresponding diagonal elements
of G. The noise-adjusted dataset, denoted by X', can be written as
X'
XO-(1/2).
89
(4.13)
Y
ION
G
P -C
l/2
PC Filtering
Signal -
Dominant
Principal
Components
Regression
Prediction
Equation
Figure 4-3: Schematic diagram of blind-adjusted principal component regression.
90
In principal component filtering, the eigenvectors of the sample covariance of X' are used to
compute the principal components and the eigenvalues are used to determine the number of
principal components to be retained. The number of retained principal components is equal
to the number of source signals estimated by the ION algorithm. The remaining principal
components are discarded as noises.
The retained principal components are regarded as new predictors. Since we discarded
noise principal components in the NAPC filtering, the number of new predictors is smaller
than n. The prediction equation is obtained by linear regression of Y on the new predictors.
4.2.4
Evaluation of Performance of BAPCR
It is of interest to see how much performance improvement BAPCR brings to linear prediction over other traditional methods such as linear regression. For this purpose we applied
four methods including BAPCR to a few simulated examples of the linear prediction problem. The four methods being contested were linear regression, PCR, BAPCR, and NAPCR.
As the criterion of the linear prediction performance we chose the discrepancy factor 'y de-
fined in (4.12).
The examples were designed to be representative of many practical multivariate datasets.
The parameters that need to be specified to define examples are: the number of predictor
variables n, the number of source signals p, the sample sizes m, the regression parameter
vector 6 that defined the relation between Y and Z, the variance o- of the statistical noise
term in Y, the mixing matrix A and the noise covariance G. We would like to compute and
present -y of the four methods as a function of m, p, and G. For all examples of the section
we used n = 50 and a- = 0.01. The regression parameter vector 3 was randomly generated
by computer. If one uses MATLAB program for simulation,
3 could be generated by the
command beta = randn(n, 1) where n denotes the number of predictor variables n = 50.
Explaining the selection of mixing matrix A needs a little elaboration. Let us begin by
recalling the following equation:
Cxx = Czz + G = AA
T
+ G.
(4.14)
Assuming that the n x p mixing matrix A is full column rank, the rank of n x n matrix Czz
is p. In other words, p out of n eigenvalues of Czz are non-zero. Recalling that SNR of
91
a noisy multivariate dataset is defined in terms of the eigenvalues of Czz and G, it would
be convenient if we fix the p non-zero eigenvalues of Czz.
For our examples, we chose
somewhat arbitrarily the p non-zero eigenvalues of Czz to be proportional to either i-4
or i- 3, i = 1, - - ,p. Other than these conditions on the eigenvalues of AAT, we let A be
random. This can be achieved by solving the following equation for A:
AAT
=
QAQT
=
Q(p)A(p)Q(p),
where A(p) is the p x p diagonal matrix of non-zero eigenvalues of AAT and
(4.15)
Q(p)
is the
m x p matrix of p eigenvectors corresponding to p non-zero eigenvalues. The solution to
(4.15) is
A = Q(P)A 1 2 U
(4.16)
(1)
where U is any p x p computer-generated random orthonormal matrix.
-y as a Function of Sample Size of Training Set
Our first set of simulations focuses on the performance of four linear predictions as a function
of m. Table 4.1 summarizes the simulation parameters used. We set the number of source
signals p to be 15. For the first two examples of Table 4.1, the eigenvalues of AAT are set to
be 1000 x i- 4, 1 < i < p. The last two examples have 1000 x i-3 , 1 < i < p as the eigenvalues
of AAT. Noises are again independent so that the noise covariance G are diagonal. As for
the diagonal elements of G, we decided on two possibilities. The first choice is G = I. This
corresponds to the case where the predictor variables have been noise-adjusted somehow
prior to our analysis. This can happen if a dataset is noise-adjusted by someone who knows
the noise variances before it is distributed for analysis. Referring to Table 4.1, this choice
covers Example (a) and (c). For Example (b) and (d), we arbitrarily set the noise covariance
as G = diag (i-1), i = 1, ... , n. For each of the examples in Table 4.1, we evaluated the
discrepancy factor -y for the training sets of a few different sample sizes. For each sample
size of the training set four different prediction equations are obtained, one from each of
the following four methods: 1) linear regression, 2) principal component regression, 3)
blind-adjusted principal component regression, and 4) noise-adjusted principal component
regression. As for the PCR, the number of principal components to be retained for the
92
(a)
(b)
(c)
(d)
n
50
50
50
50
a
0.01
0.01
0.01
0.01
#
randn(n,1)
rancn(n,1)
randn(n,1)
randn(n,1)
A
1000 x A1
1000 x A1
1000 x A 2
1000 x A 2
p
15
15
15
15
m
G
Variable
I
diag(i
of
1)
interest
I
diag(i- 1 )
Table 4.1: Important parameters of the simulations to evaluate BAPCR. Both A 1 and A 2
are n x n diagonal matrices. There are n - p zero diagonal elements in each matrix. The p
non-zero diagonal elements are i- 4 for A 1 and i- 3 for A 2 , i = 1, ... ,p.
regression was decided from the screeplot. The NAPCR serves as the performance bound
for the BAPCR. In obtaining the results for NAPCR, the number of source signals p
=
15
and the noise covariance G are assumed to be known. Therefore, the NAPCR should always
outperform the other three methods, and the performance difference between the BAPCR
and the NAPCR indicates the price of the lack of a priori knowledge of the noise variances
and the signal order.
Once prediction equations are found, it is applied to a validating
set of sample size 3000 with the same parameters used to generate the training set. The
discrepancy factor -y is then calculated from (4.10) and (4.12).
Figure 4-4 shows the results of the simulations. Each of the four subplots correspond to
each example of Table 4.1. Some of the interesting observations that can be made are;
" As we speculated before the experiment, NAPCR outperformed all other methods at
all sample sizes for all examples. This is expected because the true noise variances and
signal order are used for NAPCR. If the noise variances and the signal orders estimated
by the ION algorithm are correct, however, the performance of the BAPCR should
be comparable to that of the NAPCR as we can see for the largest sample size for
Example (a), (b), and (c). We expect that this would be the case also for Example
(d) for some sample size larger than 1000.
" The extent of outperformance of NAPCR over PCR and BAPCR is smaller in Example
(a) and (c) than in Example (b) and (d). This is due to the fact that the noise variances
93
(a)
(b)
.......................
..I
..........I.........
101
...
... ..
101
Linear Regression
v-..-PCR
e-..
BAPCR
- - -NAPCR
...
-*-
-...100
100
-
10
102
0
-...
200
400
600
800
sample size of training set
101
10
200
400
600
800
sample size of training set
(d)
(c)
10
.. .. .. .. ... .. .
. . .... . . . . . . . .... . . . . . . . . . . . . .
.
... .. .. .. ........................
. . ... .. .
... . .. .. . ... .. .. .. ...
.. ..... .. .
I... .. .. .. ..... .. ... .. .. . .. .. .. .. .. .. .. .. ..
.. . . .. .. .. .. .. .. ..
... .. .... .I .. .. .. .*
... .......
........
... .. ..... ......
. .. .. .. ....
. .. .. .. . ....
...................
...........
.........
10 0
..
..
. .. .
..
. .
..
...
. : : : : : : : : :.: :-7 - .... .
..
. .... . . . . . . . . ... .
..
. .
:
. .. .. .. .. .. .
.....................................
.....................................
.................................................
. . . . . . . . .... . . . . . . . . . . . . . . . . . . . . . . . . . . .
...
..
..
..
..
..
..
..
......
..
..
..
..
..
..
..
..
......
...
..
..............
..
..........
.
.
...
.
...
. .. . . .. . ..
. . . . ... . . . . . . . . .
.. .. . . .. .
:
.........
. . . ..
. .. .. .
. .... .. .. .. .. .. .. .. .. .. .. .. .. ...
10-1
.. .
..................
..............
.........
10-1
.. . .. . . . . . .
. .. .. .
.. .. .. .. .. .. .. ... .... . . . . . . . .
. .. . .. ... .. .. .* ... .. .. .. .. ...
. . . . . . . . . ... . . . . . . . . ... . . .
.
.. .. .. .. ..
. . . . . . .. .
0
200
400
600
E00
sample size of training set
I . . . . . . . . ..
. . ... . .. ...
....................
..............
. . . . . . . . . .... . . . . . . ... . . . . . .
. . . ...
... .. .. .. .. .. .. .. ... .. .. .
. .. . .. .. .. .. .. .. ..
.....
. .. .. .. . .... .. .
102
1000
... .. ..
.. .
.............................
. . . . . . . . . .... . . . . . . . .... . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .... . . . . . . . . .
.. .
....
....
...4
........
................ ......-
F
1000
10
100
- -.. . ...... .
.....
-.
..
... .. .. .. ..... .. .. ... ..... .. .. .. .. .. .. . .. ...
777 o ..
I
.. .. .. ... ..... .. .. .. ..... .. . .. .. . .. .. .. ..
1000
10
0
200
400
600
800
sample size of training set
1000
Figure 4-4: Simulation results for examples of Table 4.1 using linear regression, PCR,
BAPCR, and NAPCR as a function of m. p = 15.
of Example (a) and (c) are uniform across variables. Therefore, the difference between
PCR and NAPCR is that PCR requires the signal order estimation. Since the variables
are already noise-adjusted, signal order estimation by the eigenvalue screeplot should
produce the correct answer when the sample size is large.
In Figure 4-4(a), the
performances of NAPCR and BAPCR are effectively indistinguishable.
The same
argument also holds between the BAPCR and the PCR. The difference between the
PCR and the BAPCR exists in the fact that the BAPCR normalizes variables by their
estimated noise variances. Since variables are already normalized, this should make
no difference. In fact, BAPCR might underperform PCR if the noise estimation is
less than perfect. In our simulations, BAPCR underperforms the PCR slightly when
the sample size is small.
9 The performance of PCR does not improve for increasing sample size as much as
94
(a)
(b)
(c)
(d)
50
50
50
50
0.01
0.01
0.01
0.01
i3
randn(n,1)
randn(n,1)
randn(n,1)
randn(n,1)
A
1000 x A 1
1000 x A 1
1000 x A 2
1000 x A 2
n
o0
of
Variable
p
in t eres
t
m
60
60
60
60
G
I
diag(i- 1)
I
diag(i- 1 )
Table 4.2: Important parameters of the simulations to evaluate BAPCR as a function of p.
The values of this table are identical to those of Table 4.1 except for p and m.
the other three methods in Example (b) and (d). Since the noise variances are not
normalized for these two examples, the signal order estimation through the screeplot
does not yield the correct signal order p. Furthermore, due to the same reason the
PC transform does not put all signals in the first p principal components. It can be
seen that these two shortcomings of the PC transform limit the performance of the
PCR. As a result, -y remains relatively flat for increasing sample size. For these two
examples, linear regression outperforms the PCR at most sample sizes.
9 The performance improvement of BAPCR over the competition is as large as about
400 %, achieved for Example (d) at the sample size of 60.
-y as a Function of Number of Source Signals
Our second set of simulations concerns the prediction performance as a function of n. The
simulation parameters are summarized in Table 4.2. Note that the values of the parameters
of Table 4.2 and Table 4.1 are common except for the variable of interest. We set the sample
size m of the training set to be 60 since the performances of the four methods are easily
differentiated when m is small. The number of source signals runs from 5 to 30.
In Figure 4-5, we presented the discrepancy factor y of the four methods as a function
of the number of source signals p. Each of the four subplots corresponds to each example
of Table 4.2. Each curve is the average of 40 simulations using randomly chosen matrix
and U, which are used to generate A from (4.16).
95
Q
(a)
.............
................
...............................
...........
.
....... .
......I.....
.....
.........
.......
..
..........................................
...............................
..................
.............................
......................
.........
10 1
10 0
.....
.....
.....
....
...
.....
...
....
....
..
...
...
......
....
....
....
...
.....
....
....
....
..:
...
...
....
...
...
...
....
....
....
....
.....................................
...
. . . . . . . . . .... . .
. .. .. .. .. . ..
. . . . . . . . .... . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . ... . . . . . . . .
10-
.. .. .. .. .
I
5
10
15
20
# of source signals
25
30
(b)
..
..............
......... ......
..
..
..
...
.:
:
:
:
:
.
.
....
...
.....
..
...
...
. ....... .....
.......
....
...
......
..
...
....
......................
................ .........
...........
........- :.......*' ' :...*. ..............
......................... ...
........ ....
...
....... .........
10
............
...........I....... ..........
100 ... .......
...*.......
...
..................................................
.............................................
.......
..
..
..
..
..
....
...
...
....
....
....
........_..
1: 1, .........
......................
...
...........
....
...
.......
..........
.. .......
......... ..........
..
....................................... ..........
10
5
10
15
(c)
10 1
..
.. .. . .. .. ..
20
25
30
# of source signals
(d)
10
. .. .. .... . . . . . . . . . . . . . . . . .
. . .. .....
. . .. ..... .. .. .. ... .. .. .. .. . .. .. .. ..
.......... .........
..........
I........................................
..........
...........
............
.........
10 0
.. .. .. .. .. ... . .. .. .. ..
. ..
.......
. . .. .. .. ..... .. .. ... .. .. .. ..
... .. .. . .. .. .. .. .. .. ..... .. .. .. .. .. .. .. ...
... .. .. . .. .. .. .. .. .. .. .. .. .. .. .. .. .
. .. .. .. .
. . ... . .. .. ..
..
.. ... ......................... ...........
777_.'!
10 0
--
10
5
Figure 4-5:
10
15
20
# of source signals
25
---.
30
Linear Rgeso
-v- PCR
.. . .. .. .. . . .. .. .. .. ..... .. .. .. ..... ... .. .............
10
5
10
-e,Ar-
BAPCR
NAPCR
15
20
# of source signals
25
30
Simulation results for examples of Table 4.2 using linear regression, PCR,
BAPCR, and NAPCR as a function of p.
" Again, the NAPCR outperformed all other methods at all values of p for all examples.
This is due to the fact that the true noise variances and the signal order are assumed
to be known for the NAPCR. Since the sample size is small, the noise variances and
the signal orders estimated by the ION algorithm are prone to estimation error, which
appears as the performance difference between the NAPCR and the BAPCR in all
four graphs.
" The performance of linear regression is always worse than the other three methods.
This confirms our previous observation that linear regression is more susceptible to
over-training than the other three methods when the sample size is not large enough.
" From the four graphs, we may conclude that the performance of linear regression does
not depend on the number of source signals. As for the other three methods, there is
a tendency that the performance deteriorates as p increases. We believe that this is
96
(a)
(b)
50
50
0.01
0.01
3
randn(n,1)
randn(n,1)
A
1000 x A1
1000 x A 2
p
5
5
m
200
200
G
Variable
n
2
of
interest
Table 4.3: Important parameters of the simulations to evaluate BAPCR as a function of G.
due to the fact that increase in p leaves fewer principal components to be removed as
noise, and thus decreases the amount of noise filtered by the three methods.
y as a Function of Noise Distribution
This set of simulations evaluates the performance of the four methods as a function of noise
distributions among variables. The simulation parameters are summarized in Table 4.3. As
for the diagonal noise covariance matrix G, we used the following six noise distributions:
G = diag (io) ; diag (t-0-5) ; diag (i-1) ; diag (i-1.5) ; diag (i-2) ; diag (i-2.5)
.
(4.17)
These six noise covariance matrices are chosen so that we can observe the performances
of the four methods when the noise is normalized and when the noise has very different
variances among variables. For example, G = diag (i0 ) indicates that the noise is already
normalized, and G = diag (i-2.5 ) represents the case when the noise variances vary greatly
among variables. We expect that the outperformance by BAPCR over linear regression and
PCR becomes more obvious when the noise variances vary more among variables because
the advantage of normalizing variables by estimated noise variances becomes larger when
the noise variances are different among variables.
The simulation results are presented in Figure 4-6. The horizontal axis represents the
absolute value of the exponents of the diagonal elements of G. For example, G = diag (i-1.5 )
is represented by the value 1.5 on the horizontal axis.
97
+ Linear Regression
-v- PCR
+
BAPCR
NAPCR
101(a)
10
10
11(b)
10
- . . . . . . . .. . . . . ..-.
10
10
- . ..
0
0.5
1
1.5
Noise exponent
. .. . ..
7'
. . .. . .
- . -.
10
.-.-.
. -.
. -..-
2
10
10-2
2.5
-
0
. . . . . . .. . . . . . .
-
--
0.5
-
1
1.5
Noise exponent
--
- -
2
-
2.5
Figure 4-6: Simulation results for examples of Table 4.3 using linear regression, PCR,
BAPCR, and NAPCR as a function of noise distribution. The horizontal axis of each graph
represents the exponent of the diagonal elements of G.
" NAPCR again outperforms all other methods.
" More to come here.
4.3
ION Noise Filter: Signal Restoration by the ION algorithm
The primary objective of the ION algorithm is to obtain the noise variances to be used for
noise-adjustment of variables before the principal component transform. However, there are
other functions that the ION algorithm is capable of. One of them is the signal restoration,
or equivalently, the noise sequence estimation, which is the major subject of this section.
The EM algorithm derived in Section 3.5 computes not only the noise variances but also
a few other quantities. Two of them are A and E [P JX; A, G] (See Figure 3-8). The first
quantity A is an n x p matrix, and it is supposedly an estimate of the mixing matrix A.
The other quantity, an m x p matrix E [PIX; A, d], may be interpreted as the expected
source signal sequences given the observed noisy data X, assuming A and
O
are the true
mixing matrix and noise covariance, respectively. Recalling the noise-free vector Z is the
98
product of the source signal vector P and the mixing matrix A, it is not hard to realize that
the product of
A
and E [PIX;
A, ]
should be an estimate of the noise-free data matrix;
Z = E [PiX; A,
A
(4.18)
In reality, the true source signal number p is unknown. Therefore, A and E [P IX;
A, ]
are
n xp and m xp matrices, respectively, where p is the estimated source signal number obtained
from the ION algorithm. Obtaining
Z from the noisy data X constitutes a multivariate
noise filtering. We designate this operation as the ION noise filtering. One should realize
that the characteristics of signal and noise that are utilized by conventional noise filters are
not known a priori to the ION filter. For example, the difference of the frequency bands of
the signal and noise power spectrum densities are the basis of designing a frequency-selective
filter. The blind noise filter does not utilize the signal and noise power spectrum densities.
In fact, the ION filter works even if the power spectral densities for signal and noise can
share the same frequency band.
Before we proceed further, we would like to point out that
A and E [PIX; A, d] may
not be correct estimates of A and P, respectively. This is the so-called basic indeterminacy
in the blind source separation problem [39]. To understand this, it is important to realize
that the mixing matrix A and the signal source matrix P appear only as the product of the
two in the noisy data matrix X. This leads us to the conclusion that for any given mixing
matrix A and the source signal matrix P, we could create infinitely many pairs of mixing
matrices and source signal matrices that would have resulted in the same noisy data matrix
X. For example, for any p x p orthonormal matrix V, the pairs PV, VTAT and P, AT
are not distinguished by the noisy matrix X.
4.3.1
Evaluation of the ION Filter
To examine the effectiveness of the ION filter, we first applied the filter to the four examples
of multivariate noisy data described in Table 4.1. In all examples, the element of judgment
used to evaluate the effectiveness of the ION filter is the improvement in signal-to-noise ratio
(SNR) between sequences before and after the filter. A big increase in SNR indicates that
the noises are rejected effectively by the filter. In Figure 4-7, the average increase of SNR
achieved by the ION filter is drawn as a function of the sample size. For each sample size,
99
12
10
/
10
- -0/ EI
_
Z
8 .. ..
6..
-
U) 09-
0
51
0
-
/ 07-------------------------------------------------
......
50
500
:140
Example
axample b
1000
Sample Size
1500
. . . . . . . . . . . . . . ..
m
2..........1
........
31
0
2000
----------------
--
500
Example (c)
Example (d) 2...
1000
Sample Size
1500
2000
Figure 4-7: Increases in SNR achieved by the ION filtering for examples of Table 4.1 as a
function of m.
the SNR increase for each of the 50 variables was obtained first by averaging 40 repeated
simulations. In each repeated simulation, we randomly generated the source signals and
noises. The SNR increase for 50 variables was then subsequently averaged.
" SNR improvement increases as the sample size grows. This is consistent with our
understanding that the ION algorithm works more accurately with data of larger
sample size. The estimated noise-free data matrix
Z is closer to Z when the sample
size is larger.
" SNR improvement for Example (a) is higher than that for Example (b).
This is
because the congregate noise power, defined as the sum of all noise variances over
variables, is larger in Example (a) than in Example (b). Recall that noise variances
of Example (a) and (b) are diag (i0 ) and diag (i- 1), respectively. Therefore, there is
more room for SNR improvement for Example (a) than for Example (b). The same
explanation goes for the result that Example (c) experiences higher SNR improvement
than Example (d).
In our second set of simulations we tried to find out how the effectiveness of the ION
filter is affected by the number of the source signals. For this purpose, we applied the ION
filter to the four examples of Table 4.2. Again, we measured the effectiveness of the filter
by the average SNR improvement, which is presented in Figure 4-8. The result indicates
that the average SNR improvement achieved by the ION filter is a decreasing function of
100
14
12
-
Example (a)
Example (b)
-e12
--
10
\
-~
Cz
0)
.z
U 0 .. . .
.. .... ...
5
.... . . ... . . . .
0
2......1..1..2
4
...
.. .. . ... ...
>...6 ....
4'
0
Example (c)
Example (d)
...................
10
15
01
0
20
# of source signals
5
10
15
20
# of source signals
Figure 4-8: Increases in SNR achieved by the ION filtering for examples of Table 4.2 as a
function of p.
the number of source signals.
4.3.2
ION Filter vs. Wiener filter
If signal and noise statistics of a given dataset were known, the linear least-squares estimate
(LLSE) of noiseless dataset is obtained by applying the Wiener filter to the noisy dataset.
For the data model given in Figure 3-1, the Wiener filter is given by
Z
=
CzzC
X.
(4.19)
To apply the Wiener filter, it is essential that we know the covariance matrix Czz. When
we do not know Czz in advance, an estimate of Czz can be obtained from the relation
Czz = CXX - G as long as G is known. In many practical examples, neither Czz nor G
is known a priori.
Encouraged by the significant increases in SNR demonstrated by the ION filter in the
current section, we speculate that the performance of the ION filter measured by SNR
improvement is comparable to that of the Wiener filter. Of course, the ION filter cannot
outperform the Wiener filter whose performance serves as an upperbound. It is of interest
for us to see how closely the ION filter approaches the Wiener filter.
In studying the performance gap, if any, between the ION filter and the Wiener filter, we
resort to computer simulations. We used simulations examples specified in Table 4.1. The
101
results are illustrated in Figure 4-9. Figure (a-1) and (a-2) illustrates the simulation results
corresponding to example (a) in Table 4.1, and Figure (b-1) and (b-2) are the simulation
results obtained from example (b) of Table 4.1.
" As expected, the Wiener filter increases SNR more than both the ION filter and the
PC filter.
" Performances of the Wiener filter, the ION filter, and the PC filter are not much
different for example (a). For example, the differences in SNRs between the Wiener
filtered dataset and the ION filtered dataset remains less than idB for all variables.
" However, for example (b), the PC filter significantly underperform both the ION
filter and the Wiener filter for variables with small SNRs. Even though the ION filter
also underperforms the Wiener filter, the degree of underperformance is significantly
smaller for the ION filter than for the PC filter.
" The significance of the example is the fact that the ION filter, which does not assume
the a prioriknowledge of Czz performs almost as well as the Wiener filter, the optimal
linear filter which requires a priori knowledge of Czz.
4.4
Blind Principal Component Transform
In many practical situations in which we are interested in obtaining principal components
(PCs) of a noise-free dataset Z, it is usually the noisy dataset X that is available for the
transform. Let the PCs of Z and the PCs of X be referred to as the true PCs and the noisy
PCs, respectively. The noisy PCs are often used as substitutes of the true PCs. Because
the eigenvalues and the eigenvectors of Cxx are different from those of Czz, the noisy PCs
may not be good estimates of the true PCs.
As we saw in the previous section, the ION algorithm can reduce noises in X and return
Z, the estimate of the noise-free dataset Z. It may be speculated therefore that the PCs of
the ION-filtered dataset are better estimates of the true PCs than are the noisy PCs. We
will refer to the PC transform of the ION-filtered dataset as the blind principal component
(BPC) transform. Figure 4-10 illustrates the schematic diagram for the BPC transform as
well as the traditional PC transform.
102
(a-2)
(a-1)
30
5
Noisy Dataset
-
25
20
ION Filtered Dataset
Wiener Filtered Dataset
...
PC Filtered Dataset
4
15
3 -
... ..
. . ..
-
. ....
z
115
0
0
. . . . .. . . . . . . . . . .
. . . . . . . . . ...
0
10
20
30
Variable Index
40
0
50
0
10
(b-1)
50
40
50
5
-o - -
20
25
10
-.-.-.-
-
..-..
...
. .-.
.
z
10
-
. ..
-
.
0
-.-.-.
.>
0
.. . . .. ..
/.
. .L
1
1
0
40
(b-2)
30
0
20
30
Variable Index
10
20
30
Variable Index
40
50
0
0
10
20
30
Variable Index
Figure 4-9: Performance comparison of the Wiener filter, the ION filter, and the PC filter
using examples of Table 4.1.
The goal of this section is to show through simulations that the PCs obtained through
the BPC transform of X are in fact better estimates of the true PCs in the mean-square
error (MSE) sense. The MSE of the noisy PCs, denoted by MSEpc, is defined as the meansquare difference between the noisy PCs and the true PCs. An MSEpc value exists for each
noisy PC. For example, MSEpc of the first noisy PC can be written as
I M
PC
Mk=1
(PCr (k) - PCZ(k))
(4.20)
where PCX(-) and PCZ(-) denote the first noisy PC and the first true PC, respectively.
Similarly we can define the mean-square error of the BPCs, denoted by MSEbpc, as the
mean-square difference between the BPCs and the true PCs. MSEbpc of the first BPC can
103
Ztf
___
X
___
___
PC
transform -+ Noisy PC
ION
PC
filter
transform
ta+
transform
Noise-free PC
Figure 4-10: Schematic diagram of the three different principal component transforms.
be written as
MSEbpc =
1M
-
E (BPC(k) - PCZ(k)
(4.21)
k=1
where BPCX(-) denotes the first BPC of X.
For our simulation we chose Examples (b) and (d) of Table 4.1. Examples (a) and (c)
of the same table were not used for this simulation because the noise variances are constant
over variables. In those cases, the eigenvectors of the covariance of X are the same as the
eigenvectors of the covariance of Z. Therefore, the difference between the BPC transform,
which is PC transform after the ION filter retrieves the noisefree dataset Z and the noisy
PC transform is minimal for these two unchosen examples.
In Table 4.4 and Table 4.5, we computed mean-square errors of BPC and PC transform
for different sample sizes, and listed the results for the first five BPCs and PCs. Let's
consider the result presented in Table 4.4. The MSE values of the BPCs are always smaller
than those of the noisy PCs. The eigenvectors of the covariance matrix of X are different
from the eigenvectors of the covariance matrix of Z due to additive noise whose variances
vary over variables.
The ION filter reduces these noises, and the covariance matrix of
Z, which is obtained through the ION filtering, should be a better estimate of the true
covariance matrix of Z. As a result, the eigenvectors of the covariance matrix of
104
Z are well
aligned with the eigenvectors of the true covariance. This in turn reduces the MSE values
for the BPC transform. This is also true for the result of Example (d).
To illustrate the relation between the sample size m and the effectiveness of the BPC
transform, we computed the percentage reduction of MSE due to the BPC transform against
the sample size. The percentage reduction of MSE due to the BPC transform is defined as
MSEpc - MSEbpc X 100(%).
MSEbpc
The result obtained from Example (b) is presented in Figure 4-11. The percentage reduction
of the MSE values are larger when the sample size is bigger.
Since the ION filter works
better for bigger sample size, the MSE reduction achieved by the BPC is bigger for bigger
sample size. The sample result and explanation holds for Example (d), which is presented
in Figure 4-12.
105
m=60
m=80
m=100
m=200
m=500
m=1000
MSEbpc(1)
0.252
0.081
0.064
0.066
0.042
0.044
MSEpc(1)
0.281
0.103
0.085
0.091
0.067
0.072
MSEbpc(2)
0.849
0.633
0.165
0.534
0.091
0.061
MSEpc(2)
0.869
0.677
0.184
0.575
0.152
0.124
MSEbpc(3)
0.723
0.826
0.082
0.456
0.144
0.069
MSEpc(3)
0.739
0.844
0.097
0.502
0.193
0.116
MSEbpc(4)
0.527
0.093
0.116
0.073
0.056
0.051
MSEpc(4)
0.594
0.149
0.189
0.143
0.124
0.128
MSEbpc(5)
0.207
0.184
0.122
0.159
0.057
0.052
MSEpc(5)
0.305
0.295
0.203
0.350
0.113
0.152
Table 4.4: Mean square errors of the first five principal components obtained through the
BPC and the traditional PC transforms using example (b) of Table 4.1.
-*- First PC
180-
--e- Second PC
-
160
140
-
Third PC
-A- Fourth PC
Fifth PC
- --
- -.-.-.
-
- -
-
-
-
-.-
-.-....
-. -
----
CO
'20---
0
100
200
300
400
500
600
700
800
900
1C00
Sample size
Figure 4-11: Percentage reduction of MSE achieved by the BPC transform over the noisy
transform using Example (b) of Table 4.1
106
m=60
m=80
m=100
m=200
m=500
m=1000
MSEbpc(1)
0.195
0.744
0.302
0.084
0.176
0.063
MSEpc(1)
0.214
0.798
0.334
0.112
0.222
0.110
MSEbpc(2)
0.833
0.483
2.295
0.289
0.973
0.121
MSEpc(2)
0.847
0.501
2.310
0.319
1.013
0.165
MSEbpc(3)
2.371
1.087
1.991
0.605
0.338
0.202
MSEpc(3)
2.430
1.148
2.039
0.695
0.439
0.288
MSEbpc(4)
3.536
0.894
2.034
0.102
0.240
0.133
MSEpc(4)
3.563
0.935
2.052
0.124
0.292
0.181
MSEbpc(5)
1.508
0.683
0.736
0.201
0.159
0.077
MSEpc(5)
1.512
0.730
0.773
0.251
0.191
0.111
Table 4.5: Mean square errors of the first five principal components obtained through BPC
and the traditional PC transforms using example (d) of table 4.1.
80
-*-- First PC
70 -
-
- Second PC
SThird PC
Fourth PC
Fifth PC
----
60 ---
0
0
w0
trnsor
600..700.800.900.1000
100..200.300.400.500
using. Exapl.(d.o.Tale4.
00
107
.. ..
. ...... ..
Chapter 5
Evaluations of Blind Noise
Estimation and its Applications on
Real Datasets
5.1
Introduction
While we studied multivariate datasets received from various industrial manufacturing companies, it became apparent to us that variables were noisy and noise variances were not
uniform across the variables. As a result, it was imperative to be able to retrieve noise variances form a given sample dataset reliably in order to apply the NAPC transform which is
superior to the PC transform in noise filtering. After making a few assumptions about signals and noises, we developed the ION algorithm with retrieves not only noise variances but
also signal order and noise sequences. We have tested the algorithm on simulated datasets
which do not violate the underlying assumptions.
One of the assumptions made about signals is that they are Gaussian. The development
of the EM algorithm is based heavily upon this assumption. However, it is very conceivable
that this assumption may be violated in practice. For example, a real manufacturing dataset
could easily display very non-Gaussian behaviors. Another assumption that may not hold
in practice is time-related. No time structure is assumed in developing the EM algorithm,
linear combinations of signals are assumed to be instantaneous. However, many practical
datasets that we have studied did actually have time structure such as slow changes.
108
In this chapter, we apply the ION algorithm to multivariate datasets obtained from
manufacturing and remote sensing. By doing so, we want to achieve two things: 1) to finish
what motivated the development of the ION algorithm, the analysis of real datasets, 2) to
test the robustness of the ION algorithm by applying it to real datasets which may violate
the Gaussian assumption and the no time structure assumption.
In Section 5.2 we test the ION algorithm on NASA AIRS data. We especially focus
on the noise filtering aspect of the ION algorithm in the section. First, a brief discussion
of remote sensing in general and the AIRS dataset is presented. Then the ION filter is
compared with the Wiener filter and the PC filter which are widely used for multivariate
noise filtering. We explain in which cases the PC filter replaces the Wiener filter which is
known to improve SNR most among linear filters, and how the ION filter may be applied
when the Wiener filter is desired but cannot be applied. Our simulation shows that the
ION filter performs much better than the PC filter and approaches the performance of the
Wiener filter.
A large scale manufacturing dataset is analyzed in Section 5.3. As a new application
of the ION algorithm, separation of variables based on eigenvectors of the BAPC is studied. A numerical metric for effectiveness of eigenvectors separating variables into groups is
proposed. In addition, the performance of the BAPCR is compared with the traditional
least-squares linear regression.
5.2
Remote Sensing Data
Remote sensing is a fairly new technique compared to aerial photography. Microwave remote
sensing is preferable to optical photography because it can penetrate clouds better and it
does not depend on the sun for illumination. For these reasons, microwave remote sensing
techniques became popular since the early 1960s [40].
An important class of remote sensing techniques is passive remote sensing, in which
sensors measure radiation emitted by atmosphere, surface, and celestial bodies in multispectral channels. The inevitable measurement error is added to the incoming signal during
the radiance measurement. There are other sources for additional errors such as 1/f noise,
but for the sake of simplicity, our evaluation of the ION algorithm will only address the
additive Gaussian measurement noise. Modeling the combined effects of all noise sources as
109
additive Gaussian noise is not uncommon in the remote sensing research community [30].
The objective of this section is to test the ION filter on simulated AIRS data, and
compare its performance measured by SNR improvement with two other widely used filters.
First, the structure of the AIRS data is explained.
5.2.1
Background of the AIRS Data
To provide basic understanding of the AIRS data, we include following excerpts from [30]:
The National Aeronautics and Space Administration's (NASA)
Atmospheric Infrared
Sounder (AIRS) is a high-spectral-resolutioninfrared spectrometer that operates on a polarorbiting platform. AIRS provides spectral observations of the earth's atmosphere and surface over 2371 frequency channels. The data can be used to retrieve various geophysical
parameters such as atmospheric temperature and humidity profiles, etc.
A full 2371-
channel spectrum is generated every 22.4 msec.
Figure 5-1 illustrates the format of a typical AIRS dataset. An AIRS dataset consists
of 2371 columns. Each column represents measurement of spectral observation of one narrowband frequency channel and constitutes a variable. Therefore, an entire AIRS dataset
consists of 2371 variables. Each measurement of 2371 variables becomes a row, and the
number of rows depends upon the duration and frequency of the measurement. Typically
a measurement is made every 22.4 msec.
An AIRS dataset is subject to many noises such as instrument noise and errors due
to imperfect calibration.
The cumulative effect of these noises is typically modeled as
independent additive Gaussian noise. In that case, the model of an AIRS dataset coincides
with the one illustrated in Figure 3-1, upon which we based the development of the ION
algorithm.
Throughout this section, we will refer to this cumulative effect of noises as
"noise" so that noise is independent additive white Gaussian.
According to the schedule of the NASA AIRS project, the polar-orbiting platform on
which the AIRS infrared spectrometer will be aboard will not be on its orbit until the year
2000. An actual AIRS dataset will not be available until then. However, researchers in
the field were able to generate simulated AIRS dataset based on what was expected to be
observed by the spectrometer over land during nighttime. We were provided with a noiseless
110
Channel
Chl
...
Ch2371
0.0 ms
22.4 ms
Data Matrix X
44.8 ms
67.2 ms
Figure 5-1: Format of an AIRS dataset
simulated AIRS dataset by the NASA AIRS science team. The dataset is of 2371 variables
and contains 7496 observations. This is what we will refer to as the simulated noiseless
AIRS dataset. We do not know if the variables of the noiseless AIRS data are Gaussian.
A corresponding noisy dataset was generated by adding to the noiseless dataset pseudorandom Gaussian noise, whose statistics was provided by the AIRS science team. Noiseless
and noisy simulated AIRS datasets have been widely used to develop data compression and
coding algorithms to be used for actual datasets. Figure 5-2 is the signal and noise variances
of the simulated AIRS dataset.
5.2.2
Details of the Tests
There is no question that analysis of noiseless AIRS data returns more accurate description
of the atmospheric profiles than analysis of noisy AIRS data. Therefore, it is highly desirable
to remove noise from noisy AIRS data before any further analysis. Traditionally, this is
done by the Wiener filter when noise statistics are known, and by the PC filter when
noise statistics are not available. It is well known that the Wiener filter is optimal in the
least squares sense.
Therefore, the performance of the Wiener filter measured by SNR
improvement is higher than any other linear filters. On the contrary, while the PC filter
does not improve SNR as much as the Wiener filter, it does not require a prioriknowledge
of noise variances.
The purpose of this test is to compare the ION filter with the PC filter and the Wiener
111
- .. -. -
20
.~..........
~~~~~~
10
-.
-.
0
.......
-. -.
.......
...
- ...... ..
- -.
..-.
..
0-20
..
...-.....
-.. -.
.........
--.
-30
......................... -
-.
......-.
.......
-. --.
-.-.-. .........
-.
-10
Signal
Noise
-
..
-.--.-..
--.-.--.-
.......
-.
.
.......
-..... ......
-. -.-.-. .
-- -...... -..
-..-.-..-.
-40 I-.. ......... ... ............
...............-.
-50
-.........
-
.................
-60 F
-70
.........
-.
-...-...-...-...-.
..-..
0
500
1000
1500
2000
2500
Variable Index
Figure 5-2: Signal and noise variances of simulated AIRS dataset.
filter using the AIRS data.
We want to illustrate that the ION filter, while as widely
applicable as the PC filter because it does not need a priori noise variances, performs
almost as well as the optimal linear least squares filter. In Figure 5-3 notations for input
and output of the three filters are defined. The ith variables of ZION, ZPC and
are denoted by ZION
21c
and
2iWIENER,
respectively. SNR of
defined as, respectively,
ZION
pC,
and
ZWIENER
ZiWIENER
are
2
SNR
=
2ION
zi
(5.1a)
a (ZIONZg)
U2
SNR2pC
2
-
(5.1b)
z'
(2rC-zj)
2
SNR
where a2 ION_
ZiWIENER
2PC_,
2 WIENER =
2
(5.1c)
Uzi
0 (ZWIENER-Z.)
and a 2WIENER
are
variances of
ZION
_z,
Zc
-
Z , and
- Zi, respectively. For comparison purposes, we also define the SNR of unfiltered
112
-
X
ION Filter
ZION
PC Filter
Z
c§"
[zc,...,
0 ZWIENER
\Wiener Filter
ZION
ON
_.
WIENER,
.
ZWIENER]
Figure 5-3: Three competing filters and notations.
dataset as
2
SNR
-
(5. 1d)
2
U(X -zi)
The difference between SNR 2 !ON and SNRgpc is thus an indication of what can be gained by
retrieving noise variances through the ION algorithm and the difference between SNR2!ON
and
SNR 2 WIENER
and SNR2!ON can be interpreted as the performance loss due to lack
of a priori knowledge of noise statistics. In testing the three filters, we use a simulated
AIRS dataset consisting of 240 channels which are equally spaced among the entire 2471
channels. Figure 5-4(a) provides the signal and noise variances of the selected 240 variables.
Figure 5-4(b) is the eigenvalue screeplot of the decimated AIRS dataset after corresponding
pseudo-random noises are added to each channel. A quick glance of the screeplot indicates
that the number of signals falls on the range of 10 < p < 25. There exists a significant gap
between n and p, which is a prerequisite for the ION algorithm.
5.2.3
Test Results
In Figure 5-5, we plot SNR's of
ZION,
ZPC,
ZWIENER,
and X of 240 variables in the AIRS
dataset. The horizontal axis is the variable index, and the vertical axis is the SNR in dB.
It shows that the Wiener filter and the ION filter always increase SNR of the noisy dataset.
113
(a)
(b)
10 5
20
...-....-.
--
0
-......-
....
10
CU 10
0..50
C
.CU
C
-40*
-
Signal
...
...
100
.. -....
...150 .. 200...
250.
- -.----
Noise
-
.-
105
.........
- -
------..-..---.--.-
-60
0
50
100
150
Variable Index
200
0
250
50
100
150
200
250
Index
Figure 5-4: 240 Variable AIRS dataset.
(a) Signal and noise variances.
(b) Eigenvalue
screeplot.
For the PC filter, however, SNR 2 pc - SNRx is not always positive. For example, this value
is negative at high variable indexes, which means that the PC filter actually decreases SNR
for high index variables.
Effectiveness of ION Filters: AIRS DATA (240 Variables, 7496 Observations)
60
55
50 - 45
-
-
--
-/
-
-
/-
- - - -
-
--
40
z
(n
35
-..---.-.-...-.-.---..-.---.--
30
25
-
20
o Filte rN
-N
--
-
- -
--
IO N Filter
- -
Wiener Filter
- PC
-
1
-
- -
--
jV
- -
-
-
-
.......
-1
--
- - ---
-
--
--
Filter
15
10
0
50
100
150
200
250
Variable Index
Figure 5-5:
Plots of SNR of unfiltered, ION-filtered, PC-filtered, and Wiener-filtered
datasets.
* The PC filter, which has been widely used for reduction of noise with unknown variances, sometimes increases noise variances. For example, SNR decreases for the last
112 variables (from variable 129 to variable 240) in Figure 5-5.
This can happen
when signal variances of certain variables are extremely small, even smaller than
114
noise variances of other variables.
PC transforms of such datasets could end up
putting noises at lower principal components while signals are transformed to higher
principal component1 . Figure 5-6 sketches an eigenvalue screeplot of such example.
" The ION filter increases SNR for all variables.
"
The SNR difference between ZWIENER and ZION, plotted in Figure 5-7(a), is relatively
small compared to the range of SNR displayed in Figure 5-5.
This is a significant
result considering that the ION filter does not require noise variances a priori.
" Figure 5-7(b) reveals that the performance gap of the ION filter and the PC filter can
be large for variables with small signal variances.
Noise
Signal
0
Figure 5-6: An example eigenvalue screeplot where signal variances are pushed to a higher
principal component due to larger noise variances of other variables.
5.3
5.3.1
Packaging Paper Manufacturing Data
Introduction
As a part of the Leaders for Manufacturing (LFM) program, we had a chance to work with
a company which produced paper used for packaging materials. We refer to this company
1A lower principal component means a principal component with higher variance.
second principal component is lower than the fourth principal component.
115
For example, the
(a)
(b)
8
40
7
6
5
-35--. -
30
.
.--.-.
- .
--
2 5-.--.--.
- .-.-.-
20 S-
z
CO)
-
3
..
z
15
C')
-
.- -.
-
10
5
2
1
-
0
0
- - --.--.-
50
- - . .
100
150
Variable Index
200
- -- -0
-5
250
0
50
100
150
Variable Index
200
250
Figure 5-7: Differences in signal-to-noise ratio between pairs of ZWIENER and ZION and of
ZION and Z'c. (a) Plot of SNR 2WIENER - SNR2ION- (b) Plot of SNR 2 ION - N
as B-COM in this thesis. The entire production line of the company can be categorized into
four different operations: preparation, production, power and chemical supply, and quality
measurement. Figure 5-8 is the schematic diagram of the production line of B-COM.
The preparation stage includes fiber pre-process, digestion and stock preparation. Large
logs are chipped into small woodchips (fiber pre-process). The woodchips are then digested
into thick liquid (digestion). Lots of chemicals supplied by effluent plant are added at this
stage. This liquid is then fed to three machines (A, B, and C), whose final product is paper
rolls. The power plant supplies necessary electricity and steam. Before the final product
is delivered to customers, it goes through a standard quality check, which is performed
offline. If a roll of paper does not pass the quality check, the roll is typically recycled at the
digestion stage.
In addition to the offline quality check, B-COM records thousands of variables across
preparations, production, and power and chemical supply stages. It was thought initially
that every information about the production and quality of the product are embedded in
these thousands of 'inline' variables. During our collaboration with the company, we were
provided with extensive collections of observations of inline variables and corresponding
values of offline quality variables and asked to look for cause(s) for variations in product
qualities. If we could establish a firm relationship between product quality and certain
controllable inline variables, the company would benefit greatly because product qualities
116
Power Plant
Fiber
Pre-Process
Digestion
- Steam
- Electricity
-
4
,
Effluent
Plant
Stock
Preparation
Paper
Machine
Paper
Machine
Paper
Machine
A
B
C
-F1
1~
-F-I--
Quality
Measurement
Figure 5-8: Schematic diagram of paper production line of company B-COM
117
2
0-
-.
.. . .. .
.. . . .. . . ..
. .
-
-----.
-.-...--..--.-..--..---....
-.....-..
-..
-.....
-4 .
-6
0
. . . . .
100
200
300
Index (i)
400
500
Figure 5-9: An eigenvalue screeplot of 577 variable paper machine C dataset of B-COM.
could then be checked and controlled immediately through those inline variables instead of
long-delayed offline quality checks.
The raw dataset of inline variables was very large and unorganized. First, it contained
various non-numeric data points.
There were also many data points where the entries
were missing. Although these non-numeric or missing entries may contain some valuable
information about the process, our focus was mainly on numeric parts of the datasets.
Therefore, we basically excluded those missing entries from our analysis. We also excluded
variables which remained constant for all times, since they did not have information on
quality fluctuation. This is called data preparation, or data cleaning. The cleaned dataset
still consists of thousands of variables.
For the thesis, we choose to use variables which are collected from the paper machine
C for three reasons: 1) the machine C is the largest machine among the three machines, 2)
the three machines are totally independent, and 3) time-synchronization among variables
measured at different parts of the production line is not very good. For example, it takes
about 18 hours for logs at the pre-process stage to come out as paper. Therefore, a data
point at the pre-process stage is related to a data point at the final product stage 18 hours
later. It is not a small task to follow time delays for thousand of variables. On the contrary,
within a paper machine, the time lags among variables are negligible.
118
The final dataset consists of 7847 observations of 577 variables. Each variable is standardized to zero mean and unit variance. The eigenvalue screeplot of this standardized
dataset is given in Figure 5-9. The noise flat bed is not clearly defined, indicating that the
noise variances are not uniform across standardized variables.
5.3.2
Separation of Variables into Subgroups using PC Transform
Analyzing a dataset with a large number of variables sometimes involves dividing variables
into several subgroups. If the division is "well-designed," studying individual subgroups may
reveal information which is otherwise hard to discover. In addition, subgroups of variables
are easier to analyze because of their smaller sizes. In many cases, it is helpful to divide
variables based on their correlations: divide variables into subgroups so that strongly correlated variables are put into one subgroup and that little correlations exist between variables
across subgroups. By doing so, variables of one subgroup, however many there are in the
subgroup, can be considered to control (or be controlled by) one physical factor. Because
the PC transform utilizes covariances among variables to obtain eigenvectors which are then
used as weights in computing statistically uncorrelated principal components, it is believed
that eigenvectors would be useful in dividing subgroups of variables. Specifically, absolute
values of elements of eigenvectors represent the contribution of corresponding variables to
principal components. Therefore, one could gather variables corresponding to elements of
large absolute values in an eigenvector into one subgroup.
Figure 5-10 presents the first eight eigenvectors of the 577 variable B-COM dataset.
Contrary to our expectation, these plots do not identify any variable subgroup clearly.
Instead, the plots imply that most of variables contribute somewhat equally to each principal
components. Although possible in theory, this is an unlikely event in practice.
5.3.3
Quantitative Metric for Subgroup Separation
The purpose of this section is to introduce a quantitative metric for 'effectiveness' of eigenvectors in defining a subgroup. Qualitatively, an eigenvector is effective in defining a subgroup if most elements of it are close to zero and a few elements have exceptionally large
magnitude (positive or negative). Anineffective eigenvector is the one whose elements are
mostly comparable magnitude. For example, the eight eigenvectors in Figure 5-10 fit the
description of ineffective eigenvectors. The subgroup separation metric (GSM) defined in
119
first eigenvector
second eigenvector
0.2
0.2
- . . ... . ..-.. . ... . .
0.1
0.1
0
0
-.
. .--.-.-.-.-- . ...-.
.. .
-0.1
-U.2
-.-
0
100
200
300
400
- 0 .1
-0.2
500
300
400
500
0.2
0.11
0.1
0
0
-0.11
-.--
.. -.-.. . .... .
-0.1
I
100
200
300
400
-0.2
500
0
100
fifth eigenvector
02
0.1
0.1
0
0
-0.1
-0.1
)
100
200
300
400
200
300
400
500
sixth eigenvector
0.2
-0.2
200
fourth eigenvector
third eigenvector
02r
.
-0.2
100
0
-.---.--.
- -.--.
-..-.-.-
.
-- -
-0.2-
0
500
-
100
seventh eigenvector
-
200
- --
300
400
-
500
eighth eigenvector
0.2
A-2
0.1
0.1
-.-..- .
-.- ----.- --.. -.-.
0 .1
. ..-
.
- - --. -..
-..
0
0
....
........ ... -..: . .-.
- 0 .1
-0.1
-0.2
100
200
300
400
-0.2
500
--.
-- --.-.--.-.-
0
100
200
300
400
500
Figure 5-10: First eight eigenvectors of 577 variable paper machine C dataset of B-COM.
120
this section quantifies the effectiveness of an eigenvector. The idea of GSM is very simple.
It is the ratio between two simple averages. One of them is the average of squared elements
whose absolute values are in the largest 10 percent, and the other is the average of the
remaining elements squared. A large GSM value indicates a good subgroup separation.
Let v be an eigenvector of the covariance matrix of a given dataset and
{V},
i = 1,.-
,n
be its elements. If we rearrange {vi} from the largest to the smallest in their absolute values
and rename them as {qi} , i = 1,
..
, n, then the subgroup separation measure of v is defined
as
k
E (qi)2
GSM (v)-
(5.2)
i=1
E
(qi )2
i=k+1
where k is the smallest integer bigger than 0.1n. If the elements of v are derived from a
Gaussian distribution, then GSM(v) r 7.
The GSM values for the eight eigenvectors of Figure 5-10 are computed and presented
in Table 5.1.
Although we do not yet have an absolute way of saying if a specific GSM
value, for example 23.94 for the eighth eigenvector, indicates a good subgroup separation,
we note that the GSM values for the first 3 eigenvectors are even smaller than 7, the GSM
value of a vector whose elements are Gaussian.
GSM
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
5.42
4.32
6.51
10.80
16.64
13.07
9.47
23.94
Table 5.1: GSM values for the eigenvectors plotted in Figure 5-10.
5.3.4
Separation of Variables into Subgroups using BAPC transform
Our attempt to determine subgroups through PC transform was not very successful. As
an alternative and a potential improvement to the PC transform, we decided to try the
BAPC transform on the same 577-variable B-COM datasets.
Since the ION algorithm
effectively separated noisy variables into 'correlated signals' and 'uncorrelated noises,' it
was expected that variables which had strong correlations among them would be deemed
by the ION algorithm as having small noise variances.
121
The noise-adjusting part of the
--
1
0 .9 -
0.8 -
0.5
-
-
-.-
I
II
.I
--
-
-
I
~II~i
ii
..-
.-..
- - --
-
-
CI
.20.7 -
- 0.6
I.
I
- - --.--
--
-
-
--
0.2
-
-
-..-
-
--...-
-
--
0
0
100
200
300
400
500
600
Variable index
Figure 5-11: Retrieved Noise standard deviation of the 577-variable B-COM dataset.
BAPC transform then should boost the variances of such variables by dividing them by
their small noise standard deviations. As a result, the subsequent PC transform would be
dominated by such variables, which should emerge as the contributing variables to the top
principal components.
Figure 5-11 shows the noise standard deviation of the 577-variable B-COM dataset
retrieved by the ION algorithm. These estimated noise standard deviations are used to
noise-adjust the 577 variables. The eigenvector screeplot of the noise-adjusted dataset is
shown in Figure 5-12. Compared to the corresponding screeplot of Figure 5-9, the screeplot
after noise-adjustment has a well-defined noise bed and high SNR values for the initial PC
components.
Figure 5-13 is the first eight eigenvectors obtained through the BAPC transform. The
difference between Figure 5-10 and Figure 5-13 is obvious. The BAPC transform exhibits
significantly better-defined subgroups than the PC transform through their respective eigenvectors. Table 5.2 presents the GSM values of the eigenvectors in Figure 5-13. These values
are much larger than GSM values of Table 5.1, indicating that the eight eigenvectors in Figure 5-13 are much more effective in separating variables into subgroups than eigenvectors
in Figure 5-10.
122
DI0
-
--...
4.
.
....... .....
4--
3
2-
-
-- --
.-.-.-.
.....-.-.---- -
. .. . .
-.-.-.
- --
-
--
-
--
. ... - -.
--
-3-
0
100
200
300
Index (i)
400
500
600
Figure 5-12: Eigenvalue screeplot of blind-adjusted 577-variable B-COM dataset.
GSM
PCI
PC2
PC3
PC4
PC5
PC6
PC7
PC8
3372.7
665.86
615.62
735.27
110.21
30.44
132.60
2544.77
Table 5.2: GSM values for the eigenvectors plotted in Figure 5-13.
5.3.5
Verification of Variable Subgroups by Physical Interpretation
In order to verify the accuracy of variable subgroups obtained from the BAPC transform,
we would like to examine the physical meanings of variables which contribute significantly
to individual principal components to be classified as subgroups. If variables classified as
one subgroup fall in one of the following three categories, they can be thought to confirm
the validity of subgroups obtained by the BAPC. The three categories are:
1. Variables classified as one subgroup measure one physical quantity repetitively. For
example, if all variables in one subgroup measure temperature at the same position
in the manufacturing process, those variables may be regarded as a valid subgroup.
2. Variables classified as one subgroup measure physical quantities which are known to
be correlated.
Determining whether variables fall in this category requires a priori
engineering knowledge of the process.
123
First PC
Second PC
0.2
0.2
0
0
-.
-0.4
-0.6'
--
-..
-.
-0.2
-.-.-. . ..
-.
-.-... .
200
0
...-
400
-0.2
-
-0.4
-
-0.6-
6( 0
. -.4
- .- .-.. .- .. .-. ..-.
0
200
Third PC
0.8
.-. .-. -..-. -.. -.. -.-.-.-.--.-.---.--.
-.. . -.
. . -.. --... - -- -- - -.
0.4
0.2
. . .. . . . . .
0.6 -. -.-. . . -.-.-.-.-.-.-.
. . . . . . .
0.4 -. . -.. . . . -.-.-.-
. . . . . . . . . . .. .
0
0
200
0
400
-
-0.2
600
0
400
200
Index
600
Index
Fifth PC
Sixth PC
0.6
0.6-
0.4
.................
.
......... ...........
.
0.2
...................
- ... ...........
.
0.4- -
-
--
-
-- -
0.2
0
-
. . . . . . . . . .
..
. . . . . .
. ...........
- - - .....
- .
0
200
0
400
-0.2
600
0
200
Index
400
600
Index
Eighth PC
Seventh PC
0.6
0.8
0.4
. -...
0.2
. . . .-.
. . . .-..
0.6 --
. . . .-.. -.. . . .
. . . . . . . . . . -..
...
-.
...-.
-.
-- -
0.4
. .
-.
.
0.2 - . -...- -.-.
0
-0.2
.-.--
..-.-.-.-.-.-
0.2
-..-.-..
-.-.-.
----.
. -.
-0.2
-0.2
600
Fourth PC
0.6
-0.4
400
Index
Index
0
. .
. -. .-.
.. .
...-.. -
-
. .-.
. . . .
-.
.-.-.
.--.-.-
.
0
400
200
-0.2
600
0
200
400
600
Index
Index
Figure 5-13: First eight eigenvectors of the ION - normalized 577 variable B-COM dataset
124
3. Variable classified as one subgroup measure physical quantities which move very
closely together. This does not require a priori engineering knowledge. In fact, variables which fall in this category may promote previously unknown knowledge of the
underlying manufacturing process.
As a first example, let's study the first eigenvector in Figure 5-13.
Variables whose
corresponding eigenvector elements have absolute values larger than 0.3 are classified as a
subgroup. The list of variables and their physical meanings are listed in Table 5.3. The
first two variables in the list are related to air pressure, and the next three variables are
process speed. We learned from a process engineer of B-COM that speed and air pressure
are closely related, and the list here agrees with that knowledge. In Figure 5-14, the time
Variable Label
Physical Quantity Measured
P4:CUCSPT.MN
Couch vacuum set point
P4:PIC202.SP
Headbox pressure set point
P4:SI450
Wire drive speed
P4:SI451
Top wire drive speed
P4:SI453
Breast roll speed
Table 5.3: Variables classified as a subgroup by the first eigenvector in Figure 5-13.
plots of the five variables are presented. Although the level of the first two variables are
different from the last three variables, they are almost identical in their shapes.
Let's repeat the same exercise for the second eigenvector. This time, variables whose
corresponding eigenvector elements have absolute values larger than 0.15 are classified as a
subgroup. Table 5.4 lists those variables. The first, second, and fourth variables measure
electric current provided by the power plant. The third variable is listed as spare, and we
do not know the physical nature of the variable. However, we may conclude from the result
that it is strongly related to electric current generated and provided by the power plant.
Figure 5-15 illustrates the time plots of the four variables chosen by the second eigenvector.
Again, they have almost identical shapes.
125
P4:CUCSPT.MN
P4:PIC202.SP
400
a00
350
350
300
300
250
250
200
200
150
--
---
-
-
100
50'
0
2000
4000
Time index
-
-
-.
6000
8000
150
-
-
501
0
2000
P4:S1450
Lnf
11
2600
I
A .J
2400
2400
2200
2200
2000.
2000
1800 --
1800
11,
1600
1400
1200
2000
4000
Time index
6000
L
-. . . . .
'I
8000
-1
.1
1600
I
1400
I
0
-
P4:S1451
2800
-.-.
--
100 --
2800
2600
--
4000
Time index
6000
1200
8000
0
2000
4000
Time index
6000
8000
P4:S1453
2800
- -.. ...
-..--. -.
..
1600
-.
-..-.
.
.
.. .
..-- -
. . . ... ..
-..
..
1400
1200
0
2000
4000
Time index
6000
8000
Figure 5-14: Time plots of the five variables classified as a subgroup by the first eigenvector.
126
P4:OUTAMP.MN
0.35
--.-.-.-.--.
- -.
-.-.- -.. .-..-.
0.3
0.25
-- - - . ...- - -.. -.. . ... .-.. -.
. . . .
0
2000
0.6
-
0.5
--
0.4
--
0.3
-
4000
Time index
-
8000
6000
0.2
.- .-
0
2000
P4:SPARE2
0.6
4000
Time index
6000
8000
P4:TOPAMP.MN
1
-. ..
0.5 - -
.- .
. .... . .
. .... -..
..
-. .- .. .-.
-.
. --.
..-..--- ..
-.- ..
0.8
-..-.-.-.-.
0.4
..
0.3
0.2
....
-..
- ....-...-...-...-.
-. -.
. . . . . -.-.-.-.-.-.-..
.-.-.--
-- - --
0.15
0.1
-
- .--. .-
0.2
-. - - - -.
. .
P4:PRSAMP.MN
0.7
.
-
.
.
.
.
.
.
.
.....- ..-.......-.
....0.6.
.
-. --.
.
..
- ...
- - ..
0.7
.
.
0.5
...-.
-.
..-.
..-
-..
....
--.--.-----
-.- . .
-. -.-.-..
-- - ................
.. --
-.-.
.....
- ..
-.. ......
- .....
- -
.
0.4
0.1
0
2000
4000
Time index
6000
8000
0.3
0
2000
4000
Time index
6000
8000
Figure 5-15: Time plots of the four variables classified as a subgroup by the second eigenvector.
127
Variable Label
Physical Quantity Measured
P4:OUTAMP.MN
Top wire amps
P4:PRSAMP.MN
Press amps
P4:SPARE2
Spare
P4:TOPAMP.MN
Top press amps
Table 5.4: Variables classified as a subgroup by the second eigenvector in Figure 5-13.
Considering that the traditional PC transform did not identify even one meaningful
subgroup through its eigenvectors, the results so far confirm our speculation that the BAPC
is more effective in identifying subgroups of variables. Eigenvectors of the BAPC transform
have higher GSM values, and physical interpretations of grouped variables are consistent.
As a last part of our analysis of the effectiveness of the BAPC transform, we present in
Figure 5-16 every tenth eigenvectors up to the seventieth one. It shows the trend that the
earlier eigenvectors define a subgroup better than the later ones. GSM values in Table 5.5
confirms this observation.
GSM
PCI
PC10
PC20
PC30
PC40
PC50
PC60
PC70
3372.70
108.89
44.79
34.38
19.65
20.40
11.11
11.66
Table 5.5: GSM values for the eigenvectors in Figure 5-16.
128
First PC
Tenth PC
0.1
0
-0.1
-.....
.....
-0.2
.............
-0.3
....
...
-.-. .. -.- . -.
.--.. . ..-.
..
-0.4
-.
400
60 0
. ..-....
-.. .-.
... -.
-.
-.. .
-0.2
.-..
.
. - . . . . - . . . .- . . . .- . . .
-0.1
-0.3
200
C
200
Index
0.
.-..
.-.--.
0
~~~~
........ ..
-0.4
0
-. - -..
. -. .- . - . .- . .- . . -
.-.
-..
.-..-...-..-..
0.1
-....
......
..
-..
. .. .-.
... ..
-.
. --.. . -. .
-.
0.2
.... .............
..
- .-.
.600
400
Index
Twentieth PC
Thirtieth PC
0.2
..
............* ... ....* ..... .
0.2
.. .....
0.1
... ...
. ... . . . . . . . . . . . . . . . . . . . .
0.1
. . . . . ..
0
0
.
-0.2 ..............
-0.1
...... .......
0
400
Index
Fortieth PC
200
....
..
-0.1
-0.2
0
600
200
400
60 0
400
600
Index
Fiftieth PC
0.2
0.3
0.1
0.2
0.1
0
0
-0.1
-0.1
0
200
400
-0.2
600
0
Index
Index
Seventieth PC
Sixtieth PC
0.2
0.2
-.- .
0.1
. . .- .-..
- - -..
. .
.-.-.
0
..
-.
--.
~
-0.1
0
-.
. . -..
. .
0.1
0
-0.2
200
--..
~ -.~
.
.
.
..----
200
400
.-.- -.-.
-.-.-.
. . .
. .
.......
. . .... ... . ...... ...
.
---...--
-0.1
-0.2
600
0
400
200
600
Index
Index
Figure 5-16: Every tenth eigenvectors of the ION-normalized 577-variable B-COM dataset.
129
5.3.6
Quality prediction
Among the 577 variables of the B-COM dataset, several variables are designated as quality
variables. These variables are typically measured offline. If product quality represented by
these variables does not meet pre-determined criteria, the failed product cannot be sold and
has to be recycled. Quality failure was a significant factor in overall cost to B-COM. Because
the quality test could be carried out only after final products are produced, the company
was tremendously interested in finding out ways to predict the final product quality from
other variables which are measured in real time.
The purpose of this section is to apply the BAPCR and the least-squares linear regression to the B-COM dataset and compare the results. We are especially interested in
confirming that the BAPCR performs better than the least-squares linear regression for
this real dataset. Depending on the result, it can reinforce our claim made in the previous
chapters that the BAPCR should be used rather than the traditional least-squares linear
regression in the cases of noisy predictors.
Among the 577 variables, twenty eight variables are identified as offline quality variables
and separated from the other 549 variables.
The 549 variables are real-time variables.
Among the twenty eight quality variables, we choose one variable, which we call CMT,
as the quality variable of interest.
We will try to predict CMT using the 549 real-time
variables. The 7847 observations are divided into two parts: the first 4000 observations are
used to train the regression equation, and the remaining observations serve as the validation
dataset. Since there are only 549 predictor variables, the 4000 observations in the training
set should be more than enough.
First, the traditional least-squares linear regression is used to train the linear equation
between CMT and the 549 variables. Once the training is carried out, the validity of the
obtained equation is examined using the validating set. In order to illustrate how good (or
bad) the prediction is, we present the scatter plot of the measured CMT and the predicted
CMT in Figure 5-17.
The horizontal axis represents the CMT values in the validating
dataset, and the vertical axis represents the predicted CMT values, where the prediction
equation is obtained by the traditional linear regression. A perfect regression would produce
a straight line of unity slope. The near ball-shaped scatter plot implies that the prediction
by the traditional linear regression is not working. The root-mean-square (rms) error is
130
7(
68 -
--
66 -
-.
.......
-
--
-.-.-.
-.- .
--
-.-
.-.-.
... .
..
-
-
.
-
666
- .. . ...
5964-
-
52
-
-
5 2 -
50
50
-. ---..
..
54
56
58
60
CMT
62
.- ..
64
66
....
-.
68
70
Figure 5-17: Scatter plot of true and predicted CMT values. Linear least-squares regression
is used to determine the prediction equation.
close to 2.75. This is in fact larger than the standard deviation of the CMT variable, which
is close to 2.01. This means that the linear regression is a worse predictor than a constant,
namely the average value of the CMT.
In our next prediction attempt, we use the BAPCR to train the prediction equation.
Figure 5-18 is the resulting scatter plot. It is still very noisy, this plot shows a stronger
linear relation between measured and predicted CMT values than the scatter plot obtained
by the tradition linear relation. The rms error in this case is 1.78.
This example confirms that the BAPCR performs better than the traditional leastsquared linear regression when the predictors are corrupted by measurement noise. One may
be surprised by the poor prediction result even by the BAPCR in this example. However,
it turns out that a large portion of variance in the CMT variable is simply measurement
noise. In fact, one analysis shows that up to 70 percent of CMT variances is attributable to
measurement noise. That is why even the BAPCR could not predict the CMT values with
more accuracy.
131
70
...........................
...... ........................
...... ....... .....
6 8 . ........ ... .. ....
........
...........
..
...............
...........
.....................
...........
6
4 . ...... . ....................... ......
...
......
...... .. . . . . . . . . . .
6 2 . ........ .. ... .........................
60
. .. . . . .
Ar
. . . .... . . .. . . .
. . .. .
......... .....
......... ..
IL
Ir
0
CL
.........
...............
5 8 . ........: .......
........ ........
...........
...........
.......
.....
.........
M
........ ........
...........
. . ...............................................
5 6 . ...... ....... ........
54
52 501
50
................
............... .....
...........
.....................
.............. ....
.....
.........
.....................
.......
52
54
56
58
60
CMT
62
64
....
......
66
68
70
Figure 5-18: Scatter plot of true and predicted CMT values. BAPCR is used to determine
the prediction equati on.
132
Chapter 6
Conclusion
6.1
Summary of the Thesis
The problem of analyzing noisy multivariate data with minimal a priori information was
addressed in this thesis.
The primary objective was to develop efficient algorithms for
maximum likelihood estimation of noise variances. The problem falls in the category of
signal enhancement of noisy observations which has been widely investigated in recent years,
but the researches have been limited to cases where degrees of freedom of signal components
in the noisy dataset are known. Similarly, the problem of estimating degrees of freedom of
signal component has been studied for many researchers in the field of signal processing, but
there application has been restricted to situations where noise variances are either known a
priorior uniform across variables. For many practical multivariate datasets, including those
of ours in the areas of manufacturing and remote sensing, both noise variances and degrees
of freedom of signals are not available, thus rendering applications of previous researches
hard.
Our approach in this work is to separate the problem of joint estimation of degrees of
freedom and noise variances into two individual estimation problems, namely estimation of
degrees of freedom and estimation of noise variances. Then we develop an algorithm for each
estimation problem as if the other parameter is known, then cascade the two algorithms such
that the degrees of freedom is first estimated and then the result is applied to estimation
of noise variances. The resulting noise variance estimates are then used for normalization
of the noisy variables, which should produce an improved estimate of degrees of freedom.
These steps are repeated until the estimates converge. The development of the algorithm
133
for estimation of noise variances are derived by modeling the signal and noise vectors as
independent Gaussian vectors.
The potential applications of the algorithm are very wide.
They are all related to
noise estimation and reduction. Three applications investigated in this thesis are linear
regression, noise filtering, and principal component transform. Simulations show that the
ION algorithm should improve the performances of these applications significantly.
6.2
Contributions
The main contribution of this thesis can be summarized as the following:
1. The thesis derived a maximum-likelihood (ML) estimate for noise variances for noisy
Gaussian multivariate data. We considered a situation in which unknown noise variances must be retrieved from a sample dataset in order for intended analysis tools to
work satisfactorily. We approached the problem from an information theoretic viewpoint by deriving an iterative EM-based algorithm for noise variance estimation. As
long as the degrees of freedom are known or accurately estimated, the estimated noise
variances are the ML estimate.
2. We considered a more realistic scenario in which the degrees of freedom are not known
as well as the noise variances. We proposed a scheme which separates the joint order
and noise estimation problem into two individual estimation problems and alternates
the two individual estimation processes so that one estimation process should augment
the other one. Although we suggest an ad hoc but robust order estimation method
based on eigenvalue screeplot, we believe that other more analytic methods could
replace it thanks to the modular nature of the ION algorithm. The ION algorithm
is readily applicable to various fields in multivariate data analysis and it is easily
implemented as a program.
3. We identified the application areas of the ION algorithm and provided wide range of
examples in order to visualize the benefits of the proposed algorithm over traditional
multivariate analysis tools. The three applications considered in this thesis are linear
regression, principal component transform, and noise filtering:
134
"
For linear regression, we illustrated that depending on m (number of observations) and n (number of variables) of training dataset, the performance represented by y of (4.12) could be as much as 4 times better for BAPCR over PCR
or ordinary linear regression.
* For principal component transform and subgrouping variables, by applying the
ION algorithm before NAPC transform we had a tremendous success in identifying subgroups of variables from a large multivariate dataset derived from a
paper manufacturing plant. Regular principal component transform could not
detect any of those subgroups.
" For noise filtering, our experiments with remote sensing data illustrated that the
performance of the ION filter approaches that of the Wiener filter which is the
theoretical limit of a linear filter.
6.3
Suggestions for Further Research
In out development of the EM algorithm for estimation of noise variances, we used an
independent Gaussian vector model for the signal and noise vectors because of its analytic
tractability. While assuming the noise vector is Gaussian can be justified without much
difficulty, the signal vector does not always have to be Gaussian. We did not attempt to
extend the EM algorithm for non-Gaussian signal vectors. There are two further research
topics regarding this issue:
" Apply the EM algorithm developed for Gaussian signal vectors to non-Gaussian signal
vectors such as Laplace distribution, log-normal distribution, Rayleigh distribution.
" Extend the EM algorithm for other non-Gaussian signal vectors.
Another research opportunity involves the choice of the estimation method for degrees
of freedom. As we stated, there are many established methods regarding this problem [25,
26, 27, 32].
QAt
proposed method based on the eigenvalue screeplot is admittedly ad hoc,
but still makes sense because the method is robust against non-uniform noise variances
across variables. It will be interesting to see how well other algorithms could be substituted
for the screeplot method.
135
Convergence study is also left to the future research. Although we have not encountered any example for which the ION algorithm did not converge after just a few iterations,
one cannot be assured that the ION algorithm should converge without exception. If the
algorithm is to converge without exception, then the future work should prove so. It the algorithm does not always converge, then it remains to be understood in what circumstances
it does and does not. As a related problem, if the ION algorithm converges without exception, it would be interesting to see if it is asymptotically equivalent to the Wiener filter. All
our examples indicate that they are almost identical performance-wise.
Yet another research area that is left out in this thesis is to exploit any time structure of
variables to characterize the multivariate datasets. If a dataset has some time structure in
it, investigating it could reveal potentially important characterization of the dataset which
may not be understood otherwise. Regarding to this approach, power spectral analysis [41],
autoregressive models, and Kalman filter [42, 43] could be a few starting points.
136
Bibliography
[1] James B. Lee, Stephen Woodyatt, and Mark Berman. Enhancement of high spectral
resolution remote-sensing data by a noise-adjusted principal component transform.
IEEE Transactions on Geoscience and Remote Sensing, 28(3):295-304, May 1990.
[2] W. J. Krzanowski and F. H. C. Marriott. Multivariate Analysis, Part 1, 2. Arnold,
London, UK, 1995.
[3] Subhash Sharma. Applied Multivariate Techniques. John Wiley & Sons, 1996.
[4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistics Society, Series B, 39:1-38,
December 1976.
[5] E Weinstein, A. V. Oppenheim, M. Feder, and R. Buck, J. Iterative and sequential
algorithms for multi-sensor signal enhancement. IEEE Transactions on Signal Process-
ing, 42(4):846-859, April 1994.
[6] A. Papoulis. Probability, Random Variables, ans Stochastic Processes. McGraw-Hill,
3rd edition, 1991.
[7] T. W. Anderson, S. Das Gupta, and G. P. H. Styan. A Bibliography of Multivariate
Statistical Analysis. Halsted Press, New York, 1972.
[8] Jae. S. Lim. Two-Dimensional Signal and Image Processing. Prentice-Hall, Inc., En-
glewood Cliffs, NJ, 1990.
[9] J. D. Jobson. Applied Multivariate Data Analysis. Spirnger-Verlag, New York, NY,
1991.
137
[10] Paula Newbold.
Statistics for Business and Economics. Prentice-Hall, Englewood
Cliffs, NJ, 4th edition, 1995.
[11] George Arfken. Mathematical Methods for Physicists. Academic Press, Orlando, FL,
third edition, 1985.
[12] Gilbert Strang. Linear Algebra and Its Applications. Harcourt Brace Jovanovich College
Publishers, 3rd edition, 1988.
[13] David A. Belsley, Edwin Kuh, and Roy E. Welsch. Regression Diagnostics. John Wiley
& Sons, 1980.
[14] Gene H. Golub and Charles F. Van Loan. An analysis of the total least squares problem.
SIAM Journal of Numerical Analysis, 17:883-893, 1980.
[15] Gene. H. Golub and Charles. F. van Loan. Matrix Computation. The Johns-Hopkins
University Press, Baltimore, MD, 3rd edition, 1996.
[16] S. D. Hodges and P. G. Moore. Data uncertainties and least squares regression. Applied
Statistics, 21:185-195, 1972.
[17] W. J. Krzanowski. Ranking principal components to reflect group structure. Journal
of Chemometrics, 6:97-102, 1992.
[18] H. 0. Wold.
Partial least squares.
Encyclopedia of Statistical Sciences, 6:581-591,
1985.
[19] A. E. Hoerl and R. W. Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 8:27-51, 1970.
[20] I. E. Frank and J. H. Friedman. A statistical view of some chemometrics regression
tools. Technometrics, 35(2):109-148, 1993.
[21] Gregory C. Reinsel and Raja P. Velu. Multivariate Reduced-Rank Regression. Springer,
1998.
[22] David L. Donoho. De-noising by soft-thresholding. IEEE Transactions on Information
Theory, 41:613-627, May 1995.
138
[23] Jae S. Lim. Image restoration by short space spectral subtraction. IEEE Transactions
on Acoustics, Speech, and Signal Processing, 28:191-197, April 1980.
[24] E Weinstein, A. V. Oppenheim, and M. Feder. Signal enhancement using single and
multi-senor measurements. RLE Technical Report, MIT, 560, November 1990.
[25] H. Akaike. Information theory and an extension of the maximum likelihood principle.
Proceeding of the second internationalsymposium on information theory, supplument
to probelms of control and information theory, pages 267-281, 1973.
[26] H. Akaike. A new look at the statistical model identification. ieeetaucon, 19:716-723,
1974.
[27] G. Schwartz. Estimating the dimension of a model. 4nn. Stat., 6:461-464, 1978.
[28] Mati Wax and Thomas Kailath. Detection of signals by information theoretic criteria.
IEEE Transactions on Acoustics, Speech, and Signal Processing, 33(2):387-392, April
1985.
[29] Robert. G. Gallager. Information Theory and Reliable Communication. John Wiley &
Sons, 1968.
[30] Carlos Cabrera-Mercader.
Robust compression of multispectral remote sensing data.
PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, 1999.
[31] L. C. Zhao, P. R. Krishnaiah, and Z. D. Bai. On detection of the number of signals in
presence of while noise. Journal of Multivariate Analysis, 20(1):1-25, October 1986.
[32] J. Rissanen. Modeling by shortest data description. Automatica, 14:465-471, 1978.
[33] J. L. Horn. A rationale and test for the number of factors in factor analysis. Psy-
chometrika, 30:1789-186, 1965.
[34] S. J. Allen and Hubbard R. Regression equations of the latent roots of random data
correlation matrices with unities on the diagonal. Multivariate Behavioral Research,
21:393-398, 1986.
[35] Charles W. Therrien. Discrete Random Signals and Statistical Signal Processing.Pren-
tice Hall, Englewood Cliffs, NJ, 1992.
139
[36] Sanford Weisberg. Applied Linear Regression. John Wiley & Sons, 2nd edition, 1985.
[37] J. Berkson. Are there two regressions? J. Am. Statist. Assoc., 45:164-180, 1950.
[38] T. W. Anderson. An Introduction to Multivariate Statistical Analysis. John Wiley &
Sons, Inc., second edition, 1984.
[39] Vincente Zarzoso and Asoke K. Nandi. Blind separation of independent sources for
virtually any source probability density function. IEEE Transactions on Signal Pro-
cessing, 47(9):2419-2432, September 1999.
[40] Fawwaz T. Ulaby, Richard K. Moore, and Adrian K. Fung. Microwave Remote Sensing,
volume 1. Artech House, Norwood, MA, 1981.
[41] Gwilym M. Jenkins and Donald G. Watts.
Spectral Analysis and its applications.
Holden-Day, Oakland, CA, 1968.
[42] Gilbert Strang.
Introduction to Applied Mathematics. Wellesley-Cambridge Press,
1986.
[43] Carl W. Helstrom.
Probability and Stochastic Processes for Engineers. Macmillan
Publishing Company, New York, NY, 2rd edition, 1991.
140
Download