Document 11385040

advertisement
© Tobias Gulden Dahl, 2002
ISSN 1501-7710
Cover:
Inger Sandved Anfinsen
Series of dissertations submitted to the
Faculty of Mathematics and Natural Sciences, University of Oslo
No. 229
All rights reserved. No part of this publication may be reproduced
or transmitted, in any form or by any means, without permission.
Printed in Norway:
GCS Media AS, Oslo
Publisher:
Unipub AS, Oslo 2002
Unipub forlag is a subsidiary company of Akademika AS owned by
The University Foundation for Student Life (SiO
Contents
1 Introduction
1.1 Empirical Modelling of Motion . . . . .
1.2 Generalized Procrustes Analysis . . . .
1.3 GCA and related methods . . . . . . . .
1.4 Statistical Shape Analysis . . . . . . . .
1.5 Mobile Communication Systems: MIMO
1.6 A two-agent problem . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Short Summary of the Papers
3
4
6
10
12
14
20
27
3 Papers
31
3.1 A Bridge between Tucker-1 and Carroll’s Generalized Canonical
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
33
3.2 Outlier and Group detection in Sensory Analysis using Hierarchical
Cluster Analysis with the Procrustes Distance . . . . . . . . . . .
61
3.3 BIMA - Blind Iterative MIMO Algorithm . . . . . . . . . . . . . .
87
3.4 Blind MIMO Estimation based on the Power Method . . . . . . .
99
3.5 The Game of Blind MIMO Channel Estimation . . . . . . . . . . . 129
4 Discussion
4.1 Sensory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Mobile Communications . . . . . . . . . . . . . . . . . . . . . . . .
i
159
161
162
ii
Preface
This is a thesis for the dr. scient degree in Applied and Industrial Mathematics
(a program including Applied Statistics). The thesis is submitted to the
Department of Informatics at the University of Oslo. The thesis was prepared
under the supervision of Professor Nils D. Christophersen, University of Oslo,
Professor Tormod Næs, University of Oslo and MATFORSK, and Ole-Christian
Lingjærde, University of Oslo. The work has been financed by the University of
Oslo, Department of Informatics.
The thesis consists of five papers. Papers One and Two are about sensory
panel analysis.
Paper Three, Four and Five are about wireless mobile
communication systems. The methods, techniques and ideas are closely
connected, even if the applications are different. The Introduction describes
these connections in detail.
Acknowledgments
I want to express my gratitude to my supervisors, Professor Tormod Næs and
Nils D. Christophersen. I am particularly grateful for all the time they have
spent with me, not only on the papers and the thesis, but also on talks not directly related to work. Tormod Næs has been very important to me, in that he has
taught me a lot about the processes involved in writing papers and researching,
as well as finding a balance between scientific work and other activities. Professor Nils Christophersen has always taken time to listen to my more and less
original and interesting ideas. Without his ’open door policy’, my progress would
have been much slower. I also want to direct special thanks to Dr. David Gesbert
at the Department of Informatics. Without his knowledge of mobile communication systems, it would have been impossible to publish in an area that was so
new to me. Special thanks to Ole-Christian Lingjærde, University of Oslo, for
joint work, and to Professor Nils Lid Hjort. Working together on the final thesis
paper was fun, and I learned a lot from our discussions. I also want to thank my
colleagues in Glasgow for inspiration, for all I learned about shape analysis, and
for giving me a statistical home away from home. Finally, thanks to my friends
an my family for support in periods of much work.
Tobias Dahl
Oslo, March 2002
1
2
Chapter 1
Introduction
From Sensory Analysis to Mobile Communication
Systems: A personal account
In this introduction, I try to show how a number of ideas, spread through the
various sub-projects of my thesis, relate to each other. The introduction is rather
personal, and more about the process behind the papers than about the results
they present.
Simply presenting background material from different fields of applied
statistics would not suit this material well: The central part of the thesis is
about connecting ideas from different areas of natural science. The reader
will find that the papers contain ideas and contributions from several areas of
quantitative discipline, including
• Motion Analysis
• Sensory Analysis, Psychometrics and Chemometrics
• Classic Multivariate techniques (Clustering, Principal Component Analysis, MANOVA) and Three-way Analysis
• Regularization and Stochastic Simulation.
• Statistical Shape Analysis
• Signal Processing and Wireless Telecommunication
• Game Theory and Artificial Intelligence (AI)
This breadth calls for connections. Sometimes, transition of methodology from
one area to another is obvious and straightforward, as a specific method can
be more or less insensitive to the type of data it is applied to. At other
times, the connections are more intricate, but therefore more interesting. One
example in this thesis can be found when the first and the second last papers
are seen together: My attempt to make a new method for sensory panel
3
CHAPTER 1. INTRODUCTION
analysis at first turned out to be a replica of a well-know method. It had been
formulated by me in an iterative fashion, when a closed-form solution had been
known for many years. Still, it was my iterative solution (and not the closedform) to this problem that carried over to solve a blind channel estimation
problem in telecommunications. In another paper, I take advantage of the
close connections between psychometrics and statistical shape analysis, and
show that developments in the latter, which is a rather new field, give rise
to new approaches of solving practical problems in the former. Orthogonal
approximation, which is closely related to Procrustes methods developed in
Psychometrics, is used in two papers on wireless mobile communications. More
examples are given later on in this chapter.
The purpose of this introduction is two-fold. Firstly, it presents background
material for the individual sub-fields I have worked in, and secondly, it gives
an account of how the different sub-projects relate to each other. As the
thesis covers several areas of quantitative science, I have tried to make the
introduction readable by persons from any of those, balancing on a thin line
between too much and too little detail on a specific subject. Still, I hope the
reader will find this section interesting and readable, and that it will serve well
as an introduction to the core of the thesis.
I will work my way through the papers in a chronological fashion, starting
with the end of my M.Sc. (about Motion Analysis), and ending up with
the final paper on game theory (Multi-Agent Systems) applied to a wireless
telecommunications problem.
1.1 Empirical Modelling of Motion
In my M.Sc thesis, I was studying movements in athletes using empirical analysis techniques. Movement analysis is often seen from a bio-mechanical point
of view. My idea was to use methods from multivariate statistics (principal component analysis, discriminant analysis) rather than deterministic techniques
(differential equations) typical for this kind of analysis. Attaching reflex markers to different parts of the athletes’ bodies (typically the joints), I would track
a specific movement using a special camera set-up. I would compare and analyze the movements of the different athletes who participated in the experiment.
The purpose was to detect differences in the execution, rather then finding an
“optimal movement”, and also study how different body parts moved to perform
a whole action. One of the first problems I faced, was that of matching the
motions of the athletes. This is not quite the same as comparing them, it was
rather a question of data pre-processing: No athlete stood at exactly the same
place as another when performing the movement. Athletes were also of different sizes. Sometimes, the setup of the 3D motion cameras had changed since the
last recording session, and so there was a constant need to "align" the movement
recordings prior to any further analysis.
4
1.1. EMPIRICAL MODELLING OF MOTION
To compensate for these effects, I initially used an affine transformation (see
e.g. Newman and Sproull, 1976). This is a rather general transformation, capable of rotating, scaling, stretching, translating, and bending curves in certain
ways, known as a “shear”. If two sets of points in 3-space are held as the rows of
the matrices X1 , X2 , both in RN ×3 , the matching is done by solving
min = ||(X1 Q + 1tT ) − X2 ||2F
Q,t
(1.1)
where || · ||F is the Frobenious norm, Q is a 3 by 3 matrix, t ∈ R3 is a translation
vector, and 1 is the n×1 vector of 1’es. Of course, this problem can be generalized
to cover points in p-space, making the matrices X1 , X2 both Rn,p , Q ∈ Rp,p and
t ∈ Rp .
When I took account of the fact that movements could be initiated at certain
points in time as well as in space, preprocessing with the affine transformations
gave interesting and interpretable results in the subsequent analysis. From the
literature, we knew another option, to use procrustes transformation for matching the curves. This transformation can be seen a restricted version of the affine
transformation. It admits for rotation, translation, and isotropic scaling. The
optimal procrustes transformation T (X1 ) to match X2 is found by solving
min = ||(c · X1 Q + 1tT ) − X2 ||2F
Q,c,t
(1.2)
subject to the constraint that Q is orthogonal, QT Q = I, and c is a scalar. One
could argue that the unwanted effects (positioning, rotation, size compensation)
would be exactly the ones that could be compensated by using the procrustes
transformation, whereas the flexibility of the affine transformation could lead to
over-fitting.
Procrustes Analysis was, according to Dryden and Mardia (1998), “developed
initially for applications in psychology”. First references include Moisier (1939)
and later Green (1952), Cliff (1966), Shönemann (1966,1968), Gruvaeus (1970),
Shönemann and Carroll (1970). A list of authors that have described the
pure rotation case, (a restriction so that multiplication by Q does not involve
reflections), as well as a number of books introducing Procrustes Analysis can
be found in chapter 5 of Dryden and Mardia (1998).
For the problem of matching motion curves, the procrustes transformation
turned out to work well. Pre-processing with this transformation gave better
discrimination between athletes than we found using the affine transformation.
I was still interested in how the two transformations related, and how they
affected my data as pre-processing steps. As part of my M.Sc.,I constructed
a bridge between the two transformations. The bridge was controlled by
a flexibility parameter α. Choosing α = 0 would give a "rigid" procrustes
transformation, α = 1 would give a "flexible" affine transformation, and values
in-between would yield intermediate flexibility in the transformation.
5
CHAPTER 1. INTRODUCTION
The starting point of my PhD was an attempt to modify the Generalized
Procrustes Analysis (GPA, explained below), giving it the extra flexibility of the
affine transformation to avoid some of the problems associate with rigidness.
1.2 Generalized Procrustes Analysis
The procrustes transformation is the central part of a sub-field called “procrustes
analysis” (PA). The technique for matching one set of points to match another,
also has a “big brother”, known as Generalized Procrustes Analysis (GPA) .
Whereas PA can be used to make two configurations similar, in GPA one tries to
make a group of configurations as similar as possible with each other. Central
papers on GPA include Kristof and Wingersky (1971), Gower (1971, 1975),
Ten Berge (1977), Sibson (1978,1979), Langron and Collins (1985) and Goodall
(1991).
GPA was originally designed to handle multi-dimensional data rather than
planar or 3D-points that could by visualized on a piece of paper, or in space.
A typical example of a situation when GPA can be applied, is when a panel
of judges are assessing a selection of wines. For each wine, each member will
give scores for sourness, bitterness, saltiness, spiciness, oak-blend, fruitfulness
etc, on a scale from 0-10. What the manufacturer wants is an "overall expert
opinion", which enables him to put a label on the back of each bottle, telling
whether the wine is "fruitful", "ripe", "dry" or "bitter" etc. A first shot would be
the average score for each wine, and for each category, taken over the judges. In
practice, however, it turns out that the judges use the scales differently (Langron, 1983). One person’s perception of "sourness", may for another be a mixture
of "sourness" and "bitterness". There is even a technique called “free choice profiling” (Arnold and Williams, 1985), where the judges themselves (individually)
pick the vocabulary to use for a tasting session. The assumption made by the
researcher, is that the panel members have a similar underlying perception or
taste experience, but that they use the attributes differently. In such situations,
comparison is meaningless without some kind of compensation. Also, one judge
may be restrictive in his use of the scales, ranging all wines (and their qualities)
in the range of 3-5, while another uses the full scale 0-10. A judge may also
be overly negative or positive in his assessment, giving scores 0-3 or 7-10 more
often that the rest of the panel.
Ignoring these effects, the "average chart" for the judges can be non-informative.
It can be assumed however, (Arnold & Williams, 1985) that each judge will stick
to his scale and his taste interpretation throughout the whole session, which
makes it possible for a mathematician to compensate for the systematic differences (or “divergence in use of the terms”). The procrustes transformations
are used to compensate. This compensation can be understood geometrically,
as each wine can be represented as a point in a high-dimensional space, with
6
1.2. GENERALIZED PROCRUSTES ANALYSIS
co-ordinates given by the different tasting dimensions. The method I used for
matching motion curves, had been used for decades to improve the reliability of
tasting experiments.
Let X1 , X2, . . . Xq denote q sensory profiles, each matrix Xi begin N columns
× P variables. The rows correspond to food samples (or alternatively, markers
on a body), and the columns to tasting categories (or dimensions x, y, z in space).
Throughout, we assume that the profiles are centered, e.g. that they are each
pre-multiplied by with matrix C,
Xi := CXi
1
C = (I − 11T )
N
(1.3)
(1.4)
It can be shown that this centering is equivalent to a translation of all rows in
each Xi to have center in the origin, and furthermore, that this is the optimal
way of using translation in profile matching (see Dahl, 1998 or Mardia et al.,
1997). The GPA problem is now that of solving (see e.g. Risvik and Næs, 1996)
min
{Tij }
q
X
X
||Tij (Xi ) − Tji(Xj )||2F
(1.5)
i=1 j<i
where Tij denotes the optimal Procrustes transformation to match 1 the profile
Xi with the profile Xj . The standard approach to solving this problem is to
replace it with another problem,
min
{Ti }
q
X
||Ti (Xi) − Y||2F
(1.6)
1X
Y=
Xi
q i=1
(1.7)
i=1
q
Equivalence between (1.5) and (1.6) can be shown using standard results from
multivariate analysis. The matrix Y now denotes the average or consensus that
the other profiles are transformed to match. The criterion is usually minimized in an iterative fashion. First, and initial consensus is made (see Ten Berge
(1977) for details). Next, all the profiles are transformed to match this consensus, before a new average Y is taken, etc. The GPA routine can be simplified
as:
GPA Algorithm (Ten Berge, 1977)
0. Center the profiles Xi and make an initial Y.
1
Care must of course be taken when selecting the transformations, because as opposed to the
case with ordinary Procrustes Analysis, both profiles will be transformed.
7
CHAPTER 1. INTRODUCTION
1. Transform each of the profiles X1 , X2 , . . . to resemble Y by an orthogonal
transformation.
2. Make a new Y as the average of the transformed X1 , X2 , . . . .
3. Repeat from 1. until convergence.
4. Scale the (transformed) profiles X1 to match each-other.
Some stopping criterion is used in step 3, typically by defining a threshold value
for the decrease in the objective function (1.6). Isotropic scaling of all matrices
to match each other (step 4) is described in ten Berge and Paul (1993).
1.2.1 The Rigidness of GPA
The GPA procedure can be criticized for its lack of flexibility. In sensory
analysis, the operational steps constituting the procrustes transformation puts
restrictions on how much compensation can be made on the score charts. There
is no obvious reason that compensations should be orthogonal. The isotropic
scaling also forces all tasting dimensions to be scaled equally much. A judge
could for instance be very restrictive in one of the tasting dimensions only. To
compensate for this, it would be more correct to scale just that dimension to
match the average option, and not all the other dimensions. There was also the
question of whether a reduced-rank consensus could be found (some work has
been done, e.g. Peay (1988), as it is no guarantee that a set of panel judges has
all tasting dimensions in common.
1.2.2 Softening GPA
It should turn out that my attempts on softening GPA led me to rediscover a
well-know method. Still, as the reader will see, my solution to this problem
enabled me to solve a problem in mobile communications later on, that would
not have been easily solved without this experience. In particular, I wanted to
“improve” GPA by changing step number 1 to become
1. Affine Transform each of the profiles X1 , X2 , . . . , to resemble Y.
or even
1. Transform each of the profiles X1 , X2 , . . . to resemble Y, using my flexible
transformation controlled by α.
This would, I believed, enable me to bridge GPA with a “soft GPA” or perhaps a
"Generalized Affine Analysis", just as I had succeeded in bridging the procrustes
and the affine transformation. I envisioned a soft GPA that could capture the
subtle nuances of tasting experiences better, and consequently, provide a better
representation of the overall tasting experience than GPA could.
8
1.2. GENERALIZED PROCRUSTES ANALYSIS
1.2.2.1 Collapse
I tried to replace the original step 1. with one based on the affine transformation.
Doing so, I realized two things: First, the procedure did not converge. Rather
than this, the elements of the Y matrix tended to infinity (or sometimes towards
0). I tried to help this by normalizing each of the columns in Y, introducing an
extra step,
2 ½. Normalize each column yi in Y: yi := yi /||yi ||.
Doing this helped the situation only a little. The elements of Y no longer tended
to infinity, but, it turned out that all columns of Y became identical (or sometimes, identical up to a change of sign).
It was hard to interpret this result. With all columns identical, it could seem
like the computer program was saying: “There is only one dimension (or tasting
category) present in the whole data set X1 , X2, . . . . This is the dimension you
see in all columns of the Y”. I was quite confident that the judges were capable having more than one common, underlying dimension of taste. Surely, the
judges must be using more than one set of taste buds. Even when I generated a
set of random data matrices the same phenomenon persisted: All columns of Y
became identical.
To help the situation, which I had “half helped” by normalizing the vectors yi ,
I resolved to “normalization’s big brother”, “orthogonalization”. This choice was
inspired by the observation that the parallelity of the Y columns only appeared
after a few iterations. Initially, the columns of Y could be quite different, but
they would gradually become more and more similar. I therefore decided to let
the first column be as it was (apart from normalizing it of course), then from
the second column I would subtract whatever could be accounted for by the first
column, and from the third column, whatever could be found in the first two
vectors etc. This was an extra step,
2 ¾. From each column yi , remove any contribution from the previous columns
y1 , y2 , . . . y(i−1) .
This was just a series of projections, or a Gram-Schmidt process. Doing this, I
forced the columns of Y to be different, but I still gave the data some “liberty
of speech”, by not fixing them completely. Still, I wasn’t satisfied. The Y
matrix that I was now calculating was clearly some kind of average opinion
of the judges. However, it seemed a little bit funny that all columns would
have to be independent (or, if looking at them as vectors, perpendicular), they
had no correlation. Furthermore, all columns were of the same norm after
the normalizing procedure. Clearly, this “forced consensus Y” didn’t look very
much like any of the individual profiles X1 , X2 , . . . , which sometimes had a lot of
correlation between the columns, and rarely had any column norm equal to one.
9
CHAPTER 1. INTRODUCTION
I worked a little more on the problem to try to understand how the algorithm
processed the data. By writing out the recursive procedure as a single equation,
I managed to show that the columns of Y were the eigenvectors of a particular
matrix, constituted by certain sums and products of the profiles X1 , X2 , . . . .
Forgetting about orthogonalization and normalization for a moment, and using
Y (0) as the initial consensus guess, the matrix Y (k) would after i iterations
become
X
X
Xi (XTi Xi )† XTi ) · · · (
Xi (XTi Xi )† XTi )· Y (0)
(1.8)
Y (k) = (
|
{z
i
i
}
k times
or alternatively,
X
Y (k) = [
Xi (XTi Xi)† XTi ]k Y (0)
(1.9)
i
Here, the “†” denotes the Moore-Penrose pseudo-inverse. It can be seen that
this is a power method (see e.g. Golub and Van Loan, 1996)
P for each column
of Y (0) to converge to the top eigenvector of the matrix i Xi (XTi Xi )† XTi . By
keeping the columns of Y k orthogonal through a Gram-Schmidt process, I got
a set of P orthogonal eigenvectors as the (orthogonalized) consensus matrix
Y. Shortly before publishing these results, I found out that my new method
had already been discovered over 30 year ago. It was known as Generalized
Canonical Analysis, or GCA, described by J.D. Carroll in 1968. My work on
GPA, which by now had changed to become work on GCA, lead me into studying
more methods for multiple data matrices X1 , X2 , . . . , XQ , and to write my first
thesis paper about connections between such methods.
1.3 GCA and related methods
GCA exists in several variants. Important papers on the subject include Carroll
(1968: MAXVAR), Kettenring (1971: SUMCOR,MINVAR,SSQCOR,GENVAR),
van de Geer (1984), Tenenhaus (1987), Kiers et al. (1994), and van de Burg and
Dijksterhuis (1996). The GCA criterion used by Carroll is
max
βij ,yj
Q
X
corr2 (Xi βij , yj )
(1.10)
i=1
under the constraint Y T Y = I, where Y = [y1 , y2 , . . . yK ], and βij is a vector of
regression coefficients. The dimension of the solution space K can be selected
by the user. van der Burg et al. (1994) has shown that this problem can be
reformulated to become
min
βi ,Y
Q
X
||Xiβi − Y||2 subject to Y T Y = I
i=1
10
(1.11)
1.3. GCA AND RELATED METHODS
with {βi } as regression coefficient matrices. The solution to any of the problems
(1.10),(1.11) is to let the columns of Y be the orthogonal eigenvectors of
Z=
Q
X
Xi (XTi Xi )† XTi
(1.12)
i=1
as demonstrated in the appendix of the same paper, and as I myself had
discovered through my work.
Reading the GCA literature, and exploring other methods for studying profile
data, I came across a set of so-called three-way methods. The most well
known, apart from Generalized Procrustes Analysis, are probably the 3-way
factor analysis methods PARAFAC, Tucker-1, Tucker-2 and Tucker-3, (see e.g.
Kroonenberg and De Leeuw, 1980, Naes and Kowalski 1989).
Some of these three-way methods resemble GCA (or my own “new” procedure)
quite a lot. Not only do they yield a kind of “consensus matrix” Y that reflected
the overall opinion of the judges, but they do so in ways similar to other methods
I had studied. The Tucker-1 method is known as a 3-way principal component
analysis. It detects structures that are common for several judges, along several
tasting dimensions, and for several samples (of food, wine etc) in a data set. It
then uses these structures as a basis for exploring and describing the individual
judges in a “common language”, a kind of “all-agree-upon-this-terminology” that
can be detected. The Tucker-1 problem is that of solving
X
||Xi − Yβi||2
(1.13)
min
{βi },YT Y=I
i
which is not very different from the least-squares GCA criterion (1.11). Both
methods use a kind of consensus matrix that contains compressed information
about the data. GCA and Tucker-1 were indeed closely related, yet it seemed
like the choice of which method to use in a particular situation was a matter
of scientific background, and not well explored. 2 GCA and Tucker-1 were so
similar that I was could “bridge” the two, in a similar way as when bridging the
procrustes and the affine transformation. Using this bridge on empirical tasting
data, I demonstrated the use of both methods in a unified framework that
facilitated comparison. This provided me with "a way from one method to the
other", it enabled me to create "border cases" between the two highlighting their
differences, and discuss when one method was more appropriate than the other.
I developed a number of plots, that could help the practical user decide which
particular method to use, or even give him the option of a hybrid. Another aspect
of the work, was that GCA could be quite sensitive to noise in the data, while
Tucker-1 was more robust. Even if the nature of the problem he wanted to solve
called for using GCA, the wish to regularize the noise sensitivity could make
2
One of the reviewers on the first papers have argued that theoretical comparisons between
GCA and Tucker-1 have in fact been done, but that combining the methods seems to be a new
approach.
11
CHAPTER 1. INTRODUCTION
a user go slightly towards a Tucker-1 method. A “regularized GCA” (RGCA)
is a hybrid method of the framework. These results formed the first paper of
my thesis with the title "A Bridge between Tucker-1 and Carroll’s Generalized
GCA". The paper also discussed several techniques for choosing the α parameter,
including a kind of external validation using MANOVA in conjunction with the
process parameters for making the food samples. The paper is joint with Tormod
Næs.
1.4 Statistical Shape Analysis
As a part of my PhD, I was at the University of Glasgow for a six month period.
Just before my arrival, a new project had been started on shape analysis of babies’ cleft lip and palate. With my background in image analysis and procrustes
methods, I soon joined a small group of people trying to get to grips with statistical shape analysis, studying a new book by Dryden and Mardia . In this book,
procrustes methods were a central topic. I learned that GPA had been in use for
shape analysis problems since the 1990s. Among the important papers in this
area are Kendall (1984,1989), Le and Kendall (1993), Goodall (1991), Ziezold
(1994), Le (1995), Kent and Mardia (1997) The book [8] explores the origins of
Procrustes Analysis, describing the connections between multivariate analysis
and shape analysis in an interesting way. Introduction of old methodology into
new fields can lead to new developments. For example, In sensory analysis, the
researchers were concerned with the problems of making profiles X1 , X2 , . . . as
similar as possible to one another, by means of procrustes transformations, before taking an average Y that would serve as a “consensus”. In shape analysis,
there was an interest in assessing how similar the profiles X1 , X2, . . . were, after
transformation. The profiles (or rather, configurations) were no longer tasting
score charts, but rather curves, defined by point sets outlining objects such as
skulls or tanks.
(Table: Synonyms: profiles, configurations, point sets, matrices. Illustrate their
interpretations and them being identical).
To measure the difference between shapes (after procrustes transformation),
the procrustes distance (Le and Kendall, 1993) was introduced. For centered
configurations X1 , X2 , this is
d(X1 , X2 ) = ||T (X1) − X2 ||2F
(1.14)
where T (·) denotes the optimal procrustes transformation for X1 to match X2 .
The problem with this distance measure is that it is assymetrical. Generally,
d(X1 , X2 ) 6= d(X2 , X1)
12
(1.15)
1.4. STATISTICAL SHAPE ANALYSIS
which makes it inappropriate for a number of purposes. However, it the matrices
are normalized to have fixed variance (for instance, unit variance), we have the
full procrustes distance,
dF (X1 , X2 ) = ||T (
X2 2
X1
)−
||
||X1||
||X2 || F
(1.16)
and symmetry, dF (X1 , X2 ) = dF (X2 , X1 ) is obtained.
It occurred to me that this new measure dF could be brought back into
sensory analysis. Little work had been done on the following problem: If a
set of judges all taste a selection of food samples, but one of the judges have
no or little common experience with the others, what good will standardization
with the GPA procedure do? If there are members of two different expert
groups in the tasting panels, how can one check if these two groups are capable
of detecting the same tasting sensations? One group could be highly trained
to detect certain spices, another may be trained to spot nuances of saltiness.
Forcing all data from such panel members through GPA, without any previous
check on the consistency of the board can yield a consensus not representative
of any of the judges. This would be a “head-in-the-oven, feet-in-the-freezer, and
on-the-average, we’re doing just fine” situation. Even if two experts groups could
independently spot important (but different) qualities in the food, taking a GPA
average of them all could cut off important contributions from both sides. A way
of detecting group structures and outliers among the judges is to collect them
in clusters. If the perception of two judges differed only in terms on how they
used the scales (e.g. they had a common underlying tasting experience, distorted
only by misuse of terms), they could be grouped to become one. If a third judge
was similar to (at least) one of the two, he could be grouped as one of the as
well. Repeating this procedure, one could create a chart (Figure 1.1) that can
be used for detecting sub-groups and outliers among the sensory judges. The
central element in this procedure was the procrustes distance, which was used
to measure pair-wise differences between the judges, after transforming them
to similarity by the procrustes transformation. The combination of Hierarchical
Cluster Analysis (HCA) and the Full Procrustes Distance was the subject of my
second paper, joint with Tormod Næs.
Against this approach one could raise the same critique as against GPA.
Using the procrustes transformation in a sensory context takes away some of
the flexibility needed to detect nuances in the consensus of the judges. However,
since this paper was concerned with pair-wise comparison of profiles, there really
was no need to use the GPA algorithm on the full set of profiles. Working with
pairs of profiles only, one can at any time resort to an affine transformation, and
create an affine distance to replace the procrustes distance, or even use more
complicated distances.
13
CHAPTER 1. INTRODUCTION
0.9
0.8
0.7
0.6
0.5
0.4
1
4
9
2
10
7
3
5
8
6
Figure 1.1: Dendrogram for Hierarchical Clustering with the Full Procrustes Distance.
The numbers on the horizontal axis are the judge indexes. The length of the vertical
lines connecting judges are the distances dF (·, ·). In this experiment, judges 1,2,4,7,9
and 10 are similar, judge 6 is an outlier, and judges 3,5 and 8 are are possible outliers,
not well connected with the consensus group.
1.5 Mobile Communication Systems: MIMO
When I returned from Glasgow, there was a growing interest in mobile
communications systems at the University of Oslo. New techniques based on
MIMO antennas (multiple input, multiple output) were being studied. At first
glance, this seemed like a subject far away from my previous work on sensory
and shape analysis, but it soon turned out that ideas carried over from one field
to another.
1.5.1 MIMO Background
Quoting Gesbert and Akhtar (2002), MIMO systems can be defined as referring
“to a link for which the transmitting end as well as the receiving end is equipped
with multiple antenna elements”. The properties of MIMO systems were first
discussed in a number of information-theory articles published by members of
the Bell Labs (Telatar 1995, Foschini 1996). Systems with several antenna
elements at one side has been in use since the seventies (smart antennas).
Typically, the base station 3 is equipped with several antenna elements, whereas
3
where the cost and space is more easily affordable than on a portable subscriber unit, a PDA,
laptop or a cell-phone.
14
1.5. MOBILE COMMUNICATION SYSTEMS: MIMO
M
N
H
HT
Figure 1.2: Multiple antenna elements both at the transmitter and the receiver
gives a large increase in channel capacity compared with conventional systems (smart
antennas).
the mobile subscriber unit has a single antenna (a SIMO/MISO system, “Single
Input, Multiple Output”/”Multiple Input, Single Output”). Smart antennas refer
to signal processing techniques that exploit the fact that data is transmitted
from or received by multiple antenna elements. One aspect of this ’smartness’,
is to offer more reliable communication in the presence of adverse propagation
conditions, such as multi-path fading and interference. Another concept is
that of beam-forming, which means that power is distributed over the antenna
elements to focus energy into a desired direction, thereby increasing the signalto-noise ratio (SNR). As we will see, MIMO channels inherit all this smartness,
and has some extra advantages over conventional smart antennas.
.
Multiple antennas both at the receiver and the transmitter make a matrix
channel H, of dimensions (N =number of receive antennas) times (M =number
of transmit antennas). Transmission of a data vector x, from the base station
BTS to the mobile subscriber unit X through the channel H can be expressed as
y = Hx + n
(1.17)
where n is a noise term, usually assumed to be white Gaussian noise, n ∼
N(0, σ 2 I). From the viewpoint of the receiver, this means that each of the
15
CHAPTER 1. INTRODUCTION
individual antenna elements receives its signals through channels that are
sufficiently different. In terms of linear algebra, and momentarily assuming
no noise, this means that the equation system
y = Hx
(1.18)
must be solvable for any x. If the coefficient {yi } of the vector y corresponds to
the signal at the receive antenna elements, and {xi } is the power output from
each transmit antenna element, then the signal at receive element i is
yi = hTi x
(1.19)
where hTi is the i0 th row of H. This equation states that each receive antenna
element i sees the data vector x through a channel determined by hi . Now, there
are linear dependencies between the columns of H, successful determination
of x will only be possible in certain situations. Note that if multiple antenna
elements existed only at the transmitter, there would only by one coefficient
x1 in the vector x. With multiple receive antennas, the capacity of the system
is increased, for each of the coefficients of the vector x can be symbols in an
individual data stream. This is known as spatial multiplexing, and is one of the
advantages MIMO has over smart antennas. Note that the channel matrix H
is determined using training data (sequences known by both parties), before the
individual streams can be separated and the symbol data estimated.
Another striking property of MIMO systems, is the ability to exploit certain
spatial modes of transmission and retrieval so as to maximize the SNR. In
particular, the singular vectors of the matrixPH play an important role. The
r
T
reason for this is the fact that if H =
j=1 σj uj vj is a singular value
decomposition of H, with r = rank(H), then
v1 = max arg{x|||x||=1} E{||Hx + n||2 }
(1.20)
The received signal will be strongest if x is parallel with the top right side
singular vector of H, or equivalently, when the coefficients of the singular vector
is used for weighting the power output at the antenna elements. Typically,
x = cv1 will be used for transmitting the symbol c, as will be explained in
section 1.5.2.1. It is even possible to use several singular vectors v1 , v2 , . . . in
superposition, to operate independent modes with maximum SNR in descending
order. Here one exploits the fact that signals follow different spatial paths
(multi-path propagation, each path corresponding to a set of weights determined
by the coefficients of a singular vector). Of course, to realize this potential,
the transmitter must have knowledge of the channel matrix H to estimate the
singular vectors.
Summing up, one of the most striking properties of MIMO systems is
the possibility of turning multi-path propagation, usually a pitfall of wireless
transmission, into an advantage for increasing the user’s data rate, as was
shown by Foschini (1996,1998).
16
1.5. MOBILE COMMUNICATION SYSTEMS: MIMO
1.5.2 BIMA - Blind Iterative MIMO Algorithm
The work I have done on MIMO (joint with D. Gesbert and N. Christophersen)
is about how to maximize the capacity of a MIMO channel by blindly estimating
the channel matrix. Most MIMO algorithms today (for instance V-BLAST by
Golden et al., 1999) assume that the channel matrix is estimated at the receiver
side only. In this case, optimal weighting at the transmitter with the antenna
elements is impossible. For the transmitter to transmit on the top singular
vector, he must have knowledge of the channel matrix H. In a Time-Division
Duplex (TDD) system, data can also be transmitted in the opposite direction
of (1.17). If the channel “exhibits reciprocity”, transmission from the mobile
subscriber unit (SU) to the base station (BTS) is expressible as
x = HT y + n
(1.21)
Since the singular vectors of H and HT are the same (but reversed left and
right), both parties can have knowledge of those through sending of training sequences, and optimal weight allocation at both the transmitter and the receiver
is possible. Still, it would be desirable if this training period could be skipped, an
the block SVD that simplified or performed iteratively. Typically, the channel H
will vary with time and must be re-estimated at regular intervals. In the GSM
systems of today, and also in third generation systems (UMTS 4 ), up to 20% of
the traffic is training data, used for approximating H. If this regular training
phase can be skipped, the capacity of the channel is increased. My goal was to
find the singular vectors without having to pay the price of using training data,
and also without any need for an estimate of H or an actual (computer-intensive)
block SVD. I constructed the following algorithm for transmission in two directions (Downlink: BTS to SU, Uplink: SU to BTS). Assume for the time being
that H as well as the vectors x and y are real-valued matrices/vectors. This
simplifies the notation, but the demonstrated results carry directly over to the
complex case.
BIMA (Blind Iterative MIMO Algorithm)
1. Start with a random vector x(0) , and set i = 1.
2. Send from BTS the vector x(i−1) , which will be received as y(i) = Hx(i−1) + n
by X.
3. Normalize the received vector y(i) := y(i) /||y(i) ||.
4. Re-send from X the normalized vector y(i) , received at BTS as x(i) =
HT y(i) + n
5. Normalize the received vector x(i) := x(i) /||x(i) ||.
4
Universal Mobile Telephone Services.
17
CHAPTER 1. INTRODUCTION
6. Increase i, repeat from 2.
By doing this, one can show that x and y converge to the top singular vector
pair, the one with optimum performance. Forgetting about the normalization for
a moment 5 , it is easily seen that
x(i) = (HT H)(HT H)·, . . . , ·(HT H) x(0) = (HT H)i x(0)
|
{z
}
y
(i)
i times
T
= (HH )(HH )·, . . . , ·(HHT ) y(0) = (HHT )i y(0)
|
{z
}
T
(1.22)
(1.23)
i times
Again, each of these equations are power methods for finding the top
eigenvectors of HT H and HHT respectively. But these eigenvectors are, by the
definition of the singular value decomposition, also the top singular vectors of
the matrix H. Note that the BIMA algorithm can be given a modular design.
The operations can be split between the base station and the mobile subscriber
unit in a way that requires no extra communication between the parties.
1.5.2.1 Symbol transmission and multiple singular modes
A data symbol, a number c, is transmitted by multiplying it by a singular vector
prior to transmission. If the singular vectors are known, and we select
x = cv1
(1.24)
as the transmit vector, then this vector is received as
y = Hx = (
r
X
σi ui viT )cv1 = σ1 cu1
(1.25)
i=1
in the deterministic case with no channel noise. If the receiving party knows u1
and σ1 , he can recapture the symbol perfectly,
ĉ =
yT u1
=c
σ1
(1.26)
It is also possible to operate several data streams (or channels) in parallel. If
c1 , . . . , cK0 are symbols from K independent data streams, each of those symbols
can be transmitted from the base station using a singular vector vi , by letting
x=
K
X
ci vi
i=1
5
meaning: Don’t normalize the vectors before they are sent.
18
(1.27)
1.5. MOBILE COMMUNICATION SYSTEMS: MIMO
On the mobile subscriber unit, this is received as
r
K
K
X
X
X
T
y = Hx = (
σi ui vi )(
cj vj ) =
σi ui
i=1
j=1
(1.28)
i=1
again assuming no channel noise. The symbol from thej 0th data stream can
be recaptured using the corresponding left singular vector uj and the singular
value σj ,
yT uj
(1.29)
cˆj =
σj
Transmission and receiving of data on the top singular modes only requires
channel information on a “need-to-known”-basis. One side must know the
singular vector estimates {v̂i } and the other party the corresponding set {ûi }.
If the singular values are also estimated, we have a reduced rank estimate of
the channel matrix H,
K
X
σ̂i ûi v̂iT
(1.30)
ĤK =
i=1
if the estimates are correct, this is the optimal reduced-rank estimate of H in
the L2 -sense (Golub and Van Loan, 1996). With respect to (1.29), it will in many
cases be possible to decide cj from yT uj only, without dividing by σi . Thus, it is
not necessary to estimate the singular values.
The ability to operate several independent channels on the top singular
modes enables a large increase in the capacity of the MIMO channel. In
practice of course, there is channel noise, which increases the probability that
a cj is wrongly decided. Also, the singular vectors are not perfectly known,
but are estimated as part of the BIMA algorithm. BIMA can be adapted to
estimate more than one singular vector pair, by exploiting certain properties
of the modulation alphabet, which is the set of values that a symbol cj may
have. Another interesting property of BIMA, is the fact that even if the
initial estimates of the singular vectors and the symbols are incorrect, they
will gradually converge to their correct values. This is demonstrated in the
third and fourth papers of this thesis. We also show by simulation that if the
channel H varies continuously with time, the algorithm will track the singular
vectors. In the above, we have assumed that H is real-valued. In practice, H
will have complex values, which calls for a slight modification of the algorithm.
The vectors that are normalized in steps 3 and 5 of the algorithm will also be
conjugated, as described in the fourth thesis paper. The reason for this, is the
fact that the complex generalization of equations (1.22) and (1.23) is to replace
the matrices HT H and HHT with H∗ H and HH∗ respectively. If the vectors are
not conjugated, this transition from the real to the complex case will fail. In the
real case of course, complex conjugation is always superfluous.
The concept of normalizing, conjugating and returning a vector has a strong
connection with the works of M. Fink (1997) in medical ultrasound.
19
CHAPTER 1. INTRODUCTION
The observation that the steps 1-5 gives convergence to the first singular
vector pair has been observed by other authors, such as Bach Andersen (2000).
However, using this property directly as a part of a communication process is, to
our best knowledge, new.
1.5.2.2 Links with sensory analysis
Deriving the mathematical details, the BIMA algorithm is not much different
from my "Soft GPA" (or iterative GCA). Comparing the equation (1.9) in
the GPA section with equations (1.22) and (1.23) in the sections on mobile
communications, both can be seen to be power methods for finding an eigenvector. However, there is one important difference.
the eigenvectors
P In (1.9),
T
† T
X
(X
X
)
could be found directly as the eigenvectors of
i
i Xi using some
i
i
numerical method. This could be done because the matrices Xi were “at hand to
be played with”. In the case of (1.22) and (1.23) this is impossible, because the
matrix H is unknown. It is only through interaction with the transmit vectors
x(i) and y(i) that H enters the equations, unless one wants to use training-data
for estimating H.
We mentioned above that by modifying BIMA, one can transmit independent
data streams using several singular modes. To keep the singular vector
estimates independent, and prevent them all from converging to the a top
singular vector, the orthogonalization procedure that I used for the “soft GPA”
(or iterative GCA) was used. Summing up, ideas that were superfluous in one
area had the right to live in another.
As a final detail, I will mention that the algorithm performs even better if
the singular vector estimates, varying with time as the channel H itself varies,
are smoothed. The simplest way of doing this is to take the average of the last
few singular vector estimates. However, the average of matrices of orthogonal
vectors need not be orthogonal itself. In the fourth paper, a modified average
borrowing methodology from Procrustes Analysis is used for smoothing the
singular vectors.
1.6 A two-agent problem
The final part of my thesis was a joint manuscript with Nils Christophersen, Ole
Christian Lingjærde and Nils Lid Hjort, considering a closely related but much
more difficult problem:
In Frequency Division Systems (FDD), the transmit/receive equation differs
from the equation (1.17),(1.21) above. If H is used for transmission of x to become
y = Hx + n
the opposite relation is not x = HT y + n, but
x = Gy + n
20
(1.31)
1.6. A TWO-AGENT PROBLEM
Whereas H and HT are perfectly related, G can be quite different from H and
might represent a "different physical reality" than H. This will typically happen
if the two communicating parties use different frequencies (wavelengths) for
data transmission. For instance, the scattering of high-frequency waves will
often be different from the physics of low-frequency ones. G is then a new
quantity that must be accounted for in the algorithms. It is still desirable to find
the singular vectors of G and H in order to maximize the channel performance.
It turns out that the problem of finding the top singular modes of the two
channels H and G is one in which the two parties (base station X, mobile
subscriber Y) have to cooperate. There is currently a great interest in the
artificial intelligence (AI) community in so-called “multi-agent systems” (MAS).
Among the more recent references are Cruz and Simaan (1999), and Wolpert
et al. (1999). There are many applications for problems with multiple agents,
and crucial elements include (Stone, 2000) agent heterogeneity, knowledge of
strategies, control distribution and communication possibilities. Also, the agents
have individual goals, observed as reward functions, which must be chosen
carefully so that fulfillment of the individuals goal also leads to fulfillment of
the overall goal. There is a risk that agents might work at “cross-purposes”,
or go in each others way frustrating one an another when trying to solve their
respective problems (interaction problems). In the final thesis paper, we cast the
problem of finding the top singular modes of G and H into a two-agent problem
setting. We show how the problem can be solved using a leader-follower strategy.
Again, the use of optimal rotations is central in the work. A procrustes
statistic is used for performance assessment, and the polar decomposition,
which is a close relative to the orthogonal procrustes transformation, is one
of the central building blocks in the algorithm. This emphasizes the close
methodological connection between my contributions in the these otherwise
quite different areas of applied statistics.
21
CHAPTER 1. INTRODUCTION
22
Bibliography
[1] Arnold, G.M, Williams, A.A. (1985) The use of generalized Procrustes
Techniques in sensory analysis, In: Statistical Procedures in Food Research,
Piggot, J.R. (Ed.)
[2] Bach Andersen, J. (2000) Array gain and capacity for known random
channels with multiple element arrays at both ends”, IEEE Journal on
Selected areas in Communication, Vol. 18, No 11, pp. 2172–2178.
[3] ten Berge, Jos M.F.; Bekker, Paul A. (1993) The isotropic scaling problem
in generalized procrustes analysis. Computational Statistics and Data
Analysis 16, No.2, 201-204.
[4] van der Burg, E. & Dijksterhuis, G. (1996) Generalized canonical analysis
of individual sensory profiles and instrumental data, In: Multivariate
Analysis of Data in Sensory Science , edited by Naes T. & Risvik E, Elsevier
Science.
[5] Carroll, J.D. (1968) Generalization of canonical analysis to three or more
sets of variables, Proceedings of the 76th Convention of the American
Psychological Association 3, pp. 227-228.
[6] Cruz J.B. , Simaan, Jr. M.A (1999) Multi-Agent Control Strategies with
Incentives, in Proceedings, Symposion on Advances in Enterprise Control,
pp. 177-182, San Diego, CA.
[7] Dahl, T. (1998) Empirical Modeling of Human Motion, M.Sc Thesis,
University of Oslo, Department of Mathematics.
[8] Dryden, I.L., Mardia, K.V. (1998) Statistical shape analysis, (Wiley).
[9] van de Geer, J.P. (1984) Linear Relations among k sets of Variables,
Psychometrika Vol. 49, No 1, pp. 79-94.
[10] Fink, M. (1997) Time-reversed acoustics Phys. Today, Vol. 20, pp.34 - 40.
[11] Foschini, G.J. (1996) Layered Space-time architecture for wireless communications in a fading environment, Bell Labs Technical Journal, Vol. 1, No
2, pp. 41-59.
BIBLIOGRAPHY
[12] Foschini G.J., Gans, M.J. (1998) On the limit of wireless communications
in a fading environment when using multiple antennas, Wireless Personal
Communications, Vol.6, No 3, pp. 311-335.
[13] Foschini G.J. (1998) Layered space-time architecture for wireless communication, Bell Labs Technical Journal, Vol. 1, pp. 41-59
[14] Gesbert, D. Akhtar, J. (2002) Breaking the barriers of Shannon’s capacity:
An overview of MIMO wireless systems, Telenor’s journal: Telektronikk.
[15] Golden, G.D., Foschini, G.J., Valenzuela, R.A. and Wolniasky, P.W. (1999)
Detection algorithm and initial laboratory results using the V-BLAST
space-time communication architecture, Electronics Letters, Vol. 35, No. 1,
pp. 14-15., 1999
[16] Golub, G. & Van Loan, C.F. (1996) Matrix computations, 3rd ed. The Johns
Hopkins Univ. Press
[17] Goodall, C. (1991) Procrustes methods in the statistical analysis of shape.
(with discussion). Journal of the Royal Statistical Society: Series B Vol. 53,
pp. 285-339.
[18] Gower, J.C. (1975) Generalized Procrustes analysis, Psychometrika Vol. 40,
pp. 33-51.
[19] Green, B.F. (1952) The orthogonal approximation of an oblique structure in
factor analysis. Psychometrika Vol. 17, pp. 429-440.
[20] Gruvaeus, G.T. (1970) A general approach to Procrustes pattern rotation.
Psychometrika Vol. 35 pp. 493-505.
[21] Kendall D.G. (1984) Shape manifold, Procrustean metrics and complex
projective spaces. Bulletin of the London Mathematical Association, Vol 16
pp. 81-121.
[22] Kendall D.G. (1989) A survey of the statistical theory of shape. Statistical
Science, pp. 87-120.
[23] Kent, John T.; Mardia, Kanti V. (1997) Consistency of Procrustes
estimators. (English) Journal of the Royal Statistical Society: Series B Vol.
59, pp. 281-290.
[24] Kettenring, J.R (1971) Canonical analysis of several sets of variables,
Biometrika, Vol. 58, pp. 433-451.
[25] Kiers, H.A.L., Cléroux, R., ten Berge, J.M.F. (1994), Generalized Canonical
Analysis based on optimizing matrix correlations and a relation with
IDIOSCAL, Computational Statistics and Data Analysis, Vol. 18, No. 3, pp.
331-340.
24
BIBLIOGRAPHY
[26] Kristof, W. and Wingersy, B. (1971) Generalizations of the orthogonal
Procrustes rotation procedure to more than two matrices. Proceedings of
the 79th Annual Convention of the American Psychological Association, 6,
pp. 81-90.
[27] Kroonenberg, P. & De Leeuw, J. (1980) Principal component analysis
of three-mode data by means of alternating least squares algorithms,
Psychometrika, Vol. 45, pp. 69-97.
[28] Le H.-L. and Kendall D.G. (1993) The Riemannian Structure of Euclidean
shape spaces: a novel environment for statistics. Annals of Statistics Vol
21, pp.1225-1271.
[29] Langron, S.P. (1983) The application of Procrustes statistics to sensory
profiling. In: Sensory Quality in Foods & Beverages: Definition, Measurement & Control, A. A. Williams & R.K. Atkin (Eds), Ellis Horwood Ltd,
Chichester, pp. 89-95.
[30] Langron, S.P. and Collins, A.J. (1985) Perturbation theory for generalized
Procrustes analysis. Journal of the Royal Statistical Society: Series B Vol.
47, pp. 277-284.
[31] Mardia, K.V., Kent, J.T., Bibby, J.M. (1997) Multivariate Analysis, Academic
Press
[32] Newman, W.M., Sproull R.F. (1976) Principles of interactive computer
graphics. McGraw-Hill.
[33] Naes, T. Kowalski, B. (1989) Predicting sensory profiles from external
instrumental measurements, Food Quality and Preference, 4/5, pp. 135-147.
[34] Peay, E.R. (1988) Multidimensional rotation and scaling of configurations
to optimal agreement. Psychometrika Vol. 53, pp.199-208.
[35] Risvik E. & Naes T. (1996) Multivariate Analysis of Data in Sensory Science
(Elsevier Science)
[36] Schönemann, P.H. (1966) A generalized solution to the orthogonal Procrustes problem, Psychometrika Vol. 31, pp.1-10.
[37] Schönemann, P.H. (1968) On two sided orthogonal Procrustes problems.
Psychometrika Vol. 33, pp. 19-33.
[38] Schönemann, P.H. and Carroll R.M. (1970) Fitting one matrix to another
under choice of central dilation and rigid motion. Psychometrika Vol. 35,
pp. 245-255.
25
BIBLIOGRAPHY
[39] Sibson, R. (1978) Studies in the robustness of multidimensional scaling:
Procrustes statistics. Journal of the Royal Statistical Society: Series B Vol.
40, pp. 234-238.
[40] Sibson, R. (1979) Studies in the robustness of multidimensional scaling:
perturbation analysis of classic scaling. Journal of the Royal Statistical
Society: Series B Vol. 41, pp. 217-229.
[41] Stone P., Veloso M. (2000) Multiagent Systems: A Survey from a Machine
Learning Perspective, Autonomous Robotics, Vol. 8, No. 3.
[42] Telatar, I.E. (1995) Capacity of multi-antenna Gaussian channels Bell Labs
Technical Memorandum.
[43] Tenenhaus, M. (1987) Generalized Canonical Analysis, Bernoulli, Vol.2, pp.
133-136.
[44] Wolpert, D.H., Wheeler, K.R., Tumer, K. (1999) “General Principles of
Learning-based Multi-Agent Systems”, Proceedings of the Third International Conference on Autonomous Agents (Agents’99).
[45] Ziezold, H. (1977) On expected figures and a strong law of large numbers
for random elements in quasi-metric spaces. In Transactions of the Seventh
Prague Conference on Information Theory, Statistical Decision Functions,
Random Processes and of the 1974 European meeting of Statisticians, Vol.
A, pp. 591-602, Prague. Academica: Czechoslovak Academy of Sciences.
(9,83,179,290).
26
Chapter 2
Short Summary of the
Papers
Paper One: A Bridge Between Carroll’s Generalized Canonical Analysis
and the Tucker-1 Method.
by T. Dahl and T. Næs.
Submitted to Psychometrika, in 2nd review.
This paper is addressed to the sensory science society. When working with score
chart analysis for multiple judges, there are various subcultures within the society. The choice of analysis tools is sometimes more a matter of background than
of anything else. This paper demonstrates how two popular methods, the Generalized Canonical Analysis and the Tucker-1 method for three-way analysis,
although seemingly very different, essentially derive their result from the same
structures in the data set. Moreover, the paper shows how the two methods can
be combined in a joint ridge-regression-like framework, and that this bridge,
determined by a parameter α, is a new method in its own right. Choice of the
parameter setting in various theoretical and real situations is discussed, and
illustrated with new kinds of plots and figures. The main contributions are (a)
demonstrating the close connection between two methods which seem very different, and (b) developing new ways of visualizing the workings of multivariate
methods in sensory science.
Paper Two: Outlier and Group Detection in Sensory Panel Analysis
with the Procrustes Distance.
T. Dahl and T. Næs.
To be submitted to Food Quality and Preference.
This paper also concerns sensory analysis. When score charts, represented as
CHAPTER 2. SHORT SUMMARY OF THE PAPERS
matrices (N food samples times P tasting cathegories), are obtained from several judges, it is of interest to find a consensus matrix for the whole tasting
panel. Often, the judges use terms (tasting cathegories) and scales differently,
and this must be taken into account to make a consensus. To this end, Generalized Procrustes Analysis (GPA) is commonly used. However, it might be that
no common structure can be found for the whole tasting panel. This paper introduces the procrustes distance, developed in statistical shape analysis, into
sensory analysis. In this paper, it is suggested for grouping panel judges who
have similar assessments about the samples into clusters. This ensures that the
consensus comes from a relatively homogenous group of panel members, differing only in their use of terms, and not in the underlying perception about the
products/samples. Outliers will typically be grouped with the other judges at a
very late stage, which can easily be seen from the dendrograms. Validation of
the method using process data is also demonstrated. The main contribution is
the use of the Procrustes Distance for Hierarchical Clustering.
Paper Three: BIMA - Blind Iterative MIMO Algorithm.
T. Dahl, Christophersen and D. Gesbert.
Accepted for ICASSP 2002.
In multi-antenna wireless communication system, and MIMO systems ("Multiple Input Multiple Output" in particular, transmission is through a matrix
channel. Prior to symbol transmission, the channel must be estimated. For this
estimation, training data, occupying as much as 20 % of the capacity, is needed
to track the time-varying channel. Alternatively, the channel can be estimated
blindly, but this normally requires the use of computer-intensive higher-order
methods. We have developed an algorithm (BIMA) that finds the optimal transmission modes of a MIMO channel without the need for a statistical estimate of
the channel, in a computational and effective way. Furthermore, this method can
track the optimal modes of a time-varying channel, as part of normal operation,
and without extra computational and channel capacity costs. The algorithm is a
variant of the power method, a numerical method for estimating eigen-vectors.
It has a connections with “time reversal mirror” used in ultrasound imaging.
The main contribution are (a) the exploitation of the ’intrinsic power method’
in uplink and downlink communication, and (b) blind channel estimation with
iterative estimation of the singular vectors.
Paper Four: Blind MIMO Estimation based on the Power Method.
T. Dahl, N. Christophersen and D. Gesbert.
To be submitted to IEEE Transactions on Signal Processing.
This paper further develops the BIMA algorithm. More details and simulations
28
are given, and regularization/smoothing of the transmission parameters is introduced, improving the SNR considerably. More background on MIMO, as well
as the methods connection with the “time reversal” mirror is given.
Paper Five: The Game of Blind MIMO Channel Estimation.
T. Dahl, N. Christophersen, O.C. Lingjærde, and N. Lid Hjort.
The BIMA algorithm works for reciprocal channels only. If communication from
one side to the other (uplink) is through a matrix channel, then communication the other way (downlink) is assumed to be through the transpose matrix
channel. When different frequencies are used for uplink and downlink communication (Frequency-division-duplex) this condition fails to hold. We show that
it is still possible to blindly estimate the optimal transmission modes for the uplink and downlink channels. This is formulated as a two-agent problem in game
theory. It borrows its notation and ideas from the field of Multi-Agent Systems,
a subfield of Artificial Intelligence. Various techniques from statistical analysis,
such as non-linear nested optimization, optimal rotations, quadratic data fitting
and stochastic simulations are used to present a framework for the solution of
the problem. The main contributions are (a) the formulation of the problem in a
two-agent framework, (b) the use of ellipsoid fitting for estimating one set of singular vectors only (partial SVD), and (c) the use of principal component analysis
for removing non-linearities in the optimization.
29
CHAPTER 2. SHORT SUMMARY OF THE PAPERS
30
Chapter 3
Papers
• Paper I: A Bridge between Tucker-1 and Carroll’s
Generalized Canonical Analysis
• Paper II: Outlier and Group detection in Sensory
Analysis using Hierarchical Cluster Analysis with
the Procrustes Distance
• Paper III: BIMA - Blind Iterative MIMO Algorithm
• Paper IV: Blind MIMO Estimation based on the
Power Method
• Paper V: The Game of Blind MIMO Channel
Estimation
31
CHAPTER 3. PAPERS
32
Paper I:
A Bridge between Tucker-1 and Carroll’s
Generalized Canonical Analysis
33
CHAPTER 3. PAPERS
34
A Bridge between Tucker-1 and Carroll’s
Generalized Canonical Analysis
Tobias Dahl and Tormod Næs ∗
Abstract
This paper concerns tools for analyzing relationships between and within
multiple data matrices. A new unified approach is developed, bridging
two existing methods; Carroll’s Generalized Canonical Analysis (GCA) with
the Tucker-1 method for principal component analysis of multiple matrices.
GCA and Tucker-1 are shown to correspond to particular choices of a ridge
parameter. The unified method may again be generalized to a larger space
of methods.
key words: Generalized Canonical Analysis, Three-way Factor Analysis,
Singular Value Decomposition, Ridge Regression, Principal Components,
MANOVA.
1 Introduction
A common problem in applied statistics is to relate a number of matrices {Xi }
to each other, in order to find common and unique structures. Typical examples
are sensory analysis, i.e. tasting experiments of food (Baardseth et. al, 1992,
Amerine et. al, 1965), and analysis of shape or body motions (Dahl, 1998).
In the former example, each matrix contains sensory scores given by one
single assessor. Each column may represent a variable (attribute), and each
row a sample or an object. An important issue is to investigate the common
structures among the individuals sensory ratings (Naes & Risvik, 1996), and
also to understand what is unique information. The purpose may be quality
control of the sensory panel, better understanding of individual differences of
perception and scoring, or simply the need for a sensible consensus matrix of
data representing the whole panel. Such a consensus can be used for further
statistical analysis.
In the second example, each matrix contains trajectories for a number of
markers (usually bright reflexes) through a particular body movement. The
∗
MATFORSK (Norwegian Food Research Insititue) and Unviersity of Oslo.
35
2 TWO CLASSES OF METHODS
matrices are related in order to find common structures in movement patterns,
summarizing how different joints and limbs co-ordinate for each individual, as
well as detecting common structures for the subjects (Dahl, 1998).
A number of different techniques have already been suggested for such
studies. The most well known are probably the 3-way factor analysis methods
PARAFAC, Tucker-1, Tucker-2 and Tucker-3, (Kroonenberg, 1980, Naes &
Kowalski 1989), Generalized Procrustes Analysis (Kristof & Wingersky 1971,
Gower 1975, ten Berge 1977) and various versions of Generalized Canonical
Analysis (Carroll 1968: MAXVAR, Kettenring 1971: SUMCOR, MINVAR,
SSQCOR, GENVAR, van de Geer 1984, Tenenhaus 1987, Kiers et al. 1994, van
de Burg & Dijksterhuis 1996). All these techniques are in use, but are based
on quite different approaches and belong to apparently quite different ways of
handling the problems. There are, however, some common features which seem
to have been little investigated in the literature. For instance, all techniques
end up with a “consensus” type of matrix representing common information for
the whole dataset. They also provide information about how and how well the
different matrices {Xi } relate to this consensus.
This paper is devoted to a discussion of some mathematical and practical
relationships between some of these techniques. Building a “bridge” between
three-way factor analysis, in particular Tucker-1, and Carroll’s Generalized
Canonical Analysis, will be the main focus. First, the techniques are presented
and discussed as members of two quite different philosophies. It is then shown
that both can be considered as special cases of a more general model framework
- a bridge.
Computations on real sensory data will be used for illustrations.
A number of plots facilitating empirical studies will also be proposed, and
validation of the methods will be discussed.
The unified framework connecting GCA and Tucker 1 may again be
generalized; It gives a class of potentially interesting methods for analysis
of three-way data. The generalized framework is flexible, and can be used
to connect three-way analysis with other methods, such as regression and
classification.
2 Two classes of methods
In this paper, we will consider situations where a number of data matrices
Each row corresponds to a
{X1 , X2 , . . . XQ }, Xi ∈ RN,Pi are available.
sample/object (a total of N), and each column to a variable, typically a dimension
of evaluation, for instance a taste score. The number of variables Pi may differ
between the matrices. If Pi = P for all i, the data {Xi } can be regarded as
a three-way (N × P × Q) structure X. Throughout the paper, the profiles (or
matrices) are assumed to be centered, 1 Xi = HXi , with H = (In − N1 1n 1Tn ).
1
This kind of centering is common in PCA and in correlation-based methods to ensure one
is working with the correct covariance and correlation matrices. In Procrustes-like methods it
36
PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA
2.1 Methods based on linear fitting
One of the important classes of methods for relating Q matrices to each other is
based on estimating individual transformation matrices {βi } , right-multiplied
by their corresponding matrices {Xi } in order to approximate a consensus matrix
Y. This can be formalized as
X
||Xi βi − Y||2
(1)
min
{βi },Y
i
The minimization may be subject to constraints on {βi } , on Y or on both.
Note theat some type of restriction is always needed, in order to avoid the
trivial 0 soulution. The most well known technique based on such a sum is the
Generalized Procrustes Analysis (GPA, Gower 1975, ten Berge 1977). It restricts
each βi to be orthogonal, βiT βi = I, and works in an iterative way to minimize the
criterion by rotations and reflections. The outcome of this process is a full rank
consensus matrix Y ∈ RN,P with no reduction to a lower-dimensional space,
although variants with dimension-reduction exist (Peay, 1988). The Ordinary
GPA consensus Y is is sometimes considered aP
better representative of the set
{Xi } than the straightforward average Y0 = Q1 i Xi (see e.g. Langron 1983 or
Arnold & Williams, 1985).
Generalized Canonical Correlation Analysis (GCA) can also be cast into this
framework. It is usually applied with individual transformation matrices {βi }
that have a reduced number of columns. The Y matrix represents not only
a type of consensus, but also information compressed to a lower dimensional
space, which may be important for interpretation purposes. GCA restricts Y to
have orthogonal columns. The criterion (1) can be reformulated to a become a
measure for the correlation between the profiles (see below).
Users of methods such as GCA and GPA may be interested in several aspects
of the results. Important examples are the common information represented
by Y, the degree of fit of the individual matrices, or the coefficients {βi } which
assigns each sample with coordinates in an interesting common space.
2.2 Three-way data compression methods
Another, and equally important class of techniques is the family of so-called
three-way factor analysis methods. These methods also give a common scores Y
as output. This Y is multiplied by different types of restricted loadings matrices
in order to approximate {Xi } . A general framework is
X
min
||Xi − Yβi ||2
(2)
{βi },Y T Y=I
i
Here, Y is the consensus (or common scores) matrix and {βi } are the individual
loading matrices. For compression to take place and the approach to be
interpreted as a translation step, centering the objects around the origin.
37
3 THE RELATION BETWEEN GCA AND TUCKER-1
meaningful, the number of columns Py in Y needs to be low, typically Py < Pi for
all i. Y is usually assumed orthogonal, but other restrictions may be imposed.
The methods in this class differ in the way that {βi } is restricted. With no
restriction on {βi } , the method is called Tucker-1 2 . Tucker-2 uses βi = Ri Q,
where Q is the common loadings matrix and Ri is the rotation matrix for the
different individuals. PARAFAC and Tucker-3 apply other restrictions (Naes &
Kowalski, 1989).
This paper deals with Tucker-1, from now on referred to only as the Tucker
method. The applications of three-way methods are similar to those of PCA with
ordinary two-way data, loadings and scores providing compressed information.
3 The relation between GCA and Tucker1
The purpose of this paper is to connect these two apparently different method
classes to each other. In particular, we will concentrate on the methods Tucker-1
and GCA and show that they can be formulated within a common framework. It
will be shown that both these methods, as well as solutions in-between, can be
found by constructing a matrix Z depending on a parameter α, and extracting
eigenvectors from this matrix.
3.1 Carroll’s Generalized Canonical Analysis
The GCA problem is usually formulates by using a correlation criterion, defined
as
X
corr2 (Xi βij , yj )
(3)
max
βij ,yj
i
T
under the restriction Y Y = I.
van der Burg et. al (1994) have shown that
the solutionPvectors in Y of the GCA problem are the orthogonal eigenvectors
of ZGCA = i Xi (XTi Xi )† XTi (M† denoting the Moore-Penrose inverse of M), and
that the natural ordering of columns is by the ranking of the corresponding
eigenvalues.
There is of course also the question of “fairness”
P (Van de Geer,
1984) and weighting, e.g. to use a representation w ZGCA = i wi Xi (XTi Xi )† XTi ,
but this can be accomplished by suitable scaling
P ofTthe profiles {Xi } , and will
not be pursued here. Note also that ZGCA = i Ui Ui , if Ui Si Vi = Xi is a SVD of
Xi
van der Burg et. al (1994, Appendix) also showed that the GCA problem can
be reformulated to become
X
min
||Xi βi − Y||2 subject to YT Y = I
(4)
βi ,Y
i
2
Tucker-1 is an unfolding PCA, it can be shown that the restriction βi = RQTi , with
P Since
T
T
i Qi Qi = Q Q = I is (trivially) full-filled once the Y matrix is found
38
PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA
showing that the GCA method indeed is a special case of the general framework
described in section 1.1.
3.2 Tucker 1
The Tucker 1 method can be formulated in a number of different ways,
depending on purpose of study and arrangement of matrices and vectors. Tucker
1 is sometimes referred to as an “unfolding” method (Naes & Kowalski 1989),
since its solution vectors can be obtained by unfolding of the three-dimensional
(N × P × Q) structure X. The Tucker 1 method can be defined properly as
{βi
min
X
},Y T Y=I
||Xi − Y βi ||2
(5)
i
It is well known that the solution
vectors in Y of the Tucker problem are the
P
T
eigenvectors
=
see e.g. Gifi (1990). In SVD-notation,
i Xi Xi ,
P of Z2T ucker
T
U
S
U
.
Note
the
similarity
between this solution and the SVDZT ucker =
i i i i
version of GCA.
3.2.1 Ridge GCA and its connection with Tucker-1
Ridge-techniques have been used successfully for regularization of various
multivariate methods. The classic Ridge Regression (Hoerl & Kennard 1970),
as well as Regularized Discriminant Analysis (Friedman 1989) are examples.
An analogue for GCA is now presented.
The usefulness of the ridge parameter in this context has two aspects: First
of all, since GCA is a correlation based technique, the ridge parameter can be
used to improve stability of the solution in cases with collinear data Secondly,
introducing the ridge parameter α leads to a parameterization connecting GCA
and Tucker-1.
Let Xi = Ui Si ViT again be the SVD. The following technique down-weights
the influence of the components with low associated variance: Replace ZGCA by
ZRidge ,
X
Xi (XTi Xi + αI)† XTi =
(6)
+ αI)† Vi Si UTi =
(7)
Ui Si ViT [Vi (S2i + αI)† ViT ]Vi Si UTi =
Ui Si (S2i + αI)† Si UTi = Ui Di (α)UTi
(8)
(9)
ZRidge =
X
i
Ui Si ViT (Vi S2i ViT
i
39
3 THE RELATION BETWEEN GCA AND TUCKER-1
σ2
where Di (α) = diag{ σ2 ij+α }. The equation (6) can without loss of generality be
ij
rewritten as
X
Xi [(1 − α)XTi Xi + αI]−1 XTi
(10)
Zα =
i
The set of matrices (up to scaling by a constant) covered by ZRidge for α ∈ [0, ∞ >
now appears on a [0, 1]-scale. Restricting the parameter of interest to a closed
interval simplifies further analysis greatly. Comparison of different α values is
much easier, and construction of plots and figures with α varying along one or
two axes is straightforward. Note that GCA is still obtained for α = 0, while
α = 1 yields Tucker-1.
It easily shown that
Zα =
X
Ui Λi (α)UTi
(11)
i
σ2
This bridge, defined by Λi (α), is of course not
where Λi (α) = diag{ (1−α)σij2 +α }.
ij
the only thinkable bridge between the methods.
One could ask whether one α parameter for each matrix, αi , would not be
more appropriate. This question parallels the problem of choosing between
quadratic and linear discriminant analysis in a supervised classification
problem. Fewer parameters makes estimation more robust, at the cost of
reduced flexibility. Clearly, choosing one αi for each Xi would bring a certain
selectivity into the regularization, shrinking some matrices more and others less.
Thus, use and selection of only one α is treated.
3.2.2 A possible generalization of the framework
All methods in the unified framework can, just as GCA and Tucker, be seen from
a principal vector point of view. The solution vectors in Y are then the principal
column components (left singular vectors) are extracted from Ũ,
Ũ = (U1 Λ1 (α), U2 Λ2 (α)), . . .)
Λi = diag{λi1 (α), λi2 (α), . . .}
(12)
(13)
This has previously been observed by van de Geer (1984), who raises the
question of “what to analyze?”, the original profiles {Xi } , or matrices {Ui }
spanning their column spaces. Tucker-1 is equivalent to (PCA-) analyzing the
original profiles, while GCA amounts to analyzing the principal vectors. The two
methods also correspond to specific choices of {Λi } . A whole space of methods
can be defined by varying the elements of the individual diagonal coefficient
matrices {Λi } . The weighting of an individual principal vector uij affects the
probability that uij will be well described by the first solution vectors in Y. Some
other methods in this class are discussed at the end of the paper.
40
PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA
3.2.3 Intuitive explanation of the framework.
Analogous to PCA for matrix data, the Tucker-1 and GCA methods seek to
describe variability, i.e. covariance structure, for a set of matrices. According to
the mathematical description above, the difference between the two covariance
structures extracted by the two methods can be stated as follows:
(1) Tucker-1 is a weighted GCA, in the sense that the singular
P P vectors,
T
appearing with no extra weights in the GCA expression Z =
i
j uij uij are
P P
weighted to become Z = C i j σij uij uTij , where C is an irrelevant constant.
(2) Analogously, GCA is a Tucker-1 analysis on singular vector/value pairs
so that the singular values are equal to 1. In this sense, GCA is a normalized
Tucker-1 analysis.
In the Xi -domain, it is clear that GCA is a Tucker-1 analysis on normalized
data with normalization factors (XTi Xi )−1/2 .
3.2.4 Noise and Stability
Stability.
Tucker-1 is focused on describing the X-matrices the best possible way. Thus,
it will not be as influenced by the small principal vectors as the case may be
with GCA. The first principal vectors are weighted higher, in a relative sense, in
Tucker-1 than in GCA. 3 The larger eigenvectors are usually more stable with
respect to perturbation of {Xi } than the rest, indicating that Tucker modeling
may give more stable solutions than GCA.
Low-variance components.
In practical problems, there might be some singular values σij very close to
zero, indicating that a very small portion of the variance is explained by the
corresponding vector. The occurrence of such small singular values can be due to
high collinearity in the underlying “true” variables, as well as numerical errors
and noise in the data. In the α = 0, or GCA case, this introduces an error
source in the estimation of the solution eigenvectors, especially if the number
of profiles Q is low. Increasing α slightly would weight down the corresponding
low-variance singular vectors, giving a regularizing effect. It could however,
be argued, that the need for regularization is not too great.
. If a lowvariance component from a profile is given a considerable weight (as e.g in GCA,
where all components are weighted equally), it is unlikely to be matched and
strengthened by the low-variance components of other profiles unless some true
common structure exists between them. When the contributions are summed
and eigenvectors extracted, the influence of the noise component on the solution
space R(Y) is likely to be low.
3
Whether the actual weight increases or not depends on wheter σij <> 1, but in any case the
influence of the associated singular vector is higher for the first singular vectors - in Tucker-1
41
4 SELECTION OF THE RIDGE α
An alternative way of regularizing would be to remove all singular uij
vectors with associated variance below a certain threshold from the basis vector
matrices Ui . Doing this, one of course runs the risk of loosing contributions to
important low-variance vectors performing interplay between the profiles.
A table is given below, summarizing algebraic results, interpretation and
stability issues.
Table 1: Summary of the properties of the GCA/Tucker-framework
Name
GCA
Ridge GCA
α
α=0
0<α<1
max over
P
2
YT Y = I
?
i ||Xi βij − Y||
{βij }
{yj } are
P
P
T
† T
T
−1 T
eigenvectors
i Xi (Xi Xi ) Xi
i Xi [(1 − α)Xi Xi + αI] Xi
of
..in SVDP
P
T
2
T
domain...
i Ui Ui
i Ui Λi (α)Ui
λ2ij (α) =
2
σij
2 +α
(1−α)σij
Tucker 1
α=1
P
i ||Xi
P
P
− Y βij ||2
T
i Xi Xi
2 T
i Ui Si Ui
{yj } are left
principal
vectors of
..in SVDdomain...
Component
weighting
X∗i = Xi (XTi Xi )−1/2
X∗ = (X∗1 , X∗2 , . . .)
X̃ = (X̃1 , X̃2 , . . .)
X̃i = Xi [(1 − α)XTi Xi + αI]−1/2
X = (X1 , X2 , . . .)
U = (U1 , U2 , . . .)
All components are
weighted equally
Ũ = (U1 Λ1 (α), U2 Λ2 (α), . . .)
Focus
of Analysis
Noise
Maximizing
correlation
Sensitive
U∗ = (U1 S1 , U2 S2 , . . .)
Components with
high variance are
favoured
Describing
variability
Insensitive
4 Selection of the Ridge α
The question of validation is central in regularization problems: What choice of
the regularization parameter(s), in this case α , should be used? The answer
depends on the problem setting. If there is external information related to the
profiles, external validation may be used for choosing α. In regression problems,
this corresponds to parameter selection with respect to the response y. Methods
such as PCR, PLS, TSVD, CG are usually cross-validated on the MSE with
respect to y.
4.1 Internal Validation
Looking at the unified framework as a set of tools for finding structures
of relationships, two ideas for selecting the number of components are the
42
PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA
following:
(1) Cross-validation leaving out either samples, or whole profiles {Xi } : If a
GCA/Tucker-model is estimated for all but one sample, the predictability of this
sample (residual error when the sample is projected onto the reduced model) can
be tested with a varying number of components. The procedure is repeated for
all the samples, removing one sample, putting the old one back, and calculating
the MSE for all samples.
(2) Looking for stable intervals of the α parameters, e.g. intervals for which
solution space is stable (e.g. the vectors in Y varies little), and select α as
for instance the point in the middle of the interval. Linear subspaces can be
compared by means of principal angles (Golub & Van Loan, 1996)4 . Degree of
change in principal angles may be used as a measure of stability.
4.2 External Validation by MANOVA
When external information exists, such as for instance design variables or direct
measurements of related quantities, it is possible to use a MANOVA model and
corresponding testing methodology to find out which value of the parameter α
is most appropriate. Assuming that the solution vectors in Y relate in a linear
way to the design variables (or predictors), Wilk’s λ and its associated P-value
from Wilk’s λ -distribution can be used to determine the influence of the design
(or, alternatively, the individual design variables) on the solution space. A low
P-value is a clear indication of good correspondence, or in other words, that a
particular choice of α is a good one.
Selection of α by MANOVA will be illustrated at the end of the paper.
5 Visualization Tools
There is an extensive literature on visualization tools for GCA, see e.g
Dijksterhuis & van der Burg (1996) for an overview. Methods in use involve
object scores and loading plots, and various techniques for studying the behavior
of samples, variables and profiles. The tools presented here are more focused on
the solution vectors in Y than on individual samples and variables, and are to
the authors’ best knowledge, new. For the practical user of the framework, it
is worth mentioning that existing methods from GCA and 3-way traditions can
be adapted to work within the unified framework.
4
which is actually a canonical correlation analysis.
43
5 VISUALIZATION TOOLS
5.1 Hiding plots
A useful viewpoint on the relationship between solution vectors in Y and
the column spaces R(Xi ) of the profiles {Xi } is that the solution vectors are
“embedded” or “hidden” in the column spaces. The methods in the unified
framework serves as “filters”, recovering the solution vectors. The “hidingplace” of the vectors may be of interest, raising questions like “Was the first
solution vectors constituted from (linear combinations of) high-variance or lowvariance components of the individual profiles?” The hiding plot is a bar plot
of the principal variances in a profile (σij2 ), topped with a curve illustrating the
projections coefficient of a solution vector yj onto the principal components Ui .
The curves are plotted from the components of vectors hij vectors. 5
hij = |UTi yj |2
(14)
Each hij (j varying over the solutions yj ) can be seen as a curve by plotting its
component values against its index. Spline smoothing is applied to the curve to
make the figure more readable. Figure 1 shows an example for our dataset.
5.2 Manhattan plots
The solution vectors in Y as well as the principal components {uij } are vectors
that account for variability. In PCA-like applications it is customary to look at
projections of the data onto principal vectors, to see how much of the variance is
explained by a “reduced model”. In the following, this idea is further developed
in two directions. First, the reduced model need no longer be defined by principal
vectors. Instead, projections of the data onto an arbitrary basis (explanatory
variables in W) is considered. Second, a matrix is constructed, that holds
cumulative projection variances of the original data columns onto the given
basis. This matrix can be visualized in a figure, where the high values are
bright and the low values are dark.
The Manhattan matrix H(X, W) ∈ RQ,P is defined by its elements
Pi
(wT xj )2
(15)
hij (X, W) = r=1 r2
||xj ||
where xr and wr are the is the r 0 th column vectors of the X and W respectively,
under the assumption that WT W = I, e.g. the explanatory variables are
normalized and orthogonalized. It is easily seen that 0 ≤ hij ≤ 1. Three natural
examples of Manhattan Plots are
(i) to look at H(Xi , Ui ), which can be used in a standard PCA-way to see how
well the data are explained by a reduced model,
(ii) to look at H(Xi , Y), to see how well each of the profiles’ columns can be
explained by the solution vectors in Y, and
5
This way of visualizing vectors is common, e.g in NIR spectroscopy.
44
PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA
(a)
(b)
(c)
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
1 2 3 4 5 6 7 8 910111213
(d)
0
1 2 3 4 5 6 7 8 910111213
(e)
0
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
1 2 3 4 5 6 7 8 910111213
(g)
0
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
1 2 3 4 5 6 7 8 910111213
0
1 2 3 4 5 6 7 8 910111213
(h)
0
1 2 3 4 5 6 7 8 910111213
(f)
1 2 3 4 5 6 7 8 910111213
1 2 3 4 5 6 7 8 910111213
Figure 1: Hiding plots for the profiles of the sausage data using α = 0.05.
Two curves are plotted on top of each set of principal bars. The solid line (-)
corresponds to the first solution vector y1 , the stapled one (- -) to the second
solution vector y2 (the third vector was omitted for readability). It is clearly
seen that y1 was mainly constituted from the first principal components of the
profiles, with the notable exceptions of the fourth (d) and the sixth profile (f).
y1 was also influenced by some medium-variance components, see e.g. (b),(c),(e)
and (f). The second solution vector y2 has less obvious projections, though one
could argue that it is influenced by the subsequent principal components 2-4, see
e.g. (a),(b),(c),(g),(e), which all have stapled lines peaks in these neighborhoods.
45
5 VISUALIZATION TOOLS
(iii) to look at the relationship between solution vectors in Y and the
individual principal column vectors through H(Ui , Y), giving insight into how
well each principal basis is explained by the common basis.
Figure 2 shows Manhattan plots H(Xi , Ui ) for the first four profiles of the
Sausage Data set analyzed at the end of the paper. Examples of the this
approach is found in figures 5-6.
5.3 CORV plots
CORV (CORellation-to-Variability-explanation) can be used to identify what the
solution vectors do explain. By changing α, Y(α) will be solution vectors lying on
a scale. This scale has “correlation scoring” and “explanation of variability” as
its ends.
GCA
α
Tucker − 1
Correlation ←→ Explanation
scoring
of variability
One may ask “Could the GCA solution vectors also pass as Tucker vectors, or
vice versa?” A vector designed to focus on correlation (GCA modeling) may also
be suitable for describing a lot of the total variability (Tucker modeling). A chart
such as the one below could be useful:
Y(α) designed for
GCA Tucker
Y(α)
GCA
1
0.7
scoring on Tucker 0.3
1
The 1’s are obvious “maximum scores”, but the numbers 0.3 and 0.7 tell
something about how “flexible” the vectors are, whether they are useful as
solution vector for one of the problems (GCA/Tucker) only, or whether the
solutions are interchangeable. This gives a pinpoint on the robustness of the
α eventually chosen. Clearly, if all elements in the chart were 1’s, the choice of
α wouldn’t matter
GCA and Tucker are only the ends of a scale, and so it would be more natural
to have more boxes in the chart, covering GCA, Tucker, and the intermediate
cases along both axes. Even better, a continuum f (., .) could be constructed and
visualized as an intensity image rather than a chart. Since both the GCA and
the Tucker problems can be described as eigenvector problems, one can study
fj (α, β) = yj (α)T
46
Zβ
yj (α)
||Zβ ||2
(16)
PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA
(a)
(b)
2
2
4
4
6
6
8
8
10
10
12
12
2
4
6
8
10
12
2
4
6
(c)
2
4
4
6
6
8
8
10
10
12
12
4
6
10
12
8
10
12
(d)
2
2
8
8
10
12
2
4
6
Figure 2: Manhattan plots for four profiles, with principal components as
explanatory variables. (a) shows H(X1, U1 ), (b) shows H(X2 , U2 ), (c) shows
H(X3 , U3 ) and (d) shows H(X4 , U4 ). One way of interpreting sub-figure (a)
would be the following: (i) The four first variables (columns of the data matrix)
cannot be explained by the first explanatory variable, neither can the ninth and
the thirteenth variable. This can be seen from the upper row of the matrix,
where these positions are black. (ii) The explanations of the first, second and
thirteenth variables increases significantly when another principal component
(explanatory variable) is introduced. (iii) The degree of explanation increases
naturally as more and more explanatory variables are introduced. This can be
observed from the figure, since all columns become increasingly brighter when
going from top to bottom.
47
5 VISUALIZATION TOOLS
for this purpose. The function fj gives the scoring of the changing solution vector
yj (α) on all matrices {Zβ }. One might plot all f1 (α, β), f2(α, β), . . ., or just a few.
Typically, eigenvectors with low eigenvalues are unstable, so there is a limit to
how many figures are informative. Normalization of the matrices to have the
same L2 norm (forcing them all to have first singular value σi1 = 1) simplifies
comparison. In the figures, the functions fj (α, β) are sampled into matrices by
choosing suitable grids for the α and β values. High f values are bright, low
values dark. Clearly, 0 ≤ fj (α, β) ≤ 1 for all α, β, j. Furthermore, it is clear that
the diagonal fj (α = β) will be the brightest part of the figure only when j = 1, e.g.
for the first solution vector y1 (α). Subsequent solution vectors are perpendicular
with the first one, and will therefore fail to obtain maximum scoring along the
diagonal.
Perturbation Issues.
When α is varied, this can be seen as a perturbation of Zα in (10).
Ordered eigenvector extraction is an unstable process with respect to matrix
perturbation. When the variance associated with certain directions in the
space is changed, the ordering of the principal components may be altered.
Furthermore, if a subset of the principal components have identical variance,
almost any set of vectors spanning the subspace may be output to describe it.
One example where this is likely to occur is within the unified framework; When
the influence of low-variance components becomes comparable with that of the
high-variance ones as α → 0, there may be several components with comparable
associated variance. When vectors swap place by order, rifts are produced in
the surface described by fj (α, β). The CORV-plots therefore give additional
information about when the ordering of principal vectors is altered, but these
shifts also make the figures less readable. 6
The good news are that principal subspaces are stable with respect to
perturbation, even if the vectors spanning them are not (Golub & Van Loan
1996). If something is known about the dimension of the solution space (for
instance, the number of underlying parameters controlling the process), the
function (16) can be extended to focus on subspaces rather than on individual
vectors. If Y(α)k = (y1 , y2 , . . . yK ) ∈ RN,k holds the first k orthogonal solution
vectors for a particular choice of α as its column vectors,
fkS (α, β) = ||Y(α)k T
Zβ
Y(α)k ||2
||Zβ ||2
(17)
will do for subspaces what f(α, β) did for vectors. Figure 3 shows some examples
of CORV-plots taken from simulations. Here of course, fkS (α, β) > 1 might occur,
6
The “rift problem” can possibly be resolved by a vector tracing approach (not pursued here):
If one suspects that two eigenvectors have swapped place due to perturbation of the matrix, a
parallelism test may be performed to reorder the vectors. The tracing must necessarily happen
on a sufficiently coarse α -grid, otherwise the eigenvalues will at some point be nearly identical,
and tracing becomes very difficult.
48
PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA
(a)
min: 0.26272
max: 1
(b)
0.2
0.2
0.4
0.4
min: 0.27814
max: 0.9958
0.4
0.8
β
0
β
0
0.6
0.6
0.8
0.8
1
1
0
0.2
(c)
0.4
α
min: 0.38741
0.6
0.8
1
0
0.2
α
0.6
1
max: 1.3931
0
0.2
β
0.4
0.6
0.8
1
0
0.2
0.4
α
0.6
0.8
1
Figure 3: CORV plot for a synthetic example. (a) and (b) are CORV plots for
the two first solution vectors y1 , y2 . The clear shift of eigenvectors is seen.
The common-plot (c), based on fkS (α, β), for both solution vectors, shows that
correlation scoring and variability may both be well explained. The creation of
this example is rather complicated, and has been omitted for the sake of space.
whereas fj (α, β) was bounded upwards by 1. The bright areas represent high
scores (or high degree of explanation), the dark represent low scores.
6 A Case Study
6.1 Description of the dataset
The data in the following study come from sensory profiling of sausages. The
original sausage data set (published by Baardseth et al, 1992) consists of
measurements for eight sensory judges, each of them assessing 60 sausage
samples (N), usingP = 15 attribute variables. Each profile is then represented
by a 60 times 15 matrix Xi ∈ R60,15 , i = 1 . . . 8. The sausages were produced
49
6 A CASE STUDY
according to four design variables, with 5,3,2 and 2 levels respectively, given
a total of 5*3*2*2=60 combinations In previous analyses, the design factors
have been shown to fit extremely well with the profiles. For the purpose of
illustration, the problem was here made more difficult by removing all samples
corresponding to the top three levels of the first design variable, and the top
level on the second. This gives a dataset with 2*2*2*2=16 sausage samples. The
profiles, now with 16 samples, were column centered, standardized by Frobenius
norm (||Xi ||2 = 1), and variables 5 and 6 were removed as earlier publications on
this data has proven their lack of relationship with the design.
6.2 Results
A step size for α of 0.05, and 3 solution vectors was used (the number of
components to use is a question in itself that will not be pursued here). Plotting
the Wilk’s λ values, and the associated P-values (figure 4) shows that the best fit
between consensus profile and design is achieved for a relatively low value of α.
The sub-figures show Wilk’s λ and P-curves for the model.
Manhattan plots (figures 5- 7) with Y(α) holding explanatory variables, and
principal variables {uij } as the ones to be explained, shows a clear shift once the
focus is turned away from pure correlation-scoring: There is a significant change
as α goes from 0 to 0.05, then the picture changes slowly until α =1.
The CORV plots (8) display the same phenomenon. The distinguished first
rows and columns in all sub-figures indicate a clear shift when α increases from
0. The sub-figure (a) suggests that the first solution vector is stable, accounting
both for correlation scoring and explanation of variability, as soon as α > 0.
The second and third solution vectors, whose CORV plots are given in (b) and
(c), seem to be better for correlation scoring than explanation of variability (the
figures get darker towards the lower right corner). The overall CORV plot (d)
for the subspace spanned by the three solution vectors shows the same tendency
(there is a slow decrease in the scores going along the diagonal from the upper
left to the lower right corner). The sharp shift from the first to the second
row / column, shows that the choosing α = 0 will give quite different results
from choosing α = 0.05. Increasing α further gives little, but gradual change,
indicating that α is “robust” upwards from 0.05. These results are generally in
line with the P-values and Wilk’s λ for the MANOVA model, but one could argue
that the quickly rising curves after α = 0.05 indicate less robustness with respect
to choosing α, compared to what the CORV plots did.
A geometrical interpretation would be the following: When α = 0, the
influence of low-variance components (probably noise) is very high. Increasing
α just a little gives RGCA and a set of solution vectors matching the process
variables well. As α is further increased, the solution spaces gradually shift
focus in the direction of high-variance components. This change, however,
reduces the influence of components that relating well with the process (design)
variables. Hence the increase in the P-values and Wilk’s λ as α goes to 1. Note
50
PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA
that none of the P-values are very low, suggesting that the relationship between
the profiles and the design variables were not very strong (this is due to removal
of samples with specific levels, as described above).
The hiding plots (1) gives detailed information on how the solution vectors
can be represented as linear combinations of the profiles principal vectors. It
suggests that the solution vectors in Y were constituted mainly from highvariance components, but not only those. An overall interpretation of the plots
could be the following: Strong connections between the profiles and the design
is not found when the first singular vectors alone determine the solution (the
Tucker case), but rather when the later singular vectors are influential as
well (although not quite as much as in the pure GCA case). One explanation
to this phenomenon (or at least a guess) could be that the overall “tasting
experience” described by the judges is a certain “sausageness”, reflected in the
early principal components of the profiles. Changing the sausage design (more
or less flour, salt etc.) gives rise to more subtle nuances, but these are only
found in the lower variance components. To investigate this issue further, one
could look more detailed into the linear combinations matching the profiles with
design data in the MANOVA procedure, checking if they correspond to high or
low variance components in the profiles. This, however, is out of the scope of the
present paper.
The above example is a case where RGCA, compared with Tucker and
GCA, proved useful, improving solution stability by shrinking the low-variance
components. Using the MANOVA validation procedure and the tools, it was
possible to select the ridge parameter α = 0.05 in a meaningful way.
The production and tasting of the sausages in the data set were custom made
at the Norwegian Food Research Institute. If, however, these sausages (or some
other product) had been part of a production line, with a trained panel of judges
and ingredients varying with seasons and delivery, the method could be made
part of an industrial quality control: One could estimate α from one tasting
session, and use it as a parameter of consensus estimation and product labeling
for future sessions. It would be even better to perform cross-validation of α over
previous sessions, always to have the best estimate of α available for the next.
7 Conclusions & Discussion
GCA and Tucker can be seen as two ends of a scale, assuming the data to be
full three-way data.
The whole set of methods is essentially a three-way
analysis tool-box, since it combines information from three domains of variation;
The samples, the variables and the profiles. The methods on the scale belong
to an even larger space of methods, in which GCA and Tucker are only two
points. An interesting package of tools can be design by considering the diagonal
elements on the matrices Λi as linear filter factors. Any kind of filter could then
51
7 CONCLUSIONS & DISCUSSION
(a)
(b)
0.055
0.4
0.05
0.35
0.045
0.3
0.04
P(α)
λ(α)
0.25
0.035
0.2
0.03
0.15
0.025
0.1
0.02
0.015
0
0.2
0.4
α
0.6
0.8
0.05
1
0
0.2
0.4
α
0.6
0.8
1
Figure 4: Wilk’s λ for the model, the design variables, and their components,
and associated P-values.
52
PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA
(a) α=0
(b)
(c)
(d)
0.5
0.5
0.5
0.5
1
1
1
1
1.5
1.5
1.5
1.5
2
2
2
2
2.5
2.5
2.5
2.5
3
3
3
3
3.5
2 4 6 8 10 12
3.5
(e)
2 4 6 8 10 12
3.5
(f)
2 4 6 8 10 12
3.5
(g)
(h)
0.5
0.5
0.5
0.5
1
1
1
1
1.5
1.5
1.5
1.5
2
2
2
2
2.5
2.5
2.5
2.5
3
3
3
3
3.5
2 4 6 8 10 12
3.5
2 4 6 8 10 12
3.5
2 4 6 8 10 12
2 4 6 8 10 12
3.5
2 4 6 8 10 12
Figure 5: Manhattan plot with the profiles principal components as the
horizontal (or original) variables, α =0, which is the GCA case. Note that the
leftmost columns, corresponding to the principal vectors with high associated
variance, have only slightly “higher towers” than the other ones.
53
7 CONCLUSIONS & DISCUSSION
(a) α=0.05
(b)
(c)
(d)
0.5
0.5
0.5
0.5
1
1
1
1
1.5
1.5
1.5
1.5
2
2
2
2
2.5
2.5
2.5
2.5
3
3
3
3
3.5
2 4 6 8 10 12
3.5
(e)
2 4 6 8 10 12
3.5
(f)
2 4 6 8 10 12
3.5
(g)
(h)
0.5
0.5
0.5
0.5
1
1
1
1
1.5
1.5
1.5
1.5
2
2
2
2
2.5
2.5
2.5
2.5
3
3
3
3
3.5
2 4 6 8 10 12
3.5
2 4 6 8 10 12
3.5
2 4 6 8 10 12
2 4 6 8 10 12
3.5
2 4 6 8 10 12
Figure 6: Turning to α =0.05 (RGCA), there is a swift change. Now, there is a
much clearer tendency for the first principal components to have high towers.
54
PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA
(a) α=1
(b)
(c)
(d)
0.5
0.5
0.5
0.5
1
1
1
1
1.5
1.5
1.5
1.5
2
2
2
2
2.5
2.5
2.5
2.5
3
3
3
3
3.5
2 4 6 8 10 12
3.5
(e)
2 4 6 8 10 12
3.5
(f)
2 4 6 8 10 12
3.5
(g)
(h)
0.5
0.5
0.5
0.5
1
1
1
1
1.5
1.5
1.5
1.5
2
2
2
2
2.5
2.5
2.5
2.5
3
3
3
3
3.5
2 4 6 8 10 12
3.5
2 4 6 8 10 12
3.5
2 4 6 8 10 12
2 4 6 8 10 12
3.5
2 4 6 8 10 12
Figure 7: This picture changes slowly as α increases (here α = 1, the Tucker
case). Now, the first columns are clearly the brightest.
55
7 CONCLUSIONS & DISCUSSION
(a)
min: 0.52
max: 1.00
0.2
0.2
0.4
0.4
(b)
min: 0.26
0.2
0.4
(d)
min: 0.63
0.2
0.4
max: 0.99
β
0
β
0
0.6
0.6
0.8
0.8
1
1
0
0.2
0.4
(c)
min: 0.17
α
0.6
0.8
1
0
max: 0.98
0.2
0.2
0.4
0.4
0.6
0.8
1
max: 1.72
β
0
β
0
α
0.6
0.6
0.8
0.8
1
1
0
0.2
0.4
α
0.6
0.8
1
0
α
0.6
0.8
1
Figure 8: CORV plot for the sausages. Figure (a),(b) and (c) are for the three
first (individual) solution vectors, (d) for the subspaces spanned by all of them.
56
PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA
be designed to put focus on certain components in the profiles. Making e.g a lowpass-filter, resembling a normal distribution, with arbitrary placement of center
and deviations, could be used to “tune high and low frequencies on a set of radios
(profiles)”, to see how they harmonize. The power of modern computers makes it
possible to scan for such relationships in real-time, combining human intuition
with statistical and computational power.
It is important to emphasize that the GCA method only looks for linear
combinations that are highly correlated. All directions are equally important
regardless of whether they describe much of the information in Xi or not. If there
are strong correlations in some of the "smaller" directions, these will be picked
up just as easily as the others. Such directions can, however, sometimes have
strong effects on the stability of the solution. This problem is well understood
in regression analysis. An interesting study would be a statistical analysis of
stability on the framework. Simulations based on random profile generation
and framework analysis for fixed α is one approach.
The connection between GCA and multiple regression (MR) has been covered
elsewhere. It has e.g. been shown that MR is a form of GCA. GCA turns out
to be a remarkably general framework, encompassing canonical analysis as a
special case, and therefore discriminant analysis, multiple regression, analysis
of variance and correspondence analysis (McKeon 1965, Gittins 1985, Carroll
1968, Kettenring 1971, Tenenhaus 1987). It would therefore be of interest to
explore whether RGCA as a framework would contain Ridge Regression and
Regularized Discriminant Analysis as well as the others.
Variable subset selection.
With the high dimensionality of the data, and the total number of columns
in specific, it would be of interest to perform dimension-reduction by variableselection. There may be certain columns in the profiles that account for a lot of
the variability. If strong underlying structures exist, certain variables may turn
out always to be linear combinations of others. A subset selection approach,
focusing on a few variables spanning column subspaces comparable to those
obtained when all variables are used, could simplify interpretation, as well as
the experimental procedures (e.g. by reducing the number of tastes each judge
would have to describe)
Definition of consensus.
With respect to the seemingly different classes of optimization problems
presented in the introduction, relatively little has been said about the
interpretation of the consensus matrix Y. Well understood in the specific cases
of GCA and Tucker-1, it might still be possible to say something about how the
consensus captures and compresses information about the profile, on the full
scale (or even generalized space) of methods.
57
REFERENCES
Links with Continuum Regression (CR) and PLS.
Changing the focus of study from correlation to explanation of variability is not
too different from the idea of CR. Continuum regression involves a scale bridging
classic least squares regression with principal component regression, having
PLS (partial least squares) regression in between. Since PLS can be expressed
(non-linearly) using filter-coefficient in an SVD-setting (Hansen, 1996), ideas
linking our methodology with external response data in a PLS or a multiresponse-regression setting is a topic for further research.
The authors are grateful for comments from Ole-Christian Lindgjærde and
Nils Christophersen. Special thanks to Øyvind Langsrud for the MATLAB
implementation of MANOVA.
References
[1] Amerine, M.A., Pangborn, R.M., Rossler, E.B.(1965) Principles of Sensory
Evaluation of Food, Academic Press, New York.
[2] Arnold, G.M., Williams, A.A., (1985) The use of generalized Procrustes
Techniques in sensory analysis, In: Statistical Procedures in Food Research,
Piggot, J.R. (Ed.)
[3] Baardseth P., Næs T., Mielnik J., Skrede G., Hølland S., and Eide O. (1992)
Dairy ingredients effects on sausage sensory properties studied by principal
component analysis, Journal of Food Science, Vol. 57, No.4 pp. 822-828.
[4] ten Berge, J. M. F. (1977) Orthogonal Procrustes rotation for two or more
matrices. Psychometrika, 42, pp. 267-276
[5] van der Burg, E. & Dijksterhuis, G. (1996) Generalized canonical analysis
of individual sensory profiles and instrumental data, In: Multivariate
Analysis of Data in Sensory Science , edited by Naes T. & Risvik E, Elsevier
Science.
[6] van der Burg, E., de Leeuw, J., and Dijksterhuis, G.B. (1994) OVERALS:
nonlinear canonical correlation analysis with k sets of variables. Computational Statistics and Data Analysis, no. 18, pp. 141-163.
[7] Carroll, J.D. (1968) Generalization of canonical analysis to three or more
sets of variables, Proceedings of the 76th Convention of the American
Psychological Association 3, pp. 227-228.
[8] Dahl, T. (1998) Empirical Modeling of Human Motion, MSc Thesis,
University of Oslo, Department of Mathematics.
58
PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA
[9] Friedman, J.H. (1989) Regularized Discriminant Analysis, Journal of the
American Statistical Association, Vol. 84, No. 405, pp. 165-175.
[10] van de Geer, J.P. (1984) Linear Relations among k sets of Variables (1984),
Psychometrika, Vol. 49, No 1, pp. 79-94.
[11] Gifi, A. (1990) Wiley Series in Probability and Mathematical Statistics.
Chichester. John Wiley & Sons.
[12] Gittins, R. (1985) Canonical Analysis, a review with applications in ecology,
Berlin: Springer-Verlag.
[13] Golub, G. & Van Loan, C.F. (1996) Matrix computations, 3rd ed. The Johns
Hopkins Univ. Press.
[14] Gower, J.C. (1975) Generalized Procrustes Analysis, Psychometrika 40, pp.
33-51.
[15] Hansen, P.C. (1996) Rank Deficient and Discrete Ill-posed Problems, Ph.D.
dissertation, Technical University of Denmark, DK-2800 Lyngby, Denmark.
[16] Hoerl, A.E, & Kennard, R.W. (1970) Ridge Regression: Biased estimation
for Nonorthogonal problems”, Technometrics, 8, pp. 27-51.
[17] Kettenring, J.R (1971) Canonical analysis of several sets of variables,
Biometrika, Vol. 58, pp. 433-451.
[18] Kristof, W. & Wingersy, B. (1971) Generalizations of the orthogonal
Procrustes rotation procedure to more than two matrices. Proceedings of
the 79th Annaual Convention of the American Psychological Association, 6,
pp. 81-90.
[19] Kroonenberg, P. & De Leeuw, J. (1980) Principal component analysis
of three-mode data by means of alternating least squares algorithms,
Psychometrika, Vol. 45, pp. 69-97.
[20] Langron, S.P. (1983) The application of Procrustes statistics to sensory
profiling. In: Sensory Quality in Foods & Beverages: Definition, Measurement & Control, A. A. Williams & R.K. Atkin (Eds), Ellis Horwood Ltd,
Chichester, pp. 89-95.
[21] Naes, T. Kowalski, B. (1989) Predicting sensory profiles from external
instrumental measurements, Food Quality and Preference, 4/5, pp. 135-147.
[22] Naes T. & Risvik E. (1996) Multivariate Analysis of Data in Sensory Science,
Elsevier Science.
[23] Mardia, K.V., Kent, J.T., Bibby, J.M. (1997) Multivariate Analysis, Academic
Press.
59
REFERENCES
[24] McKeon, J.J., (1965) Canonical Analysis: some relations between canonical
correlation, factor analysis, discriminant function analysis, and scaling
theory. Psychometric Monograph 13. University of Chicago Press, Chicago.
[25] Peay, E.R. (1988) Multidimensional rotation and scaling of configurations
to optimal agreement. Psychometrika 53, pp.199-208.
[26] Tenenhaus, M. (1987) Generalized Canonical Analysis, Bernoulli, Vol.2, pp.
133-136.
[27] Tucker (1958) Ledyard R. An inter-battery method of factor analysis.
Psychometrika 23, pp. 111-136.
60
Paper II:
Outlier and Group detection in Sensory
Analysis using Hierarchical Cluster
Analysis with the Procrustes Distance
61
CHAPTER 3. PAPERS
62
Outlier and Group Detection in Sensory
Panels using Hierarchical Cluster Analysis
with the Procrustes Distance
Tobias Dahl and Tormod Næs∗
Abstract
Generalized Procrustes Analysis (GPA) is a well-known method both in
multivariate analysis and shape analysis. It is used to find a representative
average for a set of matrices (configurations, shapes, profiles). In this paper,
hierarchical clustering is suggested for situations where the data profiles
are believed to come from non-homogeneous groups. The Full Procrustes
Distance is used as the dissimilarity measure for the amalgamation. This
new approach to sensory panel analysis may be used at an exploratory stage,
in combination with GPA, to gain insight into the structures of the data set.
It can help the researcher detect outliers and sub-groups, help him/her make
decisions regarding further analysis, and reduce the risk of erroneous inference about the data.
key words: Procrustes Distance, Generalized Procrustes Analysis,
dendrograms, Sensory Analysis, Shape Analysis
1 Introduction
In many practical situations, it is important to define a meaningful average of
data matrices. Examples of such situations are
(i) A set of sensory profiles for a number of products. This could be N
food samples assessed by Q judges, using P tasting variables (sourness,
saltiness, bitterness etc), giving Q N-by-P matrices.
(ii) A set of psychological profiles, which could be N patients interviewed by Q
therapists giving scores along P dimensions (depression, anxiety etc).
∗
MATFORSK (Norwegian Food Research Insitute) and University of Oslo.
63
1 INTRODUCTION
(iii) A set of point configurations, or shapes in two or three dimensions, which
are similar and differ mainly in terms of rigid transformations. If N is the
number of points, and Q the number of shapes, Q N-by-2 or N-by-3 matrices
must be analyzed.
The easiest way of handling this data is to use simple averages, but there are
some obvious problems with this approach:
• There may be confusion about the use of terms (saltiness and bitterness, or
depression and anxiety)
• There may be differences in the scaling. One judge or therapist may use
a large portion of the scale than another, e.g. one scores in the range 2-6,
and another in 1-10.
• The center of the scales may differ.
These comments are relevant for cases (i) and (ii) above. In case (iii),
this corresponds to orientational dislocation, size differences and center
displacements of the shapes.
Generalized Procrustes Analysis (GPA) is a technique frequently used to
handle such problems. It is based on standardizing profiles with respect to
rotation/reflection, isotropic scaling and translation, in order to provide a better
average, a consensus. Even though case (iii) seems different from the two others,
they can be shown to be equivalent 1 by geometrical reasoning.
Procrustes methods have a history in two distinct disciplines of statistics.
It was introduced in psychometrics, a branch of multivariate analysis,
where important references include Mosier (1939), Green (1952), Cliff (1966),
Schönemann (1966, 1968), Gruvaeus (1970), Schönemann and Carroll (1970),
Kristof & Wingersky (1971), Gower (1971, 1975), Ten Berge (1977), Sibson
(1978,1979), Langron and Collins(1985). The methods have been used as
standard tools in the related field of sensory analysis since the 80s, due to
important contributions by Arnold & Williams (1985) and later Dijksterhuis
(1996). Procrustes methods were brought into statistical shape analysis by
Kendall (1984, 1989), Goodall (1991) and later Dryden & Mardia (1997).
Their introduction in shape analysis led to new practical and theoretical
developments. The individual matrices (profiles or configurations) known from
psychometrics were formalized as shapes in the new field. The shape space was
explored by Kendall (1984,1989) and Le and Kendall (1993), and a distance
measure known as the Procrustes Distance was developed to measure degree
of difference between configurations. The new findings can be used to widen the
range of applications involving Procrustes Analysis, even in psychometrics, as
will be shown.
1
Variants of Procrustes works without the possibility of reflection, leaving rotation as the only
admissible orthogonal transformation. This is common in shape analysis.
64
PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS
(a)
(b)
18
16 17 19
20
15
21
14
13
22
12
23
11 10 9
24
8 31 30 29
32
28
25
7
2726
33
6
5 34
2
4
3
35
36
37
38
1
45
39
44
43 424140
222324
25
21
20
26
19
27
18
28
17
29
16
30
15
31
14
13
32
12
11
33
10
34
9
35
8
7
36
6
5
37
34
38
2
39
1
40
454443 4241
Figure 1: An artificial example from shape analysis: (a) Two point configurations
(or profiles), one banana and and one mushroom, are to be matched by procrustes
rotations (after an initial centering/translation). The numbers along the lines are the
landmark indices. (b) The consensus (drawn with a thick line) obtained by rotating
the configurations to match and then averaging, is neither representative for the
mushroom,nor for the banana.
1.1 The consensus is not always meaningful
As described above, GPA is based on a model for the difference between matrices.
Figure 1 shows an example where the procrustes transformation is not very
meaningful. Other examples are:
• A sensory experiment where the judges come from two expert groups.
• A tasting session where the food samples have been presented to the judges
at different times. Certain chemical processes could have changed the taste
of the food.
• A study in psychotherapy where one of the therapists comes from a
different school.
• A sensory panel where one or more judges have caught a cold.
Working with two or three-dimensional data, such problems can be detected
by plotting the data, especially if the points have a natural ordering (as in
figure 1, with successive numbering along the contour of the shape). For
65
2 METHODOLOGY
higher dimension, other techniques are required. This paper is to presents and
discusses a new approach to sensory panel analysis. Grouping the judges with
Hierarchical Cluster Analysis (HCA) one can check the relevance/quality of the
Procrustes model for high dimensional data. Main focus will be on detecting
groups of configurations and detecting objects which are totally different from
the rest (outliers). For other ideas concerning clustering methods on three-way
data, see Carroll and Arabie (1983) Gordon and Vichi (1999), Krieger and Green
(1999) and Vichi (1999).
To illustrate the gains of this new approach, existing GPA diagnostics will be
computed and compared with the new ones. A data set from sensory analysis,
manipulated for the purpose of illustration, will be used in the computations.
Finally, a real-life dataset (peas, Næs & Kowalski 1989) will be studied in detail.
HCA with the Procrustes Distance is intended to be used in an informal way,
at an early stage, to explore the nature of the data at hand. Questions such as
determining the number of clusters have been considered by other authors (see
e.g. Gordon, 1999), and are not central to the paper.
2 Methodology
2.1 Cluster Analysis
The purpose of clustering is to group n objects into g groups or clusters. The
members of each cluster should be “similar” in some sense, and the clusters thus
“homogeneous”. There are different approaches to clustering, both hierarchical
and criterion based ones. In this paper, the focus will be on the former. At each
level, the two most similar groups are agglomerated. The results from such
analyses are usually presented in tree structures, called dendrograms. Even
though all objects will, in such a process, end up in the same group, the grouping
process is interesting. The length of the edge connecting the nodes matches the
degree of dissimilarity between the subgroups.
A hierarchical clustering process is based on the distances or similarities
between the objects. The similarity between two clusters Gi , Gj is defined by
a dissimilarity measure d(·, ·) that can be constructed in a number of different
ways, depending on the nature of the objects and the purpose of the clustering.
If {x1 , . . . , xn } are vectors, a frequent choice of distance function between two
clusters is
d(Gi , Gj ) =
min
z∈Gi ,w∈Gj
||z − w||22
(1)
Using the distance function (1) leads to single linkage clustering. Other variants
will be described in section 2.4, together with a distance function for matrices
rather than vectors. For more information on clustering, see Mardia et. al (1998)
or Gordon (1999).
66
PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS
2.2 Procrustes Analysis & The Procrustes Distance
If ||.||F denotes the Frobenius norm, the Procrustes transformation for a matrix
X1 to match with X2 (both n × p) is found by solving
min ||T (X1) − X2 ||F = d(X1 , X2 )
(2)
T
with the requirement that T is a transformation composed by rotation/reflection
matrix Q (QT Q = I), and an isotropic scaling factor c ∈ R,
T (X1 ) = X1 · cQ,
c :=
tr XT1 X2
||X1 ||2F
(3)
The solution to the optimal orthogonal transformation problem is
Q = UVT
(4)
where XT1 X2 = USVT is the singular value decomposition of XT1 X2 . This
can easily be derived using standard results from linear algebra (Cliff, 1966
or Mardia et al, 1979). The dissimilarity measure d measures the degree of
dissimilarity between two matrices after rotational and scaling effects have been
removed. It is, however, not symmetric. Generally,
d(X1 , X2 ) 6= d(X2 , X1 )
(5)
or less formally, the distance from one object X to an object Y is not the same as
the distance from Y to X. This makes the amalgamation difficult to interpret. By
assuring that X1 and X2 are scaled to have the same variance (after centering),
||X1||2F = ||X2 ||2F = K, however, there is symmetry in the distance function. The
new distance is called the Full Procrustes Distance.
min ||T (
T
X2
X1
)−
||F = dF (X1 , X2)
||X1||
||X2 ||
(6)
with the same requirement on T as above. Typically K = 1 is chosen, which
means that the profiles are scaled to have unit variance. A proof for the
symmetry of dF can be found in Dryden & Mardia (1997). Translation is usually
applied together with scaling and rotation. It can be proven that the optimal way
of translating the point-sets is by column-centering the profiles. Throughout the
paper, it is assumed that all profiles (shapes) are pre-processed in this way.
67
2 METHODOLOGY
2.3 Generalized Procrustes Analysis
Generalized Procrustes Analysis (GPA) does 2 for several matrices what
Procrustes Analysis does for two matrices. It is based on an iteratively updated
average or consensus Z, and it makes a set of matrices as similar with Z as
possible by Procrustes transformations. This also makes the profiles as similar
as possible with each other (Gower, 1975). If Ti denotes the optimal (and
implicitly defined) transformation by scaling and rotations for the i’th profile
Xi ∈ RN ×P , the GPA minimizes
g(X1 , . . . , XQ ) =
Q
X
||Ti (Xi ) − Z||2F
(7)
i=1
Since this paper is concerned with dissimilarity measures, note that the match
of transformed profiles with the consensus Z can be checked by considering the
individual terms in (7).
gi = ||Ti (Xi) − Z||2F i = 1, . . . , q
(8)
An algorithm for GPA can be found in the appendix.
2.4 HCA with the Procrustes Distance
If the profiles {X1 , . . . , XQ } are too different, taking a GPA average may be
meaningless. To detect such situations, diagnostics are needed, but gi has
intrinsic problems:
• If one profile is an outlier, it has already been allowed to influence the
consensus Z when gi is calculated. Thus, an outlier will seem less of an
outlier after the iterative procedure. 3
• If there are several groups of profiles, homogeneous within but not across
groups, the gi diagnostics will not reflect the grouping. Worst case, it may
produce a consensus not representative of any profile. At the best, it will
yield high values of gi for all i, reflecting a poorly defined common opinion.
Cluster analysis reveals group structures as well as individuals not fitting
well with existing groups. Visualization by trees (dendrograms) can help the
researcher detect when the averaging process “breaks down” due to the influence
of profiles very different from the rest, as well as situations where subgroups of
profiles differ substantially from one another.
2
more or less, anyway. It is well known that GPA algorithms only find a local minimum for
the optimization, whereas the procrustes transformation, only covering two matrices, finds the
global optimum. For details, see Ten Berge (1977).
3
This is of course also the case in ordinary regression analysis: The outliers do influence the
parameters and thus affect the estimated relationship between the variables.
68
PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS
2.4.1 Single, Complete, Average Linkage
Let the sensory profiles {X1 , . . . , XQ} be the objects to be grouped in a cluster
analysis. Each cluster Gi contains one or more profiles at any time. The distance
between two clusters Gi , Gj is based on the full Procrustes distance and may be
one of the following:
min
dF (X, Y)
(9)
max
dF (X, Y)
(10)
average
d (X, Y)
X ∈ G i , Y ∈ Gj F
(11)
dS (Gi , Gj ) =
dC (Gi , Gj ) =
dM (Gi , Gj ) =
X∈Gi ,Y∈Gj
X∈Gi ,Y∈Gj
From these three distances and a general clustering algorithm specified below,
three HCA variants are derived.
A general clustering algorithm
Let {G1 , . . . , GQ } = {X1 , . . . , XQ } be the initial clusters. A distance matrix
D = {dij } ∈ Rg,g , where g is the number of clusters at some time, is given by
the elements dij = d(Gi , Gj ) where d is the clustering distance function chosen.
Next, the minimum element value dij of the matrix D is found, its row and
column indices refer to the two clusters which are to be grouped in the current
step. The two groups are linked, the new (g − 1) clusters relabeled and the procedure repeated until there is only one cluster left, or until a stopping criterion
determines the end of the process.
2.4.2 Centroid Linkage & GPA
A fourth clustering variant is also possible, perhaps reflecting some intrinsic
ideas of GPA more clearly than the other methods. Rather than computing
distances (minimum, maximum or average) between element of clusters, each
cluster could be represented by an average element, and the distance between
clusters could be measured in terms of the full Procrustes Distance between
average elements, or consensuses. A natural candidate for the average element is
the GPA consensus Zi of the matrices in each cluster Gi ,
(12)
dij = d(Zi , Zj )
At the final step, this amalgamation is equivalent to GPA, because all matrices
are members of the same cluster, and the average matrix computed for this
cluster is the ordinary GPA consensus. The first g − 1 clustering steps are, in
this light, sub-GPA’s.
At any point in the amalgamation process, an atypical long vertical linking
in the dendrograms can be seen as a “breakdown”. The other three clustering
variants could also be turned into sub-GPA’s, merely by computing GPA
consensuses Zi after each clustering step. The main drawback with the centroid
69
3 EXPERIMENTS: DETECTING GROUP STRUCTURE AND OUTLIERS
method is that it fails to preserve a basic clustering property: The continuous
increase of distance levels through the stages of the clustering process. Let ds
denote the distance function used to group the two clusters when there are g − s
clusters left. We should expect that
d1 ≤ d2 ≤ d2 . . . ≤ dg−1
(13)
This property can be easily be derived for single, complete and average linkage.
But see figure 2, sub-figure (c), for an example where the centroid method fails to
fulfill (13). This is commonly called “inversion” in Hierarchical Cluster Analysis,
and is typical when centroid linkage is used.
3 Experiments: Detecting group structure and outliers
To illustrate the methods and the practical problems that can be analyzed, a set
of experiments was carried out. The original data is a set from a tasting session
(Baardseth et. al, 1992) of sausages, which was manipulated in various ways to
highlight certain situations that may occur in real tasting experiments. The set
contains eight judge profiles, each describing N = 60 sausage samples in p = 13
variables. A profile is represented by a matrix Xi ∈ R60×13 , i = 1, . . . , Q = 8, all
column-centered and scaled to have unit variance (or unit Frobenius norm). The
experiments were run in MATLAB on a Sun Ultra 5, with a running time of a
few (5-10) seconds.
3.1 Basic Study
A basic study illustrates the use of the methods in the simplest setting. Using
the original profiles X1 , . . . X8 , the four clustering variants were applied (their
dendrograms can be found in figure 2). The different methods give very similar
results. Some basic properties can be seen that are typical of any hierarchical
clustering method4 .
• The complete linkage tends to have longer edges in the tree than the
others. This is because it joins cluster elements of maximum distance,
and thus takes longer before connecting single objects with large clusters.
The chance that one object in a big cluster is far away from a single object
outside the cluster is usually large. Rather than grouping objects quickly,
it tends to pair up objects, then pairs of pairs, and so on.
4
and not just when using the Procrustes Distance.
70
PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS
(a) Complete Linkage
(b) Single Linkage
0.95
0.95
0.9
0.9
0.85
0.85
0.8
0.8
0.75
0.75
0.7
4
6
3
7
5
2
8
0.7
1
4
6
(c) Centroid Linkage
0.95
0.9
0.9
0.85
0.85
0.8
0.8
0.75
0.75
4
6
7
3
8
5
2
7
8
5
2
1
(d) Average Linkage
0.95
0.7
3
0.7
1
4
6
3
7
8
5
2
1
Figure 2: Basic Survey. The methods give similar clusters as results. The Complete
Linkage Method differs from the other ones in that it connects object 2 and 8 at an
early stage. Note panel (c) corresponding to centroid linkage, and the downward edge
(inversion) when joining the class 4,6 with object 7.
• The single linkage method, on the contrary, easily allocates new (single)
objects to clusters with many members. It is much easier to find one withincluster object close to the new object, than to make certain no within-cluster
object is far apart from the new object.
• The centroid method has the “inversion”, which is typical for centroid
methods, and to be expected. This is due to the calculation of a cluster
representative (centroid) which does not conform with the principle of an
increasing sequence of distances.
• The average linkage seems, not surprisingly, to be an in-between of single
and complete linkage.
From the figures, one can also argue that object 1 is an outlier.
71
3 EXPERIMENTS: DETECTING GROUP STRUCTURE AND OUTLIERS
3.2 Splitting the data into two sessions.
A situation with two separate tasting sessions was simulated by dividing each
sensory matrix Xi into two halves, each consisting of subsets of n1 = n2 = 30
samples. Each sub-matrix can be seen as a single profile for a single experiment.
This manipulation corresponds to a real-life situations where two separated
tasting experiments are performed by the same set of judges. If the judges
were sufficiently clear in their scoring (relative to the others), one would expect
the clustering to be very similar for the two experiments. The four clustering
variants were run to check whether the judges would group in the same way in
these two “tasting session”. There were differences between them, as figure 3
shows, including the following observations:
• Two clusters are “stable” with respect to choice of tasting session: Object 4
and 6 always clusters together, and the same is true for 3 and 7, with the
notable exception of the centroid linkage method (here 4 and 6 only join in
the first session, while 3 and 7 only join in the second). In the first session,
4-and-6 group joins with 8 before joining the existing 3-and-7 cluster for all
but the centroid linkage method.
Other details can be seen in the dendrogram, but the main observations are
(1) there are differences between the sessions and (2) the amalgamations are
similar, within the sessions, for single, complete and average linkage, but
notably different from the amalgamations of the centroid linkage variant.
3.3 Outlier detection
The purpose of the third experiment was to illustrate outlier detection. The
matrix of the seventh judge was “turned into an outlier” by exchanging the first
30 rows with the 30 last rows of the matrix X7 . The dendrograms are found in
figure 4.
• All variants cluster correctly, grouping the outlier with the other objects
only at the last stage.
• The complete and average linkage variants are less clear than the other
two in displaying object 7 as the outlier. The centroid and single linkage
have long edges connecting the other objects with object 7, and thus show
the outlier effect much more clearly.
• The strictness of complete linkage against building large groups quickly
makes the edge connecting object 7 with the others comparably small. The
average linkage result borrows this property to some extent.
72
PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS
(a) First Subset − Complete Linkage
(b) Second Subset − Complete Linkage
0.9
0.9
0.8
0.8
0.7
0.6
0.7
4
6 8 3 7 2 5 1
(c) First Subset − Single Linkage
3 7 4 6 1 5 8 2
(d) Second Subset − Single Linkage
0.9
0.9
0.8
0.8
0.7
0.6
0.7
4 6 8 3 7 5 2 1
(e) First Subset − Average Linkage
3 7 6 4 5 1 8 2
(f) Second Subset − Average Linkage
0.9
0.9
0.8
0.8
0.7
0.6
0.7
4 6 8 3 7 5 2 1
(g) First Subset − Centroid Linkage
3 7 4 6 5 1 2 8
(h) Second Subset − Centroid Linkage
0.9
0.9
0.8
0.8
0.7
0.6
0.7
4
6
8
3
7
2
5
1
3
7
4
6
5
1
8
2
Figure 3: Splitting the data into two sets/sessions. The grouping of the objects is
different for the two subsets, but similar, for the four clustering variants. The centroid
linkage differs from the other variants .
• For single linkage, quickly grouping clusters, adding object 1 to all others
requires the acceptance of an object far away from the others. The relative
short distances in the early stages makes the final linkage seem large, thus
emphasizing the outlying nature of X7 .
Considering outliers, diagnostic tools become interesting. Two diagnostics are
available so far,
• The GPA residuals gi
• The inter-group clustering distances ds
To see whether the HCA approach gives anything new , it is natural to compare
the two measures gi and ds in a number of situations. They are, however, on
different scales, and must be re-scaled to make comparison meaningful: Each
element in each set (gi or the cluster distances ds for a specific HCA variant)
was divided by its maximum element. The original GPA diagnostics (table 1)
match the diagnostic impression given by the dendrograms and the clustering
73
3 EXPERIMENTS: DETECTING GROUP STRUCTURE AND OUTLIERS
(a) Complete Linkage
(b) Single Linkage
0.95
0.95
0.9
0.9
0.85
0.85
0.8
0.8
0.75
0.75
0.7
4
6
3
8
5
2
1
0.7
7
4
6
(c) Centroid Linkage
0.95
0.9
0.9
0.85
0.85
0.8
0.8
0.75
0.75
4
6
3
8
5
2
1
8
5
2
1
7
(d) Average Linkage
0.95
0.7
3
0.7
7
4
6
3
8
5
2
1
7
Figure 4: Case I: One outlier. All variants detect the outliers by grouping it with the
others at the final stage only. Note that figures (b) and (c) show the outlier effect more
clearly than the other ones.
distances (table 2): The (scaled) Procrustes diagnostic gi = 1.0000, large relative
to the other gi ’s, indicates that the profile 7 is an outlier. The same indication is
made by HCA, since the seventh object is allocated to the other ones only at the
last stage. The (scaled) clustering distance 1.0000 at this step is also relatively
large, at least for the single linkage and centroid linkage. Overall, one may
conclude that HCA gave little new information in this specific case. In the next
sections, some examples for the opposite case are given.
3.4 Two classes
In some situations, there may be reason to look for group structures among the
judges. If the judges come from two separate expert groups on, e.g., wine tasting,
this may reflect on the scoring charts. If severe, systematic differences exist
between the groups, extracting a meaningful average is impossible.
A two-group situation was created by exchanging the first 30 rows with the
last 30 rows in the four first profiles X1 , . . . , X4 , leaving X5 , . . . , X8 as they were.
The analysis of this data can be seen in figure 5. There is, of course, no way the
74
PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS
(a) Complete Linkage
(b) Single Linkage
0.95
0.95
0.9
0.9
0.85
0.85
0.8
0.8
0.75
0.75
0.7
6
7
5
8
1
2
3
0.7
4
6
7
(c) Centroid Linkage
0.95
0.9
0.9
0.85
0.85
0.8
0.8
0.75
0.75
6
7
8
5
1
2
3
5
1
3
4
2
(d) Average Linkage
0.95
0.7
8
0.7
4
6
7
8
5
1
3
4
2
Figure 5: Case II: A two-group situation. HCA with the Procrustes Distance correctly
identifies the two groups by joining them only at the final stage, and then with a high
associated clustering distance.
standardized gi diagnostics (table 3) can identify the two-group situation, but all
the dendrograms reflect the situation. The single and centroid linkage gives a
better illustration than the other two.
3.5 Two classes and one outlier
To conclude the experimental section, a setting was made with two classes and
an outlier. One class consists of profiles X5 , X6 , X8. The other set held modified
profiles X1 , X2, X3 , X4 with the 10 last rows put on top of the first 50. Finally,
X7 was made an outlier by exchanging the first 30 with the last 30 rows. The
dendrograms are found in figure 6. All variants reveal the two distinct classes,
and the one outlier that is joined with one of the classes at the second last stage.
The gi diagnostics (table 4) fails even to identify the outlier correctly, and is
otherwise very little informative, with several gi close to the value 1.0000. X4 is
reported the furthest away from the consensus, and the outlier X7 is actually
75
3 EXPERIMENTS: DETECTING GROUP STRUCTURE AND OUTLIERS
(a) Complete Linkage
(b) Single Linkage
0.95
0.95
0.9
0.9
0.85
0.85
0.8
0.8
0.75
0.75
0.7
6
8
5
7
1
2
3
0.7
4
6
8
(c) Centroid Linkage
0.95
0.9
0.9
0.85
0.85
0.8
0.8
0.75
0.75
6
8
5
1
2
3
4
1
3
4
2
7
(d) Average Linkage
0.95
0.7
5
0.7
7
6
8
5
1
3
4
2
7
Figure 6: Case III: Two groups and an outlier.
closer to the consensus than the others. In this case, the standard diagnostic gi
is useless.
76
PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS
gi
0.9165
0.9073
Table 1: GPA Lack of fits
0.8665 0.8553 0.8787 0.8673
1.0000
0.8641
Table 2: Cluster distances used for linking at each stage
Method/Step
Complete Linkage
Single Linkage
Centroid Linkage
Average Linkage
gi
gi
0.9594
0.9829
1
0.7685
0.7943
0.7869
0.7849
2
0.8142
0.8416
0.8010
0.8315
3
0.8224
0.8422
0.7779
0.8364
4
0.8661
0.8555
0.7910
0.8587
5
0.9037
0.8565
0.8396
0.8939
6
0.9392
0.8762
0.8758
0.9172
7
1.0000
1.0000
1.0000
1.0000
0.9136
Table 3: GPA Lack-of-fits
0.9565 0.9599 0.9716 1.0000
0.9775
0.9918
0.9212
Table 4: GPA Lack-of-Fits
0.9998 1.0000 0.9231 0.9565
0.8385
0.9544
77
4 ANALYSIS OF DESCRIPTIVE SENSORY DATA FROM PEAS
4 Analysis of descriptive sensory data
from peas
The pea data used for this example have previously been analyzed by Næs and
Kowalski (1989) and the reader is referred to that paper for details. The data
contain sensory measurements made by 10 assessors on 60 different samples of
peas (different varieties and different degree of maturity). 10 sensory attributes
were measured, but in this paper we only consider 4 of them (pea flavor,
sweetness, off-flavor and mealiness). There were 2 replicates for each sample
and these were averaged before statistical analysis.
For this data set we confine ourselves to single linkage only. The dendrogram
for the pea data using this technique is shown in Figure 7. It gives a clear idea
about the similarity and differences among the assessors. First of all, there
is a clear group of 6 assessors, 1,2,4,7,9, and 10, who are very similar to each
other. The (full) Procrustes Distance (vertical axis) between assessor 1 and 4 is
only slightly smaller than between 1 and 7, which is the last one joined to this
cluster. The next assessor to be joined is number 3 which is considerably further
away. Assessor number 6 seems to quite different from the rest in this study.
For this particular study, near infrared reflectance (NIR) data were also
acquired for all the samples. The NIR data in this contained absorbance
readings at 116 different wavelengths. In order to verify the conclusions above,
the sensory profiles were related to principal components of the NIR data. First
of all, the scores for each individual assessor were considered and the results are
presented in Figure 8. As can be seen, already after 3-4 principal components
the prediction ability is quite good. The prediction error is here ||Wβ − Y||2F ,
where W is the N × P matrix of (NIR) principal component coefficients for the
N samples in P = 1, 2, ..., 10 (principal) dimensions, β ∈ RP ×P is the matrix
of regression coefficients and Y is a sensory profile, normalized to have unit
variance, Y := Xi /||Xi || for some assessor/profile indexed by i. It is also clear
that the same group of assessors as was determined to be similar, are the ones
that are easiest to predict. Assessor number 6 is clearly the one with the least
clear relationship between sensory and NIR data. The assessor 3,5 and 8 come
in an intermediate position. In Figure 9 is presented similar results for the
consensus of the six similar assessors, for the full panel and for the raw average
of the profiles (the prediction ability was measured the same way as for the
individual profiles, with the same normalization of the target (consensus) as was
used for the individual profiles as described above). Assessor 6 is also plotted for
comparison. The results show that the consensus from the six similar ones is
clearly easier to relate to the NIR data than the full consensus.
These results together clearly indicate that assessors 1,4,9,7,2 and 10 are
similar and have a simpler and more predictable relationship to NIR than the
other four. A possible and quite tempting explanation for this is that they are
simply more reliable than the rest of the assessors in this case. If no strong
78
PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS
0.9
0.8
0.7
0.6
0.5
0.4
1
4
9
2
10
7
3
5
8
6
Figure 7: Pea data: Dendrorams, HCA with the Procrustes Distance.
non-linearities are present in the relationship, this is clearly the most natural
explanation. The results also give a clear indication of a possible outlier. The
assessors 1,2,4,7,9 and 10 are very similar, then there is a gap to the next group
3,5,8 before a new gap separates assessor 6 from the rest. This analysis shows
that the HCA/PD approach is a natural first step in an analysis of sensory panel
data.
5 Conclusions & Discussion
5.1 Conclusions
The simulations illustrated a number of situations were HCA could be used in
combination with GPA. The single linkage and the centroid linkage seemed to
be better at isolating group structures and outliers. The centroid variant has
two drawbacks. First, it has the “inversions” which are generally considered
inappropriate, and secondly, it requires the computation of a GPA consensus
at each amalgamation step, making it computer-intensive. The single linkage
HCA, using the Procrustes Distance as dissimilarity measure, is therefore the
79
5 CONCLUSIONS & DISCUSSION
1
0.9
0.8
0.7
Judge 6
0.6
Judge 8
Judge 5
Judge 3
0.5
Judge 10
Judge 2
Judge 7
Judge 4
Judge 9
0.4
Judge 1
0.3
1
2
3
4
5
6
7
8
9
10
Figure 8: Errors for regression of NIR data (compressed data in p-space) onto the
individual profiles.
approach we will recommend for sensory analysts. Through our simulations, we
have demonstrated that two-group situations and outliers can be detected using
this approach. In a real-life experiment (pea data), HCA was demonstrated
as a first natural step in examining panel data. The indications given by the
clustering was confirmed by later analysis, and thus illustrates the usefulness
in the approach we have presented.
5.2 The Researchers position
The recommended HCA variant could easily be incorporated into any sensory
scientist toolbox. Rather than using straightforward GPA, a routine check could
be run to verify that sufficiently common opinions about the products exist.
There is otherwise a potential danger that outliers and subgroups may ruin
the experiment. Routine use of the suggested clustering methods could help
manufacturers in order to select members for an expert sensory panel. The
tools developed here are intended for being used in an informal way, to gain
insight into the structures of the dataset. Using HCA would be natural at an
explanatory stage, by a person who has a good understanding of the data.
So far, the method has been used on one dataset only, and it would be
necessary to study more sets in to assess its full value as a tool.
80
PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS
1
0.9
0.8
Assessor 6
0.7
0.6
0.5
Raw average
0.4
Assessors 1,2,4,7,9,10
Full Consensus
0.3
0.2
1
2
3
4
5
6
7
8
9
10
Figure 9: Regression between NIR data and consensuses. The errors for the raw
average is plotted for reference (–).
5.3 Further Research
5.3.1 Other distance functions
One of the main critics against GPA is the fact that it uses only rigid
transformations (rotation, isotropic scaling and translation) to compensate for
systematic differences between judges. However, there is no reason to believe
that there do not exist more subtle misunderstanding. There are methods
that handle such subtleties better than GPA (GCA, Tucker or PARAFAC), but
none of these will give tree structure representations such as our dendrograms.
However, since the we do not actually carry out a GPA (except in the centroidlinkage case), one can group the objects in more flexible ways by using
alternative dissimilarity measures than the Procrustes Distance. These could be
constructed to detect similarity with respect to non-rigid transformations, such
as affine transformations, or thin plate splines for shape analysis. It might also
be possible to design dissimilarity measures based on entropy measures (from
information theory). In this case, the dissimilarity between two matrices would
be determined from their joint entropy. In this situation, one does not have to
find an optimal mapping from one profile to another, it suffices to measure the
81
5 CONCLUSIONS & DISCUSSION
degree of common information which helps one determine if there is an optimal,
possibly non-linear mapping (Hyvärinen, 1999).
5.3.2 Alternative ways of studying profile data
In this paper, iterative averaging (GPA) and HCA with the Procrustes Distance
were presented. These are only two possible ways of exploring profile data. Another approach is to study profiles from the perspective of minimum spanning
trees, which is equivalent with single linkage clustering. This technique is used
in botany (Dahl, 1982 and Gauslaa, 1985) and computer networking. It has
recently been employed to connect multiple PCA and PLS-models in chemometrics (unpublished work by Martens, Anderssen & Høy) Minimum spanning trees
generated by the Procrustes Distance could be used to create a map (a graph) to
see how judges relate.
Appendix
The GPA Algorithm
A simplified version of the GPA algorithm is presented. For a full version with
details, see Ten Berge (1977).
(1) Scale all matrices Xi to have unit variance.
(2) Make initial rotations of Xi by: Rotating X2 to match X1 , then X3 toPmatch
1/2(X1 +X2 ), and so on. Make an initial Z as their average Z = 1/Q Q
i=1 Xi
(using the rotated Xi ).
(3) Rotate all matrices Xi to match with Z.
(4) Recalculate Z as the average of the transformed Xi ’s.
(5) Repeat from (3) until convergence.
(6) P
Scale all matrices to maximum agreement, preserving the total variance
Q
2
i=1 ||Xi ||F = C.
82
PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS
References
[1] Arnold, G.M., Williams, A.A. The use of generalized Procrustes Techniques
in sensory analysis, In: Statistical Procedures in Food Research. (1985)
[2] Baardseth P., Næs, T., Mielnik, J., Skrede, G., Hølland, S., and Eide, O.
Dairy ingredients effects on sausage sensory properties studied by principal
component analysis, Journal of Food Science, Vol. 57, 4, (1992) 822-828.
[3] ten Berge, J. M. F. Orthogonal Procrustes rotation for two or more matrices.
Psychometrika 42, (1977) 267-276.
[4] Carroll, J.D., Arabie, P. INDCLUS: an individual differences generalization
of the ADCLUS model and the MAPCLUS algorithm, (1983) Psychometrika
48, 157-169.
[5] Cliff, N. Orthogonal rotation to congruence, Psychometrika 31, (1966) 33-42
[6] Dahl, E. Unpublished work: Ordination analysis. Analysis of randomly
selected test areas (Norwegian: Ordinasjonsanalyse. Analyser av tilfeldig
valgte prøveflater), (1982) NLH - Norwegian Agricultural University.
[7] Dijksterhuis, G. Procrustes analysis in sensory research. In Multivariate
Analysis of Data in Sensory Science, Ed. T.Næs and E.Risvik (Elsevier
Science, 1996)
[8] Dryden, I.L., Mardia, K.V. Statistical shape analysis (Wiley, 1998)
[9] Gauslaa, Y. The ecology of Lobarion Pulmonariae and Parmelion Capetarae
in Quercus dominated forests in south-west Norway. Lichenologist. 17 (2),
(1985) 117-140.
[10] Goodall, C. Procrustes methods in the statistical analysis of shape. (with
discussion). Journal of the Royal Statistical Society: Series B 53, (1991)
285-339.
[11] Gordon, A.D. Classification. 2nd ed. Monographs on Statistics and Applied
Probability. Boca Raton, (FL: Chapman & Hall, 1999).
[12] Gordon, A.D. and Vichi, M., Partitions of partitions, Journal of Classification, 15, (1999) 265-285.
[13] Gower, J.C. Statistical methods for comparing different multivariate
analyses of the same data. In Hodson, F.R., Kendall D.G., and Tautu, P.
editors, Mathematics in the Archeological and Historical Sciences, (1983),
138-149, Edinburgh. Edinburgh University Press.
[14] Gower, J.C. Generalized Procrustes Analysis, Psychometrika 40 (1975) 3351.
83
REFERENCES
[15] Green, B.F. The orthogonal approximation of an oblique structure in factor
analysis. Psychometrika 17, (1952) 429-440.
[16] Gruvaeus, G.T. A general approach to Procrustes pattern rotation.
Psychometrika 35 (1970) 493-505.
[17] Hyvärinen, A. Fast and Robust Fixed-Point Algorithms for Independent
Component Analysis, IEEE Trans. on Neural Networks (1999), 10(3), 626634.
[18] Kendall D.G. Shape manifold, Procrustean metrics and complex projective
spaces. Bulletin of the London Mathematical Association 16, (1984) 81-121.
[19] Kendall D.G. A survey of the statistical theory of shape. Statistical Science
4 (1989) 87-120.
[20] Kent, John T.; Mardia, Kanti V. Consistency of Procrustes estimators.
(English) Journal of the Royal Statistical Society: Series B 59, (1997) 281290.
[21] Krieger A.M., and Green P.E, A generalized Rand-Index method for
consensus clustering of separate partitions of the same data base, Journal
of Classification, 16, 63-89.
[22] Kristof, W. and Wingersy, B. Generalizations of the orthogonal Procrustes
rotation procedure to more than two matrices. Proceedings of the 79th
Annual Convention of the American Psychological Association, 6, (1971) 8190.
[23] Langron, S.P. and Collins, A.J. Perturbation theory for generalized
Procrustes analysis. Journal of the Royal Statistical Society: Series B 47,
(1985) 277-284.
[24] Le H.-L. and Kendall D.G. The Riemannian Structure of Euclidean shape
spaces: a novel environment for statistics. Annals of Statistics 21, (1993)
1225-1271.
[25] Mardia, K.V., Kent, J.T., Bibby, J.M., Multivariate Analysis, (Academic
Press, 1979)
[26] Martens, H., Anderssen, E., Høy, M. Unpublished work on minimum
spanning trees for PCA and PLS models (2000).
[27] Mosier, C.I. Determining a simple structure when loadings for certain tests
are known, Psychometrika 4, (1931) 149-162.
[28] Næs, T. and Kowalski, B. Predicting sensory profiles from external
instrumental measurements. Food Quality and Preference (1989), (4/5),
135-147.
84
PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS
[29] Risvik E. and Næs T. Multivariate Analysis of Data in Sensory Science
(Elsevier Science, 1996)
[30] Schönemann, P.H. A generalized solution to the orthogonal Procrustes
problem, Psychometrika 31, (1966) 1-10.
[31] Schönemann, P.H. On two sided orthogonal Procrustes problems. Psychometrika 33, (1968) 19-33.
[32] Schönemann, P.H. and Carroll R.M. Fitting one matrix to another under
choice of central dilation and rigid motion. Psychometrika 35, (1970) 245255.
[33] Sibson, R. Studies in the robustness of multidimensional scaling: Procrustes statistics. Journal of the Royal Statistical Society: Series B 40,
(1978) 234-238.
[34] Sibson, R. Studies in the robustness of multidimensional scaling: perturbation analysis of classic scaling. Journal of the Royal Statistical Society:
Series B 41 (1979) 217-229.
[35] Vichi (1999) One-mode classification of a three way data matrix, Journal of
Classification, (1999) 16, 27-44.
85
86
Paper III:
BIMA - Blind Iterative MIMO Algorithm
87
CHAPTER 3. PAPERS
88
BIMA: Blind Iterative MIMO Algorithm
accepted for ICASSP 2002
T. Dahl, N. Christophersen, D. Gesbert
Abstract
Identification of the channel matrix is of main concern in wireless MIMO
(Multiple Input Multiple Output) systems. Here, we present an SVDbased approach for blind identification of the main independent parallel
channels. The right and left singular vectors are estimated directly (no
channel matrix estimation is necessary) and continuously updated during
normal transmission. The approach is related to the iterative Power Method
[8], as well as the time reversal approach [4].
1 Introduction
Wireless MIMO systems are capable of delivering large increases in capacity
through utilization of parallel communication channels [5], [6], [12].
For a N(receive) × M(transmit) channel matrix H of rank K0 ≤ min(N, M),
the parallel channels are naturally realized through the Singular Value
Decomposition (SVD) H = USVH , when the channel matrix is known both at
the transmitter and the receiver side. S is the diagonal matrix of singular values
σ1 ≥ σ2 , · · · , σK0 > 0, and
U = [u1 , . . . , uK0 ] ∈ C N,K0
V = [v1 , . . . , vK0 ] ∈ C M,K0
(1)
(2)
are unitary matrices whose columns can be used as receive and transmit
vectors {ui } and {vi }, respectively. One can select a number K (K ≤ K0 ) of
transmit/receive vectors to use for communication. Under stationary conditions,
one may try to determine H experimentally and subsequently perform the SVD
as in the sonar application [10]. For time-varying systems, most studies have
assumed that H is unknown at the transmitter and known - through training
data - at the receiver. However, first, this implies overhead, and second, the
use of channel knowledge on the receiver only leads to less efficient use of
the MIMO system. The transmit array diversity gain is not realized, and
one is unable to transmit on the top singular vectors, those giving maximum
performance/complexity tradeoff.
89
2 METHODS
In the method presented, two-way transmission of data allows the two parties
to estimate a selected set of left and right singular vectors, without explicit
knowledge of H. Unlike other previous methods for blind MIMO estimation
(for example [13] and references therein), which rely on a statistical based
estimation of the channel matrix, our technique estimates the eigen-structure of
the MIMO channel directly, without need of an actual SVD. The key advantage of
this technique is that it exploits transmission of regular symbol data to acquire
an update of the singular vectors.
2 Methods
Assume a flat fading MIMO channel H exhibiting reciprocity. The uplink and
downlink channels are the same, as in TDD (Time Division Duplex) systems.
M
N
H
HT
Without noise, transmission (s) and receiving (r) for two parties X and Y (for
instance, X=base station, Y =mobile), are given by:
yr = Hxs ,
xr = HT ys
90
(3)
PAPER III: BIMA - BLIND ITERATIVE MIMO ALGORITHM
Taking xs = vi , ys = uj , we have
yr = Hxs = Hvi = σi ui
xr = HT ys = HT uj = σj vj
(4)
(5)
therefore, transmitting data on a right singular vector yields data lying on the
left singular vector, and vice versa. Consider now the following Power Method
[8]:
1. xs = initial, ys = initial.
2. yr = Hxs , xr = HT ys .
3. ys = yr /||yr ||,xs = xr /||xr ||.
4. Repeat from 2.
Expressing the initial xs and ys in terms of the basis vectors vj and uj ,
respectively, it can be shown that ys → u1 and xs → v1 . This is a straightforward
generalization of the proof in [8]. Thus, as suggested independently by BachAndersen [1], transmission and retransmission leads to convergence to the first
pair of singular vectors. This algorithm is called the NIPALS algorithm [14] in
chemometrics, but could equally well be termed a Two-way Power Method. It is
closely related to the time reversal approach [4].
2.1 Updating several singular vectors during transmission
We now describe how several singular vector pairs of H can be estimated and
tracked as part of normal communication. The method works for a noisy, timevarying channel, and shows robustness, as will be demonstrated by simulations
in the next section.
For simplicity, all matrices are assumed real, but generalization to the
complex case is straightforward. Assume that a block of data is sent during
a short period of time where the channel Hk = H(tk ) can be assumed constant.
Sending and receiving data through a noisy channel is expressed as
k
Yrk = Hk Xks + ηy
T
k
Xkr = Hk Ysk + ηx
(6)
(7)
Using estimates Ûk−1 ∈ RN,K and V̂k−1 ∈ RM,K from the previous iteration,
symbols can be sent using Xks = V̂k−1Ckx and Ysk = Ûk−1 Cky . Here, Ckx and Cky
are the K-by-r symbol matrices holding binary (±1) symbols that need to be
transmitted at time tk . The block size (r) is a design parameter, selected with
respect to the slot size and the variation of the channel. In the algorithm we
implemented, several blocks of data can be transmitted within a slot. To keep
the algebra simple, however, we will in the following assume that just one block
91
2 METHODS
of data is sent per slot. The central point in the forthcoming computations is the
following singular vector relations: If Hk = Uk Sk VkT , then
HkT Uk = Vk Sk
Hk Vk = Uk Sk
(8)
(9)
which in particular is to say that multiplication of Hk or HkT by suitable
orthogonal matrices yields matrices with perpendicular column vectors as output
(e.g. Uk Sk or Vk Sk ). Using Ûk−1 ,V̂k−1 rather than the true Uk , Vk (which are
unavailable), the approximate relations become
k
Yrk = Hk V̂k−1Ckx + ηy = Ũk ŜkL Ckx + ky
kT
k
Xkr = H Ûk−1 Cky + ηx = Ṽk ŜkR Cky + kx
(10)
(11)
The estimates Ũk and Ṽk are not identical to Ûk−1 and V̂k−1, but are rather
the “approximate SVD counter-pairs” 1 that arise when V̂k−1 and Ûk−1 are
multiplied by Hk and HkT . To save space, we shall assume that the error terms
ky , kx reflect both “failure of perpendicularity” and “delay error” arising from
using the old transmission vectors Ûk−1 , V̂k−1, as well as the channel noise.
At the recipient side, the symbols can be re-estimated through multiplication
by the old transmission matrices Ûk−1 ,V̂k−1, followed by taking signs:
Ĉkx = sign(Ûk−1T [Ũk ŜkL Ckx + ky ]) ≈ sign(ŜkL Ckx )
(12)
Ĉky = sign(V̂k−1T [Ṽk ŜkR Cky + kx ]) ≈ sign(ŜkR Cky )
(13)
ŜkL and ŜkR are defined to be positive, and have no effect on the sign operator.
If the errors ky , kx are small relative to the singular value estimates ŜkL , ŜkR (in
a Frobenius-norm sense, e.g. ||ŜL ||2F ||ky ||2F ) the symbol estimates will be
correct. The reason for this is that ŨkT Ûk−1 ≈ I and ṼkT V̂k−1 ≈ I under the
sign operator, or in other words: The approximation errors are not large enough
to change the signs. Once symbol estimates are obtained, they can be used to
find new estimates Ûk and V̂k . Assuming Ĉkx and Ĉky to have full column range
(implying r ≥ K), consider
Yrk Ĉkx † ≈ Ũk ŜkL
Xkr Ĉky † ≈ Ṽk ŜkR
(14)
(15)
where C† denotes the Moore-Penrose (pseudo-inverse) of C. The right-hand side
matrices in the equations (14),(15) are, ideally, comprised from scaled singular
vectors. Normalizing the columns on the left side, one could hope to retrieve
the true singular vectors Uk , Vk . In addition to the presence of channel noise,
however,
It is also reasonable to assume ŜL ≈ ŜR , that the left and right side singular values estimates
are roughly the same.
1
92
PAPER III: BIMA - BLIND ITERATIVE MIMO ALGORITHM
• the original vectors Ûk−1 , V̂k−1 were incorrect, so orthogonality will not
appear, and,
• through multiple passes, the Power Method will attract all column vectors
towards the first singular vector.
Consequently, an orthogonalization process must be applied to prevent
unwanted convergence of all columns towards the first singular vector pair. A
QR-decomposition (Gram-Schmidt Process) can be used to this end,
Yrk Ĉkx † = Qu Ru
Xkr Ĉky † = Qv Rv
(16)
(17)
taking Ûk = Qu and V̂k = Qv as the new estimates effectively “filtering out” the
scaling effects of ŜR , ŜL . The reason why the estimates Ûk , V̂k are better than
the previous Ûk−1 , V̂k−1, is that they have passed one more time through the
channel H. It can be shown, mathematically as well as experimentally, that the
method converges to the correct vectors after a few iterations [3].
Our method has a connection with Decision-Directed (DD) estimation, with
one important exception: One can start with any guess on the singular vectors
(and consequently, get wrong estimates of the transmitted symbols), but the
method will still convergence to the correct set of singular vectors2 . In the case of
K = 1 (maximum diversity algorithm), the method is completely insensitive to
symbol decision.
Other important issues addressed in our full paper [3] include:
• How the method depends on the convergence properties of the Power
Method.
• How the method relates closely to the Power Iterations for symmetric
matrices used by Golub and Van Loan [8].
A simplified version of the BIMA algorithm is given in Table 1.
3 Simulations & Results
We investigate the performance of the proposed algorithm in an exemplary radio
communication situation with moderate to fast time-varying, Rayleigh fading,
channels. We consider here an i.i.d. MIMO model, but more general models can
equally well be used ([2], [7]).
We first plot the Bit Error Rate (BER) performance, then illustrate tracking
of the eigen-modes by the method.
2
The symbols will be correct up to a permanent change of sign, e.g. +1 transmitted may
become -1 when received. This can be taken care of by using differential coding
93
4 DISCUSSION
1. Û0 = initial, V̂0 = initial. Set k = 1.
2. Xks = V̂k−1 Ckx , Ysk = Ûk−1 Cky
3. Yrk = Hk Xks , Xkr = HkT Ysk
4. Ĉkx = sign(Ûk−1T Yrk ) , Ĉky = sign(V̂k−1T Xkr ).
5. [Ûk , Ru ] = qr(Yrk Ĉkx † ) , [V̂k , Rv ] = qr(Xkr Ĉky † )
6. Increase k, repeat from 2.
Table 1: Simplified BIMA Algorithm. An important detail is the sorting of the
columns of Yrk , Xkr by norm, prior to performing QR. Some extra rules are needed
to handle changes in the sorting order (to be covered in [3]).
Figure (1) shows the BER for a channel H which has 4TXers and 4RXers
in a pedestrian-like environment (10Hz Doppler spread) of which we choose to
estimate/track the two best eigen-modes (out of four). The transmission rate
is 22 kBits/s (GSM like) per eigen-channel, and the Downlink/Uplink slotting
is 50 bits/slot, corresponding to 2.3 ms ping-pong period. In fact, much higher
data rates can be used, since the algorithm is mostly sensitive to the number
of Tx/Rx iterations, itself determined by the ping-pong period, and not by how
many bits are in each slot. The plot shows the BER, averaged over the two top
eigen-modes, and during the stationary regime (tracking mode) of the algorithm.
For reference, we compare with the same situation where instead a SVD would
be computed at each slot from a perfectly known channel. The plot shows a 5 dB
difference. A trained estimation algorithm will give intermediate performance.
Figure (2) illustrates tracking of singular values. The quality of such
estimates depends on the quality of estimates of the corresponding singular
vectors. In this case the Doppler spread was increased to 100Hz (vehicular
situation), using a transmission rate of 100 kBits/s and a slot size of 25 bits
corresponding to 0.25 ms. The SNR was 15 dB. The plot shows how well the
two top singular values (and more generally, singular vector pairs) are tracked
despite fast fading.
4 Discussion
The demonstrated scheme used binary transmission (BPSK) and real-valued
matrices. Using complex modulations is straightforward.
The methods discussed assumed reciprocity (uplink is Hk , downlink HkT ).
If this is not the case, Ûk , V̂k will no longer be singular vector estimates, but
rather estimates relating to eigenvectors of certain products of the uplink and
downlink channel matrices. The presented ideas are still valid, but some kind
of phase normalization must be applied. A generalized singular vector (GSVD)
94
PAPER III: BIMA - BLIND ITERATIVE MIMO ALGORITHM
0
10
−1
Bit Error Rate
10
−2
10
−3
10
−4
10
0
5
10
15
20
SNR (dB)
25
30
35
40
Figure 1: Bit Error Rates versus SNR. The lower line is the BER/SNR curve when the
channel matrix H is known, the upper when it is unknown, and the singular vectors
estimated by our procedure.
95
4 DISCUSSION
σ
i
6
True Singular Values
Singular Value Estimates
5
4
3
2
1
0
0
500
1000
1500
2000
2500
Figure 2: The BIMA algorithm used to track the two top singular vectors/values on a
4TX x 4RX fading channel with a 100Hz Doppler shift, 100 kBit/s and 25 bits per slot,
corresponding to 0.25 ms. The SNR was 15 dB. Two independent channels (K=2) were
used. The heavy lines are the true singular values, the jagged ones are estimates.
96
PAPER III: BIMA - BLIND ITERATIVE MIMO ALGORITHM
approach may also be worth considering. These questions are addressed in the
full paper [3].
It is also clear that, for the symmetric case, the convergence of the
eigenvectors could be considerably improved. The QR-decomposition effectively
filters out important information held in the received data-blocks. There is most
likely a connection with Krylov Methods for estimating eigenvectors [8].
References
[1] J. Bach Andersen, “Array gain and capacity for known random channels
with multiple element arrays at both ends” IEEE Journal on Selected areas
in Communication, Vol. 18, No 11, 2000, pp. 2172–2178.
[2] H. Bolcskei, D. Gesbert, A. Paulraj, "On the capacity of OFDM-based multiantenna systems", submitted to IEEE Trans. on Communications, Nov 1999.
Shorter version in ICASSP-2000.
[3] T. Dahl, N. Christophersen, D. Gesbert, “A blind iterative MIMO algorithm
based on the Power Method” (in preparation)
[4] M. Fink, “Time-reversed acoustics”, Phys. Today, Vol. 20, 1997, pp.34 - 40.
[5] G.J. Foschini, "Layered Space-time architecture for wireless communications
in a fading environment", Bell Labs Technical Journal, Vol. 1, No 2, 1996, pp.
41-59.
[6] G.J. Foschini, M.J. Gans, "On the limit of wireless communications in
a fading environment when using multiple antennas", Wireless Personal
Communications, Vol.6, No3, 1998, pp.311-335.
[7] D. Gesbert, H. Bolcskei, D. Gore, A. Paulraj, "MIMO channels: Capacity
and performance prediction", submitted to IEEE Trans. on Communications,
July 2000. Shorter version in Proceedings of IEEE Globecom Conference,
Nov. 2000.
[8] G. Golub and C.F. Van Loan, Matrix Computations, Johs Hopkins, 3 edition,
1996.
[9] B. Hassibi “An efficient square-root algorithm for BLAST”, ICASSP 2000
[10] D.B. Kilfoyle, “Spatial Modulation in the Underwater Acoustic Communication Channel”, PhD Thesis, MIT, June 2000.
[11] A. Paulraj, C. Papadias "Space-time Processing for Wireless Communications", IEEE Signal Processing Magazine, Nov. 1997.
97
REFERENCES
[12] E. Telatar, “Capacity of multi-antenna Gaussian channels,”
Trans. on Telecom, Vol. 10, No. 6, 1999, pp. 585–595
European
[13] A. Touzni, I. Fijalkow, M.G. Larimore, J.R. Treichler, “A globally convergent
approach for blind MIMO adaptive deconvolution” IEEE Transactions on
Signal Processing, Vol. 49, No. 6, 2001, pp. 1166-1178,
[14] H. Wold, “Soft modelling by latent variables: The non-linear iterative
partial least squares (NIPALS) approach” Perspect. Probab. Stat., Pap. Honor
M. S. Bartlett Occas. 65th Birthday, 1975, pp. 117-142.
98
Paper IV:
Blind MIMO Estimation based on the Power
Method
99
CHAPTER 3. PAPERS
100
Blind MIMO Estimation based on the Power
Method
T. Dahl, N. Christophersen, D. Gesbert
Abstract
Identification of the channel matrix is of main concern in wireless MIMO
(Multiple Input Multiple Output) systems. Here, we present an SVDbased approach for blind identification of the main independent parallel
channels. The right and left singular vectors are estimated directly (no
channel matrix estimation is necessary) and continuously updated during
normal transmission. The approach is related to the iterative Power Method
[9], as well as the “time reversal mirror” approach [4].
1 Introduction
Wireless MIMO systems are capable of delivering large increases in capacity
through utilization of parallel communication channels [5], [6], [7], [13].
Appearing first in a series of information theory articles published by members
of the Bell-Labs, MIMO systems have quickly evolved to become one of the
most popular topics among wireless communication researchers. It figures
prominently on the list of ’hot’ technologies that may have a chance to resolve the
bottlenecks of traffic capacity in the forthcoming broadband wireless Internet
access networks (UMTS 1 and beyond)
Multiple antennas both at the transmitter and the receiver create a matrix
channel. The key advantage is the possibility of transmitting over several
spatial modes of the channel matrix within the same time-frequency slot, at no
additional power expenditure. In addition, if the channel matrix is known both
at the transmitter (TX) and the receiver (RX), certain spatial modes (singular
modes) can be used to maximize the SNR.
For a N(receive) × M(transmit) channel matrix H of rank K0 ≤ min(N, M),
these modes are naturally realized through the Singular Value Decomposition
(SVD) H = USV∗ . Here, “∗ ” denotes the complex conjugate transpose. S is the
diagonal matrix of singular values σ1 ≥ σ2 , · · · , σK0 > 0, and
U = [u1 , . . . , uK0 ] ∈ C N,K0
V = [v1 , . . . , vK0 ] ∈ C M,K0
1
Universal Mobile Telephone Services
101
(1)
(2)
1 INTRODUCTION
are unitary matrices whose columns can be used as receive and transmit
vectors {ui } and {vi }, respectively. One can select a number K (K ≤ K0 ) of
transmit/receive vectors to use for communication.
Under stationary conditions, one may try to determine H experimentally and
subsequently perform the SVD as in the sonar application [11]. For time-varying
systems, a majority of the algorithms have assumed that H is unknown at the
transmitter and known - through training data - at the receiver (e.g. V-BLAST,
[5],[8]).
However, first, this implies overhead, and second, the use of channel
knowledge only on the receiver leads to less efficient use of the MIMO system.
The transmit array diversity gain is not realized, and one is unable to transmit
on the top singular vectors, those giving maximum performance/complexity
tradeoff. The advantage of a method employing the SVD over a V-BLAST like
(training-based) algorithm, is the possibility of performing spatial water-filling,
in which an optimal weighting of the eigenmodes is used.
In this paper, the contributions are twofold Firstly, in the method presented,
two-way transmission of regular data allows the two parties to estimate a
selected set of left and right singular vectors, without explicit knowledge
of H. Secondly, unlike other previous methods for blind MIMO estimation
(for example [14] and references therein), which rely on a statistical based
estimation of the channel matrix, our technique estimates the eigen-structure
of the MIMO channel directly, without need of a repeated block SVD. The key
advantage of this technique is that it exploits transmission of regular symbol
data to acquire an update of the singular vectors (our first paper [2] on this
technique was accepted for ICASSP 2002)
Our approach has a connection with the “time reversal mirror”, developed
by M. Fink [4] in ultrasound imaging: By repeatedly sending a pulse into a
body, recording the reflected signals and re-sending them after normalization
and time-reversal 2 , convergence towards a top eigenvector is reached. This
technique is for example used for detection and destruction of kidney stones,
typically corresponding to a top eigen-mode. This physical process is nothing but
a Power method (see section 2.1) for finding a top eigen-vector. This is analogous
to how our technique works: By sending a signal up and down a channel, there
is a natural convergence towards the top singular mode.
Other authors have also commented on sending-re-sending schemes. Bach
Andersen [1] has observed independently that such a procedure leads to
convergence towards the top singular mode of the channel. Kilfoyle [11], on
the other hand, uses training data to find singular modes in a non-flat fading
(underwater) channel, but also comments that there is important information
in the data vectors sent up and down. To our best knowledge, however, we
are the first to use these ideas in a blind MIMO communication setting and
2
in the frequency domain, this is the same as complex conjugation
102
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
to incorporate simultaneous estimation of multiple singular modes without any
prior estimate of the channel matrix.
1.1 Organization of the paper
The paper is laid out as follows: Section two presents the methodology and
algorithm (2.2-2.3) and section 2.4 presents visualization to enhance the understanding of the algorithm’s working. We then consider details for improving performance (2.5), smoothing of the singular vectors for robustness (2.6),
and differential symbol coding (2.7). Section 3 is a simulation section, showing
the performance of the method in various communication scenarios, both in the
aquisation phase and in tracking of a time-varying channel. Section 4 concludes
on the findings and discusses potential improvements and extensions.
Symbol
H
{ui } ,{vi}
{σi }
U, S, V
Û, V̂, Ŝ
c, cx , cy
C, Cx , Cy
Ĉ, Ĉx , Ĉy
x, y
X, Y
Meaning
channel matrix
singular vectors
singular values
singular vectors/values matrices
estimates of vector/values
symbols
symbol blocks
Symbol block estimates
transmit/receive vectors
transmit/receive matrices
2 Algorithm
2.1 Preliminaries: the Power method
We briefly recapture the Power method, which is an iterative numerical method
for finding eigenvectors. For a full reference, see [9]. Assume a matrix A ∈ RP,P ,
symmetric and real. This matrix has an eigen-decomposition,
A=
P
X
λj vj vjT
(3)
j=1
where {v1 , . . . , vP } are the orthonormal eigenvectors, and the eigenvalues {λj }
are ordered, λ1 ≥ λ2 ≥ · · · ≥ 0. If A is not of full rank (rank(A) = r < P ), then
all the eigenvalues λr+1 , . . . , λP will be zero, and the corresponding eigenvectors
103
2 ALGORITHM
vr+1 , . . . , vP span the null-spaces (row and column) of A. Now, assume a random
vector x(0) in RP . This vector can be decomposed using the eigenvectors {vj },
P
X
x(0) =
(4)
ck vk
k=1
If this vector is repeatedly pre-multiplied by A,
. . . A} x(0) = Ai x(0)
x(i) = |AA{z
i times
then
(i)
x
=(
P
X
i=j
P
X
λij vj vjT )(
ck vk ) =
k=1
P
X
λij cj vj
(5)
j=1
will be dominated by the term λ1 c1 v1 when i tends to infinity. If the vector x(i)
is normalized after each iteration, this becomes a method for finding the top
eigenvector v1 . The idea of using matrix powers for finding eigenvectors also
carries over to non-symmetric and non-real matrices. A generalization of the
Power method (sometimes called NIPALS, [15]) can be used for finding singular
vectors, which is what we are interested in.
2.2 Estimating multiple singular vector pairs
Consider the following (noiseless) scheme for transmission on a TDD (Time
Division Duplex) channel H exhibiting reciprocity:
y(i) = Hx(i−1)
z(i) = HT w(i)
(6)
(7)
Here, x(i−1) is the data vector send Uplink (UL), and w(i) is the data vector sent
Downlink (DL). We introduce feedback by defining
w(i) := ȳ(i)
x(i) := z̄(i)
In effect, this states that the signal received by one party is returned to the other
party after it has been complex conjugated. But then, (7) becomes
z(i) = HT w(i) = HT ȳ(i)
z(i) = x¯(i) = HT ȳ(i)
and when conjugating the latter equality, we get
x(i) = H∗ y(i)
104
(8)
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
The equations (6) and (8) serve as the basis for our elaborations. Usually, a
block of data (a matrix) will be sent, but for now consider the case of sending one
individual vector.
Assume that transmission starts with a random vector x(0) ,
Pmin(N,M
)
and let H = i=1
σi ui vi∗ be a full SVD of H including the singular vectors
spanning the null-spaces, which correspondP
to singular values σi = 0. Using
)
αj vj for some set {αj } of
this basis, x(0) can be decomposed as x(0) = min(N,M
j=1
constants. Now,
X
X
X
y(1) = Hx(0) = (
σi ui vi∗ )(
αj vj ) =
σi αi ui
(9)
X
x(1) = H∗ y(1) = (
i
j
X
σj uj vj∗ )∗ (
j
i
σi αi ui ) =
i
X
σi2 αi vi
i
Continuing this way, one arrives at the following recursion, for i ≥ 1:
X (2i−1)
σk
αk uk
y(i) =
k
x(i) =
X
(10)
(2i)
σk αk vk
(11)
(12)
k
Clearly, y(i) will be dominated by u1 , and x(i) by v1 as i → ∞. If, after each
iteration, normalization is applied, e.g.
x(i) := x(i) /||x(i) ||2
y(i) := y(i) /||y(i) ||2
(13)
(14)
then x(i) → v1 and y(i) → u1 as i → ∞. Now, assume that some estimates û1 and
v̂1 are known from this estimation process. The following iterations is then used
to estimate the second pair of singular vectors, u2 , v2 :
y2 = (I − û1 û∗1 )Hx2
(i)
(i)
x2
= (I −
(i)
(i−1)
(15)
(i)
v̂1 v̂1∗ )H∗ y2
(16)
(i)
Normalization of the vectors y2 , x2 must be included in the same way as in (13)
and (14). This technique effectively removes the contribution of the first singular
(i)
(i)
vector pair in the sums. Consequently, x2 and y2 will now converge towards the
second pair of singular vectors u2 and v2 , provided that the estimates of the first
singular vectors are sufficiently good. It is easily seen that the orthogonalization
is a Gram-Schmidt process, and that one could expect to find all the singular
vector pairs by keeping successive estimates perpendicular to each other and of
unit length: To estimate the r 0 th singular vector pair, one uses
yr(i) = (I −
r−1
X
ûk û∗k )Hx(i−1)
r
k=1
r−1
X
xr(i) = (I −
v̂k v̂k∗ )H∗ yr(i)
k=1
105
(17)
(18)
2 ALGORITHM
always with a subsequent normalization. In practice, there is no need to wait
with estimating the second (or third or fourth) pair until the first one is correctly
estimated. The following algorithm will estimate a full set of singular vectors
(held as columns of Û, V̂):
1. X0 = random, i = 1
2. Yi = HXi−1
3. QR = Yi , Yi := Û = Q
4. Xi = H∗ Yi
5. QR = Xi , Xi := V̂ = Q
6. Increase i, repeat from 2. until convergence
Here, Xi is the Uplink (UL) data, and Yi is the Downlink (DL) data. Note
that QR = Z denotes the decomposition of a matrix Z into one orthogonal
matrix Q and one matrix R which is upper triangular (see e.g. [9] for details
on the QR-decomposition). The matrices Xi , Yi converge to the matrices of
singular vectors, Xi → V, Yi → U. Note that one part of the job (recording,
orthogonalization, conjugation and re-sending) could be carried out by one party
and the corresponding (but independent) part by the other one. This gives
a functional framework where only one set of singular vectors {ui } or {vi } is
known by each party, which is sufficient for operating the channels.
This algorithm is a realization of equations (17),(18), since the QR
decomposition is nothing but a Gram-Schmidt process assuming the QRalgorithm keeps the diagonal elements of the R matrix real and positive. Our
algorithm for finding multiple singular vector pairs is a direct extension of
the Golub & Van Loan (1996) QR-method for finding multiple eigenvectors. It
extends it by finding singular vectors of non-symmetric matrices rather than
eigenvectors of symmetric matrices.
2.3 Transmitting symbol data while estimating the
SVD
Using the algorithm above, one can estimate a set of singular vectors by
transmitting blocks, Xi , Yi , and performing successive QR-decompositions. We
select symbols cj ∈ C from a modulation alphabet If the singular vector pairs
{ui } ,{vi } are known, one can use the transmit vectors
X
cxj vj
(19)
xs =
j
ys =
X
cyj uj
j
106
(20)
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
Here cxj and cyj , both complex numbers, correspond to data symbols to be
transmitted uplink and downlink. The singular vector pairs (u1 , v1 ), (u2 , v2 ), . . .
are known to be the vectors giving the highest SNR (signal-to-noise ratio), thus
maximizing the chance of correct symbol reconstruction. When the vectors are
transmitted, they will be received as
X
X
yr = Hxs = H
cxj vj =
σj cxj uj
(21)
j
xr = H∗ ys = H∗
X
j
cyj uj =
j
X
σj cyj vj
(22)
j
The symbols are then decoded using corresponding sets of the singular vectors:
cˆxj = yj∗ uj = σj cxj
cˆyj = x∗j vj = σj cyj
(23)
(24)
Using a set of rules, the (scaled) symbols cxj and cyj can be decided. In practice,
of course, the true singular vectors are unknown, and must be replaced by the
best available estimates, leading to new approximate variants of the equations
above (û replaces u etc). Let Cxi , Cyi be symbol blocks (uplink and downlink, respectively) both in C K,n , (n ≥ K) containing n vectors of complex symbols for K
independent channels. The first column of the block corresponds to the symbols
being sent a time t, the next at time (t + 1) and so on.
BIMA - Blind Iterative MIMO Algorithm
1. X0 = random, Û0 = random, V̂0 = random, i = 1.
2. Yi = HXi−1Cxi−1
3. Decide Ĉxi−1 from Yi∗Ûi−1
4. QR = Yi Ĉx†
i−1 , Yi := Ûi = Q
5. Xi = H∗ YiCyi
6. Decide Ĉyi from X∗i V̂i−1
7. QR = Xi Ĉy†
i , Xi := V̂i = Q
8. Repeat from 2. until convergence
If the estimation steps (3. and 6.) are correct, this algorithm is completely
equivalent to the first algorithm (assuming Ĉxi Cxi † = I, Ĉyi Cyi † = I, since n ≥ K).
The important change is that by multiplying in the symbols Cxi , Cyi in
the transmit data blocks, each single column of the received matrices has
a contribution from each of the singular vectors, as suggested by equations
107
2 ALGORITHM
(19) and (20). In the original algorithm, each column, and therefore each
transmission, contained one singular vector only. Note that this algorithm too
can be given an operational form, with two parties performing independent
symbol and singular vector estimation.
2.3.1 Convergence of singular vectors and symbol estimates
It is not immediately clear that the algorithm will converge to the correct set of
singular vectors and/or symbols. In fact, even if the singular vectors are correctly
estimated, the symbol block estimates Ĉxi , Ĉyi will have their rows biased by a
multiplication by a complex number of unit norm, due to an ambiguity in the
estimates Û, V̂ of singular vectors: If V and U were estimated together, at the
same base station, they could be selected so that their “diagonalization abilities”
were real and positive, that is
u∗i Hvj = dij with dij ∈ R
(25)
with dij = 0 for i 6= j, and dii ≥ 0. Without the possibility of rotating the singular
vectors (multiplying by a complex number of unit norm), the latter positivity
relation will generally fail in the functional setting of the BIMA algorithm, and
there will be a systematic rotation of the symbols. Symbol rotation problems
are well known in GSM-like systems, however, and is resolved using differential
coding, adding two extra steps to our algorithm:
0. Perform differential encoding of the symbol matrices Cxi , Cyi .
9. Perform differential decoding of the estimated symbol matrices, Ĉxi , Ĉyi
A precise description of this encoding/decoding is given in section 2.6.
Note that a systematic symbol rotation, corrected by pre- and post-processing
the data, does not affect the convergence of the singular vectors. The possible
rotation of a symbol can not be distinguished from the multiplication of some
singular vector estimate by a complex number d of unit norm. However, a singular vector multiplied by such a number is no less a singular vector.3
An extensive discussion is given next to aid the reader’s geometric
understanding of the algorithm, including the need for differential coding.
He/she will better understand the algorithm’s capability of sending symbols
across several parallel channels, and even if H is changing with time, it can
track the singular vector pairs as part of normal operation.
3
and the same holds true for any estimate of a singular vector.
108
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
2.4 Discussion: Behavior and Convergence
The purpose of this section is to illustrate graphically how the singular vector
estimates converge to the correct vectors, and how symbol transmission is done
while this is happening. For simplicity, consider a real-valued, H ∈ R2,2 a twoantenna RX/TX scenario. The singular vectors will also be real-valued in R2 , and
the symbol alphabet limited to two symbols, ±1 (BSPK). The ambiguity (25) will
be a ±1-ambiguity.
2.4.1 Convergence of a single singular vector
Figure 1 illustrates the convergence of the Power method when only one singular
vector pair is estimated. In panel (a), the true right singular vectors v1 , v2 are
(1)
(1)
plotted together with a first guess x1,norm . When x1,norm is sent uplink (multiplied
(1)
by H) it appears as the vector y1 on the contour of the ellipse with half-axes
σ1 u1 and σ2 u2 , as seen in panel (b). If all vectors in this figures are normalized,
they naturally appear on the unit circle, see panel (c). The normalized version
(1)
(1)
(2)
of y1 , y1,norm is then sent downlink (multiplied by HT ). It appears as x1 on
the contour of the ellipse of half-axes σ1 v1 and σ2 v2 in panel (d). If normalized
to become x1,norm (2), the cycle is completed. The new estimate of v1 is seen to
be a lot closer to the true singular vector than the initial estimate. While the
estimates of v1 improves, so do the corresponding estimates of u1 on the right
hand side panels of the figure.
2.4.2 Convergence could be towards v1 or −v1
From the equations (9),(10) it is seen that convergence of a singular vector
could take two ways. If α1 > 0 in any equation, the convergence (after due
normalization) will be towards v1 , if α1 < 0 it will be towards −v1 , corresponding
to the ambiguity in (25). In the rare case α1 = 0 convergence will be to one of the
subsequent singular vectors with corresponding αi 6= 0, but in practice, round-off
errors and noise will prevent this from happening.
2.4.3 Convergence while symbol data is transmitted
This section describes symbol decision. Note that the convergence of the
Power method (without symbol modulation) is “continuous”, in the sense that
no successive two vector estimates are negatively correlated. If a negative
correlation between two successive elements is detected, this must be because
the opposite party changed the sign of the vector before transmitting it. In the
K = 1 case (one channel only), and with a block size of n = 1 vectors per block,
(i)
this is how the BIMA algorithm works: Assume a new vector y1 is received by
(i−1)
is, when normalized, also the
one party. The previously received vector y1
(i−1)
current estimate of the first singular vector, û1 . This was used for decoding
109
2 ALGORITHM
(a)
v2
x(1)
1,norm
(b)
H
σ2 u2
y(1)
1
x(2)
1,norm
v
1
σ1 u1
T
H
(d)
(c)
u
2
σ2 v2
y(1)
1,norm
x(2)
1
σ1 v1
u1
Figure 1: Estimation of the top singular vectors in a 2 × 2 antenna MIMO system.
110
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
(a)
x(1)
(b)
v2
x(1)
v
1,norm
2
1,norm
x(2)
1,norm
x(2)
1,norm
v
1
v1
Figure 2: Convergence towards v1 or −v1 . Panel (a) shows a situation where the initial
guess converges towards v1 , panel (b) shows the situation where it converges towards
−v1 .
symbols in steps 3. and 6. of our algorithm, or alternatively, in equation (23).
Now we can use the sign operator for decision,
(i) T
cˆx i = sign(y1
(i−1)
û1
(i) T
) = sign(y1
(i−1)
y1
)
(26)
This is where differential coding comes in. It is impossible to say whether a
(correctly) “received −1” corresponds to a “sent −1”. Rather than trying to detect
“+1” and “-1” symbols, shifts of symbols are considered. If two successive symbols
are equal, a “+1” is decided on, otherwise it is a “-1”. If a “-1” is decided, then
the received vector y(i) has its sign changed y(i) := −y(i) to keep the series {y(i) }
“up-to-date”. Figure 4 illustrates two different scenarios, one where the second
iteration (normalized) vector x21,norm is positively correlated with the other two
iteration step vectors, and one where it is negatively correlated. In the former
case (a), and following differential coding, the symbol series would be {1, 1}, in
the latter case (b), it would be {−1, −1}. If H varies continuously with time, the
singular modes will be tracked as part of normal data transmission.
In practice, all the symbols are differentially encoded before they are
transmitted, and decoded when received. This corresponds to the step 0. and
9. in the algorithm.
2.4.4 Using multiple singular modes for transmission
The basis for using several independent modes of transmission is the equations
(19),(20), (23) and (24). Figures 4 and 5 illustrate how the algorithm works with
111
2 ALGORITHM
(a)
(b)
x(1)
1,norm
v2
x(1)
v2
x(2)
1,norm
1,norm
x(2)
1,norm
x(3)
1,norm
x(3)
1,norm
v1
v
1
−x(2)
1,norm
(i)
Figure 3: Changing the sign of the vector estimate x1,norm
only affects the convergence
in one aspect: It changes the convergence from going towards v1 or towards −v1 . The
speed of the convergence is not altered by arbitrarily changing the sign
two independent channels. Figure 4 shows how symbols and singular vectors
are combined to become one transmit vector. The (true) singular vectors are
parallel with the elementary axes (e1 , e2 ). The figure shows what the transmit
vector might look like in the case where the singular vectors are perfectly known
(panel a), as well as what they might look like if an initial and arbitrary guess
is made on those vectors (panel b). The combined
symbol vectors are the ones
√
going from the center onto the circle of radii 2. There are four possible vector
combinations in each case.
The next figure 5 illustrates the “improvement of the estimates”. Even if the
initial vectors are themselves incorrect, our procedure of transmission, decoding
by combination of symbol vectors, normalization and retransmission leads to
gradually better estimates. Let us consider this figure in some more detail:
(1)
(1)
In panel (a), there are two (orthogonal) vectors v1 and v2 which are initial
guesses on the right singular vectors. Assume, as in the previous figure, that
the true right singular vectors are parallel with the elementary (e1 , e2 ) axes.
(1)
(1)
(1)
(1)
From the initial guesses v1 and v2 , two symbol combinations, v1 + v2 and
(1)
(1)
v1 − v2 are formed. These are transmitted (that is, multiplied with H),
and the results are visualized in panel (b). In panel (c), the "received symbol
(1)
(1)
(1)
(1)
vectors", H[v1 + v2 (1)] and H[v1 − v2 ] are added to become 2Hv1 . This can
be done provided that the "recipient party" is able to correctly guess the symbol
combinations that were sent. If successful, it is also possible to combine the two
112
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
(a)
(b)
−v(1)+v(1)
−v1+v2
1
v
2
v +v
1
2
v
2
2
v(1)
2
v(1)+v(1)
1
v
v
1
2
1
v(1)
1
v −v
1
2
−v(1)−v(1)
1
−v1−v2
2
v(1)
−v(1)
1
2
Figure 4: When more than one singular mode is used to transmit and receive data,
superpositions are used. If H ∈ R2,2 and two singular modes are to be used, symbol
combinations ±v1 ± v2 are sent. Ideally, the symbols will be made from the true singular
(1)
vectors v1 and v2 (panel (a)) but in practice, we have to make initial guesses v1 and
(1)
v2 . The symbols made from these vectors can be seen in panel (b)
113
2 ALGORITHM
(1)
(1)
vectors to get 2Hv2 . In (d) the normalized version of 2Hv1 is taken as the new
(1)
(1)
(1)
u1 := 2Hv1 /||2Hv1 ||. Note also, that if v1 is systematically mistaken for −v1 ,
then the combination of the two vectors would amount to −2Hv21 rather than
2Hv21. This is not a problem, because
• both are equally close to a singular vector, and
• with respect to symbol decoding, the sign ambiguity is taken care of by
differential coding.
(1)
The same procedure is carried out for u2 , but any component along the new
(1)
(1)
u1 is removed from u2 , which is part of the Gram-Schmidt process. Also in
(1)
(1)
panel (d), two new symbol combinations are made from u1 and u2 , and sent
downlink, which is to say that the symbols are multiplied by HT . This can be
(1)
seen in panel (e). Again, the received vectors are combined to get 2HT u1 and
(1)
2HT u2 . A new Gram-Schmidt is carried out. The result is new estimates of
(2)
(2)
the right side singular vectors, v1 and v2 . Clearly, these are closer to the
(1)
elementary axes (x,y-vectors) than v1 and v1 (2) were. Notice finally, that in
(1)
(1)
(1)
(1)
practice, only one symbol vector can be sent at a time (v1 + v2 and v1 − v2
must be sent with a short delay). Provided that H is constant or varies little
during this period, the derivations are valid.
2.5 Considerations and Details
2.5.1 Sorting columns prior to QR
The BIMA algorithm is a (pure) Power method for the first columns of Û and V̂.
However, since all transmitted vectors are unit vectors, starting from random,
it might be a good idea to give "the best candidate a head start". When the
data matrices (blocks) are received, and multiplied by the pseudo-inverse of
the estimated symbols to become Xi Ĉix †, Y Ĉyi−1 †, the column vectors of these
matrices would ideally be scaled singular vectors of the channel matrix:
Xi Ĉiy † = H∗ Ûi Ciy Ĉiy † = SV
Yi Ĉ(i−1)
†
x
= HV̂
(i−1)
Ĉ(i−1)
Ĉ(i−1)†
x
x
= SU
(27)
(28)
under perfect conditions, Û = U, V̂i−1 = V, Ĉix = Cix , Ĉiy = Ciy . Under
imperfect conditions, an error term could be added to each of the right hand
sides, or the equalities replaced by approximations (≈). This serves as the
basis of our method, together with the QR decomposition removing dependencies
between the columns. When communication starts, there is no way of knowing
which of the column vectors in the transmit matrix constituents Û(i−1) and
V̂(i−1) correspond to a first singular vector. Thus, when performing the QR
(i−1)†
which have
decomposition, it is best to let those columns of Xi Ĉi†
y and Yi Ĉx
the maximum norm be the first vector in the Gram-Schmidt process. These
column vectors will be closest to the first singular vectors etc.
114
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
(f)
(a)
(b)
(1)
(1)
H (v1 − v2 )
v(2)
2
T
(1)
(1)
H (u1 − u2 )
v(1)
v(1)+v(1)
2
1
2
v(2)
H (v(1) + v(1))
1
1
v(1)
1
T
2 H u(1)
1
2
T
H (u(1) + u(1))
1
2
v(1)
−v(1)
1
2
(e)
(d)
(c)
u(1)+u(1)
1
2
T
(1)
(1)
H (u1 − u2 )
2 H v(1)
)
1
(1)
(1)
H (v1 − v2 )
u(1)
2
u(1)
1
u(1)−u(1)
1
H (v(1) + v(1))
1
2
2
T
H (u(1)
+ u(1)
)
1
2
Figure 5: A scheme consisting of combining vectors with fixed norm (a), vector
transmission and observation on the opposite side (b), combination of vectors to
form new scaled singular vector estimates (c), normalization (d), and retransmission,
combination and normalization in the opposite direction, panels (e),(f) and (a) again
115
2 ALGORITHM
2.5.2 Data sub-blocks
The singular vector estimates are obtained by combining information from a
sub-block of data. The original BIMA algorithm implies the use of a block as
whole (steps 2,5), but in practice, we do not use all the data in a block to estimate
the singular vectors. Instead, we use a “moving sub-block” of the receive data
vectors for this estimation. The reasons for this are:
• By the end of the slot, the information contained in the first samples is
out-dated, and does not necessarily contribute in a meaningful way.
• By using a "moving sub-block" for the estimation, we can track the
changing recipient vectors.
The optimal vectors for decoding the channels will be contained in the "recently
estimated" matrix V̂ (or Û) (steps 4,7). The "encoding vectors" in Û (or V̂) will
diverge from being good estimates to worse ones, the further from the slot start
one is. This is natural, since no feedback is provided during the sending period.
None the less, it is important to track these vectors for symbol decision, even
though they diverge from being singular vectors.
2.5.3 Color Coding
In general, the singular values {σi }, estimated by both parties, can be used
for ordering of the independent channels. However, in the presence of noise, it
could happen that the ordering on one side (X) would be different from the other
side (Y), particularly if two singular values are close in magnitude. This could
lead to data being sent “to the wrong place”, e.g. that a bit-stream transmitted
on the top singular vector is received at the second top corresponding singular
vector. We therefore assume that color coding or some other matching technique
is used, so that the different data streams can be recognized by the recipient.
An alternative to color coding is to use a blind technique that can overcome the
problem of erroneous ordering of singular values in the presence of noise. Such
a method is described in the appendix.
2.6 Smoothing
Both singular values and vectors estimates are subject to variations because of
• the fading of H.
• the noise in the data
• the fact that from transmission to transmission, the "sub-block", or the
data from which vectors/values are estimated, changes, removing one
observation from the set and including a new one.
116
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
The first we have to live with, the influence of the other two we reduce.
Increasing the size of the sub-block will indirectly give less variation
estimates of all vectors and values, simply because the estimates are certain
averages deducted from a data set of bigger size. By using more data vectors, the
effect of including/removing a vector will be decreased. According to standard
theory, the influence of noise will be reduced. The drawback is that an increased
sub-block size will lead to less correct estimates for time-varying H, since older
data is used.
To reduce the variance of the estimation process in BIMA, we focus on
averaging singular vectors. Taking an average over the singular vector estimates
for a certain time-span is the simplest way of reducing the influence of noise.
Let Û(tk ) be the matrix of singular vector estimates at time (or tick) tk . Given
a history span of Q ticks, the problem is to find an average of these matrices.
However, this average must also be orthogonal, and it is not given that an
average over a set of orthogonal matrices will be orthogonal itself. Thus, we
seek an orthogonal average Ũ that solves the problem
X
Q−1
min
Ũ
||Û(tk−l ) − Ũ||2
(29)
l=0
with
orthogonal, Ũ∗ Ũ = I. If M =
PQ−1 the requirement that Ũ should be
∗
∗
l=0 Û(tk−l ), with the SVD M = QPR , then Ũ = QR , solves the problem
(see Appendix for a proof).
Another way of reducing the effect of inclusion/removal of data produced by
the moving sub-blocks, is to use weights. Rather than labeling data vectors
as "old" or "new" (in or outside the sub-block), weights can be included in the
averaging process. An intuitive choice for a weighting function is a sigmoid
function,
f (t − ti )
,
(30)
s(ti ) =
1 + f (t − ti )
Where f (t) is some strictly positive, strictly increasing function, say f (t) =
e(t−k)/a . The weighting approach is not pursued here.
2.7 Complex differential coding
In the simulation section, we will assume QPSK symbol coding. The decision
rule is based on complex signs, a direct extension of the decision used in the
BPSK case. We now allow our symbols to take on the values ±1 ± i. To
compensate for the ambiguity (25), we use the following technique. Let {ck }
denote a sequence of received symbols for any singular vector mode, and let {c0k }
be the resultant symbol series output using our differential decoding technique.
The following rules are used:
117
3 SIMULATIONS
• If a symbol is the same as the previous one (ck = ck−1 ), then c0k = 1 + i is
decoded.
• If a symbol is the opposite as the previous one (ck = −ck−1 ), then c = −1 − i
is decoded.
• If a symbol is a positive 90
then c0k = −1 + i is decoded.
circ
rotation of the previous one (ck = ck−1eπ/2 ),
• If a symbol is a negative 90
then c0k = 1 − i is decoded.
circ
rotation of the previous one (ck = ck−1 e−π/2) ,
It is easy to make algorithms based on the rotational properties of unit norm
complex numbers, both for encoding and for decoding of these sequences. Note
also that sequences coded this way are quite robust to sudden alterations in the
rotational displacement. Only one symbol will be lost, as opposed to having a
permanent symbol error. A major drawback with this technique, is that it has a
steep loss function in the presence of noise.
3 Simulations
This section demonstrates BIMA in a simulated TDD MIMO environment
(Rayleigh-fading channel matrix H), using QPSK coding. Summing up on the
previous sections, the following parameters must be set:
M, N
K
n
nÛ,V̂
Q
Number of transmit/receive elements
Number of Modes
Slot Size (Number of bits in a UL/DL slot)
Number of received vectors to
use for estimation of Û, V̂
Number of vectors/bits
for smoothing the singular vectors
In addition, there are parameters to set when making a simulated fading H:
D
f
The Doppler shift [Hz]
Data rate [bits/second]
3.1 Convergence of Singular Vectors
Figure 6 Shows the convergence of the estimated singular vectors in a constant
MIMO channel.
The initial singular vector estimates are random, and
convergence towards the correct vectors is demonstrated. Two singular modes
(K = 2) are estimated for a 3 by 3 MIMO Channel, with singular values
118
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
(a)
(b)
0
0
0.0 dB
2.5 dB
5.0 dB
−5
0.0 dB
2.5 dB
5.0 dB
−10
7.5 dB
−10
7.5 dB
2
10 × log10 (ε )
−5
−15
−15
−20
10.0 dB
−20
10.0 dB
12.5 dB
12.5 dB
−25
−25
15.0 dB
15.0 dB
17.5 dB
−30
17.5 dB
−30
20.0 dB
−35
0
10
20
30
40
20.0 dB
−35
0
10
20
Figure 6: Convergence of the singular vectors. The errors 2 =
30
1
K
40
PK
2
i=1 ||ui − ûi ||
plotted
as function of iteration step (slot index), at various SNR’s. Here K = 2 is the number of
singular modes (channels) used in a 3TX × 3RX system. Panel (a) shows convergence of
the left singular vectors, panel (b) convergence of the right singular vectors.
σ1 = 4, σ2 = 2, σ1 = 1. The parameter nÛ,V̂ is set equal to 400. Various levels
of channel noise are tested. The errors of both singular modes are averaged,
and convergence for the left and right side vectors plotted. This demonstrates
that the BIMA algorithm converges to the correct set of singular vectors in the
presence of channel noise.
3.2 Simulation of Fading Channel
In figure 7, we simulate a fast fading 4TX x 4RX channel (D = 50 Hz Doppler
spread), with a transmission rate of f = 220 kBit/s per eigen-mode, using
K = 2 independent channels, QPSK modulation, and a ping-pong time of 1 ms.
Smoothing of the singular vectors was done by averaging over the Q = 4 last
singular vectors. nÛ,û is as above equal to 400. 10 simulations were performed
for each SNR scenario, and the mean BER over these simulations taken, see
figure 7. The upper curve shows the BER obtained with our method after an
initial aquisation period, the lower shows what could be achieved if the singular
modes of the channel were perfectly known. The loss is in the range 2-5 dB.
There seems to be a “flooring effect”, which is even more prominent for faster
119
4 CONCLUSIONS AND DISCUSSION
0
10
−1
BER
10
−2
10
−3
10
−4
10
0
2
4
6
8
10
SNR (dB)
12
14
16
18
20
Figure 7: The SNR/BER for a 4TX x 4RX channel fading at 50Hz Doppler, 220 kBits/s.
Mean BER value for the various SNRs.
fading channels. This effect is not appearing in channels varying more slowly
(e.g. 10Hz or 20Hz Doppler spread).
4 Conclusions and Discussion
We have shown that top singular modes can be estimated and tracked without
training data, without need for a statistical estimate of H, and without performing an actual SVD. These results are based upon the assumption that the
channel we consider exhibits reciprocity. In general, one can say that transmission and retransmission of vectors in a MIMO system leads to convergence of
the involved vectors towards the top singular mode. By combining properties of
the modulation alphabet with this transmission/retransmission idea, multiple
singular vector pairs can be extracted as part of normal operation.
4.1 Improving & Generalizing BIMA
In another paper, [3], we develop a method for blindly estimating singular modes
for a FDD (Frequency Division Duplex) channel. In this case, the uplink channel
120
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
matrix is generally not the transpose of the downlink channel, and so the BIMA
iterations and the intrinsic Power method can not be used. More advanced optimization techniques are needed to blindly estimate these modes, which now
are doubled in number (one set for the downlink channel, another for the uplink).
It can still be done, by combining a local linear optimizer with an approach based
on principal component analysis (PCA) for “smoothing the interaction” between
the up- and down-link singular modes. As a spin-off, a better way of estimating
singular vectors emerges, that can be used to improve the BIMA Algorithm. A
brief description follows:
A key observations in section 2, is that all (normalized) vectors sent through
the channel H (or HT ) are received as points on an ellipsoid with center in the
origin. The principal axes of this ellipsoid are identical to one set of singular vectors. With only a few sample points, it is possible to estimate such an ellipsoid,
and extract the singular vectors (figure 8). As a consequence, it is not necessary
to use to QR/Power-method iterations to find the singular modes, it can be done
in one iteration step only (one time-slot). This improves the performance of the
algorithm, both in aquisation-mode and in tracking-mode. The price to pay for
this, is some more processing complexity. The ellipsoid-fitting involves a normminimization, equivalent to finding one (bottom) singular vector of a matrix, as
well as the eigen-decomposition of a matrix A representing the quadratic form of
the ellipsoid. It is not necessary that both parties (base station and subscriber)
do this processing. Finding the top singular modes at the base is sufficient to
improve the singular mode tracking.
5 Appendix
A: Blind ordering of singular vectors
In previous sections, we assumed that the various independent channels were
color coded, so that a stream decoded with a singular vector v was correctly associated with its source (encoding vector) u. Here it is shown how channel mapping
can be done completely blindly, at the cost of some performance. If the singular
values {σi } are sufficiently different from one another, they provide a good label
for their corresponding singular vector. Each party estimates singular values
{σi } and vectors {ui } (party Y) or {vi } (party X).
Let {σ̂i }X denote the singular values estimated by the party X, and {σ̂i }Y denote
the singular values estimated by party Y. These are smoothed (as time-series),
using a moving average, to reduce the influence of channel noise. The natural
link between the set {vi } (or {ui } ) and the singular values {σ̂i }X or {σ̂i }Y , leads
to a natural ordering: The singular vector corresponding to the largest singular
value is termed v1 , the one corresponding to the second largest is termed v2 , and
121
5 APPENDIX
(a)
(b)
y
1
y
3
y2
λ u
2
2
λ1 u1
Figure 8: Singular vector estimation. The left panel (a) shows a few observations on
an ellipsoid, the panel (b) shows the extracted singular vectors u1 , u2 , multiplied by
their “lengths” λ1 , λ2 (equivalent with the singular values up to a scalar of). These
vectors/values correspond to the half-axes of the ellipse
122
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
so on, as in normal SVD. If a vector vj is associated with a specific singular σˆX j
value in {σ̂i }X and a vector uj is associated with the most similar singular value
σˆY j in {σ̂i }Y (σˆY j ≈ σˆX j ) then (uj , vj ) constitute a natural singular pair: Encoding symbol data with vj will lead to those symbols being decoded and recaptured
with uj , and vice versa.
The problems occur when two singular values become close in magnitude.
Due to the TDD, the singular values are not estimated simultaneously by both
parties. While one party transmits in on time-slot, the other receives and estimates singular values. Then the scene changes. It might happen, that in a
period of time (one or more time-slots) where two singular values are close, that
the ordering/correspondence between singular vectors and values may be different for X and Y. This will correct itself when the spacing between the {σi }
increases again. Prior to that, one will often experience "data being sent to the
wrong place". In such situations, it is better to postpone the sorting altogether,
and stick with the vectors in the order they were before the confusion started.
Observe that our algorithm is perfectly capable of tracking the singular vectors,
even if the corresponding singular value estimates change ordering.
The following scheme will take care of the sorting in a way that avoids this potential mix-up: One party X (the "leader") is always in responsible for re-ordering
his singular vectors when he thinks it is safe. This will immediately be observed
as a "swap" in the ordering singular values by the party Y (the "follower"). If this
"swap" is sufficiently clear, and not to be confused with the normal (and noisy)
variation of the channel, Y can change the order of his singular vectors accordingly. The tricky task here is of course the "sufficiently clear" part. Consider
figure 9: The top left panel (a) shows two singular values changing place when
no "re- ordering" at the shift point is imposed by party X (assume momentarily
that X has knowledge of the singular values at all times). The panel (b) shows
how this might be observed by the party (Y, "follower"). The blank slots are the
transmission periods for Y, for which no singular values are estimated. Assuming little or no noise, the singular values observed are quite close to the true
ones (dotted part of the lines). Consider now the panel (c) where "ordering" is
introduced X. Panel (d) shows how this might be observed by Y. Next, consider
panel (e), where the singular values do not actually cross, but move close before
they part again. Panel (f) shows how this is observed by Y. Now, it is easily seen
that it is practically impossible for Y to distinguish between the two situations
(d) and (f). Y can simply not tell whether the two singular values have changed
places, if they have been reordered by X, or if they did not cross at all. Yet, this
knowledge is crucial to get the bit streams distributed correctly.
Figure 10 illustrates how the problem is solved: The top panel (a) shows the
full series of singular values (observed by the "all-seeing" X), and a re-ordering
at a (“safe”) point in time after the crossing. The lower panel (b) shows how this
is perceived by the party Y. In this case, it is clear that the observed shift can not
123
5 APPENDIX
be due to the ambiguity "re-ordering by X/swap-or-close-in", and so it is safe for
Y to change its singular vectors. Finally, it is of course not necessary for X to be
"all-seeing", it suffices to observe the singular values on a grid similar to that of
Y. The main idea is to postpone the decision of swapping singular vectors until
the confusion around the ordering is over. Summing up, we devise the following
method for blind ordering of the singular channels:
• Party X: If (a) a crossing in singular values is observed, and (b) the singular
values have moved to a safe level (say 1 ) apart, swap the order of the
singular vector in accordance with the new {σ̂i }X order.
• Party Y: If (a) it is observed that the relation between {σ̂i }Y and {ui } has
been out of order, (b) their current difference in magnitude is greater than
some 2 , and (c) they suddenly swap to the correct order, then change the
ordering of the singular vectors (according to the "correct order", defined
by {σ̂i }Y ).
Clearly, we must assume 1 > 2 , e.g when party X chooses to perform the swap,
he must be certain that a swap will follow by party Y.
1 , 2 could
PThe
P 2 be defined
2
as a certain percentage of the overall "energies",
σ
and
σi,Y . In the
i i,X
presence of much noise, there is a chance that this technique will fail to produce
corresponding swaps, reducing performance of the system.
B: Smoothing of singular vectors
The problem
min
{ŨkŨ∗ Ũ=I}
is solved by letting M =
Ũ = QR∗ .
PQ−1
l=0
Q−1
X
||Û(tk−l ) − Ũ||2
l=0
Û(tk−l ), with the SVD M = QPR∗ , and setting
Proof:
Let
X
Q−1
f (Ũ) =
||Û(tk−l ) − Ũ||2 =
l=0
X
(31)
Q−1
||Û(tk−l )||2 − 2 tr Û(tk−l )∗ Ũ + ||Ũ||2
l=0
124
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
(a)
(b)
(c)
(d)
(e)
(f)
Figure 9: Blind ordering of singular values. The heavy lines denotes one time-series of
a singular value, the lighter lines another time-series.
125
5 APPENDIX
(a)
(b)
Figure 10: Solution of blind ordering problem: Sorting must only be imposed by X when
the singular values are sufficiently far apart.
126
PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD
Now, Ũ and Û(tk ), k = 1, 2, ... are are unitary matrices, thus ||Ũ||2 = ||Û(tk )||2 = K,
the number of channels in use. Consequently, it is sufficient to maximize
X
Q−1
g(Ũ) =
2 tr Û(tk−l )∗ Ũ =
l=0
X
Q−1
2 tr [
Û(tk−l )]∗ Ũ = 2 tr M∗ Ũ
l=0
Now finding the matrix Ũ that maximizes g is the same as finding the orthogonal
matrix closest to Ũ, because ||M− Ũ||2 = ||M||2 −2 tr M∗ Ũ−||Ũ||2 = C −2 tr M∗ Ũ =
C −g(Ũ). Standard theory (e.g. the polar decomposition [9]) then gives Ũ = QR∗ .
References
[1] J. Bach Andersen, “Array gain and capacity for known random channels
with multiple element arrays at both ends” IEEE Journal on Selected areas
in Communication, Vol. 18, No 11, 2000, pp. 2172–2178.
[2] T. Dahl, N. Christophersen, D. Gesbert, ‘BIMA - Blind Iterative MIMO
Algorithm”, accepted for ICASSP-2002.
[3] “The Game of Blind MIMO Channel Estimation”, in preparation.
[4] M. Fink, “Time-reversed acoustics”, Phys. Today, Vol. 20, 1997, pp.34 - 40.
[5] G.J. Foschini, “Layered Space-time architecture for wireless communications
in a fading environment”, Bell Labs Technical Journal, Vol. 1, No 2, 1996, pp.
41-59.
[6] G.J. Foschini, M.J. Gans, “On the limit of wireless communications in
a fading environment when using multiple antennas”, Wireless Personal
Communications, Vol.6, No3, 1998, pp. 311-335.
[7] D. Gesbert, H. Bolcskei, D. Gore, A. Paulraj, “MIMO channels: Capacity
and performance prediction”, submitted to IEEE Trans. on Communications,
July 2000. Shorter version in Proceedings of IEEE Globecom Conference,
Nov. 2000.
[8] G.D. Golden, G.J. Foschini, R.A. Valenzuela, and P.W. Wolniasky, “Detection
algorithm and initial laboratory results using the V-BLAST space-time
communication architecture”, Electronics Letters, Vol. 35, No. 1, pp. 14-15,
1999
[9] G. Golub and C.F. Van Loan, Matrix Computations, Johs Hopkins, 3 edition,
1996.
127
REFERENCES
[10] B. Hassibi “An efficient square-root algorithm for BLAST”, ICASSP 2000
[11] D.B. Kilfoyle, “Spatial Modulation in the Underwater Acoustic Communication Channel”, PhD Thesis, MIT, June 2000.
[12] A. Paulraj, C. Papadias “Space-time Processing for Wireless Communications”, IEEE Signal Processing Magazine, Nov. 1997.
[13] E. Telatar, “Capacity of multi-antenna Gaussian channels,”
Trans. on Telecom, Vol. 10, No. 6, 1999, pp. 585–595
European
[14] A. Touzni, I. Fijalkow, M.G. Larimore, J.R. Treichler, “A globally convergent
approach for blind MIMO adaptive deconvolution” IEEE Transactions on
Signal Processing, Vol. 49, No. 6, 2001, pp. 1166-1178.
[15] H. Wold, “Soft modelling by latent variables: The non-linear iterative
partial least squares (NIPALS) approach” Perspect. Probab. Stat., Pap. Honor
M. S. Bartlett Occas. 65th Birthday, 1975, pp. 117-142.
128
Paper V:
The Game of Blind MIMO Channel
Estimation
129
CHAPTER 3. PAPERS
130
The Game of Blind MIMO Channel
Estimation
Tobias Dahl, Nils Christophersen, Ole-Christian Lingjærde,
Nils Lid Hjort∗
Abstract
Some optimization problems involving multiple entities (individuals,
“agents”) can only be solved if they work together. In Artificial Intelligence
(AI) there is currently a great interest in collaborating multi-agent systems.
The agents work in a decentralized, automated fashion to solve an overall problem. Each agent has a reward-function, expressing degree of success in completing a subtask. Stratergies have to be chosen carefully, so
that the agents do not “work at cross-purposes”. In this paper, we discuss a
two-agent problem for finding the optimal transmission/receiving parameters for a multi-antenna wireless communication system. Techniques from
multivariate analysis and geometrical modelling are used to build a framework for solving the problem.
key words: Game Theory, Multi-Agent Systems, Ellipsoid fitting,
Orthogonal Transformation, Principal Components.
1 Introduction
1.1 Background
Game theory usually deals with situations where two or more competitors
try to maximize their individual outcome in a battle over a limited resource.
Another aspect of game theory, not as much discussed in the literature, is that
of team-work. Multiple-Agent systems (MAS, [21],[22], [27]), a popular topic
in Distributed Artificial Intelligence, provides a framework for such problems.
Consider a situation where two agents (for example persons, companies or
electronic devices) have to co-operate in order to maximize some function
∗
Department of Statistics, University of Oslo.
131
1 INTRODUCTION
reflecting the joint outcome or reward. Each agent has to contribute by selecting
a set of parameters (α or β). The problem can be expressed as
max f (α, β)
α,β
(1)
For the problem to be relevant, we must assume that α and β are chosen
independently. One agent (X) chooses α and the other agent (Y) chooses β.
If both α and β could be studied and compared at some time, the two-player
game (1) could be reformulated to become a one-player game. In many practical
problems, there must be some communication between X and Y for the situation
to be more than a mere “make-another-guess”-game. We assume therefore that
X has indirect knowledge of β and vice versa.
• The X-agent observes gX (α, β)
• The Y-agent observes hY (β, α)
The mappings gX and hY could be termed “communication functions”. Note that
the information one agent has about the other agent’s parameter is nested with
his own parameters, and is as such not explicit knowledge.
Agents X and Y will often work in a synchronized fashion. Suppose αi is
the i0 th guess for the X-agent, and βi is the i0 th guess for the Y-agent. We then
assume that X observes hY (βi−1 , αi−1 ) prior to guessing αi and that Y observes
gX (αi−1 , βi−1 ) prior to guessing βi . Various properties of the mappings gX and hY
can apply in different practical problems. Assume
gX : Rn → Rp
hY : Rm → Rq
(2)
(3)
with p ≤ n and q ≤ m. If gX and hY are linear functions, then the “degrees
of compression” (n − p), (m − q) puts constraints on how much information one
agent can have about the other. Properties such as linearity and/or continuity
of the mappings can make a problem easier to solve. Extra considerations must
sometimes be taken into account. In some application it might be necessary
to have a minimum number of guesses, e.g. we want the sequences of guesses
{αi } and {βi } to be short. 1 This would be the case if each selection of α or
β is the result of an experiment, e.g. a well dug, or an expensive test carried
out. Another restriction could be monotonic or almost monotonic convergence
towards an equilibrium, leaving out stratergies based on many and/or varied
guesses on α and β.
Both agents contribute to the solution of the overall problem, indirectly, by
maximizing their own reward functions. The two-agent problem (1) can thus be
replaced by two subproblems
max fY (α, β)
(4)
max fX (α, β)
(5)
α,β
α,β
1
meaning that the number of wrong guesses in the sequences must be low.
132
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
which upon solution coincide with solution of the original problem. In some
situations, the agents will not known that their sub-tasks are part of an overall
optimization problem. However, if they do, one agent can choose to help the
other’s maximize his reward2 , if he believes that this serves his own long-term
purposes. Designing suitable reward is sometimes difficult, and is a central part
of Collaborating Agents problems [22].
The functions gX (α, β) and hY (β, α) are generally not known in advance.
The mappings, or at least certain aspects of them, could possibly be learned
as part of the optimization. This is what is referred to as “modelling of the
other agent’s behavior” [18]. Each agent can also have prior knowledge of
the optimization strategy that the other agent is going to use [2]. Any of the
functions f, fY , fX , gX , hY could be observed with noise, making the problem
stochastic rather than deterministic.
We discuss the possibility of independently selecting sequences of guesses
{αi } and {βi } converging to a pair (α, β) maximizing f (α, β) in a specific twoagent problem.
1.2 Problem formulation
We present the joint optimization problem. In section 2.1 we argue that
it describes an important problem in wireless telecommunications. For two
matrices H ∈ RN,P G ∈ RP,N , we want to solve
max f (x, w) = E{||Hx + nx ||2 } + E{||Gw + ny ||2 }
x,w
(6)
under the constraints ||x|| = ||w|| = 1. Here, ||.|| denotes the L2 -norm. The vectors
nx ∈ RN and ny ∈ RP are noise vectors, assumed to be white and Gaussian.
Agent X has to select the vector x and agent Y has to select the vector w. The
agents also observe each others behavior,
• Y observes hY (x) = Hx + nx
• X observes gX (w) = Gw + ny
We propose a leader-follower strategy: Agent Y returns the vector he receives,
y = Hx + nx after pre-multiplying it with a matrix R and normalizing, giving
w = (RHx + nx )/||(RHx + nx )||. Agent X selects vectors x based on some
optimization algorithm and the feedback he gets from Y. In the understanding
that the data x from agent X affects the data w returned from Y and vice versa,
we write
gX (w, x) = Gw(R,x) + nx = gX (R, x)
hY (x, w) = Hx(x,R) + ny = hY (x, R)
2
even prior to maximizing his own reward.
133
(7)
(8)
1 INTRODUCTION
Now, X picks x and Y picks R. Each agent can also have “memory” of previous
data vectors and use it for decision making. One could replace the lower
parenthesis (R, x) and (x, R) with ({xi }, {Ri}) and ({xi }, {Ri}) respectively, for
i = 1, . . . , n − 1, where n is the “present” iteration. For readability, we keep the
notation in (7), (8). The overall goal is reached if the two agents Y and X each
succeed in maximizing their own reward functions,
fY (x, R) = E{||Hx(x,R) + nx ||2 }
fX (R, x) = E{||Gw(R,x) + ny ||2 }
(9)
(10)
Note that fY (x, R) = E{||hY (x, R)||2} and fX (R, x) = E{||gX (R, x)||2} respectively,
which is a very close connection between the reward and the communication
functions, and that f = fX + fY . There are two extra considerations for this
problem:
• The sequences of guesses {xi } and {wi } should converge as quickly as
possible to their optimum.
• Each element in {xi } and {wi } should be similar to the previous element,
e.g. ||xi−1 − xi || ≤ and ||wi−1 − wi || for some small positive , and for all i.
We will give some examples of problems for multi-agent systems, then show
that the problem we have stated has an important application in wireless
telecommunications.
1.3 Examples
Problems that can be cast into a MAS setting arise in everyday life as well as in
science. Examples are the following:
• Two people living together will have to contribute with their share of
opinions and actions, trying to maximize the joint happiness, without
always knowing the precise contributions of their partner. The “intended
words” α and β might not be directly understood, only perceived through
“mappings” gX (α, β) and hY (β, α).
• Closing a contract in a business situation, one might not know the precise
investments and agreements the other agent has made, but each agent is
still interested in maximizing the outcome of a joint contract.
• In design of control systems for routing over a communication network,
multiple units (servers) have to collaborate to direct the traffic and reduce
overhead.
• Multi-Agent Control Systems are used for directing constellations of
communication satellites and planetary exploration vehicles [27].
134
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
• In wireless communications, a base station (Agent Y) and a mobile subscriber unit (agent X) have to tune their antenna parameters independently (but in accordance with one another) in order to maximize the capacity/reliability of the channel.
In this paper, we focus on the last example. The problem of interest is to find the
optimal transmission/receive parameters for communication through a wireless
multi-antenna channel. We present a framework for solving this difficult, nested
non-linear optimization problem.
1.4 Wireless mobile communication
It has been known for a long time that the use of multiple antenna elements
at the transmitter improves channel capacity (“smart antennas”). By carefully
spreading the transmit power over the antenna elements, one can focus energy
into desired directions, increasing the average signal-to-noise ratio (SNR).
Appearing first in a series of articles on information theory, published by
members of the Bell Laboratories (E.Telatar [24], J. Foschini [9]), it was
demonstrated that the use of multiple antenna elements both at the Base
Station (BTS) and at the mobile Subscriber Unit (SU: cell phone, PDA or laptop),
results in a large increase of the channel capacity. In MIMO systems (“Multiple
Input, Multiple Output”), the channel is a matrix. Let y be the result of
transmitting the data vector x across the channel H. This is expressed as
y = Hx + nx
(11)
where the noise term nx is assumed to be white and Gaussian, nx ∼ N(0, σ 2 , I).
The components of the vectors x and y are the transmitted and received signals
at the respective antenna elements. The vector x is constituted from symbols in
independent data-streams (modulation), each of which the receiver may try to
recapture using the inverse H−1 followed by demodulation. This diversity gain
is one of the reasons that MIMO outperforms conventional systems. Usually, H
is estimated by transmission of a training sequence of known symbols. There
are two drawbacks with this approach. First, there is overhead. The channel
characteristics H vary with time and must therefore be re-estimated at regular
intervals. In 3G systems3 , up to 20% of all the data sent is for training.
The second drawback with most training-based systems, is that they take no
advantage of the singular modes of H. The received energy will be at the
maximum if x is parallel with the first singular vector v1H , since
v1H = max arg {x|||x||=1} E{||Hx + nx ||2 }
(12)
P H H G HT
where H = ri=1
σi ui vi is a singular value decomposition (SVD) of H. The
singular vector v1H of H maximizes the expected signal-to-noise ratio (SNR)
3
Third generation systems, as opposed to e.g. GSM, which is a second generation system.
135
2 METHODOLOGY
when used for transmission. The corresponding singular vector uH
1 is used for
optimal weighting of the receive antenna elements. To employ the singular
modes the channel matrix 4 must be known by the transmitter as well as the
receiver [6]. Most systems based on training data assume channel knowledge on
the receiver only (e.g. V-BLAST 5 , [9], [12]), and transmission with maximum
average SNR is therefore impossible. In this paper, we try to find the top
singular vectors blindly, without the need for a statistical estimate of H. To
complicate things further, in a FDD system (Frequency Division Duplex), the
channel matrix H only characterizes transmission in one direction (from BTS
to SU). Communication in the opposite direction (from SU to BTS) is generally
characterized by a different channel matrix G. Transmission then amounts to
PrG
w = Gz + ny
(13)
T
G
Let G = i=1 σiG uG
be a SVD of G. Now, if w and x are parallel with leading
i vi
singular vector w = cv1G and x = dv1H , the agents can transmit the symbols
c and d using the top singular vectors. The symbols usually comes from some
final modulation alphabet. In this paper, we will focus on the transmit vectors
only, and symbol modulation is not discussed. Also, data transmission is usually
not considered as being sent in single vectors, but rather in blocks, containing
hundreds or thousands of symbols. Note that in the special case of reciprocity
(H = GT ) the problem of blind estimation of the singular modes has been solved
[6].
1.5 Layout of the Paper
The paper is laid out as follows. Section 2 is a methodology section, It explains
the usefulness of solving the problem (6) in wireless mobile communication (2.1),
and describes the necessary building blocks for a solution (2.2, 2.3). Section
3 is about the algorithms, explaining in more detail what the agents X and
Y have to do to jointly solve the overall optimization problem. It also has a
subsection on sensitive and noise (3.4), pointing out situations that can difficult to
handle without special precautions. Section 4 is a simulation section, assessing
the performance of some of the sub-tasks, as well as simulating the twoagent system in operation. Section 5 discusses the findings, and suggests
improvements.
2 Methodology
2.1 Blind MIMO Channel Estimation
We interpret the equations (6), (7),(8),(9), (10) in terms of blind FDD channel
estimation.
4
5
or at least the leading singular vectors
Vertical Bell Labs Layered Space-Time
136
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
Two agents X and Y want to transmit their data on top singular modes of
a MIMO channel in a FDD system. The equation (6) states that if optimal
vectors x and w are selected, then the received power at both stations will be
at the maximum. The constraints ||x|| = ||w|| = 1 set a limit on the power
output from the antenna arrays, which is practical from a cost perspective. It
is clear that x = v1H and w = v1G at the maximum of f (x, w). The equations
(7), (8) simply state that X and Y ’see’ the other agent’s sending vector (w or x)
through a channel matrix. Equations (9) and (10) state that X and Y each try
to get the maximum received power. It is clear that if the reward functions fX
and fY are maximized, the overall problem is solved. At this optimum point,
H H
gX (x, R) = σ1G uG
1 and hY (R, x) = σ1 u1 , so that all four singular vectors are
G
H
known. It is sufficient that X knows v1H and uG
1 and Y knows v1 and u1 to
transmit/receive on the top singular modes.
Consider the extra requirements (convergence speed and continuity). They
can be interpreted as follows:
The initial guesses x0 and w0 are both random vectors. Each vector xi and
wi will carry a symbol, and we want to maximize the chance that this symbol is
correctly recaptured. Thus, we want {xi } and {wi } converge to their optima as
quickly as possible.
The reason that continuity is desirable, is the fact that previous (earlier)
estimates are used for symbol decision, and too much variation between
successive elements in the sequences {xi } and {wi } makes the task difficult.
Throughout the paper, we assume that all matrices and vectors are real-valued,
as this simplifies visualization and the discussions considerably. The channel
matrices will be complex in practical problems, but the methodology we present
can be extended to work in the complex case. We also assume G and H, both
in RP,P , to be square and invertible, simplifying both analysis and performance
assessment. A final working system must of course handle non-symmetric and
degenerate/singular cases.
We now present the building blocks necessary for our framework. The two
agents X and Y needs different tools to complete their sub-tasks. X need
a numerical optimization method (Local CG, section 2.2) in order to find an
optimum x. Y on the other hand, needs to get as much information about H as
possible, in order to make a “good correction matrix” R. The ability to estimate
one set of singular vectors and the singular values of H is crucial, and ellipsoid
fitting (2.3) is the central tool.
2.2 Local CG Algorithm
Consider the problem of maximizing
f (x) = ||f(x)||2
137
(14)
2 METHODOLOGY
subject to ||x|| = 1. Using the Lagrange multiplier method, we seek a stationary
point (x, α) of
f ∗ (x) = ||f(x)||2 − α(xT x − 1).
(15)
The stationarity condition on α implies that xT x = 1. Consider the problem of
determining x. Resorting to the same strategy as used in the Newton–Raphson
method, let x0 denote the current estimate for x, let u0 denote the gradient
vector, and W the Hessian matrix of f (x), both evaluated at x0 . We then find the
next estimate for x by equating to zero the first order Taylor expansion of Eq.
(15),
(16)
u0 + W(x − x0 ) − αx = 0
that is,
Wx = αx + Wx0 − u0
(17)
However, since the function f is only implicitly given in our problem, neither u0
nor W are known, and we have to resort to approximations. Suppose f (x) is well
approximated locally by a linear map, i.e. f (x) ≈ Bx for x in a neighborhood of
x0 . Then, u0 and W are well approximated by BT Bx0 and BT B, respectively.
Then Eq. (17) simplifies to
BT Bx = αx
(18)
which need to be solved for α and x under the constraint xT x = 1. A solution that
also solves the Taylor approximation to the optimization problem in Eq. (15) is
given by x = v1 , where v1 is the principal eigenvector of BT B or, equivalently, the
right singular vector of B corresponding to the largest singular value. This leads
to the following iterative method for solving the original nonlinear optimization
problem in Eq. (15):
(19)
xi+1 := xi + λ(vˆ1 − xi )
In order to determine the matrix B = Bi to be used in the ith iteration of Eq. 19,
let d ≥ 0 be a given integer and define
Xi = [xi−d , . . . , xi ]
Zi = [f(xi−d ), . . . , xi ]
Bi = Zi Xi †
(20)
(21)
(22)
Here, “†” denotes the Moore-Penrose pseudo-inverse. The parameter λ controls
the step size.
2.3 Ellipsoid fitting
When vectors of constant norm ( ||x|| = 1) are multiplied by a channel matrix H,
the resultant vectors y = Hx + nx will approximately lie on an ellipsoid with
the center in the origin. If this ellipsoid can be determined, one set of singular
vectors and the singular values of H can be estimated, without knowledge of
the other set. Ellipsoid fitting is in this sense a “partial SVD”. Ellipsoid fitting
138
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
to data in 2 or 3-space is not a novel idea (Turner et al. 1999, Fitzgibbon et
al. 1996). Applications are plentiful in computer vision and planetary motion
problems. Fitting of data to an ellipsoid in higher dimension is, to our best
knowledge, an undiscussed problem. An ellipsoid in RP , with center in the
origin, can be described as
T
y Ay =
P
X
aij yi yj = c
(23)
i=1,j=1
where y ∈ RP , the matrix A = {aij } ∈ RP,P is symmetric, and c ∈ R is a constant.
This is a general equation for a quadratic form (an ellipsoid, a hyperboloid or a
paraboloid), so we must assume that A is positive semi-definite if c is positive,
and negative semi-definite if c is negative. Given a set of observations on the
ellipsoid {y1 , y2 , . . . }, we can compute A: The quadratic form can be replaced by
a linear form,
F (a, yq ) = aT yq = 0
(24)
where yq is a vector containing the cross-product terms,
yq = (y12, y22 , . . . , y1y2 , y1 y3 , . . . , 1)T
(25)
and the vector a contains the corresponding elements in the matrix A and the
constant c,
a = (a11 , a22 , 2 · a12 , 2 · a13 , . . . , −c)T
(26)
Note that the symmetry of A implies the use of half the set of products in a and
yq , e.g. finding both a12 and a21 is unnecessary. By reinserting the elements
of a into the matrix A, and adjusting the off-diagonal by a factor of 12 , the
problem is solved. Of course, one observation yq is not enough, so we consider
Yq = [yq,i−d . . . , yq,i ], where i is the iteration6 index, and d is some positive integer.
Inspired by [8], we try to solve
min ||aT Yq ||2
{a|||a||=1}
(27)
The restriction ||a|| = 1 comes from the
that there is an extra
Pr observation
H
degree of freedom in (23). But if Yq = i=1 σi ui vi is a SVD of the data block
(assumed to have full column rank r), then the trailing singular vector a = ur
will be the solution to this problem. Assume that A is pre-multiplied with (1/c),
A := (1/c)A, so that the quadratic form is
yT Ay = 1
(28)
Let A = QΛQT be the spectral decomposition of A, with Q the matrix of
orthogonal column eigenvectors {qi } and Λ = diag(λ1 , λ2 , . . . ), is the matrix of
6
Remark: If ellipsoid that is to be estimated changes, old samples will be outdated after a
while. This will be the case in our application
139
3 ALGORITHM
increasingly ordered eigenvalues, λ1 ≤ λ2 ≤ λp . We show how to use ellipsoid
fitting for estimating a set of singular vector/values. Under perfect conditions
(no noise)
y = Hx
(29)
for some transmit vector x with unit norm, ||x|| = 1. Then x = H−1 y, and
yT H−T H−1 y = yT (HHT )−1 y = 1
(30)
Comparing with (27), observe that A corresponds to (HHT )−1 , if y is obtained
from (29). The left singular vectors of H are the eigenvectors of HHT . These are
the same as the eigenvectors of (HHT )−1 , except that the ordering with respect
to the eigenvalues is reversed. In the presence of noise, A is now an estimate of
(HHT )−1 , and the eigenvectors of A are estimates of the singular vectors of H.
The singular value estimates σ̂iH are obtained from the eigenvalues in Λ as
√
=
Λ
(31)
Ŝ−1
H
√
where · denotes the square root in the elements of a matrix. Similarly, we
estimate the singular vectors by ÛH = Q. If some of the eigenvalues are
negative, (29) is not defined, as will be discussed later. Unless otherwise is
explicitly stated, we will use the terms “eigenvectors”/”eigenvalues” to refer to
the eigenvectors/values of the matrix A, and “singular vectors”/”singular values”
to refer to the singular vectors/values of H.
3 Algorithm
We now describe the individual tasks for X and Y in the two-agent framework.
A particular feature of our proposal, is the fact that one of the agents (Y) does
not try to maximize his reward from the very beginning. Instead, he starts by
using information in the receive vectors {yi } to help X maximize his reward.
Only when X has a good chance of converging to his optimum reward, Y tries to
maximize his own.
3.1 Agent X: Optimizing x
The job of the “leader” X is straightforward. Starting from a random x0 , he
continuously tries to approximate the mapping gX (R, x) as a linear mapping (he
assumes R to be constant), and seeks the optimum of fX (R, x) = ||gX (R, x)||2
using the local CG algorithm. The only thing he needs to be aware of, is that the
problem can change abruptly if Y changes R. However, if he is “lucky”, Y will
pick an R not only to help himself, but also to make the problem that X has as
simple as possible.
140
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
3.2 Agent Y: Making the problem linear
Consider first the ideal situation of no channel noise. When a sequence of
vectors {x} is sent from X to Y, it is received as vectors {yi } = {Hxi}. Assume
now that agent Y has successfully estimated the left singular vectors and the
singular values, using ellipsoid fitting. The estimates are held in the matrices
T
is the SVD of H. Now let Y choose
ÛH = UH , ŜH = SH . Here, H = ÛH ŜH V̂H
T
T −1 T
R0 = ÛH Ŝ−1
H ÛH = UH SH UH
(32)
and consider what happens when a received vector y = Hx is pre-multiplied
with R0 :
T
T
T
(33)
R0 Hx = UH S−1
H UH UH SH VH x = UH VH x
T
The composition of R0 and H is now an orthogonal matrix UH VH
. It can be
shown [13] that
T
UH VH
= min arg{Q|QT Q=I} ||H − Q||2F
(34)
so the composition R0 H effectively replaces the matrix H by its closest
orthogonal matrix. Here, || · ||F denotes the Frobenious norm. Adding the
normalization step has no effect: For x with ||x|| = 1,
T
UH VH
x
R0 Hx
T
=
= UH VH
x
T
||R0 Hx||
||UH VH x||
(35)
T
since ||UH VH
x|| = ||x|| = 1. The vector received by the agent X is now
T
x
gX (R0 , x) = GUH VH
(36)
which is a linear mapping of x. Consequently, the local linear CG algorithm
T
x = v1G will be the top
will find its optimum value. The vector w = UH VH
right singular vector of G. Under noise-free conditions, the top singular vector
pair of G can be found by X without concern of “sparse optima” coming from
the non-linear normalization. This trick is nothing but a principal component
analysis (PCA), decorrelating the received vectors. Figure 1 illustrates the effect
of using the matrix R0 on the received vectors {yi }. In the presence of noise, the
estimates become less accurate. This can sometimes have severe consequences
on R0 , and on its usefulness as a pre-processor/linearizer, at least when applied
to some receive vectors y. This is discussed in 3.4.
3.3 Agent Y: Adjusting the optimum
When R0 has been introduced, and the local CG algorithm has converged, the
vectors {y = Hx} received by agent Y will also converge in the mean sense. We
approximate
d−1
X
yi−k
(37)
yOpt = 1/d
k=0
141
3 ALGORITHM
(a)
(b)
(c)
(d)
(e)
5
5
4
4
3
3
2
2
1
1
0
0
2
4
0
6
0
2
4
6
Figure 1: The figure illustrates the effect of Y using the matrix R0 to help X maximize
his reward in a 2 × 2 channel matrix scenario. In panel (a) are the vectors {xi } sent
through the channel H. When received by Y, they are normalized and resent through G
(R0 = I initially). The resulting points are found in panel (b). Note the sparse density of
points on the extrema of the ellipsoid coming from the non-linearity introduced by the
normalization. The panel (d) shows the situation from an angular point of view. Along
the horizontal axis is the angle θ ∈ [0, 2π], where θ is taken from x(θ) = [cos θ, sin θ], and
the radii y = r(θ) = ||x(θ)|| is plotted along the vertical axes. Clearly, it must be difficult
for an optimization algorithm based on assumptions of continuity to pick an angle θ that
maximizes ||x(θ)||. Panel (c) shows the effect of pre-multiplying by R0 before re-sending
through G, and in (e), the improved situation is shown in a angle/radii plot.
142
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
for some positive integer d, a parameter to the algorithm. The vector xOpt
is defined in the same fashion for X. If Y is to have maximum reward, the
convergence point yOpt should ideally be parallel with the leading left singular
vector of H, or
yOpt = σ1H uH
(38)
1
This can accomplished by introducing a rotation R1 “manipulating” X to send
vectors received as σ1H uH
1 . Assume that the mapping R1 has the property
R1 : uH
1 → yOpt /||yOpt ||
(39)
One can construct such an R1 which is a rotation in the subspace spanned by the
two vectors yOpt and uH
1 , leaving all vectors perpendicular to that space as they
are. For calculation of such rotation matrices, see [13]. Consider what happens
if the vector v1H is transmitted through H and the result Hv1H is pre-multiplied
with R1 R0 .
T
R1 R0 Hv1H = R1 UH VH
v1 = R1 σ1H uH
1 = cyOpt
(40)
where c is an irrelevant constant that will disappear in the normalization. We
define the mapping
(41)
R := R1 R0
The composition of the two mappings is the “correction matrix” for Y. The
fact that R1 is a rotation makes the composition R1 R0 H a linear mapping,
even after the normalization step. The Local CG algorithm will still find
the optimum singular vector for G, while the introduction of the rotation R1
’suggests’ that the singular vectors of H are found simultaneously. In subsequent
sections, the words “orthogonal transformation” and “rotation” will be used
interchangeably, although strictly speaking, a “rotation” does not involve the
possibility of reflection, as does “orthogonal transformation”.
3.4 Sensitivity Analysis
In the simulation section, we show that the two-agent leader-follower strategy
suggested leads to convergence of the sequence {x} and {w} towards the top
right singular vectors of H and G in the deterministic case, σ 2 = 0. However,
if noise of variance σ 2 > 0 is added, the solution is sometimes unstable. In this
section, we examine the possible causes of this instability, and discuss ways of
making the estimation more robust.
3.4.1 Sensitivity of Ellipsoid fitting
Ellipsoid fitting is a central part of Y’s task. The “partial SVD”, giving the
singular vector estimates ÛH and ŜH are used in the initial R0 matrix. The
ellipsoid fitting is sensitive in two ways.
143
3 ALGORITHM
First, the quadratic form matrix A is derived from the vector a, which is the
solution to the problem (27). To solve the problem, a must be parallel with the
last singular vector of the matrix Yq . The trailing singular vectors of a matrix
are more sensitive to variation in the data than the leading singular vectors (see
e.g. Hansen [14] or Hastie et al. [16]). A small change in the data matrix Yq
can lead to a great change7 in the vector a, and in turn, in the matrix A and the
estimates ŜH , ÛH .
Second, the quadratic form represented by A is only a local approximation
to the the true ellipsoid that the noise-free data would lie on. We can expect
the approximation to be better close to the sample points, and worse further
away. Particularly, if the sample points are close to some of the short half-axes
of the ellipsoid, the estimates for the long half-axes will be poor. Since the true
corresponding eigenvalues are already low (low λ → large σ), the estimates may
even drop below zero.
3.4.1.1 Suggestions for improvement
Ellipsoid fitting could be regularized by biasing the fit towards a sphere. One
can strike a balance between the ellipsoid that fits the data best, and the sphere
that fits the data best. This could be done in a number of ways, by penalizing
(27), or by estimating the optimal sphere and adjusting the eigenvalues of the
ellipsoid to be more equal to the (single) eigenvalue obtained from sphere fitting.
This approach could also be useful in the case where H is degenerate (singular
or close to singular). Negative, zero or small eigenvalues would be positively
biased, and large eigenvalues negatively biased, which could be useful: In
eigenvalue estimation in the presence of noise, the estimation error is the ’in
the opposite direction’ [6]. Another approach is to regularize Ŝ with respect
to the particular sample points used for estimating it. Components pointing
away from the sample points {yi } on the ellipsoid should be subject to carefully
selected scaling or much regularization.
3.4.2 Reduced rank in the Local CG algorithm
If and when R0 has been introduced to simplify X’s problem, there should be no
problems for X - at first glance. However, the matrix
X = [xi−d , . . . , xi ]
(42)
will, upon convergence of x, have reduced column rank since all columns will
be identical. Then the inversion of P
X in (22) is not straightforward, since a full
n,p
T
(basis expanded) SVD of Bi ∈ R , K
i=1 σi ui vi , K = min(n, p) at some point will
have σ1 ≥ 0 but σi = 0 for some i > 1. It is not easy to determine the precise
column rank of Bi , especially in the presence of noise.
7
This can also be argued geometrically, by plotting a few points of an ellipse on paper. A small
movement of a sensitive point can change the ellipse completely.
144
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
3.4.2.1 Suggestions for improvement
One possibility is to ridge (Hoerl & Kennard, 1970) the inversion of X, e.g.
by adding a small positive constant α to each of the singular values prior to
inversion [14]. Another alternative is to introduce some random variation in
{xi } . This will not only solve the problem of the rank reduction, but also give
agent Y a better chance to estimate the ellipsoid, due to the data variation. This
will happen at the cost of some performance in SNR. Regularization involves
estimation of a regularization parameter, which adds another level of complexity.
In some situations, however, it is possibly to obtain optimal values for these
parameters.
3.4.3 Unstable adjustment of the optimum
The purpose of the matrix R1 is to encourage agent X to send vectors that are
parallel with v1H , in the understanding that this also will help him maximize his
reward function. However, R1 “has no concept” of directions in H other than uH
1 .
For maximizing the reward functions, one would hope that if X failed to send
something close to v1H , he should send something closer to v2H than to, say v5H .
However, since R1 “has no concept of” directions outside span{yOpt , uH
1 }, a big
contribution in a low variance direction could maximize the reward function of
X just as easily as an x with a big contribution in the direction v2 . Thus, X could
sabotage Y’s attempt of getting a high reward.
4 Simulations
4.1 Simulation: Performance of Y
We try to assess the performance of Y in his attempt to help X optimizing x. We
examine various steps involved in calculating and operating a matrix R0 under
different signal-to-noise scenarios.
4.1.1 Constructing sequences {xi }
Y has to make its inference about H based on an incoming sequence of vectors
{yi } from (11). Agent X, trying to find a vector x maximizing his reward, may
sometimes be close to his convergence point xOpt. In this case, the variation is
the sequences {xi } and {yi } will be low, making it more difficult to estimate the
necessary parameters. We will mainly investigate such situations. We make no
assumptions regarding the optimization algorithm that was used for selecting
{xi } . The Local CG algorithm or any other (possibly non-linear) method could be
used. We examine situations where the elements in {xi } vary around a certain
target vector t,
(43)
||xi − t|| < 145
4 SIMULATIONS
Typical choices will be t = vkH , where k indexes a singular vector of H. The
elements of the sequence {yi } will vary around the corresponding vector σkH uH
k .
0
The variation in this sequence will depend on the magnitude of the k th singular
value of H as well as the noise.
4.1.2 Simulation settings
Random matrices H ∈ R5,5 with i.i.d elements were calculated and used for
simulations. If σ1H , . . . , σPH are the singular values of H, invertibiliy is imposed
by requiring that the condition number σ1H /σPH < 10. We then constructed
sequences {xi } , with or without specific relations to H, each with 100 elements.
The procedures were repeated 1000 times (1000 H-matrices) for each SNR
scenario. In all the examples, ||xi || = 1 for all i. The expected signal-to-noise
ratio (SNR), expressible in decibel (dB) as
SNR(dB) = 10 · log10
||E{Hx}||
||E{n}||
(44)
is selected by adjusting the variance σ 2 .
4.1.3 Performance Diagnostics
In the following, Y = HX + Nx denotes the block of vectors {yi } , Y =
[y1 , y2 , . . . , y100 ], as they are observed with noise, and X = [x1 , x2 , . . . , x100 ]
contains the vectors {xi } . Nx = [n1x , n2x , . . . , n100
x ] is a matrix of the same size
i
2
as Y, with columns nx ∼ N(0, σ I). In the simulation framework, the noise-free
version Y T rue = HX is available for reference.
4.1.3.1 Perpendicularity.
From the matrix A, agent Y estimates a set of singular vectors in ÛH . If the
singular vector estimates are correct, ÛH = UH , then the product matrix
M = ÛTH H
(45)
T
should have perpendicular rows, since UTH H = SH VH
. In this case, the outer
T
product MM should have zero elements off the diagonal. The statistic
tP end (M) =
||diag(MMT )||
||MMT ||F
(46)
can be used to assess the successfulness of Y when calculating a singular vector
matrix ÛH , a high value indicating success The distribution of tP end can be
simulated under the null hypothesis that M is a random, by repeatedly picking
random matrices H and calculating tP end (UH) for some random orthogonal U.
The randomness of H makes this distribution equivalent with that of tP end (H),
without U. A deviation from the null hypothesis, in the direction of a ’successful
perpendicularization’, is reported as a low simulated P-value.
146
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
4.1.3.2
Rotation.
T
The purpose of R0 = ÛH Ŝ−1
H Û is to make the composition R0 H an orthogonal
transformation. Noise can make the estimates ÛH , ŜH inaccurate. However,
these matrices are not our only interest. We are also interested in whether or
not the composition R0 H rotates the specific sequence {xi } . In other words, we
want to check whether the mapping from X to Z = R0 HX is a rotation of the
points in {xi } . To this end, we can use the Procrustes fit. This measure,
d(X, Z) =
min
{Q|QT Q=I}
||QX − Z||F
(47)
tells us whether or not one matrix can be rotated to match another. The optimal
rotation Q ∈ RP,P can be found explicitly (see e.g. Dryden and Mardia, 1997).
We consider the statistic
tRotationEst (X, R0 HX) = d(X, R0HX) = d(X, R0Y T rue)
(48)
From noisy observations Y, we make a matrix R0 , and check whether this is
good enough to work as a rotation of the points X when combined with H. We
simulate the distribution of this statistic under the null hypothesis (H0 : H is
random8 , e.g. not orthogonal) the following way: We pick random matrices H,
compute Y T rue = HX and compute tRotationEst (X, Y T rue ). If the matrix R0 H works
as an orthogonal transformation of X, the statistic d(X, R0HX) will be low. This
is reported as a low simulated P-value.
Note that the statistic tRotationEst is dependent on {xi } , but not on the noise,
while tP end and is not dependent on {xi } nor on the noise.
8
Here, the matrix H is working as the convolution as R0 H
147
4 SIMULATIONS
SNR
P (tP end)
P (tRotationEst
0.0
0.369
0.006
5.0
0.240
0.000
10.0
0.040
0.000
15.0
0.017
0.002
20.0
0.003
0.000
25.0
0.000
0.000
30.0
0.000
0.000
Table 1: Performance Diagnostics, Case I
SNR
P (tP end)
P (tRotationEst
0.0
0.399
0.004
5.0
0.152
0.001
10.0
0.052
0.001
15.0
0.020
0.001
20.0
0.006
0.000
25.0
0.000
0.000
30.0
0.000
0.000
Table 2: Performance diagnostics, Case II
SNR
P (tP end)
P (tRotationEst
0.0
0.367
0.001
5.0
0.065
0.000
10.0
0.023
0.000
15.0
0.002
0.007
20.0
0.000
0.012
25.0
0.000
0.000
30.0
0.000
0.000
Table 3: Performance diagnostics, Case III
SNR
P (tP end)
P (tRotationEst
0.0
0.541
0.025
5.0
0.483
0.016
10.0
0.411
0.017
15.0
0.185
0.275
20.0
0.095
0.277
25.0
0.041
0.111
Table 4: Performance diagnostics, Case IV
148
30.0
0.011
0.004
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
600
500
400
300
200
100
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 2: Distribution of the statistic tP end under the null hypothesis.
4.2 Results
4.2.1 Case I: Random elements {xi }
This situation is, from the viewpoint of our optimization problem, related to
sending training data, and is considered only for reference. The P-values for
the performance diagnostics are found in the table 4.2.1. In this reference case,
the two statistics indicate good performance. When the elements in {xi } are
random, there will be high variation in the corresponding sequence {yi } , which
improves the ellipsoid fitting.
4.2.2 Case II: Elements in a small random neighborhood
We study the situation where t is taken as a random, unit norm target vector
from a small neighborhood, without any particular connection with the channel
matrix H (e.g. not in the neighborhood of a specific singular vector). The
variation was kept low, ||xi − t|| < = 0.1 for all i, to simulate the situation where
{xi } has converged. There is little variation in {yi } , which is likely to make
the ellipsoid fitting more difficult. Comparing the results in Table 4.2.2 with the
reference, performance is almost identical. This indicates that we can hope for
149
4 SIMULATIONS
600
500
400
300
200
100
0
0
5
10
15
20
25
Figure 3: Distribution of tRotationEst , Case I, under the null hypothesis.
150
30
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
good performance, even if the sequence {xi } is confined to a small region, e.g.
upon convergence.
4.2.3 Case III: Elements in a small neighbourhood around v1H
A sequence {xi } with ||xi − v1H || < = 0.1 (for all i) is constructed and used for
simulation. We compare the results in table 4.2.3 with the reference cases I and
II. The ability to make ÛT H have perpendicular columns is better than in the
reference cases, particularly in the presence of much noise. This indicates that
ellipsoids are better sampled at points with high curvature, e.g. if the points are
parallel with the longest half-axis. For SNRs in the range 15-25 dB, the rotation
ability is lower than in the reference cases. This is probably due to errors in the
estimates σ̂iH . Particularly the small singular vectors will be poorly estimated.
Upon inversion, this becomes a source of errors for the “rotation ability” of the
composition R0 H. At low SNRs, the noise in the data could lead to more equal
eigenvalues. 9 This could explain the good relative performance at low SNRs.
4.2.4 Case IV: Elements in a small neighbourhood around v5H
The setup is identical to the that of the previous example, with the target vector
chosen to be v5 rather than v1 . We compare with the references cases I and II,
as well as III. Both statistics show worse performance when compared with all
the other cases, indicating that this scenario is a more difficult case than III and
II.
4.2.5 Discussion: Aspects of the sequence {yi }
Y does not know in advance the convergence point yOpt of the sequence {yi }. If
yOpt happens to be parallel with the first singular vector uH
1 of H, then estimation
of the singular vectors in ÛH and ŜH will be better, due to the curvature around
the point u1 . If, on the other hand, yOpt is parallel with one of the late singular
vectors, say up , the curvature around that point will be much lower, and so the
estimates will be worse.
4.3 The System at Work
We demonstrate that our system finds the top singular vectors in the
deterministic case (no noise).
In the presence of noise, more work on
regularization is needed to improve stability.
Two random matrices G and H, both in R5,5 , were computed, and a random
initial vector x0 used as a starting point. The “correction matrix” was initially
defined to be R0 = I. Subsequently, R was computed in two stages. After 60
T
iterations, R = R0 = ÛH Ŝ−1
H ÛH was introduced, to make the problem of finding
9
A kind of “natural” regularization. For a white noise matrix, all eigenvalues have modulo 1.
151
4 SIMULATIONS
v1G linear. After another 60 iterations, the extra rotation R1 was introduced
(R := R1 R0 ) to adjust the position of the optimum yOpt . The reason for this
splitting is the fact that it is unlikely that a good and stable candidate for yOpt
can be found before the problem has been ’stabilized’ by R0 .
The simulation results are seen in the figure 5. Consider the three phases:
During the first 30 iterations, the estimate v1G becomes more or less correct, but
the optimum is unstable, due to the non-linearities involved. There is of course
no convergence for v̂1H towards v1H , since the local CG algorithm only tries to find
v1G . After the introduction of R0 , the problem of finding v1G can be successfully
solved (with stability) by linear approximation. The estimate v̂1H stabilizes, but
there is yet no convergence to the correct singular vector. After the introduction
of R1 , both singular vector estimates converges to become correct.
In practice, we want to avoid the sharp shifts displayed in the figure after the
correcting steps. This is also necessary to fulfill the requirement that the sequences {xi } and {wi } should change gradually and not abruptly. The reason
for the sharp shifts has to do with the Bi -matrix in the local CG algorithm.
When the problem of finding the top singular vector of G is modified by the introduction/change of R0 , the matrices Z and X suddenly contains observations
(columns) from two different problems. This lead to temporarily meaningless
estimates Bi . This problem corrects itself when the samples belonging to the
“previous” problem are taken out the estimation process. A way to make convergence smoother, is to introduce the mappings in a more gradual way.
4.3.1 Discussion: Errors in Ŝ−1 , Contraction and Expansion
We briefly comment the possible effects of erroneous estimation of eigenvalues/singular values. The singular values corresponding to singular vectors parallel with yOpt will be well estimated. Consider what happens if yOpt is parallel
H
with a uH
5 , a trailing singular vector. In this case, σ1 will be poorly estimated.
Upon inversion this error is reflected in Ŝ−1
H . If X modifies his vector xOpt in
direction of the first singular vector of H (as ’encouraged’ by the introduction of
R1 ), this contribution will be contracted or expanded, depending on the error in
σ̂1 . This is not a good situation for agent Y. If X tries to change his vector xOpt to
find a vector more parallel with v1H , the corresponding variation in yOpt could be
very high or very low, and improvement in the situation for Y hard to establish.
We conclude this section by discussing some theoretical reference scenarios. Using our knowledge about the relationships between the principal directions of G
and H, we discuss which situations are more and less difficult to handle.
4.4 Reciprocity: G = HT
G
This is the easiest case. From the SVD of the matrices, it is seen that uH
1 = v1 ,
and so if X picks x = v1H , yOpt will be parallel with v1G . With the initial R0 = I,
152
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
H H
G
the vector transmitted through G will be yOpt /||yOpt|| = σrH uH
r /||σr ur || = v1 , and
both top singular vectors are found simultaneously. In this particular case, the
matrix R0 is superfluous, and the only thing that can fail is if this matrix is
wrongly estimated. A working system should be able to detect this situation,
and avoid calculation and use of R0 all-together.
4.5 Inversion, G = H−1
In this case, v1G = uH
r . This means that if X finds an x maximizing his reward
function, the corresponding yOpt will be parallel with the last singular vector of
H, uH
r . If in addition, X finds this x after a few iterations only and if the initial
x(0) is close to the optimal xOpt, then the ellipsoid fitting of Y must be based on
observations of y almost parallel to the last singular vector v1H . The simulations
indicate that this is difficult.
5 Discussion
We have described implementation of a framework for finding the top singular
modes in a FDD system. The calculations used do not involve estimates Ĝ or Ĥ,
as in conventional systems based on training data. The channel matrices G and
H are never observed, only interacted with trough matrix multiplication. We can
therefore assume that if the channel matrices G and H vary continuously with
time, our method is able to track the top singular modes, a clear advantage over
most training-based systems. After an initial aquisation period, common to all
blind methods, the matrices R0 and R1 are updated continously. Furthermore,
for a training-based system to employ the top singular modes, G and H must
be known by both parties. In most conventional training-based systems (e.g.
V-BLAST), only the receiver has knowledge of the channel matrix.
5.1 Asymmetric workloads
It was shown that the job of finding the top singular vectors can be divided
between X and Y with an asymmetric workload. Clearly, agent X will have the
simplest job. All that is needed is an optimizer capable of finding the extreme
point on an ellipsoid. The agent Y has considerably harder work. He has to fit
points to an ellipsoid, calculate the principal axes and corresponding lengths of
half-axes, as well as try to determine the convergence point of the sequence {yi }
. This imbalance in work-load suggests that agent Y will be the base-station
(BTS), where processing power and complexity is more affordable, and X the
mobile subscriber unit.
153
5 DISCUSSION
5.2 Regularization
More research must be done to avoid the possible problems of having errors in
Ŝ−1
H . It also is clear that the mapping R0 is intimately related with the rotation
R1 . The latter determines the positioning of yOpt , and consequently, the quality
of the ellipsoid fitting. Updating the rotations R0 and R1 interchangeably,
combined with regularization of the ellipsoid fitting, could improve overall
performance. Also, we have not considered the robustness of the “final rotation”
R1 . It is clear that this mapping can be sensitive, since it is constructed without
consideration of principal directions in H other than uH
1 .
5.3 Considerations for an improved system
Based on our experience with the framework, we discuss some potential
improvements and sources of information that are not exploited.
• Both X and Y can tell whether or not they have obtained their optimal
values for fX and fY , by considering the eigenvalues/eigenvectors from
ellipsoid fitting. Unless the vector xOpt or yOpt is parallel with the trailing
eigenvector (leading singular vector), the true optimum is not reached.
This can be used as a stopping criterion for optimization.
• Upon convergence of the series {yi } to σ1 u1H (except for the random
variation from the channel noise) the elements of the sequence can be
averaged, before they are processed with R, normalized and sent back
through G, avoiding “double noise accounting”.
• Interchangeable update of R0 and R1 : Introducing R1 “too quickly” is risky
for two reasons. First, it encourages replacement of the original yOpt by
−1
a vector σ1 uH
1 that is not well estimated, and second the scalings in Ŝ
could make contributions in this direction “explode” or “vanish”. It seems
better to use a “partial rotation”, Rp1 , with p ∈ [0, 1]. When yOpt changes, the
matrix R0 is recalculated using the new observations. Continuing this way,
gradually rotating with (a new) R0 , the optimum could be reached safely
without haphazard.
• Better estimation of R1 : As mentioned, this rotation is confined to a
subspace, and thus one cannot say how changes in x along directions
outside this subspace affect the reward functions. However, one could
consider the possibility that Y “learned” from his mistakes, and created
a more complex rotation, that would ensure a better reward.
• If processing power permits it, one could also consider schemes where X
and Y change roles.
154
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
5.4 Using multiple singular modes
In a complete working system, it should be possible to estimate multiple singular
modes, simultaneously, while communicating symbols on these modes (using
superpositions). This has already been accomplished in the reciprocal case
G = HT (Dahl et al. 2001). The ideas used there for increasing the number
of singular modes can be generalized to work for the non-reciprocal case. As an
extra bonus, there will be larger variation in the vectors {xi } and {wi } . This
variation will improve the ellipsoid fitting, and thus overall performance.
5.5 Concluding remarks
A number of issues must be investigated before robust algorithms for the two
agents can be devised. We have not discussed performance of the Local CG
algorithm in the presence of various noise levels, nor is the estimation of yOpt
covered in any detail. Selection of the step parameter λ is another issue,
as is selection of potential regularization parameters, and the integer d used
both in the Local CG and in the detection of yOpt . The non-square and the
degenerate channel matrix cases, as well as symbol modulation and operation
in complex modes are all subjects for further research. Based on the experience
with the present framework, we believe that the formulation of blind FDD
MIMO Channel estimation as a two-agent problem will prove to be a useful
contribution.
References
[1] Alonso, A. and Kudenko, D. (2001) Machine Learning Techniques for Adaptive Logic-Based Multi-Agent Systems: A Preliminary Report. Artificial
Intelligence Group, Department of Computer Science, University of New
York.
[2] Cruz, J.B. Simaan, Jr. M.A. (1999) Multi-Agent Control Strategies with
Incentives, Proceedings, Symposion on Advances in Enterprise Control, pp.
177-182.
[3] Dahl T., Christophersen N., Gesbert D.(2001) BIMA - Blind Iterative MIMO
Algorithm, accepted for ICASSP-2002
[4] Fitzgibbon, A.W., Pilu, M., Fisher, R.B. (1996) Direct Least Squares Fitting
of Ellipses”, International Conference on Pattern Recognition, Vienna.
[5] Foschini, G.J. (1996) Layered Space-time architecture for wireless communications in a fading environment, Bell Labs Technical Journal, Vol. 1, No
2, pp. 41-59.
155
REFERENCES
(a)
1.2
1
0.8
0.6
0.4
0.2
0
−0.2
0
20
40
60
80
100
120
140
160
180
200
120
140
160
180
200
(b)
1.2
1
0.8
0.6
0.4
0.2
0
−0.2
0
20
40
60
80
100
Figure 4: Convergence of the singular vector estimates v̂1G and ûH
1 . The lower panel (a)
H
G
G
shows the error ||uH
1 − û1 ||. The lower panel (b) shows the error ||v1 − v̂1 ||..
156
PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION
[6] Frank, I.E., Friedman, J.H. (1989) Classification: Oldtimers and newcomers, Journal of Chemometrics, Vol. 3, pp. 463-475.
[7] Golden, G.D., Foschini, G.J., Valenzuela, R.A. and Wolniasky, P.W. (1999)
Detection algorithm and initial laboratory results using the V-BLAST
space-time communication architecture, Electronics Letters, Vol. 35, No. 1,
pp. 14-15.
[8] Golub, G. and Van Loan, C.F. (1996) Matrix Computations, Johs Hopkins,
3 edition.
[9] Hansen, P.C. (1996) Rank Deficient and Discrete Ill-posed Problems, Ph.D.
dissertation, Technical University of Denmark, DK-2800 Lyngby, Denmark.
[10] Hastie, T., Buja, A., Tibshirani, R. (1995) “Penalized Discriminant
Analysis”, The Annals of Statistics, Vol.23, No.1, 73-102.
[11] Hoerl, A.E, & Kennard, R.W. (1970) Ridge Regression: Biased estimation
for Nonorthogonal problems”, Technometrics, 8, pp. 27-51.
[12] Hu, J. and Wellmann, P. (1998) Online learning about other agents in a
dynamic multiagent system. In Proceedings of the Fifteenth International
Congress on Autonomous agents, pp. 239-246.
[13] Stone, P., Tumer, K., Gmytrasiewicz, P., Greenwald, A., Littman, M.
Namatame, A. Sen, S., Veloso M., Vidal, J., Wolpert, D. (2002) Description:
Collaborative Learning Agents AIII-2002 Spring Symposium, March 2002,
Stanford, CA.
[14] Stone P. , Veloso M. (2000) “Multiagent Systems: A Survey from a Machine
Learning Perspective”, in Autonomous Robotics, Vol. 8, No. 3.
[15] Telatar, I.E. (1995) Capacity of multi-antenna Gaussian channels, Bell Labs
Technical Memorandum.
[16] Turner, D.A. Anderson, I.J., Mason, J.C., Cox M.G. (1998) An algorithm for
fitting an ellipsoid to data, Technical Report RR9803, School of Computing
and Mathematics, University of Huddersfield, UK
[17] Wolpert, D.H. Wheeler, K.R. Tumer, K. (1999) “General Principles of
Learning-based Multi-Agent Systems”, Proceedings of the Third International Conference on Autonomous Agents (Agents’99)
157
158
Chapter 4
Discussion
159
CHAPTER 4. DISCUSSION
160
4.1. SENSORY ANALYSIS
Discussion
In this chapter, I will briefly discuss some of my findings. In particular, I try to
describe problems that have not been solved satisfactorily, and discuss lines of
improvement and further research.
4.1 Sensory Analysis
Finding a Meaningful Consensus
At the outset of my thesis, the aim was to find a method that could find
and summarize connections between sensory profiles better than Generalized
Procrustes Analysis (GPA). The main critique against this method was its
rigidness, e.g. that it used orthogonal transformation rather than a general
linear (or non-linear) mapping. In my work I have considered I learned about
techniques, such as three-way methods and GCA. Still, none of these methods
yield an average (consensus) that both represents compressed information about
the judges and resembles the profiles of the individual judges in some way. In
all the methods I have encountered, the consensus is either obtained using a
restricted transformation, or it has some kind of orthogonality criterion. It
would be interesting to find a method that generated a consensus, representing
an average assessor, having some of the same covariance structure between
the variables as the individual profiles, but that still captures the more subtle
nuances not found by GPA. One idea is to study minimum spanning trees
between the profiles (as discussed in Paper Two), and to try to avoid the ’collapse’
described in the Introduction by more carefully deciding which profiles are
transformed to match. By keeping the number of transformations to a minimum,
the ’collapse’ induced by multiple iterations could possibly be avoided.
Test for subtle nuances
In Paper Five on mobile communication, we used a Procrustes measure to
statistically determine whether or not a matrix could be transformed to match
another by orthogonal transformation. Working along those lines, it could be
possible to make a statistical test to check whether the relationship between two
matrices was orthogonal, or if it was significantly better described by a nonorthogonal transformation. This could be used to check the validity of using
161
CHAPTER 4. DISCUSSION
three-way analysis rather than GPA on sensory panel data. As an alternative
to general linear transformations, one could also think of using information
theory to decide whether there exists a mapping, possibly non-linear, between
two profiles.
4.2 Mobile Communications
Improving Blind Estimation
In paper Five, working on blind estimation of non-reciprocal channels (FDD),
the use of ellipsoid fitting as a “partial SVD” was described. This can be used
to improve performance in the reciprocal channel (TDD) case also. The power
method, which is a numerical method for finding eigenvalues and eigenvectors,
is no longer considered a serious method for eigen-estimation. The reason it
was used in our papers, was the fact that it arose naturally as an implicit
process of the physics involved. It more or less lends itself to be exploited, at
low processing and channel capacity costs compared to methods that are in use
today. If A ∈ Rp,p is a real symmetric matrix, and x some random vector in the
column space of A, then the Krylov sequence {x, Ax, A2x, . . . , An x}, can be used
for eigenvector estimation. This sequence arises naturally in the power method,
but more a lot can be said about the eigen-structure of A from this sequence. For
instance, the conjugate gradient method (CG) can make better estimates of the
eigenvectors from this sequence than the power method, which simply uses the
last element as an approximation. There must be unexploited information in the
vectors that are transmitted in a TDD system. A study of the CG algorithm and
other Krylov methods for the sake of improving the eigen-mode estimation could
be useful. Comparing such techniques with the “partial SVD” used for ellipsoid
fitting in Paper Five, is another aspect.
Simplifications of the blind FDD algorithm
The blind FDD algorithm in paper Five can be considerably simplified when
there is only one receive element at the mobile subscriber unit. This corresponds
to a SIMO/MISO system, or smart antennas. In this case, a simplified version of
the Local CG algorithm can be used to find the optimal weight distribution for
transmission from the base station to the mobile subscriber unit. The connection
between this method and other techniques for blind SIMO/MISO estimation
must be understood. It is clearly more difficult to come up with new ideas in
an area which has been studied intensively for thirty years. Still, it could be
that ideas inspired by MIMO could bring some improvement to smart antenna
systems.
162
4.2. MOBILE COMMUNICATIONS
Symbol modulation
The performance of blind channel estimation both in TDD and FDD systems
could be improved by working more on the symbol modulation, and the
properties of the modulation alphabet. Spatial water-filling could also be
implemented to improve performance. In the non-reciprocal case, symbol
modulation was not discussed, and only the top singular modes are used for
transmission. Implementation of a modulation alphabet, as well as using more
singular modes for transmitting and receiving data is a next natural step in this
work.
The connection with game theory and Multi-Agent Systems is another aspect.
Particularly, optimization of nested, time-varying functions is a subject that
could be of interest for a wider statistically oriented audience.
163
Related documents
Download