© Tobias Gulden Dahl, 2002 ISSN 1501-7710 Cover: Inger Sandved Anfinsen Series of dissertations submitted to the Faculty of Mathematics and Natural Sciences, University of Oslo No. 229 All rights reserved. No part of this publication may be reproduced or transmitted, in any form or by any means, without permission. Printed in Norway: GCS Media AS, Oslo Publisher: Unipub AS, Oslo 2002 Unipub forlag is a subsidiary company of Akademika AS owned by The University Foundation for Student Life (SiO Contents 1 Introduction 1.1 Empirical Modelling of Motion . . . . . 1.2 Generalized Procrustes Analysis . . . . 1.3 GCA and related methods . . . . . . . . 1.4 Statistical Shape Analysis . . . . . . . . 1.5 Mobile Communication Systems: MIMO 1.6 A two-agent problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Short Summary of the Papers 3 4 6 10 12 14 20 27 3 Papers 31 3.1 A Bridge between Tucker-1 and Carroll’s Generalized Canonical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.2 Outlier and Group detection in Sensory Analysis using Hierarchical Cluster Analysis with the Procrustes Distance . . . . . . . . . . . 61 3.3 BIMA - Blind Iterative MIMO Algorithm . . . . . . . . . . . . . . 87 3.4 Blind MIMO Estimation based on the Power Method . . . . . . . 99 3.5 The Game of Blind MIMO Channel Estimation . . . . . . . . . . . 129 4 Discussion 4.1 Sensory Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Mobile Communications . . . . . . . . . . . . . . . . . . . . . . . . i 159 161 162 ii Preface This is a thesis for the dr. scient degree in Applied and Industrial Mathematics (a program including Applied Statistics). The thesis is submitted to the Department of Informatics at the University of Oslo. The thesis was prepared under the supervision of Professor Nils D. Christophersen, University of Oslo, Professor Tormod Næs, University of Oslo and MATFORSK, and Ole-Christian Lingjærde, University of Oslo. The work has been financed by the University of Oslo, Department of Informatics. The thesis consists of five papers. Papers One and Two are about sensory panel analysis. Paper Three, Four and Five are about wireless mobile communication systems. The methods, techniques and ideas are closely connected, even if the applications are different. The Introduction describes these connections in detail. Acknowledgments I want to express my gratitude to my supervisors, Professor Tormod Næs and Nils D. Christophersen. I am particularly grateful for all the time they have spent with me, not only on the papers and the thesis, but also on talks not directly related to work. Tormod Næs has been very important to me, in that he has taught me a lot about the processes involved in writing papers and researching, as well as finding a balance between scientific work and other activities. Professor Nils Christophersen has always taken time to listen to my more and less original and interesting ideas. Without his ’open door policy’, my progress would have been much slower. I also want to direct special thanks to Dr. David Gesbert at the Department of Informatics. Without his knowledge of mobile communication systems, it would have been impossible to publish in an area that was so new to me. Special thanks to Ole-Christian Lingjærde, University of Oslo, for joint work, and to Professor Nils Lid Hjort. Working together on the final thesis paper was fun, and I learned a lot from our discussions. I also want to thank my colleagues in Glasgow for inspiration, for all I learned about shape analysis, and for giving me a statistical home away from home. Finally, thanks to my friends an my family for support in periods of much work. Tobias Dahl Oslo, March 2002 1 2 Chapter 1 Introduction From Sensory Analysis to Mobile Communication Systems: A personal account In this introduction, I try to show how a number of ideas, spread through the various sub-projects of my thesis, relate to each other. The introduction is rather personal, and more about the process behind the papers than about the results they present. Simply presenting background material from different fields of applied statistics would not suit this material well: The central part of the thesis is about connecting ideas from different areas of natural science. The reader will find that the papers contain ideas and contributions from several areas of quantitative discipline, including • Motion Analysis • Sensory Analysis, Psychometrics and Chemometrics • Classic Multivariate techniques (Clustering, Principal Component Analysis, MANOVA) and Three-way Analysis • Regularization and Stochastic Simulation. • Statistical Shape Analysis • Signal Processing and Wireless Telecommunication • Game Theory and Artificial Intelligence (AI) This breadth calls for connections. Sometimes, transition of methodology from one area to another is obvious and straightforward, as a specific method can be more or less insensitive to the type of data it is applied to. At other times, the connections are more intricate, but therefore more interesting. One example in this thesis can be found when the first and the second last papers are seen together: My attempt to make a new method for sensory panel 3 CHAPTER 1. INTRODUCTION analysis at first turned out to be a replica of a well-know method. It had been formulated by me in an iterative fashion, when a closed-form solution had been known for many years. Still, it was my iterative solution (and not the closedform) to this problem that carried over to solve a blind channel estimation problem in telecommunications. In another paper, I take advantage of the close connections between psychometrics and statistical shape analysis, and show that developments in the latter, which is a rather new field, give rise to new approaches of solving practical problems in the former. Orthogonal approximation, which is closely related to Procrustes methods developed in Psychometrics, is used in two papers on wireless mobile communications. More examples are given later on in this chapter. The purpose of this introduction is two-fold. Firstly, it presents background material for the individual sub-fields I have worked in, and secondly, it gives an account of how the different sub-projects relate to each other. As the thesis covers several areas of quantitative science, I have tried to make the introduction readable by persons from any of those, balancing on a thin line between too much and too little detail on a specific subject. Still, I hope the reader will find this section interesting and readable, and that it will serve well as an introduction to the core of the thesis. I will work my way through the papers in a chronological fashion, starting with the end of my M.Sc. (about Motion Analysis), and ending up with the final paper on game theory (Multi-Agent Systems) applied to a wireless telecommunications problem. 1.1 Empirical Modelling of Motion In my M.Sc thesis, I was studying movements in athletes using empirical analysis techniques. Movement analysis is often seen from a bio-mechanical point of view. My idea was to use methods from multivariate statistics (principal component analysis, discriminant analysis) rather than deterministic techniques (differential equations) typical for this kind of analysis. Attaching reflex markers to different parts of the athletes’ bodies (typically the joints), I would track a specific movement using a special camera set-up. I would compare and analyze the movements of the different athletes who participated in the experiment. The purpose was to detect differences in the execution, rather then finding an “optimal movement”, and also study how different body parts moved to perform a whole action. One of the first problems I faced, was that of matching the motions of the athletes. This is not quite the same as comparing them, it was rather a question of data pre-processing: No athlete stood at exactly the same place as another when performing the movement. Athletes were also of different sizes. Sometimes, the setup of the 3D motion cameras had changed since the last recording session, and so there was a constant need to "align" the movement recordings prior to any further analysis. 4 1.1. EMPIRICAL MODELLING OF MOTION To compensate for these effects, I initially used an affine transformation (see e.g. Newman and Sproull, 1976). This is a rather general transformation, capable of rotating, scaling, stretching, translating, and bending curves in certain ways, known as a “shear”. If two sets of points in 3-space are held as the rows of the matrices X1 , X2 , both in RN ×3 , the matching is done by solving min = ||(X1 Q + 1tT ) − X2 ||2F Q,t (1.1) where || · ||F is the Frobenious norm, Q is a 3 by 3 matrix, t ∈ R3 is a translation vector, and 1 is the n×1 vector of 1’es. Of course, this problem can be generalized to cover points in p-space, making the matrices X1 , X2 both Rn,p , Q ∈ Rp,p and t ∈ Rp . When I took account of the fact that movements could be initiated at certain points in time as well as in space, preprocessing with the affine transformations gave interesting and interpretable results in the subsequent analysis. From the literature, we knew another option, to use procrustes transformation for matching the curves. This transformation can be seen a restricted version of the affine transformation. It admits for rotation, translation, and isotropic scaling. The optimal procrustes transformation T (X1 ) to match X2 is found by solving min = ||(c · X1 Q + 1tT ) − X2 ||2F Q,c,t (1.2) subject to the constraint that Q is orthogonal, QT Q = I, and c is a scalar. One could argue that the unwanted effects (positioning, rotation, size compensation) would be exactly the ones that could be compensated by using the procrustes transformation, whereas the flexibility of the affine transformation could lead to over-fitting. Procrustes Analysis was, according to Dryden and Mardia (1998), “developed initially for applications in psychology”. First references include Moisier (1939) and later Green (1952), Cliff (1966), Shönemann (1966,1968), Gruvaeus (1970), Shönemann and Carroll (1970). A list of authors that have described the pure rotation case, (a restriction so that multiplication by Q does not involve reflections), as well as a number of books introducing Procrustes Analysis can be found in chapter 5 of Dryden and Mardia (1998). For the problem of matching motion curves, the procrustes transformation turned out to work well. Pre-processing with this transformation gave better discrimination between athletes than we found using the affine transformation. I was still interested in how the two transformations related, and how they affected my data as pre-processing steps. As part of my M.Sc.,I constructed a bridge between the two transformations. The bridge was controlled by a flexibility parameter α. Choosing α = 0 would give a "rigid" procrustes transformation, α = 1 would give a "flexible" affine transformation, and values in-between would yield intermediate flexibility in the transformation. 5 CHAPTER 1. INTRODUCTION The starting point of my PhD was an attempt to modify the Generalized Procrustes Analysis (GPA, explained below), giving it the extra flexibility of the affine transformation to avoid some of the problems associate with rigidness. 1.2 Generalized Procrustes Analysis The procrustes transformation is the central part of a sub-field called “procrustes analysis” (PA). The technique for matching one set of points to match another, also has a “big brother”, known as Generalized Procrustes Analysis (GPA) . Whereas PA can be used to make two configurations similar, in GPA one tries to make a group of configurations as similar as possible with each other. Central papers on GPA include Kristof and Wingersky (1971), Gower (1971, 1975), Ten Berge (1977), Sibson (1978,1979), Langron and Collins (1985) and Goodall (1991). GPA was originally designed to handle multi-dimensional data rather than planar or 3D-points that could by visualized on a piece of paper, or in space. A typical example of a situation when GPA can be applied, is when a panel of judges are assessing a selection of wines. For each wine, each member will give scores for sourness, bitterness, saltiness, spiciness, oak-blend, fruitfulness etc, on a scale from 0-10. What the manufacturer wants is an "overall expert opinion", which enables him to put a label on the back of each bottle, telling whether the wine is "fruitful", "ripe", "dry" or "bitter" etc. A first shot would be the average score for each wine, and for each category, taken over the judges. In practice, however, it turns out that the judges use the scales differently (Langron, 1983). One person’s perception of "sourness", may for another be a mixture of "sourness" and "bitterness". There is even a technique called “free choice profiling” (Arnold and Williams, 1985), where the judges themselves (individually) pick the vocabulary to use for a tasting session. The assumption made by the researcher, is that the panel members have a similar underlying perception or taste experience, but that they use the attributes differently. In such situations, comparison is meaningless without some kind of compensation. Also, one judge may be restrictive in his use of the scales, ranging all wines (and their qualities) in the range of 3-5, while another uses the full scale 0-10. A judge may also be overly negative or positive in his assessment, giving scores 0-3 or 7-10 more often that the rest of the panel. Ignoring these effects, the "average chart" for the judges can be non-informative. It can be assumed however, (Arnold & Williams, 1985) that each judge will stick to his scale and his taste interpretation throughout the whole session, which makes it possible for a mathematician to compensate for the systematic differences (or “divergence in use of the terms”). The procrustes transformations are used to compensate. This compensation can be understood geometrically, as each wine can be represented as a point in a high-dimensional space, with 6 1.2. GENERALIZED PROCRUSTES ANALYSIS co-ordinates given by the different tasting dimensions. The method I used for matching motion curves, had been used for decades to improve the reliability of tasting experiments. Let X1 , X2, . . . Xq denote q sensory profiles, each matrix Xi begin N columns × P variables. The rows correspond to food samples (or alternatively, markers on a body), and the columns to tasting categories (or dimensions x, y, z in space). Throughout, we assume that the profiles are centered, e.g. that they are each pre-multiplied by with matrix C, Xi := CXi 1 C = (I − 11T ) N (1.3) (1.4) It can be shown that this centering is equivalent to a translation of all rows in each Xi to have center in the origin, and furthermore, that this is the optimal way of using translation in profile matching (see Dahl, 1998 or Mardia et al., 1997). The GPA problem is now that of solving (see e.g. Risvik and Næs, 1996) min {Tij } q X X ||Tij (Xi ) − Tji(Xj )||2F (1.5) i=1 j<i where Tij denotes the optimal Procrustes transformation to match 1 the profile Xi with the profile Xj . The standard approach to solving this problem is to replace it with another problem, min {Ti } q X ||Ti (Xi) − Y||2F (1.6) 1X Y= Xi q i=1 (1.7) i=1 q Equivalence between (1.5) and (1.6) can be shown using standard results from multivariate analysis. The matrix Y now denotes the average or consensus that the other profiles are transformed to match. The criterion is usually minimized in an iterative fashion. First, and initial consensus is made (see Ten Berge (1977) for details). Next, all the profiles are transformed to match this consensus, before a new average Y is taken, etc. The GPA routine can be simplified as: GPA Algorithm (Ten Berge, 1977) 0. Center the profiles Xi and make an initial Y. 1 Care must of course be taken when selecting the transformations, because as opposed to the case with ordinary Procrustes Analysis, both profiles will be transformed. 7 CHAPTER 1. INTRODUCTION 1. Transform each of the profiles X1 , X2 , . . . to resemble Y by an orthogonal transformation. 2. Make a new Y as the average of the transformed X1 , X2 , . . . . 3. Repeat from 1. until convergence. 4. Scale the (transformed) profiles X1 to match each-other. Some stopping criterion is used in step 3, typically by defining a threshold value for the decrease in the objective function (1.6). Isotropic scaling of all matrices to match each other (step 4) is described in ten Berge and Paul (1993). 1.2.1 The Rigidness of GPA The GPA procedure can be criticized for its lack of flexibility. In sensory analysis, the operational steps constituting the procrustes transformation puts restrictions on how much compensation can be made on the score charts. There is no obvious reason that compensations should be orthogonal. The isotropic scaling also forces all tasting dimensions to be scaled equally much. A judge could for instance be very restrictive in one of the tasting dimensions only. To compensate for this, it would be more correct to scale just that dimension to match the average option, and not all the other dimensions. There was also the question of whether a reduced-rank consensus could be found (some work has been done, e.g. Peay (1988), as it is no guarantee that a set of panel judges has all tasting dimensions in common. 1.2.2 Softening GPA It should turn out that my attempts on softening GPA led me to rediscover a well-know method. Still, as the reader will see, my solution to this problem enabled me to solve a problem in mobile communications later on, that would not have been easily solved without this experience. In particular, I wanted to “improve” GPA by changing step number 1 to become 1. Affine Transform each of the profiles X1 , X2 , . . . , to resemble Y. or even 1. Transform each of the profiles X1 , X2 , . . . to resemble Y, using my flexible transformation controlled by α. This would, I believed, enable me to bridge GPA with a “soft GPA” or perhaps a "Generalized Affine Analysis", just as I had succeeded in bridging the procrustes and the affine transformation. I envisioned a soft GPA that could capture the subtle nuances of tasting experiences better, and consequently, provide a better representation of the overall tasting experience than GPA could. 8 1.2. GENERALIZED PROCRUSTES ANALYSIS 1.2.2.1 Collapse I tried to replace the original step 1. with one based on the affine transformation. Doing so, I realized two things: First, the procedure did not converge. Rather than this, the elements of the Y matrix tended to infinity (or sometimes towards 0). I tried to help this by normalizing each of the columns in Y, introducing an extra step, 2 ½. Normalize each column yi in Y: yi := yi /||yi ||. Doing this helped the situation only a little. The elements of Y no longer tended to infinity, but, it turned out that all columns of Y became identical (or sometimes, identical up to a change of sign). It was hard to interpret this result. With all columns identical, it could seem like the computer program was saying: “There is only one dimension (or tasting category) present in the whole data set X1 , X2, . . . . This is the dimension you see in all columns of the Y”. I was quite confident that the judges were capable having more than one common, underlying dimension of taste. Surely, the judges must be using more than one set of taste buds. Even when I generated a set of random data matrices the same phenomenon persisted: All columns of Y became identical. To help the situation, which I had “half helped” by normalizing the vectors yi , I resolved to “normalization’s big brother”, “orthogonalization”. This choice was inspired by the observation that the parallelity of the Y columns only appeared after a few iterations. Initially, the columns of Y could be quite different, but they would gradually become more and more similar. I therefore decided to let the first column be as it was (apart from normalizing it of course), then from the second column I would subtract whatever could be accounted for by the first column, and from the third column, whatever could be found in the first two vectors etc. This was an extra step, 2 ¾. From each column yi , remove any contribution from the previous columns y1 , y2 , . . . y(i−1) . This was just a series of projections, or a Gram-Schmidt process. Doing this, I forced the columns of Y to be different, but I still gave the data some “liberty of speech”, by not fixing them completely. Still, I wasn’t satisfied. The Y matrix that I was now calculating was clearly some kind of average opinion of the judges. However, it seemed a little bit funny that all columns would have to be independent (or, if looking at them as vectors, perpendicular), they had no correlation. Furthermore, all columns were of the same norm after the normalizing procedure. Clearly, this “forced consensus Y” didn’t look very much like any of the individual profiles X1 , X2 , . . . , which sometimes had a lot of correlation between the columns, and rarely had any column norm equal to one. 9 CHAPTER 1. INTRODUCTION I worked a little more on the problem to try to understand how the algorithm processed the data. By writing out the recursive procedure as a single equation, I managed to show that the columns of Y were the eigenvectors of a particular matrix, constituted by certain sums and products of the profiles X1 , X2 , . . . . Forgetting about orthogonalization and normalization for a moment, and using Y (0) as the initial consensus guess, the matrix Y (k) would after i iterations become X X Xi (XTi Xi )† XTi ) · · · ( Xi (XTi Xi )† XTi )· Y (0) (1.8) Y (k) = ( | {z i i } k times or alternatively, X Y (k) = [ Xi (XTi Xi)† XTi ]k Y (0) (1.9) i Here, the “†” denotes the Moore-Penrose pseudo-inverse. It can be seen that this is a power method (see e.g. Golub and Van Loan, 1996) P for each column of Y (0) to converge to the top eigenvector of the matrix i Xi (XTi Xi )† XTi . By keeping the columns of Y k orthogonal through a Gram-Schmidt process, I got a set of P orthogonal eigenvectors as the (orthogonalized) consensus matrix Y. Shortly before publishing these results, I found out that my new method had already been discovered over 30 year ago. It was known as Generalized Canonical Analysis, or GCA, described by J.D. Carroll in 1968. My work on GPA, which by now had changed to become work on GCA, lead me into studying more methods for multiple data matrices X1 , X2 , . . . , XQ , and to write my first thesis paper about connections between such methods. 1.3 GCA and related methods GCA exists in several variants. Important papers on the subject include Carroll (1968: MAXVAR), Kettenring (1971: SUMCOR,MINVAR,SSQCOR,GENVAR), van de Geer (1984), Tenenhaus (1987), Kiers et al. (1994), and van de Burg and Dijksterhuis (1996). The GCA criterion used by Carroll is max βij ,yj Q X corr2 (Xi βij , yj ) (1.10) i=1 under the constraint Y T Y = I, where Y = [y1 , y2 , . . . yK ], and βij is a vector of regression coefficients. The dimension of the solution space K can be selected by the user. van der Burg et al. (1994) has shown that this problem can be reformulated to become min βi ,Y Q X ||Xiβi − Y||2 subject to Y T Y = I i=1 10 (1.11) 1.3. GCA AND RELATED METHODS with {βi } as regression coefficient matrices. The solution to any of the problems (1.10),(1.11) is to let the columns of Y be the orthogonal eigenvectors of Z= Q X Xi (XTi Xi )† XTi (1.12) i=1 as demonstrated in the appendix of the same paper, and as I myself had discovered through my work. Reading the GCA literature, and exploring other methods for studying profile data, I came across a set of so-called three-way methods. The most well known, apart from Generalized Procrustes Analysis, are probably the 3-way factor analysis methods PARAFAC, Tucker-1, Tucker-2 and Tucker-3, (see e.g. Kroonenberg and De Leeuw, 1980, Naes and Kowalski 1989). Some of these three-way methods resemble GCA (or my own “new” procedure) quite a lot. Not only do they yield a kind of “consensus matrix” Y that reflected the overall opinion of the judges, but they do so in ways similar to other methods I had studied. The Tucker-1 method is known as a 3-way principal component analysis. It detects structures that are common for several judges, along several tasting dimensions, and for several samples (of food, wine etc) in a data set. It then uses these structures as a basis for exploring and describing the individual judges in a “common language”, a kind of “all-agree-upon-this-terminology” that can be detected. The Tucker-1 problem is that of solving X ||Xi − Yβi||2 (1.13) min {βi },YT Y=I i which is not very different from the least-squares GCA criterion (1.11). Both methods use a kind of consensus matrix that contains compressed information about the data. GCA and Tucker-1 were indeed closely related, yet it seemed like the choice of which method to use in a particular situation was a matter of scientific background, and not well explored. 2 GCA and Tucker-1 were so similar that I was could “bridge” the two, in a similar way as when bridging the procrustes and the affine transformation. Using this bridge on empirical tasting data, I demonstrated the use of both methods in a unified framework that facilitated comparison. This provided me with "a way from one method to the other", it enabled me to create "border cases" between the two highlighting their differences, and discuss when one method was more appropriate than the other. I developed a number of plots, that could help the practical user decide which particular method to use, or even give him the option of a hybrid. Another aspect of the work, was that GCA could be quite sensitive to noise in the data, while Tucker-1 was more robust. Even if the nature of the problem he wanted to solve called for using GCA, the wish to regularize the noise sensitivity could make 2 One of the reviewers on the first papers have argued that theoretical comparisons between GCA and Tucker-1 have in fact been done, but that combining the methods seems to be a new approach. 11 CHAPTER 1. INTRODUCTION a user go slightly towards a Tucker-1 method. A “regularized GCA” (RGCA) is a hybrid method of the framework. These results formed the first paper of my thesis with the title "A Bridge between Tucker-1 and Carroll’s Generalized GCA". The paper also discussed several techniques for choosing the α parameter, including a kind of external validation using MANOVA in conjunction with the process parameters for making the food samples. The paper is joint with Tormod Næs. 1.4 Statistical Shape Analysis As a part of my PhD, I was at the University of Glasgow for a six month period. Just before my arrival, a new project had been started on shape analysis of babies’ cleft lip and palate. With my background in image analysis and procrustes methods, I soon joined a small group of people trying to get to grips with statistical shape analysis, studying a new book by Dryden and Mardia . In this book, procrustes methods were a central topic. I learned that GPA had been in use for shape analysis problems since the 1990s. Among the important papers in this area are Kendall (1984,1989), Le and Kendall (1993), Goodall (1991), Ziezold (1994), Le (1995), Kent and Mardia (1997) The book [8] explores the origins of Procrustes Analysis, describing the connections between multivariate analysis and shape analysis in an interesting way. Introduction of old methodology into new fields can lead to new developments. For example, In sensory analysis, the researchers were concerned with the problems of making profiles X1 , X2 , . . . as similar as possible to one another, by means of procrustes transformations, before taking an average Y that would serve as a “consensus”. In shape analysis, there was an interest in assessing how similar the profiles X1 , X2, . . . were, after transformation. The profiles (or rather, configurations) were no longer tasting score charts, but rather curves, defined by point sets outlining objects such as skulls or tanks. (Table: Synonyms: profiles, configurations, point sets, matrices. Illustrate their interpretations and them being identical). To measure the difference between shapes (after procrustes transformation), the procrustes distance (Le and Kendall, 1993) was introduced. For centered configurations X1 , X2 , this is d(X1 , X2 ) = ||T (X1) − X2 ||2F (1.14) where T (·) denotes the optimal procrustes transformation for X1 to match X2 . The problem with this distance measure is that it is assymetrical. Generally, d(X1 , X2 ) 6= d(X2 , X1) 12 (1.15) 1.4. STATISTICAL SHAPE ANALYSIS which makes it inappropriate for a number of purposes. However, it the matrices are normalized to have fixed variance (for instance, unit variance), we have the full procrustes distance, dF (X1 , X2 ) = ||T ( X2 2 X1 )− || ||X1|| ||X2 || F (1.16) and symmetry, dF (X1 , X2 ) = dF (X2 , X1 ) is obtained. It occurred to me that this new measure dF could be brought back into sensory analysis. Little work had been done on the following problem: If a set of judges all taste a selection of food samples, but one of the judges have no or little common experience with the others, what good will standardization with the GPA procedure do? If there are members of two different expert groups in the tasting panels, how can one check if these two groups are capable of detecting the same tasting sensations? One group could be highly trained to detect certain spices, another may be trained to spot nuances of saltiness. Forcing all data from such panel members through GPA, without any previous check on the consistency of the board can yield a consensus not representative of any of the judges. This would be a “head-in-the-oven, feet-in-the-freezer, and on-the-average, we’re doing just fine” situation. Even if two experts groups could independently spot important (but different) qualities in the food, taking a GPA average of them all could cut off important contributions from both sides. A way of detecting group structures and outliers among the judges is to collect them in clusters. If the perception of two judges differed only in terms on how they used the scales (e.g. they had a common underlying tasting experience, distorted only by misuse of terms), they could be grouped to become one. If a third judge was similar to (at least) one of the two, he could be grouped as one of the as well. Repeating this procedure, one could create a chart (Figure 1.1) that can be used for detecting sub-groups and outliers among the sensory judges. The central element in this procedure was the procrustes distance, which was used to measure pair-wise differences between the judges, after transforming them to similarity by the procrustes transformation. The combination of Hierarchical Cluster Analysis (HCA) and the Full Procrustes Distance was the subject of my second paper, joint with Tormod Næs. Against this approach one could raise the same critique as against GPA. Using the procrustes transformation in a sensory context takes away some of the flexibility needed to detect nuances in the consensus of the judges. However, since this paper was concerned with pair-wise comparison of profiles, there really was no need to use the GPA algorithm on the full set of profiles. Working with pairs of profiles only, one can at any time resort to an affine transformation, and create an affine distance to replace the procrustes distance, or even use more complicated distances. 13 CHAPTER 1. INTRODUCTION 0.9 0.8 0.7 0.6 0.5 0.4 1 4 9 2 10 7 3 5 8 6 Figure 1.1: Dendrogram for Hierarchical Clustering with the Full Procrustes Distance. The numbers on the horizontal axis are the judge indexes. The length of the vertical lines connecting judges are the distances dF (·, ·). In this experiment, judges 1,2,4,7,9 and 10 are similar, judge 6 is an outlier, and judges 3,5 and 8 are are possible outliers, not well connected with the consensus group. 1.5 Mobile Communication Systems: MIMO When I returned from Glasgow, there was a growing interest in mobile communications systems at the University of Oslo. New techniques based on MIMO antennas (multiple input, multiple output) were being studied. At first glance, this seemed like a subject far away from my previous work on sensory and shape analysis, but it soon turned out that ideas carried over from one field to another. 1.5.1 MIMO Background Quoting Gesbert and Akhtar (2002), MIMO systems can be defined as referring “to a link for which the transmitting end as well as the receiving end is equipped with multiple antenna elements”. The properties of MIMO systems were first discussed in a number of information-theory articles published by members of the Bell Labs (Telatar 1995, Foschini 1996). Systems with several antenna elements at one side has been in use since the seventies (smart antennas). Typically, the base station 3 is equipped with several antenna elements, whereas 3 where the cost and space is more easily affordable than on a portable subscriber unit, a PDA, laptop or a cell-phone. 14 1.5. MOBILE COMMUNICATION SYSTEMS: MIMO M N H HT Figure 1.2: Multiple antenna elements both at the transmitter and the receiver gives a large increase in channel capacity compared with conventional systems (smart antennas). the mobile subscriber unit has a single antenna (a SIMO/MISO system, “Single Input, Multiple Output”/”Multiple Input, Single Output”). Smart antennas refer to signal processing techniques that exploit the fact that data is transmitted from or received by multiple antenna elements. One aspect of this ’smartness’, is to offer more reliable communication in the presence of adverse propagation conditions, such as multi-path fading and interference. Another concept is that of beam-forming, which means that power is distributed over the antenna elements to focus energy into a desired direction, thereby increasing the signalto-noise ratio (SNR). As we will see, MIMO channels inherit all this smartness, and has some extra advantages over conventional smart antennas. . Multiple antennas both at the receiver and the transmitter make a matrix channel H, of dimensions (N =number of receive antennas) times (M =number of transmit antennas). Transmission of a data vector x, from the base station BTS to the mobile subscriber unit X through the channel H can be expressed as y = Hx + n (1.17) where n is a noise term, usually assumed to be white Gaussian noise, n ∼ N(0, σ 2 I). From the viewpoint of the receiver, this means that each of the 15 CHAPTER 1. INTRODUCTION individual antenna elements receives its signals through channels that are sufficiently different. In terms of linear algebra, and momentarily assuming no noise, this means that the equation system y = Hx (1.18) must be solvable for any x. If the coefficient {yi } of the vector y corresponds to the signal at the receive antenna elements, and {xi } is the power output from each transmit antenna element, then the signal at receive element i is yi = hTi x (1.19) where hTi is the i0 th row of H. This equation states that each receive antenna element i sees the data vector x through a channel determined by hi . Now, there are linear dependencies between the columns of H, successful determination of x will only be possible in certain situations. Note that if multiple antenna elements existed only at the transmitter, there would only by one coefficient x1 in the vector x. With multiple receive antennas, the capacity of the system is increased, for each of the coefficients of the vector x can be symbols in an individual data stream. This is known as spatial multiplexing, and is one of the advantages MIMO has over smart antennas. Note that the channel matrix H is determined using training data (sequences known by both parties), before the individual streams can be separated and the symbol data estimated. Another striking property of MIMO systems, is the ability to exploit certain spatial modes of transmission and retrieval so as to maximize the SNR. In particular, the singular vectors of the matrixPH play an important role. The r T reason for this is the fact that if H = j=1 σj uj vj is a singular value decomposition of H, with r = rank(H), then v1 = max arg{x|||x||=1} E{||Hx + n||2 } (1.20) The received signal will be strongest if x is parallel with the top right side singular vector of H, or equivalently, when the coefficients of the singular vector is used for weighting the power output at the antenna elements. Typically, x = cv1 will be used for transmitting the symbol c, as will be explained in section 1.5.2.1. It is even possible to use several singular vectors v1 , v2 , . . . in superposition, to operate independent modes with maximum SNR in descending order. Here one exploits the fact that signals follow different spatial paths (multi-path propagation, each path corresponding to a set of weights determined by the coefficients of a singular vector). Of course, to realize this potential, the transmitter must have knowledge of the channel matrix H to estimate the singular vectors. Summing up, one of the most striking properties of MIMO systems is the possibility of turning multi-path propagation, usually a pitfall of wireless transmission, into an advantage for increasing the user’s data rate, as was shown by Foschini (1996,1998). 16 1.5. MOBILE COMMUNICATION SYSTEMS: MIMO 1.5.2 BIMA - Blind Iterative MIMO Algorithm The work I have done on MIMO (joint with D. Gesbert and N. Christophersen) is about how to maximize the capacity of a MIMO channel by blindly estimating the channel matrix. Most MIMO algorithms today (for instance V-BLAST by Golden et al., 1999) assume that the channel matrix is estimated at the receiver side only. In this case, optimal weighting at the transmitter with the antenna elements is impossible. For the transmitter to transmit on the top singular vector, he must have knowledge of the channel matrix H. In a Time-Division Duplex (TDD) system, data can also be transmitted in the opposite direction of (1.17). If the channel “exhibits reciprocity”, transmission from the mobile subscriber unit (SU) to the base station (BTS) is expressible as x = HT y + n (1.21) Since the singular vectors of H and HT are the same (but reversed left and right), both parties can have knowledge of those through sending of training sequences, and optimal weight allocation at both the transmitter and the receiver is possible. Still, it would be desirable if this training period could be skipped, an the block SVD that simplified or performed iteratively. Typically, the channel H will vary with time and must be re-estimated at regular intervals. In the GSM systems of today, and also in third generation systems (UMTS 4 ), up to 20% of the traffic is training data, used for approximating H. If this regular training phase can be skipped, the capacity of the channel is increased. My goal was to find the singular vectors without having to pay the price of using training data, and also without any need for an estimate of H or an actual (computer-intensive) block SVD. I constructed the following algorithm for transmission in two directions (Downlink: BTS to SU, Uplink: SU to BTS). Assume for the time being that H as well as the vectors x and y are real-valued matrices/vectors. This simplifies the notation, but the demonstrated results carry directly over to the complex case. BIMA (Blind Iterative MIMO Algorithm) 1. Start with a random vector x(0) , and set i = 1. 2. Send from BTS the vector x(i−1) , which will be received as y(i) = Hx(i−1) + n by X. 3. Normalize the received vector y(i) := y(i) /||y(i) ||. 4. Re-send from X the normalized vector y(i) , received at BTS as x(i) = HT y(i) + n 5. Normalize the received vector x(i) := x(i) /||x(i) ||. 4 Universal Mobile Telephone Services. 17 CHAPTER 1. INTRODUCTION 6. Increase i, repeat from 2. By doing this, one can show that x and y converge to the top singular vector pair, the one with optimum performance. Forgetting about the normalization for a moment 5 , it is easily seen that x(i) = (HT H)(HT H)·, . . . , ·(HT H) x(0) = (HT H)i x(0) | {z } y (i) i times T = (HH )(HH )·, . . . , ·(HHT ) y(0) = (HHT )i y(0) | {z } T (1.22) (1.23) i times Again, each of these equations are power methods for finding the top eigenvectors of HT H and HHT respectively. But these eigenvectors are, by the definition of the singular value decomposition, also the top singular vectors of the matrix H. Note that the BIMA algorithm can be given a modular design. The operations can be split between the base station and the mobile subscriber unit in a way that requires no extra communication between the parties. 1.5.2.1 Symbol transmission and multiple singular modes A data symbol, a number c, is transmitted by multiplying it by a singular vector prior to transmission. If the singular vectors are known, and we select x = cv1 (1.24) as the transmit vector, then this vector is received as y = Hx = ( r X σi ui viT )cv1 = σ1 cu1 (1.25) i=1 in the deterministic case with no channel noise. If the receiving party knows u1 and σ1 , he can recapture the symbol perfectly, ĉ = yT u1 =c σ1 (1.26) It is also possible to operate several data streams (or channels) in parallel. If c1 , . . . , cK0 are symbols from K independent data streams, each of those symbols can be transmitted from the base station using a singular vector vi , by letting x= K X ci vi i=1 5 meaning: Don’t normalize the vectors before they are sent. 18 (1.27) 1.5. MOBILE COMMUNICATION SYSTEMS: MIMO On the mobile subscriber unit, this is received as r K K X X X T y = Hx = ( σi ui vi )( cj vj ) = σi ui i=1 j=1 (1.28) i=1 again assuming no channel noise. The symbol from thej 0th data stream can be recaptured using the corresponding left singular vector uj and the singular value σj , yT uj (1.29) cˆj = σj Transmission and receiving of data on the top singular modes only requires channel information on a “need-to-known”-basis. One side must know the singular vector estimates {v̂i } and the other party the corresponding set {ûi }. If the singular values are also estimated, we have a reduced rank estimate of the channel matrix H, K X σ̂i ûi v̂iT (1.30) ĤK = i=1 if the estimates are correct, this is the optimal reduced-rank estimate of H in the L2 -sense (Golub and Van Loan, 1996). With respect to (1.29), it will in many cases be possible to decide cj from yT uj only, without dividing by σi . Thus, it is not necessary to estimate the singular values. The ability to operate several independent channels on the top singular modes enables a large increase in the capacity of the MIMO channel. In practice of course, there is channel noise, which increases the probability that a cj is wrongly decided. Also, the singular vectors are not perfectly known, but are estimated as part of the BIMA algorithm. BIMA can be adapted to estimate more than one singular vector pair, by exploiting certain properties of the modulation alphabet, which is the set of values that a symbol cj may have. Another interesting property of BIMA, is the fact that even if the initial estimates of the singular vectors and the symbols are incorrect, they will gradually converge to their correct values. This is demonstrated in the third and fourth papers of this thesis. We also show by simulation that if the channel H varies continuously with time, the algorithm will track the singular vectors. In the above, we have assumed that H is real-valued. In practice, H will have complex values, which calls for a slight modification of the algorithm. The vectors that are normalized in steps 3 and 5 of the algorithm will also be conjugated, as described in the fourth thesis paper. The reason for this, is the fact that the complex generalization of equations (1.22) and (1.23) is to replace the matrices HT H and HHT with H∗ H and HH∗ respectively. If the vectors are not conjugated, this transition from the real to the complex case will fail. In the real case of course, complex conjugation is always superfluous. The concept of normalizing, conjugating and returning a vector has a strong connection with the works of M. Fink (1997) in medical ultrasound. 19 CHAPTER 1. INTRODUCTION The observation that the steps 1-5 gives convergence to the first singular vector pair has been observed by other authors, such as Bach Andersen (2000). However, using this property directly as a part of a communication process is, to our best knowledge, new. 1.5.2.2 Links with sensory analysis Deriving the mathematical details, the BIMA algorithm is not much different from my "Soft GPA" (or iterative GCA). Comparing the equation (1.9) in the GPA section with equations (1.22) and (1.23) in the sections on mobile communications, both can be seen to be power methods for finding an eigenvector. However, there is one important difference. the eigenvectors P In (1.9), T † T X (X X ) could be found directly as the eigenvectors of i i Xi using some i i numerical method. This could be done because the matrices Xi were “at hand to be played with”. In the case of (1.22) and (1.23) this is impossible, because the matrix H is unknown. It is only through interaction with the transmit vectors x(i) and y(i) that H enters the equations, unless one wants to use training-data for estimating H. We mentioned above that by modifying BIMA, one can transmit independent data streams using several singular modes. To keep the singular vector estimates independent, and prevent them all from converging to the a top singular vector, the orthogonalization procedure that I used for the “soft GPA” (or iterative GCA) was used. Summing up, ideas that were superfluous in one area had the right to live in another. As a final detail, I will mention that the algorithm performs even better if the singular vector estimates, varying with time as the channel H itself varies, are smoothed. The simplest way of doing this is to take the average of the last few singular vector estimates. However, the average of matrices of orthogonal vectors need not be orthogonal itself. In the fourth paper, a modified average borrowing methodology from Procrustes Analysis is used for smoothing the singular vectors. 1.6 A two-agent problem The final part of my thesis was a joint manuscript with Nils Christophersen, Ole Christian Lingjærde and Nils Lid Hjort, considering a closely related but much more difficult problem: In Frequency Division Systems (FDD), the transmit/receive equation differs from the equation (1.17),(1.21) above. If H is used for transmission of x to become y = Hx + n the opposite relation is not x = HT y + n, but x = Gy + n 20 (1.31) 1.6. A TWO-AGENT PROBLEM Whereas H and HT are perfectly related, G can be quite different from H and might represent a "different physical reality" than H. This will typically happen if the two communicating parties use different frequencies (wavelengths) for data transmission. For instance, the scattering of high-frequency waves will often be different from the physics of low-frequency ones. G is then a new quantity that must be accounted for in the algorithms. It is still desirable to find the singular vectors of G and H in order to maximize the channel performance. It turns out that the problem of finding the top singular modes of the two channels H and G is one in which the two parties (base station X, mobile subscriber Y) have to cooperate. There is currently a great interest in the artificial intelligence (AI) community in so-called “multi-agent systems” (MAS). Among the more recent references are Cruz and Simaan (1999), and Wolpert et al. (1999). There are many applications for problems with multiple agents, and crucial elements include (Stone, 2000) agent heterogeneity, knowledge of strategies, control distribution and communication possibilities. Also, the agents have individual goals, observed as reward functions, which must be chosen carefully so that fulfillment of the individuals goal also leads to fulfillment of the overall goal. There is a risk that agents might work at “cross-purposes”, or go in each others way frustrating one an another when trying to solve their respective problems (interaction problems). In the final thesis paper, we cast the problem of finding the top singular modes of G and H into a two-agent problem setting. We show how the problem can be solved using a leader-follower strategy. Again, the use of optimal rotations is central in the work. A procrustes statistic is used for performance assessment, and the polar decomposition, which is a close relative to the orthogonal procrustes transformation, is one of the central building blocks in the algorithm. This emphasizes the close methodological connection between my contributions in the these otherwise quite different areas of applied statistics. 21 CHAPTER 1. INTRODUCTION 22 Bibliography [1] Arnold, G.M, Williams, A.A. (1985) The use of generalized Procrustes Techniques in sensory analysis, In: Statistical Procedures in Food Research, Piggot, J.R. (Ed.) [2] Bach Andersen, J. (2000) Array gain and capacity for known random channels with multiple element arrays at both ends”, IEEE Journal on Selected areas in Communication, Vol. 18, No 11, pp. 2172–2178. [3] ten Berge, Jos M.F.; Bekker, Paul A. (1993) The isotropic scaling problem in generalized procrustes analysis. Computational Statistics and Data Analysis 16, No.2, 201-204. [4] van der Burg, E. & Dijksterhuis, G. (1996) Generalized canonical analysis of individual sensory profiles and instrumental data, In: Multivariate Analysis of Data in Sensory Science , edited by Naes T. & Risvik E, Elsevier Science. [5] Carroll, J.D. (1968) Generalization of canonical analysis to three or more sets of variables, Proceedings of the 76th Convention of the American Psychological Association 3, pp. 227-228. [6] Cruz J.B. , Simaan, Jr. M.A (1999) Multi-Agent Control Strategies with Incentives, in Proceedings, Symposion on Advances in Enterprise Control, pp. 177-182, San Diego, CA. [7] Dahl, T. (1998) Empirical Modeling of Human Motion, M.Sc Thesis, University of Oslo, Department of Mathematics. [8] Dryden, I.L., Mardia, K.V. (1998) Statistical shape analysis, (Wiley). [9] van de Geer, J.P. (1984) Linear Relations among k sets of Variables, Psychometrika Vol. 49, No 1, pp. 79-94. [10] Fink, M. (1997) Time-reversed acoustics Phys. Today, Vol. 20, pp.34 - 40. [11] Foschini, G.J. (1996) Layered Space-time architecture for wireless communications in a fading environment, Bell Labs Technical Journal, Vol. 1, No 2, pp. 41-59. BIBLIOGRAPHY [12] Foschini G.J., Gans, M.J. (1998) On the limit of wireless communications in a fading environment when using multiple antennas, Wireless Personal Communications, Vol.6, No 3, pp. 311-335. [13] Foschini G.J. (1998) Layered space-time architecture for wireless communication, Bell Labs Technical Journal, Vol. 1, pp. 41-59 [14] Gesbert, D. Akhtar, J. (2002) Breaking the barriers of Shannon’s capacity: An overview of MIMO wireless systems, Telenor’s journal: Telektronikk. [15] Golden, G.D., Foschini, G.J., Valenzuela, R.A. and Wolniasky, P.W. (1999) Detection algorithm and initial laboratory results using the V-BLAST space-time communication architecture, Electronics Letters, Vol. 35, No. 1, pp. 14-15., 1999 [16] Golub, G. & Van Loan, C.F. (1996) Matrix computations, 3rd ed. The Johns Hopkins Univ. Press [17] Goodall, C. (1991) Procrustes methods in the statistical analysis of shape. (with discussion). Journal of the Royal Statistical Society: Series B Vol. 53, pp. 285-339. [18] Gower, J.C. (1975) Generalized Procrustes analysis, Psychometrika Vol. 40, pp. 33-51. [19] Green, B.F. (1952) The orthogonal approximation of an oblique structure in factor analysis. Psychometrika Vol. 17, pp. 429-440. [20] Gruvaeus, G.T. (1970) A general approach to Procrustes pattern rotation. Psychometrika Vol. 35 pp. 493-505. [21] Kendall D.G. (1984) Shape manifold, Procrustean metrics and complex projective spaces. Bulletin of the London Mathematical Association, Vol 16 pp. 81-121. [22] Kendall D.G. (1989) A survey of the statistical theory of shape. Statistical Science, pp. 87-120. [23] Kent, John T.; Mardia, Kanti V. (1997) Consistency of Procrustes estimators. (English) Journal of the Royal Statistical Society: Series B Vol. 59, pp. 281-290. [24] Kettenring, J.R (1971) Canonical analysis of several sets of variables, Biometrika, Vol. 58, pp. 433-451. [25] Kiers, H.A.L., Cléroux, R., ten Berge, J.M.F. (1994), Generalized Canonical Analysis based on optimizing matrix correlations and a relation with IDIOSCAL, Computational Statistics and Data Analysis, Vol. 18, No. 3, pp. 331-340. 24 BIBLIOGRAPHY [26] Kristof, W. and Wingersy, B. (1971) Generalizations of the orthogonal Procrustes rotation procedure to more than two matrices. Proceedings of the 79th Annual Convention of the American Psychological Association, 6, pp. 81-90. [27] Kroonenberg, P. & De Leeuw, J. (1980) Principal component analysis of three-mode data by means of alternating least squares algorithms, Psychometrika, Vol. 45, pp. 69-97. [28] Le H.-L. and Kendall D.G. (1993) The Riemannian Structure of Euclidean shape spaces: a novel environment for statistics. Annals of Statistics Vol 21, pp.1225-1271. [29] Langron, S.P. (1983) The application of Procrustes statistics to sensory profiling. In: Sensory Quality in Foods & Beverages: Definition, Measurement & Control, A. A. Williams & R.K. Atkin (Eds), Ellis Horwood Ltd, Chichester, pp. 89-95. [30] Langron, S.P. and Collins, A.J. (1985) Perturbation theory for generalized Procrustes analysis. Journal of the Royal Statistical Society: Series B Vol. 47, pp. 277-284. [31] Mardia, K.V., Kent, J.T., Bibby, J.M. (1997) Multivariate Analysis, Academic Press [32] Newman, W.M., Sproull R.F. (1976) Principles of interactive computer graphics. McGraw-Hill. [33] Naes, T. Kowalski, B. (1989) Predicting sensory profiles from external instrumental measurements, Food Quality and Preference, 4/5, pp. 135-147. [34] Peay, E.R. (1988) Multidimensional rotation and scaling of configurations to optimal agreement. Psychometrika Vol. 53, pp.199-208. [35] Risvik E. & Naes T. (1996) Multivariate Analysis of Data in Sensory Science (Elsevier Science) [36] Schönemann, P.H. (1966) A generalized solution to the orthogonal Procrustes problem, Psychometrika Vol. 31, pp.1-10. [37] Schönemann, P.H. (1968) On two sided orthogonal Procrustes problems. Psychometrika Vol. 33, pp. 19-33. [38] Schönemann, P.H. and Carroll R.M. (1970) Fitting one matrix to another under choice of central dilation and rigid motion. Psychometrika Vol. 35, pp. 245-255. 25 BIBLIOGRAPHY [39] Sibson, R. (1978) Studies in the robustness of multidimensional scaling: Procrustes statistics. Journal of the Royal Statistical Society: Series B Vol. 40, pp. 234-238. [40] Sibson, R. (1979) Studies in the robustness of multidimensional scaling: perturbation analysis of classic scaling. Journal of the Royal Statistical Society: Series B Vol. 41, pp. 217-229. [41] Stone P., Veloso M. (2000) Multiagent Systems: A Survey from a Machine Learning Perspective, Autonomous Robotics, Vol. 8, No. 3. [42] Telatar, I.E. (1995) Capacity of multi-antenna Gaussian channels Bell Labs Technical Memorandum. [43] Tenenhaus, M. (1987) Generalized Canonical Analysis, Bernoulli, Vol.2, pp. 133-136. [44] Wolpert, D.H., Wheeler, K.R., Tumer, K. (1999) “General Principles of Learning-based Multi-Agent Systems”, Proceedings of the Third International Conference on Autonomous Agents (Agents’99). [45] Ziezold, H. (1977) On expected figures and a strong law of large numbers for random elements in quasi-metric spaces. In Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes and of the 1974 European meeting of Statisticians, Vol. A, pp. 591-602, Prague. Academica: Czechoslovak Academy of Sciences. (9,83,179,290). 26 Chapter 2 Short Summary of the Papers Paper One: A Bridge Between Carroll’s Generalized Canonical Analysis and the Tucker-1 Method. by T. Dahl and T. Næs. Submitted to Psychometrika, in 2nd review. This paper is addressed to the sensory science society. When working with score chart analysis for multiple judges, there are various subcultures within the society. The choice of analysis tools is sometimes more a matter of background than of anything else. This paper demonstrates how two popular methods, the Generalized Canonical Analysis and the Tucker-1 method for three-way analysis, although seemingly very different, essentially derive their result from the same structures in the data set. Moreover, the paper shows how the two methods can be combined in a joint ridge-regression-like framework, and that this bridge, determined by a parameter α, is a new method in its own right. Choice of the parameter setting in various theoretical and real situations is discussed, and illustrated with new kinds of plots and figures. The main contributions are (a) demonstrating the close connection between two methods which seem very different, and (b) developing new ways of visualizing the workings of multivariate methods in sensory science. Paper Two: Outlier and Group Detection in Sensory Panel Analysis with the Procrustes Distance. T. Dahl and T. Næs. To be submitted to Food Quality and Preference. This paper also concerns sensory analysis. When score charts, represented as CHAPTER 2. SHORT SUMMARY OF THE PAPERS matrices (N food samples times P tasting cathegories), are obtained from several judges, it is of interest to find a consensus matrix for the whole tasting panel. Often, the judges use terms (tasting cathegories) and scales differently, and this must be taken into account to make a consensus. To this end, Generalized Procrustes Analysis (GPA) is commonly used. However, it might be that no common structure can be found for the whole tasting panel. This paper introduces the procrustes distance, developed in statistical shape analysis, into sensory analysis. In this paper, it is suggested for grouping panel judges who have similar assessments about the samples into clusters. This ensures that the consensus comes from a relatively homogenous group of panel members, differing only in their use of terms, and not in the underlying perception about the products/samples. Outliers will typically be grouped with the other judges at a very late stage, which can easily be seen from the dendrograms. Validation of the method using process data is also demonstrated. The main contribution is the use of the Procrustes Distance for Hierarchical Clustering. Paper Three: BIMA - Blind Iterative MIMO Algorithm. T. Dahl, Christophersen and D. Gesbert. Accepted for ICASSP 2002. In multi-antenna wireless communication system, and MIMO systems ("Multiple Input Multiple Output" in particular, transmission is through a matrix channel. Prior to symbol transmission, the channel must be estimated. For this estimation, training data, occupying as much as 20 % of the capacity, is needed to track the time-varying channel. Alternatively, the channel can be estimated blindly, but this normally requires the use of computer-intensive higher-order methods. We have developed an algorithm (BIMA) that finds the optimal transmission modes of a MIMO channel without the need for a statistical estimate of the channel, in a computational and effective way. Furthermore, this method can track the optimal modes of a time-varying channel, as part of normal operation, and without extra computational and channel capacity costs. The algorithm is a variant of the power method, a numerical method for estimating eigen-vectors. It has a connections with “time reversal mirror” used in ultrasound imaging. The main contribution are (a) the exploitation of the ’intrinsic power method’ in uplink and downlink communication, and (b) blind channel estimation with iterative estimation of the singular vectors. Paper Four: Blind MIMO Estimation based on the Power Method. T. Dahl, N. Christophersen and D. Gesbert. To be submitted to IEEE Transactions on Signal Processing. This paper further develops the BIMA algorithm. More details and simulations 28 are given, and regularization/smoothing of the transmission parameters is introduced, improving the SNR considerably. More background on MIMO, as well as the methods connection with the “time reversal” mirror is given. Paper Five: The Game of Blind MIMO Channel Estimation. T. Dahl, N. Christophersen, O.C. Lingjærde, and N. Lid Hjort. The BIMA algorithm works for reciprocal channels only. If communication from one side to the other (uplink) is through a matrix channel, then communication the other way (downlink) is assumed to be through the transpose matrix channel. When different frequencies are used for uplink and downlink communication (Frequency-division-duplex) this condition fails to hold. We show that it is still possible to blindly estimate the optimal transmission modes for the uplink and downlink channels. This is formulated as a two-agent problem in game theory. It borrows its notation and ideas from the field of Multi-Agent Systems, a subfield of Artificial Intelligence. Various techniques from statistical analysis, such as non-linear nested optimization, optimal rotations, quadratic data fitting and stochastic simulations are used to present a framework for the solution of the problem. The main contributions are (a) the formulation of the problem in a two-agent framework, (b) the use of ellipsoid fitting for estimating one set of singular vectors only (partial SVD), and (c) the use of principal component analysis for removing non-linearities in the optimization. 29 CHAPTER 2. SHORT SUMMARY OF THE PAPERS 30 Chapter 3 Papers • Paper I: A Bridge between Tucker-1 and Carroll’s Generalized Canonical Analysis • Paper II: Outlier and Group detection in Sensory Analysis using Hierarchical Cluster Analysis with the Procrustes Distance • Paper III: BIMA - Blind Iterative MIMO Algorithm • Paper IV: Blind MIMO Estimation based on the Power Method • Paper V: The Game of Blind MIMO Channel Estimation 31 CHAPTER 3. PAPERS 32 Paper I: A Bridge between Tucker-1 and Carroll’s Generalized Canonical Analysis 33 CHAPTER 3. PAPERS 34 A Bridge between Tucker-1 and Carroll’s Generalized Canonical Analysis Tobias Dahl and Tormod Næs ∗ Abstract This paper concerns tools for analyzing relationships between and within multiple data matrices. A new unified approach is developed, bridging two existing methods; Carroll’s Generalized Canonical Analysis (GCA) with the Tucker-1 method for principal component analysis of multiple matrices. GCA and Tucker-1 are shown to correspond to particular choices of a ridge parameter. The unified method may again be generalized to a larger space of methods. key words: Generalized Canonical Analysis, Three-way Factor Analysis, Singular Value Decomposition, Ridge Regression, Principal Components, MANOVA. 1 Introduction A common problem in applied statistics is to relate a number of matrices {Xi } to each other, in order to find common and unique structures. Typical examples are sensory analysis, i.e. tasting experiments of food (Baardseth et. al, 1992, Amerine et. al, 1965), and analysis of shape or body motions (Dahl, 1998). In the former example, each matrix contains sensory scores given by one single assessor. Each column may represent a variable (attribute), and each row a sample or an object. An important issue is to investigate the common structures among the individuals sensory ratings (Naes & Risvik, 1996), and also to understand what is unique information. The purpose may be quality control of the sensory panel, better understanding of individual differences of perception and scoring, or simply the need for a sensible consensus matrix of data representing the whole panel. Such a consensus can be used for further statistical analysis. In the second example, each matrix contains trajectories for a number of markers (usually bright reflexes) through a particular body movement. The ∗ MATFORSK (Norwegian Food Research Insititue) and Unviersity of Oslo. 35 2 TWO CLASSES OF METHODS matrices are related in order to find common structures in movement patterns, summarizing how different joints and limbs co-ordinate for each individual, as well as detecting common structures for the subjects (Dahl, 1998). A number of different techniques have already been suggested for such studies. The most well known are probably the 3-way factor analysis methods PARAFAC, Tucker-1, Tucker-2 and Tucker-3, (Kroonenberg, 1980, Naes & Kowalski 1989), Generalized Procrustes Analysis (Kristof & Wingersky 1971, Gower 1975, ten Berge 1977) and various versions of Generalized Canonical Analysis (Carroll 1968: MAXVAR, Kettenring 1971: SUMCOR, MINVAR, SSQCOR, GENVAR, van de Geer 1984, Tenenhaus 1987, Kiers et al. 1994, van de Burg & Dijksterhuis 1996). All these techniques are in use, but are based on quite different approaches and belong to apparently quite different ways of handling the problems. There are, however, some common features which seem to have been little investigated in the literature. For instance, all techniques end up with a “consensus” type of matrix representing common information for the whole dataset. They also provide information about how and how well the different matrices {Xi } relate to this consensus. This paper is devoted to a discussion of some mathematical and practical relationships between some of these techniques. Building a “bridge” between three-way factor analysis, in particular Tucker-1, and Carroll’s Generalized Canonical Analysis, will be the main focus. First, the techniques are presented and discussed as members of two quite different philosophies. It is then shown that both can be considered as special cases of a more general model framework - a bridge. Computations on real sensory data will be used for illustrations. A number of plots facilitating empirical studies will also be proposed, and validation of the methods will be discussed. The unified framework connecting GCA and Tucker 1 may again be generalized; It gives a class of potentially interesting methods for analysis of three-way data. The generalized framework is flexible, and can be used to connect three-way analysis with other methods, such as regression and classification. 2 Two classes of methods In this paper, we will consider situations where a number of data matrices Each row corresponds to a {X1 , X2 , . . . XQ }, Xi ∈ RN,Pi are available. sample/object (a total of N), and each column to a variable, typically a dimension of evaluation, for instance a taste score. The number of variables Pi may differ between the matrices. If Pi = P for all i, the data {Xi } can be regarded as a three-way (N × P × Q) structure X. Throughout the paper, the profiles (or matrices) are assumed to be centered, 1 Xi = HXi , with H = (In − N1 1n 1Tn ). 1 This kind of centering is common in PCA and in correlation-based methods to ensure one is working with the correct covariance and correlation matrices. In Procrustes-like methods it 36 PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA 2.1 Methods based on linear fitting One of the important classes of methods for relating Q matrices to each other is based on estimating individual transformation matrices {βi } , right-multiplied by their corresponding matrices {Xi } in order to approximate a consensus matrix Y. This can be formalized as X ||Xi βi − Y||2 (1) min {βi },Y i The minimization may be subject to constraints on {βi } , on Y or on both. Note theat some type of restriction is always needed, in order to avoid the trivial 0 soulution. The most well known technique based on such a sum is the Generalized Procrustes Analysis (GPA, Gower 1975, ten Berge 1977). It restricts each βi to be orthogonal, βiT βi = I, and works in an iterative way to minimize the criterion by rotations and reflections. The outcome of this process is a full rank consensus matrix Y ∈ RN,P with no reduction to a lower-dimensional space, although variants with dimension-reduction exist (Peay, 1988). The Ordinary GPA consensus Y is is sometimes considered aP better representative of the set {Xi } than the straightforward average Y0 = Q1 i Xi (see e.g. Langron 1983 or Arnold & Williams, 1985). Generalized Canonical Correlation Analysis (GCA) can also be cast into this framework. It is usually applied with individual transformation matrices {βi } that have a reduced number of columns. The Y matrix represents not only a type of consensus, but also information compressed to a lower dimensional space, which may be important for interpretation purposes. GCA restricts Y to have orthogonal columns. The criterion (1) can be reformulated to a become a measure for the correlation between the profiles (see below). Users of methods such as GCA and GPA may be interested in several aspects of the results. Important examples are the common information represented by Y, the degree of fit of the individual matrices, or the coefficients {βi } which assigns each sample with coordinates in an interesting common space. 2.2 Three-way data compression methods Another, and equally important class of techniques is the family of so-called three-way factor analysis methods. These methods also give a common scores Y as output. This Y is multiplied by different types of restricted loadings matrices in order to approximate {Xi } . A general framework is X min ||Xi − Yβi ||2 (2) {βi },Y T Y=I i Here, Y is the consensus (or common scores) matrix and {βi } are the individual loading matrices. For compression to take place and the approach to be interpreted as a translation step, centering the objects around the origin. 37 3 THE RELATION BETWEEN GCA AND TUCKER-1 meaningful, the number of columns Py in Y needs to be low, typically Py < Pi for all i. Y is usually assumed orthogonal, but other restrictions may be imposed. The methods in this class differ in the way that {βi } is restricted. With no restriction on {βi } , the method is called Tucker-1 2 . Tucker-2 uses βi = Ri Q, where Q is the common loadings matrix and Ri is the rotation matrix for the different individuals. PARAFAC and Tucker-3 apply other restrictions (Naes & Kowalski, 1989). This paper deals with Tucker-1, from now on referred to only as the Tucker method. The applications of three-way methods are similar to those of PCA with ordinary two-way data, loadings and scores providing compressed information. 3 The relation between GCA and Tucker1 The purpose of this paper is to connect these two apparently different method classes to each other. In particular, we will concentrate on the methods Tucker-1 and GCA and show that they can be formulated within a common framework. It will be shown that both these methods, as well as solutions in-between, can be found by constructing a matrix Z depending on a parameter α, and extracting eigenvectors from this matrix. 3.1 Carroll’s Generalized Canonical Analysis The GCA problem is usually formulates by using a correlation criterion, defined as X corr2 (Xi βij , yj ) (3) max βij ,yj i T under the restriction Y Y = I. van der Burg et. al (1994) have shown that the solutionPvectors in Y of the GCA problem are the orthogonal eigenvectors of ZGCA = i Xi (XTi Xi )† XTi (M† denoting the Moore-Penrose inverse of M), and that the natural ordering of columns is by the ranking of the corresponding eigenvalues. There is of course also the question of “fairness” P (Van de Geer, 1984) and weighting, e.g. to use a representation w ZGCA = i wi Xi (XTi Xi )† XTi , but this can be accomplished by suitable scaling P ofTthe profiles {Xi } , and will not be pursued here. Note also that ZGCA = i Ui Ui , if Ui Si Vi = Xi is a SVD of Xi van der Burg et. al (1994, Appendix) also showed that the GCA problem can be reformulated to become X min ||Xi βi − Y||2 subject to YT Y = I (4) βi ,Y i 2 Tucker-1 is an unfolding PCA, it can be shown that the restriction βi = RQTi , with P Since T T i Qi Qi = Q Q = I is (trivially) full-filled once the Y matrix is found 38 PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA showing that the GCA method indeed is a special case of the general framework described in section 1.1. 3.2 Tucker 1 The Tucker 1 method can be formulated in a number of different ways, depending on purpose of study and arrangement of matrices and vectors. Tucker 1 is sometimes referred to as an “unfolding” method (Naes & Kowalski 1989), since its solution vectors can be obtained by unfolding of the three-dimensional (N × P × Q) structure X. The Tucker 1 method can be defined properly as {βi min X },Y T Y=I ||Xi − Y βi ||2 (5) i It is well known that the solution vectors in Y of the Tucker problem are the P T eigenvectors = see e.g. Gifi (1990). In SVD-notation, i Xi Xi , P of Z2T ucker T U S U . Note the similarity between this solution and the SVDZT ucker = i i i i version of GCA. 3.2.1 Ridge GCA and its connection with Tucker-1 Ridge-techniques have been used successfully for regularization of various multivariate methods. The classic Ridge Regression (Hoerl & Kennard 1970), as well as Regularized Discriminant Analysis (Friedman 1989) are examples. An analogue for GCA is now presented. The usefulness of the ridge parameter in this context has two aspects: First of all, since GCA is a correlation based technique, the ridge parameter can be used to improve stability of the solution in cases with collinear data Secondly, introducing the ridge parameter α leads to a parameterization connecting GCA and Tucker-1. Let Xi = Ui Si ViT again be the SVD. The following technique down-weights the influence of the components with low associated variance: Replace ZGCA by ZRidge , X Xi (XTi Xi + αI)† XTi = (6) + αI)† Vi Si UTi = (7) Ui Si ViT [Vi (S2i + αI)† ViT ]Vi Si UTi = Ui Si (S2i + αI)† Si UTi = Ui Di (α)UTi (8) (9) ZRidge = X i Ui Si ViT (Vi S2i ViT i 39 3 THE RELATION BETWEEN GCA AND TUCKER-1 σ2 where Di (α) = diag{ σ2 ij+α }. The equation (6) can without loss of generality be ij rewritten as X Xi [(1 − α)XTi Xi + αI]−1 XTi (10) Zα = i The set of matrices (up to scaling by a constant) covered by ZRidge for α ∈ [0, ∞ > now appears on a [0, 1]-scale. Restricting the parameter of interest to a closed interval simplifies further analysis greatly. Comparison of different α values is much easier, and construction of plots and figures with α varying along one or two axes is straightforward. Note that GCA is still obtained for α = 0, while α = 1 yields Tucker-1. It easily shown that Zα = X Ui Λi (α)UTi (11) i σ2 This bridge, defined by Λi (α), is of course not where Λi (α) = diag{ (1−α)σij2 +α }. ij the only thinkable bridge between the methods. One could ask whether one α parameter for each matrix, αi , would not be more appropriate. This question parallels the problem of choosing between quadratic and linear discriminant analysis in a supervised classification problem. Fewer parameters makes estimation more robust, at the cost of reduced flexibility. Clearly, choosing one αi for each Xi would bring a certain selectivity into the regularization, shrinking some matrices more and others less. Thus, use and selection of only one α is treated. 3.2.2 A possible generalization of the framework All methods in the unified framework can, just as GCA and Tucker, be seen from a principal vector point of view. The solution vectors in Y are then the principal column components (left singular vectors) are extracted from Ũ, Ũ = (U1 Λ1 (α), U2 Λ2 (α)), . . .) Λi = diag{λi1 (α), λi2 (α), . . .} (12) (13) This has previously been observed by van de Geer (1984), who raises the question of “what to analyze?”, the original profiles {Xi } , or matrices {Ui } spanning their column spaces. Tucker-1 is equivalent to (PCA-) analyzing the original profiles, while GCA amounts to analyzing the principal vectors. The two methods also correspond to specific choices of {Λi } . A whole space of methods can be defined by varying the elements of the individual diagonal coefficient matrices {Λi } . The weighting of an individual principal vector uij affects the probability that uij will be well described by the first solution vectors in Y. Some other methods in this class are discussed at the end of the paper. 40 PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA 3.2.3 Intuitive explanation of the framework. Analogous to PCA for matrix data, the Tucker-1 and GCA methods seek to describe variability, i.e. covariance structure, for a set of matrices. According to the mathematical description above, the difference between the two covariance structures extracted by the two methods can be stated as follows: (1) Tucker-1 is a weighted GCA, in the sense that the singular P P vectors, T appearing with no extra weights in the GCA expression Z = i j uij uij are P P weighted to become Z = C i j σij uij uTij , where C is an irrelevant constant. (2) Analogously, GCA is a Tucker-1 analysis on singular vector/value pairs so that the singular values are equal to 1. In this sense, GCA is a normalized Tucker-1 analysis. In the Xi -domain, it is clear that GCA is a Tucker-1 analysis on normalized data with normalization factors (XTi Xi )−1/2 . 3.2.4 Noise and Stability Stability. Tucker-1 is focused on describing the X-matrices the best possible way. Thus, it will not be as influenced by the small principal vectors as the case may be with GCA. The first principal vectors are weighted higher, in a relative sense, in Tucker-1 than in GCA. 3 The larger eigenvectors are usually more stable with respect to perturbation of {Xi } than the rest, indicating that Tucker modeling may give more stable solutions than GCA. Low-variance components. In practical problems, there might be some singular values σij very close to zero, indicating that a very small portion of the variance is explained by the corresponding vector. The occurrence of such small singular values can be due to high collinearity in the underlying “true” variables, as well as numerical errors and noise in the data. In the α = 0, or GCA case, this introduces an error source in the estimation of the solution eigenvectors, especially if the number of profiles Q is low. Increasing α slightly would weight down the corresponding low-variance singular vectors, giving a regularizing effect. It could however, be argued, that the need for regularization is not too great. . If a lowvariance component from a profile is given a considerable weight (as e.g in GCA, where all components are weighted equally), it is unlikely to be matched and strengthened by the low-variance components of other profiles unless some true common structure exists between them. When the contributions are summed and eigenvectors extracted, the influence of the noise component on the solution space R(Y) is likely to be low. 3 Whether the actual weight increases or not depends on wheter σij <> 1, but in any case the influence of the associated singular vector is higher for the first singular vectors - in Tucker-1 41 4 SELECTION OF THE RIDGE α An alternative way of regularizing would be to remove all singular uij vectors with associated variance below a certain threshold from the basis vector matrices Ui . Doing this, one of course runs the risk of loosing contributions to important low-variance vectors performing interplay between the profiles. A table is given below, summarizing algebraic results, interpretation and stability issues. Table 1: Summary of the properties of the GCA/Tucker-framework Name GCA Ridge GCA α α=0 0<α<1 max over P 2 YT Y = I ? i ||Xi βij − Y|| {βij } {yj } are P P T † T T −1 T eigenvectors i Xi (Xi Xi ) Xi i Xi [(1 − α)Xi Xi + αI] Xi of ..in SVDP P T 2 T domain... i Ui Ui i Ui Λi (α)Ui λ2ij (α) = 2 σij 2 +α (1−α)σij Tucker 1 α=1 P i ||Xi P P − Y βij ||2 T i Xi Xi 2 T i Ui Si Ui {yj } are left principal vectors of ..in SVDdomain... Component weighting X∗i = Xi (XTi Xi )−1/2 X∗ = (X∗1 , X∗2 , . . .) X̃ = (X̃1 , X̃2 , . . .) X̃i = Xi [(1 − α)XTi Xi + αI]−1/2 X = (X1 , X2 , . . .) U = (U1 , U2 , . . .) All components are weighted equally Ũ = (U1 Λ1 (α), U2 Λ2 (α), . . .) Focus of Analysis Noise Maximizing correlation Sensitive U∗ = (U1 S1 , U2 S2 , . . .) Components with high variance are favoured Describing variability Insensitive 4 Selection of the Ridge α The question of validation is central in regularization problems: What choice of the regularization parameter(s), in this case α , should be used? The answer depends on the problem setting. If there is external information related to the profiles, external validation may be used for choosing α. In regression problems, this corresponds to parameter selection with respect to the response y. Methods such as PCR, PLS, TSVD, CG are usually cross-validated on the MSE with respect to y. 4.1 Internal Validation Looking at the unified framework as a set of tools for finding structures of relationships, two ideas for selecting the number of components are the 42 PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA following: (1) Cross-validation leaving out either samples, or whole profiles {Xi } : If a GCA/Tucker-model is estimated for all but one sample, the predictability of this sample (residual error when the sample is projected onto the reduced model) can be tested with a varying number of components. The procedure is repeated for all the samples, removing one sample, putting the old one back, and calculating the MSE for all samples. (2) Looking for stable intervals of the α parameters, e.g. intervals for which solution space is stable (e.g. the vectors in Y varies little), and select α as for instance the point in the middle of the interval. Linear subspaces can be compared by means of principal angles (Golub & Van Loan, 1996)4 . Degree of change in principal angles may be used as a measure of stability. 4.2 External Validation by MANOVA When external information exists, such as for instance design variables or direct measurements of related quantities, it is possible to use a MANOVA model and corresponding testing methodology to find out which value of the parameter α is most appropriate. Assuming that the solution vectors in Y relate in a linear way to the design variables (or predictors), Wilk’s λ and its associated P-value from Wilk’s λ -distribution can be used to determine the influence of the design (or, alternatively, the individual design variables) on the solution space. A low P-value is a clear indication of good correspondence, or in other words, that a particular choice of α is a good one. Selection of α by MANOVA will be illustrated at the end of the paper. 5 Visualization Tools There is an extensive literature on visualization tools for GCA, see e.g Dijksterhuis & van der Burg (1996) for an overview. Methods in use involve object scores and loading plots, and various techniques for studying the behavior of samples, variables and profiles. The tools presented here are more focused on the solution vectors in Y than on individual samples and variables, and are to the authors’ best knowledge, new. For the practical user of the framework, it is worth mentioning that existing methods from GCA and 3-way traditions can be adapted to work within the unified framework. 4 which is actually a canonical correlation analysis. 43 5 VISUALIZATION TOOLS 5.1 Hiding plots A useful viewpoint on the relationship between solution vectors in Y and the column spaces R(Xi ) of the profiles {Xi } is that the solution vectors are “embedded” or “hidden” in the column spaces. The methods in the unified framework serves as “filters”, recovering the solution vectors. The “hidingplace” of the vectors may be of interest, raising questions like “Was the first solution vectors constituted from (linear combinations of) high-variance or lowvariance components of the individual profiles?” The hiding plot is a bar plot of the principal variances in a profile (σij2 ), topped with a curve illustrating the projections coefficient of a solution vector yj onto the principal components Ui . The curves are plotted from the components of vectors hij vectors. 5 hij = |UTi yj |2 (14) Each hij (j varying over the solutions yj ) can be seen as a curve by plotting its component values against its index. Spline smoothing is applied to the curve to make the figure more readable. Figure 1 shows an example for our dataset. 5.2 Manhattan plots The solution vectors in Y as well as the principal components {uij } are vectors that account for variability. In PCA-like applications it is customary to look at projections of the data onto principal vectors, to see how much of the variance is explained by a “reduced model”. In the following, this idea is further developed in two directions. First, the reduced model need no longer be defined by principal vectors. Instead, projections of the data onto an arbitrary basis (explanatory variables in W) is considered. Second, a matrix is constructed, that holds cumulative projection variances of the original data columns onto the given basis. This matrix can be visualized in a figure, where the high values are bright and the low values are dark. The Manhattan matrix H(X, W) ∈ RQ,P is defined by its elements Pi (wT xj )2 (15) hij (X, W) = r=1 r2 ||xj || where xr and wr are the is the r 0 th column vectors of the X and W respectively, under the assumption that WT W = I, e.g. the explanatory variables are normalized and orthogonalized. It is easily seen that 0 ≤ hij ≤ 1. Three natural examples of Manhattan Plots are (i) to look at H(Xi , Ui ), which can be used in a standard PCA-way to see how well the data are explained by a reduced model, (ii) to look at H(Xi , Y), to see how well each of the profiles’ columns can be explained by the solution vectors in Y, and 5 This way of visualizing vectors is common, e.g in NIR spectroscopy. 44 PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA (a) (b) (c) 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 1 2 3 4 5 6 7 8 910111213 (d) 0 1 2 3 4 5 6 7 8 910111213 (e) 0 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 1 2 3 4 5 6 7 8 910111213 (g) 0 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 1 2 3 4 5 6 7 8 910111213 0 1 2 3 4 5 6 7 8 910111213 (h) 0 1 2 3 4 5 6 7 8 910111213 (f) 1 2 3 4 5 6 7 8 910111213 1 2 3 4 5 6 7 8 910111213 Figure 1: Hiding plots for the profiles of the sausage data using α = 0.05. Two curves are plotted on top of each set of principal bars. The solid line (-) corresponds to the first solution vector y1 , the stapled one (- -) to the second solution vector y2 (the third vector was omitted for readability). It is clearly seen that y1 was mainly constituted from the first principal components of the profiles, with the notable exceptions of the fourth (d) and the sixth profile (f). y1 was also influenced by some medium-variance components, see e.g. (b),(c),(e) and (f). The second solution vector y2 has less obvious projections, though one could argue that it is influenced by the subsequent principal components 2-4, see e.g. (a),(b),(c),(g),(e), which all have stapled lines peaks in these neighborhoods. 45 5 VISUALIZATION TOOLS (iii) to look at the relationship between solution vectors in Y and the individual principal column vectors through H(Ui , Y), giving insight into how well each principal basis is explained by the common basis. Figure 2 shows Manhattan plots H(Xi , Ui ) for the first four profiles of the Sausage Data set analyzed at the end of the paper. Examples of the this approach is found in figures 5-6. 5.3 CORV plots CORV (CORellation-to-Variability-explanation) can be used to identify what the solution vectors do explain. By changing α, Y(α) will be solution vectors lying on a scale. This scale has “correlation scoring” and “explanation of variability” as its ends. GCA α Tucker − 1 Correlation ←→ Explanation scoring of variability One may ask “Could the GCA solution vectors also pass as Tucker vectors, or vice versa?” A vector designed to focus on correlation (GCA modeling) may also be suitable for describing a lot of the total variability (Tucker modeling). A chart such as the one below could be useful: Y(α) designed for GCA Tucker Y(α) GCA 1 0.7 scoring on Tucker 0.3 1 The 1’s are obvious “maximum scores”, but the numbers 0.3 and 0.7 tell something about how “flexible” the vectors are, whether they are useful as solution vector for one of the problems (GCA/Tucker) only, or whether the solutions are interchangeable. This gives a pinpoint on the robustness of the α eventually chosen. Clearly, if all elements in the chart were 1’s, the choice of α wouldn’t matter GCA and Tucker are only the ends of a scale, and so it would be more natural to have more boxes in the chart, covering GCA, Tucker, and the intermediate cases along both axes. Even better, a continuum f (., .) could be constructed and visualized as an intensity image rather than a chart. Since both the GCA and the Tucker problems can be described as eigenvector problems, one can study fj (α, β) = yj (α)T 46 Zβ yj (α) ||Zβ ||2 (16) PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA (a) (b) 2 2 4 4 6 6 8 8 10 10 12 12 2 4 6 8 10 12 2 4 6 (c) 2 4 4 6 6 8 8 10 10 12 12 4 6 10 12 8 10 12 (d) 2 2 8 8 10 12 2 4 6 Figure 2: Manhattan plots for four profiles, with principal components as explanatory variables. (a) shows H(X1, U1 ), (b) shows H(X2 , U2 ), (c) shows H(X3 , U3 ) and (d) shows H(X4 , U4 ). One way of interpreting sub-figure (a) would be the following: (i) The four first variables (columns of the data matrix) cannot be explained by the first explanatory variable, neither can the ninth and the thirteenth variable. This can be seen from the upper row of the matrix, where these positions are black. (ii) The explanations of the first, second and thirteenth variables increases significantly when another principal component (explanatory variable) is introduced. (iii) The degree of explanation increases naturally as more and more explanatory variables are introduced. This can be observed from the figure, since all columns become increasingly brighter when going from top to bottom. 47 5 VISUALIZATION TOOLS for this purpose. The function fj gives the scoring of the changing solution vector yj (α) on all matrices {Zβ }. One might plot all f1 (α, β), f2(α, β), . . ., or just a few. Typically, eigenvectors with low eigenvalues are unstable, so there is a limit to how many figures are informative. Normalization of the matrices to have the same L2 norm (forcing them all to have first singular value σi1 = 1) simplifies comparison. In the figures, the functions fj (α, β) are sampled into matrices by choosing suitable grids for the α and β values. High f values are bright, low values dark. Clearly, 0 ≤ fj (α, β) ≤ 1 for all α, β, j. Furthermore, it is clear that the diagonal fj (α = β) will be the brightest part of the figure only when j = 1, e.g. for the first solution vector y1 (α). Subsequent solution vectors are perpendicular with the first one, and will therefore fail to obtain maximum scoring along the diagonal. Perturbation Issues. When α is varied, this can be seen as a perturbation of Zα in (10). Ordered eigenvector extraction is an unstable process with respect to matrix perturbation. When the variance associated with certain directions in the space is changed, the ordering of the principal components may be altered. Furthermore, if a subset of the principal components have identical variance, almost any set of vectors spanning the subspace may be output to describe it. One example where this is likely to occur is within the unified framework; When the influence of low-variance components becomes comparable with that of the high-variance ones as α → 0, there may be several components with comparable associated variance. When vectors swap place by order, rifts are produced in the surface described by fj (α, β). The CORV-plots therefore give additional information about when the ordering of principal vectors is altered, but these shifts also make the figures less readable. 6 The good news are that principal subspaces are stable with respect to perturbation, even if the vectors spanning them are not (Golub & Van Loan 1996). If something is known about the dimension of the solution space (for instance, the number of underlying parameters controlling the process), the function (16) can be extended to focus on subspaces rather than on individual vectors. If Y(α)k = (y1 , y2 , . . . yK ) ∈ RN,k holds the first k orthogonal solution vectors for a particular choice of α as its column vectors, fkS (α, β) = ||Y(α)k T Zβ Y(α)k ||2 ||Zβ ||2 (17) will do for subspaces what f(α, β) did for vectors. Figure 3 shows some examples of CORV-plots taken from simulations. Here of course, fkS (α, β) > 1 might occur, 6 The “rift problem” can possibly be resolved by a vector tracing approach (not pursued here): If one suspects that two eigenvectors have swapped place due to perturbation of the matrix, a parallelism test may be performed to reorder the vectors. The tracing must necessarily happen on a sufficiently coarse α -grid, otherwise the eigenvalues will at some point be nearly identical, and tracing becomes very difficult. 48 PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA (a) min: 0.26272 max: 1 (b) 0.2 0.2 0.4 0.4 min: 0.27814 max: 0.9958 0.4 0.8 β 0 β 0 0.6 0.6 0.8 0.8 1 1 0 0.2 (c) 0.4 α min: 0.38741 0.6 0.8 1 0 0.2 α 0.6 1 max: 1.3931 0 0.2 β 0.4 0.6 0.8 1 0 0.2 0.4 α 0.6 0.8 1 Figure 3: CORV plot for a synthetic example. (a) and (b) are CORV plots for the two first solution vectors y1 , y2 . The clear shift of eigenvectors is seen. The common-plot (c), based on fkS (α, β), for both solution vectors, shows that correlation scoring and variability may both be well explained. The creation of this example is rather complicated, and has been omitted for the sake of space. whereas fj (α, β) was bounded upwards by 1. The bright areas represent high scores (or high degree of explanation), the dark represent low scores. 6 A Case Study 6.1 Description of the dataset The data in the following study come from sensory profiling of sausages. The original sausage data set (published by Baardseth et al, 1992) consists of measurements for eight sensory judges, each of them assessing 60 sausage samples (N), usingP = 15 attribute variables. Each profile is then represented by a 60 times 15 matrix Xi ∈ R60,15 , i = 1 . . . 8. The sausages were produced 49 6 A CASE STUDY according to four design variables, with 5,3,2 and 2 levels respectively, given a total of 5*3*2*2=60 combinations In previous analyses, the design factors have been shown to fit extremely well with the profiles. For the purpose of illustration, the problem was here made more difficult by removing all samples corresponding to the top three levels of the first design variable, and the top level on the second. This gives a dataset with 2*2*2*2=16 sausage samples. The profiles, now with 16 samples, were column centered, standardized by Frobenius norm (||Xi ||2 = 1), and variables 5 and 6 were removed as earlier publications on this data has proven their lack of relationship with the design. 6.2 Results A step size for α of 0.05, and 3 solution vectors was used (the number of components to use is a question in itself that will not be pursued here). Plotting the Wilk’s λ values, and the associated P-values (figure 4) shows that the best fit between consensus profile and design is achieved for a relatively low value of α. The sub-figures show Wilk’s λ and P-curves for the model. Manhattan plots (figures 5- 7) with Y(α) holding explanatory variables, and principal variables {uij } as the ones to be explained, shows a clear shift once the focus is turned away from pure correlation-scoring: There is a significant change as α goes from 0 to 0.05, then the picture changes slowly until α =1. The CORV plots (8) display the same phenomenon. The distinguished first rows and columns in all sub-figures indicate a clear shift when α increases from 0. The sub-figure (a) suggests that the first solution vector is stable, accounting both for correlation scoring and explanation of variability, as soon as α > 0. The second and third solution vectors, whose CORV plots are given in (b) and (c), seem to be better for correlation scoring than explanation of variability (the figures get darker towards the lower right corner). The overall CORV plot (d) for the subspace spanned by the three solution vectors shows the same tendency (there is a slow decrease in the scores going along the diagonal from the upper left to the lower right corner). The sharp shift from the first to the second row / column, shows that the choosing α = 0 will give quite different results from choosing α = 0.05. Increasing α further gives little, but gradual change, indicating that α is “robust” upwards from 0.05. These results are generally in line with the P-values and Wilk’s λ for the MANOVA model, but one could argue that the quickly rising curves after α = 0.05 indicate less robustness with respect to choosing α, compared to what the CORV plots did. A geometrical interpretation would be the following: When α = 0, the influence of low-variance components (probably noise) is very high. Increasing α just a little gives RGCA and a set of solution vectors matching the process variables well. As α is further increased, the solution spaces gradually shift focus in the direction of high-variance components. This change, however, reduces the influence of components that relating well with the process (design) variables. Hence the increase in the P-values and Wilk’s λ as α goes to 1. Note 50 PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA that none of the P-values are very low, suggesting that the relationship between the profiles and the design variables were not very strong (this is due to removal of samples with specific levels, as described above). The hiding plots (1) gives detailed information on how the solution vectors can be represented as linear combinations of the profiles principal vectors. It suggests that the solution vectors in Y were constituted mainly from highvariance components, but not only those. An overall interpretation of the plots could be the following: Strong connections between the profiles and the design is not found when the first singular vectors alone determine the solution (the Tucker case), but rather when the later singular vectors are influential as well (although not quite as much as in the pure GCA case). One explanation to this phenomenon (or at least a guess) could be that the overall “tasting experience” described by the judges is a certain “sausageness”, reflected in the early principal components of the profiles. Changing the sausage design (more or less flour, salt etc.) gives rise to more subtle nuances, but these are only found in the lower variance components. To investigate this issue further, one could look more detailed into the linear combinations matching the profiles with design data in the MANOVA procedure, checking if they correspond to high or low variance components in the profiles. This, however, is out of the scope of the present paper. The above example is a case where RGCA, compared with Tucker and GCA, proved useful, improving solution stability by shrinking the low-variance components. Using the MANOVA validation procedure and the tools, it was possible to select the ridge parameter α = 0.05 in a meaningful way. The production and tasting of the sausages in the data set were custom made at the Norwegian Food Research Institute. If, however, these sausages (or some other product) had been part of a production line, with a trained panel of judges and ingredients varying with seasons and delivery, the method could be made part of an industrial quality control: One could estimate α from one tasting session, and use it as a parameter of consensus estimation and product labeling for future sessions. It would be even better to perform cross-validation of α over previous sessions, always to have the best estimate of α available for the next. 7 Conclusions & Discussion GCA and Tucker can be seen as two ends of a scale, assuming the data to be full three-way data. The whole set of methods is essentially a three-way analysis tool-box, since it combines information from three domains of variation; The samples, the variables and the profiles. The methods on the scale belong to an even larger space of methods, in which GCA and Tucker are only two points. An interesting package of tools can be design by considering the diagonal elements on the matrices Λi as linear filter factors. Any kind of filter could then 51 7 CONCLUSIONS & DISCUSSION (a) (b) 0.055 0.4 0.05 0.35 0.045 0.3 0.04 P(α) λ(α) 0.25 0.035 0.2 0.03 0.15 0.025 0.1 0.02 0.015 0 0.2 0.4 α 0.6 0.8 0.05 1 0 0.2 0.4 α 0.6 0.8 1 Figure 4: Wilk’s λ for the model, the design variables, and their components, and associated P-values. 52 PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA (a) α=0 (b) (c) (d) 0.5 0.5 0.5 0.5 1 1 1 1 1.5 1.5 1.5 1.5 2 2 2 2 2.5 2.5 2.5 2.5 3 3 3 3 3.5 2 4 6 8 10 12 3.5 (e) 2 4 6 8 10 12 3.5 (f) 2 4 6 8 10 12 3.5 (g) (h) 0.5 0.5 0.5 0.5 1 1 1 1 1.5 1.5 1.5 1.5 2 2 2 2 2.5 2.5 2.5 2.5 3 3 3 3 3.5 2 4 6 8 10 12 3.5 2 4 6 8 10 12 3.5 2 4 6 8 10 12 2 4 6 8 10 12 3.5 2 4 6 8 10 12 Figure 5: Manhattan plot with the profiles principal components as the horizontal (or original) variables, α =0, which is the GCA case. Note that the leftmost columns, corresponding to the principal vectors with high associated variance, have only slightly “higher towers” than the other ones. 53 7 CONCLUSIONS & DISCUSSION (a) α=0.05 (b) (c) (d) 0.5 0.5 0.5 0.5 1 1 1 1 1.5 1.5 1.5 1.5 2 2 2 2 2.5 2.5 2.5 2.5 3 3 3 3 3.5 2 4 6 8 10 12 3.5 (e) 2 4 6 8 10 12 3.5 (f) 2 4 6 8 10 12 3.5 (g) (h) 0.5 0.5 0.5 0.5 1 1 1 1 1.5 1.5 1.5 1.5 2 2 2 2 2.5 2.5 2.5 2.5 3 3 3 3 3.5 2 4 6 8 10 12 3.5 2 4 6 8 10 12 3.5 2 4 6 8 10 12 2 4 6 8 10 12 3.5 2 4 6 8 10 12 Figure 6: Turning to α =0.05 (RGCA), there is a swift change. Now, there is a much clearer tendency for the first principal components to have high towers. 54 PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA (a) α=1 (b) (c) (d) 0.5 0.5 0.5 0.5 1 1 1 1 1.5 1.5 1.5 1.5 2 2 2 2 2.5 2.5 2.5 2.5 3 3 3 3 3.5 2 4 6 8 10 12 3.5 (e) 2 4 6 8 10 12 3.5 (f) 2 4 6 8 10 12 3.5 (g) (h) 0.5 0.5 0.5 0.5 1 1 1 1 1.5 1.5 1.5 1.5 2 2 2 2 2.5 2.5 2.5 2.5 3 3 3 3 3.5 2 4 6 8 10 12 3.5 2 4 6 8 10 12 3.5 2 4 6 8 10 12 2 4 6 8 10 12 3.5 2 4 6 8 10 12 Figure 7: This picture changes slowly as α increases (here α = 1, the Tucker case). Now, the first columns are clearly the brightest. 55 7 CONCLUSIONS & DISCUSSION (a) min: 0.52 max: 1.00 0.2 0.2 0.4 0.4 (b) min: 0.26 0.2 0.4 (d) min: 0.63 0.2 0.4 max: 0.99 β 0 β 0 0.6 0.6 0.8 0.8 1 1 0 0.2 0.4 (c) min: 0.17 α 0.6 0.8 1 0 max: 0.98 0.2 0.2 0.4 0.4 0.6 0.8 1 max: 1.72 β 0 β 0 α 0.6 0.6 0.8 0.8 1 1 0 0.2 0.4 α 0.6 0.8 1 0 α 0.6 0.8 1 Figure 8: CORV plot for the sausages. Figure (a),(b) and (c) are for the three first (individual) solution vectors, (d) for the subspaces spanned by all of them. 56 PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA be designed to put focus on certain components in the profiles. Making e.g a lowpass-filter, resembling a normal distribution, with arbitrary placement of center and deviations, could be used to “tune high and low frequencies on a set of radios (profiles)”, to see how they harmonize. The power of modern computers makes it possible to scan for such relationships in real-time, combining human intuition with statistical and computational power. It is important to emphasize that the GCA method only looks for linear combinations that are highly correlated. All directions are equally important regardless of whether they describe much of the information in Xi or not. If there are strong correlations in some of the "smaller" directions, these will be picked up just as easily as the others. Such directions can, however, sometimes have strong effects on the stability of the solution. This problem is well understood in regression analysis. An interesting study would be a statistical analysis of stability on the framework. Simulations based on random profile generation and framework analysis for fixed α is one approach. The connection between GCA and multiple regression (MR) has been covered elsewhere. It has e.g. been shown that MR is a form of GCA. GCA turns out to be a remarkably general framework, encompassing canonical analysis as a special case, and therefore discriminant analysis, multiple regression, analysis of variance and correspondence analysis (McKeon 1965, Gittins 1985, Carroll 1968, Kettenring 1971, Tenenhaus 1987). It would therefore be of interest to explore whether RGCA as a framework would contain Ridge Regression and Regularized Discriminant Analysis as well as the others. Variable subset selection. With the high dimensionality of the data, and the total number of columns in specific, it would be of interest to perform dimension-reduction by variableselection. There may be certain columns in the profiles that account for a lot of the variability. If strong underlying structures exist, certain variables may turn out always to be linear combinations of others. A subset selection approach, focusing on a few variables spanning column subspaces comparable to those obtained when all variables are used, could simplify interpretation, as well as the experimental procedures (e.g. by reducing the number of tastes each judge would have to describe) Definition of consensus. With respect to the seemingly different classes of optimization problems presented in the introduction, relatively little has been said about the interpretation of the consensus matrix Y. Well understood in the specific cases of GCA and Tucker-1, it might still be possible to say something about how the consensus captures and compresses information about the profile, on the full scale (or even generalized space) of methods. 57 REFERENCES Links with Continuum Regression (CR) and PLS. Changing the focus of study from correlation to explanation of variability is not too different from the idea of CR. Continuum regression involves a scale bridging classic least squares regression with principal component regression, having PLS (partial least squares) regression in between. Since PLS can be expressed (non-linearly) using filter-coefficient in an SVD-setting (Hansen, 1996), ideas linking our methodology with external response data in a PLS or a multiresponse-regression setting is a topic for further research. The authors are grateful for comments from Ole-Christian Lindgjærde and Nils Christophersen. Special thanks to Øyvind Langsrud for the MATLAB implementation of MANOVA. References [1] Amerine, M.A., Pangborn, R.M., Rossler, E.B.(1965) Principles of Sensory Evaluation of Food, Academic Press, New York. [2] Arnold, G.M., Williams, A.A., (1985) The use of generalized Procrustes Techniques in sensory analysis, In: Statistical Procedures in Food Research, Piggot, J.R. (Ed.) [3] Baardseth P., Næs T., Mielnik J., Skrede G., Hølland S., and Eide O. (1992) Dairy ingredients effects on sausage sensory properties studied by principal component analysis, Journal of Food Science, Vol. 57, No.4 pp. 822-828. [4] ten Berge, J. M. F. (1977) Orthogonal Procrustes rotation for two or more matrices. Psychometrika, 42, pp. 267-276 [5] van der Burg, E. & Dijksterhuis, G. (1996) Generalized canonical analysis of individual sensory profiles and instrumental data, In: Multivariate Analysis of Data in Sensory Science , edited by Naes T. & Risvik E, Elsevier Science. [6] van der Burg, E., de Leeuw, J., and Dijksterhuis, G.B. (1994) OVERALS: nonlinear canonical correlation analysis with k sets of variables. Computational Statistics and Data Analysis, no. 18, pp. 141-163. [7] Carroll, J.D. (1968) Generalization of canonical analysis to three or more sets of variables, Proceedings of the 76th Convention of the American Psychological Association 3, pp. 227-228. [8] Dahl, T. (1998) Empirical Modeling of Human Motion, MSc Thesis, University of Oslo, Department of Mathematics. 58 PAPER I: A BRIDGE BETWEEN TUCKER-1 AND CARROLL’S GCA [9] Friedman, J.H. (1989) Regularized Discriminant Analysis, Journal of the American Statistical Association, Vol. 84, No. 405, pp. 165-175. [10] van de Geer, J.P. (1984) Linear Relations among k sets of Variables (1984), Psychometrika, Vol. 49, No 1, pp. 79-94. [11] Gifi, A. (1990) Wiley Series in Probability and Mathematical Statistics. Chichester. John Wiley & Sons. [12] Gittins, R. (1985) Canonical Analysis, a review with applications in ecology, Berlin: Springer-Verlag. [13] Golub, G. & Van Loan, C.F. (1996) Matrix computations, 3rd ed. The Johns Hopkins Univ. Press. [14] Gower, J.C. (1975) Generalized Procrustes Analysis, Psychometrika 40, pp. 33-51. [15] Hansen, P.C. (1996) Rank Deficient and Discrete Ill-posed Problems, Ph.D. dissertation, Technical University of Denmark, DK-2800 Lyngby, Denmark. [16] Hoerl, A.E, & Kennard, R.W. (1970) Ridge Regression: Biased estimation for Nonorthogonal problems”, Technometrics, 8, pp. 27-51. [17] Kettenring, J.R (1971) Canonical analysis of several sets of variables, Biometrika, Vol. 58, pp. 433-451. [18] Kristof, W. & Wingersy, B. (1971) Generalizations of the orthogonal Procrustes rotation procedure to more than two matrices. Proceedings of the 79th Annaual Convention of the American Psychological Association, 6, pp. 81-90. [19] Kroonenberg, P. & De Leeuw, J. (1980) Principal component analysis of three-mode data by means of alternating least squares algorithms, Psychometrika, Vol. 45, pp. 69-97. [20] Langron, S.P. (1983) The application of Procrustes statistics to sensory profiling. In: Sensory Quality in Foods & Beverages: Definition, Measurement & Control, A. A. Williams & R.K. Atkin (Eds), Ellis Horwood Ltd, Chichester, pp. 89-95. [21] Naes, T. Kowalski, B. (1989) Predicting sensory profiles from external instrumental measurements, Food Quality and Preference, 4/5, pp. 135-147. [22] Naes T. & Risvik E. (1996) Multivariate Analysis of Data in Sensory Science, Elsevier Science. [23] Mardia, K.V., Kent, J.T., Bibby, J.M. (1997) Multivariate Analysis, Academic Press. 59 REFERENCES [24] McKeon, J.J., (1965) Canonical Analysis: some relations between canonical correlation, factor analysis, discriminant function analysis, and scaling theory. Psychometric Monograph 13. University of Chicago Press, Chicago. [25] Peay, E.R. (1988) Multidimensional rotation and scaling of configurations to optimal agreement. Psychometrika 53, pp.199-208. [26] Tenenhaus, M. (1987) Generalized Canonical Analysis, Bernoulli, Vol.2, pp. 133-136. [27] Tucker (1958) Ledyard R. An inter-battery method of factor analysis. Psychometrika 23, pp. 111-136. 60 Paper II: Outlier and Group detection in Sensory Analysis using Hierarchical Cluster Analysis with the Procrustes Distance 61 CHAPTER 3. PAPERS 62 Outlier and Group Detection in Sensory Panels using Hierarchical Cluster Analysis with the Procrustes Distance Tobias Dahl and Tormod Næs∗ Abstract Generalized Procrustes Analysis (GPA) is a well-known method both in multivariate analysis and shape analysis. It is used to find a representative average for a set of matrices (configurations, shapes, profiles). In this paper, hierarchical clustering is suggested for situations where the data profiles are believed to come from non-homogeneous groups. The Full Procrustes Distance is used as the dissimilarity measure for the amalgamation. This new approach to sensory panel analysis may be used at an exploratory stage, in combination with GPA, to gain insight into the structures of the data set. It can help the researcher detect outliers and sub-groups, help him/her make decisions regarding further analysis, and reduce the risk of erroneous inference about the data. key words: Procrustes Distance, Generalized Procrustes Analysis, dendrograms, Sensory Analysis, Shape Analysis 1 Introduction In many practical situations, it is important to define a meaningful average of data matrices. Examples of such situations are (i) A set of sensory profiles for a number of products. This could be N food samples assessed by Q judges, using P tasting variables (sourness, saltiness, bitterness etc), giving Q N-by-P matrices. (ii) A set of psychological profiles, which could be N patients interviewed by Q therapists giving scores along P dimensions (depression, anxiety etc). ∗ MATFORSK (Norwegian Food Research Insitute) and University of Oslo. 63 1 INTRODUCTION (iii) A set of point configurations, or shapes in two or three dimensions, which are similar and differ mainly in terms of rigid transformations. If N is the number of points, and Q the number of shapes, Q N-by-2 or N-by-3 matrices must be analyzed. The easiest way of handling this data is to use simple averages, but there are some obvious problems with this approach: • There may be confusion about the use of terms (saltiness and bitterness, or depression and anxiety) • There may be differences in the scaling. One judge or therapist may use a large portion of the scale than another, e.g. one scores in the range 2-6, and another in 1-10. • The center of the scales may differ. These comments are relevant for cases (i) and (ii) above. In case (iii), this corresponds to orientational dislocation, size differences and center displacements of the shapes. Generalized Procrustes Analysis (GPA) is a technique frequently used to handle such problems. It is based on standardizing profiles with respect to rotation/reflection, isotropic scaling and translation, in order to provide a better average, a consensus. Even though case (iii) seems different from the two others, they can be shown to be equivalent 1 by geometrical reasoning. Procrustes methods have a history in two distinct disciplines of statistics. It was introduced in psychometrics, a branch of multivariate analysis, where important references include Mosier (1939), Green (1952), Cliff (1966), Schönemann (1966, 1968), Gruvaeus (1970), Schönemann and Carroll (1970), Kristof & Wingersky (1971), Gower (1971, 1975), Ten Berge (1977), Sibson (1978,1979), Langron and Collins(1985). The methods have been used as standard tools in the related field of sensory analysis since the 80s, due to important contributions by Arnold & Williams (1985) and later Dijksterhuis (1996). Procrustes methods were brought into statistical shape analysis by Kendall (1984, 1989), Goodall (1991) and later Dryden & Mardia (1997). Their introduction in shape analysis led to new practical and theoretical developments. The individual matrices (profiles or configurations) known from psychometrics were formalized as shapes in the new field. The shape space was explored by Kendall (1984,1989) and Le and Kendall (1993), and a distance measure known as the Procrustes Distance was developed to measure degree of difference between configurations. The new findings can be used to widen the range of applications involving Procrustes Analysis, even in psychometrics, as will be shown. 1 Variants of Procrustes works without the possibility of reflection, leaving rotation as the only admissible orthogonal transformation. This is common in shape analysis. 64 PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS (a) (b) 18 16 17 19 20 15 21 14 13 22 12 23 11 10 9 24 8 31 30 29 32 28 25 7 2726 33 6 5 34 2 4 3 35 36 37 38 1 45 39 44 43 424140 222324 25 21 20 26 19 27 18 28 17 29 16 30 15 31 14 13 32 12 11 33 10 34 9 35 8 7 36 6 5 37 34 38 2 39 1 40 454443 4241 Figure 1: An artificial example from shape analysis: (a) Two point configurations (or profiles), one banana and and one mushroom, are to be matched by procrustes rotations (after an initial centering/translation). The numbers along the lines are the landmark indices. (b) The consensus (drawn with a thick line) obtained by rotating the configurations to match and then averaging, is neither representative for the mushroom,nor for the banana. 1.1 The consensus is not always meaningful As described above, GPA is based on a model for the difference between matrices. Figure 1 shows an example where the procrustes transformation is not very meaningful. Other examples are: • A sensory experiment where the judges come from two expert groups. • A tasting session where the food samples have been presented to the judges at different times. Certain chemical processes could have changed the taste of the food. • A study in psychotherapy where one of the therapists comes from a different school. • A sensory panel where one or more judges have caught a cold. Working with two or three-dimensional data, such problems can be detected by plotting the data, especially if the points have a natural ordering (as in figure 1, with successive numbering along the contour of the shape). For 65 2 METHODOLOGY higher dimension, other techniques are required. This paper is to presents and discusses a new approach to sensory panel analysis. Grouping the judges with Hierarchical Cluster Analysis (HCA) one can check the relevance/quality of the Procrustes model for high dimensional data. Main focus will be on detecting groups of configurations and detecting objects which are totally different from the rest (outliers). For other ideas concerning clustering methods on three-way data, see Carroll and Arabie (1983) Gordon and Vichi (1999), Krieger and Green (1999) and Vichi (1999). To illustrate the gains of this new approach, existing GPA diagnostics will be computed and compared with the new ones. A data set from sensory analysis, manipulated for the purpose of illustration, will be used in the computations. Finally, a real-life dataset (peas, Næs & Kowalski 1989) will be studied in detail. HCA with the Procrustes Distance is intended to be used in an informal way, at an early stage, to explore the nature of the data at hand. Questions such as determining the number of clusters have been considered by other authors (see e.g. Gordon, 1999), and are not central to the paper. 2 Methodology 2.1 Cluster Analysis The purpose of clustering is to group n objects into g groups or clusters. The members of each cluster should be “similar” in some sense, and the clusters thus “homogeneous”. There are different approaches to clustering, both hierarchical and criterion based ones. In this paper, the focus will be on the former. At each level, the two most similar groups are agglomerated. The results from such analyses are usually presented in tree structures, called dendrograms. Even though all objects will, in such a process, end up in the same group, the grouping process is interesting. The length of the edge connecting the nodes matches the degree of dissimilarity between the subgroups. A hierarchical clustering process is based on the distances or similarities between the objects. The similarity between two clusters Gi , Gj is defined by a dissimilarity measure d(·, ·) that can be constructed in a number of different ways, depending on the nature of the objects and the purpose of the clustering. If {x1 , . . . , xn } are vectors, a frequent choice of distance function between two clusters is d(Gi , Gj ) = min z∈Gi ,w∈Gj ||z − w||22 (1) Using the distance function (1) leads to single linkage clustering. Other variants will be described in section 2.4, together with a distance function for matrices rather than vectors. For more information on clustering, see Mardia et. al (1998) or Gordon (1999). 66 PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS 2.2 Procrustes Analysis & The Procrustes Distance If ||.||F denotes the Frobenius norm, the Procrustes transformation for a matrix X1 to match with X2 (both n × p) is found by solving min ||T (X1) − X2 ||F = d(X1 , X2 ) (2) T with the requirement that T is a transformation composed by rotation/reflection matrix Q (QT Q = I), and an isotropic scaling factor c ∈ R, T (X1 ) = X1 · cQ, c := tr XT1 X2 ||X1 ||2F (3) The solution to the optimal orthogonal transformation problem is Q = UVT (4) where XT1 X2 = USVT is the singular value decomposition of XT1 X2 . This can easily be derived using standard results from linear algebra (Cliff, 1966 or Mardia et al, 1979). The dissimilarity measure d measures the degree of dissimilarity between two matrices after rotational and scaling effects have been removed. It is, however, not symmetric. Generally, d(X1 , X2 ) 6= d(X2 , X1 ) (5) or less formally, the distance from one object X to an object Y is not the same as the distance from Y to X. This makes the amalgamation difficult to interpret. By assuring that X1 and X2 are scaled to have the same variance (after centering), ||X1||2F = ||X2 ||2F = K, however, there is symmetry in the distance function. The new distance is called the Full Procrustes Distance. min ||T ( T X2 X1 )− ||F = dF (X1 , X2) ||X1|| ||X2 || (6) with the same requirement on T as above. Typically K = 1 is chosen, which means that the profiles are scaled to have unit variance. A proof for the symmetry of dF can be found in Dryden & Mardia (1997). Translation is usually applied together with scaling and rotation. It can be proven that the optimal way of translating the point-sets is by column-centering the profiles. Throughout the paper, it is assumed that all profiles (shapes) are pre-processed in this way. 67 2 METHODOLOGY 2.3 Generalized Procrustes Analysis Generalized Procrustes Analysis (GPA) does 2 for several matrices what Procrustes Analysis does for two matrices. It is based on an iteratively updated average or consensus Z, and it makes a set of matrices as similar with Z as possible by Procrustes transformations. This also makes the profiles as similar as possible with each other (Gower, 1975). If Ti denotes the optimal (and implicitly defined) transformation by scaling and rotations for the i’th profile Xi ∈ RN ×P , the GPA minimizes g(X1 , . . . , XQ ) = Q X ||Ti (Xi ) − Z||2F (7) i=1 Since this paper is concerned with dissimilarity measures, note that the match of transformed profiles with the consensus Z can be checked by considering the individual terms in (7). gi = ||Ti (Xi) − Z||2F i = 1, . . . , q (8) An algorithm for GPA can be found in the appendix. 2.4 HCA with the Procrustes Distance If the profiles {X1 , . . . , XQ } are too different, taking a GPA average may be meaningless. To detect such situations, diagnostics are needed, but gi has intrinsic problems: • If one profile is an outlier, it has already been allowed to influence the consensus Z when gi is calculated. Thus, an outlier will seem less of an outlier after the iterative procedure. 3 • If there are several groups of profiles, homogeneous within but not across groups, the gi diagnostics will not reflect the grouping. Worst case, it may produce a consensus not representative of any profile. At the best, it will yield high values of gi for all i, reflecting a poorly defined common opinion. Cluster analysis reveals group structures as well as individuals not fitting well with existing groups. Visualization by trees (dendrograms) can help the researcher detect when the averaging process “breaks down” due to the influence of profiles very different from the rest, as well as situations where subgroups of profiles differ substantially from one another. 2 more or less, anyway. It is well known that GPA algorithms only find a local minimum for the optimization, whereas the procrustes transformation, only covering two matrices, finds the global optimum. For details, see Ten Berge (1977). 3 This is of course also the case in ordinary regression analysis: The outliers do influence the parameters and thus affect the estimated relationship between the variables. 68 PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS 2.4.1 Single, Complete, Average Linkage Let the sensory profiles {X1 , . . . , XQ} be the objects to be grouped in a cluster analysis. Each cluster Gi contains one or more profiles at any time. The distance between two clusters Gi , Gj is based on the full Procrustes distance and may be one of the following: min dF (X, Y) (9) max dF (X, Y) (10) average d (X, Y) X ∈ G i , Y ∈ Gj F (11) dS (Gi , Gj ) = dC (Gi , Gj ) = dM (Gi , Gj ) = X∈Gi ,Y∈Gj X∈Gi ,Y∈Gj From these three distances and a general clustering algorithm specified below, three HCA variants are derived. A general clustering algorithm Let {G1 , . . . , GQ } = {X1 , . . . , XQ } be the initial clusters. A distance matrix D = {dij } ∈ Rg,g , where g is the number of clusters at some time, is given by the elements dij = d(Gi , Gj ) where d is the clustering distance function chosen. Next, the minimum element value dij of the matrix D is found, its row and column indices refer to the two clusters which are to be grouped in the current step. The two groups are linked, the new (g − 1) clusters relabeled and the procedure repeated until there is only one cluster left, or until a stopping criterion determines the end of the process. 2.4.2 Centroid Linkage & GPA A fourth clustering variant is also possible, perhaps reflecting some intrinsic ideas of GPA more clearly than the other methods. Rather than computing distances (minimum, maximum or average) between element of clusters, each cluster could be represented by an average element, and the distance between clusters could be measured in terms of the full Procrustes Distance between average elements, or consensuses. A natural candidate for the average element is the GPA consensus Zi of the matrices in each cluster Gi , (12) dij = d(Zi , Zj ) At the final step, this amalgamation is equivalent to GPA, because all matrices are members of the same cluster, and the average matrix computed for this cluster is the ordinary GPA consensus. The first g − 1 clustering steps are, in this light, sub-GPA’s. At any point in the amalgamation process, an atypical long vertical linking in the dendrograms can be seen as a “breakdown”. The other three clustering variants could also be turned into sub-GPA’s, merely by computing GPA consensuses Zi after each clustering step. The main drawback with the centroid 69 3 EXPERIMENTS: DETECTING GROUP STRUCTURE AND OUTLIERS method is that it fails to preserve a basic clustering property: The continuous increase of distance levels through the stages of the clustering process. Let ds denote the distance function used to group the two clusters when there are g − s clusters left. We should expect that d1 ≤ d2 ≤ d2 . . . ≤ dg−1 (13) This property can be easily be derived for single, complete and average linkage. But see figure 2, sub-figure (c), for an example where the centroid method fails to fulfill (13). This is commonly called “inversion” in Hierarchical Cluster Analysis, and is typical when centroid linkage is used. 3 Experiments: Detecting group structure and outliers To illustrate the methods and the practical problems that can be analyzed, a set of experiments was carried out. The original data is a set from a tasting session (Baardseth et. al, 1992) of sausages, which was manipulated in various ways to highlight certain situations that may occur in real tasting experiments. The set contains eight judge profiles, each describing N = 60 sausage samples in p = 13 variables. A profile is represented by a matrix Xi ∈ R60×13 , i = 1, . . . , Q = 8, all column-centered and scaled to have unit variance (or unit Frobenius norm). The experiments were run in MATLAB on a Sun Ultra 5, with a running time of a few (5-10) seconds. 3.1 Basic Study A basic study illustrates the use of the methods in the simplest setting. Using the original profiles X1 , . . . X8 , the four clustering variants were applied (their dendrograms can be found in figure 2). The different methods give very similar results. Some basic properties can be seen that are typical of any hierarchical clustering method4 . • The complete linkage tends to have longer edges in the tree than the others. This is because it joins cluster elements of maximum distance, and thus takes longer before connecting single objects with large clusters. The chance that one object in a big cluster is far away from a single object outside the cluster is usually large. Rather than grouping objects quickly, it tends to pair up objects, then pairs of pairs, and so on. 4 and not just when using the Procrustes Distance. 70 PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS (a) Complete Linkage (b) Single Linkage 0.95 0.95 0.9 0.9 0.85 0.85 0.8 0.8 0.75 0.75 0.7 4 6 3 7 5 2 8 0.7 1 4 6 (c) Centroid Linkage 0.95 0.9 0.9 0.85 0.85 0.8 0.8 0.75 0.75 4 6 7 3 8 5 2 7 8 5 2 1 (d) Average Linkage 0.95 0.7 3 0.7 1 4 6 3 7 8 5 2 1 Figure 2: Basic Survey. The methods give similar clusters as results. The Complete Linkage Method differs from the other ones in that it connects object 2 and 8 at an early stage. Note panel (c) corresponding to centroid linkage, and the downward edge (inversion) when joining the class 4,6 with object 7. • The single linkage method, on the contrary, easily allocates new (single) objects to clusters with many members. It is much easier to find one withincluster object close to the new object, than to make certain no within-cluster object is far apart from the new object. • The centroid method has the “inversion”, which is typical for centroid methods, and to be expected. This is due to the calculation of a cluster representative (centroid) which does not conform with the principle of an increasing sequence of distances. • The average linkage seems, not surprisingly, to be an in-between of single and complete linkage. From the figures, one can also argue that object 1 is an outlier. 71 3 EXPERIMENTS: DETECTING GROUP STRUCTURE AND OUTLIERS 3.2 Splitting the data into two sessions. A situation with two separate tasting sessions was simulated by dividing each sensory matrix Xi into two halves, each consisting of subsets of n1 = n2 = 30 samples. Each sub-matrix can be seen as a single profile for a single experiment. This manipulation corresponds to a real-life situations where two separated tasting experiments are performed by the same set of judges. If the judges were sufficiently clear in their scoring (relative to the others), one would expect the clustering to be very similar for the two experiments. The four clustering variants were run to check whether the judges would group in the same way in these two “tasting session”. There were differences between them, as figure 3 shows, including the following observations: • Two clusters are “stable” with respect to choice of tasting session: Object 4 and 6 always clusters together, and the same is true for 3 and 7, with the notable exception of the centroid linkage method (here 4 and 6 only join in the first session, while 3 and 7 only join in the second). In the first session, 4-and-6 group joins with 8 before joining the existing 3-and-7 cluster for all but the centroid linkage method. Other details can be seen in the dendrogram, but the main observations are (1) there are differences between the sessions and (2) the amalgamations are similar, within the sessions, for single, complete and average linkage, but notably different from the amalgamations of the centroid linkage variant. 3.3 Outlier detection The purpose of the third experiment was to illustrate outlier detection. The matrix of the seventh judge was “turned into an outlier” by exchanging the first 30 rows with the 30 last rows of the matrix X7 . The dendrograms are found in figure 4. • All variants cluster correctly, grouping the outlier with the other objects only at the last stage. • The complete and average linkage variants are less clear than the other two in displaying object 7 as the outlier. The centroid and single linkage have long edges connecting the other objects with object 7, and thus show the outlier effect much more clearly. • The strictness of complete linkage against building large groups quickly makes the edge connecting object 7 with the others comparably small. The average linkage result borrows this property to some extent. 72 PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS (a) First Subset − Complete Linkage (b) Second Subset − Complete Linkage 0.9 0.9 0.8 0.8 0.7 0.6 0.7 4 6 8 3 7 2 5 1 (c) First Subset − Single Linkage 3 7 4 6 1 5 8 2 (d) Second Subset − Single Linkage 0.9 0.9 0.8 0.8 0.7 0.6 0.7 4 6 8 3 7 5 2 1 (e) First Subset − Average Linkage 3 7 6 4 5 1 8 2 (f) Second Subset − Average Linkage 0.9 0.9 0.8 0.8 0.7 0.6 0.7 4 6 8 3 7 5 2 1 (g) First Subset − Centroid Linkage 3 7 4 6 5 1 2 8 (h) Second Subset − Centroid Linkage 0.9 0.9 0.8 0.8 0.7 0.6 0.7 4 6 8 3 7 2 5 1 3 7 4 6 5 1 8 2 Figure 3: Splitting the data into two sets/sessions. The grouping of the objects is different for the two subsets, but similar, for the four clustering variants. The centroid linkage differs from the other variants . • For single linkage, quickly grouping clusters, adding object 1 to all others requires the acceptance of an object far away from the others. The relative short distances in the early stages makes the final linkage seem large, thus emphasizing the outlying nature of X7 . Considering outliers, diagnostic tools become interesting. Two diagnostics are available so far, • The GPA residuals gi • The inter-group clustering distances ds To see whether the HCA approach gives anything new , it is natural to compare the two measures gi and ds in a number of situations. They are, however, on different scales, and must be re-scaled to make comparison meaningful: Each element in each set (gi or the cluster distances ds for a specific HCA variant) was divided by its maximum element. The original GPA diagnostics (table 1) match the diagnostic impression given by the dendrograms and the clustering 73 3 EXPERIMENTS: DETECTING GROUP STRUCTURE AND OUTLIERS (a) Complete Linkage (b) Single Linkage 0.95 0.95 0.9 0.9 0.85 0.85 0.8 0.8 0.75 0.75 0.7 4 6 3 8 5 2 1 0.7 7 4 6 (c) Centroid Linkage 0.95 0.9 0.9 0.85 0.85 0.8 0.8 0.75 0.75 4 6 3 8 5 2 1 8 5 2 1 7 (d) Average Linkage 0.95 0.7 3 0.7 7 4 6 3 8 5 2 1 7 Figure 4: Case I: One outlier. All variants detect the outliers by grouping it with the others at the final stage only. Note that figures (b) and (c) show the outlier effect more clearly than the other ones. distances (table 2): The (scaled) Procrustes diagnostic gi = 1.0000, large relative to the other gi ’s, indicates that the profile 7 is an outlier. The same indication is made by HCA, since the seventh object is allocated to the other ones only at the last stage. The (scaled) clustering distance 1.0000 at this step is also relatively large, at least for the single linkage and centroid linkage. Overall, one may conclude that HCA gave little new information in this specific case. In the next sections, some examples for the opposite case are given. 3.4 Two classes In some situations, there may be reason to look for group structures among the judges. If the judges come from two separate expert groups on, e.g., wine tasting, this may reflect on the scoring charts. If severe, systematic differences exist between the groups, extracting a meaningful average is impossible. A two-group situation was created by exchanging the first 30 rows with the last 30 rows in the four first profiles X1 , . . . , X4 , leaving X5 , . . . , X8 as they were. The analysis of this data can be seen in figure 5. There is, of course, no way the 74 PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS (a) Complete Linkage (b) Single Linkage 0.95 0.95 0.9 0.9 0.85 0.85 0.8 0.8 0.75 0.75 0.7 6 7 5 8 1 2 3 0.7 4 6 7 (c) Centroid Linkage 0.95 0.9 0.9 0.85 0.85 0.8 0.8 0.75 0.75 6 7 8 5 1 2 3 5 1 3 4 2 (d) Average Linkage 0.95 0.7 8 0.7 4 6 7 8 5 1 3 4 2 Figure 5: Case II: A two-group situation. HCA with the Procrustes Distance correctly identifies the two groups by joining them only at the final stage, and then with a high associated clustering distance. standardized gi diagnostics (table 3) can identify the two-group situation, but all the dendrograms reflect the situation. The single and centroid linkage gives a better illustration than the other two. 3.5 Two classes and one outlier To conclude the experimental section, a setting was made with two classes and an outlier. One class consists of profiles X5 , X6 , X8. The other set held modified profiles X1 , X2, X3 , X4 with the 10 last rows put on top of the first 50. Finally, X7 was made an outlier by exchanging the first 30 with the last 30 rows. The dendrograms are found in figure 6. All variants reveal the two distinct classes, and the one outlier that is joined with one of the classes at the second last stage. The gi diagnostics (table 4) fails even to identify the outlier correctly, and is otherwise very little informative, with several gi close to the value 1.0000. X4 is reported the furthest away from the consensus, and the outlier X7 is actually 75 3 EXPERIMENTS: DETECTING GROUP STRUCTURE AND OUTLIERS (a) Complete Linkage (b) Single Linkage 0.95 0.95 0.9 0.9 0.85 0.85 0.8 0.8 0.75 0.75 0.7 6 8 5 7 1 2 3 0.7 4 6 8 (c) Centroid Linkage 0.95 0.9 0.9 0.85 0.85 0.8 0.8 0.75 0.75 6 8 5 1 2 3 4 1 3 4 2 7 (d) Average Linkage 0.95 0.7 5 0.7 7 6 8 5 1 3 4 2 7 Figure 6: Case III: Two groups and an outlier. closer to the consensus than the others. In this case, the standard diagnostic gi is useless. 76 PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS gi 0.9165 0.9073 Table 1: GPA Lack of fits 0.8665 0.8553 0.8787 0.8673 1.0000 0.8641 Table 2: Cluster distances used for linking at each stage Method/Step Complete Linkage Single Linkage Centroid Linkage Average Linkage gi gi 0.9594 0.9829 1 0.7685 0.7943 0.7869 0.7849 2 0.8142 0.8416 0.8010 0.8315 3 0.8224 0.8422 0.7779 0.8364 4 0.8661 0.8555 0.7910 0.8587 5 0.9037 0.8565 0.8396 0.8939 6 0.9392 0.8762 0.8758 0.9172 7 1.0000 1.0000 1.0000 1.0000 0.9136 Table 3: GPA Lack-of-fits 0.9565 0.9599 0.9716 1.0000 0.9775 0.9918 0.9212 Table 4: GPA Lack-of-Fits 0.9998 1.0000 0.9231 0.9565 0.8385 0.9544 77 4 ANALYSIS OF DESCRIPTIVE SENSORY DATA FROM PEAS 4 Analysis of descriptive sensory data from peas The pea data used for this example have previously been analyzed by Næs and Kowalski (1989) and the reader is referred to that paper for details. The data contain sensory measurements made by 10 assessors on 60 different samples of peas (different varieties and different degree of maturity). 10 sensory attributes were measured, but in this paper we only consider 4 of them (pea flavor, sweetness, off-flavor and mealiness). There were 2 replicates for each sample and these were averaged before statistical analysis. For this data set we confine ourselves to single linkage only. The dendrogram for the pea data using this technique is shown in Figure 7. It gives a clear idea about the similarity and differences among the assessors. First of all, there is a clear group of 6 assessors, 1,2,4,7,9, and 10, who are very similar to each other. The (full) Procrustes Distance (vertical axis) between assessor 1 and 4 is only slightly smaller than between 1 and 7, which is the last one joined to this cluster. The next assessor to be joined is number 3 which is considerably further away. Assessor number 6 seems to quite different from the rest in this study. For this particular study, near infrared reflectance (NIR) data were also acquired for all the samples. The NIR data in this contained absorbance readings at 116 different wavelengths. In order to verify the conclusions above, the sensory profiles were related to principal components of the NIR data. First of all, the scores for each individual assessor were considered and the results are presented in Figure 8. As can be seen, already after 3-4 principal components the prediction ability is quite good. The prediction error is here ||Wβ − Y||2F , where W is the N × P matrix of (NIR) principal component coefficients for the N samples in P = 1, 2, ..., 10 (principal) dimensions, β ∈ RP ×P is the matrix of regression coefficients and Y is a sensory profile, normalized to have unit variance, Y := Xi /||Xi || for some assessor/profile indexed by i. It is also clear that the same group of assessors as was determined to be similar, are the ones that are easiest to predict. Assessor number 6 is clearly the one with the least clear relationship between sensory and NIR data. The assessor 3,5 and 8 come in an intermediate position. In Figure 9 is presented similar results for the consensus of the six similar assessors, for the full panel and for the raw average of the profiles (the prediction ability was measured the same way as for the individual profiles, with the same normalization of the target (consensus) as was used for the individual profiles as described above). Assessor 6 is also plotted for comparison. The results show that the consensus from the six similar ones is clearly easier to relate to the NIR data than the full consensus. These results together clearly indicate that assessors 1,4,9,7,2 and 10 are similar and have a simpler and more predictable relationship to NIR than the other four. A possible and quite tempting explanation for this is that they are simply more reliable than the rest of the assessors in this case. If no strong 78 PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS 0.9 0.8 0.7 0.6 0.5 0.4 1 4 9 2 10 7 3 5 8 6 Figure 7: Pea data: Dendrorams, HCA with the Procrustes Distance. non-linearities are present in the relationship, this is clearly the most natural explanation. The results also give a clear indication of a possible outlier. The assessors 1,2,4,7,9 and 10 are very similar, then there is a gap to the next group 3,5,8 before a new gap separates assessor 6 from the rest. This analysis shows that the HCA/PD approach is a natural first step in an analysis of sensory panel data. 5 Conclusions & Discussion 5.1 Conclusions The simulations illustrated a number of situations were HCA could be used in combination with GPA. The single linkage and the centroid linkage seemed to be better at isolating group structures and outliers. The centroid variant has two drawbacks. First, it has the “inversions” which are generally considered inappropriate, and secondly, it requires the computation of a GPA consensus at each amalgamation step, making it computer-intensive. The single linkage HCA, using the Procrustes Distance as dissimilarity measure, is therefore the 79 5 CONCLUSIONS & DISCUSSION 1 0.9 0.8 0.7 Judge 6 0.6 Judge 8 Judge 5 Judge 3 0.5 Judge 10 Judge 2 Judge 7 Judge 4 Judge 9 0.4 Judge 1 0.3 1 2 3 4 5 6 7 8 9 10 Figure 8: Errors for regression of NIR data (compressed data in p-space) onto the individual profiles. approach we will recommend for sensory analysts. Through our simulations, we have demonstrated that two-group situations and outliers can be detected using this approach. In a real-life experiment (pea data), HCA was demonstrated as a first natural step in examining panel data. The indications given by the clustering was confirmed by later analysis, and thus illustrates the usefulness in the approach we have presented. 5.2 The Researchers position The recommended HCA variant could easily be incorporated into any sensory scientist toolbox. Rather than using straightforward GPA, a routine check could be run to verify that sufficiently common opinions about the products exist. There is otherwise a potential danger that outliers and subgroups may ruin the experiment. Routine use of the suggested clustering methods could help manufacturers in order to select members for an expert sensory panel. The tools developed here are intended for being used in an informal way, to gain insight into the structures of the dataset. Using HCA would be natural at an explanatory stage, by a person who has a good understanding of the data. So far, the method has been used on one dataset only, and it would be necessary to study more sets in to assess its full value as a tool. 80 PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS 1 0.9 0.8 Assessor 6 0.7 0.6 0.5 Raw average 0.4 Assessors 1,2,4,7,9,10 Full Consensus 0.3 0.2 1 2 3 4 5 6 7 8 9 10 Figure 9: Regression between NIR data and consensuses. The errors for the raw average is plotted for reference (–). 5.3 Further Research 5.3.1 Other distance functions One of the main critics against GPA is the fact that it uses only rigid transformations (rotation, isotropic scaling and translation) to compensate for systematic differences between judges. However, there is no reason to believe that there do not exist more subtle misunderstanding. There are methods that handle such subtleties better than GPA (GCA, Tucker or PARAFAC), but none of these will give tree structure representations such as our dendrograms. However, since the we do not actually carry out a GPA (except in the centroidlinkage case), one can group the objects in more flexible ways by using alternative dissimilarity measures than the Procrustes Distance. These could be constructed to detect similarity with respect to non-rigid transformations, such as affine transformations, or thin plate splines for shape analysis. It might also be possible to design dissimilarity measures based on entropy measures (from information theory). In this case, the dissimilarity between two matrices would be determined from their joint entropy. In this situation, one does not have to find an optimal mapping from one profile to another, it suffices to measure the 81 5 CONCLUSIONS & DISCUSSION degree of common information which helps one determine if there is an optimal, possibly non-linear mapping (Hyvärinen, 1999). 5.3.2 Alternative ways of studying profile data In this paper, iterative averaging (GPA) and HCA with the Procrustes Distance were presented. These are only two possible ways of exploring profile data. Another approach is to study profiles from the perspective of minimum spanning trees, which is equivalent with single linkage clustering. This technique is used in botany (Dahl, 1982 and Gauslaa, 1985) and computer networking. It has recently been employed to connect multiple PCA and PLS-models in chemometrics (unpublished work by Martens, Anderssen & Høy) Minimum spanning trees generated by the Procrustes Distance could be used to create a map (a graph) to see how judges relate. Appendix The GPA Algorithm A simplified version of the GPA algorithm is presented. For a full version with details, see Ten Berge (1977). (1) Scale all matrices Xi to have unit variance. (2) Make initial rotations of Xi by: Rotating X2 to match X1 , then X3 toPmatch 1/2(X1 +X2 ), and so on. Make an initial Z as their average Z = 1/Q Q i=1 Xi (using the rotated Xi ). (3) Rotate all matrices Xi to match with Z. (4) Recalculate Z as the average of the transformed Xi ’s. (5) Repeat from (3) until convergence. (6) P Scale all matrices to maximum agreement, preserving the total variance Q 2 i=1 ||Xi ||F = C. 82 PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS References [1] Arnold, G.M., Williams, A.A. The use of generalized Procrustes Techniques in sensory analysis, In: Statistical Procedures in Food Research. (1985) [2] Baardseth P., Næs, T., Mielnik, J., Skrede, G., Hølland, S., and Eide, O. Dairy ingredients effects on sausage sensory properties studied by principal component analysis, Journal of Food Science, Vol. 57, 4, (1992) 822-828. [3] ten Berge, J. M. F. Orthogonal Procrustes rotation for two or more matrices. Psychometrika 42, (1977) 267-276. [4] Carroll, J.D., Arabie, P. INDCLUS: an individual differences generalization of the ADCLUS model and the MAPCLUS algorithm, (1983) Psychometrika 48, 157-169. [5] Cliff, N. Orthogonal rotation to congruence, Psychometrika 31, (1966) 33-42 [6] Dahl, E. Unpublished work: Ordination analysis. Analysis of randomly selected test areas (Norwegian: Ordinasjonsanalyse. Analyser av tilfeldig valgte prøveflater), (1982) NLH - Norwegian Agricultural University. [7] Dijksterhuis, G. Procrustes analysis in sensory research. In Multivariate Analysis of Data in Sensory Science, Ed. T.Næs and E.Risvik (Elsevier Science, 1996) [8] Dryden, I.L., Mardia, K.V. Statistical shape analysis (Wiley, 1998) [9] Gauslaa, Y. The ecology of Lobarion Pulmonariae and Parmelion Capetarae in Quercus dominated forests in south-west Norway. Lichenologist. 17 (2), (1985) 117-140. [10] Goodall, C. Procrustes methods in the statistical analysis of shape. (with discussion). Journal of the Royal Statistical Society: Series B 53, (1991) 285-339. [11] Gordon, A.D. Classification. 2nd ed. Monographs on Statistics and Applied Probability. Boca Raton, (FL: Chapman & Hall, 1999). [12] Gordon, A.D. and Vichi, M., Partitions of partitions, Journal of Classification, 15, (1999) 265-285. [13] Gower, J.C. Statistical methods for comparing different multivariate analyses of the same data. In Hodson, F.R., Kendall D.G., and Tautu, P. editors, Mathematics in the Archeological and Historical Sciences, (1983), 138-149, Edinburgh. Edinburgh University Press. [14] Gower, J.C. Generalized Procrustes Analysis, Psychometrika 40 (1975) 3351. 83 REFERENCES [15] Green, B.F. The orthogonal approximation of an oblique structure in factor analysis. Psychometrika 17, (1952) 429-440. [16] Gruvaeus, G.T. A general approach to Procrustes pattern rotation. Psychometrika 35 (1970) 493-505. [17] Hyvärinen, A. Fast and Robust Fixed-Point Algorithms for Independent Component Analysis, IEEE Trans. on Neural Networks (1999), 10(3), 626634. [18] Kendall D.G. Shape manifold, Procrustean metrics and complex projective spaces. Bulletin of the London Mathematical Association 16, (1984) 81-121. [19] Kendall D.G. A survey of the statistical theory of shape. Statistical Science 4 (1989) 87-120. [20] Kent, John T.; Mardia, Kanti V. Consistency of Procrustes estimators. (English) Journal of the Royal Statistical Society: Series B 59, (1997) 281290. [21] Krieger A.M., and Green P.E, A generalized Rand-Index method for consensus clustering of separate partitions of the same data base, Journal of Classification, 16, 63-89. [22] Kristof, W. and Wingersy, B. Generalizations of the orthogonal Procrustes rotation procedure to more than two matrices. Proceedings of the 79th Annual Convention of the American Psychological Association, 6, (1971) 8190. [23] Langron, S.P. and Collins, A.J. Perturbation theory for generalized Procrustes analysis. Journal of the Royal Statistical Society: Series B 47, (1985) 277-284. [24] Le H.-L. and Kendall D.G. The Riemannian Structure of Euclidean shape spaces: a novel environment for statistics. Annals of Statistics 21, (1993) 1225-1271. [25] Mardia, K.V., Kent, J.T., Bibby, J.M., Multivariate Analysis, (Academic Press, 1979) [26] Martens, H., Anderssen, E., Høy, M. Unpublished work on minimum spanning trees for PCA and PLS models (2000). [27] Mosier, C.I. Determining a simple structure when loadings for certain tests are known, Psychometrika 4, (1931) 149-162. [28] Næs, T. and Kowalski, B. Predicting sensory profiles from external instrumental measurements. Food Quality and Preference (1989), (4/5), 135-147. 84 PAPER II: OUTLIER AND GROUP DETECTION IN SENSORY PANELS [29] Risvik E. and Næs T. Multivariate Analysis of Data in Sensory Science (Elsevier Science, 1996) [30] Schönemann, P.H. A generalized solution to the orthogonal Procrustes problem, Psychometrika 31, (1966) 1-10. [31] Schönemann, P.H. On two sided orthogonal Procrustes problems. Psychometrika 33, (1968) 19-33. [32] Schönemann, P.H. and Carroll R.M. Fitting one matrix to another under choice of central dilation and rigid motion. Psychometrika 35, (1970) 245255. [33] Sibson, R. Studies in the robustness of multidimensional scaling: Procrustes statistics. Journal of the Royal Statistical Society: Series B 40, (1978) 234-238. [34] Sibson, R. Studies in the robustness of multidimensional scaling: perturbation analysis of classic scaling. Journal of the Royal Statistical Society: Series B 41 (1979) 217-229. [35] Vichi (1999) One-mode classification of a three way data matrix, Journal of Classification, (1999) 16, 27-44. 85 86 Paper III: BIMA - Blind Iterative MIMO Algorithm 87 CHAPTER 3. PAPERS 88 BIMA: Blind Iterative MIMO Algorithm accepted for ICASSP 2002 T. Dahl, N. Christophersen, D. Gesbert Abstract Identification of the channel matrix is of main concern in wireless MIMO (Multiple Input Multiple Output) systems. Here, we present an SVDbased approach for blind identification of the main independent parallel channels. The right and left singular vectors are estimated directly (no channel matrix estimation is necessary) and continuously updated during normal transmission. The approach is related to the iterative Power Method [8], as well as the time reversal approach [4]. 1 Introduction Wireless MIMO systems are capable of delivering large increases in capacity through utilization of parallel communication channels [5], [6], [12]. For a N(receive) × M(transmit) channel matrix H of rank K0 ≤ min(N, M), the parallel channels are naturally realized through the Singular Value Decomposition (SVD) H = USVH , when the channel matrix is known both at the transmitter and the receiver side. S is the diagonal matrix of singular values σ1 ≥ σ2 , · · · , σK0 > 0, and U = [u1 , . . . , uK0 ] ∈ C N,K0 V = [v1 , . . . , vK0 ] ∈ C M,K0 (1) (2) are unitary matrices whose columns can be used as receive and transmit vectors {ui } and {vi }, respectively. One can select a number K (K ≤ K0 ) of transmit/receive vectors to use for communication. Under stationary conditions, one may try to determine H experimentally and subsequently perform the SVD as in the sonar application [10]. For time-varying systems, most studies have assumed that H is unknown at the transmitter and known - through training data - at the receiver. However, first, this implies overhead, and second, the use of channel knowledge on the receiver only leads to less efficient use of the MIMO system. The transmit array diversity gain is not realized, and one is unable to transmit on the top singular vectors, those giving maximum performance/complexity tradeoff. 89 2 METHODS In the method presented, two-way transmission of data allows the two parties to estimate a selected set of left and right singular vectors, without explicit knowledge of H. Unlike other previous methods for blind MIMO estimation (for example [13] and references therein), which rely on a statistical based estimation of the channel matrix, our technique estimates the eigen-structure of the MIMO channel directly, without need of an actual SVD. The key advantage of this technique is that it exploits transmission of regular symbol data to acquire an update of the singular vectors. 2 Methods Assume a flat fading MIMO channel H exhibiting reciprocity. The uplink and downlink channels are the same, as in TDD (Time Division Duplex) systems. M N H HT Without noise, transmission (s) and receiving (r) for two parties X and Y (for instance, X=base station, Y =mobile), are given by: yr = Hxs , xr = HT ys 90 (3) PAPER III: BIMA - BLIND ITERATIVE MIMO ALGORITHM Taking xs = vi , ys = uj , we have yr = Hxs = Hvi = σi ui xr = HT ys = HT uj = σj vj (4) (5) therefore, transmitting data on a right singular vector yields data lying on the left singular vector, and vice versa. Consider now the following Power Method [8]: 1. xs = initial, ys = initial. 2. yr = Hxs , xr = HT ys . 3. ys = yr /||yr ||,xs = xr /||xr ||. 4. Repeat from 2. Expressing the initial xs and ys in terms of the basis vectors vj and uj , respectively, it can be shown that ys → u1 and xs → v1 . This is a straightforward generalization of the proof in [8]. Thus, as suggested independently by BachAndersen [1], transmission and retransmission leads to convergence to the first pair of singular vectors. This algorithm is called the NIPALS algorithm [14] in chemometrics, but could equally well be termed a Two-way Power Method. It is closely related to the time reversal approach [4]. 2.1 Updating several singular vectors during transmission We now describe how several singular vector pairs of H can be estimated and tracked as part of normal communication. The method works for a noisy, timevarying channel, and shows robustness, as will be demonstrated by simulations in the next section. For simplicity, all matrices are assumed real, but generalization to the complex case is straightforward. Assume that a block of data is sent during a short period of time where the channel Hk = H(tk ) can be assumed constant. Sending and receiving data through a noisy channel is expressed as k Yrk = Hk Xks + ηy T k Xkr = Hk Ysk + ηx (6) (7) Using estimates Ûk−1 ∈ RN,K and V̂k−1 ∈ RM,K from the previous iteration, symbols can be sent using Xks = V̂k−1Ckx and Ysk = Ûk−1 Cky . Here, Ckx and Cky are the K-by-r symbol matrices holding binary (±1) symbols that need to be transmitted at time tk . The block size (r) is a design parameter, selected with respect to the slot size and the variation of the channel. In the algorithm we implemented, several blocks of data can be transmitted within a slot. To keep the algebra simple, however, we will in the following assume that just one block 91 2 METHODS of data is sent per slot. The central point in the forthcoming computations is the following singular vector relations: If Hk = Uk Sk VkT , then HkT Uk = Vk Sk Hk Vk = Uk Sk (8) (9) which in particular is to say that multiplication of Hk or HkT by suitable orthogonal matrices yields matrices with perpendicular column vectors as output (e.g. Uk Sk or Vk Sk ). Using Ûk−1 ,V̂k−1 rather than the true Uk , Vk (which are unavailable), the approximate relations become k Yrk = Hk V̂k−1Ckx + ηy = Ũk ŜkL Ckx + ky kT k Xkr = H Ûk−1 Cky + ηx = Ṽk ŜkR Cky + kx (10) (11) The estimates Ũk and Ṽk are not identical to Ûk−1 and V̂k−1, but are rather the “approximate SVD counter-pairs” 1 that arise when V̂k−1 and Ûk−1 are multiplied by Hk and HkT . To save space, we shall assume that the error terms ky , kx reflect both “failure of perpendicularity” and “delay error” arising from using the old transmission vectors Ûk−1 , V̂k−1, as well as the channel noise. At the recipient side, the symbols can be re-estimated through multiplication by the old transmission matrices Ûk−1 ,V̂k−1, followed by taking signs: Ĉkx = sign(Ûk−1T [Ũk ŜkL Ckx + ky ]) ≈ sign(ŜkL Ckx ) (12) Ĉky = sign(V̂k−1T [Ṽk ŜkR Cky + kx ]) ≈ sign(ŜkR Cky ) (13) ŜkL and ŜkR are defined to be positive, and have no effect on the sign operator. If the errors ky , kx are small relative to the singular value estimates ŜkL , ŜkR (in a Frobenius-norm sense, e.g. ||ŜL ||2F ||ky ||2F ) the symbol estimates will be correct. The reason for this is that ŨkT Ûk−1 ≈ I and ṼkT V̂k−1 ≈ I under the sign operator, or in other words: The approximation errors are not large enough to change the signs. Once symbol estimates are obtained, they can be used to find new estimates Ûk and V̂k . Assuming Ĉkx and Ĉky to have full column range (implying r ≥ K), consider Yrk Ĉkx † ≈ Ũk ŜkL Xkr Ĉky † ≈ Ṽk ŜkR (14) (15) where C† denotes the Moore-Penrose (pseudo-inverse) of C. The right-hand side matrices in the equations (14),(15) are, ideally, comprised from scaled singular vectors. Normalizing the columns on the left side, one could hope to retrieve the true singular vectors Uk , Vk . In addition to the presence of channel noise, however, It is also reasonable to assume ŜL ≈ ŜR , that the left and right side singular values estimates are roughly the same. 1 92 PAPER III: BIMA - BLIND ITERATIVE MIMO ALGORITHM • the original vectors Ûk−1 , V̂k−1 were incorrect, so orthogonality will not appear, and, • through multiple passes, the Power Method will attract all column vectors towards the first singular vector. Consequently, an orthogonalization process must be applied to prevent unwanted convergence of all columns towards the first singular vector pair. A QR-decomposition (Gram-Schmidt Process) can be used to this end, Yrk Ĉkx † = Qu Ru Xkr Ĉky † = Qv Rv (16) (17) taking Ûk = Qu and V̂k = Qv as the new estimates effectively “filtering out” the scaling effects of ŜR , ŜL . The reason why the estimates Ûk , V̂k are better than the previous Ûk−1 , V̂k−1, is that they have passed one more time through the channel H. It can be shown, mathematically as well as experimentally, that the method converges to the correct vectors after a few iterations [3]. Our method has a connection with Decision-Directed (DD) estimation, with one important exception: One can start with any guess on the singular vectors (and consequently, get wrong estimates of the transmitted symbols), but the method will still convergence to the correct set of singular vectors2 . In the case of K = 1 (maximum diversity algorithm), the method is completely insensitive to symbol decision. Other important issues addressed in our full paper [3] include: • How the method depends on the convergence properties of the Power Method. • How the method relates closely to the Power Iterations for symmetric matrices used by Golub and Van Loan [8]. A simplified version of the BIMA algorithm is given in Table 1. 3 Simulations & Results We investigate the performance of the proposed algorithm in an exemplary radio communication situation with moderate to fast time-varying, Rayleigh fading, channels. We consider here an i.i.d. MIMO model, but more general models can equally well be used ([2], [7]). We first plot the Bit Error Rate (BER) performance, then illustrate tracking of the eigen-modes by the method. 2 The symbols will be correct up to a permanent change of sign, e.g. +1 transmitted may become -1 when received. This can be taken care of by using differential coding 93 4 DISCUSSION 1. Û0 = initial, V̂0 = initial. Set k = 1. 2. Xks = V̂k−1 Ckx , Ysk = Ûk−1 Cky 3. Yrk = Hk Xks , Xkr = HkT Ysk 4. Ĉkx = sign(Ûk−1T Yrk ) , Ĉky = sign(V̂k−1T Xkr ). 5. [Ûk , Ru ] = qr(Yrk Ĉkx † ) , [V̂k , Rv ] = qr(Xkr Ĉky † ) 6. Increase k, repeat from 2. Table 1: Simplified BIMA Algorithm. An important detail is the sorting of the columns of Yrk , Xkr by norm, prior to performing QR. Some extra rules are needed to handle changes in the sorting order (to be covered in [3]). Figure (1) shows the BER for a channel H which has 4TXers and 4RXers in a pedestrian-like environment (10Hz Doppler spread) of which we choose to estimate/track the two best eigen-modes (out of four). The transmission rate is 22 kBits/s (GSM like) per eigen-channel, and the Downlink/Uplink slotting is 50 bits/slot, corresponding to 2.3 ms ping-pong period. In fact, much higher data rates can be used, since the algorithm is mostly sensitive to the number of Tx/Rx iterations, itself determined by the ping-pong period, and not by how many bits are in each slot. The plot shows the BER, averaged over the two top eigen-modes, and during the stationary regime (tracking mode) of the algorithm. For reference, we compare with the same situation where instead a SVD would be computed at each slot from a perfectly known channel. The plot shows a 5 dB difference. A trained estimation algorithm will give intermediate performance. Figure (2) illustrates tracking of singular values. The quality of such estimates depends on the quality of estimates of the corresponding singular vectors. In this case the Doppler spread was increased to 100Hz (vehicular situation), using a transmission rate of 100 kBits/s and a slot size of 25 bits corresponding to 0.25 ms. The SNR was 15 dB. The plot shows how well the two top singular values (and more generally, singular vector pairs) are tracked despite fast fading. 4 Discussion The demonstrated scheme used binary transmission (BPSK) and real-valued matrices. Using complex modulations is straightforward. The methods discussed assumed reciprocity (uplink is Hk , downlink HkT ). If this is not the case, Ûk , V̂k will no longer be singular vector estimates, but rather estimates relating to eigenvectors of certain products of the uplink and downlink channel matrices. The presented ideas are still valid, but some kind of phase normalization must be applied. A generalized singular vector (GSVD) 94 PAPER III: BIMA - BLIND ITERATIVE MIMO ALGORITHM 0 10 −1 Bit Error Rate 10 −2 10 −3 10 −4 10 0 5 10 15 20 SNR (dB) 25 30 35 40 Figure 1: Bit Error Rates versus SNR. The lower line is the BER/SNR curve when the channel matrix H is known, the upper when it is unknown, and the singular vectors estimated by our procedure. 95 4 DISCUSSION σ i 6 True Singular Values Singular Value Estimates 5 4 3 2 1 0 0 500 1000 1500 2000 2500 Figure 2: The BIMA algorithm used to track the two top singular vectors/values on a 4TX x 4RX fading channel with a 100Hz Doppler shift, 100 kBit/s and 25 bits per slot, corresponding to 0.25 ms. The SNR was 15 dB. Two independent channels (K=2) were used. The heavy lines are the true singular values, the jagged ones are estimates. 96 PAPER III: BIMA - BLIND ITERATIVE MIMO ALGORITHM approach may also be worth considering. These questions are addressed in the full paper [3]. It is also clear that, for the symmetric case, the convergence of the eigenvectors could be considerably improved. The QR-decomposition effectively filters out important information held in the received data-blocks. There is most likely a connection with Krylov Methods for estimating eigenvectors [8]. References [1] J. Bach Andersen, “Array gain and capacity for known random channels with multiple element arrays at both ends” IEEE Journal on Selected areas in Communication, Vol. 18, No 11, 2000, pp. 2172–2178. [2] H. Bolcskei, D. Gesbert, A. Paulraj, "On the capacity of OFDM-based multiantenna systems", submitted to IEEE Trans. on Communications, Nov 1999. Shorter version in ICASSP-2000. [3] T. Dahl, N. Christophersen, D. Gesbert, “A blind iterative MIMO algorithm based on the Power Method” (in preparation) [4] M. Fink, “Time-reversed acoustics”, Phys. Today, Vol. 20, 1997, pp.34 - 40. [5] G.J. Foschini, "Layered Space-time architecture for wireless communications in a fading environment", Bell Labs Technical Journal, Vol. 1, No 2, 1996, pp. 41-59. [6] G.J. Foschini, M.J. Gans, "On the limit of wireless communications in a fading environment when using multiple antennas", Wireless Personal Communications, Vol.6, No3, 1998, pp.311-335. [7] D. Gesbert, H. Bolcskei, D. Gore, A. Paulraj, "MIMO channels: Capacity and performance prediction", submitted to IEEE Trans. on Communications, July 2000. Shorter version in Proceedings of IEEE Globecom Conference, Nov. 2000. [8] G. Golub and C.F. Van Loan, Matrix Computations, Johs Hopkins, 3 edition, 1996. [9] B. Hassibi “An efficient square-root algorithm for BLAST”, ICASSP 2000 [10] D.B. Kilfoyle, “Spatial Modulation in the Underwater Acoustic Communication Channel”, PhD Thesis, MIT, June 2000. [11] A. Paulraj, C. Papadias "Space-time Processing for Wireless Communications", IEEE Signal Processing Magazine, Nov. 1997. 97 REFERENCES [12] E. Telatar, “Capacity of multi-antenna Gaussian channels,” Trans. on Telecom, Vol. 10, No. 6, 1999, pp. 585–595 European [13] A. Touzni, I. Fijalkow, M.G. Larimore, J.R. Treichler, “A globally convergent approach for blind MIMO adaptive deconvolution” IEEE Transactions on Signal Processing, Vol. 49, No. 6, 2001, pp. 1166-1178, [14] H. Wold, “Soft modelling by latent variables: The non-linear iterative partial least squares (NIPALS) approach” Perspect. Probab. Stat., Pap. Honor M. S. Bartlett Occas. 65th Birthday, 1975, pp. 117-142. 98 Paper IV: Blind MIMO Estimation based on the Power Method 99 CHAPTER 3. PAPERS 100 Blind MIMO Estimation based on the Power Method T. Dahl, N. Christophersen, D. Gesbert Abstract Identification of the channel matrix is of main concern in wireless MIMO (Multiple Input Multiple Output) systems. Here, we present an SVDbased approach for blind identification of the main independent parallel channels. The right and left singular vectors are estimated directly (no channel matrix estimation is necessary) and continuously updated during normal transmission. The approach is related to the iterative Power Method [9], as well as the “time reversal mirror” approach [4]. 1 Introduction Wireless MIMO systems are capable of delivering large increases in capacity through utilization of parallel communication channels [5], [6], [7], [13]. Appearing first in a series of information theory articles published by members of the Bell-Labs, MIMO systems have quickly evolved to become one of the most popular topics among wireless communication researchers. It figures prominently on the list of ’hot’ technologies that may have a chance to resolve the bottlenecks of traffic capacity in the forthcoming broadband wireless Internet access networks (UMTS 1 and beyond) Multiple antennas both at the transmitter and the receiver create a matrix channel. The key advantage is the possibility of transmitting over several spatial modes of the channel matrix within the same time-frequency slot, at no additional power expenditure. In addition, if the channel matrix is known both at the transmitter (TX) and the receiver (RX), certain spatial modes (singular modes) can be used to maximize the SNR. For a N(receive) × M(transmit) channel matrix H of rank K0 ≤ min(N, M), these modes are naturally realized through the Singular Value Decomposition (SVD) H = USV∗ . Here, “∗ ” denotes the complex conjugate transpose. S is the diagonal matrix of singular values σ1 ≥ σ2 , · · · , σK0 > 0, and U = [u1 , . . . , uK0 ] ∈ C N,K0 V = [v1 , . . . , vK0 ] ∈ C M,K0 1 Universal Mobile Telephone Services 101 (1) (2) 1 INTRODUCTION are unitary matrices whose columns can be used as receive and transmit vectors {ui } and {vi }, respectively. One can select a number K (K ≤ K0 ) of transmit/receive vectors to use for communication. Under stationary conditions, one may try to determine H experimentally and subsequently perform the SVD as in the sonar application [11]. For time-varying systems, a majority of the algorithms have assumed that H is unknown at the transmitter and known - through training data - at the receiver (e.g. V-BLAST, [5],[8]). However, first, this implies overhead, and second, the use of channel knowledge only on the receiver leads to less efficient use of the MIMO system. The transmit array diversity gain is not realized, and one is unable to transmit on the top singular vectors, those giving maximum performance/complexity tradeoff. The advantage of a method employing the SVD over a V-BLAST like (training-based) algorithm, is the possibility of performing spatial water-filling, in which an optimal weighting of the eigenmodes is used. In this paper, the contributions are twofold Firstly, in the method presented, two-way transmission of regular data allows the two parties to estimate a selected set of left and right singular vectors, without explicit knowledge of H. Secondly, unlike other previous methods for blind MIMO estimation (for example [14] and references therein), which rely on a statistical based estimation of the channel matrix, our technique estimates the eigen-structure of the MIMO channel directly, without need of a repeated block SVD. The key advantage of this technique is that it exploits transmission of regular symbol data to acquire an update of the singular vectors (our first paper [2] on this technique was accepted for ICASSP 2002) Our approach has a connection with the “time reversal mirror”, developed by M. Fink [4] in ultrasound imaging: By repeatedly sending a pulse into a body, recording the reflected signals and re-sending them after normalization and time-reversal 2 , convergence towards a top eigenvector is reached. This technique is for example used for detection and destruction of kidney stones, typically corresponding to a top eigen-mode. This physical process is nothing but a Power method (see section 2.1) for finding a top eigen-vector. This is analogous to how our technique works: By sending a signal up and down a channel, there is a natural convergence towards the top singular mode. Other authors have also commented on sending-re-sending schemes. Bach Andersen [1] has observed independently that such a procedure leads to convergence towards the top singular mode of the channel. Kilfoyle [11], on the other hand, uses training data to find singular modes in a non-flat fading (underwater) channel, but also comments that there is important information in the data vectors sent up and down. To our best knowledge, however, we are the first to use these ideas in a blind MIMO communication setting and 2 in the frequency domain, this is the same as complex conjugation 102 PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD to incorporate simultaneous estimation of multiple singular modes without any prior estimate of the channel matrix. 1.1 Organization of the paper The paper is laid out as follows: Section two presents the methodology and algorithm (2.2-2.3) and section 2.4 presents visualization to enhance the understanding of the algorithm’s working. We then consider details for improving performance (2.5), smoothing of the singular vectors for robustness (2.6), and differential symbol coding (2.7). Section 3 is a simulation section, showing the performance of the method in various communication scenarios, both in the aquisation phase and in tracking of a time-varying channel. Section 4 concludes on the findings and discusses potential improvements and extensions. Symbol H {ui } ,{vi} {σi } U, S, V Û, V̂, Ŝ c, cx , cy C, Cx , Cy Ĉ, Ĉx , Ĉy x, y X, Y Meaning channel matrix singular vectors singular values singular vectors/values matrices estimates of vector/values symbols symbol blocks Symbol block estimates transmit/receive vectors transmit/receive matrices 2 Algorithm 2.1 Preliminaries: the Power method We briefly recapture the Power method, which is an iterative numerical method for finding eigenvectors. For a full reference, see [9]. Assume a matrix A ∈ RP,P , symmetric and real. This matrix has an eigen-decomposition, A= P X λj vj vjT (3) j=1 where {v1 , . . . , vP } are the orthonormal eigenvectors, and the eigenvalues {λj } are ordered, λ1 ≥ λ2 ≥ · · · ≥ 0. If A is not of full rank (rank(A) = r < P ), then all the eigenvalues λr+1 , . . . , λP will be zero, and the corresponding eigenvectors 103 2 ALGORITHM vr+1 , . . . , vP span the null-spaces (row and column) of A. Now, assume a random vector x(0) in RP . This vector can be decomposed using the eigenvectors {vj }, P X x(0) = (4) ck vk k=1 If this vector is repeatedly pre-multiplied by A, . . . A} x(0) = Ai x(0) x(i) = |AA{z i times then (i) x =( P X i=j P X λij vj vjT )( ck vk ) = k=1 P X λij cj vj (5) j=1 will be dominated by the term λ1 c1 v1 when i tends to infinity. If the vector x(i) is normalized after each iteration, this becomes a method for finding the top eigenvector v1 . The idea of using matrix powers for finding eigenvectors also carries over to non-symmetric and non-real matrices. A generalization of the Power method (sometimes called NIPALS, [15]) can be used for finding singular vectors, which is what we are interested in. 2.2 Estimating multiple singular vector pairs Consider the following (noiseless) scheme for transmission on a TDD (Time Division Duplex) channel H exhibiting reciprocity: y(i) = Hx(i−1) z(i) = HT w(i) (6) (7) Here, x(i−1) is the data vector send Uplink (UL), and w(i) is the data vector sent Downlink (DL). We introduce feedback by defining w(i) := ȳ(i) x(i) := z̄(i) In effect, this states that the signal received by one party is returned to the other party after it has been complex conjugated. But then, (7) becomes z(i) = HT w(i) = HT ȳ(i) z(i) = x¯(i) = HT ȳ(i) and when conjugating the latter equality, we get x(i) = H∗ y(i) 104 (8) PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD The equations (6) and (8) serve as the basis for our elaborations. Usually, a block of data (a matrix) will be sent, but for now consider the case of sending one individual vector. Assume that transmission starts with a random vector x(0) , Pmin(N,M ) and let H = i=1 σi ui vi∗ be a full SVD of H including the singular vectors spanning the null-spaces, which correspondP to singular values σi = 0. Using ) αj vj for some set {αj } of this basis, x(0) can be decomposed as x(0) = min(N,M j=1 constants. Now, X X X y(1) = Hx(0) = ( σi ui vi∗ )( αj vj ) = σi αi ui (9) X x(1) = H∗ y(1) = ( i j X σj uj vj∗ )∗ ( j i σi αi ui ) = i X σi2 αi vi i Continuing this way, one arrives at the following recursion, for i ≥ 1: X (2i−1) σk αk uk y(i) = k x(i) = X (10) (2i) σk αk vk (11) (12) k Clearly, y(i) will be dominated by u1 , and x(i) by v1 as i → ∞. If, after each iteration, normalization is applied, e.g. x(i) := x(i) /||x(i) ||2 y(i) := y(i) /||y(i) ||2 (13) (14) then x(i) → v1 and y(i) → u1 as i → ∞. Now, assume that some estimates û1 and v̂1 are known from this estimation process. The following iterations is then used to estimate the second pair of singular vectors, u2 , v2 : y2 = (I − û1 û∗1 )Hx2 (i) (i) x2 = (I − (i) (i−1) (15) (i) v̂1 v̂1∗ )H∗ y2 (16) (i) Normalization of the vectors y2 , x2 must be included in the same way as in (13) and (14). This technique effectively removes the contribution of the first singular (i) (i) vector pair in the sums. Consequently, x2 and y2 will now converge towards the second pair of singular vectors u2 and v2 , provided that the estimates of the first singular vectors are sufficiently good. It is easily seen that the orthogonalization is a Gram-Schmidt process, and that one could expect to find all the singular vector pairs by keeping successive estimates perpendicular to each other and of unit length: To estimate the r 0 th singular vector pair, one uses yr(i) = (I − r−1 X ûk û∗k )Hx(i−1) r k=1 r−1 X xr(i) = (I − v̂k v̂k∗ )H∗ yr(i) k=1 105 (17) (18) 2 ALGORITHM always with a subsequent normalization. In practice, there is no need to wait with estimating the second (or third or fourth) pair until the first one is correctly estimated. The following algorithm will estimate a full set of singular vectors (held as columns of Û, V̂): 1. X0 = random, i = 1 2. Yi = HXi−1 3. QR = Yi , Yi := Û = Q 4. Xi = H∗ Yi 5. QR = Xi , Xi := V̂ = Q 6. Increase i, repeat from 2. until convergence Here, Xi is the Uplink (UL) data, and Yi is the Downlink (DL) data. Note that QR = Z denotes the decomposition of a matrix Z into one orthogonal matrix Q and one matrix R which is upper triangular (see e.g. [9] for details on the QR-decomposition). The matrices Xi , Yi converge to the matrices of singular vectors, Xi → V, Yi → U. Note that one part of the job (recording, orthogonalization, conjugation and re-sending) could be carried out by one party and the corresponding (but independent) part by the other one. This gives a functional framework where only one set of singular vectors {ui } or {vi } is known by each party, which is sufficient for operating the channels. This algorithm is a realization of equations (17),(18), since the QR decomposition is nothing but a Gram-Schmidt process assuming the QRalgorithm keeps the diagonal elements of the R matrix real and positive. Our algorithm for finding multiple singular vector pairs is a direct extension of the Golub & Van Loan (1996) QR-method for finding multiple eigenvectors. It extends it by finding singular vectors of non-symmetric matrices rather than eigenvectors of symmetric matrices. 2.3 Transmitting symbol data while estimating the SVD Using the algorithm above, one can estimate a set of singular vectors by transmitting blocks, Xi , Yi , and performing successive QR-decompositions. We select symbols cj ∈ C from a modulation alphabet If the singular vector pairs {ui } ,{vi } are known, one can use the transmit vectors X cxj vj (19) xs = j ys = X cyj uj j 106 (20) PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD Here cxj and cyj , both complex numbers, correspond to data symbols to be transmitted uplink and downlink. The singular vector pairs (u1 , v1 ), (u2 , v2 ), . . . are known to be the vectors giving the highest SNR (signal-to-noise ratio), thus maximizing the chance of correct symbol reconstruction. When the vectors are transmitted, they will be received as X X yr = Hxs = H cxj vj = σj cxj uj (21) j xr = H∗ ys = H∗ X j cyj uj = j X σj cyj vj (22) j The symbols are then decoded using corresponding sets of the singular vectors: cˆxj = yj∗ uj = σj cxj cˆyj = x∗j vj = σj cyj (23) (24) Using a set of rules, the (scaled) symbols cxj and cyj can be decided. In practice, of course, the true singular vectors are unknown, and must be replaced by the best available estimates, leading to new approximate variants of the equations above (û replaces u etc). Let Cxi , Cyi be symbol blocks (uplink and downlink, respectively) both in C K,n , (n ≥ K) containing n vectors of complex symbols for K independent channels. The first column of the block corresponds to the symbols being sent a time t, the next at time (t + 1) and so on. BIMA - Blind Iterative MIMO Algorithm 1. X0 = random, Û0 = random, V̂0 = random, i = 1. 2. Yi = HXi−1Cxi−1 3. Decide Ĉxi−1 from Yi∗Ûi−1 4. QR = Yi Ĉx† i−1 , Yi := Ûi = Q 5. Xi = H∗ YiCyi 6. Decide Ĉyi from X∗i V̂i−1 7. QR = Xi Ĉy† i , Xi := V̂i = Q 8. Repeat from 2. until convergence If the estimation steps (3. and 6.) are correct, this algorithm is completely equivalent to the first algorithm (assuming Ĉxi Cxi † = I, Ĉyi Cyi † = I, since n ≥ K). The important change is that by multiplying in the symbols Cxi , Cyi in the transmit data blocks, each single column of the received matrices has a contribution from each of the singular vectors, as suggested by equations 107 2 ALGORITHM (19) and (20). In the original algorithm, each column, and therefore each transmission, contained one singular vector only. Note that this algorithm too can be given an operational form, with two parties performing independent symbol and singular vector estimation. 2.3.1 Convergence of singular vectors and symbol estimates It is not immediately clear that the algorithm will converge to the correct set of singular vectors and/or symbols. In fact, even if the singular vectors are correctly estimated, the symbol block estimates Ĉxi , Ĉyi will have their rows biased by a multiplication by a complex number of unit norm, due to an ambiguity in the estimates Û, V̂ of singular vectors: If V and U were estimated together, at the same base station, they could be selected so that their “diagonalization abilities” were real and positive, that is u∗i Hvj = dij with dij ∈ R (25) with dij = 0 for i 6= j, and dii ≥ 0. Without the possibility of rotating the singular vectors (multiplying by a complex number of unit norm), the latter positivity relation will generally fail in the functional setting of the BIMA algorithm, and there will be a systematic rotation of the symbols. Symbol rotation problems are well known in GSM-like systems, however, and is resolved using differential coding, adding two extra steps to our algorithm: 0. Perform differential encoding of the symbol matrices Cxi , Cyi . 9. Perform differential decoding of the estimated symbol matrices, Ĉxi , Ĉyi A precise description of this encoding/decoding is given in section 2.6. Note that a systematic symbol rotation, corrected by pre- and post-processing the data, does not affect the convergence of the singular vectors. The possible rotation of a symbol can not be distinguished from the multiplication of some singular vector estimate by a complex number d of unit norm. However, a singular vector multiplied by such a number is no less a singular vector.3 An extensive discussion is given next to aid the reader’s geometric understanding of the algorithm, including the need for differential coding. He/she will better understand the algorithm’s capability of sending symbols across several parallel channels, and even if H is changing with time, it can track the singular vector pairs as part of normal operation. 3 and the same holds true for any estimate of a singular vector. 108 PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD 2.4 Discussion: Behavior and Convergence The purpose of this section is to illustrate graphically how the singular vector estimates converge to the correct vectors, and how symbol transmission is done while this is happening. For simplicity, consider a real-valued, H ∈ R2,2 a twoantenna RX/TX scenario. The singular vectors will also be real-valued in R2 , and the symbol alphabet limited to two symbols, ±1 (BSPK). The ambiguity (25) will be a ±1-ambiguity. 2.4.1 Convergence of a single singular vector Figure 1 illustrates the convergence of the Power method when only one singular vector pair is estimated. In panel (a), the true right singular vectors v1 , v2 are (1) (1) plotted together with a first guess x1,norm . When x1,norm is sent uplink (multiplied (1) by H) it appears as the vector y1 on the contour of the ellipse with half-axes σ1 u1 and σ2 u2 , as seen in panel (b). If all vectors in this figures are normalized, they naturally appear on the unit circle, see panel (c). The normalized version (1) (1) (2) of y1 , y1,norm is then sent downlink (multiplied by HT ). It appears as x1 on the contour of the ellipse of half-axes σ1 v1 and σ2 v2 in panel (d). If normalized to become x1,norm (2), the cycle is completed. The new estimate of v1 is seen to be a lot closer to the true singular vector than the initial estimate. While the estimates of v1 improves, so do the corresponding estimates of u1 on the right hand side panels of the figure. 2.4.2 Convergence could be towards v1 or −v1 From the equations (9),(10) it is seen that convergence of a singular vector could take two ways. If α1 > 0 in any equation, the convergence (after due normalization) will be towards v1 , if α1 < 0 it will be towards −v1 , corresponding to the ambiguity in (25). In the rare case α1 = 0 convergence will be to one of the subsequent singular vectors with corresponding αi 6= 0, but in practice, round-off errors and noise will prevent this from happening. 2.4.3 Convergence while symbol data is transmitted This section describes symbol decision. Note that the convergence of the Power method (without symbol modulation) is “continuous”, in the sense that no successive two vector estimates are negatively correlated. If a negative correlation between two successive elements is detected, this must be because the opposite party changed the sign of the vector before transmitting it. In the K = 1 case (one channel only), and with a block size of n = 1 vectors per block, (i) this is how the BIMA algorithm works: Assume a new vector y1 is received by (i−1) is, when normalized, also the one party. The previously received vector y1 (i−1) current estimate of the first singular vector, û1 . This was used for decoding 109 2 ALGORITHM (a) v2 x(1) 1,norm (b) H σ2 u2 y(1) 1 x(2) 1,norm v 1 σ1 u1 T H (d) (c) u 2 σ2 v2 y(1) 1,norm x(2) 1 σ1 v1 u1 Figure 1: Estimation of the top singular vectors in a 2 × 2 antenna MIMO system. 110 PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD (a) x(1) (b) v2 x(1) v 1,norm 2 1,norm x(2) 1,norm x(2) 1,norm v 1 v1 Figure 2: Convergence towards v1 or −v1 . Panel (a) shows a situation where the initial guess converges towards v1 , panel (b) shows the situation where it converges towards −v1 . symbols in steps 3. and 6. of our algorithm, or alternatively, in equation (23). Now we can use the sign operator for decision, (i) T cˆx i = sign(y1 (i−1) û1 (i) T ) = sign(y1 (i−1) y1 ) (26) This is where differential coding comes in. It is impossible to say whether a (correctly) “received −1” corresponds to a “sent −1”. Rather than trying to detect “+1” and “-1” symbols, shifts of symbols are considered. If two successive symbols are equal, a “+1” is decided on, otherwise it is a “-1”. If a “-1” is decided, then the received vector y(i) has its sign changed y(i) := −y(i) to keep the series {y(i) } “up-to-date”. Figure 4 illustrates two different scenarios, one where the second iteration (normalized) vector x21,norm is positively correlated with the other two iteration step vectors, and one where it is negatively correlated. In the former case (a), and following differential coding, the symbol series would be {1, 1}, in the latter case (b), it would be {−1, −1}. If H varies continuously with time, the singular modes will be tracked as part of normal data transmission. In practice, all the symbols are differentially encoded before they are transmitted, and decoded when received. This corresponds to the step 0. and 9. in the algorithm. 2.4.4 Using multiple singular modes for transmission The basis for using several independent modes of transmission is the equations (19),(20), (23) and (24). Figures 4 and 5 illustrate how the algorithm works with 111 2 ALGORITHM (a) (b) x(1) 1,norm v2 x(1) v2 x(2) 1,norm 1,norm x(2) 1,norm x(3) 1,norm x(3) 1,norm v1 v 1 −x(2) 1,norm (i) Figure 3: Changing the sign of the vector estimate x1,norm only affects the convergence in one aspect: It changes the convergence from going towards v1 or towards −v1 . The speed of the convergence is not altered by arbitrarily changing the sign two independent channels. Figure 4 shows how symbols and singular vectors are combined to become one transmit vector. The (true) singular vectors are parallel with the elementary axes (e1 , e2 ). The figure shows what the transmit vector might look like in the case where the singular vectors are perfectly known (panel a), as well as what they might look like if an initial and arbitrary guess is made on those vectors (panel b). The combined symbol vectors are the ones √ going from the center onto the circle of radii 2. There are four possible vector combinations in each case. The next figure 5 illustrates the “improvement of the estimates”. Even if the initial vectors are themselves incorrect, our procedure of transmission, decoding by combination of symbol vectors, normalization and retransmission leads to gradually better estimates. Let us consider this figure in some more detail: (1) (1) In panel (a), there are two (orthogonal) vectors v1 and v2 which are initial guesses on the right singular vectors. Assume, as in the previous figure, that the true right singular vectors are parallel with the elementary (e1 , e2 ) axes. (1) (1) (1) (1) From the initial guesses v1 and v2 , two symbol combinations, v1 + v2 and (1) (1) v1 − v2 are formed. These are transmitted (that is, multiplied with H), and the results are visualized in panel (b). In panel (c), the "received symbol (1) (1) (1) (1) vectors", H[v1 + v2 (1)] and H[v1 − v2 ] are added to become 2Hv1 . This can be done provided that the "recipient party" is able to correctly guess the symbol combinations that were sent. If successful, it is also possible to combine the two 112 PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD (a) (b) −v(1)+v(1) −v1+v2 1 v 2 v +v 1 2 v 2 2 v(1) 2 v(1)+v(1) 1 v v 1 2 1 v(1) 1 v −v 1 2 −v(1)−v(1) 1 −v1−v2 2 v(1) −v(1) 1 2 Figure 4: When more than one singular mode is used to transmit and receive data, superpositions are used. If H ∈ R2,2 and two singular modes are to be used, symbol combinations ±v1 ± v2 are sent. Ideally, the symbols will be made from the true singular (1) vectors v1 and v2 (panel (a)) but in practice, we have to make initial guesses v1 and (1) v2 . The symbols made from these vectors can be seen in panel (b) 113 2 ALGORITHM (1) (1) vectors to get 2Hv2 . In (d) the normalized version of 2Hv1 is taken as the new (1) (1) (1) u1 := 2Hv1 /||2Hv1 ||. Note also, that if v1 is systematically mistaken for −v1 , then the combination of the two vectors would amount to −2Hv21 rather than 2Hv21. This is not a problem, because • both are equally close to a singular vector, and • with respect to symbol decoding, the sign ambiguity is taken care of by differential coding. (1) The same procedure is carried out for u2 , but any component along the new (1) (1) u1 is removed from u2 , which is part of the Gram-Schmidt process. Also in (1) (1) panel (d), two new symbol combinations are made from u1 and u2 , and sent downlink, which is to say that the symbols are multiplied by HT . This can be (1) seen in panel (e). Again, the received vectors are combined to get 2HT u1 and (1) 2HT u2 . A new Gram-Schmidt is carried out. The result is new estimates of (2) (2) the right side singular vectors, v1 and v2 . Clearly, these are closer to the (1) elementary axes (x,y-vectors) than v1 and v1 (2) were. Notice finally, that in (1) (1) (1) (1) practice, only one symbol vector can be sent at a time (v1 + v2 and v1 − v2 must be sent with a short delay). Provided that H is constant or varies little during this period, the derivations are valid. 2.5 Considerations and Details 2.5.1 Sorting columns prior to QR The BIMA algorithm is a (pure) Power method for the first columns of Û and V̂. However, since all transmitted vectors are unit vectors, starting from random, it might be a good idea to give "the best candidate a head start". When the data matrices (blocks) are received, and multiplied by the pseudo-inverse of the estimated symbols to become Xi Ĉix †, Y Ĉyi−1 †, the column vectors of these matrices would ideally be scaled singular vectors of the channel matrix: Xi Ĉiy † = H∗ Ûi Ciy Ĉiy † = SV Yi Ĉ(i−1) † x = HV̂ (i−1) Ĉ(i−1) Ĉ(i−1)† x x = SU (27) (28) under perfect conditions, Û = U, V̂i−1 = V, Ĉix = Cix , Ĉiy = Ciy . Under imperfect conditions, an error term could be added to each of the right hand sides, or the equalities replaced by approximations (≈). This serves as the basis of our method, together with the QR decomposition removing dependencies between the columns. When communication starts, there is no way of knowing which of the column vectors in the transmit matrix constituents Û(i−1) and V̂(i−1) correspond to a first singular vector. Thus, when performing the QR (i−1)† which have decomposition, it is best to let those columns of Xi Ĉi† y and Yi Ĉx the maximum norm be the first vector in the Gram-Schmidt process. These column vectors will be closest to the first singular vectors etc. 114 PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD (f) (a) (b) (1) (1) H (v1 − v2 ) v(2) 2 T (1) (1) H (u1 − u2 ) v(1) v(1)+v(1) 2 1 2 v(2) H (v(1) + v(1)) 1 1 v(1) 1 T 2 H u(1) 1 2 T H (u(1) + u(1)) 1 2 v(1) −v(1) 1 2 (e) (d) (c) u(1)+u(1) 1 2 T (1) (1) H (u1 − u2 ) 2 H v(1) ) 1 (1) (1) H (v1 − v2 ) u(1) 2 u(1) 1 u(1)−u(1) 1 H (v(1) + v(1)) 1 2 2 T H (u(1) + u(1) ) 1 2 Figure 5: A scheme consisting of combining vectors with fixed norm (a), vector transmission and observation on the opposite side (b), combination of vectors to form new scaled singular vector estimates (c), normalization (d), and retransmission, combination and normalization in the opposite direction, panels (e),(f) and (a) again 115 2 ALGORITHM 2.5.2 Data sub-blocks The singular vector estimates are obtained by combining information from a sub-block of data. The original BIMA algorithm implies the use of a block as whole (steps 2,5), but in practice, we do not use all the data in a block to estimate the singular vectors. Instead, we use a “moving sub-block” of the receive data vectors for this estimation. The reasons for this are: • By the end of the slot, the information contained in the first samples is out-dated, and does not necessarily contribute in a meaningful way. • By using a "moving sub-block" for the estimation, we can track the changing recipient vectors. The optimal vectors for decoding the channels will be contained in the "recently estimated" matrix V̂ (or Û) (steps 4,7). The "encoding vectors" in Û (or V̂) will diverge from being good estimates to worse ones, the further from the slot start one is. This is natural, since no feedback is provided during the sending period. None the less, it is important to track these vectors for symbol decision, even though they diverge from being singular vectors. 2.5.3 Color Coding In general, the singular values {σi }, estimated by both parties, can be used for ordering of the independent channels. However, in the presence of noise, it could happen that the ordering on one side (X) would be different from the other side (Y), particularly if two singular values are close in magnitude. This could lead to data being sent “to the wrong place”, e.g. that a bit-stream transmitted on the top singular vector is received at the second top corresponding singular vector. We therefore assume that color coding or some other matching technique is used, so that the different data streams can be recognized by the recipient. An alternative to color coding is to use a blind technique that can overcome the problem of erroneous ordering of singular values in the presence of noise. Such a method is described in the appendix. 2.6 Smoothing Both singular values and vectors estimates are subject to variations because of • the fading of H. • the noise in the data • the fact that from transmission to transmission, the "sub-block", or the data from which vectors/values are estimated, changes, removing one observation from the set and including a new one. 116 PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD The first we have to live with, the influence of the other two we reduce. Increasing the size of the sub-block will indirectly give less variation estimates of all vectors and values, simply because the estimates are certain averages deducted from a data set of bigger size. By using more data vectors, the effect of including/removing a vector will be decreased. According to standard theory, the influence of noise will be reduced. The drawback is that an increased sub-block size will lead to less correct estimates for time-varying H, since older data is used. To reduce the variance of the estimation process in BIMA, we focus on averaging singular vectors. Taking an average over the singular vector estimates for a certain time-span is the simplest way of reducing the influence of noise. Let Û(tk ) be the matrix of singular vector estimates at time (or tick) tk . Given a history span of Q ticks, the problem is to find an average of these matrices. However, this average must also be orthogonal, and it is not given that an average over a set of orthogonal matrices will be orthogonal itself. Thus, we seek an orthogonal average Ũ that solves the problem X Q−1 min Ũ ||Û(tk−l ) − Ũ||2 (29) l=0 with orthogonal, Ũ∗ Ũ = I. If M = PQ−1 the requirement that Ũ should be ∗ ∗ l=0 Û(tk−l ), with the SVD M = QPR , then Ũ = QR , solves the problem (see Appendix for a proof). Another way of reducing the effect of inclusion/removal of data produced by the moving sub-blocks, is to use weights. Rather than labeling data vectors as "old" or "new" (in or outside the sub-block), weights can be included in the averaging process. An intuitive choice for a weighting function is a sigmoid function, f (t − ti ) , (30) s(ti ) = 1 + f (t − ti ) Where f (t) is some strictly positive, strictly increasing function, say f (t) = e(t−k)/a . The weighting approach is not pursued here. 2.7 Complex differential coding In the simulation section, we will assume QPSK symbol coding. The decision rule is based on complex signs, a direct extension of the decision used in the BPSK case. We now allow our symbols to take on the values ±1 ± i. To compensate for the ambiguity (25), we use the following technique. Let {ck } denote a sequence of received symbols for any singular vector mode, and let {c0k } be the resultant symbol series output using our differential decoding technique. The following rules are used: 117 3 SIMULATIONS • If a symbol is the same as the previous one (ck = ck−1 ), then c0k = 1 + i is decoded. • If a symbol is the opposite as the previous one (ck = −ck−1 ), then c = −1 − i is decoded. • If a symbol is a positive 90 then c0k = −1 + i is decoded. circ rotation of the previous one (ck = ck−1eπ/2 ), • If a symbol is a negative 90 then c0k = 1 − i is decoded. circ rotation of the previous one (ck = ck−1 e−π/2) , It is easy to make algorithms based on the rotational properties of unit norm complex numbers, both for encoding and for decoding of these sequences. Note also that sequences coded this way are quite robust to sudden alterations in the rotational displacement. Only one symbol will be lost, as opposed to having a permanent symbol error. A major drawback with this technique, is that it has a steep loss function in the presence of noise. 3 Simulations This section demonstrates BIMA in a simulated TDD MIMO environment (Rayleigh-fading channel matrix H), using QPSK coding. Summing up on the previous sections, the following parameters must be set: M, N K n nÛ,V̂ Q Number of transmit/receive elements Number of Modes Slot Size (Number of bits in a UL/DL slot) Number of received vectors to use for estimation of Û, V̂ Number of vectors/bits for smoothing the singular vectors In addition, there are parameters to set when making a simulated fading H: D f The Doppler shift [Hz] Data rate [bits/second] 3.1 Convergence of Singular Vectors Figure 6 Shows the convergence of the estimated singular vectors in a constant MIMO channel. The initial singular vector estimates are random, and convergence towards the correct vectors is demonstrated. Two singular modes (K = 2) are estimated for a 3 by 3 MIMO Channel, with singular values 118 PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD (a) (b) 0 0 0.0 dB 2.5 dB 5.0 dB −5 0.0 dB 2.5 dB 5.0 dB −10 7.5 dB −10 7.5 dB 2 10 × log10 (ε ) −5 −15 −15 −20 10.0 dB −20 10.0 dB 12.5 dB 12.5 dB −25 −25 15.0 dB 15.0 dB 17.5 dB −30 17.5 dB −30 20.0 dB −35 0 10 20 30 40 20.0 dB −35 0 10 20 Figure 6: Convergence of the singular vectors. The errors 2 = 30 1 K 40 PK 2 i=1 ||ui − ûi || plotted as function of iteration step (slot index), at various SNR’s. Here K = 2 is the number of singular modes (channels) used in a 3TX × 3RX system. Panel (a) shows convergence of the left singular vectors, panel (b) convergence of the right singular vectors. σ1 = 4, σ2 = 2, σ1 = 1. The parameter nÛ,V̂ is set equal to 400. Various levels of channel noise are tested. The errors of both singular modes are averaged, and convergence for the left and right side vectors plotted. This demonstrates that the BIMA algorithm converges to the correct set of singular vectors in the presence of channel noise. 3.2 Simulation of Fading Channel In figure 7, we simulate a fast fading 4TX x 4RX channel (D = 50 Hz Doppler spread), with a transmission rate of f = 220 kBit/s per eigen-mode, using K = 2 independent channels, QPSK modulation, and a ping-pong time of 1 ms. Smoothing of the singular vectors was done by averaging over the Q = 4 last singular vectors. nÛ,û is as above equal to 400. 10 simulations were performed for each SNR scenario, and the mean BER over these simulations taken, see figure 7. The upper curve shows the BER obtained with our method after an initial aquisation period, the lower shows what could be achieved if the singular modes of the channel were perfectly known. The loss is in the range 2-5 dB. There seems to be a “flooring effect”, which is even more prominent for faster 119 4 CONCLUSIONS AND DISCUSSION 0 10 −1 BER 10 −2 10 −3 10 −4 10 0 2 4 6 8 10 SNR (dB) 12 14 16 18 20 Figure 7: The SNR/BER for a 4TX x 4RX channel fading at 50Hz Doppler, 220 kBits/s. Mean BER value for the various SNRs. fading channels. This effect is not appearing in channels varying more slowly (e.g. 10Hz or 20Hz Doppler spread). 4 Conclusions and Discussion We have shown that top singular modes can be estimated and tracked without training data, without need for a statistical estimate of H, and without performing an actual SVD. These results are based upon the assumption that the channel we consider exhibits reciprocity. In general, one can say that transmission and retransmission of vectors in a MIMO system leads to convergence of the involved vectors towards the top singular mode. By combining properties of the modulation alphabet with this transmission/retransmission idea, multiple singular vector pairs can be extracted as part of normal operation. 4.1 Improving & Generalizing BIMA In another paper, [3], we develop a method for blindly estimating singular modes for a FDD (Frequency Division Duplex) channel. In this case, the uplink channel 120 PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD matrix is generally not the transpose of the downlink channel, and so the BIMA iterations and the intrinsic Power method can not be used. More advanced optimization techniques are needed to blindly estimate these modes, which now are doubled in number (one set for the downlink channel, another for the uplink). It can still be done, by combining a local linear optimizer with an approach based on principal component analysis (PCA) for “smoothing the interaction” between the up- and down-link singular modes. As a spin-off, a better way of estimating singular vectors emerges, that can be used to improve the BIMA Algorithm. A brief description follows: A key observations in section 2, is that all (normalized) vectors sent through the channel H (or HT ) are received as points on an ellipsoid with center in the origin. The principal axes of this ellipsoid are identical to one set of singular vectors. With only a few sample points, it is possible to estimate such an ellipsoid, and extract the singular vectors (figure 8). As a consequence, it is not necessary to use to QR/Power-method iterations to find the singular modes, it can be done in one iteration step only (one time-slot). This improves the performance of the algorithm, both in aquisation-mode and in tracking-mode. The price to pay for this, is some more processing complexity. The ellipsoid-fitting involves a normminimization, equivalent to finding one (bottom) singular vector of a matrix, as well as the eigen-decomposition of a matrix A representing the quadratic form of the ellipsoid. It is not necessary that both parties (base station and subscriber) do this processing. Finding the top singular modes at the base is sufficient to improve the singular mode tracking. 5 Appendix A: Blind ordering of singular vectors In previous sections, we assumed that the various independent channels were color coded, so that a stream decoded with a singular vector v was correctly associated with its source (encoding vector) u. Here it is shown how channel mapping can be done completely blindly, at the cost of some performance. If the singular values {σi } are sufficiently different from one another, they provide a good label for their corresponding singular vector. Each party estimates singular values {σi } and vectors {ui } (party Y) or {vi } (party X). Let {σ̂i }X denote the singular values estimated by the party X, and {σ̂i }Y denote the singular values estimated by party Y. These are smoothed (as time-series), using a moving average, to reduce the influence of channel noise. The natural link between the set {vi } (or {ui } ) and the singular values {σ̂i }X or {σ̂i }Y , leads to a natural ordering: The singular vector corresponding to the largest singular value is termed v1 , the one corresponding to the second largest is termed v2 , and 121 5 APPENDIX (a) (b) y 1 y 3 y2 λ u 2 2 λ1 u1 Figure 8: Singular vector estimation. The left panel (a) shows a few observations on an ellipsoid, the panel (b) shows the extracted singular vectors u1 , u2 , multiplied by their “lengths” λ1 , λ2 (equivalent with the singular values up to a scalar of). These vectors/values correspond to the half-axes of the ellipse 122 PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD so on, as in normal SVD. If a vector vj is associated with a specific singular σˆX j value in {σ̂i }X and a vector uj is associated with the most similar singular value σˆY j in {σ̂i }Y (σˆY j ≈ σˆX j ) then (uj , vj ) constitute a natural singular pair: Encoding symbol data with vj will lead to those symbols being decoded and recaptured with uj , and vice versa. The problems occur when two singular values become close in magnitude. Due to the TDD, the singular values are not estimated simultaneously by both parties. While one party transmits in on time-slot, the other receives and estimates singular values. Then the scene changes. It might happen, that in a period of time (one or more time-slots) where two singular values are close, that the ordering/correspondence between singular vectors and values may be different for X and Y. This will correct itself when the spacing between the {σi } increases again. Prior to that, one will often experience "data being sent to the wrong place". In such situations, it is better to postpone the sorting altogether, and stick with the vectors in the order they were before the confusion started. Observe that our algorithm is perfectly capable of tracking the singular vectors, even if the corresponding singular value estimates change ordering. The following scheme will take care of the sorting in a way that avoids this potential mix-up: One party X (the "leader") is always in responsible for re-ordering his singular vectors when he thinks it is safe. This will immediately be observed as a "swap" in the ordering singular values by the party Y (the "follower"). If this "swap" is sufficiently clear, and not to be confused with the normal (and noisy) variation of the channel, Y can change the order of his singular vectors accordingly. The tricky task here is of course the "sufficiently clear" part. Consider figure 9: The top left panel (a) shows two singular values changing place when no "re- ordering" at the shift point is imposed by party X (assume momentarily that X has knowledge of the singular values at all times). The panel (b) shows how this might be observed by the party (Y, "follower"). The blank slots are the transmission periods for Y, for which no singular values are estimated. Assuming little or no noise, the singular values observed are quite close to the true ones (dotted part of the lines). Consider now the panel (c) where "ordering" is introduced X. Panel (d) shows how this might be observed by Y. Next, consider panel (e), where the singular values do not actually cross, but move close before they part again. Panel (f) shows how this is observed by Y. Now, it is easily seen that it is practically impossible for Y to distinguish between the two situations (d) and (f). Y can simply not tell whether the two singular values have changed places, if they have been reordered by X, or if they did not cross at all. Yet, this knowledge is crucial to get the bit streams distributed correctly. Figure 10 illustrates how the problem is solved: The top panel (a) shows the full series of singular values (observed by the "all-seeing" X), and a re-ordering at a (“safe”) point in time after the crossing. The lower panel (b) shows how this is perceived by the party Y. In this case, it is clear that the observed shift can not 123 5 APPENDIX be due to the ambiguity "re-ordering by X/swap-or-close-in", and so it is safe for Y to change its singular vectors. Finally, it is of course not necessary for X to be "all-seeing", it suffices to observe the singular values on a grid similar to that of Y. The main idea is to postpone the decision of swapping singular vectors until the confusion around the ordering is over. Summing up, we devise the following method for blind ordering of the singular channels: • Party X: If (a) a crossing in singular values is observed, and (b) the singular values have moved to a safe level (say 1 ) apart, swap the order of the singular vector in accordance with the new {σ̂i }X order. • Party Y: If (a) it is observed that the relation between {σ̂i }Y and {ui } has been out of order, (b) their current difference in magnitude is greater than some 2 , and (c) they suddenly swap to the correct order, then change the ordering of the singular vectors (according to the "correct order", defined by {σ̂i }Y ). Clearly, we must assume 1 > 2 , e.g when party X chooses to perform the swap, he must be certain that a swap will follow by party Y. 1 , 2 could PThe P 2 be defined 2 as a certain percentage of the overall "energies", σ and σi,Y . In the i i,X presence of much noise, there is a chance that this technique will fail to produce corresponding swaps, reducing performance of the system. B: Smoothing of singular vectors The problem min {ŨkŨ∗ Ũ=I} is solved by letting M = Ũ = QR∗ . PQ−1 l=0 Q−1 X ||Û(tk−l ) − Ũ||2 l=0 Û(tk−l ), with the SVD M = QPR∗ , and setting Proof: Let X Q−1 f (Ũ) = ||Û(tk−l ) − Ũ||2 = l=0 X (31) Q−1 ||Û(tk−l )||2 − 2 tr Û(tk−l )∗ Ũ + ||Ũ||2 l=0 124 PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD (a) (b) (c) (d) (e) (f) Figure 9: Blind ordering of singular values. The heavy lines denotes one time-series of a singular value, the lighter lines another time-series. 125 5 APPENDIX (a) (b) Figure 10: Solution of blind ordering problem: Sorting must only be imposed by X when the singular values are sufficiently far apart. 126 PAPER IV: BLIND MIMO ESTIMATION BASED ON THE POWER METHOD Now, Ũ and Û(tk ), k = 1, 2, ... are are unitary matrices, thus ||Ũ||2 = ||Û(tk )||2 = K, the number of channels in use. Consequently, it is sufficient to maximize X Q−1 g(Ũ) = 2 tr Û(tk−l )∗ Ũ = l=0 X Q−1 2 tr [ Û(tk−l )]∗ Ũ = 2 tr M∗ Ũ l=0 Now finding the matrix Ũ that maximizes g is the same as finding the orthogonal matrix closest to Ũ, because ||M− Ũ||2 = ||M||2 −2 tr M∗ Ũ−||Ũ||2 = C −2 tr M∗ Ũ = C −g(Ũ). Standard theory (e.g. the polar decomposition [9]) then gives Ũ = QR∗ . References [1] J. Bach Andersen, “Array gain and capacity for known random channels with multiple element arrays at both ends” IEEE Journal on Selected areas in Communication, Vol. 18, No 11, 2000, pp. 2172–2178. [2] T. Dahl, N. Christophersen, D. Gesbert, ‘BIMA - Blind Iterative MIMO Algorithm”, accepted for ICASSP-2002. [3] “The Game of Blind MIMO Channel Estimation”, in preparation. [4] M. Fink, “Time-reversed acoustics”, Phys. Today, Vol. 20, 1997, pp.34 - 40. [5] G.J. Foschini, “Layered Space-time architecture for wireless communications in a fading environment”, Bell Labs Technical Journal, Vol. 1, No 2, 1996, pp. 41-59. [6] G.J. Foschini, M.J. Gans, “On the limit of wireless communications in a fading environment when using multiple antennas”, Wireless Personal Communications, Vol.6, No3, 1998, pp. 311-335. [7] D. Gesbert, H. Bolcskei, D. Gore, A. Paulraj, “MIMO channels: Capacity and performance prediction”, submitted to IEEE Trans. on Communications, July 2000. Shorter version in Proceedings of IEEE Globecom Conference, Nov. 2000. [8] G.D. Golden, G.J. Foschini, R.A. Valenzuela, and P.W. Wolniasky, “Detection algorithm and initial laboratory results using the V-BLAST space-time communication architecture”, Electronics Letters, Vol. 35, No. 1, pp. 14-15, 1999 [9] G. Golub and C.F. Van Loan, Matrix Computations, Johs Hopkins, 3 edition, 1996. 127 REFERENCES [10] B. Hassibi “An efficient square-root algorithm for BLAST”, ICASSP 2000 [11] D.B. Kilfoyle, “Spatial Modulation in the Underwater Acoustic Communication Channel”, PhD Thesis, MIT, June 2000. [12] A. Paulraj, C. Papadias “Space-time Processing for Wireless Communications”, IEEE Signal Processing Magazine, Nov. 1997. [13] E. Telatar, “Capacity of multi-antenna Gaussian channels,” Trans. on Telecom, Vol. 10, No. 6, 1999, pp. 585–595 European [14] A. Touzni, I. Fijalkow, M.G. Larimore, J.R. Treichler, “A globally convergent approach for blind MIMO adaptive deconvolution” IEEE Transactions on Signal Processing, Vol. 49, No. 6, 2001, pp. 1166-1178. [15] H. Wold, “Soft modelling by latent variables: The non-linear iterative partial least squares (NIPALS) approach” Perspect. Probab. Stat., Pap. Honor M. S. Bartlett Occas. 65th Birthday, 1975, pp. 117-142. 128 Paper V: The Game of Blind MIMO Channel Estimation 129 CHAPTER 3. PAPERS 130 The Game of Blind MIMO Channel Estimation Tobias Dahl, Nils Christophersen, Ole-Christian Lingjærde, Nils Lid Hjort∗ Abstract Some optimization problems involving multiple entities (individuals, “agents”) can only be solved if they work together. In Artificial Intelligence (AI) there is currently a great interest in collaborating multi-agent systems. The agents work in a decentralized, automated fashion to solve an overall problem. Each agent has a reward-function, expressing degree of success in completing a subtask. Stratergies have to be chosen carefully, so that the agents do not “work at cross-purposes”. In this paper, we discuss a two-agent problem for finding the optimal transmission/receiving parameters for a multi-antenna wireless communication system. Techniques from multivariate analysis and geometrical modelling are used to build a framework for solving the problem. key words: Game Theory, Multi-Agent Systems, Ellipsoid fitting, Orthogonal Transformation, Principal Components. 1 Introduction 1.1 Background Game theory usually deals with situations where two or more competitors try to maximize their individual outcome in a battle over a limited resource. Another aspect of game theory, not as much discussed in the literature, is that of team-work. Multiple-Agent systems (MAS, [21],[22], [27]), a popular topic in Distributed Artificial Intelligence, provides a framework for such problems. Consider a situation where two agents (for example persons, companies or electronic devices) have to co-operate in order to maximize some function ∗ Department of Statistics, University of Oslo. 131 1 INTRODUCTION reflecting the joint outcome or reward. Each agent has to contribute by selecting a set of parameters (α or β). The problem can be expressed as max f (α, β) α,β (1) For the problem to be relevant, we must assume that α and β are chosen independently. One agent (X) chooses α and the other agent (Y) chooses β. If both α and β could be studied and compared at some time, the two-player game (1) could be reformulated to become a one-player game. In many practical problems, there must be some communication between X and Y for the situation to be more than a mere “make-another-guess”-game. We assume therefore that X has indirect knowledge of β and vice versa. • The X-agent observes gX (α, β) • The Y-agent observes hY (β, α) The mappings gX and hY could be termed “communication functions”. Note that the information one agent has about the other agent’s parameter is nested with his own parameters, and is as such not explicit knowledge. Agents X and Y will often work in a synchronized fashion. Suppose αi is the i0 th guess for the X-agent, and βi is the i0 th guess for the Y-agent. We then assume that X observes hY (βi−1 , αi−1 ) prior to guessing αi and that Y observes gX (αi−1 , βi−1 ) prior to guessing βi . Various properties of the mappings gX and hY can apply in different practical problems. Assume gX : Rn → Rp hY : Rm → Rq (2) (3) with p ≤ n and q ≤ m. If gX and hY are linear functions, then the “degrees of compression” (n − p), (m − q) puts constraints on how much information one agent can have about the other. Properties such as linearity and/or continuity of the mappings can make a problem easier to solve. Extra considerations must sometimes be taken into account. In some application it might be necessary to have a minimum number of guesses, e.g. we want the sequences of guesses {αi } and {βi } to be short. 1 This would be the case if each selection of α or β is the result of an experiment, e.g. a well dug, or an expensive test carried out. Another restriction could be monotonic or almost monotonic convergence towards an equilibrium, leaving out stratergies based on many and/or varied guesses on α and β. Both agents contribute to the solution of the overall problem, indirectly, by maximizing their own reward functions. The two-agent problem (1) can thus be replaced by two subproblems max fY (α, β) (4) max fX (α, β) (5) α,β α,β 1 meaning that the number of wrong guesses in the sequences must be low. 132 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION which upon solution coincide with solution of the original problem. In some situations, the agents will not known that their sub-tasks are part of an overall optimization problem. However, if they do, one agent can choose to help the other’s maximize his reward2 , if he believes that this serves his own long-term purposes. Designing suitable reward is sometimes difficult, and is a central part of Collaborating Agents problems [22]. The functions gX (α, β) and hY (β, α) are generally not known in advance. The mappings, or at least certain aspects of them, could possibly be learned as part of the optimization. This is what is referred to as “modelling of the other agent’s behavior” [18]. Each agent can also have prior knowledge of the optimization strategy that the other agent is going to use [2]. Any of the functions f, fY , fX , gX , hY could be observed with noise, making the problem stochastic rather than deterministic. We discuss the possibility of independently selecting sequences of guesses {αi } and {βi } converging to a pair (α, β) maximizing f (α, β) in a specific twoagent problem. 1.2 Problem formulation We present the joint optimization problem. In section 2.1 we argue that it describes an important problem in wireless telecommunications. For two matrices H ∈ RN,P G ∈ RP,N , we want to solve max f (x, w) = E{||Hx + nx ||2 } + E{||Gw + ny ||2 } x,w (6) under the constraints ||x|| = ||w|| = 1. Here, ||.|| denotes the L2 -norm. The vectors nx ∈ RN and ny ∈ RP are noise vectors, assumed to be white and Gaussian. Agent X has to select the vector x and agent Y has to select the vector w. The agents also observe each others behavior, • Y observes hY (x) = Hx + nx • X observes gX (w) = Gw + ny We propose a leader-follower strategy: Agent Y returns the vector he receives, y = Hx + nx after pre-multiplying it with a matrix R and normalizing, giving w = (RHx + nx )/||(RHx + nx )||. Agent X selects vectors x based on some optimization algorithm and the feedback he gets from Y. In the understanding that the data x from agent X affects the data w returned from Y and vice versa, we write gX (w, x) = Gw(R,x) + nx = gX (R, x) hY (x, w) = Hx(x,R) + ny = hY (x, R) 2 even prior to maximizing his own reward. 133 (7) (8) 1 INTRODUCTION Now, X picks x and Y picks R. Each agent can also have “memory” of previous data vectors and use it for decision making. One could replace the lower parenthesis (R, x) and (x, R) with ({xi }, {Ri}) and ({xi }, {Ri}) respectively, for i = 1, . . . , n − 1, where n is the “present” iteration. For readability, we keep the notation in (7), (8). The overall goal is reached if the two agents Y and X each succeed in maximizing their own reward functions, fY (x, R) = E{||Hx(x,R) + nx ||2 } fX (R, x) = E{||Gw(R,x) + ny ||2 } (9) (10) Note that fY (x, R) = E{||hY (x, R)||2} and fX (R, x) = E{||gX (R, x)||2} respectively, which is a very close connection between the reward and the communication functions, and that f = fX + fY . There are two extra considerations for this problem: • The sequences of guesses {xi } and {wi } should converge as quickly as possible to their optimum. • Each element in {xi } and {wi } should be similar to the previous element, e.g. ||xi−1 − xi || ≤ and ||wi−1 − wi || for some small positive , and for all i. We will give some examples of problems for multi-agent systems, then show that the problem we have stated has an important application in wireless telecommunications. 1.3 Examples Problems that can be cast into a MAS setting arise in everyday life as well as in science. Examples are the following: • Two people living together will have to contribute with their share of opinions and actions, trying to maximize the joint happiness, without always knowing the precise contributions of their partner. The “intended words” α and β might not be directly understood, only perceived through “mappings” gX (α, β) and hY (β, α). • Closing a contract in a business situation, one might not know the precise investments and agreements the other agent has made, but each agent is still interested in maximizing the outcome of a joint contract. • In design of control systems for routing over a communication network, multiple units (servers) have to collaborate to direct the traffic and reduce overhead. • Multi-Agent Control Systems are used for directing constellations of communication satellites and planetary exploration vehicles [27]. 134 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION • In wireless communications, a base station (Agent Y) and a mobile subscriber unit (agent X) have to tune their antenna parameters independently (but in accordance with one another) in order to maximize the capacity/reliability of the channel. In this paper, we focus on the last example. The problem of interest is to find the optimal transmission/receive parameters for communication through a wireless multi-antenna channel. We present a framework for solving this difficult, nested non-linear optimization problem. 1.4 Wireless mobile communication It has been known for a long time that the use of multiple antenna elements at the transmitter improves channel capacity (“smart antennas”). By carefully spreading the transmit power over the antenna elements, one can focus energy into desired directions, increasing the average signal-to-noise ratio (SNR). Appearing first in a series of articles on information theory, published by members of the Bell Laboratories (E.Telatar [24], J. Foschini [9]), it was demonstrated that the use of multiple antenna elements both at the Base Station (BTS) and at the mobile Subscriber Unit (SU: cell phone, PDA or laptop), results in a large increase of the channel capacity. In MIMO systems (“Multiple Input, Multiple Output”), the channel is a matrix. Let y be the result of transmitting the data vector x across the channel H. This is expressed as y = Hx + nx (11) where the noise term nx is assumed to be white and Gaussian, nx ∼ N(0, σ 2 , I). The components of the vectors x and y are the transmitted and received signals at the respective antenna elements. The vector x is constituted from symbols in independent data-streams (modulation), each of which the receiver may try to recapture using the inverse H−1 followed by demodulation. This diversity gain is one of the reasons that MIMO outperforms conventional systems. Usually, H is estimated by transmission of a training sequence of known symbols. There are two drawbacks with this approach. First, there is overhead. The channel characteristics H vary with time and must therefore be re-estimated at regular intervals. In 3G systems3 , up to 20% of all the data sent is for training. The second drawback with most training-based systems, is that they take no advantage of the singular modes of H. The received energy will be at the maximum if x is parallel with the first singular vector v1H , since v1H = max arg {x|||x||=1} E{||Hx + nx ||2 } (12) P H H G HT where H = ri=1 σi ui vi is a singular value decomposition (SVD) of H. The singular vector v1H of H maximizes the expected signal-to-noise ratio (SNR) 3 Third generation systems, as opposed to e.g. GSM, which is a second generation system. 135 2 METHODOLOGY when used for transmission. The corresponding singular vector uH 1 is used for optimal weighting of the receive antenna elements. To employ the singular modes the channel matrix 4 must be known by the transmitter as well as the receiver [6]. Most systems based on training data assume channel knowledge on the receiver only (e.g. V-BLAST 5 , [9], [12]), and transmission with maximum average SNR is therefore impossible. In this paper, we try to find the top singular vectors blindly, without the need for a statistical estimate of H. To complicate things further, in a FDD system (Frequency Division Duplex), the channel matrix H only characterizes transmission in one direction (from BTS to SU). Communication in the opposite direction (from SU to BTS) is generally characterized by a different channel matrix G. Transmission then amounts to PrG w = Gz + ny (13) T G Let G = i=1 σiG uG be a SVD of G. Now, if w and x are parallel with leading i vi singular vector w = cv1G and x = dv1H , the agents can transmit the symbols c and d using the top singular vectors. The symbols usually comes from some final modulation alphabet. In this paper, we will focus on the transmit vectors only, and symbol modulation is not discussed. Also, data transmission is usually not considered as being sent in single vectors, but rather in blocks, containing hundreds or thousands of symbols. Note that in the special case of reciprocity (H = GT ) the problem of blind estimation of the singular modes has been solved [6]. 1.5 Layout of the Paper The paper is laid out as follows. Section 2 is a methodology section, It explains the usefulness of solving the problem (6) in wireless mobile communication (2.1), and describes the necessary building blocks for a solution (2.2, 2.3). Section 3 is about the algorithms, explaining in more detail what the agents X and Y have to do to jointly solve the overall optimization problem. It also has a subsection on sensitive and noise (3.4), pointing out situations that can difficult to handle without special precautions. Section 4 is a simulation section, assessing the performance of some of the sub-tasks, as well as simulating the twoagent system in operation. Section 5 discusses the findings, and suggests improvements. 2 Methodology 2.1 Blind MIMO Channel Estimation We interpret the equations (6), (7),(8),(9), (10) in terms of blind FDD channel estimation. 4 5 or at least the leading singular vectors Vertical Bell Labs Layered Space-Time 136 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION Two agents X and Y want to transmit their data on top singular modes of a MIMO channel in a FDD system. The equation (6) states that if optimal vectors x and w are selected, then the received power at both stations will be at the maximum. The constraints ||x|| = ||w|| = 1 set a limit on the power output from the antenna arrays, which is practical from a cost perspective. It is clear that x = v1H and w = v1G at the maximum of f (x, w). The equations (7), (8) simply state that X and Y ’see’ the other agent’s sending vector (w or x) through a channel matrix. Equations (9) and (10) state that X and Y each try to get the maximum received power. It is clear that if the reward functions fX and fY are maximized, the overall problem is solved. At this optimum point, H H gX (x, R) = σ1G uG 1 and hY (R, x) = σ1 u1 , so that all four singular vectors are G H known. It is sufficient that X knows v1H and uG 1 and Y knows v1 and u1 to transmit/receive on the top singular modes. Consider the extra requirements (convergence speed and continuity). They can be interpreted as follows: The initial guesses x0 and w0 are both random vectors. Each vector xi and wi will carry a symbol, and we want to maximize the chance that this symbol is correctly recaptured. Thus, we want {xi } and {wi } converge to their optima as quickly as possible. The reason that continuity is desirable, is the fact that previous (earlier) estimates are used for symbol decision, and too much variation between successive elements in the sequences {xi } and {wi } makes the task difficult. Throughout the paper, we assume that all matrices and vectors are real-valued, as this simplifies visualization and the discussions considerably. The channel matrices will be complex in practical problems, but the methodology we present can be extended to work in the complex case. We also assume G and H, both in RP,P , to be square and invertible, simplifying both analysis and performance assessment. A final working system must of course handle non-symmetric and degenerate/singular cases. We now present the building blocks necessary for our framework. The two agents X and Y needs different tools to complete their sub-tasks. X need a numerical optimization method (Local CG, section 2.2) in order to find an optimum x. Y on the other hand, needs to get as much information about H as possible, in order to make a “good correction matrix” R. The ability to estimate one set of singular vectors and the singular values of H is crucial, and ellipsoid fitting (2.3) is the central tool. 2.2 Local CG Algorithm Consider the problem of maximizing f (x) = ||f(x)||2 137 (14) 2 METHODOLOGY subject to ||x|| = 1. Using the Lagrange multiplier method, we seek a stationary point (x, α) of f ∗ (x) = ||f(x)||2 − α(xT x − 1). (15) The stationarity condition on α implies that xT x = 1. Consider the problem of determining x. Resorting to the same strategy as used in the Newton–Raphson method, let x0 denote the current estimate for x, let u0 denote the gradient vector, and W the Hessian matrix of f (x), both evaluated at x0 . We then find the next estimate for x by equating to zero the first order Taylor expansion of Eq. (15), (16) u0 + W(x − x0 ) − αx = 0 that is, Wx = αx + Wx0 − u0 (17) However, since the function f is only implicitly given in our problem, neither u0 nor W are known, and we have to resort to approximations. Suppose f (x) is well approximated locally by a linear map, i.e. f (x) ≈ Bx for x in a neighborhood of x0 . Then, u0 and W are well approximated by BT Bx0 and BT B, respectively. Then Eq. (17) simplifies to BT Bx = αx (18) which need to be solved for α and x under the constraint xT x = 1. A solution that also solves the Taylor approximation to the optimization problem in Eq. (15) is given by x = v1 , where v1 is the principal eigenvector of BT B or, equivalently, the right singular vector of B corresponding to the largest singular value. This leads to the following iterative method for solving the original nonlinear optimization problem in Eq. (15): (19) xi+1 := xi + λ(vˆ1 − xi ) In order to determine the matrix B = Bi to be used in the ith iteration of Eq. 19, let d ≥ 0 be a given integer and define Xi = [xi−d , . . . , xi ] Zi = [f(xi−d ), . . . , xi ] Bi = Zi Xi † (20) (21) (22) Here, “†” denotes the Moore-Penrose pseudo-inverse. The parameter λ controls the step size. 2.3 Ellipsoid fitting When vectors of constant norm ( ||x|| = 1) are multiplied by a channel matrix H, the resultant vectors y = Hx + nx will approximately lie on an ellipsoid with the center in the origin. If this ellipsoid can be determined, one set of singular vectors and the singular values of H can be estimated, without knowledge of the other set. Ellipsoid fitting is in this sense a “partial SVD”. Ellipsoid fitting 138 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION to data in 2 or 3-space is not a novel idea (Turner et al. 1999, Fitzgibbon et al. 1996). Applications are plentiful in computer vision and planetary motion problems. Fitting of data to an ellipsoid in higher dimension is, to our best knowledge, an undiscussed problem. An ellipsoid in RP , with center in the origin, can be described as T y Ay = P X aij yi yj = c (23) i=1,j=1 where y ∈ RP , the matrix A = {aij } ∈ RP,P is symmetric, and c ∈ R is a constant. This is a general equation for a quadratic form (an ellipsoid, a hyperboloid or a paraboloid), so we must assume that A is positive semi-definite if c is positive, and negative semi-definite if c is negative. Given a set of observations on the ellipsoid {y1 , y2 , . . . }, we can compute A: The quadratic form can be replaced by a linear form, F (a, yq ) = aT yq = 0 (24) where yq is a vector containing the cross-product terms, yq = (y12, y22 , . . . , y1y2 , y1 y3 , . . . , 1)T (25) and the vector a contains the corresponding elements in the matrix A and the constant c, a = (a11 , a22 , 2 · a12 , 2 · a13 , . . . , −c)T (26) Note that the symmetry of A implies the use of half the set of products in a and yq , e.g. finding both a12 and a21 is unnecessary. By reinserting the elements of a into the matrix A, and adjusting the off-diagonal by a factor of 12 , the problem is solved. Of course, one observation yq is not enough, so we consider Yq = [yq,i−d . . . , yq,i ], where i is the iteration6 index, and d is some positive integer. Inspired by [8], we try to solve min ||aT Yq ||2 {a|||a||=1} (27) The restriction ||a|| = 1 comes from the that there is an extra Pr observation H degree of freedom in (23). But if Yq = i=1 σi ui vi is a SVD of the data block (assumed to have full column rank r), then the trailing singular vector a = ur will be the solution to this problem. Assume that A is pre-multiplied with (1/c), A := (1/c)A, so that the quadratic form is yT Ay = 1 (28) Let A = QΛQT be the spectral decomposition of A, with Q the matrix of orthogonal column eigenvectors {qi } and Λ = diag(λ1 , λ2 , . . . ), is the matrix of 6 Remark: If ellipsoid that is to be estimated changes, old samples will be outdated after a while. This will be the case in our application 139 3 ALGORITHM increasingly ordered eigenvalues, λ1 ≤ λ2 ≤ λp . We show how to use ellipsoid fitting for estimating a set of singular vector/values. Under perfect conditions (no noise) y = Hx (29) for some transmit vector x with unit norm, ||x|| = 1. Then x = H−1 y, and yT H−T H−1 y = yT (HHT )−1 y = 1 (30) Comparing with (27), observe that A corresponds to (HHT )−1 , if y is obtained from (29). The left singular vectors of H are the eigenvectors of HHT . These are the same as the eigenvectors of (HHT )−1 , except that the ordering with respect to the eigenvalues is reversed. In the presence of noise, A is now an estimate of (HHT )−1 , and the eigenvectors of A are estimates of the singular vectors of H. The singular value estimates σ̂iH are obtained from the eigenvalues in Λ as √ = Λ (31) Ŝ−1 H √ where · denotes the square root in the elements of a matrix. Similarly, we estimate the singular vectors by ÛH = Q. If some of the eigenvalues are negative, (29) is not defined, as will be discussed later. Unless otherwise is explicitly stated, we will use the terms “eigenvectors”/”eigenvalues” to refer to the eigenvectors/values of the matrix A, and “singular vectors”/”singular values” to refer to the singular vectors/values of H. 3 Algorithm We now describe the individual tasks for X and Y in the two-agent framework. A particular feature of our proposal, is the fact that one of the agents (Y) does not try to maximize his reward from the very beginning. Instead, he starts by using information in the receive vectors {yi } to help X maximize his reward. Only when X has a good chance of converging to his optimum reward, Y tries to maximize his own. 3.1 Agent X: Optimizing x The job of the “leader” X is straightforward. Starting from a random x0 , he continuously tries to approximate the mapping gX (R, x) as a linear mapping (he assumes R to be constant), and seeks the optimum of fX (R, x) = ||gX (R, x)||2 using the local CG algorithm. The only thing he needs to be aware of, is that the problem can change abruptly if Y changes R. However, if he is “lucky”, Y will pick an R not only to help himself, but also to make the problem that X has as simple as possible. 140 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION 3.2 Agent Y: Making the problem linear Consider first the ideal situation of no channel noise. When a sequence of vectors {x} is sent from X to Y, it is received as vectors {yi } = {Hxi}. Assume now that agent Y has successfully estimated the left singular vectors and the singular values, using ellipsoid fitting. The estimates are held in the matrices T is the SVD of H. Now let Y choose ÛH = UH , ŜH = SH . Here, H = ÛH ŜH V̂H T T −1 T R0 = ÛH Ŝ−1 H ÛH = UH SH UH (32) and consider what happens when a received vector y = Hx is pre-multiplied with R0 : T T T (33) R0 Hx = UH S−1 H UH UH SH VH x = UH VH x T The composition of R0 and H is now an orthogonal matrix UH VH . It can be shown [13] that T UH VH = min arg{Q|QT Q=I} ||H − Q||2F (34) so the composition R0 H effectively replaces the matrix H by its closest orthogonal matrix. Here, || · ||F denotes the Frobenious norm. Adding the normalization step has no effect: For x with ||x|| = 1, T UH VH x R0 Hx T = = UH VH x T ||R0 Hx|| ||UH VH x|| (35) T since ||UH VH x|| = ||x|| = 1. The vector received by the agent X is now T x gX (R0 , x) = GUH VH (36) which is a linear mapping of x. Consequently, the local linear CG algorithm T x = v1G will be the top will find its optimum value. The vector w = UH VH right singular vector of G. Under noise-free conditions, the top singular vector pair of G can be found by X without concern of “sparse optima” coming from the non-linear normalization. This trick is nothing but a principal component analysis (PCA), decorrelating the received vectors. Figure 1 illustrates the effect of using the matrix R0 on the received vectors {yi }. In the presence of noise, the estimates become less accurate. This can sometimes have severe consequences on R0 , and on its usefulness as a pre-processor/linearizer, at least when applied to some receive vectors y. This is discussed in 3.4. 3.3 Agent Y: Adjusting the optimum When R0 has been introduced, and the local CG algorithm has converged, the vectors {y = Hx} received by agent Y will also converge in the mean sense. We approximate d−1 X yi−k (37) yOpt = 1/d k=0 141 3 ALGORITHM (a) (b) (c) (d) (e) 5 5 4 4 3 3 2 2 1 1 0 0 2 4 0 6 0 2 4 6 Figure 1: The figure illustrates the effect of Y using the matrix R0 to help X maximize his reward in a 2 × 2 channel matrix scenario. In panel (a) are the vectors {xi } sent through the channel H. When received by Y, they are normalized and resent through G (R0 = I initially). The resulting points are found in panel (b). Note the sparse density of points on the extrema of the ellipsoid coming from the non-linearity introduced by the normalization. The panel (d) shows the situation from an angular point of view. Along the horizontal axis is the angle θ ∈ [0, 2π], where θ is taken from x(θ) = [cos θ, sin θ], and the radii y = r(θ) = ||x(θ)|| is plotted along the vertical axes. Clearly, it must be difficult for an optimization algorithm based on assumptions of continuity to pick an angle θ that maximizes ||x(θ)||. Panel (c) shows the effect of pre-multiplying by R0 before re-sending through G, and in (e), the improved situation is shown in a angle/radii plot. 142 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION for some positive integer d, a parameter to the algorithm. The vector xOpt is defined in the same fashion for X. If Y is to have maximum reward, the convergence point yOpt should ideally be parallel with the leading left singular vector of H, or yOpt = σ1H uH (38) 1 This can accomplished by introducing a rotation R1 “manipulating” X to send vectors received as σ1H uH 1 . Assume that the mapping R1 has the property R1 : uH 1 → yOpt /||yOpt || (39) One can construct such an R1 which is a rotation in the subspace spanned by the two vectors yOpt and uH 1 , leaving all vectors perpendicular to that space as they are. For calculation of such rotation matrices, see [13]. Consider what happens if the vector v1H is transmitted through H and the result Hv1H is pre-multiplied with R1 R0 . T R1 R0 Hv1H = R1 UH VH v1 = R1 σ1H uH 1 = cyOpt (40) where c is an irrelevant constant that will disappear in the normalization. We define the mapping (41) R := R1 R0 The composition of the two mappings is the “correction matrix” for Y. The fact that R1 is a rotation makes the composition R1 R0 H a linear mapping, even after the normalization step. The Local CG algorithm will still find the optimum singular vector for G, while the introduction of the rotation R1 ’suggests’ that the singular vectors of H are found simultaneously. In subsequent sections, the words “orthogonal transformation” and “rotation” will be used interchangeably, although strictly speaking, a “rotation” does not involve the possibility of reflection, as does “orthogonal transformation”. 3.4 Sensitivity Analysis In the simulation section, we show that the two-agent leader-follower strategy suggested leads to convergence of the sequence {x} and {w} towards the top right singular vectors of H and G in the deterministic case, σ 2 = 0. However, if noise of variance σ 2 > 0 is added, the solution is sometimes unstable. In this section, we examine the possible causes of this instability, and discuss ways of making the estimation more robust. 3.4.1 Sensitivity of Ellipsoid fitting Ellipsoid fitting is a central part of Y’s task. The “partial SVD”, giving the singular vector estimates ÛH and ŜH are used in the initial R0 matrix. The ellipsoid fitting is sensitive in two ways. 143 3 ALGORITHM First, the quadratic form matrix A is derived from the vector a, which is the solution to the problem (27). To solve the problem, a must be parallel with the last singular vector of the matrix Yq . The trailing singular vectors of a matrix are more sensitive to variation in the data than the leading singular vectors (see e.g. Hansen [14] or Hastie et al. [16]). A small change in the data matrix Yq can lead to a great change7 in the vector a, and in turn, in the matrix A and the estimates ŜH , ÛH . Second, the quadratic form represented by A is only a local approximation to the the true ellipsoid that the noise-free data would lie on. We can expect the approximation to be better close to the sample points, and worse further away. Particularly, if the sample points are close to some of the short half-axes of the ellipsoid, the estimates for the long half-axes will be poor. Since the true corresponding eigenvalues are already low (low λ → large σ), the estimates may even drop below zero. 3.4.1.1 Suggestions for improvement Ellipsoid fitting could be regularized by biasing the fit towards a sphere. One can strike a balance between the ellipsoid that fits the data best, and the sphere that fits the data best. This could be done in a number of ways, by penalizing (27), or by estimating the optimal sphere and adjusting the eigenvalues of the ellipsoid to be more equal to the (single) eigenvalue obtained from sphere fitting. This approach could also be useful in the case where H is degenerate (singular or close to singular). Negative, zero or small eigenvalues would be positively biased, and large eigenvalues negatively biased, which could be useful: In eigenvalue estimation in the presence of noise, the estimation error is the ’in the opposite direction’ [6]. Another approach is to regularize Ŝ with respect to the particular sample points used for estimating it. Components pointing away from the sample points {yi } on the ellipsoid should be subject to carefully selected scaling or much regularization. 3.4.2 Reduced rank in the Local CG algorithm If and when R0 has been introduced to simplify X’s problem, there should be no problems for X - at first glance. However, the matrix X = [xi−d , . . . , xi ] (42) will, upon convergence of x, have reduced column rank since all columns will be identical. Then the inversion of P X in (22) is not straightforward, since a full n,p T (basis expanded) SVD of Bi ∈ R , K i=1 σi ui vi , K = min(n, p) at some point will have σ1 ≥ 0 but σi = 0 for some i > 1. It is not easy to determine the precise column rank of Bi , especially in the presence of noise. 7 This can also be argued geometrically, by plotting a few points of an ellipse on paper. A small movement of a sensitive point can change the ellipse completely. 144 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION 3.4.2.1 Suggestions for improvement One possibility is to ridge (Hoerl & Kennard, 1970) the inversion of X, e.g. by adding a small positive constant α to each of the singular values prior to inversion [14]. Another alternative is to introduce some random variation in {xi } . This will not only solve the problem of the rank reduction, but also give agent Y a better chance to estimate the ellipsoid, due to the data variation. This will happen at the cost of some performance in SNR. Regularization involves estimation of a regularization parameter, which adds another level of complexity. In some situations, however, it is possibly to obtain optimal values for these parameters. 3.4.3 Unstable adjustment of the optimum The purpose of the matrix R1 is to encourage agent X to send vectors that are parallel with v1H , in the understanding that this also will help him maximize his reward function. However, R1 “has no concept” of directions in H other than uH 1 . For maximizing the reward functions, one would hope that if X failed to send something close to v1H , he should send something closer to v2H than to, say v5H . However, since R1 “has no concept of” directions outside span{yOpt , uH 1 }, a big contribution in a low variance direction could maximize the reward function of X just as easily as an x with a big contribution in the direction v2 . Thus, X could sabotage Y’s attempt of getting a high reward. 4 Simulations 4.1 Simulation: Performance of Y We try to assess the performance of Y in his attempt to help X optimizing x. We examine various steps involved in calculating and operating a matrix R0 under different signal-to-noise scenarios. 4.1.1 Constructing sequences {xi } Y has to make its inference about H based on an incoming sequence of vectors {yi } from (11). Agent X, trying to find a vector x maximizing his reward, may sometimes be close to his convergence point xOpt. In this case, the variation is the sequences {xi } and {yi } will be low, making it more difficult to estimate the necessary parameters. We will mainly investigate such situations. We make no assumptions regarding the optimization algorithm that was used for selecting {xi } . The Local CG algorithm or any other (possibly non-linear) method could be used. We examine situations where the elements in {xi } vary around a certain target vector t, (43) ||xi − t|| < 145 4 SIMULATIONS Typical choices will be t = vkH , where k indexes a singular vector of H. The elements of the sequence {yi } will vary around the corresponding vector σkH uH k . 0 The variation in this sequence will depend on the magnitude of the k th singular value of H as well as the noise. 4.1.2 Simulation settings Random matrices H ∈ R5,5 with i.i.d elements were calculated and used for simulations. If σ1H , . . . , σPH are the singular values of H, invertibiliy is imposed by requiring that the condition number σ1H /σPH < 10. We then constructed sequences {xi } , with or without specific relations to H, each with 100 elements. The procedures were repeated 1000 times (1000 H-matrices) for each SNR scenario. In all the examples, ||xi || = 1 for all i. The expected signal-to-noise ratio (SNR), expressible in decibel (dB) as SNR(dB) = 10 · log10 ||E{Hx}|| ||E{n}|| (44) is selected by adjusting the variance σ 2 . 4.1.3 Performance Diagnostics In the following, Y = HX + Nx denotes the block of vectors {yi } , Y = [y1 , y2 , . . . , y100 ], as they are observed with noise, and X = [x1 , x2 , . . . , x100 ] contains the vectors {xi } . Nx = [n1x , n2x , . . . , n100 x ] is a matrix of the same size i 2 as Y, with columns nx ∼ N(0, σ I). In the simulation framework, the noise-free version Y T rue = HX is available for reference. 4.1.3.1 Perpendicularity. From the matrix A, agent Y estimates a set of singular vectors in ÛH . If the singular vector estimates are correct, ÛH = UH , then the product matrix M = ÛTH H (45) T should have perpendicular rows, since UTH H = SH VH . In this case, the outer T product MM should have zero elements off the diagonal. The statistic tP end (M) = ||diag(MMT )|| ||MMT ||F (46) can be used to assess the successfulness of Y when calculating a singular vector matrix ÛH , a high value indicating success The distribution of tP end can be simulated under the null hypothesis that M is a random, by repeatedly picking random matrices H and calculating tP end (UH) for some random orthogonal U. The randomness of H makes this distribution equivalent with that of tP end (H), without U. A deviation from the null hypothesis, in the direction of a ’successful perpendicularization’, is reported as a low simulated P-value. 146 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION 4.1.3.2 Rotation. T The purpose of R0 = ÛH Ŝ−1 H Û is to make the composition R0 H an orthogonal transformation. Noise can make the estimates ÛH , ŜH inaccurate. However, these matrices are not our only interest. We are also interested in whether or not the composition R0 H rotates the specific sequence {xi } . In other words, we want to check whether the mapping from X to Z = R0 HX is a rotation of the points in {xi } . To this end, we can use the Procrustes fit. This measure, d(X, Z) = min {Q|QT Q=I} ||QX − Z||F (47) tells us whether or not one matrix can be rotated to match another. The optimal rotation Q ∈ RP,P can be found explicitly (see e.g. Dryden and Mardia, 1997). We consider the statistic tRotationEst (X, R0 HX) = d(X, R0HX) = d(X, R0Y T rue) (48) From noisy observations Y, we make a matrix R0 , and check whether this is good enough to work as a rotation of the points X when combined with H. We simulate the distribution of this statistic under the null hypothesis (H0 : H is random8 , e.g. not orthogonal) the following way: We pick random matrices H, compute Y T rue = HX and compute tRotationEst (X, Y T rue ). If the matrix R0 H works as an orthogonal transformation of X, the statistic d(X, R0HX) will be low. This is reported as a low simulated P-value. Note that the statistic tRotationEst is dependent on {xi } , but not on the noise, while tP end and is not dependent on {xi } nor on the noise. 8 Here, the matrix H is working as the convolution as R0 H 147 4 SIMULATIONS SNR P (tP end) P (tRotationEst 0.0 0.369 0.006 5.0 0.240 0.000 10.0 0.040 0.000 15.0 0.017 0.002 20.0 0.003 0.000 25.0 0.000 0.000 30.0 0.000 0.000 Table 1: Performance Diagnostics, Case I SNR P (tP end) P (tRotationEst 0.0 0.399 0.004 5.0 0.152 0.001 10.0 0.052 0.001 15.0 0.020 0.001 20.0 0.006 0.000 25.0 0.000 0.000 30.0 0.000 0.000 Table 2: Performance diagnostics, Case II SNR P (tP end) P (tRotationEst 0.0 0.367 0.001 5.0 0.065 0.000 10.0 0.023 0.000 15.0 0.002 0.007 20.0 0.000 0.012 25.0 0.000 0.000 30.0 0.000 0.000 Table 3: Performance diagnostics, Case III SNR P (tP end) P (tRotationEst 0.0 0.541 0.025 5.0 0.483 0.016 10.0 0.411 0.017 15.0 0.185 0.275 20.0 0.095 0.277 25.0 0.041 0.111 Table 4: Performance diagnostics, Case IV 148 30.0 0.011 0.004 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION 600 500 400 300 200 100 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 2: Distribution of the statistic tP end under the null hypothesis. 4.2 Results 4.2.1 Case I: Random elements {xi } This situation is, from the viewpoint of our optimization problem, related to sending training data, and is considered only for reference. The P-values for the performance diagnostics are found in the table 4.2.1. In this reference case, the two statistics indicate good performance. When the elements in {xi } are random, there will be high variation in the corresponding sequence {yi } , which improves the ellipsoid fitting. 4.2.2 Case II: Elements in a small random neighborhood We study the situation where t is taken as a random, unit norm target vector from a small neighborhood, without any particular connection with the channel matrix H (e.g. not in the neighborhood of a specific singular vector). The variation was kept low, ||xi − t|| < = 0.1 for all i, to simulate the situation where {xi } has converged. There is little variation in {yi } , which is likely to make the ellipsoid fitting more difficult. Comparing the results in Table 4.2.2 with the reference, performance is almost identical. This indicates that we can hope for 149 4 SIMULATIONS 600 500 400 300 200 100 0 0 5 10 15 20 25 Figure 3: Distribution of tRotationEst , Case I, under the null hypothesis. 150 30 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION good performance, even if the sequence {xi } is confined to a small region, e.g. upon convergence. 4.2.3 Case III: Elements in a small neighbourhood around v1H A sequence {xi } with ||xi − v1H || < = 0.1 (for all i) is constructed and used for simulation. We compare the results in table 4.2.3 with the reference cases I and II. The ability to make ÛT H have perpendicular columns is better than in the reference cases, particularly in the presence of much noise. This indicates that ellipsoids are better sampled at points with high curvature, e.g. if the points are parallel with the longest half-axis. For SNRs in the range 15-25 dB, the rotation ability is lower than in the reference cases. This is probably due to errors in the estimates σ̂iH . Particularly the small singular vectors will be poorly estimated. Upon inversion, this becomes a source of errors for the “rotation ability” of the composition R0 H. At low SNRs, the noise in the data could lead to more equal eigenvalues. 9 This could explain the good relative performance at low SNRs. 4.2.4 Case IV: Elements in a small neighbourhood around v5H The setup is identical to the that of the previous example, with the target vector chosen to be v5 rather than v1 . We compare with the references cases I and II, as well as III. Both statistics show worse performance when compared with all the other cases, indicating that this scenario is a more difficult case than III and II. 4.2.5 Discussion: Aspects of the sequence {yi } Y does not know in advance the convergence point yOpt of the sequence {yi }. If yOpt happens to be parallel with the first singular vector uH 1 of H, then estimation of the singular vectors in ÛH and ŜH will be better, due to the curvature around the point u1 . If, on the other hand, yOpt is parallel with one of the late singular vectors, say up , the curvature around that point will be much lower, and so the estimates will be worse. 4.3 The System at Work We demonstrate that our system finds the top singular vectors in the deterministic case (no noise). In the presence of noise, more work on regularization is needed to improve stability. Two random matrices G and H, both in R5,5 , were computed, and a random initial vector x0 used as a starting point. The “correction matrix” was initially defined to be R0 = I. Subsequently, R was computed in two stages. After 60 T iterations, R = R0 = ÛH Ŝ−1 H ÛH was introduced, to make the problem of finding 9 A kind of “natural” regularization. For a white noise matrix, all eigenvalues have modulo 1. 151 4 SIMULATIONS v1G linear. After another 60 iterations, the extra rotation R1 was introduced (R := R1 R0 ) to adjust the position of the optimum yOpt . The reason for this splitting is the fact that it is unlikely that a good and stable candidate for yOpt can be found before the problem has been ’stabilized’ by R0 . The simulation results are seen in the figure 5. Consider the three phases: During the first 30 iterations, the estimate v1G becomes more or less correct, but the optimum is unstable, due to the non-linearities involved. There is of course no convergence for v̂1H towards v1H , since the local CG algorithm only tries to find v1G . After the introduction of R0 , the problem of finding v1G can be successfully solved (with stability) by linear approximation. The estimate v̂1H stabilizes, but there is yet no convergence to the correct singular vector. After the introduction of R1 , both singular vector estimates converges to become correct. In practice, we want to avoid the sharp shifts displayed in the figure after the correcting steps. This is also necessary to fulfill the requirement that the sequences {xi } and {wi } should change gradually and not abruptly. The reason for the sharp shifts has to do with the Bi -matrix in the local CG algorithm. When the problem of finding the top singular vector of G is modified by the introduction/change of R0 , the matrices Z and X suddenly contains observations (columns) from two different problems. This lead to temporarily meaningless estimates Bi . This problem corrects itself when the samples belonging to the “previous” problem are taken out the estimation process. A way to make convergence smoother, is to introduce the mappings in a more gradual way. 4.3.1 Discussion: Errors in Ŝ−1 , Contraction and Expansion We briefly comment the possible effects of erroneous estimation of eigenvalues/singular values. The singular values corresponding to singular vectors parallel with yOpt will be well estimated. Consider what happens if yOpt is parallel H with a uH 5 , a trailing singular vector. In this case, σ1 will be poorly estimated. Upon inversion this error is reflected in Ŝ−1 H . If X modifies his vector xOpt in direction of the first singular vector of H (as ’encouraged’ by the introduction of R1 ), this contribution will be contracted or expanded, depending on the error in σ̂1 . This is not a good situation for agent Y. If X tries to change his vector xOpt to find a vector more parallel with v1H , the corresponding variation in yOpt could be very high or very low, and improvement in the situation for Y hard to establish. We conclude this section by discussing some theoretical reference scenarios. Using our knowledge about the relationships between the principal directions of G and H, we discuss which situations are more and less difficult to handle. 4.4 Reciprocity: G = HT G This is the easiest case. From the SVD of the matrices, it is seen that uH 1 = v1 , and so if X picks x = v1H , yOpt will be parallel with v1G . With the initial R0 = I, 152 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION H H G the vector transmitted through G will be yOpt /||yOpt|| = σrH uH r /||σr ur || = v1 , and both top singular vectors are found simultaneously. In this particular case, the matrix R0 is superfluous, and the only thing that can fail is if this matrix is wrongly estimated. A working system should be able to detect this situation, and avoid calculation and use of R0 all-together. 4.5 Inversion, G = H−1 In this case, v1G = uH r . This means that if X finds an x maximizing his reward function, the corresponding yOpt will be parallel with the last singular vector of H, uH r . If in addition, X finds this x after a few iterations only and if the initial x(0) is close to the optimal xOpt, then the ellipsoid fitting of Y must be based on observations of y almost parallel to the last singular vector v1H . The simulations indicate that this is difficult. 5 Discussion We have described implementation of a framework for finding the top singular modes in a FDD system. The calculations used do not involve estimates Ĝ or Ĥ, as in conventional systems based on training data. The channel matrices G and H are never observed, only interacted with trough matrix multiplication. We can therefore assume that if the channel matrices G and H vary continuously with time, our method is able to track the top singular modes, a clear advantage over most training-based systems. After an initial aquisation period, common to all blind methods, the matrices R0 and R1 are updated continously. Furthermore, for a training-based system to employ the top singular modes, G and H must be known by both parties. In most conventional training-based systems (e.g. V-BLAST), only the receiver has knowledge of the channel matrix. 5.1 Asymmetric workloads It was shown that the job of finding the top singular vectors can be divided between X and Y with an asymmetric workload. Clearly, agent X will have the simplest job. All that is needed is an optimizer capable of finding the extreme point on an ellipsoid. The agent Y has considerably harder work. He has to fit points to an ellipsoid, calculate the principal axes and corresponding lengths of half-axes, as well as try to determine the convergence point of the sequence {yi } . This imbalance in work-load suggests that agent Y will be the base-station (BTS), where processing power and complexity is more affordable, and X the mobile subscriber unit. 153 5 DISCUSSION 5.2 Regularization More research must be done to avoid the possible problems of having errors in Ŝ−1 H . It also is clear that the mapping R0 is intimately related with the rotation R1 . The latter determines the positioning of yOpt , and consequently, the quality of the ellipsoid fitting. Updating the rotations R0 and R1 interchangeably, combined with regularization of the ellipsoid fitting, could improve overall performance. Also, we have not considered the robustness of the “final rotation” R1 . It is clear that this mapping can be sensitive, since it is constructed without consideration of principal directions in H other than uH 1 . 5.3 Considerations for an improved system Based on our experience with the framework, we discuss some potential improvements and sources of information that are not exploited. • Both X and Y can tell whether or not they have obtained their optimal values for fX and fY , by considering the eigenvalues/eigenvectors from ellipsoid fitting. Unless the vector xOpt or yOpt is parallel with the trailing eigenvector (leading singular vector), the true optimum is not reached. This can be used as a stopping criterion for optimization. • Upon convergence of the series {yi } to σ1 u1H (except for the random variation from the channel noise) the elements of the sequence can be averaged, before they are processed with R, normalized and sent back through G, avoiding “double noise accounting”. • Interchangeable update of R0 and R1 : Introducing R1 “too quickly” is risky for two reasons. First, it encourages replacement of the original yOpt by −1 a vector σ1 uH 1 that is not well estimated, and second the scalings in Ŝ could make contributions in this direction “explode” or “vanish”. It seems better to use a “partial rotation”, Rp1 , with p ∈ [0, 1]. When yOpt changes, the matrix R0 is recalculated using the new observations. Continuing this way, gradually rotating with (a new) R0 , the optimum could be reached safely without haphazard. • Better estimation of R1 : As mentioned, this rotation is confined to a subspace, and thus one cannot say how changes in x along directions outside this subspace affect the reward functions. However, one could consider the possibility that Y “learned” from his mistakes, and created a more complex rotation, that would ensure a better reward. • If processing power permits it, one could also consider schemes where X and Y change roles. 154 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION 5.4 Using multiple singular modes In a complete working system, it should be possible to estimate multiple singular modes, simultaneously, while communicating symbols on these modes (using superpositions). This has already been accomplished in the reciprocal case G = HT (Dahl et al. 2001). The ideas used there for increasing the number of singular modes can be generalized to work for the non-reciprocal case. As an extra bonus, there will be larger variation in the vectors {xi } and {wi } . This variation will improve the ellipsoid fitting, and thus overall performance. 5.5 Concluding remarks A number of issues must be investigated before robust algorithms for the two agents can be devised. We have not discussed performance of the Local CG algorithm in the presence of various noise levels, nor is the estimation of yOpt covered in any detail. Selection of the step parameter λ is another issue, as is selection of potential regularization parameters, and the integer d used both in the Local CG and in the detection of yOpt . The non-square and the degenerate channel matrix cases, as well as symbol modulation and operation in complex modes are all subjects for further research. Based on the experience with the present framework, we believe that the formulation of blind FDD MIMO Channel estimation as a two-agent problem will prove to be a useful contribution. References [1] Alonso, A. and Kudenko, D. (2001) Machine Learning Techniques for Adaptive Logic-Based Multi-Agent Systems: A Preliminary Report. Artificial Intelligence Group, Department of Computer Science, University of New York. [2] Cruz, J.B. Simaan, Jr. M.A. (1999) Multi-Agent Control Strategies with Incentives, Proceedings, Symposion on Advances in Enterprise Control, pp. 177-182. [3] Dahl T., Christophersen N., Gesbert D.(2001) BIMA - Blind Iterative MIMO Algorithm, accepted for ICASSP-2002 [4] Fitzgibbon, A.W., Pilu, M., Fisher, R.B. (1996) Direct Least Squares Fitting of Ellipses”, International Conference on Pattern Recognition, Vienna. [5] Foschini, G.J. (1996) Layered Space-time architecture for wireless communications in a fading environment, Bell Labs Technical Journal, Vol. 1, No 2, pp. 41-59. 155 REFERENCES (a) 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 20 40 60 80 100 120 140 160 180 200 120 140 160 180 200 (b) 1.2 1 0.8 0.6 0.4 0.2 0 −0.2 0 20 40 60 80 100 Figure 4: Convergence of the singular vector estimates v̂1G and ûH 1 . The lower panel (a) H G G shows the error ||uH 1 − û1 ||. The lower panel (b) shows the error ||v1 − v̂1 ||.. 156 PAPER V: THE GAME OF BLIND MIMO CHANNEL ESTIMATION [6] Frank, I.E., Friedman, J.H. (1989) Classification: Oldtimers and newcomers, Journal of Chemometrics, Vol. 3, pp. 463-475. [7] Golden, G.D., Foschini, G.J., Valenzuela, R.A. and Wolniasky, P.W. (1999) Detection algorithm and initial laboratory results using the V-BLAST space-time communication architecture, Electronics Letters, Vol. 35, No. 1, pp. 14-15. [8] Golub, G. and Van Loan, C.F. (1996) Matrix Computations, Johs Hopkins, 3 edition. [9] Hansen, P.C. (1996) Rank Deficient and Discrete Ill-posed Problems, Ph.D. dissertation, Technical University of Denmark, DK-2800 Lyngby, Denmark. [10] Hastie, T., Buja, A., Tibshirani, R. (1995) “Penalized Discriminant Analysis”, The Annals of Statistics, Vol.23, No.1, 73-102. [11] Hoerl, A.E, & Kennard, R.W. (1970) Ridge Regression: Biased estimation for Nonorthogonal problems”, Technometrics, 8, pp. 27-51. [12] Hu, J. and Wellmann, P. (1998) Online learning about other agents in a dynamic multiagent system. In Proceedings of the Fifteenth International Congress on Autonomous agents, pp. 239-246. [13] Stone, P., Tumer, K., Gmytrasiewicz, P., Greenwald, A., Littman, M. Namatame, A. Sen, S., Veloso M., Vidal, J., Wolpert, D. (2002) Description: Collaborative Learning Agents AIII-2002 Spring Symposium, March 2002, Stanford, CA. [14] Stone P. , Veloso M. (2000) “Multiagent Systems: A Survey from a Machine Learning Perspective”, in Autonomous Robotics, Vol. 8, No. 3. [15] Telatar, I.E. (1995) Capacity of multi-antenna Gaussian channels, Bell Labs Technical Memorandum. [16] Turner, D.A. Anderson, I.J., Mason, J.C., Cox M.G. (1998) An algorithm for fitting an ellipsoid to data, Technical Report RR9803, School of Computing and Mathematics, University of Huddersfield, UK [17] Wolpert, D.H. Wheeler, K.R. Tumer, K. (1999) “General Principles of Learning-based Multi-Agent Systems”, Proceedings of the Third International Conference on Autonomous Agents (Agents’99) 157 158 Chapter 4 Discussion 159 CHAPTER 4. DISCUSSION 160 4.1. SENSORY ANALYSIS Discussion In this chapter, I will briefly discuss some of my findings. In particular, I try to describe problems that have not been solved satisfactorily, and discuss lines of improvement and further research. 4.1 Sensory Analysis Finding a Meaningful Consensus At the outset of my thesis, the aim was to find a method that could find and summarize connections between sensory profiles better than Generalized Procrustes Analysis (GPA). The main critique against this method was its rigidness, e.g. that it used orthogonal transformation rather than a general linear (or non-linear) mapping. In my work I have considered I learned about techniques, such as three-way methods and GCA. Still, none of these methods yield an average (consensus) that both represents compressed information about the judges and resembles the profiles of the individual judges in some way. In all the methods I have encountered, the consensus is either obtained using a restricted transformation, or it has some kind of orthogonality criterion. It would be interesting to find a method that generated a consensus, representing an average assessor, having some of the same covariance structure between the variables as the individual profiles, but that still captures the more subtle nuances not found by GPA. One idea is to study minimum spanning trees between the profiles (as discussed in Paper Two), and to try to avoid the ’collapse’ described in the Introduction by more carefully deciding which profiles are transformed to match. By keeping the number of transformations to a minimum, the ’collapse’ induced by multiple iterations could possibly be avoided. Test for subtle nuances In Paper Five on mobile communication, we used a Procrustes measure to statistically determine whether or not a matrix could be transformed to match another by orthogonal transformation. Working along those lines, it could be possible to make a statistical test to check whether the relationship between two matrices was orthogonal, or if it was significantly better described by a nonorthogonal transformation. This could be used to check the validity of using 161 CHAPTER 4. DISCUSSION three-way analysis rather than GPA on sensory panel data. As an alternative to general linear transformations, one could also think of using information theory to decide whether there exists a mapping, possibly non-linear, between two profiles. 4.2 Mobile Communications Improving Blind Estimation In paper Five, working on blind estimation of non-reciprocal channels (FDD), the use of ellipsoid fitting as a “partial SVD” was described. This can be used to improve performance in the reciprocal channel (TDD) case also. The power method, which is a numerical method for finding eigenvalues and eigenvectors, is no longer considered a serious method for eigen-estimation. The reason it was used in our papers, was the fact that it arose naturally as an implicit process of the physics involved. It more or less lends itself to be exploited, at low processing and channel capacity costs compared to methods that are in use today. If A ∈ Rp,p is a real symmetric matrix, and x some random vector in the column space of A, then the Krylov sequence {x, Ax, A2x, . . . , An x}, can be used for eigenvector estimation. This sequence arises naturally in the power method, but more a lot can be said about the eigen-structure of A from this sequence. For instance, the conjugate gradient method (CG) can make better estimates of the eigenvectors from this sequence than the power method, which simply uses the last element as an approximation. There must be unexploited information in the vectors that are transmitted in a TDD system. A study of the CG algorithm and other Krylov methods for the sake of improving the eigen-mode estimation could be useful. Comparing such techniques with the “partial SVD” used for ellipsoid fitting in Paper Five, is another aspect. Simplifications of the blind FDD algorithm The blind FDD algorithm in paper Five can be considerably simplified when there is only one receive element at the mobile subscriber unit. This corresponds to a SIMO/MISO system, or smart antennas. In this case, a simplified version of the Local CG algorithm can be used to find the optimal weight distribution for transmission from the base station to the mobile subscriber unit. The connection between this method and other techniques for blind SIMO/MISO estimation must be understood. It is clearly more difficult to come up with new ideas in an area which has been studied intensively for thirty years. Still, it could be that ideas inspired by MIMO could bring some improvement to smart antenna systems. 162 4.2. MOBILE COMMUNICATIONS Symbol modulation The performance of blind channel estimation both in TDD and FDD systems could be improved by working more on the symbol modulation, and the properties of the modulation alphabet. Spatial water-filling could also be implemented to improve performance. In the non-reciprocal case, symbol modulation was not discussed, and only the top singular modes are used for transmission. Implementation of a modulation alphabet, as well as using more singular modes for transmitting and receiving data is a next natural step in this work. The connection with game theory and Multi-Agent Systems is another aspect. Particularly, optimization of nested, time-varying functions is a subject that could be of interest for a wider statistically oriented audience. 163