Marc Sobel *1 and Longin Jan Latecki 2
1 Dept of Statistics, Temple University, Philadelphia, PA, USA
2 Dept of Computer and Information Sciences,
Temple University, Philadelphia, PA, USA.
ABSTRACT
Data visualization is achieved by minimizing distortion resulting from observing the relationships between data points. Typically, this is accomplished by estimating latent data points, designed to accurately reflect the pairwise relationships between observed data points. The distortion masks the true pairwise relationships between data points, represented by the latent data. Distortion can be modeled as masking the pairwise distances between data points (i.e., pair-wise distance distortion) or, alternatively, as masking dissimilarity measures between data points (i.e., pair-wise dissimilarity distortion). Multidimensional scaling methods are usually used to model pairwise distance distortion. This class of methods includes principal components analysis, which minimizes the global distortion between observed and latent data. We employ an algorithm which we call, stepwise forward selection, for purposes of identifying appropriate initializing values and determining the appropriate dimensionality of the latent data space. We model pair-wise dissimilarity distortion using mixtures of pairwise difference factor-analysis statistical models. Our approach is different from that of probabilistic principal components (e.g., Bishop and Tipping (1983)) where noise masks the relationship between each individual data point and its latent counterpart. By contrast, in our approach, noise masks pairwise dissimilarities between data points and analogous latent quantities; we will see below that this difference in approach allows us to build in some extra flexibility into the interpretation and modeling of high-dimensional data. Our approach is similar in spirit to the approach employed in relational Markov models (e.g.,
Koller, D. (1999) ). We show that the pair wise factor-analysis models frequently better fit the data because they allow for direct modeling of pair-wise dissimilarities between data points.
Keywords: Multidimensional Scaling, factor analysis, relational Markov Models, principal components.
AMS 2000 Subject Classification: Primary 62P30; Secondary 68T45 and 68U10.
________________________________________________________________________
*
Correspondence: Marc Sobel, Department of Statistics, Fox School of Business and
Management, 1810 N. 13 th
Street, Temple University; Philadelphia, PA 19122. E-mail: marc.sobel@temple.edu
1.
INTRODUCTION
There has been considerable interest in both the data mining and statistical research communities in the comparison, registration, and classification of image data. From the practitioners perspective, there are a number of advantages accruing from their success.
First, algorithms of this sort can provide mechanisms for data visualization. Second, they provide a mechanism for learning the important features of the data. Because feature vectors typically take values in very high dimensional spaces, reducing their dimensionality is crucial to most data mining tasks. Many algorithms for reducing data dimensionality depend on estimating latent variables by minimizing certain energy (or loss) functions. Other algorithms achieve the same purpose by estimating latent variables using statistical models. Generally, dimensionality reduction methods can be divided into those which employ a metric and those which do not. In the former case, data points taking values in a high-dimensional space together with pairwise distances between them are observed. The goal, in this case, is to estimate latent data points (one for each observed data point), taking values in a lower dimensional space, whose pairwise distances accurately reflect those of their observed counterparts. Methods of this sort include those proposed by Sammon (1969). Methods of the latter variety (which do not employ a metric) start with data points whose pairwise differences are measured by dissimilarities which need not correspond to a distance. Methods of this sort include those of Kruskal (e.g., Cox and Cox (2001) ). In this paper we assume observed data points, living in a high-dimensional space. We propose a factor analysis model in which the pairwise differences between these data points are represented by the corresponding
differences between their (low-dimensional) latent counterparts. It is assumed that the the true pairwise differences between observed data points are masked by noise. This noise could arise in many different ways; examples include:
(i) settings where partitioning data into groups is of paramount interest, and lack of straightforward clusters can be modeled as the impact of noise on the pairwise relationships between data, and
(ii) settings where (global) energy is being modeled; in this case noise arises in evaluating the relationship between the energy of neighboring data points.
The main goal of multidimensional scaling (MDS) is to minimize the distortion between pairwise data distances and the corresponding pairwise distances between their latent counterparts. This minimization insures that the (topological) structure of the latent data optimally reflects that of the observed data. MDS algorithms frequently involve constructing loss functions which prescribe (scaled) penalties for differences between observed and latent pairwise distances. See also McFarlane and Young (1994) for a graphical analysis of MDS. MDS methods are used widely in behavioral, econometric and social sciences (e.g., Cox and Cox (2001)). The most commonly used nonlinear projection methods within MDS involve minimizing measures of distortion (like those of
Kruskal and Sammon) (e.g., Sammon (1969) and Cox and Cox (2001)). These measures of distortion frequently take the form of loss or energy functions. For purposes of comparison we focus on the loss function proposed by J. Sammon in Sammon (1969). In this paper we compare MDS methods (as represented by that of Sammons) with those using pairwise difference factor analysis methodology. We employ stepwise forward selection algorithms (see below) to provide initializing vectors appropriate for use with
all of the employed methodologies. Other options which have been recommended include
(i) random initializing values (e.g., Sammon (1969), Faloutsos and Lin
(1995) , and Maclachlan and Peer (2000)) and
(ii) initializing at latent variable values arising from employing principal components analysis (PCA) (e.g., Jolliffe (1986)).
The former option, still commonly used, fails to provide any useful training
(information). The latter option fails because PCA does not provide optimal (or nearoptimal) solutions to minimizing the (noise) distortion of the data. In fact the distortion reduction for PCA generated latent variables is very small. For multi-dimensional scaling models, after employing stepwise forward selection algorithms for the aforementioned purposes, we typically use gradient descent methods (e.g., Mao and Jain
(1995) and Lerner, Boaz, etc..(1998)) to minimize Sammon’s cost function. For factor analysis mixture models, after employing stepwise forward selection algorithms, we use the EM algorithm (e.g., Laird and Rubin (1977)) to provide estimates of the parameters.
We partition the pairwise differences between data into two groups by determining data membership using EM-supplied probabilities and an appropriate threshold. The first group consists in those pairs of observations with pairwise differences fully explained by the pairwise differences between their latent counterparts; the second group consists in those pairs which fail to be fully explained. These groups provide a mechanism for distinguishing between data clusters; pairs of observations in the former group are put in the same cluster while those in the latter group are put into different clusters.
2 . FORMULATING THE PROBLEM USING STATISTICAL MODELS
Below, we compare two different ways of projecting the relationships between pairs of data points into latent k-dimensional space, denoted by R k
; typically k will be taken to be 2 or 3. Using the notation n
R p
for the observed (pdimensional) feature vector data with components, F i
( F i ,1
,..., F ) ( i
n , and
for the standard Euclidean norm in u dimensions, multidimensional scaling is u concerned with projecting a known real valued dissimilarity function,
( , )
D (F , F )
of the ordered pairs of features
F F j
onto their latent kdimensional distance counterparts
j k
(1
n ) . Sammons energy function provides a typical example of this. In this setting the latent r-vectors,
i
1,..., n are chosen to minimize a loss (or energy) function of the form,
S F
1
,...,
n
)
1 i j n
j p
j j p k
2
; (2.1) in the
vector parameters. In this example the dissimilarity measure is,
( , )
F - F j p
. Many variations on this basic theme have been explored in the literature. (e.g., Cox and Cox., (2001)) As a counterpoint to this approach, we assume (as above) that feature vectors, associated with each data object are themselves observed. We employ a variant of probabilistic principal components models, introduced
in McFarlane and Young (1994) Our variant is designed to take account of the fact that we seek to model the noise distortion between pairs of feature vectors rather than the noise distortion associated with each individual feature vector. We follow the main principle of MDS which is to map the data to a low-dimensional space in such a way that the distortion between data points is minimal. We introduce some necessary notation first. Let D i j denote the dissimilarity between feature vectors F i
and F j
R p n ) . We note that ( , ) may now take vector values; we refer to the space in which ( , ) takes values as the dissimilarity space. We assume that the dimension d of this dissimilarity space is smaller than the dimension p of the feature space. In the example, given below, we take ( , )
F - F j
(1
i<j
n ) , in which case the dimension p of the feature space is the same as the dimension d of the dissimilarity space. Other examples include assuming that p
l
1
F F
, in which case the dissimilarity space has dimension d=1 .
The general statistical model assumed below takes the form: j
A
i
-
j
; 1 i j n (2.2) where,
identifies the particular mixture model component;
A
= s means that the pair (i,j) belong to mixture component s and s=1 or 2)
are parametric p
k matrices indexed by the component
;
i ,
F
' observation index i. (1 i n).
' is the pairwise noise distortion for features <j
n)
It is assumed below that the errors
are normally distributed with common variance
2
I I
p identity matrix)
While the dimensionality d of the D’s (defined above) may be quite high, the dimensionality k of the latent vectors will typically be assumed to be quite small. The matrices A
are (latent) projection matrices projecting paired differences between parametric latent
vectors onto their feature vector paired difference counterparts. We use the EM algorithm (e.g., Laird and Rubin (1977)) to estimate the model parameters under the assumption that the observed dissimilarities are given by
F - F (1 i j n) . The equations needed for doing this are given in the
appendix. In equation (2.3), below, we assume that the aforementioned mixture model, indexed by
, consists of exactly 2 components. The first component comprises pairs of observations whose difference has small variance; the second comprises pairs of observations whose difference has a large variance. The first component model is designed to characterize those pairs of feature vectors whose difference is wellapproximated by the corresponding difference between their latent variable counterparts; the second, those pairs of feature vectors whose difference is not well-approximated by this difference. Specifically, we assume that the first component variance, [
1
]
2
is
significantly smaller than the second, [
2
]
2
. Model parameters
i
(i=1,…,n) were selected to minimize quantities of the form,
SS
1 i j n
F i
F A
(
i
-
j
)
2
P
F F j
) (2.3)
A
denotes the factor matrix estimate given by the (expectation-maximization)
EM algorithm and P
F F j
) is the posterior probability specified in that algorithm
(see the appendix, below). Using the EM algorithm, we estimate the most likely components
1 i j n
for each pair of features; this imposes a natural clustering of the features defined so that the cluster (of indices) containing a given feature F i is
C i
j j
i
ˆ i j :
i
ˆ j i
(i=1,….,n). In the examples, given below, we employ this clustering for comparative purposes.
We assess the fitness of data visualization models using Bayesian p-values (e.g.,
Gelman, Carlin, et. al. (1995)). This can be formulated as the probability that the information obtained from the model is less than expected under an aposteriori update of the data. Information quantities like those derived below are discussed in Maclachlan and Peer (2000). This kind of calculation is not possible for typical (multidimensional scaling) MDS models because they are not formulated as statistical models. In what follows, we use the notation D
F i
F j
1 i j n for the set of observed dissimilarities, and ‘INF’ for the parametric information measure (e.g., Bernardo (1995) ). In the model
(referred to below as, M) introduced at the beginning of this section, the information
contained in the observed dissimilarity measures about the parameters, assuming an uninformative prior and ignoring marginal terms, is,
| D
E
log
L
D
=
F i
F j
,
1
P (
1| F F j
) +
where L
1 i j n
L
F i
F j
,
2
P (
2 | F F j
) (2.4) denotes the likelihood of the data with
L
1
2
exp
F i
F j
A
(
i
2
j
)
2
(1
n )
For the model introduced in section 2, the right hand side of equation (2.4) can be approximated, omitting terms which don’t involve the observed dissimilarity measures, by
( | D )
-
( | D )
1 i j n
1 i j n
F i
F j
A
F i
F j
A
1
(
i ,(1)
ˆ
1
i
ˆ
2
,(2)
ˆ j ,(2)
)
2
ˆ j ,(1)
)
2
(
1| F F j
)
(
2 | F F j
)
(2.5) where the hatted quantities are all EM algorithm estimates (see the appendix) (see also
Maclachlan and Peer (2000) for a more complete discussion of information quantities like that given in equation (2.5)). We simulate the posterior update,
*
( , ) of the dissimilarity F i
F j
via,
( , )
ˆ
1
ˆ i ,1
ˆ j ,1
),
2
ˆ i ,2
ˆ j ,2
),
2
P
1| F F j
)
P
2 | F F j
)
(2.6)
(1 i j n ) . ( N
2
refers to the normal distribution with mean
1
and variance
2
). Below, we use notation D
*
1 i j n
for the updated dissimilarities. In this notation, the posterior Bayes p-value is given by,
Bayes pvalue
( | D
*
)
( D D
(2.7)
(the probability in equation (2.7) being calculated over the distribution specified by equation (2.6)). For the models examined below the (Bayesian p-values) were all between 80 and 90% indicating good fits.
3. ALGORITHMS EMPLOYED FOR MDS DATA VISUALIZATION
We use online gradient descent algorithms to estimate parameters in the MDS approach to data visualization (e.g., McFarlane and Young (1994)). The gradient of
Sammon’s energy function (see equation (2.1)) with respect to the parametric vector i restricted to terms involving
j
; j
i is:
i
j
F i
F j k
F i
F j p p
j j
(3.1)
The analogous quantity with the indices i and j switched is:
j
( | )
i
(1 i j N ) . An online gradient descent algorithm can, in theory, be based on an iterative calculation of the r-vectors by updating r-vectors using the following iterative steps:
i
( new )
( new ) j
i
( current )
( current ) j
-
-
i
S j
j
S i
(3.2) where
i
( current ) (respectively,
i
( new ) ) denotes the current (respectively, new) value of the latent vector
and
denotes the quantity given by equation (3.1) (i=1,…,n).
We have already remarked on the problem of training a large number of
vectors for the purpose of initializing the gradient descent and EM algorithms. We show below how to select a small number k of vantage vectors F v
1
,..., F v k
with indices, v
1
,..., v k
, selected from among the full set of indices 1,…,n, having the property that the Sammon’s distortion function,
S F v
1
,...., v k
)
1 i j k
F i
F j
u k
1
F v u
F i
F v u
F j
F i
F j
2
2
(3.3) is fairly small. This provides us with well-trained (i.e., well-fitted) initializing
-vectors given by projecting the original data points F
1
,..., F n
to R k
in a way described in the next section. Since, for purposes of visualization, k=2 or k=3 is typically sufficient to insure small values of S F v v k
) for a moderate sized data set, two or three vantage vectors usually suffice in this case (e.g., Fraleyand Raftery (1999) ). For a large number
n of observed data points vantage objects can be obtained by the stepwise forward selection process described below. We note that this process improves on the adhoc procedures used heretofore (Maclachlan and Peer (2000) ).
4. A GEOMETRIC METHOD TO ESTIMATE LATENT VARIABLES
In this section we describe a process for selecting optimal initialized vectors for nonlinear projection methods. We show below that Sammon’s mapping as well as pairwise difference factor analysis methodology yield better results when initialized with these vectors. According to our best knowledge, this is the first serious approach to providing adequate initialized vectors for multidimensional scaling (MDS) and related techniques. As mentioned above, the methods actually used heretofore either employ random initializing vectors or latent vectors obtained using principal components analysis (PCA) (e.g., Jolliffe (1986)). While the drawbacks of random initialization do not require any discussion, our experimental results in the next section demonstrate that employing PCA for this purpose does not produce good results. This is a consequence of the fact that PCA minimizes the distortion between the observed data points and their projections (latent variables), while MDS techniques minimize the distortion between the mutual distances between the original data points and the corresponding distances between their latent counterparts. For purposes of simplifying our presentation, we limit discussion to the problem of constructing initializing latent vectors
i
1,..., n for projecting the data set F ={ F
1
,..., F n
} into two-dimensional Euclidean space. Without loss of generality we show how to select latent initializing vectors to minimize Sammon’s
energy (see equation (2.1)). We seek a set of latent vectors,
i
1,..., n
in 2-space having the property that pair-wise distances are as close as possible to the corresponding pairwise distances between the data points. Below, we outline the construction of the coordinate system into which the data are projected. As above, we use the notation
s
(s=1,…n) to denote indices selected from the index set
1,..., n
. a) First, select two different data points, F v
1
and F (called vantage points below), v
2 from the original data set F ={ F
1
,..., F n
} in such a way that the distortion
S F
2
) , (defined by equation (3.3)) is minimized. b) Next, select (respective) 2-dimensional projected versions,
v
1
, and
v
2
of F
1 and of F
2
having the properties that, i)
1 corresponds to the origin, and
2 defines the direction of the x-axis in the new coordinate system, and ii) The distance (in the new coordinate system) between
v
1
, and
v
2
is identical to the distance, F
1
F
2 p
between F
1
and F
2
. c) Finally, select a point F
3
, non-collinear with the set
F
, F
1 v
2
; its projection,
3
serves to distinguish the positive direction of the y axis in the new coordinate system.
Heron’s formula (e.g., Coxeter (1969)), detailing the relationships between triangles and the lengths of their sides, is used to calculate the projection,
i
of each data point , F i
in the 2-dimensional coordinate system described by (a),(b),(c) above (i=1,…,n). Our experimental results demonstrate that the proposed selection of projected vectors
i
1,..., n
provides a good visualization for various standard data sets. This is demonstrated in Figure 1 where the procedure is applied to the Iris data set. The Iris data is composed of 150 vectors each having 4 components. It is known that there are 3 clusters, each having 50 points; these consist of one clear cluster and two clusters that are hard to distinguish from one another (see below).
Irises projected using distances to vectors 39 and 51
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
-1 0 1 2 3 4 5 6 7
Figure 1 The optimal initializing vectors computed by the geometric method, outlined above, provide a good visualization of the Irises data set.
We note that results extending those given above, using the Cayley-Menger determinant
(e.g., Sommerville (1958)), can be used to construct optimal initializing vectors in spaces of 3 or more dimensions. However, optimal selection in these cases can be computationally expensive. Therefore, we propose to employ the following stepwise forward selection algorithm to reduce the cost of computation.
Algorithm: Stepwise Forward Selection
As above, we use the notation
s
(s=1,…n) to denote indices selected from the full index set
1,..., n
. At each step s=1,...,n, the stepwise forward selection algorithm selects one new vantage vector index v s
that is added to the set of previously chosen vantage vector indices, v
1
,..., v s -1
. Using the notation introduced in equation (3.3), the vantage vector index v s
is chosen to satisfy, v s
arg min v
1,..., n
v
1
,..., v s
1
S
F | v
1
,..., v s
1
, v
(4.1) where 'arg min ' denotes the index in the set A for which the minimum value of
S F v
1
,..., v s
1
, ) is attained, and v
1,..., n
1
,..., v s
1
corresponds to the set of indices between 1 and n which do not correspond to one of the indices
1
,..., v s
1
already selected. At stage s, having chosen the vantage object v s
, we prune the vantage vector indices by comparing the distortion measures, S F v
1
,..., v i
1
, v i
1
,..., v s
)
(i=1,…,s) with the measure,
S F v v s
1
, v s
) . If any of the former measures are smaller than the latter measure, we remove the vantage object v i
(corresponding to the smallest of the former distortion measures), reorder those indices which remain, and move to the next step; if none of the former measures are smaller, we increase s by 1 and move to the next step. The algorithm stops when s
k .
5. EXPERIMENTAL RESULTS
In this section we examine the performance of the proposed algorithms on various data sets. We begin with the classical Iris data set (e.g., Kohonen (2001)). The Iris data is composed of 150 vectors each having 4 components. It is known that there are 3 clusters, each having 50 points; these consist of one clear cluster, denoted by A below, and two clusters, B and C that are hard to distinguish from one another. We first compare 2 dimensional projections obtained using the geometric method with input vantage vectors produced via stepwise forward selection. Figure 2 (top) shows a two-dimensional projection of the Iris data obtained using the geometric selection algorithm for data visualization presented in Section 4; the distortion measure for this estimate (computed using Sammon’s energy function) is 522. Figure 2 (bottom) shows a two dimensional projection of the Iris data obtained using PCA projection; the distortion measure
(computed using Sammons energy function) for this estimate is 540. As can be seen, the geometric method and PCA projection both do well in this case. However, the geometric method, pictured on the top of Figure 2, gives a more accurate visualization of the original data set, since pairwise distances are better preserved in this case.
Irises projected using distances to vectors 39 and 51, Sammon's distorition = 522
PCA projection of Irises, Sammon's distortion = 540
2
1.5
1
0.5
0
-0.5
-1
-1.5
7
6.5
6
5.5
5
4.5
-2
-1 0 1 2 3 4 5 6 7
4
2 3 4 5 6 7 8 9 10
Figure 2. Two-dimensional visualization of Iris data obtained by: (left) the geometric method presented in Section 4, and (right) PCA projection.
We now turn to evaluating two dimensional projections for the data set referred to below as composite movie . Composite movie is composed of 10 shots (each having 10 frames) taken from 4 different movies; these consist in: a) 4 shots taken from the movie, Mr. Beans Christmas: frames 1 to 40. b) 3 shots taken from the movie, House Tour: frames 41 to 70. c) 2 shots taken from a movie we created (referred to below as, Mov1): frames 71 to 90. d) 1 shot from a movie in which Kylie Minogue is interviewed: frames 91 to 100.
The frames can be viewed at the site: http://www.cis.temple.edu/~latecki/ImSim . Using image processing techniques described in (16), we assign a vector with 72 features to each of the 100 frames. We obtain a data set consisting of one hundred 72-component feature vectors. Composite Movie has two hierarchical grouping levels; it can be grouped using shots and separately using movies. We expect to distinguish both between the shots and, on a higher level, between the movies. The best data visualization
algorithm (cf., Figure 3, left) for this data set was obtained using the pairwise difference factor analysis mixture model outlined in section 2; we used initializing vantage vectors, computed using stepwise forward selection. As can be seen in Figure 3(left), below, there are 4 clear clusters that belong together in the upper left hand corner of the figure.
They represent the 4 shots grouped to form excerpts from Mr Beans Christmas. In the lower right corner, we see two clear clusters. These are two shots from the movie referred to as Mov1. The 3 shots from the movie House Tour are represented by the 3 rightmost clusters in the middle of the figure. Figure 3(right) below, employs Sammon’s data visualization algorithm, using gradient descent (see section 3) with the same vantage vectors. Sammon’s data visualization gave a significantly worse picture of the data.
This is demonstrated by the fact that the movies are no longer grouped correctly. For example, the four clusters from Mr. Bean’s Christmas are mixed with clusters from the other movies in the lower right hand quadrant.
Figure 3: 2 dimensional projections of the composite movie data obtained by (left) the pairwise difference factor analysis mixture algorithm and (right) Sammons algorithm computed using gradient descent.
6. CONCLUSIONS AND FUTURE RESEARCH
We have introduced stepwise forward selection algorithms and demonstrated their value in providing initialized values for factor mixture models and Multidimensional scaling algorithms. It has been shown that pair-wise difference factor mixture models provide good data visualization for a wide variety of data when vantage vectors, constructed using stepwise forward selection, are used to generate appropriate initializing values.
Our examples illustrate that factor mixture models frequently provide better data visualization than Multidimensional Scaling algorithms, designed for the same purpose.
Their superiority arises as a result of their flexibility in modeling data distortion. We have shown how to assess the fitness of factor mixture models and used these results to assess fit in the examples presented above. We would like to extend our current work to include mixture factor models which incorporate intra-component correlations.
REFERENCES .
Bernardo, J. (1995) Bayesian Theory, Wiley Series in Probability and Statistics.
Bishop, M., and Tipping, M.E. (1983) A Hierarchical Latent Variable Model for Data
Visualization, IEEE Transactions on Pattern Analysis and Machine Intelligence,
20,(3),281-293.
Coxeter, H. S. M., (1969) Introduction to Geometry . Wiley, New York, 12 .
Cox, T.F., and Cox., M.A., (2001) Multidimensional Scaling, Chapman and Hall.
Faloutsos C., and Lin, K.-I. FastMap (1995): A fast algorithm for Indexing,Data-Mining and Visualization of Traditional and Multimedia Datasets, Proc. ACM SIGMOD
International Conference on Management of Data , 163-174.
Fraley, C. and Raftery A.E. , (1999) How Many Clusters? Which clustering method?
Answers via Model Based Cluster Analysis, Computer Journal, 41, 297-306.
Gelman, Carlin, Stern and Rubin, (1995) Bayesian Data Analysis , Chapman and Hall .
Jolliffe,I.T. (1986) Principal Component Analysis , Springer-Verlag.
Kohonen, T . (2001) Self-organizing maps , Springer-Verlaag, New York .
Koller, D. (1999) Probabilistic Relational Models, invited contribution to, Inductive
Logic Programming, 9 th International Workshop (ILP-99), Saso Dzeroski and Peter
Flach, Eds, Springer Verlag, 3-13.
Laird, N.M., and Rubin, D.B., (1977) Maximum likelihood for incomplete data via the em algorithm, Journal of Royal Statistical Society, 39, 1-38.
Latecki, L.J., and Wildt, D., (2002) Automatic Recognition of Unpredictable Events in
Videos, Proceedings of the International Conference on Pattern Recognition (ICPR), 16.
Lerner, Boaz, Guterman, Hugo, Aladjem, Mayer, Dinstein, Itshak, and Romem, Yitzhak,
(1998) On pattern classification with Sammon’s Nonlinear Mapping - An Experimental
Study, Pattern Recognition, 31, 371-381.
Maclachlan, G. and Peer, D., (2000) Finite Mixture Models, Wiley Series in Probability and Statistics .
Mao, J. and Jain, A.K., (1995) Artifical Neural Networks for Feature Extraction and
Multivariate Data Projection. IEEE Transactions on Neural Networks, 6,(2).
McFarlane, M., and Young F.W., (1994) Graphical Sensitivity Analysis for
Multidimensional Scaling, Journal of Computational and Graphical Statistics, 3, (1), 23-
33.
Sammon, J.W., Jr., (1969) A nonlinear mapping for data structure analysis, IEEE Trans.
Comput. 18, 401—409.
Sommerville, D. M. Y, (1958) An Introduction to the Geometry of N Dimensions.
New
York: Dover, 124-125 .
APPENDIX
Calculations via the EM algorithm needed for the mixture factor difference model:
In this section we describe the Expectation-Maximization (EM) algorithm used to estimate the latent variables and parameters introduced in section 2 above. For purposes of clarity we repeat the formulation of our model:
F i
F j
A
i
-
j
; 1 i j n (A.1) where,
identifies the particular mixture model component;
= s means that the pair (i,j) belong to mixture component s and s=1 or 2)
A are parametric p
k
i ,
F
'
' is the pairwise noise distortion for features (1 i<j
n)
It is ass umed below that the errors
are normally distributed with common variance
2
I . ( is the p p identity matrix)
In the notation below,
μ
1
,...,
n
refers to the matrix of mean vectors, and
i
current
(respectively,
i
new
) denotes the current value (respectively, the new, updated value) of the latent variable
i
for the
’th component ( =1,2) (i=1,…,n). Analogous notation is used to characterize the projection matrix A . We also employ the notation,
(
( current
i
) for the average of the columns of (the current) μ excluding row i; similar notation applies to the updated parameters. (
=1,2) (i=1,…,n). In this notation,
F F
2
1 exp
F i
F j
-A
(current)
(
i
( current )
2[
(
current
( current ) j
)
2 exp
F i
F j
-A
(current)
(
i
( current )
2[
(
current
( current j
)
)
2
(A.2)
for the posterior probability weight attached to the component
conditional on the observed dissimilarity measure F i
F j
. We update the latent mean vectors
i
( new )
(
=1,2; i=1,…,n) via,
i
( new )
1 i j n
A
( new )
' A
( new )
1
A
(
new )
'
F i
F j
A
(
new )
(
( new
i
)
1 i j n
P
F F j
)
P
F F j
)
(A.3)
The projection matrix, A
( new
is updated using the formula,
A
( new
1 i j n
F i
F j
i
( new )
( j new )
'
1 i j n
i
( new )
( j new )
i
( new ) ( j new )
'
1 i j n
P
for
=1,2. We upgrade the variances
(
new )
via,
F F j
)
1
P
F F j
)
(A.4)
( new )
1 i j n
F i
F j
A
(
new )
(
i
( new ) j
( new )
2
P
F F j
)
(A.5)
1 i j n
P
F F j
)