Two-stage Framework for Visualization of Clustered High Dimensional Data Jaegul Choo Shawn Bohn

Two-stage Framework for Visualization of Clustered High Dimensional Data Jaegul Choo∗ Shawn Bohn† Haesun Park∗ College of Computing Georgia Institute of Technology 266 Ferst Drive, Atlanta, GA 30332, USA National Visualization and Analytics Center Pacific Northwest National Laboratory 902 Battelle Blvd, Richland, WA 99354, USA College of Computing Georgia Institute of Technology 266 Ferst Drive, Atlanta, GA 30332, USA A BSTRACT In this paper, we discuss dimension reduction methods for 2D visualization of high dimensional clustered data. We propose a twostage framework for visualizing such data based on dimension reduction methods. In the first stage, we obtain the reduced dimensional data by applying a supervised dimension reduction method such as linear discriminant analysis which preserves the original cluster structure in terms of its criteria. The resulting optimal reduced dimension depends on the optimization criteria and is often larger than 2. In the second stage, the dimension is further reduced to 2 for visualization purposes by another dimension reduction method such as principal component analysis. The role of the second-stage is to minimize the loss of information due to reducing the dimension all the way to 2. Using this framework, we propose several two-stage methods, and present their theoretical characteristics as well as experimental comparisons on both artificial and real-world text data sets. Keywords: dimension reduction, linear discriminant analysis, principal component analysis, orthogonal centroid method, 2D projection, clustered data, regularization, generalized singular value decomposition Index Terms: H.5.2 [INFORMATION INTERFACES AND PRESENTATION]: User Interfaces—Theory and methods 1 I NTRODUCTION Within the visual analytics community, various types of information content are represented using high dimensional signatures. To make these signatures useful they often need to be transformed into a lower dimension (i.e., 2D or 3D) for a variety of visual representations such as scatter plots. Many researchers in this community have used a wide assortment of dimension reduction techniques, e.g., self-organizing map (SOM) [12], principal component analysis (PCA) [11], multidimensional scaling (MDS) [2], etc. However, it is not always clear why a certain technique has been chosen over another, especially to the end user. Typically, its goal can be viewed in terms of two aspects: efficiency and accuracy. Efficiency as defined here is the time to compute the reduction, but accuracy may not be as simple to quantify. Many would amiably agree to quantify accuracy as a measure of the relationship preservation in the high dimensional space to the reduced dimensional space. Note that most techniques either directly or indirectly work on this principle. There are other properties that are important to those interpreting the semantics of the reduced space. Specifically, we note that while local neighbor preservation is important it depends upon the analysis task. No single reduction technique will provide the complete view as various properties of the space are obscured or lost. We have mentioned that typically the primary objective is relationship preservation. However, there are at least two others: outlier and macro structure visualization. Outliers are conceptually easy (i.e., a variance beyond some threshold), but more difficult to quantify, ∗ e-mail: {joyfull, hpark}@cc.gatech.edu † e-mail:shawn.bohn@pnl.gov as we do not necessarily know which set of outliers are important to accentuate to the user. Certain techniques (e.g., PCA) tend to show outliers more readily, however tend to compress the reduced space at the expense of showcasing the outliers. Other techniques (e.g., SOM) maximize space usage well, but do so at the expense of masking or even hiding those outliers. Likewise, macro structures of the high dimensional space may be masked or massively distorted during the reduction. Macro structures are those larger order groupings (e.g., clusters) that exist in the original dimensional space. We recognize they are important in dimension reduction research and to those in the visual analytics community. However, few of them focus on data representation especially for visualization of the clustered data [20, 13, 3]. We propose theoretical measures for these properties and efficient algorithms which will aid not only the researchers but ultimately the users/analysts to better understand which balance of properties are important and for which analytic tasks. 2 M OTIVATION The focus of this paper is the fundamental characteristics of dimension reduction techniques for visualizing high dimensional data in the form of a 2D scatter plot when the data has cluster structure. The role of dimension reduction here is to give a 2-dimensional representation of data while preserving cluster structure as much as possible. To this end, supervised dimension reduction methods that incorporate cluster information such as linear discriminant analysis (LDA) [4] or orthogonal centroid method (OCM) [10] can be naturally considered. However, one of the issues is that with many dimension reduction methods designed to preserve the cluster structure in the data, the theoretically optimal reduced dimension, which is the smallest dimension that is acceptable with respect to the optimization criteria of the dimension reduction method, is usually larger than 2. For example, in LDA, the minimum reduced dimension that preserves the cluster structure quality measure defined as a trace maximization problem is one less than the number of clusters in the data in general [8, 7]. In this case, one may simply choose the two dimensions that contribute most to such a measure. However, with only two dimensions, such a measure may become significantly smaller than the original quantity after dimension reduction. This results in loss of information that hinders visualization in properly reflecting the true cluster relationship of the data. A similar situation may occur when using PCA for visualizing the data not having a cluster structure. Even though PCA finds the principal axes that maximally capture the variance of the data, when the resulting 2-dimensional representation of the data maintains only a small fraction of the total variance, the relationships of the data in 2 dimension are likely to be highly inconsistent with those in the original dimension. Such loss of information is inevitable in that the dimension has to be reduced to 2. Our main motivation is to deal with such loss more carefully by separating the loss-introducing stage from the original dimension reduction methods. Based on this idea, we propose the two-stage framework of dimension reduction for visualization. In this framework, a supervised dimension reduction method is applied in the first stage so that the original dimension is reduced to the minimum dimension achievable while preserving the quality of cluster measure as defined in a dimension reduction method. The reduced dimension achieved in the first stage is often larger than 2. Thus in the second stage, we find another dimension reducing transformation that minimizes the loss introduced in further reducing the dimension all the way to 2. This two-stage framework provides us with a means to flexibly apply different types of dimension reduction techniques in each stage and to systematically analyze their effects, which provides understanding the effects of the overall dimension reduction methods for visualization of clustered data. The issues then are the design of the most appropriate dimension reduction methods, the modeling of optimization criteria, and the corresponding solution methods. In this paper, we present both theoretical and empirical answers to these issues. Specifically, we propose several two-stage methods utilizing linear dimension reduction methods such as LDA, orthogonal centroid method (OCM), and principal component analysis (PCA), and we present their theoretical justifications by modeling the optimization criteria for which each method provides the optimal solution. Also, we illustrate and compare the effectiveness of the proposed methods by showing empirical visualization on synthetic and real-world data sets.Although nonlinear dimension reduction methods such as MDS or other manifold learning methods such as isometric feature mapping [16] and locally linear embedding [14] may also be utilized for the effective 2D visualization of high dimensional data, our focus in this paper is on linear methods. The linear methods are computationally more efficient in general, and unlike most of the manifold learning methods, they also provide dimension reducing transformations that can be applied to map and visualize unseen data points in the same space where the existing data are visualized. Our approach to successively apply two dimension reduction methods should be discerned from the previous works [18, 19, 21] in that they usually aim for improving computational efficiency, scalability, or applicability of a certain dimension reduction method, e.g., LDA. The rest of this paper is organized as follows. In Section 3, LDA, OCM, and PCA are described based on a unified framework of the scatter matrices and their trace optimization problems. In Section 4, we formulate two-stage dimension reduction methods, and in Section 5, several two-stage methods for visualization are proposed and compared along with their criteria. Experimental comparisons are given using artificial and real-world data sets in Section 6, and conclusion and future work are addressed in Section 7. 3 D IMENSION P ROBLEM R EDUCTION AS T RACE ∑ a j and c = j∈Ni 1 n n ∑ a j. j=1 (i) The scatter matrix within the i-th cluster Sw , the within-cluster scatter matrix Sw , the between-cluster scatter matrix Sb , and the total (or mixture) scatter matrix St are defined [9, 15], respectively, as (i) Sw = ∑ (a j − c(i) )(a j − c(i) )T , j∈Ni k Sw = (i) ∑ Sw k = i=1 ∑ ∑ (a j − c(i) )(a j − c(i) )T , k Sb = k ∑ ∑ (c(i) − c)(c(i) − c)T = ∑ ni (c(i) − c)(c(i) − c)T i=1 j∈Ni = (2) i=1 j∈Ni 1 n k−1 i=1 k ∑ ∑ ni n j (c(i) − c( j) )(c(i) − c( j) )T , and (3) i=1 j=i+1 n St = ∑ (a j − c)(a j − c)T . (4) j=1 Note that the total scatter matrix St is related to Sw and Sb as [9] St = Sw + Sb . (5) When GT in Eq. (1) is applied to the matrix A, the scatter matrices Sw , Sb , and St in the original dimensional space are reduced to the l × l matrices GT Sw G, GT Sb G, and GT St G, respectively. By computing the trace of the scatter matrices as k trace(Sw ) = ∑ ∑ (a j − c(i) )T (a j − c(i) ) i=1 j∈Ni k = ∑ ∑ ka j − c(i) k22 , (6) i=1 j∈Ni k trace(Sb ) = ∑ ∑ (c(i) − c)T (c(i) − c) i=1 j∈Ni k ∑ ni kc(i) − ck22 (7) = 1 k−1 k ∑ ∑ ni n j kc(i) − c( j) k22 , and n i=1 j=i+1 (8) = ∑ (a j − c)T (a j − c) = ∑ ka j − ck22 , = i=1 GT : x ∈ Rm×1 → z = GT x ∈ Rl×1 . (1) Suppose also that a data matrix A = [a1 a2 · · · an ] ∈ Rm×n is given where the columns a j , j = 1, . . . , n, of A represent n data items in an m-dimensional space, and they are partitioned into k clusters. Without loss of generality, for simplicity of notations, we further assume that A is partitioned as A2 1 ni c(i) = O PTIMIZATION In this section, we introduce the notions of scatter matrices used in defining cluster quality and optimization criteria for dimension reduction. Suppose a dimension reducing linear transformation GT ∈ Rl×m maps an m-dimensional data vector x to a vector z in an ldimensional space (m > l): A = [A1 Let Ni denote the set of column indices that belong to cluster i, and ni the size of Ni . The i-th cluster centroid c(i) and the global centroid c are defined, respectively, as ··· Ak ], where Ai ∈ Rm×ni and k ∑ ni = n. i=1 trace(St ) n n j=1 j=1 (9) we obtain values that can be used to measure the cluster quality. Note that from Eqs. (7) and (8), trace(Sb ) can be viewed as the squared sum of the pairwise distances between cluster centroids as well as that of the distances between each centroid and the global centroid. The cluster structure quality can be defined by analyzing how well each cluster can be discriminated from each other. High quality clusters usually have small trace(Sw ) and large trace(Sb ), relating to the small variance within each cluster and the large distances between clusters. Subsequently, dimension reduction methods may be intended to maximize trace(GT Sb G) and minimize trace(GT Sw G) in the reduced dimensional space. This simultaneous optimization can be approximated to a single criterion as Jb/w (G) = max trace((GT Sw G)−1 (GT Sb G)), (10) which is the criterion of LDA. In addition, one may focus on maximizing the distances between clusters, which can be represented as the criterion of OCM, i.e., T Jb (G) = max trace(G Sb G). GT G=I (11) On the other hand, regardless of cluster dependent terms, Sw and Sb , the trace of the total scatter matrix St can be maximized as Jt (G) = max trace(GT St G), GT G=I (12) which turns out to be the criterion of PCA. In Eqs. (11) and (12), without the constraint, GT G = I, Jb (G) and Jt (G) can become arbitrarily large. In what follows, LDA, OCM, and PCA are discussed based on such maximization criteria, and their properties relevant to visualization are identified. 3.1 Linear Discriminant Analysis (LDA) Conceptually, in LDA, we are looking for a dimension reducing transformation that keeps the between-cluster relationship as remote as possible by maximizing trace(GT Sb G) while keeping the within cluster relationship as compact as possible by minimizing trace(GT Sw G). As shown in Eq. (10), the criterion of LDA can be written as Jb/w (G) = max trace((GT Sw G)−1 (GT Sb G)). (13) It can be shown that for any G ∈ Rm×l where m > l , −1 trace((GT Sw G)−1 (GT Sb G)) ≤ trace(Sw Sb ), (14) −1 S ) meaning that the cluster structure quality measured by trace(Sw b cannot be increased after dimension reduction [4]. By setting the derivative of Eq. (13) with respect to G to zero, which gives the first order optimality condition, it can be shown that the solution of LDA, where we denote it as GLDA , has the columns which are the leading generalized eigenvectors u of the generalized eigenvalue problem, Sb u = λ Sw u. (15) Since the rank of Sb is at most k − 1, LDA achieves the upper bound of trace((GT Sw G)−1 (GT Sb G)) in Eq. (14) for any l such that l ≥ k − 1, i.e., trace((GTLDA Sw GLDA )−1 (GTLDA Sb GLDA )) −1 = trace(Sw Sb ) for l ≥ k − 1, (16) −1 S ) is preserved between the original which indicates trace(Sw b space and the reduced dimensional space obtained by GLDA . 3.2 Orthogonal Centroid Method (OCM) Orthogonal centroid method (OCM) [10] focuses only on maximizing trace(GT Sb G) under the constraint of GT G = I. The criterion of OCM is shown as Jb (G) = max trace(GT Sb G). GT G=I (17) It is known that for any G ∈ Rm×l where m > l such that GT G = I, trace(GT Sb G) ≤ trace(Sb ), (18) which means the cluster structure quality measured by trace(Sb ) cannot be increased after dimension reduction. The solution of Eq. (17) can be obtained by setting the columns of G as the leading eigenvectors of Sb . Since Sb has at most k − 1 nonzero eigenvalues, the upper bound of trace(GT Sb G) in Eq. (18) can be achieved for any l such that l ≥ k − 1, i.e., trace(GT Sb G) = trace(Sb ) for l ≥ k − 1. (19) Eq. (19) indicates trace(Sb ) is preserved between the original and the reduced dimensional spaces. An advantage of OCM is that it achieves an upper bound of trace(GT Sb G) more efficiently by using QR decomposition, avoiding the eigendecomposition. The algorithm of OCM is as follows. First the centroid matrix C is formed so that each column of C is composed of each cluster’s centroid vector, i.e., C = c1 c2 · · · ck . Then the reduced QR decomposition [5] of C is computed for C = Qk R where Qk ∈ Rm×k with QTk Qk = I and R ∈ Rk×k is upper triangular. The solution of OCM, GOCM , is found as GOCM = Qk . Note that the columns of GOCM are composed of the orthogonal bases for the subspace spanned by the centroids, and l = k in this case. Finally, OCM achieves trace(GTOCM Sb GOCM ) = trace(Sb ), where l = k. By using the equivalence between Eqs. (3) and (3), one can prove that each pairwise distance between cluster centroids is also preserved in the reduced dimensional space obtained by OCM. Another important property of OCM is that by projecting data into the subspace spanned by the centroids, the order of similarities between any particular point and centroids are preserved in terms of Euclidean norm and cosine similarity measure [10, 7]. In other words, for any vector q ∈ Rm×1 and cluster centroids c(i) and c( j) , we have kq − c(i) k2 < kq − c( j) k2 ⇒ kGTOCM q − GTOCM c(i) k2 < kGTOCM q − GTOCM c( j) k2 , and qT c(i) qT c( j) < ⇒ (i) kqk2 kc k2 kqk2 kc( j) k2 T T GTOCM q GTOCM c( j) GTOCM q GTOCM c(i) < . kGTOCM qk2 kGTOCM c(i) k2 kGTOCM qk2 kGTOCM c( j) k2 3.3 Principal Component Analysis (PCA) PCA is a well-known dimension reduction method that captures the maximal variance in the data. The criterion of PCA can be written as Jt (G) = max trace(GT St G). (20) GT G=I For any G ∈ Rm×l where m > l such that GT G = I, we have trace(GT St G) ≤ trace(St ), (21) which means trace(St ) cannot be increased after dimension reduction. The solution of Eq. (20), where we denote it as GPCA , can be obtained by setting the columns of G as the leading eigenvectors of St . Since the rank of St is at most min(m, n), PCA achieves the upper bound of trace(GT St G) in Eq. (21) for any l such that l ≥ min(m, n), i.e., trace(GTPCA St GPCA ) = trace(St ) for l ≥ min(m, n). Table 1: Comparison of dimension reduction methods. It is assumed Sb and St are full rank. LDA Optimization Criterion GT (x ∈ Rm×1 → y ∈ Rl×1 ) Algorithm Smallest dimension achieving the criterion upper bound Jb/w (G) = max trace((GT Sw G)−1 (GT Sb G)) generalized eigendecomposition min G, GT G=Il PCA Jb (G) = max trace(GT Sb G) Jt (G) = max trace(GT St G) QR decomposition symmetric eigendecomposition k min(m, n) GT G=I k−1 In many applications of PCA, however, l is usually chosen as a fixed value less than the ranke of St for the purpose of dimension reduction or noise reduction. This noisy subspace corresponds to the smallest eigenvectors of St , and they are removed by PCA for better representation of the data. Although St is related to Sb and Sw as in Eq. (5), St as it is does not contain any information on cluster labels. That is, unlike LDA and OCM, PCA ignores the cluster structure represented by Sb and/or Sw , which is why PCA is considered as an unsupervised dimension reduction method. Usually, PCA assumes that the global centroid is zero by subtracting the empirical mean of the data from each data vector. The centered data can be represented as A − ceT , where e is ndimensional vector whose components are all 1’s. PCA has a unique property that, given a fixed l, it produces the best reduced dimensional representation that minimizes the difference between the centered matrix A − ceT and its projection to the reduced dimensional space GGT (A − ceT ) where G has orthonormal columns, i.e., GPCA = arg OCM kGGT (A − ceT ) − (A − ceT )k, where the matrix norm k · k is either a Frobenius norm or a Euclidean norm. The three discussed methods are summarized in Table 1. GT G=I In the next section, discussion will be focused on various ways for choosing the first stage dimension reducing transformation G and the second stage dimension transformation H with a purpose to construct combined dimension reducing transformation V T = H T GT for 2-dimensional visualization according to various optimization criteria. 5 T WO - STAGE M ETHODS FOR 2D V ISUALIZATION All the proposed two-stage methods start from one of the supervised dimension reduction methods such as LDA or OCM that are designed for clustered data. In the first stage (by GT ∈ Rl×m in Eq. (23)), the dimension is reduced by LDA or OCM to the smallest dimension that satisfies Eq. (16) or (19), respectively. Therefore in the first stage, the cluster structure quality measured either −1 S ) or trace(S ) is preserved. Then we perform the by trace(Sw b b second-stage dimension reduction (by H T ∈ R2×l in Eq. (24)) that minimizes the loss of information either by applying the same criterion used in the first stage or by using Jt in Eq. (20), i.e., that of PCA. As seen in Section 3.3, Eq. (20) gives the best approximation of the first-stage results that minimize the difference in terms of Frobenius/Euclidean norm. In what follows, we describe each of the two-stage methods in detail, and derive their equivalent single-stage methods (by V T ∈ R2×m in Eq. (22)) in case they exist. 5.1 Rank-2 LDA 4 F ORMULATION OF T WO - STAGE F RAMEWORK FOR V ISU - ALIZATION Suppose we want to find a dimension reducing linear transformation V T ∈ R2×m that maps an m-dimensional data vector x to a vector z in a 2-dimensional space (m ≫ 2): V T : x ∈ Rm×1 → z = V T x ∈ R2×1 . (22) Further assume that it is composed of two stages of dimension reductions as follows. In the first stage, a dimension reducing linear transformation GT ∈ Rl×m maps an m-dimensional data vector x to a vector y in the l-dimensional space (l ≪ m): GT : x ∈ Rm×1 → y = GT x ∈ Rl×1 , (23) where l is fixed as its minimum optimal dimension by the first-stage criterion. When l ≤ 2, we have no further dimension reduction to do after the first step. However, an optimal l in many methods and for many data sets is larger than 2, and so we assume that l > 2. In the second stage, another dimension reducing linear transformation H T ∈ R2×l maps an l-dimensional data vector y to a vector z in the 2-dimensional space(l > 2): H T : y ∈ Rl×1 → z = H T y ∈ R2×1 . (24) Such consecutive dimension reductions performed by GT followed by H T can be combined, resulting in a single dimension reducing transformation V T as V T = H T GT . −1 S ) In this method, LDA is applied in the first stage, and trace(Sw b is preserved in the l-dimensional space where l = k − 1. In the second stage, the same criterion Jb/w (H) is used to reduce the ldimensional first-stage results to 2-dimensional data. The criterion of the second-stage dimension reducing matrix H can be formulated as (25) Hb/w = max trace((H T (GTLDA Sw GLDA )H)−1 H∈Rl×2 (H T (GTLDA Sb GLDA )H)). (26) Assuming the columns of GLDA , which are generalized eigenvectors of Eq. (15), are in decreasing order of their corresponding generalized eigenvalues, i.e., GLDA = u1 u2 · · · uk−1 where λ1 ≥ λ2 ≥ · · · ≥ λk−1 , the solution of Eq. (26) is Hb/w = e1 e2 , where e1 and e2 are the first and the second standard unit T 1 0 ··· 0 vectors, i.e., e1 = ∈ Rl×1 and e2 = T l×1 0 1 0 ··· 0 ∈ R . This solution is equivalent to choosing two dimensions with the most leading generalized eigenvalues from the first stage result, and the resulting two-stage method can be represented as a single-stage dimension reduction method by V ∈ Rm×2 which directly maximize Jb/w , i.e., Vb/w = arg max Jb/w (V ) = arg max trace((V T SwV )−1 (V T SbV )). V ∈Rm×2 V ∈Rm×2 (27) Table 2: Summary of the optimization criteria of the two-stage dimension reduction methods. Rank-2 LDA Stage 1: Preservation trace((H T (GT Sw G)H)−1 (H T (GT Sb G)H)) z ∈ R2×1 ) Vb/w = GLDA Hb/w = u1 u2 trace (H T (GT St G)H) , trace(H T (GTLDA St GLDA )H), trace (H T (GT Sb G)H) H T H=I trace(GTOCM Sb GOCM ) ≫ trace(GTOCM Sw GOCM ), 5.2 LDA followed by PCA −1 S ) is In this method, LDA is applied in the first stage, and trace(Sw b preserved in the l-dimensional space where l = k − 1. In the second stage, PCA is applied in order to obtain the best approximation of the l-dimensional first-stage results in terms of Frobenius/Euclidean norm. The second-stage dimension reducing matrix H is obtained by solving max trace (H T (GT St G)H) H T H=I Unlike LDA, OCM does not minimize trace(GT Sw G) as shown in Eq. (17). Therefore the following may not be the case: where u1 and u2 are the leading generalized eigenvectors of Eq. (15). This solution is also known as reduced-rank linear discriminant analysis [6]. H∈Rl×2 ,H T H=I Rank-2 PCA on Sb trace(GT Sb G) = trace(Sb ) H T H=I The solution of Eq. (27) becomes Ht = arg OCM+PCA −1 S ) trace((GT Sw G)−1 (GT Sb G)) = trace(Sw b GT (x ∈ Rm×1 → y ∈ Rl×1 ) Stage 2: Maximization HT (y ∈ Rl×1 → LDA + PCA (28) where the solution is the two leading eigenvectors of the total scatter matrix of the first-stage result, GTLDA St GLDA . From Eq. (5), we have which means that GTOCM Sb GOCM does not necessarily dominate GTOCM St GOCM . Then the two principal axes of GTOCM St GOCM obtained by PCA in the second stage tend to fail to reflect those of GTOCM Sb GOCM , which may rather scatter the data points within each cluster, eventually preventing the visualization results from showing a clear cluster structure. 5.4 Rank-2 PCA on Sb In this method, OCM is applied in the first stage, and trace(Sb ) is preserved in the l-dimensional space where l = k. In the second stage, the same criterion Jb (H) is used to reduce the l-dimensional first-stage results to 2-dimensional data. The second-stage dimension reducing matrix H is obtained by solving Hb = arg max H∈Rl×2 ,H T H=I trace(H T (GTOCM Sb GOCM )H), (32) Since LDA conceptually maximizes trace(GT Sb G) and minimizes trace(GT Sw G), the result is expected to satisfy where the solution is the two leading eigenvectors of the betweenscatter matrix of the first-stage result, GTOCM Sb GOCM . The columns of GOCM form the subspace spanned by centroids, and this subspace includes the range space of Sb . Accordingly, one can easily show that the eigenvector uYi ∈ Rl×1 of GTOCM Sb GOCM is related to eigenvectors ui ∈ Rm×1 of Sb as trace(GTLDA Sb GLDA ) ≫ trace(GTLDA Sw GLDA )), uYi = GTOCM ui GTLDA St GLDA = GTLDA (Sb + Sw )GLDA . (29) which means that GTLDA St GLDA is dominated by GTLDA Sb GLDA , i.e., GTLDA (Sb + Sw )GLDA ≃ GTLDA Sb GLDA . In this case, the principal axes that PCA gives in the second stage better reflect those of the between-cluster matrix of the first-stage result, GTLDA Sb GLDA , and they may in turn discriminate the clusters clearly in the 2-dimensional space. In this sense, LDA followed by PCA achieves a clear cluster structure as well as a good approximation of the first-stage result. 5.3 OCM followed by PCA In this method, OCM is applied in the first stage, and trace(Sb ) is preserved in the l-dimensional space where l = k. In the second stage, PCA is applied in order to obtain the best approximation of the l-dimensional first-stage results in terms of Frobenius/Euclidean norm. As in Section 5.2, the second-stage dimension reducing matrix H is obtained by solving Ht = arg max H∈Rl×2 ,H T H=I trace(H T (GTOCM St GOCM )H), (30) where the solution is the two leading eigenvectors of the total scatter matrix of the first-stage result, GTOCM St GOCM . From Eq. (5), we have GTOCM St GOCM = GTOCM (Sb + Sw )GOCM . (31) with their corresponding eigenvalues matched as well, i.e., λiY = λi . Hence, the solution of Eq. (32) can be written as Hb = uY1 uY2 = GTOCM u1 u2 . (33) Using Eq. (33) and the relationship shown in Eq. (25), the singlestage dimension reducing transformation Vb can be built as T u1 T T T Vb = Hb GOCM = GOCM GTOCM uT2 T u1 = (34) uT2 = arg max Jb (V ) = arg max trace(V T SbV ). V ∈Rm×2 V ∈Rm×2 (35) Eq. (34) holds since the eigenvectors of Sb , u1 and u2 , are in the range space of GOCM . The criterion of Eq. (35) has been used in one of the successful visual analytic systems, IN-SPIRE, for 2D representation of document data [17]. The discussed two-stage methods are summarized in Table 2. 6 E XPERIMENTS In this section, we present visualization results using the proposed methods for several data sets, especially focusing on undersampled text data visualization where the data item is represented in m-dimensional space and the number of the data items n is less than m (m > n). Table 3: Description of data sets. Original dimension, m Number of data items, n Number of clusters, k GAUSSIAN 1100 1000 10 6.1 Regularization on LDA for undersampled data In undersampled cases, the LDA criterion shown in Eq. (13) cannot be applied directly because Sw is singular. In order to overcome this singularity problem, Howland et al. proposed a universal algorithmic framework of LDA using the generalized singular value decomposition (LDA/GSVD) [8, 7]. Specifically, for the case when m ≫ n ≫ k, which is usual for most undersampled problems, LDA/GSVD gives the solution for G such that GT Sw G = 0 while maintaining the maximum value of trace(GT Sb G). This solution makes sense since LDA criterion is formulated to minimize trace(GT Sw G). However, it means that all of the data points belonging to a specific cluster are represented as a single point in the reduced dimensional space, which lessens the generalization ability for classification as well as for visualizing the individual data relationship within each cluster. On the contrary, the fact that LDA makes GT Sw G = 0 can be viewed as an advantage for visualization purposes since LDA has the capability to fully minimize trace(GT Sw G). By means of regularization on Sw one can control trace(GT Sw G), which determines the scatter of the data points within each cluster. In regularized LDA which was originally proposed to avoid the singularity of Sw in classification context, Sw is replaced by a nonsingular matrix Sw + γ I where I is an identity matrix, and γ is a control parameter. In general, as γ is increased, the within-cluster distance, trace(GT Sw G), also becomes larger with data points being more scattered around their corresponding centroids. As γ is decreased, the within-cluster distance becomes smaller, and the data points gather closer around their centroids. Such manipulation of γ can be exploited in a visualization context because one can choose an appropriate value of γ so that the second-stage method such as PCA, which maximizes trace(GT St G) = trace(GT Sb G + GT Sw G), does not focus too much on trace(GT Sw G). The results that follow are based on such choices of γ . 6.2 Data Sets The data sets tested are composed of one artificially-generated Gaussian-mixture dataset (GAUSSIAN) and three real-world text data sets (MEDLINE, NEWSGROUPS, and REUTERS) that are clustered based on their topics. All the text documents are encoded as term-document matrices where each dimension corresponds to a particular word, and the value of a certain dimension represents the frequency of the corresponding word shown in the document. Each data set is set to have an equal number of data per cluster, and have a mean of zero which is attained by subtracting the global mean. (See Section 6.3.) The descriptions of data sets, which are also summarized in Table 3, are as follows. The GAUSSIAN data set is a randomly generated Gaussian mixture with 10 clusters. Each cluster is made up of 100 data vectors, which add up to 1000 in total, and the dimension is set to 1100, which is slightly more than the number of the data items. In its visualization shown in Fig. (1), the clusters are labeled using letters as • ’a’, ’b’, . . . , and ’j’. The MEDLINE data set is a document corpus related to medical science from the National Institutes of Health1 . The original dimension is 22095, and the number of clusters is 5, where each clus1 http://www.cc.gatech.edu/˜hpark/data.html MEDLINE 22095 500 5 NEWSGROUPS 16702 770 11 REUTERS 3907 800 10 ter has 100 documents. The cluster labels that correspond to the document topics are shown as • heart attack (’h’), colon cancer (’c’), diabetes (’d’), oral cancer (’o’), and tooth decay (’t’), where the letters in parentheses are used in the visualization shown in Fig. (2). The NEWSGROUPS data set [1] is a collection of newsgroup documents, and originally composed of 20 topics. However, we have chosen 11 topics for visualization, and each cluster is set to have 70 documents. The original dimension is 16702, and the cluster labels are shown as • comp.sys.ibm.pc.hardware (’p’), comp.sys.mac.hardware (’a’), misc.forsale (’f’), rec.sport.baseball (’b), sci crypt (’y’), sci.electronics (’e’), sci.med (’d’), soc.religion.christian (’c’), talk.politics.guns (’g’), talk.politics.misc (’p’), and talk.religion.misc (’r’), where the letters in parentheses are used in the visualization shown in Fig. (3). The REUTERS data set [1] is the document collection that appeared in the Reuters newswire in 1987, and originally composed of hundreds of topics. Among them, 10 topics related to economic subjects are chosen for visualization, and each cluster has 80 documents. The original dimension is 3907, and the cluster labels are shown as • earn (’e’), acquisitions (’a’), money-fx (’m’), grain (’g’), crude (’r’), trade (’t’), interest (’i’), ship (’s’), wheat (’w’), and corn (’c’), where the letters in parentheses are used in the visualization shown in Fig. (4). 6.3 Effects of Data Centering Fig. 5 is the example of applying OCM+PCA to the MEDLINE data set with and without data centering. Once the MEDLINE data set is encoded as a term-document matrix, every component has a non-negative value, which results in the global centroid that is significantly far from the origin. Then performing PCA without data centering might give the first principal axis as the one reflecting the global centroid rather than that discriminating clusters. If we consider projecting the data onto each of the horizontal and the vertical axes in Fig. 5, the former, which corresponds to the first principal axis, does not help in showing the cluster structure clearly, and only the vertical axis, which corresponds to the second principal axis from PCA, discriminates clusters. We have found that such undesirable behavior is common in many cases without data centering, which is why we assume that data is centered throughout this paper. Accordingly, all the results shown in Figs 1-4 are obtained after data centering. 6.4 Comparison of Visualization Results The results of four two-stage methods for the tested data sets are shown in Figs.1-42 . In all cases, LDA-based methods show cluster structures more clearly than OCM-based methods. This proves the effectiveness of LDA that considers both within- and between-cluster measures while OCM only takes into account the latter. Due to this difference, OCM generally produces a widely-scattered data representation within each cluster. As a result, in the NEWSGROUPS dataset, such a wide within-cluster variance significantly deteriorates the 2 Those figures can be arbitrarily magnified without losing the resolution in the electronic version of this paper. cluster structure visualization even if OCM still attempts to maximize the between-cluster distance. In the MEDLINE and the REUTERS data sets, all of the four methods produce relatively similar results. However, we have controlled the within-cluster variance in LDA-based methods using the regularization term γ I. In addition, the fact that rank-2 LDA and LDA+PCA produce almost identical results indicates that GTLDA St GLDA is dominated by GTLDA Sb GLDA after LDA is applied in the first stage as we expected. Rank-2 LDA represents each cluster most compactly by minimizing the within-cluster radii both in the first and the second stage. However, it may reduce the between-cluster distances as well because Jb/w maximizes the conceptual ratio of two scatter measures. As can be seen in the two LDA-based methods applied to the NEWGROUPS data set, while rank-2 LDA minimizes the within-cluster radii, it also places the centroids closer to each other as compared to those in LDA+PCA. Due to this effect, which one is preferable between rank-2 LDA and LDA+PCA depends on the data set to be visualized. Overall, OCM+PCA and Rank-2 PCA on Sb show similar results. It means GT Sb G ≃ GT St G in that the difference between two methods lies in whether PCA is applied to GT Sb G or GT St G in the second stage. Since performing PCA on GT Sb G is computationally more efficient than PCA on GT St G, Rank-2 PCA on Sb can be a good alternative to OCM+PCA in case efficient computation is important. Finally, these visualization results reveal the interesting cluster relationships underlying in the data. In Fig. (2), the clusters for colon cancer (’c’) and oral cancer (’o’) are shown close to each other. In Fig. (3), the clusters of soc.religion.christian (’c’) and talk.religion.misc (’r’), those of comp.sys.ibm.pc.hardware (’p’) and comp.sys.mac.hardware (’a’), and those of sci.crypt (’y’) and sci.med (’d’) are closely located respectively in LDA-based methods. In addition, the two clusters, misc.forsale (’f’) and rec.sport.baseball (’b’), are shown to be the most distinctive, which makes sense because those topics are quite irrelevant to the others. In Fig. (4), the clusters of grain (’g’), wheat (’w’), and corn (’c’) as well as those of money-fx (’m’) and interest (’i’) are visualized very close. 7 C ONCLUSION AND F UTURE W ORK According to our results, LDA-based methods are shown to be superior to OCM-based methods since both within- and betweencluster relationships are taken into account in LDA. Especially, combined with PCA in the second stage, LDA+PCA achieves a clear discrimination between clusters as well as the best approximation of the results of LDA when the distance between data is measured in terms of Frobenius/Euclidean norm. However, many classes except for few of them that are clearly unrelated tend to be overlapped especially when dealing with large numbers of data points and clusters. This is inherently due to the nature of the second-stage dimension reduction in which only the two axes are chosen so that the classes which contribute most to the second stage criteria can be well-discriminated. Such behavior can exaggerate the distances between particular clusters, and more elaboration towards new criteria that fits in visualization is required. In the MEDLINE and the REUTERS datasets, visualization results seem to have a tail-shape along specific directions. We often found this phenomenon to occur in many other data sets. It is still unclear as to what causes this and how it affects the visualization, e.g. characteristics of information loss in the second stage. Finally, in order to determine how much loss of information is introduced by each method, more rigorous analysis based on various quantitative measures such as pairwise between-cluster distance and within-cluster radii should be conducted. ACKNOWLEDGEMENTS The work of these authors was supported in part by the National Science Foundation grants CCF-0732318 and CCF-0808863. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. R EFERENCES [1] A. Asuncion and D. Newman. UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences, 2007. [2] T. F. Cox and M. A. A. Cox. Multidimensional Scaling. Chapman & Hall, London, 1994. [3] I. S. Dhillon, D. S. Modha, and W. S. Spangler. Class visualization of high-dimensional data with applications. Computational Statistics & Data Analysis, 41(1):59 – 90, 2002. [4] K. Fukunaga. Introduction to Statistical Pattern Recognition, second edition. Academic Press, Boston, 1990. [5] G. H. Golub and C. F. van Loan. Matrix Computations, third edition. Johns Hopkins University Press, Baltimore, 1996. [6] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001. [7] P. Howland, M. Jeon, and H. Park. Structure preserving dimension reduction for clustered text data based on the generalized singular value decomposition. SIAM Journal on Matrix Analysis and Applications, 25(1):165–179, 2003. [8] P. Howland and H. Park. Generalizing discriminant analysis using the generalized singular value decomposition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 26(8):995–1006, Aug. 2004. [9] A. Jain and R. Dubes. Algorithms for clustering data. Prentice-Hall, Inc. Upper Saddle River, NJ, USA, 1988. [10] M. Jeon, H. Park, and J. B. Rosen. Dimensional reduction based on centroids and least squares for efficient processing of text data. In Proceedings of the First SIAM International Workshop on Text Mining. Chiago, IL, 2001. [11] I. Jolliffe. Principal component analysis. Springer, 2002. [12] T. Kohonen. Self-organizing maps. Springer, 2001. [13] Y. Koren and L. Carmel. Visualization of labeled data using linear transformations. In Information Visualization, 2003. INFOVIS 2003. IEEE Symposium on, pages 23–30, Oct. 2003. [14] S. T. Roweis and L. K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science, 290(5500):2323–2326, 2000. [15] D. L. Swets and J. J. Weng. Using discriminant eigenfeatures for image retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(8):831–836, 1996. [16] J. B. Tenenbaum, V. d. Silva, and J. C. Langford. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science, 290(5500):2319–2323, 2000. [17] J. A. Wise. The ecological approach to text visualization. Journal of the American Society for Information Science, 50(13):1224–1233, 1999. [18] J. Ye and Q. Li. A two-stage linear discriminant analysis via qrdecomposition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(6):929–941, June 2005. [19] H. Yu and J. Yang. A direct LDA algorithm for high-dimensional data with application to face recognition. Pattern Recognition, 34:2067– 2070, 2001. [20] X. Zhang, C. Myers, and S. Kung. Cross-weighted fisher discriminant analysis for visualization of dna microarray data. In Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP’04). IEEE International Conference on, volume 5, pages V–589–92 vol.5, May 2004. [21] W. Zhao, R. Chellappa, and A. Krishnaswamy. Discriminant analysis of principal components for face recognition. Automatic Face and Gesture Recognition, IEEE International Conference on, 0:336, 1998. Figure 1: Comparison of the two-stage methods in the GAUSS data set. (a) Rank-2 LDA (b) LDA+PCA ee e eeeeee e ee eee e eeeeeee ee ii eeeeeeeeee ee ee e i c cc c eeeeee e eee e i i e e eeee e ee ee ee ei ii i i i ii c cc c c c c e e e e i e e e e e e c c cc cccc c c e ei eei ei i i iiiii iii ii ii i iii i i i c e ei i ii i i i ii ii i i ii ii c c cccccc cccccccc cc ccc cccc cc c ee i iie ii i i i iii i ii ii i i i ee cc cccc c cccccccc i ii iiiiii i i cc c cccccc c c e a c f i i i b bb f c cc c c h b f ff ffii i c c cc bb b bbbbb f h h a a aa fffff if f b b bb b ff fb h h hh hh c c c chc cc b b bb a aa a aaa aaaa aa h b b bbbbbbbfffbfbbbb b h h hh hhh hhh f bb fb f f f f ffff a aaaa a a aa c b bbbbbbbbbbbb a a aaaaaa h b bbb fbbbbffbfbfffffff fffffffffff ff fff a aa aaa hh b hhhhhhhhhhhhh aa a aaa aaa hh hh h h b bbbbb bffbb h hh hh hhhhhh a a aaa aaaa h hh c a aaaaa a a fbf f f ff f f ffff ff f ff f h h h b a hh hhh h b a b b a b h h h a aaa aa a aaa b bbbbbf bb f f f fffff hhhh hhhhhhhh hhh h f a aa a hh h hh aa a bb g h h a g h 0.08 e 0.04 0.02 0 g g g g gggggg g g ggggg gg gg g g g gggg gggg g gg g gg ggg g gg g g g ggg g gg g g g gggggg gg gg g gg gg g gg g gg g g ggg gg g g g ggg g g g −0.02 −0.04 d dd d d d d d dddddddd d d ddddd ddj d d dd d ddddddd d dddddd dddddddddjd d j dd dd ddd d ddd d dd jdd ddjj j d j d d d djd d dd dj j j j dd j d jj j j j jj jjjjj jd j j j j jjd jj j jj j j j jjj j j jjj jj j j j j j jj jjjj jjjj j jj j j j j j jj jj j j j j jj j j −0.06 −0.06 −0.04 −0.02 0 0.06 0.04 h h hhhhh h h hh h h hhhhhhhh hhh hh hh h h h hhhhhhhhhh hhhhh h hh h h h h hhh hh hhhhhh h h hhh h hhh h hh hhhh h h h hh h h g h −0.02 g j g −0.06 0.02 0.04 0.06 gg g ggg g g g g gg gg gg gggggg g gg g gggg g ggg g gg ggggg gg g g g ggggggggg g g g gg g g gg gggg ggggggg g g g g g g g g ggg g g g g g −0.04 −0.06 −0.04 i i i i i ii i i i i i ii i i i ii ii i i i ii i i i ii a i i ai e i i i ii iiiei i i i aiai iaaii i ai iaa ei ie a iie a ibi j b i a e e j a aabaibi iaa j aa f eeeiaeii fifibii fb e ei e b ia feeefi eibai ba abaa ee eee i fa fj a ee iefe aebbibb ae babaabb fie eabiiea e e eee j affj aafd aia aaba haa f ebbbaebi ba bbhab jiba fa fjbeje j aiadh f a f a a b ee e eeifaeef aejeeebfb j j a a j e e b j f a b afb jj a ebfj biajbffab e eeef efjbebebef aab fib ee jh faabfjajfa fa ebba j j aj f d f bee ja ef b fjdfa jbj jajbf bb bbjjb e fd ebfed a b e f fe f ea jbcf djjajh fae bib dbfbfbe ed bfj afffjfbjaajbfjb bfb jhj a fe jgd jfh i jjjbjd eaafbj b e gb fbd fjbfjdbdabd e e jjhh d d d fbbj ffdhfjbjfjd gd ee e f g f e ced fbfdaeeb jddhfh jj d ajfdd dd fafdjhfdbabjjj d jjd c f f f fdj behhdjfhh dbf f hj h g hhfjjfjj jhchh hj b d e c d bd b de djb hd dbdghdhhgjdjhhbhdjhdhhhdfdhhhd dddhh g g g ddd hdchjhd e f hh djjhghhh dhdd ec fcf f f f de hfedfehhhd fhh fh d g c d dhjh h ddd h j h h h h f f d c hdj d hd h dgd c c cc h hhhd f dhh gg g g g gh hddjhdghhh d ggdh ggg g c c hdd dd hd g g gdgg gg g c c c f h hhhd h d hggh g gg gggg g g g g g c cc c e g gg c c ggg g g c c h gd d g c g ggg g cc c h j gg gg gg gggg c cc ccc c c cccc h h g gg g cc c c c g g gg g g g c h c cccc c c c g g c c c c c c c ccc c g g g g c cccc c c cc g g c g cc c gg g c c c c cc c c g g c c c 100 50 e i ii be e eee ee e e eee eeeeee e ii i i i b b eeefeeee e ei i eeee ee e eee i i i ii i i i ii beeee c b b bb bbbbbbbb fbebe i eff ff f fi ee ebefebeee ee b eeefe bbbbeb bbbebb f f i i i ii i iii ii iiii iiii iii i iii iiii ebffb ef efe ii ebbee eeeefe bbbbbbb ffbb e bb bbbb i b bbbbbbbe e f f fi f i i i iiiii iiiii i iiiiiiii ii i b b f efef effe b b b b f f b f b e f f b b e f i i f b b bbbb f e bbbb b ii a f f f f ffffffff f fifff i i ii bb a b bbe bbbbf fffffeff ffeff fffffff fff ff f f i i ai aaaaa aaa bbb b b f f f f ff ff aaaaa a aaaaaaa a a a a aaaaaaaa aa a f a a a aa a a aaaa aaaaaaaa a aaa aaaaaaaaa a a a a a a aaa aa a d ddd a a a a aa a aaaa ddd d ddd d ddddd d ddddddddddddd d ddd dd j j j ddd ddddd ddjdddd ddddj d ddd j ddd d dd dddddjd d dj djdjd d d j j d d dddjdd ddj j jj jj j j jj jj jdj jjdjd j j j jjjj j jjjjjjjjj jj j j j j j j jj j j jjj j jj jjjj j j j j jj jjjj j j j j j j j j j j j j 0.02 (d) Rank-2 PCA on Sb i i c c 0 (c) OCM+PCA c c c c c c c cc c c c cc c c cccc cccc cc c c cccc c cc c c ccc cccccccccc c c cc cc cc cccc cccc ccc cc c cc c c cc cc ccc cc c cc cc c e b −0.02 0 0.02 0.04 0 −50 c c −100 c −150 0.06 c 100 i i i i ai ii i ii i i ii a j jj i i ii ii i i i aa ia a aaj j iii i i i i a aaa a i a ii aiai a i i f ieii aiaiiaiaafaaiaifaiaaiaaj f aaajjjjaj aadjjjj dj i i ii ii f aiaaa a a i a j a a i a i iaafbiaifj fjij djj jij ji j fj fjjjj j jfj j ii f ejaaeaabafa e i ia jababi fajaf f j j j j j djjj d fj iibii e f ffi aia iaa j a i i e a a b i i a e jjjj a h b i f a jdd jb ja aiba bjaabjejbjbe i e e affeei f ibfb g e jb adjjdddjj ddfj hj d a fje fefebfbfa ib fbfad ba fafjbecde d bba jf j d b f e a d dd g eaffb bbfbbfe b bjfbfbb djfddfjddjjgf h fajffbj fdfbd b g e f ef eeee aebaebfba ifid dffb ea eebbe efdbbdfjfd bdgfddjjdj bf b fhe j dbjbd g g e eif de j d ebeefbefbd bde h dd d d d gg jhg fbjdbffb bb g ee ibfbdeb e e f f eeeeee dbdj dd jde bjbbjd fbdd d gg gg da fe eef fe j gbebbfabde ef e f ee dhbgbdd hdd h e e b de jbb ee eeege fbb ee ffbaf bd df g cdhdfdh hhdd gd g ggggg gg ggg g ddb bebf fd fbchd e h b e h h ggg c d b g g g g gg gg e cf eddf e hbd f d dhgbghhdhhhgdh gg g ggg gg g g g gg gh g h ggg eh f d e ed ce efe fe f ffb ddb hh h h hh jgg hh h h g g g g g g g h e h g e c e edh hhhhh hh h h gg gg gg gg gg c c ehfhedhd h fc h h hgdhdh hhh hhhh h h g g g gg hhh ce c cc c c c h h c c gg ehh hh hhhhh h cc c c h g h h hh c ccc g h h c h c c c c c cc hh cc c cc cc c c c h h cccc c c c cc ccc cc c ccc ccc c c cc c c c c c c cc c c c c cc c c c c c c i i 50 0 −50 −100 c c −150 c i i g g g g g c c c cc c i i a c −150 −100 −50 0 50 100 150 −100 −50 0 50 100 150 Figure 2: Comparison of the two-stage methods in the MEDLINE data set. (a) Rank-2 LDA (b) LDA+PCA cc 0.06 0.04 0.02 0 (c) OCM+PCA cc c cc c c cc c cccc c c cc ccccccccccccc ccc c ccccccccc c cc ccccccc c ccccccccccccco c c ccccccccccc c coc c c o o ooooooo o o o o oo oooo oooo ooooooo o oo o ooooo ooo oo ooo o o oo ooooo oooooo o o o oooooooo o o o o o o o 0.04 d dd d d d d d d d d d d dd ddd ddddd dddd d d ddddddddd dddd dddd d ddddd ddd d d d ddddddddddddd d d d dd d d d d 0.02 0 d hh hhh h h h h hh h hhh hhhh hhhhhhh h hhhhhhh hh hh hhhhh hhh hh hh hhh h hhhh hhhhh hh h hhh h h h h hh d to to t t tt t t t ttt t t t t t ttttttttt tt tt t t t ttttttt tt ttt tt ttttttt t t t t tttttttttttt t tt tt tt t tt t t t −0.04 −0.06 −0.05 h h h h hh h h h hhhhhhhhhhh h h hhh hhhhh h hhhh hhh hhhhhhhhhh h h hhh hhhhh hhhhhh hhh hhhh h hhh hh hh h hh h t t −0.04 t t ttt tt t tt tt t tt t t tt tttttttttttt tttt t tt t t tttttttttttt t ttt tt tt t t tt tt t tt tt ttttttt t t t ttt h −0.06 −0.08 d d 0.05 0 d h −0.05 −0.1 t t −0.15 0.05 0.1 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1 (d) Rank-2 PCA on Sb c d c c d c cc d d c d cc ccc d d d d cc cc cc o d ddddd d ccc cco c ccc c d c d c c c c d d d ccocc c o ccccccccccc dddddd ddddd d c o ccccccc cc d d d d d ccc co c cc o o dd d dddd ddd o ccccococc c d dd d d ddd d cc c d c ddddd d oco o oooccocococ o cc dd d d d ddddddd d ddd d d dd coo oc o o d o cocococo d dd ooo oo o oo d o oo o oo o o oo oo dd o oooo h o o o ooo oo oo oooo ooo h oo oooo o d d oo h hh h h h h h o ooo oo h h h hh h hh hhhhhh hhh hhh hhhhhh h h o hhhh d h h o ot hh hhhho hhh hhh hhh h h h hh t h hh hh hh h tt t t hhhhhh h h t ot tttt tt h h h h h t o t t t t tt t hh h t tt t t t t t tt t t t t ttt ttt t t tt t t tt ttttt t t ttttttt tt tt t tt t t t t t tt t ttt t tt ttt t t t t t t c c c ccc ccccc ccc c d cc co cc o c cc c c c cc d d c c cc ccc o c ccc o d cccocccccccc c cc d c ccco cc c ccco o ddd d d d d o ccooc ccc d occoc oc d o ccc ccco c ooco dd d d dd o c cooo d ddd ddddd ddddddd d o co occc o occc d d o d d c d o cococoooo o o o dd dd ddd ddddd ooo oo ddd dddd dddd dddddddd d d d d o d o o oooo o d o ooo oo ddd d d dd o ooo d oo oo dd o dd oo ooo h ooooo o o o h oo dd h h h o ooo oo hhh h d h d h h hh hh hh h h h o o o hhhhh h hhhhh hh hhhhh o h h t hhhhhhh hh h h t h hhhh hhhhhh hh h hh d hh t o ttt t t t hhhhhh hh hhh t t t tt t o h ttt h h t t t t t t tt t t h t tt ttt tt tt t t t t t t t tt ttttt tt t tt tt t t t t tttt tt ttt t t ttt ttt t tt t t tt t t t t 0.15 d 0.1 0.05 0 −0.05 −0.1 −0.15 d d −0.2 −0.2 t 0 c 0.1 d d d dd dd d d dd d dddd ddddddd d dd dd d dd dd d dd ddddd dddddd d ddddddddd d dd dd d dddddddddd d d d d −0.02 −0.02 c 0.15 c cc c cc c ccccc c c c c cccccccccccc cccc cc cccc cc cc ccccccccccccccccccco cc ccccccccc c c cc o o o o oo oooooooo oooooooooo o o o ooooo o ooooo ooooooooooo o o oo oooo ooo o oooooo ooo oooo o o o oo o 0.06 t t −0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.2 0.25 Figure 3: Comparison of the two-stage methods in the NEWSGROUPS data set. (a) Rank-2 LDA 0.06 0.04 0.02 0 −0.02 −0.04 −0.06 −0.08 −0.1 yy y yy y d y y yydyyd yyd yyyd d dy y d y yyydd d p dddydy yyy d dydy d y dy yd y dyy ddyd y y y yy ddyydddyyyddy p p d yy d ddd p p pp r y y d pd dddd pp a y y y p rccpc r dyc d a p p p ap a pp p a p c c d dd dy dyy y d dd p papppap ppap pa a r pc pdypd d y ddyd d pppa a pa e paap p aaaaa p pdd cp p dy ee e a a p c pccp e a pa pafaap r r crcccp p cypcrr r p r d p ppp apappaa p p p d a a e eae p ppae p ap pa appaap a p c r c pcc c pccrp r c c rdpr c dpy g p p e e e e e p e ea aap aaa p a ep p e r p p f p p r r p e r r c r r c p r p d g e r c p p r e aae pf afe e a a rc ccppcccc rr rg cp r pgpgp p c prc a a eee a pa e e eee e gr c ccrppcg c crc rgcrrp rcp cccgpcpdpr cr c p ee e a a a p a e e e p e e r e p c e a a e rrp f e eea pr ppp ccg cgp gg recrr c rr r r gr rg ef ee a e a f f f f f ff ff e c c rgprgp pgpp gce d r p rg e e e a rp a f cr gggrgggggr g gg p ee e p f e p p g cg r g crgg r g e gggggg g fff c f f f f e e e g g r rg f f ff f e e gp g gc gg r gg gg f f ff e gr g g g gg b f f f g f f g g g b f g g f f g g b b b b b b b bbb b b b bb b b bb b b bb bb b b b bb bb b b bb b b b b bb b bb bb b bb bb b b b b b b b b bb b b b −0.04 −0.02 0 0.02 0.04 (b) LDA+PCA 0.06 0.04 p p a f f 0.02 f f f f f f ff f ff f f f ff f f f f f ff f f f 0 −0.02 −0.04 −0.06 −0.08 0.06 (c) OCM+PCA 0.08 −0.04 −0.02 0 0.02 0.04 0.06 (d) Rank-2 PCA on Sb b d d dd d d d d d dd d dd d d ddd d dd dd d p d d d d d d yy dd dddd pp d y d d y d yy ddd d pa ap pp p y y d dddd d a dd dddd r p ap a ppppaa pa y cc ry y y ycdyyyy p appa p a p a y yd y yy a c c dy p a p a appa papap a r cc yy yyyyyyyy y yyy d p p a a pp a a a d yy dy ryy papppp appp aa a c r pr yd y y papaaaa y e ee ap ap aa p cc c y p yyr rr y ppa a pap a pe ppp p e y p d c c p y p p y af y e ee yp cr rc dyrp c cr c r c ppp ap a eepea ep p pp cppr y yr y a yeyp ye p ppcrccr crrrp a a e pae a aa p rcr rp cpyc pr dc redpp e e cc c rrccrp pccc e r rcr rp ee ae ae apee efafa a p a cp r f p per e e r e yee rrrprcc p pr rpr y cr c cr prpcr crccpcrpcprpcycprcp a p f a a e c c r rcp py p f e ee cpp ccp c r e re e a e e e e e e eeea p p c r pp cp rc ef rc r p pc p er cr e e a a pprcpp r e pg e eeee f e e e ff f f f p p p r f p c g r e p f p f f ppr g rp c p e e f f f f ef e f pg f ff c c pg p r g grp f g b g g r f f f e g gg g r f e f f gg f bg g p f f ff f f f f f f g g f f f g f f g g g ff f f g bg g g g ff f gg g f f gggggg f f ggg g ff f b b g g f ff f ggg g b g g g f f g ggg g g g g b b b b g g gg g g g bbb b b b bb b b b bb g g b bb b bb b bb g bbb b b b gb bb gb b b b b bbbb b b bb b b bbb b b b b b bb b b b b g dd d b 0.2 0.1 0 −0.1 −0.2 −0.3 0.08 b b b b b g g b y b b c b ae g bb b ba e e e b b b a g gbbg b bb b g a b g g bb b a c b y p fe r y b g f yb yg e gc b grgr c y c g a c a g ab g b e ge gg p yr b g ba g e ff f b cgy r a e b r a y pg gg g e f f ey dy e cy b bby rbb ycbg g p ab b e ep e e yp a be y ba p p f bf yy r g g pyr f yf f e b af fc bf be e g a b y p b b p p rb yeygbr e rpg r cc pyy y b f y b e c p e a p g e p y e a a p p pe a ggc a abp g facf f ea f e fe b e d e ge ep c r g e gg ppe f e p r rp b a fapfp f a bag f d grpc raa g perg dp d cf py c y ega y p pe p cr ff dpe p f yy r pcccagb d a efr a f f a a af p g pp y cp a b d e r f y r gp e c r gy p pc p ppyaparg f a r yf e ceep pype p rcr f pf bp ff e pa p cy r c cg yceg ypy yebdg c ef rar ep p g ppapcr p pr d afp cc c rp yr rc d ay d d a d pp e ed yp e e f fe f cf r yr p aa r p pb e ycc c gp p prdd yy gya peae y p dppd d pr p y cy c g a r p p gr e a d gr p rg pd f p pr ag d p d ped a rpa r r p ydgd ygp f dppp a gpr eppa d rpy c y y d yc p p a f p p c p r c py p g p a g d ecyp p y r p a p f r cppcc b pr r dg pcpd p pd a a dy d dp f a c cpd d d d c cpr c yd r r y d a dpcdpr yedd pp d p a d dd p cd e p e pp p p r p r c pd y c c d d pr g re d c r dc d f f p p d p a p c d c f r d d dd c a d r dr p dc d d r d d b b b −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.2 g b g b b b b b b b g b b b g bg bb b b b g g eb g eb bb e b bcgg bg ea g g bb g ge p y gb pg b g gggyr b b f be f p c ab g g y b bb r a rb fb cgp r b g g f d gbb a b y egp g bbb c c y c b g f fe b p a g g b f f b b g bp ae a cc g b yacbrbf ry r f r yyc y be r e f p g p p a e ey c gg p b rg y ye ea g g ar pyab aaf f f a f b e bg f f b f g g p f ff f e f y pr p cf y ar aar p e epd y grc g yyy p be pryyae gpcp p e f e e f c e e e b f p e p y c a p ar b pe a ff f f r ccy pegr gyg g fy p fp f aa e caeey y g p ppp rrp y g ep pe f fa b b f p r r c r r r r cc pc g f ep d rpeyp ggpa ap efe d y r p p p pa ff a cc p y ecy c gcyp c crc yy c eppg r ragf ped dpf e r bp f pb e f cf ea p pp f a f f g r c p c c f e y r a e y g y p d c e f p d c f a p a f p f e p g rpp p c p pcp yrpycg cpde pyeddfappbydady acpa fpa appay dy pebppa e y af e p r cr y g pf e p prggc prg p ea a r ee ype rp pr g bar p f d pa y p ye e c r pry cp r a ppd f p rec pere p pd p r yr yc yp p g pdc c f p e a p y p d dg e a ppa ad d pa d pf a r c cpb r d cp y pdy pcy cd dcy p p d p dyd ara f p r a d dp yy d rdy p d rf r ap c p fp r rp a y ea cy a d p f d pp ry add cd ye pd p ddd c ygd p c a pd c p aa e c dc c d f p d c d p ea p d r d fd r d f d cr rp r e c r a d dd c a d f d p d d d d dc d r r d d d d g f 0.1 f e f f f a f f f f f f f f 0 f −0.1 p f −0.2 −0.3 d 0.4 0.5 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5 Figure 4: Comparison of the two-stage methods in the REUTERS data set. (a) Rank-2 LDA 0.06 0.04 0.02 0 −0.02 −0.04 −0.06 −0.08 im i im m im i i im mm iim m iim m i im m iiim mim im m im ii iim m im ii im iim i m imm m iiim i im iim im m i im m m m m im m iiim iim i m mtmm imi m iim mi i mm mm mi mi imm m tm m tm t m i tm m t m t t ttttt t t ttttt ttt tttt t tt t t ttt t ttttt t t t ir t t t ttt ttttt ttttttt t t t tt t t r t tr r r s r r s rr rrrr r r r ss sssss sa a rrrrrrrr rrr rara s sssssssss s s s r rarrrrrrrrrrrrrr r rrs rsrs rsrsssssssssssssssssss aa s ssssssss arrrrr arrra a a arar a aa a a ar a ss s ss s ss raa aaaaa aaaa a e aaaaaaa aa aaaaaaa a aaa aaaaa a a aa aaea a aa e e e e e a e e e eee e ee e e e ee e ee eeeeeee eeee ee e ee e ee eeee eeee eee e eee e e eeeeeee e ee e e eeee (b) LDA+PCA 0.04 c tw s g s cg w c g c gwc gc c c w g c gc w c gw ggcw g gw w wc w c w gw gww w g w gcc gc gw w w gc g c cw c wc cccc cc cww w gcw g g cgww gg c w w w w c w g cc w g w g ggw w cgg gw w gcc w gcg c w gg gww w gw w gwgwcwgwgcwgcgc ccgcc cc gg c w w gwc cg gc c g w g gg ggcgw gcw w g w c ww gc w g ccccgc g w sww g g ww g w s c c im m i i ii m im im im im iim ii m iiim im imm iiim m iii m m iim ii i m im im iim iiim ii m im m m im ii m ii m im m i m im m iim mi iimm m imimm im m m m mi m m tm mm mm mm m m m m 0.02 0 w s −0.02 −0.04 −0.06 −0.08 0.3 0.2 ttmm c m tm tm t tt t tttt ir ttt t t tt ttttttttttttttt ttt t tt tt t t t t t tttt t t ttt t ttttttt tt t r t r rt a a r aa s a aaaaaaarars r s rrraarrr aarsrssssr s s s g aaa ra raa rrara aar rasssss s ra ra a rra a aa a arrra sasaa rrra ra s sa a a a e raaaaa rsrs s srssssssss rra rssrss ra s rrarrarassrs s a aaarrea rss sssssssss a ararrrrsrrrsrss ss e ee e a e e eeee e e eeeeee e ee ee eeee eeeee e eee ee ee ee eee ee eee eeee e e e eeeeee e ee e ee c w wg wcgg gwg w w g g ww ww wg gw w c www w wc g cg gc c cg g w w c wc cg c g w wc w w g w g w cw ww c g wcg wccccg g gcgcg c ccg g t g cwgwcwgccgwgwgwcgccwccwggcwgwgcwgwgwgwwgwgwgwcw cwcwgwcgw wg w cg w gcccgcc g cg ccccg gcccg ggg cgcg gg g w cg c cs cccg gg c cc g gg c s g w c c c w ww w wc w 0.1 0 s w −0.1 −0.2 −0.3 e e −0.04 (c) OCM+PCA i 0.06 −0.02 0 0.02 0.04 0.06 −0.04 −0.4 −0.02 0 0.02 0.04 0.06 0.08 (d) Rank-2 PCA on Sb im imim im im mim im im im mim im m iimim im im m im im im i im i imim imimi ii imi im im m imi iii iiim i w c w c g g w iim im c w g w wg g w ww c i i iim w c c s ww m imi c w g wg wg c wg wg c g g wg w g c c cgg c w cg iim wc imm ww wg c g cw gg g w c w ccw cg im g wc w gg wg im cw ttm m wg ww wg g g w c cwcg it m wg ww c wg w wg cgw cc cg g mt tm w ww miiiimmm gwcw cc m w g wg g c g cccc ww m w cc ag ccgcg cg m t ttm m tt m wt cg c gg c gg cgg itm m m g c w w t g c im c t m m w w g c c ccc g w ttt wcccg g gg w w t ggg w w m tti mt tt t tit tttt tt t g wwcc c c gtcgctw g cg t t tt tm im w c g c ttrt t ttt t t t t g cg w g t t m c s c s ww gw araaaair t tt t tt t ttt t g g g erar st tt t tt c tr c tt mr a r rr e arsa r srs raaraara w rrsrsasas s r s s a s rrra raaaaa a s rae arsaasss ss araa a rarsrra a g a e rr a errr raa rr r s aasrsa aarra srssssss raarrss rr rss ss aaaaa s arsa rasarra era rrrasrsa sssss sss re are ssrra e e aa sss ea rss ss ere rre rra a e s s s r aee s e ee re ee r e e ee eeee e e e e eeee e e eeee e eeeee e e ee e e eeeee e e eeee ee e e e −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.3 imim im im im m im im im im mim im imim im im im m im im im im i im i im w w w g c imim i im im g w w gs w ii ii im w g g w c g im w gw c g w w cgw cw w gcw g c i i iiimm g c w cw i iim i g w ww c w gww w c w gw w g g w gw gg im g g w w c gc c c w w gg i i im w c gc g c w w w g cw g cc imimimi gww g cw c gg w c cww g w w w g c i im m c w g w g gccgwcwg cgwwgcc imm iimim gc w gc g cw g cccw w i ii t mm m a g w g c g w w g t c t c g gw g w c g gg w w g w m mm c gw gc mm m i im gcc gc c t c w mm gc gccc g gccctw m itm tmmic gcw c t t gw mi tim tt m t c g scccgw g c gw t ggggcgc s atttttm g m tt t ttt t t tg c c c ttm ataat ttt m ttrt t t t c gt ra ri r t tttm e w a rrreaa cs a ttrr tttttsts ttcttttt t raaarsra m s t tttt tr srta s asrt ara rraara r rara t a a re aaaaaa a a s s arsaararrarrarrraasrasaat ssrrrs sg err arse rra aa srarrr a era errr sa e er raaa rssssr ss rssssa rsrsa ars ss aas rre e e r ss s rsa rsrsss ss ee raeee erssa arrre ree e sr e rerr reeesssssssssss s s e e eeeeeee eee e e eeee e eee eeee e eee eeeee e eeee ee e e e 0.2 w w w c w c 0.1 0 −0.1 −0.2 −0.3 0.4 −0.3 Figure 5: Example of effects of data centering in the MEDLINE data set. (a)OCM+PCA with data centering 0.15 0.1 0.05 0 −0.05 −0.1 −0.15 c d c c d c cc d d c d cc ccc d d d d cc cc cc c d ddddd d ccc cc c cco c occccc cocc d c c d d d d d cc c o cc cccccc dddddd dd dd d c o cccccc cc d d d d d ccc co c cc o o o c cc c d dddd dddddddd d ddd d d d co cc occc ooocococcc o dddd d d d d o ooocc c cc dd d dd d ddd ddd o oc ddd d d dd d o o o cocococcooooo o d dd o oo oo d o ooo o o oo oo dd oooooo h o o oo ooooo ooooo h oo oooo ooo o h h h h d d hh h h o ooo oo h h h hh h hh hhhhhh h h hh h h h h h h o h h h d h h hhh h hh ho h o ot h hhhhh hhh h h h h hhhh t h hh hh hhh hh tt t t hhhhhh h h t ot tttt tt hhh h to t t t hh t tt t h t tt t t t t t tt t t t t ttt ttt t t tt t t tt ttttt t t tttttt tt tt t tt tt t t t t tt t ttt t tt t t t t ttt t t −0.2 −0.2 (b)OCM+PCA without data centering c c 0.2 d c c c 0.15 0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 c c c c c c c c c c c c cc c c c c c c co c c c c c cccc c cc o o c c c o o cc c coc c ccoc cc coc coocco c c coc c c o c ccco ccc o c c cc oc cco oc cc c oco o ooc c c c o oo c o o o o c oococo o o c c o cc o o o o oc oo o t o o oo o ooc oo o o o o o o o c o o o to oo o o o ooo oo oo oo o o t o oo to to t ot t t t o t t to o o o o c t t t tt ttt h t ot t tt o tt t ctt t t t t t h o tt httt t t oth tttt th t tthtttth ttt t t t tt htttht tt thth tt tt t t to t t h tt t h h h h ht hh hh h h h h thh ht h t tht h ht h t t t h h hh h h h hhhh hhhh hh h hhthhhh hh hh h t h h hhhh h hh h hhhh h h h d h t h d h hh h h hh d h h d h d d dd d h h dd h d ddd d dd d d d d d dd d d dd d d dd d d d d d d d d d dd ddddd d dd ddddd d d dd d d d dd dd ddd d dd d d d dd d d d d d d dd d d d d d d d d d d d d d t −0.15 −0.1 −0.05 h d 0 0.05 0.1 0.15 0.2 0.25 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 −0.2 −0.1 0 0.1 0.2 0.3 w w w w c c 0.4

Two-stage Framework for Visualization of Clustered High Dimensional Data Jaegul Choo Shawn Bohn

Related documents

Products

Support

Two-stage Framework for Visualization of Clustered High Dimensional Data Jaegul Choo Shawn Bohn

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib