Implementation of textile plot Natsuhiko Kumasaka1 and Ritei Shibata2 1 2 Fundamental Science and Technology, Keio University kumasaka@stat.math.keio.ac.jp Department of Mathematics, Keio University shibata@math.keio.ac.jp Summary. Textile plot is a new data visualisation technique for exploring high dimensional data. The textile plot is a parallel coordinate plot with axes whose locations and scales are simultaneously chosen so that all connecting lines, each of which signifies an observation, are aligned as horizontally as possible. The textile plot can visualise not only numerical data but also ordered or unordered categorical data or the mix of those together with various attributes. The aim of this article is to report an implementation of the textile plot. The algorithm for computing the optimal locations and scales has been developed and necessary information for the display is also discussed into detail. Key words: Textile plot, High dimensional data visualisation, Parallel coordinate plot, Constrained maximisation problem, DandDR. 1 Introduction Parallel coordinate plot [Ins85] [Weg90] has been frequently used for exploring high dimensional data. The plot is a simple way of visualising very high dimensional data but restricted to numerical data, since the primary aim of the plot is to represent a set of data points in Euclidean space on a two dimensional display. Several attempts have been done to visualise both numerical and categorical data, for example, [Wil96] [The02] and [Mat03]. But there remains a problem that it becomes harder to understand what is going on behind the data as the number of intersections of the connected lines increases. Textile plot [KS06] is a solution to solve such a problem. The locations and scales of whole axes on the parallel coordinate plot are selected so as to align all connected lines as horizontally as possible. It does not only make easier for user to understand the relationship between adjacent axes but also grasp several global relationships among the underlying data vectors. The textile plot is named by analogy to a fabric to which warps (axes) and wefts (connected lines) are woven. Furthermore, textile plot can display any ordered or unordered categorical data together with numerical data, even if missing values exist. This is an advantage of our criterion to select the locations and scales of whole axes at once. Categorical 582 Natsuhiko Kumasaka and Ritei Shibata data can be dealt with numerical data as far as it is encoded by a set of contrasts. Then the position of the levels are determined on an axis by the criterion. It is also worthy of noting that the result is independent of the choice of contrast. There are several related works which have been done, particularly in homogeneity analysis [Gif90]. In homogeneity analysis, any categorical data vectors are quantified so as to minimise the total distance from object scores. It is a common practice to display the quantified vectors as a two dimensional plot. But a plot on parallel coordinates, so called optimised parallel coordinate plot, has been proposed [MD01], too. As a result, the optimised parallel coordinate plot and textile plot provide the same picture as far as all data vectors are categorical and no missing value exists. The objective of textile plot is, however, different from the optimised parallel coordinate plot. The textile plot is a tool for exploring any high dimensional data as it is without any specific objectives. Therefore the design policy of textile plot is to provide necessary and sufficient information in a concise and effective way. The order of the parallel axes on the textile plot is also carefully chosen so as to give a clear image of the data to the user. 2 Textile Plot 2.1 Selection of locations and scales P 2 We will use the following notations. The norm kxk2v = n i=1 vi xi is a weighted norm of the vector x with the weight vector v, and x·v, x/v and x ≤ v are element-wise product, division and inequality for two vectors x and v. Assume that p dimensional n observation data is given. We organise the data into p data vectors {x1 , . . . , xp } each of which is consists of n elements. Then the data vectors x1 , . . . , xp are transformed into the p coordinate vectors y j = αj 1 + βj xj , j = 1, . . . , p, (1) to make a parallel coordinate plot, where 1 is the vector of all ones. The location parameter vector α = (α1 , . . . , αp )T and the scale parameter vector β = (β1 . . . , βp )T are simultaneously chosen so as to minimise the sum of squared deviations S 2 (α, β) = p X ky j − mk2w j , (2) j=1 where m= p X w j ·y j /w (3) j=1 is the mean vector of y j ’s. The vector w j , the element of which consists of 0 or 1, indicates the locations of missing values in the data vector xj , j = 1, . . . , p, that is, the element of w j is 0 if the corresponding P element of xj is missing, otherwise 1. Then each element of the vector w = pj=1 w j indicates the number of missing values per observation. Implementation of textile plot 583 In the textile plot, a constraint is introduced for α and β to avoid trivial solutions like α = β = 0. It is that the total dispersion of the points displayed on a textile plot should be equal to the effective number N = 1T w of the points displayed, that is, p X ky j − ȳ·j 1k2w j = N, (4) j=1 where ȳ·j = w Tj y j /1T w j . If a data vector xj is a categorical data vector of the qj levels, it is transformed into the coordinate vector y j = αj 1 + Xj β j (5) instead of (1), where Xj is a n × (qj − 1) matrix encoded by a set of contrasts. As is noted, the textile plot is invariant under change of the contrast. If xj is not only categorical but also ordered, the order of the levels should be retained in the process of transformation. This implies a constraint for the choice of the scale parameter β j . It is simply described as βj ≥ 0 or β j ≤ 0, (6) as far as a specific contrasts matrix 00 ··· 01 B . .C B 1 . . .. C B C C=B .. . . C A (7) . .0 1 ··· 1 is employed. 2.2 Design of the Point Display For better visualisation, a good design of the point display on each warp is indispensable to providing proper assistance for user to understand various aspects of the data. The notion of data types plays an important role. The distinction between numerical and non-numerical data is not enough for proper understanding of the data particularly when it is high dimensional. In the textile plot, numerical data is classified into continuous or discrete, and non-numerical data is classified into ordered, unordered or logical. The points on a warp are differently displayed according to the data type of the given data vector. Figure 1 illustrates the way of display on a warp for each data types. Throughout all data types, each point on a warp is indicated by circles with the area proportional to the number of duplicated values. As is shown later, it is quite important to display such a duplication even for continuous data. The number of missing values in each data vector is also indicated by the area of the circle with the symbol NA. 584 Natsuhiko Kumasaka and Ritei Shibata Numerical data Continuous Discrete Inf Inf 11.2 Non-numerical data Ordered Unordered Logical LL D E L A M B 16 TRUE FALSE S C NA NA 2 l l NA La is Ax Ax i s La be be l A (N xis um La er be al l ) be NA La NA is 0 Ax -Inf Ax i (U s La ni be t) l -0.5 Fig. 1. Design of Point Display on a Warp Common features for the continuous and discrete data types are an arrow head indicating the direction of the coordinates, possible maximum and minimum values placed at the both ends of the warp and the maximum and minimum values of the given data vector. The maximum and minimum values of the data vector are also used for the tick labels placed on the left hand side of the warp. Indication of possible values is different. In case of continuous data, possible range is indicated by a vertical line. On the other hand, possible values are shown by several tick marks in case of discrete data. The same principle applies to non-numerical data. All possible values or levels including zero frequency levels are indicated by the level names for ordered, unordered or logical data. The order of the levels in case of the ordered data type is indicated by a sequence of arrows. As same as in parallel coordinate plot, all warps are placed in parallel on a display and the coordinates are connected by polygonal lines to identify observations. But the polygonal line is disconnected on a warp if the observation is missing. The order of warps is also important to provide a clear image of the data. In the textile plot, the warps are placed in an ascending order of the dispersion ky j − mk, j = 1, . . . , p. An exception is the warp corresponding to an ID vector which identifies the observation. It is placed on the leftmost of the display and only the ID names are placed at the coordinates. As is shown later, the coordinate vector for the ID warp is always obtained from the mean vector m, so that it is not necessary to include the ID vector as a part Implementation of textile plot 585 of data vectors to compute the coordinate vectors. We can then always identify the observation by looking at the leftmost warp. Inf Inf Inf 0 7.9 6.9 119 123 136 106 131 118 108 132 110 103 144 121 109 141 145 126 142 113 133 105 146 130 140 101 129 125 112 116 147 137 148 115 117 104 149 111 138 124 120 114 135 127 143 102 134 128 122 150 139 78 77 69 73 53 51 55 84 88 87 59 76 66 107 57 71 64 52 75 79 74 92 63 98 72 62 54 86 56 67 93 90 95 91 100 83 97 85 70 68 81 96 89 82 60 65 80 61 94 99 58 2.5 2 virginica versicolor setosa 0.1 1 Inf pa l (c .Wid m t ) h pa l.L (c en m gt ) h 0 Se ta l. (c Wid m t ) h Pe ie s ec Sp ta l.L (c eng m t ) h 0 Pe ID 4.4 4.3 0 Se 42 24 32 19 21 26 44 46 27 37 2 35 6 31 29 45 40 10 28 11 18 50 36 13 22 25 8 15 17 49 30 1 4 41 16 9 12 3 20 48 5 47 39 7 34 38 43 14 33 23 Fig. 2. Textile Plot of Iris Data Figure 2 is the textile plot of famous Iris data. The leftmost warp is ID warp whose labels are the sequence number of observations. It can be easily seen from the plot that all data vectors are continuous measurements except for Species. The different size of circles on continuous measurements indicate that there are many duplicated values. This is because the precision of the measurements is one decimal point. The well known fact that Petal Width and Petal Length play an important role for discriminating Species is clearly visualised in this plot. 3 Implementation 3.1 Computation Even for computing the coordinate vectors y j , j = 1, . . . , p, we need to know the data types, numerical, categorical or ordered categorical, of the given data vectors xj , j = 1, . . . , p and the location of missing values. Although distinction between Continuous and Discrete, or Unordered and Logical is not necessary. Preparation To simplify the implementation, we assume that the first r data vectors x1 , . . . , xr are ordered categorical and xr+1 , . . . , xp are other types of data vectors. 586 Natsuhiko Kumasaka and Ritei Shibata Each data vector xj is transformed into an n × (qj − 1) data matrix Xj , which is encoded by a set of contrasts if it is non-numerical data vector, otherwise Xj = xj . The qj is the number of the levels of xj or qj = 2 if it is numerical. The i th row of the matrix Xj is filled with 0’s if the ith element of xj is a missing value. The data matrices Xj j =P1, . . . , p are combined into a n × Q data matrix X = (X1 , . . . , Xp ), where Q = pj=1 (qj − 1). The index set Ij such as (X j−1 j X i=1 i=1 (qi − 1) + 1, . . . , ) (qi − 1) , is used for indicating xj , j = 1, . . . , p, and I = Sp j=1 Ij = {1, . . . , Q}. The Use of Generalised Inverse We use notations v(K ) or M(K , L ) for the sub-vector or the sub-matrix specified by index set K and L . We also use principal sub-matrix S(K ) = S(K , K ) for square matrix S [Hor85]. The minimisation problem in Section 2.1 is to minimise f (α, β) = S 2 (α, β) − N = αT A11 α − 2αT A12 β + β T A22 β with respect to the location parameter vector α = (α1 , . . . , αp ) and the scale parameter vector β = (β T1 , . . . , β Tp )T under the constraints, β T Bβ = N, (8) β j ≥ 0 or β j ≤ 0, j = 1 . . . , r. (9) and Here A11 (j, k) =T −w j (w k /w) j 6= k, −w Tj (w k /w) + 1T w j j = k, A12 (j, Ik ) = w Tj (Xk /w) j 6= k, w Tj (Xk /w) − w Tj Xj j = k, A8 22 (Ij , Ik ) = T < −X j (Xk /w) T Xj w j w Tj Xj /(1T w j ) : −XTj (Xk /w) j 6= k, j = k, and B(I 8 j , Ik ) = j 6= k, < OT Xj Xj − XTj w j w Tj Xj /(1T w j ) : j = k. Implementation of textile plot 587 The notation / here is extended for a n × r matrix Z and a vector v as Z/v = (z 1 /v, . . . , z r /v). Since no α is involved in the constraints (8) and (9), we see from ∂f /∂α = 0 that any solution α̂ satisfies the following equation, A11 α = A12 β for a given β. Because of the singularity of A11 , we need a generalised inverse of A11 to explicitly write the solution. In fact, it can be written as + α̂ = A+ 11 A12 β + (I − A11 A11 )z (10) for arbitrary p-dimensional vector z by using the Moor-Penrose inverse [Rao73] of A11 . We have then f (α̂, β) = β T (−AT12 A+ 11 A12 + A22 )β. It is now clear that the solution β̂ can be obtained as the β which maximises the quadratic form of A = AT12 A+ 11 A12 − A22 under the constraints (8) and (9). It is simply to find the eigenvector of A with respect to B for the largest eigenvalue, if no ordered categorical data vector is involved in the given data. Inequality Constraint Maximisation Problem The computation becomes a bit complicated if an ordered categorical data vector is included. We have to solve a quadratic maximisation problem with an equality constraint (8) and an inequality constraint (9). S As is described in [KS06], we have to find an index set I0 ⊆ Iord = rk=1 Ik such that 1. β̂(I0 ) = 0, and β̂(I0c ) is an eigenvector of A(I0c ) with respect to B(I0c ) for the largest eigenvalue λ̂, where I0c = I \ I0 , 2. Either 2{A(Ik , I ) − λ̂B(Ik , I )}β̂ ≥ β̂(Ik ) or 2{A(Ik , I ) − λ̂B(Ik , I )}β̂ ≤ β̂(Ik ) is satisfied for 1 ≤ k ≤ r. 588 Natsuhiko Kumasaka and Ritei Shibata An algorithm for the computation is as the following. input A, B, Iord λ̂ ⇐ 0.0 for all I0 s.t. I0 ⊆ Iord do I0c = I \ I0 λ ⇐ λmax (A(I0c ), B(I0c )) if λ̂ < λ then β(I0c ) ⇐ v max (A(I0c ), B(I0c )) β(I0 ) ⇐ 0 if β(Ij ) ≥ 0 or β(Ij ) ≤ 0, where λmax (A, B) indicates the largest eigenj = 1, . . . , r then λ̂ ⇐ λ β̂ ⇐ β end if end if end for β̂ ⇐ (N/β T Bβ)1/2 β̂ α̂ ⇐ A+ 11 A12 β̂ return α̂, β̂ and λ̂ value of A with respect to B, and v max (A, B) yields the eigenvector A with respect to B with the largest eigen value. 3.2 Textile Plot Display The coordinate vectors y 1 , . . . , y p are obtained from the formula (5) by using α̂ and β̂. The coordinate vector y 0 for the ID yarn is given by y0 = 1 (m − m̄1) + m̄1, λ̂ where m̄ = 1T m/n and λ̂ is the largest eigen value, which is already obtained in the algorithm. Sequence numbers of observations are used as the labels for the points on the ID yarn if no ID vector is given a priori. The other warps are placed right to the ID yarn according to the ascending order of ky j − mk, j = 1, . . . , p. All points on each warp are displayed according to the design of point display described in Section 2.2. We always need to make a frequency table of the coordinates on each warp. In case of numerical data, the direction of each warp can be seen from the sign of βj , but the possible minimum and maximum values of data vector should be given a priori, which is used for determining the length of the axis line or the number of ticks drawn on the warp. In case of non-numerical data, all possible levels should be known a priori to show zero frequency levels on a warp. 4 Textile Plot on DandDR The textile plot is currently implemented as a part of DandDR [YS04] which is an interface between R [R D04] and DandD. DandD (Data and Description) is a project Implementation of textile plot 589 to create a good environment for modelling data. DandDR receives any necessary information described in a DandD instance with the data itself from DandDServer and creates a dad object on R. The object has its own plot method which produces the textile plot. The main algorithm of the computation is written by C language. The package CLAPACK [LPK] is used for obtaining generalised inverse and solving generalised eigen value problem. The DandDR is available from the DandD project home page [DAD]. References [DAD] [Gif90] [Hor85] [Ins85] [KS06] [LPK] [Mat03] [MD01] [RM71] [R] [The02] [Weg90] [Wil96] [YS04] DandD Project: DandD Home Page, http://www.stat.math.keio.ac.jp/DandD/ Gifi, A.: Nonlinear Multivariate Analysis. John Wiley & Sons Ltd, (1990) Horn, R., Johnson, C.: Matrix Analysis. Cambridge University Press, (1985) Inselberg, A.: The plane with parallel coordinates. The Visual Computer, 1, 69–91 (1985) Kumasaka, N., Shibata, R.: High Dimensional Data Visualisation: Textile Plot. Research Report in Department of Mathematics, KSTS/RR-06/001, Keio University, (2006) LAPACK: Home Page, http://www.netlib.org/lapack/ Matthias, S.: Visualizing categorical data arising in the health sciences using hammock plots. American Statistical Association; 2003, CD-ROM (2003) Michailidis, G., de Leeuw, J.: Data visualization through graph drawing. Computational Statistics, 16, 435–450 (2001) Rao, C.R.,Mitra, S.K.: Generalized Inverse of Matrices and its Applications. John Wiley & Sons, Inc., (1971) R Project: R Project Home Page, http://www.r-project.org/ Theus, M.: Interactive data visualization using mondrian. Journal of Statistical Software, 7 (2002) Wegman, E.: Hyperdimensional data analysis using parallel coordinates. Journal of The American Statistical Association, 85, 664–675 (1990) Wills, G.J.: Selection: 524,288 ways to say this is interesting. Proceedings of the 1996 IEEE Symposium on Information Visualization, IEEE Computer Society Washington, DC, USA, 54–60 (1996) Yokouchi, D., Shibata, R.: DandD Client Server System, Compstat 2004 CD-ROM, Physical-Verlag, Prague. (2004)