Implementation of textile plot

advertisement
Implementation of textile plot
Natsuhiko Kumasaka1 and Ritei Shibata2
1
2
Fundamental Science and Technology, Keio University
kumasaka@stat.math.keio.ac.jp
Department of Mathematics, Keio University
shibata@math.keio.ac.jp
Summary. Textile plot is a new data visualisation technique for exploring high
dimensional data. The textile plot is a parallel coordinate plot with axes whose
locations and scales are simultaneously chosen so that all connecting lines, each of
which signifies an observation, are aligned as horizontally as possible. The textile
plot can visualise not only numerical data but also ordered or unordered categorical
data or the mix of those together with various attributes. The aim of this article is
to report an implementation of the textile plot. The algorithm for computing the
optimal locations and scales has been developed and necessary information for the
display is also discussed into detail.
Key words: Textile plot, High dimensional data visualisation, Parallel coordinate
plot, Constrained maximisation problem, DandDR.
1 Introduction
Parallel coordinate plot [Ins85] [Weg90] has been frequently used for exploring high
dimensional data. The plot is a simple way of visualising very high dimensional data
but restricted to numerical data, since the primary aim of the plot is to represent a
set of data points in Euclidean space on a two dimensional display.
Several attempts have been done to visualise both numerical and categorical
data, for example, [Wil96] [The02] and [Mat03]. But there remains a problem that
it becomes harder to understand what is going on behind the data as the number
of intersections of the connected lines increases.
Textile plot [KS06] is a solution to solve such a problem. The locations and scales
of whole axes on the parallel coordinate plot are selected so as to align all connected
lines as horizontally as possible. It does not only make easier for user to understand
the relationship between adjacent axes but also grasp several global relationships
among the underlying data vectors. The textile plot is named by analogy to a fabric
to which warps (axes) and wefts (connected lines) are woven.
Furthermore, textile plot can display any ordered or unordered categorical data
together with numerical data, even if missing values exist. This is an advantage of
our criterion to select the locations and scales of whole axes at once. Categorical
582
Natsuhiko Kumasaka and Ritei Shibata
data can be dealt with numerical data as far as it is encoded by a set of contrasts.
Then the position of the levels are determined on an axis by the criterion. It is also
worthy of noting that the result is independent of the choice of contrast.
There are several related works which have been done, particularly in homogeneity analysis [Gif90]. In homogeneity analysis, any categorical data vectors are
quantified so as to minimise the total distance from object scores. It is a common
practice to display the quantified vectors as a two dimensional plot. But a plot
on parallel coordinates, so called optimised parallel coordinate plot, has been proposed [MD01], too. As a result, the optimised parallel coordinate plot and textile
plot provide the same picture as far as all data vectors are categorical and no missing
value exists.
The objective of textile plot is, however, different from the optimised parallel
coordinate plot. The textile plot is a tool for exploring any high dimensional data
as it is without any specific objectives. Therefore the design policy of textile plot is
to provide necessary and sufficient information in a concise and effective way. The
order of the parallel axes on the textile plot is also carefully chosen so as to give a
clear image of the data to the user.
2 Textile Plot
2.1 Selection of locations and scales
P
2
We will use the following notations. The norm kxk2v = n
i=1 vi xi is a weighted norm
of the vector x with the weight vector v, and x·v, x/v and x ≤ v are element-wise
product, division and inequality for two vectors x and v.
Assume that p dimensional n observation data is given. We organise the data
into p data vectors {x1 , . . . , xp } each of which is consists of n elements. Then the
data vectors x1 , . . . , xp are transformed into the p coordinate vectors
y j = αj 1 + βj xj ,
j = 1, . . . , p,
(1)
to make a parallel coordinate plot, where 1 is the vector of all ones. The location parameter vector α = (α1 , . . . , αp )T and the scale parameter vector β = (β1 . . . , βp )T
are simultaneously chosen so as to minimise the sum of squared deviations
S 2 (α, β) =
p
X
ky j − mk2w j ,
(2)
j=1
where
m=
p
X
w j ·y j /w
(3)
j=1
is the mean vector of y j ’s. The vector w j , the element of which consists of 0 or 1,
indicates the locations of missing values in the data vector xj , j = 1, . . . , p, that
is, the element of w j is 0 if the corresponding
P element of xj is missing, otherwise
1. Then each element of the vector w = pj=1 w j indicates the number of missing
values per observation.
Implementation of textile plot
583
In the textile plot, a constraint is introduced for α and β to avoid trivial solutions
like α = β = 0. It is that the total dispersion of the points displayed on a textile
plot should be equal to the effective number N = 1T w of the points displayed, that
is,
p
X
ky j − ȳ·j 1k2w j = N,
(4)
j=1
where ȳ·j = w Tj y j /1T w j .
If a data vector xj is a categorical data vector of the qj levels, it is transformed
into the coordinate vector
y j = αj 1 + Xj β j
(5)
instead of (1), where Xj is a n × (qj − 1) matrix encoded by a set of contrasts. As
is noted, the textile plot is invariant under change of the contrast.
If xj is not only categorical but also ordered, the order of the levels should be
retained in the process of transformation. This implies a constraint for the choice of
the scale parameter β j . It is simply described as
βj ≥ 0
or
β j ≤ 0,
(6)
as far as a specific contrasts matrix
00 ··· 01
B
. .C
B
1 . . .. C
B
C
C=B
.. . . C
A
(7)
. .0
1 ··· 1
is employed.
2.2 Design of the Point Display
For better visualisation, a good design of the point display on each warp is indispensable to providing proper assistance for user to understand various aspects of
the data.
The notion of data types plays an important role. The distinction between numerical and non-numerical data is not enough for proper understanding of the data
particularly when it is high dimensional. In the textile plot, numerical data is classified into continuous or discrete, and non-numerical data is classified into ordered,
unordered or logical.
The points on a warp are differently displayed according to the data type of the
given data vector. Figure 1 illustrates the way of display on a warp for each data
types.
Throughout all data types, each point on a warp is indicated by circles with the
area proportional to the number of duplicated values. As is shown later, it is quite
important to display such a duplication even for continuous data. The number of
missing values in each data vector is also indicated by the area of the circle with the
symbol NA.
584
Natsuhiko Kumasaka and Ritei Shibata
Numerical data
Continuous Discrete
Inf
Inf
11.2
Non-numerical data
Ordered Unordered Logical
LL
D
E
L
A
M
B
16
TRUE
FALSE
S
C
NA
NA
2
l
l
NA
La
is
Ax
Ax
i
s
La
be
be
l
A
(N xis
um La
er be
al l
)
be
NA
La
NA
is
0
Ax
-Inf
Ax
i
(U s La
ni be
t) l
-0.5
Fig. 1. Design of Point Display on a Warp
Common features for the continuous and discrete data types are an arrow head
indicating the direction of the coordinates, possible maximum and minimum values
placed at the both ends of the warp and the maximum and minimum values of
the given data vector. The maximum and minimum values of the data vector are
also used for the tick labels placed on the left hand side of the warp. Indication of
possible values is different. In case of continuous data, possible range is indicated by
a vertical line. On the other hand, possible values are shown by several tick marks
in case of discrete data.
The same principle applies to non-numerical data. All possible values or levels
including zero frequency levels are indicated by the level names for ordered, unordered or logical data. The order of the levels in case of the ordered data type is
indicated by a sequence of arrows.
As same as in parallel coordinate plot, all warps are placed in parallel on a display
and the coordinates are connected by polygonal lines to identify observations. But
the polygonal line is disconnected on a warp if the observation is missing.
The order of warps is also important to provide a clear image of the data. In the
textile plot, the warps are placed in an ascending order of the dispersion ky j − mk,
j = 1, . . . , p. An exception is the warp corresponding to an ID vector which identifies
the observation. It is placed on the leftmost of the display and only the ID names
are placed at the coordinates.
As is shown later, the coordinate vector for the ID warp is always obtained from
the mean vector m, so that it is not necessary to include the ID vector as a part
Implementation of textile plot
585
of data vectors to compute the coordinate vectors. We can then always identify the
observation by looking at the leftmost warp.
Inf
Inf
Inf
0
7.9
6.9
119
123
136
106
131
118
108
132
110
103
144
121
109
141
145
126
142
113
133
105
146
130
140
101
129
125
112
116
147
137
148
115
117
104
149
111
138
124
120
114
135
127
143
102
134
128
122
150
139
78
77
69
73
53
51
55
84
88
87
59
76
66
107
57
71
64
52
75
79
74
92
63
98
72
62
54
86
56
67
93
90
95
91
100
83
97
85
70
68
81
96
89
82
60
65
80
61
94
99
58
2.5
2
virginica
versicolor
setosa
0.1
1
Inf
pa
l
(c .Wid
m t
) h
pa
l.L
(c en
m gt
) h
0
Se
ta
l.
(c Wid
m t
) h
Pe
ie
s
ec
Sp
ta
l.L
(c eng
m t
) h
0
Pe
ID
4.4
4.3
0
Se
42
24
32
19
21
26
44
46
27
37
2
35
6
31
29
45
40
10
28
11
18
50
36
13
22
25
8
15
17
49
30
1
4
41
16
9
12
3
20
48
5
47
39
7
34
38
43
14
33
23
Fig. 2. Textile Plot of Iris Data
Figure 2 is the textile plot of famous Iris data. The leftmost warp is ID warp
whose labels are the sequence number of observations. It can be easily seen from
the plot that all data vectors are continuous measurements except for Species. The
different size of circles on continuous measurements indicate that there are many
duplicated values. This is because the precision of the measurements is one decimal
point. The well known fact that Petal Width and Petal Length play an important
role for discriminating Species is clearly visualised in this plot.
3 Implementation
3.1 Computation
Even for computing the coordinate vectors y j , j = 1, . . . , p, we need to know the
data types, numerical, categorical or ordered categorical, of the given data vectors
xj , j = 1, . . . , p and the location of missing values. Although distinction between
Continuous and Discrete, or Unordered and Logical is not necessary.
Preparation
To simplify the implementation, we assume that the first r data vectors x1 , . . . , xr
are ordered categorical and xr+1 , . . . , xp are other types of data vectors.
586
Natsuhiko Kumasaka and Ritei Shibata
Each data vector xj is transformed into an n × (qj − 1) data matrix Xj , which is
encoded by a set of contrasts if it is non-numerical data vector, otherwise Xj = xj .
The qj is the number of the levels of xj or qj = 2 if it is numerical. The i th row of
the matrix Xj is filled with 0’s if the ith element of xj is a missing value.
The data matrices Xj j =P1, . . . , p are combined into a n × Q data matrix
X = (X1 , . . . , Xp ), where Q = pj=1 (qj − 1). The index set Ij such as
(X
j−1
j
X
i=1
i=1
(qi − 1) + 1, . . . ,
)
(qi − 1)
,
is used for indicating xj , j = 1, . . . , p, and I =
Sp
j=1
Ij = {1, . . . , Q}.
The Use of Generalised Inverse
We use notations v(K ) or M(K , L ) for the sub-vector or the sub-matrix specified
by index set K and L . We also use principal sub-matrix S(K ) = S(K , K ) for
square matrix S [Hor85].
The minimisation problem in Section 2.1 is to minimise
f (α, β) = S 2 (α, β) − N = αT A11 α − 2αT A12 β + β T A22 β
with respect to the location parameter vector α = (α1 , . . . , αp ) and the scale parameter vector β = (β T1 , . . . , β Tp )T under the constraints,
β T Bβ = N,
(8)
β j ≥ 0 or β j ≤ 0, j = 1 . . . , r.
(9)
and
Here
A11 (j,
k) =T
−w j (w k /w)
j 6= k,
−w Tj (w k /w) + 1T w j j = k,
A12 (j,
Ik ) =
w Tj (Xk /w)
j 6= k,
w Tj (Xk /w) − w Tj Xj j = k,
A8
22 (Ij , Ik ) =
T
< −X
j (Xk /w)
T
Xj w j w Tj Xj /(1T w j )
:
−XTj (Xk /w)
j 6= k,
j = k,
and
B(I
8 j , Ik ) =
j 6= k,
< OT
Xj Xj − XTj w j w Tj Xj /(1T w j )
:
j = k.
Implementation of textile plot
587
The notation / here is extended for a n × r matrix Z and a vector v as Z/v =
(z 1 /v, . . . , z r /v).
Since no α is involved in the constraints (8) and (9), we see from ∂f /∂α = 0
that any solution α̂ satisfies the following equation,
A11 α = A12 β
for a given β.
Because of the singularity of A11 , we need a generalised inverse of A11 to explicitly write the solution. In fact, it can be written as
+
α̂ = A+
11 A12 β + (I − A11 A11 )z
(10)
for arbitrary p-dimensional vector z by using the Moor-Penrose inverse [Rao73] of
A11 . We have then
f (α̂, β) = β T (−AT12 A+
11 A12 + A22 )β.
It is now clear that the solution β̂ can be obtained as the β which maximises the
quadratic form of
A = AT12 A+
11 A12 − A22
under the constraints (8) and (9). It is simply to find the eigenvector of A with
respect to B for the largest eigenvalue, if no ordered categorical data vector is
involved in the given data.
Inequality Constraint Maximisation Problem
The computation becomes a bit complicated if an ordered categorical data vector
is included. We have to solve a quadratic maximisation problem with an equality
constraint (8) and an inequality constraint (9).
S
As is described in [KS06], we have to find an index set I0 ⊆ Iord = rk=1 Ik
such that
1. β̂(I0 ) = 0, and β̂(I0c ) is an eigenvector of A(I0c ) with respect to B(I0c ) for
the largest eigenvalue λ̂, where I0c = I \ I0 ,
2. Either
2{A(Ik , I ) − λ̂B(Ik , I )}β̂ ≥ β̂(Ik )
or
2{A(Ik , I ) − λ̂B(Ik , I )}β̂ ≤ β̂(Ik )
is satisfied for 1 ≤ k ≤ r.
588
Natsuhiko Kumasaka and Ritei Shibata
An algorithm for the computation is as the following.
input A, B, Iord
λ̂ ⇐ 0.0
for all I0 s.t. I0 ⊆ Iord do
I0c = I \ I0
λ ⇐ λmax (A(I0c ), B(I0c ))
if λ̂ < λ then
β(I0c ) ⇐ v max (A(I0c ), B(I0c ))
β(I0 ) ⇐ 0
if β(Ij ) ≥ 0 or β(Ij ) ≤ 0,
where λmax (A, B) indicates the largest eigenj = 1, . . . , r then
λ̂ ⇐ λ
β̂ ⇐ β
end if
end if
end for
β̂ ⇐ (N/β T Bβ)1/2 β̂
α̂ ⇐ A+
11 A12 β̂
return α̂, β̂ and λ̂
value of A with respect to B, and v max (A, B) yields the eigenvector A with respect
to B with the largest eigen value.
3.2 Textile Plot Display
The coordinate vectors y 1 , . . . , y p are obtained from the formula (5) by using α̂ and
β̂. The coordinate vector y 0 for the ID yarn is given by
y0 =
1
(m − m̄1) + m̄1,
λ̂
where m̄ = 1T m/n and λ̂ is the largest eigen value, which is already obtained in the
algorithm. Sequence numbers of observations are used as the labels for the points
on the ID yarn if no ID vector is given a priori.
The other warps are placed right to the ID yarn according to the ascending order
of ky j − mk, j = 1, . . . , p. All points on each warp are displayed according to the
design of point display described in Section 2.2.
We always need to make a frequency table of the coordinates on each warp. In
case of numerical data, the direction of each warp can be seen from the sign of βj ,
but the possible minimum and maximum values of data vector should be given a
priori, which is used for determining the length of the axis line or the number of
ticks drawn on the warp. In case of non-numerical data, all possible levels should be
known a priori to show zero frequency levels on a warp.
4 Textile Plot on DandDR
The textile plot is currently implemented as a part of DandDR [YS04] which is an
interface between R [R D04] and DandD. DandD (Data and Description) is a project
Implementation of textile plot
589
to create a good environment for modelling data. DandDR receives any necessary
information described in a DandD instance with the data itself from DandDServer
and creates a dad object on R. The object has its own plot method which produces
the textile plot. The main algorithm of the computation is written by C language.
The package CLAPACK [LPK] is used for obtaining generalised inverse and solving
generalised eigen value problem.
The DandDR is available from the DandD project home page [DAD].
References
[DAD]
[Gif90]
[Hor85]
[Ins85]
[KS06]
[LPK]
[Mat03]
[MD01]
[RM71]
[R]
[The02]
[Weg90]
[Wil96]
[YS04]
DandD Project: DandD Home Page,
http://www.stat.math.keio.ac.jp/DandD/
Gifi, A.: Nonlinear Multivariate Analysis. John Wiley & Sons Ltd, (1990)
Horn, R., Johnson, C.: Matrix Analysis. Cambridge University Press,
(1985)
Inselberg, A.: The plane with parallel coordinates. The Visual Computer,
1, 69–91 (1985)
Kumasaka, N., Shibata, R.: High Dimensional Data Visualisation: Textile
Plot. Research Report in Department of Mathematics, KSTS/RR-06/001,
Keio University, (2006)
LAPACK: Home Page, http://www.netlib.org/lapack/
Matthias, S.: Visualizing categorical data arising in the health sciences
using hammock plots. American Statistical Association; 2003, CD-ROM
(2003)
Michailidis, G., de Leeuw, J.: Data visualization through graph drawing.
Computational Statistics, 16, 435–450 (2001)
Rao, C.R.,Mitra, S.K.: Generalized Inverse of Matrices and its Applications. John Wiley & Sons, Inc., (1971)
R Project: R Project Home Page, http://www.r-project.org/
Theus, M.: Interactive data visualization using mondrian. Journal of Statistical Software, 7 (2002)
Wegman, E.: Hyperdimensional data analysis using parallel coordinates.
Journal of The American Statistical Association, 85, 664–675 (1990)
Wills, G.J.: Selection: 524,288 ways to say this is interesting. Proceedings of the 1996 IEEE Symposium on Information Visualization, IEEE
Computer Society Washington, DC, USA, 54–60 (1996)
Yokouchi, D., Shibata, R.: DandD Client Server System, Compstat 2004
CD-ROM, Physical-Verlag, Prague. (2004)
Download