Multiple Cluster Comparison:The Identification of Stable Objects

advertisement
Visual Analytics in Data Stability and Data Flow Analysis
Jianping Zhou, Georges Grinstein
ABSTRACT
Coupled visualization and analysis, a.k.a. visual analytics, has become a major
methodology for data analysis and exploration. Effective visual analysis of various
clustering results and categorical information within a single graphical interactive display
is highly recommended. We describe a specific integration of analysis and visualization
for partition stability evaluation. Partitions are decompositions of a dataset into a family
of disjoint subsets. They may refer to the results of clustering, of groupings of categorical
dimensions, of binned numerical dimensions, of predetermined class labeling dimensions,
or prior knowledge structured in mutually exclusive format (one data item associated
with one and only one outcome). Partition or cluster stability analysis identifies nearoptimal structure, builds ensembles, or conducts validation.
We developed a set of partition stability metrics and a new visualization tool,
CComViz (Cluster Comparison Visualization), for mutual comparison and evaluation of
multiple partitioning of the same dataset. We've added a novel layout algorithm for
informatively rearranging the order of the records and dimensions. With these techniques,
we are able to visualize data stability, data flow, density distribution and hierarchy, and
data correlation at the record, at the group and at the dimension levels within a single
graphical interactive display. We provide an example application of CComViz.
1
INTRODUCTION
Partition comparison and evaluation is a common task in comparative data analysis. In
this context, partitions refer to the results of clustering in flat format, of groupings of
categorical dimensions, of binned numerical dimensions, of predetermined class labeling
dimensions, or prior knowledge structured in mutually exclusive format (one data item
associated with one and only one outcome). Partition stability analysis deals with
comparison and evaluation of multiple partitions in order to identify near-optimal
structure, build ensembles, or conduct validation. Partition stability is a measure
regarding the consistence of records within clusters or data categories. Stable records
characterize the main features of their belonging clusters while unstable records indicate
outliers and anomalies. The results of partition stability analysis often answer questions
like
 Which data items are strongly clustered together? Which data items are marginally
clustered?
 Is the clustering result obtained by a method better than other methods?
 What is the optimal number of clusters over multiple clustering runs with different
settings?
 How consistent, or stable, are data memberships over multiple partitions?
 How data flow starting from one partition to others? How data volumes change?
What are their density distributions and hierarchical relationships?
The integration of analysis and visualization for partition stability evaluation has become
a major methodology in exploratory data analysis. We have developed a set of partition
stability metrics and a new visualization tool, CComViz (Cluster Comparison
Visualization), to carry out this methodology. We proposed a novel layout algorithm used
in CComViz for informatively rearranging the record display order. With these
techniques, we are able to visualize data stability, data flow, density distribution and
hierarchy, and data correlation at the record, at the cluster and at the dimension levels
within a single graphical interactive display.
2
BACKGROUND
Partition stability metrics characterize diverse partition stability analyses. In terms of
distance, correlation or similarity measure between partitions or clusters to be compared,
partition stability metrics are divided into partition-based metrics and cluster-based
metrics. There are totally four levels of partition stability, which are respectively called
record stability, group stability, dimension stability and dataset stability. Record stability
measures the record’s tendency to follow a trend. Group stability measures the stability of
a cluster or a partitioning group across different partitions. And dimension stability is
overall difference of a partition from others. Finally, dataset stability reflects the dataset’s
complexity indicating the chance of data pattern existence with underlying selection of
partitions. Partition-based stability metrics include only metrics at the dimension and the
dataset levels, and are often used for estimating the number of clusters [1] [2] [3] [4].
Some external and relative criteria [5] used in cluster analysis, including Rand index [6],
Fowlkes-Mallows index [7] [8] and Hubert’s  statistic [5], can be used to define
partition-based stability metrics. Cluster-based stability metrics consist of metrics at all
levels. In general, any cluster distance, correlation or similarity measures are options for
defining cluster-based stability metrics. These options include F-measure [9], label
distance, label correlation coefficient [11], cluster similarity [10]. More cluster distance
measures can be found in the references [12] [13].
Information theory [14] is traditionally used to measure the amount of shared information
between two variables. Strehl and Ghosh [15], Fred and Jain [16] adopted mutual
information index from information theory and defined two consistency measures applied
to a pair of partitions. These measures can be further used for partition-based stability
metrics. A study using cluster-based stability metrics for perturbation-based cluster
analysis was conducted by Kaze and Grinstein [11]. Perturbation-based cluster analysis is
a kind of partition stability analysis, which investigates the response of various clustering
methods to random noise added to input data in order to find out the most stable
clustering through a large quantity of experiments. To support this technique, the authors
defined stability metrics at the record, the cluster, and the dimension levels. The
calculations of these stability measures are based on comparisons between the original
(unperturbed) clustering and every perturbed clustering. These metrics require the setting
of number of clusters to be the same in every perturbation run, which equal to the number
of clusters of the unperturbed clustering. In addition, the correspondent cluster pairs
between unperturbed and perturbed clustering results, known as label correspondence
problem [15] [18], need to be recognized. For perturbation-based cluster analysis, Bittner
et al. [17] presented a partition-based stability analytics, called WARP (Weighted
Average Discrepant Pairs). FOM (Figure of Merit) is another kind of partition stability
analysis, proposed by Yeung et al. [22]. This technique is analogous to leave-one-out
validation. In every run, one dimension is left out and serves as the validation dimension.
Clustering is applied to the remaining dataset. And a comparison then takes place
between the validation dimension and the clustering result. As every dimension is chosen
as the validation dimension, the average similarity is able to measure the partition
stability.
Visualization provides an intuitive and interactive means to help partition stability
analysis. Traditionally, interactive comparison of multiple partitions, projected in
multiple visualization instances and arranged side by side, is a straightforward approach
to perform partition stability analysis. However due to the limitations of computer screen
space, as well as human short-term memory, the comparison with multiple graphic
displays often cause efficiency problem. Apparently, a single graphic interactive display
is a solution to address this problem. Chen et al. [1] proposed a tree graph to express the
similarities and hierarchical grouping relationships of various clustering results.
Expression Profiler [23] visualization package developed at EBI (European
Bioinformatics Institute) presents a method for comparing various clustering results in
either flat or hierarchical structures. For the comparison of two flat clustering results, a
bipartite graph is used to express their relationships in [23]. Each side of the bipartite
graph is a group of a clustering result. An algorithm is applied to rearrange the original
cluster order, such that some clusters are merged into “superclusters” nodes and the
number of crossing edges is minimized. After node arrangement and merge, the weight of
each edge between two nodes is proportional to the number of overlapped records in both
nodes. SM (Stable Matrix) visualization tool, proposed by Cvek [24], visualizes a
symmetric 2D matrix in heatmap. The element of this matrix is calculated by SM (i, j) =
nij / N; where N is the total number of clustering results involved in the comparison, and
nij is the number of clustering results which recognize the pair of records (i, j) staying in
the same cluster. The record index order in SM is carefully arranged by using an
algorithm, so that the record stability statuses and relationships are well visually
presented. Parallel Sets [25] and Interactive Sankey Diagrams [26] are variants of parallel
coordinates [27]. Unlike parallel coordinates, each record in these visualizations has a
unique position on each axis regardless of its value. When record display order on each
axis is appropriately arranged, these visualization tools work well for displaying data
flows, density distributions and hierarchies of multiple variables. In order to reduce
clutter and occlusion, a set of rules are defined in these tools based on grouping and
sorting.
3
PARTITION STABILITY METRICS
In order to perform mutual comparison and evaluation of multiple partitions, we
developed a set of cluster-based partition stability metrics at the record, the cluster and
the dimension levels. For illustrating the calculation of these metrics, we created a
synthetic dataset, as shown in Table 1, which contains three categorical dimensions.
Table 1 Synthetic Dataset
Record
C1
C2
Index
0
c11
c22
1
c12
c22
2
c11
c21
3
c11
c22
4
c11
c22
5
c12
c22
6
c11
c21
7
c11
c21
8
c12
c21
9
c13
c22
10
c13
c22
11
c12
c21
12
c13
c22
13
c13
c22
14
c13
c21
15
c13
c21
C3
c31
c32
c33
c31
c33
c33
c33
c33
c32
c31
c32
c33
c33
c32
c32
c31
3.1
Record Stability
For a record with index k and a given set of partitions N = {C1, C2, …, Cn}, cmF(k)
represents a partitioning group of Cm and contains the record k, i.e. cmF(k)  Cm, Cm  N,
k  cmF(k), the record stability for record k, RS(k), is defined.
RS(k) =
2
n
n
n (n -1)
∑ i = 1 ∑ j = i +1 D(ciF(k), cjF(k))
In above equation, D(ca, cb) is called D measure, which may be a reciprocal distance,
correlation or similarity measure between partitioning group ca and cb. This measure is
symmetric, i.e. D(ca, cb) = D(cb, ca). As we reviewed in the background, a variety of
metrics can serve as this measure. In this paper, we adopt the label correlation coefficient
[11], which is defined as
corCl (ca, cb) =
| ca  cb |
 | ca | * | cb |
Below is an example from the synthetic dataset to show the calculation. Considering 0 
c11, 0  c22, 0  c31, the label correlation coefficient for record with index 0 is calculated.
RS(0) =
2
(corCl (c11, c22) + corCl (c11, c31) + corCl (c22, c31) )
3 *2
=
1
3
(
3
6*8
+
2
 6*4
+
3
 8*4
) = 0.44
Table 2 lists all RS values for the synthetic dataset.
Table 2 Record Stability Value
Record
RS
Index
0
0.44
1
0.41
2
0.55
3
0.44
4
0.47
5
0.36
6
0.55
7
0.55
8
0.39
9
0.48
10
0.51
11
0.44
12
0.36
13
0.51
14
0.4
15
0.3
Various D measures, although diverse definitions, share a common feature for which the
higher the D measure value, the higher the proportion of records whose dimension values
fall into the paired partitioning groups. As a result, these high proportional records form a
pattern. So, if the average D measure value of multiple partitioning groups which a
specific record falls into is high, we could conjecture this record has high tendency to
follow a trend across the multiple partitions. In reverse situation, if no patterns are
associated with a specific record, this record could be considered as an outlier or anomaly.
The range of RS depends on the range of D measure. In the case of corCl serving as the D
measure, RS is between (0, 1]. Its value for a record will be 1 when all partitioning
groups which the record falls into are the same.
RS value is a mutual comparison result. From a statistical point of view, this measure
reflects this record’s stability feature in overall.
3.2
Group stability
Group stability, GS, is the average record stability of a partitioning group.
GS (c) =
∑k  c RS(k)
|c|
Table 3 contains GS results for each portioning group of the synthetic dataset.
Table 3 Partitioning group and group stability
Group
Members
GS
{0,2,3,4,6,7}
0.5
c11
{1,5,8,11}
0.4
c12
{9,10,12,13,14,15}
0.43
c13
{2,6,7,8,11,13,14,15}
0.45
c21
0.44
{0,1,3,4,5,9,10,12}
c22
0.42
{0,3,9,15}
c31
{1,8,10,13,14}
0.44
c32
{2,4,5,6,7,11,12}
0.47
c33
3.3
Relative Dimension Stability
Given two partitions, Ca and Cb, their relative partition stability, RPS, is defined.
RPS (Ca , Cb) =
1
m
∑m
k = 1 D(caF(k), cbF(k))
where m is the record count of the dataset. Due to the symmetric property of D measure,
RPS is also symmetric.
3.4
Dimension Stability
Dimension stability is defined.
1
∑m
∑ ni = 1, i ≠u D(cuF(k), ciF(k))
PS (Cu) =
k=1
m (n-1)
When corCl serves as the D measure, RDS and DS values for the synthetic dataset are
calculated and shown in Table 3.
Table 3 PRS and PS results
Partition RPS (Ci ,C1) RPS (Ci ,C2)
C1
1
0.43
C2
0.43
1
C3
0.51
0.45
RPS (Ci ,C3)
0.51
0.45
1
PS (Ci)
0.47
0.44
0.48
3.5
Dataset Stability
Dataset stability is defined.
DS =
1
n
∑ ni = 1 PS( Ci )
In later section, we will see partition stability metrics defined above are not only the
measures of data feature but also the criteria for achieving visual aesthetics in our new
developed visualization tool.
4
CCOMVIZ VISUALIZATION AND LAYOUT ALGORITHM
In order to seek a visualization tool more suitable for partition stability analysis, we
developed CComViz (Cluster Comparison Visualization) visualization. CComViz
utilizes the same data representation model as that of Parallel Sets [25] and Interactive
Sankey Diagrams [26]. Fig. 2 depicts how data are projected in CComViz using the
synthetic dataset. In this figure, the numbers within boxes represent record indices.
c31
c32
c33
C3
C1
C2
0
3
9
15
1
8
10
13
14
2
4
5
6
7
11
12
0
2
3
4
6
7
1
5
8
11
9
10
12
13
14
15
2
6
7
8
11
14
15
0
1
3
4
5
9
10
12
13
c11
c12
c13
c21
c22
Fig. 1 Illustration of CcomViz projection
Like parallel coordinates [27], CComViz lays dimension axes out in parallel. But
these axes are not coordinates. Every axis is an expression of individual record display
order and represented as a set of continuous color bars. Each color bar corresponds to a
partitioning group with its length scaled according to its density in dimension. In
CComViz, data items are also represented as polylines that link their points on every axis.
The point position of each data item on axis is determined by its rank in the record
display order. Since each data item has a unique rank on axis, no more than one data item
is projected to the same point on each axis (without considering the rounding of
projection position to screen pixel). Therefore, no more than one data item is projected to
the same polyline. This feature benefits in particular categorical dimensions in contrast to
regular parallel coordinates, in which data items with the same category value are
projected to the same point on axis. In CComViz, one of the most important visual
properties is called hot dimension. The hot dimension, like C3 in Fig. 1, is a selected
dimension by which all record lines are colored. Any active dimension can be selected as
the hot dimension. Choosing different hot dimension provides a convenient way to view
how records in a group of one dimension fall into groups of other dimensions. In this way,
data flow, density distribution and hierarchy starting from different dimension can be
easily explored.
Unlike parallel coordinates, CComViz focuses on achieving visual analytics
rather than geometrical data mapping. The observation of geometrical data properties in
CComViz may be disabled, meaning that the position of record projection has no direct
connection to its value. Although this may inadvertently impact on numerical dimensions,
by means of data and tool linking mechanism, such as under the UVP (Universal
Visualization Platform) [28], this disadvantage can be compensated through using other
visualization tools. As a matter of fact, parallel coordinates and CComViz share many
common characteristics and interfaces. Using both visualization tools in an interactive
and linked environment can generate a strong and unified power for comparative cluster
analysis.
In CComViz, the feature that no more than one record is projected to the same
point on each axis avoids record overlap. But clutter due to crossing remains a concern.
As seen in Fig. 1, without proper record display order rearrangement, the severe edge
crossings make it difficult to observe data patterns and data relationships. This problem
has to be solved. Otherwise, this data representation model is useless for most real-world
datasets. Beyond that, from a visualization point of view, visually representing data with
less clutter is not enough. Data representation should amplify human cognition [29].
Now that CComViz is intended to display data stability, data flow, density distribution
and hierarchy across multiple partitions, the record and dimension display order needs to
enhance readability of these data features so cognition can be amplified.
Information visualization is fundamentally dependent upon the properties of
human perception. According to visualization study [29], grouping, sorting, and
positioning are among the most effective ways to support human perceptual inference,
enhance pattern detection, as well as reduce information search time and memory usage.
Another influential factor in human perceptual inference is related to semantics. As
CComViz is designed to distinguish stable and unstable records, we expect their visual
appearances to be different. Relying on semantic associations, to make stable records
look smooth and unstable records fluctuate can help with human perceptual inference.
In order to achieve the matching between data representation model and human
perception model, we brought above concepts into a methodology in developing a
sophisticated layout algorithm for record and dimension display order rearrangement in
terms of selection of the hot dimension and partition stability measures. The mechanism
behind the methodology is based on consideration and utilization of user perception
modeling, partition stability measurement, and interactive functions involved in the
course of data exploration and pattern recognition. The methodology used in the layout
algorithm makes CComViz an ideal tool to visualize data stability, data flow, density
distribution and hierarchy, as well as data correlation across a number of partitions within
a single graphical display.
In CComViz, the process for rearranging record and dimension display order is in
a pipeline sequence. The computation of record display order rearrangement is based on
dimension display order. Any change in adding, removing or reordering dimensions will
cause a change in the record display order.
4.1
Dimension Display Order Rearrangement
For partition comparison and evaluation, it is important to identify distinctions
and variations between similar partitions. The similarity is defined by different metrics,
such as clustering criteria and partitioning consistency. Since similar partitions have a
high degree of partitioning consistency, there should be fewer crossing lines if similar
partitions are adjacent and the display orders of partitioning groups along their axes
maintain a correspondent relationship (To be discussed later). Therefore, from visual
aesthetics point of view, putting similar partitions together assists the observation of
outliers or odd records while from human visual perception point of view, it facilitates
comparison and perceptual inference.
As the layout of dimension axes in CComViz is a linear sequence, there are two
approaches to dimension display order rearrangement, which in CComViz is an optional
function. By default, the dimension display order is arranged by users to achieve
flexibility.
4.1.1
Sorting Approach
Cluster evaluation through sorting by criteria is common in comparative cluster
analysis. CComViz provides a natural way to visualize sorted clustering results.
In general, any kind of clustering criteria or clustering quality measures can be used for
sorting. Due to their varying characteristics, the comparison results can be either
consistent or inconsistent. For example, data stability pattern may not be observed better
through sorting by internal criteria than by external or relative criteria, since internal
criteria are not intended to measure data stability. However, it is still worthwhile to
project CComViz in different sorting orders to explore partitioning features and verify
partitioning reliability from various perspectives.
As one of CComViz projection goals is to reduce crossing, the stability-based
criteria, such as our proposed PS metric, are ideal for that purpose. Regarding the
synthetic dataset displayed in Table 1, through sorting by PS, the dimension display order
is <C3, C1, C2>.
4.1.2
Sequential Selection Approach
Sequential selection is a process for rearranging the dimension display order by
using relative partition similarity measure. Once the first dimension which the user thinks
is the best or the most interesting in terms of certain criteria is identified, a dimension that
is the most similar to the current dimension among all unselected candidate dimensions is
chosen as the immediate successive dimension. Our proposed RPS is one of these kinds
of similarity measures.
In the synthetic dataset, when C3, which has higher PS value over others, is
chosen as the first dimension, the dimension display order by sequential selection using
RPS remains to be < C3, C1, C2>.
The dimension display order rearrangement is capable of improving not only visual
aesthetics but also comparison results. Because stability-based partition comparison and
evaluation work in voting fashion, when too many inferior partitions or divergent
partitions are involved in a comparison, the results will not be reliable. By visually
checking partition quality or similarity, and throwing inferior or divergent partitions away
from the sorted sequence, the results of the next run on the trimmed series of partitions
will be improved. In summary, a proper dimension display order makes users focus more
on high quality or interesting partitions when many partitions are involved in comparison
and evaluation. It also leads to progressive refinement of comparison results.
4.2
Record Display Order Rearrangement
To some extent, record display order rearrangement in CComViz is similar to the
Bipartite Graph Crossing Minimization (BGCM) problem [30]. One of the BGCM
algorithms, called gravity-center algorithm, has been successfully used in Expression
Profiler visualization tool package [23] for reducing crossing when two clustering results
are compared. However, a distinction exists: BGCM algorithms deal with one pair of
partitions, whereas record display order rearrangement in CComViz deals with unlimited
number of partitions. Our methodology does not directly treat record display order
rearrangement as a BGCM problem; instead it implicitly achieves crossing reduction. As
discussed earlier, reducing crossing is not the only goal of CComViz visualization. Other
goals include facilitating observation of data stability, density distribution and hierarchy,
as well as other interesting relationships. To achieve these goals, we take the grouping
and reordering approach to developing a layout algorithm that enforces the formation of
record projections in envelope (band) shapes with fewer crossings [25] [31] [32].
This algorithm contains four steps, which involves record grouping, group
matching, group reordering, record sorting, and linked list intersection and concatenation
operations. The second and the third steps need information about record stability and
group stability. The record display order depends on the selection of dimensions to be
compared, the selection of the hot dimension, as well as the dimension display order. We
now use the synthetic dataset to illustrate these steps with the dimension display order <
C3, C1, C2> and the setting of C3 as the hot dimension
The first step is to group each dimension by its own categorical values, as we see
in Fig 1. In the second step, the groups of the hot dimension are first reordered by
following the descending order of their group stability values. For the synthetic dataset,
since GS(c33) > GS(c31) > GSA(c32), the group display order of C3 is < c33, c31, c32 > (top
to bottom). And then starting from the hot dimension, the groups of other dimensions are
sequentially reordered to both ends in terms of group matching relationships. The group
matching relationships specify a set of group pairs. Each pair is considered as the most
similar groups of adjacent dimensions by D measure. Each group can only be paired once
with one of its adjacent dimension groups. For the synthetic dataset, the construction of
the group matching relationships is illustrated in Table 4 and Table 5. Note that the pair
matching operations proceed by following the corCl measure order. After the matching,
the relationships are expressed as c33c11, c32c13, c31c12, c13c22, c11c21.
Once the group matching relationships are built, the group reordering is then conducted.
During the group reordering, a rule is applied for a situation where the number of groups
of a dimension is larger than that of the preceding dimension. In this situation, the group
order of the remaining groups of this dimension follows the descending order of their
group stability values. Fig. 2 shows the group reordering result for the synthetic dataset.
Since the hot dimension C3 is the first dimension, no sequentially backward reordering is
applied.
Table 4 Group Matching between C3 and C1
C3
C1
corCl 
c33
c11
0.62
c32
c13
0.55
c32
c12
0.45
c11
c31
0.41
c31
c33
c33
c31
c32
c13
c12
c13
c12
c11
0.41
0.38
0.15
0
0
Table 4 Group Matching between C1 and C2
C1
C2
corCl 
c13
c22
0.54
c11
c21
0.46
c11
c22
0.41
c21
c12
0.38
c22
c12
0.33
c13
c21
0.31
GS
0.47
0.44
0.42
C3
C1
C2
2
4
5
6
7
11
12
0
3
9
15
1
8
10
13
14
0
2
3
4
6
7
1
5
8
11
9
10
12
13
14
15
2
6
7
8
11
14
15
0
1
3
4
5
9
10
12
13
c31
c32
c33



c11
c12
c13


c21
c22
Fig. 2 Step 2: Group reordering, starting
from the hot dimension to both ends
The third step contains two kinds of operations. First, the records of the last dimension
(rightmost) are reordered within each group in descending order by record stability. And
then starting from the dimension prior to the last dimension backwards to the hot
dimension, the record orders of other dimensions are sequentially reordered by
performing a series of linked list intersection and concatenation operations. A linked list
intersection operation returns the overlapped elements of two or more linked lists with
the order following the first linked list. With these operations, the new record order of
each current dimension’s group is determined by
cum =
Where

i 
 ( cvi  c’um )
 is the linked list intersection operator and

 is the linked list

concatenation operator. cum and c’um denote the record orders of group cum after and
before reordering respectively. v is the dimension right next to u. The accumulated
operations follow the group order of dimension Cv starting from top to bottom. For
example, the new order of c11 is:
c11 =


( c21 
 c’11)  ( c22  c’11)
= ({2,6,7,11,14,8,15}
= {2,6,7}
 {0,2,3,4,6,7}) 
 ({10,13,9,4,0,3,1,5,12} 
 {0,2,3,4,6,7})

 {4,0,3} = {2,6,7,4,0,3}

Fig. 3 shows the reordering results after the third step.
Sequential recording
c31
c32
c33
C3
C1
C2
RS
2
6
7
4
11
5
12
0
3
15
9
8
1
14
10
13
2
6
7
4
0
3
11
8
1
5
14
15
10
13
9
12
2
6
7
11
14
8
15
10
13
9
4
0
3
1
5
12
0.55
0.55
0.55
0.44
0.41
0.39
0.3
0.51
0.51
0.48
0.47
0.44
0.44
0.41
0.36
0.36
c11
c12
c13


c21
c22
Fig. 3 Step 3: Record reordering from
the last dimension to the hot dimension
The fourth step takes the same linked list intersection and concatenation operations, but
starts from the hot dimension to both ends. For the synthetic dataset, since the hot
dimension C3 is the first dimension, no sequential reordering to the front end is needed.
Fig. 4 displays the final result after all reordering.
Sequential recording
C3
C1
C2
2
6
7
4
11
5
12
0
3
15
9
8
1
14
10
13
2
6
7
4
0
3
11
5
8
1
12
15
9
14
10
13
2
6
7
11
8
15
14
4
0
3
5
1
12
9
10
13
c11
c12
c13
c31
c32
c33
c21
c22
Fig. 4 Step 4: Record reordering from
the hot dimension to both ends
c31
c32
c33
C3
C1
C2
2
6
7
4
11
5
12
0
3
15
9
8
1
14
10
13
2
6
7
4
0
3
11
5
8
1
12
15
9
14
10
13
2
6
7
11
8
15
14
4
0
3
5
1
12
9
10
13
c11
c12
c13
c21
c22
Fig. 5 Final result in CComViz
Finally, we add record lines on the final reordering result (Fig. 5). In contrast to Fig. 1,
Fig. 5 more clearly reveals information which is hidden before. For example, in Fig. 5,
the hierarchies related to records 11, 5 and 12 are easily seen; Smooth record lines for
records 2, 6, 7 10 and 13 well reflect their high stability. In the layout algorithm, we do
not explicitly use any target function to reduce crossing. But as we see from the result,
crossing reduction is implicitly achieved. Behind the mechanism is the use of partition
stability metrics. The computational complexity of this algorithm also reaches its lower
bound, which is linear to record size. The efficiency of this solution apparently
outperforms explicit crossing reduction solutions.
In next section, we apply this layout algorithm to a subset of breast cancer patient
data, which is used in our collaborative project with Massachusetts General Hospital,
and generate a set of CComViz plots to demonstrate the effectiveness of above
techniques, as well as a number of important interactive functions implemented in
CComViz.
5
CASE STUDY AND CCOMVIZ INTERACTIVE FUNCTIONS
The breast cancer dataset used in this case study contains 1210 patient records,
three individual risk score dimensions from three breast cancer prediction models (Gail
[54], Myriad model [33] [34], BRCAPRO [35]), two clinical test measure dimensions
(atypical hyperplasia in biopsy and average mammogram density [19] [20]), cancer status
dimension, and four personal attribute dimensions (menarche age, age of first birth, bra
size, weight). For cluster analysis, we selected the three risk score and two clinical test
measure dimensions, and employed EM (Expectation Maximization), KM (Kmeans), XM
(Extended Kmeans) and DB( Density based clustering) clustering algorithms
implemented in Weka project to generate four clustering results. The number of clusters,
which is 6, in the subspace with five dimensions selected was estimated by a automatic
function in Weka EM.
DB6
KM6
XM6
EM6
Cancer
DB6
KM6
XM6
EM6
Cancer DB6
KM6
XM6
EM6
Cancer




C5DB
C3DB
C0DB
C4DB
C2DB
C1DB
C5KM
C3KM
C0KM
C4KM
C2KM
C1KM
* DA: Diagnosed
(a) All
C3XM
C1XM
C4XM
C2XM
C0XM
C5XM
C3EM
C0EM
C2EM
C1EM
C4EM
C5EM
NR
DA
NR: Not Report
C5DB
C3DB
C0DB
C4DB
C2DB
C1DB
C5KM
C3KM
C0KM
C4KM
C2KM
C1KM
C3XM
C1XM
C4XM
C2XM
C0XM
C5XM
: Low risky group
(b) RS > 0.488
C3EM
C0EM
C2EM
C1EM
C4EM
C5EM
NR
DA
C5DB
C3DB
C0DB
C4DB
C2DB
C1DB
C5KM
C3KM
C0KM
C4KM
C2KM
C1KM
C3XM
C1XM
C4XM
C2XM
C0XM
C5XM
C3EM
C0EM
C2EM
C1EM
C4EM
C5EM
: High risky group
(c) RS < 0.372
Fig 6 Display data flows and hierarchies with optimal dimension reordering in sequential
sorting.
Fig. 6 displays the 1210 patient records with the four clustering results and the cancer
status dimensions selected in CComViz. In terms of the dimension selection, RS value
for each record is calculated. Fig. 6b and Fig. 6c highlight the record selections with RS
thresholds. Since big bands intently correspond to records with high RS values, from
Fig.6b, we can easily distinguish a few of stable sub-clusters from the four clustering
results. These sub-clusters statistically give us confidence to make more reliable
inference. For instance, based on the observation of distribution, we can quickly infer
some high or low risky populations marked with red (high risk) or green (low risk)
triangles in Fig.6b. On the other hand, unstable records are represented by small and
fluctuating bands (Fig. 6c), which help us quickly identify outliers or data items which
need to be verified. As the dimension display order is sorted, Fig. 6 also conveys
additional information about the stability ranking of the four clustering results.
NR
DA
To facilitate data exploration and observation, CComViz provides a number of important
interactive functions. These functions include:
5.1
Changing Hot Dimension
As we mentioned earlier, changing the hot dimension during data exploration
provides an easy way to investigate data flows, density hierarchies and distributions
starting from different dimensions. As opposed to Fig. 6, Fig. 7 chooses XM6 as the hot
dimension and provides snapshot of data flow from another angle. The operation of
changing hot dimension in CComViz is very easy. It can be a simple click of mouse on
the designated dimension axis. Since record display order rearrangement relies on the
selection of the hot dimension, the rearrangement is recomputed once the hot dimension
is changed.
DB6
C5DB
C3DB
C0DB
C4DB
C2DB
C1DB
KM6
XM6
C5KM
C3KM
C0KM
C4KM
C2KM
C1KM
* DA: Diagnosed
EM6
C3XM
C1XM
C4XM
C2XM
C0XM
C5XM
Cancer
C3EM
C0EM
C2EM
C1EM
C4EM
C5EM
NR
DA
NR: Not Report
Fig 7 Change hot dimension
5.2
DB6
KM6
C5DB
C3DB
C0DB
C4DB
C2DB
C1DB
XM6
C5KM
C3KM
C0KM
C4KM
C2KM
C1KM
* DA: Diagnosed
EM6
C3XM
C1XM
C4XM
C2XM
C0XM
C5XM
Cancer
C3EM
C0EM
C2EM
C1EM
C4EM
C5EM
NR
DA
NR: Not Report
Fig 8 Change group rendering order
Changing Group Rendering Order
For most real-world datasets, record occlusion in CComViz can not be avoided
even with the help of proper dimension and record display order rearrangement. Under
certain circumstances, transparent drawing provides helps for small datasets. But for
large and complex datasets, visual aesthetics in transparent drawing can not be achieved.
In this situation, the control of group rendering orders offers an ideal and flexible solution.
With this control, the most user-interested partitioning group, which is called hot group,
is rendered in the front. The selection of a hot group is made by clicking the mouse on the
group bar. Note that the hot group is a group of the hot dimension. As the hot group is
selected, the hot dimension also selected. The usefulness of changing group rendering
order can be seen by comparing Fig. 8 and Fig. 7.
5.3
Performing Boolean Operation for Record Selection
Boolean operations AND, OR, NAND, and NOR, as well as their combination are
available in CComViz for record selections. Once record selection bars are draw and
Boolean operation formula is made, the records in the result set are highlighted. As
shown in Fig. 11, the record selection is made by AND operation on selected record
subsets A and B.
DB6
KM6
XM6
EM6
C5KM
C3KM
C0KM
C4KM
C2KM
C1KM
* DA: Diagnosed
C3XM
C1XM
C4XM
C2XM
C0XM
C5XM
C3EM
C0EM
C2EM
C1EM
C4EM
C5EM
NR
DA
NR: Not Report
Fig 9 Boolean operation for
record selection
5.4
EM6
KM6
DB6
XM6
Cancer
B
A
A
C5DB
C3DB
C0DB
C4DB
C2DB
C1DB
Cancer
Customizing Dimension Display Order
C3EM
C0EM
C2EM
C1EM
C4EM
C5EM
C5KM
C3KM
C0KM
C4KM
C1KM
C2KM
* DA: Diagnosed
C5DB
C3DB
C0DB
C4DB
C1DB
C2DB
C3XM
C1XM
C4XM
C2XM
C5XM
C0XM
NR
DA
NR: Not Report
Fig 10 Display data flows and
hierarchies in custom dimension
order
Although automatic sorting or optimal reordering of dimensions helps reducing
line crossing, custom dimension display order is often needed for data exploration. Fig.
10 gives another view of the breast cancer dataset in different dimension display order
from Fig. 8. This view facilitates the investigation of EM6 clustering result.
In addition to interactive functions discussed above, CComViz is also able to handle
numeric data dimension by binning, choose options of D measures for stability analysis
and partition quality measures for dimension display order sorting, change record
reordering in different modes including silhouette index sorting and previous order
freezing etc. Due to the limitation of paper space, we skip their introduction in this paper.
For continuing the case study, we now project the four personal attribute dimensions and
cancer status dimension as well. In Fig. 11, the dimension display order is rearranged by
using sequential selection approach. This order suggests that AgeFirstBirth and
MenarcheAge are more similar to cancer status than Weight and BraSize dimensions. It is
not surprised to see a large number of crossings and branchings in Fig. 9a, which make it
difficult to draw inference even though with the help of dimension and record display
order rearrangement since no evidences show these four personal attributes are firmly
effective risk factors in entire population. However, when selecting stable records to a
degree, some high or low risky populations, as seen in Fig. 9b, are discerned. The visual
results suggest the bra size might be a risk factor in these populations and deserves
further investigation.
Through this case study, we demonstrate how CComViz can be effectively used to draw
inference through the aggregation of multiple information sources. Behind the success is
the development of metaphors that stable records are expressed within wide and smooth
bands while unstable records are within narrow and fluctuating strips. The statistical
robustness of these metaphors along with the graphic layout algorithm makes CComViz
visualization distinctive from any others.
6
CONCLUSION
In visual analytics field, it is very challenging to achieve visual aesthetics by
using metaphors based on statistics. In this paper, we proposed a suite of techniques
addressing mutual comparison of multiple partitions to meet this challenge. These
techniques include a set of partition stability metrics and CComViz visualization with the
enhancement of a sophisticated graphic layout algorithm. The partition stability metrics
are defined at the record, the group, the dimension and the dataset levels. These metrics
are not only used to measure data feature at different level, and also help graphic layout
in CComViz for implicitly reducing crossing lines. CComViz is a highly interactive
visual analytic tool to visualize data stability, data flow, density distribution and
hierarchy, and data correlation at the record, the group and the dimension levels within a
single graphical interactive display. In order to achieve visual aesthetics and reduce
crossing lines, we presented a novel methodology used to develop a layout algorithm for
informatively rearranging the order of the records and dimensions in CComViz. Our
proposed techniques can be extensively used to identify near-optimal structure, build
ensembles, or conduct validation.
Cancer
AgeFirstBirth
Not Report
Diagnosed
20 - 30
NA
> 34
< 20
31 - 34
MenarcheAge
11 - 15
NA
< 11
> 15
Weight
141 - 180
NA
>180
101 – 140
<100
(a) All Records
Cancer
AgeFirstBirth
MenarcheAge
Weight
BraSize
#38
NA
#42
#34
<#32
#36
#40
#32
#44
>#44
BraSize




Not Report
Diagnosed
: Low risky group
20 - 30
NA
> 34
< 20
31 - 34
11 - 15
NA
< 11
> 15
: High risky group
(b) RS > 0.425
141 - 180
NA
>180
101 – 140
<100
#38
NA
#42
#34
<#32
#36
#40
#32
#44
>#44
Fig 11 Personal attribute dimensions vs. cancer status dimension
REFERENCES:
1. G. Chen, Saied A. Jaradat, Nila Banerjee,Tetsuya S. Tanaka, Minoru S. H. Ko and
Michael Q. Zhang (2002), Evaluation And Comparison Of Clustering Algorithms In
Anglyzing Es Cell Gene Expression Data, Statistica Sinica 12(2002), 241-262
2. A. Ben-Hur, A. Elisseeff, and I. Guyon, A stability based method for discovering
structure in clustered data, in Pasific Symposium on Biocomputing, vol. 7, 2002, pp.
6--17
3. Francisco Azuaje, Nadia Bolshakova , Chapter 13: Clustering Genomic Expression
Data: Design And Evaluation Principles, Clustering Genome Expression Data:
Design and Evaluation Principles”, in Understanding and Using Microarray Analysis
Techniques: A Practical Guide, London: Springer Verlag, 2002
4. Julia Handl, Joshua Knowles and Douglas B. Kell, Computational cluster validation
in post-genomic data analysis, bioinformatics Vol. 21 no. 15 2005, pages 3201–3212
5. S. Theodoridis, K. Koutroumbas, Pattern recognition, 1999,San Diego, CA:
Academic Press
6. Rand,W.M. (1971) Objective criteria for the evaluation of clustering methods. J. Am.
Stat. Assoc., 66, 846–850.
7. E. Fowlkes and C. Mallows. A method for comparing two hierarchical clusterings.
Journal of the American Statistical Asociation, 78, 1983.
8. Y. Batistakis M. Halkidi and M. Vazirgiannis. On clustering validation techniques.
Journal of Intelligent Information System, 17:2/3, 2001.
9. A van Rijsbergen. Information Retrieval, second edition. Butterworths, 1979
10. J. Seo and B. Shneiderman, A Rank-by-Feature Framework for Unsupervised
Multidimensional Data Exploration Using Low Dimensional Projections, IEEE
Symposium on Information Visualization 2004
11. Simon Katz and Georges Grinstein, Perturbation Methods for Cluster Analysis.
Technical Report, Umass Lowell, 2006
12. Brian Everitt. Cluster analysis. Halsted Press, New York, 1974, 1993.
13. J.C. Bezdek, N.R. Pal, Some new indexes of cluster validity, IEEE Transactions on
Systems, Man and Cybernetics, Vol. 28, Part B, 1998, pp. 301-315
14. Kenneth Ward Church and Patrick Hanks. Word association norms, mutual
information, and lexicography, Proceedings of the 27th Annual Meeting of the
Association for Computational Linguistics, 1989.
15. A. Strehl and J. Ghosh, Cluster ensembles - a knowledge reuse framework for
combining multiple partitions, Journal of Machine Learning Research, vol. 3(Dec), pp.
583-617, 2002
16. Ana L. N. Fred, Anil K. Jain: Combining Multiple Clusterings Using Evidence
Accumulation. IEEE Trans. Pattern Anal. Mach. Intell. 27(6): 835-850 (2005)
17. M. Bittner et al., Molecular classification of cutaneous malignant melanoma by
gene expression profiling, Nature, Vol. 406, 2000, pp. 536-540
18. Behrouz Minaei-Bidgoli, Alexander Topchy and William F. Punch, A Comparison of
Resampling Methods for Clustering Ensembles, MLMTA 2004
19. H. Stalsberg et al., Human Cancer Histologic types of breast carcinoma in relation to
international variation and breast cancer risk factors. International Journal of Cancer,
Volume 44, Issue 3 , Pages 399 - 409
20. Byrne C et al. Biopsy confirmed benign breast disease, postmenopausal hormone use
of exogenous female hormones, and breast cancer risk. Cancer 2000 Nov 15 89 20462052.
21. M,H, Gail et al., Projecting individualized probabilities of developing breast cancer
for white females who are being examined annually. Biostatistics Branch, National
Cancer Institute, Bethesda, MD 20892.
22. K.Y. Yeung, D.R. Haynor, W.L. Ruzzo, Validating clustering for gene expression
data, Bioinfomatics, vol. 17, pp. 309-318,2001.
23. Aurora Torrente, Misha Kapusheski and Alvis Brazma, A new method for comparing
results from gene expression data clustering, ISMB/ECCB 2004
24. Urska Cvek, Visual and analytical tools for record level cluster analysis, ScD thesis,
Department of Computer Science, Umass Lowell, 2004
25. Fabian Bendix, Robert Kosara and Helwig Hauser,Parallel Sets: Visual Analysis of
Categorical Data, Proceedings of the Proceedings of the 2005 IEEE Symposium on
Information Visualization (INFOVIS'05) - Volume 00
26. Patrick Riehmann, Manfred Hanfler, and Bernd Froehlich, Interactive Sankey
Diagrams, In Proceedings of the IEEE Symposium on Information Visualization
(InfoVis 05), pp. 233-240, October 2005
27. A. Inselberg and B. Dimsdale. Parallel coordinates: A tool for visualizing multidimensional geometry. In Proc of IEEE Visualization Conference, pages 361-370,
1990
28. Alexander G. Gee, Hongli Li, Min Yu, Mary Beth Smrtic, Urska Cvek, Howie
Goodell, Vivek Gupta, Christine Lawrence, Jianping Zhou, Chih-Hung Chiang,
Georges G. Grinstein (2005). Universal Visualization Platform. In Robert F. Erbacher,
Philip C. Chen, Jonathan C. Roberts, Matti T. Gröhn, Katy Börner (Eds),
Visualization and Data Analysis 2005 (San Jose, California, January 17-18),
Proceedings of SPIE-IS&T Electronic Imaging, SPIE Vol. 5669, pp. 274 – 283
29. S.K. Card, J.D. Mackinlay and B. Shneiderman (Eds.). Readings in Information
Visualization: Using Vision to Think. San Francisco, California: Morgan Kaufmann.
1999
30. Lanbo Zheng, Le Song and Peter Eades. Crossing minimization problems of drawing
bipartite graphs in two clusters. ACM International Conference Proceeding Series;
Vol. 109, 2005
31. Y. H. Fua, M. O.Ward, and E. A. Rundensteiner. Hierarchical parallel coordinates for
exploration of large datasets. In IEEE Visualization, pages 43–50, 1999.
32. G. Andrienko and N. Andrienko. Parallel coordinates for exploring properties of
subsets. In Proceedings of the second IEEE International conference on coordinated
and multiple views in exploratory visualization, pages 93–104, 2004
33. Frank, T. S., Deffenbaugh, A. M., Reid, J.E, Hulick, M. Ward, B. E., Lingenfelter, B.,
Gumpper, K.L., Scholl, T. Tavtigian, S. V., Pruss, D. R., Critchfield, G.C. 2002.
Clinical Characteristics of Individuals With Germline Mutations in BRCA1 and
BRCA2: Analysis of 10,000 Individuals. Journal of Clinical Oncology, Vol 20, Issue
6: 1480-1490
34. Frank T.S, Manley S.A, Olopade O.I, et al. 1998. Sequence analysis of BRCA1 and
BRCA2: Correlation of mutations with family history and ovarian cancer risk. J Clin
Oncol 16:2417–2425
35. Pratt, K.B. and Tschapek, G. 2003. Visualizing concept drift, in Proceedings 9th
ACM SIGKDD Conference,Washington DC, ACM Press, New York NY,2003,
pp.735-740
Download