Visual Analytics in Data Stability and Data Flow Analysis Jianping Zhou, Georges Grinstein ABSTRACT Coupled visualization and analysis, a.k.a. visual analytics, has become a major methodology for data analysis and exploration. Effective visual analysis of various clustering results and categorical information within a single graphical interactive display is highly recommended. We describe a specific integration of analysis and visualization for partition stability evaluation. Partitions are decompositions of a dataset into a family of disjoint subsets. They may refer to the results of clustering, of groupings of categorical dimensions, of binned numerical dimensions, of predetermined class labeling dimensions, or prior knowledge structured in mutually exclusive format (one data item associated with one and only one outcome). Partition or cluster stability analysis identifies nearoptimal structure, builds ensembles, or conducts validation. We developed a set of partition stability metrics and a new visualization tool, CComViz (Cluster Comparison Visualization), for mutual comparison and evaluation of multiple partitioning of the same dataset. We've added a novel layout algorithm for informatively rearranging the order of the records and dimensions. With these techniques, we are able to visualize data stability, data flow, density distribution and hierarchy, and data correlation at the record, at the group and at the dimension levels within a single graphical interactive display. We provide an example application of CComViz. 1 INTRODUCTION Partition comparison and evaluation is a common task in comparative data analysis. In this context, partitions refer to the results of clustering in flat format, of groupings of categorical dimensions, of binned numerical dimensions, of predetermined class labeling dimensions, or prior knowledge structured in mutually exclusive format (one data item associated with one and only one outcome). Partition stability analysis deals with comparison and evaluation of multiple partitions in order to identify near-optimal structure, build ensembles, or conduct validation. Partition stability is a measure regarding the consistence of records within clusters or data categories. Stable records characterize the main features of their belonging clusters while unstable records indicate outliers and anomalies. The results of partition stability analysis often answer questions like Which data items are strongly clustered together? Which data items are marginally clustered? Is the clustering result obtained by a method better than other methods? What is the optimal number of clusters over multiple clustering runs with different settings? How consistent, or stable, are data memberships over multiple partitions? How data flow starting from one partition to others? How data volumes change? What are their density distributions and hierarchical relationships? The integration of analysis and visualization for partition stability evaluation has become a major methodology in exploratory data analysis. We have developed a set of partition stability metrics and a new visualization tool, CComViz (Cluster Comparison Visualization), to carry out this methodology. We proposed a novel layout algorithm used in CComViz for informatively rearranging the record display order. With these techniques, we are able to visualize data stability, data flow, density distribution and hierarchy, and data correlation at the record, at the cluster and at the dimension levels within a single graphical interactive display. 2 BACKGROUND Partition stability metrics characterize diverse partition stability analyses. In terms of distance, correlation or similarity measure between partitions or clusters to be compared, partition stability metrics are divided into partition-based metrics and cluster-based metrics. There are totally four levels of partition stability, which are respectively called record stability, group stability, dimension stability and dataset stability. Record stability measures the record’s tendency to follow a trend. Group stability measures the stability of a cluster or a partitioning group across different partitions. And dimension stability is overall difference of a partition from others. Finally, dataset stability reflects the dataset’s complexity indicating the chance of data pattern existence with underlying selection of partitions. Partition-based stability metrics include only metrics at the dimension and the dataset levels, and are often used for estimating the number of clusters [1] [2] [3] [4]. Some external and relative criteria [5] used in cluster analysis, including Rand index [6], Fowlkes-Mallows index [7] [8] and Hubert’s statistic [5], can be used to define partition-based stability metrics. Cluster-based stability metrics consist of metrics at all levels. In general, any cluster distance, correlation or similarity measures are options for defining cluster-based stability metrics. These options include F-measure [9], label distance, label correlation coefficient [11], cluster similarity [10]. More cluster distance measures can be found in the references [12] [13]. Information theory [14] is traditionally used to measure the amount of shared information between two variables. Strehl and Ghosh [15], Fred and Jain [16] adopted mutual information index from information theory and defined two consistency measures applied to a pair of partitions. These measures can be further used for partition-based stability metrics. A study using cluster-based stability metrics for perturbation-based cluster analysis was conducted by Kaze and Grinstein [11]. Perturbation-based cluster analysis is a kind of partition stability analysis, which investigates the response of various clustering methods to random noise added to input data in order to find out the most stable clustering through a large quantity of experiments. To support this technique, the authors defined stability metrics at the record, the cluster, and the dimension levels. The calculations of these stability measures are based on comparisons between the original (unperturbed) clustering and every perturbed clustering. These metrics require the setting of number of clusters to be the same in every perturbation run, which equal to the number of clusters of the unperturbed clustering. In addition, the correspondent cluster pairs between unperturbed and perturbed clustering results, known as label correspondence problem [15] [18], need to be recognized. For perturbation-based cluster analysis, Bittner et al. [17] presented a partition-based stability analytics, called WARP (Weighted Average Discrepant Pairs). FOM (Figure of Merit) is another kind of partition stability analysis, proposed by Yeung et al. [22]. This technique is analogous to leave-one-out validation. In every run, one dimension is left out and serves as the validation dimension. Clustering is applied to the remaining dataset. And a comparison then takes place between the validation dimension and the clustering result. As every dimension is chosen as the validation dimension, the average similarity is able to measure the partition stability. Visualization provides an intuitive and interactive means to help partition stability analysis. Traditionally, interactive comparison of multiple partitions, projected in multiple visualization instances and arranged side by side, is a straightforward approach to perform partition stability analysis. However due to the limitations of computer screen space, as well as human short-term memory, the comparison with multiple graphic displays often cause efficiency problem. Apparently, a single graphic interactive display is a solution to address this problem. Chen et al. [1] proposed a tree graph to express the similarities and hierarchical grouping relationships of various clustering results. Expression Profiler [23] visualization package developed at EBI (European Bioinformatics Institute) presents a method for comparing various clustering results in either flat or hierarchical structures. For the comparison of two flat clustering results, a bipartite graph is used to express their relationships in [23]. Each side of the bipartite graph is a group of a clustering result. An algorithm is applied to rearrange the original cluster order, such that some clusters are merged into “superclusters” nodes and the number of crossing edges is minimized. After node arrangement and merge, the weight of each edge between two nodes is proportional to the number of overlapped records in both nodes. SM (Stable Matrix) visualization tool, proposed by Cvek [24], visualizes a symmetric 2D matrix in heatmap. The element of this matrix is calculated by SM (i, j) = nij / N; where N is the total number of clustering results involved in the comparison, and nij is the number of clustering results which recognize the pair of records (i, j) staying in the same cluster. The record index order in SM is carefully arranged by using an algorithm, so that the record stability statuses and relationships are well visually presented. Parallel Sets [25] and Interactive Sankey Diagrams [26] are variants of parallel coordinates [27]. Unlike parallel coordinates, each record in these visualizations has a unique position on each axis regardless of its value. When record display order on each axis is appropriately arranged, these visualization tools work well for displaying data flows, density distributions and hierarchies of multiple variables. In order to reduce clutter and occlusion, a set of rules are defined in these tools based on grouping and sorting. 3 PARTITION STABILITY METRICS In order to perform mutual comparison and evaluation of multiple partitions, we developed a set of cluster-based partition stability metrics at the record, the cluster and the dimension levels. For illustrating the calculation of these metrics, we created a synthetic dataset, as shown in Table 1, which contains three categorical dimensions. Table 1 Synthetic Dataset Record C1 C2 Index 0 c11 c22 1 c12 c22 2 c11 c21 3 c11 c22 4 c11 c22 5 c12 c22 6 c11 c21 7 c11 c21 8 c12 c21 9 c13 c22 10 c13 c22 11 c12 c21 12 c13 c22 13 c13 c22 14 c13 c21 15 c13 c21 C3 c31 c32 c33 c31 c33 c33 c33 c33 c32 c31 c32 c33 c33 c32 c32 c31 3.1 Record Stability For a record with index k and a given set of partitions N = {C1, C2, …, Cn}, cmF(k) represents a partitioning group of Cm and contains the record k, i.e. cmF(k) Cm, Cm N, k cmF(k), the record stability for record k, RS(k), is defined. RS(k) = 2 n n n (n -1) ∑ i = 1 ∑ j = i +1 D(ciF(k), cjF(k)) In above equation, D(ca, cb) is called D measure, which may be a reciprocal distance, correlation or similarity measure between partitioning group ca and cb. This measure is symmetric, i.e. D(ca, cb) = D(cb, ca). As we reviewed in the background, a variety of metrics can serve as this measure. In this paper, we adopt the label correlation coefficient [11], which is defined as corCl (ca, cb) = | ca cb | | ca | * | cb | Below is an example from the synthetic dataset to show the calculation. Considering 0 c11, 0 c22, 0 c31, the label correlation coefficient for record with index 0 is calculated. RS(0) = 2 (corCl (c11, c22) + corCl (c11, c31) + corCl (c22, c31) ) 3 *2 = 1 3 ( 3 6*8 + 2 6*4 + 3 8*4 ) = 0.44 Table 2 lists all RS values for the synthetic dataset. Table 2 Record Stability Value Record RS Index 0 0.44 1 0.41 2 0.55 3 0.44 4 0.47 5 0.36 6 0.55 7 0.55 8 0.39 9 0.48 10 0.51 11 0.44 12 0.36 13 0.51 14 0.4 15 0.3 Various D measures, although diverse definitions, share a common feature for which the higher the D measure value, the higher the proportion of records whose dimension values fall into the paired partitioning groups. As a result, these high proportional records form a pattern. So, if the average D measure value of multiple partitioning groups which a specific record falls into is high, we could conjecture this record has high tendency to follow a trend across the multiple partitions. In reverse situation, if no patterns are associated with a specific record, this record could be considered as an outlier or anomaly. The range of RS depends on the range of D measure. In the case of corCl serving as the D measure, RS is between (0, 1]. Its value for a record will be 1 when all partitioning groups which the record falls into are the same. RS value is a mutual comparison result. From a statistical point of view, this measure reflects this record’s stability feature in overall. 3.2 Group stability Group stability, GS, is the average record stability of a partitioning group. GS (c) = ∑k c RS(k) |c| Table 3 contains GS results for each portioning group of the synthetic dataset. Table 3 Partitioning group and group stability Group Members GS {0,2,3,4,6,7} 0.5 c11 {1,5,8,11} 0.4 c12 {9,10,12,13,14,15} 0.43 c13 {2,6,7,8,11,13,14,15} 0.45 c21 0.44 {0,1,3,4,5,9,10,12} c22 0.42 {0,3,9,15} c31 {1,8,10,13,14} 0.44 c32 {2,4,5,6,7,11,12} 0.47 c33 3.3 Relative Dimension Stability Given two partitions, Ca and Cb, their relative partition stability, RPS, is defined. RPS (Ca , Cb) = 1 m ∑m k = 1 D(caF(k), cbF(k)) where m is the record count of the dataset. Due to the symmetric property of D measure, RPS is also symmetric. 3.4 Dimension Stability Dimension stability is defined. 1 ∑m ∑ ni = 1, i ≠u D(cuF(k), ciF(k)) PS (Cu) = k=1 m (n-1) When corCl serves as the D measure, RDS and DS values for the synthetic dataset are calculated and shown in Table 3. Table 3 PRS and PS results Partition RPS (Ci ,C1) RPS (Ci ,C2) C1 1 0.43 C2 0.43 1 C3 0.51 0.45 RPS (Ci ,C3) 0.51 0.45 1 PS (Ci) 0.47 0.44 0.48 3.5 Dataset Stability Dataset stability is defined. DS = 1 n ∑ ni = 1 PS( Ci ) In later section, we will see partition stability metrics defined above are not only the measures of data feature but also the criteria for achieving visual aesthetics in our new developed visualization tool. 4 CCOMVIZ VISUALIZATION AND LAYOUT ALGORITHM In order to seek a visualization tool more suitable for partition stability analysis, we developed CComViz (Cluster Comparison Visualization) visualization. CComViz utilizes the same data representation model as that of Parallel Sets [25] and Interactive Sankey Diagrams [26]. Fig. 2 depicts how data are projected in CComViz using the synthetic dataset. In this figure, the numbers within boxes represent record indices. c31 c32 c33 C3 C1 C2 0 3 9 15 1 8 10 13 14 2 4 5 6 7 11 12 0 2 3 4 6 7 1 5 8 11 9 10 12 13 14 15 2 6 7 8 11 14 15 0 1 3 4 5 9 10 12 13 c11 c12 c13 c21 c22 Fig. 1 Illustration of CcomViz projection Like parallel coordinates [27], CComViz lays dimension axes out in parallel. But these axes are not coordinates. Every axis is an expression of individual record display order and represented as a set of continuous color bars. Each color bar corresponds to a partitioning group with its length scaled according to its density in dimension. In CComViz, data items are also represented as polylines that link their points on every axis. The point position of each data item on axis is determined by its rank in the record display order. Since each data item has a unique rank on axis, no more than one data item is projected to the same point on each axis (without considering the rounding of projection position to screen pixel). Therefore, no more than one data item is projected to the same polyline. This feature benefits in particular categorical dimensions in contrast to regular parallel coordinates, in which data items with the same category value are projected to the same point on axis. In CComViz, one of the most important visual properties is called hot dimension. The hot dimension, like C3 in Fig. 1, is a selected dimension by which all record lines are colored. Any active dimension can be selected as the hot dimension. Choosing different hot dimension provides a convenient way to view how records in a group of one dimension fall into groups of other dimensions. In this way, data flow, density distribution and hierarchy starting from different dimension can be easily explored. Unlike parallel coordinates, CComViz focuses on achieving visual analytics rather than geometrical data mapping. The observation of geometrical data properties in CComViz may be disabled, meaning that the position of record projection has no direct connection to its value. Although this may inadvertently impact on numerical dimensions, by means of data and tool linking mechanism, such as under the UVP (Universal Visualization Platform) [28], this disadvantage can be compensated through using other visualization tools. As a matter of fact, parallel coordinates and CComViz share many common characteristics and interfaces. Using both visualization tools in an interactive and linked environment can generate a strong and unified power for comparative cluster analysis. In CComViz, the feature that no more than one record is projected to the same point on each axis avoids record overlap. But clutter due to crossing remains a concern. As seen in Fig. 1, without proper record display order rearrangement, the severe edge crossings make it difficult to observe data patterns and data relationships. This problem has to be solved. Otherwise, this data representation model is useless for most real-world datasets. Beyond that, from a visualization point of view, visually representing data with less clutter is not enough. Data representation should amplify human cognition [29]. Now that CComViz is intended to display data stability, data flow, density distribution and hierarchy across multiple partitions, the record and dimension display order needs to enhance readability of these data features so cognition can be amplified. Information visualization is fundamentally dependent upon the properties of human perception. According to visualization study [29], grouping, sorting, and positioning are among the most effective ways to support human perceptual inference, enhance pattern detection, as well as reduce information search time and memory usage. Another influential factor in human perceptual inference is related to semantics. As CComViz is designed to distinguish stable and unstable records, we expect their visual appearances to be different. Relying on semantic associations, to make stable records look smooth and unstable records fluctuate can help with human perceptual inference. In order to achieve the matching between data representation model and human perception model, we brought above concepts into a methodology in developing a sophisticated layout algorithm for record and dimension display order rearrangement in terms of selection of the hot dimension and partition stability measures. The mechanism behind the methodology is based on consideration and utilization of user perception modeling, partition stability measurement, and interactive functions involved in the course of data exploration and pattern recognition. The methodology used in the layout algorithm makes CComViz an ideal tool to visualize data stability, data flow, density distribution and hierarchy, as well as data correlation across a number of partitions within a single graphical display. In CComViz, the process for rearranging record and dimension display order is in a pipeline sequence. The computation of record display order rearrangement is based on dimension display order. Any change in adding, removing or reordering dimensions will cause a change in the record display order. 4.1 Dimension Display Order Rearrangement For partition comparison and evaluation, it is important to identify distinctions and variations between similar partitions. The similarity is defined by different metrics, such as clustering criteria and partitioning consistency. Since similar partitions have a high degree of partitioning consistency, there should be fewer crossing lines if similar partitions are adjacent and the display orders of partitioning groups along their axes maintain a correspondent relationship (To be discussed later). Therefore, from visual aesthetics point of view, putting similar partitions together assists the observation of outliers or odd records while from human visual perception point of view, it facilitates comparison and perceptual inference. As the layout of dimension axes in CComViz is a linear sequence, there are two approaches to dimension display order rearrangement, which in CComViz is an optional function. By default, the dimension display order is arranged by users to achieve flexibility. 4.1.1 Sorting Approach Cluster evaluation through sorting by criteria is common in comparative cluster analysis. CComViz provides a natural way to visualize sorted clustering results. In general, any kind of clustering criteria or clustering quality measures can be used for sorting. Due to their varying characteristics, the comparison results can be either consistent or inconsistent. For example, data stability pattern may not be observed better through sorting by internal criteria than by external or relative criteria, since internal criteria are not intended to measure data stability. However, it is still worthwhile to project CComViz in different sorting orders to explore partitioning features and verify partitioning reliability from various perspectives. As one of CComViz projection goals is to reduce crossing, the stability-based criteria, such as our proposed PS metric, are ideal for that purpose. Regarding the synthetic dataset displayed in Table 1, through sorting by PS, the dimension display order is <C3, C1, C2>. 4.1.2 Sequential Selection Approach Sequential selection is a process for rearranging the dimension display order by using relative partition similarity measure. Once the first dimension which the user thinks is the best or the most interesting in terms of certain criteria is identified, a dimension that is the most similar to the current dimension among all unselected candidate dimensions is chosen as the immediate successive dimension. Our proposed RPS is one of these kinds of similarity measures. In the synthetic dataset, when C3, which has higher PS value over others, is chosen as the first dimension, the dimension display order by sequential selection using RPS remains to be < C3, C1, C2>. The dimension display order rearrangement is capable of improving not only visual aesthetics but also comparison results. Because stability-based partition comparison and evaluation work in voting fashion, when too many inferior partitions or divergent partitions are involved in a comparison, the results will not be reliable. By visually checking partition quality or similarity, and throwing inferior or divergent partitions away from the sorted sequence, the results of the next run on the trimmed series of partitions will be improved. In summary, a proper dimension display order makes users focus more on high quality or interesting partitions when many partitions are involved in comparison and evaluation. It also leads to progressive refinement of comparison results. 4.2 Record Display Order Rearrangement To some extent, record display order rearrangement in CComViz is similar to the Bipartite Graph Crossing Minimization (BGCM) problem [30]. One of the BGCM algorithms, called gravity-center algorithm, has been successfully used in Expression Profiler visualization tool package [23] for reducing crossing when two clustering results are compared. However, a distinction exists: BGCM algorithms deal with one pair of partitions, whereas record display order rearrangement in CComViz deals with unlimited number of partitions. Our methodology does not directly treat record display order rearrangement as a BGCM problem; instead it implicitly achieves crossing reduction. As discussed earlier, reducing crossing is not the only goal of CComViz visualization. Other goals include facilitating observation of data stability, density distribution and hierarchy, as well as other interesting relationships. To achieve these goals, we take the grouping and reordering approach to developing a layout algorithm that enforces the formation of record projections in envelope (band) shapes with fewer crossings [25] [31] [32]. This algorithm contains four steps, which involves record grouping, group matching, group reordering, record sorting, and linked list intersection and concatenation operations. The second and the third steps need information about record stability and group stability. The record display order depends on the selection of dimensions to be compared, the selection of the hot dimension, as well as the dimension display order. We now use the synthetic dataset to illustrate these steps with the dimension display order < C3, C1, C2> and the setting of C3 as the hot dimension The first step is to group each dimension by its own categorical values, as we see in Fig 1. In the second step, the groups of the hot dimension are first reordered by following the descending order of their group stability values. For the synthetic dataset, since GS(c33) > GS(c31) > GSA(c32), the group display order of C3 is < c33, c31, c32 > (top to bottom). And then starting from the hot dimension, the groups of other dimensions are sequentially reordered to both ends in terms of group matching relationships. The group matching relationships specify a set of group pairs. Each pair is considered as the most similar groups of adjacent dimensions by D measure. Each group can only be paired once with one of its adjacent dimension groups. For the synthetic dataset, the construction of the group matching relationships is illustrated in Table 4 and Table 5. Note that the pair matching operations proceed by following the corCl measure order. After the matching, the relationships are expressed as c33c11, c32c13, c31c12, c13c22, c11c21. Once the group matching relationships are built, the group reordering is then conducted. During the group reordering, a rule is applied for a situation where the number of groups of a dimension is larger than that of the preceding dimension. In this situation, the group order of the remaining groups of this dimension follows the descending order of their group stability values. Fig. 2 shows the group reordering result for the synthetic dataset. Since the hot dimension C3 is the first dimension, no sequentially backward reordering is applied. Table 4 Group Matching between C3 and C1 C3 C1 corCl c33 c11 0.62 c32 c13 0.55 c32 c12 0.45 c11 c31 0.41 c31 c33 c33 c31 c32 c13 c12 c13 c12 c11 0.41 0.38 0.15 0 0 Table 4 Group Matching between C1 and C2 C1 C2 corCl c13 c22 0.54 c11 c21 0.46 c11 c22 0.41 c21 c12 0.38 c22 c12 0.33 c13 c21 0.31 GS 0.47 0.44 0.42 C3 C1 C2 2 4 5 6 7 11 12 0 3 9 15 1 8 10 13 14 0 2 3 4 6 7 1 5 8 11 9 10 12 13 14 15 2 6 7 8 11 14 15 0 1 3 4 5 9 10 12 13 c31 c32 c33 c11 c12 c13 c21 c22 Fig. 2 Step 2: Group reordering, starting from the hot dimension to both ends The third step contains two kinds of operations. First, the records of the last dimension (rightmost) are reordered within each group in descending order by record stability. And then starting from the dimension prior to the last dimension backwards to the hot dimension, the record orders of other dimensions are sequentially reordered by performing a series of linked list intersection and concatenation operations. A linked list intersection operation returns the overlapped elements of two or more linked lists with the order following the first linked list. With these operations, the new record order of each current dimension’s group is determined by cum = Where i ( cvi c’um ) is the linked list intersection operator and is the linked list concatenation operator. cum and c’um denote the record orders of group cum after and before reordering respectively. v is the dimension right next to u. The accumulated operations follow the group order of dimension Cv starting from top to bottom. For example, the new order of c11 is: c11 = ( c21 c’11) ( c22 c’11) = ({2,6,7,11,14,8,15} = {2,6,7} {0,2,3,4,6,7}) ({10,13,9,4,0,3,1,5,12} {0,2,3,4,6,7}) {4,0,3} = {2,6,7,4,0,3} Fig. 3 shows the reordering results after the third step. Sequential recording c31 c32 c33 C3 C1 C2 RS 2 6 7 4 11 5 12 0 3 15 9 8 1 14 10 13 2 6 7 4 0 3 11 8 1 5 14 15 10 13 9 12 2 6 7 11 14 8 15 10 13 9 4 0 3 1 5 12 0.55 0.55 0.55 0.44 0.41 0.39 0.3 0.51 0.51 0.48 0.47 0.44 0.44 0.41 0.36 0.36 c11 c12 c13 c21 c22 Fig. 3 Step 3: Record reordering from the last dimension to the hot dimension The fourth step takes the same linked list intersection and concatenation operations, but starts from the hot dimension to both ends. For the synthetic dataset, since the hot dimension C3 is the first dimension, no sequential reordering to the front end is needed. Fig. 4 displays the final result after all reordering. Sequential recording C3 C1 C2 2 6 7 4 11 5 12 0 3 15 9 8 1 14 10 13 2 6 7 4 0 3 11 5 8 1 12 15 9 14 10 13 2 6 7 11 8 15 14 4 0 3 5 1 12 9 10 13 c11 c12 c13 c31 c32 c33 c21 c22 Fig. 4 Step 4: Record reordering from the hot dimension to both ends c31 c32 c33 C3 C1 C2 2 6 7 4 11 5 12 0 3 15 9 8 1 14 10 13 2 6 7 4 0 3 11 5 8 1 12 15 9 14 10 13 2 6 7 11 8 15 14 4 0 3 5 1 12 9 10 13 c11 c12 c13 c21 c22 Fig. 5 Final result in CComViz Finally, we add record lines on the final reordering result (Fig. 5). In contrast to Fig. 1, Fig. 5 more clearly reveals information which is hidden before. For example, in Fig. 5, the hierarchies related to records 11, 5 and 12 are easily seen; Smooth record lines for records 2, 6, 7 10 and 13 well reflect their high stability. In the layout algorithm, we do not explicitly use any target function to reduce crossing. But as we see from the result, crossing reduction is implicitly achieved. Behind the mechanism is the use of partition stability metrics. The computational complexity of this algorithm also reaches its lower bound, which is linear to record size. The efficiency of this solution apparently outperforms explicit crossing reduction solutions. In next section, we apply this layout algorithm to a subset of breast cancer patient data, which is used in our collaborative project with Massachusetts General Hospital, and generate a set of CComViz plots to demonstrate the effectiveness of above techniques, as well as a number of important interactive functions implemented in CComViz. 5 CASE STUDY AND CCOMVIZ INTERACTIVE FUNCTIONS The breast cancer dataset used in this case study contains 1210 patient records, three individual risk score dimensions from three breast cancer prediction models (Gail [54], Myriad model [33] [34], BRCAPRO [35]), two clinical test measure dimensions (atypical hyperplasia in biopsy and average mammogram density [19] [20]), cancer status dimension, and four personal attribute dimensions (menarche age, age of first birth, bra size, weight). For cluster analysis, we selected the three risk score and two clinical test measure dimensions, and employed EM (Expectation Maximization), KM (Kmeans), XM (Extended Kmeans) and DB( Density based clustering) clustering algorithms implemented in Weka project to generate four clustering results. The number of clusters, which is 6, in the subspace with five dimensions selected was estimated by a automatic function in Weka EM. DB6 KM6 XM6 EM6 Cancer DB6 KM6 XM6 EM6 Cancer DB6 KM6 XM6 EM6 Cancer C5DB C3DB C0DB C4DB C2DB C1DB C5KM C3KM C0KM C4KM C2KM C1KM * DA: Diagnosed (a) All C3XM C1XM C4XM C2XM C0XM C5XM C3EM C0EM C2EM C1EM C4EM C5EM NR DA NR: Not Report C5DB C3DB C0DB C4DB C2DB C1DB C5KM C3KM C0KM C4KM C2KM C1KM C3XM C1XM C4XM C2XM C0XM C5XM : Low risky group (b) RS > 0.488 C3EM C0EM C2EM C1EM C4EM C5EM NR DA C5DB C3DB C0DB C4DB C2DB C1DB C5KM C3KM C0KM C4KM C2KM C1KM C3XM C1XM C4XM C2XM C0XM C5XM C3EM C0EM C2EM C1EM C4EM C5EM : High risky group (c) RS < 0.372 Fig 6 Display data flows and hierarchies with optimal dimension reordering in sequential sorting. Fig. 6 displays the 1210 patient records with the four clustering results and the cancer status dimensions selected in CComViz. In terms of the dimension selection, RS value for each record is calculated. Fig. 6b and Fig. 6c highlight the record selections with RS thresholds. Since big bands intently correspond to records with high RS values, from Fig.6b, we can easily distinguish a few of stable sub-clusters from the four clustering results. These sub-clusters statistically give us confidence to make more reliable inference. For instance, based on the observation of distribution, we can quickly infer some high or low risky populations marked with red (high risk) or green (low risk) triangles in Fig.6b. On the other hand, unstable records are represented by small and fluctuating bands (Fig. 6c), which help us quickly identify outliers or data items which need to be verified. As the dimension display order is sorted, Fig. 6 also conveys additional information about the stability ranking of the four clustering results. NR DA To facilitate data exploration and observation, CComViz provides a number of important interactive functions. These functions include: 5.1 Changing Hot Dimension As we mentioned earlier, changing the hot dimension during data exploration provides an easy way to investigate data flows, density hierarchies and distributions starting from different dimensions. As opposed to Fig. 6, Fig. 7 chooses XM6 as the hot dimension and provides snapshot of data flow from another angle. The operation of changing hot dimension in CComViz is very easy. It can be a simple click of mouse on the designated dimension axis. Since record display order rearrangement relies on the selection of the hot dimension, the rearrangement is recomputed once the hot dimension is changed. DB6 C5DB C3DB C0DB C4DB C2DB C1DB KM6 XM6 C5KM C3KM C0KM C4KM C2KM C1KM * DA: Diagnosed EM6 C3XM C1XM C4XM C2XM C0XM C5XM Cancer C3EM C0EM C2EM C1EM C4EM C5EM NR DA NR: Not Report Fig 7 Change hot dimension 5.2 DB6 KM6 C5DB C3DB C0DB C4DB C2DB C1DB XM6 C5KM C3KM C0KM C4KM C2KM C1KM * DA: Diagnosed EM6 C3XM C1XM C4XM C2XM C0XM C5XM Cancer C3EM C0EM C2EM C1EM C4EM C5EM NR DA NR: Not Report Fig 8 Change group rendering order Changing Group Rendering Order For most real-world datasets, record occlusion in CComViz can not be avoided even with the help of proper dimension and record display order rearrangement. Under certain circumstances, transparent drawing provides helps for small datasets. But for large and complex datasets, visual aesthetics in transparent drawing can not be achieved. In this situation, the control of group rendering orders offers an ideal and flexible solution. With this control, the most user-interested partitioning group, which is called hot group, is rendered in the front. The selection of a hot group is made by clicking the mouse on the group bar. Note that the hot group is a group of the hot dimension. As the hot group is selected, the hot dimension also selected. The usefulness of changing group rendering order can be seen by comparing Fig. 8 and Fig. 7. 5.3 Performing Boolean Operation for Record Selection Boolean operations AND, OR, NAND, and NOR, as well as their combination are available in CComViz for record selections. Once record selection bars are draw and Boolean operation formula is made, the records in the result set are highlighted. As shown in Fig. 11, the record selection is made by AND operation on selected record subsets A and B. DB6 KM6 XM6 EM6 C5KM C3KM C0KM C4KM C2KM C1KM * DA: Diagnosed C3XM C1XM C4XM C2XM C0XM C5XM C3EM C0EM C2EM C1EM C4EM C5EM NR DA NR: Not Report Fig 9 Boolean operation for record selection 5.4 EM6 KM6 DB6 XM6 Cancer B A A C5DB C3DB C0DB C4DB C2DB C1DB Cancer Customizing Dimension Display Order C3EM C0EM C2EM C1EM C4EM C5EM C5KM C3KM C0KM C4KM C1KM C2KM * DA: Diagnosed C5DB C3DB C0DB C4DB C1DB C2DB C3XM C1XM C4XM C2XM C5XM C0XM NR DA NR: Not Report Fig 10 Display data flows and hierarchies in custom dimension order Although automatic sorting or optimal reordering of dimensions helps reducing line crossing, custom dimension display order is often needed for data exploration. Fig. 10 gives another view of the breast cancer dataset in different dimension display order from Fig. 8. This view facilitates the investigation of EM6 clustering result. In addition to interactive functions discussed above, CComViz is also able to handle numeric data dimension by binning, choose options of D measures for stability analysis and partition quality measures for dimension display order sorting, change record reordering in different modes including silhouette index sorting and previous order freezing etc. Due to the limitation of paper space, we skip their introduction in this paper. For continuing the case study, we now project the four personal attribute dimensions and cancer status dimension as well. In Fig. 11, the dimension display order is rearranged by using sequential selection approach. This order suggests that AgeFirstBirth and MenarcheAge are more similar to cancer status than Weight and BraSize dimensions. It is not surprised to see a large number of crossings and branchings in Fig. 9a, which make it difficult to draw inference even though with the help of dimension and record display order rearrangement since no evidences show these four personal attributes are firmly effective risk factors in entire population. However, when selecting stable records to a degree, some high or low risky populations, as seen in Fig. 9b, are discerned. The visual results suggest the bra size might be a risk factor in these populations and deserves further investigation. Through this case study, we demonstrate how CComViz can be effectively used to draw inference through the aggregation of multiple information sources. Behind the success is the development of metaphors that stable records are expressed within wide and smooth bands while unstable records are within narrow and fluctuating strips. The statistical robustness of these metaphors along with the graphic layout algorithm makes CComViz visualization distinctive from any others. 6 CONCLUSION In visual analytics field, it is very challenging to achieve visual aesthetics by using metaphors based on statistics. In this paper, we proposed a suite of techniques addressing mutual comparison of multiple partitions to meet this challenge. These techniques include a set of partition stability metrics and CComViz visualization with the enhancement of a sophisticated graphic layout algorithm. The partition stability metrics are defined at the record, the group, the dimension and the dataset levels. These metrics are not only used to measure data feature at different level, and also help graphic layout in CComViz for implicitly reducing crossing lines. CComViz is a highly interactive visual analytic tool to visualize data stability, data flow, density distribution and hierarchy, and data correlation at the record, the group and the dimension levels within a single graphical interactive display. In order to achieve visual aesthetics and reduce crossing lines, we presented a novel methodology used to develop a layout algorithm for informatively rearranging the order of the records and dimensions in CComViz. Our proposed techniques can be extensively used to identify near-optimal structure, build ensembles, or conduct validation. Cancer AgeFirstBirth Not Report Diagnosed 20 - 30 NA > 34 < 20 31 - 34 MenarcheAge 11 - 15 NA < 11 > 15 Weight 141 - 180 NA >180 101 – 140 <100 (a) All Records Cancer AgeFirstBirth MenarcheAge Weight BraSize #38 NA #42 #34 <#32 #36 #40 #32 #44 >#44 BraSize Not Report Diagnosed : Low risky group 20 - 30 NA > 34 < 20 31 - 34 11 - 15 NA < 11 > 15 : High risky group (b) RS > 0.425 141 - 180 NA >180 101 – 140 <100 #38 NA #42 #34 <#32 #36 #40 #32 #44 >#44 Fig 11 Personal attribute dimensions vs. cancer status dimension REFERENCES: 1. G. Chen, Saied A. Jaradat, Nila Banerjee,Tetsuya S. Tanaka, Minoru S. H. Ko and Michael Q. Zhang (2002), Evaluation And Comparison Of Clustering Algorithms In Anglyzing Es Cell Gene Expression Data, Statistica Sinica 12(2002), 241-262 2. A. Ben-Hur, A. Elisseeff, and I. Guyon, A stability based method for discovering structure in clustered data, in Pasific Symposium on Biocomputing, vol. 7, 2002, pp. 6--17 3. Francisco Azuaje, Nadia Bolshakova , Chapter 13: Clustering Genomic Expression Data: Design And Evaluation Principles, Clustering Genome Expression Data: Design and Evaluation Principles”, in Understanding and Using Microarray Analysis Techniques: A Practical Guide, London: Springer Verlag, 2002 4. Julia Handl, Joshua Knowles and Douglas B. Kell, Computational cluster validation in post-genomic data analysis, bioinformatics Vol. 21 no. 15 2005, pages 3201–3212 5. S. Theodoridis, K. Koutroumbas, Pattern recognition, 1999,San Diego, CA: Academic Press 6. Rand,W.M. (1971) Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc., 66, 846–850. 7. E. Fowlkes and C. Mallows. A method for comparing two hierarchical clusterings. Journal of the American Statistical Asociation, 78, 1983. 8. Y. Batistakis M. Halkidi and M. Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information System, 17:2/3, 2001. 9. A van Rijsbergen. Information Retrieval, second edition. Butterworths, 1979 10. J. Seo and B. Shneiderman, A Rank-by-Feature Framework for Unsupervised Multidimensional Data Exploration Using Low Dimensional Projections, IEEE Symposium on Information Visualization 2004 11. Simon Katz and Georges Grinstein, Perturbation Methods for Cluster Analysis. Technical Report, Umass Lowell, 2006 12. Brian Everitt. Cluster analysis. Halsted Press, New York, 1974, 1993. 13. J.C. Bezdek, N.R. Pal, Some new indexes of cluster validity, IEEE Transactions on Systems, Man and Cybernetics, Vol. 28, Part B, 1998, pp. 301-315 14. Kenneth Ward Church and Patrick Hanks. Word association norms, mutual information, and lexicography, Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, 1989. 15. A. Strehl and J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research, vol. 3(Dec), pp. 583-617, 2002 16. Ana L. N. Fred, Anil K. Jain: Combining Multiple Clusterings Using Evidence Accumulation. IEEE Trans. Pattern Anal. Mach. Intell. 27(6): 835-850 (2005) 17. M. Bittner et al., Molecular classification of cutaneous malignant melanoma by gene expression profiling, Nature, Vol. 406, 2000, pp. 536-540 18. Behrouz Minaei-Bidgoli, Alexander Topchy and William F. Punch, A Comparison of Resampling Methods for Clustering Ensembles, MLMTA 2004 19. H. Stalsberg et al., Human Cancer Histologic types of breast carcinoma in relation to international variation and breast cancer risk factors. International Journal of Cancer, Volume 44, Issue 3 , Pages 399 - 409 20. Byrne C et al. Biopsy confirmed benign breast disease, postmenopausal hormone use of exogenous female hormones, and breast cancer risk. Cancer 2000 Nov 15 89 20462052. 21. M,H, Gail et al., Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. Biostatistics Branch, National Cancer Institute, Bethesda, MD 20892. 22. K.Y. Yeung, D.R. Haynor, W.L. Ruzzo, Validating clustering for gene expression data, Bioinfomatics, vol. 17, pp. 309-318,2001. 23. Aurora Torrente, Misha Kapusheski and Alvis Brazma, A new method for comparing results from gene expression data clustering, ISMB/ECCB 2004 24. Urska Cvek, Visual and analytical tools for record level cluster analysis, ScD thesis, Department of Computer Science, Umass Lowell, 2004 25. Fabian Bendix, Robert Kosara and Helwig Hauser,Parallel Sets: Visual Analysis of Categorical Data, Proceedings of the Proceedings of the 2005 IEEE Symposium on Information Visualization (INFOVIS'05) - Volume 00 26. Patrick Riehmann, Manfred Hanfler, and Bernd Froehlich, Interactive Sankey Diagrams, In Proceedings of the IEEE Symposium on Information Visualization (InfoVis 05), pp. 233-240, October 2005 27. A. Inselberg and B. Dimsdale. Parallel coordinates: A tool for visualizing multidimensional geometry. In Proc of IEEE Visualization Conference, pages 361-370, 1990 28. Alexander G. Gee, Hongli Li, Min Yu, Mary Beth Smrtic, Urska Cvek, Howie Goodell, Vivek Gupta, Christine Lawrence, Jianping Zhou, Chih-Hung Chiang, Georges G. Grinstein (2005). Universal Visualization Platform. In Robert F. Erbacher, Philip C. Chen, Jonathan C. Roberts, Matti T. Gröhn, Katy Börner (Eds), Visualization and Data Analysis 2005 (San Jose, California, January 17-18), Proceedings of SPIE-IS&T Electronic Imaging, SPIE Vol. 5669, pp. 274 – 283 29. S.K. Card, J.D. Mackinlay and B. Shneiderman (Eds.). Readings in Information Visualization: Using Vision to Think. San Francisco, California: Morgan Kaufmann. 1999 30. Lanbo Zheng, Le Song and Peter Eades. Crossing minimization problems of drawing bipartite graphs in two clusters. ACM International Conference Proceeding Series; Vol. 109, 2005 31. Y. H. Fua, M. O.Ward, and E. A. Rundensteiner. Hierarchical parallel coordinates for exploration of large datasets. In IEEE Visualization, pages 43–50, 1999. 32. G. Andrienko and N. Andrienko. Parallel coordinates for exploring properties of subsets. In Proceedings of the second IEEE International conference on coordinated and multiple views in exploratory visualization, pages 93–104, 2004 33. Frank, T. S., Deffenbaugh, A. M., Reid, J.E, Hulick, M. Ward, B. E., Lingenfelter, B., Gumpper, K.L., Scholl, T. Tavtigian, S. V., Pruss, D. R., Critchfield, G.C. 2002. Clinical Characteristics of Individuals With Germline Mutations in BRCA1 and BRCA2: Analysis of 10,000 Individuals. Journal of Clinical Oncology, Vol 20, Issue 6: 1480-1490 34. Frank T.S, Manley S.A, Olopade O.I, et al. 1998. Sequence analysis of BRCA1 and BRCA2: Correlation of mutations with family history and ovarian cancer risk. J Clin Oncol 16:2417–2425 35. Pratt, K.B. and Tschapek, G. 2003. Visualizing concept drift, in Proceedings 9th ACM SIGKDD Conference,Washington DC, ACM Press, New York NY,2003, pp.735-740