Using Entropy-Related Measures in Categorical Data Visualization Jamal Alsakran The University of Jordan Xiaoke Huang, Ye Zhao Kent State University Jing Yang UNC Charlotte Karl Fast Kent State University Categorical Datasets • Generated in a large variety of applications – E.g., health/social studies, bank transactions, online shopping records, and taxonomy classifications • Contain a series of categorical dimensions (variables) Categorical Discreteness • Values of a dimension comprise a set of discrete categories • Mushroom dataset – 8,124 records and 23 categorical dimensions classes c p e e p capshape c x x b x capsurface c s s s y cap-color c n y w w bruises c t t t t odor c p a l p gillattachme gillnt spacing c c f c f c f c f c gill-size c n b b n gill-color c k k n n Challenges • Multidimensional visualization methods are often undermined when directly applied to categorical datasets – the limited number of categories creates overlapping elements and visual clutter – the lack of an inherent order (in contrast to numeric variables) confounds the visualization design Categorical data visualization • • • • Sieve diagram and Mosaic display Contigency Wheel Parallel Sets Mapping to numbers Our Work • Investigate the use of entropy-related measures in visualizing multidimensional categorical data – Show how entropy-related measures can help users understand and navigate categorical data – Employ these measures in managing and ordering dimensions within the parallel set visualization – Conduct user studies on real-world data Entropy and Related Measures • Considering a categorical variable as a discrete random variable X, • Probability distribution • Entropy – Measure diversity of one dimension • Joint Entropy – Measure diversity with two variables • Mutual Information – Measure the variables' mutual dependence Use of Entropy • Chen and Janicke proposed an informationtheoretic framework for visualization. • Pargnostics: pixel-based entropy used for order optimization of coordinates • We use entropy and mutual information in categorical data visualization Visualize Data Facts • Mushroom dataset – Size: the number of categories – Color: entropy Navigation Guide: Scatter Plot Matrix • Joint entropy matrix – High joint entropy indicates diversely distributed data records in a scatter plot – Low joint entropy reveals lots of overlaps Navigation Guide: Scatter Plot Matrix • Mutual information matrix – Large mutual information indicates high dependency between two dimensions – Small mutual information reveals less dependency Dimension Management on Parallel Sets • Use entropy related measures to help users manage dimension spacing, ordering and filtering • Ribbon colors defined by mushroom classes – Green: edible Blue: poisonous Filtering and Spacing Remove low diversity dimensions by setting an entropy threshold Arrange space between neighboring coordinates with joint entropy Sorting Categories over Coordinates • Unlike numerical dimensions, no inherent order exists for categorical variables – reading order – alphabetical order • We use pairwise joint probability distribution to find an optimal sequence – Reduce ribbon intersections Sorting Categories over Coordinates Using the reading orders of coordinates and categories over them After Sorting categories of neighboring coordinates Optimal Ordering of Multiple Coordinates • For parallel coordinates many existing approaches reduce line crossings between neighboring coordinates • Using line crossings as cost function between every pair of dimensions, global cost minimization is achieved by a graph theory based method [32] • However, reducing crossings does not necessarily lead to more effective insight discovery – ribbon crossings reliant on the sequences of categories over axes (reading order? Alphabeta order?) Our Method • We use mutual information as the cost function – Benefit: the cost is not related to the sequences of categories over axes • Globally maximize the sum of mutual information of a series of dimensions • A Hamiltonian path algorithm of the Traveling Salesman Problem is solved to create optimal ordering C2: Optimized by ribbon crossings with alphabetical category sequence C3: Optimized by mutual information with alphabetical category sequence C4: Optimized by mutual information with optimized category sequence User Studies • Assess user performance on insight discovery with different ordering approaches • Design specific tasks for users to complete in a limited time period • Apply statistical analysis on the results Mushroom Data • 11 participants received training and 10 minutes practice before test • Each participant was given 90 seconds to find the mushroom characteristics as many as possible, which are (T1) All-edible; (T2) Allpoisonous; (T3) Mostly-edible; (T4) Mostly poisonous • Compared with ground truth, each participant was given a score Results • Average percentage of user findings over ground truth on each task Results • Total performance of user findings using different visualizations Results • Total error rate of user findings using different visualizations Statistical Test • We applied the Friedman test of variance by ranks (a non-parametric statistical test) • Statistical significant differences are discovered – Between C1 and C4 (p-value= 0.011) – Between C2 and C4 (p-value = 0.035) – Between C3 and c4 (p-value = 0.007) Congressional Voting Records Green: Democrat Red: Republican C1: Using the reading order C2: Using the optimized order Congressional Voting Records leftmost dimension is the votes of education-spending Green: nay Red: yea C3: Using the reading order C4: Using the optimized order User Study of Voting Dataset • 35 participants were given 2 mins to complete tasks • Using C1 and C2, for each bill – (T1) which party vote more for yea? – (T2) which party vote more for nay? • Using C3 and C4, for each bill – (T3) which congressmen group vote more for yea? – (T4) which congressmen group vote more for nay? Results • We graded each participant – 1 point if the answer was correct – -1 point if the answer was incorrect – 0 points if they said it was hard to identify • • • • The average score of using C1 was 11.5 The average score of using C2 was 20.1 The average score of using C3 was 13.2 The average score of using C4 was 18.0 Statistical Test • One-way analysis of variance (ANOVA) to compare the effect of using different visualizations • One test was performed for C1 and C2 – p-value = 0.0001 • Another test was performed for C3 and C4 – p-value = 0.02 Conclusion • Utilize measures from information theory to enhance the visualization of high dimensional categorical data • Support users to browse data facts among dimensions, to determine starting points of data analysis, and to test- and-tune parameters for visual reasoning Thanks! • This work is partially supported by US NSF IIS-1352927, IIS-1352893, and Google Faculty Research Award