Information Visualization Design for Multidimensional Data: Integrating the Rank-by-Feature Framework with Hierarchical Clustering Dissertation Defense Human-Computer Interaction Lab & Dept. of Computer Science Jinwook Seo Outline • Research Problems • Clustering Result Visualization in HCE • GRID Principles • Rank-by-Feature Framework • Evaluation – Case studies – User survey via emails • Contributions and Future work Exploration of Multidimensional Data • To understand the story that the data tells • To find features in the data set • To generate hypotheses • Lost in multidimensional space • Tools and techniques are available in many areas • Strategy and interface to organize them to guide discovery Constrained by Conventions User/Researcher Conventional Tools Statistical Methods Data Mining Algorithms Multidimensional Data Boosting Information Bandwidth User/Researcher Information Visualization Interfaces Statistical Methods Data Mining Algorithms Multidimensional Data Contributions • Graphics, Ranking, and Interaction for Discovery (GRID) principles • Rank-by-Feature Framework • The design and implementation of the Hierarchical Clustering Explorer (HCE) • Validation through case studies and user surveys Hierarchical Clustering Explorer: Understanding Clusters Through Interactive Exploration • Overview of the entire clustering results compressed overview • The right number of clusters minimum similarity bar • Overall pattern of each cluster (aggregation) detail cutoff bar • Compare two results brushing and linking using pair-tree HCE History • • • • Document-View Architecture 72,274 lines of C++ codes, 76 C++ classes About 2,500 downloads since April 2002 Commercial license to a biotech company (www.vialactia.com) • Freely downloadable at www.cs.umd.edu/hcil/hce Goal: Find Interesting Features in Multidimensional Data • Finding clusters, outliers, correlations, gaps, … is difficult in multidimensional data – Cognitive difficulties in >3D • Therefore utilize low-dimensional projections – Perceptual efficiency in 1D and 2D – Orderly process to guide discovery Do you see anything interesting? Do you see any interesting feature? Scatter Plot 50 40 30 20 10 0 50 75 100 125 150 Ionization Energy 175 200 225 250 Correlation…What else? Scatter Plot 50 40 30 20 10 0 50 75 100 125 150 Ionization Energy 175 200 225 250 Outliers Scatter Plot 50 40 He 30 20 10 0 50 75 100 Rn 125 150 Ionization Energy 175 200 225 250 GRID Principles • Graphics, Ranking, and Interaction for Discovery in Multidimensional Data • study 1D study 2D then find features • ranking guides insight statistics confirm Rank-by-Feature Framework • Based on the GRID principles • 1D → 2D – 1D : Histogram + Boxplot – 2D : Scatterplot • Ranking Criteria – statistical methods – data mining algorithms • Graphical Overview • Rapid & Interactive Browsing A Ranking Example 3138 U.S. counties with 17 attributes Uniformness (entropy) (6.7, 6.1, 4.5, 1.5) Pearson correlation (0.996, 0.31, 0.01, -0.69) Categorical Variables in RFF • New ranking criteria – Chi-square, ANOVA • Significance and Strength – How strong is a relationship? – How significant is a relationship? • Partitioning and Comparison – partition by a column (categorical variable) – partition by a row (class info for columns) – compare clustering results for partitions color : Contingency coefficient C color : Quadracity size : Chi-square p-value size : Least-square error Categorical Variables in RFF • New ranking criteria – Chi-square, ANOVA • Significance and Strength – How strong is a relationship? – How significant is a relationship? • Partitioning and Comparison – partition by a column (categorical variable) – partition by a row (class info for columns) – compare clustering results for partitions Partitioning and Comparison FieldType s1 s2 s3 s4 s5 s6 s7 integer integer real integer integer integer categorical i1 i2 i3 … in-1 in M M M … F F Compare two column-clustering results Partitioning and Comparison s1 s2 s3 s4 s5 s6 CID 1 1 1 2 2 2 FieldType integer integer real integer integer integer i1 i2 i3 … in-1 in Compare two row-clustering results Qualitative Evaluation • Case studies – 30-minute weekly meeting for 6 weeks individually – observe how participants use HCE – improve HCE according to their requirements – 1 molecular biologist (Acute lung injuries in mice) – 1 biostatistician (FAMuSS Study data) – 1 meteorologist (Aerosol measurement) Lessons Learned • Rank-by-Feature Framework – Enables systematic/orderly exploration – Prevents from missing important features – Helps confirm known features – Helps identify unknown features – Reveals outliers as signal/noise • More work needed – Transformation of variables – More ranking criteria – More interactions User Survey via Emails • • • • • • 1500 user survey emails 13 questions on HCE and RFF 60% successfully sent out 85 users replied 60 users answered a majority of questions 25 just curious users 60 49 50 40 30 25 24 25 22 20 7 10 0 dendrogram histogram ordering scatterplot ordering tabular view profile search gene ontology Which features have you used? 25 20 20 15 13 12 10 5 2 0 significantly somew hat significantly a little bit not at all Do you think HCE improved the way you analyze your data set? Future Work • Integrating RFF with Other Tools – More ranking criteria – GRID principles available in other tools • Scaling-up – Selection/Filtering to handle large number of dimensions • Interaction in RFF • Further Evaluation Future Work • Integrating RFF with Other Tools – More ranking criteria – GRID principles available in other tools • Scaling-up – Selection/Filtering to handle large number of dimensions • Interaction in RFF • Further Evaluation Contributions • Graphics, Ranking, and Interaction for Discovery (GRID) principles • Rank-by-Feature Framework • The design and implementation of the Hierarchical Clustering Explorer (HCE) • Validation through case studies and user surveys