An Evaluation of Microarray Visualization Tools for Biological Insight Purvi Saraiya Chris North Dept. of Computer Science Virginia Polytechnic Institute and State University Karen Duca Virginia Bioinformatics Institute Virginia Polytechnic Institute and State University Presented by Tugrul Ince and Nir Peer University of Maryland Goals Evaluate five popular visualization tools Cluster/Treeview TimeSearcher Hierarchical Clustering Explorer (HCE) Spotfire GeneSpring Do so in the context of bioinformatics data exploration 2 Goals Research Questions How successful are these tools in stimulating insight? How do various visualization techniques affect the users’ perception of data? How does users’ background affect the tool usage? How do these tools support hypothesis generation? Can insight be measured in a controlled experiment? 3 Visualization Evaluations Typically evaluations consist of controlled measurements of user performance and accuracy on predetermined tasks We are looking for an evaluation that better simulates a bioinformatics data analysis scenario We use a protocol the focuses on recognition and quantification of insights gained from actual exploratory use of visualizations 4 Insights Hard to define what is an “insight” We need this term to be quantifiable and reproducible Solution Encourage users to think aloud and report any findings they have about the dataset Videotape a session to capture and characterize individual insights as they occur generally provides more information than subjective measures from post-experiment surveys 5 Insights Define insight as an individual observation about the data by the participant a unit of discovery Essentially, any data observation made during the think aloud protocol Now we can quantify some characteristics of each insight 6 Insight Characteristics Observation Time Hypothesis and direction of research Directed vs. Unexpected The significance of the insight. Coded by a domain expert. Hypotheses The amount of time taken to reach the insight Domain Value The actual finding about the data Recall: participants are asked to identify questions they want to explore Correctness Breadth vs. Depth 7 Insight Characteristics Category Overview – overall distributions of gene expression Patterns – identification or comparison across data attributes Groups – identification or comparison of groups of genes Details – focused information about specific genes 8 Experiment Design A 35 between-subjects design between-subjects different subjects for each pair Dataset: 3 treatments Visualization tool: 5 treatments 9 Experiment Design Participants 2 participants per dataset per tool Have at least a Bachelor’s degree in a biological field Assigned to tools they had never worked with before to prevent advantage measure learning time Categories 10 Domain Experts 11 Domain Novices Senior researchers with extensive experience in microarray experiments and microarray data analysis Lab technicians or graduate student research assistants 9 Software Developers Professionals who implement microarray software tools 10 Protocol and Measures Chose new users with only minimal tool training Participants received an initial training Success in the initial usage period is critical for the tool’s adoption by biologists Background description about the dataset 15-minute tool tutorial Participants listed some analysis questions Instructed to examine the data with the tool as long as needed They were allowed to ask for help about the tool Simulates training by colleagues 11 Protocol and Measures Every 15 minutes, participants estimated percent of total potential insight they obtained so far Finally, assessed overall experience with the tools during session Entire session was videotaped for later analysis Later, all individual occurrences of insights were identified and codified 12 Show me pictures Here are the tools!!! 13 Cluster/TreeView = ClusterView Cluster TreeView to cluster data Visualize the clusters Uses heat-maps 14 TimeSearcher 1 Parallel Coordinate Visualization Interactive Filtering Line Graphs for each data entity 15 HCE Clusters data Several Visualizations Heat-Maps Parallel Coordinates Scatter Plots Histograms Brushing and Linking 16 Spotfire General Purpose Visualization Tool Several Displays Scatter Plots Bar Graphs Histograms Pie/Line Charts Others… Dynamic Query Sliders Brushing and Linking 17 GeneSpring Suitable for Microarray data analysis Shows physical positions on genomes Array layouts Pathways Gene-to-gene comparison Brushing and Linking Clustering capability 18 Enough about Tools, Tell me the Results!!! 19 Number of Insights ClusterView TimeSearcher 1 HCE Spotfire GeneSpring Spotfire: Highest number of insights HCE: poorest 20 Total Domain Value ClusterView TimeSearcher 1 HCE Spotfire GeneSpring Spotfire: Highest insight value HCE, GeneSpring: poorer 21 Avg. Final Amount Learned ClusterView TimeSearcher 1 HCE Spotfire GeneSpring Spotfire: high value in learning ClusterView and HCE are poor 22 Avg. Time to First Insight ClusterView TimeSearcher 1 HCE Spotfire GeneSpring ClusterView: very short time to first insight TimeSearcher 1 and Spotfire are also quick 23 Avg. Total Time ClusterView TimeSearcher 1 HCE Spotfire GeneSpring Total time users spent using the tool Low Values: Efficient or Not useful for insight 24 Unexpected Insights HCE revealed several unexpected results ClusterView provided a few TimeSearcher 1 for time series data Spotfire contributed to 2 unexpected insights Hypotheses A few insights led to hypotheses Spotfire 3 ClusterView 2 TimeSearcher 1 1 HCE 1 25 Tools vs. Datasets 26 Insight Categories Overall Gene Expression Expression Patterns Searching patterns is critical Clustering is useful Grouping Overview of genes in general Some users wanted to group genes GeneSpring enables grouping Detail Information Users want detailed information about genes that are familiar to them 27 Visual Representations and Interactions Although some tools have many visualization techniques, users tend to use only a few Spotfire users preferred heat-maps GeneSpring users preferred parallel coordinates Lupus dataset: visualized best with heat-maps Most users preferred outputs of clustering algorithms HCE not useful when a particular column arrangement is useful 28 Running out of time, So, wrap up Use a Visualization tool (that’s why we’re here!) Spotfire: best general performance GeneSpring: Hard to use Dataset dictates best tool! Time Series data: TimeSearcher Others: Spotfire, GeneSpring? Interaction is the key Grouping and Clustering are necessary features 29 Critique In all fairness, measuring insights is really hard! Here are some possible issues Subjectivity Experiment relies on users always thinking aloud Also, depends on a domain expert to evaluate insights Results may vary widely based on participants expertise (only two per tool-dataset pair) Some insight characteristics are inherently subjective Domain Value Breadth vs. Depth 30 Critique How do one count insights? Assumes honest reporting by participants Some insights may be of no great value What if a discovery just reaffirms a known fact? Is that an insight? Measuring time taken to reach an insight Maybe instead of measuring from beginning of session we should measure from last insight 31