An Evaluation of Microarray Visualization Tools for Biological Insight

advertisement
An Evaluation of Microarray Visualization Tools
for Biological Insight
Purvi Saraiya
Chris North
Dept. of Computer Science
Virginia Polytechnic Institute
and State University
Karen Duca
Virginia Bioinformatics
Institute
Virginia Polytechnic Institute
and State University
Presented by
Tugrul Ince and Nir Peer
University of Maryland
Goals

Evaluate five popular visualization tools
Cluster/Treeview
 TimeSearcher
 Hierarchical Clustering Explorer (HCE)
 Spotfire
 GeneSpring


Do so in the context of bioinformatics data
exploration
2
Goals

Research Questions
How successful are these tools in stimulating insight?
 How do various visualization techniques affect the
users’ perception of data?
 How does users’ background affect the tool usage?
 How do these tools support hypothesis generation?
 Can insight be measured in a controlled experiment?

3
Visualization Evaluations

Typically evaluations consist of


controlled measurements of user performance and
accuracy on predetermined tasks
We are looking for an evaluation that better
simulates a bioinformatics data analysis scenario

We use a protocol the focuses on

recognition and quantification of insights gained from
actual exploratory use of visualizations
4
Insights



Hard to define what is an “insight”
We need this term to be quantifiable and
reproducible
Solution

Encourage users to think aloud


and report any findings they have about the dataset
Videotape a session to capture and characterize
individual insights as they occur

generally provides more information than subjective
measures from post-experiment surveys
5
Insights

Define insight as
an individual observation about the data by the
participant
 a unit of discovery
 Essentially, any data observation made during the
think aloud protocol


Now we can quantify some characteristics of
each insight
6
Insight Characteristics

Observation


Time



Hypothesis and direction of research
Directed vs. Unexpected


The significance of the insight. Coded by a domain expert.
Hypotheses


The amount of time taken to reach the insight
Domain Value


The actual finding about the data
Recall: participants are asked to identify questions they want to explore
Correctness
Breadth vs. Depth
7
Insight Characteristics

Category
Overview – overall distributions of gene expression
 Patterns – identification or comparison across data
attributes
 Groups – identification or comparison of groups of
genes
 Details – focused information about specific genes

8
Experiment Design

A 35 between-subjects design
between-subjects  different subjects for each pair
 Dataset: 3 treatments
 Visualization tool: 5 treatments

9
Experiment Design

Participants



2 participants per dataset per tool
Have at least a Bachelor’s degree in a biological field
Assigned to tools they had never worked with before



to prevent advantage
measure learning time
Categories

10 Domain Experts


11 Domain Novices


Senior researchers with extensive experience in microarray experiments
and microarray data analysis
Lab technicians or graduate student research assistants
9 Software Developers

Professionals who implement microarray software tools
10
Protocol and Measures

Chose new users with only minimal tool training


Participants received an initial training





Success in the initial usage period is critical for the tool’s
adoption by biologists
Background description about the dataset
15-minute tool tutorial
Participants listed some analysis questions
Instructed to examine the data with the tool as long as
needed
They were allowed to ask for help about the tool

Simulates training by colleagues
11
Protocol and Measures



Every 15 minutes, participants estimated percent
of total potential insight they obtained so far
Finally, assessed overall experience with the
tools during session
Entire session was videotaped for later analysis

Later, all individual occurrences of insights were
identified and codified
12
Show me pictures
Here are the tools!!!
13
Cluster/TreeView = ClusterView

Cluster


TreeView


to cluster data
Visualize the clusters
Uses heat-maps
14
TimeSearcher 1



Parallel Coordinate
Visualization
Interactive Filtering
Line Graphs for each
data entity
15
HCE


Clusters data
Several Visualizations





Heat-Maps
Parallel Coordinates
Scatter Plots
Histograms
Brushing and Linking
16
Spotfire


General Purpose
Visualization Tool
Several Displays







Scatter Plots
Bar Graphs
Histograms
Pie/Line Charts
Others…
Dynamic Query Sliders
Brushing and Linking
17
GeneSpring

Suitable for Microarray
data analysis






Shows physical positions
on genomes
Array layouts
Pathways
Gene-to-gene comparison
Brushing and Linking
Clustering capability
18
Enough about Tools,
Tell me the Results!!!
19
Number of Insights
ClusterView TimeSearcher 1


HCE
Spotfire
GeneSpring
Spotfire: Highest number of insights
HCE: poorest
20
Total Domain Value
ClusterView TimeSearcher 1


HCE
Spotfire
GeneSpring
Spotfire: Highest insight value
HCE, GeneSpring: poorer
21
Avg. Final Amount Learned
ClusterView TimeSearcher 1


HCE
Spotfire
GeneSpring
Spotfire: high value in learning
ClusterView and HCE are poor
22
Avg. Time to First Insight
ClusterView TimeSearcher 1


HCE
Spotfire
GeneSpring
ClusterView: very short time to first insight
TimeSearcher 1 and Spotfire are also quick
23
Avg. Total Time
ClusterView TimeSearcher 1


HCE
Spotfire
GeneSpring
Total time users spent using the tool
Low Values: Efficient or Not useful for insight
24
Unexpected Insights




HCE revealed several unexpected results
ClusterView provided a few
TimeSearcher 1 for time series data
Spotfire contributed to 2 unexpected insights
Hypotheses

A few insights led to hypotheses
 Spotfire  3
 ClusterView  2
 TimeSearcher 1  1
 HCE  1
25
Tools vs. Datasets
26
Insight Categories

Overall Gene Expression


Expression Patterns



Searching patterns is critical
Clustering is useful
Grouping



Overview of genes in general
Some users wanted to group genes
GeneSpring enables grouping
Detail Information

Users want detailed information about genes that are familiar
to them
27
Visual Representations and Interactions

Although some tools have many visualization
techniques, users tend to use only a few
Spotfire users preferred heat-maps
 GeneSpring users preferred parallel coordinates




Lupus dataset: visualized best with heat-maps
Most users preferred outputs of clustering
algorithms
HCE not useful when a particular column
arrangement is useful
28
Running out of time, So, wrap up




Use a Visualization tool (that’s why we’re here!)
Spotfire: best general performance
GeneSpring: Hard to use
Dataset dictates best tool!
Time Series data: TimeSearcher
 Others: Spotfire, GeneSpring?



Interaction is the key
Grouping and Clustering are necessary features
29
Critique


In all fairness, measuring insights is really hard! Here
are some possible issues
Subjectivity




Experiment relies on users always thinking aloud
Also, depends on a domain expert to evaluate insights
Results may vary widely based on participants expertise (only
two per tool-dataset pair)
Some insight characteristics are inherently subjective


Domain Value
Breadth vs. Depth
30
Critique

How do one count insights?
Assumes honest reporting by participants
 Some insights may be of no great value
 What if a discovery just reaffirms a known fact? Is
that an insight?


Measuring time taken to reach an insight

Maybe instead of measuring from beginning of
session we should measure from last insight
31
Download