Information Visualization Design for Multidimensional Data:

advertisement
Information Visualization Design for
Multidimensional Data:
Integrating the Rank-by-Feature Framework with
Hierarchical Clustering
Dissertation Defense
Human-Computer Interaction Lab &
Dept. of Computer Science
Jinwook Seo
Outline
• Research Problems
• Clustering Result Visualization in HCE
• GRID Principles
• Rank-by-Feature Framework
• Evaluation
– Case studies
– User survey via emails
• Contributions and Future work
Exploration of Multidimensional Data
• To understand the story that the data tells
• To find features in the data set
• To generate hypotheses
• Lost in multidimensional space
• Tools and techniques are available in
many areas
• Strategy and interface to organize them to
guide discovery
Constrained by Conventions
User/Researcher
Conventional Tools
Statistical Methods
Data Mining Algorithms
Multidimensional Data
Boosting Information Bandwidth
User/Researcher
Information Visualization Interfaces
Statistical Methods
Data Mining Algorithms
Multidimensional Data
Contributions
• Graphics, Ranking, and Interaction for
Discovery (GRID) principles
• Rank-by-Feature Framework
• The design and implementation of the
Hierarchical Clustering Explorer (HCE)
• Validation through case studies and user
surveys
Hierarchical Clustering Explorer:
Understanding Clusters Through Interactive Exploration
• Overview of the entire clustering results
 compressed overview
• The right number of clusters
 minimum similarity bar
• Overall pattern of each cluster (aggregation)
 detail cutoff bar
• Compare two results
 brushing and linking using pair-tree
HCE History
•
•
•
•
Document-View Architecture
72,274 lines of C++ codes, 76 C++ classes
About 2,500 downloads since April 2002
Commercial license to a biotech company
(www.vialactia.com)
• Freely downloadable at
www.cs.umd.edu/hcil/hce
Goal: Find Interesting Features in
Multidimensional Data
• Finding clusters, outliers, correlations, gaps, …
is difficult in multidimensional data
– Cognitive difficulties in >3D
• Therefore utilize low-dimensional projections
– Perceptual efficiency in 1D and 2D
– Orderly process to guide discovery
Do you see anything interesting?
Do you see any interesting feature?
Scatter Plot
50
40
30
20
10
0
50
75
100
125
150
Ionization Energy
175
200
225
250
Correlation…What else?
Scatter Plot
50
40
30
20
10
0
50
75
100
125
150
Ionization Energy
175
200
225
250
Outliers
Scatter Plot
50
40
He
30
20
10
0
50
75
100
Rn
125
150
Ionization Energy
175
200
225
250
GRID Principles
• Graphics, Ranking, and Interaction for
Discovery in Multidimensional Data
• study 1D
study 2D
then find features
• ranking guides insight
statistics confirm
Rank-by-Feature Framework
• Based on the GRID principles
• 1D → 2D
– 1D : Histogram + Boxplot
– 2D : Scatterplot
• Ranking Criteria
– statistical methods
– data mining algorithms
• Graphical Overview
• Rapid & Interactive Browsing
A Ranking Example
3138 U.S. counties with 17 attributes
Uniformness (entropy) (6.7, 6.1, 4.5, 1.5)
Pearson correlation (0.996, 0.31, 0.01, -0.69)
Categorical Variables in RFF
• New ranking criteria
– Chi-square, ANOVA
• Significance and Strength
– How strong is a relationship?
– How significant is a relationship?
• Partitioning and Comparison
– partition by a column (categorical variable)
– partition by a row (class info for columns)
– compare clustering results for partitions
color : Contingency coefficient C
color : Quadracity
size : Chi-square p-value
size : Least-square error
Categorical Variables in RFF
• New ranking criteria
– Chi-square, ANOVA
• Significance and Strength
– How strong is a relationship?
– How significant is a relationship?
• Partitioning and Comparison
– partition by a column (categorical variable)
– partition by a row (class info for columns)
– compare clustering results for partitions
Partitioning and Comparison
FieldType
s1
s2
s3
s4
s5
s6
s7
integer
integer
real
integer
integer
integer
categorical
i1
i2
i3
…
in-1
in
M
M
M
…
F
F
Compare two column-clustering results
Partitioning and Comparison
s1
s2
s3
s4
s5
s6
CID
1
1
1
2
2
2
FieldType
integer
integer
real
integer
integer
integer
i1
i2
i3
…
in-1
in
Compare two row-clustering results
Qualitative Evaluation
• Case studies
– 30-minute weekly meeting for 6 weeks
individually
– observe how participants use HCE
– improve HCE according to their requirements
– 1 molecular biologist (Acute lung injuries in mice)
– 1 biostatistician (FAMuSS Study data)
– 1 meteorologist (Aerosol measurement)
Lessons Learned
• Rank-by-Feature Framework
– Enables systematic/orderly exploration
– Prevents from missing important features
– Helps confirm known features
– Helps identify unknown features
– Reveals outliers as signal/noise
• More work needed
– Transformation of variables
– More ranking criteria
– More interactions
User Survey via Emails
•
•
•
•
•
•
1500 user survey emails
13 questions on HCE and RFF
60% successfully sent out
85 users replied
60 users answered a majority of questions
25 just curious users
60
49
50
40
30
25
24
25
22
20
7
10
0
dendrogram
histogram
ordering
scatterplot
ordering
tabular view
profile search gene ontology
Which features have you used?
25
20
20
15
13
12
10
5
2
0
significantly
somew hat significantly
a little bit
not at all
Do you think HCE improved the way you analyze your data set?
Future Work
• Integrating RFF with Other Tools
– More ranking criteria
– GRID principles available in other tools
• Scaling-up
– Selection/Filtering to handle large number
of dimensions
• Interaction in RFF
• Further Evaluation
Future Work
• Integrating RFF with Other Tools
– More ranking criteria
– GRID principles available in other tools
• Scaling-up
– Selection/Filtering to handle large number
of dimensions
• Interaction in RFF
• Further Evaluation
Contributions
• Graphics, Ranking, and Interaction for
Discovery (GRID) principles
• Rank-by-Feature Framework
• The design and implementation of the
Hierarchical Clustering Explorer (HCE)
• Validation through case studies and user
surveys
Download