More Microarray Analysis: Unsupervised Approaches

advertisement
Visualization Approaches for
Gene Expression Data
Matt Hibbs
Assistant Professor
The Jackson Laboratory
March 4, 2010
1
Transcriptomics & Gene
Expression
• Simultaneous measurement of
transcription for the entire genome
• Useful for broad range of biological
questions
DNA
Transcription
Translation
mRNA
Ribosome
Proteins
March 4, 2010
2
Outline
• Technologies & Specific Concerns
– cDNA microarrays (2-color & 1-color arrays)
– RNA-seq
•
•
•
•
•
•
Normalization visualizations
Full data displays
Dimensionality reduction
Sequence-order displays
Comparative visualization
Future Directions
March 4, 2010
3
Technology: 2-color cDNA
Microarrays
reference mRNA
test mRNA
Add mRNA to slide
for Hybridization
add green dye
add red dye
hybridize
A
B
C
D
Spot slide with
known sequences
A
C
March 4, 2010
A
1.5
B
0.8
C
-1.2
D
0.1
Scan hybridized array
B
D
A
C
B
D
4
Technology: 2-color cDNA
Microarrays
March 4, 2010
5
Technology: RNA-seq
Image from WikiMedia
March 4, 2010
6
Normalization: MA-plot
• Need to account for intensity bias between channels (red/green, or mult. 1-color)
• MA-plot (also called RI-plot) shows relationship between ratio and intensity
March 4, 2010
7
Normalization: Box-Whisker
Quantile
• Quantile normalization often used to adjust for between chip variance
• Box-Whisker plots typically used to visualize the process
March 4, 2010
8
Full Data Displays
• Techniques to show all of the data at once
• Heat Maps
– Displays numerical values as colors
– Good to see all data intuitively
– Requires clustering to see patterns
• Parallel Coordinates
– Line plots of high-dimensional data
– Easy to see/select trends or patterns
– Esp. good for course data (time, drug, etc.)
March 4, 2010
9
Heat Maps
…
Cluster
…
Rasterize
-3
Under-Expressed
March 4, 2010
0
+3
Over-Expressed10
Heat Maps: Stats
• Clustering important to see patterns
– Hierarchical, K-means, SOM, etc…
– Choice of distance metric in addition to
method
• Match the visualization mapping to the
statistics used for analysis
– Coloration based on actual numbers
appropriate for Euclidian distance measures
– Centered or normalized measures should use
corresponding colorings
March 4, 2010
11
Heat Maps: Distance Metrics
Euclidean Distance
Spearman Correlation
Pearson Correlation
March 4, 2010
12
Heat Maps: Stats
Data clustered using a rank-based statistic
lowest value
March 4, 2010
highest value
13
Heat Maps: Overview + Detail
March 4, 2010
Data from Spellman et al., 1998
Java TreeView, Saldanha et al.
14
Parallel Coordinates
• View expression vectors as lines
– X-axis = conditions
– Y-axis = value
March 4, 2010
Time Searcher, Hochheiser et al.
15
Parallel Coordinates
• Selection and Interaction
methods can answer specific
questions
• Brushing techniques to select
patterns
• Cluttered displays for large
datasets, limited number of
conditions effectively shown
March 4, 2010
Time Searcher, Hochheiser et al.
16
Dimensionality Reduction
• Project data from large, high dimensional
space to a smaller space (usually 2 or 3 D)
• Several techniques:
– SVD & PCA
– Multidimensional scaling
• Once projected into lower dimension, use
standard 2D (or 3D) techniques
March 4, 2010
17
Dimensionality Reduction
March 4, 2010
18
Dimensionality Reduction: SVD
…
…
Transform original data vectors into an
orthogonal basis that captures
decreasing amounts of variation
March 4, 2010
19
Dimensionality Reduction: SVD
SVD
March 4, 2010
20
SVD Example
G1
S
G2
M
M/G1
Legend
Data from Spellman et al., 1998
March 4, 2010
GeneVAnD, Hibbs et al.
21
Sequence-based Visualization
• View data in chromosomal order
– Copy number variation & aneuploidies
• common in cancers & other disorders
– Competitive Genomic Hybridization (CGH)
– mRNA sequencing (RNA-seq)
– Borrows concepts from genome browsers
March 4, 2010
22
Sequence-based: CGH
• Karyoscope plots
March 4, 2010
Java TreeView, Saldanha et al.
23
Sequence-based: RNA-seq
March 4, 2010
IGV, http://www.broadinstitute.org/igv
24
Comparative Visualization
• Using multiple simultaneous
complementary views of data
• Each scheme emphasizes different
aspects – use multiple to show overall
picture
• Show multiple, related datasets to identify
common and unique patterns
March 4, 2010
25
Comparative Visualization: Single
Dataset
March 4, 2010
MeV, Saeed et al.
26
Comparative Visualization: Single
Dataset
Spotfire
GeneSpring
March 4, 2010
27
Comparative Visualization: Multidataset
Data from Spellman et al., 1998
Dendrogram
Heat Map Overview
HIDRA
Hibbs et al.
March 4, 2010
28
Comparative Visualization: Multidataset
Data from Spellman et al., 1998
Selection
HIDRA
Hibbs et al.
March 4, 2010
Synchronized
Details
29
Comparative Visualization: Multidataset
Data from Spellman et al., 1998
Selection
HIDRA
Hibbs et al.
March 4, 2010
30
Summary & Tools
•
•
•
•
R & bioconductor
Java TreeView (Saldanha, 2004)
Time Searcher (Hochheiser et al., 2003)
Integrative Genomics Viewer (IGV;
www.broadinstitute.org/igv)
• TIGR’s MultiExperiment Viewer (MeV;
Saeed et al., 2003)
• HIDRA (Hibbs et al., 2007)
March 4, 2010
31
Trends & Future Directions
• Emphasis on usability and audience
– If a “wet bench” biologist can’t use it…
• Incorporate common statistical analysis
techniques with visualizations
– e.g. differential expression tests, GO enrichments,
etc.
• Isoforms and Splice variants
• New user interaction schemes
– e.g. multi-touch interfaces, large-format displays
• Low level “systems analysis”
– linking together multiple types of data into unified
displays
March 4, 2010
32
Acknowledgements
• Hibbs Lab
– Karen Dowell
– Tongjun Gu
– Al Simons
• Olga Troyanskaya Lab
– Patrick Bradley
– Maria Chikina
– Yuanfang Guan
•
•
•
•
•
March 4, 2010
Chad Myers
David Hess
Florian Markowetz
Edo Airoldi
Curtis Huttenhower
• Kai Li Lab
– Grant Wallace
• Amy Caudy
• Maitreya Dunham
• Botstein, Kruglyak,
Broach, Rose labs
• Kyuson Yun
• Carol Bult
33
Postdoctoral Opportunities in
Computational & Systems Biology
The Center for Genome Dynamics at The Jackson Laboratory
www.genomedynamics.org
Investigators use computation, mathematical modeling and
statistics, with a shared focus on the genetics of complex traits
Requires PhD (or equivalent) in quantitative field such as
computer science, statistics, applied mathematics or in
biological sciences with strong quantitative background
Programming experience recommended
The Jackson Laboratory was voted #2 in a poll of postdocs
conducted by The Scientist in 2009 and is an EOE/AA employer
March 4, 2010
34
Download