Visualization Approaches for Gene Expression Data Matt Hibbs Assistant Professor The Jackson Laboratory March 4, 2010 1 Transcriptomics & Gene Expression • Simultaneous measurement of transcription for the entire genome • Useful for broad range of biological questions DNA Transcription Translation mRNA Ribosome Proteins March 4, 2010 2 Outline • Technologies & Specific Concerns – cDNA microarrays (2-color & 1-color arrays) – RNA-seq • • • • • • Normalization visualizations Full data displays Dimensionality reduction Sequence-order displays Comparative visualization Future Directions March 4, 2010 3 Technology: 2-color cDNA Microarrays reference mRNA test mRNA Add mRNA to slide for Hybridization add green dye add red dye hybridize A B C D Spot slide with known sequences A C March 4, 2010 A 1.5 B 0.8 C -1.2 D 0.1 Scan hybridized array B D A C B D 4 Technology: 2-color cDNA Microarrays March 4, 2010 5 Technology: RNA-seq Image from WikiMedia March 4, 2010 6 Normalization: MA-plot • Need to account for intensity bias between channels (red/green, or mult. 1-color) • MA-plot (also called RI-plot) shows relationship between ratio and intensity March 4, 2010 7 Normalization: Box-Whisker Quantile • Quantile normalization often used to adjust for between chip variance • Box-Whisker plots typically used to visualize the process March 4, 2010 8 Full Data Displays • Techniques to show all of the data at once • Heat Maps – Displays numerical values as colors – Good to see all data intuitively – Requires clustering to see patterns • Parallel Coordinates – Line plots of high-dimensional data – Easy to see/select trends or patterns – Esp. good for course data (time, drug, etc.) March 4, 2010 9 Heat Maps … Cluster … Rasterize -3 Under-Expressed March 4, 2010 0 +3 Over-Expressed10 Heat Maps: Stats • Clustering important to see patterns – Hierarchical, K-means, SOM, etc… – Choice of distance metric in addition to method • Match the visualization mapping to the statistics used for analysis – Coloration based on actual numbers appropriate for Euclidian distance measures – Centered or normalized measures should use corresponding colorings March 4, 2010 11 Heat Maps: Distance Metrics Euclidean Distance Spearman Correlation Pearson Correlation March 4, 2010 12 Heat Maps: Stats Data clustered using a rank-based statistic lowest value March 4, 2010 highest value 13 Heat Maps: Overview + Detail March 4, 2010 Data from Spellman et al., 1998 Java TreeView, Saldanha et al. 14 Parallel Coordinates • View expression vectors as lines – X-axis = conditions – Y-axis = value March 4, 2010 Time Searcher, Hochheiser et al. 15 Parallel Coordinates • Selection and Interaction methods can answer specific questions • Brushing techniques to select patterns • Cluttered displays for large datasets, limited number of conditions effectively shown March 4, 2010 Time Searcher, Hochheiser et al. 16 Dimensionality Reduction • Project data from large, high dimensional space to a smaller space (usually 2 or 3 D) • Several techniques: – SVD & PCA – Multidimensional scaling • Once projected into lower dimension, use standard 2D (or 3D) techniques March 4, 2010 17 Dimensionality Reduction March 4, 2010 18 Dimensionality Reduction: SVD … … Transform original data vectors into an orthogonal basis that captures decreasing amounts of variation March 4, 2010 19 Dimensionality Reduction: SVD SVD March 4, 2010 20 SVD Example G1 S G2 M M/G1 Legend Data from Spellman et al., 1998 March 4, 2010 GeneVAnD, Hibbs et al. 21 Sequence-based Visualization • View data in chromosomal order – Copy number variation & aneuploidies • common in cancers & other disorders – Competitive Genomic Hybridization (CGH) – mRNA sequencing (RNA-seq) – Borrows concepts from genome browsers March 4, 2010 22 Sequence-based: CGH • Karyoscope plots March 4, 2010 Java TreeView, Saldanha et al. 23 Sequence-based: RNA-seq March 4, 2010 IGV, http://www.broadinstitute.org/igv 24 Comparative Visualization • Using multiple simultaneous complementary views of data • Each scheme emphasizes different aspects – use multiple to show overall picture • Show multiple, related datasets to identify common and unique patterns March 4, 2010 25 Comparative Visualization: Single Dataset March 4, 2010 MeV, Saeed et al. 26 Comparative Visualization: Single Dataset Spotfire GeneSpring March 4, 2010 27 Comparative Visualization: Multidataset Data from Spellman et al., 1998 Dendrogram Heat Map Overview HIDRA Hibbs et al. March 4, 2010 28 Comparative Visualization: Multidataset Data from Spellman et al., 1998 Selection HIDRA Hibbs et al. March 4, 2010 Synchronized Details 29 Comparative Visualization: Multidataset Data from Spellman et al., 1998 Selection HIDRA Hibbs et al. March 4, 2010 30 Summary & Tools • • • • R & bioconductor Java TreeView (Saldanha, 2004) Time Searcher (Hochheiser et al., 2003) Integrative Genomics Viewer (IGV; www.broadinstitute.org/igv) • TIGR’s MultiExperiment Viewer (MeV; Saeed et al., 2003) • HIDRA (Hibbs et al., 2007) March 4, 2010 31 Trends & Future Directions • Emphasis on usability and audience – If a “wet bench” biologist can’t use it… • Incorporate common statistical analysis techniques with visualizations – e.g. differential expression tests, GO enrichments, etc. • Isoforms and Splice variants • New user interaction schemes – e.g. multi-touch interfaces, large-format displays • Low level “systems analysis” – linking together multiple types of data into unified displays March 4, 2010 32 Acknowledgements • Hibbs Lab – Karen Dowell – Tongjun Gu – Al Simons • Olga Troyanskaya Lab – Patrick Bradley – Maria Chikina – Yuanfang Guan • • • • • March 4, 2010 Chad Myers David Hess Florian Markowetz Edo Airoldi Curtis Huttenhower • Kai Li Lab – Grant Wallace • Amy Caudy • Maitreya Dunham • Botstein, Kruglyak, Broach, Rose labs • Kyuson Yun • Carol Bult 33 Postdoctoral Opportunities in Computational & Systems Biology The Center for Genome Dynamics at The Jackson Laboratory www.genomedynamics.org Investigators use computation, mathematical modeling and statistics, with a shared focus on the genetics of complex traits Requires PhD (or equivalent) in quantitative field such as computer science, statistics, applied mathematics or in biological sciences with strong quantitative background Programming experience recommended The Jackson Laboratory was voted #2 in a poll of postdocs conducted by The Scientist in 2009 and is an EOE/AA employer March 4, 2010 34