Divining Systems Biology Knowledge from High-throughput Experiments Using EGAN Jesse Paquette ISMB 2010 Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center University of California, San Francisco (AKA BCBC HDFCCC UCSF) High-throughput experiments • This talk applies to – – – – – – – – Expression microarrays aCGH SNP/CNV arrays MS/MS Proteomics DNA methylation ChIP-Seq RNA-Seq In-silico experiments • If parts of the output can be mapped to gene IDs – You can use EGAN What do you hope to accomplish? Collect data Process data Differential analysis Publish! Clusters and/or gene lists Produce insight about the underlying biology New papers! New testable hypotheses Drug targets! New grants! Leverage organic intelligence Clusters and/or gene lists Summarize Visualize Produce insight about the underlying biology Contextualize New testable hypotheses Producing insight from clusters and gene lists • Summarize: find enriched pathways (and other gene sets) – Hypergeometric over-representation • DAVID – Global trends • GSEA • Visualize: gene relationships in a graph – Protein-protein interactions • Cytoscape – Network module discovery • Ingenuity IPA – Literature co-occurrence • PubGene • Contextualize: pertinent literature • PubMed • Google • iHOP EGAN: Exploratory Gene Association Networks • Methods: state-of-the-art analysis of clusters and gene lists – – – – – • User Interface: responds quickly to new queries from the biologist – – – – – • Hypergeometric enrichment of gene sets Global statistical trends of gene sets Hypergraph visualization (via Cytoscape libraries) Literature identification Network module discovery Sandbox-style functionality Dynamic adjustment of p-value cutoffs Point-and-click interface All data in-memory for immediate access Links to external websites Modular: integrates as a flexible plug-and-play cog – – – – – All data is customizable Proprietary data can be restricted to the client location Java runs on almost every OS (PC, Mac, LINUX) Can be configured and launched from a different application (e.g. GenePattern) Analyses can be scripted for automation Gene sets • A gene set is a a set of semantically related genes – e.g. Wnt signaling pathway • EGAN contains a database of gene sets – > 100k gene sets by default • KEGG, Reactome, NCI-Nature, Gene Ontology, MeSH, Conserved Domain, Cytoband, miRNA targets – You can easily add your own • Simple file format • Download from MSigDB (Broad Institute) Gene-gene relationships • EGAN also contains – Protein-protein interactions (PPI) – Literature co-occurrence – Chromosomal adjacency – Kinase-target relationships • Other possibilities – Sequence homology – Expression correlation Example with microarray and aCGH results • Mirzoeva et al. (2009) Cancer Research – UCSF-LBL collaboration – Analysis of breast cancer cell lines • Basal vs. luminal • Discoveries in this presentation – miRNA regulator of subtype (mir-200) – Annexin (ANXA1) as potential regulator of ER, glucocorticoid and EGFR signaling Gene list - higher expression in basal cell lines Gene set/pathway enrichment Importing gene lists from publications Combining expression with aCGH Finding network modules Where to find EGAN • Website – http://akt.ucsf.edu/EGAN/ • 2010 paper in Bioinformatics – http://www.ncbi.nlm.nih.gov/pubmed/19933825 Acknowledgements • BCBC HDFCCC UCSF – – – – Taku Tokuyasu Adam Olshen Ritu Roy Ajay Jain • LBNL – Debopriya Das – Joe Gray • Funding – UCSF Cancer Center Support Grant • UCSF – Early adopters • • • • • Ingrid Revet Antoine Snijders Stephan Gysin Sook Wah Yee Joachim Silber – Cytoscape gurus • David Quigley • Scooter Morris – OTM • David Eramian • Ha Nguyen – Laura van ’t Veer – Donna Albertson – Graeme Hodgson