Driving Discovery Through Data Integration and Analysis John Quackenbush Molecular Diagnostics World 2010 28 October 2010 Disease Progression and Personalized Care Birth Treatment Natural History of Disease Clinical Care Environment + Lifestyle Outcomes Treatment Options Disease Staging Patient Stratification Early Detection Genetic Risk Biomarkers Quality Of Life Death Turning the vision into a reality Assure access to samples and rational consent Develop a technology platform Make information integration as a central mission Conduct research as a vital component Present data and information to the local community Enable research beyond your own Engage corporate partners Communicating the mission to the community. Assure Access to Samples Access, Research, Security Patients want to be part of the process of curing disease Informed consent needs to be structured to allow patients to be partners in the research process HIPPA requires both informed consent and that we assure patient confidentiality But “identifiability” is a moving target in a genomic age With the <$1000 genome, in the age of Facebook, what this means remains unclear The new Genomics is a disruptive technology. Develop a Technology Platform 2006: State of the Art Sequencing PRODUCTION Rooms of equipment Subcloning > picking > prepping 35 FTEs 3-4 weeks SEQUENCING 74x Capillary Sequencers 10 FTEs 15-40 runs per day 1-2Mb per instrument per day 120Mb total capacity per day Sequencing the genome took ~15 years and $3B 2008: Enabling a New Era in Genome Analysis PRODUCTION 1x Cluster Station 1 FTE 1 day SEQUENCING 1x Genome Analyzer Same FTE as above 1 run per 5 days 15 Gb per instrument per run >3 Gb per day (1x genome coverage) We can now re-sequence the genome in a ~1 week The Challenge New technologies inspired by the Human Genome Project are transforming biomedical research from a laboratory science to an information science We need new approaches to making sense of the data we generate The winners in the race to understand disease are going to be those best able to collect, manage, analyze, and interpret the data. Make information integration as a central mission http://compbio.dfci.harvard.edu Gene RNA Gene Index Databases Protein TM4 Microarray Software Network Patient Predict Network Candidate Gene(s) Perturb Network (RNAi) Assay Response (mA) Resourcerer Other Databases Other tools MeSHer ClusterMed Bayesian Nets Central Warehouse DNA Microarray Analysis Other Things: Mesoscopic Expression Correlated Signatures State Space Gene Models Tiling Arrays to Genes Dealing with an Information Overload Beating Information Overload Clinical Data Genomics Cytogenomics Metabolomics Transcriptomics Epigenomics Central Warehouse Chemical Biology Etc. Improved Diagnostics Individualized Therapies More Effective Agents PubMed Clinical Trials Proteomics The HapMap The Genome Disease Databases (OMIM) Published Datasets Drug Bank misc PubMed GenBank Rules Engine Web Center Portal BAM Dashboard Portals Business Intelligence Partners Dana Farber Clinical Systems OMICS IDX Rx Lab Enterprise Service Bus Dana Farber Lab External External Dana-Farber Research DB Conceptual Architecture Clinical Trial Idm & Security Custom A B A D C Facts B HTB ODS genomics Web Service Directory ….. C D Severity Score Build or Buy BPEL …… Facts De-identification Terminology EMPI Mapping Security Auditing Clinical Pathways Oracle RFID Existing An Example: Signature Analysis Warehouse Array Express Fenglong Liu GEO Random Websites Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard GeneChip Oncology Database Fenglong Liu GeneChip Oncology Database Fenglong Liu An Example: Signature Analysis PubMed Kerm Picard Array Express Warehouse Analysis Fenglong Liu GEO Random Websites In-House Studies Aedin Culhane, Thomas Schwarzl, Joe White, Fenglong Liu, Kerm Picard GeneSigDB – release 2 http://compbio.dfci.harvard.edu/genesigdb GeneSigDB – comparing cancers Cancer is a Cell-Cycle Disease Aedin Culhane, Daniel Gusenleitner Breast Cancer has unique signatures Aedin Culhane, Daniel Gusenleitner A sample research question How many Multiple Myeloma patients, with bone marrow or blood samples in the bank, and who have a chromosome 13 deletion, responded (complete, partial, or minor remission) to therapy and how many did not respond? A Path Forward We are working to develop a two-way strategy for future Clinic → Lab Lab → Clinic Consider OncotypeDx This approach represents the intellectual framework for future success – and the bridges between the various laboratories and programs. Conduct research as a vital component Bayesian Networks Amira Djebbari Raktin Sinha Dan Schlauch When we say “Networks” we mean… Genes are represented as “nodes” Interactions are represented by “edges” Edges can be directed to show “causal” interactions Edges are not necessarily direct interactions Bayesian network - example Conditional Edges represent dependencies probability table at Gene1 node “Gene2” Gene1 Gene2=1|Gene1 -1 0 1 0.1 0.2 0.7 Gene2 Gene3 Gene4 Learning Bayesian networks: Structure Conditional probability tables Bayesian networks - priors No free lunch theorem (Wolpert & MacReady, 1996): The performance of general-purpose optimization algorithm iterated on cost function is independent of the algorithm when averaged over all cost functions. Suggests that when considering a specific application one can introduce a potentially useful bias using domain knowledge A low-cost lunch? One can “help” the search along by providing a seed structure representing what we believe is the most likely network The network search process will then use gene expression data to look for perturbations on the structure that are supported by the data There are many possible sources of prior structures including the Biomedical literature and large-scale interaction studies (PPI) Bayesian networks using microarray data and literature Test Set: Golub et al. ALL/AML dataset Learn BN with literature network as prior structure, Protein-Protein Interaction data (PPI), and literature+PPI Perform 200 bootstrap network estimations and find links that are “high confidence” Compare without prior (microarray data only) vs. with prior structure from the literature to look for known interactions. BN: No Priors Amira Djebbari BN: PPI Data Amira Djebbari BN: Literature Priors Amira Djebbari BN: Literature + PPI Cell Cycle Gene Subnetwork Amira Djebbari Improving the Seeds Co-occurrence does not a provide directionality for interactions, but a BN is a DAG and our assignment is ad hoc The literature contains information about how we the genes (and their products) interact The challenge is extracting that information from the literature—there is too much to read Text mining doesn’t work well for the biomedical literature. Improving the Seeds (2) Solution: Use a hybrid approach! Use text-mining tools to find sentences that contain names of two or more genes Use the Amazon Mechanical Turk to extract [subject]—[predicate]—[object] triples Define relationships between genes based on the “consensus” interaction Combine these results with pathway databases to build seed networks. “PredictiveNetworks” seeds from the literature Present data and information to the local community LGRC Research Portal LGRC Research Portal PAGE DETAILS - View aggregate statistics - View cohort details - Build cohort sets - Build composite phenotypes Actions: -Go to data download for selected cohort -Go to assay detail for selected cohort -Go to cohort manager LGRC Research Portal PAGE DETAILS Search -Facets -Search within results -Keyword prompts -Search history Table: -Paged results -Sortable columns Actions: -Go to Gene detail page -Add genes to ‘gene set’ PAGE DETAILS Annotation summary & summary view for each assay/data type: Accordion style sections Annotation Summary Gene Expression Summary -GEXP – expression profile across major Dx categories -RNASeq – Exon structure of the gene -SNPs – Table of SNPs in region of gene, highlighting association with major Dx group - Methylation – Methylation profile in region around gene -Genomic alterations – table of CNVs & alterations observed w/ freq in region around gene Actions: - Click through to assay detail page -Add gene to set RNASeq LGRC Research Portal Analysis Tools Cohort 1: Set 1 Cohort 2: Set 2 Job name: PAGE DETAILS -Very minimal parameters and options…here just 2 cohorts of interest, maybe p-value cutoff My job 1 View analysis parameters Generates comprehensive report Start Analysis Edit in place results – Don’t set parameters, edit the results Analysis goes into queue, email notification when finished Job Status Running Analysis of Differential Expression: My Job 1 PAGE DETAILS -Very minimal parameters and options. Supervised Analysis Generates comprehensive report Edit in place results – Don’t set parameters, edit the results Accordion style result sections Meta analysis Generate PDF report of analysis Analysis goes into queue, email notification when finished Unsupervised analysis Engage corporate partners We need to find the best tools We received an $1M Oracle Commitment grant to create our integrated clinical/research data warehouse We’ve partnered with IDBS to create data portals We are working with Illumina on a variety of projects We are forging relationships with Thomson-Reuters to link genomic profiling data to drug, trial, and patent information We are building partnerships with Roche, Genomatix, NEB, and others interested in entering the personal genomics space. Enable research beyond your own John Quackenbush, Director Mick Correll, Associate Director The Mission The mission of the CCCB is to provide broad-based support for the analysis and interpretation of ‘omic data and in doing so to further basic, clinical and translational research. CCCB also will conduct research that opens new ways of understanding cancer. CCCB Service Offering IT Infrastructure -Application hosting -Data management -Custom software development -Comprehensive collaboration portals CCCB Service Offering IT Infrastructure Next-Gen Sequencing -Competitive per-lane pricing -Integrated informatics -Major focus for development in 2010 CCCB Service Offering Sequencing IT Infrastructure Analytical Consulting -Bioinformatics / statistical data analysis -Experimental design -Value-add for IT/Sequencing services CCCB Collaborative Consulting Model 1. Initial meeting to understand project scope and objectives 2. Development of an analysis plan and time/cost estimate Sequencing IT Infrastructure Consulting 3. During project execution, data and results are exchanged through a secure, password-protected collaboration portal 4. Available as ad-hoc service, or larger scale support agreements Communicate the mission to the community. The LGRC Genomics is here to stay Acknowledgments The Gene Index Team Corina Antonescu Valentin Antonescu Fenglong Liu Geo Pertea Razvan Sultana John Quackenbush Array Software Hit Team Katie Franklin Eleanor Howe Sarita Nair Jerry Papenhausen John Quackenbush Dan Schlauch Raktim Sinha Joseph White H. Lee Moffitt Center/USF Timothy J. Yeatman Greg Bloom <johnq@jimmy.harvard.edu> Center for Cancer Microarray Expression Team Computational Biology Stefan Bentink Mick Correll Thomas Chittenden Howie Goodell Aedin Culhane Kristina Holton Kristina Holton Jerry Papenhausen Jane Pak Patricia Papastamos Renee Rubio John Quackenbush (Former) Stellar Students http://cccb.dfci.harvard.edu Martin Aryee Kaveh Maghsoudi Jess Mar Systems Support Stas Alekseev, Sys Admin Assistant Patricia Papastamos http://compbio.dfci.harvard.edu