SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver Omics • Large-scale analyses for studying a population of molecules or molecular mechanisms • High-throughput data • Examples – Genomics (entire genome – DNA) – Proteomics (study of protein repertoire) – Epigenomics (study of DNA and histone modifications) Omics Epigenome Phenome Adapted from http://www.sciencebasedmedicine.org http://www.scientificpsychic.com/fitness/transcription.gif http://themedicalbiochemistrypage.org/images/hemoglobin.jpg http://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png http://creatia2013.files.wordpress.com/2013/03/dna.gif Large-scale Projects & Databases NCI 60 Database Integration of Omics Data • Each type of data gives a different snapshot of the biological or disease system • Why integrate data? • Reduce false positives/negatives • Identify interactions between different molecules • Explore functional mechanisms Challenges 1. 2. 3. 4. 5. When to integrate? Dimensionality Resolution Heterogeneity Interactions and Pathways Challenge 1: When to integrate? • Early – Merging data to increase sample size • Intermediate – Convert different data sources into common format (e.g., ranks, correlation matrices), kernel-based analysis • Late – Meta-analysis (combine effect size or p-value), aggregate voting for classifiers, genomic enrichment and overlap of significant results Genomic Meta-analysis: Combining Multiple Transcriptomic Studies Tseng Lab, U. of Pitt. Assessing Genomic Overlap: Permutation-based Strategies Bickel Lab, Berkeley & ENCODE Ann. Appl. Stat. (2010) 4:4 1660-1697. Challenge 2: Dimensionality • Most technologies produce 10Ks to 100Ks measurements per sample – Exponential increase with 2+ data types • Dimension reduction – Process data type separately (filtering) – Combine with model fitting – Multivariate analysis Sparse Multivariate Methods • Variable Selection, Discriminant Analysis, Visualization • Penalties (or regularization) to reduce parameter space, only a few entries are nonzero (sparsity) • Sparse Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLS) Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, Stanford Stat Appl Genet Mol Biol. 2009 January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35 Challenge 3: Genomic Resolution • Base level (conservation, motif scores) • Regular intervals (expression/binding from tiling arrays) • Irregular intervals – Gene/ncRNA level data (expression) – Individual positions (SNP, methylation sites) Challenge 4: Heterogeneity • • • • Technology-specific sources of error Different pre-processing, normalization Different amounts of missing values Data matching – Different identifiers – Not always one-to-one (microarrays) – Imputation Challenge 4: Heterogeneity • Continuous – expression and binding data from microarrays, motif scores, protein/metabolite abundance • Counts – expression data from sequencing • 0-1 – conservation (UCSC), DNA methylation • Binary/Categorical – Thresh-holding (e.g., motif scores), genotype Case Study: Development Ci • important for differentiation of appendages during development • transcription factor – binds to DNA near target genes http://www.biology.ualberta.ca/locke.hp/research.htm http://howardhughes.trinity.duke.edu Kechris Lab, CU Denver Hierarchical Mixture Model • Data - Transcriptome: Ci pathway mutants (expr) – irregular interval - Genome: DNA binding data of Ci (bind) – regular interval, DNA conservation across 14 insect species (cons)– base level • Goal: Predict gene targets of Ci • Hidden variable is gene target – hierarchical mixture model Dvorkin et al., 2013 (under review) Challenge 5: Interactions and Pathways • Known Pathways – Incorporate information in databases (curated but sparse) – e.g., KEGG pathways have metabolite – protein interactions (directed graphs) • De novo Pathways – Discover novel interactions Known Pathways gene metabolite Jornsten, Chalmers & Michailidis, U. Michigan Biostatistics (2012) 13:4 748-761 Joint modeling of metabolite and transcript data to identify active pathways de novo Interactions PHENOTYPE • Single data methylation site INTEGRATION • Pair-wise – Correlations (e.g., eQTL) – Bayesian networks • Multiple – Kernel-based methods – Probabilistic graphical models – Network analysis gene SNP protein metabolite gene de novo Interactions Shojaie Lab U. Washington Biometrika (2010) 97 (3): 519-538. Summary Methodology 1. 2. 3. 4. 5. Meta-analysis Permutation-based Methods Sparse Multivariate Methods Graphical Models Network Analysis