Data Integration

SAMSI 2014-2015 Program Beyond Bioinformatics: Statistical and Mathematical Challenges Topic: Data Integration Katerina Kechris, PhD Associate Professor Biostatistics and Informatics Colorado School of Public Health University of Colorado Denver Omics • Large-scale analyses for studying a population of molecules or molecular mechanisms • High-throughput data • Examples – Genomics (entire genome – DNA) – Proteomics (study of protein repertoire) – Epigenomics (study of DNA and histone modifications) Omics Epigenome Phenome Adapted from http://www.sciencebasedmedicine.org http://www.scientificpsychic.com/fitness/transcription.gif http://themedicalbiochemistrypage.org/images/hemoglobin.jpg http://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png http://creatia2013.files.wordpress.com/2013/03/dna.gif Large-scale Projects & Databases NCI 60 Database Integration of Omics Data • Each type of data gives a different snapshot of the biological or disease system • Why integrate data? • Reduce false positives/negatives • Identify interactions between different molecules • Explore functional mechanisms Challenges 1. 2. 3. 4. 5. When to integrate? Dimensionality Resolution Heterogeneity Interactions and Pathways Challenge 1: When to integrate? • Early – Merging data to increase sample size • Intermediate – Convert different data sources into common format (e.g., ranks, correlation matrices), kernel-based analysis • Late – Meta-analysis (combine effect size or p-value), aggregate voting for classifiers, genomic enrichment and overlap of significant results Genomic Meta-analysis: Combining Multiple Transcriptomic Studies Tseng Lab, U. of Pitt. Assessing Genomic Overlap: Permutation-based Strategies Bickel Lab, Berkeley & ENCODE Ann. Appl. Stat. (2010) 4:4 1660-1697. Challenge 2: Dimensionality • Most technologies produce 10Ks to 100Ks measurements per sample – Exponential increase with 2+ data types • Dimension reduction – Process data type separately (filtering) – Combine with model fitting – Multivariate analysis Sparse Multivariate Methods • Variable Selection, Discriminant Analysis, Visualization • Penalties (or regularization) to reduce parameter space, only a few entries are nonzero (sparsity) • Sparse Canonical Correlation Analysis (CCA) and Partial Least Squares Regression (PLS) Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, Stanford Stat Appl Genet Mol Biol. 2009 January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35 Challenge 3: Genomic Resolution • Base level (conservation, motif scores) • Regular intervals (expression/binding from tiling arrays) • Irregular intervals – Gene/ncRNA level data (expression) – Individual positions (SNP, methylation sites) Challenge 4: Heterogeneity • • • • Technology-specific sources of error Different pre-processing, normalization Different amounts of missing values Data matching – Different identifiers – Not always one-to-one (microarrays) – Imputation Challenge 4: Heterogeneity • Continuous – expression and binding data from microarrays, motif scores, protein/metabolite abundance • Counts – expression data from sequencing • 0-1 – conservation (UCSC), DNA methylation • Binary/Categorical – Thresh-holding (e.g., motif scores), genotype Case Study: Development Ci • important for differentiation of appendages during development • transcription factor – binds to DNA near target genes http://www.biology.ualberta.ca/locke.hp/research.htm http://howardhughes.trinity.duke.edu Kechris Lab, CU Denver Hierarchical Mixture Model • Data - Transcriptome: Ci pathway mutants (expr) – irregular interval - Genome: DNA binding data of Ci (bind) – regular interval, DNA conservation across 14 insect species (cons)– base level • Goal: Predict gene targets of Ci • Hidden variable is gene target – hierarchical mixture model Dvorkin et al., 2013 (under review) Challenge 5: Interactions and Pathways • Known Pathways – Incorporate information in databases (curated but sparse) – e.g., KEGG pathways have metabolite – protein interactions (directed graphs) • De novo Pathways – Discover novel interactions Known Pathways gene metabolite Jornsten, Chalmers & Michailidis, U. Michigan Biostatistics (2012) 13:4 748-761 Joint modeling of metabolite and transcript data to identify active pathways de novo Interactions PHENOTYPE • Single data methylation site INTEGRATION • Pair-wise – Correlations (e.g., eQTL) – Bayesian networks • Multiple – Kernel-based methods – Probabilistic graphical models – Network analysis gene SNP protein metabolite gene de novo Interactions Shojaie Lab U. Washington Biometrika (2010) 97 (3): 519-538. Summary Methodology 1. 2. 3. 4. 5. Meta-analysis Permutation-based Methods Sparse Multivariate Methods Graphical Models Network Analysis

Data Integration

Related documents

Products

Support

Data Integration

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib