Data Integration

advertisement
SAMSI 2014-2015 Program
Beyond Bioinformatics: Statistical and
Mathematical Challenges
Topic: Data Integration
Katerina Kechris, PhD
Associate Professor
Biostatistics and Informatics
Colorado School of Public Health
University of Colorado Denver
Omics
• Large-scale analyses for studying a population
of molecules or molecular mechanisms
• High-throughput data
• Examples
– Genomics (entire genome – DNA)
– Proteomics (study of protein repertoire)
– Epigenomics (study of DNA and histone modifications)
Omics
Epigenome
Phenome
Adapted from http://www.sciencebasedmedicine.org http://www.scientificpsychic.com/fitness/transcription.gif
http://themedicalbiochemistrypage.org/images/hemoglobin.jpg http://upload.wikimedia.org/wikipedia/commons/c/c6/Clopidogrel_active_metabolite.png
http://creatia2013.files.wordpress.com/2013/03/dna.gif
Large-scale Projects & Databases
NCI 60 Database
Integration of Omics Data
• Each type of data gives a different snapshot of
the biological or disease system
• Why integrate data?
• Reduce false positives/negatives
• Identify interactions between different
molecules
• Explore functional mechanisms
Challenges
1.
2.
3.
4.
5.
When to integrate?
Dimensionality
Resolution
Heterogeneity
Interactions and Pathways
Challenge 1: When to integrate?
• Early
– Merging data to increase sample size
• Intermediate
– Convert different data sources into common format
(e.g., ranks, correlation matrices), kernel-based
analysis
• Late
– Meta-analysis (combine effect size or p-value),
aggregate voting for classifiers, genomic enrichment
and overlap of significant results
Genomic Meta-analysis:
Combining Multiple Transcriptomic Studies
Tseng Lab, U. of Pitt.
Assessing Genomic Overlap:
Permutation-based Strategies
Bickel Lab, Berkeley & ENCODE
Ann. Appl. Stat. (2010) 4:4 1660-1697.
Challenge 2: Dimensionality
• Most technologies produce 10Ks to 100Ks
measurements per sample
– Exponential increase with 2+ data types
• Dimension reduction
– Process data type separately (filtering)
– Combine with model fitting
– Multivariate analysis
Sparse Multivariate Methods
• Variable Selection,
Discriminant Analysis,
Visualization
• Penalties (or regularization)
to reduce parameter space,
only a few entries are nonzero (sparsity)
• Sparse Canonical Correlation
Analysis (CCA) and Partial
Least Squares Regression
(PLS)
Le Cao, U. of Queensland; Besse, U. of Toulose; Witten, U. of Wash; Tibshirani, Stanford
Stat Appl Genet Mol Biol. 2009 January 1; 8(1): Article 28; Stat Appl Genet Mol Biol. 2008;7(1):Article 35
Challenge 3: Genomic Resolution
• Base level (conservation, motif scores)
• Regular intervals (expression/binding from tiling arrays)
• Irregular intervals
– Gene/ncRNA level data (expression)
– Individual positions (SNP, methylation sites)
Challenge 4: Heterogeneity
•
•
•
•
Technology-specific sources of error
Different pre-processing, normalization
Different amounts of missing values
Data matching
– Different identifiers
– Not always one-to-one (microarrays)
– Imputation
Challenge 4: Heterogeneity
• Continuous
– expression and binding data from microarrays,
motif scores, protein/metabolite abundance
• Counts
– expression data from sequencing
• 0-1
– conservation (UCSC), DNA methylation
• Binary/Categorical
– Thresh-holding (e.g., motif scores), genotype
Case Study: Development
Ci
• important for differentiation of
appendages during development
• transcription factor – binds to DNA
near target genes
http://www.biology.ualberta.ca/locke.hp/research.htm
http://howardhughes.trinity.duke.edu
Kechris Lab, CU Denver
Hierarchical Mixture Model
• Data
-
Transcriptome: Ci pathway mutants (expr) – irregular
interval
-
Genome: DNA binding data of Ci (bind) – regular interval,
DNA conservation across 14 insect species (cons)– base
level
• Goal: Predict gene targets of Ci
• Hidden variable is gene target – hierarchical
mixture model
Dvorkin et al., 2013 (under review)
Challenge 5: Interactions and Pathways
• Known Pathways
– Incorporate information in databases (curated but
sparse)
– e.g., KEGG pathways have metabolite – protein
interactions (directed graphs)
• De novo Pathways
– Discover novel interactions
Known Pathways
gene
metabolite
Jornsten, Chalmers & Michailidis, U. Michigan
Biostatistics (2012) 13:4 748-761
Joint modeling of metabolite
and transcript data to
identify active pathways
de novo Interactions
PHENOTYPE
• Single data
methylation site
INTEGRATION
• Pair-wise
– Correlations (e.g., eQTL)
– Bayesian networks
• Multiple
– Kernel-based methods
– Probabilistic graphical models
– Network analysis
gene
SNP
protein
metabolite
gene
de novo Interactions
Shojaie Lab U. Washington
Biometrika (2010) 97 (3): 519-538.
Summary Methodology
1.
2.
3.
4.
5.
Meta-analysis
Permutation-based Methods
Sparse Multivariate Methods
Graphical Models
Network Analysis
Download