Data Integration for Cancer
Genomics
Personalized Medicine Tumor Board
Question: given all we know about a patient, what is
the “optimal” treatment?
The Cancer Genome Atlas Project
(TCGA)
SNP
Structural variations
DNA methylation
Gene expression
microRNA expression
Paired samples/unpaired samples
Data Processing Challenges
Contamination
Subclones
Biological questions
• Changes in genes between cancer and normals
• Disease heterogeneity, subtypes
• Joint modeling, mechanisms
Integrative approach
Meta-analytical approach
PARADIGM: PAthway Recognition Algorithm using Data
Integration on Genomic Models
Xpxn = Wpx(k-1) Z(k-1)xn + epxn
cov(e) = diag(ψ1, ψ2,…, ψp)
Non-negative matrix factorization
XMxN = WMxK x HKxN
All matrix entries are nonnegative
Minimize
X1: an M x N1 matrix
X2: an M x N2 matrix
X3: an M x N3 matrix
X1 = W x H1
X2 = W x H2
X3 = W x H3
TCGA and GWAS, and ENCODE
Cancer Treatment
Examples
http://discover.nci.nih.gov/cellminer/
Gene expression data: HG-U133A chip, mapped to 12980 genes across 59 cell
lines (expression data of the cell line “LC:NCI_H23” was unavailable). Use genes
included in two lists: (1) 766 cancer-related genes (Chen, et al., 2008); (2) 8919
genes from the Integrated Druggable Genome Database (IDGD) Project (Hopkins
and Groom, 2002; Russ and Lampel, 2005). After this filtering, 6958 genes
retained.
Drug response data: 101 drugs annotated in the CancerResource database
(Ahmed, et al., 2011). –log(GI50)
Pathway association information: Retrieved from the KEGG MEDICUS database
(Kanehisa, et al., 2010). 58 pathways which are either known to be related to
cancer or have drug targets. Among the 6958 genes selected in step (1), 1863
genes are covered by these 58 pathways and constitute the final list of genes in
our real data analysis.
GI50 values
Cancer Types
Cancer type
Leukemia
Non-Small Cell Lung
Colon
CNS
Melanoma
Ovarian
Renal
Prostate (excluded)
Breast
Number of cell lines
6
8
7
6
9
7
8
2
6
Connectivity Map Data
• CMap Build 02 (http://www.broadinstitute.org/cmap/) provides public
download of genome-wide transcriptional profiles of five human cancer
cell lines (MCF7: human breast cancer; HL60: human promyelocytic
leukemia; ssMCF7: MCF7 grown in a different vehicle; PC3: human
epitelial prostate cancer; SKMEL5: human skin melanoma) both before
and after the treatments of 1309 distinct bioactive small molecules.
• Used the data from the HT_HG-U133A array platform, which consists of
4466 expression response profiles, representing 1084 different
compounds.
• Integration within the same cancer type
• Integration across different cancer types
One individual with 188 fold coverage
Ideal Pipeline
• Patient diagnosis and sample collection
• Various types of genomics profiling
• Driver mutations, disease subtypes
• Targeted treatments, monitoring, and additional
treatments
Topics of Interest
• Data processing
• Relationships among different data types
• Tumor heterogeneity
• Single cell analysis
• Modeling
• Targeted treatment
• Integration over different tumor types
• TCGA, ENCODE, GWAS, 1000 Genomes, and others