Tumor subtype identification and patients stratification using aCGH data Kurt Zhang, PhD Bioinformatics Core Department of Pathology University of North Dakota School of Medicine The 3rd PaSiPhIC, June 27, 2012 1 Outline • What is aCGH data? • Can aCGH data be used for tumor subtype identification and patient stratification? – Tumor is genetically heterogeneous – Link tumor subtypes with clinical outcomes • Can we develop a better clustering method for aCGH data? – “All models are approximations; some perform better than others” – Peter Bonate 2 Array Comparative Genomic Hybridization (aCGH) 3 Array Comparative Genomic Hybridization • An array technology: allow quantitative determination of amplifications and deletions on a genome-wide scale Normal Human Genome MDA-MB-435 Breast Cancer Cell Genome 4 Array CGH Platform: Affymetrix 100k SNP Array • SNP: Single Nucleotide Polymorphisms • 120k rows of data for each samples • GTYPE software for generating copy number data 5 Chromosome 7q 6 Can aCGH data be used for tumor subtype identification and patient stratification? 7 Tumor intro-heterogeneity Gerlinger et al. NEJM 2012 8 3*30 data: Hierarchical Clustering (Pearson’s dissimilarity/ Ward’s linkage) CRC tumors NB primary w/ pairs SCLC lines 9 PCA Clustering – 3*30 data 10 Nonnegative Matrix Factorization (NMF) • NMF (Nonnegative Matrix Factorization): Lee & Seung 1999 Nature 401:788 Theory: Perception of the whole is based on perception of its parts. To compare with PCA (Principal Components Analysis) 11 NMF Algorithm Green are the “features”. Red are the “weights”. Samples H SNPs A = Start with random elements in red and green. W WH +E Optimize so that (aij – whij)2 is minimized. 12 NMF Algorithm Cont’d W 1 Sample k . . . . . . . . 0.38 . . . . . . . . 2 3 . . . . . . . . 0.52 . . . . . . . . . . . . . . . . 0.17 . . . . . . . . Sample k is placed in cluster 2 13 NMF clustering – 3*30 data group 1 sample.01_exp.5AAAA1 sample.02_exp.5AAAA1 sample.03_exp.5AAAA2 sample.04_exp.5AAAA2 sample.05_exp.5AAAA3 sample.06_exp.5AAAA3 sample.08_exp19BBBB sample.9_exp.2BBBB sample.09_exp.5AAAA5 sample.10_exp.5AAAA5 sample.11_exp.5AAAA sample.12_exp.5AAAA sample.13_exp.5AAAA4 sample.14_exp.5AAAA4 sample.15_exp.5AAAA sample.16_exp.5AAAA sample.17_exp.5AAAA sample.19_exp.2BBBB sample.19_exp5AAAA sample.20_exp.5AAAA sample.22_exp.2BBBB sample.23_exp.2BBBB sample18_exp.5AAAA sample_01_exp21AAAA sample_01_exp24 sample_01_exp31 sample_01_exp42 sample_02_exp21AAAA sample_02_exp24 sample_02_exp28 sample_02_exp36 sample_03_exp21AAAA sample_03_exp24 sample_03_exp25 sample_03_exp26 sample_03_exp28 sample_04_exp20BBBB sample_04_exp21AAAA sample_05_exp21AAAA sample_06_exp21AAAA sample_08_exp21AAAA sample_08_exp24 sample_08_exp25 sample_08_exp31 sample_09_exp10 sample_11_exp21AAAA sample_12_exp21AAAA group 2 sample.20_exp.2BBBB sample_01_exp25 sample_01_exp26 sample_01_exp28 sample_01_exp36 sample_02_exp10 sample_02_exp25 sample_02_exp26 sample_02_exp31 sample_02_exp42 sample_03_exp10 sample_03_exp31 sample_08_exp26 sample_08_exp28 sample_08_exp36 sample_08_exp42 sample_09_exp21AAAA sample_09_exp24 group 3 sample.1_exp.2BBBB sample.2_exp.2BBBB sample.3_exp.2BBBB sample.4_exp.4BBBB sample.5_exp.2BBBB sample.6_exp.2BBBB sample.7_exp.2BBBB sample.07_exp19BBBB sample.8_exp.2BBBB sample.09_exp19BBBB sample.10_exp.2BBBB sample.10_exp19BBBB sample.11_exp.2BBBB sample.12_exp.2BBBB sample.13_exp.2BBBB sample.14_exp.2BBBB sample.15_exp.2BBBB sample.16_exp.2BBBB sample.17_exp.2BBBB sample.18_exp.2BBBB sample.21_exp.2BBBB sample_02_exp20BBBB sample_03_exp20BBBB sample_07_exp21AAAA sample_10_exp21AAAA 14 NMF with Segmentation for 3*30 Data (5262 segments) group 1 sample_01_exp24 sample_01_exp25 sample_01_exp26 sample_01_exp28 sample_01_exp31 sample_01_exp36 sample_01_exp42 sample_02_exp10 sample_02_exp24 sample_02_exp25 sample_02_exp26 sample_02_exp28 sample_02_exp31 sample_02_exp36 sample_02_exp42 sample_03_exp10 sample_03_exp24 sample_03_exp25 sample_03_exp26 sample_03_exp28 sample_03_exp31 sample_08_exp24 sample_08_exp25 sample_08_exp26 sample_08_exp28 sample_08_exp31 sample_08_exp36 sample_08_exp42 sample_09_exp10 sample_09_exp21AAAAA sample_09_exp24 group 2 sample.01_exp.5AAAAA1 sample.02_exp.5AAAAA1 sample.03_exp.5AAAAA2 sample.04_exp.5AAAAA2 sample.05_exp.5AAAAA3 sample.06_exp.5AAAAA3 sample.08_exp19BBBBB sample.9_exp.2BBBBB sample.09_exp.5AAAAA5 sample.10_exp.5AAAAA5 sample.11_exp.5AAAAA sample.12_exp.5AAAAA sample.13_exp.5AAAAA4 sample.14_exp.5AAAAA4 sample.15_exp.5AAAAA sample.16_exp.5AAAAA sample.17_exp.5AAAAA sample.19_exp.2BBBBB sample.19_exp5AAAAA sample.20_exp.5AAAAA sample.21_exp.2BBBBB sample.22_exp.2BBBBB sample18_exp.5AAAAA sample_01_exp21AAAAA sample_02_exp21AAAAA sample_03_exp21AAAAA sample_04_exp20BBBBB sample_04_exp21AAAAA sample_05_exp21AAAAA sample_06_exp21AAAAA sample_08_exp21AAAAA sample_11_exp21AAAAA sample_12_exp21AAAAA group 3 sample.1_exp.2BBBBB sample.2_exp.2BBBBB sample.3_exp.2BBBBB sample.4_exp.4BBBBB sample.5_exp.2BBBBB sample.6_exp.2BBBBB sample.7_exp.2BBBBB sample.07_exp19BBBBB sample.8_exp.2BBBBB sample.09_exp19BBBBB sample.10_exp.2BBBBB sample.10_exp19BBBBB sample.11_exp.2BBBBB sample.12_exp.2BBBBB sample.13_exp.2BBBBB sample.14_exp.2BBBBB sample.15_exp.2BBBBB sample.16_exp.2BBBBB sample.17_exp.2BBBBB sample.18_exp.2BBBBB sample.20_exp.2BBBBB sample.23_exp.2BBBBB sample_02_exp20BBBBB sample_03_exp20BBBBB sample_07_exp21AAAAA sample_10_exp21AAAAA 15 NMF and HC clustering - Astrocytoma 16 Kaplan-Meier Curve - Astrocytoma NMF HC Survival Plot 1.0 1 2 0.9 0.8 Surviving 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 10 20 30 40 50 60 Month Log-rank Test: Log-rank Test: P value=0.008 P value=0.459 17 Kaplan-Meier Curve - Astrocytoma NMF HC Survival Plot Survival Plot 1.0 1.0 1 2 3 0.9 0.8 0.8 0.7 0.7 Surviving Surviving 1 2 3 0.9 0.6 0.5 0.4 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 10 20 30 40 month Log-rank Test clust1 clust2 1 2 1 3 2 3 50 60 10 20 30 40 50 60 Month P value 0.008 0.194 0.213 Log-rank Test clust1 clust2 1 2 1 3 2 3 P value 0.625 0.481 0.954 18 Leave-5-out Stability Test 1 2 2 Randomly leave 5 out 1 3 3 Misplacement rate # Clusters NMF HC 2 1.79% 13.03% 3 9.48% 8.76% 19 Can we develop a better clustering method for aCGH data? 20 DNA copy numbers are highly correlated 21 A hidden Markov Model(HMM) Ptran Pemi • • • • Ptran Pemi Ptran Pemi A hidden Markov model is used to calculate the true copy number of each marker from 5’ to 3’ on each chromosome. In graphic, t is the index of the current SNP, y is observed intensity value of the SNP, x is the true copy number. The copy number of the next SNP of the process only depends upon the present SNP. If a DNA segment is normal, the copy number of SNPs on this region should be 2. If copy number is greater than 2, it is a gain of copy number (amplification); if less than 2, it is a loss of copy number (deletion). The true copy number of the next SNP is determined by the transition probability from the current SNP and the emission probability to the observation. The transition probability is modeled by a multinomial distribution, and the emission probability is modeled by lognormal distribution. A HMM-based Clustering Method yij(t-1) yij(t) yij(t+1) • Suppose we have G groups. • Each group of samples is fitted with a single HMM. • At each state, the true copy number is calculated by the transition probability from the previous state and the multiple emission probabilities for all observations within the group. • Thus the entire samples are modeled with a mixture HMM.23 HMMC Algorithm Simulation: clustering error rate Simulation: convergence rate Model Assessment: #clusters = ? • Akaike/Bayesian Information Criteria (AIC/BIC) AIC = -2log(L) + 2K BIC = -2log(L) + log(n)*K Where L is maximum likelihood for the given model, K is the number of parameters in the model, and n is the total number of samples (sample size). 27 HMM Likelihood and BIC • Log-likelihood of cluster g ng g ln Ptran (t ) ln Pemi ( y jt ) t 1 j 1 T • BIC where G is the number of clusters, T is the number of markers and n is the number of samples. 28 HMMC - Astrocytoma NMF HMMC 1.0 1.0 0.9 0.9 0.8 0.8 Surviving 0.6 0.5 0.4 P=0.0078 0.3 0.2 0.7 Surviving Log-rank test: 0.7 Log-rank test: 0.6 0.5 0.4 P=0.0046 0.3 0.2 0.1 0.1 0.0 0.0 10 20 30 month 40 50 60 10 20 30 40 50 60 month 29 HMMC - Astrocytoma Summary of HMMC • We used a HMM model to take into account the spatial correlation between genomic markers. • The number of groups is determined by a HMMbased BIC. • A data mining algorithm and parallel computing is used to increase the computation efficiency. • The HMM clusters are associated with clinical outcome. • Genomic CNV biomarkers can be derived from the clustering results. 31 Future Studies • To implement new distance metrics, such as distance correlation, maximum information coeffient, etc. • To use genetic algorithm for optimal clustering. • To extend our clustering algorithm to nextgeneration genome sequencing data. • To develop a nonparametric statistic method to segment DNA copy number and to discover genomic biomarkers for each tumor subtype. 32 Acknowledgements • UND bioinformatics core – Yi Yang – Maksym Tkach – Mohammed Mohmoud – Brent Weichel – Kaitlin Clarke – Paul Rodemeyer – Thomas Kokal • Tumor genomics group in Abbott Laboratories – Dimitri Semizarov, PhD – Xin Lu, PhD – Charles van Sant, PhD – Viswanath Devanarayan, PhD This study is supported by NIH INBRE. 33