Computing the optimal hierarchical clustering A major disadvantage

Computing the optimal hierarchical clustering A major disadvantage of a hierarchical clustering approach is the uncertainty of the derived clusters, e.g. the use of different number of features may result in different patient-clusters. We assessed the uncertainty of a hierarchical clustering using Pvclust1 that builds on multi-scale bootstrap. It computes a bootstrap probability (BP) and an approximately unbiased (AU) P-value for each cluster. These P-values indicate how strong a clustering is supported by the data. Clusters with significance level < 0.05 are taken into consideration which indicates that these clusters do not only “seem to exist” but are stable when we perturb the number of observations. In further analyses we used the AU P-value for assessment of uncertainty as this is a better approximation than BP P-value according to the authors of Pvclust1. To select the hierarchical clustering which is best supported by the data, we used the following procedure: i) Ranking the feature sets across-patient standard deviation for each data set and selecting an increasing number of probesets using 21 different cutoffs ([0,..,20%]). The selected feature sets for GEP and DMP are then iteratively combined. ii) Each feature set (440 in total) is then used for hierarchical cluster analysis with 1000 multi-scale bootstraps, using Ward’s linkage and Pearson correlation distance. iii) An average silhouette score2 (relatedness of samples in a cluster and the separation of different clusters) is computed for each significantly observed cluster from Pvclust (Figure S2). iv) Subsequently the hierarchical clustering that is best supported by the data is selected. One should expect that the highest silhouette score from the significant Pvclust clusters should preserve, to some extent, the clusters of currently known abnormalities (CEBPAsilenced, CEBPAdm, inv(16), t(8;21) and t(15;17)). This is in line with our findings as the highest silhouette score from the significant Pvclust clusters also showed a high silhouette score for the currently known clusters. Note that the GEP and DMP-data is mean normalized with unit variance (z-score). Computing the stability of the detected clusters The stability for the 18 newly derived clusters is examined using all the derived hierarchical clustering's as described before. The cluster-labels that are determined for each hierarchical clustering are used to determine the average silhouette score for each of the hierarchical clustering (Figure S2). We hypothesized that stable clusters are frequently seen among different hierarchical clustering’s. For the optimal hierarchical clustering we detected that ten out of eighteen clusters (# 1, 2, 3, 6, 8, 9, 11, 13, 16, 18) have high silhouette scores [0.5,..,1] based on all other hierarchical clustering’s. These include the (cyto)genetically groups such as, inv(16), t(15;17), CEBPAdm and CEBPAsilenced. Five clusters (# 3, 11, 13, 14, 15) varied in silhouette scores [0.4,..,0.5] and, three clusters (# 4, 7 and 12) showed "low" silhouette scores [0,..,0.4]. Based on these average silhouette scores we categorized the clusters into high (n=10 clusters), medium (n=5 clusters) and low (n=3 clusters) stability (Illustrated with **, * and no asterisks respectively). References 1. 2. Suzuki R, Shimodaira H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics. Jun 15 2006;22(12):1540-1542. Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. 1987;20:53–65. Supplementary files: Figure S1: Read depth frequency of the aligned loci The horizontal axis represents the read depth that is measured for a loci and the vertical axis its frequency. As an example, a loci with read depth of 100 is seen 24517 times, whereas a read depth of 1000 is seen 4 times (both are averages among all samples). Figure S2: Selection of the most optimal hierarchical clustering Twenty one different cut-offs are chosen based on patient standard deviation for GEP and separately for DMP. Iteratively combining gene expression and DNA-methylation probesets using one of the 440 combinations: 21x21 minus 1 (zero GEP and zero DMP probesets). The uncertainty of each hierarchical clustering is assessed using the estimated bootstrap labels from Pvclust, and scored using the silhouette scores. The hierarchical clustering with the highest average silhouette score is best supported by the data and indicated with a red rectangle. Figure S3: Optimal hierarchical clustering for GEP and DMP separately Heat map representing pairwise correlations between the 344 AML cases for solely the gene expression or DNA-methylation profiles. The optimal hierarchical clustering for (A) GEP was based on 650 probesets, (B) DMP was based on 682 probesets. The diagonal illustrates in red the AML subtypes, inv(16), t(8;21), t(15;17), CEBPA doublemutants (CEBPAdm), CEBPAsilenced, the splice factor genes U2AF35, SF3B1, SRSF2, and the RAEB/ AML-LBC cases. Figure S4: Enriched pathways for cluster #11 Graphical representation of the top 10 enriched pathways/gene sets based on the differentially expressed and DNA methylated genes for cluster #11 (detailed results are depicted in Table S8). Figure S5: Gene expression and DNA-methylation levels of annotated genes in erythroid development Heat map representing pairwise correlations between the 344 AML cases using the integrated gene expression or DNA-methylation profiles. The splice factor mutants are indicated with red bars. Gene expression data of the differential regulated genes, involved in erythroid development are mean normalized and depicted on the diagonal with red bars for relative high gene expression levels. Blue bars depict hypomethylation levels. Figure S6: DNA-methylation and gene expression levels for cluster #3 and the enriched pathways (A) Graphical representation of the top 10 enriched pathways/gene sets based on the differentially expressed and DNA methylated genes for cluster #3 (detailed results are depicted in Table S8). (B) Differential expressed and DNA-methylation genes in patient samples from cluster #3 compared to all other AMLs are indicated with different colored dots. The colors depict the gene expression and DNA-methylation status, i.e. the right upper corner represents genes that are hypomethylated and overexpressed (green dots). Many of these genes encode for proteins involved in erythroid development or function (detailed results are depicted in Table S8). Figure S7: Relapse-free survival and Event-free survival for patients in clusters #3 and #11 Kaplan-Meier survival curves, and multivariate analysis for (A) relapse-free survival (RFS) and (B) event-free survival (EFS) for cluster #3 vs. all patients except for cluster 3 patients, cluster #11 vs. all patients except for cluster 11 patients, and cluster #3 vs. cluster #11 patients. Multivariate analysis for cluster #3 for (C) RFS, (D) EFS, and for cluster #11 (E) RFS and (F) EFS. The included covariates into the Cox proportional hazard ratio (HR) model are: NPM1mut vs. wild-type NPM1, FLT3ITD vs. no FLT3ITD, NRASmut/ KRASmut vs. wild-type NRAS/KRAS, high cytogenetic risk vs. no high cytogenetic risk; age and white blood cell count (WBC) are used as a continues variable. Figure S8: Gene mutations in patients from cluster #3, cluster #11 and splice factor mutations outside these clusters Columns represent patients from cluster #3, #11 and splice factor mutants outside these clusters. The rows (red) indicates mutations in the genotypes SRSF2, U2AF35, SF3B1, NRAS/KRAS, ASXL1, NPM1mutant, FLT3ITD, FLT3TKD, DNMT3A, IDH1 and IDH2. Wild-type genotypes are indicated in white and missings in blue. The bottom row indicates the RAEB/ AML-LBC status in red and missings in blue. Figure S1 2224 2224 T cells 2228 2228 T cells 2246 2246 T cells 2259 2259 T cells 2278 2278 T cells 3318 3318 T cells 3330 3330 T cells 4340 4340 T cells 5290 5290 T cells 5363 5363 T cells 6373 6373 T cells 6448 6448 T cells 7303 7303 T cells 7309 7309 T cells 1000 900 800 Frequency 700 600 500 400 300 200 100 0 0 200 400 600 800 1000 Read depth aligned loci 1200 1400 1600 1800 2000 B DMP probesets 0.4 0.3 0.2 0.1 0 0% GEP probesets GEP: 2168 probesets DMP: 2045 probesets 20% - 4336 19% - 4119 18% - 3902 17% - 3685 16% - 3468 15% - 3252 14% - 3035 13% - 2818 12% - 2601 11% - 2385 10% - 2168 9% - 1951 8% - 1734 7% - 1517 6% - 1301 5% - 1084 4% - 867 3% - 650 2% - 434 1% - 217 0% 1% - 227 2% - 455 3% - 682 4% - 909 5% - 1136 6% - 1364 7% - 1591 8% - 1818 9% - 2045 10% - 2273 11% - 2500 12% - 2727 13% - 2954 14% - 3182 15% - 3409 16% - 3636 17% - 3863 18% - 4091 19% - 4318 20% - 4545 A Average silhoutte-score Pvclust-labels Figure S2 344 AML patients Inv(16) t(15;17) CEBPAsilenced 344 AML patients 344 AML patients (9) (21) (24) (24) (9) (28) (7) (7) SF3B1 RAEB/AML-LBC (19) SRSF2 U2AF35 CEBPAsilenced CEBPAdm t(15;17) t(8;21) Inv(16) 1 (9) (21) (24) (24) (9) (7) (7) SF3B1 RAEB/AML-LBC (19) SRSF2 U2AF35 CEBPAsilenced CEBPAdm t(15;17) GEP: 650 probesets (28) A) t(8;21) Inv(16) Figure S3 B) DMP: 682 probesets 1 0 0 -1 -1 CEBPAdm t(8;21) Inv(16) t(15;17) CEBPAdm CEBPAdm and CEBPAsilenced CEBPAsilenced t(8;21) CEBPAdm and CEBPAsilenced 344 344 344 AML patients P53 Signaling Oocyte Meiosis 20 Cell Cycle Porphyrin And Chlorophyll Metabolism Alpha-Hemoglobin StabilizingProtein Pathway (AHSP) G2 Cell cycle Metabolism of porphyrin Synthesis of heme Synthesis of porphyrin Condensation of chromosomes Keratitis Segregation of chromosomes Morphology of red blood cells Porphyria Alignment of chromosomes Morphology of spindle fibers Morphology of mitotic spindle Bulimia nervosa Abnormal morphology of red blood cells Spherocytosis M phase Carcinoma -log10(P-value) Figure S4 IPA pathways BioCarta gene sets KEGG 15 10 5 0 SRSF2 U2AF35 SF3B1 Figure S5 1 0 cl1 -1 cl2 Cluster #3 cl3 cl4 cl5 344 AML patients cl6 cl7 cl8 cl9 cl10 Cluster #11 cl11 cl12 cl13 cl14 cl15 cl16 cl17 cl18 344 344 AML patients B A -log10(P-value) 2 4 6 8 10 12 cell death of immune cells transcription function of blood cells function of leukocytes cell death of blood cells expression of RNA transcription of RNA cell death cell death of tumor cell lines proliferation of B lymphocytes Viral Infection proliferation of cells differentiation of leukocytes quantity of leukocytes apoptosis quantity of lymphocytes inflammatory response quantity of mononuclear leukocytes quantity of blood cells antibody response proliferation of immune cells apoptosis of leukocytes leukocyte migration differentiation of blood cells development of blood cells Spliceosome Aminoacyl Trna Biosynthesis Lysosome 8 6 RTN4R PYGO2 P2RY2 TMEM185B GAPVD1 AP1G1ZNF462 MRPL24ASCC1 FES VWA3B VWA5A IRF3LRRC57 ICK MTFP1VWA5A HAUS2 CDR2 SLITRK3 SLC39A6PRCC MTFP1 PTPRKRPS6KA4 ELP2 BCL2L12 ANAPC16 SLC39A6 VPS37C NLRC4 HLA-B HLA-C FBXO9 ITGB2 CABLES2 KIAA1539 MTHFR S100A9 PSTPIP1 DNA methylation difference at associated genes (t-statistics) 4 WT1 TMEM220 TIE1 FCN1 RGS2 SERPINA1 FGR DPYSL2 IL17RAFPR1 ATXN1NLRP12MYO1F RAB31 UBXN11 NOD2 CEBPD PTPREARRB2 LILRB2 ANXA5 CLEC7A CLEC7A ASAP1 IGF2R IQSEC1 CEL PATZ1 CEP70FOXC1 PATZ1 PATZ1 NME7FTO TSGA14 ZNF529 DYNC2LI1 BEND5 STRBP DNAJC18 B3GALNT1 MED12 ST18 C13ORF27 NPTX2 DYX1C1 PGBD1 2 QPRT 0 -2 -4 IPA pathways KEGG Reactome Hypermethylation 0 Hypomethylation Figure S6 CLIP3 ZDHHC21 -6 AOC2NKAIN1 HMGN5 AOC3 COL6A3 SLC6A11 SH3BGRL -8 -20 -15 Downregulation -10 -5 0 5 Gene expression difference at associated genes (t-statistics) 10 15 Upregulation 20 Figure S7 Cumulative percentage 100 75 P=0.944 50 Cluster11 P=0.136 25 All patients except for cluster 3 Cluster3 All patients except for cluster 11 Cluster11 Cluster3 Cluster11 All patients except for cluster 11 All patients except for cluster 3 At risk: 84 17 85 16 N 17 16 85 84 F 12 7 41 36 P=0.014 36 48 months 60 55 6 51 10 50 3 43 10 45 3 41 7 36 2 34 4 31 1 29 3 All patients except for cluster 11 50 Std.Err z P>|z| 0.848 0.011 0.412 0.861 0.507 0.315 0.399 2 -0.01 0.89 0.99 -0.31 -1.16 0.1 0.045* 0.993 0.374 0.32 0.756 0.247 0.923 Cluster11 P=0.239 25 Std.Err z P>|z| 0.703 0.011 0.495 0.859 0.488 0.399 0.447 0.26 0.48 1.21 0.73 -0.41 -0.87 0.41 0.798 0.634 0.227 0.463 0.679 0.383 0.683 All patients except for cluster 3 Cluster3 All patients except for cluster 11 Cluster11 95% Confidence Interval low high 1.017 4.675 0.978 1.022 0.716 2.434 0.608 4.589 0.248 2.752 0.096 1.828 0.489 2.203 Relapse Free Survival 95% Confidence Interval low high 0.358 3.800 0.984 1.027 0.779 2.858 0.499 4.605 0.222 2.664 0.102 2.397 0.552 2.476 P=0.638 N Cluster3 25 Cluster11 19 All patients except for cluster 11 106 All patients except for cluster 3 100 0 D Relapse Free Survival Hazard Ratio (HR) Cluster 11 1.167 Age 1.005 Log(WBC) 1.492 High (cytogenetic) risk group 1.516 FLT3ITD 0.770 + NPM1 0.496 N-RAS/K-RAS 1.169 All patients except for cluster 3 0 24 Hazard Ratio (HR) Cluster 3 2.181 Age 1.000 Log(WBC) 1.320 High (cytogenetic) risk group 1.670 FLT3ITD 0.827 + NPM1 0.418 N-RAS/K-RAS 1.038 75 Cluster3 12 Variables Variables 100 All patients except for cluster 3 0 E Event Free Survival All patients except for cluster 11 0 C B Relapse Free Survival Cumulative percentage A At risk: 100 25 106 19 Cluster3 12 24 36 48 months 60 61 12 63 10 52 6 48 10 50 4 45 9 47 4 44 7 40 2 38 4 Event Free Survival Hazard Ratio (HR) Cluster 3 1.640 Age 1.019 Log(WBC) 1.133 High (cytogenetic) risk group 1.940 FLT3ITD 2.113 + NPM1 0.344 N-RAS/K-RAS 1.467 Variables F F 20 12 62 54 P=0.016 Std.Err z P>|z| 0.494 0.009 0.288 0.820 0.774 0.209 0.460 1.64 2.13 0.49 1.57 2.04 -1.75 1.22 0.1 0.033* 0.622 0.117 0.041 0.08 0.222 95% Confidence Interval low high 0.909 2.960 1.001 1.036 0.689 1.866 0.848 4.440 1.031 4.332 0.104 1.134 0.793 2.713 Event Free Survival Hazard Ratio (HR) Cluster 11 1.436 Age 1.021 Log(WBC) 1.293 High (cytogenetic) risk group 1.661 FLT3ITD 1.939 + NPM1 0.303 N-RAS/K-RAS 1.606 Variables Std.Err z P>|z| 0.622 0.009 0.352 0.760 0.738 0.200 0.502 0.83 2.47 0.94 1.11 1.74 -1.81 1.52 0.404 0.013* 0.345 0.268 0.082 0.07 0.13 95% Confidence Interval low high 0.614 3.355 1.004 1.039 0.758 2.206 0.677 4.074 0.920 4.087 0.083 1.103 0.870 2.963 5354 6247 7117 7118 7137 7167 7183 7325 2283 5349 7311 2228 3323 2259 2278 2279 3330 5290 5363 6454 7116 7177 7309 7312 7172 3481 7419 5287 6450 6453 2256 3489 7317 2224 2246 4340 6359 6373 6374 6378 6448 7303 3318 7071 2216 2229 5364 6456 6462 7147 7161 7176 7313 7326 2187 7169 2757 7084 Figure S8 Mutation/ RAEB/ AML-LBC Missing SRSF2 U2AF35 SF3B1 NRAS/KRAS ASXL1 NPM1 FLT3ITD FLT3TKD DNMT3A IDH1 IDH2 RAEB AML-LBC Cluster 3 Cluster 11 Other clusters

Computing the optimal hierarchical clustering A major disadvantage

Related documents

Products

Support

Computing the optimal hierarchical clustering A major disadvantage

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib