Computing the optimal hierarchical clustering A major disadvantage

advertisement
Computing the optimal hierarchical clustering
A major disadvantage of a hierarchical clustering approach is the uncertainty of the
derived clusters, e.g. the use of different number of features may result in different
patient-clusters. We assessed the uncertainty of a hierarchical clustering using
Pvclust1 that builds on multi-scale bootstrap. It computes a bootstrap probability (BP)
and an approximately unbiased (AU) P-value for each cluster. These P-values
indicate how strong a clustering is supported by the data. Clusters with significance
level < 0.05 are taken into consideration which indicates that these clusters do not
only “seem to exist” but are stable when we perturb the number of observations. In
further analyses we used the AU P-value for assessment of uncertainty as this is a
better approximation than BP P-value according to the authors of Pvclust1.
To select the hierarchical clustering which is best supported by the data, we used the
following procedure: i) Ranking the feature sets across-patient standard deviation for
each data set and selecting an increasing number of probesets using 21 different cutoffs ([0,..,20%]). The selected feature sets for GEP and DMP are then iteratively
combined. ii) Each feature set (440 in total) is then used for hierarchical cluster
analysis with 1000 multi-scale bootstraps, using Ward’s linkage and Pearson
correlation distance. iii) An average silhouette score2 (relatedness of samples in a
cluster and the separation of different clusters) is computed for each significantly
observed cluster from Pvclust (Figure S2). iv) Subsequently the hierarchical
clustering that is best supported by the data is selected.
One should expect that the highest silhouette score from the significant Pvclust
clusters should preserve, to some extent, the clusters of currently known
abnormalities (CEBPAsilenced, CEBPAdm, inv(16), t(8;21) and t(15;17)). This is in line
with our findings as the highest silhouette score from the significant Pvclust clusters
also showed a high silhouette score for the currently known clusters.
Note that the GEP and DMP-data is mean normalized with unit variance (z-score).
Computing the stability of the detected clusters
The stability for the 18 newly derived clusters is examined using all the derived
hierarchical clustering's as described before. The cluster-labels that are determined
for each hierarchical clustering are used to determine the average silhouette score
for each of the hierarchical clustering (Figure S2). We hypothesized that stable
clusters are frequently seen among different hierarchical clustering’s. For the optimal
hierarchical clustering we detected that ten out of eighteen clusters (# 1, 2, 3, 6, 8, 9,
11, 13, 16, 18) have high silhouette scores [0.5,..,1] based on all other hierarchical
clustering’s. These include the (cyto)genetically groups such as, inv(16), t(15;17),
CEBPAdm and CEBPAsilenced. Five clusters (# 3, 11, 13, 14, 15) varied in silhouette
scores [0.4,..,0.5] and, three clusters (# 4, 7 and 12) showed "low" silhouette scores
[0,..,0.4]. Based on these average silhouette scores we categorized the clusters into
high (n=10 clusters), medium (n=5 clusters) and low (n=3 clusters) stability
(Illustrated with **, * and no asterisks respectively).
References
1.
2.
Suzuki R, Shimodaira H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering.
Bioinformatics. Jun 15 2006;22(12):1540-1542.
Rousseeuw PJ. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis.
Journal of Computational and Applied Mathematics. 1987;20:53–65.
Supplementary files:
Figure S1: Read depth frequency of the aligned loci
The horizontal axis represents the read depth that is measured for a loci and the
vertical axis its frequency. As an example, a loci with read depth of 100 is seen 24517
times, whereas a read depth of 1000 is seen 4 times (both are averages among all
samples).
Figure S2: Selection of the most optimal hierarchical clustering
Twenty one different cut-offs are chosen based on patient standard deviation for GEP
and separately for DMP. Iteratively combining gene expression and DNA-methylation
probesets using one of the 440 combinations: 21x21 minus 1 (zero GEP and zero DMP
probesets). The uncertainty of each hierarchical clustering is assessed using the
estimated bootstrap labels from Pvclust, and scored using the silhouette scores. The
hierarchical clustering with the highest average silhouette score is best supported by
the data and indicated with a red rectangle.
Figure S3: Optimal hierarchical clustering for GEP and DMP separately
Heat map representing pairwise correlations between the 344 AML cases for solely
the gene expression or DNA-methylation profiles. The optimal hierarchical clustering
for (A) GEP was based on 650 probesets, (B) DMP was based on 682 probesets. The
diagonal illustrates in red the AML subtypes, inv(16), t(8;21), t(15;17), CEBPA doublemutants (CEBPAdm), CEBPAsilenced, the splice factor genes U2AF35, SF3B1, SRSF2,
and the RAEB/ AML-LBC cases.
Figure S4: Enriched pathways for cluster #11
Graphical representation of the top 10 enriched pathways/gene sets based on the
differentially expressed and DNA methylated genes for cluster #11 (detailed results
are depicted in Table S8).
Figure S5: Gene expression and DNA-methylation levels of annotated genes in
erythroid development
Heat map representing pairwise correlations between the 344 AML cases using the
integrated gene expression or DNA-methylation profiles. The splice factor mutants are
indicated with red bars. Gene expression data of the differential regulated genes,
involved in erythroid development are mean normalized and depicted on the diagonal
with red bars for relative high gene expression levels. Blue bars depict hypomethylation
levels.
Figure S6: DNA-methylation and gene expression levels for cluster #3 and the
enriched pathways
(A) Graphical representation of the top 10 enriched pathways/gene sets based on the
differentially expressed and DNA methylated genes for cluster #3 (detailed results are
depicted in Table S8). (B) Differential expressed and DNA-methylation genes in patient
samples from cluster #3 compared to all other AMLs are indicated with different colored
dots. The colors depict the gene expression and DNA-methylation status, i.e. the right
upper corner represents genes that are hypomethylated and overexpressed (green
dots). Many of these genes encode for proteins involved in erythroid development or
function (detailed results are depicted in Table S8).
Figure S7: Relapse-free survival and Event-free survival for patients in clusters
#3 and #11
Kaplan-Meier survival curves, and multivariate analysis for (A) relapse-free survival
(RFS) and (B) event-free survival (EFS) for cluster #3 vs. all patients except for cluster
3 patients, cluster #11 vs. all patients except for cluster 11 patients, and cluster #3 vs.
cluster #11 patients. Multivariate analysis for cluster #3 for (C) RFS, (D) EFS, and for
cluster #11 (E) RFS and (F) EFS. The included covariates into the Cox proportional
hazard ratio (HR) model are: NPM1mut vs. wild-type NPM1, FLT3ITD vs. no FLT3ITD,
NRASmut/ KRASmut vs. wild-type NRAS/KRAS, high cytogenetic risk vs. no high
cytogenetic risk; age and white blood cell count (WBC) are used as a continues
variable.
Figure S8: Gene mutations in patients from cluster #3, cluster #11 and splice
factor mutations outside these clusters
Columns represent patients from cluster #3, #11 and splice factor mutants outside
these clusters. The rows (red) indicates mutations in the genotypes SRSF2, U2AF35,
SF3B1, NRAS/KRAS, ASXL1, NPM1mutant, FLT3ITD, FLT3TKD, DNMT3A, IDH1 and
IDH2. Wild-type genotypes are indicated in white and missings in blue. The bottom row
indicates the RAEB/ AML-LBC status in red and missings in blue.
Figure S1
2224
2224 T cells
2228
2228 T cells
2246
2246 T cells
2259
2259 T cells
2278
2278 T cells
3318
3318 T cells
3330
3330 T cells
4340
4340 T cells
5290
5290 T cells
5363
5363 T cells
6373
6373 T cells
6448
6448 T cells
7303
7303 T cells
7309
7309 T cells
1000
900
800
Frequency
700
600
500
400
300
200
100
0
0
200
400
600
800
1000
Read depth aligned loci
1200
1400
1600
1800
2000
B
DMP probesets
0.4
0.3
0.2
0.1
0
0%
GEP probesets
GEP: 2168 probesets
DMP: 2045 probesets
20% - 4336
19% - 4119
18% - 3902
17% - 3685
16% - 3468
15% - 3252
14% - 3035
13% - 2818
12% - 2601
11% - 2385
10% - 2168
9% - 1951
8% - 1734
7% - 1517
6% - 1301
5% - 1084
4% - 867
3% - 650
2% - 434
1% - 217
0%
1% - 227
2% - 455
3% - 682
4% - 909
5% - 1136
6% - 1364
7% - 1591
8% - 1818
9% - 2045
10% - 2273
11% - 2500
12% - 2727
13% - 2954
14% - 3182
15% - 3409
16% - 3636
17% - 3863
18% - 4091
19% - 4318
20% - 4545
A
Average silhoutte-score
Pvclust-labels
Figure S2
344 AML patients
Inv(16)
t(15;17)
CEBPAsilenced
344 AML patients
344 AML patients
(9)
(21)
(24)
(24)
(9)
(28)
(7)
(7)
SF3B1
RAEB/AML-LBC (19)
SRSF2
U2AF35
CEBPAsilenced
CEBPAdm
t(15;17)
t(8;21)
Inv(16)
1
(9)
(21)
(24)
(24)
(9)
(7)
(7)
SF3B1
RAEB/AML-LBC (19)
SRSF2
U2AF35
CEBPAsilenced
CEBPAdm
t(15;17)
GEP: 650 probesets
(28)
A)
t(8;21)
Inv(16)
Figure S3
B)
DMP: 682 probesets
1
0
0
-1
-1
CEBPAdm
t(8;21)
Inv(16)
t(15;17)
CEBPAdm
CEBPAdm and
CEBPAsilenced
CEBPAsilenced
t(8;21)
CEBPAdm and
CEBPAsilenced
344
344
344 AML patients
P53 Signaling
Oocyte Meiosis
20
Cell Cycle
Porphyrin And Chlorophyll Metabolism
Alpha-Hemoglobin StabilizingProtein
Pathway (AHSP)
G2
Cell cycle
Metabolism of porphyrin
Synthesis of heme
Synthesis of porphyrin
Condensation of chromosomes
Keratitis
Segregation of chromosomes
Morphology of red blood cells
Porphyria
Alignment of chromosomes
Morphology of spindle fibers
Morphology of mitotic spindle
Bulimia nervosa
Abnormal morphology of red blood cells
Spherocytosis
M phase
Carcinoma
-log10(P-value)
Figure S4
IPA pathways
BioCarta gene sets
KEGG
15
10
5
0
SRSF2
U2AF35
SF3B1
Figure S5
1
0
cl1
-1
cl2
Cluster #3
cl3
cl4
cl5
344 AML patients
cl6
cl7
cl8
cl9
cl10
Cluster #11
cl11
cl12
cl13
cl14
cl15
cl16
cl17
cl18
344
344 AML patients
B
A
-log10(P-value)
2 4 6 8 10 12
cell death of immune cells
transcription
function of blood cells
function of leukocytes
cell death of blood cells
expression of RNA
transcription of RNA
cell death
cell death of tumor cell lines
proliferation of B lymphocytes
Viral Infection
proliferation of cells
differentiation of leukocytes
quantity of leukocytes
apoptosis
quantity of lymphocytes
inflammatory response
quantity of mononuclear leukocytes
quantity of blood cells
antibody response
proliferation of immune cells
apoptosis of leukocytes
leukocyte migration
differentiation of blood cells
development of blood cells
Spliceosome
Aminoacyl Trna Biosynthesis
Lysosome
8
6
RTN4R
PYGO2
P2RY2
TMEM185B
GAPVD1
AP1G1ZNF462
MRPL24ASCC1
FES VWA3B
VWA5A
IRF3LRRC57
ICK
MTFP1VWA5A
HAUS2
CDR2
SLITRK3
SLC39A6PRCC MTFP1 PTPRKRPS6KA4
ELP2 BCL2L12 ANAPC16
SLC39A6
VPS37C
NLRC4
HLA-B
HLA-C
FBXO9
ITGB2
CABLES2
KIAA1539
MTHFR
S100A9
PSTPIP1
DNA methylation difference
at associated genes (t-statistics)
4
WT1
TMEM220
TIE1
FCN1
RGS2
SERPINA1
FGR
DPYSL2
IL17RAFPR1
ATXN1NLRP12MYO1F
RAB31
UBXN11
NOD2
CEBPD
PTPREARRB2
LILRB2
ANXA5
CLEC7A
CLEC7A
ASAP1
IGF2R
IQSEC1
CEL
PATZ1
CEP70FOXC1
PATZ1
PATZ1
NME7FTO
TSGA14
ZNF529
DYNC2LI1
BEND5
STRBP
DNAJC18
B3GALNT1
MED12
ST18
C13ORF27
NPTX2
DYX1C1
PGBD1
2
QPRT
0
-2
-4
IPA pathways
KEGG
Reactome
Hypermethylation
0
Hypomethylation
Figure S6
CLIP3
ZDHHC21
-6
AOC2NKAIN1
HMGN5
AOC3
COL6A3
SLC6A11 SH3BGRL
-8
-20
-15
Downregulation
-10
-5
0
5
Gene expression difference at associated genes (t-statistics)
10
15
Upregulation
20
Figure S7
Cumulative percentage
100
75
P=0.944
50
Cluster11
P=0.136
25
All patients except for cluster 3
Cluster3
All patients except for cluster 11
Cluster11
Cluster3
Cluster11
All patients except for cluster 11
All patients except for cluster 3
At risk:
84
17
85
16
N
17
16
85
84
F
12
7
41
36
P=0.014
36
48 months 60
55
6
51
10
50
3
43
10
45
3
41
7
36
2
34
4
31
1
29
3
All patients except for cluster 11
50
Std.Err
z
P>|z|
0.848
0.011
0.412
0.861
0.507
0.315
0.399
2
-0.01
0.89
0.99
-0.31
-1.16
0.1
0.045*
0.993
0.374
0.32
0.756
0.247
0.923
Cluster11
P=0.239
25
Std.Err
z
P>|z|
0.703
0.011
0.495
0.859
0.488
0.399
0.447
0.26
0.48
1.21
0.73
-0.41
-0.87
0.41
0.798
0.634
0.227
0.463
0.679
0.383
0.683
All patients except for cluster 3
Cluster3
All patients except for cluster 11
Cluster11
95% Confidence Interval
low
high
1.017
4.675
0.978
1.022
0.716
2.434
0.608
4.589
0.248
2.752
0.096
1.828
0.489
2.203
Relapse Free Survival
95% Confidence Interval
low
high
0.358
3.800
0.984
1.027
0.779
2.858
0.499
4.605
0.222
2.664
0.102
2.397
0.552
2.476
P=0.638
N
Cluster3 25
Cluster11 19
All patients except for cluster 11 106
All patients except for cluster 3 100
0
D
Relapse Free Survival
Hazard Ratio
(HR)
Cluster 11
1.167
Age
1.005
Log(WBC)
1.492
High (cytogenetic) risk group 1.516
FLT3ITD
0.770
+
NPM1
0.496
N-RAS/K-RAS
1.169
All patients except for cluster 3
0
24
Hazard Ratio
(HR)
Cluster 3
2.181
Age
1.000
Log(WBC)
1.320
High (cytogenetic) risk group 1.670
FLT3ITD
0.827
+
NPM1
0.418
N-RAS/K-RAS
1.038
75
Cluster3
12
Variables
Variables
100
All patients except for cluster 3
0
E
Event Free Survival
All patients except for cluster 11
0
C
B
Relapse Free Survival
Cumulative percentage
A
At risk:
100
25
106
19
Cluster3
12
24
36
48 months 60
61
12
63
10
52
6
48
10
50
4
45
9
47
4
44
7
40
2
38
4
Event Free Survival
Hazard Ratio
(HR)
Cluster 3
1.640
Age
1.019
Log(WBC)
1.133
High (cytogenetic) risk group 1.940
FLT3ITD
2.113
+
NPM1
0.344
N-RAS/K-RAS
1.467
Variables
F
F
20
12
62
54
P=0.016
Std.Err
z
P>|z|
0.494
0.009
0.288
0.820
0.774
0.209
0.460
1.64
2.13
0.49
1.57
2.04
-1.75
1.22
0.1
0.033*
0.622
0.117
0.041
0.08
0.222
95% Confidence Interval
low
high
0.909
2.960
1.001
1.036
0.689
1.866
0.848
4.440
1.031
4.332
0.104
1.134
0.793
2.713
Event Free Survival
Hazard Ratio
(HR)
Cluster 11
1.436
Age
1.021
Log(WBC)
1.293
High (cytogenetic) risk group 1.661
FLT3ITD
1.939
+
NPM1
0.303
N-RAS/K-RAS
1.606
Variables
Std.Err
z
P>|z|
0.622
0.009
0.352
0.760
0.738
0.200
0.502
0.83
2.47
0.94
1.11
1.74
-1.81
1.52
0.404
0.013*
0.345
0.268
0.082
0.07
0.13
95% Confidence Interval
low
high
0.614
3.355
1.004
1.039
0.758
2.206
0.677
4.074
0.920
4.087
0.083
1.103
0.870
2.963
5354
6247
7117
7118
7137
7167
7183
7325
2283
5349
7311
2228
3323
2259
2278
2279
3330
5290
5363
6454
7116
7177
7309
7312
7172
3481
7419
5287
6450
6453
2256
3489
7317
2224
2246
4340
6359
6373
6374
6378
6448
7303
3318
7071
2216
2229
5364
6456
6462
7147
7161
7176
7313
7326
2187
7169
2757
7084
Figure S8
Mutation/ RAEB/ AML-LBC
Missing
SRSF2
U2AF35
SF3B1
NRAS/KRAS
ASXL1
NPM1
FLT3ITD
FLT3TKD
DNMT3A
IDH1
IDH2
RAEB
AML-LBC
Cluster 3
Cluster 11
Other clusters
Download