Statistical Methods:

advertisement
Supplementary Information:
Statistical Methods:
Normalization of Microarray data: Feature extraction and normalization of the raw data were
performed using the Agilent G2567AA Feature Extraction software (Agilent Technologies). Data
from an individual array was normalized by linear and Lowess normalizations and spatial
detrending protocol [http://www.chem.agilent.com/temp/rad7DFA2/00047887.pdf (p.215-217)] as
described. For data filtering, we analyzed the dye-reversal experiments using Spotfire®
DecisionSite 8.0 (Somerville, MA). A total of 307 slides were compared for correlations between
dye-reversal pairs and for each sample only those genes with a positive correlation were used for
further analysis. The expression values and their error estimates were evaluated by taking the
mean and variance over replicate array experiments. This gave two matrices, a normalized
expression matrix of dimension 19061x135 and the complementing matrix of standard deviations.
For further analysis we selected genes that had values in at least 73 (50%) of the samples and
showed more than 2-fold variability across at least 5 samples, which resulted in a matrix of 5257
genes. Missing values in this matrix were not imputed.
Survival analysis: The aim was to derive a prognostic classifier from our data and to test it in
external independent data sets. We divided this task into three steps. First, we used internal
validation to show that prognostic classifiers can be found that are predictive in a test set.
Second, having shown that this was possible, we derived an optimized classifier using the entire
data set. There are two reasons why we separated the process of deriving an optimized classifier
from the process of internal validation. First, the optimal classifier is strongly dependent on the
training set used, which raises the problem of how to combine the optimal classifiers derived for
each choice of training set (multiple choices of training set are necessary to avoid bias associated
with using only one training test set partition), and while it may be possible to combine the optimal
classifiers into an overall classifier by calculating an average ranking of genes over the rankings
produced by each training set, we found here that this average ranking was effectively identical to
the ranking obtained using the whole data set as a training set. Second, since ultimately the
1
validity of a classifier rests on its performance on external data sets, we might as well use the
entire internal data set to derive such a classifier. After having derived an optimized classifier
from our cohort, the third and final step consists of testing it in completely independent external
studies.
The general scheme described above was implemented using a semi-supervised method, called
“Cox-clustering” (Bair & Tibshirani, 2004). This method uses unsupervised clustering over genes
selected in a supervised fashion (Cox proportional hazards regression) and allows unbiased
identification of subgroups of patients that differ in outcome.
These analyses were carried out on the matrix of 5257 genes. Each gene was first standardized
to have zero mean and unit standard deviation across all samples. The association of each gene
with survival was tested using a univariate Cox proportional hazards regression model. The
proportional hazards assumption was tested for each gene in the study and we confirmed that for
all genes with significant Cox-scores the proportional hazards assumption was justified.
Correction for multiple comparisons was performed by converting the p values into q values,
which express the proportion of significant genes that turn out to be false leads (the false
discovery rate- FDR) (Storey & Tibshirani, 2003). Using the q statistic the top 200 Cox ranked
genes were bound by an FDR of 30% (in other words on average 60/200 may be ‘false leads’).
Owing to the fact that genes had variable numbers of missing values, we also estimated the FDR
using a Monte-Carlo simulation that randomized the gene expression values across samples.
Step1 (internal validation to validate sample assignment process and to test for over-fitting):
The data was divided into a training and test set. For a given choice of training set we ranked all
the genes using univariate Cox-regression, as before, but now using only the samples in the
training set. Classifiers were then built by sequentially adding more genes from the ranked list,
starting at 30 genes and increasing up to 200. Owing to the significant number of missing values,
considering gene sets with at least 30 members was necessary in order to apply the subsequent
clustering algorithm. The optimal number of genes was determined using LOOCV (leave one out
cross-validation) on the training set, i.e we applied robust k-means (k=2) clustering (3) to all the
2
training samples except one to learn the two clusters. The two clusters were then categorized as
good or bad prognosis depending on the hazard ratio. The test sample was then assigned to
good or bad prognosis using the nearest centroid classifier method. For each fold of the LOOCV
this gave a prognostic prediction for each sample in the training set acting as a test sample. We
then performed Cox-regression of survival against this dichotomization of the samples and the
performance of the classifier was given by the log-rank test p-value. A classifier was chosen with
the most significant p-value (in almost all cases this corresponded to highest hazard ratio).
Having derived a classifier from the training set using LOOCV we next reevaluated this classifier
on the whole training set to learn the two clusters (and centroids) and to check the association of
the clusters with survival. Test samples were then assigned to good or bad prognosis using the
nearest centroid classifier. We next repeated all this analysis for different choices of training and
test sets, ensuring that the test sets were mutually disjoint. This guaranteed that each sample
would act as a test sample once. The prognostic class assignment of all the samples was then
tested for association with survival using Cox-regression. If the associated log-rank test p-value
was less than 0.05 we were then able to conclude that there was no significant over-fitting in the
learning procedure. The whole procedure described above was carried out for three different
training test set partition sizes: we used test sets of sizes 1 (leave one-out), 5 and 9 samples.
Since in the latter two cases there were many different ways of choosing 27 (135/5) and 15
(135/9) disjoint test sets among the 135 samples, we repeated for these cases the analysis a total
of 10 times to ensure there was no bias in the particular way the disjoint test sets were defined.
Step 2 (deriving the optimal classifier using the entire data set):
As results in Step-1 were encouraging, it suggested that the same procedure could be applied to
the entire data set followed by the use of external independent data sets as test sets. Thus, we
applied the LOOCV methodology as described before but now to the entire data set. While this
gives us a way of defining an optimal classifier, our experience with the methodology suggests
that it doesn't necessarily outperform one derived without LOOCV. Since the ultimate validation
of a classifier comes from the external tests we decided to derive our optimal classifier without the
3
LOOCV-step. Having Cox-ranked the genes using the entire data set, classifiers were then built
by sequentially adding genes from the ranked list, starting at 30 genes and increasing up to 200.
The relevance of the classifiers for survival was tested by first clustering all the samples into two
groups using robust k-means (k=2) and then performing a Cox-regression to determine whether
the subtypes defined by the k-means clustering were indeed associated with survival (p < 0.05).
This procedure yielded a range of classifiers with mutually overlapping 95% hazard ratio CI's.
We declared the optimal classifier as the one maximising the hazard ratio.
Step 3 (external validation of classifier)
Having derived an optimized classifier we next attempted to validate it in two independent
external data sets (van de Vijver et al., 2002; Wang et al., 2005). Given the classifier set of
genes and an external data set we first excluded those genes not present on the external
platform. Genes were normalized to zero mean and unit variance (z-score transformation). Next,
for each of the samples in the external cohort with gene expression vector x, we computed a
continuous prognostic index, PI(x), as PI(x)= cor(x,c2)-cor(x,c1), where cor is the usual (centered)
Pearson correlation, while c2 and c1 denote the centroids of the poor and good outcome clusters
from optimal classifier learned in step 2.
To evaluate the classifier we used two different measures of prognostic separation. For one of
them, the hazard ratio (10), we first assigned each external sample to poor or good prognosis
using the nearest centroid classification rule. That is, we used the classification rule, poor
prognosis if PI(x) > 0 and good prognosis if PI(x) < 0. For the second measure, the D-index
(Royston & Sauerbrei, 2004), we ranked the continuous PI(x) values in increasing order. The
resulting risk ordering was then tested for association with survival by a Cox-regression against
the reordered scaled rankits, the estimated regression coefficient defining the D-index.
Remarks: (i) the D-index does not require the model to be recalibrated since it is determined
entirely by the relative risk ordering of the samples; (ii) significance of the D-index was tested
using a normality assumption as well as by random permutation of time labels.
4
Software packages used: Univariate and multivariate Cox regressions, Kaplan-Meier analysis
and Hazard ratio computations were carried out using the survival R-package version 2.16
(www.cran.r-project.org). Gene Ontology (GO) was performed using EASE software package
(http://david.niaid.nih.gov/david/ease.htm) and the Gene Ontology Tree Machine
(http://genereg.ornl.gov/gotm). Survival estimation with the Adjuvant! software was performed
using the online version 7.0 (https://www.adjuvantonline.com/online.jsp).
References
Bair E & Tibshirani R. (2004). Semi-supervised methods to predict patient survival from gene
expression data. PLoS Biology 2: 503-11.
Deutsch JM. (2003). Evolutionary algorithms for finding optimal gene sets in microarray
prediction. Bioinformatics 19: 45-52.
Ein-Dor L, Kela I, Getz G, Givol D & Eytan D. (2005). Outcome signature genes in breast cancer:
is there a unique set? Bioinformatics 21: 171-8.
Goldberg DE. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison
Wesley: Reading, MA.
Ooi CH & Tan P. (2003). Genetic algorithms applied to multi-class prediction for the analysis of
gene expression data. Bioinformatics 19: 37-44.
Royston P & Sauerbrei W. (2004). A new approach to modelling interactions between treatment
and continuous covariates in clinical trials by using fractional polynomials. Stat Med. 23:
723-48.
Storey JD & Tibshirani R. (2003). Statistical significance for genomewide studies. Proc Natl Acad
Sci U S A 100: 9440-5.
van de Vijver MJ, He YD, van 't Veer L, Dai H, Hart AAM, Voskuil DW, et al. (2002). A geneexpression signature as a predictor of survival in breast cancer. N Engl J Med. 347:
1999-2009.
5
Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, et al. (2005). Gene-expression
profiles to predict distant metastasis of lymph-node-negative primary breast cancer.
Lancet 365: 671-9.
6
Supplementary Tables and Figures:
Clinical Parameter
GOOD (95)
POOR (40)
ER positive
75
18
ER negative
19
21
High Grade (3)
21
28
Low Grade (1 & 2)
73
12
Node positive
28
16
Node negative
66
24
Adjuvant-Poor
7
5
Adjuvant-Good
88
35
Menopause (Yes)
63
26
Menopause (No)
31
13
Size (>2cm)
26
15
Size (≤2cm)
68
25
NPI (>3.4)
41
34
NPI (≤3.4)
53
6
Age
56.5(±9)
57.8(±8)
Supplementary Table S1. The distribution of clinical parameters over the good/bad
prognostic samples. ER: estrogen receptor. NPI: Nottingham prognostic index.
7
Cell Function
Mitosis and Cell Cycle
Genes
KNTC2, SPAG5, ASPM
*p value
0.0005
BUB1, PKMYT1, PTTG1
DCN, FBLN1, GAS6, AMH, CART
Extracellular Matrix
LOXL1, OMD, ANGPTL4, PDGFC
0.009
SMOC2, C1R, C1S, SPARCL1, CILP
Transcription factor
CEBPD, ZMYND11, UHRF1, MLLT1
0.006
MYBL2, SMARCA4, BTF3, TIF1, PTTG1
Calcium ion binding
CDH11, FBLN1, GAS6, ANXA5, SMOC2
0.008
TNNC2, C1R, C1S, SPARCL1, KIAA0703
Supplementary Table S2. Cell functions significantly associated with the prognostic
genes. Selective genes associated with each cell function and p values of significance are given.
* Corrected for the multiple testing.
8
Cox-Ranked
Clinical Features
HR
p value
(95% CI)
Pre-menopausal
4.21
0.0047
(1.13-15.72)
Postmenopausal
6.78
1e-06
(2.55-18.05)
ER negative
4.41
0.0119
(1.54-12.62)
ER positive
6.93
2e-07
(2.00-24.05)
Node negative
10.31
3e-08
(3.26-32.61)
Node positive
2.77
0.0274
(0.97-7.91)
Supplementary Table S3. Stratified Hazard ratios of Cox-ranked signature with different
clinical variables. ER: estrogen receptor. HR: Hazard Ratio. CI: confidence interval.
9
Signature
Naderi et al
Vijver et al
Wang et al
Cox-ranked (70g)
5.8 (p=9e-9)
3.98 (p=2e-9)
1.76 (p=0.005)
Wang 76g
2.2 (p=0.02)
2.18 (p=0.001)
2.19 (p=6e-5)
Veer 70g
1.02 (p=0.94)
11.4 (p=8e-10)
1.6 (p=0.03)
Supplementary Table S4. Performance of gene-sets across different data sets.
Hazard Ratios and p values for each signature are given.
Genes
Naderi et al
Vijver et al
Wang et al
EBP
1.95
1.69
1.26
EXO1
1.82
1.84
1.26
TIMELESS
1.81
1.81
1.3
CTPS
1.66
1.58
1.2
SMARCA4
1.89
1.33
0.84
PTTG1
1.75
1.81
1.38
PSMD2
1.69
1.71
1.36
TIF1
1.87
1.34
0.79
MYBL2
1.72
1.72
1.36
BUB1
1.74
1.84
1.47
DNMT3B
1.74
1.57
1.31
FANCA
1.59
1.69
1.21
MBP
1.54
0.86
0.77
ZWINT
1.65
1.83
1.37
BM039
2.01
1.78
1.36
PKMYT1
1.66
1.94
1.31
FLJ10292
1.46
1.35
1.54
SQLE
1.54
1.57
1.28
RAD54L
1.67
1.82
1.33
EIF4EBP1
1.56
1.41
1.27
FLJ10706
1.53
1.36
1.25
RAB22A
1.63
1.31
1.41
CDC2
1.59
1.57
1.44
APPBP1
1.66
1.34
1.26
PSMD7
1.6
1.88
1.24
DTYMK
1.58
1.62
1.23
SHMT2
1.9
1.84
1.34
HSPC171
1.91
1.44
1.24
MAD2L1
1.81
1.69
1.46
10
Supplementary Table S5. Overlapping prognostic genes across three independent data
sets. Exponentials of Cox-coefficient values for the 29 common genes identified through Coxanalysis across the three studies. All Cox-values have p < 0.05 in the three studies. Values >1
mean that the gene is overexpressed in poor outcome samples relative to good outcome, and
values < 1 mean the gene is relatively underexpressed in poor outcome samples.
11
Clinico-Pathological Features
All Patients (n= 135)
Age (years)
Mean (SD)
57 (9)
< 45
15 (11%)
45-54
36 (27%)
55-64
56 (41%)
≥ 65
28 (21%)
Menopausal status
Premenopausal
44 (33%)
Post-menopausal
90 (67%)
*Not Known
1
T stage
T in situ
1
T1
93 (69%)
T2
41 (31%)
Lymph Node stage
N0
86 (67%)
N1/N2
43 (33%)
*Not Known
6
Grade
1
35 (26%)
2
50 (37%)
3
49 (37%)
ER status
Positive
93 (70%)
Negative
40 (30%)
*Not Known
2
Tamoxifen Therapy (ER+ cases)
Yes
38 (40%)
No
55 (60%)
Chemotherapy (CMF)
Yes
6 (4%)
No
129 (96%)
NPI score
≤ 3.4
59 (44%)
> 3.4
75 (56%)
Supplementary Table S6. Summary of clinical and pathological features for the cohort.
ER: Estrogen Receptor. Nottingham Prognostic Index (NPI =0.2 size (cm) + grade + stage).
* Percentage is given for the cases with known status.
12
Supplementary Figure S1. Expected number of false positives vs. number of significant
tests.
FDR: False Discovery Rate
MC: MonteCarlo simulation
q-value: Bayesian FDR
13
Supplementary Figure S2. Cox-Ranked prognostic genes. Variation of log-rank test p-value
(A) and Hazard Ratio (B) as a function of the number of Cox-ranked genes present in the
clustering set. Error Bars give the 95% confidence intervals. Dashed line corresponds to p= 0.05
in panel A and HR= 1 in panel B.
14
Download