Supplementary Information manuscript van `t Veer et al

advertisement

1

Nature manuscript F08651

Supplementary Information van ‘t Veer et al.

Method of unsupervised clustering

In our two-dimensional cluster analysis, gene clustering and experiment (tumour) clustering are performed independently without interfering between the two dimensions using an agglomerative hierarchical clustering algorithm [J. Hartigan, Clustering Algorithms, (John Wiley

& Sons, New York, 1975)]. For gene (or experiment) clustering, the distance metric (also known as dissimilarity measure) is defined between a pair of genes (or a pair of experiments) i and j as:

D ij

1

  ij

, (1) where

 ij

is the error-weighted correlation coefficient between two genes (or two experiments) i and j across all experiments (or all genes) k = 1, …, N

:

 ij

 k

N

1 k

N

1 r ik

 ik r ik

 ik

2 k

N

1 r jk

 jk r jk

 jk

2

.

(2)

In this equation, r ik

is the logarithmic transcriptional expression level measured relative to a baseline condition with an error

 ik

for gene (or under experiment) i , under experiment (or for gene) k . The summation runs over all experiments k = 1, …, N exps

(or all genes k = 1, …, N gene

).

The colour display encodes the logarithm of these expression changes, where red is upregulation, green is downregulation, black represents no change, and gray represents non available. Each row in the display represents an experiment condition pair, and each column

2 corresponds to a gene. The rows and columns are displayed in the order given by the clustering output trees in the two dimensions.

Not all genes are retained in the clustering analysis, only 5,000 significant genes with more than two-fold regulation and significance of regulation p < 0.01 in more than 5 experiments were kept. This focusses the attention to the most informative genes, yet does not bias the clustering result towards any a priori assumptions as to mechanism.

Method of supervised classification

We calculated the correlation between the prognostic category (metastasis vs. nometastasis) and the logarithmic expression ratio across all 78 samples for each individual gene in the 5,000 significant genes. The distribution of the correlation coefficients is shown in red in the histogram in Figure S1 (a). Genes with a greater correlation coefficient (Pearson Coefficient) are likely candidates for reporting prognosis, i.e., short interval to metastasis or no metastasis. In order to evaluate the significance of each correlation coefficient with respect to a null hypothesis that such correlation coefficient can be found by chance, we used a permutation technique to generate Monte-Carlo data that randomises the association between gene expression data of the

78 tumour samples and their prognostic categories. The blue histogram in Figure S1 (a) shows the distribution of correlation coefficients obtained from one Monte-Carlo trial for such a null hypothesis. 10,000 such Monte-Carlo simulations were generated. Subsequently, genes that have the correlation coefficient either larger than 0.3 (“correlated genes”) or less than –0.3 (“anticorrelated genes”) were selected both in the real data and the Monte-Carlo data. We found that

231 genes fulfilled this criterion in the real data set, where the number of genes that fulfilled the same criterion in the Monte-Carlo data is much smaller and varies from run to run. The frequency distribution of the number of genes that satisfies this criterion for 10,000 Monte-Carlo runs is

3 displayed in Figure S1 (b). The probability of finding 231 genes or more with a correlation of at least +/-0.3 with outcome purely by chance is estimated to be 0.3% based on 10,000 Monte-Carlo trails as shown in Figure S1 (b). It is noted that on average, there would be 36 genes selected by chance.

The significance for each of the 231 genes as a prognostic reporter was evaluated by a metric similar to the “Fisher” statistic. The “Fisher” metric for each gene is plotted in Figure S2

(a). The confidence level of each gene in the candidate list was estimated with respect to a null hypothesis derived from the actual data set using the random permutation technique. The p-value estimation from this distribution for each gene in the candidate list is shown in Figure S2 (b).

We used the method of “leave-one-out” for cross validation. Specifically, at one time, we took one sample out and used the remaining 77 samples to define a classifier based on the set of

231 discriminating genes. Then we predicted the outcome of the one sample we left out in the first place. The prediction of the left out sample is based on its correlation coefficient to the

“good prognosis” template and “poor prognosis” template, where the “good” and “poor” templates are the average expression patterns of clinically “good” and “poor” samples within the

77 samples. The correlation coefficient is calculated using the selected reporter genes. We repeated this procedure until each of the 78 samples was left out once. We finally counted in how many cases the predictions were correct and in how many cases the predictions were incorrect.

The performance of the classifier is measured by the error rates of type 1 (false negative) and type

2 (false positive) for this selected gene set. We repeated the above performance evaluation procedure based on the leave-one-out cross validation when we added 5 more marker genes each time from the top of the candidate list until all the 231 genes were used as discriminating genes.

The performance as a function of the number of marker genes is shown in Figure S3. The number of wrong predictions of type 1 and type 2 errors change dramatically with the number of marker

4 genes employed. The combined error rate reaches the minimum when we use 70 marker genes from the top of our candidate list. Therefore, we consider this set of 70 genes as the optimal set of marker genes that can be used to classify patients in “sporadic” group into two prognostic subgroups: “good prognosis” group and “poor prognosis” group. It is interesting to point out that the accuracy in predicting the prognosis of “sporadic” breast cancer patients is quite low when we use just few marker genes. The accuracy improves with the increasing number of marker genes until the optimal number of marker genes is reached (~70 genes). However, beyond the optimal number of marker genes, the accuracy becomes worse, due to the introduction of noise.

Performance cross-validation

Since the 231 reporters were defined using all 78 samples, the cross-validation in the previous section may have the potential of over-fitting through an information leak. To address this problem in the cross-validation, we constructed a cross-validation procedure that has no information leak, and involves no optimization. The procedure is as follows: (1) leave one sample out, (2) define reporters based on the remaining 77 samples among the set of ~5000 significant genes, (3) use the reporters to predict the outcome of the one sample that was left out in step (1),

(4) repeat steps (1)-(3) exhaustively for all 78 samples.

The above procedure essentially created 78 classifiers based on 78 sets of reporters. The reporters in each “leave-one-out” case are defined as genes with a |correlation| > 0.3 in the remaining 77 samples, where the correlation is calculated between the gene expression and the outcome of 77 samples. In this process, the sample left out is not involved in the reporter selection, therefore, the procedure does not have any information leak. The prediction is based on the correlation coefficient of the reporter expression pattern for the left out sample with the templates defined by the remaining samples.

5

The average number of reporters from these 78 classifiers is 238+/-23. Figure S4 presents the frequency of the original 231 genes and the union of other genes found in these 78 classifiers.

We found that the vast majority of the original 231 reporter genes is commonly shared by the 78 classifiers. In particular, 180 of the original 231 reporters appeared 74 times or more in those 78 classifiers.

Using this cross-validation process, at the threshold range which mis-classifies 3 “poor prognosis patients” as “good prognosis patients”, 17 to 19 (average 18) “good prognosis patients” are mis-classified as “poor” (Figure S5). This method results in an odds ratio of 15 (95% CI 4-56,

Fisher’s exact test p-value of 4.1E-6). This should be compared to the situation where the reporter genes are fixed to the original 231 genes: then, 15 to 16 “good prognosis patients” are mis-classified as “poor”(Figure S6), which gives odds ratios from 18 to 20. The differences in the number of misclassifications and odds ratios represent a possible information leak.

It should be pointed out that this cross-validation process, apart from having no information leak in the reporter selection, also didn't optimize the number of reporter genes as was done in the definition of our classifier. Hence the odds ratio of 15 obtained from this process may be on the conservative side.

Multivariate logistic fit

To evaluate the added prognostic value of the microarray gene expression profiling in addition to the clinical parameters, the microarray parameter and the clinical parameters are combined to form a ‘complete’ multivariate model by the logistic regression (see, for example,

“S-PLUS 2000 Guide to Statistics, Vol.1”, P.301). The clinical parameters used for the current modelling are the tumour grade, oestrogen receptor (ER) status, progesteron receptor (PR) status,

6 tumor size, patient age, and angioinvasion. To avoid quoting an optimistic number from the microarray data, the correlation coefficient to the “good prognosis” templates from the conservative cross-validation (previous section of this material) is used in the logistic regression.

In order to calculate the odds ratio from these multivariate logistic regression coefficients, all the input parameters, including that of the microarray, were converted to a binary format (0 or

1) as follows:

Parameter grade

ER

PR size (mm) age angioinvasion

Microarray

Correlation

0

1,2

<=10

<=10

<=20

<=40

1

3

>10

>10

>20

>40

0 1

<= 0.54 > 0.54

The odds ratio for each parameter was calculated using the multivariate logistic regression coefficient: OR = exp( logistic coefficient), and the 95% confidence interval as: CI = exp( logistic coefficient +/- 1.96 * std error). The results are displayed in the following table:

Parameter grade

ER

PR size (mm) age angioinvasion

Microarray

Correlation

Logistic Coefficient Std. Error Odds ratio 95% CI

-0.08 0.79 1.1 [0.2 5.1]

0.5

-0.75

-1.26

1.4

-1.55

2.87

0.94

0.93

0.66

0.79

0.74

0.85

1.7

2.1

3.5

4

4.7

17.6

[0.3 10.4]

[0.3 13.1]

[1.0 12.8]

[0.9 19.1]

[1.1 20.1]

[3.3 93.7]

7

Legends to the Figures in Supplementary Information.

Figure S1. (a) Histogram of the correlation coefficients of the gene expression ratio of each significant gene with the prognostic category (metastases within 5 years or metastases free for > 5 years), shown in red. The blue distribution is obtained from one Monte-Carlo run where the association of the gene expression and the prognostic category were randomised. The magnitude of correlation or anti-correlation of 231 genes is greater than 0.3. (b) Frequency distribution of the number of genes that satisfy the same criterion for 10,000 Monte-Carlo runs. The mean is 36 and p(n>231) is 0.3% and p(n>231/2) = 3.3%.

Figure S2. (a) The Fisher metric for each gene on the discriminating reporter candidate list. (b)

The p-value obtained from the Monte-Carlo runs indicates the probability that the gene is selected as a discriminating gene by chance. The gene orders in (a) and (b) are identical.

Figure S3. The classification error rates for type 1 and type 2 as a function of the number of discriminating genes used in the classifier. Y axis is the number of tumours classified wrong, X axis is the number of reporters. Note that the optimal combined error rate is reached with 70 discriminating marker genes. The red stars represent type 1 errors (false negative in predicting metastasis) and blue circles are type 2 (false positive) errors.

Figure S4. The frequency distribution of the original 231 reporter genes and the union of other genes found in the 78 no-information-leak cross-validation classifiers. Each cross-validation classifier selects its own reporter genes without the left out sample.

8

Figure S5. The error rates (same definition as Figure S3) versus the threshold in the correlation coefficient to the “good prognosis” template, as determined in the no-information-leak crossvalidation. At the thresholds where the number of false negatives is 3, the number of false positives is 17 to 19 (average 18).

Figure S6. The error rates (same definition as Figure S3) versus the threshold in the correlation coefficient to the “good prognosis” template, as determined in the cross-validation using the 231 reporter genes selected from all 78 samples. At the thresholds where the number of false negatives is 3, the number of false positives is 15 to16.

Table S1. List of 117 patients with clinical information.

The header for each column is self-explanatory.

Table S2. List of 231 prognosis reporter genes including the optimal set of 70 markers. The header for each column is self-explanatory. The 70 optimal marker genes are the 70 genes with the highest absolute correlation coefficients.

Table S3. List of 2460 ER status reporter genes including the optimal set of 550 markers. The

550 optimal marker genes are the 550 genes with the highest absolute correlation coefficients.

The header for each column is self-explanatory.

9

Table S4. List of 430 BRCA1 reporter genes including the optimal set of 100 markers. The 100 optimal marker genes are the 100 genes with the highest absolute correlation coefficients. The header for each column is self-explanatory.

Download