Supplementary Information: Statistical Methods: Normalization of Microarray data: Feature extraction and normalization of the raw data were performed using the Agilent G2567AA Feature Extraction software (Agilent Technologies). Data from an individual array was normalized by linear and Lowess normalizations and spatial detrending protocol [http://www.chem.agilent.com/temp/rad7DFA2/00047887.pdf (p.215-217)] as described. For data filtering, we analyzed the dye-reversal experiments using Spotfire® DecisionSite 8.0 (Somerville, MA). A total of 307 slides were compared for correlations between dye-reversal pairs and for each sample only those genes with a positive correlation were used for further analysis. The expression values and their error estimates were evaluated by taking the mean and variance over replicate array experiments. This gave two matrices, a normalized expression matrix of dimension 19061x135 and the complementing matrix of standard deviations. For further analysis we selected genes that had values in at least 73 (50%) of the samples and showed more than 2-fold variability across at least 5 samples, which resulted in a matrix of 5257 genes. Missing values in this matrix were not imputed. Survival analysis: The aim was to derive a prognostic classifier from our data and to test it in external independent data sets. We divided this task into three steps. First, we used internal validation to show that prognostic classifiers can be found that are predictive in a test set. Second, having shown that this was possible, we derived an optimized classifier using the entire data set. There are two reasons why we separated the process of deriving an optimized classifier from the process of internal validation. First, the optimal classifier is strongly dependent on the training set used, which raises the problem of how to combine the optimal classifiers derived for each choice of training set (multiple choices of training set are necessary to avoid bias associated with using only one training test set partition), and while it may be possible to combine the optimal classifiers into an overall classifier by calculating an average ranking of genes over the rankings produced by each training set, we found here that this average ranking was effectively identical to the ranking obtained using the whole data set as a training set. Second, since ultimately the 1 validity of a classifier rests on its performance on external data sets, we might as well use the entire internal data set to derive such a classifier. After having derived an optimized classifier from our cohort, the third and final step consists of testing it in completely independent external studies. The general scheme described above was implemented using a semi-supervised method, called “Cox-clustering” (Bair & Tibshirani, 2004). This method uses unsupervised clustering over genes selected in a supervised fashion (Cox proportional hazards regression) and allows unbiased identification of subgroups of patients that differ in outcome. These analyses were carried out on the matrix of 5257 genes. Each gene was first standardized to have zero mean and unit standard deviation across all samples. The association of each gene with survival was tested using a univariate Cox proportional hazards regression model. The proportional hazards assumption was tested for each gene in the study and we confirmed that for all genes with significant Cox-scores the proportional hazards assumption was justified. Correction for multiple comparisons was performed by converting the p values into q values, which express the proportion of significant genes that turn out to be false leads (the false discovery rate- FDR) (Storey & Tibshirani, 2003). Using the q statistic the top 200 Cox ranked genes were bound by an FDR of 30% (in other words on average 60/200 may be ‘false leads’). Owing to the fact that genes had variable numbers of missing values, we also estimated the FDR using a Monte-Carlo simulation that randomized the gene expression values across samples. Step1 (internal validation to validate sample assignment process and to test for over-fitting): The data was divided into a training and test set. For a given choice of training set we ranked all the genes using univariate Cox-regression, as before, but now using only the samples in the training set. Classifiers were then built by sequentially adding more genes from the ranked list, starting at 30 genes and increasing up to 200. Owing to the significant number of missing values, considering gene sets with at least 30 members was necessary in order to apply the subsequent clustering algorithm. The optimal number of genes was determined using LOOCV (leave one out cross-validation) on the training set, i.e we applied robust k-means (k=2) clustering (3) to all the 2 training samples except one to learn the two clusters. The two clusters were then categorized as good or bad prognosis depending on the hazard ratio. The test sample was then assigned to good or bad prognosis using the nearest centroid classifier method. For each fold of the LOOCV this gave a prognostic prediction for each sample in the training set acting as a test sample. We then performed Cox-regression of survival against this dichotomization of the samples and the performance of the classifier was given by the log-rank test p-value. A classifier was chosen with the most significant p-value (in almost all cases this corresponded to highest hazard ratio). Having derived a classifier from the training set using LOOCV we next reevaluated this classifier on the whole training set to learn the two clusters (and centroids) and to check the association of the clusters with survival. Test samples were then assigned to good or bad prognosis using the nearest centroid classifier. We next repeated all this analysis for different choices of training and test sets, ensuring that the test sets were mutually disjoint. This guaranteed that each sample would act as a test sample once. The prognostic class assignment of all the samples was then tested for association with survival using Cox-regression. If the associated log-rank test p-value was less than 0.05 we were then able to conclude that there was no significant over-fitting in the learning procedure. The whole procedure described above was carried out for three different training test set partition sizes: we used test sets of sizes 1 (leave one-out), 5 and 9 samples. Since in the latter two cases there were many different ways of choosing 27 (135/5) and 15 (135/9) disjoint test sets among the 135 samples, we repeated for these cases the analysis a total of 10 times to ensure there was no bias in the particular way the disjoint test sets were defined. Step 2 (deriving the optimal classifier using the entire data set): As results in Step-1 were encouraging, it suggested that the same procedure could be applied to the entire data set followed by the use of external independent data sets as test sets. Thus, we applied the LOOCV methodology as described before but now to the entire data set. While this gives us a way of defining an optimal classifier, our experience with the methodology suggests that it doesn't necessarily outperform one derived without LOOCV. Since the ultimate validation of a classifier comes from the external tests we decided to derive our optimal classifier without the 3 LOOCV-step. Having Cox-ranked the genes using the entire data set, classifiers were then built by sequentially adding genes from the ranked list, starting at 30 genes and increasing up to 200. The relevance of the classifiers for survival was tested by first clustering all the samples into two groups using robust k-means (k=2) and then performing a Cox-regression to determine whether the subtypes defined by the k-means clustering were indeed associated with survival (p < 0.05). This procedure yielded a range of classifiers with mutually overlapping 95% hazard ratio CI's. We declared the optimal classifier as the one maximising the hazard ratio. Step 3 (external validation of classifier) Having derived an optimized classifier we next attempted to validate it in two independent external data sets (van de Vijver et al., 2002; Wang et al., 2005). Given the classifier set of genes and an external data set we first excluded those genes not present on the external platform. Genes were normalized to zero mean and unit variance (z-score transformation). Next, for each of the samples in the external cohort with gene expression vector x, we computed a continuous prognostic index, PI(x), as PI(x)= cor(x,c2)-cor(x,c1), where cor is the usual (centered) Pearson correlation, while c2 and c1 denote the centroids of the poor and good outcome clusters from optimal classifier learned in step 2. To evaluate the classifier we used two different measures of prognostic separation. For one of them, the hazard ratio (10), we first assigned each external sample to poor or good prognosis using the nearest centroid classification rule. That is, we used the classification rule, poor prognosis if PI(x) > 0 and good prognosis if PI(x) < 0. For the second measure, the D-index (Royston & Sauerbrei, 2004), we ranked the continuous PI(x) values in increasing order. The resulting risk ordering was then tested for association with survival by a Cox-regression against the reordered scaled rankits, the estimated regression coefficient defining the D-index. Remarks: (i) the D-index does not require the model to be recalibrated since it is determined entirely by the relative risk ordering of the samples; (ii) significance of the D-index was tested using a normality assumption as well as by random permutation of time labels. 4 Software packages used: Univariate and multivariate Cox regressions, Kaplan-Meier analysis and Hazard ratio computations were carried out using the survival R-package version 2.16 (www.cran.r-project.org). Gene Ontology (GO) was performed using EASE software package (http://david.niaid.nih.gov/david/ease.htm) and the Gene Ontology Tree Machine (http://genereg.ornl.gov/gotm). Survival estimation with the Adjuvant! software was performed using the online version 7.0 (https://www.adjuvantonline.com/online.jsp). References Bair E & Tibshirani R. (2004). Semi-supervised methods to predict patient survival from gene expression data. PLoS Biology 2: 503-11. Deutsch JM. (2003). Evolutionary algorithms for finding optimal gene sets in microarray prediction. Bioinformatics 19: 45-52. Ein-Dor L, Kela I, Getz G, Givol D & Eytan D. (2005). Outcome signature genes in breast cancer: is there a unique set? Bioinformatics 21: 171-8. Goldberg DE. (1989). Genetic Algorithms in Search, Optimization and Machine Learning. Addison Wesley: Reading, MA. Ooi CH & Tan P. (2003). Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics 19: 37-44. Royston P & Sauerbrei W. (2004). A new approach to modelling interactions between treatment and continuous covariates in clinical trials by using fractional polynomials. Stat Med. 23: 723-48. Storey JD & Tibshirani R. (2003). Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 100: 9440-5. van de Vijver MJ, He YD, van 't Veer L, Dai H, Hart AAM, Voskuil DW, et al. (2002). A geneexpression signature as a predictor of survival in breast cancer. N Engl J Med. 347: 1999-2009. 5 Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, et al. (2005). Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet 365: 671-9. 6 Supplementary Tables and Figures: Clinical Parameter GOOD (95) POOR (40) ER positive 75 18 ER negative 19 21 High Grade (3) 21 28 Low Grade (1 & 2) 73 12 Node positive 28 16 Node negative 66 24 Adjuvant-Poor 7 5 Adjuvant-Good 88 35 Menopause (Yes) 63 26 Menopause (No) 31 13 Size (>2cm) 26 15 Size (≤2cm) 68 25 NPI (>3.4) 41 34 NPI (≤3.4) 53 6 Age 56.5(±9) 57.8(±8) Supplementary Table S1. The distribution of clinical parameters over the good/bad prognostic samples. ER: estrogen receptor. NPI: Nottingham prognostic index. 7 Cell Function Mitosis and Cell Cycle Genes KNTC2, SPAG5, ASPM *p value 0.0005 BUB1, PKMYT1, PTTG1 DCN, FBLN1, GAS6, AMH, CART Extracellular Matrix LOXL1, OMD, ANGPTL4, PDGFC 0.009 SMOC2, C1R, C1S, SPARCL1, CILP Transcription factor CEBPD, ZMYND11, UHRF1, MLLT1 0.006 MYBL2, SMARCA4, BTF3, TIF1, PTTG1 Calcium ion binding CDH11, FBLN1, GAS6, ANXA5, SMOC2 0.008 TNNC2, C1R, C1S, SPARCL1, KIAA0703 Supplementary Table S2. Cell functions significantly associated with the prognostic genes. Selective genes associated with each cell function and p values of significance are given. * Corrected for the multiple testing. 8 Cox-Ranked Clinical Features HR p value (95% CI) Pre-menopausal 4.21 0.0047 (1.13-15.72) Postmenopausal 6.78 1e-06 (2.55-18.05) ER negative 4.41 0.0119 (1.54-12.62) ER positive 6.93 2e-07 (2.00-24.05) Node negative 10.31 3e-08 (3.26-32.61) Node positive 2.77 0.0274 (0.97-7.91) Supplementary Table S3. Stratified Hazard ratios of Cox-ranked signature with different clinical variables. ER: estrogen receptor. HR: Hazard Ratio. CI: confidence interval. 9 Signature Naderi et al Vijver et al Wang et al Cox-ranked (70g) 5.8 (p=9e-9) 3.98 (p=2e-9) 1.76 (p=0.005) Wang 76g 2.2 (p=0.02) 2.18 (p=0.001) 2.19 (p=6e-5) Veer 70g 1.02 (p=0.94) 11.4 (p=8e-10) 1.6 (p=0.03) Supplementary Table S4. Performance of gene-sets across different data sets. Hazard Ratios and p values for each signature are given. Genes Naderi et al Vijver et al Wang et al EBP 1.95 1.69 1.26 EXO1 1.82 1.84 1.26 TIMELESS 1.81 1.81 1.3 CTPS 1.66 1.58 1.2 SMARCA4 1.89 1.33 0.84 PTTG1 1.75 1.81 1.38 PSMD2 1.69 1.71 1.36 TIF1 1.87 1.34 0.79 MYBL2 1.72 1.72 1.36 BUB1 1.74 1.84 1.47 DNMT3B 1.74 1.57 1.31 FANCA 1.59 1.69 1.21 MBP 1.54 0.86 0.77 ZWINT 1.65 1.83 1.37 BM039 2.01 1.78 1.36 PKMYT1 1.66 1.94 1.31 FLJ10292 1.46 1.35 1.54 SQLE 1.54 1.57 1.28 RAD54L 1.67 1.82 1.33 EIF4EBP1 1.56 1.41 1.27 FLJ10706 1.53 1.36 1.25 RAB22A 1.63 1.31 1.41 CDC2 1.59 1.57 1.44 APPBP1 1.66 1.34 1.26 PSMD7 1.6 1.88 1.24 DTYMK 1.58 1.62 1.23 SHMT2 1.9 1.84 1.34 HSPC171 1.91 1.44 1.24 MAD2L1 1.81 1.69 1.46 10 Supplementary Table S5. Overlapping prognostic genes across three independent data sets. Exponentials of Cox-coefficient values for the 29 common genes identified through Coxanalysis across the three studies. All Cox-values have p < 0.05 in the three studies. Values >1 mean that the gene is overexpressed in poor outcome samples relative to good outcome, and values < 1 mean the gene is relatively underexpressed in poor outcome samples. 11 Clinico-Pathological Features All Patients (n= 135) Age (years) Mean (SD) 57 (9) < 45 15 (11%) 45-54 36 (27%) 55-64 56 (41%) ≥ 65 28 (21%) Menopausal status Premenopausal 44 (33%) Post-menopausal 90 (67%) *Not Known 1 T stage T in situ 1 T1 93 (69%) T2 41 (31%) Lymph Node stage N0 86 (67%) N1/N2 43 (33%) *Not Known 6 Grade 1 35 (26%) 2 50 (37%) 3 49 (37%) ER status Positive 93 (70%) Negative 40 (30%) *Not Known 2 Tamoxifen Therapy (ER+ cases) Yes 38 (40%) No 55 (60%) Chemotherapy (CMF) Yes 6 (4%) No 129 (96%) NPI score ≤ 3.4 59 (44%) > 3.4 75 (56%) Supplementary Table S6. Summary of clinical and pathological features for the cohort. ER: Estrogen Receptor. Nottingham Prognostic Index (NPI =0.2 size (cm) + grade + stage). * Percentage is given for the cases with known status. 12 Supplementary Figure S1. Expected number of false positives vs. number of significant tests. FDR: False Discovery Rate MC: MonteCarlo simulation q-value: Bayesian FDR 13 Supplementary Figure S2. Cox-Ranked prognostic genes. Variation of log-rank test p-value (A) and Hazard Ratio (B) as a function of the number of Cox-ranked genes present in the clustering set. Error Bars give the 95% confidence intervals. Dashed line corresponds to p= 0.05 in panel A and HR= 1 in panel B. 14