Supplementary Information for Breast tumor subgroups reveal diverse clinical prognostic power Zhaoqi Liu, Xiang-Sun Zhang and Shihua Zhang* National Center for Mathematics and Interdisciplinary Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China *Corresponding author. Email: zsh@amss.ac.cn Table S1. Different prognostic results of the multivariate Cox PH model. P-values in bold denote that the averaged CI is larger than the mean value of the 1000 permutation. Subtypes Basal-like HER2+ Luminal A Luminal B Normal-like All Samples size 328 238 719 490 200 1981 Clinical+gene P-value 0.605 0.014 0.597 0.087 0.712 0.010 0.624 0.047 0.709 0.056 0.685 — Gene only P-value 0.539 0.030 0.557 0.164 0.624 0.298 0.573 0.082 0.686 0.010 0.626 — Clinical only P-value 0.599 0.033 0.625 0.283 0.715 0.001 0.619 0.039 0.667 0.288 0.677 — Besides the multivariate Cox PH model, we also adopted a random survival forest model in the same manner which obtained consistent results (Table S2). For gene expression and clinical features, we used the same variables as in the cox model. Three-fold cross validation repeating for 100 times and permutation test were conducted following the same manner. The CI was used to estimate the prediction performance of the model. Table S2. Different prognostic results of the rand survival forest model. P-values in bold denote that the averaged CI is larger than the mean value of the 1000 permutation. Subtypes Basal-like HER2+ Luminal A Luminal B Normal-like All Samples size 328 238 719 490 200 1981 Clinical+gene P-value 0.622 0.029 0.637 0.133 0.726 0.010 0.657 0.093 0.714 0.085 0.702 — Gene only P-value 0.544 0.068 0.563 0.216 0.619 0.329 0.568 0.105 0.678 0.020 0.620 — Clinical only P-value 0.621 0.017 0.640 0.146 0.732 0.001 0.657 0.114 0.691 0.318 0.690 — In total, there were 7 gene network modules with a CI larger than 0.60, of which six were from normal-like subtype and one from basal-like subtype. They were BIRC5 module (CI= 0.650), MCM10 module (0.640), AURKA module (0.635), RECQL5 module (0.632), POLQ module (0.625) and CEP55 module (0.624) from the normal-like subtype, and IL18RAP module (0.605) from the basal-like subtype. Table S3-S4 are functional annotations of these 7 modules based on pathway and GO enrichment analysis respectively. Table S3. Functional annotations of 7 prognostic modules based on pathway enrichment analysis. Module Gene set Size Mitotic M-M/G1 phases Aurora B signaling Chromosome Maintenance Aurora A signaling 222 40 105 31 16 13 8 4 <5.0e-4 <5.0e-4 <3.3e-4 <2.5e-4 Synthesis of DNA Cell Cycle Checkpoints Regulation of DNA replication Mitotic M-M/G1 phases S Phase Mitotic G1-G1/S phases Cell cycle DNA replication CDK regulation of DNA replication ATR signaling pathway 96 118 75 222 112 134 124 36 18 37 13 11 10 13 13 13 9 7 5 3 <1.6e-04 <1.6e-04 <1.6e-04 <1.6e-04 <1.6e-04 <1.6e-04 <1.4e-04 <1.2e-04 <1.1e-04 7.0e-04 Aurora A signaling 31 4 <1.0e-03 Oocyte meiosis 112 3 3.0e-03 Meiosis 85 7 <1.0e-03 Homologous recombination 28 4 <5.0e-04 Fanconi anemia pathway 52 3 2.3e-03 POLQ Homologous recombination 28 4 <5.0e-04 CEP55 Mitotic G2-G2/M phases 87 4 <1.0e-03 IL23-mediated signaling events(N) 36 4 <1.0e-03 IL12-mediated signaling events(N) 60 4 <5.0e-04 IL12 signaling mediated by STAT4(N) 30 3 3.3e-04 BIRC5 MCM10 AURKA RECQL5 IL18RAP Overlap FDR Table S4. Functional annotations of 7 prognostic modules based on GO biological process. Module BIRC5 Gene Set mitotic prometaphase cell division M phase of mitotic cell cycle CenH3-containing nucleosome assembly at centromere mitotic cell cycle nucleosome assembly chromosome segregation mitotic chromosome condensation attachment of spindle microtubules to kinetochore Size 83 251 90 22 Overlap 14 16 14 8 FDR <3.3e-04 <3.3e-04 <3.3e-04 <2.5e-04 298 87 52 12 6 14 8 7 5 3 <2.0e-04 <1.6e-04 <1.4e-04 <1.2e-04 <1.1e-04 DNA replication DNA strand elongation involved in DNA replication M/G1 transition of mitotic cell cycle mitotic cell cycle 120 30 67 298 14 9 12 15 <1.6e-04 <1.6e-04 <1.6e-04 <1.6e-04 MCM10 S phase of mitotic cell cycle G1/S transition of mitotic cell cycle cell cycle checkpoint DNA-dependent DNA replication initiation DNA unwinding involved in replication 96 126 115 18 6 12 12 10 7 3 <1.6e-04 <1.6e-04 <1.4e-04 <1.2e-04 <1.1e-04 AURKA mitosis 157 5 <1.0e-03 RECQL5 DNA recombination meiosis reciprocal meiotic recombination 54 37 25 6 5 3 <1.0e-03 <5.0e-04 3.3e-04 POLQ DNA recombination DNA repair DNA duplex unwinding double-strand break repair via homologous recombination 54 244 17 35 3 4 2 2 <1.0e-03 5.0e-04 4.6e-03 9.7e-03 CEP55 G2/M transition of mitotic cell cycle mitotic cell cycle Mitosis 109 298 157 4 5 4 2.0e-03 2.0e-03 2.6e-03 18RAP intracellular protein kinase cascade protein phosphorylation induction of apoptosis 86 342 166 3 4 3 9.2e-02 1.0e-01 1.0e-01 We conducted multivariate Cox PH regression model on the top 10 genes of each gene signature (Table S5), and reported the overall Wald test p-value. For the BRIC5 and MCM10 module, genes were ranked based on their CIs. For PCNA and Wu signature, top 10 genes were selected based on the ranked list in their original paper. All five gene signatures show significant prognostic ability with p-values <0.05 on the normal-like tumors, while only CIN attractor shows higher significance on Luminal B than it on normal-like subtype. The results confirm our original observation. Table S5. The performance of multivariate Cox PH regression model of the five gene signatures on each of the five subtypes. For each gene signature, we used its top-ranked 10 genes to fit the model and the overall Wald test p-value was reported. Subtypes Basal Her2+ Luminal A Luminal B Normal-like BRIC5 0.851 0.463 0.099 0.191 2.50e-04 MCM10 0.071 0.757 0.389 0.216 1.82e-06 PCNA 0.817 0.840 0.096 0.022 0.018 CIN 0.936 0.277 0.078 0.003 0.023 Wu 0.556 0.263 0.006 0.005 5.60e-04 Table S6. Univariate Cox PH regression of the 32 gene stroma module to each of the five subtypes the METABRIC dataset and the OsloVal dataset, The Cox model was calculated based on the averaged gene expression value of the 32 genes. Hazard ratio with 95% confidence interval, Wald test p-value and concordance index were reported. Subtypes Basal Her2+ Luminal A Luminal B Normal-like METABRIC OsloVal HR 1.017 1.047 0.812 1.159 0.849 0.904 0.952 95% CI 0.79-1.31 0.77-1.41 0.67-0.98 0.97-1.39 0.57-1.26 0.82-0.99 0.72-1.25 p-value 0.895 0.765 0.029 0.111 0.421 0.046 0.727 CI 0.513 0.518 0.539 0.522 0.542 0.533 0.528 Table S7. CIs of the CIN attractor metagene signature tested on the clinical feature defined subgroups of the METABRIC dataset. The METABRIC dataset was split into two or three subgroups according to different clinical features. We calculated the CI of the CIN gene signature on each subgroup. We chose the CIN attractor metagene by Cheng.et.al which was developed from multiple cancer dataset ignoring the PAM50 subtype and had shown significant high prognostic power on the normal-like breast cancer subtype. To be consistent with the original study, we used the top 10 genes of the CIN attractor metagene signature for calculation. Lymph node status CI Positive 0.60 Negative 0.58 ER status Positive 0.60 Negative 0.53 tumor grade 1 0.52 2 0.59 3 0.57 low risk Yes 0.55 No 0.58 Table S8. CIs of the CIN attractor metagene signature tested on the different cellularity defined subgroups of the METABRIC dataset. CI1 denotes the CIs tested on each cellularity subgroups. CI2 denotes the CIs tested on subgroups of removing a certain cellularity types. cellularity high moderate low CI1 0.59 0.62 0.62 CI2 0.62 0.60 0.60 Table S9. Ten types of clinical features used in the Multivariate Cox PH regression model. Variable name Type Metric or levels Age at Diagnosis numeric Tumor size numeric Centimeters Lymph Nodes Positive numeric Count Grade factor 1 = Nottingham (Elston-Ellis) Score 3 to 5 2 = Nottingham Score 6 to 7 3 = Nottingham Score 8 to 9 ER (Estrogen Receptor) factor Negative IHC Status Positive HER2 SNP6 State factor NEUT GAIN LOSS Treatment received factor CT CT/HT CT/HT/RT CT/RT HT HT/RT NONE RT ER Expression factor + PR Expression factor + HER2 Expression factor + Age at Diagnosis: age of patient at diagnosis of disease Tumor size: size of tumor in cm Lymph Nodes Positive: This covariate is one component of the standard 'TNM' classification of breast cancer. In this case this refers to the 'N' component. The number of lymph nodes involved is prognostic. Basically, after a patient undergoes surgical 'interrogation' of her axillary nodal basin, she can be staged as: N0 = no axillary node metastases identified N1 = 1 - 3 nodes exhibiting metastases N2 = 4 - 9 nodes being 'positive' N3 > 10 nodes being 'positive' Grade: This is a semi-quantitative measure that is a composite of three histopathologic characteristics seen under a microscope by a pathologist. It therefore should not be interpreted as a continuous variable, but really should be treated as a categorical variable. The components include measures of tubule formation, mitotic count, and nuclear pleomorphism. Estrogen receptor (ER) immunohistochemistry status: Estrogen receptor status is obviously the 'original' molecular marker in breast cancer. It is prognostic and predictive. It can be measured by IHC (immunohistochemistry) or hormone-binding assay. Both methods are commonly in use currently. HER2 SNP6 State: A call as provided by METABRIC, using the SNP6.0 data. Again, HER2 status is an important component of clinical decision making. Treatment received: CT: Chemotherapy. HT: Hormonal Therapy. RT: Radiation Therapy ER Expression: METABRIC provided in their supplemental tables a call based on their expression data. ER status is such an important component of clinical decision making. PR Expression: As provided by METABRIC as a dichotomous call based on their expression data. HER2 Expression: A call as provided by METABRIC, using the expression data. This table is available at the Breast Cancer Challenge support page (https://sagebionetworks.jira.com/wiki/display/BCC). Figure S1. Kaplan-Meier cumulative survival curves of two breast cancer groups defined based on the expression of the BIRC5 module over a 15-year period on the five breast cancer subtypes, the METABRIC whole dataset and the OsloVal dataset respectively. The two patient groups in each plot were defined by partitioning the patients into two equal-sized sets using the median value of the averaged gene expression profile of the BIRC5 module. Figure S2. KM curves of the MCM10 module. Figure S3. KM curves of the AURKA module. Figure S4. KM curves of the RECQL5 module. Figure S5. KM curves of the POLQ module. Figure S6. KM curves of the CEP55 module. Figure S7. KM curves of the IL18RAP module. Figure S8. Distributions of tumor grade, ER status, lymph node status and samples of low risk tumors among the five breast cancer subtypes and the METABRIC whole dataset. More than 50% of luminal A and normal-like samples were lymph node negative, which was false for the other three subtypes. 25% of luminal A and 36% of normal-like samples were in grade 3,while 89% for basal-like, 72% for HER2+ and 54% for luminal B. Low risk tumors are referred as the ER positive, lymph node negative and low grade tumors. Normal-like and luminal A subtypes were made up of more “low risk” patients than other three subtypes. Figure S9. Distributions of tumor cellularity among the five breast cancer subtypes and the METABRIC dataset. Figure S10. Distributions of histological type among the five breast cancer subtypes. DCIS: ductal carcinoma in situ. IDC: invasive ductal carcinoma. MED: medullary. MUC: mucinous. TUB: tubular. ILC: invasive lobular carcinoma. INVASIVE TUMOUR MIXED NST AND A SPECIAL TYPE: NST stands for 'no special type' and therefore this term probably is a grab bag for tumors of mixed pathologic features. It is unlikely that the few tumors of this category are pathologically identical (at least by histology). OTHER: a grab bag term, as above. OTHER INVASIVE: a grab bag term, but the pathologists want to reinforce that there's an invasive component. PHYL: 'phyllodes' tumors that are a separate category from carcinomas. The do span the range of benign, borderline, and malignant disease. In common, they are rapid growing, but remain confined to the breast.