Supplementary Material Aberrant RNA-splicing in cancer – expression changes and driver mutations of splicing factor genes Anita Sveen, Sami Kilpinen, Anja Ruusulehto, Ragnhild A. Lothe, Rolf I. Skotheim 1 Supplementary Methods Analysis of differential splicing events among cancer types Principal components analysis (PCA) of differential splicing events among cancer types was performed using the PCA-function in the FactoMineR-package (Husson, Josse, Le, and Mazet, 2015) in R version 3.0.1, based on data downloaded from the TCGA SpliceSeq database (http://projects.insilico.us.com/TCGASpliceSeq/). This database contains RNA-sequencing data from 15 randomly selected samples from each of 24 different cancer types sequenced by the TCGA (Supplementary Figure 1B). These data have been analyzed for alternative splicing using the SpliceSeq algorithm, calculating a percent-spliced-in value (PSI; ratio of normalized read counts indicating inclusion of a transcript element over the total normalized reads for that event) for all possible splicing events in all samples. The downloaded data includes the average PSIvalues for each cancer type for the top 3 911 splicing events with the largest variation in PSIvalues across the tumor samples. Splicing events annotated as “alternate acceptor sites”, “alternate donor sites”, “exon skip”, “retained intron”, or “mutually exclusive exons” were included. The PCA of these data thus compares the cancer types with respect to global splicing patterns in the 15 randomly selected samples from each cancer type. The results are shown in Supplementary Figure 1B. Gene expression analysis Gene expression data Using the MediSapiens database (http://ist.medisapiens.com), a comprehensive and integrated collection of microarray mRNA expression data normalized across all included samples, we have 2 compared splicing factor gene expression in 8 362 cancer samples from 21 different cancer types, and 876 samples from corresponding normal tissues (Supplementary Table 2). Based on Ensembl gene identifiers, expression data for the complete sample set was retrieved for 261 of the totally 347 splicing factor genes. Principal components analysis PCA was performed to compare all 21 cancer types and their corresponding normal samples with regard to splicing factor gene expression using the prcomp-function in R version 3.0.1. For the seven cancer types represented by more than 500 samples each (acute lymphocytic leukemia, acute myeloid leukemia, breast carcinoma, chronic lymphocytic leukemia, colorectal adenocarcinoma, lung adenocarcinoma, and lymphoma), a random selection of samples representing the same sample size as the corresponding normal sample set (overview in Supplementary Table 2) was done. The results from PCA analysis are plotted in two dimensions (principal component, PC, one versus two) for each cancer type (Supplementary Figure 3). To assess the separation of cancer and normal samples based on the expression levels of splicing factor genes, independent samples t-tests comparing the cancer and corresponding normal samples were performed for both PC1 and PC2. In most cancer types (except acute lymphocytic leukemia, glioblastoma, and lung adenocarcinoma) cancer and normal samples were significantly separated along PC1 and/or PC2 (P < 0.005; Supplementary Table 2). 3 Differential gene expression analysis Differential gene expression analysis was performed by F-tests in one-way analysis of variance using the Partek Genomics Suite software (Partek Inc., St. Louis, MO, U.S.A.). To identify the most differentially expressed genes, the genes were ranked based on signal-to-noise ratios (Fvalues). Adjusting for sample size and multiple comparisons For eleven of the totally 21 cancer types (acute lymphocytic leukemia, acute myeloid leukemia, breast carcinoma, chronic lymphocytic leukemia, clear cell renal cell carcinoma, colorectal adenocarcinoma, lung adenocarcinoma, lung squamous cell carcinoma, lymphoma, prostate cancer, and uterine cancer), both the cancer and corresponding normal sample sets are represented by a sufficient number of samples to allow for high-confidence differential gene expression analysis. These eleven cancer types have a significance level calculated from statistical power analysis lower than 0.05, given effect size 0.5, statistical power (sensitivity) 0.8, and the number of samples representing the cancer and corresponding normal tissues (Supplementary Table 2). Statistical power analysis was conducted in R using the pwr.t2n.test-function in the pwr-package (Stephane Champely, 2012). The calculated significance level was further used as a threshold for calling differentially expressed genes within each cancer type, to the effect of accounting for the varying numbers of samples in each dataset, thus allowing for comparisons of the number of differentially expressed genes among cancer types. P-values from differential expression analysis were first adjusted for multiple comparisons by Bonferroni correction using the p.adjust-function in R. 4 Variation in splicing factor gene expression Cancer and corresponding normal samples were compared with regard to variance in gene expression levels of all splicing factors (n = 261). To avoid the potential bias from analyzing different numbers of samples, a random selection of 41 samples (corresponding with the number of blood myeloid cell samples, representing the smallest sample set) was performed in each sample set. Only the eleven cancer types identified from statistical power analysis to be represented by a sufficient number of samples to allow for high-confidence differential gene expression analysis were included. The variance in the expression level of each gene across all samples from the individual cancer types and the normal sample types was calculated separately. Cancer and corresponding normal samples were then compared for expression variance across all splicing factor genes by paired samples t-tests. Somatic mutation analysis Pan-cancer mutation frequency in splicing factor genes and random gene sets For comparison of somatic mutation frequencies in the 347 splicing factor genes, ten sets of 347 randomly selected protein-coding genes were generated (selected among the 19 008 genes included in the HUGO Gene Nomenclature Committee (HGNC) list of protein-coding genes, downloaded from http://www.genenames.org/cgi-bin/statistics). Pan-cancer mutation frequencies in the TCGA pan-cancer data were accessed for all genes from the IntOGen web-portal (http://www.intogen.org/web/tcga/). The splicing factor genes did not have a significant increase in mutation frequency (median mutation frequency 0.5%) compared with the ten sets of random 5 genes (range of median mutation frequency 0.46% to 0.56 %; assessed by independent samples ttest). Prediction of cancer-critical genes Results from analyzing the TCGA pan-cancer somatic mutation data using five different algorithms for predicting cancer-relevant mutations are included in the IntOGen web-portal. The predictions of cancer-relevance are based on mutation frequency (MuSiC), specific mutation patterns (ActiveDriver, mutations in protein active sites; OncodriveFM, accumulation of mutations with high functional impact; OncodriveCLUST, significantly clustered mutations), and a combination of signals for positive selection of mutations in cancer (MutSig). Predictions from both pan-cancer analysis and analysis of individual cancer types were accessed from the IntOGen web-portal. The pan-cancer results were applied for comparisons of the numbers of predicted cancer-critical genes in the splicing factor gene set, the ten random gene sets, and genome-wide. Genes are considered to be cancer-critical when predicted as a potential cancer driver by either one of the five algorithms. Twenty-four (7%) of the 347 splicing factor genes were identified as cancer-critical in the pan-cancer analyses, compared with 533 (3%) of the totally 18 161 genes included in both the HGNC and IntOGen, and a median of 10.5 (3%) genes in the ten random gene sets. 6 Supplementary Table 1. Genes annotated with splicing-related Gene Ontology terms Gene Ontology Gene Ontology accession term GO:0008380 RNA splicing GO:0000244 Spliceosomal trisnRNP complex assembly Number of genes 245 6 Gene symbol A8MWD9, ACIN1, AFF2, AKAP17A, ALYREF, ARL6IP4, BCAS2, BRDT, C1QBP, CASC3, CCAR1, CCAR2, CD2BP2, CDC40, CDK12, CELF3, CIR1, CLASRP, CLP1, CPSF1, CPSF2, CPSF3, CPSF7, CSTF1, CSTF2, CSTF3, CTNNBL1, DDX23, DDX39B, DDX46, DDX47, DHX15, DHX16, DHX38, DHX8, DHX9, DNAJC8, EFTUD2, ESRP1, ESRP2, FUS, GEMIN2, GTF2F1, GTF2F2, HNRNPA0, HNRNPA1, HNRNPA1L2, HNRNPA2B1, HNRNPA3, HNRNPC, HNRNPD, HNRNPF, HNRNPH1, HNRNPH2, HNRNPH3, HNRNPK, HNRNPL, HNRNPM, HNRNPR, HNRNPU, HNRNPUL1, HSPA8, IVNS1ABP, IWS1, JMJD6, KHSRP, KIAA1429, LGALS3, LSM1, LSM10, LSM2, LSM4, LSM5, LSM6, LUC7L3, MAGOH, MAGOHB, MBNL1, MBNL2, MBNL3, MPHOSPH10, NCBP1, NCBP2, NHP2L1, NOL3, NONO, NOVA1, NSRP1, NUDT21, PABPN1, PAPOLA, PCBP1, PCBP2, PCF11, PDCD7, PHF5A, POLR2A, POLR2B, POLR2C, POLR2D, POLR2E, POLR2F, POLR2G, POLR2H, POLR2I, POLR2J, POLR2K, POLR2L, PPAN, PPARGC1A, PPIG, PPP1R8, PPP1R9B, PPP2CA, PPP2R1A, PPP4R2, PRPF18, PRPF3, PRPF38A, PRPF38B, PRPF39, PRPF4, PRPF40A, PRPF40B, PRPF4B, PRPF6, PRPF8, PTBP1, PTBP3, PUF60, QKI, RBFOX1, RBFOX2, RBFOX3, RBM10, RBM11, RBM15B, RBM17, RBM20, RBM25, RBM28, RBM38, RBM39, RBM4, RBM4B, RBM5, RBM8A, RBMX, RBMXL1, RBMY1A1, RBMY1B, RBMY1C, RBMY1D, RBMY1E, RBMY1F, RNPC3, RNPS1, RP9, RRAGC, RSRC1, SAP18, SCAF1, SCAF11, SCAF8, SCNM1, SF3A1, SF3A2, SF3A3, SF3B1, SF3B14, SF3B2, SF3B3, SF3B4, SF3B5, SFPQ, SMC1A, SMNDC1, SNRNP200, SNRNP25, SNRNP27, SNRNP35, SNRNP40, SNRNP48, SNRNP70, SNRPA, SNRPA1, SNRPB, SNRPB2, SNRPD1, SNRPD2, SNRPD3, SNRPE, SNRPF, SNRPG, SNRPN, SON, SREK1, SREK1IP1, SRPK1, SRPK2, SRRM1, SRRM4, SRSF1, SRSF11, SRSF2, SRSF3, SRSF4, SRSF5, SRSF6, SRSF7, SRSF8, SRSF9, STRAP, SUGP1, SUGP2, SUPT6H, SYNCRIP, TARDBP, THOC1, THOC2, THOC3, THOC5, THOC6, THOC7, THRAP3, TTF2, TXNL4A, TXNL4B, U2AF1, U2AF1L4, U2AF2, UPF3B, USB1, USP39, WBP11, WT1, WTAP, YBX1, YTHDC1, ZCRB1, ZMAT5, ZNF326, ZNF638, ZRANB2, ZRSR2 CD2BP2, DDX20, PRPF31, PRPF6, SRSF10, SRSF12 7 GO:0000245 Spliceosomal complex assembly 19 CRNKL1, DDX1, DDX39B, GEMIN2, GEMIN6, PRPF19, PRPF6, RBM5, SCAF11, SF1, SMN1, SNRPD1, SNRPD2, SNRPE, SNRPG, SRPK2, TXNL4A, USP39, ZRSR2 GO:0000354 Cis assembly of pre-catalytic spliceosome 2 DDX23, SNRNP200 GO:0000375 RNA splicing, via transesterification reactions 25 BCAS2, DBR1, DDX23, GEMIN2, KHSRP, LSM1, MPHOSPH10, PRPF3, PRPF4, PRPF6, PRPF8, SCAF11, SF3A3, SF3B1, SF3B3, SF3B4, SLU7, SMNDC1, SNRNP40, SRRM1, SRSF10, SRSF4, TRA2B, TXNL4A, WDR83 1 TSEN34 9 CDK13, CELF4, HNRNPA1, HNRNPM, MBNL1, PTBP1, RSRC1, SFPQ, SLU7 24 CELF3, CELF4, CELF6, DDX5, MAGOH, MBNL1, MBNL2, MYOD1, NSRP1, PTBP1, RBFOX2, RBFOX3, RBM15B, RBM25, RBM4, RBM5, RBM8A, RBMX, RBMY1A1, RNPS1, SAP18, SRSF12, THRAP3, TRA2B 26 CLNS1A, DDX20, GEMIN2, GEMIN4, GEMIN5, GEMIN6, GEMIN7, GEMIN8, NCBP1, NCBP2, PHAX, PRMT5, PRMT7, SART1, SMN1, SNRPB, SNRPC, SNRPD1, SNRPD2, SNRPD3, SNRPE, SNRPF, SNRPG, SNUPN, TGS1, WDR77 5 SF1, SF3A1, SF3A2, SF3A3, SLU7 4 PSIP1, SNRPC, SRSF1, SRSF12 GO:0000379 GO:0000380 GO:0000381 GO:0000387 GO:0000389 GO:0000395 tRNA-type intron splice site recognition and cleavage Alternative mRNA splicing, via spliceosome Regulation of alternative mRNA splicing, via spliceosome Spliceosomal snRNP assembly mRNA 3'-splice site recognition mRNA 5'-splice site recognition 8 GO:0000398 mRNA splicing, via spliceosome 169 ALYREF, AQR, BUD13, CACTIN, CCAR1, CD2BP2, CDC40, CDC5L, CELF3, CLP1, CPSF1, CPSF2, CPSF3, CPSF7, CRNKL1, CSTF1, CSTF2, CSTF3, CWC15, CWC22, CWC27, DBR1, DDX23, DDX39A, DDX39B, DDX41, DDX5, DGCR14, DHX35, DHX38, DHX8, DHX9, DNAJC8, EFTUD2, EIF4A3, FRG1, FUS, GEMIN5, GEMIN6, GEMIN7, GPATCH1, GTF2F1, GTF2F2, HNRNPA0, HNRNPA1, HNRNPA2B1, HNRNPA3, HNRNPC, HNRNPD, HNRNPF, HNRNPH1, HNRNPH2, HNRNPH3, HNRNPK, HNRNPL, HNRNPM, HNRNPR, HNRNPU, HNRNPUL1, ISY1, LSM2, LSM3, LSM7, MAGOH, NAA38, NCBP1, NCBP2, NHP2L1, NOVA1, NUDT21, PABPC1, PABPN1, PAPOLA, PAPOLB, PCBP1, PCBP2, PCF11, PHF5A, PLRG1, PNN, POLR2A, POLR2B, POLR2C, POLR2D, POLR2E, POLR2F, POLR2G, POLR2H, POLR2I, POLR2J, POLR2K, POLR2L, PPIE, PPIH, PPIL1, PPIL3, PPWD1, PRPF19, PRPF3, PRPF31, PRPF4, PRPF4B, PRPF6, PRPF8, PTBP1, RALY, RBM22, RBM5, RBM8A, RBMX, RNPS1, RSRC1, SART1, SF1, SF3A1, SF3A2, SF3A3, SF3B1, SF3B14, SF3B2, SF3B3, SF3B4, SF3B5, SKIV2L2, SLU7, SMC1A, SNRNP200, SNRNP40, SNRNP70, SNRPA, SNRPA1, SNRPB, SNRPB2, SNRPC, SNRPD1, SNRPD2, SNRPD3, SNRPE, SNRPF, SNRPG, SNW1, SRRM1, SRRM2, SRSF1, SRSF10, SRSF11, SRSF2, SRSF3, SRSF4, SRSF5, SRSF6, SRSF7, SRSF9, SUGP1, SYF2, SYNCRIP, TFIP11, TRA2A, TRA2B, TXNL4A, U2AF1, U2AF2, UPF3B, USP49, WDR83, XAB2, YBX1, ZCCHC8, ZRSR2 GO:0006376 mRNA splice site selection 15 CELF1, CELF2, CELF4, LUC7L, LUC7L2, LUC7L3, MBNL1, PTBP2, RBMX, SFSWAP, SRSF1, SRSF10, SRSF5, SRSF6, SRSF9 7 CLP1, RTCB, TRPT1, TSEN15, TSEN2, TSEN34, TSEN54 5 PTBP1, PTBP2, PTBP3, RPS13, RPS26 3 HNRNPLL, RBM20, RBM22 GO:0006388 GO:0033119 GO:0033120 tRNA splicing, via endonucleolytic cleavage and ligation Negative regulation of alternative splicing Positive regulation of alternative splicing GO:0043484 Regulation of RNA splicing 24 AFF2, AKAP17A, BRDT, CELF1, CLK1, CLK2, CLK3, CLK4, ESRP1, ESRP2, FASTK, HNRNPF, HNRNPH1, MBNL1, MBNL2, MBNL3, MYOD1, RBFOX1, RBFOX2, RBFOX3, RBM38, SNRNP70, SON, SRRM4 GO:0045292 mRNA cis splicing, via 6 DCPS, NCBP1, NCBP2, NCBP2L, RBM22, WBP4 9 GO:0048024 GO:0048025 GO:0048026 GO:0070055 GO:0005681 GO:0000243 GO:0005684 GO:0005689 GO:0044530 GO:0071004 spliceosome Regulation of mRNA splicing, via spliceosome Negative regulation of mRNA splicing, via spliceosome Positive regulation of mRNA splicing, via spliceosome HAC1-type intron splice site recognition and cleavage Spliceosomal complex Commitment complex U2-type spliceosomal complex U12-type spliceosomal complex Supraspliceosomal complex U2-type prespliceosome 5 CWC22, JMJD6, SRPK1, SRPK2, TIA1 15 ACIN1, C1QBP, HNRNPA2B1, PTBP1, RBMX, RNPS1, SAP18, SFSWAP, SRSF10, SRSF12, SRSF4, SRSF6, SRSF7, SRSF9, U2AF2 7 CELF3, CELF4, PRPF19, RBMX, SNW1, THRAP3, TRA2B 1 ERN1 93 A8MWD9, AKAP17A, API5, AQR, BCAS2, CDC40, CRNKL1, CTNNBL1, CWC15, CWC22, DDX20, DDX39B, DHX8, EFTUD2, GEMIN2, GEMIN4, GEMIN5, GEMIN6, GEMIN7, GEMIN8, HNRNPA1, HNRNPA1L2, HNRNPA2B1, HNRNPC, HNRNPDL, HNRNPM, HNRNPR, HSPA8, IVNS1ABP, KIAA1967, LGALS3, LSM4, LSM5, LSM6, LSM7, NAA38, NHP2L1, PPIH, PPP1R8, PRPF18, PRPF3, PRPF38A, PRPF38B, PRPF4, PRPF6, PRPF8, PTBP2, RBM17, RBM28, RBM5, RBMX, RHEB, SCAF8, SF1, SF3A1, SF3A2, SF3A3, SF3B1, SF3B2, SF3B3, SF3B4, SLU7, SMN1, SMNDC1, SNRNP200, SNRNP70, SNRPA, SNRPA1, SNRPB, SNRPB2, SNRPD2, SNRPD3, SNRPE, SNRPF, SNRPG, SNRPN, SNW1, SREK1, SRRM1, STRAP, SUGP1, TFIP11, TTF2, TXNL4A, TXNL4B, U2AF1, U2AF1L4, U2AF2, USP39, WAC, WBP4, WDR83, ZNF326 1 SNRPC 2 PRPF31, SF3A1 24 DHX15, PDCD7, PHF5A, RNPC3, SF3B1, SF3B14, SF3B2, SF3B3, SF3B4, SF3B5, SNRNP25, SNRNP35, SNRNP48, SNRPB, SNRPD1, SNRPD2, SNRPD3, SNRPE, SNRPF, SNRPG, YBX1, ZCRB1, ZMAT5, ZRSR2 3 ADAR, RBMX, UPF1 1 SNRPC 10 GO:0071013 Catalytic step 2 spliceosome 80 ALYREF, AQR, CACTIN, CDC40, CDC5L, CRNKL1, CWC15, CWC22, CWC27, DDX23, DDX41, DDX5, DGCR14, DHX35, DHX38, DHX8, EFTUD2, EIF4A3, FRG1, GPATCH1, HNRNPA1, HNRNPA2B1, HNRNPA3, HNRNPC, HNRNPF, HNRNPH1, HNRNPK, HNRNPM, HNRNPR, HNRNPU, ISY1, LSM2, LSM3, MAGOH, PABPC1, PLRG1, PNN, PPIE, PPIL1, PPIL3, PPWD1, PRPF19, PRPF4B, PRPF6, PRPF8, RALY, RBM22, RBM8A, RBMX, SART1, SF3A1, SF3A2, SF3A3, SF3B1, SF3B2, SF3B3, SKIV2L2, SLU7, SNRNP200, SNRNP40, SNRPA1, SNRPB, SNRPB2, SNRPD1, SNRPD2, SNRPD3, SNRPE, SNRPF, SNRPG, SNW1, SRRM1, SRRM2, SRSF1, SYF2, SYNCRIP, TFIP11, U2AF1, WDR83, XAB2, ZCCHC8 11 Supplementary Table 2. Samples included in gene expression analyses of splicing factors Cancer Cancer type Acute lymphocytic leukemia Acute myeloid leukemia Bladder cancer Breast carcinoma Cervical cancer Chronic lymphocytic leukemia Chronic myeloid leukemia Clear cell renal cell carcinoma Colorectal adenocarcinoma Glioblastoma multiforme Head and neck squamous cell carcinoma Hepatocellular carcinoma Lung adenocarcinoma Lung squamous cell carcinoma Lymphoma Ovarian serous adenocarcinoma Pancreatic cancer Corresponding normal No. of samples Sample origin 1 462 1 076 77 1 502 65 556 76 Blood lymphoid cell4 Blood myeloid cell5 Bladder Breast Cervix Blood lymphoid cell4 Blood myeloid cell5 Principal components analysis (cancer versus normal)1 No. of differentially expressed splicing factor genes3 No. of samples PC1 PC2 Significance level, statistical power analysis2 88 0.5 0.05 0.0002 91 (35%) 41 1E-13 0.005 0.02 142 (54%) 10 88 11 0.004 2E-17 0.01 0.001 0.0005 0.0002 0.5 0.0002 0.5 NA 185 (71%) NA 88 8E-32 0.1 0.0005 133 (51%) 41 2E-06 6E-20 0.09 NA 188 Kidney 73 0.07 6E-09 0.006 93 (36%) 559 Colon 47 6E-10 0.02 0.01 82 (31%) 268 Brain 30 0.4 0.005 0.08 NA 44 Oral cavity 77 5E-49 0.2 0.07 NA 107 Liver 22 1E-07 0.03 0.2 NA 840 Lung6 144 0.007 0.06 0.00009 78 (30%) 127 Lung6 144 1E-06 3E-19 0.001 85 (33%) 576 Blood lymphoid cell4 88 8E-18 0.06 0.0005 138 (53%) 287 Ovary 12 8E-08 0.0005 0.4 NA 51 Pancreas 20 3E-05 0.002 0.3 NA 12 Prostate cancer Testicular cancer Thyroid carcinoma Uterine cancer Total 185 105 50 161 4 322 Prostate Testis Thyroid gland Uterus 70 18 18 107 0.2 8E-15 0.9 0.4 0.0002 0.3 7E-09 1E-44 0.007 0.3 0.3 0.002 18 (7%) NA NA 114 (44%) 876 1 P-values from independent samples t-test comparing principal components (PC) between cancer and normal samples. For the seven cancer types with more than 500 samples, a random selection of the same amount of samples as represented by the corresponding normal were included in the analyses. 2 Significance level calculated from statistical power analysis, given effect size 0.5, statistical power (sensitivity) 0.8, and the number of samples representing the individual cancer types and corresponding normal. Only the eleven cancer types represented by sufficient sample numbers to allow P < 0.05 were included in differential gene expression analyses between cancer and corresponding normal samples. 3 Number of genes (of totally 261 splicing factor genes) with P-values from F-tests in one-way analysis of variance lower than the significance level calculated from statistical power analysis, after adjusting for multiple comparisons by Bonferroni correction. NA denotes the tissue types with significnace level from statistical power analysis >0.05. 4,5,6 Same samples. 13 Supplementary Table 3. The ten most differentially expressed splicing factor genes in different cancer types Differentially expressed genes1 Cancer type Upregulated in cancer Downregulated in cancer Acute lymphocytic leukemia DDX41, HNRNPL, SNRPA Acute myeloid leukemia HNRNPA1, HNRNPM, SNRPA, SYNCRIP Breast carcinoma Chronic lymphocytic leukemia Clear cell renal cell carcinoma Colorectal adenocarcinoma Lung adenocarcinoma Lung squamous cell carcinoma CLASRP, SNRPA CELF2, MBNL1, NOL3, SNRPD1 NONO, SF3B3, SNRPB, SNRPC, SNRPD2, SYNCRIP DHX8, ESRP1, ESRP2, HNRNPC, SCNM1, USP39 DDX39A, ESRP1, NCBP2, PTBP3, SCNM1, SNRPB, USP39 CSTF1, DDX23, PPIE, PRPF38B, SNRNP35, SREK1IP1, ZRSR2 ADAR, CD2BP2, ERN1, SRPK2, TSEN34, UPF1 CELF2, CELF3, GEMIN7, HNRNPL, MBNL3, MYOD1, NCBP1, QKI, RBFOX1, RBM4B HSPA8, IVNS1ABP, PRPF38B, PSIP1, RBM28, SF3B3, SREK1IP1, UPF3B C1QBP, ESRP2, PABPN1, POLR2L, PPARGC1A, WT1 LGALS3, MBNL3, NOVA1, PPARGC1A POLR2A, PRPF8, SAP18, UPF1 LGALS3, SAP18, SRSF5 BCAS2, CELF2, DDX47, JMJD6, RBM22, SAP18, SRSF11, SRSF5, SRSF6, SYF2 Lymphoma Prostate cancer BCAS2, CLNS1A, PPIH, PPP2CA, SNRPD1, SNW1, THOC7, WBP11 LGALS3, RBM4B Uterine cancer BCAS2, ESRP1, SYNCRIP HNRNPDL, NOVA1, PABPN1, PCBP1, SNRNP70, TRA2A, WT1 1 Top ten differentially expressed genes between cancer and corresponding normal samples (ranked based on signal-to-noise ratios from F-tests in one-way analysis of variance) 14 Supplementary Table 4. Splicing factor genes included in the Cancer Gene Census Gene Symbol Gene Ontology accession CDK12 DDX5 FUS HNRNPA2B1 NONO PPP2R1A GO:0008380 GO:0000381, GO:0000398, GO:0071013 GO:0000398, GO:0008380 GO:0000398, GO:0008380, GO:0048025, GO:0005681, GO:0071013 GO:0008380 GO:0008380 GO:0000375, GO:0000398, GO:0005681, GO:0005689, GO:0008380, GO:0071013 GO:0000380, GO:0008380 GO:0000398, GO:0008380 GO:0000381, GO:0008380, GO:0048026 GO:0000398, GO:0005681, GO:0008380, GO:0071013 GO:0008380 GO:0000245, GO:0000398, GO:0005689, GO:0008380 SF3B1 SFPQ SRSF2 THRAP3 U2AF1 WT1 ZRSR2 15 Supplementary Figure 1: Genome-wide aberrant splicing across cancer types. 16 A) Distribution of the median amount of cancer-specific splicing events (including cassette exons, competing 5’ and 3’ splice sites, and retained introns) in each of 15 cancer types. The data are summarized from Table 1 published by Dvinge and Bradley (Genome Med 2015; 7: 45), and are based on differential splicing events detected from RNA-sequencing of cancer and matched normal samples (difference in isoform ratio of ≥10 %) from 636 patients (sequenced by TCGA). Patients with acute myeloid leukemia were excluded because matched normal samples were not available. The median number of cancer-specific splicing events across the cancer types was 1 285 (ranging from 726 in liver cancer to 1 796 in breast cancer). B) Principal components analysis based on the median splicing score per tumor type of the 3 911 splicing events with largest variation in splicing scores across tumor samples shows that leukemias (purple) and brain cancers (dark blue) have distinct splicing patterns compared with the other cancer types. The median splicing score of each event in 15 samples from each cancer type was downloaded from the TCGA SpliceSeq database (http://projects.insilico.us.com/TCGASpliceSeq). The colour coding is the same for parts A) and B). ACC, adrenocortical carcinoma; BLCA, bladder urothelial carcinoma; BRCA, breast invasive carcinoma; CESC, cervical squamous cell carcinoma and endocervical adenocarcinoma; COAD, colon adenocarcinoma; DLBC, diffuse large B-cell lymphoma; GBM, glioblastoma multiforme; HNSC, head and neck squamous cell carcinoma; KICH, kidney chromophobe; KIRC, kidney renal clear cell carcinoma; KIRP, kidney renal papillary cell carcinoma; LAML, acute myeloid leukemia; LGG, brain lower grade glioma; LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell carcinoma; PAAD, pancreatic adenocarcinoma; PRAD, prostate adenocarcinoma; READ, rectum adenocarcinoma; SKCM, skin cutaneous melanoma; THCA, thyroid carcinoma; UCEC, uterine corpus endometrial carcinoma; UCS, uterine carcinosarcoma. 17 Supplementary Figure 2: Splice site mutations in cancer-relevant genes. 18 Sixty-two of the 513 genes in the Cancer Gene Census have been reported with somatic and/or germ-line mutations in splice sites in various cancer types (red squares, three genes with mutations only in benign tumors are excluded). Both cancer types and genes are sorted by the number of mutations. The column marked “Other” represents a collection of rare tumor types, and numbers are used to indicate genes found to have mutations in more than one of these. The data were downloaded from the Cancer Gene Census in November 2013 (http://cancer.sanger.ac.uk/cancergenome/projects/census/). 19 20 Supplementary Figure 3. Principal components analysis of corresponding cancer and normal samples based on the expression levels of splicing factor genes. Cancer (red dots) and corresponding normal (blue dots) samples are separated by the expression levels of splicing factor genes (n = 261) in most of the 21 cancer types. All cancer types except acute lymphocytic leukemia, glioblastoma, and lung adenocarcinoma have P-values from independent samples t-tests of principal component one (horizontal axes) and/or two (vertical axes) between cancer and normal samples lower than 0.005 (Supplementary Table 2). 21 Supplementary Figure 4. Variance in splicing factor gene expression A) In nine of eleven cancer types, the variance in gene expression levels of splicing factor genes (n = 261) is significantly lower in the cancer than the corresponding normal samples (P < 0.0001 in paired samples t-tests of all the nine cancer types). Contrarily, colorectal adenocarcinoma samples have significantly higher variance than normal colonic mucosa samples (P = 4x10-11). No difference in expression variability was found between cancer and normal samples in acute myeloid leukemia. The cancer types are sorted by largest median difference in the variance between the normal and cancer samples. B) Among the five splicing factor genes with largest median difference in variance of expression levels, WT1 has largest variance among the cancer samples (P = 0.03, paired samples t-test of variance in the 11 cancer types compared with the corresponding normal samples), whereas PPIH, PRMT7, CELF1, and GEMIN6 have largest variance in the normal samples (P < 0.007). 22 23 24 Supplementary Figure 5. Splicing factor mutation frequency in the 12 cancer types analyzed in the TCGA pan-cancer project The mutation frequency of the 340 of the totally 347 (98%) splicing factor genes that are found to be mutated in at least one cancer types is indicated in red per cancer type. Colon and rectum cancer is plottet together. Both genes (vertically) and cancer types (horizontally) are sorted alphabetically. 25 Supplementary Figure 6. Median mutation frequencies of splicing factor genes in individual cancer types The median mutation frequency of splicing factors correlates strongly with the genome-wide median mutation rate (retrieved from reference 146; Pearson correlation, r = 0.9) in the individual cancer types (uterine carcinoma is not included). This indicates that there is no individual cancer type more frequently targeted by splicing factor mutations than others. Here, the mutation frequency of splicing factor genes is calculated from the TCGA pan-cancer somatic mutation data (n = 3 205 cancers from 12 different cancer types, accessed from the IntOGen TCGA portal, reference 105), and includes 347 genes annotated with the GO-terms “RNA splicing” and/or “spliceosomal complex”. 26