Supplementary Information (docx 1033K)

advertisement
Supplementary Material
Aberrant RNA-splicing in cancer – expression changes and driver mutations
of splicing factor genes
Anita Sveen, Sami Kilpinen, Anja Ruusulehto, Ragnhild A. Lothe, Rolf I. Skotheim
1
Supplementary Methods
Analysis of differential splicing events among cancer types
Principal components analysis (PCA) of differential splicing events among cancer types was
performed using the PCA-function in the FactoMineR-package (Husson, Josse, Le, and Mazet,
2015) in R version 3.0.1, based on data downloaded from the TCGA SpliceSeq database
(http://projects.insilico.us.com/TCGASpliceSeq/). This database contains RNA-sequencing data
from 15 randomly selected samples from each of 24 different cancer types sequenced by the
TCGA (Supplementary Figure 1B). These data have been analyzed for alternative splicing using
the SpliceSeq algorithm, calculating a percent-spliced-in value (PSI; ratio of normalized read
counts indicating inclusion of a transcript element over the total normalized reads for that event)
for all possible splicing events in all samples. The downloaded data includes the average PSIvalues for each cancer type for the top 3 911 splicing events with the largest variation in PSIvalues across the tumor samples. Splicing events annotated as “alternate acceptor sites”,
“alternate donor sites”, “exon skip”, “retained intron”, or “mutually exclusive exons” were
included. The PCA of these data thus compares the cancer types with respect to global splicing
patterns in the 15 randomly selected samples from each cancer type. The results are shown in
Supplementary Figure 1B.
Gene expression analysis
Gene expression data
Using the MediSapiens database (http://ist.medisapiens.com), a comprehensive and integrated
collection of microarray mRNA expression data normalized across all included samples, we have
2
compared splicing factor gene expression in 8 362 cancer samples from 21 different cancer types,
and 876 samples from corresponding normal tissues (Supplementary Table 2). Based on Ensembl
gene identifiers, expression data for the complete sample set was retrieved for 261 of the totally
347 splicing factor genes.
Principal components analysis
PCA was performed to compare all 21 cancer types and their corresponding normal samples with
regard to splicing factor gene expression using the prcomp-function in R version 3.0.1. For the
seven cancer types represented by more than 500 samples each (acute lymphocytic leukemia,
acute myeloid leukemia, breast carcinoma, chronic lymphocytic leukemia, colorectal
adenocarcinoma, lung adenocarcinoma, and lymphoma), a random selection of samples
representing the same sample size as the corresponding normal sample set (overview in
Supplementary Table 2) was done. The results from PCA analysis are plotted in two dimensions
(principal component, PC, one versus two) for each cancer type (Supplementary Figure 3). To
assess the separation of cancer and normal samples based on the expression levels of splicing
factor genes, independent samples t-tests comparing the cancer and corresponding normal
samples were performed for both PC1 and PC2. In most cancer types (except acute lymphocytic
leukemia, glioblastoma, and lung adenocarcinoma) cancer and normal samples were significantly
separated along PC1 and/or PC2 (P < 0.005; Supplementary Table 2).
3
Differential gene expression analysis
Differential gene expression analysis was performed by F-tests in one-way analysis of variance
using the Partek Genomics Suite software (Partek Inc., St. Louis, MO, U.S.A.). To identify the
most differentially expressed genes, the genes were ranked based on signal-to-noise ratios (Fvalues).
Adjusting for sample size and multiple comparisons
For eleven of the totally 21 cancer types (acute lymphocytic leukemia, acute myeloid leukemia,
breast carcinoma, chronic lymphocytic leukemia, clear cell renal cell carcinoma, colorectal
adenocarcinoma, lung adenocarcinoma, lung squamous cell carcinoma, lymphoma, prostate
cancer, and uterine cancer), both the cancer and corresponding normal sample sets are
represented by a sufficient number of samples to allow for high-confidence differential gene
expression analysis. These eleven cancer types have a significance level calculated from
statistical power analysis lower than 0.05, given effect size 0.5, statistical power (sensitivity) 0.8,
and the number of samples representing the cancer and corresponding normal tissues
(Supplementary Table 2). Statistical power analysis was conducted in R using the
pwr.t2n.test-function in the pwr-package (Stephane Champely, 2012). The calculated
significance level was further used as a threshold for calling differentially expressed genes within
each cancer type, to the effect of accounting for the varying numbers of samples in each dataset,
thus allowing for comparisons of the number of differentially expressed genes among cancer
types. P-values from differential expression analysis were first adjusted for multiple comparisons
by Bonferroni correction using the p.adjust-function in R.
4
Variation in splicing factor gene expression
Cancer and corresponding normal samples were compared with regard to variance in gene
expression levels of all splicing factors (n = 261). To avoid the potential bias from analyzing
different numbers of samples, a random selection of 41 samples (corresponding with the number
of blood myeloid cell samples, representing the smallest sample set) was performed in each
sample set. Only the eleven cancer types identified from statistical power analysis to be
represented by a sufficient number of samples to allow for high-confidence differential gene
expression analysis were included. The variance in the expression level of each gene across all
samples from the individual cancer types and the normal sample types was calculated separately.
Cancer and corresponding normal samples were then compared for expression variance across all
splicing factor genes by paired samples t-tests.
Somatic mutation analysis
Pan-cancer mutation frequency in splicing factor genes and random gene sets
For comparison of somatic mutation frequencies in the 347 splicing factor genes, ten sets of 347
randomly selected protein-coding genes were generated (selected among the 19 008 genes
included in the HUGO Gene Nomenclature Committee (HGNC) list of protein-coding genes,
downloaded from http://www.genenames.org/cgi-bin/statistics). Pan-cancer mutation frequencies
in the TCGA pan-cancer data were accessed for all genes from the IntOGen web-portal
(http://www.intogen.org/web/tcga/). The splicing factor genes did not have a significant increase
in mutation frequency (median mutation frequency 0.5%) compared with the ten sets of random
5
genes (range of median mutation frequency 0.46% to 0.56 %; assessed by independent samples ttest).
Prediction of cancer-critical genes
Results from analyzing the TCGA pan-cancer somatic mutation data using five different
algorithms for predicting cancer-relevant mutations are included in the IntOGen web-portal. The
predictions of cancer-relevance are based on mutation frequency (MuSiC), specific mutation
patterns (ActiveDriver, mutations in protein active sites; OncodriveFM, accumulation of
mutations with high functional impact; OncodriveCLUST, significantly clustered mutations), and
a combination of signals for positive selection of mutations in cancer (MutSig). Predictions from
both pan-cancer analysis and analysis of individual cancer types were accessed from the IntOGen
web-portal. The pan-cancer results were applied for comparisons of the numbers of predicted
cancer-critical genes in the splicing factor gene set, the ten random gene sets, and genome-wide.
Genes are considered to be cancer-critical when predicted as a potential cancer driver by either
one of the five algorithms. Twenty-four (7%) of the 347 splicing factor genes were identified as
cancer-critical in the pan-cancer analyses, compared with 533 (3%) of the totally 18 161 genes
included in both the HGNC and IntOGen, and a median of 10.5 (3%) genes in the ten random
gene sets.
6
Supplementary Table 1. Genes annotated with splicing-related Gene Ontology terms
Gene Ontology Gene Ontology
accession
term
GO:0008380
RNA splicing
GO:0000244
Spliceosomal trisnRNP complex
assembly
Number of genes
245
6
Gene symbol
A8MWD9, ACIN1, AFF2, AKAP17A, ALYREF, ARL6IP4, BCAS2, BRDT,
C1QBP, CASC3, CCAR1, CCAR2, CD2BP2, CDC40, CDK12, CELF3, CIR1,
CLASRP, CLP1, CPSF1, CPSF2, CPSF3, CPSF7, CSTF1, CSTF2, CSTF3,
CTNNBL1, DDX23, DDX39B, DDX46, DDX47, DHX15, DHX16, DHX38,
DHX8, DHX9, DNAJC8, EFTUD2, ESRP1, ESRP2, FUS, GEMIN2, GTF2F1,
GTF2F2, HNRNPA0, HNRNPA1, HNRNPA1L2, HNRNPA2B1, HNRNPA3,
HNRNPC, HNRNPD, HNRNPF, HNRNPH1, HNRNPH2, HNRNPH3,
HNRNPK, HNRNPL, HNRNPM, HNRNPR, HNRNPU, HNRNPUL1, HSPA8,
IVNS1ABP, IWS1, JMJD6, KHSRP, KIAA1429, LGALS3, LSM1, LSM10, LSM2,
LSM4, LSM5, LSM6, LUC7L3, MAGOH, MAGOHB, MBNL1, MBNL2, MBNL3,
MPHOSPH10, NCBP1, NCBP2, NHP2L1, NOL3, NONO, NOVA1, NSRP1,
NUDT21, PABPN1, PAPOLA, PCBP1, PCBP2, PCF11, PDCD7, PHF5A,
POLR2A, POLR2B, POLR2C, POLR2D, POLR2E, POLR2F, POLR2G,
POLR2H, POLR2I, POLR2J, POLR2K, POLR2L, PPAN, PPARGC1A, PPIG,
PPP1R8, PPP1R9B, PPP2CA, PPP2R1A, PPP4R2, PRPF18, PRPF3, PRPF38A,
PRPF38B, PRPF39, PRPF4, PRPF40A, PRPF40B, PRPF4B, PRPF6, PRPF8,
PTBP1, PTBP3, PUF60, QKI, RBFOX1, RBFOX2, RBFOX3, RBM10, RBM11,
RBM15B, RBM17, RBM20, RBM25, RBM28, RBM38, RBM39, RBM4, RBM4B,
RBM5, RBM8A, RBMX, RBMXL1, RBMY1A1, RBMY1B, RBMY1C, RBMY1D,
RBMY1E, RBMY1F, RNPC3, RNPS1, RP9, RRAGC, RSRC1, SAP18, SCAF1,
SCAF11, SCAF8, SCNM1, SF3A1, SF3A2, SF3A3, SF3B1, SF3B14, SF3B2,
SF3B3, SF3B4, SF3B5, SFPQ, SMC1A, SMNDC1, SNRNP200, SNRNP25,
SNRNP27, SNRNP35, SNRNP40, SNRNP48, SNRNP70, SNRPA, SNRPA1,
SNRPB, SNRPB2, SNRPD1, SNRPD2, SNRPD3, SNRPE, SNRPF, SNRPG,
SNRPN, SON, SREK1, SREK1IP1, SRPK1, SRPK2, SRRM1, SRRM4, SRSF1,
SRSF11, SRSF2, SRSF3, SRSF4, SRSF5, SRSF6, SRSF7, SRSF8, SRSF9,
STRAP, SUGP1, SUGP2, SUPT6H, SYNCRIP, TARDBP, THOC1, THOC2,
THOC3, THOC5, THOC6, THOC7, THRAP3, TTF2, TXNL4A, TXNL4B,
U2AF1, U2AF1L4, U2AF2, UPF3B, USB1, USP39, WBP11, WT1, WTAP, YBX1,
YTHDC1, ZCRB1, ZMAT5, ZNF326, ZNF638, ZRANB2, ZRSR2
CD2BP2, DDX20, PRPF31, PRPF6, SRSF10, SRSF12
7
GO:0000245
Spliceosomal
complex assembly
19
CRNKL1, DDX1, DDX39B, GEMIN2, GEMIN6, PRPF19, PRPF6, RBM5, SCAF11,
SF1, SMN1, SNRPD1, SNRPD2, SNRPE, SNRPG, SRPK2, TXNL4A, USP39, ZRSR2
GO:0000354
Cis assembly of
pre-catalytic
spliceosome
2
DDX23, SNRNP200
GO:0000375
RNA splicing, via
transesterification
reactions
25
BCAS2, DBR1, DDX23, GEMIN2, KHSRP, LSM1, MPHOSPH10, PRPF3, PRPF4,
PRPF6, PRPF8, SCAF11, SF3A3, SF3B1, SF3B3, SF3B4, SLU7, SMNDC1,
SNRNP40, SRRM1, SRSF10, SRSF4, TRA2B, TXNL4A, WDR83
1
TSEN34
9
CDK13, CELF4, HNRNPA1, HNRNPM, MBNL1, PTBP1, RSRC1, SFPQ, SLU7
24
CELF3, CELF4, CELF6, DDX5, MAGOH, MBNL1, MBNL2, MYOD1, NSRP1,
PTBP1, RBFOX2, RBFOX3, RBM15B, RBM25, RBM4, RBM5, RBM8A, RBMX,
RBMY1A1, RNPS1, SAP18, SRSF12, THRAP3, TRA2B
26
CLNS1A, DDX20, GEMIN2, GEMIN4, GEMIN5, GEMIN6, GEMIN7, GEMIN8,
NCBP1, NCBP2, PHAX, PRMT5, PRMT7, SART1, SMN1, SNRPB, SNRPC,
SNRPD1, SNRPD2, SNRPD3, SNRPE, SNRPF, SNRPG, SNUPN, TGS1, WDR77
5
SF1, SF3A1, SF3A2, SF3A3, SLU7
4
PSIP1, SNRPC, SRSF1, SRSF12
GO:0000379
GO:0000380
GO:0000381
GO:0000387
GO:0000389
GO:0000395
tRNA-type intron
splice site
recognition and
cleavage
Alternative mRNA
splicing, via
spliceosome
Regulation of
alternative mRNA
splicing, via
spliceosome
Spliceosomal
snRNP assembly
mRNA 3'-splice
site recognition
mRNA 5'-splice
site recognition
8
GO:0000398
mRNA splicing,
via spliceosome
169
ALYREF, AQR, BUD13, CACTIN, CCAR1, CD2BP2, CDC40, CDC5L, CELF3,
CLP1, CPSF1, CPSF2, CPSF3, CPSF7, CRNKL1, CSTF1, CSTF2, CSTF3, CWC15,
CWC22, CWC27, DBR1, DDX23, DDX39A, DDX39B, DDX41, DDX5, DGCR14,
DHX35, DHX38, DHX8, DHX9, DNAJC8, EFTUD2, EIF4A3, FRG1, FUS,
GEMIN5, GEMIN6, GEMIN7, GPATCH1, GTF2F1, GTF2F2, HNRNPA0,
HNRNPA1, HNRNPA2B1, HNRNPA3, HNRNPC, HNRNPD, HNRNPF, HNRNPH1,
HNRNPH2, HNRNPH3, HNRNPK, HNRNPL, HNRNPM, HNRNPR, HNRNPU,
HNRNPUL1, ISY1, LSM2, LSM3, LSM7, MAGOH, NAA38, NCBP1, NCBP2,
NHP2L1, NOVA1, NUDT21, PABPC1, PABPN1, PAPOLA, PAPOLB, PCBP1,
PCBP2, PCF11, PHF5A, PLRG1, PNN, POLR2A, POLR2B, POLR2C, POLR2D,
POLR2E, POLR2F, POLR2G, POLR2H, POLR2I, POLR2J, POLR2K, POLR2L,
PPIE, PPIH, PPIL1, PPIL3, PPWD1, PRPF19, PRPF3, PRPF31, PRPF4,
PRPF4B, PRPF6, PRPF8, PTBP1, RALY, RBM22, RBM5, RBM8A, RBMX, RNPS1,
RSRC1, SART1, SF1, SF3A1, SF3A2, SF3A3, SF3B1, SF3B14, SF3B2, SF3B3,
SF3B4, SF3B5, SKIV2L2, SLU7, SMC1A, SNRNP200, SNRNP40, SNRNP70,
SNRPA, SNRPA1, SNRPB, SNRPB2, SNRPC, SNRPD1, SNRPD2, SNRPD3,
SNRPE, SNRPF, SNRPG, SNW1, SRRM1, SRRM2, SRSF1, SRSF10, SRSF11,
SRSF2, SRSF3, SRSF4, SRSF5, SRSF6, SRSF7, SRSF9, SUGP1, SYF2, SYNCRIP,
TFIP11, TRA2A, TRA2B, TXNL4A, U2AF1, U2AF2, UPF3B, USP49, WDR83,
XAB2, YBX1, ZCCHC8, ZRSR2
GO:0006376
mRNA splice site
selection
15
CELF1, CELF2, CELF4, LUC7L, LUC7L2, LUC7L3, MBNL1, PTBP2, RBMX,
SFSWAP, SRSF1, SRSF10, SRSF5, SRSF6, SRSF9
7
CLP1, RTCB, TRPT1, TSEN15, TSEN2, TSEN34, TSEN54
5
PTBP1, PTBP2, PTBP3, RPS13, RPS26
3
HNRNPLL, RBM20, RBM22
GO:0006388
GO:0033119
GO:0033120
tRNA splicing, via
endonucleolytic
cleavage and
ligation
Negative
regulation of
alternative splicing
Positive regulation
of alternative
splicing
GO:0043484
Regulation of
RNA splicing
24
AFF2, AKAP17A, BRDT, CELF1, CLK1, CLK2, CLK3, CLK4, ESRP1, ESRP2,
FASTK, HNRNPF, HNRNPH1, MBNL1, MBNL2, MBNL3, MYOD1, RBFOX1,
RBFOX2, RBFOX3, RBM38, SNRNP70, SON, SRRM4
GO:0045292
mRNA cis
splicing, via
6
DCPS, NCBP1, NCBP2, NCBP2L, RBM22, WBP4
9
GO:0048024
GO:0048025
GO:0048026
GO:0070055
GO:0005681
GO:0000243
GO:0005684
GO:0005689
GO:0044530
GO:0071004
spliceosome
Regulation of
mRNA splicing,
via spliceosome
Negative
regulation of
mRNA splicing,
via spliceosome
Positive regulation
of mRNA splicing,
via spliceosome
HAC1-type intron
splice site
recognition and
cleavage
Spliceosomal
complex
Commitment
complex
U2-type
spliceosomal
complex
U12-type
spliceosomal
complex
Supraspliceosomal
complex
U2-type
prespliceosome
5
CWC22, JMJD6, SRPK1, SRPK2, TIA1
15
ACIN1, C1QBP, HNRNPA2B1, PTBP1, RBMX, RNPS1, SAP18, SFSWAP, SRSF10,
SRSF12, SRSF4, SRSF6, SRSF7, SRSF9, U2AF2
7
CELF3, CELF4, PRPF19, RBMX, SNW1, THRAP3, TRA2B
1
ERN1
93
A8MWD9, AKAP17A, API5, AQR, BCAS2, CDC40, CRNKL1, CTNNBL1,
CWC15, CWC22, DDX20, DDX39B, DHX8, EFTUD2, GEMIN2, GEMIN4,
GEMIN5, GEMIN6, GEMIN7, GEMIN8, HNRNPA1, HNRNPA1L2,
HNRNPA2B1, HNRNPC, HNRNPDL, HNRNPM, HNRNPR, HSPA8,
IVNS1ABP, KIAA1967, LGALS3, LSM4, LSM5, LSM6, LSM7, NAA38,
NHP2L1, PPIH, PPP1R8, PRPF18, PRPF3, PRPF38A, PRPF38B, PRPF4,
PRPF6, PRPF8, PTBP2, RBM17, RBM28, RBM5, RBMX, RHEB, SCAF8, SF1,
SF3A1, SF3A2, SF3A3, SF3B1, SF3B2, SF3B3, SF3B4, SLU7, SMN1,
SMNDC1, SNRNP200, SNRNP70, SNRPA, SNRPA1, SNRPB, SNRPB2,
SNRPD2, SNRPD3, SNRPE, SNRPF, SNRPG, SNRPN, SNW1, SREK1, SRRM1,
STRAP, SUGP1, TFIP11, TTF2, TXNL4A, TXNL4B, U2AF1, U2AF1L4, U2AF2,
USP39, WAC, WBP4, WDR83, ZNF326
1
SNRPC
2
PRPF31, SF3A1
24
DHX15, PDCD7, PHF5A, RNPC3, SF3B1, SF3B14, SF3B2, SF3B3, SF3B4, SF3B5,
SNRNP25, SNRNP35, SNRNP48, SNRPB, SNRPD1, SNRPD2, SNRPD3, SNRPE,
SNRPF, SNRPG, YBX1, ZCRB1, ZMAT5, ZRSR2
3
ADAR, RBMX, UPF1
1
SNRPC
10
GO:0071013
Catalytic step 2
spliceosome
80
ALYREF, AQR, CACTIN, CDC40, CDC5L, CRNKL1, CWC15, CWC22, CWC27,
DDX23, DDX41, DDX5, DGCR14, DHX35, DHX38, DHX8, EFTUD2, EIF4A3,
FRG1, GPATCH1, HNRNPA1, HNRNPA2B1, HNRNPA3, HNRNPC, HNRNPF,
HNRNPH1, HNRNPK, HNRNPM, HNRNPR, HNRNPU, ISY1, LSM2, LSM3,
MAGOH, PABPC1, PLRG1, PNN, PPIE, PPIL1, PPIL3, PPWD1, PRPF19,
PRPF4B, PRPF6, PRPF8, RALY, RBM22, RBM8A, RBMX, SART1, SF3A1, SF3A2,
SF3A3, SF3B1, SF3B2, SF3B3, SKIV2L2, SLU7, SNRNP200, SNRNP40, SNRPA1,
SNRPB, SNRPB2, SNRPD1, SNRPD2, SNRPD3, SNRPE, SNRPF, SNRPG, SNW1,
SRRM1, SRRM2, SRSF1, SYF2, SYNCRIP, TFIP11, U2AF1, WDR83, XAB2,
ZCCHC8
11
Supplementary Table 2. Samples included in gene expression analyses of splicing factors
Cancer
Cancer type
Acute lymphocytic
leukemia
Acute myeloid
leukemia
Bladder cancer
Breast carcinoma
Cervical cancer
Chronic lymphocytic
leukemia
Chronic myeloid
leukemia
Clear cell renal cell
carcinoma
Colorectal
adenocarcinoma
Glioblastoma
multiforme
Head and neck
squamous cell
carcinoma
Hepatocellular
carcinoma
Lung adenocarcinoma
Lung squamous cell
carcinoma
Lymphoma
Ovarian serous
adenocarcinoma
Pancreatic cancer
Corresponding normal
No. of samples Sample origin
1 462
1 076
77
1 502
65
556
76
Blood lymphoid
cell4
Blood myeloid
cell5
Bladder
Breast
Cervix
Blood lymphoid
cell4
Blood myeloid
cell5
Principal components analysis
(cancer versus normal)1
No. of
differentially
expressed splicing
factor genes3
No. of
samples
PC1
PC2
Significance
level, statistical
power analysis2
88
0.5
0.05
0.0002
91 (35%)
41
1E-13
0.005
0.02
142 (54%)
10
88
11
0.004
2E-17
0.01
0.001
0.0005
0.0002
0.5
0.0002
0.5
NA
185 (71%)
NA
88
8E-32
0.1
0.0005
133 (51%)
41
2E-06
6E-20
0.09
NA
188
Kidney
73
0.07
6E-09
0.006
93 (36%)
559
Colon
47
6E-10
0.02
0.01
82 (31%)
268
Brain
30
0.4
0.005
0.08
NA
44
Oral cavity
77
5E-49
0.2
0.07
NA
107
Liver
22
1E-07
0.03
0.2
NA
840
Lung6
144
0.007
0.06
0.00009
78 (30%)
127
Lung6
144
1E-06
3E-19
0.001
85 (33%)
576
Blood lymphoid
cell4
88
8E-18
0.06
0.0005
138 (53%)
287
Ovary
12
8E-08
0.0005
0.4
NA
51
Pancreas
20
3E-05
0.002
0.3
NA
12
Prostate cancer
Testicular cancer
Thyroid carcinoma
Uterine cancer
Total
185
105
50
161
4 322
Prostate
Testis
Thyroid gland
Uterus
70
18
18
107
0.2
8E-15
0.9
0.4
0.0002
0.3
7E-09
1E-44
0.007
0.3
0.3
0.002
18 (7%)
NA
NA
114 (44%)
876
1
P-values from independent samples t-test comparing principal components (PC) between cancer and normal samples. For the seven
cancer types with more than 500 samples, a random selection of the same amount of samples as represented by the corresponding
normal were included in the analyses.
2
Significance level calculated from statistical power analysis, given effect size 0.5, statistical power (sensitivity) 0.8, and the number of
samples representing the individual cancer types and corresponding normal. Only the eleven cancer types represented by sufficient
sample numbers to allow P < 0.05 were included in differential gene expression analyses between cancer and corresponding normal
samples.
3
Number of genes (of totally 261 splicing factor genes) with P-values from F-tests in one-way analysis of variance lower than the
significance level calculated from statistical power analysis, after adjusting for multiple comparisons by Bonferroni correction. NA
denotes the tissue types with significnace level from statistical power analysis >0.05.
4,5,6
Same samples.
13
Supplementary Table 3. The ten most differentially expressed splicing factor genes in
different cancer types
Differentially expressed genes1
Cancer type
Upregulated in cancer
Downregulated in cancer
Acute lymphocytic
leukemia
DDX41, HNRNPL, SNRPA
Acute myeloid leukemia
HNRNPA1, HNRNPM, SNRPA,
SYNCRIP
Breast carcinoma
Chronic lymphocytic
leukemia
Clear cell renal cell
carcinoma
Colorectal
adenocarcinoma
Lung adenocarcinoma
Lung squamous cell
carcinoma
CLASRP, SNRPA
CELF2, MBNL1, NOL3, SNRPD1
NONO, SF3B3, SNRPB, SNRPC,
SNRPD2, SYNCRIP
DHX8, ESRP1, ESRP2, HNRNPC,
SCNM1, USP39
DDX39A, ESRP1, NCBP2, PTBP3,
SCNM1, SNRPB, USP39
CSTF1, DDX23, PPIE, PRPF38B,
SNRNP35, SREK1IP1, ZRSR2
ADAR, CD2BP2, ERN1, SRPK2, TSEN34,
UPF1
CELF2, CELF3, GEMIN7, HNRNPL,
MBNL3, MYOD1, NCBP1, QKI, RBFOX1,
RBM4B
HSPA8, IVNS1ABP, PRPF38B, PSIP1,
RBM28, SF3B3, SREK1IP1, UPF3B
C1QBP, ESRP2, PABPN1, POLR2L,
PPARGC1A, WT1
LGALS3, MBNL3, NOVA1, PPARGC1A
POLR2A, PRPF8, SAP18, UPF1
LGALS3, SAP18, SRSF5
BCAS2, CELF2, DDX47, JMJD6, RBM22,
SAP18, SRSF11, SRSF5, SRSF6, SYF2
Lymphoma
Prostate cancer
BCAS2, CLNS1A, PPIH, PPP2CA,
SNRPD1, SNW1, THOC7, WBP11
LGALS3, RBM4B
Uterine cancer
BCAS2, ESRP1, SYNCRIP
HNRNPDL, NOVA1, PABPN1, PCBP1,
SNRNP70, TRA2A, WT1
1
Top ten differentially expressed genes between cancer and corresponding normal samples
(ranked based on signal-to-noise ratios from F-tests in one-way analysis of variance)
14
Supplementary Table 4. Splicing factor genes included in the Cancer Gene Census
Gene Symbol
Gene Ontology accession
CDK12
DDX5
FUS
HNRNPA2B1
NONO
PPP2R1A
GO:0008380
GO:0000381, GO:0000398, GO:0071013
GO:0000398, GO:0008380
GO:0000398, GO:0008380, GO:0048025, GO:0005681, GO:0071013
GO:0008380
GO:0008380
GO:0000375, GO:0000398, GO:0005681, GO:0005689, GO:0008380,
GO:0071013
GO:0000380, GO:0008380
GO:0000398, GO:0008380
GO:0000381, GO:0008380, GO:0048026
GO:0000398, GO:0005681, GO:0008380, GO:0071013
GO:0008380
GO:0000245, GO:0000398, GO:0005689, GO:0008380
SF3B1
SFPQ
SRSF2
THRAP3
U2AF1
WT1
ZRSR2
15
Supplementary Figure 1: Genome-wide aberrant splicing across cancer types.
16
A) Distribution of the median amount of cancer-specific splicing events (including cassette
exons, competing 5’ and 3’ splice sites, and retained introns) in each of 15 cancer types. The data
are summarized from Table 1 published by Dvinge and Bradley (Genome Med 2015; 7: 45), and
are based on differential splicing events detected from RNA-sequencing of cancer and matched
normal samples (difference in isoform ratio of ≥10 %) from 636 patients (sequenced by TCGA).
Patients with acute myeloid leukemia were excluded because matched normal samples were not
available. The median number of cancer-specific splicing events across the cancer types was 1
285 (ranging from 726 in liver cancer to 1 796 in breast cancer). B) Principal components
analysis based on the median splicing score per tumor type of the 3 911 splicing events with
largest variation in splicing scores across tumor samples shows that leukemias (purple) and brain
cancers (dark blue) have distinct splicing patterns compared with the other cancer types. The
median splicing score of each event in 15 samples from each cancer type was downloaded from
the TCGA SpliceSeq database (http://projects.insilico.us.com/TCGASpliceSeq). The colour
coding is the same for parts A) and B). ACC, adrenocortical carcinoma; BLCA, bladder
urothelial carcinoma; BRCA, breast invasive carcinoma; CESC, cervical squamous cell
carcinoma and endocervical adenocarcinoma; COAD, colon adenocarcinoma; DLBC, diffuse
large B-cell lymphoma; GBM, glioblastoma multiforme; HNSC, head and neck squamous cell
carcinoma; KICH, kidney chromophobe; KIRC, kidney renal clear cell carcinoma; KIRP, kidney
renal papillary cell carcinoma; LAML, acute myeloid leukemia; LGG, brain lower grade glioma;
LIHC, liver hepatocellular carcinoma; LUAD, lung adenocarcinoma; LUSC, lung squamous cell
carcinoma; PAAD, pancreatic adenocarcinoma; PRAD, prostate adenocarcinoma; READ, rectum
adenocarcinoma; SKCM, skin cutaneous melanoma; THCA, thyroid carcinoma; UCEC, uterine
corpus endometrial carcinoma; UCS, uterine carcinosarcoma.
17
Supplementary Figure 2: Splice site mutations in cancer-relevant genes.
18
Sixty-two of the 513 genes in the Cancer Gene Census have been reported with somatic and/or
germ-line mutations in splice sites in various cancer types (red squares, three genes with
mutations only in benign tumors are excluded). Both cancer types and genes are sorted by the
number of mutations. The column marked “Other” represents a collection of rare tumor types,
and numbers are used to indicate genes found to have mutations in more than one of these. The
data were downloaded from the Cancer Gene Census in November 2013
(http://cancer.sanger.ac.uk/cancergenome/projects/census/).
19
20
Supplementary Figure 3. Principal components analysis of corresponding cancer and
normal samples based on the expression levels of splicing factor genes.
Cancer (red dots) and corresponding normal (blue dots) samples are separated by the expression
levels of splicing factor genes (n = 261) in most of the 21 cancer types. All cancer types except
acute lymphocytic leukemia, glioblastoma, and lung adenocarcinoma have P-values from
independent samples t-tests of principal component one (horizontal axes) and/or two (vertical
axes) between cancer and normal samples lower than 0.005 (Supplementary Table 2).
21
Supplementary Figure 4. Variance in splicing factor gene expression
A) In nine of eleven cancer types, the variance in gene expression levels of splicing factor genes
(n = 261) is significantly lower in the cancer than the corresponding normal samples (P < 0.0001
in paired samples t-tests of all the nine cancer types). Contrarily, colorectal adenocarcinoma
samples have significantly higher variance than normal colonic mucosa samples (P = 4x10-11).
No difference in expression variability was found between cancer and normal samples in acute
myeloid leukemia. The cancer types are sorted by largest median difference in the variance
between the normal and cancer samples. B) Among the five splicing factor genes with largest
median difference in variance of expression levels, WT1 has largest variance among the cancer
samples (P = 0.03, paired samples t-test of variance in the 11 cancer types compared with the
corresponding normal samples), whereas PPIH, PRMT7, CELF1, and GEMIN6 have largest
variance in the normal samples (P < 0.007).
22
23
24
Supplementary Figure 5. Splicing factor mutation frequency in the 12 cancer types
analyzed in the TCGA pan-cancer project
The mutation frequency of the 340 of the totally 347 (98%) splicing factor genes that are found to
be mutated in at least one cancer types is indicated in red per cancer type. Colon and rectum
cancer is plottet together. Both genes (vertically) and cancer types (horizontally) are sorted
alphabetically.
25
Supplementary Figure 6. Median mutation frequencies of splicing factor genes in individual
cancer types
The median mutation frequency of splicing factors correlates strongly with the genome-wide
median mutation rate (retrieved from reference 146; Pearson correlation, r = 0.9) in the individual
cancer types (uterine carcinoma is not included). This indicates that there is no individual cancer
type more frequently targeted by splicing factor mutations than others. Here, the mutation
frequency of splicing factor genes is calculated from the TCGA pan-cancer somatic mutation data
(n = 3 205 cancers from 12 different cancer types, accessed from the IntOGen TCGA portal,
reference 105), and includes 347 genes annotated with the GO-terms “RNA splicing” and/or
“spliceosomal complex”.
26
Download