Supplementary Information (docx 43K)

advertisement
Supplementary Material for “U2AF1 Mutations Alter Sequence Specificity of premRNA Binding and Splicing”
Theresa Okeyo-Owuor1, Brian S. White1,2, Rakesh Chatrikhi3, Dipika R. Mohan1, Sanghyun
Kim1, Malachi Griffith2, Li Ding2, Shamika Ketkar-Kulkarni1, Jasreet Hundal2, Kholiswa M. Laird3,
Clara L. Kielkopf3, Timothy J. Ley1,2, Matthew J. Walter1, Timothy A. Graubert1,4
1
Department of Internal Medicine, Division of Oncology, Washington University, Saint Louis,
MO, USA
2
The Genome Institute, Washington University School of Medicine, Saint Louis, MO, USA
3
Department of Biochemistry and Biophysics, University of Rochester Medical Center,
Rochester, NY, USA
4
current address: Massachusetts General Hospital Cancer Center
1
SUPPLEMENTARY METHODS
RNA-seq alignment and preprocessing. RNA-seq analysis was performed with The Genome
Institute’s Genome Modeling System (manuscript in preparation) using the RNA-seq
processing-profile ‘2819506’. Quality of raw RNA sequence data in the form of FastQ data files
was assessed using FastQC version 0.10.0 (http://www.bioinformatics.babraham.ac.
uk/projects/fastqc/). Paired 2x100 bp sequence reads were trimmed to remove ‘SPIA’ adapters
(ligated during cDNA synthesis) using the read trimmer ‘flexbar’ version 229
(https://wiki.gacrc.uga.edu/wiki/FAR) with the following parameters set: ‘--adapter
CTTTGTGTTTGA --adapter-trim-end LEFT --nono-length-dist --threads 4 --adapter-min-overlap
7 --max-uncalled 150 --min-readlength 25’. After trimming, reads were aligned to a modified
version of the human genome reference sequence (NCBI build 37) with alternative haplotype
sequences omitted. Initial segmented alignments were performed using bowtie version 2.1.01
followed by spliced alignments with TopHat version 2.0.82. During alignment, TopHat was
supplied transcript models in GTF format using the ‘-g’ parameter. Transcript models
representing known and predicted human transcripts were obtained from Ensembl version 67.3
The binary sequence alignment (BAM) files obtained by alignment of RNA-seq reads with
TopHat were summarized by use of SAMStat version 1.08 and SAMtools version 0.1.18
(specifically, the idxstats and flagstat utilities).4 The quality of alignments was assessed using
Picard version 1.85 (specifically, the RnaSeqMetrics utility) (http://picard.sourceforge.net/).
Following alignment, expression estimates in the form of Fragments per Kilobase of
exon per Million bases mapped (FPKM) were calculated by Cufflinks version 2.0.25 using default
parameters except for ‘--num-threads 4 --max-bundle-length 10000000’. Transcript models
were supplied to Cufflinks using the ‘-g’ option and the same GTF described above.
Transcripts corresponding to mitochondrial and ribosomal genes, as indicated by Ensembl gene
biotypes, were masked during calculation of transcript expression estimates. Exon-exon
junction counts were obtained by parsing the ‘junctions.bed’ file produced by TopHat. This file
2
reports the coordinates of all introns observed by splice-aware alignment of reads to the
genome and the number of reads supporting each. Each observed exon-exon junction was
annotated with the genomic coordinates of its defining acceptor and donor sites and crossreferenced against the human transcripts in Ensembl version 67 to determine the number, if
any, of exons skipped. Each junction was then assigned to one of five classes describing its
relationship to known Ensembl transcripts: ‘DA’, ‘NDA’, ‘D’, ‘A’, and ‘N’. ‘DA’ junctions are those
where the exon-exon combination corresponds to one or more known Ensembl transcripts.
‘NDA’ refers to a novel connection of donor and acceptor pair that is individually used in
Ensembl transcripts but has not been observed in that combination. ‘D’ refers to junctions that
use a known donor site but a novel acceptor site. ‘A’ refers to junctions that use a known
acceptor site but a novel donor site. ‘N’ refers to junctions that do not correspond to any
Ensembl transcript at donor or acceptor sites. Gene-level counts were obtained by excluding
unmapped reads or secondary read alignments from the BAM files (using samtools view –F
0x0104) and processing the resulting reads using HTseq6 (with parameters: mode intersectionstrict; minaqual 1; stranded no; type exon; idattr gene_id).
Gene- and junction-level differential expression analysis. Differential gene- and junctionlevel expression across WT and S34F mutant U2AF1 samples was inferred using edgeR7 in R.
edgeR uses negative binomial generalized linear models (GLMs) to infer statistically significant
differences between condition-specific (here, WT and S34F) bin counts (here, corresponding to
total reads mapped to a gene or to a junction). Significantly, we exploited edgeR’s capability to
represent the paired experimental design used here. Specifically, we tested for differential
expression between WT and S34F samples within biological replicates, i.e., adjusting for
differences between biological replicates, by using an additive linear model with biological
replicate as the blocking factor: expr ~ condition + replicate, where condition corresponds to WT
or S34F and replicate is an identifier for the biological replicate. For junction-level analysis,
3
‘DA’, ‘NDA’, ‘D’, and ‘A’ junctions were analyzed independently. We considered only junctions
having greater than 5 counts in at least 2 (of 6) samples and excluded ‘N’ junctions. Using
edgeR, we performed TMM normalization7 (using calcNormFactors), estimated the average
dispersion over all junctions (using estimateGLMCommonDisp), estimated a dispersion
parameter for each junction with a count-dependent trend line (using estimateGLMTrendedDisp
and df=20), computed an empirical Bayes estimate of the negative binomial dispersion
parameter for each junction based on a log-linear model (using estimateGLMTagwiseDisp), fit
junction-wise negative binomial GLMs (using glmFit), performed likelihood ratio tests for the
condition coefficients in the model (using glmLRT), and outputted test results, including pvalues, false discovery rates (FDRs; i.e., p-values corrected for multiple testing using the
method of Benjamini and Hochberg8) and log-fold changes (using topTags). Gene-wise
differential expression analysis was performed exactly as above except that a gene was
analyzed only if it had an FPKM > 0.1 and a read count > 25 across at least 2 samples. Gene
and junction expression data can be found in Supplementary Tables 1 and 2, respectively.
Volcano plot analysis of the differentially expressed junctions [log2(fold change)] versus the pvalues (−log10 scale) was created using R.
Validation within the publically-available TCGA AML data set (https://tcgadata.nci.nih.gov/docs/publications/laml_2012/) of the 544 junctions discovered in CD34+ cells
having FDR < 5% and | log2(fold change) | > 2 was similarly performed using edgeR. An
unpaired analysis was performed between 6 samples with S34(F/Y) point mutations in U2AF1
and 108 control samples. Other than the S34 variants in the mutant samples, none of the
samples harbored point mutations or copy number alterations in U2AF1 or any of 272 other
splicing factors. Additionally, M3 and M7 AML subtypes were excluded from the controls
because these subtypes did not harbor U2AF1 point mutations (all other subtypes did harbor
U2AF1 point mutations) and because excluding them further reduced patient heterogeneity
within the control set. Patient mutation status was assessed from the updated clinical
4
information in SuppTable01.update.2013.05.13.xlsx, available at the above website. Junctions
having greater than 5 reads in at least 5 samples were analyzed as described above. A p-value
assessing the likelihood that the 10 junctions out of 544 found to be dysregulated (FDR < 5%) in
a direction consistent with the CD34+ discovery set was calculated by 1000 trials in which 544
junctions were randomly sampled from the subset of the TCGA data set meeting the
requirement of junctions having greater than 5 reads in at least 5 samples. Directionality of fold
change was not restricted in the 1000 random trials, as it was for the 544 discovered junctions,
thus ensuring the calculated p-value is conservative.
Expression clustering. Gene-level expression clustering was performed using heatmap in R,
with default parameters, except for the distance function, as described below. These
expression values were log counts per million (logCPM) output by edgeR and then adjusted to
account for intra-pair correlation. Specifically, we used removeBatchEffect to fit the logCPM to
the same linear model used for statistical testing, logCPM ~ condition + replicate, and then to
subtract off the replicate effect from the logCPM to define the adjusted logCPM. This was done
by passing removeBatchEffect a linear model describing the effects we wished to preserve
(namely, logCPM ~ condition) and treating the replicate factor as the batch factor. Each
adjusted gene expression value was then transformed to a z-score by subtracting off that gene’s
mean expression across samples and dividing by the gene’s standard deviation. Resulting zscores were input to heatmap for clustering. The distance(x,y) between gene x and y was
defined in terms of their correlation(x,y) as 1/2 * [ 1 – correlation(x,y) ], by defining the heatmap
distance function as distfun = function(x) (1 – cor(t(x))) / 2. This ensures that distances range
between 0 (for perfectly correlated genes) and 1 (for perfectly anti-correlated genes).
Consensus sequence analysis. Consensus 5’ and 3’ splice site sequences corresponding to
skipped exons and to alternative splice site usage were computed via WebLogo.9 Junctions
5
resulting from a skipped exon (“skipped exon junctions”) were defined as those with a single
skipped exon and inferred to be differentially expressed with an FDR < 5%. The splice sites of
the spliced exon itself were determined as those belonging to a junction that was not identical to
the skipped exon junction, but which shared one of its splice sites and involved no skipped
exons. Junctions participating in alternative splice site selection were defined as those (1)
having an FDR < 5%; (2) for which the same 5’ splice site was paired with exactly two
expressed 3’ splice sites (for alternative 3’ splice site usage) or, conversely, the same 3’ splice
site was paired with exactly two expressed 5’ splice sites (for alternative 5’ splice site usage) ;
and (3) for which neither of the junctions involved the skipping of any exon. We refer to the
differentially expressed splice site as the alternative splice site and the other of the two splice
sites in the pair as the canonical splice site. By restricting to these simple cases (of only two
possible alternatives), we could be confident that the alternative splice site was preferentially
used relative to the single canonical splice site; this direct contrast would not have been
possible with multiple alternative splice sites, all of which may be in competition with each other
as well as with the canonical splice site. These splice sites were extended to define the flanking
sequences of skipped exons, alternative 5’ and 3’ splice sites, and canonical 5’ and 3’ splice
sites. Skipped exons, canonical splice sites and alternative splice sites were analyzed
separately according to whether they occurred more [log2(fold change) > 0] or less [log2(fold
change) < 0] in the S34F sample relative to WT. Control flanking regions were defined as those
corresponding to junctions that showed no evidence for differential expression—i.e., had a |
log2(fold change) | < 0.001. All of these had an FDR = 100%.
Splice site sequence analysis of the AML sample harboring U2AF1 (Q157P) followed the
above, though dysregulated junctions were defined as those with |log2(fold change)| > 1, as
opposed to FDR < 5%, and having greater than 5 reads in half of the samples. These changes
were necessitated by the difficulty of reliably performing hypothesis testing (and computing
FDRs) without replicate U2AF1 (Q157P) samples. To confirm that the different results obtained
6
for U2AF1 (Q157P) in AML samples relative to the FDR-based analysis of U2AF1 (S34F) in
CD34 samples (Figure 3) were not an artifact of this different analysis or cell type, we confirmed
that the sequence contexts obtained by using a |log2(fold change)| > 1 cutoff for the AML
samples harboring U2AF1 (S34F) were similar to those obtained by using an FDR < 5% cutoff
for the CD34 samples harboring U2AF1 (S34F; data not shown).
Purification of protein complexes. The U2AF1, U2AF2, and SF1 protein subunits were
expressed separately in E. coli and purified by affinity chromatography as respective fusions
with MBP and GST. The U2AF1 and U2AF2 proteins are full length with the exception of
nonspecific RS domains; SF1 includes the U2AF2- and pre-mRNA-binding domains. Prior to
mixing the subunits, the GST tag was cleaved and separated from U2AF2 or SF1 by cationexchange chromatography and the MBP-U2AF1 fusion was pre-purified by size-exclusion
chromatography. The U2AF2, SF1 and MBP-U2AF1 subunits were then mixed and dialyzed
during protease cleavage of the MBP tag from U2AF1. A final step of size-exclusion
chromatography ensured homogeneous complexes for affinity determination (Supplementary
Figure 5A).
SUPPLEMENTARY FIGURE LEGENDS
Supplementary Figure 1. Unique reads mapped to the human transcriptome and the
distribution of mapped bases are similar in biological replicates. (a) 300-500M reads were
obtained by RNA-seq from all four pairs of biological replicates (R1, R2, R3 and R4). (b)
Uniquely mapped reads mapped to the human transcriptome. The S34F mutant sample from
R3 had a significantly higher proportion of redundant reads. (c) Distribution of mapped bases;
the percent of mapped bases for coding, untranslated (UTR), intergenic, intronic and ribosomal
bases for the four biological replicates. (d) A G>A substitution was detected only in samples
transfected with mutant U2AF1. Mutant U2AF1 represented 85-97% of total U2AF1 expression.
7
Supplementary Figure 2. The S34F mutant does not significantly alter gene expression.
(a) Unsupervised analysis of 17,390 genes weakly segregates samples according to genotype,
with similar branch lengths within and connecting clusters/genotypes and with inconsistent overor under-expression within a cluster. (b) Supervised analysis of 1,296 genes (with FDR < 5%)
strongly segregates samples by genotype and, further, the direction of a gene’s over- or underexpression in WT relative to S34F is always consistent across sample pairs.
Supplementary Figure 3. Junction-level analysis reveals distribution of dysregulated
splicing events that are independent of 5’ splice site context. (a) Schematic representation
of the different types of junctions discovered during RNA-seq including known junctions (known
donor and acceptors sites in known combinations) and novel junctions (known donor and
acceptor in novel combinations, novel donor and known acceptor, known donor and novel
acceptor, and novel donor and acceptor). (b) Venn diagram representing genes that were
differentially expressed (1,246, FDR<5%), genes that had differential junction expression (959,
FDR<5%) and an overlap of genes that were differentially expressed that also had differential
junction expression (241). (c) Overall distribution of various alternative splicing events involves
junctions that are expressed more in S34F versus WT (left pie chart) and those that have a
higher expression in WT (right pie chart). (d) Logos nucleotide sequence analysis at the 5’
splice sites of skipped exons (top and middle panels). 5’ splice sites of junctions not altered by
U2AF1 [S34F; i.e., | log2(fold change) | < 0.001] are shown as control (lower panel).
Supplementary Figure 4. S34F mutation induces increased, validated expression of
known and novel junctions in CD34+ cells and patient samples. U2AF1 (S34F) caused
increased utilization of a 3’ alternative splice site in (a) KIAA1033 and IFI44, and increased exon
skipping in (b) SMN1 in transfected CD34+ cells (left panels) and primary MDS samples (right
panels). (c) Increased exon skipping in ATP5H, RAB1B, DDX50, DAP3 and IARS. (d)
8
Increased utilization of novel junctions in ABI1, HERC5, CCT6A and CD53. The data represent
mean +/- SD of 3 replicates, repeated in 3 biological replicates with similar results. *P<0.05,
**P<0.01, ***P<0.001.
Supplementary Figure 5. (a) Purified recombinant proteins for RNA binding experiments
analyzed by SDS-PAGE stained with Coomassie-blue. Samples include: 1, MBP-fused
U2AF1/U2AF2/SF1 complex prior to proteolytic cleavage with TEV protease to remove the MBP
tag; 2, U2AF1/U2AF2/SF1 after cleavage of MBP (40 kDa); 3, final purified wild-type
U2AF1/U2AF2/SF1 after size exclusion chromatography; 4, TEV protease (26 kDa, control); 5,
final purified U2AF1(S34F)/U2AF2/SF1 after size exclusion chromatography. (b) Fluorescence
anisotropy curves for binding of U2AF2/SF1 complexes with wild-type or U2AF1(S34F) to the
indicated RNA sites from the DEK oncogene. The average data points and standard deviations
of three independent experiments are given. Solid lines represent the nonlinear fits of the data
as described in Methods.
Supplementary Figure 6. The S34F mutation does not affect U2AF1 localization within
nuclear speckles. (a) Immunofluorescence analysis of transfected 293T cells demonstrated
that WT or U2AF1 (S34F) (green) and U2AF2 (red) co-localized (merge) normally within nuclear
speckles. (b) Staining for U2AF1-Flag (green) and Smith Antigen family of snRNP proteins
(red) further demonstrate normal co-localization of both WT and S34F U2AF1 within the nuclear
speckles (merged). Nuclei were counterstained with TOPRO (blue). Repeated in 4 separate
experiments with similar results. Images acquired at 63X, 1.4 numerical aperture, Zeiss Plan
Apochromat oil objective at 2.5 zoom and captured using Zeiss LSM510 software.
Supplementary Figure 7. Recurrent missense mutations in U2AF1 have specific effects
on alternative splicing. (a) U2AF1 expression was increased >5-fold in cells transfected with
9
WT or Q157P, compared to MIG control (left panel). Error bars represent SD of 3 technical
replicates. (b) Exogenous (WT or Q157P) Flag-tagged U2AF1 protein abundance compared
with MIG empty vector control. Numbers beneath the blot are sum of U2AF1-Flag and
endogenous U2AF1 in each case, normalized to MIG, achieved through densitometry analysis.
(c) Exon skipping of GH1 minigene was decreased by the Q157P mutant, compared to MIG or
WT controls. (d) Similar effects seen with the FMR1 minigene. For both minigene assays, SD is
shown for 3 technical replicates; repeated in 3 independent experiments. *P<0.05, **P<0.01,
***P<0.001
Supplementary Figure 8. Junctions that are alternatively spliced in an AML sample
harboring (Q157P) mutant U2AF1 do not have a consensus sequence at e-3. Logos plots
depicting (a) skipped exons, (b) skipped 3’ junctions, and (c) alternative junctions share the
consensus C at position e-3 with (d) control junctions. This is in contrast to the preference for U
at e-3 at junctions alternatively spliced by (S34F) mutant U2AF1 (see Figure 3).
10
SUPPLEMENTARY REFERENCES
1.
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Meth 2012
04//print; 9(4): 357-359.
2.
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg S. TopHat2: accurate
alignment of transcriptomes in the presence of insertions, deletions and gene fusions.
Genome Biology 2013; 14(4): R36.
3.
Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, et al. Ensembl 2013. Nucleic
Acids Research 2013 January 1, 2013; 41(D1): D48-D55.
4.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence
Alignment/Map format and SAMtools. Bioinformatics 2009 August 15, 2009; 25(16):
2078-2079.
5.
Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript
assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform
switching during cell differentiation. Nat Biotech 2010 05//print; 28(5): 511-515.
6.
Anders S, Pyl PT, Huber W. HTSeq--A Python Framework to Work with High-throughput
Sequencing Data. Biorxiv 2014; in press.
7.
Robinson M, McCarthy D, Smyth G. edgeR: a Bioconductor package for differential
expression analysis of digital gene expression data. Bioinformatics 2010; 26: 139 - 140.
8.
Benjamini Y, Hochberg Y. Controlling the False Discovery Rate - a Practical and
Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series BMethodological 1995; 57(1): 289-300.
9.
Crooks GE, Hon G, Chandonia J-M, Brenner SE. WebLogo: A Sequence Logo
Generator. Genome Research 2004 June 1, 2004; 14(6): 1188-1190.
11
SUPPLEMENTARY TABLES
Supplementary Table 1: Global gene expression profiling of CD34+ cells expressing U2AF1
WT or U2AF1 (S34F) mutant.
Supplementary Table 2: Differential expression of known and novel splice junctions.
Supplementary Table 3: Differential expression of junctions in RNA seq vs AML 200 data
(TCGA).
Supplementary Table 4: U at position e-3 on 5 validated junctions in MDS patient samples.
Supplementary Table 5: PCR primers used in splice junction validation.
12
Download