Supplementary Material for “U2AF1 Mutations Alter Sequence Specificity of premRNA Binding and Splicing” Theresa Okeyo-Owuor1, Brian S. White1,2, Rakesh Chatrikhi3, Dipika R. Mohan1, Sanghyun Kim1, Malachi Griffith2, Li Ding2, Shamika Ketkar-Kulkarni1, Jasreet Hundal2, Kholiswa M. Laird3, Clara L. Kielkopf3, Timothy J. Ley1,2, Matthew J. Walter1, Timothy A. Graubert1,4 1 Department of Internal Medicine, Division of Oncology, Washington University, Saint Louis, MO, USA 2 The Genome Institute, Washington University School of Medicine, Saint Louis, MO, USA 3 Department of Biochemistry and Biophysics, University of Rochester Medical Center, Rochester, NY, USA 4 current address: Massachusetts General Hospital Cancer Center 1 SUPPLEMENTARY METHODS RNA-seq alignment and preprocessing. RNA-seq analysis was performed with The Genome Institute’s Genome Modeling System (manuscript in preparation) using the RNA-seq processing-profile ‘2819506’. Quality of raw RNA sequence data in the form of FastQ data files was assessed using FastQC version 0.10.0 (http://www.bioinformatics.babraham.ac. uk/projects/fastqc/). Paired 2x100 bp sequence reads were trimmed to remove ‘SPIA’ adapters (ligated during cDNA synthesis) using the read trimmer ‘flexbar’ version 229 (https://wiki.gacrc.uga.edu/wiki/FAR) with the following parameters set: ‘--adapter CTTTGTGTTTGA --adapter-trim-end LEFT --nono-length-dist --threads 4 --adapter-min-overlap 7 --max-uncalled 150 --min-readlength 25’. After trimming, reads were aligned to a modified version of the human genome reference sequence (NCBI build 37) with alternative haplotype sequences omitted. Initial segmented alignments were performed using bowtie version 2.1.01 followed by spliced alignments with TopHat version 2.0.82. During alignment, TopHat was supplied transcript models in GTF format using the ‘-g’ parameter. Transcript models representing known and predicted human transcripts were obtained from Ensembl version 67.3 The binary sequence alignment (BAM) files obtained by alignment of RNA-seq reads with TopHat were summarized by use of SAMStat version 1.08 and SAMtools version 0.1.18 (specifically, the idxstats and flagstat utilities).4 The quality of alignments was assessed using Picard version 1.85 (specifically, the RnaSeqMetrics utility) (http://picard.sourceforge.net/). Following alignment, expression estimates in the form of Fragments per Kilobase of exon per Million bases mapped (FPKM) were calculated by Cufflinks version 2.0.25 using default parameters except for ‘--num-threads 4 --max-bundle-length 10000000’. Transcript models were supplied to Cufflinks using the ‘-g’ option and the same GTF described above. Transcripts corresponding to mitochondrial and ribosomal genes, as indicated by Ensembl gene biotypes, were masked during calculation of transcript expression estimates. Exon-exon junction counts were obtained by parsing the ‘junctions.bed’ file produced by TopHat. This file 2 reports the coordinates of all introns observed by splice-aware alignment of reads to the genome and the number of reads supporting each. Each observed exon-exon junction was annotated with the genomic coordinates of its defining acceptor and donor sites and crossreferenced against the human transcripts in Ensembl version 67 to determine the number, if any, of exons skipped. Each junction was then assigned to one of five classes describing its relationship to known Ensembl transcripts: ‘DA’, ‘NDA’, ‘D’, ‘A’, and ‘N’. ‘DA’ junctions are those where the exon-exon combination corresponds to one or more known Ensembl transcripts. ‘NDA’ refers to a novel connection of donor and acceptor pair that is individually used in Ensembl transcripts but has not been observed in that combination. ‘D’ refers to junctions that use a known donor site but a novel acceptor site. ‘A’ refers to junctions that use a known acceptor site but a novel donor site. ‘N’ refers to junctions that do not correspond to any Ensembl transcript at donor or acceptor sites. Gene-level counts were obtained by excluding unmapped reads or secondary read alignments from the BAM files (using samtools view –F 0x0104) and processing the resulting reads using HTseq6 (with parameters: mode intersectionstrict; minaqual 1; stranded no; type exon; idattr gene_id). Gene- and junction-level differential expression analysis. Differential gene- and junctionlevel expression across WT and S34F mutant U2AF1 samples was inferred using edgeR7 in R. edgeR uses negative binomial generalized linear models (GLMs) to infer statistically significant differences between condition-specific (here, WT and S34F) bin counts (here, corresponding to total reads mapped to a gene or to a junction). Significantly, we exploited edgeR’s capability to represent the paired experimental design used here. Specifically, we tested for differential expression between WT and S34F samples within biological replicates, i.e., adjusting for differences between biological replicates, by using an additive linear model with biological replicate as the blocking factor: expr ~ condition + replicate, where condition corresponds to WT or S34F and replicate is an identifier for the biological replicate. For junction-level analysis, 3 ‘DA’, ‘NDA’, ‘D’, and ‘A’ junctions were analyzed independently. We considered only junctions having greater than 5 counts in at least 2 (of 6) samples and excluded ‘N’ junctions. Using edgeR, we performed TMM normalization7 (using calcNormFactors), estimated the average dispersion over all junctions (using estimateGLMCommonDisp), estimated a dispersion parameter for each junction with a count-dependent trend line (using estimateGLMTrendedDisp and df=20), computed an empirical Bayes estimate of the negative binomial dispersion parameter for each junction based on a log-linear model (using estimateGLMTagwiseDisp), fit junction-wise negative binomial GLMs (using glmFit), performed likelihood ratio tests for the condition coefficients in the model (using glmLRT), and outputted test results, including pvalues, false discovery rates (FDRs; i.e., p-values corrected for multiple testing using the method of Benjamini and Hochberg8) and log-fold changes (using topTags). Gene-wise differential expression analysis was performed exactly as above except that a gene was analyzed only if it had an FPKM > 0.1 and a read count > 25 across at least 2 samples. Gene and junction expression data can be found in Supplementary Tables 1 and 2, respectively. Volcano plot analysis of the differentially expressed junctions [log2(fold change)] versus the pvalues (−log10 scale) was created using R. Validation within the publically-available TCGA AML data set (https://tcgadata.nci.nih.gov/docs/publications/laml_2012/) of the 544 junctions discovered in CD34+ cells having FDR < 5% and | log2(fold change) | > 2 was similarly performed using edgeR. An unpaired analysis was performed between 6 samples with S34(F/Y) point mutations in U2AF1 and 108 control samples. Other than the S34 variants in the mutant samples, none of the samples harbored point mutations or copy number alterations in U2AF1 or any of 272 other splicing factors. Additionally, M3 and M7 AML subtypes were excluded from the controls because these subtypes did not harbor U2AF1 point mutations (all other subtypes did harbor U2AF1 point mutations) and because excluding them further reduced patient heterogeneity within the control set. Patient mutation status was assessed from the updated clinical 4 information in SuppTable01.update.2013.05.13.xlsx, available at the above website. Junctions having greater than 5 reads in at least 5 samples were analyzed as described above. A p-value assessing the likelihood that the 10 junctions out of 544 found to be dysregulated (FDR < 5%) in a direction consistent with the CD34+ discovery set was calculated by 1000 trials in which 544 junctions were randomly sampled from the subset of the TCGA data set meeting the requirement of junctions having greater than 5 reads in at least 5 samples. Directionality of fold change was not restricted in the 1000 random trials, as it was for the 544 discovered junctions, thus ensuring the calculated p-value is conservative. Expression clustering. Gene-level expression clustering was performed using heatmap in R, with default parameters, except for the distance function, as described below. These expression values were log counts per million (logCPM) output by edgeR and then adjusted to account for intra-pair correlation. Specifically, we used removeBatchEffect to fit the logCPM to the same linear model used for statistical testing, logCPM ~ condition + replicate, and then to subtract off the replicate effect from the logCPM to define the adjusted logCPM. This was done by passing removeBatchEffect a linear model describing the effects we wished to preserve (namely, logCPM ~ condition) and treating the replicate factor as the batch factor. Each adjusted gene expression value was then transformed to a z-score by subtracting off that gene’s mean expression across samples and dividing by the gene’s standard deviation. Resulting zscores were input to heatmap for clustering. The distance(x,y) between gene x and y was defined in terms of their correlation(x,y) as 1/2 * [ 1 – correlation(x,y) ], by defining the heatmap distance function as distfun = function(x) (1 – cor(t(x))) / 2. This ensures that distances range between 0 (for perfectly correlated genes) and 1 (for perfectly anti-correlated genes). Consensus sequence analysis. Consensus 5’ and 3’ splice site sequences corresponding to skipped exons and to alternative splice site usage were computed via WebLogo.9 Junctions 5 resulting from a skipped exon (“skipped exon junctions”) were defined as those with a single skipped exon and inferred to be differentially expressed with an FDR < 5%. The splice sites of the spliced exon itself were determined as those belonging to a junction that was not identical to the skipped exon junction, but which shared one of its splice sites and involved no skipped exons. Junctions participating in alternative splice site selection were defined as those (1) having an FDR < 5%; (2) for which the same 5’ splice site was paired with exactly two expressed 3’ splice sites (for alternative 3’ splice site usage) or, conversely, the same 3’ splice site was paired with exactly two expressed 5’ splice sites (for alternative 5’ splice site usage) ; and (3) for which neither of the junctions involved the skipping of any exon. We refer to the differentially expressed splice site as the alternative splice site and the other of the two splice sites in the pair as the canonical splice site. By restricting to these simple cases (of only two possible alternatives), we could be confident that the alternative splice site was preferentially used relative to the single canonical splice site; this direct contrast would not have been possible with multiple alternative splice sites, all of which may be in competition with each other as well as with the canonical splice site. These splice sites were extended to define the flanking sequences of skipped exons, alternative 5’ and 3’ splice sites, and canonical 5’ and 3’ splice sites. Skipped exons, canonical splice sites and alternative splice sites were analyzed separately according to whether they occurred more [log2(fold change) > 0] or less [log2(fold change) < 0] in the S34F sample relative to WT. Control flanking regions were defined as those corresponding to junctions that showed no evidence for differential expression—i.e., had a | log2(fold change) | < 0.001. All of these had an FDR = 100%. Splice site sequence analysis of the AML sample harboring U2AF1 (Q157P) followed the above, though dysregulated junctions were defined as those with |log2(fold change)| > 1, as opposed to FDR < 5%, and having greater than 5 reads in half of the samples. These changes were necessitated by the difficulty of reliably performing hypothesis testing (and computing FDRs) without replicate U2AF1 (Q157P) samples. To confirm that the different results obtained 6 for U2AF1 (Q157P) in AML samples relative to the FDR-based analysis of U2AF1 (S34F) in CD34 samples (Figure 3) were not an artifact of this different analysis or cell type, we confirmed that the sequence contexts obtained by using a |log2(fold change)| > 1 cutoff for the AML samples harboring U2AF1 (S34F) were similar to those obtained by using an FDR < 5% cutoff for the CD34 samples harboring U2AF1 (S34F; data not shown). Purification of protein complexes. The U2AF1, U2AF2, and SF1 protein subunits were expressed separately in E. coli and purified by affinity chromatography as respective fusions with MBP and GST. The U2AF1 and U2AF2 proteins are full length with the exception of nonspecific RS domains; SF1 includes the U2AF2- and pre-mRNA-binding domains. Prior to mixing the subunits, the GST tag was cleaved and separated from U2AF2 or SF1 by cationexchange chromatography and the MBP-U2AF1 fusion was pre-purified by size-exclusion chromatography. The U2AF2, SF1 and MBP-U2AF1 subunits were then mixed and dialyzed during protease cleavage of the MBP tag from U2AF1. A final step of size-exclusion chromatography ensured homogeneous complexes for affinity determination (Supplementary Figure 5A). SUPPLEMENTARY FIGURE LEGENDS Supplementary Figure 1. Unique reads mapped to the human transcriptome and the distribution of mapped bases are similar in biological replicates. (a) 300-500M reads were obtained by RNA-seq from all four pairs of biological replicates (R1, R2, R3 and R4). (b) Uniquely mapped reads mapped to the human transcriptome. The S34F mutant sample from R3 had a significantly higher proportion of redundant reads. (c) Distribution of mapped bases; the percent of mapped bases for coding, untranslated (UTR), intergenic, intronic and ribosomal bases for the four biological replicates. (d) A G>A substitution was detected only in samples transfected with mutant U2AF1. Mutant U2AF1 represented 85-97% of total U2AF1 expression. 7 Supplementary Figure 2. The S34F mutant does not significantly alter gene expression. (a) Unsupervised analysis of 17,390 genes weakly segregates samples according to genotype, with similar branch lengths within and connecting clusters/genotypes and with inconsistent overor under-expression within a cluster. (b) Supervised analysis of 1,296 genes (with FDR < 5%) strongly segregates samples by genotype and, further, the direction of a gene’s over- or underexpression in WT relative to S34F is always consistent across sample pairs. Supplementary Figure 3. Junction-level analysis reveals distribution of dysregulated splicing events that are independent of 5’ splice site context. (a) Schematic representation of the different types of junctions discovered during RNA-seq including known junctions (known donor and acceptors sites in known combinations) and novel junctions (known donor and acceptor in novel combinations, novel donor and known acceptor, known donor and novel acceptor, and novel donor and acceptor). (b) Venn diagram representing genes that were differentially expressed (1,246, FDR<5%), genes that had differential junction expression (959, FDR<5%) and an overlap of genes that were differentially expressed that also had differential junction expression (241). (c) Overall distribution of various alternative splicing events involves junctions that are expressed more in S34F versus WT (left pie chart) and those that have a higher expression in WT (right pie chart). (d) Logos nucleotide sequence analysis at the 5’ splice sites of skipped exons (top and middle panels). 5’ splice sites of junctions not altered by U2AF1 [S34F; i.e., | log2(fold change) | < 0.001] are shown as control (lower panel). Supplementary Figure 4. S34F mutation induces increased, validated expression of known and novel junctions in CD34+ cells and patient samples. U2AF1 (S34F) caused increased utilization of a 3’ alternative splice site in (a) KIAA1033 and IFI44, and increased exon skipping in (b) SMN1 in transfected CD34+ cells (left panels) and primary MDS samples (right panels). (c) Increased exon skipping in ATP5H, RAB1B, DDX50, DAP3 and IARS. (d) 8 Increased utilization of novel junctions in ABI1, HERC5, CCT6A and CD53. The data represent mean +/- SD of 3 replicates, repeated in 3 biological replicates with similar results. *P<0.05, **P<0.01, ***P<0.001. Supplementary Figure 5. (a) Purified recombinant proteins for RNA binding experiments analyzed by SDS-PAGE stained with Coomassie-blue. Samples include: 1, MBP-fused U2AF1/U2AF2/SF1 complex prior to proteolytic cleavage with TEV protease to remove the MBP tag; 2, U2AF1/U2AF2/SF1 after cleavage of MBP (40 kDa); 3, final purified wild-type U2AF1/U2AF2/SF1 after size exclusion chromatography; 4, TEV protease (26 kDa, control); 5, final purified U2AF1(S34F)/U2AF2/SF1 after size exclusion chromatography. (b) Fluorescence anisotropy curves for binding of U2AF2/SF1 complexes with wild-type or U2AF1(S34F) to the indicated RNA sites from the DEK oncogene. The average data points and standard deviations of three independent experiments are given. Solid lines represent the nonlinear fits of the data as described in Methods. Supplementary Figure 6. The S34F mutation does not affect U2AF1 localization within nuclear speckles. (a) Immunofluorescence analysis of transfected 293T cells demonstrated that WT or U2AF1 (S34F) (green) and U2AF2 (red) co-localized (merge) normally within nuclear speckles. (b) Staining for U2AF1-Flag (green) and Smith Antigen family of snRNP proteins (red) further demonstrate normal co-localization of both WT and S34F U2AF1 within the nuclear speckles (merged). Nuclei were counterstained with TOPRO (blue). Repeated in 4 separate experiments with similar results. Images acquired at 63X, 1.4 numerical aperture, Zeiss Plan Apochromat oil objective at 2.5 zoom and captured using Zeiss LSM510 software. Supplementary Figure 7. Recurrent missense mutations in U2AF1 have specific effects on alternative splicing. (a) U2AF1 expression was increased >5-fold in cells transfected with 9 WT or Q157P, compared to MIG control (left panel). Error bars represent SD of 3 technical replicates. (b) Exogenous (WT or Q157P) Flag-tagged U2AF1 protein abundance compared with MIG empty vector control. Numbers beneath the blot are sum of U2AF1-Flag and endogenous U2AF1 in each case, normalized to MIG, achieved through densitometry analysis. (c) Exon skipping of GH1 minigene was decreased by the Q157P mutant, compared to MIG or WT controls. (d) Similar effects seen with the FMR1 minigene. For both minigene assays, SD is shown for 3 technical replicates; repeated in 3 independent experiments. *P<0.05, **P<0.01, ***P<0.001 Supplementary Figure 8. Junctions that are alternatively spliced in an AML sample harboring (Q157P) mutant U2AF1 do not have a consensus sequence at e-3. Logos plots depicting (a) skipped exons, (b) skipped 3’ junctions, and (c) alternative junctions share the consensus C at position e-3 with (d) control junctions. This is in contrast to the preference for U at e-3 at junctions alternatively spliced by (S34F) mutant U2AF1 (see Figure 3). 10 SUPPLEMENTARY REFERENCES 1. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Meth 2012 04//print; 9(4): 357-359. 2. Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg S. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology 2013; 14(4): R36. 3. Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, et al. Ensembl 2013. Nucleic Acids Research 2013 January 1, 2013; 41(D1): D48-D55. 4. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009 August 15, 2009; 25(16): 2078-2079. 5. Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotech 2010 05//print; 28(5): 511-515. 6. Anders S, Pyl PT, Huber W. HTSeq--A Python Framework to Work with High-throughput Sequencing Data. Biorxiv 2014; in press. 7. Robinson M, McCarthy D, Smyth G. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010; 26: 139 - 140. 8. Benjamini Y, Hochberg Y. Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society Series BMethodological 1995; 57(1): 289-300. 9. Crooks GE, Hon G, Chandonia J-M, Brenner SE. WebLogo: A Sequence Logo Generator. Genome Research 2004 June 1, 2004; 14(6): 1188-1190. 11 SUPPLEMENTARY TABLES Supplementary Table 1: Global gene expression profiling of CD34+ cells expressing U2AF1 WT or U2AF1 (S34F) mutant. Supplementary Table 2: Differential expression of known and novel splice junctions. Supplementary Table 3: Differential expression of junctions in RNA seq vs AML 200 data (TCGA). Supplementary Table 4: U at position e-3 on 5 validated junctions in MDS patient samples. Supplementary Table 5: PCR primers used in splice junction validation. 12