Computational Methods

I. Lariat Discovery Illumina Dataset. To identify lariat branchpoints, we analyzed the Illumina Human Body Map 2.0 total RNA deep sequencing library. Reads consist of RNA samples derived from 16 human tissues. Most reads are 100 bp in length and all linker sequences were removed prior to analysis. Hg19 Annotated. We discovered branch points by searching for reads with non-canonical arrangements of intronic sequences. Reads that aligned to the hg19 genome with 3 or less mismatches were eliminated. Each remaining read was split into all possible head and tail segments, in which all heads and tails were at least 15 nt long. We mapped all head and tail segments to the hg19 genome using bowtie. Head and tail segments were required to map to only one place in the genome and without any errors. Reads with segments that mapped in the expected order or did not map intronically were filtered from the dataset. The remaining inverted reads were mapped to splice sites. In cases where the read tail begins at the first nucleotide of the intron and the read head mapped within 500 nt upstream of a 3’ss, we determined that the read spans the lariat 2’-5’ lariat linkage. In cases where there is alignment ambiguity, we allow up to two mutations and assume the alignment in which the tail maps to the first nucleotide of the intron. The last nucleotide of the read head was determined to be the branchpoint. 2066 lariat reads were discovered through this screen (lariat_0 – lariat_2065 in BED track). Reads that suggested branchpoints supported by spliced EST evidence were also reported. Nine lariat reads were discovered through this screen (lariat_2110 – lariat_2118 in BED track). Unannotated, but with Illumina transcript support. To discover lariats forming in transcripts that are unannotated in the hg19 assembly, we built a library of potential spliced products inferred from the inverted reads. For each inverted read that did not map to an annotated 5’ss, we constructed a potential upstream exon by taking an 85 nucleotide window immediately upstream of the read tail. We created an array of 200 potential downstream exons by taking 85 nucleotide windows at a distance of 1 to 200 nucleotides from the end of the read head. These windows were artificially spliced together to create a set of potential spliced products. We aligned the Illumina reads against these spliced products using bowtie, requiring that the read contained at least 15 nucleotides on either side of the splice junction and did not have any mismatches. In cases where a splice product was found and the implied intron contained a 5’ GT and 3’ AG sequence, the inverted read was determined to be a true lariat forming in an unannotated transcript. Forty-four additional lariat reads were discovered through this screen (lariat_2066 – lariat_2109 in BED track). Lariats forming deep within introns. The remaining out-of-order reads with intronic heads and tails, but without annotated or Illumina transcript support, were studied. From these reads, we filtered a high confidence set of lariats by requiring that both the heads and tails were never annotated as exons (alternative events), the beginning of the tail had a patser score of at least 6.0 against a 5’ss position specific weight matrix, and the read had a mutation at the branchpoint. The 5’ss position specific weight matrix was created by inputting all hg19 annotated 5’ss sequences into the patser program. From within this high confidence set of internal lariats, the mutational profile was similar to the mutational profile of the bona-fide lariats (mostly A-> T mutations). We counted the number of splicing events that used an annotated 5’ss without a 3’ss, an annotated 3’ss without a 5’ss, or an event deep within an intron (using no annotated splice sites). We also counted the number of bona-fide lariats that passed the patser score, mutational, and intronic filters, and used that fraction to extrapolate how many true lariats without transcript support we expect exist in our data. As a control, in-order reads that were most likely caused from template switching were passed through these same filters. There is a 3.3 fold increase of out-of-order reads that pass these filters compared to in-order reads. Also, the mutational profile of the in-order reads was different than bona-fide lariats. This suggests that these internal lariats are truly forming. II. Analysis Branch Point Characterization. The branch point distance was measured as the distance between the last nucleotide of the read head to the first downstream annotated 3’ splice site. The mutational profile of reads that suggested a branch point at the last nucleotide of the intron (implying a circular intron) were compared to the mutational profile of all other reads with ‘G’ nucleotide branchpoint more distal from the 3’ss. A chi-squared test was used to show that these mutational profiles were significantly different. mRNA exon junction data was studied using tophat. We created a junction file consisting of all possible constitutive and exon skipping events within each annotated transcript. We aligned the Illumina reads using the hg19 genome and this junction file to determine how many reads span each exon/exon junction. We calculated overall rates of alternative splicing and intersected this data with our lariat branch points. In cases where the lariat formed over a skipped exon or immediately upstream of a skipped exon, the branchpoint distance was measured to the first downstream exon. In cases where the lariat formed near an alternative 3’ss, the branchpoint distance was measured to the most upstream 3’ss. For all three classes of alternative events, p values were determined by randomly sampling the entire dataset 1000 times and counting how many times the average branchpoint distance was at least as extreme as the branchpoint distance in the alternative event. Splice Site Recognition Analysis. The number of ‘AG’ dinucleotides between the branchpoint and the 3’ss ‘AG’ were counted for the 2066 hg19 annotated lariats. 2066 introns were selected at random, and simulated branchpoints were selected by randomly distributing the branchpoint distance distribution that was observed in real lariats. The number of ‘AG’ dinucleotides between these simulated branchpoints and the 3’ss ‘AG’ were counted. This process was repeated 1000 times to generate a p-value. The AG selection decision tree analysis was performed using C5.0 decision tree software. Lariat introns with exactly one branchsite and one used 3’ss were considered in this analysis. The used AG, immediately upstream AG (if existant), and downstream AG were included in the dataset. First, the data was organized into a decision tree using classifiers from the literature, including distance to branchsite, distance to upstream and downstream AGs, the nucleotide upstream of the AG, and presence of secondary structure (Gibbs free energy determined using RNAfold). In this initial run, the only informative classifiers were the presence or lack of an upstream AG, the distance between the AG and the branchsite, and the distance to upstream and downstream AGs. Next, a range of constraints for each of the distance classifiers were applied to the dataset, and C5.0 was run on each of combination of classifier constraints. The classifier sets with error rates lower than the literature classifier sets were run again on the dataset, this time using half of the dataset as training data, and the other half as testing. The highest predictive scoring classifier sets were subjected to 10-fold cross validation trials. These cross-validation trials were completed 1000 times. The classifier set with the highest average predictive accuracy was used to create the decision tree. RNA-protein Interactions. Published CLIP data for FOX2 (Yeo et. al, 2009), PTB (Xue et. al,2009), and hnRNP C (König et. al, 2010) were mapped around our branchpoint coordinates. The FOX2 and PTB data was smoothed by using the center CLIP coordinate and adding 15 nucleotides to either side. The raw hnrnpC CLIP reads were aligned using bowtie and the last nucleotide was used as the binding point (as described in study). We smoothed the hnrnpC data by adding 15 nucleotides to either side of the binding point. Lariat Recovery. The number of reads spanning each annotated exon/exon junction were determined using tophat. Lariats and exon/exon junctions were both binned by intron size. The recovery rate of a lariat read was calculated by counting the number of detected lariat reads and dividing it by the number of detected exon/exon junction reads within each intron bin. The error bars were calculated by resampling the lariat read data 1000 times and using the 95% confidence interval.

Computational Methods

Related documents

Products

Support

Computational Methods

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib