Description for Validation of Unannotated Splice Junctions Gingeras Lab, Cold Spring Harbor Labs, Cold Spring Harbor New York. Contact: davisc@cshl.edu, gingeras@cshl.edu Rationale: The ENCODE projects seeks to identify and characterize functional elements in the human genome. Throughout the scale-up phase of ENCODE, the transcriptome group has generate Long RNA-Seq, Small RNASeq, Cap-Analysis of Gene Expression (CAGE), and RNA-PET short read data on the Illumina platform for ~ 40 different human primary and transformed cell lines in replicate. From these data several high-resolution and discrete features/elements have been mined out (5’ caps, splice junctions, polyadenylation sites, small RNAs, etc…). However, because these data are obtained from short-read data, we have only limited “connectivity” information. For example, from the long RNA-Seq data, which was sequenced in mate-pair fashion with average insert sizes ~ 200 bp, we know that the sequence from mate 1 is physically linked to the sequence in mate 2. We don’t know the sequence in between and we don’t know how this mate-pair is connected to other mate-pairs in the context of longer transcripts in vivo. To date, this information is gleened from models generated in silico: In our case, by the program Cufflinks. Consequently, we have a collection of transcript models exhibiting a vast array of local complexity assembled from short read data that need to be experimentally tested. Alternatively, one can “cut to the chase” and use a more raw/elemental form of the data as a basis for additional experimentation to clone out the longer sequences at unannotated loci generated in vivo. Methods: We mined the RNA-Seq Poly-A+ data from HepG2 (Library IDs: LID16635, LID16636), H1 ES Cells (Library IDs: LID8461, LID8462) and HUVEC (Library IDs: LID8463, LID8464) to identify unannotated splice junctions that mapped to either intergenic space or antisense relative to Gencode v3c annotations. The strategy was to identify candidate junctions (mined from the junction elements provided from the STAR mapped data, NOT the junctions mined from the Cufflinks models which we did not have at this time) where multiple reads (in the .BAM files) mapped across the junction, Fig A and use their contiguously mapped mates on either side of the junction to design PCR primers against. The advantage to using splice junctions initially is that they are a distinct biochemical product requiring a transcript as its substrate hence, if validated, they are less likely to have arisen from contaminating genomic DNA vs. a genuine reporter of RNA Polymerase activity in these regions. This strategy allows us to test the following: From these data we know that the mate pairs of mate-pair #1 are physically linked, as are the mate pairs of mate-pair #2 but we don’t know if mate-pair 1 and 2 are linked on the same molecule. We know that we have several independent data points indicating that there is an unannotated splice junction mapping to these regions of the genome. However, these data were all obtained from the same upstream pipeline (i.e. mapper and other common processing and library construction and sequencing methods). Therefore, this set of experiments also functions as an independent test of these upstream processes while capturing longer-read sequence data. 1 At the time of this experiment, the longest-range high-throughput sequencing platform available was the 454 Flx. In total, we mined out 960 junctions from each cell line above = 2,880 candidate junctions in total. A (+) and a (-) PCR primer were ordered for each candidate junction as described above. First-strand cDNA was generated independently for each cell line using batch primed oligo-dT. The cDNA was aliquoted into 96-well plate, where each well contained a +/- primer for a candidate junction and junction-specific PCR was performed. To verify that the RT-PCR worked prior to sequencing we ran an aliquot of each of the 2,880 RT-PCR reactions on agarose gel, Fig B. From this, we see a high % of amplification success giving products migrating around the expected sizes. To verify the identity of the products we pooled the 960 RT-PCR reactions derived from each cell line into 3 pools: HepG2, H1 ES cells and HUVEC. The pooled samples were run on an agarose gel and a region 125-700 base pairs were excised from each and made into independent 454 libraries and run on the 454 FLX. The reads were mapped with BLAT. We are able to validate >80% of the junctions this way as well as identify a considerable amount of new junctions using this target RT-PCR approach that were not seen in the original shotgun RNA-Seq data showing its discovery potential. Accompanying Data Files Available for Download: HUVEC: HUVEC/2.GAC.454Reads.fna HUVEC/2.GAC.454Reads.qual HUVEC/blat.out.psl HUVEC/TargetedPrimers_HUVEC.bed HUVEC/TargetedSJ_HUVEC.bed 454-HUVEC-A+RTPCR.doc HUVEC-range-10plates.xls HepG2: (there are 2 sets of 454 files here because we conducted a pilot small scale test on HepG2 prior to the scaling up) HepG2/blat.out.psl HepG2/TargetedSJ_HepG2.bed HepG2/TargetedPrimers_HepG2.bed HepG2/1.GAC.454Reads.fna HepG2/2.GAC.454Reads.fna HepG2/1.GAC.454Reads.qual HepG2/2.GAC.454Reads.qual 454-HepG2A+RTPCR.doc HepG2-top-10plates-all.xls H1 ES: ES/1.GAC.454Reads.fna ES/1.GAC.454Reads.qual ES/blat.out.psl ES/TargetedPrimers_ES.bed ES/TargetedSJ_ES.bed 454-ES-A+RTPCR.doc H1ES-range-10plates.xls 2