Description_for_Validation_of_Unannotated_Splice_Junctions

advertisement
Description for Validation of Unannotated Splice Junctions
Gingeras Lab, Cold Spring Harbor Labs, Cold Spring Harbor New York.
Contact: davisc@cshl.edu, gingeras@cshl.edu
Rationale: The ENCODE projects seeks to identify and characterize functional elements in the human genome.
Throughout the scale-up phase of ENCODE, the transcriptome group has generate Long RNA-Seq, Small RNASeq, Cap-Analysis of Gene Expression (CAGE), and RNA-PET short read data on the Illumina platform for ~ 40
different human primary and transformed cell lines in replicate. From these data several high-resolution and
discrete features/elements have been mined out (5’ caps, splice junctions, polyadenylation sites, small RNAs,
etc…). However, because these data are obtained from short-read data, we have only limited “connectivity”
information. For example, from the long RNA-Seq data, which was sequenced in mate-pair fashion with
average insert sizes ~ 200 bp, we know that the sequence from mate 1 is physically linked to the sequence in
mate 2. We don’t know the sequence in between and we don’t know how this mate-pair is connected to other
mate-pairs in the context of longer transcripts in vivo. To date, this information is gleened from models
generated in silico: In our case, by the program Cufflinks. Consequently, we have a collection of transcript
models exhibiting a vast array of local complexity assembled from short read data that need to be
experimentally tested. Alternatively, one can “cut to the chase” and use a more raw/elemental form of the data
as a basis for additional experimentation to clone out the longer sequences at unannotated loci generated in
vivo.
Methods: We mined the RNA-Seq Poly-A+ data from HepG2 (Library IDs: LID16635, LID16636), H1 ES Cells
(Library IDs: LID8461, LID8462) and HUVEC (Library IDs: LID8463, LID8464) to identify unannotated splice
junctions that mapped to either intergenic space or antisense relative to Gencode v3c annotations. The strategy
was to identify candidate junctions (mined from the junction elements provided from the STAR mapped data,
NOT the junctions mined from the Cufflinks models which we did not have at this time) where multiple reads (in
the .BAM files) mapped across the junction, Fig A and use their contiguously mapped mates on either side of
the junction to design PCR primers against. The advantage to using splice junctions initially is that they are a
distinct biochemical product requiring a transcript as its substrate hence, if validated, they are less likely to have
arisen from contaminating genomic DNA vs. a genuine reporter of RNA Polymerase activity in these regions.
This strategy allows us to test the following:


From these data we know that the mate pairs of mate-pair #1 are physically linked, as are the mate pairs of
mate-pair #2 but we don’t know if mate-pair 1 and 2 are linked on the same molecule.
We know that we have several independent data points indicating that there is an unannotated splice
junction mapping to these regions of the genome. However, these data were all obtained from the same
upstream pipeline (i.e. mapper and other common processing and library construction and sequencing
methods). Therefore, this set of experiments also functions as an independent test of these upstream
processes while capturing longer-read sequence data.
1
At the time of this experiment, the longest-range high-throughput sequencing platform available was the 454 Flx.
In total, we mined out 960 junctions from each cell line above = 2,880 candidate junctions in total. A (+) and a (-)
PCR primer were ordered for each candidate junction as described above. First-strand cDNA was generated
independently for each cell line using batch primed oligo-dT. The cDNA was aliquoted into 96-well plate, where
each well contained a +/- primer for a candidate junction and junction-specific PCR was performed. To verify
that the RT-PCR worked prior to sequencing we ran an aliquot of each of the 2,880 RT-PCR reactions on
agarose gel, Fig B. From this, we see a high % of amplification success giving products migrating around the
expected sizes.
To verify the identity of the products we pooled the 960 RT-PCR reactions derived from each cell line into 3
pools: HepG2, H1 ES cells and HUVEC. The pooled samples were run on an agarose gel and a region 125-700
base pairs were excised from each and made into independent 454 libraries and run on the 454 FLX. The
reads were mapped with BLAT. We are able to validate >80% of the junctions this way as well as identify a
considerable amount of new junctions using this target RT-PCR approach that were not seen in the original
shotgun RNA-Seq data showing its discovery potential.
Accompanying Data Files Available for Download:
HUVEC:
HUVEC/2.GAC.454Reads.fna
HUVEC/2.GAC.454Reads.qual
HUVEC/blat.out.psl
HUVEC/TargetedPrimers_HUVEC.bed
HUVEC/TargetedSJ_HUVEC.bed
454-HUVEC-A+RTPCR.doc
HUVEC-range-10plates.xls
HepG2:
(there are 2 sets of 454 files here because we conducted a pilot small scale test on HepG2 prior to the scaling
up)
HepG2/blat.out.psl
HepG2/TargetedSJ_HepG2.bed
HepG2/TargetedPrimers_HepG2.bed
HepG2/1.GAC.454Reads.fna
HepG2/2.GAC.454Reads.fna
HepG2/1.GAC.454Reads.qual
HepG2/2.GAC.454Reads.qual
454-HepG2A+RTPCR.doc
HepG2-top-10plates-all.xls
H1 ES:
ES/1.GAC.454Reads.fna
ES/1.GAC.454Reads.qual
ES/blat.out.psl
ES/TargetedPrimers_ES.bed
ES/TargetedSJ_ES.bed
454-ES-A+RTPCR.doc
H1ES-range-10plates.xls
2
Download