RNA-seq: Quantifying the Transcriptome Alisha Holloway, PhD Gladstone Bioinformatics Core Director What is RNA-seq? Use of high-throughput sequencing technologies to assess the RNA content of a sample. Why do an RNA-seq experiment? • Detect differential expression • Assess allele-specific expression • Quantify alternative transcript usage • Discover novel genes/transcripts, gene fusions • Profile transcriptome • Ribosome profiling to measure translation Why do an RNA-seq experiment? • Detect differential expression • Assess allele-specific expression • Quantify alternative transcript usage • Discover novel genes/transcripts, gene fusions • Profile transcriptome • Ribosome profiling to measure translation Skelly et al. 2011 Why do an RNA-seq experiment? • Detect differential expression • Assess allele-specific expression • Quantify alternative transcript usage • Discover novel genes/transcripts, gene fusions • Profile transcriptome • Ribosome profiling to measure translation Why do an RNA-seq experiment? • Detect differential expression • Assess allele-specific expression • Quantify alternative transcript usage • Discover novel genes/transcripts, gene fusions • Profile transcriptome • Ribosome profiling to measure translation Why do an RNA-seq experiment? • Detect differential expression • Assess allele-specific expression • Quantify alternative transcript usage • Discover novel genes/transcripts, gene fusions • Profile transcriptome • Ribosome profiling to measure translation Pluripotent Stem Cell Cardiogenic Cardiac Cardiomyocytes Mesoderm Precursors Why do an RNA-seq experiment? • Detect differential expression • Assess allele-specific expression • Quantify alternative transcript usage • Discover novel genes/transcripts, gene fusions • Profile transcriptome • Ribosome profiling to measure translation More later! Ingolia et al. 2009, Weissman Lab Choose the right technology RNA-seq Microarray ID novel genes, transcripts, & exons Well vetted QC and analysis methods Greater dynamic range Less bias due to genetic variation Repeatable Well characterized biases Quick turnaround from established core facilities Currently less expensive No species-specific primer/probe design More accurate relative to qPCR Many more applications RNA-seq dynamic range and accuracy RNA-seq vs. Affy Marioni et al. 2008 RNA-seq vs. Taqman qPCR © 2010 NuGen Which sequencing technology to use? Illumina Pac-Bio Read length 100 bp paired end 2500 bp avg Throughput 200 million read pairs/lane <1% 1 million reads/ SMRT cell 15% total, most are indels, 4% SNP $7-8k/sample Error rate Cost Accessibility Uses $900/sample USCF, UC-Davis, BGI; UC-Davis; no many kits available kits/protocols available DE, ASE, Alternative Characterize splicing transcriptome When to use Pac-Bio The wet lab side…briefly New methods to keep info on strand of template too. Plan it well. • Experimental design – Biological replicates – Reference genome? – Good gene annotation? • • • • Read depth Barcoding Read length Paired vs. single-end Biological variation Technical variation Plan it well. • Experimental design – Biological replicates – Reference genome? – Good gene annotation? • • • • Read depth Barcode/Index Read length Paired vs. single-end Plan it well. • Experimental design – Biological replicates – Reference genome? – Good gene annotation? • • • • Read depth Barcode/Index Read length Paired vs. single-end How much data do we need? • ~15-20K genes expressed in a tissue | cell line. • Genes are on average 3KB • For 1x coverage using 100 bp reads, would need 600K sequence reads • In reality, we need MUCH higher coverage to accurately estimate gene expression levels. • 30-50 million reads Plan it well. • Experimental design – Biological replicates – Reference genome? – Good gene annotation? • • • • Read depth Barcode/Index Read length Paired vs. single-end 180-200 million reads / lane Run 3-4 samples / lane Plan it well. • Experimental design – Biological replicates – Reference genome? – Good gene annotation? • • • • Read depth Barcoding Read length Paired vs. single-end Uniq seq = 4read length Read length Unique seq 25 1.1x1015 50 1.3x1030 100 1.6x1060 ~60 million coding bases in vertebrate genome Plan it well. • Experimental design – Biological replicates – Reference genome? – Good gene annotation? • • • • Read depth Barcoding Read length Paired vs. single-end Power of paired-end reads • Huge impact on read mapping – Pairs give two locations to determine whether read is unique • Critical for estimating transcript-level abundance – Increases number of splice junction spanning reads Distribution of cDNA fragment sizes 200 bp 100 bp • Size select 200-300 bp cDNA fragments • Reads are usually 50 bp • Inner mate distance would be 100-200 bp on average. Ends map to different locations Paired end sequencing allows you to reliably call and remove duplicates. How do you make sense of this pile of data? • QC • Alignment • Analysis – Stats – Hypothesis tests – Combine with other data • Visualization QC • FastQC – Before alignment – http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • RNA-SeQC – After alignment – https://confluence.broadinstitute.org/display/CGATools/RNA-SeQC • Proportion of reads that mapped uniquely – Remove duplicates; likely due to PCR amp. • Assess ribosomal RNA and possible contaminant content – human RNA (if not human samples) – Mycoplasma (if cell lines) Base quality over reads High quality Red: high density Blue: low density Low quality Position along read Mapped reads summary Sample Name Total Records Optical duplicates PCR duplicates % duplication Unmapped reads Uniquely mapping reads Read-1 uniquely mapping Read-2 uniquely mapping Concordant reads pairs mapped Reads map to “+” Reads map to “-” Non-splice reads Splice reads Annotation of reads Distribution of reads over gene body Normalized by gene length Then what? • Align reads to the genome – Easy(ish) for genomic sequence – Difficult for transcripts with splice junctions Alignment Algorithms • Bowtie (Langmead et al 2009) – BWT, multiseed heuristic alignment • BWA (Li and Durbin 2009) – BWT, Smith-Waterman alignment • SOAP2 (Li et al. 2009) – BWT, proprietary alignment algorithm • BFAST (Homer at al. 2009, based on BLAT) – multiple indexes, finds candidate alignment locations using seed and extend, followed by a gapped Smith-Waterman local alignment for each candidate http://en.wikipedia.org/wiki/List_of_sequence_alignment_software Burrows-Wheeler Transform (compression to reduce memory footprint) Alignment tools for splice junction mapping • • • • Tophat MapSplice SpliceMap HMMsplicer Tophat • Map reads to transcriptome using Bowtie • Map to genome to discover novel exons – or start here if no annotation available • Split reads to smaller segments; map to genome to discover novel splice junctions • Report best alignment for each read Trapnell et al. Bioinformatics 2009; Trapnell et al. Nature Protocols 2012 MapSplice & SpliceMap • Tag alignment (user chooses aligner) – Break reads into segments – Map reads – Unmapped segments considered for splice junction mapping based on location of partner segment – Merge segments from read for final alignment • Assess splice junction quality Wang et al. NAR 2010, Au et al. NAR 2010 HMMsplicer Local Source! Dimon et al. PLoS One 2010 Compare tools for splice junction mapping Splice Junction Mapper Junction s Prop. Junctions Bowtie map to known transcriptome (control) SpliceMap Tophat – no annotation Tophat – annotation, no novel junctions Tophat – annotation, novel junctions 23,037 21,990 20,196 22,544 22,599 0.955 0.877 0.979 0.981 Difficult to get a false positive rate because you don’t know what your annotation is missing. Transcript Assignment Aligned contiguous and spliced reads Build graph to connect neighboring concordant alignments Traverse graph to assemble variants Assembled isoforms Martin & Wang, Nature Reviews Genetics 2011 Transcript Assignment &|Abundance Tools • Assembly uses annotation – Cufflinks – MISO • De novo assembly – Cufflinks – Trans-ABySS – Trinity – Maker Cufflinks • Constructs the parsimonious set of transcripts that explain the reads observed. Basically, finds a minimum path cover on the DAG. • Derives a likelihood for the abundances of a set of transcripts given a set of fragments. • FPKM – fragments per kb of exon per million fragments mapped. Trapnell, Pachter MISO • Mixture of Isoforms • Bayesian – treats expression level of set of isoforms as random variable and estimates a distribution over the values of this variable. • Gives confidence intervals for expression estimates and measures of DE as Bayes factors Burge Lab @ MIT Bias Correction and Normalization • Random hexamer bias (Hansen et al. 2010) – From PCR or RT primers – Re-estimate read counts to account for bias • Resources for normalization – Bullard et al. 2010 – http://www.rnaseqblog.com/dataanalysis/which-methodshould-you-use-fornormalization-of-rna-seqdata/ Differential Expression • Goal: determine whether observed difference in read counts is greater than would be expected due to random variation. • If reads independently sampled from population, reads would follow multinomial distribution appx by Poisson Pr(X = k) = le k -l k! Differential Expression • BUT! We know that the count data show more variance than expected • Overdispersion problem mitigated by using the negative binomial distribution, which is determined by mean and variance Kij @ NB(mij , s ) 2 ij Sample j, gene i Differential Expression • Binomial test – Old Cuffdiff or if no replicates • Negative binomial – DESeq – estimate variance using all genes with similar expression levels – Cuffdiff – sim to DESeq, but incorp fragment assignment uncertainty simultaneously – EdgeR - moderate variance over all genes • T-test Use replicates to capture biological variation KO KO1 KO2 KO3 WT1 WT2 WT3 Pool gene RPM RPM RPM RPM RPM RPM RPM Hk2 65 299 211 0.633 9.29E-03 Col1a1 171 162 178 326 160 324 511 810 0.301 1.23E-13 1095 655 0.011 5.16E-23 Slc2a1 162 78 59 59 87 WT Pool Replicate Pool RPM adj Pval adj Pval 569 286 240 160 278 217 • Counts per sample have overlapping distribution • Pooled counts show difference between KO and WT • Biological variance is ignored in pooled analysis. Differential Expression Old cuffdiff Some biology, finally? • How have gene expression patterns have changed during the course of differentiation? • Which genes are specific to certain cell types? • What can we learn about what those coexpressed genes do? Clusters of co-expressed genes • Use unsupervised clustering to group genes by expression pattern • Use gene ontology information to determine which kinds of genes are in each group • Reveal novel associations and gene types Clusters of co-expressed genes Pluripotency/stem cell: Nanog, Oct4 Mesoderm/cell fate commitment: Mesp1, Eomes Cardiac precursors: Isl1, Mef2c, Wnt2 Cardiac structure/function: Actc1, Ryr2, Tnni3 Thanks for listening! Alisha Holloway Gladstone Institutes Bioinformatics Core alisha.holloway@gladstone.ucsf.edu