RNA-seq: Quantifying the Transcriptome Alisha Holloway, PhD Gladstone Bioinformatics Core Director What is RNA-seq? Use of high-throughput sequencing technologies to assess the RNA content of a sample. Why do an RNA-seq experiment? • Detect differential expression • Assess allele-specific expression • Quantify alternative transcript usage • Discover novel genes/transcripts, gene fusions • Profile transcriptome • Ribosome profiling to measure translation Why do an RNA-seq experiment? • Detect differential expression • Assess allele-specific expression • Quantify alternative transcript usage • Discover novel genes/transcripts, gene fusions • Profile transcriptome • Ribosome profiling to measure translation Skelly et al. 2011 Why do an RNA-seq experiment? • Detect differential expression • Assess allele-specific expression • Quantify alternative transcript usage • Discover novel genes/transcripts, gene fusions • Profile transcriptome • Ribosome profiling to measure translation Why do an RNA-seq experiment? • Detect differential expression • Assess allele-specific expression • Quantify alternative transcript usage • Discover novel genes/transcripts, gene fusions • Profile transcriptome • Ribosome profiling to measure translation Why do an RNA-seq experiment? • Detect differential expression • Assess allele-specific expression • Quantify alternative transcript usage • Discover novel genes/transcripts, gene fusions • Profile transcriptome • Ribosome profiling to measure translation Pluripotent Stem Cell Cardiogenic Cardiac Cardiomyocytes Mesoderm Precursors Why do an RNA-seq experiment? • Detect differential expression • Assess allele-specific expression • Quantify alternative transcript usage • Discover novel genes/transcripts, gene fusions • Profile transcriptome • Ribosome profiling to measure translation More tomorrow! Ingolia et al. 2009, Weissman Lab RNA-seq Microarray ID novel genes, transcripts, & exons Well vetted QC and analysis methods Greater dynamic range Less bias due to genetic variation Repeatable Well characterized biases Quick turnaround from established core facilities Currently less expensive No species-specific primer/probe design More accurate relative to qPCR Many more applications RNA-seq vs. Affy Marioni et al. 2008 RNA-seq vs. Taqman © 2010 NuGen Illumina Pac-Bio Read length 100 bp paired end 2500 bp avg Throughput 200 million read pairs/lane 1 million reads/ SMRT cell Error rate <1% Cost $600/sample 15% total, most are indels, 4% SNP $7-8k/sample Accessibility USCF, UC-Davis, BGI Uses DE, ASE, quant alt. transc. usage No commercially available protocols Characterize transcriptome When to use Pac-Bio Plan it well. • Experimental design – Biological replicates – Reference genome? – Good gene annotation? • • • • Read depth Barcoding Read length Paired vs. single-end Biological variation Technical variation Plan it well. • Experimental design – Biological replicates – Reference genome? – Good gene annotation? • • • • Read depth Barcoding Read length Paired vs. single-end Plan it well. • Experimental design – Biological replicates – Reference genome? – Good gene annotation? • • • • Read depth Barcoding Read length Paired vs. single-end How much data do we need? • ~15-20K genes expressed in a tissue | cell line. • Genes are on average 3KB • For 1x coverage using 100 bp reads, would need 600K sequence reads • In reality, we need MUCH higher coverage to accurately estimate gene expression levels. • 50 million reads Plan it well. • Experimental design – Biological replicates – Reference genome? – Good gene annotation? • • • • Read depth Barcoding Read length Paired vs. single-end 200 million reads / lane Run 4 samples / lane Plan it well. • Experimental design – Biological replicates – Reference genome? – Good gene annotation? • • • • Read depth Barcoding Read length Paired vs. single-end Uniq seq = 4read length Read length Unique seq 25 1.1x1015 50 1.3x1030 100 1.6x1060 ~60 million coding bases in vertebrate genome Plan it well. • Experimental design – Biological replicates – Reference genome? – Good gene annotation? • • • • Read depth Barcoding Read length Paired vs. single-end Paired-end! • Effectively doubles read length – huge impact on read mapping • Increases number of splice junction spanning reads • Critical for estimating transcriptlevel abundance The wet lab side…briefly How do you make sense of this pile of data? • QC • Alignment • Expt: Compare two groups – Transcript Assignment & Abundance – Differential Expression • Expt: Allele-specific expression QC • FastQC - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • Proportion of reads that mapped uniquely – Remove duplicates; likely due to PCR amp. • Assess ribosomal RNA content • Assess content of possible contaminants – human RNA (if not human samples), Mycoplasma (if cell lines) Then what? • Align reads to the genome – Easy(ish) for genomic sequence – Difficult for transcripts with splice junctions Alignment Algorithms • Burrows-Wheeler Transform – Bowtie (Langmead et al 2009) – BWA (Li and Durbin 2009) – SOAP2 (Li et al. 2009) • Smith-Waterman – BFAST (Homer at al. 2009, based on BLAT) – multiple indexes, finds candidate alignment locations using seed and extend, followed by a gapped Smith-Waterman local alignment for each candidate http://en.wikipedia.org/wiki/List_of_sequence_alignment_software Alignment tools for splice junction mapping • • • • Tophat MapSplice SpliceMap HMMsplicer Tophat • Map reads to transcriptome using Bowtie • Map to genome to discover novel exons – or start here if no annotation available • Split reads to smaller segments; map to genome to discover novel splice junctions • Report best alignment for each read Trapnell et al. Bioinformatics 2009; Trapnell et al. Nature Protocols 2012 MapSplice & SpliceMap • Tag alignment (user chooses aligner) – Break reads into segments – Map reads – Unmapped segments considered for splice junction mapping based on location of partner segment – Merge segments from read for final alignment • Assess splice junction quality Wang et al. NAR 2010, Au et al. NAR 2010 HMMsplicer • Remove reads that map contiguously • Hidden markov model to detect exon boundary of remaining reads • Compute intensive • Reference annotation not used • Best for compact genomes • User sets threshold for accepting splice junction. Dimon et al. PLoS One 2010 HMMsplicer Transcript Assignment/Abundance Martin & Wang, Nature Reviews Genetics 2011 Transcript Assignment &|Abundance Tools • For DE: – Cufflinks – MISO – Scripture – not maintained • De novo assembly – Cufflinks – Trans-ABySS – Trinity – Maker Cufflinks • Constructs the parsimonious set of transcripts that explain the reads observed. Basically, finds a minimum path cover on the DAG. • Derives a likelihood for the abundances of a set of transcripts given a set of fragments. • FPKM – fragments per kb of exon per million fragments mapped. Trapnell, Pachter MISO • Mixture of Isoforms • Bayesian – treats expression level of set of isoforms as random variable and estimates a distribution over the values of this variable. • Gives confidence intervals for expression estimates and measures of DE as Bayes factors Burge Lab @ MIT Bias Correction and Normalization • Random hexamer bias (Hansen et al. 2010) – From PCR or RT primers – Reestimate FPKM or read counts based on bias • Upper quartile normalization (Bullard et al. 2010) – excellent resource for comparison to qPCR and microarray as well as methods of normalization of RNA-seq data Differential Expression • Goal: determine whether observed difference in read counts is greater than would be expected due to random variation. • If reads independently sampled from population, reads would follow multinomial distribution appx by Poisson Pr(X = k) = le k -l k! Differential Expression • BUT! We know that the count data show more variance than expected • Overdipersion problem mitigated by using the negative binomial distribution, which is determined by mean and variance Kij @ NB(mij , s ) 2 ij Sample j, gene i Differential Expression • Binomial test – Old Cuffdiff • Negative binomial – DESeq – estimate variance using all genes with similar expression levels – Cuffdiff – sim to DESeq, but incorp fragment assignment uncertainty simultaneously – EdgeR - moderate variance over all genes – T-test Differential Expression Old cuffdiff Some biology, finally? • How have gene expression patterns have changed during the course of differentiation? • Which genes are specific to certain cell types? • What can we learn about what those coexpressed genes do? Clusters of co-expressed genes • Use unsupervised clustering to group genes by expression pattern • Use gene ontology information to determine which kinds of genes are in each group • Reveal novel associations and gene types Clusters of co-expressed genes Pluripotency/stem cell: Nanog, Oct4 Mesoderm/cell fate commitment: Mesp1, Eomes Cardiac precursors: Isl1, Mef2c, Wnt2 Cardiac structure/function: Actc1, Ryr2, Tnni3 Thanks for listening! Alisha Holloway Gladstone Institutes Bioinformatics Core alisha.holloway@gladstone.ucsf.edu