mRNA-Seq: methods and applications Jim Noonan GENE 760 Introduction to mRNA-seq • Technical methodology • Read mapping and normalization • Estimating isoform-level gene expression • De novo transcript reconstruction • Sensitivity and sequencing depth • Differential expression analysis mRNA-seq workflow Martin and Wang Nat Rev Genet 12:671 (2011) Wang et al. Nat Rev Genet 10:57 (2009) Illumina RNA-seq library preparation Capture poly-A RNA with poly-T oligo attached beads (100 ng total) (2x) • • RNA quality must be high – degradation produces 3’ bias Non-poly-A RNAs are not recovered Fragment mRNA Synthesize ds cDNA Ligate adapters Amplify Generate clusters and sequence Ribosomal RNA subtraction RiboMinus Mapping RNA-seq reads and quantifying transcripts RNA-seq reads mapped to a reference genome Normalization : Reads per kilobase of feature length per million mapped reads (RPKM) • • • What is a “feature?” What about genomes with poor genome annotation? What about species with no sequenced genome? For a detailed comparison of normalization methods, see: Bullard et al. BMC Bioinformatics 11:94 (2010). Robinson and Oshlack, Genome Biol 11:R25 (2010) Quantifying gene expression by RNA-seq Use existing gene annotation: • • • • • Align to genome plus annotated splices Depends on high-quality gene annotation Which annotation to use: RefSeq, GENCODE, UCSC? Isoform quantification? Identifying novel transcripts? Reference-guided alignments: • Align to genome sequence • Infer splice events from reads • Allows transcriptome analyses of genomes with poor gene annotation De novo transcript assembly: • Assemble transcripts directly from reads • Allows transcriptome analyses of species without reference genomes Composite gene model approach Map reads to genome Map remaining reads to known splice junctions • Requires good gene models • Isoforms are ignored Which gene annotation to use? Strategies for transcript assembly Garber et al. Nat Methods 8:469 (2011) Splice-aware short read aligners Martin and Wang Nat Rev Genet 12:671 (2011) Reference based transcript assembly Martin and Wang Nat Rev Genet 12:671 (2011) Transcript assembly programs Martin and Wang Nat Rev Genet 12:671 (2011) Cufflinks: ab initio transcript assembly Step 1: map reads to reference genome Trapnell et al. Nat. Biotechnology 28:511 (2010) Cufflinks: ab initio transcript assembly Isoform abundances estimated by maximum likelihood Trapnell et al. Nat. Biotechnology 28:511 (2010) Graph-based transcript assembly Martin and Wang Nat Rev Genet 12:671 (2011) Graph-based transcript assembly Martin and Wang Nat Rev Genet 12:671 (2011) Trinity: de novo transcript assembly Grabherr et al. Nat Biotechnol 29:644 (2011) What depth of sequencing is required to characterize a transcriptome? Wang et al. Nat Rev Genet 10:57 (2009) Considerations Gene length: • Long genes are detected before short genes Expression level: • High expressors are detected before low expressors Complexity of the transcriptome: • Tissues with many cell types require more sequencing Feature type • Composite gene models • Common isoforms • Rare isoforms Detection vs. quantification • Obtaining confident expression level estimates (e.g., “stable” RPKMs) requires greater coverage Transcript detection is biased in favor of long genes Tarazona et al. Genome Res 21:2213 (2011) Applications of mRNA-seq Characterizing transcriptome complexity • Alternative splicing Differential expression analysis • Gene- and isoform-level expression comparisons Novel RNA species • lincRNAs and eRNAs • Pervasive transcription Translation • Ribosome profiling Allele-specific expression • Effect of genetic variation on gene expression • Imprinting RNA editing • Novel events Alternative isoform regulation in human tissue transcriptomes Wang et al Nature 456:470 (2008) Diversity of alternative splicing events in human tissues Wang et al. Nature 456:470 (2008) Differential expression Garber et al. Nat Methods 8:469 (2011) Programs for identifying DE genes in RNA-seq datasets Program Assumed distribution for URL count data DESeq Negative binomial wwwhuber.embl.de/users/anders/DESeq/ DEGseq Poisson www.bioconductor.org/packages/2.6/ bioc/html/DEGseq.html edgeR Negative binomial www.bioconductor.org/packages/rele ase/bioc/html/edgeR.html baySeq Negative binomial www.bioconductor.org/packages/rele ase/bioc/html/baySeq.html Cuffdiff Negative binomial cufflinks.cbcb.umd.edu/ Differential expression: Characterizing transcriptome dynamics during brain development Neuronal functions synaptic transmission cell adhesion Embryonic mouse cortex Neuronal migration RNA-seq DEX “Stemness” functions Cell cycle M phase Sox2, Oct4 Ayoub et al PNAS 1086:14950 (2011) Differential expression: Characterizing transcriptome dynamics during brain development Differential isoforms Embryonic mouse cortex RNA-seq DE isoforms Ayoub et al PNAS 1086:14950 (2011) Novel RNA species: annotating lincRNAs Guttman et al Nat Biotechnol 28:503 (2010) Neurons treated with KCL Kim et al Nature 465:182 (2010) Enhancer-associated RNAs (eRNAs) Enhancer-associated RNAs (eRNAs) Ren B. Nature 465:173 (2010) How much of the genome is transcribed? van Bakel et al. PLoS Biol. 8:e1000371 (2010) Exploiting sequence information in RNA-seq reads Majewski and Pastinen. Trends Genet 27:72 (2011) Detecting variants that affect splicing Pickrell et al . Nature 464:768 (2010) Summary: mRNA-seq applications • Quantify transcriptome complexity and compare across biological states • Determine how transcriptomes are translated in different biological contexts • Effect of genetic variation on gene expression • Imprinting and RNA editing