RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab Lecture synopsis • What is RNA-seq? • Basic concepts • Mapping-based transcriptomics (genome based) • De novo based transcriptomics (genome-free) • Expression counts and differential expression • Transcript annotation RNA-seq DNA Exon Intron Exon Intron Exon Intron Exon UTR ATG Start codon UTR GT AG GT AG GT AG Transcription TAG, TAA, TGA Stop codon Pre-mRNA UTR ATG Start codon UTR AA A TAG, TAA, TGA A Stop codon A A A Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation Overview of RNA-Seq From: http://www2.fml.tuebingen.mpg.de/raetsch/members/research/transcriptomics.html Common Data Formats for RNA-Seq FASTA format: >61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT FASTQ format: @61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT + ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA Quality values in increasing order: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ You might get the data in a .sff or .bam format. Fastq-reads are easy to extract from both of these binary (compressed) formats! Paired-End Insert size Insert size Read 1 DNA-fragment Read 2 Adapter+primer Inner mate distance Paired-end gives you two files FASTQ format (old): @61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT + ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCC @61DFRAAXX100204:1:100:10494:3070/2 ATCCAAGTTAAAACAGAGGCCTGTGACAGACTCTTGGCCCATCGTGTTGATA + _^_a^cccegcgghhgZc`ghhc^egggd^_[d]defcdfd^Z^OXWaQ^ad New: @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number> Example: @SIM:1:FCX:1:15:6329:1045 1:N:0:2 TCGCACTCAACGCCCTGCATATGACAAGACAGAATC + <>;##=><9=AAAAAAAAAA9#:<#<;<<<????#= Transcript Reconstruction from RNA-Seq Reads Nature Biotech, 2010 Transcript Reconstruction from RNA-Seq Reads TopHat Transcript Reconstruction from RNA-Seq Reads TopHat Cufflinks Transcript Reconstruction from RNA-Seq Reads TopHat Trinity The Tuxedo Suite: End-to-end Genome-based RNA-Seq Analysis Software Package Cufflinks GMAP Transcript Reconstruction from RNA-Seq Reads TopHat Cufflinks Trinity Transcript Reconstruction from RNA-Seq Reads TopHat Cufflinks Trinity GMAP Transcript Reconstruction from RNA-Seq Reads End-to-end Transcriptome-based RNA-Seq Analysis Software Package GMAP Trinity Basic concepts of mapping-based RNA-seq - Spliced reads DNA Exon Intron Exon Intron Exon Intron Exon UTR ATG Start codon UTR GT AG GT AG GT AG Transcription TAG, TAA, TGA Stop codon Pre-mRNA UTR ATG Start codon UTR AA A TAG, TAA, TGA A Stop codon A A A Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation RNA-seq - Spliced reads Pre-mRNA DNA Exon Intron Exon Intron Exon Intron UTR ATG Start codon Exon UTR GT GT GT TAG, TAA, TGA Stop codon Transcription Pre-mRNA UTR UTR ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR ATG Start codon TAG, TAA, TGA Stop codon Translation Pre-mRNA Pre-mRNA Stranded rna-seq Overview of the Tuxedo Software Suite Bowtie (fast short-read alignment) TopHat (spliced short-read alignment) Cufflinks (transcript reconstruction from alignments) Cuffdiff (differential expression analysis) CummeRbund (visualization & analysis) Slide courtesy of Cole Trapnell Tophat-mapped reads Alignments are reported in a compact representation: SAM format 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 61G9EAAXX100520:5:100:10095:16477 83 chr1 51986 38 46M = 51789 -264 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG MD:Z:67 NH:i:1 HI:i:1 NM:i:0 SM:i:38 XQ:i:40 X2:i:0 SAM format specification: http://samtools.sourceforge.net/SAM1.pdf Alignments are reported in a compact representation: SAM format 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 61G9EAAXX100520:5:100:10095:16477 (read name) 83 (FLAGS stored as bit fields; 83 = 00001010011 ) chr1 (alignment target) 51986 (position alignment starts) 38 46M (Compact description of the alignment in CIGAR format) = 51789 -264 (read sequence, oriented according to the forward alignment) CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG (base quality values) MD:Z:67 NH:i:1 HI:i:1 NM:i:0 (Metadata) SM:i:38 XQ:i:40 X2:i:0 SAM format specification: http://samtools.sourceforge.net/SAM1.pdf Alignments are reported in a compact representation: SAM format 0 61G9EAAXX100520:5:100:10095:16477 (read name) 1 83 (FLAGS stored as bit fields; 83 = 00001010011 ) 2 chr1 (alignment target) 3 51986 (position alignment starts) 4 38 5 46M (Compact description of the alignment in CIGAR format) 6 = Still not compact enough… 7 51789 to billions of reads takesaccording up a to lottheofforward space!! 8Millions-264 (read sequence, oriented alignment) 9 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA 10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG (base quality values) 11 MD:Z:67 SAM to binary – BAM format. Convert 12 NH:i:1 13 HI:i:1 14 NM:i:0 (Metadata) 15 SM:i:38 16 XQ:i:40 17 X2:i:0 SAM format specification: http://samtools.sourceforge.net/SAM1.pdf Samtools • Tools for – converting SAM <-> BAM – Viewing BAM files (eg. samtools view file.bam | less ) – Sorting BAM files, and lots more: There is also CRAM… • • • • • • CRAM compression rate File format SAM BAM CRAM lossless CRAM 8 bins CRAM no quality scores File size (GB) 7.4 1.9 1.4 0.8 0.26 Visualizing Alignments of RNA-Seq reads Text-based Alignment Viewer % samtools tview alignments.bam target.fasta IGV IGV: Viewing Tophat Alignments Transcript Reconstruction Using Cufflinks From Martin & Wang. Nature Reviews in Genetics. 2011 Transcript Reconstruction Using Cufflinks From Martin & Wang. Nature Reviews in Genetics. 2011 Transcript Reconstruction Using Cufflinks From Martin & Wang. Nature Reviews in Genetics. 2011 GFF file format GFF3 file format Seqid source type start end score strand phase attributes Chr1 Snap gene 234 3657 . + . ID=gene1; Name=Snap1; Chr1 Snap mRNA 234 3657 . + . ID=gene1.m1; Parent=gene1; Chr1 Snap exon 234 1543 . + . ID=gene1.m1.exon1; Parent=gene1.m1; Chr1 Snap CDS 577 1543 . + 0 ID=gene1.m1.CDS1; Parent=gene1.m1; Chr1 Snap exon 1822 2674 . + . ID=gene1.m1.exon2; Parent=gene1.m1; Chr1 Snap CDS 1822 2674 . + 2 ID=gene1.m1.CDS2; Parent=gene1.m1; start_ codon stop_ codon Alias, note, ontology_term … GTF file format GTF file format Seqid source type start end score strand phase attributes Chr1 Snap exon 234 1543 . + . gene_id “gene1”; transcript_id “transcript1”; Chr1 Snap CDS 577 1543 . + 0 gene_id “gene1”; transcript_id “transcript1”; Chr1 Snap exon 1822 2674 . + . gene_id “gene1”; transcript_id “transcript1”; Chr1 Snap CDS 1822 2674 . + 2 gene_id “gene1”; transcript_id “transcript1”; start_ codon stop_ codon Transcript Reconstruction from RNA-Seq Reads TopHat TheTrinity Tuxedo Suite: End-to-end Genome-based RNA-Seq Analysis Software Package Cufflinks GMAP Transcript Reconstruction from RNA-Seq Reads End-to-end Transcriptome-based RNA-Seq Analysis Software Package GMAP Trinity De novo transcriptome assembly No genome required Empower studies of non-model organisms – expressed gene content – transcript abundance – differential expression The General Approach to De novo RNA-Seq Assembly Using De Bruijn Graphs Sequence Assembly via De Bruijn Graphs From Martin & Wang, Nat. Rev. Genet. 2011 From Martin & Wang, Nat. Rev. Genet. 2011 From Martin & Wang, Nat. Rev. Genet. 2011 Contrasting Genome and Transcriptome Assembly Genome Assembly • Uniform coverage • Single contig per locus • Double-stranded Transcriptome Assembly • Exponentially distributed coverage levels • Multiple contigs per locus (alt splicing) • Strand-specific Trinity Aggregates Isolated Transcript Graphs Genome Assembly Trinity Transcriptome Assembly Single Massive Graph Many Thousands of Small Graphs Entire chromosomes represented. Ideally, one graph per expressed gene. Trinity – How it works: RNA-Seq reads Linear contigs de-Bruijn graphs Thousands of disjoint graphs Transcripts + Isoforms Trinity output: A multi-fasta file Can align Trinity transcripts to genome scaffolds to examine intron/exon structures (Trinity transcripts aligned using GMAP) An alternative: Pacific Biosciences (PacBio) • Pros: Long reads (average 4.5 kbp), can give you full length transcripts in one read • Cons: High error rate on longer fragments (15%), expensive Abundance Estimation (Aka. Computing Expression Values) Expression Value Slide courtesy of Cole Trapnell Expression Value Slide courtesy of Cole Trapnell Normalized Expression Values • Transcript-mapped read counts are normalized for both length of the transcript and total depth of sequencing. • Reported as: Number of RNA-Seq Fragments Per Kilobase of transcript per total Million fragments mapped FPKM Differential Expression Analysis Using RNA-Seq Differential expression Mapped reads - condition 1 Genome Mapped reads - condition 2 Diff. Expression Analysis Involves • Counting reads • Statistical significance testing Sample_A Gene A 1 Gene B 100 Sample_B Fold_Change Significant? 2 2-fold No 200 2-fold Yes Beware of concluding fold change from small numbers of counts Poisson distributions for counts based on 2-fold expression differences No confidence in 2-fold difference. Likely observed by chance. High confidence in 2-fold difference. Unlikely observed by chance. From: http://gkno2.tumblr.com/post/24629975632/thinking-about-rna-seq-experimental-design-for More Counts = More Statistical Power Example: 5000 total reads per sample. Observed 2-fold differences in read counts. SampleA Sample B Fisher’s Exact Test (P-value) geneA 1 2 1.00 geneB 10 20 0.098 geneC 100 200 < 0.001 Tools for DE analysis with RNA-Seq ShrinkSeq NoiSeq baySeq Vsf Voom SAMseq TSPM DESeq EBSeq NBPSeq edgeR See: http://www.biomedcentral.com/1471-2105/14/91 + other (not-R) including CuffDiff Use of transcripts • Transcripts can be assembled de novo or from mapped reads and then used in gene expression/differential expression studies • Can be functionally anntoated Functional annotation • Take transcripts from Cufflinks or Trinity • Annotate the sequences functionally in Blast2GO Blast2GO KEGG-mapping