RNA sequencing, transcriptome and expression

RNA sequencing, transcriptome and expression quantification Henrik Lantz, BILS/SciLifeLab Lecture synopsis • What is RNA-seq? • Basic concepts • Mapping-based transcriptomics (genome based) • De novo based transcriptomics (genome-free) • Expression counts and differential expression • Transcript annotation RNA-seq DNA Exon Intron Exon Intron Exon Intron Exon UTR ATG Start codon UTR GT AG GT AG GT AG Transcription TAG, TAA, TGA Stop codon Pre-mRNA UTR ATG Start codon UTR AA A TAG, TAA, TGA A Stop codon A A A Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation Overview of RNA-Seq From: http://www2.fml.tuebingen.mpg.de/raetsch/members/research/transcriptomics.html Common Data Formats for RNA-Seq FASTA format: >61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT FASTQ format: @61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT + ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA Quality values in increasing order: !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_àbcdefghijklmnopqrstuvwxyz{|}~ You might get the data in a .sff or .bam format. Fastq-reads are easy to extract from both of these binary (compressed) formats! Paired-End Insert size Insert size Read 1 DNA-fragment Read 2 Adapter+primer Inner mate distance Paired-end gives you two files FASTQ format (old): @61DFRAAXX100204:1:100:10494:3070/1 AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT + ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCC @61DFRAAXX100204:1:100:10494:3070/2 ATCCAAGTTAAAACAGAGGCCTGTGACAGACTCTTGGCCCATCGTGTTGATA + _^_a^cccegcgghhgZc`ghhcêgggd^_[d]defcdfd^ZÔXWaQâd New: @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<sample number> Example: @SIM:1:FCX:1:15:6329:1045 1:N:0:2 TCGCACTCAACGCCCTGCATATGACAAGACAGAATC + <>;##=><9=AAAAAAAAAA9#:<#<;<<<????#= Transcript Reconstruction from RNA-Seq Reads Nature Biotech, 2010 Transcript Reconstruction from RNA-Seq Reads TopHat Transcript Reconstruction from RNA-Seq Reads TopHat Cufflinks Transcript Reconstruction from RNA-Seq Reads TopHat Trinity The Tuxedo Suite: End-to-end Genome-based RNA-Seq Analysis Software Package Cufflinks GMAP Transcript Reconstruction from RNA-Seq Reads TopHat Cufflinks Trinity Transcript Reconstruction from RNA-Seq Reads TopHat Cufflinks Trinity GMAP Transcript Reconstruction from RNA-Seq Reads End-to-end Transcriptome-based RNA-Seq Analysis Software Package GMAP Trinity Basic concepts of mapping-based RNA-seq - Spliced reads DNA Exon Intron Exon Intron Exon Intron Exon UTR ATG Start codon UTR GT AG GT AG GT AG Transcription TAG, TAA, TGA Stop codon Pre-mRNA UTR ATG Start codon UTR AA A TAG, TAA, TGA A Stop codon A A A Splicing mRNA UTR UTR AAAAAAAAA ATG Start codon TAG, TAA, TGA Stop codon Translation RNA-seq - Spliced reads Pre-mRNA DNA Exon Intron Exon Intron Exon Intron UTR ATG Start codon Exon UTR GT GT GT TAG, TAA, TGA Stop codon Transcription Pre-mRNA UTR UTR ATG Start codon TAG, TAA, TGA Stop codon Splicing mRNA UTR UTR ATG Start codon TAG, TAA, TGA Stop codon Translation Pre-mRNA Pre-mRNA Stranded rna-seq Overview of the Tuxedo Software Suite Bowtie (fast short-read alignment) TopHat (spliced short-read alignment) Cufflinks (transcript reconstruction from alignments) Cuffdiff (differential expression analysis) CummeRbund (visualization & analysis) Slide courtesy of Cole Trapnell Tophat-mapped reads Alignments are reported in a compact representation: SAM format 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 61G9EAAXX100520:5:100:10095:16477 83 chr1 51986 38 46M = 51789 -264 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG MD:Z:67 NH:i:1 HI:i:1 NM:i:0 SM:i:38 XQ:i:40 X2:i:0 SAM format specification: http://samtools.sourceforge.net/SAM1.pdf Alignments are reported in a compact representation: SAM format 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 61G9EAAXX100520:5:100:10095:16477 (read name) 83 (FLAGS stored as bit fields; 83 = 00001010011 ) chr1 (alignment target) 51986 (position alignment starts) 38 46M (Compact description of the alignment in CIGAR format) = 51789 -264 (read sequence, oriented according to the forward alignment) CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG (base quality values) MD:Z:67 NH:i:1 HI:i:1 NM:i:0 (Metadata) SM:i:38 XQ:i:40 X2:i:0 SAM format specification: http://samtools.sourceforge.net/SAM1.pdf Alignments are reported in a compact representation: SAM format 0 61G9EAAXX100520:5:100:10095:16477 (read name) 1 83 (FLAGS stored as bit fields; 83 = 00001010011 ) 2 chr1 (alignment target) 3 51986 (position alignment starts) 4 38 5 46M (Compact description of the alignment in CIGAR format) 6 = Still not compact enough… 7 51789 to billions of reads takesaccording up a to lottheofforward space!! 8Millions-264 (read sequence, oriented alignment) 9 CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA 10 ###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG (base quality values) 11 MD:Z:67 SAM to binary – BAM format. Convert 12 NH:i:1 13 HI:i:1 14 NM:i:0 (Metadata) 15 SM:i:38 16 XQ:i:40 17 X2:i:0 SAM format specification: http://samtools.sourceforge.net/SAM1.pdf Samtools • Tools for – converting SAM <-> BAM – Viewing BAM files (eg. samtools view file.bam | less ) – Sorting BAM files, and lots more: There is also CRAM… • • • • • • CRAM compression rate File format SAM BAM CRAM lossless CRAM 8 bins CRAM no quality scores File size (GB) 7.4 1.9 1.4 0.8 0.26 Visualizing Alignments of RNA-Seq reads Text-based Alignment Viewer % samtools tview alignments.bam target.fasta IGV IGV: Viewing Tophat Alignments Transcript Reconstruction Using Cufflinks From Martin & Wang. Nature Reviews in Genetics. 2011 Transcript Reconstruction Using Cufflinks From Martin & Wang. Nature Reviews in Genetics. 2011 Transcript Reconstruction Using Cufflinks From Martin & Wang. Nature Reviews in Genetics. 2011 GFF file format GFF3 file format Seqid source type start end score strand phase attributes Chr1 Snap gene 234 3657 . + . ID=gene1; Name=Snap1; Chr1 Snap mRNA 234 3657 . + . ID=gene1.m1; Parent=gene1; Chr1 Snap exon 234 1543 . + . ID=gene1.m1.exon1; Parent=gene1.m1; Chr1 Snap CDS 577 1543 . + 0 ID=gene1.m1.CDS1; Parent=gene1.m1; Chr1 Snap exon 1822 2674 . + . ID=gene1.m1.exon2; Parent=gene1.m1; Chr1 Snap CDS 1822 2674 . + 2 ID=gene1.m1.CDS2; Parent=gene1.m1; start_ codon stop_ codon Alias, note, ontology_term … GTF file format GTF file format Seqid source type start end score strand phase attributes Chr1 Snap exon 234 1543 . + . gene_id “gene1”; transcript_id “transcript1”; Chr1 Snap CDS 577 1543 . + 0 gene_id “gene1”; transcript_id “transcript1”; Chr1 Snap exon 1822 2674 . + . gene_id “gene1”; transcript_id “transcript1”; Chr1 Snap CDS 1822 2674 . + 2 gene_id “gene1”; transcript_id “transcript1”; start_ codon stop_ codon Transcript Reconstruction from RNA-Seq Reads TopHat TheTrinity Tuxedo Suite: End-to-end Genome-based RNA-Seq Analysis Software Package Cufflinks GMAP Transcript Reconstruction from RNA-Seq Reads End-to-end Transcriptome-based RNA-Seq Analysis Software Package GMAP Trinity De novo transcriptome assembly No genome required Empower studies of non-model organisms – expressed gene content – transcript abundance – differential expression The General Approach to De novo RNA-Seq Assembly Using De Bruijn Graphs Sequence Assembly via De Bruijn Graphs From Martin & Wang, Nat. Rev. Genet. 2011 From Martin & Wang, Nat. Rev. Genet. 2011 From Martin & Wang, Nat. Rev. Genet. 2011 Contrasting Genome and Transcriptome Assembly Genome Assembly • Uniform coverage • Single contig per locus • Double-stranded Transcriptome Assembly • Exponentially distributed coverage levels • Multiple contigs per locus (alt splicing) • Strand-specific Trinity Aggregates Isolated Transcript Graphs Genome Assembly Trinity Transcriptome Assembly Single Massive Graph Many Thousands of Small Graphs Entire chromosomes represented. Ideally, one graph per expressed gene. Trinity – How it works: RNA-Seq reads Linear contigs de-Bruijn graphs Thousands of disjoint graphs Transcripts + Isoforms Trinity output: A multi-fasta file Can align Trinity transcripts to genome scaffolds to examine intron/exon structures (Trinity transcripts aligned using GMAP) An alternative: Pacific Biosciences (PacBio) • Pros: Long reads (average 4.5 kbp), can give you full length transcripts in one read • Cons: High error rate on longer fragments (15%), expensive Abundance Estimation (Aka. Computing Expression Values) Expression Value Slide courtesy of Cole Trapnell Expression Value Slide courtesy of Cole Trapnell Normalized Expression Values • Transcript-mapped read counts are normalized for both length of the transcript and total depth of sequencing. • Reported as: Number of RNA-Seq Fragments Per Kilobase of transcript per total Million fragments mapped FPKM Differential Expression Analysis Using RNA-Seq Differential expression Mapped reads - condition 1 Genome Mapped reads - condition 2 Diff. Expression Analysis Involves • Counting reads • Statistical significance testing Sample_A Gene A 1 Gene B 100 Sample_B Fold_Change Significant? 2 2-fold No 200 2-fold Yes Beware of concluding fold change from small numbers of counts Poisson distributions for counts based on 2-fold expression differences No confidence in 2-fold difference. Likely observed by chance. High confidence in 2-fold difference. Unlikely observed by chance. From: http://gkno2.tumblr.com/post/24629975632/thinking-about-rna-seq-experimental-design-for More Counts = More Statistical Power Example: 5000 total reads per sample. Observed 2-fold differences in read counts. SampleA Sample B Fisher’s Exact Test (P-value) geneA 1 2 1.00 geneB 10 20 0.098 geneC 100 200 < 0.001 Tools for DE analysis with RNA-Seq ShrinkSeq NoiSeq baySeq Vsf Voom SAMseq TSPM DESeq EBSeq NBPSeq edgeR See: http://www.biomedcentral.com/1471-2105/14/91 + other (not-R) including CuffDiff Use of transcripts • Transcripts can be assembled de novo or from mapped reads and then used in gene expression/differential expression studies • Can be functionally anntoated Functional annotation • Take transcripts from Cufflinks or Trinity • Annotate the sequences functionally in Blast2GO Blast2GO KEGG-mapping

RNA sequencing, transcriptome and expression

Related documents

Products

Support

RNA sequencing, transcriptome and expression

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib