RNA sequencing, transcriptome and expression

advertisement
RNA sequencing, transcriptome and
expression quantification
Henrik Lantz, BILS/SciLifeLab
Lecture synopsis
• What is RNA-seq?
• Basic concepts
• Mapping-based transcriptomics (genome based)
• De novo based transcriptomics (genome-free)
• Expression counts and differential expression
• Transcript annotation
RNA-seq
DNA
Exon
Intron
Exon
Intron
Exon
Intron
Exon
UTR
ATG
Start codon
UTR
GT
AG
GT
AG
GT
AG
Transcription
TAG, TAA, TGA
Stop codon
Pre-mRNA
UTR
ATG
Start codon
UTR AA
A
TAG, TAA, TGA
A
Stop codon
A
A
A
Splicing
mRNA
UTR
UTR AAAAAAAAA
ATG
Start codon
TAG, TAA, TGA
Stop codon
Translation
Overview of RNA-Seq
From: http://www2.fml.tuebingen.mpg.de/raetsch/members/research/transcriptomics.html
Common Data Formats for RNA-Seq
FASTA format:
>61DFRAAXX100204:1:100:10494:3070/1
AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT
FASTQ format:
@61DFRAAXX100204:1:100:10494:3070/1
AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT
+
ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCCA
Quality values in increasing order:
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
You might get the data in a .sff or .bam format. Fastq-reads are easy to extract
from both of these binary (compressed) formats!
Paired-End
Insert size
Insert size
Read 1
DNA-fragment
Read 2
Adapter+primer
Inner mate distance
Paired-end gives you two files
FASTQ format (old):
@61DFRAAXX100204:1:100:10494:3070/1
AAACAACAGGGCACATTGTCACTCTTGTATTTGAAAAACACTTTCCGGCCAT
+
ACCCCCCCCCCCCCCCCCCCCCCCCCCCCCBC?CCCCCCCCC@@CACCCCC
@61DFRAAXX100204:1:100:10494:3070/2
ATCCAAGTTAAAACAGAGGCCTGTGACAGACTCTTGGCCCATCGTGTTGATA
+
_^_a^cccegcgghhgZc`ghhc^egggd^_[d]defcdfd^Z^OXWaQ^ad
New: @<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos>
<read>:<is filtered>:<control number>:<sample number>
Example:
@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=
Transcript Reconstruction from RNA-Seq Reads
Nature Biotech, 2010
Transcript Reconstruction from RNA-Seq Reads
TopHat
Transcript Reconstruction from RNA-Seq Reads
TopHat
Cufflinks
Transcript Reconstruction from RNA-Seq Reads
TopHat
Trinity
The
Tuxedo Suite:
End-to-end Genome-based
RNA-Seq Analysis
Software Package
Cufflinks
GMAP
Transcript Reconstruction from RNA-Seq Reads
TopHat
Cufflinks
Trinity
Transcript Reconstruction from RNA-Seq Reads
TopHat
Cufflinks
Trinity
GMAP
Transcript Reconstruction from RNA-Seq Reads
End-to-end Transcriptome-based
RNA-Seq Analysis
Software Package
GMAP
Trinity
Basic concepts of mapping-based RNA-seq - Spliced reads
DNA
Exon
Intron
Exon
Intron
Exon
Intron
Exon
UTR
ATG
Start codon
UTR
GT
AG
GT
AG
GT
AG
Transcription
TAG, TAA, TGA
Stop codon
Pre-mRNA
UTR
ATG
Start codon
UTR AA
A
TAG, TAA, TGA
A
Stop codon
A
A
A
Splicing
mRNA
UTR
UTR AAAAAAAAA
ATG
Start codon
TAG, TAA, TGA
Stop codon
Translation
RNA-seq - Spliced reads
Pre-mRNA
DNA
Exon
Intron
Exon
Intron
Exon
Intron
UTR
ATG
Start codon
Exon
UTR
GT
GT
GT
TAG, TAA, TGA
Stop codon
Transcription
Pre-mRNA
UTR
UTR
ATG
Start codon
TAG, TAA, TGA
Stop codon
Splicing
mRNA
UTR
UTR
ATG
Start codon
TAG, TAA, TGA
Stop codon
Translation
Pre-mRNA
Pre-mRNA
Stranded rna-seq
Overview of the Tuxedo Software Suite
Bowtie (fast short-read alignment)
TopHat (spliced short-read alignment)
Cufflinks (transcript reconstruction from alignments)
Cuffdiff (differential expression analysis)
CummeRbund (visualization & analysis)
Slide courtesy of Cole Trapnell
Tophat-mapped reads
Alignments are reported in a compact representation: SAM format
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
61G9EAAXX100520:5:100:10095:16477
83
chr1
51986
38
46M
=
51789
-264
CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA
###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG
MD:Z:67
NH:i:1
HI:i:1
NM:i:0
SM:i:38
XQ:i:40
X2:i:0
SAM format specification: http://samtools.sourceforge.net/SAM1.pdf
Alignments are reported in a compact representation: SAM format
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
61G9EAAXX100520:5:100:10095:16477 (read name)
83 (FLAGS stored as bit fields; 83 = 00001010011 )
chr1 (alignment target)
51986 (position alignment starts)
38
46M (Compact description of the alignment in CIGAR format)
=
51789
-264
(read sequence, oriented according to the forward alignment)
CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA
###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG
(base quality values)
MD:Z:67
NH:i:1
HI:i:1
NM:i:0
(Metadata)
SM:i:38
XQ:i:40
X2:i:0
SAM format specification: http://samtools.sourceforge.net/SAM1.pdf
Alignments are reported in a compact representation: SAM format
0
61G9EAAXX100520:5:100:10095:16477 (read name)
1
83 (FLAGS stored as bit fields; 83 = 00001010011 )
2
chr1 (alignment target)
3
51986 (position alignment starts)
4
38
5
46M (Compact description of the alignment in CIGAR format)
6
=
Still not compact enough…
7
51789
to billions
of reads
takesaccording
up a to
lottheofforward
space!!
8Millions-264
(read sequence,
oriented
alignment)
9
CCCAAACAAGCCGAACTAGCTGATTTGGCTCGTAAAGACCCGGAAA
10
###CB?=ADDBCBCDEEFFDEFFFDEFFGDBEFGEDGCFGFGGGGG
(base quality values)
11
MD:Z:67 SAM to binary – BAM format.
Convert
12
NH:i:1
13
HI:i:1
14
NM:i:0
(Metadata)
15
SM:i:38
16
XQ:i:40
17
X2:i:0
SAM format specification: http://samtools.sourceforge.net/SAM1.pdf
Samtools
• Tools for
– converting SAM <-> BAM
– Viewing BAM files (eg. samtools view file.bam | less )
– Sorting BAM files, and lots more:
There is also CRAM…
•
•
•
•
•
•
CRAM compression rate File format
SAM
BAM
CRAM lossless
CRAM 8 bins
CRAM no quality scores
File size (GB)
7.4
1.9
1.4
0.8
0.26
Visualizing Alignments
of RNA-Seq reads
Text-based Alignment Viewer
% samtools tview alignments.bam target.fasta
IGV
IGV: Viewing Tophat Alignments
Transcript Reconstruction Using Cufflinks
From Martin & Wang. Nature Reviews in Genetics. 2011
Transcript Reconstruction Using Cufflinks
From Martin & Wang. Nature Reviews in Genetics. 2011
Transcript Reconstruction Using Cufflinks
From Martin & Wang. Nature Reviews in Genetics. 2011
GFF file format
GFF3 file format
Seqid source
type
start
end
score strand phase attributes
Chr1
Snap
gene
234
3657
.
+
.
ID=gene1; Name=Snap1;
Chr1
Snap
mRNA 234
3657
.
+
.
ID=gene1.m1; Parent=gene1;
Chr1
Snap
exon
234
1543
.
+
.
ID=gene1.m1.exon1;
Parent=gene1.m1;
Chr1
Snap
CDS
577
1543
.
+
0
ID=gene1.m1.CDS1;
Parent=gene1.m1;
Chr1
Snap
exon
1822
2674
.
+
.
ID=gene1.m1.exon2;
Parent=gene1.m1;
Chr1
Snap
CDS
1822
2674
.
+
2
ID=gene1.m1.CDS2;
Parent=gene1.m1;
start_
codon
stop_
codon
Alias, note, ontology_term …
GTF file format
GTF file format
Seqid source
type
start
end
score strand phase attributes
Chr1
Snap
exon
234
1543
.
+
.
gene_id “gene1”;
transcript_id “transcript1”;
Chr1
Snap
CDS
577
1543
.
+
0
gene_id “gene1”;
transcript_id “transcript1”;
Chr1
Snap
exon
1822
2674
.
+
.
gene_id “gene1”;
transcript_id “transcript1”;
Chr1
Snap
CDS
1822
2674
.
+
2
gene_id “gene1”;
transcript_id “transcript1”;
start_
codon
stop_
codon
Transcript Reconstruction from RNA-Seq Reads
TopHat
TheTrinity
Tuxedo Suite:
End-to-end Genome-based
RNA-Seq Analysis
Software Package
Cufflinks
GMAP
Transcript Reconstruction from RNA-Seq Reads
End-to-end Transcriptome-based
RNA-Seq Analysis
Software Package
GMAP
Trinity
De novo transcriptome assembly
No genome required
Empower studies of non-model organisms
– expressed gene content
– transcript abundance
– differential expression
The General Approach to
De novo RNA-Seq Assembly
Using De Bruijn Graphs
Sequence Assembly via De Bruijn Graphs
From Martin & Wang, Nat. Rev. Genet. 2011
From Martin & Wang, Nat. Rev. Genet. 2011
From Martin & Wang, Nat. Rev. Genet. 2011
Contrasting Genome and Transcriptome Assembly
Genome Assembly
• Uniform coverage
• Single contig per locus
• Double-stranded
Transcriptome Assembly
• Exponentially distributed coverage levels
• Multiple contigs per locus (alt splicing)
• Strand-specific
Trinity Aggregates Isolated Transcript Graphs
Genome Assembly
Trinity Transcriptome Assembly
Single Massive Graph
Many Thousands of Small Graphs
Entire chromosomes represented.
Ideally, one graph per expressed gene.
Trinity – How it works:
RNA-Seq
reads
Linear
contigs
de-Bruijn
graphs
Thousands of disjoint graphs
Transcripts
+
Isoforms
Trinity output: A multi-fasta file
Can align Trinity transcripts to genome scaffolds to examine intron/exon structures
(Trinity transcripts aligned using GMAP)
An alternative: Pacific Biosciences (PacBio)
• Pros: Long reads (average 4.5 kbp), can give you full
length transcripts in one read
• Cons: High error rate on longer fragments (15%),
expensive
Abundance Estimation
(Aka. Computing Expression Values)
Expression Value
Slide courtesy of Cole Trapnell
Expression Value
Slide courtesy of Cole Trapnell
Normalized Expression Values
• Transcript-mapped read counts are
normalized for both length of the transcript
and total depth of sequencing.
• Reported as: Number of RNA-Seq Fragments
Per Kilobase of transcript
per total Million fragments mapped
FPKM
Differential Expression Analysis
Using RNA-Seq
Differential expression
Mapped reads - condition 1
Genome
Mapped reads - condition 2
Diff. Expression Analysis Involves
• Counting reads
• Statistical significance testing
Sample_A
Gene A
1
Gene B
100
Sample_B
Fold_Change
Significant?
2
2-fold
No
200
2-fold
Yes
Beware of concluding fold change
from small numbers of counts
Poisson distributions for counts based on 2-fold expression differences
No confidence in 2-fold
difference. Likely
observed by chance.
High confidence in 2-fold
difference. Unlikely
observed by chance.
From: http://gkno2.tumblr.com/post/24629975632/thinking-about-rna-seq-experimental-design-for
More Counts = More Statistical Power
Example: 5000 total reads per sample.
Observed 2-fold differences in read counts.
SampleA
Sample B
Fisher’s Exact Test
(P-value)
geneA
1
2
1.00
geneB
10
20
0.098
geneC
100
200
< 0.001
Tools for DE analysis with RNA-Seq
ShrinkSeq
NoiSeq
baySeq
Vsf
Voom
SAMseq
TSPM
DESeq
EBSeq
NBPSeq
edgeR
See: http://www.biomedcentral.com/1471-2105/14/91
+ other (not-R)
including CuffDiff
Use of transcripts
• Transcripts can be assembled de novo or from
mapped reads and then used in gene
expression/differential expression studies
• Can be functionally anntoated
Functional annotation
• Take transcripts from Cufflinks or Trinity
• Annotate the sequences functionally in
Blast2GO
Blast2GO
KEGG-mapping
Download