Transcriptomics

advertisement
RNA seq (I)
Edouard Severing
A typical heat stress experiment (climate change)
Economically important frog
Heat stress
Control
(
convection)
85 minutes
5 days
How does the
frog adapt and
survive?
Coping with heat stress

The frog likely has to change several processes
in order to cope with the heat stress.




Adaptation of metabolic pathways.
Prevent water loss through skin
Changing the concentration of several enzymes,
other proteins and molecules.
We want to determine these molecule
concentration changes

Starting with proteins.
Changes at the molecular level

We could measure protein concentration directly

Not often done on a large scale

We could measure changes in the expression of
the genes that encode these proteins.

Gene expression can be approximated by
measuring the amount of mRNA molecules that
are produced by the gene.
Gene count and complexity
20.000 genes
25.000 genes
From genes to proteins (I)
Initial assumption
N
Protein coding
genes
N
N
mRNA
Proteins
Molecules
Assumption is based on studies that were performed on bacterial systems
From genes to proteins (II)
Current view
N
Protein coding
genes
XN
?N
mRNA
Proteins
Molecules
What happens here ?
Splicing
Pre-mRNA
5’-
5’-
-3’
Exon
Exon
Intron
Intron
Gene
Exon
Splicing
mRNA
5’-
Exon
Exon
Exon
-3’
-3’
Alternative splicing
Pre-mRNA
5’-
-3’
5’-
-3’ 5’-
-3’
Splicing
5’-
Splicing
-3’
5’-
-3’
Gene count and complexity
90% of genes have AS
60% of genes have AS
The average number of transcripts produced by human genes is also
higher than the average number of transcripts produced by plant genes
An extreme case
Dscam gene produces over 38,000 different
transcripts
Major alternative splicing event types
In humans exon skipping is most frequent AS event type
In plants intron retention are the most common AS event type
Humans
Plants
Exon skipping
Intron retention
RNA editing
Primary transcript
5’- A
C
U
A
C
G
A
U - 3’
(Predicted sequence)
RNA-Editing
After editing
(Observed sequence)
5’- A
C
U
A
U
G
A
U - 3’
Difficulty: Distinguish genuine RNA-editing from sequencing errors
Not everything is translated

A large fraction (>30%) of transcripts of protein
coding genes are degraded by the nonsensemediated decay (NMD) pathway.

The position of the stop codon is used to predict
whether a transcript is likely to be degraded by
the NMD pathway
Detecting putative NMD candidates
Pre-mRNA 5’-
mRNA
-3’
5’-
-3’
Exon/Exon junctions
M
Open reading frame
Stop
5’-
-3’
d > 50-55nt
Remember

The number of unique mRNA molecules is much
larger than the number of genes.

A large fraction of the mRNA molecules is
degraded by the NMD pathway.

NMD provides a means to regulate gene-expression at
the post-transcriptional level
Process the frogs into reads for analysis
Sequencing
Grind
N2
Prepare for
sequencing
>s1
ATCGTAGGGTA
>s2
ATGGCCTAGGT
Bioinformatics
Basic transcriptome analysis steps

Many research questions require the
following steps:

Reconstruction of the transcriptome
• We usually only have fragments



Quantification of the transcriptome
Differential expression analysis
Other fun stuff.
de novo transcriptome reconstruction (I)
de novo transcriptome reconstruction (II)
Genome-guided transcriptome reconstruction
Genome
5’-
-3’
mRNA
Genome-guided transcriptome reconstruction
Genome-guided transcriptome reconstruction
Remember

de novo transcriptome assembly
 When no reference genome is available
 Finding features which are not on the reference genome (tDNA
insertion)
 Programs: Trinity, Trans-ABySS, Velvet Oases

Genome-guided transcriptome reconstruction
 Reference genome is available with or without annotation
 Mapping programs: TopHat, GSNAP
 Transcriptome reconstruction: Scripture, Cufflinks
RNA seq (II)
Quantification
Edouard Severing
A typical heat stress experiment
Heat stress
Control
(convection)
Raw counts

Counting number of reads/fragments falling with
exonic regions of a gene.

Example: HTseq-count
Exon 1
Exon 2
Exon 3
Exon 4
The same fragment count yet different
expression levels
Exon 1
library
Exon 1
Library size matters
library
The same fragment count yet different
expression levels.
Exon 1
Exon 2
Exon 3
Exon 1
Transcript/gene length matters
Exon 4
Normalizing/correcting for feature length and
library size
Reads mapped to region
RPKM ≈ 1.7
300 nt
Feature length
10,000,000
All mapped reads
RPKM  109 x
Number of reads mapped to a region
Total reads x region length
Normalizing/correcting for feature length
FPKM is analogous to RPKM
RPKM = 1
RPKM = 2
FPKM = 1
Different picture emerges from raw counts and RPKM/FPKM values
Counting method issues

What to do with reads that map to multiple
isoforms (alternative splicing) or genes
Pure Random assignment?
Gene 1
Gene 2
Isoform 1
Isoform 1
No, expression can differ
Count multiple time?
No, it has been derived from a
single transcript
Isoform 2
Count issues: Back to the gene level (I)
Count issues: Back to the gene level (II)
Statistical methods:
Expression levels of
transcripts
Fishing in the dark lake experiment
Question: What fraction (t)
of the fish in the lake is
green?
Method: We catch a number
of fish and determine what
fraction is green.
Caution: Fish have to be
immediately thrown back in
the water.
Fishing in the dark lake results (I)
Sane people would do:
Sample(X)
Fraction of fish that is green
t = 1/3
Fishing in the dark lake results
Maximum likelihood estimate of t
Maximum likelihood estimate of t
Sample(X)
sample X given a certain t:
𝑃(𝑋) =
P(t))
The probability of observing our
3!
∙ 𝑡 ∙ (1 − 𝑡)2
2! ∙ 1!
Find a t that maximizes the
probability of our observation
t
Fishing in a complex dark lake.
Transcript quantification using
RNAseq is like fishing in a dark
lake with fragmented fish.
We are also forced to
determine the possible origin(s)
of the fish fragments
Only lost an eye and a
vin but not its tail
Estimating relative transcript abundances
Target
α1
Transcript 1
α2
Transcript 2
Fragmentation
Sequencing
Observation
>s1
ATCGTAGGGTA
>s2
Read mapping
ATGGCCTAGGT
Which values of the α1 and α2 gives the highest probability of
observing these reads. (α1 + α2 = 1 )
Maximum likelihood alignments

The likelihood of our observation (ʎ) corresponds
to the product of observing each of the individual
mapped reads (rj ) in our set (R)
𝑅
𝛌=
𝑃(𝑟𝑗 )
𝑗=1
R
Probability of observing a read

Probability of observing a read rj is the sum of the
individual probabilities that a read originates from
each transcript (t) in our transcript set (T).
𝑇
𝑃(𝑟𝑗 ) =
𝐾 𝑗𝑡
𝑡=1
Read j
𝛼𝑡 𝑙𝑡
∙ 𝑇
∙ 𝑃𝑗𝑡 (𝑞)
𝑖=1 𝛼𝑖 𝑙𝑖
Probability that rj originated from transcript t
Component 1: Compatibility
𝑇
𝑃(𝑟𝑗 ) =
𝐾 𝑗𝑡
𝑡=1
𝛼𝑡 𝑙𝑡
∙ 𝑇
∙ 𝑃𝑗𝑡 (𝑞)
𝑖=1 𝛼𝑖 𝑙𝑖
Does read j map to transcript t
t=1
t=2
t=3
Kj1 = 1
Kj2 = 1
Kj3 = 0
Component 2: Sequencing a read from a
specific transcript
𝑇
𝑃(𝑟𝑗 ) =
𝐾 𝑗𝑡
𝑡=1
𝛼𝑡 𝑙𝑡
∙ 𝑇
∙ 𝑃𝑗𝑡 (𝑞)
𝑖=1 𝛼𝑖 𝑙𝑖
Probability of “sequencing” a read from transcript t
𝛼𝑡 𝑙𝑡
Product of the relative expression level
and length of transcript t
Component 2: Sequencing a read from a
specific transcript
Why 𝛼𝑡 𝑙𝑡 and not just 𝛼𝑡 ?

Longer transcripts produce more fragments than
shorter transcripts at equal expression levels.
α1
Fragmentation
α2
α1 = α2
Fragments
Component
l1 = 200; α1 = 0.3
l2 = 150; α2 = 0.2
l3 = 50; α3 = 0.5
𝛼1 𝑙1
0.3 ∙ 200
=
≈ 0.52
𝑇
0.3 ∙ 200 + 0.2 ∙ 150 + 0.5 ∙ 50
𝑖=1 𝛼𝑖 𝑙𝑖
Adjust for length
normalize
Component 3:
𝑇
𝑃(𝑟𝑗 ) =
𝐾 𝑗𝑡
𝑡=1
𝛼𝑡 𝑙𝑡
∙ 𝑇
∙ 𝑃𝑗𝑡 (𝑞)
𝑖=1 𝛼𝑖 𝑙𝑖
Probability of originating from position q
on transcript t
In the case of no bias:
1
𝑃𝑗𝑡 (𝑞) =
𝑙𝑡
Components: Fragment comes from a certain
position of the transcript (I)
Occurence
300nt
200nt
More likely
Frequency
Frequency
Frequency
Frequency
Components: Fragment comes from a certain
position of the transcript (II)
Not all regions are equally
covered.
Search for abundances that best explain the
observed fragments
The method used to find
the optimum differs per
program.
Trapnell et al. 2010
 The statistical methods can also
provide an indication of the
uncertainty in the expression
estimates
 One of the sources of that
uncertainty are reads that do not
map uniquely.
Occurrence
Uncertainty in expression estimate
FPKM
Remember


The statistical methods calculate the expression
level of each transcript
The gene expression can then be obtained by
simply summing expression levels of its isoforms
Gene
RPKM = 11
Isoform 1
RPKM = 6
Isoform 2
RPKM = 5
Programs employing statistical models



Cufflinks
 Genome annotation based
• FPKM values
 Numerical method for finding the maximum likelihood
optimum
RSEM
 de novo transcriptome
• Counts and RPKM values
 Expectation maximization for finding the optimum
BitSeq
 de novo transcriptome
• Counts and RPKM values
 Markov chain Monte Carlo for sampling from the posterior
distribution.
RNA seq (III)
Differential expression
Edouard severing
A typical heat stress experiment
Heat stress
(convection)
Control
Single
measurement
Many
measurements
Is this
gene really
important ?
HSP38
Expression level
Expression level
Sources of variation
Biological
Technical
Sequencing
Treatment
Grind
N2
Convection
Freezer
RNA
extraction
Procedure
Bioinformatics
Determining expression variation

Accurately determining the variation requires
many biological samples (replicates).

Unfortunately in most case we only have two or
three replicates.

Other methods are needed to approximate/model
the variation.
Determining within condition fragment count
variation modeling (DESeq/cuffdiff)
Main assumption:
Variance depends on the
mean.
Objective:
Find a function that best
describes the relationship
between the mean and
variance.
Trapnell et al. 2012
Building the within condition fragment count
distribution (DEseq)
At this stage DESeq and Cuffdiff determined for each transcript/gene
1. The within condition fragment count mean
2. The within condition fragment count variance
With these parameters a fragment count
probability distribution can be determined.
DESeq uses a negative binomial (NB)
distribution.
NB: Variance is larger than the mean
Building the within condition fragment count
distribution (Cuffdiff)
In addition to NB distribution of DESeq Cuffdiff also takes the
uncertainty in transcripts fragment assignment into account.
The resulting count distribution is called a beta negative binomial
(BNB) distribution
This count distribution is in the end transformed to a distribution of
FPKM values.
Differential gene expression
 Now that the count or FPKM distributions are known we can start
testing for significant differences in expression levels.
 There are several ways in which one can test whether the
gene/transcript levels in two conditions are significantly different.
 DESeq: uses an exact test which is analogous to the Fisher exact test.
 (The hyper-geometrical is replaced by the NB distribution)
 Cuffdiff: uses a t-test
 (We will look at that in the next slide)
Testing for differential expression (Cuffdiff)
log FC
Expression
Heat
control
log FC
distribution
Many measurements
leads to many fold
changes
Cuffdiff p-value
According to authors of cufflinks
State that the quantity T is approximately
Normally distributed
log FC
distribution
-|T|
𝐸[log 𝐹𝐶 ]
𝑇=
𝑉𝑎𝑟[log 𝐹𝐶 ]
|T|
Unadjusted p-value of cufflinks indicates
what the probability of observing
a T which is more extreme the current T
(Red areas)

Samtools view


SRR019209.14131084
16
Chr1 3695 50
36M *
0
0
AGCGAGAGAGATCGACGGCGAAGCTCTTTACCCGCT%%"&'"(&&#'$&$)+$#,%83%0&1250'III+$'
MD:Z:34G0A0 YT:Z:UU XS:A:+ NH:i:1
Sw accepted_hits.bam

/mnt/geninf15/work/course_2013/day2/mapped_reads
AS:i:-4 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2
Download