RNA sequencing for differential expression genes

advertisement
RNA sequencing for differential
expression genes
SPEAKER : TZU-CHUN LO
ADVISOR : YAO-TING HAUNG
Outline
 Molecular Central Dogma
 RNA Sequencing
 Differential Expression Gene
 Case–Control Study
 Negative Binomial Distribution
 Hypothesis Testing
 Rice
 SNP, QTL, Pathway
Molecular Central Dogma
 The central dogma of molecular biology
describes the flow of genetic information
within a biological system.
Forest
Branches
BBQ
RNA Sequencing
Gene 1
 DNA
 RNA
 Alignment
exons
Gene 2
mRNA
reads
Spliced alignment
Alignment
Read counts
DEG process
Finding differential expression genes
via read counts each gene.
Differential Expression Gene
 We want to find the cold-resistant genes in rice.
 Rice genome
Gene 1
Gene 2
Gene 3
 We should compare with two conditions.
 Room temperature

Gene 1
Gene 2
Gene 3
Gene 1
13
6
Gene 2
4
5
Gene 3
7
2
Low temperature
Cole-resistant differential
expression genes :
Strategy for DEG
 Case–control study
 Two existing groups differing in outcome are identified and
compared on the basis of some supposed causal attribute.
condition
case
control
Gene 1
69
71
 69 v.s 71
Almost the same ?
Gene 2
86
56
 86 v.s 56
Possible DEG
Gene 3
66
111
 66 v.s 111
More likely DEG
Gene
… 4
80
…
60
…
 80 v.s 60
How to judge?
It is just one of sample in condition.
 Question
 Is the number adequate to the gene? Negative binomial distribution
 How to define the gene is differential expression? Hypothesis test
Negative Binomial Distribution
 NB is a count data distribution that can substitute
Poisson distribution for better variance.
j
Gene abundance parameter
Smooth function
i
69
𝑚𝑒𝑑𝑖𝑎𝑛
69
69 × 71
,
86
86 × 56
,
66
66 × 111
= 0.986
i=1~n
j=1~m
Library size parameter
Smooth function is more complex, so let us forget it. 
3
FPKM
 An indicator used to represent mRNA expression.
 Fragments Per Kilobase of transcript per Million
mapper reads.
𝐹𝑃𝐾𝑀 =
𝑟𝑒𝑎𝑑𝑠 𝑜𝑓 𝑔𝑒𝑛𝑒
𝑎𝑙𝑙 𝑚𝑎𝑝𝑝𝑒𝑑 𝑟𝑒𝑎𝑑𝑠 𝑚𝑖𝑙𝑙𝑖𝑜𝑛𝑠 ∗ 𝑒𝑥𝑜𝑛 𝑙𝑒𝑛𝑔𝑡ℎ 𝑜𝑓 𝑔𝑒𝑛𝑒(𝑘𝑖𝑙𝑜𝑏𝑎𝑠𝑒𝑠)
10
Genome
Exon length:
8
10
Gene 1
7
4
8
9
reads
bases
Gene 2
10
𝐺𝑒𝑛𝑒 1 =
= 0.029 ∗ 109
4
(10 + 4) (8 + 10 + 7)
𝐺𝑒𝑛𝑒
2
=
= 0.017 ∗ 109
∗
(10 + 4) (8 + 9)
106
103
∗
106
103
FPKM
109
𝑉𝑎𝑟 𝐹𝑃𝐾𝑀 =
𝑀∗𝐿
2
(𝑉𝑎𝑟[𝐾])
 Before hypothesis testing, we have to get FPKM and
variance of FPKM.
K-Reads
case
control
FPKM
case
control
Gene 1
69
71
Gene 1
9.34
14.75
Gene 2
86
56
Gene 2
22.31
15.37
Gene 3
66
111
Gene 3
40.48
53.98
…
…
…
…
…
…
Var(K)
case
control
Var(FPKM)
case
control
Gene 1
10
6
Gene 1
6
3.6
Gene 2
170
166
Gene 2
136
132.8
Gene 3
362
310
Gene 3
120.6
109.3
…
…
…
…
…
…
Hypothesis Testing
 Step 1 : You find some observations or clues support
a novel idea.
 Step 2 : Assume a against opinion that you want to
fight it.
 Step 3 : Go to test it and take a stand.
p-value
T-test
 Using t-test to compare the log ratio (log fold-change)
of gene’s expression between condition (a) and (b).
 𝑌=
𝐹𝑃𝐾𝑀𝑎
,
𝐹𝑃𝐾𝑀𝑏
 log 𝑌 = log
𝑖𝑓 𝑓𝑝𝑘𝑚𝑎 = 𝑓𝑝𝑘𝑚𝑏 , 𝑦 = 1
𝐹𝑃𝐾𝑀𝑎
𝐹𝑃𝐾𝑀𝑏
, 𝑖𝑓 𝑦 = 1, log(𝑦) = 0
 𝐻0 : 𝜇 = 0, 𝐻1 : 𝜇 ≠ 0, 𝐴𝑠𝑠𝑢𝑚𝑒 𝑡ℎ𝑎𝑡 𝐻0 𝑖𝑠 𝑡𝑟𝑢𝑒.
 𝑇=
𝐸 log 𝑌 −𝜇
𝑉𝑎𝑟[log(𝑌)]
=
𝐸 log 𝑌
𝑉𝑎𝑟[log(𝑌)]
𝐹𝑃𝐾𝑀
≈
log 𝐹𝑃𝐾𝑀𝑎
𝑏
𝑉𝑎𝑟 𝐹𝑃𝐾𝑀𝑎
𝐹𝑃𝐾𝑀𝑎 2
𝑉𝑎𝑟 𝐹𝑃𝐾𝑀𝑏
𝐹𝑃𝐾𝑀𝑏 2
−
T-test
𝐹𝑃𝐾𝑀𝑎
𝐹𝑃𝐾𝑀𝑏
log
𝑇≈
𝑉𝑎𝑟 𝐹𝑃𝐾𝑀𝑎
𝐹𝑃𝐾𝑀𝑎 2
⇒ 𝑝 − 𝑣𝑎𝑙𝑢𝑒
𝑉𝑎𝑟 𝐹𝑃𝐾𝑀𝑏
𝐹𝑃𝐾𝑀𝑏 2
−
FPKM
case
control
Var(FPKM)
case
control
Gene 1
9.34
14.75
Gene 1
6
3.6
Gene 2
22.31
15.37
Gene 2
136
132.8
Gene 3
40.48
53.98
Gene 3
120.6
109.3
…
…
…
…
…
…
T-test
Gene 1
Gene 2
Gene 3
…
p-value
0.187
0.039
0.014
…
Result Investigating
 Discussing alpha=0.05 with read counts & p-value.
If alpha=0.05
case
control
p-value
result
Gene 1
69
71
0.187
X
Gene 2
86
56
0.039
V
Gene 3
66
111
0.016
V
Gene 4
80
60
0.045
V
 If alpha=0.04 or 0.03 ?
 We don’t know which alpha is the best,
but we can do some subsequent processing.
RNA sequencing for Rice
 Plan
 Cold-resistant genes
 Samples
 Japonica (TN67): room temperature (R), low temperature (L)
 Indica (IR64): room temperature (R), low temperature (L)
 Rice
 粳稻(TN67) : 米粒闊而短,黏性較大,Q彈,如 : 蓬萊米。
 秈稻(IR64) : 米粒細而長,黏性較小,易碎,如 : 在來米。
 Zone
 TN67 : High-latitude, or high altitude
 IR64 : Low-latitude, or low altitude
Strategy for DEG
 Case–control study
 Four combinations


Different varieties or distinct temperatures
Four sets of differential expression genes

The DEGs above combination (A,B,C,D)
 Negative binomial
 Inference probability situation by sample
 Hypothesis test
 Which is the DEG that we want
 Subsequent processing
 SNP, QTL, Pathway
A
TN67R
IR64R
D
B
TN67L
IR64L
C
SNP
 A single-nucleotide polymorphism is a
sequence variation occurring when a single
nucleotide differs between members of a biological
species.
Case
ATGCCCTCGTAA
TTACTGCGT
ATGCGCTCGAAA
TTACTCCGT
Control
Assembly
SNP
QTL
 Quantitative traits refer to phenotypes (characteristics)
that vary in degree and can be attributed
to polygenic effects (product of two or more genes)
 Quantitative trait loci (QTLs) are stretches of DNA
containing or linked to the genes that underlie a
quantitative trait.
Ex : QT(cold) Loci : 599~799 (base)
1
genes
QTL
DNA
 Cold tolerance (29) & pollen fertility (43)
 QTL length : ~million bases
1000
Pathway
 Pathway is a collection of manually drawn pathway
maps representing molecular interaction and
reaction networks.
Rice
Gene No.2
Gene No.55
Gene No.99
Cold-resistant
Conclusion
 Review
 RNA Sequencing
 Differential Expression Gene
 Case–Control Study
 Negative Binomial Distribution
 Hypothesis Testing
 Rice
 SNP
 QTL
 Pathway
Variance of negative binomial
 NB is a count data distribution that can substitute
poisson distribution for better variance.
Strategy for DEG
 Case-control in the same temperature : A, C
 Case-control in the same variety : B, D
 Let T is a set of all genes.
 𝐴⋂𝐶 = 𝑋
 𝐴⋂ 𝑇 − 𝐶 = 𝑌, 𝑇 − 𝐴 ⋂𝐶 = 𝑍
 𝐵⋂𝐷 = 𝑂
 𝐵⋂ 𝑇 − 𝐷 = 𝑃, 𝑇 − 𝐵 ⋂𝐷 = 𝑄
 𝑟𝑒𝑠𝑢𝑙𝑡 = {𝑋, 𝑌, 𝑍, 𝑂, 𝑃, 𝑄}
QTL
 生物的另一類性狀例如人類的身高、體重、高
 血壓、糖尿病;水稻株高及產量對疾病的抵抗程度;老鼠




的體脂肪百分比;乳牛的乳產量;雞的產卵量,由
於其變異性是連續性的,不易分類,且易受環境影響,故
稱為數量性狀(quantitative trait)。數量性狀是由多
個基因所控制,由於每個基因對數量性狀均有影響,所以
每一基因的作用便相對地小。這些控制數量性狀的
基因稱為微效基因(polygenes)或又稱為數量性狀基因
座(quantitative trait loci,QTL)。
Rice genome size 430Mb
QTL
Negative binomial distribution
 NB is a count data distribution that can inference
adequate number by sample.
j
i
Smooth function
Negative binomial distribution
 NB is a count data distribution that can substitute
Poisson distribution for better variance.
Hypothesis test
 Step 1 : You find some observations or clues support
a novel idea.()
 Step 2 : Assume a against opinion that you want to
fight it.
 Step 3 : Go to test it and take a stand.
p-value
Case-control example
 Example
condition
case
control
Gene 1
69
71
 69 v.s 71
Almost the same
Gene 2
86
56
 86 v.s 56
Possible DEG
Gene 3
66
111
 66 v.s 111
More likely DEG
…
…
…
 Question
 Is the number adequate to the gene?


Negative binomial
How to define the gene is differential expression?

Hypothesis test
Variance of negative binomial
 NB is a count data distribution that can substitute
Poisson distribution for better variance.
RNA sequencing
Gene 1
 DNA
exons
Gene 2
mRNA
 RNA
reads
 Alignment
Spliced alignment
DNA
We should align with regions above blue.
RNA sequencing
 Spliced alignment
 TopHat

Condition 1 : case
Condition 2 : control
Sample
1
2
3
…
1
2
3
…
Gene 1
75
69
70
…
73
71
68
…
Gene 2
101
86
75
…
31
56
49
…
Gene 3
28
66
45
…
120
111
145
…
…
…
…
…
…
…
…
…
…
Reads
case
control
Variance
case
control
Gene 1
69
71
Gene 1
69
71
Gene 2
86
56
Gene 2
86
56
Gene 3
66
111
Gene 3
66
111
…
…
…
…
…
…
Download