A hierarchical regression mixture model for inferring gene regulatory networks Mayetri Gupta

advertisement
A hierarchical regression mixture model
for inferring gene regulatory networks
Mayetri Gupta
gupta@bios.unc.edu
University of North Carolina at Chapel Hill
gupta@bios.unc.edu -- p.1/30
Upstream regulation ↔ Downstream expression
Fundamental question: how can we understand the biological
mechanisms leading to disease?
...gtggtTAGAATagcgactgttttt... gene 1
...taggTATAATacagtctgacaaaa... gene 2
...cagcaacattgaTATAATtgccat... gene 3
...ctaaaacaatTATTATttatcagg... gene 4
|TATAAT|
G
1
2
3
4
5
6
bits
0
G
1
CT
2
Co-regulated genes share
similar upstream patterns
Identify genes that are differentially expressed under different
treatments or conditions
gupta@bios.unc.edu -- p.2/30
Gene Regulation: DNA Motifs
Proteins bind to DNA to activate gene transcription
G
1
2
3
4
5
6
bits
0
|TATAAT|
G
1
CT
2
Position specific weight
matrix (PSWM)
or Motif
gupta@bios.unc.edu -- p.3/30
Gene regulation in complex genomes
Harder problem: many transcription factors working in
co-ordination
LARGE sequence search space: using sequence data only
→ many false positives?
gupta@bios.unc.edu -- p.4/30
Upstream regulation ↔ Downstream expression
Gene expression contains information about sequence motifs
Sequence may contain information on gene co-regulation
Expression clustering
→ Motif discovery
or
Expression clustering
↔ Motif discovery ?
What if initial clustering is inaccurate?
gupta@bios.unc.edu -- p.5/30
eplacements
3
−1
0
1
2
G1
G2/M
M/G1
S
S/G2
−2
gene expression
Cell-cycle data set (Spellman, Mol Biol Cell. 1998)
28
time (minutes)
63
98
Measurements over 18 time points, 3 different experiments
∼ 800 genes known to be cell-cycle dependent
Do clusters of genes share common TFs?
Do certain TFs work combinatorially on groups of genes?
color: time when gene is active
gupta@bios.unc.edu -- p.6/30
Using Gene Expression in Motif Discovery
REDUCER (Bussemaker, Nat. Genet. 2001) correlates
expression of gene with number of motif occurrences
MDScan (Liu, Nat Biotech. 2002) Most strongly differentially
expressed genes → candidate motifs.
Motif Regressor (Conlon, PNAS 2003)
Multiple regression model: Sum of motif effects explains
gene expression
Yg = α +
M
∑ βmSmg + εg
m=1
Yg : expression of gene g; Smg : motif-match score
gupta@bios.unc.edu -- p.7/30
Using Gene Expression in Motif Discovery
Non-parametric approaches
Phuong et al. (Bioinformatics, 2004): Classification Trees
(CART)
Arbitrary decision criterion based on the number of
occurrences of a motif type
Multivariate adaptive regression splines (Das et al, PNAS
2004)
Joint sequence-expression model without parametric connection
Holmes and Bruno (ISMB proc.,2000) joint likelihood for
sequence and expression data
gupta@bios.unc.edu -- p.8/30
Using motif information in gene clustering
Infer sets of transcription factors involved in regulating groups of
genes
Higher transcriptional activity → greater presence of TF
binding sites, more pronounced expression changes
Genes within a “cluster” may be correlated, with or without
sharing common transcription factors
Measurements on the same gene in different conditions may
be correlated due to sharing the same upstream transcription
factor binding sites
gupta@bios.unc.edu -- p.9/30
Linear mixed effects model
y = Xβ + Zb + ε
β: fixed effects
b: random effects
ε ∼ N(0, τ−1
0 I)
y: gene expression
Fixed effects: Sequence Motif
Levels of factor are reproduced exactly if experiment is
repeated
Random effects: Expression cluster
Levels of factor (expression + clusters) may not be
reproduced exactly if experiment is repeated
gupta@bios.unc.edu -- p.10/30
Joint Model for Sequence-Expression
Complication: gene cluster identity cannot be assumed known
Z matrix not “fixed”
Conditional on cluster k, (k = 1, . . . , K), vector of log-expression
values of gene g generated from mixed-effects model:
Yg |zg = k, X, parameters ∼ N(ξg + XgT βk 1, σ2k I)
(≡ fk )
gupta@bios.unc.edu -- p.11/30
Joint Model for Sequence-Expression
Complication: gene cluster identity cannot be assumed known
Z matrix not “fixed”
Conditional on cluster k, (k = 1, . . . , K), vector of log-expression
values of gene g generated from mixed-effects model:
Yg |zg = k, X, parameters ∼ N(ξg + XgT βk 1, σ2k I)
(≡ fk )
Unconditionally, a mixture model
P(Y |X, parameters) =
∏
genes g
"
∑
cluster k
πk fk (Yg | parameters)
#
Which β’s significant, in which cluster?
gupta@bios.unc.edu -- p.12/30
Bayesian hierarchical formulation
Prior distributions for cluster k parameters:
“Expression” model:
µk ∼ N(·, v2k0 I)
σ2k ∼ InvGamma(·, ·)
ξg |zg = k ∼ N(µk , τ0 σ2k I)
“Sequence” model:
βk ∼ N(β0 ,V k )
Probabilities of cluster membership:P(zg = k) = πk
(π1 , . . . , πK ) ∼ Dirichlet(α1 , . . . , αK )
G genes; T measurements per gene
gupta@bios.unc.edu -- p.13/30
Bayesian hierarchical formulation
For each βk , we use multivariate extension of g-prior (Zellner,
1986), so that
cσ2k −1
Sk ,
Vk =
T
where Sk =
∑
zg =k
Xg XgT
Why use g-prior?
Computational efficiency, varying c −→ more/less informative
Induces dependence among genes in a cluster due to
sequence effects
2 T
cσ
2
T −1
k
Cov(Yg ,Yg0 |Zg , Zg0 = k) = vk0 I +
1 Xg Sk Xg 1
T
gupta@bios.unc.edu -- p.14/30
Regulatory Motif Model
Every non-site
position multinomial
with
θ0 = (θ01 , . . . , θ04 )
· · · θ0
θ0 θ1 · · · θ 6 θ0
θ0 · · ·
Every motif position i
multinomial with
θi = (θi1 , . . . , θi4 )
Product Multinomial model
Challenge: Find position of sites and θ’s
MEME (Bailey, ISMB 1994) Gibbs Motif Sampler (Liu, JASA 1995), AlignAce (Roth, Nat. BioTech 1998),
BioProspector (Liu, Pac. Symp. Biocomp. 2001), Stochastic Dictionary (Gupta, JASA 2003)
gupta@bios.unc.edu -- p.15/30
Sequence Motif Scoring
Starting set of motifs
De-novo: Different clusters of genes exhibiting “strong” up- or
down-regulation (MDScan)
Databases: Derived from experimental data
Motif score for w-width motif j and upstream sequence g:
P(seq(i, i + w − 1)|motif j)
Xg j =
∑ P(seq(i, i + w − 1)| null)
positions i
gupta@bios.unc.edu -- p.16/30
Sequence motif selection
Initial set of D motif candidates (D can be large!)
In regression model, want to know which motifs correlated
“significantly” with response (gene expression)
u = (u1 , . . . , uD ) where
(
1 if motif j is in model
uj =
0 otherwise.
Prior probability of motif inclusion
D
P(u) = ∏ ηu j (1 − η)1−u j
j=1
Variable selection from LARGE potential set
gupta@bios.unc.edu -- p.17/30
Outline of method
select motifs from
large initial set
update clusters
given motifs
active in each class
update model parameters
given cluster membership
and functional motifs in
each class
Complications
Cluster membership z unknown
Number of clusters K may be unknown (assume fixed for now)
Number of motif candidates is large
gupta@bios.unc.edu -- p.18/30
Parameter updating using MCMC
Update from joint posterior distribution
P(θ, β, u, z|Y , X, K)
θ = (µ, σ2 , π)
For updating steps for parameters θ, β and z, marginalize over
other parameters for efficiency (Conjugate forms permit this)
For updating u, use evolutionary Monte Carlo (Liang and
Wong, JASA 2001)
Select motifs that have most effect on expression, and differentiate
most among clusters
gupta@bios.unc.edu -- p.19/30
Three simulated data sets
−1.0
+
β M1
2
0 2
β M2
0
2 0 -2 0
0
β M3
0
0 0
0
0 2 -2
0 0
(Motif 3 not present in data)
−0.5
0.0
0.5
1.0
S1
1.5
1.5
3
−1.0
+
o
+
−0.5
0.0
+
0.5
+
o
+o
1.0
x13
+
4
+
−0.5
0.0
0.5
1.0
1.5
1
0
−1
+
Y3
3
+ +
+
2.0
−1.0
+
+
+
+
++
++
+
+ ++
+
+ +
+ ++
+
+++
+
++ + ++
+
++
++ +
++++ ++ + + + + +
+
+
+
++
+ + o oo
+++ + +
+ + ++ + +
+
+
+
oo oo
+
+ ++ ++ ++ +
o
o
o o ooo
o
+
o
++ + + ++ ++
+o
ooo
+o
o
o
+
+
o oo
o
o oo
o
o
o
+ o o
o o
o o+ o oo
o
+
o o o o o ooooo oo oo
o
o
o
o
oo
o
o
+ o ooo ooooo ooo
o
oo o oo o o o
o o o
+
o
o oo oo
o o
o
0.0
0.5
+ +
−0.5
1.0
S2
0.0
0.5
x33 +
+
+
+
−0.5
++
+
+
+
++ +
o+
+
+o
+
++
+ +
+ + o+ o+o++ +o+++ +
+
o
o
+
o
+
+
oo + o
+ + o+
ooo
o +o o+o
+
+
o
++
+ + +o++
o+++o oo++o o ++o o +o
o +o o+o +o+ oo oo
+ +
oo
o
o
o
o ++++ +
o
+ + o +o o
o
+ o+o
+ooo
+
+ o+ + o
oo o
o
o
o
+o o o o o
o
+ o
+
o
o
+
o
o
+
+
o o ++
o+
+
+
o
o
+o
o o
o
o
o o
+o
o
o
o o
o
o
o
+
+
x32
1.5
2.0
2
2
Y2
0
2.5
2
Y1
1
0
2.0
4
o++
+o
+
+
2.0
+
+ +
+
++ +
+
−2
4
2
0
−2
(3)
Y1
1.0
x12
+
++ ++
+oo
+
+ +
+
+o+o
+++ + + ++
ooooo
+ ++
+
ooooo
o+o+ +
+
ooo
+ ++ +++
++++++
++
ooooo
ooooo
++ ++ + ++ ++
ooooo
o
+ ++++ + +
oo
+
o
o
o
+ +
ooo
++
+
+ ooo
o
+
oooo
ooo
ooo
+ ++
ooo
+
+ +
+
oo
++
+ +
oo
+
oo o
ooo o
+
ooo
oo
o o
+
−1.0
+
+++
+ +
3
2
+++++++
+++
+
+
++++
++
++
++
+++
+
+
+++
++++
+++++ +
oo o
++++++++
oooo
+++
+
+
+
+
+
+
+
oooooooo
++++++
ooooooo
++
oooooooooo + +
+
oo
++o
ooooo
ooooo
oooo
o
oo
ooo
ooooo
+
oo o
oooooo
+
ooooooo
oo
o
o
−1.0
2
Y3
1
0
−1
1
x31
Coef. C1 C2 C1 C2 C1 C2
0.5
+
+
o
+
o o+
+
o o + o
+o o o+
+ o
oo
o+ + o o o
++
+
o oo
+ o+
+o oo+o o++ + o
+
o
o+
+ o++ +++ + +
+
o o+
+
o +o
o
o +
+ o + +o o
+o o ++ ++oo+ +oo +o+oo
+ o +
+oo +o +o+oo
o o
o
+ + ++oo
+ o+o
o
+o oo+o
++
+ ++
+
+
o+ +
o
o
+ o o++ o o
+
o
o
o oo
o
+ +
+
++
o
+o
+
+
+ o o
+ o
oo
o
+
++ o +
o
o
o
+ o + +
+
+
o
+
1.0
+
+
+
+
+
+ ++++
+
+ ++ ++ +++ + ++
+
+
++ +
++ +
++ ++
+ + + +++
+
o o ++ ++ ++ ++ + +o++ +
++
+
+ + +
o
+ o+
+o+ ++ + o++
+
o+ + +
o
+o+ + o
o+ +oo + o+++ +o
+o
o
o
o +o
o
+
+
o
o
o
oo
o o o
o o+
o o
o ooo + o
+ oo
ooo
oo o
o
o o oooooo o o o
o
oo
o o
o o +
ooo oo o oo o
o
o oo
o
o
o + o oooo o o oooo
o
o
o
o
+
++
+
Y2
0
4
−1
Data 1 Data 2 Data 3
0.0
+
4
4
3
2
1
0
(2)
o
−1
2 measurements each
Y1
200 “genes” in K = 2 clusters
4
−0.5
+
++
++
+++
+
o
+++
o
+
o
++++
++
++
++
oo
+
+
+
o
o
o
+
o
+
o o
o
++
++
+
+++
oo o
o+
oo oo o
+
+
o++
+++
oo oo+
++++
oo o oooooo o o
o ooo o
o
oo++ o o
+++ +
o o o o o+
o o
o
+++o++
+
+o
o
o o+o++o
o
o
o
o
o o oo
+o
o
o o o+o+++o+o
o oo
++++
+o
o
+
o
o
o
o
o
oo
o
+o
oo
o
o
o
o
o
+
5
3
2
Y1
1
2
oo
+ oooo
+
+
+
o
+
oo
ooo
+
+
+
+ o oo
++
+
+ +
oo
o
+
o
+
o
+ + + ++
oo
+
+ ++ +oo
+++
+
oooooooo
+
+
+
+++ ++ +
+ ++ +
+ +
oooo
+
+ ++ +
++ ooooooooo
+
+++ +
+
++ooo+o
++
oooo
+
+
+
+
ooo+
o+
+o oo
+ + oooo+
+
+ +
+ + o+oo
+
+ +
oooo
o
+ ++ o o
o
+
o+ +
+
+
o
++
o
+ ++
+
2
1
x11
oo o
+
+
0
5
0
++
4
++
−2
−1
amd MEF2
o
0
3
2
1
o
0
(1)
JASPAR database SAP1, SRF,
Y1
responding to 3 PSWMs from
4
Regression coeffs. for motifs cor-
o
+
o
+
o
++
o o
+++
o o o ooo
++
+
o
o o
o o
o
++ + +
oo o o o
oo
+
++
+
ooo o o o +
o
++
++++
oo o
+
++++++
o oo o
o o
o o oo o o oo
+++
+++++
o
+o+
o
+
++
ooo o o o o+
++
o oo o
o
+
+++
++
o
+
oo +o
++o
+
+
+
o
o
o
+
ooo o
oo o ++
+
o o o ++++
oo
o
+
+
++o
o o
o o+ o
o
o +
o+o+ +
o
++o
++
o
5
o
−1.5
+
−1.0
−0.5
0.0
S3
0.5
1.0
o
1.5
Y1 with motif 1,2,3 scores (columns)
in 3 data sets (rows)
gupta@bios.unc.edu -- p.20/30
eplacements
eplacements
eplacements
Simulation study results: Bayes factors
Data 2
Data 3
data3
data2
P(MK )
−500
−600
−700
−700
−300
−600
−250
−500
P(MK )
P(MK )
−400
−200
−400
−300
Data 1
1
2
K
3
4
5
6
1
2
K
3
4
5
6
1
2
K
3
4
5
6
Optimal choice: K = 2
Marginal model probability for MK through Double mixture
importance sampling
(s) (t)
(t)
(θ
|z
)
(z
)
1
π
π
1
2
(t) (s)
P(Y |MK )=
ˆ
P(Y |Z , θ , K)
∑
∑
Nt Ns t s
f1 (θ(s) |z(t) ) f2 (z(t) )
gupta@bios.unc.edu -- p.21/30
1
0
coefficient
2
1
c1
c2
c1
c2
c1
c2
−3
−0.5
−2
−2
0.0
−1
−1
0
coefficient
1.5
1.0
0.5
coefficient
Data 3
2
Data 2
2.0
Data 1
replacements
replacements
replacements
Simulation: β estimates
1
1
2
2
motif type
1
1
2
2
motif type
motif type
Data sets 1 and 2: Motifs 1,2 selected
Data set 3: Motif 1 selected
gupta@bios.unc.edu -- p.22/30
acements
3
−1
0
1
2
G1
G2/M
M/G1
S
S/G2
−2
gene expression
Spellman (1998) data
28
63
time (minutes)
98
Pick different groups of genes that are highly differentially expressed
Sets of motif candidates of widths 7-12 bp (using MDscan)
Motif overlaps lead to collinearity: remove motifs with correlation > 0.5 with a
higher-ranked one → 32 candidate motifs
Two consecutive time points: 1-2, 3-4, . . ., 17-18
gupta@bios.unc.edu -- p.23/30
Number of clusters K
c ) compared to 1-component model
log(BF
40
2
3
20
0
2
2
3
2
2
3
3
−20
log(BF)
∗
3
2
2
3
2
3
2
3
8
9
3
1
2
3
4
5
6
7
time interval
SFF
SFF
bits
MCM1
bits
A
A
T
A
C
6
5
4
TA
7
AT
TA
3
6
MCB
0
2
T
C
1
A
A
TA
7
AT
TA
MCB
Significant motifs over time intervals
AACAACA
TTGTT
TTG
TG
G
C
6
5
4
G
C
7
G
G
C
3
0
2
1
1
bits
2
SFF
gupta@bios.unc.edu -- p.24/30
A
A
T
A
T
A
MCB
C
7
AT
TA
6
0
5
7
6
5
G
4
A
3
A
2
T
G
4
bits
bits
T
CTAC
C
C
G
1
T
A
1
A
5
0
ACGCGT ACGCGT
4
7
6
C
1
0
1
C
3
7
6
5
4
T
3
2
C
G
T
TTT
G
1
2
G
G
C
C
C
TAATTA ACGCGT
2
1
TA
AACTAC
1
7
6
5
4
0
AAGAG
CG
9
1
2
3
G
5
T
2
GTT
TT
TG
TG
G
GTC
G
C
bits
AACAACA
4
0
C
G
C
1
2
MCM1
2
3
7
6
C
5
4
3
G
1
TTC
GTC
T
C
3
7
6
5
MCB
2
GTT
TT
TG
TG
G
GTC
G
C
0
TG
AGACA
T
A
2
bits
T
C
GGACAGA
1
1
8
1
2
1
SFF
A
A
T
A
7
2
2
A
AT
TA
4
bits
0
3
7
6
5
4
3
C
ACGCGT
2
T
G
C
1
1
G
G
G
G
G
C
2
6
5
4
0
TTTTGTT
bits
AACAACA
2
bits
T
C
AACAACA
6
2
2
1
0
A
A
T
A
MCB
2
1
AT
TA
3
7
6
5
4
3
MCB
0
2
T
C
1
A
A
T
A
1
A
2
1
SCB
AT
TA
2
0
bits
bits
bits
7
6
5
4
G
C
G
A
5
2
2
1
1
TT
A
9
T
8
AACT
A
T
A
TT
A
4
2
2
1
C
G
3
0
CG GAAA ACGCGT ACGCGT
2
bits
1
3
1
2
1
2
2
3
1
1
1
K∗
7
Interval
PSfrag
lacements
replacements
PSfrag replacements
PSfrag
lacements
replacements
PSfrag replacements
PSfrag
lacements
replacements
PSfrag replacements
5
10
15
20
25
30
0
5
10
15
20
25
30
1.0
0.8
0.6
0.4
0.2
10
15
20
25
30
0
5
10
15
20
25
30
0
5
10
15
20
25
30
0.8
0.6
0.4
0.2
5
10
15
20
25
30
0
5
10
15
20
25
30
0
5
10
15
20
25
30
0.6
0.0
0.2
0.4
0.0
0.2
0.4
0.6
0.8
1.0
0.8
0.8
0.6
0.4
0.2
0
1.0
0.0
0.8
0.6
0.4
0.2
1.0
0.0
5
1.0
0
0
0.8
30
0.6
25
0.4
20
0.2
15
0.0
10
1.0
5
1.0
0
0.0
1.0
0.8
0.6
0.4
0.2
0.0
0.8
0.6
0.4
0.2
0.0
Motif index −→
0.0
Posterior prob. of selection
1.0
Motif influence at different time intervals
Selected motif types for 9 time intervals for optimal K
gupta@bios.unc.edu -- p.25/30
Significant motifs match experimental PSWMs
Index
4
6
7
10
16
18
20
22
25
29
TF name
MCM1
GCN1
[CSRE]
MCB
SFF
MCM1
[RME1]
SCB
PHO4
−
Consensus
Expt.
CGAAGAG/CTCTTCG
TCAGTCA/TGACTGA
GGACAGA/TCTGTCC
ACGCGTA/TACGCGT
AACAACA/TGTTGTT
CCAATTAGG/CCTAATTGG
TTCAGGTAC/GTACCTGAA
CGCGAAAAA/TTTTTCGCG
CGTACGTAC/GTACGTACG
CTTCGCATC/GATGCGAAG
CCNNNWWRGG
TCAGTCA
[YCGGAYRRAWGG]
WCGCGW
GTMAACAA
CCNNNWWRGG
[GAACCTCAA]
CNCGAAA
CACGTK
Phase
M
G1
M
M
G1
K ≡ {G or T } ; M ≡ {A or C} ; N ≡ {A or C or G or T } ; R ≡ {A or G} ; W ≡ {A or T } ; Y ≡ {C or T }
gupta@bios.unc.edu -- p.26/30
Summary
Treating gene expression clustering as a variable may help in
discovering relationships between functional sequence motifs, and
groups of genes they regulate
Different groups of genes may behave as a cluster at different
time points
Upstream sequence motifs “constant” but effects/interactions
over time may vary
gupta@bios.unc.edu -- p.27/30
Further extensions
Motif scoring issues: sensitivity, co-occurrence of sites
Efficient model selection
Extension to high density ChIP tiling arrays
Acknowledgement:
Joseph G. Ibrahim (UNC), Jason Lieb (UNC)
UNC high-performance scientific computing group
gupta@bios.unc.edu -- p.28/30
Model Selection: number of clusters K
Likelihood-based methods not valid (BIC, etc.)
Bayes factor: ratio of model marginal probabilities
Marginal probability for model MK
Z
P(Y |MK ) = ∑
z
θ
P(Y |Z, K)p(θ|z)p(Z|K)dθ
Double mixture importance sampling
(s) (t)
(t)
π
(θ
|z
)
(z
)
1
π
1
2
(t) (s)
P(Y |MK )=
ˆ
P(Y |Z , θ , K)
∑
∑
Nt Ns t s
f1 (θ(s) |z(t) ) f2 (z(t) )
gupta@bios.unc.edu -- p.29/30
Model Selection: Number of clusters
Double mixture importance sampling
(s) (t)
(t)
(θ
|z
)
(z
)
1
π
π
1
2
(t) (s)
P(Y |MK )=
ˆ
P(Y |Z , θ , K)
∑
∑
Nt Ns t s
f1 (θ(s) |z(t) ) f2 (z(t) )
Challenge: Good sampling densities f (·) for z and θ
For a simpler case (θ marginalized), Raftery et al (TR, 2003)
propose permutation-based methods to find good candidate
sampling densities for z
gupta@bios.unc.edu -- p.30/30
Download