Non-linear/non-homogeneous dynamic Bayesian networks Dirk Husmeier

advertisement
Non-linear/non-homogeneous
dynamic Bayesian networks
Dirk Husmeier
Dynamic Bayesian network
Model
Parameters q
Integral analytically tractable!
BDe: UAI 1994
BGe: UAI 1995
BDe: Nonlinear discretized model
P
P1
Activator
P2
Repressor
Activation
Allow for noise: probabilities
P
P1
Activator
P2
Repressor
Inhibition
Conditional multinomial
distribution
BGe: Linear model
[A]= w1[P1] + w2[P2] + w3[P3] +
P1
w1
w4[P4] + noise
P2
w2
w3
P3
w4
P4
A
Pros and cons of the two models
Linear Gaussian model
Multinomial model
• Restriction to linear
processes
• Nonlinear model
• Original data Æ
no information loss
• Discretization Æ
information loss
Can we get an approximate nonlinear model
without data discretization?
y
x
Can we get an approximate nonlinear model
without data discretization?
Idea: piecewise linear model
y
x
Example: 4 genes, 10 time points
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Example: 4 genes, 10 time points
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Learning with MCMC
q
Allocation vector
h
k
Number of components (here: 3)
Learning with MCMC
q
h
k
Learning with MCMC
q
Parameters fixed
Complexity of
marginalization: k*m
h
k
Learning with MCMC
q
Parameters not fixed
Complexity of
m
marginalization: k
h
k
Learning with MCMC
q
Parameters can be
integrated out
Allocations fixed
h
k
Viral challenge and immune
activation of macrophages
Collaboration with
DPM (Division of
Pathway Medicine,
Edinburgh University)
Macrophage
macrophage
Treatment
Interferon
gamma
Interferon
gamma
IFNγ
Infection
cytomegalovirus
Cytomegalovirus
(CMV)
12 hour time course measuring total RNA
IFNγ
30 min sampling
0
1
2
3
4
5
6
7
8
9
10 11
72 Agilent Arrays
25 samples per group:
Clustering
• Analysis
Time series
statistical
analysis
(using
EDGE)
Infection with CMV
CMV
macrophage
• Pre-treatment with IFNγ
• IFNγ + CMV
12
Posterior probability of the number of components
(top) and co-allocation of two time points to the
same component (bottom)
Infection
Treatment
White=1 Black=0
Infection+treatment
Literature Æ “Known” interactions between
three cytokines: IRF1, IRF2 and IRF3
IRF1
IRF3
IRF2
Evaluation:
Average marginal posterior probabilities of
the edges versus non-edges
Sample of
high-scoring
networks
Sample of
high-scoring
networks
Feature extraction,
e.g. marginal posterior
probabilities of the edges
High-confident edge
High-confident non-edge
Uncertainty
about edges
IRF1
IRF3
IRF2
Average edge Average nonedge score
score
IRF1
IRF3
IRF2
Average edge Average nonedge score
score
IRF1
IRF3
IRF2
Average edge Average nonedge score
score
Gold standard known Æ Posterior
probabilities of true interactions
IRF1
IRF3
IRF2
New method
Gold standard known Æ Posterior
probabilities of true interactions
IRF1
IRF3
IRF2
Homogeneous
model
Circadian regulation in
Arabidopsis thaliana
Circadian rhythms in Arabidopsis thaliana
Collaboration with the Institute of Molecular Plant
Sciences at Edinburgh University (Andrew Miller’s group)
2 time series T20 and T28 of microarray gene expression data
from Arabidopsis thaliana.
- Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4,
ELF3, GI, PRR9, PRR5, and PRR3
- Both time series measured under constant light condition
at 13 time points: 0h, 2h,…, 24h, 26h
- Plants entrained with different light:dark cycles
10h:10h (T20) and 14h:14h (T28)
Gene expression time series plots
(Arabidopsis data T20 and T28)
T28
T20
Posterior probability of the number of components
?
Posterior probability of the number of components
Predicted network
Blue – activation
Red
– inhibition
Black – mixture
Three different line widths:
- thin
= PP>0.5
- medium = PP>0.75
- fat
= PP>0.9
Cogs of the Plant Clockwork
Review –
Rob McClung,
Plant Cell 2006
Two major gene
classes…
Morning genes
e.g. LHY, CCA1
… repress evening
genes
e.g. TOC1, ELF3,
ELF4, GI, LUX
… which activate
LHY and CCA1
Circadian genes in Arabidopsis thaliana, network learned
from two time series over 13 time points
ELF3
CCA1
LHY
PRR9
GI
PRR5
ELF4
TOC1
PRR3
“False negatives”
“False positives”
True positives (TP) = 8
False positives (FP) = 13
False negatives (FN) = 5
True negatives (TN) =
9²-8-13-5= 55
Sensitivity = TP/[TP+FN] = 62%
Specificity = TN/[TN+FP] = 81%
Overview of the plant clock model
Evening
Morning
LHY/
CCA1
Locke et al.
Mol. Syst. Biol. 2006
PRR9/
PRR7
Y (GI)
X
TOC1
Overview of the plant clock model
Yes
Morning
Yes
Yes
PRR9/
PRR7
LHY/
CCA1
Y (GI)
Yes
Locke et al.
Mol. Syst. Biol. 2006
X
Evening
TOC1
Allocation sampler versus
change-point process
Advances in Bioinformatics, in press
Heterogeneous DBN
q
Allocation vector
h
k
Number of components (here: 3)
Change-point process
Free allocation
Free allocation
Example: 4 genes, 10 time points
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Example: 4 genes, 10 time points
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Changepoint process
Standard dynamic Bayesian network:
homogeneous model
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Our new model: heterogeneous dynamic
Bayesian network. Here: 2 components
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Our new model: heterogeneous dynamic
Bayesian network. Here: 3 components
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Allocation sampler versus change-point process
• More flexibility,
unrestricted mixture
model.
• Not restricted to time
series
• Higher computational
costs
• Incorporates plausible
prior knowledge for
time series.
• Reduced complexity
• Less universal, not
applicable to static
data
Can we get an approximate nonlinear model
without data discretization?
Idea: piecewise linear model
y
Allocation sampler: x Æ changepoint process: t
Change-point model versus free allocation:
Arabidopsis thaliana (13 time points)
Fee
allocation
Changepoint
model
Change-point model versus free allocation:
Drosophila melanogaster (67 time points)
Free allocation
Change-point process
Not only related to the complexity and
MCMC convergence …
… but it is intrinsic to the prior.
Prior probability of assigning two time points to the
same component. White=1. Black=0.
Allocation sampler
Change-point process
Allocation sampler
Allocation
vector
Number of
components
Number of data points assigned
to the kth component
Allocation sampler
Allocation
vector
Total number of data points
Number of
components
Number of data points assigned
to the kth component
Change-point process:
even-numbered order statistics
Change-point process
Reallocation of a change-point
Prior
probability
ratio
New
changepoint
Birth of a new change-point
See Peter Green, Biometrika (1995)
Insertion of a change-point: K=1 Î K=2
Prior
probability
ratio
Allocation
vector
Allocation sampler
Change-point model
Top: change-point location j=m/2 fixed, sample size variable
Bottom: Sample size m fixed, change-point location variable
Change-point model versus free allocation:
Drosophila melanogaster (67 time points)
Free allocation
Change-point process
Allocation sampler applied to the
macrophage gene expression time series
Infection
Treatment
Infection+treatment
Allocation sampler
Change-point process
Morphogenesis in
Drosophila melanogaster
Morphogenesis in Drosophila melanogaster
• Gene expression measurements over 66 time
steps of 4028 genes (Arbeitman et al., Science,
2002).
• Selection of 11 genes involved in muscle
development.
Zhao et al. (2006),
Bioinformatics 22
Heterogeneous dynamic Bayesian network:
Plausible segmentation?
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Posterior
probability
1
2
3
4
5
6
7
Number of components
8
Posterior
probability
1
2
3
4
5
6
7
8
Number of components
Four stages of the Drosophila life cycle:
embryo Æ larva Æ pupa Æ adult
time
time
Morphogenetic transitions:
Embryo Æ larva
larva Æpupa
pupa Æ adult
Gene expression program governing the transition to adult morphology
active well before the fly emerges from the pupa.
Change-point model versus free allocation:
Drosophila melanogaster (67 time points)
Free allocation
Change-point process
Node-specific
changepoints
NIPS 2009
Standard dynamic Bayesian network:
homogeneous model
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Heterogeneous dynamic Bayesian network
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Heterogenous dynamic Bayesian network
with node-specific breakpoints
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10
X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10
X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10
X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
MCMC scheme
Moves that change the network structure
Moves that act on change-points: reallocation, birth and death moves
Avoiding spurious feedback loops
Avoiding spurious feedback loops
BGe
New model
Marginal posterior probability for edges
BGe
New model
Application to macrophage gene expression data
BGe
New model
Comparative evaluation:
Networks for simulating data
Generating synthetic data
Noise: Normal
distribution
New
model
AUC
scores
Models for
comparison
New
model
Models for
comparison
Application to Arabidopsis thaliana
4 different gene expression times series
Application to Arabidopsis thaliana
4 different gene expression times series
Dynamic programming
Joint work with Marco Grzegorczyk
Two priors for changepoints
Prior for the number of changepoints,
conditional prior on their positions:
and a
Two priors for changepoints
Prior for the number of changepoints,
and a
conditional prior on their positions:
Point process prior
Probability for the first changepoint
Probability for the time between two successive changepoints
Distribution function for the distance between two
successive changepoints
T
Negative binomial distribution
Definitions
Assumes parameters can be
integrated out in closed form
Prior
Recursion
Proof
Reminder …
Assumes parameters can be
integrated out in closed form
Prior
Summary
Definition
Recursion
Sampling of changepoints
Point process prior
Definition
Recursion
Sampling of changepoints
Prior for the number of changepoints,
and a conditional prior on their positions
Definition
Recursion
Sampling of changepoints
Gibbs sampling procedure
•P(changepoints|network,data)
Æ dynamic programming
•P(network|changepoints,data)
Gene expresssion profiles from Arabidopsis thaliana
ICML 2010
Flexible network structure with regularization
Comparison: integration of prior knowledge
Prior distribution:
Flexible network structure with regularization
Flexible network structure with regularization
Partition function
Ignoring the fan-in restriction:
Î
Number of genes
Drosophila melanogaster: Expression of 11 muscle
development genes over 66 time points
Fixed structure,
flexible parameters
time
Morphogenetic transitions:
Embryo Æ larva
larva Æpupa
pupa Æ adult
Gene expression program governing the transition to adult morphology
active well before the fly emerges from the pupa.
Transition probabilities: flexible structure with regularization
Morphogenetic transitions:
Embryo Æ larva
larva Æpupa
pupa Æ adult
Comparison with:
Dondelinger, Lèbre & Husmeier
Ahmed & Xing
Simulation study
Frank Dondelinger, Sophie Lèbre, Dirk Husmeier: ICML 2010
Synthetic simulation study
Information sharing
between adjacent
segments
No information
sharing between
adjacent segments
Thank you!
Any questions?
Download