File

advertisement
Continuous dynamic Bayesian network for gene
regulatory network modeling
Norhaini binti Baba
Student Bachelor Degree of Computer Science (Bioinformatics), Faculty of Computer Science and
Information Systems, Universiti Teknologi Malaysia, Skudai, 81310 Johor, Malaysia
ABSTRACT
In order to understand the underlying function of organisms, it is necessary to study the behavior
of genes in a gene regulatory network context. Several computational approaches are available
for modeling gene regulatory networks with different datasets. Hence, this research is conducted
to model the gene regulatory gene network using the proposed computational approach which is
the dynamic Bayesian Networks. Dynamic Bayesian Network (DBN) is extensively used to
construct GRNs based on its ability to handle microarray data and modeling feedback loops. This
DBN approach is then extend to continuous Dynamic Bayesian Network (cDBN) to construct
gene regulatory network with continuous data without discretization. The performance of the
constructed gene networks of Saccharomyces cerevisiae and E.coli are evaluated and are
compared with the previously constructed sub-networks by Dejori (2002) and S.O.S. DNA
Repair network of the Escherichia coli bacterium respectively. At the end of this research, the
gene networks constructed for Saccharomyces cerevisiae and E.coli have discovered more
potential interactions between genes. Therefore, it can be concluded that the performance of the
gene regulatory networks constructed using continuous dynamic Bayesian networks in this
research is proved to be better because it can reveal more gene relationships besides allow
feedback loops.
Key words: Gene regulatory network, Dynamic Bayesian Network, Gene Expression Data
INTRODUCTION
The regulation of gene expression is
achieved through gene regulatory networks
(GRNs) in which collections of genes
interact with one another and other
substances in a cell. Gene regulatory
networks have an important role in every
process of life. This process life including
cell differentiation, metabolism, cell cycle
and signal transduction. By understanding
the dynamics of this network, the
mechanism of disease that occurs when
these cellular processes are deregulated can
be clarified. The accurate prediction of the
behavior of gene regulatory network will
also speed up biotechnological projects,
where the predictions faster and cheaper
than lab experiment. Computational
methods will support the development of
network models and for the analysis of their
functionality have already proved to be a
valuable research tools.
Recent research efforts addressing
that shortcoming have considered undirected
graphs, directed graphs for discretization
data, or over-flexible models that lack any
information sharing among time series
segments. For instance, the Bayesian
Networks (BN) has two critical weaknesses
in which it does not allow feedback loops or
cyclic regulation and is unable to handle the
temporal aspect of time-series microarray
data.
In particular, Dynamic Bayesian
Networks (DBNs) have been applied, as
they allow feedback loops and recurrent
regulatory structures to be modeled while
avoiding the ambiguity about edge
directions common to static Bayesian
Networks (BN). In this research the method
will be extend to continuous Dynamic
Bayesian Network (cDBN) model which can
construct cyclic regulations of gene
expression data with continuous variables.
Continuous Dynamic Bayesian Network
construct gene regulatory network with
continuous data without discretization in
which discretisation of discrete dynamic
Bayesian network may lead to loss of
information. The proposed method can
optimize the network structure, which can
give best representation of the gene
interactions. The efficiency of proposed
method will show through the analysis of
Saccharomyces cerevisiae cell-cycle gene
expression and E.coli data.
METHOD
DATA PRE-PROCESSING:
IMPUTATION OF MISSING VALUE
The datasets that has been extract
may contain missing value in arbitrary
locations. This missing value occurs when
no data value is store for the variable in the
current observation. In order to prepare a
complete dataset for constructing dynamic
Bayesian networks, the missing values must
be imputed. Therefore, K-Nearest-Neighbor
method (KNN) in matlab is selected for this
research. This method using the syntax
knnimpute (data) to impute the missing
value of data.
The knnimpute(data) replaces NaNs
in data with the corresponding value from
the nearest-neighbor column. The nearestneighbor column is the closest column in the
Euclidean distance. If the corresponding
value from the nearest-neighbor column is
also NaN, the next nearest column is used.
Knnimpute(data, k) replaces NaNs in
data with the weighted mean of the k
nearest-neighbor columns. The weights are
inversely proportional to the distances from
the neighboring columns.
The experimental data is based on
the Saccharomyces cerevisiae cell cycle
time-series gene expression data from
Spellman et al. This data contains two short
time series (cln3,clb2) and four medium
time
series
(7,14,21,28,35,42,49,56,
,63,70,77,84,91,98,105,112 and 119 time
points; alpha, cdc15, cdc28 and elu).
However, the datasets that has been extract
may contain missing value in arbitrary
locations. This missing value occurs when
no data value is store for the variable in the
current observation. In order to prepare a
complete dataset for constructing dynamic
Bayesian networks, the missing values must
be imputed. A better solution is to use
imputation algorithms to estimate the
missing values by exploiting the observed
data structure and expression pattern.
Therefore, K-Nearest-Neighbor method
(KNN) is selected for this research. The
knnimpute implement as missing value
estimation method to minimize data
modeling assumptions and take advantage of
the correlation structure of the gene
expression data.
CONTINUOUS DYNAMIC BAYESIAN
NETWORK
Continuous
model
is
the mathematical practice
of
applying
a model to continuous data which has a
potentially infinite number and divisibility
of attributes. This continuous model often
uses differential equations and is converse
to discrete modeling which is need the
discretisation. Kalman Filter model is the
simplest continuous dynamic Bayesian
network. Kalman Filter is a recursive data
processing algorithm. It will generate
optimal estimate of desired quantities given
the sets of measurement. The continuous
model can embed biological knowledge in a
prior probability. The prior probability of a
network can be constructing based on the
biological knowledge such as binding site
information, DNA-protein interaction or
genes interaction by using this continuous
model.
Continuous network models of
GRNs are an extension of the Boolean
networks. Nodes still represent genes and
connections between them regulatory
influences on gene expression. Genes in
biological systems display a continuous
range of activity levels and it has been
argued
that
using
a
continuous
representation captures several properties of
gene regulatory networks not present in the
Boolean model. Continuous network model
allows grouping of inputs to a node thus
realizing another level of regulation. The
discretisation which is involve in the
discrete model will might cause information
loss. To avoid the discretisation, Kim et al.,
defined the continuous Dynamic Bayesian
Network.
This
continuous
dynamic
Bayesian network does not required the
discretisation of continuous data.
Dynamic
Bayesian
Networks
(DBNs) are widely used in regulatory
network structure inference with gene
expression data. Dynamic Bayesian
networks (DBNs) are family of probabilistic
graphical models of stochastic processes.
The inference and learning problems in
DBNs involving computing posterior
distributions aver unobserved variables or
parameters. They generalize hidden markov
models (HMMs) and linear dynamical
systems (LDSs) by representing the hidden
(and observed) state in term of state
variables which can have complex
interdependencies. The graphical structure
provides an easy way to specify the
conditional independencies and to provide a
compact parameterization of the model.
Dynamic Bayesian networks are
powerful framework for temporal data
models with application in time series
analysis. A time series of length T is a
sequence of observation vectors V = {v(1),
v(2), . . .v(T)}, where v1(t) represent the
state visible variable i at time t. The
assumption that observations may be
generated by some hidden process that
cannot directly experimentally observed.
The simplest kind of DBN is a
Hidden Markov Model (HMM) which has
one discrete hidden node and one discrete or
continuous observed node per slice.
According to Min Zou et al, (2004), for
example time slice T1 for regulatory and T2
for the target gene, where T1 precedes T2.
The time period between the time slices of
the regulatory and target (T2-T1) is
considered as the transcriptional time lag. It
is the time that it takes for the regulator gene
to express its protein product and the
transcription of the target gene to be affected
by this regulator protein.
This continuous dynamic Bayesian
network implement in Linux using
BNFinder software. BNFinder is based on a
novel polynomial-time algorithm for
learning an optimal Bayesian network
structure (Dojer, 2006). The algorithm was
designed to save reasonable speed and
perfect quality of learning in a wide class of
problems occurring in the computational
molecular biology. When dealing with
dynamic Bayesian networks, a dynamic
Bayesian network describes stochastic
evolution of a set of random variables over
discretized time. Because of that,
conditional distributions refer to random
variables in neighboring time points and the
graph is always acyclic.
BNFinder learns optimal networks
with respect to two generally used scoring
criteria: Bayesian–Dirichlet equivalence
(BDe) and minimal description length
(MDL). The BDe score originates from
Bayesian statistics and corresponds to the
posterior probability of a network-given
data. While the MDL score originates from
information theory and corresponds to the
length of the data compressed with the
compression model derived from the
network structure. It also has a statistical
interpretation as an approximation of the
posterior probability. The algorithm works
in polynomial time for both scores, but
computations with the MDL are faster,
especially for large datasets. However, the
BDe score better due to its exactness in the
statistical
interpretation.
Continuous
variables are handled with corresponding
scores, derived under the assumption that
conditional distributions belong to a family
of Gaussian mixtures.
evaluation of performance conducted by
comparing the network constructed by
Dejori (2002). In this research, the target
network is the Saccharomyces cerevisiae
cell cycle gene network in the Dejori (2002)
which is similar to this research. Therefore,
the target network in the Dejori (2002) will
be the benchmark for this research networks.
RESULTS AND DISCUSSION
SACCHAROMYCES CEREVISIAE CELLCYCLE GENE EXPRESSION
Sub-network learning
Figure 1: The overview of our proposed cDBNbased model.
CONSTRUCTING GENE REGULATORY
NETWORK
The continuous dynamic Bayesian
network that is developed is used to
construct a gene regulatory network. The
gene networks constructed are represented
in both network graph form. A gene
regulatory network in network graph form is
represented by using cytoscape software.
After gene regulatory networks are
constructed, the performance of gene
networks constructed using dynamic
Bayesian network is evaluated. The
The small sub-networks were applied
in this research due to the computational
costs and limited time. The search-algorithm
was runs on the huge datasets but the results
showed the unreasonably structure. This
showed that without restricting the subnetwork it is impossible to yield a good
reasonably structure of networks. Hence,
small sub-networks about 8 or more
variables were applied. In this research, the
sub-network of YOR263C and YPL256C
were applied to reduced the search space
and detect the good reasonably network
structures.
Figure 2: The comparison of edges in YOR263C sub-network between the network constructed
by Dejori (2002) and with the network constructed in this research. True Positive (TP) in the
dotted circle is the number of edges that exist in network constructed by Dejori and in the
network formed in this research. False Positive (FP) in the cross is the number edges that exist in
this research, but not in the network by Dejori (2002).
Figure 3: The comparison between network by Dejori (2002) and network in this research. The
true positive assign by the dotted circle. This means that 4 edges that exist in Dejori (2002) are
also exist in the network formed in this research as well. This shows that the Dynamic Bayesian
networks implemented in this research has successfully predicted and formed all the possible
existing edges or interactions between genes in the network like the one by Dejori (2002).While
the false positive represent by the cross sign where there are 20 edges formed in the network in
this research that do not exist in the network by Dejori (2002). This shows that the Dynamic
Bayesian networks implemented in this research is capable of uncovering more potential edges
or interactions between genes if compared with Dejori (2002).
YOR263C sub-network
YPL256C sub-network
Dejori (Bayesian
Network)
Proposed method
(continuous Dynamic
Bayesian Network)
Dejori (Bayesian Network)
Proposed method
(continuous Dynamic
Bayesian Network)
Nodes
8
7
12
12
Edges
6 undirected edges
12 directed edges
9 directed edges
24 directed edges
Cyclic
regulation
No
4 cyclic regulation
No
7 cyclic regulation
Up
regulation
-
-
No
Yes
Down
regulation
-
-
No
Yes
Table 1 : Comparison result between Bayesian network by Dejori and continuous dynamic
Bayesian network in this research.
Figure 2 shows the YOR263C subnetwork that is constructed in this research.
The constructed network consists of 7 nodes
(genes), 12 directed edges and 4 cyclic
regulations. The main difference between
the sub-networks constructed by Dejori
(2002) with this research is that the edges in
the network constructed in this research are
directed and has cyclic regulation, which is
it can show clearly the interactions between
genes. For example, the edge formed
between YER124C and YNR067C in subnetwork by Dejori (2002) cannot show
which gene is regulating the gene. But in
this research, the network can show clearly
that YNR0673C is regulating YER124C.
This means that the expression level of
YER124C is depending on YNR0673C.
Another interaction is between YOR264W
and YOR263C which there are cyclic
regulation. This means that the expression
level of YOR264W is depending on
YOR263C and expression level of
YOR263C is depending on YOR264W.
Figure 3 shows the YPL256C subnetwork that is constructed in this research.
The network consists of 12 nodes and 23
directed edges. The main difference of the
network formed in this research with the
network done by Dejori (2002) is that the
number of edges that are detected via this
research is about 3 times more than the
network constructed by Dejori (2002).
Besides that, all the edges in the network
developed through this research have at least
one directed edge with other nodes.
However, the network that constructed by
Dejori (2002) has failed to construct any
edge for node YGR108W that successfully
construct in this research. In addition this
research able to construct the cyclic
regulation, up regulation and down
regulation. All these proved that the
continuous dynamic Bayesian networks
implemented in this research is able to
predict and construct more potential edges
or interaction between genes in a subnetwork.
YOR263C subnetwork/ %
YPL256C subnetwork/ %
a) Sensitivity
83.3
44.4
b) Specificity
80
84.9
Table 2: Sensitivity and specificity for
YOR263C sub-networks and YPL256C subnetwork.
Table 2 shows the sensitivity and
specificity of YOR263C sub-network and
YPL256C sub-network .The sensitivity of
YOR263C sub-network (true positive rate)
is 83.3%. This means that 5 edges that exist
in Dejori (2002) are also exist in the network
formed in this research as well. This shows
that the Dynamic Bayesian networks
implemented in this research has
successfully predicted and formed all the
possible existing edges or interactions
between genes in the network like the one
by Dejori (2002). While the specificity (true
negative rate) is 80%, where there are 3
edges formed in the network in this research
do not exist in the network by Dejori (2002).
The 3 edges are YNR067C with YER124C,
YOR263C with YNR067C and YJL196C
with YNR067C. This shows that the
Dynamic Bayesian networks implemented in
this research is capable of uncovering more
potential edges or interactions between
genes if compared with Dejori (2002).
While the sensitivity of YPL256C
sub-network is for this network is
approximately 44.4%, whereby 4 directed
edges that exist in the network by Dejori
(2002) have been captured by the program
in this research as well. However, there is 5
directed edge that exists in network by
Dejori (2002), but it does not exist in the
network by this research. The specificity
(true negative rate) for this sub-network is
around 84.9%. This research has captured 20
new edges between nodes that were unable
to be captured by Dejori (2002). This
proposed method proved that it has the
ability to discover more potential interaction
between genes.
ESCHERICHIA COLI DATASET
Figure 4: Gene regulatory network construct for Escherichia coli. Activation are represented by
arrows () and inhibitions by T ( l ).
Rate / %
a) Sensitivity
70
b) Specificity
88.9
Table 3: Sensitivity and specificity of
Escherichia coli dataset compare to S.O.S.
DNA Repair network of the Escherichia coli
bacterium
Figure 4 shows the network of
Escherichia coli dataset that is constructed
in this research. The constructed network
consists of 8 nodes (genes), 12 directed
edges and 2 cyclic regulations. The main
difference between the S.O.S. DNA Repair
network of the Escherichia coli bacterium
with network in this research is that the
edges in the network constructed in this
research is it has cyclic regulation, which is
it can show clearly the interactions between
genes. For example, the edge formed
between lexA and umuD and edge between
lexA and recA in this research show the
cyclic regulations. Besides that network that
construct in this research is discover new
interactions which are between gene polB
and uvrA and gene between uvrA and uvrY.
Table 3 shows the sensitivity is 70%.
This means that 7 edges that exist in S.O.S.
DNA Repair network of the Escherichia coli
bacterium are also exist in the network
formed in this research as well. This shows
that the Dynamic Bayesian networks
implemented in this research has
successfully predicted and formed all the
possible existing edges or interactions
between genes in the network like the one
by S.O.S. DNA Repair network of the
Escherichia coli bacterium.
While the specificity is 88.9%,
where there are 5 edges formed in the
network in this research do not exist in the
network by S.O.S. DNA Repair network of
the Escherichia coli bacterium. This shows
that the continuous Dynamic Bayesian
networks implemented in this research is
capable of uncovering more potential edges
or interactions between genes if compared
with S.O.S. DNA Repair network.
optimized performance and they performed
better than the Bayesian networks developed
in Dejori (2002)’s research and network by
S.O.S. DNA Repair network of the
Escherichia coli bacterium in discovering
and predicting interactions between genes.
DISCUSSION
In this paper, new method for
constructing gene regulatory network based
on continuous dynamic Bayesian network.
The advantages of this proposed method
compared with other method such as
Bayesian network is this proposed method
can analyze the microarray data as the
continuous data without the extra data
pretreatments such as discretization. Even
nonlinear relations can be detected and
modeled by this proposed method.
The gene network constructed in this
research shows that the continuous Dynamic
Bayesian networks implemented in this
research not only have successfully
predicted all the edges that have been
constructed by Dejori (2002), but the
continuous Dynamic Bayesian networks
implemented are also able to discover more
new possible edges or interactions between
genes. Besides that the continuous DBN that
implement in this research is able to
construct cyclic regulation which is the
Bayesian network not able to construct the
cyclic regulation. Continuous DBN models
tend to estimate many false positive in the
cyclic regulation. Hence, it can be concluded
that the continuous Dynamic Bayesian
networks implemented in this research have
REFERENCES
[1] Matthew J. Beal, Francesco Falciani,
Zoubin Ghahrami. (2004). “A Bayesian
approach
to
reconstructing
genetic
regulatory networks with hidden factors.”
[2] Sunyong Kim, Seiya Imoto, Satoru
Miyano. (2004). “Dynamic bayesian
network and nonparametric regression for
nonlinear modelling of gene networks from
time series gene expression data”. science
direct .
[3] Min Zou, Suzanne D. Conzen. (2004).
“A new dynamic Bayesian network (DBN)
approach for identifying gene regulatory
networks from time course microarray data.”
[4] Zheng Li, Ping Li, Arun Krishnan,
Jingdong Liu. (2011). “Large-scale dynamic
gene
regulatory
network
inference
combining differential equation models with
the local dynamic Bayesian network
analysis.”
[5] Akther Shermin and Mehmet A. Orgun
(2009). “A 2-stage Approach for Inferring
Gene Regulatory Networks using Dynamic
Bayesian Networks.” IEEE International
Conference
on
Bioinformatics
and
Biomedicine, 2009.
[6] Wei-Po Lee and Wen-Shyong Tzou
(2009). “Computational methods for
discovering gene networks from expression
data.”Briefings In Bioinformatics, 10 (4):
408- 423
[7] Bruno-Edouard Perrin, Liva Ralaivola,
Aur´ elien Mazurie, Samuele Bottani,
Jacques Mallet and Florence d’Alch´e–Buc
(2003). “Gene networks inference using
dynamic Bayesian networks”
Bioinformatics, 19 (2) : 38-148
Download