Continuous dynamic Bayesian network for gene regulatory network modeling Norhaini binti Baba Student Bachelor Degree of Computer Science (Bioinformatics), Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, Skudai, 81310 Johor, Malaysia ABSTRACT In order to understand the underlying function of organisms, it is necessary to study the behavior of genes in a gene regulatory network context. Several computational approaches are available for modeling gene regulatory networks with different datasets. Hence, this research is conducted to model the gene regulatory gene network using the proposed computational approach which is the dynamic Bayesian Networks. Dynamic Bayesian Network (DBN) is extensively used to construct GRNs based on its ability to handle microarray data and modeling feedback loops. This DBN approach is then extend to continuous Dynamic Bayesian Network (cDBN) to construct gene regulatory network with continuous data without discretization. The performance of the constructed gene networks of Saccharomyces cerevisiae and E.coli are evaluated and are compared with the previously constructed sub-networks by Dejori (2002) and S.O.S. DNA Repair network of the Escherichia coli bacterium respectively. At the end of this research, the gene networks constructed for Saccharomyces cerevisiae and E.coli have discovered more potential interactions between genes. Therefore, it can be concluded that the performance of the gene regulatory networks constructed using continuous dynamic Bayesian networks in this research is proved to be better because it can reveal more gene relationships besides allow feedback loops. Key words: Gene regulatory network, Dynamic Bayesian Network, Gene Expression Data INTRODUCTION The regulation of gene expression is achieved through gene regulatory networks (GRNs) in which collections of genes interact with one another and other substances in a cell. Gene regulatory networks have an important role in every process of life. This process life including cell differentiation, metabolism, cell cycle and signal transduction. By understanding the dynamics of this network, the mechanism of disease that occurs when these cellular processes are deregulated can be clarified. The accurate prediction of the behavior of gene regulatory network will also speed up biotechnological projects, where the predictions faster and cheaper than lab experiment. Computational methods will support the development of network models and for the analysis of their functionality have already proved to be a valuable research tools. Recent research efforts addressing that shortcoming have considered undirected graphs, directed graphs for discretization data, or over-flexible models that lack any information sharing among time series segments. For instance, the Bayesian Networks (BN) has two critical weaknesses in which it does not allow feedback loops or cyclic regulation and is unable to handle the temporal aspect of time-series microarray data. In particular, Dynamic Bayesian Networks (DBNs) have been applied, as they allow feedback loops and recurrent regulatory structures to be modeled while avoiding the ambiguity about edge directions common to static Bayesian Networks (BN). In this research the method will be extend to continuous Dynamic Bayesian Network (cDBN) model which can construct cyclic regulations of gene expression data with continuous variables. Continuous Dynamic Bayesian Network construct gene regulatory network with continuous data without discretization in which discretisation of discrete dynamic Bayesian network may lead to loss of information. The proposed method can optimize the network structure, which can give best representation of the gene interactions. The efficiency of proposed method will show through the analysis of Saccharomyces cerevisiae cell-cycle gene expression and E.coli data. METHOD DATA PRE-PROCESSING: IMPUTATION OF MISSING VALUE The datasets that has been extract may contain missing value in arbitrary locations. This missing value occurs when no data value is store for the variable in the current observation. In order to prepare a complete dataset for constructing dynamic Bayesian networks, the missing values must be imputed. Therefore, K-Nearest-Neighbor method (KNN) in matlab is selected for this research. This method using the syntax knnimpute (data) to impute the missing value of data. The knnimpute(data) replaces NaNs in data with the corresponding value from the nearest-neighbor column. The nearestneighbor column is the closest column in the Euclidean distance. If the corresponding value from the nearest-neighbor column is also NaN, the next nearest column is used. Knnimpute(data, k) replaces NaNs in data with the weighted mean of the k nearest-neighbor columns. The weights are inversely proportional to the distances from the neighboring columns. The experimental data is based on the Saccharomyces cerevisiae cell cycle time-series gene expression data from Spellman et al. This data contains two short time series (cln3,clb2) and four medium time series (7,14,21,28,35,42,49,56, ,63,70,77,84,91,98,105,112 and 119 time points; alpha, cdc15, cdc28 and elu). However, the datasets that has been extract may contain missing value in arbitrary locations. This missing value occurs when no data value is store for the variable in the current observation. In order to prepare a complete dataset for constructing dynamic Bayesian networks, the missing values must be imputed. A better solution is to use imputation algorithms to estimate the missing values by exploiting the observed data structure and expression pattern. Therefore, K-Nearest-Neighbor method (KNN) is selected for this research. The knnimpute implement as missing value estimation method to minimize data modeling assumptions and take advantage of the correlation structure of the gene expression data. CONTINUOUS DYNAMIC BAYESIAN NETWORK Continuous model is the mathematical practice of applying a model to continuous data which has a potentially infinite number and divisibility of attributes. This continuous model often uses differential equations and is converse to discrete modeling which is need the discretisation. Kalman Filter model is the simplest continuous dynamic Bayesian network. Kalman Filter is a recursive data processing algorithm. It will generate optimal estimate of desired quantities given the sets of measurement. The continuous model can embed biological knowledge in a prior probability. The prior probability of a network can be constructing based on the biological knowledge such as binding site information, DNA-protein interaction or genes interaction by using this continuous model. Continuous network models of GRNs are an extension of the Boolean networks. Nodes still represent genes and connections between them regulatory influences on gene expression. Genes in biological systems display a continuous range of activity levels and it has been argued that using a continuous representation captures several properties of gene regulatory networks not present in the Boolean model. Continuous network model allows grouping of inputs to a node thus realizing another level of regulation. The discretisation which is involve in the discrete model will might cause information loss. To avoid the discretisation, Kim et al., defined the continuous Dynamic Bayesian Network. This continuous dynamic Bayesian network does not required the discretisation of continuous data. Dynamic Bayesian Networks (DBNs) are widely used in regulatory network structure inference with gene expression data. Dynamic Bayesian networks (DBNs) are family of probabilistic graphical models of stochastic processes. The inference and learning problems in DBNs involving computing posterior distributions aver unobserved variables or parameters. They generalize hidden markov models (HMMs) and linear dynamical systems (LDSs) by representing the hidden (and observed) state in term of state variables which can have complex interdependencies. The graphical structure provides an easy way to specify the conditional independencies and to provide a compact parameterization of the model. Dynamic Bayesian networks are powerful framework for temporal data models with application in time series analysis. A time series of length T is a sequence of observation vectors V = {v(1), v(2), . . .v(T)}, where v1(t) represent the state visible variable i at time t. The assumption that observations may be generated by some hidden process that cannot directly experimentally observed. The simplest kind of DBN is a Hidden Markov Model (HMM) which has one discrete hidden node and one discrete or continuous observed node per slice. According to Min Zou et al, (2004), for example time slice T1 for regulatory and T2 for the target gene, where T1 precedes T2. The time period between the time slices of the regulatory and target (T2-T1) is considered as the transcriptional time lag. It is the time that it takes for the regulator gene to express its protein product and the transcription of the target gene to be affected by this regulator protein. This continuous dynamic Bayesian network implement in Linux using BNFinder software. BNFinder is based on a novel polynomial-time algorithm for learning an optimal Bayesian network structure (Dojer, 2006). The algorithm was designed to save reasonable speed and perfect quality of learning in a wide class of problems occurring in the computational molecular biology. When dealing with dynamic Bayesian networks, a dynamic Bayesian network describes stochastic evolution of a set of random variables over discretized time. Because of that, conditional distributions refer to random variables in neighboring time points and the graph is always acyclic. BNFinder learns optimal networks with respect to two generally used scoring criteria: Bayesian–Dirichlet equivalence (BDe) and minimal description length (MDL). The BDe score originates from Bayesian statistics and corresponds to the posterior probability of a network-given data. While the MDL score originates from information theory and corresponds to the length of the data compressed with the compression model derived from the network structure. It also has a statistical interpretation as an approximation of the posterior probability. The algorithm works in polynomial time for both scores, but computations with the MDL are faster, especially for large datasets. However, the BDe score better due to its exactness in the statistical interpretation. Continuous variables are handled with corresponding scores, derived under the assumption that conditional distributions belong to a family of Gaussian mixtures. evaluation of performance conducted by comparing the network constructed by Dejori (2002). In this research, the target network is the Saccharomyces cerevisiae cell cycle gene network in the Dejori (2002) which is similar to this research. Therefore, the target network in the Dejori (2002) will be the benchmark for this research networks. RESULTS AND DISCUSSION SACCHAROMYCES CEREVISIAE CELLCYCLE GENE EXPRESSION Sub-network learning Figure 1: The overview of our proposed cDBNbased model. CONSTRUCTING GENE REGULATORY NETWORK The continuous dynamic Bayesian network that is developed is used to construct a gene regulatory network. The gene networks constructed are represented in both network graph form. A gene regulatory network in network graph form is represented by using cytoscape software. After gene regulatory networks are constructed, the performance of gene networks constructed using dynamic Bayesian network is evaluated. The The small sub-networks were applied in this research due to the computational costs and limited time. The search-algorithm was runs on the huge datasets but the results showed the unreasonably structure. This showed that without restricting the subnetwork it is impossible to yield a good reasonably structure of networks. Hence, small sub-networks about 8 or more variables were applied. In this research, the sub-network of YOR263C and YPL256C were applied to reduced the search space and detect the good reasonably network structures. Figure 2: The comparison of edges in YOR263C sub-network between the network constructed by Dejori (2002) and with the network constructed in this research. True Positive (TP) in the dotted circle is the number of edges that exist in network constructed by Dejori and in the network formed in this research. False Positive (FP) in the cross is the number edges that exist in this research, but not in the network by Dejori (2002). Figure 3: The comparison between network by Dejori (2002) and network in this research. The true positive assign by the dotted circle. This means that 4 edges that exist in Dejori (2002) are also exist in the network formed in this research as well. This shows that the Dynamic Bayesian networks implemented in this research has successfully predicted and formed all the possible existing edges or interactions between genes in the network like the one by Dejori (2002).While the false positive represent by the cross sign where there are 20 edges formed in the network in this research that do not exist in the network by Dejori (2002). This shows that the Dynamic Bayesian networks implemented in this research is capable of uncovering more potential edges or interactions between genes if compared with Dejori (2002). YOR263C sub-network YPL256C sub-network Dejori (Bayesian Network) Proposed method (continuous Dynamic Bayesian Network) Dejori (Bayesian Network) Proposed method (continuous Dynamic Bayesian Network) Nodes 8 7 12 12 Edges 6 undirected edges 12 directed edges 9 directed edges 24 directed edges Cyclic regulation No 4 cyclic regulation No 7 cyclic regulation Up regulation - - No Yes Down regulation - - No Yes Table 1 : Comparison result between Bayesian network by Dejori and continuous dynamic Bayesian network in this research. Figure 2 shows the YOR263C subnetwork that is constructed in this research. The constructed network consists of 7 nodes (genes), 12 directed edges and 4 cyclic regulations. The main difference between the sub-networks constructed by Dejori (2002) with this research is that the edges in the network constructed in this research are directed and has cyclic regulation, which is it can show clearly the interactions between genes. For example, the edge formed between YER124C and YNR067C in subnetwork by Dejori (2002) cannot show which gene is regulating the gene. But in this research, the network can show clearly that YNR0673C is regulating YER124C. This means that the expression level of YER124C is depending on YNR0673C. Another interaction is between YOR264W and YOR263C which there are cyclic regulation. This means that the expression level of YOR264W is depending on YOR263C and expression level of YOR263C is depending on YOR264W. Figure 3 shows the YPL256C subnetwork that is constructed in this research. The network consists of 12 nodes and 23 directed edges. The main difference of the network formed in this research with the network done by Dejori (2002) is that the number of edges that are detected via this research is about 3 times more than the network constructed by Dejori (2002). Besides that, all the edges in the network developed through this research have at least one directed edge with other nodes. However, the network that constructed by Dejori (2002) has failed to construct any edge for node YGR108W that successfully construct in this research. In addition this research able to construct the cyclic regulation, up regulation and down regulation. All these proved that the continuous dynamic Bayesian networks implemented in this research is able to predict and construct more potential edges or interaction between genes in a subnetwork. YOR263C subnetwork/ % YPL256C subnetwork/ % a) Sensitivity 83.3 44.4 b) Specificity 80 84.9 Table 2: Sensitivity and specificity for YOR263C sub-networks and YPL256C subnetwork. Table 2 shows the sensitivity and specificity of YOR263C sub-network and YPL256C sub-network .The sensitivity of YOR263C sub-network (true positive rate) is 83.3%. This means that 5 edges that exist in Dejori (2002) are also exist in the network formed in this research as well. This shows that the Dynamic Bayesian networks implemented in this research has successfully predicted and formed all the possible existing edges or interactions between genes in the network like the one by Dejori (2002). While the specificity (true negative rate) is 80%, where there are 3 edges formed in the network in this research do not exist in the network by Dejori (2002). The 3 edges are YNR067C with YER124C, YOR263C with YNR067C and YJL196C with YNR067C. This shows that the Dynamic Bayesian networks implemented in this research is capable of uncovering more potential edges or interactions between genes if compared with Dejori (2002). While the sensitivity of YPL256C sub-network is for this network is approximately 44.4%, whereby 4 directed edges that exist in the network by Dejori (2002) have been captured by the program in this research as well. However, there is 5 directed edge that exists in network by Dejori (2002), but it does not exist in the network by this research. The specificity (true negative rate) for this sub-network is around 84.9%. This research has captured 20 new edges between nodes that were unable to be captured by Dejori (2002). This proposed method proved that it has the ability to discover more potential interaction between genes. ESCHERICHIA COLI DATASET Figure 4: Gene regulatory network construct for Escherichia coli. Activation are represented by arrows () and inhibitions by T ( l ). Rate / % a) Sensitivity 70 b) Specificity 88.9 Table 3: Sensitivity and specificity of Escherichia coli dataset compare to S.O.S. DNA Repair network of the Escherichia coli bacterium Figure 4 shows the network of Escherichia coli dataset that is constructed in this research. The constructed network consists of 8 nodes (genes), 12 directed edges and 2 cyclic regulations. The main difference between the S.O.S. DNA Repair network of the Escherichia coli bacterium with network in this research is that the edges in the network constructed in this research is it has cyclic regulation, which is it can show clearly the interactions between genes. For example, the edge formed between lexA and umuD and edge between lexA and recA in this research show the cyclic regulations. Besides that network that construct in this research is discover new interactions which are between gene polB and uvrA and gene between uvrA and uvrY. Table 3 shows the sensitivity is 70%. This means that 7 edges that exist in S.O.S. DNA Repair network of the Escherichia coli bacterium are also exist in the network formed in this research as well. This shows that the Dynamic Bayesian networks implemented in this research has successfully predicted and formed all the possible existing edges or interactions between genes in the network like the one by S.O.S. DNA Repair network of the Escherichia coli bacterium. While the specificity is 88.9%, where there are 5 edges formed in the network in this research do not exist in the network by S.O.S. DNA Repair network of the Escherichia coli bacterium. This shows that the continuous Dynamic Bayesian networks implemented in this research is capable of uncovering more potential edges or interactions between genes if compared with S.O.S. DNA Repair network. optimized performance and they performed better than the Bayesian networks developed in Dejori (2002)’s research and network by S.O.S. DNA Repair network of the Escherichia coli bacterium in discovering and predicting interactions between genes. DISCUSSION In this paper, new method for constructing gene regulatory network based on continuous dynamic Bayesian network. The advantages of this proposed method compared with other method such as Bayesian network is this proposed method can analyze the microarray data as the continuous data without the extra data pretreatments such as discretization. Even nonlinear relations can be detected and modeled by this proposed method. The gene network constructed in this research shows that the continuous Dynamic Bayesian networks implemented in this research not only have successfully predicted all the edges that have been constructed by Dejori (2002), but the continuous Dynamic Bayesian networks implemented are also able to discover more new possible edges or interactions between genes. Besides that the continuous DBN that implement in this research is able to construct cyclic regulation which is the Bayesian network not able to construct the cyclic regulation. Continuous DBN models tend to estimate many false positive in the cyclic regulation. Hence, it can be concluded that the continuous Dynamic Bayesian networks implemented in this research have REFERENCES [1] Matthew J. Beal, Francesco Falciani, Zoubin Ghahrami. (2004). “A Bayesian approach to reconstructing genetic regulatory networks with hidden factors.” [2] Sunyong Kim, Seiya Imoto, Satoru Miyano. (2004). “Dynamic bayesian network and nonparametric regression for nonlinear modelling of gene networks from time series gene expression data”. science direct . [3] Min Zou, Suzanne D. Conzen. (2004). “A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data.” [4] Zheng Li, Ping Li, Arun Krishnan, Jingdong Liu. (2011). “Large-scale dynamic gene regulatory network inference combining differential equation models with the local dynamic Bayesian network analysis.” [5] Akther Shermin and Mehmet A. Orgun (2009). “A 2-stage Approach for Inferring Gene Regulatory Networks using Dynamic Bayesian Networks.” IEEE International Conference on Bioinformatics and Biomedicine, 2009. [6] Wei-Po Lee and Wen-Shyong Tzou (2009). “Computational methods for discovering gene networks from expression data.”Briefings In Bioinformatics, 10 (4): 408- 423 [7] Bruno-Edouard Perrin, Liva Ralaivola, Aur´ elien Mazurie, Samuele Bottani, Jacques Mallet and Florence d’Alch´e–Buc (2003). “Gene networks inference using dynamic Bayesian networks” Bioinformatics, 19 (2) : 38-148