Inference of Noisy Nonlinear Differential Equation Programming and Kalman Filtering

advertisement
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 7, JULY 2008
3327
Inference of Noisy Nonlinear Differential Equation
Models for Gene Regulatory Networks Using Genetic
Programming and Kalman Filtering
Lijun Qian, Senior Member, IEEE, Haixin Wang, Member, IEEE, and Edward R. Dougherty, Member, IEEE
Abstract—A key issue in genomic signal processing is the inference of gene regulatory networks. These are used both to understand the role of biological regulation in phenotypic determination
and to derive therapeutic strategies for genetic-based diseases. In
this paper, gene regulatory networks are inferred via evolutionary
modeling based on time-series microarray measurements. A nonlinear differential equation model is adopted. It includes random
noise parameters for intrinsic noise arising from stochasticity in
transcription and translation and for external noise arising from
factors such as the amount of RNA polymerase, levels of regulatory proteins, and the effects of mRNA and protein degradation.
An iterative algorithm is proposed for model identification. Genetic
programming is applied to identify the structure of the model and
Kalman filtering is used to estimate the parameters in each iteration. Both standard and robust Kalman filtering are considered.
The effectiveness of the proposed scheme is demonstrated by using
synthetic data and by using microarray measurements pertaining
to yeast protein synthesis.
Index Terms—Gene regulatory network, genetic programming,
Kalman filter.
I. INTRODUCTION
T
HE ultimate goal of the genomic revolution is to understand the genetic relations behind phenotypic characteristics of organisms. Such an understanding relies on a blueprint
that specifies the manner in which genes and proteins interact to
make a complex living system [10]. A critical step of obtaining
such a blueprint is to identify the interactions among genes via
the modeling of gene regulatory networks (GRNs). In light of
the recent development of high-throughput DNA microarray
technology, it becomes possible to discover GRNs, which are
complex and nonlinear in nature. Specifically, the increasing existence of microarray time-series data makes possible the charManuscript received June 26, 2007; revised December 30, 2007. The associate editor coordinating the review of this manuscript and approving it for publication was Xiaodong Cai. This work was supported in part by the Department
of Electrical and Computer Engineering at Prairie View A&M University and
the National Science Foundation (CCF-0514644) and the National Cancer Institute (R01 CA-104620).
L. Qian and H. Wang are with the Department of Electrical Engineering,
Prairie View A&M University, Prairie View, TX 77446 USA (e-mail:
LiQian@pvamu.edu; HWang@pvamu.edu).
E. R. Dougherty is with the Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843 USA, the Computational Biology Division of the Translational Genomics Research Institute,
Phoenix, AZ 85004 USA, and the Department of Pathology of the University
of Texas M. D. Anderson Cancer Center, Houston, TX 77030 USA (e-mail: edward@ee.tamu.edu).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TSP.2008.919638
acterization of dynamic nonlinear regulatory interactions among
genes. The synthesis and analysis of GRNs constitute a major
component of genomic signal processing [1].
Because GRN models are difficult to deduce solely by means
of experimental techniques, computational and mathematical
methods are indispensable. Much research has been done on
GRN modeling by linear differential/difference equations using
time-series data, for example, [8], [9], [11]–[13], [15], [16], just
to name a few. The basic idea is to approximate the combined effects of different genes by means of a weighted sum of their expression levels. In [13], a connectionist model is used to model
small gene networks operating in the blastoderm of Drosophila.
In [8], the concentrations of mRNA and protein are modeled by
linear differential equations. A simple form of linear additive
.
functions is suggested by [9], where
The degradation rate of gene ’s mRNA and environmental efand
fects are assumed to be incorporated in the parameters
their influence on gene ’s expression level is assumed to be
linear. A method to obtain a continuous linear differential equation model from sampled time-series data is proposed in [15].
For added biological realism (all concentrations get saturated
at some point in time), a sigmoid (squashing) function may be
included into the equation. It has been shown that this sort of
quasi-linear model can be solved by first applying the inverse of
the squashing function [11].
Because GRNs are nonlinear in nature, nonlinear differential
equation models, such as an S-system [26], can model much
more complicated GRN behavior [14]. A linear model may satisfactorily model gene behavior if the GRN is operating around
a steady-state and the linear model corresponds to the linearized
model (from the nonlinear model) at that steady-state. In addition, the linear approximation holds only when the GRN has
slow dynamics around that steady-state. A possible way to make
the GRN model hold (not only at the vicinity of the steady-state
but also at large range) is to include nonlinear terms such as
.
second-order polynomials,
In our study, a GRN is modeled by continuous nonlinear
Ordinary Differential Equations (ODEs). Compared to linear
models, identification of the nonlinear differential equation
model is computationally more intensive and can require more
data; however, the range of nonlinear behaviors exhibited by
GRNs can be more thoroughly understood with nonlinear
differential equations. When more time-series data become
available owing to advances in microarray or other technologies, and assuming continued improvement in computational
capability, it can be expected that continuous nonlinear dynamic
1053-587X/$25.00 © 2008 IEEE
3328
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 7, JULY 2008
models will play a critical role in revealing complicated gene
behavior.
genes of interest and
denotes the
Assuming there are
state (such as the microarray reading) of the th gene,1 then the
dynamics of the GRN may be modeled as
(1)
In this study we assume the functions
are in the form
(2)
is the th component of the nonwhere
and are parameter noise and external
linear function ,
noise, respectively, and it is assumed that
and are white
Gaussian noise.
can be any
In general, the component
nonlinear function. Popular choices with important biological
implications include S-systems [27] and sigmoid functions [7].
,
,
,
For example, let
,
, and do not consider noise.
Then (1) becomes the well-known S-systems [27], [38], that is
(3)
and
are coefficients, and
and
are kiwhere
netic orders. If a sigmoid function is chosen, for example,
,
,
, and
noise is excluded, (1) is the differential equation counterpart of
the well-known weight matrix model2 given by
(4)
In this work, polynomials are chosen as the nonlinear component in the proposed model and ODEs with dynamic polynomials are used in our test cases. The polynomials are utilized as
universal approximators. In order to mitigate the effect of “the
curse of dimensionality”, only second-degree polynomials are
selected. Note that an advantage of using low-degree polynomial models is that even when there exists some model mismatch, these models may be sufficiently accurate to represent
many real systems, and thus are widely utilized in practice [32].
We note that a similar GRN model has been adopted by [28],
but without noise being included in the model.
The proposed model includes all the major characteristics of a
gene regulatory network: it is nonlinear, dynamic, and noisy. To
the best of our knowledge, no previous work has used the same
model. The rationale behind the proposed model are two-fold:
first, the proposed model is general and sufficiently flexible to
include many well known models and new models yet to be
1In this paper, we consider the case where the states (x ) are the microarray
readings. Thus the measurement equation is not needed.
2In [11], the weight matrix model is a difference equation model rather than
a differential equation model.
found; second, the noisy nature of GRNs is modeled explicitly. The deterministic model (without noise) corresponds to
the nominal case, while the various stochastic effects are included as noise disturbances. For example, there is considerable experimental evidence that indicates the presence of significant stochasticity in transcriptional regulation in both eukaryotes and prokaryotes [4]. The inherent stochasticity of biochemical processes (transcription and translation) is modeled
, which corresponds to the “inas noise in the parameters
trinsic noise” mentioned in the literature [5]. Other effects, such
as those from genes not been included in the microarray, the
amount of RNA polymerase, levels of regulatory proteins, and
the effects of mRNA and protein degradation, are modeled by
[5]. Previous work has modeled these
the external noise
noise types by Gaussian white noise processes [6]. The inclusion of noise also enables the proposed model to provide interpretation of the fact that GRNs are robust to noise, by which it
is meant that the relationships among genes are not greatly affected by small changes caused by noise.
need to be identified from
The nonlinear functions
time-series microarray measurements such that the identification error is minimized and the simplest model structure is
are represelected. In this paper, the criteria of selecting
sented by a fitness function and modeling a GRN becomes a
nonlinear optimization problem (minimization of fitness functions). We provide a framework to infer the proposed nonlinear
ODE model with noise using time-series data, where Genetic
Programming and Kalman filtering are applied. Both synthetic
data and experimental data from microarray measurements
are used to evaluate the proposed method. Note that although
the proposed method is tested only using polynomials as the
nonlinear terms, it is expected that it should perform similarly
well for other choices of nonlinear terms in the proposed
model, dependent of course on sufficient data for more complex
nonlinear models.
The remainder of the paper is organized as follows: The proposed framework and the iterative algorithm are illustrated in
Section II. Section III presents the Robust Kalman filter that
mitigates the effect of inaccuracy in noise statistics estimation.
Simulation results are given in Section IV. Discussions of applying the proposed method to nonlinear models other than the
polynomial case are provided in Section V. Section VI contains
some concluding remarks.
II. METHODOLOGY AND ALGORITHM DESCRIPTION
Several design challenges have to be addressed when solving
the nonlinear optimization problem. A common difficulty in
GRN inference is that the problem is under-determined. In a typical microarray experiment, the number of the sampled data is
much smaller than the number of genes involved. For example,
there are thousands of genes and only 17 data points in the yeast
data set [34]. Hence, the system is under-determined and there
are infinitely many solutions. As pointed out in several previous
studies, such as [10], choosing a solution from the many plausible ones is a difficult task. In [8], two algorithms (minimum
weight solutions to linear equations and Fourier transform for
stable systems) are provided to construct the GRN model from
QIAN et al.: INFERENCE OF NOISY NONLINEAR DIFFERENTIAL EQUATION MODELS FOR GENE REGULATORY NETWORKS
3329
Fig. 1. Block diagram of GRN identification using GP and Kalman filtering.
time-series data. A different approach is proposed by [9], where
singular value decomposition is used to generate an initial solution and then refined by robust regression. Another proposed approach is to apply cubic interpolation between successive measurements to increase the total amount of data to the point that
the linear equations become over-determined [12]. These techniques for linear models do not apply to nonlinear models.
In this study, we have developed a systematic method to infer
a GRN represented by a nonlinear ODE with large dimensionality using rather short length time-series data. We rely on three
aspects of our approach to address this issue:
1) The identification problem is decoupled into sub-problems with the th sub-problem focusing on the th gene.
Because the time-series data of other genes are fixed (from
measurements) when we are focusing on an individual
gene, we can solve the identification problem one gene at
a time. This approach makes the inference of large GRNs
feasible. Similar decoupling procedures have been used in
previous studies such as the inference of S-system models
[18], [19]. In the th sub-problem for the th gene, the
number of parameters needing to be estimated is .
2) According to a recent result by Sontag [20],
measurements are enough for identification of a set of differential equations with unknown parameters (if experiments
are designed properly, such as the one mentioned in [20]).
bound is an upper bound. In our case, the minThe
paimum number of data points needed to estimate the
. For exrameters within the th sub-problem is
ample, the 17 data points in the yeast data set [34] will
allow us to estimate up to 8 parameters in the th equation.
Usually, since the GRN tends to be a sparse network, we do
not expect many terms on the right-hand side of the ODE
(usually 8 being more than enough).
3) The Kalman filter provides optimal estimation with excellent convergence speed. Thus, relatively short-length timeseries data is sufficient for the Kalman filter to converge. In
the simulations (Section IV) we will show that the squared
error of the estimation quickly converges close to zero.
There are, of course, limitations to the method when it comes
of genes. Given the computational environto the number
ment currently being utilized, we are confident that the algo. This is more than sufficient for the
rithm can handle
number of genes envisioned in the application at which we are
ultimately aiming, namely, utilization of control theory to derive
intervention strategies to beneficially effect network dynamics.
As applied thus far in the context of discrete Markovian regulatory networks using dynamic programming in the finite horizon
case [2] or developing a stationary policy in the infinite horizon
case [3], owing to computational reasons the number of genes
has been kept small, typically no more than 15. The initial set of
genes can be selected via existing biological knowledge, some
data driven method to find a gene family in which there is substantial intergene interaction, or from a hybrid of the two.
In this paper, a two-step nested optimization procedure is proposed to identify the nonlinear differential equation for each individual gene. Genetic programming (GP) is applied to determine the nonlinear terms (global optimization [32], [37]) and
then the corresponding parameters associated with each term
are estimated by Kalman filtering (local optimization) in each
iteration. Such a decomposition of the problem into a structural
part solved by GP and a parameter optimization part solved
by Kalman filtering reduces the complexity significantly and
speeds up convergence. The optimization procedures are illustrated in Fig. 1. Note that a similar method has been used in [28];
however, Recursive Least Square (RLS) rather than Kalman filtering is applied in [28], since noise is not modeled in that study.
3330
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 7, JULY 2008
Fig. 2. Example tree structure of a nonlinear differential equation.
GRNs. Noise models not only can give more realistic simulations
of biological systems but also can serve as a basis for analyzing
the robustness of mathematical models with respect to noise. In
this paper, noise is modeled via Gaussian white noise processes,
and Kalman filtering is employed to optimize the GRN model by
mitigating the effects of noise, thereby enhancing the obtained
model relative to both robustness and stability.
Kalman filtering provides minimum-mean-square-error estimation of the state of a stochastic linear system disturbed by
Gaussian white noise. In our proposed scheme, the Kalman filter
is applied to estimate the coefficients of the GRN model. Although the proposed GRN model is nonlinear, it is linear in
terms of its coefficients. In addition, the filtering problem is
fully decoupled so that the Kalman filter can be applied to each
individual equation. The corresponding state and measurement
equations are
(5)
In a related work (again in which noise is not considered), a genetic algorithm is embedded in genetic programming, with the
latter being employed to discover and optimize the model structure and the former being used to optimize its parameters [33].
Note that most of the previous linear and nonlinear differential equation models fix the linear or nonlinear terms in the
equations, and then the inference problem becomes a parameter estimation problem. Our model assumes a quasi-structure
of the model, i.e., we provide candidate terms and let Genetic
Programming and Kalman filtering decide which term should
exist in the model and the corresponding parameters.
A. Genetic Programming
Within each sub-problem, the nonlinear terms in the equation first have to be determined. Genetic programming [21] is
a type of evolutionary algorithm. All evolutionary algorithms
work with a population of individuals, where each individual
may be a solution of the optimization problem. GP operates on a
tree structure, which is flexible enough to represent relationships
efficiently. The leaves of a tree represent variables or constants,
while the other nodes implement operators. An example of a tree
structure is shown in Fig. 2, where two operations, multiplicaand addition/subtraction
, are used. The corretion
. Mutation and
sponding equation is
crossover operations may be performed to generate offsprings.
Selection of better performing individuals (with smaller fitness
value, thus minimizing identification error while favoring the
simplest model structure) ensures that the population evolves
towards solving the optimization problem.
B. Kalman Filter
Another step to determine the equation of each gene is to estimate the parameters, while recognizing that noise effects need to
be mitigated. As living systems are optimized to function in the
presence of noise, the corresponding mathematical models that
attempt to explain these systems should be robust relative to noise.
Untreated noise in GRN inference may lead to impractical GRN
models and eventually to incorrect biological or medical conclusions. Thus, noise modeling is essential for better descriptions of
(6)
where the -dimensional state vector (containing the parameters to be estimated) is
. The vector
represents the process noise (uncertainties in parameters). Its covariance matrix is
can be calculated as
.
contains all the modules, i.e.,
.
is
the measurement noise (external noise in GRN). Its covariance
matrix is
The noise vectors
and
are statistically independent.
For example, suppose the equation for the th gene in the
GRN model is
(7)
, and
then the state vector is
. Both and are calculated from the measurement data obtained from microarray experiments.
The implementation of the Kalman filter (for the equation
of the th gene) is given by the following equations [23] (the
subscript being dropped for simplicity):
(8)
(9)
(10)
(11)
(12)
is the Kalman filter gain and is the covariance
where
matrix of the error. The superscripts and indicate the a
priori and a posteriori values of the variables, respectively.
QIAN et al.: INFERENCE OF NOISY NONLINEAR DIFFERENTIAL EQUATION MODELS FOR GENE REGULATORY NETWORKS
3331
error and keep the model as simple as possible, which may be
achieved by minimizing the following fitness function:
(13)
Fig. 3. Block diagram of Kalman filter.
and
are the prior and posterior estimates, respectively.
and are the covariance matrices of the parameter noise and external noise, respectively. The initial conditions are
and
.
A block diagram of the Kalman filter is given in Fig. 3. In general, the Kalman filter may be interpreted as a one-step predictor
with an appropriate gain calculator [22]. Specifically, the block
“one-step predictor” corresponds to (10), the block “Kalman
Filter gain calculator” corresponds to (11), and the block “Riccati equation solver” corresponds to (12).
Convergence of the Kalman filter is an important issue [23].
The rate of convergence is defined as the number of iterations to
obtain the optimum estimates. The convergence of the Kalman
and the
filter includes the convergence of the estimates
. Convergence will be
convergence of the estimation error
studied in detail in the simulations (Section IV).
In practice, noise statistics (such as the covariance matrices)
may not be known and need to be estimated. The Kalman filter
is sensitive to the estimation error of noise statistics. Poor estimates of the noise covariance can result in filter divergence. A
robust Kalman filter is presented in Section III to compensate
for the uncertainties in the estimates of the noise covariance.
In [17], the EM algorithm is used to estimate both the state
transition matrix and the observation matrix in a linear state
space model of a GRN, and the Kalman filter is applied as a
smoother. The Kalman filter is also applied in [31], where a
two-stage method is implemented to infer a GRN from time-series data. First, a genetic algorithm and expectation maximization algorithm are used to cluster the genes, and then a linear
state space model is adopted and the Kalman filter is applied
to estimate and predict gene expressions. However, in both [17]
and [31], a linear (rather than nonlinear) state space model of
a GRN is adopted. Furthermore, the sensitivity of the Kalman
filter with respect to inaccurate noise statistics is not discussed.
C. Proposed Iterative Algorithm
The task of identifying GRNs may be considered as an optimization problem. The goal is to minimize the identification
where
is the number of data points,
is the target time
series, is the obtained time series given by the obtained difis a
ferential equation represented by a GP individual, and
penalty term. The penalty term depends on the specific model
chosen for a GRN. In this paper, is chosen as the number of
and
are
terms on the right-hand side of (2).
weights for joint optimization of the identification error and the
, a very fine-grained model
complexity of the GRN. If
,a
will be obtained with many terms. On the contrary, if
very rough model with few terms will be obtained. The scale of
the first term on the right-hand side of (2) will vary depending
on the characteristics of data sets (such as the number of time
points, the number of genes, the amplitude of the signal). In this
by means of trial and error to balance
study, we let
the effects of the two weights.
Since it is a global nonlinear optimization problem, a nested
optimization structure is adopted, where GP is applied to determine the nonlinear terms (global optimization) and Kalman
filtering is employed to estimate the corresponding parameters
for each term (local optimization) in each iteration. Such decomposition into a structural part solved by GP and a parameter optimization part solved by Kalman filtering reduces the
complexity significantly and speeds up convergence [35]. The
detailed procedures of the proposed iterative algorithm are illustrated in Fig. 4. The GP process has four operations: reproduction, crossover, mutation and selection. Kalman filtering is
employed to estimate the parameters for every generation.
III. ROBUSTNESS ANALYSIS
The standard Kalman filter provides optimal estimates if the
noise is white Gaussian and the statistics (covariance matrices)
of the noise are known. The Kalman filter is optimal in the sense
that it minimizes the trace of the estimation error’s covariance
. Unfortunately, the noise covariance matrices are
usually unknown, or at least not known exactly, in practical situations, such as in microarray experiments. Hence, it is critical
to design a robust Kalman filter adaptive to the uncertainties in
noise statistics.
Robust Kalman filter design considering uncertainties in
noise statistics can be found in [24] for continuous-time systems and in [25] for discrete-time systems. We follow the
approach of [25] to derive the performance index of the robust
Kalman filter and propose a genetic algorithm based search
procedure to find the optimal robust Kalman filter gain.
Define the estimation error as
(14)
From the standard Kalman filter, the dynamics of the estimation
error can be derived as
(15)
3332
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 7, JULY 2008
Fig. 4. Genetic programming process with Kalman filter.
Then , the estimation error’s covariance at steady-state, satisfies the following equation:
(16)
are random variables with
,
, and and are uncorrelated,
. The corresponding steady-state error covariance
, where
matrix becomes
where
and
and
may be decoupled into two parts,
(22)
(17)
that represent the estimation error’s covariance due to process
and measurement noise, respectively. It is straightforward to
verify that
Using a similar decomposition as before
(23)
(18)
(19)
where and represent the estimation error’s covariance due
to the uncertain process and measurement noise, respectively.
Each of them contains a nominal term and a term due to noise
uncertainty, i.e.,
(20)
(24)
(21)
(25)
Suppose the noise covariance contains uncertainties
QIAN et al.: INFERENCE OF NOISY NONLINEAR DIFFERENTIAL EQUATION MODELS FOR GENE REGULATORY NETWORKS
where
and
satisfy
(26)
(27)
From the above equations we may derive the following simple
relation
(28)
The standard Kalman filter minimizes the performance index
. The variation of the performance index is given by
(29)
The mean of
is zero. The variance of
is
3333
The intuition of choosing the search interval as
is that it is expected that the new gain will not be
very far away from the standard Kalman filter gain. Simulation
results show that the proposed robust Kalman filter gives much
better parameter estimates than that of the standard Kalman
filter when the noise covariances are not known exactly.
IV. SIMULATION EVALUATION
In the simulation study, both synthetic data and real microarray measurements are used to evaluate the proposed
algorithm. A robust Kalman filter is also tested against a standard Kalman filter when the noise covariance matrices are not
fixed.
A. Synthetic Data
(30)
In order to make the filter robust to the noise uncertainties, the
variance of the changes in the performance index
needs to be minimized. In addition, the filter should perform
well under nominal conditions. Hence, the following weighted
performance index is adopted to address the tradeoff between
nominal and off-nominal conditions [25]
In this part of the simulation, we use data of a metabolic network, called the E-cell system (a part of the biological phospholipid pathway), that consists of three substances and compare
our algorithm with the approach in [28], where GP and RLS estimation were used without considering noise. This network can
be approximated as
(32)
(31)
where
and
and
are weighting factors. When
, the filter becomes the standard Kalman filter; when
and
, the filter is optimal under off-nominal
conditions.
A gradient decent method is suggested by [25] to search for
the robust Kalman filter gain. The authors in [25] point out that
“special care has to be taken to come up with the gradient descent step size and the perturbation size to find the partial derivative. Computationally this method is time consuming but this is a
straightforward method of realizing a new Kalman gain”. However, it is not given in [25] how the step size should be chosen.
Because it is very difficult to choose the step size to avoid
local minima, and the method in [25] is computationally expensive, a genetic algorithm (GA) is used in this paper to search for
the robust Kalman filter gain. Note that our method is different
from the gradient descent in [25]. In addition, our method has
the capability of avoiding local minima and converges fast because GA is computationally efficient and the search interval is
limited to a reasonable range.
The procedure of our approach is as follows:
1) Find the standard Kalman filter gain at steady state.
2) Generate candidate robust Kalman filter gains in the range
, where
. Calculate
of
their respective performance indices. Keep some small percentage of the top candidates for the next generation, perform mutation on another small percentage of the candidates, and perform crossover on the bulk of the candidates.
Go to the next generation.
3) Stop if the performance index can not be further improved
or the maximum number of iterations is reached.
The last equation is added to the synthetic model for testing
whether the proposed method would create false positives. Here
is not involved in regulatory interactions with other
gene
genes. It is included to see if it is omitted from the obtained
GRN.
Assuming parameter and external noise in the E-cell network,
the equations become
(33)
and
are parameter noise and external noise, rewhere
,
spectively. Their covariance matrices are
,
,
,
,
,
,
. It is assumed that
and
are uncorrelated for all and .
Since there are three substances in the E-cell system, it is assumed that the tree structure should include a subset of the following terms on the right-hand side of the differential equation:
, , , ,
,
,
,
,
,
, , ,
, . In other words, a degree-2 polynomial model is adopted.
1000 individuals are first produced and ranked according to the
fitness value. 5% of the individuals with the minimum fitness
value are kept for the next generation. 80% individuals are performed crossover and 10% individuals are performed mutation
and the remaining 5% are for other operations. The coefficients
in the E-Cell model are determined by Kalman filtering. Fig. 5
shows the convergence of the Kalman filter for the E-cell model.
3334
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 7, JULY 2008
Fig. 5. Convergence of the Kalman filter for the E-cell model: estimation error
versus number of iterations.
TABLE I
OBTAINED PARAMETERS BY
AND
WHEN NOISE PRESENTS
GP + RLS
Fig. 6. True positive rate versus false positive rate when noise level increases
(the covariances are 0, 0.1, 1.0, 10, 30, 50, 90, respectively and the same values
are used for and ).
Q
R
GP + KF
The resulting models using GP+RLS and GP+KF are listed in
Table I. The true network structure is obtained for both methods
is not involved
(under reasonable noise levels). Since gene
is not zero in the obtained
with other genes, only parameter
GRN. In other words, there is no (false) edges between
and
the other genes.
Because predicting true edges is only useful if the number of
false edges predicted is reasonably low, it is important to examine
the True Positive rate versus the False Positive rate when noise
level increases. The results similar to (but different from) the ROC
analysis in [36] for E-cell simulation are given in Fig. 6. It is observed that the proposed method is quite robust to increased noise.
Even when the noise level grows to 10, the True Positive rate is
still 100% and the False Positive rate is below 25%.
The results of the concentration levels of the three substances
are shown in Fig. 7. We observe that under noisy conditions, GP
plus Kalman filter performs well and Kalman filtering is a much
better choice than the RLS algorithm with noise present.
B. Robust Kalman Filter
We now test the robust Kalman filter discussed in Section III
using the E-cell model (given in Section IV-A); however, instead
of fixed covariance matrices, it is assumed that the covariance
Fig. 7. E-cell simulation by RLS and Kalman filtering.
matrices are not known exactly, that is, the covariance matrices
and
are
and
,
of
and
are the same as in Section IV-A, and are
where
random variables with
and
,
. Variances are given by
,
,
,
,
, and
. The random
variables and are uncorrelated for all .
A genetic algorithm (GA) is used to search the optimal robust Kalman filter gain for the objective function defined by
and
. The results are summarized
(31), where
in Table II. It is observed that when there are uncertainties in
the noise covariance matrices, the robust Kalman filter gives
much more accurate estimates of the parameters than that of
the standard Kalman filter. It is also interesting that when noise
covariances are not known exactly, the robust Kalman filter can
achieve a similar level of performance as the standard Kalman
filter when noise covariances are known exactly.
C. Scalability Analysis
In order to study the scalability of the proposed method, a
synthetic GRN with 50 genes is used (the detailed nonlinear
QIAN et al.: INFERENCE OF NOISY NONLINEAR DIFFERENTIAL EQUATION MODELS FOR GENE REGULATORY NETWORKS
3335
TABLE II
OBTAINED PARAMETERS AND THE CORRESPONDING PERFORMANCE INDEX BY
AND
WHEN THERE ARE UNCERTAINTIES IN NOISE
: ROBUST KF GAIN
COVARIANCE MATRICES. : STANDARD KF GAIN;
K
GP + KF
K
GP + RKF
Fig. 8. Inference of a 50-gene synthetic GRN.
ODE model is available at: http://www.old.pvamu.edu/edir/
lijun/GRN50.html). The proposed method is tested under various noise levels and different length of available time series
data. It is observed in Fig. 8 that the mean square error between the exact model and the obtained model decreases with
increased length of available time series data and decreased
noise level, as expected. It is also observed that the proposed
method performs well when the noise level is not too high and
the length of available time series data is not too short.
D. Yeast Data
We consider time-series gene-expression data corresponding
to yeast protein synthesis. Here, the data for 12 genes (HAP1,
CYB2, CYC7, ROX1, CYT1, HAP2/3/4, CYC1, COX5A,
COX5B_ex1, GPD2) are picked because the relations among
them have been revealed by biological experiments. For example, HAP1 represses the nuclear encoding cytochrome gene
CYC7 under the anaerobic condition; CYB2 activates CYC7;
HAP1 is a repressor and it represses other genes [29]. The states
, respectively.
of the 12 genes are represented by
The trace of the time-series microarray measurement data
(raw data) from [34] of the 12 genes of interest is shown in
Fig. 9, where 17 sampling data points are provided for each
gene by the experiments. The data is plotted in log scale for the
convenience of representation only. The sampling data points
are evenly spaced and the observation interval is 10 minutes.
The measurement data is originally from http://www.genomics.
stanford.edu/yeast_cell_cycle/full_data.html, where related references are also available.
Fig. 9. Microarray measurement data of the 12 genes of interest (17 data points
per gene, sampled every 10 minutes).
It is assumed that the nonlinear terms in the nonlinear differential equation model are 2-degree polynomials. Because the
measurements have large values (range from several hundreds
to more than a thousand), and the changing rates of the genes
are not large (observed from the traces in Fig. 9), it is expected
that the parameters of the GRN model will be small. The noise
and
are set to be diagonal matrices
covariance matrices
with
on the diagonal. In the simulation, 1000 individuals
are produced in each generation. 100 generations are calculated
to reach the minimum fitness values.
The following model is obtained by the proposed algorithm
(without loss of generality, all the noise terms are dropped in the
equations for simplicity of presentation):
3336
The detailed interactions among the 12 genes deduced from
the obtained model are shown in Fig. 10. The obtained model
possesses the following benefits:
1) The obtained relationships among genes are in agreement
with biological experimental findings (as far as we know).
For example, we observe that the Heme activator protein
(HAP1) represses gene CYC7. HAP1 behaves as a repressor [29]. CYC7 is expressed under hypoxic conditions
and activated by CYB2. It is also observed that HAP1
activates COX5B. It is known that HAP1 functions as
a homodimer to activate oxygen-dependent expression
of COX5B [30]. ROX1 activates HAP4, HAP4 activates
HAP2, and HAP2 and HAP4 are the only 2 genes that
activate CYT1. Again, CYT1 is known to be activated by
HMG-domain site-specific DNA binding protein ROX1.
Budding yeast HAP2 is required in concert with HAP3
and HAP4 to form a heterotrimeric CCAAT-binding transcriptional activation complex at the UAS2 element of
CYC1. All of the above results agree with the findings in
[30], where detailed biological explanations are provided.
2) The obtained model reveals not only qualitative but also
quantitative relationships among genes.
3) The obtained model shows that there exist both negative
feedback and positive feedback in the GRN. For instance,
many genes (HAP1, CYC7, ROX1, CYT1, HAP2/3/4,
CYC1, COX5B_ex1) regulate themselves by negative
feedback. However, COX5A regulates itself by positive
feedback. More interestingly, there also exist both negative
feedback loops and positive feedback loops. For example,
CYB2 activates CYC7, and CYC7 represses CYB2, which
forms a negative feedback loop. On the contrary, HAP4 and
CYC1 activate each other through a positive feedback loop
between them. HAP4 and CYC1 will not be out of control
since they are also suppressed by many other genes.
4) The obtained model also shows the detailed process of how
genes work together to regulate other genes. For example,
(CYC7) shows that CYB2 activates
the equation of
(CYB2) shows that HAP1
CYC7 and the equation of
represses CYB2, which in turn shows that HAP1 represses
CYC7 through CYB2. This kind of detail is not available
in many other existing models.
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 7, JULY 2008
5) One gene may play opposite roles when collaborating with
(CYT1)
different genes. For instance, the equation of
shows that CYC7 will repress CYT1 by itself; however,
CYC7 will activate CYT1 when HAP2/4 is activated. In
this case, the relationship from CYC7 to CYT1 cannot be
determined during this time course.
6) Sometimes it is possible to determine a gene’s effect
during this time course even if it plays opposite roles
when collaborating with different genes. For example,
(ROX1) shows that HAP3 will stimthe equation of
ulate the production of HMG-domain site-specific DNA
binding protein ROX1 when collaborating with HAP4;
however, HAP3 will repress the synthesis of ROX1
when collaborating with another gene, CYC1. Because
throughout the entire time
course, the collective effect shows that HAP3 will repress
ROX1 during this time course.
In general, the proposed model shows the versatility of GRNs
and hopefully helps us better understand the structure and dypossible choices
namics of GRNs. Note that there are
on the right-hand side of each equation, so that the computational complexity is not low. A PC with a 3 GHz Intel Pentium-4
processor is used in the simulation. It takes the PC about 6 hours
to obtain the model.
V. DISCUSSIONS
The proposed approach of decoupling and state-space formulation for nonlinear system identification is not restricted to the
polynomial case. In fact, the proposed method may be applied to
many other nonlinear models. For example, this approach may
be applied to sigmoid model of GRN, which is not based on
polynomials. In this section, we demonstrate the usage of the
proposed method toward the GRN model using sigmoidal functions.
The GRN model using sigmoidal functions can be written as
(34)
where
and
are two parameters.
is the weight
and
value for gene on gene . is an offset parameter.
are intrinsic noise and external noise, respectively.
Again, the nonlinear identification problem can be decoupled
into sub-problems with the th sub-problem focusing on the
th gene. Because the time-series data of other genes are fixed
values when we are focusing on an individual gene, we can solve
the identification problem one gene at a time. The above sigmoidal model, (34), can be decoupled into sub-problems. The
fitness function of the th problem is given by
(35)
where
is the number of data points,
is the target time
series, is the obtained time series given by the obtained differential equation. And the model for each individual gene is
QIAN et al.: INFERENCE OF NOISY NONLINEAR DIFFERENTIAL EQUATION MODELS FOR GENE REGULATORY NETWORKS
3337
Fig. 10. Interactions among the 12 genes of yeast.
given by
(36)
where
(37)
Now instead of determining
parameters simultaneously,
only parameters need to be estimated for each sub-problem.
Thus the computational complexity is greatly reduced.
In order to apply a linear estimator, the above equation can be
rearranged as
(38)
Equation (38) is now linear in parameters
. Then a linear estimator may be applied to
estimate
in each iteration. When noise is not considered,
a Recursive Least Square (RLS) estimator may be used. If
noise is modeled by Gaussian white noise process with known
statistics, Kalman filter may be used to get the optimal estimate
of .
VI. CONCLUSIONS AND FUTURE WORK
The induction of GRN models from a sequence of microarray
measurements becomes attractive owing to the growing availability of time-series data. In this paper, a continuous nonlinear
ordinary differential equation model with parameter noise and
external noise is proposed. GRN inference is decoupled into
sub-problems with each sub-problem targeted for each individual gene. Then a joint genetic programming and Kalman filtering approach is proposed to infer the nonlinear differential
equation from time-series data. Simulations with synthetic and
yeast data demonstrate the effectiveness of the proposed algorithm. The proposed algorithm addresses the tradeoff between
diversification (flexibility to explore new regions) and intensification (convergence in local regions) by using genetic programming to provide the needed model flexibility to reduce the
bias (systematic error) of the model, and it uses Kalman filter to
provide fast convergence and reduce the stochastic error of the
model by mitigating the effect of noise during parameter estimation. A robust Kalman filter is also presented to compensate
for inaccurate estimates of noise statistics.
The results (of the yeast GRN) obtained in this paper reveal many interesting phenomena in GRNs. The inference of
GRNs by the proposed algorithm provides insight into a wide
range of biological processes. Specifically, the obtained nonlinear dynamic model answers whether (qualitatively) or how
much (quantitatively) a gene or external perturbation contributes
to the behavior transition of other genes or regulators (proteins)
in instances such as disease development or recovery, aging processes, cell differentiation, or other cellular phenomena. In addition, it characterizes how the parameter (intrinsic) noise and
external noise affect the process of gene expression. We would
like to point out that although only polynomials are used as test
3338
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 7, JULY 2008
cases in this study, our proposed methodology can be applied to
a broad range of models.
It is important to design control strategies to manipulate some
of the genes in the GRN and drive the system to desired target
states. The obtained GRN model using the methodology in this
paper will be utilized as a tool to study the dynamics and steady
states of the GRN under various control designs. It will have
direct applications in therapeutic target discovery. This will be
one of our research thrusts in the future.
Time delay is ubiquitous in gene regulatory activities and
incorporation of time delay may capture the system dynamics
more effectively. We plan to add time delay to our nonlinear
model in our future work. In addition, the statistics of the parameter noise and external noise may not be known at all. In
that case, even a robust Kalman filter may not be appropriate
filter may be emfor estimating parameters. Instead, an
ployed to provide robust estimation of parameters even without
the knowledge of the noise statistics. This will be one of our future efforts.
REFERENCES
[1] E. R. Dougherty, A. Datta, and C. Sima, “Research issues in genomic
signal processing,” IEEE Signal Process. Mag., vol. 22, no. 6, pp.
46–68, 2005.
[2] A. Datta, A. Choudhary, M. L. Bittner, and E. R. Dougherty, “External
control in Markovian genetic regulatory networks,” Machine Learning,
vol. 52, no. 1–2, pp. 169–191, 2003.
[3] R. Pal, A. Datta, and E. R. Dougherty, “Optimal infinite horizon control
for probabilistic Boolean networks,” IEEE Trans. Signal Process., vol.
54, no. 6, pt. 2, pp. 2375–2387, 2006.
[4] T. Kepler and T. Elston, “Stochasticity in transcriptional regulation:
Origins, consequences, and mathematical representations,” Biophys. J.,
vol. 81, no. 6, pp. 3116–3136, Dec. 2001.
[5] P. Swain, M. Elowitz, and E. Siggia, “Intrinsic and extrinsic contributions to stochasticity in gene expression,” Proc. Natl. Acad. Sci. USA,
vol. 99, pp. 12795–12800, 2002.
[6] J. Hasty, J. Pradines, M. Dolnik, and J. J. Collins, “Noise-based
switches and amplifiers for gene expression,” Proc. Natl. Acad. Sci.
USA, vol. 97, pp. 2075–2080, 2000.
[7] H. de Jong, “Modeling and simulation of genetic regulatory systems: A
literature review,” J. Computat. Biol., vol. 9, no. 1, pp. 67–103, 2002.
[8] T. Chen, H. L. He, and G. M. Church, “Modeling gene expression with
differential equations,” in Pacific Symp. Biocomput., 1999, vol. 4, pp.
29–40.
[9] M. K. S. Yeung, J. Tegnãr, and J. J. Collins, “Reverse engineering gene
networks using singular value decomposition and robust regression,”
Proc. Natl. Acad. Sci. USA, vol. 99, pp. 6163–6168, 2002.
[10] V. Filkov, “Identifying gene regulatory networks from gene expression data,” in Handbook of Computational Molecular Biology. Boca
Raton, FL: CRC Press, 2005.
[11] D. C. Weaver, C. T. Workman, and G. D. Stormo, “Modeling regulatory
networks with weight matrices,” in Pacific Symp. Biocomput., 1999,
vol. 4, pp. 112–123.
[12] P. D’haeseleer, X. Wen, S. Fuhrman, and R. Somogyi, “Linear modeling of mRNA expression levels during CNS development and injury,”
in Pacific Symp. Biocomput., 1999, vol. 4, pp. 41–52.
[13] E. Mjolsness, D. H. Sharp, and J. Reinitz, “A connectionist model of
development,” J. Theor. Biol., vol. 152, no. 4, pp. 429–453, Oct. 1991.
[14] L. F. A. Wessels, E. P. Van Someren, and M. J. T. Reinders, “A comparison of genetic network models,” in Pacific Symp. Biocomput., 2001,
vol. 6, pp. 508–519.
[15] I. Tabus, C. D. Giurcaneanu, and J. Astola, “Genetic networks inferred
from time series of gene expression data,” in Proc. 1st Int. Symp.
Control, Commun. Signal Process., Hammamet, Tunisia, 2004, pp.
755–758.
[16] M. J. L. de Hoon, S. Imoto, K. Kobayashi, N. Ogasawara, and S.
Miyano, “Inferring gene regulatory networks from time-ordered gene
expression data of Bacillus subtilis using differential equations,” in
Pacific Symp. Biocomput., 2003, vol. 8, pp. 17–28.
[17] R. Yamaguchi, R. Yoshida, S. Imoto, T. Higuchi, and S. Miyano,
“Finding module-based gene networks in time-course gene expression
data with state space models,” IEEE Signal Process. Mag., vol. 24, no.
1, pp. 37–53, 2007.
[18] Y. Maki et al., “Inference of genetic network using the expression profile time course data of mouse P19 cells,” in Proc. Genome Informatics
2002, 2002, vol. 13, pp. 382–383.
[19] S. Kimura, M. Hatakeyama, and A. Konagaya, “Inference of S-system
models of genetic networks from noisy time-series data,” Chem-Bio
Inform. J., vol. 4, no. 1, pp. 1–14, 2004.
[20] E. D. Sontag, “For differential equations with r parameters,
experiments are enough for identification,” J. Nonlinear Sci., vol. 12,
pp. 553–583, 2002.
[21] J. R. Koza, Genetic Programming: On the Programming of Computers
by Means of Natural Selection. Cambridge, MA: MIT Press, 1992.
[22] S. Haykin, Adaptive Filter Theory, 4th ed. Englewood Cliffs, NJ:
Prentice-Hall, 2001.
[23] M. Grewal and A. Andrews, Kalman Filtering: Theory and Practice.
Englewood Cliffs, NJ: Prentice-Hall, 1993.
[24] S. Sasa, “Robustness of a Kalman filter against uncertainties of noise
covariances,” in Proc. American Control Conf., 1998, pp. 2344–2348.
[25] S. Kosanam and D. Simon, “Kalman filtering with uncertain noise covariances,” in Proc. IASTED Int. Conf. Intelligent Syst. Control, 2004,
pp. 375–379.
[26] M. A. Savageau, “Rules for the evolution of gene circuitry,” in Pacific
Symp. Biocomput., 1998, vol. 3, pp. 54–65.
[27] M. A. Savageau, “20 years of s-systems,” in Canonical Nonlinear Modeling: S-Systems Approach to Understand Complexity, E. Voit, Ed.
New York: Van Nostrand Reinhold, 1991, pp. 1–44.
[28] S. Ando, E. Sakamoto, and H. Iba, “Evolutionary modeling and inference of gene network,” Inf. Sci., vol. 145, pp. 237–259, 2002.
[29] P. Woolf and Y. Wang, “A fuzzy logic approach to analyzing gene
expression data,” Physiol. Genomics, vol. 3, pp. 9–15, 2000.
[30] J. Schneider and L. Guarente, “Regulation of the yeast CYTI gene encoding cytochrome cl by HAP1 and HAP2/3/4,” Molecular Cellular
Biol., vol. 11, no. 10, pp. 4934–4942, 1991.
[31] Z. Chan, N. Kasabov, and L. Collins, “A two-stage methodology for
gene regulatory network extraction from time-course gene expression
data,” Expert Systems With Applications, vol. 30, pp. 59–63, 2006.
[32] O. Nelles, Nonlinear System Identification. New York: Springer,
2001.
[33] H. Cao, L. Kang, and Y. Chen, “Evolutionary modeling of systems
of ordinary differential equations with genetic programming,” Genetic
Programming and Evolvable Machines, vol. 1, no. 4, pp. 309–337,
2000.
[34] L. Qian, Supplemental materials, yeast data set. Dept. Electr. Comput.
Eng., Prairie View A&M Univ., Prairie View, TX, 2007 [Online]. Available: http://old.pvamu.edu/edir/lijun/SupMatTSP2007.html
[35] H. Wang, L. Qian, and E. Dougherty, “Inference of gene regulatory
networks using genetic programming and Kalman filter,” presented at
the Gensips Conf., College Station, TX, 2006.
[36] D. Husmeier, “Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian
networks,” Bioinformatics, vol. 19, no. 17, pp. 2271–2282, 2003.
[37] A. Tsakonas, “A comparison of classification accuracy of four genetic
programming-evolved intelligent structures,” Inform. Sci., vol. 176, pp.
691–724, 2006.
[38] H. Wang, L. Qian, and E. Dougherty, “Inference of gene regulatory
networks using S-system: A unified approach,” in Proc. IEEE CIBCB,
2007, pp. 82–89.
2r + 1
Lijun Qian (M’01–SM’08) received the B.E. degree
from Tsinghua University, Beijing, China, the M.S.
degree from the Technion–Israel Institute of Technology, Haifa, and the Ph.D. degree from Rutgers
University, New Brunswick, NJ.
He is an Assistant Professor in the Department
of Electrical and Computer Engineering at Prairie
View A&M University (PVAMU), Prairie View, TX.
Before joining PVAMU, he was a Researcher at the
Mathematical Science Research Center of Bell Labs,
Murray Hill, NJ. His major research interests are in
network theory, control theory, and genomic signal processing.
QIAN et al.: INFERENCE OF NOISY NONLINEAR DIFFERENTIAL EQUATION MODELS FOR GENE REGULATORY NETWORKS
Haixin Wang (M’07) received the B.S. degree in
electrical and mechanical engineering from Shandong University of Science and Technology, China,
in 1997. Since 2005, he has been pursuing the Ph.D.
degree in the Department of Electrical and Computer
Engineering at Prairie View A&M University, Prairie
View, TX.
His research interests include bioinformatics, statistical signal processing and genetic algorithms.
Edward R. Dougherty (M’05) received the M.S.
degree in computer science from the Stevens Institute of Technology, Hoboken, NJ, and the Ph.D.
degree in mathematics from Rutgers University,
New Brunswick, NJ.
He is a Professor in the Department of Electrical
and Computer Engineering at Texas A&M University, College Station, TX, where he holds the Robert
M. Kennedy Chair and is Director of the Genomic
Signal Processing Laboratory. He is also the Director
of the Computational Biology Division of the Trans-
3339
lational Genomics Research Institute in Phoenix, AZ. He is the author of 14
books, editor of five others, and author of more than 200 journal papers. He has
contributed extensively to the statistical design of nonlinear operators for image
processing and the consequent application of pattern recognition theory to nonlinear image processing. His research in genomic signal processing is aimed at
diagnosis and prognosis based on genetic signatures and using gene regulatory
networks to develop therapies based on the disruption or mitigation of aberrant
gene function contributing to the pathology of a disease.
Prof. Dougherty has been awarded the Doctor Honoris Causa by the Tampere
University of Technology in Finland. He is a fellow of SPIE, has received the
SPIE President’s Award, and served as the editor of the SPIE/IS&T Journal of
Electronic Imaging. At Texas A&M, he has received the Association of Former
Students Distinguished Achievement Award in Research, been named Fellow
of the Texas Engineering Experiment Station, and named Halliburton Professor
of the Dwight Look College of Engineering.
Download