1609 Stochastic Oscillations in Genetic Regulatory Networks. Application to Microarray Experiments. Simon Rosenfeld National Cancer Institute, Bethesda, MD 20892; sr212a@nih.gov Summary. We analyze stochastic dynamics of genetic regulatory networks using the system of nonlinear differential equations. The system of S-functions is applied to capture the role of RNA polymerase in the transcription-translation mechanism. Combining the center manifold theorem of nonlinear dynamics and probabilistic properties of the chemical rate equations, we derive a system of stochastic differential equation which is analytically tractable despite high dimension of the regulatory network. Using the stationary solutions of these equations, we explain the apparently paradoxical results of some recent time-course microarray experiments where mRNA transcription levels are found to only weakly correlate with the corresponding transcription rates. Combining analytical and simulation approaches, we determine a set of relations between the size of a regulatory network, its structural complexity, and chemical variability in the protein-mRNA system. Key words: Genetic Regulatory Network, Gene Expression, Nonlinear Dynamics, Stochastic Processes 1 Introduction According to the “central dogma” in molecular biology, the genetic regulatory process involves two key steps, namely “transcription”, i.e., deciphering the genetic code and creation of the messenger RNA (mRNA), and “translation”, i.e., synthesis of the proteins by ribosomes using mRNAs as templates. These processes run concurrently for all the genes comprising the genome. Importantly, each molecular assembly responsible for deciphering the genetic code is itself built from the proteins produced through transcription and translation of other genes, thus introducing nonlinear interactions into the regulatory process [Lew04]. In the human genome, for example, typically, from 30 to 100 regulatory proteins are involved in each transcription event in each of about 60,000 genes. This means that the regulatory network is of a very high dimensionality and very high connectivity simultaneously. Mathematical description of such a network is a challenging task, both conceptually and computationally. Quite paradoxically, however, this seemingly unfavorable combination of two “highs” opens a new avenue for approximate solutions and understanding global behavior of regulatory systems through the application of asymptotic methods. 1610 In this paper, we pay special attention to quantitative relations between the transcription levels (TL), i.e., number of mRNA molecules of a certain type per cell, and transcription rates (TR), i.e., number of mRNA molecules produced in the cell per unit of time. TLs are the quantities directly derived from microarray experiments, whereas TRs are usually unobservable. Although both of these quantities seem to be legitimate indicators for characterizing gene activity, yet generally they are different and capture different facets of the regulatory mechanism. Due to the fundamentally nonlinear nature of gene-to-gene interactions, there cannot be any direct relations between gene-specific TRs and TLs. Also, due to the inherent instability of high dimensional regulatory systems, nothing like time-independent “gene activity” may be attributed to a living cell. In our view, these conclusions may have serious consequences for the interpretation of microarray experiments where the fluctuating nature of the mRNA levels is frequently ignored, the mRNA abundance is often seen as a direct indicator of the corresponding gene’s activity, and the differential expression (i.e., difference in TLs) is taken as evidence of differences in the cells themselves. 2 Assumptions and Equations The system of nonlinear ordinary differential equations (ODE) for description of proteometranscriptome dynamics first appeared in [CHC99]. dr dt = F(p) - βr ; dp dt = γr - δp (1) p are n -dimensional column vectors of mRNA and protein concentrations measured in numbers of copies per cell; n is the number of genes in genome; β , γ and δ are the non- where r and degenerate diagonal matrices corresponding to the rates of production and degradation in transcription and translation. The n -dimensional vector-function F is a strongly nonlinear function representing the mechanism of transcription. [CHC99] linearized the system (1) in the vicinity of a certain hypothesized initial point and formulated general requirements of stability. In what follows, we augment the system (1) by an explicitly specified model for F(p) and attempt to extract the consequences from the essentially nonlinear nature of the problem. As is known from the biology of gene expression, generation of each copy of messenger RNA is preceded by a complex sequence of events in which a large number of proteins bind to the gene's regulatory sites and assemble a reading mechanism known as RNA polymerase [KMP03]. Each act of these bindings represents a separate bio-chemical reaction involving DNA and proteins, and supported by a number of smaller molecules. According to the principles of chemical kinetics, the production term F(p) should have the following general form [DeJ02] Li n k =1 m =1 Fi ( p1 ,..., pn ) = ∑ ωik ∏ pm rikm (2) where Li is the number of concurrent bio-chemical reactions for decoding rate constants; rikm i -th gene; ωik are the are the kinetic orders showing how many protein molecules of type m partici- pate in k -th bio-chemical reaction for the transcription of i -th gene. A detailed account of the assumptions underlying (2) may be found in [SSa87]. Although these assumptions are not free from inevitable simplifications, yet they constitute a reasonably solid basis for studying the dynamics of genetic regulatory networks because they recognize the central role of RNA polymerase in the 1611 nonlinear mechanism of gene-to-gene interactions. The rate constants, ωik , and kinetic orders, rikm , are assumed to be positive real and integer numbers, respectively. Due to the very nature of chemical thermodynamics, each chemical reaction is a trial-and-error process in which fixed values of molecular constants may be attributed only to big ensembles of reactions, but not to the individual ones. Therefore, we specify ωik and rikm as random numbers drawn from the gamma and Poisson popula- tions, respectively Pr(ω ik =x)= x α −1 exp(- x Γ (α )θ θ) α λ exp( −λ ) n ; Pr(rikm =n)= n! (3) This choice of probabilistic characterization is a matter of mathematical convenience, and may be easily replaced by other assumptions compatible with the nature of the problem. Assumptions (3) provide the basis for ensemble averaging, and, as usual in statistical mechanics, allow for probabilistic statements when deterministic considerations are impossible due to complexity. Although small networks are conceivable to treat deterministically (e.g. [MSa95]), the probabilistic approach seems to be the only way of studying the networks with tens of thousands of regulatory units. Naturally, a great variety of regulatory functions remains outside the scope of this model. In particular, we ignore such important processes as protein post-translational modifications. We also assume that each protein molecule produced by the ribosome has a specific regulatory function within the cell of origin and immediately becomes available everywhere within network. Even with these simplifications, mathematical analysis of such systems is not simple, and it does not seem possible to move toward more complex systems without understanding the simpler ones. 3 Outline of the Solution Following standard procedures in nonlinear dynamics [Car81], we first seek the equilibrium points of (1-2) and study structure of the solutions in their vicinity. Let P0 be the n − vector of equilibrium protein concentrations, and X (t ) be the vector of relative concentrations normalized by the equilibrium values. After some transformation, system (1-2) may be rewritten as 2 d xi dt 2 + (βi + δ i ) dxi dt Li ∑Ω Y + β iδ i xi = β iδ i ik ik (4) k =1 where Y are the so called S-functions [SVo87] defined as n log[Yik (X)]= ∑ rikm log( xm ) (5) m=1 and Ω ik = ω ik Yik ( P0 ) n ∑ω Y ik ik ( P0 ) m =1 We are looking for a stationary solution of (4-5). According to the center manifold theory [Per01], this stationary solution may be sought as a linear combination of the eigenvectors corresponding to the eigenvalues with zero real parts. Because we are dealing with a system containing tens of thousands of regulatory elements, the center manifold may contain up to hundreds of such eigenvalues forming an essentially continuous spectrum of stationary oscillations. Oscillations with such a spectrum may be reasonably viewed as random processes. In order to find the stochastic structure of these 1612 processes, we apply successive approximation approach. We first notice that the right hand side in (5) is the sum of identically distributed random functions. Because the first guess trajectories, xm (t ) , correspond to different eigenvalues from the center manifold, we may assume that they are statistically independent, and therefore the random processes ηik (t )= log(Yik ( X(t )) are approximately Gaussian. Although difficult to prove theoretically in all its generality, this conjecture has been confirmed by direct numerical simulation under the assumptions of our model. Figure 1 shows an example of such simulation. -1.0 -0.5 0.0 0.5 1.0 protein oscillations 0 20 40 60 80 100 80 100 time -1.0 -0.5 0.0 0.5 1.0 transcription rate 0 20 40 60 time Figure 1. Nonlinear transformation of linear combination of periodic oscillations. In this Figure, top panel shows 100 separate quasi-periodic oscillations covering wide spectrum of periods. As shown in the bottom panel, corresponding functions Fi (t ) in (2) tend to concentrate around the same stochastic process whose parameters are determined by the probabilistic properties of ωik and rikm . This tendency increases with complexity of the network. In the first guess approximation, all the processes ηik (t ) corresponding to different indexes may be replaced by a single Ornstein-Ulenbeck process [Gar83] with autocovariance i and k m Rη (τ ) = ( λ + λ ) ∑ var[ln(σ k )] exp( − τ τ 0 ) 2 2 k −1 The correlation radius, τ0 , depends on the width of the center manifold spectrum and becomes shorter with n increasing. Roughly speaking τ0 is inversely proportional to the maximum absolute value of the characteristic roots within the center manifold. Unfortunately, currently known theoretical bounds for eigenvalues of a matrix are too conservative to be used in practice [Bro39]; however, 1613 τ0 can be easily estimated computationally through fitting η (t ) by the first order (i.e., Markov) process. The system (4) is now decoupled on the set of independent equations containing the same "random force" d 2 xi dt 2 + (βi + δ i ) dxi dt + βiδ i xi = β iδ i eη ( t ) (6) The process ξ (t ) = exp(η (t )) is log-normally distributed with expectation exp(σ variance ⎛ τ 2 ) and autoco- ⎞ ⎟ (7) ⎝ τ 0 1 − exp( −σ ) ⎠ σ Rξ (τ )= exp(σ )[exp(σ )-1] exp ⎜ 2 2 2 2 2 where m σ = (λ + λ ) ∑ var[ln( xk )] 2 2 (8) k −1 It is important to note that the correlation radius of ξ (t ) is always smaller than τ 0 , which means that ξ (t ) is always closer to a white noise than η (t ) . Applying Fourier transform, equations (6) may be easily solved, and the solutions are the stochastic processes with expectations E ( xi ) = β iδ i exp(σ 2 2) , variances var( xi ) = β iδ i βi + δi τ0 [(exp(σ 2 σ 2) -1] 2 2 , (9) and autocorrelation function Rx (τ ) = Ai e i −τ τ 0 + Bi e − β iτ + Δi e − δ iτ (explicit expressions for the coefficients are cumbersome and omitted to save page space). These ap- β iτ 0 1 and δ iτ 0 1 , meaning that the spec- proximations are valid under the conditions that all trum of collective random force, ξ (t ) , should be wider than all the individual spectra in (6). We now can plug the first guess stochastic processes back into expressions (5) and repeat the procedure. It is worth mentioning that the center manifold spectrum remains the same throughout the iterative process because it is solely determined by the solution in the vicinity of equilibrium. 4 Interrelations between nonlinearity, instability and complexity Parameter λ in the Poisson distribution (3) is a natural measure of complexity of the system because the quantity λ n is an average number of the proteins participating in the transcription. We now formally introduce the "index of complexity", I c =(λ + λ ) n . If this index was small then the vast ma2 jority of characteristic roots would be stable (i.e., have negative real parts) and the center manifold spectrum would be narrow. Obviously, this is not the case in reality with usually I c ≈ 30 - 100 1614 [Lew04]. In the systems of such a great complexity, a substantial number of characteristic roots will reside in the right half of the complex plane thus signifying a greater instability. It is seen in Figures 2-3 where we show two examples of the distribution of characteristic roots over the complex plane for small and big I c . n=300; poisson lambda=0.05; spectral width=2.21 complexity index=15.75; stability index=3.47 6 4 imaginary parts 2 0 -2 -4 -6 -6 -4 -2 0 2 4 6 real parts Figure 2 Positions of characteristic roots in case of low complexity 0 -6 -4 -2 imaginary parts 2 4 6 n=300; poisson lambda=0.5; spectral width=4.15 complexity index=225; stability index=1.93 -6 -4 -2 0 2 4 6 real parts Figure 3 Positions of characteristic roots in case of high complexity We also define the "index of stability", I s , as a ratio of number of roots with negative real parts to that with positive ones. With complexity increasing, the stability decreasing, spectral width of the central manifold is increasing, thus making the correlation radius, τ 0 , smaller and the spectrum of 1615 collective "random force", ξ (t ) , "whiter." Effectively, that means that the more complex the system the more favorable are the conditions for applying the proposed approach. Figure 4, left panel, shows the dependence of stability on complexity. The right panel illustrates the fact that the correlation radius of ξ (t ) (open circles) is always substantially smaller than that of η (t ) (solid circles) and both Ic drastically decrease with increasing. correlation radii 12 4 0.6 6 0.8 8 10 corr.rad 1.2 1.0 log(stbl.ind) 1.4 14 1.6 stability vs complexity 0 20 40 cmplx.ind 60 0 20 40 60 cmplx.ind Figure 4 Stability and correlation radii vs. complexity of network 5 Interrelations between transcription levels and transcription rates In the model adopted here, the entire gene expression mechanism is seen as driven by a collective random force which in turn is generated by all the individual transcription-translation events. This kind of "self-consistent" or "average field" approach is widely employed in physics, with such notable examples as the Thomas-Fermi equation in atomic physics [PYa89] and Landau-Vlasov equations in physics of plasma [Che84], to name just a few. Transcript levels (TL) and transcription rates (TR) are presented by the quantities ri and Fi in (1), respectively. Since Fi are the stochastic processes gen- erated, generally speaking, by the entire network, there is no noticeable correlation between them and ri . Therefore, one cannot expect any substantial similarity between the temporal behavior of TR and TL. This conclusion is of fundamental importance for the interpretation of microarray experiments. Also, despite the fact that in our model each mRNA molecule entering the ribosome translates into exactly one protein, there is no similarity between temporal behavior of protein and mRNA concentrations. The dissimilarities increase with the regulatory network complexity because of a longer chain of intermediate events being involved in each act of gene expression. Figure 5 depicts the median correlation coefficient as a function of complexity. As seen from this Figure, in the case of high complexity half of all the protein-mRNA pairs are correlated at the level below 0.5. This level of correlation is close to that observed in the [GAP04] experiment where about half of ~5500 TLs in the yeast genome are found not correlated with corresponding TRs. Based on this comparison, we may conclude that complexity of the yeast genome is about 45 ÷ 60 0.7 0.6 0.5 median cor.coef 0.8 1616 0 20 40 60 complexity index Figure 5. Median correlation coefficients vs complexity 6. Interrelations between limits of complexity and total variability Equation (9) expresses the individual variances of protein oscillations through the individual degradation rates and total variance, which in turn, due to (8), is a linear combination of variances of logarithms of individual conentrations. It allows for the derivation of a nonlinear algebraic equation for so far unknown variance σ 2 σ = Ic 2 1 n n ∑ ⎧ ln ⎨1 + 2τ 0 ⎩ i =1 β iδ i cosh σ 2 − 1 ⎫ βi + δ i σ ⎬ (10) ⎭ 2 In a sense, solution of the original strongly nonlinear problem is now reduced to the solution of this equation. Simple analysis shows that the solution does exist and is unique if T0 >I cτ 0 ; −1 T0 = ⎡ 1 n β iδ i ⎤ ⎢n ∑ β +δ ⎥ ⎣ i =1 i i ⎦ (11) −1 where parameter T0 characterizes a network-wide average degradation rate. The inequality above, rewritten as I c <T0 τ 0 , tells us that in a regulatory network with n units there exists an upper limit of complexity determined by two global parameters, i.e., by the overall rate of degradation and spectral width of the center manifold. If all the parameters reside within the limits required by (11), then equation (10) may be easily solved numerically using a well known delta-method [Oeh92]. Here, again, we take advantage of the system being asymptotically large. Namely, suppose that we evaluate the S N = N −1 ∑ N i =1 f ( zi ) , where { zi } are iid random values drawn from the population with PDF P ( z ) . Using the delta-method, we obtain E(SN ) = f ( E ( z )) + 0.5σ z f ( E ( z )) . Assuming, in ac2 // cordance with (10), that z= βδ ( β + δ ) , and, in accordance with (3), that β and δ are gamma- 1617 distributed, we easily find that E(z)= θα 2 (2α + 1) . Substituting this expression into (10) we end up with a simple transcendental algebraic equation with respect to σ 2 (not shown here.) The total variance σ is the global measure of intensity of fluctuations within the regulatory network and is somewhat analogous to temperature in statistical mechanics. This “network temperature” 2 is a function of three global parameters, i.e., I c , τ 0 , and T0 . It is of great interest to establish asymp- totic behavior of the network temperature as a function of the size of network. Preliminary analysis shows that depending on the structure of the center manifold generated by the matrix Ωik in (4), the network temperature may increase, may remain relatively constant, or may even decrease. The main obstacle to better understanding the asymptotic behavior of σ is the lack of realistic knowledge regarding the bounds of characteristic roots of matrices, in general, and spectral radii of center mani2 folds (and therefore of τ 0 ). The last of the three scenarios mentioned above , if takes place in reality, would help to understand how a very complex network, although being inherently unstable, is nevertheless able to maintain the regulatory regime with comparatively low variability. This kind of behavior could provide some insight into the nature of “functional determinism” of a high-dimensional dynamical system arising from nearly chaotic behavior of all its regulatory subunits. At this point, this intriguing question is open for further investigation. 7 Cautionary Notes Regarding Microarray Data Interpretation There exist two sets of legitimate quantitative parameters which characterize “gene activity”, i.e., transcription level and transcription rate. Microarray experiments provide us with mRNA abundances, i.e., transcription levels. What we would rather like to know are the mRNA transcription rates, or number of mRNA copies produced per unit of time. This quantity, if available, would be a more direct measure of gene activity. The distinction between the TL and TR has been repeatedly highlighted in literature [WCB02], however it seems to remain largely ignored by microarray community. As shown above, in a complex regulatory network, transcription level, generally speaking, is a poor predictor for transcription rates. It is often tacitly assumed in the interpretation of microarray data that there exists some kind of equilibrium between production and degradation of mRNA for each gene separately, in which case a direct proportionality would exist between TLs and TRs. Global equilibrium, however, is not possible in a complex network of interacting units with many channels of coregulation. In order to judge which TRs and TLs are in equilibrium and which are not, a detailed information about the timing of corresponding bio-chemical reactions would be required. This kind of information is usually unavailable from the microarray experiments, unless it is a time-course experiment with sufficiently high rate of sampling. Another important implication of nonlinearity and complexity of regulatory network is that a living cell cannot reside in a global state of equilibrium, simply because such state cannot be stable. Stochastic oscillatory behavior is in the very nature of the regulatory process. Figuratively speaking, the cell should continuously depart from the point of equilibrium in order to activate the mechanism of returning back. A usual way of thinking in microarray data interpretation is to attribute the differences in mRNA abundances to the cells themselves. However, depending on the frequency of sampling and duration of the sample preparation, the cell can be arrested in different phases of its oscillatory cycle, thus mimicking the differential expression. That means that covariances of expression profiles may be quite different in different time scales. These covariances, usually obtained through cluster analysis or 1618 classification, are often used as a basis for the pathway analysis. However, if the temporal dynamics of the regulatory processes is ignored, this analysis may produce misleading results. Many statistical procedures in microarray data analysis, especially in the context of disease biomarker discovery, include the notion that only a small subset of all genes participate in the disease process and therefore are actually differentially expressed, while a vast majority of genes are not involved in this process and "do business as usual." Contrary to this notion, it is quite possible that rapidly fluctuating components of the regulatory network are the integral parts of the process as a whole, and their high-frequency variations manifest the preparatory work of supplying the mRNAs for slower processes with bigger amplitudes. 8 Discussion The model formalized in equations (1-3) possesses rich gamut of features capable of simulating the properties of living cells. We briefly discuss some of them here. Formally speaking, equations (1-3) are written for the entire genome, and therefore, as shown in [Lws91], there is only one global point of equilibrium. However, if random sets of rikm and ωik are clustered into a number of comparatively independent subsets, then the entire system (1) is also decomposed into comparatively independent subsystems, with their own equilibria. In this case, it would be reasonable to expect that the system may switch between different equilibria and produce different oscillatory repertoires. The concept of differentiation, i.e., the ability of living cells to perform different functions despite the fact that they have basically identical molecular structures, has been extensively discussed within a number of previously proposed regulatory models [DeJ02]. The model proposed here has the capability of mimicking the cell differentiation as well. Results of extensive simulation of “tunneling” between different oscillatory repertoires will be published elsewhere. Regulatory mechanisms in living systems are highly redundant and able to maintain their functionality even when a number of regulatory elements are “knocked out.” In the model proposed herein, all the individual transcription-translation subunits are driven by the “collective” random force whose stochastic structure is basically determined by the spectrum of center manifold. Because this spectrum is generated by a large number of individual processes, it follows that if a certain number of genes is “knocked out”, then the majority of the remaining genes will not generally change their behavior. Due to the same reasons, the model suggested here has wide basins of attractions [Wue99], i.e., low sensitivity to initial conditions. This property is considered desirable for any formal scheme in models of living systems. In this work, the S-system has been selected to represent nonlinear interactions within genetic regulatory networks for two reasons. First, the S-system originates from (and adequately represents the dynamics of) bio-chemical reactions, a material basis of all the intra-cellular processes. Second, the S-system is known to be a universal approximator, i.e., to have the capability of representing a wide range of nonlinear functions under mild restrictions on their regularity and differentiability [Voi98]. However, the S-approximation is in no way unique in this sense. Sometimes it would seem desirable to maintain a more general view on the nonlinear structure, such as provided by the artificial neural networks (ANN), for example. Our numerical experiments show that a properly constructed ANN retains many of the same features as the S-functions. In fact, the only requirement which is necessary to fulfill when selecting a nonlinear model is that it has the “mixing” capability, i.e., provides strong interaction between normal oscillatory modes resulting in stochastic-like behavior of F(p) . It is not our goal in this paper to present a sort of “theory of everything”, however, it is worth noting that the system described by equations (1-3) allows for many extension and generalizations, retaining an advantage of being compact and mathematically tractable. In particular, any bio-chemical 1619 network, alone or in conjunction with the genetic regulatory network, may be considered in a similar manner. References [Bro39] Browne, E.T.: Limits to the characteristic roots of the matrix. The American Mathematical Monthly, 5, 252:265 (1939) [Car81] Carr, J: Applications of Center Manifold Theory. Springer-Verlag (1981) [CHC99] Chen, T., He, H., Church, G.: Modeling gene expression with differential equations. Pacific Symposium on Biocomputing (1999) [Che84] Chen F.: Introduction to Plasma Physics and Controlled Fusion. Plenum Press (1984) [DeJ02] DeJong, H.: Modeling and Simulation of Genetic Regulatory Systems: a Literature Review, Journal of Computational Biology, 9(1), 63:103 (2002) [GAP04] Garcia-Martinez, J., Aranda, A. and Perez-Ortin, J. Genomic run-on evaluates transcription rates for all yeast genes and identifies gene regulatory mechanisms. Molecular Cell, 15, 303:313 (2004) [Gar83] Gardiner, C.W.: Handbook of Stochastic Processes for Physics Chemistry and Natural Sciences. Springer-Verlag (1983) [KMP03] Kim J.T., Martinetz, T. and Polani, D.: Bioinformatic principles underlying the information content of transcription factor binding sites. Theor. Biol. 220, 529:544 (2003) [Lew04] Lewin, B. : Genes VIII. Upper Saddle River, NJ (2004) [Lws91] Lewis, D. :A Qualitative Analysis of S-Systems: Hopf Bifurcation. In: Voit, E., ed. Canonical Nonlinear Modeling. S-System Approach to Understanding Complexity. Van Norstand Reinhold, NY (1991) [MSa95] McAdams, H.H., and Shapiro, I.: Circuit simulation of genetic networks. Science 269, 650:656 (1998) [Oeh92] Oehlert, G.W.: A note on the delta method. American Statistician 46, 27:29 (1992) [PYa89] Parr, R.G. and Yang W.: Density Functional Theory of Atoms and Molecules. Oxford University Press, New York (1989) [Per01] Perko, L.: Differential Equations and Dynamical Systems, Third Edition, Springer-Verlag (2001) [SVo87] Savageau, M. and Voit, E.: Recasting nonlinear differential equations as S-systems: a canonical nonlinear form. Mathematical Biosciences, 87, 83:115 (1987). [SSa89]. Sorribas, A., and Savageau, M.: Strategies for representing metabolic pathways within biochemical systems theory: reversible pathways. Mathematical Biosciences, 94 239:269 (1989) [Voi91]. Voit, E., ed.: Canonical Nonlinear Modeling. S-System Approach to Understanding Complexity. Van Norstand Reinhold, NY (1991) [WCB02] Wang, W, Cherry, JM, Botstein, D., and Li, H.: A systematic approach to reconstructing transcription networks in Saccharomyces cerevisiae. PNAS 99(26), 16893:16898 (2002) [Wue99] Wuensche,A.: Genomic regulation modeled as a network with basins of attraction. In Pacifc Symposium on Biocomputing'98. (1999)