IMS 2003, 5th International Mathematica Symposium, London, Great Britain. A Symbolic and Graphical Gene Regulation Model of the lac Operon Garret Suen and Christian Jacob Department of Computer Science University of Calgary, Canada {sueng, jacob}@cpsc.ucalgary.ca 1 Abstract We present a symbolic, grammar-based model for the classic lac operon gene regulation system implemented in Mathematica. This functional model focuses on the information processing aspect of gene regulation through pattern matching on symbolic expressions. Our lac operon notebook provides a viewer component for animated, two-dimensional visualization of the simulated gene interaction processes, and is also connected to a 3D visualization engine. 2 Introduction Biological research has changed drastically over the last decade. Nowadays, in an effort to minimize the time spent in the laboratory, data gathering and analysis is done primaily on computers. This has heralded new interdisciplinary fields between biology and computer science. Biological computing and bioinformatics seek a better understanding of biological phenomena through innovative programming techniques and algorithmic analysis. Classic biological models, that are well understood today, can provide the basis for investigating larger, more complex models. For example, models of gene regulation in prokaryotic cells (without a nucleus) lead to a better understanding of gene regulation in more complex eukaryotic cells (with a nucleus). In this paper, we present a symbolic model of the lactose operon, one of the simplest and best understood models of gene regulation in the bacterium Escherichia coli [1, 6]. We will show how gene regulation mechanisms, which mainly rely on key-lock matching among cell components, can be immediately implemented through pattern matching on symbolic expressions. We will give examples of the basic data structures we use to encode an E. coli cell proteome, demonstrate selected rules that model interactions among proteins and other cell units, and show a simple way to visualize the dynamics of these interactions over time, both in 2D and 3D. 3 The Lactose Operon Escherichia coli (E. coli) is a single-celled bacterium that resides in the gut of humans. It is a prokaryotic organism, i.e., it does not have a nucleus to enclose its circular DNA. Hence, the DNA in E. coli is free to interact with all other elements within the cell. The lac operon, in particular, is a group of genes found on the E. coli genome, the constituents of which represent a classic and intensively studied model for gene regulation [1, 6, 7]. 3.1 Gene Regulation in the lac Operon: The Key Players In a lactose-rich environment E. coli uses the sugar lactose (short: lac) as its primary food source, which is converted into glucose, the bacterium's major source of energy, and galactose. The process of converting lactose into its constituents is controlled by the regulatory mechanisms of the lac operon structure, which encodes for proteins that facilitate the breakdown and conversion of lactose. The gene itself can regulate the creation of these molecules dependent on the amount of lactose present in the cell. More specifically, the lac operon consists of four genes (Fig. 1): the lacI gene, the lacZ gene, the lacY gene, and the lacA gene. The lacZ, lacY and lacA genes are adjacent to one another on the operon. They are preceeded by a control complex consisting of an operator region and a promoter region. This promoter-operator complex allows the binding of a specific protein, RNA polymerase, that creates the enzyme b-galactosidase, which in turn breaks down lactose into glucose and galactose (Fig. 1a). (a) (b) Figure 1. Key components involved in the lac operon gene regulation process [4] : (a) docking of the repressor complex at the operator site turns the lac operon off; (b) repressor inhibition, through a conformational change, turns the lac operon on. The lacI gene, which encodes for the repressor protein, is located downstream from the main gene complex and is preceeded by its own control complex consisting of a single promoter region. The repressor serves as the basic control mechanism for the lac operon (Fig. 1b). 3.2 The lac Operon Model: DNA and Cell Cytoplasm In contrast to a probabilistic or logical network approach [2], our symbolic, grammarbased model of the lac operon uses explicit representations of the components involved in the regulation process. The interaction and regulation mechanisms among these components are implemented as rewriting rules in Mathematica [5]. A similar grammarbased approach is suggested in [3], which, to our knowledge, has never been implemented as a working, computational model. More specifically, we model the DNA strand, the cytoplasm, the energy compounds such as lactose, glucose, etc., and enzymes by explicit symbolic expressions. The following list data structure represents the two operon sections on the DNA and its surrounding cytoplasm (compare Fig. 1): Cytoplasm = 8 Operon@Promoter@D, LacI@DD, Operon@Promoter@D, Operator@D, LacZ@D, LacY@DD, RNAPolymerase@D, Lactose, Glucose, LactosePermease@D, Bgal@D, RepressorTetramer@S1@D, S2@D, S3@D, S4@DD< We have made the following simplifications in our model of the lac operon: (1) As the Watson-Crick complementarity of the double-strand encoding does not have any influence on the functional aspect of gene regulation, we represent DNA as a single-stranded list of Operon[...] expressions. (2) We have not included explicit transcription and translation of proteins as this has no influence on the regulatory aspect of the gene. Hence, proteins are translated and transcribed immediately after RNA polymerase has docked onto and read the single DNA strand. (3) As we primarily focus on energy compounds and proteins that have a direct correlation to the operon, we do not consider galactose, a byproduct of lactose cleaving, and thiogalactoside transacetylase, an enzyme encoded by the lacA gene, whose function is still unknown. (4) As we primarily focus on the lactose-b -galactosidase interaction, we also do not consider the CAP catabolite repressor section, which regulates the production of b-galactosidase based on glucose concentrations. 4 lac Operon Gene Regulation: Step-by-Step In this section we give a detailed decription of the simulation steps required in our grammar-based model of the lac operon system and how the gene regulatory interactions are visualized. These iconic representations are automatically generated in the form of a frame-by-frame animation. Each frame in the animation represents the application of a single interaction rule among elements in the system. Elements—i.e., proteins, enzymes, and energy compounds—are represented as circles, squares or diamonds of different colours or graylevels (Fig. 2). DNA elements are simple rectangle bands at the bottom of the display. All elements involved in the functional, grammar-based model are included in this visualization. In order to enhance the understanding of the dynamic interactions, arrows indicate which elements are interacting in each animation frame according to a particular rule. In this section we give a detailed decription of the simulation steps required in our grammar-based model of the lac operon system and how the gene regulatory interactions are visualized. These iconic representations are automatically generated in the form of a frame-by-frame animation. Each frame in the animation represents the application of a single interaction rule among elements in the system. Elements—i.e., proteins, enzymes, and energy compounds—are represented as circles, squares or diamonds of different colours or graylevels (Fig. 2). DNA elements are simple rectangle bands at the bottom of the display. All elements involved in the functional, grammar-based model are included in this visualization. In order to enhance the understanding of the dynamic interactions, arrows indicate which elements are interacting in each animation frame according to a particular rule. mutation state: i+ o + z+ TetramerBindOperator mutation state: i+ o+ z+ TetramerBindOperator Bgal Bgal LactosePermease LactosePermease RNAPolymerase RNAPolymerase Promoter Promoter HaL LacI Promoter Operator LacI Promoter Operator HbL LacY mutation state: i+ o+ z+ TetramerBindLactose RNAPolymerase LacZ Lactose Repressor RNAPolymerase Repressor Bgal Repressor Repressor Repressor Repressor LactosePermease LactosePermease Lactose RepressorLactose LacI LacY mutation state: i+ o+ z+ TetramerBindLactose Bgal Promoter LacZ Lactose HcL Promoter Operator Repressor LacZ LacY mutation state: i+ o+ z+ BgalBindLactose Promoter LacI HdL Promoter Operator LacZ LacY mutation state: i+ o + z+ BgalBindLactose Lactose RNAPolymerase RNAPolymerase Bgal-Lactose Complex Bgal Lactose Lactose Lactose Promoter LacI LactosePermease Lactose HeL Promoter Operator LacZ Lactose LacY Promoter LacI LactosePermease Lactose HfL Promoter Operator LacZ LacY Figure 2. (a, b): binding of a repressor tetramer to the lacZ operator; (c, d): binding of four lactose molecules to a repressor tetramer, with subsequent conformational change; (e, f): b-galactosidase breaks down lactose into a b-gal-lactose complex. 4.1 Transcription and Translation of Structure Genes RNA polymerase enzymes are present in large quantities in the cell. Therefore, we explicitly list them in the cytoplasm. RNA polymerase has a direct affinity for the promoter sites located on the operon. Once docking has occured, RNA polymerase will move along the operon and transcribe/translate the associated structure gene (Section 6). Hence, the repressor is synthesized from the lacI gene, Lactose Permease is synthesized from the lacY gene, and b-galactosidase is synthesized from the lacZ gene. This model skips the more complicated process of mRNA (messenger RNA) creation and its subsequent conversion into a protein through the action of ribosomes. Here is an example of the rules we use to capture the docking of RNA polymerase onto a promoter site and its subsequent reading of the lacI gene: Cytoplasm ê. 8x___, Operon@Promoter@D, LacI@DD, y___, RNAPolymerase@D, z___< :> 8x, Operon@Promoter@RNAPolymerase@DD, LacI@DD, y, z< % ê. Operon@Promoter@RNAPolymerase@DD, LacI@DD :> Operon@Promoter@D, LacI@RNAPolymerase@DDD From the lacI gene, which RNA polymerase reads, a repressor is synthesized: % ê. 8x___, Operon@Promoter@D, LacI@DD< :> 8x, Operon@Promoter@D, LacI@DD, y, z, Repressor@D, RNAPolymerase@D< We use similar rules for the other interactions among cell components described in the following section, where we use graphical representations instead. 4.2 Binding of Repressor Tetramer to the lacZ Operator Four repressor molecules synthesized through the lacI gene form a repressor tetramer, which has an affinity for binding to the operator region that preceeds the lacZ gene (Fig. 2a, 2b). Once binding of the repressor tetramer has occurred, transcription of the lacZ gene can not be accomplished, as the docking site for the RNA polymerase is blocked by the repressor. Consequently, in the absence of lactose this mechanism turns the lacZ gene off in order to preserve cellular resources. 4.3 Lactose Entering the Cell Lactose is free to enter the cell from the outside environment through the use of lactose permease, a protein encoded by the lacY gene. In the presence of lactose the lacZ gene is turned on. This is accomplished through the binding action of four lactose molecules to a single repressor tetramer (Fig. 2c, 2d). This binding causes structural deformation of the repressor, such that it is no longer able to bind to the operator region (Fig. 1b). As such, RNA polymerase is again free to bind to the operator and synthesize b-galactosidase. 4.4 b-Galactosidase Breaks Down Lactose The release of the repressor tetramer from the operator allows the lacZ gene to be transcribed and translated again, resulting in an increase in the synthesis of b -galactosidase, which subsequently cleaves lactose into its constituent parts (glucose and galactose) and reduces the concentration of lactose in the cell (Fig. 2e, 2f). The removal of lactose from the cell allows the repressor to bind to the operator region again and reduce the production of b-galactosidase, thus controlling gene regulation. Figure 3. A snapshot during the synthesis of b-galactosidase: several RNA polymerases are attached to the circular double helix of the DNA. Codons, i.e., triplets of nucleotides (A, G, C, T), are represented by colour-coded spheres. Both mRNA strands and chains of amino acids are represented as long cylinders. 5 The lac Operon Notebook: Animations and Data Plots Our lac operon gene regulation system iteratively applies the rewriting rules descrided in the previous section. We simulate interactions among the elements by incorporating rounds of rule application. Each round allows for rewrite rules to execute on the current state of the cell. The order of the rewrite rules applied in each round is chosen randomly, so as to model the randomness found in the biological system of the lac operon. Users of the lac operon notebook can interactively run the simulation by choosing the number of rounds the simulation runs. The output for each round consists of a list of rewrite rules and their associated graphics, which display the current state of the cell (Fig. 2). In addition, data plots are generated for each simulation run, which chart the concentrations of the following four elements over the simulation time period: Lactose, Glucose, Repressor, and b-galactosidase. 6 Visualization in 3D In order to enhance our functional, grammar-based model of gene regulation by a visualization component, we have designed a 3D visualization engine, based on the Java 3 DTM library. To realistically model the lac operon, important coding regions are included into the DNA strand. The lacZ gene, with its associated control complex is incorporated into the DNA structure. In addition, the lacI gene along with its control complex is also part of the visualization. All other interactive elements including RNA polymerase, repressor molecules, and b-galactosidase molecules are rendered as spheres of different colours (Fig. 3). There are two primary processes not yet considered in our symbolic, functional lac operon model, namely transcription and translation. One major objective of our 3D visualization is to model these processes—on an appropriate abstract level—as they occur in the cell. These intermediate processes ultimately are of interest to understand gene regulation processes in general and the lac operon, in particular. Transcription is the process of converting a DNA template strand into an intermediate single strand of messenger RNA (mRNA). RNA polymerase will read a given template strand of DNA and transcribe the codons into mRNA. The use of colour-coded codons allows for ease of interpretation when viewing the various DNA and RNA structures. Translation is the process of converting an mRNA strand into the appropriate protein. It is facilitated by two enzymes: ribosomes and transfer RNA (tRNA). Ribosomes and tRNA affect the creation of proteins based on the mRNA encoding. The use of colour-coded amino acids directly relate to specific codons on both mRNA and DNA strands. 7 Conclusion and Future Work We have presented a symbolic, grammar-based model of the classic lac operon gene regulation system. The outlined functional model focuses on the information processing aspect of gene regulation. As we are using symbolic expressions to represent all involved structures and functions, it is relatively straight-forward to combine it with a genetic programming engine. This will enable us to actually evolve gene regulation mechanisms and compare our results with other regulatory techniques evolved by nature. We are currently working on the integration of the described gene regulation model into our evolutionary computation environment, Evolvica [5]. Secondly, we will also include activators, such as CAP catabolites, into the model, which regulate the production of b -galactosidase based on glucose concentrations. Finally, we are working on an extension of the current 3D visualization engine, so that all aspects of the functional model are directly translated into animated, three-dimensional visualizations, which are automatically generated from the modeled genomic interaction processes. Further information is available at: http://www.cpsc.ucalgary.ca/~jacob/LacOperon. Acknowledgement We would like to thank Julie Andreotti and Ian Burleigh for their help with the implementation of the 3D visualization. References [1] Beckwith, J. R., Zipser, D. (eds.): The Lactose Operon. The Operon. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York (1970). [2] Bower, J. M., Bolouri, H. (eds.): Computational Modeling of Genetic and Biochemical Networks. MIT Press, Cambridge, MA (2001). [3] Collado-Vides, J.: Towards a grammatical paradigm for the study of the regulation of gene expression. In: Goodwin, B., Saunders, P. (eds.): Theoretical Biology. Epigenetic and Evolutionary Order from Complex Systems. Johns Hopkins University Press, Baltimore, ML (1992): 211-224. [4] Crotty, S., Basu, A., Onufryk, C., Ingram, V.: MIT Biology Hypertextbook. http://cyberbio.mit.edu/esgbio (1996). [5] Jacob, C.: Illustrating Evolutionary Computation with Mathematica. Morgan Kaufmann Publishers, San Francisco (2001). [6] Müller-Hill, B.: The lac Operon. A Short History of a Genetic Paradigm. de Gruyter, Berlin (1996). [7] Ptashne, M., Gann, A.: Genes & Signals. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York (2002).