CS5314 Paradigms in Bioinformatics Midterm #2 Alexandru Cioaca While prokaryotes are simple, mostly unicellular organisms concerned with basic interactions with the environment, eukaryotes have a much more complex organization, with cells that specialize in different functions (differentiation) and self-assemble in higher order structures such as tissues and organs. By employing these varied functions in a more or less coordinated and intentional fashion, the cells synergize towards building upon the basic needs of survival and reproduction. They benefit from the apparent advantage of a larger set of skills through which they can interact with other cells, individuals, species or the environment. However, almost all the cells in an eukaryote organism contain a copy of the DNA molecule of the individual. In other words, each one of these cells contains a copy of the genome so it has immediate access to all the biological information regarding the totality of aspects of the particular individual. This information is structured in basic units known as genes, one gene being nothing more than a specific section of the DNA, encoded as the order in which nucleotides are laid out in linear sequence. Through biochemical reactions, this information is used to synthesize proteins (mostly) which accomplish various tasks inside the organism in order to support the processes of life. But if each body cell contains the information describing all that the cell is designated to do but also what different types of cells are designated to do, then how do cells differentiatiate in the first place and how does the organism know what subset of genes has to be active for each cell type? At the same time, it is very intuitive that gene expression doesn’t happen all the time for every gene. While it is true that there some genes (called housekeeping genes) responsible with the permanent cycle of routines sustaining metabolism, it is also true that there are genes that get expressed only under particular circumstances, usually when their products are really needed inside the cell. This suggests that at the cellular level, there must exist mechanisms regulating the activity of genes. The activity of these mechanisms falls under the broad term of “gene regulation”. If life can be seen as a complex system of biochemical processes consisting, at the lowest level, of using genomic information to act in a specific manner, then gene regulation consists in controlling these processes and the way they are interconnected. Since the pathway of gene expression has several steps, it is useful to revise their order for a better understanding of where gene regulation can intervene. The structure of the DNA molecule resides in the nucleus of the cell and contains all the genes. When a particular gene is about to be expressed, a temporary copy of its information is created under the form of a RNA molecule (more specifically, mRNA). This process is called “transcription” and takes place in the nucleus. The mRNA molecule is then processed through a couple of other chemical reactions responsible for improving its robustness. Then, the mRNA moves from the nucleus to the cytoplasm and based on the information transported from the nucleus, it instructs the ribosomes to synthesize chains of aminoacids called polypeptides which bind to form the protein product. This process is called “translation”. The pathway can be topologically extended both at the beginning and the end. At the end, because the protein product is subject again to transportation or various other reactions which might stop it from accomplishing the task it’s meant to. While at the beginning, we can think of the fact that the DNA molecule, although identical in information in each body cell throughout the organism, it differs from cell to cell in the set of genes that are active. These steps in gene expression where control mechanisms can act are: - Pre-transcription (before transcription is initiated, e.g. active genes) - Transcription (copying DNA into mRNA might be blocked through certain mechanisms or might require some auxiliary cellular activity) - Post-transcription (mRNA might not be robust enough to make it to the ribosomes) - Translation (the structure of mRNA might not be suitable for use by ribosomes) - Post-translation (the protein product might be obstructed from pursuing its actions) We can say there are two big categories of control mechanisms: some that facilitate or induce a certain event and some that deny or inhibit a certain event. An important observation that has to be done is the fact that we cannot consider to be control mechanisms those various breakdowns inside the intermediary steps in protein synthesis that have a temporary or non-deterministic origin, such as random faults or insufficient resources. Regulation implies steadiness and under similar circumstances, similar results have to be obtained, with little to nothing variation acceptable. Regulation does not happen by chance but through means which are as deeply embedded in the organism as the processes under control. At the same time, this thing tells us that regulation is a product of evolution as well so it appeared through the same natural principles of hit’n’miss, from mutation to mutation, until it served an advantage in survival. Following up is a list of gene expression regulatory mechanisms in chronological order, from -pre-transcription to post-translation: - Chromatin structure In eukaryotic organisms, the DNA strand can reach a length of 2 meters, but it has a diameter of molecular scale. In order for it to fit inside the microscopic nucleus of cells, it is packaged arranged around molecules called histone in a tight thread-spool fashion. These structural units are called “nucleosomes” and are, at their own turn, tightly bound in structures known as chromosomes. Studies done on DNA in vitro have showed that it makes a difference whether key sites of a gene are to be found wrapped around histones or suspended between them. These sites allow certain proteins to bind to them, proteins which play an important role in transcription to RNA. Two relevant examples are the transcriptional activator protein (TAP) and the TATA-box-bnding protein (TBP). In the attached figures we can see the unfavorable case where these two binding sites are inaccessible. However, there are several multiprotein complexes called chromatin-remodeling complexes (CRC) that can be employed in order to decompact DNA off the nucleosomes so that TAP and TBP become accessible. This is shown in the attached figures. Once this happens, the transcriptional factor can attach close to the gene represented by these sites and is joined by RNA polymerase which transcribes the particular gene. The different distributions of DNA onto the nucleosomes are hard to predict from one cell to another but due to the tight compression of DNA, it is very likely that there will be genes whose binding sites will not be accessible. In this case, CRCs act as regulators that make the genes transcribable. Since CRCs are nothing but protein complexes, they are synthesized from information contained in other genes. - Altering of DNA structure Sometimes, changes occur in the DNA sequence of somatic cells and are transmitted to their descendents. These changes are programmed and are either deletions or transpositions. They have direct effects on gene expression, since the sequence of nucleotides is no longer the same. An example for programmed deletions takes place in the bone-marrow-derived cells and thymus-derived cells of vertebrate immune systems. They have complementary roles; B cells produce antibodies that mark antigens for destruction, while T cells recognize this mark and prevent them from entering the cell. B cells are able to synthesize only type of antibody and have been discovered that this is the expression of a particular gene. However, each one of the particular genes responsible for synthesizing antibodies was found to be a subsequence of a longer initial sequence. This long sequence is cut and joined after mitosis based on reacting at the encounter with a type of antigen. One of the most commonly-found antibodies in the organism is immunoglobin G and its structure was found to resemble the letter Y. When targeting different antigens, most of its structure was chemically similar, except for its upper ends. Their configuration was proved to be due to the way programmed deletions occur, which brings nearer to the constant part of the gene the correct type of DNA sequence associated with the antibody in cause. Another example of altering DNA structure is that of programmed transpositions in regulating yeast mating type. This organism has two mating types, a and α. The difference stands only in phenotype, as studies have revealed the fact that the genotype contains biological information about both genders under the form of interchangeable cassettes. Through DNA rearrangement, yeast can switch to either a and α in the lineage of a particular cell and mate from this perspective. - Alternative promoters There are genes that have more than one associated promoter. From the same protein-coding regions, depending on which one of the promoters is active, different transcripts can be obtained. In this case, the control mechanism is the active promoter. Its active status is determined in the cell cycle. For example, the gene for alcohol dehydrogenase in Drosophila uses one promoter when it is in larvae state and another one when in adult state. This is a fascinating and elegant solution, comparable to that of dynamic pointers in high-level programming languages. - Epigenetic control Epigenetic is a term that means “on genes”. It refers to a type of control over gene expression that is not caused by altering the sequence of bases in DNA, but to an external factor that prevents a particular sequence of being read (transcribed) for what it is supposed to be. One example is the addition of a methyl (CH3) group to the number-5 carbon atom in the cytosine bases. This process is called methylation and it causes a lower transcription rate of the methylated sequence. Another type of epigenetic mechanism refers to specialized proteins that bind at a particular sequence of the DNA molecule with the same effect in the transcription rate. Heavy methylation is associated with the inactivation of genes in the X chromosome, for example. As cells undergo division in females, there is a moment in the cell lineage where one of the X chromosomes becomes inactive and all descendants of that cell will inherit this particularity. Another example is that of “genomic imprinting” in mammals, where hundreds of genes are heavily methylated in the germ line, but in a different fashion from male to female. This is retained throughout embryonic development but it can be reversed later on in development. Various theories suggest that this is a parental conservation instinct at the expense of the fetus so there is a balance between the exchange of resources between the mother and the progeny. - Transcriptional initiation The initiation of transcript is probably the most used regulatory strategy. Transcription takes place in the nucleus and produces an mRNA molecule that carries information about a gene encoded in the DNA molecule to the cytoplasm. This is achieved by RNA polymerase which copies a sequence of nucleotides into mRNA. Thus, RNA polymerase has to know where is this sequence of interest located and when to start copying it. The latter issue usually involves proteins known as inducer and repressor. Inducers correspond to positive regulation and activate transcription of a certain gene. Without the activity or presence of an inducer, the gene is inactive. When the cell needs the gene to be expressed, a chain of events unfold so that the inducer attaches in a location close to the gene (upstream) which signals the start of transcription. If the gene is constantly active, then it is probably regulated negatively, through proteins called repressors that bind upstream (by themselves or along with a protein complex) and disable the expression of the gene until further events deem it necessary to recommence. The most important factor in transcriptional initiation is a protein called transcriptional activator protein (TAP). This binds upstream of the gene and recruits the transcription complex which at its turn, triggers the recruitment of RNA polymerase holoenzyme. Transcriptional activator proteins are mostly gene-specific; their action can be negatively regulated by proteins that bind to it and block the transcription complex. Some categories of TAPs are helix-turn-helix motif and zinc fingers. As there are two types of regulation, that is positive and negative, this implies a large variety of possibilities for interacting with the environment. A protein product that is required in special circumstances could be associated with a gene that is normally inactive. The lack of the protein product might negatively regulate another gene that is responsible for producing the TAP which enables the transcription of the gene associated with our missing protein product. Another plausible scenario of regulation deals with synthesizing products that defend against a high concentration of an unwanted molecule in the gene. When the unwanted molecule is present, a normally inactive gene might be positively regulated by the intruder and its associated product will start being produced. Another class of regulatory mechanisms are DNA sequences found at a variety of locations around genes, called enhancers and silencers. As the names suggest, their molecular structure is designated to either hasten or strength transcription (enhancers) by bonding with the transcriptional complex or on the contrary, prevent the transcription (silencers). - Transcript Processing The transcription process from the same gene under the same promoter can still yield different mRNA molecules. This is due to an important feature of the genome called “alternative splicing”. Since most of the eukaryotic genes are non-contiguous blocks of coding sequences of base pairs, the first draft of mRNA contains two types of sequences: exons, which give the final form of mRNA and introns, which are removed. However, by alternating the selection of exons and introns in the post-transcription processing of mRNA, the cell can come up with more than one expression from the same gene. For example, the 30000 human genes can encode 64000 to 90000 proteins, based on this alternation. Thus, gene expression can be regulated to keep certain sections on the initial mRNA molecule and discard other. This is governed by decisional factors from within the cell as it processes the mRNA in order to obtain sequences that are viable for translation. These decisional factors are means to regulate gene expression and act through the same biochemical algorithms developed by evolution. - RNA Transport Once DNA has been transcribed into mRNA and this has been processed for translation, mRNA is heading towards ribosomes in the cytoplasm for translation. Regulation factors have been found that can stop it on its way, RNA interference being one of them. RNA interference works through small RNA molecules that can cleave mRNA in non-translational sequences or even block it from being translated by the ribosome. These molecules are of two types: small interfering RNA (siRNA) and micro RNA (miRNA). They are produced in the cytoplasm from a special molecule called double-stranded RNA and are first chopped in even smaller sequences by the dicer enzyme. These cleavage products are recruited by an RNA-induced silencing complex protein (RISC) and target mRNA with complementary sequences. Their effect on the mRNA is different: RISC with siRNA cleaves mRNA, while miRNA attaches to it and prevents translation. - Transcript Stability The mRNA molecule has a lifetime of about 3 hours in most eukaryotes and it is meant for being translated in the same cell. This is due to the fact that each cell differentiate through the active set of genes that describe the function of that cell. Under special circumstances, this rule does not stand and the mechanisms that ensure a certain destination and length of life for the mRNA molecule are overwritten. An example occurs in newly fertilized eggs whose metabolism translates preexisting cytoplasmic mRNAs transcribed by the mother. This is definitely not common practice in mature organisms. For example, the way this becomes possible in Drosophila is through the elongation of the poly-A tail of the mRNA. Another relevant example is that of silkworm fibroin mRNA. During cocoon formation, the silk gland synthesizes silk fibroin in large amounts. There are three factors controlling this unusual behavior: cells become highly polyploidy accumulating a large number of chromosomes, hence copy of the silk gene, the promoter of this gene is strong and enhances the rate of transcription and the transcribed mRNA is very stable, which a lifetime of days. At the same time, there are factors that can speed up the degradation of mRNA. One of them is the deadenylation-dependent pathway, through which an enzyme trims the length of the poly-A tail of the mRNA which makes it susceptible to a decapping enzyme that removes the 5’ cap. Without it, the mRNA is unable to initiate translation and is rapidly degrade by exonucleases. The other one is called deadenylation-independent pathway which either decaps or cleaves mRNA. These regulation mechanisms are useful to prevent the synthesis of incomplete polypeptides in the cell. - Initiation of Translational Translation is the process through which mRNA is used by the ribosome to synthesize the polypeptides that compose the protein. This process takes place outside of the nucleus and it is independent from transcription. Eukaryotes can regulate gene expression at this level too. The two basic types of regulation that can be imposed here is the obstruction or facilitation of mRNA to be translated and the rate at which proteins are produced. In contrast with the examples given above in the case of transcript stability, here we are referring to a regular messenger RNA transcript, but an intensification or relaxation of the translation process. The most interesting example of regulation at this level is given by recently discovered small regulatory RNA molecules complementary in sequence with mRNA. These are called “antisense RNA” and they act by pairing with mRNA over short sequences, the consequence being either inhibition or activation of the translation. An example of inhibition can be found in E.coli’s through the OxyS regulatory RNA which affects the gene flhA (TAP). This molecule has the ability of binding at critical sites, rendering the mRNA unable to bind with the ribosome. On the other hand, DsrA regulatory RNA activates the translation of the gene rpoS, responsible for encoding a sigma factor for RNA polymerase that allows transcription of a new set of RNAs from a special set of promoters at stationary phase in cell cultures when the cell density is high and the intensity of cell proliferation is low. The 5’ end of rpoS mRNA is self-complementary and it curls under the shape of a hairpin, trapping the ribosome-binding site and the translational start site. These sites become exposed under the effect of DsrA on the rpoS mRNA, so translation can be issued. See attached figures. - Post-Translational Modification After the protein has been synthesized by the ribosome under the form of series of polypeptides (chains of aminoacids), its functions can be extended through further chemical modifications consisting in joining other molecules to it, cleaving its structure at different sites or changing some of its aminoacids groups. These operations are usually performed by specialized enzymes which can be considered the control mechanism active at this level. One of the organelles responsible for this type of regulation is the Golgi apparatus. - Protein Stability Some proteins degrade faster than other. The rate at which they decay can be a consequence of external molecules acting upon it, for example to regulate an excess of the protein in cause or a fault which generated a protein to be active in the wrong cell. Another factor of decay can be embedded in the protein, under the form of aminoacid sequences that break down in time easier. This means that once a protein is synthesized, there are still ways of controlling its behavior. - Protein Transport Last step in expressing a gene consists in transporting the protein to its designated “workplace”. Responsible for the displacement of proteins are, obviously, other proteins called carrier proteins. The most challenging transport occurs through the cellular membrane, between two neighboring cells. This also hints at the possible reasons why protein transport should be controlled: there has to exist a mechanism to check outgoing or incoming proteins and make sure they are eligible for transport. Since these kinds of verifications can only occur from a biochemical point-of-view, the structure of the carrier protein enables them to verify the compatibility of the protein to be transported with its destination. A gene regulatory network (GRN), as the name implies, represents a set of genes responsible for influencing the expression of another target gene whose product is required by the cell. The term “network” suggests that the effect they have on the gene to be expressed is similar to that of a network of on-off switches and potentiometers, so both digital and analog controls. The scientific approach towards studying and modeling GRNs employs mathematical concepts such as graph theory and combinatorial logic. At a basic level, a GRN can be thought of as a black-box that exerts one specific action of control on the gene to be expressed, action that can be represented as the resultant action of all the genes part of the network. But looking at what happens at the molecular level, the situation is far more complicated. The individual regulatory genes come into play at different times and determine different characteristics of the protein synthesis process. Some of them are interconnected as adjacent units, where the product of one gene directly communicates (reacts) with the next one in line, while some of them can be considered on far off branches of the network; they work in parallel and appear to be independent but their products cumulate after other conditions (part of the same GRN) are met. Considering the fact that gene expression has the ultimate single goal of providing a finite product under the form of a protein, GRN can be seen as a converging network of regulatory genes (like a funnel towards protein synthesis), orchestrating the intermediate steps by enabling or disabling, amplifying or attenuating certain biochemical reactions. Most definitions of GRN place their action in the transcriptional scope, as the GRN determines when and how much RNA is transcribed for synthesis in the ribosome. However, considering the various types of gene regulation presented above, an even larger scope of GRN extends to all the steps undertaken inside the cell towards protein synthesis and usage. Just as genes vary greatly in DNA sequence and proteins in their structure, GRNs can come in different forms. Their common features are those given by their general role of regulation. First of all, GRNs need be able to read the features of interest in the environment (cell, tissue, etc) through input signals. These input signals could be the concentration of a particular molecule such as proteins and hormones. Then, GRNs need be able to generate the appropriate output signals through which they influence the outcome of the target gene expression process. Since we are working in the same molecular context, it comes as no surprise that these output signals are molecules as well, mostly proteins. It is interesting to note, from an engineering perspective, that this type of communication is neither asynchronous nor synchronous and it does not involve neither closed or open channels. The cell is functioning so well exactly because there are no constraints imposed on communication. Input signals are read from the “wild “ and generate the release of output signals into the “wild”; other similar mechanisms are responsible for transporting this output signal to its place of action where it will act as a decisional factor. Also, as all regulatory structures, GRNs need a feedback loop which can be seen as nothing more than input signals (from the GRN perspective) that were generated by the environment after the GRN started taking action. As we can see, input, output and feedback signals are not that different in concept, all of them being molecules. It is their particular structure that makes the difference, but their particular structure is more than a symbol or a tag, it is also the actual function they are designated for. We are literally talking about a permanent circle of life, where nothing is enforced or requested, except that everybody plays a small part, binding with the molecules with which it is compatible and treating them (or the event of binding) as an input signal; based on this, other molecules are synthesized which will serve as output signal and influence the expression of other “coworkers”. Only when we place cellular life in an abstract framework and start discerning genes in regulatory or structural genes we are able to see causalities. The relation of gene regulatory networks with the gene regulation mechanisms presented above is the fact that GRNs can employ any of them as nodes of the network. Moreover, as we can see from above, most of these mechanisms contain more than one step so we can say that they are gene regulatory networks as well. Taking the most trivial example of regulation, we could have a housekeeping gene that is always active and its biochemical pathway isn’t influenced by the action of any other regulatory gene. Even in this case, we still have more than one regulatory mechanism involved, as translation has to be started by a transcriptional activator protein which is the expression of another gene, and mRNA has to be processed and then translated into a protein. But if we choose to consider this as not being a network, we can see that there are no other regulatory mechanisms as straightforward; they all increase the complexity of the pathway so under the assumption we made, there is more than one gene contributing to the final product. In my opinion as a control engineer, “gene regulatory networks” is nothing more than the appropriate term for what actually happens during gene expression. It is true, however, that we can establish degrees of complexity in these networks, based on the number of participant genes (number of nodes) and the interactions between them (number of arcs). For example, it would be unfair to place under the same category housekeeping genes with the complex system of genes controlling early development in embryos. The threshold between simple networks and complex networks is debatable, so probably approaches such as complexity theory and chaos theory could also be suitable for their study and it will also eliminate the bias between classifying some of the genes as structural and some as functional, which is only a matter of perspective. The two papers studying chromatin structures and gene regulation I chose are: “Mechanism of Protein Access to Specific DNA Sequences in Chromatin: A Dynamic Equilibrium Model for Gene Regulation”, K. J. Polach and J. Widom J. Mol. Biol. (1995) 254, 130–149 “High-throughput mapping of the chromatin structure of human promoters”, Fatih Ozsolak, Jun S Song, X Shirley Liu and David E Fisher Nature Biotechnology, Vol. 25, No. 2, Feb. 2007 The first paper deals with the problem of certain DNA sequences tightly wrapped around the nucleosomes not being accessible to transcription regulatory proteins, but they are still transcribed. The authors present three alternatives as an answer: proteins bind before DNA is packaged in chromosomes, the DNA sequences of interest are never actually packaged or there are mechanisms of active invasion in order to transcribe those sequences. They provide counter-arguments for all three models. For the first one, cells that are prevented from replicating their DNA are still undergo transcription. The second one is dismissed through physical considerations, as there is nothing in the structure of DNA that could enforce the same distribution along the nucleosomes in cell of the organism. And for the third one, the problem is that it lacks an explanation for how proteins are able to target the right nucleosome. The authors are trying to extend the third model by considering the nucleosomes as dynamic structures that expose temporarily stretches of their DNA. They use an approach based on modeling mathematically the kinetics of nucleosomes and trying to prove the correctness of their model through the laws of energy conservation. In parallel, they are performing experiments in vitro with sea urchin cells by replacing the regulatory protein with a restriction enzyme and engineering nucleosomes with sites for E. In this way, they expect to detect through gel electrophoresis the effect of the enzyme on the nucleosomes. From their experiments they observe that all restriction sites are cleaved, hence accessible. At the same time, they estimate an equilibrium constant that happens to quantify how well inside the nucleosome is the the restriction site located; the further it is from exposure, the more energy is needed for dislocating it. The authors conclude their assumption is true and the model is a good approximation of the underlying mechanics. Nucleosomes are not static and temporarily expose the binding sites needed for regulatory proteins. These proteins use that short window of time to attach to the promoter and they recruit other protein complexes which help displacing the nucleosomes for proper transcription. The second paper is trying to address the problem of observing the motion of nucleosomes in experiments. The authors present a high-throughput microarray approach and an analysis algorithm for examining nucleosomes-positioning in promoters of 3600 human genes. First, they are performing an in vivo footprinting experiment on the DNA molecule on human cancer cells and hybridize isolated nucleosomal and input DNA in the microarray. They study the data using signal-processing techniques such as wavelet decomposition against highfrequency noise and edge-detection for curve profiling. These curves have oscillatory shapes and based on the peaks of the oscillations, the location of each nucleosome is inferred. Then they focus on extrapolating from their data locations of transcription factors and discover these to be mostly between nucleosomes. The paper was interesting to read as it involved the use of signal processing and statistical analysis.