Chapter THEORETICAL AND EXPERIMENTAL METHODS TO STUDY TRANSCRIPTION FACTORS AND THEIR INTERACTIONS 1.1 INTRODUCTION ................................................................................................. 1-1 1.2 MOLECULAR MECHANISM FOR THE CONTROL OF GENE EXPRESSION 1-2 1.3 APPROACHES TO STUDY EVOLUTION OF TRANSCRIPTION FACTORS .. 1-3 1.3.1 COMPARISON OF PROTEIN SEQUENCES TO INFER HOMOLOGY ............................ 1-3 1.3.2 DOMAINS OF A PROTEINS CAN BE USED TO INFER EVOLUTIONARY HISTORY ......... 1-4 1.3.3 PROCEDURES TO ASSIGN DOMAINS TO PROTEIN SEQUENCES ............................. 1-7 1.3.4 PROTEIN EVOLUTION ....................................................................................... 1-8 1.4 APPROACHES TO STUDY GENE EXPRESSION PROGRAMS ................... 1-10 1.4.1 METHODS TO STUDY PROTEIN-DNA INTERACTIONS ........................................ 1-11 1.4.2 GENE EXPRESSION ANALYSIS ......................................................................... 1-12 1.5 STRUCTURE OF TRANSCRIPTIONAL REGULATORY NETWORKS .......... 1-17 1.5.1 MOTIFS......................................................................................................... 1-19 1.5.2 MODULES ..................................................................................................... 1-19 1.5.3 GLOBAL NETWORK ORGANISATION ................................................................. 1-20 1.6 REFERENCES .................................................................................................. 1-21 1 1.1 Introduction THEORETICAL AND EXPERIMENTAL METHODS TO STUDY TRANSCRIPTION FACTORS AND THEIR INTERACTIONS 1 Parts of this chapter appeared in the following research review articles: 1. Madan Babu, M.1*, Luscombe, N.1, Aravind, L., Gerstein, M., Teichmann, S.A.* (2004). Structure and evolution of gene regulatory networks. Current Opinion in Structural Biology, 14(3):283-91. 2. Madan Babu, M.* (2004) Introduction to Microarray Data Analysis, Book chapter in Computational Genomics, Horizon bioscience press (Grant, R. Editor) 3. Luscombe, N.* and Madan Babu, M.* (2004) GenCompass: a universal system to analyse gene expression for any genome, Trends in Biotechnology, in press. 1.1 Introduction This chapter provides an overview of experimental and computational methods that we employed to study the evolution of genes, with particular emphasis on transcription factors, their target genes, and the resulting regulatory network. We first present sequence and structure-based methods commonly used to study the evolution of genes and their products. This includes methods to determine protein domains, infer protein sequence homology, and classify sequences and structures according to evolutionary principles. We then focus on experimental and computational methods to characterize sequences of transcriptional events. More specifically, we illustrate principles behind the various strategies for the identification of DNA binding proteins, DNA binding sites, and target genes for a given transcription factor. We include a discussion on microarrays experiments and analysis, a technique routinely used to monitor mRNA expression levels of genes at the genomic scale. We 1-1 1.2 Molecular mechanism for the control of gene expression also illustrates principles behind the different clustering procedures used to group genes according to properties of interest. Finally, we look at the organization of the sum total of individual transcription interactions into a comprehensive network at the genome level, and define properties of this network, including network motifs, modules, and the scale-free structure and topological properties it shares with other non-biological networks like the world wide web. 1.2 Molecular mechanism for the control of gene expression The nature of any given organism is not only determined by the repertoire of the genes that it encodes for but also by the way in which these genes are precisely regulated at any given time point. Transcription factors are an important class of protein molecules that can control the mRNA expression of other genes by binding to DNA near the gene thereby affecting transcription of the nearby genes. Most transcription factors achieve this by responding to changes in external signals. In most prokaryotes, sensing changes in the levels of a small molecule can bring about active expression of genes that catabolize the small molecule (e.g. lactose) or repress expression of genes that synthesise them (e.g. tryptophan). Most transcription factors consist of two regions, one which recognises the signal and the other which binds DNA (Jacob and Monod, 1961). An example of a transcription factor, Crp is shown in Figure 1.1. Figure 1.1: 3D protein structure of a transcription factor Crp bound to cAMP and DNA Figure 1.1: A part of the protein that senses the external signal and the part that binds DNA are coloured blue and red respectively. In this case, the signal is a small-molecule cyclic AMP, coloured green. The DNA sequence to which Crp binds is coloured yellow. The coordinates for the structure were downloaded from the protein data bank (PDB: 1J59). 1-2 1.3 Approaches to study evolution of transcription factors In the case of the higher eukaryotes, transcription factors have been shown to be crucial for the developmental process. A class of Hox transcription factors has been conserved throughout evolution and is involved in deciding the body plan in eukaryotes. Thus, these DNA binding transcription factors are the key proteins which bring about the execution of specific developmental programs by regulating expression of the required proteins in the right place at the right time. Figure 1.2 below shows the arrangement of the Hox genes in the genomes of mammals and flies suggesting a common program for the development of body plan. Figure 1.2: Developmental program and the Hox transcription factors Figure 1.2: A set of Hox transcription factors in the genomes of mammals and flies. These genes direct the development of different segments in the body plans of many animals. The position of the gene also marks the position of its expression in the embryo. During embryonic development, the first genes (red) in the genome are expressed at the anterior of the embryo while other genes (orange, yellow) are expressed at more distal parts. This pattern of expression in flies gives rise to mouth parts, thorax, wing segments, and the tail (Molecular Cell Biology; 6th Edition). Thus we see that transcription factors are proteins that can sense changes in the environment and depending upon the change can either activate or repress expression of certain genes to bring about the required effect. 1.3 Approaches to study evolution of transcription factors The following section will explain concepts and methods that can be applied to understanding evolution of proteins, and which were applied in this dissertation to the study of the evolution of transcription factors and their regulated target genes. 1.3.1 Comparison of protein sequences to infer homology One can infer homology by comparing protein sequences. Some of the commonly used pairwise sequence comparison programs include Blast (Altschul et al., 1990), Fasta (Pearson and Lipman, 1988) and programs that implement the Smith-Waterman algorithm (Ssearch and MPsearch) (Needleman and Wunsch, 1970; Smith and Waterman, 1981). These can be broadly grouped in to two classes: (i) Programs that perform an ‘exhaustive search’, where all possible 1-3 1.3 Approaches to study evolution of transcription factors combinations of amino acid positions are compared to get the best alignment. Such methods are computationally intensive when it comes to comparing a given sequence against a huge database. (ii) Programs that employ heuristics, which search words instead of individual amino acids. Such methods compromise on sensitivity, but are very fast when comparing a protein against a huge database, e.g. Uni-prot sequence database or the Swiss Prot database (Leinonen et al., 2004). Such methods are generally reliable in detecting orthologs and closely related paralogs. For example, a bi-directional best-hit procedure using standard sequence comparison methods is a routinely used method to detect orthologous proteins in large-scale comparative genomics studies. It had been used for this thesis, see chapter 4 for details. However the problem arises when the compared sequences have diverged too far, as is true for very distant paralogs within the same genome. In such cases, regular sequence comparison methods fail to detect homology. This is where analysing protein sequences in the context of their domain content can help to reliably detect evolutionary relationships. 1.3.2 Domains of a proteins can be used to infer evolutionary history Proteins are made up of domains. Domains are defined based on either sequences, or structures, or both. Two different ways of defining a domain are shown in Figure 1.3. A domain is a defined region in a protein, which can either be the complete protein or be a part of a protein, which can occur independently or in combination with other domains in a different protein. A list of domain databases is provided in Table 1.1. Throughout this thesis, the Structural Classification of Proteins (SCOP) domain (Andreeva et al., 2004; Lo Conte et al., 2000; Murzin et al., 1995) definition will be used unless otherwise stated. In the Structural Classification of Proteins (SCOP) database, domains are defined based on protein structure. In SCOP, a domain is defined as an independent evolutionary and structural unit that can undergo duplication and recombination. Hence a domain can occur on its own or in combination with other domains in a different protein. Figure 1.3: Sequence based and structure based domain definition Structure based domain definition (Homeodomain superfamily in SCOP) Sequence based domain definition (Homeobox family in Pfam) Figure 1.3: SCOP (PDB: 9ANT) and Pfam (PF00046) definition for the homeodomain family of proteins. 1-4 1.3 Approaches to study evolution of transcription factors Small proteins consist of a single domain, for example the hen-egg lysozyme (Figure 1.4a), whereas large proteins consist of one or more domains, referred to as multi-domain proteins. An example of a large multidomain protein is shown in Figure 1.4b, it is a tyrosine kinase, which consists of SH3 domain, followed by SH2 domain and a catalytic domain. Thus, domains in proteins can either function independently or may contribute to the function of the multidomain protein in cooperation with other domains (Teichmann et al., 1999; Vogel et al., 2004). Figure 1.4: A single-domain protein and a multi-domain protein a b Hen egg white lysozyme Domain architecture Lysozyme-like Hck kinase Domain architecture SH3 domain : SH2 domain : Protein Kinase-like Figure 1.4: (a) The 3D structure of T4 hen-egg lysozyme protein (PDB: 193l) made of a single domain. (b) The structure of Hck tyrosine kinase (PDB: 1AD5) made of SH3 domain, SH2 domain and a catalytic domain. The linear organisation of domains from the N-terminus to the C-terminus is referred to as the protein’s domain architecture and is shown below each protein. Since a protein domain is an evolutionary unit (Murzin et al., 1995), by studying the domain composition of proteins one can understand their evolutionary history. Given that a domain can have a defined function, one can also predict possible functions of a protein with known domain composition. Some of the widely used domain databases are shown in Table 1.1. 1-5 1.3 Approaches to study evolution of transcription factors Table 1.1a: A list of sequence-based domain databases Database Type Pfam Sequence SMART Sequence TIGRFAM Sequence ProDom Sequence COGs Sequence CDD Sequence Description Protein families database (Bateman et al., 2004) is a carefully curated database of protein sequence families which provides multiple sequence alignments and the hidden Markov models built using the HMMER (Eddy, 1998) package. Simple Modular Architecture Research Tool (Letunic et al., 2004) is similar to Pfam, however it consists only of domains that are seen in signalling, chromatin associated proteins and extra-cellular proteins. Hidden Markov models are built using the HMMER package. TIGRFAM (The Institute for Genomic Research Families) (Peterson et al., 2001) is another curated sequence based hidden Markov model library which is built by clustering protein sequences obtained primarily from the microbial genome sequencing projects. ProDom (Servant et al., 2002) is a comprehensive collection of domain families which is obtained by automatically clustering all protein sequences available in the Swiss-Prot database. Clusters of Orthologous Groups (Tatusov et al., 2003) consist of groups of homologous proteins (i.e. proteins that have evolved from a common ancestor). Each group consists of paralogous proteins (duplicated proteins within one genome) and orthologous proteins (the same protein in different genomes). COGs are generated automatically by using BLAST sequence analysis tool. Conserved domain database (Marchler-Bauer et al., 2003) uses Pfam, SMART and their own domain definition to build a position specific scoring matrix (PSSM). Reverse PSI–BLAST (Altschul et al., 1997) is then used to assign domains to proteins. Table 1.1b: A list of structure-based domain databases Database Type MMDB Structure FSSP Structure CATH Structure Description MMDB (Chen et al., 2003) groups sequences into clusters based on pairwise comparison of structures available in the Protein Data Bank (Bourne et al., 2004) FSSP (Holm and Sander, 1994) is a fully automatic classification scheme that clusters proteins into groups by carrying out structure-structure alignments. CATH (Orengo et al., 1997; Orengo et al., 1999) is a hierarchical, semi-automatic method, which classifies protein domain structures into four major levels: Class, Architecture, Topology and Homologous superfamily. 1-6 1.3 Approaches to study evolution of transcription factors Database Type SCOP Structure Gene3D Structure Superfamily Structure Description SCOP (Andreeva et al., 2004; Lo Conte et al., 2002; Murzin et al., 1995) is a manually curated hierarchical classification of all proteins of known structure. Proteins are broken into domains, which are then classified into the following levels: family, superfamily, fold and class. Gene3D (Buchan et al., 2003) is a library of HMMs which are built using the SAM-T99 package using the CATH domain definition. Superfamily (Madera et al., 2004) consists of a library of hidden Markov models which are built using the SAM-T99 package at the SCOP superfamily level using the SCOP domain definitions. Table 1.1c: A list of sequence and structure based domain databases Database Type Sequence InterPro and structure Sequence HOMSTRAD and structure Description InterPro (Mulder et al., 2003) is an integrated resource, which uses many different domain databases, including Pfam, SMART, and Superfamily. Homologous structure alignment database (Stebbings and Mizuguchi, 2004) consists of annotated structural alignments for homologous families and is based on a variety of other domain definitions. . Table 1.1: A list of databases that use protein sequences and structures to define domains. For a comprehensive list of known domain databases, the reader is referred to the annual January database issue of the Nucleic Acids Research journal. 1.3.3 Procedures to assign domains to protein sequences As mentioned above, one can infer function and the evolutionary history of an uncharacterised protein by identifying the domains of which it is made. To do this, one has to be able to reliably detect regions in the protein sequence that belong to a particular domain family. There exist a variety of computational techniques that can detect domains, some of which are discussed below. A. Profile based methods Profile based methods like PSI-BLAST first pick up close homologs using standard pairwise sequence search methods. A multiple sequence alignment is built from the close homologs. For each position in the multiple sequence alignment, a position specific scoring matrix (PSSM) is generated that gives a measure for the variability in the amino acid composition for the position. Thus if a position has a high conservation of a particular amino acid, then the matrix for that position will have a high negative value for a substitution. In this way, functionally important regions on the protein sequence will be treated differently from any other region that can 1-7 1.3 Approaches to study evolution of transcription factors accumulate mutations without strict constraints. Using this position specific scoring matrix and the multiple sequence alignment, the database is searched again to find distant homologs in an iterative manner. In programs implementing iterative PSI-BLAST (Schaffer et al., 2001), the user can specify the number of iterations or can ask the program to iterate this procedure until convergence, i.e. until no more new sequences are picked up. B. Hidden Markov model (HMM) based methods Another advancement in the area of protein homology detection was made with the use of hidden Markov models to search for specific protein domains. A hidden Markov model is a different way to represent a profile (Eddy, 1996). HMM treats amino acid insertions and deletions more efficiently than profile based methods. HMM can be though of as a finite model that describes a probability distribution over a large number of possible sequences. Two commonly used programs are HMMER (Eddy, 1998) and SAM (Karplus et al., 1998). HMMER package requires a multiple sequence alignment, which is then used to create HMMs, and calibrate and score them. In the case of SAM, a single sequence can be used as an input, which is then automatically searched in an iterative manner against a database to create a multiple sequence alignment and to generate an HMM (Madera and Gough, 2002). Target 99 module carries out the multiple sequence alignment, hence the other name of the package SAM-T99. Pfam database uses HMMER for building its library of HMMs, whereas Superfamily database uses SAM-T99. Initial studies by Park et al (Park et al., 1998) and Brenner et al (Brenner et al., 1998) have shown that profile based methods perform better in detecting remote homologs than pairwise sequence comparison methods. More recently, a study by Madera and Gough showed that HMM based methods give better results than profile based methods. A thorough and systematic comparison of the performance of different methods is given in Madera and Gough (Madera and Gough, 2002). 1.3.4 Protein evolution by duplication, divergence and recombination of domains It is now well established that most proteins in genomes have evolved by duplication, divergence and recombination of existing domains (Chothia et al., 2003; Vogel et al., 2004). When the first genome sequences of H. influenzae and M. genitalium were published, Brenner et al (Brenner et al., 1995) and Teichmann et al (Teichmann et al., 1998) showed that there has been extensive duplication of a limited number of distinct domain families. Studies which probed into the conservation of domain order by assigning domains to protein sequences from completely sequenced genomes showed that domain order is highly conserved in many proteins (Apic et al., 2001). A study on the Rossmann domain family of proteins by Bashton and Chothia (Bashton and Chothia, 2002) revealed that most proteins have a conserved domain order even under little functional constraint. Considered together, these results suggest that during evolution, proteins conserved their domain architecture following duplication and divergence. Thus proteins with identical domain architecture can be considered to have evolved by duplication of a common ancestor. 1-8 1.3 Approaches to study evolution of transcription factors The original work reported in this thesis is a procedure to identify DNA binding transcription factors in E. coli based on domain assignments to the known DNA binding domain families (chapter 2). This procedure has been applied to predict transcription factors for many completely sequenced genomes, including the mouse genome and the Dictyostelium genome. Number of the predicted transcription factors for the five genomes is shown in Table 1.2, ranging from nearly 300 in E. coli to over 3000 in humans. These proteins constitute between 6% (in E. coli and yeast) and 8% (in human) of all proteins encoded in these organisms. Table 1.2: Predicted number of transcription factors in five model organisms Organism Number of transcripts E. coli S. cerevisiae C. elegans H. sapiens A. thaliana 4,280 6,357 31,677 32,036 28,787 Number of transcripts with DNA-binding domainsa 267 245 1463 2604 1667 Percentage of transcripts containing DNA-binding domains 6.2% 3.9% 4.6% 8.1% 5.7% Table 1.2: aDNA-binding domain assignments from Pfam and SUPERFAMILY were used to establish the repertoire of DNA-binding transcription factors in five model organisms. An expectation value threshold of 0.002 was used in making the assignments. Co-regulators that do not bind DNA directly were excluded. Our procedure allows us to assess the evolutionary relationships among transcription factors. Our studies and those of the others using E. coli (Madan Babu and Teichmann, 2003; PerezRueda and Collado-Vides, 2000), archaea (Aravind and Koonin, 1999), plants and animals (Ledent and Vervoort, 2001; Riechmann et al., 2000) have consistently demonstrated that transcription factors draw their DNA-binding domains from a relatively small, ancient conserved repertoire. Different organisms have various parts of the ancient repertoire expanded in their genomes. For example, the winged-helix domain and the zinc ribbon domain are encountered in all three principal kingdoms of life. The ribbon-helix-helix (MetJ/Arc) domain is found only in prokaryotes (Aravind and Koonin, 1999), whereas the crown group eukaryotes display a proliferation of several novel DNA-binding domains, such as the C2H2 zinc finger domains, the AT hook domains, the HMG1 domains and the MADS box domains (Lander et al., 2001). Shown in Figure 1.5 are the examples of some of the DNA binding domains in the five genomes listed in Table 1.2. The DNA-binding domain families have been chosen to emphasize that many families are specific to individual phylogenetic groups and can be greatly expanded in some genomes. For example, the nuclear hormone receptor family transcription factors are very 1-9 1.4 Approaches to study gene expression programs abundant in Caenorhabditis elegans compared with other organisms, whereas the Zn2/Cys6 fungal-type zinc finger is expanded in the fungi, but absent elsewhere. In contrast to the high level of conservation of other regulatory and signalling systems across the crown group eukaryotes, some of the transcription factor families are dramatically different in the various lineages. This suggests a major role for recurrent, massive and lineage-specific expansions in the evolution of transcription factors in the crown group eukaryotes (Coulson and Ouzounis, 2003; Lespinet et al., 2002). In prokaryotes, several orthologous groups of transcription factors show a much wider spread across phylogenetically diverse organisms, suggesting a role for horizontal transfers, in addition to diversification through a lower level of lineage-specific duplications. Figure 1.5: Lineage-specific expansion of DNA-binding domain families E. coli C-terminal effector domain of the bipartite response regulators S. cerevisiae C. elegans H. sapiens A. thaliana 17 0 0 0 0 Zn2/Cys6 DNA-binding domain 0 53 0 0 0 Glucocorticoid receptor-like (DNA-binding domain) 0 10 361 19 48 C2H2 and C2HC zinc fingers 0 30 125 1039 SRF-like 0 4 3 7 59 113 C-terminal effector domain of the bipartite response regulators C2H2 and C2HC zinc fingers CheY-like Zn2/Cys6 DNA-binding domain SRF-like Nuclear receptor ligand-binding domain Glucocorticoid receptor-like (DNA-binding domain) Figure 1.5: Examples of DNA-binding domain families of transcription factors that are prevalent in one of the five genomes, but are rare in the others. The genomic occurrence of each family is provided in the table and we depict their most common domain architectures alongside. SRF, serum response factor. 1.4 Approaches to study gene expression programs So far we have seen how one could use domains to characterise proteins and how domain definition of proteins can help to identify possible DNA binding transcription factors. In the following section, small and large-scale strategies to study gene expression programs that reveal the regulated target genes and their transcription factors will be discussed. 1-10 1.4 Approaches to study gene expression programs 1.4.1 Methods to study Protein-DNA interactions Interaction of a protein with DNA on a chromosome can affect transcription of nearby genes. Hence studying protein-DNA interaction will tell us which genes are controlled by a transcription factor and will allow us to identify targets of transcription factors. There are conventional smallscale experiments which focus on studying individual transcription factors in great detail. Alternatively, with the recent advancement that have been made, we are now in a position to carry out large-scale experiments that allow simultaneous monitoring of protein-DNA interaction and provides us with a comprehensive set of interactions in a genome-wide level. The principle behind some of the commonly used strategies is described in Table 1.3 below. Table 1.3a: Small-scale experimental methods to probe protein-DNA interactions Methods Band-shift DNA footprinting Binding site detection using FRET Binding site detection using unnatural base analog In vivo cross linking and immunoprecipit ation Description Since DNA molecules are more flexible than proteins, they have much higher mobility in a polyacrylamide gel. Thus, under favourable conditions, free DNA can be distinguished from DNA bound to proteins due to the difference in molecular weight (Garner and Revzin, 1981). In DNA foot printing, a 5’ end labelled double stranded DNA is partially degraded by DNAase both in the presence and absence of the binding protein. Degraded fragments are then loaded on to a polyacrylamide gel and are visualised by autoradiography. Since the region where the protein has bound the DNA will be protected from DNAase, no fragments are seen in those regions. Thus by comparing lanes, one can identify the binding site (Galas and Schmitz, 1978). A method by Heyduk and Heyduk (Heyduk and Heyduk, 2002) uses a library of double stranded DNA with one of the two fluorophores attached to its end. Protein binding to two pieces of DNA (one from each library), where each comprises one-half of the binding site, induces FRET (Fluorescence Resonance Energy Transfer), which can be used to find protein bound to DNA. A method by Storek et al (Storek et al., 2002) uses a library of DNA sequences which have an unnatural base analog (one for each base). Following selection for protein bound DNA molecules, the DNA is cleaved specifically at the modified base. The site of incorporation can be identified by gel electrophoresis by running fragments generated from unbound sample next to the fragments generated from the bound sample. Since the presence of an analog in the binding site impedes protein binding, this results in a depletion of the protein-bound pool. Over expression of a DNA binding protein in a cell ensures that the protein exists in its DNA-bound form. The bound protein is then covalently linked to DNA by using a cross-linking agent such as formaldehyde. After cross-linking, DNA is sheared and the protein bound DNA is precipitated using specific antibodies to the protein. Reversal of the cross-linking releases the bound DNA allowing the sequence of the fragments to be determined by regular sequencing methods. This method is called Chromatin Immunoprecipitation or ChIP in short (Kuo and Allis, 1999). 1-11 1.4 Approaches to study gene expression programs Table 1.3b: Large-scale experimental strategies to probe protein-DNA interactions Method Identifying DNA binding proteins by mass spectrometry ChromatinImmunoprecipit ation-Chip experiments (ChIP-chip) DNA adenine methyl transferase Identification Description Immobilised short DNA probes are incubated with cell or nuclear extract, allowing the protein to bind DNA on the surface. Proteins are analysed directly off the solid support by MALDI/TOF mass spectrometry. If the determined molecular mass does not uniquely identify a protein, proteins are subjected to mass spectrometric peptide mapping. This provides a way of detecting post-translational modification of specific residues on the protein (Nordhoff et al., 1999). In ChIP-chip experiments, intergenic regions are spotted on to a microarray chip. Following a chromatin immunoprecipitation step, discussed previously, the bound fragments are reverse cross-linked and hybridized onto the chip. Complementary sequences will bind to specific spots on the chip thus providing the exact intergenic region to which the protein was bound (Horak and Snyder, 2002). This method was used by Lee et al (Lee et al., 2002) to study binding sites of 106 yeast transcription factors chip and hence reconstruct the transcriptional regulatory network for these proteins. One of the problems in ChIP experiments is that artefacts can be produced due to non-specific cross-linking of protein and DNA. To overcome this problem, van Steensel and Henikoff (van Steensel and Henikoff, 2000) introduced the DamID technique. First the protein of interest is fused to an E. coli protein, DNA adenine methyl transferase (Dam). Dam methylates the N6 position of the adenine in the sequence GATC, which occurs at reasonably high frequency in any genome (~1 site in 256 bases). Upon binding DNA, the Dam protein preferentially methylates adenine in the vicinity of binding. Subsequently, the genomic DNA is digested by the DpnI and DpnII restriction enzymes that cleave within the non-methylated GATC sequence, and remove fragments that are not methylated. The remaining methylated fragments are amplified by selective PCR and quantified by microarray analysis. Sun et al (Sun et al., 2003) successfully applied this technique to map protein DNA interactions at high resolution along segments of genomic DNA from Drosophila using a combination of the DamID technique and genomic DNA tiling path microarrays. Table 1.3: small-scale and large-scale experimental strategies to study protein-DNA interactions and identify targets for transcription factors. 1.4.2 Gene expression analysis In the previous section, different experiments that can be done to identify targets for a given transcription factor was explained. However such methods fail to provide information about the cellular conditions in which the gene will be regulated, i.e. when the gene is turned on and when it is turned off. Hence such methods only provide us with static information about transcriptional 1-12 1.4 Approaches to study gene expression programs interactions and fail to provide us with a dynamic picture about when the interaction would be active. Thus, what would be useful to know is when and under what conditions a gene is differentially expressed and when it is not. This provides us with direct information about genes that need to be present in a given cellular condition, such as under stress or sporulation, etc. One of the ways to monitor gene expression of many genes simultaneously is through monitoring mRNA levels on a microarray chip. The following section will explain the basic principle behind a microarray experiment. Chapter 5 will discuss about how integrating gene expression data with transcriptional interactions can help us define transcriptional interactions that will active in a particular cellular condition. A. Microarray One of the ways to study gene expression programs is to monitor the expression levels of thousands of genes simultaneously under a particular condition. Microarray technology makes this possible and the quantity of data generated from each experiment is enormous, which dwarfs the amount of data generated by the genome sequencing projects. A microarray is typically a glass slide on to which DNA molecules are fixed in an orderly manner at specific locations called spots (or features). A microarray may contain thousands of spots and each spot may contain a few million copies of identical DNA molecules that correspond uniquely to a gene (Figure 1.6a). The DNA in a spot may either be genomic DNA or a short stretch of oligonucleotide strand that correspond to a gene. The spots are printed on to the glass slide by a robot or are synthesised by the process of photolithography. The principle of differential hybridization is explained in Figure 1.6b below. More recently, to obviate the need to develop sequence specific chips, Lizardi and co workers have developed a procedure called GenCompass, which combines a traditional enzymatic manipulation step, a universal microarray representing all possible DNA hexamers, and advanced bioinformatics techniques to analyse data obtained. For a review about possible future applications using the universal microarray system, see Luscombe and Madan Babu (Luscombe and Madan Babu, in press). 1-13 1.4 Approaches to study gene expression programs Figure 1.6: Microarray experiment b a Condition B Condition A mRNA extraction cDNA labelling with dyes Each spot contains Oligonucleotide sequence or genomic DNA that “uniquely” represents a gene Hybridisation Spot (feature) Excitation with laser sub-array Final image stored as a file Figure 1.6: (a) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that represents a single gene from an organism. (b) Schematic of the experimental protocol to study differential expression of genes. The organism is grown in two different conditions (a reference condition and a test condition). RNA is extracted from the two cells, and is labelled with different dyes (red and green) during the synthesis of cDNA by reverse transcriptase. Following this step, cDNA is hybridized onto the microarray slide, where each cDNA molecule representing a gene will bind to the spot containing its complementary DNA sequence. The microarray slide is then excited with a laser at suitable wavelengths to detect the intensities of red and green fluorescence value which represents the relative expression level of a gene in the two conditions considered. The final image is stored as a file for further analysis. B. Clustering methods One of the goals of carrying out a microarray experiment is to identify genes or samples with similar expression profiles, to make meaningful biological inference about the set of genes or samples. Clustering is one of the unsupervised approaches to classify data into groups of genes or samples with similar patterns that are characteristic to the group. One should note that clustering analysis can be applied to any kind of information as long as it is represented in a way amenable for clustering (i.e. a profile, a vector or a matrix). Chapter 4 in this thesis exploits the power of clustering to study transcription factor and transcriptional interaction conservation in the different genomes by creating ‘conservation’ profiles across the different genomes for the genes and their interactions. Clustering methods can be hierarchical (grouping objects into clusters, specifying relationships among objects in the cluster, resembling a phylogenetic tree) or non-hierarchical (grouping into clusters without specifying relationships between objects in the cluster) as shown in Figure 1.7. 1-14 1.4 Approaches to study gene expression programs It should be noted that an object may refer to a gene or a sample and a cluster refers to a set of objects that behave in a similar manner. In the following section, hierarchical agglomerative clustering and K-means clustering will be described. For a detailed introduction to microarray analysis, please refer to Causton et al (Causton et al, 2003), Madan Babu, M (Madan Babu, 2004) or appendix B. Figure 1.7: Clustering methods Clustering methods Hierarchical Agglomerative Non-hierarchical e.g: K-means, SOMs Divisive Single linkage Complete linkage Average linkage Centroid linkage Figure 1.6: An overview of the different clustering methods. Hierarchical clustering Hierarchical clustering may be agglomerative (starting with the assumption that each object is a cluster and grouping similar objects into bigger clusters) or divisive (starting from grouping all objects into one cluster and subsequently breaking the big cluster into smaller clusters with similar properties). Hierarchical agglomerative clustering In the case of a hierarchical agglomerative clustering, the objects are successively fused until all the objects are included. For a hierarchical agglomerative clustering procedure, each object is considered as a cluster and the pair-wise distance measures for the objects to be clustered are first calculated. Based on the pair-wise distances between them, objects that are similar to each other are grouped into clusters. After this is done, pair-wise distances between the clusters are re-calculated, and clusters that are similar are grouped together in an iterative manner until all the objects are included in a single cluster. This information can be represented as a dendrogram, where the distance from the branch point is indicative of the distance between the two clusters or objects. The comparison of a cluster with another cluster or an object can be done using four approaches (Figure 1.8). 1-15 1.4 Approaches to study gene expression programs Figure 1.8: Algorithms to group objects into clusters Single linkage clustering Complete linkage clustering Average linkage clustering Centroid clustering Cluster-1 Cluster-1 Cluster-1 Cluster-1 Cluster-2 Object in a cluster (may be a gene or a sample expression profile) Cluster-2 Cluster-2 Distance between clusters Cluster-2 Centroid of a cluster (may be centroid of gene or sample expression profiles) Figure 1.8: Single linkage, complete linkage, average linkage and centroid linkage clustering procedure. Single linkage clustering (Minimum distance) In single linkage clustering, the distance between two clusters is calculated as the minimum distance between all possible pairs of objects, one from each cluster. This method has the advantage that it is insensitive to outliers. This method is also known as nearest neighbour linkage. In fact, the BLASTclust procedure, which will be described in chapter 4 in this thesis, implements the single linkage clustering procedure to group orthologous proteins. Complete linkage clustering (Maximum distance) In complete linkage clustering, the distance between the two clusters is calculated as the maximum distance between all possible pairs of objects, one from each cluster. The disadvantage of this method is that it is sensitive to outliers. This method is also known as the farthest neighbour linkage. Average linkage clustering In average linkage clustering, the distance between the two clusters is calculated as the average of the distance between all possible pairs of objects in clusters. Centroid linkage clustering In centroid linkage clustering, an average expression profile (called a centroid) is calculated in two steps. First, the mean in each dimension of the expression profiles is calculated for all objects in a cluster. Then, the distance between the clusters is measured as the distance between the average expression profiles of the two clusters. 1-16 1.5 Structure of transcriptional regulatory networks K-means K-means is a popular non-hierarchical clustering method (Figure 1.9). In K-means clustering, the first step is to arbitrarily group objects into a predetermined number of clusters. The number of clusters can be chosen randomly or estimated by first performing a hierarchical clustering on the data. Following this step, the average expression profile (centroid) is calculated for each cluster and this step is called initialization. Next, individual objects are moved from one cluster to the other depending on which centroid is closer to the gene. This procedure of calculating the centroid for each cluster and moving objects closer to the centroid is performed in an iterative manner a fixed number of times, or until convergence (where the cluster composition remains unaltered). Typically, the number of iterations required to obtain stable clusters ranges from 20,000 to 100,000 times. However, there is no guarantee that the clusters will converge. This method has an advantage that it is scalable for large datasets. This method has been used to group genomes based on patterns in the conservation of genes, their interactions and the regulatory motifs (refer Chapter 5). Figure 1.9: K-means clustering Gene expression space Initialisation Iteration Convergence (or after fixed number of iteration) Figure 1.9: The principle behind K-means clustering. Objects are grouped into a predefined number of clusters during the initialization step. Centroid for each cluster is calculated, and objects are re-grouped depending on how close they are to their centroids. This step is performed iteratively until convergence or is performed for a fixed number of iterations to get a final cluster of objects. 1.5 Structure of transcriptional regulatory networks So far we have seen how one can study and identify putative transcription factors using domain assignments, study protein-DNA interaction to identify targets for transcription factors, monitor expression levels of genes under particular conditions, and the different clustering procedure to groups objects (genes) that have similar behaviour (expression or conservation). In the following section, the organisation of individual transcriptional interactions into a transcriptional interaction network will be discussed. Chapter 3 will discuss how such networks evolve, chapter 4 describes how such networks change in evolution and chapter 5 will explain how the structure of such networks change during the different cellular conditions. 1-17 1.5 Structure of transcriptional regulatory networks The assembly of individual regulatory interactions linking transcription factors to their target genes in an organism can be viewed as a directed graph, in which the regulators and targets represent the nodes, and the regulatory interactions are the edges (Figure 1.10) (Madan Babu et al., 2004; Wei et al., 2004; Xia et al., 2004). This resulting network is a complex, multilayered system that can be examined at four levels of detail. At the most basic level, the network comprises a collection of transcription factors, downstream target genes and the binding sites in the DNA (Figure 1.10a). At the second level, these basic units are organised into recurrent patterns of interconnections called network motifs (Lee et al., 2002; Milo et al., 2002; Shen-Orr et al., 2002), which appear frequently throughout the network (Figure 1.10b). At the third level, the motifs cluster into semi-independent transcriptional units called modules (Figure 1.10c). Finally, at the top level, the regulatory network consists of interconnecting interactions among the modules, to build up the entire network (Figure 1.10d). Figure 1.10: Structural organisation of transcriptional regulatory networks SIM Transcription factor MIM FFM Target gene and binding site (a) Basic unit (b) Motifs (c) Modules (d) Transcriptional regulatory network Figure 1.10: (a) The ‘basic unit’ comprises the transcription factor, its target gene with DNA recognition site and the regulatory interaction between them. (b) Units are often organised into network ‘motifs’, which comprise specific patterns of inter-regulation that are over-represented in networks. Examples of motifs include single input (SIM), multiple input (MIM) and feedforward motifs (FFM). (c) Network motifs can be interconnected to form semi-independent ‘modules’, many of which have been identified by integrating regulatory interaction data with gene expression data, and imposing evolutionary conservation. (d) The entire assembly of regulatory interactions constitutes the ‘transcriptional regulatory network’, which provides the blueprint for regulation of gene expression in an organism. It should be noted that much of the work on regulatory networks has focused on E. coli and the yeast Saccharomyces cerevisiae, for which data are most abundant. The individual regulatory interactions in E. coli have been collected manually from the literature in the RegulonDB database (Salgado et al., 2004). In yeast, on the other hand, manually curated data (Guelzim et 1-18 1.5 Structure of transcriptional regulatory networks al., 2002; Matys et al., 2003) have been greatly augmented by the output of large-scale DNAbinding data from chromatin immunoprecipitation-chip (ChIp-chip) experiments (Horak et al., 2002; Lee et al., 2002). 1.5.1 Motifs At a local level, the transcriptional network can be broken down into a series of regulatory motifs. These represent the simplest units of network architecture, in which there are specific patterns of inter-regulation between transcription factors and target genes. Motifs do not often represent independent units that are functionally separable from the rest of the network. However, they have been shown theoretically and experimentally to possess particular kinetic properties that determine the temporal program of expression of the target genes (Mangan and Alon, 2003; Mangan et al., 2003). Schematics of three prevalent motifs are shown in Figure 1.10b: Single Input Motif, Multiple Input Motif and Feed Forward Motif. The first two comprise direct-acting motifs, whereby a single or multiple transcription factors regulate their targets. Yu et al. (Yu et al., 2003) showed that target genes belonging to the same single and multiple input motifs tend to be coexpressed, and that the level of co-expression is higher when multiple transcription factors are involved. The feed-forward loop is composed of two transcription factors, whereby the first regulates the second and both regulate a final target gene. Further motifs identified by Lee et al. (Lee et al., 2002) in yeast represent patterns of interconnections of variable complexity, such as the autoregulatory and regulatory chain motifs. 1.5.2 Modules The organisation of the regulatory network can also be captured at an intermediate level by examining its modularity (Hartwell et al., 1999; Lee et al., 2002). Intuitively, one might expect distinct cellular processes to be conveniently regulated by discrete and separable modules. Indeed, Guelzim et al. (Guelzim et al., 2002) reported global fragmentation of the regulatory network in yeast. The clustering coefficient — a measure of the propensity for nodes to form ‘cliques’ — was fivefold higher than would be expected for a random network. There have been several different approaches to identifying modules and these studies have provided distinct outcomes with respect to the resulting modules. The main conclusion, however, is that regulatory network tends to be highly interconnected and very few modules are entirely separable from the rest of the network. In fact, many identified modules are nested within each other in a hierarchical organisation at differing levels of connectivities. Dobrin et al. (Dobrin et al., 2004) showed that many of the multiple input and feed-forward loop motifs in E. coli overlap, so that they share transcription factors or target genes. Thus, many small, highly connected motifs group into a few larger modules, which in turn integrate into even 1-19 1.5 Structure of transcriptional regulatory networks larger ones. These nested modules are interconnected through local regulatory hubs. Such an organisation combines the capacity for rapid regulatory changes through regulatory hubs with integration of the regulatory processes across several modules. Other approaches to identifying modules have incorporated further data sources, such as gene expression data sets. Typical analyses have applied clustering algorithms to gene expression data to find sets of co-expressed genes. In one of the original studies by Tavazoie et al. (Tavazoie et al., 1999), it was reported that some of the major co-expression clusters coincided with functional groupings of genes. In an ambitious extension of our work (Teichmann and Madan Babu, 2002), Stuart et al. (Stuart et al., 2003) recently clustered over 3000 microarray experiments on four eukaryotic genomes and identified 22 163 gene pairs whose co-expression is conserved across all organisms. They grouped sets of orthologues into modules, suggesting that co-expression of gene pairs over large evolutionary distances implies a selective advantage for co-regulation, perhaps because the genes are functionally related. In another interesting study, Ihmels et al. (Ihmels et al., 2002) added a different perspective by taking the experimental conditions into account when defining the gene clusters. Their ‘signature’ algorithm identifies clusters according to the experimental conditions in which the expression patterns of genes are most significantly correlated. The authors identified 86 transcriptional modules and the experimental conditions in which they operate. Segal et al. (Segal et al., 2003) used a probabilistic algorithm to partition gene modules first based on their expression profiles, and then identified specific regulatory genes that are predicted to control the modules by comparing the expression profiles of candidate regulators and the gene modules. They were able to identify 50 different modules with distinct regulatory programs. Particularly illuminating was the formation of higher order groupings by the individual modules, which are regulated by partly overlapping but distinct regulators. Bar-Joseph et al. (Bar-Joseph et al., 2003) improved previous algorithms by explicitly linking gene expression data with the regulatory interaction data produced by Lee et al. (Lee et al., 2002) through ChIp-chip experiments. In this way, the authors were able to partition 655 distinct genes and 68 transcription factors into 106 regulatory modules. Many of the identified modules could be linked to particular cellular processes. 1.5.3 Global network organisation At a global level, the overall structure or topology of the gene regulatory network can be described by parameters derived from graph theory. The incoming connectivity is the number of transcription factors regulating a target gene, which quantifies the combinatorial effect of gene regulation. A recent study by Guelzim et al. (Guelzim et al., 2002) reported that the fraction of target genes with a given incoming connectivity decreases exponentially. The exponential behaviour indicates that most target genes are regulated by similar numbers of factors (93% of 1-20 1.6 References genes are regulated by 1–4 factors in yeast) and presumably reflects the molecular limits on the number of transcription factors that can affect a target gene simultaneously, which are imposed by protein and DNA structural constraints at promoters. The outgoing connectivity, which is the number of target genes regulated by each transcription factor, is distributed according to a power law, contrary to the incoming connectivity parameter. This is indicative of a hub-containing network structure, in which a select few transcription factors participate in the regulation of a disproportionately large number of target genes. These hubs can be viewed as ‘global regulators’, as opposed to the remaining transcription factors that can be considered ‘fine tuners’. Global regulators can be defined based on the number of genes they regulate. In the transcriptional network in yeast, regulatory hubs have a propensity to be lethal if removed (Guelzim et al., 2002). Martinez-Antonio and Collado-Vides (Martinez-Antonio and Collado-Vides, 2003) defined global regulators by taking into account additional factors, such as the number of co-regulators and the number of conditions. As described in this Chapter, a wealth of data on transcription factors, their regulatory interactions and the gene expression programmes of organisms have become available. Theoretical approaches can be used to study the structure and evolution transcriptional regulatory networks. This thesis focuses on the analysis of transcriptional regulatory systems in single celled organisms. With the advancements being made in biology, we expect to be able to apply similar methods to multi-cellular organisms in future. 1.6 References Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-10. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389-402. Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T. J., Chothia, C. and Murzin, A. G. (2004). SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res 32, D226-9. Apic, G., Gough, J. and Teichmann, S. A. (2001). Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J Mol Biol 310, 311-25. Aravind, L. and Koonin, E. V. (1999). DNA-binding proteins and evolution of transcription regulation in the archaea. Nucleic Acids Res 27, 4658-70. Bar-Joseph, Z., Gerber, G. K., Lee, T. I., Rinaldi, N. J., Yoo, J. Y., Robert, F., Gordon, D. B., Fraenkel, E., Jaakkola, T. S., Young, R. A. et al. (2003). Computational discovery of gene modules and regulatory networks. Nat Biotechnol 21, 1337-42. Epub 2003 Oct 12. 1-21 1.6 References Bashton, M. and Chothia, C. (2002). The geometry of domain combination in proteins. J Mol Biol 315, 927-39. Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E. L. et al. (2004). The Pfam protein families database. Nucleic Acids Res 32, D138-41. Bourne, P. E., Addess, K. J., Bluhm, W. F., Chen, L., Deshpande, N., Feng, Z., Fleri, W., Green, R., Merino-Ott, J. C., Townsend-Merino, W. et al. (2004). The distribution and query systems of the RCSB Protein Data Bank. Nucleic Acids Res 32, D223-5. Brenner, S. E., Chothia, C. and Hubbard, T. J. (1998). Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc Natl Acad Sci U S A 95, 6073-8. Brenner, S. E., Hubbard, T., Murzin, A. and Chothia, C. (1995). Gene duplications in H. influenzae. Nature 378, 140. Buchan, D. W., Rison, S. C., Bray, J. E., Lee, D., Pearl, F., Thornton, J. M. and Orengo, C. A. (2003). Gene3D: structural assignments for the biologist and bioinformaticist alike. Nucleic Acids Res 31, 469-73. Chen, J., Anderson, J. B., DeWeese-Scott, C., Fedorova, N. D., Geer, L. Y., He, S., Hurwitz, D. I., Jackson, J. D., Jacobs, A. R., Lanczycki, C. J. et al. (2003). MMDB: Entrez's 3Dstructure database. Nucleic Acids Res 31, 474-7. Chothia, C., Gough, J., Vogel, C. and Teichmann, S. A. (2003). Evolution of the protein repertoire. Science 300, 1701-3. Coulson, R. M. and Ouzounis, C. A. (2003). The phylogenetic diversity of eukaryotic transcription. Nucleic Acids Res 31, 653-60. Dobrin, R., Beg, Q. K., Barabasi, A. L. and Oltvai, Z. N. (2004). Aggregation of topological motifs in the Escherichia coli transcriptional regulatory network. BMC Bioinformatics 5, 10. Eddy, S. R. (1996). Hidden Markov models. Curr Opin Struct Biol 6, 361-5. Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics 14, 755-63. Galas, D. J. and Schmitz, A. (1978). DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res 5, 3157-70. Garner, M. M. and Revzin, A. (1981). A gel electrophoresis method for quantifying the binding of proteins to specific DNA regions: application to components of the Escherichia coli lactose operon regulatory system. Nucleic Acids Res 9, 3047-60. Guelzim, N., Bottani, S., Bourgine, P. and Kepes, F. (2002). Topological and causal structure of the yeast transcriptional regulatory network. Nat Genet 31, 60-3. Hartwell, L. H., Hopfield, J. J., Leibler, S. and Murray, A. W. (1999). From molecular to modular cell biology. Nature 402, C47-52. Heyduk, T. and Heyduk, E. (2002). Molecular beacons for detecting DNA binding proteins. Nat Biotechnol 20, 171-6. Holm, L. and Sander, C. (1994). The FSSP database of structurally aligned protein fold families. Nucleic Acids Res 22, 3600-9. 1-22 1.6 References Horak, C. E., Luscombe, N. M., Qian, J., Bertone, P., Piccirrillo, S., Gerstein, M. and Snyder, M. (2002). Complex transcriptional circuitry at the G1/S transition in Saccharomyces cerevisiae. Genes Dev 16, 3017-33. Horak, C. E. and Snyder, M. (2002). ChIP-chip: a genomic approach for identifying transcription factor binding sites. Methods Enzymol 350, 469-83. Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y. and Barkai, N. (2002). Revealing modular organization in the yeast transcriptional network. Nat Genet 31, 370-7. Jacob, F. and Monod, J. (1961). Genetic regulatory mechanisms in the synthesis of proteins. J Mol Biol 3, 318-56. Karplus, K., Barrett, C. and Hughey, R. (1998). Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846-56. Kuo, M. H. and Allis, C. D. (1999). In vivo cross-linking and immunoprecipitation for studying dynamic Protein:DNA associations in a chromatin environment. Methods 19, 425-33. Lander, E. S. Linton, L. M. Birren, B. Nusbaum, C. Zody, M. C. Baldwin, J. Devon, K. Dewar, K. Doyle, M. FitzHugh, W. et al. (2001). Initial sequencing and analysis of the human genome. Nature 409, 860-921. Ledent, V. and Vervoort, M. (2001). The basic helix-loop-helix protein family: comparative genomics and phylogenetic analysis. Genome Res 11, 754-70. Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N. M., Harbison, C. T., Thompson, C. M., Simon, I. et al. (2002). Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298, 799-804. Leinonen, R., Diez, F. G., Binns, D., Fleischmann, W., Lopez, R. and Apweiler, R. (2004). UniProt Archive. Bioinformatics 25, 25. Lespinet, O., Wolf, Y. I., Koonin, E. V. and Aravind, L. (2002). The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Res 12, 1048-59. Letunic, I., Copley, R. R., Schmidt, S., Ciccarelli, F. D., Doerks, T., Schultz, J., Ponting, C. P. and Bork, P. (2004). SMART 4.0: towards genomic data integration. Nucleic Acids Res 32, D142-4. Lo Conte, L., Ailey, B., Hubbard, T. J., Brenner, S. E., Murzin, A. G. and Chothia, C. (2000). SCOP: a structural classification of proteins database. Nucleic Acids Res 28, 257-9. Lo Conte, L., Brenner, S. E., Hubbard, T. J., Chothia, C. and Murzin, A. G. (2002). SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res 30, 264-7. Madan Babu, M. and Teichmann, S. A. (2003). Evolution of transcription factors and the gene regulatory network in Escherichia coli. Nucleic Acids Res 31, 1234-44. Madan Babu, M., Luscombe, N. M., Aravind, L., Gerstein, M. and Teichmann, S. A. (2004). Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol 14, 283-91. Madera, M. and Gough, J. (2002). A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res 30, 4321-8. Madera, M., Vogel, C., Kummerfeld, S. K., Chothia, C. and Gough, J. (2004). The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res 32, D235-9. 1-23 1.6 References Mangan, S. and Alon, U. (2003). Structure and function of the feed-forward loop network motif. Proc Natl Acad Sci U S A 100, 11980-5. Epub 2003 Oct 6. Mangan, S., Zaslaver, A. and Alon, U. (2003). The coherent feedforward loop serves as a sign-sensitive delay element in transcription networks. J Mol Biol 334, 197-204. Marchler-Bauer, A., Anderson, J. B., DeWeese-Scott, C., Fedorova, N. D., Geer, L. Y., He, S., Hurwitz, D. I., Jackson, J. D., Jacobs, A. R., Lanczycki, C. J. et al. (2003). CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res 31, 383-7. Martinez-Antonio, A. and Collado-Vides, J. (2003). Identifying global regulators in transcriptional regulatory networks in bacteria. Curr Opin Microbiol 6, 482-9. Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A. E., Kel-Margoulis, O. V. et al. (2003). TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res 31, 374-8. Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002). Network motifs: simple building blocks of complex networks. Science 298, 824-7. Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P. et al. (2003). The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res 31, 315-8. Murzin, A. G., Brenner, S. E., Hubbard, T. and Chothia, C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 247, 536-40. Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48, 443-53. Nordhoff, E., Krogsdam, A. M., Jorgensen, H. F., Kallipolitis, B. H., Clark, B. F., Roepstorff, P. and Kristiansen, K. (1999). Rapid identification of DNA-binding proteins by mass spectrometry. Nat Biotechnol 17, 884-8. Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. and Thornton, J. M. (1997). CATH--a hierarchic classification of protein domain structures. Structure 5, 1093-108. Orengo, C. A., Pearl, F. M., Bray, J. E., Todd, A. E., Martin, A. C., Lo Conte, L. and Thornton, J. M. (1999). The CATH Database provides insights into protein structure/function relationships. Nucleic Acids Res 27, 275-9. Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T. and Chothia, C. (1998). Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 284, 1201-10. Pearson, W. R. and Lipman, D. J. (1988). Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 85, 2444-8. Perez-Rueda, E. and Collado-Vides, J. (2000). The repertoire of DNA-binding transcriptional regulators in Escherichia coli K-12. Nucleic Acids Res 28, 1838-47. Peterson, J. D., Umayam, L. A., Dickinson, T., Hickey, E. K. and White, O. (2001). The Comprehensive Microbial Resource. Nucleic Acids Res 29, 123-5. Riechmann, J. L., Heard, J., Martin, G., Reuber, L., Jiang, C., Keddie, J., Adam, L., Pineda, O., Ratcliffe, O. J., Samaha, R. R. et al. (2000). Arabidopsis transcription factors: genomewide comparative analysis among eukaryotes. Science 290, 2105-10. 1-24 1.6 References Salgado, H., Gama-Castro, S., Martinez-Antonio, A., Diaz-Peredo, E., Sanchez-Solano, F., Peralta-Gil, M., Garcia-Alonso, D., Jimenez-Jacinto, V., Santos-Zavaleta, A., BonavidesMartinez, C. et al. (2004). RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res 32, D303-6. Schaffer, A. A., Aravind, L., Madden, T. L., Shavirin, S., Spouge, J. L., Wolf, Y. I., Koonin, E. V. and Altschul, S. F. (2001). Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 29, 29943005. Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D. and Friedman, N. (2003). Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet 34, 166-76. Servant, F., Bru, C., Carrere, S., Courcelle, E., Gouzy, J., Peyruc, D. and Kahn, D. (2002). ProDom: automated clustering of homologous domains. Brief Bioinform 3, 246-51. Shen-Orr, S. S., Milo, R., Mangan, S. and Alon, U. (2002). Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31, 64-8. Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J Mol Biol 147, 195-7. Stebbings, L. A. and Mizuguchi, K. (2004). HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database. Nucleic Acids Res 32, D203-7. Storek, M. J., Ernst, A. and Verdine, G. L. (2002). High-resolution footprinting of sequencespecific protein-DNA contacts. Nat Biotechnol 20, 183-6. Stuart, J. M., Segal, E., Koller, D. and Kim, S. K. (2003). A gene-coexpression network for global discovery of conserved genetic modules. Science 302, 249-55. Epub 2003 Aug 21. Sun, L. V., Chen, L., Greil, F., Negre, N., Li, T. R., Cavalli, G., Zhao, H., Van Steensel, B. and White, K. P. (2003). Protein-DNA interaction mapping using genomic tiling path microarrays in Drosophila. Proc Natl Acad Sci U S A 100, 9428-33. Epub 2003 Jul 22. Tatusov, R. L., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Kiryutin, B., Koonin, E. V., Krylov, D. M., Mazumder, R., Mekhedov, S. L., Nikolskaya, A. N. et al. (2003). The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41. Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J. and Church, G. M. (1999). Systematic determination of genetic network architecture. Nat Genet 22, 281-5. Teichmann, S. A. and Madan Babu, M. (2002). Conservation of gene co-regulation in prokaryotes and eukaryotes. Trends Biotechnol 20, 407-10; discussion 410. Teichmann, S. A., Chothia, C. and Gerstein, M. (1999). Advances in structural genomics. Curr Opin Struct Biol 9, 390-9. Teichmann, S. A., Park, J. and Chothia, C. (1998). Structural assignments to the Mycoplasma genitalium proteins show extensive gene duplications and domain rearrangements. Proc Natl Acad Sci U S A 95, 14658-63. van Steensel, B. and Henikoff, S. (2000). Identification of in vivo DNA targets of chromatin proteins using tethered dam methyltransferase. Nat Biotechnol 18, 424-8. Vogel, C., Bashton, M., Kerrison, N. D., Chothia, C. and Teichmann, S. A. (2004). Structure, function and evolution of multidomain proteins. Curr Opin Struct Biol 14, 208-16. 1-25 1.6 References Wei, G. H., Liu, D. P. and Liang, C. C. (2004). Charting gene regulatory networks: strategies, challenges and perspectives. Biochem J 381, 1-12. Xia, Y., Yu, H., Jansen, R., Seringhaus, M., Baxter, S., Greenbaum, D., Zhao, H. and Gerstein, M. (2004). Analyzing Cellular Biochemistry in Terms of Molecular Networks. Annu Rev Biochem 73, 1051-1087. Yu, H., Luscombe, N. M., Qian, J. and Gerstein, M. (2003). Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet 19, 422-7. 1-26