1.3 Approaches to study evolution of transcription factors

advertisement
Chapter
THEORETICAL AND EXPERIMENTAL
METHODS TO STUDY TRANSCRIPTION
FACTORS AND THEIR INTERACTIONS
1.1 INTRODUCTION ................................................................................................. 1-1
1.2 MOLECULAR MECHANISM FOR THE CONTROL OF GENE EXPRESSION 1-2
1.3 APPROACHES TO STUDY EVOLUTION OF TRANSCRIPTION FACTORS .. 1-3
1.3.1 COMPARISON OF PROTEIN SEQUENCES TO INFER HOMOLOGY ............................ 1-3
1.3.2 DOMAINS OF A PROTEINS CAN BE USED TO INFER EVOLUTIONARY HISTORY ......... 1-4
1.3.3 PROCEDURES TO ASSIGN DOMAINS TO PROTEIN SEQUENCES ............................. 1-7
1.3.4 PROTEIN EVOLUTION ....................................................................................... 1-8
1.4 APPROACHES TO STUDY GENE EXPRESSION PROGRAMS ................... 1-10
1.4.1 METHODS TO STUDY PROTEIN-DNA INTERACTIONS ........................................ 1-11
1.4.2 GENE EXPRESSION ANALYSIS ......................................................................... 1-12
1.5 STRUCTURE OF TRANSCRIPTIONAL REGULATORY NETWORKS .......... 1-17
1.5.1 MOTIFS......................................................................................................... 1-19
1.5.2 MODULES ..................................................................................................... 1-19
1.5.3 GLOBAL NETWORK ORGANISATION ................................................................. 1-20
1.6 REFERENCES .................................................................................................. 1-21
1
1.1 Introduction
THEORETICAL AND EXPERIMENTAL
METHODS TO STUDY TRANSCRIPTION
FACTORS AND THEIR INTERACTIONS
1
Parts of this chapter appeared in the following research review articles:
1. Madan Babu, M.1*, Luscombe, N.1, Aravind, L., Gerstein, M., Teichmann, S.A.* (2004).
Structure and evolution of gene regulatory networks. Current Opinion in Structural Biology,
14(3):283-91.
2. Madan Babu, M.* (2004) Introduction to Microarray Data Analysis, Book chapter in
Computational Genomics, Horizon bioscience press (Grant, R. Editor)
3. Luscombe, N.* and Madan Babu, M.* (2004) GenCompass: a universal system to analyse
gene expression for any genome, Trends in Biotechnology, in press.
1.1 Introduction
This chapter provides an overview of experimental and computational methods that we
employed to study the evolution of genes, with particular emphasis on transcription factors, their
target genes, and the resulting regulatory network.
We first present sequence and structure-based methods commonly used to study the evolution
of genes and their products. This includes methods to determine protein domains, infer protein
sequence homology, and classify sequences and structures according to evolutionary
principles.
We then focus on experimental and computational methods to characterize sequences of
transcriptional events. More specifically, we illustrate principles behind the various strategies for
the identification of DNA binding proteins, DNA binding sites, and target genes for a given
transcription factor. We include a discussion on microarrays experiments and analysis, a
technique routinely used to monitor mRNA expression levels of genes at the genomic scale. We
1-1
1.2 Molecular mechanism for the control of gene expression
also illustrates principles behind the different clustering procedures used to group genes
according to properties of interest.
Finally, we look at the organization of the sum total of individual transcription interactions into a
comprehensive network at the genome level, and define properties of this network, including
network motifs, modules, and the scale-free structure and topological properties it shares with
other non-biological networks like the world wide web.
1.2 Molecular mechanism for the control of gene expression
The nature of any given organism is not only determined by the repertoire of the genes that it
encodes for but also by the way in which these genes are precisely regulated at any given time
point. Transcription factors are an important class of protein molecules that can control the
mRNA expression of other genes by binding to DNA near the gene thereby affecting
transcription of the nearby genes. Most transcription factors achieve this by responding to
changes in external signals. In most prokaryotes, sensing changes in the levels of a small
molecule can bring about active expression of genes that catabolize the small molecule (e.g.
lactose) or repress expression of genes that synthesise them (e.g. tryptophan). Most
transcription factors consist of two regions, one which recognises the signal and the other which
binds DNA (Jacob and Monod, 1961). An example of a transcription factor, Crp is shown in
Figure 1.1.
Figure 1.1: 3D protein structure of a transcription factor Crp bound to cAMP and DNA
Figure 1.1: A part of the protein that senses the external signal and the part that binds DNA are
coloured blue and red respectively. In this case, the signal is a small-molecule cyclic AMP,
coloured green. The DNA sequence to which Crp binds is coloured yellow. The coordinates for
the structure were downloaded from the protein data bank (PDB: 1J59).
1-2
1.3 Approaches to study evolution of transcription factors
In the case of the higher eukaryotes, transcription factors have been shown to be crucial for the
developmental process. A class of Hox transcription factors has been conserved throughout
evolution and is involved in deciding the body plan in eukaryotes. Thus, these DNA binding
transcription factors are the key proteins which bring about the execution of specific
developmental programs by regulating expression of the required proteins in the right place at
the right time. Figure 1.2 below shows the arrangement of the Hox genes in the genomes of
mammals and flies suggesting a common program for the development of body plan.
Figure 1.2: Developmental program and the Hox transcription factors
Figure 1.2: A set of Hox transcription factors in the genomes of mammals and flies. These
genes direct the development of different segments in the body plans of many animals. The
position of the gene also marks the position of its expression in the embryo. During embryonic
development, the first genes (red) in the genome are expressed at the anterior of the embryo
while other genes (orange, yellow) are expressed at more distal parts. This pattern of
expression in flies gives rise to mouth parts, thorax, wing segments, and the tail (Molecular Cell
Biology; 6th Edition).
Thus we see that transcription factors are proteins that can sense changes in the environment
and depending upon the change can either activate or repress expression of certain genes to
bring about the required effect.
1.3 Approaches to study evolution of transcription factors
The following section will explain concepts and methods that can be applied to understanding
evolution of proteins, and which were applied in this dissertation to the study of the evolution of
transcription factors and their regulated target genes.
1.3.1 Comparison of protein sequences to infer homology
One can infer homology by comparing protein sequences. Some of the commonly used
pairwise sequence comparison programs include Blast (Altschul et al., 1990), Fasta (Pearson
and Lipman, 1988) and programs that implement the Smith-Waterman algorithm (Ssearch and
MPsearch) (Needleman and Wunsch, 1970; Smith and Waterman, 1981). These can be broadly
grouped in to two classes: (i) Programs that perform an ‘exhaustive search’, where all possible
1-3
1.3 Approaches to study evolution of transcription factors
combinations of amino acid positions are compared to get the best alignment. Such methods
are computationally intensive when it comes to comparing a given sequence against a huge
database. (ii) Programs that employ heuristics, which search words instead of individual amino
acids. Such methods compromise on sensitivity, but are very fast when comparing a protein
against a huge database, e.g. Uni-prot sequence database or the Swiss Prot database
(Leinonen et al., 2004). Such methods are generally reliable in detecting orthologs and closely
related paralogs. For example, a bi-directional best-hit procedure using standard sequence
comparison methods is a routinely used method to detect orthologous proteins in large-scale
comparative genomics studies. It had been used for this thesis, see chapter 4 for details.
However the problem arises when the compared sequences have diverged too far, as is true for
very distant paralogs within the same genome. In such cases, regular sequence comparison
methods fail to detect homology. This is where analysing protein sequences in the context of
their domain content can help to reliably detect evolutionary relationships.
1.3.2 Domains of a proteins can be used to infer evolutionary history
Proteins are made up of domains. Domains are defined based on either sequences, or
structures, or both. Two different ways of defining a domain are shown in Figure 1.3. A domain
is a defined region in a protein, which can either be the complete protein or be a part of a
protein, which can occur independently or in combination with other domains in a different
protein. A list of domain databases is provided in Table 1.1. Throughout this thesis, the
Structural Classification of Proteins (SCOP) domain (Andreeva et al., 2004; Lo Conte et al.,
2000; Murzin et al., 1995) definition will be used unless otherwise stated. In the Structural
Classification of Proteins (SCOP) database, domains are defined based on protein structure. In
SCOP, a domain is defined as an independent evolutionary and structural unit that can undergo
duplication and recombination. Hence a domain can occur on its own or in combination with
other domains in a different protein.
Figure 1.3: Sequence based and structure based domain definition
Structure based domain definition
(Homeodomain superfamily in SCOP)
Sequence based domain definition
(Homeobox family in Pfam)
Figure 1.3: SCOP (PDB: 9ANT) and Pfam (PF00046) definition for the homeodomain family of
proteins.
1-4
1.3 Approaches to study evolution of transcription factors
Small proteins consist of a single domain, for example the hen-egg lysozyme (Figure 1.4a),
whereas large proteins consist of one or more domains, referred to as multi-domain proteins. An
example of a large multidomain protein is shown in Figure 1.4b, it is a tyrosine kinase, which
consists of SH3 domain, followed by SH2 domain and a catalytic domain. Thus, domains in
proteins can either function independently or may contribute to the function of the multidomain
protein in cooperation with other domains (Teichmann et al., 1999; Vogel et al., 2004).
Figure 1.4: A single-domain protein and a multi-domain protein
a
b
Hen egg white lysozyme
Domain architecture
Lysozyme-like
Hck kinase
Domain architecture
SH3 domain : SH2 domain : Protein Kinase-like
Figure 1.4: (a) The 3D structure of T4 hen-egg lysozyme protein (PDB: 193l) made of a single
domain. (b) The structure of Hck tyrosine kinase (PDB: 1AD5) made of SH3 domain, SH2
domain and a catalytic domain. The linear organisation of domains from the N-terminus to the
C-terminus is referred to as the protein’s domain architecture and is shown below each protein.
Since a protein domain is an evolutionary unit (Murzin et al., 1995), by studying the domain
composition of proteins one can understand their evolutionary history. Given that a domain can
have a defined function, one can also predict possible functions of a protein with known domain
composition. Some of the widely used domain databases are shown in Table 1.1.
1-5
1.3 Approaches to study evolution of transcription factors
Table 1.1a: A list of sequence-based domain databases
Database
Type
Pfam
Sequence
SMART
Sequence
TIGRFAM
Sequence
ProDom
Sequence
COGs
Sequence
CDD
Sequence
Description
Protein families database (Bateman et al., 2004) is a carefully
curated database of protein sequence families which provides
multiple sequence alignments and the hidden Markov models
built using the HMMER (Eddy, 1998) package.
Simple Modular Architecture Research Tool (Letunic et al.,
2004) is similar to Pfam, however it consists only of domains
that are seen in signalling, chromatin associated proteins and
extra-cellular proteins. Hidden Markov models are built using
the HMMER package.
TIGRFAM (The Institute for Genomic Research Families)
(Peterson et al., 2001) is another curated sequence based
hidden Markov model library which is built by clustering protein
sequences obtained primarily from the microbial genome
sequencing projects.
ProDom (Servant et al., 2002) is a comprehensive collection of
domain families which is obtained by automatically clustering all
protein sequences available in the Swiss-Prot database.
Clusters of Orthologous Groups (Tatusov et al., 2003) consist of
groups of homologous proteins (i.e. proteins that have evolved
from a common ancestor). Each group consists of paralogous
proteins (duplicated proteins within one genome) and
orthologous proteins (the same protein in different genomes).
COGs are generated automatically by using BLAST sequence
analysis tool.
Conserved domain database (Marchler-Bauer et al., 2003) uses
Pfam, SMART and their own domain definition to build a
position specific scoring matrix (PSSM). Reverse PSI–BLAST
(Altschul et al., 1997) is then used to assign domains to
proteins.
Table 1.1b: A list of structure-based domain databases
Database
Type
MMDB
Structure
FSSP
Structure
CATH
Structure
Description
MMDB (Chen et al., 2003) groups sequences into clusters
based on pairwise comparison of structures available in the
Protein Data Bank (Bourne et al., 2004)
FSSP (Holm and Sander, 1994) is a fully automatic
classification scheme that clusters proteins into groups by
carrying out structure-structure alignments.
CATH (Orengo et al., 1997; Orengo et al., 1999) is a
hierarchical, semi-automatic method, which classifies protein
domain structures into four major levels: Class, Architecture,
Topology and Homologous superfamily.
1-6
1.3 Approaches to study evolution of transcription factors
Database
Type
SCOP
Structure
Gene3D
Structure
Superfamily
Structure
Description
SCOP (Andreeva et al., 2004; Lo Conte et al., 2002; Murzin et
al., 1995) is a manually curated hierarchical classification of all
proteins of known structure. Proteins are broken into domains,
which are then classified into the following levels: family,
superfamily, fold and class.
Gene3D (Buchan et al., 2003) is a library of HMMs which are
built using the SAM-T99 package using the CATH domain
definition.
Superfamily (Madera et al., 2004) consists of a library of hidden
Markov models which are built using the SAM-T99 package at
the SCOP superfamily level using the SCOP domain definitions.
Table 1.1c: A list of sequence and structure based domain databases
Database
Type
Sequence
InterPro
and
structure
Sequence
HOMSTRAD
and
structure
Description
InterPro (Mulder et al., 2003) is an integrated resource, which
uses many different domain databases, including Pfam,
SMART, and Superfamily.
Homologous structure alignment database (Stebbings and
Mizuguchi, 2004) consists of annotated structural alignments for
homologous families and is based on a variety of other domain
definitions.
.
Table 1.1: A list of databases that use protein sequences and structures to define domains. For
a comprehensive list of known domain databases, the reader is referred to the annual January
database issue of the Nucleic Acids Research journal.
1.3.3 Procedures to assign domains to protein sequences
As mentioned above, one can infer function and the evolutionary history of an uncharacterised
protein by identifying the domains of which it is made. To do this, one has to be able to reliably
detect regions in the protein sequence that belong to a particular domain family. There exist a
variety of computational techniques that can detect domains, some of which are discussed
below.
A. Profile based methods
Profile based methods like PSI-BLAST first pick up close homologs using standard pairwise
sequence search methods. A multiple sequence alignment is built from the close homologs. For
each position in the multiple sequence alignment, a position specific scoring matrix (PSSM) is
generated that gives a measure for the variability in the amino acid composition for the position.
Thus if a position has a high conservation of a particular amino acid, then the matrix for that
position will have a high negative value for a substitution. In this way, functionally important
regions on the protein sequence will be treated differently from any other region that can
1-7
1.3 Approaches to study evolution of transcription factors
accumulate mutations without strict constraints. Using this position specific scoring matrix and
the multiple sequence alignment, the database is searched again to find distant homologs in an
iterative manner. In programs implementing iterative PSI-BLAST (Schaffer et al., 2001), the
user can specify the number of iterations or can ask the program to iterate this procedure until
convergence, i.e. until no more new sequences are picked up.
B. Hidden Markov model (HMM) based methods
Another advancement in the area of protein homology detection was made with the use of
hidden Markov models to search for specific protein domains. A hidden Markov model is a
different way to represent a profile (Eddy, 1996). HMM treats amino acid insertions and
deletions more efficiently than profile based methods. HMM can be though of as a finite model
that describes a probability distribution over a large number of possible sequences. Two
commonly used programs are HMMER (Eddy, 1998) and SAM (Karplus et al., 1998). HMMER
package requires a multiple sequence alignment, which is then used to create HMMs, and
calibrate and score them. In the case of SAM, a single sequence can be used as an input,
which is then automatically searched in an iterative manner against a database to create a
multiple sequence alignment and to generate an HMM (Madera and Gough, 2002). Target 99
module carries out the multiple sequence alignment, hence the other name of the package
SAM-T99. Pfam database uses HMMER for building its library of HMMs, whereas Superfamily
database uses SAM-T99. Initial studies by Park et al (Park et al., 1998) and Brenner et al
(Brenner et al., 1998) have shown that profile based methods perform better in detecting remote
homologs than pairwise sequence comparison methods. More recently, a study by Madera and
Gough showed that HMM based methods give better results than profile based methods. A
thorough and systematic comparison of the performance of different methods is given in
Madera and Gough (Madera and Gough, 2002).
1.3.4 Protein evolution by duplication, divergence and recombination of domains
It is now well established that most proteins in genomes have evolved by duplication,
divergence and recombination of existing domains (Chothia et al., 2003; Vogel et al., 2004).
When the first genome sequences of H. influenzae and M. genitalium were published, Brenner
et al (Brenner et al., 1995) and Teichmann et al (Teichmann et al., 1998) showed that there has
been extensive duplication of a limited number of distinct domain families. Studies which probed
into the conservation of domain order by assigning domains to protein sequences from
completely sequenced genomes showed that domain order is highly conserved in many
proteins (Apic et al., 2001). A study on the Rossmann domain family of proteins by Bashton and
Chothia (Bashton and Chothia, 2002) revealed that most proteins have a conserved domain
order even under little functional constraint. Considered together, these results suggest that
during evolution, proteins conserved their domain architecture following duplication and
divergence. Thus proteins with identical domain architecture can be considered to have evolved
by duplication of a common ancestor.
1-8
1.3 Approaches to study evolution of transcription factors
The original work reported in this thesis is a procedure to identify DNA binding transcription
factors in E. coli based on domain assignments to the known DNA binding domain families
(chapter 2). This procedure has been applied to predict transcription factors for many
completely sequenced genomes, including the mouse genome and the Dictyostelium genome.
Number of the predicted transcription factors for the five genomes is shown in Table 1.2,
ranging from nearly 300 in E. coli to over 3000 in humans. These proteins constitute between
6% (in E. coli and yeast) and 8% (in human) of all proteins encoded in these organisms.
Table 1.2: Predicted number of transcription factors in five model organisms
Organism
Number of
transcripts
E. coli
S. cerevisiae
C. elegans
H. sapiens
A. thaliana
4,280
6,357
31,677
32,036
28,787
Number of
transcripts with
DNA-binding
domainsa
267
245
1463
2604
1667
Percentage of transcripts
containing DNA-binding
domains
6.2%
3.9%
4.6%
8.1%
5.7%
Table 1.2: aDNA-binding domain assignments from Pfam and SUPERFAMILY were used to
establish the repertoire of DNA-binding transcription factors in five model organisms. An
expectation value threshold of 0.002 was used in making the assignments. Co-regulators that
do not bind DNA directly were excluded.
Our procedure allows us to assess the evolutionary relationships among transcription factors.
Our studies and those of the others using E. coli (Madan Babu and Teichmann, 2003; PerezRueda and Collado-Vides, 2000), archaea (Aravind and Koonin, 1999), plants and animals
(Ledent and Vervoort, 2001; Riechmann et al., 2000) have consistently demonstrated that
transcription factors draw their DNA-binding domains from a relatively small, ancient conserved
repertoire. Different organisms have various parts of the ancient repertoire expanded in their
genomes. For example, the winged-helix domain and the zinc ribbon domain are encountered in
all three principal kingdoms of life. The ribbon-helix-helix (MetJ/Arc) domain is found only in
prokaryotes (Aravind and Koonin, 1999), whereas the crown group eukaryotes display a
proliferation of several novel DNA-binding domains, such as the C2H2 zinc finger domains, the
AT hook domains, the HMG1 domains and the MADS box domains (Lander et al., 2001).
Shown in Figure 1.5 are the examples of some of the DNA binding domains in the five
genomes listed in Table 1.2. The DNA-binding domain families have been chosen to emphasize
that many families are specific to individual phylogenetic groups and can be greatly expanded in
some genomes. For example, the nuclear hormone receptor family transcription factors are very
1-9
1.4 Approaches to study gene expression programs
abundant in Caenorhabditis elegans compared with other organisms, whereas the Zn2/Cys6
fungal-type zinc finger is expanded in the fungi, but absent elsewhere. In contrast to the high
level of conservation of other regulatory and signalling systems across the crown group
eukaryotes, some of the transcription factor families are dramatically different in the various
lineages. This suggests a major role for recurrent, massive and lineage-specific expansions in
the evolution of transcription factors in the crown group eukaryotes (Coulson and Ouzounis,
2003; Lespinet et al., 2002). In prokaryotes, several orthologous groups of transcription factors
show a much wider spread across phylogenetically diverse organisms, suggesting a role for
horizontal transfers, in addition to diversification through a lower level of lineage-specific
duplications.
Figure 1.5: Lineage-specific expansion of DNA-binding domain families
E. coli
C-terminal effector domain of the
bipartite response regulators
S. cerevisiae C. elegans H. sapiens A. thaliana
17
0
0
0
0
Zn2/Cys6 DNA-binding domain
0
53
0
0
0
Glucocorticoid receptor-like
(DNA-binding domain)
0
10
361
19
48
C2H2 and C2HC zinc fingers
0
30
125 1039
SRF-like
0
4
3
7
59
113
C-terminal effector domain
of the bipartite response regulators
C2H2 and C2HC zinc fingers
CheY-like
Zn2/Cys6 DNA-binding domain
SRF-like
Nuclear receptor ligand-binding
domain
Glucocorticoid receptor-like
(DNA-binding domain)
Figure 1.5: Examples of DNA-binding domain families of transcription factors that are prevalent
in one of the five genomes, but are rare in the others. The genomic occurrence of each family is
provided in the table and we depict their most common domain architectures alongside. SRF,
serum response factor.
1.4 Approaches to study gene expression programs
So far we have seen how one could use domains to characterise proteins and how domain
definition of proteins can help to identify possible DNA binding transcription factors. In the
following section, small and large-scale strategies to study gene expression programs that
reveal the regulated target genes and their transcription factors will be discussed.
1-10
1.4 Approaches to study gene expression programs
1.4.1 Methods to study Protein-DNA interactions
Interaction of a protein with DNA on a chromosome can affect transcription of nearby genes.
Hence studying protein-DNA interaction will tell us which genes are controlled by a transcription
factor and will allow us to identify targets of transcription factors. There are conventional smallscale experiments which focus on studying individual transcription factors in great detail.
Alternatively, with the recent advancement that have been made, we are now in a position to
carry out large-scale experiments that allow simultaneous monitoring of protein-DNA interaction
and provides us with a comprehensive set of interactions in a genome-wide level. The principle
behind some of the commonly used strategies is described in Table 1.3 below.
Table 1.3a: Small-scale experimental methods to probe protein-DNA interactions
Methods
Band-shift
DNA
footprinting
Binding site
detection using
FRET
Binding site
detection using
unnatural base
analog
In vivo cross
linking and
immunoprecipit
ation
Description
Since DNA molecules are more flexible than proteins, they have much
higher mobility in a polyacrylamide gel. Thus, under favourable conditions,
free DNA can be distinguished from DNA bound to proteins due to the
difference in molecular weight (Garner and Revzin, 1981).
In DNA foot printing, a 5’ end labelled double stranded DNA is partially
degraded by DNAase both in the presence and absence of the binding
protein. Degraded fragments are then loaded on to a polyacrylamide gel
and are visualised by autoradiography. Since the region where the protein
has bound the DNA will be protected from DNAase, no fragments are
seen in those regions. Thus by comparing lanes, one can identify the
binding site (Galas and Schmitz, 1978).
A method by Heyduk and Heyduk (Heyduk and Heyduk, 2002) uses a
library of double stranded DNA with one of the two fluorophores attached
to its end. Protein binding to two pieces of DNA (one from each library),
where each comprises one-half of the binding site, induces FRET
(Fluorescence Resonance Energy Transfer), which can be used to find
protein bound to DNA.
A method by Storek et al (Storek et al., 2002) uses a library of DNA
sequences which have an unnatural base analog (one for each base).
Following selection for protein bound DNA molecules, the DNA is cleaved
specifically at the modified base. The site of incorporation can be
identified by gel electrophoresis by running fragments generated from
unbound sample next to the fragments generated from the bound sample.
Since the presence of an analog in the binding site impedes protein
binding, this results in a depletion of the protein-bound pool.
Over expression of a DNA binding protein in a cell ensures that the
protein exists in its DNA-bound form. The bound protein is then covalently
linked to DNA by using a cross-linking agent such as formaldehyde. After
cross-linking, DNA is sheared and the protein bound DNA is precipitated
using specific antibodies to the protein. Reversal of the cross-linking
releases the bound DNA allowing the sequence of the fragments to be
determined by regular sequencing methods. This method is called
Chromatin Immunoprecipitation or ChIP in short (Kuo and Allis, 1999).
1-11
1.4 Approaches to study gene expression programs
Table 1.3b: Large-scale experimental strategies to probe protein-DNA interactions
Method
Identifying DNA
binding
proteins by
mass
spectrometry
ChromatinImmunoprecipit
ation-Chip
experiments
(ChIP-chip)
DNA adenine
methyl
transferase
Identification
Description
Immobilised short DNA probes are incubated with cell or nuclear extract,
allowing the protein to bind DNA on the surface. Proteins are analysed
directly off the solid support by MALDI/TOF mass spectrometry. If the
determined molecular mass does not uniquely identify a protein, proteins
are subjected to mass spectrometric peptide mapping. This provides a
way of detecting post-translational modification of specific residues on the
protein (Nordhoff et al., 1999).
In ChIP-chip experiments, intergenic regions are spotted on to a
microarray chip. Following a chromatin immunoprecipitation step,
discussed previously, the bound fragments are reverse cross-linked and
hybridized onto the chip. Complementary sequences will bind to specific
spots on the chip thus providing the exact intergenic region to which the
protein was bound (Horak and Snyder, 2002). This method was used by
Lee et al (Lee et al., 2002) to study binding sites of 106 yeast transcription
factors chip and hence reconstruct the transcriptional regulatory network
for these proteins.
One of the problems in ChIP experiments is that artefacts can be
produced due to non-specific cross-linking of protein and DNA. To
overcome this problem, van Steensel and Henikoff (van Steensel and
Henikoff, 2000) introduced the DamID technique. First the protein of
interest is fused to an E. coli protein, DNA adenine methyl transferase
(Dam). Dam methylates the N6 position of the adenine in the sequence
GATC, which occurs at reasonably high frequency in any genome (~1 site
in 256 bases). Upon binding DNA, the Dam protein preferentially
methylates adenine in the vicinity of binding. Subsequently, the genomic
DNA is digested by the DpnI and DpnII restriction enzymes that cleave
within the non-methylated GATC sequence, and remove fragments that
are not methylated. The remaining methylated fragments are amplified by
selective PCR and quantified by microarray analysis. Sun et al (Sun et al.,
2003) successfully applied this technique to map protein DNA interactions
at high resolution along segments of genomic DNA from Drosophila using
a combination of the DamID technique and genomic DNA tiling path
microarrays.
Table 1.3: small-scale and large-scale experimental strategies to study protein-DNA
interactions and identify targets for transcription factors.
1.4.2 Gene expression analysis
In the previous section, different experiments that can be done to identify targets for a given
transcription factor was explained. However such methods fail to provide information about the
cellular conditions in which the gene will be regulated, i.e. when the gene is turned on and when
it is turned off. Hence such methods only provide us with static information about transcriptional
1-12
1.4 Approaches to study gene expression programs
interactions and fail to provide us with a dynamic picture about when the interaction would be
active. Thus, what would be useful to know is when and under what conditions a gene is
differentially expressed and when it is not. This provides us with direct information about genes
that need to be present in a given cellular condition, such as under stress or sporulation, etc.
One of the ways to monitor gene expression of many genes simultaneously is through
monitoring mRNA levels on a microarray chip. The following section will explain the basic
principle behind a microarray experiment. Chapter 5 will discuss about how integrating gene
expression data with transcriptional interactions can help us define transcriptional interactions
that will active in a particular cellular condition.
A. Microarray
One of the ways to study gene expression programs is to monitor the expression levels of
thousands of genes simultaneously under a particular condition. Microarray technology makes
this possible and the quantity of data generated from each experiment is enormous, which
dwarfs the amount of data generated by the genome sequencing projects. A microarray is
typically a glass slide on to which DNA molecules are fixed in an orderly manner at specific
locations called spots (or features). A microarray may contain thousands of spots and each spot
may contain a few million copies of identical DNA molecules that correspond uniquely to a gene
(Figure 1.6a). The DNA in a spot may either be genomic DNA or a short stretch of oligonucleotide strand that correspond to a gene. The spots are printed on to the glass slide by a
robot or are synthesised by the process of photolithography. The principle of differential
hybridization is explained in Figure 1.6b below.
More recently, to obviate the need to develop sequence specific chips, Lizardi and co workers
have developed a procedure called GenCompass, which combines a traditional enzymatic
manipulation step, a universal microarray representing all possible DNA hexamers, and
advanced bioinformatics techniques to analyse data obtained. For a review about possible
future applications using the universal microarray system, see Luscombe and Madan Babu
(Luscombe and Madan Babu, in press).
1-13
1.4 Approaches to study gene expression programs
Figure 1.6: Microarray experiment
b
a
Condition B
Condition A
mRNA
extraction
cDNA
labelling with dyes
Each spot contains
Oligonucleotide sequence
or genomic DNA that “uniquely”
represents a gene
Hybridisation
Spot
(feature)
Excitation with laser
sub-array
Final image stored as a file
Figure 1.6: (a) A microarray may contain thousands of ‘spots’. Each spot contains many copies
of the same DNA sequence that represents a single gene from an organism. (b) Schematic of
the experimental protocol to study differential expression of genes. The organism is grown in
two different conditions (a reference condition and a test condition). RNA is extracted from the
two cells, and is labelled with different dyes (red and green) during the synthesis of cDNA by
reverse transcriptase. Following this step, cDNA is hybridized onto the microarray slide, where
each cDNA molecule representing a gene will bind to the spot containing its complementary
DNA sequence. The microarray slide is then excited with a laser at suitable wavelengths to
detect the intensities of red and green fluorescence value which represents the relative
expression level of a gene in the two conditions considered. The final image is stored as a file
for further analysis.
B. Clustering methods
One of the goals of carrying out a microarray experiment is to identify genes or samples with
similar expression profiles, to make meaningful biological inference about the set of genes or
samples. Clustering is one of the unsupervised approaches to classify data into groups of genes
or samples with similar patterns that are characteristic to the group.
One should note that clustering analysis can be applied to any kind of information as long as it
is represented in a way amenable for clustering (i.e. a profile, a vector or a matrix). Chapter 4 in
this thesis exploits the power of clustering to study transcription factor and transcriptional
interaction conservation in the different genomes by creating ‘conservation’ profiles across the
different genomes for the genes and their interactions.
Clustering methods can be hierarchical (grouping objects into clusters, specifying relationships
among objects in the cluster, resembling a phylogenetic tree) or non-hierarchical (grouping into
clusters without specifying relationships between objects in the cluster) as shown in Figure 1.7.
1-14
1.4 Approaches to study gene expression programs
It should be noted that an object may refer to a gene or a sample and a cluster refers to a set of
objects that behave in a similar manner. In the following section, hierarchical agglomerative
clustering and K-means clustering will be described. For a detailed introduction to microarray
analysis, please refer to Causton et al (Causton et al, 2003), Madan Babu, M (Madan Babu,
2004) or appendix B.
Figure 1.7: Clustering methods
Clustering methods
Hierarchical
Agglomerative
Non-hierarchical
e.g: K-means, SOMs
Divisive
Single linkage
Complete linkage
Average linkage
Centroid linkage
Figure 1.6: An overview of the different clustering methods.
Hierarchical clustering
Hierarchical clustering may be agglomerative (starting with the assumption that each object is a
cluster and grouping similar objects into bigger clusters) or divisive (starting from grouping all
objects into one cluster and subsequently breaking the big cluster into smaller clusters with
similar properties).
Hierarchical agglomerative clustering
In the case of a hierarchical agglomerative clustering, the objects are successively fused until all
the objects are included. For a hierarchical agglomerative clustering procedure, each object is
considered as a cluster and the pair-wise distance measures for the objects to be clustered are
first calculated. Based on the pair-wise distances between them, objects that are similar to each
other are grouped into clusters. After this is done, pair-wise distances between the clusters are
re-calculated, and clusters that are similar are grouped together in an iterative manner until all
the objects are included in a single cluster. This information can be represented as a
dendrogram, where the distance from the branch point is indicative of the distance between the
two clusters or objects.
The comparison of a cluster with another cluster or an object can be done using four
approaches (Figure 1.8).
1-15
1.4 Approaches to study gene expression programs
Figure 1.8: Algorithms to group objects into clusters
Single linkage clustering
Complete linkage clustering
Average linkage clustering
Centroid clustering
Cluster-1
Cluster-1
Cluster-1
Cluster-1
Cluster-2
Object in a cluster
(may be a gene or a
sample expression profile)
Cluster-2
Cluster-2
Distance between clusters
Cluster-2
Centroid of a cluster
(may be centroid of gene
or sample expression profiles)
Figure 1.8: Single linkage, complete linkage, average linkage and centroid linkage clustering
procedure.
Single linkage clustering (Minimum distance)
In single linkage clustering, the distance between two clusters is calculated as the minimum
distance between all possible pairs of objects, one from each cluster. This method has the
advantage that it is insensitive to outliers. This method is also known as nearest neighbour
linkage. In fact, the BLASTclust procedure, which will be described in chapter 4 in this thesis,
implements the single linkage clustering procedure to group orthologous proteins.
Complete linkage clustering (Maximum distance)
In complete linkage clustering, the distance between the two clusters is calculated as the
maximum distance between all possible pairs of objects, one from each cluster. The
disadvantage of this method is that it is sensitive to outliers. This method is also known as the
farthest neighbour linkage.
Average linkage clustering
In average linkage clustering, the distance between the two clusters is calculated as the
average of the distance between all possible pairs of objects in clusters.
Centroid linkage clustering
In centroid linkage clustering, an average expression profile (called a centroid) is calculated in
two steps. First, the mean in each dimension of the expression profiles is calculated for all
objects in a cluster. Then, the distance between the clusters is measured as the distance
between the average expression profiles of the two clusters.
1-16
1.5 Structure of transcriptional regulatory networks
K-means
K-means is a popular non-hierarchical clustering method (Figure 1.9). In K-means clustering,
the first step is to arbitrarily group objects into a predetermined number of clusters. The number
of clusters can be chosen randomly or estimated by first performing a hierarchical clustering on
the data. Following this step, the average expression profile (centroid) is calculated for each
cluster and this step is called initialization. Next, individual objects are moved from one cluster
to the other depending on which centroid is closer to the gene. This procedure of calculating the
centroid for each cluster and moving objects closer to the centroid is performed in an iterative
manner a fixed number of times, or until convergence (where the cluster composition remains
unaltered). Typically, the number of iterations required to obtain stable clusters ranges from
20,000 to 100,000 times. However, there is no guarantee that the clusters will converge. This
method has an advantage that it is scalable for large datasets. This method has been used to
group genomes based on patterns in the conservation of genes, their interactions and the
regulatory motifs (refer Chapter 5).
Figure 1.9: K-means clustering
Gene expression space
Initialisation
Iteration
Convergence
(or after fixed number
of iteration)
Figure 1.9: The principle behind K-means clustering. Objects are grouped into a predefined
number of clusters during the initialization step. Centroid for each cluster is calculated, and
objects are re-grouped depending on how close they are to their centroids. This step is
performed iteratively until convergence or is performed for a fixed number of iterations to get a
final cluster of objects.
1.5 Structure of transcriptional regulatory networks
So far we have seen how one can study and identify putative transcription factors using domain
assignments, study protein-DNA interaction to identify targets for transcription factors, monitor
expression levels of genes under particular conditions, and the different clustering procedure to
groups objects (genes) that have similar behaviour (expression or conservation). In the
following section, the organisation of individual transcriptional interactions into a transcriptional
interaction network will be discussed. Chapter 3 will discuss how such networks evolve, chapter
4 describes how such networks change in evolution and chapter 5 will explain how the structure
of such networks change during the different cellular conditions.
1-17
1.5 Structure of transcriptional regulatory networks
The assembly of individual regulatory interactions linking transcription factors to their target
genes in an organism can be viewed as a directed graph, in which the regulators and targets
represent the nodes, and the regulatory interactions are the edges (Figure 1.10) (Madan Babu
et al., 2004; Wei et al., 2004; Xia et al., 2004). This resulting network is a complex, multilayered
system that can be examined at four levels of detail. At the most basic level, the network
comprises a collection of transcription factors, downstream target genes and the binding sites in
the DNA (Figure 1.10a). At the second level, these basic units are organised into recurrent
patterns of interconnections called network motifs (Lee et al., 2002; Milo et al., 2002; Shen-Orr
et al., 2002), which appear frequently throughout the network (Figure 1.10b). At the third level,
the motifs cluster into semi-independent transcriptional units called modules (Figure 1.10c).
Finally, at the top level, the regulatory network consists of interconnecting interactions among
the modules, to build up the entire network (Figure 1.10d).
Figure 1.10: Structural organisation of transcriptional regulatory networks
SIM
Transcription factor
MIM
FFM
Target gene
and binding site
(a) Basic unit
(b) Motifs
(c) Modules
(d) Transcriptional
regulatory
network
Figure 1.10: (a) The ‘basic unit’ comprises the transcription factor, its target gene with DNA
recognition site and the regulatory interaction between them. (b) Units are often organised into
network ‘motifs’, which comprise specific patterns of inter-regulation that are over-represented
in networks. Examples of motifs include single input (SIM), multiple input (MIM) and feedforward motifs (FFM). (c) Network motifs can be interconnected to form semi-independent
‘modules’, many of which have been identified by integrating regulatory interaction data with
gene expression data, and imposing evolutionary conservation. (d) The entire assembly of
regulatory interactions constitutes the ‘transcriptional regulatory network’, which provides the
blueprint for regulation of gene expression in an organism.
It should be noted that much of the work on regulatory networks has focused on E. coli and the
yeast Saccharomyces cerevisiae, for which data are most abundant. The individual regulatory
interactions in E. coli have been collected manually from the literature in the RegulonDB
database (Salgado et al., 2004). In yeast, on the other hand, manually curated data (Guelzim et
1-18
1.5 Structure of transcriptional regulatory networks
al., 2002; Matys et al., 2003) have been greatly augmented by the output of large-scale DNAbinding data from chromatin immunoprecipitation-chip (ChIp-chip) experiments (Horak et al.,
2002; Lee et al., 2002).
1.5.1 Motifs
At a local level, the transcriptional network can be broken down into a series of regulatory
motifs. These represent the simplest units of network architecture, in which there are specific
patterns of inter-regulation between transcription factors and target genes. Motifs do not often
represent independent units that are functionally separable from the rest of the network.
However, they have been shown theoretically and experimentally to possess particular kinetic
properties that determine the temporal program of expression of the target genes (Mangan and
Alon, 2003; Mangan et al., 2003).
Schematics of three prevalent motifs are shown in Figure 1.10b: Single Input Motif, Multiple
Input Motif and Feed Forward Motif. The first two comprise direct-acting motifs, whereby a
single or multiple transcription factors regulate their targets. Yu et al. (Yu et al., 2003) showed
that target genes belonging to the same single and multiple input motifs tend to be coexpressed, and that the level of co-expression is higher when multiple transcription factors are
involved. The feed-forward loop is composed of two transcription factors, whereby the first
regulates the second and both regulate a final target gene. Further motifs identified by Lee et al.
(Lee et al., 2002) in yeast represent patterns of interconnections of variable complexity, such as
the autoregulatory and regulatory chain motifs.
1.5.2 Modules
The organisation of the regulatory network can also be captured at an intermediate level by
examining its modularity (Hartwell et al., 1999; Lee et al., 2002). Intuitively, one might expect
distinct cellular processes to be conveniently regulated by discrete and separable modules.
Indeed, Guelzim et al. (Guelzim et al., 2002) reported global fragmentation of the regulatory
network in yeast. The clustering coefficient — a measure of the propensity for nodes to form
‘cliques’ — was fivefold higher than would be expected for a random network.
There have been several different approaches to identifying modules and these studies have
provided distinct outcomes with respect to the resulting modules. The main conclusion,
however, is that regulatory network tends to be highly interconnected and very few modules are
entirely separable from the rest of the network. In fact, many identified modules are nested
within each other in a hierarchical organisation at differing levels of connectivities.
Dobrin et al. (Dobrin et al., 2004) showed that many of the multiple input and feed-forward loop
motifs in E. coli overlap, so that they share transcription factors or target genes. Thus, many
small, highly connected motifs group into a few larger modules, which in turn integrate into even
1-19
1.5 Structure of transcriptional regulatory networks
larger ones. These nested modules are interconnected through local regulatory hubs. Such an
organisation combines the capacity for rapid regulatory changes through regulatory hubs with
integration of the regulatory processes across several modules.
Other approaches to identifying modules have incorporated further data sources, such as gene
expression data sets. Typical analyses have applied clustering algorithms to gene expression
data to find sets of co-expressed genes. In one of the original studies by Tavazoie et al.
(Tavazoie et al., 1999), it was reported that some of the major co-expression clusters coincided
with functional groupings of genes. In an ambitious extension of our work (Teichmann and
Madan Babu, 2002), Stuart et al. (Stuart et al., 2003) recently clustered over 3000 microarray
experiments on four eukaryotic genomes and identified 22 163 gene pairs whose co-expression
is conserved across all organisms. They grouped sets of orthologues into modules, suggesting
that co-expression of gene pairs over large evolutionary distances implies a selective advantage
for co-regulation, perhaps because the genes are functionally related.
In another interesting study, Ihmels et al. (Ihmels et al., 2002) added a different perspective by
taking the experimental conditions into account when defining the gene clusters. Their
‘signature’ algorithm identifies clusters according to the experimental conditions in which the
expression patterns of genes are most significantly correlated. The authors identified 86
transcriptional modules and the experimental conditions in which they operate. Segal et al.
(Segal et al., 2003) used a probabilistic algorithm to partition gene modules first based on their
expression profiles, and then identified specific regulatory genes that are predicted to control
the modules by comparing the expression profiles of candidate regulators and the gene
modules. They were able to identify 50 different modules with distinct regulatory programs.
Particularly illuminating was the formation of higher order groupings by the individual modules,
which are regulated by partly overlapping but distinct regulators.
Bar-Joseph et al. (Bar-Joseph et al., 2003) improved previous algorithms by explicitly linking
gene expression data with the regulatory interaction data produced by Lee et al. (Lee et al.,
2002) through ChIp-chip experiments. In this way, the authors were able to partition 655 distinct
genes and 68 transcription factors into 106 regulatory modules. Many of the identified modules
could be linked to particular cellular processes.
1.5.3 Global network organisation
At a global level, the overall structure or topology of the gene regulatory network can be
described by parameters derived from graph theory. The incoming connectivity is the number of
transcription factors regulating a target gene, which quantifies the combinatorial effect of gene
regulation. A recent study by Guelzim et al. (Guelzim et al., 2002) reported that the fraction of
target genes with a given incoming connectivity decreases exponentially. The exponential
behaviour indicates that most target genes are regulated by similar numbers of factors (93% of
1-20
1.6 References
genes are regulated by 1–4 factors in yeast) and presumably reflects the molecular limits on the
number of transcription factors that can affect a target gene simultaneously, which are imposed
by protein and DNA structural constraints at promoters.
The outgoing connectivity, which is the number of target genes regulated by each transcription
factor, is distributed according to a power law, contrary to the incoming connectivity parameter.
This is indicative of a hub-containing network structure, in which a select few transcription
factors participate in the regulation of a disproportionately large number of target genes. These
hubs can be viewed as ‘global regulators’, as opposed to the remaining transcription factors that
can be considered ‘fine tuners’. Global regulators can be defined based on the number of genes
they regulate. In the transcriptional network in yeast, regulatory hubs have a propensity to be
lethal if removed (Guelzim et al., 2002). Martinez-Antonio and Collado-Vides (Martinez-Antonio
and Collado-Vides, 2003) defined global regulators by taking into account additional factors,
such as the number of co-regulators and the number of conditions.
As described in this Chapter, a wealth of data on transcription factors, their regulatory
interactions and the gene expression programmes of organisms have become available.
Theoretical approaches can be used to study the structure and evolution transcriptional
regulatory networks. This thesis focuses on the analysis of transcriptional regulatory systems in
single celled organisms. With the advancements being made in biology, we expect to be able to
apply similar methods to multi-cellular organisms in future.
1.6 References
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local
alignment search tool. J Mol Biol 215, 403-10.
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman,
D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res 25, 3389-402.
Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T. J., Chothia, C. and Murzin, A. G.
(2004). SCOP database in 2004: refinements integrate structure and sequence family data.
Nucleic Acids Res 32, D226-9.
Apic, G., Gough, J. and Teichmann, S. A. (2001). Domain combinations in archaeal,
eubacterial and eukaryotic proteomes. J Mol Biol 310, 311-25.
Aravind, L. and Koonin, E. V. (1999). DNA-binding proteins and evolution of transcription
regulation in the archaea. Nucleic Acids Res 27, 4658-70.
Bar-Joseph, Z., Gerber, G. K., Lee, T. I., Rinaldi, N. J., Yoo, J. Y., Robert, F., Gordon, D. B.,
Fraenkel, E., Jaakkola, T. S., Young, R. A. et al. (2003). Computational discovery of gene
modules and regulatory networks. Nat Biotechnol 21, 1337-42. Epub 2003 Oct 12.
1-21
1.6 References
Bashton, M. and Chothia, C. (2002). The geometry of domain combination in proteins. J Mol
Biol 315, 927-39.
Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths-Jones, S., Khanna, A.,
Marshall, M., Moxon, S., Sonnhammer, E. L. et al. (2004). The Pfam protein families
database. Nucleic Acids Res 32, D138-41.
Bourne, P. E., Addess, K. J., Bluhm, W. F., Chen, L., Deshpande, N., Feng, Z., Fleri, W.,
Green, R., Merino-Ott, J. C., Townsend-Merino, W. et al. (2004). The distribution and query
systems of the RCSB Protein Data Bank. Nucleic Acids Res 32, D223-5.
Brenner, S. E., Chothia, C. and Hubbard, T. J. (1998). Assessing sequence comparison
methods with reliable structurally identified distant evolutionary relationships. Proc Natl Acad Sci
U S A 95, 6073-8.
Brenner, S. E., Hubbard, T., Murzin, A. and Chothia, C. (1995). Gene duplications in H.
influenzae. Nature 378, 140.
Buchan, D. W., Rison, S. C., Bray, J. E., Lee, D., Pearl, F., Thornton, J. M. and Orengo, C.
A. (2003). Gene3D: structural assignments for the biologist and bioinformaticist alike. Nucleic
Acids Res 31, 469-73.
Chen, J., Anderson, J. B., DeWeese-Scott, C., Fedorova, N. D., Geer, L. Y., He, S., Hurwitz,
D. I., Jackson, J. D., Jacobs, A. R., Lanczycki, C. J. et al. (2003). MMDB: Entrez's 3Dstructure database. Nucleic Acids Res 31, 474-7.
Chothia, C., Gough, J., Vogel, C. and Teichmann, S. A. (2003). Evolution of the protein
repertoire. Science 300, 1701-3.
Coulson, R. M. and Ouzounis, C. A. (2003). The phylogenetic diversity of eukaryotic
transcription. Nucleic Acids Res 31, 653-60.
Dobrin, R., Beg, Q. K., Barabasi, A. L. and Oltvai, Z. N. (2004). Aggregation of topological
motifs in the Escherichia coli transcriptional regulatory network. BMC Bioinformatics 5, 10.
Eddy, S. R. (1996). Hidden Markov models. Curr Opin Struct Biol 6, 361-5.
Eddy, S. R. (1998). Profile hidden Markov models. Bioinformatics 14, 755-63.
Galas, D. J. and Schmitz, A. (1978). DNAse footprinting: a simple method for the detection of
protein-DNA binding specificity. Nucleic Acids Res 5, 3157-70.
Garner, M. M. and Revzin, A. (1981). A gel electrophoresis method for quantifying the binding
of proteins to specific DNA regions: application to components of the Escherichia coli lactose
operon regulatory system. Nucleic Acids Res 9, 3047-60.
Guelzim, N., Bottani, S., Bourgine, P. and Kepes, F. (2002). Topological and causal structure
of the yeast transcriptional regulatory network. Nat Genet 31, 60-3.
Hartwell, L. H., Hopfield, J. J., Leibler, S. and Murray, A. W. (1999). From molecular to
modular cell biology. Nature 402, C47-52.
Heyduk, T. and Heyduk, E. (2002). Molecular beacons for detecting DNA binding proteins. Nat
Biotechnol 20, 171-6.
Holm, L. and Sander, C. (1994). The FSSP database of structurally aligned protein fold
families. Nucleic Acids Res 22, 3600-9.
1-22
1.6 References
Horak, C. E., Luscombe, N. M., Qian, J., Bertone, P., Piccirrillo, S., Gerstein, M. and
Snyder, M. (2002). Complex transcriptional circuitry at the G1/S transition in Saccharomyces
cerevisiae. Genes Dev 16, 3017-33.
Horak, C. E. and Snyder, M. (2002). ChIP-chip: a genomic approach for identifying
transcription factor binding sites. Methods Enzymol 350, 469-83.
Ihmels, J., Friedlander, G., Bergmann, S., Sarig, O., Ziv, Y. and Barkai, N. (2002). Revealing
modular organization in the yeast transcriptional network. Nat Genet 31, 370-7.
Jacob, F. and Monod, J. (1961). Genetic regulatory mechanisms in the synthesis of proteins. J
Mol Biol 3, 318-56.
Karplus, K., Barrett, C. and Hughey, R. (1998). Hidden Markov models for detecting remote
protein homologies. Bioinformatics 14, 846-56.
Kuo, M. H. and Allis, C. D. (1999). In vivo cross-linking and immunoprecipitation for studying
dynamic Protein:DNA associations in a chromatin environment. Methods 19, 425-33.
Lander, E. S. Linton, L. M. Birren, B. Nusbaum, C. Zody, M. C. Baldwin, J. Devon, K.
Dewar, K. Doyle, M. FitzHugh, W. et al. (2001). Initial sequencing and analysis of the human
genome. Nature 409, 860-921.
Ledent, V. and Vervoort, M. (2001). The basic helix-loop-helix protein family: comparative
genomics and phylogenetic analysis. Genome Res 11, 754-70.
Lee, T. I., Rinaldi, N. J., Robert, F., Odom, D. T., Bar-Joseph, Z., Gerber, G. K., Hannett, N.
M., Harbison, C. T., Thompson, C. M., Simon, I. et al. (2002). Transcriptional regulatory
networks in Saccharomyces cerevisiae. Science 298, 799-804.
Leinonen, R., Diez, F. G., Binns, D., Fleischmann, W., Lopez, R. and Apweiler, R. (2004).
UniProt Archive. Bioinformatics 25, 25.
Lespinet, O., Wolf, Y. I., Koonin, E. V. and Aravind, L. (2002). The role of lineage-specific
gene family expansion in the evolution of eukaryotes. Genome Res 12, 1048-59.
Letunic, I., Copley, R. R., Schmidt, S., Ciccarelli, F. D., Doerks, T., Schultz, J., Ponting, C.
P. and Bork, P. (2004). SMART 4.0: towards genomic data integration. Nucleic Acids Res 32,
D142-4.
Lo Conte, L., Ailey, B., Hubbard, T. J., Brenner, S. E., Murzin, A. G. and Chothia, C. (2000).
SCOP: a structural classification of proteins database. Nucleic Acids Res 28, 257-9.
Lo Conte, L., Brenner, S. E., Hubbard, T. J., Chothia, C. and Murzin, A. G. (2002). SCOP
database in 2002: refinements accommodate structural genomics. Nucleic Acids Res 30, 264-7.
Madan Babu, M. and Teichmann, S. A. (2003). Evolution of transcription factors and the gene
regulatory network in Escherichia coli. Nucleic Acids Res 31, 1234-44.
Madan Babu, M., Luscombe, N. M., Aravind, L., Gerstein, M. and Teichmann, S. A. (2004).
Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol 14, 283-91.
Madera, M. and Gough, J. (2002). A comparison of profile hidden Markov model procedures
for remote homology detection. Nucleic Acids Res 30, 4321-8.
Madera, M., Vogel, C., Kummerfeld, S. K., Chothia, C. and Gough, J. (2004). The
SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res 32, D235-9.
1-23
1.6 References
Mangan, S. and Alon, U. (2003). Structure and function of the feed-forward loop network motif.
Proc Natl Acad Sci U S A 100, 11980-5. Epub 2003 Oct 6.
Mangan, S., Zaslaver, A. and Alon, U. (2003). The coherent feedforward loop serves as a
sign-sensitive delay element in transcription networks. J Mol Biol 334, 197-204.
Marchler-Bauer, A., Anderson, J. B., DeWeese-Scott, C., Fedorova, N. D., Geer, L. Y., He,
S., Hurwitz, D. I., Jackson, J. D., Jacobs, A. R., Lanczycki, C. J. et al. (2003). CDD: a
curated Entrez database of conserved domain alignments. Nucleic Acids Res 31, 383-7.
Martinez-Antonio, A. and Collado-Vides, J. (2003). Identifying global regulators in
transcriptional regulatory networks in bacteria. Curr Opin Microbiol 6, 482-9.
Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., Hornischer, K.,
Karas, D., Kel, A. E., Kel-Margoulis, O. V. et al. (2003). TRANSFAC: transcriptional
regulation, from patterns to profiles. Nucleic Acids Res 31, 374-8.
Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002).
Network motifs: simple building blocks of complex networks. Science 298, 824-7.
Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Barrell, D., Bateman, A., Binns, D.,
Biswas, M., Bradley, P., Bork, P. et al. (2003). The InterPro Database, 2003 brings increased
coverage and new features. Nucleic Acids Res 31, 315-8.
Murzin, A. G., Brenner, S. E., Hubbard, T. and Chothia, C. (1995). SCOP: a structural
classification of proteins database for the investigation of sequences and structures. J Mol Biol
247, 536-40.
Needleman, S. B. and Wunsch, C. D. (1970). A general method applicable to the search for
similarities in the amino acid sequence of two proteins. J Mol Biol 48, 443-53.
Nordhoff, E., Krogsdam, A. M., Jorgensen, H. F., Kallipolitis, B. H., Clark, B. F.,
Roepstorff, P. and Kristiansen, K. (1999). Rapid identification of DNA-binding proteins by
mass spectrometry. Nat Biotechnol 17, 884-8.
Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. and Thornton, J. M.
(1997). CATH--a hierarchic classification of protein domain structures. Structure 5, 1093-108.
Orengo, C. A., Pearl, F. M., Bray, J. E., Todd, A. E., Martin, A. C., Lo Conte, L. and
Thornton, J. M. (1999). The CATH Database provides insights into protein structure/function
relationships. Nucleic Acids Res 27, 275-9.
Park, J., Karplus, K., Barrett, C., Hughey, R., Haussler, D., Hubbard, T. and Chothia, C.
(1998). Sequence comparisons using multiple sequences detect three times as many remote
homologues as pairwise methods. J Mol Biol 284, 1201-10.
Pearson, W. R. and Lipman, D. J. (1988). Improved tools for biological sequence comparison.
Proc Natl Acad Sci U S A 85, 2444-8.
Perez-Rueda, E. and Collado-Vides, J. (2000). The repertoire of DNA-binding transcriptional
regulators in Escherichia coli K-12. Nucleic Acids Res 28, 1838-47.
Peterson, J. D., Umayam, L. A., Dickinson, T., Hickey, E. K. and White, O. (2001). The
Comprehensive Microbial Resource. Nucleic Acids Res 29, 123-5.
Riechmann, J. L., Heard, J., Martin, G., Reuber, L., Jiang, C., Keddie, J., Adam, L., Pineda,
O., Ratcliffe, O. J., Samaha, R. R. et al. (2000). Arabidopsis transcription factors: genomewide comparative analysis among eukaryotes. Science 290, 2105-10.
1-24
1.6 References
Salgado, H., Gama-Castro, S., Martinez-Antonio, A., Diaz-Peredo, E., Sanchez-Solano, F.,
Peralta-Gil, M., Garcia-Alonso, D., Jimenez-Jacinto, V., Santos-Zavaleta, A., BonavidesMartinez, C. et al. (2004). RegulonDB (version 4.0): transcriptional regulation, operon
organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res 32, D303-6.
Schaffer, A. A., Aravind, L., Madden, T. L., Shavirin, S., Spouge, J. L., Wolf, Y. I., Koonin,
E. V. and Altschul, S. F. (2001). Improving the accuracy of PSI-BLAST protein database
searches with composition-based statistics and other refinements. Nucleic Acids Res 29, 29943005.
Segal, E., Shapira, M., Regev, A., Pe'er, D., Botstein, D., Koller, D. and Friedman, N.
(2003). Module networks: identifying regulatory modules and their condition-specific regulators
from gene expression data. Nat Genet 34, 166-76.
Servant, F., Bru, C., Carrere, S., Courcelle, E., Gouzy, J., Peyruc, D. and Kahn, D. (2002).
ProDom: automated clustering of homologous domains. Brief Bioinform 3, 246-51.
Shen-Orr, S. S., Milo, R., Mangan, S. and Alon, U. (2002). Network motifs in the
transcriptional regulation network of Escherichia coli. Nat Genet 31, 64-8.
Smith, T. F. and Waterman, M. S. (1981). Identification of common molecular subsequences. J
Mol Biol 147, 195-7.
Stebbings, L. A. and Mizuguchi, K. (2004). HOMSTRAD: recent developments of the
Homologous Protein Structure Alignment Database. Nucleic Acids Res 32, D203-7.
Storek, M. J., Ernst, A. and Verdine, G. L. (2002). High-resolution footprinting of sequencespecific protein-DNA contacts. Nat Biotechnol 20, 183-6.
Stuart, J. M., Segal, E., Koller, D. and Kim, S. K. (2003). A gene-coexpression network for
global discovery of conserved genetic modules. Science 302, 249-55. Epub 2003 Aug 21.
Sun, L. V., Chen, L., Greil, F., Negre, N., Li, T. R., Cavalli, G., Zhao, H., Van Steensel, B.
and White, K. P. (2003). Protein-DNA interaction mapping using genomic tiling path
microarrays in Drosophila. Proc Natl Acad Sci U S A 100, 9428-33. Epub 2003 Jul 22.
Tatusov, R. L., Fedorova, N. D., Jackson, J. D., Jacobs, A. R., Kiryutin, B., Koonin, E. V.,
Krylov, D. M., Mazumder, R., Mekhedov, S. L., Nikolskaya, A. N. et al. (2003). The COG
database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41.
Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J. and Church, G. M. (1999).
Systematic determination of genetic network architecture. Nat Genet 22, 281-5.
Teichmann, S. A. and Madan Babu, M. (2002). Conservation of gene co-regulation in
prokaryotes and eukaryotes. Trends Biotechnol 20, 407-10; discussion 410.
Teichmann, S. A., Chothia, C. and Gerstein, M. (1999). Advances in structural genomics.
Curr Opin Struct Biol 9, 390-9.
Teichmann, S. A., Park, J. and Chothia, C. (1998). Structural assignments to the Mycoplasma
genitalium proteins show extensive gene duplications and domain rearrangements. Proc Natl
Acad Sci U S A 95, 14658-63.
van Steensel, B. and Henikoff, S. (2000). Identification of in vivo DNA targets of chromatin
proteins using tethered dam methyltransferase. Nat Biotechnol 18, 424-8.
Vogel, C., Bashton, M., Kerrison, N. D., Chothia, C. and Teichmann, S. A. (2004). Structure,
function and evolution of multidomain proteins. Curr Opin Struct Biol 14, 208-16.
1-25
1.6 References
Wei, G. H., Liu, D. P. and Liang, C. C. (2004). Charting gene regulatory networks: strategies,
challenges and perspectives. Biochem J 381, 1-12.
Xia, Y., Yu, H., Jansen, R., Seringhaus, M., Baxter, S., Greenbaum, D., Zhao, H. and
Gerstein, M. (2004). Analyzing Cellular Biochemistry in Terms of Molecular Networks. Annu
Rev Biochem 73, 1051-1087.
Yu, H., Luscombe, N. M., Qian, J. and Gerstein, M. (2003). Genomic analysis of gene
expression relationships in transcriptional regulatory networks. Trends Genet 19, 422-7.
1-26
Download