Theme: Computationally Predicted sigma 70 promoters in Escherichia coli .
I. On the Computational method.
Defining the problem : The COVER program uses a strategy of two steps to predict promoters in the regulatory regions of the E. coli genome: Scoring Sequences and
Selecting best candidates steps. Cover is able to identify 86% true E. coli promoters correctly, generating an average of 4.7 putative promoters per 250 bp region as output.
The 79% of these 4.7 signals exist in clusters, as a series of overlapping potentially competing RNA polymerase-binding sites. Even though the presence of the clusters could seem interesting, it is obvious that not all 4.7 predictions are functional promoters. An outline of procedures to clean and select the best predictions is presented in this document.
How COVER scores a promoter-like signal?
We use the PATSER program to scan a target sequence using a position-specific weight matrix, and this program assigns a score to each sequence segment of the same length as the matrix. The assigned score is calculated according to the theory developed by Jerry Hertz and Gary Stormo [1,2]. The position-specific weight matrix is the result of a multiple sequence alignment, where functional related sequences are compared to each other, and similar sequences motifs are found. The matrix, or consensus matrix, shows which residues are conserved and which residues are variable and assigns a weight to each base in each position by taking into account its frequency in the aligned motifs and its frequency in the background. Usually the background is the complete genomic sequence.
The weight assigned to a sequence segment, the score, is calculated in terms of the logratio between two probabilities:
Score
n l
1 f ( b , l ) log
2
f p
( b , l )
( b )
where n is the length of the consensus matrix, f(b,l) is the probability of the base b in the position l , and p(b) is the probability of the base b in the background, i.e. the a priori probability. The log-ratio gives us an estimate of the probability for each base to occur by chance rather than as an instance of the motif. This estimate is then used to modify, amplifying or lowering, the probability of the base b in the position l .
As this theory takes into account the frequency of each base in each position, any base, in the sequence segment being scored, that comes far or close to the consensus in a specific position will have a weight indicating its distance towards the consensus. If the base is the most conserved in that position, its distance would be like a zero. If the base is the second conserved in that position its distance would be a little bigger than zero. If the base is one not frequent in that position, its distance towards the consensus would be the biggest. At the end, the sequence score will be something like the sum of these distances.
What are the disadvantages of this scoring system?
It depends on the background model, if you are not careful on selecting the a priori probabilities then the results would not be correct. If your molecule is able to recognize a large spectrum of binding sites, the matrix will allow you to get many false positives. Lastly, the position weight matrices do not take into account position correlations, they are based on first-order statistics. Nevertheless, position-weight matrices are models that can be very useful for the discovery and prediction of binding sites in genomic DNA. They have performed well and have provided a very good approximation of the true nature of the specific protein-DNA interactions [3,4].
Model for DNA sigma70 promoters . The canonical model of the
70 promoter is defined as a simple pair of hexamers, positioned at –35 and –10 base pairs (bp) from the transcription start (+1), with respective consensus sequences TTGACA and TATAAT, and separated by a spacer of 15 to 21 bp. RNAP-
70 can recognize and bind –35 and –10 motifs that differ substantially from their consensus sequences. On average, E. coli promoters preserve only eight of the 12 canonical bases of the –35 and –10 hexamers.
However, there are other elements that can modify this sigma70 model. (i) A (TG) subregion 1 bp immediately upstream of the -10 box, which may render the -35 box dispensable. It appears in the 20% of known promoters. (ii) Upstream activator sites may substitute the role of a -35 region. (iii) A UP element located approximately 4 bp upstream of the -35 box. The half of this element appears in the 3% of the known promoters.
COVER uses two position weight-matrices to search for the –10 & -35 boxes respectively.
The matrices were created from alignment of 116 known promoters.
We used the matrices to score each one of the 116 pairs of boxes, –10 & -35. The 98% of the 116 pairs of boxes had a score above of
. When scanning the genome, every pair of segment sequences that scored above of that cutoff were considered a putative promoter. The final score is the sum of the –10 box and –35 box scores.
II. Annotation.
We want to assign a score to every prediction and in fact, also to every experimentally characterized promoter.
Step 1.
To score every -10 and every -35 known and predicted box for sigma promoters.
This is basically done.
Step 2 . To include additional elements of sigma 70 promoters.
Extend the number of used matrices to search for promoter-like sequences: a –10 matrix
(TATAAT), -35 matrix (TTGACA), a matrix for the extended motif (TGn), and a matrix for the UP element (nnAAA(A/T)(A/T)T(A/T)TTTTnnAAAAnnn).
Score them and add them to the description of every sigma 70 promoter.
III. Do we want a smaller set of higher quality predictions?
Remember that we classified all evidences in two classes. Strong and weak.
Computational predictions are weak. Nonetheless, we may want to have a smaller set of better predictions.
In order to clean the predictions generated by the COVER program, we could follow two steps:
1) We would say that the –10 box is the only pervasive element in
70 promoters described until now. The TG-extended motif and positive regulators can render the
-35 box dispensable, so we would say that -35 box, TG-extended motif, and the positive regulators anchor the RNAP-
70
in the DNA. The UP element has been shown to confer additional strength to the promoter and they have been only described in rRNA and tRNA promoters. It has been showed that “bad” –10 boxes can be functional when the TG-extended motif is present.
Methods . a)
Make an analysis to show if “good” –35 boxes, and presence of UP elements can allow –10 boxes that are far from the consensus. b) Define a criteria of selection based on the scores, rather than the sum of all of them,
Raise the cutoff for -35 and –10 boxes.
Signals with –10 & -35 boxes over the new cutoff, they are accepted.
Signals with –10 boxes under the new cutoff and TG-motif are accepted.
Signals with –10 boxes under the new cutoff and –35 boxes over the new cutoff are accepted.
Signals with –10 boxes under the new cutoff with a UP element are accepted.
2) Establish cleaning criteria based on the positions that are more important to the promoter, those under functional restrictions.
Some essays have shown that the double-stranded -10 box stabilizes the initial binding of the polymerase and subsequently, as single-stranded DNA, it promotes enzyme isomerization to the functional form. Two highly conserved bases (the first TA), of this box are the most important for interaction with RNAP, while all other positions provide an accessory contribution. Studies on –35 box have showed that –31, –33 and -34 position would be positions of interaction with
RNAP-
70 . The TG-motif is bound by the region 2 of the
70 .
But searching in the literature o find which are the nucleotides in the region covered by the RNAP-
70
(-2 to -50) whose mutation can render the promoter useless, would be labor-intensive. First, and second, we would need to curate promoter by promoter. We prefer to follow the option described above of raising cutoffs for every box or element.
References.
1. Hertz, G.Z. and G.D. Stormo, Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics , 1999. 15(7-8): p. 563-77.
2. Hertz, G.Z., G.W. Hartzell, 3rd, and G.D. Stormo, Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput Appl Biosci , 1990.
6(2): p. 81-92.
3. Man T.K., J.S. Yang, and G. D. Stormo, Quantitative modeling of DNA-protein interactions: effects of amino acid substitutions on binding specificity of the Mnt repressor, Nucleic Acids Res , 2004. 32(13): p. 4026-32.
4. Benos P.V., M.L. Bulyk, and G.D. Stormo, Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res , 2002. 30(20): p. 4442-51.