Mathematical Tools for Regulatory Signals Extraction. Mireille Régnier INRIA 78153 Le Chesnay-FRANCE Mireille.Regnier@inria.fr Phone: 33 1 39 63 54 78 Fax: 33 1 39 63 55 96 Keywords: regulatory signals, statistics, overrepresentation, protein binding sites, markov models Abstract: Statistical techniques provide an efficient way to analyse “in silico” the huge amount of data from large-scale sequencing. The key idea is to search for regulatory signals among exceptional words, e.g. words that are either underrepresented or overrepresented. We provide a few mathematical results to assess the significance of an exceptional word. 1. Motivation In Silico prediction of regulation signals relies on studies that pointed out common motifs on co-expressed genes [van Helden et al., 1998; Brazma et al.1998]. Searching common motifs in sequences of co-expressed genes promoters now is a key to the determination of transcriptional regulation networks. In recent works on yeast genes that were expressed at the same time of a cellular cycle, [Tavazoie et al, 1999; Spellman et al, 1998], the motifs are very specific: they are almost never found for other genes; hence, they are potential binding sites for transcription factors. Many methods have been proposed to search such motifs: word counting, sampling, phylogenetic footprinting…Statistical approaches search for exceptional words. The underlying assumption is as follows: a biological function that is enhanced or avoided is associated to a word, or a set of words that is over-represented, or underrepresented. An experimental validation has been possible for some motifs. For instance, the ChIP-chip method (Iyer et al, 2001) achieves this task by a simultaneous immunoprecipitation of a transcription factor and a hybridization on a chip of the target sequence. We provide below a brief formalization of the main issues addressed in biological applications. 2. Introduction In statistical approaches, one assumes, in a first approximation, that the genomic sequences are randomly generated . In this paper, we consider two probability models. In the Bernoulli model, the character distribution at a given position in a genomic sequence is independent of that position. In the Markovian model, it depends on the previous position, possibly of a few positions). Details and different background probability models are discussed in [Hampson et al., 2002]. The goal is the extraction of exceptional words -either overrepresented or underrepresented-. To achieve this goal, one needs first an algorithm that searches for candidate motifs. A recent survey on algorithms can be found in [Lonardi, 2001;Marsan, 2002]. It has been shown [Blanchette and Tompa, 2001] that the sequences that are recognized by the same transcription factor are (approximately) conserved during the evolution. Hence, it is relevant to look for a signal that is an approximate word. Second, one relies on mathematical tools to assess statistical significance of these candidates. A great interest arose recently for statistics for the number of occurrences of a set of words [van Helden et al., 1998; Hampson et al., 2002; Reinert et al., 2000; Regnier, 2000; Apostolico et al., 2000]. We present and discuss a wide range of tractable formulae and discuss widely used approximations. We state their validity domains and occasionally extend them. Some efficient algorithms that actually compute them are available in the QuickScore library. Different formalizations are possible for various applications. One can study a long text, typically a genome, and look for under or over-represented words. For example, the Chi motif is over-represented in many bacterial genomes. Restriction-modification systems provide an other typical example. It has been observed that the binding site of a restriction enzyme is an avoided word [Rocha et al, 2000]. [Panina et al, 2000] shows that word avoidance in bacterial genomes is correlated with taxonomy. This work is refined in [Vandenbogaert et al,2002]. Consensus words are taken into account: for instance, CYCGRC is a binding site for Aval in Anabaena variabilis. Word avoidance also allows to detect horizontal transfers [Vandenbogaert, 2003]. A search can be performed on a given set of sequences: typically, the upstream sequences of co-regulated or co-expressed genes. These sequences are relatively short, generated independently according to a common distribution, but may have different lengths [Van Helden et al., 1998; Buhler et al., 2001]. One looks for exceptional signals, a word H, or a set of words , that are potential binding sites. One counts the number of sequences where H, or , is actually observed and compares this number with its expected value. This scheme underlies the software RSA-oligonucleotides of van Helden [van Helden et al., 1998] that searches for regulatory signals. One can search for a known site in a set of upstream sequences . in order to point out a coregulation. One can also search for an unknown site. In that case, the most over-represented signal (possibly several top scoring signals) is predicted “in silico” as a potential candidate. A typical example is the characterization of polyadenylation signals in human genes [Beaudoing et al., 2000] from a set of ESTs. This approach is very appealing when microarrays experiments show a co-expression for a set of genes. 3. Large Sequences In this section, we concentrate on the statistical significance of word avoidance or overrepresentation in a large sequence. We assume that this sequence is randomly generated according to a Bernoulli or Markov model. For a given word H, one denotes P(H) its probability under the model. The probability of a set of words is denoted by P( =P(H) with H in . The expectation E(H) (respectively E( )) and the variance Var (H) (respectively Var( )) of a given word H (respectively a given set of words have been extensively studied by various authors [Pevzner et al., 1989; Bender et al., 1993; Regnier et al., 1997; Regnier, 2000; Reinert et al., 2000] and closed formulae have been derived. This allows for an efficient computation of the Z-score: Z O(H) E(H) Var(H) where O(H) (respectively O( )) is the observed number of occurrences of H (respectively words from ) in the sequence under study. A highly positive (respectively negative) Z-score shows that a word is exceptional. Still, this does not provide valuable information on the probability that this exceptional event occurs, the so-called p-value. In Theorem 1, we state a computable formula for the p-value. We discuss the quality of this formula. e.g. its precision and the efficiency of its computation. We illustrate on one example that the p-value is more sensitive than the Z-score. Additionnally, this result will allow to deal with conditional events in Section 5. All results rely on the overlapping structure of the words: Definition 1 [Guibas et al., 1981] Given two words H and F, the correlation set AH,F is the set of suffixes w of F such that F is a suffix of Hw. The correlation polynomial is, in the Bernoulli case: A H,F (z) P(w)z |w| wAH,F where |w| is the length of w. When H=F, one rewrites AH,H(z)=AH(z). Example: (i) Let H be the Chi motif GCTGGTGG. The correlation set is AH,H = {,CTGGTGG} where is the empty word. The correlation polynomial is AH(z)=1+P(CTGGTGG)z7. (ii) Let be the generalized Chi motif GNTGGTGG where N denotes any nucleotide from {A,C,G,T}. The correlation set between two words GNTGGTGG and GXTGGTGG from the generalized Chi motif only depend on the second word. If X=G, the correlation set AGNTGGTGG,GGTGGTGG is {TGGTGG,GTGGTGG}. When X ranges in {A,C,T}, the correlation set reduces to {XTGGTGG}. For any X, the contribution of XTGGTGG is P(XTGGTGG)z7. When TGGTGG is in the correlation set, its contribution is P(TGGTGG) z6. Hence, AGNTGGTGG,GXTGGTGG(z)=1+P(XTGGTGG)z7if X{A,C,T } or 1+P(TGGTGG)z6+P(GTGGTGG)z7 if X=G. Now, our recent combinatorial results on the distribution of word occurrences allow to prove [Denise et al., 2001;Regnier et al., 2003] the expression below for the p-value: Theorem 1 Let H be a given pattern. One denotes: D(z)=(1-z) AH(z) +P(H)zm. (1) Let a be a real number such that 0<a<1 and aP(H). The fundamental equation is D(z)2-(1+(a-1)z)D(z)-az(1-z)D'(z)=0, (2) If a>P(H),let za be the largest real positive root of (2) satisfying 0<za<1. If a<P(H), let za be the smallest real positive solution of (2) that satisfies 1<za. Then: nI(a) a Prob(NH na) 1 e na (3) where I(a)alog( D(za) )log za D(za) za 1 (4) and a and a are rational functions of a and za. Extending the reasoning in [Regnier, 2001], we observe here that Theorem 1 steadily extends to some sets of words: e.g. when the correlation sets AH,F do not depend on the first word H. This is the case for the generalized Chi motif. Example: (i) The Chi motif is overrepresented in many bacterial genomes. The motif GCTGGTGG occurs 499 times in E. coli, where the Bernoulli distribution of the nucleic acids is pA=0.2461913, pC=0.2542308, pG=0.2536579, pT=0.2459200. The p-value computed by Theorem 1 is: 1.417e-227. A computation by induction by the software G-Don [Nuel, 2001] yields a p-value: 1.403e-227. (ii) The generalized Chi motif GNTGGTGG occurs 159 times in H. influenzae. The p-value is .8597547482e-36 under the uniform model. Approximation Quality It turns out [Denise et al., 2001] that expression (3) is very closed to the exact expression of this probability provided in [Regnier et al., 1997], computed by the software Excep [Klaerr-Blanchard et al., 2000]. Still, as the computation reduces to the numerical solution of a polynomial equation, this computation is much faster and numerically very stable. Moreover, the possible range for the length n of the text and the probability P(H) is much larger. For example, the computation of (3) for the Chi motif in E. Coli, where the size of the genome is n=4639221, exceeds the computational limits of Excep. Applications with avoided words in restriction-modification systems for bacteria can be found in [Vandenbogaert et al., 2002]. This can be compared to some common approximations for the p-values, e.g. the p-values computed for the normal law with mean p=P(H) and variance V(H) or for the (compound) Poisson distribution with parameter p=P(H). In these cases, the rate function is well known [Dembo et al., 1992], e.g.: (a P(H)) 2 for the normal law, 2Var(H) (ii) alog a (p a) or the Poisson law. p (i) When a-p is not too large, a local development shows that: za 1- a p . 2AH(1)12mp This yields the following approximation for the rate function : (a p) I(a)alog a (pa)(AH(1)1) p p 2 (2) This proves theoretically that the Poisson approximation (or the compound Poisson approximation) is very tight, especially when the motif is not selfoverlapping (e.g. AH(1)-1=0). This fact was experimentally observed for some ranges of n and p in [Robin et al., 2001, Nicodeme01]. Equation (5) also shows that the normal approximation is tight only for word occurrences na that satisfy a-p <<p. This equation implies that a is close to p. Hence, such words are neither overrepresented nor underrepresented: they are not exceptional! For exceptional words (with a high Z-score), the ratio between the exact probability and the normal approximation is very small, e.g. the relative error is very high. Nevertheless, the difference, used for instance in [Robin et al., 2001] is upper bounded by the maximal value. Hence, we point out the paradox: the smaller the difference, the worst the approximation...As a conclusion, we claim that the normal approximation is always poor. Restriction enzymes For a given genome, the palindromic sites associated to the restriction enzymes of the genome are avoided [Vandenbogaert03]. In Table 1, these avoided words are sorted by decreasing Z-score. Their Z-rank, in column 4, is significantly different from their P-rank (column 6) derived from a sort by P-values. Column 2 yields the number of occurrences in E. Coli. Hexamer Occ Z-score Z-Rank GGCGCC 94 -41.9 1 GCATGC 588 -39.9 2 GCCGGC 293 -37.4 3 CGGCCG 284 -35.1 4 CACGTG 143 -32.3 5 GGGCCC 68 -30.7 6 TTGCAA 1024 -30.2 7 ATGCAT 839 -28.0 8 CCGCGG 657 -26.2 9 GCTAGC 157 -26.1 10 CAATTG 986 -25.7 11 GAGCTC 152 -23.9 12 GTGCAC 575 -23.1 13 TGGCCA 630 -23.0 14 GAATTC 645 -21.9 15 AAGCTT 556 -21.9 16 CTCGAG 177 -21.3 17 CCTAGG 16 -21.2 18 ACATGT 477 -20.4 19 TCTAGA 39 -19.8 20 TTCGAA 800 -19.4 21 CCATGG 612 -19.4 22 CGCGCG 2127 -118.7 23 P-val (M) P-Rank .3385654443e-3 1 .2514096486e-3 2 .2358169663e-3 3 .2050106769e-3 4 .1849237452e-3 5 .1763897841e-3 6 .1280869614e-3 7 9 .1095910354e-3 .980276091e-4 10 .1133125447e-3 8 .904588694e-4 12 .9436368542e-4 11 .753139868e-4 14 .738769289e-4 15 .664817246e-4 19 .671111153e-4 18 .7121997639e-4 17 .8539762880e-4 13 20 .5812779786e-4 .7189624736e-4 16 .498331920e-4 22 .5086498391e-4 21 .3641666775e-4 23 Table 1: Table of avoided palindromic hexanucleotides in E. Coli. The first two binding sites that exchange their ranks are GCTAGC and ATGCAT for enzymes Nhel and AvaIII: (8,10) changes to (9,8). Similarly, CCTAGG (enzyme AvrII) goes from rank 18 up to rank 13, and TCTAGA (enzyme Xbal) goes from rank 20 up to rank 16. For similar Z-scores, the pvalue, in column 5, may reverse such order. This is the case, for example, the three binding sites TCTAGA, TTCGAA and CCATGG for enzymes Xbal, AsuIII and NcoI. It is interesting to notice, that the last hexamer, CGCGCG, keeps its rank, with a sensible difference in p-value (a 0.13e-4 difference in the logarithms represents a 10-26 factor for the p-value), but is not a binding site. As a conclusion, the p-value appears to be more sensitive than the Z-score. 4. Small Sequences In this context, one searches for a signal that occurs in a set of L small sequences more often than expected. Typically, these sequences are upstream regions of (possibly co-regulated) genes. The signal may be a single word H, such as the G-box CACGTG, or a consensus word , such as the cleavage site CCWGG for the restriction enzyme EcoRII. It can also be a structured motif [Marsan et al., 2000] or a dyad [Eskin et al., 2002]: for example, the motif TCGGCGGCTAAAT N20GATTCGGAAGTAAA, where N20 denotes a sequence of 20 unspecified nucleotides, in gene YDR285W in S. Cerevisiae. One proceeds in two steps: (i) compute for a given signal, the probability p that it occurs at least once in the text; (ii) compute PL(k), the probability that k out of L sequences contain that motif at least once. When the p-value PL(k) is very small, one concludes that the extracted signal is relevant. Step (i): the combinatorial results on words [Guibas et al., 1981;Regnier, 2000] allow for an exact computation of p. With an indexation by the size n of the sequence, one proves [Guibas et al., 1981] that, for a single word, the probabilities pn satisfy: (1 p ) z n0 n n P(H) z D(z) n This extends to a set of words [Guibas et al., 1981] and to the Markov model [Regnier et al., 1997]. A formal development of the right hand size yields 1-pn for any n. Practically, when n is small, this is easily computed by a symbolic computation system. But, the larger n is, the trickier the computation, and, even worse, numerical instability may occur. The software Excep [Klaerr-Blanchard et al., 2000] relies on a careful implementation of the Fast Fourier Transform, and allows a computation for middle n, Still, it is stucked for large n by the numerical problems and the computational costs. The computation by induction in [Robin et al., 1999], suffers, even more deeply, of the same drawbacks. The computation for a set of words is even harder. A very common approximation [Hart et al., 2000; Beaudoing et al., 2000; Buhler et al., 2001; Vinogradov et al., 2002] for pn is: pn=1-(1-P( ))n-m+1, where n is the size of the sequence. It has been observed by many people [van Helden et al., 1998] that this approximation is bad when words in overlap, notably when reduces to a single self-overlapping word H. Incidentally, the exact results show that, anyhow, the Poisson approximation is rough for small n. When n is larger, we extend this formula for a set of possibly overlapping words, . The case of a unique self-overlapping word is addressed in [Vandenbogaert03], and a formal proof, including the Markov case, can be found in [Clement et al. 2003]. Theorem 2 Let be a set of patterns and P( ) be the probability of the words in : the probability pn to find at least one word from in a sequence of size n, satisfies, in the Bernoulli case, when nP( ) is small: where C(H)= p1-e-nC( H) (6) (2AH, F(1)1) . For a single pattern H, a tighter P(H) H F approximation holds: C(H) P(H) . AH (1) Hint for the proof: Now classical combinatorial analytics [Flajolet et al., 1995] pe n log provide a good asymptotic expression for pn. For a single word, , where ρ is the smallest real positive root of D(z)=0, where D(z) is the polynom defined in (1). It is easily checked that ρ~ P(H) . The approximation is very AH(1) _ tight. For a set H of q words of length m, one defines the q×q matrix (z) of the 2 q autocorrelation polynomials. Let H be a q×q matrix where all rows are identical m m to the vector (P(H1) , … , P(q) ). One computes approximately Trace(H. (1)-1), where the Trace is the sum of all diagonal coefficients. z z Step (ii) : our main observation here is that PL(k) is the tail distribution of a Bernoulli process. Hence, it is known [Dembo et al., 2002; Waterman, 1995] logPL (k) that n I(a) with a=k/n and: I(a)alog a (1a)log 1a P(H) 1 P(H) (7) This approximation is very tight. It is easier and faster to compute than the binomial formula for PL(k). Moreover, the numerical computation of these small numbers is very stable. It was recently implemented in RSA-tools. Remark: Given a set H, when nC( H ) is large, e.g. when n is large, then p is very close to 1: almost surely, a word from H occurs in the sequence, and no valuable information can be derived from an occurrence. In practice, the length n of the upstream sequences under study is fixed, and we can check a priori if nC(H ) is too large. In that case, if H is noisy, decreasing the noise decreases the size of H, hence nC(H ) and an H -occurrence may then be a significant event. This is a way to avoid the critical phenomena exhibited in [Buhler et al., 2001]. . 5. Conditional Events The overrepresentation -or under representation- of a signal (the Chi motif in bacteria, the Alu sequences,…) modifies the statistical properties of a genomic sequence. It is valuable to eliminate artefacts of a strong signal in order to extract a weaker signal. It is also of interest to cluster similar motifs in a single degenerate signal. Mathematically, this means that one is conditioning by a rare event, e.g. an event with a very small probability. The precise formulae in Theorem 1 allow to prove: Theorem 3 [Regnier et al., 2003)] Let H and F be two patterns. Assume that k occurrences of H are found in the text, with k=na. Then, the expected number of occurrences of F, knowing that H occurs k times is: E(F/NH=k)n(a) where m m [(1 z ) A (z ) P(F) z a ][(1 z ) A (z ) P(H) z a ] (a) z D(z )(D(z ) z 1) a a H, F a a a a F, H a a and P(H) and P(F) are their probabilities of occurrence , AH,F(z) and AF,H(z) are their correlation polynomials. We briefly present below two applications. Similar problems are studied in [Blanchette et al., 2001]. Polyadenylation In [Beaudoing et al., 2000], short (around 50 bp) upstream regions of EST of human genes are searched for polyadenylation signals. The largest Z-score is assigned to AAUAAA, that actually is the dominating signal. The computation of a Z-score using E(F/AAUAAA) as the new expected value for each pattern F drastically reduces the Z-scores of artefacts AUAAAN and NAAUAA and assigns the 2-nd rank in Z-scores to AUUAAA, that actually is the second (weak) signal [Denise et al., 2001]. Arabidopsis Thaliana RSA-tools was “blindly” used on upstream regions of Arabidopsis Thaliana, by M. Lescot. In one test, the two highest Z-scores were assigned to H=ACGTGG and F=CACGTG. The number of occurrences were NH=32 and NF=52. The conditional expectations are: E(H/F)=24 and E(F/H)=20. Our conclusion is that H is better explained by F than F is explained by H. This leads to chose F=CACGTG as the significant signal, although its Z-score is smaller. As a matter of fact, the biological knowledge confirms that CACGTG is the regulatory signal, the so-called G-box. 6. Conclusion It is an interesting challenge to derive the rate function for a set of consensus words, or degenerated words (IUPAC code). The special case of double strand counting is addressed in [Lescot et al., 2003]. One can also extend the approximation (5). An other interesting issue is the derivation of the conditional expectation when a set of words is overrepresented. One expects a simplification of the formulae for consensus words. Finally, it is worth extending (7) for sequences of different lengths. These formulae are implemented in the QuickScore library (http://algo.inria.fr/regnier/QuickScore/). Acknowledgements I am grateful to Magali Lescot (Marseille) for fruitful discussions and for providing data on Arabidopsis Thaliana. This research was partially supported by French-Russian Liapunov Institute and INTAS 99-1476. References [Apostolico et al., 2000] A. Apostolico, M.E. Bock, S. Lonardi, and X. Xu. Efficient detection of unusual words. Journal of Computational Biology, 7(1/2): 71-94, 2000. [Beaudoing et al., 2000]E. Beaudoing, S. Freier, J. Wyatt, J.M. Claverie, and D. Gautheret. Patterns of Variant Polyadenylation Signal Usage in Human Genes. Genome Research., 10: 1001-1010, 2000. [Bender et al., 1993] Edward A. Bender and Fred Kochman. The Distribution of Subwords Counts is Usually Normal. European Journal of Combinatorics, 14: 265-275, 1993. [Blanchette et al., 2001] M. Blanchette and S. Sinha. Separating real motifs from their artifacts. Bioinformatics (ISMB special issue), 817: 30-38, 2001. [Brazma et al., 1998] A. Brazma, I. Jonassen, J. Vilo andE. Ukkonen.Predicting gene regulatory elements in silico on a genomic scale. Genome Res, 11(8): 1202-1215, 1998. [Buhler et al., 2001] J. Buhler and M. Tompa. Finding Motifs Using Random Projections. In RECOMB’01, pages 69-76. ACM-, 2001. Proc.RECOMB’01, Montréal. [Clement et al., 2003] J. Clement and E. Dolley and M. Regnier.Approximate Words.submitted, 2003. [Denise et al., 2001] A. Denise, M. Régnier, and M. Vandenbogaert. Assessing statistical significance of overrepresented oligonucleotides. In WABI’01, pages 85-97. Springer-Verlag, 2001. in Proc. First Intern. Workshop on Algorithms in Bioinformatics, Aarhus, Denmark, August 20001; preliminary version as INRIA research report 4132. [Dembo et al., 1992] A. Dembo and O. Zeitouni. Large Deviations Techniques. Jones and Bartlett, Boston, 1992. [Eskin et al., 2002] E. Eskin and P. Pevzner. Finding Composite Regulatory Patterns in DNA Sequences. Bioinformatics, 1(1): 1-9, 2002. [Flajolet et al., 1995] Ph. Flajolet and R. Sedgewick. Analysis of Algorithms. Addison-Wesley, 1995. [Guibas et al., 1981] L. Guibas and A.M. Odlyzko. String Overlaps, Pattern Matching and Nontransitive Games. Journal of Combinatorial Theory, Series A, 30: 183-208, 1981. [Hampson et al., 2002] S. Hamspon, D. Kibler, and P. Baldi. Distribution patterns of overrepresented k-mers in non-coding yeast dna. Bioinformatics, 18: 513-528, 2002. [Hart et al., 2000] R. Hart, A.K. Royyuru, G. Stolovitzky, and A. Califano. Systematic and Automated Discovery of Patterns in PROSITE Families. In RECOMB’00, pages 147-154. ACM-, 2000. Proceedings RECOMB’00, Tokyo. [Iyer et al., 2001] VR. Iyer, CE Horak, CS Scafe, D. Botstein, M. Snyder and PO Brown.Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF.Nature,409: 533-538, 2001. [Klaerr-Blanchard et al., 2000] Maude Klaerr-Blanchard, Hélène Chiapello, and Eivind Coward. Detecting localized repeats in genomic sequences: A new strategy and its application to B. subtilis and A. thaliana sequences. Comput. Chem., 24(1): 57-70, 2000. [Lescot et al., 2003] M. Lescot and M. Régnier. In Silico prediction of regulatory signals on plant promoter sequences. 2003. Poster at MCCMB’03 and ECCB’03. [Lonardi, 2001] S. Lonardi. Global detectors of unusual words: design, implementation, and applications to pattern discovery in biosequences. Phdthesis, Department of Computer Sciences, Purdue University, 2001. 145 pages. [Marsan, 2002] L. Marsan. Inférence de motifs structurés: algorithmes et outils appliqués à la détection de sites de fixation dans des séquences génomiques. Phdthesis, University of Marnela-Vallée, 2002. [Marsan et al., 2000] L. Marsan and M.F. Sagot. Extracting structured motifs using a suffix treealgorithms and application to promoter consensus identification. In RECOMB’00, pages 210219. ACM-, 2000. Proceedings RECOMB’00, Tokyo. [Nicodeme, 2001] P. Nicodème. Fast Approximate Motif Statistics. Journal of Computational Biology, 8(3): 235-248, 2001. [Nuel, 2001] G. Nuel. Grandes déviations et chaines de Markov pour l’étude des mots exceptionnels dans les séquences biologiques. Phd thesis, Université René Descartes, Paris V, 2001. defended in July,2001. [Panina et al., 2000] E. Panina and A. Mironov and M. Gelfand. Statistical analysis of complete bacterial genomes: Avoidance of palindromes and restriction-modification systems..J. Mol. Biol.,34: 215-221, 2000. [Pevzner et al., 1989] P.A. Pevzner, M. Borodovski, and A. Mironov. Linguistic of Nucleotide sequences: The Significance of Deviations from the Mean: Statistical Characteristics and Prediction of the Frequency of Occurrences of Words. J. Biomol. Struct. Dynam., 6: 10131026, 1989. [Robin et al., 1999] S. Robin and J. J. Daudin. Exact distribution of word occurrences in a random sequence of letters. J. Appl. Prob., 36: 179-193, 1999. [Rocha et al., 2000] E. Rocha and A. Viari and A. Danchin. Oligonucleotides bias in Bacillus subtilis: general trends and taxonomic comparisons. Nucleic Acids Research,26: 2971-2980, 2000. [Régnier, 2000] M. Régnier. A Unified Approach to Word Occurrences Probabilities. Discrete Applied Mathematics, 104(1): 259-280, 2000. Special issue on Computational Biology;preliminary version at RECOMB’98. [Regnier, 2001] M. Régnier. Complexity of Unusual Words Counting. In JOBIM’00, volume 2066 of Lecture Notes in Computer Science, pages 101-117. Springer-Verlag, 2001. Proceedings of JOBIM’00, Montpellier. [Regnier et al., 1997] M. Régnier and W. Szpankowski. On Pattern Frequency Occurrences in a Markovian Sequence. Algorithmica, 22(4): 631-649, 1997. preliminary draft at ISIT’97. [Régnier et al., 2003] M. Régnier and A. Denise. Rare Events on Random Strings, 2003. submitted; http: //algo.inria.fr/regnier/index.html. [Robin et al., 2001] S. Robin and S. Schbath. Numerical Comparison of Several Approximations on the Word Count Distribution in Random Sequences. Journal of Computational Biology, 8(4): 349-359, 2001. [Reinert et al., 2000] G. Reinert, S. Schbath, and M. Waterman. Probabilistic and Statistical Properties of Words: An Overview. Journal of Computational Biology, 7(1): 1-46, 2000. [Vandenbogaert, 2003] M. Vandenbogaert. Recherche de signaux fonctionnels dans les zones de régulation. Thèse de 3ieme cycle, Université de Bordeaux I. [van Helden, 1998] J. van Helden, B. Andre, and J. Collado-Vides. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol., 281: 827-842, 1998. [Vinogradov et al., 2002] D.V. Vinogradov and A.A. Mironov. SITEPROB: Yet Another Algorithm to Find Regulatory Signals in Nucleotide Sequences. In Proceedings of BGRS’02, volume 1, pages 28-30, 2002. [Vandenbogaert et al., 2002] M. Vandenbogaert and Makeev V. Analysis of bacterial RM-systems through genome-scale analysis and related taxonomic issues, 2002. in Proceedings BGRS’02, Novossibirsk. submitted to ISB. [Waterman, 1995] M. Waterman. Introduction to Computational Biology. Chapman and Hall, London, 1995.