Mathematical tools for regulatory signals - Algorithms Project

advertisement
Mathematical Tools for Regulatory Signals Extraction.
Mireille Régnier
INRIA
78153 Le Chesnay-FRANCE
Mireille.Regnier@inria.fr
Phone: 33 1 39 63 54 78 Fax: 33 1 39 63 55 96
Keywords: regulatory signals, statistics, overrepresentation, protein
binding sites, markov models
Abstract: Statistical techniques provide an efficient way to analyse “in
silico” the huge amount of data from large-scale sequencing. The key
idea is to search for regulatory signals among exceptional words, e.g.
words that are either underrepresented or overrepresented. We provide a
few mathematical results to assess the significance of an exceptional
word.
1. Motivation
In Silico prediction of regulation signals relies on studies that pointed out
common motifs on co-expressed genes [van Helden et al., 1998; Brazma et
al.1998]. Searching common motifs in sequences of co-expressed genes
promoters now is a key to the determination of transcriptional regulation
networks. In recent works on yeast genes that were expressed at the same time of
a cellular cycle, [Tavazoie et al, 1999; Spellman et al, 1998], the motifs are very
specific: they are almost never found for other genes; hence, they are potential
binding sites for transcription factors. Many methods have been proposed to
search
such
motifs:
word
counting,
sampling,
phylogenetic
footprinting…Statistical approaches search for exceptional words. The
underlying assumption is as follows: a biological function that is enhanced or
avoided is associated to a word, or a set of words that is over-represented, or
underrepresented. An experimental validation has been possible for some motifs.
For instance, the ChIP-chip method (Iyer et al, 2001) achieves this task by a
simultaneous immunoprecipitation of a transcription factor and a hybridization
on a chip of the target sequence.
We provide below a brief formalization of the main issues addressed in
biological applications.
2. Introduction
In statistical approaches, one assumes, in a first approximation, that the genomic
sequences are randomly generated . In this paper, we consider two probability
models. In the Bernoulli model, the character distribution at a given position in a
genomic sequence is independent of that position. In the Markovian model, it
depends on the previous position, possibly of a few positions). Details and
different background probability models are discussed in [Hampson et al., 2002].
The goal is the extraction of exceptional words -either overrepresented or
underrepresented-. To achieve this goal, one needs first an algorithm that
searches for candidate motifs. A recent survey on algorithms can be found in
[Lonardi, 2001;Marsan, 2002]. It has been shown [Blanchette and Tompa, 2001]
that the sequences that are recognized by the same transcription factor are
(approximately) conserved during the evolution. Hence, it is relevant to look for
a signal that is an approximate word. Second, one relies on mathematical tools
to assess statistical significance of these candidates. A great interest arose
recently for statistics for the number of occurrences of a set of words [van
Helden et al., 1998; Hampson et al., 2002; Reinert et al., 2000; Regnier, 2000;
Apostolico et al., 2000]. We present and discuss a wide range of tractable
formulae and discuss widely used approximations. We state their validity
domains and occasionally extend them. Some efficient algorithms that actually
compute them are available in the QuickScore library.
Different formalizations are possible for various applications. One can study a
long text, typically a genome, and look for under or over-represented words. For
example, the Chi motif is over-represented in many bacterial genomes.
Restriction-modification systems provide an other typical example. It has been
observed that the binding site of a restriction enzyme is an avoided word [Rocha
et al, 2000]. [Panina et al, 2000] shows that word avoidance in bacterial genomes
is correlated with taxonomy. This work is refined in [Vandenbogaert et al,2002].
Consensus words are taken into account: for instance, CYCGRC is a binding site
for Aval in Anabaena variabilis. Word avoidance also allows to detect
horizontal transfers [Vandenbogaert, 2003].
A search can be performed on a given set of sequences: typically, the upstream
sequences of co-regulated or co-expressed genes. These sequences are relatively
short, generated independently according to a common distribution, but may
have different lengths [Van Helden et al., 1998; Buhler et al., 2001]. One looks
for exceptional signals, a word H, or a set of words , that are potential binding
sites. One counts the number of sequences where H, or , is actually observed
and compares this number with its expected value. This scheme underlies the
software RSA-oligonucleotides of van Helden [van Helden et al., 1998] that
searches for regulatory signals. One can search for a known site in a set of
upstream sequences . in order to point out a coregulation. One can also search for
an unknown site. In that case, the most over-represented signal (possibly several
top scoring signals) is predicted “in silico” as a potential candidate. A typical
example is the characterization of polyadenylation signals in human genes
[Beaudoing et al., 2000] from a set of ESTs. This approach is very appealing
when microarrays experiments show a co-expression for a set of genes.
3. Large Sequences
In this section, we concentrate on the statistical significance of word avoidance
or overrepresentation in a large sequence. We assume that this sequence is
randomly generated according to a Bernoulli or Markov model.
For a given word H, one denotes P(H) its probability under the model. The
probability of a set of words
is denoted by P( =P(H) with H in . The
expectation E(H) (respectively E( )) and the variance Var (H) (respectively
Var( )) of a given word H (respectively a given set of words
have been
extensively studied by various authors [Pevzner et al., 1989; Bender et al., 1993;
Regnier et al., 1997; Regnier, 2000; Reinert et al., 2000] and closed formulae
have been derived. This allows for an efficient computation of the Z-score:
Z
O(H) E(H)
Var(H)
where O(H) (respectively O( )) is the observed number of occurrences of H
(respectively words from ) in the sequence under study. A highly positive
(respectively negative) Z-score shows that a word is exceptional. Still, this does
not provide valuable information on the probability that this exceptional event
occurs, the so-called p-value. In Theorem 1, we state a computable formula for
the p-value. We discuss the quality of this formula. e.g. its precision and the
efficiency of its computation. We illustrate on one example that the p-value is
more sensitive than the Z-score. Additionnally, this result will allow to deal with
conditional events in Section 5. All results rely on the overlapping structure of
the words:
Definition 1 [Guibas et al., 1981] Given two words H and F, the correlation set
AH,F is the set of suffixes w of F such that F is a suffix of Hw. The correlation
polynomial is, in the Bernoulli case:
A
H,F
(z)
P(w)z
|w|
wAH,F
where |w| is the length of w. When H=F, one rewrites AH,H(z)=AH(z).
Example:
(i) Let H be the Chi motif GCTGGTGG. The correlation set is AH,H =
{,CTGGTGG} where  is the empty word. The correlation polynomial is
AH(z)=1+P(CTGGTGG)z7.
(ii) Let
be the generalized Chi motif GNTGGTGG where N denotes any
nucleotide from
{A,C,G,T}. The correlation set between two words
GNTGGTGG and GXTGGTGG from the generalized Chi motif only depend on
the second word. If X=G, the correlation set AGNTGGTGG,GGTGGTGG is
{TGGTGG,GTGGTGG}. When X ranges in {A,C,T}, the correlation set reduces
to {XTGGTGG}. For any X, the contribution of XTGGTGG is P(XTGGTGG)z7.
When TGGTGG is in the correlation set, its contribution is P(TGGTGG) z6.
Hence, AGNTGGTGG,GXTGGTGG(z)=1+P(XTGGTGG)z7if X{A,C,T } or
1+P(TGGTGG)z6+P(GTGGTGG)z7 if X=G.
Now, our recent combinatorial results on the distribution of word
occurrences allow to prove [Denise et al., 2001;Regnier et al., 2003] the
expression below for the p-value:
Theorem 1 Let H be a given pattern. One denotes:
D(z)=(1-z) AH(z) +P(H)zm.
(1)
Let a be a real number such that 0<a<1 and aP(H). The fundamental equation
is
D(z)2-(1+(a-1)z)D(z)-az(1-z)D'(z)=0,
(2)
If a>P(H),let za be the largest real positive root of (2) satisfying 0<za<1. If
a<P(H), let za be the smallest real positive solution of (2) that satisfies 1<za.
Then:
nI(a) a
Prob(NH na) 1 e
na
(3)
where
I(a)alog(
D(za)
)log za
D(za) za 1
(4)
and a and a are rational functions of a and za.
Extending the reasoning in [Regnier, 2001], we observe here that Theorem 1
steadily extends to some sets of words: e.g. when the correlation sets AH,F do
not depend on the first word H. This is the case for the generalized Chi motif.
Example:
(i) The Chi motif is overrepresented in many bacterial genomes. The
motif GCTGGTGG occurs 499 times in E. coli, where the Bernoulli
distribution of the nucleic acids is pA=0.2461913, pC=0.2542308,
pG=0.2536579, pT=0.2459200. The p-value computed by Theorem 1 is:
1.417e-227. A computation by induction by the software G-Don [Nuel,
2001] yields a p-value: 1.403e-227.
(ii) The generalized Chi motif GNTGGTGG occurs 159 times in H.
influenzae. The p-value is .8597547482e-36 under the uniform model.
Approximation Quality
It turns out [Denise et al., 2001] that expression (3) is very closed to the exact
expression of this probability provided in [Regnier et al., 1997], computed by the
software Excep [Klaerr-Blanchard et al., 2000]. Still, as the computation reduces
to the numerical solution of a polynomial equation, this computation is much
faster and numerically very stable. Moreover, the possible range for the length n
of the text and the probability P(H) is much larger. For example, the
computation of (3) for the Chi motif in E. Coli, where the size of the genome is
n=4639221, exceeds the computational limits of Excep. Applications with
avoided words in restriction-modification systems for bacteria can be found in
[Vandenbogaert et al., 2002].
This can be compared to some common approximations for the p-values, e.g.
the p-values computed for the normal law with mean p=P(H) and variance V(H)
or for the (compound) Poisson distribution with parameter p=P(H). In these
cases, the rate function is well known [Dembo et al., 1992], e.g.:
(a P(H)) 2
for the normal law,
2Var(H)
(ii) alog a (p a) or the Poisson law.
p
(i)
When a-p is not too large, a local development shows that:
za 1-
a p
.
2AH(1)12mp
This yields the following approximation for the rate function :
(a p)
I(a)alog a (pa)(AH(1)1) p
p
2
(2)
This proves theoretically that the Poisson approximation (or the compound
Poisson approximation) is very tight, especially when the motif is not
selfoverlapping (e.g. AH(1)-1=0). This fact was experimentally observed for
some ranges of n and p in [Robin et al., 2001, Nicodeme01]. Equation (5) also
shows that the normal approximation is tight only for word occurrences na that
satisfy a-p <<p. This equation implies that a is close to p. Hence, such words are
neither overrepresented nor underrepresented: they are not exceptional! For
exceptional words (with a high Z-score), the ratio between the exact probability
and the normal approximation is very small, e.g. the relative error is very high.
Nevertheless, the difference, used for instance in [Robin et al., 2001] is upper
bounded by the maximal value. Hence, we point out the paradox: the smaller
the difference, the worst the approximation...As a conclusion, we claim that the
normal approximation is always poor.
Restriction enzymes For a given genome, the palindromic sites associated to
the restriction enzymes of the genome are avoided [Vandenbogaert03]. In Table
1, these avoided words are sorted by decreasing Z-score. Their Z-rank, in
column 4, is significantly different from their P-rank (column 6) derived from a
sort by P-values. Column 2 yields the number of occurrences in E. Coli.
Hexamer Occ Z-score Z-Rank
GGCGCC 94
-41.9
1
GCATGC 588
-39.9
2
GCCGGC 293
-37.4
3
CGGCCG 284
-35.1
4
CACGTG 143
-32.3
5
GGGCCC 68
-30.7
6
TTGCAA 1024
-30.2
7
ATGCAT 839
-28.0
8
CCGCGG 657
-26.2
9
GCTAGC 157
-26.1 10
CAATTG 986
-25.7 11
GAGCTC 152
-23.9 12
GTGCAC 575
-23.1 13
TGGCCA 630
-23.0 14
GAATTC 645
-21.9 15
AAGCTT 556
-21.9 16
CTCGAG 177
-21.3 17
CCTAGG 16
-21.2 18
ACATGT 477
-20.4 19
TCTAGA 39
-19.8 20
TTCGAA 800
-19.4 21
CCATGG 612
-19.4 22
CGCGCG 2127 -118.7 23
P-val (M) P-Rank
.3385654443e-3
1
.2514096486e-3
2
.2358169663e-3
3
.2050106769e-3
4
.1849237452e-3
5
.1763897841e-3
6
.1280869614e-3
7
9
.1095910354e-3
.980276091e-4 10
.1133125447e-3
8
.904588694e-4 12
.9436368542e-4
11
.753139868e-4 14
.738769289e-4 15
.664817246e-4 19
.671111153e-4 18
.7121997639e-4
17
.8539762880e-4
13
20
.5812779786e-4
.7189624736e-4
16
.498331920e-4 22
.5086498391e-4
21
.3641666775e-4
23
Table 1: Table of avoided palindromic hexanucleotides in E. Coli.
The first two binding sites that exchange their ranks are GCTAGC and
ATGCAT for enzymes Nhel and AvaIII: (8,10) changes to (9,8). Similarly,
CCTAGG (enzyme AvrII) goes from rank 18 up to rank 13, and TCTAGA
(enzyme Xbal) goes from rank 20 up to rank 16. For similar Z-scores, the pvalue, in column 5, may reverse such order. This is the case, for example, the
three binding sites TCTAGA, TTCGAA and CCATGG for enzymes Xbal, AsuIII
and NcoI. It is interesting to notice, that the last hexamer, CGCGCG, keeps its
rank, with a sensible difference in p-value (a 0.13e-4 difference in the logarithms
represents a 10-26 factor for the p-value), but is not a binding site. As a
conclusion, the p-value appears to be more sensitive than the Z-score.
4. Small Sequences
In this context, one searches for a signal that occurs in a set of L small sequences
more often than expected. Typically, these sequences are upstream regions of
(possibly co-regulated) genes. The signal may be a single word H, such as the
G-box CACGTG, or a consensus word , such as the cleavage site CCWGG for
the restriction enzyme EcoRII. It can also be a structured motif [Marsan et al.,
2000] or a dyad [Eskin et al., 2002]: for example, the motif TCGGCGGCTAAAT
N20GATTCGGAAGTAAA, where N20 denotes a sequence of 20 unspecified
nucleotides, in gene YDR285W in S. Cerevisiae. One proceeds in two steps:
(i) compute for a given signal, the probability p that it occurs at least once
in the text;
(ii) compute PL(k), the probability that k out of L sequences contain that
motif at least once.
When the p-value PL(k) is very small, one concludes that the extracted signal is
relevant.
Step (i): the combinatorial results on words [Guibas et al., 1981;Regnier, 2000]
allow for an exact computation of p. With an indexation by the size n of the
sequence, one proves [Guibas et al., 1981] that, for a single word, the
probabilities pn satisfy:
(1 p ) z
n0
n
n
P(H) z

D(z)
n
This extends to a set of words [Guibas et al., 1981] and to the Markov model
[Regnier et al., 1997]. A formal development of the right hand size yields 1-pn
for any n. Practically, when n is small, this is easily computed by a symbolic
computation system. But, the larger n is, the trickier the computation, and, even
worse, numerical instability may occur. The software Excep [Klaerr-Blanchard
et al., 2000] relies on a careful implementation of the Fast Fourier Transform,
and allows a computation for middle n, Still, it is stucked for large n by the
numerical problems and the computational costs. The computation by induction
in [Robin et al., 1999], suffers, even more deeply, of the same drawbacks. The
computation for a set of words is even harder.
A very common approximation [Hart et al., 2000; Beaudoing et al., 2000;
Buhler et al., 2001; Vinogradov et al., 2002] for pn is:
pn=1-(1-P( ))n-m+1,
where n is the size of the sequence. It has been observed by many people [van
Helden et al., 1998] that this approximation is bad when words in
overlap,
notably when
reduces to a single self-overlapping word H. Incidentally, the
exact results show that, anyhow, the Poisson approximation is rough for small n.
When n is larger, we extend this formula for a set of possibly overlapping words,
. The case of a unique self-overlapping word is addressed in
[Vandenbogaert03], and a formal proof, including the Markov case, can be found
in [Clement et al. 2003].
Theorem 2 Let
be a set of patterns and P( ) be the probability of the words
in : the probability pn to find at least one word from in a sequence of size n,
satisfies, in the Bernoulli case, when nP( ) is small:
where C(H)=
p1-e-nC( H)
(6)
(2AH, F(1)1) . For a single pattern H, a tighter
P(H) 
H
F
approximation holds: C(H)
P(H)
.
AH (1)
Hint for the proof: Now classical combinatorial analytics [Flajolet et al., 1995]
pe
n log
provide a good asymptotic expression for pn. For a single word,
,
where ρ is the smallest real positive root of D(z)=0, where D(z) is the polynom
defined in (1). It is easily checked that ρ~
P(H)
. The approximation is very
AH(1) _
tight. For a set H of q words of length m, one defines the q×q matrix (z) of the
2
q autocorrelation polynomials. Let H be a q×q matrix where all rows are identical
m
m
to the vector (P(H1)
, … , P(q)
). One computes approximately
Trace(H. (1)-1), where the Trace is the sum of all diagonal coefficients.
z
z
Step (ii) : our main observation here is that PL(k) is the tail distribution of a
Bernoulli process. Hence, it is known [Dembo et al., 2002; Waterman, 1995]
logPL (k)
that n I(a) with a=k/n and:
I(a)alog a (1a)log 1a
P(H)
1 P(H)
(7)
This approximation is very tight. It is easier and faster to compute than the
binomial formula for PL(k). Moreover, the numerical computation of these small
numbers is very stable. It was recently implemented in RSA-tools.
Remark: Given a set H, when nC( H ) is large, e.g. when n is large, then p is
very close to 1: almost surely, a word from H occurs in the sequence, and no
valuable information can be derived from an occurrence. In practice, the length n
of the upstream sequences under study is fixed, and we can check a priori if
nC(H ) is too large. In that case, if H is noisy, decreasing the noise decreases the
size of H, hence nC(H ) and an H -occurrence may then be a significant event.
This is a way to avoid the critical phenomena exhibited in [Buhler et al., 2001]. .
5. Conditional Events
The overrepresentation -or under representation- of a signal (the Chi motif in
bacteria, the Alu sequences,…) modifies the statistical properties of a genomic
sequence. It is valuable to eliminate artefacts of a strong signal in order to
extract a weaker signal. It is also of interest to cluster similar motifs in a single
degenerate signal. Mathematically, this means that one is conditioning by a rare
event, e.g. an event with a very small probability. The precise formulae in
Theorem 1 allow to prove:
Theorem 3 [Regnier et al., 2003)] Let H and F be two patterns. Assume that k
occurrences of H are found in the text, with k=na. Then, the expected number of
occurrences of F, knowing that H occurs k times is:
E(F/NH=k)n(a)
where
m
m
[(1 z ) A (z ) P(F) z a ][(1 z ) A (z ) P(H) z a ]
(a) z
D(z )(D(z ) z 1)
a
a
H, F
a
a
a
a
F, H
a
a
and P(H) and P(F) are their probabilities of occurrence , AH,F(z) and AF,H(z)
are their correlation polynomials.
We briefly present below two applications. Similar problems are studied in
[Blanchette et al., 2001].
Polyadenylation
In [Beaudoing et al., 2000], short (around 50 bp) upstream regions of EST of
human genes are searched for polyadenylation signals. The largest Z-score is
assigned to AAUAAA, that actually is the dominating signal. The computation
of a Z-score using E(F/AAUAAA) as the new expected value for each pattern F
drastically reduces the Z-scores of artefacts AUAAAN and NAAUAA and
assigns the 2-nd rank in Z-scores to AUUAAA, that actually is the second
(weak) signal [Denise et al., 2001].
Arabidopsis Thaliana
RSA-tools was “blindly” used on upstream regions of Arabidopsis Thaliana, by
M. Lescot. In one test, the two highest Z-scores were assigned to H=ACGTGG
and F=CACGTG. The number of occurrences were NH=32 and NF=52. The
conditional expectations are: E(H/F)=24 and E(F/H)=20. Our conclusion is that
H is better explained by F than F is explained by H. This leads to chose
F=CACGTG as the significant signal, although its Z-score is smaller. As a
matter of fact, the biological knowledge confirms that CACGTG is the regulatory
signal, the so-called G-box.
6. Conclusion
It is an interesting challenge to derive the rate function for a set of consensus
words, or degenerated words (IUPAC code). The special case of double strand
counting is addressed in [Lescot et al., 2003]. One can also extend the
approximation (5). An other interesting issue is the derivation of the conditional
expectation when a set of words is overrepresented.
One expects a
simplification of the formulae for consensus words. Finally, it is worth
extending (7) for sequences of different lengths. These formulae are
implemented in the QuickScore library (http://algo.inria.fr/regnier/QuickScore/).
Acknowledgements
I am grateful to Magali Lescot (Marseille) for fruitful discussions and for
providing data on Arabidopsis Thaliana. This research was partially supported
by French-Russian Liapunov Institute and INTAS 99-1476.
References
[Apostolico et al., 2000] A. Apostolico, M.E. Bock, S. Lonardi, and X. Xu. Efficient detection of
unusual words. Journal of Computational Biology, 7(1/2): 71-94, 2000.
[Beaudoing et al., 2000]E. Beaudoing, S. Freier, J. Wyatt, J.M. Claverie, and D. Gautheret.
Patterns of Variant Polyadenylation Signal Usage in Human Genes. Genome Research., 10:
1001-1010, 2000.
[Bender et al., 1993] Edward A. Bender and Fred Kochman. The Distribution of Subwords Counts
is Usually Normal. European Journal of Combinatorics, 14: 265-275, 1993.
[Blanchette et al., 2001] M. Blanchette and S. Sinha. Separating real motifs from their artifacts.
Bioinformatics (ISMB special issue), 817: 30-38, 2001.
[Brazma et al., 1998] A. Brazma, I. Jonassen, J. Vilo andE. Ukkonen.Predicting gene regulatory
elements in silico on a genomic scale. Genome Res, 11(8): 1202-1215, 1998.
[Buhler et al., 2001] J. Buhler and M. Tompa. Finding Motifs Using Random Projections. In
RECOMB’01, pages 69-76. ACM-, 2001. Proc.RECOMB’01, Montréal.
[Clement et al., 2003] J. Clement and E. Dolley and M. Regnier.Approximate Words.submitted,
2003.
[Denise et al., 2001] A. Denise, M. Régnier, and M. Vandenbogaert. Assessing statistical
significance of overrepresented oligonucleotides. In WABI’01, pages 85-97. Springer-Verlag,
2001. in Proc. First Intern. Workshop on Algorithms in Bioinformatics, Aarhus, Denmark,
August 20001; preliminary version as INRIA research report 4132.
[Dembo et al., 1992] A. Dembo and O. Zeitouni. Large Deviations Techniques. Jones and Bartlett,
Boston, 1992.
[Eskin et al., 2002] E. Eskin and P. Pevzner. Finding Composite Regulatory Patterns in DNA
Sequences. Bioinformatics, 1(1): 1-9, 2002.
[Flajolet et al., 1995] Ph. Flajolet and R. Sedgewick. Analysis of Algorithms. Addison-Wesley,
1995.
[Guibas et al., 1981] L. Guibas and A.M. Odlyzko. String Overlaps, Pattern Matching and
Nontransitive Games. Journal of Combinatorial Theory, Series A, 30: 183-208, 1981.
[Hampson et al., 2002] S. Hamspon, D. Kibler, and P. Baldi. Distribution patterns of overrepresented k-mers in non-coding yeast dna. Bioinformatics, 18: 513-528, 2002.
[Hart et al., 2000] R. Hart, A.K. Royyuru, G. Stolovitzky, and A. Califano. Systematic and
Automated Discovery of Patterns in PROSITE Families. In RECOMB’00, pages 147-154.
ACM-, 2000. Proceedings RECOMB’00, Tokyo.
[Iyer et al., 2001] VR. Iyer, CE Horak, CS Scafe, D. Botstein, M. Snyder and PO Brown.Genomic
binding sites of the yeast cell-cycle transcription factors SBF and MBF.Nature,409: 533-538,
2001.
[Klaerr-Blanchard et al., 2000] Maude Klaerr-Blanchard, Hélène Chiapello, and Eivind Coward.
Detecting localized repeats in genomic sequences: A new strategy and its application to B.
subtilis and A. thaliana sequences. Comput. Chem., 24(1): 57-70, 2000.
[Lescot et al., 2003] M. Lescot and M. Régnier. In Silico prediction of regulatory signals on plant
promoter sequences. 2003. Poster at MCCMB’03 and ECCB’03.
[Lonardi, 2001] S. Lonardi. Global detectors of unusual words: design, implementation, and
applications to pattern discovery in biosequences. Phdthesis, Department of Computer
Sciences, Purdue University, 2001. 145 pages.
[Marsan, 2002] L. Marsan. Inférence de motifs structurés: algorithmes et outils appliqués à la
détection de sites de fixation dans des séquences génomiques. Phdthesis, University of Marnela-Vallée, 2002.
[Marsan et al., 2000] L. Marsan and M.F. Sagot. Extracting structured motifs using a suffix treealgorithms and application to promoter consensus identification. In RECOMB’00, pages 210219. ACM-, 2000. Proceedings RECOMB’00, Tokyo.
[Nicodeme, 2001] P. Nicodème. Fast Approximate Motif Statistics. Journal of Computational
Biology, 8(3): 235-248, 2001.
[Nuel, 2001] G. Nuel. Grandes déviations et chaines de Markov pour l’étude des mots
exceptionnels dans les séquences biologiques. Phd thesis, Université René Descartes, Paris V,
2001. defended in July,2001.
[Panina et al., 2000] E. Panina and A. Mironov and M. Gelfand. Statistical analysis of complete
bacterial genomes: Avoidance of palindromes and restriction-modification systems..J. Mol.
Biol.,34: 215-221, 2000.
[Pevzner et al., 1989] P.A. Pevzner, M. Borodovski, and A. Mironov. Linguistic of Nucleotide
sequences: The Significance of Deviations from the Mean: Statistical Characteristics and
Prediction of the Frequency of Occurrences of Words. J. Biomol. Struct. Dynam., 6: 10131026, 1989.
[Robin et al., 1999] S. Robin and J. J. Daudin. Exact distribution of word occurrences in a random
sequence of letters. J. Appl. Prob., 36: 179-193, 1999.
[Rocha et al., 2000] E. Rocha and A. Viari and A. Danchin. Oligonucleotides bias in Bacillus
subtilis: general trends and taxonomic comparisons. Nucleic Acids Research,26: 2971-2980,
2000.
[Régnier, 2000] M. Régnier. A Unified Approach to Word Occurrences Probabilities. Discrete
Applied Mathematics, 104(1): 259-280, 2000. Special issue on Computational
Biology;preliminary version at RECOMB’98.
[Regnier, 2001] M. Régnier. Complexity of Unusual Words Counting. In JOBIM’00, volume 2066
of Lecture Notes in Computer Science, pages 101-117. Springer-Verlag, 2001. Proceedings of
JOBIM’00, Montpellier.
[Regnier et al., 1997] M. Régnier and W. Szpankowski. On Pattern Frequency Occurrences in a
Markovian Sequence. Algorithmica, 22(4): 631-649, 1997. preliminary draft at ISIT’97.
[Régnier et al., 2003] M. Régnier and A. Denise. Rare Events on Random Strings, 2003.
submitted; http: //algo.inria.fr/regnier/index.html.
[Robin et al., 2001] S. Robin and S. Schbath. Numerical Comparison of Several Approximations
on the Word Count Distribution in Random Sequences. Journal of Computational Biology,
8(4): 349-359, 2001.
[Reinert et al., 2000] G. Reinert, S. Schbath, and M. Waterman. Probabilistic and Statistical
Properties of Words: An Overview. Journal of Computational Biology, 7(1): 1-46, 2000.
[Vandenbogaert, 2003] M. Vandenbogaert. Recherche de signaux fonctionnels dans les zones de
régulation. Thèse de 3ieme cycle, Université de Bordeaux I.
[van Helden, 1998] J. van Helden, B. Andre, and J. Collado-Vides. Extracting regulatory sites from
the upstream region of yeast genes by computational analysis of oligonucleotide frequencies.
J. Mol. Biol., 281: 827-842, 1998.
[Vinogradov et al., 2002] D.V. Vinogradov and A.A. Mironov. SITEPROB: Yet Another
Algorithm to Find Regulatory Signals in Nucleotide Sequences. In Proceedings of BGRS’02,
volume 1, pages 28-30, 2002.
[Vandenbogaert et al., 2002] M. Vandenbogaert and Makeev V. Analysis of bacterial RM-systems
through genome-scale analysis and related taxonomic issues, 2002. in Proceedings BGRS’02,
Novossibirsk. submitted to ISB.
[Waterman, 1995] M. Waterman. Introduction to Computational Biology. Chapman and Hall,
London, 1995.
Download