Chapter 1

advertisement
Transcription Factor Binding
Sites Prediction based on
Sequence Similarity
Jeong Seop Sim
simjs@etri.re.kr
Myung Eun Lim
melim@etri.re.kr
Myung Geun Chung
aobo@etri.re.kr
Soo-Jun Park
psj@etri.re.kr
Sun Hee Park
shp@etri.re.kr
Electronics and Telecommunications Research Institute,
Daejeon, Korea
Abstract
The availability of the whole genome sequences of human due to the Human
Genome Project makes it possible to study gene function much more
efficiently. We can find or predict the functions and positions of genes by
analyzing the transcriptional regulation.
In this paper, we propose an algorithm of predicting TF binding sites
from a given set of sequences of upstream regions of genes using suffix array
and local alignment algorithm. Our algorithm based on following two
assumptions: i) there tends to be a common TF that binds with each of the
upstream regions of the functionally related genes, ii) a TF binds with similar
DNA sequences.
Keywords
Transcription factor, transcription factor binding sites, suffix array, local
alignments
Chapters’
title
2
1. Introduction
Suffix trees and suffix arrays are very important index data structures in
diverse applications in string processing and computational biology. Despite
simplicity of suffix arrays, suffix trees have been the most fundamental index
data structures in the literature because suffix arrays were inferior to suffix
trees in the two aspects, construction time and search time. But recently, there
have been vigorous works on suffix arrays, and suffix arrays are proved to be
as powerful as suffix trees.[Kärkkäinen 2003, Kim 2003, Ko 2003,Sim 2003].
The availability of the whole genome sequences of human due to the
Human Genome Project makes it possible to study gene function much more
efficiently. We can find or predict the functions and positions of genes by
analyzing the transcriptional regulation. In general, there are three issues in
the field of transcriptional regulation, i) transcription factors (TF for short), ii)
transcription factor binding sites, and iii) regulatory proteins.
In this paper, we focus on the second issue and suggest an algorithm of
predicting TF binding sites from a given set of sequences of upstream regions
of genes (transcriptional regulation regions). For this, we need two
assumptions. First, we assume that there tends to be a common TF that binds
with each of the upstream regions of the functionally related genes. Our
second assumption is that TF binding sites that bind with common TF have
similar DNA sequences.
2. Preliminaries
2.1.
Suffix array
To search a pattern in a text, suffix trees and suffix arrays are widely used.
The suffix tree due to McCreight [McCreight 1976] is a compacted trie of all
the suffixes of a string. The suffix array due to Manber and Myers [Manber
1993] is basically a sorted list of all the suffixes of a string.
To search a pattern efficiently in a suffix array, LCP (longest common
prefix) information is used. An LCP array L is an array of the lengths of
common prefixes of two adjacent suffixes in a suffix array SAT for a given
sequence T . That is, L[i ] (1  i  n  1) stores the length of common prefix
of SAT [i] and SAT [i  1] where n is the length of T . For example, when
a given text T  abbabaababbb# where # is the smallest symbol in the
alphabet, suffix array SAT of T and LCP array L are shown at Table 1.
3
Chapters’ title
i
SAT [i]
1
2
3
4
5
6
7
8
9
10
11
12
13
13
6
4
7
1
9
12
5
3
8
11
2
10
suffix
#
aababbb#
abaababbb#
ababbb#
abbabaababbb#
abbb#
b#
baababbb#
babaababbb#
babbb#
bb#
bbabaababbb#
bbb#
L[i ]
0
1
3
2
3
0
1
2
3
1
2
2
Table 1. Suffix array and LCP array
2.1.
Transcription Factor Binding Sites
In molecular biology, gene regulation is a hot issue and there are quite many
methodologies to enable characterize gene expression patterns and profiles.
Due to the huge number of cell types and organs, we cannot characterize all
combinations of conditional factors by experiments. Thus, we need tools for
the in silico identification of genomic signals that are related to gene
regulations, for example, CoreInspector [Ohler 2001] and MCPromoter
[Zhang 1998].
TRANSFAC [Matys 2003] is a database that consists of TF, TF binding
sites, DNA-binding profiles. Most programs related to regulatory regions
developed based on TRANSFAC. We also use TRANSFAC to test the
accuracy of our algorithm.
3.
3.1.
Algorithm and Results
Algorithm
Given a set S of DNA sequences s1 , s 2 ,, s n our algorithm first finds a
common sequence above given thresholds. In general, the input sequences are
set of sequences that seem to be functionally related. We use suffix arrays to
find common DNA sequences in the given set of transcriptional regulation
regions. First, we make a long sequence T by concatenating all the
Chapters’
title
4
sequences in S . That is, T  s1 #1 s2 #2  sn #n . Each # i is a special
character to delimit each sequence. Now, we make a suffix array SAT of T
and make LCP array L . We use Kärkkäinen and Sanders’s[Kärkkäinen
2003] algorithm to construct a suffix array of a given sequence in linear time.
We are given two thresholds. One is LEN , the shortest length of the
candidates of TF binding sites. The other is RTO , the frequency ratio of the
candidate of TF binding sites in the given sequences. For example, if 10
sequences are given and LEN and RTO are 4 and 0.8, respectively, our
algorithm finds all the sequences that are longer than 4 and appear in at least 8
sequences in the given set of sequences. See Figure 1.
Figure 1. Flow of algorithm. (1) Given input sequences, our
algorithm concatenates all the input sequences to make them into
one sequence. (2) Build the suffix array of the concatenated
sequence, and (3) make the LCP array. (4) Find sequences above
given thresholds. In this example, LEN is 5 and RTO is 0.8.
AGCTC is selected as a candidate.
5
Chapters’ title
3.2.
Experimental Results
To evaluate the accuracy of our algorithm, we obtain upstream regions and TF
binding sites from EMBL release 75 and TRANSFAC [Stoesser 2003] release
3.2 to test our algorithm. We evaluated the performance of our algorithm by
the measure of PPV (positive probability value, PPV = TP/(TP+FP), where
TP is true positive and FP is false positive) and obtained 35.5% at best case
when LEN is 4 (bp) and RTO is 0.85. See Table 2.
bp
%
4
5
6
7
8
9
100
30.9
18.2
0
0
0
0
95
30.9
18.2
0
0
0
0
90
33.0
14.6
0
0
0
0
85
35.3
17.5
0
0
0
0
80
29.4
13.0
15.2
25.0
0
0
75
29.6
16.9
14.3
25.0
0
0
70
30.0
7.6
13.3
25.0
0
0
65
30.2
18.1
13.3
25.0
0
0
60
27.5
15.9
10.9
10.4
8.6
7.3
55
27.6
16.7
12.1
10.0
8.2
7.3
Table 2. Positive probability value
4.
Conclusion
In this paper, we proposed an algorithm that predicts TF binding sites from
transcription regulation regions by extracting common sequences using suffix
array. Our algorithm extracts common sequences from the input sequences
and output them as TF binding sites. If we search pairs of common sequences,
that is, search gap allowed motifs, we think we can improve the accuracy of
our algorithm.
References
J. Kärkkäinen and P. Sanders, Simple linear work suffix array construction,
International Colloquium on Automata, Languages and Programming,
LNCS 2719, 943-955, 2003.
D.K. Kim, J.S. Sim, H. Park and K. Park, Linear-time construction of suffix
arrays, Combinatorial Pattern Matching, LNCS 2676, 186-199, 2003.
P. Ko and S. Aluru, Space efficient linear time construction of suffix arrays,
Combinatorial Pattern Matching, LCNS 2676, 200-210, 2003.
U. Manber and G. Myers, Suffix arrays : A new method for on-line string
searches, SIAM Journal on Computing, 22, 935-938, 1993.
V. Matys, E. Fricke, R. Geffers, E. Gößling, M. Haubrock, R. Hehl, K.
Hornischer, D. Karas, A. E. Kel, O.V. Kel-Margoulis, D.U. Kloos, S.
Land, B. Lewicki-Potapov, H. Michael, R. Munch, I. Reuter, S. Rotert, H.
Chapters’
title
6
Saxel, M. Scheer, S. Thiele, and E. Wingender, TRANSFAC:
transcriptional regulation, from patterns to profiles, Nucleic Acids
Research, 31(1), 374-378, 2003.
E.M. McCreight, A space-economical suffix tree construction algorithm,
Journal of the ACM, 23, 262-272, 1976.
U. Ohler, H. Niemann, G. Liao, G.M. Rubin, Joint modeling of DNA
sequence and physical properties to improve eukaryotic promoter
recognition, Bioinformatics, 17 Suppl 1, S199-206, 2001.
J.S. Sim, D.K. Kim, H. Park, and K. Park, Linear-time search in suffix arrays,
Australasian Workshop on Combinatorial Algorithms, 139-146, 2003.
G. Stoesser, W. Baker, A. Broek, M. Garcia-Pastor, C. Kanz, T. Kulikova, R.
Leinonen, Q. Lin, V. Lombard, R. Lopez, R. Mancuso, F. Nardone, P.
Stoehr, M.A. Tuli, K. Tzouvara, and R. Vaughan, The EMBL ncleotide
sequence database: major new developments, Nucleic Acids
Research, 31(1), 17-22, 2003.
M.Q. Zhang, Identification of human gene core promoters inSilico, 8(3),
319-326, 1998.
Download