General description

advertisement
General description
DistanceScan is a tool for prediction of functional combinations of transcription factor
binding sites (TFBS).
The DistanceScan tool exploits the method of distance distributions of TFBS pairs (Shelest,
2006), which allows to model random distribution of distances and compare it with the
distribution observed in the query sequences. Comparison of the profiles allows to filter out
the "noise" and to retain the potentially functional combinations. This approach has proved its
usefulness as a filtering technique for the selection of TFBS pairs in promoter modeling.
The effectiveness of the approach was first demonstrated by the correct re-identification of the
real, experimentally proven composite elements from the TRANSCompel® Professional
database (Shelest, 2006). You can see a brief report of these findings here.
How it works
The distance distribution approach considers the dependency of the number of TFBS pairs on
the distance at which they occur (distance distribution). For any set of sequences there will be
some background of noise due to unavoidable false positives among the predictions for TFBS.
This tool allows to distinguish the real signal (distance in the functional TFBS pairs) from the
noise signal (distance in the false positive TFBS pairs).
We suppose that the positions of the false positive TFBS are random and uniformly
distributed. Taking into account the frequencies of the found (predicted) TFBSs, the tool
calculates the distribution of distances for the random case (i.e., if all TFBS were distributed
randomly):
Fig. 1. Calculated random distance distribution in a pair of TFBS (green line: ±3
standard deviations).
The number of pairs (fobsd,δ) in the set of sequences in a distance interval from d to d+δ is
directly counted for each d (up to a maximum specified) and for each δ (up to a maximum
specified). Further, we consider the difference between the calculated and observed distance
distributions. A distance is considered as over-represented if it overcomes a certain score
characterizing the probability of the distance to be found not by chance. The over-represented
distances are considered as non-random, hence potentially functional (see Fig.2).
Fig. 2. The distribution of distances in a set of sequences containing known composite
elements (red line) compared with the random distance distribution calculated for the
same set (blue line). The green line shows the score corresponding to the p-value~0.05.
The workflow includes 4 main steps:
 Identification of pairs;
 Measuring the number of pairs in a distance interval;
 Calculation of random distance distributions for the sequence sets;
 Selection of the over-represented peaks.
Presently DistanceScan is working with 3 kinds of inputs: (i) the results of a PWM-based
TFBS prediction (MatchTM output), (ii) the results of motif prediction (Gibbs Sampler output),
and (iii) the results of two-step prediction of motifs and corresponding PWM search
(MEME/FIMO output).
How to start
1. Select the input format.
Presently there are 3 formats taken as an input by DistanceScan: the output of the MatchTM
tool (see example file), the output of the Gibbs Sampler (see example file) and the output of
the MEME/FIMO (see example file). You have to select the format in the corresponding link.
2. Fill in ALL the fields.
 Input file
Upload a textfile containing your data. Don't forget to press the "submit"
button, otherwise the file is not uploaded. In this case you will get the error
message “cannot open the connection“.
 Length of sequences.
Please insert the length of your sequences. Note that the tool works only with
the sequences of the equal lengths. If you give a wrong number (for instance,
your sequences are 1200bp and you write 1000), you will get an error message:
"Wrong length of sequences (too short)". If you define the length that is longer
than the actual one, this will not be considered as an error but the results will
change (since the predictions are normalized by the length). Actually, it is not
the real length of the sequences that matters, but the last coordinate in the input
file. The tool will not automatically trim the sequences to the defined length.
IMPORTANT! For the promoter analysis is it not plausible to consider very
long sequences. We recommend to consider sequences with the length up to
1000-1500 bp, better less.
 Maximal distance
This is the maximal allowed distance between the pairs constituents (Fig. 3)
Fig. 3. Maximal distance and distance shift.
 Distance shift
The distances in genuine TFBS pairs (composite elements) are not rigid. To
allow slight shift of the sites, we consider the pairs on some distance interval,
(Fig. 3). This is the  that you have to specify in this field. For instance, if
you expect your pairs on the distance up to 100bp, but would allow a shift of
10bp, you put "100" in the "Maximal distance" field, and "10" in the "Distance
shift" field. The tool will scrutinize all shifts up to the maximal specified and
select the highest shift corresponding to the score (S) (see below). Note that in
the results table it will show the highest possible shift (and the corresponding
score), even if the lower shift gives better results. For instance, you look for a
pair on the distance up to 100 with a shift 10. the pair on the distance 65 plus
10 has a score of 3.2, but the same pair on the distance 65+7 has the score of
4.5. The tool will give the only pair 65+10, because it always gives the highest
available shift. To see if there are other variants, you have to consult the plot of
the distribution.
 Portion of sequences with pairs
Probably, not all sequences in the set contain the TFBS combinations.
Sometimes it is enough if, for instance, 80% of the set has them. In this field
you can specify this number (in decimals, 80% = 0.8).
 Score (S)
To characterize the over-representation of a pair we define its score as the ratio
of the difference between the measured (observed) and calculated numbers of
pairs and the standard deviation: S  ( f obs d ,  f d , ) /  d , .
The probability to obtain an over-represented pair with a given score by chance
depends on the input parameters and is defined individually for each found
pair. We have constructed score tables that are stored on the server. When
provided with the input data, the tool selects the minimal score corresponding
to the p-value of 0.05 and suggests this value to the user. However, the user
can select a higher score to obtain a smaller p-value.
You have to define some score in the input page, hence before you know the
tool’s recommendations. This is necessary because otherwise there will be too
many predictions. The score you define in the beginning may be different from
the one suggested by the tool (the recommendation you will see together with
the results). We suggest that for the first run you select the score of 3 or 4 and
then adjust it according to the tool’s recommendation.
After all fields are filled in, you can start the search.
Fig. 4. Screenshot of the correctly filled input form. In this example, the length of each sequence in the set is 1000bp, the maximal
allowed distance between the TFBS is 120bp, the allowed distance shift is 5, we expect to find pairs in 80% of all sequences and the
score s = 4.
Getting the results
After the search is finished, you will see the results in the same window. If any pairs are
found, you will get the list of them with corresponding distances, shifts, and scores. On the
right you will see some general information: the number of considered sequences (this allows
you to control if the program has recognized all sequences) and short description of the found
pairs: the pair type (the names of the corresponding matrices for Match output and the
numbers of motifs for the MEME and Gibbs) and two links to the plots. For each pair, two
plots are created: (i) the picture of the occurrences of the motifs (sites) and (ii) the plot with
the distance distribution (Fig.5). To see the plots, you have to copy the link into a new
browser tab or window. The green line in the distribution plot corresponds to the selected
score.
The results can be downloaded.
A.
B.
Fig.5. Two plots created for each pair type: A. Occurrences of the motifs (sites) in each sequence; B. Distance distribution; blue line
– calculated random distribution; red line – observed distance distribution; green line shows the score.
Download