Scanning promoters to predict TF binding sites and

advertisement
ALLBIO training workshop
Analysing thousands of bacterial genomes: gene
annotation, metabolism, regulation
Marseille, June 22-25, 2014
Course book
CONTENTS
Contents......................................................................................................................................................... 2
RSAT tutorials .............................................................................................................................................. 3
Scanning promoters to predict TF binding sites and target genes ................................................... 3
Getting motifs from RegulonDB...................................................................................................... 3
Protocol ............................................................................................................................................ 3
Exercise ............................................................................................................................................ 4
Retrieving all upstream sequences from RSAT .............................................................................. 5
Protocol ............................................................................................................................................ 5
String-based pattern matching........................................................................................................... 6
Matrix-based pattern matching ......................................................................................................... 7
Exercise................................................................................................................................................. 7
2/8
RSAT TUTORIALS
Scanning promoters to predict TF binding sites and target genes
In this tutorial, we will address the situation when we are interested by one particular
transcription factor for which some binding sites and target genes already have been
characterized. Having at hand the genome of interest, we want to scan all the promoters in order
to predict putative binding sites, and hence infer putative target genes.
Getting motifs from RegulonDB
RegulonDB (http://regulondb.ccg.unam.mx/) is a database of transcription factors in the model
Bacteria Escherichia coli. It contains a detailed collection of binding sites with experimental
evidence. In addition, the curation team used the annotated binding sites to build motifs
(position-specific scoring matrices) that can be used to scan sequences.
Protocol
1. Open a connection to RegulonDB (http://regulondb.ccg.unam.mx/).
2. Select a transcription factor of interest. Each participant can select a different factor, we will
then summarize and compare the individual results. We recommend to choose a factor for
which RegulonDB contains between 10 and 50 binding sites (avoid CRP, Fis, FNR, H-NS
and other global TFs). We will illustrate the following steps with CpxR.
a. In the 'Search by type of object' window at the upper part of the website write the
name of the selected transcription factor and select the option “Regulon” in the
drop-down menu.
3. Read carefully the annotation and take some notes
a. Number of annotated binding sites. (CpxR = 57)
b. Number of target genes. (CpxR = 63)
c. Number of target operons. (CpxR = 37)
4. Copy the matrix and save it in a text file.
5. Open connection to the Regulatory Sequence Analysis Tools (http://www.rsat.eu/).
6. In the menu at the left side, click on the menu matrix tools and then select the program
convert-matrix
3/8
a. Enter the matrix (step 4), and choose the option “tab” as input format (tab-delimited
file).
b. Select the background model :
i. Organism-specific, Escherichia coli K12 substr MG1655 uid5779,
ii. Sequence type: upstream-noorf
c. Select output options: in addition to the default output options, activate the
“parameters” field.
d. Press the button 'Go'.
Exercise


Interpret the sequence logo. How good does the motif look ? Are there well-conserved
residues ? Dispersed or regrouped ?
Get the parameters (including consensus). Pay a particular attention to the information
content and “information per column”. Which spacing would you expect between
successive sites in a random sequence.
4/8
Retrieving all upstream sequences from RSAT
Protocol
Open connection to the Regulatory Sequence Analysis Tools (http://www.rsat.eu/).
In the menu at the left side, click on the menu sequence tools and then select the program
retrieve-sequences.
Select the option 'single organism' and select Escherichia coli K12 substr MG1655 uid5779 in the
menu.
Select 'all' the genes
Choose the option 'CDS' (coding sequence).
In the menu sequence type, select the option 'upstream'. For the next option 'From' and 'to',
you can click on them to see the suggested positions for TF activators or repressors. To
know if your selected TF is activator or repressor you can search in RegulonDB.
Check the option 'Prevent overlap with neighbour genes (noorf)'
Select 'Fasta' as sequence format and 'gene name' as sequence label.
Click on 'go' button.
Store the sequences on a text file.
5/8
String-based pattern matching
We will first apply a very rough approach : predicting binding sites on all the upstream sequences
of one genome based on the consensus extracted from the matrix.
Open connection to the Regulatory Sequence Analysis Tools (http://www.rsat.eu/).
In the menu at the left side, click on the menu pattern matching and then select the program
dna-pattern.
In the 'Query pattern' section, paste the consensus sequence of the selected TF.
In the 'Sequences' section, paste the sequence or upload a file with the sequences and select
the 'fasta' format.
Let the default options and click 'Go'.
6/8
Matrix-based pattern matching
We will apply a complex approach : predicting binding sites on all the upstream sequences of one
genome based on the position-specific scoring matrices.
Open connection to the Regulatory Sequence Analysis Tools (http://www.rsat.eu/).
In the menu at the left side, click on the menu pattern matching and then select the program
matrix-scan (quick).
In the 'Sequences' section, paste the sequence or upload a file with the sequences and select
the 'fasta' format.
In the 'Matrix' section, paste the position-specific scoring metrix taken from RegulonDB and
select the 'tab' format.
In the 'Background' section, select the Markov order of 1, select the background model
estimation 'organism-specific', choose Escherichia coli K12 substr MG1655 uid5779 and 'upstreamnoorf' as sequence type.
In the 'Scanning options' section, select the option 'sites + p-val' and set a value, i.e. 0.001.
You can change this threshold and see what happen with the results.
Click 'Go'.
Exercise
 How many promoters were detected in total with the consensus ?
 How many promoters were detected in total with the matrix ?
 How many promoters were annotated in RegulonDB as regulated by the TF ?
 How many of these were matched by the consensus ?
 How many of these were matched by the matrix ?
7/8

Compute the coverage rate.
8/8
Download