ALLBIO training workshop Analysing thousands of bacterial genomes: gene annotation, metabolism, regulation Marseille, June 22-25, 2014 Course book CONTENTS Contents......................................................................................................................................................... 2 RSAT tutorials .............................................................................................................................................. 3 Scanning promoters to predict TF binding sites and target genes ................................................... 3 Getting motifs from RegulonDB...................................................................................................... 3 Protocol ............................................................................................................................................ 3 Exercise ............................................................................................................................................ 4 Retrieving all upstream sequences from RSAT .............................................................................. 5 Protocol ............................................................................................................................................ 5 String-based pattern matching........................................................................................................... 6 Matrix-based pattern matching ......................................................................................................... 7 Exercise................................................................................................................................................. 7 2/8 RSAT TUTORIALS Scanning promoters to predict TF binding sites and target genes In this tutorial, we will address the situation when we are interested by one particular transcription factor for which some binding sites and target genes already have been characterized. Having at hand the genome of interest, we want to scan all the promoters in order to predict putative binding sites, and hence infer putative target genes. Getting motifs from RegulonDB RegulonDB (http://regulondb.ccg.unam.mx/) is a database of transcription factors in the model Bacteria Escherichia coli. It contains a detailed collection of binding sites with experimental evidence. In addition, the curation team used the annotated binding sites to build motifs (position-specific scoring matrices) that can be used to scan sequences. Protocol 1. Open a connection to RegulonDB (http://regulondb.ccg.unam.mx/). 2. Select a transcription factor of interest. Each participant can select a different factor, we will then summarize and compare the individual results. We recommend to choose a factor for which RegulonDB contains between 10 and 50 binding sites (avoid CRP, Fis, FNR, H-NS and other global TFs). We will illustrate the following steps with CpxR. a. In the 'Search by type of object' window at the upper part of the website write the name of the selected transcription factor and select the option “Regulon” in the drop-down menu. 3. Read carefully the annotation and take some notes a. Number of annotated binding sites. (CpxR = 57) b. Number of target genes. (CpxR = 63) c. Number of target operons. (CpxR = 37) 4. Copy the matrix and save it in a text file. 5. Open connection to the Regulatory Sequence Analysis Tools (http://www.rsat.eu/). 6. In the menu at the left side, click on the menu matrix tools and then select the program convert-matrix 3/8 a. Enter the matrix (step 4), and choose the option “tab” as input format (tab-delimited file). b. Select the background model : i. Organism-specific, Escherichia coli K12 substr MG1655 uid5779, ii. Sequence type: upstream-noorf c. Select output options: in addition to the default output options, activate the “parameters” field. d. Press the button 'Go'. Exercise Interpret the sequence logo. How good does the motif look ? Are there well-conserved residues ? Dispersed or regrouped ? Get the parameters (including consensus). Pay a particular attention to the information content and “information per column”. Which spacing would you expect between successive sites in a random sequence. 4/8 Retrieving all upstream sequences from RSAT Protocol Open connection to the Regulatory Sequence Analysis Tools (http://www.rsat.eu/). In the menu at the left side, click on the menu sequence tools and then select the program retrieve-sequences. Select the option 'single organism' and select Escherichia coli K12 substr MG1655 uid5779 in the menu. Select 'all' the genes Choose the option 'CDS' (coding sequence). In the menu sequence type, select the option 'upstream'. For the next option 'From' and 'to', you can click on them to see the suggested positions for TF activators or repressors. To know if your selected TF is activator or repressor you can search in RegulonDB. Check the option 'Prevent overlap with neighbour genes (noorf)' Select 'Fasta' as sequence format and 'gene name' as sequence label. Click on 'go' button. Store the sequences on a text file. 5/8 String-based pattern matching We will first apply a very rough approach : predicting binding sites on all the upstream sequences of one genome based on the consensus extracted from the matrix. Open connection to the Regulatory Sequence Analysis Tools (http://www.rsat.eu/). In the menu at the left side, click on the menu pattern matching and then select the program dna-pattern. In the 'Query pattern' section, paste the consensus sequence of the selected TF. In the 'Sequences' section, paste the sequence or upload a file with the sequences and select the 'fasta' format. Let the default options and click 'Go'. 6/8 Matrix-based pattern matching We will apply a complex approach : predicting binding sites on all the upstream sequences of one genome based on the position-specific scoring matrices. Open connection to the Regulatory Sequence Analysis Tools (http://www.rsat.eu/). In the menu at the left side, click on the menu pattern matching and then select the program matrix-scan (quick). In the 'Sequences' section, paste the sequence or upload a file with the sequences and select the 'fasta' format. In the 'Matrix' section, paste the position-specific scoring metrix taken from RegulonDB and select the 'tab' format. In the 'Background' section, select the Markov order of 1, select the background model estimation 'organism-specific', choose Escherichia coli K12 substr MG1655 uid5779 and 'upstreamnoorf' as sequence type. In the 'Scanning options' section, select the option 'sites + p-val' and set a value, i.e. 0.001. You can change this threshold and see what happen with the results. Click 'Go'. Exercise How many promoters were detected in total with the consensus ? How many promoters were detected in total with the matrix ? How many promoters were annotated in RegulonDB as regulated by the TF ? How many of these were matched by the consensus ? How many of these were matched by the matrix ? 7/8 Compute the coverage rate. 8/8