Peter Bakke Laboratory Methods in Genomics Tutorial: Species-specific Shine-Dalgarno sequence Shine-Dalgarno sequences are sites that exist slightly upstream (before the 5’ end) of coding regions in bacterial and archaeal genomes. These sequences help attract ribosomes to the mRNA in order to facilitate translation. Shine-Dalgarno (SD) sequences are generally considered to be six bases in length, lying four to ten nucleotides upstream of the gene’s start codon. The most common example of an SD sequence is AGGAGG. However, the position and order of SD sequences vary between species and between genes. When annotating a new genome, it may be helpful to find a consensus speciesspecific SD sequence in order to confirm proposed translation initiation sites. This tutorial will provide simple methods for analyzing upstream sequences with the goal of generating a species-specific Shine-Dalgarno sequence. 1. Target highly conserved genes: In your annotation database, search for genes that code for DNA and RNA polymerase subunits, as well as those that code for large subunit (LSU) ribosomal proteins. These genes are highly conserved and highly expressed, meaning that their upstream regions are also well conserved. For each gene, you should be able to display and export the nucleotide sequence of the gene, the 50-nucleotide sequence upstream of the gene, and the amino acid sequence of the gene. For example, in the Joint Genome Institute’s Integrated Microbial Genomes Education Site (IMG/edu homepage), the user is able to search and analyze the putative protein coding genes for a certain genome. My class and I worked with the genome of a halophilic archaeon, Halorhabdus utahensis (H. Utahensis page). JGI called that the H. utahensis genome possessed 19 DNA/RNA polymerase genes and 32 LSU ribosomal protein genes. The image above shows the display feature of the IMG database, where the user can visualize the upstream nucleotide sequence (shown in green), the start codon (shown in red), and the amino acid sequence. 2. Browse highly conserved genes to get an idea of what to look for: Without spending too much time, look over a number of the sequences upstream of the highly conserved genes. In general, you should be looking for purine-rich areas 0 to 20 bases upstream of the start codon. Almost all archaea have at least half of the consensus SD sequence AGGAGG, so look for a portion of this sequence. Find a basic recurring motif that will later be used as a foundation. 3. Analyze a highly conserved gene and upstream sequences: Begin by creating a document to record your findings. For each gene, you should record the gene ID, DNA coordinates, strand, and proposed function, along with additional information that will be discussed later. For each gene, start by verifying the start codon of that gene. In order to do this, obtain the amino acid sequence of the gene and run it through the BLASTp web application (BLASTp online, S. Simpson’s BLASTp tutorial). If the sequence aligns significantly with the beginning of other known proteins, record the start codon in your notebook and proceed. If it does not, locate the correct start codon and treat it as the new starting position of the gene. Record the altered start codon in your records. From this point, begin looking for the basic recurring motif that you became aware of earlier, moving from the start codon in the 5’ direction. If you find it, record a 14 base sequence that brackets this motif. For instance, the preliminary motif that I found for H. utahensis was GG. Therefore, an example of recorded SD sequence data is cgctGAGGTGacca. The preliminary SD sequence here is GAGGTG, yet the practice of recording the surrounding bases makes later alteration of the SD sequence simple. Also record the spacer length, that is, how many bases there are between the SD sequence and the start codon. For instance, there are 12 bases between the second G of “GG” and the start codon. 4. Analyze all genes coding for DNA/RNA polymerase and LSU ribosomal proteins: Using the methods from Step 3, create a list of 14-base preliminary SD sequences, start codons, and spacer lengths for all of our highly conserved genes. If you cannot find an SD sequence in front of a gene, look for other patterns up to 50 bases above the start codon. You may find adenine and thymine-rich regions or other motifs. Also, you can record whether or not the gene is located in an operon if the information is available. Remember, it is better to record too much information than too little. 5. Analyze preliminary SD sequences: Line up all of the preliminary SD sequences. Where is there the most repetition between the sequences? Is there more similarity between the sequences in the first half or the second half? In the H. utahensis sequences, I noticed that there was more repetition upstream of the “GG” than below. Using this information, crop the preliminary, 14-base SD sequences to 8 bases, keeping the most conserved portion of the sequences intact. The original recurring motif (“GG” for H. utahensis) should lie in the same position for every sequence. For instance, every SD sequence was modified to be in the format nNNNNGGn. Assign a position number to each base in the sequence, the first being 0 and the last being 7. For H. utahensis, the two guanines are always in Positions 5 and 6. Update the spacer length for each SD sequence to accurately record the number of bases between Position 6 and the start codon. 6. Create a position-base frequency plot: Using Microsoft Excel, create a table listing the four bases as rows and the eight positions as columns. For each position, count the frequency of the appearance of each of the four bases throughout all of the SD sequences. For example, if there are 5 sequences that have adenine as the first base, enter 5 in the cell corresponding to Base “A” and Position “0”. The sum of each column should equal the total number of SD sequences that you have recorded. Above: Blank position weight matrix Above: Completed position weight matrix 7. Use the frequency plot to determine a consensus SD sequence: Look at the frequency plot, checking for particularly high or low frequencies at each position. If the majority of the hits occur at one base at a position, this base can be considered a consensus base. Using the frequency plot above as an example, the adenine in Position 4 is a consensus base because it appeared at Position 4 20 out of 32 times. Consensus bases are listed as uppercase letters. Bases with moderately high frequencies (around 40-49 percent frequency) can be listed as lowercase letters. ***You now have amassed a large amount of data into a simple string of letters (ex. cgGAGGT). This information, used along with the average spacer length, will be useful as evidence in finding and altering start sites*** Resources: BIO 343: Laboratory Methods in Genomics, Davidson College, Fall 2008 “Ammonifex Ribosome Binding Sites” Excel Sheet http://www.bio.davidson.edu/Courses/Bio343/AmmonifexRibosomeBindingSites.xls