32 LSU ribosomal proteins - Genomics and Bioinformatics

advertisement
Peter Bakke
Laboratory Methods in Genomics Tutorial: Species-specific Shine-Dalgarno sequence
Shine-Dalgarno sequences are sites that exist slightly upstream (before the 5’ end) of
coding regions in bacterial and archaeal genomes. These sequences help attract
ribosomes to the mRNA in order to facilitate translation. Shine-Dalgarno (SD) sequences
are generally considered to be six bases in length, lying four to ten nucleotides upstream
of the gene’s start codon. The most common example of an SD sequence is AGGAGG.
However, the position and order of SD sequences vary between species and between
genes. When annotating a new genome, it may be helpful to find a consensus speciesspecific SD sequence in order to confirm proposed translation initiation sites. This
tutorial will provide simple methods for analyzing upstream sequences with the goal of
generating a species-specific Shine-Dalgarno sequence.
1. Target highly conserved genes:
In your annotation database, search for genes that code for DNA and RNA polymerase
subunits, as well as those that code for large subunit (LSU) ribosomal proteins. These
genes are highly conserved and highly expressed, meaning that their upstream regions are
also well conserved. For each gene, you should be able to display and export the
nucleotide sequence of the gene, the 50-nucleotide sequence upstream of the gene, and
the amino acid sequence of the gene.
For example, in the Joint Genome Institute’s Integrated Microbial Genomes Education
Site (IMG/edu homepage), the user is able to search and analyze the putative protein
coding genes for a certain genome. My class and I worked with the genome of a
halophilic archaeon, Halorhabdus utahensis (H. Utahensis page). JGI called that the H.
utahensis genome possessed 19 DNA/RNA polymerase genes and 32 LSU ribosomal
protein genes.
The image above shows the display feature of the IMG database, where the user can visualize the upstream
nucleotide sequence (shown in green), the start codon (shown in red), and the amino acid sequence.
2. Browse highly conserved genes to get an idea of what to look for:
Without spending too much time, look over a number of the sequences upstream of the
highly conserved genes. In general, you should be looking for purine-rich areas 0 to 20
bases upstream of the start codon. Almost all archaea have at least half of the consensus
SD sequence AGGAGG, so look for a portion of this sequence. Find a basic recurring
motif that will later be used as a foundation.
3. Analyze a highly conserved gene and upstream sequences:
Begin by creating a document to record your findings. For each gene, you should record
the gene ID, DNA coordinates, strand, and proposed function, along with additional
information that will be discussed later.
For each gene, start by verifying the start codon of that gene. In order to do this, obtain
the amino acid sequence of the gene and run it through the BLASTp web application
(BLASTp online, S. Simpson’s BLASTp tutorial). If the sequence aligns significantly
with the beginning of other known proteins, record the start codon in your notebook and
proceed. If it does not, locate the correct start codon and treat it as the new starting
position of the gene. Record the altered start codon in your records.
From this point, begin looking for the basic recurring motif that you became aware of
earlier, moving from the start codon in the 5’ direction. If you find it, record a 14 base
sequence that brackets this motif. For instance, the preliminary motif that I found for H.
utahensis was GG. Therefore, an example of recorded SD sequence data is
cgctGAGGTGacca. The preliminary SD sequence here is GAGGTG, yet the practice of
recording the surrounding bases makes later alteration of the SD sequence simple. Also
record the spacer length, that is, how many bases there are between the SD sequence and
the start codon. For instance, there are 12 bases between the second G of “GG” and the
start codon.
4. Analyze all genes coding for DNA/RNA polymerase and LSU ribosomal proteins:
Using the methods from Step 3, create a list of 14-base preliminary SD sequences, start
codons, and spacer lengths for all of our highly conserved genes.
If you cannot find an SD sequence in front of a gene, look for other patterns up to 50
bases above the start codon. You may find adenine and thymine-rich regions or other
motifs. Also, you can record whether or not the gene is located in an operon if the
information is available. Remember, it is better to record too much information than too
little.
5. Analyze preliminary SD sequences:
Line up all of the preliminary SD sequences. Where is there the most repetition between
the sequences? Is there more similarity between the sequences in the first half or the
second half? In the H. utahensis sequences, I noticed that there was more repetition
upstream of the “GG” than below.
Using this information, crop the preliminary, 14-base SD sequences to 8 bases, keeping
the most conserved portion of the sequences intact. The original recurring motif (“GG”
for H. utahensis) should lie in the same position for every sequence. For instance, every
SD sequence was modified to be in the format nNNNNGGn.
Assign a position number to each base in the sequence, the first being 0 and the last being
7. For H. utahensis, the two guanines are always in Positions 5 and 6. Update the spacer
length for each SD sequence to accurately record the number of bases between Position 6
and the start codon.
6. Create a position-base frequency plot:
Using Microsoft Excel, create a table listing the four bases as rows and the eight positions
as columns. For each position, count the frequency of the appearance of each of the four
bases throughout all of the SD sequences. For example, if there are 5 sequences that have
adenine as the first base, enter 5 in the cell corresponding to Base “A” and Position “0”.
The sum of each column should equal the total number of SD sequences that you have
recorded.
Above: Blank position weight matrix
Above: Completed position weight matrix
7. Use the frequency plot to determine a consensus SD sequence:
Look at the frequency plot, checking for particularly high or low frequencies at each
position. If the majority of the hits occur at one base at a position, this base can be
considered a consensus base. Using the frequency plot above as an example, the adenine
in Position 4 is a consensus base because it appeared at Position 4 20 out of 32 times.
Consensus bases are listed as uppercase letters. Bases with moderately high frequencies
(around 40-49 percent frequency) can be listed as lowercase letters.
***You now have amassed a large amount of data into a simple string of letters (ex.
cgGAGGT). This information, used along with the average spacer length, will be useful
as evidence in finding and altering start sites***
Resources:
BIO 343: Laboratory Methods in Genomics, Davidson College, Fall 2008
“Ammonifex Ribosome Binding Sites” Excel Sheet
http://www.bio.davidson.edu/Courses/Bio343/AmmonifexRibosomeBindingSites.xls
Download