here - WordPress.com

advertisement
A tool to compare two genes across two species
and to present their genomic signature
differences as applied to
Streptococcus Mitis and Mutans
Mengyang (Vicky) Li
Josie Ryan-Small
11/12/11
Li & Ryan-Small 2
Introduction
The genus Streptococcus is known as a heterogeneous Gram-positive bacteria
group that plays a significant role in medicine and industry. Within this genus are
multiple species that can be found in the microbial flora of healthy animals and humans.
However, some of these species can cause diseases in humans, such as scarlet fever,
rheumatic heart disease, and pneumococcal pneumonia (Patterson 1996). The two species
we are focusing on in this study are Streptococcus Mitis and the UA159 steryope c strain
of Streptococcus Mutans.
Streptococcus Mitis can be found within the skin, the oropharynx, the
gastrointestinal tract, and the female genital tract in humans ( Yiş, Yüksel et al. 2011). It
has been found that people who have previously had spinal anesthesia or who had
bacterial endocarditis with neurological complications have gotten meningitis from
Streptococcus Mitis. However, it is very rare for a healthy person to get meningitis
because this species has a low pathogenicity and virulence rate, and therefore, only
affects those whose immune system is weak (Yiş, Yüksel et al. 2011). In addition, S.
Mitis can cause endocarditis (Lamas et al. 2003), an inflammation of the inner layer of
the heart.
The second species, the UA159 steryope c strain of Streptococcus Mutans is
known as the main cause for dental caries, or tooth decay. This genome has become well
adapted to surviving in the environment of the oral cavity. In addition, S. mutans causes
dental caries by fermenting mannitol and sorbitol sugars, along with others to make
adherent water-soluble glucan from sucrose (Hamada & Slade, 1980).
Li & Ryan-Small 3
This study focuses on the difference between the two Streptococcus genome
species Mitis and the Mutans strain, UA159, which demonstrates their different functions
in organisms. Even though there are 64 codons, most organisms do not use all of them in
their genes. This software analyzes the different usages of codons in the genes from the
two genomes, which provides the molecular and genetic evidence of how the two differ.
However, the similarity in the GC-percentages in the two strains demonstrates the genetic
similarities in the two species. As a result, this study illuminates both the different effects
and similar behaviors of the two species in organisms, such as human beings.
Overall, this research provides important genetic evidence for biologists because
it shows the reason for the differences in the two species. Furthermore, the methods used
in this study can be reapplied to other strains to find the differences between the genes.
Li & Ryan-Small 4
Methods
This program is designed to compare and contrast the usages of codons in the two
genes. In addition, it also analyzes the difference in the GC-percentages.
The input of this program is 2 FASTA formatted files of genomes download from
the NCBI website with 2.1MB and 2.2MB. These files should be stored in a directory
(folder) called allFiles. The user can analyze any genes they want to, such as the gene
dnaA and gene dnaN, and find the lengths of those genes by looking at the protein table
downloaded from the NCBI website. The length of the amino acids in gene dnaA is 453
in the Mitis species and 452 in the Mutans species. The length of the gene dnaN is 379
in both species.
The two forms of output that are presented in this program are the standard
console printout and the Excel spreadsheet. In the standard console output, the title of
the report is printed, a prompt for the user to enter the designated start and stop
locations of each gene, the name of each file name opened from the allFiles, the GCpercent in both genomes and the length of the amino acid sequences in each gene. In
the Excel spreadsheet, the following are printed for each of the input files from allFiles:
the filename, the total number of the amino acids in the file, the TAB-delimited
headings, including amino acids, frequency, and proportion, and the TAB-delimited
amino acids synonyms (single letter codes) alphabetically ordered and their
corresponding frequencies and proportion.
This software uses the subroutine getDNA() one at a time to open two FASTA
formatted files of the DNA genomes, which returns the file of DNA sequence as one
string. The user uses the protein table downloaded from the NCBI website to determine
Li & Ryan-Small 5
where specific genes start and stop. The software uses a while loop to run the following
algorithms twice for the two strains. The built-in function in Perl, substr() is then used
to find and snip out the gene sequence. The program uses a subroutine called
translate(), to find the amino acids of the corresponding gene. Then by using a foreach
loop, the frequency counts of each amino acid are stored in a hash table. It then sorts
the keys of the hash table alphabetically. Next, a while loop is used to print all the
single letter codes of amino acids. A proportion of each amino acid is calculated by
dividing the frequency by the length of the amino acid sequence every time the loop
runs. In addition, a built-in function called length() is used to find the length of the
DNA genome. Then the program uses tr(), a built-in function called translate, to count
the number of C’s, and G’s, individually. Adding the number of C’s and G’s gives the
total number of GC’s in the sequence, which will then be divided by the length of the
genome. This step calculates the GC-percent.
Li & Ryan-Small 6
Algorithms
0. glob 2 files grabs all the files in the same folder with the same extension
foreach $file (@allFiles)
1. read the DNA genome
$DNASequence = read DNA
1a. get the percentage of GC
1b. ask the user to enter valid starting and stop locations
1c. snip out the designated gene
2. translate into amino acids
$AA = translate ($DNASequence)
while loop that runs through the gene
2a. build a hash table %codonMap, which holds the synonym letters for
each AA
2b. match your $nextTriplet to the %codonMap
2c. concatenate the single letter to the protein string $protein
2d. return the protein string $protein
$AA = “MVHFWI…”
3. count the frequency of each amino acid
3a. calculate the proportion of each synonym letter
3b. foreach $key (sort keys %wordCount)
print $EXCEL $key \t $wordCount {$key}
\t proportion
4. print output results
4a. print the percentage of GC and other Chargaff’s number to an
EXCEL spread sheet
4b. print the frequency count of the keys in each gene and the
proportion to an EXCEL
5. return to 1 and repeat for the next file
Li & Ryan-Small 7
Results
The results from this program are as follows: the percentages of the GC’s
in Streptococcus Mitis and Mutans are 39.98% and 36.83% respectively, as
seen in Table 1.
Table 1.
The percentage of the G’s and C’s across the two species
In addition, the numbers of amino acids in Streptococcus Mitis and Mutans are
454 and 453, respectively (Table 2).
Table 2.
The
total
number
amino
acids
that are
within the two species from the gene dnaA
of
These results show the similarities between the two species within the same
genus. However, when the analysis is narrowed down to the amino acids, the
differences between the two are demonstrated. This study uses the one-letter
synonym codes to present the amino acid triplets, as shown in Table 3.
Li & Ryan-Small 8
Table 3.
The table of one-letter synonym codons that correspond to the amino acids
By operating the experiment, the software calculates the frequency counts
and the proportion values of amino acids of the gene dnaA, as shown in table
Li & Ryan-Small 9
Table 4.
This table shows the amino acid frequency counts and the amino acid proportion values of the
gene dnaA across the two species
In the first column are the synonym codes for the amino acids followed by the
second and third columns, which hold the amino acid frequency counts and
proportion values. This table demonstrates the differences between the two
species.
Furthermore, we create a graph consisting of the proportions and the oneletter synonyms in an Excel file.
Li & Ryan-Small 10
Figure 1.
This graph demonstrates the difference of amino acid proportion of the gene dnaA across the
two species. The X-axis has the amino acids (one-letter synonyms) and the Y-axis has the
amino acid proportion values
This graph shows that the amino acid proportions of the gene dnaA across the
two species are quite different. In comparing the amino acid proportions, only
one of the one-letter synonym codes has the same proportion value across the
two species, which is the synonym W. However, others have different
proportion values such as the synonyms I and N. Moreover, both the synonyms
I and N have the greatest difference in their amino acid proportions compared
Li & Ryan-Small 11
to all the other amino acids across the two species. In addition, this gene uses
all 20 amino acids.
Gene dnaN is also analyzed in this study. Both Streptococcus Mitis and
Mutans have the same number of amino acids, 379, as seen in Table 5.
Table 5.
The total number of amino acids of the gene dnaN across the two species
The frequency counts and proportion values of the gene across the two species
can be seen in Table 6, which is set up in the same format as Table 4. As
before, the amino acid frequency counts and proportion values are different
across the two species.
Table 6.
This table shows the amino acid frequency counts and the amino acid proportion values of the
gene dnaN across the two species
Li & Ryan-Small 12
Again, we create a graph consisting of the proportions and the one-letter
synonyms of dnaN in an Excel file.
Figure 2.
This graph demonstrates the difference of amino acid proportions of the gene dnaN across the two species.
The X-axis is labeled with the amino acids (one-letter synonym codons) and the Y-axis has the amino acid
proportion values
From this graph, it is observed that some of the amino acid proportion values across the
two species are the same, which are the synonyms F, M, and P. However, what is
interesting is that again, both the synonyms I and N have the greatest difference in their
amino acid proportion values compared to all of the other amino acids. Finally, this time
the gene dnaN uses only 19 codons, instead of all of the 20 amino acids.
Li & Ryan-Small 13
Discussion
Looking at the results of this program, this software is very successful because it
aids in finding the difference and similarity between the two species Streptococcus Mitis
and Mutans by calculating the frequency counts and proportions of amino acids within
the same gene across the two species and figuring the GC-percent of the two genomes. In
addition, the software allows users to compare any two genes obtained from the
corresponding protein table of any species due to the fact of the feature that enables the
user to type in the designated start and stop locations.
However, it should be noted that the program is currently designed to look at two
genes that are comparable to each other, such as two species within the same genus, and
are more likely to have the same usages of amino acids. Future studies will incorporate
the ability to look at genes that do not have the same usages of amino acids. The reason
for incorporating this new feature is that instead of omitting the amino acid that is not
used in one of the genes, the program will include that amino acid and place a zero for
the frequency count.
Furthermore, the program is now designed to work on only two genes at a time,
but future editions of the software will compare multiple genes at one time. In addition,
Li & Ryan-Small 14
future researchers can incorporate statistical analyses that show how the values from the
two species differ from each other in the context of statistics.
Overall, this program is very efficient because it not only shows the genomic
signatures of the two species and analyzes them, but also has the flexibility to interpret
any two genes that the users wish to compare.
Works Sited
Ajdic ́, D, McShan, M. W, McLaughlin, E. R, Savic ́, G, Chang, J, Carson, B. M,
Primeaux, C, Tian, Kenton, R. S, Shaoping Lin, H, J, Qian, Y, Li, S, Zhu, H,
Najar, F, Lai, H, White, J, Roe, A. B, and Ferretti, J. J. Genome Sequence of
Streptococcus Mutans UA159, a Cariogenic Dental Pathogen. October 23, 2002,
doi: 10.1073/pnas.172501299 PNAS October 29, 2002 vol. 99 no. 22 1443414439. http://www.pnas/org/content/99/22/14434.short.
Hamada S, Slade, H. D. Biology, Immunology, and Cariogenicity of Streptococcus
mutans. Microbiology and Molecular Biology Reviews. 1980, 44(2): 331.
http://mmbr.asm.org/content/44/2/331.citation.
Lamas CC, Eykyn SJ.vv, Blood culture negative endocarditis: analysis of 63 cases
presenting over 25 years. Heart. 2003 March; 89(3): 258–262
Lee HJ, Hong SH. Analysis of microRNA-size, small RNAs in Streptococcus mutans by
deep sequencing. FEMS Microbiol Lett. 2011 Oct 21.
http://www.ncbi.nlm.nih.gov/pubmed/22092283.
Patterson MJ (1996). Streptococcus. Medical Microbiology, 4th edition, Galveston (TX):
Li & Ryan-Small 15
University of Texas Medical Branch at Galveston, Chapter 13.
http://www.ncbi.nlm.nih.gov/pubmed/21413248.
Yiş R, Yüksel CN, Derundere U, Yiş, U, Meningitis and White Matter Lesions due to
Streptococcus mitis in a Previously Healthy Child, Mikrobiyol Bul. 2011
Oct;45(4):741-5. http://www.ncbi.nlm.nih.gov/pubmed/22090306
Download