Survey difference distribution of microsatellite DNA in prokaryotic and eukaryotic cell Chen-Hsiang Shen Project Report Abstract Microsatellite DNA is a simple repeats sequence which existences in different species gene. However, the function of this kind of DNA does not comprehend. It does not translate to protein. Through the analysis of distribution of microsatellite DNA in prokaryotic an eukaryotic cell to acquire further understanding of microsatellite DNA. This bioinformatic projects include writing a program to analysis the distribution of microsatellite DNA and using this program to survey the microsatellite DNA in one prokaryotic cell, Solibacter Usitatus, and one eukaryotic cell, Schizosaccharomyces Pombe. AA and TT are the most abundant repeat unit in S. pombe, and CG is the highest number repeat unit in S. Usitatus. Introduction Both prokaryotic and eukaryotic genomes contain microsatellite DNA, which also called simple sequence repeats (SSRs). The structure characters of microsatellite DNA are a repeating unit consisting one to five base pairs and expansion size from 10 to 100 base pair(Bennett 2000). Microsatellite DNA belongs to tandem repetitive sequences (TRSs). Through seven different species research, researcher found that the presence of microsatellite DNA is in non-coding area(Metzgar, Bytof et al. 2000). In past, the relevance of microsatellite DNA in genomes is vague because it does not include any coding message. Otherwise, people think microsatellite DNA as a clue for evolution. However, the importance of study of microsatellite DNA is driven by the correlation between repetitive sequence and human disease, such as gastric cancer (Vauhkonen, Vauhkonen et al. 2006), neuromuscular degenerative diseases(Wang 2007), and Huntington's disease(Levin, Richie et al. 2006). But, till now, the function of microsatellite DNA in genome sequence is still a controversy. Through the different genome projects, we can easily analysis the distribution of microsatellite DNA in different species. In this project, there are two part of work. First, a program design with three functions, one function is open a DNA sequence file and promotes user input specific repeat sequence. Another function is using specific repeat sequence which user input to search in the sequence we opened. The last function is print out the position of repeat sequence in genome and total repeat number at genome. The second part is that choosing one eukaryotic cell, such as S. cerevisiae, Arabidopsis thaliana, and Drosophila, and one prokaryotic cell, such as Escherichia coli to analysis microsatellite DNA distribution pattern in DNA sequence. Those species are the model animal in biology, expecting the analysis of microsatellite DNA in those species can provide useful information about the function of microsatellite DNA. Methods or Algorithms The java program can prompt user input DNA sequence file name, output file name, the repeat sequence file that contains repeat unit you want to search in input DNA sequence file and count each A, T, C, and G number in DNA sequence file. To read repeat unit file is using java readLine function. First, the program needs to declare an array, and in for loop, uses BufferedReader function to read each line and store its value in array. The searching and counting function uses java pattern and matcher function to find out microsatellite DNA and each code at sequence file, and use while loop to searching whole repeat unit in sequence file. These two methods are writing as class method, then using for loop to search each microsatellite DNA sequence location and counting it lin genomic sequence. When finishing the coding, this program will search prokaryotic and eukaryotic cell DNA. Here, in eukaryotic cell was using Schizosaccharomyces pombe chromosome I which is an excellent model organism. In prokaryotic cell was using Solibacter usitatus which is a soil bacterium isolated from pastureland in Victoria, Australia. The microsatellite DNA used in searching are 10 different sequences which are based on AA, TT, AC, GT, AG, CT, AT, CG, CC, and GG and each repeat unit multiplied by four. 1.AAAAAAAA 2.TTTTTTTT 3.ACACACAC 4.GTGTGTGT 5.AGAGAGAG 6.CTCTCTCT 7.ATATATAT 8.CGCGCGCG 9.CCCCCCCC 10.GGGGGGGG After searching these microsatellite DNA in S. usitatus and S. pombe, Microsoft Excel used to draw chart and count the percentage of each code in sequence file. Results and Discussion Each microsatellite DNA location is divided by sequence length to normalize the location of each microsatellite DNA in its genome or chromosome. The distribution pattern of ten microsatellite DNA normally is linear distribution. There is not observation that the microsatellite DNA cluster in a specific region. Repeat unit is based on AA or TT have higher repeat rate in S. Pombe chromosome 1, this phenomenon also found in human chromosome 21(Hsu, Chen et al. 2000). 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 85 169 253 337 421 505 589 673 757 841 925 1009 1093 1177 1261 1345 Fig.1a. AAAAAAAA distribution in S. Pombe 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 Fig 1b. AAAAAAAA distribution in S. usitatus 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 89 177 265 353 441 529 617 705 793 881 969 1057 1145 1233 1321 1409 1497 Fig. 2a TTTTTTTT distribution in S. Pombe 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Fig 2b. TTTTTTTT distribution in S.usitatus 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 Fig. 3a ACACACAC distribution in S. Pombe 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Fig 3b. ACACACAC distribution in S.usitatus 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 Fig. 4a GTGTGTGT distribution in S. Pombe Fig 4b.GTGTGTGT distribution in S.usitatus 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 Fig. 5a AGAGAGAG distribution in S. Pombe 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 Fig 5b. AGAGAGAG distribution in S. usitatus 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82 Fig. 6a CTCTCTCT distribution in S. Pombe 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 Fig 6b. CTCTCTCT distribution in S. usitatus 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451 476 501 526 Fig. 7a ATATATAT distribution in S. Pombe 2 3 4 5 Fig 7b. ATATATAT distribution in S. usitatus 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 2 3 4 5 1 119 237 355 473 591 709 827 945 1063 1181 1299 1417 1535 1653 1771 1889 6 Fig. 8aCGCGCGCG distribution in S. Pombe Fig 8b. CGCGCGCG distribution in S. usitatus 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 2 3 4 5 6 1 7 Fig. 9a CCCCCCCCC distribution in S. Pombe 3 4 5 6 7 8 9 10 11 12 Fig 9b. CCCCCCCC distribution in S. usitatus 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 2 0 1 2 3 Fig 10a GGGGGGGG distribution in S. Pombe 1 2 3 4 5 Fig 10b. GGGGGGGG distribution in S .usitatus Length of S. Pombe chromosome I is 4,508,822 b.p., and the content percentage of each DNA code is 31.96 % in A, 32.09 % in T, 17.97 % in C, and 17.96 % in G. In S. Utisatus, length is 10,861,681 b.p. and the percentage of each code is 17.45 % in A, 17.50 % in T, 28.47 % in C, and 36.57 % in G.. A S Pombe T C G Total 31.96% 32.09% 17.97% 17.96% 4,508,822 bp S usitatus 17.45% 17.50% 28.47% 36.57% 10,861,681 bp Table I. Sequence size and the percentage of each code Between two different species in eukaryotic and prokaryotic, the higher count numbers in S. pomte is the microsatellite DNA based on AA and TT, 1425 and 1497 respectively. In S. ustatus, however, there is totally different phenomenon that CG based microsatellite DNA has higher counter number, count number is 1999. 2000 S Pombe S usitatus 1500 1000 500 Fig 11. 0 AA TT AC GT AG CT AT CG CC GG Future works In JAVA program, it could include three different functions: the first one is adding graphic function. So that when the program is running it can automatically generate chart for each microsatellite DNA. The user does not need to use another program to generate chart and analysis result. The second function could be added is automatically generating microsatellite DNA pattern. Form the definition, in program, we can use array and loop functions to generate whole possible pattern of microsatellite DNA. Then, it can avoid the error of typing microsatellite DNA in each run. The last function can add in program is add search ORF location. When drawing microsatellite location in genomic DNA, it is useful to compare with gene location. In Biology part, there are thousands of species are in progress. It is easy to get DNA sequence at pubmed website (http://www.ncbi.nlm.nih.gov/sites/entrez). Widely test the location of microsatellite DNA in genomic DNA might acquire more hints about its function in biology. Conclusion The presence of microsatellite DNA in prokaryotic cell and eukaryotic cell should have some meaning other than the evolution and disease marker in human. In this research, it shows that the distribution of 10 microsatellite DNA is liner distribution in genomic DNA. However, the functions, meanings and importance of microsatellite DNA need more advanced research. References Bennett, P. (2000). "Demystified ... microsatellites." Mol Pathol 53(4): 177-83. Hsu, C.-M., W.-H. Chen, et al. (2000). "Differential Distribution of Dinucleotide Repeat Length and Frequency in Human Chromosome 21." Tzu Chi Med J 12(2): 6. Levin, B. C., K. L. Richie, et al. (2006). "Advances in Huntington's disease diagnostics: development of a standard reference material." Expert Rev Mol Diagn 6(4): 587-96. Metzgar, D., J. Bytof, et al. (2000). "Selection against frameshift mutations limits microsatellite expansion in coding DNA." Genome Res 10(1): 72-80. Vauhkonen, M., H. Vauhkonen, et al. (2006). "Pathology and molecular biology of gastric cancer." Best Pract Res Clin Gastroenterol 20(4): 651-74. Wang, Y. H. (2007). "Chromatin structure of repeating CTG/CAG and CGG/CCG sequences in human disease." Front Biosci 12: 4731-41.