Survey difference distribution of microsatellite DNA in prokaryotic

advertisement
Survey difference distribution of microsatellite DNA in prokaryotic
and eukaryotic cell
Chen-Hsiang Shen
Project Report
Abstract
Microsatellite DNA is a simple repeats sequence which existences in different species
gene. However, the function of this kind of DNA does not comprehend. It does not translate
to protein. Through the analysis of distribution of microsatellite DNA in prokaryotic an
eukaryotic cell to acquire further understanding of microsatellite DNA. This bioinformatic
projects include writing a program to analysis the distribution of microsatellite DNA and
using this program to survey the microsatellite DNA in one prokaryotic cell, Solibacter
Usitatus, and one eukaryotic cell, Schizosaccharomyces Pombe. AA and TT are the most
abundant repeat unit in S. pombe, and CG is the highest number repeat unit in S. Usitatus.
Introduction
Both prokaryotic and eukaryotic genomes contain microsatellite DNA, which also called
simple sequence repeats (SSRs). The structure characters of microsatellite DNA are a
repeating unit consisting one to five base pairs and expansion size from 10 to 100 base
pair(Bennett 2000). Microsatellite DNA belongs to tandem repetitive sequences (TRSs).
Through seven different species research, researcher found that the presence of microsatellite
DNA is in non-coding area(Metzgar, Bytof et al. 2000). In past, the relevance of
microsatellite DNA in genomes is vague because it does not include any coding message.
Otherwise, people think microsatellite DNA as a clue for evolution. However, the importance
of study of microsatellite DNA is driven by the correlation between repetitive sequence and
human disease, such as gastric cancer (Vauhkonen, Vauhkonen et al. 2006), neuromuscular
degenerative diseases(Wang 2007), and Huntington's disease(Levin, Richie et al. 2006). But,
till now, the function of microsatellite DNA in genome sequence is still a controversy.
Through the different genome projects, we can easily analysis the distribution of
microsatellite DNA in different species.
In this project, there are two part of work. First, a program design with three functions,
one function is open a DNA sequence file and promotes user input specific repeat sequence.
Another function is using specific repeat sequence which user input to search in the sequence
we opened. The last function is print out the position of repeat sequence in genome and total
repeat number at genome. The second part is that choosing one eukaryotic cell, such as S.
cerevisiae, Arabidopsis thaliana, and Drosophila, and one prokaryotic cell, such as
Escherichia coli to analysis microsatellite DNA distribution pattern in DNA sequence. Those
species are the model animal in biology, expecting the analysis of microsatellite DNA in
those species can provide useful information about the function of microsatellite DNA.
Methods or Algorithms
The java program can prompt user input DNA sequence file name, output file name, the
repeat sequence file that contains repeat unit you want to search in input DNA sequence
file and count each A, T, C, and G number in DNA sequence file.
To read repeat unit file is using java readLine function. First, the program needs to
declare an array, and in for loop, uses BufferedReader function to read each line and store
its value in array.
The searching and counting function uses java pattern and matcher function to find
out microsatellite DNA and each code at sequence file, and use while loop to searching
whole repeat unit in sequence file. These two methods are writing as class method, then
using for loop to search each microsatellite DNA sequence location and counting it lin
genomic sequence.
When finishing the coding, this program will search prokaryotic and eukaryotic cell
DNA. Here, in eukaryotic cell was using Schizosaccharomyces pombe chromosome I which
is an excellent model organism. In prokaryotic cell was using Solibacter usitatus which is a
soil bacterium isolated from pastureland in Victoria, Australia. The microsatellite DNA used
in searching are 10 different sequences which are based on AA, TT, AC, GT, AG, CT, AT, CG,
CC, and GG and each repeat unit multiplied by four.
1.AAAAAAAA
2.TTTTTTTT
3.ACACACAC
4.GTGTGTGT
5.AGAGAGAG
6.CTCTCTCT
7.ATATATAT
8.CGCGCGCG
9.CCCCCCCC
10.GGGGGGGG
After searching these microsatellite DNA in S. usitatus and S. pombe, Microsoft
Excel used to draw chart and count the percentage of each code in sequence file.
Results and Discussion
Each microsatellite DNA location is divided by sequence length to normalize the
location of each microsatellite DNA in its genome or chromosome. The distribution pattern of
ten microsatellite DNA normally is linear distribution. There is not observation that the
microsatellite DNA cluster in a specific region. Repeat unit is based on AA or TT have higher
repeat rate in S. Pombe chromosome 1, this phenomenon also found in human chromosome
21(Hsu, Chen et al. 2000).
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
1
85 169 253 337 421 505 589 673 757 841 925 1009 1093 1177 1261 1345
Fig.1a. AAAAAAAA distribution in S. Pombe
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
Fig 1b. AAAAAAAA distribution in S. usitatus
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
1
89 177 265 353 441 529 617 705 793 881 969 1057 1145 1233 1321 1409 1497
Fig. 2a TTTTTTTT distribution in S. Pombe
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Fig 2b. TTTTTTTT distribution in S.usitatus
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
1
3 5 7
9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
Fig. 3a ACACACAC distribution in S. Pombe
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19
Fig 3b. ACACACAC distribution in S.usitatus
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
Fig. 4a GTGTGTGT distribution in S. Pombe
Fig 4b.GTGTGTGT distribution in S.usitatus
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82
Fig. 5a AGAGAGAG distribution in S. Pombe
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73
Fig 5b. AGAGAGAG distribution in S. usitatus
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70 73 76 79 82
Fig. 6a CTCTCTCT distribution in S. Pombe
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57
Fig 6b. CTCTCTCT distribution in S. usitatus
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
1
1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451 476 501 526
Fig. 7a ATATATAT distribution in S. Pombe
2
3
4
5
Fig 7b. ATATATAT distribution in S. usitatus
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
1
2
3
4
5
1 119 237 355 473 591 709 827 945 1063 1181 1299 1417 1535 1653 1771 1889
6
Fig. 8aCGCGCGCG distribution in S. Pombe
Fig 8b. CGCGCGCG distribution in S. usitatus
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
1
2
3
4
5
6
1
7
Fig. 9a CCCCCCCCC distribution in S. Pombe
3
4
5
6
7
8
9
10
11
12
Fig 9b. CCCCCCCC distribution in S. usitatus
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
2
0
1
2
3
Fig 10a GGGGGGGG distribution in S. Pombe
1
2
3
4
5
Fig 10b. GGGGGGGG distribution in S .usitatus
Length of S. Pombe chromosome I is 4,508,822 b.p., and the content percentage of each
DNA code is 31.96 % in A, 32.09 % in T, 17.97 % in C, and 17.96 % in G. In S. Utisatus,
length is 10,861,681 b.p. and the percentage of each code is 17.45 % in A, 17.50 % in T,
28.47 % in C, and 36.57 % in G..
A
S Pombe
T
C
G
Total
31.96% 32.09% 17.97% 17.96%
4,508,822 bp
S usitatus 17.45% 17.50% 28.47% 36.57% 10,861,681 bp
Table I. Sequence size and the percentage of each code
Between two different species in eukaryotic and prokaryotic, the higher count numbers
in S. pomte is the microsatellite DNA based on AA and TT, 1425 and 1497 respectively. In S.
ustatus, however, there is totally different phenomenon that CG based microsatellite
DNA has higher counter number, count number is 1999.
2000
S Pombe
S usitatus
1500
1000
500
Fig 11.
0
AA
TT
AC
GT
AG
CT
AT
CG
CC
GG
Future works
In JAVA program, it could include three different functions: the first one is adding
graphic function. So that when the program is running it can automatically generate chart for
each microsatellite DNA. The user does not need to use another program to generate chart
and analysis result. The second function could be added is automatically generating
microsatellite DNA pattern. Form the definition, in program, we can use array and loop
functions to generate whole possible pattern of microsatellite DNA. Then, it can avoid the
error of typing microsatellite DNA in each run. The last function can add in program is add
search ORF location. When drawing microsatellite location in genomic DNA, it is useful to
compare with gene location.
In Biology part, there are thousands of species are in progress. It is easy to get DNA
sequence at pubmed website (http://www.ncbi.nlm.nih.gov/sites/entrez). Widely test the
location of microsatellite DNA in genomic DNA might acquire more hints about its function
in biology.
Conclusion
The presence of microsatellite DNA in prokaryotic cell and eukaryotic cell should have
some meaning other than the evolution and disease marker in human. In this research, it
shows that the distribution of 10 microsatellite DNA is liner distribution in genomic DNA.
However, the functions, meanings and importance of microsatellite DNA need more
advanced research.
References
Bennett, P. (2000). "Demystified ... microsatellites." Mol Pathol 53(4): 177-83.
Hsu, C.-M., W.-H. Chen, et al. (2000). "Differential Distribution of Dinucleotide Repeat
Length and Frequency in Human Chromosome 21." Tzu Chi Med J 12(2): 6.
Levin, B. C., K. L. Richie, et al. (2006). "Advances in Huntington's disease diagnostics:
development of a standard reference material." Expert Rev Mol Diagn 6(4): 587-96.
Metzgar, D., J. Bytof, et al. (2000). "Selection against frameshift mutations limits
microsatellite expansion in coding DNA." Genome Res 10(1): 72-80.
Vauhkonen, M., H. Vauhkonen, et al. (2006). "Pathology and molecular biology of gastric
cancer." Best Pract Res Clin Gastroenterol 20(4): 651-74.
Wang, Y. H. (2007). "Chromatin structure of repeating CTG/CAG and CGG/CCG sequences
in human disease." Front Biosci 12: 4731-41.
Download