Assembly and Gene Identification of Contigs of Synechococcus sp

advertisement
Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
Assembly and Gene Identification of Contigs of
Synechococcus sp. strain PCC 7002 Genome
Wang Zhu
College of Life Science, Peking University
Abstract
In this project, we try to sequence and annotate the genome of a cyanobacterium -- Synechococcus
sp. strain PCC 7002. Currently, prokaryotic genome sequencing is generally carried out by the
shotgun approach. We obtained 1kb-3kb sequence reads from Huada gene center. Cosmid end
sequences were also used. We utilized software package Phredphrap to perform the assembly of
these reads and have reduced the number of contigs to 242. The final goal is to construct a whole
genome sequence of several megabases with an error rate lower than 1 per 10000 nucleotides. The
largest contigs was analyzed by program GeneMarkS to correct frame shift errors and predict
genes in them. We also programmed a tool to extract these gene sequences from the report list of
GeneMarkS so that they can be under further studies.
Introduction
The cyanobacteria are believed to be of very ancient origin, and are the answer of present-day
chloroplasts. Therefore, it is of great interest to analyze the structure and organization of genes in
this organism. Synechococcus sp. strain PCC 7002 is a unicellular cyanobacterium. The genomes
of other two cyanobacteria PCC 6803 and PCC 7120 have already been sequenced and can be
visited through public databases. 7002 genome is predicted to be 2.8Mb. Understanding its whole
genome may provide the basis for the studies of metabolism and photosynthesis. At present, the
most widely used strategy for the sequencing of a microbial genome is that of whole-genome
shotgun sequencing. A large number of clones, from the libraries representative of the whole
genome, are sequenced and assembled into contigs. The contigs are then linked together using a
variety of methods to obtain the whole genome sequence in a single contig. The process can be
divided roughly into several procedures, including library construction, reads assembly and
closure phase, as shown in Figure 1.
300
Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
Fig 1. Outline of genome sequencing, assembly and annotation
Both small-insert (1-2 kb) libraries and large-insert (20-300kb) libraries like BAC (bacterial
artificial chromosome) are needed to ensure appropriate coverage of the genome and obtain a
‘scaffold’ of the genome which is used during the closure phase. The small-insert library data of
PCC 7002 came from Huada gene center. The BAC library is being constructed in our lab.
The assembly phase is composed of three major steps: the conversion of the data from automated
sequencers to nucleotide sequences, the utilization of these sequences in the assembly process and
the continuous assessment of this assembly process. There are various software tools for such
work and we choose the Phredphrap package. It’s based on complex algorithms which perform
pairwise comparisons for all sequences, allowing automatic threshold selection with respect to the
decision of whether two sequences overlap or not. Clusters of overlapping sequences are
constructed and consensus sequences are deduced from these clusters. The assembly-result file in
Phrap format can be viewed with software Consed. In Consed, contigs are shown with the
sequences composing of them. It’s easy to find and edit low-quality sequences and the assembly
results can be assessed vividly.
The goals of annotation include detecting and describing the protein-coding sequence, the
structure of these genes (including untranslated regions and control elements), homology
comparison between the sequences being analyzed and sequences available from public databases
at either nucleic acid or the protein level. As long as contigs are constructed, preliminary
annotation can be carried out. In this research, we use software GeneMarkS, which is based on
heuristic Markov model, to predict genes in contigs as well as correct frame shift errors.
301
Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
Methods and Results
Ⅰ. Assembly
1. Base calling
The computer operation system is Unix/Linux. The reads sequences are stored as chromat files in
directory “chromat_dir”. One example of the file is shown in Figure 2.
Fig 2. Part of a “.abi” chromat file viewed with Consed
Phred uses simple Fourier methods to examine the four base traces in the region surrounding each
point in the data set in order to predict a series of evenly spaced predicted peak locations. Next
phred finds the centers of the actual (observed) peaks and a dynamic programming algorithm is
used to match the observed peaks with the predicted peak locations found in the first step. Phred
evaluates the trace surrounding each called base using a quality value (QV). The quality value is
related to the base call error probability (P_e) by the formula:
-
P_e=10 QV/10
Run phred with the options:
% phred –id chromat_dir –pd phd_dir
which causes phred to read the chromat files in “chromat_dir” and write the converted “.phd” files
to “phd_dir”. In “.phd” files comments on the file conversion process are listed first, then are the
bases information as shown below in Table 1.
<base>
c
a
a
t
a
t
t
g
<quality>
34
42
42
33
31
33
11
11
<position in chromat>
2622
2634
2647
2660
2674
2685
2698
2711
Tab 1. Part of the file “s_14291.y1.abi.phd” showing phd format
Then run the phd2fasta program to make FASTA files. After running
% phd2fasta –id phd_dir –os seqs_fasta –oq seqs_fasta.screen.qual
two files are created. File “seqs_fasta” records all the sequences in FASTA format (shown in Table
2.) and file “seqs_fasta.screen.qual” records the quality value of each base in all the sequences.
302
Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
>001-h10.urt.abi CHROMAT_FILE: 001-h10.urt.abi PHD_FILE: 001-h10.urt.abi.phd.1 CHEM:
term DYE: big TIME: Sun Jun 2 17:06:07 2002 TEMPLATE: 001-h10 DIRECTION: fwd
GGCGGCGGCTTTGGCGGGCTTGAGTTGGGGGAGGGTTTTTTCTTGGAGGG
TGAGTTTTTTTTCTTCGAGGCGCAGCAGCTGTTGCTGGTTGACTTGTACC
GGTTCTAGCAGTTGGCGGTGGAGCTGGAGCGATCGCCGTTTTTCTTTCTC
TAATTCCCGTTGCCGGTGGCCCCATTGGTGCCGTTTTGCCTTTTGGGTGG
CGAGTTTGGACGCAACCGAAAGTTGTTCATCTTCCCCCAGGGCTTTTACC
TTGGCGTTCAAACTTTCTAGGGTTTGGCCAGTCTGAAAAATTTCCCCATC
CAAAGCGTTCAAATTTTGAGTTAACTGGGTGCGTTCCTGGGCGCTGCTGG
CGAGATCCCGTTGCAAGTTTTCCTCTTGAAGTTGGAGCGATCGCCATGTT
AAAAGAACTTTTTCCTGCTTGTTGGCGGCCAGGTCAATTTTGAGC
>001-h11.uft.abi ………...
Tab 2. Part of the file “seqs_fasta” showing FASTA format
2. Vector screening
All the sequencing results contain the vector sequence at the two ends of the insert sequence. They
should be screened out before assembly. This is done by program cross_match:
% cross_match seqs_fasta vector.seq –minmatch 12 –minscore 20 –screen > screen.out
File vector.seq, also in FASTA format, contains all the vector sequences we want to screen for
(pUC19, pBluSKM, pBluSKP). The “–screen” option causes a file named “seqs_fasta.screen” to
be created, containing vector-masked versions of the original sequences. This “.screen” file is
what later is provided as input to Phrap. The output file “screen.out” lists the matches that were
found.
3. Assembly
The program phrap is based on Smith-Waterman algorithm (SWAT) and so is cross_match. It
scores pairwise alignment and constructs contig sequences as a mosaic of the highest quality parts
of reads.
Run phrap to perform the sequence assembly as follows:
% phrap seqs_fasta.screen –ace > phrap.out
Phrap writes the assembled contigs to the file “seqs_fasta.screen.contigs” (shown in Table 3.) and
creates a “seqs_fasta.screen.contigs.ace” file that can be used for importing the assembly to
Consed for assessment and editing. The assembly output information is contained in file
“phrap.out”.
>seqs_fasta.screen.Contig61
CACCCCGTAAGAGTGACCAGTGGAACGGTCAAAAAATTATGCGTGATCGC
CGCATTTCAATTACTTTTAGAAAAGTGATTATTTAGAAAGTGTTTTTATT
TAAAATCATTATTAATCTTGTCTGATGCAATGTTTTGAGTAATCTTTAAT
TATTTTTTGGCCATGCAAATACCAATTTCACCACGTCCTAAATATTATCC
AGTGAAACTTGAGTTTCCTAATCCTGTGACTCATAAATCTATCCTCTCCC
AGCAGCAGATTTCTAATAAAGCTTTTTTTCT………………………
Tab 3. Part of the file “seqs_fasta.screen.contigs”
303
Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
4. Viewing with Consed
We used program Consed to view the assembly results generated by phrap (shown in Figure 3).
The consensus is on the top line and reads that match are listed below. The darker region of a read
is the low-quality area and an erroneous base is marked by red color. When a chromat file “.abi” is
displayed in consed, erroneous bases in a sequence can be changed by clicking the middle button
of the mouse on the bases and then editing them.
Fig 3. Viewing assembly results with Consed
Ⅱ. Results
Actually, all the assembly steps can be run under one combined program phredphrap:
% phredphrap &
Before running, we have edited the file “phredphrap” and changed some parameters to meet our
needs. The parameters are listed below: (their meaning discussed later)
-trim_qual 20
-trim_start 10
-repeat_stringency 0.95
-forcelevel 1
-bypasslevel 1
-maxgap 35
-minmatch 12
-minscore 30
-maxmatch 30
-vector_bound 50 -max_subclone_size 8000
We put 26911 entries for assembly and got 242 contigs (see Figure 4) in total. Average quality
value of these entries was 25.0. About 62.57% of them were bidirectional clones, 33.10%
unidirectional clones, 2.49% walking sequence, 1.36% cosmid end sequence and the remaining
0.48% were genes from EMBL and Genebank. The entries’ average full length was 801.4bp and
was reduced to 657.2bp after trimming (remove the beginning low quality bases). Average quality
value of consensus sequences is 48.0 per base. The number of confirmed reads is 25494 and the
remaining are singlets that can’t be linked with any other reads. The depth of coverage reached 5.8.
We estimated preliminarily that the size of Synechococcus sp. strain PCC 7002 genome is
2809470bp.
304
Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
327789bp, 10%
155659bp, 5%
121021bp, 4%
Fig 4. Composition of contigs and the size of the largest ones
Ⅲ. Gene Identification
1. Running GeneMarkS
Now the largest contig obtained from phredphrap is 328kb in large. These large contigs can be the
substrate of annotation. Gene identification is the first step of annotation and we use software
GeneMarkS to predict genes in them. GeneMarkS uses an improved version of the gene finding
program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the
Gibbs sampling multiple alignment program. It’s especially useful for newly sequenced
prokaryotic genome with no prior knowledge of any protein or rRNA genes.
We submitted the contig sequences online at website:
http://opal.biology.gatech.edu/GeneMark/genemarks.cgi
The results were sent via email. For each contig, we acquired a postscript graphics file which
demonstrated the predicted genes (see Figure 5) and a text file containing the begin and end
locations of the genes. The plateau in the postscript graphics indicates the range of the gene.
Genes could be found both in direct sequence and complementary sequence, and they could be in
frames 1,2,3 or -1,-2,-3.
Fig 5. Predicted genes demonstrated by GeneMarkS
305
Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
2. Frame shift correction
Besides gene identification, we found GeneMarkS results very useful for the identification of
frame shift. As we know, a gene-coding sequence can be translated in three frames. So any base
deletion or insertion may cause the translation change from one frame to another, i.e. frame shift.
A frame shift can be easily identified in the postscript graphics file, usually at the site where an
arrow points from one plateau to adjacent plateau in another frame. One example is shown in
figure 6.
Fig 6. Frame shift shown in postscript graphics
We reexamined the target sequence in Consed (see Figure 7), and found one possible base deletion
at site 93432. File “s_6377.y1.abi” suggested a “c” here, but the consensus omitted it.
Fig 7. Reexamine the sequence at possible frame shift site
Then we checked the chromat files “s_6377.y1.abi” and “s_6849.g1.abi” and judged that there
should be a “c” at site 93432 (as shown in Figure 8). So a “c” was added to sequence
“s_6849.g1.abi” and thus to the consensus. The frame shift was corrected.
306
Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
Fig 8. Confirmation and Correction of frame shift
Then we ran GeneMarkS again and found the frame shift error was indeed corrected (see Figure
9).
Fig 9. Frame shift error was corrected in the new postscript graphics
Running blastx against PCC 6803 genome database also revealed the frame shift (see Table 4,
Before correction). After correction, we ran blastx again, and this time the gene was complete and
it matched very well (see Table 4, After correction).
E-value
Frame
Start site
End site
Match start
Match end
AA length
Before
correction
3.00E-67
1
92056
93432
3
460
493
3.00E-67
3
93435
93527
462
492
493
After correction
3.00E-73
1
92054
93526
3
492
493
Tab 4. Comparison between before and after correction of the frame shift
307
Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
With this method, most of the frame shift errors in contigs can be detected and corrected.
Sometimes it’s difficult to decide confidently where the frame shift occurs for the low quality of
the examined sequence. In that case, we shouldn’t rush to correct the error as we think what it
should be. More data should be added before further consideration.
3. Information extraction
When mistakes were reduced to minimum, we can extract these genes out from the contig
sequences and get the protein sequences. So we wrote a program called “extract” to extract all the
gene sequences from the information of their begin and end locations in the contigs, and another
program called “translate” to translate these gene sequences to protein sequences. They were both
written in Perl. For the length limitation of this paper, program lines are not listed here.
4. Analysis of contig 241
We chose the second largest contig - contig 241 (156kb) to be analyzed. After running
GeneMarkS and correcting frame shift errors (about one error per 10kb), we know that there are
about 142 genes in this contig (every 1096bp has a gene in either strand). The average length of
the genes is 918bp. So it’s clear that coding sequences comprise most regions of this cyanobacteria
genome and genes are arranged in line consecutively with very short gap between them. Then this
contig was performed by running blastx against PCC 6803 protein database. A threshold e-value
of 1E-20 was used for this analysis. Part of the blastx results are shown below in Table 5. We
compared these proteins with the genes predicted by GeneMarkS, finding that 102 genes share
homology with 6803 proteins and the other 40 genes are unique for 7002. These unique genes may
be of special interest to 7002’s morphology and metabolism and thus need further experiments to
reveal their structure and function.
Tab 5. Part of results of the contig 241 blastx against PCC 6803
Query Query
frame e-value
ORF No
product
genetic Length
Sbjct
Sbjct
symbol
(aa)
init
end
mntC
330
15
326
init
end
202
1128
1
1E-126
sll1598
Mn transporter MntC
1189
1911
1
2E-24
sll0385
ABC transporter
284
42
276
1207
1851
1
8E-38
slr2044
ABC transporter
289
21
231
1210
1941
1
1E-110
sll1599
Mn transporter MntA
260
9
252
1213
1914
1
4E-23
sll0489
ABC transporter
342
2
224
1213
1911
1
1E-30
slr1318
iron(III) dicitrate
fecE
268
10
244
nrtD
332
19
225
mntA
transport system
permease protein FecE
1213
1839
1
5E-21
sll1453
nitrate transport
protein NrtD
1261
1845
1
4E-21
sll0778
ABC transporter
790
244
433
1980
2780
3
6E-22
slr2045
hypothetical protein
281
10
277
1998
2810
3
1E-107
sll1600
Mn transporter MntB
306
12
282
3037
3639
1
5E-33
slr0006
217
18
217
308
mntB
Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
Query Query
init
end
5280
4426
frame e-value
-1
4E-77
ORF No
product
slr1559
shikimate
genetic Length
Sbjct
Sbjct
symbol
(aa)
init
end
aroE
290
4
288
256
15
255
363
1
363
5-dehydrogenase
6074
5361
-2
2E-55
sll1123
7233
6142
-1
1E-142
sll0245
hypothetical protein
9115
8153
-3
3E-32
slr1225
protein kinase PknA
pknA
495
3
344
9118
8243
-3
4E-32
slr1697
eukariotic protein
pknA
574
1
296
332
11
244
kinase
12603 13283
3
1E-24
slr1113
ABC transporter
Discussion
1. Often we got new sequence data, which was not in chromat files, but text files containing the
sequences. In such case, we first edited the files in Word and saved them in FASTA format.
Program SeqVerter, which can conveniently merge several sequence files into one file or do the
reverse process, may be especially useful for the creation of proper file format. Next we used
program mktrace to create an ideal chromat file “.scf”, for example:
% mktrace 3e7-3-cos 3e7-3-cos.scf
We simply wrote a shell and add this sentence “mktrace $i $i.scf” to batch process a directory of
files and saved the new chromat files in directory “chromat_dir”.
Having these chromat files in hand, we could easily run phredphrap again and these new sequence
information were added into assembly process.
In fact, any FASTA file can be treated with mktrace to create a chromat file, which can then be
viewed and edited in Consed.
2. Many parameters in command line options of phrap can greatly affect the results of assembly.
For instance, -minmatch sets the minimum length of matching word to nucleate SWAT
comparison. Increasing -minmatch can dramatically decrease the time required for the pairwise
sequence comparisons; it also tends to have the effect of increasing assembly stringency. For
example, when we changed this parameter from 12 to 14, the number of contigs increased from
242 to 253. However, it may cause some significant matches to be missed. Parameter -
maxmatch sets the maximum length of matching word. Parameter -minscore sets the minimum
alignment score. Phrap scores pairwise sequence alignment as follows: matching residues receive
a reward of +1, mismatches get a penalty of -2, gap opening residues a penalty of -4, and gap
extension residues a penalty of -3. So a sequence alignment score must be above the -minscore
before these sequences can be put together. Parameter -vector_bound sets the number of
potential vector bases at beginning of each read. Matches that lie entirely within this region are
assumed to represent vector matches and are ignored. Parameter -max_subclone_size checks the
maximum size of forward-reverse read pair. Parameter -trim_start sets the number of bases to be
removed at the beginning of each read as these bases are often of low quality. Parameter -
repeat_stringency controls stringency of match required for joins and -forcelevel and -
309
Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
bypasslevel regulate stringency during final contig merge pass(0 is most stringent and 10 is the
least). We have tested different values of these parameters and have run phredphrap after each
change. The current parameters are set to reach a balance between minimizing number of contigs
and avoiding mismatches.
3. Now our work is still going on and the final goal is to obtain the whole genome sequence,
annotate the genome and set up a web database for public use. We are now at the gap-closure
phase in which we should link the contigs together with specific PCR products or cloned inserts
that span each gap so that a single contig is obtained. We have several methods to link contigs
together. We have done blastx and found that if ends of two contigs encode different parts of the
same protein, these contigs are probably neighbors. Our lab is also constructing BAC libraries
(40kb). If the terminal sequences of a single BAC clone belong to different contigs, it also
indicates neighborhood of these two contigs. All potential neighbor predictions have to be verified
by standard or long-range PCR. For the contigs without identified neighbors, gaps in genomic
sequences may be bridged by chromosomal-walking methods.
Annotation is another huge task to be solved. It includes genes identification (as we have done in
contigs) and characterization, deduction of metabolic pathways and prediction of protein structure.
Additionally, from the complete view of the whole genome, we may discover gene displacements
and horizontal transfers, which may help to trace evolutionary networks.
Acknowledgements
I’d like to express my sincere thanks to Professor Zhao Jindong. It’s he who led me to the frontier
of biological research field. I thank Professor Luo Jingchu for his guidance on bioinformatics. I
also thank Lee Tao who is in charge of the whole project and had helped me a lot in my research
work. Finally, I should thank Dr. Lee Tsung-Dao and Chun-Tsung foundation for offering me this
research opportunity.
References
Frangeul, L, et al. (1999). Cloning and assembly strategies in microbial genome projects.
Microbiology,145,2625-2634.
Fleischmann, W, et al. (1999). A novel method for automatic functional annotation of proteins.
Bioinformatics,15,228-233.
Besemer, J, et al. (2001). GeneMarkS: a self-training method for prediction of gene starts in
microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids
Research,29,12,2607-2618
Bouck, J, et al. (1998). Analysis of the quality and utility of random shotgun sequencing at low
redundancies. Genome Research, 8,1074-1084.
Frederick, R. et al. (1997). The complete genome sequence of Escherichia coli K-12. Science, 277,
1453-1462.
Ewing, B. et al. (1998). Base-calling of automated sequencer traces using phred. II. Error
probabilities. Genome Research,8,186-194.
McMurray, A. et al. (1998). Short-insert libraries as a method of problem solving in genome
310
Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002)
sequencing. Genome Research, 8, 562-566.
Nelson, K. et al. (1999). Evidence for lateral gene transfer between Archaea and Bacteria from
genome sequence of Thermotoga maritima. Nature,399,323-329.
Gordon, D. et al. (1998). Consed: a graphical tool for sequence finishing. Genome
Research,8,195-202.
Barnes, W. (1994). PCR amplification of up to 35kb DNA with high fidelity and high yield from
lambda bacteriophage templates. PNAS USA,91,2216-2220.
Staden, R. (1979). A strategy of DNA sequencing employing computer programs. Nucleic Acids
Research,10,4731-4751.
Huang, X. et al. (1996). An improved sequence assembly program. Genomics,33,21-31.
Makoto H. et al. (1995). Computer survey for likely genes in the one megabase contiguous
genomic sequence data of Synechocystis sp. strain PCC 6803. DNA Research,2,239-246.
Green, P. et al. (1997). Against a whole-genome shotgun. Genome Research, 7,410-417.
Cole, S. et al. (1998). Deciphering the biology of Mycobacterium tuberculosis from the complete
genome sequence. Nature,393,537-544.
Takakazu, K. et al. (2001). Complete genomic sequence of the filamentous nitrogen-fixing
cyanobacterium Anabaena sp. strain PCC 7120. DNA Research,8,205-213.
作者简介:王竹,1982年生于湖北荆州,北京大学大学生命科学学院生物技术系
99级本科生。2001年5月开始受“ 政基金”资助参加科研工作。曾获ESEC奖学
金,杜邦奖学金。
感悟与寄语:一年半的“ 政”经历使我收获良多。我不仅掌握了基本分子生物
学实验手段技术,而且在导师赵进东教授的指引下涉足了生物信息学领域。赵教
授对科学前沿的敏锐洞察力和对问题的独到见解帮助我真正了解了什么是科学。
科学研究需要勤奋加创造,在其过程中挫折失败总是与成功相伴。勇敢而智慧的
人将能从困难中找到希望,从失败走向成功。感谢李政道先生对祖国青年学子的
关切之情,“ 政”经历将永远是我的人生的一笔宝贵财富。
指导教师简介:赵进东,生命科学学院副院长,长江特聘教授。美国 University of
Texas-Austin 博士学位。主要研究领域为蓝藻的光合和固氮作用。
311
Download