Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002) Assembly and Gene Identification of Contigs of Synechococcus sp. strain PCC 7002 Genome Wang Zhu College of Life Science, Peking University Abstract In this project, we try to sequence and annotate the genome of a cyanobacterium -- Synechococcus sp. strain PCC 7002. Currently, prokaryotic genome sequencing is generally carried out by the shotgun approach. We obtained 1kb-3kb sequence reads from Huada gene center. Cosmid end sequences were also used. We utilized software package Phredphrap to perform the assembly of these reads and have reduced the number of contigs to 242. The final goal is to construct a whole genome sequence of several megabases with an error rate lower than 1 per 10000 nucleotides. The largest contigs was analyzed by program GeneMarkS to correct frame shift errors and predict genes in them. We also programmed a tool to extract these gene sequences from the report list of GeneMarkS so that they can be under further studies. Introduction The cyanobacteria are believed to be of very ancient origin, and are the answer of present-day chloroplasts. Therefore, it is of great interest to analyze the structure and organization of genes in this organism. Synechococcus sp. strain PCC 7002 is a unicellular cyanobacterium. The genomes of other two cyanobacteria PCC 6803 and PCC 7120 have already been sequenced and can be visited through public databases. 7002 genome is predicted to be 2.8Mb. Understanding its whole genome may provide the basis for the studies of metabolism and photosynthesis. At present, the most widely used strategy for the sequencing of a microbial genome is that of whole-genome shotgun sequencing. A large number of clones, from the libraries representative of the whole genome, are sequenced and assembled into contigs. The contigs are then linked together using a variety of methods to obtain the whole genome sequence in a single contig. The process can be divided roughly into several procedures, including library construction, reads assembly and closure phase, as shown in Figure 1. 300 Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002) Fig 1. Outline of genome sequencing, assembly and annotation Both small-insert (1-2 kb) libraries and large-insert (20-300kb) libraries like BAC (bacterial artificial chromosome) are needed to ensure appropriate coverage of the genome and obtain a ‘scaffold’ of the genome which is used during the closure phase. The small-insert library data of PCC 7002 came from Huada gene center. The BAC library is being constructed in our lab. The assembly phase is composed of three major steps: the conversion of the data from automated sequencers to nucleotide sequences, the utilization of these sequences in the assembly process and the continuous assessment of this assembly process. There are various software tools for such work and we choose the Phredphrap package. It’s based on complex algorithms which perform pairwise comparisons for all sequences, allowing automatic threshold selection with respect to the decision of whether two sequences overlap or not. Clusters of overlapping sequences are constructed and consensus sequences are deduced from these clusters. The assembly-result file in Phrap format can be viewed with software Consed. In Consed, contigs are shown with the sequences composing of them. It’s easy to find and edit low-quality sequences and the assembly results can be assessed vividly. The goals of annotation include detecting and describing the protein-coding sequence, the structure of these genes (including untranslated regions and control elements), homology comparison between the sequences being analyzed and sequences available from public databases at either nucleic acid or the protein level. As long as contigs are constructed, preliminary annotation can be carried out. In this research, we use software GeneMarkS, which is based on heuristic Markov model, to predict genes in contigs as well as correct frame shift errors. 301 Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002) Methods and Results Ⅰ. Assembly 1. Base calling The computer operation system is Unix/Linux. The reads sequences are stored as chromat files in directory “chromat_dir”. One example of the file is shown in Figure 2. Fig 2. Part of a “.abi” chromat file viewed with Consed Phred uses simple Fourier methods to examine the four base traces in the region surrounding each point in the data set in order to predict a series of evenly spaced predicted peak locations. Next phred finds the centers of the actual (observed) peaks and a dynamic programming algorithm is used to match the observed peaks with the predicted peak locations found in the first step. Phred evaluates the trace surrounding each called base using a quality value (QV). The quality value is related to the base call error probability (P_e) by the formula: - P_e=10 QV/10 Run phred with the options: % phred –id chromat_dir –pd phd_dir which causes phred to read the chromat files in “chromat_dir” and write the converted “.phd” files to “phd_dir”. In “.phd” files comments on the file conversion process are listed first, then are the bases information as shown below in Table 1. <base> c a a t a t t g <quality> 34 42 42 33 31 33 11 11 <position in chromat> 2622 2634 2647 2660 2674 2685 2698 2711 Tab 1. Part of the file “s_14291.y1.abi.phd” showing phd format Then run the phd2fasta program to make FASTA files. After running % phd2fasta –id phd_dir –os seqs_fasta –oq seqs_fasta.screen.qual two files are created. File “seqs_fasta” records all the sequences in FASTA format (shown in Table 2.) and file “seqs_fasta.screen.qual” records the quality value of each base in all the sequences. 302 Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002) >001-h10.urt.abi CHROMAT_FILE: 001-h10.urt.abi PHD_FILE: 001-h10.urt.abi.phd.1 CHEM: term DYE: big TIME: Sun Jun 2 17:06:07 2002 TEMPLATE: 001-h10 DIRECTION: fwd GGCGGCGGCTTTGGCGGGCTTGAGTTGGGGGAGGGTTTTTTCTTGGAGGG TGAGTTTTTTTTCTTCGAGGCGCAGCAGCTGTTGCTGGTTGACTTGTACC GGTTCTAGCAGTTGGCGGTGGAGCTGGAGCGATCGCCGTTTTTCTTTCTC TAATTCCCGTTGCCGGTGGCCCCATTGGTGCCGTTTTGCCTTTTGGGTGG CGAGTTTGGACGCAACCGAAAGTTGTTCATCTTCCCCCAGGGCTTTTACC TTGGCGTTCAAACTTTCTAGGGTTTGGCCAGTCTGAAAAATTTCCCCATC CAAAGCGTTCAAATTTTGAGTTAACTGGGTGCGTTCCTGGGCGCTGCTGG CGAGATCCCGTTGCAAGTTTTCCTCTTGAAGTTGGAGCGATCGCCATGTT AAAAGAACTTTTTCCTGCTTGTTGGCGGCCAGGTCAATTTTGAGC >001-h11.uft.abi ………... Tab 2. Part of the file “seqs_fasta” showing FASTA format 2. Vector screening All the sequencing results contain the vector sequence at the two ends of the insert sequence. They should be screened out before assembly. This is done by program cross_match: % cross_match seqs_fasta vector.seq –minmatch 12 –minscore 20 –screen > screen.out File vector.seq, also in FASTA format, contains all the vector sequences we want to screen for (pUC19, pBluSKM, pBluSKP). The “–screen” option causes a file named “seqs_fasta.screen” to be created, containing vector-masked versions of the original sequences. This “.screen” file is what later is provided as input to Phrap. The output file “screen.out” lists the matches that were found. 3. Assembly The program phrap is based on Smith-Waterman algorithm (SWAT) and so is cross_match. It scores pairwise alignment and constructs contig sequences as a mosaic of the highest quality parts of reads. Run phrap to perform the sequence assembly as follows: % phrap seqs_fasta.screen –ace > phrap.out Phrap writes the assembled contigs to the file “seqs_fasta.screen.contigs” (shown in Table 3.) and creates a “seqs_fasta.screen.contigs.ace” file that can be used for importing the assembly to Consed for assessment and editing. The assembly output information is contained in file “phrap.out”. >seqs_fasta.screen.Contig61 CACCCCGTAAGAGTGACCAGTGGAACGGTCAAAAAATTATGCGTGATCGC CGCATTTCAATTACTTTTAGAAAAGTGATTATTTAGAAAGTGTTTTTATT TAAAATCATTATTAATCTTGTCTGATGCAATGTTTTGAGTAATCTTTAAT TATTTTTTGGCCATGCAAATACCAATTTCACCACGTCCTAAATATTATCC AGTGAAACTTGAGTTTCCTAATCCTGTGACTCATAAATCTATCCTCTCCC AGCAGCAGATTTCTAATAAAGCTTTTTTTCT……………………… Tab 3. Part of the file “seqs_fasta.screen.contigs” 303 Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002) 4. Viewing with Consed We used program Consed to view the assembly results generated by phrap (shown in Figure 3). The consensus is on the top line and reads that match are listed below. The darker region of a read is the low-quality area and an erroneous base is marked by red color. When a chromat file “.abi” is displayed in consed, erroneous bases in a sequence can be changed by clicking the middle button of the mouse on the bases and then editing them. Fig 3. Viewing assembly results with Consed Ⅱ. Results Actually, all the assembly steps can be run under one combined program phredphrap: % phredphrap & Before running, we have edited the file “phredphrap” and changed some parameters to meet our needs. The parameters are listed below: (their meaning discussed later) -trim_qual 20 -trim_start 10 -repeat_stringency 0.95 -forcelevel 1 -bypasslevel 1 -maxgap 35 -minmatch 12 -minscore 30 -maxmatch 30 -vector_bound 50 -max_subclone_size 8000 We put 26911 entries for assembly and got 242 contigs (see Figure 4) in total. Average quality value of these entries was 25.0. About 62.57% of them were bidirectional clones, 33.10% unidirectional clones, 2.49% walking sequence, 1.36% cosmid end sequence and the remaining 0.48% were genes from EMBL and Genebank. The entries’ average full length was 801.4bp and was reduced to 657.2bp after trimming (remove the beginning low quality bases). Average quality value of consensus sequences is 48.0 per base. The number of confirmed reads is 25494 and the remaining are singlets that can’t be linked with any other reads. The depth of coverage reached 5.8. We estimated preliminarily that the size of Synechococcus sp. strain PCC 7002 genome is 2809470bp. 304 Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002) 327789bp, 10% 155659bp, 5% 121021bp, 4% Fig 4. Composition of contigs and the size of the largest ones Ⅲ. Gene Identification 1. Running GeneMarkS Now the largest contig obtained from phredphrap is 328kb in large. These large contigs can be the substrate of annotation. Gene identification is the first step of annotation and we use software GeneMarkS to predict genes in them. GeneMarkS uses an improved version of the gene finding program GeneMark.hmm, heuristic Markov models of coding and non-coding regions and the Gibbs sampling multiple alignment program. It’s especially useful for newly sequenced prokaryotic genome with no prior knowledge of any protein or rRNA genes. We submitted the contig sequences online at website: http://opal.biology.gatech.edu/GeneMark/genemarks.cgi The results were sent via email. For each contig, we acquired a postscript graphics file which demonstrated the predicted genes (see Figure 5) and a text file containing the begin and end locations of the genes. The plateau in the postscript graphics indicates the range of the gene. Genes could be found both in direct sequence and complementary sequence, and they could be in frames 1,2,3 or -1,-2,-3. Fig 5. Predicted genes demonstrated by GeneMarkS 305 Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002) 2. Frame shift correction Besides gene identification, we found GeneMarkS results very useful for the identification of frame shift. As we know, a gene-coding sequence can be translated in three frames. So any base deletion or insertion may cause the translation change from one frame to another, i.e. frame shift. A frame shift can be easily identified in the postscript graphics file, usually at the site where an arrow points from one plateau to adjacent plateau in another frame. One example is shown in figure 6. Fig 6. Frame shift shown in postscript graphics We reexamined the target sequence in Consed (see Figure 7), and found one possible base deletion at site 93432. File “s_6377.y1.abi” suggested a “c” here, but the consensus omitted it. Fig 7. Reexamine the sequence at possible frame shift site Then we checked the chromat files “s_6377.y1.abi” and “s_6849.g1.abi” and judged that there should be a “c” at site 93432 (as shown in Figure 8). So a “c” was added to sequence “s_6849.g1.abi” and thus to the consensus. The frame shift was corrected. 306 Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002) Fig 8. Confirmation and Correction of frame shift Then we ran GeneMarkS again and found the frame shift error was indeed corrected (see Figure 9). Fig 9. Frame shift error was corrected in the new postscript graphics Running blastx against PCC 6803 genome database also revealed the frame shift (see Table 4, Before correction). After correction, we ran blastx again, and this time the gene was complete and it matched very well (see Table 4, After correction). E-value Frame Start site End site Match start Match end AA length Before correction 3.00E-67 1 92056 93432 3 460 493 3.00E-67 3 93435 93527 462 492 493 After correction 3.00E-73 1 92054 93526 3 492 493 Tab 4. Comparison between before and after correction of the frame shift 307 Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002) With this method, most of the frame shift errors in contigs can be detected and corrected. Sometimes it’s difficult to decide confidently where the frame shift occurs for the low quality of the examined sequence. In that case, we shouldn’t rush to correct the error as we think what it should be. More data should be added before further consideration. 3. Information extraction When mistakes were reduced to minimum, we can extract these genes out from the contig sequences and get the protein sequences. So we wrote a program called “extract” to extract all the gene sequences from the information of their begin and end locations in the contigs, and another program called “translate” to translate these gene sequences to protein sequences. They were both written in Perl. For the length limitation of this paper, program lines are not listed here. 4. Analysis of contig 241 We chose the second largest contig - contig 241 (156kb) to be analyzed. After running GeneMarkS and correcting frame shift errors (about one error per 10kb), we know that there are about 142 genes in this contig (every 1096bp has a gene in either strand). The average length of the genes is 918bp. So it’s clear that coding sequences comprise most regions of this cyanobacteria genome and genes are arranged in line consecutively with very short gap between them. Then this contig was performed by running blastx against PCC 6803 protein database. A threshold e-value of 1E-20 was used for this analysis. Part of the blastx results are shown below in Table 5. We compared these proteins with the genes predicted by GeneMarkS, finding that 102 genes share homology with 6803 proteins and the other 40 genes are unique for 7002. These unique genes may be of special interest to 7002’s morphology and metabolism and thus need further experiments to reveal their structure and function. Tab 5. Part of results of the contig 241 blastx against PCC 6803 Query Query frame e-value ORF No product genetic Length Sbjct Sbjct symbol (aa) init end mntC 330 15 326 init end 202 1128 1 1E-126 sll1598 Mn transporter MntC 1189 1911 1 2E-24 sll0385 ABC transporter 284 42 276 1207 1851 1 8E-38 slr2044 ABC transporter 289 21 231 1210 1941 1 1E-110 sll1599 Mn transporter MntA 260 9 252 1213 1914 1 4E-23 sll0489 ABC transporter 342 2 224 1213 1911 1 1E-30 slr1318 iron(III) dicitrate fecE 268 10 244 nrtD 332 19 225 mntA transport system permease protein FecE 1213 1839 1 5E-21 sll1453 nitrate transport protein NrtD 1261 1845 1 4E-21 sll0778 ABC transporter 790 244 433 1980 2780 3 6E-22 slr2045 hypothetical protein 281 10 277 1998 2810 3 1E-107 sll1600 Mn transporter MntB 306 12 282 3037 3639 1 5E-33 slr0006 217 18 217 308 mntB Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002) Query Query init end 5280 4426 frame e-value -1 4E-77 ORF No product slr1559 shikimate genetic Length Sbjct Sbjct symbol (aa) init end aroE 290 4 288 256 15 255 363 1 363 5-dehydrogenase 6074 5361 -2 2E-55 sll1123 7233 6142 -1 1E-142 sll0245 hypothetical protein 9115 8153 -3 3E-32 slr1225 protein kinase PknA pknA 495 3 344 9118 8243 -3 4E-32 slr1697 eukariotic protein pknA 574 1 296 332 11 244 kinase 12603 13283 3 1E-24 slr1113 ABC transporter Discussion 1. Often we got new sequence data, which was not in chromat files, but text files containing the sequences. In such case, we first edited the files in Word and saved them in FASTA format. Program SeqVerter, which can conveniently merge several sequence files into one file or do the reverse process, may be especially useful for the creation of proper file format. Next we used program mktrace to create an ideal chromat file “.scf”, for example: % mktrace 3e7-3-cos 3e7-3-cos.scf We simply wrote a shell and add this sentence “mktrace $i $i.scf” to batch process a directory of files and saved the new chromat files in directory “chromat_dir”. Having these chromat files in hand, we could easily run phredphrap again and these new sequence information were added into assembly process. In fact, any FASTA file can be treated with mktrace to create a chromat file, which can then be viewed and edited in Consed. 2. Many parameters in command line options of phrap can greatly affect the results of assembly. For instance, -minmatch sets the minimum length of matching word to nucleate SWAT comparison. Increasing -minmatch can dramatically decrease the time required for the pairwise sequence comparisons; it also tends to have the effect of increasing assembly stringency. For example, when we changed this parameter from 12 to 14, the number of contigs increased from 242 to 253. However, it may cause some significant matches to be missed. Parameter - maxmatch sets the maximum length of matching word. Parameter -minscore sets the minimum alignment score. Phrap scores pairwise sequence alignment as follows: matching residues receive a reward of +1, mismatches get a penalty of -2, gap opening residues a penalty of -4, and gap extension residues a penalty of -3. So a sequence alignment score must be above the -minscore before these sequences can be put together. Parameter -vector_bound sets the number of potential vector bases at beginning of each read. Matches that lie entirely within this region are assumed to represent vector matches and are ignored. Parameter -max_subclone_size checks the maximum size of forward-reverse read pair. Parameter -trim_start sets the number of bases to be removed at the beginning of each read as these bases are often of low quality. Parameter - repeat_stringency controls stringency of match required for joins and -forcelevel and - 309 Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002) bypasslevel regulate stringency during final contig merge pass(0 is most stringent and 10 is the least). We have tested different values of these parameters and have run phredphrap after each change. The current parameters are set to reach a balance between minimizing number of contigs and avoiding mismatches. 3. Now our work is still going on and the final goal is to obtain the whole genome sequence, annotate the genome and set up a web database for public use. We are now at the gap-closure phase in which we should link the contigs together with specific PCR products or cloned inserts that span each gap so that a single contig is obtained. We have several methods to link contigs together. We have done blastx and found that if ends of two contigs encode different parts of the same protein, these contigs are probably neighbors. Our lab is also constructing BAC libraries (40kb). If the terminal sequences of a single BAC clone belong to different contigs, it also indicates neighborhood of these two contigs. All potential neighbor predictions have to be verified by standard or long-range PCR. For the contigs without identified neighbors, gaps in genomic sequences may be bridged by chromosomal-walking methods. Annotation is another huge task to be solved. It includes genes identification (as we have done in contigs) and characterization, deduction of metabolic pathways and prediction of protein structure. Additionally, from the complete view of the whole genome, we may discover gene displacements and horizontal transfers, which may help to trace evolutionary networks. Acknowledgements I’d like to express my sincere thanks to Professor Zhao Jindong. It’s he who led me to the frontier of biological research field. I thank Professor Luo Jingchu for his guidance on bioinformatics. I also thank Lee Tao who is in charge of the whole project and had helped me a lot in my research work. Finally, I should thank Dr. Lee Tsung-Dao and Chun-Tsung foundation for offering me this research opportunity. References Frangeul, L, et al. (1999). Cloning and assembly strategies in microbial genome projects. Microbiology,145,2625-2634. Fleischmann, W, et al. (1999). A novel method for automatic functional annotation of proteins. Bioinformatics,15,228-233. Besemer, J, et al. (2001). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Research,29,12,2607-2618 Bouck, J, et al. (1998). Analysis of the quality and utility of random shotgun sequencing at low redundancies. Genome Research, 8,1074-1084. Frederick, R. et al. (1997). The complete genome sequence of Escherichia coli K-12. Science, 277, 1453-1462. Ewing, B. et al. (1998). Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Research,8,186-194. McMurray, A. et al. (1998). Short-insert libraries as a method of problem solving in genome 310 Series of Selected Papers from Chun-Tsung Scholars, Peking University (2002) sequencing. Genome Research, 8, 562-566. Nelson, K. et al. (1999). Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima. Nature,399,323-329. Gordon, D. et al. (1998). Consed: a graphical tool for sequence finishing. Genome Research,8,195-202. Barnes, W. (1994). PCR amplification of up to 35kb DNA with high fidelity and high yield from lambda bacteriophage templates. PNAS USA,91,2216-2220. Staden, R. (1979). A strategy of DNA sequencing employing computer programs. Nucleic Acids Research,10,4731-4751. Huang, X. et al. (1996). An improved sequence assembly program. Genomics,33,21-31. Makoto H. et al. (1995). Computer survey for likely genes in the one megabase contiguous genomic sequence data of Synechocystis sp. strain PCC 6803. DNA Research,2,239-246. Green, P. et al. (1997). Against a whole-genome shotgun. Genome Research, 7,410-417. Cole, S. et al. (1998). Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence. Nature,393,537-544. Takakazu, K. et al. (2001). Complete genomic sequence of the filamentous nitrogen-fixing cyanobacterium Anabaena sp. strain PCC 7120. DNA Research,8,205-213. 作者简介:王竹,1982年生于湖北荆州,北京大学大学生命科学学院生物技术系 99级本科生。2001年5月开始受“ 政基金”资助参加科研工作。曾获ESEC奖学 金,杜邦奖学金。 感悟与寄语:一年半的“ 政”经历使我收获良多。我不仅掌握了基本分子生物 学实验手段技术,而且在导师赵进东教授的指引下涉足了生物信息学领域。赵教 授对科学前沿的敏锐洞察力和对问题的独到见解帮助我真正了解了什么是科学。 科学研究需要勤奋加创造,在其过程中挫折失败总是与成功相伴。勇敢而智慧的 人将能从困难中找到希望,从失败走向成功。感谢李政道先生对祖国青年学子的 关切之情,“ 政”经历将永远是我的人生的一笔宝贵财富。 指导教师简介:赵进东,生命科学学院副院长,长江特聘教授。美国 University of Texas-Austin 博士学位。主要研究领域为蓝藻的光合和固氮作用。 311