Supplementary Text 1

Methods A modified whole genome shotgun strategy was used to sequence the ~ 9.2 Mb genome of C. hominis isolate TU502 1,5 which was derived from an infected child from Uganda. TU502 was identified as C. hominis by restriction fragment length polymorphism analysis 2 and analysis of the polymorphic region of the small subunit ribosomal RNA 3 and β-tubulin genes 4. The isolate was propagated in gnotobiotic piglets 5 and the oocysts purified from the feces as previously described 6. To generate DNA, the isolate was expanded in neonatal calves 1. DNA was purified from surface-sterilized oocysts 7, shotgun and BAC clones were constructed, and end sequences were generated as previously described 8. The analysis herein was performed on Data Version 2.0, which included the over 220,000 sequence reads from small insert clones, and end sequence from approximately 2,000 BAC clones averaging 35 kbp in size, generated as of October 15, 2003. The data represents a ~12 fold shotgun clone coverage of the genome with quality score of Phred 209, and a 7-8 fold coverage with BAC clones. The Phrap based assembly 9 of these sequence reads yielded 2086 contigs totaling 9.16 Mb, after removal of contaminant contigs identified by BLASTN 10 database searches against the GenBank databases. Contigs with E values < e-15 were identified as probable contaminants. Average gap size among 1,426 (9.05 Mb) contigs that align with the 9.11 Mb C. parvum genome is estimated as less than 260 bp. The HAPPY map of the C. parvum genome chromosomes 12 11 and the assembled C. parvum were used to align our assembly of the C. hominis genome as shown in Figure 1, since the C. hominis and C. parvum genomes have indistinguishable molecular karyotypes 5. The existing HAPPY map allowed us to accurately assign and order over 323, and orient over 300 of the largest contigs containing the majority of the C. hominis sequence to specific chromosomes. After the alignment, the C. parvum sequence covered ~9.05 Mb of the estimated 9.2 Mb C. hominis sequence. There remain 246 physical discontinuities in the C. hominis sequence, i.e., physical gaps spanned by no known clones. We estimate that greater than 99% of the C. hominis genome is present in this assembly. All DNA features; e.g., GC content, microsatellite and repeat, telomeric repeat, palindromic octamer sequences and codon usage were analyzed using EMBOSS programs 13. EST sequences were extracted from the GenBank database and aligned with genomic 14, DNA for intron identification. The tRNAs were identified using tRNAscan-SE rRNA genes were identified by BLAST 10 and search. Genome features were viewed and examined using Artemis 15 and Apollo 16 annotation tools. The open reading frames (ORFs) were predicted using Glimmer 17, GeneMarkS 18, and EMBOSS 13, and checked manually. All predicted genes larger than 67 amino acids (201 bp) were analyzed. The gene list was used to search GenBank for the putative homologous proteins, against Interpro potential domains, through signalP 20, targetP 21 and TMHMM 22 19 for the for surface or membrane associated proteins. To estimate the content of introns, the C. hominis ORFs were first searched against the nr database using FASTA 23 program. GeneWise analysis were used to predict introns. For Gene Ontology (GO) 25 24 and manual analysis, the predicted proteins were categorized by searching for homologous proteins in the UniProt using the FASTA 23 26 database program, with a cut-off threshold of e < 1e-6. The GO terms found associated with proteins in UniProt were assigned to the putative homologous C. hominis genes. The high level view of GO slims 27 were obtained from the assignment. The contigs were split in overlapping (50 bp) 1.5 kb pieces using splitter from EMBOSS 13. The split fragments were searched against UniRef 1.0 26 using BLASTX 10 with cut-off e < 1e-6. The resulting genes were processed by internal scripts to find the annotated UniRef 26 proteins (EC or GO numbers) and the participated metabolic pathways defined in KEGG 28. Metabolic pathways in C. hominis were automatically predicted by combining data from KEGG 28, ENZYME 29, UniRef 26 and GO 25 databases, and internal scripts. The results were subsequently checked and refined manually. The contigs of C. hominis were aligned with C. parvum using Mummer 30. The contigs that did not align were grouped as the orphans. The genome architectures were compared and viewed using Apollo 16. The detailed genomic DNA sequences were compared using ClustalW 31. Indels (insertions and deletions) and substitutions (transitions and transversions) with the quality scores higher than Phred 20 were collected and analyzed using in house scripts. Three different patterns of indels were divided based on the frameshift. The one indel_1 causes one base frameshift, two indel_2 causes two bases frameshift and three indel_3 cause three bases frameshift. Reference List 1. Akiyoshi,D.E., Feng,X., Buckholt,M.A., Widmer,G. & Tzipori,S. Genetic analysis of a Cryptosporidium parvum human genotype 1 isolate passaged through different host species. Infect. Immun. 70, 5670-5675 (2002). 2. Feng,X. et al. Extensive polymorphism in Cryptosporidium parvum identified by multilocus microsatellite analysis. Appl. Environ. Microbiol. 66, 3344-3349 (2000). 3. Carraway,M., Tzipori,S. & Widmer,G. Identification of genetic heterogeneity in the Cryptosporidium parvum ribosomal repeat. Appl. Environ. Microbiol. 62, 712-716 (1996). 4. Widmer,G., Tchack,L., Chappell,C.L. & Tzipori,S. Sequence polymorphism in the beta-tubulin gene reveals heterogeneous and variable population structures in Cryptosporidium parvum. Appl. Environ. Microbiol. 64, 4477-4481 (1998). 5. Widmer,G. et al. Animal propagation and genomic survey of a genotype 1 isolate of Cryptosporidium parvum. Mol. Biochem. Parasitol. 108, 187-197 (2000). 6. Widmer,G., Tzipori,S., Fichtenbaum,C.J. & Griffiths,J.K. Genotypic and phenotypic characterization of Cryptosporidium parvum isolates from people with AIDS. J. Infect. Dis. 178, 834-840 (1998). 7. Robertson,L.J., Campbell,A.T. & Smith,H.V. In vitro excystation of Cryptosporidium parvum. Parasitology 106 ( Pt 1), 13-19 (1993). 8. Bruce Birren, Valeria Mancino & Hiroaki Shizuya. Genome Analysis: A Laboratory Manual. B.Birren,e.al. (ed.), pp. 241-296 (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY,1999). 9. Gordon,D., Abajian,C. & Green,P. Consed: a graphical tool for sequence finishing. Genome Res. 8, 195-202 (1998). 10. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. & Lipman,D.J. Basic local alignment search tool. J Mol. Biol. 215, 403-410 (1990). 11. Piper,M.B., Bankier,A.T. & Dear,P.H. A HAPPY map of Cryptosporidium parvum. Genome Res. 8, 1299-1307 (1998). 12. Abrahamsen,M.S. et al. Complete genome sequence of the apicomplexan, Cryptosporidium parvum. Science 304, 441-445 (2004). 13. Rice,P., Longden,I. & Bleasby,A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276-277 (2000). 14. Lowe,T.M. & Eddy,S.R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955-964 (1997). 15. Rutherford,K. et al. Artemis: sequence visualization and annotation. Bioinformatics. 16, 944-945 (2000). 16. Lewis,S.E. et al. Apollo: a sequence annotation editor. Genome Biol 3, RESEARCH0082 (2002). 17. Salzberg,S.L., Pertea,M., Delcher,A.L., Gardner,M.J. & Tettelin,H. Interpolated Markov models for eukaryotic gene finding. Genomics 59, 24-31 (1999). 18. Besemer,J., Lomsadze,A. & Borodovsky,M. GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res. 29, 2607-2618 (2001). 19. Mulder,N.J. et al. InterPro: an integrated documentation resource for protein families, domains and functional sites. Brief. Bioinform. 3, 225-235 (2002). 20. Bendtsen,J.D., Nielsen,H., von Heijne,G. & Brunak,S. Improved prediction of signal peptides: SignalP 3.0. J. Mol Biol 340, 783-795 (2004). 21. Emanuelsson,O., Nielsen,H., Brunak,S. & von Heijne,G. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol Biol 300, 1005-1016 (2000). 22. Krogh,A., Larsson,B., von Heijne,G. & Sonnhammer,E.L. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol Biol 305, 567-580 (2001). 23. Pearson,W.R. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 183, 63-98 (1990). 24. Ewan Birney & Mor Amitai. GeneWise. EMBL - European Bioinformatics Institute . 1-30-2004. 25. Harris,M.A. et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32 Database issue, D258-D261 (2004). 26. Apweiler,R. et al. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res. 32 Database issue, D115-D119 (2004). 27. Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32, D258-D261 (2004). 28. Kanehisa,M. & Goto,S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27-30 (2000). 29. Bairoch,A. The ENZYME database in 2000. Nucleic Acids Res. 28, 304-305 (2000). 30. Kurtz,S. et al. Versatile and open software for comparing large genomes. Genome Biol 5, R12 (2004). 31. Kohli,D.K. & Bachhawat,A.K. CLOURE: Clustal Output Reformatter, a program for reformatting ClustalX/ClustalW outputs for SNP analysis and molecular systematics. Nucleic Acids Res. 31, 3501-3502 (2003).

Supplementary Text 1

Related documents

Products

Support

Supplementary Text 1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib