Supplementary Text 1

advertisement
Methods
A modified whole genome shotgun strategy was used to sequence the ~ 9.2 Mb
genome of C. hominis isolate TU502
1,5
which was derived from an infected child from
Uganda. TU502 was identified as C. hominis by restriction fragment length polymorphism
analysis 2 and analysis of the polymorphic region of the small subunit ribosomal RNA 3 and
β-tubulin genes 4. The isolate was propagated in gnotobiotic piglets
5
and the oocysts
purified from the feces as previously described 6. To generate DNA, the isolate was
expanded in neonatal calves 1. DNA was purified from surface-sterilized oocysts 7, shotgun
and BAC clones were constructed, and end sequences were generated as previously
described 8.
The analysis herein was performed on Data Version 2.0, which included the over
220,000 sequence reads from small insert clones, and end sequence from approximately
2,000 BAC clones averaging 35 kbp in size, generated as of October 15, 2003. The data
represents a ~12 fold shotgun clone coverage of the genome with quality score of Phred
209, and a 7-8 fold coverage with BAC clones. The Phrap based assembly
9
of these
sequence reads yielded 2086 contigs totaling 9.16 Mb, after removal of contaminant contigs
identified by BLASTN 10 database searches against the GenBank databases. Contigs with E
values < e-15 were identified as probable contaminants. Average gap size among 1,426 (9.05
Mb) contigs that align with the 9.11 Mb C. parvum genome is estimated as less than 260
bp. The HAPPY map of the C. parvum genome
chromosomes
12
11
and the assembled C. parvum
were used to align our assembly of the C. hominis genome as shown in
Figure 1, since the C. hominis and C. parvum genomes have indistinguishable molecular
karyotypes 5. The existing HAPPY map allowed us to accurately assign and order over 323,
and orient over 300 of the largest contigs containing the majority of the C. hominis
sequence to specific chromosomes. After the alignment, the C. parvum sequence covered
~9.05 Mb of the estimated 9.2 Mb C. hominis sequence. There remain 246 physical
discontinuities in the C. hominis sequence, i.e., physical gaps spanned by no known clones.
We estimate that greater than 99% of the C. hominis genome is present in this assembly.
All DNA features; e.g., GC content, microsatellite and repeat, telomeric repeat,
palindromic octamer sequences and codon usage were analyzed using EMBOSS programs
13.
EST sequences were extracted from the GenBank database and aligned with genomic
14,
DNA for intron identification. The tRNAs were identified using tRNAscan-SE
rRNA genes were identified by BLAST
10
and
search. Genome features were viewed and
examined using Artemis 15 and Apollo 16 annotation tools. The open reading frames (ORFs)
were predicted using Glimmer 17, GeneMarkS 18, and EMBOSS
13,
and checked manually.
All predicted genes larger than 67 amino acids (201 bp) were analyzed. The gene list was
used to search GenBank for the putative homologous proteins, against Interpro
potential domains, through signalP 20, targetP 21 and TMHMM
22
19
for the
for surface or membrane
associated proteins. To estimate the content of introns, the C. hominis ORFs were first
searched against the nr database using FASTA
23
program. GeneWise
analysis were used to predict introns. For Gene Ontology (GO)
25
24
and manual
analysis, the predicted
proteins were categorized by searching for homologous proteins in the UniProt
using the FASTA
23
26
database
program, with a cut-off threshold of e < 1e-6. The GO terms found
associated with proteins in UniProt were assigned to the putative homologous C. hominis
genes. The high level view of GO slims
27
were obtained from the assignment. The contigs
were split in overlapping (50 bp) 1.5 kb pieces using splitter from EMBOSS
13.
The split
fragments were searched against UniRef 1.0 26 using BLASTX 10 with cut-off e < 1e-6. The
resulting genes were processed by internal scripts to find the annotated UniRef
26
proteins
(EC or GO numbers) and the participated metabolic pathways defined in KEGG
28.
Metabolic pathways in C. hominis were automatically predicted by combining data from
KEGG
28,
ENZYME
29,
UniRef
26
and GO
25
databases, and internal scripts. The results
were subsequently checked and refined manually.
The contigs of C. hominis were aligned with C. parvum using Mummer
30.
The
contigs that did not align were grouped as the orphans. The genome architectures were
compared and viewed using Apollo
16.
The detailed genomic DNA sequences were
compared using ClustalW 31. Indels (insertions and deletions) and substitutions (transitions
and transversions) with the quality scores higher than Phred 20 were collected and analyzed
using in house scripts. Three different patterns of indels were divided based on the
frameshift. The one indel_1 causes one base frameshift, two indel_2 causes two bases
frameshift and three indel_3 cause three bases frameshift.
Reference List
1. Akiyoshi,D.E., Feng,X., Buckholt,M.A., Widmer,G. & Tzipori,S. Genetic analysis of
a Cryptosporidium parvum human genotype 1 isolate passaged through different host
species. Infect. Immun. 70, 5670-5675 (2002).
2. Feng,X. et al. Extensive polymorphism in Cryptosporidium parvum identified by
multilocus microsatellite analysis. Appl. Environ. Microbiol. 66, 3344-3349 (2000).
3. Carraway,M., Tzipori,S. & Widmer,G. Identification of genetic heterogeneity in the
Cryptosporidium parvum ribosomal repeat. Appl. Environ. Microbiol. 62, 712-716
(1996).
4. Widmer,G., Tchack,L., Chappell,C.L. & Tzipori,S. Sequence polymorphism in the
beta-tubulin gene reveals heterogeneous and variable population structures in
Cryptosporidium parvum. Appl. Environ. Microbiol. 64, 4477-4481 (1998).
5. Widmer,G. et al. Animal propagation and genomic survey of a genotype 1 isolate of
Cryptosporidium parvum. Mol. Biochem. Parasitol. 108, 187-197 (2000).
6. Widmer,G., Tzipori,S., Fichtenbaum,C.J. & Griffiths,J.K. Genotypic and phenotypic
characterization of Cryptosporidium parvum isolates from people with AIDS. J.
Infect. Dis. 178, 834-840 (1998).
7. Robertson,L.J., Campbell,A.T. & Smith,H.V. In vitro excystation of Cryptosporidium
parvum. Parasitology 106 ( Pt 1), 13-19 (1993).
8. Bruce Birren, Valeria Mancino & Hiroaki Shizuya. Genome Analysis: A Laboratory
Manual. B.Birren,e.al. (ed.), pp. 241-296 (Cold Spring Harbor Laboratory Press, Cold
Spring Harbor, NY,1999).
9. Gordon,D., Abajian,C. & Green,P. Consed: a graphical tool for sequence finishing.
Genome Res. 8, 195-202 (1998).
10. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. & Lipman,D.J. Basic local alignment
search tool. J Mol. Biol. 215, 403-410 (1990).
11. Piper,M.B., Bankier,A.T. & Dear,P.H. A HAPPY map of Cryptosporidium parvum.
Genome Res. 8, 1299-1307 (1998).
12. Abrahamsen,M.S. et al. Complete genome sequence of the apicomplexan,
Cryptosporidium parvum. Science 304, 441-445 (2004).
13. Rice,P., Longden,I. & Bleasby,A. EMBOSS: the European Molecular Biology Open
Software Suite. Trends Genet. 16, 276-277 (2000).
14. Lowe,T.M. & Eddy,S.R. tRNAscan-SE: a program for improved detection of transfer
RNA genes in genomic sequence. Nucleic Acids Res. 25, 955-964 (1997).
15. Rutherford,K. et al. Artemis: sequence visualization and annotation. Bioinformatics.
16, 944-945 (2000).
16. Lewis,S.E. et al. Apollo: a sequence annotation editor. Genome Biol 3,
RESEARCH0082 (2002).
17. Salzberg,S.L., Pertea,M., Delcher,A.L., Gardner,M.J. & Tettelin,H. Interpolated
Markov models for eukaryotic gene finding. Genomics 59, 24-31 (1999).
18. Besemer,J., Lomsadze,A. & Borodovsky,M. GeneMarkS: a self-training method for
prediction of gene starts in microbial genomes. Implications for finding sequence
motifs in regulatory regions. Nucleic Acids Res. 29, 2607-2618 (2001).
19. Mulder,N.J. et al. InterPro: an integrated documentation resource for protein families,
domains and functional sites. Brief. Bioinform. 3, 225-235 (2002).
20. Bendtsen,J.D., Nielsen,H., von Heijne,G. & Brunak,S. Improved prediction of signal
peptides: SignalP 3.0. J. Mol Biol 340, 783-795 (2004).
21. Emanuelsson,O., Nielsen,H., Brunak,S. & von Heijne,G. Predicting subcellular
localization of proteins based on their N-terminal amino acid sequence. J. Mol Biol
300, 1005-1016 (2000).
22. Krogh,A., Larsson,B., von Heijne,G. & Sonnhammer,E.L. Predicting transmembrane
protein topology with a hidden Markov model: application to complete genomes. J.
Mol Biol 305, 567-580 (2001).
23. Pearson,W.R. Rapid and sensitive sequence comparison with FASTP and FASTA.
Methods Enzymol. 183, 63-98 (1990).
24. Ewan Birney & Mor Amitai. GeneWise. EMBL - European Bioinformatics Institute .
1-30-2004.
25. Harris,M.A. et al. The Gene Ontology (GO) database and informatics resource.
Nucleic Acids Res. 32 Database issue, D258-D261 (2004).
26. Apweiler,R. et al. UniProt: the Universal Protein knowledgebase. Nucleic Acids Res.
32 Database issue, D115-D119 (2004).
27. Gene Ontology Consortium. The Gene Ontology (GO) database and informatics
resource. Nucleic Acids Res. 32, D258-D261 (2004).
28. Kanehisa,M. & Goto,S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic
Acids Res. 28, 27-30 (2000).
29. Bairoch,A. The ENZYME database in 2000. Nucleic Acids Res. 28, 304-305 (2000).
30. Kurtz,S. et al. Versatile and open software for comparing large genomes. Genome
Biol 5, R12 (2004).
31. Kohli,D.K. & Bachhawat,A.K. CLOURE: Clustal Output Reformatter, a program for
reformatting ClustalX/ClustalW outputs for SNP analysis and molecular systematics.
Nucleic Acids Res. 31, 3501-3502 (2003).
Download