TEXT S2 - Figshare

advertisement
TEXT S2: EVALUATING SEQUENCE QUALITY AND COVERAGE USING
CYTOPLASMIC RIBOSOMAL PROTEINS
Martin Helmkampf and Jürgen Gadau
School of Life Sciences, Arizona State University, Tempe, AZ 85287, United States of America
The high degree of sequence conservation among cytoplasmic ribosomal
proteins (CRPs) (Text S1), coupled with the fact that genes encoding these proteins are
widely distributed in genomes studied so far (e.g., Drosophila melanogaster and
humans [1,2]), makes CRP genes ideally suited to assess the coverage of genome
assemblies or automatically annotated gene sets. Moreover, since they can be
annotated with high confidence by comparison to homologous genes from even
distantly related organisms, sequencing errors resulting in shifts of the reading frame
can be identified unambiguously, allowing the evaluation of sequence accuracy.
In Atta cephalotes the presence of these numerous and widely distributed genes
(which are scattered across 44 scaffolds, including the 16 largest ones) indicates that
the assembly excellently covers the gene space of the genome. The automatically
annotated gene set obtained by MAKER [3], though certainly biased towards highly
expressed genes with strong EST support, also contained all but one (very short) CRP
gene and should therefore provide a good representation of the proteome. Assembly
errors were observed in only one case, indicated by exons that were missing from the
genomic sequence, but present in the EST dataset. Further, a single case of
sequencing error associated with a homopolymer run was discovered and confirmed by
a comparison of corresponding genomic raw reads. Relative to the total number of
nucleotide positions coding for CRP genes (45000, including RACK1), this amounts to
an error rate of 0.002 %. While this measure underestimates the actual error rate,
because only artifactual insertions and deletions, but not substitutions that leave the
reading frame intact are accounted for, the latter are estimated to contribute less than
20 % to the total number of errors produced by 454 sequencing technology [4]. Even
though the error rate might be two to three times higher in non-coding regions [5], it
represents an exceptionally high sequence accuracy. Thus, coverage and sequence
fidelity of the CRP genes testify to the high overall quality of the A. cephalotes genome
assembly.
References
1. Marygold S, Roote J, Reuter G, Lambertsson A, Ashburner M, et al. (2007) The
ribosomal protein genes and Minute loci of Drosophila melanogaster. Genome
Biology 8: R216.
2. Uechi T, Tanaka T, Kenmochi N (2001) A Complete Map of the Human Ribosomal
Protein Genes: Assignment of 80 Genes to the Cytogenetic Map and Implications
for Human Disorders. Genomics 72: 223.
3. Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, et al. (2008) MAKER: An easy-touse annotation pipeline designed for emerging model organism genomes.
Genome Res 18: 188-196.
4. Huse S, Huber J, Morrison H, Sogin M, Welch D (2007) Accuracy and quality of
massively parallel DNA pyrosequencing. Genome Biology 8: R143.
5. Moore M, Dhingra A, Soltis P, Shaw R, Farmerie W, et al. (2006) Rapid and accurate
pyrosequencing of angiosperm plastid genomes. BMC Plant Biology 6: 17.
Download