TEXT S2: EVALUATING SEQUENCE QUALITY AND COVERAGE USING CYTOPLASMIC RIBOSOMAL PROTEINS Martin Helmkampf and Jürgen Gadau School of Life Sciences, Arizona State University, Tempe, AZ 85287, United States of America The high degree of sequence conservation among cytoplasmic ribosomal proteins (CRPs) (Text S1), coupled with the fact that genes encoding these proteins are widely distributed in genomes studied so far (e.g., Drosophila melanogaster and humans [1,2]), makes CRP genes ideally suited to assess the coverage of genome assemblies or automatically annotated gene sets. Moreover, since they can be annotated with high confidence by comparison to homologous genes from even distantly related organisms, sequencing errors resulting in shifts of the reading frame can be identified unambiguously, allowing the evaluation of sequence accuracy. In Atta cephalotes the presence of these numerous and widely distributed genes (which are scattered across 44 scaffolds, including the 16 largest ones) indicates that the assembly excellently covers the gene space of the genome. The automatically annotated gene set obtained by MAKER , though certainly biased towards highly expressed genes with strong EST support, also contained all but one (very short) CRP gene and should therefore provide a good representation of the proteome. Assembly errors were observed in only one case, indicated by exons that were missing from the genomic sequence, but present in the EST dataset. Further, a single case of sequencing error associated with a homopolymer run was discovered and confirmed by a comparison of corresponding genomic raw reads. Relative to the total number of nucleotide positions coding for CRP genes (45000, including RACK1), this amounts to an error rate of 0.002 %. While this measure underestimates the actual error rate, because only artifactual insertions and deletions, but not substitutions that leave the reading frame intact are accounted for, the latter are estimated to contribute less than 20 % to the total number of errors produced by 454 sequencing technology . Even though the error rate might be two to three times higher in non-coding regions , it represents an exceptionally high sequence accuracy. Thus, coverage and sequence fidelity of the CRP genes testify to the high overall quality of the A. cephalotes genome assembly. References 1. Marygold S, Roote J, Reuter G, Lambertsson A, Ashburner M, et al. (2007) The ribosomal protein genes and Minute loci of Drosophila melanogaster. Genome Biology 8: R216. 2. Uechi T, Tanaka T, Kenmochi N (2001) A Complete Map of the Human Ribosomal Protein Genes: Assignment of 80 Genes to the Cytogenetic Map and Implications for Human Disorders. Genomics 72: 223. 3. Cantarel BL, Korf I, Robb SMC, Parra G, Ross E, et al. (2008) MAKER: An easy-touse annotation pipeline designed for emerging model organism genomes. Genome Res 18: 188-196. 4. Huse S, Huber J, Morrison H, Sogin M, Welch D (2007) Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biology 8: R143. 5. Moore M, Dhingra A, Soltis P, Shaw R, Farmerie W, et al. (2006) Rapid and accurate pyrosequencing of angiosperm plastid genomes. BMC Plant Biology 6: 17.