Methods (Supplementary) for sequencing, assembly, and

advertisement
Supplementary Methods: Sequencing, assembly, and annotation of the human
influenza virus genome
When designing the high-throughput sequencing pipeline for the eight RNA molecules
that comprise the influenza virus, we strove to create a method that would be robust,
consistent, and flexible. We needed to minimize the number of finishing steps required
to obtain full genomic sequences, to automate as many steps as possible, and to
accommodate changes in primer design and protocols that might arise later. For
example, primers were set up in a 96-well format so that one-step RT-PCR reactions
could be efficiently performed. A key innovation in building this pipeline was very close
interaction between bioinformatics and laboratory staff, including co-development of
software and lab procedures.
Assembly and annotation. Several aspects of the assembly process are unique to the
influenza project and were designed specifically for it. First, trimming the non-influenza
sequence from each sequence “read” was a critical step. In addition to the normal M13
tags, sequence from the degenerate primers must also be trimmed. This is important
because if an amplification primer contains a base that does not match the sequence of
the isolate being amplified, an incorrect base could be incorporated into the PCR product
and subsequently sequenced. Second, because the amplicons are short, primer sequence
often appeared on the 3’ ends of the sequences. Third, the amplicons spanning the ends
of the circularized segments had to be split computationally into two distinct sequences
prior to assembly.
The assembler minimus, designed for small projects such as this, proved highly effective
at assembling the average of 185 reads per genome. Minimus is part of the open-source
AMOS1 project (http://amos.sourceforge.net) and was easy to reconfigure to handle some
of the special requirements of this project. One novel element of the algorithm is that we
were able to use a reference genome as a guide to assembly, which allows the assembler
to tolerate much shorter overlap between reads than normal. Thus, reads that overlapped
by only 1-2 bases could be successfully assembled together.
Further automation was accomplished by the AutoEditor program2, which uses the
aligned reads to correct errors made by the basecalling software. This program corrects
approximately 80% of the mis-calls that otherwise would have to be reviewed by a
human editor. Following AutoEditor, all genomes went through at least one round of
manual review by human editors before being submitted to GenBank.
Annotation of all genomes was accomplished using a customized influenza annotation
pipeline developed at NCBI. Because each genome contains the same genes, this process
was used not only to attach gene names and coordinates, but also to check each genome
for frameshifts or other possible sequencing errors.
Sequence editing. After the eight segments of an isolate were assembled individually,
they were manually edited using CloE (Closure Editor), a TIGR program for editing
assemblies. The editors checked all apparent polymorphisms against reference data and
repaired frame shifts and other sequencing errors whenever they were discovered. After
editing, each isolate was submitted to a validation program, which checked segment
length, alignments with reference sequences, and fidelity of reading frames. Upon
validation, genomes were submitted to an annotation team at the National Center for
Biotechnology Information, which used a custom-designed gene finding pipeline to
attach gene coordinates. The finished sequence with annotation were then immediately
submitted to GenBank.
1.
2.
Pop, M., Phillippy, A., Delcher, A. L. & Salzberg, S. L. Comparative genome
assembly. Brief Bioinform 5, 237-248 (2004).
Gajer, P., Schatz, M. & Salzberg, S. L. Automated correction of genome sequence
errors. Nucleic Acids Res 32, 562-9 (2004).
Download