Supplementary Methods: Sequencing, assembly, and annotation of the human influenza virus genome When designing the high-throughput sequencing pipeline for the eight RNA molecules that comprise the influenza virus, we strove to create a method that would be robust, consistent, and flexible. We needed to minimize the number of finishing steps required to obtain full genomic sequences, to automate as many steps as possible, and to accommodate changes in primer design and protocols that might arise later. For example, primers were set up in a 96-well format so that one-step RT-PCR reactions could be efficiently performed. A key innovation in building this pipeline was very close interaction between bioinformatics and laboratory staff, including co-development of software and lab procedures. Assembly and annotation. Several aspects of the assembly process are unique to the influenza project and were designed specifically for it. First, trimming the non-influenza sequence from each sequence “read” was a critical step. In addition to the normal M13 tags, sequence from the degenerate primers must also be trimmed. This is important because if an amplification primer contains a base that does not match the sequence of the isolate being amplified, an incorrect base could be incorporated into the PCR product and subsequently sequenced. Second, because the amplicons are short, primer sequence often appeared on the 3’ ends of the sequences. Third, the amplicons spanning the ends of the circularized segments had to be split computationally into two distinct sequences prior to assembly. The assembler minimus, designed for small projects such as this, proved highly effective at assembling the average of 185 reads per genome. Minimus is part of the open-source AMOS1 project (http://amos.sourceforge.net) and was easy to reconfigure to handle some of the special requirements of this project. One novel element of the algorithm is that we were able to use a reference genome as a guide to assembly, which allows the assembler to tolerate much shorter overlap between reads than normal. Thus, reads that overlapped by only 1-2 bases could be successfully assembled together. Further automation was accomplished by the AutoEditor program2, which uses the aligned reads to correct errors made by the basecalling software. This program corrects approximately 80% of the mis-calls that otherwise would have to be reviewed by a human editor. Following AutoEditor, all genomes went through at least one round of manual review by human editors before being submitted to GenBank. Annotation of all genomes was accomplished using a customized influenza annotation pipeline developed at NCBI. Because each genome contains the same genes, this process was used not only to attach gene names and coordinates, but also to check each genome for frameshifts or other possible sequencing errors. Sequence editing. After the eight segments of an isolate were assembled individually, they were manually edited using CloE (Closure Editor), a TIGR program for editing assemblies. The editors checked all apparent polymorphisms against reference data and repaired frame shifts and other sequencing errors whenever they were discovered. After editing, each isolate was submitted to a validation program, which checked segment length, alignments with reference sequences, and fidelity of reading frames. Upon validation, genomes were submitted to an annotation team at the National Center for Biotechnology Information, which used a custom-designed gene finding pipeline to attach gene coordinates. The finished sequence with annotation were then immediately submitted to GenBank. 1. 2. Pop, M., Phillippy, A., Delcher, A. L. & Salzberg, S. L. Comparative genome assembly. Brief Bioinform 5, 237-248 (2004). Gajer, P., Schatz, M. & Salzberg, S. L. Automated correction of genome sequence errors. Nucleic Acids Res 32, 562-9 (2004).