The Biology of Genomes Marth INFORMATICS TOOLS FOR HUMAN GENOME RESEQUENCING Gabor T. Marth, Michael Stromberg, Chip Stewart, Weichun Huang, Aaron Quinlan Boston College, Chestnut Hill, MA 02467 Next-generation sequencing technologies are now capable of producing over a gigabase of useful data per machine per day. This vast throughput led to the sequencing of several notable individual human genomes, and the 1000 Genomes Project is gearing up for en-masse sequencing of thousands of more individuals. Because of rapid technological changes software tools for mammalian-scale resequencing analyses are currently in a flux. As next-generation human resequencing becomes more routine there is a growing need not only for efficient software but also for clear algorithmic behavior and well characterized performance. We developed a complete suite of software tools for mammalian-scale variation discovery. (1) For read alignments we updated our aligner/assembler program, MOSAIK, to work with paired fragment-end reads including 454 and dibase-encoded SOLiD sequences. The final MOSAIK read alignments are constructed with the Smith-Waterman algorithm producing gapped alignments necessary for short-INDEL discovery and for aligning reads that contain INDEL errors. This highly sensitive technique also allows us to accurately identify reads that map to a unique genome location, and report every alignment position for reads that map to multiple regions, a behavior critical for accurate SNP calling. (2) We have completely re-engineered our polymorphism discovery program, POLYBAYES, for heterozygous SNP and short-INDEL detection in diploid, whole-genome short-read sequence, and added algorithms for accurate individual genotype calling based on the aligned reads. (3) We developed a new program, SPANNER, for detecting structural variation events from paired-end read map positions, and quantifying copy number from the depth of read coverage. (4) We customized our assembly viewer program, EAGLEVIEW, for visual data validation. These tools form an integrated informatics pipeline using efficient, standardized read, assembly, and annotation data file formats. We also developed a benchmarking suite that allows us to test the performance of alignment and variation detection software based on synthetic datasets generated from informed models of sequence variations and technology-specific sequencing error profiles. Benchmarking has allowed us to measure, and subsequently improve, the accuracy and sensitivity of our analysis software, as we report in this presentation. We describe the application of our tools for SNP and short-INDEL discovery in whole-genome human short-fragment paired-end Illumina/Solexa sequencing reads collected from a normal human genome. We also demonstrate our pipeline for SV discovery in a whole-genome, normal human resequencing dataset consisting of 45 million 2x25-bp paired-end reads from ~2kb fragments (25x physical clone coverage) sequenced with the AB SOLiD system, and in SOLiD paired-end datasets from human disease genomes for which tiling microarray data is available to evaluate our computational structural variation candidates.