A Probabilistic Approach to Single-Nucleotide

advertisement
The Biology of Genomes
Marth
INFORMATICS TOOLS FOR HUMAN GENOME RESEQUENCING
Gabor T. Marth, Michael Stromberg, Chip Stewart, Weichun Huang, Aaron Quinlan
Boston College, Chestnut Hill, MA 02467
Next-generation sequencing technologies are now capable of producing over a
gigabase of useful data per machine per day. This vast throughput led to the sequencing
of several notable individual human genomes, and the 1000 Genomes Project is gearing
up for en-masse sequencing of thousands of more individuals. Because of rapid
technological changes software tools for mammalian-scale resequencing analyses are
currently in a flux. As next-generation human resequencing becomes more routine there
is a growing need not only for efficient software but also for clear algorithmic behavior
and well characterized performance.
We developed a complete suite of software tools for mammalian-scale variation
discovery. (1) For read alignments we updated our aligner/assembler program, MOSAIK,
to work with paired fragment-end reads including 454 and dibase-encoded SOLiD
sequences. The final MOSAIK read alignments are constructed with the Smith-Waterman
algorithm producing gapped alignments necessary for short-INDEL discovery and for
aligning reads that contain INDEL errors. This highly sensitive technique also allows us
to accurately identify reads that map to a unique genome location, and report every
alignment position for reads that map to multiple regions, a behavior critical for accurate
SNP calling. (2) We have completely re-engineered our polymorphism discovery
program, POLYBAYES, for heterozygous SNP and short-INDEL detection in diploid,
whole-genome short-read sequence, and added algorithms for accurate individual
genotype calling based on the aligned reads. (3) We developed a new program,
SPANNER, for detecting structural variation events from paired-end read map positions,
and quantifying copy number from the depth of read coverage. (4) We customized our
assembly viewer program, EAGLEVIEW, for visual data validation. These tools form an
integrated informatics pipeline using efficient, standardized read, assembly, and
annotation data file formats.
We also developed a benchmarking suite that allows us to test the performance
of alignment and variation detection software based on synthetic datasets generated from
informed models of sequence variations and technology-specific sequencing error
profiles. Benchmarking has allowed us to measure, and subsequently improve, the
accuracy and sensitivity of our analysis software, as we report in this presentation.
We describe the application of our tools for SNP and short-INDEL discovery in
whole-genome human short-fragment paired-end Illumina/Solexa sequencing reads
collected from a normal human genome. We also demonstrate our pipeline for SV
discovery in a whole-genome, normal human resequencing dataset consisting of 45
million 2x25-bp paired-end reads from ~2kb fragments (25x physical clone coverage)
sequenced with the AB SOLiD system, and in SOLiD paired-end datasets from human
disease genomes for which tiling microarray data is available to evaluate our
computational structural variation candidates.
Download