TechApp Example 2 - Albuquerque High Performance Computing

advertisement

Computational Biology and Biomedicine

Polony Sequencing of the

$1000 Genome

Antoine Ho 1 , Norah Torrez-Martinez 1 , and Jeremy S. Edwards 1,2,3

1 Department of Molecular Genetics and Microbiology;

2 Department of Chemical and Nuclear Engineering, 3 UNM Cancer Center,

University of New Mexico, Albuquerque, NM 87131

E-mail: jsedwards@salud.unm.edu

Introduction.

Next Generation Sequencing is becoming current generation sequencing, prompting the need for more computational advances. Next Generation Sequencing methods have moved away from

Sanger Sequencing [1], which employed electrophoresis sequencing, and onto sequencing methods that lend themselves to greatly increased highthroughput. These technologies have different library protocols, depending on whether sequencing will be done on a microarray, beads, or single molecules. There are also different methods employed for the sequencing itself, whether it be sequencing by hybridization [2], ligation, or extension.

Whatever the means utilized, it is a certainty that high performance computing will be required to process the data; whether it is image processing to create the raw tags themselves, or the computationally intensive process of aligning tags to a reference genome while trying to discern real data such as Single Nucleotide Polymorphisms (SNP) or structural genome alterations.

Polony Sequencing Methodology

Polony Sequencing utilizes a fixed Bead array and Sequencing By

Ligation (SBL) to obtain sequence information. Biotinylated template

DNA is isolated in a PCR solution and added to streptavidin coated beads. Through a process of emulsion PCR [3], the beads become clonally covered with multiple copies of a single template strand. In

SBL, an anchor primer is hybridized to a known region of the template

DNA, usually a linker or adaptor that has been ligated onto a fragment of unknown DNA sequence. Then using a series of degenerate query oligos that have fluorophores, the DNA adjacent to the known region is sequenced through a series of hybridizing an anchor primer, ligation of a query oligo, then denaturing the DNA and clearing all signal and repeating [3]. Images captured in four fluorescent channels, one corresponding to each base pair, serve as data. On every frame, every bead’s coordinates are recorded as well as the signal the bead gave for a base pair position. After multiple cycles of biochemistry and imaging yields a sequence for every bead, which is then used as the raw sequence for alignment.

SBL is limited in read length due to the fact that base pair hybridization becomes less accurate the further from the ligation point the base pair is. Data suggest that ligation accuracy and efficiency remains high for the first few positions, with diminishing specificity for positions further. This remains a reason why gains in SBL have been commonly performed by using Anchor Primer and library construction methods. More Anchor Primer sites allows for more positions to be sequenced via SBL, but an insertion of a new anchor primer site requires significant library preparation, potentially allowing for loss of material and amplification bias. By utilizing a SBL variation that incorporates an Endonuclease V digestion, it is possible to extend the read lengths from a typical SBL approach.

This is the approach under investigation in this work.

Computational Analysis

Acknowledgements

UNM Integrating Nanotechnology with Cell Biology and Neuroscience (INCBN) NSF IGERT Program; UNM Cancer Center Shared Resource for

Bioinformatics and Computational Biology (M Murphy, SM Wilson, SR Atlas); UNM Center for Advanced Research Computing

Download