in Silico Primer Design and Simulation for Targeted High Throughput Sequencing I519 – FALL 2010 Adam Thomas, Kanishka Jain, Tulip Nandu BACKGROUND n Major Milestone n Molecular structure of DNA n Human Genome Project n High-Throughput Sequencing (HTS) n HTS transformed common experiments on single genes to entire genomes n Low cost n Multiple samples in every run (Eg. 454 Sequencer can sequence 400-600Mb) BACKGROUND n Primers are a short stand of nucleotides that serve as the starting point of DNA synthesis. n Approximately 20-25 nucleotides. n Used to determine the DNA strand that needs amplification. n Complement of DNA strand. PCR n Polymerase Chain Reaction n Technique to amplify a small region of DNA n 3 step process: n n Denaturation, n Annealing and n Extension. Process repeated for approximately 30 to 40 cycles. PCR n Denaturation Heat (approx 90°C) separates double strand into two single strands PCR n Annealing Primer binding to individual strands (occurs at 45 to 60°C) PCR n Extension Temperature raised to 72°C and the Tag DNA polymerase enzyme is used to replicate DNA strands PCR n End of First Cycle Process repeated for approximately 30 to 40 cycles. CURRENT PROCESS CURRENT PROCESS n Primer3 used to create primers using PCR. n The primers then need to be validated. Validation is performed by simulation, alignment and re-assembly. n MetaSim is used to simulate PCR to create expected amplicons. n CAP3 is used for re-assembly of simulated sequences. n BLASTing the simulated sequences against the original sequence give a fairly accurate measure of how well the primers will perform. ISSUES FACED WITH CURRENT PROCESS n Each tool uses different file inputs and outputs. n Users have to manually convert file formats to use in each tool. n None of the tools up till now can integrate all of the functions and give high throughput analysis. GOAL Integrate the whole process involved in the High throughput sequencing experiment and keep track of the parameters that are enter or changed. OBJECTIVES n A way to visualize the primers and amplicons in relation to the genome and be able to edit the primers manually and see how that affects the simulation. n Optimization of the high-throughput process by minimizing the number of reads needed by the ‘454 process’ and still be able to assemble the sequence. n Validation of the simulated amplicon reads to see whether the predicted simulation is in order and rectify the problem. PROPOSED SOLUTION VISUALIZATION TOOL n GBrowse n Popular and open source. n Well defined plugin architecture. n Plugin to design primers using Primer3 already available. PRIMER DESIGN n PrimerDesign.pm plugin already exists for GBrowse. Design primers using Primer 3 n Designed to only amplify one specific region of DNA with as few primers and no overlapping amplicons. n Tweaked to take two additional input parameters: Amplicon Overlap and Max Amplicon Length. n Once primers are created using GBrowse, the primers are output into a Featured File Format (FFF) PRIMER VALIDATION SIMULATION n Simulation performed using MetaSim. n MetaSim: n Generates sets of synthetic reads or mate-pairs based on adaptable sequencing error models (e.g. for Sanger chemistry, Roche's 454 and Illumina (former Solexa). n Can be controlled via graphical user interface or in command line mode. SIMULATION n Function written in Perl to invoke MetaSim using command line option. n Algorithm: n Read FFF file. Extract primer coordinates. n Extract sequence from the original sequence. n Run MetaSim simulation using command line options. n Each sequence generates its own FASTA ASSEMBLY n Perl function written to invoke CAP3 using its command line interface. n Each file generated from the MetaSim simulation is input into CAP3 which then assembles the contigs. ASSEMBLY n CAP3. n Input simulated sequences as FASTA file. n CAP3 is a sequence assembly program that allows users to assemble a set of short contigs. n Takes an input a file of sequence reads in FASTA format. n If header contains a dot (‘.’), CAP3 requires that the names of reads sequenced from the same subclone contain the same substring up to the first dot. n Can be invoked using a command line interface. BLAST n Assembled contigs are then BLASTed against the original sequence to validate. n GBrowse accepts the assembled sequence and BLASTs against the original sequence. n This plugin requires 4 steps: n Exporting assembled contigs and original sequence from Gbrowse. n Creating a BLAST database. n BLASTing the contigs against the sequence. n Importing result back into GBrowse. DEMO QUESTIONS