Molecular Biology: DNA sequencing Molecular Biology: DNA sequencing Author: Prof Marinda Oosthuizen Licensed under a Creative Commons Attribution license. SEQUENCE ANALYSIS Once you have generated a sequence trace file, you will need to manipulate it in some way in order to analyse your data. The first thing you’ll need to do is to evaluate the quality of your sequence data to see whether it’s worth continuing to analyse it. If you have generated more than one sequence from your template you will need to assemble those sequences into a single contig. Once your sequences have been assembled and edited, you can write out a consensus sequence and analyse it further. You may want to compare it with existing sequence data to find out how your organism is related to other known organisms, or perhaps to locate unique sequences that could be used to identify your organism in a diagnostic test. Many programs are available for sequence manipulation and analysis. Some of these are freely available and you can download them from the web and install them on your own computer. Evaluate your sequence data You will first need to evaluate the quality of your sequence data. A good sequence looks like this: Peaks should be sharp, well-defined, and scaled high in the first several panels of printed analyzed data, as shown here 1|P a g e Molecular Biology: DNA sequencing The characteristics of peaks on a chromatogram change from panel to panel in a predictable way for a typical sequencing run. In the first few panels, the peaks should be sharp, well defined and scaled high. Peak definition should remain fairly good for up to about 550 bases. Several freely available programmes are available for viewing AB1 trace files. These include programs such as: Chromas LITE (http://www.technelysium.com.au/chromas_lite.html) BioEdit (http://www.mbio.ncsu.edu/BioEdit/page2.html) Trev (a sequence editor which is part of the Staden Package: http://staden.sourceforge.net/staden_home.html) Sequence Assembly If you have generated more than one sequence from your template you will need to assemble those sequences into a single contig. A contig is an aligned contiguous section of sequence comprising two or more reads joined together by virtue of matching bases. It is always a good idea to obtain a forward and reverse read for every template you wish to sequence in order to check the accuracy of your data. Gap4 can be recommended for sequence assembly. Gap4 forms part of the Staden Package (http://staden.sourceforge.net/staden_home.html). The original version of gap4 was described in Bonfield et al. (1995). Four different programs are installed when you install the Staden Package: pregap4, gap4, spin and trev. Pregap4: Before entry into a gap4 database the raw data from sequencing instruments needs to be passed through several processes, such as screening for vectors, quality evaluation, and conversion of data formats. Pregap4 provides a graphical user interface to set up the processing required to prepare trace data for assembly or analysis; and also gives a method for its automation. Gap4: Gap4 is a Genome Assembly Program. The program contains all the tools that would be expected from an assembly program plus many unique features and a very easily used interface. It performs assembly, contig joining, assembly checking, repeat searching, experiment suggestion, read pair analysis and contig editing. Spin: Spin is an interactive and graphical program for analysing and comparing sequences. It contains functions to search for restriction sites, consensus sequences/motifs and protein coding regions, can analyse the composition of the sequence and translate DNA to protein. It also contains functions for locating segments of similarity within and between sequences, and for finding global and local alignments between pairs of sequences. Trev: For some types of sequencing project it is convenient to view and edit the chromatogram data prior to assembly into a gap4 database, and this is the function of the program trev (we have already met trev). 2|P a g e Molecular Biology: DNA sequencing Sequence comparison using BLAST Newly obtained sequences can be compared with databases of previously characterized genes and proteins to identify them, to assign possible functions or to obtain evolutionary information. One of the most commonly used comparison tools is BLAST, (Basic Local Alignment Search Tool), a method for rapid searching of nucleotide and protein databases. There are three main sequence databases which share sequence data on a daily basis: DNA Data Bank of Japan (DDBJ) http://www.ddbj.nig.ac.jp EMBL Nucleotide Sequence Database GenBank http://www.ebi.ac.uk http://www.ncbi.nlm.nih.gov GenBank is maintained by the National Centre for Biotechnology Information (NCBI) in the National Library of Medicine (NLM) at the National Institutes of Health (NIH) in Bethesda, Maryland, USA. The NCBI creates public databases, conducts research in computational biology and develops software for analyzing sequence data. BLAST is NCBI’s sequence similarity search tool. Many different BLAST programs are available. The type of program to choose will depend on the nature of the query sequence, the purpose of the search and the target database. For nucleotide queries, megablast is currently the best program to use to find an identical match to a query sequence. Discontiguous megablast and blastn are the programs of choice for finding similar sequences from related organisms. To identify a query amino acid sequence and to find similar sequences from other organisms in protein databases, blastp is the program of choice. blastx translates a nucleotide query sequence in all six reading frames and compares the translation to a protein database. Sequence alignment It is possible to produce sequence alignments to compare related sequences. If two biological sequences are sufficiently similar, almost invariably they have similar biological functions and are probably descended from a common ancestor. This implies that function is encoded into sequence and that there is a redundancy in the encoding, such that positions in the sequence may be changed without perceptible changes in the function. Sequence alignments provide the basis for predicting de novo the secondary structure of proteins, for knowledge-based tertiary structure predictions and for inferring phylogenetic trees and resolving questions of ancestry between species. There are many sequence alignment programs available. One of the most commonly used is ClustalW (Higgins et al., 1994) (http://www.matfys.kvl.dk/bioinformatik/juni2001/exercise5.html ), a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. Evolutionary relationships can be seen via viewing Cladograms or Phylograms. 3|P a g e