molecular_dna_sequencing_analysis

advertisement
Molecular Biology: DNA sequencing
Molecular Biology:
DNA sequencing
Author: Prof Marinda Oosthuizen
Licensed under a Creative Commons Attribution license.
SEQUENCE ANALYSIS
Once you have generated a sequence trace file, you will need to manipulate it in some way in order to
analyse your data.

The first thing you’ll need to do is to evaluate the quality of your sequence data to see whether

it’s worth continuing to analyse it.
If you have generated more than one sequence from your template you will need to assemble
those sequences into a single contig.

Once your sequences have been assembled and edited, you can write out a consensus
sequence and analyse it further.

You may want to compare it with existing sequence data to find out how your organism is
related to other known organisms, or perhaps to locate unique sequences that could be used to
identify your organism in a diagnostic test.
Many programs are available for sequence manipulation and analysis. Some of these are freely
available and you can download them from the web and install them on your own computer.
Evaluate your sequence data
You will first need to evaluate the quality of your sequence data. A good sequence looks like this:
Peaks should be sharp, well-defined, and scaled high in the first several panels of printed analyzed data,
as shown here
1|P a g e
Molecular Biology: DNA sequencing
The characteristics of peaks on a chromatogram change from panel to panel in a predictable way for
a typical sequencing run. In the first few panels, the peaks should be sharp, well defined and scaled
high. Peak definition should remain fairly good for up to about 550 bases.
Several freely available programmes are available for viewing AB1 trace files. These include
programs such as:

Chromas LITE (http://www.technelysium.com.au/chromas_lite.html)

BioEdit (http://www.mbio.ncsu.edu/BioEdit/page2.html)

Trev (a sequence editor which is part of the Staden Package:
http://staden.sourceforge.net/staden_home.html)
Sequence Assembly
If you have generated more than one sequence from your template you will need to assemble those
sequences into a single contig. A contig is an aligned contiguous section of sequence comprising two
or more reads joined together by virtue of matching bases. It is always a good idea to obtain a forward
and reverse read for every template you wish to sequence in order to check the accuracy of your
data.
Gap4 can be recommended for sequence assembly. Gap4 forms part of the Staden Package
(http://staden.sourceforge.net/staden_home.html). The original version of gap4 was described in
Bonfield et al. (1995). Four different programs are installed when you install the Staden Package:
pregap4, gap4, spin and trev.
Pregap4: Before entry into a gap4 database the raw data from sequencing instruments
needs to be passed through several processes, such as screening for vectors, quality
evaluation, and conversion of data formats. Pregap4 provides a graphical user interface to
set up the processing required to prepare trace data for assembly or analysis; and also
gives a method for its automation.
Gap4: Gap4 is a Genome Assembly Program. The program contains all the tools that would
be expected from an assembly program plus many unique features and a very easily used
interface. It performs assembly, contig joining, assembly checking, repeat searching,
experiment suggestion, read pair analysis and contig editing.
Spin: Spin is an interactive and graphical program for analysing and comparing sequences.
It contains functions to search for restriction sites, consensus sequences/motifs and protein
coding regions, can analyse the composition of the sequence and translate DNA to protein.
It also contains functions for locating segments of similarity within and between sequences,
and for finding global and local alignments between pairs of sequences.
Trev: For some types of sequencing project it is convenient to view and edit the
chromatogram data prior to assembly into a gap4 database, and this is the function of the
program trev (we have already met trev).
2|P a g e
Molecular Biology: DNA sequencing
Sequence comparison using BLAST
Newly obtained sequences can be compared with databases of previously characterized genes and
proteins to identify them, to assign possible functions or to obtain evolutionary information. One of the
most commonly used comparison tools is BLAST, (Basic Local Alignment Search Tool), a method for
rapid searching of nucleotide and protein databases. There are three main sequence databases
which share sequence data on a daily basis:

DNA Data Bank of Japan (DDBJ)
http://www.ddbj.nig.ac.jp


EMBL Nucleotide Sequence Database
GenBank
http://www.ebi.ac.uk
http://www.ncbi.nlm.nih.gov
GenBank is maintained by the National Centre for Biotechnology Information (NCBI) in the National
Library of Medicine (NLM) at the National Institutes of Health (NIH) in Bethesda, Maryland, USA. The
NCBI creates public databases, conducts research in computational biology and develops software
for analyzing sequence data. BLAST is NCBI’s sequence similarity search tool.
Many different BLAST programs are available. The type of program to choose will depend on the
nature of the query sequence, the purpose of the search and the target database.

For nucleotide queries, megablast is currently the best program to use to find an identical match

to a query sequence.
Discontiguous megablast and blastn are the programs of choice for finding similar sequences
from related organisms.

To identify a query amino acid sequence and to find similar sequences from other organisms in
protein databases, blastp is the program of choice.

blastx translates a nucleotide query sequence in all six reading frames and compares the
translation to a protein database.
Sequence alignment
It is possible to produce sequence alignments to compare related sequences. If two biological
sequences are sufficiently similar, almost invariably they have similar biological functions and are
probably descended from a common ancestor. This implies that function is encoded into sequence
and that there is a redundancy in the encoding, such that positions in the sequence may be changed
without perceptible changes in the function.
Sequence alignments provide the basis for predicting de novo the secondary structure of proteins, for
knowledge-based tertiary structure predictions and for inferring phylogenetic trees and resolving
questions of ancestry between species.
There are many sequence alignment programs available. One of the most commonly used is
ClustalW (Higgins et al., 1994) (http://www.matfys.kvl.dk/bioinformatik/juni2001/exercise5.html ), a general
purpose multiple sequence alignment program for DNA or proteins. It produces biologically
meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the
selected sequences, and lines them up so that the identities, similarities and differences can be seen.
Evolutionary relationships can be seen via viewing Cladograms or Phylograms.
3|P a g e
Download