A much more extensive guide to sequence exchange and re

advertisement
MODULE 1
Sequence Information and File Formats
AIMS

To understand the conventions regarding the presentation of DNA and protein sequence
information.

To understand the logic underlying these conventions.

To become familiar with the commonly used sequence file formats.

To become familiar with the READSEQ programme for the interconversion of file formats.
OBJECTIVES
The student should be able to:

Present a nucleotide or protein sequence according to accepted conventions

Recognize different sequence files formats

Interconvert files between formats
INTRODUCTION
Virtually all the information one deals with in computational molecular biology is either in the form of
DNA or protein sequences. There are certain conventions applying to the way this sequence information
is presented both in the conventional literature and in databases. Furthermore, the way in which sequence
information is stored, retrieved and manipulated varies. That is to say there are different computer file
types. This module explains the conventions of sequence presentation, describes the various file types,
and illustrates how these file types can be interconverted. This material, while really not that exciting, is
of absolutely fundamental importance to anyone wishing to work in bioinformatics.
DNA
The DNA of living organisms is normally double stranded. However, whenever you look at a paper which
includes DNA sequence information it is the convention to show only one strand of the DNA. This begs
the questions - "which strand do you show"? -and- "which way round do you show it"?
It is usually the case, except in some viral genomes, that either strand can be the template (or coding)
strand at any particular point, but not both. Given that the two strands are anti-parallel, the genes on one
strand will face in one direction and the genes on the other strand will face in the opposite direction.
As you will remember, the orientation of a DNA strand is determined by which end has a 5'-phosphate
group and which has a 3'-hydroxyl group. Thus, any DNA strand has a 5'-3' polarity. RNA polymerase in
all organisms moves along the template strand of the DNA in the 3'-5' direction producing RNA that
grows in the 5'-3' direction. So in fact the RNA sequence will be identical to that of the non-template
strand, except for the presence of uracil instead of thymine. Consequently, it has become the convention
to show the non-template strand of the DNA when presenting sequence information, because it resembles
the RNA encoded by that particular gene.
I imagine for purely cultural reasons the sequence is shown running from right to left on the page, with
the 5' end of the sequence on the right. This would correspond to the protein sequence also running from
right to left.
Sometimes when sequencing projects are in the draft state there are still ambiguities in the sequence that
still have to be resolved. IUPAC have defined a standard table for the nucleotide ambiguity codes.
R = A or G
K = G or T
S = G or C
Y = C or T
M = A or C
W = A or T
B = not A (G or C or T)
H = not G (A or T or C)
N = any nucleotide
D = not C (G or A or T)
V = not T (A or G or C)
PROTEIN
Polypeptides like DNA strands have a polarity, with an N-terminal and C-terminal ends possessing a free
amino group and carboxyl group respectively. It follows both from the fact that the N-terminal part of the
protein is synthesized first and from the convention regarding the presentation of DNA sequences, that a
polypeptide is presented with its N-terminus on the left of the page and its C-terminus on the right.
H2N-Methionine-Valine-Tyrosine-Cysteine-Arginine-Glycine-Isoleucine-Lysine-COOH
To keep the polypeptide information in a form that can be conveniently handled by computers the amino
acids are each given a single letter code. Thus, the sequence above would be represented as:
MVYCRGIK
You will notice that each amino acid is not necessarily represented by its initial letter. This table provides
the standard one-letter code for amino acids.
Glycine G
Isoleucine I
Cysteine C
Tryptophan W
Arginine R
Alanine A
Phenylalanine F
Threonine T
Proline P
Lysine K
Valine V
Tyrosine Y
Methionine M
Aspartic acid D
Histidine H
Leucine L
Serine S
Asparagine N
Glutamic acid E
Glutamine Q
There is a recent agreement in IUPAC that selenocysteine which occasionally occurs in proteins should
be represented by the letter U.
FILE TYPES
Many software packages have been developed for the analysis of DNA and protein sequences and an
unfortunate by-product of this is that a variety of different file formats have been developed store DNA
and protein sequence information. The various software packages will usually only accept a specific file
format, however, there are programmes which will convert sequence information between the different
file formats. The situation is made even worse by the fact that different sequence databases hold the
information in different file formats. So an important set of basic skills is to be able to recognize the
different file formats and to be able to interconvert files between formats. The Table below lists many of
the file formats and the most commonly used have hyperlinks to nucleotide sequence examples.
IG/Stanford
Fitch
Plain/Raw
GenBank/GB
Pearson/Fasta
PIR/CODATA
NBRF
Zuker
MSF
EMBL
Olsen
ASN 1.8
GCG
Phylip 3.2
PAUP/NEXUS
DNAStrider
Phylip
Pretty
READSEQ
There are many programmes for file conversion (the GCG suite alone has 23!). In this module we will use
READSEQ which is particularly useful as it automatically detects many sequence formats and
interconverts them.
Web-based versions of READSEQ are available at NIH, BCM Search Launcher and Bioportal. Have a
look at them. All you need to do is to select and copy a sequence and just paste it into the window of
Readseq and select the format you want it converted to (there is help available if you really need it)
An extensive guide to sequence exchange and re-formatting in the GCG suite of programmes is available
in the EMBnet Biocomputing tutorials.
A DETAILED LOOK AT TWO FILE FORMATS
FASTA
The simplest sequence file format and one used by many molecular biology analysis tools available on the
Web is the FASTA (or Pearson) format. The fist line of the file always begins with the > (greater than
symbol) and this is followed by the sequence identification, or sometimes by some more informative
content, and then an end-of-line (carriage return). The sequence, DNA or protein, is then represented by a
simple string of characters.
GENBANK
Perhaps the most important file format to be familiar with is the Genbank flatfile. Let’s start be explaining
why this format is so important. There are three international sequence databases (Genbank in the USA,
EMBL in Europe and DDBJ in Japan) which together constitute the International Nucleotide Sequence
Database Collaboration. A key feature of this collaboration is an agreement on the rules relating to the
way that data is stored and annotated (i.e. the file format) that permits the three databases to exchange
information on a daily basis. The GenBank, EMBL, and DDBJ nucleic acid sequence data banks have
from their inception used tables of sites and features to describe the roles and locations of higher order
sequence domains and elements within the genome of an organism. In February, 1986, GenBank and
EMBL began a collaborative effort (joined by DDBJ in 1987) to devise a common feature table format
and common standards for annotation practice – it is very useful to have a look at The
DDBJ/EMBL/GenBank Feature Table Definition. The term flatfile refers to the structure of the database
(databases can either be flatfile or relational). The sequence data in Genbank is held as ASN.1 files that
are machine readable, but don’t make much sense to human beings (have a look at an example). The
ASN.1 file can be converted to a human-readable Genbank flatfile (GBFF). The GBFF consists of three
distinct parts:
The Header contains database-specific information that give the sequence its unique identification
together with other information e.g. source organism, associated literature reference etc.
The Features section comprises the annotation of the sequence and describes features such as exons,
introns, coding sequences etc.
The nucleotide sequence itself constitutes the third part of the file
Have a look at a sample GBFF with a good explanation of what each part of the file means produced by
the National Centre for Biotechnology Information.
EXERCISES
1. Convert a sequence in Genbank format and convert it to GCG and FASTA using one of the Web
READSEQ sites
1.1 Go to the Table and click the Genbank link.
1.2 Select the whole of the GenBank file and copy
1.3 Go to one one of the Readseq links (e.g. NIH)
1.4 Click within the boundaries of the text box
1.5 Paste the contents of the clipboard
1.6 Select the desired output format from the dropdown menu
1.7 Press the Submit button
Download