MODULE 1 Sequence Information and File Formats AIMS To understand the conventions regarding the presentation of DNA and protein sequence information. To understand the logic underlying these conventions. To become familiar with the commonly used sequence file formats. To become familiar with the READSEQ programme for the interconversion of file formats. OBJECTIVES The student should be able to: Present a nucleotide or protein sequence according to accepted conventions Recognize different sequence files formats Interconvert files between formats INTRODUCTION Virtually all the information one deals with in computational molecular biology is either in the form of DNA or protein sequences. There are certain conventions applying to the way this sequence information is presented both in the conventional literature and in databases. Furthermore, the way in which sequence information is stored, retrieved and manipulated varies. That is to say there are different computer file types. This module explains the conventions of sequence presentation, describes the various file types, and illustrates how these file types can be interconverted. This material, while really not that exciting, is of absolutely fundamental importance to anyone wishing to work in bioinformatics. DNA The DNA of living organisms is normally double stranded. However, whenever you look at a paper which includes DNA sequence information it is the convention to show only one strand of the DNA. This begs the questions - "which strand do you show"? -and- "which way round do you show it"? It is usually the case, except in some viral genomes, that either strand can be the template (or coding) strand at any particular point, but not both. Given that the two strands are anti-parallel, the genes on one strand will face in one direction and the genes on the other strand will face in the opposite direction. As you will remember, the orientation of a DNA strand is determined by which end has a 5'-phosphate group and which has a 3'-hydroxyl group. Thus, any DNA strand has a 5'-3' polarity. RNA polymerase in all organisms moves along the template strand of the DNA in the 3'-5' direction producing RNA that grows in the 5'-3' direction. So in fact the RNA sequence will be identical to that of the non-template strand, except for the presence of uracil instead of thymine. Consequently, it has become the convention to show the non-template strand of the DNA when presenting sequence information, because it resembles the RNA encoded by that particular gene. I imagine for purely cultural reasons the sequence is shown running from right to left on the page, with the 5' end of the sequence on the right. This would correspond to the protein sequence also running from right to left. Sometimes when sequencing projects are in the draft state there are still ambiguities in the sequence that still have to be resolved. IUPAC have defined a standard table for the nucleotide ambiguity codes. R = A or G K = G or T S = G or C Y = C or T M = A or C W = A or T B = not A (G or C or T) H = not G (A or T or C) N = any nucleotide D = not C (G or A or T) V = not T (A or G or C) PROTEIN Polypeptides like DNA strands have a polarity, with an N-terminal and C-terminal ends possessing a free amino group and carboxyl group respectively. It follows both from the fact that the N-terminal part of the protein is synthesized first and from the convention regarding the presentation of DNA sequences, that a polypeptide is presented with its N-terminus on the left of the page and its C-terminus on the right. H2N-Methionine-Valine-Tyrosine-Cysteine-Arginine-Glycine-Isoleucine-Lysine-COOH To keep the polypeptide information in a form that can be conveniently handled by computers the amino acids are each given a single letter code. Thus, the sequence above would be represented as: MVYCRGIK You will notice that each amino acid is not necessarily represented by its initial letter. This table provides the standard one-letter code for amino acids. Glycine G Isoleucine I Cysteine C Tryptophan W Arginine R Alanine A Phenylalanine F Threonine T Proline P Lysine K Valine V Tyrosine Y Methionine M Aspartic acid D Histidine H Leucine L Serine S Asparagine N Glutamic acid E Glutamine Q There is a recent agreement in IUPAC that selenocysteine which occasionally occurs in proteins should be represented by the letter U. FILE TYPES Many software packages have been developed for the analysis of DNA and protein sequences and an unfortunate by-product of this is that a variety of different file formats have been developed store DNA and protein sequence information. The various software packages will usually only accept a specific file format, however, there are programmes which will convert sequence information between the different file formats. The situation is made even worse by the fact that different sequence databases hold the information in different file formats. So an important set of basic skills is to be able to recognize the different file formats and to be able to interconvert files between formats. The Table below lists many of the file formats and the most commonly used have hyperlinks to nucleotide sequence examples. IG/Stanford Fitch Plain/Raw GenBank/GB Pearson/Fasta PIR/CODATA NBRF Zuker MSF EMBL Olsen ASN 1.8 GCG Phylip 3.2 PAUP/NEXUS DNAStrider Phylip Pretty READSEQ There are many programmes for file conversion (the GCG suite alone has 23!). In this module we will use READSEQ which is particularly useful as it automatically detects many sequence formats and interconverts them. Web-based versions of READSEQ are available at NIH, BCM Search Launcher and Bioportal. Have a look at them. All you need to do is to select and copy a sequence and just paste it into the window of Readseq and select the format you want it converted to (there is help available if you really need it) An extensive guide to sequence exchange and re-formatting in the GCG suite of programmes is available in the EMBnet Biocomputing tutorials. A DETAILED LOOK AT TWO FILE FORMATS FASTA The simplest sequence file format and one used by many molecular biology analysis tools available on the Web is the FASTA (or Pearson) format. The fist line of the file always begins with the > (greater than symbol) and this is followed by the sequence identification, or sometimes by some more informative content, and then an end-of-line (carriage return). The sequence, DNA or protein, is then represented by a simple string of characters. GENBANK Perhaps the most important file format to be familiar with is the Genbank flatfile. Let’s start be explaining why this format is so important. There are three international sequence databases (Genbank in the USA, EMBL in Europe and DDBJ in Japan) which together constitute the International Nucleotide Sequence Database Collaboration. A key feature of this collaboration is an agreement on the rules relating to the way that data is stored and annotated (i.e. the file format) that permits the three databases to exchange information on a daily basis. The GenBank, EMBL, and DDBJ nucleic acid sequence data banks have from their inception used tables of sites and features to describe the roles and locations of higher order sequence domains and elements within the genome of an organism. In February, 1986, GenBank and EMBL began a collaborative effort (joined by DDBJ in 1987) to devise a common feature table format and common standards for annotation practice – it is very useful to have a look at The DDBJ/EMBL/GenBank Feature Table Definition. The term flatfile refers to the structure of the database (databases can either be flatfile or relational). The sequence data in Genbank is held as ASN.1 files that are machine readable, but don’t make much sense to human beings (have a look at an example). The ASN.1 file can be converted to a human-readable Genbank flatfile (GBFF). The GBFF consists of three distinct parts: The Header contains database-specific information that give the sequence its unique identification together with other information e.g. source organism, associated literature reference etc. The Features section comprises the annotation of the sequence and describes features such as exons, introns, coding sequences etc. The nucleotide sequence itself constitutes the third part of the file Have a look at a sample GBFF with a good explanation of what each part of the file means produced by the National Centre for Biotechnology Information. EXERCISES 1. Convert a sequence in Genbank format and convert it to GCG and FASTA using one of the Web READSEQ sites 1.1 Go to the Table and click the Genbank link. 1.2 Select the whole of the GenBank file and copy 1.3 Go to one one of the Readseq links (e.g. NIH) 1.4 Click within the boundaries of the text box 1.5 Paste the contents of the clipboard 1.6 Select the desired output format from the dropdown menu 1.7 Press the Submit button