Bioinformatics Lecture 7 • Types of databases. • Principles of organizations and functioning. • Sequence formats. • Conversion of one sequence format to another. • Database search. • FASTA, BLAST. Protein and DNA/RNA Databases •The first biological database was created by Margaret Dayhoff in 1960s as a reaction to development of proteinsequencing methods in 1950s. • Proteins in this and other DB were organised into families and superfamilies based on degree of similarity. • Tables that reflected the frequency of changes observed in the sequences of a group of closely related proteins were then derived (Percent Accepted Mutations, PAM Matrices). • These tables were used to align sequences and reconstruct the evolutionary pathways – phylogenetic trees Protein and DNA/RNA Databases • In the following years numerous protein and other databases were developed. SwissProt is an example Protein and DNA/RNA Databases • The first DNA DB were developed in 1979-80 by American (GenBank) and European groups (EMBL) also as a reaction to development of sequencing techniques. • Hundreds different specialised DB were constructed since then. Many DB contain DNA or RNA and protein information. There are numerous links between DB and regular exchange of data and tools occur. • The Entrez Nucleotides database is a collection of sequences from several sources, including GenBank, RefSeq, and PDB. The number of stored bases grows at an exponential rate. On 20.02.2004 the total number of base stored in Entrez was 20,197,497,568. ENTREZ • DB of different kind merged together and become global hubs of knowledge. Protein and DNA/RNA Databases • Each DB obviously has a complicated internal structure • Only a section of GenBank dealing with sequences consists of 58 blocks. Each block has at least one but usually several links with other blocks. • The major components of DB include: storage of sequences, many blocks responsible for retrieval of sequences, several blocks responsible for alignment, blocks responsible for input of the data and quality control, blocks responsible for statistical analysis of sequences and many other functionalities available in the DB • Complex DB like GenBank contain several sets of blocks, which serve DNA, protein, genome, taxonomy and other domains of the DB with numerous links operating between them. • There are hundreds of thousand and even millions requests to the major banks every day. Clearly this must be accommodated in the structure of the bank Collecting sequence and other data • Primary sequences of DNA, RNA and proteins constitute a significant portion of information accumulated in DB Collecting sequences and other data • All primary sequences, which are going to be in public domain, must be submitted into a DB, otherwise a publication is not accepted. • DNA sequences are usually submitted on-line in the following form. • Each sequence provided with an ID • If two or more identical sequences provided by different people a problem of redundancy of a DB will emerge. Collecting sequences and other data • As size of genomes varies dramatically from 10,000 bp for simple viruses up to several billion bp in higher animals and plants, the number of sequences covering the whole genome also varies very significantly 10 – 106. • DNA fragments presented in DB have not only very different lengths but also diverse origin. Some are large fragments of genome, other represent genes or their fragments, some are repeats and noncoding sequences, etc. • Many fragments have areas of overlaps. • Many sequences are annotated. It means that their position on genetic maps, internal structure of genes (exon-intron) and function are known or predicted. However in many cases such information is missing. Sequence formats • It is import to ensure that sequence files do not contain special characters recognisable only by text editors. ASCII files are suitable for most sequence programs. • However independent DB and some widely used programs developed slightly different formats for sequences. • Correct using of different formats is critical as well as a possibility to recognize and convert sequence/file/entry from one format to another. GenBank DNA sequence entry Sequence formats • There are many different (> 20) sequences formats including GenBank, EMBL, SwissProt, FASTA, Genetics Computer Group (GCG) Sequence Format and several others. 1. FASTA/Pearson format >seq1 2. GenBank format LOCUS seq1 16bp agctagct actgg DEFINITION seq1, 16 bases, 2688 checksum. >seq2 ORIGIN aactaact attcg 1 agctagctag // LOCUS seq2 20bp Conversion of one sequence format to another • There are several computer programs able to convert formats. • READSEQ is one of such programs and is very useful. Sequence formats recognized by format conversion program READSEQ: 1. Abstract Syntax Notation (ASN.1) 2. DNA Strider 3. EMBL 4. FASTA 5. Fitch (phylogenetic analysis) 6. GenBank 7. GCG 8. Intelligenetics 9. Multiple sequence format 10. Nat. Biomedical Research Foundation (NBRF) 11. Protein Information Resource (PIR) 12. And 6-8 additional specialised formats READSEQ Format conversion in GenBank Storage of information in a sequence database • There are millions of entries in the major DNA and protein DB and each entry usually contain significant amount of information. • This information is organised into a tabular form, as it usually done in relational DB. The number of columns (fields) in such DB is much larger than in the table below. • An index of these fields can be made, which allows very fast search of a DB using one or few field simultaneously. • The information in one DB can be cross-referenced to that in another DB. For instance DNA, protein and reference DB have all been cross-references so that moving between them is readily accomplished. Accession No 123 124 Organism Reference Name Keywords Sequence LexA protein SOS regulon, repressor,… ATGCCGG… E. coli Medline1, H. sapiens Medline2, glucorticoid transcriptional CCGATAAC receptor regulator Database Types • There are several types of DB; the two principal types are the relational and object-oriented. • The relational DB orders data in tables made up of rows giving specific item in the DB, and columns giving the features as attributes of those items. Careful indexation and cross-referencing are essential for each item in DB has a unique set of identifying features. •The object-oriented DB structure has been useful in the development of biological DB. These DB are necessary to deal with complex and constantly evolving biological objects like genetic maps. A sophisticated architecture and use of unifying common language, like Interface Definition Language (IDL) are required to make such DB functional and united. To plan/construct such object-oriented DB a specific set of procedures called the Unified Modelling Language (UML) was devised. Example of object-oriented DB Sequence retrieval from the public Databases •The essential step in providing access to a functional DB is development of software and Web pages that allow queries to be made. •ENTREZ system implemented in NCBI is a good example of such retrieval system •Three major versions of a query search are implemented in a number of DBs. 1. ID search, 2. Molecule name or function search and 3. Similarity search. • The first two types are based on DB indexes and cross-references. • The third is different as it has to create a comparative data and then use these data for the retrieval and DB searches. DB searches for similar sequences • Since Charles Darwin the idea of common origin of species became widely accepted view, however the level of similarity on molecular level between distant species remained unclear until 1970s and 1980s. • At that time the fact that many DNA and particularly protein molecules retain significant (>60-70%) or high (>85%) similarity hundreds of millions of years after separation from the common ancestor was established. • This discovery as well as practical needs to search growing DB lead to development of effective methods of similarity search. • Two programs, which greatly facilitated the similarity search, were developed FASTA (Pearson and Lipman 1988) and BLAST (Altschul et al. 1990). Basics of similarity searches • The basic step in any similarity search is an alignment of two or more sequences. Principles of alignment will be considered during the next lecture. • The search provides a list of DB sequences with which a query sequence can be aligned. Then scoring procedure is implemented, which allows to measure degree of similarity from 100% identity to a loose similarity. • A common reason for performing a DB search is to find a related gene. A matched gene (or any other sequence) may provide a clue as to function. • An alternative task can be be achieved when a sequence with known function or role is used as a query for search in a species genome. • The search must be fast and sensitive enough. FASTA • FASTA is a program for rapid alignment of pairs of protein and DNA sequences. • Comparison of all nucleotides or amino acids is not an option, even for powerful computers, FASTA instead searches for matching sequence patterns (“words”) called k-tuples. These patterns comprise k consecutive matches in the compared sequences. • Using k-tuples FASTA builds a local alignment. • Finally FASTA scores this alignment and output a list of sequences similar to a query in the descending order. gaps ATCGAACCTGGATCGTGGCCATCGAACCTGGATCGTGGCCATCGAACCTGGATCGTGGCC GGCGAACCCCTATCGTGGCGTTACCGCCTTATTGACGGCCATCGAACCTGGATCGTGGCC k=6 k=8 k = 14 k-tuples FASTA FASTA performs the following statistical tasks: 1. The average score for DB seq. of the same length is determined, 2. The average score is plotted against the log of average seq. length in each length range, 3. The points are then fitted to a straight line by linear regression, 4. A z score, the number of standard deviations from fitted line, is calculated for each score, 5. Low scoring seq. are removed. 6. A statistical comparison with Z distribution follows, which allows to calculate E ( ) value. If E ( ) = 0, and z score is high two sequences are identical, when E is higher then a threshold level, no clear similarity is observed. Methods used by FASTA to locate sequence similarity: A. Rapid location of 10 best matching regions in each pair. For DNA seq. k = 4-6, for protein k = 1-2. The highest-density matches identified. B. The highest-density regions are evaluated using special scoring matrixes (next lecture) and the best initial regions (INIT1) are found (*-the best). C. Longer regions of identity of score INITN are generated by joining INIT with scores higher than a certain threshold, which include positive scores for similarity and negative for gaps. Optimisation procedure follows. FASTA Typical output of FASTA similarity search Query – Motif2; #282 – is a fragment from a DB >>#282 (18 aa) initn: 48 init1: 48 opt: 71 z-score: 191.0 E(): 6.9e-06 Smith-Waterman score: 71; 61.111% identity in 18 aa overlap 10 20 Motif2 VKTYGFAATSVEEAKEVAEERGK X:.:::.::X.:.:.. : #282 GFVATSAEEAEEIAKKLG 10 BLAST • Basic Local Alignment Search Tool (BLAST) was developed as a new way to perform seq. similarity search. BLAST is faster than FASTA while being nearly as sensitive. • The minimal “word” (k-tuple) length is slightly higher than in FASTA, 3 for proteins and 11 for DNA. BLAST procedure •The steps used by the BLAST algorithm: • The seq is optionally filtered to remove low-complexity regions (AGAGAG…) • A list of words of certain length is made • Using substitution scores matrixes (like PAM or BLOSUM62) the query seq. words are evaluated for matches with any DB seq. and these scores (log) are added • A cutoff score (T) is selected to reduce number of matches to the most significant ones • The above procedure is repeated for each word in the query seq. • The remaining high-scoring words are organised into efficient search tree and rapidly compared to the DB seq. • If a good match is found then an alignment is extended from the match area in both directions as far as the score continue to grow. In the latest version of BLAST more time-efficient method is used BLAST procedure Query sequence • The essence of this method is finding a diagonal connecting ungapped alignments and extending them Database sequence BLAST procedure • 8) The next step is to determine those high scoring pairs (HSP) of seq., which have score greater than a cutoff score (S). S is determined empirically by examining a range of scores found by comparing random seq. and by choosing a value that is significantly greater. • 9) Then BLAST determines statistical significance of each HSP score. The probability p of observing a score S equal to or greater than x is given by the equation: p (S x) = 1 – exp(-e-(x-u)), where u = [log (Km’n’)]/ and K and are parameters that are calculated by BLAST for amino acid or nucleotide substitution scoring matrix, n’ is effective length of the query seq. and m’ is effective length of the database seq. • 10) On the next step a statistical assessments is made in the case if two or more HSP regions are found and certain matching pairs are put in descending order in the output file as far as their similarity/ score is concerned. On line BLAST results On line BLAST results On line BLAST results