Database Searches FASTA Database searches: Why? • To discover or verify identity of a newly sequenced gene • To find other members of a multigene family • To classify groups of genes Database searching • In practice, we cannot use Smith-Waterman to search for sequences in a database: – Databases are huge (GenBank ~30 million sequences, SwissProt >> 100,000 sequences) – S-W is slow: Time is proportional to N n2 where n = sequence length and N = number of sequences in the database • Instead, use faster heuristic approaches – FASTA – BLAST • Tradeoff: Sensitivity vs. false positives • Smith-Waterman is slower, but more sensitive Dot Plots GATCA AC TGA CGTA G T T C A G C T G C G T A C Dot Plots GATCA AC TGA CGTA G T T C A G C T G C G T A C 4-base window and 75% identity FASTA • Originally developed ~1985 by Lipman and Pearson • Goal: Perform fast, approximate local alignments to find sequences in the database that are related to the query sequence • Based on dot plot idea FASTA: Step 1 • Look for exact matches between words in query and test sequence – Words are short • DNA words are usually 6 bases • Protein words are 1 or 2 amino acids – Ktup denotes word length – Use hash tables to locate words quickly FASTA: Details • Hashing: Map a strings of characters to integers. e.g., – – – – AAA → 0 AAC → 1 ... TTT → 63 (oversimplified) • Preprocess the database and create a table that stores locations of each possible k-tuple: – 20k for amino acids (400 if k = 2), – 4k for DNA (4096 if k = 6), • Use hash code computed from query sequence k-tuples for quick look up FASTA FASTA: Step 2 • Find 10 best diagonal runs (sequence of nearby hot spots on same diagonal) • Give each hot spot a positive score, and each space between consecutive hot spots a negative score that decreases with distance – similar to affine gap costs in S-W • Each diagonal run is composed of matches (hot spots themselves) and mismatches (interspot regions) but no indels FASTA: Step 3 • Evaluate each diagonal run using an appropriate scoring matrix and find best scoring run – Discard runs with low scores (“filtration”) • The highest-scoring diagonal is reported as init1 FASTA: Step 4 • After all diagonals found, try to join diagonals by adding gaps • Use weighted directed acyclic graph between segments representing those which could be combined using indel • Find a maximum weight path in this graph; corresponds to a local alignment, reported as initn Adding gaps FASTA: Step 5 • If score reaches a threshold value, compute an alternative local alignment • Form a band around init1 in dynamic programming table – Width depends on ktup • Use Smith-Waterman to find best alignment restricted to that band. • Result is called opt FASTA: Final Steps • Rank database sequences according to opt scores • use full Smith-Waterman method to align query sequence against each of the highest ranking sequences from the database • Perform statistical analysis !!SEQUENCE_LIST 1.0 (Nucleotide) FASTA of: b2.seq from: 1 to: 693 December 9, 2002 14:02 TO: /u/browns02/Victor/Search-set/*.seq Sequences: 2,050 Symbols: 913,285 Word Size: 6 Searching with both strands of the query. Scoring matrix: GenRunData:fastadna.cmp Constant pamfactor used Gap creation penalty: 16 Gap extension penalty: 4 Histogram Key: Each histogram symbol represents 4 search set sequences Each inset symbol represents 1 search set sequences z-scores computed from opt scores z-score obs exp (=) (*) < 20 0 0: 22 0 0: 24 3 0:= 26 2 0:= 28 5 0:== 30 11 3:*== 32 19 11:==*== 34 38 30:=======*== 36 58 61:===============* 38 79 100:==================== * 40 134 140:==================================* 42 167 171:==========================================* 44 205 189:===============================================*==== 46 209 192:===============================================*===== 48 177 184:=============================================* List The best scores are: init1 initn SW:PPI1_HUMAN Begin: 1 End: 269 ! Q00169 homo sapiens (human). phosph... 1854 SW:PPI1_RABIT Begin: 1 End: 269 ! P48738 oryctolagus cuniculus (rabbi... 1840 SW:PPI1_RAT Begin: 1 End: 270 ! P16446 rattus norvegicus (rat). pho... 1543 SW:PPI1_MOUSE Begin: 1 End: 270 ! P53810 mus musculus (mouse). phosph... 1542 SW:PPI2_HUMAN Begin: 1 End: 270 ! P48739 homo sapiens (human). phosph... 1533 SPTREMBL_NEW:BAC25830 Begin: 1 End: 270 ! Bac25830 mus musculus (mouse). 10, ... 1488 SP_TREMBL:Q8N5W1 Begin: 1 End: 268 ! Q8n5w1 homo sapiens (human). simila... 1477 SW:PPI2_RAT Begin: 1 End: 269 ! P53812 rattus norvegicus (rat). pho... 1482 opt z-sc E(1018780).. 1854 1854 2249.3 1.8e-117 1840 1840 2232.4 1.6e-116 1543 1837 2228.7 2.5e-116 1542 1836 2227.5 2.9e-116 1533 1533 1861.0 7.7e-96 1488 1522 1847.6 4.2e-95 1477 1522 1847.6 4.3e-95 1482 1516 1840.4 1.1e-94 Alignments SCORES Init1: 1515 Initn: 1565 Opt: 1687 z-score: 1158.1 E(): 2.3e-58 >>GB_IN3:DMU09374 (2038 nt) initn: 1565 init1: 1515 opt: 1687 Z-score: 1158.1 expect(): 2.3e-58 66.2% identity in 875 nt overlap (83-957:151-1022) 60 70 80 90 100 110 u39412.gb_pr CCCTTTGTGGCCGCCATGGACAATTCCGGGAAGGAAGCGGAGGCGATGGCGCTGTTGGCC || ||| | ||||| | ||| ||||| DMU09374 AGGCGGACATAAATCCTCGACATGGGTGACAACGAACAGAAGGCGCTCCAACTGATGGCC 130 140 150 160 170 180 120 130 140 150 160 170 u39412.gb_pr GAGGCGGAGCGCAAAGTGAAGAACTCGCAGTCCTTCTTCTCTGGCCTCTTTGGAGGCTCA ||||||||| || ||| | | || ||| | || || ||||| || DMU09374 GAGGCGGAGAAGAAGTTGACCCAGCAGAAGGGCTTTCTGGGATCGCTGTTCGGAGGGTCC 190 200 210 220 230 240 180 190 200 210 220 230 u39412.gb_pr TCCAAAATAGAGGAAGCATGCGAAATCTACGCCAGAGCAGCAAACATGTTCAAAATGGCC ||| | ||||| || ||| |||| | || | |||||||| || ||| || DMU09374 AACAAGGTGGAGGACGCCATCGAGTGCTACCAGCGGGCGGGCAACATGTTTAAGATGTCC 250 260 270 280 290 300 240 250 260 270 280 290 u39412.gb_pr AAAAACTGGAGTGCTGCTGGAAACGCGTTCTGCCAGGCTGCACAGCTGCACCTGCAGCTC |||||||||| ||||| | |||||| |||| ||| || ||| || | DMU09374 AAAAACTGGACAAAGGCTGGGGAGTGCTTCTGCGAGGCGGCAACTCTACACGCGCGGGCT 310 320 330 340 350 360