Sequence Comparison & Alignment Computing in Molecular Biology Hugues Sicotte National Center for Biotechnology Information sicotte@ncbi.nlm.nih.gov Sequence Comparison & Alignment COMPARATIVE ANALYSIS Sequence alignment is similar to other types of comparative analysis Involves scoring similarities and differences among a group of related entities Finches of the Galápagos Islands observed by Charles Darwin on the voyage of HMS Beagle Sequence Comparison & Alignment Homology Homology Is the central concept for all of biology. Whenever we say that a mammalian hormone is the ‘same’ hormone as a fish hormone, that a human gene sequence is the ‘same’ as a sequence in a chimp or a mouse, that a HOX gene is the ‘same’ in a mouse, a fruit fly, a frog and a human - even when we argue that discoveries about a worm, a fruit fly, a frog, a mouse, or a chimp have relevance to the human condition - we have made a bold and direct statement about homology. The aggressive confidence of modern biomedical science implies that we know what we are talking about.” David B. Wake Sequence Comparison & Alignment Alignment algorithms model evolutionary processes Derivation from a common ancestor through incremental change due to dna replication errors, mutations, damage, or unequal crossingover. COMPARATIVE ANALYSIS GATTACCA GATGACCA GATTACCA GATTACCA GATTATCA GATTACCA T GATCATCA GATTGATCA GAT ACCA Substitution insertion deletion Sequence Comparison & Alignment Alignment algorithms model evolutionary processes Derivation from a common ancestor through incremental change COMPARATIVE ANALYSIS GATTACCA GATGACCA GATTACCA GATTACCA GATTATCA GATCATCA GATTACCA GATTGATCA GATACCA Only extant sequences are known, ancestral sequences are postulated. Sequence Comparison & Alignment Alignment algorithms model evolutionary processes Derivation from a common ancestor through incremental change. Mutations that do not kill the host may carry over to the population. Rarely are mutations kept/rejected by natural selection. COMPARATIVE ANALYSIS GATTACCA GATGACCA GATTACCA GATTACCA GATTATCA GATCATCA GATTACCA GATTGATCA GATACCA The term homology implies a common ancestry, which may be inferred from observations of sequence similarity Sequence Comparison & Alignment Comparative Analysis of Genes 3000Myr 1000Myr 500Myr Bacteria Yeast Worm Fly Mouse Human Align Extant Sequences MSH2_Human SPE1_DROME MSH2_Yeast MUTS_ECOLI TGVIVLMAQIGCFVPCESAEVSIVDCILARVGAGDSQLKGVSTFMAEMLETASILRSATK VGTAVLMAHIGAFVPCSLATISMVDSILGRVGASDNIIKGLSTFMVEMIETSGIIRTATD VGVISLMAQIGCFVPCEEAEIAIVDAILCRVGAGDSQLKGVSTFMVEILETASILKNASK TALIALMAYIGSYVPAQKVEIGPIDRIFTRVGAADDLASGRSTFMVEMTETANILRNATE *** ** ** * * **** **** * ** * * Human Colon Cancer MSH2 gene is homologous to DNA repair proteins Sequence Comparison & Alignment Why Align sequences? - Finding similar sequences helps determine the properties and function of a new sequence. (Must be verified experimentally) -Conserved positions in homologous sequences hint at functionally important sites in proteins. (active or catalytic sites, dna binding domains, di-sulfide bridges, structural bends, hydrophobic pockets, protein binding domains,…) -Conserved nucleotides can hint at regulatory elements, either pre-transcriptional or post-transcriptional. Sequence Comparison & Alignment Sound alignment methods reflect evolution. DNA Evolution: - Mutation: Errors in DNA replication of DNA repair. -substitutions: replacement of one base by another. -deletions/insertions: By dna mispairing during replication or unequal crossing over. - Gene conversion or unequal crossing over: Large segments of DNA can be inserted/deleted. - Mutations that do not kill the host are propagated. Sometimes positive mutations are selected for. Reference: Molecular Evolution: Wen-Hsiung Li, 1997,Sinauer Associates publishing 4.5 3' Flank 4 3.5 introns 3 5'UTR 2.5 2 3'UTR 1.5 1 non-denerate 0.5 0 Pseudogenes Substitution rate per nucleotide site per billion years. 5' Flank 5 Non-Coding Different regions evolve at different rates, consistent with evolutionary constraints. Synonymous versus non-synonymous mutations Coding Sequence Comparison & Alignment Twofold degenerate 4-fold degenerate Pseudogenes Sequence Comparison & Alignment Alignment definition and Type: Alignment: Each Base is used at most once. Global Alignment: All bases aligned with another base or with a gap (symbol of “-” or sometimes “.”). G-ATES GRATED Local Alignments: Do not need to align all the bases in all sequences. Align BILLGATESLIKESCHEESE and GRATEDCHEESE G-ATESLIKESCHEESE GRATED-----CHEESE or G-ATES & CHEESE GRATED & CHEESE Sequence Comparison & Alignment Insertions and deletions (‘indels’) are represented by gaps in alignments COMPARATIVE ANALYSIS GATTATACCA GATTA---CA gap of length 3 Sequence Comparison & Alignment SEQUENCE ALIGNMENT Alignment of trypsin sequences from mouse and crayfish An alignment provides a mapping of residues in one sequence onto those of another S-S * Mouse IVGGYNCEENSVPYQVSLNS-----GYHFCGGSLINEQWVVSAGHCYK-------SRIQV Crayfish IVGGTDAVLGEFPYQLSFQETFLGFSFHFCGASIYNENYAITAGHCVYGDDYENPSGLQI * Mouse RLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVINARVSTISLPTA Crayfish VAGELDMSVNEGSEQTITVSKIILHENFDYDLLDNDISLLKLSGSLTFNNNVAPIALPAQ S-S Conserved residues are often of structural or functional importance Mouse PPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPG-KITSNMFCVGFLE Crayfish GHTATGNVIVTGWG-TTSEGGNTPDVLQKVTVPLVSDAECRDDYGADEIFDSMICAGVPE S-S * Mouse GGKDSCQGDSGGPVVCNG----QLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIKNTIAAN Crayfish GGKDSCQGDSGGPLAASDTGSTYLAGIVSWGYGCARPGYPGVYTEVSYHVDWIKANAV-- Figure 7.1 Sequence Comparison & Alignment SEQUENCE ALIGNMENT Alignment of trypsin sequences from mouse and crayfish S-S * Mouse IVGGYNCEENSVPYQVSLNS-----GYHFCGGSLINEQWVVSAGHCYK-------SRIQV Crayfish IVGGTDAVLGEFPYQLSFQETFLGFSFHFCGASIYNENYAITAGHCVYGDDYENPSGLQI * Mouse RLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTLNNDIMLIKLSSRAVINARVSTISLPTA Crayfish VAGELDMSVNEGSEQTITVSKIILHENFDYDLLDNDISLLKLSGSLTFNNNVAPIALPAQ S-S Mouse PPATGTKCLISGWGNTASSGADYPDELQCLDAPVLSQAKCEASYPG-KITSNMFCVGFLE Crayfish GHTATGNVIVTGWG-TTSEGGNTPDVLQKVTVPLVSDAECRDDYGADEIFDSMICAGVPE S-S * Mouse GGKDSCQGDSGGPVVCNG----QLQGVVSWGDGCAQKNKPGVYTKVYNYVKWIKNTIAAN Crayfish GGKDSCQGDSGGPLAASDTGSTYLAGIVSWGYGCARPGYPGVYTEVSYHVDWIKANAV-- Conserved positions are often of functional importance. Alignment of trypsin proteins of mouse (Swiss-Prot P07146) and crayfish (Swiss-Prot P00765). Identical residues are highlighted red and underlined. Indicated above the alignment are three disulfide bonds (-S-S-) whose participating cysteine residues are conserved, amino acids whose side chains are involved in the charge relay system (asterisk) and the active side residue which governs substrate specificity (diamond). The other conserved positions have no known role. These conserved residues could be coincidentally conserved or have some unknown structural role. Figure 7.1 Sequence Comparison & Alignment SEQUENCE ALIGNMENT Human zeta crystallin vs E.coli quinone oxidoreductase Stars indicate identical residues and dots indicate conservative substitutions CLUSTAL W (1.7) multiple sequence alignment Human-Zcr Ecoli-QOR MATGQKLMRAVRVFEFGGPEVLKLRSDIAVPIPKDHQVLIKVHACGVNPVETYIRSGTYS ------MATRIEFHKHGGPEVLQA-VEFTPADPAENEIQVENKAIGINFIDTYIRSGLYP : :...:.******: ::: . * :::: :: :* *:* ::****** *. Human-Zcr Ecoli-QOR RKPLLPYTPGSDVAGVIEAVGDNASAFKKGDRVFTSSTISGGYAEYALAADHTVYKLPEK -PPSLPSGLGTEAAGIVSKVGSGVKHIKAGDRVVYAQSALGAYSSVHNIIADKAAILPAA * ** *::.**::. **.... :* ****. :.: *.*:. ... ** Human-Zcr Ecoli-QOR LDFKQGAAIGIPYFTAYRALIHSACVKAGESVLVHGASGGVGLAACQIARAYGLKILGTA ISFEQAAASFLKGLTVYYLLRKTYEIKPDEQFLFHAAAGGVGLIACQWAKALGAKLIGTV :.*:*.** : :*.* * :: :*..*..*.*.*:***** *** *:* * *::**. Human-Zcr Ecoli-QOR GTEEGQKIVLQNGAHEVFNHREVNYIDKIKKYVGEKGIDIIIEMLANVNLSKDLSLLSHG GTAQKAQSALKAGAWQVINYREEDLVERLKEITGGKKVRVVYDSVGRDTWERSLDCLQRR ** : : .*: ** :*:*:** : ::::*: .* * : :: : :.. . .:.*. *.: Human-Zcr Ecoli-QOR GRVIVVG-SRGTIEINPRDTMAKES----SIIGVTLFSSTKEEFQQYAAALQAGMEIGWL GLMVSFGNSSGAVTGVNLGILNQKGSLYVTRPSLQGYITTREELTEASNELFSLIASGVI * :: .* * *:: . : ::. : .: : :*:**: : : * : : * : Human-Zcr Ecoli-QOR KPVIGSQ--YPLEKVAEAHENIIHGSGATGKMILLL KVDVAEQQKYPLKDAQRAHE-ILESRATQGSSLLIP * :..* ***:.. .*** *:.. .: *. :*: Figure 7.2 Sequence Comparison & Alignment Score and Statistics Percent Identity. Can be misleading. Score: A simple quality measure is the “score”. The score assigns points for each aligned base (or gap) of the alignment. identical bases : “match” score mismatching bases: “mismatch” score gaps: “gap opening” penalty for starting a gap “gap extension” penalty for each gap symbol. Example: match = +1 , mismatch =-1, gap opening = -5, gap extension=-1 G-ATESLIKESCHEESE AND/OR GRATED-----CHEESE Score = 10*(+1)+1*(-1)+(-5-1)+(-5+5*(-1)) = -7 G-ATES & CHEESE GRATED & CHEESE Sequence Comparison & Alignment Which alignment is “better”? SCORING SYSTEMS GCTACTAG-T-T--CGC-T-TAGC GCTACTAGCTCTAGCGCGTATAGC 0 mismatches, 5 gaps GCTACTAGTT------CGCTTAGC GCTACTAGCTCTAGCGCGTATAGC 3 mismatches, 1 gap Sequence Comparison & Alignment SCORING SYSTEMS High penalty for “opening” a gap (e.g. G = 5) Lower penalty for “entending” a gap (e.g. L = 1) GCTACTAG-T-T--CGC-T-TAGC GCTACTAGCTCTAGCGCGTATAGC Penalty = 5G + 6L = 31 GCTACTAGTT------CGCTTAGC GCTACTAGCTCTAGCGCGTATAGC Penalty = 1G + 6L = 11 Sequence Comparison & Alignment Mix-and-match protein modules confound alignment algorithms LOCAL SIMILARITY Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT) F12 F2 E F1 E PLAT F1 E F1,F2 E K Catalytic K K K Catalytic Catalytic Fibronectin repeats EGF similarity domain Kringle domain Serine protease activitiy Figure 7.3 Sequence Comparison & Alignment Mix-and-match protein modules confound alignment algorithms LOCAL SIMILARITY Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT) F12 F2 E F1 E K Catalytic modules in reverse order PLAT F1 E F1,F2 E K Catalytic K K Catalytic Fibronectin repeats EGF similarity domain Kringle domain Serine protease activitiy Figure 7.3 Sequence Comparison & Alignment Mix-and-match protein modules confound alignment algorithms LOCAL SIMILARITY Protein modules in coagulation factor XII (F12) and tissue plasminogen activator (PLAT) F12 F2 E F1 E K Catalytic repeated modules PLAT F1 E F1,F2 E K Catalytic K K Catalytic Fibronectin repeats EGF similarity domain Kringle domain Serine protease activitiy Figure 7.3 Sequence Comparison & Alignment DOT PLOTS Dot-plot Fitch : Biochem. Genet. (1969)3,99-108 Horizontal axis is coordinates for one sequence Vertical axis is coordinates for the other C G T A C C G T A 0 0 0 1 0 0 0 0 C 1 0 0 0 1 1 0 0 G 0 1 0 0 0 0 1 0 T 0 0 1 0 0 0 0 1 Figure 7.4 Sequence Comparison & Alignment DOT PLOTS Dot-plot Fitch : Biochem. Genet. (1969)3,99-108 Horizontal axis is coordinates for one sequence Vertical axis is coordinates for the other Can also score not 1 position at a time, but in sliding window. For example a window of 3 nucleotides where we score 1 for identical triplets and 0 for all other combinations yields. C G T A C C A 0 0 0 0 0 0 C 1 0 0 0 0 1 G T G T Figure 7.4b Sequence Comparison & Alignment DOT PLOTS Horizontal axis is coordinates for one sequence Vertical axis is coordinates for the other Tissue Plasminogen Activator (PLAT) Coagulation Factor XII (F12) Figure 7.4 Sequence Comparison & Alignment DOT PLOTS K K Catalytic Adjacent dots merge to form diagonal segments Tissue Plasminogen Activator (PLAT) Plot dots for high similarity within a short window E F1 Coagulation Factor XII (F12) F2 E F1 E K Catalytic Figure 7.4 Sequence Comparison & Alignment DOT PLOTS Catalytic K K Tissue Plasminogen Activator (PLAT) Repeated domains show a characteristic pattern E F1 Coagulation Factor XII (F12) F2 E F1 E K Catalytic Figure 7.4 Sequence Comparison & Alignment PATH GRAPHS Dot plots suggest paths through the alignment space EGF similarity domains of urokinse plasminogen activator (PLAU) and tissue plasminogen activator (PLAT) 137 90 137 23 23 90 PLAU PLAT 90 23 72 Each path is a unique alignment 72 Path graphs are more explicit representations EPKKVKDHCSKHSPCQKGGTCVNMP--SGPH-CLCPQHLTGNHCQKEK---CFE ELHQVPSNCD----CLNGGTCVSNKYFSNIHWCNCPKKFGGQHCEIDKSKTCYE 137 72 Figure 7.5 Sequence Comparison & Alignment Best-path problems are common in computer science A best-path algorithm used for sequence alignment is called ‘dynamic programming’ PATH GRAPHS Routing a phone call from Washington DC to San Francisco Sequence Comparison & Alignment DYNAMIC PROGRAMMING Dynamic Programming Example Construct an optimal of these two sequences: G A T A C T A G A T T A C C A Using these scoring rules: Match: +1 Mismatch: -1 Gap: -1 Sequence Comparison & Alignment Arrange the sequence residues along a two-dimensional lattice Vertices of the lattice fall between letters DYNAMIC PROGRAMMING G A T A C T A G A T T A C C A Sequence Comparison & Alignment The goal is to find the optimal path DYNAMIC PROGRAMMING G A T A C T A G A T T A C C A from here to here Sequence Comparison & Alignment Each path corresponds to a unique alignment DYNAMIC PROGRAMMING G A T A C T A G A T T A C C A Which one is optimal? Sequence Comparison & Alignment The score for a path is the sum of its incremental edges scores DYNAMIC PROGRAMMING G A T A C T A G A T T A C C A A aligned with A Match = +1 Sequence Comparison & Alignment The score for a path is the sum of its incremental edges scores DYNAMIC PROGRAMMING G A T A C T A G A T T A C C A A aligned with T Mismatch = -1 Sequence Comparison & Alignment The score for a path is the sum of its incremental edges scores DYNAMIC PROGRAMMING G A T A C T A G A T T A C C A T aligned with NULL Gap = -1 NULL aligned with T Sequence Comparison & Alignment Incrementally extend the path DYNAMIC PROGRAMMING G A T T A C C A 0 -1 G A T A C T A -1 +1 Sequence Comparison & Alignment Incrementally extend the path Remember the best sub-path leading to each point on the lattice DYNAMIC PROGRAMMING G A T T A C C A 0 -1 G A T A C T A -1 -2 +1 -2 Sequence Comparison & Alignment Incrementally extend the path Remember the best sub-path leading to each point on the lattice DYNAMIC PROGRAMMING G A T T A C C A 0 -1 G A T A C T A -1 -2 +1 -2 0 0 +2 Sequence Comparison & Alignment Incrementally extend the path Remember the best sub-path leading to each point on the lattice DYNAMIC PROGRAMMING G A T T A C C A 0 G A T A C T A -1 -2 -1 +1 -2 0 -2 0 +2 Sequence Comparison & Alignment Incrementally extend the path Remember the best sub-path leading to each point on the lattice DYNAMIC PROGRAMMING G A T T A C C A 0 G A T A C T A -1 -2 -3 -1 +1 -2 0 -1 -2 0 +2 +1 -3 -1 +1 +3 Sequence Comparison & Alignment Incrementally extend the path Remember the best sub-path leading to each point on the lattice DYNAMIC PROGRAMMING G A T T A C C A 0 G A T A C T A -1 -2 -3 -4 -5 -1 +1 0 -1 -2 -3 -2 0 +2 +1 0 -1 -3 -1 +1 +3 +2 +1 -4 -2 0 +2 +2 +1 -5 -3 -1 +1 +3 +2 Sequence Comparison & Alignment Incrementally extend the path Remember the best sub-path leading to each point on the lattice DYNAMIC PROGRAMMING G A T T A C C A 0 G A T A C T A -1 -2 -3 -4 -5 -6 -7 -1 +1 0 -1 -2 -3 -4 -5 -2 0 +2 +1 0 -1 -2 -3 -3 -1 +1 +3 +2 +1 0 -1 -4 -2 0 +2 +2 +1 +2 +1 -5 -3 -1 +1 +3 +2 +1 +3 -6 -4 -2 0 +2 +4 +3 +2 -7 -5 -3 -1 +1 +3 +3 +2 -8 -6 -4 -2 0 +2 +2 +4 Sequence Comparison & Alignment Trace-back to get optimal path and alignment DYNAMIC PROGRAMMING G A T T A C C A 0 G A T A C T A -1 -2 -3 -4 -5 -6 -7 -1 +1 0 -1 -2 -3 -4 -5 -2 0 +2 +1 0 -1 -2 -3 -3 -1 +1 +3 +2 +1 0 -1 -4 -2 0 +2 +2 +1 +2 +1 -5 -3 -1 +1 +3 +2 +1 +3 -6 -4 -2 0 +2 +4 +3 +2 -7 -5 -3 -1 +1 +3 +3 +2 -8 -6 -4 -2 0 +2 +2 +4 Sequence Comparison & Alignment Print out the alignment G A - T A CT A G A T T A CC A DYNAMIC PROGRAMMING G A T A C T A G A T T A C C A Sequence Comparison & Alignment Global Alignment methods: Local Alignment methods: Two different types of Alignment Needleman & Wunch (J. Mol. Biol. (1970) 48,443-453 : Problem of finding the best path. Revelation: Any partial subpath that ends at a point along the true optimal path must itself be the optimal path leading to that point. This provides a method to create a matrix of path “score”, the score of a path leading to that point. Trace the optimal path from one end to the other of the two sequences. Smith & Waterman.(J. Mol. Biol. (1981), 147,195-197: Use Needleman &Wunch, but report all non-overlapping paths, starting at the highest scoring points in the path graph. FASTP(Lipman &Pearson(1985),Science 227,1435-1441 BLAST (Altschul et al (1990),J. Mol. Bio. 215,408-410): don’t report all overlapping paths, but only attempt to find paths if there are words that are high-scoring. Speeds up considerably the alignments. Sequence Comparison & Alignment Implementations of dynamic programming for global and local similarities GLOBAL & LOCAL SIMILARITY Optimal global alignment Optimal local alignment Needleman & Wunsch (1970) Smith & Waterman (1981) Sequences align essentially from end to end Sequences align only in small, isolated regions Sequence Comparison & Alignment Score and Statistics Some amino acids mutations do not affect structure/function very much. Amino acids with similar physico-chemical and steric properties can often replace each other. Scoring system that doesn’t penalize very much mutations to similar amino acid. PAM Matrices: Point Accepted Mutations. Defined in terms of a divergence of 1 percent PAM. For distant sequences use PAM250, while for closer sequences (like DNA) use PAM100. Some sites accumulate mutations some others don’t, thus use of the PAM100 matrice doesn’t mean that the sequences compared were 100% mutated. BLOSUM: BLOCK substitution matrices. Started with the BLOCKS database of multiple alignment only involving distant sequences. BLOSUM62 means that the proteins compated were never closer than 62% Identity. BLOSUM50 matrices involved alignment of more distant sequences. Recommend use BLOSUM matrices (BLOSUM62) for most protein alignments. Sequence Comparison & Alignment SCORING SYSTEMS A Some amino acid substitutions are more common than others R -1 5 N -2 0 6 D -2 -2 1 C BLOSUM62 6 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 G Substitution scores come from an odds ratio based on measured substitution rates 4 0 -2 H -2 0 5 0 -1 -3 -2 -2 1 -1 -3 0 6 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 K -1 2 0 -1 -3 M -1 -1 -2 -3 -1 1 4 1 -2 -1 -3 -2 5 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 0 -1 0 0 0 -1 -2 -2 7 S 1 -1 1 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 3 -3 -2 -2 V 2 -1 -1 -2 -1 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A I L M T V R N D C Q E G H K F P S W Y Figure 7.8 Sequence Comparison & Alignment SCORING SYSTEMS A Identities get positive scores, but some are better than others 4 R -1 5 N -2 0 6 D -2 -2 1 C BLOSUM62 6 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 G 0 -2 H -2 0 5 0 -1 -3 -2 -2 1 -1 -3 0 6 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 K -1 2 0 -1 -3 M -1 -1 -2 -3 -1 1 4 1 -2 -1 -3 -2 5 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 0 -1 0 0 0 -1 -2 -2 7 S 1 -1 1 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 3 -3 -2 -2 V 2 -1 -1 -2 -1 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A I L M T V R N D C Q E G H K F P S W Y Figure 7.8 Sequence Comparison & Alignment SCORING SYSTEMS A Some non-identities have positive scores, but most are negative 4 R -1 5 N -2 0 6 D -2 -2 1 C BLOSUM62 6 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 G 0 -2 H -2 0 5 0 -1 -3 -2 -2 1 -1 -3 0 6 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 K -1 2 0 -1 -3 M -1 -1 -2 -3 -1 1 4 1 -2 -1 -3 -2 5 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 0 -1 0 0 0 -1 -2 -2 7 S 1 -1 1 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 3 -3 -2 -2 V 2 -1 -1 -2 -1 2 7 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 A I L M T V R N D C Q E G H K F P S W Y Figure 7.8 Sequence Comparison & Alignment Compare one query sequence against an entire database A typical search has four basic elements DATABASE SEARCHING > fasta myquery swissprot -ktup 2 search query sequence optional program sequence database parameters Sequence Comparison & Alignment With exponential database growth, searches keep taking more time DATABASE SEARCHING > fasta myquery swissprot -ktup 2 searching . . . . . . Sequence Comparison & Alignment The “hit list” gives titles and scores for matched sequences DATABASE SEARCHING > fasta myquery swissprot -ktup 2 The best scores are: initn gi|1706794|sp|P49789|FHIT_HUMAN BIS(5'-ADENOSYL)- 996 gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL) 412 gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEI 238 gi|3915958|sp|Q58276|Y866_METJA HYPOTHETICAL HIT- 153 gi|3916020|sp|Q11066|YHIT_MYCTU HYPOTHETICAL 15.7 163 gi|3023940|sp|O07513|HIT_BACSU HIT PROTEIN 164 gi|2506515|sp|Q04344|HNT1_YEAST HIT FAMILY PROTEI 130 gi|2495235|sp|P75504|YHIT_MYCPN HYPOTHETICAL 16.1 125 gi|418447|sp|P32084|YHIT_SYNP7 HYPOTHETICAL 12.4 42 gi|3025190|sp|P94252|YHIT_BORBU HYPOTHETICAL 15.9 128 gi|1351828|sp|P47378|YHIT_MYCGE HYPOTHETICAL HIT76 gi|418446|sp|P32083|YHIT_MYCHR HYPOTHETICAL 13.1 27 gi|1708543|sp|P49773|IPK1_HUMAN HINT PROTEIN (PRO 66 gi|2495231|sp|P70349|IPK1_MOUSE HINT PROTEIN (PRO 65 gi|1724020|sp|P49774|YHIT_MYCLE HYPOTHETICAL HIT52 gi|1170581|sp|P16436|IPK1_BOVIN HINT PROTEIN (PRO 66 gi|2495232|sp|P80912|IPK1_RABIT HINT PROTEIN (PRO 66 gi|1177047|sp|P42856|ZB14_MAIZE 14 KD ZINC-BINDIN 73 gi|1177046|sp|P42855|ZB14_BRAJU 14 KD ZINC-BINDIN 76 gi|1169825|sp|P31764|GAL7_HAEIN GALACTOSE-1-PHOSP 58 gi|113999|sp|P16550|APA1_YEAST 5',5'''-P-1,P-4-TE 47 gi|1351948|sp|P49348|APA2_KLULA 5',5'''-P-1,P-4-T 63 gi|123331|sp|P23228|HMCS_CHICK HYDROXYMETHYLGLUTA 58 gi|1170899|sp|P06994|MDH_ECOLI MALATE DEHYDROGENA 70 gi|3915666|sp|Q10798|DXR_MYCTU 1-DEOXY-D-XYLULOSE 75 gi|124341|sp|P05113|IL5_HUMAN INTERLEUKIN-5 PRECU 36 gi|1170538|sp|P46685|IL5_CERTO INTERLEUKIN-5 PREC 36 gi|121369|sp|P15124|GLNA_METCA GLUTAMINE SYNTHETA 45 gi|2506868|sp|P33937|NAPA_ECOLI PERIPLASMIC NITRA 48 gi|119377|sp|P10403|ENV1_DROME RETROVIRUS-RELATED 59 gi|1351041|sp|P48415|SC16_YEAST MULTIDOMAIN VESIC 48 gi|4033418|sp|O67501|IPYR_AQUAE INORGANIC PYROPHO 38 init1 opt z-sc E(77110) 996 996 1262.1 0 382 395 507.6 1.4e-21 133 316 407.4 5.4e-16 98 190 253.1 2.1e-07 163 184 244.8 6.1e-07 164 170 227.2 5.8e-06 91 157 210.3 5.1e-05 125 148 199.7 0.0002 42 140 191.3 0.00058 73 139 188.7 0.00082 76 133 181.0 0.0022 27 119 165.2 0.017 66 118 163.0 0.022 65 116 160.5 0.03 52 117 160.3 0.031 66 115 159.3 0.035 66 112 155.5 0.057 73 112 155.4 0.058 76 110 153.8 0.072 58 104 138.5 0.51 47 103 137.8 0.56 63 98 131.3 1.3 58 99 129.4 1.6 48 91 122.9 3.7 50 92 121.9 4.3 36 85 121.3 4.7 36 84 120.0 5.5 45 90 118.9 6.3 48 92 117.4 7.6 59 89 117.0 8 48 97 117.0 8 38 83 116.8 8.3 Sequence Comparison & Alignment E-value “Hits” can be sorted according to their E-value or their score. The E-value is better known as the EXPECT value and is a function of score, database size and query sequence length. E-value: Number of alignments with a score >=S that you expect to find if the database was a collection of random letters. e.g. For a score of 1, one only requires 1 match, and there should be an enormous amount of alignments. One expects to find less alignments with a score of 5, and so on.. Eventually when the score is big enough, one expects to find an insignificant number of of alignments that could be due to chance. E-value of less than 1e-6 (1* 10-6 in scientific notation) are usually very good and for proteins, E<1e-2 is usually considered significant. It is still possible for a Hit with E>1 to be biologically meaningful, but more analysis is required to comfirm that. Even for VERY good hits, it is possible that the hit is due to a biological artifact (sequencing/cloning vector, repeats, lowcomplexity sequence…) Sequence Comparison & Alignment E-value Another type of statistics is the P-value, which given a score S for an alignment is the Probability that an alignment of the query against a database of random sequences has a score >= S.For gapless alignments the Pvalue can be computed from theory. Sometimes one has an alignments algorithms, or biologically complex databases that do not allow the computation of P-value based on the statistical theory of a uniform database. In this case, one computes uses an alternate statistics, the Z-value (e.g. FASTA suite), which shuffles the query sequence and thus creates many compositionally identical query sequence. Each random sequences is then re-queried agains the database. When done enough times, this provides a distribution of scores which is approximately normally distributed (if lucky) around some mean. Z-value = score distance away from mean/ standard devuation .. a Z-value of 3 or greater is good. = Standard deviation S = score of alignment Prob Distrib Score Deviation from mean Sequence Comparison & Alignment Detailed alignments are shown farther down in the output DATABASE SEARCHING > fasta myquery swissprot -ktup 2 >>gi|1703339|sp|P49776|APH1_SCHPO BIS(5'-NUCLEOSYL)-TETR (182 aa) initn: 412 init1: 382 opt: 395 z-score: 507.6 E(): 1.4e-21 Smith-Waterman score: 395; 52.3% identity in 109 aa overlap 10 20 30 40 50 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDLRPDEVADLF : X: .:.:: :.:: ::..:::::: : : : :..:: :.:..::: gi|170 MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKDLTPSELTDLF 10 20 30 40 50 60 gi|170 60 70 80 90 100 110 gi|170 QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHRNDSIYEELQK ....: :.:: : ... ....::: .::::: :::::..::: .:: .:: .: :X.: gi|170 TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSENDLVYSELEK 70 80 90 100 110 120 120 130 140 gi|170 HDKEDFPASWRSEEEMAAEAAALRVYFQ .. gi|170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKPRTLEEMEKEAQWLKGYFSEEQE 130 140 150 160 170 180 >>gi|1723425|sp|P49775|HNT2_YEAST HIT FAMILY PROTEIN 2 (217 aa) initn: 238 init1: 133 opt: 316 z-score: 407.4 E(): 5.4e-16 Smith-Waterman score: 316; 37.4% identity in 131 aa overlap gi|170 10 20 30 40 MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRP-VER :.. :. .v^: :.. ..:::: ::.::::::. ::X : Sequence Comparison & Alignment HASHING METHODS Simplest Database searching could is a large dynamic programming example. Database sequence For a query of N letters against a database of M letters, it requires MxN comparisons. Query sequence Sequence Comparison & Alignment Hashing is a common method for accelerating database searches Compile “dictionary” of words from the query sequence. Put each word in a lookup table that points to the original position in the sequence. Thus given one word, you can know if it is in the query in a single operation. HASHING METHODS query sequence MLIIKRDELVISWASHERE MLI LII IIK IKR all overlapping KRD words of size 3 RDE DEL ELV LVI VIS ISW SWA WAS ASH SHE HER ERE Sequence Comparison & Alignment Index lookup Each word is assigned a unique integer. E.g. for a word of 3 letters made up of an alphabet of 20 letters. 1. Assign a code to each letter Code(l) (0 to 19) 2. For a word of 3 letters L1 L2 L3 the code is index = Code(L1)*202 + Code(L2)*201 + Code(L3) 3. Have an array with a list of the positions that have that word. 0 1 2 3 1 Position in query sequence of word Sequence Comparison & Alignment Building the dictionary for the query sequence requires (N-2) operations. The database contains (M-2) words, and it takes only one operation to see if the word was in the query. HASHING METHODS query sequence MLIIKRDELVISWASHERE MLI LII IIK IKR all overlapping KRD words of size 3 RDE DEL ELV LVI VIS ISW SWA WAS ASH SHE HER ERE Sequence Comparison & Alignment HASHING METHODS Query sequence Use word hits to determine were to search for alignments fills the dynamic programming matrix in (N-2)+(M-2) operations instead of MxN. Database sequence Scan the database, looking up words in the dictionary Sequence Comparison & Alignment HASHING METHODS Query sequence Use word hits to determine were to search for alignments Database sequence Scan the database, looking up words in the dictionary FASTA searches in a band Sequence Comparison & Alignment HASHING METHODS Query sequence Use word hits to determine were to search for alignments Database sequence Scan the database, looking up words in the dictionary BLAST extends from word hits Sequence Comparison & Alignment Multiple Alignment FHIT_HUMAN MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLV... -----------MS-F RFGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV... APH1_SCHPO MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLV... -----------MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL PGHVLV... HNT2_YEAST MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIV MILSKTKKPKSMNKP IYFSKFLVT-EQVFY KSKYTYALVNLKPIV PGHVLI... PGHVLI... -----------MCIF CKIINGEIP-AKVVY EDEHVLAFLDINPRN KGHTLV... Y866_METJA MCIFCKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV... A true multiple alignment method will align all the sequences together at the same time. Sequence Comparison & Alignment Multiple Alignment FHIT_HUMAN -----------MS-F RFGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV... APH1_SCHPO -----------MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL PGHVLV... HNT2_YEAST MILSKTKKPKSMNKP IYFSKFLVT-EQVFY KSKYTYALVNLKPIV PGHVLI... Y866_METJA -----------MCIF CKIINGEIP-AKVVY EDEHVLAFLDINPRN KGHTLV... A true multiple alignment method will align all the sequences together at the same time. Unfortunately, there is no formal computationally tractable method for more than 3 sequences. There are many approximate methods, such as Progressive multiple alignment methods. Sequence Comparison & Alignment Progressive Multiple Alignment HNT2_YEAST Y866_METJA FHIT_HUMAN APH1_SCHPO Align all pairs of sequences. Pairwise alignments: compute distance matrix FHIT_HUMAN APH1_SCHPO HNT2_YEAST Y866_METJA FHIT_HUMAN APH1_SCHPO 395 HNT2_YEAST 316 380 Y866_METJA 290 300 340 Sequence Comparison & Alignment Progressive Multiple Alignment FHIT_HUMAN Guide Tree APH1_SCHPO HNT2_YEAST Y866_METJA Pairwise alignments: compute distance matrix FHIT_HUMAN APH1_SCHPO HNT2_YEAST Y866_METJA FHIT_HUMAN APH1_SCHPO 395 HNT2_YEAST 316 380 Y866_METJA 290 300 340 Sequence Comparison & Alignment Multiple Alignment FHIT_HUMAN MSFR MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLV... FGQHLIKP-SVVFL KTELSFALVNRKPVV PGHVLV... APH1_SCHPO MPKQ MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLV... LYFSKFPVGSQVFY RTKLSAAFVNLKPIL PGHVLV... HNT2_YEAST MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIVPGHVLI... Y866_METJA MCIF MCIFCKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV... CKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV... Align two closest sequences This alignment creates a consensus sequence that is next used to align subsequent sequences. From the point of view of this pairwise alignment, the gap can be inserted anywhere In the green region (between the 1st M , and base 13 (S)) Sequence Comparison & Alignment Multiple Alignment FHIT_HUMAN -----------MSF MS-F RFGQHLIKP-SVVFL RFGQHLIKP-SVVFL KTELSFALVNRKPVV KTELSFALVNRKPVV PGHVLV... PGHVLV... APH1_SCHPO -----------MPK MPKQ LYFSKFPVG-SQVFY QLYFSKFPVGSQVFY RTKLSAAFVNLKPIL RTKLSAAFVNLKPIL PGHVLV... PGHVLV... HNT2_YEAST MILSKTKKPKSMNK MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIVPGHVLI... PIYFSKFLVTEQVFY KSKYTYALVNLKPIV PGHVLI... Y866_METJA MCIF MCIFCKIINGEIP-AKVVYEDEHVLAFLDINPRNKGHTLV... CKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV... Align Next closest sequence to the consensus. Once inserted gap position cannot move because they are part of the consensus. Sequence Comparison & Alignment Multiple Alignment FHIT_HUMAN FHIT_HUMAN -----------MS-F -----------MSFR RFGQHLIKP-SVVFL FGQHLIKP-SVVFL KTELSFALVNRKPVV KTELSFALVNRKPVV PGHVLV... PGHVLV... APH1_SCHPO APH1_SCHPO -----------MPKQ -----------MPKQ LYFSKFPVG-SQVFY LYFSKFPVGSQVFY RTKLSAAFVNLKPIL RTKLSAAFVNLKPIL PGHVLV... PGHVLV... HNT2_YEAST HNT2_YEAST MILSKTKKPKSMNKP MILSKTKKPKSMNKP IYFSKFLVT-EQVFY IYFSKFLVTEQVFY KSKYTYALVNLKPIV KSKYTYALVNLKPIV PGHVLI... PGHVLI... Y866_METJA Y866_METJA MCIFCKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV... -----------MCIF CKIINGEIPAKVVY EDEHVLAFLDINPRN KGHTLV... Align Next closest sequence to new consensus. Hopefully, the result should be similar to what a true multiple alignment method would have yielded. We saw that the order of alignment determines the existence of gaps. Because of the order of alignments, the gap position cannot be changed to align these two P, which would have resulted in a higher score. Sequence Comparison & Alignment Clustalw: CLUSTALW is a progressive multiple alignment tool. - Adaptive gap opening and extension scores, makes it relatively insensitive to small changes in gap parameters. - Choice of DNA or protein gap penalty alignments. - Available on the web or on PC/Mac/unix. http://dot.imgen.bcm.tmc.edu:9331/multi-align/Options/clustalw.html The uppercase “O” in options is relevant. Sequence Comparison & Alignment BLAST BLAST and BLAST2SEQUENCES is a database search engine based on using hashing to accelerate the search. blastn (for nucleotides) or blastp (for proteins) blastx (translates a nucleotide query in all 6 reading frames and compare it to a protein database.) tblastn (compare a protein against a nucleotide database translated in all 6 reading frames.) tblastx (compares a nucleotide sequence against a nucleotide database by translating the query and database in all 6 reading frames.) http://www.ncbi.nlm.nih.gov/BLAST/ A pairwise alignment implementation of these program is available at: http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html Sequence Comparison & Alignment Clustalw: Blast: Query-Anchored Alignments (master Slave) Is a multiple alignment program. Every Sequence is aligned to every other one. NOT a multiple alignment program, but may display Query-Anchored multiple pairwise alignments that look like multiple alignment, but all the sequences are only aligned to the first sequence! Gaps in the query, means NOTHING can be aligned to it. Gaps may optionally be shown(flat view), or entire column omitted. Gap in subject sequence This Column is NOT aligned together. It is displayed there for convenience. Sequence Comparison & Alignment BLAST and BLAST2SEQUENCES Exercizes: Use Entrez to find the protein sequences with LOCUS name FHIT_HUMAN HNT2_YEAST Use clustalw to align these two sequences, And WITHOUT LOSING THAT RESULT SCREEN!!! Use pairwise blast to align these two sequences as well. EXERCIZE: Try to reproduce the example of clustalW alignment (the order of input sequences is not important) Sequence Comparison & Alignment TextBook: References "Bioinformatics" A Practical Guide to the Analysis of Genes and Proteins. Edited by Andy D. Baxevanis and B.F. Ouellette readings: chapters 7,8,9 http://www.ncbi.nlm.nih.gov/BLAST/b last_overview.html