Multiple alignments, PATTERNS, PSI-BLAST Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Overview Multiple alignments Patterns PROSITE database, syntax, use PSI-BLAST How-to, Goal, problems, use BLAST, matrices, use [ Profiles/HMMs ] … Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 What is a multiple sequence alignment? What can it do for me? How can I produce one of these? How can I use it? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How can I use a multiple alignment? chite wheat trybr unknown ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr unknown AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Homology? SwissProt Unkown Sequence Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How can I use a multiple alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Prosite Patterns SwissProt Match? Unkown Sequence Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How can I use a multiple alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Extrapolation Prosite Patterns Prosite Profiles L? K>R A F D E F G H Q I V L W -More Sensitive -More Specific Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How can I use a multiple alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Phylogeny chite wheat trybr mouse -Evolution -Paralogy/Orthology Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How can I use a multiple alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Phylogeny Struc. Prediction Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique PhD For secondary Structure Prediction: 75% Accurate. Threading: is improving but is not yet as good. LF-2001.11 How can I use a multiple alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Phylogeny Struc. Prediction Caution! Automatic Multiple Sequence Alignment methods are not always perfect… Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 The problem why is it difficult to compute a multiple sequence alignment? Biology What is a good alignment? chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * Computation Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique What is the good alignment? LF-2001.11 The problem why is it difficult to compute a multiple sequence alignment? CIRCULAR PROBLEM.... Good Sequences Good Alignment Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 What do I need to know to make a good multiple alignment? How do sequences evolve? How does the computer align the sequences? How can I choose my sequences? What is the best program? How can I use my alignment? Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 An alignment is a story ADKPKRPLSAYMLWLN Deletion Insertion ADKPRRP---LS-YMLWLN ADKPKRPKPRLSAYMLWLN Mutation ADKPKRPLSAYMLWLN Mutations + Selection ADKPRRPLS-YMLWLN ADKPKRPLSAYMLWLN ADKPKRPKPRLSAYMLWLN Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Homology Same sequences -> same origin? -> same function? -> same 3D fold? %Sequence Identity Same 3D Fold 30% Twilight Zone Length 100 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Convergent evolution AFGP with (ThrAlaAla)n Similar To Trypsynogen N S Chen et al, 97, PNAS, 94, 3811-16 AFGP with (ThrAlaAla)n NOT Similar to Trypsinogen Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Residues and mutations All residues are equal, but some more than others… Aliphatic Aromatic M C P L V A G G I T C S D N KE Y F H Q W R Small Hydrophobic Polar Accurate matrices are data driven rather than knowledge driven Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Substitution matrices Different Flavors: • Pam: 250, 350 • Blosum: 45, 62 • … Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 What is the best substition matrix? Mutation rates depend on families Family Histone3 Insulin Interleukin I a-Globin Apolipoprot. AI Interferon G S 6.4 4.0 4.6 5.1 4.5 8.6 N 0 0.1 1.4 0.6 1.6 2.8 in Substitutions/site/Billion Years as measured on Mouse Vs Human (0.08 Billion years) Rates Choosing the right matrix may be tricky Gonnet250 > BLOSUM62 > PAM250 Depends on the family, the program used and its tuning Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Insertions and deletions? Affine Gap Penalty Cost=GOP+GEP*L Indel Cost Cost L Cost L L Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How to align many sequences? Exact algorithms are computing time consuming Needlemann & Wunsch Smith & Waterman -> heuristic required! 6 5 8 Globins =>9 7 2 3 4 =>3 =>150 =>1000 =>1 =>2 =>5 years weeks sec000 mn hours years years Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Existing methods 1-Carillo and Lipman: -MSA, DCA. -Few Small Closely Related Sequence. -Do Well When They Can Run. 2-Segment Based: 4-Progressive: -DIALIGN, MACAW. -ClustalW, Pileup, Multalign… -May Align Too Few Residues -Fast and Sensitive 3-Iterative: -HMMs, HMMER, SAM. -Slow, Sometimes Inacurate -Good Profile Generators Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Progressive alignment Feng and Dolittle, 1980; Taylor 1981 Dynamic Programming Using A Substitution Matrix Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Progressive alignment Feng and Dolittle, 1980; Taylor 1981 -Depends on the CHOICE of the sequences. -Depends on the ORDER of the sequences (Tree). -Depends on the PARAMETERS: •Substitution Matrix. •Penalties (Gop, Gep). •Sequence Weight. Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique •Tree making Algorithm. LF-2001.11 Selecting sequences from a BLAST output Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 A common mistake Sequences too closely related PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE :**::*.*******:***:* :****************..::******:*********** PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES :*** ******.******.**** *:************.:******:** Identical sequences brings no information Multiple sequence alignments thrive on diversity Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Respect information! PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT TPCC_MOUSE ------------------------------------------SMTDLLN----AEDIKKA ------------------------------------------SMTDLLN----AEDIKKA ------------------------------------------SMTDLLS----AEDIKKA ------------------------------------------SMTDVLS----AEDIKKA ------------------------------------------SMTDLLS----AEDIKKA ------------------------------------------AMTELLN----AEDIKKA MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :*. .*:::: PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT TPCC_MOUSE VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFI IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM :. . * .*..:*: *: * *. :::..:*:::**: .*:*: :** : PRVA_MACFU PRVA_HUMAN PRVA_GERSP PRVA_MOUSE PRVA_RAT PRVA_RABIT TPCC_MOUSE LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESLKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAESLKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSESLKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAESLKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAESLKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSESLQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE *: . .. :: .: : *: ***:.**:*. :** :: Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 -This alignment is not informative about the relation between TPCC MOUSE and the rest of the sequences. -A better spread of the sequences is needed Selecting diverse sequences PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES PRVA_MACFU PRVA_ESOLU -AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIE -AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIE MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE -AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIE -SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIE -SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIE --AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE : *: .: . .* .:*. * ** *: * : * :* * **:** PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES PRVA_MACFU PRVA_ESOLU EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKAEDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQDEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKAQDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKAEDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA :** .*:.* .* *: ** :: .* **** **::** ** -A REASONABLE model now exists. -Going further:remote homologues. Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Aligning remote homologues PRVA_MACFU PRVA_ESOLU PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES TPCS_RABIT TPCS_PIG TPCC_MOUSE ------------------------------------------SMTDLLNA----EDIKKA -------------------------------------------AKDLLKA----DDIKKA ------------------------------------------AFAGVLND----ADIAAA ------------------------------------------AFAGILSD----ADIAAG -----------------------------------------MACAHLCKE----ADIKTA ------------------------------------------AVAKLLAA----ADVTAA ------------------------------------------SITDIVSE----KDIDAA -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI -TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM : :: PRVA_MACFU PRVA_ESOLU PRVB_CYPCA PRVB_BOACO PRV1_SALSA PRVB_LATCH PRVB_RANES TPCS_RABIT TPCS_PIG TPCC_MOUSE VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM : . .: .. . *: * : * :* : .*:*: :** . PRVA_MACFU LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESPRVA_ESOLU LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEAPRVB_CYPCA LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-PRVB_BOACO LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKGPRV1_SALSA LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-PRVB_LATCH LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-PRVB_RANES LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-TPCS_RABIT FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ of Bioinformatics TPCS_PIG Swiss Institute FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ Institut Suisse de Bioinformatique TPCC_MOUSE LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE :: .. :: : :: .* :.** *. :** :: LF-2001.11 Going further… PRVA_MACFU PRVB_BOACO PRV1_SALSA TPCS_RABIT TPCS_PIG TPCC_MOUSE TPC_PATYE VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI . : .. . :: . : * :* : .* *. : * . PRVA_MACFU PRVB_BOACO PRV1_SALSA TPCS_RABIT TPCS_PIG TPCC_MOUSE TPC_PATYE LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQFR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQLQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVELS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA : . :: : :: * :..* :. :** :: Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 What makes a good alignment… The more divergeant the sequences, the better The fewer indels, the better Nice ungapped blocks separated with indels Different classes of residues within a block: Completely conserved Size and hydropathy conserved Size or hydropathy conserved The ultimate evaluation is a matter of personal judgment and knowledge Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Avoiding pitfalls Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Keep a biological perspective chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : chite wheat trybr mouse AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL-DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS -K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG ----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS * *** .:: ::... : * . . . : * . *: * chite wheat trybr mouse KSEWEAKAATAKQNY-I--RALQE-YERNG-GKAPYVAKANKLKGEY-N--KAIAA-YNK-GESA RKVYEEMAEKDKERY----K--RE-M------KQAYIQLAKDDRIRYDNEMKSWEEQMAE----: : * : .* : Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 DIFFERENT PARAMETERS Do not overtune!!! chite wheat trybr mouse ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. ::: .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : chite wheat trybr mouse ---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD --DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE KKDSNAPKRAMTSFMFFSSDFRS-----KHSDLS-IVEMSKAAGAAWKELGP -----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP ***. * .: .. . : . . * . *: * chite wheat trybr mouse AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE * : .* . : Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 DO NOT PLAY WITH PARAMETERS! IF YOU KNOW THE ALIGNMENT YOU WANT: MAKE IT YOURSELF! Choosing the right method PROBLEM PROGRAM ClustalW Source: BaliBase Thompson et al, NAR, 1999 ClustalW MSA DIALIGN II DIALIGN II Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 METHOD Conclusion The best alignment method: The best evaluation method: Your brain The right data Your eyes Experimental information (SwissProt) How can I go further? Patterns Profiles HMMs … What can I conclude? Homology -> information extrapolation Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 The database Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 History Founded by Amos Bairoch 1988 First release in the PC/Gene software 1990 Synchronisation with Swiss-Prot 1994 Integration of « profiles » 1999 PROSITE joins InterPro November 2001 Current release 16.50 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Database content Official Release ~1400 Patterns ~100 Profiles 4 Rules ~1100 Documentations PSxxxxx PATTERN PSxxxxx MATRIX PSxxxxx RULE PDOCxxxxx Pre-Release ~250 Profiles ~150 Documentations PSxxxxx MATRIX QDOCxxxxx Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Pattern « philosophy » Target: definition of sites with biological information catalytic, metal binding, S-S bridge, cofactor binding, prosthetic group, PTM Easy to understand and to design, example Q-x(3)-N-[SA]-C-G-x(3)-[LIVM](2)-H-[SA]-[LIVM]-[SA] Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Pattern syntax Regular expression (REGEXP) language: Each position is separated by a dash « - » amino acids are represented by single letter code « x » represent any amino acid [] group of amino acid acceptable for a position {} group of amino acid not acceptable for a position () multiple or range e.g., A(1,3) means 1 to 3 A < anchor at beginning of sequence > anchor at end of sequence Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Profile « philosophy » Aim: identification of domains and not protein families Gene discovery vs automatic annotation Importance of score and calibration Possible manual tuning (by a well trained expert… ;-) -> allowed by the profile syntax -> no direct link to multiple alignment Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Database content: PATTERN ID AC DT DE PA NR NR NR CC CC DR DR DR (…) DR DR DR DR DR DO // UCH_2_1; PATTERN. PS00972; JUN-1994 (CREATED); SEP-2000 (DATA UPDATE); SEP-2000 (INFO UPDATE). Ubiquitin carboxyl-terminal hydrolases family 2 signature 1. G-[LIVMFY]-x(1,3)-[AGC]-[NASM]-x-C-[FYW]-[LIVMFC]-[NST]-[SACV]-x-[LIVMS]-Q. /RELEASE=38,80000; /TOTAL=41(41); /POSITIVE=41(41); /UNKNOWN=0(0); /FALSE_POS=0(0); /FALSE_NEG=2; /PARTIAL=0; /TAXO-RANGE=??E??; /MAX-REPEAT=1; /SITE=7,active_site(?); Q93008, FAFX_HUMAN, T; O00507, FAFY_HUMAN, T; P55824, FAF_DROME , T; P70398, FAF_MOUSE , T; P54578, TGT_HUMAN , T; P40826, TGT_RABIT , T; P25037, UBP1_YEAST, T; O42726, UBP2_KLULA, T; Q01476, UBP2_YEAST, T; P38187, UBPD_YEAST, P52479, UBPE_MOUSE, Q02863, UBPG_YEAST, P34547, UBPX_CAEEL, P53874, UBPA_YEAST, PDOC00750; T; T; T; T; N; Q24574, P38237, P43593, Q09931, Q17361, UBPE_DROME, UBPE_YEAST, UBPH_YEAST, UBPY_CAEEL, UBPT_CAEEL, T; Q14694, UBPE_HUMAN, T; T; P50101, UBPF_YEAST, T; T; Q61068, UBPW_MOUSE, T; T; N; Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Database content: Profile ID AC DT DE MA MA MA MA MA MA MA MA (…) MA MA NR NR NR CC DR DR (…) DR DO // UCH_2_3; MATRIX. PS50235; SEP-2000 (CREATED); SEP-2000 (DATA UPDATE); SEP-2000 (INFO UPDATE). Ubiquitin carboxyl-terminal hydrolases family 2 profile. /GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=193; TOPOLOGY=LINEAR; /DISJOINT: DEFINITION=PROTECT; N1=10; N2=185; /NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=1.3922; R2=.00836191; TEXT='NScore'; /CUT_OFF: LEVEL=0; SCORE=910; N_SCORE=9.0; MODE=1; /CUT_OFF: LEVEL=-1; SCORE=610; N_SCORE=6.5; MODE=1; /DEFAULT: B1=-100; E1=-100; MI=-105; MD=-105; IM=-105; DM=-105; I=-20; D=-20; /I: B1=0; BI=-105; BD=-105; /M: SY='T'; M=0,-14,2,-19,-16,-9,-21,-18,-6,-10,-5,-5,-12,-21,-15,-6,0,9,6,-29,-11,-16; /M: SY='D'; M=-11,12,-27,17,6,-21,-9,-4,-21,-4,-18,-14,5,-12,0,-6,-3,-8,-19,-26,-11,2; /I: E1=0; /RELEASE=38,80000; /TOTAL=47(47); /POSITIVE=47(47); /UNKNOWN=0(0); /FALSE_POS=0(0); /FALSE_NEG=0; /PARTIAL=0; /TAXO-RANGE=??E??; /MAX-REPEAT=1; Q01988, UBPB_CANFA, T; Q93008, FAFX_HUMAN, T; O00507, FAFY_HUMAN, T; P55824, FAF_DROME , T; P70398, FAF_MOUSE , T; P53010, PAN2_YEAST, T; Q09798, YAA4_SCHPO, T; P43589, YFH5_YEAST, T; PDOC00750; Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Database content: documentation {PDOC00750} {PS00972; UCH_2_1} {PS00973; UCH_2_2} {PS50235; UCH_2_3} {BEGIN} ********************************************************************** * Ubiquitin carboxyl-terminal hydrolases family 2 signatures/profile * ********************************************************************** Ubiquitin carboxyl-terminal hydrolases (EC 3.1.2.15) (UCH) (deubiquitinating enzymes) [1,2] are thiol proteases that recognize and hydrolyze the peptide bond at the C-terminal glycine of ubiquitin. These enzymes are involved in the processing of poly-ubiquitin precursors as well as that of ubiquinated proteins. There are two distinct families of UCH. The second class consist of large proteins (800 to 2000 residues) and is currently represented by: - Yeast UBP1, UBP2, UBP3, UBP4 (or DOA4/SSV7), UBP5, UBP7, UBP9, UBP10, UBP11, UBP12, UBP13, UBP14, UBP15 and UBP16. - Human tre-2. - Human isopeptidase T. - Human isopeptidase T-3. (…) Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Database content: documentation also probably implicated in the catalytic mechanism. We have developed signature pattern for both conserved regions. We also developed a profile including the two regions covered by the patterns. -Consensus pattern: G-[LIVMFY]-x(1,3)-[AGC]-[NASM]-x-C-[FYW]-[LIVMFC]-[NST][SACV]-x-[LIVMS]-Q [C is the putative active site residue] -Sequences known to belong to this class detected by the pattern: ALL, except for two sequences. (…) -Note: these proteins belong to family C19 in the classification of peptidases [3,E1]. -Note: this documentation entry is linked to both a signature pattern and a profile. As the profile is much more sensitive than the pattern, you should use it if you have access to the necessary software tools to do so. -Last update: September 2000 / Patterns and text revised; profile added. [ 1] Jentsch S., Seufert W., Hauser H.-P. Biochim. Biophys. Acta 1089:127-139(1991). [ 2] D'andrea A., Pellman D. Crit. Rev. Biochem. Mol. Biol. 33:337-352(1998). [ 3] Rawlings N.D., Barrett A.J. Meth. Enzymol. 244:461-486(1994). [E1] http://www.expasy.ch/cgi-bin/lists?peptidas.txt Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Tools EMBOSS FINDPATTERN, SCANPROSITE... http://www.isrec.isb-sib.ch/software Pftools 2.2 (pfmake, pfw, pfscan, pfsearch) http://www.expasy.org/tools/#pattern PFSCAN & PFRAMESCAN fuzzpro, fuzztran, fuzznuc, patmatdb, patmatmotifs Fortran source code (open source) Binaries (solaris, linux, hpux, irix, win32, macosX) GeneMatcher (http://www.paracel.com) Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 PSI-BLAST What is it? Derived from NCBI-BLAST2.0 Position Specific Iterative BLAST Difference with BLAST PSSM / checkpoint Advantage / Disadvantage Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 PSI-BLAST Position specific iterative BLAST (PSI-BLAST) refers to a feature of BLAST 2.0 in which a profile (or position specific scoring matrix, PSSM) is constructed (automatically) from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second (etc.) BLAST search (replacing the normal matrix, e.g. BLOSUM62) and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity. Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 BLAST algorithm Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 Differences with BLAST The two E-values Automatically or manually selecting the matches The substitution matrix The iteration Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 PSI-BLAST E-values Two different E value settings need to be specified in the PSIBLAST program. The first of these (upper) sets the threshold for the initial BLAST search. The default value is 10 as in the standard BLAST program. The second E value (lower) is the threshold value for inclusion in the position specific matrix used for PSI-BLAST iterations. The default setting is 0.001. The E values specified allow the user to see (and selectively, based on prior knowledge, include) all of the BLAST hits up to E=10; but to automatically include only those hits exceeding a relatively rigorous E value threshold of 0.001. Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 BLAST PSSM or weight matrix A substitution matrix for an alphabet of size A is of size AxA A PSSM for an alphabet of size A is of size AxN where N is the length of the query A R N . . Y V A 4 -1 -2 -2 0 R -1 5 0 -2 -3 N -2 0 6 -2 -3 . . Y -2 -2 -2 7 -1 V 0 -3 -3 . . -1 4 M I S E A 0 2 1 0 R -1 -1 0 0 N -1 -1 0 0 . . Y -1 -1 -1 -1 V -1 -2 -1 -1 C 0 -1 -1 U 0 0 0 -1 -1 0 -1 -1 -1 -1 -1 0 -1 -1 -1 3 -1 Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 E N C I A 0 -1 0 -1 3 0 0 -1 -1 -1 0 5 -1 0 -1 . . BLAST Iteration Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 PHI-BLAST: a link with PATTERNS PHI-BLAST means Pattern-Hit Initiated BLAST PHI-BLAST expects as input a protein query sequence and a pattern contained in that sequence. PHI-BLAST searches the specified database for other protein sequences that also contain the input pattern and have significant similarity to the query sequence in the vicinity of the pattern occurrences. Statistical significance is reported using E-values as for other forms of BLAST, but the statistical method for computing the E-values is different. PHI-BLAST is integrated with Position-Specific Iterated BLAST (PSI-BLAST), so that the results of a PHI-BLAST query can be used to initiate one or more rounds of PSI-BLAST searching. Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 The good and the bad Advantages Fast User friendly interface Local bias statistics Single software Disadvantages Could be confusing No position specific gap penalty Fixed query length Complex PSSM/checkpoint for reuse Difficult scan vs search Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11 How to « PSI-BLAST » efficiently? Choose carefuly your query sequence Limit the size to the domain, but maximize Check matches: include or exclude based on biological knowledge Do not overfit!! Try reverse experiment to certify Swiss Institute of Bioinformatics Institut Suisse de Bioinformatique LF-2001.11