Visualisation of Multiple Sequence Alignments VIZBI 2011 Des Higgins Conway Institute University College Dublin Ireland Multiple Alignment? • Align 3 or more sequences together – Homologous residues lined up in columns Whale myoglobin Lamprey globin Lupin globin ----VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT GSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTP---EFFPKFKGLTT ---GALTESQAALVKSSWEEF--NIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE • Needed because of – Orthologues from different species But mainly: – Paralogues from Gene duplications • Multi-gene families – e.g. humans have approx. 500 protein kinases Human Protein Kinases The human kinome comprises 40 atypical PKs and 478 classical PKs. The latter consist of 388 serine/threonine kinases, 90 tyrosine kinases and 50 sequences which lack a functional catalytic site. (Manning et al., Science, 2002) Globin Multiple Alignment Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- 1. Visualise the residues/gaps? Globin Multiple Alignment Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- Globin Multiple Alignment Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- Alpha helices Globin Multiple Alignment Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- Haem binding Histidines Globin Multiple Alignment Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL Horse beta ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF Human beta ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV Horse alpha . .:: *. : . : *. * . : . 2. Visualise the sequence groupings? Human alpha Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----Whale myoglobin LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----Lamprey cyanohaemoglobin LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG Lupin leghaemoglobin LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--- So: What is the Problem? • What if N >> 100,000? • e.g. SSU rRNA – www.arb-silva.de – 1,471,257 seqs • e.g. ABC transporters – PFAM – ABC_tran PF00005 – 127,458 seqs • Metagenomics •Sequence 10,000 vertebrate genomes! =>5,000,000 protein kinases, GPCRs SequenceJuxtaposer: Fluid Navigation For Large-Scale Sequence Comparison In Context James Slack Kristian Hildebrandy Tamara Munzner Katherine St. John. Proc. German Conference on Bioinformatics 2004, pp 3742 Poster D03 VIZBI, 2011 Sequence Surveyor: scalable multiple sequence alignment overview visualisation. Danielle Albers, Colin Dewey, Michael Gleicher Poster D09 VIZBI, 2011 JProfileGrid: visualising very large multiple sequence alignments. Alberto Roca, Aaron Abajian, David Vigerust This talk • How to make huge multiple alignments • How to cluster > 100,000 sequences • MDS/PCA on big datasets Multiple Sequence Alignment • NP complete • Mainly use: “Progressive Alignment” – Greedy heuristic – Use a tree/clustering of the seqs • Barton and Sternberg (1988) Feng and Doolittle (1987) Higgins and Sharp (1988) Hogeweg and Hesper (1984) Willlie Taylor (1987) Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--: : .: . .. . : Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin “Guide Tree” Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--: : .: . .. . : Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--: : .: . .. . : Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin --------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST --------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN ---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT --------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE *: : : * . : .: * : * : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL ----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL ----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV . .:: *. : . : *. * . : . Human beta Horse beta Human alpha Horse alpha Whale myoglobin Lamprey globin Lupin globin LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--: : .: . .. . : Horse beta Human beta Horse alpha Human alpha Whale myoglobin Lamprey cyanohaemoglobin Lupin leghaemoglobin Clustal • 66,000 citations • Clustal1-Clustal4 – 1988, Paul Sharp, Dublin • Clustal V 1992 – EMBL Heidelberg, – Rainer Fuchs – Alan Bleasby • Clustal W, Clustal X 1994-2005 – Toby Gibson, EMBL, Heidelberg – Julie Thompson, ICGEB, Strasbourg • Clustal W and Clustal X 2.0 2007 – University College Dublin www.clustal.org Complexity • Guide tree construction O(N2) • Later Progressive Alignment O(N) • Guide tree construction is limiting >10,000 seq alignment is tough PartTree • • • • MAFFT Package Select n sequences where n << N UPGMA on n sequences Cluster the remainder (N-n) with their closest clusters Katoh, K., Toh, H., 2007. PartTree: an algorithm to build an approximate tree from a large number of unaligned sequences. Bioinformatics 23, 372–374. Embedding? • Replace each sequence by a Vector – Vector-Vector distances • MUCH faster than • Seq. – Seq. distances • Vectors very fast/simple to cluster • e.g. cluster 10,000 vectors of length 150 • <<1 min on 1 processor • UPGMA • e.g. cluster 300,000 vectors of length 300 • 6 mins • k-means, k = 300 Embedding papers • FastMap • Faloutsos, C., Lin, K. (1995) FastMap: A Fast Algorithm for Indexing Data-Mining and Visualisation of Traditional and Multimedia Datasets, Proc. 1995 ACM SIGMOD International Con. on Management of Data, pp.163–174. • Sparsemap • G. Hristescu and M. Farach-Colton. Cluster-preserving embedding of proteins. Technical Report 99-50, Computer Science Department, Rutgers University, 1999. mBED • Select k seqs “randomly” – k << N – k α logN • Use distance to each of these k “references” – k long vector for each sequence • Use heuristics – avoid duplicates – find outliers • Very fast and simple – Complexity O(kN) i.e. O(NlogN) • Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG. (2010) Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol Biol. 14;5:21. mBED k seeds N N N k MDS visualisation? • Do PCA on Embedded sequences • 3994 H3N2 HA sequences – 1967 (blue) - 2008 (orange) Guide Tree Quality • 1000 random guide trees • 1000 sparsemap trees • Clustal tree • mBED Clustal Ω • Release first version by April 2011 • Scalable – mBed – Gordon Blackshields • Accurate – HMM-HMM alignment – HHalign – Johannes Söding, Munich. • Re-use old alignments – Kevin Karplus – UCSC • Align 120,000 abc transporters – 6 hours on 1 core • More accurate than – MUSCLE or MAFFT • Coming soon... Fabian Sievers Andreas Wilm David Dineen MDS/PCA etc. • Dimension reduction • Treat alignment columns as variables – PCA • Principal Components Analysis – CA • Correspondence Analysis, Jean Paul Benzécri • Use NxN distance matrix – MDS – PCOORD Use CA, PCA for Sequences? • every alignment column: – 20 binary variables – Or several physicochemical properties d = 0.05 Trypsin-like serine proteases 15 Chymotrypsins EC_1_6 EC_1_5 EC_1_7 EC_1_4 EC_1_3 EC_1_8 EC_1_9 EC_1_2 EC_1_12 EC_1_11 EC_1_10 EC_1_15 EC_1_16 EC_1_14 EC_1_13 EC_1_17 EC_1_18 EC_1_0 EC_1_1 EC_1_19 EC_4_88 EC_4_117 EC_4_13 EC_4_38 EC_4_37 EC_4_0 EC_4_39 EC_4_36 EC_4_15 EC_4_44 EC_4_14 EC_4_6 EC_4_4 EC_4_3 EC_4_48 EC_4_7 EC_4_2 EC_4_16 EC_4_17 EC_4_12 EC_4_24 EC_4_47 EC_4_115 EC_4_21 EC_4_81 EC_4_10 EC_4_46 EC_4_9 EC_4_8 EC_4_76 EC_4_70 EC_4_63 EC_4_11 EC_4_22 EC_4_49 EC_4_78 EC_4_23 EC_4_45 EC_4_56 EC_4_5 EC_4_20 EC_4_71 EC_4_77 EC_4_35 EC_4_55 EC_4_43 EC_4_85 EC_4_18 EC_4_93 EC_4_53 EC_4_25 EC_4_86 EC_4_52 EC_4_51 EC_4_40 EC_4_64 EC_4_34 EC_4_66 EC_4_42 EC_4_1 EC_4_41 EC_4_113 EC_4_92 EC_4_114 EC_4_79 EC_4_83 EC_4_19 EC_4_54 EC_4_73 EC_4_72 EC_4_75 EC_4_74 EC_4_84 EC_4_80 EC_4_69 EC_4_57 EC_4_29 EC_4_90 EC_4_68 EC_4_27 EC_4_89 EC_4_67 EC_4_31 EC_4_82 EC_4_91 EC_4_62 EC_4_30 EC_4_65 EC_4_26 EC_4_32 X5PTP_EC_4 EC_4_58 EC_4_61 EC_4_33 EC_4_97 EC_4_96 EC_4_50 EC_4_87 EC_4_101 EC_4_100 EC_4_109 EC_4_59 EC_4_28 EC_4_60 EC_4_110 EC_4_116 EC_4_95 EC_4_108 EC_4_107 EC_4_94 EC_4_98 EC_4_111 EC_4_112 EC_4_102 EC_4_99 EC_4_103 EC_4_104 EC_4_106 EC_4_105 10 Elastases 31 Trypsins EC_36_3 EC_36_2 EC_36_6 EC_36_4 EC_36_5 EC_36_1 EC_36_0 d = 0.1 •Correspondence Analysis •Supervise: •Between Groups Analysis •Dolédec and Chessel (1987) (similar to PLS discriminant analysis) X54V X265S X232M X154T X95N X3N X93F X137C X243Q X82E X87L X7A X180Q X155T X14W X165N X229S X183L X181A Chymotrypsin Tripsin X98W X66T Elastase X98Y X155S X93I X275G X154V X228K X162S X70R X229D X204S X132Y X273K X16S X18I X10N X92I X196Y X82G X232Q d = 0.1 8 e-04 X54V X265S X232M X154T X95N X3N X93F X137C X243Q X82E X87L X7A X180Q Trypsin Tripsin X98W X66T Elastase X98Y X155S X93I X275G X154V X228K X162S X70R X229D X204S X132Y X273K X16S X18I X10N X92I 4 e-04 Chymotrypsin X196Y X82G X232Q e+00 X155T X14W X165N X229S X183L X181A d = 0.1 8 e-04 X54V X265S X232M X154T X95N X3N X93F X137C X243Q X82E X87L X7A X180Q Chymotrypsin Trypsin Tripsin X98W X66T Elastase X98Y X155S X93I X275G X154V X228K X162S X70R X229D X204S X132Y X273K X16S X18I X10N X92I 4 e-04 X155T X14W X165N X229S X183L X181A X196Y X82G Wallace IM, Higgins DG.(2007) Supervised multivariate analysis of sequence groups to identify specificity determining residues. BMC Bioinformatics. 8:135. e+00 X232Q MDS • Multidimensional Scaling • Fit distances to a NxN distance matrix • Use euclidean distances? – “Classical scaling” = Principal Co-Ordinates Analysis • PCOORD, John Gower – Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53, 325-328. – Higgins, D.G. (1992) Sequence ordinations: a multivariate analysis approach to analysing large sequence data sets. CABIOS, 8, 15-22. – Complexity at least O(N2) Large scale MDS? • SC-MDS • Jengnan Tzeng, Henry Horng-Shing Lu, and Wen-Hsiung Li (2008) Multidimensional scaling for large genomic data sets BMC Bioinformatics. 2008; 9: 179. Easily • mBED do MDS on >100,000 seqs • Blackshields et al., (2010) • PCOORD or MDS on a subset of the sequences • add the rest later • Landmark MDS + Nystrom approximation • V. de Silva, J.B. Tenenbaum, “Sparse multidimensional scaling using landmark points.” (2004) Technical report, Stanford University. • 307,434 lentivirus (HIV etc) sequences from UniProt. H3N2 flu sequences • Weifeng Shi • 8167 HA sequences – human H3N2 influenza viruses • DNAdist in Phylip – K2P (Kimura two parameter) model • Python: MatplotlIb 1960s 1970s 1980s 1990s 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 BGA, CIA mBED Aedin Culhane Ian Jeffery Stephen Madden Iain Wallace Guy Perriere, Lyons Gordon Blackshields Mark Larkin Clustal Omega Flu MDS Fabian Sievers Andreas Wilm David Dineen Johannes Soeding, Munich Rodrigo Lopez, EBI Weifeng Shi Supervised PCA or CA? Malate Dehydrogenases Lactate Dehydrogenases ADE-4 http://pbil.univ-lyon1.fr/ADE-4/ Thioulouse J., Chessel D., Dolédec S., & Olivier J.M. (1997) ADE-4: a multivariate analysis and graphical display software. Statistics and Computing, 7, 1, 75-83. Between Group Analysis BGA Dolédec, S. & Chessel, D. (1987) Acta Oecologica, Oecologica Generalis, 8, 3, 403-426. Supervised Correspondence Analysis or PCA CO-Inertia Analysis CIA Dolédec, S. & Chessel, D. (1994) Freshwater Biology, 31, 277-294. Thioulouse, J. & Lobry, J.R. (1995) CABIOS, 11, 321-329 2 datasets; Simultaneous CA or PCA • MADE4 – Culhane, A., Thiolouse, J., Perriere, G., Higgins, D.G. (2005) MADE4: an R package for multivariate analysis of gene expression data. Bioinformatics. 21(11):2789-2790. Very large datasets • e.g. 381,602 tRNA from RF00005 • 40 mins embedding Plus 6 mins to cluster with k-means – k = 300