Des Higgins

advertisement
Visualisation of
Multiple Sequence Alignments
VIZBI 2011
Des Higgins
Conway Institute
University College Dublin
Ireland
Multiple Alignment?
• Align 3 or more sequences together
– Homologous residues lined up in columns
Whale myoglobin
Lamprey globin
Lupin globin
----VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
GSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTP---EFFPKFKGLTT
---GALTESQAALVKSSWEEF--NIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
• Needed because of
– Orthologues from different species
But mainly:
– Paralogues from Gene duplications
• Multi-gene families
– e.g. humans have approx. 500 protein kinases
Human
Protein
Kinases
The human kinome comprises
40 atypical PKs and 478
classical PKs. The latter
consist of 388
serine/threonine kinases, 90
tyrosine kinases and 50
sequences which lack a
functional catalytic site.
(Manning et al., Science, 2002)
Globin Multiple Alignment
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---
1. Visualise the residues/gaps?
Globin Multiple Alignment
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---
Globin Multiple Alignment
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---
Alpha helices
Globin Multiple Alignment
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---
Haem binding Histidines
Globin Multiple Alignment
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
Horse beta
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
Human beta
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
Horse alpha
. .:: *. :
.
: *. * .
: .
2. Visualise the sequence groupings?
Human alpha
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----Whale myoglobin
LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----Lamprey cyanohaemoglobin
LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
Lupin leghaemoglobin
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA---
So: What is the Problem?
• What if N >> 100,000?
• e.g. SSU rRNA
– www.arb-silva.de
– 1,471,257 seqs
• e.g. ABC transporters
– PFAM
– ABC_tran PF00005
– 127,458 seqs
• Metagenomics
•Sequence 10,000 vertebrate genomes!
=>5,000,000 protein kinases, GPCRs
SequenceJuxtaposer: Fluid Navigation For Large-Scale Sequence
Comparison In Context James Slack Kristian Hildebrandy Tamara Munzner
Katherine St. John. Proc. German Conference on Bioinformatics 2004, pp 3742
Poster D03 VIZBI, 2011
Sequence Surveyor: scalable multiple sequence alignment overview visualisation.
Danielle Albers, Colin Dewey, Michael Gleicher
Poster D09 VIZBI, 2011
JProfileGrid: visualising very large multiple sequence alignments.
Alberto Roca, Aaron Abajian, David Vigerust
This talk
• How to make huge multiple alignments
• How to cluster > 100,000 sequences
• MDS/PCA on big datasets
Multiple Sequence Alignment
• NP complete
• Mainly use: “Progressive Alignment”
– Greedy heuristic
– Use a tree/clustering of the seqs
• Barton and Sternberg (1988)
Feng and Doolittle (1987)
Higgins and Sharp (1988)
Hogeweg and Hesper (1984)
Willlie Taylor (1987)
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--:
: .:
.
..
.
:
Horse beta
Human beta
Horse alpha
Human alpha
Whale myoglobin
Lamprey cyanohaemoglobin
Lupin leghaemoglobin
“Guide Tree”
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--:
: .:
.
..
.
:
Horse beta
Human beta
Horse alpha
Human alpha
Whale myoglobin
Lamprey cyanohaemoglobin
Lupin leghaemoglobin
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--:
: .:
.
..
.
:
Horse beta
Human beta
Horse alpha
Human alpha
Whale myoglobin
Lamprey cyanohaemoglobin
Lupin leghaemoglobin
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
--------VHLTPEEKSAVTALWGKVN–-VDEVGGEALGRLLVVYPWTQRFFESFGDLST
--------VQLSGEEKAAVLALWDKVN–-EEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
---------VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS---------VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHF-DLS---------VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKT
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
--------GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSE
*: :
:
* .
: .:
* :
* :
.
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
PDAVMGNPKVKAHGKKVLGAFSDGLAHLDN-----LKGTFATLSELHCDKLHVDPENFRL
PGAVMGNPKVKAHGKKVLHSFGEGVHHLDN-----LKGTFAALSELHCDKLHVDPENFRL
----HGSAQVKGHGKKVADALTNAVAHVDD-----MPNALSALSDLHAHKLRVDPVNFKL
----HGSAQVKAHGKKVGDALTLAVGHLDD-----LPGALSNLSDLHAHKLRVDPVNFKL
EAEMKASEDLKKHGVTVLTALGAILKKKGH-----HEAELKPLAQSHATKHKIPIKYLEF
ADQLKKSADVRWHAERIINAVNDAVASMDDT--EKMSMKLRDLSGKHAKSFQVDPQYFKV
VP--QNNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVAD-AHFPV
. .:: *. :
.
: *. * .
: .
Human beta
Horse beta
Human alpha
Horse alpha
Whale myoglobin
Lamprey globin
Lupin globin
LGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH-----LGNVLVVVLARHFGKDFTPELQASYQKVVAGVANALAHKYH-----LSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR-----LSHCLLSTLAVHLPNDFTPAVHASLDKFLSSVSTVLTSKYR-----ISEAIIHVLHSRHPGDFGADAQGAMNKALELFRKDIAAKYKELGYQG
LAAVIADTVAAG---D------AGFEKLMSMICILLRSAY------VKEAILKTIKEVVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA--:
: .:
.
..
.
:
Horse beta
Human beta
Horse alpha
Human alpha
Whale myoglobin
Lamprey cyanohaemoglobin
Lupin leghaemoglobin
Clustal
• 66,000 citations
• Clustal1-Clustal4
– 1988, Paul Sharp, Dublin
• Clustal V 1992
– EMBL Heidelberg,
– Rainer Fuchs
– Alan Bleasby
• Clustal W, Clustal X 1994-2005
– Toby Gibson, EMBL, Heidelberg
– Julie Thompson, ICGEB, Strasbourg
• Clustal W and Clustal X 2.0 2007
– University College Dublin
www.clustal.org
Complexity
• Guide tree construction
O(N2)
• Later Progressive Alignment
O(N)
• Guide tree construction is limiting
>10,000 seq alignment is tough
PartTree
•
•
•
•
MAFFT Package
Select n sequences where n << N
UPGMA on n sequences
Cluster the remainder (N-n) with their
closest clusters
Katoh, K., Toh, H., 2007. PartTree: an algorithm to build an
approximate tree from a large number of unaligned sequences.
Bioinformatics 23, 372–374.
Embedding?
• Replace each sequence by a Vector
– Vector-Vector distances
• MUCH faster than
• Seq. – Seq. distances
• Vectors very fast/simple to cluster
• e.g. cluster 10,000 vectors of length 150
• <<1 min on 1 processor
• UPGMA
• e.g. cluster 300,000 vectors of length 300
• 6 mins
• k-means, k = 300
Embedding papers
• FastMap
• Faloutsos, C., Lin, K. (1995) FastMap: A Fast Algorithm for Indexing
Data-Mining and Visualisation of Traditional and Multimedia Datasets,
Proc. 1995 ACM SIGMOD International Con. on Management of Data,
pp.163–174.
• Sparsemap
• G. Hristescu and M. Farach-Colton. Cluster-preserving embedding of
proteins. Technical Report 99-50, Computer Science Department,
Rutgers University, 1999.
mBED
• Select k seqs “randomly”
– k << N
– k α logN
• Use distance to each of these k “references”
– k long vector for each sequence
• Use heuristics
– avoid duplicates
– find outliers
• Very fast and simple
– Complexity O(kN)
i.e. O(NlogN)
• Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG. (2010)
Sequence embedding for fast construction of guide trees for multiple
sequence alignment. Algorithms Mol Biol. 14;5:21.
mBED
k seeds
N
N
N
k
MDS visualisation?
• Do PCA on
Embedded
sequences
• 3994 H3N2 HA
sequences
– 1967 (blue)
- 2008 (orange)
Guide Tree Quality
• 1000 random
guide trees
• 1000
sparsemap
trees
• Clustal tree
• mBED
Clustal Ω
• Release first version by April 2011
• Scalable
– mBed
– Gordon Blackshields
• Accurate
– HMM-HMM alignment
– HHalign
– Johannes Söding, Munich.
• Re-use old alignments
– Kevin Karplus
– UCSC
• Align 120,000 abc transporters
– 6 hours on 1 core
• More accurate than
– MUSCLE or MAFFT
• Coming soon...
Fabian Sievers
Andreas Wilm
David Dineen
MDS/PCA etc.
• Dimension reduction
• Treat alignment columns as variables
– PCA
• Principal Components Analysis
– CA
• Correspondence Analysis, Jean Paul Benzécri
• Use NxN distance matrix
– MDS
– PCOORD
Use CA, PCA for Sequences?
• every alignment
column:
– 20 binary
variables
– Or
several
physicochemical
properties
d = 0.05
Trypsin-like serine proteases
15 Chymotrypsins
EC_1_6
EC_1_5
EC_1_7
EC_1_4
EC_1_3
EC_1_8
EC_1_9
EC_1_2
EC_1_12
EC_1_11
EC_1_10
EC_1_15
EC_1_16
EC_1_14
EC_1_13
EC_1_17
EC_1_18
EC_1_0
EC_1_1
EC_1_19
EC_4_88
EC_4_117
EC_4_13
EC_4_38
EC_4_37
EC_4_0
EC_4_39
EC_4_36
EC_4_15
EC_4_44
EC_4_14
EC_4_6
EC_4_4
EC_4_3
EC_4_48
EC_4_7
EC_4_2
EC_4_16
EC_4_17
EC_4_12
EC_4_24
EC_4_47
EC_4_115
EC_4_21
EC_4_81
EC_4_10
EC_4_46
EC_4_9
EC_4_8
EC_4_76
EC_4_70
EC_4_63
EC_4_11
EC_4_22
EC_4_49
EC_4_78
EC_4_23
EC_4_45
EC_4_56
EC_4_5
EC_4_20
EC_4_71
EC_4_77
EC_4_35
EC_4_55
EC_4_43
EC_4_85
EC_4_18
EC_4_93
EC_4_53
EC_4_25
EC_4_86
EC_4_52
EC_4_51
EC_4_40
EC_4_64
EC_4_34
EC_4_66
EC_4_42
EC_4_1
EC_4_41
EC_4_113
EC_4_92
EC_4_114
EC_4_79
EC_4_83
EC_4_19
EC_4_54
EC_4_73
EC_4_72
EC_4_75
EC_4_74
EC_4_84
EC_4_80
EC_4_69
EC_4_57
EC_4_29
EC_4_90
EC_4_68
EC_4_27
EC_4_89
EC_4_67
EC_4_31
EC_4_82
EC_4_91
EC_4_62
EC_4_30
EC_4_65
EC_4_26
EC_4_32
X5PTP_EC_4
EC_4_58
EC_4_61
EC_4_33
EC_4_97
EC_4_96
EC_4_50
EC_4_87
EC_4_101
EC_4_100
EC_4_109
EC_4_59
EC_4_28
EC_4_60
EC_4_110
EC_4_116
EC_4_95
EC_4_108
EC_4_107
EC_4_94
EC_4_98
EC_4_111
EC_4_112
EC_4_102
EC_4_99
EC_4_103
EC_4_104
EC_4_106
EC_4_105
10 Elastases
31 Trypsins
EC_36_3
EC_36_2
EC_36_6
EC_36_4
EC_36_5
EC_36_1
EC_36_0
d = 0.1
•Correspondence Analysis
•Supervise:
•Between Groups Analysis
•Dolédec and Chessel (1987)
(similar to PLS discriminant
analysis)
X54V
X265S
X232M
X154T
X95N
X3N
X93F
X137C
X243Q
X82E
X87L
X7A
X180Q
X155T
X14W
X165N
X229S
X183L
X181A
Chymotrypsin
Tripsin
X98W
X66T
Elastase
X98Y
X155S
X93I
X275G
X154V
X228K
X162S
X70R
X229D
X204S
X132Y
X273K
X16S
X18I
X10N
X92I
X196Y
X82G
X232Q
d = 0.1
8 e-04
X54V
X265S
X232M
X154T
X95N
X3N
X93F
X137C
X243Q
X82E
X87L
X7A
X180Q
Trypsin
Tripsin
X98W
X66T
Elastase
X98Y
X155S
X93I
X275G
X154V
X228K
X162S
X70R
X229D
X204S
X132Y
X273K
X16S
X18I
X10N
X92I
4 e-04
Chymotrypsin
X196Y
X82G
X232Q
e+00
X155T
X14W
X165N
X229S
X183L
X181A
d = 0.1
8 e-04
X54V
X265S
X232M
X154T
X95N
X3N
X93F
X137C
X243Q
X82E
X87L
X7A
X180Q
Chymotrypsin
Trypsin
Tripsin
X98W
X66T
Elastase
X98Y
X155S
X93I
X275G
X154V
X228K
X162S
X70R
X229D
X204S
X132Y
X273K
X16S
X18I
X10N
X92I
4 e-04
X155T
X14W
X165N
X229S
X183L
X181A
X196Y
X82G
Wallace IM, Higgins DG.(2007)
Supervised multivariate analysis of sequence groups to identify specificity
determining residues. BMC Bioinformatics. 8:135.
e+00
X232Q
MDS
• Multidimensional Scaling
• Fit distances to a NxN distance matrix
• Use euclidean distances?
– “Classical scaling”
= Principal Co-Ordinates Analysis
• PCOORD, John Gower
– Gower, J. C. (1966). Some distance properties of latent root and
vector methods used in multivariate analysis. Biometrika 53,
325-328.
– Higgins, D.G. (1992) Sequence ordinations: a multivariate analysis
approach to analysing large sequence data sets. CABIOS, 8, 15-22.
– Complexity at least O(N2)
Large scale MDS?
• SC-MDS
• Jengnan Tzeng, Henry Horng-Shing Lu, and Wen-Hsiung Li (2008)
Multidimensional scaling for large genomic data sets BMC
Bioinformatics. 2008; 9: 179.
Easily
• mBED
do MDS on >100,000 seqs
• Blackshields et al., (2010)
• PCOORD or MDS on a subset of the sequences
• add the rest later
• Landmark MDS + Nystrom approximation
• V. de Silva, J.B. Tenenbaum, “Sparse multidimensional scaling using
landmark points.” (2004) Technical report, Stanford University.
• 307,434 lentivirus (HIV etc) sequences from UniProt.
H3N2 flu sequences
• Weifeng Shi
• 8167 HA sequences
– human H3N2 influenza viruses
• DNAdist in Phylip
– K2P (Kimura two parameter) model
• Python: MatplotlIb
1960s
1970s
1980s
1990s
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
BGA, CIA
mBED
Aedin Culhane
Ian Jeffery
Stephen Madden
Iain Wallace
Guy Perriere, Lyons
Gordon Blackshields
Mark Larkin
Clustal Omega
Flu MDS
Fabian Sievers
Andreas Wilm
David Dineen
Johannes Soeding, Munich
Rodrigo Lopez, EBI
Weifeng Shi
Supervised PCA or CA?
Malate Dehydrogenases
Lactate Dehydrogenases
ADE-4
http://pbil.univ-lyon1.fr/ADE-4/
Thioulouse J., Chessel D., Dolédec S., & Olivier J.M.
(1997) ADE-4: a multivariate analysis and graphical
display software. Statistics and Computing, 7, 1, 75-83.
Between Group Analysis BGA
Dolédec, S. & Chessel, D. (1987)
Acta Oecologica, Oecologica Generalis, 8, 3, 403-426.
Supervised Correspondence Analysis or PCA
CO-Inertia Analysis CIA
Dolédec, S. & Chessel, D. (1994) Freshwater Biology, 31, 277-294.
Thioulouse, J. & Lobry, J.R. (1995) CABIOS, 11, 321-329
2 datasets; Simultaneous CA or PCA
• MADE4
–
Culhane, A., Thiolouse, J., Perriere, G., Higgins, D.G. (2005)
MADE4: an R package for multivariate analysis of gene expression
data. Bioinformatics. 21(11):2789-2790.
Very large datasets
• e.g. 381,602 tRNA
from RF00005
• 40 mins embedding
Plus 6 mins
to cluster
with k-means
– k = 300
Download