slides

advertisement
Multiple alignments,
PATTERNS, PSI-BLAST
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Overview

Multiple alignments


Patterns


PROSITE database, syntax, use
PSI-BLAST


How-to, Goal, problems, use
BLAST, matrices, use
[ Profiles/HMMs ] …
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
What is a multiple sequence alignment?



What can it do for me?
How can I produce one of these?
How can I use it?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
How can I use a multiple alignment?
chite
wheat
trybr
unknown
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
unknown
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Extrapolation
Homology?
SwissProt
Unkown Sequence
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
How can I use a multiple alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Extrapolation
Prosite Patterns
SwissProt
Match?
Unkown Sequence
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
How can I use a multiple alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-IQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Extrapolation
Prosite Patterns
Prosite Profiles
L?
K>R
A
F
D
E
F
G
H
Q
I
V
L
W
-More Sensitive
-More Specific
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
How can I use a multiple alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Phylogeny
chite
wheat
trybr
mouse
-Evolution
-Paralogy/Orthology
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
How can I use a multiple alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Phylogeny
Struc. Prediction
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
PhD For secondary
Structure Prediction:
75% Accurate.
Threading: is improving
but is not yet as good.
LF-2001.11
How can I use a multiple alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Phylogeny
Struc. Prediction
Caution!
Automatic Multiple
Sequence Alignment methods
are not always perfect…
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
The problem

why is it difficult to compute a multiple sequence
alignment?
Biology
What is a good alignment?
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
Computation
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
What is the good alignment?
LF-2001.11
The problem

why is it difficult to compute a multiple sequence
alignment?
CIRCULAR PROBLEM....
Good
Sequences
Good
Alignment
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
What do I need to know to make a good multiple alignment?





How do sequences evolve?
How does the computer align the sequences?
How can I choose my sequences?
What is the best program?
How can I use my alignment?
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
An alignment is a story
ADKPKRPLSAYMLWLN
Deletion
Insertion
ADKPRRP---LS-YMLWLN
ADKPKRPKPRLSAYMLWLN
Mutation
ADKPKRPLSAYMLWLN
Mutations
+
Selection
ADKPRRPLS-YMLWLN
ADKPKRPLSAYMLWLN
ADKPKRPKPRLSAYMLWLN
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Homology

Same sequences -> same origin? -> same function? ->
same 3D fold?
%Sequence Identity
Same 3D Fold
30%
Twilight Zone
Length
100
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Convergent evolution
AFGP with (ThrAlaAla)n
Similar To Trypsynogen
N
S
Chen et al, 97, PNAS, 94, 3811-16
AFGP with (ThrAlaAla)n
NOT
Similar to Trypsinogen
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Residues and mutations

All residues are equal, but some more than others…
Aliphatic
Aromatic
M C
P
L V
A G G
I
T C S
D
N
KE
Y
F
H
Q
W R
Small
Hydrophobic
Polar
Accurate matrices are data driven rather than knowledge driven
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Substitution matrices
Different Flavors:
• Pam: 250, 350
• Blosum: 45, 62
• …
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
What is the best substition matrix?

Mutation rates depend on families
Family
Histone3
Insulin
Interleukin I
a-Globin
Apolipoprot. AI
Interferon G
S
6.4
4.0
4.6
5.1
4.5
8.6
N
0
0.1
1.4
0.6
1.6
2.8
in Substitutions/site/Billion Years as measured on Mouse Vs Human (0.08 Billion years)
Rates
Choosing
the right matrix may be tricky


Gonnet250 > BLOSUM62 > PAM250
Depends on the family, the program used and its tuning
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Insertions and deletions?
Affine Gap Penalty
Cost=GOP+GEP*L
Indel Cost
Cost
L
Cost
L
L
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
How to align many sequences?

Exact algorithms are computing time consuming



Needlemann & Wunsch
Smith & Waterman
-> heuristic required!
6
5
8 Globins =>9
7
2
3
4
=>3
=>150
=>1000
=>1
=>2
=>5
years
weeks
sec000
mn
hours
years
years
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Existing methods
1-Carillo and Lipman:
-MSA, DCA.
-Few Small Closely Related Sequence.
-Do Well When They Can Run.
2-Segment Based:
4-Progressive:
-DIALIGN, MACAW.
-ClustalW, Pileup, Multalign…
-May Align Too Few Residues
-Fast and Sensitive
3-Iterative:
-HMMs, HMMER, SAM.
-Slow, Sometimes Inacurate
-Good Profile Generators
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Progressive alignment
Feng and Dolittle, 1980; Taylor 1981
Dynamic Programming Using A Substitution Matrix
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Progressive alignment
Feng and Dolittle, 1980; Taylor 1981
-Depends on the CHOICE of the sequences.
-Depends on the ORDER of the sequences (Tree).
-Depends on the PARAMETERS:
•Substitution Matrix.
•Penalties (Gop, Gep).
•Sequence Weight.
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
•Tree making Algorithm.
LF-2001.11
Selecting sequences from a BLAST output
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
A common mistake



Sequences too closely related
PRVA_MACFU
PRVA_HUMAN
PRVA_GERSP
PRVA_MOUSE
PRVA_RAT
PRVA_RABIT
SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE
SMTDLLNAEDIKKAVGAFSATDSFDHKKFFQMVGLKKKSADDVKKVFHMLDKDKSGFIEE
SMTDLLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKTPDDVKKVFHILDKDKSGFIEE
SMTDVLSAEDIKKAIGAFAAADSFDHKKFFQMVGLKKKNPDEVKKVFHILDKDKSGFIEE
SMTDLLSAEDIKKAIGAFTAADSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIEE
AMTELLNAEDIKKAIGAFAAAESFDHKKFFQMVGLKKKSTEDVKKVFHILDKDKSGFIEE
:**::*.*******:***:* :****************..::******:***********
PRVA_MACFU
PRVA_HUMAN
PRVA_GERSP
PRVA_MOUSE
PRVA_RAT
PRVA_RABIT
DELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES
DELGFILKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAES
DELGFILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSES
DELGSILKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAES
DELGSILKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAES
EELGFILKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSES
:*** ******.******.**** *:************.:******:**
Identical sequences brings no information
Multiple sequence alignments thrive on diversity
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Respect information!
PRVA_MACFU
PRVA_HUMAN
PRVA_GERSP
PRVA_MOUSE
PRVA_RAT
PRVA_RABIT
TPCC_MOUSE
------------------------------------------SMTDLLN----AEDIKKA
------------------------------------------SMTDLLN----AEDIKKA
------------------------------------------SMTDLLS----AEDIKKA
------------------------------------------SMTDVLS----AEDIKKA
------------------------------------------SMTDLLS----AEDIKKA
------------------------------------------AMTELLN----AEDIKKA
MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM
: :*.
.*::::
PRVA_MACFU
PRVA_HUMAN
PRVA_GERSP
PRVA_MOUSE
PRVA_RAT
PRVA_RABIT
TPCC_MOUSE
VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI
VGAFSATDS--FDHKKFFQMVG------LKKKSADDVKKVFHMLDKDKSGFIEEDELGFI
IGAFAAADS--FDHKKFFQMVG------LKKKTPDDVKKVFHILDKDKSGFIEEDELGFI
IGAFAAADS--FDHKKFFQMVG------LKKKNPDEVKKVFHILDKDKSGFIEEDELGSI
IGAFTAADS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGSI
IGAFAAAES--FDHKKFFQMVG------LKKKSTEDVKKVFHILDKDKSGFIEEEELGFI
IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM
:. .
* .*..:*: *:
* *. :::..:*:::**: .*:*: :** :
PRVA_MACFU
PRVA_HUMAN
PRVA_GERSP
PRVA_MOUSE
PRVA_RAT
PRVA_RABIT
TPCC_MOUSE
LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESLKGFSPDARDLSAKETKMLMAAGDKDGDGKIGVDEFSTLVAESLKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVSESLKGFSSDARDLSAKETKTLLAAGDKDGDGKIGVEEFSTLVAESLKGFSSDARDLSAKETKTLMAAGDKDGDGKIGVEEFSTLVAESLKGFSPDARDLSVKETKTLMAAGDKDGDGKIGADEFSTLVSESLQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE
*:
. .. :: .: : *: ***:.**:*. :** ::
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
-This alignment is not
informative about the
relation between TPCC
MOUSE and the rest of the
sequences.
-A better spread of the
sequences is needed
Selecting diverse sequences
PRVB_CYPCA
PRVB_BOACO
PRV1_SALSA
PRVB_LATCH
PRVB_RANES
PRVA_MACFU
PRVA_ESOLU
-AFAGVLNDADIAAALEACKAADSFNHKAFFAKVGLTSKSADDVKKAFAIIDQDKSGFIE
-AFAGILSDADIAAGLQSCQAADSFSCKTFFAKSGLHSKSKDQLTKVFGVIDRDKSGYIE
MACAHLCKEADIKTALEACKAADTFSFKTFFHTIGFASKSADDVKKAFKVIDQDASGFIE
-AVAKLLAAADVTAALEGCKADDSFNHKVFFQKTGLAKKSNEELEAIFKILDQDKSGFIE
-SITDIVSEKDIDAALESVKAAGSFNYKIFFQKVGLAGKSAADAKKVFEILDRDKSGFIE
-SMTDLLNAEDIKKAVGAFSAIDSFDHKKFFQMVGLKKKSADDVKKVFHILDKDKSGFIE
--AKDLLKADDIKKALDAVKAEGSFNHKKFFALVGLKAMSANDVKKVFKAIDADASGFIE
:
*: .: . .* .:*. * **
*:
* :
* :* * **:**
PRVB_CYPCA
PRVB_BOACO
PRV1_SALSA
PRVB_LATCH
PRVB_RANES
PRVA_MACFU
PRVA_ESOLU
EDELKLFLQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKAEDELKKFLQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG
VEELKLFLQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQDEELELFLQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKAQDELGLFLQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKAEDELGFILKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES
EEELKFVLKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEA
:** .*:.*
.* *: ** :: .* **** **::** **
-A REASONABLE model now exists.
-Going further:remote homologues.
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Aligning remote homologues
PRVA_MACFU
PRVA_ESOLU
PRVB_CYPCA
PRVB_BOACO
PRV1_SALSA
PRVB_LATCH
PRVB_RANES
TPCS_RABIT
TPCS_PIG
TPCC_MOUSE
------------------------------------------SMTDLLNA----EDIKKA
-------------------------------------------AKDLLKA----DDIKKA
------------------------------------------AFAGVLND----ADIAAA
------------------------------------------AFAGILSD----ADIAAG
-----------------------------------------MACAHLCKE----ADIKTA
------------------------------------------AVAKLLAA----ADVTAA
------------------------------------------SITDIVSE----KDIDAA
-TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI
-TDQQAEARSYLSEEMIAEFKAAFDMFDADGG-GDISVKELGTVMRMLGQTPTKEELDAI
MDDIYKAAVEQLTEEQKNEFKAAFDIFVLGAEDGCISTKELGKVMRMLGQNPTPEELQEM
:
::
PRVA_MACFU
PRVA_ESOLU
PRVB_CYPCA
PRVB_BOACO
PRV1_SALSA
PRVB_LATCH
PRVB_RANES
TPCS_RABIT
TPCS_PIG
TPCC_MOUSE
VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI
LDAVKAEGS--FNHKKFFALVG------LKAMSANDVKKVFKAIDADASGFIEEEELKFV
LEACKAADS--FNHKAFFAKVG------LTSKSADDVKKAFAIIDQDKSGFIEEDELKLF
LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF
LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF
LEGCKADDS--FNHKVFFQKTG------LAKKSNEELEAIFKILDQDKSGFIEDEELELF
LESVKAAGS--FNYKIFFQKVG------LAGKSAADAKKVFEILDRDKSGFIEQDELGLF
IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI
IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI
IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM
:
. .: .. . *:
* :
* :* : .*:*: :** .
PRVA_MACFU
LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAESPRVA_ESOLU
LKSFAADGRDLTDAETKAFLKAADKDGDGKIGIDEFETLVHEAPRVB_CYPCA
LQNFKADARALTDGETKTFLKAGDSDGDGKIGVDEFTALVKA-PRVB_BOACO
LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKGPRV1_SALSA
LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ-PRVB_LATCH
LQNFSAGARTLTKTETETFLKAGDSDGDGKIGVDEFQKLVKA-PRVB_RANES
LQNFRASARVLSDAETSAFLKAGDSDGDGKIGVEEFQALVKA-TPCS_RABIT
FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQ
of Bioinformatics
TPCS_PIG Swiss Institute
FR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQ
Institut
Suisse
de
Bioinformatique
TPCC_MOUSE
LQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVE
::
.. :: :
:: .* :.** *. :** ::
LF-2001.11
Going further…
PRVA_MACFU
PRVB_BOACO
PRV1_SALSA
TPCS_RABIT
TPCS_PIG
TPCC_MOUSE
TPC_PATYE
VGAFSAIDS--FDHKKFFQMVG------LKKKSADDVKKVFHILDKDKSGFIEEDELGFI
LQSCQAADS--FSCKTFFAKSG------LHSKSKDQLTKVFGVIDRDKSGYIEEDELKKF
LEACKAADT--FSFKTFFHTIG------FASKSADDVKKAFKVIDQDASGFIEVEELKLF
IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNADGYIDAEELAEI
IEEVDEDGSGTIDFEEFLVMMVRQMKEDAKGKSEEELAECFRIFDRNMDGYIDAEELAEI
IDEVDEDGSGTVDFDEFLVMMVRCMKDDSKGKSEEELSDLFRMFDKNADGYIDLDELKMM
SDEMDEEATGRLNCDAWIQLFER---KLKEDLDERELKEAFRVLDKEKKGVIKVDVLRWI
.
: .. . ::
. :
* :* : .* *. : * .
PRVA_MACFU
PRVB_BOACO
PRV1_SALSA
TPCS_RABIT
TPCS_PIG
TPCC_MOUSE
TPC_PATYE
LKGFSPDARDLSAKETKTLMAAGDKDGDGKIGVDEFSTLVAES-LQNFDGKARDLTDKETAEFLKEGDTDGDGKIGVEEFVVLVTKG-LQNFCPKARELTDAETKAFLKAGDADGDGMIGIDEFAVLVKQ--FR---ASGEHVTDEEIESLMKDGDKNNDGRIDFDEFLKMMEGVQFR---ASGEHVTDEEIESIMKDGDKNNDGRIDFDEFLKMMEGVQLQ---ATGETITEDDIEELMKDGDKNNDGRIDYDEFLEFMKGVELS---SLGDELTEEEIENMIAETDTDGSGTVDYEEFKCLMMSSDA
:
. :: :
::
* :..* :. :** ::
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
What makes a good alignment…




The more divergeant the sequences, the better
The fewer indels, the better
Nice ungapped blocks separated with indels
Different classes of residues within a block:




Completely conserved
Size and hydropathy conserved
Size or hydropathy conserved
The ultimate evaluation is a matter of personal
judgment and knowledge
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Avoiding pitfalls
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Keep a biological perspective
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
chite
wheat
trybr
mouse
AD--K----PKR-PLYMLWLNS-ARESIKRENPDFK-VT-EVAKKGGELWRGL-DPNK----PKRAP-FFVFMGE-FREEFKQKNPKNKSVA-AVGKAAGERWKSLS
-K--KDSNAPKR-AMT-MFFSSDFR-S-KH-S-DLS-IV-EMSKAAGAAWKELG
----K----PKR-PRYNIYVSESFQEA-K--D-D-S-AQGKL-KLVNEAWKNLS
*
*** .:: ::... :
* . . .
: * . *: *
chite
wheat
trybr
mouse
KSEWEAKAATAKQNY-I--RALQE-YERNG-GKAPYVAKANKLKGEY-N--KAIAA-YNK-GESA
RKVYEEMAEKDKERY----K--RE-M------KQAYIQLAKDDRIRYDNEMKSWEEQMAE----: :
*
: .*
:
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
DIFFERENT
PARAMETERS
Do not overtune!!!
chite
wheat
trybr
mouse
---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. ::: .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
chite
wheat
trybr
mouse
---ADKPKRPL-SAYMLWLNSARESIKRENPDFK-VTEVAKKGGELWRGLKD
--DPNKPKRAP-SAFFVFMGEFREEFKQKNPKNKSVAAVGKAAGERWKSLSE
KKDSNAPKRAMTSFMFFSSDFRS-----KHSDLS-IVEMSKAAGAAWKELGP
-----KPKRPR-SAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLSP
***. * .: .. .
: . .
* . *: *
chite
wheat
trybr
mouse
AATAKQNYIRALQEYERNGGANKLKGEYNKAIAAYNKGESA
AEKDKERYKREM--------AKDDRIRYDNEMKSWEEQMAE
*
: .* . :
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
DO NOT PLAY WITH
PARAMETERS!
IF YOU KNOW THE
ALIGNMENT YOU
WANT:
MAKE IT YOURSELF!
Choosing the right method
PROBLEM
PROGRAM
ClustalW
Source: BaliBase
Thompson et al, NAR, 1999
ClustalW
MSA
DIALIGN II
DIALIGN II
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
METHOD
Conclusion

The best alignment method:



The best evaluation method:



Your brain
The right data
Your eyes
Experimental information
(SwissProt)

How can I go further?




Patterns
Profiles
HMMs
…
What can I conclude?

Homology -> information
extrapolation
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
The
database
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
History






Founded by Amos Bairoch
1988 First release in the PC/Gene software
1990 Synchronisation with Swiss-Prot
1994 Integration of « profiles »
1999 PROSITE joins InterPro
November 2001 Current release 16.50
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Database content








Official Release
~1400 Patterns
~100 Profiles
4 Rules
~1100 Documentations
PSxxxxx
PATTERN
PSxxxxx
MATRIX
PSxxxxx
RULE
PDOCxxxxx
Pre-Release
~250 Profiles
~150 Documentations
PSxxxxx
MATRIX
QDOCxxxxx
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Pattern « philosophy »

Target: definition of sites with biological information


catalytic, metal binding, S-S bridge, cofactor binding,
prosthetic group, PTM
Easy to understand and to design, example

Q-x(3)-N-[SA]-C-G-x(3)-[LIVM](2)-H-[SA]-[LIVM]-[SA]
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Pattern syntax

Regular expression (REGEXP) language:
Each position is separated by a dash « - »
 amino acids are represented by single letter code
 « x » represent any amino acid
 [] group of amino acid acceptable for a position
 {} group of amino acid not acceptable for a position
 () multiple or range e.g., A(1,3) means 1 to 3 A
 < anchor at beginning of sequence
 > anchor at end of sequence

Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Profile « philosophy »




Aim: identification of domains and not protein
families
Gene discovery vs automatic annotation
Importance of score and calibration
Possible manual tuning (by a well trained expert… ;-)


-> allowed by the profile syntax
-> no direct link to multiple alignment
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Database content: PATTERN





















ID
AC
DT
DE
PA
NR
NR
NR
CC
CC
DR
DR
DR
(…)
DR
DR
DR
DR
DR
DO
//
UCH_2_1; PATTERN.
PS00972;
JUN-1994 (CREATED); SEP-2000 (DATA UPDATE); SEP-2000 (INFO UPDATE).
Ubiquitin carboxyl-terminal hydrolases family 2 signature 1.
G-[LIVMFY]-x(1,3)-[AGC]-[NASM]-x-C-[FYW]-[LIVMFC]-[NST]-[SACV]-x-[LIVMS]-Q.
/RELEASE=38,80000;
/TOTAL=41(41); /POSITIVE=41(41); /UNKNOWN=0(0); /FALSE_POS=0(0);
/FALSE_NEG=2; /PARTIAL=0;
/TAXO-RANGE=??E??; /MAX-REPEAT=1;
/SITE=7,active_site(?);
Q93008, FAFX_HUMAN, T; O00507, FAFY_HUMAN, T; P55824, FAF_DROME , T;
P70398, FAF_MOUSE , T; P54578, TGT_HUMAN , T; P40826, TGT_RABIT , T;
P25037, UBP1_YEAST, T; O42726, UBP2_KLULA, T; Q01476, UBP2_YEAST, T;
P38187, UBPD_YEAST,
P52479, UBPE_MOUSE,
Q02863, UBPG_YEAST,
P34547, UBPX_CAEEL,
P53874, UBPA_YEAST,
PDOC00750;
T;
T;
T;
T;
N;
Q24574,
P38237,
P43593,
Q09931,
Q17361,
UBPE_DROME,
UBPE_YEAST,
UBPH_YEAST,
UBPY_CAEEL,
UBPT_CAEEL,
T; Q14694, UBPE_HUMAN, T;
T; P50101, UBPF_YEAST, T;
T; Q61068, UBPW_MOUSE, T;
T;
N;
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Database content: Profile

























ID
AC
DT
DE
MA
MA
MA
MA
MA
MA
MA
MA
(…)
MA
MA
NR
NR
NR
CC
DR
DR
(…)
DR
DO
//
UCH_2_3; MATRIX.
PS50235;
SEP-2000 (CREATED); SEP-2000 (DATA UPDATE); SEP-2000 (INFO UPDATE).
Ubiquitin carboxyl-terminal hydrolases family 2 profile.
/GENERAL_SPEC: ALPHABET='ABCDEFGHIKLMNPQRSTVWYZ'; LENGTH=193; TOPOLOGY=LINEAR;
/DISJOINT: DEFINITION=PROTECT; N1=10; N2=185;
/NORMALIZATION: MODE=1; FUNCTION=LINEAR; R1=1.3922; R2=.00836191; TEXT='NScore';
/CUT_OFF: LEVEL=0; SCORE=910; N_SCORE=9.0; MODE=1;
/CUT_OFF: LEVEL=-1; SCORE=610; N_SCORE=6.5; MODE=1;
/DEFAULT: B1=-100; E1=-100; MI=-105; MD=-105; IM=-105; DM=-105; I=-20; D=-20;
/I: B1=0; BI=-105; BD=-105;
/M: SY='T'; M=0,-14,2,-19,-16,-9,-21,-18,-6,-10,-5,-5,-12,-21,-15,-6,0,9,6,-29,-11,-16;
/M: SY='D'; M=-11,12,-27,17,6,-21,-9,-4,-21,-4,-18,-14,5,-12,0,-6,-3,-8,-19,-26,-11,2;
/I: E1=0;
/RELEASE=38,80000;
/TOTAL=47(47); /POSITIVE=47(47); /UNKNOWN=0(0); /FALSE_POS=0(0);
/FALSE_NEG=0; /PARTIAL=0;
/TAXO-RANGE=??E??; /MAX-REPEAT=1;
Q01988, UBPB_CANFA, T; Q93008, FAFX_HUMAN, T; O00507, FAFY_HUMAN, T;
P55824, FAF_DROME , T; P70398, FAF_MOUSE , T; P53010, PAN2_YEAST, T;
Q09798, YAA4_SCHPO, T; P43589, YFH5_YEAST, T;
PDOC00750;
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Database content: documentation





















{PDOC00750}
{PS00972; UCH_2_1}
{PS00973; UCH_2_2}
{PS50235; UCH_2_3}
{BEGIN}
**********************************************************************
* Ubiquitin carboxyl-terminal hydrolases family 2 signatures/profile *
**********************************************************************
Ubiquitin carboxyl-terminal hydrolases (EC 3.1.2.15) (UCH) (deubiquitinating
enzymes) [1,2] are thiol proteases that recognize and hydrolyze the peptide
bond at the C-terminal glycine of ubiquitin. These enzymes are involved in the
processing of poly-ubiquitin precursors as well as that of ubiquinated
proteins.
There are two distinct families of UCH. The second class consist of large
proteins (800 to 2000 residues) and is currently represented by:
- Yeast UBP1, UBP2, UBP3, UBP4 (or DOA4/SSV7), UBP5, UBP7, UBP9, UBP10,
UBP11, UBP12, UBP13, UBP14, UBP15 and UBP16.
- Human tre-2.
- Human isopeptidase T.
- Human isopeptidase T-3.
(…)
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Database content: documentation



also probably implicated in the catalytic mechanism. We have developed
signature pattern for both conserved regions. We also developed a profile
including the two regions covered by the patterns.

-Consensus pattern: G-[LIVMFY]-x(1,3)-[AGC]-[NASM]-x-C-[FYW]-[LIVMFC]-[NST][SACV]-x-[LIVMS]-Q
[C is the putative active site residue]
-Sequences known to belong to this class detected by the pattern: ALL, except
for two sequences.
(…)
-Note: these proteins belong to family C19 in the classification of peptidases
[3,E1].
-Note: this documentation entry is linked to both a signature pattern and a
profile. As the profile is much more sensitive than the pattern, you should
use it if you have access to the necessary software tools to do so.

-Last update: September 2000 / Patterns and text revised; profile added.

[ 1] Jentsch S., Seufert W., Hauser H.-P.
Biochim. Biophys. Acta 1089:127-139(1991).
[ 2] D'andrea A., Pellman D.
Crit. Rev. Biochem. Mol. Biol. 33:337-352(1998).
[ 3] Rawlings N.D., Barrett A.J.
Meth. Enzymol. 244:461-486(1994).
[E1] http://www.expasy.ch/cgi-bin/lists?peptidas.txt
















Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Tools

EMBOSS


FINDPATTERN, SCANPROSITE...


http://www.isrec.isb-sib.ch/software
Pftools 2.2 (pfmake, pfw, pfscan, pfsearch)



http://www.expasy.org/tools/#pattern
PFSCAN & PFRAMESCAN


fuzzpro, fuzztran, fuzznuc, patmatdb, patmatmotifs
Fortran source code (open source)
Binaries (solaris, linux, hpux, irix, win32, macosX)
GeneMatcher (http://www.paracel.com)
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
PSI-BLAST

What is it?





Derived from NCBI-BLAST2.0
Position Specific Iterative BLAST
Difference with BLAST
PSSM / checkpoint
Advantage / Disadvantage
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
PSI-BLAST




Position specific iterative BLAST (PSI-BLAST) refers to a feature of
BLAST 2.0 in which a profile (or position specific scoring matrix,
PSSM) is constructed (automatically) from a multiple alignment of the
highest scoring hits in an initial BLAST search.
The PSSM is generated by calculating position-specific scores for each
position in the alignment. Highly conserved positions receive high scores
and weakly conserved positions receive scores near zero.
The profile is used to perform a second (etc.) BLAST search (replacing
the normal matrix, e.g. BLOSUM62) and the results of each "iteration"
used to refine the profile.
This iterative searching strategy results in increased sensitivity.
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
BLAST algorithm
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Differences with BLAST




The two E-values
Automatically or manually selecting the matches
The substitution matrix
The iteration
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
PSI-BLAST E-values

Two different E value settings need to be specified in the PSIBLAST program.



The first of these (upper) sets the threshold for the initial BLAST
search. The default value is 10 as in the standard BLAST program.
The second E value (lower) is the threshold value for inclusion in the
position specific matrix used for PSI-BLAST iterations. The default
setting is 0.001.
The E values specified allow the user to see (and selectively,
based on prior knowledge, include) all of the BLAST hits up to
E=10; but to automatically include only those hits exceeding a
relatively rigorous E value threshold of 0.001.
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
BLAST PSSM or weight matrix


A substitution matrix for an alphabet of size A is
of size AxA
A PSSM for an alphabet of size A is of size AxN
where N is the length of the query
A R N . . Y V
A 4 -1 -2
-2 0
R -1 5 0
-2 -3
N -2 0 6
-2 -3
.
.
Y -2 -2 -2
7 -1
V 0 -3 -3 . . -1 4
M I S E
A 0 2 1 0
R -1 -1 0 0
N -1 -1 0 0
.
.
Y -1 -1 -1 -1
V -1 -2 -1 -1
C
0
-1
-1
U
0
0
0
-1
-1
0 -1 -1 -1 -1 -1
0 -1 -1 -1 3 -1
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
E N C I A
0 -1 0 -1 3
0 0 -1 -1 -1
0 5 -1 0 -1
.
.
BLAST Iteration
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
PHI-BLAST: a link with PATTERNS



PHI-BLAST means Pattern-Hit Initiated BLAST
PHI-BLAST expects as input a protein query sequence and a pattern
contained in that sequence.
PHI-BLAST searches the specified database for other protein
sequences that also contain the input pattern and have significant
similarity to the query sequence in the vicinity of the pattern
occurrences.


Statistical significance is reported using E-values as for other forms of BLAST,
but the statistical method for computing the E-values is different.
PHI-BLAST is integrated with Position-Specific Iterated BLAST (PSI-BLAST), so
that the results of a PHI-BLAST query can be used to initiate one or more rounds
of PSI-BLAST searching.
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
The good and the bad

Advantages




Fast
User friendly interface
Local bias statistics
Single software

Disadvantages





Could be confusing
No position specific gap
penalty
Fixed query length
Complex PSSM/checkpoint
for reuse
Difficult scan vs search
Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
How to « PSI-BLAST » efficiently?
Choose carefuly your query sequence
 Limit the size to the domain, but maximize
 Check matches: include or exclude based on
biological knowledge
 Do not overfit!!
 Try reverse experiment to certify

Swiss Institute of Bioinformatics
Institut Suisse de Bioinformatique
LF-2001.11
Download