PPT

advertisement
C lass
Architecture
Topology or Fold Group
Homologous Superfamily
CATH domain database
Orengo & Thornton 1994
•
•
•
•
The CATH domain database and associated resources
- DHS, Gene3D
How do we determine domain boundaries?
How do we you identify fold groups and evolutionary
superfamilies?
What is the distribution of the CATH domain families
in the PDB and in the genomes?
Multidomain proteins
~20,000 chains from
Protein Databank (PDB)
~50,000 domains in
CATH structure
database
~40% of the entries in CATH are multidomain
Domains are important evolutionary units
analysis by Teichmann and others suggests that ~6080% of genes in genomes may be multidomain
Carboxypeptidase G2
(1cg2A)
Carboxypeptidase A
(2ctc)
~30% of multidomains in CATH are discontinuous
Algorithms for Recognising Domain
Boundaries

DETECTIVE Swindells 1995
each domain should have a recognisable hydrophobic core

DOMAK
Siddiqui & Barton, 1995
residues comprising a domain make more internal contacts
than external ones

PUU Holm & Sander, 1994
parser for protein folding units: maximal interaction within
domains and minimal interaction between domains
Consensus is sought between the three methods –
on average this occurs about 20% of the time
74%
29%
Close homologues
21%
Twilight zone
4%
Midnight zone
11%
Homologues/analogues
Algorithms for Recognising Homologues

Sequence Based methods
close homologues
– BLAST (Altschul et al.)
- SSEARCH (Smith &
Waterman)
remote homologues

– SAM-T99 (Karplus et al)
Structure Based Methods
close & remote homologues - CATHEDRAL (Harrison,
Thornton Orengo)
- SSAP (Taylor & Orengo)
- CORA (Orengo)
74%
Close homologues
SSEARCH
29%
21%
Twilight zone
HMMs, SSAP
4%
Midnight zone
CATHEDRAL, SSAP
11%
Homologues/analogues
CATHEDRAL, SSAP
Hidden Markov Models (HMMs)
SAM-T99
SAMOSA
query
sequence
Karplus Group
Orengo Group
Non redundant
GenBank database
hits
these methods can currently identify ~70% of remote homologues
(3 times more powerful than BLAST)
Percentage of PDB structures classified in CATH by
different methods over the last 2 years
remote homologues (8.6)
analogues (1.9)
Novel folds
SSAP
2.0
remote homologues
(<30%)
HMMs
Close homologues
(>30%)
SSEARCH
1.9
8.6
7.6
20.7
59.2
Near-identical
SSEARCH
Percentage of structural genomics PDB structures
classified in CATH by different methods over the
last 2 years
near-identical
SSEARCH
novel folds
analogues
SSAP
11.8
7.7
8.0
22.0
close homologues
(>30%)
SSEARCH
28.4
remote homologues
SSAP
22.0
remote homologues
(<30%)
HMMs
Structure Based Algorithms for Recognising
Homologues
CATHEDRAL
Pairwise alignment - secondary
structure comparison
SSAP
Pairwise alignment - residue
comparison
CORA
Multiple alignment – residue
comparison
74%
Close homologues
ssearch
29%
21%
Twilight zone
HMMs
4%
Midnight zone
CATHEDRAL, SSAP
11%
Homologues/analogues
CATHEDRAL, SSAP
structure is much more highly conserved than
sequence
cholera toxin
Heat labile
enterotoxin
pertussis toxin
97
81
79%
12%
Structure
similarity
(SSAP) score
Sequence
identity
Pairwise Sequence Identities and Structure
Similarity (SSAP) Scores in CATH Domain Families
structure
similarity
(SSAP)
score
same function
different function
sequence identity (%)
• Residue insertions in the loops connecting secondary
structures
• Shifts in the orientations of secondary structures
Structural variation in the P-loop Hydrolase Superfamily
Yeast Elongation factor complex
Yeast Guanylate kinase
Helicase domain of bacteriophage t7
ATP phosphorylase
Structural variation in the Galectin Binding Superfamily
Fast Structure Comparison
Method (CATHEDRAL)
Andrew Harrison et al., JMB, 2002



ignore the variable loop regions and only
compare the secondary structures
derive vectors through secondary structure
elements
compare closest approach distances and vector
orientations using graph theory
d
a
b
a . b = | a || b | cos 
+ dihedral angle 
+ chirality
CATHEDRAL
CATHs Existing Domain Recognition ALgorithm
d, , ,
chirality
H
edge
H
d, , ,
chirality
d, , ,
chirality
H
node
Compares graphs of proteins
Comparing proteins with similar folds identifies an
overlap graph with the largest common structural
motif
A
III
A,a
I
C
II
III
B
I
C,d
IV
a
III
II
b
b
I
d
c
II
V
B,c
overlap graph has a
structural motif of 3
secondary structures
Graphs are compared using the Bron Kerbosch
algorithm to find the largest common graph
In this example the common graph contains 5
nodes.
1000 times faster than residue based methods
(e.g. SSAP)
Performance
statistical significance can be assessed by
scanning a protein ‘graph’ against ‘graphs’ of all
known structures
Score ~
common graph size
(size protein1 . size protein2)1/2
statistical significance can be assessed by
scanning a protein ‘graph’ against ‘graphs’ of all
known structures
Score ~
common graph size
(size protein1 . size protein2)1/2
scores for unrelated structures exhibit an
extreme value distribution
F = A e - b . score
log F = log A - b .score
allows you to calculate the probability (P-value, E-value) of
obtaining any score by chance
Using CATHEDRAL to Identify Domain
Boundaries
Graph based secondary structure
comparison is very fast - 1000 times
faster than residue based methods
New multi-domain structures can be
rapidly scanned against the library of
CATH domains. E-values can be used
to identify significant matches.
85-90% of domains in new multi-domain
structures have relatives in CATH
CATHEDRAL
Secondary structure
match by graph
residues in
CATH domain
family 1
SSAP residue alignment
residues in
CATH domain
family 2
Fold A
Fold B
Multi-domain
structure
residues in new multi-domain
SSAP
Protein B
Protein A
Taylor & Orengo,
J. Mol. Biol. 1989
Scores range
from 0-100
Residues in protein A
Residues in protein B
residue based
structure
comparison
method using
dynamic
programming
CATHEDRAL
One third of known multi-domain structures are discontinuous
Reasons for Structural Similarity
• Divergence - similarity arises due to
divergent evolution from a common
ancestor - structure much more highly
conserved than sequence
• Convergence - similarity due to there
being a limited number of ways of packing
helices and strands in 3D space
Domain structure database
C lass
Orengo & Thornton 1994
A rchitecture
Topology or Fold Group
Homologous Superfamily
~50,000 domains in PDB
~1500 domain superfamilies in CATH
CATH
3
domain database
~50,000 domains
Class
~36
Architecture
~810
Topology or
Fold
C AT H
40,000
~50,000
domain
domain
entries
entries
Topology or
Fold Group
~810
Homologous
Superfamily (Domain
Family)
~1500
Sequence
Family
(35%, 60%, 95%)
DHS
Dictionary of Homologous Superfamilies
http://www.biochem.ucl.ac.uk/bsm/dhs
Description of structural and functional characteristics for each
superfamily
DHS
Dictionary of Homologous Superfamilies
http://www.biochem.ucl.ac.uk/bsm/dhs
Description of structural and functional characteristics for each
superfamily
Variation in Secondary Structures Across Superfamily
DHS:Dictionary of Homologous Superfamilies
http://www.biochem.ucl.ac.uk/bsm/dhs
Functional annotations from GO, EC, COGs, KEGG
DHS:Dictionary of Homologous Superfamilies
http://www.biochem.ucl.ac.uk/bsm/dhs
Multiple structure alignments with conserved residues
highlighted
DHS:Dictionary of Homologous superfamilies
http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D
Population of CATH Families and
Structural Groups
cluster proteins with
similar sequences
cluster proteins with
similar structures and
functions
cluster proteins with
similar structures
~50,000 structural
domains
~4000 sequence
families (35%)
S
~1,500 homologous
superfamilies
H
~36 architectures
T
A
3 major protein classes
C
~810 fold groups
Arc repressor-like
CATH
Arc repressor-like
nearly one third of
the superfamilies
belong to <10 fold
groups
Up-down
Rossmann Fold
Rossmann
SH3-like
OB fold
OB Fold
Immunoglobulin
Alpha/Beta Plaits
Jelly Roll
Alpha-beta plait
TIM barrel
Jelly Roll
CATH numbering scheme
2.40.50.100
Class
Architecture
Topology
Homology
2.
Mainly beta
40.
Barrel
50.
OB Fold
100 Heat labile
enterotoxin superfamily
CATH
http://www.biochem.ucl.ac.uk/bsm/cath
CATH domain structure database
CATH
http://www.biochem.ucl.ac.uk/bsm/cath
CATH class level
CATH
http://www.biochem.ucl.ac.uk/bsm/cath
CATH architecture level
CATH
http://www.biochem.ucl.ac.uk/bsm/cath
CATH Topology or fold group level
CATH
http://www.biochem.ucl.ac.uk/bsm/cath
CATH homologous superfamilies in each fold group
CATH
http://www.biochem.ucl.ac.uk/bsm/cath
CATH homologous superfamily level
CATH
http://www.biochem.ucl.ac.uk/bsm/cath
CATH sequence families (>=35% identity) in each superfamily
CATH
http://www.biochem.ucl.ac.uk/bsm/cath
CATH classification information for individual domains
CATH
http://www.biochem.ucl.ac.uk/bsm/cath
CATH structural relatives listed for each domain
CATH server
http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl
CATH server
http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl
CATH server
http://www.biochem.ucl.ac.uk/cgi-bin/cath/CathServer.pl
structural matches and statistics listed for query domain
Expanding CATH with sequence
relatives from genomes

Library of HMMs built for representative sequences from
each CATH domain superfamily
protein sequences
from genomes
Scan
against
CATH
HMM library
assign domains to
CATH
superfamilies
Expanding CATH
~1400 Domain Structure Superfamilies
S1
H
S2
Homologous
Superfamily
S3
sequences added
from GenBank,
genomes, SWPTTrEMBL
CATH-HMMs
S1
H
Homologous
Superfamily
Sequence family
~50,000 sequences
~4,000 sequence families
S2
S3
S4
S5
~600,000 sequences
~24,000 sequence families
Up to 70% of sequences in completed genomes can be
assigned to CATH domain superfamilies
Arc repressor-like
Gene3D
Arc repressor-like
Up-down
Four helix bundle
Alpha
horseshoe
Alpha horseshoe fold
SH3-like
SH3-type
barrel
OB fold
Rossmann Fold
OB Fold
Rossmann
Immunoglobulin
Immunoglobulin-like
Jelly Roll
Jelly Roll
Alpha/Beta Plaits
TIM Barrel
Alpha-beta plait
TIM barrel
Gene3D
http://www.biochem.ucl.ac.uk/bsm/Gene3D
CATH domain structure annotations for complete genomes
Gene3D
http://www.biochem.ucl.ac.uk/bsm/Gene3D
Individual genome statistics
Gene3D
http://www.biochem.ucl.ac.uk/bsm/Gene3D
Assignment of sequences to Gene3D protein families
Gene3D
http://www.biochem.ucl.ac.uk/bsm/Gene3D
Functional annotations for individual sequences
Gene3D
http://www.biochem.ucl.ac.uk/bsm/Gene3D
Functional annotations for individual sequences
Gene3D
http://www.biochem.ucl.ac.uk/bsm/Gene3D
Domain annotations for individual sequences
Gene3D
http://www.biochem.ucl.ac.uk/bsm/Gene3D
Domain annotations for individual sequences
Summary



CATH currently identifies ~1500 superfamilies in the
~50,000 structural domains from the PDB
These domains families contain over 600,000 domain
sequences from the genomes and sequence databases
Up to 70% of genome sequences can be assigned to
domain structure families using HMMs and threading
Acknowledgements
Frances Pearl
Ian Sillitoe
Oliver Redfern
Mark Dibley
Tony Lewis
Chris Bennett
Andrew Harrison
Gabrielle Reeves
Alastair Grant
David Lee
Janet Thornton
http://www.biochem.ucl.ac.uk/bsm/cath
Medical Research Council,
Wellcome Trust, NIH
Biotechnology and Biological Sciences Research Council
Download