Exploiting Structural and Comparative Genomics to Reveal Protein Functions

advertisement
Exploiting Structural and Comparative Genomics
to Reveal Protein Functions
How many domain families can we find in the genomes and can
we predict the functions of relatives?
Exploiting protein structure to predict protein functions
Using correlated phylogenetic profiles based on CATH
domains to reveal functional associations
C ATH
Gene3D
Domain families of known structure
Protein families and domain annotations for
completed genomes
CATHEDRAL
Oliver Redfern and Andrew Harrison
Combines a rapid graph
theory secondary
structure filter with
dynamic programming for
accurate residue
alignment
SVM is used to combine
scores and assess
significance of match
CATH version 3.0
1100 fold groups
2100 homologous superfamilies
86,000 Domains
Fold Recognition Performance
1
0.98
0.94
% Correct Fold
% Correct Fold
0.96
0.92
CATHEDRAL
CE
DALI
LSQMAN
STRUCTAL
SSAP
SSAP
DDP
0.9
0.88
0.86
0.84
0.82
0.8
0
5
10
15
Rank
Rank
20
25
Gene3D:Domain annotations in genome sequences
scan against
library of HMM
models
>2 million protein
sequences
from 300 completed
genomes and Uniprot
~2000 CATH
~9000 Pfam
assign domains to
CATH and Pfam
superfamilies
Benchmarking by structural data shows that 76% of remote
homologues can be identified using the HMMs
Gene3D:Domain annotations in genome sequences
DomainFinder: structural domains from CATH take precedent
N
C
CATH-1
Pfam-2
Pfam-1
NewFam
Pfam-1
CATH-1
NewFam
Pfam-2
Percentage of all domain family sequences
Domain families ranked by size (number of domain sequences)
Pfam families of
unknown structure
NewFam of
unknown
stucture
CATH superfamilies of known structure
Rank by family size
~90% of domain sequences in the genomes and UniProt can be assigned to
~7000 domain families
<100 families account for 50% of domain sequences of known fold
structural
superfamily
(CATH)
F2
subfamily of
relatives
F3
F1
F4
relatives likely
to have similar
functions
F5
Only ~3% of diverse sequences in large CATH domain families
have known structures
Gene3D: Domain mappings for 300 Completed Genomes
300 genomes, >2 million
sequences including
UniProt and RefSeq
structural domain
assignments from CATH
Iterative Profile Search
functional
domain assignments
Methodology
from Pfam
Also: SWISS-PROT, EC, COGs,
GO, KEGG, MIPS, BIND,
IntAct
http://www.biochem.ucl.ac.uk:8080/Gene3D
Russell Marsden, Corin Yeats, Michael Maibaum, David Lee
Nucleic Acids Res. 2006
Yeats et al. Nucleic Acids res. 2006.
Conservation of enzyme function in homologous domains
with same multidomain architecture (MDA) in Gene3D
Protein 1
Pfam-1 CATH-1
NewFam
Pfam-2
Protein 2
100
level EC STRING MATCH)
Conservation of EC
(%)
to 3 levels(3rd
number
FUNCTION CONSERVATION
Pfam-1
CATH-1
Pfam-2
DOMAINS
IN SAMENewFam
ARCHITECTURES
90
80
70
60
50
40
30
20
10
0
11--20
21-30
No OVERLAP
40% OVERLAP
80% OVERLAP
31-40
41-50
51-60
61-70
Sequence Identity
Sequence
identity
10% OVERLAP
50% OVERLAP
90% OVERLAP
20% OVERLAP
60% OVERLAP
100% OVERLAP
71-80
81-90
91-100
30% OVERLAP
70% OVERLAP
Sequence identity thresholds for 95% conservation of
enzyme function (to 3 EC Levels)
332 highly conserved
families
number of
sequences
60 highly variable
families
1000000
number of
families
200
180
160
140
120
100
80
60
40
20
0
100000
10000
1000
100
10
1
11-20%
21-30%
31-40%
41-50%
51-60%
Sequence
identity
Number of domain
relatives
number of sequences
61-70%
71-80%
81-90% 91-100%
thresholds
Number of Superfamilies
number of families
Exploiting Structural and Comparative Genomics
to Reveal Protein Functions
How many domain families can we find in the genomes and can
we predict the functions of relatives?
Exploiting protein structure to predict protein functions
Using correlated phylogenetic profiles based on CATH
domains to reveal functional associations
C ATH
Gene3D
Domain families of known structure
Protein families and domain annotations for
completed genomes
Conservation of Enzyme Function in CATH Domain
Families
90
80
SSAP score
Structural similarity (SSAP) score
100
Different Function
Same Function
70
60
50
40
0
10
20
30
40
50
60
70
80
90
100
sequence idenity (%)
Pairwise sequence identity
same functions
different functions
Correlation of structural variability with number of
different functional groups
COGs Vs SSGs
90
P-loop hydrolases
(COG-270, SSG-67)
80
Numbe r of COGs
70
60
0-25
25-50
50-75
75-100
50
40
30
20
10
0
0
10
20
30
40
50
60
Numbe
r of Sstructural
tructua l Sclusters
ub-Groups
Number
of diverse
within family
Some families show great structural diversity
Gabrielle Reeves
Multiple structural alignment by CORA allows identification of
consensus secondary structure and embellishments
2DSEC algorithm
In 117 superfamilies relatives expanded by >2 fold or more
These families represent more than half the genome sequences of known fold
Structural embellishments can modify the active site
Galectin binding
superfamily
Structural embellishments can modulate domain interactions
side orientation
face orientation
Glucose 6-phosphate
dehydrogenase
a
Dihydrodipiccolinate
reductase
Additional secondary structure shown at (a) are involved in
subunit interactions
Structural embellishments can modify function by modifying active
site geometry and mediating new domain and subunit interactions
Biotin carboxylase
D-alanine-d-alanine ligase
ATP Grasp
superfamily
Dimer of biotin carboxylase
Secondary structure insertions are distributed along the chain
but aggregate in 3D
80
Frequency (%)
60
40
Indel frequency < 1 %
20
0.85%
0.38%
0.23%
0.11%
0.06%
0.02%
0
1
2
3
4
5
6
7
8
9
10
11
12
Size of Indel (number of secondary structures)
85% of residue insertions comprise only 1 or 2 secondary structures
60% of domains have secondary structure embellishments co-located in 3D with
3 or more other embellishments
In 80% of domains, 1 or more embellishments contact other domains or subunits
~80% of variable families are adopt regular layered architectures
2 Layer Alpha Beta Sandwich
2 Layer Beta Sandwich
Alpha / Beta Barrel
3 Layer Alpha
Beta Sandwich
2 Layer Alpha Beta Sandwich
2 Layer Beta Sandwich
Alpha / Beta Barrel
3 Layer Alpha
Beta Sandwich
Function prediction to Guide Target Selection for Structural
Genomics
structural
superfamily
(CATH)
close
relatives
with same
MDA
F2
F3
F1
relatives likely
to have similar
functions
F4
F5
Only ~3% of diverse sequence families (S30 clusters) in large
CATH families have known structures
Conservation of Enzyme Function in Homologous Domains
100
80
70
% Frequency
Conservation of EC levels
(%)
90
Not Conserved
60
Less than 3 EC
50
EC3
40
EC4
30
20
10
0
50-60
60-70
70-80
SSAP Score
80-90
Structure similarity (SSAP) score
90-100
FLORA – structural templates for assigning
structures to functional subgroups in CATH
Perform CORA multiple structural alignment on
functional subfamiles within CATH superfamily
Use CORAXplode (HMMs) to find related sequences in
UniProt and identify conserved residues (seed)
Explore local structural environment of
seed residues to find conserved structural motifs
Dataset of 84 enzyme superfamilies in CATH of which
21 are functionally very diverse
Finding conserved residue positions (seeds) - Scorecons
multiple sequence
alignment of relatives
from functional family
guided by structure
alignment
identify most highly
conserved residue positions
using Scorecons – Valdar and
Thornton (2001)
seed positions
FLORA Algorithm for Identifying Structural Homologues
with Similar Functions
expand to local
environment of
12Å
assign conserved
sequence seeds
new structures are scanned
against a library of FLORA
templates and SVMs used to
assess significance of matches
identify
structurally
conserved
residue
cliques and
generate
template
Performance of FLORA vs Global Structure
Comparison (SSAP)
1
0.9
0.8
Coverage
Coverage
0.7
0.6
SSAP
FLORA
0.5
-
0.4
0.3
0.2
0.1
0
0
0.1
Error
Error rate
0.2
Exploiting Structural and Comparative Genomics
to Reveal Protein Functions
How many domain families can we find in the genomes and can
we predict the functions of relatives?
Exploiting protein structure to predict protein functions
Using correlated phylogenetic profiles based on CATH
domains to reveal functional associations
C ATH
Gene3D
Domain families of known structure
Protein families and domain annotations for
completed genomes
Eisenberg Phylogenetic Profiles for Detecting Functional
Associations
Superfamily
Functionally
Linked
sp1
sp2
Organism
sp3
sp4
Superfamily 1
1
0
1
0
Superfamily 2
1
0
1
0
Superfamily 3
0
0
1
1
presence or
absence of
superfamily in
organism
Gene3D Phylogenetic Occurrence Profiles
CATH Domain
Superfamily
Organism
sp3
sp4
sp1
sp2
Superfamily 1
12
13
14
11
Superfamily 2
35
0
12
60
Superfamily 3
6
0
0
0
number of
relatives
from
superfamily in
organism
Phylogenetic Occurrence Profiles Based on Domain
Superfamily and Subfamilies in Gene3D
Superfamily
30% sequence
identity
cluster
50% sequence
identity
cluster
40% sequence
identity
cluster
Phylogenetic Profiles for Families and Subfamilies
Juan Ranea and Corin Yeats
domains clustered at different
levels of sequence similarity:
Superfam. 30% 40%
50%
phylogenetic occurrence profile
matrix
60%… 100%
Sp1 Sp2 Sp3 Sp4 … Spn
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
.
.
.
Cluster n
3
0
1
0
1
0
0
.
.
.
0
3
2
0
2
0
3
0
.
.
.
1
5
4
1
0
2
1
0
.
.
.
0
7
5
0
0
1
2
1
.
.
.
1
…
…
…
…
…
…
…
…
…
…
…
5
4
1
6
0
1
2
.
.
.
0
Comparison of Pairs of
Phylogenetic Profiles
Sp1 Sp2 Sp3 Sp4 Sp5 … Spn
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Cluster 7
.
.
.
Cluster n
6
4
1
0
1
0
4
.
.
.
0
9
3
0
2
4
3
8
.
.
.
1
6
7
1
0
1
1
4
.
.
.
0
9
5
0
0
4
2
8
.
.
.
1
5
3
2
1
1
0
4
.
.
.
1
…
…
…
…
…
…
…
…
…
…
…
9
5
1
6
4
1
8
.
.
.
0
10
Cluster 1
5
Cluster 2
Sp1 Sp2 Sp3 Sp4
Sp5 … Spn
10
Cluster 1
E1
5
Cluster 5
Sp1 Sp2 Sp3 Sp4
10
E2
5
Sp1 Sp2 Sp3 Sp4
Sp5 … Spn
Sp5 … Spn
Cluster 1
Euclidian distance:
Cluster 7
E1 >> E2
Statistical Significance of Correlated Pairs
(Comparison against 3 randomised models)
80
70
Real matrix
60
Random matrix I
50
40
Random matrix II
30
Random matrix III
20
10
Pearson correlation coefficients
(0.9)-(1.0)
(0.8)-(0.9)
(0.7)-(0.8)
(0.6)-(0.7)
(0.5)-(0.6)
(0.4)-(0.5)
(0.3)-(0.4)
(0.2)-(0.3)
(0.1)-(0.2)
(0.0)-(0.1)
(-0.1)-(0.0)
(-0.2)-(-0.1)
(-0.3)-(-0.2)
0
Domain Associations Network from 13 Eukaryotes:
Actin
&
VCP-like ATPases
DNA replication
and repair
Chaperones and
Cytoskeleton
DNA Topoisomerase
& Elongation factor G
Number of domain relatives
DNA topoisomerase
& Elongation Factor G
10
9
8
7
6
5
4
3
2
1
0
1
2
3
4
5
6
7
8
Species
9
10
11
12
13
Highly correlated profiles correspond to pairs of families
with significant similarity in GO functions
%Frq
%Sum_SS/Frq
60
50
40
30
20
biological processes
10
(>=19)
(18)-(19)
(17)-(18)
(16)-(17)
(15)-(16)
(14)-(15)
(13)-(14)
(12)-(13)
(11)-(12)
(10)-(11)
(9)-(10)
(8)-(9)
(7)-(8)
(6)-(7)
(5)-(6)
(4)-(5)
(3)-(4)
(2)-(3)
(1)-(2)
(0)-(1)
0
Distances of correlated profile scores
Frequency of significant GO semantic similarity scores
Summary
–
–
–
–
On average 85% of domain sequences in genomes can be assigned
to ~6000 domain families in CATH and Pfam
Information on multidomain architectures (MDAs) can extend
functional annotations obtained through domain based homologies
Specific structural templates for functional subgroups within
domain families can also help in assigning functions as more
structures are solved
Analysis of Gene3D phylogenetic occurrence profiles allows
detection of functional associations between families
Acknowledgements
CATH
Lesley Greene
Alison Cuff
Ian Sillitoe
Tony Lewis
Mark Dibley
Oliver Redfern
Tim Dallman
Gene3D
Corin Yeats
Sarah Addou
Russell Marsden
David Lee
Alastair Grant
Ilhem Diboun
Juan Garcia Ranea
http://www.biochem.ucl.ac.uk/bsm/cath_new
Medical Research Council, Wellcome Trust, NIH
EU funded Biosapiens, EU funded Embrace, BBSRC
Download