Structural Phylogenomic Inference of Protein
Function
Kimmen Sjölander
Extend function prediction through inclusion of structure prediction and analysis
Anti-fungal defensin
(Radish)
Drosomycin
(Drosophila)
Scorpion toxin
Predict active site & subfamily specificity positions
VirB4
• Status quo approach to protein function prediction
– Given a gene (or protein) of unknown function
• Run BLAST to find homologs
• Identify the top BLAST hit(s)
• If the score is significant, transfer the annotation
– If resources permit, predict domains using PFAM or CDD
• Problems :
– Approach fails completely for ~30% of genes
– Of those with annotations, only 3% have any supporting experimental evidence
• 97% have had functions predicted by homology alone*
– High error rate
* Based on analysis of >300K proteins in the UniProt database
Tomato Cf-2 Bioinformatics Analysis
Domain fusion and fission events complicate function prediction by homology, particularly for particularly common domains (e.g., LRR regions).
Domain structure analysis (e.g., PFAM) is often critical.
Tomato Cf-2 (GI:1587673)
Dixon, Jones, Keddie, Thomas, Harrison and Jones JDG
Cell (1996)
BLAST against Arabidopsis
Top BLAST hit in Arabidopsis is an RLK!
Panther
PFAM results
Errors due to domain shuffling
(sic)
Error presumably due to non-orthology of database hits used for annotation
Phylogenetic analysis suggests it’s more likely a Biogenic
Amine GPCR
Human neutral sphingomyelinase
Main sources of annotation errors:
1. Domain shuffling
2. Gene duplication (failure to discriminate between orthologs and paralogs)
3. Existing database annotation errors
Propagation of existing database annotation errors
Errors in gene structure
Contamination
Other…
Galperin and Koonin, “Sources of systematic error in functional annotation of genomes: domain rearrangement, nonorthologous gene displacement and operon disruption.” In Silico Biol. 1998
Eisen “Phylogenomics: Improving Functional Predictions for Uncharacterized Genes by Evolutionary Analysis,”
Genome Research 1998
Sjölander, “Phylogenomic inference of protein molecular function: advances and challenges," Bioinformatics
2004
Piet Hein, Grooks
QuickT ime ™ an d a
TIFF ( Uncomp res sed) deco mpre ssor ar e need ed to see this pictur e.
There is nothing more difficult to take in hand, more perilous to conduct, or more uncertain in its success, than to take the lead in the introduction of a new order of things.
Because the innovator has for enemies all those who have done well under the old conditions, and lukewarm defenders in those who may do well under the new.
This coolness arises partly from the incredulity of men, who do not readily believe in new things until they have had a long experience of them .
Construction of genome-scale phylogenomic libraries
Cluster genome into global homology groups
Include homologs from other species
Construct multiple sequence alignment
Predict cellular localization.
Predict protein structure
Predict key residues
Deposit book in library
Construct HMMs for the family and for individual subfamilies.
Construct phylogenetic trees.
Overlay with annotation data.
Identify subfamilies.
Retrieve key literature
Berkeley Universal Proteome
Phylogenomic Explorer
9,707 protein family “books” and 708K HMMs and expanding daily http://phylogenomics.berkeley.edu/UniversalProteome
12% identity
VirB4 TrwB structure (1E9RA)
Active site
Example Book: Voltage-gated K+ channels
SCI-PHY subfamilies supported by
ML tree, and also consistent with subtype and phylogenetic distribution
(only one branch of
ML tree displayed)
Look up protein family “books” based on the annotations associated with any sequence.
Queries can be based on GO biological process, PFAM domains, UniProt accession numbers, etc.
What clustering methods are appropriate for inference of protein function?
What alignment methods are accurate?
How to mask?
What tree methods to use?
How to root a tree?
Can we define functional subfamilies automatically?
Fraction superposable positions drops with evolutionary divergence
Pairwise alignment MSA-pw Sequence-profile methods
%ID #pair %Superpos BLAST ClustalW Tcoffee ClustalW MAFFT MUSCLE
>70 107 90.6
0.954
0.955
0.955
0.955
0.954
0.954
50-70
40-50
63
46
87.2
83.4
0.862
0.824
0.903
0.872
0.894
0.855
0.901
0.856
0.919
0.862
0.911
0.846
30-40
25-30
20-25
65
41
53
85.4
82.1
77.9
0.811
0.779
0.612
0.874
0.782
0.599
0.867
0.788
0.627
0.87
0.795
0.633
0.892
0.837
0.678
0.925
0.836
0.661
15-20 84
10-15 151
5-10
0-5
204
122
73
64.4
50.4
39.5
0.381
0.451
0.457
0.16
0.186
0.234
0.49
0.302
0.496
0.35
-0.007
-0.014
0 -0.047
0.098
-0.033
-0.049 -0.051
-0.034
-0.024
HMM TreeHMM TreeHMM-Opt
0.951
0.954
0.96
0.903
0.855
0.899
0.868
0.727
0.904
0.855
0.892
0.866
0.728
0.554
0.351
0.578
0.387
0.572
0.363
0.075
0.096
0.085
-0.022
-0.026
-0.025
0.929
0.934
0.953
0.91
0.813
0.72
0.551
0.29
0.127
Clustering global (or glocal) homologs
Minimize profile drift
Improved alignment accuracy
Nandini Krishnamurthy, Ph.D.
Step 1: Construct SearchDB
Q=query
Construct SearchDB using PSI-BLAST against target database
Q
Step 2: Select and align core set.
Q
Inclusion criteria:
E-value 1e-10
Bi-directional coverage
MUSCLE multiple alignment (Edgar, 2003)
Step 3: Run SCI-PHY to identify subfamilies and build subfamily HMMs (SHMMs)
Q
BETE subfamily identification: Sjölander 1998
SHMM construction: Brown et al, 2004
Step 4: SHMMs compete for sequences from SearchDB. Sequences meeting criteria are aligned to their closest SHMM.
Q
Step 5: Run SCI-PHY on extended alignment to identify new subfamilies and construct SHMMs.
Q
Iterate until convergence
Q
Comparing FlowerPower,
BLAST,
PSI-BLAST and UCSC T2K
Test: Clustering global homologs
Agreement at domain structure determined by PFAM. SCOP used to cluster PFAM domains into structural equivalence classes.
Subfamily Classification In PHYlogenomics
(SCI-PHY)
Seq1 LE R Y-K
Seq2 LD R FPR
Seq3 IE R YGK
Seq4 MD R F-K
Seq5 VE R YGK
Nandini Krishnamurthy, Ph.D.
Duncan Brown
Multiple sequence alignment
5
3
1
4
2
Phylogenetic tree & subfamily decomposition
Agglomerative clustering
Input: MSA
Initialize: construct profile 1 for each row in MSA
While (#clusters > 1) {
Join closest 2 pair of clusters
Re-estimate profile 1
Compute encoding cost 3 for this stage
} /* cut tree using minimum encoding cost */
1.
2.
Use Dirichlet mixture densities
Distance function: relative entropy
Sjolander, K. "Phylogenetic inference in protein superfamilies: Analysis of SH2 domains" Proceedings of Conference Intelligent Systems for Molecular Biology
(ISMB) 1998
Detection of critical positions
• Each stage of the algorithm defines a different set of alignments, one for each cluster (“subfamily”).
• Find the point during the clustering where the encoding cost of the alignments is minimal. This defines the subfamily decomposition.
Cost
N
# classes 1
N= number of sequences. S= number of subfamilies; n c,1
…n c,s are the amino acids aligned by subfamilies 1 through s at column c.
represents the Dirichlet mixture prior .
SCI-PHY analysis of selected GPCRs
Venter et al , The sequence of the human genome (2001) Science .
Sjolander, “"Phylogenomic inference of protein molecular function: advances and challenges," (2004) Bioinformatics
D558
R627
D628
G629
Y221
W222
H745
Y743 A744
Elizabeth Hua-Mei Kellogg
Ryan Ritterson
Nandini Krishnamurthy
D RD E YA H
Parker JS, Roe
SM, Barford D. ,
EMBO J., 2004
Tanaka Hall, T.
Structure 2005
Rivas et al, 2005
7TM GPCR
3.5.2.2
Dihydropyrimidinase
ABC Transporter
3.5.4.1
Cytosine deaminase
Amidohydrolase
3.5.2.3
Dihydroorotase
ATPase
Family
3.5.1.5
Urease
Subfamily
1.
At completely conserved positions, and subfamily gapped positions: Use match state distributions estimated for general (family) HMM.
2.
At other positions:
1.
Estimate Dirichlet mixture density posterior for each subfamily at each position separately.
2.
Use Dirichlet density posteriors to weight contributions from other subfamilies.
3.
Compute amino acid distribution using weighted counts and standard Dirichlet procedure.
1 2
6 7
3 4 5
Brown et al,“Subfamily HMMs in functional genomics” (2005) Pacific Symposium on Biocomputing
Subfamily HMMs increase the separation between true and false positives
• 515 unique SCOP folds
• PFAM full MSAs
• Scored against Astral PDB90
1.5% error rate in subfamily classification using top-scoring SHMM
SATCHMO: Simultaneous Alignment and
Tree Construction using
Hidden Markov mOdels
Xia Jiang
Nandini Krishnamurthy
Duncan Brown
Michael Tung
Jake Gunn-Glanville
Bob Edgar
Edgar, R., and Sjölander, K .
, "SATCHMO: Sequence Alignment and Tree Construction using
Hidden Markov models," Bioinformatics . 2003 Jul 22;19(11):1404-11
• Structural divergence within a superfamily means that…
– Multiple sequence alignment (MSA) is hard
– Alignable positions varies according to degree of divergence
• Current MSA methods not designed to handle this variability
– Assume globally alignable, all columns (e.g. ClustalW)…
• Over-aligns, i.e. aligns regions that are not superposable
– …or identify and align only highly conserved positions (e.g., SAM software with HMM “surgery”)
• Challenge
– Different degrees of alignability in different sequence pairs, different regions
– Masking protocols are lossy: loop regions may be variable across the family but may be critical for function!
• Input : unaligned sequences
• Initialize : a profile HMM is constructed for each sequence.
• While (#clusters > 1) {
– Use profile-profile scoring to select clusters to join
– Align clusters to each other, keeping columns fixed
– Analyze joint MSA to predict which positions appear to be structurally similar; these are retained, the remainder are masked.
– Construct a profile HMM for the new masked MSA
}
• Output : Tree and MSA
0.3
0.2
0.1
0
0.7
0.6
0.5
0.4
1
0.9
0.8
Assessing sequence alignment with respect to structural alignment
Xia Jiang Duncan Brown Nandini Krishnamurthy
Alignment accuracy as a function of % ID
(including homologs, full-length sequences)
10-15% 15-20% 20-25% 25-30% 30-35%
Percent ID
CLUSTALW MUSCLE MAFFT SATCHMO
35-40%
Future work: Interactive specificity position identification
Catalytic residues colored red
• Enable users to select subtrees for analysis
• Identify positions conserved within each subtree, but which differentiate the two**
• Plot over MSA and on structure (if available)
Donald and Shakhnovich, NAR 2005
Major challenge: Phylogenetic uncertainty
Given : A (gene tree of unknown function), gene trees B and C (characterized function)
Predict function for A.
A
C
B B
C
A
A
B
C
Problem: use three phylogenetic tree methods, get 3 or more trees! Change the MSA, you also change the tree…
Need: Better simulation studies, benchmark datasets
http://phylogenomics.berkeley.edu
Berkeley Phylogenomics Group
PI: Kimmen Sjölander
Nandini Krishnamurthy, Ph.D.
Duncan Brown
Sriram Sankararaman
Xia Jiang
Jake Gunn-Glanville
Lead programmer and web administrator:
Dan Kirshner
This work is supported in part by a Presidential Early Career Award for Scientists and Engineers from the NSF, and by an R01 from the NHGRI (NIH).