Exploring Functional Landscapes of Proteins via Manifold Embeddings of the Gene Ontology

advertisement
Exploring Functional Landscapes
of Proteins via Manifold
Embeddings of the Gene Ontology
IMA, UMN, 11/14/07
IPAM, UCLA, 11/29/07
Gilad Lerman
Department of Mathematics
University of Minnesota
Fundamental Problem in
Molecular Evolution
How do we quantify the relationship between
structure and function?
More specifically:
Given two protein domains, how
similar are they in terms of function ?
(i.e. form a functional distance for protein
domains)
GATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGGGAA
TGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGCCAAGTCCTGAAAGATGAG
TGGGGAGGGAAGGGGGAGGTGGAGGGATGGGGAATGGGCCGGGATGGGATAGCGCAAAC
GGAAGGGAAACCAGCACTGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGA
AATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGA
CATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGTCAAATAGTTTGGAA
“Nothing in Biology makes sense
AATGATCACAATAACCCCAAGCATAATATTTCGTTAATTCTCACAGAATCACATATAGGTGC
ATCCCCATTTTATGAATGGAGTGATGAAAACCTTAGGAATAATGAATGATTTGCGCAGGCT
except in the light of evolution”
ATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACTTTAATGTTTGCTTTGTTCATGAGC
TTGCCTCTCCTATGCAGTTAAGCAGGTAGGTGACAGAAAAGCCCATGTTTGTCTCTACTCA
CCGACTGAATGTATGTATGGAGTTTCTACACCAGATTCTTCAGTGCTCTGGATATTAACTGG
Theodosius Dobzhansky (1900-1975)
ATGGCTTGTCAAATAGTTTCCGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAATAGTTT
GCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCACTTCAGATGACAA
GTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAATCGGGGGTTCCCGTAACTGTC
GTTTAATAATTTGCTAGAACAGTTTACGGAACTCAGAAAAACAGTTTATTTTCTTTTTTTCT
GAGGGTCTTATTTTGTTGCCCAGGCTGGTGTGCAATGGTGCAGTCATAGCTCATTGCAGCC
• Guides the construction of our functional metric
CTGGGTTCCAGTGGTTCTCCCACCTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCACC
• Relevant in interpreting our results
CTGGCTAGTTTCTTTTATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTGGCCAC
TGGTCTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAGATGTGAGCC
TGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTCAGAAACAGTCAATG
ACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCACTTTGGGAGGCCAAGGTG
CGCTTAAACTCAGGAGTTTGAGACCAGCCTGGGCAACATGGTGAAAACCTGTCTCTATAA
AAAAATAATAATAATAACTGGTGTGGTGTTGTGCACCTAGAGTTCCAACTACTAGGGAAGC
AGAGGATACCTTGAGCTGGGGACTGGGGAGGCTTAGGTTACAGTAAGCTGAGATTGTGCC
TCCAGCTTGGACAAAAGAGCCTGATCCTGTCTCAAAAAAAAGAAAGATACCCAGGGTCC
Evolution of This Talk…
•
•
•
•
•
•
Background
Framework to study Structure-Function
Functional distance between protein domains
Function-Structure correlation
Convergent and divergent evolution
What’s next?
Structure
Background
(Proteins)
• Proteins are assembled spatially out of distinct structural units
• These structural units are called protein domains
• Protein domains fold independently
Transferase (Methyltransferase) 1adm
Decomposing a Protein into its Domains
Fibronectin protein–1fnf
3-D Structure
DALI
Comparison
• Automated comparison of 3D protein structures
by 2D distance matrices
• Z-score – structure similarity score
Holm L, Sander C., JMB 1993, 233: 123-138
Function
Function: Gene Ontology (GO)
• GO Goal: controlled vocabulary of genes +
products in any organism (since 1998)
Gene Ontology: tool for unification of biology
M. Ashburner et al. (the gene ontology consortium). Nature Genet 25, 2000
• 3 structured vocabularies (species-independent)
to describe gene products in terms of:
1) biological processes
2) cellular components
3) molecular functions
• GO is friendly (google
)
GO Demonstration
Sequence-Structure-Function
Sequences
Structures
of amino acids
folding into domains
Protein domains
Molecular Functions
Gene Ontology (GO)
Shakhnovich BE. et al. BMC Bioinformatics. 2003, 4:34
Shakhnovich BE..PLoS Comp. Biol. 2005 Jun;1(1):e9.
Similarity Measures
•
•
•
•
Structure (protein domains): Z-scores
Sequences: BLAST
Phylogenetic Information: MI score
Function Scores ????
Holm L, Sander C., JMB
1993, 233: 123-138
Altschul SF, et. al
JMB 1990 Oct 5;215(3):403-10.
Pellegrini M, et. Al
Proc Natl Acad Sci U S A.
1999 Apr 13;96(8):4285-8.
Previous Functional “Distances”?
1.
Similarity measures of ontologies (individual nodes)
Lord PW et. al, Bioinformatics, 2003, 19(10): 1275-1283.
•
•
Assign local fractions p(n) for each node
pms(n1, n2) = min{p(n)} among parents n of n1 and n2
sim(n1 , n2 ) = - ln{ pms (n1 , n2 )}
2. “Distance” between protein domains (subgraphs)
Shakhnovich BE, PLoS Comput Biol, 2005 Jun;1(1):e9.
FA, B
1

L

i{ functions}
( pA,i  pB ,i )2
pA,i /pB,i - percentage of sequences that fold into
structure A/B and annotated as function i
Our Goal: Forming Distances
• What’s given?
GO graph & subgraphs of protein domains
• Questions:
1. How to form meaningful similarities (between
nodes)?
2. How to form distances from similarities (nodes)?
3. How to use these to form distances between
domains (subgraphs)?
Using Similarities to Create
Distances for Nodes
Machine Learning Framework:
• Given: points (nodes) {xi}i=1,…,N, similarities
K(xi , xj), such that K is symmetric and positive
• Distance: d2(xi,xj) = K(xi,xi)+K(xj, xj)-2K(xi,xj)
• Interpretation: K(xi,xj) = ‹φ(xi), φ(xj)›, then
d2(xi,xj) = ||φ(xi) - φ(xj)||2
• φ – embedding from input to feature space (N)
K – the kernel
The mapping φ
It can be obtained by either
1. Find the eigenpairs (u1, λ1),…,(uN, λN ) of K and set
( ( xi )) j  (u j )i   j , j  1,
,N
• Note that K(xi,xj) = ‹φ(xi), φ(xj)›
2. Form RKHS induced by K
The “manifold embedding”
→”φ”→
Remark: we do not use ”φ”, only the kernel K
Figure by Todd Wittman (mani)
→”φ”→
Figure by Coifman & Lafon
How to Assign Similarities?
• Local/ad hoc similarity
• Global similarity:
obtained by propagating local similarities
(diffusion on graph mimicking evolutionary process)
Forming a Diffusion Kernel
nij number of domains (subgraphs) shared by nodes i & j
Local similarity weight: wij  1/ nij
Degree matrix D (diagonal): d ii   wij
j
Transition Probabilities: P : D 1  W

1
2
Symmetric version of P: P : D  W  D
K m  P 2m

1
2
Fiedler M. 1975. Czech. Math. Journal, 25: 619-633
Chung F. 1997 (book). AMS
Kondor R, Lafferty JD: ICML 2002: 315-322
Km is a diffusion kernel with parameter m
Belkin M, Niyogi P.Tech Report 2002 U. Chicago
Ham J. et al. ICML 2004: 369-376
Coifman et al. PNAS 2005, 102 (21): 7426
Forming a Diffusion Distance
• Formally:
Km  P 2m
d m2 ( xi , x j )  K m ( xi , xi )  K m ( x j , x j )  2 K m ( xi , x j )
• Interpretation:
N
d m2 ( xi , x j )   ( K m / 2 ( xi , xk )  K m / 2 ( x j , xk )) 2
k 1
It describes the rate of connectivity between
vertices according to paths of length m
(Szummer M. Jaakkola T. NIPS 2001, 14)
(Ham J. et al. ICML 2004: 369-376)
Coifman et al. PNAS 2005, 102 (21): 7426.
Another Diffusion Distance
• Another kernel…
L  ( D  W )  ( I  P) D1
• The corresponding distance to this kernel
2



dL ( xi , x j )  L ( xi , xi )  L ( x j , x j )  2L ( xi , x j )
is the expected time to travel from one vertex
to another and then back again
Coifman et al. PNAS 2005, 102 (21): 7426
Ham J. et al. ICML 2004: 369-376
“Distances” Between Domains
Given: d(x,y) – diffusion distance between annotation x and y
Compute:
d(x,A) – distance between annotation x and domain A
d ( x, A)  min d ( x, y)
yA
d(A,B) – “distance” between domains A and B

1 1
1
2
2
d ( A, B )  
d ( xi , A) 
d ( xi , B ) 



2  | B | xi B
| A | xiA

Dubuisson MP, Jain AK. IAPR 1994. 566-568.
Memoli F, Sapiro G. Found. Comput. Math. 2005. 313-347.
Quick Summary
• Formed diffusion distance between
functional annotation (nodes)
• Formed functional distances between
protein domains (subgraphs)
What’s Next
• We put those distances in context with the
geometric structure
• We indicate how those distances can infer
evolutionary information
Comparisons
Functional Domain Universe Graph
FDUG:
• Connect all edges (domains) with functional distance < Fmax
• Color the top nine commonly occurring folds (use SCOP)
• Identify main functional domains, e.g.
B: DNA Binding, C: RNA Binding, D: Exonucleases, E:
Transcription Factors
Observation
• Domain sharing fold classification form clusters
with common functions
• Domains with related functions are proximal
H: Oxidoreductases,
I: Dehydrogenases
B: DNA Binding, C: RNA Binding,
D: Exonucleases,
E: Transcription Factors
Traversing the FDUG
1hlv Centromere Binding Protein
1gdt Site Specific Resolvase
2hdd Engrailed Transcription Factor
Divergent Evolution
• Biological characteristics with a common evolutionary
origin that have diverged over evolutionary time.
• Previous example may indicate a case of divergent
evolution (common ancestry)
Convergent Evolution
• In
Definition
in molecular
evolutionary
biology: evolution:
Two
proteins
with no similar
apparent
homology performing
organisms
acquiring
characteristics
while
the
same function
evolving
in separate and sometimes varying
ecosystems
• We may identify such cases by searching for
low F-scores and low Z-scores (large distances)
• Example: convergence of tRNA synthases
1pys and 1a8h, F-score = .001, Z-score < 2
• This example is well-documented
Mosyak L. et al. Nat Struct Biol 1995, 2:537-47
Sugiura I. et al Nucliec Acids Res 2004 32, D189-92
Summary
• Defined a distance between protein functions
(nodes) and functional distance between
protein domains (subgraphs)
• Shown correlation with structure, sequence
and phylogeny
• Explored structure-function relation via
FDUG (functional domain universe graph)
• Indicated examples of divergent and
convergent evolution
Some Future Projects
• Extension to cellular components and processes
and their use in quantitative research of
convergent evolution
• Infer function from structure (or vice versa) via
supervised/semisupervised learning.
Another
Hybrid
Linear Interests
Modeling
in Evolution
VeryDirection
Recent
• Study of evolution of transcriptional response to osmotic stress
• Applying recent tools of knowledge discovery
Thanks
Contact: lerman@umn.edu
Supplementary webpage:
http://www.math.umn.edu/~lerman/supp/protein_distance/
Collaborator:
Borya Shakhnovich, O’shea Lab, Harvard
Support: NSF
Thanks:
• IPAM (Mark Green) for 2003 proteomics workshop
• R.R. Coifman (Yale), S. Lafon (Google), M. Maggioni (Duke)
• Organizers of current workshop
Embedding Annotations on
top 2 coordinates
Embedding Protein Domains
Download