Exploring Functional Landscapes of Proteins via Manifold Embeddings of the Gene Ontology IMA, UMN, 11/14/07 IPAM, UCLA, 11/29/07 Gilad Lerman Department of Mathematics University of Minnesota Fundamental Problem in Molecular Evolution How do we quantify the relationship between structure and function? More specifically: Given two protein domains, how similar are they in terms of function ? (i.e. form a functional distance for protein domains) GATTTGGACTGTAACTTAAAAATGATCAAATTATGTTTCCCATGCATCAGGTGCAATGGGAA TGGAGAGTGAGAGAAGCTTCCAGTTAAGGTGACATTGAAGCCAAGTCCTGAAAGATGAG TGGGGAGGGAAGGGGGAGGTGGAGGGATGGGGAATGGGCCGGGATGGGATAGCGCAAAC GGAAGGGAAACCAGCACTGTACAGACCTGAACAACGAAGATGGCATATTTTGTTCAGGGA AATTAAGTGTGGCAGGAATGCTTTGTAGACACAGTAATTTGCTTGTATGGAATTTTGCCTGA CATTGCAGTTTCTGATTTTTTGATGTCTTCATCCATCACTGTCCTTGTCAAATAGTTTGGAA “Nothing in Biology makes sense AATGATCACAATAACCCCAAGCATAATATTTCGTTAATTCTCACAGAATCACATATAGGTGC ATCCCCATTTTATGAATGGAGTGATGAAAACCTTAGGAATAATGAATGATTTGCGCAGGCT except in the light of evolution” ATATTAAGACTGAGTCAAATGTTGGGTCTGGTCTGACTTTAATGTTTGCTTTGTTCATGAGC TTGCCTCTCCTATGCAGTTAAGCAGGTAGGTGACAGAAAAGCCCATGTTTGTCTCTACTCA CCGACTGAATGTATGTATGGAGTTTCTACACCAGATTCTTCAGTGCTCTGGATATTAACTGG Theodosius Dobzhansky (1900-1975) ATGGCTTGTCAAATAGTTTCCGGACCTTGTCAAATAGTTTGGAGTCCTTGTCAAATAGTTT GCACAGACCCCACAAGTTAGGGGCTCAGTCCCACGAGGCCATCCTCACTTCAGATGACAA GTCCTAAGTTGTCACCATACTTTTGACCAACCTGTTACCAATCGGGGGTTCCCGTAACTGTC GTTTAATAATTTGCTAGAACAGTTTACGGAACTCAGAAAAACAGTTTATTTTCTTTTTTTCT GAGGGTCTTATTTTGTTGCCCAGGCTGGTGTGCAATGGTGCAGTCATAGCTCATTGCAGCC • Guides the construction of our functional metric CTGGGTTCCAGTGGTTCTCCCACCTCAGCCTCCCTAGTAGCTGAGACTACATGCCTGCACC • Relevant in interpreting our results CTGGCTAGTTTCTTTTATTTTTTGTATAGATGGGGTCTTGTTGTGTTGGCCAGGCTGGCCAC TGGTCTCAAGTGATCCTCCCACCTCAGCCTCTGAAAGTGCTGGGATTACAGATGTGAGCC TGGCCAGTTCATTTCCTATTACTGGTTCATTGTGAAGGATACATCTCAGAAACAGTCAATG ACGTGCATGCTGGATGCAGTGGCTCATGCCTGTAATCTCAGCACTTTGGGAGGCCAAGGTG CGCTTAAACTCAGGAGTTTGAGACCAGCCTGGGCAACATGGTGAAAACCTGTCTCTATAA AAAAATAATAATAATAACTGGTGTGGTGTTGTGCACCTAGAGTTCCAACTACTAGGGAAGC AGAGGATACCTTGAGCTGGGGACTGGGGAGGCTTAGGTTACAGTAAGCTGAGATTGTGCC TCCAGCTTGGACAAAAGAGCCTGATCCTGTCTCAAAAAAAAGAAAGATACCCAGGGTCC Evolution of This Talk… • • • • • • Background Framework to study Structure-Function Functional distance between protein domains Function-Structure correlation Convergent and divergent evolution What’s next? Structure Background (Proteins) • Proteins are assembled spatially out of distinct structural units • These structural units are called protein domains • Protein domains fold independently Transferase (Methyltransferase) 1adm Decomposing a Protein into its Domains Fibronectin protein–1fnf 3-D Structure DALI Comparison • Automated comparison of 3D protein structures by 2D distance matrices • Z-score – structure similarity score Holm L, Sander C., JMB 1993, 233: 123-138 Function Function: Gene Ontology (GO) • GO Goal: controlled vocabulary of genes + products in any organism (since 1998) Gene Ontology: tool for unification of biology M. Ashburner et al. (the gene ontology consortium). Nature Genet 25, 2000 • 3 structured vocabularies (species-independent) to describe gene products in terms of: 1) biological processes 2) cellular components 3) molecular functions • GO is friendly (google ) GO Demonstration Sequence-Structure-Function Sequences Structures of amino acids folding into domains Protein domains Molecular Functions Gene Ontology (GO) Shakhnovich BE. et al. BMC Bioinformatics. 2003, 4:34 Shakhnovich BE..PLoS Comp. Biol. 2005 Jun;1(1):e9. Similarity Measures • • • • Structure (protein domains): Z-scores Sequences: BLAST Phylogenetic Information: MI score Function Scores ???? Holm L, Sander C., JMB 1993, 233: 123-138 Altschul SF, et. al JMB 1990 Oct 5;215(3):403-10. Pellegrini M, et. Al Proc Natl Acad Sci U S A. 1999 Apr 13;96(8):4285-8. Previous Functional “Distances”? 1. Similarity measures of ontologies (individual nodes) Lord PW et. al, Bioinformatics, 2003, 19(10): 1275-1283. • • Assign local fractions p(n) for each node pms(n1, n2) = min{p(n)} among parents n of n1 and n2 sim(n1 , n2 ) = - ln{ pms (n1 , n2 )} 2. “Distance” between protein domains (subgraphs) Shakhnovich BE, PLoS Comput Biol, 2005 Jun;1(1):e9. FA, B 1 L i{ functions} ( pA,i pB ,i )2 pA,i /pB,i - percentage of sequences that fold into structure A/B and annotated as function i Our Goal: Forming Distances • What’s given? GO graph & subgraphs of protein domains • Questions: 1. How to form meaningful similarities (between nodes)? 2. How to form distances from similarities (nodes)? 3. How to use these to form distances between domains (subgraphs)? Using Similarities to Create Distances for Nodes Machine Learning Framework: • Given: points (nodes) {xi}i=1,…,N, similarities K(xi , xj), such that K is symmetric and positive • Distance: d2(xi,xj) = K(xi,xi)+K(xj, xj)-2K(xi,xj) • Interpretation: K(xi,xj) = ‹φ(xi), φ(xj)›, then d2(xi,xj) = ||φ(xi) - φ(xj)||2 • φ – embedding from input to feature space (N) K – the kernel The mapping φ It can be obtained by either 1. Find the eigenpairs (u1, λ1),…,(uN, λN ) of K and set ( ( xi )) j (u j )i j , j 1, ,N • Note that K(xi,xj) = ‹φ(xi), φ(xj)› 2. Form RKHS induced by K The “manifold embedding” →”φ”→ Remark: we do not use ”φ”, only the kernel K Figure by Todd Wittman (mani) →”φ”→ Figure by Coifman & Lafon How to Assign Similarities? • Local/ad hoc similarity • Global similarity: obtained by propagating local similarities (diffusion on graph mimicking evolutionary process) Forming a Diffusion Kernel nij number of domains (subgraphs) shared by nodes i & j Local similarity weight: wij 1/ nij Degree matrix D (diagonal): d ii wij j Transition Probabilities: P : D 1 W 1 2 Symmetric version of P: P : D W D K m P 2m 1 2 Fiedler M. 1975. Czech. Math. Journal, 25: 619-633 Chung F. 1997 (book). AMS Kondor R, Lafferty JD: ICML 2002: 315-322 Km is a diffusion kernel with parameter m Belkin M, Niyogi P.Tech Report 2002 U. Chicago Ham J. et al. ICML 2004: 369-376 Coifman et al. PNAS 2005, 102 (21): 7426 Forming a Diffusion Distance • Formally: Km P 2m d m2 ( xi , x j ) K m ( xi , xi ) K m ( x j , x j ) 2 K m ( xi , x j ) • Interpretation: N d m2 ( xi , x j ) ( K m / 2 ( xi , xk ) K m / 2 ( x j , xk )) 2 k 1 It describes the rate of connectivity between vertices according to paths of length m (Szummer M. Jaakkola T. NIPS 2001, 14) (Ham J. et al. ICML 2004: 369-376) Coifman et al. PNAS 2005, 102 (21): 7426. Another Diffusion Distance • Another kernel… L ( D W ) ( I P) D1 • The corresponding distance to this kernel 2 dL ( xi , x j ) L ( xi , xi ) L ( x j , x j ) 2L ( xi , x j ) is the expected time to travel from one vertex to another and then back again Coifman et al. PNAS 2005, 102 (21): 7426 Ham J. et al. ICML 2004: 369-376 “Distances” Between Domains Given: d(x,y) – diffusion distance between annotation x and y Compute: d(x,A) – distance between annotation x and domain A d ( x, A) min d ( x, y) yA d(A,B) – “distance” between domains A and B 1 1 1 2 2 d ( A, B ) d ( xi , A) d ( xi , B ) 2 | B | xi B | A | xiA Dubuisson MP, Jain AK. IAPR 1994. 566-568. Memoli F, Sapiro G. Found. Comput. Math. 2005. 313-347. Quick Summary • Formed diffusion distance between functional annotation (nodes) • Formed functional distances between protein domains (subgraphs) What’s Next • We put those distances in context with the geometric structure • We indicate how those distances can infer evolutionary information Comparisons Functional Domain Universe Graph FDUG: • Connect all edges (domains) with functional distance < Fmax • Color the top nine commonly occurring folds (use SCOP) • Identify main functional domains, e.g. B: DNA Binding, C: RNA Binding, D: Exonucleases, E: Transcription Factors Observation • Domain sharing fold classification form clusters with common functions • Domains with related functions are proximal H: Oxidoreductases, I: Dehydrogenases B: DNA Binding, C: RNA Binding, D: Exonucleases, E: Transcription Factors Traversing the FDUG 1hlv Centromere Binding Protein 1gdt Site Specific Resolvase 2hdd Engrailed Transcription Factor Divergent Evolution • Biological characteristics with a common evolutionary origin that have diverged over evolutionary time. • Previous example may indicate a case of divergent evolution (common ancestry) Convergent Evolution • In Definition in molecular evolutionary biology: evolution: Two proteins with no similar apparent homology performing organisms acquiring characteristics while the same function evolving in separate and sometimes varying ecosystems • We may identify such cases by searching for low F-scores and low Z-scores (large distances) • Example: convergence of tRNA synthases 1pys and 1a8h, F-score = .001, Z-score < 2 • This example is well-documented Mosyak L. et al. Nat Struct Biol 1995, 2:537-47 Sugiura I. et al Nucliec Acids Res 2004 32, D189-92 Summary • Defined a distance between protein functions (nodes) and functional distance between protein domains (subgraphs) • Shown correlation with structure, sequence and phylogeny • Explored structure-function relation via FDUG (functional domain universe graph) • Indicated examples of divergent and convergent evolution Some Future Projects • Extension to cellular components and processes and their use in quantitative research of convergent evolution • Infer function from structure (or vice versa) via supervised/semisupervised learning. Another Hybrid Linear Interests Modeling in Evolution VeryDirection Recent • Study of evolution of transcriptional response to osmotic stress • Applying recent tools of knowledge discovery Thanks Contact: lerman@umn.edu Supplementary webpage: http://www.math.umn.edu/~lerman/supp/protein_distance/ Collaborator: Borya Shakhnovich, O’shea Lab, Harvard Support: NSF Thanks: • IPAM (Mark Green) for 2003 proteomics workshop • R.R. Coifman (Yale), S. Lafon (Google), M. Maggioni (Duke) • Organizers of current workshop Embedding Annotations on top 2 coordinates Embedding Protein Domains