ProteinFunction

advertisement

Protein Function

Arthur M. Lesk Bologna Winter School 2011 1

Mochida K, Shinozaki K. Genomics and bioinformatics resources for crop improvement. Plant Cell Physiol. 2010 Apr;51(4):497-523.

Basis of this topic

Extended central dogma:

DNA → RNA → Protein amino acid sequence → Protein structure → Protein function

(I am happy with most of this)

3

Basic questions:

How do proteins evolve changed or novel functions?

Given the amino acid sequences of proteins inferred from genomic sequences, how can we assign functions to them?

4

Genomics gives us many new protein sequences  Often there is little experimental information about the proteins themselves  What can we deduce about proteins from their amino acid sequences?

… from the amino acid sequence of one protein alone?

… from comparisons of amino acid sequences of related proteins from different species?

... from interaction networks?

5

What properties of proteins do we want to learn about and how do we measure and analyse them?

 amino acid sequence  three-dimensional structure  FUNCTION  expression pattern  regulation 6

How do we learn these?

 amino acid sequence – genomic sequences  three-dimensional structure – X-ray, NMR, ... modelling  FUNCTION – experiment? inference?

 expression pattern – microarrays, sequencing  regulation – chip/chip experiments 7

Function is difficult

 Sequence determines structure determines function  From knowing sequence and structure of one protein alone, can we deduce its function?

 Identify binding site?

 Identify catalytic residues?

 Identify ligand?

 Analogy to drug-design problem. 8

Part of problem: what is function?

 It is essential that we have a good idea of what we mean by protein function  The literature probably gives too optimistic a point of view about that  So we will have to discuss this point fairly carefully if we are to make sense 9

Assignment of protein function

 Given a catalogue or catalogues of protein function, how do we assign one or more functions to some protein?

Experimental

 Direct measurement in laboratory  Phenotypes of knockout-mutants 

Theoretical

  Mostly reasoning from homology

Combined

 Theory, or structural information, may give hypotheses  Experiment can test, confirm, refine 10

Experimental methods: classical in vitro studies of purified proteins  Isolate protein  Expose to many possible substrates (often can make educated guesses from sequence)  Explicitly verify one or more enzymatic activities  Explore range of substrate specificity 11

Classically,one often had a target activity and sought the enzyme

 Working out metabolic pathways  Radioactive tracers gave some idea of intermediates in metabolic pathway  Separate proteins, measure activity with respect to some particular reaction of each fraction  Keep doing this until protein pure 12

Function inference from phenotype of knock-out mutants  Labour-intensive if you have to make them  If lethal, no clue  Has been done for yeast: knock-out strains lacking every protein individually  For humans, nature has done a lot of experiments, results present clinically  Example, mutants in phenylalanine hydroxylase →  failure of conversion of phenylalanine to tyrosine,  disease phenylketonuria  clue to function of phenylalanine hydroxylase 13

in vitro measurement of function of purified protein  Fine, but very labour-intensive  Not suitable for genome-wide function assignment  Ignores context  Also: one may miss alternative functions  What about location-dependent function?

 What about structural control by postranslational modification? (for instance, phosphorylation) 14

Multiple functions

Recruitment:

proteins that have more than one function 

Location-dependent function

 Function depending on

post-translational modification

(for instance, phosphorylation of proteins in signalling cascades)  Function depending on temperature 15

Recruitment

Recruitment: Use of an active protein for an unrelated function with little or even no sequence change  In many cases, if a new function is needed, it is easier to use or adapt a protein that already exists, rather than to create a new folding pattern  Example: eye lens crystallins 16

Avian eye-lens proteins

 In the duck, crystallins have identical sequences to liver enolase and lactate dehydrogenase  They never see the substrates in the eye  In other birds, sequences have changed enough to lose catalytic activity. This proves that enzymatic activity not necessary in eye 17

Localization-dependent function

 Phosphoglucose isomerase = neuroleukin = autocrine motility factor = differentiation and maturation indicator  In cytoplasm: glycolytic enzyme  Outside cell: nerve growth factor, cytokine 

Purify and measure function in vitro???

18

Proteinase do = DegP

 Chaperone at low temperatures  Proteinase at high temperatures  Logic: moderate stress – try to rescue proteins  more extreme stress – give up and recycle 19

Ambiguity in function: a warning

 The fact that in many cases it is not possible to ascribe a unique function to a protein makes both experimental and theoretical function assignment more complicated  Judging whether function is assigned correctly also difficult 20

Theoretical methods

 Inference of function from sequence?

 Inference of function from structure?

 It would be very nice if we could infer function from sequence. We have lots of sequence data.

 But unquestionably structure provides much more detailed information. We might need it.

21

Theoretical methods

 Primary approach: reasoning by homology  We know the functions of haemoglobin in humans  If you sequence the aardvark genome, and find haemoglobin chains very similar to those of human haemoglobin, you are entitled to infer the function of aardvark haemoglobin  Assumption: evolution preserves function 22

Relationship between protein sequence, structure and function 23

Relationship between protein sequence, structure and function  Within a domain, similar sequences determine similar structures. Almost no exceptions  Often, similar structures determine similar functions  Many more exceptions, especially in case of:  Multiple functions  Change in domain partners  Similar structures generally require similar sequences  Similar functions often do NOT require similar sequences and structure 24

Examples

 Human, horse haemoglobin  similar sequences / similar structures / similar functions Malate / Lactate dehydrogenases  closely-related enzymes   sequences of Human MDH / LDH have ~20% identical residues one residue change (Glu →Arg) enough to convert Bacillus stearothermophilus LDH → MDH  MDH from a Trichomonad much more similar to LDHs than to other MDHs 25

Many families of proteases

  There are families of homologous proteases   Serine proteases – trypsin, chymotrypsin, elastase, thrombin Cysteine proteases – papain, actinidin  Aspartate proteases  Metalloproteases Different families of proteinases have:  similar functions  completely different amino acid sequences  completely different three dimensional structures  subtilisin and chymotrypsin share the catalytic triad in their mechanisms (they differ in sequence and structure) 26

Sequence-function relationships

 In many cases, similar proteins retain similar functions (example: mammalian globins)  Distantly-related proteins can retain function or diverge in function  But closely-related proteins can have very different functions  And don’t forget recruitment: Even a single protein can have multiple functions 27

Maybe it time to consider how we define and classify protein functions

 A classification of protein function is a list of known protein functions  For instance: catalyses dehydrogenation of ethahol is a potential protein function  Assignment of this function to horse liver alcohol dehydrogenase is a separate problem 28

What do we mean by protein function?

 Keep in mind that much of the literature is written from the point of view that it is possible to give a one-to-one matching of proteins with functions  In some cases, recognition that proteins can have multiple functions takes the form of discussing “moonlighting” proteins  Importance of context-dependent function:

What is the function, in playing the piano, of the second finger on the right hand?

29

Much protein function is an organized set of activities among interacting proteins  The problem is not so much that individual proteins have multiple activities  It is that the activities of proteins are dependent on the context in which they are found, and on the activities of the other proteins, in the vicinity and beyond  Also on state of modification of protein, e.g. phosphorylation  Provided we recognize this, we can perhaps start with primary activities of individual proteins 30

How could we test predictions of function?

31

How to measure distance between functions?

 For sequences and structures, there are natural measures of divergence  Sequence: count identical residues  Structures: r.m.s.d. of well-fitting parts

(Specialists may argue about details, or propose alternatives, but basically the answers aren't too different.)

Function: no natural measure of difference

 Derive measure from structure of schemes that 32 classify protein functions

Classifications of protein functions

33

How do we describe and classify protein functions?

 First attempt at a catalogue of protein functions was the Enzyme Commission  Third International Congress of Biochemistry (Brussels, 1955): General Assembly of the International Union of Biochemistry (IUB), in consulation with the International Union of Pure and Applied Chemistry (IUPAC), set up an International Commission on Enzymes  They thought the problems were primarily nomenclatural 34

Enzyme Commission

 Terms of reference:

To consider the classification and nomenclature of enzymes and coenzymes …

 Reported at the 1961 IUB meeting (Moscow)  Results published first as a book, then on web  Limited to description of enzyme-catalysed reactions  Catalog of reactions not of enzymes 35

Enzyme Commission numbers

 EC numbers have four parts: 1.1.1.1

 They look suspiciously like IP numbers  First number: general class  General class specifies type of reaction  Second, third, and fourth, subclasses  Definitions of subclasses depend on the main class  Note that EC classification is a strict hierarchy 36

Main Enzyme Commission Classes

37

Enzyme Commission / EC numbers

 (EC numbers NOT European Commission)  Authorized by International Union of Biochemistry and Commission on Enzyme Nomenclature  EC set up by International Union of Biochemistry in 1955.  Report in 1961, modified 1964, several supplements since then.

 Published as book, now available on web 38

What does EC classify

 Enzyme nomenclature  Classification of reactions catalysed by enzymes 

NOT a set of assignment of function to proteins

– That is a different task 

(Note that Gene Ontology – another classification scheme – also does not assign functions to proteins)

39

Example of full EC Classification

 What does EC 1.1.1.1 correspond to?

an alcohol + NAD + = an aldehyde or ketone + NADH + H +  EC 1 Oxidoreductases  EC 1.1 Acting on the CH-OH group of donors  EC 1.1.1 With NAD + or NADP + as acceptor -- EC 1.1.1.1 alcohol dehydrogenase -- [Also: EC 1.1.1.2 alcohol dehydrogenase (NADP + ) EC 1.1.1.3 homoserine dehydrogenase … ] 40

Enzyme Commission numbers

 Four-level hierarchy  Example: isopentenyl diphosphate ∆-isomerase EC number 5.3.3.2:  5 = general category (of isomerases)  5.3 = intramolecular isomerases  5.3.3 = enzymes that transpose C=C bonds  5.3.3.2 = specific reaction  EC classifies reactions, names enzymes that catalyse reactions, does not name proteins.

41

Problems with EC numbers

 EC numbers classify reactions, not enzymes  Assignment of enzymes to reactions is a separate issue  Substrate specificity is not addressed  The EC classification is restricted to enzymes  Proteins with functions other than enzyme catalysis are not covered  Not a criticism of EC Task Group – 1961 was a long time ago 42

Gene Ontology Consortium

 Michael Ashburner was leading a group with the job of annotating the D. melanogaster genome  Decided to work out new classification scheme  Described in his 2006 autobiographical memoir: “Won for all: How the Drosophila Genome was Sequenced” 43

Gene Ontology project

 Initiated by Michael Ashburner (early 1990’s).

 Has since grown, become de facto standard  References:  Lewis, S.E. (2004). Gene Ontology: looking backwards and forwards.

Genome Biology 6:103.

 Ashburner, M. (2006). Won for All / How the Drosophila Genome was Sequenced. Cold Spring Harbor Laboratory Press.

44

Gene Ontology

 EC limited to enzymes  Gene Ontology consortium produced new, more general classification of protein function  Three independent categories:  Molecular function (overlaps EC)  Biological process  Subcellular location  GO: not tree structure, directed acyclic graph 45

What is an ontology?

 Specification of how to describe a body of knowledge  Nomenclature (fixed vocabulary)  Rules of syntax of terms  Types of relationships among entities:  ‘Is a’: for instance: ‘A cat is a mammal.’  ‘Part of’: for instance: ‘A tail is part of a cat.’ 46

What is an ontology?

 Types of relationships among entities:  ‘Is a’: for instance: ‘A cat is a mammal.’  ‘Part of’: for instance: ‘A tail is part of a cat.’  Note that ‘A cat is a mammal. A mammal is an animal’ implies that ‘A cat is an animal’  But ‘A tail is part of a cat. A cat is a mammal.’ does NOT imply that a tail is a mammal. 47

Features of Gene Ontology (GO)

 Includes all functions, not only enzymatic catalysis  Three separate catalogues:  Molecular function  Cellular component  Biological process  Form of each catalogue NOT a tree or hierarchy, but a

directed acyclic graph

 Like EC, GO provides catalogues of functions, NOT assignment of proteins to categories 48

49

GO example: molecular function

50

GO example: biological process

51

GO example: cellular component

52

GO example: molecular function

1. Molecular function = root 2. Not a tree: multiple paths between nodes 3. Edges are directed. Each edge connects a more general category to a more specific one 4. If there is a path from node A to node B, there is NOT a path from node B to node A. That is, there are no cycles.

Thus: Directed Acyclic Graph 53

Are EC and GO consistent?

 Enzyme Commission identifiers form a strict four-level hierarchy, or tree. For example,  isopentenyl-diphosphate Δ-isomerase is assigned EC number 5.3.3.2:  initial 5 specifies the most general category, 5 = isomerases  5.3 comprises intramolecular isomerases  5.3.3 those enzymes that transpose C = C bonds;  5.3.3.2 specifies a particular reaction.  In contrast, the GO classification is not a tree, but a Directed Acyclic Graph 54

Are EC and GO consistent?

 Enzyme Commission identifiers form a strict four-level hierarchy, or tree.  isopentenyl-diphosphate Δ-isomerase is assigned EC number 5.3.3.2

 in contrast, the GO classification is not a tree, but a Directed Acyclic Graph  path from isopentenyl-diphosphate Δ isomerase to the root node of the molecular function DAG: 55

Are EC and GO consistent?

 There are four intervening nodes, progressively more general categories as we move up the figure.  The GO description of this enzyme as an oxidoreductase is inconsistent with the EC classification  In the EC classification a committed choice between oxidoreductase and isomerase must be made at the highest level the hierarchy.

56

isopentenyl-diphosphate ∆-isomerase EC GO  EC number 5.3.3.2:  5 = general category (isomerases)  5.3 = intramolecular isomerases  5.3.3 = enzymes that transpose C=C bonds  5.3.3.2 = specific reaction 57

Inconsistencies

 Different databases use different versions of GO  Different versions of different databases  Downloaded versions of different databases may not be updated to reflect changes in parent databases  What can be done?

58

Errors in databases

1. Keep them out – But how? 2. natural language processing by computer? (Automatic: literature → database)???

3. If you find them correct them (you = WHO?) 4. Correct them where?    Master copy of database?

What about copies? Errors propagate?

How to propagate corrections?

59

Correction of Errors in Databases?

 Eternal vigilance at each installation?????

 Community involvement – curation by experts?

 Open source idea – bulletin board?

 ‘Knowbots’ running around web? Security?

 Distribute programs for ‘health checks’?

60

Distributed updating of databases Park, Park & Kim (2004). Bioinformatics Appl. Note.

 Gene Ontology classification provides basis for database annotations  Updates to GO include:  new terms  new obsoletions  term name changes  new definitions  new term merges  term movements  Require updating of annotations 61

GOChase (Park, Park & Kim)

Recommend updates (security considerations require local file changes)  Web-based interfaces:  GOChase-History: evolution of GO ID  GOChase-Correct: suggests change  Health check of your database: flag problems  Submit GO ID: report its use in annotation in a list of common databases http://www.strubi.org/software/GOChase/ 62

63

Function assignment from sequences

Analogy to ‘Homology Modelling’ of structure?

64

Function assignment from sequence

 Basic idea: homologous proteins have similar sequences  How close must sequences be for this to be valid?

 We know how to measure divergence of sequences  How do we measure differences in function among homologues? (Use EC or GO?) 65

Many similar proteins have similar functions, don't they?

 In many cases closely-related proteins have closely-related functions.

 Example: human and horse haemoglobin  43 residue differences out of 446 ( α+β chains)  96% residue identity  SAME FUNCTION 66

Function assignment from homology?

 OK, if the sequences differ greatly then the function may differ  But if the sequences are similar, the functions will be the same – WON'T THEY?

 Well, sometimes ...

67

Commonly used for function annotation in databases

 Proteins appear in databases when their sequences are known  Annotation of function?

 Experimental evidence for function  Transfer of function from homologue  How well does this work?

 How can we tell?

 Requires measure of distance between functions  Large literature reporting ‘howlers’ in function transfer 68

GO defines: Sources of annotation

 GO categories of sources of annotation: IDA: Inferred from direct assay TAS: Traceable author statement IMP: Inferred from mutant phenotype IGI: Inferred from genetic interaction IPI: Inferred from physical interaction

ISS: Inferred from sequence similarity IEA: Inferred from electronic annotation NAS: Non-traceable author statement

69

GO does not do function assignment

 Emphasise: GO is a classification of protein functions.  It does not itself assign functions to proteins  That is a separate job  However, GO makes available a controlled vocabulary about sources of function assignment for use by people who do assign function 70

Sources of Annotation: Experiment / Inferred From: Thomas, P.D., Mi, H. & Lewis, S. (2007). Curr. Opin. Chem. Biol. 11, 4-11.

71

‘Homology modelling’ of function?

 Sequence determines structure determines function  Small changes in sequence produce small changes in structure  BUT: dependence of function on sequence (and even on structure) doesn't have simple ‘topology’ 72

Similar sequences produce similar structures 73

74

Two questions

1. How does protein function diverge as amino acid sequence diverges?

2. How can we evaluate the accuracy of transfer of annotation among homologous proteins?

Problems associated with question 2 make question 1 harder to answer 75

How do proteins change function as their sequences diverge  Divergence v. recruitment  Divergence:  Change in specificity (chymotrypsin, trypsin)  Change in regulation (myoglobin, haemoglobin)  Related functions with similar mechanisms (adaptation of catalytic site) (Gerlt & Babbitt) 76

Gene duplication and divergence

 General way to develop new functions   Very old theory about how metabolic pathways developed – new protein developed to provide substrate for current initial step:  Now growing on B (BCD …ATP)  Medium runs out of B.   BC enzyme duplicates, diverges to catalyze AB Now you can grow on A (ABCD …ATP) Attractive because:  BC enzyme has binding site for B  explains gene organization in operon

WRONG: mechanism of A

B in general different from B

C, needs different structure, catalytic residues

77

Simple approach: predict function by sequence similarity  Given a protein of known sequence and unknown structure and function  Find the protein of known function that has the highest sequence similarity to the unknown  Assign to the unknown protein the function(s) of the known protein 

How well does this work?

78

How to measure how well a function assignment method is working?

If you are predicting EC numbers, a simple metric: how many of the 4 EC numbers did you get right?

 Recall: EC 1.1.1.1 corresponds to an alcohol + NAD + = an aldehyde or ketone + NADH + H +  EC 1 Oxidoreductases  EC 1.1 Acting on the CH-OH group of donors  EC 1.1.1 With NAD + or NADP + as acceptor  EC 1.1.1.1 alcohol dehydrogenase (NAD + )  EC 1.1.1.2 alcohol dehydrogenase (NADP + )  EC 1.1.1.3 homoserine dehydrogenase 79

Several groups have measured relationship between sequence divergence and functional divergence using EC classification  Example: Todd, Orengo & Thornton, JMB 2001  For enzymes, sequence identity > 40%, all four EC numbers conserved  sequence identity > 30% three levels of EC numbers conserved for 70% of pairs 80

C.A. Wilson, J. Kreychman & M. Gerstein  Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol. 2000 Mar 17;297(1):233-49.

 One of many such studies  How well does sequence similarity predict EC numbers?

 Expect:  very close sequences, get all four components of EC number correct  distant sequences, get fewer components right 81

From Wilson, Kreychman & Gerstein

     ( —○—), general similarity ( —×—), non-enzymes with same functional class; ( — —), enzymes with same functional class; (- - -×- - -), non-enzymes with same precise function; (- - - - - -), enzymes with the same precise function.

82

Take-home message:

 Wilson, Kreychman & Gerstein come to conclusions similar to Todd, Orengo & Thornton: For single-domain proteins, enzyme function, as defined by the first three EC numbers, is almost completely conserved above a sequence identity threshold of 40% 83

Extension of sequence-function correlation to GO

 Several groups have measured relationship between sequence divergence and functional divergence using EC classification  How to define metric on functions for GO?

 Try using distal GO-IDs  How to measure distance between SETS of GO-IDs 84

How to define metric on functions?

 Minimal path length through GO DAG?

85

Distal GO-IDs

 Distinguish pairs of nodes in same branch of DAG from those for which minimal path between them includes root node 86

How to measure distance between SETS of GO-IDs

Proteins may be assigned to multiple nodes in GO DAG

 What is distance between x and o in these three graphs?

(a) 4, (b) 2, (c) 2, 4 87

To study accuracy of annotation transfer, use experimental annotation only?

 Obviously.

 But there are problems.

 Many fewer data  Inconsistencies  Sometimes annotation correct, but source of annotation incorrect 88

Sanger, Blankenberg, Altman & Lesk  Quantitative sequence-function relationships in proteins based on gene ontology. BMC Bioinformatics. 2007 Aug 8;8:294.

 Similar project but asked how well one can predict GO categories from sequence similarity  For proteins with more than 50% residue identity, transfer of annotation between homologues will lead to an erroneous attribution.

89

Dependence of function divergence on sequence divergence: the EF-hand family Fraction of pairs GO distance 90

Conclusions

 It is possible to define statistical distribution describing relationship between divergence of sequence and divergence of function  General rule: sequences diverge, function diverges But: exceptions exist  Threshold at about 50% sequence identity at which sequence starts to diverge more radically  Databases contain many errors or incompleteness, still human, labour-intensive activity 91

Use of structure to assign function

 Basic fact: structure changes more conservatively than sequence  It is possible to recognise homology between more distant relatives from structure than from sequence  BUT, the more distant the relationship the more likely that the function has changed  However, can certainly check that active-site residues have been conserved  Sometimes infer change in specificity – for instance, serine proteinases 92

Specificity of chymotrypsin-like serine proteinases  Cleave polypeptide chains adjacent to specific residues  Specificity determined by complementarity of pocket in enzyme and sidechain of residue  Chymotrypsin, elastase, and trypsin differ in specificity through changes in lining of specificity pocket  Chymotrypsin – large flat sidechains (Phe, Trp)  Trypsin – long positively charged (Lys, Arg)  Elastase – short nonpolar sidechains (Ala, Val) 93

Chemical mechanism of catalysis for serine proteases.

Ser, His, Asp catalytic triad Specificity pocket

Perona J J , Craik C S J. Biol. Chem. 1997;272:29987-29990

©1997 by American Society for Biochemistry and Molecular Biology

General fold and specific active site

 The activity of many proteins depends on the structure of an active site, which often involves very few residues  Rest of the protein exists to create the geometry of the active site  (also involved in energetics of activity)  Nevertheless, there is a correlation between folding pattern and function 95

Active site and general fold

 The active site has a direct relation to function  The general folding pattern has an indirect relation to function  And yet the correlation between folding pattern and function may be a reliable indicator of function  Example: helix-turn-helix motif in DNA-binding proteins 96

Given a protein structure can we predict function directly?

 Sometimes… To some extent …  What are reasonable goals?

 Sometimes structure gives general idea, guiding laboratory work to pin it down  Often it is possible to identify putative active site  Allows educated guesses about ligand 97

Is knowing ligand, knowing function?

 Well, not precisely.

 However it is a big step forward  Even knowledge of general type of ligand useful in guiding additional work  Example: Haemophilus influenzae structural genomics project 98

HI1679

 α/β-hydrolase fold, putative remote homology to L-2-haloacid dehydrogenases  Several substrates tried.  HI1679 cleaved 6-phosphogluconate, phosphotyrosine  This was a success (Trust but verify.) 99

HI1434

 related to a region in tRNA synthetases.  contains putative binding site, likely to bind nucleotide  no specific ligand has yet been identified  This has not been a success 100

Nuclear Transport Factor-2

• Protein known to be involved in traffiicking across nuclear membrane • Crystal structure determined • Mechanism of function not obvious • ???

101

NtF-2 homologous to scytalone dehydratase • Alexei Murzin spotted a similarity of fold between NTF-2 and scytalone dehydratase • This structure shows scytalone dehydratase binding an inhibitor 102

Scytalone dehydratase

Scytalone dehydratase is an enzyme in the pathway for melanin synthesis 103

NTF-2 Superposition

104

Search for ligands

 On the basis of the structural similarity, many ligands were designed and tested  So far, none has shown any binding or catalyzed reactivity  Conclusion: structural similarity is useful guide to hypotheses about function, but doesn’t always work … 105

Structure-based function assignment  Extract functional residues from structures of known function  Residues contributing to function of entire homologous family conserved in whole family  Residues contributing to specific function of subfamily conserved only in subfamily 106

Screening structures for binding sites

 Knowing the residues that bind ligands in proteins with experimentally-determined structure in liganded state  Can write programs to search for these spatial patterns in protein of known structure and unknown ligand and function  Many examples of this are known  Reviewed by: Domingues & Lengauer, in:

Bioinformatics – From Genomes to Therapies

(Wiley-VCH, Weinheim, 2007), Vol. 3, pp. 1211-1252 107

Combined sequence-structure methods  Homologous proteins may have diverged in sequence and function (leave aside recruitment)  Assume no strong sequence similarity to protein of known function  Align sequences  Use structure to get better alignments  Check for conservation of binding site, catalytic residues 108

Several groups have applied these ideas  Cohen & Lichtarge, ‘Evolutionary Trace Method’ (J. Mol. Biol. 1996)  Irving, Whisstock, Lesk (Proteins 2001)  Hannenhalli & Russell (J. Mol. Biol. 2000)  Sternberg and coworkers (PNAS 2004, Phil. Trans. Roy. Soc. 2006)  See also: Automated Function Prediction, ISMB Special Interest Group Meeting, 2005 109

Evolutionary trace method (Cohen & Lichtarge) 110

Tools available from European Bioinformatics Institute (EBI)

 PDBeMotif (Golovin & Kenrick) Exploration of Protein Data Bank (PDB) by combining protein sequence, chemical structure and 3D data in a single search.  Tempura (Laskowski R A, Watson J D, Thornton J M) search PDB for similarities to active site from a selected protein.

111

A priori modelling of ligand

 Given a cleft in a protein, can you model what ligand would bind there?

 Problem common to assignment of protein function, and design of lead compounds in drug discovery  One possibility – dock in everything (computer screening)  Constructive methods 112

Ligand in chymotrypsin specificity Asp – His – Ser pocket catalytic triad Ligand Structure 2CHA 113

Modelling ligand to chymotrypsin

Experimental ligand Prediction 114

Modelling ligand to chymotrypsin

 Subsequent experiments showed that 2-methyl tryptophan binds in this pocket  This extra atom is real 115

Advantage of constructive methods

 We have assumed that it is possible to identify active site   If there is a deep cleft in the molecule, lined with charged residues … BOH …  But some molecules develop subsidiary binding sites, which might be missed  Viral 3C proteases have developed a subsidiary binding site for RNA 116

What other relationships among properties of organisms are useful in assigning function?

117

What are we looking for?

 We might try to identify proteins that have similar functions in same or different species    Human and Horse haemoglobin We may be able to find these if they are homologues We might try to identify proteins that have coordinated functions in same or different species   Two or more proteins in same metabolic pathway, or part of same macromolecular complex These may in general NOT be homologues 118

Various clues that proteins have coordinated activities  Linked on genome? (Best for bacteria, not for archaea; occasionally for eukaryotes)  Appear as separate (monomeric) proteins in one species, and as single multidomain protein in other species  Often separate proteins in prokaryotes are fused in eukaryotes (but some examples of opposite are known) 119

Function assignment by reconstruction of metabolic pathways

120

Reconstruction of metabolic pathways  Suppose a you have new genome sequence, and can assign function to many but not all proteins  Reconstruct metabolic pathways, and see which essential enzymes you have already identified  Look for essential steps in pathway to which no enzyme assigned, and proteins for which no function assigned: try to match “orphans” 121

Use of synteny

 In bacteria, proteins that catalyse successive steps in a metabolic pathway often appear successively in genome  In archaea this is less general  This can provide useful information in assigning protein function` 122

Tryptophan synthetase pathway in E. coli  In E. coli, shikimate kinase is an enzyme in the pathway of synthesis of chorismate from erythrose-4-phosphate  chorismate is a branch compound for the synthesis of aromatic amino acids  tryptophan synthetase pathway one of the best worked-out in E. coli, in terms of enzymology and regulation 123

Pathway of synthesis of shikimate from erythrose-4-P in E. coli

From: Daugherty et al., J Bacteriol. 2001 January; 183(1): 292 –300.

124

Cross-table of metabolic steps and genes  Match up known genes and known metabolic steps  No recognized protein for metabolic step?

 Maybe metabolic step is missing from that organism  No recognized function for some gene?

 Maybe can match up missing function with gene missing function assignment 125

Matching gene with function

 Check for homologues  Maybe find several  Maybe find none  Look in genome for operons containing succession of genes for steps in pathway  Usually works in bacteria  Less common in archaea 126

Aromatic amino acid biosynthesis

127

R. Boyer

E. coli trp operon

Note collinearity of genes with order of reactions in pathway 128 From: Garret, R.H. & Grisham, C.M. (1999) Biochemistry. 2 nd ed. (Thomson Higher Education, Belmont, CA)

Function assignment from metabolic pathway reconstruction  Methanococcus jannaschii has pathway for synthesis of chorismate from 3-dehydroquinate

3-dehydroquinate chorismate

129

3 dehydroquinate chorismate shikimate-5-P ← shikimate catalysed by shikimate kinase 130

Methanocuccus jannaschii

shikimate kinase  One step of tryptophan synthesis pathway catalysed by shikimate kinase  No shikimate kinase identifiable in M. jannaschii genome, by homology to known shikimate kinases in other species  Pathway genes are NOT clustered in the M. jannaschii genome 131

Shikimate kinase in Methanococcus jannaschii  In M. jannaschii, the shikimate kinase pathway is NOT catalysed by enzymes consecutive in the genome in an operon  Sequence similarity identified most enzymes but not shikimate kinase  In another archaeon, A. pernix, the genes in this pathway ARE collinear.  From this is was possible to identify the A. pernix shikimate kinase, and from that the M. jannaschii homologue.  NOT homologous to shikimate kinases in bacteria, eukarya Reference: Dougherty et al., J. Bacteriology (2001). 183, 292 –300.

132

Mapping of genes in silicate synthesis pathway in several prokaryotic genomes From: Daugherty et al., J Bacteriol. 2001 January; 183(1): 292 –300.

133

Mapping of genes for shikimate synthesis in several prokaryotic genomes From: Daugherty et al., J Bacteriol. 2001 January; 183(1): 292 –300.

134

From: Daugherty et al., J Bacteriol. 2001 January; 183(1): 292 –300.

135

Why didn’t homology search work?

 Archaeal shikimate kinase is NOT related to bacterial or eukaryotic shikimate kinases.

 It is distantly related to homoserine kinases of the GHMP kinase superfamily.

M. jannaschii homoserine kinase IS identifiable by homology  The two enzymes are substrate-specific 136

137

Phylogenetic profiles

 Clues to function from genes shared among different organisms  Different groups of organisms need different sets of genes  For instance, some bacteria have flagellae  Genes found in bacteria that contain flagellae but not in other bacteria or other groups of organisms: involved in flagellar function 138

Phylogenetic Profiles

 Developed by Marcotte, Eisenberg et al. (PNAS 96, 4285-4288, 1999 and elsewhere)  Tabulate homologues of E. coli proteins in 16 other genomes  (Note: assume homologues share function – this is input to method, not result)  Table: column = organism, row = gene  Put a  if organism has gene 139

From: Pellegrini et al. (1999). Proc. Natl. Acad. Sci. U.S.A. 96, 4285-4288 140

Phylogenetic profile

 Pattern of row = barcode of which organisms a gene occurs in  Result: Genes that share patterns are ‘functionally linked’  Functionally linked = participate in some coordinated way in some structure or process  Note: proteins can be functionally linked even if

they are not homologous

141

Example: ribosomal proteins

 Homologues of coil protein RL7 are found in 10 bacterial genomes and yeast, not in archaea  Those that match phylogenetic profile have functions associated with ribosome  Have pulled out sets of ribosomal proteins on basis of phylogenetic profile  Linked proteins need not be homologues nor be localized in genome 142

Combine phylogenetic profiling with matching ‘orphans’  Create metabolic network for an organism  Assign functions by homology when possible  Missing enzymes in pathway?

 Genes that lack assignment?

 Try to match these up (recall archaeal shikimate kinase)  Phylogenetic profiles can assist in this 143

From: Chen & Vitkup (2006). Genome Biol. 7, R17 144

Phylogenetic profiles / orphan assignment Chen & Vitkup (2006). Genome Biol. 7, R17  Phylogenetic profiles can link proteins in a metabolic pathway  Even more, better fit of profile implies closer in metabolic network  Test, using yeast:  remove gene from network  try to recover it from pool of ~6000 genes  results: 22.8% top prediction correct (37.3% correct answer in top 10) 145

State of the art in function assignment  We have a ‘bag of tricks’ – that is, many methods, all of which work sometimes and fail sometimes.

 In some cases, no method works except go back to the lab and work it out.

 We do not have a unified framework or a systematic approach to function assignment 146

Conclusion

 Inferring protein function from knowledge of function of close relative is like solving the clue of an American crossword puzzle. Finding the precise word is difficult but task in principle straightforward  Inferring function a priori from structure like British crossword puzzle. Which clues are real? which clues are misleading?

147

Reference

 Lengauer, T. (ed.) Bioinformatics – From Genomes to Therapies (3 vols.) Wiley-VCH, Weinheim, 2007 148

Download