Protein Structure and Function CHAPTER4. From Sequence to Function : Case Studies in Structural and Functional Genomics 4-0. Overview : From Sequence to Function in the Age of Genomics - - Genomics is making an increasing contribution to the study of protein structure and function Many computational and experimental tools are now available. Different experimental methods are required to define a protein’s function. In this chapter : methods of comparing amino-acids sequences to determine their similarity and to search for related sequences in the sequence databases. Predicting a protein’s function from its structure. 4-0. Overview : From Sequence to Function in the Age of Genomics Figure4-1.Time and distance scales in functional genomics 4-0. Overview : From Sequence to Function in the Age of Genomics Figure4-1.Time and distance scales in functional genomics 4-1. Sequence Alignment and Comparison Sequence comparison provides a measure of the relationship between genes -Homologous : genes or proteins related by divergent evolution from a common ancestor. -Homology : evolutionary similarity between them. Alignment is the first step in determining whether two sequences are similar to each other -Alignment : comparing two or more sequences. -Sometimes insertions and deletions causes sequences slid. Sliding creates gaps. Figure4-2. Pairwise alignment 4-1. Sequence Alignment and Comparison High - E-value : the probability that an alignment score as good as the one found between two sequences. - Up to an E-value of approximately 10-10, the likelihood Low of an identical function is reasonably high, but then it starts to decrease substantially. Figure4-3. Plot of percentage of protein pairs having the same biochemical function as sequence changes 4-1. Sequence Alignment and Comparison Multiple alignments and phylogenetic trees -The alignment process can by expanded to give a multiple sequence alignment. -Any residue, or short stretch of sequence, that is identical in all sequences in a given set is said to be CONSERVED. Figure4-4. Multiple alignment 4-1. Sequence Alignment and Comparison -Multiple sequence alignments of homologous proteins or gene sequences from different species are used to derive a so-called evolutionary distance. -These distances can be used to construct phylogenetic trees that attempt to reflect evolutionary relationships between species. Figure4-5. Phylogenetic tree comparing the three major MAP kinase subgroups 4-2. Protein Profiling Structural data can help sequence comparison find related proteins -Straightforward sequence alignment does not indicate any relationship between the prokaryotic and eukaryotic domain. -However, when the alignment is performed by comparing residues in the corresponding secondary structure elements of the prokaryotic and eukaryotic domains, some regions of sequence conservations appear. Figure4-6. Some Examples of Small Functional Protein Domains 4-2. Protein Profiling Sequence and structural motifs and patterns can identify proteins with similar biochemical functions -Sometimes, only a part of a protein sequence can be aligned with that of another protein. -Local alignment can identify a functional module within a protein. -These function-specific blocks of sequence are called functional motifs. -Two broad classes : short, contiguous motif = usually specify binding site : discontinuous or non-contiguous motif = catalytic sites 4-2. Protein Profiling Figure4-7. Representative examples of short contiguous binding motifs 4-2. Protein Profiling PSI-BLAST : position-specific iterated BLAST. Amino acid position Five sequences Probability for Cys Figure4-8. Construction of a profile 4-3. Deriving Function from Sequence Sequence information is increasing exponentially - The growth of sequence information is exponential, and shows no sign of slowing down. Figure4-9.The growth of DNA and protein sequence information collected by GenBank over 20 years 4-3. Deriving Function from Sequence - As one proceeds form prokaryotes to eukaryotes, and from single-celled to multicellular organisms, the number of genes increases markedly. Figure4-10.Table of the size of the genomes of some representative organisms 4-3. Deriving Function from Sequence In some cases function can by inferred from sequence - If a protein has more than about 40% sequence identity to another protein whose biochemical function is known, and if the functionally important residues are conserved between them. Green : non-enzymatic Blue : enzymatic Figure4-11. Relationship of sequence similarity to similarity of function 4-3. Deriving Function from Sequence -Local alignments of functional motifs in the sequence can often identity at least one biochemical function of a protein. (Ex. Helix-turn-helix, zinc finger motifs) - Walker motif : ATP or GTP binding motif. Figure4-12.The P loop of the Walker motif 4-3. Deriving Function from Sequence -Sequence comparison is an active area of research because it is now the easiest technique to apply to a new protein sequence. -Large proportion are inferred only by overall sequence similarity to known proteins. Figure4-13. Analysis of the functions of the protein-coding sequences in the yeast genome 4-4. Experimental Tools for Probing Protein Function Gene function can sometimes be established experimentally without information from protein structure or sequence homology -Experience suggests that genes of similar function often display similar patterns of expression. -Expression can by measured at the level of mRNA or protein. -The mRNA-based techniques : DNA microarrays and SAGE - Microarray technology can provide expression patterns for up to 20,000 genes at a time. Figure4-14. DNA microarray 4-4. Experimental Tools for Probing Protein Function -High throughput protein expression monitor can be achieved by two-dimensional gel electrophoresis. - Protein spot can be identified by Mass spectrometry. - 2D GE can detect the amount of protein and modifications. -But it is slow and expensive. -It can fail to detect proteins tat are only present in a few copies per cell. Figure4-15. 2-D protein gel 4-4. Experimental Tools for Probing Protein Function -The phenotype produced by inactivating a gene, a gene knockout, is highly informative about the cellular pathway. -Knockout can be obtained by classical mutagenesis, targeted mutations, RNA interference, the use of antisense message RNA, or by antibody binding. Figure4-16. The phenotype of a gene knockout can give clues to the role of the gene 4-4. Experimental Tools for Probing Protein Function -The location of a protein in the cell often provides a valuable clue to its functions. - Technique : attachment of a tag sequence to the gene in question. Commonly used method is to fuse the sequence encoding GFP(green fluorescent protein). Figure4-17. Protein localization in the cell 4-4. Experimental Tools for Probing Protein Function - Interacting proteins can be found by yeast two-hybrid system. -Two distinct domains are necessary to activate transcription in yeast. ①. A DNA binding domain(bind to promoter) ②. An activation domain - DBD fused A protein + AD fused Y protein. - If A and Y protein interact each other, DBD and AD close together. And transcription will start. Figure4-18.Two-hybrid system for finding interacting proteins 4-5. Divergent and Convergent Evolution -In general, if the overall identity between the two sequences is greater than about 40%, they will code for proteins of similar fold. -Rmsd : rood-mean-square difference in spatial positions of backbone atoms. 40 Figure4-19. Relationship between sequence and structural divergence of proteins 4-5. Divergent and Convergent Evolution Benzoylformate decarboxylase Pyruvate decarboxylase Low seq.similarity Similar structure Proteins with low sequence similarity but very similar overall structure and active sites are likely to be homologous Figure4-20. Ribbon diagram of the structure of a monomer of benzoylformate decarboxylase (BFD) and pyruvate decarboxylase (PDC) 4-5. Divergent and Convergent Evolution Divergent evolution can produce proteins with sequence and structural similarity but different function Similar structure Different function -Steroid delta-isomerase -Nuclear transport factor2 -Scytalone dehydratase Figure4-21. Seuperposition of the three-dimensional structures of steroid-deltaisomerase, nuclear transport factor-2 and scytalone dehydratase 4-6. Structure from Sequence : Homology Modeling Homology modeling is used to deduce the structure of a sequence with reference to the structure of a close homolog -Upper : sequence similarity is likely to yield enough structural similarity for homology modeling. -Lower : highly problematic to homology modeling. Figure4-22.The threshold for structural homology 4-6. Structure from Sequence : Homology Modeling Conservation is measured by Gstat - High value = more conserved Homology modeling Integral membrane protein rodopsin with the cluster of conserved interacting residues(red) based on conservancy Figure4-23. Evolutionary conservation and interactions between residues in the protein-interaction domain PDZ and in rhodopsin 4-6. Structure from Sequence : Homology Modeling Plasminogen(blue) and chymotipsinogen(red) are very similar. Chymotripsin(green), Plasminogen(blue) and chymotipsinogen(red) different active site conformation. Figure4-24. Structural changes in closely related proteins 4-7. Structure from Sequence : Profile-Based Threading and “Rosetta” Profile-based threading tries to predict the structure of a sequence even if no sequence homologs are known -Computer program forces the sequence to adopt every known protein fold in turn, and in each case a scoring function is calculated that measures the suitability of the sequence for that particular fold. -The highest Z-value score indicates that the sequence almost certainly adopts that fold. Figure4-25.The method of profile-based threading 4-7. Structure from Sequence : Profile-Based Threading and “Rosetta” The ROSETTA method attempts to predict protein structure form sequence without the aid of a homologous sequence or structure -Rosetta is that the distribution of conformations sampled for a given short segment. -Each calculated structures similar to real crystal structure but not perfect. Figure4-26. Some decoy structures produced by the Rosetta method 4-7. Structure from Sequence : Profile-Based Threading and “Rosetta” The level of agreement with the known native structure varies, but in many cases the overall fold is predicted well enough to be recognizable. Figure4-27. Examples of the best-center cluster found by Rosetta for a number of different test proteins 4-7. Structure from Sequence : Profile-Based Threading and “Rosetta” The level of agreement with the known native structure varies, but in many cases the overall fold is predicted well enough to be recognizable. Figure4-27. Examples of the best-center cluster found by Rosetta for a number of different test proteins 4-8. Deducing Function from Structure : Protein Superfamilies - In contrast to the exponential increase in sequence information, (=Sequence information) structural information(X-ray or NMR) has up to now been increasing at a much lower rate. -Superfamily : loosely defined as a set of homologous proteins with similar three-dimensional structures. - Within each superfamily, there are families with more closely related functions and significant(>50%) sequence identity. Figure4-28. Growth in the number of structures in the protein data bank 4-8. Deducing Function from Structure : Protein Superfamilies The four superfamilies of serine proteases are examples of convergent evolution - Serine proteases fall into several structural superfamilies, which are recognizable from their amino-acid sequences and the particular disposition of the three catalytically important residues in the active site. Same superfamily Chymotrypsin Subtilisin Figure4-29.The overall folds of two members of different superfamilies of serine proteases 4-8. Deducing Function from Structure : Protein Superfamilies Taq. DNA polymerase Reverse transcriptase DNA polymerase - Another large enzyme superfamily with numerous different biological roles is characterized by the so-called polymerase fold, which resembles an open hand. Figure4-30. A comparison of primer-template DNA bound to three DNA polymerases 4-9. Strategies for Identifying Binding Sites Binding sites are identified as regions where the computed interaction energy between the probe and the protein is favorable for binding - Zone1 : good site for binding positive charged group. - Zone2 : good site for binding hydrophobic group. - Zone3 : good site for binding negative charged group. Figure4-31. Example of the use of GRID Overlay of three pieces of a known inhibitor of dihydrofolate reductase onto the zones. By GRID method(program) 4-9. Strategies for Identifying Binding Sites MSCS(multiple solvent crystal structures) is a crystallographic technique that identifies energetically favorable binding sites and orientations of small organic molecules on the surface of proteins. Figure4-32. Some organic solvents used as probes for binding sites for functional groups 4-9. Strategies for Identifying Binding Sites Small organic molecules bind to on the protein surface Figure4-33. Structure of subtilisin in 100% acetonitrile 4-9. Strategies for Identifying Binding Sites - The binding sites for different organic solvent molecules were obtained by X-ray crystallography of crystals of thermolysin soaked in the solvent. Figure4-34. Ribbon representation showing the experimentally derived functionality map of thermolysin 4-10. Strategies for Identifying Catalytic Residues Active-site residues in a structure can sometimes by recognized computationally by their geometry -Searches the structure for geometrical arrangements of chemically reactive side chains that match those in the active sites of known enzymes. - The geometry of the catalytic triad of the serine proteases as used to locate similar sites in other proteins. Figure4-35. An active-site template 4-10. Strategies for Identifying Catalytic Residues THEMATICS : net charge of potentially ionizable groups on each residue in the protein structure is calculated as a function of pH. - Amino acids, which show abnormal ionization curve (green His 95 and blue Glu 165 in triosephosphoate isomerase), are possibly catalytic residues. Figure4-36.Theoretical microscopic titration curves 4-10. Strategies for Identifying Catalytic Residues Structure of triosephosphate isomerase. His 95 and Glu 165 are both located in the active site. Figure4-37. Residues that show abnormal ionization behavior with changing pH define the active site 4-11. TIM Barrels : One Structure with Diverse Functions - Mandelate racemase : intercpmvert R- and S-mandelate. Figure4-38.The chemical reaction catalyzed by mandelate racemase 4-11. TIM Barrels : One Structure with Diverse Functions - Muconate lactonizing enzyme : transforms the cis, cis-muconic acid derived from mandelate into muconolactone. Figure4-39.The chemical reaction catalyzed by muconate lactonizing enzyme 4-11. TIM Barrels : One Structure with Diverse Functions Mandelate racemase Muconate lactonizing enzyme 26% sequence identity and overall fold are essentially identical. Figure4-40. Mandelate racemase (left) and muconate lactonizing enzyme (right) have almost identical folds 4-11. TIM Barrels : One Structure with Diverse Functions Mandelate racemase Muconate lactonizing enzyme The amino acids that coordinate with the metal ion are conserved between the two enzymes and similar catalytic residues. Figure4-41. A comparison of the active sites of mandelate racemase (left) and muconate lactonizing enzyme (right) 4-12. PLP Enzymes : Diverse Structures with One Function L-aspartate aminotransferase : L-aspartate → L-glutamate Use the cofactor “puridoxal phosphate(PLP)” Figure4-42.The overall reaction catalyzed by the pyridoxal phosphate-dependent enzyme L-aspartate aminotransferase 4-12. PLP Enzymes : Diverse Structures with One Function Step 1 : The amino group of the amino acid substrate displaces the side-chain amino group of the lysine residue that holds the cofactor PLP in the active site. Step 2 : PLP catalyzes a rearrangement of the amino acid substrate. Step 3 : followed by hydrolysis of the kero0acid portion, leaving the nitrogen of the amino acid bound to the cofactor to form the intermediate PMP. Figure4-43.The general mechanism for PLP-dependent catalysis of transamination, the interconversion of α-amino acids and α-keto acids 4-12. PLP Enzymes : Diverse Structures with One Function L-aspartate aminotransferase D-amino acid aminotransferase Absolutely no identity and folding structures totally different. Figure4-44.The three-dimensional structures of L-aspartate aminotransferase (left) and D-amino acid aminotransferase (right) 4-12. PLP Enzymes : Diverse Structures with One Function L-aspartate aminotransferase D-amino acid aminotransferase However, the active sites are found to be strikingly similar. Figure4-45. Comparison of the active sites of L-aspartate aminotransferase (left) and D-amino acid aminotransferase (right) 4-12. PLP Enzymes : Diverse Structures with One Function Bacterial D-amino acid aminotransferase Humanl D-amino acid aminotransferase Two enzymes recognizes only L-amino acids → similar structure. Figure4-46.The three-dimensional structures of bacterial D-amino acid aminotransferase (left) and human mitochondrial branches-chain L-amino acid aminotransferase (right) 4-13. Moonlighting : Proteins with More than One Function In multicellular organisms, multifunctional proteins help expand the number of protein functions that can be derived from relatively small genomes Figure4-47. Some examples of multifunctional proteins with their various functions 4-13. Moonlighting : Proteins with More than One Function Cytokine macrophage inhibitory factor (MIF) Substrate binding and active site -Proinflammatory cytokine that activates T cells and macrophages. -Catalyzes the tautomerization of phenylpuruvic acid. Figure4-48.The three-dimensional structure of the monomer of macrophage inhibitory factor, MIF 4-14. Chameleon Sequences : One Sequence with More than One Fold Cyclodextrin glycosyltransferase Beta-galactosidase -Chameleon sequence : exists in different conformations in different environments. -LITTAHA (red) has different conformation in two different enzyme. Figure4-49. Chameleon sequences 4-14. Chameleon Sequences : One Sequence with More than One Fold Dimerization of sequence specific DNA binding protein Fis. Single-site mutation(pro26→ala26) can converted form a beta strand to an alpha helix. Figure4-50. Chameleon sequences in the DNA-binding protein Fis 4-14. Chameleon Sequences : One Sequence with More than One Fold -Some proteins contain natural chameleon sequences that may be important to their function. -DNA-binding transcriptional regulator from yeast. Figure4-51. Chameleon sequence in the DNA-binding protein MATα2 from yeast 4-15. Prions, Amyloids and Serpins : Metastable Protein Folds -Some structures may be metastable-able to change into one or more different stable structures. -The best characterized of these changeable structures is the prion. -The precise structure of the diseasecausing form is not yet known, but is known to have much more beta sheet that the cellular form Figure4-52.The prion protein 4-15. Prions, Amyloids and Serpins : Metastable Protein Folds -Alzheimer’s, Parkinson’s and type Ⅱ diabetes. Each disease is associated with a particular protein, and extracellular aggregates of these proteins are thought to be the origin of the disease. -Produce fibrous protein aggregates of identical, largely beta-sheet, structure. Figure4-53. A possible mechanism for the formation of amyloid fibrils by a globular protein 4-15. Prions, Amyloids and Serpins : Metastable Protein Folds Cleavage the loop by protease. Cleavage triggers a refolding of the cleaved structure that makes it more stable. Figure4-54. Structural transformation in a serine protease inhibitor on binding protease 4-16. Functions for Uncharacterized Genes : Galactonate Dehydratease -Similar structures and mechanisms between same family members. -MR, MLE, enolase. Figure4-55. Active sites of MR, MLE, and enolase 4-16. Functions for Uncharacterized Genes : Galactonate Dehydratase Carbon source The unknown enzyme, F587 has now been identified as the gene dgoD, encoding galactonate dehydratase. Figure4-56.The pathway for the utilization of galactonate in E.coli 4-16. Functions for Uncharacterized Genes : Galactonate Dehydratase The fold is the same as those of MR, MLR and enolase(belongs to same family). Figure4-57. Structure of galactonate dehydratase 4-16. Functions for Uncharacterized Genes : Galactonate Dehydratase The active site is the same as those of MR, MLE, and enolase (belongs to same family). Figure4-58. Schematic diagram of a model of the active site of galactonate dehydratase with substrate bound 4-17. Starting from Scratch : A Gene Product of Unknown Function Alanine racemase YBL035c in yeast - The yeast protein lacks the largely antiparallel beta-sheet domain of the racemase, however, the active sites, indicated by the presence of the bound cofactor. Figure4-59.The three-dimensional structures of bacterial alanine racemase and yeast YBL036c 4-17. Starting from Scratch : A Gene Product of Unknown Function Alanine racemase YBL035c in yeast Enzyme-cofactor binding residues are preserved. Figure4-60. Comparison of the active sites of bacterial alanine racemase and YBL036c CHAPTER5. Structure Determination 5-1. The Interpretation of Structural Information The objective end=product of a crystallographic structure determination is an electron density map. 3Å resolution 2Å resolution 1Å resolution Figure5-1. Portion of a protein electron density map at three different resolutions 5-1. The Interpretation of Structural Information The figure shows the superposition of the set of models derived from the internuclear distances measured for this protein in solution. Figure5-2. NMR structure ensemble 5-2. Structure Determination by X-Ray Crystallography and NMR Figure5-3. Structure determination by X-ray crystallography 5-2. Structure Determination by X-Ray Crystallography and NMR Figure5-4. Structure determination by NMR 5-3. Quality and Representation of Crystal and NMR Structures (a). Wire model : useful for example in comparisons of two conformations. (b). Ribbon diagram : alpha and beta strand. easily recognizable. (c). Ball and stick model : bonded and non-bonded distances can be assessed, which is important for evaluating interactions Figure5-5. Different ways of presenting a protein structure 5-3. Quality and Representation of Crystal and NMR Structures (d). Space filling : useful for assessing the fit of a ligand to a binding site. (e). Surface topography : can be colored according to different local properties such as the electrostatic potential at different points in the molecules. Figure5-5. Different ways of presenting a protein structure