Protein structure prediction: The holy grail of bioinformatics Proteins: Four levels of structural organization: Primary structure Secondary structure Tertiary structure Quaternary structure Primary structure = the linear amino acid sequence Secondary structure = spatial arrangement of amino-acid residues that are adjacent in the primary structure a helix = A helical structure, whose chain coils tightly as a right-handed screw with all the side chains sticking outward in a helical array. The tight structure of the a helix is stabilized by same-strand hydrogen bonds between -NH groups and -CO groups spaced at four amino-acid residue intervals. The b-pleated sheet is made of loosely coiled b strands are stabilized by hydrogen bonds between -NH and -CO groups from adjacent strands. An antiparallel β sheet. Adjacent β strands run in opposite directions. Hydrogen bonds between NH and CO groups connect each amino acid to a single amino acid on an adjacent strand, stabilizing the structure. A parallel β sheet. Adjacent β strands run in the same direction. Hydrogen bonds connect each amino acid on one strand with two different amino acids on the adjacent strand. Silk fibroin a helix b sheet (parallel and antiparallel) tight turns flexible loops irregular elements (random coil) Tertiary structure = three-dimensional structure of protein The tertiary structure is formed by the folding of secondary structures by covalent and non-covalent forces, such as hydrogen bonds, hydrophobic interactions, salt bridges between positively and negatively charged residues, as well as disulfide bonds between pairs of cysteines. Quaternary structure = spatial arrangement of subunits and their contacts. Holoproteins & Apoproteins Holoprotein Prosthetic group Apoprotein Holoprotein Prosthetic group Apohemoglobin = 2a + 2b Prosthetic group Heme Hemoglobin = Apohemoglobin + 4Heme Christian B. Anfinsen 1916-1995 Sela M, White FH, & Anfinsen CB. 1959. The reductive cleavage of disulfide bonds and its application to problems of protein structure. Biochim. Biophys. Acta. 31:417-426. Not all proteins fold independently. Chaperones. Reducing agents: Ammonium thioglycolate (alkaline) pH 9.0-10 Glycerylmonothioglycolate (acid) pH 6.5-8.2 Oxidant What do we need to know in order to state that the tertiary structure of a protein has been solved? Ideally: We need to determine the position of all atoms and their connectivity. Less Ideally: We need to determine the position of all Cabackbone structure). Protein structure: Limitations and caveats • Not all proteins or parts of proteins assume a welldefined 3D structure in solution. • Protein structure is not static, there are various degrees of thermal motion for different parts of the structure. • There may be a number of slightly different conformations in solution. • Some proteins undergo conformational changes when interacting with STUFF. Experimental Protein Structure Determination • X-ray crystallography – – – – most accurate in vitro needs crystals ~$100-200K per structure • NMR – – – – fairly accurate in vivo no need for crystals limited to very small proteins • Cryo-electron-microscopy – imaging technology – low resolution Why predict protein structure? • Structural knowledge = some understanding of function and mechanism of action • Predicted structures can be used in structure-based drug design • It can help us understand the effects of mutations on structure and function • It is a very interesting scientific problem (still unsolved in its most general form after more than 50 years of effort) Secondary structure prediction Secondary structure prediction • Historically first structure prediction methods predicted secondary structure • Can be used to improve alignment accuracy • Can be used to detect domain boundaries within proteins with remote sequence homology • Often the first step towards 3D structure prediction • Informative for mutagenesis studies Protein Secondary Structures (Simplifications) a-HELIX b-STRAND COIL (everything else) Assumptions • The entire information for forming secondary structure is contained in the primary sequence • side groups of residues will determine structure • examining windows of 13-17 residues is sufficient to predict secondary structure a-helices 5–40 residues long b-strands 5–10 residues long Predicting Secondary Structure From Primary Structure • accuracy 64-75% • higher accuracy for a-helices than for bsheets • accuracy is dependent on protein family • predictions of engineered (artificial) proteins are less accurate A surprising result! Chameleon sequences The “Chameleon” sequence sequence 1 sequence 2 TEAVDAATAEKVFKQYANDNGVDGEWTYDDATKTFTVTEK Replace both sequences with an engineered peptide (“chameleon”) TEAVDAWTVEKAFKTFANDNGVDGAWTVEKAFKTFTVTEK a -helix b-strand Source: Minor and Kim. 1996. Nature 380:730-734 Measures of prediction accuracy • Qindex and Q3 • Correlation coefficient Qindex Qindex: (Qhelix, Qstrand, Qcoil, Q3) - percentage of residues correctly predicted as a-helix, b-strand, coil, or for all 3 conformations. Npredicted Q3 100 Nobserved Drawbacks: - even a random assignment of structure can achieve a high score (Holley & Karpus 1991) Correlation coefficient pa na -ua oa a ([na ua ][na oa ][ pa ua ][ pa oa ]) C Ca = 1 (=100%) True positive False positive (overpredicted) pa oa True negative False negative (underpredicted) na ua Methods of secondary structure prediction First generation methods: single residue statistics Chou & Fasman (1974 & 1978) : Some residues have particular secondary-structure preferences. Based on empirical frequencies of residues in a-helices, b-sheets, and coils. Examples: Glu Val α-helix β-strand Chou-Fasman method Name P (H) P (E ) P (t urn) f (i) f (i+ 1) f (i+ 2) f (i+ 3) Alanine 142 83 66 0.06 0.076 0.035 0.058 Arginine 98 93 95 0.07 0.106 0.099 0.085 101 54 146 0.147 0.11 0.179 0.081 Asparagine 67 89 156 0.161 0.083 0.191 0.091 Cysteine 70 119 119 0.149 0.05 0.117 0.128 Glutamic Acid 151 37 74 0.056 0.06 0.077 0.064 Glutamine 111 110 98 0.074 0.098 0.037 0.098 Glycine 57 75 156 0.102 0.085 0.19 0.152 Histidine 100 87 95 0.14 0.047 0.093 0.054 Isoleucine 108 160 47 0.043 0.034 0.013 0.056 Leucine 121 130 59 0.061 0.025 0.036 0.07 Lysine 114 74 101 0.055 0.115 0.072 0.095 Methionine 145 105 60 0.068 0.082 0.014 0.055 Phenylalanine 113 138 60 0.059 0.041 0.065 0.065 Proline 57 55 152 0.102 0.301 0.034 0.068 Serine 77 75 143 0.12 0.139 0.125 0.106 Threonine 83 119 96 0.086 0.108 0.065 0.079 108 137 96 0.077 0.013 0.064 0.167 69 147 114 0.082 0.065 0.114 0.125 106 170 50 0.062 0.048 0.028 0.053 Aspartic Acid Tryptophan Tyrosine Valine Chou-Fasman Method • Accuracy: Q3 = 50-60% Second generation methods: segment statistics • Similar to single-residue methods, but incorporating additional information (adjacent residues, segmental statistics). • Problems: – Low accuracy - Q3 below 66% (results). – Q3 of b-strands (E) : 28% - 48%. – Predicted structures were too short. The GOR method • developed by Garnier, Osguthorpe & Robson • build on Chou-Fasman Pij values • evaluate each residue PLUS adjacent 8 Nterminal and 8 carboxyl-terminal residues • sliding window of 17 residues • underpredicts b-strand regions • GOR method accuracy Q3 = ~64% Third generation methods • Third generation methods reached 77% accuracy. • They consist of two new ideas: 1. A biological idea – Using evolutionary information based on conservation analysis of multiple sequence alignments. 2. A technological idea – Using neural networks. Artificial Neural Networks An attempt to imitate the human brain (assuming that this is the way it works). Neural network models - machine learning approach - provide training sets of structures (e.g. a-helices, non a -helices) - computers are trained to recognize patterns in known secondary structures - provide test set (proteins with known structures) - accuracy ~ 70 –75% Reasons for improved accuracy • Align sequence with other related proteins of the same protein family • Find members that has a known structure • If significant matches between structure and sequence assign secondary structures to corresponding residues New and Improved ThirdGeneration Methods Exploit evolutionary information. Based on conservation analysis of multiple sequence alignments. • PHD (Q3 ~ 70%) Rost B, Sander, C. (1993) J. Mol. Biol. 232, 584-599. • PSIPRED (Q3 ~ 77%) Jones, D. T. (1999) J. Mol. Biol. 292, 195-202. Arguably remains the top secondary structure prediction method (won all CASP competitions since 1998). Secondary Structure Prediction Summary 1st Generation - 1970s • Q3 = 50-55% • Chou & Fausman, GOR 2nd Generation -1980s • Q3 = 60-65% • Qian & Sejnowski, GORIII 3rd Generation - 1990s • Q3 = 70-80% • PhD, PSIPRED Many 3rd+ generation methods exist: PSI-PRED - http://bioinf.cs.ucl.ac.uk/psipred/ JPRED - http://www.compbio.dundee.ac.uk/~www-jpred/ PHD - http://www.embl-heidelberg.de/predictprotein/predictprotein.html NNPRED - http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html The sequence-structure gap More than 13,137,813 known protein sequences, 76,495 experimentally determined structures. The sequence-structure gap The gap is getting bigger. Sequences 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 1985 1990 1995 2000 2005 Structures 2000000 1800000 1600000 1400000 1200000 1000000 800000 600000 400000 200000 0 1980 Protein Secondary Structures (Simplifications) a-HELIX b-STRAND COIL (everything else) Beyond Secondary Structure Before Tertiary Structure Supersecondary structures (motifs): small, discrete, commonly observed aggregates of secondary structures helix-loop-helix b-a-b Domains: independent units of structure b barrel four-helix bundle The terms “domain” and “motif” are sometimes used interchangeably. Helix-loop-helix Beyond Secondary Structure Before Tertiary Structure Folds: Compact folding arrangements of a polypeptide chain (a protein or part of a protein). The terms “domain” and “fold” are sometimes used interchangeably. EF Fold Found in Calcium binding proteins such as Calmodulin Leucine Zipper Rossman Fold •The beta-alpha-beta-alpha-beta subunit •Often present in nucleotide-binding proteins b sandwich b barrel a/b horseshoe Four helix bundle •24 amino acid peptide with a hydrophobic surface •Assembles into 4 helix bundle through hydrophobic regions •Maintains solubility of membrane proteins TIM Barrel PDB New Fold Growth Old fold New fold • The number of unique folds in nature is fairly small (possibly a few thousands) • 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB Protein data bank • http//:www.rcsb.org/pdb/ Protein 3D structure data: The structure of a protein consists of the 3D (X,Y,Z) coordinates of each non-hydrogen atom of the protein. Some protein structure also include coordinates of covalently linked prosthetic groups, non-covalently linked ligand molecules, or metal ions. For some purposes (e.g. structural alignment) only the Cα coordinates are needed. Example of PDB format: ATOM ATOM ATOM ATOM 18 19 20 21 N CA C O GLY GLY GLY GLY 27 27 27 27 X Y Z occupancy / temp. factor 40.315 39.049 38.729 39.507 161.004 160.737 159.239 158.484 11.211 10.462 10.784 11.404 1.00 1.00 1.00 1.00 10.11 14.18 20.75 21.88 Note: the PDB format provides no information about connectivity between atoms. The last two numbers (occupancy, temperature factor) relate to disorders of atomic positions in crystals. Protein structure: Some computational tasks • Building a protein structure model from X-ray data • Building a protein structure model from NMR data • Computing the energy for a given protein structure (conformation) • Energy minimization: Finding the structure with the minimal energy according to some empirical “force fields”. • Simulating the protein folding process (molecular dynamics) • Structure visualization • Computing secondary structure from atomic coordinates • Protein superposition, structural alignment • Protein fold classification • Threading: finding a fold (prototype structure) that fits to a sequence • Docking: fitting ligands onto a protein surface by molecular dynamics or energy minimization • Protein 3D structure prediction from sequence Viewing protein structures When looking at a protein structure, we may ask the following types of questions: • Is a particular residue on the inside or outside of a protein? • Which amino acids interact with each other? • Which amino acids are in contact with a ligand (DNA, peptide hormone, small molecule, etc.)? • Is an observed mutation likely to disturb the protein structure? Standard capabilities of protein structure software: • Display of protein structures in different ways (wireframe, backbone, sticks, spacefill, ribbon. • Highlighting of individual atoms, residues or groups of residues • Calculation of interatomic distances • Advanced feature: Superposition of related structures Example: c-abl oncoprotein SH2 domain, display wireframe Example: c-abl oncoprotein SH2 domain, display sticks Example: c-abl oncoprotein SH2 domain, display backbone Example: c-abl oncoprotein SH2 domain, display spacefill Example: c-abl oncoprotein SH2 domain, display ribbons Predicting protein 3d structure Goal: 3d structure from 1d sequence An existing fold Fold recognition Homology modeling A new fold ab-initio Homology modeling Based on the two major observations (and some simplifications): 1. The structure of a protein is uniquely defined by its amino acid sequence. 2. Similar sequences adopt similar structures. (Distantly related sequences may still fold into similar structures.) Homology modeling needs three items of input: • The sequence of a protein with unknown 3D structure, the "target sequence." • A 3D “template” – a structure having the highest sequence identity with the target sequence ( >30% sequence identity) • An sequence alignment between the target sequence and the template sequence Homology Modeling: How it works o Find template o Align target sequence with template o Generate model: - add loops - add sidechains o Refine model Two zones of homology modeling [Rost, Protein Eng. 1999] Automated Web-Based Homology Modelling SWISS Model : http://www.expasy.org/swissmod/SWISS-MODEL.html WHAT IF : http://www.cmbi.kun.nl/swift/servers/ The CPHModels Server : http://www.cbs.dtu.dk/services/CPHmodels/ 3D Jigsaw : http://www.bmm.icnet.uk/~3djigsaw/ SDSC1 : http://cl.sdsc.edu/hm.html EsyPred3D : http://www.fundp.ac.be/urbm/bioinfo/esypred/ Fold recognition = Protein Threading Which of the known folds is likely to be similar to the (unknown) fold of a new protein when only its amino-acid sequence is known? Protein Threading • The goal: find the “correct” sequence-structure alignment between a target sequence and its native-like fold in PDB MTYKLILN …. NGVDGEWTYTE • Energy function – knowledge (or statistics) based rather than physics based – Should be able to distinguish correct structural folds from incorrect structural folds – Should be able to distinguish correct sequence-fold alignment from incorrect sequence-fold alignments Protein Threading • Basic premise The number of unique structural (domain) folds in nature is fairly small (possibly a few thousand) • Statistics from Protein Data Bank (~2,000 structures) 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB • Chances for a protein to have a structural fold that already exists in PDB are quite good. Protein Threading Basic components: – – – – Structure database Energy function Sequence-structure alignment algorithm Prediction reliability assessment Protein Threading – structure database • Build a template database Process • Threading - A protein fold recognition technique that involves incrementally replacing the sequence of a known protein structure with a query sequence of unknown structure. The new “model” structure is evaluated using a simple heuristic measure of protein fold quality. The process is repeated against all known 3D structures until an optimal fit is found. Fold recognition methods • 3D-PSSM http//:www.sbg.bio.ic.ac.uk/~3dpssm/ • Fugue http://www-cryst.bioc.cam.ac.uk/~fugue/ • HHpred http://protevo.eb.tuebingen.mpg.de/toolkit/index.php?view=hh pred ab-initio folding Goal: Predict structure from “first principles” Requires: – A free energy function, sufficiently close to the “true potential” – A method for searching the conformational space Advantages: – Works for novel folds – Shows that we understand the process Disadvantages: – Applicable to short sequences only Rosetta [Simons et al. 1997] http//:www.bioinfo.rpi.edu/~bystrc/hmmstr/server.php Qian et al. (Nature: 2007) used distributed computing* to predict the 3D structure of a protein from its amino-acid sequence. Here, their predicted structure (grey) of a protein is overlaid with the experimentally determined crystal structure (color) of that protein. The agreement between the two is excellent. *70,000 home computers for about two years. Overall Approach Multiple Sequence Alignment Database Searching No Homologue in PDB Protein Sequence Secondary Structure Prediction Fold Recognition Yes Homology Modelling 3-D Protein Model Sequence-Structure Alignment Ab-initio Structure Prediction Yes Predicted Fold No ExPASy Proteomics Server: Expert Protein Analysis System links to lots of protein prediction resources http://expasy.org/ RMSDmin The root mean square deviation (RMSD) is the measure of the average distance between the backbones of superimposed proteins. In the study of globular protein conformations, one customarily measures the similarity in three-dimensional structure by the RMSD of the Cα atomic coordinates after optimal rigid body superposition. A widely used way to compare the structures of biomolecules or solid bodies is to “translate” or rotate one structure with respect to the other to minimize the RMSD. This RMSDmin can be used as a distance measure between two proteins.