http://bioinformatics.icmb.utexas.edu/OPD ~400,000 peptide mass spectra A few diverse examples of proteins: A muscle protein: aspirin A virus protein shell (“capsid”): Watercolors by David Goodsell, Scripps Outline Part I What dictates the 3D shape (“fold”) of proteins? 1. Primary structure of proteins - amino acids & peptide bonds 2. Secondary structure of proteins - “local” folding topology & predicting 2° structure 3. Tertiary structure of proteins - “global” folding topology - X-ray crystallography & NMR - aligning structure computationally - protein folding - designing new structures Part II How do proteins interact with each other in the cell? The levels of protein structure: “ribbon” = Ca backbone solvent accessible surface Different representations of a typical globular protein (myoglobin) ribbon + stick-figure side chains all atoms drawn at van der waals radii Due to resonance forms of the peptide bond: Peptide bonds (N-CO) are planar, so only allowed rotation along amino acid backbone is around Ca-N and Ca-CO bonds ==> by convention angles called F & Y Protein folding = the selection of F/Y angles & side chain angles leading to low energy packing of the atoms A Ramachandran plot shows only certain F/Y combinations are sampled, dictated by steric hindrance of atoms neighboring peptide bond Favored regions correspond to secondary structures ==> allowable “local” structural conformations 3 of the most common secondary structures a helix 3.6 aa’s/turn http://www.rtc.riken.go.jp/jouhou/image/protein/2ndst/2ndst.html Amino acids vary in their intrinsic propensities to adopt the different secondary structures Given aa sequence, how to predict 2° structure? ==> PhD input = 13 aa sliding window - neural network, predicts 3 states: a helix, b strand, coil & relative level of solvent accessibility ==> 3 state prediction accuracy ~72% http://maple.bioc.columbia.edu/predictprotein/ Some proteins have unusual secondary structures that span membrane => membrane proteins How to identify transmembrane segments in a protein? Current best approach, TMHMM is based on Hidden Markov models. Hidden states A B C ... 0.4 0.1 0.3 ... Y A generic HMM: X Hidden state seq: XXXXYYYYXXXY Observable seq: CCBCCAAABCAC Goal = recover hidden state sequence by analyzing emissions A B C ... 0.1 0.3 0.4 ... transition probabilities emission probabilities emission probabilities TMHMM hidden Markov model inside & outside loop models, helix cap models HMM for 5-25 aa helix core Correctly predicts >90 % of the transmembrane helices Discriminates between soluble and membrane proteins with false positive rate ~1% Krogh et al, J Mol Biol. 305:567-80 (2001) Packing of secondary structures leads to more complex 3D assemblies (“motifs”): Tertiary structure = 3D packing of secondary structural elements - Hydrophobic residues (Phe, Ile, Leu, Trp) buried in the core - Core densely packed; not room even for H2O, comparable to a typical crystal - Core atoms so close that van der Waals bonds contribute significantly - Charged and polar R groups (e.g., Arg, Lys, Glu, Asp, His) on outside and hydrated Experimental approaches to protein structure I X-ray crystallography crystal of pure protein Rotate crystal, collect amplitudes of diffracted X-rays as function of incident angle of X-rays Find phases of diffracted X-rays (by experiment or computation) Electrons in crystal diffract X-rays according to Bragg’s Law: nl = 2d sinq wavelength distance between atomic layers in crystal angle of X-rays to plane of atoms From B. Rupp’s X-ray crystallography intro: http://www-structure.llnl.gov/Xray/101index.html With phases & amplitudes, Fourier transform to find distribution of electrons (“electron density”) in protein Build atomic model into electron density, refine Experimental approaches to protein structure II Nuclear magnetic resonance protein in solution in center Vary radio wave pulses, Measure field generated in response over time => function of chemical environment of each nucleus very strong magnet coils to send/detect radio waves Assign identities to nuclei, measure distances between amino acid atoms Use distance geometry to solve for ensemble of 3D structures consistent with distance constraints Basic principle: Atomic nuclei w/ odd mass #’s have spin ==> charged, spinning particles & produce magnetic field In an external magnetic field, this nuclear magnetic field precesses around an axis Can observe this process by applying radio wave pulses at frequencies related to precession frequencies & measuring the resulting induced electric current Flemming Poulson, A Brief Introduction to NMR spectroscopy of proteins. 3 broadest classes of protein 3D structures Fibrous e.g., collagen Membrane e.g, K+ channel & Globular ... Examples of globular protein “folds” all a a/b all b a+b >24,000 experimentally determined protein structures stored in PDB database: http://www.rcsb.org/pdb/ Atomic coordinates of a protein structure (PDB format) - first 3 aa’s = Met-Glu-Ala... aa type & # atom # & name ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 N CA C O CB CG SD CE N CA C O CB CG CD OE1 OE2 N CA C O CB MET MET MET MET MET MET MET MET GLU GLU GLU GLU GLU GLU GLU GLU GLU ALA ALA ALA ALA ALA A A A A A A A A A A A A A A A A A A A A A A 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 atomic coordinates x 32.632 31.203 30.947 31.741 30.931 29.500 28.784 27.934 29.841 29.498 28.134 28.043 30.533 30.050 30.843 31.432 30.858 27.077 25.751 25.735 25.475 24.678 y z -11.712 -12.125 -12.743 -13.533 -13.144 -13.132 -14.774 -14.832 -12.367 -12.881 -12.349 -11.213 -12.408 -12.440 -11.520 -10.532 -11.780 -13.140 -12.666 -12.594 -13.591 -13.608 53.840 53.853 55.207 55.685 52.733 52.189 52.145 53.770 55.822 57.128 57.527 57.995 58.152 59.600 60.513 60.018 61.737 57.353 57.749 59.298 59.986 57.214 occupancy B-factor 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 63.20 63.20 63.20 63.20 96.70 96.70 96.70 96.70 61.59 61.59 61.59 61.59 51.85 51.85 51.85 51.85 51.85 71.14 71.14 71.14 71.14 34.69 atom type N C C O C C S C N C C O C C C O O N C C O C Some of the major computational questions in structural biology 1. How to distinguish membrane proteins from soluble proteins ? 2. How to align protein structures & start organizing them into families, etc. ? 3. How to predict folded protein structure from the linear amino acid sequence? 4. How to identify the active/functional region of the protein from the structure? 5. How to predict the interactions of drugs or other proteins from the structure? 6. How to computationally predict the structural consequences of mutations? 7. How to predict protein function from structure? 8. How to design new or unnatural protein structures? How to find the best superposition of 2 protein structures? Note: superimposing 2 structures is easy if you know the equivalent amino acids -> the hard part is to find this mapping of atoms from 1 structure to the other Amino acid # One now-classic approach: DALI Protein #1 structure Ca coordinates only Align sequence #1 to sequence #2 so as to maximize similarity in contact patterns Amino acid # Calculate matrix of all pairwise Ca-Ca distances Repeat for protein # 2 Holm & Sander, J Mol Biol. 233:123-38 (1993) L Best structural alignment corresponds to maximizing L S (i, j ) i 1 j 1 i, j = aligned pairs of matched residues i = iA, iB j=jA,jB = similarity of 2 Ca-Ca distance matrices, dAij and dBij In the simplest case, (i, j ) q d d R A ij B ij where dAij and dBij are equivalenced residues in proteins A and B. and q R = minimum level of similarity Choose mapping of residues (e.g. iA to iB) to minimize dAij- dBij Protein A iA iB dAij dBij jA Protein B jB The ability to compare structures has led to recognition of a hierarchy of 3° structures (“folds”) Class As organized in the CATH or SCOP or FSSP databases: Architecture Manual classification at architecture level, automated at topology level Topology Homologous Superfamily H flavodoxin homologues Protein Folding Classic experiment from 1960’s (Chris Anfinsen): Purified small protein RNaseA, Refolded in a few minutes in solution ==> all information necessary for correct folding was captured in the linear amino acids sequence Corollary: Proteins do not fold by randomly testing conformations. Given a 100 amino acid protein, & 10 possible conformations / amino acids = 10100 possible conformations for the protein ==> not possible to randomly sample, clearly constrained search An energetic view of the folding process Fast Collection of Large # of conformationally similar different conformations molecules interconverting Slow Unique or small # of final conformations free energy optimize packing T U “hydrophobic collapse” M Unfolded Molten globule Transition state F Folded folding trajectory Local secondary structures form first Adapted from Branden & Tooze One long-time goal of biologists/biophysicists: Solve the Protein Folding Problem = computationally predict protein 3D fold from 1D amino acid sequence Two general approaches: 1st principles/ab initio: e.g., atomistic molecular dynamics simulations of proteins, modeling force fields w/ electrostatic, van der waals forces, solvent, etc. over long time Empirical: - fold recognition/threading - reverses the process: given set of structures, learn empirical rules that predict folds Empirical currently more successful at predicting final structure, but no information about folding trajectory An example of a successful design of a new protein fold by a combination of empirical & ab initio structural modeling designed 93 amino acid protein with topology not in PDB dbase designed model solved structure Kuhlman et al, Science, 302:1364-1368 (2003) The Kuhlman et al. design strategy Starting model = Choose predefined 3D topology Assemble 3D model from 3 and 9 amino acid fragments of known structure ==> Generated 172 backbone-only starting models Initialization Choose optimal sequence for each starting model using energy function that captures: 12-6 Lennard-Jones potential orientation-dependent hydrogen bonding term implicit solvation model Choose amino acid side chain orientations (“rotamers”) by sampling from known structures Iterate between: Optimize choice of amino acid sequence for a fixed backbone conformation Optimize amino acid backbone coordinates for a fixed sequence Same energy function used at all stages Only previous lowest energy sequence/structure optimized at each stage Final designed sequence not similar to any known protein sequence Kuhlman et al, Science, 302:1364-1368 (2003) References A good introduction to structural biology = Introduction to Protein Structure - Carl Branden & John Tooze Web resources: Protein Data Bank = > 24,000 protein structures, atomic coordinates, & the “protein of the month” http://www.rcsb.org/pdb CATH/SCOP protein structure hierarchies: http://www.biochem.ucl.ac.uk/bsm/cath/ http://scop.mrc-lmb.cam.ac.uk/scop/ Several of the illustrations in this tutorial were taken from Lehninger Principles of Biochemistry, by Nelson & Cox Part II Macrophage (“white blood cell”) “Macrophage and Bacterium 2,000,000X” Watercolor by David S. Goodsell, 2002 Blood serum Bacterium Typical size ranges of known protein structures & assemblies single protein domain dimeric protein aquaporin (membrane channel) Ribosome From a (recommended) review article==>Sali et al. Nature 422:216-225 (2003) Outline Part I What dictates the 3D shape (“fold”) of proteins? Part II How do proteins interact with each other in the cell? 4. “Quaternary” structure of proteins & protein interactions 5. Experimental approaches to determine interactions - yeast 2 hybrid, mass spectrometry 6. Testing the accuracy of the interactions 7. Moving back to the atomic resolution world - electron microscopy & tomography - modeling structures of complexes Why study interactions? Proteins interact all the time (e.g., bump into each other non-specifically) We’re interested in specific interactions ==> e.g., those w/ downstream consequences For example, consequences might include: Inducing a change in the structure of an interaction partner Stabilizing or destabilizing an interaction partner Modifying the activity of a protein (activate, inhibit, or otherwise regulate) Cause interaction partner to move to another location Cut interaction partner Chemically modify interaction partner (phosphorylate, dephosphorylate, glycosylate, deglycosylate, ubiquitinate, sumoylate, etc... ==> more than 200 modifications to proteins known, many catalyzed by other proteins So, defining interactions helps to define these processes & their functional consequences Experimental/Computational methods for observing/inferring protein interactions Sali et al. Nature 422:216-225 (2003) X-ray structure of ATP synthase Schematic version Network representation a b d g b2 e a Total set = protein complex Sum of direct + indirect interactions c12 Some methods measure direct interactions, some indirect Xenarios & Eisenberg, Curr. Op. Biotech. 12:334-9 (2001) Interactions between yeast proteins Experimental approaches to protein interactions I Yeast two-hybrid DBD “Bait” DNA binding domain + “Prey” Act Transcription activation domain Core transcription machinery transcription operator or upstream activating sequence Reporter gene Basic idea = screen library of “prey” proteins to test which ones interact with a given “bait” protein Fields & Song, Nature 340:245-6 (1989) Experimental approaches to protein interactions I High-throughput yeast two-hybrid I Haploid yeast cells expressing activation domainprey fusion proteins Diploid yeast probed with DNA-binding domainPcf11 bait fusion protein Uetz et al. Nature 403 (2000) Uetz et al. Nature 403 (2000) A second group (Ito et al.), with a related yeast two-hybrid approach, also mapped a large number of interactions, then compared the interactions w/ the Uetz data: A surprise at the time was the apparent inconsistency among the interaction sets ==> either # of potential interactions is large or false positive rate high (or both) Ito et al. PNAS 98:4569-74 (2001) Experimental approaches to protein interactions II Mapping complexes by mass spectrometry I Tag “Bait” protein Interaction partners co-purified with “bait” 493 bait proteins Affinity column SDSpage protein 1 protein 2 protein 3 Trypsin digest, protein 4 identify peptides by protein 5 mass spectrometry protein 6 3617 “interactions” Ho et al. Nature 415 (2002) Experimental approaches to protein interactions I A variant: Tandem affinity purification (TAP) + Mass spectrometry Tag1 Tag2 Bait Affinity column2 Affinity column1 SDSpage + protease protein 1 protein 2 protein 3 protein 4 protein 5 protein 6 Trypsin digest, identify peptides by mass spectrometry Affinity column1 Rigout et al., Nature Biotech. 17:1030-2 (1999) Gavin et al. Nature 415 (2002) How accurate are these high-throughput screens? Can compare to known interactions, but these are incomplete A different strategy is to identify properties that correlate with interactions & test versus those properties Three tests: 1. Comparison of interactions to a reference interaction set 2. Comparison of mRNA co-expression of interacting partners 3. Comparison of functions of predicted interaction partners Test #1 Estimate accuracy by comparing to a well-determined reference set of interactions (tends to underestimate accuracy) von Mering, Krause et al. Nature May 8, 2002 Test #2 Estimating interaction assay accuracy by assessing mRNA co-expression of putative interaction partners Random Protein Pairs True interactions Estimate % false positives from observed vs. expected genes w/ correlated expression Correlation coefficient between expression vectors derived from many DNA microarray experiments Estimated false positive rates based on this test: Mrowka et al. Genome Research 11:1971-3 (2001) A related strategy: fit distribution of co-expression relationships as mix of those from random & well-characterized interactions ==> Mixture % indicates accuracy. Deane, Salwinski et al. Mol. Cell. Proteomics (2002) Estimated true positive rates based on this test >1 independent expmt >2 independent expmt Genome-wide yeast two-hybrid At least 1 small-scale expmt >1 independent experiment Paralogs also interact Increasing # of Interaction Sequence Tags Deane, Salwinski et al. Mol. Cell. Proteomics (2002) Test #3 Estimate accuracy by measuring functional similarity of putative partners ==> in particular, measure tendency to be in same cellular system or process From literature & pathway databases (KEGG/GO), we know ~1-3000 yeast protein functions: Swi4 Cdc27 Cell cycle MAPK signaling pathway Cell cycle Ubiquitin-mediated proteolysis Pathways of A Pathways of B pw1 pw2 Jaccard coefficient = # pathways in common / # total pathways S n pw pw1 U 1 <pathway similarity> = n pw2 1 U pw2 pairs Systematically test every pair of characterized proteins Quality of the observed protein-protein interactions as measured by the pathway overlap test Small-scale experiments max agreement of interacting proteins’ pathways Large-scale yeast two-hybrid interaction experiments Date & Marcotte, Nature Biotech. 2003 The various accuracy tests agree to a first approximation (at least as regards the ranking of accuracies) Estimated True Positive Rate via Co-expression Test set Pathways Authors Ito et al. Ho et al. Gavin et al. Uetz et al. Tong et al. Method # interactions Mrowka Deane vonMering Date Y2H 4081 9% 22% 6-10% ~18% MS ~3617 ~10% 1-3% MS ~1440 ~85% ~10% Y2H 957 53% 50% ~57% synthetic lethal 295 ~20% >1 independent experiment >2 independent experiments ~2000 1167 85% 88% ~30-40% ~87% ~60-70% ~95% The current highest throughput protein interaction screens: Yeast Authors Ito et al. Tong et al. Ho et al. Gavin et al. Uetz et al. Fromont-Racine et al. Tong et al. Newman et al. Method Y2H SL MS MS Y2H Y2H SL Y2H # interactions 4081 ~4000 ~3617 ~1440 957 357 295 152 C. elegans Li et al. Walhout et al. Davy et al. Y2H Y2H Y2H ~4000 148 138 Fly Giot et al. Y2H 20,405 Human Bouwmeester et al. MS & several others, including Hepatitis C & H. pylori Y2H = yeast two hybrid MS = mass spectrometry SL = synthetic lethal 221 How many meaningful physical protein-protein interactions are there? At a rough estimate: Human Yeast ~5,800 genes ~5,800 proteins x 2-10 interactions/protein ~12,000 - 60,000 interactions >10-20,000 known, perhaps ~1/2 correct ==> ~1/3 of the way to a complete map! ~40,000 genes >>40,000 proteins x 2-10 interactions/protein >>80,000 - 400,000 interactions <5,000 known ==> approx. 1% of the complete map! ==> We’re a long ways from the complete map of the human “interactome” Can we relate these interactions back to the protein structure? ==> A growing area of research is combination of low resolution structure with atomic models to build structures of protein complexes: For example: Experimental or computational protein models Low resolution electron density map from electron microscopy or electron tomography Rough estimate of atomic model of protein complex Example 1 – Electron microscopy of a protein complex Experimental electron microscopy data Reconstructed electron density map of protein complex Dock atomic models into electron density maps Sali et al. Nature 422:216-225 (2003) Example 2- Electron tomography of a protein complex/assembly Measure projections of molecules after illuminating with electron beam from different angles Reconstruct density distribution (“tomogram”) as sum of back-projected densities Sali et al. Nature 422:216-225 (2003) Reconstructing cellular organization of molecular complexes by fitting structures into electron tomograms “noisy” tomogram (3D density map) of single cell Fit known structures (“templates”) into density Sali et al. Nature 422:216-225 (2003) Some Protein Interaction Resources on the Internet Protein interaction databases Biomolecular Interaction Network Database (BIND) http://www.blueprint.org/bind/bind.php Currently 73,000 interactions Database of Interacting Proteins (DIP) http://dip.doe-mbi.ucla.edu Currently 44,000 interactions Protein Quaternary structure database (PSQ) http://pqs.ebi.ac.uk Atomic structures of interacting proteins Interactive visualization of networks Cytoscape: http://www.cytoscape.org Interactive display of protein networks LGL (Large Graph Layout): http://bioinformatics.icmb.utexas.edu/LGL Visualization of networks with up millions of edges, 100,000’s of vertices