2. Introduction to Rosetta and structural modeling • Approaches for structural modeling of proteins • The Rosetta framework and its prediction modes • Cartesian and polar coordinates • Sampling (finding the structure) and scoring (selecting the structure) Structural Modeling of Proteins Approaches Prediction of Structure from Sequence Flowchart Comparison of query sequence to nr database Similar to a sequence of known structure? No Yes Fold Recognition (Threading) Homology Modeling (Comparative Modeling) Protocols: ab initio, loops, side chains, active sites…. Fits a known fold? Yes No Ab initio prediction The Rosetta framework and its prediction modes A short history of Rosetta In the beginning: ab initio modeling of protein structure starting from sequence Short fragments of known proteins are assembled by a Monte Carlo strategy to yield native-like protein conformations Reliable fold identification for short proteins. Recently improved to high-resolution models (within 2A RMSD) ATCSFFGRKLL….. A short history of Rosetta Success of ab initio protocol lead to extension to Protein design Design of new fold: TOP7 Protein loop modeling; homology modeling Protein-protein docking; protein interface design ATCSFFGRKLL….. ATCSFFGRKLL….. Protein-ligand docking Protein-DNA interactions; RNA modeling Many more, e.g. solving the phase problem in Xray crystallography The Rosetta Strategy • Observation: local sequence preferences bias, but do not uniquely define the local structure of a protein • Goal: mimic interplay of local and global interactions that determine protein structure • Local interactions: fragments derived from known structures (sampled for similar sequences/secondary structure propensity) • Global (non-local) interactions: buried hydrophobic residues, paired b strands, specific side chain interactions, etc The Rosetta Strategy • Local interactions – fragments – Fragment library representing accessible local structures for all short sequences in a protein chain, derived from known structures • Global (non-local) interactions – scoring function – Derived from conformational statistics of known structures More recent additions • Boinc (Rosetta@home) • FoldIt • Rosettascripts; RosettaDiagrams • PyRosetta Scoring and Sampling The basic assumption in structure prediction Native structure located in global minimum (free) energy conformation (GMEC) ➜A good Energy function can select the correct model among decoys ➜A good sampling technique can find the GMEC in the rugged landscape E GMEC Conformation space Two-Step Procedure 1. Low-resolution step locates potential minima (fast) 2. Cluster analysis identifies broadest basins in landscape 3. High-resolution step can identify lowest energy minimum in the basins (slow) E Conformation space GMEC Low-Resolution Step Structure Representation: • Equilibrium bonds and angles (Engh & Huber 1991) • Centroid: average location of center of mass of sidechain (Centroid | aa, f,) • No modeling of side chains • Fast Low-Resolution Scoring Function Bayes Theorem: • Independent components prevent over-counting P(str | seq) = P(str)*P(seq|str) / P(seq) structure dependent features sequencedependent features constant O ... O N N O N N O ... Sequence-Dependent Components Bayes Theorem: P(str | seq) = P(str) * P(seq | str) / P(seq) Score = Senv+ Spair + … neighbors: Cb-Cb <10Ǻ Rohl et al. (2004) Methods in Enzymology 383:66 Origin: Simons et al., JMB 1997; Simons et al., Proteins 1999 Structure-Dependent Components P(str | seq) = P(str) * P(seq | str) / P(seq) Score = … + Srg + Scb + Svdw + … Structure-Dependent Components P(str | seq) = P(str) * P(seq | str) / P(seq) Score = … + Sss + … Structure-Dependent Components P(str | seq) = P(str) * P(seq | str) / P(seq) Score = … + Ssheet+ Shs + … + Srama 10 High-Resolution Step Slow, exact step Structure Representation: • Locates global energy minimum • All-atom (including polar and non-polar hydrogens, but no water) • Side chains as rotamers from backbone-dependent library • Side chain conformation adjusted frequently Dunbrack 1997 High-Resolution Step: Rotamer Libraries • Side chains have preferred conformations • They are summarized in rotamer libraries • Select one rotamer for each position • Best conformation: lowest-energy combination of rotamers Serine c1 preferences t=180o g+=+60o g-=-60o High-Resolution Scoring Function • Major contributions: – Burial of hydrophobic groups away from water – Void-free packing of buried groups and atoms – Buried polar atoms form intra-molecular hydrogen bonds High-Resolution Scoring Function Packing interactions Score = SLJ(atr + rep) + …. Linearized repulsive part e: well depth from CHARMm19 rij High-Resolution Scoring Function Implicit solvation Score = … + Ssolvation + …. xij2 xji2 xij=(rij - Ri)/li solvation free energy density of i polar polar Lazaridis & Karplus, Proteins 1999 High-Resolution Scoring Function Hydrogen Bonds (original function) Score = …. + Shb(srbb+lrbb+sc) + …. srbb: short range, backbone HB lrbb: long range, backbone HB sc: HB with side chain atom d H N O C (Kortemme, 2003; Morozov 2004) Hydrogen Bonding Energy (Kortemme, Morozov & Baker 2003 JMB) Based on statistics from high-resolution structures in the Protein Data Bank (rcsb.org) ] E W [ E ( ) E ( ) E ( ) E ( ) H B H B H A G k T l n P Slide from Jeff Gray High-Resolution Scoring Function Rotamer preference Score = … + Sdunbrack + …. Dunbrack, 1997 Scoring Function: Summary One long, generic function …. Score = Senv+ Spair + Srg + Scb + Svdw + Sss+ Ssheet+ Shs + Srama + Shb (srbb + lrbb) + docking_score + Sdisulf_cent+ Srs+ Sco + Scontact_prediction + Sdipolar+ Sprojection + Spc+ Stether+ Sf+ Sw+ Ssymmetry + Ssplicemsd + ….. docking_score = Sd env+ Sd pair + Sd contact+ Sd vdw+ Sd site constr + Sd + Sfab score Score = SLJ(atr + rep) + Ssolvation + Shb(srbb+lrbb+sc) + Sdunbrack + Spair – Sref + Sprob1b + Sintrares + Sgb_elec + Sgsolt + Sh2o(solv + hb) + S_plane Representations of protein structure: Cartesian and polar coordinates PDB x ATOM ATOM ATOM ATOM ….. …. 490 491 492 493 N CA C O GLN GLN GLN GLN A A A A 31 31 31 31 52.013 52.134 51.726 51.015 y z -87.359 -8.797 -87.762 -10.201 -89.222 -10.343 -89.601 -11.275 1.00 7.06 1.00 8.67 1.00 10.90 1.00 9.63 N C C O Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 1 0.00 -60.00 -180.00 -60.00 0.00 0.00 0.00 2 3 …. … … 2 ways to represent the protein structure Cartesian coordinates (x,y,z; pdb format) Intuitive – look at molecules in space Easy calculation of energy score (based on atomatom distances) – Difficult to change conformation of structure (while keeping bond length and bond angle unchanged) Polar coordinates (FW; equilibrium angles and bond lengths) Compact (3 values/residue) Easy changes of protein structure (turn around one or more dihedral angles) – Non-intuitive – Difficult to evaluate energy score (calculation of neighboring matrix complicated) A snake in the 2D world • Cartesian representation: x 5 (3,3) 2-3 3 (1,2) 3-4 4 (2,2) 2 (1,1) 1 (0,0) y points: (0,0),(1,1),(1,2),(2,2),(3,3) connections (predefined): 1-2,2-3,3-4,4-5 A snake in the 2D world x • Internal coordinates: bond lengths (predefined): √2,1,1,√2 angles: 450,90o,0o,45o y x 45o 90o 45o y From wikipedia A snake wiggling in the 2D world • Constraint: keep bond length fixed • Move in Cartesian representation x (0,0),(1,1),(1,2),(2,2),(3,3) (0,0),(1,1),(1,2),(2,2),(3,0) Bond length changed! y A snake wiggling in the 2D world • Constraint: keep bond length fixed • Move in polar coordinates x 450,90o,0o,45o 450,90o,45o,45o Bond length unchanged! Large impact on structure y Polar Cartesian coordinates Convert r and q to x and y x y √2,1,1,√2 450,90o,0o,45o From wikipedia (0,0),(1,1),(1,2),(2,2),(3,3) Cartesianpolar coordinates Convert x and y to r and q x y (0,0),(1,1),(1,2),(2,2),(3,3) √2,1,1,√2 450,90o,0o,45o Moving the snake to the 3D world • Cartesian representation: z points: additional z-axis (0,0,0),(1,1,0),(1,2,0),(2,2,0), (3,3,0) connections (predefined): 1-2,2-3,3-4,4-5 • Internal coordinates: y bond lengths (predefined): x √2,1,1,√2 angles: 450,90o,0o,45o dihedral angles: 1800,180o Proteins: bond lengths and angles fixed. Only dihedral angles are varied Dihedral angles Dihedral angles c1-c4 define side chain • Dihedral angle: defines geometry of 4 consecutive atoms (given bond lengths and angles) From wikipedia What we learned from our snake • Cartesian representation: Easy to look at, difficult to move – Moves do not preserve bond length (and angles in 3D) z • Internal coordinates: Easy to move, difficult to see x y – calculation of distances between points not trivial Proteins: bond lengths and angles fixed. Only dihedral angles are varied Solution: toggle CALCULATE ENERGY Cartesian coordinates: Transform: calculate dihedral angles from coordinates introduce changes in structure by rotating around dihedral angle(s) (change F values) Derive distance matrix (neighbor list) for energy score calculation Transform: build positions in space according to dihedral angles PDB x ATOM ATOM ATOM ATOM ….. …. MOVE STRUCTURE Polar coordinates: 490 491 492 493 N CA C O GLN GLN GLN GLN A A A A 31 31 31 31 52.013 52.134 51.726 51.015 y z -87.359 -8.797 -87.762 -10.201 -89.222 -10.343 -89.601 -11.275 1.00 7.06 1.00 8.67 1.00 10.90 1.00 9.63 (0,0),(1,1),(1,2),(2,2),(3,3) N C C O Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 1 0.00 -60.00 -180.00 -60.00 0.00 0.00 0.00 2 3 0 o o o …. 45 ,90 ,0 ,45 Cartesian polar coordinates How to calculate polar from Cartesian coordinates: example F: C’-N-Ca-C – define plane perpendicular to N-Ca (b2) vector – calculate projection of Ca-C (b3) and C’-N (b1) onto plane – calculate angle between projections PDB x … ATOM ATOM ATOM ATOM ….. …. 490 491 492 493 C N CA O GLN GLY GLY GLY A A A A 31 32 32 32 52.013 52.134 51.726 51.015 y z -87.359 -8.797 -87.762 -10.201 -89.222 -10.343 -89.601 -11.275 1.00 7.06 1.00 8.67 1.00 10.90 1.00 9.63 (0,0),(1,1),(1,2),(2,2),(3,3) N C C O Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 ….. 32 -59.00 -60.00 -180.00 0.00 0.00 0.00 0.00 33 34 …. 0 o o o … … 45 ,90 ,0 ,45 Polar Cartesian coordinates Find x,y,z coordinates of C, based on atom positions of C’, N and Ca, and a given F value (F: C’-N-Ca-C) • create Ca-C vector: –size Ca-C=1.51A (equilibrium bond length) –angle N-Ca-C= 111o (equilibrium value for N-Ca-C angle) • rotate vector around N-Ca axis to obtain projections of Ca-C and N-C’ with wanted F PDB x … ATOM ATOM ATOM ATOM ….. …. 490 491 492 493 C N CA O GLN GLY GLY GLY A A A A 31 32 32 32 52.013 52.134 51.726 51.015 y z -87.359 -8.797 -87.762 -10.201 -89.222 -10.343 -89.601 -11.275 1.00 7.06 1.00 8.67 1.00 10.90 1.00 9.63 (0,0),(1,1),(1,2),(2,2),(3,3) N C C O Position PHI PSI OMEGA CHI1 CHI2 CHI3 CHI4 ….. 32 -59.00 -60.00 -180.00 0.00 0.00 0.00 0.00 33 34 …. … … 450,90o,0o,45o Representation of protein structure Rosetta folding 1 2 3 4 5 6 7 8 3 backbone dihedral angles per residue Build coordinates of structure starting from first atom, according to dihedral angles (and equilibrium bond length and angle) 1 2 3 4 5 6 7 7 8 8 Sampling and minimization in TORSIONAL space: change angle and rebuild, starting from changed angle Based on slides by Chu Wang Representation of protein structure Rosetta folding 1 2 3 4 5 6 7 8 3 backbone dihedral angles per residue Sampling and minimization in TORSIONAL space Sampling and minimization in RIGID-BODY space 1 2 3 4 5 6 7 8 Backbone dihedral angles fixed (rigid-body) Rosetta docking 1’ 2’ 3’ 4’ 5’ 6’ 7’ 8’ 6 rigid-body DOFs -3 translational vectors 3 rotational angles How can those two types of degrees of freedom be combined? Fold tree representation Originally developed to improve sampling of strand registers in b-sheet proteins. Allows simultaneous optimization of rigid-body and backbone/sidechain torsional degrees of freedom. Example: fold-tree based docking “peptide” edge – 3 backbone dihedral angles 1 2 3 4 5 6 7 8 3’ 4’ 5’ 6’ 7’ 8’ “long-range” edge – 6 rigid-body DOFs 1’ 2’ “peptide” edge – 3 backbone dihedral angles Construct fold-trees to treat a variety of protein folding and docking problems. Fold tree: Bradley and Baker, Proteins (2006) Fold-trees for different modeling tasks protein folding N C Color – flexible bb Gray – fixed bb Flexible “peptide” edge rigid “peptide” edge N: N-terminal; C: C-terminal; X: chain break; O: root of the tree; 1 1’ rigid “jump” 1 1’ flexible “jump” Fold-trees for different modeling tasks loop modeling N 1 x 1’ 2 x 2’ C Color – flexible bb Gray – fixed bb Flexible “peptide” edge rigid “peptide” edge N: N-terminal; C: C-terminal; X: chain break; O: root of the tree; 1 1’ rigid “jump” 1 1’ flexible “jump” Fold-trees for different modeling tasks fully flexible docking N 1 C N 1’ C docking w/ loop modeling N N 1 3’ 2 x x3 2’ 1’ C C docking w/ hinge motion N N 1 1’ Flexible “peptide” edge C Color – flexible bb Gray – fixed bb C rigid “peptide” edge N: N-terminal; C: C-terminal; X: chain break; O: root of the tree; 1 1’ rigid “jump” 1 1’ flexible “jump” Fold-trees for different modeling tasks Color – flexible bb Gray – fixed bb Pale – symmetry operation Fold-trees for different modeling tasks Color – flexible bb Gray – fixed bb • Filled colored circles - flexible sc Fold-trees for different modeling tasks Color – flexible bb Gray – fixed bb • Filled colored circles - flexible sc o empty colored circles – flexible amino acid: design Fold-trees for different modeling tasks Color – flexible bb Gray – fixed bb • Filled colored circles - flexible sc o empty colored circles – flexible amino acid: design The Rosetta sampling strategy: a general overview Fragment Sampling Local optimization • 9 residue fragments • 3 residue fragments • Gradual addition of parameters to scoring function • Quick quenching • Strategies to keep fragment insertion/perturbation local • Monte Carlo (MC) Sampling • MC sampling with minimization • Repacking and refinement Side chain rearrangement