6. Homology Modeling Prediction of structure from sequence Flowchart Comparison of query sequence to nr database Similar to a sequence of known structure? No Yes Fold Recognition (Threading) Homology Modeling (Comparative Modeling) Yes Fits a known fold? No Ab initio prediction Homology modeling 4 steps: 1. Detect template 2. Align sequence onto template 3. Build model (loop modeling) 4. Refine model (relax) Errors in comparative modeling a. Wrong side chain conformations b. Small backbone deviations c. Wrong loop modeling d. Wrong alignment e. Wrong template (Marti-Renom & Sali, 2000) 4 Homology modeling 4 steps: 1. Detect template 2. Align sequence onto template 3. Build model (loop modeling) 4. Refine model (relax) Sequence-structure identity depends on length of protein No dissimilar pairs above the threshold line 6 Sander & Schneider, 1991 Template matching • Given: sequence • Wanted: • State of the art protocols include – more sophisticated searches – additional information • Easiest approach: for improved template – Blast / Psiblast against sequences selection with known structure – Profile-profile – select template based on comparison (HHSEARCH) sequence identity – Seq-structure • >70%: straight forward compatibility (Threading: • ~40-50%: usually clear RAPTOR) – structural template – sequence-structure alignment • lower seqid: alignment is challenge Alignment step is critical Sequence-sequence alignment • Information content: 1.Sequence 2.Profile (Position specific scoring matrix -PSSM) aa preferences for each position 3.Hidden Markov Models (HMM) Contains in addition position-specific in/del penalties M match D deletion I insertion Sequence-sequence alignment • Information content: – Sequence-sequence comparison • e.g. BLAST – Profile-sequence comparison • e.g. PSI-BLAST – Profile-profile comparison • e.g. LAMA, PROF_SIM, COMPASS – HMM-HMM comparison • e.g. HHSEARCH More information – increased sensitivity in detecting template No new folds & superfamilies lately -> template available for everyone # of unique folds SCOP Folds # of folds # of new folds ~1400 folds No new folds in the last years!! # of unique superfamilies Year SCOP Superfamilies # of superfamilies # of new superfamilies ~2300 superfamilies No new superfamilies in the last years!! Year Many sequences – few folds: How can I detect my fold?10 Additional ways to include structural information: Threading 4 E Evaluate compatibility of sequence with fold, based on pairwise residue potentials Essential components: • structural template • neighbor definition • energy function C 3 C 2 A1 10 5 C 9 6 A 8 7 D Eab A E = Eaibj C positions i,j D ACCECADAAC E -3-1-4-4-1-4-3-3=-23. S A C D -3 -1 -1 -4 0 1 0 2 . . C A A E ….. 0 0 1 2 5 6 6 7 . . .. .. .. .. 11 Threading (fold recognition): Find best template for given sequence 1) ... 56) ... MAHFPGFGQSLLFGYPVYVFGD... -10 ... ... n) ... -123 ... Potential fold 20.5 RAPTOR State of the art threading method of choice • Successful for “low-homology” proteins (few homolog sequences – low entropy in alignment) • State-of-the art threading protocol: uses linear programming to efficiently find best seq-str threading (linear combination of regression trees) • Optimizes use of several templates http://raptorx.uchicago.edu/ Jian Peng and Jinbo Xu. RaptorX: exploiting structure information for protein alignment by statistical inference. PROTEINS, 2011; A multiple-template approach to protein threading. PROTEINS, 2011. Combine sequence-structure and sequence-sequence comparisons • Example 1: GENTHREADER (ANN) How likely are 2 aas to be neighbors?? How likely is aa to be buried/exposed?? Combine sequence and structure for template selection Example 2: HHSEARCH*: • Based on hidden markov models (HMM) • Sequence-HMM alignment • Here: extended to HMM-HMM alignment * Söding. Protein homology detection by HMM-HMM comparison. Bioinformatics (2005) 21: 951 HHSEARCH: HMM-HMM alignment • Formalization: • more sensitive (for hard cases with <20% seqid) than: – Profile-profile comparison – Profile-sequence comparison – Sequence-sequence comparison * Söding. Protein homology detection by HMM-HMM comparison. Bioinformatics (2005) 21: 951 HHSEARCH includes structural information about template Include secondary structure preference in model: •Score pairs of aligned secondary structure elements with substitution matrix • Query sequence: Predicted secondary structure (PSIPRED: H/E/C) with confidence [0..9] • Structural template: Secondary structure (DSSP: H/E/B/G/I/T/S) DSSP: H = alpha helix E = extended strand B = residue in isolated beta-bridge G = 3-helix (3/10 helix) I = 5 helix (pi helix) T = hydrogen bonded turn S = bend 10 x 3 x 7 substitution values * Söding. Protein homology detection by HMM-HMM comparison. Bioinformatics (2005) 21: 951 Homology modeling 4 steps: 1. Detect template 2. Align sequence onto template 3. Build model (loop modeling) 4. Refine model (relax) Build model 1. Copy aligned regions from template 2. Rebuild missing pieces: Model loops 3. Refine model: add side chains (and minimize; relax) Build model: Loop modeling Input: • 2 anchors • length of missing residues 2 approaches: • Loop libraries: construct loops from fragments of known structures • Loop closure algorithms – – model new conformations good for longer loops Fold-trees for loop modeling tasks loop modeling N 1 x 1’ 2 x 2’ C Color – flexible bb Gray – fixed bb Flexible “peptide” edge rigid “peptide” edge N: N-terminal; C: C-terminal; X: chain break; O: root of the tree; 1 1’ rigid “jump” 1 1’ flexible “jump” Rosetta loop modeling • Define regions that are flexible, and perturb these in a fixed background – Same moves as described in ab initio, but more restricted – Use fold tree architecture: connect take off and landing segment by a jump, cut loop (at defined place, or arbitrarily), apply perturbation, reclose loop – Loop closure: using cyclic coordinate descent (CCD) or kinematic loop closure (KC) • Fragments can be used to improve knowledge-based modeling Cyclic Coordinate Descend (CCD) closure by moving each joint separately.. Canutescu & Dunbrack,. Protein Sci. 12, 963–972 (2003). ..to maximally approach end Repeat to obtain several conformations…. Refine, and select best! Loop closure and degrees of freedom • Over-constrained for <6 DOFs • Under-constrained for >6 DOF: infinite number of solutions. • A molecular loop closure problem with 6 DOF has at most 16 solutions. • Kinematic loop closure allows calculation of analytical solution Kinematic loop closure Coutsias (2004) From robotics: Analytical solution of loop closure for 6 degrees of freedom Challenge: • find analytical formulation to extract • all possible backbone structures of a chain segment, that are • geometrically consistent with preceding and following parts of the given structure. Setup: Kinematic loop closure, cont. Solutions aligned to each other aligned to constant part Kinematic closure (KC) • Analytical solution of loop closure for 6 degrees of freedom • Extension: analytical determination of all mechanically accessible conformations for 6 torsions of a peptide chain of any length (e.g. 25 residues) (1) Randomly perturb non-pivot positions (2) Apply KC to pivot positions Kinematic closure (KC) in Rosetta • Embedded into MCM protocol (low-res + high-res) – 720 steps – Repeat 1000 times Perturbation + KC Loop backbone minimization Kinematic closure (KC) • Improves median modeling quality from 2.0Å to 0.8Å RMSD (on set of 25 loops) (CCD) Improve loop modeling by sampling along Principle Components (PC) of natural variation • Collect loops of a set of homolog templates • Perform Principle Component Analysis (PCA): Collection of loops can be described by a few (3) PCs only ➜Improves model quality: more similar to the final structure than to template. • Depends on a set of known homolog structures 8 protein structures PCA2 PCA1 PCA3 Qian 2004 PNAS Free-energy optimization along PC of natural variation: example • Red: model (2.36A RMSD) • Blue: native • Green: refined (1.42A RMSD) Qian 2004 PNAS Homology modeling 4 steps: 1. Detect template 2. Align sequence onto template 3. Build model (loop modeling) 4. Refine model (relax) Rosetta: Refine model with relax protocol Same as in last ab initio modeling step*: • Introduce general flexibility • Relax protocol finds near-by minima (within 4-5Å RMSD) vdw repulsive Small backbone moves and MCM * MCM protocol: small & shear moves (120 steps; see lecture 5) Side chain optimization Backbone optimization Side chain optimization+ minimization Backbone optimization Homology modeling with Rosetta Summary - Basic protocol: 1. Detect template and align sequence: based on HHSEARCH (alignment of two HMMs) or RAPTOR (Threading) 2. Define aligned regions and loop regions; copy aligned regions and complete protein structure with loop modeling (with KIC kinematic loop closure, or CCD) 3. Refine structure with the “relax” protocol Improvement over single best target • single impressive improvements • many targets better than template Reasons • multiple templates • free modeling • refinement worse better CASP - Template-based modeling (TBM) CASP7: Example for improved TBM with Rosetta (T330) Distance cutoff • Blue: Native • Green: Baker Model04 • Red: Template % of residues aligned 41 Rosetta in CASP7 & 8: use of several templates improves prediction Templates that produce lower energy structures produce better models Homology modeling - summary • Homology modeling to high resolution is challenging (~ ab initio modeling) • Today models are already better than the template – GOOD NEWS! Good alignment and template selection are critical • Sophisticated new approaches have improved homology modeling in recent years – Include additional information during template selection, alignment and refinement