Protein Structure Prediction Mason Bially Types of Structure • Primary Structure – The linear amino acid sequence. • Secondary Structure – The local three-dimensional structure. – Defined by hydrogen bonding patterns. • Tertiary Structure – The global three-dimensional structure. – Defined in atomic coordinates. – The actual function. • Quaternary Structure – The arrangement of multiple proteins. How do we find Secondary Structure? • Couple Algorithms: – DSSP (Original, Slight Errors) – STRIDE (Newer, Sliding Window) • Requires the primary and tertiary structure. – Because of this they are exact, not guesswork. • Finds hydrogen bonds. – Uses potential energy functions. • Based on amino acid locations and orientations. • STRIDE’s is slightly more accurate – Returns one of 8 types of secondary structure for each amino acid. • • • • 3 helix types 2 beta-sheet types 2 turn types and ‘other’ X-Ray Crystallography • Shoot X-rays through a crystal and depending on how the Xrays come back, angle and intensity, the structure can be determined. • Some proteins are challenging to crystallize (intrinsic membrane proteins). • Can handle arbitrarily large sizes. NMR Protein Spectroscopy • Uses Nuclear Magnetic Resonance a phenomena by which atomic nuclei in a magnetic field respond to electromagnetic radiation by reemitting it. • Has difficulty with large proteins. • Works on almost anything. (Including proteins with unstable tertiary structure) Why do we need Structure Prediction? • Experimentally Finding tertiary structure has problems. – Slow, difficult, hard. – Some proteins can’t be found experimentally. • We need to cover more ground, quicker. – Drug design. – Bioinformatics tool development. – More detailed Interactome information. But isn’t it computationally hard? • Yes. • Secondary structure prediction. – Machine learning methods. • Tertiary structure prediction. – Homology Modeling – Fold Recognition (AKA Protein Threading ) – From scratch (AKA de novo, AKA ab initio) Basis for Prediction (Comparative Modeling) • Protein structure (Secondary and Tertiary) is evolutionarily more conserved than the DNA or amino acid sequence. – Structure is function; changing it would prevent the protein from doing it’s job. • Therefore proteins will probably share structure with each other. Secondary Structure prediction • Early attempts. (~60%) – Chou-Fasman • Uses the probability of a secondary structure containing an amino acid. – GOR • Bayesian inference applied to the same basic idea. • Machine learning methods. (~70%) – Neural networks. – Support vector machines. – Hidden Markov models. • Future. – Secondary structure is also based on the environment the protein is folded in. – Including this metadata to attempt to improve methods. Homology Modeling Homology Modeling • Requires primary structure and a template tertiary structure. – Relies on the idea that if one protein has a specific structure, so do other proteins. • Only works with relatively similar sequences. – Sequence identity above 50% is high quality. • Low quality x-ray crystallography. – Sequence identity above 30% is medium quality. • Anything lower degrades rapidly. – Limited by availability of suitable templates. – Limited by the ability to accurately align and choose distant templates. • Sometimes function/structure will diverge for seemingly similar targets and templates. – Happily generates models against incorrect templates. Homology Modeling 1. Template selection and Sequence alignment – Crucial, but relatively simple if a similar sequence exists (BLAST). For edge cases: – • PSI-Blast, HMM or profile-profile alignment based. 2. Model Generation – – – Multiple methods. Construct the model by placing the amino acids where the aligned template suggests. Then refine by going back to the chemistry/physics and fixing errors. 3. Model Assessment – – – Make sure the resulting fold is correct. Detects errors in alignments and template selection. Sometimes chooses the best of many potential models. Fold Recognition Fold Recognition (AKA Protein Threading) • Requires primary structure and a library of tertiary structures. – Relies on the idea that there are (relatively) few folds (tertiary structure) of proteins. • Often feeds final structure back to Homology Modeling techniques as template to get final model. • Can use a number of different scoring algorithms. – Most popular is free energy. • Attempts to find which templates in the library minimize the scoring algorithm – Threading – Dynamic Programming. (Optimization technique) – Machine Learning. • Often finds a large number of results. How do we know these models work? • CASP (Critical Assessment of Techniques for Protein Structure Prediction) – Every two years. – Tests blind prediction algorithms. • In many different categories. – Since 1994. • Other variations. Future • Mix it all together! • Including evolutionary information. – Improves alignment. – Helps find better folds. • Structural information. – Predicted secondary structure can help. • Mixing with ab initio/de novo methods. Questions? • COMPUTATIONAL STRUCTURAL BIOLOGY Methods and Applications – By Torsten Schwede and Manuel C Peitsch • Images from Wikipedia or sources.