BCB 444/544 Lecture 19 A bit of: Protein Structure - Basics Protein Structure Visualization, Classification & Comparison #19_Oct05 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 1 Required Reading (before lecture) √Mon Oct 1 - Lecture 17 Protein Motifs & Domain Prediction • Chp 7 - pp 85-96 √ Wed Oct 3 - Lecture 18 Protein Structure: Basics (Note chg in Lecture Schedule online ) • Chp 12 - pp 173-186 √Thurs Oct 4 & Fri Oct 5 - Lab 6 & Lecture 19 Protein Structure: Basics, Databases, Visualization, Classification & Comparison • Chp 13 - pp 187-199 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 2 BCB 544 - Extra Required Reading Assigned Mon Sept 24 BCB 544 Extra Required Reading Assignment: for 544 Extra HW#1 Task 2 • Pollard KS, …., Haussler D. (2006) An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443: 167-172. • http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html doi:10.1038/nature05113 • PDF available on class website - under Required Reading Link BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 3 BCB 544 Projects (Optional for BCB 444) • For a better idea about what's involved in the Team Projects, please look over last year's expectations for projects: http://www.public.iastate.edu/~f2007.com_s.544/project.htm Please note: wrong URL (instead of that shown above) was included in originally posted 544ExtraHW#1; corrected version is posted now • Criteria for evaluation of projects (oral presentations) are summarized here: http://www.public.iastate.edu/%7Ef2007.com_s.544/homework/HW7.pdf BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 4 Assignments & Announcements - #1 Students registered for BCB 444: Two Grading Options 1) Take Final Exam per original Grading Policies 2) Instead of taking Final Exam - you may participate in a Team Research Project If you choose #2, please do 3 things: 1) Contact Drena (in person) 2) Send email to Michael Terribilini (terrible@iastate.edu) 3) Complete 544 Extra HW#1 - Task 1.1 by noon on Mon Oct 1 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 5 Assignments & Announcements - #2 BCB 444s (Standard): 200 pts 200 100 500 pts Midterm Exams = 100 points each Homework & Laboratory assignments = 200 points Final Exam Total for BCB 444 BCB 444p (Project): 200 pts 200 190 590 pts Midterm Exams = 100 points each Homework & Laboratory assignments = 200 points Team Research Project Total for BCB 444p BCB 544: 200 pts 200 100 200 700 pts Midterm Exams = 100 points each Homework & Laboratory assignments Final Exam Discussion Questions & Team Research Projects Total for BCB 544 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 6 Assignments & Announcements #3 ALL: HomeWork #3 Due: Mon Oct 8 by 5 PM • HW544: HW544Extra #1 √Due: Task 1.1 - Mon Oct 1 by noon Due: Task 1.2 & Task 2 - Fri Oct 12 by 5 PM (not Monday) • 444 "Project-instead-of-Final" students should also submit: • HW544Extra #1 • Due: Task 1.1 - Mon Oct 8 by noon • Due: Task 1.2 - Fri Oct 12 by 5 PM (not Monday) Task 2 NOT required! BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 7 QUESTIONS re: HW#3? Due Mon BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 8 This is a new slide HMM example from Eddy HMM paper: Toy HMM for Splice Site Prediction BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 9 An HMM for Occasionally Dishonest Casino Transition probabilities • Prob(Fair Loaded) = 0.01 • Prob(Loaded Fair) = 0.2 But, where do you start? "Begin" state not shown BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 10 Occasionally Dishonest Casino - HW#3 "Begin" state? 50:50 chance of starting with F vs L die BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 11 This slide has been changed Calculating Different Paths to an Observed Sequence Calculations such as those shown below are used to fill a matrix with probability values for every state at every position x x1, x2, x3 6,2,6 LLL (3) LFL emission probability Pr(x, (1) ) a0F eF (6)aFF eF (2)aFF eF (6) 1 1 1 0.5 0.99 0.99 6 6 6 0.00227 (1) FFF ( 2) transition probability Pr(x , (2) ) a0 LeL (6)aLLeL (2)aLLeL (6) 0.5 0.5 0.8 0.1 0.8 0.5 0.008 Pr(x , (3) ) a0LeL (6)aLF eF (2)aFL eL (6)aL 0 0.5 0.5 0.2 0.0000417 1 0.01 0.5 6 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 12 Calculate optimal path? Construct a matrix of probability values for every state at every residue How: one way = Viterbi Algorithm • Initialization (i = 0) v 0 (0) 1, vk (0) 0 for k 0 • Recursion (i = 1, . . . , L): For each state k v k (i ) ek (xi ) max v r (i 1)ark r • Termination: Pr(x , * ) max vk (L)ak 0 k To find *, use trace-back, as in dynamic programming BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 13 Viterbi for Calculating Most Probable Path* x * Path within HMM that matches query sequence with highest probability 6 2 1 0 0 0 (1/6)(1/2) = 1/12 (1/6)max{(1/12)0.99, (1/4)0.2} = 0.01375 (1/6)max{0.013750.99, 0.020.2} = 0.00226875 0 (1/2)(1/2) = 1/4 (1/10)max{(1/12)0.01, (1/4)0.8} = 0.02 (1/2)max{0.013750.01, 0.020.8} = 0.08 B F L 6 0 v k (i ) ek (xi ) max v r (i 1)ark r BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 14 Total Probability Several different paths can result in observation x Probability that our model will emit x is: Pr(x ) Pr(x , ) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 15 This slide has bee changed Calculating the Total Probability: Note: This not the same as matrix on previous slide! Here, last column contains sums for each row x B F L 6 2 1 0 0 0 (1/6)(1/2) = 1/12 (1/6)sum{(1/12)0.99, (1/4)0.2} = 0.022083 (1/6)sum{0.0220830.99, 0.0200830.2} = 0.004313 0 (1/2)(1/2) = 1/4 (1/10)sum{(1/12)0.01, (1/4)0.8} = 0.020083 (1/2)sum{0.0220830.01, 0.0200830.8} = 0.008144 Total probability = Pr(x, ) 6 0 = 0 + 0.004313 + 0.008144 = 0.012 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 16 A few more Details re: Profiles & HMMs • Smoothing or "Regularization" - method used to avoid "over-fitting" • Common problem in machine learning (data-driven) approaches • Limited training sample size causes over-representation of observed characters while "ignoring" unobserved characters • Result? Miss members of family not yet sampled (too many false negative hits) • Pseudocounts - adding artificial values for 'extra' amino acid(s) not observed in the training set • Treated as a 'real' values in calculating probabilities • Improve predictive power of profiles & HMMs • Dirichlet mixture - commonly used mathematical model to simulate the aa distribution in a sequence alignment • To "correct" problems in an observed alignment based on limited number of sequences BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 17 Chp 7 - Protein Motifs & Domain Prediction SECTION II SEQUENCE ALIGNMENT Xiong: Chp 7 Protein Motifs and Domain Prediction • √Identification of Motifs & Domains in MSAs • √Motif & Domain Databases Using Regular Expressions • √Motif & Domain Databases Using Statistical Models • Protein Family Databases • Motif Discovery in Unaligned Sequences • √Sequence Logos BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 18 Motifs & Domains • Motif - short conserved sequence pattern • Associated with distinct function in protein or DNA • Avg = 10 residues (usually 6-20 residues) • e.g., zinc finger motif - in protein • e.g., TATA box - in DNA • Domain - "longer" conserved sequence pattern, defined as a independent functional and/or structural unit • Avg = 100 residues (range from 40-700 in proteins) • e.g., kinase domain or transmembrane domain - in protein • Domains may (or may not) include motifs BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 19 2 Approaches for Representing "Consensus" Information in Motifs & Domains • Regular expression - symbolic representation of information from MSA • e.g., protein phosphorylation site motif: [S,T]- X- [R,K] • Symbols represent specific or unspecified residues, spaces, etc. • 2 mechanisms for matching: • Exact • "Fuzzy" (inexact, approximate) - flexible, more permissive to detect "near matches" • Statistical model - includes probability information derived from MSA • e.g., PSSM, Profile, or HMM BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 20 Motif & Domain Databases Based on regular expressions: • Prosite (Interpro includes Prosite, PRINTS, etc) • Emofit Limitation: these don't take probability info into account Based on statistical models: • • • • • • • PRINTS BLOCKS ProDom Pfam SMART CDART Reverse PsiBLAST • READ your textbook & try some of these at home; there are distinct advantages/disadvantages associated with each • TAKE HOME LESSON: Always try several methods! (not just one!) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 21 Protein Family Databases • In addition to databases of "related" protein sequences, based on shared motifs or domains (Pfam, BLOCKS, CDART), some databases "cluster" sequences into families based on near full-length sequence comparisons • COGs - Clusters of Orthologous Groups (at NCBI) • Mostly Prokaryotic sequences • KOG = newer Eukaryotic version • COGnitor - softwared to search database • ProtoNet - also clusters of homologous protein sequences • Advantages: tree-like hierarchical structure • Provide GO (gene ontology) annotations • Provides InterPro keywords BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 22 Motif Discovery in Unaligned Sequences Expectation Maximization - generate"random" alignment of all sequences, derive PSSM, iteratively match individual sequences to PSSM to edit & improve it Problems? Can hit a local optimum (premature convergence) Sensitive to initial alignment • MEME - Multiple EM for Motif Elicitation - modified EM, avoids local optimum issues; two step procedure Gibbs Sampling - generate "trial" PSSM from random alignment first, as in EM, but leave one sequence out of initial alignment, then iteratively match PSSM to left-out sequences • Gibbs Sampler - web-based motif search via Gibbs sampling • Not mentioned in textbook: • Stochastic context-free grammers • Other "state of the art"pproaches in recent literature, but not available in web-based servers (yet) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 23 Chp 12 - Protein Structure Basics SECTION V STRUCTURAL BIOINFORMATICS Xiong: Chp 12 Protein Structure Basics • LAB 6 • Introduction to Protein DataBank - PDB • PyMol • Cn3D? BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 24 Chp 12 - Protein Structure Basics SECTION V STRUCTURAL BIOINFORMATICS Xiong: Chp 12 Protein Structure Basics • • • • • • • • Amino Acids Peptide Bond Formation Dihedral Angles Hierarchy Secondary Structures Tertiary Structures Determination of Protein 3-Dimensional Structure Protein Structure DataBank (PDB) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 25 Protein Structure & Function • Protein structure - primarily determined by sequence • Protein function - primarily determined by structure • Globular proteins: compact hydrophobic core & hydrophilic surface • Membrane proteins: special hydrophobic surfaces • Folded proteins are only marginally stable • Some proteins do not assume a stable "fold" until they bind to something = Intrinsically disordered Predicting protein structure and function can be very hard -- & fun! BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 26 4 Basic Levels of Protein Structure BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 27 Primary & Secondary Structure • Primary • Linear sequence of amino acids • Description of covalent bonds linking aa’s • Secondary • Local spatial arrangement of amino acids • Description of short-range non-covalent interactions • Periodic structural patterns: -helix, b-sheet BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 28 Tertiary & Quaternary Structure • Tertiary • Overall 3-D "fold" of a single polypeptide chain • Spatial arrangement of 2’ structural elements; packing of these into compact "domains" • Description of long-range non-covalent interactions (plus disulfide bonds) • Quaternary • In proteins with > 1 polypeptide chain, spatial arrangement of subunits BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 29 "Additional" Structural Levels • Super-secondary elements • Motifs • Domains • Foldons BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 30 Amino Acids • Each of 20 different amino acids has different "R-Group" or side chain attached to C BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 31 Peptide Bond is Rigid and Planar BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 32 Hydrophobic Amino Acids BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 33 Charged Amino Acids BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 34 Polar Amino Acids BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 35 Certain Side-chain Configurations are Energetically Favored (Rotamers) Ramachandran plot: "Allowable" psi & phi angles BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 36 Glycine is Smallest Amino Acid R group = H atom • Glycine residues increase backbone flexibility because they have no R group BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 37 Proline is Cyclic • Proline residues reduce flexibility of polypeptide chain • Proline cis-trans isomerization is often a rate-limiting step in protein folding • Recent work suggests it also may also regulate ligand binding in native proteins Andreotti (BBMB) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 38 Cysteines can Form Disulfide (S-S) Bonds • Disulfide bonds (covalent) stabilize 3-D structures • In eukaryotes, disulfide bonds are often found in secreted proteins or extracellular domains BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 39 Globular Proteins Have a Compact Hydrophobic Core • Packing of hydrophobic side chains into interior is main driving force for folding • Problem? Polypeptide backbone is highly polar (hydrophilic) due to polar -NH and C=O in each peptide unit (which are charged at neutral pH=7, found in biological systems); these polar groups must be neutralized • Solution? Form regular secondary structures, • e.g., -helix, b-sheet, stabilized by H-bonds BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 40 Exterior Surface of Globular Proteins is Generally Hydrophilic • Hydrophobic core formed by packed secondary structural elements provides compact, stable core • "Functional groups" of protein are attached to this framework; exterior has more flexible regions (loops) and polar/charged residues • Hydrophobic "patches" on protein surface are often involved in protein-protein interactions BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 41 Protein Secondary Structures • Helices • bSheets • Loops • Coils BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 42 Helix: Stabilized by H-bonds between every ~ 4th residue in Backbone C = black O = red N = blue H = white Look! - Charges on backbone are "neutralized" by hydrogen bonds (H-bonds) - red fuzzy vertical bonds BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 43 Certain Amino Acids are "Preferred" & Others are Rare in Helices • Ala, Glu, Leu, Met = good helix formers • Pro, Gly Tyr, Ser = very poor • Amino acid composition & distribution varies, depending on on location of helix in 3-D structure BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 44 b-Sheets - also Stabilized by H-bonds Between Backbone Atoms Anti-parallel Parallel BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 45 Loops Connect helices and sheets Vary in length and 3-D configurations Are located on surface of structure Are more "tolerant" of mutations Are more flexible and can adopt multiple conformations • Tend to have charged and polar amino acids • Are frequently components of active sites • Some fall into distinct structural families (e.g., hairpin loops, reverse turns) • • • • • BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 46 Coils • Regions of 2' structure that are not helices, sheets, or recognizable turns • Intrinsically disordered regions appear to play important functional roles BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 47 Chp 13 - Protein Structure Basics SECTION V STRUCTURAL BIOINFORMATICS Xiong: Chp 13 Protein Structure Visualization, Comparison & Classfication • Protein Structural Visualization • Protein Structure Comparison • Protein Structure Classification BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 48