#19 - Protein Structure Basics & Classification 10/5/07 Required Reading BCB 444/544 (before lecture) √ Mon Oct 1 - Lecture 17 Lecture 19 Protein Motifs & Domain Prediction • Chp 7 - pp 85-96 A bit of: Protein Structure - Basics √ Wed Oct 3 - Lecture 18 Protein Structure: Basics (Note chg in Lecture Schedule online ) • Chp 12 - pp 173-186 Protein Structure Visualization, Classification & Comparison √ Thurs Oct 4 & Fri Oct 5 - Lab 6 & Lecture 19 Protein Structure: Basics, Databases, Visualization, #19_Oct05 Classification & Comparison • Chp 13 - pp 187-199 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 1 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification BCB 544 - Extra Required Reading • For a better idea about what's involved in the Team Projects, please look over last year's expectations for projects: http://www.public.iastate.edu/~f2007.com_s.544/project.htm BCB 544 Extra Required Reading Assignment: for 544 Extra HW#1 Task 2 Please note: wrong URL (instead of that shown above) was included in originally posted 544ExtraHW#1; corrected version is posted now • Pollard KS, …., Haussler D. (2006) An RNA gene expressed during cortical development evolved rapidly in humans. Nature 443: 167-172. • http://www.nature.com/nature/journal/v443/n7108/abs/nature05113.html doi:10.1038/nature05113 • Criteria for evaluation of projects (oral presentations) are summarized here: • PDF available on class website - under Required Reading Link 10/5/07 http://www.public.iastate.edu/%7Ef2007.com_s.544/homework/HW7.pdf 3 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification Assignments & Announcements - #1 1) Take Final Exam per original Grading Policies 2) Instead of taking Final Exam - you may participate in a Team Research Project If you choose #2, please do 3 things: 1) Contact Drena (in person) 2) Send email to Michael Terribilini (terrible@iastate.edu) 3) Complete 544 Extra HW#1 - Task 1.1 by noon on Mon Oct 1 BCB 444/544 Fall 07 Dobbs 10/5/07 4 10/5/07 6 Assignments & Announcements - #2 Students registered for BCB 444: Two Grading Options BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 2 BCB 544 Projects (Optional for BCB 444) Assigned Mon Sept 24 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 10/5/07 BCB 444s (Standard): 200 pts 200 100 500 pts Midterm Exams = 100 points each Homework & Laboratory assignments = 200 points Final Exam Total for BCB 444 BCB 444p (Project): 200 pts 200 190 590 pts Midterm Exams = 100 points each Homework & Laboratory assignments = 200 points Team Research Project Total for BCB 444p BCB 544: 5 200 pts 200 100 200 700 pts Midterm Exams = 100 points each Homework & Laboratory assignments Final Exam Discussion Questions & Team Research Projects Total for BCB 544 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 1 #19 - Protein Structure Basics & Classification 10/5/07 QUESTIONS re: HW#3? Due Mon Assignments & Announcements #3 ALL: HomeWork #3 Due: Mon Oct 8 by 5 PM • HW544: HW544Extra #1 √ Due: Task 1.1 - Mon Oct 1 by noon Due: Task 1.2 & Task 2 - Fri Oct 12 by 5 PM (not Monday) • 444 "Project-instead-of-Final" students should also submit: • HW544Extra #1 • Due: Task 1.1 - Mon Oct 8 by noon • Due: Task 1.2 - Fri Oct 12 by 5 PM ( not Monday) Task 2 NOT required! BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 7 This is a new slide HMM example from Eddy HMM paper: Toy HMM for Splice Site Prediction BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 8 An HMM for Occasionally Dishonest Casino Transition probabilities • Prob(Fair → Loaded) = 0.01 • Prob(Loaded → Fair) = 0.2 But, where do you start? "Begin" state not shown BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 9 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 10 This slide has been changed Occasionally Dishonest Casino - HW#3 Calculating Different Paths to an Observed Sequence Calculations such as those shown below are used to fill a matrix with probability values for every state at every position transition probability x = x1 , x 2 , x 3 = 6,2,6 Pr(x, " emission probability (1) ! (1) = FFF ) = a0F eF (6)aFF eF (2)aFF eF (6) 1 1 1 = 0.5 # # 0.99 # # 0.99 # 6 6 6 $ 0.00227 Pr(x , " (2) ) = a0 LeL (6)aLLeL (2)aLLeL (6) ! (2) = LLL = 0.5 ! 0.5 ! 0.8 ! 0.1 ! 0.8 ! 0.5 ! "Begin" state? 50:50 chance of starting with F vs L die BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification BCB 444/544 Fall 07 Dobbs 10/5/07 ! (3) = LFL 11 = 0.008 Pr(x , # ( 3) ) = a0 LeL (6)aLF eF (2)aFLeL (6)aL 0 = 0.5 " 0.5 " 0.2 " ! 0.0000417 1 " 0.01 " 0.5 6 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 12 2 #19 - Protein Structure Basics & Classification 10/5/07 Calculate optimal path? Construct a matrix of Viterbi for Calculating Most Probable Path* probability values for every state at every residue x * Path within HMM that matches query sequence with highest probability How: one way = Viterbi Algorithm ε • Initialization (i = 0) v 0 (0) = 1, v k (0) = 0 for k > 0 • Recursion (i = 1, . . . , L ): π For each state k v k (i ) = ek (xi ) max(v r (i ! 1)ark ) r 6 2 0 k 1 0 F 0 (1/6)×(1/2) = 1/12 (1/6)×max{(1/12)×0.99, (1/4)×0.2} = 0.01375 (1/6)×max{0.01375×0.99, 0.02×0.2} = 0.00226875 L 0 (1/2)×(1/2) = 1/4 (1/10)×max{(1/12)×0.01, (1/4)×0.8} = 0.02 (1/2)×max{0.01375×0.01, 0.02×0.8} = 0.08 v k (i ) = ek (xi ) max(v r (i ! 1)ark ) r To find π*, use trace-back, as in dynamic programming BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 0 B • Termination: Pr(x , ! * ) = max(v k (L)ak 0 ) 6 13 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 14 This slide has bee changed Total Probability Calculating the Total Probability: Note: This not the same as matrix on previous slide! Here, last column contains sums for each row Several different paths can result in observation x x ε 6 2 B 1 0 0 F 0 (1/6)×(1/2) = 1/12 (1/6)×sum{(1/12)×0.99, (1/4)×0.2} = 0.022083 (1/6)×sum{0.022083×0.99, 0.020083×0.2} = 0.004313 L 0 (1/2)×(1/2) = 1/4 (1/10)×sum{(1/12)×0.01, (1/4)×0.8} = 0.020083 (1/2)×sum{0.022083×0.01, 0.020083×0.8} = 0.008144 Probability that our model will emit x is: Pr(x ) = ! Pr(x , " ) " π Total probability = ! Pr(x, " ) 6 0 = 0 + 0.004313 + 0.008144 = 0.012 " BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 15 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 16 Chp 7 - Protein Motifs & Domain Prediction A few more Details re: Profiles & HMMs • Smoothing or "Regularization" - method used to avoid "over-fitting" SECTION II • Common problem in machine learning (data-driven) approaches • Limited training sample size causes over-representation of observed characters while "ignoring" unobserved characters SEQUENCE ALIGNMENT Xiong: Chp 7 Protein Motifs and Domain Prediction • Result? Miss members of family not yet sampled • √ Identification of Motifs & Domains in MSAs • √ Motif & Domain Databases Using Regular Expressions • √ Motif & Domain Databases Using Statistical Models (too many false negative hits) • Pseudocounts - adding artificial values for 'extra' amino acid(s) not observed in the training set • Protein Family Databases • Motif Discovery in Unaligned Sequences • Treated as a 'real' values in calculating probabilities • Improve predictive power of profiles & HMMs • Dirichlet mixture - commonly used mathematical model to simulate the aa distribution in a sequence alignment • √ Sequence Logos • To "correct" problems in an observed alignment based on limited number of sequences BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification BCB 444/544 Fall 07 Dobbs 10/5/07 17 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 18 3 #19 - Protein Structure Basics & Classification 10/5/07 2 Approaches for Representing "Consensus" Information in Motifs & Domains Motifs & Domains • Motif - short conserved sequence pattern • Regular expression - symbolic representation of information from MSA • Associated with distinct function in protein or DNA • Avg = 10 residues (usually 6-20 residues) • e.g., zinc finger motif - in protein • e.g., TATA box - in DNA • e.g., protein phosphorylation site motif: [S,T]- X- [R,K] • Symbols represent specific or unspecified residues, spaces, etc. • 2 mechanisms for matching: • Exact • "Fuzzy" (inexact, approximate) - flexible, more permissive to detect "near matches" • Domain - "longer" conserved sequence pattern, defined as a independent functional and/or structural unit • Statistical model - includes probability information derived from MSA • Avg = 100 residues (range from 40-700 in proteins) • e.g., kinase domain or transmembrane domain - in protein • e.g., PSSM, Profile, or HMM • Domains may (or may not) include motifs BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 19 Motif & Domain Databases 20 • In addition to databases of "related" protein sequences, based on shared motifs or domains (Pfam, BLOCKS, CDART), some databases "cluster" sequences into families based on near full-length sequence comparisons • Prosite (Interpro includes Prosite, PRINTS, etc) • Emofit Limitation: these don't take probability info into account Based on statistical models: PRINTS BLOCKS ProDom Pfam SMART CDART Reverse PsiBLAST 10/5/07 Protein Family Databases Based on regular expressions: • • • • • • • BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification • COGs - Clusters of Orthologous Groups (at NCBI) • Mostly Prokaryotic sequences • KOG = newer Eukaryotic version • COGnitor - softwared to search database • READ your textbook & try some of these at home; there are distinct advantages/disadvantages associated with each • ProtoNet - also clusters of homologous protein sequences • TAKE HOME LESSON: Always try several methods! (not just one!) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 • Advantages: tree-like hierarchical structure • Provide GO (gene ontology) annotations • Provides InterPro keywords 21 Motif Discovery in Unaligned Sequences BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 22 Chp 12 - Protein Structure Basics Expectation Maximization - generate"random" alignment of all sequences, derive PSSM, iteratively match individual sequences to PSSM to edit & improve it Problems? Can hit a local optimum (premature convergence) Sensitive to initial alignment • MEME - Multiple EM for Motif Elicitation - modified EM, avoids local optimum issues; two step procedure Gibbs Sampling - generate "trial" PSSM from random alignment first, as in EM, but leave one sequence out of initial alignment, then iteratively match PSSM to left-out sequences • Gibbs Sampler - web-based motif search via Gibbs sampling • Not mentioned in textbook: SECTION V STRUCTURAL BIOINFORMATICS Xiong: Chp 12 Protein Structure Basics • LAB 6 • Introduction to Protein DataBank - PDB • PyMol • Cn3D? • Stochastic context-free grammers • Other "state of the art"pproaches in recent literature, but not available in web-based servers (yet) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification BCB 444/544 Fall 07 Dobbs 10/5/07 23 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 24 4 #19 - Protein Structure Basics & Classification 10/5/07 Protein Structure & Function Chp 12 - Protein Structure Basics SECTION V • Protein structure - primarily determined by sequence STRUCTURAL BIOINFORMATICS • Protein function - primarily determined by structure Xiong: Chp 12 Protein Structure Basics • • • • • • • • • Globular proteins: compact hydrophobic core & hydrophilic surface Amino Acids Peptide Bond Formation Dihedral Angles Hierarchy Secondary Structures Tertiary Structures Determination of Protein 3-Dimensional Structure Protein Structure DataBank (PDB) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 • Membrane proteins: special hydrophobic surfaces • Folded proteins are only marginally stable • Some proteins do not assume a stable "fold" until they bind to something = Intrinsically disordered Predicting protein structure and function can be very hard -- & fun! 25 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 26 10/5/07 28 10/5/07 30 Primary & Secondary Structure 4 Basic Levels of Protein Structure • Primary • Linear sequence of amino acids • Description of covalent bonds linking aa’s • Secondary • Local spatial arrangement of amino acids • Description of short-range non-covalent interactions • Periodic structural patterns: α-helix, β-sheet BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 27 Tertiary & Quaternary Structure BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification "Additional" Structural Levels • Tertiary • Super-secondary elements • Motifs • Domains • Foldons • Overall 3-D "fold" of a single polypeptide chain • Spatial arrangement of 2’ structural elements; packing of these into compact "domains" • Description of long-range non-covalent interactions (plus disulfide bonds) • Quaternary • In proteins with > 1 polypeptide chain, spatial arrangement of subunits BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification BCB 444/544 Fall 07 Dobbs 10/5/07 29 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 5 #19 - Protein Structure Basics & Classification 10/5/07 Amino Acids Peptide Bond is Rigid and Planar • Each of 20 different amino acids has different "R-Group" or side chain attached to Cα BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 31 Hydrophobic Amino Acids BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 32 10/5/07 34 10/5/07 36 Charged Amino Acids 10/5/07 33 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification Certain Side-chain Configurations are Energetically Favored (Rotamers) Polar Amino Acids Ramachandran plot: "Allowable" psi & phi angles BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification BCB 444/544 Fall 07 Dobbs 10/5/07 35 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 6 #19 - Protein Structure Basics & Classification 10/5/07 Glycine is Smallest Amino Acid R group = H atom Proline is Cyclic • Proline residues reduce flexibility of polypeptide chain • Glycine residues increase • Proline cis-trans isomerization is often a rate-limiting step in protein folding backbone flexibility because they have no R group • Recent work suggests it also may also regulate ligand binding in native proteins Andreotti (BBMB) BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 37 Cysteines can Form Disulfide (S-S) Bonds 38 • Packing of hydrophobic side chains into interior is main driving force for folding • Problem? Polypeptide backbone is highly polar (hydrophilic) due to polar -NH and C=O in each peptide unit (which are charged at neutral pH=7, found in biological systems); these polar groups must be neutralized • In eukaryotes, disulfide bonds are often found in secreted proteins or extracellular domains 10/5/07 10/5/07 Globular Proteins Have a Compact Hydrophobic Core • Disulfide bonds (covalent) stabilize 3-D structures BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification • Solution? Form regular secondary structures, • e.g., α-helix, β-sheet, stabilized by H-bonds 39 Exterior Surface of Globular Proteins is Generally Hydrophilic BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 40 10/5/07 42 Protein Secondary Structures • Hydrophobic core formed by packed secondary structural elements provides compact, stable core • α−Helices • "Functional groups" of protein are attached to this framework; exterior has more flexible regions (loops) and polar/charged residues • Loops • β−Sheets • Coils • Hydrophobic "patches" on protein surface are often involved in protein-protein interactions BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification BCB 444/544 Fall 07 Dobbs 10/5/07 41 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 7 #19 - Protein Structure Basics & Classification 10/5/07 Certain Amino Acids are "Preferred" & Others are Rare in α−Helices α−Helix: Stabilized by H-bonds between every ~ 4th residue in Backbone • Ala, Glu, Leu, Met = good helix formers • Pro, Gly Tyr, Ser = very poor • Amino acid composition & distribution varies, depending on on location of helix in 3-D structure C = black O = red N = blue H = white Look! - Charges on backbone are "neutralized" by hydrogen bonds (H-bonds) - red fuzzy vertical bonds BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 43 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification β-Sheets - also Stabilized by H-bonds Between Backbone Atoms Anti-parallel 10/5/07 45 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 46 Chp 13 - Protein Structure Basics SECTION V • Regions of 2' structure that are not helices, sheets, or recognizable turns STRUCTURAL BIOINFORMATICS Xiong: Chp 13 Protein Structure Visualization, Comparison & Classfication • Intrinsically disordered regions appear to play important functional roles BCB 444/544 Fall 07 Dobbs 10/5/07 Connect helices and sheets Vary in length and 3-D configurations Are located on surface of structure Are more "tolerant" of mutations Are more flexible and can adopt multiple conformations • Tend to have charged and polar amino acids • Are frequently components of active sites • Some fall into distinct structural families (e.g., hairpin loops, reverse turns) • • • • • Coils BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 44 Loops Parallel BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 • Protein Structural Visualization • Protein Structure Comparison • Protein Structure Classification 10/5/07 47 BCB 444/544 F07 ISU Dobbs #19- Protein Structure Basics & Classification 10/5/07 48 8