Bioinformatics and Intrinsically Disordered Proteins (IDPs) A. Keith Dunker Biochemistry and Molecular Biology & Center for Computational Biology / Bioinformatics Indiana University School of Medicine Presented at: October 22, 2010 Center For Computational Biology and Bioinformatics Outline • What are “Intrinsically Disordered Proteins” ? • Bioinformatics Applications to IDPs – Why don’t IDPs form structure? – Predicting IDPs from amino acid sequence – Some important results from IDP prediction – An improved order / disorder amino acid scale – Predicting phosphorylation sites – Disorder and function: two examples • Importance of bioinformatics to IDP research Definitions: Intrinsically Disordered Proteins (IDPs) and ID Regions (IDRs) • Whole proteins and regions of proteins are intrinsically disordered if they lack stable 3D structure under physiological conditions, • But exist instead as highly dynamic, rapidly interconverting ensembles without particular equilibrium values for their coordinates or bond angles and with noncooperative conformational changes. Outline • What are “Intrinsically Disordered Proteins” ? • Bioinformatics Applications to IDPs – Why don’t IDPs form structure? – Predicting IDPs from amino acid sequence – Some important results from IDP prediction – An improved order / disorder amino acid scale – Predicting phosphorylation sites – Disorder and function: two examples • Importance of bioinformatics to IDP research Why are IDPs / IDRs unstructured? • From the 1950s to now, >> 1,000 IDPs / IDRs studied and characterized • Visit: http://www.disprot.org • Why do IDPs & IDRs lack structure? – Lack a ligand or partner? – Denatured during isolation? – Folding requires conditions found inside cells? – Lack of folding encoded by amino acid sequence? ( Disorder -Order ) / Order Amino Acid Compositions 1.0 4aa L 14aa (14579) 15aa L 29aa (10381) 30aa L (58147) Surface 0.5 0.0 -0.5 Buried -1.0 W C F I Y V L H M A T R G Q S N P D E K Residue Why are IDPs / IDRs unstructured? • To a first approximation, amino acid composition determines whether a protein folds or remains intrinsically disordered. • Given a composition that favors folding, the sequence details determine which fold. • Given a composition that favors not folding, the sequence details provide motifs for biological function. Outline • What are “Intrinsically Disordered Proteins” ? • Bioinformatics Applications to IDPs – Why don’t IDPs form structure? – Predicting IDPs from amino acid sequence – Some important results from IDP prediction – An improved order / disorder amino acid scale – Predicting phosphorylation sites – Disorder and function: two examples • Importance of bioinformatics to IDP research Prediction of Intrinsic Disorder Ordered / Disordered Sequence Data Attribute Selection or Extraction Aromaticity, Hydropathy, Charge, Complexity Separate Training and Testing Sets Predictor Training Neural Networks, SVMs, etc. Predictor Validation on Out-of-Sample Data Prediction First Machine-learning Predictor SDR/MDR/LDR Predictors 7 – 21 missing AA 22 – 44 45 or more 1. Short Disordered Regions (SDR): Medium Disordered Regions (MDR): Long Disordered Regions (LDR): 2. SDR / MDR / LDR predictors: Neural networks 3. Training dataset: proteins with missing AA SDR: 34 proteins, 11,050 AA, 38 IDR, 411 IDAA MDR: 20 proteins, 4,764 AA, 22 IDR, 464 IDAA LDR: 7 proteins, 2,069 AA, 7 IDR, 465 IDAA 4. Feature selection: standard sequential forward selection 5. Accuracy: 59 – 67% estimated by 5-cross validation 6. Better than chance; Better on self than on not self Romero P, et.al. Proc. IEEE International Conference on Neural Networks. 1:90-95 (1997) Next: PONDR®VL-XT XN(1) 11 14 XN, VL1, and : neural networks VL1(2) N-14 N-11 VL-XT(2) XC(1) Li X et al., Genome Informat. 9:201-213 (1999) (2) Romero P et al., Proteins 42:38-48 (2001) (1) Input features: XN: 8 VL1: 10 XC: 8 Inputs for PONDR®VL-XT XN Coordination V No. VL1 Coordination Net charge No. XC VIYFW M N H D PEVK - WFY W Y F D E Coordination Hydropathy VIYFW M T H No. Accuracy (ACC) = (% Corr-O)/2 + (%Corr-D)/2 ACC ( estimated by cross-validation ) ~ 72 ± 4% Li X. et.al. Genome Informat. 9:201-213(1999) Romero P. et.al. Proteins 42:38-48(2001) - K R PEVK - R Disorder Prediction in CASP • • • • Critical Assessment of Structure Prediction http://predictioncenter.org CASP1(1994) to CASP9 (2010) Experimentalists provide amino acid sequences as they are determining the structures of proteins • Groups register and make structure predictons • After structures determined, predictions evaluated • Disorder predictions introduced in CASP5 (2002) CASP PREDICTIONS ARE TRULY BLIND!!! 1.0 40 Area under ROC curve Number of CASP predictors Disorder Prediction in CASP 30 20 10 PreDisorder 0.9 0.8 VSL2 VSL2 0.7 0.6 0 2002 2004 2006 2008 2010 2002 2004 2006 2008 2010 Year Year CASP5 (2002), sensitivity replaced AUC Our Performance in CASP • Used VL-XT, poor on short disordered regions in CASP5, but very well on long disordered regions. • VL trained mainly on long disordered regions. • Changed predictor in CASP6 and CASP7, new predictor ranked #1. Big improvement !! • Did not participate in CASP 8, but would not have ranked #1 with current predictors. • What was change that led to large improvement in CASP6?? Predictors of Natural Disordered Regions PONDR®VL-XT and PONDR®VSL2 M1(3) N(1) 11 14 OM 1-OM VL1(2) N-14 N-11 VL-XT(2) VL2(3) VS2(3) C(1) N, VL1, and C are neural networks N-term: 8 inputs VL1: 10 inputs C-term: 8 inputs VSL2(3) OS VSL2 Score = OL×OM + OS×(1-OM) M1, VSL2-L, and VSL2-S are support vector machines M1: 54 inputs VL2: 20 inputs VS2: 20 inputs Li X et al., Genome Informat. 9:201-213 (1999) (2) Romero P et al., Proteins 42:38-48 (2001) (3) Peng K et al., BMC Bioinfo. 7:208 (2006) (1) OL Comparison on CASP 8 Dataset AUC = 0.89 ACC = 80% AUC = Area Under Curve ACC = (%Corr-O)/2 + (%Corr-D)/2 Zhang P, et.al. (unpublished results; not quite same as CASP evaluation) PONDR®VL-XT, PONDR®VSL2B and PreDisorder (–) Structured Disorder Score XPA VL-XT VSL2 PreDisorder 1.0 (+) Disordered 0.8 0.6 0.4 0.2 0.0 0 50 100 150 Residue Index Iakoucheva L et al., Protein Sci 3: 561-571 (2001) Dunker AK et al., FEBS J 272: 5129-5148 (2005) Deng X., et al., BMC Bioinformatics 10:436 (2009) 200 250 Published Predictors of Disordered Proteins PONDRs: Number of predictors of IDPs - VSL2: Ranked #1 in CASP 7 (2006); PONDRS - VSL1: Ranked #1 in CASP 6 (2004); # +,- / # phobics 60 50 40 30 20 10 8 7 6 5 0 2009 2008 2007 2006 2005 2004 2003 2001 2000 1997 1979 Year CASP He B, et al., Cell Res 19: 929-949 (2009) Outline • What are “Intrinsically Disordered Proteins” (IDPs) • Bioinformatics Applications to IDPs – Why don’t IDPs form structure? – Predicting IDPs from amino acid sequence – Some important results from IDP prediction – An improved order / disorder amino acid scale – Predicting phosphorylation sites – Disorder and function: two examples • Importance of bioinformatics to IDP research How Abundant are IDRs/IDPs? • To Estimate Abundance of IDPs/IDRs: predict on whole proteomes from many organisms. ALERT!! • Lack of membrane-protein-specific disorder predictors means that • Estimates of disorder will be too low by a small percentage. VSL2 Prediction of Abundance** of Intrinsically Disordered Proteins Organisms # Orgs. # Proteins Avg. # Proteins % Disordered AA %Proteins IDR >30 %Proteins Natively Unfolded Archaea 73 536 – 4234 2199 12.5 – 37.2% 0– 60.0% 3.2 – 31.5% Bacteria 951 182 – 9320 3331 12.0 – 36.1% 11.5 – 53.7% 2.7 – 29.2% Single-cell Eukarya 58 1909 – 16365 9098 22.3 – 49.9% 17.0 – 76.8% 16.8 – 47.6% Multi-cell Eukarya 51 1775 – 35942 11295 10.4 – 49.0% 4.4 – 66.5% 6.9 – 48.7% **Are organism-specific predictors sometimes needed? Archaea Phylogenetic Tree >30% >21% >14% >17% <14% Todd Lowe (http://archaea.ucsc.edu/) Average fraction of disordered residues Predicted Disorder vs. Proteome Size 0.6 0.6 Bacteria Archaea SC eukaryotes MC eukaryoyes 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 102 103 104 Proteome size 105 Why So Much Disorder? Hypothesis: Disorder Used for Signaling • Sequence Structure Function – Catalysis, – Membrane transport, – Binding small molecules. • Sequence Disordered Ensemble Function – Signaling, – Regulation, Dunker AK, et al., Biochemistry 41: 6573-6582 (2002) – Recognition, Dunker AK, et al., Adv. Prot. Chem. 62: 25-49 (2002) – Control. Xie H, et al., Proteome Res. 6: 1882-1932 (2007) Outline • What are “Intrinsically Disordered Proteins” (IDPs) • Bioinformatics Applications to IDPs – Why don’t IDPs form structure? – Predicting IDPs from amino acid sequence – Some important results from IDP prediction – An improved order / disorder amino acid scale – Predicting phosphorylation sites • Importance of bioinformatics to IDP research A New Order / Disorder AA Scale, Part 1 • Collect equal numbers of O and D windows of length 21. • Calculate the value of attribute, x, for each window. • For each interval of x, count how many windows are O and D; from this, determine P (O I x) and P (D I x) • Plot P (O I x) and P (D I x) versus x. • Determine the areas between the two curves. • Area Ratio Value = (area between curves / total area) • Apply to 517 aa scales: http://www.genome.jp/aaindex . • Rank scales from smallest to largest Campen A, et al Protein Pept Lett 15: 956-963 (2008) A New Order / Disorder AA Scale, Part 2 • Overall idea: make random changes to a scale, test for higher ARV, repeat until no larger value is found. • Genetic Algorithm Pseudocode: – – – – – – Choose initial population Repeat Evaluate the fitness of each individual Select a certain portion of best-ranking individuals Breed new population through crossover + mutation Until terminating condition • ARV value improved from 0.69 for best of 517 scales to 0.76 for new scale, called TOP-ID Campen A, et al Protein Pept Lett 15: 956-963 (2008) P (D l x) and P (O I x) Versus x Plots: Area Between Curves Used to Rank Attributes, X Flexibility ARV = 0.69, Rank = #1/517 Positive Charge ARV = 0.36, Rank = #238/517 Extracellular Protein AA Composition ARV = 0.07, Rank #517/517 TOP-IDP ARV = 0.76 Campen A et al., Protein & Peptide Lett 15: 956-963 (2008) Analysis of the disorder propensity in p53 by Top-IDP (A), PONDR® VLXT (B) and PONDR® VSL1 (C). Chronology of Amino Acid Evolution DISORDER TO ORDER, NON-LIFE TO LIFE Di Mauro E, et al., in Genesis: Origin of Life on Earth and Other Planets (In press) Outline • What are “Intrinsically Disordered Proteins” (IDPs) • Bioinformatics Applications to IDPs – Why don’t IDPs form structure? – Predicting IDPs from amino acid sequence – Some important results from IDP prediction – An improved order / disorder amino acid scale – Predicting phosphorylation sites – Disorder and function: two examples • Importance of bioinformatics to IDP research New Phosphorylation Predictor Data collection from high quality sources, such as Uniprot/Swiss-Prot,Phospho.ELM, PhosphoPep,and PhosPhAt Non-redundant datasets built by BLASTclust KNN – similarity to known sites (+ / -) of phosphorylation Phosphorylation sites Non-phosphorylation sites Feature extraction KNN scores Disorder Scores – used VSL2 AA frequencies – at sequence positions before and after phophorylation sites Amino acid frequencies Features from positive set Features from negative set Training data Control data Bootstrap Bootstrap sample 1 ... Gao J et al Mol and Cell Proteomics (In press) Disorder scores Bootstrap sample m Training Classifier 1 ... Classifier m Aggregating Phosphorylation prediction model Making predictions on new data Specificity estimation Disorder Score vs. Phosphorylation (A) Phospho-S/T in H. sapiens x 10 (B) Non-phospho-S/T in 5 H. sapiens +6 10000 2 91.3% > 0.5 +5 1 5000 +4 0 0 0.2 0.4 (C) Phospho-S/T in 0.6 0.8 1 0 A. thaliana 0.2 x 10 4 87.6% > 0.5 1000 5 0 0 0 0.2 0.4 (E) Phospho-Y in 0.6 0.8 1 0 H. sapiens x 10 400 0.8 1 A. thaliana 50.5% > 0.5 10 500 0.6 (D) Non-phospho-S/T in 15 1500 0.4 0.2 4 0.4 0.6 (F) Non-phospho-Y in 0.8 1 H. sapiens 6 Residue Positions 0 Occurence 54.9% > 0.5 +3 +2 +1 0 -1 -2 -3 4 -4 2 -5 200 0 -6 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 Disorder score Gao J et al., Mol & Cell Proteomics 9 (Epub) (2010) 0.8 1 Outline • What are “Intrinsically Disordered Proteins” (IDPs) • Bioinformatics Applications to IDPs – Why don’t IDPs form structure? – Predicting IDPs from amino acid sequence – Some important results from IDP prediction – An improved order / disorder amino acid scale – Predicting phosphorylation sites – Disorder and function: two examples • Importance of bioinformatics to IDP research Signaling Example 1: Calcineurin and Calmodulin B-Subunit A-Subunit Meador W et al., Science 257: 1251-1255 (1992) Active Site Autoinhibitory Peptide Kissinger C et al., Nature 378:641-644 (1995) Example 2: p27kip1: A Disordered Domain Cyclin A CDK p27kip1 (69 residues) 3D Structure: Russo AA et al., Nature 382: 325-331 (1996) DD: Tompa P et al., Bioessays 4: 328-340 (2008) The p27kip1 Disordered Domain: Used for Signal Integration 1 Y88 pY88 3 2 ATP pY88 pY88 ? ♦ ? ♦ T187 T187 ? ♦ ♦ pT187 pT187 Ub’n 1. NRTK phosphorylation @ Y88, signal #1. 2. Intra-molecular phosphorylation @ T187, #2. 3. Ubiquitination @ several possible loci, #3. 4. Proteasome digestion of p27, then cell cycle progression. Galea CA et al., J Mol Biol 376: 827-838 (2008) Dunker AK & Uversky VN, Nat Chem Biol 4: 229-230 (2008) 4 Outline • What are “Intrinsically Disordered Proteins” (IDPs) • Bioinformatics Applications to IDPs – Predicting IDPs from amino acid sequence – Some important results from IDP prediction – An improved order / disorder amino acid scale – Predicting phosphorylation sites – Disorder and function: two examples • Importance of bioinformatics to IDP research Importance of Bioinformatics to IDP and Protein Research • Thousands of IDPs and IDRs have been found. • Not one IDP or IDR is discussed in any current biochemistry textbook! • Why? - IDPs and IDRs don’t fit Sequence Structure Function • New paradigm developed from bioinformatics Sequence Disordered Ensemble Function IDP prediction is changing fundamental views of structure-function relationships! Thank You ! ! ! Indiana University Bin Xue Jake Chen Bill Sullivan Predrag Radivojac Jennifer Chen Pedro Romero Marc Cortese Derrick Johnson Chris Oldfield Amrita Mohan Yunlong Liu Ann Roman Tom Hurley Anna DePaoli-Roach Yuro Tagaki Siama Zaidi Jingwei Meng Wei-Lun Hsu Hua Lu Fei Huang Vladimir Uversky Collaborators Harbin Engineering University Bo He Kejun Wang University of Idaho Celeste J. Brown Chris Williams Molecular Kinetics Yugong Cheng Tanguy LeGall Aaron Santner Plant and Food Research UCSD Lilia Iakoucheva Sebat Temple University Zoran Obradovic Slobodan Vucetic Vladimir Vacic Kang Peng Hiongbo Xie Siyuan Ren Uros Midic Enzyme Institute Gary Daughdrill Peter Tompa Zsuzsanna Dosztanyi Istvan Simon Monika Fuxreiter Wright State University USU Oleg Paliy Robert Williams Xaiolin Sun USF