Intrinsically Disordered Proteins (IDPs)

advertisement
Bioinformatics and
Intrinsically Disordered Proteins (IDPs)
A. Keith Dunker
Biochemistry and Molecular Biology &
Center for Computational Biology / Bioinformatics
Indiana University School of Medicine
Presented at:
October 22, 2010
Center For Computational
Biology and Bioinformatics
Outline
• What are “Intrinsically Disordered Proteins” ?
• Bioinformatics Applications to IDPs
– Why don’t IDPs form structure?
– Predicting IDPs from amino acid sequence
– Some important results from IDP prediction
– An improved order / disorder amino acid scale
– Predicting phosphorylation sites
– Disorder and function: two examples
• Importance of bioinformatics to IDP research
Definitions: Intrinsically Disordered
Proteins (IDPs) and ID Regions (IDRs)
• Whole proteins and regions of proteins are
intrinsically disordered if they lack stable 3D
structure under physiological conditions,
• But exist instead as highly dynamic, rapidly
interconverting ensembles without
particular equilibrium values for their
coordinates or bond angles and with noncooperative conformational changes.
Outline
• What are “Intrinsically Disordered Proteins” ?
• Bioinformatics Applications to IDPs
– Why don’t IDPs form structure?
– Predicting IDPs from amino acid sequence
– Some important results from IDP prediction
– An improved order / disorder amino acid scale
– Predicting phosphorylation sites
– Disorder and function: two examples
• Importance of bioinformatics to IDP research
Why are IDPs / IDRs unstructured?
• From the 1950s to now, >> 1,000 IDPs / IDRs
studied and characterized
• Visit: http://www.disprot.org
• Why do IDPs & IDRs lack structure?
– Lack a ligand or partner?
– Denatured during isolation?
– Folding requires conditions found inside cells?
– Lack of folding encoded by amino acid sequence?
( Disorder -Order ) / Order
Amino Acid Compositions
1.0
4aa  L  14aa (14579)
15aa  L  29aa (10381)
30aa  L
(58147)
Surface
0.5
0.0
-0.5
Buried
-1.0
W C F I Y V L H M A T R G Q S N P D E K
Residue
Why are IDPs / IDRs unstructured?
• To a first approximation, amino acid composition
determines whether a protein folds or remains
intrinsically disordered.
• Given a composition that favors folding, the
sequence details determine which fold.
• Given a composition that favors not folding, the
sequence details provide motifs for biological
function.
Outline
• What are “Intrinsically Disordered Proteins” ?
• Bioinformatics Applications to IDPs
– Why don’t IDPs form structure?
– Predicting IDPs from amino acid sequence
– Some important results from IDP prediction
– An improved order / disorder amino acid scale
– Predicting phosphorylation sites
– Disorder and function: two examples
• Importance of bioinformatics to IDP research
Prediction of Intrinsic Disorder
Ordered / Disordered Sequence Data
Attribute Selection or Extraction
Aromaticity,
Hydropathy,
Charge,
Complexity
Separate Training and Testing Sets
Predictor Training
Neural Networks,
SVMs, etc.
Predictor Validation on Out-of-Sample Data
Prediction
First Machine-learning Predictor
SDR/MDR/LDR Predictors
7 – 21 missing AA
22 – 44
45 or more
1.
Short Disordered Regions (SDR):
Medium Disordered Regions (MDR):
Long Disordered Regions (LDR):
2.
SDR / MDR / LDR predictors: Neural networks
3.
Training dataset: proteins with missing AA
SDR: 34 proteins, 11,050 AA, 38 IDR, 411 IDAA
MDR: 20 proteins, 4,764 AA, 22 IDR, 464 IDAA
LDR: 7 proteins, 2,069 AA, 7 IDR, 465 IDAA
4.
Feature selection: standard sequential forward selection
5.
Accuracy: 59 – 67% estimated by 5-cross validation
6.
Better than chance; Better on self than on not self
Romero P, et.al. Proc. IEEE International Conference on Neural Networks. 1:90-95 (1997)
Next: PONDR®VL-XT
XN(1)
11
14
XN, VL1, and : neural networks
VL1(2)
N-14
N-11
VL-XT(2)
XC(1)
Li X et al., Genome Informat. 9:201-213 (1999)
(2) Romero P et al., Proteins 42:38-48 (2001)
(1)
Input features:
XN: 8
VL1: 10
XC: 8
Inputs for PONDR®VL-XT
XN
Coordination V
No.
VL1 Coordination Net charge
No.
XC
VIYFW M N H D PEVK -
WFY
W Y F D E
Coordination Hydropathy VIYFW M T H No.
Accuracy (ACC) = (% Corr-O)/2 + (%Corr-D)/2
ACC ( estimated by cross-validation ) ~ 72 ± 4%
Li X. et.al. Genome Informat. 9:201-213(1999)
Romero P. et.al. Proteins 42:38-48(2001)
-
K R
PEVK -
R
Disorder Prediction in CASP
•
•
•
•
Critical Assessment of Structure Prediction
http://predictioncenter.org
CASP1(1994) to CASP9 (2010)
Experimentalists provide amino acid sequences
as they are determining the structures of proteins
• Groups register and make structure predictons
• After structures determined, predictions evaluated
• Disorder predictions introduced in CASP5 (2002)
CASP PREDICTIONS ARE TRULY BLIND!!!
1.0
40
Area under ROC curve
Number of CASP predictors
Disorder Prediction in CASP
30
20
10
PreDisorder
0.9
0.8
VSL2
VSL2
0.7
0.6
0
2002 2004 2006 2008 2010
2002 2004 2006 2008 2010
Year
Year
CASP5 (2002), sensitivity replaced AUC
Our Performance in CASP
• Used VL-XT, poor on short disordered regions in
CASP5, but very well on long disordered regions.
• VL trained mainly on long disordered regions.
• Changed predictor in CASP6 and CASP7, new
predictor ranked #1. Big improvement !!
• Did not participate in CASP 8, but would not have
ranked #1 with current predictors.
• What was change that led to large improvement in
CASP6??
Predictors of Natural Disordered Regions
PONDR®VL-XT and PONDR®VSL2
M1(3)
N(1)
11
14
OM 1-OM
VL1(2)
N-14
N-11
VL-XT(2)
VL2(3)
VS2(3)
C(1)
N, VL1, and C are neural networks
N-term: 8 inputs
VL1:
10 inputs
C-term: 8 inputs
VSL2(3)
OS
VSL2 Score = OL×OM + OS×(1-OM)
M1, VSL2-L, and VSL2-S are
support vector machines
M1:
54 inputs
VL2:
20 inputs
VS2: 20 inputs
Li X et al., Genome Informat. 9:201-213 (1999)
(2) Romero P et al., Proteins 42:38-48 (2001)
(3) Peng K et al., BMC Bioinfo. 7:208 (2006)
(1)
OL
Comparison on CASP 8 Dataset
AUC =
0.89
ACC = 80%
AUC = Area Under Curve
ACC = (%Corr-O)/2 + (%Corr-D)/2
Zhang P, et.al. (unpublished results; not quite same as CASP evaluation)
PONDR®VL-XT, PONDR®VSL2B
and PreDisorder
(–) Structured
Disorder Score
XPA
VL-XT
VSL2
PreDisorder
1.0
(+) Disordered
0.8
0.6
0.4
0.2
0.0
0
50
100
150
Residue Index
Iakoucheva L et al., Protein Sci 3: 561-571 (2001)
Dunker AK et al., FEBS J 272: 5129-5148 (2005)
Deng X., et al., BMC Bioinformatics 10:436 (2009)
200
250
Published Predictors of
Disordered Proteins
PONDRs:
Number of predictors of IDPs
- VSL2: Ranked #1
in CASP 7 (2006);
PONDRS
- VSL1: Ranked #1
in CASP 6 (2004);
# +,- / # phobics
60
50
40
30
20
10
8
7
6
5
0
2009
2008
2007
2006
2005
2004
2003
2001
2000
1997
1979
Year
CASP
He B, et al., Cell Res 19: 929-949 (2009)
Outline
• What are “Intrinsically Disordered Proteins” (IDPs)
• Bioinformatics Applications to IDPs
– Why don’t IDPs form structure?
– Predicting IDPs from amino acid sequence
– Some important results from IDP prediction
– An improved order / disorder amino acid scale
– Predicting phosphorylation sites
– Disorder and function: two examples
• Importance of bioinformatics to IDP research
How Abundant are IDRs/IDPs?
• To Estimate Abundance of IDPs/IDRs:
predict on whole proteomes from many
organisms.
ALERT!!
• Lack of membrane-protein-specific disorder
predictors means that
• Estimates of disorder will be too low by a
small percentage.
VSL2 Prediction of Abundance**
of Intrinsically Disordered Proteins
Organisms
#
Orgs.
#
Proteins
Avg. #
Proteins
%
Disordered
AA
%Proteins
IDR >30
%Proteins
Natively
Unfolded
Archaea
73
536 –
4234
2199
12.5 –
37.2%
0–
60.0%
3.2 –
31.5%
Bacteria
951
182 –
9320
3331
12.0 –
36.1%
11.5 –
53.7%
2.7 –
29.2%
Single-cell
Eukarya
58
1909 –
16365
9098
22.3 –
49.9%
17.0 –
76.8%
16.8 –
47.6%
Multi-cell
Eukarya
51
1775 –
35942
11295
10.4 –
49.0%
4.4 –
66.5%
6.9 –
48.7%
**Are organism-specific predictors sometimes needed?
Archaea Phylogenetic Tree
>30%
>21%
>14%
>17%
<14%
Todd Lowe (http://archaea.ucsc.edu/)
Average fraction of disordered residues
Predicted Disorder vs. Proteome Size
0.6
0.6
Bacteria
Archaea
SC eukaryotes
MC eukaryoyes
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.0
0.0
102
103
104
Proteome size
105
Why So Much Disorder?
Hypothesis: Disorder Used for Signaling
• Sequence  Structure  Function
– Catalysis,
– Membrane transport,
– Binding small molecules.
• Sequence  Disordered Ensemble  Function
– Signaling,
– Regulation, Dunker AK, et al., Biochemistry 41: 6573-6582 (2002)
– Recognition, Dunker AK, et al., Adv. Prot. Chem. 62: 25-49 (2002)
– Control.
Xie H, et al., Proteome Res. 6: 1882-1932 (2007)
Outline
• What are “Intrinsically Disordered Proteins” (IDPs)
• Bioinformatics Applications to IDPs
– Why don’t IDPs form structure?
– Predicting IDPs from amino acid sequence
– Some important results from IDP prediction
– An improved order / disorder amino acid scale
– Predicting phosphorylation sites
• Importance of bioinformatics to IDP research
A New Order / Disorder AA Scale, Part 1
• Collect equal numbers of O and D windows of length 21.
• Calculate the value of attribute, x, for each window.
• For each interval of x, count how many windows are O
and D; from this, determine P (O I x) and P (D I x)
• Plot P (O I x) and P (D I x) versus x.
• Determine the areas between the two curves.
• Area Ratio Value = (area between curves / total area)
• Apply to 517 aa scales: http://www.genome.jp/aaindex .
• Rank scales from smallest to largest
Campen A, et al Protein Pept Lett 15: 956-963 (2008)
A New Order / Disorder AA Scale, Part 2
• Overall idea: make random changes to a scale, test for
higher ARV, repeat until no larger value is found.
• Genetic Algorithm Pseudocode:
–
–
–
–
–
–
Choose initial population
Repeat
Evaluate the fitness of each individual
Select a certain portion of best-ranking individuals
Breed new population through crossover + mutation
Until terminating condition
• ARV value improved from 0.69 for best of 517 scales to
0.76 for new scale, called TOP-ID
Campen A, et al Protein Pept Lett 15: 956-963 (2008)
P (D l x) and P (O I x) Versus x Plots:
Area Between Curves Used to Rank Attributes, X
Flexibility
ARV = 0.69, Rank = #1/517
Positive Charge
ARV = 0.36, Rank = #238/517
Extracellular Protein
AA Composition
ARV = 0.07, Rank #517/517
TOP-IDP
ARV = 0.76
Campen A et al., Protein & Peptide Lett 15: 956-963 (2008)
Analysis of the disorder propensity in p53 by
Top-IDP (A), PONDR® VLXT (B) and PONDR® VSL1 (C).
Chronology of Amino Acid Evolution
DISORDER TO ORDER, NON-LIFE TO LIFE
Di Mauro E, et al., in Genesis: Origin of Life on Earth and
Other Planets (In press)
Outline
• What are “Intrinsically Disordered Proteins” (IDPs)
• Bioinformatics Applications to IDPs
– Why don’t IDPs form structure?
– Predicting IDPs from amino acid sequence
– Some important results from IDP prediction
– An improved order / disorder amino acid scale
– Predicting phosphorylation sites
– Disorder and function: two examples
• Importance of bioinformatics to IDP research
New Phosphorylation
Predictor
Data collection from high quality sources,
such as Uniprot/Swiss-Prot,Phospho.ELM,
PhosphoPep,and PhosPhAt
Non-redundant datasets built by BLASTclust
KNN – similarity to known
sites (+ / -) of phosphorylation
Phosphorylation sites
Non-phosphorylation sites
Feature extraction
KNN scores
Disorder Scores – used VSL2
AA frequencies – at sequence
positions before and after
phophorylation sites
Amino acid frequencies
Features from
positive set
Features from
negative set
Training data
Control data
Bootstrap
Bootstrap
sample 1
...
Gao J et al Mol and Cell
Proteomics (In press)
Disorder scores
Bootstrap
sample m
Training
Classifier 1
...
Classifier m
Aggregating
Phosphorylation
prediction model
Making predictions on new data
Specificity
estimation
Disorder Score vs. Phosphorylation
(A) Phospho-S/T in
H. sapiens
x 10
(B) Non-phospho-S/T in
5
H. sapiens
+6
10000
2
91.3% > 0.5
+5
1
5000
+4
0
0
0.2
0.4
(C) Phospho-S/T in
0.6
0.8
1
0
A. thaliana
0.2
x 10
4
87.6% > 0.5
1000
5
0
0
0
0.2
0.4
(E) Phospho-Y in
0.6
0.8
1
0
H. sapiens
x 10
400
0.8
1
A. thaliana
50.5% > 0.5
10
500
0.6
(D) Non-phospho-S/T in
15
1500
0.4
0.2
4
0.4
0.6
(F) Non-phospho-Y in
0.8
1
H. sapiens
6
Residue Positions
0
Occurence
54.9% > 0.5
+3
+2
+1
0
-1
-2
-3
4
-4
2
-5
200
0
-6
0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
Disorder score
Gao J et al., Mol & Cell Proteomics 9 (Epub) (2010)
0.8
1
Outline
• What are “Intrinsically Disordered Proteins” (IDPs)
• Bioinformatics Applications to IDPs
– Why don’t IDPs form structure?
– Predicting IDPs from amino acid sequence
– Some important results from IDP prediction
– An improved order / disorder amino acid scale
– Predicting phosphorylation sites
– Disorder and function: two examples
• Importance of bioinformatics to IDP research
Signaling Example 1:
Calcineurin and Calmodulin
B-Subunit
A-Subunit
Meador W et al., Science
257: 1251-1255 (1992)
Active Site
Autoinhibitory
Peptide
Kissinger C et al., Nature 378:641-644 (1995)
Example 2:
p27kip1: A Disordered Domain
Cyclin A
CDK
p27kip1 (69 residues)
3D Structure: Russo AA et al., Nature 382: 325-331 (1996)
DD: Tompa P et al., Bioessays 4: 328-340 (2008)
The p27kip1 Disordered Domain:
Used for Signal Integration
1
Y88
pY88
3
2
ATP
pY88
pY88
?
♦
?
♦
T187
T187
?
♦ ♦
pT187
pT187
Ub’n
1. NRTK phosphorylation @ Y88, signal #1.
2. Intra-molecular phosphorylation @ T187, #2.
3. Ubiquitination @ several possible loci, #3.
4. Proteasome digestion of p27, then cell cycle
progression.
Galea CA et al., J Mol Biol 376: 827-838 (2008)
Dunker AK & Uversky VN, Nat Chem Biol 4: 229-230 (2008)
4
Outline
• What are “Intrinsically Disordered Proteins” (IDPs)
• Bioinformatics Applications to IDPs
– Predicting IDPs from amino acid sequence
– Some important results from IDP prediction
– An improved order / disorder amino acid scale
– Predicting phosphorylation sites
– Disorder and function: two examples
• Importance of bioinformatics to IDP research
Importance of Bioinformatics to
IDP and Protein Research
• Thousands of IDPs and IDRs have been found.
• Not one IDP or IDR is discussed in any current
biochemistry textbook!
• Why? - IDPs and IDRs don’t fit
Sequence  Structure  Function
• New paradigm developed from bioinformatics
Sequence  Disordered Ensemble  Function
IDP prediction is changing fundamental views of
structure-function relationships!
Thank You ! ! !
Indiana University
Bin Xue
Jake Chen
Bill Sullivan
Predrag Radivojac
Jennifer Chen
Pedro Romero
Marc Cortese
Derrick Johnson
Chris Oldfield
Amrita Mohan
Yunlong Liu
Ann Roman
Tom Hurley
Anna DePaoli-Roach
Yuro Tagaki
Siama Zaidi
Jingwei Meng
Wei-Lun Hsu
Hua Lu
Fei Huang
Vladimir Uversky
Collaborators
Harbin Engineering University
Bo He
Kejun Wang
University of Idaho
Celeste J. Brown
Chris Williams
Molecular Kinetics
Yugong Cheng
Tanguy LeGall
Aaron Santner
Plant and Food Research
UCSD
Lilia Iakoucheva Sebat
Temple University
Zoran Obradovic
Slobodan Vucetic
Vladimir Vacic
Kang Peng
Hiongbo Xie
Siyuan Ren
Uros Midic
Enzyme Institute
Gary Daughdrill
Peter Tompa
Zsuzsanna Dosztanyi
Istvan Simon
Monika Fuxreiter
Wright State University
USU
Oleg Paliy
Robert Williams
Xaiolin Sun
USF
Download